Scikit-learn Wrapper for Regularized Greedy Forest
Project description
rgf_python
The wrapper of machine learning algorithm *Regularized Greedy Forest (RGF)* for Python.
Features
Scikit-learn interface and possibility of usage for multi-label classification problem.
Original RGF implementation is available only for regression and binary classification, but rgf_python is also available for multi-label classification by “One-vs-Rest” method.
Example:
from sklearn import datasets
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import StratifiedKFold, cross_val_score
from rgf.sklearn import RGFClassifier
iris = datasets.load_iris()
rng = check_random_state(0)
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]
rgf = RGFClassifier(max_leaf=400,
algorithm="RGF_Sib",
test_interval=100,
verbose=True)
n_folds = 3
rgf_scores = cross_val_score(rgf,
iris.data,
iris.target,
cv=StratifiedKFold(n_folds))
rgf_score = sum(rgf_scores)/n_folds
print('RGF Classfier score: {0:.5f}'.format(rgf_score))
More examples could be found here.
Software Requirements
Python (2.7 or >= 3.4)
scikit-learn (>= 0.18)
RGF C++ (link)
If you can’t access the above URL, alternatively, you can get RGF C++ by downloading it from this page. Please see README in the zip file to build RGF executional.
Installation
git clone https://github.com/fukatani/rgf_python.git python setup.py install
or using pip:
pip install git+git://github.com/fukatani/rgf_python@master
You have to place RGF execution file in directory which is included in environmental variable ‘PATH’. Or you can direct specify path by manual editing rgf/sklearn.py
## Edit this ##################################################
#Location of the RGF executable
loc_exec = 'C:\\Program Files\\RGF\\bin\\rgf.exe'
#Location for RGF temp files
loc_temp = 'temp/'
## End Edit ##################################################
You need to set actual location of RGF execution file by editing ‘loc_exec’. And the variable ‘loc_temp’ can be changed to specify the directory for placing temp files.
Tuning Hyper-parameters
You can tune hyper-parameters as follows.
max_leaf: Appropriate values are data-dependent and usually varied from 1000 to 10000.
test_interval: For efficiency, it must be either multiple or divisor of 100 (default value of the optimization interval).
algorithm: You can select “RGF”, “RGF Opt” or “RGF Sib”.
loss: You can select “LS”, “Log” or “Expo”.
reg_depth: Must be no smaller than 1. Meant for being used with algorithm = “RGF Opt” or “RGF Sib”.
l2: Either 1, 0.1, or 0.01 often produces good results though with exponential loss (loss = “Expo”) and logistic loss (loss = “Log”), some data requires smaller values such as 1e-10 or 1e-20.
sl2: Default value is equal to l2. On some data, l2/100 works well.
normalize: If turned on, training targets are normalized so that the average becomes zero.
min_samples_leaf: Smaller values may slow down training. Too large values may degrade model accuracy.
n_iter: Number of iterations of coordinate descent to optimize weights.
n_tree_search: Number of trees to be searched for the nodes to split. The most recently grown trees are searched first.
opt_interval: Weight optimization interval in terms of the number of leaf nodes.
learning_rate: Step size of Newton updates used in coordinate descent to optimize weights.
Detailed instruction of tuning hyper-parameters is here.
Using at Kaggle Kernel
Now, Kaggle Kernel supports rgf_python. Please see this page.
Other
Shamelessly, much part of the implementation is based on the following code. Thanks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.