Scikit-learn Wrapper for Regularized Greedy Forest
Project description
rgf_python
The wrapper of machine learning algorithm Regularized Greedy Forest (RGF) for Python.
Features
Scikit-learn interface and possibility of usage for multi-label classification problem.
Original RGF implementation is available only for regression and binary classification, but rgf_python is also available for multi-label classification by “One-vs-Rest” method.
Example:
from sklearn import datasets
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import StratifiedKFold, cross_val_score
from rgf.sklearn import RGFClassifier
iris = datasets.load_iris()
rng = check_random_state(0)
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]
rgf = RGFClassifier(max_leaf=400,
algorithm="RGF_Sib",
test_interval=100,
verbose=True)
n_folds = 3
rgf_scores = cross_val_score(rgf,
iris.data,
iris.target,
cv=StratifiedKFold(n_folds))
rgf_score = sum(rgf_scores)/n_folds
print('RGF Classfier score: {0:.5f}'.format(rgf_score))
More examples could be found here.
Software Requirements
Python (2.7 or >= 3.4)
scikit-learn (>= 0.18)
RGF C++ (link)
If you can’t access the above URL, alternatively, you can get RGF C++ by downloading it from this page. Please see README in the zip file to build RGF executional.
Installation
From PyPI using pip:
pip install rgf_python
or from GitHub:
git clone https://github.com/fukatani/rgf_python.git python setup.py install
You have to place RGF execution file into directory which is included in environmental variable ‘PATH’. Alternatively, you may specify actual location of RGF execution file and directory for placing temp files by corresponding flags in configuration file .rgfrc, which you should create into your home directory. The default values are platform dependent: for Windows exe_location=$HOME/rgf.exe, temp_location=$HOME/temp/rgf and for others exe_location=$HOME/rgf, temp_location=/tmp/rgf. Here is the example of .rgfrc file:
exe_location=C:/Program Files/RGF/bin/rgf.exe temp_location=C:/Program Files/RGF/temp
Tuning Hyper-parameters
You can tune hyper-parameters as follows.
max_leaf: Appropriate values are data-dependent and usually varied from 1000 to 10000.
test_interval: For efficiency, it must be either multiple or divisor of 100 (default value of the optimization interval).
algorithm: You can select “RGF”, “RGF Opt” or “RGF Sib”.
loss: You can select “LS”, “Log” or “Expo”.
reg_depth: Must be no smaller than 1. Meant for being used with algorithm = “RGF Opt” or “RGF Sib”.
l2: Either 1, 0.1, or 0.01 often produces good results though with exponential loss (loss = “Expo”) and logistic loss (loss = “Log”), some data requires smaller values such as 1e-10 or 1e-20.
sl2: Default value is equal to l2. On some data, l2/100 works well.
normalize: If turned on, training targets are normalized so that the average becomes zero.
min_samples_leaf: Smaller values may slow down training. Too large values may degrade model accuracy.
n_iter: Number of iterations of coordinate descent to optimize weights.
n_tree_search: Number of trees to be searched for the nodes to split. The most recently grown trees are searched first.
opt_interval: Weight optimization interval in terms of the number of leaf nodes.
learning_rate: Step size of Newton updates used in coordinate descent to optimize weights.
Detailed instruction of tuning hyper-parameters is here.
Using at Kaggle Kernel
Now, Kaggle Kernel supports rgf_python. Please see this page.
Other
Shamelessly, much part of the implementation is based on the following code. Thanks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.