Skip to main content

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification.

Project description

napkinXC

PyPI version Build Status Documentation Status License

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification. It allows to train a classifier for very large datasets in few lines of code with minimal resources.

Right now, napkinXC implements the following features both in Python and C++:

  • Probabilistic Label Trees (PLTs) and Online Probabilistic Label Trees (OPLTs),
  • Hierarchical softmax (HSM),
  • Binary Relevance (BR),
  • One Versus Rest (OVR),
  • fast online prediction of top-k labels or labels above the given threshold,
  • hierarchical k-means clustering for tree building and other tree building methods,
  • support for predefined hierarchies,
  • LIBLINEAR, SGD, and AdaGrad solvers for base classifiers,
  • efficient ensembles tree-based model,
  • helpers to download and load data from XML Repository,
  • helpers to measure performance.

Documentation is Please note that this library is still under development and also serves as a base for experiments. Some of the experimental features may not be documented.

The napkinXC is distributed under MIT license. All contributions to the project are welcome!

Roadmap

Coming soon:

  • Possibility to use any type of binary classifier from Python
  • Improved dataset loading from Python
  • More datasets from XML Repository

Python Quick Start and Documentation

napkinXC's documentation is available at https://napkinxc.readthedocs.io and is generated from this repository.

Python version of napkinXC can be easily installed from PyPy repository on Linux and MacOS, it requires modern C++ compiler, CMake and Git installed:

pip install napkinxc

or the latest master version directly from the GitHub repository:

pip install pip install git+https://github.com/mwydmuch/napkinXC.git

Minimal example of usage:

from napkinxc.datasets import load_dataset
from napkinxc.models import PLT
from napkinxc.measures import precision_at_k

X_train, Y_train = load_dataset("eurlex-4k", "train")
X_test, Y_test = load_dataset("eurlex-4k", "test")
plt = PLT("eurlex-model")
plt.fit(X_train, Y_train)
Y_pred = plt.predict(X_test, top_k=1)
print(precision_at_k(Y_test, Y_pred, k=1))

More examples can be found under python/examples directory.

Building executable

napkinXC can also be built using:

cmake .
make

Command line options

Usage: nxc <command> <args>

Commands:
    train                   Train model on given input data
    test                    Test model on given input data
    predict                 Predict for given data
    ofo                     Use online f-measure optimization
    version                 Print napkinXC version
    help                    Print help

Args:
    General:
    -i, --input             Input dataset
    -o, --output            Output (model) dir
    -m, --model             Model type (default = plt):
                            Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
    --ensemble              Number of models in ensemble (default = 1)
    -t, --threads           Number of threads to use (default = 0)
                            Note: -1 to use #cpus - 1, 0 to use #cpus
    --hash                  Size of features space (default = 0)
                            Note: 0 to disable hashing
    --featuresThreshold     Prune features below given threshold (default = 0.0)
    --seed                  Seed (default = system time)
    --verbose               Verbose level (default = 2)

    Base classifiers:
    --optimizer             Optimizer used for training binary classifiers (default = libliner)
                            Optimizers: liblinear, sgd, adagrad, fobos
    --bias                  Value of the bias features (default = 1)
    --inbalanceLabelsWeighting     Increase the weight of minority labels in base classifiers (default = 1)
    --weightsThreshold      Threshold value for pruning models weights (default = 0.1)

    LIBLINEAR:              (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
    -s, --liblinearSolver   LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
                            Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
                                               L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
    -c, --liblinearC        LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
                            smaller values specify stronger regularization (default = 10.0)
    --eps, --liblinearEps   LIBLINEAR tolerance of termination criterion (default = 0.1)

    SGD/AdaGrad:
    -l, --lr, --eta         Step size (learning rate) for online optimizers (default = 1.0)
    --epochs                Number of training epochs for online optimizers (default = 1)
    --adagradEps            Defines starting step size for AdaGrad (default = 0.001)

    Tree:
    -a, --arity             Arity of tree nodes (default = 2)
    --maxLeaves             Maximum degree of pre-leaf nodes. (default = 100)
    --tree                  File with tree structure
    --treeType              Type of a tree to build if file with structure is not provided
                            tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
                                        balancedInOrder, balancedRandom, onlineComplete

    K-Means tree:
    --kmeansEps             Tolerance of termination criterion of the k-means clustering
                            used in hierarchical k-means tree building procedure (default = 0.001)
    --kmeansBalanced        Use balanced K-Means clustering (default = 1)

    Prediction:
    --topK                  Predict top-k labels (default = 5)
    --threshold             Predict labels with probability above the threshold (default = 0)
    --thresholds            Path to a file with threshold for each label
    --setUtility            Type of set-utility function for prediction using svbopFull, svbopHf, svbopMips models.
                            Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
                            See: https://arxiv.org/abs/1906.08129

    Set-Utility:
    --alpha
    --beta
    --delta
    --gamma

    Test:
    --measures              Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
                            Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
                                      p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)

Data Format

napkinXC supports multi-label svmlight/libsvm format and format of datasets from The Extreme Classification Repository, which has an additional header line with a number of data points, features, and labels.

label,label,... feature(:value) feature(:value) ...

References and acknowledgments

This library implements methods from following papers:

Another implementation of PLT model is available in extremeText library, that implements approach described in this NeurIPS paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

napkinxc-0.4.1.tar.gz (147.9 kB view details)

Uploaded Source

File details

Details for the file napkinxc-0.4.1.tar.gz.

File metadata

  • Download URL: napkinxc-0.4.1.tar.gz
  • Upload date:
  • Size: 147.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/42.0.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.5

File hashes

Hashes for napkinxc-0.4.1.tar.gz
Algorithm Hash digest
SHA256 b705535adda7c0bcc2db81cfb36f96ebf91ea813bfc632267a4d198ac54f1162
MD5 ddfc1ae8aba5363115d0c36255c6c3aa
BLAKE2b-256 967da826f21fab908a61d89307e820179b95679ae0042fdd608fcf6f71845da6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page