spark-sklearn

Integration tools for running scikit-learn on Spark

These details have been verified by PyPI

Maintainers

josephkb mengxr seanowen smurching thunterdb

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other things, it can:

train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn
convert Spark’s Dataframes seamlessly into numpy ndarray or sparse matrices
(experimental) distribute Scipy’s sparse matrices as a dataset of sparse vectors

It focuses on problems that have a small amount of data and that can be run in parallel. For small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark. For datasets that do not fit in memory, we recommend using the distributed implementation in `Spark MLlib.

This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

Installation

This package is available on PYPI:

pip install spark-sklearn

This project is also available as Spark package.

The developer version has the following requirements:

scikit-learn 0.18 or 0.19. Later versions may work, but tests currently are incompatible with 0.20.
Spark >= 2.1.1. Spark may be downloaded from the Spark website. In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the Spark guide for more details.
nose (testing dependency only)
pandas, if using the pandas integration or testing. pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the python/ subdirectory is in the PYTHONPATH when launching the pyspark interpreter:

PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

You can directly run tests:

cd python && ./run-tests.sh

This requires the environment variable SPARK_HOME to point to your local copy of Spark.

Example

Here is a simple example that runs a grid search with Spark. See the Installation section on how to install the package.

from sklearn import svm, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

API documentation is currently hosted on Github pages. To build the docs yourself, see the instructions in docs/.

https://travis-ci.org/databricks/spark-sklearn.svg?branch=master

Project details

These details have been verified by PyPI

Maintainers

josephkb mengxr seanowen smurching thunterdb

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.3.0

Jan 30, 2019

0.2.3

Sep 29, 2017

0.2.2

Sep 20, 2017

0.2.1

Sep 11, 2017

0.2.0

Aug 16, 2016

0.1.2

Mar 17, 2016

0.1.1

Jan 11, 2016

0.1.0

Jan 11, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-sklearn-0.3.0.tar.gz (28.2 kB view details)

Uploaded Jan 30, 2019 Source

File details

Details for the file spark-sklearn-0.3.0.tar.gz.

File metadata

Download URL: spark-sklearn-0.3.0.tar.gz
Upload date: Jan 30, 2019
Size: 28.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.2

File hashes

Hashes for spark-sklearn-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d78d4f08a3849b243232ef78b63b4babfdb04ec529f996f4699923f40cfce827`
MD5	`4460d6c8402a5b46d361c442c2e47f19`
BLAKE2b-256	`b03f34b8dec7d2cfcfe0ba99d637b4f2d306c1ca0b404107c07c829e085f6b38`