Skip to main content

Experimental tools for parallel machine learning

Project description

# Pyrallel - Parallel Data Analytics in Python

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).

  • focus on small to medium data (with data locality when possible).

  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.

  • do not focus on HA / Fault Tolerance (yet).

  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

## Dependencies

The usual suspects: Python 2.7, NumPy, SciPy.

Fetch the development version (master branch) from:

StarCluster develop branch and its IPCluster plugin is also required to easily startup a bunch of nodes with IPython.parallel setup.

## Patterns currently under investigation

  • Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid Search) for machine learning models

  • Share numerical arrays efficiently over the nodes and make them available to concurrently running Python processes without making copies in memory using memory-mapped files.

  • Distributed Random Forests fitting.

  • Ensembling heterogeneous library models.

  • Parallel implementation of online averaged models using a MPI AllReduce, for instance using MiniBatchKMeans on partitioned data.

See the content of the examples/ folder for more details.

## License

Simplified BSD.

## History

This project started at the [PyCon 2012 PyData sprint](http://wiki.ipython.org/PyCon12Sprint) as a set of proof of concept [IPython.parallel scripts](https://github.com/ogrisel/pycon-pydata-sprint).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrallel-0.2.1.tar.gz (7.9 kB view details)

Uploaded Source

File details

Details for the file pyrallel-0.2.1.tar.gz.

File metadata

  • Download URL: pyrallel-0.2.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyrallel-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b57cfdf7dfc14628d7c3b738e0e23d8750a5ffc1ac2aae83d2216cd1549b6b30
MD5 ffe1bed1bff178ec2f91e4155d7fe930
BLAKE2b-256 cab842036676c89dcc92c90e74f8cb5ccc60bc2da860bef3115068806ef639ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page