Skip to main content

Lightweight pipelining: using Python functions as pipeline jobs.

Project description

Joblib is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:

  1. transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)

  2. easy simple parallel computing

  3. logging and tracing of the execution

Joblib is optimized to be fast and robust in particular on large, long-running functions and has specific optimizations for numpy arrays.


joblib is BSD-licensed.

Vision

Joblib came out of long-running data-analysis Python scripts. The long term vision is to provide tools for scientists to achieve better reproducibility when running jobs, without changing the way numerical code looks like. However, Joblib can also be used to provide a light-weight make replacement.

The main problems identified are:

  1. Lazy evaluation: People need to rerun over and over the same script as it is tuned, but end up commenting out steps, or uncommenting steps, as they are needed, as they take long to run.

  2. Persistence: It is difficult to persist in an efficient way arbitrary objects containing large numpy arrays. In addition, hand-written persistence to disk does not link easily the file on disk to the corresponding Python object it was persists from in the script. This leads to people not a having a hard time resuming the job, eg after a crash and persistence getting in the way of work.

The approach taken by Joblib to address these problems is not to build a heavy framework and coerce user into using it (e.g. with an explicit pipeline). It strives to leave your code and your flow control as unmodified as possible.

Current features

  1. Transparent and fast disk-caching of output value: a make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. The goal is to separate operations in a set of steps with well-defined inputs and outputs, that are saved and reran only if necessary, by using standard Python functions:

    >>> from joblib import Memory
    >>> mem = Memory(cachedir='/tmp/joblib')
    >>> import numpy as np
    >>> a = np.vander(np.arange(3))
    >>> square = mem.cache(np.square)
    >>> b = square(a)
    ________________________________________________________________________________
    [Memory] Calling square...
    square(array([[0, 0, 1],
           [1, 1, 1],
           [4, 2, 1]]))
    ___________________________________________________________square - 0.0s, 0.0min
    
    >>> c = square(a)
    >>> # The above call did not trigger an evaluation
  2. Embarrassingly parallel helper: to make is easy to write readable parallel code and debug it quickly:

    >>> from joblib import Parallel, delayed
    >>> from math import sqrt
    >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))
    [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
    
  3. Logging/tracing: The different functionalities will progressively acquire better logging mechanism to help track what has been ran, and capture I/O easily. In addition, Joblib will provide a few I/O primitives, to easily define define logging and display streams, and provide a way of compiling a report. We want to be able to quickly inspect what has been run.

Contributing

The code is hosted on github. It is easy to clone the project and experiment with making your own modifications. If you need extra features, don’t hesitate to contribute them.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joblib-0.4.3.dev.tar.gz (142.3 kB view details)

Uploaded Source

Built Distributions

joblib-0.4.3.dev-py2.6.egg (77.6 kB view details)

Uploaded Source

joblib-0.4.3.dev-py2.5.egg (77.3 kB view details)

Uploaded Source

File details

Details for the file joblib-0.4.3.dev.tar.gz.

File metadata

  • Download URL: joblib-0.4.3.dev.tar.gz
  • Upload date:
  • Size: 142.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for joblib-0.4.3.dev.tar.gz
Algorithm Hash digest
SHA256 bca063bababd0af34cdddb540cf16dc0d5ff8aba7ad88a1e7b4162da216a4a91
MD5 b170543f0a2512b991b55903469b2ff3
BLAKE2b-256 d0a0670002d1b7fcd3d1e6782f294968e06eadc091b21b003cfa2cf77214c987

See more details on using hashes here.

Provenance

File details

Details for the file joblib-0.4.3.dev-py2.6.egg.

File metadata

File hashes

Hashes for joblib-0.4.3.dev-py2.6.egg
Algorithm Hash digest
SHA256 20e4809b44326549f00591d5242cf49d5e0bc8c7304764da93dc3fed0ffa5e04
MD5 5d916edf78f006d579fe7ac2ae8b4a92
BLAKE2b-256 96c30437450acfe294d1590523a19bc9e9c0e4853126634484772bcd770e408f

See more details on using hashes here.

Provenance

File details

Details for the file joblib-0.4.3.dev-py2.5.egg.

File metadata

File hashes

Hashes for joblib-0.4.3.dev-py2.5.egg
Algorithm Hash digest
SHA256 ea199ed9ad8da491c79eff9b51ef235463765e5c6585b485fa308283c8b8256b
MD5 12f1ea77d54a5831919b71c5ea67f62c
BLAKE2b-256 51864a8a9a4cd151b49b5141ac969d204ce3c01b3683fef7fdc7108573efffa4

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page