A pure-python highly-distributed MapReduce cluster.
Project description
Having just finished reading the original Google MapReduce paper, I obviously felt the need to try to implement such a system in Python.
My goals are to implement enough of the functionality described in the paper to be usable, though I strongly warn against ever using this code for anything real.
Since one of the goals (see Goals, below) is simplicity from an end-user standpoint, I am following some of Kenneth Reitz’s advice and starting with a readme and documentation.
Examples
The canonical word-count example:
# myjob.py from pluribus import job @job.map_ def emit_words(key, value): # key: document name # value: document contents for word in value.split(): yield word, 1 @job.reduce_ def sum_occurences(key, values): # key: a word # values: a list of counts return sum(values)
Assuming you’re running everything on one host, you can ignore the network connection information.
Start a pluribus master:
$ pluribus master
Start a pluribus worker (or several hundred):
$ pluribus worker
On the master or on another machine that can talk to the master:
$ pluribus job myjob # ... wait <results>
Goals
Explicit goals are:
Simple to use, both as an administrator and end-user.
Well-documented.
Robust to worker failure.
Fast-enough.
Use only the Python (2.7+) standard library (at least to run).
Explicit non-goals are:
Be a filesystem.
Robust to master failure.