pympipool - scale python functions over multiple compute nodes
Project description
pympipool - scale python functions over multiple compute nodes
Up-scaling python functions for high performance computing (HPC) can be challenging. While the python standard library
provides interfaces for multiprocessing and asynchronous task execution, namely multiprocessing
and concurrent.futures
both are
limited to the execution on a single compute node. So a series of python libraries have been developed to address the
up-scaling of python functions for HPC. Starting in the datascience and machine learning community with solutions like
dask over more HPC focused solutions like parsl up to Python bindings
for the message passing interface (MPI) named mpi4py. Each of these solutions has their
advantages and disadvantages, in particular the mixing of MPI parallel python functions and serial python functions in
combined workflows remains challenging.
To address these challenges pympipool
is developed with three goals in mind:
- Reimplement the standard python library interfaces namely
multiprocessing.pool.Pool
andconcurrent.futures.Executor
as closely as possible, to minimize the barrier of up-scaling an existing workflow to be used on HPC resources. - Integrate MPI parallel python functions based on mpi4py on the same level as serial python functions, so both can be combined in a single workflow. This allows the users to parallelize their workflows one function at a time. Internally this is achieved by coupling a serial python process to a MPI parallel python process.
- Embrace Jupyter notebooks for the interactive development of HPC workflows, as they allow the users to document their though process right next to the python code and their results all within one document.
Features
As different users and different workflows have different requirements in terms of the level of parallelization, the
pympipool
implements a series of five different interfaces:
pympipool.Pool
: Following themultiprocessing.pool.Pool
thepympipool.Pool
class implements themap()
andstarmap()
functions. Internally these connect to an MPI parallel subprocess running thempi4py.futures.MPIPoolExecutor
. So by increasing the number of workers, by setting themax_workers
parameter thepympipool.Pool
can scale the execution of serial python functions beyond a single compute node. For MPI parallel python functions thepympipool.MPISpawnPool
is derived from thepympipool.Pool
and usesMPI_Spawn()
to execute those. For more details see below.pympipool.Executor
: The easiest way to execute MPI parallel python functions right next to serial python functions is thepympipool.Executor
. It implements the executor interface defined by theconcurrent.futures.Executor
. So functions are submitted to thepympipool.Executor
using thesubmit()
function, which returns anconcurrent.futures.Future
object. With theseconcurrent.futures.Future
objects asynchronous workflows can be constructed which periodically check if the computation is completeddone()
and then query the results using theresult()
function. The limitation of thepympipool.Executor
is lack of load balancing, eachpympipool.Executor
acts as a serial first in first out (FIFO) queue. So it is the task of the user to balance the load of many different tasks over multiplepympipool.Executor
instances.pympipool.PoolExecutor
: To combine the functionality of thepympipool.Pool
and thepympipool.Executor
thepympipool.PoolExecutor
again connects to thempi4py.futures.MPIPoolExecutor
. Still in contrast to thepympipool.Pool
it does not implement themap()
andstarmap()
functions but rather thesubmit()
function based on theconcurrent.futures.Executor
interface. In this case the load balancing happens internally and the maximum number of workersmax_workers
defines the maximum number of parallel tasks. But only serial python tasks can be executed in contrast to thepympipool.Executor
which can also execute MPI parallel python tasks.pympipool.MPISpawnPool
: An alternative way to support MPI parallel functions in addition to thepympipool.Executor
is thepympipool.MPISpawnPool
. Just like thepympipool.Pool
it supports themap()
andstarmap()
functions. The additionalranks_per_task
parameter defines how many MPI ranks are used per task. All functions are executed with the same number of MPI ranks. The limitation of this approach is that it usesMPI_Spawn()
to create new MPI ranks for the execution of the individual tasks. Consequently, this approach is not as scalable as thepympipool.Executor
but it offers load balancing for a large number of similar MPI parallel tasks.pympipool.SocketInterface
: The key functionality of thepympipool
package is the coupling of a serial python process with an MPI parallel python process. This happens in the background using a combination of the zero message queue and cloudpickle to communicate binary python objects. Thepympipool.SocketInterface
is an abstraction of this interface, which is used in the other classes insidepympipool
and might also be helpful for other projects.
In addition to using MPI to start a number of processes on different HPC computing resources, pympipool
also supports
the flux-framework as additional backend. By setting the optional enable_flux_backend
parameter to True
the flux backend can be enabled for the pympipool.Pool
, pympipool.Executor
and pympipool.PoolExecutor
.
Other optional parameters include the selection of the working directory where the python function should be executed cwd
and the option to oversubscribe MPI tasks which is an OpenMPI specific feature which can be
enabled by setting oversubscribe
to True
. For more details on the pympipool
classes and their application, the
extended documentation is linked below.
Documentation
License
pympipool
is released under the BSD license https://github.com/pyiron/pympipool/blob/main/LICENSE . It is a spin-off of the pyiron
project https://github.com/pyiron/pyiron therefore if you use pympipool
for calculation which result in a scientific publication, please cite:
@article{pyiron-paper,
title = {pyiron: An integrated development environment for computational materials science},
journal = {Computational Materials Science},
volume = {163},
pages = {24 - 36},
year = {2019},
issn = {0927-0256},
doi = {https://doi.org/10.1016/j.commatsci.2018.07.043},
url = {http://www.sciencedirect.com/science/article/pii/S0927025618304786},
author = {Jan Janssen and Sudarsan Surendralal and Yury Lysogorskiy and Mira Todorova and Tilmann Hickel and Ralf Drautz and Jörg Neugebauer},
keywords = {Modelling workflow, Integrated development environment, Complex simulation protocols},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pympipool-0.5.7.tar.gz
.
File metadata
- Download URL: pympipool-0.5.7.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21037e49a27d6756157680e741d345d2e342c8983ba4eb9684a3d8a06e8557ab |
|
MD5 | e8cefc8fa0aa294c7049e9712d862d34 |
|
BLAKE2b-256 | a2b212e9619a667267693d20038696bb82641e0ff86ed9c25532df239530c604 |
File details
Details for the file pympipool-0.5.7-py3-none-any.whl
.
File metadata
- Download URL: pympipool-0.5.7-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c0e7dfe8833b8ab2396461b517eb8c7b2c7859e11e6a9550dcbb5883cbd8fdf |
|
MD5 | 9e43d6aa7ba2d33303d3cdf5fe60b2fc |
|
BLAKE2b-256 | 21b1b04e3d5de0cc5220b9e1074252f958bfaac63f4e2ff7ccab9f646c38120c |