adaptive-scheduler

Run many `adaptive.Learner`s on many cores (>10k) using `mpi4py.futures`, `ipyparallel`, or `dask-mpi`.

These details have not been verified by PyPI

Project links

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- System :: Distributed Computing

Project description

Run many adaptive.Learners on many cores (>10k) using mpi4py.futures, ipyparallel, or dask.distributed.

What is this?

The Adaptive scheduler solves the following problem, you need to run more learners than you can run with a single runner and/or can use >1k cores.

ipyparallel and dask.distributed provide very powerful engines for interactive sessions. However, when you want to connect to >1k cores it starts to struggle. Besides that, on a shared cluster there is often the problem of starting an interactive session with ample space available.

Our approach is to schedule a different job for each adaptive.Learner. The creation and running of these jobs are managed by adaptive-scheduler. This means that your calculation will definitely run, even though the cluster might be fully occupied at the moment. Because of this approach, there is almost no limit to how many cores you want to use. You can either use 10 nodes for 1 job (learner) or 1 core for 1 job (learner) while scheduling hundreds of jobs.

Everything is written such that the computation is maximally local. This means that is one of the jobs crashes, there is no problem and it will automatically schedule a new one and continue the calculation where it left off (because of Adaptive’s periodic saving functionality). Even if the central “job manager” dies, the jobs will continue to run (although no new jobs will be scheduled.)

Design goals

Needs to be able to run on efficiently >30k cores
Works seamlessly with the Adaptive package
Minimal load on the file system
Removes all boiler plate of working with a scheduler
1. writes job script
2. (re)submits job scripts
Handles random crashes (or node evictions) with minimal data loss
Preserves Python kernel and variables inside a job (in contrast to submitting jobs for every parameter)
Separates the simulation definition code from the code that runs the simulation
Maximizes computation locality, jobs continue to run when the main process dies

How does it work?

You create a file where you define a bunch of learners and corresponding fnames such that they can be imported, like:

# learners_file.py
import adaptive
from functools import partial

def h(x, pow, a):
    return a * x**pow

combos = adaptive.utils.named_product(
    pow=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    a=[0.1, 0.5],
)  # returns list of dicts, cartesian product of all values

learners = [adaptive.Learner1D(partial(h, **combo),
            bounds=(-1, 1)) for combo in combos]
fnames = [f"data/{combo}" for combo in combos]

Then you start a process that creates and submits as many job-scripts as there are learners. Like:

import adaptive_scheduler

def goal(learner):
    return learner.npoints > 200

run_manager = adaptive_scheduler.server_support.RunManager(
    learners_file="learners_file.py",
    goal=goal,
    cores_per_job=12,  # every learner is one job
    log_interval=30,  #  write info such as npoints, cpu_usage, time, etc. to the job log file
    save_interval=300,  # save the data every 300 seconds
)
run_manager.start()

That’s it! You can run run_manager.info() which will display an interactive ipywidget that shows the amount of running, pending, and finished jobs, buttons to cancel your job, and other useful information.

But how does really it work?

The ~adaptive_scheduler.server_support.RunManager basically does what is written below. So, you need to create a learners_file.py that defines learners and fnames (like in the section above). Then a “job manager” writes and submits as many jobs as there are learners but doesn’t know which learner it is going to run! This is the responsibility of the “database manager”, which keeps a database of job_id <--> learner.

In another Python file (the file that is run on the nodes) we do something like:

# run_learner.py
import adaptive
from adaptive_scheduler import client_support
from mpi4py.futures import MPIPoolExecutor

# the file that defines the learners we created above
from learners_file import learners, fnames


if __name__ == "__main__":  # ← use this, see warning @ https://bit.ly/2HAk0GG
    # the address of the "database manager"
    url = "tcp://10.75.0.5:37371"

    # ask the database for a learner that we can run
    learner, fname = client_support.get_learner(url, learners, fnames)

    # load the data
    learner.load(fname)

    # run until `some_goal` is reached with an `MPIPoolExecutor`
    # you can also use a ipyparallel.Client, or dask.distributed.Client
    runner = adaptive.Runner(
        learner, executor=MPIPoolExecutor(), shutdown_executor=True, goal=some_goal
    )

    # periodically save the data (in case the job dies)
    runner.start_periodic_saving(dict(fname=fname), interval=600)

    # log progress info in the job output script, optional
    client_support.log_info(runner, interval=600)

    # block until runner goal reached
    runner.ioloop.run_until_complete(runner.task)

    # tell the database that this learner has reached its goal
    client_support.tell_done(url, fname)

In a Jupyter notebook we can start the “job manager” and the “database manager” like:

from adaptive_scheduler import server_support
from learners_file import learners, fnames

# create a new database
db_fname = "running.json"
server_support.create_empty_db(db_fname, fnames)

# create unique names for the jobs
n_jobs = len(learners)
job_names = [f"test-job-{i}" for i in range(n_jobs)]

# start the "job manager" and the "database manager"
database_task = server_support.start_database_manager("tcp://10.75.0.5:37371", db_fname)

job_task = server_support.start_job_manager(
    job_names,
    db_fname,
    cores=200,  # number of cores per job
    run_script="run_learner.py",
)

So in summary, you have three files:

learners_file.py which defines the learners and its filenames
run_learner.py which picks a learner and runs it
a Jupyter notebook where you run the “database manager” and the “job manager”

You don’t actually ever have to leave the Jupter notebook, take a look at the example notebook.

Jupyter notebook example

See example.ipynb.

Installation

WARNING: This is still the pre-alpha development stage.

Install the latest stable version from conda with (recommended)

conda install adaptive-scheduler

or from PyPI with

pip install adaptive_scheduler

or install master with

pip install -U https://github.com/basnijholt/adaptive-scheduler/archive/master.zip

or clone the repository and do a dev install (recommended for dev)

git clone git@github.com:basnijholt/adaptive-scheduler.git
cd adaptive-scheduler
pip install -e .

Development

In order to not pollute the history with the output of the notebooks, please setup the git filter by executing

python ipynb_filter.py

in the repository.

We also use pre-commit, so pip install pre_commit and run

pre-commit install

in the repository.

Limitations

Right now adaptive_scheduler is only working for SLURM and PBS, however only the functions in adaptive_scheduler/slurm.py would have to be implemented for another type of scheduler. Also there are no tests at all!

Project details

These details have not been verified by PyPI

Project links

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- System :: Distributed Computing

Release history Release notifications | RSS feed

2.13.3

Nov 6, 2024

2.13.2

Nov 6, 2024

2.13.1

Nov 5, 2024

2.13.0

Oct 31, 2024

2.12.3

Oct 16, 2024

2.12.2

Oct 10, 2024

2.12.1

Sep 11, 2024

2.12.0

Jul 19, 2024

2.11.0

Jul 18, 2024

2.10.0

Jun 19, 2024

2.9.0

Jun 6, 2024

2.8.0

Apr 29, 2024

2.7.0

Apr 23, 2024

2.6.4

Apr 6, 2024

2.6.3

Mar 22, 2024

2.6.2

Mar 22, 2024

2.6.1

Mar 20, 2024

2.6.0

Mar 19, 2024

2.5.0

Mar 7, 2024

2.4.1

Mar 7, 2024

2.4.0

Mar 6, 2024

2.3.0

Jan 18, 2024

2.2.3 yanked

Jan 18, 2024

Reason this release was yanked:

incorrect tag

2.2.2

Dec 1, 2023

2.2.1

Nov 18, 2023

2.2.0

Nov 15, 2023

2.1.0

Jun 1, 2023

2.0.1

May 17, 2023

2.0.0

May 11, 2023

1.8.0

Apr 10, 2023

1.7.0

Mar 28, 2023

1.6.2

Mar 17, 2023

1.6.1

Mar 9, 2023

1.6.0

Dec 13, 2022

1.5.0

Dec 13, 2022

1.4.0

Dec 5, 2022

1.3.1

Oct 12, 2022

1.3.0

Oct 12, 2022

1.2.0

Oct 12, 2022

1.1.0

Oct 11, 2022

1.0.1

Oct 8, 2022

1.0.0

Oct 7, 2022

0.19.20

Aug 26, 2022

0.19.19

Jul 6, 2022

0.19.18

Jul 5, 2022

0.9.17

Jul 7, 2021

0.9.16

May 17, 2021

0.9.15

May 5, 2021

0.9.14

May 3, 2021

0.9.13

Mar 18, 2021

0.9.12

Mar 18, 2021

0.9.11

Mar 17, 2021

0.9.10

Feb 20, 2021

0.9.9

Feb 2, 2021

0.9.8

Jan 2, 2021

0.9.7

Oct 7, 2020

0.9.6

Oct 6, 2020

0.9.5

Oct 2, 2020

0.9.4

Sep 29, 2020

0.9.3

Sep 28, 2020

0.9.2

Sep 23, 2020

0.9.0

Sep 14, 2020

0.8.2

Sep 4, 2020

0.8.1

Aug 26, 2020

0.8.0

Aug 26, 2020

0.7.2

May 19, 2020

0.7.1

Apr 1, 2020

0.7.0

Apr 1, 2020

0.6.2

Mar 31, 2020

0.6.1

Jan 9, 2020

0.6.0

Nov 16, 2019

0.5.4

Sep 6, 2019

0.5.3

Sep 4, 2019

0.5.2

Sep 4, 2019

0.5.1

Sep 3, 2019

0.5.0

Sep 3, 2019

0.4.1

Sep 2, 2019

0.4.0

Sep 2, 2019

0.3.1

Jul 8, 2019

This version

0.3.0

Jun 29, 2019

0.2.11

Jun 25, 2019

0.2.10

Jun 19, 2019

0.2.9

Jun 19, 2019

0.2.8

Jun 14, 2019

0.2.7

Jun 13, 2019

0.2.6

Jun 13, 2019

0.2.5

Jun 12, 2019

0.2.4

Jun 10, 2019

0.2.3

Jun 7, 2019

0.2.2

Jun 5, 2019

0.2.1

Jun 5, 2019

0.2.0

May 29, 2019

0.1.11

May 29, 2019

0.1.10

May 23, 2019

0.1.9

May 22, 2019

0.1.8

May 22, 2019

0.1.7

May 15, 2019

0.1.6

May 15, 2019

0.1.5

May 14, 2019

0.1.4

May 9, 2019

0.1.3

May 8, 2019

0.1.2

May 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptive_scheduler-0.3.0.tar.gz (32.8 kB view details)

Uploaded Jun 29, 2019 Source

Built Distribution

adaptive_scheduler-0.3.0-py3-none-any.whl (33.5 kB view details)

Uploaded Jun 29, 2019 Python 3

File details

Details for the file adaptive_scheduler-0.3.0.tar.gz.

File metadata

Download URL: adaptive_scheduler-0.3.0.tar.gz
Upload date: Jun 29, 2019
Size: 32.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for adaptive_scheduler-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0708a9f7f21a946cb97a8d7328cd520f278401bf08679a838bb62eaf572e846b`
MD5	`1b2423b6d631bf09b478832e12932bfb`
BLAKE2b-256	`66c25c06f71132869339881bf7e1e442ac69791c80425d125a2110d46ae41632`

See more details on using hashes here.

File details

Details for the file adaptive_scheduler-0.3.0-py3-none-any.whl.

File metadata

Download URL: adaptive_scheduler-0.3.0-py3-none-any.whl
Upload date: Jun 29, 2019
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for adaptive_scheduler-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`716f448ba616e5566cd5c8843d77ed0674c6e59d6547110d2fb1e1ef7e778b80`
MD5	`67e6e8037531acb57778607005b7fda2`
BLAKE2b-256	`7a61550ca90e4750accb90a0d16244a9f7c3c03c9406eaa347a3c9c45778f55c`