PyTorch Elastic Training
Project description
TorchElastic
TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
Requirements
torchelastic requires
- python3 (3.8+)
- torch
- etcd
Installation
pip install torchelastic
Quickstart
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Contributing
We welcome PRs. See the CONTRIBUTING file.
License
torchelastic is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
torchelastic-0.2.1.tar.gz
(64.4 kB
view hashes)
Built Distributions
torchelastic-0.2.1-py3.8.egg
(180.0 kB
view hashes)
Close
Hashes for torchelastic-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5cde4de50cfca3930bf952aaaee83b7e8425d3f1976b9f1df9626d9f4f7ae89 |
|
MD5 | e7c8174b8136dc6877d545938d255656 |
|
BLAKE2b-256 | c12b8d8b9227905c8aa7a8c06fc3191072345c0e74615af1d050b9e5adec3d88 |