PyTorch Elastic Training
Project description
TorchElastic
TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
Requirements
torchelastic requires
- python3 (3.6+)
- torch
- etcd
Installation
pip install torchelastic
Quickstart
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Contributing
We welcome PRs. See the CONTRIBUTING file.
License
torchelastic is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
torchelastic-0.2.0rc0.tar.gz
(51.5 kB
view hashes)
Built Distribution
Close
Hashes for torchelastic-0.2.0rc0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5303c5ea81754d3a076abf0160a831287eb77e6c28a39a913fd299da74d6db2 |
|
MD5 | abcce2d03198fb6c83111e87fcd56fa0 |
|
BLAKE2b-256 | b399e541e257f1a4afc00a8c95441a428a159bedfcc3c02030c7228c0ae3717d |