NVIDIA Resiliency Package
Project description
Nvidia Resiliency Extension
This project combines multiple resiliency-related solutions.
- Fault Tolerance package
- Straggler Detection package
- PyTorch Lightning callbacks
Installation:
From sources
git clone --recursive <this repo URL>
cd <repo>
pip install .
Requirements:
- Python >= 3.10
- gcc >= 8.0
- CUDA >= 11.8
Fault Tolerance integration guide
This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).
Let's define some terms used in this section:
PTL
is PyTorch LightningFault Tolerance
,FT
is thefault_tolerance
package, included innvidia_resiliency_ext
.FT callback
,FaultToleranceCallback
is a PTL callback defined inptl_resiliency
package, included innvidia_resiliency_ext
.ft_launcher
is a launcher tool included in the FT, which is based ontorchrun
.heartbeat
is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.rank monitor
is a special side-process started byft_launcher
that monitors heartbeats from its rank.timeouts
are time intervals used by a rank monitor to detect that a rank is not alive. There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.launcher script
is a bash script that invokesft_launcher
.
0. Use ft_launcher
to start the workload
ft_launcher
is similar to torchrun
but it starts a rank monitor for each started rank.
ft_launcher
takes the FT configuration in a YAML file (--fault-tol-cfg-path
) or via CLI args (--ft-param-...
).
FT configuration items are described in FaultToleranceConfig
docstring.
1. Add FT callback to the trainer
Add FT callback to PTL callbacks.
fault_tol_cb = FaultToleranceCallback(
autoresume=True,
calculate_timeouts=True,
logger_name="test_logger",
exp_dir=tmp_path,
)
trainer = pl.Trainer(
...
callbacks=[..., fault_tol_cb],
)
Core FT callback functionality is:
- Establishing a connection with a rank monitor
- Sending heartbeats during training and evaluation steps
- Disconnecting from a rank monitor
Optionally, it can also:
- Compute timeouts that will be used instead of timeouts defined in the FT config
- Create a flag file when the training is completed
FT callback initialization params:
def __init__(
self,
autoresume: bool,
calculate_timeouts: bool,
simulated_fault_params: Optional[Any] = None,
exp_dir: Union[str, pathlib.Path, None] = None,
logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
):
"""
Initialize callback instance.
This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.
Args:
autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
Calculated timeouts overwrite the timeouts from the FT config.
Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
For example, for training started from scratch, the timeouts are computed at the end of the second job.
simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
logger_name (Optional[str], optional): Logger name to be used.
Defaults to "nemo_logger.FaultToleranceCallback".
"""
2. Implementing auto-resume
Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs.
NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the FaultToleranceCallback
.
FaultToleranceCallback
exposes an "interface" that allows implementing an auto-resume launcher script.
Specifically, if autoresume=True
the FT callback creates a special marker file when a training is completed.
The marker file location is expected to be set in the FAULT_TOL_FINISHED_FLAG_FILE
environment variable.
The following mechanism can be used to implement an auto-resuming launcher script:
- Launcher script starts ranks with
ft_launcher
FAULT_TOL_FINISHED_FLAG_FILE
should be passed to rank processes- When a
ft_launcher
exits, a launcher script checks if theFAULT_TOL_FINISHED_FLAG_FILE
file was created.- If
FAULT_TOL_FINISHED_FLAG_FILE
exists, the auto-resume loop can be broken, as the training is completed. - If
FAULT_TOL_FINISHED_FLAG_FILE
does not exist, the continuation job can be issued (other conditions can be checked e.g. if the maximum number of failures is not reached).
- If
Straggler Detection integration guide
Include plt_resiliency.StragglerDetectionCallback
in a PTL trainer callbacks.
straggler_cb_args = dict(
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_log=3,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
logger_name="test_logger",
)
straggler_det_cb = StragglerDetectionCallback(**cb_args)
trainer = pl.Trainer(
...
callbacks=[..., straggler_det_cb],
)
StragglerDetectionCallback
initialization params:
def __init__(
self,
report_time_interval: float,
calc_relative_gpu_perf: bool,
calc_individual_gpu_perf: bool,
num_gpu_perf_scores_to_log: int,
gpu_relative_perf_threshold: float,
gpu_individual_perf_threshold: float,
stop_if_detected: bool,
logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
):
"""
Initialize straggler detection callback instance.
Args:
report_time_interval (float): Interval [seconds] of the straggler check
calc_relative_gpu_perf (bool): Calculate relative GPU performance
calc_individual_gpu_perf (bool): Calculate individual GPU performance
num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".
Raises:
ValueError: If invalid config was provided.
"""
More info on straggler detection can be found in the straggler package's README.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 332997f4a9237d137a0b74be5b18e87923d2afbed86b0382334c9dba36db2652 |
|
MD5 | b33ed4b9478acc07f8ec62456305516b |
|
BLAKE2b-256 | 60064517300290520936391abd7ebbb59a7e65d047d5c8cfd2db14adf95aeff3 |
File details
Details for the file nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b07eb0a65096677bfe9a71162808a5f4a41a7145e9fa57fb93955ed22f24218 |
|
MD5 | 246e210b75c87c9984473b528cf92948 |
|
BLAKE2b-256 | 9270c3d7f91929ff76e9a95809fc28be81fbe62ff049d1277d1d9e671948a66a |
File details
Details for the file nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4c3d963f66f3ae20de5860e16204439e69b096466fb9f36d75a9bc61fd7c328 |
|
MD5 | d21dd5789c80de0d44764347e5f355e5 |
|
BLAKE2b-256 | ed1125854e1c68940b281532f1016eb737102e1d540f8a1d84307bca00baa497 |