Training Library
Project description
InstructLab Training Library
- Installing
- Using the library
- Learning about the training arguments
- Learning about
TorchrunArgs
arguments - Example training run with arguments
To simplify the process of fine-tuning models with the LAB method, this library provides a simple training interface.
Installing the library
To get started with the library, you must clone this repository and install it via pip
.
Install the library:
pip install instructlab-training
You can then install the library for development:
pip install -e ./training
Additional NVIDIA packages
This library uses the flash-attn
package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.
Basic install
pip install .[cuda]
Editable install (development)
pip install -e .[cuda]
Using the library
You can utilize this training library by importing the necessary items.
from instructlab.training import (
run_training,
TorchrunArgs,
TrainingArgs,
DeepSpeedOptions
)
You can then define various training arguments. They will serve as the parameters for your training runs. See:
Learning about training arguments
The TrainingArgs
class provides most of the customization options
for training jobs. There are a number of options you can specify, such as setting
DeepSpeed
config values or running a LoRA
training job instead of a full fine-tune.
TrainingArgs
Field | Description |
---|---|
model_path | Either a reference to a HuggingFace repo or a path to a model saved in the HuggingFace format. |
data_path | A path to the .jsonl training dataset. This is expected to be in the messages format. |
ckpt_output_dir | Directory where trained model checkpoints will be saved. |
data_output_dir | Directory where the processed training data is stored (post filtering/tokenization/masking) |
max_seq_len | The maximum sequence length to be included in the training set. Samples exceeding this length will be dropped. |
max_batch_len | Maximum tokens per gpu for each batch that will be handled in a single step. Used as part of the multipack calculation. If running into out-of-memory errors, try to lower this value, but not below the max_seq_len . |
num_epochs | Number of epochs to run through before stopping. |
effective_batch_size | The amount of samples in a batch to see before we update the model parameters. |
save_samples | Number of samples the model should see before saving a checkpoint. Consider this to be the checkpoint save frequency. |
learning_rate | How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
warmup_steps | The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to learning_rate . |
is_padding_free | Boolean value to indicate whether or not we're training a padding-free transformer model such as Granite. |
random_seed | The random seed PyTorch will use. |
mock_data | Whether or not to use mock, randomly generated, data during training. For debug purposes |
mock_data_len | Max length of a single mock data sample. Equivalent to max_seq_len but for mock data. |
deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
chat_tmpl_path | Specifies the chat template / special tokens for training. |
checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |
DeepSpeedOptions
This library only currently support a few options in DeepSpeedOptions
:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.
Field | Description |
---|---|
cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |
For more information about DeepSpeed, see deepspeed.ai
FSDPOptions
Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP. They are listed below:
Field | Description |
---|---|
cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are: FULL_SHARD (ZeRO-3), HYBRID_SHARD (ZeRO-3*), SHARD_GRAD_OP (ZeRO-2), and NO_SHARD . |
[!NOTE] For
sharding_strategy
- OnlySHARD_GRAD_OP
has been extensively tested and is actively supported by this library.
loraOptions
LoRA options currently supported:
Field | Description |
---|---|
rank | The rank parameter for LoRA training. |
alpha | The alpha parameter for LoRA training. |
dropout | The dropout rate for LoRA training. |
target_modules | The list of target modules for LoRA training. |
quantize_data_type | The data type for quantization in LoRA training. Valid options are None and "nf4" |
Example run with LoRa options
If you'd like to do a LoRA train, you can specify a LoRA
option to TrainingArgs
via the LoraOptions
object.
from instructlab.training import LoraOptions, TrainingArgs
training_args = TrainingArgs(
lora = LoraOptions(
rank = 4,
alpha = 32,
dropout = 0.1,
),
# ...
)
Learning about TorchrunArgs
arguments
When running the training script, we always invoke torchrun
.
If you are running a single-GPU system or something that doesn't otherwise require distributed training configuration, you can create a default object:
run_training(
torchrun_args=TorchrunArgs(),
training_args=TrainingArgs(
# ...
),
)
However, if you want to specify a more complex configuration, the library currently supports all the options that torchrun accepts today.
[!NOTE] For more information about the
torchrun
arguments, please consult the torchrun documentation.
Example training run with TorchrunArgs
arguments
For example, in a 8-GPU, 2-machine system, we would specify the following torchrun config:
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'
# on machine 1
torchrun_args = TorchrunArgs(
nnodes = 2, # number of machines
nproc_per_node = 4, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = RDZV_ENDPOINT
)
run_training(
torchrun_args=torchrun_args,
training_args=training_args
)
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'
# on machine 2
torchrun_args = TorchrunArgs(
nnodes = 2, # number of machines
nproc_per_node = 4, # num GPUs per machine
node_rank = 1, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = f'{MASTER_ADDR}:{MASTER_PORT}'
)
run_training(
torch_args=torchrun_args,
train_args=training_args
)
Example training run with arguments
Define the training arguments which will serve as the parameters for our training run:
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",
# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
We'll also need to define the settings for running a multi-process job
via torchrun
. To do this, create a TorchrunArgs
object.
[!TIP] Note, for single-GPU jobs, you can simply set
nnodes = 1
andnproc_per_node=1
.
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
Finally, you can just call run_training
and this library will handle
the rest 🙂.
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file instructlab_training-0.5.1.tar.gz
.
File metadata
- Download URL: instructlab_training-0.5.1.tar.gz
- Upload date:
- Size: 4.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e40a10465bd2ea91030efd44591de93ab05cfd17854f3a6ec83ed4714446123 |
|
MD5 | 1c52ed74450baae7278d60512c0ba3bb |
|
BLAKE2b-256 | d82f6386ba1566d77c8871bb4dff964c49496a13ffd83ac9df047044b5328f94 |
File details
Details for the file instructlab_training-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: instructlab_training-0.5.1-py3-none-any.whl
- Upload date:
- Size: 47.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b060e7c6555b013fcb8eec05b49e150b1748284bb392e9f871fde82e06be61c |
|
MD5 | 97996343a1ec06dd392fa4ebfe1621a8 |
|
BLAKE2b-256 | 685492934a38dc157c742ac814c0b3ec7872c9968e8e23b2a9c051c751d28adb |