Training Library

These details have not been verified by PyPI

Project links

Project description

InstructLab Training Library

Lint Build Release License

In order to simplify the process of fine-tuning models through the LAB method, this library provides a simple training interface.

Installation

To get started with the library, you must clone this repo and install it from source via pip:

# clone the repo and switch to the directory
git clone https://github.com/instructlab/training
cd training

# install the library
pip install .

For development, install it instead with pip install -e . instead to make local changes while using this library elsewhere.

Installing Additional NVIDIA packages

We make use of flash-attn and other packages which rely on NVIDIA-specific CUDA tooling to be installed.

If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:

# for a regular install
pip install .[cuda]

# or, for an editable install (development)
pip install -e .[cuda]

Usage

Using the library is fairly straightforward, import the necessary items,

from instructlab.training import (
    run_training,
    TorchrunArgs,
    TrainingArgs,
    DeepSpeedOptions
)

Then, define the training arguments which will serve as the parameters for our training run:

# define training-specific arguments
training_args = TrainingArgs(
    # define data-specific arguments
    model_path = "ibm-granite/granite-7b-base",
    data_path = "path/to/dataset.jsonl",
    ckpt_output_dir = "data/saved_checkpoints",
    data_output_dir = "data/outputs",

    # define model-trianing parameters
    max_seq_len = 4096,
    max_batch_len = 60000,
    num_epochs = 10,
    effective_batch_size = 3840,
    save_samples = 250000,
    learning_rate = 2e-6,
    warmup_steps = 800,
    is_padding_free = True, # set this to true when using Granite-based models
    random_seed = 42,
)

We'll also need to define the settings for running a multi-process job via torchrun. To do this, create a TorchrunArgs object.

[!TIP] Note, for single-GPU jobs, you can simply set nnodes = 1 and nproc_per_node=1.

torchrun_args = TorchrunArgs(
    nnodes = 1, # number of machines 
    nproc_per_node = 8, # num GPUs per machine
    node_rank = 0, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = '127.0.0.1:12345'
)

Finally, you can just call run_training and this library will handle the rest 🙂.

run_training(
    torchrun_args=torchrun_args,
    training_args=training_args,
)

Customizing `TrainingArgs`

The TrainingArgs class provides most of the customization options for the training job itself. There are a number of options you can specify, such as setting DeepSpeed config values or running a LoRA training job instead of a full fine-tune.

Here is a breakdown of the general options:

Field	Description
model_path	Either a reference to a HuggingFace repo or a path to a model saved in the HuggingFace format.
data_path	A path to the `.jsonl` training dataset. This is expected to be in the messages format.
ckpt_output_dir	Directory where trained model checkpoints will be saved.
data_output_dir	Directory where we'll store all other intermediary data such as log files, the processed dataset, etc.
max_seq_len	The maximum sequence length to be included in the training set. Samples exceeding this length will be dropped.
max_batch_len	The maximum length of all training batches that we intend to handle in a single step. Used as part of the multipack calculation. If running into out-of-memory errors, try to lower this value, but not below the `max_seq_len`.
num_epochs	Number of epochs to run through before stopping.
effective_batch_size	The amount of samples in a batch to see before we update the model parameters. Higher values lead to better learning performance.
save_samples	Number of samples the model should see before saving a checkpoint. Consider this to be the checkpoint save frequency. The amount of storage used for a single training run will usually be `4GB * len(dataset) / save_samples`
learning_rate	How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.
warmup_steps	The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to `learning_rate`.
is_padding_free	Boolean value to indicate whether or not we're training a padding-free transformer model such as Granite.
random_seed	The random seed PyTorch will use.
mock_data	Whether or not to use mock, randomly generated, data during training. For debug purposes
mock_data_len	Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data.
deepspeed_options	Config options to specify for the DeepSpeed optimizer.
lora	Options to specify if you intend to perform a LoRA train instead of a full fine-tune.

`DeepSpeedOptions`

We only currently support a few options in DeepSpeedOptions: The default is to run with DeepSpeed, so these options only currently allow you to customize aspects of the ZeRO stage 2 optimizer.

Field	Description
cpu_offload_optimizer	Whether or not to do CPU offloading in DeepSpeed stage 2.

`loraOptions`

If you'd like to do a LoRA train, you can specify a LoRA option to TrainingArgs via the LoraOptions object.

from instructlab.training import LoraOptions, TrainingArgs

training_args = TrainingArgs(
    lora = LoraOptions(
        rank = 4,
        alpha = 32,
        dropout = 0.1,
    ),
    # ...
)

Here is the definition for what we currently support today:

Field	Description
rank	The rank parameter for LoRA training.
alpha	The alpha parameter for LoRA training.
dropout	The dropout rate for LoRA training.
target_modules	The list of target modules for LoRA training.
quantize_data_type	The data type for quantization in LoRA training. Valid options are `None` and `"nf4"`

Customizing `TorchrunArgs`

When running the training script, we always invoke torchrun.

If you are running a single-GPU system or something that doesn't otherwise require distributed training configuration, you can just create a default object:

run_training(
    torchrun_args=TorchrunArgs(),
    training_args=TrainingArgs(
        # ...
    ),
)

However, if you want to specify a more complex configuration, we currently expose all of the options that torchrun accepts today.

![NOTE] For more information about the torchrun arguments, please consult the torchrun documentation.

For example, in a 8-GPU, 2-machine system, we would specify the following torchrun config:

MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'

# on machine 1
torchrun_args = TorchrunArgs(
    nnodes = 2, # number of machines 
    nproc_per_node = 4, # num GPUs per machine
    node_rank = 0, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = RDZV_ENDPOINT
)

run_training(
    torchrun_args=torchrun_args,
    training_args=training_args
)

MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'

# on machine 2
torchrun_args = TorchrunArgs(
    nnodes = 2, # number of machines 
    nproc_per_node = 4, # num GPUs per machine
    node_rank = 1, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = f'{MASTER_ADDR}:{MASTER_PORT}'
)

run_training(
    torch_args=torchrun_args,
    train_args=training_args
)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Nov 13, 2024

0.6.0a1 pre-release

Nov 9, 2024

0.6.0a0 pre-release

Nov 1, 2024

0.5.5

Oct 22, 2024

0.5.4

Oct 8, 2024

0.5.3

Oct 8, 2024

0.5.2

Oct 2, 2024

0.5.1

Oct 1, 2024

0.5.0

Oct 1, 2024

0.5.0a0 pre-release

Sep 26, 2024

0.4.2

Aug 14, 2024

0.4.1

Aug 9, 2024

0.4.0

Aug 8, 2024

0.3.2

Aug 8, 2024

0.3.1

Jul 25, 2024

0.3.0

Jul 18, 2024

0.2.0

Jul 11, 2024

0.1.0

Jul 3, 2024

0.0.5.1

Jul 2, 2024

0.0.5

Jul 1, 2024

0.0.4

Jun 28, 2024

This version

0.0.3

Jun 25, 2024

0.0.2

Jun 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructlab_training-0.0.3.tar.gz (4.8 MB view details)

Uploaded Jun 25, 2024 Source

Built Distribution

instructlab_training-0.0.3-py3-none-any.whl (38.9 kB view details)

Uploaded Jun 25, 2024 Python 3

File details

Details for the file instructlab_training-0.0.3.tar.gz.

File metadata

Download URL: instructlab_training-0.0.3.tar.gz
Upload date: Jun 25, 2024
Size: 4.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for instructlab_training-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`d911b0bb6216c8a33c245719e515bda021796d3e2e1f3a71b35a6f27e5d55934`
MD5	`f1a3e94d992022748bd71652df416c38`
BLAKE2b-256	`978ed9a0c3d588dbeaa6f58612b7b4594ee002de9a4dddb9f3e9ef8b059bc9ca`

See more details on using hashes here.

File details

Details for the file instructlab_training-0.0.3-py3-none-any.whl.

File metadata

Download URL: instructlab_training-0.0.3-py3-none-any.whl
Upload date: Jun 25, 2024
Size: 38.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for instructlab_training-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b072493675f3fc032d4b4e2da4130c5d42305e1a13d8088695a235d9adce6287`
MD5	`689a2a697a86e934922727df8f8424f3`
BLAKE2b-256	`bce58c62e3c8a0688bbaafc06ed8786199828580f3d7a1c6104bac0cbe4b8567`

See more details on using hashes here.

instructlab-training 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InstructLab Training Library

Installation

Installing Additional NVIDIA packages

Usage

Customizing `TrainingArgs`

`DeepSpeedOptions`

`loraOptions`

Customizing `TorchrunArgs`

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

instructlab-training 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InstructLab Training Library

Installation

Installing Additional NVIDIA packages

Usage

Customizing TrainingArgs

DeepSpeedOptions

loraOptions

Customizing TorchrunArgs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Customizing `TrainingArgs`

`DeepSpeedOptions`

`loraOptions`

Customizing `TorchrunArgs`