Skip to main content

Package to retrieve Goodput of jobs running on Cloud TPU.

Project description

ML Goodput Measurement

Overview

ML Goodput Measurement is a library intended to be used with Cloud TPU to log the necessary information and query a job's Goodput. It can be pip installed to import its modules, and retrieve information about a training job's overall productive Goodput. The package exposes API interfaces to log useful information from the user application and query Goodput for the job run, gain insight into the productivity of ML workloads and utilization of compute resources.

The package also exposes Goodput Monitoring APIs which allow asynchronous query and export of the job's Goodput to Tensorboard with configurable upload interval.

Components

The ML Goodput Measurement library consists of the following main components:

  • GoodputRecorder

  • GoodputCalculator

  • GoodputMonitor

The GoodputRecorder exposes APIs to the client to export key timestamps while a training job makes progress, namely APIs that allow logging of productive step time and total job run time. The library will serialize and store this data in Google Cloud Logging.

The GoodputCalculator exposes APIs to compute Goodput based on the recorded data. Cloud Logging handles its internal operations asynchronously. The recommended way to compute Goodput is to run an analysis program separate from the training application, either on a CPU instance or on the users' development machine.

The GoodputMonitor exposes APIs to query and upload goodput data to Tensorboard asynchronously. It does this by instantiating a GoodputCaluclator under the hood.

Installation

To install the ML Goodput Measurement package, run the following command on TPU VM:

pip install ml-goodput-measurement

Usage

The usage of this package requires the setup of a Google Cloud project with billing enabled to properly use Google Cloud Logging. If you don't have a Google Cloud project, or if you don't have billing enabled for your Google Cloud project, then do the following:

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

  2. Make sure that billing is enabled for your Google Cloud project. Instructions can be found here

To run your training on Cloud TPU, set up the Cloud TPU environment by following instructions here.

To learn more about Google Cloud Logging, visit this page.

Import

To use this package, import the goodput module:

from ml_goodput_measurement import goodput

Define the name of the Google Cloud Logging logger.

Create a run-specific logger name where Cloud Logging entries can be written to and read from.

For example:

goodput_logger_name = f'goodput_{config.run_name}'

Create a GoodputRecorder object

Next, create a recorder object with the following parameters:

  1. job_name: The full run name of the job.
  2. logger_name: The name of the Cloud Logging logger object (created in the previous step).
  3. logging_enabled: Whether or not this process has Cloud Logging enabled.

NOTE: For a multi-worker setup, please ensure that only one worker writes the logs to avoid the duplication. In JAX, for example, the check could be if jax.process_index() == 0

NOTE: logging_enabled defaults to False and Goodput computations cannot be completed if no logs are ever written.

For example:

goodput_recorder = goodput.GoodputRecorder(job_name=config.run_name, logger_name=goodput_logger_name, logging_enabled=(jax.process_index() == 0))

Record Data with GoodputRecorder

Record Job Start and End Time

Use the recorder object to record the job's overall start and end time.

For example:

def main(argv: Sequence[str]) -> None:
# Initialize configs…
goodput_recorder.record_job_start_time(datetime.datetime.now())
# TPU Initialization and device scanning…
# Set up other things for the main training loop…
# Main training loop
train_loop(config)
goodput_recorder.record_job_end_time(datetime.datetime.now())

Record Step Time

Use the recorder object to record a step's start time using record_step_start_time(step_count):

For example:

def train_loop(config, state=None):
# Set up mesh, model, state, checkpoint manager…

# Initialize functional train arguments and model parameters…

# Define the compilation

for step in np.arange(start_step, config.steps):
  goodput_recorder.record_step_start_time(step)
  # Training step…

return state

Retrieve Goodput with GoodputCalculator

In order to retrieve the Goodput of a job run, all you need to do is instantiate a GoodputCalculator object with the job's run name and the Cloud Logging logger name used to record data for that job run. Then call the get_job_goodput API to get the computed Goodput for the job run.

It is recommended to make the get_job_goodput calls for a job run from an instance that runs elsewhere from your training machine.

Create a GoodputCalculator object

Create the calculator object:

goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
goodput_calculator = goodput.GoodputCalculator(job_name=config.run_name, logger_name=goodput_logger_name)

Retrieve Goodput

Finally, call the get_job_goodput API to retrieve Goodput for the entire job run.

total_goodput = goodput_calculator.get_job_goodput()
print(f"Total job goodput: {total_goodput:.2f}%")

Monitor Goodput with GoodputMonitor

In order to monitor the Goodput of a job run on Tensorboard, all you need to do is instantiate a GoodputMonitor object with the job's run name, cloud logger name and Goodput monitoring configurations (as described below). Then call the start_goodput_uploader API to asynchronously query and upload measured Goodput to the specified Tensorboard directory.

Create a GoodputMonitor object

Create a GoodputMonitor object with the following parameters:

  1. job_name: The full run name of the job.
  2. logger_name: The name of the Cloud Logging logger object (created in the previous step).
  3. tensorboard_dir: The directory to write TensorBoard data to.
  4. upload_interval: The time interval at which to query and upload data to TensorBoard.
  5. monitoring_enabled: Whether or not monitoring is enabled. If the application is interested in monitoring Goodput, it should set this value to True. Only one worker should enable monitoring.

NOTE: Please ensure that only one worker enables monitoring of Goodput. In JAX, for example, the check could be if jax.process_index() == 0

For example:

goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
goodput_monitoring_enabled = config.monitor_goodput and jax.process_index() == 0 # Check for configs whether or not the enable monitoring.

goodput_monitor = goodput.GoodputMonitor(job_name=config.run_name, logger_name=logger_name, tensorboard_dir=config.tensorboard_dir, upload_interval=config.goodput_upload_interval_seconds, monitoring_enabled=goodput_monitoring_enabled)

Start asynchronous "query and upload" of Goodput

Call the start_goodput_uploader API to spin off a thread which continuously queries and uploads Goodput.

goodput_monitor.start_goodput_uploader()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_goodput_measurement-0.0.3.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

ml_goodput_measurement-0.0.3-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file ml_goodput_measurement-0.0.3.tar.gz.

File metadata

File hashes

Hashes for ml_goodput_measurement-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ed23fdc4a824076e6b964f19cd5a3eaf6b058e67d0116bdbb73730e7b683f62b
MD5 831f23045815b41c6c694cc538c2cf34
BLAKE2b-256 1f476bcf7746e3b362d234e43e99f3fa986607ffb5649db11448a844c7c737ab

See more details on using hashes here.

File details

Details for the file ml_goodput_measurement-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ml_goodput_measurement-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 12f85921255caf0b47b5f748d536844b1ac5b5021e203fa8d542d158b7ce802e
MD5 2f22d5f5b7176654b7a5a4e61f38b937
BLAKE2b-256 972b4726e3fa97d828470aa24d2d1148503516a093eb3aeb15410d3e0f26f178

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page