Skip to main content

Monitor, debug and profile the jobs running on Cloud accelerators like TPUs and GPUs.

Project description

Cloud Accelerator Diagnostics

Overview

Cloud Accelerator Diagnostics is a library to monitor, debug and profile the workloads running on Cloud accelerators like TPUs and GPUs. Additionally, this library provides a streamlined approach to automatically upload data to Tensorboard Experiments in Vertex AI. The package allows users to create a Tensorboard instance and Experiments in Vertex AI, and upload logs to them.

Installation

To install the Cloud Accelerator Diagnostics package, run the following command:

pip install cloud-accelerator-diagnostics

Automating Uploads to Vertex AI Tensorboard

Before creating and uploading logs to Vertex AI Tensorboard, you must enable Vertex AI API in your Google Cloud console. Also, make sure to assign the Vertex AI User IAM role to the service account that will call the APIs in cloud-accelerator-diagnostics package. This is required to create and access the Vertex AI Tensorboard in the Google Cloud console.

Create Vertex AI Tensorboard

To learn about Vertex AI Tensorboard, visit this page.

Here is an example script to create a Vertex AI Tensorboard instance with the name test-instance in Google Cloud Project test-project.

Note: Vertex AI is available in only these regions.

from cloud_accelerator_diagnostics import tensorboard

instance_id = tensorboard.create_instance(project="test-project",
                                          location="us-central1",
                                          tensorboard_name="test-instance")
print("Vertex AI Tensorboard created: ", instance_id)

Create Vertex AI Experiment

To learn about Vertex AI Experiments, visit this page.

The following script will create a Vertex AI Experiment named test-experiment in your Google Cloud Project test-project. Here's how it handles attaching a Tensorboard instance:

Scenario 1: Tensorboard Instance Exist

If a Tensorboard instance named test-instance already exists in your project, the script will attach it to the new Experiment.

Scenario 2: No Tensorboard Instance Present

If test-instance does not exist, the script will create a new Tensorboard instance with that name and attach it to the Experiment.

from cloud_accelerator_diagnostics import tensorboard

instance_id, tensorboard_url = tensorboard.create_experiment(project="test-project",
                                                             location="us-central1",
                                                             experiment_name="test-experiment",
                                                             tensorboard_name="test-instance")

print("View your Vertex AI Tensorboard here: ", tensorboard_url)

If a Vertex AI Experiment with the specified name exists, a new one will not be created, and the existing Experiment's URL will be returned.

Note: You can attach multiple Vertex AI Experiments to a single Vertex AI Tensorboard.

Upload Logs to Vertex AI Tensorboard

The following script will continuously monitor for new data in the directory (logdir), and uploads it to your Vertex AI Tensorboard Experiment. Note that after calling start_upload_to_tensorboard(), the thread will be kept alive even if an exception is thrown. To ensure the thread gets shut down, put any code after start_upload_to_tensorboard() and before stop_upload_to_tensorboard() in a try block, and call stop_upload_to_tensorboard() in finally block. This example shows how you can upload the profile logs collected for your JAX workload on Vertex AI Tensorboard.

from cloud_accelerator_diagnostics import uploader

uploader.start_upload_to_tensorboard(project="test-project",
                                     location="us-central1",
                                     experiment_name="test-experiment",
                                     tensorboard_name="test-instance",
                                     logdir="gs://test-directory/testing")
try:
  jax.profiler.start_trace("gs://test-directory/testing")
  <your code goes here>
  jax.profiler.stop_trace()
finally:
  uploader.stop_upload_to_tensorboard()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloud_accelerator_diagnostics-0.1.0.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cloud_accelerator_diagnostics-0.1.0.tar.gz.

File metadata

File hashes

Hashes for cloud_accelerator_diagnostics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 193121c81cae6892d07fca5724b3a89127fd77d59935612201fd5daf8a7c70ca
MD5 562dc58ec0de684e80f34a4249d8e25a
BLAKE2b-256 5cd06b63b56c9f4a3c7b7a65f688c6c49ddf68994d146289042499c33fa94111

See more details on using hashes here.

File details

Details for the file cloud_accelerator_diagnostics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cloud_accelerator_diagnostics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad152cc9c28f45634f19c366bd96e38ec8a0fa19f0a33bee79f13c58905419b6
MD5 899edfcef5797a33ef13cd42e624d962
BLAKE2b-256 b27b91a0db39030c5b432d654b447dc19f8320c692cb0bb78e9f00b4b87f7a73

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page