Skip to main content

Monitor, debug and profile the jobs running on Cloud accelerators like TPUs and GPUs.

Project description

Cloud Accelerator Diagnostics

Overview

Cloud Accelerator Diagnostics is a library to monitor, debug and profile the workloads running on Cloud accelerators like TPUs and GPUs. Additionally, this library provides a streamlined approach to automatically upload data to Tensorboard Experiments in Vertex AI. The package allows users to create a Tensorboard instance and Experiments in Vertex AI, and upload logs to them.

Installation

To install the Cloud Accelerator Diagnostics package, run the following command:

pip install cloud-accelerator-diagnostics

Automating Uploads to Vertex AI Tensorboard

Before creating and uploading logs to Vertex AI Tensorboard, you must enable Vertex AI API in your Google Cloud console. Also, make sure to assign the Vertex AI User IAM role to the service account that will call the APIs in cloud-accelerator-diagnostics package. This is required to create and access the Vertex AI Tensorboard in the Google Cloud console.

Create Vertex AI Tensorboard

To learn about Vertex AI Tensorboard, visit this page.

Here is an example script to create a Vertex AI Tensorboard instance with the name test-instance in Google Cloud Project test-project.

Note: Vertex AI is available in only these regions.

from cloud_accelerator_diagnostics import tensorboard

instance_id = tensorboard.create_instance(project="test-project",
                                          location="us-central1",
                                          tensorboard_name="test-instance")
print("Vertex AI Tensorboard created: ", instance_id)

Create Vertex AI Experiment

To learn about Vertex AI Experiments, visit this page.

The following script will create a Vertex AI Experiment named test-experiment in your Google Cloud Project test-project. Here's how it handles attaching a Tensorboard instance:

Scenario 1: Tensorboard Instance Exist

If a Tensorboard instance named test-instance already exists in your project, the script will attach it to the new Experiment.

Scenario 2: No Tensorboard Instance Present

If test-instance does not exist, the script will create a new Tensorboard instance with that name and attach it to the Experiment.

from cloud_accelerator_diagnostics import tensorboard

instance_id, tensorboard_url = tensorboard.create_experiment(project="test-project",
                                                             location="us-central1",
                                                             experiment_name="test-experiment",
                                                             tensorboard_name="test-instance")

print("View your Vertex AI Tensorboard here: ", tensorboard_url)

If a Vertex AI Experiment with the specified name exists, a new one will not be created, and the existing Experiment's URL will be returned.

Note: You can attach multiple Vertex AI Experiments to a single Vertex AI Tensorboard.

Upload Logs to Vertex AI Tensorboard

The following script will continuously monitor for new data in the directory (logdir), and uploads it to your Vertex AI Tensorboard Experiment. Note that after calling start_upload_to_tensorboard(), the thread will be kept alive even if an exception is thrown. To ensure the thread gets shut down, put any code after start_upload_to_tensorboard() and before stop_upload_to_tensorboard() in a try block, and call stop_upload_to_tensorboard() in finally block. This example shows how you can upload the profile logs collected for your JAX workload on Vertex AI Tensorboard.

from cloud_accelerator_diagnostics import uploader

uploader.start_upload_to_tensorboard(project="test-project",
                                     location="us-central1",
                                     experiment_name="test-experiment",
                                     tensorboard_name="test-instance",
                                     logdir="gs://test-directory/testing")
try:
  jax.profiler.start_trace("gs://test-directory/testing")
  <your code goes here>
  jax.profiler.stop_trace()
finally:
  uploader.stop_upload_to_tensorboard()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloud_accelerator_diagnostics-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cloud_accelerator_diagnostics-0.1.1.tar.gz.

File metadata

File hashes

Hashes for cloud_accelerator_diagnostics-0.1.1.tar.gz
Algorithm Hash digest
SHA256 12a0ce5f1f8743ac6e989dc46fca57f3a914456e75e67b1cc0908d044845b363
MD5 9d69872bda87788e1520768e2ac27ad1
BLAKE2b-256 12de5e07a135c23fbe8649b3d27058d1d0198e0362945fea2bfb34838029191d

See more details on using hashes here.

File details

Details for the file cloud_accelerator_diagnostics-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cloud_accelerator_diagnostics-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 03961c247e64aca0bff9f263e2a6f0e90685ef89f6738b0a357d655aa1cfe418
MD5 a92218a451c26810fa5c246acbdbf33d
BLAKE2b-256 96236baecc3a47f0fef895cfc6b1cbb5424a83eeed52c30ba334e23d07e9e70f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page