Skip to main content

Monitor, debug and profile the jobs running on Cloud TPU.

Project description

Cloud TPU Diagnostics

This is a comprehensive library to monitor, debug and profile the jobs running on Cloud TPU. To learn about Cloud TPU, refer to the full documentation.

Features

1. Debugging

1.1 Collect Stack Traces

This module will dump the python traces when a fault such as Segmentation fault, Floating-point exception, Illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help debug when a program running on Cloud TPU is stuck or hung somewhere.

Installation

To install the package, run the following command on TPU VM:

pip install cloud-tpu-diagnostics

Usage

To use this package, first import the module:

from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration

Then, create configuration object for stack traces. The module will only collect stack traces when collect_stack_trace parameter is set to True. There are following scenarios supported currently:

Scenario 1: Do not collect stack traces on faults
stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace=False)

This configuration will prevent you from collecting stack traces in the event of a fault or process hang.

Scenario 2: Collect stack traces on faults and display on console
stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace=True,
                      stack_trace_to_cloud=False)

If there is a fault or process hang, this configuration will show the stack traces on the console (stderr).

Scenario 3: Collect stack traces on faults and upload on cloud
stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace=True,
                      stack_trace_to_cloud=True)

This configuration will temporary collect stack traces inside /tmp/debugging directory on TPU host if there is a fault or process hang. Additionally, the traces collected in TPU host memory will be uploaded to Google Cloud Logging, which will make it easier to troubleshoot and fix the problems.

By default, stack traces will be collected every 10 minutes. In order to change the duration between two stack trace collection events, add the following configuration:

stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace=True,
                      stack_trace_to_cloud=True,
                      stack_trace_interval_seconds=300)

This configuration will collect the stack traces on cloud after every 5 minutes.

Then, create configuration object for debug.

debug_config = debug_configuration.DebugConfig(
                stack_trace_config=stack_trace_config)

Then, create configuration object for diagnostic.

diagnostic_config = diagnostic_configuration.DiagnosticConfig(
                      debug_config=debug_config)

Finally, call the diagnose() method using with and wrap the statements inside the context manager for which you want to collect the stack traces.

with diagnostic.diagnose(diagnostic_config):
    run_job(...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloud_tpu_diagnostics-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

cloud_tpu_diagnostics-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file cloud_tpu_diagnostics-0.1.0.tar.gz.

File metadata

  • Download URL: cloud_tpu_diagnostics-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for cloud_tpu_diagnostics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 65e9924a72565ae4cfdbe3125a4d3fe5881f88bce9228fe290fea60a1edb2fc0
MD5 39ed259a66bbafca2b8b815103af6a58
BLAKE2b-256 9d6baaf24470a73a1d738cc4574d45b6d9a48df81c600dbbb7dff93cf3736329

See more details on using hashes here.

Provenance

File details

Details for the file cloud_tpu_diagnostics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cloud_tpu_diagnostics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97db431c08e0d9d8b44542fb7d6e9ef74a428038da27d009044caeb8a6a2cc26
MD5 3fe1dca00e8585a701d973a90e46f37e
BLAKE2b-256 d0ea569e91b09592ca949ffc21e827916cc705cffd3ce903ca981dc629104a2c

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page