GenAI Perf Analyzer CLI - CLI tool to simplify profiling LLMs and Generative AI models with Perf Analyzer

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
Operating System
- Unix
Programming Language
Topic
- Scientific/Engineering
- Software Development

Project description

GenAI-Perf

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.

Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from a dataset defined via a file), and the type of load to generate (number of concurrent requests, request rate).

GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv and json file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.

You can use GenAI-Perf to run performance benchmarks on

[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.

Installation

The easiest way to install GenAI-Perf is through Triton Server SDK container. Install the latest release using the following command:

export RELEASE="24.10"

docker run -it --net=host --gpus=all  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Check out genai_perf command inside the container:
genai-perf --help

Alternatively, to install from source:

Since GenAI-Perf depends on Perf Analyzer, you'll need to install the Perf Analyzer binary:

Install Perf Analyzer (Ubuntu, Python 3.10+)

NOTE: you must already have CUDA 12 installed (checkout the CUDA installation guide).

pip install tritonclient

You can also build Perf Analyzer from source as well.

Install GenAI-Perf from source

pip install git+https://github.com/triton-inference-server/perf_analyzer.git#subdirectory=genai-perf

Quick Start

In this quick start, we will use GenAI-Perf to run performance benchmarking on the GPT-2 model running on Triton Inference Server with a TensorRT-LLM engine.

Serve GPT-2 TensorRT-LLM model using Triton CLI

You can follow the quickstart guide in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend. The full instructions are copied below for convenience:

# This container comes with all of the dependencies for building TRT-LLM engines
# and serving the engine with Triton Inference Server.
docker run -ti \
    --gpus all \
    --network=host \
    --shm-size=1g --ulimit memlock=-1 \
    -v /tmp:/tmp \
    -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Install the Triton CLI
pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.11

# Build TRT LLM engine and generate a Triton model repository pointing at it
triton remove -m all
triton import -m gpt2 --backend tensorrtllm

# Start Triton pointing at the default model repository
triton start

Running GenAI-Perf

Now we can run GenAI-Perf inside the Triton Inference Server SDK container:

genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --streaming

Example output:

                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  16.26 │  12.39 │  17.25 │  17.09 │  16.68 │  16.56 │
│          Inter token latency (ms) │   1.85 │   1.55 │   2.04 │   2.02 │   1.97 │   1.92 │
│              Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │
│            Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │
│             Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │ 520.87 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   1.99 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

See Tutorial for additional examples.

Visualization

GenAI-Perf can also generate various plots that visualize the performance of the current profile run. This is disabled by default but users can easily enable it by passing the --generate-plots option when running the benchmark:

genai-perf profile \
  -m gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --streaming \
  --concurrency 1 \
  --generate-plots

This will generate a set of default plots such as:

Time to first token (TTFT) analysis
Request latency analysis
TTFT vs Input sequence lengths
Inter token latencies vs Token positions
Input sequence lengths vs Output sequence lengths

Using `compare` Subcommand to Visualize Multiple Runs

The compare subcommand in GenAI-Perf facilitates users in comparing multiple profile runs and visualizing the differences through plots.

Usage

Assuming the user possesses two profile export JSON files, namely profile1.json and profile2.json, they can execute the compare subcommand using the --files option:

genai-perf compare --files profile1.json profile2.json

Executing the above command will perform the following actions under the compare directory:

Generate a YAML configuration file (e.g. config.yaml) containing the metadata for each plot generated during the comparison process.
Automatically generate the default set of plots (e.g. TTFT vs. Input Sequence Lengths) that compare the two profile runs.

compare
├── config.yaml
├── distribution_of_input_sequence_lengths_to_output_sequence_lengths.jpeg
├── request_latency.jpeg
├── time_to_first_token.jpeg
├── time_to_first_token_vs_input_sequence_lengths.jpeg
├── token-to-token_latency_vs_output_token_position.jpeg
└── ...

Customization

Users have the flexibility to iteratively modify the generated YAML configuration file to suit their specific requirements. They can make alterations to the plots according to their preferences and execute the command with the --config option followed by the path to the modified configuration file:

genai-perf compare --config compare/config.yaml

This command will regenerate the plots based on the updated configuration settings, enabling users to refine the visual representation of the comparison results as per their needs.

See Compare documentation for more details.

Model Inputs

GenAI-Perf supports model input prompts from either synthetically generated inputs, or from a dataset defined via a file.

When the dataset is synthetic, you can specify the following options:

--num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.
--synthetic-input-tokens-mean <int>: The mean of number of tokens in the generated prompts when using synthetic data, >= 1.
--synthetic-input-tokens-stddev <int>: The standard deviation of number of tokens in the generated prompts when using synthetic data, >= 0.
--random-seed <int>: The seed used to generate random values, >= 0.

When the dataset is coming from a file, you can specify the following options:

--input-file <path>: The input file or directory containing the prompts or filepaths to images to use for benchmarking as JSON objects.

For any dataset, you can specify the following options:

--output-tokens-mean <int>: The mean number of tokens in each output. Ensure the --tokenizer value is set correctly, >= 1.
--output-tokens-stddev <int>: The standard deviation of the number of tokens in each output. This is only used when output-tokens-mean is provided, >= 1.
--output-tokens-mean-deterministic: When using --output-tokens-mean, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.

You can optionally set additional model inputs with the following option:

--extra-inputs <input_name>:<value>: An additional input for use with the model with a singular value, such as stream:true or max_tokens:5. This flag can be repeated to supply multiple extra inputs.

For Large Language Models, there is no batch size (i.e. batch size is always 1). Each request includes the inputs for one individual inference. Other modes such as the embeddings and rankings endpoints support client-side batching, where --batch-size-text N means that each request sent will include the inputs for N separate inferences, allowing them to be processed together.

Authentication

GenAI-Perf can benchmark secure endpoints such as OpenAI, which require API key authentication. To do so, you must add your API key directly in the command. At the end of your command, append the below flags. Replace the key with your API key. The -- flag allows arguments to pass directly into Perf Analyzer in superuser mode. The -H flag is used to add HTTP headers.

-- -H "Authorization: Bearer ${API_KEY}" -H "Accept: text/event-stream"

Metrics

GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.

Metric	Description	Aggregations
Time to First Token	Time between when a request is sent and when its first response is received, one value per request in benchmark	Avg, min, max, p99, p90, p75
Inter Token Latency	Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark	Avg, min, max, p99, p90, p75
Request Latency	Time between when a request is sent and when its final response is received, one value per request in benchmark	Avg, min, max, p99, p90, p75
Output Sequence Length	Total number of output tokens of a request, one value per request in benchmark	Avg, min, max, p99, p90, p75
Input Sequence Length	Total number of input tokens of a request, one value per request in benchmark	Avg, min, max, p99, p90, p75
Output Token Throughput	Total number of output tokens from benchmark divided by benchmark duration	None–one value per benchmark
Request Throughput	Number of final responses from benchmark divided by benchmark duration	None–one value per benchmark

Command Line Options

`-h`

`--help`

Show the help message and exit.

Endpoint Options:

`-m <list>`

`--model <list>`

The names of the models to benchmark. A single model is recommended, unless you are profiling multiple LoRA adapters. (default: None)

`--model-selection-strategy {round_robin, random}`

When multiple models are specified, this is how a specific model is assigned to a prompt. Round robin means that each model receives a request in order. Random means that assignment is uniformly random (default: round_robin)

`--backend {tensorrtllm,vllm}`

When using the "triton" service-kind, this is the backend of the model. For the TRT-LLM backend, you currently must set exclude_input_in_output to true in the model config to not echo the input tokens in the output. (default: tensorrtllm)

`--endpoint <str>`

Set a custom endpoint that differs from the OpenAI defaults. (default: None)

`--endpoint-type {chat,completions,embeddings,rankings}`

The endpoint-type to send requests to on the server. This is only used with the openai service-kind. (default: None)

`--service-kind {triton,openai}`

The kind of service perf_analyzer will generate load for. In order to use openai, you must specify an api via --endpoint-type. (default: triton)

`--streaming`

An option to enable the use of the streaming API. (default: False)

`-u <url>`

`--url <url>`

URL of the endpoint to target for benchmarking. (default: None)

Input Options

`-b <int>`

`--batch-size <int>`

`--batch-size-text <int>`

The text batch size of the requests GenAI-Perf should send. This is currently only supported with the embeddings, and rankings endpoint types. (default: 1)

`--batch-size-image <int>`

The image batch size of the requests GenAI-Perf should send. This is currently only supported with the image retrieval endpoint type. (default: 1)

`--extra-inputs <str>`

Provide additional inputs to include with every request. You can repeat this flag for multiple inputs. Inputs should be in an input_name:value format. Alternatively, a string representing a json formatted dict can be provided. (default: None)

`--input-file <path>`

The input file or directory containing the content to use for profiling. To use synthetic files for a converter that needs multiple files, prefix the path with 'synthetic:', followed by a comma-separated list of filenames. The synthetic filenames should not have extensions. For example, 'synthetic:queries,passages'. Each line should be a JSON object with a 'text' or 'image' field in JSONL format. Example: {"text": "Your prompt here"}"

`--num-prompts <int>`

The number of unique prompts to generate as stimulus. (default: 100)

`--output-tokens-mean <int>`

`--osl`

The mean number of tokens in each output. Ensure the --tokenizer value is set correctly. (default: -1)

`--output-tokens-mean-deterministic`

When using --output-tokens-mean, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens. (default: False)

`--output-tokens-stddev <int>`

The standard deviation of the number of tokens in each output. This is only used when --output-tokens-mean is provided. (default: 0)

`--random-seed <int>`

The seed used to generate random values. (default: 0)

`--synthetic-input-tokens-mean <int>`

`--isl`

The mean of number of tokens in the generated prompts when using synthetic data. (default: 550)

`--synthetic-input-tokens-stddev <int>`

The standard deviation of number of tokens in the generated prompts when using synthetic data. (default: 0)

Profiling Options

`--concurrency <int>`

The concurrency value to benchmark. (default: None)

`--measurement-interval <int>`

`-p <int>`

The time interval used for each measurement in milliseconds. Perf Analyzer will sample a time interval specified and take measurement over the requests completed within that time interval. (default: 10000)

`--request-rate <float>`

Sets the request rate for the load generated by PA. (default: None)

`-s <float>`

`--stability-percentage <float>`

The allowed variation in latency measurements when determining if a result is stable. The measurement is considered as stable if the ratio of max / min from the recent 3 measurements is within (stability percentage) in terms of both infer per second and latency. (default: 999)

Output Options

`--artifact-dir`

The directory to store all the (output) artifacts generated by GenAI-Perf and Perf Analyzer. (default: artifacts)

`--generate-plots`

An option to enable the generation of plots. (default: False)

`--profile-export-file <path>`

The path where the perf_analyzer profile export will be generated. By default, the profile export will be to profile_export.json. The genai-perf files will be exported to <profile_export_file>_genai_perf.json and <profile_export_file>_genai_perf.csv. For example, if the profile export file is profile_export.json, the genai-perf file will be exported to profile_export_genai_perf.csv. (default: profile_export.json)

Other Options

`--tokenizer <str>`

The HuggingFace tokenizer to use to interpret token metrics from prompts and responses. The value can be the name of a tokenizer or the filepath of the tokenizer. (default: hf-internal-testing/llama-tokenizer)

`--tokenizer-revision <str>`

The specific tokenizer model version to use. It can be a branch name, tag name, or commit ID. (default: main)

`--tokenizer-trust-remote-code`

Allow custom tokenizer to be downloaded and executed. This carries security risks and should only be used for repositories you trust. This is only necessary for custom tokenizers stored in HuggingFace Hub. (default: False)

`-v`

`--verbose`

An option to enable verbose mode. (default: False)

`--version`

An option to print the version and exit.

Known Issues

GenAI-Perf can be slow to finish if a high request-rate is provided
Token counts may not be exact

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
Operating System
- Unix
Programming Language
Topic
- Scientific/Engineering
- Software Development

Release history Release notifications | RSS feed

This version

0.0.9.dev0 pre-release

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

genai_perf-0.0.9.dev0-py3-none-any.whl (638.8 kB view details)

Uploaded Nov 14, 2024 Python 3

File details

Details for the file genai_perf-0.0.9.dev0-py3-none-any.whl.

File metadata

Download URL: genai_perf-0.0.9.dev0-py3-none-any.whl
Upload date: Nov 14, 2024
Size: 638.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for genai_perf-0.0.9.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91a431accf362c29d091ce801d7ba4519ab7b9a8b845d2e23026761a4c26d73a`
MD5	`2b12a3a1fa0ff6ddd6cab90d1d9f7c28`
BLAKE2b-256	`ed7047c24a8681211d78343e77346e41149f63a154e681bdf77e2c41956039b4`

See more details on using hashes here.

genai-perf 0.0.9.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GenAI-Perf

Installation

Install Perf Analyzer (Ubuntu, Python 3.10+)

Install GenAI-Perf from source

Quick Start

Serve GPT-2 TensorRT-LLM model using Triton CLI

Running GenAI-Perf

Visualization

Using compare Subcommand to Visualize Multiple Runs

Usage

Customization

Model Inputs

Authentication

Metrics

Command Line Options

-h

--help

Endpoint Options:

-m <list>

--model <list>

--model-selection-strategy {round_robin, random}

--backend {tensorrtllm,vllm}

--endpoint <str>

--endpoint-type {chat,completions,embeddings,rankings}

--service-kind {triton,openai}

--streaming

-u <url>

--url <url>

Input Options

-b <int>

--batch-size <int>

--batch-size-text <int>

--batch-size-image <int>

--extra-inputs <str>

--input-file <path>

--num-prompts <int>

--output-tokens-mean <int>

--osl

--output-tokens-mean-deterministic

--output-tokens-stddev <int>

--random-seed <int>

--synthetic-input-tokens-mean <int>

--isl

--synthetic-input-tokens-stddev <int>

Profiling Options

--concurrency <int>

--measurement-interval <int>

-p <int>

--request-rate <float>

-s <float>

--stability-percentage <float>

Output Options

--artifact-dir

--generate-plots

--profile-export-file <path>

Other Options

--tokenizer <str>

--tokenizer-revision <str>

--tokenizer-trust-remote-code

-v

--verbose

--version

Known Issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Using `compare` Subcommand to Visualize Multiple Runs

`-h`

`--help`

`-m <list>`

`--model <list>`

`--model-selection-strategy {round_robin, random}`

`--backend {tensorrtllm,vllm}`

`--endpoint <str>`

`--endpoint-type {chat,completions,embeddings,rankings}`

`--service-kind {triton,openai}`

`--streaming`

`-u <url>`

`--url <url>`

`-b <int>`

`--batch-size <int>`

`--batch-size-text <int>`

`--batch-size-image <int>`

`--extra-inputs <str>`

`--input-file <path>`

`--num-prompts <int>`

`--output-tokens-mean <int>`

`--osl`

`--output-tokens-mean-deterministic`

`--output-tokens-stddev <int>`

`--random-seed <int>`

`--synthetic-input-tokens-mean <int>`

`--isl`

`--synthetic-input-tokens-stddev <int>`

`--concurrency <int>`

`--measurement-interval <int>`

`-p <int>`

`--request-rate <float>`

`-s <float>`

`--stability-percentage <float>`

`--artifact-dir`

`--generate-plots`

`--profile-export-file <path>`

`--tokenizer <str>`

`--tokenizer-revision <str>`

`--tokenizer-trust-remote-code`

`-v`

`--verbose`

`--version`