This directory contains several examples that demonstrate how to use the Executor
API:
- The example defined in
executorExampleBasic.cpp
shows how you can generate output tokens for a single prompt in only a few lines of code. - The example defined in
executorExampleAdvanced.cpp
supports more options such as providing an arbitrary number of input requests with arbitrary tokens per request and running in streaming mode. - The example defined in
executorExampleLogitsProcessor.cpp
shows how to useLogitsPostProcessor
to control output tokens. - The example defined in
executorExampleFastLogits.cpp
shows how to useExternalDraftTokensConfig
for speculative decoding and optionally use the fast logits feature. - The example defined in
executorExampleKvEvents.cpp
shows how to use the KV cache event API. - The example defined in
executorExampleDisaggregated.cpp
shows how to use the disaggregated executor API.
To build the examples, you first need to build the TensorRT-LLM C++ shared libraries (libtensorrt_llm.so
and libnvinfer_plugin_tensorrt_llm.so
) using the build_wheel.py
script. Alternatively, if you have already build the TensorRT-LLM libraries, you can modify the provided CMakeLists.txt
such that the libtensorrt_llm.so
and libnvinfer_plugin_tensorrt_llm.so
are imported properly.
Once the TensorRT-LLM libraries are built, you can run
mkdir build
cd build
cmake ..
make -j
from the ./examples/cpp/executor/
folder to build the basic and advanced examples.
Before you run the examples, please make sure that you have already built engine(s) using the TensorRT-LLM API.
Use trtllm-build
to build the TRT-LLM engine.
From the examples/cpp/executor/build
folder, you can get run the executorExampleBasic
example with:
./executorExampleBasic <path_to_engine_dir>
where <path_to_engine_dir>
is the path to the directly containing the TensorRT engine files.
This example shows how you can define which engine IO tensors should be dumped to numpy files.
From the examples/cpp/executor/build
folder, you can get run the executorExampleDebug
example with:
./executorExampleDebug <path_to_engine_dir>
where <path_to_engine_dir>
is the path to the directly containing the TensorRT engine files.
From the examples/cpp/executor/build
folder, you can also run the executorExampleAdvanced
example. To get the full list of supported input arguments, type
./executorExampleAdvanced -h
For example, you can run:
./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv
to run with the provided dummy input tokens from inputTokens.csv
. Upon successful completion, you should see the following in the logs:
[TensorRT-LLM][INFO] Creating request with 6 input tokens
[TensorRT-LLM][INFO] Creating request with 4 input tokens
[TensorRT-LLM][INFO] Creating request with 10 input tokens
[TensorRT-LLM][INFO] Got 20 tokens for beam 0 for requestId 3
[TensorRT-LLM][INFO] Request id 3 is completed.
[TensorRT-LLM][INFO] Got 14 tokens for beam 0 for requestId 2
[TensorRT-LLM][INFO] Request id 2 is completed.
[TensorRT-LLM][INFO] Got 16 tokens for beam 0 for requestId 1
[TensorRT-LLM][INFO] Request id 1 is completed.
[TensorRT-LLM][INFO] Writing output tokens to outputTokens.csv
[TensorRT-LLM][INFO] Exiting.
To run the executorExampleAdvanced
on models that require multiple GPUs, you can run the example using MPI as follows:
mpirun -n <num_ranks> --allow-run-as-root ./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv
where <num_ranks>
must equal to tp*pp
for the TensorRT engine. By default GPU device IDs [0...(num_ranks-1)]
will be used.
Alternatively, it's also possible to run multi-GPU model by using the so-called Orchestrator
communication mode, where the Executor
instance will automatically spawn additional processes to run the model on multiple GPUs. To use the Orchestrator
communication mode, you can run the example with:
./executorExampleAdvanced --engine_dir <path_to_engine_dir> --input_tokens_csv_file ../inputTokens.csv --use_orchestrator_mode --worker_executable_path <path_to_executor_worker>
where <path_to_executor_worker>
is the absolute path to the stand-alone executor worker executable, located atcpp/build/tensorrt_llm/executor_worker/executorWorker
by default.
To run the executorExampleFastLogits
, you need two GPUs (one for the draft model and one for the target model). You can run it as follows:
mpirun -n 3 --allow-run-as-root ./executorExampleFastLogits --engine_dir <path_to_target_engine> --draft_engine_dir <path_to_draft_engine> --num_draft_tokens=3
The examples uses 3 MPI ranks (one for the orchestrator, one for the draft model and one for the target model).
Use --fast_logits=false
to disable the fast logits feature.
From the examples/cpp/executor/build
folder, you can get run the executorExampleKvEvents
example with:
./executorExampleKvEvents --engine_dir <path_to_engine_dir>
where <path_to_engine_dir>
is the path to the directly containing the TensorRT engine files.
This example shows how the KV Cache Event API can be used to reconstruct the state of TRT-LLM's internal radix tree. This can be used in applications such as smart routing to route requests between multiple executor instances to maximize KV Cache reuse. Events are emitted when blocks are stored, removed, or updated in the radix tree.
From the examples/cpp/executor/build
folder, you can also run the executorExampleDisaggregated
example. To get the full list of supported input arguments, type
./executorExampleDisaggregated -h
Note setting TRTLLM_USE_MPI_KVCACHE=1
is required to run disaggregated executor.
For example, you can run :
export TRTLLM_USE_MPI_KVCACHE=1
mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
where <num_ranks_for_context>
must equal to tp*pp
for the context engine, and <num_ranks_for_generation>
must equal to tp*pp
for the generation engine,the context engine and generation engine can be heterogeneous in parallelism. <num_ranks>
must equal to <num_ranks_for_context>+<num_ranks_for_generation>+1
, the additional rank is used as orchestrator process.