Skip to content

Files

Failed to load latest commit information.

Latest commit

 Cannot retrieve latest commit at this time.

History

History
 
 

commandr

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Command R

This document explains how to build the C4AI Command-R, C4AI Command R+, Aya-23-8B, Aya-23-35B models using TensorRT-LLM and run on a single GPU or a single node with multiple GPUs.

Overview

The TensorRT-LLM Command-R implementation can be found in tensorrt_llm/models/commandr/model.py. The TensorRT-LLM Command-R example code is located in examples/commandr. There is one main file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

  • FP16
  • INT8 & INT4 Weight-Only
  • Tensor Parallel

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs

# clone one or more models we want to build
git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01         command_r_v01
git clone https://huggingface.co/CohereForAI/c4ai-command-r-plus        command_r_plus
git clone https://huggingface.co/CohereForAI/aya-23-8B                  aya_23_8B
git clone https://huggingface.co/CohereForAI/aya-23-35B                 aya_23_35B

2. Convert weights from HF Transformers to TensorRT-LLM format

The convert_checkpoint.py script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.

# Command-R: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir command_r_v01 --output_dir trt_ckpt/command_r_v01/fp16/1-gpu

# Command-R+: 4-way tensor parallelism
python3 convert_checkpoint.py --model_dir command_r_plus --tp_size 4 --output_dir trt_ckpt/command_r_plus/fp16/4-gpu

# Aya-23-8B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir aya_23_8B --output_dir trt_ckpt/aya_23_8B/fp16/1-gpu

# Aya-23-35B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir aya_23_35B --output_dir trt_ckpt/aya_23_35B/fp16/1-gpu

3. Build TensorRT engine(s)

The trtllm-build command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.

Normally, the trtllm-build command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the --workers argument.

# Command-R: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/command_r_v01/fp16/1-gpu \
        --gemm_plugin float16 \
        --output_dir trt_engines/command_r_v01/fp16/1-gpu

# Command-R+: 4-way tensor parallelism
trtllm-build --checkpoint_dir trt_ckpt/command_r_plus/fp16/4-gpu \
        --gemm_plugin float16 \
        --output_dir trt_engines/command_r_plus/fp16/4-gpu

# Command-R: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/aya_23_8B/fp16/1-gpu \
        --gemm_plugin float16 \
        --output_dir trt_engines/aya_23_8B/fp16/1-gpu

# Command-R: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/aya_23_35B/fp16/1-gpu \
        --gemm_plugin float16 \
        --output_dir trt_engines/aya_23_35B/fp16/1-gpu

If the engines are built successfully, you will see output like (Command-R as the example):

......
[09/19/2024-03:34:30] [TRT] [I] Engine generation completed in 26.9495 seconds.
[09/19/2024-03:34:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 4 MiB, GPU 70725 MiB
[09/19/2024-03:34:55] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 176260 MiB
[09/19/2024-03:34:55] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:52
[09/19/2024-03:34:55] [TRT] [I] Serialized 26 bytes of code generator cache.
[09/19/2024-03:34:55] [TRT] [I] Serialized 315007 bytes of compilation cache.
[09/19/2024-03:34:55] [TRT] [I] Serialized 12 timing cache entries
[09/19/2024-03:34:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[09/19/2024-03:34:55] [TRT-LLM] [I] Build phase peak memory: 176257.29 MB, children: 17.65 MB
[09/19/2024-03:34:55] [TRT-LLM] [I] Serializing engine to trt_engines/command_r_v01/fp16/1-gpu/rank0.engine...
[09/19/2024-03:35:20] [TRT-LLM] [I] Engine serialized. Total time: 00:00:25
[09/19/2024-03:35:23] [TRT-LLM] [I] Total time of building all engines: 00:01:47

4. Run inference

Single node, single GPU

# Run the default engine of Command-R on single GPU.
python3 ../run.py --max_output_len 50 \
        --tokenizer_dir command_r_v01 \
        --engine_dir trt_engines/command_r_v01/fp16/1-gpu

# Run the default engine of Command-R on single GPU, using streaming output.
python3 ../run.py --max_output_len 50 \
        --tokenizer_dir command_r_v01 \
        --engine_dir trt_engines/command_r_v01/fp16/1-gpu \
        --streaming

# Run the default engine of Aya-23-8B on single GPU.
python3 ../run.py --max_output_len 50 \
        --tokenizer_dir aya_23_8B \
        --engine_dir trt_engines/aya_23_8B/fp16/1-gpu

# Run the default engine of Aya-23-35B on single GPU.
python3 ../run.py --max_output_len 50 \
        --tokenizer_dir aya_23_35B \
        --engine_dir trt_engines/aya_23_35B/fp16/1-gpu

Single node, multi GPU

# Run the Tensor Parallel 4 engine of Command-R+ on 4 GPUs.
mpirun -n 4 \
    python ../run.py  --max_output_len 50 \
        --tokenizer_dir command_r_plus \
        --engine_dir trt_engines/command_r_plus/fp16/4-gpu

If the engines are run successfully, you will see output like (Command-R as the example):

......
Input [Text 0]: "<BOS_TOKEN>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in Paris and worked in the kitchens of the French royal family. He came to England in 1814 and worked in a number of London hotels and restaurants, including the Reform Club and the London Tavern. He also opened his own restaurant"

5. Run summarization task

# Run the summarization of Command-R task.
python3 ../summarize.py --test_trt_llm \
        --hf_model_dir command_r_v01 \
        --engine_dir trt_engines/command_r_v01/fp16/1-gpu

If the engines are run successfully, you will see output like (Command-R as the example):

......
[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (total latency: 81.05689692497253 sec)
[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2000)
[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 24.67402621952367)
[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[01/26/2024-02:51:56] [TRT-LLM] [I]   rouge1 : 24.06804397902119
[01/26/2024-02:51:56] [TRT-LLM] [I]   rouge2 : 6.456513335555016
[01/26/2024-02:51:56] [TRT-LLM] [I]   rougeL : 16.77644999660741
[01/26/2024-02:51:56] [TRT-LLM] [I]   rougeLsum : 20.57359472317834

Weight Only quantization

Use --use_weight_only to enable INT8-Weight-Only quantization, this will significantly lower the latency and memory footprint. Furthermore, use --weight_only_precision int8 or --weight_only_precision int4 to configure the data type of the weights.

# Command-R: single gpu, int8 weight only quantization
python3 convert_checkpoint.py --model_dir command_r_v01 \
        --use_weight_only \
        --weight_only_precision int8 \
        --output_dir trt_ckpt/command_r_v01/int8_wo/1-gpu

# Command-R: single-gpu engine with int8 weight only quantization, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/command_r_v01/int8_wo/1-gpu \
        --gemm_plugin float16 \
        --output_dir trt_engines/command_r_v01/int8_wo/1-gpu

# Run inference.
python3 ../run.py --max_output_len 50 \
        --tokenizer_dir command_r_v01 \
        --engine_dir trt_engines/command_r_v01/int8_wo/1-gpu