Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Name

Last commit message

Last commit date

Granite

This document shows how to build and run a Granite 3.0 model in TensorRT-LLM.

The TensorRT-LLM Granite implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can be found in llama/model.py. See the LLaMA example examples/llama for details.

Granite 3.0

Download model checkpoints

First, download the HuggingFace BF16 checkpoints of Granite 3.0 model.

HF_MODEL="granite-3.0-8b-instruct" # or granite-3.0-3b-a800m-instruct
# clone the model we want to build
git clone https://huggingface.co/ibm-granite/${HF_MODEL} tmp/hf_checkpoints/${HF_MODEL}

Convert weights from HF Transformers to TensorRT-LLM format

Set environment variables and necessary directory:

PREC_RAW="bfloat16"
TP=1
mkdir -p tmp/trt_engines

BF16

Convert the weights using the convert_checkpoint.py script:

ENGINE="${HF_MODEL}_${PREC_RAW}_tp${TP}"
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1  # The current checkpoint conversion code requires legacy path
python3 ../llama/convert_checkpoint.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                       --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                       --dtype ${PREC_RAW} \
                                       --tp_size ${TP} \
                                       --use_embedding_sharing

FP8 PTQ

Notes:

Currently quantize.py does not support Expert Parallelism (EP) mode yet. User should use ../llama/convert_checkpoint.py and specify --moe_ep_size 1 instead, if needed.
TensorRT-LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. batch_size and calib_size can be adjusted to shorten the calibration time. Please refer to ../quantization/README.md for explanation.

PREC_QUANT="fp8"
ENGINE="${HF_MODEL}_${PREC_QUANT}_tp${TP}"
python ../quantization/quantize.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                   --dtype ${PREC_RAW} \
                                   --qformat ${PREC_QUANT} \
                                   --kv_cache_dtype ${PREC_QUANT} \
                                   --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                   --batch_size 1 \
                                   --calib_size 128 \
                                   --tp_size ${TP}

Build TensorRT engine

# Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`
# Use --workers to enable parallel build
trtllm-build --checkpoint_dir ./tmp/tllm_checkpoints/${ENGINE} \
             --output_dir ./tmp/trt_engines/${ENGINE} \
             --gpt_attention_plugin ${PREC_RAW} \
             --gemm_plugin ${PREC_RAW} \
             --workers ${TP}

Run Engine

Test your engine with the run.py script:

mpirun -n ${TP} --allow-run-as-root python ../run.py --engine_dir ./tmp/trt_engines/${ENGINE} --tokenizer_dir tmp/hf_checkpoints/${HF_MODEL} --max_output_len 20 --input_text "The future of AI is"

For more usage examples see examples/llama/README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

granite

granite

README.md

Granite

Download model checkpoints

Convert weights from HF Transformers to TensorRT-LLM format

BF16

FP8 PTQ

Build TensorRT engine

Run Engine

Files

granite

Directory actions

More options

Directory actions

More options

Latest commit

History

granite

Folders and files

parent directory

README.md

Granite

Download model checkpoints

Convert weights from HF Transformers to TensorRT-LLM format

BF16

FP8 PTQ

Build TensorRT engine

Run Engine