Skip to content

Files

Failed to load latest commit information.

Latest commit

 Cannot retrieve latest commit at this time.

History

History
 
 

granite

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Granite

This document shows how to build and run a Granite 3.0 model in TensorRT-LLM.

The TensorRT-LLM Granite implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can be found in llama/model.py. See the LLaMA example examples/llama for details.

Download model checkpoints

First, download the HuggingFace BF16 checkpoints of Granite 3.0 model.

HF_MODEL="granite-3.0-8b-instruct" # or granite-3.0-3b-a800m-instruct
# clone the model we want to build
git clone https://huggingface.co/ibm-granite/${HF_MODEL} tmp/hf_checkpoints/${HF_MODEL}

Convert weights from HF Transformers to TensorRT-LLM format

Set environment variables and necessary directory:

PREC_RAW="bfloat16"
TP=1
mkdir -p tmp/trt_engines

BF16

Convert the weights using the convert_checkpoint.py script:

ENGINE="${HF_MODEL}_${PREC_RAW}_tp${TP}"
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1  # The current checkpoint conversion code requires legacy path
python3 ../llama/convert_checkpoint.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                       --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                       --dtype ${PREC_RAW} \
                                       --tp_size ${TP} \
                                       --use_embedding_sharing

FP8 PTQ

Notes:

  • Currently quantize.py does not support Expert Parallelism (EP) mode yet. User should use ../llama/convert_checkpoint.py and specify --moe_ep_size 1 instead, if needed.
  • TensorRT-LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. batch_size and calib_size can be adjusted to shorten the calibration time. Please refer to ../quantization/README.md for explanation.
PREC_QUANT="fp8"
ENGINE="${HF_MODEL}_${PREC_QUANT}_tp${TP}"
python ../quantization/quantize.py --model_dir tmp/hf_checkpoints/${HF_MODEL} \
                                   --dtype ${PREC_RAW} \
                                   --qformat ${PREC_QUANT} \
                                   --kv_cache_dtype ${PREC_QUANT} \
                                   --output_dir tmp/tllm_checkpoints/${ENGINE} \
                                   --batch_size 1 \
                                   --calib_size 128 \
                                   --tp_size ${TP}

Build TensorRT engine

# Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`
# Use --workers to enable parallel build
trtllm-build --checkpoint_dir ./tmp/tllm_checkpoints/${ENGINE} \
             --output_dir ./tmp/trt_engines/${ENGINE} \
             --gpt_attention_plugin ${PREC_RAW} \
             --gemm_plugin ${PREC_RAW} \
             --workers ${TP}

Run Engine

Test your engine with the run.py script:

mpirun -n ${TP} --allow-run-as-root python ../run.py --engine_dir ./tmp/trt_engines/${ENGINE} --tokenizer_dir tmp/hf_checkpoints/${HF_MODEL} --max_output_len 20 --input_text "The future of AI is"

For more usage examples see examples/llama/README.md