Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Sample Weight-Stripping

Overview
- Build Weights Stripped Engine
- Engine Refitter
Prerequisites
Weight-Stripping Workflow Example
Engine Plan File Size Results
Experimental
- Checkpoint Pruner
  - Pruning a TensorRT-LLM Checkpoint

Overview

This workflow introduces a new script trtllm-refit. trtllm-refit allows you to refit the generated engine with weights from any TensorRT-LLM checkpoint matching the same architecture, so long as you build the engine as refittable or stripped.

Build Weights Stripped Engine

TensorRT can generate refittable engines with the same performance as the non-refittable ones when TensorRT builder optimize under the assumption that the engine will be refitted with weights identical to those provide at build time. Those refittable weights can be stripped to reduce the engine plan file size, with the option to subsequently supply them via the refit interface.

New option --strip_plan is introduced in trtllm-build

trtllm-build --strip_plan --checkpoint_dir ${CHECKPOINT_DIR} --output_dir ${ENGINE_DIR} ...

Engine Refitter

The refitter allows you to refit an engine with weights in a TensorRT-LLM checkpoint. It does this by doing a textual match between engine and checkpoint weight names. In order for the refitter to work, the engine must be built with refitting enabled. This can be accomplished by passing --strip_plan to trtllm-build.

After building a stripped engine via trtllm-build, run

trtllm-refit --checkpoint_dir ${CHECKPOINT_DIR} --engine_dir ${ENGINE_DIR}

Prerequisites

Install TensorRT-LLM either through pip or from the source (Linux, Windows).

Weight-Stripping Workflow Example

GPT-J

Download the weights.

# 1. Weights & config
git clone https://huggingface.co/EleutherAI/gpt-j-6b
pushd gpt-j-6b && \
  rm -f pytorch_model.bin && \
  wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
popd

# 2. Vocab and merge table
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt

Convert the Hugging Face checkpoint into TensorRT-LLM format. Run below command lines in examples/gptj directory.

# Build a float16 checkpoint using HF weights.
python convert_checkpoint.py --model_dir ./gpt-j-6b \
                                     --dtype float16 \
                                     --output_dir ./trt_ckpt/gptj_fp16_tp1/

# Build an int8 weight-only checkpoint using HF weights.
python convert_checkpoint.py --model_dir ./gpt-j-6b \
                                     --dtype float16 \
                                     --use_weight_only \
                                     --weight_only_precision int8 \
                                     --output_dir ./trt_ckpt/gptj_int8_tp1/

Build the weights stripped engine.

# Build with --strip_plan. Requires TRT>=10.0.0
trtllm-build --checkpoint_dir ./trt_ckpt/gptj_fp16_tp1/ \
             --output_dir ./trt_engines/gptj_fp16_tp1/ \
             --gemm_plugin float16 \
             --max_batch_size=32 \
             --max_input_len=1919 \
             --max_seq_len=2047 \
             --strip_plan

Refit the engine. The refit engine lives at ${ENGINE_DIR}.refit.

# --checkpoint_dir points to the path of the weights you want refit, in this case the original weights.
trtllm-refit --checkpoint_dir ./trt_ckpt/gptj_fp16_tp1/ --engine_dir ./trt_engines/gptj_fp16_tp1/ --output_dir ./trt_engines/gptj_fp16_tp1.refit/

Verify the engine.

# Run the summarization task.
python3 ../summarize.py --engine_dir ./trt_engines/gptj_fp16_tp1.refit \
                        --hf_model_dir ./gpt-j-6b \
                        --batch_size 1 \
                        --test_trt_llm \
                        --tensorrt_llm_rouge1_threshold 14 \
                        --data_type fp16 \
                        --check_accuracy

Llama-7b INT4

Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
Calibrate the checkpoint and convert into TensorRT-LLM format. Run below command lines in examples/llama directory.

# Calibrate INT4 using AMMO.
python ../quantization/quantize.py --model_dir  /llm-models/llama-models/llama-7b-hf/ \
               --dtype float16 \
               --qformat int4_awq \
               --awq_block_size 128 \
               --output_dir ./quantized_int4-awq \
               --calib_size 32

Build the weights stripped engine.

# Build with --strip_plan. Requires TRT>=10.0.0
trtllm-build --checkpoint_dir ./quantized_int4-awq \
                --strip_plan \
                --gemm_plugin float16 \
                --output_dir trt_int4_AWQ

Refit the engine.

trtllm-refit --checkpoint_dir ./quantized_int4-awq \
                --engine_dir trt_int4_AWQ \
                --output_dir trt_int4_AWQ_full_from_wtless

Verify the engine.

python3 ../summarize.py --engine_dir trt_int4_AWQ_full_from_wtless \
                --hf_model_dir /llm-models/llama-models/llama-7b-hf/ \
                --batch_size 1 \
                --test_trt_llm \
                --check_accuracy

Llama-7b FP16 + WoQ INT8

Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
Convert the checkpoint into TensorRT-LLM format. Run below command lines in examples/llama directory.

python3 convert_checkpoint.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
                --output_dir ./llama-7b-hf-fp16-woq \
                --dtype float16 \
                --use_weight_only \
                --weight_only_precision int8

Build the weights stripped engine.

# Build with --strip_plan. Requires TRT>=10.0.0
trtllm-build --checkpoint_dir ./llama-7b-hf-fp16-woq \
                --output_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless \
                --strip_plan \
                --gemm_plugin float16

Refit the engine.

trtllm-refit --checkpoint_dir ./llama-7b-hf-fp16-woq \
                --engine_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless \
                --output_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless-to-full

Verify the engine.

python3 ../summarize.py --engine_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless-to-full \
                --hf_model_dir /llm-models/llama-models/llama-7b-hf/ \
                --batch_size 1 \
                --test_trt_llm \
                --check_accuracy

Llama2-70b FP8 with TP=2

Download the llama-v2-70b-hf checkpoint and saved in /llm-models/llama-models-v2/llama-v2-70b-hf/.
Calibrate the checkpoint and convert into TensorRT-LLM format. Run below command lines in examples/llama directory.

# Calibrate FP8 using AMMO.
python ../quantization/quantize.py --model_dir /llm-models/llama-models-v2/llama-v2-70b-hf/ \
               --dtype float16 \
               --qformat fp8 \
               --kv_cache_dtype fp8 \
               --output_dir ./llama2-70b-hf-fp8-tp2 \
               --calib_size 512 \
               --tp_size 2

Build the weights stripped engine.

trtllm-build --checkpoint_dir ./llama2-70b-hf-fp8-tp2 \
                --output_dir engines/llama2-70b-hf-fp8-tp2 \
                --gemm_plugin float16 \
                --workers 2

Refit the engine.

trtllm-refit --checkpoint_dir ./llama2-70b-hf-fp8-tp2 \
                --engine_dir engines/llama2-70b-hf-fp8-tp2 \
                --output_dir engines/llama2-70b-hf-fp8-tp2.refit

Verify the engine.

python3 ../summarize.py --engine_dir engines/llama2-70b-hf-fp8-tp2.refit \
                --hf_model_dir /llm-models/llama-models-v2/llama-v2-70b-hf/ \
                --batch_size 1 \
                --test_trt_llm \
                --check_accuracy

Engine Plan File Size Results

Model	Full Engine Plan Size	Weight-Stripped Engine Plan Size
llama-7b INT4	3.7GB	5.3MB
llama-7b FP16 + WoQ INT8	6.54GB	28.69MB
llama2-70b FP8 + TP=2	64.78GB	60.61MB

Experimental

Checkpoint Pruner

The checkpoint pruner allows you to strip Conv and Gemm weights out of a TensorRT-LLM checkpoint. Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.

When building an engine with a pruned checkpoint, TensorRT-LLM fills in the missing weights with random ones. These weights should later be refit with the original weights to preserve the intended behavior.

Building an engine from a pruned checkpoint will also allow the engine to be refit.

Pruning a TensorRT-LLM Checkpoint

Install TensorRT-LLM either through pip or from the source (Linux, Windows).
Download a model of your choice and convert it to a TensorRT-LLM checkpoint (llama instructions).
(Optional) Run the trtllm-prune command.

# Prunes the TRT-LLM checkpoint at ${CHECKPOINT_DIR}, and stores it in the directory ${CHECKPOINT_DIR}.pruned
trtllm-prune --checkpoint_dir ${CHECKPOINT_DIR}

The pruned checkpoint lives at ${CHECKPOINT_DIR}.pruned by default, however, this can be overridden by issuing the --out_dir flag.

Build the stripped engine.

# From pruned checkpoint.
trtllm-build --checkpoint_dir ${CHECKPOINT_DIR}.pruned \
             --output_dir ${ENGINE_OUT_DIR} \
             ${EXTRA_ARGS}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

sample_weight_stripping

sample_weight_stripping

README.md

Sample Weight-Stripping

Table Of Contents

Overview

Build Weights Stripped Engine

Engine Refitter

Prerequisites

Weight-Stripping Workflow Example

GPT-J

Llama-7b INT4

Llama-7b FP16 + WoQ INT8

Llama2-70b FP8 with TP=2

Engine Plan File Size Results

Experimental

Checkpoint Pruner

Pruning a TensorRT-LLM Checkpoint

Files

sample_weight_stripping

Directory actions

More options

Directory actions

More options

Latest commit

History

sample_weight_stripping

Folders and files

parent directory

README.md

Sample Weight-Stripping

Table Of Contents

Overview

Build Weights Stripped Engine

Engine Refitter

Prerequisites

Weight-Stripping Workflow Example

GPT-J

Llama-7b INT4

Llama-7b FP16 + WoQ INT8

Llama2-70b FP8 with TP=2

Engine Plan File Size Results

Experimental

Checkpoint Pruner

Pruning a TensorRT-LLM Checkpoint