This example implements a NanoGPT model using Tripy:
model.py
defines the model as annvtripy.Module
.weight_loader.py
loads weights from a HuggingFace checkpoint.example.py
runs inference infloat16
on input text and displays the output.
-
Install prerequisites:
python3 -m pip install -r requirements.txt
-
Run the example:
python3 example.py --input-text "What is the answer to life, the universe, and everything?"
-
[Optional] Use a fixed seed for predictable outputs:
python3 example.py --input-text "What is the answer to life, the universe, and everything?" --seed=0
quantization.py
, uses
NVIDIA TensorRT Model Optimizer
to quantize the pytorch model.
load_quant_weights_from_hf
in weight_loader.py
converts the quantization
parameters to scales and loads them into the Tripy model.
Use --quant-mode
in example.py
to enable quantization. Supported modes:
-
Weight-only
int8
quantization:python3 example.py --input-text "What is the answer to life, the universe, and everything?" --seed=0 --quant-mode int8-weight-only
Warning
For this model, int4
quantization may result in poor accuracy. We include it only to demonstrate the workflow.
-
Weight-only
int4
quantization:python3 example.py --input-text "What is the answer to life, the universe, and everything?" --seed=0 --quant-mode int4-weight-only