Google Kubernetes Engine (GKE) provides fine-grain control for large language model (LLM) inference with optimal performance and cost. This guide describes best practices for optimizing inference and serving of open LLMs with GPUs on GKE using the vLLM and Text Generation Inference (TGI) serving frameworks.
For a summarized checklist of all the best practices, see the Checklist summary.
Objectives
This guide is intended for Generative AI customers, new or existing GKE users, ML Engineers, and LLMOps (DevOps) engineers who are interested in optimizing their LLM workloads using GPUs with Kubernetes.
By the end of this guide, you'll be able to:
- Choose post-training LLM optimization techniques, including quantization, tensor parallelism, and memory optimization.
- Weigh the high-level tradeoffs when considering these optimization techniques.
- Deploy open LLM models to GKE using serving frameworks such as vLLM or TGI with optimization settings enabled.
Overview of LLM serving optimization techniques
Unlike non-AI workloads, LLM workloads typically exhibit higher latency and lower throughput due to their reliance on matrix multiplication operations. To enhance LLM inference performance, you can use specialized hardware accelerators (for example, GPUs and TPUs) and optimized serving frameworks.
You can apply one or more of the following best practices to reduce LLM workload latency while improving throughput and cost-efficiency:
The examples in this guide use the Gemma 7B LLM together with the vLLM or TGI serving frameworks to apply these best practices; however, the concepts and features described are applicable to most popular open LLMs.
Before you begin
Before you try the examples in this guide, complete these prerequisite tasks:
Follow the instructions in these guides to get access to the Gemma model, prepare your environment, and create and configure Google Cloud resources:
- Serve Gemma open models using GPUs on GKE with vLLM
- Serve Gemma open models using GPUs on GKE with Hugging Face TGI
Make sure to save the Hugging Face access token to your Kubernetes secret.
Clone the https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/ samples repository to your local development environment.
Change your working directory to
/kubernetes-engine-samples/ai-ml/llm-serving-gemma/
.
Best practice: Quantization
Quantization is a technique analogous to lossy image compression that reduces model size by representing weights in lower precision formats (8-bit or 4-bit), thus lowering memory requirements. However, like image compression, quantization involves a trade-off: decreased model size can lead to reduced accuracy.
Various quantization methods exist, each with its own unique advantages and disadvantages. Some, like AWQ and GPTQ, require pre-quantization and are available on platforms like Hugging Face or Kaggle. For example, if you apply GPTQ on the Llama-2 13B model and AWQ on the Gemma 7B model, you can serve the models on a single L4 GPU instead of two L4 GPUs without quantization.
You can also perform quantization using tools like AutoAWQ
and AutoGPTQ. These methods can improve
latency and throughput. In contrast, techniques using EETQ and the
bitsandbytes
library for quantization don't require pre-quantized models so can be a
suitable choice when pre-quantized versions aren't available.
The best quantization technique to use depends on your specific goals, and the technique's compatibility with the serving framework you want to use. To learn more, see the Quantization guide from Hugging Face.
Select one of these tabs to see an example of applying quantization using the TGI or vLLM frameworks:
TGI
GKE supports these quantization options with TGI:
awq
gptq
eetq
bitsandbytes
bitsandbytes-nf4
bitsandbytes-fp4
AWQ and GPTQ quantization methods require pre-quantized models, while EETQ
and bitsandbytes
quantization can be applied to any model. To learn more
about these options, see this Hugging Face article.
To use quantization, set the
-–quantize
parameter when starting the model server.
The following snippet shows how to optimize Gemma 7B with
bitsandbytes
quantization using TGI on GKE.
To apply this configuration, use the following command:
kubectl apply -f tgi/tgi-7b-bitsandbytes.yaml
vLLM
GKE supports these quantization options with vLLM:
gptq
- with 4-bit quantization (not supported for Gemma, but available for other models)awq
squeezellm
- KV cache quantizations using the FP8 (8-bit floating point) E5M2 and E4M3 formats.
To use model quantization with vLLM, the models must be pre-quantized.
When you start the runtime, set the –quantization
parameter.
The following snippet shows how to optimize Gemma 7B model
with awq
quantization using vLLM on GKE:
To apply this configuration, use the following command:
kubectl apply -f vllm/vllm-7b-awq.yaml
Improve latency by using KV cache quantization
You can use FP8 E5M2 KV Cache quantization to significantly decrease KV cache memory footprint and improve latency, especially for large batch sizes. However, this reduces inference accuracy.
To enable FP8 E5M2 KV Cache
quantization, set the parameter --kv-cache-dtype fp8_e5m2
:
To apply this configuration, use the following command:
kubectl apply -f vllm/vllm-7b-kvcache.yaml
Best practice: Tensor parallelism
Tensor parallelism is a technique that distributes computational load across multiple GPUs, which is essential when you run large models that exceed single GPU memory capacity. This approach can be more cost-effective as it lets you use multiple affordable GPUs instead of a single expensive one. It can also enhance model inference throughput. Tensor parallelism leverages the fact that tensor operations can be performed independently on smaller data chunks.
To learn more about this technique, see the Tensor Parallelism guide from Hugging Face.
Select one of these tabs to see an example of applying tensor parallelism using the TGI or vLLM frameworks:
TGI
With TGI, the serving runtime will use all GPUs available to the Pod by
default. You can set the number of GPUs to use by specifying the
--num-shard
parameter with the number of GPUs as the value.
See the Hugging Face documentation for the list of models supported for tensor parallelism.
The following snippet shows how to optimize the Gemma 7B instruction-tuned model using tensor parallelism and two L4 GPUs:
To apply this configuration, use the following command:
kubectl apply -f tgi/tgi-7b-it-tensorparallelism.yaml
In GKE Autopilot clusters, running this command creates a Pod with minimum resource requirements of 21 vCPU and 78 GiB memory.
vLLM
vLLM supports distributed tensor-parallel inference. vLLM enables the feature by default if there is more than one GPU available.
The following snippet shows how you can optimize the Gemma 7B instruction-tuned model using tensor parallelism and two L4 GPUs:
To apply this configuration, use the following command:
kubectl apply -f vllm/vllm-7b-it-tensorparallelism.yaml
In GKE Autopilot clusters, running this command creates a Pod with minimum resource requirements of 21 vCPU and 78 GiB memory.
Best practice: Model memory optimization
Optimizing the memory usage of LLMs is crucial for efficient inference. This section introduces attention layer optimization strategies, such as paged attention and flash attention. These strategies enhance memory efficiency, allowing for longer input sequences and reduced GPU idle time. This section also describes how you can adjust model input and output sizes to fit memory constraints and optimize for specific serving frameworks.
Attention layer optimization
Self-attention layers allow models to understand context in language processing tasks, as word meanings can shift depending on the context. However, these layers store input token weights, keys (K), and values (V) in GPU vRAM. Thus, as the input sequence lengthens, this leads to quadratic growth in size and computation time.
Using KV caching is particularly useful when you're dealing with long input sequences, where the overhead of self-attention can become significant. This optimization approach reduces computational processing to linear complexity.
Specific techniques for optimizing attention mechanisms in LLMs include:
- Paged attention: Paged attention improves memory management for large models and long input sequences by using paging techniques, similar to OS virtual memory. This effectively reduces fragmentation and duplication in the KV cache, allowing for longer input sequences without running out of GPU memory.
- Flash attention: Flash attention reduces GPU memory bottlenecks by minimizing data transfers between GPU RAM and L1 cache during token generation. This eliminates idle time for computing cores, significantly improving inference and training performance for GPUs.
Model input and output size tuning
Memory requirements depend on input and output size. Longer output and more context require more resources, while shorter output and less context can save costs by using a smaller, cheaper GPU.
Select one of these tabs to see an example of tuning the model input and output memory requirements in the TGI or vLLM frameworks:
TGI
The TGI serving runtime checks memory requirements during startup and doesn't start if the maximum possible model memory footprint doesn't fit into the available GPU memory. This check eliminates out-of-memory (OOM) crashes on memory-intensive workloads.
GKE supports the following TGI parameters for optimizing model memory requirements:
The following snippet shows how you can serve a Gemma 7B
instruction-tuned model with a single L4 GPU, with parameter settings
--max-total-tokens=3072, --max-batch-prefill-tokens=512,
--max-input-length=512
:
To apply this configuration, use the following command:
kubectl apply -f tgi/tgi-7b-token.yaml
vLLM
In vLLM, configure the model's context length, which directly impacts
the KV cache size and GPU RAM requirements. Smaller context lengths allow
for the use of more affordable GPUs. The default value is the maximum
number of tokens the model accepts. Limit the maximum context length with
--max-model-len MAX_MODEL_LEN
if needed.
For example, the Gemma 7B instruction tuned model, with its
default context length of 8192, exceeds the memory capacity of a single
NVIDIA L4 GPU. To deploy on an L4, limit the combined length of prompts and
outputs by setting --max-model-len
to a value under 640. This adjustment
enables running the model on a single L4 GPU despite its large default
context length.
To deploy with the modified token limit, use the following snippet:
To apply this configuration, use the following command:
kubectl apply -f vllm/vllm-7b-token.yaml
Checklist summary
Optimization Goal | Practice |
---|---|
Latency |
|
Throughput |
|
Cost-efficiency |
|
What's next
- For an end-to-end guide that covers container configuration, refer to Serve an LLM with multiple GPUs on GKE.
- If you need a cloud-managed LLM serving solution, deploy your model through Vertex AI Model Garden.