Hlstransform: Energy-Efficient Llama 2 Inference On Fpgas Via High Level Synthesis
Hlstransform: Energy-Efficient Llama 2 Inference On Fpgas Via High Level Synthesis
Hlstransform: Energy-Efficient Llama 2 Inference On Fpgas Via High Level Synthesis
Synthesis
Andy He * 1 Darren Key * 1 Mason Bulling * 1 Andrew Chang * 1 Skyler Shapiro * 1 Everett Lee 1
Abstract 1. Introduction
arXiv:2405.00738v1 [cs.AR] 29 Apr 2024
1
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
tions requiring real-time inference on the edge, in addition In summary, through our methods which we name
to monetary reasons, a dedicated GPU is often impractical HLSTransform, we demonstrate the following:
as it cannot draw sufficient and sustained power.
While GPU acceleration will likely remain dominant in the 1. Low power and energy consumption
near future despite the power disadvantage, there is value Energy savings up to a 12.75x reduction of total energy
in exploring different avenues of hardware acceleration as consumption compared to CPU and an 8.25x reduction
deep learning tasks continue to diverge into highly specific of total energy consumption compared to GPU.
applications. Further, as transformers become more and
more ubiquitous, there is a case to be made for designing 2. Fast inference speeds and low latency
model-specific hardware accelerators solely to optimize in- Acceleration up to 2.46x in inference speed in compar-
ference. To that end, Field Programmable Gate Arrays (FP- ison to CPU, and maintaining up to 0.53x in inference
GAs) are another desirable choice for accelerators as they speed in comparison to GPU, despite the GPU having
offer a hardware reconfigurable for specific tasks enabled by 4x higher base clock rate.
a large number of programmable logic gates, making them
inexpensive to iterate hardware designs on. Furthermore, 3. Verification of HLS tools for faster deployment
FPGAs are distinguished for their reduced power consump- Ensuring HLS tools run properly to synthesize appro-
tion, which on average is only 28% of GPUs (Cong et al., priate FPGA designs for this study. We also test the
2018). learning curve of the tools for the use of any developer
without extensive hardware backgrounds.
What limits the adoption of FPGAs currently is the high
barrier of entry and relative lack of research compared to
GPUs. FPGAs are commonly used to prototype hardware We open-source our code and document our FPGA syn-
designs for system-on-chip (SoC) and Application Specific thesis to the public, available in our GitHub repo here:
Integrated Circuit (ASIC), which is typically done on the github.com/HLSTransform/submission. To the
register-transfer level (RTL) using hardware description lan- best of our knowledge, our model is one of the first open-
guages like Verilog. However, the design and verification of source HLS-based implementations for transformers. In
RTL modules are known to be extremely complex and time- our research process, the lack of documentation for many
consuming. High Level Synthesis (HLS) is a methodology steps of the process combined with the absence of existing
that seeks to address that complexity by allowing developers open-source FPGA accelerators for transformers served as
to write hardware descriptions in more accessible, high-level a high barrier to entry, and we hope our work serves a step
languages like C or C++. HLS tools convert high-level code forward in democratizing the usage and research of FPGAs
input into RTL code that optimizes for performance, area, for transformer inference.
and energy consumption, leading to faster prototyping and
iteration for FPGAs. Furthermore, the nature of HLS tools 2. Related Work
and availability of Vitis C / RTL co-simulation make it sim-
ple to verify the correctness of the synthesized hardware We delineate a few studies that relate to FPGA accelerators
designs; these factors allow HLS to significantly shorten the for transformers and the application of high level synthesis.
traditional hardware development cycle.
2.1. Existing Hardware Accelerators for Transformers
In this literature, we employ HLS tools to design FPGAs for on FPGA
accelerating Llama 2 inference. In addition to the large GPU
power footprint of LLMs that may be addressed with FP- Existing hardware accelerators for transformers on FPGA
GAs, the complex data flow of transformer models (Li et al., incorporate specialized techniques to optimize performance
2020) often comprises of nonlinearities or token encoding on FPGAs. Column Balanced Block Pruning (Peng et al.,
subroutines (such as RoPE) that are difficult to accelerate 2021) and FTrans (Li et al., 2020) are two novel frameworks
on GPUs but could be better suited for FPGAs. Llama 2 is for transformer models suitable for FPGA acceleration. By
chosen in particular due to its open-source implementations incorporating weight pruning to employ sparse matrix mul-
and superb performance (Touvron et al., 2023b), making tiplication, these papers are able to achieve multiple folds of
it a popular and well researched choice. We use Andrej improvements in transformer inference compared to CPUs
Karpathy’s llama2.c repository (Karpathy, 2023) to develop and GPUs in terms of performance and energy efficiency.
our methods on a relatively small (110M parameters) model We instead strive to maintain dense matrix multiplication
to allow for our financial and compute constraints. We focus in our methods to allow for general application to exist-
on inference over training due to its higher energy usage ing transformer models. Similarly, NPE (Khan et al., 2021)
and greater suitability for FPGAs. introduces a framework for FPGA acceleration on transform-
ers, utilizing piecewise linear approximations for nonlinear
2
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
quantization algorithm “Q8 0”, explored further in Section Host Executable Kernel Binary
3.2.
Xilinx Runtime
2.2. hls4ml
Hardware Platform
3
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
for efficient CPU transformer inference and referred to as to increased throughput. The pipeline pragma is applied
“Q8 0” quantization in the library (Gerganov). We quantize to the main loops responsible for computing matrix-vector
the embedding, attention, and the feedforward weights. The multiplication and rotary position embeddings.
RMSNorm params, which are sensitive to error, are kept in
float32 precision. 3.3.2. L OOP U NROLLING
Although quantization leads to decreased model accuracy, Loop unrolling is an optimization technique that increases
the accuracy dropoff is minimal, and we explore the ef- the efficiency of hardware implementations derived from
fects of quantization in Section 4.1. Quantization allows high-level code. This process involves expanding the loop
for smaller weights, which permits us to better utilize body multiple times in order to reduce the number of itera-
the limited memory bandwidth on the FPGA and per- tions. By doing this, loop unrolling enables the simultaneous
form integer-only calculations, which provides inference execution of multiple consecutive loop iterations, as long as
speedups through lower precision arithmetic calculations there are no intra-loop data dependencies.
(Kim et al., 2021).
In other words, if a loop is executed N times and we un-
roll it M times, the loop body will be replicated M times
3.3. Optimization of Llama 2 Accelerator Using HLS
within each iteration, thereby reducing the total number
Pragmas
of iterations to N/M . This technique is especially useful
Pragmas in High-Level Synthesis (HLS) are directives used in hardware design because it can lead to more parallel
to guide the HLS compiler in the process of converting the operations, allowing the hardware to perform more tasks
high-level code into a hardware description, typically used simultaneously at the cost of chip space.
when indicating to the compiler that a specific optimization
should be performed on some section of the code. 3.3.3. M EMORY PARTITIONING
The application of HLS partitioning pragmas is a critical
step in the design of the Llama 2 deep learning accelera-
for(int i = 0; i < 2; i++) {
4
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
256 tokens and the max context length of 1024 tokens to sults from the system simulations (Khan et al., 2021), and
test both the short and long text generation domains. we provide a report of our full timings in the Appendix.
Our FPGA designs were synthesized targeting the Ultra-
scale+ VU9P platform available on AWS, and the synthe- Table 2. I NFERENCE SPEED ( TOKENS PER SECOND )
sized designs were then exported to an Amazon Machine
Image (AMI) using a custom toolchain provided by Ama- H ARDWARE 256 TOKENS ↑ 1024 TOKENS ↑
zon (AWS). We use the f1.2xlarge instance from AWS to CPU 23.21 TOKS / S 19.63 TOKS / S
host the FPGA, and we use the t2.2xlarge instance for our GPU 107.00 TOKS / S 107.24 TOKS / S
CPU benchmarks (8 vCPUs, 2.3 GHz Intel Xeon Broadwell FPGA 57.11 TOKS / S 57.11 TOKS / S
E5-2686 v4), the same CPUs used in the FPGA instance,
and an NVIDIA RTX 3090 GPU for our GPU benchmarks.
We use the original Llama 2 implementation provided by
Meta for our GPU experiments. We run all samples with Table 3. I NFERENCE L ATENCY ( MILLISECONDS )
non-batched inference (batch size 1).
H ARDWARE 256 TOKENS ↓ 1024 TOKENS ↓
While we run benchmarks of FPGA performance against
CPUs and GPUs, we are unable to provide equitable quan- CPU 43.08 MS 50.94 MS
GPU 9.34 MS 9.32 MS
tized benchmarks for GPUs, as the different scaling factors FPGA 17.51 MS 17.51 MS
per section in the quantization algorithm used would require
specialized kernels to make this efficient. To provide equi-
table comparisons, we also provide perplexity benchmarks, According to Table 2, the FPGA is 2.46x the inference speed
a common metric for model quality, along with inference of CPU and 0.53x the inference speed of GPU.
latency and energy consumption benchmarks to demonstrate
minimal tradeoffs to accuracy while fully utilizing the opti- Although the GPU performs inference faster than the FPGA,
mized integer-arithmetic abilities of FPGAs. one of the primary bottlenecks of deep learning inference is
memory bandwidth and the availability of on-chip memory
4.1. Perplexity (Balasubramanian et al., 2021). A RTX 3090 has 24GB
VRAM running at 1219 MHz with a base core clock of 1395
We measure perplexity on the validation dataset for TinyS- MHz (TechPowerUp, 2024). In comparison, a VU9P FPGA
tories, for both the quantized and unquantized models of the has 345.9 MB of combined on-chip BRAM and URAM,
110M parameter model; perplexity is a common metric for running at a much slower clock speed of around 200-300
model quality that measures a model’s uncertainty about its MHz depending on the module; however, with much lower
predictions. Our experimental setup is detailed further in clock speeds, the FPGA is able to achieve better efficiency
the Appendix. on power and energy consumption, as shown below.
5
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
6
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
7
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
Karpathy, A. llama2.c. 2023. URL https:// Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
github.com/karpathy/llama2.c. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
Khan, H., Khan, A., Khan, Z., Huang, L. B., Wang, tuned chat models. arXiv preprint arXiv:2307.09288,
K., and He, L. Npe: An fpga-based overlay pro- 2023a.
cessor for natural language processing. 2021. doi:
10.1145/3431920.3439477. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Bhosale, S., et al. Llama 2: Open foundation and fine-
Keutzer, K. I-bert: Integer-only bert quantization. 2021. tuned chat models. arXiv preprint arXiv:2307.09288,
2023b.
Li, B., Pandey, S., Fang, H., Lyv, Y., Li, J., Chen, J., Xie, M.,
Wan, L., Liu, H., and Ding, C. Ftrans: energy-efficient Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
acceleration of transformers using fpga. In Proceedings of L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
the ACM/IEEE International Symposium on Low Power is all you need. 2023.
Electronics and Design, pp. 175–180, 2020. Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon,
McDonald, J., Li, B., Frey, N., Tiwari, D., Gadepally, V.,
F., et al. Bloom: A 176b-parameter open-access multilin-
and Samsi, S. Great power, great responsibility: Rec-
gual language model. arXiv preprint arXiv:2211.05100,
ommendations for reducing energy for training language
2022.
models. arXiv preprint arXiv:2205.09646, 2022.
Xiong, C. and Xu, N. Performance comparison of blas on
Merritt, R. What is accelerated computing? NVIDIA Blog?, cpu, gpu and fpga. 2020 IEEE 9th Joint International
2021. Information Technology and Artificial Intelligence Con-
ference (ITAIC), 2020.
Patterson, D. Good news about the carbon footprint
of machine learning training. 2022. URL https: Zhang, B. and Sennrich, R. Root mean square layer normal-
//blog.research.google/2022/02/good- ization. 2019.
news-about-carbon-footprint-of.html.
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An
Peng, H., Huang, S., Geng, T., Li, A., Jiang, W., Liu, H., open-source small language model. 2024.
Wang, S., and Ding, C. Accelerating transformer-based
deep learning models on fpgas using column balanced
block pruning. In 2021 22nd International Symposium on
Quality Electronic Design (ISQED), pp. 142–148, 2021.
doi: 10.1109/ISQED51717.2021.9424344.
Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A.,
Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and
Gadepally, V. From words to watts: Benchmarking the
energy costs of large language model inference. In 2023
IEEE High Performance Extreme Computing Conference
(HPEC), pp. 1–9. IEEE, 2023.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu,
Y. Roformer: Enhanced transformer with rotary position
embedding. 2021.
8
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
A. Appendix
A.1. Experimental Setup
For all our experiments, we use a sampling temperature of 1, an empty prompt (prompt is “”), and top-p sampling at 1. We
run all our experiments 100 times and take the average for our results.
We use Karpathy’s provided 110M model, which has an embedding dim of 768, 12 layers. 12 heads, 12 KV heads, and a
max context length of 1024.
Table 7. We obtain our timing results from the synthesis as shown below.
Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute)
forward Pipeline 1 771 771 771 771 3.084 us 3.084 us 3.084 us
rmsnorm 768 Pipeline 1 770 770 770 770 3.080 us 3.080 us 3.080 us
rmsnorm 768 Pipeline 2 771 771 771 771 3.084 us 3.084 us 3.084 us
rmsnorm 768 Pipeline sum of squares 5413 5413 5413 5413 21.652 us 21.652 us 21.652 us
rmsnorm 768 Pipeline norm and scale 23 23 23 23 92.000 ns 92.000 ns 92.000 ns
rmsnorm 768 Pipeline 5 770 770 770 770 3.080 us 3.080 us 3.080 us
rmsnorm 768 s 7822 7822 7822 7822 31.288 us 31.288 us 31.288 us
round 1 1 1 1 4.000 ns 4.000 ns 4.000 ns
p hls fptosi float i8 1 1 1 1 4.000 ns 4.000 ns 4.000 ns
quantize 768 Pipeline main loop 198 198 198 198 0.792 us 0.792 us 0.792 us
quantize 768 Pipeline 2 770 770 770 770 3.080 us 3.080 us 3.080 us
quantize 768 Pipeline 3 14 14 14 14 56.000 ns 56.000 ns 56.000 ns
quantize 768 s 971 971 971 971 3.884 us 3.884 us 3.884 us
matmul 768 768 Pipeline x buff 50 50 50 50 0.200 us 0.200 us 0.200 us
matmul 768 768 Pipeline xs buff 5 5 5 5 20.000 ns 20.000 ns 20.000 ns
matmul 768 768 Pipeline VITIS LOOP 225 1 20900 20900 20900 20900 83.600 us 83.600 us 83.600 us
matmul 768 768 s 20977 20977 20977 20977 83.908 us 83.908 us 83.908 us
pow generic float s 1 15 15 15 60.000 ns 60.000 ns 60.000 ns
sin or cos float s 1 18 18 18 72.000 ns 72.000 ns 72.000 ns
forward Pipeline rotation1 119 119 119 119 0.476 us 0.476 us 0.476 us
forward Pipeline 3 839 839 839 839 3.356 us 3.356 us 3.356 us
forward Pipeline 4 839 839 839 839 3.356 us 3.356 us 3.356 us
forward Pipeline iterate 530 1̃554 530 1042 1554 2.120 us 4.168 us 6.216 us
forward Pipeline max 2 2̃61 2 133 261 8.000 ns 0.532 us 1.044 us
forward Pipeline exp 24 5̃6 24 40 56 96.000 ns 0.160 us 0.224 us
forward Pipeline sum 10 1̃546 10 778 1546 40.000 ns 3.112 us 6.184 us
forward Pipeline norm 9 2̃5 9 17 25 36.000 ns 68.000 ns 0.100 us
forward Pipeline 10 66 66 66 66 0.264 us 0.264 us 0.264 us
forward Pipeline acc 89 1̃625 89 857 1625 0.356 us 3.428 us 6.500 us
forward Pipeline residual 61 61 61 61 0.244 us 0.244 us 0.244 us
matmul 768 2048 Pipeline x buff 50 50 50 50 0.200 us 0.200 us 0.200 us
matmul 768 2048 Pipeline xs buff 5 5 5 5 20.000 ns 20.000 ns 20.000 ns
matmul 768 2048 Pipeline VITIS LOOP 225 1 55460 55460 55460 55460 0.222 ms 0.222 ms 0.222 ms
matmul 768 2048 s 55537 55537 55537 55537 0.222 ms 0.222 ms 0.222 ms
forward Pipeline swi glu 552 552 552 552 2.208 us 2.208 us 2.208 us
forward Pipeline 14 2050 2050 2050 2050 8.200 us 8.200 us 8.200 us
quantize 2048 Pipeline main loop 221 221 221 221 0.884 us 0.884 us 0.884 us
quantize 2048 Pipeline 2 2050 2050 2050 2050 8.200 us 8.200 us 8.200 us
quantize 2048 Pipeline 3 34 34 34 34 0.136 us 0.136 us 0.136 us
quantize 2048 s 2274 2274 2274 2274 9.096 us 9.096 us 9.096 us
matmul 2048 768 Pipeline x buff 130 130 130 130 0.520 us 0.520 us 0.520 us
matmul 2048 768 Pipeline xs buff 10 10 10 10 40.000 ns 40.000 ns 40.000 ns
matmul 2048 768 Pipeline VITIS LOOP 225 1 52526 52526 52526 52526 0.210 ms 0.210 ms 0.210 ms
matmul 2048 768 s 52659 52659 52659 52659 0.211 ms 0.211 ms 0.211 ms
forward Pipeline residual2 58 58 58 58 0.232 us 0.232 us 0.232 us
matmul 768 32000 Pipeline x buff 50 50 50 50 0.200 us 0.200 us 0.200 us
matmul 768 32000 Pipeline xs buff 5 5 5 5 20.000 ns 20.000 ns 20.000 ns
matmul 768 32000 Pipeline VITIS LOOP 225 1 864190 864190 864190 864190 3.457 ms 3.457 ms 3.457 ms
matmul 768 32000 s 864311 864311 864311 864311 3.457 ms 3.457 ms 3.457 ms
forward 4160108 4̃892636 4160107 4377403 4892635 16.640 ms 17.510 ms 19.571 ms