Design Optimization For High-Performance Computing Using FPGA
Design Optimization For High-Performance Computing Using FPGA
Abstract
Reconfigurable architectures like Field Programmable Gate Arrays
(FPGAs) have been used for accelerating computations in several
domains because of their unique combination of flexibility, performance,
and power efficiency. However, FPGAs have not been widely used for
high-performance computing, primarily because of their programming
complexity and difficulties in optimizing performance. We optimize Tensil
AI’s open-source inference accelerator for maximum performance using
ResNet20 trained on CIFAR in this paper in order to gain insight into
the use of FPGAs for high-performance computing. In this paper, we
show how improving hardware design, using Xilinx Ultra RAM, and
using advanced compiler strategies can lead to improved inference per-
formance. We also demonstrate that running the CIFAR test data set
shows very little accuracy drop when rounding down from the original 32-
bit floating point. The heterogeneous computing model in our platform
allows us to achieve a frame rate of 293.58 frames per second (FPS) and
a %90 accuracy on a ResNet20 trained using CIFAR. The experimen-
tal results show that the proposed accelerator achieves a throughput of
21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power
consumption at 100 MHz. The comparison results with off-the-shelf
1
Springer Nature 2021 LATEX template
2 Isik et al.
1 Introduction
Real-time vision-based motion tracking is necessary for many applications.
Real-time video streaming for surveillance applications requires advanced
encoding and decoding techniques as well as compute-intensive image pro-
cessing designs. The ability to operate in real-time is especially important for
applications in which speed is paramount, such as production areas, traffic
speed control systems, or when the camera activity needs to be synchronized
with other system components. The developers of machine vision frameworks
and respectability may be engrossed in determining which of these steps to
implement while developing the rest of the framework. The organize option is
commonly selected when prototyping the system for the first time. The num-
ber of sections or outlines the application must prepare each instant depends
on how many sections the prototyped application must handle at any given
moment. Researchers are developing methods to design embedded systems that
require less power, which is critical for most applications in modern embed-
ded systems [1] [2]. High-resolution image preparation applications demand
faster, configurable, high-throughput frameworks with superior productivity
for preparing enormous data sets [3] [4] [5]. FPGAs (Field-Programmable Gate
Arrays) can play an important role since they provide configurability, adapt-
ability, and parallelism to coordinate the necessary throughput rates of the
application under consideration [6]. An FPGA device provides an execution
method that allows it to be used in real-life applications. FPGAs have sig-
nificantly increased the flexibility of hardware in general. A wider community
of builders can now make use of these devices thanks to advancements in the
toolchains for developing applications on them. Applications that require con-
currency, high transfer speeds, and re-programmability typically use FPGAs.
Modern digital life is increasingly reliant on image-processing applications.
They are used in a variety of applications, including medical imaging, secu-
rity, autonomous vehicles, and entertainment. In order to meet the increasing
demand for more accurate and faster image processing, high-performance com-
puting systems are needed. Image processing systems can be improved through
FPGA-based design optimization. There are several factors that require push-
ing for higher performance in image processing. The following factors are
discussed in more detail.
• Resolution and Image Size: An image processing system’s performance is
strongly influenced by image resolution and file size. The complexity of
Springer Nature 2021 LATEX template
Isik et al. 3
4 Isik et al.
Isik et al. 5
and high throughput processing it provides make it an ideal solution for high-
performance computing applications. The Tensil AI inference accelerator is not
only highly scalable but also highly efficient. The technology can be employed
in edge devices such as smartphones, smart cameras, and IoT devices, as well
as in cloud-based applications that require high-performance machine learning
inferences. Tensil AI accelerators can be deployed on a wide range of FPGA
platforms, including Xilinx’s Alveo accelerator cards, making them ideal for
high-performance computing applications. The Tensil AI open-source infer-
ence accelerator is a powerful tool for accelerating machine learning inference
on FPGA platforms. A wide range of input sizes and shapes can be supported,
making it a highly scalable and versatile solution. High-performance comput-
ing will likely become even more reliant on solutions like the Tensil AI inference
accelerator as machine learning becomes more important [7] [8].
The rest of the paper is organized as follows: Section II presents the moti-
vation and related works. Section III introduces open-source ml inference
accelerators. The proposed method and its experimental results and analy-
sis are reported in Section IV and Section V. Section VI concludes the
contents of this paper and gives future aspects of this paper.
2 Motivation
FPGAs have been around for several decades, and they are used in many
different applications. High-performance computing has been limited by a
number of challenges and difficulties. FPGAs have not been widely used in
high-performance computing due to their high development cost and com-
plexity. The tools and technologies required for FPGA development are often
expensive and complex, which makes it difficult to develop systems based
on FPGAs. FPGA-based solutions have proven challenging to adopt for
many organizations, especially for smaller organizations or those with limited
resources. The limited availability of high-level software tools is another chal-
lenge with FPGAs in high-performance computing. Developing software for
FPGAs requires a deep understanding of the underlying hardware architecture,
which is more difficult than for traditional processors. However, high-level syn-
thesis tools are not as mature as those used for traditional processors, making
development more challenging [9] [10].
Some high-performance computing applications can also be limited by
the limited amount of on-chip memory on FPGAs. There is a significant
amount of data transfer between the FPGA and external memory, which
slows performance and increases latency. For many high-performance comput-
ing applications, floating-point operations are also not supported by FPGAs.
FPGAs used in high-performance computing also have a limited number of
prebuilt IP blocks. The development of FPGA-based solutions often requires
the use of pre-built intellectual property (IP) blocks, such as memory con-
trollers and data interfaces. The availability of these IP blocks for FPGAs is
Springer Nature 2021 LATEX template
6 Isik et al.
often limited, which makes developing FPGA-based systems more difficult and
time-consuming.
High-performance computing applications benefit from the advantages of
FPGAs, despite these challenges. FPGAs can be highly optimized for specific
tasks and often perform better than traditional processors in specific appli-
cations. A hardware-level parallelism capability also enhances performance
for certain tasks. Recent developments have made FPGAs more accessible
for high-performance computing, thus addressing these challenges. The avail-
ability of high-level synthesis tools for FPGAs makes software development
easier, for example. A number of pre-built IP blocks are also being developed
and made available for FPGAs. A number of FPGA-based solutions are now
available that require less specialized hardware design knowledge and are eas-
ier to use. Despite the challenges and difficulties involved in developing and
implementing FPGAs, they have not been widely used for high-performance
computing, but efforts are being made to resolve these issues and make
FPGA-based solutions more accessible and usable for high-performance com-
puting. The adoption of FPGAs in high-performance computing will increase
as development tools, IP blocks, and FPGA-based solutions improve [11].
High-performance computing applications have attracted significant inter-
est in FPGAs in recent years. FPGA-based systems can be highly optimized
for specific tasks, and they can often perform better than traditional pro-
cessors in specific applications. The image and video processing industry has
extensively used FPGAs for high-performance computing. The processing of
high-resolution images and video can be carried out in real time using FPGAs.
A high-level synthesis tool called Vivado HLS has been used by researchers at
UCLA to develop an FPGA-based system for real-time image processing [12].
A throughput of 52 frames per second was achieved when filtering images,
and 20 frames per second when segmenting images. High-performance com-
puting has also been done using FPGAs in the financial industry. Complex
mathematical operations are often involved in financial calculations, which are
well suited for FPGAs. A high-frequency trading system developed by the
Tokyo Stock Exchange (TSE) can process trades in less than one microsec-
ond using FPGAs [13] [14]. The system uses FPGAs to calculate financial
instruments such as options and futures. Machine learning and artificial intel-
ligence are other areas where FPGAs have been used for high-performance
computing. FPGAs can be highly optimized for neural network computations,
making it possible to process large amounts of data faster and more effi-
ciently. Scientific calculations can be highly optimized on FPGAs, resulting
in faster, more efficient processing of large amounts of data. Furthermore, a
number of existing works focus on optimizing FPGA-based systems for high-
performance computing in general. Researchers have developed a tool called
FireSim to simulate large-scale FPGA-based systems using cloud resources
[15]. The tool can be used to optimize system performance and evaluate differ-
ent design options. There are many existing works that focus on using FPGAs
for high-performance computing. Several applications, including image and
Springer Nature 2021 LATEX template
Isik et al. 7
8 Isik et al.
Isik et al. 9
simulates complex behaviors over long periods of time using large-scale neural
models. Tensil AI has the potential drawback of limited flexibility. Its lim-
ited versatility may be due to its specialized nature as a hardware accelerator.
Users may not be able to create custom models or algorithms, and they may
have to use pre-built models and architectures instead. Therefore, Nengo and
Tensil AI are both powerful frameworks for developing high-performance com-
puting applications. A variety of applications can be carried out with Nengo,
whereas Tensil AI is more suited for specific tasks, such as machine learning
inference. Developers should carefully evaluate the strengths and weaknesses
of each framework before selecting one, and ultimately their choice will depend
on the specific needs of the application [7] [8].
4 Method
We propose different approaches such as Vivado hardware design, leverag-
ing Xilinx Ultra RAM, and using advanced compiler strategies to improve
the performance of inference. In the ResNet20-ZCU104 tutorial by Tensil AI,
several methods are used to optimize the design of the ResNet20 neural net-
work for implementation on the Xilinx ZCU104 development board using their
open-source inference accelerator. ResNet-20 is trained using CIFAR-10, a
dataset that contains 60,000 32x32 RGB images categorized into ten classes.
A PyTorch framework is used to train the network, which has an accuracy
of approximately %91. Training the network on ZCU104 is followed by sev-
eral optimization steps for deployment on the device. A pruning technique
reduces computation and memory requirements by removing unnecessary con-
nections from the network. Tensil AI reduces the number of parameters and
computations by pruning connections from the trained network. A further
optimization method used in the tutorial is quantization, which involves reduc-
ing the weights and activations of the network. The network is quantized
to 8-bit fixed-point precision using the TensorRT framework, further reduc-
ing memory and computation requirements. Tensil AI’s open-source inference
accelerator, designed to accelerate sparse neural network execution on FPGAs,
implements the optimized neural network on the ZCU104. High performance
and energy efficiency are achieved by utilizing FPGAs’ reconfigurability and
parallelism. The ResNet20-ZCU104 tutorial by Tensil AI demonstrates a vari-
ety of optimization techniques that can be used to optimize neural network
designs for implementation on FPGA-based accelerators, including pruning
and quantization.
10 Isik et al.
file system functionality to read the ResNet model and a set of images to test
it with. The set we will be using is the test set for the original CIFAR-10.
The ResNet model is trained with separate training and validation sets from
the CIFAR-10. The test set is what the model hasn’t seen in training and
therefore gives an objective estimate of its accuracy. The CIFAR-10 provides
a test set of 10,000 images in several formats. We will use the binary format
that is more suitable for the embedded application. With the SD card inserted
and containing the CIFAR-10 test data set and the ResNet model compiled
for Tensil, you should see the inference printing every 100’s images and the
corresponding prediction along with measured inferences (frames) per second.
After running the inference on the entire test data set the program will print
the final average frames per second and the accuracy of the inference. For the
baseline solution, we are getting an average of 133.54 frames per second with
%90 accuracies. Note that the accuracy we are seeing when testing the same
ResNet model with TensorFlow is %92. The %2 drop is due to changing the
data type from a 32-bit floating point in TensorFlow to a 16-bit fixed point in
Tensil.
Isik et al. 11
12 Isik et al.
Utilization XCZU7EV
LUT 181440
DSP 1054
BRAM 293
URAM 96
Table 1 Resource Usage.
KV) to accumulators and all of the Ultra RAM (48 KV) to local memory. For
the Ultra RAM solution, we are getting an average of 170.16 frames per second,
another meaningful improvement. This improvement is based purely on having
larger on-chip memory. With a small on-chip memory the Tensil compiler
is forced to partition ResNet convolution layers into multiple load-compute-
save blocks. This, in turn, requires that the same input activations are loaded
multiple times, assuming weights are loaded only once. This is called weight-
stationary dataflow. In the future, we will add an option for input-stationary
dataflow. With it, when partitioned, the input activations are loaded once and
the same weights are loaded multiple times.FPGA utilization for Ultra RAM
design is shown in Table 1.
Figure 3 shows such a 3-partitioned compilation. Layer N has 2 stages. In
each stage, a unique subset of weights is loaded. Then, each stage is further
split into 2 partitions. Partition is defined by the largest amount of weights,
input and output activations, and intermediate results that fit local memory
and accumulators.
Isik et al. 13
5 Results
Our results demonstrate the effectiveness of Tensil AI’s open-source inference
accelerator for optimizing neural networks and implementing them on FPGAs
for high-performance computing applications. It has been done using CPUs,
GPUs, and FPGAs. CPU/GPU-based NNs consume a lot of power and have
a limited throughput due to limited memory bandwidth which is shown in
Table 2. In Table 3 Many researchers have developed FPGA-based designs
14
Isik et al.
Isik et al. 15
16 Isik et al.
6 Conclusions
The ResNet20-ZCU104 project demonstrates the potential of using FPGA-
based acceleration for machine learning tasks. By leveraging the unique
capabilities of FPGAs, such as low latency and efficient memory usage, the
project achieved impressive results in terms of both performance and accuracy.
The model is implemented for hardware acceleration with various heteroge-
neous devices and resulting in an energy-efficient, reconfigurable system on
the latter. In the further phase of our work, we will propose to use Dynamic
Partial Reconfiguration which is state-of-art technology of reconfigurable hard-
ware into an achieved high-performance framework. Within this feature, we
will solve reshaping and offloading the initial and post-data processing for
high-performance computing with Tensil AI. The Tensil AI already can take
a different model by compiling something new, but the data going in and out
could require manipulation which works between the input process and output
process of the Tensil AI.
7 Declarations
7.1 Ethical Approval
Not applicable
Isik et al. 17
7.4 Funding
This research did not receive any specific grant from funding agencies in the
public, commercial, or not-for-profit sectors.
References
[1] Wang, H., Zhang, X., Kong, D., Lu, G., Zhen, D., Zhu, F., Xu, K.: Convo-
lutional neural network accelerator on fpga. In: 2019 IEEE International
Conference on Integrated Circuits, Technologies and Applications (ICTA),
pp. 61–62 (2019). IEEE
[3] Blaiech, A.G., Khalifa, K.B., Valderrama, C., Fernandes, M.A., Bedoui,
M.H.: A survey and taxonomy of fpga-based deep learning accelerators.
Journal of Systems Architecture 98, 331–345 (2019)
[4] Zou, D., Dou, Y., Guo, S., Ni, S.: High performance sparse matrix-vector
multiplication on fpga. IEICE Electronics Express 10(17), 20130529–
20130529 (2013)
[5] Isik, M., Oldland, M., Zhou, L.: An energy-efficient reconfigurable autoen-
coder implementation on fpga. arXiv preprint arXiv:2301.07050 (2023)
[6] Woods, R., McAllister, J., Lightbody, G., Yi, Y.: FPGA-based Implemen-
tation of Signal Processing Systems. John Wiley & Sons, ??? (2008)
[10] Sklyarov, V., Skliarova, I., Utepbergenov, I., Akhmediyarova, A., et al.:
Hardware accelerators for information processing in high-performance
computing systems. Int J Innov Comput Inf Control 15(1), 321–335
(2019)
[11] Huang, S., Pearson, C., Nagi, R., Xiong, J., Chen, D., Hwu, W.-m.:
Accelerating sparse deep neural networks on fpgas. In: 2019 IEEE High
Springer Nature 2021 LATEX template
18 Isik et al.
[12] Chen, Z., Zhou, J., Blair, G.J., Blair, H.T., Cong, J.: Fpga-based in-
vivo calcium image decoding for closed-loop feedback applications. arXiv
preprint arXiv:2212.04736 (2022)
[15] Karandikar, S., Biancolin, D., Amid, A., Pemberton, N., Ou, A., Katz,
R., Nikolic, B., Bachrach, J., Asanovic, K.: Using firesim to enable agile
end-to-end risc-v computer architecture research (2019)
[16] Moreau, T., Chen, T., Vega, L., Roesch, J., Yan, E., Zheng, L., Fromm,
J., Jiang, Z., Ceze, L., Guestrin, C., et al.: A hardware–software blueprint
for flexible deep learning specialization. IEEE Micro 39(5), 8–16 (2019)
[17] Zunin, V.: Intel openvino toolkit for computer vision: Object detection
and semantic segmentation. In: 2021 International Russian Automation
Conference (RusAutoCon), pp. 847–851 (2021). IEEE
[19] Morcos, B.: Nengofpga: an fpga backend for the nengo neural simulator.
Master’s thesis, University of Waterloo (2019)
[20] DeWolf, T., Jaworski, P., Eliasmith, C.: Nengo and low-power ai hard-
ware for robust, embedded neurorobotics. Frontiers in Neurorobotics 14,
568359 (2020)
[22] Ma, Y., Cao, Y., Vrudhula, S., Seo, J.-s.: Optimizing loop operation and
dataflow in fpga acceleration of deep convolutional neural networks. In:
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 45–54 (2017)
[23] Mei, C., Liu, Z., Niu, Y., Ji, X., Zhou, W., Wang, D.: A 200mhz 202.4
Springer Nature 2021 LATEX template
Isik et al. 19
gflops@ 10.8 w vgg16 accelerator in xilinx vx690t. In: 2017 IEEE Global
Conference on Signal and Information Processing (GlobalSIP), pp. 784–
788 (2017). IEEE
[24] Zhang, M., Li, L., Wang, H., Liu, Y., Qin, H., Zhao, W.: Optimized
compression for implementing convolutional neural networks on fpga.
Electronics 8(3), 295 (2019)
[25] Blott, M., Preußer, T.B., Fraser, N.J., Gambardella, G., O’brien, K.,
Umuroglu, Y., Leeser, M., Vissers, K.: Finn-r: An end-to-end deep-
learning framework for fast exploration of quantized neural networks.
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
11(3), 1–23 (2018)
[26] Zhang, X., Wei, X., Sang, Q., Chen, H., Xie, Y.: An efficient fpga-based
implementation for quantized remote sensing image scene classification
network. Electronics 9(9), 1344 (2020)
[27] Li, L., Zhang, S., Wu, J.: Efficient object detection framework and hard-
ware architecture for remote sensing images. Remote Sensing 11(20), 2376
(2019)
[28] Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula,
S., Seo, J.-s., Cao, Y.: Throughput-optimized opencl-based fpga accel-
erator for large-scale convolutional neural networks. In: Proceedings of
the 2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, pp. 16–25 (2016)