FPGA-accelerated SmartNIC For Supporting 5G Virtualized Radio Access

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Computer Networks 210 (2022) 108931

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

FPGA-accelerated SmartNIC for supporting 5G virtualized Radio Access


Network✩
Justine Cris Borromeo a ,∗, Koteswararao Kondepu b , Nicola Andriolli c , Luca Valcarenghi a
a
Scuola Superiore Sant’Anna, Via Moruzzi 1, 56124 Pisa, Italy
b
Indian Institute of Technology Dharwad, Dharwad, Karnataka, India
c
National Research Council of Italy (CNR-IEIIT), via Caruso 16, 56122, Pisa, Italy

ARTICLE INFO ABSTRACT

Keywords: Disaggregated, virtualized, and open next-generation eNodeB (gNB) could bring several benefits to the Next
Hardware Acceleration Generation Radio Access Network (NG-RAN) by enabling more market competition and customer choice, lower
Network function virtualization equipment costs, and improved network performance. This can be achieved through gNB-central unit (CU)-
OpenCL
control plane (CP), gNB-CU-user plane (UP) and gNB-distributed unit (DU) separation, CU and DU function
virtualization, and zero touch RAN management and control. However, to achieve the performance required by
specific foreseen 5G usage scenarios (e.g., Ultra Reliable Low Latency Communications — URLLC), offloading
selected disaggregated gNB functions into an accelerated hardware becomes a necessity.
To this aim, this study proposes the implementation of 5G DU Low-PHY layer functions into an FPGA-
based SmartNIC exploiting the Open Computing Language (OpenCL) framework to facilitate the integration of
accelerated 5G functions within the mobile protocol stack. The proposed implementation is compared against
(i) a CPU-based OpenAirInterface implementation, and (ii) a GPU-based implementation of IFFT exploiting
clfft and cufft libraries. Experimental results show that the different optimization techniques implemented in
the proposed solution reduce the Low-PHY processing time and the use of FPGA resources. Moreover, the
GPU-based implementation of the cufft and the proposed FPGA-based implementation have a lower processing
time and power consumption compared to a CPU-based implementation for up to two cores. Finally, the
implementation in a SmartNIC reduces the delay added by the host-to-device communication through the
Peripheral Component Interconnect Express (PCIe) interface, considering both functional split options 2 and
7-1.

1. Introduction With the constant evolution of 5G networks, Network Function


Virtualization (NFV) is explored to provide rapid and cost-effective de-
5G architecture implements a very scalable and flexible network ployment, upgrade, and scaling of network services and functions in an
technology that provides a resilient cloud-native mobile network with integrated fronthaul/backhaul network infrastructure [2]. NFV aims to
end-to-end support for network slicing. It aims to support new services implement the following improvements: (i) decoupling software from
based on three major usage scenarios, namely: (i) enhanced mobile hardware, allowing separate timelines and maintenance for software
broadband (eMBB) supporting higher broadband access capabilities, and hardware; (ii) flexible function deployment, where software and
faster connections, and higher resolution; (ii) massive machine-type hardware can perform different functions at various times; and (iii)
communications (mMTC) for high density connections of low cost dynamic scaling of the Virtualized Network Function (VNF) perfor-
and energy efficient IoT devices; and (iii) ultra-reliable low-latency mance [3,4]. Virtualization prevents network service providers from
communications (URLLC), which support mission critical applications investing on expensive hardware components. It can also accelerate the
requiring very low latency and high reliability [1]. installation time, thereby providing faster services to customers.

✩ This work received funding from the ECSEL JU grant agreement No 876967. The JU receives support from the EU Horizon 2020 research and innovation
programme and the Italian Ministry of Education, University, and Research (MIUR), Italy. Intel University Program and Terasic Inc are gratefully acknowledged
for donating the FPGA hardware. This work is also partly supported by DST SERB Startup Research Grant (SRG-2021-001522).
∗ Corresponding author.
E-mail address: [email protected] (J.C. Borromeo).

https://doi.org/10.1016/j.comnet.2022.108931
Received 7 October 2021; Received in revised form 7 March 2022; Accepted 24 March 2022
Available online 31 March 2022
1389-1286/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

Table 1 2. Considered 5G RAN Architecture


Bandwidth and one-way latency requirements of different functional split options.
Functional Required Downlink Required Uplink One-way Different functional split options are currently under investigation
Split Capacity Capacity Latency
for the deployment of NG-RAN that distribute several functions be-
Option 2 4016 Mb/s 3024 Mb/s 1–10 ms tween the RU, DU, and CU, resulting in different delay, jitter, and
Option 7-1 9.2 Gb/s 60.4 Gb/s 250 μs
capacity requirements. The split options that currently received most of
Option 7-2 9.8 Gb/s 15.2 Gb/s 250 μs
Option 8 157.3 Gb/s 157.3 Gb/s 250 μs
the attentions are Option 8, Option 7 (in particular Option 7-1, Option
7-2, and also Option 7.2x in O-RAN), and Option 2. They can be also
utilized in combination when RU, DU, and CU are deployed in different
devices and interconnected by fronthaul and midhaul interfaces [13–
The concept of NFV also extends to the deployment of Radio Access 16]. Table 1 shows the requirements of the interfaces (either fronthaul
Networks (RAN) with the aim of a faster scaling to improve user expe- or midhaul) connecting the network elements hosting the functions
rience as the network capacity grows, especially for IoT devices where based on the listed split options. As shown, lower layer functional
split options (i.e., option 7 and option 8) have higher fronthaul ca-
millions of devices are expected to be connected to the 5G network.
pacity and stricter one-way latency requirements due to the remote
Open-source mobile platforms such as OpenAirInterface (OAI) [5] and
implementation of the Hybrid Automatic Repeat Request (HARQ) [17].
open radio access network architectures like O-RAN [6] have been 3GPP Release-15 [18,19] also finalized the specification of the 5G
developed, where mobile networks and equipment are software-driven, New Radio (5G NR), which supports operation with frequency bands
virtualized, flexible, and energy-efficient. The Open RAN initiative is ranging from sub-1 GHz up to mmWave. Two operating frequency
also part of the Telecom Infra Project (TIP) [7], which aims to accel- ranges (FRs) have been defined: FR1: 450 MHz - 6 GHz (commonly
erate innovation and commercialization in RAN domain with multi- referred to as sub-6) and FR2: 24.25 GHz–52.6 GHz (also referred to
vendor inter-operable products and solutions that are easy to integrate as millimeter wave). In FR1 and FR2, the maximum bandwidth is 100
in the operators’ network and are verified for different deployment MHz and 400 MHz, respectively, both being much larger than the
scenarios. TIP OpenRAN program supports the development of disag- maximum LTE bandwidth of 20 MHz. Moreover, to support a wide
range of use-cases and application scenarios, 5G NR features flexible
gregated and interoperable 2G/3G/4G/5G NR RAN solutions based on
subcarrier spacing, which can be obtained by
service provider requirements.
Thus, the 5G RAN is evolving towards the Next Generation RAN 𝛥𝑓 = 2𝜇 × 15 kHz; 𝜇 ∈ (−1, 0, 1, 2, 3, 4, 5) (1)
— NG-RAN) where disaggregated Evolved NodeBs (gNBs) are utilized. Also, the slot duration is scaled by a factor of 𝑇𝑠𝑙𝑜𝑡 = 2−𝜇
from
Each gNB is composed of a Central Unit (CU) that is connected to one or the transmission time interval of the LTE. This means that the slot
more Distributed Units (DU) through the midhaul interface, and each duration (𝑇𝑠𝑙𝑜𝑡 ), the cyclic prefix (CP) length, and the OFDM symbol
DU is connected to one or more Radio Units (RU) that implements duration, (𝑇𝑂𝐹 𝐷𝑀 = 1∕𝛥𝑓 ) reduces as the subcarrier spacing increases,
Radio Frequency (RF) functions using the fronthaul interface [8]. as illustrated in Fig. 1 [19]. In this case, the elaboration time needed
to perform FFT/IFFT and CP addition/removal is shorter when using
From the data center perspective, accelerated edge cloud micro
subcarrier frequencies higher than 15 kHz, which makes these functions
data centers [9] featuring the integration and interconnection of Cen- one of the best candidates for hardware acceleration.
tral Processing Units (CPUs), Graphical Processing Units (GPUs), Field The approach proposed in this paper accelerates the 5G gNB stack
Programmable Gate Arrays (FPGAs), and the recently proposed Data Low-PHY functions in an FPGA-based SmartNIC. This work focuses on
Processing Units (DPUs) [10] are emerging. Thus, accelerated edge a dual-split scenario where Option 8 is implemented in the fronthaul
cloud micro data centers will play an important role in the disaggre- interface, while two different functional split options are considered in
gated and virtualized 5G RAN and beyond, bringing several benefits, the midhaul (options 2 and 7-1). Fig. 2 shows the proposed hardware
such as reduced latency and power consumption [4]. offloading setup with two different scenarios distributing Packet Data
Convergence Protocol (PDCP) to Low-PHY functions within the DU and
This paper proposes the implementation of an FPGA-based Smart-
CU using option 2 and 7-1 functional splits in the midhaul interface.
NIC to be installed in the edge cloud micro data center, near the The RU hosts RF functions only (option 8 fronthaul functional split).
antenna site where a DU is hosted or directly in the RU to accelerate The DU and CU components are deployed in the edge cloud. In the
the gNB Low-PHY functions. Specifically, the offloaded functions are accelerated edge, the Low-PHY functions in the DU are offloaded onto
the Inverse Fast Fourier Transform (IFFT) and the Cyclic Prefix (CP) and accelerated by an FPGA using the OpenCL Framework.
addition of the Orthogonal Frequency Division Multiplexing (OFDM) The considered solution slightly differs from the approach currently
in downlink transmission. The considered function implementation adopted for the O-RAN fronthaul, known as option 7.2x split. In that
is based on the Open Computing Language (OpenCL) framework to case, as reported in [15], OFDM phase compensation, iFFT, CP ad-
ease the integration with other mobile software functions that are not dition, and digital beamforming functions in the downlink direction
reside in the O-RU. The remaining PHY functions, including resource
accelerated (e.g., exploiting the CPU).
element mapping, precoding, layer mapping, modulation, scrambling,
The FPGA-based implementation is evaluated and compared with rate matching and coding reside in the O-DU. By reducing the number
the CPU-based Low-PHY of OpenAirInterface (OAI) [5] and GPU-based of functions hosted in the RU, the approach considered in this paper
FFT/IFFT libraries (i.e., clfft [11] and cufft [12]) in terms of processing features most of the advantages listed in [15] for split option 7.2x,
time and energy consumption. The experimental evaluation shows that such as interoperability, advanced receivers and inter-cell coordination,
for large FFT/IFFT sizes (i.e., ≥ 2048), the FPGA-based implementation and future proofness. In addition, it even features lower RU complexity,
outperforms the OAI Low-PHY implementation processed in a high- energy consumption, and cost at the expenses of transport bandwidth
end single and dual core CPU and GPU-based clfft, but it shows a scalability. Indeed, it scales with the number of antennas and not with
the number of streams, as in split option 7.2x. In addition, user data
higher processing time compared to GPU-based cufft. However, cufft
transfer cannot be optimized to send only Physical Resource Blocks
energy consumption is high, while FPGA-based Low-PHY experiences
(PRBs) that contain user data for the purpose of reducing transport
the lowest energy consumption. Moreover, the FPGA-based smartNIC bandwidth, as in split option 7.2x. However, the proposed FPGA-based
overcomes the host-to-device memory transfer bottleneck, thus making implementation can be fully compatible with split option 7.2x because
FPGA-based accelerators effective in providing deterministic latency it can be deployed in the RU and accelerate the RU functions listed in
and high processing capacity per Watt. split option 7.2x.

2
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

Fig. 1. 5G New Radio Flexible Numerology.

Table 2
Number of Cyclic Prefix samples with respect to the FFT/IFFT size.
FFT/IFFT size CP length
128 10
256 20
512 40
1024 80
2048 160

computed using the formula [20]:



𝑁−1
𝑋(𝑘) = 𝑥(𝑛)𝑊𝑁𝑛𝑘 ; 𝑘 = 0, 1, … , 𝑁 − 1 (2)
𝑛=0

where 𝑋(𝑘) are the samples in frequency domain, 𝑥(𝑛) are the samples
in time domain, 𝑁 is the number of FFT/IFFT points, and 𝑊𝑁𝑛𝑘 is the
Fig. 2. 5G RAN Architecture with proposed hardware offloading setup using Options
2 and 7-1 functional split. twiddle factor. The latter is computed as:

𝑊𝑁𝑛𝑘 = 𝑒−𝑗2𝑛𝑘𝜋∕𝑁 (3)

The computation of the FFT/IFFT can be divided into two parts: the
data reordering and the radix butterfly configuration. The order of the
samples of the FFT/IFFT input and output are different; the former is in
natural order, while the latter is in bit-reversed order. A data reordering
step is used to convert the input from the natural order to bit-reversed
order and vice versa.
Radix butterfly configuration decomposes the computation of the
FFT into different stages. The number of radix butterfly stages needed
Fig. 3. Low-PHY layer protocol to be implemented in hardware. to compute the FFT/IFFT is given as 𝑙𝑜𝑔𝑀 (𝑁), where 𝑁 is the FFT/IFFT
points, while 𝑀 represents the radix number. The higher the radix
number, the fewer the number of stages needed, balanced by a more
3. Considered Low-PHY Implementation complex twiddle factor computation.
Fig. 4 shows a 16-point FFT computed in two ways: (a) 4 stages
The Low-PHY functions implemented in the FPGA are shown in (16-point) using Radix-2; and (b) 2 stages (16-point) using Radix-4. In
computational cost multiplication, the Radix-2 brings twiddle factors at
Fig. 3 for both downlink and uplink directions. In the downlink di-
0◦ and 180◦ , while Radix-4 has twiddle factors at angles 0◦ , 90◦ , 180◦ ,
rection, IQ samples in the frequency domain consisting of 32 bits (16
and 270◦ [21]. However, as shown in Fig. 4, the former needs 4 stages,
bits for the real part and 16 bits for the complex part) are received
while the latter only needs 2 stages to compute the 16-point FFT. In
in the FPGA, which then performs the IFFT to convert the samples to
computing the FFT with N-points (higher than 16), a combination of
the time domain. The cyclic prefix (CP) is then inserted as a guard
Radix-4 and Radix-2 stages can be considered for implementation.
interval to avoid inter-symbol interference (ISI). In the uplink direction,
This paper focuses in implementing up to 2048 FFT/IFFT points; the
IQ samples in the time domain are received with CP, then the FPGA number of radix butterfly stages in each implementation is shown in
performs CP removal to take out the guard interval, and the FFT oper- Fig. 5. We implemented a Radix-4 butterfly configuration on the first
ation to convert the samples to the frequency domain. The CP insertion few stages of FFT/IFFT points to achieve fewer computation stages,
(removal) is performed by adding (removing) redundant bytes before then we added another Radix-2 butterfly configuration on the last stage
each OFDM symbol. The number of added or removed CP samples is of 128, 512 and 2048-point FFT/IFFT.
shown in Table 2.
The FFT/IFFT implementation is one of the key components, and 4. OpenCL Framework
the most complex and computationally intensive module in Orthogonal
Frequency Division Multiplexing (OFDM). The FFT is an optimized Low-PHY functions are implemented by using the OpenCL frame-
computation of the Discrete Fourier Transform (DFT), which can be work [22], which can execute a kernel on an FPGA platform using

3
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

OpenCL devices. The OpenCL device is where a stream of instructions is


executed. Such instructions are called kernel. OpenCL devices are called
Compute Devices (CDs) and they can be a CPU, GPU, DSP or FPGA.
Thus, OpenCL is suitable for the implementation of the considered DU,
where some functions are implemented in the FPGA and some others
are implemented in the CPU.

4.2. OpenCL Utilization Survey


Fig. 4. Radix Butterfly Configuration: (a) 16-point Radix-2 FFT, (b) 16-point Radix-4
FFT. Researchers are already exploiting OpenCL framework on FPGAs
for different applications, such as image processing (typically based on
convolutional neural networks) and cryptographic accelerators.
FPGA-based OpenCL implementations of block ciphers to achieve
high throughput and low energy consumption were investigated in
[23]. Nine different ISO standard block ciphers were compared with the
CPU-based implementations. Results show that the OpenCL implemen-
tation in FPGA achieves a higher throughput compared to the CPU in
8 different block ciphers aside from the Advanced Encryption Standard
(AES). The authors were also able to achieve an energy improvement by
22.78𝑥 compared to the pure software implementation of ISO standard
block ciphers.
The utilization of hardware acceleration on a virtualized Cloud-
RAN is a use case reported in [4] with the aim of leveraging resource
utilization for load balancing among different base stations to provide
cost reduction, high resource and spectrum utilization, and energy
efficient networks. One of the main issues addressed in this paper
concerns the computationally intensive signal processing tasks of the
physical layer (i.e., channel coding/decoding, FFT/IFFT). This research
motivates our work since the authors recommended to implement these
Fig. 5. Radix Butterfly Configuration of different FFT/IFFT points. tasks in a dedicated CPU processor or on general-purpose layer 1 (L1)
accelerators.
The authors from [24] assess the suitability of employing OpenCL-
driven reconfigurable hardware in the context of 5G virtualized gNB
DU. Using a Terasic DE5-Net Development board with Intel Stratix V GX
FPGA, the authors focus on the implementation of the Low-PHY level
functionalities at the DU using the Option 7-1 functional split. Results
show that OpenCL has a better processing time when data sizes in-
crease (more than 2048 OFDM symbols) due to its pipelined approach.
However, the kernel implementation does not fit into the FPGA for
OFDM symbol size larger than 512. Another bottleneck presented in
this research is the data transfer and synchronization between the host
and the device memory, as well as reading and writing from/to global
and local memory inside the FPGA.
This paper focuses on a further optimized implementation compared
to the one proposed in [24], aimed to deploy kernels with more than
Fig. 6. OpenCL Platform model.
512 OFDM kernels in FPGA through OpenCL framework. Different
optimization techniques are utilized to further improve the processing
time and the FPGA area overhead. This paper is also an extended
a software development kit (SDK) provided by Intel. This section in- version of [25], which includes the processing time of GPU-based IFFT
troduces the OpenCL programming language with some references to libraries, and proposes the use of kernel autorun through OpenCL
the application of OpenCL in different scenarios (e.g., Block Ciphers channels to reduce the data transfer between the device and the host.
and Low-PHY). The specific software development kit used to program With the use of OpenCL channels, IQ samples can either be received
OpenCL on Intel FPGAs is also discussed. from (sent through) one of the FPGA interfaces (QSFP for DE10-pro
FPGA).
4.1. OpenCL Programming Language
4.3. Intel FPGA SDK for OpenCL
OpenCL is a parallel computing application programming interface
(API) that is capable of executing a kernel on different platforms such Intel FPGA SDK for OpenCL provides the necessary APIs and run-
as CPUs, GPUs, and FPGAs [22]. It can write a single program and run time library to program the FPGAs attached to the PCIe interfaces,
it on heterogeneous platforms. However, to maximize the performance very similarly to a GPU or any kind of hardware accelerators. The IP
in each platform, different types of optimization techniques are imple- cores needed for the communication between the FPGA, external DDR
mented. It can also support a parallel computing approach to enhance memory, and PCIe, alongside with necessary PCIe and DMA drivers for
application performance. the communication between the host and the FPGA, are also provided
As shown in Fig. 6, the OpenCL platform always includes a single by the board manufacturers in a form of a Board Support Package
host that acts as a master capable of interacting with one or more (BSP) [26].

4
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

To further minimize the use of FPGA resources, function calls should


be avoided in favor of using loops inside the kernel code, that are then
partially or fully unrolled depending on the availability of the FPGA
resources.

5.3. Matrix instead of Vector Representation

When performing arithmetic operations in an FPGA with large


Fig. 7. Intel FPGA SDK for OpenCL flow; AOC is Intel FPGA SDK for OpenCL Offline
Compiler.
number of components in an array, especially when implementing
FFT/IFFT with 2048 points, where arithmetic addition and twiddle
factor multiplication are computed on 4096 components with 16-bits
The flow of Intel FPGA SDK for OpenCL to compile the host code each. The computational complexity can be reduced by representing
and convert the kernel code to an FPGA bitstream is shown in Fig. 7. these components as a matrix rather than a vector, especially when im-
The OpenCL kernel needs to be compiled offline using the Intel Altera plementing a nested for loop. Furthermore, lowering the computational
offline compiler (AOC), unlike CPUs and GPUs, due to long placement complexity of the arithmetic operation increases the kernel operating
and routing time in FPGAs. AOC converts the kernel code into a Verilog frequency, resulting in a reduction of the processing time.
code, which is the hardware description language for the FPGA. The
Verilog code is converted to an FPGA bitstream after placement and
5.4. Maximizing Global Memory Bandwidth
routing. After compilation, the AOC generates the Altera Offline Com-
piler Object file (.𝑎𝑜𝑐𝑜) containing kernel and configuration information
required at runtime, Altera Offline Compiler Executable file (.𝑎𝑜𝑐𝑥) The DE10-pro Terasic FPGA Board contains 2 banks of DDR4 mem-
(i.e., the hardware configuration file), and a kernel folder or subdirec- ory with a bus width of 64 bits running at 2133 MHz (1066 MHz double
tory that contains the information necessary to create the .𝑎𝑜𝑐𝑥 file, data-rate), which provides 34.1 GB/s of external memory bandwidth.
including an area and timing report file for analysis. Note that LLVM Since the memory controller on the FPGA runs at 1∕8 of the clock of the
is a compiler and tool chain technology that is designed for an inter- external memory (i.e., 266 MHz), just 128 bytes per clock can saturate
mediate code representation called LLVM Intermediate Representation the memory bandwidth. Increasing the memory transfer per clock
(IR) [27]. above the memory bandwidth increases the area overhead without
The host side can be programmed using 𝐶 or 𝐶 + + programming decreasing the memory transfer time. For the Low-PHY implementation
language. The 𝐶∕𝐶 + + compiler compiles the host program and links in FPGA, only 128 bytes of data are sent from the global to the local
it to the Intel FPGA SDK using OpenCL runtime libraries. After com-
memory per clock cycle to maximize the global memory bandwidth.
pilation, the host then runs the host application, which programs and
executes the hardware image into the FPGA. In our implementation,
the host is programmed using 𝐶 + + programming language. 5.5. SmartNIC-based Implementation

5. Low-PHY Function Optimization using OpenCL


The implementation of the DU with the considered split Option 2 in
the downlink direction is shown in Fig. 8(a). Here, it is assumed that
To optimize the OpenCL kernels for FPGA, two main strategies could
be considered, namely the improvement of the pipeline throughput a Commercial Off-the-Shelf (COTS) server equipped with an FPGA is
and the exploitation of data parallelism. Few procedures to improve utilized. The RLC, MAC, and High-PHY functions are implemented in a
the FPGA-based implementation have been discussed in [9], such as CPU that receives the data from the CU (through a Network Interface
replicating a kernel pipeline to increase data parallelism and using Card – NIC, such as a QSFP) where the PDCP layer is implemented.
sliding windows to improve pipeline throughput. In the following we After performing the High-PHY function, the IQ samples in the
provide a list of optimization techniques applicable in OpenCL to frequency domain are sent from the CPU to the FPGA through the
further reduce the processing time of the Low-PHY function and reduce CPU’s PCIe interface for Low-PHY elaboration in the FPGA. Finally, IQ
the resource utilization in the FPGA. samples in the time domain are returned to the CPU, again through the
PCIe, and sent to the RU through another NIC (e.g., SFP+). However,
5.1. Loop Unrolling the data transfer and synchronization between the host and the device
memory becomes a bottleneck in this implementation scenario, due to
The loop unrolling command replicates the loop body multiple the contribution of the host-to-device transfer latency. Aside from the
times executing each loop iteration in a parallel manner [26]. In cases host-to-device memory transfer, data transfer between FPGAs’ global
where there are no loop-carried dependencies, the unrolling loops can
and local memory also adds to the FPGA-based Low-PHY processing
reduce the processing time of each ‘for’ loop implementation in the
time. Compared to GPUs, where the memory bandwidth offered by
FPGA. However, fully unrolling the loop iteration may also significantly
GDDR5X or HBM2 is in the order of hundreds of GB/s, FPGA boards
increase the resource utilization of the FPGA. A 𝑝𝑟𝑎𝑔𝑚𝑎 𝑢𝑛𝑟𝑜𝑙𝑙 (𝑁)
usually offer a much lower memory bandwidth (e.g., DDR4 with around
command is used by the compiler before the ‘for’ loop to unroll the
32 GB/s). The proposed solution is shown in Fig. 8(b). This implemen-
loop, resulting in an 𝑁 time speedup in the execution performance of
the loop. tation reduces the Low-PHY processing time since data are transferred
from the CPU to the FPGA through the PCIe interface only once. This
5.2. Avoid Function Calling is possible by exploiting auto-run kernels and OpenCL channels. The
auto-run kernels allow to execute the processing in hardware without
Writing a separate function and calling it inside the main kernel interaction with the host and the global memory. Indeed, the host
code is implemented as a separate circuit on the FPGA. This would starts the auto-run kernel that forwards the data to the NIC interface
result in an additional use of FPGA resources since more routing (e.g., QSFP) of the FPGA after the Low-PHY implementation by means
resources are needed to connect the function into the main kernel code. of the I/O OpenCL channels.

5
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

Fig. 8. DU implementation scenarios: (a) COTS server equipped with FPGA; (b) FPGA-based SmartNIC implementation.

6. Performance Evaluation Table 3


OpenCL optimization result on 128 OFDM symbols.

This section evaluates the impact of the different optimization Version 1 Version 2 Version 3 Version 4 Version 5

techniques on the processing time (i.e., the time required to process Processing time [μ𝑠] 34.37 23.5 23.45 21.4 15.43
the considered FFT size), the resource utilization (logic gates, DSP, Logic gate utilization 14% 25% 21% 19% 21%
DSP utilization <1% 5% 4% 3% 3%
memory bits, RAM), and the maximum kernel operating frequency Memory utilization 2% 2% 2% 3% 5%
of a Low-PHY function in FPGA using OpenCL framework. The best RAM utilization 4% 6% 6% 7% 11%
performing version is then assessed in terms of resource utilization Kernel frequency [MHz] 239.23 366.7 285.63 484.26 484.78
and maximum kernel operating frequency by varying the number of
IFFT points from 128 to 2048. Results of the FPGA-based Low-PHY in
Table 4
terms of processing time and energy consumption are then compared FPGA resources and kernel operating frequency of Low-PHY layer functions with
to a CPU-based Low-PHY implementation with OpenAirInterface, and different IFFT points.
a GPU-based implementation of clfft and cufft libraries. The clfft is an 128 256 512 1024 2048
OpenCL-based software library containing FFT functions [11]. The cufft Logic gate utilization 21% 26% 36% 51% 66%
is a CUDA-based FFT library developed by NVIDIA [12]. DSP utilization 3% 3% 8% 14% 14%
As shown in Fig. 2, the RU hosts just the RF functions implemented Memory utilization 5% 6% 11% 15% 15%
in an Ettus X310 Universal Software Radio Peripherals (USRPs) and is RAM utilization 11% 13% 16% 23% 23%
Kernel frequency [MHz] 484.78 461.68 390.93 299.67 146.26
connected to the DU through a 10 Gbps optical Ethernet fronthaul. The
DU and CU components are deployed in the edge cloud exploiting Dell
Poweredge R740 servers. A midhaul link with 10/25 Gbps is used to
connect the DU and the CU. In the accelerated edge, the DU Low-PHY to 5%), and RAM utilization (4% to 6%). When function calling is
functions are offloaded onto a DE10-pro development board with Stratix avoided in the implementation, there is a decrease of about 4% in the
10 FPGA, two 8 GB DDR4 memory modules and PCIe v3.0 with 16 slots utilization of logic elements. Using matrix representation on array of
at 32 GB/s bandwidth [26]. The CPU-based implementation is executed components further decreases the processing time and the used logic
in an Intel Core [email protected] GHz and based on Intel Advanced gates, and increases the kernel frequency by a factor of 1.7. The fastest
Vector Extension 2 (AVX2), where arithmetic operations are performed processing time, 15.43 μs, is achieved in implementation version 5 with
on 256-bit vectors to achieve better performance with floating point a utilization of 21% for logic gates, 3% for DSP, 5% for memory, 11%
calculations and data organization. The clfft and cufft are implemented for RAM, and the kernel operates at 484.78 MHz frequency, achieved
in an NVIDIA Tesla T4 GPU featuring 320 NVIDIA Turing tensor cores, with Intel FPGA SDK for OpenCL version 20.3.
16 GB GDDR6 memory modules, and PCIe v3.0 with 16 slots [28].
Host-to-device transfer latency results are also detailed.
6.2. Hardware Performance
6.1. Low-PHY Layer Optimization
Considering the best performing version in Table 3, namely version
5, 5G Low-PHY functions with 128 up to 2048 IFFT points have been
Table 3 shows the performance of five implementations of the Low-
implemented in FPGA using OpenCL framework. Results in terms of
PHY function of the 5G RAN with 128 IFFT points, considering the
different optimization techniques discussed in Section 3.4. The five utilization of logic gates, DSP utilization, memory utilization and RAM
implementations exploit different optimization techniques available in utilization, and kernel operating frequency are shown in Table 4.
the considered Intel FPGA SDK for OpenCL: (i) version 1 features the It can be noted that the use of hardware resources increases as the
implementation of IFFT and CP addition without any optimization; IFFT points increase. This is because the size of the array increases and
(ii) version 2 uses the loop unrolling method; (iii) version 3 removes more FPGA resources are needed to parallelize the IFFT and CP addition
function calls inside the main kernel code; (iv) version 4 implements a computation. Also the kernel frequency decreases with increasing IFFT
matrix instead of a vector representation of the array to increase the points due to the increasing complexity, given by 𝑂(𝑁𝑙𝑜𝑔𝑁) with 𝑁
kernel frequency, thus reducing the computation time; (v) and version being the number of IFFT points.
5 is the same as version 4, but with the kernel code compiled by using
Intel FPGA SDK for OpenCL version 20.3, instead of the older 19.1 used 6.3. Processing Time
in the previous cases. The different versions are compared in terms
of processing time, utilization of logic gates, DSP utilization, memory Fig. 9 shows the processing time as a function of the IFFT size of 5G
utilization, RAM utilization, and kernel operating frequency. low-PHY in OpenCL exploiting an FPGA, a CPU with a different number
As shown in Table 3, the processing time decreases from 34.37 μs of processing cores (from one to four), and in GPU with clfft and cufft
to 23.5 μs when the loop is fully unrolled with the trade-off of an libraries. The open-source benchmark package Gearshifft [29] is used
increased utilization of logic gates (14% to 25%), DSP utilization (<1% to evaluate the processing performance of IFFT libraries in the GPU.

6
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

Fig. 9. FPGA vs. CPU vs. GPU processing time. Fig. 10. Energy Usage per Low-PHY operation in FPGA, CPU, and GPU with different
IFFT points.
The CPU-based processing time decreases for increasing CPU cores.
This is expected since multiple cores allow to run multiple processes 6.4. Energy Consumption
at the same time with greater ease compared to a single core, in-
creasing the performance when handling multiple tasks or demanding Fig. 10 shows the energy consumption per Low-PHY operation in
computations. Such processing time performance cannot be appreciated FPGA, single CPU core, and GPU (for both clfft and cufft). It is measured
for lower IFFT sizes (i.e., 128 and 256 IFFT points) due to the lower as the energy consumed per Low-PHY operation, which is obtained by
complexity, which can be handled by a single CPU core. However, multiplying the power consumption by the Low-PHY processing time
a notable performance difference can be highlighted for 2048 IFFT in each device. The energy consumption allows a fair comparison
points, where it takes around 6.38 μs only, to run the implementation among the different implementations because it is the product of power
with 4 CPU cores, while it needs 31.84 μs with 1 CPU core. This is consumption and processing time. Thus, low energy consumption can
because the computational effort for IFFT and CP addition is shared be achieved by low power consumption or short processing time. The
among 4 CPU cores, resulting in an approximately four times lower s-tui [30] tool is used to measure the CPU power, while quartus_pow
processing time. (included in the de10_pro board support package) is used to estimate
Moreover, the processing time for the CPU-, GPU-, and FPGA-based the power dissipated in the FPGA, and the 𝑛𝑣𝑖𝑑𝑖𝑎−𝑠𝑚𝑖 command is used
implementations increases as a function of the IFFT points. However, to measure the power consumption in the GPU. As shown in Fig. 10,
just a small increase in the processing time (around 1 μs) occurs with the energy usage per operation increases as a function of the considered
the FPGA-based implementation because of data parallelism with full IFFT points when using either FPGA, CPU, or GPU. Results also show
loop unrolling on the radix butterfly and CP addition computation. that the GPU-based clfft consumes the highest amount of energy per
The same happens with GPU-based implementation since it has more operation, also due to the large processing time, as shown in Fig. 9,
parallel computing resources (320 Turing Tensor Cores and 2560 CUDA making it the least optimal solution to be integrated into the 5G Low-
Cores for NVIDIA Tesla T4) compared to a CPU. Also, GPU-based cufft PHY. On the other hand, CPU-based implementation of Low-PHY has
has a faster processing time compared to clfft, since cufft is based in the lowest energy consumption up to 512 IFFT size because of the
CUDA developed also by NVIDIA where the library is executed, better
short processing time. However, at higher IFFT sizes, the FPGA-based
matching the computing characteristics of the GPU and thus offering a
implementation has the lowest energy consumption (1.03𝑥 lower than
better performance.
single-core CPU for 1024 IFFT size and 2.22𝑥 lower than single-core
In comparing the processing time of different implementations,
CPU for 2048 IFFT size) followed by GPU-based cufft implementation.
results show that the GPU-based implementation of the clfft library has
In fact, the long processing time of the FPGA-based implementation (see
the longest processing time, making it less suited to be deployed in the
Fig. 9) is compensated by a low power consumption (see Fig. 11).
5G Low-PHY. CPU-based implementations (from 1 to 4 CPU cores) have
Also, implementing Low-PHY with more CPU cores further increases
the shortest processing times up to 512 IFFT size. However, at 2048
the energy consumption. This means that, as the IFFT size increases,
IFFT points, the FPGA-based and GPU-based cufft implementations are
offloading the Low-PHY function into the FPGA becomes more energy
1.45𝑥 and 1.91𝑥 faster compared to the CPU-based implementation with
efficient compared to processing them into the CPU or GPU (for both
up to 2 CPU cores, respectively. The CPU-based Low-PHY implementa-
tion performs better at smaller IFFT sizes since it operates at a higher clfft and cufft libraries).
clock frequency (i.e., 4.2 GHz) compared to GPU (i.e., 1.59 GHz) and The proposed solution is flexible. If the available fronthaul/midhaul
FPGA (i.e., 480 MHz). However, when the computational complexity latency budget is large, an increased processing time can be traded for a
increases (i.e., 𝑃 >1024 IFFT points), the data parallelism of FPGA lower power consumption, provided that the fronthaul/midhaul latency
and the parallel computing resources of GPU outperforms CPU-based constraint is satisfied. Otherwise, the hardware guaranteeing the lowest
implementations. processing time shall be utilized.
Although the CPU-based implementation with 3 or 4 cores has a
shorter processing time compared to FPGA-based Low-PHY and cufft 6.5. FPGA-based SmartNIC Scenario
library implementation in Tesla T4 GPU, these results are only possible
if there are no other 5G functions implemented in the CPU. With a 5G Fig. 12 shows the overall execution time of the Low-PHY imple-
RAN testbed using a higher layer split (option 2), the DU does not only mentation when a COTS server is equipped with an FPGA, i.e., the
implement Low-PHY, but also processes the High-PHY, MAC, and RLC scenario depicted in Fig. 8(a). In this implementation, although the
functions in the CPU. In this case, CPU resources are shared among FPGA-based implementation of the 5G Low-PHY layer provides a faster
four different 5G functions, resulting in a longer processing time. This processing for larger IFFT sizes with lower energy consumption, the
is why implementing 5G LOW-PHY function in FPGA or offloading the host-to-device/device-to-host and off-chip to on-chip/on-chip to off-
IFFT implementation in GPU using cufft library can improve the overall chip memory transfer time is still a bottleneck. Indeed, writing to and
gNB performance, since they can free some of the CPU cores (e.g., 2) reading from memory (i.e., Write and Read in the figure) requires
and the free CPU cores can be exploited to perform the remaining RAN more time than the kernel execution itself (i.e., Kernel in the figure),
functions. especially for the 2048 IFFT size. However, the proposed smartNIC

7
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

SmartNIC. The proposed Low-PHY implementation has been compared


against the CPU-based implementation of Low-PHY utilized by Ope-
nAirInterface running in a CPU with 1 to 4 cores and with clfft and
cufft libraries running in a GPU.
Results showed that for low IFFT size the FPGA-based and GPU-
based cufft implementations experience a higher processing time com-
pared to the CPU-based implementation. However, at 2048 IFFT points,
the FPGA-based Low-PHY function and GPU-based cufft can free up to
2 CPU cores thanks to a 1.45𝑥 and a 1.91𝑥 processing time reduction,
respectively. The GPU-based implementation showed a higher energy
consumption than the FPGA-based one. The FPGA-based implementa-
tion also showed the lowest energy usage per operation. Finally, the uti-
lization of the FPGA-based SmartNIC avoided the latency contributed
by the host-to-device/device-to-host and off-chip to on-chip/on-chip to
Fig. 11. Power Consumption of Low-PHY operation in FPGA, CPU, and GPU with
off-chip memory transfer.
different IFFT Size.
Although OpenCL provides an easy integration of the Low-PHY
with other 5G functions implemented in the CPU, there are still some
improvements to be addressed in future works. One is the utilization
of host pipes for direct communication between the host and kernel
running in the FPGA. The implementation of Low-PHY into the RU
using a lower layer split (7.2x used by O-RAN) will also be considered
in an SoC FPGA to further reduce the processing time, since the CPU
and the FPGA share the same memory. Finally, the FPGA-based Low-
PHY will be integrated with the other 5G protocol stack implemented
by OpenAirInterface to analyze the overall gNB performance.

Declaration of competing interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared to
Fig. 12. FPGA overall execution time. influence the work reported in this paper.

implementation, depicted in Fig. 8(b), mitigates this issue because data References
are directly sent to the RU through the QSFP interface after the IFFT
[1] A. Ghosh, A. Maeder, M. Baker, D. Chandramouli, 5G evolution: A view
and CP addition execution, therefore just the data writing to the kernel on 5G cellular technology beyond 3GPP release 15, IEEE Access 7 (2019)
(i.e., Write) and the kernel processing time (i.e., Kernel) contribute 127639–127651.
to the overall execution time for option 2 functional split. Thus, the [2] R. Mijumbi, J. Serrat, J.-L. Gorricho, N. Bouten, F. De Turck, R. Boutaba, Network
function virtualization: State-of-the-art and research challenges, IEEE Commun.
SmartNIC implementation reduces the processing time of about 54.33%
Surv. Tutor. 18 (1) (2016) 236–262.
for the 128 IFFT size and of 53.08% for the 2048 IFFT size with respect [3] Technical Specification Group Radio Access Network, Study on New Radio Access
to the one achieved by the COTS server implementation. Technology; Radio Access Architecture and Interfaces, Technical Report (TR)
Moreover, in case split Option 7-1 is implemented, where only 38.801, 3GPP, 2017, Version 2.0.0.
Low-PHY functions are implemented in the DU, while the upper layer [4] Network Functions Virtualization; Acceleration Technologies; Report on
Acceleration Technologies & Use Cases, v1.1.1, ETSI GS NVF-IFA 001.
functions are implemented in the CU, the computation time can be [5] OpenAirInterface | 5G software alliance for democratising wireless innovation,
further reduced. Indeed, in this scenario IQ frequency domain samples 2021, https://openairinterface.org/ (Last accessed: 2021-07-19).
coming from the CU can be input directly to the FPGA-based smartNIC, [6] O-RAN alliance, 2021, https://www.o-ran.org/ (Last accessed: 2021-07-21).
processed there, and the resulting IQ time domain samples can be sent [7] A new approach to building and deploying telecom network infrastructure, 2021,
https://telecominfraproject.com/?section=access (Last accessed: 2021-20-04).
to the RU through another smartNIC QSFP output. Thus, in this scenario [8] F. Civerchia, K. Kondepu, F. Giannone, S. Doddikrinda, P. Castoldi, L.
also the time required to write the data to the kernel (i.e., Write in Valcarenghi, Encapsulation techniques and traffic characterisation of an ethernet-
Fig. 9) can be saved. based 5G fronthaul, in: 20𝑡ℎ International Conference on Transparent Optical
Another bottleneck of implementing OpenCL in FPGA is the writing Networks (ICTON), 2018, http://dx.doi.org/10.1109/ICTON.2018.8473737.
[9] J.C. Borromeo, K. Kondepu, N. Andriolli, L. Valcarenghi, An overview of hard-
to and reading from the global memory, which is the DDR4 RAM of the
ware acceleration techniques for 5G functions, in: 22𝑛𝑑 International Conference
FPGA development board. Since data are sent to the global memory on Transparent Optical Networks (ICTON), 2020, http://dx.doi.org/10.1109/
from the host through the PCIe interface, they have to be forwarded ICTON51198.2020.9203242.
first to the local memory inside the FPGA for the IFFT and CP Addition [10] NVIDIA DPUS, 2021, https://www.nvidia.com/en-us/networking/products/data-
processing-unit/ (Last accessed: 2021-07-19).
processing. To further reduce the overall processing time of the FPGA-
[11] Clfft, 2021, https://github.com/clMathLibraries/clFFT (Last accessed: 2021-07-
based Low-PHY, OpenCL host pipes [31] can be utilized to have a direct 19).
communication between the host and the kernel running in the FPGA. [12] Cufft, 2021, https://docs.nvidia.com/cuda/cufft/index.html (Last accessed:
This solution bypasses the latency contributed by the global to local 2021-07-19).
memory transfer inside the FPGA. However, this implementation is left [13] L.M.P. Larsen, A. Checko, H.L. Christiansen, A survey of the functional splits
proposed for 5G mobile crosshaul networks, IEEE Commun. Surv. Tutor. 21 (1)
as a future work since host pipes are supported by Arria 10 GX FPGAs (2019) 146–172.
only. [14] ITU-T, 5G wireless fronthauls requirements in a passive optical network context,
2020, ITU-T G.Sup66 (09/2020).
7. Conclusion [15] O-RAN Fronthaul Working Group, Control, User and Synchronization Plane
Specification, O-RAN.WG4.CUS.0-v07.00 Version 1.0, 2021.
[16] Huber suhner functional split, 2022, https://www.hubersuhner.com/
This paper proposed the implementation of the Low-PHY functions en/documents-repository/technologies/pdf/fiber-optics-documents/5g-
of a disaggregated gNB distributed unit (DU) in an FPGA-accelerated fundamentals-functional-split-overview (Last accessed: 2022-01-10).

8
J.C. Borromeo et al. Computer Networks 210 (2022) 108931

[17] Common Public Radio Interface: eCPRI Interface Specification, Version 2.0, 2019. Koteswararao Kondepu is an Assistant Professor at In-
[18] 3GPP, 3GPP TR 38.912 v15.0.0 (2018-06): Study on new radio (NR) access dia Institute of Technology Dharwad, Dharwad, India. He
technology (release 15), (38.912) 3GPP, 2018, Version 15.0.0. obtained his Ph.D. degree in Computer Science and Engi-
[19] J.K. Chaudhary, A. Kumar, J. Bartelt, G. Fettweis, C-RAN employing xRAN neering from Institute for Advanced Studies Lucca (IMT),
functional split: Complexity analysis for 5G nr remote radio unit, in: 2019 Italy in July 2012. His research interests are 5G, optical
European Conference on Networks and Communications (EuCNC), 2019, pp. networks design, energy-efficient schemes in communication
580–585, http://dx.doi.org/10.1109/EuCNC.2019.8801953. networks, and sparse sensor networks.
[20] J.W. Cooley, J.W. Tukey, An Algorithm for the Machine Calculation of Complex
Fourier Series, in: Mathematic of Computation, 1965, pp. 297–301.
[21] G. Polat, S. Ozturk, M. Yakut, Design and implementation of 256-point radix-4
100 gbit/s fft algorithm into FPGA for high-speed applications, ETRI J. 37 (4)
(2015) 667–676.
[22] A. Munshi, B.R. Gaster, T.G. Mattson, J. Fung, D. Ginsburg, OpenCL:
Programming Guide, Addison-Wesley, 2011.
[23] A. Barenghi, M. Madaschi, N. Mainardi, G. Pelosi, Opencl HLS based design
Nicola Andriolli received the Laurea degree in telecommu-
of FPGA accelerators for cryptographic primitives, in: Intern. Conf.E on High
nications engineering from the University of Pisa in 2002,
Performance Computing Simulation (HPCS), 2018, pp. 634–641, http://dx.doi.
and the Diploma and Ph.D. degrees from Scuola Superiore
org/10.1109/HPCS.2018.00105.
Sant’Anna, Pisa, in 2003 and 2006, respectively. He was a
[24] F. Civerchia, M. Pelcat, L. Maggiani, K. Kondepu, P. Castoldi, L. Val-
Visiting Student at DTU, Copenhagen, Denmark and a Guest
carenghi, Is opencl driven reconfigurable hardware suitable for virtualising 5g
Researcher at NICT, Tokyo, Japan. In 2007-2019 he was
infrastructure? IEEE Trans. Netw. Serv. Manage. 17 (2) (2020) 849–863.
an Assistant Professor at Scuola Superiore Sant’Anna. Since
[25] J.C. Borromeo, K. Kondepu, N. Andriolli, L. Valcarenghi, Experimental evaluation
2019 he is a Researcher at CNR-IEIIT.
of 5G vRAN function implementation in an accelerated edge cloud, in: 2021
He has a background in the design and the performance
European Conference on Optical Communication (ECOC), 2021.
analysis of optical circuit-switched and packet-switched net-
[26] Intel Corporation, Intel FPGA SDK from opencl pro edition: Program-
works and nodes. His research interests have extended to
ming guide, 2021, https://www.intel.com/content/dam/www/programmable/
photonic integration technologies for telecom, datacom and
us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf (last accessed
computing applications, working in the field of optical pro-
2021-07-21).
cessing, optical interconnection network architectures and
[27] The LLVM compiler infrastructure project, 2022, https://llvm.org/ (Accessed:
scheduling. Recently he has been investigating integrated
March 7, 2022).
transceivers, frequency comb generators, and architectures
[28] NVIDIA Tesla T4, 2021, https://www.nvidia.com/content/dam/en-zz/Solutions/
and subsystems for photonic neural networks. He authored
Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf (Last accessed: 2021-07-
more than 180 publications in international journals and
21).
conferences, contributed to one IETF RFC, and filed 11
[29] P. Steinbach, M. Werner, Gearshifft - the FFT benchmark suite for heterogeneous
patents.
platforms, 2017, CoRR abs/1702.00629 arXiv:1702.00629 URL http://arxiv.org/
abs/1702.00629.
[30] The stress terminal UI: s-tui, 2021, https://github.com/amanusk/s-tui (Last
accessed: 2021-07-21).
[31] Intel, Opencl host pipe design example, 2021, https://www.intel.com/content/ Luca Valcarenghi is an Associate Professor at the Scuola
www/us/en/programmable/support/support-resources/design-examples/design- Superiore Sant’Anna of Pisa, Italy, since 2014. He published
software/opencl/host-pipe.html (Last accessed: 2021-07-21). almost three hundred papers (source Google Scholar, May
2020) in International Journals and Conference Proceedings.
Dr. Valcarenghi received a Fulbright Research Scholar Fel-
lowship in 2009 and a JSPS Invitation Fellowship Program
Justine Cris Borromeo received his BS Electronics Engi- for Research in Japan (Long Term) in 2013. His main
neering degree at Mindanao State University- Iligan Institute research interests are optical networks design, analysis,
of Technology in 2015, and the MS Electronics Engineering and optimization; communication networks reliability; en-
at Ateneo de Manila University in 2019. He is currently ergy efficiency in communications networks; optical access
a Ph.D student in Emerging Digital Technologies at Scuola networks; zero touch network and service management;
Superiore Sant’Anna, Pisa. His research interests includes experiential networked intelligence; 5G technologies and
radio access networks in 5G technologies, and FPGA and beyond.
GPU-based hardware acceleration.

You might also like