FPGA-accelerated SmartNIC For Supporting 5G Virtualized Radio Access
FPGA-accelerated SmartNIC For Supporting 5G Virtualized Radio Access
FPGA-accelerated SmartNIC For Supporting 5G Virtualized Radio Access
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
Keywords: Disaggregated, virtualized, and open next-generation eNodeB (gNB) could bring several benefits to the Next
Hardware Acceleration Generation Radio Access Network (NG-RAN) by enabling more market competition and customer choice, lower
Network function virtualization equipment costs, and improved network performance. This can be achieved through gNB-central unit (CU)-
OpenCL
control plane (CP), gNB-CU-user plane (UP) and gNB-distributed unit (DU) separation, CU and DU function
virtualization, and zero touch RAN management and control. However, to achieve the performance required by
specific foreseen 5G usage scenarios (e.g., Ultra Reliable Low Latency Communications — URLLC), offloading
selected disaggregated gNB functions into an accelerated hardware becomes a necessity.
To this aim, this study proposes the implementation of 5G DU Low-PHY layer functions into an FPGA-
based SmartNIC exploiting the Open Computing Language (OpenCL) framework to facilitate the integration of
accelerated 5G functions within the mobile protocol stack. The proposed implementation is compared against
(i) a CPU-based OpenAirInterface implementation, and (ii) a GPU-based implementation of IFFT exploiting
clfft and cufft libraries. Experimental results show that the different optimization techniques implemented in
the proposed solution reduce the Low-PHY processing time and the use of FPGA resources. Moreover, the
GPU-based implementation of the cufft and the proposed FPGA-based implementation have a lower processing
time and power consumption compared to a CPU-based implementation for up to two cores. Finally, the
implementation in a SmartNIC reduces the delay added by the host-to-device communication through the
Peripheral Component Interconnect Express (PCIe) interface, considering both functional split options 2 and
7-1.
✩ This work received funding from the ECSEL JU grant agreement No 876967. The JU receives support from the EU Horizon 2020 research and innovation
programme and the Italian Ministry of Education, University, and Research (MIUR), Italy. Intel University Program and Terasic Inc are gratefully acknowledged
for donating the FPGA hardware. This work is also partly supported by DST SERB Startup Research Grant (SRG-2021-001522).
∗ Corresponding author.
E-mail address: [email protected] (J.C. Borromeo).
https://doi.org/10.1016/j.comnet.2022.108931
Received 7 October 2021; Received in revised form 7 March 2022; Accepted 24 March 2022
Available online 31 March 2022
1389-1286/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
2
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
Table 2
Number of Cyclic Prefix samples with respect to the FFT/IFFT size.
FFT/IFFT size CP length
128 10
256 20
512 40
1024 80
2048 160
where 𝑋(𝑘) are the samples in frequency domain, 𝑥(𝑛) are the samples
in time domain, 𝑁 is the number of FFT/IFFT points, and 𝑊𝑁𝑛𝑘 is the
Fig. 2. 5G RAN Architecture with proposed hardware offloading setup using Options
2 and 7-1 functional split. twiddle factor. The latter is computed as:
The computation of the FFT/IFFT can be divided into two parts: the
data reordering and the radix butterfly configuration. The order of the
samples of the FFT/IFFT input and output are different; the former is in
natural order, while the latter is in bit-reversed order. A data reordering
step is used to convert the input from the natural order to bit-reversed
order and vice versa.
Radix butterfly configuration decomposes the computation of the
FFT into different stages. The number of radix butterfly stages needed
Fig. 3. Low-PHY layer protocol to be implemented in hardware. to compute the FFT/IFFT is given as 𝑙𝑜𝑔𝑀 (𝑁), where 𝑁 is the FFT/IFFT
points, while 𝑀 represents the radix number. The higher the radix
number, the fewer the number of stages needed, balanced by a more
3. Considered Low-PHY Implementation complex twiddle factor computation.
Fig. 4 shows a 16-point FFT computed in two ways: (a) 4 stages
The Low-PHY functions implemented in the FPGA are shown in (16-point) using Radix-2; and (b) 2 stages (16-point) using Radix-4. In
computational cost multiplication, the Radix-2 brings twiddle factors at
Fig. 3 for both downlink and uplink directions. In the downlink di-
0◦ and 180◦ , while Radix-4 has twiddle factors at angles 0◦ , 90◦ , 180◦ ,
rection, IQ samples in the frequency domain consisting of 32 bits (16
and 270◦ [21]. However, as shown in Fig. 4, the former needs 4 stages,
bits for the real part and 16 bits for the complex part) are received
while the latter only needs 2 stages to compute the 16-point FFT. In
in the FPGA, which then performs the IFFT to convert the samples to
computing the FFT with N-points (higher than 16), a combination of
the time domain. The cyclic prefix (CP) is then inserted as a guard
Radix-4 and Radix-2 stages can be considered for implementation.
interval to avoid inter-symbol interference (ISI). In the uplink direction,
This paper focuses in implementing up to 2048 FFT/IFFT points; the
IQ samples in the time domain are received with CP, then the FPGA number of radix butterfly stages in each implementation is shown in
performs CP removal to take out the guard interval, and the FFT oper- Fig. 5. We implemented a Radix-4 butterfly configuration on the first
ation to convert the samples to the frequency domain. The CP insertion few stages of FFT/IFFT points to achieve fewer computation stages,
(removal) is performed by adding (removing) redundant bytes before then we added another Radix-2 butterfly configuration on the last stage
each OFDM symbol. The number of added or removed CP samples is of 128, 512 and 2048-point FFT/IFFT.
shown in Table 2.
The FFT/IFFT implementation is one of the key components, and 4. OpenCL Framework
the most complex and computationally intensive module in Orthogonal
Frequency Division Multiplexing (OFDM). The FFT is an optimized Low-PHY functions are implemented by using the OpenCL frame-
computation of the Discrete Fourier Transform (DFT), which can be work [22], which can execute a kernel on an FPGA platform using
3
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
4
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
5
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
Fig. 8. DU implementation scenarios: (a) COTS server equipped with FPGA; (b) FPGA-based SmartNIC implementation.
This section evaluates the impact of the different optimization Version 1 Version 2 Version 3 Version 4 Version 5
techniques on the processing time (i.e., the time required to process Processing time [μ𝑠] 34.37 23.5 23.45 21.4 15.43
the considered FFT size), the resource utilization (logic gates, DSP, Logic gate utilization 14% 25% 21% 19% 21%
DSP utilization <1% 5% 4% 3% 3%
memory bits, RAM), and the maximum kernel operating frequency Memory utilization 2% 2% 2% 3% 5%
of a Low-PHY function in FPGA using OpenCL framework. The best RAM utilization 4% 6% 6% 7% 11%
performing version is then assessed in terms of resource utilization Kernel frequency [MHz] 239.23 366.7 285.63 484.26 484.78
and maximum kernel operating frequency by varying the number of
IFFT points from 128 to 2048. Results of the FPGA-based Low-PHY in
Table 4
terms of processing time and energy consumption are then compared FPGA resources and kernel operating frequency of Low-PHY layer functions with
to a CPU-based Low-PHY implementation with OpenAirInterface, and different IFFT points.
a GPU-based implementation of clfft and cufft libraries. The clfft is an 128 256 512 1024 2048
OpenCL-based software library containing FFT functions [11]. The cufft Logic gate utilization 21% 26% 36% 51% 66%
is a CUDA-based FFT library developed by NVIDIA [12]. DSP utilization 3% 3% 8% 14% 14%
As shown in Fig. 2, the RU hosts just the RF functions implemented Memory utilization 5% 6% 11% 15% 15%
in an Ettus X310 Universal Software Radio Peripherals (USRPs) and is RAM utilization 11% 13% 16% 23% 23%
Kernel frequency [MHz] 484.78 461.68 390.93 299.67 146.26
connected to the DU through a 10 Gbps optical Ethernet fronthaul. The
DU and CU components are deployed in the edge cloud exploiting Dell
Poweredge R740 servers. A midhaul link with 10/25 Gbps is used to
connect the DU and the CU. In the accelerated edge, the DU Low-PHY to 5%), and RAM utilization (4% to 6%). When function calling is
functions are offloaded onto a DE10-pro development board with Stratix avoided in the implementation, there is a decrease of about 4% in the
10 FPGA, two 8 GB DDR4 memory modules and PCIe v3.0 with 16 slots utilization of logic elements. Using matrix representation on array of
at 32 GB/s bandwidth [26]. The CPU-based implementation is executed components further decreases the processing time and the used logic
in an Intel Core [email protected] GHz and based on Intel Advanced gates, and increases the kernel frequency by a factor of 1.7. The fastest
Vector Extension 2 (AVX2), where arithmetic operations are performed processing time, 15.43 μs, is achieved in implementation version 5 with
on 256-bit vectors to achieve better performance with floating point a utilization of 21% for logic gates, 3% for DSP, 5% for memory, 11%
calculations and data organization. The clfft and cufft are implemented for RAM, and the kernel operates at 484.78 MHz frequency, achieved
in an NVIDIA Tesla T4 GPU featuring 320 NVIDIA Turing tensor cores, with Intel FPGA SDK for OpenCL version 20.3.
16 GB GDDR6 memory modules, and PCIe v3.0 with 16 slots [28].
Host-to-device transfer latency results are also detailed.
6.2. Hardware Performance
6.1. Low-PHY Layer Optimization
Considering the best performing version in Table 3, namely version
5, 5G Low-PHY functions with 128 up to 2048 IFFT points have been
Table 3 shows the performance of five implementations of the Low-
implemented in FPGA using OpenCL framework. Results in terms of
PHY function of the 5G RAN with 128 IFFT points, considering the
different optimization techniques discussed in Section 3.4. The five utilization of logic gates, DSP utilization, memory utilization and RAM
implementations exploit different optimization techniques available in utilization, and kernel operating frequency are shown in Table 4.
the considered Intel FPGA SDK for OpenCL: (i) version 1 features the It can be noted that the use of hardware resources increases as the
implementation of IFFT and CP addition without any optimization; IFFT points increase. This is because the size of the array increases and
(ii) version 2 uses the loop unrolling method; (iii) version 3 removes more FPGA resources are needed to parallelize the IFFT and CP addition
function calls inside the main kernel code; (iv) version 4 implements a computation. Also the kernel frequency decreases with increasing IFFT
matrix instead of a vector representation of the array to increase the points due to the increasing complexity, given by 𝑂(𝑁𝑙𝑜𝑔𝑁) with 𝑁
kernel frequency, thus reducing the computation time; (v) and version being the number of IFFT points.
5 is the same as version 4, but with the kernel code compiled by using
Intel FPGA SDK for OpenCL version 20.3, instead of the older 19.1 used 6.3. Processing Time
in the previous cases. The different versions are compared in terms
of processing time, utilization of logic gates, DSP utilization, memory Fig. 9 shows the processing time as a function of the IFFT size of 5G
utilization, RAM utilization, and kernel operating frequency. low-PHY in OpenCL exploiting an FPGA, a CPU with a different number
As shown in Table 3, the processing time decreases from 34.37 μs of processing cores (from one to four), and in GPU with clfft and cufft
to 23.5 μs when the loop is fully unrolled with the trade-off of an libraries. The open-source benchmark package Gearshifft [29] is used
increased utilization of logic gates (14% to 25%), DSP utilization (<1% to evaluate the processing performance of IFFT libraries in the GPU.
6
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
Fig. 9. FPGA vs. CPU vs. GPU processing time. Fig. 10. Energy Usage per Low-PHY operation in FPGA, CPU, and GPU with different
IFFT points.
The CPU-based processing time decreases for increasing CPU cores.
This is expected since multiple cores allow to run multiple processes 6.4. Energy Consumption
at the same time with greater ease compared to a single core, in-
creasing the performance when handling multiple tasks or demanding Fig. 10 shows the energy consumption per Low-PHY operation in
computations. Such processing time performance cannot be appreciated FPGA, single CPU core, and GPU (for both clfft and cufft). It is measured
for lower IFFT sizes (i.e., 128 and 256 IFFT points) due to the lower as the energy consumed per Low-PHY operation, which is obtained by
complexity, which can be handled by a single CPU core. However, multiplying the power consumption by the Low-PHY processing time
a notable performance difference can be highlighted for 2048 IFFT in each device. The energy consumption allows a fair comparison
points, where it takes around 6.38 μs only, to run the implementation among the different implementations because it is the product of power
with 4 CPU cores, while it needs 31.84 μs with 1 CPU core. This is consumption and processing time. Thus, low energy consumption can
because the computational effort for IFFT and CP addition is shared be achieved by low power consumption or short processing time. The
among 4 CPU cores, resulting in an approximately four times lower s-tui [30] tool is used to measure the CPU power, while quartus_pow
processing time. (included in the de10_pro board support package) is used to estimate
Moreover, the processing time for the CPU-, GPU-, and FPGA-based the power dissipated in the FPGA, and the 𝑛𝑣𝑖𝑑𝑖𝑎−𝑠𝑚𝑖 command is used
implementations increases as a function of the IFFT points. However, to measure the power consumption in the GPU. As shown in Fig. 10,
just a small increase in the processing time (around 1 μs) occurs with the energy usage per operation increases as a function of the considered
the FPGA-based implementation because of data parallelism with full IFFT points when using either FPGA, CPU, or GPU. Results also show
loop unrolling on the radix butterfly and CP addition computation. that the GPU-based clfft consumes the highest amount of energy per
The same happens with GPU-based implementation since it has more operation, also due to the large processing time, as shown in Fig. 9,
parallel computing resources (320 Turing Tensor Cores and 2560 CUDA making it the least optimal solution to be integrated into the 5G Low-
Cores for NVIDIA Tesla T4) compared to a CPU. Also, GPU-based cufft PHY. On the other hand, CPU-based implementation of Low-PHY has
has a faster processing time compared to clfft, since cufft is based in the lowest energy consumption up to 512 IFFT size because of the
CUDA developed also by NVIDIA where the library is executed, better
short processing time. However, at higher IFFT sizes, the FPGA-based
matching the computing characteristics of the GPU and thus offering a
implementation has the lowest energy consumption (1.03𝑥 lower than
better performance.
single-core CPU for 1024 IFFT size and 2.22𝑥 lower than single-core
In comparing the processing time of different implementations,
CPU for 2048 IFFT size) followed by GPU-based cufft implementation.
results show that the GPU-based implementation of the clfft library has
In fact, the long processing time of the FPGA-based implementation (see
the longest processing time, making it less suited to be deployed in the
Fig. 9) is compensated by a low power consumption (see Fig. 11).
5G Low-PHY. CPU-based implementations (from 1 to 4 CPU cores) have
Also, implementing Low-PHY with more CPU cores further increases
the shortest processing times up to 512 IFFT size. However, at 2048
the energy consumption. This means that, as the IFFT size increases,
IFFT points, the FPGA-based and GPU-based cufft implementations are
offloading the Low-PHY function into the FPGA becomes more energy
1.45𝑥 and 1.91𝑥 faster compared to the CPU-based implementation with
efficient compared to processing them into the CPU or GPU (for both
up to 2 CPU cores, respectively. The CPU-based Low-PHY implementa-
tion performs better at smaller IFFT sizes since it operates at a higher clfft and cufft libraries).
clock frequency (i.e., 4.2 GHz) compared to GPU (i.e., 1.59 GHz) and The proposed solution is flexible. If the available fronthaul/midhaul
FPGA (i.e., 480 MHz). However, when the computational complexity latency budget is large, an increased processing time can be traded for a
increases (i.e., 𝑃 >1024 IFFT points), the data parallelism of FPGA lower power consumption, provided that the fronthaul/midhaul latency
and the parallel computing resources of GPU outperforms CPU-based constraint is satisfied. Otherwise, the hardware guaranteeing the lowest
implementations. processing time shall be utilized.
Although the CPU-based implementation with 3 or 4 cores has a
shorter processing time compared to FPGA-based Low-PHY and cufft 6.5. FPGA-based SmartNIC Scenario
library implementation in Tesla T4 GPU, these results are only possible
if there are no other 5G functions implemented in the CPU. With a 5G Fig. 12 shows the overall execution time of the Low-PHY imple-
RAN testbed using a higher layer split (option 2), the DU does not only mentation when a COTS server is equipped with an FPGA, i.e., the
implement Low-PHY, but also processes the High-PHY, MAC, and RLC scenario depicted in Fig. 8(a). In this implementation, although the
functions in the CPU. In this case, CPU resources are shared among FPGA-based implementation of the 5G Low-PHY layer provides a faster
four different 5G functions, resulting in a longer processing time. This processing for larger IFFT sizes with lower energy consumption, the
is why implementing 5G LOW-PHY function in FPGA or offloading the host-to-device/device-to-host and off-chip to on-chip/on-chip to off-
IFFT implementation in GPU using cufft library can improve the overall chip memory transfer time is still a bottleneck. Indeed, writing to and
gNB performance, since they can free some of the CPU cores (e.g., 2) reading from memory (i.e., Write and Read in the figure) requires
and the free CPU cores can be exploited to perform the remaining RAN more time than the kernel execution itself (i.e., Kernel in the figure),
functions. especially for the 2048 IFFT size. However, the proposed smartNIC
7
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
implementation, depicted in Fig. 8(b), mitigates this issue because data References
are directly sent to the RU through the QSFP interface after the IFFT
[1] A. Ghosh, A. Maeder, M. Baker, D. Chandramouli, 5G evolution: A view
and CP addition execution, therefore just the data writing to the kernel on 5G cellular technology beyond 3GPP release 15, IEEE Access 7 (2019)
(i.e., Write) and the kernel processing time (i.e., Kernel) contribute 127639–127651.
to the overall execution time for option 2 functional split. Thus, the [2] R. Mijumbi, J. Serrat, J.-L. Gorricho, N. Bouten, F. De Turck, R. Boutaba, Network
function virtualization: State-of-the-art and research challenges, IEEE Commun.
SmartNIC implementation reduces the processing time of about 54.33%
Surv. Tutor. 18 (1) (2016) 236–262.
for the 128 IFFT size and of 53.08% for the 2048 IFFT size with respect [3] Technical Specification Group Radio Access Network, Study on New Radio Access
to the one achieved by the COTS server implementation. Technology; Radio Access Architecture and Interfaces, Technical Report (TR)
Moreover, in case split Option 7-1 is implemented, where only 38.801, 3GPP, 2017, Version 2.0.0.
Low-PHY functions are implemented in the DU, while the upper layer [4] Network Functions Virtualization; Acceleration Technologies; Report on
Acceleration Technologies & Use Cases, v1.1.1, ETSI GS NVF-IFA 001.
functions are implemented in the CU, the computation time can be [5] OpenAirInterface | 5G software alliance for democratising wireless innovation,
further reduced. Indeed, in this scenario IQ frequency domain samples 2021, https://openairinterface.org/ (Last accessed: 2021-07-19).
coming from the CU can be input directly to the FPGA-based smartNIC, [6] O-RAN alliance, 2021, https://www.o-ran.org/ (Last accessed: 2021-07-21).
processed there, and the resulting IQ time domain samples can be sent [7] A new approach to building and deploying telecom network infrastructure, 2021,
https://telecominfraproject.com/?section=access (Last accessed: 2021-20-04).
to the RU through another smartNIC QSFP output. Thus, in this scenario [8] F. Civerchia, K. Kondepu, F. Giannone, S. Doddikrinda, P. Castoldi, L.
also the time required to write the data to the kernel (i.e., Write in Valcarenghi, Encapsulation techniques and traffic characterisation of an ethernet-
Fig. 9) can be saved. based 5G fronthaul, in: 20𝑡ℎ International Conference on Transparent Optical
Another bottleneck of implementing OpenCL in FPGA is the writing Networks (ICTON), 2018, http://dx.doi.org/10.1109/ICTON.2018.8473737.
[9] J.C. Borromeo, K. Kondepu, N. Andriolli, L. Valcarenghi, An overview of hard-
to and reading from the global memory, which is the DDR4 RAM of the
ware acceleration techniques for 5G functions, in: 22𝑛𝑑 International Conference
FPGA development board. Since data are sent to the global memory on Transparent Optical Networks (ICTON), 2020, http://dx.doi.org/10.1109/
from the host through the PCIe interface, they have to be forwarded ICTON51198.2020.9203242.
first to the local memory inside the FPGA for the IFFT and CP Addition [10] NVIDIA DPUS, 2021, https://www.nvidia.com/en-us/networking/products/data-
processing-unit/ (Last accessed: 2021-07-19).
processing. To further reduce the overall processing time of the FPGA-
[11] Clfft, 2021, https://github.com/clMathLibraries/clFFT (Last accessed: 2021-07-
based Low-PHY, OpenCL host pipes [31] can be utilized to have a direct 19).
communication between the host and the kernel running in the FPGA. [12] Cufft, 2021, https://docs.nvidia.com/cuda/cufft/index.html (Last accessed:
This solution bypasses the latency contributed by the global to local 2021-07-19).
memory transfer inside the FPGA. However, this implementation is left [13] L.M.P. Larsen, A. Checko, H.L. Christiansen, A survey of the functional splits
proposed for 5G mobile crosshaul networks, IEEE Commun. Surv. Tutor. 21 (1)
as a future work since host pipes are supported by Arria 10 GX FPGAs (2019) 146–172.
only. [14] ITU-T, 5G wireless fronthauls requirements in a passive optical network context,
2020, ITU-T G.Sup66 (09/2020).
7. Conclusion [15] O-RAN Fronthaul Working Group, Control, User and Synchronization Plane
Specification, O-RAN.WG4.CUS.0-v07.00 Version 1.0, 2021.
[16] Huber suhner functional split, 2022, https://www.hubersuhner.com/
This paper proposed the implementation of the Low-PHY functions en/documents-repository/technologies/pdf/fiber-optics-documents/5g-
of a disaggregated gNB distributed unit (DU) in an FPGA-accelerated fundamentals-functional-split-overview (Last accessed: 2022-01-10).
8
J.C. Borromeo et al. Computer Networks 210 (2022) 108931
[17] Common Public Radio Interface: eCPRI Interface Specification, Version 2.0, 2019. Koteswararao Kondepu is an Assistant Professor at In-
[18] 3GPP, 3GPP TR 38.912 v15.0.0 (2018-06): Study on new radio (NR) access dia Institute of Technology Dharwad, Dharwad, India. He
technology (release 15), (38.912) 3GPP, 2018, Version 15.0.0. obtained his Ph.D. degree in Computer Science and Engi-
[19] J.K. Chaudhary, A. Kumar, J. Bartelt, G. Fettweis, C-RAN employing xRAN neering from Institute for Advanced Studies Lucca (IMT),
functional split: Complexity analysis for 5G nr remote radio unit, in: 2019 Italy in July 2012. His research interests are 5G, optical
European Conference on Networks and Communications (EuCNC), 2019, pp. networks design, energy-efficient schemes in communication
580–585, http://dx.doi.org/10.1109/EuCNC.2019.8801953. networks, and sparse sensor networks.
[20] J.W. Cooley, J.W. Tukey, An Algorithm for the Machine Calculation of Complex
Fourier Series, in: Mathematic of Computation, 1965, pp. 297–301.
[21] G. Polat, S. Ozturk, M. Yakut, Design and implementation of 256-point radix-4
100 gbit/s fft algorithm into FPGA for high-speed applications, ETRI J. 37 (4)
(2015) 667–676.
[22] A. Munshi, B.R. Gaster, T.G. Mattson, J. Fung, D. Ginsburg, OpenCL:
Programming Guide, Addison-Wesley, 2011.
[23] A. Barenghi, M. Madaschi, N. Mainardi, G. Pelosi, Opencl HLS based design
Nicola Andriolli received the Laurea degree in telecommu-
of FPGA accelerators for cryptographic primitives, in: Intern. Conf.E on High
nications engineering from the University of Pisa in 2002,
Performance Computing Simulation (HPCS), 2018, pp. 634–641, http://dx.doi.
and the Diploma and Ph.D. degrees from Scuola Superiore
org/10.1109/HPCS.2018.00105.
Sant’Anna, Pisa, in 2003 and 2006, respectively. He was a
[24] F. Civerchia, M. Pelcat, L. Maggiani, K. Kondepu, P. Castoldi, L. Val-
Visiting Student at DTU, Copenhagen, Denmark and a Guest
carenghi, Is opencl driven reconfigurable hardware suitable for virtualising 5g
Researcher at NICT, Tokyo, Japan. In 2007-2019 he was
infrastructure? IEEE Trans. Netw. Serv. Manage. 17 (2) (2020) 849–863.
an Assistant Professor at Scuola Superiore Sant’Anna. Since
[25] J.C. Borromeo, K. Kondepu, N. Andriolli, L. Valcarenghi, Experimental evaluation
2019 he is a Researcher at CNR-IEIIT.
of 5G vRAN function implementation in an accelerated edge cloud, in: 2021
He has a background in the design and the performance
European Conference on Optical Communication (ECOC), 2021.
analysis of optical circuit-switched and packet-switched net-
[26] Intel Corporation, Intel FPGA SDK from opencl pro edition: Program-
works and nodes. His research interests have extended to
ming guide, 2021, https://www.intel.com/content/dam/www/programmable/
photonic integration technologies for telecom, datacom and
us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf (last accessed
computing applications, working in the field of optical pro-
2021-07-21).
cessing, optical interconnection network architectures and
[27] The LLVM compiler infrastructure project, 2022, https://llvm.org/ (Accessed:
scheduling. Recently he has been investigating integrated
March 7, 2022).
transceivers, frequency comb generators, and architectures
[28] NVIDIA Tesla T4, 2021, https://www.nvidia.com/content/dam/en-zz/Solutions/
and subsystems for photonic neural networks. He authored
Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf (Last accessed: 2021-07-
more than 180 publications in international journals and
21).
conferences, contributed to one IETF RFC, and filed 11
[29] P. Steinbach, M. Werner, Gearshifft - the FFT benchmark suite for heterogeneous
patents.
platforms, 2017, CoRR abs/1702.00629 arXiv:1702.00629 URL http://arxiv.org/
abs/1702.00629.
[30] The stress terminal UI: s-tui, 2021, https://github.com/amanusk/s-tui (Last
accessed: 2021-07-21).
[31] Intel, Opencl host pipe design example, 2021, https://www.intel.com/content/ Luca Valcarenghi is an Associate Professor at the Scuola
www/us/en/programmable/support/support-resources/design-examples/design- Superiore Sant’Anna of Pisa, Italy, since 2014. He published
software/opencl/host-pipe.html (Last accessed: 2021-07-21). almost three hundred papers (source Google Scholar, May
2020) in International Journals and Conference Proceedings.
Dr. Valcarenghi received a Fulbright Research Scholar Fel-
lowship in 2009 and a JSPS Invitation Fellowship Program
Justine Cris Borromeo received his BS Electronics Engi- for Research in Japan (Long Term) in 2013. His main
neering degree at Mindanao State University- Iligan Institute research interests are optical networks design, analysis,
of Technology in 2015, and the MS Electronics Engineering and optimization; communication networks reliability; en-
at Ateneo de Manila University in 2019. He is currently ergy efficiency in communications networks; optical access
a Ph.D student in Emerging Digital Technologies at Scuola networks; zero touch network and service management;
Superiore Sant’Anna, Pisa. His research interests includes experiential networked intelligence; 5G technologies and
radio access networks in 5G technologies, and FPGA and beyond.
GPU-based hardware acceleration.