CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms
CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms
CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms
performing the butterfly operations, and employing FIFOs of dif- The Discrete Fourier Transform (DFT) is a linear transformation
ferent depth to match the required stride access. Because of the 𝐹 𝑁 : F𝑁 → F𝑁 defined for y = 𝐹 𝑁 (x) by:
data-access pattern, the architecture needs to block data in on-chip
SRAM (performing a matrix transpose) before it is able to write it 𝑁 −1
∑︁
back to off-chip memory. Although it is an interesting architecture, 𝑖𝑗
𝑦𝑖 = 𝜔𝑁 𝑥 𝑗
it targets a limited range of input sizes from 214 to 220 elements. And 𝑗=0
being an ASIC design means that it loses the deployment flexibility
and time-to-market offered by FPGAs.
When F is a finite field of characteristic 𝑝 > 0, the DFT is usually
NTTGen is a hardware generation framework targeting FPGAs,
called Number Theoretic Transform (NTT). Note that in practical
optimized for homomorphic encryption. The inputs to the frame-
implementation, a DFT over C will be an approximation, whereas
work are application parameters (such as latency, polynomial de-
in the case of NTT, we are interested in exact solutions.
gree, and a list of prime moduli), and hardware resources constraints
A naive implementation of DFT has complexity 𝑂 (𝑁 2 ). An ef-
(e.g. DSP, BRAM and I/O bandwidth), while the output is synthesiz-
ficient algorithm to compute DFT with complexity 𝑂 (𝑁 log 𝑁 ) or
able Verilog code. It exploits data-, pipeline- and batch-parallelism,
better is referred to as Fast Fourier Transform (FFT). For practical
and offers two flavours of cores: general purpose, and low-latency
applications, we typically set 𝑁 = 2𝑛 (radix-2 FFT/NTT), extending
customized for generalized Mersenne primes. Input and output
the 𝑥𝑖 with zeros.
polynomials are stored in off-chip memory, and results are shown
Since the 𝑤 𝑖 𝑗 are 𝑁 2 coefficients, but the 𝑤 𝑘 are only 𝑁 distinct
for polynomial degrees ranging from 210 to 214 . At the heart of
numbers, there is periodicity in the matrix coefficients. This led to
the design is the use of Streaming Permutation Networks (SPN)
a number of "factorization lemmas", of which the Cooley-Tukey
[CP15], which allows for arbitrary permutation strides, while reduc-
framework [CT65] is the most well-known and fundamental.
ing the interconnect complexity and avoiding expensive crossbars.
When 𝑁 factors in 𝑛 1𝑛 2 , the Cooley-Tukey framework decom-
A SPN consists of three sub-networks: two of them for spatial per-
poses a given order 𝑁 DFT into 𝑛 2 smaller order 𝑛 1 DFTs, and 𝑛 1
mutations (in the same cycle) and one for temporal-permutation
smaller order 𝑛 2 DFTs, combined with so-called twiddles, which
(across cycles). NTTGen actually extends the original SPN to sup-
are proper powers of 𝜔 𝑁 . It directly follows that the complexity
port runtime-control of the underlying routing tables and address
reduces from 𝑁 2 to 𝑛 1𝑛 22 + 𝑛 21𝑛 2 < 𝑁 2 .
generations. In comparison, CycloneNTT takes a different approach:
By repeatedly factoring 𝑁 , one gets to smaller DFTs of prime
we avoid permutations entirely by using a multi-FIFO architecture
order. These "minimal" FFTs (combined with their twiddle multi-
backed by off-chip memory.
plications) are called butterflies. In particular, if 𝑁 is a power of
Worth mentioning as well is HEAX [RLPD20], a highly paral-
2, factoring all the way leads to repeated order 2 DFTs, for a total
lelizable hardware architecture targeting fully homomorphic en-
complexity of 𝑁 log 𝑁 .
cryption, with pipelined NTT cores and a word-size of 54 bits to
Two common ways to compute a FFT are decimation in time
maximize resources utilization (DSPs). It requires on-chip multi-
(DIT) and decimation in frequency (DIF). These represent the two
plexers to distribute data and twiddle factors to the cores, and is
cases of factoring 𝑁 "from the left" or "from the right": DIF when
mainly optimized for on-chip memory usage up to 213 elements.
𝑛 1 is a small radix, DIT when 𝑛 2 is a radix. We stress that DIT and
Beyond that, it employs off-chip memory but only to a very limited
DIF differ both on 1) how they "loop" over the input values, and 2)
extent due to the latency overhead.
the butterfly function. We add that it is not necessary and, as we
Previous relevant art targeting FPGAs also include [SRTJ+ 19],
will see, not always efficient to factor "all the way down" - repeated
[KLC+ 20] for homomorphic encryption, [DM22] for lattice-based
higher-order (also called higher radix) factorsr educe the number
cryptography, [MK20] for homomorphically encrypted deep neural
of accesses to memory.
network inference, as well as [MKO+ 20] for post-quantum digital
signature scheme. Nevertheless, these designs make primarily use
of on-chip BRAM and do not directly provision the use of off-chip 3.2 Goldilocks Field
memory. All these work could benefit from CycloneNTT approach While our design is generic and can be adapted to any field, includ-
to scale to larger NTTs and possibly allow for richer applications. ing complex numbers, our current implementation focuses on the
Other interesting references, although not directly applicable to field F𝑝 with 𝑝 = 264 − 232 + 1 = 0xFFFF_FFFF_0000_0001, called
CycloneNTT, include: in-memory computation such as CryptoPIM Goldilocks field [Gou21, Tea22].
[NGI+ 20] and MeNTT [LPY22]; optimized GPU-based implementa- Elements 𝑎, 𝑏 ∈ F𝑝 can be represented as 64-bit (unsigned)
tions such as [DCH+ 21]; as well as RISC-V architecture extensions integers. Common field operations are extremely efficient on 64-bit
for NTT such as [PS22] and [KA20]. CPUs, including:
(1) Host sends the input values over PCIe, in bit-reverse order2
(2) FPGA receives the input values and stores them in the
external memory.
ntt_compute() Compute NTT.
(1) Host sends compute command over PCIe.
(2) FPGA repeats multiple times:
(a) Load twiddles and values into input FIFOs (from exter-
nal memory into SRAM).
(b) Process twiddles and values via Sub-NTT.
(c) Send output values into output FIFOs (from SRAM to
external memory).
Details depend on the architecture and are explained in
Figure 2: CycloneNTT butterfly unit optimized for
Sec. 5, 6.
Goldilocks field. This 8-stage pipeline outputs 𝑒 ′ = 𝑒 + 𝑜,
ntt_recv() → values Receive output values from FPGA.
𝑜 ′ = 𝜔 (𝑒 − 𝑜). 𝑒 and 𝑜 are "even" and "odd" inputs to the
(1) Host sends recv command over PCIe.
butterfly, 𝜔 is the twiddle factor, 𝜖 = 232 − 1 is used for
(2) FPGA returns output values from external memory over
modular reduction, and reg(s) are pipeline registers to align
PCIe.
data in time.
We would like to note that optimizing data transfer between the
FPGA and the host is an orthogonal problem that depends on the ac-
tual platform, and the type of communication used (PCIe/Ethernet),
hence we consider it out of the scope of this work.
5 SINGLE-LAYER STREAMING
In this version of the architecture, the NTT network is processed of this architecture is determined by the bandwidth of the off-chip
one layer at a time. For each layer, the entire dataset is read from memory to stream in and out the entire dataset log 𝑁 times.
off-chip memory, processed through an array of butterfly units, and
the results are sent back to the off-chip memory, forming a cyclone 5.1 Streaming Twiddles
of data streaming. Since we use [BLDS10]’s network, within a layer, twiddle factors
Along with the dataset, twiddle factors are also streamed in to are used in sequential order. As we progress through the layers,
be used by the butterfly array. This architecture streams in and out the number of twiddle factors needed is cut in half, as shown in
the entire dataset as many as the layers of the NTT network. For 𝑁 Fig 3. However, the stream always starts from index-0. Therefore,
𝑁 ) from off-chip
for layer 𝐿, we simply stream in twiddles [0... 2𝐿+1
numbers, the network has log 𝑁 layers, hence the time complexity
memory.
2 Bit-reverse order indexes can be computed "on-the fly", and Host can access input We make an important observation here that as we progress
values in RAM, in bit-reverse order. This seems more efficient than in-place reordering. through layers, the bandwidth required for streaming twiddles is
CycloneNTT Jump Trading / Jump Crypto, 2022, Chicago, IL
We start by placing every other input number into left and right
FIFOs (Fig 4). Processing every layer of the network amounts to read-
ing 𝑁2 numbers from each FIFO (𝑁 total), running them through
the array of butterflies in parallel, and writing the results back to
their corresponding FIFO.
cut in half. This is not only because half the twiddles are used, but
also due to the fact that in layer 𝐿, 2𝐿 neighboring butterflies use the
6 MULTI-LAYER STREAMING
same twiddle factor. This provides opportunity for twiddle reuse, The single-layer architecture proposed in Section 5 requires a single
given we process butterflies in a sequential order within the layer. butterfly unit per two input data as we arrange the units in a one-
dimensional array. Therefore, if off-chip memory bandwidth allows
5.2 Parallel Butterflies for streaming in maximum of 2𝐵 elements per cycle, we can only
utilize 𝐵 butterfly units. However, if the chip’s capacity allows for
We note that the data dependency in the butterfly network is only
more butterfly units, we have a memory bound architecture.
inter-layer. Within a layer, we can execute in parallel as many
To overcome the under-utilization of our computing resources,
butterflies as we can afford to feed input data to, and fit the circuit on
we can arrange the butterfly units into a sub-network of butterflies
the chip. In this single-layer CycloneNTT architecture, we employ
rather than a vector. This sub-network is similar in shape to the
an array of 𝐵 fully-pipelined butterflies in parallel. In order to
overall butterfly network shown in Fig 3, it is only smaller in size.
achieve maximum throughput, we need to feed the butterfly array
A sub-network with 𝐵 butterfly units in each layer (total of log(2𝐵)
every cycle to avoid bubbles. Therefore, we require 2 × 𝐵 numbers
layers by design) requires 2𝐵 inputs, and produces 2𝐵 outputs,
every cycle to be read from off-chip memory, regardless of the layer
hence its memory bandwidth requirements are the same as in the
we are processing.
single-layer architecture.
It should be noted that this architecture is memory bound by
Said in another way, if we have bandwidth to stream 2𝐵 elements
design: for every butterfly unit in the system, two numbers and a
per cycle, then we can construct a sub-network with 𝐵 butterfly
twiddle factor need to be read from off-chip memory.
units per layer, that sequentially processes log(2𝐵) layers, for a
total of 𝐵 log(2𝐵) butterfly units. In return, we will only need to
5.3 2-FIFO Architecture 1
stream the dataset from/to external memory log(2𝐵) log 𝑁 times.
As demonstrated in Fig 3, there exists only inter-layer data depen-
Just to provide a concrete example, we can instantiate a 6-layer
dency in the butterfly network. Furthermore, every output of a
architecture with 𝐵 = 32 butterflies per layer, or a total of 6 × 32 =
butterfly feeds only a single butterfly, meaning that a single but-
192 butterflies. This architecture will only need to stream the dataset
terfly feeds two butterflies in the next layer. However, we make a
from/to external memory 16 log 𝑁 times.
further important observation that a single butterfly, depending
on its position in the layer, either feeds the left input or the right
input of its target butterflies, as highlighted in Fig 3. We use this 6.1 Streaming Twiddles
observation to propose a 2-FIFO architecture as shown in Fig 4. In multi-layer processing, we process a sub-network of the overall
Every butterfly can be seen as reading its inputs from two FIFOs, network at a time. This means that for every sub-network execu-
left and right. However, both its outputs will be pushed into either tion, we require twiddle factors for all the log(2𝐵) layers at the
left, or right FIFO (Fig 4). In addition, as evidenced in Fig 3, the same time. Consequently, we require higher memory bandwidth to
two outputs are 2𝐿 rows apart, as their target butterflies in the next stream in twiddle factors. On a positive note, recall that the band-
layer are 2𝐿 apart. For a given butterfly in position (P) in layer (L), width requirement for twiddle factors halves every layer into the
we have: network. Therefore, for a sub-network width 𝐵, the overall band-
𝑃
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑙𝑒 𝑓 𝑡/𝑟𝑖𝑔ℎ𝑡 = ⌊ 𝐿 ⌋ mod 2 width requirement for the twiddle factors is only Σ𝐵−1
0
1 , i.e. < 2
2𝑖
2 (1) times larger compared to what is needed for single-layer processing,
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 = 0, 2𝐿 regardless of the value of 𝐵.
Jump Trading / Jump Crypto, 2022, Chicago, IL Aasaraai et al.
6.4 Quasi-Streaming
The primary reason behind streaming, rather than random access
of, the data to/from off-chip memory is the internal structure of the
DRAM architectures. In a typical DRAM (DDR/HBM), accessing
a row of the data is expensive, hence the memory is arranged
in a very wide format. To minimize overhead, applications must
strive to access and consume an entire row before moving on to Figure 6: Quasi-Streaming read order to change O-FIFO index
the next. Therefore, streaming data within a row is very beneficial. every two beats. In this example we highlight reading index-
However, we make a key observation that if an application’s access 4 out of order leads to output FIFO index-1.
quantum is an entire DRAM row, for such application random
access to different rows yields the same throughput as streaming
rows sequentially.
In Fig 5 we show the throughput we achieve by accessing random that produces frequently-changing egress FIFO indexes. Fig 6 shows
and sequential addresses, while varying access size. As can be seen, an example of such sequence for quasi-streaming of data, that
streaming data reaches maximum throughput with access size of enables switching egress FIFO index quickly. Alternatively, for a
256-bytes, however random-access only eventually reaches the large enough 𝐵 (≥ 8), which leads to reading one or more DRAM
same throughput with access size of 4096-bytes or higher. This rows at a time, one can conveniently choose 𝐵 to be the quantum
demonstrates that when accessing a DRAM one-row-at-a-time, the of FIFO reads, simplifying the read sequence further. It should be
application can view the DRAM as a truly random-access-memory, noted that in this architecture, the elements stored in the FIFOs are
and no consideration needs to be made for sequential addressing. not popped sequentially, but rather with pre-defined read addresses.
Using this observation, we propose exploiting quasi-streaming
in CycloneNTT to set the output buffer capacity of FIFOs to a few 6.5 Power vs Delay
multiples of a DRAM row, irrespective of 𝐵. Since we can random CycloneNTT is a configurable architecture. The primary parameter
access ingress FIFOs at DRAM-row granularity, we strategically to tune is the number of inputs to the sub-network (𝐵), which also
choose a sequence of addresses to read and feed the sub-network determines the number of FIFOs (2𝐵). As we grow 𝐵, fewer passes
CycloneNTT Jump Trading / Jump Crypto, 2022, Chicago, IL
log 𝑁
of the dataset is required ( log 2𝐵 ), lowering execution time. How-
ever, sub-network size grows proportional to 𝐵 log 2𝐵, resulting
in increased power dissipation. Depending on the application and
environmental properties, one can choose the right 𝐵 to create the
best fitting solution. Figure 7: Memory interfaces used in single-layer streaming
Another major design decision to consider is clock frequency. connecting left/right FIFOs and providing twiddle access.
Depending on the platform, after a certain clock speed the memory
interface will be saturated, and increasing the clock speed returns
no discernible gains. On the other hand, increases in clock speed
result in more power dissipation. For example, on the C1100 HBM-
based platform [AX21, AX22], each port is rated at the theoretical
maximum speed of 14.2GB/s, excluding DRAM timing overheads.
Considering the 512-bit interface to the memory, this translates
to a maximum clock speed of 222MHz, after which the memory
interface is saturated.
7 MEMORY INTERFACE
In this section we discuss the architecture and complexity of the
memory interface needed for reading and writing to off-chip DRAM Figure 8: Output FIFO index and position in the FIFO based
storage. CycloneNTT uses DRAM to store twiddle factors, input on the beat-index in each layer 𝑙 ∈ [0 . . . log 𝑁 ).
vectors, and the back-end for the 2B FIFOs employed.
7.1 Single-Layer 7.1.3 Writes. As shown in Fig 3 and Eq 1 both outputs of the
butterfly unit are directed to the same FIFO, either left or right.
In this section we discuss the relatively simpler memory interface
In addition, the two outputs are not placed sequentially, and are
needed for the single-layer CycloneNTT. As we only require two
separated by Eq 1. Consequently, for each write port, we require
FIFOs, we can simply allocate one memory port per FIFO as shown
two (one for each butterfly output) small SRAM-based caches to
in Fig 7. In addition, we require a third memory port to access twid-
hold each butterfly output as they become available. This caching
dle factors. Table 1 shows the relative bandwidth requirement per
is necessary as writing individual numbers to DRAM yields very
port in each layer. For example, when processing the first layer of
low bandwidth utilization.
the entire network, at every beat for every butterfly, we require one
number from each FIFO, along with one twiddle factor. Therefore,
while processing the first layer, the bandwidth requirement is the
7.2 Multi-Layer
same for all ports. The multi-layer configuration of CycloneNTT improves memory
bandwidth utilization while presenting new challenges with regards
7.1.1 Twiddles. As described in Sec 3.3 our butterfly network uses to data write backs.
twiddles in a streaming fashion, in bit-reverse order. Therefore, at Similar to single-layer, memory ports are assumed to be wider
initialization time we store all twiddle factors into the correspond- than individual numbers. However, unlike single-layer, in the same
ing memory bank, in bit-reverse order. During the computation of cycle we cannot consume multiple numbers from the same FIFO as
layer 𝑙 we simply stream in twiddles [0 . . . 𝑁
2𝑙
) as required by the that would require creating a vector of sub-networks which would
network. be prohibitively expensive. Instead, we map multiple FIFOs into
the same memory port, which results in one number per FIFO per
7.1.2 Reads. At least in the case of Goldilocks field and our tar-
read-cycle throughput as we require. Considering our targeted field
geted platforms, the memory interfaces are wider (512-bits) than
and platforms, each port can accommodate 512 64 = 8 FIFOs.
the individual numbers (64-bits), and we expect this will be the case
for most applications. Consequently, every read from the memory 7.2.1 Reads. Each memory port is independently fetching num-
provides multiple numbers. This provides the opportunity to pro- bers from its allotted eight FIFOs (or input vector for the first round)
cess multiple numbers at the same time using a vector of butterflies, in a quasi-streaming fashion. To minimize read overhead, read re-
as explained in Section 5. quests are batched into 𝐷 beats, yielding all numbers required for 𝐷
Jump Trading / Jump Crypto, 2022, Chicago, IL Aasaraai et al.
Platform Memory Type Single/Multi-Layer Platform Layers Total Power (W) Design Power (W)
AWS-F1 DDR Single AWS-F1 Single 35.243 0.906
Single Single 28.45 1.727
3 (B=4) 3 27.766 1.265
C1100 HBM 4 (B=8) C1100 4 35.071 3.919
5 (B=16) 5 48.800 14.739
6 (B=32) 6 52.752 23.499
Table 3: All configurations used for evaluation. Table 4: Power dissipation as estimated by Vivado.
Platform Single/Multi-Layer Clock Speed (MHz) Layers 218 220 221 224
AWS-F1 Single 250 3 655-1310 - 6116-12233 55924-111848
Single 300 4 - 1092-2184 - 20971-41943
3 300 5 - 732-1464 - -
C1100 4 300 6 64-129 - - 5518-1137
5 176 Table 7: Execution latency lower and upper bound, in mi-
6 161 croseconds, for various architectures with applicable input
Table 6: Maximum clock speed attainable for each configura- sizes.
tion.
Layers 218 220 221 224 Alternative platforms that may also be considered in a future
Single 3536 16806 36388 358937 work include, but are not limited to, AMD-Xilinx Versal AI Series
3 1137 - 10859 101589 (making use of their AI Engines) as well as Versal HBM series.
4 - 2117 - 42570
5 - 1036 - - REFERENCES
6 101 - - 8083 [AWS] AWS. Amazon EC2 F1 instances. https://aws.amazon.com/ec2/instance-
types/f1/. Accessed: 2022-09-23.
Table 8: Execution latency, in microseconds, for various ar- [AX21] AMD-Xilinx. Varium C1100 compute adaptor data sheet, DS1003 (v1.0).
chitectures with applicable input sizes. https://docs.xilinx.com/v/u/en-US/ds1003-varium-c1100, September 2021.
Accessed: 2022-09-23.
[AX22] AMD-Xilinx. Vitis unified software platform documentation: Ap-
plication acceleration development (UG1393): HBM configuration
and use. https://docs.xilinx.com/r/en-US/ug1393-vitis-application-
acceleration/HBM-Configuration-and-Use, May 2022. Accessed: 2022-
09-23.
[BLDS10] Kevin J. Bowers, Ross A. Lippert, Ron O. Dror, and David E. Shaw. Improved
twiddle access for fast fourier transforms. IEEE Transactions on Signal
Processing, 58(3):1122–1130, 2010.
[CP15] Ren Chen and Viktor K. Prasanna. Automatic generation of high through-
put energy efficient streaming architectures for arbitrary fixed permuta-
tions. In 2015 25th International Conference on Field Programmable Logic
and Applications (FPL), pages 1–8, 2015.
[CT65] James Cooley and John Tukey. An algorithm for the machine calculation
of complex fourier series. Mathematics of Computation, 19(90):297–301,
1965.
+
[DCH 21] Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid
Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei
Hwu. Accelerating fourier and number theoretic transforms using tensor
cores and warp shuffles. In 2021 30th International Conference on Parallel
Architectures and Compilation Techniques (PACT), pages 345–355, 2021.
[DM22] Kemal Derya, Ahmet Can Mert, Erdinç Öztürk, and Erkay Savaş. CoHA-
NTT: A configurable hardware accelerator for NTT-based polynomial
Figure 11: Throughput in millions of numbers per second multiplication. Microprocessors and Microsystems, 89:104451, 2022.
processed by each architecture and prior work. [Gou21] A. P. Goucher. An efficient prime for number-theoretic trans-
forms. https://cp4space.hatsya.com/2021/09/01/an-efficient-prime-for-
number-theoretic-transforms, September 2021. Accessed: 2022-09-23.
[JWW21] Matthias Jung, Christian Weis, and Norbert Wehn. The Dynamic Random
Access Memory Challenge in Embedded Computing Systems, pages 19–36.
9 CONCLUSION AND FUTURE WORK Springer International Publishing, Cham, 2021.
CycloneNTT is a hardware solution for computing NTT on large [KA20] Emre Karabulut and Aydin Aysu. RANTT: A RISC-V architecture extension
for the number theoretic transform. In 2020 30th International Conference
datasets (≥ 224 , 64-bit numbers) that require external memory. on Field-Programmable Logic and Applications (FPL), pages 26–32, 2020.
+
By applying a series of algorithmic- and implementation-level op- [KLC 20] Sunwoong Kim, Keewoo Lee, Wonhee Cho, Yujin Nam, Jung Hee Cheon,
timizations, CycloneNTT achieves a quasi-streaming data access and Rob A. Rutenbar. Hardware architecture of a number theoretic trans-
form for a bootstrappable RNS-based homomorphic encryption scheme.
pattern that maximizes throughput. The architecture is configurable In 2020 IEEE 28th Annual International Symposium on Field-Programmable
and it has been applied to DDR- and HBM-based platforms. More- Custom Computing Machines (FCCM), pages 56–64, 2020.
+
[KMG 19] Hans Kanders, Tobias Mellqvist, Mario Garrido, Kent Palmkvist, and Oscar
over, the designer can trade off power for delay, depending on Gustafsson. A 1 million-point FFT on a single FPGA. IEEE Transactions on
the target system and environmental conditions. To the best of Circuits and Systems I: Regular Papers, 66(10):3863–3873, 2019.
our knowledge, CycloneNTT is the first architecture to tackle this [LPY22] Dai Li, Akhil Pakala, and Kaiyuan Yang. MeNTT: A compact and efficient
processing-in-memory number theoretic transform (NTT) accelerator. IEEE
problem for such large datasets in an efficient manner. Transactions on Very Large Scale Integration (VLSI) Systems, 30(5):579–588,
Future work relates to the current limitations on external mem- 2022.
ory, and it is two-fold. First, as datasets grow larger, even off-chip [MKO+ 20] Ahmet Can Mert, Emre Karabulut, Erdinc Ozturk, Erkay Savas, and Aydin
Aysu. An extensive study of flexible design methods for the number
memory capacity becomes a limiting factor, which would require theoretic transform. IEEE Transactions on Computers, pages 1–1, 2020.
the deployment of multiple FPGAs. Efficient inter-FPGA commu- [MK20] Ahmet Can Mert, Emre Karabulut, Erdinç Öztürk, Erkay Savaş, Michela
Becchi, and Aydin Aysu. A flexible and scalable NTT hardware : Appli-
nication through various channels (PCIe/Ethernet/etc.) will be the cations from homomorphically encrypted deep learning to post-quantum
key points in achieving the same level of throughput as in a single cryptography. In 2020 Design, Automation Test in Europe Conference Exhi-
FPGA solution. Second, DRAM technology suffers, in general, from bition (DATE), pages 346–351, 2020.
[NGI+ 20] Hamid Nejatollahi, Saransh Gupta, Mohsen Imani, Tajana Simunic Rosing,
high-latency, non-deterministic access latency, temperature and Rosario Cammarota, and Nikil Dutt. CryptoPIM: In-memory acceleration
reliability issues (for a detailed analysis, refer to [JWW21]). An al- for lattice-based cryptographic hardware. In 2020 57th ACM/IEEE Design
ternative here is to explore off-chip SRAM, which offers lower and Automation Conference (DAC), pages 1–6, 2020.
[PS22] Rogério Paludo and Leonel Sousa. NTT architecture for a linux-ready
deterministic access latency at the cost of lower capacity and band- RISC-V fully-homomorphic encryption accelerator. IEEE Transactions on
width. Historically, off-chip SRAM has been limited to an order of Circuits and Systems I: Regular Papers, 69(7):2669–2682, 2022.
[RLPD20] M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. HEAX: An archi-
magnitude lower capacity compared to DRAM. However recent pro- tecture for computing on encrypted data. In Proceedings of the Twenty-Fifth
cess technology advancements have enabled higher capacity chips. International Conference on Architectural Support for Programming Lan-
The use of smaller, yet-faster, off-chip memory coupled with the guages and Operating Systems, ASPLOS ’20, page 1295–1309, New York,
NY, USA, 2020. Association for Computing Machinery.
deployment of multiple FPGAs, could provide higher system-level [SRTJ+ 19] Sujoy Sinha Roy, Furkan Turan, Kimmo Jarvinen, Frederik Vercauteren, and
performance and energy efficiency. Ingrid Verbauwhede. FPGA-based high-performance parallel architecture
Jump Trading / Jump Crypto, 2022, Chicago, IL Aasaraai et al.
for homomorphic computing on encrypted data. In 2019 IEEE International Conference on Computing Frontiers, CF ’22, page 30–39, New York, NY, USA,
Symposium on High Performance Computer Architecture (HPCA), pages 2022. Association for Computing Machinery.
387–398, 2019. [ZWZ+ 21] Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan
[Tea22] Polygon Zero Team. Plonky2: Fast recursive arguments with PLONK Long, Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. PipeZK:
and FRI. https://github.com/mir-protocol/plonky2/blob/main/plonky2/ Accelerating zero-knowledge proof with a pipelined architecture. In Pro-
plonky2.pdf, September 2022. Accessed: 2022-09-23. ceedings of the 48th Annual International Symposium on Computer Architec-
[YKKP22] Yang Yang, Sanmukh R. Kuppannagari, Rajgopal Kannan, and Viktor K. ture, ISCA ’21, page 416–428. IEEE Press, 2021.
Prasanna. NTTGen: A framework for generating low latency NTT im-
plementations on FPGA. In Proceedings of the 19th ACM International