CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of
Large Datasets on DDR- and HBM-based FPGA Platforms

Kaveh Aasaraai, Emanuele Cesena, Rahul Maganti, Nicolas Stalder, Javier Varela, Kevin Bowers
{kaasaraai,jvarela,kbowers}@jumptrading.com
{ecesena,rmaganti,nicolas}@jumpcrypto.com
ABSTRACT Modern applications in cryptography, ranging from homomor-

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier- phic encryption to lattice-based and other post-quantum primitives,
Transform (FFT) on finite fields. NTT is being increasingly used to zero-knowledge (ZK) proofs, motivated a recent review of classi-
in blockchain and zero-knowledge proof applications. Although cal FFT algorithms and research in efficient implementations over
FFT and NTT are widely studied for FPGA implementation, we finite fields. And, as richer applications are built or proposed, the
believe CycloneNTT is the first to solve this problem for large need to compute NTT over larger datasets grows.
data sets (≥ 224 , 64-bit numbers) that would not fit in the on-chip Our work is motivated by ZK applications, particularly in the
RAM. CycloneNTT uses a state-of-the-art butterfly network and context of blockchain technology. ZK proofs are a building block
maps the dataflow to hybrid FIFOs composed of on-chip SRAM to build both scalability solutions (e.g., ZK-rollups) and privacy-
and external memory. This manifests into a quasi-streaming data enhanced applications for blockchains. In many concrete schemes,
access pattern minimizing external memory access latency and max- generating a ZK proof requires computing a NTT, typically on a
imizing throughput. We implement two variants of CycloneNTT fairly large dataset.
optimized for DDR and HBM external memories. Although histor-
ically this problem has been shown to be memory-bound, Cyclo- Contribution. In this paper we present CycloneNTT, a FPGA solu-
neNTT’s quasi-streaming access pattern is optimized to the point tion for computing NTT on large datasets (≥ 224 , 64-bit elements),
that when using HBM (Xilinx C1100), the architecture becomes that thus require external memory.
compute-bound. On the DDR-based platform (AWS F1), the latency By leveraging the algorithm introduced in [BLDS10] to reduce
of the application is equal to the streaming of the entire dataset memory accesses, and applying a series of algorithmic- and imple-
log 𝑁 times to/from external memory. Moreover, exploiting HBM’s mentation-level optimizations, CycloneNTT achieves a quasi stream-
larger number of channels, and following a series of additional ing data access pattern that maximizes throughput.
optimizations, CycloneNTT only requires 16 log 𝑁 passes. The architecture is configurable, and we apply it to DDR- and
HBM-based platforms. Moreover, we show how the designer can
ACM Reference Format:
trade off power for delay, depending on the target system and
Kaveh Aasaraai, Emanuele Cesena, Rahul Maganti, Nicolas Stalder, Javier
environmental conditions.
Varela, Kevin Bowers. . CycloneNTT: An NTT/FFT Architecture Using
Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Plat-
forms. In Proceedings of (Jump Trading / Jump Crypto). ACM, New York, Organization. In Sec. 2 we present related work, and in Sec. 3 we de-
NY, USA, 12 pages. fine terminology and notation. We introduce CycloneNTT in Sec. 4
describing requirements, architecture and high-level interaction
1 INTRODUCTION between host and FPGA. In Sec. 5 and 6 we present two instances
of CycloneNTT: a single-layer streaming implementation that is
The Number Theoretic Transform (NTT) is a generalization of the
memory bounded, and a multi-layer streaming one that can over-
Fast Fourier Transform (FFT). The FFT, first presented in [CT65]
come memory bandwidth constrains and become compute-bound.
and credited back to Gauss, remains unparalleled in its impact on
We show experimental results in Sec. 7 and we conclude in Sec. 8
modern computer science, having largely been responsible for the
highlighting future works.
birth of digital signal processing. As a result, numerous attempts
have been made over the last half century to optimize and improve
both the asymptotic and concrete efficiency of the FFT. 2 RELATED WORK
Classically, the FFT is computed over the field of complex num- Following the growing interest in ZK proofs, recent efforts have
bers, which has an interpretation as transforming periodic functions been made to map them to specialized hardware architectures. Of
from the time to the frequency domain. When the computation particular relevance are PipeZk for ASIC [ZWZ+ 21], and NTTGen
is done over finite fields, the FFT is usually referred to as NTT. for FPGA [YKKP22].
It has an interpretation as transforming polynomials over the fi- PipeZk is a pipelined accelerator for zkSNARKs targeting an
nite field from their coefficients to their value representations, and ASIC architecture, and composed of two subsystems: one for poly-
it’s used for example for large-polynomial multiplication, or for nomial computation (including NTT) and one for multi-scalar mul-
interpolating polynomials at given values. tiplication. The former performs a recursive decomposition of large
NTT kernels into smaller tiles, which allows them to fit into on-
chip compute resources while complying with off-chip bandwidth
Jump Trading / Jump Crypto, 2022, Chicago, IL
© 2022 limitations. In PipeZk, multiple modules can run in parallel to op-
timize data utilization. A module is composed of NTT cores, each
Jump Trading / Jump Crypto, 2022, Chicago, IL Aasaraai et al.
performing the butterfly operations, and employing FIFOs of dif- The Discrete Fourier Transform (DFT) is a linear transformation
ferent depth to match the required stride access. Because of the 𝐹 𝑁 : F𝑁 → F𝑁 defined for y = 𝐹 𝑁 (x) by:
data-access pattern, the architecture needs to block data in on-chip
SRAM (performing a matrix transpose) before it is able to write it 𝑁 −1
∑︁
back to off-chip memory. Although it is an interesting architecture, 𝑖𝑗
𝑦𝑖 = 𝜔𝑁 𝑥 𝑗
it targets a limited range of input sizes from 214 to 220 elements. And 𝑗=0
being an ASIC design means that it loses the deployment flexibility
and time-to-market offered by FPGAs.
When F is a finite field of characteristic 𝑝 > 0, the DFT is usually
NTTGen is a hardware generation framework targeting FPGAs,
called Number Theoretic Transform (NTT). Note that in practical
optimized for homomorphic encryption. The inputs to the frame-
implementation, a DFT over C will be an approximation, whereas
work are application parameters (such as latency, polynomial de-
in the case of NTT, we are interested in exact solutions.
gree, and a list of prime moduli), and hardware resources constraints
A naive implementation of DFT has complexity 𝑂 (𝑁 2 ). An ef-
(e.g. DSP, BRAM and I/O bandwidth), while the output is synthesiz-
ficient algorithm to compute DFT with complexity 𝑂 (𝑁 log 𝑁 ) or
able Verilog code. It exploits data-, pipeline- and batch-parallelism,
better is referred to as Fast Fourier Transform (FFT). For practical
and offers two flavours of cores: general purpose, and low-latency
applications, we typically set 𝑁 = 2𝑛 (radix-2 FFT/NTT), extending
customized for generalized Mersenne primes. Input and output
the 𝑥𝑖 with zeros.
polynomials are stored in off-chip memory, and results are shown
Since the 𝑤 𝑖 𝑗 are 𝑁 2 coefficients, but the 𝑤 𝑘 are only 𝑁 distinct
for polynomial degrees ranging from 210 to 214 . At the heart of
numbers, there is periodicity in the matrix coefficients. This led to
the design is the use of Streaming Permutation Networks (SPN)
a number of "factorization lemmas", of which the Cooley-Tukey
[CP15], which allows for arbitrary permutation strides, while reduc-
framework [CT65] is the most well-known and fundamental.
ing the interconnect complexity and avoiding expensive crossbars.
When 𝑁 factors in 𝑛 1𝑛 2 , the Cooley-Tukey framework decom-
A SPN consists of three sub-networks: two of them for spatial per-
poses a given order 𝑁 DFT into 𝑛 2 smaller order 𝑛 1 DFTs, and 𝑛 1
mutations (in the same cycle) and one for temporal-permutation
smaller order 𝑛 2 DFTs, combined with so-called twiddles, which
(across cycles). NTTGen actually extends the original SPN to sup-
are proper powers of 𝜔 𝑁 . It directly follows that the complexity
port runtime-control of the underlying routing tables and address
reduces from 𝑁 2 to 𝑛 1𝑛 22 + 𝑛 21𝑛 2 < 𝑁 2 .
generations. In comparison, CycloneNTT takes a different approach:
By repeatedly factoring 𝑁 , one gets to smaller DFTs of prime
we avoid permutations entirely by using a multi-FIFO architecture
order. These "minimal" FFTs (combined with their twiddle multi-
backed by off-chip memory.
plications) are called butterflies. In particular, if 𝑁 is a power of
Worth mentioning as well is HEAX [RLPD20], a highly paral-
2, factoring all the way leads to repeated order 2 DFTs, for a total
lelizable hardware architecture targeting fully homomorphic en-
complexity of 𝑁 log 𝑁 .
cryption, with pipelined NTT cores and a word-size of 54 bits to
Two common ways to compute a FFT are decimation in time
maximize resources utilization (DSPs). It requires on-chip multi-
(DIT) and decimation in frequency (DIF). These represent the two
plexers to distribute data and twiddle factors to the cores, and is
cases of factoring 𝑁 "from the left" or "from the right": DIF when
mainly optimized for on-chip memory usage up to 213 elements.
𝑛 1 is a small radix, DIT when 𝑛 2 is a radix. We stress that DIT and
Beyond that, it employs off-chip memory but only to a very limited
DIF differ both on 1) how they "loop" over the input values, and 2)
extent due to the latency overhead.
the butterfly function. We add that it is not necessary and, as we
Previous relevant art targeting FPGAs also include [SRTJ+ 19],
will see, not always efficient to factor "all the way down" - repeated
[KLC+ 20] for homomorphic encryption, [DM22] for lattice-based
higher-order (also called higher radix) factorsr educe the number
cryptography, [MK20] for homomorphically encrypted deep neural
of accesses to memory.
network inference, as well as [MKO+ 20] for post-quantum digital
signature scheme. Nevertheless, these designs make primarily use
of on-chip BRAM and do not directly provision the use of off-chip 3.2 Goldilocks Field
memory. All these work could benefit from CycloneNTT approach While our design is generic and can be adapted to any field, includ-
to scale to larger NTTs and possibly allow for richer applications. ing complex numbers, our current implementation focuses on the
Other interesting references, although not directly applicable to field F𝑝 with 𝑝 = 264 − 232 + 1 = 0xFFFF_FFFF_0000_0001, called
CycloneNTT, include: in-memory computation such as CryptoPIM Goldilocks field [Gou21, Tea22].
[NGI+ 20] and MeNTT [LPY22]; optimized GPU-based implementa- Elements 𝑎, 𝑏 ∈ F𝑝 can be represented as 64-bit (unsigned)
tions such as [DCH+ 21]; as well as RISC-V architecture extensions integers. Common field operations are extremely efficient on 64-bit
for NTT such as [PS22] and [KA20]. CPUs, including:
• 𝑎 ± 𝑏 mod 𝑝, as a single addition/subtraction, optionally

3 BACKGROUND followed by subtracting/adding 𝜀 = 232 − 1 = 0xFFFF_FFFF
in case of overflow/underflow.
3.1 Number Theoretic Transform (NTT) • 𝑎 · 𝑏 mod 𝑝, as a single 64 × 64 → 128-bit multiplication
Let 𝑁 be a positive integer and F a field containing a primitive followed by reduction mod 𝑝, that can be done efficiently
𝑁 -root of unity, denoted as 𝑤 𝑁 . That is, 𝑤 𝑁 = 1 in the field, but (a handful of assembly instructions), again using the sparse
𝑤 𝑘 ≠ 1, for 0 < 𝑘 < 𝑁 . representation of 𝑝.
CycloneNTT Jump Trading / Jump Crypto, 2022, Chicago, IL
Algorithm 1 Radix-2 gNTT

1: function gNTT(values, logN, twiddles)
2: values are expected in bit-reversed order
3: 𝑁 ← 2logN
4: for 𝑖 = 0 to logN do ⊲ DIT loop
5: 𝑚 ← 2𝑖+1
6: for 𝑘 = 0 to 𝑁 step 𝑚 do
7: 𝜔 ← twiddles[𝑘/𝑚] ⊲ one twiddle in inner loop
8: for 𝑗 = 0 to 𝑚/2 do
9: 𝑒 ← values[𝑘 + 𝑗]
10: 𝑜 ← values[𝑘 + 𝑗 + 𝑚/2]
11: values[𝑘 + 𝑗] ← 𝑒 + 𝑜 ⊲ DIF butterfly
12: values[𝑘 + 𝑗 + 𝑚/2] ← 𝜔 (𝑒 − 𝑜)
13: end for
14: end for Figure 1: CycloneNTT System Architecture and Dataflow
15: end for
16: return values
17: end function • Support large datasets that do not fit in on-chip SRAM.
• Avoid random-access to off-chip memory.
• Provide configuration to the designer to trade-off power for
3.3 gNTT Algorithm delay.
In [BLDS10], Bowers et al. introduce the gFFT algorithm to com-
pute DFT (corresponding to a factorization they call 𝐺). Their key 4.1 Architecture
observation is that it is possible to re-arrange the computation and The overall system architecture is shown in Fig 1. The system is
drastically reduce the number of memory accesses to the twiddle composed of the following main components:
factors, asymptotically from 𝑂 (𝑁 log 𝑁 ) in the classical Cooley- • External memory, which is assumed to be DRAM based (DDR,
Tukey down to 𝑂 (𝑁 ). HBM).
Surprisingly, the resulting algorithm looks exactly like a Cooley- • Multiple large FIFOs for streaming the dataset in and out,
Tukey DIT, but with butterflies that resemble those of a DIF butterfly. backed by on-chip and off-chip memory.
The derivation is independent from the base field, therefore it also • Sub-NTT module which can be either a single array of but-
applies to NTT. terfly units, or a multi-layered network of butterfly units
We call gNTT the analogous algorithm for NTT, and we present connected just like a regular NTT network.
a radix-2, in-place, iterative version in Alg. 1. This is at the core of • Some connectivity to a host, either through PCIe or net-
CycloneNTT. Note the DIT loop starting at line 4 (therefore this work, to load twiddle factors and input data into the external
implicitly requires the input values to be sorted in bit-reverse order), memory.
and the DIF butterfly in line 11-12. The key feature of gNTT is the • A host processor to initialize the system, with no interaction
twiddle 𝜔 at line 7, which is constant throughout the inner loop. during the computation.
As we will see in more detail, NTT is generally a memory- In Sec. 5 and 6 we present two instances of this architecture that
bound computation and CycloneNTT, in some instances, results in primarily differ on the Sub-NTT module and how data is streamed
a compute-bound one. Reducing the number of twiddles accesses from/to external memory.
via gNTT is the first step towards achieving this goal. The first instance, Single-Layer Streaming, computes NTT by
We also highlight that gNTT is often more efficient than Cooley- moving values in and out of external memory log 𝑁 times (e.g., 24
Tukey even for software implementations. For example, as a corol- times). The second instance, Multi-Layer Streaming, improves on
lary of this work, we implemented gNTT in the open source library that by computing 𝐶 layers at a time (e.g., 𝐶 = 3, or 𝐶 = 6), thus
plonky21 , resulting in a 10-15% performance improvement. reducing the number of passes over the values to 𝐶1 log 𝑁 (resp., 8
or 4, compared to the 24 for Single-Layer Streaming).
4 CYCLONENTT
In this section we present CycloneNTT architecture and its unique 4.2 High-Level Protocol
properties. It performs in-place operations on the input values. For Host and FPGA implement the following protocol:
the sake of this study, input data is assumed to be stored in off-chip
ntt_init(𝑁 ) Initialize the FPGA and twiddles.
memory, although streaming data to the FPGA from the host can
(1) Host pre-computes all 𝑁 twiddles, sorted in bit-reverse
be easily accommodated.
order.
The following lists key design goals for CycloneNTT:
(2) Host sends twiddles over PCIe.
• All computation performed entirely within the FPGA and (3) FPGA receives the twiddles and stores them in the external
DRAM directly connected to the FPGA. memory.
1 https://github.com/mir-protocol/plonky2/ ntt_send(values) Send input values into FPGA.
(1) Host sends the input values over PCIe, in bit-reverse order2
(2) FPGA receives the input values and stores them in the
external memory.
ntt_compute() Compute NTT.
(1) Host sends compute command over PCIe.
(2) FPGA repeats multiple times:
(a) Load twiddles and values into input FIFOs (from exter-
nal memory into SRAM).
(b) Process twiddles and values via Sub-NTT.
(c) Send output values into output FIFOs (from SRAM to
external memory).
Details depend on the architecture and are explained in
Figure 2: CycloneNTT butterfly unit optimized for
Sec. 5, 6.
Goldilocks field. This 8-stage pipeline outputs 𝑒 ′ = 𝑒 + 𝑜,
ntt_recv() → values Receive output values from FPGA.
𝑜 ′ = 𝜔 (𝑒 − 𝑜). 𝑒 and 𝑜 are "even" and "odd" inputs to the
(1) Host sends recv command over PCIe.
butterfly, 𝜔 is the twiddle factor, 𝜖 = 232 − 1 is used for
(2) FPGA returns output values from external memory over
modular reduction, and reg(s) are pipeline registers to align
PCIe.
data in time.
We would like to note that optimizing data transfer between the
FPGA and the host is an orthogonal problem that depends on the ac-
tual platform, and the type of communication used (PCIe/Ethernet),
hence we consider it out of the scope of this work.
4.3 Butterfly Unit

We design and implement a fully-pipelined DIF butterfly unit for
the Goldilocks field F𝑝 , where elements are represented as 64-bit
integers. Fig 2 shows the 8-stage pipeline design with an initiation
interval of 1. The DIF butterfly computes:
𝑒′ = 𝑒 + 𝑜
𝑜 ′ = 𝜔 (𝑒 − 𝑜) ,
where 𝑒, 𝑜 are two input values, 𝜔 is the twiddle factor, and all
operations are modulo 𝑞.
We employ multiple techniques as outlined in [Gou21] to reduce
the complexity of the computations involved, including modular
multiplication:
• Multiplication by the field’s 𝜖 = 232 − 1 is simplified to a
single subtraction as the constant is in the form of 2𝑛 − 1.
• Due to the unique properties of the Goldilocks field and its
prime, the modular multiplication times 𝜔 can be reduced to
Figure 3: An 8-input butterfly network (starting from left),
a single 64-bit unsigned multiplication along with a series
showing twiddle indexes used in each layer. All right-inputs
of 64-bit additions and subtractions.
are highlighted with a bubble.
5 SINGLE-LAYER STREAMING
In this version of the architecture, the NTT network is processed of this architecture is determined by the bandwidth of the off-chip
one layer at a time. For each layer, the entire dataset is read from memory to stream in and out the entire dataset log 𝑁 times.
off-chip memory, processed through an array of butterfly units, and
the results are sent back to the off-chip memory, forming a cyclone 5.1 Streaming Twiddles
of data streaming. Since we use [BLDS10]’s network, within a layer, twiddle factors
Along with the dataset, twiddle factors are also streamed in to are used in sequential order. As we progress through the layers,
be used by the butterfly array. This architecture streams in and out the number of twiddle factors needed is cut in half, as shown in
the entire dataset as many as the layers of the NTT network. For 𝑁 Fig 3. However, the stream always starts from index-0. Therefore,
𝑁 ) from off-chip
for layer 𝐿, we simply stream in twiddles [0... 2𝐿+1
numbers, the network has log 𝑁 layers, hence the time complexity
memory.
2 Bit-reverse order indexes can be computed "on-the fly", and Host can access input We make an important observation here that as we progress
values in RAM, in bit-reverse order. This seems more efficient than in-place reordering. through layers, the bandwidth required for streaming twiddles is
We start by placing every other input number into left and right
FIFOs (Fig 4). Processing every layer of the network amounts to read-
ing 𝑁2 numbers from each FIFO (𝑁 total), running them through
the array of butterflies in parallel, and writing the results back to
their corresponding FIFO.
5.4 FIFO Memory

The two FIFOs employed in this CycloneNTT architecture are
backed by both on-chip and off-chip memory. We construct the
FIFOs as ring buffers in off-chip memory, and read from off-chip
memory directly. However, when pushing into the FIFOs, we first
store the data and its position in the FIFO in a small, on-chip buffer.
When enough data has been buffered to fill an entire DRAM row,
Figure 4: Input data is first bit-reverse reordered and placed
we flush the buffer to off-chip memory at the corresponding ring
in two FIFOs. The butterfly unit pops data from both FIFOs
address. Flushing data at DRAM row granularity ensures the lowest
and twiddle FIFO, but pushes data only into one FIFO
overhead due to the internal structure of the DRAM, yielding the
.
highest throughput possible.
cut in half. This is not only because half the twiddles are used, but
also due to the fact that in layer 𝐿, 2𝐿 neighboring butterflies use the
6 MULTI-LAYER STREAMING
same twiddle factor. This provides opportunity for twiddle reuse, The single-layer architecture proposed in Section 5 requires a single
given we process butterflies in a sequential order within the layer. butterfly unit per two input data as we arrange the units in a one-
dimensional array. Therefore, if off-chip memory bandwidth allows
5.2 Parallel Butterflies for streaming in maximum of 2𝐵 elements per cycle, we can only
utilize 𝐵 butterfly units. However, if the chip’s capacity allows for
We note that the data dependency in the butterfly network is only
more butterfly units, we have a memory bound architecture.
inter-layer. Within a layer, we can execute in parallel as many
To overcome the under-utilization of our computing resources,
butterflies as we can afford to feed input data to, and fit the circuit on
we can arrange the butterfly units into a sub-network of butterflies
the chip. In this single-layer CycloneNTT architecture, we employ
rather than a vector. This sub-network is similar in shape to the
an array of 𝐵 fully-pipelined butterflies in parallel. In order to
overall butterfly network shown in Fig 3, it is only smaller in size.
achieve maximum throughput, we need to feed the butterfly array
A sub-network with 𝐵 butterfly units in each layer (total of log(2𝐵)
every cycle to avoid bubbles. Therefore, we require 2 × 𝐵 numbers
layers by design) requires 2𝐵 inputs, and produces 2𝐵 outputs,
every cycle to be read from off-chip memory, regardless of the layer
hence its memory bandwidth requirements are the same as in the
we are processing.
single-layer architecture.
It should be noted that this architecture is memory bound by
Said in another way, if we have bandwidth to stream 2𝐵 elements
design: for every butterfly unit in the system, two numbers and a
per cycle, then we can construct a sub-network with 𝐵 butterfly
twiddle factor need to be read from off-chip memory.
units per layer, that sequentially processes log(2𝐵) layers, for a
total of 𝐵 log(2𝐵) butterfly units. In return, we will only need to
5.3 2-FIFO Architecture 1
stream the dataset from/to external memory log(2𝐵) log 𝑁 times.
As demonstrated in Fig 3, there exists only inter-layer data depen-
Just to provide a concrete example, we can instantiate a 6-layer
dency in the butterfly network. Furthermore, every output of a
architecture with 𝐵 = 32 butterflies per layer, or a total of 6 × 32 =
butterfly feeds only a single butterfly, meaning that a single but-
192 butterflies. This architecture will only need to stream the dataset
terfly feeds two butterflies in the next layer. However, we make a
from/to external memory 16 log 𝑁 times.
further important observation that a single butterfly, depending
on its position in the layer, either feeds the left input or the right
input of its target butterflies, as highlighted in Fig 3. We use this 6.1 Streaming Twiddles
observation to propose a 2-FIFO architecture as shown in Fig 4. In multi-layer processing, we process a sub-network of the overall
Every butterfly can be seen as reading its inputs from two FIFOs, network at a time. This means that for every sub-network execu-
left and right. However, both its outputs will be pushed into either tion, we require twiddle factors for all the log(2𝐵) layers at the
left, or right FIFO (Fig 4). In addition, as evidenced in Fig 3, the same time. Consequently, we require higher memory bandwidth to
two outputs are 2𝐿 rows apart, as their target butterflies in the next stream in twiddle factors. On a positive note, recall that the band-
layer are 2𝐿 apart. For a given butterfly in position (P) in layer (L), width requirement for twiddle factors halves every layer into the
we have: network. Therefore, for a sub-network width 𝐵, the overall band-
𝑃
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑙𝑒 𝑓 𝑡/𝑟𝑖𝑔ℎ𝑡 = ⌊ 𝐿 ⌋ mod 2 width requirement for the twiddle factors is only Σ𝐵−1
0
1 , i.e. < 2
2𝑖
2 (1) times larger compared to what is needed for single-layer processing,
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 = 0, 2𝐿 regardless of the value of 𝐵.
6.2 2B-FIFO Architecture

In the case of single-layer processing, we employ 2 FIFOs. As for
multi-layer processing, we require larger number of FIFOs to be
able to feed all the sub-network inputs at the same time. Fortunately,
HBM-based FPGA platforms provide a large number of ports to off-
chip memory banks (e.g. 32). Furthermore, each port is significantly
wider (typically 512-bits) than Goldilocks field’s numbers (64-bits).
Since we stream in data from all FIFOs at the same time, we can
combine multiple FIFOs (512/64=8) into the same memory bank to
be able to access all FIFOs in parallel.
As with single-layer architecture, the sub-network is fed from
all FIFOs in parallel, and all outputs of the sub-network are directed
to a single FIFO, identified by the position (𝑃), layer (𝐿), and size of
the sub-network (𝐵): Figure 5: Bandwidth comparison between streaming and
random-access of various number of bytes. This data was
𝑃
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑖𝑛𝑑𝑒𝑥 = ⌊ 𝐿 ⌋ mod 2𝐵 captured on C1100 platform using HBM channel-0.
2 (2)
𝑂 − 𝐹 𝐼 𝐹𝑂 : 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 = {𝑖 × (2𝐵) 𝐿 ; 𝑖 ∈ [0, 2𝐵)}
6.3 Output Buffer Analysis

As stated in eq. (2), egress FIFO index switches every (2𝐵) 𝐿 beats,
yet the sub-network is fed from all FIFOs in parallel. Therefore, for
every FIFO we need a buffer to absorb the rate disparity until the
egress FIFO switches to the next index. However, as we advance
through network layers (𝐿), egress FIFO changes very infrequently
up until the very last layer, where we would require a buffer as
𝑁 = 𝑁 numbers. Since we target very large datasets,
large as 𝐵 × 2𝐵 2
this is not a practical buffer size we can afford on-chip. In the fol-
lowing subsections we show how we solve this issue by employing
a technique we call quasi-streaming.
6.4 Quasi-Streaming
The primary reason behind streaming, rather than random access
of, the data to/from off-chip memory is the internal structure of the
DRAM architectures. In a typical DRAM (DDR/HBM), accessing
a row of the data is expensive, hence the memory is arranged
in a very wide format. To minimize overhead, applications must
strive to access and consume an entire row before moving on to Figure 6: Quasi-Streaming read order to change O-FIFO index
the next. Therefore, streaming data within a row is very beneficial. every two beats. In this example we highlight reading index-
However, we make a key observation that if an application’s access 4 out of order leads to output FIFO index-1.
quantum is an entire DRAM row, for such application random
access to different rows yields the same throughput as streaming
rows sequentially.
In Fig 5 we show the throughput we achieve by accessing random that produces frequently-changing egress FIFO indexes. Fig 6 shows
and sequential addresses, while varying access size. As can be seen, an example of such sequence for quasi-streaming of data, that
streaming data reaches maximum throughput with access size of enables switching egress FIFO index quickly. Alternatively, for a
256-bytes, however random-access only eventually reaches the large enough 𝐵 (≥ 8), which leads to reading one or more DRAM
same throughput with access size of 4096-bytes or higher. This rows at a time, one can conveniently choose 𝐵 to be the quantum
demonstrates that when accessing a DRAM one-row-at-a-time, the of FIFO reads, simplifying the read sequence further. It should be
application can view the DRAM as a truly random-access-memory, noted that in this architecture, the elements stored in the FIFOs are
and no consideration needs to be made for sequential addressing. not popped sequentially, but rather with pre-defined read addresses.
Using this observation, we propose exploiting quasi-streaming
in CycloneNTT to set the output buffer capacity of FIFOs to a few 6.5 Power vs Delay
multiples of a DRAM row, irrespective of 𝐵. Since we can random CycloneNTT is a configurable architecture. The primary parameter
access ingress FIFOs at DRAM-row granularity, we strategically to tune is the number of inputs to the sub-network (𝐵), which also
choose a sequence of addresses to read and feed the sub-network determines the number of FIFOs (2𝐵). As we grow 𝐵, fewer passes
Layer-1 Layer-2 Last Layer (log 𝑁 )

Twiddles 1 1 1
2 log 𝑁
FIFO-0 1 1 1
FIFO-1 1 1 1
Table 1: Relative bandwidth requirement for each port
log 𝑁
of the dataset is required ( log 2𝐵 ), lowering execution time. How-
ever, sub-network size grows proportional to 𝐵 log 2𝐵, resulting
in increased power dissipation. Depending on the application and
environmental properties, one can choose the right 𝐵 to create the
best fitting solution. Figure 7: Memory interfaces used in single-layer streaming
Another major design decision to consider is clock frequency. connecting left/right FIFOs and providing twiddle access.
Depending on the platform, after a certain clock speed the memory
interface will be saturated, and increasing the clock speed returns
no discernible gains. On the other hand, increases in clock speed
result in more power dissipation. For example, on the C1100 HBM-
based platform [AX21, AX22], each port is rated at the theoretical
maximum speed of 14.2GB/s, excluding DRAM timing overheads.
Considering the 512-bit interface to the memory, this translates
to a maximum clock speed of 222MHz, after which the memory
interface is saturated.
7 MEMORY INTERFACE
In this section we discuss the architecture and complexity of the
memory interface needed for reading and writing to off-chip DRAM Figure 8: Output FIFO index and position in the FIFO based
storage. CycloneNTT uses DRAM to store twiddle factors, input on the beat-index in each layer 𝑙 ∈ [0 . . . log 𝑁 ).
vectors, and the back-end for the 2B FIFOs employed.
7.1 Single-Layer 7.1.3 Writes. As shown in Fig 3 and Eq 1 both outputs of the
butterfly unit are directed to the same FIFO, either left or right.
In this section we discuss the relatively simpler memory interface
In addition, the two outputs are not placed sequentially, and are
needed for the single-layer CycloneNTT. As we only require two
separated by Eq 1. Consequently, for each write port, we require
FIFOs, we can simply allocate one memory port per FIFO as shown
two (one for each butterfly output) small SRAM-based caches to
in Fig 7. In addition, we require a third memory port to access twid-
hold each butterfly output as they become available. This caching
dle factors. Table 1 shows the relative bandwidth requirement per
is necessary as writing individual numbers to DRAM yields very
port in each layer. For example, when processing the first layer of
low bandwidth utilization.
the entire network, at every beat for every butterfly, we require one
number from each FIFO, along with one twiddle factor. Therefore,
while processing the first layer, the bandwidth requirement is the
7.2 Multi-Layer
same for all ports. The multi-layer configuration of CycloneNTT improves memory
bandwidth utilization while presenting new challenges with regards
7.1.1 Twiddles. As described in Sec 3.3 our butterfly network uses to data write backs.
twiddles in a streaming fashion, in bit-reverse order. Therefore, at Similar to single-layer, memory ports are assumed to be wider
initialization time we store all twiddle factors into the correspond- than individual numbers. However, unlike single-layer, in the same
ing memory bank, in bit-reverse order. During the computation of cycle we cannot consume multiple numbers from the same FIFO as
layer 𝑙 we simply stream in twiddles [0 . . . 𝑁
2𝑙
) as required by the that would require creating a vector of sub-networks which would
network. be prohibitively expensive. Instead, we map multiple FIFOs into
the same memory port, which results in one number per FIFO per
7.1.2 Reads. At least in the case of Goldilocks field and our tar-
read-cycle throughput as we require. Considering our targeted field
geted platforms, the memory interfaces are wider (512-bits) than
and platforms, each port can accommodate 512 64 = 8 FIFOs.
the individual numbers (64-bits), and we expect this will be the case
for most applications. Consequently, every read from the memory 7.2.1 Reads. Each memory port is independently fetching num-
provides multiple numbers. This provides the opportunity to pro- bers from its allotted eight FIFOs (or input vector for the first round)
cess multiple numbers at the same time using a vector of butterflies, in a quasi-streaming fashion. To minimize read overhead, read re-
as explained in Section 5. quests are batched into 𝐷 beats, yielding all numbers required for 𝐷
Number of Ports Utilization

Single Layer - F1 3 3 = 75%
4
3
Single Layer - C1100 3 32 = 9%
3
3-Layer - C1100 3 32 = 9%
4-Layer - C1100 6 6 = 18%
32
6-Layer - C1100 24 24 = 75%
32
Table 2: Memory port usage and overall memory bandwidth
utilization of the platform.
Figure 9: Logical and physical memory ports in multi-layer

architecture. Each cache is implemented with a single block-
into a separate blockram with acceptable storage waste. To provide
ram.
multiple read ports for data transpose, we carefully rotate the data
through blockram columns as they are being written. In this way,
no two numbers belonging to the same column are ever written to
beats of the sub-network. 𝐷 is chosen in a way that 512×𝐷 ≥ DRAM-
the same blockram, hence we can read them in parallel from each
row. Next, read addresses follow a FIFO-first scheme to provide a
ram. This architecture requires 2𝐵, 2𝐵 × 1 multiplexers inside the
fast switching output FIFO as per Eq 2 and Fig 8. For the first and
soft-crossbar, and 𝐹 , 2𝐵 × 1 multiplexers for data transpositions.
last layers, output FIFO either switches with every beat or does
For 𝐵 > 4 this can yield to long critical paths, hence we opt to use
not switch at all, therefore read addresses end up being sequen-
a 2-cycle multiplexer design.
tial. However, starting from the second layer, each port requests 𝐷
beats, then jumps the gap to the next output FIFO. In practice, this
is done with carefully decomposing the read address into limbs and
7.4 Simultaneous Reads and Writes
cycling through them out of order. This is very similar to DRAM At any given time, we are reading from all FIFOs to feed the sub-
row/bank/bank-group interleaving, with the added complexity that network, and writing back its outputs to FIFOs. In order to avoid the
the bit width allocated for each part changes as we progress through read/write clash into the same memory bank, we allocate two HBM
the layers. Nevertheless, this yields a compute-heavy, yet relatively banks for each logical memory port of the architecture, as shown in
simple, circuit to determine read addresses. Fig 9. Across layers, we ping-pong between the underlying layers
for reads and writes, hence no HBM port is ever used for reads and
7.2.2 Writes. Writing data back to DRAM is substantially more writes at the same time. This comes at the expense of extra memory
complicated compared to reads. We have to overcome two chal- ports used.
lenges: a) Multiple FIFOs need to be combined into the same port, Accessing adjacent HBM memory ports comes with almost no
b) As evident in Fig 8, the output data needs to be transposed. overhead in HBM architectures. HBM provides a hardened, low-
In order to combine multiple outputs into the same port, we need latency crossbar that provides a high speed all-to-all configuration
an SRAM-based cache to store the results as they become available. only for adjacent ports. Therefore, ping-ponging between two ports
The cache is as wide as the sub-network, and requires 𝐹 × 𝐷 rows for each logical port amounts to changing the MSB of the address bit,
where 𝐹 = 8 in this case. It is only after every 𝐹 × 𝐷 beats that we and using the same port for reads and writes with no multiplexers
have the data for all FIFOs allotted to each port, at which time we required in the fabric.
can write them to the memory after transposing them.
7.2.3 Twiddles. Similar to the single-layer architecture, twiddles 7.5 Memory Interface Utilization
have the same bandwidth requirement as numbers in the first layer. Following the memory interfacing explained in the previous sec-
We opt to allocate the same number of ports for twiddles as for tions, table 2 reports the number of memory ports used and overall
numbers, which is 𝐹2𝐵
=8 .
utilization of the platform memory interfaces, under various Cy-
cloneNTT configurations. Note that the reported number includes
7.3 Data Transposition ports needed for accessing numbers and twiddle factors. Overall,
Row-Column data transposition is required starting from the second in the case of multi-layer architecture, 3 × 𝐹2𝐵
=8 ports are needed,
layer. This is shown in Fig 8 as sequential numbers for each FIFO making the 6-layer architecture the largest a 32-port HBM platform
are spread across beats. The challenge here is each write port cache can support.
needs to support writing the entire beat (2B numbers) every cycle
as they come out of the sub-network, yet it needs to be able to 8 EVALUATION
transpose 𝐹 = 8 beats into a write word for DRAM, also every cycle. In this section we evaluate CycloneNTT on two different platforms
The first operation requires a wide memory arrangement, while with various design parameters. We implement all networks with
the second operation requires multiple read ports. Goldilocks butterfly units, however any number system can be
Fortunately, individual numbers being written (64-bits) are about substituted. For example if the butterfly unit is replaced by one that
the same width as typical blockrams found in FPGAs (36/72 bits). handles complex numbers, the entire system would calculate the
We propose the architecture shown in Fig ??. Each column is stored FFT of input data.
Platform Memory Type Single/Multi-Layer Platform Layers Total Power (W) Design Power (W)
AWS-F1 DDR Single AWS-F1 Single 35.243 0.906
Single Single 28.45 1.727
3 (B=4) 3 27.766 1.265
C1100 HBM 4 (B=8) C1100 4 35.071 3.919
5 (B=16) 5 48.800 14.739
6 (B=32) 6 52.752 23.499
Table 3: All configurations used for evaluation. Table 4: Power dissipation as estimated by Vivado.
Platform Layers LUTs Regs BRAM DSP

AWS-F1 Single 11255 (10%) 15325 (10%) 54 (10%) 160 (10%)
8.1 Methodology 20234 23061 320 54
Single
Table 3 lists all CycloneNTT configurations used in this work. The (2.57%) (1.46%) (5.66%) (4.54%)
focus of this work is on the C1100 platform as is the more suitable 16771 16998 144 29
3
platform for such application. However, to demonstrate the porta- (1.9%) (1.0%) (2.4%) (2.2%)
bility of our design, we showcase a single-layer configuration on 53026 46026 384 74
C1100 4
(6.7%) (2.9%) (6.8%) (6.2%)
AWS F1 platform as well.
170099 121075 960 212
We implement the entire system in SystemVerilog RTL, and 5
(20%) (6.9%) (16%) (16%)
have made the code publicly available on GitHub, however for 563677 319104 2304 691
anonymity we omit the link until after publication. We value the 6
(65%) (18%) (39%) (51%)
reproducibility of all the results presented here, and chose platforms
Table 5: Resource utilization count and the percentage of the
that are accessible to most designers.
platform used for various CycloneNTT configurations.
We report five metrics for each configuration of CycloneNTT:
• Resource Utilization: We report LUTs, registers, BRAMs,
URAMs (if any), and DSPs used in each configuration.
• Power: We rely on Vivado’s power report to get an estimate 8.1.2 C1100. This is a PCIe card provided by AMD-Xilinx directly,
of the power dissipation of the design. It should be noted that and is curated towards cryptocurrency mining [AX21]. This plat-
these are estimates by the tool, and actual energy consump- form is equipped with a VU55P HBM-based FPGA. This FPGA has
tion can vary at runtime due to input data and environmental 32-HBM channels, each with 2Gbits of capacity, and 14GB/s of the-
properties, and would require live measurements. oretical bandwidth. HBM channels are almost independent [AX22].
• Clock Speed: We report the clock speed achieved for each The interface exposed to the design is 512-bits wide, and would
design. As mentioned in Section 6.5, pushing the clock speed saturate at 222MHz. On this platform we use Vitis/XRT environ-
beyond a certain point provides no discernible gains and will ment for development and deployment. However, we use pure RTL
only result in higher power dissipation. kernels, and only rely on the platform for communication to the
• Latency: We report the time it takes to process the entire host and HBM.
NTT network. We exclude data transfer times between the
FPGA and the host, as we find data transfer optimization to 8.2 Results
be orthogonal to this work, and can vary depending on the In this subsection we discuss our findings regarding CycloneNTT
platform, e.g. in a host-less, network-attached appliance the architecture over four metrics.
interface speed can vary.
• Throughput: To facilitate comparison across platforms and 8.2.1 Power. Table 4 reports estimated power dissipation for vari-
configurations, and with prior art, we report the throughput ous CycloneNTT configurations on both platforms. Since we com-
of the design in terms of millions-of-numbers-per-second. pare power profile on two different platforms, we include power
report for the entire chip and the design. The single-layer architec-
ture consumes more energy compared to the 3-layer design. This
8.1.1 AWS-F1. This cloud platform provides access to cards with can be explained by its inefficient use of resources as shown in the
AMD-Xilinx VU9P FPGAs [AWS]. The card has four independent next section. The 6-layer architecture demonstrates a super-linear
DDR channels, each with 16GB capacity and a theoretical through- power increase with respect to 𝐵. As reported in the next subsection,
put of 16GB/s. The interface exposed to the design is 512-bits wide, this design has a significantly higher resource utilization, hence
and would be saturated at 250MHz. On this platform, we rely on the significant increase in its power dissipation.
Amazon provided shell for all communications with the host and
DDR. We find the platform and shell easy to use, taking about 20% 8.2.2 Resource. In Fig 5 we report the resource utilization of var-
of the VU9P chip’s resources. The use of this platform in this study ious CycloneNTT configurations. The single-layer architecture
is to demonstrate CycloneNTT’s performance on a DDR-based demonstrates a clear inefficiency in using resources, as it consumes
platform where only a few number of memory ports are available. more than a 3-layer architecture. However a super-linear growth is
Platform Single/Multi-Layer Clock Speed (MHz) Layers 218 220 221 224
AWS-F1 Single 250 3 655-1310 - 6116-12233 55924-111848
Single 300 4 - 1092-2184 - 20971-41943
3 300 5 - 732-1464 - -
C1100 4 300 6 64-129 - - 5518-1137
5 176 Table 7: Execution latency lower and upper bound, in mi-
6 161 croseconds, for various architectures with applicable input
Table 6: Maximum clock speed attainable for each configura- sizes.
tion.
8.2.3 Clock Frequency. Fig 6 reports the maximum clock speed

achieved for each design configuration. Given the relatively flat
architecture of CycloneNTT, the clock speed is not significantly
affected by the number of layers used in the sub-network. However,
the routability of the design is affected, as the chip simply runs out of
wires to connect all the butterfly units for 7-layers or larger designs.
Although having a higher clock frequency is generally desirable,
however due to memory bandwidth constraints, the overall system
performance gain is negligible after a certain point. In addition,
when power dissipation is considered, one can choose to lower
the clock speed to lower power profile of the design, at minimal
expense of performance.
8.2.4 Estimated Latency. Given the clock frequency of each con-
figuration, we can estimate the latency of a multi-layer design for
a given input size. In an 𝐿 layer design, with input size of 𝑁 , all
numbers are read from the external memory and written back 𝑁𝐿
times. Given we employ 2𝐵 = 2𝐿 FIFOs in parallel, the lower bound
for the number of cycles it would take to cycle through the num-
bers is 𝑁𝐿 ∗ 2𝑁𝐿 ∗ 𝑝𝑒𝑟𝑖𝑜𝑑. However, due to write caching required
for data transposition, the entire process can be delayed. In the
worst case scenario, all reads and all writes happen sequentially,
Figure 10: The final placement of a 6-layer CycloneNTT on which gives us the upper bound for execution time of two times the
C1100 platform. lower bound. In Table 7 we report the range for execution latency
of various configurations with various input sizes.
8.2.5 Latency. Table 8 compares the execution time of different
designs for different dataset sizes. As expected, the single-layer
visible with increasing the number of layers in the case of multi-
architecture is the slowest, as it spends log 𝑁 passes over the entire
layer architectures.
dataset. In comparison, multi-layer architectures significantly cut
As discussed in Section 6.3, in the case of multi-layer architec-
down the latency by processing multiple layers at the same time. Of
tures, buffers as wide as 2𝐵 and as high as DRAM row are required
course as evidenced before, this speed gain comes at the relatively
to absorb the rate disparity between ingress and egress of the FIFOs.
significant cost of power and resources. Comparing Table 7 with
However, as discussed in Section 7 a separate buffer per memory
Table 8 we see that all obtained numbers fall between the estimated
write port is also required, which also scales with respect to 𝐵.
range.
As a result, and as evidenced in these results, BRAM usage grows
exponentially with respect to the number of layers. 8.2.6 Throughput. Fig 11 shows the improvement in throughput
The only DSP usage in the design is inside the butterfly units. as we add layers to CycloneNTT. The exponential growth with
The single-layer design requires 𝐵 butterfly units multiplied by respect to the number of layers is evident. Here we also compare
the butterfly vector size (=8 in our case), whereas the multi-layer CycloneNTT with prior works. [KMG+ 19] demonstrates a through-
design requires 𝐵 log 2𝐵 units. Considering DSPs only, the largest put of 233 million per second for one million points, as that is the
CycloneNTT architecture to fit in C1100 platform would be 8-layers. largest that can fit in their platform’s on-chip memory. [YKKP22]
However, due to excessive wires required inside the sub-network is an NTT generator platform, but does not accommodate a mil-
of butterfly units, and to connect the soft crossbar required inside lion point NTT. We use their largest sample size (21 4) and assume
write caches, as described in Section 7, a 6-layer (𝐵 = 32) is the a linear scaling to calculate their equivalent million/sec through-
largest configuration that fits on the C1100 platform. In Fig 10 put. We should note that CycloneNTT utilizes external memory to
shows the final placement of a 6-layer CycloneNTT on C1100. accommodate large data sets while providing superior throughput.
Layers 218 220 221 224 Alternative platforms that may also be considered in a future
Single 3536 16806 36388 358937 work include, but are not limited to, AMD-Xilinx Versal AI Series
3 1137 - 10859 101589 (making use of their AI Engines) as well as Versal HBM series.
4 - 2117 - 42570
5 - 1036 - - REFERENCES
6 101 - - 8083 [AWS] AWS. Amazon EC2 F1 instances. https://aws.amazon.com/ec2/instance-
types/f1/. Accessed: 2022-09-23.
Table 8: Execution latency, in microseconds, for various ar- [AX21] AMD-Xilinx. Varium C1100 compute adaptor data sheet, DS1003 (v1.0).
chitectures with applicable input sizes. https://docs.xilinx.com/v/u/en-US/ds1003-varium-c1100, September 2021.
Accessed: 2022-09-23.
[AX22] AMD-Xilinx. Vitis unified software platform documentation: Ap-
plication acceleration development (UG1393): HBM configuration
and use. https://docs.xilinx.com/r/en-US/ug1393-vitis-application-
acceleration/HBM-Configuration-and-Use, May 2022. Accessed: 2022-
09-23.
[BLDS10] Kevin J. Bowers, Ross A. Lippert, Ron O. Dror, and David E. Shaw. Improved
twiddle access for fast fourier transforms. IEEE Transactions on Signal
Processing, 58(3):1122–1130, 2010.
[CP15] Ren Chen and Viktor K. Prasanna. Automatic generation of high through-
put energy efficient streaming architectures for arbitrary fixed permuta-
tions. In 2015 25th International Conference on Field Programmable Logic
and Applications (FPL), pages 1–8, 2015.
[CT65] James Cooley and John Tukey. An algorithm for the machine calculation
of complex fourier series. Mathematics of Computation, 19(90):297–301,
1965.
+
[DCH 21] Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid
Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei
Hwu. Accelerating fourier and number theoretic transforms using tensor
cores and warp shuffles. In 2021 30th International Conference on Parallel
Architectures and Compilation Techniques (PACT), pages 345–355, 2021.
[DM22] Kemal Derya, Ahmet Can Mert, Erdinç Öztürk, and Erkay Savaş. CoHA-
NTT: A configurable hardware accelerator for NTT-based polynomial
Figure 11: Throughput in millions of numbers per second multiplication. Microprocessors and Microsystems, 89:104451, 2022.
processed by each architecture and prior work. [Gou21] A. P. Goucher. An efficient prime for number-theoretic trans-
forms. https://cp4space.hatsya.com/2021/09/01/an-efficient-prime-for-
number-theoretic-transforms, September 2021. Accessed: 2022-09-23.
[JWW21] Matthias Jung, Christian Weis, and Norbert Wehn. The Dynamic Random
Access Memory Challenge in Embedded Computing Systems, pages 19–36.
9 CONCLUSION AND FUTURE WORK Springer International Publishing, Cham, 2021.
CycloneNTT is a hardware solution for computing NTT on large [KA20] Emre Karabulut and Aydin Aysu. RANTT: A RISC-V architecture extension
for the number theoretic transform. In 2020 30th International Conference
datasets (≥ 224 , 64-bit numbers) that require external memory. on Field-Programmable Logic and Applications (FPL), pages 26–32, 2020.
+
By applying a series of algorithmic- and implementation-level op- [KLC 20] Sunwoong Kim, Keewoo Lee, Wonhee Cho, Yujin Nam, Jung Hee Cheon,
timizations, CycloneNTT achieves a quasi-streaming data access and Rob A. Rutenbar. Hardware architecture of a number theoretic trans-
form for a bootstrappable RNS-based homomorphic encryption scheme.
pattern that maximizes throughput. The architecture is configurable In 2020 IEEE 28th Annual International Symposium on Field-Programmable
and it has been applied to DDR- and HBM-based platforms. More- Custom Computing Machines (FCCM), pages 56–64, 2020.
+
[KMG 19] Hans Kanders, Tobias Mellqvist, Mario Garrido, Kent Palmkvist, and Oscar
over, the designer can trade off power for delay, depending on Gustafsson. A 1 million-point FFT on a single FPGA. IEEE Transactions on
the target system and environmental conditions. To the best of Circuits and Systems I: Regular Papers, 66(10):3863–3873, 2019.
our knowledge, CycloneNTT is the first architecture to tackle this [LPY22] Dai Li, Akhil Pakala, and Kaiyuan Yang. MeNTT: A compact and efficient
processing-in-memory number theoretic transform (NTT) accelerator. IEEE
problem for such large datasets in an efficient manner. Transactions on Very Large Scale Integration (VLSI) Systems, 30(5):579–588,
Future work relates to the current limitations on external mem- 2022.
ory, and it is two-fold. First, as datasets grow larger, even off-chip [MKO+ 20] Ahmet Can Mert, Emre Karabulut, Erdinc Ozturk, Erkay Savas, and Aydin
Aysu. An extensive study of flexible design methods for the number
memory capacity becomes a limiting factor, which would require theoretic transform. IEEE Transactions on Computers, pages 1–1, 2020.
the deployment of multiple FPGAs. Efficient inter-FPGA commu- [MK20] Ahmet Can Mert, Emre Karabulut, Erdinç Öztürk, Erkay Savaş, Michela
Becchi, and Aydin Aysu. A flexible and scalable NTT hardware : Appli-
nication through various channels (PCIe/Ethernet/etc.) will be the cations from homomorphically encrypted deep learning to post-quantum
key points in achieving the same level of throughput as in a single cryptography. In 2020 Design, Automation Test in Europe Conference Exhi-
FPGA solution. Second, DRAM technology suffers, in general, from bition (DATE), pages 346–351, 2020.
[NGI+ 20] Hamid Nejatollahi, Saransh Gupta, Mohsen Imani, Tajana Simunic Rosing,
high-latency, non-deterministic access latency, temperature and Rosario Cammarota, and Nikil Dutt. CryptoPIM: In-memory acceleration
reliability issues (for a detailed analysis, refer to [JWW21]). An al- for lattice-based cryptographic hardware. In 2020 57th ACM/IEEE Design
ternative here is to explore off-chip SRAM, which offers lower and Automation Conference (DAC), pages 1–6, 2020.
[PS22] Rogério Paludo and Leonel Sousa. NTT architecture for a linux-ready
deterministic access latency at the cost of lower capacity and band- RISC-V fully-homomorphic encryption accelerator. IEEE Transactions on
width. Historically, off-chip SRAM has been limited to an order of Circuits and Systems I: Regular Papers, 69(7):2669–2682, 2022.
[RLPD20] M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. HEAX: An archi-
magnitude lower capacity compared to DRAM. However recent pro- tecture for computing on encrypted data. In Proceedings of the Twenty-Fifth
cess technology advancements have enabled higher capacity chips. International Conference on Architectural Support for Programming Lan-
The use of smaller, yet-faster, off-chip memory coupled with the guages and Operating Systems, ASPLOS ’20, page 1295–1309, New York,
NY, USA, 2020. Association for Computing Machinery.
deployment of multiple FPGAs, could provide higher system-level [SRTJ+ 19] Sujoy Sinha Roy, Furkan Turan, Kimmo Jarvinen, Frederik Vercauteren, and
performance and energy efficiency. Ingrid Verbauwhede. FPGA-based high-performance parallel architecture
for homomorphic computing on encrypted data. In 2019 IEEE International Conference on Computing Frontiers, CF ’22, page 30–39, New York, NY, USA,
Symposium on High Performance Computer Architecture (HPCA), pages 2022. Association for Computing Machinery.
387–398, 2019. [ZWZ+ 21] Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan
[Tea22] Polygon Zero Team. Plonky2: Fast recursive arguments with PLONK Long, Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. PipeZK:
and FRI. https://github.com/mir-protocol/plonky2/blob/main/plonky2/ Accelerating zero-knowledge proof with a pipelined architecture. In Pro-
plonky2.pdf, September 2022. Accessed: 2022-09-23. ceedings of the 48th Annual International Symposium on Computer Architec-
[YKKP22] Yang Yang, Sanmukh R. Kuppannagari, Rajgopal Kannan, and Viktor K. ture, ISCA ’21, page 416–428. IEEE Press, 2021.
Prasanna. NTTGen: A framework for generating low latency NTT im-
plementations on FPGA. In Proceedings of the 19th ACM International

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms

Uploaded by

Copyright:

Available Formats

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets On DDR - and HBM-based FPGA Platforms

Uploaded by

Copyright:

Available Formats

CycloneNTT: An NTT/FFT Architecture Using Quasi-Streaming of

Large Datasets on DDR- and HBM-based FPGA Platforms

ABSTRACT Modern applications in cryptography, ranging from homomor-

• 𝑎 ± 𝑏 mod 𝑝, as a single addition/subtraction, optionally

Algorithm 1 Radix-2 gNTT

4.3 Butterfly Unit

5.4 FIFO Memory

6.2 2B-FIFO Architecture

6.3 Output Buffer Analysis

Layer-1 Layer-2 Last Layer (log 𝑁 )

Number of Ports Utilization

Figure 9: Logical and physical memory ports in multi-layer

Platform Layers LUTs Regs BRAM DSP

8.2.3 Clock Frequency. Fig 6 reports the maximum clock speed

You might also like