Rosetta - A Realistic High-Level Synthesis Benchmark Suite For Software Programmable FPGAs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for


Software Programmable FPGAs
Yuan Zhou1 *, Udit Gupta2⋆ , Steve Dai1 , Ritchie Zhao1 , Nitish Srivastava1 , Hanchen Jin1 , Joseph Featherston1 ,
Yi-Hsiang Lai1 , Gai Liu1 , Gustavo Angarita Velasquez3⋆ , Wenping Wang4⋆ , Zhiru Zhang1 *
1
School of Electrical and Computer Engineering, Cornell University, USA
2 Computer Science, Harvard University, USA
3 Systems Engineering and Computer Science, National University of Colombia, Colombia
4 Electronic and Information Engineering, Zhejiang University, China

*{yz882,zhiruz}@cornell.edu

ABSTRACT 1 INTRODUCTION
Modern high-level synthesis (HLS) tools greatly reduce the turn- Field-programmable gate arrays (FPGAs) have become an attractive
around time of designing and implementing complex FPGA-based option for realizing specialized accelerators thanks to their reconfig-
accelerators. They also expose various optimization opportunities, urability, massive fine-grained parallelism, and performance per watt
which cannot be easily explored at the register-transfer level. With advantage. With the extreme-scale integration of modern system-
the increasing adoption of the HLS design methodology and con- on-chip (SoC) and escalating design complexity of emerging applica-
tinued advances of synthesis optimization, there is a growing need tions, designing at a higher level of abstraction has become crucial
for realistic benchmarks to (1) facilitate comparisons between tools, to achieving high productivity. To address this challenge, high-level
(2) evaluate and stress-test new synthesis techniques, and (3) estab- synthesis (HLS) tools have emerged to allow application developers
lish meaningful performance baselines to track progress of the HLS to describe the hardware accelerator using common software pro-
technology. While several HLS benchmark suites already exist, they gramming languages like C/C++ by automatically generating RTL
are primarily comprised of small textbook-style function kernels, from behavioral descriptions [7, 14]. With the recent advances on
instead of complete and complex applications. To address this limita- HLS techniques and algorithms, modern HLS tools enable design-
tion, we introduce Rosetta, a realistic benchmark suite for software ers to explore optimization opportunities that are infeasible at the
programmable FPGAs. Designs in Rosetta are fully-developed appli- register-transfer level.
cations. They are associated with realistic performance constraints, Programming FPGAs with HLS tools is drastically different from
and optimized with advanced features of modern HLS tools. We be- writing traditional software code. HLS users typically need to apply
lieve that Rosetta is not only useful for the HLS research community, many optimization pragmas/directives to meet design constraints.
but can also serve as a set of design tutorials for non-expert HLS The success of such manual optimization often requires nontrivial
users. In this paper we describe the characteristics of our bench- hardware design knowledge. For example, in image/video processing,
marks and the optimization techniques applied to them. We further the right combination of SRAM-based line buffers and shift regis-
report experimental results on an embedded FPGA device as well as ters is needed to achieve the ideal throughput and resource usage
a cloud FPGA platform. for pipelining the stencil code in hardware. With a more complex
dataflow structure, the user needs to further calculate and specify the
ACM Reference Format: right FIFO depth to obtain the best pipeline rate without causing too
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen much area overhead. However, these advanced HLS optimizations
Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, are rarely used or even required in the existing HLS benchmark
Wenping Wang, Zhiru Zhang. 2018. Rosetta: A Realistic High-Level Synthesis suites (e.g., [11], [23]), which primarily include relatively small ker-
Benchmark Suite for Software Programmable FPGAs. In FPGA ’18: 2018 nels that are designed to test some of the basic capabilities of an
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
HLS tool such as the synthesis support of high-level language con-
February 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3174243.3174255
structs. In addition, for HLS tool developers and the HLS research
community at large, there is also a growing demand for a common
set of realistic and complex designs to evaluate the efficacy of new
⋆ Udit, Gustavo, and Wenping conducted this research when they were affiliated with
or visiting Cornell. synthesis techniques.
To this end, we introduce Rosetta1 — a suite of realistic HLS bench-
marks for software programmable FPGAs. Rosetta includes popular
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed machine learning workloads such as logistic regression and neural
for profit or commercial advantage and that copies bear this notice and the full citation network inference, as well as real-time video processing applications
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
including image rendering and face detection. Unlike previous ef-
to post on servers or to redistribute to lists, requires prior specific permission and/or a forts, Rosetta presents fully developed applications instead of small
fee. Request permissions from [email protected]. kernel programs, and specifies realistic design constraints for each
FPGA ’18, February 25–27, 2018, Monterey, CA, USA
© 2018 Association for Computing Machinery. 1 Rosetta gets the name following the convention of a plethora of “stone” benchmark
ACM ISBN 978-1-4503-5614-5/18/02. . . $15.00 suites. It also symbolizes that our benchmarks are specified in multiple languages (i.e.,
https://doi.org/10.1145/3174243.3174255 C++, OpenCL) and useful for evaluating HLS across different tools and platforms.

269
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

application. These design constraints are satisfied by applying ad- In particular, state-of-the-art HLS tools provide many advanced fea-
vanced optimizations of state-of-the-art HLS tools, which are not tures for achieving high design quality. Examples include arbitrary-
exercised by existing benchmark suites. With these features, Rosetta precision datatypes, parameterized hardware data structures (e.g.,
is not only a set of practical benchmarks for the HLS community, line buffers), and hierarchical dataflow pipelining. These features are
but also a design tutorial on how to build specialized FPGA accelera- often used in combination with other common HLS optimizations
tors with advanced HLS optimizations. More concretely, our main such as unrolling, loop pipelining [9, 15, 37], and array partition-
contributions are threefold: ing [30, 41]. Moreover, they are typically applied across multiple
• We design and present Rosetta, which couples a range of realistic kernels exhibiting different characteristics to meet the stringent
applications with real-world design constraints under different applicant-level design constraints.
programming models. Current Rosetta designs are written in C++ We believe that a new set of full-application benchmarks is desir-
and OpenCL. The synthesized hardware accelerators are tested able to enable more realistic performance reporting of HLS tools and
on both embedded and cloud FPGA platforms. FPGA-based acceleration. Along this line, Liu et al. [16] conducted a
comprehensive case study on an H.264 decoder, and they have open
• Rosetta demonstrates how to effectively apply advanced optimiza-
tions provided by modern HLS tools to meet the design constraints sourced their HLS implementation. Rosetta goes one step further
by providing a suite of application benchmarks that can be used
and achieve high quality of results. Examples of these optimiza-
tions include fixed-point optimization, dataflow pipelining, and to (1) facilitate comparisons between HLS tools, (2) evaluate new
synthesis techniques, and (3) establish meaningful baselines to track
data reuse through customized memory.
progress of the HLS and FPGA technologies. Each application in
• The proposed benchmark suite is freely available in open-source Rosetta includes a set of enforceable application-level design con-
format2 . We plan to continuously improve Rosetta by strengthen- straints based on real-world specifications. These constraints model
ing current cases and adding new applications from other domains. the realistic use cases for FPGA-based hardware accelerators, which
The rest of this paper is organized as follows: in Section 2, we helps standardize the evaluation of future advancements in HLS
introduce related work on HLS benchmarking and optimizations; tools. Furthermore, the applications in Rosetta leverage advanced
Section 3 outlines the Rosetta applications and key HLS optimiza- features of HLS tools to achieve high quality of results (QoRs) across
tion techniques leveraged by them; details of each benchmark are a distinct set of hardware designs. Hence these benchmarks can
described in Section 4; we show our experimental results in Section also serve as useful design tutorials for FPGA programmers to build
5, and conclude this work in Section 6. high-performance hardware accelerators using HLS.

2 RELATED WORK 3 ROSETTA OVERVIEW


FPGA programming currently differs significantly from the com- Rosetta currently contains six realistic benchmarks selected from
mon practice of software programming, even with the use of HLS machine learning and video processing fields, where FPGAs are com-
tools. Instead of simply focusing on functional correctness and exe- petitive on energy efficiency compared to CPUs and GPUs.3 For each
cution time, FPGA programmers often have to explore various com- Rosetta design, we provide the unoptimized software version, and the
plex design trade-offs involving performance, power, area, and cost. optimized HLS implementations written in either C++ or OpenCL.
Therefore, traditional software benchmark suites cannot directly be Table 1 lists the current Rosetta collection. Two of these benchmarks,
applied to HLS evaluation. In response, a number of HLS-specific binarized neural network and face detection, are adopted from our
benchmark suites have been developed by the research community previously published work [25, 39], while the rest are new designs.
for evaluating various aspects of hardware synthesis techniques and Rosetta contains both compute-bound and memory-bound appli-
tool flows. CHStone [11] is a widely used C-based HLS benchmark cations comprised of a rich set of kernels. These applications and
suite, which contains function kernels selected from application kernels expose diverse sources of parallelism. Our current HLS im-
domains such as arithmetic, signal processing, and security. Mach- plementations typically exploit instruction-level parallelism (ILP)
Suite [23] is another popular HLS benchmark suite, which includes a through fine-grained pipelining, and in some cases also expose task-
more diverse set of kernels and provides different algorithms for the level parallelism (TLP) by overlapping the execution of different
same kernel to facilitate comparisons at the algorithmic level. A more kernels. Additionally, each benchmark is associated with realistic
recent effort, Spector [10], offers OpenCL benchmarks that are ready design objectives — the machine learning applications require ei-
to be executed on Intel (formerly Altera) FPGA platforms. Kernels ther low latency or high throughput depending on the use-case
in Spector are designed to have large design spaces, which is useful scenario, while video processing applications must meet a real-time
for experimentation of automatic design space exploration (DSE) throughput target of at least 30 frames per second. In order to achieve
techniques. Additionally, HLS researchers have also adopted bench- these application-level constraints, Rosetta designs are customized
marks from other communities. For example, Rodinia [5], originally using a variety of HLS optimization techniques, which are concisely
designed for GPU benchmarking, has been used to test OpenCL- summarized as follows:
based HLS flows [29, 31]. Polybench [21] from the software compiler • Datatype customization – Customized data types such as fixed-
community has been adopted for assessing HLS-targeted polyhedral point types allow an FPGA accelerator to compute at the desired
transformations [22, 34, 42] and DSE techniques [24, 29, 38, 40]. numerical accuracy, and often lead to significant performance and
While the popular kernel benchmarks are simple to run and ana- area improvements over the design using full-precision floating-
lyze, they are insufficient for evaluating the increased capabilities of point types.
HLS optimizations and new technology advances in FPGA devices.
3 For the time being, we are not targeting traditional benchmarks from cryptography
(e.g., AES) and digital signal processing (e.g., DCT, FFT), since they are already included
2 Released on Github at https://github.com/cornell-zhang/rosetta in several other benchmark suites [10, 23].

270
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 1: The current set of the Rosetta applications — Rosetta contains both compute-bound and memory-bound applications with
different workloads. Kernels in each application expose different sources of parallelism: SLP = subword-level parallelism; DLP = data-level
parallelism; ILP = instruction-level parallelism. Different types of parallelism available in each compute kernel are listed in parentheses.
Application Categorization Major Compute Kernels Major HLS Optimizations
Video processing
Dataflow pipelining
3D Rendering Compute bound Integer arithmetics (ILP)
Communication customization
Integer operation intensive
Machine learning
Hamming distance (SLP, DLP, ILP) Loop unrolling
Digit Recognition Compute bound
KNN voting (ILP) Loop pipelining
Bitwise operation intensive
Dot product (DLP, ILP)
Machine learning Dataflow pipelining
Scalar multiplication (DLP, ILP)
Spam Filtering Memory bound Datatype customization
Vector addition (DLP, ILP)
Fixed-point arithmetic intensive Communication customization
Sigmoid function (ILP)
Video processing Dataflow pipelining
1D convolution (DLP, ILP)
Optical Flow Memory bound Memory customization
Outer product (DLP, ILP)
Floating-point arithmetic intensive Communication customization
Machine learning Memory customization
Binarized Neural Binarized 2D convolution (SLP, DLP, ILP)
Compute bound Datatype customization
Network (BNN) [39] Binarized dot product (SLP, DLP, ILP)
Bitwise operation intensive Communication customization
Video processing
Image scaling (DLP, ILP) Memory customization
Face Detection [25] Compute bound
Cascaded classifiers (DLP, ILP) Datatype customization
Integer arithmetic intensive

• Compute customization – Compute customization improves 1 TRIANGLES: for (int i = 0; i < NUM_3D_TRI; i++) {
the latency and/or throughput of the design through paralleliza- 2 #pragma HLS dataflow
tion and pipelining. Loop unrolling, loop pipelining, and dataflow 3 // five stages for processing each 3D triangle
pipelining fall into this category. 4 projection(triangle_3ds, &triangle_2ds, angle);
5 flag = rasterization1(triangle_2ds, max_min,
• Memory customization – FPGA accelerators typically demand 6 &triangle_2ds_same, max_index);
very high on-chip memory bandwidth to enable highly distributed 7 size = rasterization2(flag, max_min, max_index,
control and computation. Therefore, it is critical to set up cus- 8 triangle_2ds_same, fragment);
tomized memory hierarchy to provide the required bandwidth 9 size_pixels = zculling(i, fragment, size, pixels);
through data reuse and memory banking. 10 coloringFB(i, size_pixels, pixels, frame_buffer);
11 }
• Communication customization – The limited data bandwidth
between off-chip memories and the FPGA accelerators often be-
comes the performance bottleneck for memory-bound applica- Figure 1: Main loop for 3D Rendering. One triangle is pro-
tions. Hence it is crucial to customize the communication channel cessed by five image processing stages in each iteration.
and protocol used by the hardware accelerator to fully utilize off-
chip memory bandwidth through proper data packing and careful Time

design of the data layout.


Triangle 0 projection rast1 rast2 zculling coloringFB

Triangle 1 projection rast1 rast2 zculling coloringFB


4 BENCHMARK DESCRIPTION
This section discusses Rosetta applications in detail. For each bench- Triangle 2 projection rast1 rast2 zculling coloringFB
mark, we first briefly introduce its functionality and design con-
straints; we then describe its major compute kernels, explain the
rationale behind our categorizations in Table 1, and discuss the key Figure 2: Dataflow optimization overlaps different pipeline
HLS optimizations applied to this design. stages in 3D rendering.

4.1 3D Rendering coordinates of 3192 triangles. Target throughput is 30 frames per


The 3D rendering benchmark renders 2D images from 3D triangle second.
mesh models [20]. Taking in 3D coordinates of triangle vertices, the The HLS design contains a typical image processing pipeline as
application projects the triangles onto a 2D image, and colors the shown in Figure 1. The coordinates of each triangle go through
image pixels according to the "altitude" of the projected triangle. four kernel functions before updating the output frame buffer in
Our implementation works on 256x256 images where pixels are coloringFB. Integer operations form the primary workload inside
represented with 8-bit integers. The provided dataset contains the the kernels: projection and rasterization2 are rich in integer

271
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 __local WholeDigitType training_set[NUM_TRAINING] These two kernels have very different characteristics: while we can
2 __attribute__((xcl_array_partition(block,PAR_FACTOR,1))); easily exploit the bit-level and data-level parallelism in the Hamming
3 distance kernel, the KNN voting kernel is harder to parallelize.
4 __attribute__((xcl_pipeline_loop))
Digit recognition has a high compute to communication ratio.
5 TRAINING_LOOP:
For each test instance, Hamming distance calculation requires 100s-
6 for (int i = 0; i < NUM_TRAINING / PAR_FACTOR; i ++) {
7 __attribute__((opencl_unroll_hint)) 1000s of cycles depending on the parallelization factor, and KNN
8 LANES: voting requires 10s-100s of cycles depending on K and the paralleliza-
9 for (int j = 0; j < PAR_FACTOR; j ++) { tion factor. The training samples and their labels are stored on-chip
10 // Read a new instance from the training set and reused for all test instances. As a result, digit recognition is a
11 int train_id = j * NUM_TRAINING / PAR_FACTOR + i; compute-bound application.
12 WholeDigitType training_instance; Figure 3 shows the main compute loop nest for KNN calcula-
13 training_instance = training_set[train_id]; tion, alongside key HLS optimizations. TRAINING_LOOP iterates over
14 // Update the KNN set training samples, while the inner loop, LANES, instantiates different
15 update_knn(test_instance, training_instance,
Hamming distance units. In addition to compute optimizations in
16 &knn_set[j*K_CONST]);
the form of loop pipelining and unrolling (lines 4 and 7 of Figure 3),
17 }
18 } memory optimization is needed since the default implementation of
on-chip array training_set only has two memory ports, it cannot
supply PAR_FACTOR training instances per cycle. The training_set
Figure 3: Main compute loop nest for KNN calculation in
array is partitioned in line 2. With these optimizations, we can exploit
OpenCL.
the data-level parallelism between training instances.
arithmetic, while rasterization1 and zculling are heavy on in- Design parameters. The user can tune the following knobs:
teger comparisons. Each triangle requires a large amount of com-
putation relative to its memory size. Therefore, the application is • K: number of nearest neighbors.
categorized as compute-bound. • PAR_FACTOR: number of parallel Hamming distance units.
3D rendering is a prime example of dataflow optimization, which These two parameters present an interesting trade-off between
is applied in the HLS code on line 2 of Figure 1. Dataflow optimiza- classification accuracy, latency, and resource utilization. Increasing
tion exploits task-level parallelism by overlapping different stages PAR_FACTOR reduces the latency of the Hamming distance kernel,
of the image processing pipeline, as shown in Figure 2. Although but complicates the KNN voting kernel. Parallelization also causes
the latency of processing each triangle is not reduced, dataflow opti- frequency to drop. Furthermore, the complexity of both kernels in-
mization improves throughput and ensures no hardware module in creases with K. Additional results and analysis on the design space
the pipeline is idle in the steady state. are presented in Section 5.
Design parameters. We provide a switch in the source code to
enable/disable dataflow optimization. 4.3 Spam Filtering
The spam filtering application uses stochastic gradient descent (SGD)
4.2 Digit Recognition to train a logistic regression (LR) model for spam email classifica-
Digit recognition classifies hand-written digits using the K-nearest- tion [19]. The input is a dataset containing 5000 emails, 4500 for
neighbor (KNN) algorithm. The application works on a downsampled training and 500 for testing [26]. Each email is represented as a 1024-
subset of the MNIST database [13], with 18000 training samples and dimensional vector whose elements are relative word frequencies
2000 test samples evenly split amongst the ten digit classes. Each stored as 16-bit fixed-point numbers. The SGD training process pro-
MNIST image is downsampled to 14x14 and each pixel is represented duces a vector of 32-bit fixed-point parameters for the LR model.
as a single bit; thus, each image can be stored as a 196-bit unsigned We use five training epochs and a minibatch size of one; each epoch
integer. The KNN algorithm computes the Hamming distance be- processes every training sample once and updates the parameters
tween a test input and each training sample, stores the labels of the after each sample.
training samples with the K shortest distances, and votes among the The performance target of spam filtering is to minimize training
K labels to decide the label of the test sample. The design objective latency. Critical resource constraints are the number of hardened
for digit recognition is to minimize the total latency of classifying DSP blocks and the size of on-chip storage, which limits the level of
the 2000 test samples. compute parallelization and the amount of data stored on the FPGA.
Digit recognition includes two major compute kernels: Hamming The SGD algorithm contains kernels commonly found in machine
distance calculation and KNN voting. The Hamming distance kernel learning applications, including dot product, vector addition, and
computes the Manhattan distance between two samples; as each sigmoid.
sample is comprised of 1-bit pixels, this is done via bitwise XOR on Our spam filtering design exploits datatype customization and
the inputs, followed by computing a population count of the result. approximation of complex arithmetic operations on the FPGA. Fig-
The kernel is therefore rich in bitwise logic. The Hamming distance ure 4 shows the optimized sigmoid function. Lines 1-3 show the
must be calculated between a test input and every training sample. customized datatypes used to avoid expensive floating-point arith-
As a result, Hamming distance calculation is the dominant workload metic. We also eliminate most of the compute by taking advantage
of digit recognition. The KNN voting kernel examines the list of of the properties of the sigmoid function. Sigmoid asymptotically
Hamming distances to find the K nearest training samples, and out- approaches one when the input is large and zero when the input is
puts the classification result as the most frequent label amongst them. small (i.e. large negative). Sigmoid values when the input is between
The main workload in this kernel is integer comparison and sorting. minus four and four are hardcoded in a look-up table.

272
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 typedef fixed<F_TWIDTH,F_IWIDTH> FeatureType;


2 typedef uint<LUT_TWIDTH> IdxFixed; Gradient
3 typedef fixed<LUT_TWIDTH, LUT_IWIDTH> LutInFixed; xy
4 // values of sigmoid function stored in a look-up table Unpack Weight_y
5 FeatureType useLUT(LutInFixed in) { Gradient
z
6 IdxFixed index;
7 if (in < 0) { Outer
8 in = -in; Tensor_y Weight_x
product
9 index = LUT_SIZE - (in << (LUT_TWIDTH - LUT_IWIDTH));
10 } Compute
Tensor_x
11 else flow
12 index = (in << (LUT_TWIDTH - LUT_IWIDTH));
13 return lut[index]; Optical Flow Accelerator
14 }
Output Flow
15 // sigmoid function Image Frames
Vectors
16 FeatureType Sigmoid(FeatureType exponent) {
17 if (exponent > 4)
18 return 1.0; Off-chip Memory
19 else if (exponent < -4)
20 return 0.0;
21 else { Figure 6: Hardware diagram for optical flow — The kernels are
22 LutInFixed inLut = (LutInFixed)exponent; connected by FIFOs for streaming dataflow pipelining.
23 return useLUT(inLut);
24 } integer type. Inside the FPGA, the data is unpacked into 16-bit train-
25 } ing vector elements, resulting in a communication throughput of
multiple elements per cycle. Despite this optimization, the through-
Figure 4: Datatype and compute optimization to the Sigmoid put of the dataflow pipeline is still determined by the communica-
function — Specialized datatypes are used throughout the whole tion latency because of the relatively simple and highly parallelized
hardware function to avoid expensive floating-point arithmetic. We compute units for LR. Therefore, spam filtering is classified as a
use a look-up table to store the values of the sigmoid function so that memory-bound application.
the complex arithmetic operations can be reduced. In our implemen-
Design parameters. The design space of spam filtering consists of
tation F_TWIDTH = 32, F_IWIDTH = 13, LUT_TWIDTH = 12, LUT_IWIDTH
the following parameters:
= 4.
• PAR_FACTOR: the parallelization factor of the vector compute
1 typedef uint<VDWIDTH> VectorDataType; kernels.
2 typedef fixed<D_TWIDTH, D_IWIDTH> DataType; • VDWIDTH: the width of the packed vector data type, which con-
3 void read_data(VectorDataType* data, trols the upper-bound of the off-chip communication band-
4 DataType* training, width of the hardware function.
5 int tid)
6 { Our results and analysis on the design space are shown in Section 5.
7 for (int i = 0; i < N_FEATURES/(VDWIDTH/D_TWIDTH); i++) {
8 #pragma HLS pipeline 4.4 Optical Flow
9 // read in the data
Optical flow captures the motion pattern of objects between consec-
10 int idx = tid * N_FEATURES / (VDWIDTH/D_TWIDTH) + i;
11 VectorDataType tmp = data[idx];
utive image frames. It is an important step for object detection and
12 // distribute into local buffer is integrated into several image/video processing toolsets such as
13 for (int j = 0; j < (VDWIDTH/D_TWIDTH); j++) { OpenCV and the Computer Vision toolbox of MATLAB. Our imple-
14 int loc_idx = i * (VDWIDTH/D_TWIDTH) + j; mentation is based on the Lucas-Kanade method which is friendly
15 training[loc_idx] = tmp((j+1)*D_TWIDTH-1, j*D_TWIDTH); for FPGAs [32]. The output is a 2D vector field of the same size,
16 }} where each vector shows the movement of the pixel in the input
17 }
image frames. Currently, pixels of input images are represented with
8-bit integers, while the output and all intermediate results are rep-
Figure 5: Communication optimization for spam filtering — resented with 32-bit floating-point numbers. We use the MPI Sintel
In our implementation D_TWIDTH = 16, D_IWIDTH = 4, N_FEATURES = dataset [4] for testing this benchmark. The resolution of the image
1024. Users can tune the VDWIDTH parameter to control the off-chip frames in this dataset is 436x1024.
communication bandwidth. Optical flow must satisfy a real-time throughput constraint of
30 frames per second. In addition, the limited amount of on-chip
Our target FPGA devices do not have sufficient on-chip memory storage prevents us from buffering the image frame on chip. Fig-
to store the complete training set, necessitating the streaming of ure 6 shows the image processing pipeline with eight stages. The
training instances from off-chip memory. Dataflow optimization main compute kernel for stages Gradient, Weight, and Tensor is
(introduced in Section 4.1) is applied to overlap communication and 1D convolution; the Outer product stage performs outer product of
compute. To fully utilize off-chip memory bandwidth, we apply ele- three-dimensional vectors. Output is generated in the Compute flow
ment packing as shown in Figure 5. Data is transferred from off-chip stage. Currently, we are using floating-point arithmetic in these ker-
storage as VectorFeatureType, which is a wide, custom-bitwidth nels. Data packing optimization introduced in Section 4.3 is applied

273
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 void gradient_xy(pixel_t frame[MAX_HEIGHT][MAX_WIDTH],


Image Buffer Line Buffer 2 pixel_t gradient_x[MAX_HEIGHT][MAX_WIDTH],
a b c 0 1 2 3 4 c 0 1 2 3 4 3 pixel_t gradient_y[MAX_HEIGHT][MAX_WIDTH])
4 {
5 6 7 8 9 A B C 5 6 7 8 9 A B C
5 // specialized line buffer and window buffer
D E F x D E F x 6 hls::LineBuffer<5,MAX_WIDTH,pixel_t> buf;
7 hls::Window<5,5,pixel_t> window;
8 GRAD_XY_OUTER: for (int r = 0; r < MAX_HEIGHT + 2; r ++) {
Previous Window Window Buffer 9 GRAD_XY_INNER: for (int c = 0; c < MAX_WIDTH + 2; c ++) {
a b c 0 1 2 3 4 10 #pragma HLS pipeline II=1
a b c 0 11 // fill the line buffer
5 6 7 8 9 A B C
5 6 7 8 12 if (r < MAX_HEIGHT && c < MAX_WIDTH) {
D E F x 13 // shift up pixels in column c
D E F x
14 buf.shift_pixels_up(c);
15 // insert new pixel into column c of the last row
Current Window 16 buf.insert_bottom_row(frame[r][c], c);
17 } else if (c < MAX_WIDTH) {
18 buf.shift_pixels_up(c);
Figure 7: Example of a 2-row line buffer and a 3x3 window
19 // zero padding
buffer — Pixels a, b, c and 0-F are already visited, while x is a new 20 buf.insert_bottom_row(0,c);
pixel. The line buffer stores pixels in the two most recently visited 21 }
rows, and reads in one pixel from the image buffer every cycle. The 22 // fill the window buffer
3x3 window buffer stores recently visited pixels in the 3x3 sliding 23 if (r < MAX_HEIGHT && c < MAX_WIDTH) {
window. When the sliding window shifts to the right, the left-most 24 // shift pixels to the left
25 window.shift_pixels_left();
pixels in the window buffer are shifted out, while two pixels stored
26 for (int i = 0; i < 4; i ++)
in the line buffer (0 and 8) and the new pixel x are shifted in. The 27 // read from the line buffer
new pixel x is also stored into the line buffer and pixel c is removed 28 // and insert to the right-most column
from the line buffer. 29 window.insert_pixel(buf.getval(i, c), i, 4);
30 } else {
31 window.shift_pixels_left();
32 for (int i = 0; i < 4; i ++)
to avoid contention on the off-chip memory. Each packet contains 33 // zero padding
one pixel from each image frame, and the Unpack stage distributes 34 window.insert_pixel(0, i, 4);
the pixels to on-chip FIFOs. Similar to 3D rendering, we use dataflow 35 }
36
optimization to construct channels between stages of the image pro- // compute
37 // ......
cessing pipeline. The major difference between the two benchmarks 38 }}}
is that all pipeline stages in optical flow produce and consume pix-
els in a strict sequential order. In addition, the pipeline stages have
Figure 8: Gradient kernel optimized with line buffer and win-
perfectly balanced rates. Therefore, the channels between pipeline
dow buffer — hls::LineBuffer and hls::Window classes provide
stages can be implemented as fixed-depth FIFOs, as shown in Fig-
parameterized implementations of line buffers and window buffers.
ure 6. The whole accelerator is a very deep, fine-grained pipeline
with different stages perfectly overlapped. Convolvers fout output streams
Memory customization is also necessary for optical flow to achieve Input
high throughput. Here we introduce the common specialized mem- words
BitSel ∗
ory structures for image processing applications: line buffer and fin
Pooling, Output
+ + Bnorm,

window buffer. Figure 7 gives a pictorial illustration of a 2-row line Input Binarize words
buffer and a 3x3 window buffer. The line buffer reads in one pixel per words
BitSel ∗ Integer
buffer
cycle and stores pixels in recently visited rows. The window buffer Variable-width fout Conv
is completely partitioned into registers for parallel data access, and Line Buffer Weights

it consistently reads from the line buffer. These specialized memory


structures exploit the data reuse in stencil applications with sliding Figure 9: Hardware structure of the BNN accelerator (figure
processing windows, and minimize memory accesses to the next- adapted from [39]).
level memory hierarchy. The convolution kernels in optical flow are
good candidates for this memory customization. Figure 8 shows how 4.5 Binarized Neural Network
we construct and maintain a line buffer and a window buffer in the Accelerating convolutional neural networks (CNNs) has become
gradient_xy kernel. Proper conditions need to be applied to avoid an important research topic for the FPGA community. Academic
out-of-bound array accesses. and industry researchers have implemented different CNN models
With the optimizations described above, we classify optical flow on a variety of FPGA platforms [3, 18, 35, 36]. Recently, binarized
as a memory-bound application because the off-chip memory band- neural networks (BNNs) were shown to be a natural fit for FPGA
width directly determines the throughput of the streaming dataflow hardware [6, 27, 33, 39]. BNNs constrain weights and intermediate
pipeline. However, this is because our current implementation does activations to +1 or -1; this converts most of its multiplies to binary
not exploit data reuse between input frames. We plan to further XORs and takes full advantage of the FPGA logic fabric. We adopt
optimize this design to achieve a higher throughput. an open-source implementation of BNN by Zhao et al. [39] as a
representative neural network application in Rosetta.

274
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Input Word Line Buffer Image Line Buffer


Image Buffer
Input width 32 32 c 0 1 2 3 4
line 1 a b c 0 1 2 3 4
line 3 line 2 5 6 7 8 9 A B C
5 6 7 8 9 A B C
line 3 D E F x
D E F x
Input width 8 8
Image Window Buffer
line 5 line 6 line 7 line 8
z line 3
line 4
line 4
line 5
line 5
line 6
line 6
line 7 Previous Window 0
line 5 line 6 line 7 line 8 a b c 0 1 2 3 4 8

5 6 7 8 9 A B C x
Figure 10: Example usage of variable-width line buffer for 8- D E F x
wide and 32-wide feature maps (figure adapted from [39]). Integral Image
Current Window
Window Buffer
Zhao et al. implement the BNN model described in [8], which
operates on the CIFAR-10 dataset [12]. It contains six convolutional
layers, three pooling layers, and three fully-connected layers. Fig- Figure 11: Specialized line buffer and window buffer for face
ure 9 shows the hardware diagram of the BNN accelerator, which detection [25] — Here we show a 3x3 example, but the actual imple-
uses a configurable number of convolvers to exploit data-level paral- mentation uses 25x25 windows. Solid arrows refer to normal register
lelism in a scalable manner. The authors target a small FPGA device shifting, while dashed arrows refer to addition. The image window
with limited on-chip storage. As a result, the BNN weights cannot buffer accumulates the incoming pixels and construct the integral
fit on-chip and the accelerator must be invoked multiple times to image on the fly. The integral image window buffer accesses the
classify an image; each time new weights are loaded from off-chip image window buffer for new data.
memory.
There are two major kernels in BNN: binarized convolution and
binarized dot product. Both kernels are intensive of bitwise logic Table 2: Device capacity of the two FPGA platforms and the
operations. Binarized convolution comprise the majority of opera- resource utilization of the platform logic (shell) on AWS F1 —
tions in classifying an image, and is heavily parallelized as a result. The last row reports the average resource utilization of the shell,
In contrast, the binarized fully-connected layers, which use the dot with the standard deviation in parentheses.
product kernel, are limited by off-chip memory-bandwidth. We cate-
# LUTs # FFs # BRAMs # DSPs
gorize BNN as compute-bound since latency improvement mostly
comes from accelerating compute in the convolutional layers. AWS F1 Total 1181768 2363536 2160 6840
Since 2D convolutional layers have a sliding window access pat- ZC706 Total 218600 437200 545 900
tern, line buffers are used to exploit data locality. In particular, a AWS F1 Shell 293209 (±3693) 381853 (±5138) 545 (±0) 12 (±0)
variable-width line buffer (VWLB) is designed to keep the hardware
convolvers fully utilized despite the varying sizes of the feature maps. kernel in feature extraction applications such as SIFT [17], as well
Figure 10 shows how the VWLB works for different input widths. For as the pooling layers of CNNs. The cascaded classifiers are the domi-
input feature map with a width of 32, the VWLB operates identically nant workload for the face detection application. The authors of [25]
to a conventional line buffer. For a smaller feature map with a width parallelize the first three classifier stages and pipeline the rest of
of 8, each row in the VWLB stores multiple rows of the input. The the stages to exploit data-level parallelism. This kernel also exposes
rows are carefully arranged in the VWLB so that the convolutional an irregular memory access pattern — each classifier accesses ei-
filter can slide through and produce correct results. ther eight or twelve pixels, and the classifiers have different access
Design parameters. The BNN benchmark allows users to tune the patterns. This feature itself makes the kernel interesting for HLS
number of convolvers in the accelerator. Other parameters such as memory optimization techniques. Customized memory partitioning
the size of buffers are automatically scaled. is applied to improve kernel frequency and reduce routing effort [41].
The cascaded classifiers operate on a sliding window of the inte-
4.6 Face Detection gral image. As a result, face detection can also benefit from the line
buffer and window buffer optimization introduced in Section 4.4.
The face detection application is adopted from [25]. It uses the Viola- However, constructing the whole integral image before applying
Jones algorithm [28] to detect human faces in a given image. More the classifiers would require a significant amount of on-chip storage
specifically, the accelerator takes an 320x240 greyscale image as and incur performance loss. Therefore, the authors of [25] modified
input, which is scaled to construct an image pyramid; afterwards, an the window buffer to construct the integral image efficiently. The
integral image is constructed from each image in the image pyramid, operation of this buffer is depicted in Figure 11, where the modified
and a set of cascaded classifiers are applied to a fixed-size window image window buffer accumulates pixels on the diagonal to compute
which scans through the integral image; eventually, the positions the pixel values in the integral image.
and sizes of the human faces are returned.
As mentioned in [25], the throughput target for face detection
is 30 frames per second. In addition, the application is subject to 5 EXPERIMENTAL RESULTS
hardware constraints including limited on-chip storage and rout- We have synthesized the Rosetta benchmarks targeting an embedded
ing resources. The two major compute kernels in face detection are FPGA as well as a cloud FPGA instance. We use Xilinx ZC706 for
image scaling and cascaded classifiers. Image scaling is a common the embedded platform, which contains a Kintex-7 FPGA with a

275
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 3: Rosetta results on Xilinx ZC706 Platform — The Runtime column shows overall
execution time. Resource numbers show the total resource usage of the designs, including
both kernel function and shell logic. Bitstreams are generated by Xilinx SDSoC 2017.1.
Benchmark # LUTs # FFs # BRAMs # DSPs Runtime (ms) Throughput
3D Rendering 8893 12471 48 11 4.7 213 frames/s
Digit Recognition1 41238 26468 338 1 10.6 189k digits/s
Spam Filtering2 12678 22134 49 160 78.9 285k samples/s
Optical Flow 42878 61078 54 454 24.3 41.2 frames/s
Binarized Neural Network3 46899 46760 102 4 4995.2 200 images/s
Face Detection 62688 83804 121 79 33.0 30.3 frames/s
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.
3. Eight convolvers, 1000 test images.

Table 4: Rosetta results on AWS F1 Platform — Kernel: execution time on the FPGA; Comm.: time of data transfer between
host and global memory; Runtime: overall execution time. Performance-Cost Ratio is calculated based on the hourly rate (in
US Dollar/$) of the AWS f1.2xlarge instance [1]. Resource numbers are for kernel functions only. Bitstreams are generated by
Xilinx SDAccel 2017.1.
Performance-Cost
Benchmark # LUTs # FFs # BRAMs # DSPs Kernel (ms) Comm. (ms) Runtime (ms) Throughput
Ratio
3D Rendering 6763 7916 36 11 3.6 0.19 4.4 227 frames/s 496k frames/$
Digit Recognition1 39971 33853 207 0 9.9 0.55 11.1 180k digits/s 393M digits/$
Spam Filtering2 7207 17434 90 224 25.1 4.8 30.9 728k samples/s 1.6G samples/$
Optical Flow 38094 63438 55 484 2.6 4.8 8.4 119 frames/s 260k frames/$
Face Detection 48217 54206 92 72 20.2 0.47 21.5 46.5 frames/s 101k frames/$
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.

target clock frequency of 140MHz. For the cloud FPGA platform, we


Table 5: 3D rendering without dataflow on AWS F1.
choose the AWS f1.2xlarge instance (F1), which is equipped with
a Xilinx VU9P FPGA. The target clock frequency for our experi- # LUTs # FFs # BRAMs #DSPs Kernel (ms)
ments on F1 is 250MHz. These two platforms have different memory
6323 7737 36 11 5.3
systems — on ZC706, the FPGA shares the same DRAM with the
embedded CPU, while on F1 the FPGA has its own on-board DRAM
and communicates with the CPU through PCIe. In the rest of this Tables 3 and 4 show our experimental results on the two plat-
section, we use the term global memory to refer to the DRAM on forms. All resource usage numbers are extracted from Vivado reports
the FPGA side, and use host memory for the DRAM on the CPU side. after place and route. Resource numbers in Table 3 show the total
The BNN benchmark is originally designed for embedded FPGA resource utilization of the designs on ZC706, while Table 4 reports
platforms and requires nontrivial effort to be retargeted to AWS F1. resource usage on F1 without the shell logic. The total runtime of
We leave this for future work, and will only present BNN results on the applications, including hardware kernel time, communication
ZC706 in this paper. For other benchmarks, the HLS code for the time, and the overhead of necessary software function calls, are
two platforms share the same optimization techniques, with some measured on both platforms. On AWS F1, we further break down the
platform-dependent variances such as datatype and interface. Xilinx kernel and communication time with the help of the SDAccel profiler.
SDSoC 2017.1 is used to generate bitstream for ZC706, and SDAccel Rosetta benchmarks generally have better performance on AWS F1
2017.1 is used for F1. because of its higher frequency and off-chip memory bandwidth,
We run the F1 applications remotely through the FPGA developer except for digit recognition. For some applications, however, this
AMI flow provided by AWS, whereas the experiments on ZC706 are performance gap is narrow due to the communication latency and
performed locally. Table 2 shows the available resource counts of the additional overhead incurred by OpenCL runtime.
two platforms. On the F1 platform, the AWS platform logic (or shell) Since cost efficiency is an important aspect of platform selection
consumes a considerable amount of resources to provide peripheral and accelerator design, we further provide the performance-cost
connections for PCIe data transfer, DRAM access, and interrupts [2]. ratio as a metric for F1 applications based on the hourly rate of the
In the third row of Table 2, we report the statistics of the resource f1.2xlarge instance (currently at $1.65 per hour).
usage by this shell across different applications. For ZC706, Xilinx In the remainder of this section we summarize the results for the
SDSoC also automatically generates shell logic for communications four new benchmarks. As for BNN and face detection, interested
among accelerators, processors, and DRAM. However, the size of readers can refer to [39] and [25], respectively, for more results and
these shells greatly vary across designs, and are typically small detailed performance analysis.
compared to that of the core logic. Hence we choose to simply report
the total resource utilization for ZC706 results. 3D Rendering. For our test dataset, the total execution time of 3D
rendering is 4.7 ms and 4.4 ms on the two platforms, respectively.
Converting to throughput, our design achieves 213 frames per second

276
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 6: Digit recognition accuracy vs. K value.


K 2 3 4 5
Accuracy (%) 92.9 93.9 94.3 94.3

(a) (b)
Figure 13: Spam filtering design space, results are for AWS F1
Figure 12: Digit recognition design space, results are for AWS platform — Off-chip memory bandwidth is controlled by VDWIDTH.
F1 platform — (a) Kernel time vs. K value. Difference in kernel time This parameter strictly limits the performance of the hardware ker-
is caused by variance in latency and kernel frequency. (b) LUT usage nel, showing that spam filtering is a memory-bound application.
vs. K value.
already large. When the Hamming distance kernel is highly paral-
on ZC706 and 227 frames per second on F1. While the throughput lelized, the KNN voting kernel, which is highly sequential, becomes
calculated with our test input is much higher than the target, both the performance bottleneck. The performance can be further im-
kernel time and communication time increase with more triangles proved by optimizing the KNN voting kernel, and finding an optimal
in the input. Communication latency is not significant on F1, but combination of the K value and PAR_FACTOR.
the software API calls in OpenCL runtime incur a 0.6 ms overhead,
which is not negligible for this specific application. These API calls Spam Filtering. The performance of spam filtering significantly
initiate data transfer, enqueue the kernel function, and set proper differs on two platforms. The kernel time on F1 is 3.1x shorter than
kernel arguments. ZC706, and the total execution time on F1 is 2.6x shorter, despite the
Table 5 shows the resource utilization and kernel time of a baseline additional 4.8 ms latency for host-global memory communication.
design where dataflow optimization is not applied. Comparing with In addition to the frequency improvement, this performance gap
the first row of Table 4, enabling dataflow optimization improves the is mainly caused by the difference in off-chip memory bandwidth.
kernel time by around 30% without significant resource overhead. Since we apply dataflow optimization to overlap communication and
This result demonstrates the efficacy of dataflow optimization in compute, the overall latency of the design is determined by the max-
image processing pipelines. imum of compute and communication latency. Because the compute
kernels are highly parallel, the low communication bandwidth on
ZC706 results in a much longer latency of the dataflow pipeline.
Digit Recognition. In contrast to other benchmarks, the perfor-
Figure 13 shows the kernel time on AWS F1 with different com-
mance of digit recognition is currently slightly worse on F1 than
binations of PAR_FACTOR and VDWIDTH. Here PAR_FACTOR specifies
ZC706. The overall throughput is 189k digits per second on ZC706
the degree of parallelism in vector kernels, and VDWIDTH controls the
and 180k digits per second on F1. Although F1 has a shorter kernel
off-chip communication bandwidth. With the same off-chip band-
time of 9.9ms, the latency of communication and other overhead in
width, increasing PAR_FACTOR beyond 64 does not result in much
OpenCL runtime seem to have offset this advantage. According to
performance gain, since the communication latency already dom-
our analysis, this is likely due to a missing feature in the specific
inates the compute latency. When off-chip bandwidth is reduced,
version of the tool we are using, where async_group_copy is not
communication latency further increases, and kernel time degrades
pipelined to the full extent. Hence we expect to achieve a higher
for all PAR_FACTOR values we tested. The best-achievable perfor-
performance on F1 in the near future once this issue is resolved.
mance improves with a higher off-chip memory bandwidth. These
As mentioned in Section 4.2, digit recognition has a complex de-
results confirm that spam filtering is a memory-bound application.
sign space. Table 6 shows the classification accuracy of different
K values. Figure 12 shows kernel time and resource utilization of Optical Flow. The total execution time of optical flow is 8.4 ms on
different design points. We only show kernel time in Figure 12a F1 and 24.3 ms on ZC706. Both implementations satisfy the through-
because host-global memory communication time is not affected by put constraint. On the AWS F1 platform, host-global memory com-
kernel implementation. In Figure 12b, only the most critical resource munication time takes up approximately 60% of the total execution
LUT is shown. As we can see from Table 6 and Figure 12, the two time due to the large input/output data size. If we only consider ker-
design parameters expose interesting design trade-offs. Increasing nel time, it is 9.3x shorter on F1 than on ZC706. Similar with spam
the K value improves classification accuracy at a cost of significant filtering, this behavior is also caused by the difference in off-chip
increase in kernel time, which is caused by the frequency drop and memory bandwidth. The optical flow accelerator is reading from and
the worsened latency of the KNN voting kernel. Additionally, the writing to the off-chip memory at the same time due to the stream-
benefit of increasing PAR_FACTOR diminishes when PAR_FACTOR is ing dataflow optimization. The F1 platform has multiple off-chip

277
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

DDR banks to handle concurrent read and write requests. On ZC706, [13] Y. LeCun. The MNIST Database of Handwritten Digits. http://yann. lecun. com/exd-
however, these concurrent requests would cause contention on the b/mnist/, Dec 2017.
[14] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen. High-Level Synthesis:
off-chip memory, and the accelerator is often stalled due to the lack Productivity, Performance, and Software Constraints. Journal of Electrical and
of input data. Computer Engineering, 2012:1:1–1:1, Jan 2012.
[15] G. Liu, M. Tan, S. Dai, R. Zhao, and Z. Zhang. Architecture and Synthesis for
Area-Efficient Pipelining of Irregular Loop Nests. IEEE Trans. on Computer-Aided
6 CONCLUSIONS AND FUTURE WORK Design of Integrated Circuits and Systems (TCAD), 2017.
[16] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. High Level
We have presented Rosetta, an open-source, realistic benchmark Synthesis of Complex Applications: An H. 264 Video Decoder. Int’l Symp. on
suite for high-level synthesis targeting modern FPGA platforms. Field-Programmable Gate Arrays (FPGA), Feb 2016.
Rosetta is designed to be a collection of real applications which are [17] D. G. Lowe. Object Recognition from Local Scale-Invariant Features. Int’l Conf. on
Computer Vision (ICCV), Oct 1999.
optimized for performance and resource constraints. All Rosetta [18] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo. Optimizing Loop Operation and Dataflow
applications are ready to be executed on the supported embedded in FPGA Acceleration of Deep Convolutional Neural Networks. Int’l Symp. on
and cloud platforms. We believe that Rosetta can serve as a useful Field-Programmable Gate Arrays (FPGA), Feb 2017.
[19] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
benchmark suite for HLS algorithms and tools, as well as a set of [20] J. Pineda. A Parallel Algorithm for Polygon Rasterization. ACM SIGGRAPH
design tutorials for application developers interested in FPGA-based Computer Graphics, 22(4):17–20, 1988.
[21] L.-N. Pouchet. Polybench: The Polyhedral Benchmark Suite. http://www. cs. ucla.
accelerated computing. edu/pouchet/software/polybench, Dec 2017.
Rosetta will be continuously improved in the future. We will [22] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-Based Data Reuse
extend Rosetta to include more realistic applications from emerging Optimization for Configurable Computing. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2013.
domains. For the existing benchmarks, we plan to provide both [23] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks
C++ and OpenCL implementations for every benchmark to embrace for Accelerator Design and Customized Architectures. Int’l Symp. on Workload
different programming models commonly supported by HLS tools. Characterization (IISWC), Oct 2014.
[24] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-
The benchmarks will also be further optimized for achieving higher Performance Accelerator Simulator Enabling Large Design Space Exploration of
performance and resource efficiency. Customized Architectures. Int’l Symp. on Computer Architecture (ISCA), Jun 2014.
[25] N. K. Srivastava, S. Dai, R. Manohar, and Z. Zhang. Accelerating Face Detection on
Programmable SoC Using C-Based Synthesis. Int’l Symp. on Field-Programmable
ACKNOWLEDGEMENTS Gate Arrays (FPGA), Feb 2017.
[26] The Apache Software Foundation. Public Corpus. http://spamassassin. apache.
This research was supported in part by a DARPA Young Faculty org/old/publiccorpus/, Apr 2017.
Award, NSF Awards #1337240, #1453378, and a research gift from [27] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vis-
Xilinx, Inc. We thank Dr. Sumit Roy from Xilinx for providing helpful sers. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.
Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017.
feedback on the Rosetta designs. We also thank Ackerley Tng, Edgar [28] P. Viola, M. J. Jones, and D. Snow. Detecting Pedestrians using Patterns of Motion
Munoz, Wendian Jiang, Lin Wang, Yun Qing, Nithya Subramanian, and Appearance. International Journal of Computer Vision, 63(2):153–161, Jul 2005.
[29] S. Wang, Y. Liang, and W. Zhang. FlexCL: An Analytical Performance Model for
Nikita Patil, Surabhi Singh, Judy Stephen, and Ian Thompson for OpenCL Workloads on Flexible FPGAs. Design Automation Conf. (DAC), Jun 2017.
their contributions to the baseline designs of digit recognition, 3D [30] Y. Wang, P. Li, and J. Cong. Theory and Algorithm for Generalized Memory
rendering, spam filtering, and optical flow. Partitioning in High-Level Synthesis. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2014.
[31] Z. Wang, B. He, W. Zhang, and S. Jiang. A Performance Analysis Framework for
REFERENCES Optimizing OpenCL Applications on FPGAs. Int’l Symp. on High Performance
[1] Amazon Web Services. AWS FPGA Developer AMI. https://aws. amazon. com/- Computer Architecture (HPCA), Mar 2016.
marketplace/pp/B06VVYBLZZ, Dec 2017. [32] Z. Wei, L. Dah-Jye, and B. E. Nelson. FPGA-Based Real-Time Optical Flow Algo-
[2] Amazon Web Services. AWS Shell Interface Specification. https://github. rithm Design and Implementation. Journal of Multimedia, 2:38–45, Sep 2007.
com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md, [33] H. Yonekawa and H. Nakahara. On-Chip Memory Based Binarized Convolutional
Dec 2017. Deep Neural Network Applying Batch Normalization Free Technique on an FPGA.
[3] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu. An OpenCL Int’l Parallel and Distributed Processing Symp. Workshops (IPDPSW), May 2017.
Deep Learning Accelerator on Arria 10. Int’l Symp. on Field-Programmable Gate [34] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-Based
Arrays (FPGA), Feb 2017. Accelerator Design for Deep Convolutional Neural Networks. Int’l Symp. on
[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A Naturalistic Open Source Field-Programmable Gate Arrays (FPGA), Feb 2015.
Movie for Optical Flow Evaluation. European Conference on Computer Vision [35] C. Zhang and V. K. Prasanna. Frequency Domain Acceleration of Convolutional
(ECCV), Oct 2012. Neural Networks on CPU-FPGA Shared Memory System. Int’l Symp. on Field-
[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Ro- Programmable Gate Arrays (FPGA), Feb 2017.
dinia: A Benchmark Suite for Heterogeneous Computing. Int’l Symp. on Workload [36] J. Zhang and J. Li. Improving the Performance of OpenCL-Based FPGA Accelerator
Characterization (IISWC), Oct 2009. for Convolutional Neural Network. Int’l Symp. on Field-Programmable Gate Arrays
[6] P. Colangelo, R. Huang, E. Luebbers, M. Margala, and K. Nealis. Fine-Grained Ac- (FPGA), Feb 2017.
celeration of Binary Neural Networks Using Intel Xeon Processor with Integrated [37] Z. Zhang and B. Liu. SDC-Based Modulo Scheduling for Pipeline Synthesis. Int’l
FPGA. Int’l Symp. on Field-Programmable Custom Computing Machines (FCCM), Conf. on Computer-Aided Design (ICCAD), Nov 2013.
Apr/May 2017. [38] J. Zhao, L. Feng, S. Sharad, W. Zhang, Y. Liang, and B. He. COMBA: A Com-
[7] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level prehensive Model-Based Analysis Framework for High Level Synthesis of Real
Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Applications. Int’l Conf. on Computer-Aided Design (ICCAD), Nov 2017.
Computer-Aided Design of Integrated Circuits and Systems, 30(4):473–491, 2011. [39] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. B. Srivastava, R. Gupta, and
[8] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Z. Zhang. Accelerating Binarized Convolutional Neural Networks with Software-
Neural Networks: Training Deep Neural Networks with Weights and Activations Programmable FPGAs. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb
Constrained to + 1 or -1. arXiv preprint arXiv:1602.02830, Mar 2016. 2017.
[9] S. Dai, R. Zhao, G. Liu, S. Srinath, U. Gupta, C. Batten, and Z. Zhang. Dynamic [40] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar. Lin-Analyzer: A High-Level
Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis. Int’l Performance Analysis Tool for FPGA-Based Accelerators. Design Automation Conf.
Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017. (DAC), Jun 2016.
[10] Q. Gautier, A. Althoff, P. Meng, and R. Kastner. Spector: An OpenCL FPGA [41] Y. Zhou, K. M. Al-Hawaj, and Z. Zhang. A New Approach to Automatic Memory
Benchmark Suite. Int’l Conf. on Field Programmable Technology (FPT), Dec 2016. Banking using Trace-Based Address Mining. Int’l Symp. on Field-Programmable
[11] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and Quantitative Gate Arrays (FPGA), Feb 2017.
Analysis of the CHStone Benchmark Program Suite for Practical C-Based High- [42] W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, and J. Cong. Improving Poly-
Level Synthesis. Journal of Information Processing, Vol. 17, pages 242–254, Oct hedral Code Generation for High-Level Synthesis. Proc. of the 8th Int. Conf. on
2008. Hardware/Software Codesign and System Synthesis (CODES+ISSS), Sep/Oct 2013.
[12] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny
Images. Technical report, University of Toronto, Apr 2009.

278

You might also like