Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew

Advanced Parallel Processing
Technologies 13th International

Symposium APPT 2019 Tianjin China
August 15 16 2019 Proceedings
Pen-Chung Yew
Visit to download the full and correct content document:
https://textbookfull.com/product/advanced-parallel-processing-technologies-13th-inter
national-symposium-appt-2019-tianjin-china-august-15-16-2019-proceedings-pen-chu
ng-yew/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Euro Par 2019 Parallel Processing 25th International

Conference on Parallel and Distributed Computing
Göttingen Germany August 26 30 2019 Proceedings Ramin
Yahyapour
https://textbookfull.com/product/euro-par-2019-parallel-
processing-25th-international-conference-on-parallel-and-
distributed-computing-gottingen-germany-
august-26-30-2019-proceedings-ramin-yahyapour/
Euro Par 2019 Parallel Processing Workshops Euro Par

2019 International Workshops Göttingen Germany August
26 30 2019 Revised Selected Papers Ulrich Schwardmann
https://textbookfull.com/product/euro-par-2019-parallel-
processing-workshops-euro-par-2019-international-workshops-
gottingen-germany-august-26-30-2019-revised-selected-papers-
ulrich-schwardmann/
Advanced Computer Architecture 13th Conference ACA 2020

Kunming China August 13 15 2020 Proceedings Dezun Dong
https://textbookfull.com/product/advanced-computer-
architecture-13th-conference-aca-2020-kunming-china-
august-13-15-2020-proceedings-dezun-dong/
Advanced Hybrid Information Processing Third EAI

International Conference ADHIP 2019 Nanjing China
September 21 22 2019 Proceedings Part I Guan Gui
https://textbookfull.com/product/advanced-hybrid-information-
processing-third-eai-international-conference-adhip-2019-nanjing-
china-september-21-22-2019-proceedings-part-i-guan-gui/
Advanced Hybrid Information Processing Third EAI
International Conference ADHIP 2019 Nanjing China
September 21 22 2019 Proceedings Part II Guan Gui
https://textbookfull.com/product/advanced-hybrid-information-
processing-third-eai-international-conference-adhip-2019-nanjing-
china-september-21-22-2019-proceedings-part-ii-guan-gui/
Advanced Technologies, Systems, and Applications IV

-Proceedings of the International Symposium on
Innovative and Interdisciplinary Applications of
Advanced Technologies (IAT 2019) Samir Avdakovi■
https://textbookfull.com/product/advanced-technologies-systems-
and-applications-iv-proceedings-of-the-international-symposium-
on-innovative-and-interdisciplinary-applications-of-advanced-
technologies-iat-2019-samir-avdakovic/
Network and System Security 13th International

Conference NSS 2019 Sapporo Japan December 15 18 2019
Proceedings Joseph K. Liu
https://textbookfull.com/product/network-and-system-
security-13th-international-conference-nss-2019-sapporo-japan-
december-15-18-2019-proceedings-joseph-k-liu/
Mathematical and Computational Oncology First

International Symposium ISMCO 2019 Lake Tahoe NV USA
October 14 16 2019 Proceedings George Bebis
https://textbookfull.com/product/mathematical-and-computational-
oncology-first-international-symposium-ismco-2019-lake-tahoe-nv-
usa-october-14-16-2019-proceedings-george-bebis/
Frontiers in Cyber Security Second International

Conference FCS 2019 Xi an China November 15 17 2019
Proceedings Bazhong Shen
https://textbookfull.com/product/frontiers-in-cyber-security-
second-international-conference-fcs-2019-xi-an-china-
november-15-17-2019-proceedings-bazhong-shen/
Pen-Chung Yew · Per Stenström ·
Junjie Wu · Xiaoli Gong · Tao Li (Eds.)
Advanced
LNCS 11719
Parallel Processing
Technologies
13th International Symposium, APPT 2019
Tianjin, China, August 15–16, 2019
Proceedings
Lecture Notes in Computer Science 11719
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Editorial Board Members

Elisa Bertino
Purdue University, West Lafayette, IN, USA
Wen Gao
Peking University, Beijing, China
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Gerhard Woeginger
RWTH Aachen, Aachen, Germany
Moti Yung
Columbia University, New York, NY, USA
More information about this series at http://www.springer.com/series/7407
Pen-Chung Yew Per Stenström
• •
Junjie Wu Xiaoli Gong Tao Li (Eds.)

• •
Advanced
Parallel Processing
Technologies
13th International Symposium, APPT 2019
Tianjin, China, August 15–16, 2019
Proceedings
123
Editors
Pen-Chung Yew Per Stenström
University of Minnesota Chalmers University of Technology
Minneapolis, MN, USA Gothenburg, Sweden
Junjie Wu Xiaoli Gong
National University of Defense Technology Nankai University
Changsha, China Tianjin, China
Tao Li
Nankai University
Tianjin, China
ISSN 0302-9743 ISSN 1611-3349 (electronic)

Lecture Notes in Computer Science
ISBN 978-3-030-29610-0 ISBN 978-3-030-29611-7 (eBook)
https://doi.org/10.1007/978-3-030-29611-7
LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues
© Springer Nature Switzerland AG 2019

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The ever-increasing demand of parallel processing drives society to investigate new

computer architecture and system software techniques. Following this trend, APPT
2019 broadly captured the recent advances in parallel architectures and systems, par-
allel software, parallel algorithms and neural network applications, etc., and provided
an excellent forum for the presentation of research efforts and the exchange of
viewpoints.
We would like to express our gratitude to all the colleagues who submitted papers
and congratulate those whose papers were accepted. Following the success of its past
twelve conference editions, APPT 2019 managed to provide a high-quality program for
all attendees. The Program Committee (PC) decided to accept 11 papers. All sub-
missions were reviewed by three PC members. There was also an online discussion
stage to guarantee that consensus was reached for each submission.
We would like to thank the authors for submitting their work to APPT 2019,
and we would also like to show our sincere appreciation to the PC members. The 30
PC members did an excellent job in returning high-quality reviews in time and
engaging in a constructive online discussion. We would also like to thank the
general chairs (Prof. Ke Gong and Prof. Xiangke Liao), the organization chairs
(Prof. Tao Li, Prof. Dezun Dong, and Prof. Xiangfei Meng), and the publication chairs
(Prof. Junjie Wu and Prof. Xiaoli Gong). Our thanks also go to Springer for their
assistance in putting the proceedings together.
July 2019 Pen-Chung Yew

Per Stenström
Organization
APPT 2019 was organized by the China Computer Federation.
General Chairs
Ke Gong Nankai University, China
Xiangke Liao National University of Defense Technology, China
Steering Committee Chair

Yong Dou National University of Defense Technology, China
Steering Committee
Zhenzhou Ji Harbin Institute of Technology, China
Dongsheng Wang Tsinghua University, China
Xingwei Wang Northeastern University, China
Chenggang Wu Institute of Computing Technology, Chinese Academy
of Sciences, China
Gongxuan Zhang Nanjing University of Science and Technology, China
Junjie Wu National University of Defense Technology, China
Organization Chairs
Tao Li Nankai University, China
Xiangfei Meng National SuperComputer Center in Tianjin, China
Dezun Dong National University of Defense Technology, China
Organization Committee
Hong An University of Science and Technology of China, China
Qiang Cao Huazhong University of Science and Technology,
China
Yunji Chen Institute of Computing Technology, Chinese Academy
of Sciences, China
Yun Liang Peking University, China
Kuanjiu Zhou Dalian University of Technology, China
Sonwen Pei University of Shanghai for Science and Technology,
China
Tian Song Beijing Institute of Technology, China
viii Organization
Guanxue Yue Jiangxi University of Science and Technology, China

Lifang Wen China Machine Press, Beijing Huazhang Graphics
& Information Co. Ltd., China
Program Chairs
Pen-Chung Yew University of Minnesota, USA
Per Stenström Chalmers University of Technology, Sweden
Program Committee
Manuel E. Acacio University of Murcia, Spain
Trevor E. Carlson National University of Singapore, Singapore
Paul Carpenter Barcelona Supercomputing Center, Spain
Yong Chen Texas Tech University, USA
Rudolf Eigenmann University of Delaware, USA
Zhenman Fang Simon Fraser University, Canada
Bok-Min Goi Universiti Tunku Abdul Rahman, Malaysia
Anup Holey Nvidia, USA
Guoliang Jin North Carolina State University, USA
Jangwoo Kim Seoul National University, South Korea
John Kim Korea Advanced Institute of Science and Technology,
South Korea
Zhiyuan Li Purdue University, USA
Chen Liu Clarkson University, USA
Lei Liu Institute of Computing Technology, Chinese Academy
of Sciences, China
Vassilis Papaefstathiou FORTH-ICS, Greece
Miquel Pericas Chalmers University of Technology, Sweden
Cristina Silvano Politecnico di Milano, Italy
Magnus Själander Norwegian University of Science and Technology,
Norway
Shuaiwen Song Pacific Northwest National Lab, USA
James Tuck North Carolina State University, USA
Nian-Feng Tzeng Center for Advanced Computer Studies,
University of Louisiana at Lafayette, USA
Hans Vandierendonck Queen’s University Belfast, UK
Bo Wu Colorado School of Mines, USA
Liao Xiaofei Huazhong University of Science and Technology,
China
Zhibin Yu Shenzhen Institute of Advanced Technology, China
Mohamed Zahran New York University, USA
Antonia Zhai University of Minnesota, USA
Jidong Zhai Tsinghua University, China
Weihua Zhang Fudan University, China
Huiyang Zhou NC State University, USA
Organization ix
Publicity and Exhibition Chairs

Weixing Ji Beijing Institute of Technology, China
Jizeng Wei Tianjin University, China
Publication Chairs
Junjie Wu National University of Defense Technology, China
Xiaoli Gong Nankai University, China
Workshop Chairs
Chao Li Shanghai Jiaotong University, China
Lifang Wen China Machine Press, Beijing Huazhang Graphics
& Information Co. Ltd., China
Local Chair
Ye Lu Nankai University, China
Poster Chair
Yong Xie Xiamen University of Technology, China
Contents
System Support for Neural Networks
RV-CNN: Flexible and Efficient Instruction Set for CNNs Based

on RISC-V Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Wenqi Lou, Chao Wang, Lei Gong, and Xuehai Zhou
Compiling Optimization for Neural Network Accelerators . . . . . . . . . . . . . . 15

Jin Song, Yimin Zhuang, Xiaobing Chen, Tian Zhi, and Shaoli Liu
ZhuQue: A Neural Network Programming Model Based on Labeled

Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Weijian Du, Linyang Wu, Xiaobing Chen, Yimin Zhuang, and Tian Zhi
Scheduling and File Systems
Reducing Rename Overhead in Full-Path-Indexed File System . . . . . . . . . . . 43

Longhua Wang, Youyou Lu, Siyang Li, Fan Yang, and Jiwu Shu
Partition and Scheduling Algorithms for Neural Network Accelerators . . . . . . 55

Xiaobing Chen, Shaohui Peng, Luyang Jin, Yimin Zhuang, Jin Song,
Weijian Du, Shaoli Liu, and Tian Zhi
Optimization and Parallelization
SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights

and Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Jiaming Xie and Yun Liang
DA-BERT: Enhancing Part-of-Speech Tagging of Aspect Sentiment

Analysis Using BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Songwen Pei, Lulu Wang, Tianma Shen, and Zhong Ning
Random Inception Module and Its Parallel Implementation. . . . . . . . . . . . . . 96

Yingqi Gao, Kunpeng Xie, Song Guo, Kai Wang, Hong Kang,
and Tao Li
Security and Algorithms
CBA-Detector: An Accurate Detector Against Cache-Based Attacks

Using HPCs and Pintools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Beilei Zheng, Jianan Gu, and Chuliang Weng
xii Contents
An Efficient Log Parsing Algorithm Based on Heuristic Rules . . . . . . . . . . . 123

Lin Zhang, Xueshuo Xie, Kunpeng Xie, Zhi Wang, Ye Lu,
and Yujun Zhang
Distribution Forest: An Anomaly Detection Method Based

on Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Chengfei Yao, Xiaoqing Ma, Biao Chen, Xiaosong Zhao, and Gang Bai
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

System Support for Neural Networks
RV-CNN: Flexible and Efficient
Instruction Set for CNNs Based
on RISC-V Processors
Wenqi Lou, Chao Wang(B) , Lei Gong, and Xuehai Zhou
School of Computer Science, University of Science and Technology of China,

Hefei, China
{louwenqi,leigong0203}@mail.ustc.edu.cn, {cswang,xhzhou}@ustc.edu.cn
Abstract. Convolutional Neural Network (CNN) has gained significant

attention in the field of machine learning, particularly due to its high
accuracy in character recognition and image classification. Neverthe-
less, due to the computation-intensive and memory-intensive character
of CNN, general-purpose processors which usually need to support var-
ious workloads are not efficient for CNN implementation. Therefore, a
great deal of emerging CNN-specific hardware accelerators is able to
improve efficiency. Although existing accelerators are significantly effi-
cient, they are often inflexible or require complex controllers to handle
calculations and data transfer. In this paper, we analyze classical CNN
applications and design a domain-specific instruction set of 9 matrix
instructions, called RV-CNN, based on the promising RISC-V architec-
ture. By abstracting CNN into instructions, our design possesses a higher
code density and provides sufficient flexibility and efficiency for CNN
than general-purpose ISAs. Specifically, the proposed instructions are
extended to RISC-V ISA as custom instructions. Besides, we also intro-
duce micro-architectural optimizations to increase computational density
and reduce the required memory bandwidth. Finally, we implement the
architecture with the extended ISA and evaluate it with LeNet-5 on the
datasets (MNIST, Caltech101, and Cifar-10). Results show that com-
pared with the Intel Core i7 processor and Tesla k40c GPU, our design
has 36.09x and 11.42x energy efficiency ratio and 6.70x and 1.25x code
density respectively.
Keywords: CNN · RISC-V · Domain-specific instructions · FPGA
1 Introduction
Convolutional neural network (CNN), a category of feed-forward artificial neural

networks, is well known for its high precision in the fields of character recogni-
tion, image classification, and face detection [10,13,14]. Inspired by the visual
cortex of the brain, CNN is typically composed of multi-layer networks. In recent
years, with the improvement of recognition accuracy, the depth of the network
c Springer Nature Switzerland AG 2019
P.-C. Yew et al. (Eds.): APPT 2019, LNCS 11719, pp. 3–14, 2019.
https://doi.org/10.1007/978-3-030-29611-7_1
4 W. Lou et al.
has been considerably increased. However, a deeper network structure means

more computation and more weight data access, which makes the low efficiency
of general-purpose processors in performing CNN calculations intolerable. There-
fore, various accelerators based on FPGA [8,15,16], GPU [9], and ASIC [2] have
been proposed, which gain better performance than general-purpose processors.
However, these accelerators are often optimized only for some layers of the neu-
ral network, with less flexibility. To address this problem, Chen’s team proposes
Cambricon [11], a domain-specific Instruction Set Architecture (ISA) for NN
accelerators, which supports ten network structures and has higher code den-
sity and performance than traditional ISA. Nevertheless, it is not specific to
CNNs, thereby overlooking the reusability and parallelism of data. Flexible for
CNNs, DaDianNao [3] can support the Multi-Layer Perceptrons (MLPs), but
it demands substantial reconfigurable computing units and complex control.
In [5,6], Luca et al. propose a Hardware Convolution Engine (HWCE) based
on RISC-V architecture and achieve significant results, but they only design
instructions for convolution operation, without considering other layers. Thus,
an efficient and flexible CNN-specific instruction set is still demanding.
In this paper, we present a novel lightweight ISA for CNN reference, called
RV-CNN. It consists of 9 instructions based on RISC-V, thereby to support the
current mainstream CNN technologies. The main work of this paper is summa-
rized as follows:
– Though studying computational patterns of the popular CNNs, we propose a

small and easy-to-implement CNN-specific instruction set, which can flexibly
support a variety of CNN structures.
– Breaking the traditional peripheral accelerator pattern, we extend CNN-
specific instructions into RISC-V five-stage pipeline architecture and par-
ticularly optimize the implementation of instructions.
– As a case study, we implement our design and evaluate it from code density,
performance, and power consumption, which demonstrates our design possess
promising energy efficiency.
The rest of this paper is organized as follows. Section 2 briefly introduces our
motivations and a few design preferences. Section 3 describes the details of the
new ISA. Section 4 illustrates the overall architecture. Section 5 displays the
experiment setup and evaluation results. Section 6 is the conclusion.
2 Motivations and Preferences

2.1 Motivations
Flexibility. Although compared with general-purpose processors, application

scenarios that hardware accelerators for CNNs will deal with are much decreased
and more certain, there are still many network structures which are different but
have similarities, possessing different strengths. Nevertheless, common hardware
accelerators normally deploy the whole or part of the neural network on the
RV-CNN: Flexible and Efficient Instruction Set for CNNs Based on RISC-V 5
FPGA, in which the network structure is often not reconfigured due to time
expense. Thus, we expect to provide more flexibility for CNN techniques by
abstracting the intensive operations in CNNs into dedicated instructions. Users
can write assembly instructions to build a particular CNN.
Efficiency. Typically, an accelerator serves as a peripheral to the host CPU.
Hence, the host CPU is in charge of transferring data from the main memory
to the accelerator over a bus. It is obviously not a negligible overhead because
of additional processing in the operating system and massive data. Besides, bus
bandwidth also limits the performance of these accelerators. Therefore, instead
of the previous pattern, we deploy the acceleration unit into the processor’s
pipeline, and then optimize the memory access of the acceleration unit to satisfy
its data bandwidth requirements, thereby improving efficiency.
2.2 Design Preferences

RISC-V Extension. Designing a completely novel CNN-specific ISA usually
involves plenty of factors, but the part that restricts the speed of calculation is
what matters precisely. In view of the extensibility of RISC-V, we extend it with
our dedicated instructions which are crucial to accelerating the CNN computing,
while maintaining the basic kernel and each standard extension unchanged. In
this way, we can concentrate on designing our CNN instructions, as well as
directly using scalar and logical control instructions that the RISC-V provides.
Moreover, we can also utilize the toolchain provided by RISC-V to speed up the
development process.
Data-Level Parallelism. Taking the topological structure of convolutional
neural networks (layer-by-layer) and the independence of weight matrices
between layers into account, it is a more efficient way that utilizes data-level
parallelism by applying matrix instructions than exploring instruction-level par-
allelism in NN operations. Furthermore, when dealing with calculations involv-
ing large amounts of data, matrix instructions can explicitly specify the inde-
pendence between the data blocks, which can significantly reduce the size of
dependency detection logic, compared to the conventional scalar instructions.
What’s more, matrix instructions also possess higher code density, so we chiefly
focus on data-level parallelism here.
Scratchpad Memory. Vector registers commonly appear in the vector archi-
tecture, each of which is a fixed-length bank holding a single vector, and allow
processors to operate all elements in a vector at one time. Scratchpad memory [1],
a high-speed internal memory used for the temporary storage of calculations, can
be accessed by direct addressing, costs low power, and supports variable-length
data accesses. Considering that dense, continuous, variable-length data access
often occurs in CNNs, and weight data rarely reused, we replace vector registers
with the scratchpad memory in our design.
6 W. Lou et al.
3 Details of Custom Instructions

3.1 Custom Instructions
We design the RV-CNN architecture, including both data transfer and computa-
tional instructions, as shown in Table 1. Cooperating with the base ISA, RV-CNN
can perform typical CNN calculations. RV-CNN has 32 32-bit General-Purpose
Registers, which can be used to store scalar values, as well as in register-indirect
addressing of the on-chip scratchpad memory. Additionally, due to the 32-bit
instruction length limit, we set up a vector length register (VLR) to ascer-
tain the length of the vector, similar to the vector architecture. RV-CNN still
access memory only through the corresponding MLOAD/MSTORE, obeying a
load-store architecture of RISC-V. The instructions already in RISC-V are not
described here.
Table 1. An overview of RV-CNN.
Instruction type Example

Data Transfer MLOAD/MSTORE
Computational MMM/MSIG/MSFMX/MRELU
Logical MAXPOOL/MINPOOL/APOOL
Data Transfer Instructions. To flexibly support matrix operations, data

transfer instructions can load/store variable-size data blocks (an integer multi-
ple of VLR value) from/to main memory. Specifically, the stride field of instruc-
tions can designate the stride of adjacent elements, avoiding expensive matrix
transpose operations in memory. Figure 1 illustrates the matrix load (MLOAD)
instruction, where Reg0 specifies the destination address; Reg1, Reg2, and Reg3
respectively specify the source address of matrix, the size of the matrix, and the
stride of adjacent elements. Matrix store (MSTORE) instruction is similar to
that of MLOAD, while regularly ignoring the stride fields.
Fig. 1. Matrix Load (MLOAD) instruction.
Matrix Computational Instructions. CNNs are mainly composed of convo-

lutional layers, pooling layers, and fully-connected layers, where the most com-
putation concentrates in convolutional layers [4]. In the convolutional layer, con-
volution kernels move continuously on input feature maps and do a dot-product
operation with the coincidence region to generate the input data of the next layer.
Nevertheless, in this process, the operation between different feature maps and
corresponding convolution kernels is independent of each other. To make full use
of data parallelism, we adopt mapping technology (im2col algorithm) to trans-
form 3-D convolution operation to MM operation (see Fig. 2 for an illustration).
Moreover, the computing unit can be reused by fully-connected layers on account
of analogous MM (row = 1) computing pattern. Note that instead of storing the
entire input feature data, the rearrangement is performed before when we store
them in the FPGA on-chip memory.
Fig. 2. Matrix multiplication version of convolution.
After mapping the 3D convolution to MM (matrix multiplication) operation,

we use the Matrix-Mult-Matrix instruction to perform it. It is illustrated in
Fig. 3, where Reg0 specifies the base scratchpad memory address of the matrix
output (Destination address); Inst[16:12] is the function field of the instruc-
tion, indicating the MM operation. Reg1 and Reg2 specify the base address of
the matrix1 and matrix2 respectively. The upper 16 bits and the lower 16 bits
in Reg3 specify the number of rows of matrix 1 and the number of columns
of matrix 2 (Rows + Cols), respectively. Accordingly, the size of matrix1 and
matrix2 can be ascertained by the value in VLR and Reg3. To utilize greater
extent of data locality as well as reduce concurrent read/write requests to the
same address, we choose to adopt dedicated the MMM instruction to perform
matrix multiplication instead of decomposing it into finer-grained instruction
(e.g., matrix-mult-vector and vector dot products) here.
Fig. 3. Matrix Multiply Matrix (MMM) instruction.

8 W. Lou et al.
The Matrix Sigmoid (MSIG) and the Matrix Softmax (MSFMX) instruction
are essential to complete the entire computation. By default, we employ the
MSIG instruction to activate the input data through the sigmoid function, to
define the output of neurons. Alternatively, users can choose the MRELU or
MTANH instruction to implement the relu or the tanh function by modifying the
Inst [31:27] field (see Fig. 4). Correspondingly, to obtain the prediction results,
the MSFMX instruction is used to normalize the output data.
Fig. 4. Matrix Sigmoid (MSIG) Instruction.
Matrix Logical Instruction. The formats of maximum pooling (MAXPOOL)

instruction (see Fig. 5) are similar to those of MLOAD, where reg0, reg1, and
reg2 possess the same meaning as MLOAD instruction, respectively presenting
destination address of output data, source address, and size of the input matrix.
Whereas, as to the GPOOL instruction, the upper 16 bits and the lower 16
bits in Reg3 respectively designate the size and sliding step of kernels. Also, the
MINPOOL (minimum pooling) or APOOL (average pooling) instruction, only
differing from the function field of the MAXPOOL instruction, can be taken to
determine the minimum or average pooling.
Fig. 5. Maximum Pooling (MAXPOOL) instruction.
3.2 Code Examples
To illustrate the usage of our proposed instructions, we implement two sim-

ple yet representative components of CNN, a convolutional layer and a pooling
layer. The example code of the fully-connected layer is similar to that of the
convolutional layer, except that the output should pass through the activation
function.
Convolutional Layer Code:

// $1 : i n p u t mat1 a d d r e s s , $2 : i n p u t mat2 a d d r e s s
// $3 : temp v a r i a b l e a d d r e s s , $4 : output s i z e
// $5 : row o f mat1 , c o l o f mat2 , $6 : mat1 a d d r e s s , $7 : mat1 s i z e
// $8 : mat2 a d d r e s s , $9 : mat2 s i z e , $10 : output m a t r ix a d d r e s s
LI $5 , 0 x 0 3 1 0 0 0 0 6 // row o f mat1 ( 7 8 4 ) , c o l o f mat2 ( 6 )
LI $VLR, 0x1A // s e t v e c t o r l e n g t h ( 2 6 )
MLOAD $1 , $6 , $7 // l o a d k e r n e l m a t r ix
MLOAD $2 , $8 , $9 // l o a d f e a t u r e map m a t r ix
MMM $3 , $2 , $1 , $5 // mat1 x mat2
MSTORE $3 , $10 , $4 // s t o r e output t o a d d r e s s ( $10 )
Pooling Layer Code:

// $1 : i n p u t mat a d d r e s s , $2 : f e a t u r e map s i z e
// $3 : l o o p c o u n t e r , $4 ˜ $6 : temp v a r i a b l e a d d r e s s
// $8 : mat a d d r e s s , $7 : output mat a d d r e s s , $9 : output s i z e
ADDI $4 , $6 , #0
LI $10 , 0 x 0 0 0 4 0 0 0 2 // k e r n e l s i z e : 2 x2 , s t e p =2
LI $VLR, 0x1C // s e t v e c t o r l e n g t h ( 2 8 )
LI $3 , 0 x06 // s e t l o o p c o u n t e r ( 6 )
L0 :MLOAD $1 , $8 , $2 // l o a d f e a t u r e map
APOOL $5 , $1 , $2 , $10 // subsample
MSIG $4 , $5 , $2 // a c t i v a t e
ADDI $4 , $4 , 0xC4 // update temp a d d r e s s ( add 14 x14 )
ADD $8 , $8 , $2 // next f e a t u r e map
SUB $3 , $3 , #1
BGE $3 , #0, L0 // i f ( l o o p c o u n t e r >0) g o t o l a b e l 0
MSTORE $6 , $7 , $9 // s t o r e mat t o a d d r e s s ( $7 )
4 Implementation Details
4.1 Overall Architecture
A simplified block diagram of the RV-CNN core architecture displaying its

pipeline stages and major functional blocks is illustrated in Fig. 6. It includes five
stages: fetching, decoding, execution, memory access, and write-back, in which
the matrix computing unit is in the execution stage of the pipeline. Since the
matrix unit can directly interact with the scratchpad memory, matrix opera-
tions do not go through the memory access and write-back stage. Correspond-
ingly, after the fetching and decoding stage, the instructions in the base ISA
will enter the ALU then go to the next stage, while the custom instructions
will perform corresponding operations such as data transmission, convolution,
and activation through the matrix unit. The address space of the scratchpad
memory is mapped to the main memory by global mapping, and the remain-
ing memory address space is still accessed through the cache. We can readily
embed Direct Memory Access (DMA) in the matrix unit for the data transfer.
10 W. Lou et al.
Furthermore, through appending a data buffer between the matrix unit and the
scratchpad memory, we can effectively reduce the data delay. In a nutshell, the
scratchpad memory and cache are relatively independent, and the control mod-
ule will detect the data dependence and decide whether to stall the pipeline or
not. By default, data involved in the execution of matrix computational logical
instructions should already exist in the scratchpad memory, which requires strict
control from the program.
Fig. 6. A simplified block diagram of the RV-CNN core.
4.2 Matrix Unit’s Architecture

The overall structure of the matrix unit is illustrated in Fig. 7, where the orange
and black arrows represent the control flow and the data flow, respectively. As
we can see, the internal controller, which works as a finite state machine, is the
center of the matrix unit. After receiving control signals from the previous stage,
the controller arouses the sub-component to complete the corresponding task if
it is available. Otherwise, it will generate a feedback signal to indicate that the
task is busy. The matrix unit contains four sub-units: Matrix multiplication, Sig-
moid, Pooling, and Softmax. The buffer module is essentially an on-chip memory
that buffers the input and output matrices and temporarily stores intermediate
results. Lastly, the input/output module is responsible for receiving/sending the
elements of the matrix in order.
4.3 Optimization
Since we reuse the MM unit to complete the computation-intensive convolutional
layers and memory-intensive fully-connected layers, the performance of the MM
unit exerts a significant impact on that of the whole matrix unit. Therefore,
we adopt an adder tree and data reuse to optimize the MM unit in terms of
computation and data access (see Fig. 7).
Fig. 7. The architecture of matrix unit (left); Optimizing details for MM Unit (right).
Adder tree. Matrix multiplication composes of multiple vector dot products

where the length of the vector is variably regulated by VLR. In our design, the
length of two vectors that MM unit to handle is fixed at 16. Therefore, vector
length greater than 16 will be fragmented, and less than 16 will be padded
with zeros. We deploy several DSP48Es to complete the float-point addition
and multiplication and optimize the accumulation process of the dot product
by adopting a binary tree, which significantly reduces the time complexity, from
O(N) to O(log N). At the same time, utilizing pipeline technology, we can get
sum or partial sum in every clock cycle.
Data reuse. After the data required for matrix multiplication has been trans-
ferred from the main memory to the scratchpad memory, the matrix unit con-
tinuously acquires two vectors through the input unit to complete computing.
Considering the potential data locality in multiplication calculation, we lessen
data bandwidth requirements through data reuse. Clearly, the reuse distance of
the feature data is quite shorter than that of the weight matrix. Hence, we give
priority to the reuse of the feature matrix. Meanwhile, we set two vectors to read
the weight data, alternately used in current iteration computation.
5 Experiment and Results

5.1 Experiment Method
Platform. We synthesize the prototype system, then place, route the synthe-
sized design in Vivado 2017.4, and evaluate the core by deploying it in Nexys4
DDR development board.
Baselines. We compare the performance of our system with existing implemen-
tations on general-purpose CPU and GPU.
– CPU. The CPU baseline is i7-4790K. We compare the processing time and
power of our design with those of the CPU version of the program, utilizing
the PAPI (Performance Application Programming Interface) tool which is an
open source project provided by Intel, and gettimeofday() function.
12 W. Lou et al.
– GPU. For comparison, we implement a GPU solution on Tesla k40c, and mea-
sure the processing time and running power by applying the cuEventElapsed-
Time() function and nvprof command respectively (provided by NVIDIA).
Benchmarks. Based on LeNet-5, we take three common image classification

data sets (MNIST, CalTech 101, and CIFAR-10) as our benchmarks.
5.2 Results
In this subsection, we first report the resource utilization and power consump-
tion of the system in the FPGA board, and then compare our design on the
FPGA with CPU, GPU, and existing FPGA-based accelerators in three aspects
respectively.
Area and power. We obtain the utilization of resources and the power con-
sumption of FPGA by checking the implementation report in Vivado tools (LUT:
39.09%, 24780; FF: 26.49%, 33594; BRAM: 21.85%, 29.5; DSP: 50.42%, 121; and
Power: 0.331 W).
Flexibility. The dedicated ISA we propose is not only suitable for accelerating
CNN applications but also provides support to other deep learning algorithms
with similar computing patterns, like DNNs. We implement the popular CNNs
(Lenet-5, Alexnet, VGG) by using the specific instructions and measure the
average code size of three CNNs. Compared with the GPU, x86, and MIPS,
RV-CNN achieves 1.25x, 6.70x, 9.51x reduction of code length respectively.
Energy Efficiency. We compare the energy consumption of our system with
CPU and GPU in the CNN reference process. As shown in the Fig. 8, the power
consumption of CPU and GPU is 91.03x and 228.66x that of our design, and
the energy consumption is 36.09x and 11.42x that of our design, respectively.
Experimental results indicate that our design is significantly better than CPU
and GPU in terms of energy consumption.
Fig. 8. Power ratios (left) and Energy ratios (right) vs CPU and GPU (Based on
LeNet-5).
Performance. We also compare our design with the existing accelerators.

Since different work adopts different quantization strategies and platforms, it is
hard to choose a precise and effective comparison method. As we can see, taking
Giga Operations Per Second (GOPS) as the evaluation standard, previous works
can achieve better performance than ours. However, higher performance is the
consumption of more resources, such as DSP blocks. In view of efficiency, we
finally chose a relatively fair comparison standard - performance/w, which is
defined as average GOPS per watt. As shown in Table 2, compared to previous
works, our design achieves the highest performance efficiency.
Table 2. Comparison with other FPGA accelerators.
[7] [17] [12] Ours

Platform Zynq XC7Z045 Virtex7 VX485T Zynq XC7Z045 Artix7 XC7A100T
Frequency(MHz) 150 100 150 100
Precision 16-bit fixed 32-bit float Q15 32-bit float
Power(W) 8 18.61 10 0.331
Perf(GOPS) 23.18 61.62 38.4 3.14(CONV)
DSP util N/A 2240 391 121
Perf/w(GOPS/w) 2.90 3.31 3.84 9.48
6 Conclusion
In this work, we present an easy-to-implement CNN-specific instruction set,
called RV-CNN, to provide more flexibility for CNN structures. Through study-
ing computational patterns of the popular CNN techniques, we design nine
coarse-grained matrix instructions in RV-CNN and extend the base RISC-V ISA
with it. Then, we embed the corresponding acceleration unit in the classic five-
stage pipeline architecture. Using Xilinx Artix7 100T to implement our design,
compared with the Intel Core i7 processor and Tesla k40c GPU, it holds 36.09x
and 11.42x energy efficiency ratio and 6.70x and 1.25x code density respectively.
Besides, compared with the existing accelerators, it also achieves a promising
energy efficiency.
Acknowledgments. This work is partially supported by the National Key Research

and Development Program of China (under Grant 2017YFA0700900), National Science
Foundation of China (No. 61772482), Jiangsu Provincial Natural Science Foundation
(No. BK20181193), Youth Innovation Promotion Association CAS (No. 2017497), and
Fundamental Research Funds for the Central Universities (WK2150110003).
References
1. Banakar, R., Steinke, S., Lee, B.S., Balakrishnan, M., Marwedel, P.: Scratchpad
memory: a design alternative for cache on-chip memory in embedded systems. In:
International Symposium on Hardware/software Codesign (2002)
14 W. Lou et al.
2. Chen, T., et al.: DianNao: a small-footprint high-throughput accelerator for ubiq-

uitous machine-learning. In: ACM SIGPLAN Notices, vol. 49, pp. 269–284. ACM
(2014)
3. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., et al.: DaDianNao: a machine-learning
supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Sym-
posium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014)
4. Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In:
Wermter, S., et al. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-11179-7 36
5. Conti, F., Rossi, D., Pullini, A., Loi, I., Benini, L.: PULP: a ultra-low power parallel
accelerator for energy-efficient and flexible embedded vision. J. Signal Process.
Syst. 84(3), 339–354 (2016)
6. Flamand, E., et al.: GAP-8: a RISC-V SoC for AI at the edge of the IoT. In:
2018 IEEE 29th International Conference on Application-Specific Systems, Archi-
tectures and Processors (ASAP), pp. 1–4. IEEE (2018)
7. Gokhale, V., Jin, J., Dundar, A., Martini, B., Culurciello, E.: A 240 G-ops/s mobile
coprocessor for deep neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 682–687 (2014)
8. Gong, L., Wang, C., Li, X., Chen, H., Zhou, X.: MALOC: a fully pipelined fpga
accelerator for convolutional neural networks with all layers mapped on chip. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 37(11), 2601–2612 (2018)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
10. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning
applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
11. Liu, S., et al.: Cambricon: an instruction set architecture for neural networks. In:
ACM SIGARCH Computer Architecture News, vol. 44, pp. 393–405. IEEE Press
(2016)
12. Moini, S., Alizadeh, B., Ebrahimpour, R.: A resource-limited hardware accelerator
for convolutional neural networks in embedded vision applications. IEEE Trans.
Circuits Syst. II: Express Briefs 64(10), 1217–1221 (2017)
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
14. Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint
identification-verification. In: Advances in Neural Information Processing Systems,
pp. 1988–1996 (2014)
15. Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learn-
ing accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst. 36(3), 513–517 (2016)
16. Wang, C., Li, X., Chen, Y., Zhang, Y., Diessel, O., Zhou, X.: Service-oriented
architecture on FPGA-based MPSoC. IEEE Trans. Parallel Distrib. Syst. 28(10),
2993–3006 (2017)
17. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based
accelerator design for deep convolutional neural networks. In: Proceedings of the
2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
pp. 161–170. ACM (2015)
Compiling Optimization for Neural Network
Accelerators
Jin Song1,2,3(&), Yimin Zhuang1,2,3, Xiaobing Chen1,2,3,

Tian Zhi2,3(&), and Shaoli Liu2,3
1
University of Chinese Academy of Sciences, Beijing, China
2
SKL of Computer Architecture, Institute of Computing Technology, CAS,
Beijing, China
{songjin,zhitian}@ict.ac.cn
3
Cambricon Tech. Ltd., Beijing, China
Abstract. Nowadays artificial neural networks are one of the most common
computational models among all the intelligent methods. To cope with the ever-
growing scales of neural networks and the restrictions of system energy con-
sumption, there comes out a bunch of neural network (NN) accelerators.
However, owing to their dedicated architecture, programming on NN acceler-
ators is different from general processors. In order to improve performance, it is
necessary to use global structure information of NN model to optimize the
compilation. In this paper, we introduce a series of layer-based compile opti-
mizations for NN accelerators. From top to bottom, we define a type of com-
putational graph, carrying necessary information such as relationship between
layer nodes and data nodes. Then according to the pattern of a NN layer
computation process, we apply an intra layer loop unrolling and pipelining,
including fine-grained and coarse-grained two levels. Similarly, we apply layer
fusion optimization based on our computational graph and abstract pipelining
stage. After expanding pipelining stages of layers, we can reduce some redun-
dant IO operations, which we call it layer elimination optimization. The
experiment results show that with our proposed optimizations the inference
process can achieve up to 1.34x speedup than not using fusion optimization.
Keywords: Neural network accelerator Compile optimization Layer fusion
1 Introduction
At present, intelligent applications such as image recognition [1–3], target detection [4,
5] and natural language processing [6, 7] have become one of the hottest spots both in
commercial and research areas. Artificial Intelligence (AI) are not only used in a lot of
smart applications but even in the complex strategy games like Go [8, 9], Dota2 [10].
To some extent, AI has started to beat human in the man-machine matches.
However, as the deep learning algorithm is highly intensive in computation and
memory access, the traditional processors can no longer meet the demand of intelligent
applications. For instance, in the field of intelligent driving, the forward inference
operations have strict sequential requirements. In this context, lots of machine learning
© Springer Nature Switzerland AG 2019

P.-C. Yew et al. (Eds.): APPT 2019, LNCS 11719, pp. 15–26, 2019.
https://doi.org/10.1007/978-3-030-29611-7_2
16 J. Song et al.
accelerators have come out [11–15, 17, 18]. Theory and practice have proved that,
using dedicated intelligent processor to accelerate machine learning algorithm, the
energy efficiency ratio can be improved dozens or even hundreds of times compared to
the general processors. In practical applications, most of the NN accelerators adopt
verified methods (e.g. low-precision representation, quantitative calculation, sparse
weights, ReLU (Rectified Linear Unit) layer and so on) to acquire performance
improvement. And there is also an Instruction Set Architecture (ISA) for NN [19]
proposed for the flexibility and effectiveness. However, those different optimizations
bring complexity for NN compilation.
While programming on a NN accelerator, manual optimization for the whole net-
work structure is unpractical because both the scale and the parameters in NNs are
massive and complexity is exponentially-growing [20, 21, 23]. Moreover, the calcu-
lation methods in NN models are also changing with the development of algorithms.
For example, dilated convolution layer adds a new parameter “dilation” to expand the
receptive field [24]. They change the kernel data fetching orders. For AlexNet [1], there
is group convolution structure; For MobileNet [25], there is depth separable convo-
lution (Depth-wise Conv) structure; For ShuffleNet [3], the convolution layer follows
shuffle layer, etc. Due to the complexity of NN algorithms and the deepening of
models, the programming and optimization for NN accelerators must be automatically
accomplished by software not by human programmer.
There are several popular deep learning frameworks (such as Tensorflow [27], caffe
[28], MXNet [29], etc.) for AI developers to build their models and train, fine-tune and
share to other researchers. However, the description of a model is usually expressed by
operators or layers as basic units. Thus, we use a layer-based computing graph model
as a high-level intermediate representation (IR) to optimize compilation. Subsequently,
the low-level optimization is also based on layers. We propose a novel programming
style called stage level parallel, taking advantage of NN accelerators instructions
parallel execution on different types of on-chip resources. In inference process, some of
parameters and scale of NN layers are constant and we can implement intra layer
pipelining and loop unrolling to reduce redundant instructions. Also, between layers,
based on stage level parallel instruction blocks, we can deploy layer fusion and layer
elimination optimizations to get a further better performance.
2 Compiling Optimization
2.1 Definition of Computational Graph

We choose a high-level IR of optimization similar to that in TVM [34] where a node
represents an operation and edges represent data dependencies between operations. Our
graph uses nodes with a label of name represent layers and data which are linking in a
layer as shown on the Fig. 1 below.
Compiling Optimization for Neural Network Accelerators 17
Id:1
Data type: Float32
constant: true
Consumer: L1
Producer: model
Filter
Id:2
Name: L1
Data type: Float32
Layer type:
Input Output constant: false
ConvoluƟon
Consumer: L3
Mode: normal
Producer: L1
Input
Id:0 Id:3 Id:5

Data type: Float32 Data type: Float16 Name: L3 Data type: Float32
constant: false constant: true Layer type: Add Output constant: false
Consumer: L1, L2 Consumer: L2 Mode: normal Consumer: output
Producer: input Producer: model Producer: L3
Filter Input
Id:4
Name: L2
Data type: Float32
Input Layer type:
Output constant: false
ConvoluƟon
Consumer: L3
Mode: sparse
Producer: L2
Fig. 1. Nodes information of computation graph. Rectangles represent data nodes and ovals
represent layer nodes. Lines with arrows show the relationship index of a layer-data node pair.
For example, a convolution layer L1 receives user’s input image as the input data
for its computation and its output data is the input data of L3 (as shown in Fig. 1).
Besides data nodes index, layer nodes attributes include more information such as
identification number, layer types, computation mode, etc. Data nodes contain tags of
off-chip memory allocation type, is constant or not, data type and so on.
And there is a more important information is the relationship of a layer-data node
pair. We choose a producer-consumer model to describe dependencies between nodes.
A layer node can have zero-to-many input data nodes, and these data’s consumer lists
contain this layer. A layer node can have one-to-many output data nodes, and these
data’s producer is this layer, which means that a data’s consumer is a list and producer
can have only one, from user direct input or from other layers. These identifiers of
information are used for data layout and back-ends instruction generation processing
stages.
The computational graph is built in high-level NN frameworks’ kernel or operator
functions such as Tensorflow or MXNet. And after several intermediate optimization
steps, the graph backend calls NN accelerator compiler library APIs to generate binary
NN instructions, which part is not being discussed in this work.
2.2 Intra Layer Loop Unrolling and Pipelining

At present, there are lots of scenarios for deep learning applications. The structure of
the neural network model is usually formed during the training process, and the scale
18 J. Song et al.
and parameters of NN layers are saved in model. When compile a neural network for
accelerators, some of the values during optimization or existing in instruction fields can
be replaced by constant value or immediate numbers. For example, the application
scenario of convolutional neural network (CNN) in image recognition field is usually
started with an input image. The image size can be fixed or dynamic. For a dynamic
image size, we need a preprocess transformation (resize, crop, etc.) by which the image
has been adjusted to the fixed scale as the network defined. And then do the forward
inference.
In this situation, the condition branch judgment and the number of loops of the NN
instructions are known ahead of time, which can reduce the programming jump and
useless code to some extent. Therefore, we can implement a loop unrolling and
pipelining to set loops with immediate numbers instead of through register scalar
calculation.
Write
IniƟal out Compute Compute
... back final
value first value last value
result
Loop inside kernel, kx * ky
Loop for channels
Write
Loop for IniƟal out Compute Compute
... back final
ho * wo value first value last value
result
Next sliding window ...
Fig. 2. A fine-grained pipelining example of a pooling layer. Unrolling static immediate number
of loops for sliding windows with strides on channel, height and width dimension, and
calculation using hardware corresponding units to get final results.
For more details, we consider the NN instructions of same type have the sequential
assurance that different types of instructions can execute at the same time. Thus, we can
use a ping-pong strategy of on-chip resource usage to parallel the stage of memory
access and the stage of computation. The data fetching stage can be paralleled with
calculation and write-back stage as shown in Fig. 2. In addition, data tiling can also be
implemented with a naive static memory allocation, and divide the tensors into several
tiling data slices in a layer’s computing.
Compiling Optimization for Neural Network Accelerators 19
In some NN accelerators, a whole kernel’s computation can be completed through one

single NN instruction while the data fetching and scheduling are executed by hardware
functional units simultaneously. Thus, we propose a higher level of parallel pipelining.
Since the calculation of neural network is mostly concentrated in convolution layers
and fully-connected layer, in order to accelerate the calculation of neural network, the
accelerators mostly use parallel multipliers and adders to carry out data parallel com-
putation. In order to feed data in time to the accelerators function units, make a better
use of data locality and enhance data reuse, some accelerators adopt on-chip-memory
storage access hierarchy [12, 17, 18]. The NN layer computation process can be
abstracted as follows: firstly, load necessary data slices into on-chip memory, then
calculation is performed with the corresponding on-chip function units, and finally
store the result slices to off-chip memory to complete the whole process of the NN
layer operations.
IniƟal out
D1 Kernel Write
Compute back final
Load value
result Store
INPUT D1 OUTPUT R1
Loop for kernel compute
Load D2 Kernel
INPUT D2 Compute
Fig. 3. A coarse-grained pipelining example of a pooling layer. D1 and D2 data allocate at

symmetric on-chip memory address with ping-pong pipelining strategy. The static memory
allocation method uses the shapes and parameters determined at compile time to adjust size of
each data tiling slices. Load input slices with tiling size and on-chip memory allocation is at the
beginning. And computation is paralleled with the load and store stage part of the opposite ping-
pong memory data. Store output slices to off-chip memory and release the on-chip memory space
at the end.
If we balance the execution time of the whole pipelining stage, which means the
calculation functional units and IO units of the specific hardware back-end can work
respectively at comparative time cost, we can cover the latency to some extent by using
a type of coarse-grained pipelining (see Fig. 3). The period between synchronization
instructions is called one time slice. During one time slice, previous store part of
pipelining stage will execute at first so that can release corresponding on-chip memory
space for next load pipelining stage to allocate and use for next input data slices. While
the other half of resources are executing different type of instructions. Then the IO and
computing functional units are working in parallel during one time slice, and so on.
And data dependency between layers is solved through computation graph.
Another random document with
no related content on Scribd:
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free

distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.
Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be

used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears, or
with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is derived

from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is posted

with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.
1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute this

electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,

performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or providing

access to or distributing Project Gutenberg™ electronic works
provided that:
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who

notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of

any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™

electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend

considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except

for the “Right of Replacement or Refund” described in paragraph
1.F.3, the Project Gutenberg Literary Archive Foundation, the owner
of the Project Gutenberg™ trademark, and any other party
distributing a Project Gutenberg™ electronic work under this
agreement, disclaim all liability to you for damages, costs and
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you

discover a defect in this electronic work within 90 days of receiving it,
you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or entity
that provided you with the defective work may elect to provide a
replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the
Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and distribution
of Project Gutenberg™ electronic works, harmless from all liability,
costs and expenses, including legal fees, that arise directly or
indirectly from any of the following which you do or cause to occur:
(a) distribution of this or any Project Gutenberg™ work, (b)
alteration, modification, or additions or deletions to any Project
Gutenberg™ work, and (c) any Defect you cause.
Section 2. Information about the Mission of

Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.
The Foundation’s business office is located at 809 North 1500 West,

Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact
Section 4. Information about Donations to

the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many small
donations ($1 to $5,000) are particularly important to maintaining tax
exempt status with the IRS.
The Foundation is committed to complying with the laws regulating

charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states where

we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot make

any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About Project

Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed

editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew

Uploaded by

Copyright:

Available Formats

Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew

Uploaded by

Copyright:

Available Formats

Advanced Parallel Processing

Technologies 13th International

Euro Par 2019 Parallel Processing 25th International

Euro Par 2019 Parallel Processing Workshops Euro Par

Advanced Computer Architecture 13th Conference ACA 2020

Advanced Hybrid Information Processing Third EAI

Advanced Technologies, Systems, and Applications IV

Network and System Security 13th International

Mathematical and Computational Oncology First

Frontiers in Cyber Security Second International

Editorial Board Members

Junjie Wu Xiaoli Gong Tao Li (Eds.)

ISSN 0302-9743 ISSN 1611-3349 (electronic)

© Springer Nature Switzerland AG 2019

The ever-increasing demand of parallel processing drives society to investigate new

July 2019 Pen-Chung Yew

APPT 2019 was organized by the China Computer Federation.

Steering Committee Chair

Guanxue Yue Jiangxi University of Science and Technology, China

Publicity and Exhibition Chairs

System Support for Neural Networks

RV-CNN: Flexible and Efficient Instruction Set for CNNs Based

Compiling Optimization for Neural Network Accelerators . . . . . . . . . . . . . . 15

ZhuQue: A Neural Network Programming Model Based on Labeled

Scheduling and File Systems

Reducing Rename Overhead in Full-Path-Indexed File System . . . . . . . . . . . 43

Partition and Scheduling Algorithms for Neural Network Accelerators . . . . . . 55

Optimization and Parallelization

SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights

DA-BERT: Enhancing Part-of-Speech Tagging of Aspect Sentiment

Random Inception Module and Its Parallel Implementation. . . . . . . . . . . . . . 96

Security and Algorithms

CBA-Detector: An Accurate Detector Against Cache-Based Attacks

An Efficient Log Parsing Algorithm Based on Heuristic Rules . . . . . . . . . . . 123

Distribution Forest: An Anomaly Detection Method Based

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Wenqi Lou, Chao Wang(B) , Lei Gong, and Xuehai Zhou

School of Computer Science, University of Science and Technology of China,

Abstract. Convolutional Neural Network (CNN) has gained signiﬁcant

Keywords: CNN · RISC-V · Domain-speciﬁc instructions · FPGA

Convolutional neural network (CNN), a category of feed-forward artiﬁcial neural

has been considerably increased. However, a deeper network structure means

– Though studying computational patterns of the popular CNNs, we propose a

2 Motivations and Preferences

Flexibility. Although compared with general-purpose processors, application

2.2 Design Preferences

3 Details of Custom Instructions

Table 1. An overview of RV-CNN.

Instruction type Example

Data Transfer Instructions. To ﬂexibly support matrix operations, data

Fig. 1. Matrix Load (MLOAD) instruction.

Matrix Computational Instructions. CNNs are mainly composed of convo-

Fig. 2. Matrix multiplication version of convolution.

After mapping the 3D convolution to MM (matrix multiplication) operation,

Fig. 3. Matrix Multiply Matrix (MMM) instruction.

Fig. 4. Matrix Sigmoid (MSIG) Instruction.

Matrix Logical Instruction. The formats of maximum pooling (MAXPOOL)

Fig. 5. Maximum Pooling (MAXPOOL) instruction.

3.2 Code Examples