Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew
Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew
Advanced Parallel Processing Technologies 13th International Symposium APPT 2019 Tianjin China August 15 16 2019 Proceedings Pen-Chung Yew
https://textbookfull.com/product/euro-par-2019-parallel-
processing-workshops-euro-par-2019-international-workshops-
gottingen-germany-august-26-30-2019-revised-selected-papers-
ulrich-schwardmann/
https://textbookfull.com/product/advanced-computer-
architecture-13th-conference-aca-2020-kunming-china-
august-13-15-2020-proceedings-dezun-dong/
https://textbookfull.com/product/advanced-hybrid-information-
processing-third-eai-international-conference-adhip-2019-nanjing-
china-september-21-22-2019-proceedings-part-i-guan-gui/
Advanced Hybrid Information Processing Third EAI
International Conference ADHIP 2019 Nanjing China
September 21 22 2019 Proceedings Part II Guan Gui
https://textbookfull.com/product/advanced-hybrid-information-
processing-third-eai-international-conference-adhip-2019-nanjing-
china-september-21-22-2019-proceedings-part-ii-guan-gui/
https://textbookfull.com/product/network-and-system-
security-13th-international-conference-nss-2019-sapporo-japan-
december-15-18-2019-proceedings-joseph-k-liu/
https://textbookfull.com/product/mathematical-and-computational-
oncology-first-international-symposium-ismco-2019-lake-tahoe-nv-
usa-october-14-16-2019-proceedings-george-bebis/
https://textbookfull.com/product/frontiers-in-cyber-security-
second-international-conference-fcs-2019-xi-an-china-
november-15-17-2019-proceedings-bazhong-shen/
Pen-Chung Yew · Per Stenström ·
Junjie Wu · Xiaoli Gong · Tao Li (Eds.)
Advanced
LNCS 11719
Parallel Processing
Technologies
13th International Symposium, APPT 2019
Tianjin, China, August 15–16, 2019
Proceedings
Lecture Notes in Computer Science 11719
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Advanced
Parallel Processing
Technologies
13th International Symposium, APPT 2019
Tianjin, China, August 15–16, 2019
Proceedings
123
Editors
Pen-Chung Yew Per Stenström
University of Minnesota Chalmers University of Technology
Minneapolis, MN, USA Gothenburg, Sweden
Junjie Wu Xiaoli Gong
National University of Defense Technology Nankai University
Changsha, China Tianjin, China
Tao Li
Nankai University
Tianjin, China
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
General Chairs
Ke Gong Nankai University, China
Xiangke Liao National University of Defense Technology, China
Steering Committee
Zhenzhou Ji Harbin Institute of Technology, China
Dongsheng Wang Tsinghua University, China
Xingwei Wang Northeastern University, China
Chenggang Wu Institute of Computing Technology, Chinese Academy
of Sciences, China
Gongxuan Zhang Nanjing University of Science and Technology, China
Junjie Wu National University of Defense Technology, China
Organization Chairs
Tao Li Nankai University, China
Xiangfei Meng National SuperComputer Center in Tianjin, China
Dezun Dong National University of Defense Technology, China
Organization Committee
Hong An University of Science and Technology of China, China
Qiang Cao Huazhong University of Science and Technology,
China
Yunji Chen Institute of Computing Technology, Chinese Academy
of Sciences, China
Yun Liang Peking University, China
Kuanjiu Zhou Dalian University of Technology, China
Sonwen Pei University of Shanghai for Science and Technology,
China
Tian Song Beijing Institute of Technology, China
viii Organization
Program Chairs
Pen-Chung Yew University of Minnesota, USA
Per Stenström Chalmers University of Technology, Sweden
Program Committee
Manuel E. Acacio University of Murcia, Spain
Trevor E. Carlson National University of Singapore, Singapore
Paul Carpenter Barcelona Supercomputing Center, Spain
Yong Chen Texas Tech University, USA
Rudolf Eigenmann University of Delaware, USA
Zhenman Fang Simon Fraser University, Canada
Bok-Min Goi Universiti Tunku Abdul Rahman, Malaysia
Anup Holey Nvidia, USA
Guoliang Jin North Carolina State University, USA
Jangwoo Kim Seoul National University, South Korea
John Kim Korea Advanced Institute of Science and Technology,
South Korea
Zhiyuan Li Purdue University, USA
Chen Liu Clarkson University, USA
Lei Liu Institute of Computing Technology, Chinese Academy
of Sciences, China
Vassilis Papaefstathiou FORTH-ICS, Greece
Miquel Pericas Chalmers University of Technology, Sweden
Cristina Silvano Politecnico di Milano, Italy
Magnus Själander Norwegian University of Science and Technology,
Norway
Shuaiwen Song Pacific Northwest National Lab, USA
James Tuck North Carolina State University, USA
Nian-Feng Tzeng Center for Advanced Computer Studies,
University of Louisiana at Lafayette, USA
Hans Vandierendonck Queen’s University Belfast, UK
Bo Wu Colorado School of Mines, USA
Liao Xiaofei Huazhong University of Science and Technology,
China
Zhibin Yu Shenzhen Institute of Advanced Technology, China
Mohamed Zahran New York University, USA
Antonia Zhai University of Minnesota, USA
Jidong Zhai Tsinghua University, China
Weihua Zhang Fudan University, China
Huiyang Zhou NC State University, USA
Organization ix
Publication Chairs
Junjie Wu National University of Defense Technology, China
Xiaoli Gong Nankai University, China
Workshop Chairs
Chao Li Shanghai Jiaotong University, China
Lifang Wen China Machine Press, Beijing Huazhang Graphics
& Information Co. Ltd., China
Local Chair
Ye Lu Nankai University, China
Poster Chair
Yong Xie Xiamen University of Technology, China
Contents
1 Introduction
The rest of this paper is organized as follows. Section 2 briefly introduces our
motivations and a few design preferences. Section 3 describes the details of the
new ISA. Section 4 illustrates the overall architecture. Section 5 displays the
experiment setup and evaluation results. Section 6 is the conclusion.
FPGA, in which the network structure is often not reconfigured due to time
expense. Thus, we expect to provide more flexibility for CNN techniques by
abstracting the intensive operations in CNNs into dedicated instructions. Users
can write assembly instructions to build a particular CNN.
Efficiency. Typically, an accelerator serves as a peripheral to the host CPU.
Hence, the host CPU is in charge of transferring data from the main memory
to the accelerator over a bus. It is obviously not a negligible overhead because
of additional processing in the operating system and massive data. Besides, bus
bandwidth also limits the performance of these accelerators. Therefore, instead
of the previous pattern, we deploy the acceleration unit into the processor’s
pipeline, and then optimize the memory access of the acceleration unit to satisfy
its data bandwidth requirements, thereby improving efficiency.
operation with the coincidence region to generate the input data of the next layer.
Nevertheless, in this process, the operation between different feature maps and
corresponding convolution kernels is independent of each other. To make full use
of data parallelism, we adopt mapping technology (im2col algorithm) to trans-
form 3-D convolution operation to MM operation (see Fig. 2 for an illustration).
Moreover, the computing unit can be reused by fully-connected layers on account
of analogous MM (row = 1) computing pattern. Note that instead of storing the
entire input feature data, the rearrangement is performed before when we store
them in the FPGA on-chip memory.
The Matrix Sigmoid (MSIG) and the Matrix Softmax (MSFMX) instruction
are essential to complete the entire computation. By default, we employ the
MSIG instruction to activate the input data through the sigmoid function, to
define the output of neurons. Alternatively, users can choose the MRELU or
MTANH instruction to implement the relu or the tanh function by modifying the
Inst [31:27] field (see Fig. 4). Correspondingly, to obtain the prediction results,
the MSFMX instruction is used to normalize the output data.
4 Implementation Details
Furthermore, through appending a data buffer between the matrix unit and the
scratchpad memory, we can effectively reduce the data delay. In a nutshell, the
scratchpad memory and cache are relatively independent, and the control mod-
ule will detect the data dependence and decide whether to stall the pipeline or
not. By default, data involved in the execution of matrix computational logical
instructions should already exist in the scratchpad memory, which requires strict
control from the program.
4.3 Optimization
Since we reuse the MM unit to complete the computation-intensive convolutional
layers and memory-intensive fully-connected layers, the performance of the MM
unit exerts a significant impact on that of the whole matrix unit. Therefore,
we adopt an adder tree and data reuse to optimize the MM unit in terms of
computation and data access (see Fig. 7).
RV-CNN: Flexible and Efficient Instruction Set for CNNs Based on RISC-V 11
Fig. 7. The architecture of matrix unit (left); Optimizing details for MM Unit (right).
– GPU. For comparison, we implement a GPU solution on Tesla k40c, and mea-
sure the processing time and running power by applying the cuEventElapsed-
Time() function and nvprof command respectively (provided by NVIDIA).
5.2 Results
In this subsection, we first report the resource utilization and power consump-
tion of the system in the FPGA board, and then compare our design on the
FPGA with CPU, GPU, and existing FPGA-based accelerators in three aspects
respectively.
Area and power. We obtain the utilization of resources and the power con-
sumption of FPGA by checking the implementation report in Vivado tools (LUT:
39.09%, 24780; FF: 26.49%, 33594; BRAM: 21.85%, 29.5; DSP: 50.42%, 121; and
Power: 0.331 W).
Flexibility. The dedicated ISA we propose is not only suitable for accelerating
CNN applications but also provides support to other deep learning algorithms
with similar computing patterns, like DNNs. We implement the popular CNNs
(Lenet-5, Alexnet, VGG) by using the specific instructions and measure the
average code size of three CNNs. Compared with the GPU, x86, and MIPS,
RV-CNN achieves 1.25x, 6.70x, 9.51x reduction of code length respectively.
Energy Efficiency. We compare the energy consumption of our system with
CPU and GPU in the CNN reference process. As shown in the Fig. 8, the power
consumption of CPU and GPU is 91.03x and 228.66x that of our design, and
the energy consumption is 36.09x and 11.42x that of our design, respectively.
Experimental results indicate that our design is significantly better than CPU
and GPU in terms of energy consumption.
Fig. 8. Power ratios (left) and Energy ratios (right) vs CPU and GPU (Based on
LeNet-5).
RV-CNN: Flexible and Efficient Instruction Set for CNNs Based on RISC-V 13
6 Conclusion
In this work, we present an easy-to-implement CNN-specific instruction set,
called RV-CNN, to provide more flexibility for CNN structures. Through study-
ing computational patterns of the popular CNN techniques, we design nine
coarse-grained matrix instructions in RV-CNN and extend the base RISC-V ISA
with it. Then, we embed the corresponding acceleration unit in the classic five-
stage pipeline architecture. Using Xilinx Artix7 100T to implement our design,
compared with the Intel Core i7 processor and Tesla k40c GPU, it holds 36.09x
and 11.42x energy efficiency ratio and 6.70x and 1.25x code density respectively.
Besides, compared with the existing accelerators, it also achieves a promising
energy efficiency.
References
1. Banakar, R., Steinke, S., Lee, B.S., Balakrishnan, M., Marwedel, P.: Scratchpad
memory: a design alternative for cache on-chip memory in embedded systems. In:
International Symposium on Hardware/software Codesign (2002)
14 W. Lou et al.
Abstract. Nowadays artificial neural networks are one of the most common
computational models among all the intelligent methods. To cope with the ever-
growing scales of neural networks and the restrictions of system energy con-
sumption, there comes out a bunch of neural network (NN) accelerators.
However, owing to their dedicated architecture, programming on NN acceler-
ators is different from general processors. In order to improve performance, it is
necessary to use global structure information of NN model to optimize the
compilation. In this paper, we introduce a series of layer-based compile opti-
mizations for NN accelerators. From top to bottom, we define a type of com-
putational graph, carrying necessary information such as relationship between
layer nodes and data nodes. Then according to the pattern of a NN layer
computation process, we apply an intra layer loop unrolling and pipelining,
including fine-grained and coarse-grained two levels. Similarly, we apply layer
fusion optimization based on our computational graph and abstract pipelining
stage. After expanding pipelining stages of layers, we can reduce some redun-
dant IO operations, which we call it layer elimination optimization. The
experiment results show that with our proposed optimizations the inference
process can achieve up to 1.34x speedup than not using fusion optimization.
1 Introduction
At present, intelligent applications such as image recognition [1–3], target detection [4,
5] and natural language processing [6, 7] have become one of the hottest spots both in
commercial and research areas. Artificial Intelligence (AI) are not only used in a lot of
smart applications but even in the complex strategy games like Go [8, 9], Dota2 [10].
To some extent, AI has started to beat human in the man-machine matches.
However, as the deep learning algorithm is highly intensive in computation and
memory access, the traditional processors can no longer meet the demand of intelligent
applications. For instance, in the field of intelligent driving, the forward inference
operations have strict sequential requirements. In this context, lots of machine learning
accelerators have come out [11–15, 17, 18]. Theory and practice have proved that,
using dedicated intelligent processor to accelerate machine learning algorithm, the
energy efficiency ratio can be improved dozens or even hundreds of times compared to
the general processors. In practical applications, most of the NN accelerators adopt
verified methods (e.g. low-precision representation, quantitative calculation, sparse
weights, ReLU (Rectified Linear Unit) layer and so on) to acquire performance
improvement. And there is also an Instruction Set Architecture (ISA) for NN [19]
proposed for the flexibility and effectiveness. However, those different optimizations
bring complexity for NN compilation.
While programming on a NN accelerator, manual optimization for the whole net-
work structure is unpractical because both the scale and the parameters in NNs are
massive and complexity is exponentially-growing [20, 21, 23]. Moreover, the calcu-
lation methods in NN models are also changing with the development of algorithms.
For example, dilated convolution layer adds a new parameter “dilation” to expand the
receptive field [24]. They change the kernel data fetching orders. For AlexNet [1], there
is group convolution structure; For MobileNet [25], there is depth separable convo-
lution (Depth-wise Conv) structure; For ShuffleNet [3], the convolution layer follows
shuffle layer, etc. Due to the complexity of NN algorithms and the deepening of
models, the programming and optimization for NN accelerators must be automatically
accomplished by software not by human programmer.
There are several popular deep learning frameworks (such as Tensorflow [27], caffe
[28], MXNet [29], etc.) for AI developers to build their models and train, fine-tune and
share to other researchers. However, the description of a model is usually expressed by
operators or layers as basic units. Thus, we use a layer-based computing graph model
as a high-level intermediate representation (IR) to optimize compilation. Subsequently,
the low-level optimization is also based on layers. We propose a novel programming
style called stage level parallel, taking advantage of NN accelerators instructions
parallel execution on different types of on-chip resources. In inference process, some of
parameters and scale of NN layers are constant and we can implement intra layer
pipelining and loop unrolling to reduce redundant instructions. Also, between layers,
based on stage level parallel instruction blocks, we can deploy layer fusion and layer
elimination optimizations to get a further better performance.
2 Compiling Optimization
Id:1
Data type: Float32
constant: true
Consumer: L1
Producer: model
Filter
Id:2
Name: L1
Data type: Float32
Layer type:
Input Output constant: false
ConvoluƟon
Consumer: L3
Mode: normal
Producer: L1
Input
Filter Input
Id:4
Name: L2
Data type: Float32
Input Layer type:
Output constant: false
ConvoluƟon
Consumer: L3
Mode: sparse
Producer: L2
Fig. 1. Nodes information of computation graph. Rectangles represent data nodes and ovals
represent layer nodes. Lines with arrows show the relationship index of a layer-data node pair.
For example, a convolution layer L1 receives user’s input image as the input data
for its computation and its output data is the input data of L3 (as shown in Fig. 1).
Besides data nodes index, layer nodes attributes include more information such as
identification number, layer types, computation mode, etc. Data nodes contain tags of
off-chip memory allocation type, is constant or not, data type and so on.
And there is a more important information is the relationship of a layer-data node
pair. We choose a producer-consumer model to describe dependencies between nodes.
A layer node can have zero-to-many input data nodes, and these data’s consumer lists
contain this layer. A layer node can have one-to-many output data nodes, and these
data’s producer is this layer, which means that a data’s consumer is a list and producer
can have only one, from user direct input or from other layers. These identifiers of
information are used for data layout and back-ends instruction generation processing
stages.
The computational graph is built in high-level NN frameworks’ kernel or operator
functions such as Tensorflow or MXNet. And after several intermediate optimization
steps, the graph backend calls NN accelerator compiler library APIs to generate binary
NN instructions, which part is not being discussed in this work.
and parameters of NN layers are saved in model. When compile a neural network for
accelerators, some of the values during optimization or existing in instruction fields can
be replaced by constant value or immediate numbers. For example, the application
scenario of convolutional neural network (CNN) in image recognition field is usually
started with an input image. The image size can be fixed or dynamic. For a dynamic
image size, we need a preprocess transformation (resize, crop, etc.) by which the image
has been adjusted to the fixed scale as the network defined. And then do the forward
inference.
In this situation, the condition branch judgment and the number of loops of the NN
instructions are known ahead of time, which can reduce the programming jump and
useless code to some extent. Therefore, we can implement a loop unrolling and
pipelining to set loops with immediate numbers instead of through register scalar
calculation.
Write
IniƟal out Compute Compute
... back final
value first value last value
result
Write
Loop for IniƟal out Compute Compute
... back final
ho * wo value first value last value
result
Fig. 2. A fine-grained pipelining example of a pooling layer. Unrolling static immediate number
of loops for sliding windows with strides on channel, height and width dimension, and
calculation using hardware corresponding units to get final results.
For more details, we consider the NN instructions of same type have the sequential
assurance that different types of instructions can execute at the same time. Thus, we can
use a ping-pong strategy of on-chip resource usage to parallel the stage of memory
access and the stage of computation. The data fetching stage can be paralleled with
calculation and write-back stage as shown in Fig. 2. In addition, data tiling can also be
implemented with a naive static memory allocation, and divide the tensors into several
tiling data slices in a layer’s computing.
Compiling Optimization for Neural Network Accelerators 19
IniƟal out
D1 Kernel Write
Compute back final
Load value
result Store
INPUT D1 OUTPUT R1
Loop for kernel compute
Load D2 Kernel
INPUT D2 Compute
If we balance the execution time of the whole pipelining stage, which means the
calculation functional units and IO units of the specific hardware back-end can work
respectively at comparative time cost, we can cover the latency to some extent by using
a type of coarse-grained pipelining (see Fig. 3). The period between synchronization
instructions is called one time slice. During one time slice, previous store part of
pipelining stage will execute at first so that can release corresponding on-chip memory
space for next load pipelining stage to allocate and use for next input data slices. While
the other half of resources are executing different type of instructions. Then the IO and
computing functional units are working in parallel during one time slice, and so on.
And data dependency between layers is solved through computation graph.
Another random document with
no related content on Scribd:
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.