Analyzing CUDA Workloads Using A Detailed GPU Simulator

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt
University of British Columbia,
Vancouver, BC, Canada
{bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca

Abstract widespread interest in using GPU hardware to accelerate non-


Modern Graphic Processing Units (GPUs) provide suffi- graphics applications.
Since its introduction by NVIDIA Corporation in February
ciently flexible programming models that understanding their
2007, the CUDA programming model [29,33] has been used to
performance can provide insight in designing tomorrow’s
develop many applications for GPUs. CUDA provides an easy
manycore processors, whether those are GPUs or other-
to learn extension of the ANSI C language. The programmer
wise. The combination of multiple, multithreaded, SIMD cores
specifies parallel threads, each of which runs scalar code.
makes studying these GPUs useful in understanding trade-
While short vector data types are available, their use by the
offs among memory, data, and thread level parallelism. While
programmer is not required to achieve peak performance, thus
modern GPUs offer orders of magnitude more raw comput-
making CUDA a more attractive programming model to those
ing power than contemporary CPUs, many important ap-
less familiar with traditional data parallel architectures. This
plications, even those with abundant data level parallelism,
execution model has been dubbed a single instruction, multiple
do not achieve peak performance. This paper characterizes
thread (SIMT) model [22] to distinguish it from the more
several non-graphics applications written in NVIDIA’s CUDA
traditional single instruction, multiple data (SIMD) model.
programming model by running them on a novel detailed
As of February 2009, NVIDIA has listed 209 third-party
microarchitecture performance simulator that runs NVIDIA’s
applications on their CUDA Zone website [30]. Of the 136
parallel thread execution (PTX) virtual instruction set. For
applications listed with performance claims, 52 are reported to
this study, we selected twelve non-trivial CUDA applications
obtain a speedup of 50× or more, and of these 29 are reported
demonstrating varying levels of performance improvement on
to obtain a speedup of 100× or more. As these applications
GPU hardware (versus a CPU-only sequential version of
already achieve tremendous benefits, this paper instead focuses
the application). We study the performance of these applica-
on evaluating CUDA applications with reported speedups
tions on our GPU performance simulator with configurations
below 50× since this group of applications appears most in
comparable to contemporary high-end graphics cards. We
need of software tuning or changes to hardware design.
characterize the performance impact of several microarchitec-
This paper makes the following contributions:
ture design choices including choice of interconnect topology,
• It presents data characterizing the performance of twelve
use of caches, design of memory controller, parallel work-
existing CUDA applications collected on a research GPU
load distribution mechanisms, and memory request coalescing
simulator (GPGPU-Sim).
hardware. Two observations we make are (1) that for the appli-
• It shows that the non-graphics applications we study
cations we study, performance is more sensitive to interconnect
tend to be more sensitive to bisection bandwidth versus
bisection bandwidth rather than latency, and (2) that, for some
latency.
applications, running fewer threads concurrently than on-chip
• It shows that, for certain applications, decreasing the
resources might otherwise allow can improve performance by
number of threads running concurrently on the hardware
reducing contention in the memory system.
can improve performance by reducing contention for on-
chip resources.
1. Introduction • It provides an analysis of application characteristics in-
While single-thread performance of commercial superscalar cluding the dynamic instruction mix, SIMD warp branch
microprocessors is still increasing, a clear trend today is for divergence properties, and DRAM locality characteristics.
computer manufacturers to provide multithreaded hardware We believe the observations made in this paper will provide
that strongly encourages software developers to provide ex- useful guidance for directing future architecture and software
plicit parallelism when possible. One important class of paral- research.
lel computer hardware is the modern graphics processing unit The rest of this paper is organized as follows. In Section 2
(GPU) [22,25]. With contemporary GPUs recently crossing the we describe our baseline architecture and the microarchitecture
teraflop barrier [2,34] and specific efforts to make GPUs easier design choices that we explore before describing our simu-
to program for non-graphics applications [1, 29, 33], there is lation infrastructure and the benchmarks used in this study.

978-1-4244-4184-6/09/$25.00 ©2009 IEEE 163

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
Shader Core
Thread Warp
Thread Warp

Thread Warp
Scheduler

Custom kernel SIMD


Application libcuda PTX code Pipeline
Fetch

Shader Cores Decode


CPU
Core Core Core Core Core Core local/global access
(or L1 miss); texture RF RF RF RF
Memory or const cache miss

Interconnection Network L1
Thread Warp L1 L1 local& Shared
Statistics tex const global Mem.
To interconnect Thread Warp
L2 L2 L2 All threads
Memory Memory Memory Data hit in L1?
Controller Controller Controller
MSHRs
DRAM DRAM DRAM Writeback
cudaMemcpy
GPGPU-Sim

(a) Overview (b) Detail of Shader Core


Figure 1. Modeled system and GPU architecture [11]. Dashed portions (L1 and L2 for local/global accesses) omitted from baseline.

Our experimental methodology is described in Section 3 and execution. Our simulator omits graphics specific hardware not
Section 4 presents and analyzes results. Section 5 reviews exposed to CUDA.
related work and Section 6 concludes the paper. Figure 1(b) shows the detailed implementation of a single
shader core. In this paper, each shader core has a SIMD
2. Design and Implementation width of 8 and uses a 24-stage, in-order pipeline without
In this section we describe the GPU architecture we simu- forwarding. The 24-stage pipeline is motivated by details in the
lated, provide an overview of our simulator infrastructure and CUDA Programming Guide [33], which indicates that at least
then describe the benchmarks we selected for our study. 192 active threads are needed to avoid stalling for true data
dependencies between consecutive instructions from a single
2.1. Baseline Architecture thread (in the absence of long latency memory operations).
We model this pipeline with six logical pipeline stages (fetch,
Figure 1(a) shows an overview of the system we simulated.
decode, execute, memory1, memory2, writeback) with super-
The applications evaluated in this paper were written using
pipelining of degree 4 (memory1 is an empty stage in our
CUDA [29, 33]. In the CUDA programming model, the GPU
model). Threads are scheduled to the SIMD pipeline in a fixed
is treated as a co-processor onto which an application running
group of 32 threads called a warp [22]. All 32 threads in a
on a CPU can launch a massively parallel compute kernel. The
given warp execute the same instruction with different data
kernel is comprised of a grid of scalar threads. Each thread is
values over four consecutive clock cycles in all pipelines (the
given an unique identifier which can be used to help divide up
SIMD cores are effectively 8-wide). We use the immediate
work among the threads. Within a grid, threads are grouped
post-dominator reconvergence mechanism described in [11] to
into blocks, which are also referred to as cooperative thread
handle branch divergence where some scalar threads within a
arrays (CTAs) [22]. Within a single CTA threads have access
warp evaluate a branch as “taken” and others evaluate it as
to a common fast memory called the shared memory and can,
“not taken”.
if desired, perform barrier synchronizations.
Figure 1(a) also shows our baseline GPU architecture. The Threads running on the GPU in the CUDA programming
GPU consists of a collection of small data-parallel compute model have access to several memory regions (global, local,
cores, labeled shader cores in Figure 1, connected by an constant, texture, and shared [33]) and our simulator models
interconnection network to multiple memory modules (each accesses to each of these memory spaces. In particular, each
labeled memory controller). Each shader core is a unit similar shader core has access to a 16KB low latency, highly-banked
in scope to a streaming multiprocessor (SM) in NVIDIA per-core shared memory; to global texture memory with a per-
terminology [33]. Threads are distributed to shader cores at core texture cache; and to global constant memory with a
the granularity of entire CTAs, while per-CTA resources, such per-core constant cache. Local and global memory accesses
as registers, shared memory space, and thread slots, are not always require off chip memory accesses in our baseline
freed until all threads within a CTA have completed execution. configuration. For the per-core texture cache, we implement
If resources permit, multiple CTAs can be assigned to a a 4D blocking address scheme as described in [14], which
single shader core, thus sharing a common pipeline for their essentially permutes the bits in requested addresses to promote

164

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
2.2. GPU Architectural Exploration
This section describes some of the GPU architectural design
options explored in this paper. Evaluations of these design
Shader Core
options are presented in Section 4.

Memory Controller
2.2.1. Interconnect. The on-chip interconnection network can
be designed in various ways based on its cost and performance.
Figure 2. Layout of memory controller nodes in mesh3 Cost is determined by complexity and number of routers as
well as density and length of wires. Performance depends on
spatial locality in a 2D space rather than in linear space. For latency, bandwidth and path diversity of the network [9]. (Path
the constant cache, we allow single cycle access as long as all diversity indicates the number of routes a message can take
threads in a warp are requesting the same data. Otherwise, a from the source to the destination.)
port conflict occurs, forcing data to be sent out over multiple Butterfly networks offer minimal hop count for a given
cycles and resulting in pipeline stalls [33]. Multiple memory router radix while having no path diversity and requiring very
accesses from threads within a single warp to a localized long wires. A crossbar interconnect can be seen as a 1-stage
region are coalesced into fewer wide memory accesses to im- butterfly and scales quadratically in area as the number of
prove DRAM efficiency1 . To alleviate the DRAM bandwidth ports increase. A 2D torus interconnect can be implemented
bottleneck that many applications face, a common technique on chip with nearly uniformly short wires and offers good path
used by CUDA programmers is to load frequently accessed diversity, which can lead to a more load balanced network.
data into the fast on-chip shared memory [40]. Ring and mesh interconnects are both special types of torus
interconnects. The main drawback of a mesh network is its
Thread scheduling inside a shader core is performed with
relatively higher latency due to a larger hop count. As we
zero overhead on a fine-grained basis. Every 4 cycles, warps
will show in Section 4, our benchmarks are not particularly
ready for execution are selected by the warp scheduler and
sensitive to latency so we chose a mesh network as our
issued to the SIMD pipelines in a loose round robin fashion
baseline while exploring the other choices for interconnect
that skips non-ready warps, such as those waiting on global
topology.
memory accesses. In other words, whenever any thread inside
a warp faces a long latency operation, all the threads in the 2.2.2. CTA distribution. GPUs can use the abundance of par-
warp are taken out of the scheduling pool until the long allelism in data-parallel applications to tolerate memory access
latency operation is over. Meanwhile, other warps that are latency by interleaving the execution of warps. These warps
not waiting are sent to the pipeline for execution in a round may either be from the same CTA or from different CTAs
robin order. The many threads running on each shader core running on the same shader core. One advantage of running
thus allow a shader core to tolerate long latency operations multiple smaller CTAs on a shader core rather than using a
without reducing throughput. single larger CTA relates to the use of barrier synchronization
In order to access global memory, memory requests must points within a CTA [40]. Threads from one CTA can make
be sent via an interconnection network to the corresponding progress while threads from another CTA are waiting at a
memory controllers, which are physically distributed over barrier. For a given number of threads per CTA, allowing more
the chip. To avoid protocol deadlock, we model physically CTAs to run on a shader core provides additional memory
separate send and receive interconnection networks. Using latency tolerance, though it may imply increasing register
separate logical networks to break protocol deadlock is another and shared memory resource use. However, even if sufficient
alternative, but one we did not explore. Each on-chip memory on-chip resources exist to allow more CTAs per core, if a
controller then interfaces to two off-chip GDDR3 DRAM compute kernel is memory-intensive, completely filling up all
chips2 . Figure 2 shows the physical layout of the memory CTA slots may reduce performance by increasing contention in
controllers in our 6x6 mesh configuration as shaded areas3 . the interconnection network and DRAM controllers. We issue
The address decoding scheme is designed in a way such CTAs in a breadth-first manner across shader cores, selecting
that successive 2KB DRAM pages [19] are distributed across a shader core that has a minimum number of CTAs running on
different banks and different chips to maximize row locality it, so as to spread the workload as evenly as possible among
while spreading the load among the memory controllers. all cores.
2.2.3. Memory Access Coalescing. The minimum granularity
access for GDDR3 memory is 16 bytes and typically scalar
1. When memory accesses within a warp cannot be coalesced into a single
memory access, the memory stage will stall until all memory accesses are
threads in CUDA applications access 4 bytes per scalar
issued from the shader core. In our design, the shader core can issue a thread [19]. To improve memory system efficiency, it thus
maximum of 1 access every 2 cycles. makes sense to group accesses from multiple, concurrently-
2. GDDR3 stands for Graphics Double Data Rate 3 [19]. Graphics DRAM issued, scalar threads into a single access to a small, contigu-
is typically optimized to provide higher peak data bandwidth.
3. Note that with area-array (i.e., “flip-chip”) designs it is possible to place ous memory region. The CUDA programming guide indicates
I/O buffers anywhere on the die [6]. that parallel memory accesses from every half-warp of 16

165

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
Application

Source code Source code Application


Tool (.cpp) (.cu)
Source code Source code
File (.cpp) (.cu)
cudafe + nvopencc
Nvidia Toolkit
cudafe + nvopencc
GPGPU-Sim Host C .ptx
code
User-specific Host C
tool/file .ptx
code
ptxas

cubin.bin ptxas
libcuda.a
Custom
libcuda.a
per thread
C/C++ compiler register
C/C++ compiler usage

Executable Executable
NVIDIA PCI-E Custom Function
GPU libcuda GPGPU-Sim Statistics
libcuda call

(a) CUDA Flow with GPU Hardware (b) GPGPU-Sim


Figure 3. Compilation Flow for GPGPU-Sim from a CUDA application in comparison to the normal CUDA compilation flow.

threads can be coalesced into fewer wide memory accesses if memory controller. While threads can only read from texture
they all access a contiguous memory region [33]. Our baseline and constant memory, they can both read and write to local
models similar intra-warp memory coalescing behavior (we and global memory. In our evaluation of caches for local and
attempt to coalesce memory accesses from all 32 threads in a global memory we model non-coherent caches. (Note that
warp). threads from different CTAs in the applications we study do
A related issue is that since the GPU is heavily multi- not communicate through global memory.)
threaded a balanced design must support many outstanding
memory requests at once. While microprocessors typically 2.3. Extending GPGPU-Sim to Support CUDA
employ miss-status holding registers (MSHRs) [21] that use We extended GPGPU-Sim, the cycle-accurate simulator we
associative comparison logic merge simultaneous requests for developed for our earlier work [11]. GPGPU-Sim models
the same cache block, the number of outstanding misses that various aspects of a massively parallel architecture with highly
can be supported is typically small (e.g., the original Intel programmable pipelines similar to contemporary GPU archi-
Pentium 4 used four MSHRs [16]). One way to support a tectures. A drawback of the previous version of GPGPU-
far greater number of outstanding memory requests is to use Sim was the difficult and time-consuming process of convert-
a FIFO for outstanding memory requests [17]. Similarly, our ing/parallelizing existing applications [11]. We overcome this
baseline does not attempt to eliminate multiple requests for difficulty by extending GPGPU-Sim to support the CUDA
the same block of memory on cache misses or local/global Parallel Thread Execution (PTX) [35] instruction set. This
memory accesses. However, we also explore the possibility of enables us to simulate the numerous existing, optimized
improving performance by coalescing read memory requests CUDA applications on GPGPU-Sim. Our current simulator
from later warps that require access to data for which a mem- infrastructure runs CUDA applications without source code
ory request is already in progress due to another warp running modifications on Linux based platforms, but does require
on the same shader core. We call this inter-warp memory access to the application’s source code. To build a CUDA
coalescing. We observe that inter-warp memory coalescing application for our simulator, we replace the common.mk
can significantly reduce memory traffic for applications that makefile used in the CUDA SDK with a version that builds the
contain data dependent accesses to memory. The data for application to run on our microarchitecture simulator (while
inter-warp merging quantifies the benefit of supporting large other more complex build scenarios may require more complex
capacity MSHRs that can detect a secondary access to an makefile changes).
outstanding request [45]. Figure 3 shows how a CUDA application can be compiled
2.2.4. Caching. While coalescing memory requests captures for simulation on GPGPU-Sim and compares this compila-
spatial locality among threads, memory bandwidth require- tion flow to the normal CUDA compilation flow [33]. Both
ments may be further reduced with caching if an application compilation flows use cudafe to transform the source code of
contains temporal locality or spatial locality within the access a CUDA application into host C code running on the CPU
pattern of individual threads. We evaluate the performance and device C code running on the GPU. The GPU C code is
impact of adding first level, per-core L1 caches for local and then compiled into PTX assembly (labeled “.ptx” in Figure 3)
global memory access to the design described in Section 2.1. by nvopencc, an open source compiler provided by NVIDIA
We also evaluate the effects of adding a shared L2 cache based on Open64 [28, 36]. The PTX assembler (ptxas) then
on the memory side of the interconnection network at the assembles the PTX assembly code into the target GPU’s native

166

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
ISA (labeled “cubin.bin” in Figure 3(a)). The assembled code Transform, Binomial Option Pricing, Separable Convolution,
is then combined with the host C code and compiled into a 64-bin Histogram, Matrix Multiply, Parallel Reduction, Scalar
single executable linked with the CUDA Runtime API library Product, Scan of Large Arrays, and Matrix Transpose. Due to
(labeled “libcuda.a” in Figure 3) by a standard C compiler. In space limitations, and since most of these benchmarks already
the normal CUDA compilation flow (used with NVIDIA GPU perform well on GPUs, we only report details for Black-
hardware), the resulting executable calls the CUDA Runtime Scholes (BLK), a financial options pricing application, and
API to set up and invoke compute kernels onto the GPU via Fast Walsh Transform (FWT), widely used in signal and image
the NVIDIA CUDA driver. processing and compression. We also report the harmonic
When a CUDA application is compiled to use GPGPU-Sim, mean of all SDK applications simulated, denoted as SDK in
many steps remain the same. However, rather than linking the data bar charts in Section 4.
against the NVIDIA supplied libcuda.a binary, we link against Below, we describe the CUDA applications not in the SDK
our own libcuda.a binary. Our libcuda.a implements “stub” that we use as benchmarks in our study. These applications
functions for the interface defined by the header files supplied were developed by the researchers cited below and run un-
with CUDA. These stub functions set up and invoke simulation modified on our simulator.
sessions of the compute kernels on GPGPU-Sim (as shown AES Encryption (AES) [24] This application, developed
in Figure 3(b)). Before the first simulation session, GPGPU- by Manavski [24], implements the Advanced Encryption Stan-
Sim parses the text format PTX assembly code generated by dard (AES) algorithm in CUDA to encrypt and decrypt files.
nvopencc to obtain code for the compute kernels. Because The application has been optimized by the developer so that
the PTX assembly code has no restriction on register usage constants are stored in constant memory, the expanded key
(to improve portability between different GPU architectures), stored in texture memory, and the input data processed in
nvopencc performs register allocation using far more registers shared memory. We encrypt a 256KB picture using 128-bit
than typically required to avoid spilling. To improve the encryption.
realism of our performance model, we determine the register Graph Algorithm: Breadth-First Search (BFS) [15]
usage per thread and shared memory used per CTA using Developed by Harish and Narayanan [15], this application
ptxas4 . We then use this information to limit the number performs breadth-first search on a graph. As each node in
of CTAs that can run concurrently on a shader core. The the graph is mapped to a different thread, the amount of
GPU binary (cubin.bin) produced by ptxas is not used by parallelism in this applications scales with the size of the input
GPGPU-Sim. After parsing the PTX assembly code, but before graph. BFS suffers from performance loss due to heavy global
beginning simulation, GPGPU-Sim performs an immediate memory traffic and branch divergence. We perform breadth-
post-dominator analysis on each kernel to annotate branch first search on a random graph with 65,536 nodes and an
instructions with reconvergence points for the stack-based average of 6 edges per node.
SIMD control flow handling mechanism described by Fung Coulombic Potential (CP) [18,41] CP is part of the Parboil
et al. [11]. During a simulation, a PTX functional simulator Benchmark suite developed by the IMPACT research group at
executes instructions from multiple threads according to their UIUC [18,41]. CP is useful in the field of molecular dynamics.
scheduling order as specified by the performance simulator. Loops are manually unrolled to reduce loop overheads and
When the simulation completes, the host CPU code is then the point charge data is stored in constant memory to take
allowed to resume execution. In our current implementation, advantage of caching. CP has been heavily optimized (it
host code runs on a normal CPU, thus our performance has been shown to achieve a 647× speedup versus a CPU
measurements are for the GPU code only. version [40]). We simulate 200 atoms on a grid size of
2.4. Benchmarks 256×256.
gpuDG (DG) [46] gpuDG is a discontinuous Galerkin
Our benchmarks are listed in Table 1 along with the main time-domain solver, used in the field of electromagnetics to
application properties, such as the organization of threads into calculate radar scattering from 3D objects and analyze wave
CTAs and grids as well as the different memory spaces on the guides, particle accelerators, and EM compatibility [46]. Data
GPU exploited by each application. Multiple entries separated is loaded into shared memory from texture memory. The inner
by semi-colons in the grid and CTA dimensions indicate the loop consists mainly of matrix-vector products. We use the 3D
application runs multiple kernels. version with polynomial order of N=6 and reduce time steps
For comparison purposes we also simulated the following to 2 to reduce simulation time.
benchmarks from NVIDIA’s CUDA software development
3D Laplace Solver (LPS) [12] Laplace is a highly parallel
kit (SDK) [32]: Black-Scholes Option Pricing, Fast Walsh
finance application [12]. As well as using shared memory, care
4. By default, the version of ptxas in CUDA 1.1 appears to attempt to avoid
was taken by the application developer to ensure coalesced
spilling registers provided the number of registers per thread is less than 128 global memory accesses. We observe that this benchmark
and none of the applications we studied reached this limit. Directing ptxas to suffers some performance loss due to branch divergence. We
further restrict the number of registers leads to an increase in local memory
usage above that explicitly used in the PTX assembly, while increasing the
run one iteration on a 100x100x100 grid.
register limit does not increase the number of registers used. LIBOR Monte Carlo (LIB) [13] LIBOR performs Monte

167

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
Table 1. Benchmark Properties
Benchmark Abr. Grid CTA Concurrent Total Instruction Shared Constant Texture Barriers?
Dimensions Dimensions CTAs/core Threads Count Memory? Memory? Memory?
AES Cryptography [24] AES (257,1,1) (256,1,1) 2 65792 28M Yes Yes 1D Yes
Graph Algorithm: BFS (128,1,1) (512,1,1) 4 65536 17M No No No No
Breadth First Search [15]
Coulombic Potential [18, 41] CP (8,32,1) (16,8,1) 8 32768 126M No Yes No No
gpuDG [46] DG (268,1,1); (84,1,1); 5 22512; 596M Yes No 1D Yes
(268,1,1); (112,1,1); 6 30016; Yes
(603,1,1) (256,1,1) 4 154368 No
3D Laplace Solver [12] LPS (4,25,1) (32,4,1) 6 12800 82M Yes No No Yes
LIBOR Monte Carlo [13] LIB (64,1,1) (64,1,1) 8 4096 907M No Yes No No
MUMmerGPU [42] MUM (782,1,1) (64,1,1) 3 50000 77M No No 2D No
Neural Network NN (6,28,1); (13,13,1); 5 28392; 68M No No No No
Digit Recognition [5] (50,28,1); (5,5,1); 8 35000; No
(100,28,1); (1,1,1); 8 2800; No
(10,28,1) (1,1,1) 8 280 No
N-Queens Solver [37] NQU (223,1,1) (96,1,1) 1 21408 2M Yes No No Yes
Ray Tracing [26] RAY (16,32,1) (16,8,1) 3 65536 71M No Yes No Yes
StoreGPU [4] STO (384,1,1) (128,1,1) 1 49152 134M Yes No No No
Weather Prediction [27] WP (9,8,1) (8,8,1) 3 4608 215M No No No No
Black-Scholes BLK (256,1,1) (256,1,1) 3 65536 236M No No No No
option pricing [32]
Fast Walsh Transform [32] FWT (512,1,1); (256,1,1); 4 131072; 240M Yes No No Yes
(256,1,1); (512,1,1) 2 131072 Yes

Carlo simulations based on the London Interbank Offered tions. The search space implies that the execution time grows
Rate Market Model [13]. Each thread reads a large number exponentially with N. Our analysis shows that most of the
of variables stored in constant memory. We find the working computation is performed by a single thread, which explains
set for constant memory fits inside the 8KB constant cache the low IPC. We simulate N=10.
per shader core that we model. However, we find memory Ray Tracing (RAY) [26] Ray-tracing is a method of
bandwidth is still a bottleneck due to a large fraction of local rendering graphics with near photo-realism. In this implemen-
memory accesses. We use the default inputs, simulating 4096 tation, each pixel rendered corresponds to a scalar thread in
paths for 15 options. CUDA [26]. Up to 5 levels of reflections and shadows are
MUMmerGPU (MUM) [42] MUMmerGPU is a parallel taken into account, so thread behavior depends on what object
pairwise local sequence alignment program that matches query the ray hits (if it hits any at all), making the kernel susceptible
strings consisting of standard DNA nucleotides (A,C,T,G) to to branch divergence. We simulate rendering of a 256x256
a reference string for purposes such as genotyping, genome image.
resequencing, and metagenomics [42]. The reference string StoreGPU (STO) [4] StoreGPU is a library that accelerates
is stored as a suffix tree in texture memory and has been hashing-based primitives designed for middleware [4]. We
arranged to exploit the texture cache’s optimization for 2D chose to use the sliding-window implementation of the MD5
locality. Nevertheless, the sheer size of the tree means high algorithm on an input file of size 192KB. The developers
cache miss rates, causing MUM to be memory bandwidth- minimize off-chip memory traffic by using the fast shared
bound. Since each thread performs its own query, the nature memory. We find STO performs relatively well.
of the search algorithm makes performance also susceptible Weather Prediction (WP) [27] Numerical weather pre-
to branch divergence. We use the first 140,000 characters of diction uses the GPU to accelerate a portion of the Weather
the Bacillus anthracis str. Ames genome as the reference string Research and Forcast model (WRF), which can model and pre-
and 50,000 25-character queries generated randomly using the dict condensation, fallout of various precipitation and related
complete genome as the seed. thermodynamic effects of latent heat release [27]. The kernel
Neural Network (NN) [5] Neural network uses a convo- has been optimized to reduce redundant memory transfer by
lutional neural network to recognize handwritten digits [5]. storing the temporary results for each altitude level in the cell
Pre-determined neuron weights are loaded into global memory in registers. However, this requires a large amount of registers,
along with the input digits. We modified the original source thus limiting the maximum allowed number of threads per
code to allow recognition of multiple digits at once to increase shader core to 192, which is not enough to cover global and
parallelism. Nevertheless, the last two kernels utilize blocks local memory access latencies. We simulate the kernel using
of only a single thread each, which results in severe under- the default test sample for 10 timesteps.
utilization of the shader core pipelines. We simulate recog-
nition of 28 digits from the Modified National Institute of 3. Methodology
Standards Technology database of handwritten digits. Table 2 shows the simulator’s configuration. Rows with
N-Queens Solver (NQU) [37] The N-Queen solver tackles multiple entries show the different configurations that we have
a classic puzzle of placing N queens on a NxN chess board simulated. Bold values show our baseline. To simulate the
such that no queen can capture another [37]. It uses a simple mesh network, we used a detailed interconnection network
backtracking algorithm to try to determine all possible solu- model, incorporating the configurable interconnection network

168

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
Table 2. Hardware Configuration actually causes an overall slowdown since now 4 CTAs need
Number of Shader Cores 28 to be completed by the first core, requiring a total time of at
Warp Size 32 least 3.6T. We carefully verified that this behavior occurs by
SIMD Pipeline Width 8
Number of Threads / Core 256 / 512 / 1024 / 1536 / 2048 plotting the distribution of CTAs to shader cores versus time
Number of CTAs / Core 2 / 4 / 8 / 12 / 16 for both configurations being compared. This effect would
Number of Registers / Core 4096 / 8192 / 16384 / 24576 / 32768
Shared Memory / Core (KB) 4/8/16/24/32 (16 banks, 1 access/cycle/bank)5
be less significant in a real system with larger data sets and
Constant Cache Size / Core 8KB (2-way set assoc. 64B lines LRU) therefore grids with a larger number of CTAs. Rather than
Texture Cache Size / Core 64KB (2-way set assoc. 64B lines LRU) attempt to eliminate the effect by modifying the scheduler (or
Number of Memory Channels 8
L1 Cache None / 16KB / 32KB / 64KB the benchmarks) we simply note where it occurs.
4-way set assoc. 64B lines LRU In Section 4.7 we measure the impact of running greater
L2 Cache None / 128KB / 256KB
8-way set assoc. 64B lines LRU or fewer numbers of threads. We model this by varying the
GDDR3 Memory Timing tCL =9, tRP =13, tRC =34 number of concurrent CTAs permitted by the shader cores,
tRAS =21, tRCD =12, tRRD =8
Bandwidth per Memory Module 8 (Bytes/Cycle) which is possible by scaling the amount of on-chip resources
DRAM request queue capacity 32 / 128 available to each shader core. There are four such resources:
Memory Controller out of order (FR-FCFS) /
in order (FIFO) [39] Number of concurrent threads, number of registers, amount
Branch Divergence Method Immediate Post Dominator [11] of shared memory, and number of CTAs. The values we use
Warp Scheduling Policy Round Robin among ready warps
are shown in Table 2. The amount of resources available per
shader core is a configurable simulation option, while the
Table 3. Interconnect Configuration
amount of resources required by each kernel is extracted using
Topology Mesh / Torus / Butterfly / Crossbar / Ring
Routing Mechanism Dimension Order / Destination Tag
ptxas.
Routing delay 1
Virtual channels 2
Virtual channel buffers 4 4. Experimental Results
Virtual channel allocator iSLIP / PIM
Alloc iters 1 In this section we evaluate the designs introduced in Sec-
VC alloc delay 1 tion 2. Figure 4.1 shows the classification of each benchmark’s
Input Speedup 2
Flit size (Bytes) 8 / 16 / 32 / 64 instruction type (dynamic instruction frequency). The Fused
Multiply-Add and ALU Ops (other) sections of each bar show
simulator introduced by Dally et al. [9]. Table 3 shows the the proportion of total ALU operations for each benchmark
interconnection configuration used in our simulations. (which varies from 58% for NQU to 87% for BLK). Only
We simulate all benchmarks to completion to capture all the DG, CP and NN utilize the Fused Multiply-Add operations
distinct phases of each kernel in the benchmarks, especially extensively. Special Function Unit (SFU)6 instructions are also
the behavior at the tail end of the kernels, which can vary only used by a few benchmarks. CP is the only benchmark that
drastically compared to the beginning. If the kernels are has more than 10% SFU instructions.
relatively short and are frequently launched, the difference in The memory operations portion of Figure 4.1 is further
performance when not simulating the benchmark to comple- broken down in terms of type as shown in Figure 5. Note that
tion can be significant. “param” memory refers to parameters passed through the GPU
We note that the breadth-first CTA distribution heuristic kernel call, which we always treat as cache hits. There is a
described in Section 2.2.2 can occasionally lead to counter- large variation in the memory instruction types used among
intuitive performance results due to a phenomina we will refer benchmarks: for CP over 99% of accesses are to constant
to as CTA load imbalance. This CTA load imbalance can occur memory while for NN most accesses are to global memory.
when the number of CTAs in a grid exceeds the number that
can run concurrently on the GPU. For example, consider six 4.1. Baseline
CTAs on a GPU with two shader cores where at most two
CTAs can run concurrently on a shader core. Assume running We first simulated our baseline GPU configuration with
one CTA on one core takes time T and running two CTAs the bolded parameters shown in Table 2. Figure 6 shows
on one core takes time 2T (e.g., no off-chip accesses and six the performance of our baseline configuration (for the GPU
or more warps per CTA—enough for one CTA to keep our only) measured in terms of scalar instructions per cycle (IPC).
24 stage pipeline full). If each CTA in isolation takes equal For comparison, we also show the performance assuming
time T, total time is 3T (2T for the first round of four CTAs a perfect memory system with zero memory latency. Note
plus T for the last two CTAs which run on separate shader that the maximum achievable IPC for our configuration is
cores). Suppose we introduce an enhancement that causes
5. We model the shared memory to service up to 1 access per cycle in each
CTAs to run in time 0.90T to 0.91T when run alone (i.e., bank. This may be more optimistic than what can be inferred from the CUDA
faster). If both CTAs on the first core now finish ahead of Programming Guide (1 access/2 cycles/bank) [33].
those on the other core at time 1.80T versus 1.82T, then our 6. The architecture of the NVIDIA GeForce 8 Series GPUs includes a spe-
cial function unit for transcendental and attribute interpolation operations [22].
CTA distributor will issue the remaining 2 CTAs onto the first We include the following PTX instructions in our SFU Ops classification:
core, causing the load imbalance. With the enhancement, this cos, ex2, lg2, rcp, rsqrt, sin, sqrt.

169

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
100%
SFU Ops Control Flow Fused Multiply-Add ALU Ops (other) Mem Ops architectures with intra-warp branch divergence [11]. Figure 7
shows warp occupancies (number of active threads in an issued
Classification

80%
Instruction

60%
40%
warp) over the entire runtime of the benchmarks. This metric
20% can be seen as a measure of how much GPU throughput
0%
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT potential is wasted due to unfilled warps. The control flow
Figure 4. Instruction Classification. portion of the bars in Figure 4 shows that BFS, LPS, and NQU
Shared Tex Const Param Local Global
contain from 13% to 28% control flow operations. However,
100%
Memory Instruction

intensive control flow does not necessarily translate into high


Classification

75%
branch divergence; it depends more on whether or not all
50%
threads in a warp branch in the same direction. NN has the
25%
0%
lowest warp occupancy while it contains only 7% control flow
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT operations. On the other hand, LPS with 19% control flow has
Figure 5. Memory Instructions Breakdown full warp occupancy 75% of the time. It is best to analyze
Figure 7 with Table 1 in mind, particularly in the case of
Baseline Perfect Memory Maximum IPC = 224
224 NN. In NN, two of the four kernels have only a single thread
192 in a block and they take up the bulk of the execution time,
160
128
meaning that the unfilled warps in NN are not due to branch
IPC

96
64
32 divergence. Some benchmarks (such as AES, CP, LIB, and
0
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK
STO) do not incur significant branch divergence, while others
Figure 6. Baseline performance (HM=Harmonic Mean) do. MUM experiences severe performance loss in particular
because more than 60% of its warps have less than 5 active
1-4 5-8 9-12 13-16 17-20 21-24 25-28 29-32 threads. BFS also performs poorly since threads in adjacent
Warp Occupancy

100%
80% nodes in the graph (which are grouped into warps) behave
60%
40% differently, causing more than 75% of its warps to have less
20%
0%
than 50% occupancy. Warp occupancy for NN and NQU is
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT low due to large portions of the code being spent in a single
Figure 7. Warp Occupancy thread.
224 (28 shader cores x 8-wide pipelines). We also validated
our simulator against an Nvidia Geforce 8600GTS (a “low 4.3. Interconnect Topology
end” graphics card) by configuring our simulator to use 4 Figure 9 shows the speedup of various interconnection
shaders and two memory controllers. The IPC of the GPU network topologies compared to a mesh with 16 Byte channel
hardware, as shown in Figure 8(a), was estimated by dividing bandwidth. On average our baseline mesh interconnect per-
the dynamic instruction count measured (in PTX instructions) forms comparable to a crossbar with input speedup of two
in simulation by the product of the measured runtime on for the workloads that we consider. We also have evaluated
hardware and the shader clock frequency [31]. Figure 8(b) two torus topologies: “Torus - 16 Byte Channel BW”, which
shows the scatter plot of IPC obtained with our simulations has double the bisection bandwidth of the baseline “Mesh” (a
mimicking the 8600GTS normalized to the theoretical peak determining factor in the implementation cost of a network);
IPC versus the normalized IPC data measured using the and “Torus - 8 Byte Channel BW”, which has the same
8600GTS. The correlation coefficient was calculated to be bisection bandwidth as “Mesh”. The “Ring” topology that
0.899. One source of difference, as highlighted by the data we evaluated has a channel bandwidth of 64. The “Crossbar”
for CP which actually achieves a normalized IPC over 1, topology has a parallel iterative matching (PIM) allocator as
is likely due to compiler optimizations in ptxas which may opposed to an iSLIP allocator for other topologies. The two-
reduce the instruction count on real hardware7 . Overall, the stage butterfly and crossbar employ destination tag routing
data shows that applications that perform well in real GPU while others use dimension-order routing. The ring and mesh
hardware perform well in our simulator and applications that networks are the simplest and least expensive networks to build
perform poorly in real GPU hardware also perform poorly in in terms of area.
our simulator. In the following sections, we explore reasons As Figure 9 suggests, most of the benchmarks are fairly
why some benchmarks do not achieve peak performance. insensitive to the topology used. In most cases, a change in
topology results in less than 20% change in performance from
4.2. Branch Divergence the baseline, with the exception of the Ring and Torus with
8 Byte channel bandwidth. BLK experiences a performance
Branch divergence was highlighted by Fung et al. as a
gain with Ring due to the CTA load imbalance phenomena
major source of performance loss for multithreaded SIMD
described in Section 3. BLK has 256 CTAs. For the Ring
7. We only simulate the input PTX code which, in CUDA, ptxas then configuration, the number of CTAs executed per shader core
assembles into a proprietary binary format that we are unable to simulate. varies from 9 to 10. However, for the baseline configuration,

170

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
1

Simulated 8600 GTS


LIB
0.8 AES CP
LPS

Normalized IPC
Estimated 8600GTS (HW) Simulated 8600GTS STO
48 0.6
RAY
DG
40 Max IPC = 32 0.4 MUM
IPC 32 WP
24 0.2
16 BFS, NN, NQU
8 0
0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP 8600GTS Normalized IPC

(a) (b)
Figure 8. Performance Comparison with 8600GTS
Crossbar - 16 Byte Channel BW 2 Stage Butterfly - 16 Byte Channel BW 1.2 Extra 4 cycles Extra 8 cycles Extra 16 cycles
Torus - 16 Byte Channel BW Torus - 8 Byte Channel BW

Speedup
1.2 Ring - 64 Byte Channel BW 1
0.8
1
Speedup

0.6
0.8 0.4
0.6 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK

0.4 Figure 10. Interconnection Network Latency Sensitivity


0.2 1.4
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK 8B 16B (Baseline) 32B 64B
1.2

Speedup
Figure 9. Interconnection Network Topology 1
0.8

one of the shader cores is assigned 11 CTAs due to small 0.6


0.4
variations in time coupled with our greedy CTA distribution AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK

heuristic. When more CTAs run on a shader core, all CTAs Figure 11. Interconnection Network Bandwidth Sensitivity
on that shader core take longer to complete, resulting in a
performance loss for the baseline configuration for BLK. interconnect has already been overprovisioned. Our analysis
As we will show in the next section, one of the reasons shows that for the baseline configuration, the input port to
why different topologies do not change the performance of the return interconnect from memory to the shader cores is
most benchmarks dramatically is that the benchmarks are stalled 16% of the time on average. Increasing the flit size to
not sensitive to small variations in latency, as long as the 32 completely eliminates these stalls, which is why there is no
interconnection network provides sufficient bandwidth. further speedup for interconnect flit size of 64. Note that our
memory read request packet sizes are 8 bytes, allowing them
4.4. Interconnection Latency and Bandwidth to be sent to the memory controllers in a single flit for all of
the configurations shown in Figure 11.
Figure 10 shows the IPC results for various mesh network
Overall, the above data suggests that performance is more
router latencies. Without affecting peak throughput, we add sensitive to interconnect bandwidth than to latency for the non-
an extra pipelined latency of 4, 8, or 16 cycles to each router
graphics workloads that we study. In other words, restricting
on top of our baseline router’s 2-cycle latency. An extra 4
channel bandwidth causes the interconnect to become a bot-
cycle latency per router is easily tolerated for most benchmarks tleneck.
and causes only 3.5% performance loss when harmonically
averaged across all benchmarks. BLK and CP experience a
performance gain due to the CTA load imbalance phenomena
4.5. DRAM Utilization and Efficiency
described in Section 3. With 8 extra cycles of latency per In this section we explore the impact that memory controller
router, the performance degradation is noticeable (slowdown design has on performance. Our baseline configuration uses an
by 9% on average) and becomes much worse (slowdown by Out-of-Order (OoO) First-Ready First-Come First-Serve (FR-
25% on average) at 16 cycles of extra latency. Note that these FCFS) [39] memory controller with a capacity of 32 memory
experiments are only intended to show the latency sensitivity requests. Each cycle, the OoO memory controller prioritizes
of benchmarks. memory requests that hit an open row in the DRAM over
We also modify the mesh interconnect bandwidth by varying requests that require a precharge and activate to open a new
the channel bandwidth from 8 bytes to 64 bytes. Figure 11 row. Against this baseline, we compare a simple First-In First-
shows that halving the channel bandwidth from 16 bytes to 8 Out (FIFO) memory controller that services memory requests
bytes has a noticeable negative impact on most benchmarks, in the order that they are received, as well as a more aggressive
but doubling and quadrupling channel bandwidth only results FR-FCFS OoO controller with an input buffer capacity of
in a small gain for a few workloads i.e., BFS and DG. 128 (OoO128). We measure two metrics besides performance:
DG is the most bandwidth sensitive workload, getting a The first is DRAM efficiency, which is the percentage of time
31% speedup and 53% slowdown for flit sizes of 32 and 8 spent sending data across the pins of DRAM over the time
respectively. The reason why DG does not exhibit further when there are any memory requests being serviced or pending
speedup with flit size of 64 is because at this point, the in the memory controller input buffer; the second is DRAM

171

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
1.4 First-In First-Out Out-of-Order 128 3.5 None 16kB L1 32kB L1 64kB L1 64kB L1; 128kB L2 64kB L1; 256kB L2
1.2 3

Speedup
2.5
Speedup

1
2
0.8
1.5
0.6 1
0.4 0.5
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK

Figure 12. Impact of DRAM memory controller optimizations Figure 15. Effects of adding an L1 or L2
60%
25% 50% 100% (Baseline) 150% 200%
DRAM Utilization

First-In First-Out Out-of-Order 32 Out-of-Order 128 1.5


40% 1.25

Speedup
20% 1
0.75
0%
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT 0.5
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK
Figure 13. DRAM Utilization
Figure 16. Effects of varying number of CTAs
100%
DRAM Efficiency

First-In First-Out Out-of-Order 32 Out-of-Order 128


80% caches for local/global accesses) writes to memory only cause
60% the memory controller to read data out of DRAM if a portion
40% of a 16B is modified due to writes that are not coalesced. When
20%
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT
caches are added for local and global accesses, for simplicity,
Figure 14. DRAM Efficiency a write miss prevents a warp from being scheduled until the
cache block is read from DRAM. Furthermore, when a dirty
utilization, which is the percentage of time spent sending data line is evicted, the entire line is written to DRAM even if only
across the DRAM data pins over the entire kernel execution a single word of that line is modified. We leave exploration
time. These two measures can differ if an application contains of better cache policies to future work.
GPU computation phases during which it does not access Benchmarks that make extensive use of “shared memory”,
DRAM (e.g., if it has been heavily optimized to use “shared namely AES, LPS, NQU, and STO, do not respond signifi-
memory”). cantly to caches. On the other hand, BFS and NN have the
Figure 12 compares the performance of our baseline to highest ratio of global memory instructions to all instructions
FIFO and OoO128. We observe that AES, CP, NQU, and (at 19% and 27% respectively) and so they experience the
STO exhibit almost no slowdown for FIFO. Figure 14 shows highest speedup among workloads.
AES and STO obtain over 75% DRAM efficiency. Close
examination reveals that at any point in time all threads access 4.7. Are More Threads Better?
at most two rows in each bank of each DRAM, meaning that
Increasing the number of simultaneously running threads
a simple DRAM controller policy suffices. Furthermore, Fig-
can improve performance by having a greater ability to hide
ure 13 shows that AES and STO have low DRAM utilization
memory access latencies. However, doing so may result in
despite the fact that they process large amounts of data. Both
higher contention for shared resources, such as interconnect
these applications make extensive use of shared memory (see
and memory. We explored the effects of varying the resources
Figure 5). NQU and CP have very low DRAM utilization,
that limit the number of threads and hence CTAs that can run
making them insensitive to memory controller optimizations
concurrently on a shader core, without modifying the source
(CP slows down for OoO128 due to variations in CTA load
code for the benchmarks. We vary the amount of registers,
distribution). Performance is reduced by over 40% when using
shared memory, threads, and CTAs between 25% to 200% of
FIFO for BFS, LIB, MUM, RAY, and WP. These benchmarks
those available to the baseline. The results are shown in Figure
all show drastically reduced DRAM efficiency and utilization
16. For the baseline configuration, some benchmarks are
with this simple controller.
already resource-constrained to only 1 or 2 CTAs per shader
core, making them unable to run using a configuration with
4.6. Cache Effects less resources. We do not show bars for configurations that for
Figure 15 shows the effects on IPC of adding caches to the this reason are unable to run. NQU shows little change when
system. The first 3 bars show the relative speedup of adding varying the number of CTAs since it has very few memory
a 16KB, 32KB or 64KB cache to each shader core. The last operations. For LPS, NN, and STO, performance increases
two bars show the effects of adding a 128KB or 256KB L2 as more CTAs per core are used. LPS cannot take advantage
cache to each memory controller in addition to a 64KB L1 of additional resources beyond the baseline (100%) because
cache in each shader. CP, RAY and FWT exhibit a slowdown all CTAs in the benchmark can run simultaneously for the
with the addition of L1 caches. Close examination shows that baseline configuration. Each CTA in STO uses all the shared
CP experiences a slowdown due to the CTA load imbalance memory in the baseline configuration, therefore increasing
phenomena described in Section 3, whereas RAY and FWT shared memory by half for the 150% configuration results
experience a slowdown due to the way write misses and in no increase in the number of concurrently running CTAs.
evictions of dirty lines are handled. For the baseline (without AES and MUM show clear trends in decreasing performance

172

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
1.4 by writing and optimizing applications to run on actual CUDA
hardware, we use our novel performance simulator to observe
Speedup

1.2

1
the detailed behavior of CUDA applications upon varying
0.8
architectural parameters.
AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK There have been acceleration architectures proposed besides
Figure 17. Inter-Warp Memory Coalescing the GPU model that we analyze in this paper. Mahesri et al.
introduce a class of applications for visualization, interaction,
as the number of CTAs increases. We observed that with and simulation [23]. They propose using an accelerator ar-
more concurrent CTAs, AES and MUM experience increased chitecture (xPU) separate from the GPU to improve perfor-
contention in the memory system resulting in 8.6× and 5.4× mance of their benchmark suite. The Cell processor [7, 38]
worse average memory latency, respectively comparing 200% is a hardware architecture that can function like a stream
resources vs. 50%. BFS, RAY, and WP show distinct optima processor with appropriate software support. It consists of a
in performance when the CTA limit is at 100%, 100%, and controlling processor and a set of SIMD co-processors each
150% of the baseline shader, respectively. Above these limits, with independent program counters and instruction memory.
we observe DRAM efficiencies decrease and memory latencies Merrimac [8] and Imagine [3] are both streaming processor
increase, again suggesting increased contention in the memory architectures developed at Stanford.
system. For configuration with limits below the optima, the Khailany et al. [20] explore VLSI costs and performance
lack of warps to hide memory latencies reduces performance. of a stream processor as the number of streaming clusters
CP suffers CTA load imbalance due to CTA scheduling for and ALUs per cluster scales. They use an analytical cost
the 50% and 100% configurations. Similarly, DG suffers CTA model. The benchmarks they use also have a high ratio of
load imbalance in the 150% configuration. ALU operations to memory references, which is a property
Given the widely-varying workload-dependent behavior, al- that eases memory requirements of streaming applications.
ways scheduling the maximal number of CTAs supported by a The UltraSPARC T2 [44] microprocessor is a multithreading,
shader core is not always the best scheduling policy. We leave multicore CPU which is a member of the SPARC family and
for future work the design of dynamic scheduling algorithms comes in 4, 6, and 8 core variations, with each core capable of
to adapt to the workload behavior. running 8 threads concurrently. They have a crossbar between
the L2 and the processor cores (similar to our placement
4.8. Memory Request Coalescing
of the L2 in Figure 1(a)). Although the T1 and T2 support
Figure 17 presents data showing the improvement in perfor- many concurrent threads (32 and 64, respectively) compared to
mance when enabling inter-warp memory coalescing described other contemporary CPUs, this number is very small compared
in Section 2.2.3. The harmonic mean speedup versus intra- to the number on a high end contemporary GPU (e.g., the
warp coalescing is 6.1%. CP’s slowdown with inter-warp Geforce 8800 GTX supports 12,288 threads per chip).
coalescing is due to load imbalance in CTA distribution. We quantified the effects of varying cache size, DRAM
Accesses in AES, DG, and MUM are to data dependent bandwidth and other parameters which, to our knowledge,
locations which makes it harder to use the explicitly managed has not been published previously. While the authors of
shared memory to capture locality. These applications use the the CUDA applications which we use as benchmarks have
texture cache to capture this locality and inter-warp merging published work, the emphasis of their papers was not on
effectively eliminates additional requests for the same cache how changes in the GPU architecture can affect their appli-
block at the expense of associative search hardware. cations [4, 5, 12, 13, 15, 24, 26, 27, 37, 41, 42, 46]. In terms of
It is interesting to observe that the harmonic mean speedup streaming multiprocessor design, all of the above-mentioned
of the CUDA SDK benchmarks is less than 1%, showing that works have different programming models from the CUDA
these highly optimized benchmarks do not benefit from inter- programming model that we employ.
warp memory coalescing. Their careful program optimizations
ensure less redundancy in memory requests generated by each 6. Conclusions
thread.
In this paper we studied the performance of twelve con-
temporary CUDA applications by running them on a detailed
5. Related Work performance simulator that simulates NVIDIA’s parallel thread
Existing graphics-oriented GPU simulators include Qsil- execution (PTX) virtual instruction set architecture. We pre-
ver [43], which does not model programmable shaders, and sented performance data and detailed analysis of performance
ATTILLA [10], which focuses on graphics specific features. bottlenecks, which differ in type and scope from application
Ryoo et al. [41] use CUDA to speedup a variety of relatively to application. First, we found that generally performance of
easily parallelizable scientific applications. They explore the these applications is more sensitive to interconnection network
use of conventional code optimization techniques and take ad- bisection bandwidth rather than (zero load) latency: Reducing
vantage of the different memory types available on NVIDIA’s interconnect bandwidth by 50% is even more harmful than
8800GTX to obtain speedup. While their analysis is performed increasing the per-router latency by 5.3× from 3 cycles

173

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.
to 19 cycles. Second, we showed that caching global and [19] Infineon. 256Mbit GDDR3 DRAM, Revision 1.03 (Part No.
local memory accesses can cause performance degradation for HYB18H256321AF). http://www.infineon.com, December 2005.
[20] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, and
benchmarks where these accesses do not exhibit temporal or B. Towles. Exploring the VLSI scalability of stream processors. In Proc.
spatial locality. Third, we observed that sometimes running 9th Int’l Symp. on High Performance Computer Architecture, page 153,
fewer CTAs concurrently than the limit imposed by on-chip 2003.
[21] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization.
resources can improve performance by reducing contention in In Proc. 8th Int’l Symp. Computer Architecture, pages 81–87, 1981.
the memory system. Finally, aggressive inter-warp memory [22] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla:
coalescing can improve performance in some applications by A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39–
55, 2008.
up to 41%. [23] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel. Tradeoffs in designing
accelerator architectures for visual computing. In Proc. 41st IEEE/ACM
Acknowledgments Int’l Symp. on Microarchitecture, 2008.
[24] S. A. Manavski. CUDA compatible GPU as an efficient hardware
We thank Kevin Skadron, Michael Shebanow, John Kim, accelerator for AES cryptography. In ICSPC 2007: Proc. of IEEE Int’l
Andreas Moshovos, Xi Chen, Johnny Kuan and the anonymous Conf. on Signal Processing and Communication, pages 65–68, 2007.
reviewers for their valuable comments on this work. This work [25] Marco Chiappetta. ATI Radeon HD 2900 XT - R600 Has Arrived.
http://www.hothardware.com/printarticle.aspx?articleid=966.
was partly supported by the Natural Sciences and Engineering [26] Maxime. Ray tracing. http://www.nvidia.com/cuda.
Research Council of Canada. [27] J. Michalakes and M. Vachharajani. GPU acceleration of numerical
weather prediction. IPDPS 2008: IEEE Int’l Symp. on Parallel and
References Distributed Processing, pages 1–7, April 2008.
[28] M. Murphy. NVIDIA’s Experience with Open64. In 1st Annual
[1] Advanced Micro Devices, Inc. ATI CTM Guide, 1.01 edition, 2006. Workshop on Open64, 2008.
[2] Advanced Micro Devices, Inc. Press Release: AMD Delivers Enthusiast [29] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel
Performance Leadership with the Introduction of the ATI Radeon HD Programming with CUDA. ACM Queue, 6(2):40–53, Mar.-Apr. 2008.
3870 X2, 28 January 2008. [30] NVIDIA. CUDA ZONE. http://www.nvidia.com/cuda.
[3] J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. Evaluating [31] NVIDIA. Geforce 8 series. http://www.nvidia.com/page/geforce8.html.
the Imagine stream architecture. In Proc. 31st Int’l Symp. on Computer [32] NVIDIA Corporation. NVIDIA CUDA SDK code
Architecture, page 14, 2004. samples. http://developer.download.nvidia.com/compute/cuda/
[4] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu. sdk/website/samples.html.
StoreGPU: exploiting graphics processing units to accelerate distributed [33] NVIDIA Corporation. NVIDIA CUDA Programming Guide, 1.1 edition,
storage systems. In Proc. 17th Int’l Symp. on High Performance 2007.
Distributed Computing, pages 165–174, 2008. [34] NVIDIA Corporation. Press Release: NVIDIA Tesla GPU Computing
[5] Billconan and Kavinguy. A Neural Network on GPU. Processor Ushers In the Era of Personal Supercomputing, 20 June 2007.
http://www.codeproject.com/KB/graphics/GPUNN.aspx. [35] NVIDIA Corporation. PTX: Parallel Thread Execution ISA, 1.1 edition,
[6] P. Buffet, J. Natonio, R. Proctor, Y. Sun, and G. Yasar. Methodology 2007.
for I/O cell placement and checking in ASIC designs using area-array [36] Open64. The open research compiler. http://www.open64.net/.
power grid. In IEEE Custom Integrated Circuits Conference, 2000. [37] Pcchen. N-Queens Solver.
[7] S. Clark, K. Haselhorst, K. Imming, J. Irish, D. Krolak, and T. Ozguner. http://forums.nvidia.com/index.php?showtopic=76893.
Cell Broadband Engine interconnect and memory interface. In Hot Chips [38] D. Pham, S. Asano, M. Bolliger, M. D. , H. Hofstee, C. Johns, J. Kahle,
17, Palo Alto, CA, August 2005. A.Kameyama, J. Keaty, Y. Masubuchi, D. S. M. Riley, D. Stasiak,
[8] W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J.-H. Ahn, J. Gummaraju, M. Suzuoki, M. Wang, J. Warnock, S. W. D. Wendel, T.Yamazaki, and
M. Erez, N. Jayasena, I. Buck, T. J. Knight, and U. J. Kapasi. Merrimac: K. Yazawa. The design and implementation of a first-generation Cell
Supercomputing with streams. In SC ’03: Proc. 2003 ACM/IEEE Conf. processor. Digest of Technical Papers, IEEE Int’l Solid-State Circuits
on Supercomputing, page 35, 2003. Conference (ISSCC), pages 184–592 Vol. 1, 10-10 Feb. 2005.
[9] W. J. Dally and B. Towles. Interconnection Networks. Morgan [39] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens.
Kaufmann, 2004. Memory access scheduling. In Proc. 27th Int’l Symp. on Computer
[10] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. E. ATTILA: Architecture, pages 128–138, 2000.
a cycle-level execution-driven simulator for modern GPU architectures. [40] S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, J. Stratton,
Int’l Symp. on Performance Analysis of Systems and Software, pages and W. W. Hwu. Program optimization space pruning for a multithreaded
231–241, March 2006. GPU. In Proc. 6th Int’l Symp. on Code Generation and Optimization
[11] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp (CGO), pages 195–204, April 2008.
formation and scheduling for efficient GPU control flow. In Proc. 40th [41] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk,
IEEE/ACM Int’l Symp. on Microarchitecture, 2007. and W. W. Hwu. Optimization principles and application performance
[12] M. Giles. Jacobi iteration for a Laplace discretisation on a 3D structured evaluation of a multithreaded GPU using CUDA. In Proc. 13th ACM
grid. http://people.maths.ox.ac.uk/˜gilesm/hpc/NVIDIA/laplace3d.pdf. SIGPLAN Symp. on Principles and Practice of Parallel Programming,
[13] M. Giles and S. Xiaoke. Notes on using the NVIDIA 8800 GTX graphics pages 73–82, 2008.
card. http://people.maths.ox.ac.uk/˜gilesm/hpc/. [42] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput
[14] Z. S. Hakura and A. Gupta. The design and analysis of a cache sequence alignment using Graphics Processing Units. BMC Bioinfor-
architecture for texture mapping. In Proc. 24th Int’l Symp. on Computer matics, 8(1):474, 2007.
Architecture, pages 108–120, 1997. [43] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simu-
[15] P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms lation framework for graphics architectures. In Proc. ACM SIG-
on the GPU Using CUDA. In HiPC, pages 197–208, 2007. GRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages
[16] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and 85–94, 2004.
P. Roussel. The Microarchitecture of the Pentium R
4 Processor. Intel
R
[44] Sun Microsystems, Inc. OpenSPARCT M T2 Core Microarchitecture
Technology Journal, 5(1), 2001. Specification, 2007.
[17] H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in a texture [45] J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for
cache architecture. In Proc. SIGGRAPH/EUROGRAPHICS Workshop High Memory-Level Parallelism. In Proc. 39th IEEE/ACM Int’l Symp.
on Graphics Hardware, 1998. on Microarchitecture, pages 409–422, 2006.
[18] Illinois Microarchitecture Project utilizing Advanced Compiler [46] T. C. Warburton. Mini Discontinuous Galerkin Solvers.
Technology Research Group. Parboil benchmark suite. http://www.caam.rice.edu/˜timwar/RMMC/MIDG.html.
http://www.crhc.uiuc.edu/IMPACT/parboil.php.

174

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 18:34:48 UTC from IEEE Xplore. Restrictions apply.

You might also like