OpenACC Programming Guide 0 0
OpenACC Programming Guide 0 0
1 Introduction 1
1.1 Writing Portable Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Standard Programming Languages . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Parallel Programming Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 What is OpenACC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The OpenACC Accelerator Model . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Benefits and Limitations of OpenACC . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Accelerating an Application with OpenACC . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 OpenACC Directive Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Porting Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Heterogenous Computing Best Practices . . . . . . . . . . . . . . . . . . . . . 7
1.4 Case Study - Jacobi Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Parallelize Loops 15
3.1 The Kernels Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The Parallel Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Differences Between Parallel and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 The Loop Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 private . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2 reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Routine Directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 C++ Class Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.1 Atomic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Case Study - Parallelize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.1 Parallel Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
CONTENTS
5 Optimize Loops 41
5.1 Efficient Loop Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 OpenACC’s 3 Levels of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Understanding OpenACC’s Three Levels of Parallelism . . . . . . . . . . . . 43
5.3 Mapping Parallelism to the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Collapse Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Routine Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 Case Study - Optimize Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 OpenACC Interoperability 53
6.1 The Host Data Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Using Device Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Obtaining Device and Host Pointer Addresses . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Additional Vendor-Specific Interoperability Features . . . . . . . . . . . . . . . . . . 56
6.4.1 Asynchronous Queues and CUDA Streams (NVIDIA) . . . . . . . . . . . . . 56
6.4.2 CUDA Managed Memory (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . 56
6.4.3 Using CUDA Device Kernels (NVIDIA) . . . . . . . . . . . . . . . . . . . . . 57
A References 70
Chapter 1
Introduction
This guide presents methods and best practices for accelerating applications in an incremental,
performance portable way. Although some of the examples may show results using a given compiler
or accelerator, the information presented in this document is intended to address all architectures
both available at publication time and well into the future. Readers should be comfortable with C,
C++, or Fortran, but do not need experience with parallel programming or accelerated computing,
although such experience will be helpful.
Note: This guide is a community effort. To contribute, please visit the project on Github.
1
CHAPTER 1. INTRODUCTION 2
Because of these complexities, it’s important that developers choose a programming model that
balances the need for portability with the need for performance. Below are four programming models
of varying degrees of both portability and performance. In a real application it’s frequently best to
use a mixture of approaches to ensure a good balance between high portability and performance.
1.1.1 Libraries
Standard (and de facto standard) libraries provide the highest degree of portability because the
programmer can frequently replace only the library used without even changing the source code
itself when changing compute architectures. Since many hardware vendors provide highly-tuned
versions of common libraries, using libraries can also result in very high performance. Although
libraries can provide both high portability and high performance, few applications are able to use
only libraries because of their limited scope.
Some vendors provide additional libraries as a value-add for their platform, but which implement
non-standard APIs. These libraries provide high performance, but little portability. Fortunately
because libraries provide modular APIs, the impact of using non-portable libraries can be isolated
to limit the impact on overall application portability.
There is no one programming model that fits all needs. An application developer needs to evaluate
the priorities of the project and make decisions accordingly. A best practice is to begin with the
most portable and productive programming models and move to lower level programming models
only as needed and in a modular fashion. In doing so the programmer can accelerate much of the
application very quickly, which is often more beneficial than attempting to get the absolute highest
performance out of a particular routine before moving to the next. When development time is
limited, focusing on accelerating as much of the application as possible is generally more productive
than focusing solely on the top time consuming routine.
a high level diagram of the OpenACC abstract accelerator, but remember that the devices and
memories may be physically the same on some architectures.
More details of OpenACC’s abstract accelerator model will be presented throughout this guide
when they are pertinent.
Best Practice: For developers coming to OpenACC from other accelerator programming models,
such as CUDA or OpenCL, where host and accelerator memory is frequently represented by two
distinct variables (host_A[] and device_A[], for instance), it’s important to remember that when
using OpenACC a variable should be thought of as a single object, regardless of whether it’s backed
by memory in one or more memory spaces. If one assumes that a variable represents two separate
memories, depending on where it is used in the program, then it is possible to write programs that
access the variable in unsafe ways, resulting in code that would not be portable to devices that share
a single memory between the host and device. As with any parallel or asynchronous programming
paradigm, accessing the same variable from two sections of code simultaneously could result in
a race condition that produces inconsistent results. By assuming that you are always accessing a
single variable, regardless of how it is stored in memory, the programmer will avoid making mistakes
that could cost a significant amount of effort to debug.
CHAPTER 1. INTRODUCTION 5
memories when absolutely necessary. Programmers will often realize the largest performance gains
after optimizing data movement during this step.
This process is by no means the only way to accelerate using OpenACC, but it has been proven
successful in numerous applications. Doing the same steps in different orders may cause both
frustration and difficulty debugging, so it’s advisable to perform each step of the process in the
order shown above.
76 iter++;
77 }
55 do j=1,m-2
56 do i=1,n-2
57 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
58 A(i ,j-1) + A(i ,j+1) )
CHAPTER 1. INTRODUCTION 9
63 do j=1,m-2
64 do i=1,n-2
65 A(i,j) = Anew(i,j)
66 end do
67 end do
68
72 end do
The outermost loop in each example will be referred to as the convergence loop, since it loops
until the answer has converged by reaching some maximum error tolerance or number of iterations.
Notice that whether or not a loop iteration occurs depends on the error value of the previous
iteration. Also, the values for each element of A is calculated based on the values of the previous
iteration, known as a data dependency. These two facts mean that this loop cannot be run in
parallel.
The first loop nest within the convergence loop calculates the new value for each element based
on the current values of its neighbors. Notice that it is necessary to store this new value into
a different array. If each iteration stored the new value back into itself then a data dependency
would exist between the data elements, as the order each element is calculated would affect the
final answer. By storing into a temporary array we ensure that all values are calculated using the
current state of A before A is updated. As a result, each loop iteration is completely independent
of each other iteration. These loop iterations may safely be run in any order or in parallel and the
final result would be the same. This loop also calculates a maximum error value. The error value
is the difference between the new value and the old. If the maximum amount of change between
two iterations is within some tolerance, the problem is considered converged and the outer loop will
exit.
The second loop nest simply updates the value of A with the values calculated into Anew. If this is
the last iteration of the convergence loop, A will be the final, converged value. If the problem has
not yet converged, then A will serve as the input for the next iteration. As with the above loop
nest, each iteration of this loop nest is independent of each other and is safe to parallelize.
In the coming sections we will accelerate this simple application using the method described in this
document.
Chapter 2
A variety of tools can be used to evaluate application performance and which are available will
depend on your development environment. From simple application timers to graphical performance
analyzers, the choice of performance analysis tool is outside of the scope of this document. The
purpose of this section is to provide guidance on choosing important sections of code for acceleration,
which is independent of the profiling tools available.
Throughout this guide, the NVIDIA Nsight Systems performance analysis tool which is provided
with the CUDA toolkit, will be used for CPU profiling. When accelerator profiling is needed, the
application will be run on an NVIDIA GPU and the NVIDIA Nsight Systems profiler will be again
be used.
10
CHAPTER 2. ASSESS APPLICATION PERFORMANCE 11
operations are performed on a data element per load or store from memory.
• Available parallelism - Examine the loops within the hotspots to understand how much work
each loop nest performs. Do the loops iterate 10’s, 100’s, 1000’s of times (or more)? Do the
loop iterations operate independently of each other? Look not only at the individual loops,
but look a nest of loops to understand the bigger picture of the entire nest.
Gathering baseline data like the above both helps inform the developer where to focus efforts for
the best results and provides a basis for comparing performance throughout the rest of the process.
It’s important to choose input that will realistically reflect how the application will be used once it
has been accelerated. It’s tempting to use a known benchmark problem for profiling, but frequently
these benchmark problems use a reduced problem size or reduced I/O, which may lead to incorrect
assumptions about program performance. Many developers also use the baseline profile to gather
the expected output of the application to use for verifying the correctness of the application as it
is accelerated.
58, Generated vector simd code for the loop containing reductions
68, Memory copy idiom, loop replaced by call to __c_mcopy8
79, GetTimer inlined, size=10 (inline) file laplace2d.c (54)
Once the executable has been built, the nsys command will run the executable and generate a
profiling report that can be viewed offline in the NVIDIA Nsight Systems GUI
$ nsys profile ./a.out
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-2f5b-f32e-7dec-9af0.qdrep"
Report file moved to "/home/ubuntu/openacc-programming-guide/examples/laplace/ch2/report1.qdrep"
Once the data has been collected, and the .qdrep report has been generated, it can be visualized
using the Nsight Systems GUI. You must first copy the report (report1.qdrep in the example above)
to a machine that has graphical capabilities and download the Nsight Systems interface. Next, you
must open the application and select your file via the file manager.
When we open the report in Nsight Systems, we see that the vast majority of the time is spent
in two routines: main and __c_mcopy8. A screenshot of the initial screen for Nsight systems is
shown in figure 2.1. Since the code for this case study is completely within the main function of the
program, it’s not surprising that nearly all of the time is spent in main, but in larger applications
it’s likely that the time will be spent in several other routines.
Clicking into the main function we can see that nearly all of the runtime within main comes from
the loop that calculates the next value for A. This is shown in figure 2.2. What is not obvious from
the profiler output, however, is that the time spent in the memory copy routine shown in the initial
screen is actually the second loop nest, which performs the array swap at the end of each iteration.
The compiler output shows above that the loop at line 68 was replaced by a memory copy, because
doing so is more efficient than copying each element individually. So what the profiler is really
showing us is that the major hotspots for our application are the loop nest that calculate Anew
from A and the loop nest that copies from Anew to A for the next iteration, so we’ll concentrate our
CHAPTER 2. ASSESS APPLICATION PERFORMANCE 13
Figure 2.1: Nsight Systems initial window in the GUI. You must use the toolbar at the top to find
your target report file
Figure 2.2: Nsight initial profile window showing 81% of runtime in main and 17% in a memory
copy routine.
CHAPTER 2. ASSESS APPLICATION PERFORMANCE 14
Parallelize Loops
Now that the important hotspots in the application have been identified, the programmer should in-
crementally accelerate these hotspots by adding OpenACC directives to the important loops within
those routines. There is no reason to think about the movement of data at this point in the process,
the OpenACC compiler will analyze the data needed in the identified region and automatically en-
sure that the data is available on the accelerator. By focusing solely on the parallelism during this
step, the programmer can move as much computation to the device as possible and ensure that the
program is still giving correct results before optimizing away data motion in the next step. During
this step in the process it is common for the overall runtime of the application to increase, even if
the execution of the individual loops is faster using the accelerator. This is because the compiler
must take a cautious approach to data movement, frequently copying more data to and from the
accelerator than is actually necessary. Even if overall execution time increases during this step, the
developer should focus on expressing a significant amount of parallelism in the code before moving
on to the next step and realizing a benefit from the directives.
OpenACC provides two different approaches for exposing parallelism in the code: parallel and
kernels regions. Each of these directives will be detailed in the sections that follow.
15
CHAPTER 3. PARALLELIZE LOOPS 16
4 {
5 y[i] = 0.0f;
6 x[i] = (float)(i+1);
7 }
8
1 !$acc kernels
2 do i=1,N
3 y(i) = 0
4 x(i) = i
5 enddo
6
7 do i=1,N
8 y(i) = 2.0 * x(i) + y(i)
9 enddo
10 !$acc end kernels
In this example the code is initializing two arrays and then performing a simple calculation on them.
Notice that we have identified a block of code, using curly braces in C and starting and ending
directives in Fortran, that contains two candidate loops for acceleration. The compiler will analyze
these loops for data independence and parallelize both loops by generating an accelerator kernel for
each. The compiler is given complete freedom to determine how best to map the parallelism available
in these loops to the hardware, meaning that we will be able to use this same code regardless of the
accelerator we are building for. The compiler will use its own knowledge of the target accelerator
to choose the best path for acceleration. One caution about the kernels directive, however, is that
if the compiler cannot be certain that a loop is data independent, it will not parallelize the loop.
Common reasons for why a compiler may misidentify a loop as non-parallel will be discussed in a
later section.
parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the
programmer asserts incorrectly that the loop may be parallelized then the resulting application may
produce incorrect results.
To put things another way: the kernels construct may be thought of as a hint to the compiler of
where it should look for parallelism while the parallel directive is an assertion to the compiler of
where there is parallelism.
An important thing to note about the kernels construct is that the compiler will analyze the code
and only parallelize when it is certain that it is safe to do so. In some cases the compiler may not
have enough information at compile time to determine whether a loop is safe to parallelize, in which
case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely
parallel. For example, in the case of C/C++ code, where arrays are represented as pointers, the
compiler may not always be able to determine that two arrays do not reference the same memory,
otherwise known as pointer aliasing. If the compiler cannot know that two pointers are not aliased
it will not be able to parallelize a loop that accesses those arrays.
Best Practice: C programmers should use the restrict keyword (or the __restrict decorator
in C++) whenever possible to inform the compiler that the pointers are not aliased, which will
frequently give the compiler enough information to then parallelize loops that it would not have
otherwise. In addition to the restrict keyword, declaring constant variables using the const
keyword may allow the compiler to use a read-only memory for that variable if such a memory
exists on the accelerator. Use of const and restrict is a good programming practice in general,
as it gives the compiler additional information that can be used when optimizing the code.
Fortran programmers should also note that an OpenACC compiler will parallelize Fortran array
syntax that is contained in a kernels construct. When using parallel instead, it will be necessary
to explicitly introduce loops over the elements of the arrays.
One more notable benefit that the kernels construct provides is that if data is moved to the device
for use in loops contained in the region, that data will remain on the device for the full extent of the
region, or until it is needed again on the host within that region. This means that if multiple loops
access the same data it will only be copied to the accelerator once. When parallel loop is used
on two subsequent loops that access the same data a compiler may or may not copy the data back
and forth between the host and the device between the two loops. In the examples shown in the
previous section the compiler generates implicit data movement for both parallel loops, but only
generates data movement once for the kernels approach, which may result in less data motion by
default. This difference will be revisited in the case study later in this chapter.
For more information on the differences between the kernels and parallel directives, please see
[http://www.pgroup.com/lit/articles/insider/v4n2a1.htm].
At this point many programmers will be left wondering which directive they should use in their code.
More experienced parallel programmers, who may have already identified parallel loops within their
code, will likely find the parallel loop approach more desirable. Programmers with less parallel
programming experience or whose code contains a large number of loops that need to be analyzed
may find the kernels approach much simpler, as it puts more of the burden on the compiler. Both
approaches have advantages, so new OpenACC programmers should determine for themselves which
CHAPTER 3. PARALLELIZE LOOPS 19
approach is a better fit for them. A programmer may even choose to use kernels in one part of
the code, but parallel in another if it makes sense to do so.
Note: For the remainder of the document the phrase parallel region will be used to describe either
a parallel or kernels region. When refering to the parallel construct, a terminal font will be
used, as shown in this sentence.
3.4.1 private
The private clause specifies that each loop iteration requires its own copy of the listed variables. For
example, if each loop contains a small, temporary array named tmp that it uses during its calculation,
then this variable must be made private to each loop iteration in order to ensure correct results. If
tmp is not declared private, then threads executing different loop iterations may access this shared
tmp variable in unpredictable ways, resulting in a race condition and potentially incorrect results.
Below is the synax for the private clause.
private(var1, var2, var3, ...)
There are a few special cases that must be understood about scalar variables within loops. First,
loop iterators will be privatized by default, so they do not need to be listed as private. Second,
unless otherwise specified, any scalar accessed within a parallel loop will be made first private by
default, meaning a private copy will be made of the variable for each loop iteration and it will be
initialized with the value of that scalar upon entering the region. Finally, any variables (scalar or
not) that are declared within a loop in C or C++ will be made private to the iterations of that
loop by default.
Note: The parallel construct also has a private clause which will privatize the listed variables
for each gang in the parallel region.
3.4.2 reduction
The reduction clause works similarly to the private clause in that a private copy of the affected
variable is generated for each loop iteration, but reduction goes a step further to reduce all of those
private copies into one final result, which is returned from the region. For example, the maximum
of all private copies of the variable may be required. A reduction may only be specified on a scalar
variable and only common, specified operations can be performed, such as +, *, min, max, and
various bitwise operations (see the OpenACC specification for a complete list). The format of the
reduction clause is as follows, where operator should be replaced with the operation of interest and
variable should be replaced with the variable being reduced:
CHAPTER 3. PARALLELIZE LOOPS 20
reduction(operator:variable)
An example of using the reduction clause will come in the case study below.
1 !$acc kernels
2 h(:) = 0
3 !$acc end kernels
4 !$acc parallel loop
5 do i=1,N
6 !$acc atomic
7 h(a(i)) = h(a(i)) + 1
8 enddo
9 !$acc end parallel loop
Notice that updates to the histogram array h are performed atomically. Because we are incrementing
the value of the array element, an update operation is used to read the value, modify it, and then
write it back.
CHAPTER 3. PARALLELIZE LOOPS 22
80 iter++;
81 }
3.7.2 Kernels
Using the kernels construct to accelerate the loops we’ve identified requires inserting just one
directive in the code and allowing the compiler to perform the parallel analysis. Adding a kernels
construct around the two computational loop nests results in the following code.
51 while ( error > tol && iter < iter_max )
52 {
53 error = 0.0;
54
65 }
66
78 iter++;
79 }
54 !$acc kernels
55 do j=1,m-2
56 do i=1,n-2
57 A(i,j) = 0.25_fp_kind * ( Anew(i+1,j ) + Anew(i-1,j ) + &
58 Anew(i ,j-1) + Anew(i ,j+1) )
59 error = max( error, abs(A(i,j) - Anew(i,j)) )
60 end do
61 end do
62
63 do j=1,m-2
64 do i=1,n-2
65 A(i,j) = Anew(i,j)
66 end do
67 end do
68 !$acc end kernels
69
with the kernels directive it is necessary for the programmer to do some amount of analysis to
determine where parallelism may be found.
Taking a look at the compiler output points to some more subtle differences between the two
approaches.
$ nvc -acc -Minfo=accel laplace2d-kernels.c
main:
56, Generating implicit copyin(A[:][:]) [if not already present]
Generating implicit copyout(Anew[1:4094][1:4094],A[1:4094][1:4094]) [if not already present
58, Loop is parallelizable
60, Loop is parallelizable
Generating Tesla code
58, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
60, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
64, Generating implicit reduction(max:error)
68, Loop is parallelizable
70, Loop is parallelizable
Generating Tesla code
68, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
70, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
The first thing to notice from the above output is that the compiler correctly identified all four loops
as being parallelizable and generated kernels from those loops. Also notice that the compiler only
generated implicit data movement directives at line 54 (the beginning of the kernels region), rather
than at the beginning of each parallel loop. This means that the resulting code should perform
fewer copies between host and device memory in this version than the version from the previous
section. A more subtle difference between the output is that the compiler chose a different loop
decomposition scheme (as is evident by the implicit acc loop directives in the compiler output)
than the parallel loop because kernels allowed it to do so. More details on how to interpret this
decomposition feedback and how to change the behavior will be discussed in a later chapter.
At this point we have expressed all of the parallelism in the example code and the compiler has
parallelized it for an accelerator device. Analyzing the performance of this code may yield surprising
results on some accelerators, however. The results below demonstrate the performance of this code
on 1 - 16 CPU threads on an AMD Threadripper CPU and an NVIDIA Volta V100 GPU using
both implementations above. The y axis for figure 3.1 is execution time in seconds, so smaller is
better. For the two OpenACC versions, the bar is divided by time transferring data between the
host and device and time executing on the device.
The performance of this improves as more CPU threads are added to the calculation, however,
since the code is memory-bound the performance benefit of adding additional threads quickly di-
minishes. Also, the OpenACC versions perform poorly compared to the CPU baseline. The both
the OpenACC kernels and parallel loop versions perform worse than the serial CPU baseline.
It is also clear that the parallel loop version spends significantly more time in data transfer
than the kernels version. Further performance analysis is necessary to identify the source of this
slowdown. This analysis has already been applied to the graph above, which breaks down time
CHAPTER 3. PARALLELIZE LOOPS 27
spent computing the solution and copying data to and from the accelerator.
A variety of tools are available for performing this analysis, but since this case study was compiled for
an NVIDIA GPU, NVIDIA Nsight Systems will be used to understand the application peformance.
The screenshot in figure 3.2 shows Nsight Systems profile for 2 iterations of the convergence loop
in the parallel loop version of the code.
Since the test machine has two distinct memory spaces, one for the CPU and one for the GPU,
it’s necessary to copy data between the two memories. In this screenshot, the tool represents data
transfers using the tan colored boxes in the two MemCpy rows and the computation time in the
green and purple boxes in the rows below Compute. It should be obvious from the timeline displayed
that significantly more time is being spent copying data to and from the accelerator before and
after each compute kernel than actually computing on the device. In fact, the majority of the time
is spent either in memory copies or in overhead incurred by the runtime scheduling memory copeis.
In the next chapter we will fix this inefficiency, but first, why does the kernels version outperform
the parallel loop version?
When an OpenACC compiler parallelizes a region of code it must analyze the data that is needed
within that region and copy it to and from the accelerator if necessary. This analysis is done at a
per-region level and will typically default to copying arrays used on the accelerator both to and from
the device at the beginning and end of the region respectively. Since the parallel loop version
has two compute regions, as opposed to only one in the kernels version, data is copied back and
forth between the two regions. As a result, the copy and overhead times are roughly twice that of
the kernels region, although the compute kernel times are roughly the same.
CHAPTER 3. PARALLELIZE LOOPS 28
Figure 3.2: Screenshot of NVIDIA Nsight Systems Profile on 2 steps of the Jacobi Iteration showing
a high amount of data transfer compared to computation.
Chapter 4
At the end of the previous chapter we saw that although we’ve moved the most compute intensive
parts of the application to the accelerator, sometimes the process of copying data from the host to
the accelerator and back will be more costly than the computation itself. This is because it’s difficult
for a compiler to determine when (or if) the data will be needed in the future, so it must be cautious
and ensure that the data will be copied in case it’s needed. To improve upon this, we’ll exploit
the data locality of the application. Data locality means that data used in device or host memory
should remain local to that memory for as long as it’s needed. This idea is sometimes referred to
as optimizing data reuse or optimizing away unnecessary data copies between the host and device
memories. However you think of it, providing the compiler with the information necessary to only
relocate data when it needs to do so is frequently the key to success with OpenACC.
After expressing the parallelism of a program’s important regions it’s frequently necessary to provide
the compiler with additional information about the locality of the data used by the parallel regions.
As noted in the previous section, a compiler will take a cautious approach to data movement,
always copying data that may be required, so that the program will still produce correct results.
A programmer will have knowledge of what data is really needed and when it will be needed. The
programmer will also have knowledge of how data may be shared between two functions, something
that is difficult for a compiler to determine. Profiling tools can help the programmer identify when
excess data movement occurs, as will be shown in the case study at the end of this chapter.
The next step in the acceleration process is to provide the compiler with additional information
about data locality to maximize reuse of data on the device and minimize data transfers. It is after
this step that most applications will observe the benefit of OpenACC acceleration. This step will
be primarily beneficial on machines where the host and device have separate memories.
29
CHAPTER 4. OPTIMIZE DATA LOCALITY 30
level in the program call tree to enable data to be shared between regions in multiple functions.
The data construct is a structured construct, meaning that it must begin and end in the same
scope (such as the same function or subroutine). A later section will discuss how to handle cases
where a structured construct is not useful. A data region may be added to the earlier parallel
loop example to enable data to be shared between both loop nests as follows.
1 #pragma acc data
2 {
3 #pragma acc parallel loop
4 for (i=0; i<N; i++)
5 {
6 y[i] = 0.0f;
7 x[i] = (float)(i+1);
8 }
9
1 !$acc data
2 !$acc parallel loop
3 do i=1,N
4 y(i) = 0
5 x(i) = i
6 enddo
7
construct to inform the compiler of the data needs of that region of code. The data directives,
along with a brief description of their meanings, follow.
• copy - Create space for the listed variables on the device, initialize the variable by copying
data to the device at the beginning of the region, copy the results back to the host at the end
of the region, and finally release the space on the device when done.
• copyin - Create space for the listed variables on the device, initialize the variable by copying
data to the device at the beginning of the region, and release the space on the device when
done without copying the data back the the host.
• copyout - Create space for the listed variables on the device but do not initialize them. At
the end of the region, copy the results back to the host and release the space on the device.
• create - Create space for the listed variables and release it at the end of the region, but do
not copy to or from the device.
• present - The listed variables are already present on the device, so no further action needs
to be taken. This is most frequently used when a data region exists in a higher-level routine.
• deviceptr - The listed variables use device memory that has been managed outside of Ope-
nACC, therefore the variables should be used on the device without any address translation.
This clause is generally used when OpenACC is mixed with another programming model, as
will be discussed in the interoperability chapter.
In the case of the copy, copyin, copyout and create clause, their intended functionality will not
occur if the variable referenced already exists within device memory. It may be helpful to think of
these clauses as having an implicit present clause attached to them, where if the variable is found
to be present on the device, the other clause will be ignored. An important example of this behavior
is that using the copy clause when the variable already exists within device memory will not copy
any data between the host and device. There is a different directive for copying data between the
host and device from within a data region, and will be discussed shortly.
known at compile time. Shaping is also useful when only a part of the array needs to be stored on
the device.
As an example of array shaping, the code below modifies the previous example by adding shape
information to each of the arrays.
1 #pragma acc data create(x[0:N]) copyout(y[0:N])
2 {
3 #pragma acc parallel loop
4 for (i=0; i<N; i++)
5 {
6 y[i] = 0.0f;
7 x[i] = (float)(i+1);
8 }
9
In this example, the programmer knows that both x and y will be populated with data on the device,
so neither need to have data copied from the host. However, since y is used within a copyout clause,
the data contained within y will be copied from the device to the host when the end of the data
region is reached. This is useful in a situation where you need the results stored in y later in host
code.
wanting to manage device data across different code files. For example, in a C++ class data is
frequently allocated in a class constructor, deallocated in the destructor, and cannot be accessed
outside of the class. This makes using structured data regions impossible because there is no single,
structured scope where the construct can be placed. For these situations we can use unstructured
data lifetimes. The enter data and exit data directives can be used to identify precisely when
data should be allocated and deallocated on the device.
The enter data directive accepts the create and copyin data clauses and may be used to specify
when data should be created on the device.
The exit data directive accepts the copyout and a special delete data clause to specify when
data should be removed from the device.
If a variable appears in multiple enter data directives, it will only be deleted from the device if an
equivalent number of exit data directives are used. To ensure that the data is deleted, you can
add the finalize clause to the exit data directive. Additionally, if a variable appears in multiple
enter data directives, only the instance will do any host-to-device data movement. If you need
to move data between the host and device any time after data is allocated with enter data, you
should use the update directive, which is discussed later in this chapter.
9 public:
10 /// Class constructor
11 Data(int length)
12 {
13 len = length;
14 arr = new ctype[len];
15 #pragma acc enter data copyin(this)
CHAPTER 4. OPTIMIZE DATA LOCALITY 34
The same technique used in the class constructor and destructor above can be used in other program-
ming languages as well. For instance, it’s common practice in Fortran codes to have a subroutine
CHAPTER 4. OPTIMIZE DATA LOCALITY 35
that allocate and initialize all arrays contained within a module. Such a routine is a natural place to
use an enter data region, as the allocation of both the host and device memory will appear within
the same routine in the code. Placing enter data and exit data directives in close proximity to
the usual allocation and deallocation of data within the code simplifies code maintenance.
2 do i=2,N-1
3 ! calculate internal values
4 A(i) = 1
5 end do
6 !$acc parallel
7 A(1) = 0;
8 A(N) = 0;
9 !$acc end parallel
In the above example, the second parallel region will generate and launch a small kernel for
setting the first and last elements. Small kernels generally do not run long enough to overcome
the cost of a kernel launch on some offloading devices, such as GPUs. It’s important that the data
transfer saved by employing this technique is large enough to overcome the high cost of a kernel
launch on some devices. Both the parallel loop and the second parallel region could be made
asynchronous (discussed in a later chapter) to reduce the cost of the second kernel launch.
Note: Because the kernels directive instructs the compiler to search for parallelism, there is no
similar technique for kernels, but the parallel approach above can be easily placed between kernels
regions.
63 + A[j-1][i] + A[j+1][i]);
64 error = fmax( error, fabs(Anew[j][i] - A[j][i]));
65 }
66 }
67
80 iter++;
81 }
Figure 4.1: NVIDIA Nsight Systems showing a single iteration of the Jacobi solver after adding the
OpenACC data region.
Looking at the final performance of this code, we see that the time for the OpenACC code on a
GPU is now much faster than even the best threaded CPU code. Although only the parallel
loop version is shown in the performance graph, the kernels version performs equally well once
the data region has been added.
This ends the Jacobi Iteration case study. The simplicity of this implementation generally shows
very good speed-ups with OpenACC, often leaving little potential for further improvement. The
reader should feel encouraged, however, to revisit this code to see if further improvements are
possible on the device of interest to them.
CHAPTER 4. OPTIMIZE DATA LOCALITY 40
Figure 4.2: Runtime of Jacobi Iteration after adding OpenACC data region
Chapter 5
Optimize Loops
Once data locality has been expressed, developers may wish to further tune the code for the hard-
ware of interest. It’s important to understand that the more loops are tuned for a particular type of
hardware the less performance portable the code becomes to other architecures. If you’re generally
running on one particular accelerator, however, there may be some gains to be had by tuning how
the loops are mapped to the underlying hardware.
It’s tempting to begin tuning the loops before all of the data locality has been expressed in the
code. However, because data copies are frequently the limiter to application performance on the
current generation of accelerators the performance impact of tuning a particular loop may be too
difficult to measure until data locality has been optimized. For this reason the best practice is to
wait to optimize particular loops until after all of the data locality has been expressed in the code,
reducing the data transfer time to a minimum.
41
CHAPTER 5. OPTIMIZE LOOPS 42
instruction operating on multiple pieces of data (much like SIMD parallelism on a modern CPU
or SIMT parallelism on a modern GPU). Vector operations are performed with a particular vector
length, indicating how many data elements may be operated on with the same instruction. Gang
parallelism is coarse-grained parallelism, where gangs work independently of each other and may
not synchronize. Worker parallelism sits between vector and gang levels. A gang consists of 1 or
more workers, each of which operates on a vector of some length. Within a gang the OpenACC
model exposes a cache memory, which can be used by all workers and vectors within the gang,
and it is legal to synchronize within a gang, although OpenACC does not expose synchronization
to the user. Using these three levels of parallelism, plus sequential, a programmer can map the
parallelism in the code to any device. OpenACC does not require the programmer to do this
mapping explicitly, however. If the programmer chooses not to explicitly map loops to the device
of interest the compiler will implicitly perform this mapping using what it knows about the target
device. This makes OpenACC highly portable, since the same code may be mapped to any number
of target devices. The more explicit mapping of parallelism the programmer adds to the code,
however, the less portable they make the code to other architectures.
Workers perform the same instruction on multiple elements of data using vector operations. So,
gangs consist of at least one worker, which operates on a vector of data.
1 !$acc kernels
2 !$acc loop gang
3 do j=1,M
4 !$acc loop vector(128)
5 do i=1,N
6
that describes where in a given row these non-zero elements would reside, and a third describing
the columns in which the data would reside. The code for this exercise is below.
1 #pragma acc parallel loop
2 for(int i=0;i<num_rows;i++) {
3 double sum=0;
4 int row_start=row_offsets[i];
5 int row_end=row_offsets[i+1];
6 #pragma acc loop reduction(+:sum)
7 for(int j=row_start;j<row_end;j++) {
8 unsigned int Acol=cols[j];
9 double Acoef=Acoefs[j];
10 double xcoef=xcoefs[Acol];
11 sum+=Acoef*xcoef;
12 }
13 ycoefs[i]=sum;
14 }
Notice that I have now explicitly informed the compiler that the innermost loop should be a vector
loop, to ensure that the compiler will map the parallelism exactly how I wish. I can try different
vector lengths to find the optimal value for my accelerator by modifying the vector_length clause.
Below is a graph showing the relative speed-up of varying the vector length compared to the
compiler-selected value.
Figure 5.2: Relative speed-up from varying vector_length from the default value of 128
Notice that the best performance comes from the smallest vector length. Again, this is because the
number of non-zeros per row is very small, so a small vector length results in fewer wasted compute
resources. On the particular chip I’m using, the smallest possible vector length, 32, achieves the
best possible performance. On this particular accelerator, I also know that the hardware will not
perform efficiently at this vector length unless we can identify further parallelism another way. In
this case, we can use the worker level of parallelism to fill each gang with more of these short vectors.
Below is the modified code.
1 #pragma acc parallel loop gang worker num_workers(4) vector_length(32)
2 for(int i=0;i<num_rows;i++) {
3 double sum=0;
4 int row_start=row_offsets[i];
5 int row_end=row_offsets[i+1];
6 #pragma acc loop vector
7 for(int j=row_start;j<row_end;j++) {
8 unsigned int Acol=cols[j];
9 double Acoef=Acoefs[j];
10 double xcoef=xcoefs[Acol];
11 sum+=Acoef*xcoef;
12 }
CHAPTER 5. OPTIMIZE LOOPS 51
13 ycoefs[i]=sum;
14 }
Figure 5.3: Speed-up from varying number of workers for a vector length of 32.
On this particular hardware, the best performance comes from a vector length of 32 and 4 workers,
which is similar to the simpler loop with a default vector length of 128. In this case, we observed
a 2.5X speed-up from decreasing the vector length and another 1.26X speed-up from varying the
CHAPTER 5. OPTIMIZE LOOPS 52
number of workers within each gang, resulting in an overall 3.15X performance improvement from
the untuned OpenACC code.
Best Practice: Although not shown in order to save space, it’s generally best to use the
device_type clause whenever specifying the sorts of optimizations demonstrated in this section,
because these clauses will likely differ from accelerator to accelerator. By using the device_type
clause it’s possible to provide this information only on accelerators where the optimizations
apply and allow the compiler to make its own decisions on other architectures. The OpenACC
specification specifically suggests nvidia, radeon, and host as three common device type strings.
Chapter 6
OpenACC Interoperability
The authors of OpenACC recognized that it may sometimes be beneficial to mix OpenACC code
with code accelerated using other parallel programming languages, such as CUDA or OpenCL, or
accelerated math libraries. This interoperability means that a developer can choose the program-
ming paradigm that makes the most sense in the particular situation and leverage code and libraries
that may already be available. Developers don’t need to decide at the begining of a project between
OpenACC or something else, they can choose to use OpenACC and other technologies.
NOTE: The examples used in this chapter can be found online at https://github.com/jefflarkin/openacc-
interoperability
53
CHAPTER 6. OPENACC INTEROPERABILITY 54
9 }
10 }
11 void set(int n, float val, float * restrict arr)
12 {
13 #pragma acc kernels deviceptr(arr)
14 {
15 for(int i=0; i<n; i++)
16 {
17 arr[i] = val;
18 }
19 }
20 }
21 int main(int argc, char **argv)
22 {
23 float *x, *y, tmp;
24 int n = 1<<20;
25
26 x = acc_malloc((size_t)n*sizeof(float));
27 y = acc_malloc((size_t)n*sizeof(float));
28
29 set(n,1.0f,x);
30 set(n,0.0f,y);
31
1 module saxpy_mod
2 contains
3 subroutine saxpy(n, a, x, y)
4 integer :: n
5 real :: a, x(:), y(:)
6 !$acc parallel deviceptr(x,y)
7 y(:) = y(:) + a * x(:)
8 !$acc end parallel
9 end subroutine
10 end module
Notice that in the set and saxpy routines, where the OpenACC compute regions are found,
each compute region is informed that the pointers being passed in are already device point-
ers by using the deviceptr clause. This example also uses the acc_malloc, acc_free, and
acc_memcpy_from_device routines for memory management. Although the above example uses
CHAPTER 6. OPENACC INTEROPERABILITY 56
acc_malloc and acc_memcpy_from_device, which are provided by the OpenACC specification for
portable memory management, a device-specific API may have also been used, such as cudaMalloc
and cudaMemcpy.
memory is similar to OpenACC memory management, in that only a single reference to the memory
is necessary and the runtime will handle the complexities of data movement. The advantage that
managed memory sometimes has it that it is better able to handle complex data structures, such as
C++ classes or structures containing pointers, since pointer references are valid on both the host
and the device. More information about CUDA Managed Memory can be obtained from NVIDIA.
To use managed memory within an OpenACC program the developer can simply declare pointers
to managed memory as device pointers using the deviceptr clause so that the OpenACC runtime
will not attempt to create a separate device allocation for the pointers.
It is also worth noting that the NVIDIA HPC compiler (formerly PGI compiler) has direct support
for using CUDA Managed Memory by way of a compiler option. See the compiler documentation
for more details.
7 // Function declaration
8 #pragma acc routine seq
9 extern "C" void f1dev( float*, float* int );
10
11 // Function call-site
12 #pragma acc parallel loop present( a[0:n], b[0:n] )
13 for( int i = 0; i < n; ++i )
14 {
15 // f1dev is a __device__ function build with CUDA
16 f1dev( a, b, i );
17 }
Chapter 7
This chapter will discuss OpenACC features and techniques that do not fit neatly into other sections
of the guide. These techniques are considered advanced, so readers should feel comfortable with
the features discussed in previous chapters before proceeding to this chapter.
58
CHAPTER 7. ADVANCED OPENACC FEATURES 59
4 end do
5 !$acc update self(c) async
In the case above, the host thread will enqueue the parallel region into the default asynchronous
queue, then execution will return to the host thread so that it can also enqueue the update, and
finally the CPU thread will continue execution. Eventually, however, the host thread will need
the results computed on the accelerator and copied back to the host using the update, so it must
synchronize with the accelerator to ensure that these operations have finished before attempting to
use the data. The wait directive instructs the runtime to wait for past asynchronous operations to
complete before proceeding. So, the above examples can be extended to include a synchronization
before the data being copied by the update directive proceeds.
1 #pragma acc parallel loop async
2 for (int i=0; i<N; i++)
3 {
4 c[i] = a[i] + b[i]
5 }
6 #pragma acc update self(c[0:N]) async
7 #pragma acc wait
For this example we will be modifying a simple application that generates a mandelbrot set, such as
the picture shown above. Since each pixel of the image can be independently calculated, the code
is trivial to parallelize, but because of the large size of the image itself, the data transfer to copy
the results back to the host before writing to an image file is costly. Since this data transfer must
occur, it’d be nice to overlap it with the computation, but as the code is written below, the entire
computation must occur before the copy can occur, therefore there is noting to overlap. (Note: The
mandelbrot function is a sequential function used to calculate the value of each pixel. It is left out
of this chapter to save space, but is included in the full examples.)
1 #pragma acc parallel loop
2 for(int y=0;y<HEIGHT;y++) {
3 for(int x=0;x<WIDTH;x++) {
4 image[y*WIDTH+x]=mandelbrot(x,y);
5 }
CHAPTER 7. ADVANCED OPENACC FEATURES 62
6 }
7
The mandelbrot code can use this same technique by chunking up the image generation and data
transfers into smaller, independent pieces. This will be done in multiple steps to reduce the likeli-
hood of introducing an error. The first step is to introduce a blocking loop to the calculation, but
keep the data transfers the same. This will ensure that the work itself is properly divided to give
CHAPTER 7. ADVANCED OPENACC FEATURES 63
correct results. After each step the developer should build and run the code to ensure the resulting
image is still correct.
1 num_batches=8
2 batch_size=WIDTH/num_batches
3 do yp=0,num_batches-1
4 ystart = yp * batch_size + 1
5 yend = ystart + batch_size - 1
6 !$acc parallel loop
7 do iy=ystart,yend
8 do ix=1,HEIGHT
9 image(ix,iy) = min(max(int(mandelbrot(ix-1,iy-1)),0),MAXCOLORS)
10 enddo
11 enddo
12 enddo
13
1 num_batches=8
2 batch_size=WIDTH/num_batches
3 call cpu_time(startt)
4 !$acc data create(image)
5 do yp=0,NUM_BATCHES-1
6 ystart = yp * batch_size + 1
7 yend = ystart + batch_size - 1
8 !$acc parallel loop
9 do iy=ystart,yend
10 do ix=1,HEIGHT
11 image(ix,iy) = mandelbrot(ix-1,iy-1)
12 enddo
13 enddo
14 !$acc update self(image(:,ystart:yend))
15 enddo
16 !$acc end data
By the end of this step we are calculating and copying each block of the image independently, but
this is still being done sequentially, each block after the previous. The performance at the end of
this step is generally comparable to the original version.
CHAPTER 7. ADVANCED OPENACC FEATURES 65
1 num_batches=8
2 batch_size=WIDTH/num_batches
3 call cpu_time(startt)
4 !$acc data create(image)
5 do yp=0,NUM_BATCHES-1
6 ystart = yp * batch_size + 1
7 yend = ystart + batch_size - 1
8 !$acc parallel loop async(yp)
9 do iy=ystart,yend
10 do ix=1,HEIGHT
11 image(ix,iy) = mandelbrot(ix-1,iy-1)
12 enddo
13 enddo
14 !$acc update self(image(:,ystart:yend)) async(yp)
15 enddo
16 !$acc wait
17 !$acc end data
With this modification it’s now possible for the computational part of one block to operate simulta-
neously as the data transfer of another. The developer should now experiment with varying block
sizes to determine what the optimal value is on the architecture of interest. It’s important to note,
however, that on some architectures the cost of creating an asynchronous queue the first time its
CHAPTER 7. ADVANCED OPENACC FEATURES 66
used can be quite expensive. In long-running applications, where the queues may be created once at
the beginning of a many-hour run and reused throughout, this cost is amortized. In short-running
codes, such as the demonstration code used in this chapter, this cost may outweigh the benefit of the
pipelining. Two solutions to this are to introduce a simple block loop at the beginning of the code
that pre-creates the asynchronous queues before the timed section, or to use a modulus operation
to reuse the same smaller number of queues among all of the blocks. For instance, by using the
block number modulus 2 as the asynchronous handle, only two queues will be used and the cost
of creating those queues will be amortized by their reuse. Two queues is generally sufficient to see
a gain in performance, since it still allows computation and updates to overlap, but the developer
should experiment to find the best value on a given machine.
Below we see a screenshot showing before and after profiles from applying these changes to the code
on an NVIDIA GPU platform. Similar results should be possible on any acclerated platform. Using
16 blocks and two asynchronous queues, as shown below, roughly a 2X performance improvement
was observed on the test machine over the performance without pipelining.
Figure 7.3: NVIDIA NSight Systems profiler timelines for the original mandelbrot code (Top) and
the pipelined code using 16 blocks over 2 asynchronous queues (Bottom).
7.2.1 acc_get_num_devices()
The acc_get_num_devices() routine may be used to query how many devices of a given architec-
ture are available on the system. It accepts one parameter of type acc_device_t and returns a
CHAPTER 7. ADVANCED OPENACC FEATURES 67
1 batch_size=WIDTH/num_batches
2 do gpu=0,1
3 call acc_set_device_num(gpu,acc_device_nvidia)
4 !$acc enter data create(image)
5 enddo
6 do yp=0,NUM_BATCHES-1
7 call acc_set_device_num(mod(yp,2),acc_device_nvidia)
8 ystart = yp * batch_size + 1
9 yend = ystart + batch_size - 1
10 !$acc parallel loop async(yp)
11 do iy=ystart,yend
12 do ix=1,HEIGHT
13 image(ix,iy) = mandelbrot(ix-1,iy-1)
14 enddo
15 enddo
16 !$acc update self(image(:,ystart:yend)) async(yp)
17 enddo
18 do gpu=0,1
19 call acc_set_device_num(gpu,acc_device_nvidia)
CHAPTER 7. ADVANCED OPENACC FEATURES 69
20 !$acc wait
21 !$acc exit data delete(image)
22 enddo
Although this example over-allocates device memory by placing the entire image array on the device,
it does serve as a simple example of how the acc_set_device_num() routine can be used to operate
on a machine with multiple devices. In production codes the developer will likely want to partition
the work such that only the parts of the array needed by a specific device are available there.
Additionally, by using CPU threads it may be possible to issue work to the devices more quickly
and improve overall performance. Figure 7.3 shows a screenshot of the NVIDIA NSight Systems
showing the mandelbrot computation divided across two NVIDIA GPUs.
References
• OpenACC.org
• OpenACC on the NVIDIA Parallel Forall Blog
• PGI Insider Newsletter
• OpenACC at the NVIDIA GPU Technology Conference
• OpenACC on Stack Exchange
• OpenACC Community Slack
70