Performance Measurement Tools and Techniques

Performance Measurement Tools and
Techniques
Pekka Manninen
CSC, the Finnish IT center for science
PRACE Petascale Summer School

Stockholm, Sweden Aug 26-29 2008

Part I: Introduction
 Motivation
 Traditional and Petascales optimization
 Performance data and optimization

Motivation
 It is all about software

• It is of no use for building petascale machines without suitable
software
• Single core efficiency is not the key issue anymore but the
scalability
 Large scale parallel application development is still
work on the frontier with novel challenges
• How to utilize thousands of cores
• How to deal with I/O
 Good performance analysis tools are mandatory for
parallel program development

Traditional optimization process overview
Parallel optimization stage
-Apply parallelization
-Perform parallel optimization
Serial optimization stage

-Perform serial optimization
Application development stage

-Choose algorithms
-Choose data structures
-Develop or port application

Petascales: Optimization flowchart
No
Reduce
Choose algorithms, Unoptimal,
Develop code overhead from
data structures and correct
Apply Assess scalability Sufficient? No communication
parallelization parallel
parallelization and load
strategy code
inbalance
Yes
Apply compiler Identify

Measure single- Measure parallel No
optimization performance No Sufficient? Yes Sufficient?
core performance performance
Tune for the bottlenecks
processor
Optimize I/O
Yes
Link
Optimized
optimized Assess scalability Yes Converged?
code
libraries

Optimization considerations
1. Load balance
2. Minimal dedicated time for communication
 Minimize communication
 Overlap computation and communication
3. CPU utilization
 Optimal memory access (cache utilization)
 Pipeline performance (branch prediction, prefetching)
 SIMD operations
 Efficient I/O

Examples of relevant measurements
 Execution time across CPUs

• In order to an application to scale all tasks should be kept
equally loaded
 MPI trace
• How and when the communication is carried out, hinting how to
optimize communication
• Communication bandwidth
 Function call-tree and execution time profile
• Pinpoint the execution hotspots, i.e. where to spend the most
of the effort in serial optimization
 Hardware counters (e.g. Cache utilization ratio,
Instruction usage, Computational intensity, Flop rate)
• Provide insight on the potential inefficiencies of a given routine
 I/O statistics

Performance data collection
 Two “dimensions”
 When collection is triggered Acquisition Presentation
• Asynchronous (sampling) or
synchronous (code Sampling Profile
instrumentation)
 How data is recorded
• Profile or a trace file
Instrumentation Timeline

Things to be kept in mind
 The objective of performance analysis is to understand

the behavior of the whole system and apply it to
improve the performance
 Instrumentation causes always overhead
• Artifacts in all measurements
 A performance analyst should
• Have an understanding of the different levels of the system
architecture
• Be able to communicate with users as well as developers
• Be patient enough to explore a broad range of hypotheses and
double-check them
• Be open-minded as to where the performance bottleneck could
be

Part II: Cray performance analysis tools
as an example
 Overview
 Usage

Cray performance analysis infrastructure
 CrayPat
• pat_build - an utility for application instrumentation without a
need for source code modification
• Transparent run-time library for measurements
• pat_report - performance reports and visualization file
• pat_help - interactive help utility
 Cray Apprentice2
• An advanced graphical performance analysis and visualization
tool

Instrumentation with pat_build
 No source or makefile modification needed, link-time

instrumentation
• Requires object files
• Instruments compiler-optimized code
• Generates stand-alone instrumented executable and preserves
the original binary
 Automatic instrumentation at group level
• Supports both asynchronous and synchronous instrumentation
 Basic usage
% module load xt-craypat
% make clean; make
% pat_build -g <trace groups> a.out

 Trace groups
biolibs Cray bioinformatics library routines
blas BLAS subroutines
heap Dynamic heap
io stdio + sysio tracegroups
lapack LAPACK subroutines
math ANSI math
mpi MPI statistics
omp OpenMP API
omp-rtl OpenMP runtime library
pthreads Posix threads
shmem SHMEM API
stdio All functions that accept or return the
file* construct
sysio I/O system calls
system system calls

 Automatic profiling analysis (APA)

% module load xt-craypat/4.2
% make clean; make
% pat_build -O apa a.out
 Execution of the instrumented executable will produce
a report for pat_report and an .apa file that allows for
fine-tuning the analysis

Fine-grained instrumentation
 Fortran
include “pat_apif.h”
...
call PAT_region_begin(id,”label”,ierr)
<code segment>
call PAT_region_end(id,ierr)
 C
include <pat_api.h>
...
ierr = PAT_region_begin(id,”label”);
<code segment>
ierr = PAT_region_end(id);

Collecting data
 Run the instrumented application (a.out+pat) as usual,

then measurement data is created (.xf file)
 Must run on Lustre file system
 Optional runtime variables
• Optional timeline view of the program by setting
PAT_RT_SUMMARY=0
• Request hardware performance counter information by setting
PAT_RT_HWPC=<HWPC group id>
• Number of files to store raw data can be customized with
PAT_RT_EXPFILE_MAX

Collecting data
 Hardware performance counter groups

1 Summary with translation lookaside buffer
metrics
2 L1 and L2 cache metrics
3 Bandwidth information
4 Hypertransport information
5 Floating point instruction (including SSE)
information
6 Cycles stalled and resources empty
7 Cycles stalled and resources full
8 Instructions and branches
9 Instruction cache values

Analysis with pat_report
 Combines information from binary with the raw

performance measurement data
 Generates a text report of performance results and/or
formats data to be visualized with Apprentice2
 Basic usage
pat_report -O <keywords> data_file.xf
 Useful keywords
profile Subroutine level data
callers Function callers
calltree Calltree
heap Heap information, instrument with -g heap
mpi MPI statistics, instrument with -g mpi
load_balance Load balance information
help Available options

ct -O calltree
defaults Tables that would appear by default.
heap -O heap_program,heap_hiwater,heap_leaks
io -O read_stats,write_stats
lb -O load_balance
load_balance -O lb_program,lb_group,lb_function
mpi -O mpi_callers
callers Profile by Function and Callers

callers+src Profile by Function and Callers, with Line
Numbers
calltree Function Calltree View
calltree+src Calltree View with Callsite Line Numbers
heap_hiwater Heap Stats during Main Program
heap_leaks Heap Leaks during Main Program
heap_program Heap Usage at Start and End of Main Program
hwpc HW Performance Counter Data
load_balance_function Load Balance across PE's by Function
load_balance_group Load Balance across PE's by FunctionGroup
load_balance_program Load Balance across PE's

load_balance_sm Load Balance with MPI Sent Message Stats
loops Loop Stats from -hprofile_generate
mpi_callers MPI Sent Message Stats by Caller
mpi_dest_bytes MPI Sent Message Stats by Destination PE
mpi_dest_counts MPI Sent Message Stats by Destination PE
mpi_rank_order Suggested MPI Rank Order
mpi_sm_rank_order Sent Message Stats and Suggested MPI Rank Order
pgo_details Loop Stats detail from -hprofile_generate
profile Profile by Function Group and Function
profile+src Profile by Group, Function, and Line
profile_pe.th Profile by Function Group and Function
profile_pe_th Profile by Function Group and Function
profile_th_pe Profile by Function Group and Function
rogram_time Program Wall Clock Time
read_stats File Input Stats by Filename
samp_profile Profile by Function
samp_profile+src Profile by Group, Function, and Line
thread_times Program Wall Clock Time
write_stats File Output Stats by Filename

Part III: Case study
 Workplan
 Demonstration of tools
 Analysis

Case study: CP2K performance analysis
 A code for ab-initio molecular dynamics Runtimes (s) on Cray

simulations written in Fortran 95, uses MPI XT4
for parallelization
#CPUs FFTSG FFTW3
 Test job: A few time-step DFT simulation of
64 1550 1592
512 water molecules
 We wish to know the following 128 979 1022
• Function profile - where are the hotspots 256 659 660
• Single-core efficiency 512 560 558
• Load balance and communication analysis - what
are the scalability bottlenecks? 1024 416 434
2048 546 542
Peak performance around
19 TFlop/s - still quite far Not enough to calculate and
away from petascales so much communication
overhead that the execution
is slower than with 1024
cores!
Case study: CP2K performance analysis
 Workplan
• Do this on a longer route to reduce instrumentation overhead
• Instrument the code (that is build with Craypat module loaded)
pat_build -O apa cp2k.pat (→ cp2k.pat+pat)
• Execute cp2k.pat+pat with 512 cores (→ an .xf file )
• Obtain a sampling profile
pat_report -O samp_profile+src xf-file.xf >
samp_profile
• This will produce the profile as well as an .apa file
• Run a more exhaustive analysis by editing apa file and
rebuilding the executable
pat_build -O apa-file.apa
• Executing this will produce another .xf file - visualize the
analysis with Apprentice2

Sampling profile output
Table 1: Profile by Group, Function, and Line
Samp % | Samp | Imb. | Imb. |Group

| | Samp | Samp % | Function
| | | | Source | 23.4% | 8355 | -- | -- |ETC
| | | | Line ||-----------------------------------------------
| | | | PE='HIDE' || 14.7% | 5249 | 397.36 | 7.1% |dgemm_kernel
|| 3.5% | 1256 | 149.35 | 10.7% |dgemm_otcopy
100.0% | 35712 | -- | -- |Total || 1.5% | 531 | 80.12 | 13.1% |dgemm_oncopy
|------------------------------------------------ ||===============================================
| 63.1% | 22537 | -- | -- |MPI | 13.5% | 4820 | -- | -- |USER
||----------------------------------------------- ||-----------------------------------------------
|| 17.3% | 6164 | 459.63 | 7.0% |MPI_Bcast || 3.2% | 1145 | -- | -|UPDATE_COST_CPU_ROW.
|| 14.8% | 5291 | 1409.44 | 21.1% |MPI_Recv in.DISTRIBUTION_OPTIMIZE
|| 8.6% | 3088 | 2372.59 | 43.5% | ||||---------------------------------------------
mpi_allreduce_
4||| 2.2% | 781 | 486.43 | 38.5% |line.359
|| 5.9% | 2123 | 211.08 | 9.1% |
mpi_alltoallv_ 4||| 1.0% | 343 | 216.15 | 38.7% |line.358
|| 5.2% | 1844 | 909.33 | 33.1% |MPI_Reduce |================================================
|| 4.5% | 1595 | 415.49 | 20.7% |mpi_waitall_

|| 1.4% | 508 | 224.91 | 30.7% |MPI_Send
|| 1.1% | 406 | 84.81 | 17.3% |mpi_alltoall_
||===============================================

Sampling profile
 The corresponding fractions with other processor

counts
#cores MPI ETC USER

128 37.0 47.3 15.7
256 46.4 38.7 14.9
512 63.1 23.4 13.5

APA file
# You can edit this file, if desired, and use it
# to reinstrument the program for tracing like this:
#
# pat_build -O cp2k.pat+pat+4779-896sdt.apa
#
# These suggested trace options are based on data from:
#
# /wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp2k.pat+pat+4779-896sdt.ap2,
/wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp
2k.pat+pat+4779-896sdt.xf Select the hardware counter
# ----------------------------------------------------------------------group here or give the PAPI
# HWPC group to collect by default. calls explicitly, get group 2
# -Drtenv=PAT_RT_HWPC=0 # Summary with instructions metrics.
-Drtenv=PAT_RT_HWPC=2
# ----------------------------------------------------------------------
# Libraries to trace.
-g mpi
Insert the desired tracegroups
# ----------------------------------------------------------------------
here

APA file continued
# User-defined functions to trace, sorted by % of samples.

# Limited to top 200. A function is commented out if it has < 1%
# of samples, or if a cumulative threshold of 90% has been reached,
# or if it has size < 200 bytes.
-w # Enable tracing of user-defined functions. We may now select the
# Note: -u should NOT be specified as an additional option. functions which we want to
# 3.21% 756 bytes trace and reduce the
-T UPDATE_COST_CPU_ROW.in.DISTRIBUTION_OPTIMIZE instrumentation overhead by
# 0.74% 5757 bytes ignoring functions with little
-T PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS
significance on the execution
# 0.66% 72344 bytes
time, uncomment a few
-T RS_PW_TRANSFER_DISTRIBUTED.in.REALSPACE_GRID_TYPES
# 0.64% 9426 bytes
-T RS_PW_TRANSFER_REPLICATED.in.REALSPACE_GRID_TYPES
# 0.56% 2172 bytes
-T PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS
# 0.49% 17890 bytes
# -T CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS

Performance analysis
 Let us get the most out of the analysis data

pat_report -O defaults,mpi,io,lb,heap,mpi_dest_bytes,hwpc,mpi_sm_rank_order xf-
file.xf

Table 1: Profile by Function Group and Function
Note how the
Experiment=1 / Group / Function / PE='HIDE'
instrumentation
========================================================================
Totals for program affects the
performance
------------------------------------------------------------------------
Time% 100.0%
Time 1527.856590
Imb.Time --
Imb.Time% --
Calls 167616828
REQUESTS_TO_L2:DATA 12.127M/sec 16047354721 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.449M/sec 11180256256 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.139M/sec 4153438330 fills
PAPI_L1_DCA 1519.962M/sec 2011407347660 refs
User time (approx) 1323.327 secs 3043652030586 cycles
Cycles 1323.327 secs 3043652030586 cycles
Utilization rate 86.6%
LD & ST per D1 miss 131.18 refs/miss
D1 cache hit ratio 99.2%
D1+D2 cache hit ratio 99.8%
Effective D1+D2 Reuse 7.57 refs/byte
System to D1 refill 3.139M/sec 4153438330 lines
System to D1 bandwidth 191.567MB/sec 265820053147 bytes
L2 to Dcache bandwidth 515.661MB/sec 715536400364 bytes

MPI_SYNC
This is all load
------------------------------------------------------------------------
Time% imbalance! 51.9%
Time 792.264406
Imb.Time --
Imb.Time% --
Calls 347497

USER
------------------------------------------------------------------------
Time% 31.1%
Time 475.803299
Imb.Time --
Imb.Time% --
Calls 141943266
LD & ST per D1 miss Cache statistics. 44.28 refs/miss
D1 cache hit ratio There’s room to 97.7%
LD & ST per D2 miss improve both L1 and 175.09 refs/miss
D2 cache hit ratio
L2 utilization. 74.7%

Table 2: Load Balance with MPI Sent Message Stats
Time % | Time | Sent | Sent Msg | Avg Sent |Experiment=1

| | Msg | Total Bytes | Msg Size |Group
| | Count | | | PE[mmm]
100.0% | 1527.687783 | 587504 | 27691142616 | 47133.54 |Total

|--------------------------------------------------------------------
| 51.9% | 792.264406 | -- | -- | -- |MPI_SYNC
||-------------------------------------------------------------------
|| 0.2% | 1294.073494 | -- | -- | -- |pe.369
|| 0.0% | 370.692059 | -- | -- | -- |pe.434
|| 0.0% | 253.923512 | -- | -- | -- |pe.14
||===================================================================
| 31.1% | 475.803299 | -- | -- | -- |USER
||-------------------------------------------------------------------
|| 0.1% | 769.359224 | -- | -- | -- |pe.28
|| 0.0% | 247.921795 | -- | -- | -- |pe.5
|| 0.0% | 195.379051 | -- | -- | -- |pe.405
||===================================================================
| 15.3% | 234.166242 | 587504 | 27691142616 | 47133.54 |MPI
MPI data transfer
||-------------------------------------------------------------------
|| 0.0% | 277.189769 | 668446 | 28291878712 | 42324.85 |pe.14
without load
|| 0.0% | 233.824474 | 577886 | 27245088600 | 47146.13 |pe.393 imbalance
|| 0.0% | 205.883493 | 581264 | 27281927088 | 46935.52 |pe.305
||===================================================================

Table 3: MPI Sent Message Stats by Caller
Sent Msg | Sent | MsgSz | 16B<= | 256B<= | 4KB<= | 64KB<= |Experiment=1

Total Bytes | Msg | <16B | MsgSz | MsgSz | MsgSz | MsgSz |Function
Message passing
| Count | Count | <256B | <4KB | <64KB | <1MB | Caller
| | | Count | Count | Count | Count | PE[mmm] profile
27691142615 | 587504 | 23182 | 21316 | 179467 | 250579 | 112965 |Total

|-----------------------------------------------------------------------------
| 16856792064 | 205380 | -- | -- | -- | 147168 | 58212 |mpi_isend_
||----------------------------------------------------------------------------
|| 10371981312 | 58212 | -- | -- | -- | -- | 58212 |MP_ISENDRECV_RM2.in.MESSAGE_PASSING
|||---------------------------------------------------------------------------
3|| 8620867584 | 48384 | -- | -- | -- | -- | 48384CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS
4|| | | | | | | | CP_SM_FM_MULTIPLY.in.CP_SM_FM_INTERACTIONS
|||||-------------------------------------------------------------------------
5|||| 3232825344 | 18144 | -- | -- | -- | -- | 18144 |QS_KS_BUILD_KOHN_SHAM_MATRIX.in.QS_KS_METHODS
6|||| | | | | | | | QS_KS_UPDATE_QS_ENV.in.QS_KS_METHODS
|||||||-----------------------------------------------------------------------
7|||||| 2649120768 | 14868 | -- | -- | -- | -- | 14868 |SCF_ENV_DO_SCF.in.QS_SCF
8|||||| | | | | | | | SCF.in.QS_SCF
9|||||| | | | | | | | QS_ENERGIES.in.QS_ENERGY
10||||| | | | | | | | QS_FORCES.in.QS_FORCE
11||||| | | | | | | | FORCE_ENV_CALC_ENERGY_FORCE.in.FORCE_ENV_METHODS
||||||||||||------------------------------------------------------------------
We could go into a
...
||||||||||||||||||------------------------------------------------------------ routine and see if we
18|||||||||||||||| 1628024832 | 9072 | -- | -- | -- | -- | 9072 |pe.416 could aggregate small
18|||||||||||||||| 1617573888 | 9072 | -- | -- | -- | -- | 9072 |pe.164
messages
18|||||||||||||||| 1600155648 | 9072 | -- | -- | -- | -- | 9072 |pe.374
||||||||||||||||||============================================================

Table 5: Heap Stats during Main Program
Tracked | MBytes | Total | Allocs | Total | Tracked | Tracked |Experiment=1

Heap | Not | Allocs | Not | Frees | Objects | MBytes |PE[mmm]
HiWater | Tracked | | Tracked | | Not | Not |
MBytes | | | | | Freed | Freed |
54.178 | 665.609 | 9047403 | 2167539 | 9046867 | 465 | 0.705 |Total

|-----------------------------------------------------------------------------------
| 67.376 | 568.817 | 8618099 | 1962537 | 8617537 | 468 | 0.563 |pe.424
| 53.599 | 669.015 | 9026169 | 2020513 | 9025735 | 378 | 0.679 |pe.350
| 45.430 | 646.063 | 8993493 | 2260616 | 8993059 | 381 | 0.401 |pe.379
|===================================================================================

Table 7: File Input Stats by Filename
Read | Read MB | Read Rate | Reads | Read |Experiment=1

Time | | MB/sec | | B/Call |File Name
| | | | | PE[mmm]
| | | | | File Desc
0.000 | 1.132719 | 3304.704390 | 47 | 25271.11 |Total

|---------------------------------------------------------------
| 0.000 | 0.886948 | 3678.979746 | 34 | 27353.88 |NA
||--------------------------------------------------------------
|| 0.000 | 0.001757 | 5.794658 | 34 | 54.18 |pe.107
3| | | | | | fd.12
|| 0.000 | 0.001757 | 7.319901 | 34 | 54.18 |pe.114
3| | | | | | fd.12
|| 0.000 | 0.000929 | 7.620604 | 19 | 51.26 |pe.0
3| | | | | | fd.49
||==============================================================
| 0.000 | 0.245771 | 2417.241058 | 13 | 19823.85 |<NA>
||--------------------------------------------------------------
|| 0.052 | 0.245771 | 4.721184 | 6899 | 37.35 |pe.0
3| | | | | | fd.49
|| 0.000 | -- | -- | -- | -- |pe.51
|| 0.000 | -- | -- | -- | -- |pe.242
|===============================================================

Table 8: File Output Stats by Filename
Write | Write MB | Write Rate | Writes |Write B/Call |Experiment=1

Time | | MB/sec | | |File Name
| | | | | PE[mmm] Rather modest I/
| | | | | File Desc O, check the I/O
bandwidth!
0.006 | 1627.057666 | 280217.977719 | 166 | 10277672.40 |Total
|------------------------------------------------------------------------
| 0.005 | 1624.943947 | 301911.123163 | 56 | 30426379.00 |out-w512-RESTART.wfn
||-----------------------------------------------------------------------
|| 2.756 | 1624.943947 | 589.670154 | 28714 | 59339.60 |pe.0
3| | | | | | fd.49
|| 0.000 | -- | -- | -- | -- |pe.51
|| 0.000 | -- | -- | -- | -- |pe.242
||=======================================================================
| 0.000 | 2.032463 | 8471.972651 | 108 | 19733.26 |<NA>
||-----------------------------------------------------------------------
|| 0.123 | 2.032463 | 16.546812 | 55300 | 38.54 |pe.0
3| | | | | | fd.49
|| -- | -- | -- | -- | -- |pe.51
|| -- | -- | -- | -- | -- |pe.242
||=======================================================================
| 0.000 | 0.081256 | 440.883140 | 2 | 42601.50 |stdout
||-----------------------------------------------------------------------
|| 0.094 | 0.081256 | 0.861100 | 1238 | 68.82 |pe.0
3| | | | | | fd.1
|| 0.000 | -- | -- | -- | -- |pe.51
|| 0.000 | -- | -- | -- | -- |pe.242
|========================================================================

Table 13: Load Balance across PE's
...
========================================================================
pe.491 Compare the
amount of calls and
------------------------------------------------------------------------
Time% time spent 0.2%
Time 1733.719207
Calls 23369392
========================================================================

Table 13: Load Balance across PE's
...
========================================================================
pe.2 Compare the
amount of calls and
------------------------------------------------------------------------
Time% time spent 0.2%
Time 1312.932656
Calls 318135867
========================================================================

Table 15: Load Balance across PE's by Function
...
========================================================================
USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.216
------------------------------------------------------------------------
Time% 0.0%
Time 342.454473
Calls 576
REQUESTS_TO_L2:DATA 1084579465 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 1011292217 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 42841511 fills
PAPI_L1_DCA 177829252850 refs
User time (approx) 311763447500 cycles
======================================================================== Why does not this
... routine employ all
========================================================================
the processes?
USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.361
------------------------------------------------------------------------
Time% 0.0%
Time 0.001128
Calls 576

Table 15: Load Balance across PE's by Function
...
========================================================================
USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.408
------------------------------------------------------------------------
Time% 0.0%
Time 181.970770
Calls 283115520
========================================================================
------------------------------------------------------------------------
Time 0.000000
Or this?
========================================================================
------------------------------------------------------------------------
Time 0.000000
========================================================================

Table 20: HW Performance Counter Data
Experiment=1 / PE='HIDE'
========================================================================
Totals for program
------------------------------------------------------------------------
========================================================================

Table 21: Sent Message Stats and Suggested MPI Rank Order
Sent Msg Total Bytes per MPI rank
Max Avg Min Max Min
Total Bytes Total Bytes Total Bytes Rank Rank
288785211952 55382285231 43823619696 8 511
------------------------------------------------------------
Dual core: Sent Msg Total Bytes per node
Rank Max Avg Min Max Node Min Node
Order Total Bytes Total Bytes Total Bytes Ranks Ranks
However this
custom
d 332608831648 110764570462 91978303568 8,511 178,269
u 332608831648 110764570462 91978303568 8,511 178,269
placement of
2 332753774320 110764570462 91702010072 503,8 409,102 ranks did not
0 335218166984 110764570462 89167505112 8,264 255,511 do much in
1 573867407720 110764570462 88992394184 8,9 502,503 practice
------------------------------------------------------------
Quad core: Sent Msg Total Bytes per node
Rank Max Avg Min Max Node Min Node
Order Total Bytes Total Bytes Total Bytes Ranks Ranks
According to CrayPat,
d 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473 the default SMP-like
u 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473
placement of ranks is
2 662859801904 221529140924 183540043808 502,9,503,8 374,137,375,136
the worst choice
0 665630542072 221529140924 181038379488 8,264,9,265 246,502,247,503
1 1145872706976 221529140924 178161802184 8,9,10,11 500,501,502,503

Visualization with Apprentice2
 The previous statistics can be viewed graphically with

Apprentice2
• pat_report version 4.2 produces the .ap2 file, with older
versions you get it with -f ap2 switch to pat_report
• Just launch it by app2 command

You will get more
info by holding the
mouse cursor over a
slice. Clicking will
show the load
balance of the
function

Smallest, average
and largest
individual time

Same profile as a list, a click
on a function will provide the
HW counter data of it

HW counter overview.
Would be useful if
cache miss or cycle
stall counters have
been recorded.

Routine call flow window
(who calls who) and how
the execution is divided

The largest execution time
on the left, smallest on the
right

Final remarks on CP2K analysis
 Load balance is a major concern and largest efforts on

optimization should be put there
• By performance analysis, we know in which routines the most
severe imbalances happen and start to look the issue from
there
 Messages are usually not too small, so no need for
tedious aggregating
 I/O seems to be efficiently written
 Single core efficiency should be investigated in the
most intense routines
• L1 and L2 accessing could be improved
• We would need also other HW counter data for a true
optimization work, e.g. SSE instruction utilization and memory
stalls

Performance Measurement Tools and Techniques

Uploaded by

Copyright:

Available Formats

Performance Measurement Tools and Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Measurement Tools and Techniques

Uploaded by

Copyright:

Available Formats

Performance Measurement Tools and

PRACE Petascale Summer School

 It is all about software

Serial optimization stage

Application development stage

Apply compiler Identify

 Execution time across CPUs

 The objective of performance analysis is to understand

 No source or makefile modification needed, link-time

 Automatic profiling analysis (APA)

 Run the instrumented application (a.out+pat) as usual,

 Hardware performance counter groups

 Combines information from binary with the raw

callers Profile by Function and Callers

 A code for ab-initio molecular dynamics Runtimes (s) on Cray

Samp % | Samp | Imb. | Imb. |Group

|| 5.2% | 1844 | 909.33 | 33.1% |MPI_Reduce |================================================

|| 4.5% | 1595 | 415.49 | 20.7% |mpi_waitall_

 The corresponding fractions with other processor

#cores MPI ETC USER

# User-defined functions to trace, sorted by % of samples.

 Let us get the most out of the analysis data

Time % | Time | Sent | Sent Msg | Avg Sent |Experiment=1

100.0% | 1527.687783 | 587504 | 27691142616 | 47133.54 |Total

Sent Msg | Sent | MsgSz | 16B<= | 256B<= | 4KB<= | 64KB<= |Experiment=1

27691142615 | 587504 | 23182 | 21316 | 179467 | 250579 | 112965 |Total

Tracked | MBytes | Total | Allocs | Total | Tracked | Tracked |Experiment=1

54.178 | 665.609 | 9047403 | 2167539 | 9046867 | 465 | 0.705 |Total

Read | Read MB | Read Rate | Reads | Read |Experiment=1

0.000 | 1.132719 | 3304.704390 | 47 | 25271.11 |Total

Write | Write MB | Write Rate | Writes |Write B/Call |Experiment=1

 The previous statistics can be viewed graphically with

 Load balance is a major concern and largest efforts on

You might also like