Performance Measurement Tools and Techniques
Performance Measurement Tools and Techniques
Performance Measurement Tools and Techniques
Techniques
Pekka Manninen
CSC, the Finnish IT center for science
Part I: Introduction
Motivation
Traditional and Petascales optimization
Performance data and optimization
Motivation
Traditional optimization process overview
Parallel optimization stage
-Apply parallelization
-Perform parallel optimization
Petascales: Optimization flowchart
No
Reduce
Choose algorithms, Unoptimal,
Develop code overhead from
data structures and correct
Apply Assess scalability Sufficient? No communication
parallelization parallel
parallelization and load
strategy code
inbalance
Yes
Link
Optimized
optimized Assess scalability Yes Converged?
code
libraries
Optimization considerations
1. Load balance
2. Minimal dedicated time for communication
Minimize communication
Overlap computation and communication
3. CPU utilization
Optimal memory access (cache utilization)
Pipeline performance (branch prediction, prefetching)
SIMD operations
Efficient I/O
Examples of relevant measurements
Performance data collection
Two “dimensions”
When collection is triggered Acquisition Presentation
• Asynchronous (sampling) or
synchronous (code Sampling Profile
instrumentation)
How data is recorded
• Profile or a trace file
Instrumentation Timeline
Things to be kept in mind
Part II: Cray performance analysis tools
as an example
Overview
Usage
Cray performance analysis infrastructure
CrayPat
• pat_build - an utility for application instrumentation without a
need for source code modification
• Transparent run-time library for measurements
• pat_report - performance reports and visualization file
• pat_help - interactive help utility
Cray Apprentice2
• An advanced graphical performance analysis and visualization
tool
Instrumentation with pat_build
Instrumentation with pat_build
Trace groups
biolibs Cray bioinformatics library routines
blas BLAS subroutines
heap Dynamic heap
io stdio + sysio tracegroups
lapack LAPACK subroutines
math ANSI math
mpi MPI statistics
omp OpenMP API
omp-rtl OpenMP runtime library
pthreads Posix threads
shmem SHMEM API
stdio All functions that accept or return the
file* construct
sysio I/O system calls
system system calls
Instrumentation with pat_build
Fine-grained instrumentation
Fortran
include “pat_apif.h”
...
call PAT_region_begin(id,”label”,ierr)
<code segment>
call PAT_region_end(id,ierr)
C
include <pat_api.h>
...
ierr = PAT_region_begin(id,”label”);
<code segment>
ierr = PAT_region_end(id);
Collecting data
Collecting data
Analysis with pat_report
Analysis with pat_report
ct -O calltree
defaults Tables that would appear by default.
heap -O heap_program,heap_hiwater,heap_leaks
io -O read_stats,write_stats
lb -O load_balance
load_balance -O lb_program,lb_group,lb_function
mpi -O mpi_callers
Analysis with pat_report
load_balance_sm Load Balance with MPI Sent Message Stats
loops Loop Stats from -hprofile_generate
mpi_callers MPI Sent Message Stats by Caller
mpi_dest_bytes MPI Sent Message Stats by Destination PE
mpi_dest_counts MPI Sent Message Stats by Destination PE
mpi_rank_order Suggested MPI Rank Order
mpi_sm_rank_order Sent Message Stats and Suggested MPI Rank Order
pgo_details Loop Stats detail from -hprofile_generate
profile Profile by Function Group and Function
profile+src Profile by Group, Function, and Line
profile_pe.th Profile by Function Group and Function
profile_pe_th Profile by Function Group and Function
profile_th_pe Profile by Function Group and Function
rogram_time Program Wall Clock Time
read_stats File Input Stats by Filename
samp_profile Profile by Function
samp_profile+src Profile by Group, Function, and Line
thread_times Program Wall Clock Time
write_stats File Output Stats by Filename
Part III: Case study
Workplan
Demonstration of tools
Analysis
Case study: CP2K performance analysis
Sampling profile output
Table 1: Profile by Group, Function, and Line
Sampling profile
APA file
# You can edit this file, if desired, and use it
# to reinstrument the program for tracing like this:
#
# pat_build -O cp2k.pat+pat+4779-896sdt.apa
#
# These suggested trace options are based on data from:
#
# /wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp2k.pat+pat+4779-896sdt.ap2,
/wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp
2k.pat+pat+4779-896sdt.xf Select the hardware counter
# ----------------------------------------------------------------------group here or give the PAPI
# HWPC group to collect by default. calls explicitly, get group 2
# -Drtenv=PAT_RT_HWPC=0 # Summary with instructions metrics.
-Drtenv=PAT_RT_HWPC=2
# ----------------------------------------------------------------------
# Libraries to trace.
-g mpi
Insert the desired tracegroups
# ----------------------------------------------------------------------
here
APA file continued
Performance analysis
Table 1: Profile by Function Group and Function
Note how the
Experiment=1 / Group / Function / PE='HIDE'
instrumentation
========================================================================
Totals for program affects the
performance
------------------------------------------------------------------------
Time% 100.0%
Time 1527.856590
Imb.Time --
Imb.Time% --
Calls 167616828
REQUESTS_TO_L2:DATA 12.127M/sec 16047354721 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.449M/sec 11180256256 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.139M/sec 4153438330 fills
PAPI_L1_DCA 1519.962M/sec 2011407347660 refs
User time (approx) 1323.327 secs 3043652030586 cycles
Cycles 1323.327 secs 3043652030586 cycles
User time (approx) 1323.327 secs 3043652030586 cycles
Utilization rate 86.6%
LD & ST per D1 miss 131.18 refs/miss
D1 cache hit ratio 99.2%
LD & ST per D2 miss 484.28 refs/miss
D2 cache hit ratio 72.9%
D1+D2 cache hit ratio 99.8%
Effective D1+D2 Reuse 7.57 refs/byte
System to D1 refill 3.139M/sec 4153438330 lines
System to D1 bandwidth 191.567MB/sec 265820053147 bytes
L2 to Dcache bandwidth 515.661MB/sec 715536400364 bytes
MPI_SYNC
This is all load
------------------------------------------------------------------------
Time% imbalance! 51.9%
Time 792.264406
Imb.Time --
Imb.Time% --
Calls 347497
REQUESTS_TO_L2:DATA 0.474M/sec 373392744 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 0.371M/sec 291730786 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 0.025M/sec 19302776 fills
PAPI_L1_DCA 1444.128M/sec 1136805936028 refs
User time (approx) 787.192 secs 1810541859938 cycles
Cycles 787.192 secs 1810541859938 cycles
User time (approx) 787.192 secs 1810541859938 cycles
Utilization rate 99.4%
LD & ST per D1 miss 3654.93 refs/miss
D1 cache hit ratio 100.0%
LD & ST per D2 miss 58893.39 refs/miss
D2 cache hit ratio 93.8%
D1+D2 cache hit ratio 100.0%
Effective D1+D2 Reuse 920.21 refs/byte
System to D1 refill 0.025M/sec 19302776 lines
System to D1 bandwidth 1.497MB/sec 1235377634 bytes
L2 to Dcache bandwidth 22.619MB/sec 18670770284 bytes
USER
------------------------------------------------------------------------
Time% 31.1%
Time 475.803299
Imb.Time --
Imb.Time% --
Calls 141943266
REQUESTS_TO_L2:DATA 43.794M/sec 13550844676 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 31.395M/sec 9714170564 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 10.629M/sec 3288776269 fills
PAPI_L1_DCA 1861.004M/sec 575832343661 refs
User time (approx) 309.420 secs 711666603720 cycles
Cycles 309.420 secs 711666603720 cycles
User time (approx) 309.420 secs 711666603720 cycles
Utilization rate 65.0%
LD & ST per D1 miss Cache statistics. 44.28 refs/miss
D1 cache hit ratio There’s room to 97.7%
LD & ST per D2 miss improve both L1 and 175.09 refs/miss
D2 cache hit ratio
L2 utilization. 74.7%
D1+D2 cache hit ratio 99.4%
Effective D1+D2 Reuse 2.74 refs/byte
System to D1 refill 10.629M/sec 3288776269 lines
System to D1 bandwidth 648.732MB/sec 210481681247 bytes
L2 to Dcache bandwidth 1916.183MB/sec 621706916112 bytes
Table 2: Load Balance with MPI Sent Message Stats
Table 3: MPI Sent Message Stats by Caller
Table 5: Heap Stats during Main Program
Table 7: File Input Stats by Filename
Table 13: Load Balance across PE's
...
========================================================================
pe.2 Compare the
amount of calls and
------------------------------------------------------------------------
Time% time spent 0.2%
Time 1312.932656
Calls 318135867
REQUESTS_TO_L2:DATA 20.391M/sec 19283776533 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 14.358M/sec 13578864880 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 5.022M/sec 4749508915 fills
PAPI_L1_DCA 1616.882M/sec 1529092888549 refs
User time (approx) 945.705 secs 2175120420977 cycles
Cycles 945.705 secs 2175120420977 cycles
User time (approx) 945.705 secs 2175120420977 cycles
Utilization rate 72.0%
LD & ST per D1 miss 83.43 refs/miss
D1 cache hit ratio 98.8%
LD & ST per D2 miss 321.95 refs/miss
D2 cache hit ratio 74.1%
D1+D2 cache hit ratio 99.7%
Effective D1+D2 Reuse 5.03 refs/byte
System to D1 refill 5.022M/sec 4749508915 lines
System to D1 bandwidth 306.530MB/sec 303968570560 bytes
L2 to Dcache bandwidth 876.371MB/sec 869047352320 bytes
========================================================================
Table 15: Load Balance across PE's by Function
...
========================================================================
USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.216
------------------------------------------------------------------------
Time% 0.0%
Time 342.454473
Calls 576
REQUESTS_TO_L2:DATA 1084579465 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 1011292217 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 42841511 fills
PAPI_L1_DCA 177829252850 refs
User time (approx) 311763447500 cycles
User time (approx) 311763447500 cycles
======================================================================== Why does not this
... routine employ all
========================================================================
the processes?
USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.361
------------------------------------------------------------------------
Time% 0.0%
Time 0.001128
Calls 576
REQUESTS_TO_L2:DATA 23918 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 21906 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 1474 fills
PAPI_L1_DCA 1334202 refs
User time (approx) 0 cycles
User time (approx) 0 cycles
Table 15: Load Balance across PE's by Function
...
========================================================================
USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.408
------------------------------------------------------------------------
Time% 0.0%
Time 181.970770
Calls 283115520
REQUESTS_TO_L2:DATA 1921736331 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 1725999972 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 2667162 fills
PAPI_L1_DCA 299580285973 refs
User time (approx) 152982200000 cycles
User time (approx) 152982200000 cycles
========================================================================
USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.127
------------------------------------------------------------------------
Time 0.000000
Or this?
========================================================================
USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.425
------------------------------------------------------------------------
Time 0.000000
========================================================================
Table 20: HW Performance Counter Data
Experiment=1 / PE='HIDE'
========================================================================
Totals for program
------------------------------------------------------------------------
REQUESTS_TO_L2:DATA 12.130M/sec 16047201243 req
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.451M/sec 11180135091 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.140M/sec 4153365739 fills
PAPI_L1_DCA 1520.227M/sec 2011113494425 refs
User time (approx) 1322.904 secs 3042678719117 cycles
Cycles 1322.904 secs 3042678719117 cycles
User time (approx) 1322.904 secs 3042678719117 cycles
Utilization rate 86.6%
LD & ST per D1 miss 131.16 refs/miss
D1 cache hit ratio 99.2%
LD & ST per D2 miss 484.21 refs/miss
D2 cache hit ratio 72.9%
D1+D2 cache hit ratio 99.8%
Effective D1+D2 Reuse 7.57 refs/byte
System to D1 refill 3.140M/sec 4153365739 lines
System to D1 bandwidth 191.625MB/sec 265815407319 bytes
L2 to Dcache bandwidth 515.821MB/sec 715528645846 bytes
========================================================================
Table 21: Sent Message Stats and Suggested MPI Rank Order
Sent Msg Total Bytes per MPI rank
Max Avg Min Max Min
Total Bytes Total Bytes Total Bytes Rank Rank
288785211952 55382285231 43823619696 8 511
------------------------------------------------------------
Dual core: Sent Msg Total Bytes per node
Rank Max Avg Min Max Node Min Node
Order Total Bytes Total Bytes Total Bytes Ranks Ranks
However this
custom
d 332608831648 110764570462 91978303568 8,511 178,269
u 332608831648 110764570462 91978303568 8,511 178,269
placement of
2 332753774320 110764570462 91702010072 503,8 409,102 ranks did not
0 335218166984 110764570462 89167505112 8,264 255,511 do much in
1 573867407720 110764570462 88992394184 8,9 502,503 practice
------------------------------------------------------------
Quad core: Sent Msg Total Bytes per node
Rank Max Avg Min Max Node Min Node
Order Total Bytes Total Bytes Total Bytes Ranks Ranks
According to CrayPat,
d 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473 the default SMP-like
u 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473
placement of ranks is
2 662859801904 221529140924 183540043808 502,9,503,8 374,137,375,136
the worst choice
0 665630542072 221529140924 181038379488 8,264,9,265 246,502,247,503
1 1145872706976 221529140924 178161802184 8,9,10,11 500,501,502,503
Visualization with Apprentice2
You will get more
info by holding the
mouse cursor over a
slice. Clicking will
show the load
balance of the
function
Smallest, average
and largest
individual time
Same profile as a list, a click
on a function will provide the
HW counter data of it
HW counter overview.
Would be useful if
cache miss or cycle
stall counters have
been recorded.
Routine call flow window
(who calls who) and how
the execution is divided
The largest execution time
on the left, smallest on the
right
Final remarks on CP2K analysis