KNL Presentation TACC Summer School - Shared

Introduction to The
Intel® Xeon PHI™ processor

(codename “Knights Landing”)
Dr. Harald Servat - HPC Software Engineer
Data Center Group – Innovation Performing and Architecture Group
Summer School in Advanced Scientific Computing 2016

February 21st, 2016 – Braga, Portugal
Legal Disclaimers
Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies
depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
[intel.com].
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and
roadmaps.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes.
Any differences in your system hardware, software or configuration may affect your actual performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies
depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual
performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance
and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, Intel Xeon Phi, Intel Optane and 3D XPoint are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the
United States or other countries.
*Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation. All rights reserved.
2
Agenda
1. Introduction
2. Micro-architecture
i. Tile architecture
ii. Untile architecture
3. AVX512 Instruction set
4. High-Bandwidth memory
3
Introduction
Moore’s Law and Parallelism
5
CPU parallelism is already a must
Symbiotic Extension
of Intel® Xeon®
processor stack
Vectorized & Parallelized

TBA
Intel® Xeon®
processor
5100
Intel® Xeon®
processor
5500
Intel® Xeon®
processor
5600
Intel® Xeon® processor
Sandy
code-named
code-named
Ivy Bridge
code-named
Haswell
>100x
Intel® Xeon Phi™
coprocessor
Knights
Intel® Xeon Phi™ processor
& coprocessor
Knights
series series series Bridge EP EP EP Corner Landing
Core(s) 2 4 6 8 12 18 Scalar & Parallelized
61 72
Threads 2 8 12 16 24 36 244 288
SIMD Width 128 128 128 256 256 256 Vectorized & Single-Threaded
512 512
Scalar & Single-Threaded
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Performance Projections as of Q1 2015. For more information go to
http://www.intel.com/performance . Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Copy right © 2015, Intel Corporation
Chart illustrates relative performance of the Binomial Options DP workload running on an Intel® Xeon® processor from the adjacent generation.
*Product specification for launched and shipped products available on ark.intel.com.
1Not launched
3
Parallelism and Performance
2420,48
Peak GFLOP/s in Single Precision
800
1000
 Clock Rate x Cores x Ops/Cycle x SIMD
2 x Intel® Xeon® Processor E5-2670v2 151,28
100
GFLOP/s
 2.5 GHz x 2 x 10 cores x 2 ops x 8 SIMD 100
40
= 800 GFLOP/s 19,84
Intel® Xeon Phi™ Coprocessor 7120P 10

5
 1.24 GHz x 61 cores x 2 ops x 16 SIMD
= 2420.48 GFLOP/s 1,24
1
Scalar & ST Vector & STScalar & MT Vector &
MT
2 x Processor Coprocessor
7
Parallelism and Performance
2420,48
 On modern hardware,
Performance = Parallelism 1000 800
 Flat programming model on parallel

hardware is not effective. 100
151,28
GFLOP/s
100
 Parallel programming is not optional. 40
19,84
 Codes need to be made parallel
(“modernized”) before they can be tuned for 10
5
the hardware (“optimized”).
1,24
1
Scalar & ST Vector & STScalar & MT Vector &
MT
2 x Processor Coprocessor
8
A Paradigm Shift
Memory Bandwidth
400+ GB/s STREAM
Memory Capacity
Over 25x KNC
Coprocessor
Power Efficiency
Over 25% better than card
I/O
200 Gb/s/dir with int fabric
Knights Landing
Cost
Fabric Less costly than discrete parts
Flexibility
Server Processor Limitless configurations
Density
3+ KNL with fabric in 1U
Memory
9
Knights Landing (Host or PCIe)
+F
Knights Landing Knights Landing
Host Processor Host Processor

w/ integrated Fabric
Knights Landing Processors Knights Landing PCIe Coprocessors

Host Processor for Groveport Platform Ingredient of Grantley & Purley Platforms
Solution for future clusters with both Xeon and Xeon Phi Solution for general purpose servers and workstations
5
Stampede-knl (or Stampede1.5)
Intel S7200AP Cluster
484x Intel Xeon Phi 7250 68C 1.4GHz
• 32,912 total cores

• 1,474 teraFLOP/s
Intel Omni-Path
5
Intel® Xeon Phi™ x200 processor:
micro-Architecture
Intel® Xeon Phi™ Processor Family Architecture Overview
Codenamed “Knights Landing” or KNL
Comprises 38 physical tiles, at which at most 36 active KNL

Package
• Remaining for yield recovery
MCDRAM MCDRAM
Introduces new 2D cache-coherent mesh interconnect
(Untile)
• Tiles
• Memory controllers
• I/O controllers
• Other agents
HUB
2VPU 2VPU
1MB
Core L2 Core
DDR4 DDR4
Enhanced Intel® Atom™ cores based on
Silvermont Microarchitecture
MCDRAM MCDRAM
Tile EDC (embedded DRAM controller) IMC (integrated memory controller) IIO (integrated I/O controller)
13
Tile architecture
KNL processor Tile 2VPUs
HUB
2VPUs
1MB
L2
Tile Core Core
• 2 cores, each with 2 vector processing units (VPU)

• 1 MB L2-cache shared between the cores
Core
• Binary compatible with Xeon
• Enhanced Silvermont (Atom)-based for HPC w/ 4 threads
• Out-of-order core
• 2-wide decode, 6-wide execute (2 int, 2 fp, 2 mem), 2-wide retire
2 VPU
• 512-bit SIMD (AVX512) 32SP/16DP per unit
• Legacy X87, SSE, AVX and AVX2 support
15
HUB
2VPUs
1MB
L2
Core Core
Structure Characteristics
32 KB 8-way Icache
L1
Cache 32 KB 8-way Dcache
L2 1 MB 16-way Unified cache
48-entry fully-associative ITLB
L1
64-entry 8-way DTLB 4KB pages
TLB 256-entry 8-way DTLB 4KB pages
L2 128-entry 8-way DTLB 2/4MB pages
16-entry fully-associative DTLB 1GB pages
16
HUB
2VPUs
1MB
L2
Core Core
CHA Caching/Home Agent (or HUB)

• 2D-Mesh connections for Tile
• Distributed Tag Directory to keep L2s coherent
• MESIF protocol
… More to come in the UNTILE section!
17
Intel® XEON PHI™ PROCESSOR EARLY SHIP TURBO SPECS
Single Tile All Tile
Active Active TDP Freq AVX Freq Mesh Freq OPIO DDR
SKU TDP (W) Turbo Turbo
Tiles Cores GHz GHz GHz GT/s MHz
GHz GHz
7250 215 34 68 1.6 1.5 1.4 1.2 1.7 7.2 2400

7230 215 32 64 1.5 1.4 1.3 1.1 1.7 7.2 2400
7210 215 32 64 1.5 1.4 1.3 1.1 1.6 6.4 2133
Turbo is an opportunistic increase in frequency over TDP frequency

• KNL has two turbo modes
• Single tile turbo – any one tile increases frequency while all other tiles are in the C6 idle state
• All tile turbo – all tiles run at an increased frequency
• Frequency varies, depending on the workload, power budget and SKU
• When running AVX intense code frequency may decrease
• UNHALTED_CORE_CYCLES vs UNHALTED_REFERENCE_CYCLES performance counters
OPIO is Intel’s On Package IO technology for high speed connections between multiple chips on a single
package.
18
Intel® Xeon Phi™ x200 vs silvermont Comparison
FEATURE SILVERMONT KNL
Vector ISA Up to Intel® SSE4.2 Up to Intel® AVX-512
Enhanced (VPU) Vector processing Unit 2x 128-bit VPU / Core 2x 512-bit VPU / Core
Physical/Virtual addressing 36 bits / 48 bits 46 bits / 48 bits
HW based Gather/Scatter No Yes
Reorder buffer entries 32 72
Threads / Core 1 4
Memory operations per cycle 1 (16 bytes each) 2 (64 bytes each)
Vector/FP reservation station policy In-Order Out-of-Order
L1 cache size 24K 32K
L2 cache to D-cache BW 1X 2X
Micro-TLB 32 64
4K pages: 128 4K pages: 256
Data TLB 2M pages: 16 2M pages: 128
1G pages: N/A 1G pages: 16
19
Knights Landing vs. Knights Corner Feature Comparison
FEATURE INTEL® XEON PHI™ COPROCESSOR 7120P KNIGHTS LANDING PRODUCT FAMILY
Processor Cores Up to 61 enhanced P54C Cores Up to 72 enhanced Silvermont cores
In order Out of order
Key Core Features 4 threads / core (back-to-back scheduling restriction) 4 threads / core
2 wide 2 wide
Peak FLOPS1 SP: 2.416 TFLOPs • DP: 1.208 TFLOPs Up to 3x higher
Scalar Performance1 1X Up to 3x higher
x87, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, Intel® AVX,
Vector ISA x87, (no Intel® SSE or MMX™), Intel IMIC
AVX2, AVX-512 (no Intel® TSX)
Interprocessor Bus Bidirectional Ring Interconnect Mesh of Rings Interconnect
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
1- Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes.
Any differences in your system hardware, software or configuration may affect your actual performance.
20
KNL Core organization
Front-End Unit (FEU)
Decode, allocate 2 instructions/cycle

32-entry instruction queue
Gskew-style branch predictor
21
Allocation Unit
72-entry ROB buffer

72-entry rename buffers
16 store data buffers
4 gather scatter data tables
22
Integer Execution Unit (IEU)
2 IEUs per core

• 2 uops dispatched / cycle
• 12-entries each
Out-of-order
Most operations take 1 cycle
• Some operations take 3-5 and are
supported on only one IEU (e.g muls)
23
Memory Execution Unit (MEU)
Dispatches 2 uops (either LD/ST)

• in-order
• but can complete in any order
2 64B load & 1 64B store port for
Dcache
L2 supports 1 Line Read and ½ Line
Write per cycle
L1 - L2 prefetcher
• Track up to 48 access patterns
Fast unaligned and cache-line split
support
Fast gather/scatter support
24
Vector Processing Unit (VPU)
2 VPUs tightly integrated with core

pipeline
• 20-entry FP RS
• executed out-of-order
2 512-bit FMA / cycle
Most FP operations take 6 cycles
1 VPU provides legacy x87, MMX
support a subset of SSE instructions
25
Retire
2 instructions / cycle
26
KNL Hardware Threading
Thread
selection
4 threads per core SMT point
Resources dynamically partitioned

• Re-order buffer, Rename buffers,
Reservation station
• Partitioning changes as threads wake
up and go to sleep
Resources shared
• Caches
• TLB
Several Thread Selection points in
the pipeline ( )
• Maximize throughput while
being fair
• Account for available resources,
stalls and forwards progress
27
Taking benefit of the core
Threading
• Ensure that thread affinities are set.
• Understand affinity and how it affects your application (i.e. which threads share data?).
• Understand how threads share core resources.
• An individual thread has the highest performance when running alone in a core.
• Running 2 or 4 threads in a core may result in higher per core performance but lower per
thread performance.
• Due to resource partitioning, 3 thread configuration will have fewer aggregative
resources than 1, 2 or 4 threads per core. 3 threads in a core is unlikely to perform better
than 2 or 4 threads.
Vectorization
• Prefer AVX512 instructions and avoid mixing SSE, AVX and AVX512 instructions.
• Avoid cache-line splits; align data structures to 64 bytes.
• Avoid gathers/scatters; replace with shuffles/permutes for known sequences.
• Use hardware trascendentals (fast-math) whenever possible.
• AVX512 achieves best performance when not using masking
• KNC intrinsic code is unlikely to generate optimal KNL code, recompile from HL language.
28
Data Locality: Nested Parallelism
• Recall that KNL cores are grouped into tiles, with two cores sharing an L2.
• Effective capacity depends on locality:

• 2 cores sharing no data => 2 x 512 KB
• 2 cores sharing all data => 1 x 1 MB
• Ensuring good locality (e.g. through blocking or nested parallelism)

is likely to improve performance.
#pragma omp parallel for num_threads(ntiles)

for (int i = 0; i < N; ++i)
{
#pragma omp parallel for num_threads(8)
for (int j = 0; j < M; ++j)
{
…
}
}
29
unTile architecture
KNL PROCESSOR UNTILE
Comprises a mesh connecting the tiles (in red)
with the MCDRAM and DDR memories.
• Also with I/O controllers and other agents
Caching Home Agent (CHA) holds portion of the
distributed tag directory and serves as
connection point between tile and mesh
• No L3 cache as in Xeon
Cache coherence uses MESIF protocol
(Modified, Exclusive, Shared, Invalid, Forward)
Tile EDC (embedded DRAM controller) IMC (integrated memory controller) IIO (integrated I/O controller)
31
KNL MESH INTERCONNECT Mesh of Rings
• Every row and column is a ring
• YX routing: Go in Y  Turn  Go in X
OPIO OPIO PCIe OPIO OPIO
EDC EDC IIO EDC EDC

• 1 cycle to go in Y, 2 cycles to go in X
Tile Tile Tile Tile
• Messages arbitrate at injection and on turn
Tile Tile Tile Tile Tile Tile

Mesh at fixed frequency of 1.7 GHz
DDR
iMC Tile Tile Tile Tile iMC
DDR Distributed Directory Coherence protocol

KNL supports Three Cluster Modes
1) All-to-all
EDC EDC Misc EDC EDC
2) Quadrant
3) Sub-NUMA Clustering
OPIO OPIO OPIO OPIO
Selection done at boot time.

Cluster mode: all-to-all
Address uniformly hashed across all
OPIO OPIO PCIe OPIO OPIO distributed directories
EDC EDC IIO EDC EDC
3 No affinity between Tile, Directory and

Memory
Tile Tile Tile Tile
Tile Tile
4 Tile Tile Tile Tile
1
Tile Tile Tile Tile Tile Tile Lower performance mode, compared to
other modes. Mainly for fall-back
DDR DDR

Typical Read L2 miss
Tile Tile Tile Tile Tile Tile 1. L2 miss encountered

2 2. Send request to the distributed directory
3. Miss in the directory. Forward to memory

OPIO OPIO OPIO OPIO
4. Memory sends the data to the requestor
33
Cluster mode: quadrant
Chip divided into four Quadrants
EDC EDC IIO EDC EDC

Affinity between the Directory and
3 Memory
Tile Tile Tile Tile
Tile Tile
4 Tile Tile Tile Tile
Lower latency and higher BW than all-to-
1 all
2 SW Transparent
DDR DDR


1. L2 miss encountered
EDC EDC Misc EDC EDC 2. Send request to the distributed directory
OPIO OPIO OPIO OPIO
34
Cluster mode: sub-numa clustering (SNC4)
Each Quadrant (Cluster) exposed as a
EDC EDC IIO EDC EDC separate NUMA domain to OS
3
Tile Tile Tile Tile
Analogous to 4-socket Xeon

Tile Tile Tile Tile
4 Tile Tile
Tile Tile Tile

1
Tile Tile Tile
2 SW Visible
DDR DDR


1. L2 miss encountered
2. Send request to the distributed directory
OPIO OPIO OPIO OPIO
35
How to DETECT / USE the cluster modes?
Detection
• CPUID instruction
• /proc/cpuinfo
• hwloc command
• lstopo –no-io
• numactl / libnuma
Use
• numactl / libnuma
Diagram for Xeon
• Memkind 2-socket 8-core w/ HT and 16GB per socket
• MPI/OpenMP
36
Implications for Parallel Runtimes
OpenMP
• No changes for All-2-All or Quadrant modes
• In SNC4 and using multiple MPI ranks per processor, use descriptors
• compact, scatter
• In SNC4 with no MPI, need to manually handle NUMA bindings
MPI
• Use existing (Intel) MPI mechanisms for affinity control
- I_MPI_PIN, I_MPI_PIN_MODE, I_MPI_PIN_PROCESSOR_LIST, I_MPI_PIN_DOMAIN
• Don’t limit yourself to 1 MPI Rank per SNC
37
AVX512 Instruction set
SIMD: Single Instruction, Multiple Data
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
• Scalar mode • Vector (SIMD) mode
– one instruction produces one result – one instruction can produce multiple
– E.g. vaddss, (vaddsd) results
– E.g. vaddps, (vaddpd)
SSE
AVX
X X x7 x6 x5 x4 x3 x2 x1 x0
+ +
Y Y y7 y6 y5 y4 y3 y2 y1 y0
= =
X+Y X+Y x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0
KNL Hardware instruction set
E5-2600 E5-2600v3 KNL
(SNB1) (HSW1) (Xeon Phi2) KNL implements all legacy instructions
x87/MMX x87/MMX x87/MMX
• Legacy binary runs w/o recompilation
• KNC binary requires recompilation
SSE SSE SSE
LEGACY
AVX AVX AVX KNL introduces AVX-512 Extensions
AVX2 AVX2 • 512-bit FP/Integer Vectors
BMI BMI • 32 registers & 8 mask registers
TSX
• Gather/Scatter
AVX-512F
No TSX. Under AVX-512CD

separate CPUID bit Conflict Detection: Improves Vectorization
AVX-512PF
Prefetch: Gather and Scatter Prefetch
AVX-512ER
1. Previous Code name Intel® Xeon® processors Exponential and Reciprocal Instructions
2. Xeon Phi = Intel® Xeon Phi™ processor
40
KNL avx512 Instruction Set
CPUID Instructions Description
Prefetch cache line into the L2 cache with
PREFETCHWT1
intent to write
AVX512PF
VGATHERPF{D,Q}{0,1}P Prefetch vector of D/Qword indexes into
S the L1/L2 cache
Prefetch vector of D/Qword indexes into
VSCATTERPF{D,Q}{0,1}
the L1/L2 cache with intent to write
Prefetch sparse PS
Intel AVX-512
vector memory
Prefetch locations in
Instructions (PFI) Computes approximation of 2x with
advance VEXP2{PS,PD}
maximum relative error of 2-23
AVX512ER
Intel AVX-512 Computes approximation of reciprocal with
Fast math
Exponential and functions for VRCP28{PS,PD} max relative error of 2-28 before rounding
Reciprocal transcendental
Instructions (ERI) sequences
Computes approximation of reciprocal
VRSQRT28{PS,PD} square root with max relative error of 2-28
Intel AVX-512 Auto-conflict before rounding
detection for
Conflict Detection alias
Instructions (CDI) Detect duplicate values within a mask and
disambiguation
VPCONFLICT{D,Q} create conflict-free subsets
AVX512CD
Count the number of leading zero bits in

VPLZCNT{D,Q}
each element
VPBROADCASTM{B2Q, Broadcast vector mask into vector
W2D} elements
41
Motivation for Conflict Detection
Sparse computations are common in HPC, but hard to vectorize due to race conditions
Consider the “scatter” or “histogram” problem:
for(i=0; i<16; i++) { A[B[i]]++; }
index = vload &B[i] // Load 16 B[i] indices

old_val = vgather A, index // Grab A[B[i]]
new_val = vadd old_val, +1.0 // Compute new values
vscatter A, index, new_val // Update A[B[i]]
• Problem if two vector lanes try to increment the same histogram bin
• Code above is wrong if any values within B[i] are duplicated
− Only one update from the repeated index would be registered!
• A solution to the problem would be to avoid executing the sequence gather-
op-scatter with vector of indexes that contain conflicts
Intel Confidential
Conflict Detection Instructions (CDI)
AVX-512 CDI introduces three new instructions:
• vpconflict{d,q} zmm1 {k1}, zmm2/mem

Compares (for equality) each element in zmm2 with “earlier” elements and outputs bit vector.
• vpbroadcastm{b2q,w2d} zmm1 {k0}, to_do

• vplzcnt{d,q} Manipulate bit vector
from vpconflict to
• vptestnm{d,q} k2 {k1}, zmm1, zmm2/mem construct a useful mask.
(from AVX-512F)
43
Conflict Detection Instructions (CDI)
Vectorization with these instructions looks like this:
for (int i = 0; i < N; i += 16)

{
__m512i indices = vload &B[i]
vpconflictd comparisons, indices // comparisons = __m512i
__mmask to_do = 0xffff;
do
{
vpbroadcastmd tmp, to_do // tmp = __m512i Do work for element if no
vptestnmd mask {to_do}, comparisons, tmp conflicts on remaining
do_work(mask); // gather-compute-scatter earlier elements.
to_do ^= mask;
} while(to_do);
}
44
Conflict Detection Instructions (CDI) - Example
Index Register: 2 2 1 2 1) Compare (for equality)
each element in zmm2 with
“earlier” elements and output
2 2 1 2 bit vector.
Bit Vector: 0101 0001 0000 0000
2) Combine bit vector and 0101 0001 0000 0000 vpconflict

todo to work out which 1000
1111
1100 1000
1111
1100 1000
1111
1100 1000
1111
1100 vpbroadcast
elements can be updated in
this iteration. 0000
0101
0100 0000
0001 0000 0000
1111
0011
0111 vptest
3) Loop until todo is 0000.
1000
0011
0100
45
Conflict Detection Instructions (CDI) – Compiler
The Intel® compiler (15.0 onwards) will recognise potential run-time conflicts and
generate vpconflict loops automatically:
for (int i = 0; i < N; ++i)

{
histogram[index[i]]++;
}
Such loops would originally have resulted in:

remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between histogram line 22 and histogram line 22
remark #15346: vector dependence: assumed ANTI dependence between histogram line 22 and histogram line 22
If you know that conflicts cannot occur, you should still specify this:
(e.g. #pragma ivdep, #pragma simd, #pragma omp simd)
46
Guidelines for writing vectorizable code
Prefer simple “for” or “DO” loops
Write straight line code. Try to avoid:
• function calls (unless inlined or SIMD-enabled functions)
• branches that can’t be treated as masked assignments.
Avoid dependencies between loop iterations

• Or at least, avoid read-after-write dependencies
Prefer arrays to the use of pointers

• Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.
• Try to use the loop index directly in array subscripts, instead of incrementing a separate counter for use as an
array address.
• Disambiguate function arguments, e.g. -fargument-noalias
Use efficient memory accesses

• Favor inner loops with unit stride
• Minimize indirect addressing a[i] = b[ind[i]]
• Align your data consistently where possible (to 16, 32 or 64 byte boundaries)
Processor Dispatch (fat binaries)
Compiler can generate multiple code paths
• Optimized for different processors
• Only when likely to help performance
• One default code path, one or more optimized paths
• Optimized paths are for Intel processors only
Examples:
• -axavx
• default path optimized for Intel® SSE2 (Intel or non-Intel) (-msse2)
• Second path optimized for Intel® AVX (code name Sandy Bridge, etc.)
• -axcore-avx2,avx -xsse4.2
• Default path optimized for Intel® SSE4.2 (code name Nehalem, Westmere)
• Second path optimized for Intel® AVX (code name Sandy Bridge, etc)
• Third path optimized for Intel® AVX2 (code name Haswell)
Intel® Compiler Switches Targeting Intel® AVX-512
Switch Description
KNL only
-xmic-avx512
Not a fat binary.
Future Xeon only
-xcore-avx512
Not a fat binary.
AVX-512 subset common to both.
-xcommon-avx512
Not a fat binary.
Fat binaries. Allows to target KNL and other
-axmic-avx512 etc.
Intel® Xeon® processors
Don’t use –mmic with KNL !

Best would be to use –axcore-avx512,mic-avx512 –xcommon-avx512
All supported in 16.0 and forthcoming 17.0 compilers
Binaries built for earlier Intel® Xeon® processors will run unchanged on KNL
Binaries built for Intel® Xeon Phi™ coprocessors will not.
49
High-Bandwidth memory
Intel® Xeon Phi™ x200 processor Overview
Compute
Platform Memory
 Intel® Xeon® Processor Binary-Compatible
Up to 384 GB DDR4 …
 3+ TFLOPS, 3X ST (single-thread) perf. vs KNC
 2D Mesh Architecture
Knights
. .  Out-of-Order Cores
. Landing .
.
.
On-Package Memory (MCDRAM)
up to 72 Cores  Up to 16 GB at launch
 Over 5x STREAM vs. DDR4 at launch
Integrated Fabric
…
Processor Package
…
51
Heterogenous memory Architecture
Intel® Xeon Phi™ x200 processor uses two types of memory
• standard DDR4 (DIMM)
• high-bandwidth MCDRAM (on-package)
What are the usage models?

How software can benefit from them?
52
MCDRAM Modes
64B cache lines
Cache mode direct-mapped
• Direct mapped cache

KNL Cores +
• Inclusive cache 16GB
MCDRAM DRAM
Uncore (L2)
• Misses have higher latency
• Needs MCDRAM access + DDR access
• No source changes needed to use,
automatically managed by hw as if LLC
Physical Address
8GB/ 16GB
MCDRAM
KNL Cores +
Flat mode Uncore (L2)
• MCDRAM mapped to physical address space Up to
384 GB
• Exposed as a NUMA node DRAM
• Use numactl --hardware, lscpu to display configuration
• Accessed through memkind library or numactl
Split Options: 8 or 12GB
MCDRAM
25/75% or 50/50%
Hybrid
• Combination of the above two KNL Cores + 8 or 4 GB
• E.g., 8 GB in cache + 8 GB in Flat Mode Uncore (L2) MCDRAM DRAM
53
MCDRAM as Cache MCDRAM as FLAT MODE
Upside Upside
• No software modifications required • Isolation of MCDRAM for high-performance
application use only
• Bandwidth benefit (over DDR) • OS and applications use DDR memory
• No software modifications required if data fits in
MCDRAM
• Lower latency
• i.e., no MCDRAM cache misses
• Maximum addressable memory
Downside
• Higher latency for DDR access Downside
• i.e., for cache misses • Generally, software modifications (or
interposer library) required
• Sustained misses limited by DDR BW
• to use DDR and MCDRAM in the same app
• All memory is transferred as:
• Which data structures should go where?
• DDR -> MCDRAM -> L2 • MCDRAM is a finite resource and tracking it
• Less addressable memory adds complexity
54
Take away message: Cache vs Flat Mode
Recommended
DDR MCDRAM MCDRAM Flat DDR +

Hybrid
Only as Cache Only MCDRAM
Software Change allocations for

No software changes required
Effort bandwidth-critical data.
Not peak
Performance Best performance.
performance.
Limited Optimal HW
memory utilization +
capacity opportunity for
new algorithms
55
How to ACCESS MCDRAM in Flat Mode?
New mechanisms proposed by Intel:
• Memkind Library
• User space library
• C/C++ language interface
• Needs source modification
• Fortran FASTMEM compiler directives
• Internally uses memkind library
Scope of
• Ongoing language standardization efforts
• AutoHBW for C/C++
this presentation
• interposer library based on memkind
• No source modification needed (based on size of allocations)
• No fine control over individual allocations
Use standard OS mechanisms

• Using numactl
• Direct OS system calls
• mmap(1), mbind(1)
• Not the preferred method
• Page-only granularity, OS serialization, no pool management
56
Memkind library Architecture
57
A Heterogeneous Memory Management Framework
The memkind library The hbwmalloc interface
– Defines a plug-in architecture – The high bandwidth memory interface
– Each plug-in is called a “kind” of memory – Implemented on top of memkind
– Built on top of jemalloc – Simplifies memkind plug-in (kind)
– High level memory management selection
functions can be overridden – Uses all kinds featuring on package
– Available via github: memory on the Knights Landing
https://github.com/memkind architecture
– Provides support for 2MB and 1GB pages
– Select fallback behavior when on package
memory does not exist or is exhausted
– Check for existence of on package
memory
memkind – “Kinds” of Memory
Many “kinds” of memory supported by memkind:
– MEMKIND_DEFAULT
Default allocation using standard memory and default page size.
– MEMKIND_HBW
Allocate from the closest high-bandwidth memory NUMA node at time of allocation.
– MEMKIND_HBW_PREFERRED
If there is not enough HBW memory to satisfy the request, fall back to standard memory.
– MEMKIND_HUGETLB
Allocate using 2MB pages.
– MEMKIND_GBTLB These can all be used with HBW
Allocate using GB pages.
(e.g. MEMKIND_HBW_HUGETLB);
– MEMKIND_INTERLEAVE all but INTERLEAVE can be used with
Allocate pages interleaved across all NUMA nodes. HBW_PREFERRED.
– MEMKIND_PMEM
Allocate from file-backed heap.
59
memkind & hbwmalloc – Early Experiments
AutoHBW: Interposer Library that comes with memkind
• Automatically allocates memory from MCDRAM
• If a heap allocation (e.g., malloc/calloc) is larger than a given threshold
LD_PRELOAD=libautohbw.so ./application
Run-time configuration options are passed through environment variables:

– AUTO_HBW_SIZE=x[:y]
Any allocation larger than x and smaller than y should be allocated in HBW memory.
– AUTO_HBW_MEM_TYPE
Sets the “kind” of HBW memory that should be allocated (e.g. MEMKIND_HBW)
– AUTO_HBW_LOG and AUTO_HBW_DEBUG for extra information.
Easy to integrate similar functionality into other libraries, C++ allocators, etc.
60
memkind – C “Hello World!” Example
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <memkind.h>
int main(int argc, char **argv)
{
const size_t size = 512;
char *default_str = NULL;
char *hbw_str = NULL;
default_str = (char *)memkind_malloc(MEMKIND_DEFAULT, size);
if (default_str == NULL) {
perror("memkind_malloc()");
fprintf(stderr, "Unable to allocate default string\n");
return errno ? -errno : 1;
}
hbw_str = (char *)memkind_malloc(MEMKIND_HBW, size);
if (hbw_str == NULL) {
perror("memkind_malloc()");
fprintf(stderr, "Unable to allocate hbw string\n");
}
sprintf(default_str, "Hello world from standard memory\n");
sprintf(hbw_str, "Hello world from high bandwidth memory\n");
fprintf(stdout, "%s", default_str);
fprintf(stdout, "%s", hbw_str);
+ Link w/ memkind library
memkind_free(MEMKIND_DEFAULT, hbw_str);
memkind_free(MEMKIND_DEFAULT, default_str); (otherwise won’t link due to
return 0;
unresolved references)
}
Based on:
https://github.com/memkind/memkind/blob/dev/examples/hello_memkind_example.c
61
Using Memkind Library to Access MCDRAM (Fortran)
• Unlike C, Fortran does not rely on a malloc –type API to perform
allocations of dynamic memory
• Intrinsic ALLOCATE statement used for all dynamic allocations
• Intrinsic DEALLOCATE for deallocation of memory
• NOTE: Fortran 2003 standard requires ALLOCATABLE variables to be
automatically deallocated when they go out of scope
c Declare arrays to be dynamic

REAL, ALLOCATABLE :: A(:), B(:), C(:)
!DEC$ ATTRIBUTES FASTMEM :: A
NSIZE=1024
c allocate array ‘A’ from MCDRAM
ALLOCATE (A(1:NSIZE))
(otherwise silently
c Allocate arrays that will come from DDR allocated in DDR)
ALLOCATE (B(NSIZE), C(NSIZE))
62
Fortran FASTMEM STATUS
• To have ATTRIBUTES FASTMEM, ALLOCATABLE –attribute is required
• In version 16 of the Intel compiler, FASTMEM is not allowed for
• Variables with the POINTER –attribute
REAL, POINTER :: array(:)
• Automatic (stack) variables
SUBROUTINE SUB1(n)
INTEGER :: n
REAL :: A(n,n)
...
END SUBROUTINE
• Components of derived types
TYPE mytype
REAL, ALLOCATABLE :: array(:)
END TYPE mytype
• COMMON blocks
INTEGER, PARAMETER :: NARR = 1000
REAL ARRAY(NARR,NARR)
COMMON /MATRIX/ ARRAY, N
63
Running memkind
The following command allocates all the data from the application into DDR (NUMA
node 0) except for the MEMKIND allocations on HBW (NUMA node 1)
export MEMKIND_HBW_NODES=1
numactl --membind=0 --cpunodebind=0 <binary>
64
hbwmalloc – C “Hello World!” Example
Fallback policy is controlled with hbw_set_policy:
– HBW_POLICY_BIND
– HBW_POLICY_PREFERRED
#include <stdlib.h> – HBW_POLICY_INTERLEAVE
#include <stdio.h>
#include <errno.h>
#include <hbwmalloc.h> Page sizes can be passed to
hbw_posix_memalign_psize:
{ – HBW_PAGESIZE_4KB
const size_t size = 512; – HBW_PAGESIZE_2MB
char *default_str = NULL;
char *hbw_str = NULL; – HBW_PAGESIZE_1GB
default_str = (char *)malloc(size);
if (default_str == NULL) {
perror("malloc()");
fprintf(stderr, "Unable to allocate default string\n");
}
hbw_str = (char *)hbw_malloc(size);
if (hbw_str == NULL) {
perror("hbw_malloc()");
fprintf(stderr, "Unable to allocate hbw string\n");
}
sprintf(default_str, "Hello world from standard memory\n");
sprintf(hbw_str, "Hello world from high bandwidth memory\n");
fprintf(stdout, "%s", default_str);
fprintf(stdout, "%s", hbw_str); + Link w/ memkind library
hbw_free(hbw_str); (otherwise won’t link due to
free(default_str);
return 0; Based on:
} https://github.com/memkind/memkind/blob/dev/examples/hello_hbw_example.c
65
Hbwmalloc – c++ STL allocator example
#include <iostream>
#include <vector>
#include <hbw_allocator.h>

{
const int length = 10;
std::vector<double, hbw::allocator<double> > data(10);
for (int i = 0; i < length; ++i) {

data[i] = (double)(i);
}
std::cout << data[length-1] << std::endl;

return 0;
}

(otherwise won’t link due to
66
STANDARD WAYS OF AccessING MCDRAM
MCDRAM is exposed to OS/software as a NUMA node
Utility numactl is standard utility for NUMA system control
• See “man numactl”
• Do “numactl --hardware” to see the NUMA configuration of your system
KNL with 2 NUMA nodes Intel® Xeon® with 2 NUMA nodes
DDR KNL
MC
DRAM ≈ DDR Xeon Xeon DDR
Node 0 Node 1 Node 0 Node 1
If the total memory footprint of your app is smaller than the size of MCDRAM
• Use numactl to allocate all of its memory from MCDRAM
• numactl --membind=mcdram_id <command>
• Where mcdram_id is the ID of MCDRAM “node”
• Allocations that don’t fit into MCDRAM make application fail
If the total memory footprint of your app is larger than the size of MCDRAM
• You can still use numactl to allocate part of your app in MCDRAM
• numactl --preferred=mcdram_id <command>
• Allocations that don’t fit into MCDRAM spill over to DDR
67
Software visible memory configuration
1. Cache mode / Quadrant 3. Cache mode / SNC-4
4. Flat mode with sub-NUMA clustering (SNC-4)

2. Flat mode / Quadrant
DDR 68
MCDRAM
Obtaining Memkind Library
Homepage: http://memkind.github.io/memkind
Download package
• On RHEL* 7
• yum install epel-release; yum install memkind
• For other distros: install from
http://download.opensuse.org/repositories/home:/cmcantalupo/
Alternatively, you can build from source

• git clone https://github.com/memkind.git
• See CONTRIBUTING file for build instructions
• Must use this option to get AutoHBW library
• Requires libnuma (development files and libraries)
• yum install numactl-devel
69
Mkl and HBM
Intel MKL 2017 memory manager tries to allocate memory to MCDRAM through the
Memkind library.
By default the amount of MCDRAM available for Intel MKL is unlimited. To control the
amount of MCDRAM available for Intel MK use either of the following:
• Call mkl_set_memory_limit (MKL_MEM_MCDRAM, <limit in mbytes>)

• Set the MKL_FAST_MEMORY_LIMIT=<limit in mbytes> environment var
70
MPI and HBM
The environment variable is to control memory policy for MPI processes.
There are three kinds of memory we can control: PSXE 2017
• User code memory (emulates “numactl –m” command) Beta U1
• MPI buffers
• User’s buffers but allocated by IMPI for MPI_Win_allocate_shared/MPI_Win_allocate
The suggested format for the environment variable is the following:
I_MPI_HBW_POLICY=<user buffers policy>[,[mpi buffers policy][,win_allocate policy]]
Where each of comma separated values can be the following:
Memory allocation go first to local for a process

hbw_preferred
MCDRAM, then to local DRAM
Only allocate memory on local for a process
hbw_bind
MCDRAM
Memory will be interleaved between the MCDRAM
hbw_interleave
and DRAM on the local SNC node
71
Take-away messages
Intel Xeon Phi X200 is a highly capable processor
• Can run your already built applications for Xeon
• w/ Highly parallel system to execute many processes/threads at the same time
• w/ Highly vectorized architecture to generate multiple operations per instruction
• w/ High-Bandwidth Memory to shorten the memory gap
• … with low energy consumption footprint
Intel tools / libraries / compilers are here to help on taking advantage of all these
properties.
72
Thank you!
Questions?

KNL Presentation TACC Summer School - Shared

Uploaded by

Copyright:

Available Formats

KNL Presentation TACC Summer School - Shared

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KNL Presentation TACC Summer School - Shared

Uploaded by

Copyright:

Available Formats

Introduction to The

Intel® Xeon PHI™ processor

Summer School in Advanced Scientific Computing 2016

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation. All rights reserved.

Vectorized & Parallelized

Intel® Xeon Phi™ Coprocessor 7120P 10

 Flat programming model on parallel

Host Processor Host Processor

Knights Landing Processors Knights Landing PCIe Coprocessors

Intel S7200AP Cluster

484x Intel Xeon Phi 7250 68C 1.4GHz

• 32,912 total cores

Comprises 38 physical tiles, at which at most 36 active KNL

• 2 cores, each with 2 vector processing units (VPU)

CHA Caching/Home Agent (or HUB)

… More to come in the UNTILE section!

7250 215 34 68 1.6 1.5 1.4 1.2 1.7 7.2 2400

Turbo is an opportunistic increase in frequency over TDP frequency

Decode, allocate 2 instructions/cycle

72-entry ROB buffer

2 IEUs per core

Dispatches 2 uops (either LD/ST)

2 VPUs tightly integrated with core

Resources dynamically partitioned

• Effective capacity depends on locality:

• Ensuring good locality (e.g. through blocking or nested parallelism)

#pragma omp parallel for num_threads(ntiles)

EDC EDC IIO EDC EDC

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Selection done at boot time.

3 No affinity between Tile, Directory and

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile 1. L2 miss encountered

3. Miss in the directory. Forward to memory

EDC EDC IIO EDC EDC

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Analogous to 4-socket Xeon

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Typical Read L2 miss

Tile Tile Tile Tile Tile Tile

No TSX. Under AVX-512CD

Count the number of leading zero bits in

for(i=0; i<16; i++) { A[B[i]]++; }

index = vload &B[i] // Load 16 B[i] indices

• vpconflict{d,q} zmm1 {k1}, zmm2/mem

• vpbroadcastm{b2q,w2d} zmm1 {k0}, to_do

for (int i = 0; i < N; i += 16)

Bit Vector: 0101 0001 0000 0000

2) Combine bit vector and 0101 0001 0000 0000 vpconflict

for (int i = 0; i < N; ++i)

Such loops would originally have resulted in:

Avoid dependencies between loop iterations

Prefer arrays to the use of pointers

Use efficient memory accesses

Don’t use –mmic with KNL !