0% found this document useful (0 votes)
19 views

Chapter 06

computer_6

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chapter 06

computer_6

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface


6th
Edition

Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
 Connecting multiple processors (or cores) to get higher
performance
 Scalability, availability, and power efficiency
1. Task-level (or process-level) parallelism
 High throughput for independent jobs
2. Parallel processing program (or parallel software)
 Single program run on multiple processors
 Multicore microprocessors
 Chips with multiple processors (cores)
 Shared memory processors (SMPs)

Chapter 6 — Parallel Processors from Client to Cloud — 2


Hardware and Software

 Hardware
 Serial: e.g., Pentium 4
 Parallel: e.g., quad-core Xeon e5345
 Software
 Sequential: e.g., matrix multiplication
 Concurrent: e.g., operating system
 Sequential/concurrent software can run on serial/parallel hardware
 Challenge: making effective use of parallel hardware

Chapter 6 — Parallel Processors from Client to Cloud — 3


What We’ve Already Covered
 §2.11: Parallelism and Instructions
 Synchronization

 §3.6: Parallelism and Computer Arithmetic


 Subword Parallelism

 §4.10: Parallelism and Advanced Instruction-Level


Parallelism
 §5.10: Parallelism and Memory Hierarchies
 Cache Coherence

Chapter 6 — Parallel Processors from Client to Cloud — 4


§6.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
 Parallel software is the problem
 Need to get significant performance improvement
 Otherwise, just use a faster uniprocessor, since it’s
easier!

 Difficulties
 Partitioning
 Coordination
 Communications overhead

Chapter 6 — Parallel Processors from Client to Cloud — 5


Speed-up Challenge: Parallel Program

 Sequential part can limit speedup significantly


 Example: 100 processors, 90× speedup?
 Tnew = Tparallelizable/100 + Tsequential Amdahl’s Law

1
Speedup   90
(1 Fparallelizable )  Fparallelizable /100

 Solving: Fparallelizable = 0.999

 Need sequential part to be 0.1% of original time

Chapter 6 — Parallel Processors from Client to Cloud — 6


Scaling Example (1/2)
 Workload: (1) sum of 10 scalars, and (2) 10 × 10 matrix
sum. And, assume that only the matrix sum is parallelizable
 What speed-up do you get with 10 versus 40 processors ?
 Solution:
 Execution time for single processor
 Time = (10 + 100) × tadd = 110 × tadd
 Assumes load can be balanced across processors
 Execution time with 10 processors
 Time = 10 × tadd + (100/10) × tadd = 20 × tadd
 Speedup = 110/20 = 5.5 (55% of potential speed-up)
 Execution time with 100 processors
 Time = 10 × tadd + (100/40) × tadd = 12.5 × tadd
 Speedup = 110/12.5 = 8.8 (22% of potential speed-up)

Chapter 6 — Parallel Processors from Client to Cloud — 7


Scaling Example (2/2)
 What if matrix size is 20 × 20?
 Solution
 Execution time for single processor
 Time = (10 + 400) × tadd = 500 × tadd
 Assuming load balanced
 Execution time with 10 processors
 Time = 10 × tadd + (400/10) × tadd = 50 × tadd
 Speedup = 410/50 = 8.2 (82% of potential speed-up)
 Execution time with 40 processors
 Time = 10 × tadd + (400/40) × tadd = 20 × tadd
 Speedup = 410/20 = 20.5 (51% of potential speed-up)

Chapter 6 — Parallel Processors from Client to Cloud — 8


Strong vs Weak Scaling
 Strong scaling: problem size fixed
 As in example
 Weak scaling: problem size proportional to number of
processors
 10 processors, 10 × 10 matrix
 Time = 10 × tadd + (100/10) × tadd = 20 × tadd
 40 processors, 20 × 20 matrix
 Time = 10 × tadd + (400/40) × tadd = 20 × tadd
 Constant performance in this example

Chapter 6 — Parallel Processors from Client to Cloud — 9


Speed-up Challenge: Balancing Load (1/2)
 Assume matrix size is 20 × 20 with 40 processors
 Show the impact on speed-up if load is unbalanced
 Case I : if one processor has 5% of the parallel load
 Case II: if one processor has 12.5% of the parallel load
 Solution
 5% of the parallel load : (5% x 400) tadd = 20 × tadd
 The other 39 will share the remaining 380 tadd
 Execution time:

 Speed-up = 410/30 = 14
 The remaining 39 processors are utilized less than half the time

Chapter 6 — Parallel Processors from Client to Cloud — 10


Speed-up Challenge: Balancing Load (2/2)
 Solution (conti.)
 12.5% of the parallel load : (5% x 400) tadd = 50 × tadd
 The other 39 will share the remaining 350 tadd
 Execution time:

 Speed-up = 410/60 = 7
 The remaining 39 processors are utilized less than 20% of the
time

 For a single processor with twice the load of the others cuts speed-
up by a third. And, five times the load on just one processor reduces
speed-up by almost a factor of three.

Chapter 6 — Parallel Processors from Client to Cloud — 11


§6.3 SISD, MIMD, SIMD, SPMD, and Vector
An Alternate Classification
 SISD, MIMD, SIMD, SPMD, and Vector
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

 SIMD : Exploit data-level parallelism


 SPMD : Single Program Multiple Data
 A parallel program on a MIMD computer

 Vector : an elegant interpretation of SIMD

Chapter 6 — Parallel Processors from Client to Cloud — 12


Vector Processors
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into registers
 Results stored from registers to memory

 Example: Vector extension to MIPS


 32 × 64-element registers (64-bit elements)
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double

 Significantly reduces instruction-fetch bandwidth


Chapter 6 — Parallel Processors from Client to Cloud — 13
Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;upper bound of what to load
loop: l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done
 Vector MIPS code

l.d $f0,a($sp) ;load scalar a


lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result

Chapter 6 — Parallel Processors from Client to Cloud — 14


Vector vs. Scalar
 Vector architectures and compilers
 Simplify data-parallel programming
 Explicit statement of absence of loop-carried
dependences
 Reduced checking in hardware
 Regular access patterns benefit from
interleaved and burst memory
 Avoid control hazards by avoiding loops
 More general than ad-hoc media
extensions (such as MMX, SSE)
 Better match with compiler technology
Chapter 6 — Parallel Processors from Client to Cloud — 15
SIMD
 Operate elementwise on vectors of data
 E.g., MMX and SSE instructions in x86
 Multiple data elements in 128-bit wide registers
 All processors execute the same
instruction at the same time
 Each with different data address, etc.
 Simplifies synchronization
 Reduced instruction control hardware
 Works best for highly data-parallel
applications
Chapter 6 — Parallel Processors from Client to Cloud — 16
Vector vs. Multimedia Extensions
 Vector instructions have a variable vector width,
multimedia extensions have a fixed width
 Vector instructions support strided access,
multimedia extensions do not
 Vector units can be combination of pipelined and
arrayed functional units:

Chapter 6 — Parallel Processors from Client to Cloud — 17


§6.4 Hardware Multithreading
Multithreading
 Performing multiple threads of execution in
parallel
 Replicate registers, PC, etc.
 Fast switching between threads
 Fine-grain multithreading
 Switch threads after each cycle
 Interleave instruction execution
 If one thread stalls, others are executed
 Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)

Chapter 6 — Parallel Processors from Client to Cloud — 18


Simultaneous Multithreading
 In multiple-issue dynamically scheduled
processor
 Schedule instructions from multiple threads
 Instructions from independent threads execute
when function units are available
 Within threads, dependencies handled by
scheduling and register renaming
 Example: Intel Pentium-4 HT
 Two threads: duplicated registers, shared
function units and caches

Chapter 6 — Parallel Processors from Client to Cloud — 19


Multithreading Example

Chapter 6 — Parallel Processors from Client to Cloud — 20


Future of Multithreading
 Will it survive? In what form?
 Power considerations  simplified
microarchitectures
 Simpler forms of multithreading
 Tolerating cache-miss latency
 Thread switch may be most effective
 Multiple simple cores might share
resources more effectively

Chapter 6 — Parallel Processors from Client to Cloud — 21


§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory
 SMP: shared memory multiprocessor
 Hardware provides single physical
address space for all processors
 Synchronize shared variables using locks
 Memory access time
 UMA (uniform) vs. NUMA (nonuniform)

Chapter 6 — Parallel Processors from Client to Cloud — 22


Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

Chapter 6 — Parallel Processors from Client to Cloud — 23


Example: Sum Reduction

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Chapter 6 — Parallel Processors from Client to Cloud — 24


§6.6 Introduction to Graphics Processing Units
History of GPUs
 Early video cards
 Frame buffer memory with address generation for
video output
 3D graphics processing
 Originally high-end computers (e.g., SGI)
 Moore’s Law  lower cost, higher density
 3D graphics cards for PCs and game consoles
 Graphics Processing Units
 Processors oriented to 3D graphics tasks
 Vertex/pixel processing, shading, texture mapping,
rasterization

Chapter 6 — Parallel Processors from Client to Cloud — 25


Graphics in the System

Chapter 6 — Parallel Processors from Client to Cloud — 26


GPU Architectures
 Processing is highly data-parallel
 GPUs are highly multithreaded
 Use thread switching to hide memory latency
 Less reliance on multi-level caches
 Graphics memory is wide and high-bandwidth
 Trend toward general purpose GPUs
 Heterogeneous CPU/GPU systems
 CPU for sequential code, GPU for parallel code
 Programming languages/APIs
 DirectX, OpenGL
 C for Graphics (Cg), High Level Shader Language
(HLSL)
 Compute Unified Device Architecture (CUDA)

Chapter 6 — Parallel Processors from Client to Cloud — 27


Example: NVIDIA Tesla
Streaming
multiprocessor

8 × Streaming
processors

Chapter 6 — Parallel Processors from Client to Cloud — 28


Example: NVIDIA Tesla
 Streaming Processors
 Single-precision FP and integer units
 Each SP is fine-grained multithreaded
 Warp: group of 32 threads
 Executed in parallel,
SIMD style
 8 SPs
× 4 clock cycles
 Hardware contexts
for 24 warps
 Registers, PCs, …

Chapter 6 — Parallel Processors from Client to Cloud — 29


Classifying GPUs
 Don’t fit nicely into SIMD/MIMD model
 Conditional execution in a thread allows an
illusion of MIMD
 But with performance degredation
 Need to write general purpose code with care

Static: Discovered Dynamic: Discovered


at Compile Time at Runtime
Instruction-Level VLIW Superscalar
Parallelism
Data-Level SIMD or Vector Tesla Multiprocessor
Parallelism

Chapter 6 — Parallel Processors from Client to Cloud — 30


GPU Memory Structures

Chapter 6 — Parallel Processors from Client to Cloud — 31


Putting GPUs into Perspective
Feature Multicore with SIMD GPU
SIMD processors 4 to 8 8 to 16
SIMD lanes/processor 2 to 4 8 to 16
Multithreading hardware support for 2 to 4 16 to 32
SIMD threads
Typical ratio of single precision to 2:1 2:1
double-precision performance
Largest cache size 8 MB 0.75 MB
Size of memory address 64-bit 64-bit
Size of main memory 8 GB to 256 GB 4 GB to 6 GB
Memory protection at level of page Yes Yes
Demand paging Yes No
Integrated scalar processor/SIMD Yes No
processor
Cache coherent Yes No

Chapter 6 — Parallel Processors from Client to Cloud — 32


Guide to GPU Terms

Chapter 6 — Parallel Processors from Client to Cloud — 33


§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 34


Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost (prefer virtual machines)
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — 35


Sum Reduction (Again)
 Sum 100,000 on 100 processors
 First distribute 100 numbers to each
 The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
 Reduction
 Half the processors send, other half receive
and add
 The quarter send, quarter receive and add, …
Chapter 6 — Parallel Processors from Client to Cloud — 36
Sum Reduction (Again)
 Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */

 Send/receive also provide synchronization


 Assumes send/receive take similar time to addition

Chapter 6 — Parallel Processors from Client to Cloud — 37


Grid Computing
 Separate computers interconnected by
long-haul networks
 E.g., Internet connections
 Work units farmed out, results sent back
 Can make use of idle time on PCs
 E.g., SETI@home, World Community Grid

Chapter 6 — Parallel Processors from Client to Cloud — 38


§6.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

Chapter 6 — Parallel Processors from Client to Cloud — 39


Multistage Networks

Chapter 6 — Parallel Processors from Client to Cloud — 40


Network Characteristics
 Performance
 Latency per message (unloaded network)
 Throughput
 Link bandwidth
 Total network bandwidth
 Bisection bandwidth
 Congestion delays (depending on traffic)
 Cost
 Power
 Routability in silicon

Chapter 6 — Parallel Processors from Client to Cloud — 41


§6.10 Multiprocessor Benchmarks and Performance Models
Parallel Benchmarks
 Linpack: matrix linear algebra
 SPECrate: parallel run of SPEC CPU programs
 Job-level parallelism
 SPLASH: Stanford Parallel Applications for
Shared Memory
 Mix of kernels and applications, strong scaling
 NAS (NASA Advanced Supercomputing) suite
 computational fluid dynamics kernels
 PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
 Multithreaded applications using Pthreads and
OpenMP
Chapter 6 — Parallel Processors from Client to Cloud — 42
Code or Applications?
 Traditional benchmarks
 Fixed code and data sets
 Parallel programming is evolving
 Should algorithms, programming languages,
and tools be part of the system?
 Compare systems, provided they implement a
given application
 E.g., Linpack, Berkeley Design Patterns
 Would foster innovation in approaches to
parallelism

Chapter 6 — Parallel Processors from Client to Cloud — 43


Modeling Performance
 Assume performance metric of interest is
achievable GFLOPs/sec
 Measured using computational kernels from
Berkeley Design Patterns
 Arithmetic intensity of a kernel
 FLOPs per byte of memory accessed
 For a given computer, determine
 Peak GFLOPS (from data sheet)
 Peak memory bytes/sec (using Stream
benchmark)

Chapter 6 — Parallel Processors from Client to Cloud — 44


Roofline Diagram

Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Chapter 6 — Parallel Processors from Client to Cloud — 45


Comparing Systems
 Example: Opteron X2 vs. Opteron X4
 2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs.
2.3GHz
 Same memory system

 To get higher performance


on X4 than X2
 Need high arithmetic intensity
 Or working set must fit in X4’s
2MB L-3 cache

Chapter 6 — Parallel Processors from Client to Cloud — 46


Optimizing Performance
 Optimize FP performance
 Balance adds & multiplies
 Improve superscalar ILP
and use of SIMD
instructions
 Optimize memory usage
 Software prefetch
 Avoid load stalls
 Memory affinity
 Avoid non-local data
accesses

Chapter 6 — Parallel Processors from Client to Cloud — 47


Optimizing Performance
 Choice of optimization depends on
arithmetic intensity of code

 Arithmetic intensity is
not always fixed
 May scale with
problem size
 Caching reduces
memory accesses
 Increases arithmetic
intensity
Chapter 6 — Parallel Processors from Client to Cloud — 48
§6.11 Real Stuff: Benchmarking and Rooflines i7 vs. Tesla
i7-960 vs. NVIDIA Tesla 280/480

Chapter 6 — Parallel Processors from Client to Cloud — 49


Rooflines

Chapter 6 — Parallel Processors from Client to Cloud — 50


Benchmarks

Chapter 6 — Parallel Processors from Client to Cloud — 51


Performance Summary
 GPU (480) has 4.4 X the memory bandwidth
 Benefits memory bound kernels
 GPU has 13.1 X the single precision throughout, 2.5 X
the double precision throughput
 Benefits FP compute bound kernels
 CPU cache prevents some kernels from becoming
memory bound when they otherwise would on GPU
 GPUs offer scatter-gather, which assists with kernels
with strided data
 Lack of synchronization and memory consistency
support on GPU limits performance for some kernels

Chapter 6 — Parallel Processors from Client to Cloud — 52


§6.12 Going Faster: Multiple Processors and Matrix Multiply
Multi-threading DGEMM
 Use OpenMP:

void dgemm (int n, double* A, double* B, double* C)


{
#pragma omp parallel for
for ( int sj = 0; sj < n; sj += BLOCKSIZE )
for ( int si = 0; si < n; si += BLOCKSIZE )
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
do_block(n, si, sj, sk, A, B, C);
}

Chapter 6 — Parallel Processors from Client to Cloud — 53


Multithreaded DGEMM

Chapter 6 — Parallel Processors from Client to Cloud — 54


Multithreaded DGEMM

Chapter 6 — Parallel Processors from Client to Cloud — 55


§6.13 Fallacies and Pitfalls
Fallacies
 Amdahl’s Law doesn’t apply to parallel
computers
 Since we can achieve linear speedup
 But only on applications with weak scaling
 Peak performance tracks observed
performance
 Marketers like this approach!
 But compare Xeon with others in example
 Need to be aware of bottlenecks

Chapter 6 — Parallel Processors from Client to Cloud — 56


Pitfalls
 Not developing the software to take
account of a multiprocessor architecture
 Example: using a single lock for a shared
composite resource
 Serializes accesses, even if they could be done in
parallel
 Use finer-granularity locking

Chapter 6 — Parallel Processors from Client to Cloud — 57


§6.14 Concluding Remarks
Concluding Remarks
 Goal: higher performance by using multiple
processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SaaS importance is growing and clusters are a
good match
 Performance per dollar and performance per
Joule drive both mobile and WSC

Chapter 6 — Parallel Processors from Client to Cloud — 58


Concluding Remarks (con’t)
 SIMD and vector
operations match
multimedia applications
and are easy to
program

Chapter 6 — Parallel Processors from Client to Cloud — 59

You might also like