Chapter 06
Chapter 06
Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
Connecting multiple processors (or cores) to get higher
performance
Scalability, availability, and power efficiency
1. Task-level (or process-level) parallelism
High throughput for independent jobs
2. Parallel processing program (or parallel software)
Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
Shared memory processors (SMPs)
Hardware
Serial: e.g., Pentium 4
Parallel: e.g., quad-core Xeon e5345
Software
Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware
Difficulties
Partitioning
Coordination
Communications overhead
1
Speedup 90
(1 Fparallelizable ) Fparallelizable /100
Speed-up = 410/30 = 14
The remaining 39 processors are utilized less than half the time
Speed-up = 410/60 = 7
The remaining 39 processors are utilized less than 20% of the
time
For a single processor with twice the load of the others cuts speed-
up by a third. And, five times the load on just one processor reduces
speed-up by almost a factor of three.
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
8 × Streaming
processors
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Arithmetic intensity is
not always fixed
May scale with
problem size
Caching reduces
memory accesses
Increases arithmetic
intensity
Chapter 6 — Parallel Processors from Client to Cloud — 48
§6.11 Real Stuff: Benchmarking and Rooflines i7 vs. Tesla
i7-960 vs. NVIDIA Tesla 280/480