And Motivation: Presenter
And Motivation: Presenter
And Motivation: Presenter
and Motivation
Presenter
MaxAcademy Lecture Series V1.0, September 2011
Lecture Overview
Challenges of the Exaflop Supercomputer
How much of a processor does processing?
Custom computers
FPGA accelerator hardware
Programmability
2
Rise of x86 Supercomputers
3
The Exaflop Supercomputer (2018)
1 exaflop = 1018 FLOPS
How do we program this?
Using processor cores with 8FLOPS/clock at 2.5GHz
50M CPU cores
What about power?
Assume power envelope of 100W per chip
Moores Law scaling: 6 cores today ~100 cores/chip
Who pays for this?
500k CPU chips
50MW (just for CPUs!) 100MW likely
Jaguar power consumption: 6MW
4
What do 50M cores look like?
Spatial decomposition
on a 100003 regular grid
1.0 Terapoints
20k points per core
273 region per core
Computing a 13x13x13
convolution stencil:
66% halo
5
Power Efficiency
Green500 list identifies the most energy efficient
supercomputers from the Top500 list
Power efficiency
(GFLOPS/W)
Best (BlueGene/Q) 1.6
Average accelerator 0.76
3.6x
Average non-accelerator 0.21
6
Intel 6-Core X5680 Westmere
L2 Cache &
L1 data cache
interrupt
Computation
Execution units servicing
Memory
ordering and Paging
execution Core
Out-of-order Branch
Instruction prediction
scheduling & decode and
retirement Instruction fetch
microcode & L1 cache
Memory controller
7
A Special Purpose Computer
A custom chip for a specific application
No instructions no instruction decode logic
No branches no branch prediction
Explicit parallelism No out-of-order scheduling
Data streamed onto-chip No multi-level caches
Memory
(Lots of)
Rest of the MyApplication
world Chip
8
A Special Purpose Computer
But we have more than one application
Generally impractical to have machines that are
completely optimized for only one code
Need to run many applications on a typical cluster
Memory
Rest of the
Memory
OtherApplication
Memory
world
Network MyApplication
Memory
Chip
MyApplication
Network Chip
MyApplication
Network Chip
Chip
9
A Special Purpose Computer
Use a reconfigurable chip that can be reprogrammed
at runtime to implement:
Different applications
Or different versions of the same application
Memory
Network Optimized for
Config 1
Application D
A
B
C
E
10
Instruction Processors
11
Dataflow/Stream Processors
12
Accelerating Real Applications
The majority of LoC in most applications are scalar
CPUs are good for: latency-sensitive, control-
intensive, non-repetitive code
Dataflow engines are good for: high throughput
repetitive processing on large data volumes
A system should contain both
Lines of code
Total Application 1,000,000
Kernel to accelerate 2,000
Software to restructure 20,000
13
Custom Computing in a PC
14
Embedded Systems
Architecture
Custom and hardware (custom architecture) is
Processor called Hardware Software Co-design
Register System-on-a-Chip (SoC)
file Custom architecture as extension of
the processor instruction set.
Data
15
Is there an optimal location?
Depends on the application
More specifically it depends on the
systems Bottleneck for the application
Possible Bottlenecks:
Memory access latency
Memory access bandwidth
Memory size
Processor local memory size
Processor ALU resource
Processor ALU operation latency
Various bus bandwidths
16
Major Bottlenecks: Examples
Throughput Latency
17
Examples
for(int i=0;i<N;i++){
a[i]=b[i];
}
is limited by: .
for(int i=0;i<N;i++){
for(int j=0;j<1000;j++){
a[i]=a[i]+j;
}
}
is limited by: .
18
Reconfigurable Computing with FPGAs
DSP Block IO Block
Logic Cell (105 elements)
Xilinx Virtex-6 FPGA
1.00E+008
10000
1.00E+007
CPU Transistors
FPGA Registers
1.00E+006 1000
1993 1995 1998 2001 2004 2006 2009 2012
20
High Density Compute with FPGAs
1U Form Factor
4x MAX3 cards
with Virtex-6 FPGAs
12 Intel Xeon cores
Up to 192GB FPGA RAM
Up to 192GB host RAM
MaxRing interconnect
Infiniband/10GE
21
Exercises
1. Given a computer system which is never limited by the memory bus with N Mb memory and
a processor with 2 ALUs (write down any additional assumptions you make). For each of the
points below write a pseudo program which is limited in performance by:
a) Memory access latency
b) Memory size
c) Processor ALU resources
2. Find 3 research projects on the web, working on something related to this lecture and
describe what they do and why in your own words.
22