And Motivation: Presenter

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Introduction

and Motivation

Presenter
MaxAcademy Lecture Series V1.0, September 2011
Lecture Overview
Challenges of the Exaflop Supercomputer
How much of a processor does processing?
Custom computers
FPGA accelerator hardware
Programmability

2
Rise of x86 Supercomputers

3
The Exaflop Supercomputer (2018)
1 exaflop = 1018 FLOPS
How do we program this?
Using processor cores with 8FLOPS/clock at 2.5GHz
50M CPU cores
What about power?
Assume power envelope of 100W per chip
Moores Law scaling: 6 cores today ~100 cores/chip
Who pays for this?
500k CPU chips
50MW (just for CPUs!) 100MW likely
Jaguar power consumption: 6MW

4
What do 50M cores look like?
Spatial decomposition
on a 100003 regular grid
1.0 Terapoints
20k points per core
273 region per core
Computing a 13x13x13
convolution stencil:
66% halo

5
Power Efficiency
Green500 list identifies the most energy efficient
supercomputers from the Top500 list
Power efficiency
(GFLOPS/W)
Best (BlueGene/Q) 1.6
Average accelerator 0.76
3.6x
Average non-accelerator 0.21

At 1.6 GFLOPs/W; 1 exaflop = 625MW


To deliver 1 Exaflop at 6MW we need 170 GFLOPS/W

6
Intel 6-Core X5680 Westmere

L2 Cache &
L1 data cache
interrupt
Computation
Execution units servicing
Memory
ordering and Paging
execution Core
Out-of-order Branch
Instruction prediction
scheduling & decode and
retirement Instruction fetch
microcode & L1 cache

Memory controller

Core Core Core Core Core Core


I/O and QPI

I/O and QPI


Uncore

Shared L3 cache Shared L3 cache

7
A Special Purpose Computer
A custom chip for a specific application
No instructions no instruction decode logic
No branches no branch prediction
Explicit parallelism No out-of-order scheduling
Data streamed onto-chip No multi-level caches

Memory
(Lots of)
Rest of the MyApplication
world Chip

8
A Special Purpose Computer
But we have more than one application
Generally impractical to have machines that are
completely optimized for only one code
Need to run many applications on a typical cluster

Memory
Rest of the

Memory
OtherApplication

Memory
world
Network MyApplication

Memory
Chip
MyApplication
Network Chip
MyApplication
Network Chip
Chip

9
A Special Purpose Computer
Use a reconfigurable chip that can be reprogrammed
at runtime to implement:
Different applications
Or different versions of the same application

Memory
Network Optimized for
Config 1
Application D
A
B
C
E

10
Instruction Processors

11
Dataflow/Stream Processors

12
Accelerating Real Applications
The majority of LoC in most applications are scalar
CPUs are good for: latency-sensitive, control-
intensive, non-repetitive code
Dataflow engines are good for: high throughput
repetitive processing on large data volumes
A system should contain both
Lines of code
Total Application 1,000,000
Kernel to accelerate 2,000
Software to restructure 20,000

13
Custom Computing in a PC

Processor Where is the Custom Architecture?


On-Chip w/ access to register file
Register
file
L1$ Co-processor w/ access to level 1 cache
Next to level 2 cache
In an adjacent processor socket, connected
using QPI/Hypertransport
L2$ As Memory Controller instead of North/South
Bridge
North/South Bridge
As main memory (DIMMs)
As a peripheral on PCI Express bus
PCI Bus
Dimms

Inside the peripheral, i.e. a customizable Disk


controller
Disk

14
Embedded Systems

Instructions Harvard Architecture


Partitioning of Programs into software

Architecture
Custom and hardware (custom architecture) is
Processor called Hardware Software Co-design
Register System-on-a-Chip (SoC)
file Custom architecture as extension of
the processor instruction set.
Data

15
Is there an optimal location?
Depends on the application
More specifically it depends on the
systems Bottleneck for the application
Possible Bottlenecks:
Memory access latency
Memory access bandwidth
Memory size
Processor local memory size
Processor ALU resource
Processor ALU operation latency
Various bus bandwidths

16
Major Bottlenecks: Examples

Throughput Latency

Memory Convolution Graph algorithms

CPU Monte Carlo Optimization

17
Examples
for(int i=0;i<N;i++){
a[i]=b[i];
}

is limited by: .

for(int i=0;i<N;i++){
for(int j=0;j<1000;j++){
a[i]=a[i]+j;
}
}

is limited by: .

18
Reconfigurable Computing with FPGAs
DSP Block IO Block
Logic Cell (105 elements)
Xilinx Virtex-6 FPGA

Block RAM DSP Block


Block RAM (20TB/s)
19
CPU and FPGA Scaling
1.00E+010 1000000

FPGAs on the same curve


as CPUs (Moores law)
1.00E+009
100000

1.00E+008

10000
1.00E+007
CPU Transistors
FPGA Registers

1.00E+006 1000
1993 1995 1998 2001 2004 2006 2009 2012

20
High Density Compute with FPGAs
1U Form Factor
4x MAX3 cards
with Virtex-6 FPGAs
12 Intel Xeon cores
Up to 192GB FPGA RAM
Up to 192GB host RAM
MaxRing interconnect
Infiniband/10GE

21
Exercises
1. Given a computer system which is never limited by the memory bus with N Mb memory and
a processor with 2 ALUs (write down any additional assumptions you make). For each of the
points below write a pseudo program which is limited in performance by:
a) Memory access latency
b) Memory size
c) Processor ALU resources

2. Find 3 research projects on the web, working on something related to this lecture and
describe what they do and why in your own words.

22

You might also like