And Motivation: Presenter

Introduction
and Motivation
Presenter
MaxAcademy Lecture Series V1.0, September 2011
Lecture Overview
Challenges of the Exaflop Supercomputer
How much of a processor does processing?
Custom computers
FPGA accelerator hardware
Programmability
2
Rise of x86 Supercomputers
3
The Exaflop Supercomputer (2018)
1 exaflop = 1018 FLOPS
How do we program this?
Using processor cores with 8FLOPS/clock at 2.5GHz
50M CPU cores
What about power?
Assume power envelope of 100W per chip
Moores Law scaling: 6 cores today ~100 cores/chip
Who pays for this?
500k CPU chips
50MW (just for CPUs!) 100MW likely
Jaguar power consumption: 6MW
4
What do 50M cores look like?
Spatial decomposition
on a 100003 regular grid
1.0 Terapoints
20k points per core
273 region per core
Computing a 13x13x13
convolution stencil:
66% halo
5
Power Efficiency
Green500 list identifies the most energy efficient
supercomputers from the Top500 list
Power efficiency
(GFLOPS/W)
Best (BlueGene/Q) 1.6
Average accelerator 0.76
3.6x
Average non-accelerator 0.21
At 1.6 GFLOPs/W; 1 exaflop = 625MW

To deliver 1 Exaflop at 6MW we need 170 GFLOPS/W
6
Intel 6-Core X5680 Westmere
L2 Cache &
L1 data cache
interrupt
Computation
Execution units servicing
Memory
ordering and Paging
execution Core
Out-of-order Branch
Instruction prediction
scheduling & decode and
retirement Instruction fetch
microcode & L1 cache
Memory controller
Core Core Core Core Core Core

I/O and QPI
I/O and QPI

Uncore
Shared L3 cache Shared L3 cache
7
A Special Purpose Computer
A custom chip for a specific application
No instructions no instruction decode logic
No branches no branch prediction
Explicit parallelism No out-of-order scheduling
Data streamed onto-chip No multi-level caches
Memory
(Lots of)
Rest of the MyApplication
world Chip
8
But we have more than one application
Generally impractical to have machines that are
completely optimized for only one code
Need to run many applications on a typical cluster
Memory
Rest of the
Memory
OtherApplication
Memory
world
Network MyApplication
Memory
Chip
MyApplication
Network Chip
MyApplication
Network Chip
Chip
9
Use a reconfigurable chip that can be reprogrammed
at runtime to implement:
Different applications
Or different versions of the same application
Memory
Network Optimized for
Config 1
Application D
A
B
C
E
10
Instruction Processors
11
Dataflow/Stream Processors
12
Accelerating Real Applications
The majority of LoC in most applications are scalar
CPUs are good for: latency-sensitive, control-
intensive, non-repetitive code
Dataflow engines are good for: high throughput
repetitive processing on large data volumes
A system should contain both
Lines of code
Total Application 1,000,000
Kernel to accelerate 2,000
Software to restructure 20,000
13
Custom Computing in a PC
Processor Where is the Custom Architecture?

On-Chip w/ access to register file
Register
file
L1$ Co-processor w/ access to level 1 cache
Next to level 2 cache
In an adjacent processor socket, connected
using QPI/Hypertransport
L2$ As Memory Controller instead of North/South
Bridge
North/South Bridge
As main memory (DIMMs)
As a peripheral on PCI Express bus
PCI Bus
Dimms
Inside the peripheral, i.e. a customizable Disk

controller
Disk
14
Embedded Systems
Instructions Harvard Architecture

Partitioning of Programs into software
Architecture
Custom and hardware (custom architecture) is
Processor called Hardware Software Co-design
Register System-on-a-Chip (SoC)
file Custom architecture as extension of
the processor instruction set.
Data
15
Is there an optimal location?
Depends on the application
More specifically it depends on the
systems Bottleneck for the application
Possible Bottlenecks:
Memory access latency
Memory access bandwidth
Memory size
Processor local memory size
Processor ALU resource
Processor ALU operation latency
Various bus bandwidths
16
Major Bottlenecks: Examples
Throughput Latency
Memory Convolution Graph algorithms
CPU Monte Carlo Optimization
17
Examples
for(int i=0;i<N;i++){
a[i]=b[i];
}
is limited by: .
for(int i=0;i<N;i++){
for(int j=0;j<1000;j++){
a[i]=a[i]+j;
}
}
is limited by: .
18
Reconfigurable Computing with FPGAs
DSP Block IO Block
Logic Cell (105 elements)
Xilinx Virtex-6 FPGA
Block RAM DSP Block

Block RAM (20TB/s)
19
CPU and FPGA Scaling
1.00E+010 1000000
FPGAs on the same curve

as CPUs (Moores law)
1.00E+009
100000
1.00E+008
10000
1.00E+007
CPU Transistors
FPGA Registers
1.00E+006 1000
1993 1995 1998 2001 2004 2006 2009 2012
20
High Density Compute with FPGAs
1U Form Factor
4x MAX3 cards
with Virtex-6 FPGAs
12 Intel Xeon cores
Up to 192GB FPGA RAM
Up to 192GB host RAM
MaxRing interconnect
Infiniband/10GE
21
Exercises
1. Given a computer system which is never limited by the memory bus with N Mb memory and
a processor with 2 ALUs (write down any additional assumptions you make). For each of the
points below write a pseudo program which is limited in performance by:
a) Memory access latency
b) Memory size
c) Processor ALU resources
2. Find 3 research projects on the web, working on something related to this lecture and
describe what they do and why in your own words.
22

And Motivation: Presenter

Uploaded by

Copyright:

Available Formats

And Motivation: Presenter

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

And Motivation: Presenter

Uploaded by

Copyright:

Available Formats

Introduction

At 1.6 GFLOPs/W; 1 exaflop = 625MW

Core Core Core Core Core Core

I/O and QPI

Shared L3 cache Shared L3 cache

Processor Where is the Custom Architecture?

Inside the peripheral, i.e. a customizable Disk

Instructions Harvard Architecture

Memory Convolution Graph algorithms

CPU Monte Carlo Optimization

Block RAM DSP Block

FPGAs on the same curve

You might also like