EE292A Lecture 2.ML - Hardware

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

EE292A Lecture 2

Machine Learning Network


to Custom Hardware

Patrick Groeneveld
AMD
[email protected]

Copyright ©2024 by Raúl Camposano, Antun Domic and Patrick Groeneveld


Major Steps in the Hardware Mapping Flow
C++/Tensorflow/Pytorch
Patrick,
Today ip ip
System
ip
Raúl,
Next weeks
RTL

Antun
Circuit
Patrick

Mask Physical

Chip tapeout, sent to IC fab


EDA flow: Hundreds of Algorithmic Steps
C++/Tensorflow/Pytorch Architectural Synth. ML Kernel matching
MLIR

Datapath Opto High Level Synthesis


ip ip RTL
System Logic/Area Opto Logic Synthesis: Logic minimization, mapping
ip Mapped net list, minimum area

Timing Opto Rough sizing, High-Fanout buffering


Conditioned net list
RTL Global placement
Place for minimal wire length & congestion
Global routing
Placed gates

Physical Opto 1 Gate size, buffering, Global routing, incremental placement


Placed Gates, global route

Circuit Detailed placement All cells in legal locations


Placed gates, global route + Clock wires

Physical Opto 2 Gate size, buffering, incr. global route,


Detailed Placed gates, global+track route, scan wires

Detailed Routing Generate wire pattern


Mask Physical Routed design

Physical Opto 3 Gate sizing, buffering, fill insertion

Chip tapeout, sent to IC fab GDS2


Today: Machine
Learning Network to
Custom Hardware
• Computation for machine learning
• Dense (Fully Connected) and
convolution layers
• Speeding up Computation: Floating Point
formats, multiplication
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow, with real-life
example:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU

Stanford EE292A Lecture 2 4


5
Apple A12
iPhone SoC
• 7nm TSMC FinFet GPU 3 GPU 4
• 9.89 x 8.42 = 83.27 mm2
• 6.9 Billion transistors
• 4 GPU cores (~18%): 9 Blocks
• 6 CPU cores (~14%): 13 Blocks
• 4 ‘Tempest’ Low Power
CPU cores
• 2 ‘Vortex’ High Performance
CPU cores
• L2 & L3 caches
7 8
• 8 NPU/TPU cores (~7%) 4 Blocks
5 6
• DDR (~3%) 1 block
HP CPU2
• Misc (~57%): 50 Unique Blocks HP CPU1
3 4
LPCPU LPCPU
3 4 GPU 1 GPU 2
• Total: ~75 unique blocks 1 2
April 4, 2024 Stanford EE292A Lecture 1 LPCPU LPCPU
1 2 Source: TechInsights/Anandtech
The Past 6 Years of Apple Mobile SoCs
A12: iPhone Xs (2018) A14: iPhone 12 (2020) A16: iPhone 14 (2022)

LP CPU GPU

L3 Cache
L3 Cache
Cache GPU X4 HP X5
GPU

L2 cache
X4 NPU CPU
HP HP X4 X2 LP
NPU
NPU CPU LP CPU CPU
X2 CPU X4
X4
X2

A13: iPhone 11 (2019) A15: iPhone 13 (2021) NPU


HP

L3 Cache
L2 cache
CPU
LP X2
LP CPU
CPU LP CPU
GPU
X4
L3 Cache

GPU X4 GPU X4 X6
Cache X4 X5
NPU HP CPU
HP
NPU
L2 cache

CPU
X2 X2
A17: iPhone 15pro (2023) Sources: TechInsights/Anandtech/Angstronomics
A Neural Network with 7 layers 0
1
Fully Fully
down- down- Connected Connected 2
filter sample filter sample Network Network
3
4
5
6
7
8
28x28 image
9
(MNIST database)

Convolutional filter Rectified Linear Unit Max-Pool Fully Connected


Network

7
Stanford EE292A Lecture 2
Convolutional Neural Network for Self Driving

Camera images

April 2, 2024 Stanford EE292A Lecture 1 8


Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE )
Automotive: Full Self-Driving Tesla

April 2, 2024 Stanford EE292A Lecture 1 9


Tesla
Tesla “full
“Full self driving computer”
Self Driving” board

April 2, 2024 Stanford EE292A Lecture 2 10


Source: @Tesla on twitter
Autonomous Driving: Software & Hardware

Less than 100 milliseconds

Image Processing: Perception: Localization: Planning and


Cameras, Radar

Noise Reduction Feature Detection Cars, Traffic signs, Actions:


Lens Distortion Sensor fusion People, Motion Navigation Brake
Accel.
Correction Tracking Steering,
Planning Steer
High Dynamic Range Road Tracking Braking
Processing Threat Detection Acceleration

Hardware: Deep Learning TPU, GPU or CPU? CPU


Dedicated Signal Inference:
processor TPU or GPU or CPU?
April 2, 2024 Stanford EE292A Lecture 1 11
Tesla Chip Designer

April 2, 2024 Stanford EE292A Lecture 1 12


Still from Tesla Autonomy Day video, Pete Bannon, lead chip engineer & Elon Musk
Tesla’s Full Self Driving Chip: Inference Engine
Hardware with CPUs, GPUs and TPUs
Image • 6 Billion transistors
proc • 14nm FF, 2.6 cm2
Video
• 250M standard cells
Encoder 16X GPU 4X CPU • ~30 Watt
• Almost slicing

• 12X CPU (ARM A72)


4X CPU • ~20% area
• 16X GPU (ARM Mali)
• ~20% area
• 2X TPU
TPU TPU 4X CPU • ~45% area

Tensor Processing Unit


April 2, 2024 Stanford EE292A Lecture 1 13
Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
Form Follows Function in the TPU
buffer Write:
ReLU 128Byte/cycle
Pool
96x96
8bit Multiply Read:
32bit add 128Byte/cycle
weights 96x96
“Layer fusion”
96x96
In 1 clock cycle

Read: • Runs instructions, generated by a custom


256Byte/cycle compiler
Image Data • DMA read/write, Convolution, FC, scale, etc.
• 36 TOPS @2GHz, 2X on chip = 7.5 Watts
• Performance: 2100 Frames/sec
32 MByte SRAM • Compare 16 GPU = 17 Fps (128X slower)
April 2, 2024 1 TByte/sec bandwidth Stanford EE292A Lecture 1
• Compare 12 CPU
14
= 1.5Fps (1400X slower)
Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE )
Tesla’s new FSD Hardware 4
Hardware 3: (2019)
• 12X CPU (ARM A72)
• 16X GPU (ARM Mali)
• 2X TPU
• 14nm Samsung process
FSD Hardware 3

Hardware 4: (2023)
• 20X CPU (ARM A72)
• 16X GPU (ARM Mali)
• 3X TPU
• 7nm Samsung process??
FSD Hardware 4

15
Source: https://twitter.com/greentheonly/status/1625905234076741641?s=20
The Dense (Fully Connected) Layer
1
2 1
3 2 Weight matrix is These weights are
𝑖𝑛 4 3 𝑜𝑢𝑡 generally quite sparse ‘trainable’
5 4
6 5
7
w(1,1) w(2,1) w(3,1) w(4,1) w(5,1)
"#% Weight W(1,2) W(2,2) W(3,2) W(4,2) W(5,2)
W(1,3) W(2,3) W(3,3) W(4,3) W(5,3)
𝑜𝑢𝑡! = % 𝑖𝑛" ∗ 𝑤(𝑖, 𝑗) IN W(1,4) W(2,4) W(3,4) W(4,4) W(5,4)

"#$ W(1,5) W(2,5) W(3,5) W(4,5) W(5,5)


W(1,6) W(2,6) W(3,6) W(4,6) W(5,6)
tensor dense_layer(const tensor & in,
const size_t outSize) { W(1,7) W(2,7) W(3,7) W(4,7) W(5,7)
tensor out(outSize);
for (int j = 1; j <= in.size(); j++) {

OUT
for (int i = 1; i <= out.size(), i++) {
out[i] += w[i][j] * in[j];
}
} Add Multiply
return out; 16
}
Convolutional Layer: Code Sketch
tensor rgb_convolution_layer(const tensor & input,
const convFilter & weights)
{
tensor output(weights.size(), weights.width(), weights.height());

3 for (color = red; color <= blue; color++) {


2160 for (int row = 0; row < input.width(); row++) {
3840 for (int column = 0; column < input.height(), row++) {
32 for (int filter = 0; filter < weights.size(), filter++) {
5 for (int i = 0; i < weights[filter].width(), i++) {
5 for (int j = 0; j < weights[filter].height(), j++) {
output[filter][row][column] +=
weights[filter][i][j] * input[color][row+i][column+j];
}
}
}
Multiply Add
}
}
return output;
}

Typical Smartphone picture: 3X3840x2160 pixels (RGB), 32 filters or 5x5 pixels each
So, this convolution layers would require 3x3840x2160x32x5x5 = 19,906,560,000 Multiply-accumulate per layer
So ~20B Floating Point Multiply-Adds.
With 1 MAC per clock cycle @ 2Ghz, that would take 20 seconds… 17
Convolution layer = Matrix Multiplication
D

• Example on right: One 3x3 RGB input image (so 3x3x3 values)
• Two sets of 2x2 convolution weights for feature extraction over the 3 colors
• So 2x2x2x3 filter weights
• No 0-padding, 4 outputs per convolution filter
• Construct 4x12 Image Data matrix D:
• Repeats each data element
• Construct 12x2 Filter weight matrix F:
• Stores all weights
• Multiply matrices: DxF = O
• O = 4x2
𝑂!" 𝑂!$ 𝑂!% 𝑂!&
F 𝑂#" 𝑂#$ 𝑂#% 𝑂#&

Matrix multiplication D x F = O
Need hardware that is good at this
IEEE754 Floating Point Format Standard
16-bit Half-precision format:
Sign Exponent Bias offset
bit S E E E E E M M M M M M M M M M
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
! %&"#'( 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑏𝑖𝑎𝑠 15 32 !64 128 256 512 1024
𝑋 = (−1) ∗ (1. 𝑀)"#$ ∗ 2
2 4 8 16

Sign Exponent Mantissa


Normalized Mantissa has Mantissa
Implicit 1 as
Most Significant Bit

C/C++ Total Exponent Mantissa Bias 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0


numpy bits bits bits 1.0000000000!"# = 1.0$%&
long double 128 15 112 16383
Python 𝑋 = (−1)!∗ 1.0 ∗ 2"#$"# = 1.0
default double 64 11 52 1023
float 32 8 23 127
Not standard 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0
in C/C++ half 16 5 10 15 1! 1!
4 8

1.0110000000!"# = 1.375$%&

𝑋 = (−1)"∗ 1.375 ∗ 2"%$"# = −5.5 19


Try it on-line: https://www.h-schmidt.net/FloatConverter/IEEE754.html
Special cases: IEEE754 Floating Point
16-bit Half-precision format:
Sign Exponent Bias offset
bit S E E E E E M M M M M M M M M M
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
! %&"#'( 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑏𝑖𝑎𝑠 15 32 !64 128 256 512 1024
𝑋 = (−1) ∗ (1. 𝑀)"#$ ∗ 2
2 4 8 16

Sign Exponent Mantissa


Normalized Mantissa has Mantissa
Implicit 1 as
Most Significant Bit

x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 exponent all zeroes and mantissa all zeroes

0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 +Infinity sign 0, exponent all 1 and mantissa all zeroes


1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 -Infinity sign 1, exponent all 1 and mantissa all zeroes

x 1 1 1 1 1 x x x x x x x x x 1 NaN (Not a Number): exponent all 1 and mantissa not all zeroes

20
Binary Integer Multiplication
0 0 0 1 1 1 0 1
1
1
0
1
1
1
0
0

Multiplier complexity is quadratic with number of bits


Stanford EE272A 21
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
x
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M

11 bit in
1 M M M M M M M M M M 1 M M M M M M M M M M

+
15 (Fixed Exponent Bias) 11x11 bit
Binary
Multiplier
XOR -

M M M M M M M M M M M M M M M M M M M M M M 22 bit out

Rounding logic
Normalization 1 0
+ mux

S E E E E E M M M M M M M M M M
22
Result:
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 x 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M

11 bit in
1 M M M M M M M M M M
15 17 1 M M M M M M M M M M

1024 1024
+
15 (Fixed Exponent Bias) 11x11 bit
Binary
32 Multiplier
XOR -
1048576
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 M M M M M M M M M M M M M M M M M M M M M M 22 bit out

Rounding logic
Normalization 1 0
+ mux
17
S E E E E E M M M M M M M M M M
23
Result: 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
1.5 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1.5
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M

1 M M M M M M M M M M
15 15 1 M M M M M M M M M M

1536 1536
+
15 (Exp. Bias) 11X11 bit
30 multiplier
XOR -
2359296
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 M M M M M M M M M M M M M M M M M M M M M M

Rounding

Normalization 1 0
+ mux
16
S E E E E E M M M M M M M M M M
Result: 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 = 1 ∗ 1.125 ∗ 2,-&,. = 2.25
Speeding up Multiply-Add:
Selecting a Floating-Point format
Exponent Mantissa
e11 m52 IEEE 754 double float
FP64 Floating Point 64 Bit s
Range: [5.4 E-79, 7.2E +75]
e8 m23
FP32 Floating Point 32 Bit s IEEE 754 float
Range: [1.75 E-38, 3.7 E+38]
e8 m10
TF32 Tensor Floating Point 19 Bit s Nvidia Hopper, Blackwell
Range: [1.75 E-38, 3.7 E+38]
BF16 Brain Floating Point 16 Bit e8 m7
Range: [1.75 E-38, 3.7 E+38] s Google, nVidia, others
CF16 Floating Point 16 Bit e6 m9
s Cerebras
Range: [9.31 E-10, 4.29 E+9]
e5 m10
FP16 Floating Point 16 Bit s IEEE 754 half
Range: [6.135 E-5, 6.5504 E+5]

FP8 Minifloat E5M2 8 Bit e5 m2


Range: [6.135 E-5, 6.5504 E+5] s
e4 m3 nVidia Hopper, Blackwell
FP8 Minifloat E4M3 8 Bit
Range: [7.81 E-3, 3.65 E+2] s

FP4 Microfloat E3M0 4 Bit e3 m0


Range: [2.5 E-1, 8.0 E+0] s nVidia Blackwell
IEEE754 FP16: the Good, the Bad and the Ugly
Exponent Mantissa
• FP16 requires 4X less FP32 Floating Point 32 Bit e8 m23
hardware than FP32. Range: [1.75E-38, 3.7E+38] s

• Therefore, it is popular in ML
FP16 Floating Point 16 Bit
e5 m10
Range: [6.135E-5, 6.5504E+5] s
• But sacrifices range
and resolution.
• Upper range is just CF16 Floating Point 16 Bit e6 m9 CF16:
Less resolution,
65504 Range: [9.31e-10, 4.29e+9] s more range

• Smallest ‘normal’
number is just BF16 Brain Floating Point 16 Bit e8 m7
0.00006135 Range: [1.75E-38, 3.7E+38] s BF16:
A lot less resolution,
• Small numbers are not much more range
rare in ML L

Multiplier hardware complexity is quadratic with number of mantissa bits


MiniFloats are a Numerical Minefield
6-bit E3M2 MiniFloat s
Range: [-14,+14] e3 m2

-Inf +Inf +Inf

A A

+Inf -Inf -Inf

B B

Multiplication result: Z = A * B Addition result: Z = A + B


Stanford EE272A 27
Image source: https://en.wikipedia.org/wiki/Minifloat
FP16 Underflow: Denormal Numbers
Normal FP16 number (E > 0):
• Denormal extends range allowing smaller 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
numbers to be represented
• This is important for ML with half precision 𝑋 = 1 ∗ 1.0 ∗ 2"#$"# = 1.0

• Encoding: exponent = 0, and implicit 0 MSB


mantissa (instead of 1): Smallest FP16 normal number:
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

𝑋 = 1 ∗ 1.0 ∗ 2"$"# = 2$"- = 0.000061035


if E > 0:
𝑋 = (−1)! ∗ (1. 𝑀)"#$ ∗ 2)&"#'( /* normal */
Largest FP16 denormal number:
else: 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
𝑋 = (−1)! ∗ (𝟎. 𝑀)"#$ ∗ 2%&"#'( /* denormal */
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
2 4 8 16 32 !64 128 256 512 1024

0.1111111111!"# = 0.99902344$%&
𝑋= 1 ∗ 0.999 ∗ 2"$"# = 0.000060604

S 0 0 0 0 0 M M M M M M M M M M Smallest FP16 denormal number:


1! 1! 1! 1! 1! 1! 1! 1! 1! 1!
0 = 𝑑𝑒𝑛𝑜𝑟𝑚𝑎𝑙 2 4 8 16 32 64 128 256 512 1024 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.0000000001!"# = 0.0009765625$%&

Stanford EE272A 𝑋 = 1 ∗ "⁄"!1- ∗ 2"$"# = 0.0000000596


Adding Denormal Multiplication to Hardware:
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.000000059
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M

1 M M M M M M M M M M 0 M M M M M M M M M M

+ Pipelined
15 11x11bit
Multiplier
XOR - Normal/Denormal hardware
control 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1
M M M M M M M M M M M M M M M M M M M M M M

Input normal?
1 0
+ mux
1 0
mux
S E E E E E M M M M M M M M M M
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.000000118 29
Schwarz, E.M.; Schmookler, M.; Son Dao Trong. ”Hardware Implementations of Denormalized Numbers".
ML Implementation:
Software, Hardware or a Hybrid of Both?

Application-Specific TPU/NPU: GPU: Software on 100s of CPU: Software on


(custom) hardware Thousands of CUDA (and Tensor) cores general-purpose CPU cores
multiply-add
ASIC FPGA
+ low NRE
- High unit cost
+ highest performance
+ Lowest unit cost
- High NRE
- Not configurable

+ Custom precision + Fast, low-cost, low power + Floating point + Easy, compilation
+ Best performance + Exploits memory locality - Fixed precision + Universal
- Higher design effort - Single integer/FP precision - Memory overhead
- 1000X slower 30
Machine Learning Hardware Different Philosophies

• Intel: add custom Multiply-Accumulate instructions to CPU


• Tesla: Full Self Driving Chip with TPUs, GPUs and CPUs
• Apple: TPU for face-ID, GPU and TPU
• Nvidia: GPU extension with multiply-adds
• Cerebras: massive multiprocessor
• Xilinx: FPGA fabric + systolic array
• Google: Dedicated TPU

31
Bringing Memory and Processing Closer

TPU/NPU: GPU CPU


Dedicated ML
hardware p w p w p w

p w p w p w
P
Weights
p w p w p w 40G 40G
Weights Weights
RAM
m m m m m m m m m m p w p w p w RAM RAM
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m m
p w p w p w
P
p p p
m m m m m m m m m m
w w w
m m m m m m m m m m

Data Data
Data Data
Data Data Data
Data RAM

32
Note that Tesla’s FSD Chip and Smartphone SoCs
deploy a hybrid of all options

Tesla FSD chip A14: iPhone SoC

16X GPU
12X CPU
LP CPU
X4

Cache
GPU
X4
2X TPU NPU HP CPU
X2
March 28, 2022

TPU

Stanford EE292A Lecture 2 33


Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
The Baseline: CPU: Performance Example

• Intel Xeon Phi, 72 cores, 4 threads, 215W


• 1024x1024 FP Matrix multiplication
• 1B MACs in ~0.01 secs = 0.1TMACS
• Max throughput @1.3GHz
64x1.3B = 0.09 TOPS

Note: TOPS= Tera Operations Per Seconds

What is the problem with CPUs??


34
https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system
NVidia Grace Hopper H-100 (2022)
Successor to Volta (2018) and Ampère (2020)

GPU Die: 80B Transistor, 50MB L2 Cache. 4 Nanometers TSMC, ~400W, 2.8cmx2.8cm = 814mm2
On Silicon Interposer with 40Gb HBM3 Memory
Datacenter board 8 of these retails at ~$200K
Stanford EE272A 36
NVidia Blackwell B200 (2024)

GPU Die: 208B Transistor on two dies on interposer. 4 Nanometers TSMC,


Datacenter board 8 of these retails at over $200K Stanford EE272A 37
GPU Architecture: Nvidia Hopper

• 8 Graphic Processing
Clusters
• 7 Texture Processing
Clusters
• 14 Volta Streaming
Multiprocessors
• 144 Streaming
50 MB L2 Cache Multiprocessors
• 64 FP32 Cores
• 64 INT32 Cores
• 32 FP64 Cores
• 8 Tensor Cores
• 6 memory controllers
• 30 TFLOPS
https://devblogs.nvidia.com/inside-volta/ , http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
38
GPU Structure ‘Streaming Multiprocessor’

A “Tensor core” does 4X4


Multiply-Accumulate

• 8 Graphic Processing Clusters


– 7 Texture Processing Clusters
– 16 Ampere Streaming
Multiprocessors
• 144 Streaming
Multiprocessors Hopper Tensor core
• Each SM has 4 times: Floating Points formats:
– 16 FP64 Cores or TF32: 500 TFLOPS
– 32 FP32 Cores or
FP16: 1000 TFLOPS
– 16 INT32 Cores or
– 1 Tensor Cores for MAC BF16: 1000 TFLOPS
FP8: 2000 TFLOPS
INT8: 2000 TFLOPS 39
Nvidia Hardware for Exploiting Weight Sparsity

4:2 sparse matrix multiplication

Claims:
Double throughput,
Half memory

Source: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

40
NVidia GPU: programmed in CUDA with
cuDNN (Deep Neural Network Library)

• Nvidia CUDA
• C++ Programming API abstraction for GPUs
Single-precision
aX + Y =
multiply-add

• Nvidia cuDNN: ML toolkit


cuDNN: Efficient Primitives for Deep Learning, arXiv
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer
41
Google: Tensor Processing Unit

• 28nm process, 700MHz, 12.5GB/s


of effective bandwidth, 40W
• Matrix Multiplier Unit (MXU):
65,536 8-bit multiply-and-add
units
• Unified Buffer (UB): 24MB of
SRAM that work as registers

https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
An Old New Idea: Systolic Arrays
Definition: A systolic array is a network of processors that
rhythmically computes and passes data through the system
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤 X
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 X33

X32 X23

X31 X22 X13

X21 X12

X11
W
W11 W12 W13 Y
W21 W22 W23

H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to VLSI Systems; Addison-Wesley, 1979
43
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

X33 Y11 = w11x11

1 X31
X32

X22
X23

X13

X21 X12

X1111 W12 W13


W

W21 W22 W23

44
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21


Y21 = w21x11
2 X32
X33

X23
Y12 = w11x12

X31 X22 X13

X2111 W
W X1212 W13

X1121 W22 W23


W

45
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21 + w13x31


Y21 = w21x11 + w22x12
3 X33
Y12 = w11x12 + w12x22
Y22 = w21x12
Y13 = w11x13
X32 X23

X3111 W
W X2212 W
X13
13

X2121 W
W X1222 W23

X11
Stanford EE292A Lecture 18 46
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21 + w13x31


Y21 = w21x11 + w22x12 + w23x13
4 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22
Y13 = w11x13 + w12x23
X33 Y23 = w21x13
W11 W
X3212 W
X23
13

X3121 W
W X2222 W
X13
23

X21 X12
47

X11
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21 + w13x31


Y21 = w21x11 + w22x12 + w23x13
5 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23
W11 W12 W
X33
13

W21 W
X3222 W
X23
23

X31 X22 X13


48

X21 X12

X11
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21 + w13x31


Y21 = w21x11 + w22x12 + w23x13
6 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23 + w23x33
W11 W12 W13

W21 W22 W
X33
23

X32 X23
49

X31 X22 X13

X21 X12

X11
An Old New Idea: Systolic Arrays

𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥

Y11 = w11x11 + w12x21 + w13x31


Y21 = w21x11 + w22x12 + w23x13
6 cycles Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
3x3x2=18 MAC Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23 + w23x33
W11 W12 W13

W21 W22 W23

Stanford EE292A Lecture 18 50


Matrix Multiplier Unit

• 65,536 8-bit integer multipliers


• @700MHz that is 45.9 TMACS
maximum throughput

https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Stanford EE292A 51
Bringing Memory and Processing Closer
Mesh
processor TPU GPU CPU

p w p w p w p w p w

p w p w p w
P
p w p w 40G p w p w p w 40G 40G
Weights Weights Weights
RAM p w p w p w RAM RAM
p w p w
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m
m m m m m m m m m m
m
p w p w p w
P
p w p w m m m m m m m m m m
m m m m m m m m m m p w p w p w

Data Data
Data Data Data
Data Data Data
streaming RAM
Data

52
Layer-pipelined Layer-sequential – batched
Cerebras
Wafer Scale Engine
AI Supercomputer for Layer-
Pipelined computing

• Startup based in Sunnyvale


• ~250 people
• Unicorn with >$800M funding
• Focus: Supercomputer for ML training
• Single 21.5cm by 21.5cm chip in 7nm
• 2.4 Trillion transistors
• 850,000 ‘AI processor cores’
• In a 800 by 1060 array
• Total on-chip memory:
~40Gb fast SRAM
• ~100 PetaByte/second fabric bandwidth

53
Cerebras Wafer Scale Engine

54
CS-1 Hardware: the box

55
Wavelets Flowing on the Network-on-Chip
Transmits one 32-bit
wavelet per clock cycle
in each direction

Wavelet Buffers
PE + forwarding table PE
per buffer

PE PE PE
A Dense Layer Kernel
in Hardware

w w w w w w w w w w
d Xp X
p Xp Xp Xp Xp Xp Xp Xp Xp
• The ML layer implementation on CS-1 is + + + + + + + + + +
massive parallel multiply-accumulate: d X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
• The data operand is efficiently + + + + + + + + + +
w w w w w w w w w w
streamed in at a high rate d X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+

Streaming
• The weight operand is stationary in w w w w w w w w w w

data
d X X X X X X X X X X
40Gb high speed memory + + + + + + + + + +
w w w w w w w w w w
• Result (sum-of-products) is streamed out to d X X X X X X X X X X
the next layer + + + + + + + + + +
d
w w w w w w w w w w
X X X X X X X X X X
+ + + + + + + + + +
850,000 (1020x830) processors: w w w w w w w w w w
d X X X X X X X X X X
+ + + + + + + + + +
in[0] F
= = = = = = = = = =
in[1] F 48K SRAM out[0]
in[2] F
FIFO queues
FIFO queues

out[1]
in[3]
F
PE out[2] Result = [D] x [W]
in[4] F out[3]
F out[4]
in[5] Processor
in[6] F out[5]
in[7] F
Layer-pipelined execution

Continuously Streaming input


• Map each element in the
TensorFlow Network to a
Kernel in the Library
• Place, Size and Shape the
kernels such that as many as
possible Pes are used.
– The slowest kernel determines
the throughput!
• Rotate and flip such that wire
length and twisting is
minimized.
– Wire length does not matter!
• Connect the kernels
cat dog • Output: interconnect bit
Continuously Streaming output stream and Machine language
program for each PE
58
Xilinx’ ACAP: “Adaptive Compute Accelleration Platform”

• Two Dual-core CPUs


• Mainly for control
• FPGA fabric (~80%):
• 900K LUT = ~2M Gates
• 1900 DSP cells each:
• FP32 Multiply-Accumulate
• 2x INT18 MAC
• 3x INT8 MAC
• 27Mb distributed RAM
• Systolic Array of VLIW
processors (~20%)
• 4x40 array

59
5-channel LTE20 Wireless (Xilinx.com white paper)
Xilinx: Systolic Array of WLIW RISC processors

INT8: does 128 parallel MAC


FP32: does 8 parallel MAC

60
Summary

• Computation for machine learning


• Dense (Fully Connected) and convolution layers
• Speeding up Computation: Floating point formats
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU

Stanford EE292A Lecture 2 61


References
• H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to
VLSI Systems; Addison-Wesley, 1979
1. Schwarz, E.M.; Schmookler, M.; Son Dao Trong (July 2005). ”Hardware Implementations of Denormalized Numbers". IEEE
Transactions on Computers. 54 (7): 825–836.
http://www.acsel-lab.com/arithmetic/arith16/papers/ARITH16_Schwarz.pdf
2. IEEE754 float converter: https://www.h-schmidt.net/FloatConverter/IEEE754.html
3. https://www.tensorflow.org
4. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
5. ACAP architecture: https://www.xilinx.com/products/silicon-devices/acap/versal.html
6. http://www.cerebras.net

Stanford EE292A Lecture 2 62

You might also like