EE292A Lecture 2.ML - Hardware

EE292A Lecture 2
Machine Learning Network

to Custom Hardware
Patrick Groeneveld
AMD
[email protected]
Copyright ©2024 by Raúl Camposano, Antun Domic and Patrick Groeneveld

Major Steps in the Hardware Mapping Flow
C++/Tensorflow/Pytorch
Patrick,
Today ip ip
System
ip
Raúl,
Next weeks
RTL
Antun
Circuit
Patrick
Mask Physical
Chip tapeout, sent to IC fab

EDA flow: Hundreds of Algorithmic Steps
C++/Tensorflow/Pytorch Architectural Synth. ML Kernel matching
MLIR
Datapath Opto High Level Synthesis

ip ip RTL
System Logic/Area Opto Logic Synthesis: Logic minimization, mapping
ip Mapped net list, minimum area
Timing Opto Rough sizing, High-Fanout buffering

Conditioned net list
RTL Global placement
Place for minimal wire length & congestion
Global routing
Placed gates
Physical Opto 1 Gate size, buffering, Global routing, incremental placement

Placed Gates, global route
Circuit Detailed placement All cells in legal locations

Placed gates, global route + Clock wires
Physical Opto 2 Gate size, buffering, incr. global route,

Detailed Placed gates, global+track route, scan wires
Detailed Routing Generate wire pattern

Mask Physical Routed design
Physical Opto 3 Gate sizing, buffering, fill insertion
Chip tapeout, sent to IC fab GDS2

Today: Machine
Learning Network to
Custom Hardware
• Computation for machine learning
• Dense (Fully Connected) and
convolution layers
• Speeding up Computation: Floating Point
formats, multiplication
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow, with real-life
example:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU
Stanford EE292A Lecture 2 4

5
Apple A12
iPhone SoC
• 7nm TSMC FinFet GPU 3 GPU 4
• 9.89 x 8.42 = 83.27 mm2
• 6.9 Billion transistors
• 4 GPU cores (~18%): 9 Blocks
• 6 CPU cores (~14%): 13 Blocks
• 4 ‘Tempest’ Low Power
CPU cores
• 2 ‘Vortex’ High Performance
CPU cores
• L2 & L3 caches
7 8
• 8 NPU/TPU cores (~7%) 4 Blocks
5 6
• DDR (~3%) 1 block
HP CPU2
• Misc (~57%): 50 Unique Blocks HP CPU1
3 4
LPCPU LPCPU
3 4 GPU 1 GPU 2
• Total: ~75 unique blocks 1 2
April 4, 2024 Stanford EE292A Lecture 1 LPCPU LPCPU
1 2 Source: TechInsights/Anandtech
The Past 6 Years of Apple Mobile SoCs
A12: iPhone Xs (2018) A14: iPhone 12 (2020) A16: iPhone 14 (2022)
LP CPU GPU
L3 Cache
L3 Cache
Cache GPU X4 HP X5
GPU
L2 cache
X4 NPU CPU
HP HP X4 X2 LP
NPU
NPU CPU LP CPU CPU
X2 CPU X4
X4
X2
A13: iPhone 11 (2019) A15: iPhone 13 (2021) NPU

HP
L3 Cache
L2 cache
CPU
LP X2
LP CPU
CPU LP CPU
GPU
X4
L3 Cache
GPU X4 GPU X4 X6
Cache X4 X5
NPU HP CPU
HP
NPU
L2 cache
CPU
X2 X2
A17: iPhone 15pro (2023) Sources: TechInsights/Anandtech/Angstronomics
A Neural Network with 7 layers 0
1
Fully Fully
down- down- Connected Connected 2
filter sample filter sample Network Network
3
4
5
6
7
8
28x28 image
9
(MNIST database)
Convolutional filter Rectified Linear Unit Max-Pool Fully Connected

Network
7
Stanford EE292A Lecture 2
Convolutional Neural Network for Self Driving
Camera images
April 2, 2024 Stanford EE292A Lecture 1 8

Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE )
Automotive: Full Self-Driving Tesla

Tesla
Tesla “full
“Full self driving computer”
Self Driving” board

Source: @Tesla on twitter
Autonomous Driving: Software & Hardware
Less than 100 milliseconds
Image Processing: Perception: Localization: Planning and

Cameras, Radar
Noise Reduction Feature Detection Cars, Traffic signs, Actions:

Lens Distortion Sensor fusion People, Motion Navigation Brake
Accel.
Correction Tracking Steering,
Planning Steer
High Dynamic Range Road Tracking Braking
Processing Threat Detection Acceleration
Hardware: Deep Learning TPU, GPU or CPU? CPU

Dedicated Signal Inference:
processor TPU or GPU or CPU?
Tesla Chip Designer

Still from Tesla Autonomy Day video, Pete Bannon, lead chip engineer & Elon Musk
Tesla’s Full Self Driving Chip: Inference Engine
Hardware with CPUs, GPUs and TPUs
Image • 6 Billion transistors
proc • 14nm FF, 2.6 cm2
Video
• 250M standard cells
Encoder 16X GPU 4X CPU • ~30 Watt
• Almost slicing
• 12X CPU (ARM A72)

4X CPU • ~20% area
• 16X GPU (ARM Mali)
• ~20% area
• 2X TPU
TPU TPU 4X CPU • ~45% area
Tensor Processing Unit

Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
Form Follows Function in the TPU
buffer Write:
ReLU 128Byte/cycle
Pool
96x96
8bit Multiply Read:
32bit add 128Byte/cycle
weights 96x96
“Layer fusion”
96x96
In 1 clock cycle
Read: • Runs instructions, generated by a custom

256Byte/cycle compiler
Image Data • DMA read/write, Convolution, FC, scale, etc.
• 36 TOPS @2GHz, 2X on chip = 7.5 Watts
• Performance: 2100 Frames/sec
32 MByte SRAM • Compare 16 GPU = 17 Fps (128X slower)
April 2, 2024 1 TByte/sec bandwidth Stanford EE292A Lecture 1
• Compare 12 CPU
14
= 1.5Fps (1400X slower)
Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE )
Tesla’s new FSD Hardware 4
Hardware 3: (2019)
• 2X TPU
• 14nm Samsung process
FSD Hardware 3
Hardware 4: (2023)
• 3X TPU
• 7nm Samsung process??
FSD Hardware 4
15
Source: https://twitter.com/greentheonly/status/1625905234076741641?s=20
The Dense (Fully Connected) Layer
1
2 1
3 2 Weight matrix is These weights are
𝑖𝑛 4 3 𝑜𝑢𝑡 generally quite sparse ‘trainable’
5 4
6 5
7
w(1,1) w(2,1) w(3,1) w(4,1) w(5,1)
"#% Weight W(1,2) W(2,2) W(3,2) W(4,2) W(5,2)
W(1,3) W(2,3) W(3,3) W(4,3) W(5,3)
𝑜𝑢𝑡! = % 𝑖𝑛" ∗ 𝑤(𝑖, 𝑗) IN W(1,4) W(2,4) W(3,4) W(4,4) W(5,4)
"#$ W(1,5) W(2,5) W(3,5) W(4,5) W(5,5)

W(1,6) W(2,6) W(3,6) W(4,6) W(5,6)
tensor dense_layer(const tensor & in,
const size_t outSize) { W(1,7) W(2,7) W(3,7) W(4,7) W(5,7)
tensor out(outSize);
for (int j = 1; j <= in.size(); j++) {
OUT
for (int i = 1; i <= out.size(), i++) {
out[i] += w[i][j] * in[j];
}
} Add Multiply
return out; 16
}
Convolutional Layer: Code Sketch
tensor rgb_convolution_layer(const tensor & input,
const convFilter & weights)
{
tensor output(weights.size(), weights.width(), weights.height());
3 for (color = red; color <= blue; color++) {

2160 for (int row = 0; row < input.width(); row++) {
3840 for (int column = 0; column < input.height(), row++) {
32 for (int filter = 0; filter < weights.size(), filter++) {
5 for (int i = 0; i < weights[filter].width(), i++) {
5 for (int j = 0; j < weights[filter].height(), j++) {
output[filter][row][column] +=
weights[filter][i][j] * input[color][row+i][column+j];
}
}
}
Multiply Add
}
}
return output;
}
Typical Smartphone picture: 3X3840x2160 pixels (RGB), 32 filters or 5x5 pixels each
So, this convolution layers would require 3x3840x2160x32x5x5 = 19,906,560,000 Multiply-accumulate per layer
So ~20B Floating Point Multiply-Adds.
With 1 MAC per clock cycle @ 2Ghz, that would take 20 seconds… 17
Convolution layer = Matrix Multiplication
D
• Example on right: One 3x3 RGB input image (so 3x3x3 values)
• Two sets of 2x2 convolution weights for feature extraction over the 3 colors
• So 2x2x2x3 filter weights
• No 0-padding, 4 outputs per convolution filter
• Construct 4x12 Image Data matrix D:
• Repeats each data element
• Construct 12x2 Filter weight matrix F:
• Stores all weights
• Multiply matrices: DxF = O
• O = 4x2
𝑂!" 𝑂!$ 𝑂!% 𝑂!&
F 𝑂#" 𝑂#$ 𝑂#% 𝑂#&
Matrix multiplication D x F = O
Need hardware that is good at this
IEEE754 Floating Point Format Standard
16-bit Half-precision format:
Sign Exponent Bias offset
bit S E E E E E M M M M M M M M M M
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
! %&"#'( 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑏𝑖𝑎𝑠 15 32 !64 128 256 512 1024
𝑋 = (−1) ∗ (1. 𝑀)"#$ ∗ 2
2 4 8 16
Sign Exponent Mantissa

Normalized Mantissa has Mantissa
Implicit 1 as
Most Significant Bit
C/C++ Total Exponent Mantissa Bias 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0

numpy bits bits bits 1.0000000000!"# = 1.0$%&
long double 128 15 112 16383
Python 𝑋 = (−1)!∗ 1.0 ∗ 2"#$"# = 1.0
default double 64 11 52 1023
float 32 8 23 127
Not standard 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0
in C/C++ half 16 5 10 15 1! 1!
4 8
1.0110000000!"# = 1.375$%&
𝑋 = (−1)"∗ 1.375 ∗ 2"%$"# = −5.5 19

Try it on-line: https://www.h-schmidt.net/FloatConverter/IEEE754.html
Special cases: IEEE754 Floating Point
16-bit Half-precision format:
Sign Exponent Bias offset
bit S E E E E E M M M M M M M M M M
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
! %&"#'( 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑏𝑖𝑎𝑠 15 32 !64 128 256 512 1024
𝑋 = (−1) ∗ (1. 𝑀)"#$ ∗ 2
2 4 8 16
Sign Exponent Mantissa

Normalized Mantissa has Mantissa
Implicit 1 as
Most Significant Bit
x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 exponent all zeroes and mantissa all zeroes
0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 +Infinity sign 0, exponent all 1 and mantissa all zeroes

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 -Infinity sign 1, exponent all 1 and mantissa all zeroes
x 1 1 1 1 1 x x x x x x x x x 1 NaN (Not a Number): exponent all 1 and mantissa not all zeroes
20
Binary Integer Multiplication
0 0 0 1 1 1 0 1
1
1
0
1
1
1
0
0
Multiplier complexity is quadratic with number of bits

Stanford EE272A 21
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
x
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M
11 bit in
1 M M M M M M M M M M 1 M M M M M M M M M M
+
15 (Fixed Exponent Bias) 11x11 bit
Binary
Multiplier
XOR -
M M M M M M M M M M M M M M M M M M M M M M 22 bit out
Rounding logic
Normalization 1 0
+ mux
S E E E E E M M M M M M M M M M
22
Result:
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 x 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
11 bit in
1 M M M M M M M M M M
15 17 1 M M M M M M M M M M
1024 1024
+
15 (Fixed Exponent Bias) 11x11 bit
Binary
32 Multiplier
XOR -
1048576
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 M M M M M M M M M M M M M M M M M M M M M M 22 bit out
Rounding logic
Normalization 1 0
+ mux
17
23
Result: 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
1.5 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1.5
1 M M M M M M M M M M
15 15 1 M M M M M M M M M M
1536 1536
+
15 (Exp. Bias) 11X11 bit
30 multiplier
XOR -
2359296
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 M M M M M M M M M M M M M M M M M M M M M M
Rounding
Normalization 1 0
+ mux
16
Result: 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 = 1 ∗ 1.125 ∗ 2,-&,. = 2.25
Speeding up Multiply-Add:
Selecting a Floating-Point format
Exponent Mantissa
e11 m52 IEEE 754 double float
FP64 Floating Point 64 Bit s
Range: [5.4 E-79, 7.2E +75]
e8 m23
FP32 Floating Point 32 Bit s IEEE 754 float
Range: [1.75 E-38, 3.7 E+38]
e8 m10
TF32 Tensor Floating Point 19 Bit s Nvidia Hopper, Blackwell
Range: [1.75 E-38, 3.7 E+38]
BF16 Brain Floating Point 16 Bit e8 m7
Range: [1.75 E-38, 3.7 E+38] s Google, nVidia, others
CF16 Floating Point 16 Bit e6 m9
s Cerebras
Range: [9.31 E-10, 4.29 E+9]
e5 m10
FP16 Floating Point 16 Bit s IEEE 754 half
Range: [6.135 E-5, 6.5504 E+5]
FP8 Minifloat E5M2 8 Bit e5 m2

Range: [6.135 E-5, 6.5504 E+5] s
e4 m3 nVidia Hopper, Blackwell
FP8 Minifloat E4M3 8 Bit
Range: [7.81 E-3, 3.65 E+2] s
FP4 Microfloat E3M0 4 Bit e3 m0

Range: [2.5 E-1, 8.0 E+0] s nVidia Blackwell
IEEE754 FP16: the Good, the Bad and the Ugly
Exponent Mantissa
• FP16 requires 4X less FP32 Floating Point 32 Bit e8 m23
hardware than FP32. Range: [1.75E-38, 3.7E+38] s
• Therefore, it is popular in ML
FP16 Floating Point 16 Bit
e5 m10
Range: [6.135E-5, 6.5504E+5] s
• But sacrifices range
and resolution.
• Upper range is just CF16 Floating Point 16 Bit e6 m9 CF16:
Less resolution,
65504 Range: [9.31e-10, 4.29e+9] s more range
• Smallest ‘normal’
number is just BF16 Brain Floating Point 16 Bit e8 m7
0.00006135 Range: [1.75E-38, 3.7E+38] s BF16:
A lot less resolution,
• Small numbers are not much more range
rare in ML L
Multiplier hardware complexity is quadratic with number of mantissa bits

MiniFloats are a Numerical Minefield
6-bit E3M2 MiniFloat s
Range: [-14,+14] e3 m2
-Inf +Inf +Inf
A A
+Inf -Inf -Inf
B B
Multiplication result: Z = A * B Addition result: Z = A + B

Stanford EE272A 27
Image source: https://en.wikipedia.org/wiki/Minifloat
FP16 Underflow: Denormal Numbers
Normal FP16 number (E > 0):
• Denormal extends range allowing smaller 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
numbers to be represented
• This is important for ML with half precision 𝑋 = 1 ∗ 1.0 ∗ 2"#$"# = 1.0
• Encoding: exponent = 0, and implicit 0 MSB

mantissa (instead of 1): Smallest FP16 normal number:
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
𝑋 = 1 ∗ 1.0 ∗ 2"$"# = 2$"- = 0.000061035

if E > 0:
𝑋 = (−1)! ∗ (1. 𝑀)"#$ ∗ 2)&"#'( /* normal */
Largest FP16 denormal number:
else: 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
𝑋 = (−1)! ∗ (𝟎. 𝑀)"#$ ∗ 2%&"#'( /* denormal */
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
2 4 8 16 32 !64 128 256 512 1024
0.1111111111!"# = 0.99902344$%&
𝑋= 1 ∗ 0.999 ∗ 2"$"# = 0.000060604
S 0 0 0 0 0 M M M M M M M M M M Smallest FP16 denormal number:

1! 1! 1! 1! 1! 1! 1! 1! 1! 1!
0 = 𝑑𝑒𝑛𝑜𝑟𝑚𝑎𝑙 2 4 8 16 32 64 128 256 512 1024 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.0000000001!"# = 0.0009765625$%&
Stanford EE272A 𝑋 = 1 ∗ "⁄"!1- ∗ 2"$"# = 0.0000000596

Adding Denormal Multiplication to Hardware:
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.000000059
1 M M M M M M M M M M 0 M M M M M M M M M M
+ Pipelined
15 11x11bit
Multiplier
XOR - Normal/Denormal hardware
control 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1
M M M M M M M M M M M M M M M M M M M M M M
Input normal?
1 0
+ mux
1 0
mux
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.000000118 29
Schwarz, E.M.; Schmookler, M.; Son Dao Trong. ”Hardware Implementations of Denormalized Numbers".
ML Implementation:
Software, Hardware or a Hybrid of Both?
Application-Specific TPU/NPU: GPU: Software on 100s of CPU: Software on

(custom) hardware Thousands of CUDA (and Tensor) cores general-purpose CPU cores
multiply-add
ASIC FPGA
+ low NRE
- High unit cost
+ highest performance
+ Lowest unit cost
- High NRE
- Not configurable
+ Custom precision + Fast, low-cost, low power + Floating point + Easy, compilation
+ Best performance + Exploits memory locality - Fixed precision + Universal
- Higher design effort - Single integer/FP precision - Memory overhead
- 1000X slower 30
Machine Learning Hardware Different Philosophies
• Intel: add custom Multiply-Accumulate instructions to CPU

• Tesla: Full Self Driving Chip with TPUs, GPUs and CPUs
• Apple: TPU for face-ID, GPU and TPU
• Nvidia: GPU extension with multiply-adds
• Cerebras: massive multiprocessor
• Xilinx: FPGA fabric + systolic array
• Google: Dedicated TPU
31
Bringing Memory and Processing Closer
TPU/NPU: GPU CPU

Dedicated ML
hardware p w p w p w
p w p w p w
P
Weights
p w p w p w 40G 40G
Weights Weights
RAM
m m m m m m m m m m p w p w p w RAM RAM
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m m
p w p w p w
P
p p p
m m m m m m m m m m
w w w
m m m m m m m m m m
Data Data
Data Data
Data Data Data
Data RAM
32
Note that Tesla’s FSD Chip and Smartphone SoCs
deploy a hybrid of all options
Tesla FSD chip A14: iPhone SoC
16X GPU
12X CPU
LP CPU
X4
Cache
GPU
X4
2X TPU NPU HP CPU
X2
March 28, 2022
TPU

Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
The Baseline: CPU: Performance Example
• Intel Xeon Phi, 72 cores, 4 threads, 215W

• 1024x1024 FP Matrix multiplication
• 1B MACs in ~0.01 secs = 0.1TMACS
• Max throughput @1.3GHz
64x1.3B = 0.09 TOPS
Note: TOPS= Tera Operations Per Seconds
What is the problem with CPUs??

34
https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system
NVidia Grace Hopper H-100 (2022)
Successor to Volta (2018) and Ampère (2020)
GPU Die: 80B Transistor, 50MB L2 Cache. 4 Nanometers TSMC, ~400W, 2.8cmx2.8cm = 814mm2
On Silicon Interposer with 40Gb HBM3 Memory
Datacenter board 8 of these retails at ~$200K
Stanford EE272A 36
NVidia Blackwell B200 (2024)
GPU Die: 208B Transistor on two dies on interposer. 4 Nanometers TSMC,

Datacenter board 8 of these retails at over $200K Stanford EE272A 37
GPU Architecture: Nvidia Hopper
• 8 Graphic Processing
Clusters
• 7 Texture Processing
Clusters
• 14 Volta Streaming
Multiprocessors
• 144 Streaming
50 MB L2 Cache Multiprocessors
• 64 FP32 Cores
• 64 INT32 Cores
• 32 FP64 Cores
• 8 Tensor Cores
• 6 memory controllers
• 30 TFLOPS
https://devblogs.nvidia.com/inside-volta/ , http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
38
GPU Structure ‘Streaming Multiprocessor’
A “Tensor core” does 4X4

Multiply-Accumulate
• 8 Graphic Processing Clusters

– 7 Texture Processing Clusters
– 16 Ampere Streaming
Multiprocessors
• 144 Streaming
Multiprocessors Hopper Tensor core
• Each SM has 4 times: Floating Points formats:
– 16 FP64 Cores or TF32: 500 TFLOPS
– 32 FP32 Cores or
FP16: 1000 TFLOPS
– 16 INT32 Cores or
– 1 Tensor Cores for MAC BF16: 1000 TFLOPS
FP8: 2000 TFLOPS
INT8: 2000 TFLOPS 39
Nvidia Hardware for Exploiting Weight Sparsity
4:2 sparse matrix multiplication
Claims:
Double throughput,
Half memory
Source: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
40
NVidia GPU: programmed in CUDA with
cuDNN (Deep Neural Network Library)
• Nvidia CUDA
• C++ Programming API abstraction for GPUs
Single-precision
aX + Y =
multiply-add
• Nvidia cuDNN: ML toolkit

cuDNN: Efficient Primitives for Deep Learning, arXiv
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer
41
Google: Tensor Processing Unit
• 28nm process, 700MHz, 12.5GB/s

of effective bandwidth, 40W
• Matrix Multiplier Unit (MXU):
65,536 8-bit multiply-and-add
units
• Unified Buffer (UB): 24MB of
SRAM that work as registers
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
An Old New Idea: Systolic Arrays
Definition: A systolic array is a network of processors that
rhythmically computes and passes data through the system
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤 X
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 X33
X32 X23
X31 X22 X13
X21 X12
X11
W
W11 W12 W13 Y
W21 W22 W23
H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to VLSI Systems; Addison-Wesley, 1979
43
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
X33 Y11 = w11x11
1 X31
X32
X22
X23
X13
X21 X12
X1111 W12 W13

W
W21 W22 W23
44
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21

Y21 = w21x11
2 X32
X33
X23
Y12 = w11x12
X31 X22 X13
X2111 W
W X1212 W13
X1121 W22 W23

W
45
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21 + w13x31

Y21 = w21x11 + w22x12
3 X33
Y12 = w11x12 + w12x22
Y22 = w21x12
Y13 = w11x13
X32 X23
X3111 W
W X2212 W
X13
13
X2121 W
W X1222 W23
X11
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21 + w13x31

Y21 = w21x11 + w22x12 + w23x13
4 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22
Y13 = w11x13 + w12x23
X33 Y23 = w21x13
W11 W
X3212 W
X23
13
X3121 W
W X2222 W
X13
23
X21 X12
47
X11
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21 + w13x31

Y21 = w21x11 + w22x12 + w23x13
5 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23
W11 W12 W
X33
13
W21 W
X3222 W
X23
23
X31 X22 X13

48
X21 X12
X11
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21 + w13x31

Y21 = w21x11 + w22x12 + w23x13
6 Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23 + w23x33
W11 W12 W13
W21 W22 W
X33
23
X32 X23
49
X31 X22 X13
X21 X12
X11
𝑤𝑤𝑤
𝑤𝑤𝑤
𝑥𝑥𝑥
Y11 = w11x11 + w12x21 + w13x31

Y21 = w21x11 + w22x12 + w23x13
6 cycles Y12 = w11x12 + w12x22 + w13x32
Y22 = w21x12 + w22x22 + w23x32
3x3x2=18 MAC Y13 = w11x13 + w12x23 + w13x33
Y23 = w21x13 + w22x23 + w23x33
W11 W12 W13
W21 W22 W23

Matrix Multiplier Unit
• 65,536 8-bit integer multipliers

• @700MHz that is 45.9 TMACS
maximum throughput
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Stanford EE292A 51
Bringing Memory and Processing Closer
Mesh
processor TPU GPU CPU
p w p w p w p w p w
p w p w p w
P
p w p w 40G p w p w p w 40G 40G
Weights Weights Weights
RAM p w p w p w RAM RAM
p w p w
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m
m m m m m m m m m m
m
p w p w p w
P
p w p w m m m m m m m m m m
m m m m m m m m m m p w p w p w
Data Data
Data Data Data
Data Data Data
streaming RAM
Data
52
Layer-pipelined Layer-sequential – batched
Cerebras
Wafer Scale Engine
AI Supercomputer for Layer-
Pipelined computing
• Startup based in Sunnyvale

• ~250 people
• Unicorn with >$800M funding
• Focus: Supercomputer for ML training
• Single 21.5cm by 21.5cm chip in 7nm
• 2.4 Trillion transistors
• 850,000 ‘AI processor cores’
• In a 800 by 1060 array
• Total on-chip memory:
~40Gb fast SRAM
• ~100 PetaByte/second fabric bandwidth
53
Cerebras Wafer Scale Engine
54
CS-1 Hardware: the box
55
Wavelets Flowing on the Network-on-Chip
Transmits one 32-bit
wavelet per clock cycle
in each direction
Wavelet Buffers
PE + forwarding table PE
per buffer
PE PE PE
A Dense Layer Kernel
in Hardware
w w w w w w w w w w
d Xp X
p Xp Xp Xp Xp Xp Xp Xp Xp
• The ML layer implementation on CS-1 is + + + + + + + + + +
massive parallel multiply-accumulate: d X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
• The data operand is efficiently + + + + + + + + + +
w w w w w w w w w w
streamed in at a high rate d X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
Streaming
• The weight operand is stationary in w w w w w w w w w w
data
d X X X X X X X X X X
40Gb high speed memory + + + + + + + + + +
w w w w w w w w w w
• Result (sum-of-products) is streamed out to d X X X X X X X X X X
the next layer + + + + + + + + + +
d
w w w w w w w w w w
X X X X X X X X X X
+ + + + + + + + + +
850,000 (1020x830) processors: w w w w w w w w w w
d X X X X X X X X X X
+ + + + + + + + + +
in[0] F
= = = = = = = = = =
in[1] F 48K SRAM out[0]
in[2] F
FIFO queues
FIFO queues
out[1]
in[3]
F
PE out[2] Result = [D] x [W]
in[4] F out[3]
F out[4]
in[5] Processor
in[6] F out[5]
in[7] F
Layer-pipelined execution
Continuously Streaming input

• Map each element in the
TensorFlow Network to a
Kernel in the Library
• Place, Size and Shape the
kernels such that as many as
possible Pes are used.
– The slowest kernel determines
the throughput!
• Rotate and flip such that wire
length and twisting is
minimized.
– Wire length does not matter!
• Connect the kernels
cat dog • Output: interconnect bit
Continuously Streaming output stream and Machine language
program for each PE
58
Xilinx’ ACAP: “Adaptive Compute Accelleration Platform”
• Two Dual-core CPUs

• Mainly for control
• FPGA fabric (~80%):
• 900K LUT = ~2M Gates
• 1900 DSP cells each:
• FP32 Multiply-Accumulate
• 2x INT18 MAC
• 3x INT8 MAC
• 27Mb distributed RAM
• Systolic Array of VLIW
processors (~20%)
• 4x40 array
59
5-channel LTE20 Wireless (Xilinx.com white paper)
Xilinx: Systolic Array of WLIW RISC processors
INT8: does 128 parallel MAC

FP32: does 8 parallel MAC
60
Summary
• Computation for machine learning

• Dense (Fully Connected) and convolution layers
• Speeding up Computation: Floating point formats
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU

References
• H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to
VLSI Systems; Addison-Wesley, 1979
1. Schwarz, E.M.; Schmookler, M.; Son Dao Trong (July 2005). ”Hardware Implementations of Denormalized Numbers". IEEE
Transactions on Computers. 54 (7): 825–836.
http://www.acsel-lab.com/arithmetic/arith16/papers/ARITH16_Schwarz.pdf
2. IEEE754 float converter: https://www.h-schmidt.net/FloatConverter/IEEE754.html
3. https://www.tensorflow.org
4. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
5. ACAP architecture: https://www.xilinx.com/products/silicon-devices/acap/versal.html
6. http://www.cerebras.net

EE292A Lecture 2.ML - Hardware

Uploaded by

Copyright:

Available Formats

EE292A Lecture 2.ML - Hardware

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EE292A Lecture 2.ML - Hardware

Uploaded by

Copyright:

Available Formats

EE292A Lecture 2

Machine Learning Network

Copyright ©2024 by Raúl Camposano, Antun Domic and Patrick Groeneveld

Chip tapeout, sent to IC fab

Datapath Opto High Level Synthesis

Timing Opto Rough sizing, High-Fanout buffering

Physical Opto 1 Gate size, buffering, Global routing, incremental placement

Circuit Detailed placement All cells in legal locations

Physical Opto 2 Gate size, buffering, incr. global route,

Detailed Routing Generate wire pattern

Physical Opto 3 Gate sizing, buffering, fill insertion

Chip tapeout, sent to IC fab GDS2

Stanford EE292A Lecture 2 4

A13: iPhone 11 (2019) A15: iPhone 13 (2021) NPU

Convolutional filter Rectified Linear Unit Max-Pool Fully Connected

April 2, 2024 Stanford EE292A Lecture 1 8

April 2, 2024 Stanford EE292A Lecture 1 9

April 2, 2024 Stanford EE292A Lecture 2 10

Less than 100 milliseconds

Image Processing: Perception: Localization: Planning and

Noise Reduction Feature Detection Cars, Traffic signs, Actions:

Hardware: Deep Learning TPU, GPU or CPU? CPU

April 2, 2024 Stanford EE292A Lecture 1 12

• 12X CPU (ARM A72)

Tensor Processing Unit

Read: • Runs instructions, generated by a custom

"#$ W(1,5) W(2,5) W(3,5) W(4,5) W(5,5)

3 for (color = red; color <= blue; color++) {

Sign Exponent Mantissa

C/C++ Total Exponent Mantissa Bias 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0

𝑋 = (−1)"∗ 1.375 ∗ 2"%$"# = −5.5 19

Sign Exponent Mantissa

x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 exponent all zeroes and mantissa all zeroes

0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 +Infinity sign 0, exponent all 1 and mantissa all zeroes

Multiplier complexity is quadratic with number of bits

FP8 Minifloat E5M2 8 Bit e5 m2

FP4 Microfloat E3M0 4 Bit e3 m0

Multiplier hardware complexity is quadratic with number of mantissa bits

-Inf +Inf +Inf

+Inf -Inf -Inf

Multiplication result: Z = A * B Addition result: Z = A + B

• Encoding: exponent = 0, and implicit 0 MSB

𝑋 = 1 ∗ 1.0 ∗ 2"$"# = 2$"- = 0.000061035

S 0 0 0 0 0 M M M M M M M M M M Smallest FP16 denormal number:

Stanford EE272A 𝑋 = 1 ∗ "⁄"!1- ∗ 2"$"# = 0.0000000596

Application-Specific TPU/NPU: GPU: Software on 100s of CPU: Software on

• Intel: add custom Multiply-Accumulate instructions to CPU

TPU/NPU: GPU CPU

Tesla FSD chip A14: iPhone SoC

Stanford EE292A Lecture 2 33

• Intel Xeon Phi, 72 cores, 4 threads, 215W

Note: TOPS= Tera Operations Per Seconds

What is the problem with CPUs??

GPU Die: 208B Transistor on two dies on interposer. 4 Nanometers TSMC,

A “Tensor core” does 4X4

• 8 Graphic Processing Clusters

4:2 sparse matrix multiplication

• Nvidia cuDNN: ML toolkit

• 28nm process, 700MHz, 12.5GB/s