06 Pipeline PDF
06 Pipeline PDF
06 Pipeline PDF
Application
Basic Pipelining
OS
CIS 501
Introduction to Computer Architecture
Compiler
CPU
Firmware
I/O
Memory
Digital Circuits
Unit 6: Pipelining
Data Hazards
Hardware: stalling and bypassing
Software: pipeline scheduling
Control Hazards
Branch prediction
Precise state
Readings
Quick Review
H+P
Single-cycle
Multi-cycle
Background slides
http:///~amir/cse371/lecture_notes/pipeline.pdf
CIS 501 (Martin/Roth): Pipelining
Pipelining
Multi-cycle
PC
PC
+
4
Pipelined
PC
I$
Register
File
s1 s2 d
IR
IR
D$
B
IR
IR
Pipeline Terminology
PC
Pipeline Control
PC
PC
+
4
PC
+
4
O
PC
PC
I$
Register
File
s1 s2 d
IR
F/D
IR
D/X
O
PC
I$
D$
B
IR
X/M
IR
Register
File
s1 s2 d
IR
M/W
CTRL
D$
IR
B
IR
IR
xC
mC
wC
mC
wC
wC
Abstract Pipeline
I-regfile
I$
I$
D$
+
4
PC
D$
+
4
F/D
D/X
X/M
M/W
E
+
10
3
X
D
F
4
M
X
D
5
W
M
X
6
W
M
Single-cycle
Pipeline diagram
Cycles across, insns down
Convention: X means ld r4,0(r5) finishes execute stage and
writes into X/M latch at end of cycle 4
E
+
E*
F-regfile
9
1
F
E*
E/
Pipeline Diagram
add r3,r2,r1
ld r4,0(r5)
st r6,4(r7)
E*
11
Pipelined
Clock period = 12ns
CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle)
Performance = 12ns/insn
12
Principles of Pipelining
L1+P = L1
T+P = 1/max(tn) " N/L1
If tn are equal (i.e., max(tn) = L1/N), throughput = N/L1
LM+P = M*max(tn) # M*L1/N
S+P (speedup) = [M*L1 / (# M*L1/N)] = ! N
13
Structural: two insns want to use same structure, one must wait
Data: two insns use same storage location
Control: one instruction affects whether another executes at all
+ frequency"
IPC#
Ultimate metric is IPC * frequency
But people buy frequency, not IPC * frequency
14
15
16
Managing a Pipeline
freq=1/(80/N+1)
0.059
0.110
IPC*freq
0.033
0.042
20
0.33
0.166
0.040
Structural Hazards
2
D
F
18
4
M
X
D
F
5
W
M
X
D
6
W
M
X
7
W
M
9
ld r2,0(r1)
add r1,r3,r4
sub r1,r3,r5
st r6,0(r1)
17
1
F
Operation I: stall
N
5
10
ld r2,0(r1)
add r1,r3,r4
sub r1,r3,r5
st r6,0(r1)
19
1
F
2
D
F
3
X
D
F
4
M
X
D
s*
5
W
M
X
F
W
M
D
W
X
20
Data Hazards
+ No IPC degradation
Increased area, power, latency (interconnect delay?)
For cheap, divisible, or highly contended resources (e.g, I$/D$)
21
add r2,r3!r1
sub r5,r4!r2
or r6,r3!r1
add r2,r3!r1
sub r1,r4!r2
or r6,r3!r1
Read-after-write (RAW)
True-dependence
Write-after-read (WAR)
Anti-dependence
Write-after-write (WAW)
Output-dependence
Only one dependence between any two insns (RAW has priority)
RAW
Read-after-write (RAW)
ldf f2,X(r1)
mulf f2,f0,f4
ldf f6,Y(r1)
addf f4,f6,f8
stf f8,Z(r1)
addi r1,8,r1
cmplti r1,800,r2
beq r2,Loop
22
add r2,r3!r1
sub r1,r4!r2
or r6,r3!r1
Problem: swap would mean sub uses wrong value for r1
True: value flows through this dependence
Using different output register for add doesnt help
24
regfile
I$
D$
+
4
PC
F/D
D/X
X/M
add r2,r3!r1
sub r1,r4!r2
add r5,r6!r7
M/W
25
1
F
2 3 4
5
D X M W
F d* d* d*
p* p* p*
add r2,r3!r1
sub r1,r4!r2
add r5,r6!r7
1
F
2 3 4
5
D X M W
F d* d* D
p* p* F
10
D
F
X
D
M
X
W
M
10
X
D
M
X
W
M
26
Bypass Logic
regfile
D$
X/M
D$
M/W
D/X
regfile
D/X
27
X/M
M/W
28
Load-Use Stalls
1
F
2
D
F
3
X
D
4
M
X
5
W
M
Load-use stall
Load value not ready at beginning of M ! cant use MX bypass
10
Use WX bypass
1
F
2
D
F
3
X
D
F
4
M
X
D
5
W
M
X
2
D
F
3
X
D
4
M
X
5
W
M
W
M
ld [r3+4]!r1
sub r1,r4!r2
10
add r2,r3!r1
?
2
D
F
3
X
D
4
5
M W
d* X
10
ld [r3+4]!r1
sub r1,r4!r2
29
1
F
2 3 4
D X M
F d* D
5
W
X
10
Before
After
Before
After
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r5,16(sp)
ld r6,20(sp)
sub r5,r6,r4
st r4,12(sp)
ld r2,4(sp)
ld r3,8(sp)
ld r5,16(sp)
add r3,r2,r1
ld r6,20(sp)
st r1,0(sp)
sub r5,r6,r4
st r4,12(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
//no stall
30
10
//stall
Compiler Scheduling
//stall
Example: WM bypass
1
F
1
F
//stall
//stall
//no stall
31
32
Enough registers
Alias analysis
Original
Wrong!
Before
Wrong(?)
ld r2,4(sp)
ld r1,8(sp)
add r1,r2,r1
st r1,0(sp)
ld r2,16(sp)
ld r1,20(sp)
sub r2,r1,r1
st r1,12(sp)
ld r2,4(sp)
ld r1,8(sp)
ld r2,16(sp)
add r1,r2,r1
ld r1,20(sp)
st r1,0(sp)
sub r2,r1,r1
st r1,12(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r5,0(r8)
ld r6,4(r8)
sub r5,r6,r4
st r4,8(r8)
ld r2,4(sp)
ld r3,8(sp)
ld r5,0(r8)
add r3,r2,r1
ld r6,4(r8)
st r1,0(sp)
sub r5,r6,r4
st r4,8(r8)
//stall
//stall
//WAR
//WAR
33
WAW Hazards
//stall
//stall
Write-after-write (WAW)
div f0,f1!f2
stf f2![r1]
addf f0,f1!f2
add r2,r3,r1
sub r1,r4,r2
or r6,r3,r1
1
F
2
D
F
3 4
5 6 7 8
E/ E/ E/ E/ E/ W
D d* d* d* X M
F D E+ E+ W
10
What to do?
Compiler effects
Scheduling problem: reordering would leave wrong value in r1
Later instruction reading r1 would get wrong value
Artificial: no value flows through dependence
Eliminate using different output register name for or
Pipeline effects
Doesnt affect in-order pipeline with single-cycle operations
One reason for making ALU operations go through M stage
Can happen with multi-cycle operations (e.g., FP or cache misses)
CIS 501 (Martin/Roth): Pipelining
34
35
36
Handling Interrupts/Exceptions
Handling Interrupts
divf f0,f1!f2
stf f2![r1]
addf f0,f1!f2
1
F
2
D
F
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E+ E+ W
8
W
M
9
W
VonNeumann says
Insn execution should appear sequential and atomic
Insn X should complete before instruction X+1 should begin
+ Doesnt physically have to be this way (e.g., pipeline)
But be ready to restore to this state at a moments notice
Called precise state or precise interrupts
37
1
F
2
D
F
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E/ E/ E/
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E+ E+ W
8
W
M
10
38
10
W
E/
S-regfile
regfile
I$
2
D
F
In this situation
10
1
F
39
D$
+
4
+
+
40
WAR Hazards
Write-after-read (WAR)
add r2,r3,r1
sub r5,r4,r2
or r6,r3,r1
Compiler effects
Scheduling problem: reordering would mean add uses wrong value
for r2
Artificial: solve using different output register name for sub
Pipeline effects
Cant happen in simple in-order pipeline
Can happen with out-of-order execution (after mid-term)
st r2![r1]
ld [r1]!r4
st r5![r1]
Read-after-write (RAW) Write-after-read (WAR)
41
1
F
2
D
F
3
X
D
4
M
X
Control Hazards
Control hazards
1
F
2
D
F
3 4 5
X M W
D X M
c* c* F
W
D
10
W
42
5
W
M
st r2![r1]
ld [r1]!r4
st r5![r1]
Write-after-write (WAW)
43
44
Speculation
Mechanics
Guess branch target, start fetching at guessed position
Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Dont write registers or memory until prediction verified
Speculative execution
Execute before all parameters known with certainty
Correct speculation
+ Avoid stall, improve performance
45
1
F
2
D
F
3
X
D
F
4
M
X
D
F
speculative
5
W
M
X
D
W
M
X
W
M
addi r1,1!r3
bnez r3,targ
st r6![r7+4]
targ:add r4,r5!r4
targ:add r4,r5!r4
1
F
2
D
F
3
X
D
F
4
M
X
D
F
5
W
M
--F
W
--D
--X
46
regfile
I$
W
Recovery:
Gain = 2 cycles
Penalty = 0 cycles
No penalty ! mis-speculation no worse than stalling
%correct = branch prediction
Static (compiler) OK, dynamic (hardware) much better
-M
D$
B
P
W
47
48
[9:2]
PC
1:0
[19:10]
[13:2]
[19:10]
[13:2]
[31:13]
[13:2]
[9:2]
branch?
1:0
target-PC
49
instruction
DIRP
+4
50
PC
I$
+
+
+
BTB
RAS
next-PC
51
52
State/prediction
for (i=0;i<100;i++)
for (j=0;j<3;j++)
// whatever
State/prediction
Outcome
N* T
T
N* n*
T
T*
T*
T*
T* N* T
T* N* T
T*
53
54
Correlated Predictor
Correlated Predictor
What happened?
BHR=NN N*
active pattern
BHR=NT
State/prediction BHR=NNN N*
N N*
BHR=TN
N N*
BHR=TT
N N* T*
N N* T*
N N* T*
Outcome
BHR=NNT
N N*
BHR=NTN
N N*
BHR=TNN
BHR=TNT
N N*
BHR=TTN
N N*
BHR=TTT
Outcome
56
Correlated Predictor
Hybrid Predictor
57
Same parameters
58
F
! Fi*BHRi
chooser
BHT
BHR
BHT
PC
BHR
> thresh
59
60
61
Research: Razor
regfile
I$
==
D$
B
P
Confidence estimation
62
64
Summary
Principles of pipelining
Effects of overhead and hazards
Pipeline diagrams
Data hazards
Stalling and bypassing
Control hazards
Branch prediction
Power techniques
Dynamic power: speculation gating
Static and dynamic power: razor latches
65