06 Pipeline PDF

This Unit: Pipelining
Application
Basic Pipelining
OS
CIS 501
Introduction to Computer Architecture
Compiler
CPU
Firmware
I/O
Memory
Digital Circuits
Unit 6: Pipelining
Gates & Transistors
Single, in-order issue

Clock rate vs. IPC
Data Hazards
Hardware: stalling and bypassing
Software: pipeline scheduling
Control Hazards
Branch prediction
Precise state
CIS 501 (Martin/Roth): Pipelining
Readings
Quick Review
H+P
insn0.fetch, dec, exec
Single-cycle
Appendix A.1 - A.6
insn1.fetch, dec, exec
insn0.fetch insn0.dec insn0.exec
Multi-cycle
Basic datapath: fetch, decode, execute

Single-cycle control: hardwired
+ Low CPI (1)
Long clock period (to accommodate slowest instruction)
Multi-cycle control: micro-programmed

+ Short clock period
High CPI
Background slides
Can we have both low CPI and short clock period?
http:///~amir/cse371/lecture_notes/pipeline.pdf
Not if datapath executes only one instruction at a time

No good way to make a single instruction go faster
3
Pipelining
5 Stage Pipelined Datapath

Multi-cycle
PC

PC
+
4
Pipelined
Important performance technique
PC
I$
Improves instruction throughput rather instruction latency
Register
File
s1 s2 d
Begin with multi-cycle design
IR
When instruction advances from stage 1 to 2

Allow next instruction to enter stage 1
Form of parallelism: insn-stage parallelism
Individual instruction takes the same number of stages
But instructions enter and leave at a much faster rate
IR
D$
B
IR
IR
Temporary values (PC,IR,A,B,O,D) re-latched every stage

Why? 5 insns may be in pipeline at once, they share a single PC?
Notice, PC not latched after ALU stage (why not?)
Automotive assembly line analogy

Pipeline Terminology
PC
Pipeline Control
PC
PC
+
4
PC
+
4
O
PC
PC
I$
Register
File
s1 s2 d
IR
F/D
IR
D/X
O
PC
I$
D$
B
IR
X/M
IR
Register
File
s1 s2 d
IR
M/W
CTRL
Five stage: Fetch, Decode, eXecute, Memory, Writeback

Nothing magical about the number 5 (Pentium 4 has 22 stages)
Latches (pipeline registers) named by stages they separate

PC, F/D, D/X, X/M, M/W
D$
IR
B
IR
IR
xC
mC
wC
mC
wC
wC
One single-cycle controller, but pipeline the control signals

Abstract Pipeline
Floating Point Pipelines

regfile
I-regfile
I$
I$
D$
+
4
PC
D$
+
4
F/D
D/X
X/M
M/W
This is an integer pipeline

Execution stages are X,M,W
Usually also one or more floating-point (FP) pipelines

Separate FP register file
One pipeline per functional unit: E+, E*, E/
Pipeline: functional unit need not be pipelined (e.g, E/)
Execution stages are E+,E+,W (no M)
E
+
10
Pipeline Performance Calculation

2
D
F
3
X
D
F
4
M
X
D
5
W
M
X
6
W
M
Back of the envelope calculation
Branch: 20%, load: 20%, store: 10%, other: 50%

W
Single-cycle
Pipeline diagram
Cycles across, insns down
Convention: X means ld r4,0(r5) finishes execute stage and
writes into X/M latch at end of cycle 4
Reverse stream analogy

Downstream: earlier stages, younger insns
Upstream: later stages, older insns
Reverse? instruction stream fixed, pipeline flows over it
Architects see instruction stream as fixed by program/compiler
E
+
E*
F-regfile
9
1
F
E*
E/
Pipeline Diagram
add r3,r2,r1
ld r4,0(r5)
st r6,4(r7)
E*
11
Clock period = 50ns, CPI = 1

Performance = 50ns/insn
Pipelined
Clock period = 12ns
CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle)
Performance = 12ns/insn
12
Principles of Pipelining
No, Part I: Pipeline Overhead
Let: insn execution require N stages, each takes tn time
Let: O be extra delay per pipeline stage
L1 (1-insn latency) = !tn

T (throughput) = 1/L1
LM (M-insn latency, where M>>1) = M*L1
Latch overhead: pipeline latches take time

Clock/data skew
Now: N-stage pipeline with overhead
Now: N-stage pipeline
L1+P = L1
T+P = 1/max(tn) " N/L1
If tn are equal (i.e., max(tn) = L1/N), throughput = N/L1
LM+P = M*max(tn) # M*L1/N
S+P (speedup) = [M*L1 / (# M*L1/N)] = ! N
O limits throughput and speedup ! useful N
Q: for arbitrarily high speedup, use arbitrarily high N?

Assume max(tn) = L1/N

L1+P+O = L1 + N*O
T+P+O = 1/(L1/N + O) = 1/(1/T + O) " T, ! 1/O
LM+P+O = M*L1/N + M*O = LM+P + M*O
S+P+O = [M*L1 / (M*L1/N + M*O)] = " N = S+P, ! L1/O
13
No, Part II: Hazards
Clock Rate vs. IPC
Dependence: relationship that serializes two insns
Deeper pipeline (bigger N)
Structural: two insns want to use same structure, one must wait
Data: two insns use same storage location
Control: one instruction affects whether another executes at all
+ frequency"
IPC#
Ultimate metric is IPC * frequency
But people buy frequency, not IPC * frequency
Hazard: dependence and both insns in pipeline together

Possibility for getting order wrong
Often fixed with stalls: insn stays in same stage for multiple cycles
Trend has been for deeper pipelines

Intel example:
486: 5 stages (50+ gate delays / clock)
Pentium: 7 stages
Pentium II/III: 12 stages
Pentium 4: 22 stages (10 gate delays / clock)
800 MHz Pentium III was faster than 1 GHz Pentium4
Next Intel core: fewer pipeline stages than Pentium 4
Let: H be average number of hazard stall cycles per instruction
L1+P+H = L1+P (no hazards for one instruction)

T+P+H = [N/(N+H)]*N/L1 = [N/(N+H)] * T+P
LM+P+H = M* L1/N * [(N+H)/N] = [(N+H)/N] * LM+P
S+P+H = M*L1 / M*L1/N*[(N+H)/N] = [N/(N+H)]*S+P
H also limit throughput, speedup ! useful N
N"! H" (more insns in flight ! dependences become hazards)

Exact H depends on program, requires detailed simulation
14
15
16
Optimizing Pipeline Depth
Managing a Pipeline
Parameterize clock cycle in terms of gate delays
Proper flow requires two pipeline operations
G gate delays to process (fetch, decode, execute) a single insn

O gate delays overhead per stage
X average stall per instruction per stage
Simplistic: real X function much, much more complex
Compute optimal N (pipeline stages) given G,O,X

IPC = 1 / (1 + X * N)
f = 1 / (G / N + O)
Example: G = 80, O = 1, X = 0.16,
IPC = 1/(1+0.16*N)
0.56
0.38
freq=1/(80/N+1)
0.059
0.110
IPC*freq
0.033
0.042
20
0.33
0.166
0.040
Both stall and flush must be propagated to younger insns
Structural Hazards
2
D
F
18
Fixing Structural Hazards

3
X
D
F
4
M
X
D
F
5
W
M
X
D
6
W
M
X
7
W
M
9
ld r2,0(r1)
add r1,r3,r4
sub r1,r3,r5
st r6,0(r1)
Structural hazard: resource needed twice in one cycle

Example: shared I/D$
Effect: stops some insns in their current stages

Use: make younger insns wait for older ones to complete
Implementation: de-assert write-enable
Effect: removes insns from current stages

Use: see later
Implementation: assert clear signals
17
1
F
Operation I: stall
Operation II: flush
N
5
10
ld r2,0(r1)
add r1,r3,r4
sub r1,r3,r5
st r6,0(r1)
Mess with latch write-enable and clear signals to achieve
19
1
F
2
D
F
3
X
D
F
4
M
X
D
s*
5
W
M
X
F
W
M
D
W
X
Can fix structural hazards by stalling

s* = structural stall
Q: which one to stall: ld or st?
Always safe to stall younger instruction (here st)
Fetch stall logic: (D/X.op == ld || D/X.op == st)
But not always the best thing to do performance wise (?)
+ Low cost, simple
Decreases IPC
Upshot: better to avoid by design, then to fix
20
Avoiding Structural Hazards
Data Hazards
Replicate the contended resource
Real insn sequences pass values via registers/memory
+ No IPC degradation
Increased area, power, latency (interconnect delay?)
For cheap, divisible, or highly contended resources (e.g, I$/D$)
Pipeline the contended resource

+ No IPC degradation, low area, power overheads
Sometimes tricky to implement (e.g., for RAMs)
For multi-cycle resources (e.g., multiplier)
Design ISA/pipeline to reduce structural hazards (RISC)

Each insn uses a resource at most once (same insn hazards)
Always in same pipe stage (hazards between two of same insn)
Reason why integer operations forced to go through M stage
And always for one cycle
21
Three kinds of data dependences (wheres the fourth?)

add r2,r3!r1
sub r1,r4!r2
or r6,r3!r1
add r2,r3!r1
sub r5,r4!r2
or r6,r3!r1
add r2,r3!r1
sub r1,r4!r2
or r6,r3!r1
Read-after-write (RAW)
True-dependence
Write-after-read (WAR)
Anti-dependence
Write-after-write (WAW)
Output-dependence
Only one dependence between any two insns (RAW has priority)
Data hazards: function of data dependences and pipeline

Potential for executing dependent insns in wrong order
Require both insns to be in pipeline (in flight) simultaneously
Dependences and Loops
RAW
Data dependences in loops
Read-after-write (RAW)
Intra-loop: within same iteration

Inter-loop: across iterations
Example: DAXPY (Double precision A X Plus Y)
for (i=0;i<100;i++)
Z[i]=A*X[i]+Y[i];
0:
1:
2:
3:
4:
5:
6:
7:
ldf f2,X(r1)
mulf f2,f0,f4
ldf f6,Y(r1)
addf f4,f6,f8
stf f8,Z(r1)
addi r1,8,r1
cmplti r1,800,r2
beq r2,Loop
RAW intra: 0!1(f2), 1!3(f4), 2!

3(f6), 3!4(f8), 5!6(r1), 6!7(r2)
RAW inter: 5!0(r1), 5!2(r1), 5!
4(r1), 5!5(r1)
WAR intra: 0!5(r1), 2!5(r1), 4!5(r1)
WAR inter: 1!0(f2), 3!1(f4), 3!
2(f6), 4!3(f8), 6!5(r1), 7!6(r2)
22
add r2,r3!r1
sub r1,r4!r2
or r6,r3!r1
Problem: swap would mean sub uses wrong value for r1
True: value flows through this dependence
Using different output register for add doesnt help
WAW intra: none

WAW inter: 0!0(f2), 1!1(f4), 2!
2(f6), 3!3(f8), 6!6(r2)
23
24
RAW: Detect and Stall
Two Stall Timings (without bypassing)

Depend on how D and W stages share regfile
regfile
I$
Each gets regfile for half a cycle

1st half D reads, 2nd half W writes 3 cycle stall
d* = data stall, p* = propagated stall
D$
+
4
PC
F/D
D/X
X/M
add r2,r3!r1
sub r1,r4!r2
add r5,r6!r7
M/W
Stall logic: detect and stall reader in D

(F/D.rs1 & (F/D.rs1==D/X.rd | F/D.rs1==X/M.rd | F/D.rs1==M/W.rd)) |
(F/D.rs2 & (F/D.rs2==D/X.rd | F/D.rs2==X/M.rd | F/D.rs2==M/W.rd))
Re-evaluated every cycle until no longer true
+ Low cost, simple
IPC degradation, dependences are the common case
25
Reducing RAW Stalls with Bypassing
1
F
2 3 4
5
D X M W
F d* d* d*
p* p* p*
add r2,r3!r1
sub r1,r4!r2
add r5,r6!r7
1
F
2 3 4
5
D X M W
F d* d* D
p* p* F
10
D
F
X
D
M
X
W
M
10
X
D
M
X
W
M
26
Bypass Logic
regfile
D$
X/M
D$
M/W
D/X
Why wait until W stage? Data available after X or M stage

Bypass (aka forward) data directly to input of X or M
MX: from beginning of M (X output) to input of X
WX: from beginning of W (M output) to input of X
WM: from beginning of W (M output) to data input of M
Two each of MX, WX (figure shows 1) + WM = full bypassing
+ Reduces stalls in a big way
Additional wires and muxes may increase clock cycle
+ 1st half W writes, 2nd half D reads 2 cycle stall

How does the stall logic change here?
regfile
D/X
27
X/M
M/W
Bypass logic: similar to but separate from stall logic

Stall logic controls latches, bypass logic controls mux inputs
Complement one another: cant bypass ! must stall
ALU input mux bypass logic
(D/X.rs2 & X/M.rd==D/X.rs2) ! 2 // check first
(D/X.rs2 & M/W.rd==D/X.rs2) ! 1 // check second
(D/X.rs2) ! 0
// check last
28
Pipeline Diagrams with Bypassing
Load-Use Stalls
If bypass exists, from/to stages execute in same cycle
Even with full bypassing, stall logic is unavoidable
Example: full bypassing, use MX bypass

add r2,r3!r1
sub r1,r4!r2
1
F
2
D
F
3
X
D
4
M
X
5
W
M
Load-use stall
Load value not ready at beginning of M ! cant use MX bypass
10
Use WX bypass
Example: full bypassing, use WX bypass

add r2,r3!r1
ld [r7]!r5
sub r1,r4!r2
1
F
2
D
F
3
X
D
F
4
M
X
D
5
W
M
X
2
D
F
3
X
D
4
M
X
5
W
M
W
M
ld [r3+4]!r1
sub r1,r4!r2
10
add r2,r3!r1
?
2
D
F
3
X
D
4
5
M W
d* X
10
ld [r3+4]!r1
sub r1,r4!r2
Can you think of a code example that uses the WM bypass?

29
1
F
2 3 4
D X M
F d* D
5
W
X
Large scheduling scope
Basic pipeline scheduling: eliminate back-to-back load-use pairs

Example code sequence: a = b + c; d = f e;
MIPS Notation:
ld r2,4(sp) is ld [sp+4]!r2 st r1, 0(sp) is st r1![sp+0]
10
Independent instruction to put between load-use pairs

+ Original example: large scope, two independent computations
This example: small scope, one computation
Before
After
Before
After
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r5,16(sp)
ld r6,20(sp)
sub r5,r6,r4
st r4,12(sp)
ld r2,4(sp)
ld r3,8(sp)
ld r5,16(sp)
add r3,r2,r1
ld r6,20(sp)
st r1,0(sp)
sub r5,r6,r4
st r4,12(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
//no stall
30
Compiler can schedule (move) insns to reduce stalls
10
Compiler Scheduling Requires
//stall
Aside II: how does stall/bypass logic handle cache misses?
Compiler Scheduling
//stall
Aside: with WX bypassing, stall logic can be in D or X
Example: WM bypass
1
F
1
F
//stall
//stall
//no stall
31
32
Enough registers
Alias analysis
To hold additional live values

Example code contains 7 different values (including sp)
Before: max 3 values live at any time ! 3 registers enough
After: max 4 values live ! 3 registers not enough ! WAR violations
Ability to tell whether load/store reference same memory locations

Effectively, whether load/store can be rearranged
Example code: easy, all loads/stores use same base register (sp)
New example: can compiler tell that r8 = sp?
Original
Wrong!
Before
Wrong(?)
ld r2,4(sp)
ld r1,8(sp)
add r1,r2,r1
st r1,0(sp)
ld r2,16(sp)
ld r1,20(sp)
sub r2,r1,r1
st r1,12(sp)
ld r2,4(sp)
ld r1,8(sp)
ld r2,16(sp)
add r1,r2,r1
ld r1,20(sp)
st r1,0(sp)
sub r2,r1,r1
st r1,12(sp)
ld r2,4(sp)
ld r3,8(sp)
add r3,r2,r1
st r1,0(sp)
ld r5,0(r8)
ld r6,4(r8)
sub r5,r6,r4
st r4,8(r8)
ld r2,4(sp)
ld r3,8(sp)
ld r5,0(r8)
add r3,r2,r1
ld r6,4(r8)
st r1,0(sp)
sub r5,r6,r4
st r4,8(r8)
//stall
//stall
//WAR
//WAR
33
WAW Hazards
//stall
//stall
Handling WAW Hazards
div f0,f1!f2
stf f2![r1]
addf f0,f1!f2
add r2,r3,r1
sub r1,r4,r2
or r6,r3,r1
1
F
2
D
F
3 4
5 6 7 8
E/ E/ E/ E/ E/ W
D d* d* d* X M
F D E+ E+ W
10
What to do?
Compiler effects
Scheduling problem: reordering would leave wrong value in r1
Later instruction reading r1 would get wrong value
Artificial: no value flows through dependence
Eliminate using different output register name for or
Pipeline effects
Doesnt affect in-order pipeline with single-cycle operations
One reason for making ALU operations go through M stage
Can happen with multi-cycle operations (e.g., FP or cache misses)
34
35
Option I: stall younger instruction (addf) at writeback

+ Intuitive, simple
Lower performance, cascading W structural hazards
Option II: cancel older instruction (divf) writeback
+ No performance loss
What if divf or stf cause an exception (e.g., /0, page fault)?
36
Handling Interrupts/Exceptions
Handling Interrupts
How are interrupts/exceptions handled in a pipeline?
divf f0,f1!f2
stf f2![r1]
addf f0,f1!f2
Interrupt: external, e.g., timer, I/O device requests

Exception: internal, e.g., /0, page fault, illegal instruction
We care about restartable interrupts (e.g. stf page fault)
divf f0,f1!f2
stf f2![r1]
addf f0,f1!f2
1
F
2
D
F
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E+ E+ W
8
W
M
9
W
VonNeumann says
Insn execution should appear sequential and atomic
Insn X should complete before instruction X+1 should begin
+ Doesnt physically have to be this way (e.g., pipeline)
But be ready to restore to this state at a moments notice
Called precise state or precise interrupts
37
More Interrupt Nastiness

divf f0,f1!f2
stf f2![r1]
divf f0,f4!f2
1
F
2
D
F
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E/ E/ E/
3 4
5 6 7
E/ E/ E/ E/ E/
D d* d* d* X
F D E+ E+ W
8
W
M
10
Make it appear as if divf finished and stf, addf havent started

Allow divf to writeback
Flush stf and addf (so thats what a flush is for)
But addf has already written back
Keep an undo register file? Complicated
Force in-order writebacks? Slow
Invoke exception handler
Restart stf
38
Research: Runahead Execution

8
W
M
E/
10
W
E/
S-regfile
regfile
I$
What about two simultaneous in-flight interrupts

Example: stf page fault, divf /0
Interrupts must be handled in program order (stf first)
Handler for stf must see program as if divf hasnt started
Must defer interrupts until writeback and force in-order writeback
Kind of a bogus example, /0 is non-restartable
In general: interrupts are really nasty

Some processors (Alpha) only implement precise integer interrupts
Easier because fewer WAW scenarios
Most floating-point interrupts are non-restartable anyway
2
D
F
In this situation
10
1
F
39
D$
+
4
In-order writebacks essentially imply stalls on D$ misses

Can save power or use idle time for performance
Runahead execution [Dundas+]
+
+
Shadow regfile kept in sync with main regfile (write to both)

D$ miss: continue executing using shadow regfile (disable stores)
D$ miss returns: flush pipe and restart with stalled PC
Acts like a smart prefetch engine
Performs better as cache tmiss grows (relative to clock period)
40
WAR Hazards
Memory Data Hazards
Write-after-read (WAR)
So far, have seen/dealt with register dependences
add r2,r3,r1
sub r5,r4,r2
or r6,r3,r1
Dependences also exist through memory

st r2![r1]
ld [r1]!r4
st r5![r1]
Compiler effects
Scheduling problem: reordering would mean add uses wrong value
for r2
Artificial: solve using different output register name for sub
Pipeline effects
Cant happen in simple in-order pipeline
Can happen with out-of-order execution (after mid-term)
st r2![r1]
ld [r1]!r4
st r5![r1]
Read-after-write (RAW) Write-after-read (WAR)
But in an in-order pipeline like ours, they do not become hazards

Memory read and write happen at the same stage
Register read happens three stages earlier than register write
In general: memory dependences more difficult than register
st r2![r1]
ld [r1]!r4
41
1
F
2
D
F
3
X
D
4
M
X
Control Hazards
ISA Branch Techniques
Control hazards
Fast branch: resolves at D, not X
Must fetch post branch insns before branch outcome is known

Default: assume not-taken (at fetch, cant tell its a branch)
Control hazards indicated with c* (or not at all)
Taken branch penalty is 2 cycles
addi r1,1!r3
bnez r3,targ
st r6![r7+4]
1
F
2
D
F
3 4 5
X M W
D X M
c* c* F
W
D
10
W
42
Test must be comparison to zero or equality, no time for ALU

+ New taken branch penalty is 1
Additional comparison insns (e.g., cmplt, slt) for complex tests
Must bypass into decode now, too
Insert insns that are independent of branch into branch delay slot
Preferably from before branch (always helps then)
But from after branch OK too
As long as no undoable effects (e.g., a store)
Upshot: short-sighted feature (MIPS regrets it)
Not a big win in todays pipelines
Complicates interrupt handling
Branch: 20%, other: 80%, 75% of branches are taken

CPIBASE = 1
CPIBASE+BRANCH = 1 + 0.20*0.75*2 = 1.3
Branches cause 30% slowdown
5
W
M
Delayed branch: branch that takes effect one insn later
Back of the envelope calculation
st r2![r1]
ld [r1]!r4
st r5![r1]
43
44
Big Idea: Speculation
Control Hazards: Control Speculation
Speculation
Deal with control hazards with control speculation

Unknown parameter: are these the correct insns to execute next?
Engagement in risky transactions on the chance of profit
Mechanics
Guess branch target, start fetching at guessed position
Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Dont write registers or memory until prediction verified
Speculative execution
Execute before all parameters known with certainty
Correct speculation
+ Avoid stall, improve performance
Incorrect speculation (mis-speculation)
Speculation game for in-order 5 stage pipeline
Must abort/flush/squash incorrect instructions

Must undo incorrect changes (recover pre-speculation state)
The game: [%correct * gain] [(1%correct) * penalty]

45
Control Speculation and Recovery

addi r1,1!r3
Correct:
bnez r3,targ
st r6![r7+4]
targ:add r4,r5!r4
1
F
2
D
F
3
X
D
F
4
M
X
D
F
speculative
5
W
M
X
D
W
M
X
W
M
addi r1,1!r3
bnez r3,targ
st r6![r7+4]
targ:add r4,r5!r4
targ:add r4,r5!r4
1
F
2
D
F
3
X
D
F
4
M
X
D
F
5
W
M
--F
W
--D
--X
46
regfile
I$
W
Not too painful in an in-order pipeline

Branch resolves in X
Younger insns (in F, D) havent changed permanent state
Flush insns currently in F/D and D/X (i.e., replace with nops)
Recovery:
Dynamic Branch Prediction
Mis-speculation recovery: what to do on wrong guess
Gain = 2 cycles
Penalty = 0 cycles
No penalty ! mis-speculation no worse than stalling
%correct = branch prediction
Static (compiler) OK, dynamic (hardware) much better
-M
D$
B
P
BP part I: target predictor

Applies to all control transfers
Supplies target PC, tells if insn is a branch prior to decode
+ Easy
BP part II: direction predictor

Applies to conditional branches only
Predicts taken/not-taken
Harder
W
47
48
Branch Target Buffer

[31:10]
[9:2]
PC
Why Does a BTB Work?

Because control insn targets are stable
1:0
[19:10]
[13:2]
[19:10]
[13:2]
[31:13]
[13:2]
[9:2]
branch?
1:0
target-PC
Branch target buffer (BTB)

A small cache: address = PC, data = target-PC
Hit? This is a control insn and its going to target-PC (if taken)
Miss? Not a control insn, or one I have never seen before
Partial data/tags: full tag not necessary, target-PC is just a guess
Aliasing: tag match, but not actual match (OK for BTB)
Pentium4 BTB: 2K entries, 4-way set-associative
49
Return Address Stack (RAS)
instruction
DIRP
+4
Direct means constant target, indirect means register target

Direct conditional branches? Check
Direct calls? Check
Direct unconditional jumps? Check
+ Indirect conditional branches? Not that useful!not widely supported

Indirect calls? Two idioms
+ Dynamically linked functions (DLLs)? Check
+ Dynamically dispatched (virtual) functions? Pretty much check
Indirect unconditional jumps? Two idioms
Switches? Not really, but these are rare
Returns? Nope, but
50
Branch Direction Prediction

Direction predictor (DIRP)
PC
I$
+
+
+
BTB
RAS
Map conditional-branch PC to taken/not-taken (T/N) decision

Seemingly innocuous, but quite difficult
Individual conditional branches often unbiased or weakly biased
90%+ one way or the other considered biased
next-PC
Return addresses are easy to predict without a BTB
Hardware return address stack (RAS) tracks call sequence

Calls push PC+4 onto RAS
Prediction for returns is RAS[TOS]
Q: how can you tell if an insn is a return before decoding it?
RAS is not a cache
A: attach pre-decode bits to I$
Written after first time insn executes
Two useful bits: return?, conditional-branch?
51
52
Branch History Table (BHT)
Two-Bit Saturating Counters (2bc)
Branch history table (BHT): simplest direction predictor
Two-bit saturating counters (2bc) [Smith]
PC indexes table of bits (0 = N, 1 = T), no tags

Essentially: branch will go same way it went last time
Problem: consider inner loop branch below (* = mis-prediction)
State/prediction
for (i=0;i<100;i++)
for (j=0;j<3;j++)
// whatever
State/prediction
Outcome
N* T
T
Replace each single-bit prediction

(0,1,2,3) = (N,n,t,T)
Force DIRP to mis-predict twice before changing its mind
Outcome
N* n*
T
T*
T*
T*
+ Fixes this pathology (which is not contrived, by the way)
T* N* T
T* N* T
T*
Two built-in mis-predictions per inner loop iteration

Branch predictor changes its mind too quickly
53
54
Correlated Predictor
Correlated (two-level) predictor [Patt]
What happened?
Exploits observation that branch outcomes are correlated

Maintains separate prediction per (PC, BHR)
Branch history register (BHR): recent branch outcomes
Simple working example: assume program has one branch
BHT: one 1-bit DIRP entry
BHT+2BHR: 4 1-bit DIRP entries
State/prediction
BHR=NN N*
active pattern
BHR=NT
State/prediction BHR=NNN N*
N N*
BHR=TN
N N*
BHR=TT
N N* T*
N N* T*
N N* T*
Outcome
BHR wasnt long enough to capture the pattern

Try again: BHT+3BHR: 8 1-bit DIRP entries
We didnt make anything better, whats the problem?

BHR=NNT
N N*
BHR=NTN
active pattern BHR=NTT
N N*
BHR=TNN
BHR=TNT
N N*
BHR=TTN
N N*
BHR=TTT
Outcome
+ No mis-predictions after predictor learns all the relevant patterns

55
56
Hybrid Predictor
Design choice I: one global BHR or one per PC (local)?
Hybrid (tournament) predictor [McFarling]
Each one captures different kinds of patterns

Global is better, captures local patterns for tight loop branches
Attacks correlated predictor BHT utilization problem

Idea: combine two predictors
Simple BHT predicts history independent branches
Correlated predictor predicts only branches that need history
Chooser assigns branches to one predictor or the other
Branches start in simple BHT, move mis-prediction threshold
+ Correlated predictor can be made smaller, handles fewer branches
+ 9095% accuracy
Design choice II: how many history bits (BHR size)?

Tricky one
+ Given unlimited resources, longer BHRs are better, but
BHT utilization decreases
Many history patterns are never seen
Many branches are history independent (dont care)
PC ^ BHR allows multiple PCs to dynamically share BHT
BHR length < log2(BHT size)
Predictor takes longer to train
Typical length: 812
57
Research: Perceptron Predictor
Branch Prediction Performance
Perceptron predictor [Jimenez]
Same parameters
Attacks BHR size problem using machine learning approach

BHT replaced by table of function coefficients Fi (signed)
Predict taken if !(BHRi*Fi)> threshold
Table size #PC*|BHR|*|F| (can use long BHR: ~60 bits)
Equivalent correlated predictor would be #PC*2|BHR|
How does it learn? Update Fi when branch is taken
BHRi == 1 ? Fi++ : Fi ;
dont care Fi bits stay near 0, important Fi bits saturate
+ Hybrid BHT/perceptron accuracy: 9598%
PC
58
Branch: 20%, load: 20%, store: 10%, other: 50%

75% of branches are taken
Dynamic branch prediction

Branches predicted with 95% accuracy
CPI = 1 + 0.20*0.05*2 = 1.02
F
! Fi*BHRi
chooser
BHT
BHR
BHT
PC
BHR
> thresh
59
60
Pipeline Performance Summary
Dynamic Pipeline Power
Base CPI is 1, but hazards increase it
Remember control-speculation game

[2 cycles * %correct] [0 cycles * (1%correct)]
No penalty ! mis-speculation no worse than stalling
Nothing magical about a 5 stage pipeline
This is a performance-only view

From a power standpoint, mis-speculation is worse than stalling
Pentium4 has 22 stage pipeline
Increasing pipeline depth

+ Increases clock frequency (thats why companies do it)
But decreases IPC
Branch mis-prediction penalty becomes longer
More stages between fetch and whenever branch computes
Non-bypassed data hazard stalls become longer
More stages between register read and write
At some point, CPI losses offset clock gains, question is when?
61
Research: Speculation Gating
Power control-speculation game

[0 nJ * %correct] [X nJ * (1%correct)]
No benefit ! correct speculation no better than stalling
Not exactly, increased execution time increases static power
How to balance the two?
Research: Razor
regfile
Speculation gating [Manne+]

Extend branch predictor to give prediction + confidence
Speculate on high-confidence (mis-prediction unlikely) branches
Stall (save energy) on low-confidence branches
I$
==
What kind of hardware circuit estimates confidence?

Hard in absolute sense, but easy relative to given threshold
Counter-scheme similar to %miss threshold for cache resizing
Example: assume 90% accuracy is high confidence
PC-indexed table of confidence-estimation counters
Correct prediction? table[PC]+=1 : table[PC]=9;
Prediction for PC is confident if table[PC] > 0;
D$
B
P
Confidence estimation
62
Razor [Uht, Ernst+]

Identify pipeline stages with narrow signal margins (e.g., X)
Add Razor X/M latch: relatches X/M input signals after safe delay
Compare X/M latch with safe razor X/M latch, different?
Flush F,D,X & M
Restart M using X/M razor latch, restart F using D/X latch
+ Pipeline will not break ! reduce VDD until flush rate too high
+ Alternatively: over-clock until flush rate too high
63
64
Summary
Principles of pipelining
Effects of overhead and hazards
Pipeline diagrams
Data hazards
Stalling and bypassing
Control hazards
Branch prediction
Power techniques
Dynamic power: speculation gating
Static and dynamic power: razor latches
65

06 Pipeline PDF

Uploaded by

Copyright:

Available Formats

06 Pipeline PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Pipeline PDF

Uploaded by

Copyright:

Available Formats

This Unit: Pipelining

Gates & Transistors

Single, in-order issue

CIS 501 (Martin/Roth): Pipelining

CIS 501 (Martin/Roth): Pipelining

insn0.fetch, dec, exec

Appendix A.1 - A.6

insn1.fetch, dec, exec

insn0.fetch insn0.dec insn0.exec

insn1.fetch insn1.dec insn1.exec

Basic datapath: fetch, decode, execute

Multi-cycle control: micro-programmed

Can we have both low CPI and short clock period?

Not if datapath executes only one instruction at a time

CIS 501 (Martin/Roth): Pipelining

5 Stage Pipelined Datapath

insn0.fetch insn0.dec insn0.exec

insn0.fetch insn0.dec insn0.exec

Important performance technique

Improves instruction throughput rather instruction latency

Begin with multi-cycle design

When instruction advances from stage 1 to 2

Temporary values (PC,IR,A,B,O,D) re-latched every stage

Automotive assembly line analogy

CIS 501 (Martin/Roth): Pipelining

Five stage: Fetch, Decode, eXecute, Memory, Writeback

Latches (pipeline registers) named by stages they separate

One single-cycle controller, but pipeline the control signals

Floating Point Pipelines

This is an integer pipeline

Usually also one or more floating-point (FP) pipelines

CIS 501 (Martin/Roth): Pipelining

Pipeline Performance Calculation

Back of the envelope calculation

Branch: 20%, load: 20%, store: 10%, other: 50%

Reverse stream analogy

Clock period = 50ns, CPI = 1

CIS 501 (Martin/Roth): Pipelining

No, Part I: Pipeline Overhead

Let: insn execution require N stages, each takes tn time

Let: O be extra delay per pipeline stage

L1 (1-insn latency) = !tn

Latch overhead: pipeline latches take time

Now: N-stage pipeline with overhead

Now: N-stage pipeline

O limits throughput and speedup ! useful N

Q: for arbitrarily high speedup, use arbitrarily high N?

Assume max(tn) = L1/N

CIS 501 (Martin/Roth): Pipelining

No, Part II: Hazards

Clock Rate vs. IPC

Dependence: relationship that serializes two insns

Deeper pipeline (bigger N)

Hazard: dependence and both insns in pipeline together

Trend has been for deeper pipelines

Let: H be average number of hazard stall cycles per instruction

L1+P+H = L1+P (no hazards for one instruction)

H also limit throughput, speedup ! useful N