Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept

INSTRUCTION LEVEL PARALLELISM
AND ITS EXPLOITATION

UNIT II
By
RAJU K, CSE Dept.
Outline
1. Instruction-Level Parallelism: Concepts and Challenges
2. Basic Compiler Techniques for Exposing ILP
3. Reducing Branch Costs with Prediction
4. Overcoming Data Hazards with Dynamic Scheduling
5. Dynamic Scheduling: Examples and the Algorithm
6. Hardware-Based Speculation
7. Exploiting ILP Using Multiple Issue and Static Scheduling
8. Exploiting ILP Using Dynamic Scheduling, Multiple Issue,
and Speculation
9. Advanced Techniques for Instruction Delivery and
Speculation
1. ILP: Concepts and Challenges
Outline :
 Introduction to ILP and Loop level parallelism
 Dependences
• Name dependences
• Data dependences
• Control dependences
 Data hazards
What is ILP?
• ILP (Instruction Level Parallelism) –
overlap execution of unrelated instructions
• Techniques that increase amount of parallelism

exploited among instructions
– reduce impact of data and control hazards
– increase processor ability to exploit parallelism
– Pipeline CPI = Ideal pipeline CPI +

Structural stalls + RAW stalls +
WAR stalls + WAW stalls + Control stalls
– Reducing each of the terms of the right-hand side

minimize CPI and thus increase instruction throughput
ILP Within and Across a Basic Block
• Basic block : —a straight-line code sequence with no branches
in except to the entry and no branches out except at the exit
• For typical MIPS programs, branch frequency is often

between 15% and 25%,
• i.e, 3 to 6 instructions execute between a pair of branches.
• And these instructions are likely to depend upon one another.

So, the amount of parallelism available within a basic block is
quite small
• This implies that we must exploit ILP across a basic block
Loop-level parallelism
for (i=1; i<=1000; i++)
x[i]= x[i] + y[i];
• A loop is a basic block
• Loop-Level Parallelism- ILP across a basic block
• The simple way to increase ILP is to exploit

parallelism among many iterations of a loop-
called as loop level parallelism
• In this example, every iteration of the loop can

overlap with any other iteration
Loop-level parallelism
• But within each loop iteration there is little or
no opportunity for overlap as there are very
few instructions
• Loop unrolling – is a technique to convert

loop level parallelism to instruction level
parallelism
Dependences
• To have ILP, instructions should have no dependence
• If two instructions are dependent, they are not

parallel and must be executed in order
• 3 types of dependences
– Data dependences
– Name dependences
– Control dependences
Data Dependences
• Also called as True data dependence
• An instruction j is data dependent on instruction i ---
if either of the following holds: ---
i
---
---
– Instruction i produces a result that may be used by k
instruction j, or ---
---
j
– Instruction j is data dependent on instruction k, and ---
instruction k is data dependent on instruction i. (aka ---
chain dependence)
• Data dependence creates RAW, WAR, and WAW

hazards
Data Dependence - Example
• The arrows show sequence of dependence and also the

execution order that must be preserved
• If two instructions are data dependent, their execution
cannot be overlapped
To overcome data dependence
• Data dependence
– Causes hazard that stalls the pipeline
– Dictates the execution order that must be preserved
– Limits the parallelism
• Techniques to avoid dependence limitations

– Maintain dependences but avoid hazards
• Code scheduling
– hardware based
– Software (compiler) based
– Eliminate dependences by code transformations
• Complex
• Compiler-based
Name Dependences
• A name dependence occurs when two instructions
use the same register or memory location, called a
name
• But there is no flow of data between the instructions

associated with that name.
• Name dependence is not a true dependence
• Instructions involved in a name dependence can

execute simultaneously. They can also be reordered,
if the name used in the instructions is changed
Types of name dependences
Two types of name dependences between an
instruction I that precedes instruction j :
• An antidependence between instruction I and
instruction j occurs when instruction j writes a register
or memory location that instruction I reads
•ordering must be preserved to ensure that I reads the

correct value.
•an antidependence between S.D and DADDIU on

register R1 (ref past slide having Data Dependence Example )
Types of name dependences
• An output dependence occurs when
instruction i and instruction j write the same
register or memory location
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Ordering must be preserved to ensure that

value finally written corresponds to
instruction j
To overcome name dependence
• renaming – change the register number or
memory location
• done either statically by a compiler or

dynamically by the hardware
Data Hazards
Read After Write (RAW) hazard :
InstrJ tries to read operand before Instri writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• This hazard results from an actual need for

communication (ie. True data dependence)
• Reordering the instructions i and j is not possible
• Program order must be preserved to ensure that j
receives the value from i
Data Hazards
Write After Read (WAR) hazard :
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• This hazard results from anti dependence

• In this eg. it results from use of the name “r1” in instruction i for reading and
instruction j for writing
• This hazard can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
• But WAR hazard occurs when instructions are reordered. ie. When instruction
j appears before instruction i
Data Hazards
Write After Write (WAW) hazard :
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• This hazard results from output dependence
• In this eg. it results from use of the name “r1” in instructions i and j for
writing
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• But WAW hazard occurs when instructions are reordered. ie. When
instruction j appears before instruction i
Control Dependences
• Determines the ordering of an instruction, i,
with respect to a branch instruction, such that
– instruction i executed in correct program order

– instruction i is executed only when it should be
e.g
if p1 {
instri;
};
Control Dependences : Example
if p1 {
S1;
};
if p2 {
S2;
}
• S1 is control dependent on p1
• S2 is control dependent on p2 but not on p1
Control Dependences
Two constraints imposed by control
dependences:
•An instruction that is control dependent on a branch
cannot be moved before the branch
– Eg. we cannot take an instruction from the then portion of an if
statement and move it before the if statement.
•An instruction that is not control dependent on a
branch cannot be moved after the branch
– Eg. we cannot take a statement before the if statement and move
it into the then portion.
How the Simple Pipeline Preserves Control
Dependence ?
– Ensure that an instruction that is control dependent

on a branch is not executed until the branch
direction is known.
Can We Violate Control Dependence?
• Yes, we can…
– If we can ensure that violating the control dependence will

not result in incorrectness of the programs, control
dependence is not critical property that must be preserved.
– Instead, the exception behavior and data flow are preserved

by data and control dependences
– These properties are critical to correctness of the program

Properties critical to program correctness
1. Exception behavior - Preserving the

exception behavior means that the
reordering of instruction execution must not
cause any new exceptions in the program
2. Data flow is the actual flow of data values

among instructions that produce results and
those that consume them.
Preserving Exception Behavior
• If we ignore the control dependence and move the load

instruction before the branch, then
– The load instruction will cause a memory protection
exception when R2=0 as the address 0(R2) becomes 0
– This will not happen if the order of branch and load
instruction is preserved.
Preserving Data flow
• Branch makes data flow dynamic (i.e., coming from

multiple points)
• The value of R1 used by the OR instruction depends

on whether the branch is taken or not
Preserving Data flow
• Thus, the OR instruction is data dependent on both

the DADDU and DSUBU instructions
• “Preserving data flow” means that if branch is not

taken, the value of R1 computed by DSUBU is used
by OR, otherwise, the value of R1 computed by
DADDU is used.
• By preserving the control dependence of the OR on

the branch, we prevent an illegal change to the data
flow
2. Basic Compiler Techniques for
Exposing ILP
 Pipeline Scheduling and Loop Unrolling

Instruction Level Parallelism
• Need : Potential overlap among instructions
• But few possibilities in a basic block

– Blocks are small (6-7 instructions)
– Even these few instructions are dependent
• Remedy - Exploit ILP across multiple basic blocks

(ie multiple iterations of a loop)
– Iterations of a loop
for (i = 1000; i > 0; i=i-1)
x[i] = x[i] + s;
A pipeline that supports multiple outstanding FP
operations
1 24
Latencies
Pipeline latency - the number of stages from the
EX stage to the stage that produces the result
Actual latencies are as follows :

Sample Pipeline and Intervening Latencies
EX
IF ID FP1 FP2 FP3 FP4 DM WB
Intervening latency - the number of stages by which a dependent

instruction must be separated from the source instruction to avoid a stall
FP Add/Sub IF ID FP1 FP2 FP3 FP4 DM WB
FP Add/Sub IF ID stall stall stall FP1 FP2 FP3
FP Add/Sub IF ID FP1 FP2 FP3 FP4 DM WB
SD IF ID EX stall stall DM WB
Intervening Latencies
LD IF ID EX DM WB
FP ALU IF ID stall FP1 FP2 FP3 FP4
LD IF ID EX DM WB
SD IF ID EX DM WB
Int IF ID EX DM WB
Add/Sub
Int IF ID EX DM WB
Add/Sub
Intervening Latencies
Int IF ID EX DM WB
Add/Sub
Branch IF stall ID EX DM WB
Intervening Latency assumptions in
our example
Latency assumptions in our examples :
Instruction producing result Instruction using result Latency

FP Add/Sub FP Add/Sub 3
FP Add/Sub SD 2
LD FP Add/Sub 1
LD SD 0
Int Add/Sub Int Add/Sub 0
Int Add/Sub Branch 1
Basic Pipeline Scheduling: Example
for(i=1; i<=1000; i++)
• Simple loop: x[i]=x[i] + s;
• Equivalint MIPS code

;R1 points to the last element in the array
;for simplicity, we assume that x[0] is at
the address 0
Loop: LD F0, 0(R1) ;F0=array
el.
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,#8 ;decrement pointer
BNEZ R1, Loop ;branch
Pipelined execution considering the intervening latencies
1. Loop: LD F0, 0(R1)
2. Stall
3. ADDD F4,F0,F2
4. Stall
5. Stall  10 clocks per iteration
6. SD 0(R1),F4 (5 stalls)
7. SUBI R1,R1,#8
8. Stall  Rewrite code to
9. BNEZ R1, Loop minimize stalls?
10. Stall

FP Add/Sub FP Add/Sub 3
FP Add/Sub SD 2
LD FP Add/Sub 1
LD SD 0
Int Add/Sub Int Add/Sub 0
Int Add/Sub Branch 1
Scheduled pipelined execution to
minimise stalls
1. Loop: LD F0, 0(R1)
2. SUBI R1,R1,#8
3. ADDD F4,F0,F2
4. Stall
5. BNEZ R1, Loop
6. SD 8(R1),F4
• 6 clocks per iteration (1 stall); but only 3 instructions do the

actual work processing the array (LD, ADDD, SD)
Unroll loop 4 times to improve potential for instr. scheduling

After scheduling observe how :
- the stalls are used by independent instructions by reordering them
- and necessary latency betwn dependent instructions are maintained
1. Loop: LD F0, 0(R1)

2. Stall <- Before Scheduling
3. ADDD F4,F0,F2
4. Stall
5. Stall
6. SD 0(R1),F4
7. SUBI R1,R1,#8
8. Stall
9. BNEZ R1, Loop
10. Stall After Scheduling
1. Loop: LD F0, 0(R1)

2. SUBI R1,R1,#8
3. ADDD F4,F0,F2
4. Stall
5. BNEZ R1, Loop ;delayed branch
6. SD 8(R1),F4 ;altered and interch. SUBI
Loop Unrolling
• Pros:
– Larger basic block
– More scope for scheduling
– and eliminating dependencies
• Cons:
– Increases code size
• Comment:
– Often a basic step for other optimizations
Unrolled Loop (4 iterations)
Loop: LD F0, 0(R1)
ADDD F4, F0, F2 Intermediate BEQZ are never taken
SD 0(R1), F4
SUBI R1, R1, #8 Eliminate!
BEQZ R1, Exit
LD F6, 0(R1)
ADDD F8, F6, F2
SD 0(R1), F8
SUBI R1, R1, #8
BEQZ R1, Exit
LD F10, 0(R1)
ADDD F12, F10, F2
SD 0(R1), F12
SUBI R1, R1, #8
BEQZ R1, Exit
LD F14, 0(R1)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop
Exit:
1. Eliminating Data Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2
• Data dependencies SUBI, LD, SD
SD 0(R1), F4  Force sequential execution of
SUBI R1, R1, #8 iterations
LD F6, 0(R1)
• Compiler removes this dependency by:
ADDD F8, F6, F2
SD 0(R1), F8
 Computing intermediate R1 values
SUBI R1, R1, #8  Eliminating intermediate SUBI
LD F10, 0(R1)  Changing final SUBI
ADDD F12, F10, F2
• Data flow analysis
SD 0(R1), F12
SUBI R1, R1, #8 Can do on Registers
LD F14, 0(R1) Cannot do easily on memory locations
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop
After Eliminating Data Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2
SD 0(R1), F4
LD F0, -8(R1)
ADDD F4, F0, F2
 index values for LD are
SD -8(R1), F4
changed
LD F0, -16(R1)
intermediate SUBI are
ADDD F4, F0, F2
eliminated
SD -16(R1), F4
LD F0, -24(R1)
final SUBI is changed
ADDD F4, F0, F2
SD -24(R1), F4
SUBI R1, R1, #32
BNEZ R1, Loop
2. Eliminating Name Dependences
Loop: LD F0, 0(R1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2 ADDD F4, F0, F2
SD 0(R1), F4 SD 0(R1), F4
LD F0, -8(R1) LD F6, -8(R1)
SD -8(R1), F4 SD -8(R1), F8
LD F0, -16(R1) LD F10, -16(R1)
ADDD F4, F0, F2 Register Renaming ADDD F12, F10, F2
SD -16(R1), F4 SD -16(R1), F12
LD F0, -24(R1) LD F14, -24(R1)
SD -24(R1), F4 SD -24(R1), F16
SUBI R1, R1, #32 SUBI R1, R1, #32
BNEZ R1, Loop BNEZ R1, Loop
Unrolled Loop (4 iterations)
LD F0, 0(R1) 1 cycle stall
2 cycles stall
This loop needs 28 clocks
ADDD F4,F0,F2
SD 0(R1),F4 =14 stalls per iteration+
LD F6, -8(R1) 14 instruction issue
ADDD F8,F6,F2
cycles
SD -8(R1),F8
LD F10, -16(R1) or 28/4=7 for each
iteration of the array
ADDD F12,F10,F2
SD -16(R1),F12
even slower
LD F14, -24(R1) than the scheduled
ADDD F16,F14,F2 version!
SD -24(R1),F16
SUBI R1,R1,#32 Rewrite loop to
BNEZ R1,Loop minimize stalls
3. Scheduled Unrolled loop
that Minimises Stalls
Loop: LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
ADDD F4,F0,F2
ADDD F8,F6,F2
This loop will run 14 cycles
ADDD F12,F10,F2 (no stalls) per iteration;
ADDD F16,F14,F2
SD 0(R1),F4 or 14/4=3.5 for each iteration!
SD -8(R1),F8
SUBI R1,R1,#32
SD -16(R1),F12
BNEZ R1,Loop
SD 8(R1),F16 ;
Steps Compiler Performed to Unroll
• Determine that is OK to move the S.D after SUBUI and BNEZ, and
find amount to adjust SD offset
• Determine that unrolling the loop would be useful
by finding that the loop iterations were independent
• Rename registers to avoid name dependencies
• Eliminate extra test and branch instructions and adjust the loop
termination and iteration code
• Determine loads and stores in unrolled loop can be interchanged by
observing that the loads and stores from different iterations are
independent
– requires analyzing memory addresses and finding that they do not
refer to the same address.
• Schedule the code, preserving any dependences needed to yield
same result as the original code
Task 1: Surprise Test
• Show the following loop unrolled so that there
are two copies of the loop body, assuming the
number of loop iterations is a multiple of 2.
Eliminate any obviously redundant computations
and do not reuse any of the registers.
Loop: L.D F0,0(R1)

ADD.D F4,F0,F2
S.D F4,0(R1)
SUBI R1,R1,#8
BNEZ R1,R2,Loop
Intervening latency assumptions
Consider the following intervening latencies between two dependent
instructions:

FP ALU op FP ALU op 3
FP ALU op SD 2
LD FP ALU op 1
LD SD 0
Int ALU op Branch 1
Int ALU op Int ALU op 0
Given Loop
Loop: L.D F0,0(R1)
Stall
ADD.D F4,F0,F2
Stall
Stall
S.D F4,0(R1)
SUBI R1,R1,#8
Stall
BNEZ R1,Loop
Stall
The loop after unrolling and eliminating name
dependences
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6, -8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
SUBI R1,R1,#16
BNEZ R1,Loop
Unrolled and scheduled loop

Loop: L.D F0,0(R1)
L.D F6,-8(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
SUBI R1,R1,#16
S.D F4, 0(R1)
BNEZ R1,Loop
S.D F8,8(R1)
3. Reducing Branch Costs with
Prediction
 Static branch prediction
 Dynamic branch prediction

• One bit
• 2 bit
• Correlating
• Tournament
Dynamic Hardware Prediction
• Importance of control dependences
– Branches and jumps are frequent
– Limiting factor as ILP increases (Amdahl’s law)
• Schemes to attack control dependences

– Static (compiler based)
• Basic (stall the pipeline)
• Predict-not-taken and predict-taken
• Delayed branch and canceling branch
– Dynamic predictors (hardware based)
• Effectiveness of br. prediction schemes depends on

– Accuracy of branch prediction scheme
– Cost of a branch when the prediction is incorrect
Basic Branch Prediction and Branch-
Prediction Buffer
– A branch prediction buffer (or branch history table)

is a small memory indexed by the lower portion of
the address of the branch instruction
– The memory contains bits that say whether the
branch was recently taken or not
Basic Branch Prediction
a.k.a. Branch History Table (BHT) - is a small direct-mapped cache of T/NT bits
Branch Instruction
IR:
+ Branch Target
PC:
BHT T (predict taken)
NT (predict not- taken)
PC + 4
Branch Prediction using BHT of one bit
predictors
• Lower portion of PC indexes table of bits (0 = N, 1 = T)
• Essentially: branch will go same way it went last time
• Problem: consider inner loop branch below ( * indicates mis-
prediction)
for ( i=0; i<100; i++ )

for ( j=0; j<3; j++ )
// whatever code
• Two “built-in” mis-predictions per inner loop iteration

• This is because branch predictor “changes its mind too quickly”
Branch Prediction using BHT of two bit
predictors
•Use an n-bit saturating counter
•Only the loop exit causes a misprediction
•2-bit predictor almost as good as any general n-bit predictor
Two bit predictors
•Force predictor to mis-predict twice before “changing its
mind”
•One mis-predict each loop execution (rather than two in )

State/Prediction N* n* T T* t T T T* t T T T*
Outcome T T T N T T T N T T T N
Prediction Accuracy of a 4K-entry 2-bit
Prediction Buffer
Correlated (two-level) predictor
• Branch predictors that use the behavior of other branches to make a
prediction are called correlating predictors or two-level predictors.
• Exploits observation that branch outcomes are correlated
• Consider the following simplified code fragment,

if (d ==0 )
d =1;
if (d == 1) {
• Equivalent code fragment is
BNEZ R1, L1 ; branch b1 (d != 0)

DADDIU R1, R0, #1
L1: DADDIU R3, R1, # -1
BNEZ R3, L2 ; branch b2 (d != 1)
…
L2:
• if b1 is not taken, b2 will be not taken.

Correlating Branch Predictors
Figure shows a branch predictor that uses the behavior of last one branch
Branch Instruction
IR:
+ Branch Target
PC:
BHT
T (predict taken)
NT (predict not- taken)
1-bit global branch history NT/T PC + 4

reg. : (stores behavior of
NT T
previous branch)
Correlated predictors
• Maintains separate prediction per (PC , BHR)

• Branch history register (BHR): stores recent
branch outcomes
A (2,2) Predictor
• consider the inner loop branch below
for ( i=0; i<100; i++ )
for ( j=0; j<3; j++ )
// whatever code
• BHT uses one 1-bit DIRection Pedictors

•2 bit global history register - stores behavior of previous 2
branches : i.e. 22 = 4 1-bit DIRP entries per branch
We didn’t make anything better, what’s the problem?

•BHR wasn’t long enough to capture the pattern
•Try again: BHT+3BHR: 23 = 8 1-bit DIRP entries
•No mis-predictions after predictor learns all the relevant patterns

(m, n) Predictors
• Use behavior of the last m branches
• 2m n-bit predictors for each branch
• Simple implementation
– Use m-bit shift register to record the behavior of the
last m branches
PC: BHT with 2 m n-bit

predictors
Choose the BHT
entry for current
branch
.
.
m-bit GBH . Prediction
Choose the predictor
Size of the BHRs
• Number of bits in a (m,n) predictor
– 2m predictors for each branch
– n bits in each predictors
– N entries in BHT corresponding to N branches
– Size of BHR = 2m x n x Number of entries in the BHR
• Example – assume 8K bits in the BHT
– (0,1): 8K entries
– (12,2): 1 entry!
Performance Comparison of 2-bit
Predictors
Hybrid (tournament) predictor
• Simple 2 bit predictor uses only local information; fails on
some branches
• Correlated predictor uses global (history) information;
improved performance
• Idea: combine simple BHT predictor and correlated

predictor
– Simple BHT predicts history independent branches
– Correlated predictor predicts only branches that need history
– Chooser assigns branches to one predictor or the other
• Better accuracy at medium size
• 90–95% accuracy
Tournament predictor
4. Overcoming Data Hazards with Dynamic
Scheduling
5. Dynamic Scheduling: Examples and the

Algorithm
Tomasulo’s Algorithm
• Used in IBM 360/91 FPU (before caches)
• Goal: high FP performance without special compilers
• Conditions:
– Small number of floating point registers (4 in 360) prevented
interesting compiler scheduling of operations
– Long memory accesses and long FP delays
– This led Tomasulo to try to figure out how to get more effective
registers — renaming in hardware!
• Why Study 1966 Computer?
• The descendants of this have flourished!
– Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …
Tomasulo’s Algorithm (cont’d)
• Control & buffers distributed with Function Units (FU)
– FU buffers called “reservation stations” =>
buffer the operands of instructions waiting to issue;
• Registers in instructions replaced by values or pointers to
reservation stations (RS) => register renaming
– avoids WAR, WAW hazards
– More reservation stations than registers,
so can do optimizations compilers can’t
• Results to FU from RS, not through registers, over Common Data
Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
Tomasulo-based FPU for MIPS
From Instruction Unit
From Mem FP Op FP Registers
Queue
Load Buffers
Load1
Load2
Load3
Load4 Store
Load5 Buffers
Load6
Store1
Store2
Store3
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP
FP adders
adders FP
FP multipliers
multipliers
Common Data Bus (CDB)

Reservation Station Components
• Op: Operation to perform in the unit (e.g., + or –)
• Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
• Qj, Qk: Reservation stations producing source registers (value to be
written)
– Note: Qj/Qk=0 => source operand is already available in Vj /Vk
– Store buffers only have Qi for RS producing result
• Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write

each register, if one exists. Blank when no pending instructions that
will write that register.
Three Stages of Tomasulo Algorithm
• 1. Issue—get instruction from FP Op Queue
– If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers)
• 2. Execute—operate on operands (EX)
– When both operands ready then execute;
if not ready, watch Common Data Bus for result
• 3. Write result—finish execution (WB)
– Write it on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed: 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
Tomasulo Example
Instruction stream
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2 3 Load/Buffers
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
FU count Add2 No
3 FP Adder R.S.
Add3 No
down 2 FP Mult R.S.
Mult1 No
Mult2 No
Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Clock cycle
counter
Tomasulo Example Cycle 1
LD F6 34+ R2 1 Load1 Yes 34+R2
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
Note: Can have multiple loads outstanding

LD F6 34+ R2 1 3 Load1 Yes 34+R2
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation

Stations; MULT issued
• Load1 completing; what is waiting for Load1?
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add2 Yes ADDD M(A2) Add1
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
• Issue ADDD here despite name dependency on F6?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add2 Yes ADDD M(A2) Add1
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 (M-M) Mult2
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Add1 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10
Add1 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Add2 (ADDD) completing; what is waiting for it?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
• Write result of ADDD here?

• All quick instructions complete in this cycle!
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult1 (MULTD) completing; what is waiting for it?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• Mult2 (DIVD) is completing; what is waiting for it?

LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56 57
ADDD F6 F8 F2 6 10 11
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)

Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution

and out-of-order completion.
Tomasulo’s scheme offers 2 major
advantages
• (1) the distribution of the hazard detection logic
– distributed reservation stations and the CDB
– If multiple instructions waiting on single result, & each
instruction has other operand, then instructions can be
released simultaneously by broadcast on CDB
– If a centralized register file were used, the units would
have to read their results from the registers when register
buses are available.
• (2) the elimination of stalls for WAW and WAR hazards
Tomasulo Drawbacks
• Complexity
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
 high capacitance, high wiring density
– Number of functional units that can complete per cycle
limited to one!
• Multiple CDBs  more FU logic for parallel assoc stores
• Non-precise interrupts
Hardware Based Speculation
Hardware-Based Speculation
– Why?
• To resolve control dependence to increase ILP
– Overcoming control dependence is done by
• Speculating on the outcome of branches and executing the
program as if our guess is correct.
– Three key ideas combined in hardware-based speculation:
• Dynamic branch prediction,
• Speculative execution, and
• Dynamic scheduling.
– Instruction commit
• When an instruction is no longer speculative, we allow it to
update the register file or memory
– The key idea behind implementing speculations is to allow
instructions to execute out of order but to force them to
commit in order.
Extend the Tomasulo’s Approach to
Support Speculation
– Separate the bypassing of results among instructions
from the actual completion of an instruction.
• By doing so, the result of an instruction can be used by other
instruction without allowing the instruction to perform any
irrecoverable update, until the instruction is no longer
speculative.
– A reorder buffer is employed to pass results among
instructions that may be speculated.
• The reorder buffer holds the results of an instruction between
the time the operation associated with the instruction
completes and the time the instruction commits.
• The store buffers in the original Tomasulo’s algorithm are
integrated into reorder buffer.
• The renaming function of reservation station (RS) is replaced by
the reorder buffer. Thus, a result is usually tagged by using the
reorder buffer entry number.
Motivation for speculative execution
Increasing available ILP
• Multiple-issue increases demand for
independent instructions
• May need to predict one branch per cycle
– Hard to maintain high clock rate
• Speculation attempts to overcome problem
– Execute program as if all branches were predicted
correctly
– Dynamic scheduling waits for branch resolution
Motivation for speculative execution
Increasing available ILP
• Three elements of speculation
– Dynamic branch prediction
– Ability to undo speculative instructions from
wrong path
– Dynamic scheduling
Data flow details
• Instructions pass results to other instructions through
CDB
• Speculative instructions pass results to following
speculative instructions without committing to registers
or memory, and without throwing exceptions
• Update register or memory only when we know that
instruction is non-speculative
• Additional commit step in instruction execution
– Execute out-of-order, commit in-order
– Avoid irrevocable change of state or exception before
checking speculation status of instruction
ROB
• Separate instruction completion from commit
of results
• Reorder buffer (ROB) used to hold temporarily
results of speculative instructions
– Results forwarded through ROB
Out-of-Order Execution
• We’re now executing instructions in data-flow
order
– More performance
• But outside world can’t know about this

– Must maintain illusion of sequentiality
Remember the Toll Booth?
One-at-a-time = 45s
5s 5s 30s 5s
With a “4-Issue” Toll Booth OOO = 30s

Hands toll-booth
agent a $100 bill; L1
takes a while to L2
count the change
L3
L4
We’ll add the equivalent of

the “shoulder” to the CPU:
the Re-Order Buffer (ROB)

Re-Order Buffer (ROB)
• Separates architected vs. physical registers
• Tracks program order of all in-flight insts
– Enables in-order completion or “commit”
Tomasulo’s algorithm with speculation
Hardware
RAT
Organization
Architected Register File
Instruction Buffers
ROB
“head”
Reservation Stations and ALUs
op Qj Qk Vj Vk
op Qj Qk Vj Vk
op Qj Qk Vj Vk
op Qj Qk Vj Vk
Add
op Qj Qk Vj Vk
op Qj Qk Vj Vk
type dest value fin
Mult
Issue RAT Architected Register File
• Read inst from inst Instruction Buffers
buffer
• Check if resources ROB
available: Reservation Stations and ALUs

“head”
– Appropriate RS entry op
op
Qj
Qj
Qk
Qk
Vj
Vj
Vk
Vk
op Qj Qk Vj Vk
– ROB entry
op Qj Qk Vj Vk
Add
• Read RAT, read op
op
Qj
Qj
Qk
Qk
Vj
Vj
Vk
Vk
type dest value fin
(available) sources, Mult
update RAT
• Write to RS and ROB Stall issue if any needed
resource not available
Exec
• Same as before
– Wait for all operands to arrive
– Compete to use functional unit
– Execute!
Write Result
• Broadcast result on CDB
– (any dependents will grab the value)
• Write result back to your ROB entry
– The ARF holds the “official” register state, which
we will only update in program order
– Mark ready/finished bit in ROB (note that this inst
has completed execution)
New: Commit
• When an inst is the oldest in the ROB
– i.e. ROB-head points to it
• Write result (if ready/finished bit is set)
– If register-producing instruction: write to
architected register file
– If store: write to memory
• Advance ROB-head to next instruction
• This is what the outside world sees

– And it’s all in-order
Commit Illustrated
• Make instruction execution “visible” to the
outside world
– “Commit” the changes to the architected state
ROB
WB result Outside World “sees”:
A 
B  A executed
ARF
C  B executed
D  C executed
E  D executed
F  E executed
G 
H 
Instructions execute out of program order,
J 
but outside world still “believes” it’s in-order
K 
Revisiting Register Renaming
RAT R1 = R2 + R3
R1 ROB4
ROB1
ROB3
R1 R3 = R5 – R6
ROB R2 R2 R1 = R1 * R7
R3 ROB2
R3 R1 = R4 + R8
ROB1 R1 R2+R3
R2 = R9 + R3
…
ROB2 R3 R5-R6
ROB3 R1 ROB1*R7
ROB1
If we issue R2=R9+R3 to the
ROB4 R1 R4+R8
ROB now, R3 comes from ROB2
ROB5 R2 R9+
R9+R3
ROB2
ROB6 However, if R3=R5-R6 commits first…
ROB7
ROB8 Then update RAT so when we issue
R2=R9+R3, it will read source from
the ARF.
Example
Inst Operands
Add:
Add:22cycles
cycles
DIV R2, R3, R4 Mult:
Mult:1010cycles
cycles
MUL R1, R5, R6 Divide:
Divide:40
40cycles
cycles
ADD R3, R7, R8
Sequentially, this would take:
MUL R1, R1, R3
SUB R4, R1, R5 40+10+2+10+2+2 = 66 cycles
ADD R1, R4, R2
(+ other pipeline stalls)
R1 -23
R2 16
R3 45
R4 5
R5 3
R6 4
R7 1
R8 2
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
DIV ROB1 45 5
I E W C
ARF RAT ROB
1 1
R1 -23 R1 ARF1 ROB1 R2 2
R2 16 R2 ARF2 ROB1 ROB2 3
R3 45 R3 ARF3 ROB3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 1
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ARF1 ROB2 ROB1 R2 2 2
R2 16 R2 ROB1 ROB2 R1 3
R3 45 R3 ARF3 ROB3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 2
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB3 1 2 DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3
R3 45 R3 ARF3 ROB3 ROB3 R3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 3
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3 4
R3 45 R3 ROB3 ROB3 R3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 4
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 6
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 13
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
DIV ROB1 45 5
MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB4 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14
R4 5 R4 ARF4 ROB4 R1 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 14
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
SUB ROB5 ROB4 3 DIV ROB1 45 5
MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB4 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ARF4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 15
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 ROB5 ROB1 MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB4 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 16
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 ROB5 ROB1 MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R7 1 R7 ARF7
2 ARF8
Cycle:
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
SUB ROB5 36 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R7 1 R7 ARF7
2 ARF8
Cycle: 26
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
SUB ROB5 36 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26
R5 3 R5 ARF5 ROB5 R4 6 16
R7 1 R7 ARF7
2 ARF8
Cycle: 28
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 ROB1 33 DIV ROB1 45 5
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16
R7 1 R7 ARF7
2 ARF8
Cycle:
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42
R1 -23 R1 ROB6 ROB1 R2 9 Y 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16
R7 1 R7 ARF7
2 ARF8
Cycle: 43
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42 43
R1 -23 R1 ROB6 ROB1 2 2 3 13
R2 9 R2 ARF2 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43
R7 1 R7 ARF7
2 ARF8
Cycle: 44
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42 43
R1 12 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43
R7 1 R7 ARF7
2 ARF8
Cycle: 45
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
I E W C
ARF RAT ROB
1 1 2 42 43
R1 12 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43 45
R6 4 R6 ARF6 ROB6 R1 42 Y
R7 1 R7 ARF7
2 ARF8
Cycle: 46
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
I E W C
ARF RAT ROB
1 1 2 42 43
R1 36 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25 46
R4 5 R4 ROB5 ROB4 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43 45
R7 1 R7 ARF7
2 ARF8
Cycle: 47
R8 R8
In Detail
1 DIV R2, R3, R4
2 MUL R1, R5, R6
4 MUL R1, R1, R3
5 SUB R4, R1, R5
I E W C
ARF RAT ROB
1 1 2 42 43
R1 36 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25 46
R4 33 R4 ARF4 ROB4 5 15 26 28 47
R5 3 R5 ARF5 ROB5 6 16 43 45
R7 1 R7 ARF7
2 ARF8
Cycle: 48
R8 R8
Task
In a processor having Tomasulo-based dynamic
scheduling and speculation (using ROB) feature,
determine the sequence of clock cycles in which the
write-back and commit operations are performed for the
following instruction sequence.
ADD.D F8, F2, F6

MULT.D F10, F0, F6
SUB.D F6, F8, F2
MULT.D F6, F10, F6
Answer
EXECUTION
CLOC EN
K INSTR SEQUENCE ISSUE BEGIN D W C
1 ADD.D F8, F2, F6 1 2 3 4 5
2 MULT.D F10, F0, F6 2 3 12 13 14
3 SUB.D F6, F8, F2 3 5 6 7 15
4 MULT.D F6, F10, F6 4 14 23 24 25

7. Exploiting ILP Using Multiple Issue
and Static Scheduling
Taking Advantage of More ILP with
Multiple Issues
– How can we reduce the CPI to less than one?
• Multiple issues
– Allow multiple instructions to issue in a clock cycle.
– Multiple-issued processors:
• Superscalar processor
– Issue varying number of instructions per clock cycle and may be
either statically scheduled by a compiler or dynamically
scheduled using techniques based on scoreboarding and
Tomasulo’s algorithm.
• Very long instruction word (VLIW) processor
– Issues a fixed number of instructions formatted either as one
large instruction or as a fixed instruction packet. Inherently
scheduled by a compiler.
Statically Versus Dynamically
Scheduled Superscalar Processors
• Statically scheduled superscalar
– Instructions are issued in order and are executed in
order
– All pipeline hazards are checked at issue time
– Dynamically issue a varying number of instructions
• Dynamically scheduled superscalar
– Allow out-of-order execution.
Statically Scheduled Superscalar
• E.g., four-issue static superscalar
– 4 instructions make one issue packet
– Fetch examines each instruction in the packet in
the program order
• instruction cannot be issued
due to structural or data hazard
either due to an instruction earlier in the issue packet
or due to an instruction already in execution
• can issue from 0 to 4 instruction per clock cycle
143
Superscalar MIPS
• Superscalar MIPS: 2 instructions, 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– If the second instruction depends on the first instruction, it can not be
issued.
– More ports for FP registers to do FP load & FP op in a pair
5 10
I IF ID Ex Mem WB Time [clocks]

FP IF ID Ex Mem WB
Note: FP
I IF ID Ex Mem WB
operations extend
FP IF ID Ex Mem WB
EX cycle
I IF ID Ex Mem WB
FP IF ID Ex Mem WB
144 Instr.
Intervening Latency assumptions in
our example
Latency assumptions for the examples :

FP ALU op SD 2
LD FP ALU op 1
LD SD 0
Loop Unrolling in Superscalar: Example
for(i=1; i<=1000; i++)
• Simple loop: x[i]=x[i] + s;
• Equivalint MIPS code

;R1 points to the last element in the array
;for simplicity, we assume that x[0] is at
the address 0
Loop: LD F0, 0(R1) ;F0=array
el.
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,#8 ;decrement pointer
BNEZ R1, Loop ;branch
Loop Unrolling in Superscalar
1. Loop: LD F0, 0(R1)
2. Stall
3. ADDD F4,F0,F2
4. Stall  10 clocks per iteration
5. Stall
(5 stalls)
6. SD 0(R1),F4
7. SUBI R1,R1,#8
8. Stall
 Rewrite code to
minimize stalls?
9. BNEZ R1, Loop
10. Stall

FP ALU op SD 2
LD FP ALU op 1
LD SD 0
Loop Unrolling in Superscalar
Integer Instr. FP Instr. Unrolled 5 times to
1 Loop: LD F0,0(R1) avoid delays
2 LD F6,-8(R1)
This loop will run
3 LD F10,-16(R1) ADDD F4,F0,F2
12 cycles (no stalls)
4 LD F14,-24(R1) ADDD F8,F6,F2 per iteration - or
5 LD F18,-32(R1) ADDD F12,F10,F2 12/5=2.4 for each
element of the array
6 SD 0(R1),F4 ADDD F16,F14,F2
7 SD -8(R1),F8 ADDD F20,F18,F2
8 SD -16(R1),F12
9 SUBI R1,R1,#40
10 SD -24(R1),F16
11 BNEZ R1,Loop
12
148
SD 8(R1),F20
The Basic VLIW Approach
• VLIW uses multiple, independent functional units
• VLIWs package the multiple operations into one very
long instruction
• Compiler is responsible to choose instructions to be
issued simultaneously
• A VLIW instruction might include one integer/branch
instruction, two memory references, and two floating-
point operations.
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
– If each operation requires a 16 to 24 bits field, the length of
each VLIW instruction is of 112 to 168 bits.
Loop Unrolling in VLIW
Mem. Ref1 Mem Ref. 2 FP1 FP2 Int/Branch
1 LD F2,0(R1) LD F6,-8(R1)
2 LD F10,-16(R1) LD F14,-24(R1)
3 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F0,F6
4 LD F26,-48(R1) ADDD F12,F0,F10 ADDD F16,F0,F14
5 ADDD F20,F0,F18 ADDD F24,F0,F22 SUBI R1,R1,#56
6 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F0,F26
7 SD -16(R1),F12 SD -24(R1),F16
8 SD 24(R1),F20 SD 16(R1),F24 BNEZ R1,Loop
9 SD 8(R1),F28
Unrolled 7 times to avoid delays

7 results in 9 clocks, or 1.3 clocks per each element (1.8X)
Average: 2.5 ops per clock, 50% efficiency (23/45)
150
Note: Need more registers in VLIW
Limitations to VLIW Implementation
• Limitations
– Technical problem
• To generate enough straight-line code fragment requires ambitiously
unrolling loops, which increases code size.
– Poor code density
• Whenever the instructions are not full, the unused functional units translate
into wasted bits in the instruction encoding (only 60% full).
– Logistical problem
• Binary code compatibility; it depends on
– Instruction set definition,
– The detailed pipeline structure, including both functional units and their
latencies.
• Advantages of a superscalar processor over a VLIW processor
– Little impact on code density.
– Even unscheduled programs, or those compiled for older
implementations, can be run.
Multiple Issue Challenges
• While Integer/FP split is simple for the HW,
get CPI of 0.5 only for programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at same time,
greater difficulty of decode and issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers,
& decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in
the long instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
152
8. Exploiting ILP Using Dynamic
Scheduling, Multiple Issue, and
Speculation
Example
• A loop to increment each element of the array
• Loop: LD R2, 0(R1) ;R2=array el.

• DADDIU R2,R2,#1 ;increment R2
• SD R2,0(R1) ;store result
• DADDIU R1,R1,#8 ;increment pointer
• BNE R2,R3,Loop ;branch if not last
element
• Assume that there are separate integer functional units for address
calculation, for ALU operations, for branch condition evaluation.
• Up to 2 instructions of any type can commit per clock.
Comparison of execution with and without Speculation
(1)
• Fig. 3.33 on page 236

Comparison of with and without
Speculation (2)
• Fig. 3.34 on page 237
9. Advanced Techniques for
Instruction Delivery and Speculation
Advanced Techniques for
Instruction Delivery and Speculation
Techniques for increasing instruction Fetch

bandwidth
–Branch-Target Buffers
–Return Address predictors
–Integrated Instruction Fetch Units
Branch Target Buffer
– For a branch instruction, branch predictor can be used only
after decoding the instruction
– Hence, the outcome of branch is determined in ID stage
– In the following example code the branch outcome is
determined only at the end of 2nd clock cycle. (i.e branch
instruction being in the ID stage)
– Thus, we don’t know which (ie. target or fall through) instruction
to fetch in the 2nd clock cycle
– Hence we cant fetch any instruction in 2nd clock cycle
Example
clock 3
BNEZ R1, L1 BNEZ IF ID

DADDIU R1, R0, #1 ? IF ?
……
L1: DADDIU R3, R1, # -1
Branch Target Buffer
• Thus, even for easily-computed branch targets, we need to wait
until instruction is decoded and direction is predicted in ID stage
(still at least one cycle too late)
• For some branches (e.g., indirect branches) target address is known

only after EX stage, which is way too late
• To reduce the branch penalty further, we need to know

• Determine whether the instruction being fetched (in IF
stage) is a branch
• If so, what the next PC should be
– ie . We need to know target address and not just predict taken
or not taken
Branch Target Buffer (BTB)
• BTB is a cache that contains the predicted target PC
value for branches encountered so far
• Is the current instruction a branch ?
– When current instruction is being fetched, its PC is sent to BTB
– If this PC matches a BTB entry, it is certain that current instruction is a
branch and NextPC address can be obtained from BTB
– Thus, BTB provides the answer before the current instruction is
decoded and therefore enables fetching to begin after IF-stage
• What is the branch target ?

– The NextPC will be branch target if the prediction is a taken direct
branch (else PC+4 )
A Branch-Target Buffer
Steps involved in handling an instruction with a BTB
Integrated Instruction Fetch Unit
 Separate out IF from the pipeline and
integrate with the following components. So,
the pipeline consists of Issue, EX and WB
stages (Tomasulo).
1. Integrated Branch Prediction – Branch predictor is part of the IFU.
2. Instruction Prefetch – Fetch instn from IM ahead of PC with the help of

branch predictor and store in a prefetch buffer.
3. Instruction Memory Access and Buffering - Keep on filling the Instruction

Queue independent of the execution => Decoupled Execution?
Return Address Stack (RAS)
• Function returns are frequent, yet
– Return Address is difficult to compute
• have to wait until Return instruction completes EX stage
– Address difficult to predict with BTB
• function can be called from multiple places
• But return address is actually easy to predict

• It is the address after the last call instruction that we haven’t
returned from yet
• Hence the Return Address Stack
Return Address Stack (RAS)
• Call pushes return address into the RAS
– Here pushing Return address to stack is treated as prediction
• When a return instruction is decoded, pop the

predicted return address from RAS
• Accurate prediction even with small RAS

End of
Unit II
Case Study of
Intel Pentium 4
• The Pentium 4 processor is Intel's new
microprocessor that was introduced in
November of 2000
• Implements a new Intel NetBurst microarchitecture
• Supports multithreading
• Supports Hyper-Threading (HT) Technology
• The Front-end-decoder
– Translates IA-32 instruction to a series of micro
operations (uops)
– uops are executed by a dynamically scheduled
speculative pipeline
– Up to 3 simple IA-32 instructions may be decoded
& translated every cycle ; generating up to 6 uops
– If a complex instruction requires more than 3
uops, uop is generated from microcode ROM
• Execution Trace cache
– Holds sequences of uops
– Uops are fetched from the trace cache for execution
by the pipeline
• Avoids the need to redecode IA-32 instructions whenever
trace cache hits
– Exploits temporal sequencing of instruction execution
– On trace cache misses, instructions are
• fetched from L2 cache
• Decoded and trace cache is refilled with uops
– Execution trace cache has its own BTB
• Predicts the outcome of uop branches
– Execution trace cache has High hit rate
• 85% for SPEC CPUINT
• Hence fetch and decode unti is rarely used
The pipeline:
– The Out-of-order speculative pipeline uses
register renaming (not reorder buffer)
– Upto 3 uops per clock can be renamed and
dispatched to functional unit queues
– the functional unit queue has 4 dispatch ports-
one each for
• load,
• store,
• basic ALU operations,
• FP and integer operations
– Upto 3 uops can be committed per clock
• Has Deeper pipeline – (20 pipeline stages) to achieve high
clock rate (3.2 GHz in 2002)
– A simple instruction takes 31 clock cycles to go from fetch to retire
– A two level cache is used to reduce the miss
penalty
– A front end BTB is used when the execution trace
cache misses
– Uses a two level predictor with both global and local branch
histories
Intel NetBurst Microarchitecture
New Intel Microarchitecture
Instruction Fetch
and PreDecode
Instruction Queue 2M/4M

5 shared L2
uCode
ROM
Decode Cache
4
up to
Rename/Alloc
10.4 Gb/s
FSB
Retirement Unit
4
(ReOrder Buffer)
Schedulers
ALU ALU ALU
Branch FAdd FMul
MMX/SSE MMX/SSE MMX/SSE Load Store
FPmove FPmove FPmove
L1 D-Cache and D-TLB

Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming – infinite virtual registers
=> all register WAW & WAR hazards are
avoided
2. Branch prediction – perfect; no
mispredictions
3. Jump prediction – all jumps perfectly
predicted
2 & 3 => machine with perfect speculation &
an unbounded buffer of instructions available
4. Memory-address alias analysis – addresses are
known & a store can be moved before a load
provided addresses are not equal:
SD cant be moved before LD can be…
LD F0 0(R1)
LD F0 0(R1)
SD F4 0(R1)
SD F4 0(R2)
Also:
unlimited number of instructions issued/clock
cycle; perfect caches;
1 cycle latency for all instructions (FP *,/);
Upper Limit to ILP: Ideal Machine
160 150.1
140
FP: 75 - 150
118.7
120 Integer: 18 - 60
Instruction Issues per cycle
100
75.2
80
IPC
62.6
54.8
60
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Programs
How is this data generated?
Window impact
• Window – set of instructions that is examined for
simultaneous execution
• No. of comparisons required every clock cycle=
completion rate * window size * no. of operands
• (ex: 6*200*2=2400)
• Window size is limited by the required storage,

comparisons, limited issue rate
– If issue rate is less ; no use of larger window
– larger window size than issue rate is useful if there are
dependencies and cache misses
Contd..
• No. of instructions that may issue, begin
execution, or commit in a cycle is usually
smaller than the window size
Window Impact
60
56
52
50 47
FP: 8 - 45 45
40
35
34
Instruction issues per cycle
30
22 22
IPC
20
15 15 Integer: 6 - 12 14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
gcc expresso li fpppp doducd tomcatv
Program
Infinite 256 128 64 32 16 8 4
Infinite 256 128 64 32 16 8 4

Contd..
• In small window loop carried dependence
cant be handled.
• Hence no performance difference btwn int
and floating point programs (ref the plot)
• Better compilers can do loop unrolling and

scheduling . Even with small window size it is
possible to get better performance
Branch Impact
Assumption: FP: 15 - 45
61 60
60 window size = 2K 58
IPC = 64
50 48
46 46 45 45
45
41
40
35
Integer: 6 - 12
30 29
IPC
20 19
16
15
13 14
12
10
10 9
6 7 6 6 7
6
4
2 2 2
0
Program
Perfect Selective predictor Standard 2-bit Static None
Perfect Tournament BHT (512) Profile No prediction

Contd..
• Integer programs and doduc : have more
complex branches
– Difficult to predict
– Hence less parallelism
• Fpppp and tomcatv : have fewer branches
and are simple
– Easy to predict
– Hence more parallelism
Renaming Register Impact
Assumption:
window size = 2K
70
IPC = 64
60 Br. Predictor = 8K 2 level prediction59 FP: 11 - 45
54
49
50
45
44
40
35
Integer: 5 - 15
IPC
29 28
30
20
20
16
15 15 15
13 12 12 12
11 10 10 11 11
10
9
10 7
5 6 5 5 5 5
4 5 4 5
4
0
Program
Infinite 256 128 64 32 None
Infinite 256 128 64 32 None

Contd..
• From the graph : impact of no. of registers is
significant only if extensive parallelism exists
• Large impact on FP programs
• Small impact on int programs
– Less parallelism
– Moreover limitation on window size and Br.
Prediction limits ILP ( makes no use of renaming)
Memory Address Alias Impact
-about analyzing memory & reg. name dependences
49 49
50 Assumption:
45 45
45 window size = 2K
40 IPC = 64 FP: 4 - 45
35
Br. Predictor = 8K 2 level prediction (Fortran,
256 renaming registers no heap)
Instruction issues per cycle
30
25
20 Integer: 4 - 9 16 16
IPC
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
Program
Perfect Global/stack Perfect Inspection None
Perfect Global/Stack perf; Inspec. None

heap conflicts Assem.
Contd..
The three alias analysis models:
1. Global/stack perfect
– Perfect predictions for global and stack references
– Heap references conflict
– This is the best compiler based analysis scheme
2. Inspectiction
– Examine the accesses at compile time to check if
they interfere
– Used in commercial compilers
3. None : all memory references are assumed
to conflict
From the graph:
• No difference btwn perfect and global/stack
analysis
– Bcoz no heap references exist in fortran
• In practice dynamically scheduled processors
rely on dynamic memory disambiguation
Simultaneous Multi Threading
Haowen Chan
Vladislav Shkapenyuk
Superscalar Processors
• Issues multiple instructions in each cycle. Typically 4.
• Several functional units of the same type, e.g. ALUs
• Dispatcher reads instructions, decides which can run
in parallel
• Limited by instruction dependencies and long-
latency operations
• Effects Horizontal & Vertical Waste
• Low Utilization even with higher-issue machines; 8
Issue with %20
Vertical Slot & Horizontal Slot
• Vertical waste is introduced when
the processor issues no
instructions in a cycle
• Horizontal waste is introduced

when not all issue slots can be
filled in a cycle.
• %61 of the wasted cycles are

vertical waste.
Superscalar Processors- 2 issue
Superscalar Processors
• Does well when there is lots
of ILP, but ILP in general
applications is limited due to
dependencies
•Cannot exploit TLP effectively

since only one context can
be served at any one time.
Simultaneous Multi Threading : Idea
• Combine Superscalar and Multithreading
such that;
1. Issue multiple instructions per cycle – Supercalar
2. Hardware state for several programs/threads –
Multithreading
• So; issue multiple instructions from multiple
threads in each cycle
Simultaneous Multi Threading
• Best of both worlds – run
multiple threads
simultaneously on a
unified processor so that
resource utilization is
maximized
• Good for applications with

either good ILP or TLP
Implementation
• Processor is presented to OS as n logical processors,
hardware provides support for n independent contexts
• Context-supporting resources are either replicated n times or
shared
• Replicated Processor Resources
– Program counters
– Subroutine return stacks
– I-cache ports
– Instruction TLB
– Architectural registers
• Shared Processor Resources
– Functional units
– Caches
– Physical registers
Benefits
• Higher resource utilization => more speedup
• Good performance over wider range of
applications
• Unified cache => No cache coherency
overhead
Tradeoffs
• Register file is the limiting factor in SMT
machines
• Larger register file – higher cost and slower
access time.(May increase the cycle time)
• Memory bandwidth requirement is more
• Contention- threads tend to have the same
resource needs at the same time
Hyper-Threading Technology (HTT)
• Hyper-Threading Technology , is Intel’s

trademark for their implementation of the
simultaneous multithreading technology
• Intel Xeon and Pentium-4 3Ghz implements
2-context SMT processor
• Intel reports only 5% added to the die area
when SMT was added to Xeon, with an
approximate 30% gain in performance

Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept

Uploaded by

Copyright:

Available Formats

Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept

Uploaded by

Copyright:

Available Formats

INSTRUCTION LEVEL PARALLELISM

AND ITS EXPLOITATION

 Introduction to ILP and Loop level parallelism

• Techniques that increase amount of parallelism

– Pipeline CPI = Ideal pipeline CPI +

– Reducing each of the terms of the right-hand side

• For typical MIPS programs, branch frequency is often

• And these instructions are likely to depend upon one another.

• A loop is a basic block

• Loop-Level Parallelism- ILP across a basic block

• The simple way to increase ILP is to exploit

• In this example, every iteration of the loop can

• Loop unrolling – is a technique to convert

• If two instructions are dependent, they are not

• Data dependence creates RAW, WAR, and WAW

• The arrows show sequence of dependence and also the

• Techniques to avoid dependence limitations

• But there is no flow of data between the instructions

• Instructions involved in a name dependence can

•ordering must be preserved to ensure that I reads the

•an antidependence between S.D and DADDIU on

• Ordering must be preserved to ensure that

• done either statically by a compiler or

• This hazard results from an actual need for

• This hazard results from anti dependence

InstrJ writes operand before InstrI writes it.

• This hazard results from output dependence

– instruction i executed in correct program order

– Ensure that an instruction that is control dependent

– If we can ensure that violating the control dependence will

– Instead, the exception behavior and data flow are preserved

– These properties are critical to correctness of the program

1. Exception behavior - Preserving the

2. Data flow is the actual flow of data values

• If we ignore the control dependence and move the load

• Branch makes data flow dynamic (i.e., coming from

• The value of R1 used by the OR instruction depends

• Thus, the OR instruction is data dependent on both

• “Preserving data flow” means that if branch is not

• By preserving the control dependence of the OR on

 Pipeline Scheduling and Loop Unrolling

• But few possibilities in a basic block

• Remedy - Exploit ILP across multiple basic blocks

Actual latencies are as follows :

IF ID FP1 FP2 FP3 FP4 DM WB

Intervening latency - the number of stages by which a dependent

FP Add/Sub IF ID FP1 FP2 FP3 FP4 DM WB

FP Add/Sub IF ID stall stall stall FP1 FP2 FP3

FP Add/Sub IF ID FP1 FP2 FP3 FP4 DM WB

FP ALU IF ID stall FP1 FP2 FP3 FP4

Latency assumptions in our examples :

Instruction producing result Instruction using result Latency

• Equivalint MIPS code

Instruction producing result Instruction using result Latency

• 6 clocks per iteration (1 stall); but only 3 instructions do the

Unroll loop 4 times to improve potential for instr. scheduling

1. Loop: LD F0, 0(R1)

1. Loop: LD F0, 0(R1)

Loop: L.D F0,0(R1)