Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
Dependences
• Name dependences
• Data dependences
• Control dependences
Data hazards
What is ILP?
• ILP (Instruction Level Parallelism) –
overlap execution of unrelated instructions
• 3 types of dependences
– Data dependences
– Name dependences
– Control dependences
Data Dependences
• Also called as True data dependence
• An instruction j is data dependent on instruction i ---
if either of the following holds: ---
i
---
---
– Instruction i produces a result that may be used by k
instruction j, or ---
---
j
– Instruction j is data dependent on instruction k, and ---
instruction k is data dependent on instruction i. (aka ---
chain dependence)
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
I: add r1,r2,r3
J: sub r4,r1,r3
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• But WAR hazard occurs when instructions are reordered. ie. When instruction
j appears before instruction i
Data Hazards
Write After Write (WAW) hazard :
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• In this eg. it results from use of the name “r1” in instructions i and j for
writing
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• But WAW hazard occurs when instructions are reordered. ie. When
instruction j appears before instruction i
Control Dependences
• Determines the ordering of an instruction, i,
with respect to a branch instruction, such that
e.g
if p1 {
instri;
};
Control Dependences : Example
if p1 {
S1;
};
if p2 {
S2;
}
• S1 is control dependent on p1
• S2 is control dependent on p2 but not on p1
Control Dependences
Two constraints imposed by control
dependences:
•An instruction that is control dependent on a branch
cannot be moved before the branch
– Eg. we cannot take an instruction from the then portion of an if
statement and move it before the if statement.
•An instruction that is not control dependent on a
branch cannot be moved after the branch
– Eg. we cannot take a statement before the if statement and move
it into the then portion.
How the Simple Pipeline Preserves Control
Dependence ?
1 24
Latencies
Pipeline latency - the number of stages from the
EX stage to the stage that produces the result
SD IF ID EX stall stall DM WB
Intervening Latencies
LD IF ID EX DM WB
LD IF ID EX DM WB
SD IF ID EX DM WB
Int IF ID EX DM WB
Add/Sub
Int IF ID EX DM WB
Add/Sub
Intervening Latencies
Int IF ID EX DM WB
Add/Sub
Branch IF stall ID EX DM WB
Intervening Latency assumptions in
our example
Loop: LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
ADDD F4,F0,F2
ADDD F8,F6,F2
This loop will run 14 cycles
ADDD F12,F10,F2 (no stalls) per iteration;
ADDD F16,F14,F2
SD 0(R1),F4 or 14/4=3.5 for each iteration!
SD -8(R1),F8
SUBI R1,R1,#32
SD -16(R1),F12
BNEZ R1,Loop
SD 8(R1),F16 ;
Steps Compiler Performed to Unroll
• Determine that is OK to move the S.D after SUBUI and BNEZ, and
find amount to adjust SD offset
• Determine that unrolling the loop would be useful
by finding that the loop iterations were independent
• Rename registers to avoid name dependencies
• Eliminate extra test and branch instructions and adjust the loop
termination and iteration code
• Determine loads and stores in unrolled loop can be interchanged by
observing that the loads and stores from different iterations are
independent
– requires analyzing memory addresses and finding that they do not
refer to the same address.
• Schedule the code, preserving any dependences needed to yield
same result as the original code
Task 1: Surprise Test
• Show the following loop unrolled so that there
are two copies of the loop body, assuming the
number of loop iterations is a multiple of 2.
Eliminate any obviously redundant computations
and do not reuse any of the registers.
Branch Instruction
IR:
+ Branch Target
PC:
PC + 4
Branch Prediction using BHT of one bit
predictors
• Lower portion of PC indexes table of bits (0 = N, 1 = T)
• Essentially: branch will go same way it went last time
• Problem: consider inner loop branch below ( * indicates mis-
prediction)
Outcome T T T N T T T N T T T N
Prediction Accuracy of a 4K-entry 2-bit
Prediction Buffer
Correlated (two-level) predictor
• Branch predictors that use the behavior of other branches to make a
prediction are called correlating predictors or two-level predictors.
Branch Instruction
IR:
+ Branch Target
PC:
BHT
T (predict taken)
Clock cycle
counter
Tomasulo Example Cycle 1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Instruction Buffers
ROB
“head”
Reservation Stations and ALUs
op Qj Qk Vj Vk
op Qj Qk Vj Vk
op Qj Qk Vj Vk
op Qj Qk Vj Vk
Add
op Qj Qk Vj Vk
op Qj Qk Vj Vk
type dest value fin
Mult
Issue RAT Architected Register File
buffer
• Check if resources ROB
– Appropriate RS entry op
op
Qj
Qj
Qk
Qk
Vj
Vj
Vk
Vk
op Qj Qk Vj Vk
– ROB entry
op Qj Qk Vj Vk
Add
• Read RAT, read op
op
Qj
Qj
Qk
Qk
Vj
Vj
Vk
Vk
type dest value fin
update RAT
• Write to RS and ROB Stall issue if any needed
resource not available
Exec
• Same as before
– Wait for all operands to arrive
– Compete to use functional unit
– Execute!
Write Result
• Broadcast result on CDB
– (any dependents will grab the value)
• Write result back to your ROB entry
– The ARF holds the “official” register state, which
we will only update in program order
– Mark ready/finished bit in ROB (note that this inst
has completed execution)
New: Commit
• When an inst is the oldest in the ROB
– i.e. ROB-head points to it
• Write result (if ready/finished bit is set)
– If register-producing instruction: write to
architected register file
– If store: write to memory
• Advance ROB-head to next instruction
…
ROB2 R3 R5-R6
ROB3 R1 ROB1*R7
ROB1
If we issue R2=R9+R3 to the
ROB4 R1 R4+R8
ROB now, R3 comes from ROB2
ROB5 R2 R9+
R9+R3
ROB2
ROB6 However, if R3=R5-R6 commits first…
ROB7
ROB8 Then update RAT so when we issue
R2=R9+R3, it will read source from
the ARF.
Example
Inst Operands
Add:
Add:22cycles
cycles
DIV R2, R3, R4 Mult:
Mult:1010cycles
cycles
MUL R1, R5, R6 Divide:
Divide:40
40cycles
cycles
ADD R3, R7, R8
Sequentially, this would take:
MUL R1, R1, R3
SUB R4, R1, R5 40+10+2+10+2+2 = 66 cycles
ADD R1, R4, R2
(+ other pipeline stalls)
R1 -23
R2 16
R3 45
R4 5
R5 3
R6 4
R7 1
R8 2
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
DIV ROB1 45 5
I E W C
ARF RAT ROB
1 1
R1 -23 R1 ARF1 ROB1 R2 2
R2 16 R2 ARF2 ROB1 ROB2 3
R3 45 R3 ARF3 ROB3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 1
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ARF1 ROB2 ROB1 R2 2 2
R2 16 R2 ROB1 ROB2 R1 3
R3 45 R3 ARF3 ROB3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 2
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB3 1 2 DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3
R3 45 R3 ARF3 ROB3 ROB3 R3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 3
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB3 1 2 DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3 4
R3 45 R3 ROB3 ROB3 R3 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 4
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB3 1 2 DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3
R2 16 R2 ROB1 ROB2 R1 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 6
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
DIV ROB1 45 5
MUL ROB2 3 4
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4
R4 5 R4 ARF4 ROB4 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 13
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
DIV ROB1 45 5
MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB2 ROB4 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14
R4 5 R4 ARF4 ROB4 R1 5
R5 3 R5 ARF5 ROB5 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 14
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
SUB ROB5 ROB4 3 DIV ROB1 45 5
MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB4 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ARF4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6
R6 4 R6 ARF6 ROB6
R7 1 R7 ARF7
2 ARF8
Cycle: 15
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
SUB ROB5 ROB4 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1 MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB4 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 16
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
SUB ROB5 ROB4 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1 MUL ROB4 12 3
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15
R4 5 R4 ROB5 ROB4 R1 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle:
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
SUB ROB5 36 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15
R5 3 R5 ARF5 ROB5 R4 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 26
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
SUB ROB5 36 3 DIV ROB1 45 5
ADD ROB6 ROB5 ROB1
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26
R5 3 R5 ARF5 ROB5 R4 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 28
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB6 ROB1 33 DIV ROB1 45 5
I E W C
ARF RAT ROB
1 1 2
R1 -23 R1 ROB6 ROB1 R2 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle:
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42
R1 -23 R1 ROB6 ROB1 R2 9 Y 2 2 3 13
R2 16 R2 ROB1 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 43
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42 43
R1 -23 R1 ROB6 ROB1 2 2 3 13
R2 9 R2 ARF2 ROB2 R1 12 Y 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 44
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
ADD ROB6 33 9
I E W C
ARF RAT ROB
1 1 2 42 43
R1 12 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6
R3 45 R3 ROB3 ROB3 R3 3 Y 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43
R6 4 R6 ARF6 ROB6 R1
R7 1 R7 ARF7
2 ARF8
Cycle: 45
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
I E W C
ARF RAT ROB
1 1 2 42 43
R1 12 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25
R4 5 R4 ROB5 ROB4 R1 36 Y 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43 45
R6 4 R6 ARF6 ROB6 R1 42 Y
R7 1 R7 ARF7
2 ARF8
Cycle: 46
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
I E W C
ARF RAT ROB
1 1 2 42 43
R1 36 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25 46
R4 5 R4 ROB5 ROB4 5 15 26 28
R5 3 R5 ARF5 ROB5 R4 33 Y 6 16 43 45
R6 4 R6 ARF6 ROB6 R1 42 Y
R7 1 R7 ARF7
2 ARF8
Cycle: 47
R8 R8
In Detail
Inst Operands Assume you can bypass and execute in the same cycle
1 DIV R2, R3, R4
2 MUL R1, R5, R6
3 ADD R3, R7, R8 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2
4 MUL R1, R1, R3
5 SUB R4, R1, R5
6 ADD R1, R4, R2 ROB fields: Type Dest Value Finished
RS (Adder) RS (Mul/Div)
I E W C
ARF RAT ROB
1 1 2 42 43
R1 36 R1 ROB6 ROB1 2 2 3 13 44
R2 9 R2 ARF2 ROB2 3 3 4 6 45
R3 3 R3 ARF3 ROB3 4 14 15 25 46
R4 33 R4 ARF4 ROB4 5 15 26 28 47
R5 3 R5 ARF5 ROB5 6 16 43 45
R6 4 R6 ARF6 ROB6 R1 42 Y
R7 1 R7 ARF7
2 ARF8
Cycle: 48
R8 R8
Task
In a processor having Tomasulo-based dynamic
scheduling and speculation (using ROB) feature,
determine the sequence of clock cycles in which the
write-back and commit operations are performed for the
following instruction sequence.
EXECUTION
CLOC EN
K INSTR SEQUENCE ISSUE BEGIN D W C
1 ADD.D F8, F2, F6 1 2 3 4 5
143
Superscalar MIPS
• Superscalar MIPS: 2 instructions, 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– If the second instruction depends on the first instruction, it can not be
issued.
– More ports for FP registers to do FP load & FP op in a pair
5 10
8 SD -16(R1),F12
9 SUBI R1,R1,#40
10 SD -24(R1),F16
11 BNEZ R1,Loop
12
148
SD 8(R1),F20
The Basic VLIW Approach
• VLIW uses multiple, independent functional units
• VLIWs package the multiple operations into one very
long instruction
• Compiler is responsible to choose instructions to be
issued simultaneously
• A VLIW instruction might include one integer/branch
instruction, two memory references, and two floating-
point operations.
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
– If each operation requires a 16 to 24 bits field, the length of
each VLIW instruction is of 112 to 168 bits.
Loop Unrolling in VLIW
Mem. Ref1 Mem Ref. 2 FP1 FP2 Int/Branch
1 LD F2,0(R1) LD F6,-8(R1)
2 LD F10,-16(R1) LD F14,-24(R1)
7 SD -16(R1),F12 SD -24(R1),F16
9 SD 8(R1),F28
152
8. Exploiting ILP Using Dynamic
Scheduling, Multiple Issue, and
Speculation
Example
• A loop to increment each element of the array
• Assume that there are separate integer functional units for address
calculation, for ALU operations, for branch condition evaluation.
• Up to 2 instructions of any type can commit per clock.
Comparison of execution with and without Speculation
(1)
Example
clock 3
Intel Pentium 4
• The Pentium 4 processor is Intel's new
microprocessor that was introduced in
November of 2000
• Implements a new Intel NetBurst microarchitecture
• Supports multithreading
• Supports Hyper-Threading (HT) Technology
• The Front-end-decoder
– Translates IA-32 instruction to a series of micro
operations (uops)
– uops are executed by a dynamically scheduled
speculative pipeline
– Up to 3 simple IA-32 instructions may be decoded
& translated every cycle ; generating up to 6 uops
– If a complex instruction requires more than 3
uops, uop is generated from microcode ROM
• Execution Trace cache
– Holds sequences of uops
– Uops are fetched from the trace cache for execution
by the pipeline
• Avoids the need to redecode IA-32 instructions whenever
trace cache hits
– Exploits temporal sequencing of instruction execution
– On trace cache misses, instructions are
• fetched from L2 cache
• Decoded and trace cache is refilled with uops
– Execution trace cache has its own BTB
• Predicts the outcome of uop branches
– Execution trace cache has High hit rate
• 85% for SPEC CPUINT
• Hence fetch and decode unti is rarely used
The pipeline:
– The Out-of-order speculative pipeline uses
register renaming (not reorder buffer)
– Upto 3 uops per clock can be renamed and
dispatched to functional unit queues
– the functional unit queue has 4 dispatch ports-
one each for
• load,
• store,
• basic ALU operations,
• FP and integer operations
– Upto 3 uops can be committed per clock
• Has Deeper pipeline – (20 pipeline stages) to achieve high
clock rate (3.2 GHz in 2002)
– A simple instruction takes 31 clock cycles to go from fetch to retire
– A two level cache is used to reduce the miss
penalty
– A front end BTB is used when the execution trace
cache misses
– Uses a two level predictor with both global and local branch
histories
Intel NetBurst Microarchitecture
New Intel Microarchitecture
Instruction Fetch
and PreDecode
up to
Rename/Alloc
10.4 Gb/s
FSB
Retirement Unit
4
(ReOrder Buffer)
Schedulers
ALU ALU ALU
Branch FAdd FMul
MMX/SSE MMX/SSE MMX/SSE Load Store
FPmove FPmove FPmove
Also:
unlimited number of instructions issued/clock
cycle; perfect caches;
1 cycle latency for all instructions (FP *,/);
Upper Limit to ILP: Ideal Machine
160 150.1
140
FP: 75 - 150
118.7
120 Integer: 18 - 60
Instruction Issues per cycle
100
75.2
80
IPC
62.6
54.8
60
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Programs
How is this data generated?
Window impact
• Window – set of instructions that is examined for
simultaneous execution
• No. of comparisons required every clock cycle=
completion rate * window size * no. of operands
• (ex: 6*200*2=2400)
52
50 47
FP: 8 - 45 45
40
35
34
Instruction issues per cycle
30
22 22
IPC
20
15 15 Integer: 6 - 12 14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
Program
IPC = 64
50 48
46 46 45 45
45
41
40
35
Integer: 6 - 12
30 29
IPC
20 19
16
15
13 14
12
10
10 9
6 7 6 6 7
6
4
2 2 2
0
gcc espresso li fpppp doducd tomcatv
Program
49
50
45
44
40
35
Integer: 5 - 15
IPC
29 28
30
20
20
16
15 15 15
13 12 12 12
11 10 10 11 11
10
9
10 7
5 6 5 5 5 5
4 5 4 5
4
0
gcc espresso li fpppp doducd tomcatv
Program
30
25
20 Integer: 4 - 9 16 16
IPC
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
Program