Hwang Sol
Hwang Sol
Hwang Sol
TUGAS – 5
CHAPTER 6
PIPELINING AND SUPERSCALAR TECHNIQUES
ARWIN
NIM. 232 06 008
Problem 6.1 – Consider the execution of a program of 15,000 instructions by a linear pipeline
processor with a clock rate of 25 Mhz. Assume that the instruction pipeline has five stages and that
one instruction is issued per clock cycle. The penalties due to branch instructions and out-of-
sequence executions are ignored.
a. Calculate the speedup factor using this pipeline to execute the program as compared with
the use of an equivalent nonpipelined processor with an equal amout of flow-through delay.
b. What are the efficiency and throughput of this pipelined processor ?
Answer :
T1 nkτ nf Sk
Sk = = Hk = Ek =
Tk kτ + ( n − 1)τ k + ( n − 1) k
=
nk
=
(15, 000 )( 25 ) =
4, 999
k + ( n − 1) 5 + (15, 000 − 1)
5
= 0, 999
=
(15, 000 )( 5 ) =
375, 000
5 + (15, 000 − 1) 15, 004
75, 000 = 24, 99 MIPS
=
15, 004
= 4, 999
Arwin – 23206008@2006
2
Problem 6.2 – Study the DEC Alpha architecture in Example 6.13, find more information in the
DEC Alpha handbook, and then answer the following questions with reasoning :
Answer :
) m =2.
) H 7( int ) = 300 MIPS and H 10( flop ) = 150 Mflops .
It also features :
a. From the superscalar and superpipeline perspectives, we must compare it with a scalar
processor which has a clock rate 25 MHz . So that, we will have a superpipelining degree of
machine is :
Arwin – 23206008@2006
3
N −m
T ( m, n) = k +
mn
10, 000 − 2
T ( 2, 6 ) = 7 +
( 2 )( 6 )
9, 998
=7+ = 7 + 833, 2 for EBOX
12
= 840, 2
≈ 840 cycles
10, 000 − 2
T ( 2, 6 ) = 10 +
( 2 )( 6 )
9, 998
= 10 + = 10 + 833, 2 for FBOX
12
= 843, 2
≈ 843 cycles
ο Speedup, S ( 2, 6 )
mn ( k + N − 1)
S ( m, n) =
mnk + N − 1
( 2 )( 6 )( 7 + 10, 000 − 1)
S ( 2, 6 ) =
( 2 )( 6 )( 7 ) + 10, 000 − 1
=
(12 )(10, 006 ) for EBOX
84 + 9, 999
120, 072
=
10, 083
= 11.91
≈ 12 times
Arwin – 23206008@2006
4
ο The DEC-21064-A was also equipped with 34-bit address bus that could cover
234 machine addressing. A shared-memory multiprocessor system will share their
memory locations to other processors. This feature enables processors to exchange data
via a large shared-memory.
Arwin – 23206008@2006
5
Problem 6.4 – Find the optimal number of pipeline stages k0 given in Eq. 6.7 using the
Answer :
f
PCR = where f is clock rate, c is the cost of all logic stages and h represents the
c + kh
cost of each latch on a k − stages pipeline. The optimal pipeline stages – in term of PCR – can
be formulated :
f
PCR = ↔ PCR ( c + kh ) = f
c + kh
cPCR + khPCR = f
khPCR = f − cPCR
f − cPCR
k= → k0
hPCR
Arwin – 23206008@2006
6
Problem 6.5 – Prove the lower bound and upper bound on the minimal average latency (MAL) as
specified in page 277.
Answer :
Proof :
Take the case on the function X’s reservation table on page 271 as followed :
1 2 3 4 5 6 7 8
S1 X X X
S2 X X
S3 X X X
From the table we see that S1 has 3 checkmarks, S2 has 2 checkmarks and S3 has 3 checkmarks
which means that the minimum MAL for function X is 3. The forbidden latencies are 2, 4, 5
and 7 while the permissible latencies are 1, 3 and 6. We, then, can obtain the collision vector
C X = 1011010 . Let’s shift this vector bit-by-bit to the right to find the state transition
diagram.
Arwin – 23206008@2006
7
a. The simple cycles are (3), (6), (1,8) and (3,8), such that the greedy cycles are (3) and
(1,8). The lowest greedy cycle is the MAL and in this case is (3), so the MAL is 3……………
(1st statement proved).
b. The collision vector C X = 1011010 has 4 1-bits and the maximum MAL is the number
of this 1-bit plus 1 or equal to 5. The largest average latency of any greedy cycle on function
X’s state diagram is 3.5 which is taken from greedy cycle (1,8) ……………….. (2nd statement
proved).
Arwin – 23206008@2006
8
Problem 6.6 – Consider the following reservation table for a four-stage pipeline with a clock cycle
τ = 20 ns.
1 2 3 4 5 6
S1 X X
S2 X X
S3 X
S4 X X
a. What are the forbidden latency and the intial collision vector ?
b. Draw the state transition diagram for the scheduling the pipeline.
c. Determine the MAL associated with the shortest greedy cycle.
d. Determine the pipeline throughput corresponding to the MAL and given τ .
e. Determine the lower bound on the MAL for this pipeline.
Answer :
a. The forbidden latencies are 1, 2, and 5 (S1 : 5; S2 : 2; S3 : 0; S4: 1) and the collision
vector is, C x = C 5C 4C 3C 2C1 = 10011 . The permissible latencies are 3 and 4.
3rd shift
shifted bit 00010
Cx 10011
new value 10011
4th shift
shifted bit 00001
Cx 10011
new value 10011
Arwin – 23206008@2006
9
1
1) The MAL is = 0.33
3
1
2) The τ is = 0.5 (at clock cycle 4 there are two task initiated)
2
e. The MAL’s lower bound of this function according to [Shar72] is 2 (the maximum
number of checkmarks in any row). The optimal latency have not been found – needs a
modification on the reservation table.
Arwin – 23206008@2006
10
Problem 6.7 – You are allowed to insert one noncompute delay stage into the pipeline in Problem 6.6
to make latency of 1 permissible in the shortest greedy cycle. The purpose is to yield a new
reservation table leading to an optimal latency equal to the lower bound.
a. Show the modified reservation table with five rows and seven columns.
b. Draw the new state transition diagram for obtaining the optimal cycle.
c. List all the simple cycles and greedy cycles from the state diagram.
d. Prove that the new MAL equals to the lower bound.
e. What is the optimal throughput of this pipeline ? Indicate the percentage of throughput
improvement compared with that obtained in part (d) of Problem 6.6.
Answer :
1 2 3 4 5 6 7
S1 X X
S2 X X
S3 X
S4 X X1
D D1
The forbidden latencies are 2 and 6, then the collision vector is C x = 100010 . The
5th shift
shifted bit 000001
Cx 100010
new value 100011
Arwin – 23206008@2006
11
1-shift from 3rd shift 1-shift from 4th shift 1-shift from 5th shift
shifted bit 010011 shifted bit 010001 shifted bit 010001
Cx 100010 Cx 100010 Cx 100010
new value 110011 new value 110011 new value 110011
3-shift from 1st shift 4-shift from 1st shift 5-shift from 1st shift
shifted bit 000110 shifted bit 000011 shifted bit 000001
Cx 100010 Cx 100010 Cx 100010
new value 100110 new value 100011 new value 100011
3-shift from 3rd shift 4-shift from 3rd shift 5-shift from 3rd shift
shifted bit 000100 shifted bit 000010 shifted bit 000001
Cx 100010 Cx 100010 Cx 100010
new value 100110 new value 100010 new value 100011
3-shift from 4th shift 4-shift from 4th shift 5-shift from 4th shift
shifted bit 000100 shifted bit 000010 shifted bit 000001
Cx 100010 Cx 100010 Cx 100010
new value 100110 new value 100010 new value 100011
3-shift from 5th shift 4-shift from 5th shift 5-shift from 5th shift
shifted bit 000100 shifted bit 000010 shifted bit 000001
Cx 100010 Cx 100010 Cx 100010
new value 100110 new value 100010 new value 100011
c. The simple cycles are (3), (4), (5), (1,3), (1,4), (1,7), (3,4), (3,5), (3,7), (4,5), (5,7),
(1,3,4), (1,3,7), (1,4,4), (1,4,7), (3,5,4), (3,5,7), and (5,1,7). The greedy cycles are (3) and
(1,3).
d. The greedy cycle (1,3) has the lowest average latency which is equal to 2. This greedy
cycle leads to the MAL of this pipeline machine. It can be seen on the reservation table that
this MAL is equal to the maximum number of checkmarks in any row in the reservation
table …. (proved).
Arwin – 23206008@2006
12
1 1
e. The throughput is = = 0.5 and it has improvement percentage of
MAL 2
0.5
= 1, 52 x 100% = 152% .
0.33
Arwin – 23206008@2006
13
Problem 6.8 – Consider an adder pipeline with four stages which consists of input lines X and Y
and output line Z. The pipeline has a register R as its output where the temporary result can be
stored and fec back to S1 at a later point in time. The inputs X and Y are multiplexed with the
outputs R and Z.
a. Assume the elements of the vector A are fed into the pipeline through input X, one
element per cycle. What is the minimum number of clock cycles required to compute the sum
adder with a flow through delay of 4τ . Find the actual speedup S4 ( 64 ) and efficiency
infinity.
d. Find N 1 , the minimun vector length required to achieve half of the maximum
2
speedup.
Answer :
a. The minimum number of clock cycle can be obtained by writing its assembly code,
creating its reservation table and trying to imagine its instruction scheduling. The process in
obtaining the outputs Z and R from input line X or Y via the same manner, such that the codes
are not much different except there is a line code to execute store command when using input
line X. The code is as follow :
Arwin – 23206008@2006
14
1 2 3 4
S1 X
S2 X
S3 X
S4 X
So that, the minimum clock cycles required to complete one process is 7 clock cycles.
Tk = ⎡⎣ k + ( n − 1) ⎤⎦ τ , so that
T4 = ⎡⎣ 4 + ( 64 − 1) ⎤⎦ τ
= 67τ
Arwin – 23206008@2006
15
T1 256τ
Sk = ↔ S4 = = 3.82
Tk 67τ
Sk 3.82
Ek = ↔ E4 = = 0.96 or 96%
k 4
S∞ =
( ∞ )( 4 ) = 4∞
4 + ( ∞ − 1) ∞ (this is an ideal condition)
=4
S∞
E∞ = = 1 (this is an ideal condition)
4
Nk
S4 = ↔ kS4 + NS4 − S4 = Nk
k + ( N − 1)
S 4 ( k − 1) = N ( k − S4 )
S4 ( k − 1)
N=
k − S4
2 ( 4 − 1) 6
N= = =3
4−2 2
Arwin – 23206008@2006
16
1 2 3 4
S1 X X
S2 X
S3 X
Answer :
a. The forbidden latency is 3. From the table we can obtain the collision vector,
C x = C 3C 2C1 = 100 . The permissible latencies are 1 and 2.
4th shift 1st shift from 110 2nd shift from 110
shifted bit 000 shifted bit 011 shifted bit 001
Cx 100 Cx 100 Cx 100
new value 100 new value 111 new value 101
Arwin – 23206008@2006
17
c. The simple cycles are (2), (4), (1,4), (2,4), and (1,1,4). The greedy cycles are (2) and
(1,4)
n
Hk =
⎡⎣ k + ( n − 1) ⎤⎦ τ
2
H3 =
⎡⎣ 3 + ( 2 − 1) ⎤⎦ 20.10−9
1
=
40.10−9
= 25 MIPS
Arwin – 23206008@2006
18
Problem 6.10 – Consider the five-staged pipelined processor specified by the following reservation
table:
1 2 3 4 5 6
S1 X X
S2 X X
S3 X
S4 X
S5 X X
Answer :
Arwin – 23206008@2006
19
1-shift from 2nd shift 2-shift from 2nd shift 2-shift from 1st shift
shifted bit 01111 shifted bit 00111 shifted bit 00111
Cx 11100 Cx 11100 Cx 11100
new value 11111 new value 11111 new value 11111
f. The greedy cycle is also the constant cycle which is equal to the MAL.
1 1
g. The maximum throughput is = = 0.5 or only 50% .
MAL 2
h. The minimum constant cycle is 2, so that the he maximum throughput does not change,
only 50%
Arwin – 23206008@2006
20
Problem 6.11 – The following assembly code is to be executed in a three-stage pipelined processor
with hazard detection and resolution in each stage. The stage are instruction fetch, operand fetch
(one or more as required), and execution (including write-back operation). Explain all possible
hazards in the execution of the code.
I1 : Inc R0 / R0 ← (R0)+ 1/
I2 : Mul ACC, R0 / ACC ← (ACC) x (R0)/
I3 : Store R1, ACC / R1 ← (ACC)/
I4 : Add ACC, R0 / ACC ← (ACC) + (R0)/
I5 : Store M, ACC / M ← (ACC)/
Answer :
Analyzing ……………….
a. RAW hazards are I1- I2, I1 - I4, I2 - I3, I2 - I4, I2 - I5, and
I4 - I5.
Arwin – 23206008@2006
21
Problem 6.13 – Consider the following pipelined processor with four stages. This pipeline has a total
evaluation time of six clock cycles. All successor stages must be used after each clock cycle.
a. Specify the reservation table for this pipeline with six columns and four fows.
b. List the set of forbidden latencies between initiations.
c. Draw the state diagram which shows all possible latency cycles.
d. List all greedy cycles from the state diagram.
e. What is the value of the minimal average latency (MAL) ?
f. What is the maximal throughput of this pipeline ?
Answer :
a. Reservation table
1 2 3 4 5 6
S1 X
S2 X X
S3 X X
S4 X X
b. The forbidden latencies are 2 and 4 with a collision vector, C X = 01010 . The
Arwin – 23206008@2006
22
1-shift from 3rd shift 1-shift from 5th shift 3-shift from 3rd shift
shifted bit 00101 shifted bit 00101 shifted bit 00001
Cx 01010 Cx 01010 Cx 01010
new value 01111 new value 01111 new value 01011
5-shift from 3rd shift 3-shift from 5th shift 5-shift from 5th shift
shifted bit 00000 shifted bit 00001 shifted bit 00000
Cx 01010 Cx 01010 Cx 01010
new value 01010 new value 01011 new value 01010
d. The greedy cycles are (3), (5), (6), (1,3), (1,5), (1,6), (3,5), (3,6), (1,3,5) and (1,3,6).
e. The value of MAL is 2 according to the greedy cycle (1,3) obtained from the state
diagram.
1
f. The maximum throughput is the inverse of MAL or = 0.5 or 50% .
2
Arwin – 23206008@2006
23
Problem 6.14 – Three functional pipelines f1 , f 2 , and f 3 are characterized by the following
reservation tables. Using these three pipelines, a composite pipeline network is formed below :
Each task going through this composite pipeline uses the pipeline in the following order: f1 first, f 2
and f 3 next, f1 again, and then the output is obtained. The dual multiplexer selects a pair of inputs,
(A,B) or (X,Y), and feeds them into the input of f1 . The use of composite pipeline is described by
1 2 3 4 1 2 3 4 1 2 3 4
S1 X T1 X X U1 X X
S2 X T2 X U2 X
S3 X X T3 X U3 X
Arwin – 23206008@2006
24
Answer :
1 2 3 4 5 6 7 8 9 10 11 12
S1 X X
S2 X X
S3 X X X X
T1 X X
T2 X
T3 X
U1 X X
U2 X
U3 X
Arwin – 23206008@2006
25
4-shift from 4th shift 5-shift from 4th shift 6-shift 4th shift
shifted bit 00000011101 shifted bit 00000001111 shifted bit 00000000111
Cx 00111000111 Cx 00111000111 Cx 00111000111
new value 00111011111 new value 00111001111 new value 00111000111
4-shift from 5th shift 5-shift from 5th shift 6-shift 5th shift
shifted bit 00000011100 shifted bit 00000001110 shifted bit 00000000111
Cx 00111000111 Cx 00111000111 Cx 00111000111
new value 00111011111 new value 00111001111 new value 00111000111
4-shift from 6th shift 5-shift from 6th shift 6-shift 6th shift
shifted bit 00000011100 shifted bit 00000001110 shifted bit 00000000111
Cx 00111000111 Cx 00111000111 Cx 00111000111
new value 00111011111 new value 00111001111 new value 00111000111
Arwin – 23206008@2006
26
1-shift for 10th and 11th shifts will lead to 4th state and the rest will follow as the 1st round shifts.
d. The simple cycles are (4), (5), (6), (10), (11), (12), (4,5), (4,6), (4,10), (4,11), (4,12),
(5,6), (5,10), (5,11), (5,12), (4, 5, 6), (4, 5, 10), (4, 5, 11) and (4, 5, 12). The greedy cycles are
(4) and (4,5).
Arwin – 23206008@2006
27
Problem 6.15 – A nonpipelined processor X has a clock rate of 25 MHz and an average CPI of 4.
Processor Y, an improved successor of X, is designed with a five-stage linear instruction pipeline.
However, due to the latch delay and clock skew effects, the clock rate of Y is only 20 MHz.
Answer :
TX = CPI X . I c .τ x TY = ⎡⎣ k + ( n − 1) ⎤⎦ τ Y
=
CPI X . I c k + ( n − 1)
=
fx fY
and
=
( 4 )(100 ) 5 + (100 − 1)
6 =
25.10 20.106
= 16 μ s = 5.2 μ s
TX
Sk =
TY
16
=
5.2
= 3.078 ≈ 3.08
Arwin – 23206008@2006
28
f
MIPS X =
CPI .106
25.106
=
4.106
= 6.25 MIPS
TY
TY = CPIY . I c .τ Y ↔ CPIY =
I c .τ Y
TY . fY
CPIY =
Ic
=
( 5.2.10 )( 20.10 )
−6 6
100
= 1.04
fY
MIPSY =
CPI .106
20.106
=
(1.04 ) .106
= 19.23 MIPS
Arwin – 23206008@2006