Assignment Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

CS605 - High Performance Computer Architecture

Assignment

Dead line for submission: 31.12.2021

1. Consider the sequence of machine instructions given below:


MUL R5, R0, R1
DIV R6, R2, R3
ADD R7, R5, R6
SUB R8, R7, R4
In the above sequence, R0 to R8 are general purpose registers. In the instructions
shown, the first register stores the result of the operation performed on the second and
the third registers. This sequence of instructions is to be executed in a pipelined
instruction processor with the following 4 stages: (1) Instruction Fetch and Decode (IF),
(2) Operand Fetch (OF), (3) Perform Operation (PO) and (4) Write back the Result (WB).
The IF, OF and WB stages take 1 clock cycle each for any instruction. The PO stage takes
1 clock cycle for ADD or SUB instruction, 3 clock cycles for MUL instruction and 5
clock cycles for DIV instruction. The pipelined processor uses operand forwarding from
the PO stage to the OF stage. How many clock cycles will the above sequence of
instructions take?

2. Consider the execution of the following code segment


DIV.D F4, F10, F9
MUL.D F5, F1, F4
ADD.D F1, F2, F3
MUL.D F7, F8, F9
on a processor which uses Tomasulo’s algorithm to dynamically schedule instructions
(single issue per cycle - no speculation) with the following non-pipelined execution units:
A 2-cycle, FP add unit,
A 3-cycle, FP multiply unit
A 6-cycle, FP divide unit
Trace the execution by showing the instruction status, the reservation station status and
register results status at the end of cycles 3, 6, 9, 12 and 15. Assume that two reservation
stations are used for each functional unit (add1, add2, mult1, mult2, div1, div2).

3. 30% of a benchmark program’s execution time is from multiply operations. Uber cool
hardware speeds up these operations 12 times! Suppose the program took 20 seconds to
execute without the enhanced hardware, what is the overall speedup achieved? During its
enhanced operation, what is the new execution time, and what is the percentage of time
multiply operations take?

4. Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory,


write-back) and the code is given below. All ops are one cycle except LW and SW, which
are 1 + 2 cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Show the
phases of each instruction per clock cycle for one iteration of the loop.
a) How many clock cycles per loop iteration are lost to branch overhead?
b) Assume a static branch predictor, capable of recognizing a backwards branch in the
Decode stage. Now how many clock cycles are wasted on branch overhead?
c) Assume a dynamic branch predictor. How many cycles are lost on a correct
prediction?
Loop: LW R3,0(R0)
LW R1,0(R3)
ADDI R1,R1,#1
SUB R4,R3,R2
SW R1,0(R3)
BNZ R4, Loop

5. The largest configuration of a Cray T90 (Cray T932) has 32 processors, each
capable of generating 4 loads and 2 stores per clock cycle. The processor clock
cycle is 2.167 ns, while the cycle time of the SRAMs used in the memory system is 15 ns.
Calculate the minimum number of memory banks required to allow all processors to run at
full memory bandwidth.

6. Consider a loop like this one:


for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
Assume that A, B, and C are distinct, non-overlapping arrays. (In practice, the
arrays may sometimes be the same or may overlap. Because the arrays may be
passed as parameters to a procedure that includes this loop, determining whether
arrays overlap or are identical often requires sophisticated, inter procedural analysis of the
program.) What are the data dependences among the statements S1 and S2 in the loop?

7. Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction
has a width of 32 and each SIMD processor contains 8 lanes for single-precision
arithmetic and load/store instructions, meaning that each non-diverged SIMD instruction
can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that
causes on average 80% of threads to be active. Assume that 70% of all SIMD instructions
executed are single- precision arithmetic and 20% are load/store. Since not all memory
latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume
that the GPU has a clock speed of 1.5 GHz. Compute the throughput, in GFLOP/sec, for
this kernel on this GPU.

8. You have been asked to investigate the relative performance of a banked versus pipelined
L1 data cache for a new microprocessor. Assume a 64 KB two-way set associative cache
with 64 byte blocks. The pipelined cache would consist of three pipe stages, similar in
capacity to the Alpha 21264 data cache. A banked implementation would consist of two
32 KB two-way set associative banks. Use CACTI and assume a 65 nm (0.065 μm)
technology to answer the following questions. The cycle time output in the Web version
shows at what frequency a cache can operate without any bubbles in the pipeline. What
is the cycle time of the cache in comparison to its access time, and how many pipestages
will the cache take up (to two decimal places)?

9. Show how the following code sequence lays out in convoys, assuming a single
copy of each vector functional unit:
How many chimes will this vector sequence take? How many cycles per FLOP
(floating-point operation) are needed, ignoring vector instruction issue overhead?

******

You might also like