Assignment Nov 19

ASSIGNMENT QUESTIONS
1. Assume that we make an enhancement to a computer that improves some mode of

execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a
percentage of the execution time when the enhanced mode is in use. Recall that
Amdahl’s law depends on the fraction of the original, unenhanced execution time that
could make use of enhanced mode. Thus, we cannot directly use this 50%
measurement to compute speedup with Amdahl’s law.
a. What is the speedup we have obtained from fast mode?
b. What percentage of the original execution time has been converted to fast mode?
2. Some microprocessors today are designed to have adjustable voltage, so a 15%

reduction in voltage may result in a 15% reduction in frequency. What would be
the impact on dynamic energy and on dynamic power?
3. Find the number of dies per 300 mm (30 cm) wafer for a die that is 1.5 cm on a
side and for a die that is 1.0 cm on a side.
4. Assume that there are half as many D-cache accesses as I-cache accesses, and
that the I-cache and D-cache are responsible for 25% and 15% of the processor’s
power consumption in a normal four-way set associative implementation. Determine
if way selection improves performance per watt based on the estimates that average
cache power consumption relative to a normal four-way set associative cache that is
0.28 for the I-cache and 0.35 for the D-cache.
5. Which is more important for floating-point programs: two-way set associativity or

hit under one miss for the primary data caches? What about integer programs?
Assume the following average miss rates for 32 KB data caches: 5.2% for floating
point programs with a direct-mapped cache, 4.9% for these programs with a two way
set associative cache, 3.5% for integer programs with a direct-mapped cache, and
3.2% for integer programs with a two-way set associative cache. Assume the miss
penalty to L2 is 10 cycles, and the L2 misses and penalties are the same.
6. Your company has just bought a new Intel Core i5 dual core processor, and you
have been tasked with optimizing your software for this processor. You will run two
applications on this dual core, but the resource requirements are not equal. The first
application requires 80% of the resources, and the other only 20% of the resources.
Assume that when you parallelize a portion of the program, the speedup for that
portion is 2.
a. Given that 40% of the first application is parallelizable, how much speedup would
you achieve with that application if run in isolation?
b. Given that 99% of the second application is parallelizable, how much speedup
would this application observe if run in isolation?
c. Given that 40% of the first application is parallelizable, how much overall system
speedup would you observe if you parallelized it?
d. Given that 99% of the second application is parallelizable, how much overall
system speedup would you observe if you parallelized it?
7. You are designing a system for a real-time application in which specific deadlines
must be met. Finishing the computation faster gains nothing. You find that your
system can execute the necessary code, in the worst case, twice as fast as necessary.
a. How much energy do you save if you execute at the current speed and turn off the
system when the computation is complete?
b. How much energy do you save if you set the voltage and frequency to be half as
much?
8. Your company is trying to choose between purchasing the Opteron or Itanium 2.

You have analyzed your company’s applications, and 60% of the time it will be
running applications similar to wupwise, 20% of the time applications similar to
ammp, and 20% of the time applications similar to apsi.
a. If you were choosing just based on overall SPEC performance, which would you
choose and why?
b. What is the weighted average of execution time ratios for this mix of applications
for the Opteron and Itanium 2?
9. Consider the usage of critical word first and early restart on L2 cache misses.
Assume a 1 MB L2 cache with 64 byte blocks and a refill path that is 16 bytes wide.
Assume that the L2 can be written with 16 bytes every 4 processor cycles, the time to
receive the first 16 byte block from the memory controller is 120 cycles, each
additional 16 byte block from main memory requires 16 cycles, and data can be
bypassed directly into the read port of the L2 cache. Ignore any cycles to transfer the
miss request to the L2 cache and the requested data to the L1 cache.
a. How many cycles would it take to service an L2 cache miss with and without
critical word first and early restart?
b. Do you think critical word first and early restart would be more important for L1
caches or L2 caches, and what factors would contribute to their relative importance?
10. You are designing a write buffer between a write-through L1 cache and a write-
back L2 cache. The L2 cache write data bus is 16 B wide and can perform a write to
an independent cache address every 4 processor cycles.
a. How many bytes wide should each write buffer entry be?
b. What speedup could be expected in the steady state by using a merging write
buffer instead of a nonmerging buffer when zeroing memory by the execution of 64-
bit stores if all other instructions could be issued in parallel with the stores and the
blocks are present in the L2 cache?
c. What would the effect of possible L1 misses be on the number of required write
buffer entries for systems with blocking and nonblocking caches?
11. In this exercise, we examine how pipelining aff ects the clock cycle time of the
processor. Problems in this exercise assume that individual stages of the datapath
have the following latencies:
Also, assume that instructions executed by the processor are broken down as follows:
What is the clock cycle time in a pipelined and non-pipelined processor?
12. Let us consider a computer executing the following mix of instructions:
How much is the CPI average assuming a clock frequency of 500 MHz? How much is the
Throughput expressed in MIPS.
13. Consider the following program code. What are the contents of ROB at the time
when we have issued all the instructions in the loop twice? Let’s also assume that the
fload and fmul.d from the first iteration have committed and all other instructions
have completed execution.
Loop: fload R1, 0(R2)

fmul.d R3, R1, R4
fsd R3, 0(R2)
addi R2, R2, -8
bne R2, R5, Loop
14. How many bits are in the (0,4) branch predictor with 8K entries. Calculate the number of
entries in a (2,2) predictor with the same number?
15. Using the code segment given below, show what the status tables look like when
the first three instructions have completed write result stage and all other instructions
have completed execution.
16. Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute,

memory, write-back) and the code is given below. All ops are one cycle except LW
and SW, which are 1 + 2 cycles, and branches, which are 1 + 1 cycles. There is no
forwarding. Show the phases of each instruction per clock cycle for one iteration of
the loop.
a. How many clock cycles per loop iteration are lost to branch overhead?
b. Assume a static branch predictor, capable of recognizing a backwards
branch in the Decode stage. Now how many clock cycles are wasted on branch
overhead?
c. Assume a dynamic branch predictor. How many cycles are lost on a correct
prediction?
Loop: LW R3,0(R0)
LW R1,0(R3)
ADDI R1,R1,#1
SUB R4,R3,R2
SW R1,0(R3)
BNZ R4, Loop
17. If you ever get confused about what a register renamer has to do, go
back to the assembly code you’re executing, and ask yourself what has to happen for
the right result to be obtained. For example, consider a three-way superscalar
machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has
executed?
18. The largest configuration of a Cray T90 (Cray T932) has 32 processors, each
capable of generating 4 loads and 2 stores per clock cycle. The processor clock
cycle is 2.167 ns, while the cycle time of the SRAMs used in the memory system
is 15 ns. Calculate the minimum number of memory banks required to allow all
processors to run at full memory bandwidth.
19. Suppose we have 8 memory banks with a bank busy time of 6 clocks and a total
memory latency of 12 cycles. How long will it take to complete a 64-element
vector load with a stride of 1? With a stride of 32?
20. Consider a loop like this one:

for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
Assume that A, B, and C are distinct, nonoverlapping arrays. (In practice, the
arrays may sometimes be the same or may overlap. Because the arrays may be
passed as parameters to a procedure that includes this loop, determining whether
arrays overlap or are identical often requires sophisticated, interprocedural analysis
of the program.) What are the data dependences among the statements S1 and
S2 in the loop?
21. Assume a GPU architecture that contains 10 SIMD processors. Each SIMD
instruction has a width of 32 and each SIMD processor contains 8 lanes for single-
precision arithmetic and load/store instructions, meaning that each non-diverged
SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has
divergent branches that causes on average 80% of threads to be active. Assume that
70% of all SIMD instructions executed are single- precision arithmetic and 20% are
load/store. Since not all memory latencies are covered, assume an average SIMD
instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.
Compute the throughput, in GFLOP/sec, for this kernel on this GPU.
22. Show the code for MIPS and VMIPS for the following DAXPY loop. Assume
that the starting addresses of X and Y are in Rx and Ry, respectively
23. You have been asked to investigate the relative performance of a banked versus
pipelined L1 data cache for a new microprocessor. Assume a 64 KB two-way set
associative cache with 64 byte blocks. The pipelined cache would consist of three
pipe stages, similar in capacity to the Alpha 21264 data cache. A banked
implementation would consist of two 32 KB two-way set associative banks. Use
CACTI and assume a 65 nm (0.065 μm) technology to answer the following
questions. The cycle time output in the Web version shows at what frequency a cache
can operate without any bubbles in the pipeline. What is the cycle time of the cache
in comparison to its access time, and how many pipestages will the cache take up (to
two decimal places)?
24. Assume a hypothetical GPU with the following characteristics:

■ Clock rate 1.5 GHz
■ Contains 16 SIMD processors, each containing 16 single-precision floating point
units
■ Has 100 GB/sec off-chip memory bandwidth Without considering memory
bandwidth, what is the peak single-precision floating-point throughput for this GPU
in GLFOP/sec, assuming that all memory latencies can be hidden? Is this throughput
sustainable given the memory bandwidth limitation?
25. Show how the following code sequence lays out in convoys, assuming a single
copy of each vector functional unit:
How many chimes will this vector sequence take? How many cycles per FLOP
(floating-point operation) are needed, ignoring vector instruction issue overhead?
******

Assignment Nov 19

Uploaded by

Copyright:

Available Formats

Assignment Nov 19

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment Nov 19

Uploaded by

Copyright:

Available Formats

ASSIGNMENT QUESTIONS

1. Assume that we make an enhancement to a computer that improves some mode of

2. Some microprocessors today are designed to have adjustable voltage, so a 15%

5. Which is more important for floating-point programs: two-way set associativity or

8. Your company is trying to choose between purchasing the Opteron or Itanium 2.

What is the clock cycle time in a pipelined and non-pipelined processor?

12. Let us consider a computer executing the following mix of instructions:

Loop: fload R1, 0(R2)

16. Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute,

20. Consider a loop like this one:

24. Assume a hypothetical GPU with the following characteristics:

You might also like