ACA Notes
ACA Notes
ACA Notes
Microprocessor consists of an ALU, register array, and a control unit. ALU performs
arithmetical and logical operations on the data received from the memory or an input
device.
Instruction set architecture is a part of processor architecture, which is necessary for creating
machine level programs to perform any mathematical or logical operations. Instruction set
architecture acts as an interface between hardware and software. It prepares the processor to
respond to the commands like execution, deleting etc given by the user.
The performance of the processor is defined by the instruction set architecture designed in it.
The Instruction Set Architecture (ISA) is the part of the processor that is visible to the programmer
or compiler writer. The ISA serves as the boundary between software and hardware. We will
briefly describe the instruction sets found in many of the microprocessors used today.
Stack
Accumulator
GPR
Advantages: Makes code generation easy. Data can be stored for long periods in registers.
Disadvantages: All operands must be named leading to longer instructions.
Earlier CPUs were of the first 2 types but in the last 15 years all CPUs made are GPR processors.
The 2 major reasons are that registers are faster than memory, the more data that can be kept
internaly in the CPU the faster the program wil run. The other reason is that registers are easier
for a compiler to use.
Programmability
– easy to express programs backward/forward/upward compatibility
– implementability & programmability across generations
2.3 RISC vs CISC
In the early days machines were programmed in assembly language and the memory access is
also slow. To calculate complex arithmetic operations, compilers have to create long sequence of
machine code.
This made the designers to build an architecture , which access memory less frequently and
reduce burden to compiler. Thus this lead to very power full but complex instruction set.
Microprogramming is easy to implement and much less expensive than hard wiring a control
unit.
It is easy to add new commands into the chip without changing the structure of the instruction
set as the architecture uses general-purpose hardware to carry out commands.
This architecture makes the efficient use of main memory since the complexity (or more
capability) of instruction allows to use less number of instructions to achieve a given task.
The compiler need not be very complicated, as the micro program instruction sets can be
written to match the constructs of high level languages.
A new or succeeding versions of CISC processors consists early generation processors in their
subsets (succeeding version). Therefore, chip hardware and instruction set became complex
with each generation of the processor.
The overall performance of the machine is reduced because of slower clock speed.
The complexity of hardware and on-chip software included in CISC design to perform many
functions.
In RISC architecture, the instruction set of processor is simplified to reduce the execution time. It
uses small and highly optimized set of instructions which are generally register to register
operations.
The speed of the execution is increased by using smaller number of instructions .This uses
pipeline technique for execution of any instruction.
The performance of RISC processors is often two to four times than that of CISC processors
because of simplified instruction set.
This architecture uses less chip space due to reduced instruction set. This makes to place extra
functions like floating point arithmetic units or memory management units on the same chip.
The per-chip cost is reduced by this architecture that uses smaller chips consisting of more
components on a single silicon wafer.
RISC processors can be designed more quickly than CISC processors due to its simple
architecture.
The execution of instructions in RISC processors is high due to the use of many registers for
holding and passing the instructions as compared to CISC processors.
The performance of a RISC processor depends on the code that is being executed. The processor
spends much time waiting for first instruction result before it proceeds with next subsequent
instruction, when a compiler makes a poor job of scheduling instruction execution.
RISC processors require very fast memory systems to feed various instructions. Typically, a large
memory cache is provided on the chip in most RISC based systems.
RISC CISC
It stands for ‘Reduced It stands for ‘Complex
Acronym
Instruction Set Computer’. Instruction Set Computer’.
The RISC processors have a The CISC processors have a
Definition smaller set of instructions larger set of instructions with
with few addressing nodes. many addressing nodes.
It has no memory unit and It has a memory unit to
Memory unit uses a separate hardware to implement complex
implement instructions. instructions.
It has a hard-wired unit of It has a micro-programming
Program
programming. unit.
It is a complex complier
Design It is an easy complier design.
design.
The calculations are faster The calculations are slow and
Calculations
and precise. precise.
Decoding of instructions is Decoding of instructions is
Decoding
simple. complex.
Time Execution time is very less. Execution time is very high.
It does not require external It requires external memory
External memory
memory for calculations. for calculations.
Pipelining does function Pipelining does not function
Pipelining
correctly. correctly.
Stalling is mostly reduced in
Stalling The processors often stall.
processors.
Code expansion can be a Code expansion is not a
Code expansion
problem. problem.
Disc space The space is saved. The space is wasted.
Used in high end applications
Used in low end applications
such as video processing,
Applications such as security systems,
telecommunications and
home automations, etc.
image processing.
Hardware level works upon dynamic parallelism whereas, the software level works on static parallelism. Dynamic
parallelism means the processor decides at run time which instructions to execute in parallel, whereas static
parallelism means the compiler decides which instructions to execute in parallel.
• Dynamic parallelism means the processor decides at run time which instructions to execute in parallel.
• static parallelism means the compiler decides which instructions to execute in parallel.
If the stages are perfectly balanced , then the time per instruction on the pipelined processor—
assuming ideal conditions—is equal to
S k = Time taken in non pipelined system /Time taken in a k stage pipelined system
= n Tn / (k + n - 1) Tp
As the number of tasks increases, n becomes much larger than k - 1, and k + n - 1 approaches
the value of n, the speedup becomes
S k = Tn/Tp
If we assume that the time it takes to process the task is same in the pipeline and the
nonpipeline circuits, we will have Tn = k Tp. Including this assumption, the speedup becomes:
S k = k Tp/Tp = k
Answer
The average instruction execution time on the unpipelined processor is
Average instruction execution time = Clock cycle x Average CPI
= 1 ns x [(40% + 20%) × 4 + 40% x 5]
= 1 ns x 4.4
= 4.4 ns
In the pipelined implementation, the clock must run at the speed of the slowest stage plus
overhead, which will be 1 + 0.2 or 1.2 ns; this is the average instruction execution time. Thus,
the speedup from pipelining is
The 0.2 ns overhead essentially establishes a limit on the effectiveness of pipelining. If the
overhead is not affected by changes in the clock cycle, Amdahl’s law tells us that the overhead
limits the speedup.
4.Execution (EX)
5.Write-back (WB)
When these operations are performed in the order given, the result is B = 32. But if they
are performed concurrently, the value of A used in computing B would be the original
value, 5, leading to an incorrect result.
I1. R2 <- R1 + R3
I2. R4 <- R2 + R3
I1. R4 <- R1 + R5
I2. R5 <- R1 + R2
I1. R2 <- R4 + R7
I2. R2 <- R1 + R3
3.6.2.1 OPERAND FORWARDING
Operand forwarding (or data forwarding) is an optimization in pipelined CPUs to limit performance
deficits which occur due to pipeline stalls. A data hazard can lead to a pipeline stall when the current
operation has to wait for the results of an earlier operation which has not yet finished.
The data hazard arises because one instruction, instruction I2, is waiting for data to be written in the
register file. However, these data are available at the output of the ALU once the Execute stage
completes step E (Execution) of I1. Hence, the delay can be reduced, or possibly eliminated, if we
arrange for the result of instruction I1 to be forwarded directly for use in step E (Execution) of I2.
Consider the pipelined execution of these instructions:
1 2 3 4 5 6 7 8 9
F1 D1 E1 M1 W1
F2 D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
Data Hazards Requiring Stalls
1 2 3 4 5 6 7 8 9 10 11 12
F1 D1 E1 M1 W1
F2 stall stall stall D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
1 2 3 4 5 6 7 8 9 10 11
F1 D1 E1 M1 W1
F2 stall stall D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline
stalls.
The decision to branch cannot be made until the execution of that instruction has been completed.
Branch instructions represent about 20% of the dynamic instruction count of most programs.
The effectiveness of the delayed branch approach depends on how often it is possible to reorder
instructions.
3.6.3.5 Loop Unrolling
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to
optimize a program's execution speed at the expense of its binary size, which is an approach known as
the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an
optimizing compiler.
Branch folding is a technique where, on the prediction of most branches, the branch instruction is
completely removed from the instruction stream presented to the execution pipeline. Branch folding
can significantly improve the performance of branches, taking the CPI for branches significantly below 1.
Superscalar Architectures
Superscalar is a computer designed to improve the performance of the execution of scalar instructions.
A scalar is a variable that can hold only one atomic value at a time, e.g., an integer or a real.
A scalar architecture processes one data item at a time
Examples of non-scalar variables: Arrays, Matrices, Records
In a superscalar architecture (SSA), several scalar instructions can be initiated simultaneously and executed
independently.
Pipelining allows also several instructions to be executed at the same time, but they have to be in different pipeline
stages at a given moment.
SSA includes all features of pipelining but, in addition, there can be several instructions executing simultaneously in
the same pipeline stage.
In a superscalar processor, multiple instruction pipelines are required. This implies that multiple instructions are
issued per cycle and multiple results are generated per cycle.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent
instructions an be executed in parallel without causing a wait state.
The effective CPI of a superscalar processor should be lower than that of a generic scalar RISC processors.
Example: IBM RS/6000
Vector Pipelines:
• In a scalar processor, each scalar instruction executes only one operation over one data element.
• Each vector instruction executes a string of operations, one for each element in the vector
It is a Load and store architecture. Operations are always done over registers. It uses “register window” concept thus
offering a large number of registers. Uses delay slot to optimize branch instruction. Passes arguments using registers
and the stack.
Modules
ARM architecture
• 32-bit RISC-processor core (32-bit instructions)
• 37 pieces of 32-bit integer registers (16 available)
• Pipelined (ARM7: 3 stages)
• Cached (depending on the implementation)
• Von Neuman-type bus structure (ARM7), Harvard (ARM9)
• 8 / 16 / 32 -bit data types
• 7 modes of operation (usr, fiq, irq, svc, abt, sys, und)
• Simple structure -> reasonably good speed / power consumption ratio
ARM7TDMI
ARM7TDMI is a core processor module embedded in many ARM7 microprocessors, such as ARM720T, ARM710T,
ARM740T, and Samsung’s KS32C50100. It is the most complex processor core module in ARM7 series.
– T: capable of executing Thumb instruction set
– D: Featuring with IEEE Std. 1149.1 JTAG boundary-scan debugging interface.
– M: Featuring with a Multiplier-And-Accumulate (MAC) unit for DSP applications.
– I: Featuring with the support of embedded In-Circuit Emulator.
• Three Pipe Stages: Instruction fetch, decode, and Execution
Features
• A 32-bit RSIC processor core capable of executing 16-bit instructions (Von Neumann Architecture)
- Memory Access
– Data can be
• 8-bit (bytes)
• 16-bit (half words)
• 32-bit (words)
• Memory Interface
– Can interface to SRAM, ROM, DRAM
– Has four basic types of memory cycle
• idle cycle
• nonsequential cycle
• sequential cycle
• coprocessor register cycle
32-bit address bus
• 32-bit data bus
– D[31:0]: Bidirectional data bus
– DIN[31:0]: Unidirectional input bus
– DOUT[31:0]: Unidirectional output bus
• Control signals
– Specify the size of the data to be transferred and the direction
of the transfer
ARM7TDMI Block Diagram
Benefit
The major benefit of superpipelining is the increase in the number of instructions which can be in the
pipeline at one time and hence the level of parallelism.
Drawbacks
The larger number of instructions "in flight" (ie in some part of the pipeline) at any time, increases the
potential for data dependencies to introduce stalls. Simulation studies have suggested that a pipeline
depth of more than 8 stages tends to be counter-productive.
Note that some recent processors, eg the MIPS R10000, can be described as both superscalar -
they have multiple processing units - and superpipelined - there are more than 5 stages in the
pipeline
4 MEMORY
• Cache includes tags to identify which block of main memory is in each cache slot
Direct Mapping
The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique. In
this technique, block j of the main memory maps onto block j modulo 128 of the cache,
Associative Mapping
the most flexible mapping method, in which a main memory block can be placed into any cache block position. In
this case, 12 tag bits are required to identify a memory block when it is resident in the cache. The tag bits of an
address received from the processor are compared to the tag bits of each block of the cache to see if the desired
block is present. This is called the associative-mapping technique.
Set-Associative Mapping
The blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside in any
block of a specific set. Hence, the contention problem of the direct method is eased by having a few choices for
block placement. At the same time, the hardware cost is reduced by decreasing the size of the associative search.
4.3 Cache coherence
In computer architecture, cache coherence is the uniformity of shared resource data that ends up
stored in multiple local caches. When clients in a system maintain caches of a common memory
resource, problems may arise with incoherent data, which is particularly the case with CPUs in a
multiprocessing system.
Given a fixed total size (in bytes) for the cache, is it better to have two caches, one for instructions and
one for data; or is it better to have a single unified cache?
Unified is better because it automatically performs load balancing. If the current program needs
more data references than instruction references, the cache will accommodate. Similarly if more
instruction references are needed.
Split is better because it can do two references at once (one instruction reference and one data
reference).
The better is cache is the split I and D (at least for L1).
But unified has the better (i.e. higher) hit ratio.
Global miss rate—The number of misses in the cache divided by the total number of memory accesses
generated by the processor. Using the terms above, the global miss rate for the first-level cache is still
just Miss rateL1, but for the second-level cache it is Miss rateL1 × Miss rateL2.
The memory interface unit is responsible for arbitrating access to the memory space between reading
instructions (based upon the current program counter) and passing data back and forth with the processor
and its internal registers.
It might at first seem that the memory interface unit is a bottleneck between the processor and the
variable/RAM space (especially with the requirement for fetching instructions at the same time); however,
in many Princeton architected processors, this is not the case because the time required to execute a
given instruction can be used to fetch the next instruction (this is known as pre-fetching and is a feature
on many Princeton architected processors.
The Von Neumann architecture's largest advantage is that it simplifies the microcontroller chip
design because only one memory is accessed. For microcontrollers, its biggest asset is that the
contents of RAM (random-access memory) can be used for both variable (data) storage as well as
program instruction storage. An advantage for some applications is the program counter stack
contents that are available for access by the program. This allows greater flexibility in developing
software, primarily in the areas of real-time operating systems.
4.8 Havard Architecture
Harvard's response was a design that used separate memory banks for program store, the processor
stack, and variable RAM.
The Harvard architecture executes instructions in fewer instruction cycles that the Von Neumann
architecture. This is because a much greater amount of instruction parallelism is possible in the Harvard
architecture. Parallelism means that fetches for the next instruction can take place during the execution
of the current instruction, without having to either wait for a "dead" cycle of the instruction's execution
or stop the processor's operation while the next instruction is being fetched.
In this model, all the processors share the physical memory uniformly. All the processors have
equal access time to all the memory words. Each processor may have a private cache memory.
Same rule is followed for peripheral devices.
When all the processors have equal access to all the peripheral devices, the system is called a
symmetric multiprocessor. When only one or a few processors can access the peripheral
devices, the system is called an asymmetric multiprocessor.
4.10 Non-uniform Memory Access (NUMA)
In NUMA multiprocessor model, the access time varies with the location of the memory word.
Here, the shared memory is physically distributed among all the processors, called local
memories. The collection of all local memories forms a global address space which can be
accessed by all the processors.
Virtual Memory
In most modern computer systems, the physical main memory is not as large as the address space of the
processor. For example, a processor that issues 32-bit addresses has an addressable space of 4G bytes.
The size of the main memory in a typical computer with a 32-bit processor may range from 1G to 4G bytes.
If a program does not completely fit into the main memory, the parts of it not currently being executed
are stored on a secondary storage device, typically a magnetic disk. As these parts are needed for
execution, they must first be brought into the main memory, possibly replacing other parts that are
already in the memory. These actions are performed automatically by the operating system, using a
scheme known as virtual memory.
Under a virtual memory system, programs, and hence the processor, reference instructions and data in an address
space that is independent of the available physical main memory space. The binary addresses that the processor
issues for either instructions or data are called virtual or logical addresses. These addresses are translated into
physical addresses by a combination of hardware and software actions. If a virtual address refers to a part of the
program or data space that is currently in the physical memory, then the contents of the appropriate location in
the main memory are accessed immediately. Otherwise, the contents of the referenced address must be brought
into a suitable location in the memory before they can be used.
Address Translation
A virtual-memory address-translation method based on the concept of fixed-length pages is shown schematically
in Figure. Each virtual address generated by the processor, whether it is for an instruction fetch or an operand
load/store operation, is interpreted as a virtual page number (high-order bits) followed by an offset (low-order
bits) that specifies the location of a particular byte (or word) within a page. Information about the main memory
location of each page is kept in a page table. This information includes the main memory address where the page
is stored and the current status of the page. An area in the main memory that can hold one page is called a page
frame. The starting address of the page table is kept in a page table base register. By adding the virtual page
number to the contents of this register, the address of the corresponding entry in the page table is obtained. The
contents of this location give the starting address of the page if that page currently resides in the main memory.
Translation Look-aside Buffer
The page table information is used by the MMU for every read and write access. Ideally, the page table should be
situated within the MMU. Unfortunately, the page table may be rather large. Since the MMU is normally
implemented as part of the processor chip, it is impossible to include the complete table within the MMU. Instead,
a copy of only a small portion of the table is accommodated within the MMU, and the complete table is kept in the
main memory. The portion maintained within the MMU consists of the entries corresponding to the most recently
accessed pages. They are stored in a small table, usually called the Translation Lookaside Buffer (TLB). The TLB
functions as a cache for the page table in the main memory. Each entry in the TLB includes a copy of the information
in the corresponding entry in the page table. In addition, it includes the virtual address of the page, which is needed
to search the TLB for a particular page.
5INTERCONNECTION NETWORK
An ICN could be either static or dynamic. Connections in a static network are fixed links, while connections in a
dynamic network are established on the fly as needed.
Static networks can be further classified according to their interconnection pattern as one-dimension (1D), two-
dimension (2D), or hypercube (HC).
Dynamic networks can be classified based on interconnection scheme as bus-based versus switch-based.
Static Interconnection Networks:
• Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bidirectional,
between processors. Two types of
CCN Characteristics:
• CCNs guarantee fast delivery of messages from any source node to any destination node. Only one link has to be
traversed.
• CCNs are expensive in terms of the number of links needed
• for their construction. This disadvantage becomes more and more apparent for higher values of N.
• The number of links in a CCN is given by N(N-1)/2.
• The delay complexity of CCNs, measured in terms of the number of links traversed as messages are routed from
any source to any destination is constant.
CCN having N=6 nodes. A total of 15 links are required in order to satisfy the complete interconnectivity of the
network.
Network Topologies
• Commercial machines often implement hybrids of multiple topologies for reasons of packaging,
cost, and available components.
Network Topologies: Buses
• Some of the simplest and earliest parallel machines used buses.
• The distance between any two nodes is O(1) in a bus. The bus also provides a convenient
broadcast media.
• Typical bus based machines are limited to dozens of nodes. Sun Enterprise servers and Intel
Pentium based shared-bus multiprocessors are examples of such architectures.
• Since much of the data accessed by processors is local to the processor, a local memory can
improve the performance of bus-based machines.
• Examples of machines that employ crossbars include the Sun Ultra HPC 10000 and the
Fujitsu VPP500.
Network Topologies: Multistage Networks
Switches
Omega Network
Network Topologies: Completely Connected Network
Tree Properties
• The distance between any two nodes is no more than 2logp.
• Links higher up the tree potentially carry more traffic than those at the lower
levels.
• For this reason, a variant called a fat-tree, fattens the links as we go up the
tree.
• Trees can be laid out in 2D with no wire crossings. This is an attractive
property of trees.
Evaluating Interconnection Networks
• Diameter: The distance between the farthest two nodes in the network. The
diameter of a linear array is p − 1, that of a tree and hypercube is log p, and
that of a completely connected network is O(1).
• Bisection Width: The minimum number of wires you must cut to divide the
network into two equal parts. The bisection width of a linear array and tree is
1, that of a hypercube is p/2 and that of a completely connected network is
p2 /4.
• Cost: The number of links or switches (whichever is asymptotically higher)
is a meaningful measure of the cost. However, a number of other factors,
such as the ability to layout the network, the length of wires, etc., also factor
in to the cost.
6 IO Organization
Programmed I/O
Programmed I/O (PIO) refers to data transfers initiated by a CPU under driver software control
to access registers or memory on a device.
The CPU issues a command then waits for I/O operations to be complete. As the CPU is faster
than the I/O module, the problem with programmed I/O is that the CPU has to wait a long time
for the I/O module of concern to be ready for either reception or transmission of data. The CPU,
while waiting, must repeatedly check the status of the I/O module, and this process is known as
Polling. As a result, the level of the performance of the entire system is severely degraded.
Interrupt
The CPU issues commands to the I/O module then proceeds with its normal work until
interrupted by I/O device on completion of its work.
For input, the device interrupts the CPU when new data has arrived and is ready to be retrieved
by the system processor. The actual actions to perform depend on whether the device uses I/O
ports, memory mapping.
For output, the device delivers an interrupt either when it is ready to accept new data or to
acknowledge a successful data transfer. Memory-mapped and DMA-capable devices usually
generate interrupts to tell the system they are done with the buffer.
Although Interrupt relieves the CPU of having to wait for the devices, but it is still inefficient in
data transfer of large amount because the CPU has to transfer the data word by word between I/O
module and memory.
Direct Memory Access (DMA) means CPU grants I/O module authority to read from or write to
memory without involvement. DMA module controls exchange of data between main memory
and the I/O device. Because of DMA device can transfer data directly to and from memory,
rather than using the CPU as an intermediary, and can thus relieve congestion on the bus. CPU is
only involved at the beginning and end of the transfer and interrupted only after entire block has
been transferred.
Direct Memory Access needs a special hardware called DMA controller (DMAC) that manages
the data transfers and arbitrates access to the system bus. The controllers are programmed with
source and destination pointers (where to read/write the data), counters to track the number of
transferred bytes, and settings, which includes I/O and memory types, interrupts and states for
the CPU cycles.
DMA increases system concurrency by allowing the CPU to perform tasks while the DMA
system transfers data via the system and memory busses. Hardware design is complicated
because the DMA controller must be integrated into the system, and the system must allow the
DMA controller to be a bus master. Cycle stealing may also be necessary to allow the CPU and
DMA controller to share use of the memory bus.