0% found this document useful (0 votes)
40 views

ELECH473 Th03

This document provides an overview of basic microprocessor architecture and the instruction execution cycle. It discusses concepts like the Von Neumann and Harvard architectures, the central processing unit (CPU), main system memory, and how instructions and data are stored and accessed. It also covers the roles of masters like the CPU initiating operations on slave components like memory, and how direct memory access allows the CPU to focus on computation rather than data transfers.

Uploaded by

hung kung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

ELECH473 Th03

This document provides an overview of basic microprocessor architecture and the instruction execution cycle. It discusses concepts like the Von Neumann and Harvard architectures, the central processing unit (CPU), main system memory, and how instructions and data are stored and accessed. It also covers the roles of masters like the CPU initiating operations on slave components like memory, and how direct memory access allows the CPU to focus on computation rather than data transfers.

Uploaded by

hung kung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

ELEC-H-473 – Microprocessor architecture

Th03: Basic processor architecture


and the execution model
Dragomir Milojevic
Université libre de Bruxelles
2023
Today

1. Basic computer architecture concepts

2. Minimum system architecture

3. Instruction execution cycle

4. Different ISAs: RISC vs. CISC

5. Instruction execution cycle & circuit

6. Pipeline execution
1. Basic computer architecture concepts

ELEC-H-473 Th03 3/65


What really defines a computer?
• Definition of a computer made with respect to electronic calculator:
“As opposed to electronic calculators, computers store electron-
ically the information that controls the computational process”
. What is specific to electronic calculators?

• The above definition of computers is due to


John von Neumann, mathematician
• He co-authored “First Draft of a Report on
the EDVAC” in 1945, an unfinished &
unpublished paper of 101 pages; there is
some controversy about actual authorship ...
. Concepts described in the above paper used
to design ENIAC (USA) & Colossus (UK),
machines considered to be the first computers

ELEC-H-473 Th03 4/65


Von Neumann architecture: HW view
• First draft defines von Neumann architecture, a model of a
computer structure (HW), term widely used today to describe
different computer architectures (IoT, mobile & supercomputer)
• Central Processor Unit (CPU) – executes
CPU
instructions defined in the program
ALU RF
. Arithmetic-Logic Unit (ALU) + Control Unit
CTRL
(CTRL) + Register File (RF)
• Memory stores both instructions & data – a key
concept that defines a computer, since Memory

instructions can change, just like normal data


. What is the consequence of it?
• Input/Output (IO) – connects to outside world IO

• Communication – typically using buses that allow


one pair of nodes to exchange data at a time
ELEC-H-473 Th03 5/65
Harvard architecture
• Variant of Von Neumann architecture model
in which Data & Instruction memories are Data
Memory
different functional block instances!
• Motivation – instruction & data memories
could be different circuits, even if CPU
implemented on the same PCB or within the ALU RF
same IC; then they can have: CTRL
. different geometrical properties (capacity)
. different electrical properties (R/W time)
. but most importantly they allow concurrent Instruction
access, instruction & data access can overlap Memory
• Pure Harvard architectures with fully
separated instruction & data memories are IO
not mainstream; used very occasionally in
dedicated DSPs & micro-controllers
ELEC-H-473 Th03 6/65
Modified Harvard architecture
• Today computer architectures most often use modified Harvard
model, a combination of pure Von Neumann & Harvard approach
• This is because memory is implemented using cache hierarchies to
overcome the problem of CMOS SRAM/DRAM vs. logic scaling
discrepancy (we will look deeper into this later on)
• Modified Harvard architectures take the advantage of von
Neumann model, where instructions are treated as any other data:
. everything is stored in the same memory, simple to control in HW
. run-time compilation since same memory store data & instructions
. it is even possible to write self-modifying code (exotic, but cool!)
• ... with the advantages of the Harvard architecture:
. concurrent instruction/data access for more throughput

• In practice computers will have separate L1 for Instructions &


Data, and for higher cache levels (L2, L3, etc.) and central
memory they will use the same, often called unified memory
ELEC-H-473 Th03 7/65
CPU
• ALU execute limited number of arithmetic & logical operations
decided at CPU design-time, i.e. then can’t be changed
afterwards; they are hardwired in HW & IC manufacturing process
. Unless you implement your CPU in FPGA circuit, and modify the
micro-architecture on the fly – this exist, and is called reconfigurable
computing that can even be dynamic (config generated at run-time)
• CPU may contain multiple ALUs that are composed of different
sub-circuits specialized for specific arithmetic & logic operations
and data types (typically integers, floating points)
• Depending on the operation ALU may require one or two (rarely
more) operands, or arguments stored in memory or RF
• Access to operands through RF is preferred because of faster
R/W access times to data; RF sometimes allow simultaneous
access to few words in parallel (especially for READ)
• ALU writes back the result in RF, or memory
ELEC-H-473 Th03 8/65
Main system memory
• Main (central) memory stores data & instructions in unified way
• Data & instructions are binary data, machine can’t make a
difference between the two by content: is Ox32 opcode or data?
• Thus, instructions are placed in a specific region of memory, often
protected in access to prevent data corruption
. Accidental modifications of program data will most likely cause the
system to crash; Can you explain why?
• Memories are accessed through ports – collection of wires
connected to pins, electrical interfaces of the IC to external world
• Ports are grouped based on functionality of the pins:
. Address port ADD – location where we want to read/write
. Control port CTRL – drives memory control logic
. Data port DATA – actual data stored at address ADD
◦ Read – data becomes available for other components in the system
◦ Write – data is written in the memory
ELEC-H-473 Th03 9/65
Practical memories
• Address – ADD[n-1:0] is input port with n ADD Data

pins, i.e. address bits → they provide 2n


addressable locations
• Control – CTRL[c-1:0] is also input port, but Memory

has only few pins (e.g. CLK, EN, R/W’)


• Data – D[m-1:0] m pins that will act as input CTRL

or output port depending on the operation


direction (W and R respectively); sometimes
Input ports
R/W ports could be separated, this comes at
Output ports
cost of increased pin number at circuit level
• Typical memory “locations” at circuit level store 1 bit
• Actual memories are assembled from few memories connected in
parallel, so that one address access multiple bits & allow word- or
bit-level parallelism (often 1 byte, but could be more)
ELEC-H-473 Th03 10/65
On memory performance
• Key memory parameters are capacity and R/W access time; both
are intertwined: more capacity mean longer R/W access times
and inversely, less capacity mean faster memories (important)
• Often memory performance is indicated using number of transfers
per second, correlated with R/W access times of the memory, but
also correlated to the speed of the interconnect between the logic
and the memory (or R/W frequency FRW )
• Number of transfers per second multiplied by the transferred word
width (so m) give us memory bandwidth BW = FRW × m
• We see that we can get more BW either by reducing the R/W
access time or with wider words in a single access
• Improving R/W time of the memory and the interconnect became
difficult to achieve over the past years; this is why today we see
memories with very wide words (could be thousands of bits!)
ELEC-H-473 Th03 11/65
On masters and slaves
• Masters can initiate transfer operations on one (one-to-one, the
most frequent case) or more slaves (one-to-many)
• Slaves can only accept, or refuse access operations requested by
a master, i.e. they can not initiate transfers themselves
• In our model CPU is the master & the memory is the slave
• CPU need some time to initiate & execute memory access; during
memory access CPU is not used to compute something really
useful (rather than computing, CPU is used to move data around)
• Today most CPUs don’t deal with memory accesses, they
outsource this task to other components in the system, they are
just passing the information or what should go where to dedicated
HW blocks that only move data around – Direct Memory Access
(DMA); this allows CPU to concentrate on what they should do:
useful computation! (and not data transfers)
ELEC-H-473 Th03 12/65
Putting it together
• On the memory side we have to read &
write; Data port is typically bi-directional
ADD
(on the CPU side ports are mirrored)
Data
• Data port can be input OR output

Memory
(exclusive), depending on the control

CPU
logic (seen from CPU) – single ported
memory – only one address/data couple is
accessed at a time (1 clock cycle) CTRL

• CPU translates R/W instructions into


appropriate HW operations Input ports

. CPU initiates transfers by setting Ctrl Output ports


signals & ADD to correct value Bidirectional
. Memory accepts transfer (Ctrl) ports
. HW decides on data direction depending
on operation (R/W)
ELEC-H-473 Th03 13/65
Closer look into the CPU: RF + CTRL
• ALUs implement adders, multipliers, dividers, hardwired
mathematical operations (dedicated circuits) or microcoded
complex operations (kind of a sub-programs for more complex
operations for which dedicated HW would be too expensive)
• Register File is connected to ALU through a set of multiplexers
circuits that will select the right paths
• Registers can be source (SRC) or CPU
ADD
destination (DST)
Data
• ALU, RF & associated logic need

Memory
control signals to steer their operation

ALU
RF

• These will be generated by the CTRL


CTRL
unit, a sub-system that interprets
CTRL
instruction being executed and
translating this into HW operation
ELEC-H-473 Th03 14/65
Register Files (RF)
• RFs are memories built using SRAMs or set of FFs; simplest RFs
are single ported, i.e. only one read OR write (or is exclusive) can
be performed at a time & at one address location
• Typical RFs are multi-ported memories, Memory
ADD
i.e. they allow multiple data to be read or
ADD
written in the same time; each port is
ADD
then addressed with a separate address
bus; this will impact memory pin count, Read

and thus routability of the IC Read

. Multi-ported RFs allow more BW to ALUs, Write


assuming same memory/interconnect
CTRL
speed which is the case
• Practical RFs have few read ports & only one write port to avoid
data coherency problems
. In case of multiple write ports we need to handle eventual
simultaneous writes to the same location → arbitration
ELEC-H-473 Th03 15/65
On HW performance
Assume that CPU runs at certain frequency and can process N
instructions per second, then:
• Memory should deliver necessary data & instructions to CPU, i.e.
we need to R/W data from memory fast enough to feed computation
• Wires between memory & CPU should be fast enough to transport
the above data with minimum delay and latency
. Wires (e.g. Cu conductors) have parasitic resistance, capacitance &
inductance (RCL) that at higher frequencies will act as filters defining the
Fmax at which the wires could still operate correctly
. RLC depend on wire material, geometry & distance: shorter wires, smaller
delay, better Fmax ; this is why DRAMs are placed close to CPU
. If delay is too big, FFs area inserted at expense of increased transfer
latency – i.e., the N0 of clock cycles to go from A to B
• CPU to Memory communication is designed to match the speed of
memory and CPU interfaces, but today this is complex to achieve
If the above is not the case, the CPU will be stalled, i.e. the
computation will stop, CPU time waisted, this is not good!
ELEC-H-473 Th03 16/65
On SW performance
• Because of the HW reality (RLC), any SW execution time will be
bounded either by:
. Computation – there are many complex computations that run on
small data sets; memory is not accessed often, the bottleneck is in
computational part of the CPU (ALUs)
. Memory – few simple computations on different data; memory is
accessed often; CPU waits for memory to deliver data; bottleneck is
in memory & memory-logic interconnect, not computation
• Good operating point for any SW is when computation, memory
accesses and communication are in perfect balance; this is hard to
achieve, but this is what programming art is all about; knowledge
on HW could & will help to eventually reach this point
• Today CPUs compute fast thanks to high speed logic &
parallelism (more on this later); major problem is in memory &
memory-to-logic interconnect – compute systems face memory
wall (performance & power limitations)
ELEC-H-473 Th03 17/65
Computer architecture is not just a HW model
Descriptions of ALU, RF, CTRL logic & memory are not enough to
make a computer → we still miss a few (key) ingredients!
• Architectural details necessary to make a real computer; next
section will introduce minimal system architecture, a (very)
simplified model of the actual computer architecture
• Instruction execution model, or how instructions in the context of
a computer program are handled during execution (at run-time);
this is also part of Von Neumann model
• Instruction Set Architecture (ISA) – actual specification of
instructions, abstracted model of the computing system
. ISA defines supported instruction, data types, local registers, central
memory & how it is managed, general input/output; ISA defines
what a system can do, not how does it do it; you could write an
ISA simulator of a computing system using another architecture
(ISA simulators or run-time binary translator such as Rosetta)
ELEC-H-473 Th03 18/65
2. Minimum system architecture

ELEC-H-473 Th03 19/65


Extra components added to the previous model
• Program counter (PC) – register that stores the address of the next
instruction to be executed; after instruction execution, PC is
incremented to access next instruction in the program, or set to an
arbitrary program address in case of a jump or branching
• Instruction register (IR) – stores the opcode of the instruction that
will be executed; this opcode is decoded to generate control signals for
other blocks in the system (we have seen simple decoder)
• Accumulator (ACC) – an optional register that can be R/W directly
from ALU (or bypassed for a direct connection to the register file)

Ctrl
IR Decoder
Instructions

Address
PC Register A
Register

ALU
Memory Acc
Data
File
Data Data B

ELEC-H-473 Th03 20/65


Different data exchange paths
a) From central memory to the Register File (RF)
b) From central memory to the Instruction Register (IR)
c) ALU can compute from RF (or from the central memory)
d) or it could use the result from previous operation (ACC)
e) PC can be updated using a computation from ALU (addresses are
used as any other data, just think of pointers in “C” language)
e

Ctrl
b IR Decoder
d
Instructions

Address
PC Register A
Register

ALU
Memory Acc
Data
File
a B
Data Data

ELEC-H-473 Th03 21/65


Shaded boxes represent registers. Thick lines represent 16-bit buses.

RISC16used
ADD in labs ... or x86

Program Counter +1

INSTRUCTION
MEMORY

ADD: 000 rA rB 0000 rC


TGT

SRC1 REGISTER FILE


SRC2

SRC2 SRC1

WE

SRC2 SRC1
ADD
CONTROL

This figure illustrates the flow-of-control for the ADD instruction. All three ports of the regist
Not that much different!
file are used, and the write-enable bit (WE) is set for the register file. The ALU control signal is
ELEC-H-473 Th03 simple ADD function. 22/65
3. Instruction execution cycle

ELEC-H-473 Th03 23/65


Instruction execution model
• Program is a list of instructions read and executed one by one in
a sequential manner, and in the order specified by the program
• Von Neumann specification defines instruction execution model:
. Fetch instruction – Current instruction is fetched from the memory,
i.e. the instruction opcode is transported from the Instruction
Memory to the Instruction Register (IR)
. Decode instruction – Internal logic interprets the instruction opcode,
i.e., it generates the appropriate control signals for all associated
logic that will be involved in the execution of this operation
. Fetch operands – Memory read for data; necessary only if data is
not in the RF (for simplicity we merge this operation with next step)
. Execute instruction – Perform actual operation on operand(s)
. Write result – memory or RF, depending on instruction
• Every instruction executes in 4 steps: F, D, Ex, W in simplified
model & notation (fetch operand & execute steps are merged)
ELEC-H-473 Th03 24/65
A. Instruction fetch
1. Control logic (not shown on the figure) places the value of the PC
on the address bus of the system
2. Control signal will generate read memory access; data read is
placed on the data bus – this is the opcode of the instruction
stored on a given address
3. IR knows that the data on the bus is to be taken thanks to the
control logic, it reads the opcode from the bus and store it for
further processing in the local register
1 Ctrl
IR Decoder
3
Instructions

Address
PC Register A
2
Register

ALU
Memory Acc
Data
File
Data Data B

ELEC-H-473 Th03 25/65


B. Instruction decode
1. Control logic reads the content of IR & passes it to decoder circuit
2. Depending on the opcode, decoder will generate different control
signals for all other units in the system (e.g. ALU operation
selection & operands preparation)
. In the simplest form this is just a x : y decoder (What are x, y ?)
. Decoders are simple combinatorial circuit
. We have seen the example in Th01
1 2

Ctrl
IR Decoder
Instructions

Address
PC Register A
Register

ALU
Memory Acc
Data
File
Data Data B

ELEC-H-473 Th03 26/65


C. Instruction Execute
1. ALU can now execute the selected operation (logical AND,
arithmetical + etc.) on the operand(s)
2. The result is written in the destination register (or accumulator)
3. PC is updated according to the executed instruction
. If arithmetic/logical operation on data we will have PC+=1;
instruction flow control instructions could set PC to any value

Ctrl
3 IR Decoder
1
Instructions

PC
Address 2
Register A
Register

ALU
Memory Acc
Data
File
Data Data B
1

ELEC-H-473 Th03 27/65


D. Write
• Depending on the architecture and the instruction used the result
is written to the appropriate destination: that can be either RF
(1) or memory (2)
• It is the instruction opcode that specifies result destination
• Note that we said that RF is faster than memory ... you should
understand trade-off between two write destinations

Ctrl
IR Decoder
Instructions

Address
PC Register A
Register

ALU
Memory Acc
Data
File
Data Data B

1
2

ELEC-H-473 Th03 28/65


Different architectures depending on memory R/W
Depending on the possible combination(s) of the source/destination
operands and/or result location (RF, memory, ...), we can have
following computer architectures:
• Pure Load/Store architectures – operands must be in the RF, so
any computation is always preceded with an explicit R/W
operation from/to memory to/from RF
• Register/Memory architecture – operands can be in either in RF
or memory (and this is an exclusive or, both RF and memory in
one instruction is not possible)
• Register + Memory – any operation can have operands being in
memory and/or RF (this is the general case)

According to you, what these differences really mean,


and what are the trade-offs?

ELEC-H-473 Th03 29/65


4. Different ISAs: RISC vs. CISC

ELEC-H-473 Th03 30/65


Instruction Set Architecture – ISA
• Until now we have defined (minimal) computer micro-architecture
and the way it operated (instruction execution), but we didn’t said
anything about WHAT this machine can really do!
• ISA is the (missing) link between the micro-architecture and the
programmer that writes programs in assembly or any other high-level
language to be compiled into executable file ...
• ISA is about definition of WHAT machine can do, as opposed to
HOW this is done
. After all programmer doesn’t need to know how the addition is done

• ISA typically defines:


. Native data types (integer and floating point sizes)
. Register number (names), sizes & types, type of instructions
. Addressing modes, memory architecture, interrupts, exception
handling and external I/O
• Computer is defined as: micro-architecture + ISA
ELEC-H-473 Th03 31/65
CPU instructions classes
• Data movement – from any to any point including:
. Load (from memory) or Store (to memory)
. Move memory-to-memory or register-to-register
. Move from memory to I/O devices (printer, screen, mouse etc.)

• Arithmetic ops – depend on data type due to encoding differences


. Integer operations
. Floating point operations
. How do we encode floating point / integer?

• Shifts – at word level (e.g. sleft, sright)


• Logical operations – e.g. and, not, set, clear
• Control instructions – jump/branch conditional or not
• Subroutine handling – call, return
• Interrupt handling

ELEC-H-473 Th03 32/65


Addressing modes
Not all modes shown below are implemented & syntax may vary
Register add R4,R3 R4←R4+R3
Immediate add R4,#3 R4←R4+3
Register Indirect add R4,(R1) R4←R4+Mem[R1]
Displacement add R4,100(R1) R4←R4+Mem[100+R1]
Indexed/Base add R4,(R1+R2) R4←R4+Mem[R1+R2]
Direct/ Absolute add R4,(1001) R4←R4+Mem[1001]
Memory indirect add R4,@(R3) R4←R4+Mem[Mem[R3]]
Auto-increment add R4,(R1)+ R4←R4+Mem[R1]
R1←R1+c
Auto-decrement add R4,-(R1) R4←R1-c
R1←R4+Mem[R1]
Scaled add R4,100(R2)(R3) R4←R4+Mem[100+R2+R3*c]

← used to note the assignment of the result


ELEC-H-473 Th03 33/65
Classification of ISA
• Depending on the way how we implement instruction set, we can
have different flavors of ISA (not that many variants though)
• In computing, most of the
time computer executes Rank Instruction Average[%]
1 load 22
same operations again &
2 condition. branch 20
again: memory move, some 3 compare 16
basic logic/arithmetic 4 store 12
operation etc. 5 add 8
6 and 6
• Table shows instructions 7 sub 5
execution frequency: first 3 8 more Reg-To-Reg 4
9 call 1
account for 48%, while only
10 return 1
10 instructions represent
Total 96
96% of all executed
instructions in a program
ELEC-H-473 Th03 34/65
Classification of Instruction Sets
• Conclusion → only few instructions are used most of the time;
this is the main motivation for two orthogonal approaches to ISA:
. Reduced Instruction Set Computer – RISC
. Complex Instruction Set Computer – CISC

• Both HW & SW communities have been fighting for decades to


prove that one approach is better than the other
• In reality both are cool, and have their advantages and
disadvantages, though following observations can be made:
. CISC are more complex architectures, require more silicon and aim
general purpose CPUs in High Performance Computing (HPC), so
servers, supercomputers; adopted by companies that make their own
silicon; Can you guess why?
. RISC tend to be simpler, used for general purpose servers and of
course mobile computing that flourished over the past decade

ELEC-H-473 Th03 35/65


RISC philosophy
• Instructions are simplified in content and therefore can be
implemented more efficiently in HW: most RISC instructions will
require only one clock cycle to execute
• More complex instructions will be implemented as “subroutines”,
called on per need basis
• RISC computers are often load/store architecture: if operation is
to be done on memory, it will be split into separate load, execute
and store instructions at assembly level
• Efficient instructions will have higher throughput and the system
as a whole would be more performant
• In another words if 99% of instructions to be executed are faster,
we can forget about the other 1%; instruction set appear to be
SMALL, with instructions HIGHLY optimized in HW

ELEC-H-473 Th03 36/65


CISC philosophy
• As opposed to RISC, CISC CPUs have many instructions & some
of them can be very complex
• In CISC you could possibly load from memory, compute and store
in a single instruction; or you could perform some dedicated math
operation using specialized HW unit (CPU vendors sell more IC
area, so better profit, and we get better systems)
• Obviously any operation directly from/to memory will be much
slower than the same operation in the RF because we still need to
perform the memory transfer; question is who is doing this:
computer HW (CISC) or SW (RISC)?
• Rich instruction sets with many different instructions can not be
implemented with fixed cost in terms of computation cycles
• This will have a strong impact on performance, as we will see
later with actual architectures!
ELEC-H-473 Th03 37/65
RISC/CISC example
   
1 load ax , a 1 ; here we use a
2 load bx , b 2 ; single instruction
3 mul ax , bx 3 ; that does everything
4 store a , ax 4 mul a,b
   
• Emphasis is on software • Emphasis is on hardware
• Single-clock, reduced • Includes multi-clock complex
instruction only instructions
• Register to register: LOAD • Memory-to-memory: LOAD
and STORE are independent and STORE incorporated in
instructions instructions
• Low cycles per second, large • Small code sizes, high cycles
code sizes per second
• More RF storage resources • More computational hardware

ELEC-H-473 Th03 38/65


RISC/CISC performance trade-off
• Software execution time tCPU is given by:

1
tCPU = NInst × CPI ×
FClk
where NInst is the number of instructions in the program, CPI is
the number of cycles required to execute one instruction (on
average) and FClk CPU clock frequency
• Having the above in mind:
. RISC → increase NInst , but reduce CPI and FClk since
optimized functions are more complex in silicon
. CISC → do the opposite: they decrease NInst , but increase
FClk , and CPI could be variable for different instructions!
• Which one reduces tCPU will be algorithm, program, compiler &
architecture dependent, so no quick judgments here

ELEC-H-473 Th03 39/65


RISC in academia
RISC was and still is very popular in academia, few examples:
• Berkley RISC – SPARC: Scalable Processor Architecture
. A lot of registers for local computation
. Low-latency instructions (typically 1 cycle)
. Still in action ... but less and less, we will see a concrete example
• Stanford – MIPS: Microprocessor without Interlocked Pipelined
Stages (not the same as Million Instructions per Second)
. Load/store architecture (works on registers only)
. Found in game consoles (Nintendo 64, Sony PlayStation, etc.)
. About to become open-source
• RISC-V – originally from Berkley, but with contributions from all
around; highly configurable, fast, low-power & silicon proven
• RISC16 – simple simulator for learning (used during labs)

ELEC-H-473 Th03 40/65


Commercial RISC/CISC
• During ’90 RISC were mostly associated with mobile applications:
ARM & ARM derived CPUs found in tablets & smartphones
• But not only → RISC based HPC machines:
. Sun Microsystems (acquired by Oracle) SPARC line of CPUs from
T1 to T5 (2013) targets servers 1CPU = 16 & 32 cores
. Fujitsu K-computer – super-computer (2011) based on SPARC
cores (10 Penta-flops, 80k 8-core processors @ 2GHz)
. Fujitsu A64FX – fastest super-computer in 2020 uses 48 ARM-V8
cores per CPU in 7nm CMOS

• Intel (obviously), AMD and IBM


ELEC-H-473 Th03 41/65
Today
• RISC/CISC hardly relevant; tendency to abandon big, fat cores in
favor of more heterogeneous architectures; this is true even in
HPC/server communities; there are many low-power ARM based
servers on the market

• Apple abandoned Intel → M1Max


• Highly optimized CPUs made
using state of the art 5nm
CMOS (57 billion transistors)
• 8 high-performance, 2 low-power
cores; 32 GPUs; 2×16 cores
neural engine; dedicated video
encoder, decoders engines
• Loads of local memory

ELEC-H-473 Th03 42/65


5. Instruction execution cycle & circuit

ELEC-H-473 Th03 43/65


Instruction execution model
• Defines a sequence of steps that need to be executed one after
the other & in a given order (we can’t execute before decode)
• So the execution of F, D, Ex, W of a single instruction assuming
different execution times per stage ti :
t1 t2 t3 t4

F D Ex W

t0 t1 t2 t3 t4

• Valid question from HW perspective: how to execute these 4


steps for subsequent instructions? → different ways of doing this
will result in different computer architectures
ELEC-H-473 Th03 44/65
Single Issue Base Machine – SIBM (1/2)
• Simple approach or KISS – Keep It Simple & Stupid: next
instruction starts to execute only after the previous has finished
• Proposed execution model refers to Single Issue Base Machine
(SIBM); “single”→ only 1 instruction can be issued & executed
at a time (“single” will become “multiple” in next lectures)
• In the example below instruction i1 will start execution only at t4
Instructions

t1 t2 t3 t4

i1 F D Ex W
i0 F D Ex W

t0 t1 t2 t3 t4 t5 t6 t7 Time

ELEC-H-473 Th03 45/65


Single Issue Base Machine – SIBM (2/2)
• Let’s assume for now that each of 4 steps will take exactly the same
amount of time t to execute:
. While the above assumption is not necessarily easy to achieve in the
real world, CPU designers will try very hard to come close to it! We
will see why this is SO important in computer architectures
• So, time to execute any instruction will take 4 × t; a program with N
instructions will take N × 4 × t to execute; instructions can be issued
1
and completed with a frequency of FClk = 4×t
• In practical systems, such as CPUs implemented using CMOS ICs,
FClk of SIBM will be very low
• Thus SIBMs are not good at all from performance perspective: we
need to complete F, D, Ex, W for each instruction before issuing the
next instruction for execution
• No practical computer will work like this, performance would be way
too low → we need to (seriously!) improve things
ELEC-H-473 Th03 46/65
CPUs and ICs
• Before going any further: do we agree on the following?
. CPU is a logic circuit (and sequential one with FFs)
. Any logic circuit has a critical path – path that exhibits biggest
delay – sum of all gate and interconnect delays
. Performance of the whole circuit (so performance of the CPU) is
limited to the performance of the critical path
• If this is clear, then we should also agree on the following:
. Bigger the circuit, better the chances are that critical path is longer,
so larger t & lower FClk ; and inversely: smaller the circuit, smaller
the critical path should be (so, smaller t & higher FClk )
• We can then roughly conclude: smaller circuits are FASTER and
bigger circuits are SLOWER; thus making a CPU that will
execute all 4 instruction steps as a “single” circuit doesn’t
make sense as in SIBM: it is better to divide CPU in multiple
& smaller logic sub-circuits
ELEC-H-473 Th03 47/65
Reducing the critical path delay 1/3
• Idea of the link between the circuit size & the critical path delay
relation is used when designing any circuit: we can split one big
circuit into 2 (or more) smaller sub-circuits
• Let’s split a circuit so that 1/2 of the critical path is in one
sub-circuit (SmallCircuit1), and the other 1/2 in the other
sub-circuit (SmallCircuit2):

Big Circuit Small Small


Circuit1 Circuit2

One big critical path Two smaller critical paths

• But this doesn’t change things, sum of delays remains the same ...
ELEC-H-473 Th03 48/65
Reducing the critical path delay 2/3
• To really gain for each wire connecting 2 sub-circuits we insert a
Flip-Flop (FF) – synchronous mem. driven by raising Clk edge
• FFs will store the result of the operation of the SmallCircuit1:

D Q

Clk

Small D Q Small
Circuit1 Clk Circuit2
D Q

Clk

• FF insertion makes two circuits independent, now they can


operate in parallel; critical path is really cut in 2

ELEC-H-473 Th03 49/65


Reducing the critical path delay 3/3
• This is possible because FFs will update their content only after
certain amount of time (next Clk rising edge); “time distance”
between the two circuits is 1 clock cycle
• In t1 , SmallCircuit1 calculates a result from input data d1 &
SmallCircuit2 is idle) in t2 SmallCircuit2 receives as input
the result of SmallCircuit1 computation from FFs
• Since result of SmallCircuit1 computation on d1 is stored in
FFs, we can start computing d2 with SmallCircuit1
D Q

Clk

Small D Q Small
Circuit1 Clk Circuit2
D Q

Clk

Above circuit, compared to one big circuit is said to be pipelined


with one stage
ELEC-H-473 Th03 50/65
Logic circuits are sequential, CPUs too (1/2)
• They already have FFs to avoid race conditions since they use
state machines (I suppose you understand this?); and do use FFs
from Inputs to Outputs (RTL paradigm we saw in logic circuits)
• In another words they already operate from Register-to-Register
• In one big circuit (Single Issue Base Machine), result is one cycle
away, with long cycle period because of circuit complexity:

D Q D Q

Clk Clk

D Q D Q

Clk Big Circuit Clk

D Q D Q

Clk Clk

ELEC-H-473 Th03 51/65


Typical logic circuits are sequential, CPU too (2/2)
• In pipelined circuit, result is two cycles away, because we broke
the critical path in two and inserted FFs between two sub-circuits:

D Q D Q D Q

Clk Clk Clk

D Q
Small D Q
Small
D Q

Clk Clk Clk


Circuit1 Circuit2
D Q D Q D Q

Clk Clk Clk

• Functionally, “the big” & “the broken in two circuits” are


identical ... if we allow enough execution cycles
• Difference is in the timing references of intermediate results

ELEC-H-473 Th03 52/65


Pipeline stage insertion: consequence
• When inserting a pipeline stage in a circuit we hopefully break the
critical path equally to reduce the delay in half for each sub-circuit
• This means FClk increases by a factor of 2
• But we do insert one clock cycle delay: circuit latency

D Q D Q D Q

Clk Clk Clk

D Q
Small D Q
Small
D Q

Clk Clk Clk


Circuit1 Circuit2
D Q D Q D Q

Clk Clk Clk

Pipeline circuits tend to increase the operating frequency, but


introduce delay expressed in extra clock cycles
ELEC-H-473 Th03 53/65
6. Pipeline execution

ELEC-H-473 Th03 54/65


Pipelining and CPU instruction execution model
• Each execution model step (so F, D, Ex, W) could be implemented as a
distinct logic sub-circuit; all sub-circuits are linked between them with
FFs that isolate computation of different stages & pass results from
one pipeline stage to another
• Because F, D, Ex, W are sub-circuits, they need less logic gates and
wires; and if they are smaller they could be faster: this is because
critical path can be “shorter” in smaller circuits because we have less
gates & interconnect
• First instruction execution is therefore delayed 4 cycles, one clock
cycle delay per pipeline stage; when instruction enters the instruction
pipeline, the result will pop-up at output 4 cycles later
• We define instruction latency as number of clock cycles elapsed
between instruction fetch (F), and operand write (W)
• We want to minimize individual instruction latency of all instructions
to maximize program performance
ELEC-H-473 Th03 55/65
Pipeline – timing diagram 1/2
Instructions
t

i3 F D Ex W
i2 F D Ex W
i1 F D Ex W
i0 F D Ex W

t0 t1 t2 t3 t4 t5 Time

• In t0 – we fetch first instruction i0


• In t1 – i0 fetch done, i1 can be fetched AND (&) i0 decoded
• In t2 – we fetch i2 , decode i1 & execute i0
• In t3 – we fetch i3 , decode i2 , execute i1 & i0 write
• and so on, the pipeline is said to be full
ELEC-H-473 Th03 56/65
Pipeline – timing diagram 2/2
Instructions
t F D Ex W
F D Ex W
i3 F D Ex W
i2 F D Ex W
i1 F D Ex W
i0 F D Ex W

t0 t1 t2 t3 t4 t5 t6 t7 Time

• After t3 , on every clock cycle, 1 new instruction is: fetched,


decoded, executed and written (terminated)
• For the external observer, once the pipeline is full, every
instruction will take only one cycle to execute
• Latency of the instruction remains however 4 cycles, so the total
“time” instruction takes to execute by the CPU remains the same
ELEC-H-473 Th03 57/65
Computing pipeline acceleration
• For N instructions and n pipeline stages, the total cycle count is:
. for a Single Issue Base machine:

Csibm = N × n
. for a n-stage pipeline CPU:
Cpipe = n + N − 1

• The acceleration is then:


Cpipe
Acc =
Csibm
• If N  n (N is big compared to n, true for big programs):
Cpipe n+N −1 1
Acc = lim = lim =
N→∞ Csibm N→∞ N × n n
Acceleration is proportional to the number of pipeline stages!
ELEC-H-473 Th03 58/65
100 years old concept, due to Henry Ford!

(1) Place the tools and the men in the sequence of the operation so that each
component part shall travel the least possible distance while in the process of finishing.
(2) Use work slides or some other form of carrier so that when a workman completes
his operation, he drops the part always in the same place—which place must always
be the most convenient place to his hand—and if possible have gravity carry the part to
the next workman for his own.
(3) Use sliding assembling lines by which the parts to be assembled are delivered at
convenient distances. 1913

ELEC-H-473 Th03 59/65


Analogy with CPU
FETCH

DECODE

WRITE

EXECUTE

Explain the concept of latency in this picture

ELEC-H-473 Th03 60/65


Pipeline & Super-pipeline
• If we speed-up things with deeper pipelines, should we insert
infinite number of pipeline stages?
NEVER!
• Each extra pipeline stage adds extra latency cycle, infinite number
of pipe-line stages would mean infinite latency!
• Acceleration factor of n is possible only if we execute a huge
number of instructions taking the same amount of time
• Thus RISC ISAs could use pipeline better than other ISAs
• In practice number of stages is a trade-off between critical path
depth (minimum period, so maximum frequency) and latency
• Some CPUs are super-pipelined – we can have few dozens of
pipeline stages (not anymore these days)
. Past Intel X86 CPUs have been heavily pipelined with >30 stages,
good for marketing since F goes up, but latency too; these days <15
ELEC-H-473 Th03 61/65
Core organization at IC-level – layout of 8008 CPU
Note the placement of different functional blocks (RF close to ALU,
instruction register close to instruction decoding stage etc.)

ELEC-H-473 Th03 62/65


Pipeline & the execution model 1/2
• Pipeline executions introduces something special → parallelism,
i.e., computation are happening at the same time!
• Pipelining – one of few techniques said to exploit what is called
Instruction Level Parallelism (ILP); ILP means that we execute
some things at the same time, true even in a single core, single
ALU machine!
• In case of a pipeline, thanks to pipelined logic circuit operation,
ILP is possible because we have overlapped execution of different
stages of different instructions
• To measure the efficiency of pipeline we use Instructions Per
Cycle – IPC (different from CPI that we already mentioned)
• For a system with single ALU, the IPC of 1 means that ILP is
used at it best! In practice IPC is rarely at this value; in the next
lecture we will see what cause IPC < 1
ELEC-H-473 Th03 63/65
Pipeline & the execution model 2/2
• Pipeline execution is handled automatically by the HW of the
CPU, you do not see this happening!
• But don’t expect miracles, you can write the code that will be
gentle to the execution pipeline ... as you could write code that
will kill the pipeline execution! and the performance of your
program will be poor
• There is a clear link between HW & SW: think of hypothesis
we’ve made to calculate the acceleration of pipelined architecture!
• It is in general a good idea to check how well the pipeline is used
during SW execution, especially in loops with many iterations
. Why loops with many iterations?
• Because pipeline execution is hidden from us, we need specific
tools (CPU simulators) that will help us to understand how well
pipeline is used
. Example: VTune profiling software for Intel processors
ELEC-H-473 Th03 64/65
Things to take
• Even single CPU computing systems, with 1 core & 1 ALU are
executing things in parallel because of:
. Bit-level parallelism at word, operand level – e.g. adder circuit will
add multiple bits of a word in a single clock cycle
. Pipelined HW for F, D, Ex, W operations that enable ILP
• Pipelined execution of instructions:
. Enables higher instruction throughput, with computation
acceleration proportional to the number of pipeline stages; but to
take most of it we need nicely filled pipeline stages
. More pipeline stages enable higher operating FClk , since smaller
critical paths linked to logic circuits in HW
. But more stages increase instruction latency

• Computing systems today are most of the time using modified


Harvard architecture, a blend of CISC/RISC ISA, with more or
less pipeline stages, depending on the application they target
ELEC-H-473 Th03 65/65

You might also like