Advanced Computer Architecture: 1.0 Objective
Advanced Computer Architecture: 1.0 Objective
Advanced Computer Architecture: 1.0 Objective
MODULE 1
1.0 Objective
The main aim of this chapter is to learn about the evolution of computer systems, various
attributes on which performance of system is measured, classification of computers on
their ability to perform multiprocessing and various trends towards parallel processing.
1.1 Introduction
From an application point of view, the mainstream of usage of computer is experiencing
a trend of four ascending levels of sophistication:
• Data processing
• Information processing
• Knowledge processing
• Intelligence processing
With more and more data structures developed, many users are shifting to computer roles
from pure data processing to information processing. A high degree of parallelism has
been found at these levels. As the accumulated knowledge bases expanded rapidly in
recent years, there grew a strong demand to use computers for knowledge processing.
Intelligence is very difficult to create; its processing even more so. Todays computers are
very fast and obedient and have many reliable memory cells to be qualified for data-
information-knowledge processing.
Parallel processing is emerging as one of the key technology in area of modern
computers. Parallel appears in various forms such as lookahead, vectorization
concurrency, simultaneity, data parallelism, interleaving, overlapping, multiplicity,
replication, multiprogramming, multithreading and distributed computing at different
processing level.
1.2 The state of computing
Modern computers are equipped with powerful hardware technology at the same time
loaded with sophisticated software packages. To access the art of computing we firstly
review the history of computers then study the attributes used for analysis of performance
of computers.
1.2.1 Evolution of computer system
1
Presently the technology involved in designing of its hardware components of computers
and its overall architecture is changing very rapidly for example: processor clock rate
increase about 20% a year, its logic capacity improve at about 30% in a year; memory
speed at increase about 10% in a year and memory capacity at about 60% increase a year
also the disk capacity increase at a 60% a year and so overall cost per bit improves about
25% a year.
But before we go further with design and organization issues of parallel computer
architecture it is necessary to understand how computers had evolved. Initially, man used
simple mechanical devices – abacus (about 500 BC) , knotted string, and the slide rule for
implementations. And designers always tried to manufacture a new machine that should
be upward compatible with the older machines.
3) Concept of specialized registers where introduced for example index registers were
introduced in the Ferranti Mark I, concept of register that save the return-address
instruction was introduced in UNIVAC I, also concept of immediate operands in IBM
704 and the detection of invalid operations in IBM 650 were introduced.
4) Punch card or paper tape were the devices used at that time for storing the program. By
the end of the 1950s IBM 650 became one of popular computers of that time and it used
the drum memory on which programs were loaded from punch card or paper tape. Some
high-end machines also introduced the concept of core memory which was able to
provide higher speeds. Also hard disks started becoming popular.
5) In the early 1950s as said earlier were design specific hence most of them were
designed for some particular numerical processing tasks. Even many of them used
decimal numbers as their base number system for designing instruction set. In such
machine there were actually ten vacuum tubes per digit in each register.
6) Software used was machine level language and assembly language.
7) Mostly designed for scientific calculation and later some systems were developed for
simple business systems.
8) Architecture features
Vacuum tubes and relay memories
CPU driven by a program counter (PC) and accumulator
Machines had only fixed-point arithmetic
9) Software and Applications
Machine and assembly language
3
Single user at a time
No subroutine linkage mechanisms
Programmed I/O required continuous use of CPU
10) examples: ENIAC, Princeton IAS, IBM 701
smaller, cheaper and dissipate less heat as compared to vacuum tube. Now the transistors
were used instead of a vacuum tube to construct computers. Another major invention was
invention of magnetic cores for storage. These cores where used to large random access
memories. These generation computers has better processing speed, larger memory
capacity, smaller size as compared to pervious generation computer.
The key features of this generation computers were
1) The IInd generation computer were designed using Germanium transistor, this
technology was much more reliable than vacuum tube technology.
2) Use of transistor technology reduced the switching time 1 to 10 microseconds thus
provide overall speed up.
2) Magnetic cores were used main memory with capacity of 100 KB. Tapes and disk
peripheral memory were used as secondary memory.
3) Introduction to computer concept of instruction sets so that same program can be
executed on different systems.
4) High level languages, FORTRAN, COBOL, Algol, BATCH operating system.
5) Computers were now used for extensive business applications, engineering design,
optimation using Linear programming, Scientific research
6) Binary number system very used.
7) Technology and Architecture
Discrete transistors and core memories
I/O processors, multiplexed memory access
Floating-point arithmetic available
Register Transfer Language (RTL) developed
8) Software and Applications
High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and
4
subroutine libraries
Batch operating system was used although mostly single user at a time
9) Example : CDC 1604, UNIVAC LARC, IBM 7090
IIIrd Generation computers(1965 to 1974)
In 1950 and 1960 the discrete components ( transistors, registers capacitors) were
manufactured packaged in a separate containers. To design a computer these discrete
unit were soldered or wired together on a circuit boards. Another revolution in computer
designing came when in the 1960s, the Apollo guidance computer and Minuteman
missile were able to develop an integrated circuit (commonly called ICs). These ICs
made the circuit designing more economical and practical. The IC based computers are
called third generation computers. As integrated circuits, consists of transistors, resistors,
capacitors on single chip eliminating wired interconnection, the space required for the
computer was greatly reduced. By the mid-1970s, the use of ICs in computers became
very common. Price of transistors reduced very greatly. Now it was possible to put all
components required for designing a CPU on a single printed circuit board. This
advancement of technology resulted in development of minicomputers, usually with 16-
bit words size these system have a memory of range of 4k to 64K.This began a new era
of microelectronics where it could be possible design small identical chips ( a thin wafer
of silicon’s). Each chip has many gates plus number of input output pins.
6
the program into small instructions and the processor works on these instructions in
different stages of completion. For example, the processor while calculating the result of
the current instruction also retrieves the operands for the next instruction. Based on this
concept later superscalar processor were designed, here to execute multiple instructions
in parallel we have multiple execution unit i.e., separate arithmetic-logic units (ALUs).
Now instead executing single instruction at a time, the system divide program into
several independent instructions and now CPU will look for several similar instructions
that are not dependent on each other, and execute them in parallel. The example of this
design are VLIW and EPIC.
1) Technology and Architecture features
ULSI/VHSIC processors, memory, and switches
High-density packaging
Scalable architecture
Vector processors
2) Software and Applications
Massively parallel processing
Grand challenge applications
Heterogenous processing
3) Examples : Fujitsu VPP500, Cray MPP, TMC CM-5, Intel Paragon
Elements of Modern Computers
The hardware, software, and programming elements of modern computer systems can be
characterized by looking at a variety of factors in context of parallel computing these
factors are:
• Computing problems
• Algorithms and data structures
• Hardware resources
• Operating systems
• System software support
• Compiler support
Computing Problems
• Numerical computing complex mathematical formulations tedious integer or
floating -point computation
7
• Transaction processing accurate transactions large database management
information retrieval
• Logical Reasoning logic inferences symbolic manipulations
Algorithms and Data Structures
• Traditional algorithms and data structures are designed for sequential machines.
• New, specialized algorithms and data structures are needed to exploit the
capabilities of parallel architectures.
• These often require interdisciplinary interactions among theoreticians,
experimentalists, and programmers.
Hardware Resources
• The architecture of a system is shaped only partly by the hardware resources.
• The operating system and applications also significantly influence the overall
architecture.
• Not only must the processor and memory architectures be considered, but also the
architecture of the device interfaces (which often include their advanced
processors).
Operating System
• Operating systems manage the allocation and deallocation of resources during
user program execution.
• UNIX, Mach, and OSF/1 provide support for multiprocessors and multicomputers
• multithreaded kernel functions virtual memory management file subsystems
network communication services
• An OS plays a significant role in mapping hardware resources to algorithmic and
data structures.
System Software Support
• Compilers, assemblers, and loaders are traditional tools for developing programs
in high-level languages. With the operating system, these tools determine the bind
of resources to applications, and the effectiveness of this determines the efficiency
of hardware utilization and the system’s programmability.
• Most programmers still employ a sequential mind set, abetted by a lack of popular
parallel software support.
System Software Support
8
• Parallel software can be developed using entirely new languages designed
specifically with parallel support as its goal, or by using extensions to existing
sequential languages.
• New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system software.
• The most common approach is to extend an existing language.
Compiler Support
• Preprocessors use existing sequential compilers and specialized libraries to
implement parallel constructs
• Precompilers perform some program flow analysis, dependence checking, and
limited parallel optimzations
• Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler
parallelizing efforts
1.2.3 Flynn's Classical Taxonomy
Among mentioned above the one widely used since 1966, is Flynn's Taxonomy. This
taxonomy distinguishes multi-processor computer architectures according two
independent dimensions of Instruction stream and Data stream. An instruction stream is
sequence of instructions executed by machine. And a data stream is a sequence of data
including input, partial or temporary results used by instruction stream. Each of these
dimensions can have only one of two possible states: Single or Multiple. Flynn’s
classification depends on the distinction between the performance of control unit and the
data processing unit rather than its operational and structural interconnections. Following
are the four category of Flynn classification and characteristic feature of each of them.
1. Single instruction stream, single data stream (SISD)
9
Figure 1.1 Execution of instruction in SISD processors
The figure 1.1 is represents a organization of simple SISD computer having one control
unit, one processor unit and single memory unit.
10
• Multiple data: Each processing unit can operate on a different data element as
shown if figure below the processor are connected to shared memory or
interconnection network providing multiple data to processing unit
11
c) Multiple instruction stream, single data stream (MISD)
• A single data stream is fed into multiple processing units.
• Each processing unit operates on the data independently via independent
instruction streams as shown in figure 1.5 a single data stream is forwarded to
different processing unit which are connected to different control unit and execute
instruction given to it by control unit to which it is attached.
12
d) Multiple instruction stream, multiple data stream (MIMD)
• Multiple Instruction: every processor may be executing a different instruction
stream
• Multiple Data: every processor may be working with a different data stream as
shown in figure 1.7 multiple data stream is provided by shared memory.
• Can be categorized as loosely coupled or tightly coupled depending on sharing of
data and control
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic
13
Here the some popular computer architecture and there types
SISD IBM 701, IBM 1620, IBM 7090, PDP VAX11/ 780
SISD (With multiple functional units) IBM360/91 (3); IBM 370/168 UP
SIMD (Word Slice Processing) Illiac – IV ; PEPE
SIMD (Bit Slice processing ) STARAN; MPP; DAP
MIMD (Loosely Coupled) IBM 370/168 MP; Univac 1100/80
MIMD(Tightly Coupled) Burroughs- D – 825
1.2.4 PERFORMANCE ATTRIBUTES
Performance of a system depends on
• hardware technology
• architectural features
• efficient resource management
• algorithm design
• data structures
• language efficiency
• programmer skill
• compiler technology
When we talk about performance of computer system we would describe how quickly a
given system can execute a program or programs. Thus we are interested in knowing the
turnaround time. Turnaround time depends on:
• disk and memory accesses
• input and output
• compilation time
• operating system overhead
• CPU time
An ideal performance of a computer system means a perfect match between the machine
capability and program behavior. The machine capability can be improved by using
better hardware technology and efficient resource management. But as far as program
behavior is concerned it depends on code used, compiler used and other run time
conditions. Also a machine performance may vary from program to program. Because
there are too many programs and it is impractical to test a CPU's speed on all of them,
14
benchmarks were developed. Computer architects have come up with a variety of metrics
to describe the computer performance.
Clock rate and CPI / IPC : Since I/O and system overhead frequently overlaps
processing by other programs, it is fair to consider only the CPU time used by a program,
and the user CPU time is the most important factor. CPU is driven by a clock with a
constant cycle time (usually measured in nanoseconds, which controls the rate of internal
operations in the CPU. The clock mostly has the constant cycle time (t in nanoseconds).
The inverse of the cycle time is the clock rate (f = 1/τ, measured in megahertz). A shorter
clock cycle time, or equivalently a larger number of cycles per second, implies more
operations can be performed per unit time. The size of the program is determined by the
instruction count (Ic). The size of a program is determined by its instruction count, Ic, the
number of machine instructions to be executed by the program. Different machine
instructions require different numbers of clock cycles to execute. CPI (cycles per
instruction) is thus an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given
program mix. The time required to execute a program containing Ic instructions is just T
= Ic * CPI * τ.
Each instruction must be fetched from memory, decoded, then operands fetched from
memory, the instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k
times the processor cycle time τ. The value of k depends on the memory technology and
the processor-memory interconnection scheme. The processor cycles required for each
instruction (CPI) can be attributed to cycles needed for instruction decode and execution
(p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*τ.
15
MIPS: The millions of instructions per second, this is calculated by dividing the number
of instructions executed in a running program by time required to run the program. The
MIPS rate is directly proportional to the clock rate and inversely proportion to the CPI.
All four systems attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program. MIPS
does not proved to be effective as it does not account for the fact that different systems
often require different number of instruction to implement the program. It does not
inform about how many instructions are required to perform a given task. With the
variation in instruction styles, internal organization, and number of processors per system
it is almost meaningless for comparing two systems.
MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point operations
per second.'' This is often used as a ``bottom-line'' figure. If one know ahead of time how
many operations a program needs to perform, one can divide the number of operations by
the execution time to come up with a MFLOPS rating. For example, the standard
algorithm for multiplying n*n matrices requires 2n3 – n operations (n2 inner products,
with n multiplications and n-1additions in each product). Suppose you compute the
product of two 100 *100 matrices in 0.35 seconds. Then the computer achieves
(2(100)3 – 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS
The term ``theoretical peak MFLOPS'' refers to how many operations per second would
be possible if the machine did nothing but numerical operations. It is obtained by
calculating the time it takes to perform one operation and then computing how many of
them could be done in one second. For example, if it takes 8 cycles to do one floating
point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic
operations are not overlapped with one another, it takes 160ns for one multiplication, and
(1,000,000,000 nanosecond/1sec)*(1 multiplication / 160 nanosecond) = 6.25*106
multiplication /sec so the theoretical peak performance is 6.25 MFLOPS. Of course,
programs are not just long sequences of multiply and add instructions, so a machine
rarely comes close to this level of performance on any real program. Most machines will
achieve less than 10% of their peak rating, but vector processors or other machines with
internal pipelines that have an effective CPI near 1.0 can often achieve 70% or more of
their theoretical peak on small programs.
16
Throughput rate : Another important factor on which system’s performance is measured
is throughput of the system which is basically how many programs a system can execute
per unit time Ws. In multiprogramming the system throughput is often lower than the
CPU throughput Wp which is defined as
Wp = f/(Ic * CPI)
Unit of Wp is programs/second.
Ws <Wp as in multiprogramming environment there is always additional overheads like
timesharing operating system etc. An Ideal behavior is not achieved in parallel computers
because while executing a parallel algorithm, the processing elements cannot devote
100% of their time to the computations of the algorithm. Efficiency is a measure of the
fraction of time for which a PE is usefully employed. In an ideal parallel system
efficiency is equal to one. In practice, efficiency is between zero and one
s of overhead associated with parallel execution
Speed or Throughput (W/Tn) - the execution rate on an n processor system, measured in
FLOPs/unit-time or instructions/unit-time.
Speedup (Sn = T1/Tn) - how much faster in an actual machine, n processors compared to
1 will perform the workload. The ratio T1/T∞is called the asymptotic speedup.
Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup achieved by n
processors
Degree of Parallelism (DOP) - for a given piece of the workload, the number of
processors that can be kept busy sharing that piece of computation equally. Neglecting
overhead, we assume that if k processors work together on any workload, the workload
gets done k times as fast as a sequential execution.
Scalability - The attributes of a computer system which allow it to be gracefully and
linearly scaled up or down in size, to handle smaller or larger workloads, or to obtain
proportional decreases or increase in speed on a given application. The applications run
on a scalable machine may not scale well. Good scalability requires the algorithm and the
machine to have the right properties
Thus in general there are five performance factors (Ic, p, m, k, t) which are influenced by
four system attributes:
• instruction-set architecture (affects Ic and p)
17
• compiler technology (affects Ic and p and m)
• CPU implementation and control (affects p *t ) cache and memory hierarchy
(affects memory access latency, k ´t)
• Total CPU time can be used as a basis in estimating the execution rate of a
processor.
Programming Environments
Programmability depends on the programming environment provided to the users.
Conventional computers are used in a sequential programming environment with tools
developed for a uniprocessor computer. Parallel computers need parallel tools that allow
specification or easy detection of parallelism and operating systems that can perform
parallel scheduling of concurrent events, shared memory allocation, and shared peripheral
and communication links.
Implicit Parallelism
Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the program.
Use a parallelizing compiler to translate the source code into parallel code.
The compiler must detect parallelism and assign target machine resources.
Success relies heavily on the quality of the compiler.
Explicit Parallelism
Programmer writes explicit parallel code using parallel dialects of common languages.
Compiler has reduced need to detect parallelism, but must still preserve existing
parallelism and assign target machine resources.
Needed Software Tools
Parallel extensions of conventional high-level languages.
Integrated environments to provide different levels of program abstraction validation,
testing and debugging performance prediction and monitoring visualization support to aid
program development, performance measurement graphics display and animation of
computational results
1.3 MULTIPROCESSOR AND MULTICOMPUTERS
Two categories of parallel computers are discussed below namely shared common
memory or unshared distributed memory.
1.3.1 Shared memory multiprocessors
18
• Shared memory parallel computers vary widely, but generally have in common
the ability for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all other
processors.
• Shared memory machines can be divided into two main classes based upon
memory access times: UMA , NUMA and COMA.
19
Non-Uniform Memory Access (NUMA):
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Disadvantages:
20
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors.
• Like shared memory systems, distributed memory systems vary widely but share
a common characteristic. Distributed memory systems require a communication
network to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space
across all processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
21
• Modern multicomputer use hardware routers to pass message. Based on the
interconnection and routers and channel used the multicomputers are divided into
generation
o 1st generation : based on board technology using hypercube architecture
and software controlled message switching.
o 2nd Generation: implemented with mesh connected architecture, hardware
message routing and software environment for medium distributed –
grained computing.
o 3rd Generation : fine grained multicomputer like MIT J-Machine.
• The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.
Advantages:
Disadvantages:
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access (NUMA) times
22
A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations.
1.4.1Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands
can not be used directly from memory but rather are loaded into registers and are put
back in registers after the operation. Vector hardware has the special ability to overlap or
pipeline operand processing.
23
1.4.2 SIMD Array Processors
The Synchronous parallel architectures coordinate Concurrent operations in lockstep
through global clocks, central control units, or vector unit controllers. A synchronous
array of parallel processors is called an array processor. These processors are composed
of N identical processing elements (PES) under the supervision of a one control unit (CU)
This Control unit is a computer with high speed registers,
local memory and arithmetic logic unit.. An array processor is basically a single
instruction and multiple data (SIMD) computers. There are N data streams; one per
processor, so different data can be used in each processor. The figure below show a
typical SIMD or array processor
24
clock is used. Thus at each step i.e., when global clock pulse changes all processors
execute the same instruction, each on a different data (single instruction multiple data).
SIMD machines are particularly useful at in solving problems involved with vector
calculations where one can easily exploit data parallelism. In such calculations the same
set of instruction is applied to all subsets of data. Lets do addition to two vectors each
having N element and there are N/2 processing elements in the SIMD. The same addition
instruction is issued to all N/2 processors and all processor elements will execute the
instructions simultaneously. It takes 2 steps to add two vectors as compared to N steps on
a SISD machine. The distributed data can be loaded into PEMs from an external source
via the system bus or via system broadcast mode using the control bus.
The array processor can be classified into two category depending how the memory units
are organized. It can be
a. Dedicated memory organization
b. Global memory organization
A SIMD computer C is characterized by the following set of parameter
C= <N,F,I,M>
Where N= the number of PE in the system . For example the iliac –IV has N=64 , the
BSP has N= 16.
F= a set of data routing function provided by the interconnection network
I= The set of machine instruction for scalar vector, data routing and network
manipulation operations
M = The set of the masking scheme where each mask partitions the set of PEs into
disjoint subsets of enabled PEs and disabled PEs.
25
1. The machine size n can be arbitrarily large
2. The machine is synchronous at the instruction level. That is, each processor is
executing it's own series of instructions, and the entire machine operates at a basic time
step (cycle). Within each cycle, each processor executes exactly one operation or does
nothing, i.e. it is idle. An instruction can be any random access machine instruction, such
as: fetch some operands from memory, perform an ALU operation on the data, and store
the result back in memory.
3. All processors implicitly synchronize on each cycle and the synchronization overhead
is assumed to be zero. Communication is done through reading and writing of shared
variables.
4. Memory access can be specified to be UMA, NUMA, EREW, CREW, or CRCW with
a defined conflict policy.
The PRAM model can apply to SIMD class machines if all processors execute identical
instructions on the same cycle, or to MIMD class machines if the processors are
executing different instructions. Load imbalance is the only form of overhead in the
PRAM model.
• EREW - Exclusive read, exclusive write; any memory location may only be
accessed once in any one step. Thus forbids more than one processor from reading
or writing the same memory cell simultaneously.
• CREW - Concurrent read, exclusive write; any memory location may be read any
number of times during a single step, but only written to once, with the write
taking place after the reads.
• ERCW – This allows exclusive read or concurrent writes to the same memory
location.
• CRCW - Concurrent read, concurrent write; any memory location may be written
to or read from any number of times during a single step. A CRCW PRAM model
must define some rule for resolving multiple writes, such as giving priority to the
lowest-numbered processor or choosing amongst processors randomly. The
PRAM is popular because it is theoretically tractable and because it gives
26
algorithm designers a common target. However, PRAMs cannot be emulated
optimally on all architectures.
27