What Is Parallel Computing

What is Parallel Computing?

Traditionally, software has been written for serial computation:

To be run on a single computer having a single Central Processing Unit (CPU);
A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.

Parallel computing is the simultaneous use of multiple compute resources to solve a
computational problem.
To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down to a series of instructions
Instructions from each part execute simultaneously on different CPUs

What are the Resources for Parallel Computing ?
The compute resources can include:
A single computer with multiple processors;
A single computer with (multiple) processor(s) and some specialized computer
resources (GPU, FPGA )
An arbitrary number of computers connected by a network;
A combination of both.
What are the applications of Parallel Computing ?
weather and climate
chemical and nuclear reactions
biological, human genome
geological, seismic activity
mechanical devices - from prosthetics to spacecraft
electronic circuits
manufacturing processes

Flynns classifications

Shared Memory Multiprocessing

Shared memory systems form a major category of multiprocessors. In this category, all
processors share a global memory .

Communication between tasks running on different processors is performed through
writing to and reading from the global memory.
All interprocessor coordination and synchronization is also accomplished via the global
Address space is identical in all processors.
Memory will not know which CPU is asking for the memory.
Each CPU execute as if other CPUs does not exists.
A shared memory system is relatively easy to program since all processors share a single
view of data and the communication between processors can be as fast as memory
accesses to a same location.
Two main problems need to be addressed when designing a shared memory system:
1. performance degradation due to contention. Performance degradation might
happen when multiple processors are trying to access the shared memory
simultaneously. A typical design might use caches to solve the contention
2. coherence problems. Having multiple copies of data, spread throughout the
caches, might lead to a coherence problem. The copies in the caches are coherent
if they are all equal to the same value. However, if one of the processors writes
over the value of one of the copies, then the copy becomes inconsistent because it
no longer equals the value of the other copies.
Scalability remains the main drawback of a shared memory system.
The simplest shared memory system consists of one memory module (M) that can be
accessed from two processors P1 and P2

1. Requests arrive at the memory module through its two ports. An arbitration unit
within the memory module passes requests through to a memory controller.
2. If the memory module is not busy and a single request arrives, then the arbitration
unit passes that request to the memory controller and the request is satisfied.
3. The module is placed in the busy state while a request is being serviced. If a new
request arrives while the memory is busy servicing a previous request, the
memory module sends a wait signal, through the memory controller, to the
processor making the new request.
4. In response, the requesting processor may hold its request on the line until the
memory becomes free or it may repeat its request some time later.
5. If the arbitration unit receives two requests, it selects one of them and passes it to
the memory controller. Again, the denied request can be either held to be served
next or it may be repeated some time later.
In computer software, shared memory is either
a method of inter-process communication (IPC), i.e. a way of exchanging data between
programs running at the same time. One process will create an area in RAM which other
processes can access, or
a method of conserving memory space by directing accesses to what would ordinarily be
copies of a piece of data to a single instance instead, by using virtual memorymappings or
with explicit support of the program in question. This is most often used for shared libraries .
Support on UNIX platforms
POSIX provides a standardized API for using shared memory, POSIX Shared Memory. This uses
the function shm_open from sys/mman.h.
POSIX interprocess communication (part of the
POSIX:XSI Extension) includes the shared-memory
functions shmat, shmctl, shmdt and shmget. UNIX System V provides an API for shared
memory as well. This uses shmget from sys/shm.h. BSD systems provide "anonymous mapped
memory" which can be used by several processes.

UMA Uniform Memory Access

In the UMA system a shared memory is accessible by all processors through an
interconnection network in the same way a single processor accesses its memory.
All processors have equal access time to any memory location. The interconnection
network used in the UMA can be a single bus, multiple buses, or a crossbar switch.




Because access to shared memory is balanced, these systems are also called SMP
(symmetric multiprocessor) systems. Each processor has equal opportunity to read/write
to memory, including equal access speed.
o A typical bus-structured SMP computer, attempts to reduce contention for the bus
by fetching instructions and data directly from each individual cache, as much as
o In the extreme, the bus contention might be reduced to zero after the cache
memories are loaded from the global memory, because it is possible for all
instructions and data to be completely contained within the cache.
This memory organization is the most popular among shared memory systems. Examples
of this architecture are Sun Starfire servers, HP V series, and Compaq AlphaServer GS,
Silicon Graphics Inc. multiprocessor servers.
Nonuniform Memory Access (NUMA)
In the NUMA system, each processor has part of the shared memory attached
The memory has a single address space. Therefore, any processor could access any
memory location directly using its real address. However, the access time to modules
depends on the distance to the processor. This results in a nonuniform memory access
A processor can also have a built-in memory controller as present in Intels Quick Path
Interconnect (QPI) NUMA Architecture.
Unlike Distributed Memory Architecture, the memory of other processor is accessible but
the latency to access them is not same. The memory which is local to other processor is
called as remote memory or foreign memory.
A number of architectures are used to interconnect processors to memory modules in a
NUMA. Among these are the tree and the hierarchical bus networks.
Examples of NUMA architecture are BBN TC-2000, SGI Origin 3000, and Cray T3E.

Distributed memory Multiprocessing

Distributed memory refers to a multiple-processor computer system in which
each processor has its own private memory.
Computational tasks can only operate on local data, and if remote data is required, the
computational task must communicate with one or more remote processors.

There is typically a processor, a memory, and some form of interconnection that allows
programs on each processor to interact with each other.
If any cpu wants to accesslocal memory that is held by other cpu a cpu-cpu
communication takes place to access the data from other memory through corresponding
The interconnection can be organised with point to point links or separate hardware can
provide a switching network.
Wise organization will keep all the desired data for a cpu in its local memory and only
communication through interconnection network will be then messages between cpus.
The network topology is a key factor in determining how the multi-processor machine scales.
The key issue in programming distributed memory systems is how to distribute the data over
the memories. Depending on the problem solved, the data can be distributed statically, or it
can be moved through the nodes. Data can be moved on demand, or data can be pushed to
the new nodes in advance.
Data can be kept statically in nodes if most computations happen locally, and only changes
on edges have to be reported to other nodes. An example of this is simulation where data is
modeled using a grid, and each node simulates a small part of the larger grid. On every
iteration, nodes inform all neighboring nodes of the new edge data.
The advantage of (distributed) shared memory is that it offers a unified address space in
which all data can be found.
The advantage of distributed memory is that it excludes race conditions, and that it forces the
programmer to think about data distribution.
The advantage of distributed (shared) memory is that it is easier to design a machine that
scales with the algorithm
Distributed shared memory hides the mechanism of communication - it does not hide the
latency of communication.
How Parallelism is done in Sequential machines?
1. Multiplicity of functional units
Use of multiple processing elements under one controller
Many of the ALU functions can be distributed to multiple specialized units
These multiple Functional Units are independent of each other
The CDC-6600
10 Functional execution units built into its
CPU The 6600 CP included 10 parallel functional units, allowing multiple instructions to be worked on
at the same time. Today this is known as a superscalar design, while at the time it was simply
"unique". The system read and decoded instructions from memory as fast as possible, generally faster
than they could be completed, and fed them off to the units for processing. The units were:
floating point multiply (2 copies)
floating point divide
floating point add
IBM 360/91
2 parallel execution units
Fixed point arithmetic
Floating point arithmetic(2 Functional units)
Floating point add-sub
Floating point multiply-div
2.Parallelism & pipelining within the CPU
Parallelism is provided by building parallel adders in almost all ALUs
Pipelining Each task is divided into subtasks which can be executed in parallel

