Why Parallel Computing?: Peter Pacheco

An Introduction to Parallel Programming
Peter Pacheco
Chapter 1
Why Parallel Computing?
Copyright © 2010, Elsevier Inc. All rights Reserved 1

# Chapter Subtitle
Roadmap
 Why we need ever-increasing performance.
 Why we’re building parallel systems.
 Why we need to write parallel programs.
 How do we write parallel programs?
 What we’ll be doing.
 Concurrent, parallel, distributed!

Changing times
 From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year.
 Since then, it’s dropped to about 20%

increase per year.

An intelligent solution
 Instead of designing and building faster
microprocessors, put multiple processors
on a single integrated circuit.

Now it’s up to the programmers
 Adding more processors doesn’t help
much if programmers aren’t aware of
them…
 … or don’t know how to use them.
 Serial programs don’t benefit from this

approach (in most cases).

Why we need ever-increasing
performance
 Computational power is increasing, but so
are our computation problems and needs.
 Problems we never dreamed of have been
solved because of past increases, such as
decoding the human genome.
 More complex problems are still waiting to
be solved.

Climate modeling

Protein folding

Drug discovery

Energy research

Data analysis

Why we’re building parallel
systems
 Up to now, performance increases have
been attributable to increasing density of
transistors.
 But there are

inherent
problems.

A little physics lesson
 Smaller transistors = faster processors.
 Faster processors = increased power
consumption.
 Increased power consumption = increased
heat.
 Increased heat = unreliable processors.

Solution
 Move away from single-core systems to
multicore processors.
 “core” = central processing unit (CPU)
 Introducing parallelism!!!

Why we need to write parallel
programs
 Running multiple instances of a serial
program often isn’t very useful.
 Think of running multiple instances of your
favorite game.
 What you really want is for

it to run faster.

Approaches to the serial problem
 Rewrite serial programs so that they’re
parallel.
 Write translation programs that

automatically convert serial programs into
parallel programs.
 This is very difficult to do.
 Success has been limited.

More problems
 Some coding constructs can be
recognized by an automatic program
generator, and converted to a parallel
construct.
 However, it’s likely that the result will be a
very inefficient program.
 Sometimes the best parallel solution is to
step back and devise an entirely new
algorithm.

Example
 Compute n values and add them together.
 Serial solution:

Example (cont.)
 We have p cores, p much smaller than n.
 Each core performs a partial sum of
approximately n/p values.
Each core uses it’s own private variables

and executes this block of code
independently of the other cores.

Example (cont.)
 After each core completes execution of the
code, is a private variable my_sum
contains the sum of the values computed
by its calls to Compute_next_value.
 Ex., 8 cores, n = 24, then the calls to

Compute_next_value return:
1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9

Example (cont.)
 Once all the cores are done computing
their private my_sum, they form a global
sum by sending results to a designated
“master” core which adds the final result.

Example (cont.)

Example (cont.)
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95
Core 0 1 2 3 4 5 6 7
my_sum 95 19 7 15 7 13 12 14

But wait!
There’s a much better way
to compute the global sum.

Better parallel algorithm
 Don’t make the master core do all the
work.
 Share it among the other cores.
 Pair the cores so that core 0 adds its result
with core 1’s result.
 Core 2 adds its result with core 3’s result,
etc.
 Work with odd and even numbered pairs of
cores.
Better parallel algorithm (cont.)
 Repeat the process now with only the
evenly ranked cores.
 Core 0 adds result from core 2.
 Core 4 adds the result from core 6, etc.
 Now cores divisible by 4 repeat the

process, and so forth, until core 0 has the
final result.

Multiple cores forming a global
sum

Analysis
 In the first example, the master core
performs 7 receives and 7 additions.
 In the second example, the master core

performs 3 receives and 3 additions.
 The improvement is more than a factor of 2!

Analysis (cont.)
 The difference is more dramatic with a
larger number of cores.
 If we have 1000 cores:
 The first example would require the master to
perform 999 receives and 999 additions.
 The second example would only require 10
receives and 10 additions.
 That’s an improvement of almost a factor

of 100!
How do we write parallel
programs?
 Task parallelism
 Partition various tasks carried out solving the
problem among the cores.
 Data parallelism
 Partition the data used in solving the problem
among the cores.
 Each core carries out similar operations on it’s
part of the data.

Professor P
15 questions
300 exams

Professor P’s grading assistants
TA#1 TA#3
TA#2

Division of work –
data parallelism
TA#1
100 exams
TA#3
100 exams
100 exams
TA#2

task parallelism
TA#1
TA#3
Questions 11 - 15
Questions 1 - 5
TA#2
Questions 6 - 10

data parallelism

task parallelism
Tasks
1)Receiving
2)Addition

Coordination
 Cores usually need to coordinate their work.
 Communication – one or more cores send
their current partial sums to another core.
 Load balancing – share the work evenly
among the cores so that one is not heavily
loaded.
 Synchronization – because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.

What we’ll be doing
 Learning to write programs that are
explicitly parallel.
 Using the C language.
 Using three different extensions to C.
 Message-Passing Interface (MPI)
 Posix Threads (Pthreads)
 OpenMP

Type of parallel systems
 Shared-memory
 The cores can share access to the computer’s
memory.
 Coordinate the cores by having them examine
and update shared memory locations.
 Distributed-memory
 Each core has its own, private memory.
 The cores must communicate explicitly by
sending messages across a network.

Type of parallel systems
Shared-memory Distributed-memory

Terminology
 Concurrent computing – a program is one
in which multiple tasks can be in progress
at any instant.
 Parallel computing – a program is one in
which multiple tasks cooperate closely to
solve a problem
 Distributed computing – a program may
need to cooperate with other programs to
solve a problem.

Concluding Remarks (1)
 The laws of physics have brought us to the
doorstep of multicore technology.
 Serial programs typically don’t benefit from
multiple cores.
 Automatic parallel program generation
from serial program code isn’t the most
efficient approach to get high performance
from multicore computers.

Concluding Remarks (2)
 Learning to write parallel programs
involves learning how to coordinate the
cores.
 Parallel programs are usually very
complex and therefore, require sound
program techniques and development.

Distributed and Cloud Computing
Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 44

Data Deluge Enabling New Challenges
(Courtesy of Judy Qiu, Indiana University, 2011)

From Desktop/HPC/Grids to
Internet Clouds in 30 Years
 HPC moving from centralized supercomputers
to geographically distributed desktops, desksides,
clusters, and grids to clouds over last 30 years
 R/D efforts on HPC, clusters, Grids, P2P, and virtual

machines has laid the foundation of cloud computing
that has been greatly advocated since 2007
 Location of computing infrastructure in areas with

lower costs in hardware, software, datasets,
space, and power requirements – moving from
desktop computing to datacenter-based clouds
Interactions among 4 technical challenges :
Data Deluge, Cloud Technology, eScience,
and Multicore/Pareallel Computing
(Courtesy of Judy Qiu, Indiana University, 2011)

Clouds and Internet of Things
HPC: High-
Performance
Computing
HTC: High-
Throughput
Computing
P2P:
Peer to Peer
MPP:
Massively Parallel
Processors
Source: K. Hwang, G. Fox, and J. Dongarra,
Distributed and Cloud Computing,
Morgan Kaufmann, 2012.

Technology Convergence toward HPC for
Science and HTC for Business
(Courtesy of Raj Buyya, University of Melbourne, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.

2011 Gartner “IT Hype Cycle” for Emerging Technologies
2010
2009 2011
2008
2007

Architecture of A Many-Core
Multiprocessor GPU interacting
with a CPU Processor

Datacenter and Server Cost Distribution

Virtual Machine Architecture
(Courtesy of VMWare, 2010)

Primitive Operations in Virtual Machines:

Concept of Virtual Clusters
(Source: W. Emeneker, et et al, “Dynamic Virtual Clustering with Xen and Moab,
ISPA 2006, Springer-Verlag LNCS 4331, 2006, pp. 440-451)

A Typical Cluster Architecture

A Typical Computational Grid

The Cloud
 Historical roots in today’s
Internet apps
 Search, email, social networks
 File storage (Live Mesh, Mobile
Me, Flicker, …)
 A cloud infrastructure provides a
framework to manage scalable, reliable,
on-demand access to applications
 A cloud is the “invisible” backend to
many of our mobile applications
 A model of computation and data storage
based on “pay as you go” access to
“unlimited” remote data center
capabilities

Basic Concept of Internet Clouds

The Next Revolution in IT
Cloud Computing
 Classical Computing  Cloud Computing
 Buy & Own
 Subscribe
 Use
 Hardware, System
Software, Applications
often to meet peak needs.
Every 18 months?
 Install, Configure, Test, Verify,

Evaluate
 Manage
 ..
 Finally, use it  $ - pay for what you use, based on
 $$$$....$(High CapEx) QoS
(Courtesy of Raj Buyya, 2012)

Cloud Computing Challenges:
Dealing with too many issues (Courtesy of R. Buyya)
n g
Prici Scalability
tion
l i za Res
Vi rtua o urc
e Met Reliability
er i ng
QoS
Billing
l Ene
e ve r gy E
L nts f fi c i e
c e e n cy
rv i em Provision
Se gre ing Utility & Risk
A on Deman
d Management
y
ur i t Legal &
S ec Regulatory
Privacy Software Eng.

Complexity
st
Tru Programming Env.
& Application Dev.

The Internet of Things (IoT)
In
TThhee
Intte
errnneett
Smart
Earth:
In
Inte
terrn
neett Internet
Internetof
of An
CClo
louuddss Things
Things IBM
Smart Earth Dream

Opportunities of IoT in 3 Dimensions
(courtesy of Wikipedia, 2010)

System Scalability vs. OS Multiplicity

System Availability vs. Configuration Size :

Transparent Cloud Computing Environment
Parallel and Distributed Programming

Grid Standards and Middleware :

Energy Efficiency :

System Attacks and Network Threads

Four Reference Books:
1. K. Hwang, G. Fox, and J. Dongarra, Distributed and Cloud
Computing: from Parallel Processing to the Internet of Things
Morgan Kauffmann Publishers, 2011
2. R. Buyya, J. Broberg, and A. Goscinski (eds), Cloud Computing:

Principles and Paradigms, ISBN-13: 978-0470887998, Wiley Press,
USA, February 2011.
3. T. Chou, Introduction to Cloud Computing: Business and

Technology, Lecture Notes at Stanford University and at Tsinghua
University, Active Book Press, 2010.
4. T. Hey, Tansley and Tolle (Editors), The Fourth Paradigm : Data-

Intensive Scientific Discovery, Microsoft Research, 2009.

Why Parallel Computing?: Peter Pacheco

Uploaded by

Copyright:

Available Formats

Why Parallel Computing?: Peter Pacheco

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why Parallel Computing?: Peter Pacheco

Uploaded by

Copyright:

Available Formats

An Introduction to Parallel Programming

Copyright © 2010, Elsevier Inc. All rights Reserved 1

Copyright © 2010, Elsevier Inc. All rights Reserved 2

 Since then, it’s dropped to about 20%

Copyright © 2010, Elsevier Inc. All rights Reserved 3

Copyright © 2010, Elsevier Inc. All rights Reserved 4

 Serial programs don’t benefit from this

Copyright © 2010, Elsevier Inc. All rights Reserved 5

Copyright © 2010, Elsevier Inc. All rights Reserved 6

Copyright © 2010, Elsevier Inc. All rights Reserved 7

Copyright © 2010, Elsevier Inc. All rights Reserved 8

Copyright © 2010, Elsevier Inc. All rights Reserved 9

Copyright © 2010, Elsevier Inc. All rights Reserved 10

Copyright © 2010, Elsevier Inc. All rights Reserved 11

 But there are

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Copyright © 2010, Elsevier Inc. All rights Reserved 13

Copyright © 2010, Elsevier Inc. All rights Reserved 14

 What you really want is for

Copyright © 2010, Elsevier Inc. All rights Reserved 15

 Write translation programs that

Copyright © 2010, Elsevier Inc. All rights Reserved 16

Copyright © 2010, Elsevier Inc. All rights Reserved 17

Copyright © 2010, Elsevier Inc. All rights Reserved 18

Each core uses it’s own private variables

Copyright © 2010, Elsevier Inc. All rights Reserved 19

 Ex., 8 cores, n = 24, then the calls to

Copyright © 2010, Elsevier Inc. All rights Reserved 20

Copyright © 2010, Elsevier Inc. All rights Reserved 21

Copyright © 2010, Elsevier Inc. All rights Reserved 22

Copyright © 2010, Elsevier Inc. All rights Reserved 23

Copyright © 2010, Elsevier Inc. All rights Reserved 24

 Now cores divisible by 4 repeat the

Copyright © 2010, Elsevier Inc. All rights Reserved 26

Copyright © 2010, Elsevier Inc. All rights Reserved 27

 In the second example, the master core

 The improvement is more than a factor of 2!

Copyright © 2010, Elsevier Inc. All rights Reserved 28

 That’s an improvement of almost a factor

Copyright © 2010, Elsevier Inc. All rights Reserved 30

Copyright © 2010, Elsevier Inc. All rights Reserved 31

Copyright © 2010, Elsevier Inc. All rights Reserved 32

Copyright © 2010, Elsevier Inc. All rights Reserved 33

Copyright © 2010, Elsevier Inc. All rights Reserved 34

Copyright © 2010, Elsevier Inc. All rights Reserved 35

Copyright © 2010, Elsevier Inc. All rights Reserved 36

Copyright © 2010, Elsevier Inc. All rights Reserved 37

Copyright © 2010, Elsevier Inc. All rights Reserved 38

Copyright © 2010, Elsevier Inc. All rights Reserved 39

Copyright © 2010, Elsevier Inc. All rights Reserved 40

Copyright © 2010, Elsevier Inc. All rights Reserved 41

Copyright © 2010, Elsevier Inc. All rights Reserved 42

Copyright © 2010, Elsevier Inc. All rights Reserved 43

Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 44

(Courtesy of Judy Qiu, Indiana University, 2011)

Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 45

 R/D efforts on HPC, clusters, Grids, P2P, and virtual

 Location of computing infrastructure in areas with

(Courtesy of Judy Qiu, Indiana University, 2011)