Slides 1
Slides 1
Slides 1
Slides
Barry Wilkinson and Michael Allen Prentice Hall, 1999
Page Chapter 1 Parallel Computers Chapter 2 Message-Passing Computing Chapter 3 Embarrassingly Parallel Computations Chapter 4 Partitioning and Divide-and-Conquer Strategies Chapter 5 Pipelined Computations Chapter 6 Synchronous Computations Chapter 7 Load Balancing and Termination Detection Chapter 8 Programming with Shared Memory Chapter 9 Sorting Algorithms Chapter 10 Numerical Algorithms Chapter 11 Image Processing Chapter 12 Searching and Optimization 2 29 63 78
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 1
Weather Forecasting
Atmosphere is modeled by dividing it into three-dimensional regions or cells. The calculations of each cell are repeated many times to model the passage of time.
Example
Whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 108 cells. Suppose each calculation requires 200 oating point operations. In one time step, 1011 oating point operations necessary.
To forecast the weather over 10 days using 10-minute intervals, a computer operating at 100 Mops (108 oating point operations/s) would take 107 seconds or over 100 days.
To perform the calculation in 10 minutes would require a computer operating at 1.7 Tops (1.7 1012 oating point operations/sec).
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 2
If there are N bodies, N 1 forces to calculate for each body, or approximately N2 calculations, in total. (N log2 N for an efcient approximate algorithm.) After determining the new positions of the bodies, the calculations must be repeated.
A galaxy might have, say, 1011 stars. Even if each calculation could be done in 1 s (106 seconds, an extremely optimistic gure), it would take 109 years for one iteration using the N2 algorithm and almost a year for one iteration using the N log2 N efcient approximate algorithm.
Figure 1.1 Astrophysical N-body simulation by Scott Linssen (undergraduate University of North Carolina at Charlotte [UNCC] student).
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 3
... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difculty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at the cost of considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation ... Gill, S. (1958), Parallel Programming, The Computer Journal, vol. 1, April, pp. 2-10. Notwithstanding the long history, Flynn and Rudd (1996) write that leads us to one simple conclusion: the future is parallel. We concur.
Figure 1.2 Conventional computer having a single processor and memory. Each main memory location in the memory in all computers is located by a number called its address. Addresses start at 0 and extend to 2n 1 when there are n bits (binary digits) in the address.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 4
Interconnection network
Threads
Threads can be used that contain regular high-level language code sequences for individual processors. These code sequences can then access shared locations.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 5
Message-Passing Multicomputer
Complete computers connected through an interconnection network:
Local memory
Programming
Still involves dividing the problem into parts that are intended to be executed simultaneously to solve the problem
Common approach is to use message-passing library routines that are linked to conventional sequential program(s) for message passing.
Processes will communicate by sending messages; this will be the only way to distribute data and results between processes.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 6
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 7
Developed because there are a number of important applications that mostly operate upon arrays of data.
Data
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 8
The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.
Message-Passing Multicomputers
Static Network Message-Passing Multicomputers
P C
Computers
P C
C P M
Computer (node)
Switch
Processor
Memory
Figure 1.9 A link between two nodes with separate wires in each direction.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 10
Network Criteria
Cost - indicated by number of links in network. (Ease of construction is also important.) Bandwidth - number of bits that can be transmitted in unit time, given as bits/sec. Network latency - time to make a message transfer through network. Communication latency - total time to send message, including software overhead and interface delays. Message latency or startup time - time to send a zero-length message. Essentially the software and hardware overhead in sending message and the actual transmission time. Diameter - minimum number of links between two farthest nodes in the network. Only shortest routes used. Used to determine worst case delays. Bisection width of a network - number of links (or sometimes wires) that must be cut to divide network into two equal parts. Can provide a lower bound for messages in a parallel algorithm.
Interconnection Networks
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 11
Links
Computer/ processor
Links
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 12
110 100
111
101
010 000
011 001
0110 0100
0111 1100
1110
1111
0101
1101
0010 0000
1010
1011 1001
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 13
Embedding
Describes mapping nodes of one network onto another network. Example - a ring can be embedded in a torus: Ring
11
01
00 x y 00 01 11 10
The dilation is the maximum number of links in the embedding network corresponding to one link in the embedded network.
Perfect embeddings, such as a line/ring into mesh/torus or a mesh onto a hypercube, have a dilation of 1.
Example, mapping a tree onto a mesh or hypercube does not result in a dilation of 1 except for very small trees of height 2:
A Root A A
Communication Methods
Circuit Switching
Involves establishing path and maintaining all links in path for message to pass, uninterrupted, from source to destination. All links are reserved for the transfer until message transfer is complete.
Simple telephone system (not using advanced digital techniques) is an example of a circuit-switched system. Once a telephone connection is made, the connection is maintained until the completion of the telephone call.
Circuit switching suffers from forcing all the links in the path to be reserved for the complete transfer. None of links can be used for other messages until the transfer is completed.
Packet Switching,
Message divided into packets of information, each of which includes source and destination addresses for routing packet through interconnection network. Maximum size for the packet, say 1000 data bytes. If message is larger than this, more than one packet must be sent through network. Buffers provided inside nodes to hold packets before they are transferred onward to the next node. This form called store-andforward packet switching.
Mail system is an example of a packet-switched system. Letters moved from mailbox to post ofce and handled at intermediate sites before being delivered to destination.
Enables links to be used by other packets once the current packet has been forwarded. Incurs a signicant latency since packets must rst be stored in buffers within each node, whether or not an outgoing link is available.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 16
Virtual Cut-Through,
Can eliminated storage latency. If the outgoing link is available, the message is immediately passed forward without being stored in the nodal buffer; i.e., it is cut through. If complete path were available, the message would pass immediately through to the destination. However, if path is blocked, storage is needed for the complete message/packet being received.
Wormhole routing
Message divided into smaller units called its (ow control digits). Only head of message initially transmitted from source node to next node when connecting link available. Subsequent its of message transmitted when links become available. Flits can become distributed through network.
Packet
Head Movement
Flit buffer
Request/acknowledge system
A way to pull its along. Only requires a single wire between the sending node and receiving node, called R/A (request/acknowledge). Source processor Destination processor
Data R/A
Figure 1.19 A signaling method between processors for wormhole routing (Ni and McKinley, 1993). R/A reset to 0 by receiving node when ready to receive it (its it buffer empty). R/A set to 1 by sending node when sending node is about to send it. Sending node must wait for R/A = 0 before setting it to a 1 and sending the it. Sending node knows data has been received when receiving node resets R/A to a 0.
Packet switching
Distance
(number of nodes between source and destination)
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 18
Deadlock
Occurs when packets cannot be forwarded to next node because they are blocked by other packets waiting to be forwarded and these packets are blocked in a similar way such that none of the packets can move.
Example
Node 1 wishes to send a message through node 2 to node 3. Node 2 wishes to send a message through node 3 to node 4. Node 3 wishes to send a message through node 4 to node 1. Node 4 wishes to send a message through node 1 to node 2. Node 4 Node 3
Messages
Node 1
Node 2
Virtual Channels
A general solution to deadlock. The physical links or channels are the actual hardware links between nodes. Multiple virtual channels are associated with a physical channel and time-multiplexed onto the physical channel.
Node
Figure 1.22 Multiple virtual channels mapped onto a single physical channel.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 19
Very high performance workstations and PCs are readily available at low cost.
The latest processors can easily be incorporated into the system as they become available.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 20
Ethernet
Common communication network for workstations
Ethernet
Workstations
Data (variable)
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 21
Ring Structures
Examples - token rings/FDDI networks
Network
Workstation/ file server Workstations Figure 1.25 Network of workstations connected via a ring.
Point-to-point Communication
Provides the highest interconnection bandwidth. Various point-to-point congurations can be created using hubs and switches.
Examples - High Performance Parallel Interface (HIPPI), Fast (100 MHz) and Gigabit Ethernet, and ber optics. Workstations
Speedup Factor
ts S(n) = Execution time using one processor (single processor system) = Execution time using a multiprocessor with n processors tp where ts is execution time on a single processor and tp is execution time on a multiprocessor. S(n) gives increase in speed in using a multiprocessor. Underlying algorithm for parallel implementation might be (and is usually) different. Speedup factor can also be cast in terms of computational steps: S(n) = Number of computational steps using one processor Number of parallel computational steps with n processors
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 23
Superlinear Speedup
where S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation.
One common reason for superlinear speedup is the extra memory in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow disk memory trafc. Superlinear speedup can occur in search algorithms.
Computing
Message
Time
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 24
Maximum Speedup
ts fts Serial section (a) One processor (1 f)ts Parallelizable sections
n processors
tp
(1 f )ts /n
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 25
20 16 12
f = 0%
20 n = 256 16 12
f = 5% 8 4 8 f = 10% f = 20% 4 n = 16 4 8 12 16 Number of processors, n (a) 20 0.2 0.4 0.6 0.8 Serial fraction, f (b) 1.0
Figure 1.30 (a) Speedup against number of processors. (b) Speedup against serial fraction, f. Even with innite number of processors, maximum speedup limited to 1/f . For example, with only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors.
Efciency
Execution time using one processor E = ----------------------------------------------------------------------------------------------------------------------------------------------------Execution time using a multiprocessor number of processors ts = -----------tp n which leads to E= S(n) n 100%
Efciency gives fraction of time that processors are being used on computation.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 26
Cost
The processor-time product or cost (or work) of a computation dened as Cost = (execution time) (total number of processors used) The cost of a sequential computation is simply its execution time, ts. The cost of a parallel computation is tp n. The parallel execution time, tp, is given by ts /S(n). Hence, the cost of a parallel computation is given by tsn ts - = -Cost = ---------S(n) E
Scalability
Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalablity.
Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalablity.
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 27
Problem Size
Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size.
However, doubling the problem size would not necessarily double the number of computational steps. It will depend upon the problem.
For example, adding two matrices has this effect, but multiplying matrices does not. The number of computational steps for multiplying matrices quadruples.
Hence, scaling different problems would imply different computational requirements. Alternative denition of problem size is to equate problem size with the number of basic steps in the best sequential algorithm.
Gustafsons Law
Rather than assume that the problem size is xed, assume that the parallel execution time is xed. In increasing the problem size, Gustafson also makes the case that the serial section of the code does not increase as the problem size.
Example
Suppose a serial section of 5% and 20 processors; the speedup according to the formula is 0.05 + 0.95(20) = 19.05 instead of 10.26 according to Amdahls law. (Note, however, the different assumptions.)
Slides for Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1999. All rights reserved. Page 28