Lecture 9 Architecture Independent MPI Algorithm Design Parallel

  • Slides: 19
Download presentation
Lecture 9 Architecture Independent (MPI) Algorithm Design Parallel Computing Spring 2010 1

Lecture 9 Architecture Independent (MPI) Algorithm Design Parallel Computing Spring 2010 1

Matrix Computations n n SPMD program design stipulates that processors executes a single program

Matrix Computations n n SPMD program design stipulates that processors executes a single program on different pieces of data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below. One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p 1 × n/p 2 so that p = p 1 × p 2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same. The most common block distributions are. n • column-wise (block) distribution. Split matrix into p column stripes so that n/p consecutive columns form the i-th stripe that will be stored in processor i. This is p 1 = 1 and p 2 = p. n • row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p 1 = p and p 2 = 1. n • block or square distribution. This is the case p 1 = p 2 = √p, i. e. the blocks are of size n/√p× n/√p and store block i to processor i. There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable. 2

Matrix block distributions 3

Matrix block distributions 3

Matrix-Vector Multiplication n Sequential Alg: the running time is O(n 2). n n^2 multiplications

Matrix-Vector Multiplication n Sequential Alg: the running time is O(n 2). n n^2 multiplications and additions MAT_VECT(A, x, y) { for i=0 to n-1 do { y[i]=0; for j=0 to n-1 do y[i]=y[i]+A[i][j]*x[j]; } } 4

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning n n Assume p=n (p – no. of

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning n n Assume p=n (p – no. of processors). Steps: n Step 1: Initial partition of matrix and vector: n n n Step 2: All-to-all broadcast n n Every process has one element of the vector, but every process needs the entire vector. Step 3: computation n n Matrix distribution: Each process get one complete row of the matrix. Vector distribution: The n*1 vector is distributed such that each process owns one of its elements. Process Pi computes Running time: n n All-to-all broadcast: θ(n) at any architecture Multiplication of a single row of A and with vector x is θ(n) Total running time is θ(n). Total work is θ(n^2) – cost-optimal 5

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning 6

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning 6

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning n n Assume p<n (p – no. of

Matrix-Vector Multiplication: Rowwise 1 -D Partitioning n n Assume p<n (p – no. of processors). Three Steps: n Initial partition of matrix and vector: n n All-to-all broadcast: n n Among p processes and involved messages of size n/p Computation: n n Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector. Running Time: n All-to-all broadcast: n n n T=(ts+ n/p tw)(p-1) on any architecture T=ts logp + n/p tw(p-1) on hypercube Computation: T=n* n/p =θ(n 2/p) Total running time T= θ(n 2/p+ts logp + n tw) Total work: W=θ(n 2+ts p logp + n p tw) – cost-optimal 7

Matrix-Vector Multiplication: Columnwise 1 -D Partitioning n Similar to rowwise 1 -D Partitioning 8

Matrix-Vector Multiplication: Columnwise 1 -D Partitioning n Similar to rowwise 1 -D Partitioning 8

Matrix-Vector Multiplication: 2 -D Partitioning n n Assume p=n 2 Steps: n Step 1:

Matrix-Vector Multiplication: 2 -D Partitioning n n Assume p=n 2 Steps: n Step 1: Initial partitioning n n n Step 2: broadcast n n Each process multiplies its matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. n n The ith element of vector should be available to the ith element of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation n n Each process get one element of matrix The vector is distributed only processes in the diagonal, each of which owns one element. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: n n n One-to-all broadcast: θ(log n) Computation in each process: θ(1) All-to-one reduction: θ(log n) Total running time: θ(log n) Total work: θ(n 2 log n) – not cost-optimal 9

Matrix-Vector Multiplication: 2 -D Partitioning 10

Matrix-Vector Multiplication: 2 -D Partitioning 10

Matrix-Vector Multiplication: 2 -D Partitioning n n Assume p<n 2 Steps: n Step 1:

Matrix-Vector Multiplication: 2 -D Partitioning n n Assume p<n 2 Steps: n Step 1: Initial partitioning n n n Step 2: columwise one-to-all broadcast n n Each process multiplies its n/ p matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. n n The ith group of elements of vector should be available to the ith group of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation n n Each process get (n/ p)*(n/ p) of matrix The vector is distributed only processes in the diagonal, each of which owns n/ p element. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: n n Columnwise one-to-all broadcast: T= (ts+ n/ p tw)(log p) on any architecture Computation in each process: T=n/ p* n/ p All-to-one reduction: T= (ts+ n/ p tw)(log p) on any architecture Total running time: T= n 2/p + 2(ts+ n/ p tw)(log p) on any architecture 11

Matrix-Vector Multiplication: 1 -D Partitioning vs. 2 -D Partitioning n n n Matrix-vector multiplication

Matrix-Vector Multiplication: 1 -D Partitioning vs. 2 -D Partitioning n n n Matrix-vector multiplication is faster with block 2 D partitioning of the matrix than with block 1 -D partitioning for the same number of processes. If the number of processes is greater than n, then the 1 -D partitioning cannot be used. If the number of processes is less than or equal to n, 2 -D partitioning is preferable. 12

Matrix Distributions : Block cyclic n In block cyclic distributions the rows (similarly for

Matrix Distributions : Block cyclic n In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion. n n n • column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i. e. q = n. • row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i. e. q = n. • scattered distribution. Let p = qi · Pj processors be divided into qj groups each group Pj consisting of qi processors. Particularly, Pj = {jqi + l | 0 ≤ l ≤ qi − 1}. Processor jqi + l is called the l-th processor of group Pj. This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j mod qj). A scattered distribution refers to the special case qi = qj = √p. 13

Block cyclic distributions 14

Block cyclic distributions 14

Scattered Distribution 15

Scattered Distribution 15

Matrix Multiplication – Serial algorithm 16

Matrix Multiplication – Serial algorithm 16

Matrix Multiplication n n The algorithm for matrix multiplication presented below was presented in

Matrix Multiplication n n The algorithm for matrix multiplication presented below was presented in the seminal work of Valiant. It works for p ≤ n 2. Three steps: n n Initial partitioning: Matrices A and B are partitioned into p blocks A i, j, and Bi, j (1 <=i, j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P 0, 0 to P √p-1, √p -1. All-to-all broadcasting: Process Pi, j initially stores Ai, j and Bi, j and computes block Ci, j of the result matrix. Computing submatrix Ci, j requires all submatrices Ai, k and Bk, j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. Computation: After Pi, j acquire Ai, 0, Ai, 1, …, Ai, √p -1 and B 0, j, B 1, j, …, B √p -1, j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8. 3. Running time: n All-to-all broadcast: n n n T=(ts+ n^2/p tw)( p-1) on any architecture T=ts log p + n^2/p tw( p-1) on hypercube Computation: n T= p*(n/ p)^3=n^3/p. 17

Matrix Multiplication n n The input matrices A and B are divided into p

Matrix Multiplication n n The input matrices A and B are divided into p block-submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor. Let Ai (respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm. 18

Matrix Multiplication 19

Matrix Multiplication 19