Cluster ParallelTechniques PDF

Parallel Programming in R PS236b Spring 2010

Mark Huberty
February 24, 2010
Concepts Parallelization in R snow foreach multicore Three benchmarks Considerations Good habits: testing, portability, and error handling The Political Science Cluster GPU Computing
Parallel Programming in R PS236b Spring 2010 Concepts
Parallelization: basic concepts
The commodity computing revolution
The fast computing revolution
The three aspects of the revolution
Cheap computing now comes in three forms: Many cores on the same chip Many chips in the same computer Many computers joined with high-speed connections We can refer to any of these parallel processing units as nodes.
Kinds of parallelization
Taking advantage of these forms of compute power requires code that can do one of several kinds of parallelization: Bit -based parallelization Already have this: the move up the chain via 4-8-16-32-64 bit machines changes the number of steps required to run a single instruction Instruction-based parallelization Processor/program layer Data-based parallelization Decompose large data structures into independent chunks, on which you perform the same operation Task - based parallelization Perform different, independent tasks on the same data For R, we are mostly interested in data and task parallelization
Data Parallelization, contd
Data parallelization is very common: Bootstrapping: Sample N times from data D and apply function F to each sample Genetic matching: generate N realizations of matches between groups T and C and calculate the balance on each; repeat for G generations Monte Carlo simulations Google does a ton of this kind of work via its MapReduce framework
Task parallelization
Task parallelization is a little less obvious. Ideas include: Given N possible estimators of treatment effect , test all against data set D Machine learning: given N different classication schemes for some data set D , generate some test statistics S for all of them
Data parallelization: an overview
So implement data parallelization, we must: Conceptualize the problem as a set of operations against independent data sets Break up this set of operations into independent components Assign each component to a node for processing Collect the output of each component and return it to the master process
Technical requirements for parallelization
For the hardware geeks: Multiple cores or servers
For the hardware geeks: Multiple cores or servers Some means to connect them
For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)
A means of sharing programs and data
A means of sharing programs and data A framework to organize the division of tasks and the collection of results
How fast can we get?
For some program containing a function F that will operate on some data set D, decompose D into di , i N and perform F on each, splitting the tasks across M nodes For the program containing F , the maximum gain from parallelization is given by Amdahls Law. For a program with P percent parallizability running on M nodes: S= 1 1P +
P M
(1)
The Political Science Compute Cluster
Heres what we have: 9 2-chip Opteron 248 servers Gigabit ethernet interconnects OpenMPI message passing A Network File System
Parallel Programming in R PS236b Spring 2010 Parallelization in R
Parallelization in R
Parallel Programming in R PS236b Spring 2010 Parallelization in R
Frameworks for parallelization
R has several frameworks to manage data parallelization. Three mature and effective ones are: snow, which uses the apply model for task division foreach, which uses a for loop model for division multicore, which is only suitable for the many-cores hardware model There are several other possibilities (nws, mapreduce, pvm) at different levels of obsolescence or instability.
Parallel Programming in R PS236b Spring 2010 Parallelization in R snow
The snow model
snow is a master/worker model: From N nodes create 1 master and N 1 workers1 ; farm jobs out to the workers. This is a little weird when using MPI-based systems where the nodes are undifferentiated; keep this in mind when using MPI for R jobs.
parlance uses the master/slave terminology, which has been abandoned for obvious reasons
1 Older
snow and parallelization
The snow library makes parallization straightforward: Create a cluster (usually with either sockets or MPI) Use parallel versions of the apply functions to run stuff across the nodes of the cluster So this is pretty easy: we already know how to use apply
Cluster creation in snow
See the clusterCreate function in the clusterSetup.R code.
snow example
1 ## assuming you ve a l r e a d y c r e a t e d a c l u s t e r c l : 2 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 3 c l u s t e r E x p o r t ( m ) 4 parSapply ( c l , 1 : n c o l s (m) , function ( x ) { 5 mean (m[ , x ] ) 6 } 7 ) Notice there that parSapply has replaced sapply, but nothing much else has changed.
Data vs. Task parallelization with snow

Data: 1 2 3 4 5 6 7 p a r L a p p l y ( c l , 1 : nSims , function ( x ) { n < dim ( data ) [ 1 ] simdata < data [ sample ( 1 : n , n , replace =TRUE ) , ] o u t < myfunc ( simdata ) return ( out ) } ) Task: 1 2 3 4 5 f u n c l i s t < l i s t ( func1 , func2 , func3 ) p a r L a p p l y ( c l , 1 : length ( f u n c l i s t ) , function ( x ) { o u t < f u n c l i s t [ [ x ] ] ( data ) return ( out ) }
Object management in snow

snow requires some additional object management: Libraries must be called on all nodes Data objects must be exported to all nodes
Object management in snow

snow requires some additional object management: Libraries must be called on all nodes Data objects must be exported to all nodes As in: 1 2 3 4 ## Given a c l u s t e r c a l l e d c l m < matrix ( rnorm ( 1 0 0 ) , ncol =10 , nrow =10) c l u s t e r E x p o r t ( c l , m ) c l u s t e r E v a l Q ( c l , l i b r a r y ( Matching ) ) Notice that this doesnt apply to objects created inside a call to the cluster (i.e. parLapply().
Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach
foreach and parallelization
REvolution Computing released the foreach libraries. To use them, you install: foreach doSNOW for using snow clusters doMPI for working directly with MPI doMC for use on multicore machines The basic idea: looks like a for loop, performs like an apply, and portable
foreach example
1 ## Load t h e l i b r a r i e s . I assume I m on an MPIbased 2 ## c l u s t e r ; o t h e r o p t i o n s are doSNOW and doMC 3 l i b r a r y ( f o r e a c h ) ; l i b r a r y ( doMPI ) 4 5 ## Get t h e c l u s t e r c o n f i g u r a t i o n . 6 c l < s t a r t C l u s t e r ( ) 7 8 ## T e l l doMPI t h a t t h e c l u s t e r e x i s t s 9 re g i st e r D o M P I ( c l ) 10 11 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 12 13 ## Run t h e f o r l o o p t o c a l c u l a t e t h e column means 14 f o r e a c h ( i =1: ncol (m) ) %dopar% { 15 mean (m[ , i ] ) # makes m a v a i l a b l e on nodes 16 }
foreach continued
Notice the important bit: ## Run t h e f o r l o o p t o c a l c u l a t e t h e mean ## o f each column i n m. f o r e a c h ( i =1: ncol (m) ) %dopar% { mean (m[ , i ] ) # makes m a v a i l a b l e on nodes } Here, the foreach term controls the repetition, but the %dopar% term does the actual work. If this werent running on a cluster, you would use %do% instead
Parallel Programming in R PS236b Spring 2010 Parallelization in R multicore
multicore and parallelization

multicore only works on chips with > 1 core (dual-core, quad-core, etc). Its basic function is mclapply: 1 library ( multicore ) 2 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 3 4 ## By d e f a u l t t h e f u n c t i o n f i n d s a l l cores on 5 ## t h e c h i p and uses them a l l 6 mclapply ( 1 : n c o l s (m) , function ( x ) { 7 mean (m[ , x ] ) 8 }, 9 mc . cores = getOption ( cores ) 10 ) 11 ## N o t i c e t h e l a s t argument ; can s p e c i f y a 12 ## number o f cores i f d e s i r e d
Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks
Three benchmarks, three lessons
Parallel computing on your laptop
New laptops today will almost certainly come with dual-core chips.
Usual speed gains run around 50%
Parallel computing on your laptop

New laptops today will almost certainly come with dual-core chips.
Benchmarked code for MBP 13, 2.2gHz, 4gb RAM
Compute time (s)
10
20
30
40
Bootstrap (1 core)
Bootstrap (2 cores)
Matching (1 core)
Matching (2 cores)
10 October 2009
Usual speed gains run around 50%
Matrix multiplication
Results for:
1 2 3 4 5 6 7 8 9 10 11 12 13 l i b r a r y ( snow ) t e s t m a t < matrix ( rnorm (10000000) , ncol =10000) mm. s e r i a l < system . time ( t e s t m a t %% t ( t e s t m a t ) ) t e s t m a t . t < t ( t e s t m a t ) source ( setupCode . R ) clusterCreate ( ) clusterExport ( cl , c ( testmat , testmat . t ) ) mm. p a r a l l e l < system . time ( parMM ( c l , t e s t m a t , t e s t m a t . t ) ) save . image ( mm. r e s u l t s . RData ) clusterShutdown ( )
mm.parallel \mm.serial = 0.6 for an 8-node cluster
Parallelization and speed: an example
So why choose one? Lets look at speed and features. The idea: write the same bootstrap as a serial job, a snow job, and a foreach job, and see what we get. All the code is available at http://pscluster.berkeley.edu
Results: timing
Table: Benchmark results for 500 repetitions of a 1000-trial bootstrap for different coding methods; parallel methods use 8 nodes Mean time (s) 176.4 33.4 27.3 27.2 Pct of Serial time 100.0 18.9 15.5 15.4 2.5 pct CI 171.8 32.9 26.8 26.7 97.5 pct CI 181.3 33.5 27.8 27.8
Serial parSapply foreach, snow foreach, dompi
Results: distributions
Variation in compute times for bootstrap, 500 repetitions of 1000 trials
parSapply foreach, snow foreach, doMPI
Variation in compute times for the serial bootstrap, 500 repetitions of 1000 trials
0.15
26
28
30 Compute time (s)
32
34
0.00
0.05
0.10
170
175 Compute time (s)
180
185
Figure: Benchmark distributions for serial and parallel bootstraps
doMPI vs. doSNOW
Given the identical performance, why choose one vs. the other? doMPI is not compatible with an environment that has a snow cluster running in it. Thus use doMPI when running things without snow, and doSNOW when combining code that requires snow with foreach-based routines.
Parallel Programming in R PS236b Spring 2010 Considerations
Parallelization and data structures

Lists dont get much use in generic R but are very helpful for parallelization: For N data sets of the same format, do some identical analysis A on each of them

Solution:
1. Create some list L of length N , containing all the data sets; 2. then loop across the list N times (using P chips) 3. and apply function A to each data set

Solution:
1. Create some list L of length N , containing all the data sets; 2. then loop across the list N times (using P chips) 3. and apply function A to each data set
Can be referred to as a map/reduce problem R has a mapReduce package that claims to do this; but its basically just an parLapply with more overhead.
Some complications
There are a few more issues to worry about: Memory management:

Processors parallelize but memory does not: the master node still has to hold all the results in RAM If R runs out of RAM, the entire process will die and take your data with it Solutions:
Reduce objects on the nodes as much as possible Run things in recursive loops
Memory mgmt and object size
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference!
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference! So if you only need the coefcients, you use much less memoryand lm is a simple object compared to, say, MatchBalance
Recursive loops for memory mgmt: a GenMatch example

Common desire: want to run GenMatch with large population sizes. Common problem: the MemoryMatrix grows very large and the kills off the R session with an out of memory error. Solution: recursion: 1 2 3 4 5 6 7 ## Assuming a gen . o u t from a dummy run o f GenMatch par < gen . o u t for ( i in 1:10){ par < GenMatch ( . . . , s t a r t i n g . v a l u e s = par$ s t a r t i n g . values , hard . g e n e r a t i o n . l i m i t =TRUE, max . g e n e r a t i o n s =25 )} Notice that here, the MemoryMatrix never grows large; but GenMatch still retains the history via recursion
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Good habits: testing, portability, and error handling
Portability
To achieve portability of code across platforms, use if statements to set the appropriate environment variables. Example: the working directory 1 2 3 4 5 6 i f ( o n c l u s t e r ==TRUE) { setwd ( / projectname / ) } else { setwd ( / r e s e a r c h / m y p r o j e c t s / projectname / ) }
Code testing and management
Parallel jobs are often long jobs, posing some issues: How to catch errors while writing code? How to test code functions? How to verify output before running? How to catch errors when running? You want to check your code before starting, rather than have the process fail while running.
codetools for syntax checking

Luke Tierneys codetools package for R will do basic syntax checking for you: 1 2 3 4 5 6 7 8 9 10 11 12 library ( codetools ) nSims < 500 t e s t F u n c t i o n < function ( simcount ) { s a p p l ( 1 : simcount , function ( x ) { mean ( rnorm ( 1 0 0 0 ) ) save . ( nSims , f i l e = nsims . RData ) } ) } Here, we should expect errors with sappl and save..
codetools output
Heres what we get: 1 2 3 4 5 6 7 > checkUsage ( t e s t F u n c t i o n ) <anonymous > : no v i s i b l e g l o b a l function d e f i n i t i o n for sappl <anonymous> : <anonymous > : no v i s i b l e g l o b a l function d e f i n i t i o n f o r save . <anonymous> : <anonymous > : no v i s i b l e b i n d i n g f o r g l o b a l v a r i a b l e nSims So checkUsage() will help catch unidentied variables, bad functions, and other typos before you actually run your jobs.
Good operating practice

1. Write code as a set of functions 2. Use conditional statements to make code portable between your laptop and the parallel environment 3. Check functions with codetools 4. Run trials of the code on small N Note that to check a set of functions, load them up and use checkUsageEnv() to loop through all those functions, as in: 1 checkUsageEnv ( env = . GlobalEnv )
Parallel Programming in R PS236b Spring 2010 The Political Science Cluster
The Political Science Compute Cluster
Technology
The basic conguration A master server with a dual-core 2.6 gHz Opteron chip and 8gb of RAM 9 2-chip 64-bit worker servers w/ 4-8gb of RAM each, for a total of 18 nodes CentOS linux The Perceus cluster administration software Gigabit ethernet interconnects OpenMPI message-passing SLURM job management and resource allocation 1tb of RAID-1 storage for users (actually about 800gb)
Accounts
Accounts are available on these terms: Polisci faculty and grad students: 2 years, renewable Non-Polisci students in 200-level courses: 1 semester For other purposes: on request All accounts come with 5gb of storage on the cluster itself. To get an account, send email to [email protected]
Logistical details
There are some user services available: 1. The cluster administrators are available at [email protected] 2. Cluster users should sign up for the listserv, at [email protected] 3. Benchmarks, code, and documentation are available at the cluster webpage: http://pscluster.berkeley.edu Finally, there is a comprehensive README le that all users should review. It can be found on the cluster webpage.
Resources
As of right now, we have the following resources available: 64-bit R, compiled against GOTO BLAS The C, FORTRAN, and MPICC compilers Emacs + ESS (in both X and terminal avors) git for version control If you want something else on the servers (Matlab, Mathematica, Stata) and can get the right licenses, wed be happy to look at setting it up.
Access
Access is available both on-campus and off: On campus: via ssh to pscluster.polisci.berkeley.edu Off campus: through the VPN via ssh to the same address The VPN software is available for free via the Software Central service (software.berkeley.edu). The README le at http://pscluster.berkeley.edu has more information on access and software conguration.
ssh clients
ssh and scp require a client program. Which program depends on your OS: OS X, Linux: Use the Terminal application Windows: PuTTY is free; HostExplorer is a commercial alternative, available free at http://software.berkeley.edu Note that any of these will give command-line access. There is no GUI.
Using the cluster
Two ways: 1. Serial jobs. No special programming needed. But you can only make use of a single node. Nice for long-running, single-threaded jobs. 2. Parallel jobs Some special programming required, but can take advantage of >1 node for speedup
A generic R session
Sessions will generally follow this pattern: Copy your code and data to your home directory on the cluster (via scp) Log into the cluster (ssh) Execute your code by requesting a certain number of nodes from SLURM and initiating the batch job Pull the output and the R transcript le back to your own computer (again via scp)
Running a job on 1 node

To run code against a single job, ask SLURM for 1 node and run your code against it: That looks like:
s a l l o c n 1 o r t e r u n n 1 R s c r i p t <yourcode . R> Here: salloc asks SLURM for nodes orterun invokes MPI to choose a node Rscript runs your code le (where you have the code to pick up the cluster you just created)
Running a le on >1 node

To run code against multiple nodes, you need to (1) ask SLURM for the nodes, (2) invoke the cluster setup in MPI, and (3) call your own code in R. That looks like:
s a l l o c n <number o f nodes> o r t e r u n n <number o f nodes> R s c r i p t <yourcode . R> Here: salloc asks SLURM for nodes orterun creates a cluster with those nodes Rscript runs your code le (where you have the code to pick up the cluster you just created)
Running the job: convenience

For your convenience, we have a script that internalizes all this stuff: R n o t i f y . sh < y o u r c o d e f i l e . R> <number o f nodes> < e m a i l address > This will run your code against the number of requested nodes, and send you an email when the job is complete. As in: R n o t i f y . sh mycode . R 4 me@berkeley . edu
Running the job: while you are away

Usually, you want to start the job and log out. To do that, a little extra is needed: nohup R n o t i f y . sh < y o u r c o d e f i l e > . . . e t c . . . Here, nohup is the Unix no hangup routine, which keeps stuff running even after you log out. This will point the console output to a le called nohup.out. If you want it to go elsewhere, use this syntax: nohup R n o t i f y . sh < y o u r c o d e f i l e > <nodecount > <email > > f i l e n a m e . o u t
Monitoring the job
SLURM provides command-line tools to monitor and manage your jobs and check the status of the cluster: squeue, which prints a list of your jobs and their status and runtimes sinfo, which prints the status of each node in the cluster scancel, which allows you to cancel a job You also have access to the normal Unix commands top and ps to look at your system processes.
Choosing the number of nodes
You might be tempted to always ask for lots of nodes. There are three reasons this is a bad idea: 1. Bad form: this a commons, dont abuse it 2. Wait time: SLURM will force your job to wait until nodes are available; the job might start (and nish) sooner if you asked for fewer nodes 3. Processing timemore not always faster.
Choosing the number of nodes: a benchmark

This is GenMatch() run on 1-17 nodes:
Parallel Programming in R PS236b Spring 2010 GPU Computing
The next generation: GPU computing and hundreds of cores
Blood, Gore, and Parallel computation
High-end computer graphics cards for gaming now have 128-256 cores and can be bought for $200-400. A supercomputer on your desktop R has enabled use of these GPUs through a package called gputools
A supercomputer on your laptop
You may already have this at your disposal: All new MacBooks come with the nVidia 9400m GPU Many PC notebooks have a similar chip Information on how to install gputools can be found at the authors website:
http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/
Information on how to install gputools on a Macbook Pro can be found at:

http://markhuberty.berkeley.edu/tech.html
gputools results

Cluster ParallelTechniques PDF

Uploaded by

Copyright:

Available Formats

Cluster ParallelTechniques PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster ParallelTechniques PDF

Uploaded by

Copyright:

Available Formats

Parallel Programming in R PS236b Spring 2010

Parallel Programming in R PS236b Spring 2010

February 24, 2010

Parallel Programming in R PS236b Spring 2010

Parallel Programming in R PS236b Spring 2010 Concepts

Parallelization: basic concepts

Parallel Programming in R PS236b Spring 2010 Concepts

The commodity computing revolution

Parallel Programming in R PS236b Spring 2010 Concepts

The fast computing revolution

Parallel Programming in R PS236b Spring 2010 Concepts

The three aspects of the revolution

Parallel Programming in R PS236b Spring 2010 Concepts

Parallel Programming in R PS236b Spring 2010 Concepts

Data Parallelization, contd

Parallel Programming in R PS236b Spring 2010 Concepts

Parallel Programming in R PS236b Spring 2010 Concepts

Data parallelization: an overview

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

For the hardware geeks: Multiple cores or servers

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

A means of sharing programs and data

Parallel Programming in R PS236b Spring 2010 Concepts

Technical requirements for parallelization

Parallel Programming in R PS236b Spring 2010 Concepts

How fast can we get?

Parallel Programming in R PS236b Spring 2010 Concepts

The Political Science Compute Cluster

Parallel Programming in R PS236b Spring 2010 Parallelization in R

Parallel Programming in R PS236b Spring 2010 Parallelization in R

Frameworks for parallelization

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

The snow model

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

snow and parallelization

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Cluster creation in snow

See the clusterCreate function in the clusterSetup.R code.

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Data vs. Task parallelization with snow

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Object management in snow

Parallel Programming in R PS236b Spring 2010 Parallelization in R snow

Object management in snow

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

foreach and parallelization

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

Parallel Programming in R PS236b Spring 2010 Parallelization in R foreach

Parallel Programming in R PS236b Spring 2010 Parallelization in R multicore

multicore and parallelization

Parallel Programming in R PS236b Spring 2010 Parallelization in R Three benchmarks