Cluster ParallelTechniques PDF
Cluster ParallelTechniques PDF
Cluster ParallelTechniques PDF
Concepts Parallelization in R snow foreach multicore Three benchmarks Considerations Good habits: testing, portability, and error handling The Political Science Cluster GPU Computing
Cheap computing now comes in three forms: Many cores on the same chip Many chips in the same computer Many computers joined with high-speed connections We can refer to any of these parallel processing units as nodes.
Kinds of parallelization
Taking advantage of these forms of compute power requires code that can do one of several kinds of parallelization: Bit -based parallelization Already have this: the move up the chain via 4-8-16-32-64 bit machines changes the number of steps required to run a single instruction Instruction-based parallelization Processor/program layer Data-based parallelization Decompose large data structures into independent chunks, on which you perform the same operation Task - based parallelization Perform different, independent tasks on the same data For R, we are mostly interested in data and task parallelization
Data parallelization is very common: Bootstrapping: Sample N times from data D and apply function F to each sample Genetic matching: generate N realizations of matches between groups T and C and calculate the balance on each; repeat for G generations Monte Carlo simulations Google does a ton of this kind of work via its MapReduce framework
Task parallelization
Task parallelization is a little less obvious. Ideas include: Given N possible estimators of treatment effect , test all against data set D Machine learning: given N different classication schemes for some data set D , generate some test statistics S for all of them
So implement data parallelization, we must: Conceptualize the problem as a set of operations against independent data sets Break up this set of operations into independent components Assign each component to a node for processing Collect the output of each component and return it to the master process
For the hardware geeks: Multiple cores or servers Some means to connect them
For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)
For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)
For the hardware geeks: Multiple cores or servers Some means to connect them A way to communicate among them
Sockets MPI (Message Passing Interface)
A means of sharing programs and data A framework to organize the division of tasks and the collection of results
For some program containing a function F that will operate on some data set D, decompose D into di , i N and perform F on each, splitting the tasks across M nodes For the program containing F , the maximum gain from parallelization is given by Amdahls Law. For a program with P percent parallizability running on M nodes: S= 1 1P +
P M
(1)
Heres what we have: 9 2-chip Opteron 248 servers Gigabit ethernet interconnects OpenMPI message passing A Network File System
Parallelization in R
R has several frameworks to manage data parallelization. Three mature and effective ones are: snow, which uses the apply model for task division foreach, which uses a for loop model for division multicore, which is only suitable for the many-cores hardware model There are several other possibilities (nws, mapreduce, pvm) at different levels of obsolescence or instability.
snow is a master/worker model: From N nodes create 1 master and N 1 workers1 ; farm jobs out to the workers. This is a little weird when using MPI-based systems where the nodes are undifferentiated; keep this in mind when using MPI for R jobs.
parlance uses the master/slave terminology, which has been abandoned for obvious reasons
1 Older
The snow library makes parallization straightforward: Create a cluster (usually with either sockets or MPI) Use parallel versions of the apply functions to run stuff across the nodes of the cluster So this is pretty easy: we already know how to use apply
snow example
1 ## assuming you ve a l r e a d y c r e a t e d a c l u s t e r c l : 2 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 3 c l u s t e r E x p o r t ( m ) 4 parSapply ( c l , 1 : n c o l s (m) , function ( x ) { 5 mean (m[ , x ] ) 6 } 7 ) Notice there that parSapply has replaced sapply, but nothing much else has changed.
REvolution Computing released the foreach libraries. To use them, you install: foreach doSNOW for using snow clusters doMPI for working directly with MPI doMC for use on multicore machines The basic idea: looks like a for loop, performs like an apply, and portable
foreach example
1 ## Load t h e l i b r a r i e s . I assume I m on an MPIbased 2 ## c l u s t e r ; o t h e r o p t i o n s are doSNOW and doMC 3 l i b r a r y ( f o r e a c h ) ; l i b r a r y ( doMPI ) 4 5 ## Get t h e c l u s t e r c o n f i g u r a t i o n . 6 c l < s t a r t C l u s t e r ( ) 7 8 ## T e l l doMPI t h a t t h e c l u s t e r e x i s t s 9 re g i st e r D o M P I ( c l ) 10 11 m < matrix ( rnorm ( 1 6 ) , 4 , 4 ) 12 13 ## Run t h e f o r l o o p t o c a l c u l a t e t h e column means 14 f o r e a c h ( i =1: ncol (m) ) %dopar% { 15 mean (m[ , i ] ) # makes m a v a i l a b l e on nodes 16 }
foreach continued
Notice the important bit: ## Run t h e f o r l o o p t o c a l c u l a t e t h e mean ## o f each column i n m. f o r e a c h ( i =1: ncol (m) ) %dopar% { mean (m[ , i ] ) # makes m a v a i l a b l e on nodes } Here, the foreach term controls the repetition, but the %dopar% term does the actual work. If this werent running on a cluster, you would use %do% instead
New laptops today will almost certainly come with dual-core chips.
10
20
30
40
Bootstrap (1 core)
Bootstrap (2 cores)
Matching (1 core)
Matching (2 cores)
10 October 2009
Matrix multiplication
Results for:
1 2 3 4 5 6 7 8 9 10 11 12 13 l i b r a r y ( snow ) t e s t m a t < matrix ( rnorm (10000000) , ncol =10000) mm. s e r i a l < system . time ( t e s t m a t %% t ( t e s t m a t ) ) t e s t m a t . t < t ( t e s t m a t ) source ( setupCode . R ) clusterCreate ( ) clusterExport ( cl , c ( testmat , testmat . t ) ) mm. p a r a l l e l < system . time ( parMM ( c l , t e s t m a t , t e s t m a t . t ) ) save . image ( mm. r e s u l t s . RData ) clusterShutdown ( )
So why choose one? Lets look at speed and features. The idea: write the same bootstrap as a serial job, a snow job, and a foreach job, and see what we get. All the code is available at http://pscluster.berkeley.edu
Results: timing
Table: Benchmark results for 500 repetitions of a 1000-trial bootstrap for different coding methods; parallel methods use 8 nodes Mean time (s) 176.4 33.4 27.3 27.2 Pct of Serial time 100.0 18.9 15.5 15.4 2.5 pct CI 171.8 32.9 26.8 26.7 97.5 pct CI 181.3 33.5 27.8 27.8
Results: distributions
Variation in compute times for bootstrap, 500 repetitions of 1000 trials
parSapply foreach, snow foreach, doMPI
Variation in compute times for the serial bootstrap, 500 repetitions of 1000 trials
0.15
26
28
32
34
0.00
0.05
0.10
170
180
185
Given the identical performance, why choose one vs. the other? doMPI is not compatible with an environment that has a snow cluster running in it. Thus use doMPI when running things without snow, and doSNOW when combining code that requires snow with foreach-based routines.
Can be referred to as a map/reduce problem R has a mapReduce package that claims to do this; but its basically just an parLapply with more overhead.
Some complications
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference!
Comparative object sizes for the output of a regression, N = 1000, P = 10: Output from lm: 321k Output from lm$coefficients: 0.6k Factor of 535 difference! So if you only need the coefcients, you use much less memoryand lm is a simple object compared to, say, MatchBalance
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Portability
To achieve portability of code across platforms, use if statements to set the appropriate environment variables. Example: the working directory 1 2 3 4 5 6 i f ( o n c l u s t e r ==TRUE) { setwd ( / projectname / ) } else { setwd ( / r e s e a r c h / m y p r o j e c t s / projectname / ) }
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Parallel jobs are often long jobs, posing some issues: How to catch errors while writing code? How to test code functions? How to verify output before running? How to catch errors when running? You want to check your code before starting, rather than have the process fail while running.
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
codetools output
Heres what we get: 1 2 3 4 5 6 7 > checkUsage ( t e s t F u n c t i o n ) <anonymous > : no v i s i b l e g l o b a l function d e f i n i t i o n for sappl <anonymous> : <anonymous > : no v i s i b l e g l o b a l function d e f i n i t i o n f o r save . <anonymous> : <anonymous > : no v i s i b l e b i n d i n g f o r g l o b a l v a r i a b l e nSims So checkUsage() will help catch unidentied variables, bad functions, and other typos before you actually run your jobs.
Parallel Programming in R PS236b Spring 2010 Good habits: testing, portability, and error handling
Technology
The basic conguration A master server with a dual-core 2.6 gHz Opteron chip and 8gb of RAM 9 2-chip 64-bit worker servers w/ 4-8gb of RAM each, for a total of 18 nodes CentOS linux The Perceus cluster administration software Gigabit ethernet interconnects OpenMPI message-passing SLURM job management and resource allocation 1tb of RAID-1 storage for users (actually about 800gb)
Accounts
Accounts are available on these terms: Polisci faculty and grad students: 2 years, renewable Non-Polisci students in 200-level courses: 1 semester For other purposes: on request All accounts come with 5gb of storage on the cluster itself. To get an account, send email to [email protected]
Logistical details
There are some user services available: 1. The cluster administrators are available at [email protected] 2. Cluster users should sign up for the listserv, at [email protected] 3. Benchmarks, code, and documentation are available at the cluster webpage: http://pscluster.berkeley.edu Finally, there is a comprehensive README le that all users should review. It can be found on the cluster webpage.
Resources
As of right now, we have the following resources available: 64-bit R, compiled against GOTO BLAS The C, FORTRAN, and MPICC compilers Emacs + ESS (in both X and terminal avors) git for version control If you want something else on the servers (Matlab, Mathematica, Stata) and can get the right licenses, wed be happy to look at setting it up.
Access
Access is available both on-campus and off: On campus: via ssh to pscluster.polisci.berkeley.edu Off campus: through the VPN via ssh to the same address The VPN software is available for free via the Software Central service (software.berkeley.edu). The README le at http://pscluster.berkeley.edu has more information on access and software conguration.
ssh clients
ssh and scp require a client program. Which program depends on your OS: OS X, Linux: Use the Terminal application Windows: PuTTY is free; HostExplorer is a commercial alternative, available free at http://software.berkeley.edu Note that any of these will give command-line access. There is no GUI.
Two ways: 1. Serial jobs. No special programming needed. But you can only make use of a single node. Nice for long-running, single-threaded jobs. 2. Parallel jobs Some special programming required, but can take advantage of >1 node for speedup
A generic R session
Sessions will generally follow this pattern: Copy your code and data to your home directory on the cluster (via scp) Log into the cluster (ssh) Execute your code by requesting a certain number of nodes from SLURM and initiating the batch job Pull the output and the R transcript le back to your own computer (again via scp)
s a l l o c n 1 o r t e r u n n 1 R s c r i p t <yourcode . R> Here: salloc asks SLURM for nodes orterun invokes MPI to choose a node Rscript runs your code le (where you have the code to pick up the cluster you just created)
s a l l o c n <number o f nodes> o r t e r u n n <number o f nodes> R s c r i p t <yourcode . R> Here: salloc asks SLURM for nodes orterun creates a cluster with those nodes Rscript runs your code le (where you have the code to pick up the cluster you just created)
SLURM provides command-line tools to monitor and manage your jobs and check the status of the cluster: squeue, which prints a list of your jobs and their status and runtimes sinfo, which prints the status of each node in the cluster scancel, which allows you to cancel a job You also have access to the normal Unix commands top and ps to look at your system processes.
You might be tempted to always ask for lots of nodes. There are three reasons this is a bad idea: 1. Bad form: this a commons, dont abuse it 2. Wait time: SLURM will force your job to wait until nodes are available; the job might start (and nish) sooner if you asked for fewer nodes 3. Processing timemore not always faster.
High-end computer graphics cards for gaming now have 128-256 cores and can be bought for $200-400. A supercomputer on your desktop R has enabled use of these GPUs through a package called gputools
You may already have this at your disposal: All new MacBooks come with the nVidia 9400m GPU Many PC notebooks have a similar chip Information on how to install gputools can be found at the authors website:
http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/
gputools results