0% found this document useful (0 votes)
115 views16 pages

Parallel Distributed Computing Using Python

This document summarizes two software components for parallel distributed computing in Python: MPI for Python and PETSc for Python. MPI for Python provides bindings for the Message Passing Interface standard to allow parallel Python programs to exploit multiple processors using message passing. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation libraries to allow sequential and parallel Python applications to utilize algorithms and data structures for solving large-scale problems. Together these tools integrate with PETSc-FEM, a parallel finite element code, to support fluid flow simulation research.

Uploaded by

clases daa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
115 views16 pages

Parallel Distributed Computing Using Python

This document summarizes two software components for parallel distributed computing in Python: MPI for Python and PETSc for Python. MPI for Python provides bindings for the Message Passing Interface standard to allow parallel Python programs to exploit multiple processors using message passing. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation libraries to allow sequential and parallel Python applications to utilize algorithms and data structures for solving large-scale problems. Together these tools integrate with PETSc-FEM, a parallel finite element code, to support fluid flow simulation research.

Uploaded by

clases daa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

Advances in Water Resources 34 (2011) 1124–1139

Contents lists available at ScienceDirect

Advances in Water Resources


journal homepage: www.elsevier.com/locate/advwatres

Parallel distributed computing using Python


Lisandro D. Dalcin ⇑, Rodrigo R. Paz, Pablo A. Kler, Alejandro Cosimo
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC), Instituto de Desarrollo Tecnológico para la Industria Química (INTEC),
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Universidad Nacional del Litoral (UNL), S3000GLN Santa Fe, Argentina

a r t i c l e i n f o a b s t r a c t

Article history: This work presents two software components aimed to relieve the costs of accessing high-performance
Available online 22 April 2011 parallel computing resources within a Python programming environment: MPI for Python and PETSc
for Python.
Keywords: MPI for Python is a general-purpose Python package that provides bindings for the Message Passing
Python Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python
MPI programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python
PETSc
provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities
allow sequential and parallel Python applications to exploit state of the art algorithms and data struc-
tures readily available in PETSc for the solution of large-scale problems in science and engineering.
MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based par-
allel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure
supports research activities related to simulation of fluid flows with applications ranging from the
design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer
interactions.
Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction computers with multiple-processor and/or multiple-core architec-


tures to clusters of workstations or dedicated computing nodes
The popularity of high-level, general purpose scientific comput- (with standard or special network interconnects), or even high-
ing environments – such as MATLAB and IDL in the commercial performance shared memory machines.
side or Octave and Scilab in the open source side – has increased The rest of this section presents a brief description about the Py-
considerably during the last decade. Users simply feel much more thon programming language and some of the more important tools
productive in such interactive environments providing tight inte- available for scientific computing, as well as a some comments
gration of simulation and visualization. They are alleviated of about MPI and PETSc. Sections 2 and 3 provide details about the
low-level details associated to compilation and linking steps, design and capabilities of MPI for Python and PETSc for Python.
memory management and input/output of more traditional scien- Section 4 presents some performance tests aimed to measure the
tific programming languages like Fortran, C, and C++. The Python overhead introduced by the Python layer in comparison to pure C
programming language is a distinguished member among these code. Finally, Section 5 presents two different application examples.
modern computing environments. Since it was born in the early
1990s, Python has steadily grown in popularity among the scien- 1.1. Python
tific community to arguably become de facto standard for compu-
tation-driven scientific research. Python [3,4] is a modern, powerful programming language. It
This paper reports on two open source tools that facilitate the has efficient high-level data structures, a simple but effective ap-
access to high-performance parallel computing resources within proach to object-oriented programming, it is easy to learn and
a Python programming environment, MPI for Python [1] (known highly extensible. It supports modules and packages, which
in short as mpi4py) and PETSc for Python [2] (known in short as encourages program modularity and code reuse.
petsc4py). They target hardware platforms ranging from desktop Python’s elegant syntax, together with its dynamic nature,
makes an excellent language for scripting and rapid application
development. Sophisticated but easy to use and well integrated
⇑ Corresponding author. Tel./fax: +54 342 4511594. solutions are available for interactive command-line work, efficient
E-mail addresses: dalcinl@gmail.com (L.D. Dalcin), rodrigo.r.paz@gmail.com multi-dimensional array processing, linear algebra, 2D and 3D
(R.R. Paz), pabloakler@gmail.com (P.A. Kler), alecosimo@gmail.com (A. Cosimo). visualization, and other scientific computing tasks.

0309-1708/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.advwatres.2011.04.013
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1125

Python is easily extended with new functions and data struc- The Cython language is similar to Python, supporting most
tures implemented in other languages. This feature allows skilled Python language constructs and libraries while adding syntax for
users to build their own computing environment, tailored to their declaring types, calling C functions, and manipulating C values.
specific needs and based on their favorite high-performance For- Cython code is compiled via C and the result runs within the
tran, C, or C++ codes. Such capabilities prove to be an advantage Python runtime environment. When static type declarations are
for modern scientific computing: users have a high-level and pro- used in Cython source, it typically executes many times faster than
ductive environment at hand, yet they can reuse existing library Python and sometimes approaches the speed of C.
code and optimize performance critical bottlenecks. Using Cython, code which manipulates Python values and C val-
The Python programming language, augmented with a set open ues can be freely intermixed, with conversions occurring automat-
source packages that have been developed over the last decade by ically wherever possible. Error checking of Python operations is
scientists and engineers, provides a ‘‘computational ecosystem’’ also automatic, and the full power of Python exception handling
that is quite capable of supporting a wide range of applications – facilities is available even in the midst of manipulating C data.
from casual scripting and lightweight tools to full-fledged systems.
For a throughout discussion about the role of Python in scientific 1.1.4. SWIG
computing and additional information about selected Python pack- SWIG [13], the Simplified Wrapper and Interface Generator, is an
ages, see [5–9]. interface compiler that connects programs written in C and C++
with a variety of scripting languages.
1.1.1. NumPy Originally developed in 1995, SWIG was first used by scientists
The NumPy project [10] started in the mid-90s as a collaborative (in the Theoretical Physics Division at Los Alamos National Labora-
effort of an international team of volunteers aimed to develop a tory, USA) for building user interfaces to molecular dynamic simu-
data-structure for efficient array computation in Python. Since lation codes running on the Connection Machine 5 supercomputer.
then, the NumPy package has found wide-spread adoption in aca- In this environment, scientists needed to work with huge amounts
demia and industry. Today, NumPy is one of the core packages of simulation data, complex hardware, and a constantly changing
for numerical computation in Python. code base. The use of a Python scripting language interface pro-
NumPy provides a powerful multi-dimensional array object vided a simple yet highly flexible foundation for solving these
with advanced and efficient general-purpose array operations. types of problems [14,15].
Additionally, NumPy contains three sub-libraries with numerical SWIG works by parsing the declarations found in C/C++ header
routines providing basic linear algebra operations, basic Fourier files and using them to generate the wrapper code needed by
transforms and sophisticated capabilities for random number gen- scripting languages, in particular Python, to access the underlying
eration. It also provides facilities in order to support interoperabil- C/C++ code. In addition, SWIG provides many customization fea-
ity with C, C++, and Fortran. tures that let developers tailor the wrapping process to suit specific
Besides its obvious scientific applications, NumPy can also be application needs.
used as an efficient multi-dimensional container of generic data. Although SWIG was originally developed for scientific applica-
New structured data-types with fixed storage layout can be de- tions, it has since evolved into a general purpose tool that is used
fined by combining fundamental data-types like integers and in a wide variety of applications – almost everything C/C++ and
floats. This allows NumPy to seamlessly and speedily integrate with scripting programming is involved.
a wide variety of database formats.
1.2. MPI
1.1.2. F2PY
Although NumPy provides similar and higher-level capabilities, MPI, the Message Passing Interface [16,17], is a standardized,
there are situations where selected, numerically intensive parts of portable message-passing system designed to work on a wide vari-
Python applications still require the efficiency of a compiled code ety of parallel computers. The standard defines a set of library rou-
for processing huge amounts of data in deeply-nested loops. For- tines (MPI is not a programming language extension) and allows
tran (especially Fortran 90 and above) is a language for efficiently users to write portable programs in the main scientific program-
implementing lengthy computations involving multi-dimensional ming languages (Fortran, C, and C++).
arrays. State of the art implementations of many commonly used The paradigm of message-passing is especially suited for (but
algorithms are readily available and implemented in Fortran. not limited to) distributed memory architectures and is used in to-
F2PY [11] is a development tool that provides a connection day’s most demanding scientific and engineering applications re-
between the Python and Fortran programming languages. It works lated to modeling, simulation, design, and signal processing.
by creating Python extension modules from special signature files MPI defines a high-level abstraction for fast and portable inter-
or directly from annotated Fortran source files. These files with process communication [18,19]. Applications can run in clusters of
additional annotations included as comments, contain all the infor- (possibly heterogeneous) workstations or dedicated compute
mation (function names, arguments and their types, etc.) that is nodes, (symmetric) multiprocessors machines, or even a mixture
needed to construct convenient Python bindings to Fortran func- of both. MPI hides all the low-level details, like networking or
tions. F2PY-generated Python extension modules enable Python shared memory management, simplifying development and main-
codes to call those Fortran 77/90/95 routines. In addition, F2PY pro- taining portability, without sacrificing performance.
vides the required support for transparently accessing Fortran 77 Implementations are available from vendors of high-perfor-
common blocks or Fortran 90/95 module data. mance computers to well known open source projects like MPICH
In a Python programming environment, F2PY is then the tool of [20,21] and Open MPI [22,23].
choice for taking advantage of the speed-up of compiled Fortran
code and integrating existing Fortran libraries. 1.3. PETSc

1.1.3. Cython PETSc [24,25], the Portable Extensible Toolkit for Scientific Compu-
Cython [12] is a recent development that provides access to tation, is a suite of algorithms and data structures for the solution
low-level C data types and functionalities in a Python program- of problems arising on scientific and engineering applications,
ming environment. especially those modeled by partial differential equations, of
1126 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

large-scale nature, and targeted for high performance parallel com- MPI for Python implements the entire specifications from the
puting environments [26]. MPI-2 standard revision 2.21 [30]. Naming conventions of the
PETSc is written in C (thus making it usable from C++); a Fortran MPI-2 C++ bindings2 are adopted, so users familiar with the C++
interface (very similar to the C one) is also available. PETSc’s com- bindings can use MPI for Python without learning a new interface.
plete functionality is only exercised by parallel applications, but
serial applications are fully supported. 2.1. Communicating Python objects and array data
PETSc employs the MPI standard for inter-process communica-
tion, thus it is based on the message-passing model for parallel The Python standard library supports different mechanisms for
computing. Despite that, PETSc provides high-level interfaces with data persistence. Many of them rely on disk storage, but pickling
collective semantics so that typical users rarely have to make mes- can also work with raw memory buffers. The pickle module pro-
sage-passing calls themselves. vides user-extensible facilities to serialize general Python objects
PETSc is designed with an object-oriented style. Almost all user- using ASCII or binary formats. MPI for Python can communicate
visible types are abstract interfaces with implementations that any built-in or user-defined Python object implementing the pickle
may be chosen at runtime. Those objects are managed through protocol. These facilities are transparently used to build binary rep-
handles to opaque data structures which are created, accessed resentations of objects to communicate (at sending processes), and
and destroyed by calling appropriate library routines. restoring them back (at receiving processes).
PETSc consists of a variety of components. Each component Although simple and general, the serialization approach (i.e.,
manipulates a particular family of objects and the operations one pickling and unpickling) imposes important overheads in memory
would like to perform on these objects. Some of the PETSc modules as well as processor usage, especially when objects with large
deal with: memory footprints are being communicated. Pickling general Py-
thon objects, ranging from primitive or container built-in types
 Index sets, including permutations, indexing into vectors, to user-defined classes, necessarily requires some processing for
renumbering, etc. dispatching the appropriate serialization method (that depends
 Vectors. on the type of the object) and processor usage to perform the ac-
 Matrices (generally sparse). tual packing. Additional memory is always needed, and if its total
 Distributed arrays for parallelizing regular grid-based problems. amount is not known a priori, many memory reallocations can oc-
 Krylov subspace methods. cur. Indeed, in the case of large numeric arrays, this is certainly
 Preconditioners, including multigrid and sparse direct solvers. unacceptable and precludes communication of objects occupying
 Nonlinear solvers. half or more of the available memory resources.
 Timesteppers for solving time-dependent, nonlinear partial dif- MPI for Python also supports direct communication of any ob-
ferential equations. ject implementing the Python buffer interface. This interface is a
standard Python mechanism provided by some types (e.g., byte
PETSc provides a rich environment for modeling scientific appli- strings and numeric arrays), allowing access in the C side to a con-
cations as well as for rapid algorithm design and prototyping. The tiguous memory buffer (i.e., address and length) containing the rel-
libraries enable easy customization and extension of both algo- evant data. This feature, in conjunction with the capability of
rithms and implementations. This approach promotes code reuse constructing user-defined MPI datatypes describing complicated
and flexibility. memory layouts, enables the implementation of many algorithms
Finally, PETSc is designed to be highly modular, enabling the involving multidimensional numeric arrays (e.g., image processing,
interoperability with several specialized parallel libraries like fast Fourier transforms, finite difference schemes on structured
Hypre [27], Trilinos/ML [28], MUMPS [29], and others through a uni- Cartesian grids) directly in Python, with negligible space or time
fied interface. overhead compared to compiled C, C++ or Fortran codes.

2.2. Using MPI for Python


2. MPI for Python
Here, a general overview of MPI concepts and functionalities
This section is devoted to describing MPI for Python, an open readily available in MPI for Python are presented. Discussed fea-
source software project that provides bindings of the MPI standard tures range from classical MPI-1 message-passing communication
for the Python programming language. operations to more advanced MPI-2 operations like dynamic pro-
MPI for Python is a general-purpose and full-featured package cess management, one-sided communication, and parallel input/
targeting the development of parallel applications in Python. It output.
provides core facilities that allow parallel Python programs to ex-
ploit multiple processors. Sequential Python applications can also 2.2.1. Communicators
take advantages of MPI for Python by communicating through In MPI for Python, Comm is the base class of communicators.
the MPI layer with external, independent parallel modules, possi- Communicator size and calling process rank can be respectively
bly written in other languages like C, C++, or Fortran. MPI for Py- obtained with methods Get_size( ) and Get_rank( ).
thon employs a back-end MPI implementation, thus being usable The Intracomm and Intercomm classes are derived from the
on any parallel environment supporting MPI. Comm class. The Is_inter( ) method (and Is_intra( ), provided
MPI for Python is implemented with Cython. The rationale for for convenience, it is not part of the MPI specification) is defined
this choice is twofold. A high-level, object oriented interface with for communicator objects and can be used to determine the partic-
Python look and feel can be easily developed and maintained in ular communicator class.
terms of lower-level MPI types and calls handled in C. Additionally,
MPI for Python can expose its internals to other C codes in such a
1
At the time of this writing, MPI for Python does not support MPI_Alltoallw (see
way that MPI handles can be recovered from Python objects. Third-
MPI standard document [30], Section 5.8, pp. 160). Support for this functionality is
party tools aimed to bridge C and Python (e.g., SWIG) can take under development.
advantage of this and couple MPI for Python with other Python 2
The C++ bindings for MPI were deprecated in the revision 2.2 of the MPI standard;
wrappers to MPI-based libraries. they are scheduled for removal in the upcoming MPI-3 standard.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1127

The two predefined intracommunicator instances are available: the actual required computations are performed sequentially at
COMM_WORLD and COMM_SELF. From them, new communicators can some process). User-defined reduction operations on memory
be created as needed. New communicator instances can be ob- buffers are also supported. Reductions are supported on both
tained with the Clone( ) method of Comm objects, the Dup( ) intracommunicators and intercommunicators (except inclusive
and Split( ) methods of Intracomm and Intercomm objects, and exclusive scan operations, MPI-2 defines them only for
and methods Create_intercomm( ) and Merge( ) of Intracomm intracommunicators).
and Intercomm objects, respectively.
The associated process group can be retrieved from a communi- 2.2.5. Dynamic process management
cator by calling the Get_group( ) method, which returns an in- In MPI for Python, new independent processes groups can be
stance of the Group class. Set operations with Group objects like created by calling the Spawn( ) method within an intracommuni-
Union( ), Intersect( ) and Difference( ) are fully supported, cator (i.e., an Intracomm instance). This call returns a new inter-
as well as the creation of new communicators from these groups. communicator (i.e., an Intercomm instance) at the parent
process group. The child process group can retrieve the matching
2.2.2. Blocking point-to-point communications intercommunicator by calling the Get_parent( ) method defined
The Send( ), Recv( ) and Sendrecv( ) methods of communi- in the Comm class. At each side, the new intercommunicator can be
cator objects provide support for blocking point-to-point commu- used to perform point to point and collective communications be-
nications within Intracomm and Intercomm instances. These tween the parent and child groups of processes.
methods can communicate either general Python objects or raw Alternatively, disjoint groups of processes can establish com-
memory buffers. The buffered, ready, and synchronous (Bsend( ), munication using a client/server approach. Any server application
Rsend( ) and Ssend( )) modes are also supported. must first call the Open_port( ) function to open a ‘‘port’’ and
the Publish_name( ) function to publish a provided ‘‘service’’,
2.2.3. Nonblocking point-to-point communications and next call the Accept( ) method within an Intracomm in-
On many systems, performance can be significantly increased stance. Any client application can first find a published ‘‘service’’
by overlapping communication and computation. This is particu- by calling the Lookup_name( ) function, which returns the ‘‘port’’
larly true on systems where communication can be executed where a server can be contacted; and next call the Connect( )
autonomously by an intelligent, dedicated communication control- method within an Intracomm instance. Both Accept( ) and
ler. Nonblocking communication is a mechanism provided by MPI Connect( ) methods return an Intercomm instance. When con-
in order to support such overlap. nection between client/server processes is no longer needed, all
The Isend( ) and Irecv( ) methods of the Comm class initiate of them must cooperatively call the Disconnect( ) method of
a send and receive operation respectively. These methods return a the Comm class. Additionally, server applications should release re-
Request instance, uniquely identifying the started operation. Its sources by calling the Unpublish_name( ) and Close_port( )
completion can be managed using the Test( ), Wait( ), and functions.
Cancel( ) methods of the Request class. The management of Re-
quest objects and associated memory buffers involved in commu- 2.2.6. One-sided operations
nication requires a careful, rather low-level coordination. Users In MPI for Python, one-sided operations are available by using
must ensure that objects exposing their memory buffers are not ac- instances of the Win class. New window objects are created by call-
cessed at the Python level while they are involved in nonblocking ing the Create( ) method at every process in a communicator and
message-passing operations. specifying a memory buffer (i.e., a base address and length). When
Often a communication with the same argument list is repeat- a window instance is no longer needed, the Free( ) method
edly executed within an inner loop. In such cases, communication should be called.
can be further optimized by using persistent communication, a The three one-sided MPI operations for remote write, read and
particular case of nonblocking communication allowing the reduc- reduction are available through calling the methods Put( ),
tion of the overhead between processes and communication con- Get( ), and Accumulate( ) respectively within a Win instance.
trollers. Furthermore, this kind of optimization can also alleviate These methods need an integer rank identifying the target process
the extra call overheads associated to dynamic languages like Py- and an integer offset relative to the base address of the remote
thon. The Send_init( ) and Recv_init( ) methods of the Comm memory block being accessed.
class create a persistent request for a send and receive operation
respectively. These methods return an instance of the Prequest 2.2.7. Parallel input/output operations
class, a subclass of the Request class. The actual communication In MPI for Python, all MPI input/output operations are per-
is started with the Start( ) method. formed through instances of the File class. File handles are
obtained by calling method Open( ) at every process in a commu-
2.2.4. Collective communications nicator and providing a file name and the intended access mode.
The Bcast( ), Scatter( ), Gather( ), Allgather( ) and After use, they must be closed by calling the Close( ) method.
Alltoall( ) methods of Intracomm instances provide support Files even can be deleted by calling method Delete( ).
for collective communications. Those methods can communicate After creation, files are typically associated with a per-process
either general Python objects or raw memory buffers. The ‘‘vector’’ view. The view defines the current set of data visible and accessible
variants (which can communicate varying amount of data at from an open file as an ordered set of elementary datatypes. This
each process) Scatterv( ), Gatherv( ), Allgatherv( ) and data layout can be set and queried with the Set_view( ) and Get_
Alltoallv( ) are also supported, they can only communicate ob- view( ) methods, respectively.
jects exposing raw memory buffers. All these collective operations Actual input/output operations are achieved by many methods
are supported on both intracommunicators and intercommunica- combining read and write calls with different behavior regarding
tors. positioning, coordination, and synchronism. Summing up, MPI for
Global reduction operations are accessible through the Re- Python supports around 30 different methods defined in MPI-2
duce( ), Allreduce( ), Scan( ) and Exscan( ) methods. All the for reading from or writing to files using explicit offsets or file
predefined (i.e., SUM, PROD, MAX, etc.) and user-defined reduction pointers (individual or shared), in blocking or nonblocking and col-
operations can be applied to general Python objects (however, lective or noncollective versions.
1128 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

2.3. Related projects implemented in PETSc and targeted to large-scale numerical simu-
lations arising in many problems of science and engineering.
pyMPI [31] is a pioneering project bringing general-purpose MPI PETSc for Python is implemented with Cython. The rationale for
support to Python. It is implemented in C with modified Python this choice is the same as the already commented in Section 2.
interpreter (as opposed to employ the stock Python interpreter PETSc presents a rather large API accessible from C and Fortran.
readily available at the computing platform) and a companion Although PETSc is designed with an object-oriented style, its API
MPI module exposing core MPI functionalities and other facilities is limited to the procedural style of the C and Fortran – these pro-
that are beyond the MPI standard. It permits basic interactive par- gramming languages do not support natively some more advanced
allel runs, which are useful for learning and debugging, and pro- concepts such as classes, inheritance and polymorphism, or excep-
vides an interface suitable for basic parallel programing. General tion-based error handling. PETSc for Python was designed from the
Python objects supporting pickle protocol as well as NumPy arrays ground to present to users an easy-to-use, high-level, pythonic
can be communicated. However, there is partial support for MPI-1 interface. This interface is certainly easier and more pleasant to
features like nonblocking communication, communication do- use than the native ones available in PETSc for C and Fortran.
mains, and process topologies. Support for user-defined MPI data- PETSc for Python has coverage for the most important PETSc
types is absent. Advanced MPI-2 features like dynamic process features. Among them, we can mention assembling distributed
management, one-sided communication, and parallel input/output vector and sparse matrices in parallel, solving systems of linear
are not available. equations with Krylov-based iterative methods and direct meth-
Pypar [32] is a minimalistic and intuitive Python interface to ods, and solving systems of nonlinear equations with Newton-
MPI. It is a lightweight wrapper implemented with a mixture of based iterative methods including matrix-free techniques. It is
high-level Python code and low-level extension module written not feasible to provide here an detailed listing of all the supported
in C, the Python interpreter does not require modification. General features, the complete API reference is available at the project
Python objects can be communicated using the pickle protocol. documentation.
There is good support for communicating NumPy arrays and prac- The rest of this section presents a general overview and some
tically full MPI bandwidth can be achieved. However, there is no examples of the many PETSc concepts and functionalities readily
support for basic MPI-1 features of common use like user-defined available in PETSc for Python. The examples are simple, self-con-
MPI datatypes, nonblocking communication, communication do- tained, and implemented in a few lines of Python code. Neverthe-
mains, and process topologies. Advanced MPI-2 features like dy- less, they show general usage patterns of PETSc for Python for
namic process management, one-sided communication, and implementing linear algebra algorithms, assembling sparse matri-
parallel input/output are not available. ces, and solving systems of linear and nonlinear equations within a
pyMPI and Pypar provide many of the features available in MPI Python programming environment.
for Python. However, differences in design and additional capabil-
ities distinguish MPI for Python from these packages. These differ- 3.1. Vectors
ences are summarized below.
PETSc for Python provides access to PETSc vectors, index sets
 MPI for Python does not require a modified Python interpreter. and general vector scatter/gather operations through the Vec, IS,
The stock Python interpreter readily available at the computing and Scatter classes respectively. By using them, the management
platform is employed. of distributed field data is highly simplified in parallel applications.
 MPI for Python is implemented with Cython (as opposed to low- Besides the use as containers for field data, PETSc vectors also
level C code), facilitating development and maintenance. represent algebraic entities of finite-dimensional vector spaces.
 MPI for Python features support for well-known wrappers gen- For this case, the Vec class provides many methods for performing
erator tools like SWIG and F2PY, lowering the barrier to reuse common linear algebra operations, like computing vector updates
existing C, C++, and Fortran MPI-based libraries. (axpy( ), aypx( ), scale( )), inner products (dot( )) and differ-
 MPI for Python Application Program Interface (API) follows clo- ent kinds of norms (norm( )).
sely the MPI standard specification. Fig. 1 shows a basic implementation of a simple Krylov-based
 MPI for Python can communicate general Python object and iterative linear solver, the (unpreconditioned) conjugate gradient
NumPy arrays. However, fast communication of array data is method.
not limited to NumPy – any Python type implementing the
Python buffer interface can participate. 3.2. Matrices
 MPI for Python supports the complete MPI-1 specification. All
blocking/nonblocking point-to-point and collective operations PETSc for Python provides access to PETSc matrices through the
are available, as well as user-defined MPI datatypes, multiple Mat class. New Mat instances are obtained by calling the
communication domains and Cartesian/graph process create( ) method. Next, the user specifies the row and column
topologies. sizes by calling the setSizes( ) method. Finally, a call to the
 MPI for Python supports for the complete MPI-2 specification, setType( ) method selects a particular matrix implementation.
providing full coverage of advanced features like dynamic pro- Matrix entries can be set (or added to existing entries) by calling
cess management, one-sided communications, and parallel the setValues( ) method. PETSc simplifies the assembling of par-
input/output. allel matrices. Any process can contribute to any entry. However,
off-process entries are internally cached. Because of this, a final call
3. PETSc for Python to the assemblyBegin( ) and assemblyEnd( ) methods is re-
quired in order to communicate off-process entries to the actual
This section describes PETSc for Python, an open-source owning process. Additionally, those calls prepare some internal
software project that provides bindings to PETSc libraries for the data structures for performing efficient parallel operations like
Python programming language. matrix–vector product. The latter operation is available by calling
PETSc for Python is a general-purpose and full-featured pack- the mult( ) method.
age. Its facilities allow sequential and parallel Python applications Fig. 2 shows the basic steps for creating and assembling a sparse
to exploit state of the art algorithms and data structures readily matrix in parallel. The assembled matrix is a discrete representation
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1129

Fig. 1. Basic implementation of conjugate gradients method.

of the two-dimensional Laplace operator on the unit square


equipped with homogeneous boundary conditions after a 5-points
finite differences discretization. The grid supporting the discretiza-
tion scheme is structured and regularly spaced. Furthermore, the
Fig. 2. Assembling a sparse matrix in parallel.
grid nodes have a simple contiguous block-distribution by rows on
a group of processes.
equations, the solve( ) method have to be called with appropriate
3.3. Linear solvers vector arguments (i.e., Vec instances) specifying the right hand
side and the location where to build the solution.
PETSc for Python provides access to PETSc linear solvers and Fig. 3 presents an example showing the basic steps required for
preconditioners through the KSP and PC classes. creating and configuring a linear solver and its inner precondition-
New KSP instances are obtained by calling the create( ) meth- er in PETSc for Python. This linear solver and preconditioner
od. This call automatically creates a companion inner precondi-
tioner (i.e., a PC instance) that can be retrieved with the
getPC( ) method for further manipulations. The KSP and PC clas-
ses provide the setType( ) methods for the selection of a specific
iterative method and preconditioning strategy. The setToler-
ances( ) method enables the specification of the different toler-
ances for declaring convergence; other algorithmic parameters
can also be set. Additionally, PETSc for Python supports attaching
a user-defined Python function for monitoring the iterative process
(by calling the setMonitor( ) method) and defining a custom
convergence criteria (by calling the setConvergenceTest( )
method).
KSP objects have to be associated with a matrix (i.e., a Mat in-
stance) representing the operator of the linear problem and a (pos-
sibly different) matrix for defining the preconditioner. This is done
by calling the setOperators( ) method. Additional options set
from command line arguments, configuration files, and environ-
ment variables can be specified by calling the setFromOp-
tions( ) method. In order to actually solve a system of linear Fig. 3. Solving a linear problem.
1130 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

combination is employed for solving a linear system involving a be set. Additionally, PETSc for Python supports attaching user-de-
previously assembled parallel sparse matrix (see Fig. 2). fined Python functions for monitoring the iterative process (by
calling the setMonitor( ) method) and defining a custom conver-
3.4. Nonlinear solvers gence criteria (by calling the setConvergenceTest( ) method).
In order to actually solve a system of nonlinear equations, the
PETSc for Python provides access to PETSc nonlinear solvers solve( ) method has to be called with appropriate vector argu-
through the SNES class. ments (i.e., Vec instances) specifying an optional right hand side
SNES objects have to be associated with a user-defined Python (usually not provided as it is the zero vector) and the location
function in charge of evaluating the nonlinear residual vector and, where to build the solution (which additionally can specify an ini-
optionally, a function for the Jacobian matrix evaluation, at each tial guess for starting the nonlinear loop).
nonlinear iteration step. Those user routines can be set with the Consider the following boundary value problem in two
methods setFunction( ) and setJacobian( ). dimensions:
New SNES instances are obtained by calling the create( )  DUðxÞ ¼ a exp½UðxÞ; x 2 X;
method. This call automatically creates a companion inner linear
UðxÞ ¼ 0; x 2 @ X;
solver (i.e., a KSP instance) that can be retrieved with the get-
KSP( ) method for further manipulations. The setToleranc- where X is the unit square (0, 1)2 and oX is the boundary, D is the
es( ) method enables the specification of the different tolerances two-dimensional Laplace operator, U is a scalar field defined on X,
for declaring convergence; other algorithmic parameters can also and a is a constant. The equation is nonlinear and usually called the
Bratu problem. The nonlinear system has a bifurcation (turning
point) at amax  6.80812, there is no solution for a > amax. The stan-
dard 5-point finite differences stencil is employed for performing a
spatial discretization on a structured, regularly spaced grid. As
result of the discretization process, a system of nonlinear equation
is obtained. Fig. 4 shows the Python implementation of the nonlin-
ear residual function for the Bratu problem and the basic steps
required for creating and configuring a nonlinear solver. The inner
Krylov linear solver is configured to use conjugate gradient method.
Additionally, the nonlinear solver is configured to use a matrix-free
method (i.e., the Jacobian is not explicitly computed).

4. Performance tests

This section presents some performance test aimed to measure


the overhead introduced by the Python layer in comparison to pure
C codes. First, wall-clock time3 of some selected point-to-point and
collective MPI communication calls using MPI for Python are com-
pared to the ones obtained with an equivalent pure C implementation;
both Ethernet and shared memory communication channels were
exercised. Next, PETSc for Python and an equivalent pure C code are
employed for driving the solution of a model transient, nonlinear, par-
tial differential equation problem using matrix-free methods; the hea-
vy computations at grid-level loops are implemented in Fortran 90.
The hardware employed to perform these tests was

 a small research cluster consisting of a server and 6 compute


nodes with two quad-core Intel Xeon E5420 2.5 GHz processors,
1333 MHz front-side bus, 8 GB DDR2–667 memory (10.6 GB/s
per-socket theoretical peak memory bandwidth), intercon-
nected via a switched Gigabit Ethernet network;
 a high-end desktop with single quad-core Intel i7 950 3.07 GHz
processor, QuickPath interconnect, 12 GB DDR3-1066 memory
(25.6 GB/s per-socket theoretical peak memory bandwidth)

and the software stack consisted of

 Linux 2.6.32 and GCC 4.4.4;


 MPICH2 1.2.1p1 and PETSc 3.1;
 Python 2.6.2 and NumPy 1.3.0.

4.1. MPI for Python


3
Wall-clock time or wall time is the total amount of time (as determined by a
chronometer) that a task takes to complete. It includes the time required for
computations, input/output and communications. Wall-clock time should not be
confused with CPU time or processor time, which measures only the time a processor
Fig. 4. Solving a nonlinear problem. was assigned to actively work on a certain task.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1131

The first test consisted in blocking send and receive operations


(MPI_SEND and MPI_RECV) between a pair of nodes. Messages
were numeric arrays of double precision (64 bits) floating-point
values. The two supported communications mechanisms, serializa-
tion (via pickle) and direct communication of memory buffers,
were compared against compiled C code. A basic implementation
of this test using MPI for Python with direct communication of
memory buffers (translation to C or C++ is straightforward) is
shown below. The actual implementation took into account mem-
ory preallocation (in order to avoid paging effects) and parallel
synchronization (in order to avoid asynchronous skew in the
start-up phase).

from mpi4py import MPI


import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank( )
msglen = 2**16
array1 = np.empty(msglen, dtype = ‘d’)
array2 = np.empty(msglen, dtype = ‘d’)
sendbuf = [array1, msglen, MPI.DOUBLE]
recvbuf = [array2, msglen, MPI.DOUBLE]
wt = MPI.Wtime( )
if rank == 0:
comm.Send(sendbuf, 1, tag = 0)
comm.Recv(recvbuf, 1, tag = 0) elif rank == 1:
comm.Recv(recvbuf, 0, tag = 0)
comm.Send(sendbuf, 0, tag = 0)
wt = MPI.Wtime()  wt

For increasing message sizes, the wall clock time required for
communication is measured many times and then averaged.
Throughput is computed as the ratio of message size (in bytes)
and wall clock time (in seconds) required to accomplish the com-
munication. Python overhead is computed from wall-clock times
as (TPython  TC)/TC.
Results obtained on the switched Gigabit Ethernet network are
shown in Fig. 5. As expected, the overhead introduced by object
serialization degrades overall efficiency. Comparing to communi-
cation performed in C, the overhead of pickle communication is
around 80% for small messages and around 30% for large messages.
However, fast communication of array data is quite efficient. The
overhead of buffer communication is below 10% for small messages
and below 5% for large messages.
Results obtained on shared memory are shown in Fig. 6. In this
case, the overhead introduced by the Python layer is quite more
noticeable. Comparing to communication performed in C, pickle
communication overhead is 150 for small messages and around
5 for large messages, while the overhead of buffer communication
is around 2.5 for small messages and around 0.5 for large
messages. Fig. 5. PingPong – Gigabit Ethernet.
The second test consisted of wall-clock time measurements of
Broadcast and All-to-All collective operations on four processes. jects, the pickle protocol is implemented quite efficiently – the
Messages were again numeric arrays of double precision floating- total amount of memory required for serialization is known in ad-
point values. vance and the array items have a common data type corresponding
Results obtained on the switched Gigabit Ethernet network are to a C primitive type. For more general, user-defined Python ob-
shown in Fig. 7. For small messages, the overhead of pickle commu- jects containing deeply nested data structures, pickle communica-
nication is significant – 12 for Broadcast and 3.5 for All-to-All – tion is expected to achieve lower performance than the reported
for large messages, it is less noticeable. The overhead of buffer com- here.
munication is small, particularly for All-to-All – less than 10% for all
message sizes. Results obtained on shared memory are shown in 4.2. PETSc for Python
Fig. 8, they follow the trends of previous results. However, the
overhead of pickle communication is quite more noticeable for all Consider the following diffusive, unsteady, non-linear, scalar
message sizes. problem in the unit cube X = (x1, x2, x3) = (0, 1)3
Finally, it is worth to remark that the all the previous tests in-
volved communication of contiguous NumPy arrays. For these ob-
1132 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

Fig. 7. Broadcast and All-to-All – Gigabit Ethernet.

and the source term G is the line source


 
1 1 1
G x1 ¼ ; x2 ¼ ; 6 x3 6 1 ¼ 300: ð4Þ
4 4 2
After time discretization using backward-Euler scheme with
time step Dt and space discretization with centered finite differ-
ences on a structured grid of N1  N2  N3 points, the following
system of equations is obtained
1  nþ1 
/i;j;k  /ni;j;k  Li;j;k ðj; /nþ1 Þ  Gi;j;k ¼ 0;
Dt
i; j; k ¼ 1; . . . ; N 1 ; N 2 ; N3 ; ð5Þ

where /ni;j;k
and /nþ1
are / at point ((i  1)Dx1, (j  1)Dx2, (k  1)Dx3)
i;j;k
and time levels t and tn+1 = tn + Dt, respectively, and
n

     
Fig. 6. PingPong – shared memory. j i  12 j i  12 þ j i þ 12
Li;j;k ðj; /Þ ¼ /½i  1  /½0
ð Dx1 Þ 2 ð Dx1 Þ 2
@/  1  1
 r  ðjð/Þr/Þ ¼ G; on X  ð0; T ð1Þ j iþ2 j j2
@t þ /½i þ 1 þ /½j  1
ð D x1 Þ 2 ðDx2 Þ2
with homogeneous Neumann conditions at the boundary C = oX  1  1  
and given initial conditions, j j2 þj jþ2 j j þ 12
 /½0 þ /½j þ 1
@/ ð Dx2 Þ 2 ð Dx2 Þ 2
¼ 0 at C  ½0; T;      
@n ð2Þ j k  12 j k  12 þ j k þ 12
þ /½k  1  /½0
/ ¼ /0 at t ¼ 0; ðDx3 Þ2 ðDx3 Þ2
 
where n is the outer normal to the boundary. j k þ 12
þ /½k þ 1; ð6Þ
The diffusion coefficient j depends on / in the following way: ðDx3 Þ2
(   
1 if / P 0; where, in short hand notation, ½0 ¼ ði; j; kÞ; i  12 ¼ i  12 ; j; k and
jð/Þ ¼ 1 ð3Þ so on, and
1þ/2
if / < 0;  / at staggered points is obtained with the averaging
scheme / i  12 ¼ 12 ð/½i  1 þ /½0Þ and so on.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1133

SNES nonlinear solver iterates until the nonlinear residual norm


is reduced to 106 of the initial one. At each nonlinear iteration,
the inner KSP linear solver performs conjugate gradients iterations
with no preconditioning until the linear residual norm is reduced
by 106 respect to the initial one.
Matrix entries of the Jacobian of Eq. (5) are not computed or
stored, instead the action of the Jacobian is approximated by finite
differencing the nonlinear residual.
The overhead of dispatch through Python, (TPython  TC)/TC using
wall-clock times, is shown in Fig. 9. The horizontal axis indicates
the number of grid points, vertical axis indicates the Python over-
head, determined as the quotient between Python and C wall-clock
timings. For the smallest problem, the overhead in using Python is
around 13%, then it decreases as the problem size grows. For med-
ium sized to large problems the overhead is around 3%.

5. Application examples

PETSc-FEM [33,34] has been developed since 1999 at CIMEC lab-


oratory and it is publicly available under the GPL license. PETSc-
FEM is a parallel multiphysics code implemented in C++ and pri-
marily targeted to 2D and 3D finite elements computations on gen-
eral unstructured grids.
PETSc-FEM provides a core library to manage parallel data distri-
bution and assembly of residual vectors and Jacobian matrices, as
well as facilities for general tensor algebra computations at the le-
vel of problem-specific finite element routines. Driver programs use
the core library and other utility data structures and routines to
implement different applications tailored to the problem at hand.
Input data in provided through text files with a specific structure
describing physical and algorithmic simulation parameters, finite
element meshes and boundary conditions. These input files are
Fig. 8. Broadcast and All-to-All – shared memory. usually generated in a preprocessing step using external tools, typ-
ically by scripts written in Octave and Perl. Those tools are also em-
ployed for postprocessing, data analysis, and visualization of output
The differences scheme detailed above is implemented in a
data. Complex simulations involving the coupling of different mod-
Fortran 90 subroutine. This subroutine loops over i, j, k grid nodes
els require the execution of separate processes cooperating through
computing nonlinear residuals according to Eq. (5) and taking into
interprocess communication mechanisms (IPC) like POSIX pipes
account boundary conditions (Eq. (2)).
and sockets using ad hoc protocols. IPC is usually performed at each
The Fortran 90 subroutine in charge of computing residuals is
time step and in some cases at the inner nonlinear iterations. Set-
employed in both Python and C driver codes employing PETSc data
ting up new applications requires hard-wiring customizations in
structures and algorithms. F2PY makes possible the access of the
the source code or employing techniques like dynamic loading to
Fortran code from Python, while accessing the Fortran code from
‘‘hook’’ application-specific code implemented by end users. A ser-
C is just a matter of accounting for the name mangling conventions
ies of shell scripts an makefiles control other aspects of this frame-
of the Fortran compiler.
work: compiling and linking, program execution, and data flow
A TS timestepper is configured to run 10 time steps with t0 = 0,
through the preprocessing/simulation/postprocessing chain.
Dt = 0.01 and initial conditions /0i;j;k ¼ 0. At each time step, the
Simulation frameworks like the previously described are com-
mon in home-grown scientific codes. However, the expertise re-
quired to handle these complexities is beyond the regular
training of scientists and engineers. End-users – and particularly
beginners – have to invest excessive time learning new skills to
manage the plethora of different components.
A computing environment based on the Python programing lan-
guage, as well as a companion ‘‘stack’’ of publicly available, flexible,
well-though Python-based tools, is a viable alternative to more tra-
ditional frameworks. The following list summarizes the benefits
we experienced after switching to a Python-centered computing
environment.

 A single language takes the place of several shell scripts, make-


files, desktop computing and visualization environments, and
scripting programming languages.
 Prototyping and testing of new model components as well as
reuse and coupling of existing models become remarkably
Fig. 9. Overhead of Python versus C codes for the solution of Eqs. (1) and (2). simpler.
1134 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

 Performance critical components developed in traditional sci- eration due to gravity. The bottom shear stresses are approximated
entific programming languages are incorporated to the frame- by using the Chèzy or Manning equations,
work with the help of readily available tools. 8 v 2 PðhÞ
 Preprocessing, postprocessing, data analysis and visualization < C2 AðhÞ zy model;
Che
are tightly integrated to simulation. Sf ¼ h 2 ð9Þ
: n 4=3
2 P ðhÞ
v Manning model;
 The edit/compile/run cycle is considerably reduced and overall a A4=3 ðhÞ

end-user productivity increases. Beginners are able to get


where P is the wetted perimeter of the channel, n is Manning rough-
started in a substantially shorter time frame.
ness coefficient, and a is a conversion factor (a = 1 for SI units).

Functionalities from the core PETSc-FEM library are made avail-


5.1.3. River/aquifer coupling
able to Python by using SWIG. These Python wrappers to PETSc-
The stream/aquifer interaction process occurs between a stream
FEM interoperate with MPI for Python and PETSc for Python. MPI
and its adjacent flood-plain aquifer. A typical discretization is
for Python is employed for communication and coordination of
shown in Fig. 10 where an element which represents the stream
parallel runs. PETSc for Python provides the data structures and
loss is connected to two nodes on the stream and two on the aqui-
algorithms readily available in PETSc. PETSc-FEM primarily contrib-
fer. If the stream level is over the phreatic aquifer level (hb + h > /)
utes the evaluation of residual vectors and Jacobian matrices.
then the stream losses water to the aquifer and vice versa. The
Python code drives the entire application, gluing the pieces
stream gain (loss) at a point is
together. In order to illustrate the capabilities of the tool chain
consisting of MPI for Python, PETSc for Python and PETSc-FEM, Gs ¼ P=Rf ð/  hb  hÞ; ð10Þ
two application examples are presented.
where Rf is the resistivity factor per unit arc length of the perimeter.
The corresponding loss (gain) to the aquifer is
5.1. Hydrology: coupled subsurface/surface flow
Ga ¼ Gs dCs ; ð11Þ
This section summarizes some results obtained when modeling
a large scale basin in Santa Fe, Argentina. A fully coupled model where Cs represents the planar curve of the stream and dCs is a Dir-
was developed to achieve a better understanding of the dynamics ac’s delta distribution with a unit intensity per unit length, such
and interactions of stream/aquifer water flow and the impact of that,
changing land and hydrology at regional levels. This model com- Z Z
prises the groundwater flow equation, coupled to the Saint–Venant f ðxÞdCs dX ¼ f ðxðsÞÞds: ð12Þ
Cs
equation for flow in open channels. The water exchange at the
aquifer/river interface depends on the head differences between So that, in the context of a weighted formulation the coupling term
them and a resistivity coefficient that characterizes such interface. can be computed as
Z Z
5.1.1. Subsurface flow WðxÞGa ðxÞdX ¼  WðxÞGs ðxÞdCs dX
The equation for the flow in a confined (phreatic) aquifer inte- Z
grated in the vertical direction is [35] ¼ WðxðsÞÞGs ðxðsÞÞds: ð13Þ

@/
S ¼ r  ðK r/Þ þ Gaq ; on Xaq  ð0; t: ð7Þ where W are weighting functions.
@t Upon using the SUPG Galerkin finite element discretization pro-
The corresponding unknown for each node is the piezometric cedure with linear triangles and/or bilinear rectangular elements
height or the level of the phreatic surface at that point / and Xaq and the trapezoidal rule for time integration, we obtain the system
is the aquifer domain, S the storativity, K the hydraulic conductiv- to be solved at each time step [38]
ity and Gaq is the source term accounting for rain, losses from
streams or other aquifers. Ukþ1  Uk
R ¼ KðUkþh ÞUkþh þ BðUkþh Þ  Gkþh ¼ 0; ð14Þ
Dt
5.1.2. Surface flow where Uk+h = hUk+1 + (1  h)Uk, U is the state for the coupled prob-
When velocity variations on the channel cross section are ne- lem (i.e., phreatic level and Saint–Venant state variables), h is the
glected, the flow can be treated as one dimensional. The equations time-weighting factor satisfying 0 6 h 6 1, Dt is the time increment
of mass and momentum conservation on a variable cross sectional and k denotes the number of time steps. K and B are the non-sym-
stream (in conservation form) are [36,37] metric stiffness matrix and the symmetric mass matrix, respectively
(K and B depend on U), G is the source vector and R is the residual
@A @Q ðAÞ vector.
þ ¼ Gst ;
@t @s !
1 @Q 1 @ Q 2 @
þ þ gðS0  Sf Þ þ g ðh þ hb Þ ¼ 0; on Xst  ð0; t;
A @t A @s A @s stream node
n3 y
n5
ð8Þ n4
n1 x
where A = A(h) is the section of the channel occupied by water for a n2

given water depth h and hb is the channel bottom elevation. For in- aquifer node

stance, in rectangular channels A(h) = wh, where w is channel stream


width. Q is the discharge, Gst represents the gain or loss of the
stream (i.e., the lateral inflow per unit length), s is the arc-length
along the channel, v = Q/A the average velocity in s-direction, vt
the velocity component in s-direction of lateral flow from tributar-
ies. S0 is the bottom slope, Sf is the slope friction and g is the accel- Fig. 10. Stream/aquifer coupling.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1135

5.1.4. Numerical simulations network has more than 150 branches discretized with 70 thousand
An example of surface and subsurface interaction flow is pre- elements, creating an average space of 100 m between river nodes.
sented. The study area represents a third of the total area of Santa The time step adopted in simulations is Dt = 1 day.
Fe province (Argentina), amounting to roughly 33,000 km2 (see Fig. 13 shows the phreatic elevation in four different days of the
Fig. 11). A period of 12 months is simulated where the total precip- simulated period. The phreatic levels increases after the two first
itation is the annual average observed in recent years (1000 mm/ dry months (January–February) when dry period starts (March–
year), but divided in two wet seasons with a rainfall rate of April). At the same time, when consider the river vicinities region
2000 mm/year (March–April and September–October) and dry sea- where the subsurface/surface flow interaction process takes place,
sons of 500 mm/year (the rest of the year). we see an increment of the river water level due to the recharge
At time t = 0 s (January/1) the piezometric head in the phreatic from the elevated aquifer phreatic levels in wet seasons. The oppo-
aquifer is 30 m above the aquifer bottom, while the water depth in site process is observed in dry periods.
stream is 10 m above the stream bed. The hydraulic conductivity
and storativity of phreatic aquifer are K = 2  103 m/s and 5.2. Microfluidics: lab-on-a-chip simulations
S = 2.5  102 m/s, respectively. The Manning friction law is
adopted for this case. The stream channel roughness is A Lab-on-a-chip (LOC) performs the functions of classical analyt-
n = 3  103 and the river width is w = 10 m. The average value of ical devices in small units of a few square centimeters in size [39].
resistivity river walls is Rf = 105 s. The computational mesh has They are used in a variety of chemical, biological and medical
1.7 million triangles (see a detail in Fig. 12) and the drainage applications. The benefits of LOC are the reduction of consumption

Fig. 11. Terrain elevation of the computational domain and its parallel partition.

Fig. 12. Mesh over topography detail.


1136 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

Fig. 13. Aquifer phreatic elevation during a periodic rainfall.

of samples and reagents, shorter analysis time, greater sensitivity, the strong coupling between them. In this particular case, due to
portability and disposability. There has been a huge interest in the high reaction rate, the fluid and the electric fields can be trea-
these devices in the past decade that led to a wide commercial ted in a quasi-stationary form, reducing the complexity of the solv-
range of products. ing process [48]. The set of differential equations that were solved
Historically, the most important techniques developed in LOC on the LOC geometry Xloc, can be summarized as:
devices are electrophoretic separations [40,41]. They are based
on the mobility of ions under the action of an external electric field. r  u ¼ 0; on Xloc ; ð15Þ
These techniques are widely used in chemical and biochemical qðu  ruÞ ¼ r  ðpI þ lðru þ ruT ÞÞ; on Xloc ; ð16Þ
analysis. As microchips for electrophoresis becomes increasingly !
2
X
N X
N X
N
complex, simulation tools are required to prototype numerically r  F z2j Xj cj r/  F zj D j r c j þ F zj cj u ¼ 0; on Xloc ;
these devices, as well as to control and optimize handling [42]. j¼1 j¼1 i¼j
Numerical simulation of a two dimensional electrophoresis ð17Þ
(2DE) device is presented. Simulations were carried out by using @cj 
a 3D time-dependent finite element model for electrophoretic þ r  zj Xj r/cj þ ucj  Dj rcj  r j ¼ 0; on Xloc  ð0; t: ð18Þ
@t
processes in microfluidic chips. Two-dimensional electrophoretic
separations consist of two independent mechanisms that are Eqs. (15) and (16) are the Navier–Stokes equations for solving
employed sequentially. The separation efficiency is estimated as the fluid field, where u is the velocity, p is the pressure, and l is
the product of the independent efficiency of each method, pro- the dynamic viscosity. In this simulation, in order to model the
vided the methods are uncoupled. Two such mechanisms, satisfy- electroosmotic flow, a slip velocity is set as boundary condition.
ing uncoupling, are free flow isoelectric focusing (FFIEF) and The magnitude for this velocity (ueo) is based on the Helmholtz–
capillary zone electrophoresis (CZE). FFIEF is a technique in which Smoluchowski approximation [46]:
an electric field and a pH gradient are established perpendicularly
to a flowing sample solution, allowing components to focus at its
fw ðr/Þ
ueo ¼  ; on Ceo ; ð19Þ
stable isoelectric point (pI) [43–45]. CZE is based on the application
l
of an electric potential difference along a capillary channel, then where  is the electric permittivity, fm is the electrokinetic potential
electric forces generate electrosmotically-driven fluid flow and in- of the solid walls Ceo, and / is the electric potential.
duce species migration along the channel axis, yielding the separa- Eq. (17) expresses the electric charge conservation for the do-
tion according to their electrophoretic mobilities [46,47]. main as a combination of migrative, diffusive and advective com-
ponents for the motion of all charges present in the solution. In
5.2.1. Modeling this case, for the j-species, zj represents the valence, Xj is the
Mathematical modeling of electrophoretic separations carried mobility, cj the concentration in mol m3, and Dj the diffusion
out on LOC involves fluid, electric and concentration fields, and coefficient.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1137

Finally, Eq. (18) is the mass transport equation for a generic j- ka2 ½A ½Hþ 
¼ ¼ Ka ð22Þ
species, were rj is the reaction term. Different electrolytes (acids, ka1 ½AH
bases and ampholytes), analytes, and particularly the hydrogen kb2 ½AH½Hþ 
ion have to be considered in order to determine the reaction terms. ¼ ¼ Kb ð23Þ
kb1 ½AHþ2 
In electrolyte chemistry the processes of association and dissocia-
tion are much faster than the transport electrokinetic processes, where Ka and Kb are the dissociation constants for the equilibrium
hence, it is a good approximation to adopt chemical equilibrium state, and the square brackets represent concentration of the given
constants to model the reactions of weak electrolytes [49], while species. The corresponding expressions of rj are obtained as follows:
strong electrolytes are considered as completely dissociated.
rA ¼ ka1 ½A ½Hþ  þ ka2 ½AH ð24Þ
In solving 2DE amphoteric species are mainly involved. The

reactions associated to a generic ampholyte AH can be summarized
þ þ
rAH ¼ ka1 ½A ½H   ka2 ½AH  kb1 ½AH½H  þ kb2 ½AHþ2  ð25Þ
as [49]: rAHþ ¼ kb1 ½AH½Hþ   kb2 ½AHþ2  ð26Þ
2
ka1
AH ¢ A þ H  þ
ð20Þ rHþ ¼ ka1 ½A ½Hþ  þ ka2 ½AH  kb1 ½AH½Hþ  þ kb2 ½AHþ2  ð27Þ
ka2
kb1
In Eq. (27) the water dissociation term is not included due to the
AHþ2 ¢ AH þ Hþ ð21Þ fact that this reaction is several orders of magnitude faster than
kb2
reactions (20) and (21) [49], then [OH] can be calculated directly
where ka1, kb1 are the dissociation rates, and ka2, kb2 the association as
rates for the ampholyte AH. Then the equilibrium state is character- Kw
ized by, ½OH  ¼ ð28Þ
½Hþ 

Fig. 14. Initial distributions of electric potential and pH.

Fig. 15. Total sample distribution at 20 and 65 s.


1138 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139

where Kw = 1014 mol2 m6 is the dissociation constant for pure 6. Conclusions
water at 25 °C.
Python is an attractive language for rapid development of small
scripts and code prototypes as well as large applications and highly
5.2.2. Simulation results
portable and reusable modules and libraries. Running Python on
A 2DE separation involving FFIEF and CZE is simulated on a LOC parallel computers is a feasible alternative for decreasing the costs
prototype. FFIEF is carried out on the left part (10  3500 
of software development targeted to HPC systems.
7000 lm3), then samples flow through five CZE channels In this work, two software components facilitating the access to
(10  1000  16,000 lm3) on the right part. Mesh consist of 175
parallel distributed computing resources within a Python pro-
thousand linear triangles, with a total amount of 6 millions of gramming environment were presented: MPI for Python and PETSc
degrees of freedom.
for Python. These packages are able to support serious medium and
Boundary conditions for the fluid field are those stated by the
large scale parallel applications.
Eq. (19), and the pressure is set to 0 Pa at the outlets. In the case
Efficiency tests have shown that performance degradation is not
of the electric field, Dirichlet boundary conditions are set where
prohibitive. In comparison to pure C codes, MPI for Python can
the electric potential is applied, and natural Neumann conditions
communicate Python array data at nearly full speed over Gigabit
on the other walls are set. Finally, for the concentration field,
Ethernet and around half speed over shared memory channels.
advective flux is set at the inlets and outlets, and natural Neumann
PETSc for Python overhead is consistently less than 10%.
conditions are set on the walls.
This software suite is supporting research activities in a variety
The applied electric potential differences (Fig. 14(a)) are fixed
of fields. Applications examples related to finite elements simula-
during the operation to provide the system with a transverse elec-
tions of hydrology and microfluidic problems were presented.
tric field in the FFIEF region and an axial electric field in the CZE
channels. The pH gradient for FFIEF is established by focusing 20
ampholytes between two sheath flows of basic and acidic solu- Acknowledgments
tions. A concentrated basic buffer solution is continuously injected
from the inlet at the right. When stationary conditions are reached, The authors extend sincere thanks to Christopher Kees, Jed
a near-linear pH gradient is developed (Fig. 14(b)). Brown, and the anonymous reviewer for their kind advice, insight-
The proposed numerical prototype is employed to separate a ful comments and constructive suggestions.
sample of 10 proteins. Proteins are injected from the central chan- This work has received financial support from Consejo Nacional
nel. After a few seconds, the different bands of isoelectric points de Investigaciones Científicas y Técnicas (CONICET, Argentina, Grant
are developed. In this particular case there are eight bands, conse- PIP 5271/05), Universidad Nacional del Litoral (UNL, Argentina,
quently, there are three or four proteins that cannot be effectively Grant CAI+D 2009 65/334), Agencia Nacional de Promoción Científica
separated by FFIEF, thus CZE is employed as an additional separa- y Tecnológica (ANPCyT, Argentina, Grants PICT 01141/2007, PICT
tion method. After leaving the FFIEF chamber, proteins separate 0270/2008, PICT-1506/2006).
electrophoretically completing the successful separation of the
10 sample compounds. Total sample distributions at two different Appendix A. Project development and support
instants of time during the separation process are shown in Fig. 15.
One of the aims of this tool is to provide information about the MPI for Python and PETSc for Python are active software pro-
separative performance of LOC. In analytical and bioanalytical jects. New features and enhancements are being added on regular
chemistry, separative performance of two-dimensional electropho- basis in order to keep Python interfaces in accordance to the up-
resis assays can be evaluated by using a customary representation dates of the MPI standard and PETSc. The testing process is sup-
in a two-dimensional map. This map contains information of the ported by automated unit testing based on the unittest
isoelectric points and the electrophoretic mobilities of the analytes package from the Python standard library. These tests are run reg-
present in sample. Results in this format are shown in Fig. 16. ularly on a variety of platforms and computer architectures.
MPI for Python is hosted on Google Code project hosting service
(http://mpi4py.googlecode.com). This service provides version
control repository (http://mpi4py.googlecode.com/svn/), issue
tracker (http://code.google.com/p/mpi4py/issues/list), and release
downloads (http://code.google.com/p/mpi4py/downloads/). Goo-
gle Groups service hosts an on-line discussion and support forum
(http://groups.google.com/group/mpi4py) and a mailing list
(mpi4py@googlegroups.com). PETSc for Python is hosted on Goo-
gle Code project hosting service (http://petsc4py.googlecode.com).
This service provides version control repository (http://
petsc4py.googlecode.com/hg/), issue tracker (http://code.google.
com/p/petsc4py/issues/list), and release downloads (http://code.
google.com/p/petsc4py/downloads/). PETSc for Python uses the
same support channels of PETSc (petsc-users@mcs.anl.gov, petsc-
maint@mcs.anl.gov, and petsc-dev@mcs.anl.gov mailing lists).

References

[1] Dalcin L. MPI for Python; 2005–2010. <http://mpi4py.googlecode.com>.


[2] Dalcin L. PETSc for Python; 2005–2010. <http://petsc4py.googlecode.com/>.
[3] van Rossum G. Python programming language; 1990–2010. <http://
www.python.org/>.
[4] van Rossum G. Python reference manual; 2010. <http://docs.python.org/ref/
Fig. 16. Two dimensional map for the separation. ref.html>.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1139

[5] Millman KJ, Aivazis M. Python for scientists and engineers. Comput Sci Eng [27] Falgout R, Jones J, Yang U. Numerical solution of partial differential equations
2011;13(2):9–12. doi:10.1109/MCSE.2011.36. on parallel computers, vol. 51. Springer-Verlag; 2006. p. 267–94 [chapter: the
[6] Pérez F, Granger B, Hunter J. Python: an ecosystem for scientific computing. design and implementation of hypre, a library of parallel high performance
Comput Sci Eng 2011;13(2):13–21. doi:10.1109/MCSE.2010.119. preconditioners].
[7] van der Walt S, Colbert S, Varoquaux G. The NumPy array: a structure for [28] Heroux M, Bartlett R, Hoekstra VHR, Hu J, Kolda T, Lehoucq R, et al. An overview
efficient numerical computation. Comput Sci Eng 2011;13(2):22–30. of Trilinos. Tech. Rep. SAND2003-2927. Sandia National Laboratories; 2003.
doi:10.1109/MCSE.2011.37. [29] Amestoy PR, Duff IS, L’Excellent J-Y, Koster J. A fully asynchronous multifrontal
[8] Behnel S, Bradshaw R, Citro C, Dalcin L, Seljebotn D, Smith K. Cython: the best solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl
of both worlds. Comput Sci Eng 2011;13(2):31–9. doi:10.1109/MCSE.2010.118. 2001;23(1):15–41.
[9] Ramachandran P, Varoquaux G. Mayavi: 3D visualization of scientific data. [30] MPI Forum. MPI: a message passing interface standard, version 2.2; 2009.
Comput Sci Eng 2011;13(2):40–51. doi:10.1109/MCSE.2011.35. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.
[10] Oliphant T. NumPy: numerical Python; 2005–2010. <http://numpy.scipy.org/>. [31] Miller P. pyMPI project page, 2000–2011. <http://pympi.sourceforge.net/>.
[11] Peterson P. F2PY: Fortran to Python interface generator; 2000–2010. <http:// [32] Nielsen O. Pypar project page; 2002–2011. <http://code.google.com/p/pypar/>.
cens.ioc.ee/projects/f2py2e/>. [33] Sonzogni VE, Yommi AM, Nigro NM, Storti MA. A parallel finite element
[12] Cython Team. Cython: C – extensions for Python; 2007–2010. <http:// program on a Beowulf cluster. Adv Eng Softw 2002;33(7–10):427–43.
www.cython.org>. [34] Storti MA, Nigro N, Paz R. PETSc-FEM: a general purpose, parallel, multi-
[13] Beazley DM. SWIG: simplified wrapper and interface generator; 1996–2010. physics FEM program; 1999–2010. <http://www.cimec.org.ar/petscfem>.
<http://www.swig.org/>. [35] Rodríguez L. Investigation of stream–aquifer interactions using a coupled
[14] Beazley DM, Lomdahl PS. Feeding a large scale physics application to Python. surface-water and ground-water flow model. PhD thesis. University of
In: Proceedings of 6th international Python conference, San Jose, California; Arizona; 1995.
1997. p. 21–9. [36] Whitham G. Linear and nonlinear waves, pure and applied mathematics. A
[15] Kadau K, Germann TC, Lomdahl PS. Molecular dynamics comes of age: 320 Wiley-Interscience series of texts, monographs, and tracts John Wiley & Sons
billion atom simulation on BlueGene/L. Int J Modern Phys C 2006;17:1755–61. Inc.; 1974.
[16] Forum MPI. MPI: a message passing interface standard. Int J Supercomput Appl [37] Hirsch C. Numerical computation of internal and external flows. Wiley Series
1994;8(3/4):159–416. in numerical methods in engineering, vol. II John Wiley & Sons Inc.; 1990.
[17] MPI Forum. MPI-2: a message passing interface standard. High Perform [38] Paz R, Storti M. An interface strip preconditioner for domain decomposition
Comput Appl 1998;12(1–2):1–299. methods: application to hydrology. Int J Numer Methods Eng
[18] Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J. MPI – the complete 2005;62(13):1873–94.
reference. The MPI core of scientific and engineering computation, vol. [39] Manz A, Graber N, Widmer H. Miniaturized total chemical analysis systems: a
1. Cambridge, MA, USA: MIT Press; 1998. novel concept for chemical sensing. Sensor Actuator B 1990;1:244–8.
[19] Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, et al. [40] Landers JP. Handbook of capillary and microchip electrophoresis and
MPI – the complete reference. The MPI-2 extensions of scientific and associated microtechniques. 3rd ed. CRC Press; 2007.
engineering computation, vol. 2. Cambridge, MA, USA: MIT Press; 1998. [41] Tian W-C, Finehout E. Microfluidics for biological applications. 1st
[20] MPICH2 Team, MPICH2: a portable implementation of MPI; 2003–2010. ed. Springer; 2008.
<http://www-unix.mcs.anl.gov/mpi/mpich2/>. [42] Erickson D. Towards numerical prototyping of labs-on-chip: modeling for
[21] Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable integrated microfluidic devices. Microfluid Nanofluid 2005;1(4):301–18.
implementation of the MPI message passing interface standard. Parallel [43] Kohlheyer D, Eijkel JCT, van den Berg A, Schasfoort RBM. Miniaturizing free-
Comput 1996;22(6):789–828. flow electrophoresis — a critical review. Electrophoresis 2008;29(5):977–93.
[22] Open MPI Team, Open MPI: open source high performance computing; 2004– [44] Turgeon RT, Bowser MT. Micro free-flow electrophoresis: theory and
2010. <http://www.open-mpi.org/>. applications. Anal Bioanal Chem 2009;394(1):187–98.
[23] Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, et al. Open [45] Sommer G, Hatch A. IEF in microfluidic devices. Electrophoresis
MPI: goals, concept, and design of a next generation MPI implementation. In: 2009;30:742–57.
Proceedings of the 11th European PVM/MPI users’ group meeting, Budapest, [46] Probstein R. Physicochemical hydrodynamics. An introduction. 2nd ed. Wiley-
Hungary; 2004. p. 97–104. Interscience; 2003.
[24] Balay S, Buschelman K, Gropp WD, Kaushik D, Knepley MG, McInnes LC, et al. [47] Hunter R. Foundations of colloid science. 2nd ed. Oxford University Press;
PETSc web page; 2010. <http://www.mcs.anl.gov/petsc>. 2001.
[25] Balay S, Buschelman K, Eijkhout V, Gropp WD, Kaushik D, Knepley MG, et al. [48] Kler PA, Berli CLA, Guarnieri FA. Modelling and high performance simulation of
PETSc users manual, Tech. Rep. ANL-95/11 – Revision 3.1. Argonne National electrophoretic techniques in microfluidic chips. Microfluid Nanofluid
Laboratory; 2010. 2010;10(1):187–98.
[26] Balay S, Gropp WD, McInnes LC, Smith BF. Efficient management of parallelism [49] Arnaud I, Josserand J, Rossier J, Girault H. Finite element simulation of off-gel
in object oriented numerical software libraries. In: Arge E, Bruaset AM, buffering. Electrophoresis 2002;23:3253–61.
Langtangen HP, editors. Modern software tools in scientific
computing. Birkhäuser Press; 1997. p. 163–202.

You might also like