Parallel Distributed Computing Using Python
Parallel Distributed Computing Using Python
a r t i c l e i n f o a b s t r a c t
Article history: This work presents two software components aimed to relieve the costs of accessing high-performance
Available online 22 April 2011 parallel computing resources within a Python programming environment: MPI for Python and PETSc
for Python.
Keywords: MPI for Python is a general-purpose Python package that provides bindings for the Message Passing
Python Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python
MPI programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python
PETSc
provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities
allow sequential and parallel Python applications to exploit state of the art algorithms and data struc-
tures readily available in PETSc for the solution of large-scale problems in science and engineering.
MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based par-
allel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure
supports research activities related to simulation of fluid flows with applications ranging from the
design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer
interactions.
Ó 2011 Elsevier Ltd. All rights reserved.
0309-1708/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.advwatres.2011.04.013
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1125
Python is easily extended with new functions and data struc- The Cython language is similar to Python, supporting most
tures implemented in other languages. This feature allows skilled Python language constructs and libraries while adding syntax for
users to build their own computing environment, tailored to their declaring types, calling C functions, and manipulating C values.
specific needs and based on their favorite high-performance For- Cython code is compiled via C and the result runs within the
tran, C, or C++ codes. Such capabilities prove to be an advantage Python runtime environment. When static type declarations are
for modern scientific computing: users have a high-level and pro- used in Cython source, it typically executes many times faster than
ductive environment at hand, yet they can reuse existing library Python and sometimes approaches the speed of C.
code and optimize performance critical bottlenecks. Using Cython, code which manipulates Python values and C val-
The Python programming language, augmented with a set open ues can be freely intermixed, with conversions occurring automat-
source packages that have been developed over the last decade by ically wherever possible. Error checking of Python operations is
scientists and engineers, provides a ‘‘computational ecosystem’’ also automatic, and the full power of Python exception handling
that is quite capable of supporting a wide range of applications – facilities is available even in the midst of manipulating C data.
from casual scripting and lightweight tools to full-fledged systems.
For a throughout discussion about the role of Python in scientific 1.1.4. SWIG
computing and additional information about selected Python pack- SWIG [13], the Simplified Wrapper and Interface Generator, is an
ages, see [5–9]. interface compiler that connects programs written in C and C++
with a variety of scripting languages.
1.1.1. NumPy Originally developed in 1995, SWIG was first used by scientists
The NumPy project [10] started in the mid-90s as a collaborative (in the Theoretical Physics Division at Los Alamos National Labora-
effort of an international team of volunteers aimed to develop a tory, USA) for building user interfaces to molecular dynamic simu-
data-structure for efficient array computation in Python. Since lation codes running on the Connection Machine 5 supercomputer.
then, the NumPy package has found wide-spread adoption in aca- In this environment, scientists needed to work with huge amounts
demia and industry. Today, NumPy is one of the core packages of simulation data, complex hardware, and a constantly changing
for numerical computation in Python. code base. The use of a Python scripting language interface pro-
NumPy provides a powerful multi-dimensional array object vided a simple yet highly flexible foundation for solving these
with advanced and efficient general-purpose array operations. types of problems [14,15].
Additionally, NumPy contains three sub-libraries with numerical SWIG works by parsing the declarations found in C/C++ header
routines providing basic linear algebra operations, basic Fourier files and using them to generate the wrapper code needed by
transforms and sophisticated capabilities for random number gen- scripting languages, in particular Python, to access the underlying
eration. It also provides facilities in order to support interoperabil- C/C++ code. In addition, SWIG provides many customization fea-
ity with C, C++, and Fortran. tures that let developers tailor the wrapping process to suit specific
Besides its obvious scientific applications, NumPy can also be application needs.
used as an efficient multi-dimensional container of generic data. Although SWIG was originally developed for scientific applica-
New structured data-types with fixed storage layout can be de- tions, it has since evolved into a general purpose tool that is used
fined by combining fundamental data-types like integers and in a wide variety of applications – almost everything C/C++ and
floats. This allows NumPy to seamlessly and speedily integrate with scripting programming is involved.
a wide variety of database formats.
1.2. MPI
1.1.2. F2PY
Although NumPy provides similar and higher-level capabilities, MPI, the Message Passing Interface [16,17], is a standardized,
there are situations where selected, numerically intensive parts of portable message-passing system designed to work on a wide vari-
Python applications still require the efficiency of a compiled code ety of parallel computers. The standard defines a set of library rou-
for processing huge amounts of data in deeply-nested loops. For- tines (MPI is not a programming language extension) and allows
tran (especially Fortran 90 and above) is a language for efficiently users to write portable programs in the main scientific program-
implementing lengthy computations involving multi-dimensional ming languages (Fortran, C, and C++).
arrays. State of the art implementations of many commonly used The paradigm of message-passing is especially suited for (but
algorithms are readily available and implemented in Fortran. not limited to) distributed memory architectures and is used in to-
F2PY [11] is a development tool that provides a connection day’s most demanding scientific and engineering applications re-
between the Python and Fortran programming languages. It works lated to modeling, simulation, design, and signal processing.
by creating Python extension modules from special signature files MPI defines a high-level abstraction for fast and portable inter-
or directly from annotated Fortran source files. These files with process communication [18,19]. Applications can run in clusters of
additional annotations included as comments, contain all the infor- (possibly heterogeneous) workstations or dedicated compute
mation (function names, arguments and their types, etc.) that is nodes, (symmetric) multiprocessors machines, or even a mixture
needed to construct convenient Python bindings to Fortran func- of both. MPI hides all the low-level details, like networking or
tions. F2PY-generated Python extension modules enable Python shared memory management, simplifying development and main-
codes to call those Fortran 77/90/95 routines. In addition, F2PY pro- taining portability, without sacrificing performance.
vides the required support for transparently accessing Fortran 77 Implementations are available from vendors of high-perfor-
common blocks or Fortran 90/95 module data. mance computers to well known open source projects like MPICH
In a Python programming environment, F2PY is then the tool of [20,21] and Open MPI [22,23].
choice for taking advantage of the speed-up of compiled Fortran
code and integrating existing Fortran libraries. 1.3. PETSc
1.1.3. Cython PETSc [24,25], the Portable Extensible Toolkit for Scientific Compu-
Cython [12] is a recent development that provides access to tation, is a suite of algorithms and data structures for the solution
low-level C data types and functionalities in a Python program- of problems arising on scientific and engineering applications,
ming environment. especially those modeled by partial differential equations, of
1126 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139
large-scale nature, and targeted for high performance parallel com- MPI for Python implements the entire specifications from the
puting environments [26]. MPI-2 standard revision 2.21 [30]. Naming conventions of the
PETSc is written in C (thus making it usable from C++); a Fortran MPI-2 C++ bindings2 are adopted, so users familiar with the C++
interface (very similar to the C one) is also available. PETSc’s com- bindings can use MPI for Python without learning a new interface.
plete functionality is only exercised by parallel applications, but
serial applications are fully supported. 2.1. Communicating Python objects and array data
PETSc employs the MPI standard for inter-process communica-
tion, thus it is based on the message-passing model for parallel The Python standard library supports different mechanisms for
computing. Despite that, PETSc provides high-level interfaces with data persistence. Many of them rely on disk storage, but pickling
collective semantics so that typical users rarely have to make mes- can also work with raw memory buffers. The pickle module pro-
sage-passing calls themselves. vides user-extensible facilities to serialize general Python objects
PETSc is designed with an object-oriented style. Almost all user- using ASCII or binary formats. MPI for Python can communicate
visible types are abstract interfaces with implementations that any built-in or user-defined Python object implementing the pickle
may be chosen at runtime. Those objects are managed through protocol. These facilities are transparently used to build binary rep-
handles to opaque data structures which are created, accessed resentations of objects to communicate (at sending processes), and
and destroyed by calling appropriate library routines. restoring them back (at receiving processes).
PETSc consists of a variety of components. Each component Although simple and general, the serialization approach (i.e.,
manipulates a particular family of objects and the operations one pickling and unpickling) imposes important overheads in memory
would like to perform on these objects. Some of the PETSc modules as well as processor usage, especially when objects with large
deal with: memory footprints are being communicated. Pickling general Py-
thon objects, ranging from primitive or container built-in types
Index sets, including permutations, indexing into vectors, to user-defined classes, necessarily requires some processing for
renumbering, etc. dispatching the appropriate serialization method (that depends
Vectors. on the type of the object) and processor usage to perform the ac-
Matrices (generally sparse). tual packing. Additional memory is always needed, and if its total
Distributed arrays for parallelizing regular grid-based problems. amount is not known a priori, many memory reallocations can oc-
Krylov subspace methods. cur. Indeed, in the case of large numeric arrays, this is certainly
Preconditioners, including multigrid and sparse direct solvers. unacceptable and precludes communication of objects occupying
Nonlinear solvers. half or more of the available memory resources.
Timesteppers for solving time-dependent, nonlinear partial dif- MPI for Python also supports direct communication of any ob-
ferential equations. ject implementing the Python buffer interface. This interface is a
standard Python mechanism provided by some types (e.g., byte
PETSc provides a rich environment for modeling scientific appli- strings and numeric arrays), allowing access in the C side to a con-
cations as well as for rapid algorithm design and prototyping. The tiguous memory buffer (i.e., address and length) containing the rel-
libraries enable easy customization and extension of both algo- evant data. This feature, in conjunction with the capability of
rithms and implementations. This approach promotes code reuse constructing user-defined MPI datatypes describing complicated
and flexibility. memory layouts, enables the implementation of many algorithms
Finally, PETSc is designed to be highly modular, enabling the involving multidimensional numeric arrays (e.g., image processing,
interoperability with several specialized parallel libraries like fast Fourier transforms, finite difference schemes on structured
Hypre [27], Trilinos/ML [28], MUMPS [29], and others through a uni- Cartesian grids) directly in Python, with negligible space or time
fied interface. overhead compared to compiled C, C++ or Fortran codes.
The two predefined intracommunicator instances are available: the actual required computations are performed sequentially at
COMM_WORLD and COMM_SELF. From them, new communicators can some process). User-defined reduction operations on memory
be created as needed. New communicator instances can be ob- buffers are also supported. Reductions are supported on both
tained with the Clone( ) method of Comm objects, the Dup( ) intracommunicators and intercommunicators (except inclusive
and Split( ) methods of Intracomm and Intercomm objects, and exclusive scan operations, MPI-2 defines them only for
and methods Create_intercomm( ) and Merge( ) of Intracomm intracommunicators).
and Intercomm objects, respectively.
The associated process group can be retrieved from a communi- 2.2.5. Dynamic process management
cator by calling the Get_group( ) method, which returns an in- In MPI for Python, new independent processes groups can be
stance of the Group class. Set operations with Group objects like created by calling the Spawn( ) method within an intracommuni-
Union( ), Intersect( ) and Difference( ) are fully supported, cator (i.e., an Intracomm instance). This call returns a new inter-
as well as the creation of new communicators from these groups. communicator (i.e., an Intercomm instance) at the parent
process group. The child process group can retrieve the matching
2.2.2. Blocking point-to-point communications intercommunicator by calling the Get_parent( ) method defined
The Send( ), Recv( ) and Sendrecv( ) methods of communi- in the Comm class. At each side, the new intercommunicator can be
cator objects provide support for blocking point-to-point commu- used to perform point to point and collective communications be-
nications within Intracomm and Intercomm instances. These tween the parent and child groups of processes.
methods can communicate either general Python objects or raw Alternatively, disjoint groups of processes can establish com-
memory buffers. The buffered, ready, and synchronous (Bsend( ), munication using a client/server approach. Any server application
Rsend( ) and Ssend( )) modes are also supported. must first call the Open_port( ) function to open a ‘‘port’’ and
the Publish_name( ) function to publish a provided ‘‘service’’,
2.2.3. Nonblocking point-to-point communications and next call the Accept( ) method within an Intracomm in-
On many systems, performance can be significantly increased stance. Any client application can first find a published ‘‘service’’
by overlapping communication and computation. This is particu- by calling the Lookup_name( ) function, which returns the ‘‘port’’
larly true on systems where communication can be executed where a server can be contacted; and next call the Connect( )
autonomously by an intelligent, dedicated communication control- method within an Intracomm instance. Both Accept( ) and
ler. Nonblocking communication is a mechanism provided by MPI Connect( ) methods return an Intercomm instance. When con-
in order to support such overlap. nection between client/server processes is no longer needed, all
The Isend( ) and Irecv( ) methods of the Comm class initiate of them must cooperatively call the Disconnect( ) method of
a send and receive operation respectively. These methods return a the Comm class. Additionally, server applications should release re-
Request instance, uniquely identifying the started operation. Its sources by calling the Unpublish_name( ) and Close_port( )
completion can be managed using the Test( ), Wait( ), and functions.
Cancel( ) methods of the Request class. The management of Re-
quest objects and associated memory buffers involved in commu- 2.2.6. One-sided operations
nication requires a careful, rather low-level coordination. Users In MPI for Python, one-sided operations are available by using
must ensure that objects exposing their memory buffers are not ac- instances of the Win class. New window objects are created by call-
cessed at the Python level while they are involved in nonblocking ing the Create( ) method at every process in a communicator and
message-passing operations. specifying a memory buffer (i.e., a base address and length). When
Often a communication with the same argument list is repeat- a window instance is no longer needed, the Free( ) method
edly executed within an inner loop. In such cases, communication should be called.
can be further optimized by using persistent communication, a The three one-sided MPI operations for remote write, read and
particular case of nonblocking communication allowing the reduc- reduction are available through calling the methods Put( ),
tion of the overhead between processes and communication con- Get( ), and Accumulate( ) respectively within a Win instance.
trollers. Furthermore, this kind of optimization can also alleviate These methods need an integer rank identifying the target process
the extra call overheads associated to dynamic languages like Py- and an integer offset relative to the base address of the remote
thon. The Send_init( ) and Recv_init( ) methods of the Comm memory block being accessed.
class create a persistent request for a send and receive operation
respectively. These methods return an instance of the Prequest 2.2.7. Parallel input/output operations
class, a subclass of the Request class. The actual communication In MPI for Python, all MPI input/output operations are per-
is started with the Start( ) method. formed through instances of the File class. File handles are
obtained by calling method Open( ) at every process in a commu-
2.2.4. Collective communications nicator and providing a file name and the intended access mode.
The Bcast( ), Scatter( ), Gather( ), Allgather( ) and After use, they must be closed by calling the Close( ) method.
Alltoall( ) methods of Intracomm instances provide support Files even can be deleted by calling method Delete( ).
for collective communications. Those methods can communicate After creation, files are typically associated with a per-process
either general Python objects or raw memory buffers. The ‘‘vector’’ view. The view defines the current set of data visible and accessible
variants (which can communicate varying amount of data at from an open file as an ordered set of elementary datatypes. This
each process) Scatterv( ), Gatherv( ), Allgatherv( ) and data layout can be set and queried with the Set_view( ) and Get_
Alltoallv( ) are also supported, they can only communicate ob- view( ) methods, respectively.
jects exposing raw memory buffers. All these collective operations Actual input/output operations are achieved by many methods
are supported on both intracommunicators and intercommunica- combining read and write calls with different behavior regarding
tors. positioning, coordination, and synchronism. Summing up, MPI for
Global reduction operations are accessible through the Re- Python supports around 30 different methods defined in MPI-2
duce( ), Allreduce( ), Scan( ) and Exscan( ) methods. All the for reading from or writing to files using explicit offsets or file
predefined (i.e., SUM, PROD, MAX, etc.) and user-defined reduction pointers (individual or shared), in blocking or nonblocking and col-
operations can be applied to general Python objects (however, lective or noncollective versions.
1128 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139
2.3. Related projects implemented in PETSc and targeted to large-scale numerical simu-
lations arising in many problems of science and engineering.
pyMPI [31] is a pioneering project bringing general-purpose MPI PETSc for Python is implemented with Cython. The rationale for
support to Python. It is implemented in C with modified Python this choice is the same as the already commented in Section 2.
interpreter (as opposed to employ the stock Python interpreter PETSc presents a rather large API accessible from C and Fortran.
readily available at the computing platform) and a companion Although PETSc is designed with an object-oriented style, its API
MPI module exposing core MPI functionalities and other facilities is limited to the procedural style of the C and Fortran – these pro-
that are beyond the MPI standard. It permits basic interactive par- gramming languages do not support natively some more advanced
allel runs, which are useful for learning and debugging, and pro- concepts such as classes, inheritance and polymorphism, or excep-
vides an interface suitable for basic parallel programing. General tion-based error handling. PETSc for Python was designed from the
Python objects supporting pickle protocol as well as NumPy arrays ground to present to users an easy-to-use, high-level, pythonic
can be communicated. However, there is partial support for MPI-1 interface. This interface is certainly easier and more pleasant to
features like nonblocking communication, communication do- use than the native ones available in PETSc for C and Fortran.
mains, and process topologies. Support for user-defined MPI data- PETSc for Python has coverage for the most important PETSc
types is absent. Advanced MPI-2 features like dynamic process features. Among them, we can mention assembling distributed
management, one-sided communication, and parallel input/output vector and sparse matrices in parallel, solving systems of linear
are not available. equations with Krylov-based iterative methods and direct meth-
Pypar [32] is a minimalistic and intuitive Python interface to ods, and solving systems of nonlinear equations with Newton-
MPI. It is a lightweight wrapper implemented with a mixture of based iterative methods including matrix-free techniques. It is
high-level Python code and low-level extension module written not feasible to provide here an detailed listing of all the supported
in C, the Python interpreter does not require modification. General features, the complete API reference is available at the project
Python objects can be communicated using the pickle protocol. documentation.
There is good support for communicating NumPy arrays and prac- The rest of this section presents a general overview and some
tically full MPI bandwidth can be achieved. However, there is no examples of the many PETSc concepts and functionalities readily
support for basic MPI-1 features of common use like user-defined available in PETSc for Python. The examples are simple, self-con-
MPI datatypes, nonblocking communication, communication do- tained, and implemented in a few lines of Python code. Neverthe-
mains, and process topologies. Advanced MPI-2 features like dy- less, they show general usage patterns of PETSc for Python for
namic process management, one-sided communication, and implementing linear algebra algorithms, assembling sparse matri-
parallel input/output are not available. ces, and solving systems of linear and nonlinear equations within a
pyMPI and Pypar provide many of the features available in MPI Python programming environment.
for Python. However, differences in design and additional capabil-
ities distinguish MPI for Python from these packages. These differ- 3.1. Vectors
ences are summarized below.
PETSc for Python provides access to PETSc vectors, index sets
MPI for Python does not require a modified Python interpreter. and general vector scatter/gather operations through the Vec, IS,
The stock Python interpreter readily available at the computing and Scatter classes respectively. By using them, the management
platform is employed. of distributed field data is highly simplified in parallel applications.
MPI for Python is implemented with Cython (as opposed to low- Besides the use as containers for field data, PETSc vectors also
level C code), facilitating development and maintenance. represent algebraic entities of finite-dimensional vector spaces.
MPI for Python features support for well-known wrappers gen- For this case, the Vec class provides many methods for performing
erator tools like SWIG and F2PY, lowering the barrier to reuse common linear algebra operations, like computing vector updates
existing C, C++, and Fortran MPI-based libraries. (axpy( ), aypx( ), scale( )), inner products (dot( )) and differ-
MPI for Python Application Program Interface (API) follows clo- ent kinds of norms (norm( )).
sely the MPI standard specification. Fig. 1 shows a basic implementation of a simple Krylov-based
MPI for Python can communicate general Python object and iterative linear solver, the (unpreconditioned) conjugate gradient
NumPy arrays. However, fast communication of array data is method.
not limited to NumPy – any Python type implementing the
Python buffer interface can participate. 3.2. Matrices
MPI for Python supports the complete MPI-1 specification. All
blocking/nonblocking point-to-point and collective operations PETSc for Python provides access to PETSc matrices through the
are available, as well as user-defined MPI datatypes, multiple Mat class. New Mat instances are obtained by calling the
communication domains and Cartesian/graph process create( ) method. Next, the user specifies the row and column
topologies. sizes by calling the setSizes( ) method. Finally, a call to the
MPI for Python supports for the complete MPI-2 specification, setType( ) method selects a particular matrix implementation.
providing full coverage of advanced features like dynamic pro- Matrix entries can be set (or added to existing entries) by calling
cess management, one-sided communications, and parallel the setValues( ) method. PETSc simplifies the assembling of par-
input/output. allel matrices. Any process can contribute to any entry. However,
off-process entries are internally cached. Because of this, a final call
3. PETSc for Python to the assemblyBegin( ) and assemblyEnd( ) methods is re-
quired in order to communicate off-process entries to the actual
This section describes PETSc for Python, an open-source owning process. Additionally, those calls prepare some internal
software project that provides bindings to PETSc libraries for the data structures for performing efficient parallel operations like
Python programming language. matrix–vector product. The latter operation is available by calling
PETSc for Python is a general-purpose and full-featured pack- the mult( ) method.
age. Its facilities allow sequential and parallel Python applications Fig. 2 shows the basic steps for creating and assembling a sparse
to exploit state of the art algorithms and data structures readily matrix in parallel. The assembled matrix is a discrete representation
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1129
combination is employed for solving a linear system involving a be set. Additionally, PETSc for Python supports attaching user-de-
previously assembled parallel sparse matrix (see Fig. 2). fined Python functions for monitoring the iterative process (by
calling the setMonitor( ) method) and defining a custom conver-
3.4. Nonlinear solvers gence criteria (by calling the setConvergenceTest( ) method).
In order to actually solve a system of nonlinear equations, the
PETSc for Python provides access to PETSc nonlinear solvers solve( ) method has to be called with appropriate vector argu-
through the SNES class. ments (i.e., Vec instances) specifying an optional right hand side
SNES objects have to be associated with a user-defined Python (usually not provided as it is the zero vector) and the location
function in charge of evaluating the nonlinear residual vector and, where to build the solution (which additionally can specify an ini-
optionally, a function for the Jacobian matrix evaluation, at each tial guess for starting the nonlinear loop).
nonlinear iteration step. Those user routines can be set with the Consider the following boundary value problem in two
methods setFunction( ) and setJacobian( ). dimensions:
New SNES instances are obtained by calling the create( ) DUðxÞ ¼ a exp½UðxÞ; x 2 X;
method. This call automatically creates a companion inner linear
UðxÞ ¼ 0; x 2 @ X;
solver (i.e., a KSP instance) that can be retrieved with the get-
KSP( ) method for further manipulations. The setToleranc- where X is the unit square (0, 1)2 and oX is the boundary, D is the
es( ) method enables the specification of the different tolerances two-dimensional Laplace operator, U is a scalar field defined on X,
for declaring convergence; other algorithmic parameters can also and a is a constant. The equation is nonlinear and usually called the
Bratu problem. The nonlinear system has a bifurcation (turning
point) at amax 6.80812, there is no solution for a > amax. The stan-
dard 5-point finite differences stencil is employed for performing a
spatial discretization on a structured, regularly spaced grid. As
result of the discretization process, a system of nonlinear equation
is obtained. Fig. 4 shows the Python implementation of the nonlin-
ear residual function for the Bratu problem and the basic steps
required for creating and configuring a nonlinear solver. The inner
Krylov linear solver is configured to use conjugate gradient method.
Additionally, the nonlinear solver is configured to use a matrix-free
method (i.e., the Jacobian is not explicitly computed).
4. Performance tests
For increasing message sizes, the wall clock time required for
communication is measured many times and then averaged.
Throughput is computed as the ratio of message size (in bytes)
and wall clock time (in seconds) required to accomplish the com-
munication. Python overhead is computed from wall-clock times
as (TPython TC)/TC.
Results obtained on the switched Gigabit Ethernet network are
shown in Fig. 5. As expected, the overhead introduced by object
serialization degrades overall efficiency. Comparing to communi-
cation performed in C, the overhead of pickle communication is
around 80% for small messages and around 30% for large messages.
However, fast communication of array data is quite efficient. The
overhead of buffer communication is below 10% for small messages
and below 5% for large messages.
Results obtained on shared memory are shown in Fig. 6. In this
case, the overhead introduced by the Python layer is quite more
noticeable. Comparing to communication performed in C, pickle
communication overhead is 150 for small messages and around
5 for large messages, while the overhead of buffer communication
is around 2.5 for small messages and around 0.5 for large
messages. Fig. 5. PingPong – Gigabit Ethernet.
The second test consisted of wall-clock time measurements of
Broadcast and All-to-All collective operations on four processes. jects, the pickle protocol is implemented quite efficiently – the
Messages were again numeric arrays of double precision floating- total amount of memory required for serialization is known in ad-
point values. vance and the array items have a common data type corresponding
Results obtained on the switched Gigabit Ethernet network are to a C primitive type. For more general, user-defined Python ob-
shown in Fig. 7. For small messages, the overhead of pickle commu- jects containing deeply nested data structures, pickle communica-
nication is significant – 12 for Broadcast and 3.5 for All-to-All – tion is expected to achieve lower performance than the reported
for large messages, it is less noticeable. The overhead of buffer com- here.
munication is small, particularly for All-to-All – less than 10% for all
message sizes. Results obtained on shared memory are shown in 4.2. PETSc for Python
Fig. 8, they follow the trends of previous results. However, the
overhead of pickle communication is quite more noticeable for all Consider the following diffusive, unsteady, non-linear, scalar
message sizes. problem in the unit cube X = (x1, x2, x3) = (0, 1)3
Finally, it is worth to remark that the all the previous tests in-
volved communication of contiguous NumPy arrays. For these ob-
1132 L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139
where /ni;j;k
and /nþ1
are / at point ((i 1)Dx1, (j 1)Dx2, (k 1)Dx3)
i;j;k
and time levels t and tn+1 = tn + Dt, respectively, and
n
Fig. 6. PingPong – shared memory. j i 12 j i 12 þ j i þ 12
Li;j;k ðj; /Þ ¼ /½i 1 /½0
ð Dx1 Þ 2 ð Dx1 Þ 2
@/ 1 1
r ðjð/Þr/Þ ¼ G; on X ð0; T ð1Þ j iþ2 j j2
@t þ /½i þ 1 þ /½j 1
ð D x1 Þ 2 ðDx2 Þ2
with homogeneous Neumann conditions at the boundary C = oX 1 1
and given initial conditions, j j2 þj jþ2 j j þ 12
/½0 þ /½j þ 1
@/ ð Dx2 Þ 2 ð Dx2 Þ 2
¼ 0 at C ½0; T;
@n ð2Þ j k 12 j k 12 þ j k þ 12
þ /½k 1 /½0
/ ¼ /0 at t ¼ 0; ðDx3 Þ2 ðDx3 Þ2
where n is the outer normal to the boundary. j k þ 12
þ /½k þ 1; ð6Þ
The diffusion coefficient j depends on / in the following way: ðDx3 Þ2
(
1 if / P 0; where, in short hand notation, ½0 ¼ ði; j; kÞ; i 12 ¼ i 12 ; j; k and
jð/Þ ¼ 1 ð3Þ so on, and
1þ/2
if / < 0; / at staggered points is obtained with the averaging
scheme / i 12 ¼ 12 ð/½i 1 þ /½0Þ and so on.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1133
5. Application examples
Performance critical components developed in traditional sci- eration due to gravity. The bottom shear stresses are approximated
entific programming languages are incorporated to the frame- by using the Chèzy or Manning equations,
work with the help of readily available tools. 8 v 2 PðhÞ
Preprocessing, postprocessing, data analysis and visualization < C2 AðhÞ zy model;
Che
are tightly integrated to simulation. Sf ¼ h 2 ð9Þ
: n 4=3
2 P ðhÞ
v Manning model;
The edit/compile/run cycle is considerably reduced and overall a A4=3 ðhÞ
@/
S ¼ r ðK r/Þ þ Gaq ; on Xaq ð0; t: ð7Þ where W are weighting functions.
@t Upon using the SUPG Galerkin finite element discretization pro-
The corresponding unknown for each node is the piezometric cedure with linear triangles and/or bilinear rectangular elements
height or the level of the phreatic surface at that point / and Xaq and the trapezoidal rule for time integration, we obtain the system
is the aquifer domain, S the storativity, K the hydraulic conductiv- to be solved at each time step [38]
ity and Gaq is the source term accounting for rain, losses from
streams or other aquifers. Ukþ1 Uk
R ¼ KðUkþh ÞUkþh þ BðUkþh Þ Gkþh ¼ 0; ð14Þ
Dt
5.1.2. Surface flow where Uk+h = hUk+1 + (1 h)Uk, U is the state for the coupled prob-
When velocity variations on the channel cross section are ne- lem (i.e., phreatic level and Saint–Venant state variables), h is the
glected, the flow can be treated as one dimensional. The equations time-weighting factor satisfying 0 6 h 6 1, Dt is the time increment
of mass and momentum conservation on a variable cross sectional and k denotes the number of time steps. K and B are the non-sym-
stream (in conservation form) are [36,37] metric stiffness matrix and the symmetric mass matrix, respectively
(K and B depend on U), G is the source vector and R is the residual
@A @Q ðAÞ vector.
þ ¼ Gst ;
@t @s !
1 @Q 1 @ Q 2 @
þ þ gðS0 Sf Þ þ g ðh þ hb Þ ¼ 0; on Xst ð0; t;
A @t A @s A @s stream node
n3 y
n5
ð8Þ n4
n1 x
where A = A(h) is the section of the channel occupied by water for a n2
given water depth h and hb is the channel bottom elevation. For in- aquifer node
5.1.4. Numerical simulations network has more than 150 branches discretized with 70 thousand
An example of surface and subsurface interaction flow is pre- elements, creating an average space of 100 m between river nodes.
sented. The study area represents a third of the total area of Santa The time step adopted in simulations is Dt = 1 day.
Fe province (Argentina), amounting to roughly 33,000 km2 (see Fig. 13 shows the phreatic elevation in four different days of the
Fig. 11). A period of 12 months is simulated where the total precip- simulated period. The phreatic levels increases after the two first
itation is the annual average observed in recent years (1000 mm/ dry months (January–February) when dry period starts (March–
year), but divided in two wet seasons with a rainfall rate of April). At the same time, when consider the river vicinities region
2000 mm/year (March–April and September–October) and dry sea- where the subsurface/surface flow interaction process takes place,
sons of 500 mm/year (the rest of the year). we see an increment of the river water level due to the recharge
At time t = 0 s (January/1) the piezometric head in the phreatic from the elevated aquifer phreatic levels in wet seasons. The oppo-
aquifer is 30 m above the aquifer bottom, while the water depth in site process is observed in dry periods.
stream is 10 m above the stream bed. The hydraulic conductivity
and storativity of phreatic aquifer are K = 2 103 m/s and 5.2. Microfluidics: lab-on-a-chip simulations
S = 2.5 102 m/s, respectively. The Manning friction law is
adopted for this case. The stream channel roughness is A Lab-on-a-chip (LOC) performs the functions of classical analyt-
n = 3 103 and the river width is w = 10 m. The average value of ical devices in small units of a few square centimeters in size [39].
resistivity river walls is Rf = 105 s. The computational mesh has They are used in a variety of chemical, biological and medical
1.7 million triangles (see a detail in Fig. 12) and the drainage applications. The benefits of LOC are the reduction of consumption
Fig. 11. Terrain elevation of the computational domain and its parallel partition.
of samples and reagents, shorter analysis time, greater sensitivity, the strong coupling between them. In this particular case, due to
portability and disposability. There has been a huge interest in the high reaction rate, the fluid and the electric fields can be trea-
these devices in the past decade that led to a wide commercial ted in a quasi-stationary form, reducing the complexity of the solv-
range of products. ing process [48]. The set of differential equations that were solved
Historically, the most important techniques developed in LOC on the LOC geometry Xloc, can be summarized as:
devices are electrophoretic separations [40,41]. They are based
on the mobility of ions under the action of an external electric field. r u ¼ 0; on Xloc ; ð15Þ
These techniques are widely used in chemical and biochemical qðu ruÞ ¼ r ðpI þ lðru þ ruT ÞÞ; on Xloc ; ð16Þ
analysis. As microchips for electrophoresis becomes increasingly !
2
X
N X
N X
N
complex, simulation tools are required to prototype numerically r F z2j Xj cj r/ F zj D j r c j þ F zj cj u ¼ 0; on Xloc ;
these devices, as well as to control and optimize handling [42]. j¼1 j¼1 i¼j
Numerical simulation of a two dimensional electrophoresis ð17Þ
(2DE) device is presented. Simulations were carried out by using @cj
a 3D time-dependent finite element model for electrophoretic þ r zj Xj r/cj þ ucj Dj rcj r j ¼ 0; on Xloc ð0; t: ð18Þ
@t
processes in microfluidic chips. Two-dimensional electrophoretic
separations consist of two independent mechanisms that are Eqs. (15) and (16) are the Navier–Stokes equations for solving
employed sequentially. The separation efficiency is estimated as the fluid field, where u is the velocity, p is the pressure, and l is
the product of the independent efficiency of each method, pro- the dynamic viscosity. In this simulation, in order to model the
vided the methods are uncoupled. Two such mechanisms, satisfy- electroosmotic flow, a slip velocity is set as boundary condition.
ing uncoupling, are free flow isoelectric focusing (FFIEF) and The magnitude for this velocity (ueo) is based on the Helmholtz–
capillary zone electrophoresis (CZE). FFIEF is a technique in which Smoluchowski approximation [46]:
an electric field and a pH gradient are established perpendicularly
to a flowing sample solution, allowing components to focus at its
fw ðr/Þ
ueo ¼ ; on Ceo ; ð19Þ
stable isoelectric point (pI) [43–45]. CZE is based on the application
l
of an electric potential difference along a capillary channel, then where is the electric permittivity, fm is the electrokinetic potential
electric forces generate electrosmotically-driven fluid flow and in- of the solid walls Ceo, and / is the electric potential.
duce species migration along the channel axis, yielding the separa- Eq. (17) expresses the electric charge conservation for the do-
tion according to their electrophoretic mobilities [46,47]. main as a combination of migrative, diffusive and advective com-
ponents for the motion of all charges present in the solution. In
5.2.1. Modeling this case, for the j-species, zj represents the valence, Xj is the
Mathematical modeling of electrophoretic separations carried mobility, cj the concentration in mol m3, and Dj the diffusion
out on LOC involves fluid, electric and concentration fields, and coefficient.
L.D. Dalcin et al. / Advances in Water Resources 34 (2011) 1124–1139 1137
Finally, Eq. (18) is the mass transport equation for a generic j- ka2 ½A ½Hþ
¼ ¼ Ka ð22Þ
species, were rj is the reaction term. Different electrolytes (acids, ka1 ½AH
bases and ampholytes), analytes, and particularly the hydrogen kb2 ½AH½Hþ
ion have to be considered in order to determine the reaction terms. ¼ ¼ Kb ð23Þ
kb1 ½AHþ2
In electrolyte chemistry the processes of association and dissocia-
tion are much faster than the transport electrokinetic processes, where Ka and Kb are the dissociation constants for the equilibrium
hence, it is a good approximation to adopt chemical equilibrium state, and the square brackets represent concentration of the given
constants to model the reactions of weak electrolytes [49], while species. The corresponding expressions of rj are obtained as follows:
strong electrolytes are considered as completely dissociated.
rA ¼ ka1 ½A ½Hþ þ ka2 ½AH ð24Þ
In solving 2DE amphoteric species are mainly involved. The
reactions associated to a generic ampholyte AH can be summarized
þ þ
rAH ¼ ka1 ½A ½H ka2 ½AH kb1 ½AH½H þ kb2 ½AHþ2 ð25Þ
as [49]: rAHþ ¼ kb1 ½AH½Hþ kb2 ½AHþ2 ð26Þ
2
ka1
AH ¢ A þ H þ
ð20Þ rHþ ¼ ka1 ½A ½Hþ þ ka2 ½AH kb1 ½AH½Hþ þ kb2 ½AHþ2 ð27Þ
ka2
kb1
In Eq. (27) the water dissociation term is not included due to the
AHþ2 ¢ AH þ Hþ ð21Þ fact that this reaction is several orders of magnitude faster than
kb2
reactions (20) and (21) [49], then [OH] can be calculated directly
where ka1, kb1 are the dissociation rates, and ka2, kb2 the association as
rates for the ampholyte AH. Then the equilibrium state is character- Kw
ized by, ½OH ¼ ð28Þ
½Hþ
where Kw = 1014 mol2 m6 is the dissociation constant for pure 6. Conclusions
water at 25 °C.
Python is an attractive language for rapid development of small
scripts and code prototypes as well as large applications and highly
5.2.2. Simulation results
portable and reusable modules and libraries. Running Python on
A 2DE separation involving FFIEF and CZE is simulated on a LOC parallel computers is a feasible alternative for decreasing the costs
prototype. FFIEF is carried out on the left part (10 3500
of software development targeted to HPC systems.
7000 lm3), then samples flow through five CZE channels In this work, two software components facilitating the access to
(10 1000 16,000 lm3) on the right part. Mesh consist of 175
parallel distributed computing resources within a Python pro-
thousand linear triangles, with a total amount of 6 millions of gramming environment were presented: MPI for Python and PETSc
degrees of freedom.
for Python. These packages are able to support serious medium and
Boundary conditions for the fluid field are those stated by the
large scale parallel applications.
Eq. (19), and the pressure is set to 0 Pa at the outlets. In the case
Efficiency tests have shown that performance degradation is not
of the electric field, Dirichlet boundary conditions are set where
prohibitive. In comparison to pure C codes, MPI for Python can
the electric potential is applied, and natural Neumann conditions
communicate Python array data at nearly full speed over Gigabit
on the other walls are set. Finally, for the concentration field,
Ethernet and around half speed over shared memory channels.
advective flux is set at the inlets and outlets, and natural Neumann
PETSc for Python overhead is consistently less than 10%.
conditions are set on the walls.
This software suite is supporting research activities in a variety
The applied electric potential differences (Fig. 14(a)) are fixed
of fields. Applications examples related to finite elements simula-
during the operation to provide the system with a transverse elec-
tions of hydrology and microfluidic problems were presented.
tric field in the FFIEF region and an axial electric field in the CZE
channels. The pH gradient for FFIEF is established by focusing 20
ampholytes between two sheath flows of basic and acidic solu- Acknowledgments
tions. A concentrated basic buffer solution is continuously injected
from the inlet at the right. When stationary conditions are reached, The authors extend sincere thanks to Christopher Kees, Jed
a near-linear pH gradient is developed (Fig. 14(b)). Brown, and the anonymous reviewer for their kind advice, insight-
The proposed numerical prototype is employed to separate a ful comments and constructive suggestions.
sample of 10 proteins. Proteins are injected from the central chan- This work has received financial support from Consejo Nacional
nel. After a few seconds, the different bands of isoelectric points de Investigaciones Científicas y Técnicas (CONICET, Argentina, Grant
are developed. In this particular case there are eight bands, conse- PIP 5271/05), Universidad Nacional del Litoral (UNL, Argentina,
quently, there are three or four proteins that cannot be effectively Grant CAI+D 2009 65/334), Agencia Nacional de Promoción Científica
separated by FFIEF, thus CZE is employed as an additional separa- y Tecnológica (ANPCyT, Argentina, Grants PICT 01141/2007, PICT
tion method. After leaving the FFIEF chamber, proteins separate 0270/2008, PICT-1506/2006).
electrophoretically completing the successful separation of the
10 sample compounds. Total sample distributions at two different Appendix A. Project development and support
instants of time during the separation process are shown in Fig. 15.
One of the aims of this tool is to provide information about the MPI for Python and PETSc for Python are active software pro-
separative performance of LOC. In analytical and bioanalytical jects. New features and enhancements are being added on regular
chemistry, separative performance of two-dimensional electropho- basis in order to keep Python interfaces in accordance to the up-
resis assays can be evaluated by using a customary representation dates of the MPI standard and PETSc. The testing process is sup-
in a two-dimensional map. This map contains information of the ported by automated unit testing based on the unittest
isoelectric points and the electrophoretic mobilities of the analytes package from the Python standard library. These tests are run reg-
present in sample. Results in this format are shown in Fig. 16. ularly on a variety of platforms and computer architectures.
MPI for Python is hosted on Google Code project hosting service
(http://mpi4py.googlecode.com). This service provides version
control repository (http://mpi4py.googlecode.com/svn/), issue
tracker (http://code.google.com/p/mpi4py/issues/list), and release
downloads (http://code.google.com/p/mpi4py/downloads/). Goo-
gle Groups service hosts an on-line discussion and support forum
(http://groups.google.com/group/mpi4py) and a mailing list
(mpi4py@googlegroups.com). PETSc for Python is hosted on Goo-
gle Code project hosting service (http://petsc4py.googlecode.com).
This service provides version control repository (http://
petsc4py.googlecode.com/hg/), issue tracker (http://code.google.
com/p/petsc4py/issues/list), and release downloads (http://code.
google.com/p/petsc4py/downloads/). PETSc for Python uses the
same support channels of PETSc (petsc-users@mcs.anl.gov, petsc-
maint@mcs.anl.gov, and petsc-dev@mcs.anl.gov mailing lists).
References
[5] Millman KJ, Aivazis M. Python for scientists and engineers. Comput Sci Eng [27] Falgout R, Jones J, Yang U. Numerical solution of partial differential equations
2011;13(2):9–12. doi:10.1109/MCSE.2011.36. on parallel computers, vol. 51. Springer-Verlag; 2006. p. 267–94 [chapter: the
[6] Pérez F, Granger B, Hunter J. Python: an ecosystem for scientific computing. design and implementation of hypre, a library of parallel high performance
Comput Sci Eng 2011;13(2):13–21. doi:10.1109/MCSE.2010.119. preconditioners].
[7] van der Walt S, Colbert S, Varoquaux G. The NumPy array: a structure for [28] Heroux M, Bartlett R, Hoekstra VHR, Hu J, Kolda T, Lehoucq R, et al. An overview
efficient numerical computation. Comput Sci Eng 2011;13(2):22–30. of Trilinos. Tech. Rep. SAND2003-2927. Sandia National Laboratories; 2003.
doi:10.1109/MCSE.2011.37. [29] Amestoy PR, Duff IS, L’Excellent J-Y, Koster J. A fully asynchronous multifrontal
[8] Behnel S, Bradshaw R, Citro C, Dalcin L, Seljebotn D, Smith K. Cython: the best solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl
of both worlds. Comput Sci Eng 2011;13(2):31–9. doi:10.1109/MCSE.2010.118. 2001;23(1):15–41.
[9] Ramachandran P, Varoquaux G. Mayavi: 3D visualization of scientific data. [30] MPI Forum. MPI: a message passing interface standard, version 2.2; 2009.
Comput Sci Eng 2011;13(2):40–51. doi:10.1109/MCSE.2011.35. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.
[10] Oliphant T. NumPy: numerical Python; 2005–2010. <http://numpy.scipy.org/>. [31] Miller P. pyMPI project page, 2000–2011. <http://pympi.sourceforge.net/>.
[11] Peterson P. F2PY: Fortran to Python interface generator; 2000–2010. <http:// [32] Nielsen O. Pypar project page; 2002–2011. <http://code.google.com/p/pypar/>.
cens.ioc.ee/projects/f2py2e/>. [33] Sonzogni VE, Yommi AM, Nigro NM, Storti MA. A parallel finite element
[12] Cython Team. Cython: C – extensions for Python; 2007–2010. <http:// program on a Beowulf cluster. Adv Eng Softw 2002;33(7–10):427–43.
www.cython.org>. [34] Storti MA, Nigro N, Paz R. PETSc-FEM: a general purpose, parallel, multi-
[13] Beazley DM. SWIG: simplified wrapper and interface generator; 1996–2010. physics FEM program; 1999–2010. <http://www.cimec.org.ar/petscfem>.
<http://www.swig.org/>. [35] Rodríguez L. Investigation of stream–aquifer interactions using a coupled
[14] Beazley DM, Lomdahl PS. Feeding a large scale physics application to Python. surface-water and ground-water flow model. PhD thesis. University of
In: Proceedings of 6th international Python conference, San Jose, California; Arizona; 1995.
1997. p. 21–9. [36] Whitham G. Linear and nonlinear waves, pure and applied mathematics. A
[15] Kadau K, Germann TC, Lomdahl PS. Molecular dynamics comes of age: 320 Wiley-Interscience series of texts, monographs, and tracts John Wiley & Sons
billion atom simulation on BlueGene/L. Int J Modern Phys C 2006;17:1755–61. Inc.; 1974.
[16] Forum MPI. MPI: a message passing interface standard. Int J Supercomput Appl [37] Hirsch C. Numerical computation of internal and external flows. Wiley Series
1994;8(3/4):159–416. in numerical methods in engineering, vol. II John Wiley & Sons Inc.; 1990.
[17] MPI Forum. MPI-2: a message passing interface standard. High Perform [38] Paz R, Storti M. An interface strip preconditioner for domain decomposition
Comput Appl 1998;12(1–2):1–299. methods: application to hydrology. Int J Numer Methods Eng
[18] Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J. MPI – the complete 2005;62(13):1873–94.
reference. The MPI core of scientific and engineering computation, vol. [39] Manz A, Graber N, Widmer H. Miniaturized total chemical analysis systems: a
1. Cambridge, MA, USA: MIT Press; 1998. novel concept for chemical sensing. Sensor Actuator B 1990;1:244–8.
[19] Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, et al. [40] Landers JP. Handbook of capillary and microchip electrophoresis and
MPI – the complete reference. The MPI-2 extensions of scientific and associated microtechniques. 3rd ed. CRC Press; 2007.
engineering computation, vol. 2. Cambridge, MA, USA: MIT Press; 1998. [41] Tian W-C, Finehout E. Microfluidics for biological applications. 1st
[20] MPICH2 Team, MPICH2: a portable implementation of MPI; 2003–2010. ed. Springer; 2008.
<http://www-unix.mcs.anl.gov/mpi/mpich2/>. [42] Erickson D. Towards numerical prototyping of labs-on-chip: modeling for
[21] Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable integrated microfluidic devices. Microfluid Nanofluid 2005;1(4):301–18.
implementation of the MPI message passing interface standard. Parallel [43] Kohlheyer D, Eijkel JCT, van den Berg A, Schasfoort RBM. Miniaturizing free-
Comput 1996;22(6):789–828. flow electrophoresis — a critical review. Electrophoresis 2008;29(5):977–93.
[22] Open MPI Team, Open MPI: open source high performance computing; 2004– [44] Turgeon RT, Bowser MT. Micro free-flow electrophoresis: theory and
2010. <http://www.open-mpi.org/>. applications. Anal Bioanal Chem 2009;394(1):187–98.
[23] Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, et al. Open [45] Sommer G, Hatch A. IEF in microfluidic devices. Electrophoresis
MPI: goals, concept, and design of a next generation MPI implementation. In: 2009;30:742–57.
Proceedings of the 11th European PVM/MPI users’ group meeting, Budapest, [46] Probstein R. Physicochemical hydrodynamics. An introduction. 2nd ed. Wiley-
Hungary; 2004. p. 97–104. Interscience; 2003.
[24] Balay S, Buschelman K, Gropp WD, Kaushik D, Knepley MG, McInnes LC, et al. [47] Hunter R. Foundations of colloid science. 2nd ed. Oxford University Press;
PETSc web page; 2010. <http://www.mcs.anl.gov/petsc>. 2001.
[25] Balay S, Buschelman K, Eijkhout V, Gropp WD, Kaushik D, Knepley MG, et al. [48] Kler PA, Berli CLA, Guarnieri FA. Modelling and high performance simulation of
PETSc users manual, Tech. Rep. ANL-95/11 – Revision 3.1. Argonne National electrophoretic techniques in microfluidic chips. Microfluid Nanofluid
Laboratory; 2010. 2010;10(1):187–98.
[26] Balay S, Gropp WD, McInnes LC, Smith BF. Efficient management of parallelism [49] Arnaud I, Josserand J, Rossier J, Girault H. Finite element simulation of off-gel
in object oriented numerical software libraries. In: Arge E, Bruaset AM, buffering. Electrophoresis 2002;23:3253–61.
Langtangen HP, editors. Modern software tools in scientific
computing. Birkhäuser Press; 1997. p. 163–202.