Message Passing Interface
Message Passing Interface (MPI) is computer software that allows many computers to communicate with one another. It is used in computer clusters.
Overview
This article may be confusing or unclear to readers. (March 2007) |
The MPI is a language-independent communications protocol used to program parallel computers.
MPI's goals are high performance, scalability, and portability. While it is generally considered to have been successful in meeting these goals, it has also been criticized for being too low level and difficult to use. Despite this complaint, it remains a crucial part of parallel programming[citation needed], since no effective alternative has come forth to take its place.
MPI is not sanctioned by any major standards body; nevertheless, it has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run these programs. The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept.
Although MPI belongs in layers 5 and higher of the OSI Reference Model, implementations may cover most layers of the reference model, with socket and TCP being used in the transport layer.
Most MPI implementations consist of a specific set of routines (API) callable from Fortran, C, or C++ and from any language capable of interfacing with such routine libraries. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs). MPI is supported on shared-memory and NUMA (Non-Uniform Memory Access) architectures as well, where it often serves both as an important portability architecture, but also helps achieve high performance in applications that are naturally owner-computes oriented.
MPI is a specification, not an implementation. MPI has Language Independent Specifications (LIS) for the function calls and language bindings. The first MPI standard specified ANSI C and Fortran-77 language bindings together with the LIS. The draft of this standard was presented at Supercomputing 1994 (November 1994) and finalized soon thereafter. About 128 functions comprise the MPI-1.2 standard as it is now defined.
There are two versions of the standard that are currently popular[citation needed]: version 1.2, which emphasizes message passing and has a static runtime environment (fixed size of world), and, MPI-2.1, which includes new features such as scalable file I/O, dynamic process management and collective communication with two groups of processes. MPI-2's LIS specifies over 500 functions and provides language bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. Interoperability of objects defined in MPI was also added to allow for easier mixed-language message passing programming. A side effect of MPI-2 standardization (completed in 1996) was clarification of the MPI-1 standard, creating the MPI-1.2 level.
It is important to note that MPI-1.2 programs, now deemed "legacy MPI-1 programs," still work under the MPI-2 standard although some functions have been deprecated. This is important since many older programs use only the MPI-1 subset.
MPI is often compared with PVM, which is a popular distributed environment and message passing system developed in 1989, and which was one of the systems that motivated the need for standard parallel message passing systems. Most computer science students who study parallel programming are taught both Pthreads and MPI programming as complementary programming models.[citation needed]
Functionality
The MPI interface is meant to provide essential virtual topology, synchronization and communication functionality between a set of processes (that have been mapped to nodes/servers/ computer instances) in a language independent way, with language specific syntax (bindings), plus a few features that are language specific. MPI programs always work with processes, although commonly people talk about processors. When one tries to get maximum performance, one process per processor (or more recently core) is selected, as part of the mapping activity; this mapping activity happens at runtime, through the agent that starts the MPI program, normally called mpirun or mpiexec.
Such functions include, but are not limited to, point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gathering and reduction operations), synchronizing nodes (barrier operation) as well as obtaining network-related information such as the number of processes in the computing session, current processor identity that a process is mapped to, neighboring processes accessible in a logical topology, and so on. Point-to-point operations come in synchronous, asynchronous, buffered, and ready forms, to allow both relatively stronger and weaker semantics for the synchronization aspects of a rendezvous-send. Many outstanding operations are possible in asynchronous mode, in most implementations.
MPI guarantees that there be progress of asynchronous messages independent of the subsequent calls to MPI made by user processes (threads). This rule is often neglected in practical implementations, but is an important underlying principle when one thinks of using asynchronous operations. The relative value of overlapping communication and computation, asynchronous vs. synchronous transfers, and low latency vs. low overhead communication remain important controversies in the MPI user and implementer communities, although recent advances in multi-core architecture are likely to re-enliven such debate. MPI-1 and MPI-2 both enable implementations that do good work in overlapping communication and computation, but practice and theory differ. MPI also specifies thread safe interfaces, which have cohesion and coupling strategies that help avoid the manipulation of unsafe hidden state within the interface. As such, it is relatively easy to write multithreaded point-to-point MPI code, and some implementation support such code. Multithreaded collective communication is best accomplished by using multiple copies of Communicators, as described below.
Concepts
There are nine basic concepts of MPI, five of which are only applicable to MPI-2.
Communicator
Although MPI has many functions, there are a few concepts that are very important, and these concepts when taken a few at a time, help people learn MPI quickly, and decide what functionality to use in their application programs.
Communicators are groups of processes in the MPI session, each of which have rank order, and their own virtual communication fabric for point-to-point operations. They also have independent communication addressibility or space for collective communication. MPI also has explicit groups, but these are mainly good for organizing and reorganizing subsets of processes, before another Communicator is made. MPI understands single group Intracommunicator operations, and bi-partite (two-group) Intercommunicator communication. In MPI-1, single group operations are most prevalent, with bi-partite operations finding their biggest role in MPI-2 where their usability is expanded to include collective communication and in dynamic process management.
Communicators can be partitioned using several commands in MPI, these commands include a graph-coloring-type algorithm called MPI_COMM_SPLIT, which is commonly used to derive topological and other logical subgroupings in an efficient way.
Point-to-point basics
This section needs expansion. You can help by adding to it. |
Collective basics
This section needs expansion. You can help by adding to it. |
One-sided communication (MPI-2)
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Collective extensions (MPI-2)
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Dynamic process management (MPI-2)
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
MPI I/O (MPI-2)
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Miscellaneous improvements of MPI-2
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Guidelines for writing multithreaded MPI-1 and MPI-2 programs
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Implementations
'Classical' cluster and supercomputer implementations
The implementation language for MPI is different in general from the language or languages it seeks to support at runtime. Most MPI implementations are done in a combination of C, C++ and assembly language, and target C, C++, and Fortran programmers. However, the implementation language and the end-user language are in principle always decoupled.
The initial implementation of the MPI 1.x standard was MPICH, from Argonne National Laboratory (correctly pronounced MPI-C-H, not pronounced as a single syllable) and Mississippi State University. IBM also was an early implementor of the MPI standard, and most supercomputer companies of the early 1990s either commercialized MPICH, or built their own implementation of the MPI 1.x standard. LAM/MPI from Ohio Supercomputing Center was another early open implementation. Argonne National Laboratory has continued developing MPICH for over a decade, and now offers MPICH 2, which is an implementation of the MPI-2.1 standard. LAM/MPI, and a number of other MPI efforts recently merged to form a new world-wide project, called the Open MPI implementation, but this name does not imply any connection with a special form of the standard. There are many other efforts that are derivatives of MPICH, LAM, and other works, too numerous to name here. Recently, Microsoft added an MPI effort to their Cluster Computing Kit (2005), based on MPICH 2. MPI has become and remains a vital interface for concurrent programming to this date.
Many Linux distributions include MPI (either or both MPICH and LAM, as particular examples), but it is best to get newest versions from MPI developer sites. Many vendors distribute customised versions of existing free software implementations which focus on better performance and stability.
Besides the mainstream of MPI programming for high performance, MPI has been used widely with Python, Perl, and Java. These communities are growing. MATLAB-based MPI appear in many forms, but no consensus on a single way of using MPI with MATLAB yet exists. The next sections detail some of these efforts.
Python
There are at least five known attempts to implement MPI for Python: mpi4py, PyPar, PyMPI, MYMPI, and the MPI submodule in ScientificPython. PyMPI is notable because it is a variant python interpreter making the multi-node application the interpreter itself, rather than the code the interpreter runs. PyMPI implements most of the MPI spec and automatically works with compiled code that needs to make MPI calls. PyPar, MYMPI, and ScientificPython's module all are designed to work like a typical module used with nothing but an import statement. They make it the coder's job to decide when and where the call to MPI_Init belongs.
OCaml
The OCamlMPI Module implements a large subset of MPI functions and is in active use in scientific computing. To get a sense of its maturity: it was reported on caml-list that an eleven thousand line OCaml program was "MPI-ified", using the module, with an additional 500 lines of code and slight restructuring and has run with excellent results on up to 170 nodes in a supercomputer.
Java
Although Java does not have an official MPI binding, there have been several attempts to bridge Java with MPI, with different degrees of success and compatibility. One of the first attempt was Bryan Carpenter's mpiJava, essentially a collection of JNI wrappers to a local C MPI library, resulting in a hybrid implementation with limited portability, which also has to be recompiled versus the specific MPI library being used.
However, this original project also defined the mpiJava API (a de-facto MPI API for Java following the equivalent C++ bindings closely) which other subsequent Java MPI projects followed. An alternative although less used API is the MPJ API, designed to be more object-oriented and closer to Sun Microsystems' coding conventions. Other than the API used, Java MPI libraries can be either dependant on a local MPI library, or implement the message passing functions in Java, while some like P2P-MPI also provide Peer to peer functionality and allow mixed platform operation (e.g. mixed Linux and Windows clusters).
Some of the most challenging parts of any MPI implementation for Java arise from the language's own limitations and peculiarities, such as the lack of proper pointers and linear memory address space for its objects , which make transferring multi-dimensional arrays and complex objects inefficient. The workarounds usually used involve transferring one line at a time or and/or performing explicit de-serialization and casting both at the sending and receiving end, simulating C or FORTRAN-like arrays by the use of a one-dimensional array, and pointers to primitive types by the use of single-element arrays, thus resulting in programming styles quite extraneous from Java's conventions.
One major improvement is now the MPJ Express by Aamir Shafi. This project was supervised under Bryan Carpenter and Mark Baker. On commodity platform like Fast Ethernet, advances in JVM technology now enable networking applications written in Java to rival their C counterparts. On the other hand, improvements in specialized networking hardware have continued, cutting down the communication costs to a couple of microseconds. Keeping both in mind, the key issue at present is not to debate the JNI approach versus the pure Java approach, but to provide a flexible mechanism for applications to swap communication protocols. The aim of this project is to provide a reference Java messaging system based on the MPI standard. The implementation follows a layered architecture based on an idea of device drivers. The idea is analogous to UNIX device drivers. For more info visit [1]
Microsoft Windows
Windows Compute Cluster Server uses the Microsoft Messaging Passing Interface v2 (MS-MPI) to communicate between the processing nodes on the cluster network. The application programming interface consists of over 160 functions. MS MPI was designed, with some exceptions because of security considerations, to cover the complete set of MPI2 functionality as implemented in MPICH2. Dynamic process spawn and publishing are planned for the future
MATLAB
This section needs to be developed.
This section needs expansion. You can help by adding to it. |
Hardware Implementations
There has been research over time into implementing MPI directly into the hardware of the system, for example by means of Processor-in-memory, where the MPI operations are actually built into the microcircuitry of the RAM chips in each node. By implication, this type of implementation would be independent of the language, OS or CPU on the system, but cannot be readily updated or unloaded.
Another approach has been to add hardware acceleration to one or more parts of the operation. This may include hardware processing of the MPI queues or the use of RDMA to directly transfer data between memory and the network interface without needing CPU or kernel intervention.
Example program
Here is "Hello World" in MPI written in C. In this example, we send a "hello" message to each processor, manipulate it trivially, send the results back to the main process, and print the messages out.
/*
"Hello World" Type MPI Test Program
*/
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#define BUFSIZE 128
#define TAG 0
int main(int argc, char *argv[])
{
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
/* At this point, all the programs are running equivalently, the rank is used to
distinguish the roles of the programs in the SPMD model, with rank 0 often used
specially... */
if(myid == 0)
{
printf("%d: We have %d processors\n", myid, numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff);
}
}
else
{
/* receive from rank 0: */
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d ", myid);
strcat(buff, idstr);
strcat(buff, "reporting for duty\n");
/* send to rank 0: */
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
return 0;
}
It is important to note that the runtime environment for the MPI implementation used (often called MPIRUN or MPIEXEC), spawns multiple copies of this program text, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. A Single-Program-Multiple-Data (SPMD) programming model is thereby facilitated, but not required; many MPI implementations allow multiple, different, executables to be started in the same MPI job. Each process has its own rank, the total number of processes in the world, and the ability to communicate between them either with point-to-point (send/receive) communication, or by collective communication among the group. It is enough for MPI to provide an SPMD-style program with MPI_COMM_WORLD, its own rank, and the size of the world to allow for algorithms to decide what they do based on their rank. In more robust examples, additional I/O to the real-world is needed of course. MPI does not guarantee how POSIX I/O, as used in the example, would actually work on a given system, but it commonly does work, at least from rank 0. If it does work, POSIX I/O like printf() is not particularly scalable, and should be used sparingly.
The notion of process and not processor is used in MPI, as shown below. The copies of this program are mapped to processors by the runtime environment of MPI. In that sense, the parallel machine can map to 1 physical processor, or N, where N is the total number of processors available, or something in between. For maximal potential for parallel speedup, more physical processors are used, but the ability to separate the mapping from the design of the program is an essential value for development, as well as for practical situations where resources are limited. It should also be noted that this example adjusts its behavior to the size of the world N, so it also seeks to be scalable to the size given at runtime. There is no separate compilation for each size of the concurrency, although different decisions might be taken internally depending on that absolute amount of concurrency provided to the program.
Adoption of MPI-2
While the adoption of MPI-1.2 has been universal, including on almost all cluster computing, the acceptance of MPI-2.1 has been more limited. Here are some of the reasons.
- While MPI-1.2 emphasizes message passing and a minimal, static runtime environment, full MPI-2 implementations include I/O and dynamic process management, and the size of the middleware implementation is substantially larger. Furthermore, most sites that use batch scheduling systems cannot support dynamic process management. Parallel I/O is well accepted as a key value of MPI-2.
- Many legacy MPI-1.2 programs were already developed by the time MPI-2 came out, and work fine. The threat of potentially lost portability by using MPI-2 functions kept people from using the enhanced standard for many years, though this is lessening in the mid 2000's, with wider support for MPI-2.
- Many MPI-1.2 applications use only a subset of that standard (16-25 functions). This minimalism of use contrasts with the huge availability of functionality now afforded in MPI-2.
Other inhibiting factors can be cited too, although these may amount more to perceptions and belief than fact. MPI-2 has been well supported in free and commercial implementations since at least the early 2000s, with some implementations coming earlier than that.
The future of MPI
This article may require copy editing for grammar, style, cohesion, tone, or spelling. (March 2007) |
There are several schools of thought on this. The MPI Forum has been dormant for nearly a decade, but maintained its mailing list. However, late 2006, the mailing list was revived, for the purpose of clarifying MPI-2 issues, and possibly for defining a new standard level. On February 9, 2007, the "MPI-2.1" Standard was kicked off, with new web presence and a new mailing list.[2] It has set its initial scope to revive errata discussions, renew membership and interest, and then explore future opportunities.
- MPI as a legacy interface is guaranteed to exist at the MPI-1.2 and MPI-2.1 levels for many years to come. Like Fortran, it is ubiquitous in technical computing, taught everywhere, and used everywhere. The body of free and commercial products that require MPI help ensure that will go on indefinitely, as will new ports of the existing free and commercial implementations to new target platforms.
- Architectures are changing, with greater internal concurrency (multi-core), better fine-grain concurrency control (threading, affinity), and more levels of memory hierarchy. This has already yielded separate, complementary standards for SMP programming, namely OpenMP. However, in future, both massive scale and multi-granular concurrency reveal limitations of the MPI standard, which is only tangentially friendly to multithreaded programming, and does not specify enough about how multi-threaded programs should be written. While multi-threaded capable MPI implementations do exist, the number of multithreaded, message passing applications are few. The drive to achieve multi-level concurrency all within MPI is both a challenge and an opportunity for the standard in future.
- The number of functions is huge, though as noted above, the number of concepts is relatively small. However, given that many users don't use the majority of the capabilities of MPI-2, a future standard might be smaller as well as more focused, or have profiles to allow different users to get what they need without waiting for a complete implementation suite, or have all that code be validated from a software engineering point of view.
- Grid computing, and virtual grid computing offer MPIs way of handling static and dynamic process management with particular 'fits'. While it is possible for force the MPI model into working on a grid, the idea of a fault-free, long-running MPI program can be problematic to satisfactorily achieve. Grids may want to instantiate MPI APIs between sets of running processes, but multi-level middleware that addresses concurrency, faults, and message traffic are needed. Fault tolerant MPI's and Grid MPIs have been attempted, but the original design of MPI itself impacts what can be done.
- People want a higher productivity interface. MPI programs are often referred to as assembly language of parallel programming. This goal – whether through semi-automated compilation – or through model-driven architecture and component engineering, or both, mean that MPI would have to evolve, and in some sense, move into the background.
These areas, some well-funded by DARPA and others, others underway in academic groups worldwide, have yet to produce a consensus that can fundamentally disrupt MPI's key values – performance and portability and ubiquitous support.
See also
- MPICH
- LAM/MPI
- Open MPI
- OpenMP
- Unified Parallel C
- Occam programming language
- Linda (coordination language)
- Parallel Virtual Machine
- Calculus of communicating systems
- Calculus of Broadcasting Systems
- Actor model
- Interconnect Driven Server
- DDT Debugging tool for MPI programs
External links
- MPI specification
- HP-MPI
- MPI DMOZ category
- Open MPI web site
- LAM/MPI web site
- MPICH
- SCore MPI
- Scali MPI
- MVAPICH: MPI over InfiniBand
- Parawiki page for MPI
- Global Arrays
- PVM/MPI Users' Group Meeting (2006 edition)
- MPI Samples
- MPICH over Myrinet (GM, classic driver)
- MPICH over Myrinet (MX, next-gen driver)
- Parallel Programming with MatlabMPI
- MPI Tutorial
- Parallel Programming with MPI
- MacMPI
- MPI over SCTP
- IPython allows MPI applications to be steered interactively.
- MPJ Express An Implementation of MPI-like bindings in Java
- mpiP is an open-source, lightweight, scalable MPI profiling tool.
References
This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.