Computational Optimal Transport
Computational Optimal Transport
Computational Optimal Transport
@article{COTFNT,
year = {2019},
volume = {11},
journal = {Foundations and Trends in Machine Learning},
title = {Computational Optimal Transport},
number = {5-6},
pages = {355--607}
author = {Gabriel Peyr\’e and Marco Cuturi}
}
Contents
1 Introduction 3
2 Theoretical Foundations 7
2.1 Histograms and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Assignment and Monge Problem . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Kantorovich Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Metric Properties of Optimal Transport . . . . . . . . . . . . . . . . . . . 19
2.5 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Algorithmic Foundations 37
3.1 The Kantorovich Linear Programs . . . . . . . . . . . . . . . . . . . . . . 38
3.2 C-Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Vertices of the Transportation Polytope . . . . . . . . . . . . . . . . . . . 42
3.5 A Heuristic Description of the Network Simplex . . . . . . . . . . . . . . . 45
3.6 Dual Ascent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Auction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3
4
6 W1 Optimal Transport 96
6.1 W 1 on Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 W 1 on Euclidean Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 W 1 on a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References 179
Abstract
Optimal transport (OT) theory can be informally described using the words of the
French mathematician Gaspard Monge (1746–1818): A worker with a shovel in hand
has to move a large pile of sand lying on a construction site. The goal of the worker is
to erect with all that sand a target pile with a prescribed shape (for example, that of a
giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified
for instance as the total distance or time spent carrying shovelfuls of sand. Mathe-
maticians interested in OT cast that problem as that of comparing two probability
distributions—two different piles of sand of the same volume. They consider all of the
many possible ways to morph, transport or reshape the first pile into the second, and
associate a “global” cost to every such transport, using the “local” consideration of
how much it costs to move a grain of sand from one place to another. Mathematicians
are interested in the properties of that least costly transport, as well as in its efficient
computation. That smallest cost not only defines a distance between distributions, but
it also entails a rich geometric structure on the space of probability distributions. That
structure is canonical in the sense that it borrows key geometric properties of the un-
derlying “ground” space on which these distributions are defined. For instance, when
the underlying space is Euclidean, key concepts such as interpolation, barycenters, con-
vexity or gradients of functions extend naturally to the space of distributions endowed
with an OT geometry.
OT has been (re)discovered in many settings and under different forms, giving it
a rich history. While Monge’s seminal work was motivated by an engineering problem,
Tolstoi in the 1920s and Hitchcock, Kantorovich and Koopmans in the 1940s estab-
lished its significance to logistics and economics. Dantzig solved it numerically in 1949
within the framework of linear programming, giving OT a firm footing in optimization.
OT was later revisited by analysts in the 1990s, notably Brenier, while also gaining
fame in computer vision under the name of earth mover’s distances. Recent years have
witnessed yet another revolution in the spread of OT, thanks to the emergence of ap-
proximate solvers that can scale to large problem dimensions. As a consequence, OT is
being increasingly used to unlock various problems in imaging sciences (such as color
or texture processing), graphics (for shape manipulation) or machine learning (for re-
gression, classification and generative modeling).
This paper reviews OT with a bias toward numerical methods, and covers the
theoretical properties of OT that can guide the design of new algorithms. We focus in
particular on the recent wave of efficient algorithms that have helped OT find relevance
in data sciences. We give a prominent place to the many generalizations of OT that
have been proposed in but a few years, and connect them with related approaches
originating from statistical inference, kernel methods and information theory. All of
2
the figures can be reproduced using code made available in a companion website1 . This
website hosts the book project Computational Optimal Transport. You will also find
slides and computational resources.
The shortest path principle guides most decisions in life and sciences: When a commod-
ity, a person or a single bit of information is available at a given point and needs to be
sent at a target point, one should favor using the least possible effort. This is typically
reached by moving an item along a straight line when in the plane or along geodesic
curves in more involved metric spaces. The theory of optimal transport generalizes that
intuition in the case where, instead of moving only one item at a time, one is concerned
with the problem of moving simultaneously several items (or a continuous distribution
thereof) from one configuration onto another. As schoolteachers might attest, planning
the transportation of a group of individuals, with the constraint that they reach a given
target configuration upon arrival, is substantially more involved than carrying it out
for a single individual. Indeed, thinking in terms of groups or distributions requires a
more advanced mathematical formalism which was first hinted at in the seminal work
of Monge [1781]. Yet, no matter how complicated that formalism might look at first
sight, that problem has deep and concrete connections with our daily life. Transporta-
tion, be it of people, commodities or information, very rarely involves moving only
one item. All major economic problems, in logistics, production planning or network
routing, involve moving distributions, and that thread appears in all of the seminal ref-
erences on optimal transport. Indeed Tolstoı [1930], Hitchcock [1941] and Kantorovich
[1942] were all guided by practical concerns. It was only a few years later, mostly after
the 1980s, that mathematicians discovered, thanks to the works of Brenier [1991] and
others, that this theory provided a fertile ground for research, with deep connections
to convexity, partial differential equations and statistics. At the turn of the millenium,
researchers in computer, imaging and more generally data sciences understood that op-
3
4 Introduction
timal transport theory provided very powerful tools to study distributions in a different
and more abstract context, that of comparing distributions readily available to them
under the form of bags-of-features or descriptors.
Several reference books have been written on optimal transport, including the two
recent monographs by Villani (2003, 2009), those by Rachev and Rüschendorf (1998a,
1998b) and more recently that by Santambrogio [2015]. As exemplified by these books,
the more formal and abstract concepts in that theory deserve in and by themselves
several hundred pages. Now that optimal transport has gradually established itself as
an applied tool (for instance, in economics, as put forward recently by Galichon [2016]),
we have tried to balance that rich literature with a computational viewpoint, centered
on applications to data science, notably imaging sciences and machine learning. We
follow in that sense the motivation of the recent review by Kolouri et al. [2017] but
try to cover more ground. Ultimately, our goal is to present an overview of the main
theoretical insights that support the practical effectiveness of OT and spend more time
explaining how to turn these insights into fast computational schemes. The main body
of Chapters 2, 3, 4, 9, and 10 is devoted solely to the study of the geometry induced by
optimal transport in the space of probability vectors or discrete histograms. Targeting
more advanced readers, we also give in the same chapters, in light gray boxes, a more
general mathematical exposition of optimal transport tailored for discrete measures.
Discrete measures are defined by their probability weights, but also by the location
at which these weights are defined. These locations are usually taken in a continuous
metric space, giving a second important degree of freedom to model random phenomena.
Lastly, the third and most technical layer of exposition is indicated in dark gray boxes
and deals with arbitrary measures that need not be discrete, and which can have in
particular a density w.r.t. a base measure. This is traditionally the default setting for
most classic textbooks on OT theory, but one that plays a less important role in general
for practical applications. Chapters 5 to 8 deal with the interplay between continuous
and discrete measures and are thus targeting a more mathematically inclined audience.
The field of computational optimal transport is at the time of this writing still an
extremely active one. There are therefore a wide variety of topics that we have not
touched upon in this survey. Let us cite in no particular order the subjects of distri-
butionally robust optimization [Shafieezadeh Abadeh et al., 2015, Esfahani and Kuhn,
2018, Lee and Raginsky, 2018, GAO et al., 2018], in which parameter estimation is
carried out by minimizing the worst posssible empirical risk of any data measure taken
within a certain Wasserstein distance of the input data; convergence of the Langevin
Monte Carlo sampling algorithm in the Wasserstein geometry [Dalalyan and Karag-
ulyan, 2017, Dalalyan, 2017, Bernton, 2018]; other numerical methods to solve OT
with a squared Euclidian cost in low-dimensional settings using the Monge-Ampère
equation [Froese and Oberman, 2011, Benamou et al., 2014, Sulman et al., 2011] which
5
Notation
• 1n,m : matrix of Rn×m with all entries identically set to 1. 1n : vector of ones.
• Σn : probability simplex with n bins, namely the set of probability vectors in Rn+ .
• c(x, y): ground cost, with associated pairwise cost matrix Ci,j = (c(xi , yj ))i,j
evaluated on the support of α, β.
• π: coupling measure between α and β, namely such that for any A ⊂ X , π(A ×
Y) = α(A), and for any subset B ⊂ Y, π(X × B) = β(B). For discrete measures
P
π = i,j Pi,j δ(xi ,yj ) .
• U(α, β): set of coupling measures, for discrete measures U(a, b).
• (f, g): dual potentials, for discrete measures (f, g) are dual variables.
• LC (a, b) and Lc (α, β): value of the optimization problem associated to the OT
with cost C (histograms) and c (arbitrary measures).
• h·, ·i: for the usual Euclidean dot-product between vectors; for two matrices of
the same size A and B, hA, Bi = tr(A> B) is the Frobenius dot-product.
def.
def.
• f ⊕ g(x, y) = f (x) + g(y), for two functions f : X → R, g : Y → R, defines
f ⊕ g : X × Y → R.
• f ⊕ g = f1> >
def. n×m for two vectors f ∈ Rn , g ∈ Rm .
m + 1n g ∈ R
R def.
• α ⊗ β is the product measure on X × Y, i.e. X ×Y g(x, y)d(α ⊗ β)(x, y) =
R
X ×Y g(x, y)dα(x)dβ(y).
• a ⊗ b = ab> ∈ Rn×m .
def.
This chapter describes the basics of optimal transport, introducing first the related
notions of optimal matchings and couplings between probability vectors (a, b), gen-
eralizing gradually this computation to transport between discrete measures (α, β), to
cover lastly the general setting of arbitrary measures. At first reading, these last nuances
may be omitted and the reader can only focus on computations between probability
vectors, namely histograms, which is the only requisite to implement algorithms de-
tailed in Chapters 3 and 4. More experienced readers will reach a better understanding
of the problem by considering the formulation that applies to arbitrary measures, and
will be able to apply it for more advanced problems (e.g. in order to move positions of
clouds of points, or in a statistical setting where points are sampled from continuous
densities).
We will use interchangeably the terms histogram and probability vector for any element
a ∈ Σn that belongs to the probability simplex
n
( )
X
Rn+
def.
Σn = a∈ : ai = 1 .
i=1
A large part of this review focuses exclusively on the study of the geometry induced by
optimal transport on the simplex.
7
8 Theoretical Foundations
Remark 2.1 (Discrete measures). A discrete measure with weights a and locations
x1 , . . . , xn ∈ X reads
n
X
α= ai δxi , (2.1)
i=1
where δx is the Dirac at position x, intuitively a unit of mass which is infinitely
concentrated at location x. Such a measure describes a probability measure if,
additionally, a ∈ Σn and more generally a positive measure if all the elements of
vector a are nonnegative. To avoid degeneracy issues where locations with no mass
are accounted for, we will assume when considering discrete measures that all the
elements of a are positive.
An arbitrary measure α ∈ M(X ) (which need not have a density nor be a sum
of Diracs) is defined by the fact that it can be integrated against any continuous
R
function f ∈ C(X ) and obtain X f (x)dα(x) ∈ R. If X is not compact, one should
also impose that f has compact support or at least has 0 limit at infinity. Measures
are thus in some sense “less regular” than functions but more regular than distribu-
tions (which are dual to smooth functions). For instance, the derivative of a Dirac
is not a measure. We denote M+ (X ) the set of all positive measures on X . The
set of probability measures is denoted M1+ (X ), which means that any α ∈ M1+ (X )
R
is positive, and that α(X ) = X dα = 1. Figure 2.1 offers a visualization of the
different classes of measures, beyond histograms, considered in this work.
2.2. Assignment and Monge Problem 9
Given a cost matrix (Ci,j )i∈JnK,j∈JmK , assuming n = m, the optimal assignment problem
seeks for a bijection σ in the set Perm(n) of permutations of n elements solving
n
1X
min C . (2.2)
σ∈Perm(n) n i=1 i,σ(i)
One could naively evaluate the cost function above using all permutations in the set
Perm(n). However, that set has size n!, which is gigantic even for small n. Consider,
for instance, that such a set has more than 10100 elements [Dantzig, 1983] when n is as
small as 70. That problem can therefore be solved only if there exist efficient algorithms
to optimize that cost function over the set of permutations, which is the subject of §3.7.
Remark 2.3 (Uniqueness). Note that the optimal assignment problem may have several
optimal solutions. Suppose, for instance, that n = m = 2 and that the matrix C is the
pairwise distance matrix between the four corners of a 2-D square of side length 1, as
represented in the left plot of Figure 2.2. In that case only two assignments exist, and
they are both optimal.
Remark 2.4 (Monge problem between discrete measures). For discrete measures
n
X m
X
α= ai δxi and β= b j δ yj , (2.3)
i=1 j=1
the Monge problem [1781] seeks a map that associates to each point xi a single
point yj and which must push the mass of α toward the mass of β, namely, such a
10 Theoretical Foundations
x1 x1
x2 y2
x5
y1 y2 x6 x3
x4 y3
x2 x7
y1
Figure 2.2: Left: blue dots from measure α and red dots from measure β are pairwise equidistant.
Hence, either matching σ = (1, 2) (full line) or σ = (2, 1) (dotted line) is optimal. Right: a Monge map
can associate the blue measure α to the red measure β. The weights αi are displayed proportionally
to the area of the disk marked at each location. The mapping here is such that T (x1 ) = T (x2 ) = y2 ,
T (x3 ) = y3 , whereas for 4 ≤ i ≤ 7 we have T (xi ) = y1 .
where the inverse σ −1 (j) is to be understood as the preimage set of j. In the special
case when n = m and all weights are uniform, that is, ai = bj = 1/n, then the
mass conservation constraint implies that T is a bijection, such that T (xi ) = yσ(i) ,
and the Monge problem is equivalent to the optimal matching problem (2.2), where
the cost matrix is
def.
Ci,j = c(xi , yj ).
When n 6= m, note that, optimality aside, Monge maps may not even exist between
a discrete measure to another. This happens when their weight vectors are not
2.2. Assignment and Monge Problem 11
compatible, which is always the case when the target measure has more points
than the source measure, n < m. For instance, the right plot in Figure 2.2 shows
an (optimal) Monge map between α and β, but there is no Monge map from β to
α.
For more general measures, for instance, for those with a density, the notion of
push-forward plays a fundamental role to describe the spatial modification (or
transport) of a probability measure. The formal definition reads as follows.
Note that T] preserves positivity and total mass, so that if α ∈ M1+ (X ) then
T] α ∈ M1+ (Y).
Remark 2.6 (Push-forward for multivariate densities). Explicitly doing the change
of variables in formula (2.6) for measures with densities (ρα , ρβ ) on Rd (assum-
ing T is smooth and bijective) shows that a push-forward acts on densities linearly
12 Theoretical Foundations
where T 0 (x) ∈ Rd×d is the Jacobian matrix of T (the matrix formed by taking the
gradient of each coordinate of T ). This implies
ρα (x)
| det(T 0 (x))| = .
ρβ (T (x))
Remark 2.7 (Monge problem between arbitrary measures). The Monge prob-
lem (2.5) can be extended to the case where two arbitrary probability measures
(α, β), supported on two spaces (X , Y) can be linked through a map T : X → Y
that minimizes Z
min c(x, T (x))dα(x) : T] α = β . (2.9)
T X
The constraint T] α = β means that T pushes forward the mass of α to β, using
the push-forward operator defined in Remark 2.5
Note that even if (α, β) have densities (ρα , ρβ ) with respect to a fixed measure (e.g.
Lebesgue on Rd ), T] α does not have T ] ρβ as density, because of the presence of
the Jacobian in (2.8). This explains why OT should be used with caution to per-
form image registration, because it does not operate as an image warping method.
Figure 2.3 illustrates the distinction between these push-forward and pull-back
operators.
Remark 2.9 (Measures and random variables). Radon measures can also be viewed
as representing the distributions of random variables. A random variable X on X
is actually a map X : Ω → X from some abstract (often unspecified) probability
space (Ω, P), and its distribution α is the Radon measure α ∈ M1+ (X ) such that
R
P(X ∈ A) = α(A) = A dα(x). Equivalently, it is the push-forward of P by X,
α = X] P. Applying another push-forward β = T] α for T : X → Y, following (2.6),
2.3. Kantorovich Relaxation 13
T T
P↵
= T↵
i xi
def. P ] T ]g g
= i T (xi )
def.
= g T
Push-forward of measures Pull-back of functions
Figure 2.3: Comparison of the push-forward operator T] , which can take as an input any measure,
and the pull-back operator T ] , which operates on functions, notaly densities.
The assignment problem, and its generalization found in the Monge problem laid out in
Remark 2.4, is not always relevant to studying discrete measures, such as those found
in practical problems. Indeed, because the assignment problem is formulated as a per-
mutation problem, it can only be used to compare uniform histograms of the same size.
A direct generalization to discrete measures with nonuniform weights can be carried
out using Monge’s formalism of push-forward maps, but that formulation may also be
degenerate in the absence of feasible solutions satisfying the mass conservation con-
straint (2.4) (see the end of Remark 2.4). Additionally, the assignment problem (2.5) is
combinatorial, and the feasible set for the Monge problem (2.9), despite being continu-
ously parameterized as the set consisting in all push-forward measures that satisfy the
mass conservation constraint, is nonconvex. Both are therefore difficult to solve when
approached in their original formulation.
The key idea of Kantorovich [1942] is to relax the deterministic nature of trans-
portation, namely the fact that a source point xi can only be assigned to another point
or location yσi or T (xi ) only. Kantorovich proposes instead that the mass at any point
xi be potentially dispatched across several locations. Kantorovich moves away from the
idea that mass transportation should be deterministic to consider instead a probabilistic
transport, which allows what is commonly known now as mass splitting from a source
toward several targets. This flexibility is encoded using, in place of a permutation σ or a
n×m
map T , a coupling matrix P ∈ R+ , where Pi,j describes the amount of mass flowing
14 Theoretical Foundations
from bin i toward bin j, or from the mass found at xi toward yj in the formalism of
discrete measures (2.3). Admissible couplings admit a far simpler characterization than
Monge maps,
n o
n×m
PT 1n = b ,
def.
U(a, b) = P ∈ R+ : P1m = a and (2.10)
The set of matrices U(a, b) is bounded and defined by n + m equality constraints, and
therefore is a convex polytope (the convex hull of a finite set of matrices) [Brualdi,
2006, §8.1].
Additionally, whereas the Monge formulation (as illustrated in the right plot of
Figure 2.2) was intrisically asymmetric, Kantorovich’s relaxed formulation is always
symmetric, in the sense that a coupling P is in U(a, b) if and only if PT is in U(b, a).
Kantorovich’s optimal transport problem now reads
def. def.
X
LC (a, b) = min hC, Pi = Ci,j Pi,j . (2.11)
P∈U(a,b)
i,j
This is a linear program (see Chapter 3), and as is usually the case with such programs,
its optimal solutions are not necessarily unique.
Remark 2.10 (Mines and factories). The Kantorovich problem finds a very natural
illustration in the following resource allocation problem (see also Hitchcock [1941]).
Suppose that an operator runs n warehouses and m factories. Each warehouse contains
a valuable raw material that is needed by the factories to run properly. More precisely,
each warehouse is indexed with an integer i and contains ai units of the raw material.
These raw materials must all be moved to the factories, with a prescribed quantity bj
needed at factory j to function properly. To transfer resources from a warehouse i to
a factory j, the operator can use a transportation company that will charge Ci,j to
move a single unit of the resource from location i to location j. We assume that the
transportation company has the monopoly to transport goods and applies the same
linear pricing scheme to all actors of the economy: the cost of shipping a units of the
resource from i to j is equal to a × Ci,j .
Faced with the problem described above, the operator chooses to solve the linear
program described in Equation (2.11) to obtain a transportation plan P? that quantifies
for each pair i, j the amount of goods Pi,j that must transported from warehouse i to
factory j. The operator pays on aggregate a total of hP? , Ci to the transportation
company to execute that plan.
2.3. Kantorovich Relaxation 15
which shows that the assignment problem (2.2) can be recast as a Kantorovich prob-
lem (2.11) where the couplings P are restricted to be exactly permutation matrices:
n
1X
min C = min hC, Pσ i.
σ∈Perm(n) n i=1 i,σ(i) σ∈Perm(n)
Next, one can easily check that the set of permutation matrices is strictly included in
the Birkhoff polytope U(1n /n, 1n /n). Indeed, for any permutation σ we have Pσ 1 =
1n and Pσ T 1 = 1n , whereas 1n 1n T /n2 is a valid coupling but not a permutation
matrix. Therefore, the minimum of hC, Pi is necessarily smaller when considering all
transportation than when considering only permutation matrices:
The following proposition shows that these problems result in fact in the same
optimum, namely that one can always find a permutation matrix that minimizes Kan-
torovich’s problem (2.11) between two uniform measures a = b = 1n /n. The Kan-
torovich relaxation is therefore tight when considered on assignment problems. Fig-
ure 2.4 shows on the left a 2-D example of optimal matching corresponding to this
special case.
Proof. Birkhoff’s theorem [1946] states that the set of extremal points of U(1n /n, 1n /n)
is equal to the set of permutation matrices. A fundamental theorem of linear program-
ming [Bertsimas and Tsitsiklis, 1997, Theorem 2.7] states that the minimum of a linear
objective in a nonempty polyhedron, if finite, is reached at an extremal point of the
polyhedron.
16 Theoretical Foundations
↵ ↵
Figure 2.4: Comparison of optimal matching and generic couplings. A black segment between xi
and yj indicates a nonzero element in the displayed optimal coupling Pi,j solving (2.11). Left: optimal
matching, corresponding to the setting of Proposition 2.1 (empirical measures with the same number
n = m of points). Right: these two weighted point clouds cannot be matched; instead a Kantorovich
coupling can be used to associate two arbitrary discrete measures.
Remark 2.11 (Kantorovich problem between discrete measures). For discrete mea-
sures α, β of the form (2.3), we store in the matrix C all pairwise costs between
def.
points in the supports of α, β, namely Ci,j = c(xi , yj ), to define
def.
Lc (α, β) = LC (a, b). (2.13)
Remark 2.12 (Using optimal assignments and couplings). The optimal transport
plan itself (either as a coupling P or a Monge map T when it exists) has found
many applications in data sciences, and in particular image processing. It has,
for instance, been used for contrast equalization [Delon, 2004] and texture syn-
thesis Gutierrez et al. [2017]. A significant part of applications of OT to imaging
sciences is for image matching [Zhu et al., 2007, Wang et al., 2013, Museyko et al.,
2009, Li et al., 2013], image fusion [Courty et al., 2016], medical imaging [Wang
et al., 2011] and shape registration [Makihara and Yagi, 2010, Lai and Zhao, 2017,
2.3. Kantorovich Relaxation 17
↵ ↵
↵
⇡ ⇡ ⇡
↵ ↵
↵
Discrete Semidiscrete Continuous
Figure 2.5: Schematic viewed of input measures (α, β) and couplings U(α, β) encountered in the three
main scenarios for Kantorovich OT. Chapter 5 is dedicated to the semidiscrete setup.
Here PX ] and PY] are the push-forwards (see Definition 2.1) of the projections
PX (x, y) = x and PY (x, y) = y. Figure 2.5 shows how these coupling con-
straints translate for different classes of problems (discrete measures and den-
sities). Using (2.7), these marginal constraints are equivalent to imposing that
π(A × Y) = α(A) and π(X × B) = β(B) for sets A ⊂ X and B ⊂ Y. The Kan-
18 Theoretical Foundations
↵ ⇡ ⇡
↵
Figure 2.6: Left: “continuous” coupling π solving (2.14) between two 1-D measures with density. The
coupling is localized along the graph of the Monge map (x, T (x)) (displayed in black). Right: “discrete”
coupling T solving (2.11) between two discrete measures of the form (2.3). The positive entries Ti,j are
displayed as black disks at position (i, j) with radius proportional to Ti,j .
↵ ↵ ↵ ↵
⇡ ⇡ ⇡ ⇡
↵ ↵ ↵ ↵
Figure 2.7: Four simple examples of optimal couplings between 1-D distributions, represented as
maps above (arrows) and couplings below. Inspired by Lévy and Schwindt [2018].
2.4. Metric Properties of Optimal Transport 19
Remark 2.15 (The cases 0 < p ≤ 1). Note that if 0 < p ≤ 1, then Dp is itself distance.
This implies that while for p ≥ 1, Wp (a, b) is a distance, in the case p ≤ 1, it is actually
Wp (a, b)p which defines a distance on the simplex.
2.4. Metric Properties of Optimal Transport 21
Remark 2.16 (Applications of Wasserstein distances). The fact that the OT dis-
tance automatically “lifts” a ground metric between bins to a metric between
histograms on such bins makes it a method of choice for applications in computer
vision and machine learning to compare histograms. In these fields, a classical ap-
proach is to “pool” local features (for instance, image descriptors) and compute
a histogram of the empirical distribution of features (a so-called bag of features)
to perform retrieval, clustering or classification; see, for instance, [Oliva and Tor-
ralba, 2001]. Along a similar line of ideas, OT distances can be used over some lifted
feature spaces to perform signal and image analysis [Thorpe et al., 2017]. Appli-
cations to retrieval and clustering were initiated by the landmark paper [Rubner
et al., 2000], with renewed applications following faster algorithms for threshold
matrices C that fit for some applications, for example, in computer vision [Pele and
Werman, 2008, 2009]. More recent applications stress the use of the earth mover’s
distance for bags-of-words, either to carry out dimensionality reduction [Rolet et al.,
2016] and classify texts [Kusner et al., 2015, Huang et al., 2016], or to define an
alternative loss to train multiclass classifiers that output bags-of-words [Frogner
et al., 2015]. Kolouri et al. [2017] provides a recent overview of such applications
to signal processing and machine learning.
Proposition 2.3. We assume X = Y and that for some p ≥ 1, c(x, y) = d(x, y)p ,
where d is a distance on X , i.e.
Proof. The proof follows the same approach as that for Proposition 2.2 and relies on
the existence of a coupling between (α, γ) obtained by “gluing” optimal couplings
between (α, β) and (β, γ).
Remark 2.18 (Geometric intuition and weak convergence). The Wasserstein dis-
tance W p has many important properties, the most important being that it is
a weak distance, i.e. it allows one to compare singular distributions (for instance,
discrete ones) whose supports do not overlap and to quantify the spatial shift
between the supports of two distributions. In particular, “classical” distances (or
divergences) are not even defined between discrete distributions (the L2 norm can
only be applied to continuous measures with a density with respect to a base mea-
sure, and the discrete `2 norm requires that positions (xi , yj ) take values in a pre-
determined discrete set to work properly). In sharp contrast, one has that for any
p > 0, W pp (δx , δy ) = d(x, y). Indeed, it suffices to notice that U(δx , δy ) = {δx,y } and
therefore the Kantorovich problem having only one feasible solution, W pp (δx , δy )
is necessarily (d(x, y)p )1/p = d(x, y). This shows that W p (δx , δy ) → 0 if x → y.
This property corresponds to the fact that W p is a way to quantify the weak
convergence, as we now define.
Remark 2.19 (Translations). A nice feature of the Wasserstein distance over a Eu-
clidean space X = Rd for the ground cost c(x, y) = kx − yk2 is that one can factor
out translations; indeed, denoting Tτ : x 7→ x − τ the translation operator, one has
2
W 2 (Tτ ] α, Tτ 0 ] β)2 = W 2 (α, β)2 − 2hτ − τ 0 , mα − mβ i +
τ − τ 0
,
def.
where mα = X xdα(x) ∈ Rd is the mean of α. In particular, this implies the nice
R
where (α̃, β̃) are the “centered” zero mean measures α̃ = Tmα ] α.
2.5. Dual Problem 23
where the sup should be understood as the essential supremum according to the
measure π on X 2 . In contrast to the cases p < +∞, this is a nonconvex optimization
problem, which is difficult to solve numerically and to study theoretically. The W ∞
distance is related to the Hausdorff distance between the supports of (α, β); see
§ 10.6.1. We refer to [Champion et al., 2008] for details.
We exchange the min and the max above, which is always possible when considering
linear programs (in finite dimension), to obtain
max ha, fi + hb, gi + min hC − f1m T − 1n gT , Pi.
(f,g)∈Rn ×Rm P≥0
The primal-dual optimality relation for the Lagrangian (2.22) allows us to locate
the support of the optimal transport plan (see also §3.3)
n o
{(i, j) ∈ JnK × JmK : Pi,j > 0} ⊂ (i, j) ∈ JnK × JmK : fi + gj = Ci,j . (2.23)
Remark 2.21. Following the interpretation given to the Kantorovich problem in Re-
mark 2.10, we follow with an intuitive presentation of the dual. Recall that in that
setup, an operator wishes to move at the least possible cost an overall amount of re-
sources from warehouses to factories. The operator can do so by solving (2.11), follow
the instructions set out in P? , and pay hP? , Ci to the transportation company.
Outsourcing logistics. Suppose that the operator does not have the computational
means to solve the linear program (2.11). He decides instead to outsource that task to
a vendor. The vendor chooses a pricing scheme with the following structure: the vendor
splits the logistic task into that of collecting and then delivering the goods and will
apply a collection price fi to collect a unit of resource at each warehouse i (no matter
where that unit is sent to) and a price gj to deliver a unit of resource to factory j
(no matter from which warehouse that unit comes from). On aggregate, since there
are exactly ai units at warehouse i and bj needed at factory j, the vendor asks as a
consequence of that pricing scheme a price of hf, ai + hg, bi to solve the operator’s
logistic problem.
Setting prices. Note that the pricing system used by the vendor allows quite nat-
urally for arbitrarily negative prices. Indeed, if the vendor applies a price vector f for
warehouses and a price vector g for factories, then the total bill will not be changed
by simultaneously decreasing all entries in f by an arbitrary number and increasing all
entries of g by that same number, since the total amount of resources in all warehouses
is equal to those that have to be delivered to the factories. In other words, the vendor
can give the illusion of giving an extremely good deal to the operator by paying him to
collect some of his goods, but compensate that loss by simply charging him more for
delivering them. Knowing this, the vendor, wishing to charge as much as she can for
that service, sets vectors f and g to be as high as possible.
Checking prices. In the absence of another competing vendor, the operator must
therefore think of a quick way to check that the vendor’s prices are reasonable. A
possible way to do so would be for the operator to compute the price LC (a, b) of the
most efficient plan by solving problem (2.11) and check if the vendor’s offer is at the
very least no larger than that amount. However, recall that the operator cannot afford
such a lengthy computation in the first place. Luckily, there is a far more efficient way
for the operator to check whether the vendor has a competitive offer. Recall that fi
is the price charged by the vendor for picking a unit at i and gj to deliver one at j.
Therefore, the vendor’s pricing scheme implies that transferring one unit of the resource
2.5. Dual Problem 25
from i to j costs exactly fi + gj . Yet, the operator also knows that the cost of shipping
one unit from i to j as priced by the transporting company is Ci,j . Therefore, if for any
pair i, j the aggregate price fi + gj is strictly larger that Ci,j , the vendor is charging
more than the fair price charged by the transportation company for that task, and the
operator should refuse the vendor’s offer.
α P? 40
β f? g?
20
-20
Figure 2.8: Consider in the left plot the optimal transport problem between two discrete measures α
and β, represented respectively by blue dots and red squares. The area of these markers is proportional
to the weight at each location. That plot also displays the optimal transport P? using a quadratic
Euclidean cost. The corresponding dual (Kantorovich) potentials f? and g? that correspond to that
configuration are also displayed on the right plot. Since there is a “price” f?i for each point in α (and
conversely for g and β), the color at that point represents the obtained value using the color map on
the right. These potentials can be interpreted as relative prices in the sense that they indicate the
individual cost, under the best possible transport scheme, to move a mass away at each location in α,
or on the contrary to send a mass toward any point in β. The optimal transport cost is therefore equal
to the sum of the squared lengths of all the arcs on the left weighted by their thickness or, alternatively,
using the dual formulation, to the sum of the values (encoded with colors) multiplied by the area of
each marker on the right plot.
n × m inequalities that for any transport plan P (including the optimal one P? ), the
marginal constraints imply
X X X X X X
Pi,j Ci,j ≥ Pi,j fi + gj = fi Pi,j + gj Pi,j
i,j i,j i j j i
Knowing this, the vendor must therefore find a set of prices f, g that maximize
hf, ai + hg, bi but that must satisfy at the very least for all i, j the basic inequality
that fi + gj ≤ Ci,j for his offer to be accepted, which results in Problem (2.20). One can
show, as we do later in §3.1, that the best price obtained by the vendor is in fact exactly
equal to the best possible cost the operator would obtain by computing LC (a, b).
Figure 2.8 illustrates the primal and dual solutions resulting from the same transport
problem. On the left, blue dots represent warehouses and red dots stand for factories; the
areas of these dots stand for the probability weights a, b, links between them represent
an optimal transport, and their width is proportional to transfered amounts. Optimal
prices obtained by the vendor as a result of optimizing Problem (2.20) are shown on
the right. Prices have been chosen so that their mean is equal to 0. The highest relative
prices come from collecting goods at an isolated warehouse on the lower left of the
figure, and delivering goods at the factory located in the upper right area.
Remark 2.22 (Dual problem between arbitrary measures). To extend this primal-
dual construction to arbitrary measures, it is important to realize that measures are
naturally paired in duality with continuous functions (a measure can be accessed
only through integration against continuous functions). The duality is formalized
in the following proposition, which boils down to Proposition 2.4 when dealing with
discrete measures.
Here, (f, g) is a pair of continuous functions and are also called, as in the discrete
case, “Kantorovich potentials.”
The discrete case (2.20) corresponds to the dual vectors being samples of the
continuous potentials, i.e. (fi , gj ) = (f (xi ), g(yj )). The primal-dual optimality con-
ditions allow us to track the support of the optimal plan, and (2.23) is generalized
as
Supp(π) ⊂ {(x, y) ∈ X × Y : f (x) + g(y) = c(x, y)} . (2.26)
Note that in contrast to the primal problem (2.15), showing the existence of
solutions to (2.24) is nontrivial, because the constraint set R(c) is not compact and
the function to minimize noncoercive. Using the machinery of c-transform detailed
2.5. Dual Problem 27
in § 5.1, in the case c(x, y) = d(x, y)p with p ≥ 1, one can, however, show that
optimal (f, g) are necessarily Lipschitz regular, which enables us to replace the
constraint by a compact one.
R R
Remark 2.23 (Unconstrained dual). In the case X dα = Y dβ = 1, the constrained
dual problem (2.24) can be replaced by an unconstrained one,
Z Z
Lc (α, β) = sup f dα + gdβ + min (c − f ⊕ g), (2.27)
(f,g)∈C(X )×C(Y) X Y X ⊗Y
where we denoted (f ⊕ g)(x, y) = f (x) + g(y). Here the minimum should be con-
sidered as the essential supremum associated to the measure α ⊗ β, i.e., it does not
change if f or g is modified on sets of zero measure for α and β. This alternative
dual formulation was pointed out to us by Francis Bach. It is obtained from the
R
primal problem (2.15) by adding the redundant constraint dπ = 1.
Proof. We sketch the main ingredients of the proof; more details can be found, for
R R
instance, in [Santambrogio, 2015]. We remark that cdπ = Cα,β −2 hx, yidπ(x, y),
28 Theoretical Foundations
where the constant is Cα,β = kxk2 dα(x)+ kyk2 dβ(y). Instead of solving (2.15),
R R
see also §3.2 and §5.1, where that idea is applied respectively in the discrete
setting and for generic costs c(x, y). By iterating this argument twice, one
can replace ϕ by ϕ∗∗ , which is a convex function, and thus impose in (2.31)
that ϕ is convex. Condition (2.26) shows that an optimal π is supported on
{(x, y) : ϕ(x) + ϕ∗ (y) = hx, yi}, which shows that such a y is optimal for the
minimization (2.30) of the Legendre transform, whose optimality condition reads
y ∈ ∂ϕ(x). Since ϕ is convex, it is differentiable almost everywhere, and since α has
a density, it is also differentiable α-almost everywhere. This shows that for each
x, the associated y is uniquely defined α-almost everywhere as y = ∇ϕ(x), and it
shows that necessarily π = (Id, ∇ϕ)] α.
This result shows that in the setting of W 2 with no-singular densities, the
Monge problem (2.9) and its Kantorovich relaxation (2.15) are equal (the relaxation
is tight). This is the continuous counterpart of Proposition 2.1 for the assignment
case (2.1), which states that the minimum of the optimal transport problem is
achieved at a permutation matrix (a discrete map) when the marginals are equal
and uniform. Brenier’s theorem, stating that an optimal transport map must be
the gradient of a convex function, provides a useful generalization of the notion
of increasing functions in dimension more than one. This is the main reason why
2.5. Dual Problem 29
Remark 2.25 (Monge–Ampère equation). For measures with densities, using (2.8),
one obtains that ϕ is the unique (up to the addition of a constant) convex function
which solves the following Monge–Ampère-type equation:
where ∂ 2 ϕ(x) ∈ Rd×d is the Hessian of ϕ. The Monge–Ampère operator det(∂ 2 ϕ(x))
can be understood as a nonlinear degenerate Laplacian. In the limit of small dis-
placements, ϕ = Id + εψ, one indeed recovers the Laplacian ∆ as a linearization
since for smooth maps
The convexity constraint forces det(∂ 2 ϕ(x)) ≥ 0 and is necessary for this equation
to have a solution. There is a large body of literature on the theoretical analysis of
the Monge–Ampère equation, and in particular the regularity of its solution—see,
for instance, [Gutiérrez, 2016]; we refer the interested read to the review paper
by Caffarelli [2003]. A major difficulty is that in full generality, solutions need not
be smooth, and one has to resort to the machinery of Alexandrov solutions when the
input measures are arbitrary (e.g. Dirac masses). Many solvers have been proposed
in the simpler case of the Monge–Ampère equation det(∂ 2 ϕ(x)) = f (x) for a fixed
right-hand-side f ; see, for instance, [Benamou et al., 2016b] and the references
therein. In particular, capturing anisotropic convex functions requires special care,
and usual finite differences can be inaccurate. For optimal transport, where f
actually depends on ∇ϕ, the discretization of Equation (2.32), and the boundary
condition result in technical challenges outlined in [Benamou et al., 2014] and the
references therein. Note also that related solvers based on fixed-point iterations
have been applied to image registration [Haker et al., 2004].
30 Theoretical Foundations
Remark 2.26 (Binary cost matrix and 1-norm). One can easily check that when the
cost matrix C is 0 on the diagonal and 1 elsewhere, namely, when C = 1n×n − In ,
the 1-Wasserstein distance between a and b is equal to the 1-norm of their difference,
LC (a, b) = ka − bk1 .
Remark 2.27 (Kronecker cost function and total variation). In addition to Re-
mark 2.26 above, one can also easily check that this result extends to arbitrary
measures in the case where c(x, y) is 0 if x = y and 1 when x 6= y. The OT
distance between two discrete measures α and β is equal to their total variation
distance (see also Example 8.2).
i.e. locally (if one assumes distinct points), W p (α, β) is the `p norm between two
vectors of ordered values of α and β. That statement is valid only locally, in the
sense that the order (and those vector representations) might change whenever
some of the values change. That formula is a simple consequence of the more
general setting detailed in Remark 2.30. Figure 2.9, top row, illustrates the 1-D
transportation map between empirical measures with the same number of points.
The bottom row shows how this monotone map generalizes to arbitrary discrete
measures.
It is also possible to leverage this 1-D computation to also compute efficiently
OT on the circle as shown by Delon et al. [2010]. Note that if the cost is a concave
function of the distance, notably when p < 1, the behavior of the optimal transport
plan is very different, yet efficient solvers also exist [Delon et al., 2012].
2.6. Special Cases 31
Figure 2.9: 1-D optimal couplings: each arrow xi → yj indicates a nonzero Pi,j in the optimal
coupling. Top: empirical measures with same number of points (optimal matching). Bottom: generic
case. This corresponds to monotone rearrangements, if xi ≤ xi0 are such that Pi,j 6= 0, Pi0 ,j 0 6= 0, then
necessarily yj ≤ yj 0 .
optimal assignment between the two discrete distributions. For image processing
applications, (ȳσ(i) )i defines the color values of an equalized version of x̄, whose
empirical distribution matches exactly the one of ȳ. The equalized version of that
image can be recovered by folding back that nm-dimensional vector as an image
of size n × m. Also, t ∈ [0, 1] 7→ (1 − t)x̄i + tȳσ(i) defines an interpolation between
the original image and the equalized one, whose empirical distribution of pixels is
the displacement interpolation (as defined in (7.7)) between those of the inputs.
That function is also called the generalized quantile function of α. For any p ≥ 1,
32 Theoretical Foundations
Figure 2.10: Histogram equalization for image processing, where t parameterizes the displacement
interpolation between the histograms.
one has
p Z 1
W p (α, β)p =
Cα−1 − Cβ−1
= |Cα−1 (r) − Cβ−1 (r)|p dr. (2.36)
Lp ([0,1]) 0
This means that through the map α 7→ Cα−1 , the Wasserstein distance is isometric
to a linear space equipped with the Lp norm or, equivalently, that the Wasserstein
distance for measures on the real line is a Hilbertian metric. This makes the geom-
etry of 1-D optimal transport very simple but also very different from its geometry
in higher dimensions, which is not Hilbertian as discussed in Proposition 8.1 and
more generally in §8.3. For p = 1, one even has the simpler formula
Z
W 1 (α, β) = kCα − Cβ kL1 (R) = |Cα (x) − Cβ (x)|dx (2.37)
R
Z Z x
=
d(α − β) dx,
(2.38)
R −∞
which shows that W 1 is a norm (see §6.2 for the generalization to arbitrary dimen-
sions). An optimal Monge map T such that T] α = β is then defined by
T = Cβ−1 ◦ Cα . (2.39)
α β (tT + (1 − t)Id)] α
1 1 1
Cµ C-1 T
µ -1
Cν T
C-1
ν
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Figure 2.11: Computation of OT and displacement interpolation between two 1-D measures, using
cumulant function as detailed in (2.39).
N (mβ , Σβ ) are two Gaussians in Rd , then one can show that the following map
T : x 7→ mβ + A(x − mα ), (2.40)
where
1 1 1 1 1
−2 2 2 −
A= Σα Σα Σβ Σα2 Σα 2 = AT ,
is such that T] ρα = ρβ . Indeed, one simply has to notice that the change of variables
formula (2.8) is satisfied since
1
ρβ (T (x)) = det(2πΣβ )− 2 exp(−hT (x) − mβ , Σ−1
β (T (x) − mβ )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , AT Σ−1
β A(x − mα )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , Σ−1
α (x − mα )i),
Both that map T and the corresponding potential ψ are illustrated in Figures 2.12
and 2.13
With additional calculations involving first and second order moments of ρα ,
we obtain that the transport cost of that map is
where B is the so-called Bures metric [1969] between positive definite matrices (see
also Forrester and Kieburg [2016]),
B(Σα , Σβ )2 = tr Σα + Σβ − 2(Σα1/2 Σβ Σα1/2 )1/2 ,
def.
(2.42)
where Σ1/2 is the matrix square root. One can show that B is a distance on covari-
ance matrices and that B 2 is convex with respect to both its arguments. In the
case where Σα = diag(ri )i and Σβ = diag(si )i are diagonals, the Bures metric is
the Hellinger distance
√ √
B(Σα , Σβ ) =
r − s
2 .
For 1-D Gaussians, W 2 is thus the Euclidean distance on the √ 2-D plane plot-
ting the mean and the standard deviation of a Gaussian (m, Σ), as illustrated
in Figure 2.14. For a detailed treatment of the Wasserstein geometry of Gaus-
sian distributions, we refer to Takatsu [2011], and for additional considerations
on the Bures metric the reader can consult the very recent references [Malagò
et al., 2018, Bhatia et al., 2018]. One can also consult [Muzellec and Cuturi, 2018]
for a a recent application of this metric to compute probabilistic embeddings for
words, [Shafieezadeh Abadeh et al., 2018] to see how it is used to compute a robust
extension to Kalman filtering, or [Mallasto and Feragen, 2017] in which it is applied
to covariance functions in reproducing kernel Hilbert spaces.
0
ρβ
-1
ρα
-2
-3
-4 -2 0 2 4 6
Figure 2.12: Two Gaussians ρα and ρβ , represented using the contour plots of their densities, with
respective mean and variance matrices mα = (−2, 0), Σα = 21 1 − 21 ; − 12 1 and mβ = (3, 1), Σβ =
2, 12 ; 21 , 1 . The arrows originate at random points x taken on the plane and end at the corresponding
mappings of those points T (x) = mβ + A(x − mα ).
1
ρα (x) = p h(hx − mα , A−1 (x − mα )i)
det(A)
1
ρβ (x) = p h(hx − mβ , B−1 (x − mβ )i),
det(B)
for the same nonnegative valued function h such that the integral
Z
h(hx, xi)dx = 1,
Rd
then their optimal transport map is also the linear map (2.40) and their Wasserstein
distance is also given by the expression (2.41), with a slightly different scaling of the
Bures metric that depends only the generator function h. For instance, that scaling
is 1 for Gaussians (h(t) = e−t/2 ) and 1/(d+2) for uniform distributions on ellipsoids
(h the indicator function for [0, 1]). This result follows from the fact that the
covariance matrix of an elliptic distribution is a constant times its positive definite
parameter [Gómez et al., 2003, Theo. 4(ii)] and that the Wasserstein distance
between elliptic distributions is a function of the Bures distance between their
covariance matrices [Gelbrich, 1990, Cor. 2.5].
36 Theoretical Foundations
Figure 2.13: Same Gaussians ρα and ρβ as defined in Figure 2.12, represented this time as surfaces.
The surface above is the Brenier potential ψ defined up to an additive constant (here +50) such that
T = ∇ψ. For visual purposes, both Gaussian densities have been multiplied by a factor of 100.
m
Figure 2.14: Computation of displacement interpolation between two 1-D Gaussians. De-
(x−m)2
def. −
noting Gm,σ (x) = √1 e 2s2 the Gaussian density, it thus shows the interpolation
2πs
G(1−t)m0 +tm1 ,(1−t)σ0 +tσ1 .
3
Algorithmic Foundations
This chapter describes the most common algorithmic tools from combinatorial opti-
mization and linear programming that can be used to solve the discrete formulation
of optimal transport, as described in the primal problem (2.11) or alternatively its
dual (2.20).
The origins of these algorithms can be traced back to World War II, either right
before with Tolstoı’s seminal work [1930] or during the war itself, when Hitchcock
[1941] and Kantorovich [1942] formalized the generic problem of dispatching available
resources toward consumption sites in an optimal way. Both of these formulations, as
well as the later contribution by Koopmans [1949], fell short of providing a provably cor-
rect algorithm to solve that problem (the cycle violation method was already proposed
as a heuristic by Tolstoı [1939]). One had to wait until the field of linear programming
fully blossomed, with the proposal of the simplex method, to be at last able to solve
rigorously these problems.
The goal of linear programming is to solve optimization problems whose objective
function is linear and whose constraints are linear (in)equalities in the variables of in-
terest. The optimal transport problem fits that description and is therefore a particular
case of that wider class of problems. One can argue, however, that optimal transport is
truly special among all linear program. First, Dantzig’s early motivation to solve linear
programs was greatly related to that of solving transportation problems [Dantzig, 1949,
p. 210]. Second, despite being only a particular case, the optimal transport problem
remained in the spotlight of optimization, because it was understood shortly after that
optimal transport problems were related, and in fact equivalent, to an important class
of linear programs known as minimum cost network flows [Korte and Vygen, 2012, p.
37
38 Algorithmic Foundations
213, Lem. 9.3] thanks to a result by Ford and Fulkerson [1962]. As such, the OT prob-
lem has been the subject of particular attention, ever since the birth of mathematical
programming [Dantzig, 1951], and is still widely used to introduce optimization to a
new audience [Nocedal and Wright, 1999, §1, p. 4].
To make the link with the linear programming literature, one can cast the equation
above as a linear program in standard form, that is, a linear program with a linear
objective; equality constraints defined with a matrix and a constant vector; and non-
negative constraints on variables. Let In stand for the identity matrix of size n and let
⊗ be Kronecker’s product. The (n + m) × nm matrix
" #
1 T ⊗ Im
A= n ∈ R(n+m)×nm
In ⊗ 1m T
can be used to encode the row-sum and column-sum constraints that need to be satisfied
for any P to be in U(a, b). To do so, simply cast a matrix P ∈ Rn×m as a vector p ∈ Rnm
such that the i + n(j − 1)’s element of p is equal to Pij (P is enumerated columnwise)
to obtain the following equivalence:
a
P ∈ Rn×m ∈ U(a, b) ⇔ p ∈ Rnm
+ , Ap = b .
where the nm-dimensional vector c is equal to the stacked columns contained in the
cost matrix C.
Remark 3.1. Note that one of the n + m constraints described above is redundant
or that, in other words, the line vectors of matrix A are not linearly independent.
Indeed, summing all n first lines and the subsequent m lines results in the same vector
(namely A 01m = A 10m = 1 T ). One can show that removing a line in A and
n n
a nm
the corresponding entry in b yields a properly defined linear system. For simplicity,
and to avoid treating asymmetrically a and b, we retain in what follows a redundant
formulation, keeping in mind that degeneracy will pop up in some of our computations.
3.1. The Kantorovich Linear Programs 39
The dual problem corresponding to Equation (3.2) is, following duality in linear
programming [Bertsimas and Tsitsiklis, 1997, p. 143] defined as
a T
LC (a, b) = max b h. (3.3)
h∈Rn+m
AT h≤c
Note that this program is exactly equivalent to that presented in Equation (2.4).
Remark 3.2. We provide a simple derivation of the duality result above, which can be
seen as a direct formulation of the arguments developed in Remark 2.21. Strong duality,
namely the fact that the optima of both primal (3.2) and dual (3.3) problems do indeed
coincide, requires a longer proof [Bertsimas and Tsitsiklis, 1997, §4.10]. To simplify
notation, we write q = ba . Consider now a relaxed primal problem of the optimal
Note first that this relaxed problem has no marginal constraints on p. Because that
minimization allows for many more p solutions, we expect H(h) to be smaller than
z̄ = LC (a, b). Indeed, writing p? for any optimal solution of the primal problem (3.1),
we obtain
min cT p − hT (Ap − q) ≤ cT p? − hT (Ap? − q) = cT p? = z̄.
p∈Rnm
+
The approach above defines therefore a problem which can be used to compute an
optimal upper bound for the original problem (3.1), for any cost vector h; that function
is called the Lagrange dual function of L. The goal of duality theory is now to compute
the best lower bound z by maximizing H over any cost vector h, namely
!
T
z = max H(h) = max hT q + min
nm
(c − AT h) p .
h h p∈R+
3.2 C-Transforms
We present in this section an important property of the dual optimal transport prob-
lem (3.3) which takes a more important meaning when used for the semidiscrete optimal
transport problem in §5.1. This section builds upon the original formulation (2.20) that
splits dual variables according to row and column sum constraints:
Consider any dual feasible pair (f, g). If we “freeze” the value of f, we can notice that
there is no better vector solution for g than the C-transform vector of f, denoted
f C ∈ Rm and defined as
(f C )j = min Cij − fi ,
i∈JnK
since it is indeed easy to prove that (f, f C ) ∈ R(C) and that f C is the largest possible
vector such that this constraint is satisfied. We therefore have that
This result allows us first to reformulate the dual problem as a piecewise affine concave
maximization problem expressed in a single variable f as
LC (a, b) = max
n
hf, ai + hf C , bi. (3.5)
f∈R
Putting that result aside, the same reasoning applies of course if we now “freeze”
the values of g and consider instead the C̄-transform of g, namely vector gC̄ ∈ Rn
defined as
(gC̄ )i = min Cij − gj ,
j∈JmK
One may hope for a strict increase in the objective at each of these iterations. However,
this does not work because alternating C and C̄ transforms quickly hits a plateau.
Proposition 3.1. The following identities, in which the inequality sign between vectors
should be understood elementwise, hold:
(i) f ≤ f 0 ⇒ f C ≥ f 0 C ,
3.3. Complementary Slackness 41
The relation gC̄C ≥ g is obtained in the same way. Now, set g = f C . Then, gC̄ =
f CC̄ ≥ f. Therefore, using result (i) we have f CC̄C ≤ f C . Result (ii) yields f CC̄C ≥ f C ,
proving the equality.
Primal (3.2) and dual (3.3), (2.20) problems can be solved independently to obtain
optimal primal P? and dual (f? , g? ) solutions. The following proposition characterizes
their relationship.
Proposition 3.2. Let P? and f? , g? be optimal solutions for the primal (2.24) and
dual (2.11) problems, respectively. Then, for any pair (i, j) ∈ JnK × JmK, P?i,j (Ci,j −
f?i + g?j ) = 0 holds. In other words, if P?i,j > 0, then necessarily f?i + g?j = Ci,j ; if
f?i + g?j < Ci,j then necessarily P?i,j = 0.
Proof. We have by strong duality that hP? , Ci = hf? , ai + hg? , bi. Recall that P? 1m =
a and P? T 1n = b; therefore
hf? , ai + hg? , bi = hf? , P? 1m i + hg? , P? T 1n i
= hf? 1m T , P? i + h1n g? T , P? i,
which results in
hP? , C − f? ⊕ g? i = 0.
Because (f? , g? ) belongs to the polyhedron of dual constraints (2.21), each entry of the
matrix C − f? ⊕ g? is necessarily nonnegative. Therefore, since all the entries of P are
nonnegative, the constraint that the dot-product above is equal to 0 enforces that, for
any pair of indices (i, j) such that Pi,j > 0, Ci,j − (fi + gj ) must be zero, and for any
pair of indices (i, j) such that Ci,j > fi + gj that Pi,j = 0.
The converse result is also true. We define first the idea that two variables for the
primal and dual problems are complementary.
42 Algorithmic Foundations
Definition 3.1. A matrix P ∈ Rn×m and a pair of vectors (f, g) are complementary
w.r.t. C if for all pairs of indices (i, j) such that Pi,j > 0 one also has Ci,j = fi + gj .
If a pair of feasible primal and dual variables is complementary, then we can conclude
they are optimal.
Proposition 3.3. If P and (f, g) are complementary and feasible solutions for the pri-
mal (2.24) and dual (2.11) problems, respectively, then P and (f, g) are both primal
and dual optimal.
and therefore P and (f, g) are respectively primal and dual optimal.
Recall that a vertex or an extremal point of a convex set is formally a point x in that
set such that, if there exiss y and z in that set with x = (y + z)/2, then necessarily
x = y = z. A linear program with a nonempty and bounded feasible set attains its
minimum at a vertex (or extremal point) of the feasible set [Bertsimas and Tsitsiklis,
1997, p. 65, Theo. 2.7]. Since the feasible set U(a, b) of the primal optimal transport
problem (3.2) is bounded, one can restrict the search for an optimal P to the set of
extreme points of the polytope U(a, b). Matrices P that are extremal in U(a, b) have
an interesting structure that has been the subject of extensive research [Brualdi, 2006,
§8]. That structure requires describing the transport problem using the formalism of
bipartite graphs.
10 0.3
10 0.2
1 1
20 0.5
20 0.16
2 2
30 0.2
30 0.08
3 3
40 40 0.56
Figure 3.1: The optimal transport problem as a bipartite network flow problem. Here n = 3, m = 4.
All coordinates of the source histogram, a, are depicted as source nodes on the left labeled 1, 2, 3,
whereas all coordinates of the target histogram b are labeled as nodes 10 , 20 , 30 , 40 . The graph is bipartite
in the sense that all source nodes are connected to all target nodes, with no additional edges. To each
edge (i, j 0 ) is associated a cost Cij . A feasible flow is represented on the right. Proposition 3.4 shows
that this flow is not extremal since it has at least one cycle given by ((1, 10 ), (2, 10 ), (2, 40 ), (1, 40 )).
10 10 10
1 1 1
P220 20 P220 + " 20 P220 " 20
2 P320 2 P320 " 2 P320 + "
P230 P230 " P230 + "
0 0
3 3 30
3 P330 3 P330 + " 3 P330 "
n n n
P m0 Q m0 R m0
Figure 3.2: A solution P with a cycle in the graph of its support can be perturbed to obtain two
feasible solutions Q and R such that P is their average, therefore disproving that P is extremal.
0}. Then the graph G(P) = (V ∪ V 0 , S(P)) has no cycles. In particular, P cannot have
def.
H = (i1 , j10 ), (i2 , j10 ), (i2 , j20 ), . . . , (ik , jk0 ), (i1 , jk0 ) .
ε < min(i,j 0 )∈F Pij . Consider a perturbation matrix E whose (i, j) entry is equal to ε
if i → j 0 ∈ H̄, −ε if j → i0 ∈ H̄, and zero otherwise. Define matrices Q = P + E and
R = P−E as illustrated in Figure 3.2. Because ε is small enough, all elements in Q and
R are nonnegative. By construction, E has either lines (resp., columns) with all entries
equal to 0 or exactly one entry equal to ε and another equal to −ε for those indexed
by i1 , . . . , ik (resp., j1 , . . . , jk ). Therefore, E is such that E1m = 0n and ET 1n = 0m ,
and we have that Q and R have the same marginals as P, and are therefore feasible.
Finally P = (Q + R)/2 which, since Q, R 6= P, contradicts the fact that P is an
extremal point. Since a graph with k nodes and no cycles cannot have more than k − 1
edges, we conclude that S(P) cannot have more than n + m − 1 edges, and therefore
P cannot have more than n + m − 1 nonzero entries.
0 0 0 0 0 0 0 0 0
0.2 0 0 0.2 0 0 0.2 0 0
→ 0.3 0.1 • → 0.3 0.1 0.1 → 0.3 0.1 0.1
0 0 0 0 0 • 0 0 0.3
We write NW(a, b) for the unique plan that can be obtained through this heuristic.
Note that there is, however, a much larger number of NW corner solutions that
can be obtained by permuting arbitrarily the order of a and b first, computing
the corresponding NW corner table, and recovering a table of U(a, b) by invert-
ing again the order of columns and rows: setting σ = (3, 1, 2), σ 0 = (3, 2, 1) gives
3.5. A Heuristic Description of the Network Simplex 45
aσ = [0.3, 0.2, 0.5], bσ0 = [0.4, 0.1, 0.5], and σ −1 = (2, 3, 1), σ 0 = (3, 2, 1). Observe that
0.3 0 0
NW(aσ , bσ0 ) = 0.1 0.1 0 ∈ U(aσ , bσ0 ),
0 0 0.5
0 0.1 0.1
NWσ−1 σ0−1 (aσ , bσ0 ) = 0.5 0 0 ∈ U(a, b).
0 0 0.3
Let N (a, b) be the set of all NW corner solutions that can be produced this way:
Consider a feasible matrix P whose graph G(P) = (V ∪ V 0 , S(P)) has no cycles. P has
therefore no more than n + m − 1 nonzero entries and is a vertex of U(a, b) by Propo-
sition 3.4. Following Proposition 3.3, it is therefore sufficient to obtain a dual solution
(f, g) which is feasible (i.e. C − f ⊕ g has nonnegative entries) and complementary to P
(pairs of indices (i, j 0 ) in S(P) are such that Ci,j = fi + gj ), to prove that P is optimal.
The network simplex relies on two simple principles: to each feasible primal solution
P one can associate a complementary pair (f, g). If that pair is feasible, then we have
reached optimality. If not, one can consider a modification of P that remains feasible
and whose complementary pair (f, g) is modified so that it becomes closer to feasibility.
The simplex proceeds by associating first to any extremal solution P a pair of (f, g)
complementary dual variables. This is simply carried out by finding two vectors f and
g such that for any (i, j 0 ) in S(P), fi + gj is equal to Ci,j . Note that this, in itself, does
not guarantee that (f, g) is feasible.
Let s be the cardinality of S(P). Because P is extremal, s ≤ n + m − 1. Because
G(P) has no cycles, G(P) is either a tree or a forest (a union of trees), as illustrated
in Figure 3.3. Aiming for a pair (f, g) that is complementary to P, we consider the
46 Algorithmic Foundations
10 0.1
0.16 1
F (P) = {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 },
20 0.16
0.4 2 {3, 40 }, {4, 40 }, {4, 50 }, {5, 60 }
0
3 0.3
0.06 3 n
40 0.1 G(P) = {1, 2, 10 , 20 , 30 }, {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 } ,
0.24 4
0 0 0 0 0
50 0.2 {3, 4, 4 , 5 }, {3, 4 }, {4, 4 }, {4, 5 } ,
0.14 5 o
0 0
{5, 6 }, {5, 6 }
60 0.14
Figure 3.3: A feasible transport P and its corresponding set of edges S(P) and graph G(P). As can
be seen, the graph G(P) = ({1, . . . , 5, 10 , . . . , 60 }, S(P)) is a forest, meaning that it can be expressed as
the union of tree graphs, three in this case.
where the elements of S(P) are enumerated as (i1 , j10 ), . . . , (is , js0 ).
Since s ≤ n + m − 1 < n + m, the linear system (3.6) above is always undetermined.
This degeneracy can be interpreted in part because the parameterization of U(a, b)
with n + m constraints results in n + m dual variables. A more careful formulation,
outlined in Remark 3.1, would have resulted in an equivalent formulation with only
n + m − 1 constraints and therefore n + m − 1 dual variables. However, s can also be
strictly smaller than n + m − 1: This happens when G(P) is the disjoint union of two
or more trees. For instance, there are 5 + 6 = 11 dual variables (one for each node) in
Figure 3.3, but only 8 edges among these 11 nodes, namely 8 linear equations to define
(f, g). Therefore, there will be as many undetermined dual variables under that setting
as there will be connected components in G(P).
Consider a tree among those listed in G(P). Suppose that tree has k nodes i1 , . . . , ik
among source nodes and l nodes j10 , . . . , jl0 among target nodes, resulting in r = k + l,
def.
1 30 g3 := C2,3 f2
f1 := 0 0
1
g1 := C1,1 f1 2
f2 := C2,1 g1 20 g2 := C2,2 f2
Figure 3.4: The five dual variables f1 , f2 , g1 , g2 , g3 corresponding to the five nodes appearing in the
first tree of the graph G(P) illustrated in Figure 3.3 are linked through four linear equations that
involve corresponding entries in the cost matrix C. Because that system is degenerate, we choose a
root in that tree (node 1 in this example) and set its corresponding variable to 0 and proceed then by
traversing the tree (either breadth-first or depth-first) from the root to obtain iteratively the values of
the four remaining dual variables.
(b) G now has a cycle. In that case, we need to remove an edge in G to ensure that G
is still a forest, yet also modify P so that P is feasible and G(P) remains included
in G. These operations can all be carried out by increasing the value of Pi,j and
modifying the other entries of P appearing in the detected cycle, in a manner
very similar to the one we used to prove Proposition 3.4. To be more precise, let
us write that cycle (i1 , j10 ), (j10 , i2 ), (i2 , j20 ), . . . , (il , jl0 ), (jl0 , il+1 ) with the convention
that i1 = il+1 = i to ensure that the path is a cycle that starts and ends at i,
whereas j1 = j, to highlight the fact that the cycle starts with the added edge
{i, j}, going in the right direction. Increase now the flow of all “positive” edges
(ik , jk0 ) (for k ≤ l), and decrease that of “negative” edges (jk0 , ik+1 ) (for k ≤ l), to
obtain an updated primal solution P̃, equal to P for all but the following entries:
∀k ≤ l, P̃ik ,jk := Pik ,jk + θ; P̃ik+1 ,jk := Pik+1 ,jk − θ.
Here, θ is the largest possible increase at index i, j using that cycle. The value
48 Algorithmic Foundations
We now use the dual vectors (f, g) computed at the end of the previous iteration. They
are such that fik + gik = Cik ,jk and fik+1 + gik = Cik+1 ,jk for all edges initially in G,
resulting in the identity
l
X l
X l
X l
X
Cik ,jk − Cik+1 ,jk = Ci,j + fik + gjk − fik+1 + gjk
k=1 k=1 k=2 k=1
= Ci,j − (fi + gj ).
That term is, by definition, negative, since i, j were chosen because Ci,j < fi − gj .
Therefore, if θ > 0, we have that
If θ = 0, which can happen if G and G(P) differ, the graph G is simply changed, but
P is not.
The network simplex algorithm can therefore be summarized as follows: Initialize
the algorithm with an extremal solution P, given for instance by the NW corner rule
as covered in §3.4.2. Initialize the graph G with G(P). Compute a pair of dual vari-
ables (f, g) that are complementary to P using the linear system solve using the tree
structure(s) in G as described in §3.5.1. (i) Look for a violating pair of indices to the
constraint C − f ⊕ g ≥ 0; if none, P is optimal and stop. If there is a violating pair
(i, j 0 ), (ii) add the edge (i, j 0 ) to G. If G still has no cycles, update (f, g) accordingly;
if there is a cycle, direct it making sure (i, j 0 ) is labeled as positive, and remove a neg-
ative edge in that cycle with the smallest flow value, updating P, G as illustrated in
Figure 3.5, then build a complementary pair f, g accordingly; return to (i). Some of the
operations above require graph operations (cycle detection, tree traversals) which can
be implemented efficiently in this context, as described in ([Bertsekas, 1998, §5]).
3.6. Dual Ascent Methods 49
Figure 3.5: Adding an edge {i, j} to the graph G(P) can result in either (a) the graph remains a
forest after this addition, in which case f, g can be recomputed following the approach outlined in §3.5.1;
(b.1) the addition of that edge creates a cycle, from which we can define a directed path; (b.2) the path
can be used to increase the value of Pi,j and propagate that change along the cycle to maintain the
flow feasibility constraints, until the flow of one of the edges that is negatively impacted by the cycle
is decreased to 0. This removes the cycle and updates P.
Orlin [1997] was the first to prove the polynomial time complexity of the
network simplex. Tarjan [1997] provided shortly after an improved bound in
O ( (n + m)nm log(n + m) log ((n + m)kCk∞ ) ) which relies on more efficient data
structures to help select pivoting edges.
Dual ascent methods precede the network simplex by a few decades, since they can be
traced back to work by Borchardt and Jocobi [1865] and later König and Egerváry, as
recounted by Kuhn [1955]. The Hungarian algorithm is the best known algorithm in
that family, and it can work only in the particular case when a and b are equal and are
both uniform, namely a = b = 1n /n. We provide in what follows a concise description
of the more general family of dual ascent methods. This requires the knowledge of
the maximum flow problem ([Bertsimas and Tsitsiklis, 1997, §7.5]). By contrast to the
network simplex, presented above in the primal, dual ascent methods maintain at each
iteration dual feasible solutions whose objective is progressively improved by adding a
sparse vector to f and g. Our presentation is mostly derived from that of ([Bertsimas
and Tsitsiklis, 1997, §7.7]) and starts with the following definition.
Definition 3.2. For S ⊂ JnK, S 0 ⊂ JmK0 = {10 , . . . , m0 } we write 1S for the vector in Rn
def.
of zeros except for ones at the indices enumerated in S, and likewise for the vector 1S 0
in Rm with indices in S 0 .
In what follows, (f, g) is a feasible dual pair in R(C). Recall that this simply means
that for all pairs (i, j 0 ) ∈ JnK × JmK0 , fi + gj ≤ Cij . We say that (i, j 0 ) is a balanced pair
(or edge) if fi + gj = Cij and inactive otherwise, namely if fi + gj < Cij . With this
50 Algorithmic Foundations
convention, we start with a simple result describing how a feasible dual pair (f, g) can
be perturbed using sparse vectors indexed by sets S and S 0 and still remain feasible.
def.
Proposition 3.5. (f̃, g̃) = (f, g) + ε(1S , −1S 0 ) is dual feasible for a small enough ε > 0
if for all i ∈ S, the fact that (i, j 0 ) is balanced implies that j 0 ∈ S 0 .
Proof. For any i ∈ S, consider the set Ii of all j 0 ∈ JmK0 such that (i, j 0 ) is inactive,
def.
namely such that fi + gj < Cij . Define εi = minj∈Ii Ci,j − fi − gj , the smallest margin
by which fi can be increased without violating the constraints corresponding to j 0 ∈ Ii .
Indeed, one has that if ε ≤ εi then f̃i + g̃j < Ci,j for any j 0 ∈ Ii . Consider now the set
Bi of balanced edges associated with i. Note that Bi = JmK0 \ Ii . The assumption above
is that j 0 ∈ Bi ⇒ j 0 ∈ S 0 . Therefore, one has that for j 0 ∈ Bi , f̃i + g̃j = fi + gj = Ci,j . As
a consequence, the inequality f̃i + g̃j ≤ Ci,j is ensured for any j ∈ JmK0 . Choosing now
an increase ε smaller than the smallest possible allowed, namely mini∈S εi , we recover
that (f̃, g̃) is dual feasible.
The main motivation behind the iteration of the network simplex presented in §3.5.1
is to obtain, starting from a feasible primal solution P, a complementary feasible dual
pair (f, g). To reach that goal, P is progressively modified such that its complementary
dual pair reaches dual feasibility. A symmetric approach, starting from a feasible dual
variable to obtain a feasible primal P, motivates dual ascent methods. The proposi-
tion below is the main engine of dual ascent methods in the sense that it guarantees
(constructively) the existence of an ascent direction for (f, g) that maintains feasibil-
ity. That direction is built, similarly to the network simplex, by designing a candidate
primal solution P whose infeasibility guides an update for (f, g).
Proposition 3.6. Either (f, g) is optimal for Problem (3.4) or there exists S ⊂ JnK, S 0 ⊂
JmK0 such that (f̃, g̃) = (f, g) + ε(1S , −1S 0 ) is feasible for a small enough ε > 0 and has
def.
Proof. We consider first a complementary primal variable P to (f, g). To that effect, let
B be the set of balanced edges, namely all pairs (i, j 0 ) ∈ JnK × JmK0 such that fi + gj =
Ci,j , and form the bipartite graph whose vertices {1, . . . , n, 10 , . . . , m0 } are linked with
edges in B only, complemented by a source node s connected with capacitated edges to
all nodes i ∈ JnK with respective capacities ai , and a terminal node t also connected
to all nodes j 0 ∈ JmK0 with edges of respective capacities bj , as seen in Figure 3.6.
The Ford–Fulkerson algorithm ([Bertsimas and Tsitsiklis, 1997, p. 305]) can be used to
compute a maximal flow F on that network, namely a family of n+m+|B| nonnegative
values indexed by (i, j 0 ) ∈ B as fsi ≤ ai , fij 0 , fj 0 t ≤ bj that obey flow constraints and
P
such that i fsi is maximal. If the throughput of that flow F is equal to 1, then a feasible
primal solution P, complementary to f, g by construction, can be extracted from F by
defining Pi,j = fij 0 for (i, j 0 ) ∈ B and zero elsewhere, resulting in the optimality of (f, g)
3.6. Dual Ascent Methods 51
and P by Proposition 3.3. If the throughput of F is strictly smaller than 1, the labeling
algorithm proceeds by labeling (identifying) those nodes reached iteratively from s for
which F does not saturate capacity constraints, as well as those nodes that contribute
flow to any of the labeled nodes. Labeled nodes are stored in a nonempty set Q, which
does not contain the terminal node t per optimality of F (see Bertsimas and Tsitsiklis
1997, p. 308, for a rigorous presentation of the algorithm). Q can be split into two sets
S = Q ∩ JnK and S 0 = Q ∩ JmK0 . Because we have assumed that the total throughput
is strictly smaller than 1, S 6= ∅. Note first that if i ∈ S and (i, j) is balanced, then j 0
is necessarily in S 0 . Indeed, since all edges (i, j 0 ) have infinite capacity by construction,
the labeling algorithm will necessarily reach j 0 if it includes i in S. By Proposition 3.5,
there exists thus a small enough ε to ensure the feasibility of f̃, g̃. One still needs to
prove that 1S T a − 1S 0 T b > 0 to ensure that (f̃, g̃) has a better objective than (f, g).
Let S̄ = JnK \ S and S̄ 0 = JmK0 \ S 0 and define
X X X X
A= fsi , B= fsi , C= fj 0 t , D= fj 0 t .
i∈S i∈S̄ j 0 ∈S 0 j 0 ∈S¯0
The total maximal flow starts from s and is therefore equal to A + B, but also arrives
at t and is therefore equal to C + D. Flow conservation constraints also impose that the
very same flow is equal to B + C, therefore A = C. On the other hand, by definition
of the labeling algorithm, we have for all i in S that fsi < ai , whereas fj 0 t = bj for
j 0 ∈ S̄ 0 because t cannot be in S 0 by optimality of the considered flow. We therefore
have A < 1S T a and C = 10S T b. Therefore 1S T a − 10S T b > A − C = 0.
The dual ascent method proceeds by modifying any feasible solution (f, g) by any
vector generated by sets S, S 0 that ensure feasibility and improve the objective. When
the sets S, S 0 are those given by construction in the proof of Proposition 3.6, and
the steplength ε is defined as in the proof of Proposition 3.5, we recover a method
known as the primal-dual method. That method reduces to the Hungarian algorithm
for matching problems. Dual ascent methods share similarities with the dual variant
of the network simplex, yet they differ in at least two important aspects. Simplex-type
methods always ensure that the current solution is an extreme point of the feasible
set, R(C) for the dual, whereas dual ascent as presented here does not make such
an assumption, and can freely produce iterates that lie in the interior of the feasible
set. Additionally, whereas the dual network simplex would proceed by modifying (f, g)
to produce a primal solution P that satisfies linear (marginal constraints) but only
nonnegativity upon convergence, dual ascent builds instead a primal solution P that is
always nonnegative but which does not necessarily satisfy marginal constraints.
52 Algorithmic Foundations
10 0.1
0.06 10
1 1 0.06
0.06 0.12
20 0.12 20 0.12
2 2 0.24
0.46 30
s 30 0.3 t s t
3 1 capacity 3 0.06 0.3
0.06 0.1
40 40
4 0.1
0.24 4
0
5 0.2 0.1 50 0
0.14 5 5
0.14
60 0.14 60 0.14
(a) Given balanced edges, what is the maximal flow possible? (b) Maxflow = 0.74
10 10
1 1 D
20
B 20
2 2
s 30 t s 30 t
3 3
0 0
4 4
4 4
50 A 50
5 5 C
S = {2, 3, 4, 5} 60 60
S 0 = {20 , 30 , 40 , 60 }
(c) the labeling algorithm identifies sets S, S 0 (d) total flows A, B, C, D through nodes S, S̄, S 0 , S̄ 0
Figure 3.6: Consider a transportation problem involving the marginals introduced first in Figure 3.3,
with n = 5, m = 6. Given two feasible dual vectors f, g, we try to obtain the “best” flow matrix P that
is complementary to (f, g). Recall that this means that P can only take positive values on those edges
(i, j 0 ) corresponding to indices for which fi + gj = Ci,j , here represented with dotted lines in plot (a).
The best flow that can be achieved with that graph structure can be formulated as a max-flow problem
in a capacitated network, starting from an abstract source node s connected to all nodes labeled i ∈ JnK,
terminating at an abstract terminal node t connected to all nodes labeled j 0 , where j ∈ Jm0 K, and such
that the capacities of edge (s, i), (j 0 , t), i ∈ JnK, j ∈ JmK are respectively ai , bj and all others infinite.
The Ford–Fulkerson algorithm ([Bertsimas and Tsitsiklis, 1997, p. 305]) can be applied to compute such
a max-flow, which, as represented in plot (b), only achieves 0.74 units of mass out of 1 needed to solve
the problem. One of the subroutines used by max-flow algorithms, the labeling algorithm ([Bertsimas
and Tsitsiklis, 1997, p. 308]), can be used to identify nodes that receive an unsaturated flow from s
(and recursively, all of its successors), denoted by orange lines in plot (c). The labeling algorithm also
adds by default nodes that send a positive flow to any labeled node, which is the criterion used to select
node 3, which contributes with a red line to 30 . Labeled nodes can be grouped in sets S, S 0 to identify
nodes which can be better exploited to obtain a higher flow, by modifying f, g to obtain a different
graph. The proof involves partial sums of flows described in plot (d)
The auction algorithm was originally proposed by Bertsekas [1981] and later refined
in [Bertsekas and Eckstein, 1988]. Several economic interpretations of this algorithm
have been proposed (see e.g. Bertsekas [1992]). The algorithm can be adapted for arbi-
trary marginals, but we present it here in its formulation to solve optimal assignment
problems.
3.7. Auction Algorithm 53
Partial assignments and ε-complementary slackness. The goal of the auction algo-
rithm is to modify iteratively a triplet S, ξ, g, where S is a subset of JnK, ξ a partial
assignment vector, namely an injective map from S to JnK, and g a dual vector. The
dual vector is meant to converge toward a solution satisfying an approximate com-
plementary slackness property (3.8), whereas S grows to cover JnK as ξ describes a
permutation. The algorithm works by maintaining the three following properties after
each iteration:
(a) ∀i ∈ S, Ci,ξi − gξi ≤ ε + minj Ci,j − gj (ε-CS).
(b) The size of S can only increase at each iteration.
(c) There exists an index i such that gi decreases by at least ε.
Auction algorithm updates. Given a point j the auction algorithm uses not only the
optimum appearing in the usual C-transform but also a second best,
ji1 ∈ argminj Ci,j − gj , ji2 ∈ argminj6=j 1 Ci,j − gj ,
i
2. update S and ξ: If there exists an index i0 ∈ S such that ξi0 = ji1 , remove it by
updating S ← S \ {i0 }. Set ξi = ji1 and add i to S, S ← S ∪ {i}.
, ξ n is equal to ξ except for its ith element equal to ji1 , and S n is equal to the union of
{i} with S (with possibly one element removed). The update of gn can be rewritten
gnj 1 = Ci,j 1 − (Ci,j 2 − gj 2 ) − ε;
i i i i
therefore we have
Ci,j 1 − gnj 1 = ε + (Ci,j 2 − gj 2 ) = ε + min(Ci,j − gj ).
i i i i j6=ji1
and since the inequality is also obviously true for j = ji1 we therefore obtain the ε-
complementary slackness property for index i. For other indices i0 6= i, we have again
that since gn ≤ g the sequence of inequalities holds,
Ci,ξn0 − gnξn0 = Ci,ξi0 − gξi0 ≤ ε + min Ci0 ,j − gj ≤ ε + min Ci0 ,j − gnj .
i i j j
3.7. Auction Algorithm 55
Remark 3.3. Note that this result yields a naive number of operations of N 3 kCk∞ /ε
for the algorithm to terminate. That complexity can be reduced to N 3 log kCk∞ when
using a clever method known as ε-scaling, designed to decrease the value of ε with each
iteration ([Bertsekas, 1998, p. 264]).
Proposition 3.9. The auction algorithm finds an assignment whose cost is nε subopti-
mal.
Proof. Let σ, g? be the primal and dual optimal solutions of the assignment problem
of matrix C, with optimum
X X X
t? = Ci,σi = min Ci,j − g?j + g?j .
j
i j
Let ξ, g be the solutions output by the auction algorithm upon termination. The ε-CS
conditions yield that for any i ∈ S,
min Ci,j − gj ≥ Ci,ξi − gξi − ε.
j
where the second inequality comes from ε-CS, the next equality by cancellation of the
sum of terms in gξi and gj , and the last inequality by the suboptimality of ξ as a
permutation.
56 Algorithmic Foundations
The auction algorithm can therefore be regarded as an alternative way to use the
machinery of C-transforms. Next we explore another approach grounded on regulariza-
tion, the so-called Sinkhorn algorithm, which also bears similarities with the auction
algorithm as discussed in [Schmitzer, 2016b].
Note finally that, on low-dimensional regular grids in Euclidean space, it is possible
to couple these classical linear solvers with multiscale strategies, to obtain a significant
speed-up [Schmitzer, 2016a, Oberman and Ruan, 2015].
4
Entropic Regularization of Optimal Transport
with an analogous definition for vectors, with the convention that H(a) = −∞ if one of
the entries aj is 0 or negative. The function H is 1-strongly concave, because its Hessian
is ∂ 2 H(P ) = − diag(1/Pi,j ) and Pi,j ≤ 1. The idea of the entropic regularization
of optimal transport is to use −H as a regularizing function to obtain approximate
57
58 Entropic Regularization of Optimal Transport
Since the objective is an ε-strongly convex function, Problem (4.2) has a unique optimal
solution. The idea to regularize the optimal transport problem by an entropic term can
be traced back to modeling ideas in transportation theory [Wilson, 1969]: Actual traffic
patterns in a network do not agree with those predicted by the solution of the optimal
transport problem. Indeed, the former are more diffuse than the latter, which tend
to rely on a few routes as a result of the sparsity of optimal couplings for (2.11).
To mitigate this sparsity, researchers in transportation proposed a model, called the
“gravity” model [Erlander, 1980], that is able to form a more “blurred” prediction of
traffic given marginals and transportation costs.
Figure 4.1 illustrates the effect of the entropy to regularize a linear program over the
simplex Σ3 (which can thus be visualized as a triangle in two dimensions). Note how
the entropy pushes the original LP solution away from the boundary of the triangle.
The optimal Pε progressively moves toward an “entropic center” of the triangle. This
is further detailed in the proposition below. The convergence of the solution of that
regularized problem toward an optimal solution of the original linear program has been
studied by Cominetti and San Martín [1994], with precise asymptotics.
c
P"
"
Figure 4.1: Impact of ε on the optimization of a linear function on the simplex, solving Pε =
argminP∈Σ3 hC, Pi − εH(P) for a varying ε.
Proposition 4.1 (Convergence with ε). The unique solution Pε of (4.2) converges to
the optimal solution with maximal entropy within the set of all optimal solutions of
the Kantorovich problem, namely
ε→0
Pε −→ argmin {−H(P) : P ∈ U(a, b), hP, Ci = LC (a, b), } (4.3)
P
so that in particular
ε→0
LεC (a, b) −→ LC (a, b).
One also has
ε→∞
Pε −→ a ⊗ b = abT = (ai bj )i,j . (4.4)
4.1. Entropic Regularization 59
Proof. We consider a sequence (ε` )` such that ε` → 0 and ε` > 0. We denote P` the
solution of (4.2) for ε = ε` . Since U(a, b) is bounded, we can extract a sequence (that
we do not relabel for the sake of simplicity) such that P` → P? . Since U(a, b) is closed,
P? ∈ U(a, b). We consider any P such that hC, Pi = LC (a, b). By optimality of P
and P` for their respective optimization problems (for ε = 0 and ε = ε` ), one has
Since H is continuous, taking the limit ` → +∞ in this expression shows that hC, P? i =
hC, Pi so that P? is a feasible point of (4.3). Furthermore, dividing by ε` in (4.5) and
taking the limit shows that H(P) ≤ H(P? ), which shows that P? is a solution of (4.3).
Since the solution P?0 to this program is unique by strict convexity of −H, one has
P? = P?0 , and the whole sequence is converging. In the limit ε → +∞, a similar proof
shows that one should rather consider the problem
min − H(P),
P∈U(a,b)
Formula (4.3) states that for a small regularization ε, the solution converges to the
maximum entropy optimal transport coupling. In sharp contrast, (4.4) shows that for
a large regularization, the solution converges to the coupling with maximal entropy be-
tween two prescribed marginals a, b, namely the joint probability between two indepen-
dent random variables distributed following a, b. A refined analysis of this convergence
is performed in Cominetti and San Martín [1994], including a first order expansion in
ε (resp., 1/ε) near ε = 0 (resp., ε = +∞). Figures 4.2 and 4.3 show visually the effect
of these two convergences. A key insight is that, as ε increases, the optimal coupling
becomes less and less sparse (in the sense of having entries larger than a prescribed
threshold), which in turn has the effect of both accelerating computational algorithms
(as we study in §4.2) and leading to faster statistical convergence (as shown in §8.5).
Defining the Kullback–Leibler divergence between couplings as
!
def.
X Pi,j
KL(P|K) = Pi,j log − Pi,j + Ki,j , (4.6)
i,j
Ki,j
the unique solution Pε of (4.2) is a projection onto U(a, b) of the Gibbs kernel associ-
ated to the cost matrix C as Ci,j
Ki,j = e− ε .
def.
Pε = ProjKL
def.
U(a,b) (K) = argmin KL(P|K). (4.7)
P∈U(a,b)
60 Entropic Regularization of Optimal Transport
Figure 4.2: Impact of ε on the couplings between two 1-D densities, illustrating Proposition 4.1.
Top row: between two 1-D densities. Bottom row: between two 2-D discrete empirical densities with
the same number n = m of points (only entries of the optimal (Pi,j )i,j above a small threshold are
displayed as segments between xi and yj ).
"
Figure 4.3: Impact of ε on coupling between two 2-D discrete empirical densities with the same
number n = m of points (only entries of the optimal (Pi,j )i,j above a small threshold are displayed as
segments between xi and yj ).
Remark 4.1 (Entropic regularization between discrete measures). For discrete mea-
sures of the form (2.1), the definition of regularized transport extends naturally
to
Lεc (α, β) = LεC (a, b),
def.
(4.8)
with cost Ci,j = c(xi , yj ), to emphasize the dependency with respect to the posi-
tions (xi , yj ) supporting the input measures.
4.1. Entropic Regularization 61
Remark 4.2 (General formulation). One can consider arbitrary measures by replac-
ing the discrete entropy by the relative entropy with respect to the product measure
def.
dα ⊗dβ(x, y) = dα(x)dβ(y), and propose a regularized counterpart to (2.15) using
Z
Lεc (α, β)
def.
= min c(x, y)dπ(x, y) + ε KL(π|α ⊗ β), (4.9)
π∈U (α,β) X ×Y
Proposition 4.2. For any π ∈ U(α, β), and for any (α0 , β 0 ) having the same 0
measure sets as (α, β) (so that they have both densities with respect to one another)
one has
KL(π|α ⊗ β) = KL(π|α0 ⊗ β 0 ) − KL(α ⊗ β|α0 ⊗ β 0 ).
c(x,y)
where K is the Gibbs distributions dK(x, y) = e− ε dα(x)dβ(y). This problem is
def.
Remark 4.3 (Mutual entropy). Similarly to (2.16), one can rephrase (4.9) using
62 Entropic Regularization of Optimal Transport
random variables
n o
Lεc (α, β) = min E(X,Y ) (c(X, Y )) + εI(X, Y ) : X ∼ α, Y ∼ β ,
(X,Y )
def.
where, denoting π the distribution of (X, Y ), I(X, Y ) = KL(π|α ⊗ β) is the so-
called mutual information between the two random variables. One has I(X, Y ) ≥ 0
and I(X, Y ) = 0 if and only if the two random variables are independent.
The following proposition shows that the solution of (4.2) has a specific form, which can
be parameterized using n + m variables. That parameterization is therefore essentially
dual, in the sense that a coupling P in U(a, b) has nm variables but n + m constraints.
4.2. Sinkhorn’s Algorithm and Its Convergence 63
α β
Figure 4.4: Top: evolution with ε of the solution πε of (4.9). Bottom: evolution of the copula function
ξπε .
Proposition 4.3. The solution to (4.2) is unique and has the form
Proof. Introducing two dual variables f ∈ Rn , g ∈ Rm for each marginal constraint, the
Lagrangian of (4.2) reads
diag(u)K diag(v). The variables (u, v) must therefore satisfy the following nonlinear
equations which correspond to the mass conservation constraints inherent to U(a, b):
These two equations can be further simplified, since diag(v)1m is simply v, and the
multiplication of diag(u) times Kv is
Remark 4.5 (Historical perspective). The iterations (4.15) first appeared in [Yule, 1912,
Kruithof, 1937]. They were later known as the iterative proportional fitting procedure
(IPFP) Deming and Stephan [1940] and RAS [Bacharach, 1965] methods [Idel, 2016].
The proof of their convergence is attributed to Sinkhorn [1964], hence the name of the
algorithm. This algorithm was later extended in infinite dimensions by Ruschendorf
[1995]. This regularization was used in the field of economics to obtain approximate
solutions to optimal transport problems, under the name of gravity models [Wilson,
1969, Erlander, 1980, Erlander and Stewart, 1990]. It was rebranded as “softassign”
by Kosowsky and Yuille [1994] in the assignment case, namely when a = b = 1n /n,
and used to solve matching problems in economics more recently by Galichon and
Salanié [2009]. This regularization has received renewed attention in data sciences (in-
cluding machine learning, vision, graphics and imaging) following [Cuturi, 2013], who
4.2. Sinkhorn’s Algorithm and Its Convergence 65
⇡"(`)
`
0
-0.5
-1
-1.5
-2
`
1000 2000 3000 4000 5000
ε = 10 ε = 0.1 ε= 10−3
(`)
Figure 4.5: Top: evolution of the coupling πε = diag(u(`) )K diag(v(`) ) computed at iteration `
of Sinkhorn’s iterations, for 1-D densities on X = [0, 1], c(x, y) = |x − y|2 , and ε = 0.1. Bottom:
impact of ε the convergence rate of Sinkhorn, as measured in term of marginal constraint violation
(`)
log(kπε 1m − bk1 ).
This yields a matrix P̂ ∈ U(a, b) such that the 1-norm between P̂ and
diag(u)K diag(v) is controlled by the marginal violations of diag(u)K diag(v),
namely
P̂ − diag(u)K diag(v)
≤ ka − u (Kv)k1 +
b − v (KT u)
.
1 1
This field remains active, as shown by the recent improvement on the result above
by Dvurechensky et al. [2018].
Remark 4.7 (Numerical stability of Sinkhorn iterations). As we discuss in Remarks 4.14
and 4.15, the convergence of Sinkhorn’s algorithm deteriorates as ε → 0. In numerical
practice, however, that slowdown is rarely observed in practice for a simpler reason:
Sinkhorn’s algorithm will often fail to terminate as soon as some of the elements of
the kernel K become too negligible to be stored in memory as positive numbers, and
become instead null. This can then result in a matrix product Kv or KT u with ever
smaller entries that become null and result in a division by 0 in the Sinkhorn update
of Equation (4.15). Such issues can be partly resolved by carrying out computations
on the multipliers u and v in the log domain. That approach is carefully presented in
Remark 4.23 and is related to a direct resolution of the dual of Problem (4.2).
Remark 4.8 (Relation with iterative projections). Denoting
n o
Ca1 = {P : P1m = a} Cb2 = P : PT 1m = b
def. def.
and
the rows and columns constraints, one has U(a, b) = Ca1 ∩ Cb2 . One can use Bregman
iterative projections [Bregman, 1967],
P(`+1) = ProjKL (`)
P(`+2) = ProjKL (`+1)
def. def.
Ca1 (P ) and C 2 (Pb
). (4.16)
Since the sets Ca1 and Cb2 are affine, these iterations are known to converge to the solution
of (4.7); see [Bregman, 1967]. These iterates are equivalent to Sinkhorn iterations (4.15)
since defining
P(2`) = diag(u(`) )K diag(v(`) ),
def.
one has
P(2`+1) = diag(u(`+1) )K diag(v(`) )
def.
In practice, however, one should prefer using (4.15), which only requires manipulating
scaling vectors and multiplication against a Gibbs kernel, which can often be accelerated
(see Remarks 4.17 and 4.19 below).
Remark 4.9 (Proximal point algorithm). In order to approximate a solution of the un-
regularized (ε = 0) problem (2.11), it is possible to use iteratively the Sinkhorn al-
gorithm, using the so-called proximal point algorithm for the KL metric. We denote
def.
F (P) = hP, πi + ιU(a,b) (P) the unregularized objective function. The proximal point
iterations for the KL divergence computes a minimizer of F , and hence a solution of
the unregularized OT problem (2.11), by computing iteratively
1
P(`+1) = ProxKL (P(`) ) = argmin KL(P|P(`) ) + F (P)
def. def.
1
F
(4.17)
ε
P∈Rn×m ε
+
(0)
starting from an arbitrary P (see also (4.52)). The proximal point algorithm is the
most basic proximal splitting method. Initially introduced for the Euclidean metric
(see, for instance, (Rockafellar 1976)), it extends to any Bregman divergence [Censor
and Zenios, 1992], so in particular it can be applied here for the KL divergence (see
Remark 8.1). The proximal operator is usually not available in closed form, so some
form of subiterations are required. The optimization appearing in (4.17) is very similar
to the entropy regularized problem (4.2), with the relative entropy KL(·|P(`) ) used in
place of the negative entropy −H. Proposition 4.3 and Sinkhorn iterations (4.15) carry
C
over to this more general setting when defining the Gibbs kernel as K = e− ε P(`) =
Ci,j
(`)
(e− ε Pi,j )i,j . Iterations (4.17) can thus be implemented by running the Sinkhorn
algorithm at each iteration. Assuming for simplicity P(0) = 1n 1>
m , these iterations thus
have the form
C
P (`+1) = diag(u(`) )(e− ε P(`) ) diag(v(`) )
(`+1)C
= diag(u(`) · · · u(0) )e− ε P(`) ) diag(v(`) · · · v(0) ).
The proximal point iterates apply therefore iteratively Sinkhorn’s algorithm with a
− C
kernel e ε/` , i.e., with a decaying regularization parameter ε/`. This method is thus
tightly connected to a series of works which combine Sinkhorn with some decaying
schedule on the regularization; see, for instance, [Kosowsky and Yuille, 1994]. They
are efficient in small spacial dimension, when combined with a multigrid strategy to
approximate the coupling on an adaptive sparse grid [Schmitzer, 2016b].
Remark 4.10 (Other regularizations). It is possible to replace the entropic term −H(P)
in (4.2) by any strictly convex penalty R(P), as detailed, for instance, in [Dessein et al.,
2018]. A typical example is the squared `2 norm
X
R(P) = P2i,j + ιR+ (Pi,j ); (4.18)
i,j
68 Entropic Regularization of Optimal Transport
see [Essid and Solomon, 2017]. Another example is the family of Tsallis entropies [Muzel-
lec et al., 2017]. Note, however, that if the penalty function is defined even when entries
of P are nonpositive, which is, for instance, the case for a quadratic regularization (4.18),
then one must add back a nonnegativity constraint P ≥ 0, in addition to the marginal
constraints P1m = a and P> 1n = b. Indeed, one can afford to ignore the nonnega-
tivity constraint using entropy because that penalty incorporates a logarithmic term
which forces the entries of P to stay in the positive orthant. This implies that the set
of constraints is no longer affine and iterative Bregman projections do not converge
anymore to the solution. A workaround is to use instead Dykstra’s algorithm (1983,
1985) (see also Bauschke and Lewis 2000), as detailed in [Benamou et al., 2015]. This
algorithm uses projections according to the Bregman divergence associated to R. We
refer to Remark 8.1 for more details regarding Bregman divergences. An issue is that in
general these projections cannot be computed explicitly. For the squared norm (4.18),
this corresponds to computing the Euclidean projection on (Ca1 , Cb2 ) (with the extra
positivity constraints), which can be solved efficiently using projection algorithms on
simplices [Condat, 2015]. The main advantage of the quadratic regularization over en-
tropy is that it produces sparse approximation of the optimal coupling, yet this comes at
the expense of a slower algorithm that cannot be parallelized as efficiently as Sinkhorn
to compute several optimal transports simultaneously (as discussed in §4.16). Figure 4.6
contrasts the approximation achieved by entropic and quadratic regularizers.
Figure 4.6: Comparison of entropic regularization R = −H (top row) and quadratic regularization
R = k·k2 + ιR+ (bottom row). The (α, β) marginals are the same as for Figure 4.4.
4.2. Sinkhorn’s Algorithm and Its Convergence 69
Remark 4.11 (Barycentric projection). Consider again the setting of Remark 4.1 in
which we use entropic regularization to approximate OT between discrete measures.
The Kantorovich formulation in (2.11) and its entropic regularization (4.2) both
yield a coupling P ∈ U(a, b). In order to define a transportation map T : X → Y,
in the case where Y = Rd , one can define the so-called barycentric projection map
1 X
T : xi ∈ X 7−→ Pi,j yj ∈ Y, (4.19)
ai j
where the input measures are discrete of the form (2.3). Note that this map is only
defined for points (xi )i in the support of α. In the case where T is a permutation
matrix (as detailed in Proposition 2.1), then T is equal to a Monge map, and as
ε → 0, the barycentric projection progressively converges to that map if it is unique.
For arbitrary (not necessarily discrete) measures, solving (2.15) or its regularized
version (4.9) defines a coupling π ∈ U(α, β). Note that this coupling π always has
dπ(x,y)
a density dα(x)dβ(y) with respect to α ⊗ β. A map can thus be retrieved by the
formula
dπ(x, y)
Z
T : x ∈ X 7−→ y dβ(y). (4.20)
Y dα(x)dβ(y)
In the case where, for ε = 0, π is supported on the graph of the Monge map
(see Remark 2.24), then using ε > 0 produces a smooth approximation of this
map. Such a barycentric projection is useful to apply the OT Monge map to solve
problems in imaging; see Figure 9.6 for an application to color modification. It has
also been used to compute approximations of principal geodesics in the space of
probability measures endowed with the Wasserstein metric; see [Seguy and Cuturi,
2015].
ui u0j
∀ (u, u0 ) ∈ (Rn+,∗ )2 , dH (u, u0 ) = log max
def.
.
i,j uj u0i
It can be shown to be a distance on the projective cone Rn+,∗ / ∼, where u ∼ u0
means that ∃r > 0, u = ru0 (the vectors are equal up to rescaling, hence the
name “projective”). This means that dH satisfies the triangular inequality and
dH (u, u0 ) = 0 if and only if u ∼ u0 . This is a projective version of Hilbert’s original
distance on bounded open convex sets [Hilbert, 1895]. The projective cone Rn+,∗ / ∼
is a complete metric space for this distance. By a logarithmic change of variables,
70 Entropic Regularization of Optimal Transport
the Hilbert metric on the rays of the positive cone is isometric to the variation
seminorm (it is a norm between vectors that are defined up to an additive constant)
!
,u
u0
0
)
2
> 0} K K 2 R+
;r KR2+ K
{ru R2+
u =
!
Figure 4.7: Left: the Hilbert metric dH is a distance over rays in cones (here positive vectors). Right:
visualization of the contraction induced by the iteration of a positive matrix K.
for any p0 ∈ Σn , dH (K` p0 , p? ) ≤ λ(K)` dH (p0 , p? ), i.e. one has linear convergence
of the iterates of the matrix toward p? . This is illustrated in Figure 4.8.
p?
Σ3 KΣ3 K2 Σ3 {K` Σ3 }`
where we used Theorem 4.1. This shows (4.22). One also has, using the triangular
inequality,
which gives the first part of (4.23) since u(`) (Kv(`) ) = P(`) 1m (the second one
being similar). The proof of (4.24) follows from [Franklin and Lorenz, 1989, Lem.
3].
The bound (4.23) shows that some error measures on the marginal constraints
T
violation, for instance, kP(`) 1m − ak1 and kP(`) 1n − bk1 , are useful stopping
criteria to monitor the convergence. Note that thanks to (4.21), these Hilbert metric
rates on the scaling variable (u(`) , v(`) ) give a linear rate on the dual variables
(f(`) , g(`) ) = (ε log(u(`) ), ε log(v(`) )) for the variation norm k·kvar .
def.
Figure 4.5, bottom row, highlights this linear rate on the constraint violation
and shows how this rate degrades as ε → 0. These results are proved in [Franklin
and Lorenz, 1989] and are tightly connected to nonlinear Perron–Frobenius the-
ory [Lemmens and Nussbaum, 2012]. Perron–Frobenius theory corresponds to the
linearization of the iterations; see (4.25). This convergence analysis is extended
by [Linial et al., 1998], who show that each iteration of Sinkhorn increases the
permanence of the scaled coupling matrix.
Remark 4.15 (Local convergence). The global linear rate (4.24) is often quite pes-
simistic, typically in X = Y = Rd for cases where there exists a Monge map when
ε = 0 (see Remark 2.7). The global rate is in contrast rather sharp for more dif-
ficult situations where the cost matrix C is close to being random, and in these
cases, the rate scales exponentially bad with ε, 1 − λ(K) ∼ e−1/ε . To obtain a
finer asymptotic analysis of the convergence (e.g. if one is interested in a high-
precision solution and performs a large number of iterations), one usually rather
studies the local convergence rate. One can write a Sinkhorn update as iterations
4.3. Speeding Up Sinkhorn’s Iterations 73
For optimal (f, g) solving (4.30), denoting P = diag(ef/ε )K diag(eg/ε ) the optimal
coupling solving (4.2), one has the following Jacobian:
This Jacobian is a positive matrix with ∂Φ(f)1n = 1n , and thus by the Perron–
Frobenius theorem, it has a single dominant eigenvector 1m with associated eigen-
value 1. Since f is defined up to a constant, it is actually the second eigenvalue
1 − κ < 1 which governs the local linear rate, and this shows that for ` large
enough,
kf(`) − fk = O((1 − κ)` ).
Numerically, in “simple cases” (such as when there exists a smooth Monge map
when ε = 0), this rate scales like κ ∼ ε. We refer to [Knight, 2008] for more details
in the bistochastic (assignment) case.
(`)
and , (4.26)
KV K U(`+1)
T
initialized with V(0) = 1m×N . Here ·· corresponds to the entrywise division of matri-
ces. One can further check that upon convergence of V and U, the (row) vector of
regularized distances simplifies to
1n T (U log U ((K C)V) + U ((K C)(V log V))) ∈ RN .
74 Entropic Regularization of Optimal Transport
Note that the basic Sinkhorn iterations described in Equation (4.15) are intrinsically
GPU friendly, since they only consist in matrix-vector products, and this was exploited,
for instance, to solve matching problems in Slomp et al. [2011]). However, the matrix-
matrix operations presented in Equation (4.26) present even better opportunities for
parallelism, which explains the success of Sinkhorn’s algorithm to compute OT distances
between histograms at large scale.
Remark 4.17 (Speed-up for separable kernels). We consider in this section an important
particular case for which the complexity of each Sinkhorn iteration can be significantly
reduced. That particular case happens when each index i and j considered in the cost-
matrix can be described as a d-uple taken in the cartesian product of d finite sets
Jn1 K, . . . , Jnd K,
i = (ik )dk=1 , j = (jk )dk=1 ∈ Jn1 K × · · · × Jnd K.
In that setting, if the cost Cij between indices i and j is additive along these sub-indices,
namely if there exists d matrices C1 , . . . , Cd , each of respective size n1 × n, . . . , nd × nd ,
such that
d
X
Cij = Ckik ,jk ,
k=1
then one obtains as a direct consequence that the kernel appearing in the Sinkhorn
iterations has a separable multiplicative structure,
d
Y
Ki,j = Kkik ,jk . (4.27)
k=1
Such a separable multiplicative structure allows for a very fast (exact) evaluation of
Ku. Indeed, instead of instantiating K as a matrix of size n × n, which would have
Q
a prohibitive size since n = k nk is usually exponential in the dimension d, one can
instead recover Ku by simply applying Kk along each “slice” of u. If n = m, the
complexity reduces to O(n1+1/d ) in place of O(n2 ).
An important example of this speed-up arises when X = Y = [0, 1]d ; the ground
cost is the q-th power of the q-norm,
d
c(x, y) = kx − ykqq =
X
|xi − yi |q , q > 0;
i=1
and the space is discretized using a regular grid in which only points xi =
(i1 /n1 , . . . , id /nd ) for i = (i1 , . . . , id ) ∈ Jn1 K × · · · × Jnd K are considered. In that case a
multiplication by K can be carried out more efficiently by applying each 1-D nk × nk
convolution matrix h q i
Kk = exp(− r−s nk /ε)
1≤r,s≤nk
4.3. Speeding Up Sinkhorn’s Iterations 75
to u reshaped as a tensor whose first dimension has been permuted to match the k-th
set of indices. For instance, if d = 2 (planar case) and q = 2 (2-Wasserstein, resulting in
Gaussian convolutions), histograms a and as a consequence Sinkhorn multipliers u can
be instantiated as n1 ×n2 matrices. We write U to underline the fact that the multiplier
u is reshaped as a n1 ×n2 matrix, rather than a vector of length n1 n2 . Then, computing
Ku, which would naively require (n1 n2 )2 operations with a naive implementation, can
be obtained by applying two 1-D convolutions separately, as
to recover a n1 × n2 matrix in (n21 )n2 + n1 (n22 ) operations instead of n21 n22 operations.
Note that this example agrees with the exponent (1 + 1/d) given above. With larger d,
one needs to apply these very same 1-D convolutions to each slice of u (reshaped as a
tensor of suitable size) an operation which is extremely efficient on GPUs.
This important observations underlies many of the practical successes found when
applying optimal transport to shape data in 2-D and 3-D, as highlighted in [Solomon
et al., 2015, Bonneel et al., 2016], in which distributions supported on grids of sizes as
large as 2003 = 8 × 106 are handled.
Remark 4.19 (Geodesic in heat approximation). For nonplanar domains, the kernel K
is not a convolution, but in the case where the cost is Ci,j = dM (xi , yj )p where dM
is a geodesic distance on a surface M (or a more general manifold), it is also possible
dM
to perform fast approximations of the application of K = e− ε to a vector. Indeed,
Varadhan’s formulas [1967] assert that this kernel is close to the Laplacian kernel (for
p = 1) and the heat kernel (for p = 2). The first formula of Varadhan states
√
t
log(Pt (x, y)) = dM (x, y) + o(t) where Pt = (Id − t∆M )−1 ,
def.
− (4.28)
2
76 Entropic Regularization of Optimal Transport
R
where Ht is the integral kernel defined so that gt = M Ht (x, y)f (y)dy is the solution
at time t of the heat equation
∂gt (x)
= (∆M gt )(x).
∂t
The convergence in these formulas (4.28) and (4.29) is uniform on compact manifolds.
Numerically, the domain M is discretized (for instance, using finite elements) and
∆M is approximated by a discrete Laplacian matrix L. A typical example is when
using piecewise linear finite elements, so that L is the celebrated cotangent Laplacian
(see [Botsch et al., 2010] for a detailed account for this construction). These formulas
can be used to approximate efficiently the multiplication by the Gibbs kernel Ki,j =
d(xi ,yj )p √
e− ε . Equation (4.28) suggests, for the case p = 1, to use ε = 2t and to replace
the multiplication by K by the multiplication by (Id − tL)−1 , which necessitates the
resolution of a positive symmetric linear system. Equation (4.29), coupled with R steps
of implicit Euler for the stable resolution of the heat flow, suggests for p = 2 to trade
the multiplication by K by the multiplication by (Id − Rt L)−R for 4t = ε, which in turn
necessitates R resolutions of linear systems. Fortunately, since these linear systems
are supposed to be solved at each Sinkhorn iteration, one can solve them efficiently
by precomputing a sparse Cholesky factorization. By performing a reordering of the
rows and columns of the matrix [George and Liu, 1989], one obtains a nearly linear
sparsity for 2-D manifolds and thus each Sinkhorn iteration has linear complexity (the
performance degrades with the dimension of the manifold). The use of Varadhan’s
formula to approximate geodesic distances was initially proposed in [Crane et al., 2013]
and its use in conjunction with Sinkhorn iterations in [Solomon et al., 2015].
As briefly mentioned in Remark 4.7, the Sinkhorn algorithm suffers from numerical
overflow when the regularization parameter ε is small compared to the entries of the cost
matrix C. This concern can be alleviated to some extent by carrying out computations
in the log domain. The relevance of this approach is made more clear by considering
the dual problem associated to (4.2), in which these log-domain computations arise
naturally.
The optimal (f, g) are linked to scalings (u, v) appearing in (4.12) through
Proof. We start from the end of the proof of Proposition 4.3, which links the optimal
primal solution P and dual multipliers f and g for the marginal constraints as
The neg-entropy of P scaled by ε, namely εhP, log P − 1n×m i, can be stated explicitly
as a function of f, g, C,
therefore, the first term in (4.32) cancels out with the first term in the entropy above.
The remaining terms are those appearing in (4.30).
Remark 4.21 (Sinkhorn as a block coordinate ascent on the dual problem). A simple
approach to solving the unconstrained maximization problem (4.30) is to use an exact
block coordinate ascent strategy, namely to update alternatively f and g to cancel the
respective gradients in these variables of the objective of (4.30). Indeed, one can notice
after a few elementary computations that, writing Q(f, g) for the objective of (4.30),
∇|f Q(f, g) = a − ef/ε Keg/ε , (4.33)
∇|g Q(f, g) = b − eg/ε KT ef/ε . (4.34)
78 Entropic Regularization of Optimal Transport
Such iterations are mathematically equivalent to the Sinkhorn iterations (4.15) when
considering the primal-dual relations highlighted in (4.31). Indeed, we recover that at
any iteration
(f(`) , g(`) ) = ε(log(u(`) ), log(v(`) )).
Remark 4.22 (Soft-min rewriting). Iterations (4.35) and (4.36) can be given an alterna-
tive interpretation, using the following notation. Given a vector z of real numbers we
write minε z for the soft-minimum of its coordinates, namely
e−zi /ε .
X
minε z = −ε log (4.37)
i
Note that minε (z) converges to min z for any vector z as ε → 0. Indeed, minε can be
interpreted as a differentiable approximation of the min function, as shown in Figure 4.9.
Figure 4.9: Display of the function minε (z) in 2-D, z ∈ R2 , for varying ε.
Note that these operations are equivalent to the entropic c-transform introduced in §5.3
(see in particular (5.11)). Using this notation, Sinkhorn’s iterates read
T
f(`+1) = Minrow
ε (C − 1n g(`) ) + ε log a, (4.40)
g(`+1) = Mincol (`) T
ε (C − f 1m ) + ε log b. (4.41)
Note that as ε → 0, minε converges to min, but the iterations do not converge anymore
in the limit ε = 0, because alternate minimization does not converge for constrained
problems, which is the case for the unregularized dual (2.20).
Remark 4.23 (Log-domain Sinkhorn). While mathematically equivalent to the Sinkhorn
updates (4.15), iterations (4.38) and (4.39) suggest using the log-sum-exp stabilization
trick to avoid underflow for small values of ε. Writing z = min z, that trick suggests
evaluating minε z as
e−(zi −z)/ε .
X
minε z = z − ε log (4.42)
i
Instead of substracting z to stabilize the log-domain iterations as in (4.42), one can ac-
tually substract the previously computed scalings. This leads to the stabilized iteration
f(`+1) = Minrow (`) (`)
ε (S(f , g )) + f
(`)
+ ε log(a), (4.43)
g(`+1) = Mincol
ε (S(f
(`+1) (`)
, g )) + g(`) + ε log(b), (4.44)
where we defined
S(f, g) = Ci,j − fi − gj .
i,j
In contrast to the original iterations (4.15), these log-domain iterations (4.43) and (4.44)
are stable for arbitrary ε > 0, because the quantity S(f, g) stays bounded during the
iterations. The downside is that it requires nm computations of exp at each step. Com-
puting a Minrow
ε or Mincol
ε is typically substantially slower than matrix multiplications
and requires computing line by line soft-minima of matrices S. There is therefore no
efficient way to parallelize the application of Sinkhorn maps for several marginals simul-
taneously. In Euclidean domains of small dimension, it is possible to develop efficient
multiscale solvers with a decaying ε strategy to significantly speed up the computation
using sparse grids [Schmitzer, 2016b].
Remark 4.24 (Dual for generic measures). For generic and not necessarily discrete
input measures (α, β), the dual problem (4.30) reads
Z Z Z
−c(x,y)+f (x)+g(y)
sup f dα + gdβ − ε e ε dα(x)dβ(y). (4.45)
(f,g)∈C(X )×C(Y) X Y X ×Y
sup is actually a max) of these Kantorovich potentials (f, g) in the case of entropic
transport is less easy than for classical OT, because one cannot use the c-transform
and potentials are not automatically Lipschitz. Proof of existence can be done using
the convergence of Sinkhorn iterations; see [Chizat et al., 2018b] for more details.
which achieves the same optimal value as (4.45). Similarly to (4.37), the soft-
minimum (here on X × Y) is defined as
Z
def. −S(x,y)
∀ S ∈ C(X × Y), minε S = −ε e ε dα(x)dβ(y)
X ×Y
(note that it depends on (α, β)). As ε → 0, minε → min, as used in the unregular-
ized and unconstrained formulation (2.27). Note that while both (4.45) and (4.46)
are unconstrained problems, a chief advantage of (4.46) is that it is better condi-
tioned, in the sense that the Hessian of the functional is uniformly bounded by ε.
Another way to obtain such a conditioning improvement is to consider semidual
problems; see §5.3 and in particular Remark 5.1. A disadvantage of this alterna-
tive dual formulation is that the presence of a log prevents the use of stochastic
optimization methods as detailed in §5.4; see in particular Remark 5.3.
Proposition 4.5. Any pair of optimal solutions (f? , g? ) to (4.30) are such that (f? , g? ) ∈
R(C), the set of feasible Kantorovich potentials defined in (2.21). As a consequence,
we have that for any ε,
hf? , ai + hg? , bi ≤ LC (a, b).
Proof. Primal-dual optimality conditions in (4.4) with the constraint that P is a prob-
ability and therefore Pi,j ≤ 1 for all i, j yields that exp(−(f?i + g?j − Ci,j )/ε) ≤ 1 and
therefore that f?i + g?j ≤ Ci,j .
A chief advantage of the regularized transportation cost LεC defined in (4.2) is that
it is smooth and convex, which makes it a perfect fit for integrating as a loss function
4.5. Regularized Approximations of the Optimal Transport Cost 81
Proposition 4.6. LεC (a, b) is a jointly convex function of a and b for ε ≥ 0. When
ε > 0, its gradient is equal to
" #
f?
∇LεC (a, b) = ? ,
g
where f? and g? are the optimal solutions of Equation (4.30) chosen so that their
coordinates sum to 0.
In [Cuturi, 2013], lower and upper bounds to approximate the Wasserstein distance
between two histograms were proposed. These bounds consist in evaluating the primal
and dual objectives at the solutions provided by the Sinkhorn algorithm.
Furthermore
PεC (a, b) − DεC (a, b) = ε(H(P? ) + 1). (4.47)
Proof. Equation (4.47) is obtained by writing that the primal and dual problems have
the same values at the optima (see (4.30)), and hence
? ? /ε
LεC (a, b) = PεC (a, b) − εH(P? ) = DεC (a, b) − εhef /ε
, Keg i
? ? /ε
The final result can be obtained by remarking that hef /ε , Keg i = 1, since the latter
amounts to computing the sum of all entries of P? .
The relationships given above suggest a practical way to bound the actual OT
distance, but they are, in fact, valid only upon convergence of the Sinkhorn algorithm
and therefore never truly useful in practice. Indeed, in practice Sinkhorn iterations are
always terminated after a certain accuracy threshold is reached. When a predetermined
number of L iterations is set and used to evaluate DεC using iterates f(L) and g(L) instead
of optimal solutions f? and g? , one recovers, however, a lower bound: Using notation
82 Entropic Regularization of Optimal Transport
appearing in Equations (4.43) and (4.44), we thus introduce the following finite step
approximation of LεC :
(L)
DC (a, b) = hf(L) , ai + hg(L) , bi.
def.
(4.48)
This “algorithmic” Sinkhorn functional lower bounds the regularized cost function as
soon as L ≥ 1.
Proposition 4.8 (Finite Sinkhorn divergences). The following relationship holds:
(L)
DC (a, b) ≤ LεC (a, b).
Proof. Similarly to the proof of Proposition 4.5, we exploit the fact that after even just
one single Sinkhorn iteration, we have, following (4.35) and (4.36), that f(L) and g(L)
(L) (L)
are such that the matrix with elements exp(−(fi + gj − Ci,j )/ε) has column sum
b and its elements are therefore each upper bounded by 1, which results in the dual
(L)
feasibility of (fi , g(L) ).
Remark 4.26 (Primal infeasibility of the Sinkhorn iterates). Note that the primal iterates
provided in (4.8) are not primal feasible, since, by definition, these iterates are designed
to satisfy upon convergence marginal constraints. Therefore, it is not valid to consider
hC, P(2L+1) i as an approximation of LC (a, b) since P(2L+1) is not feasible. Using the
rounding scheme of Altschuler et al. [2017] laid out in Remark 4.6 one can, however,
yield an upper bound on LεC (a, b) that can, in addition, be conveniently computed
using matrix operations in parallel for several pairs of histograms, in the same fashion
as Sinkhorn’s algorithm [Lacombe et al., 2018].
Remark 4.27 (Nonconvexity of finite dual Sinkhorn divergence). Unlike the regularized
(L)
expression LεC in (4.30), the finite Sinkhorn divergence DC (a, b) is not, in general, a
(L)
convex function of its arguments (this can be easily checked numerically). DC (a, b)
is, however, a differentiable function which can be differentiated using automatic differ-
entiation techniques (see Remark 9.1.3) with respect to any of its arguments, notably
C, a, or b.
Indeed, defining F = ι{a} and G = ι{b} , where the indicator function of a closed convex
set C is (
0 if x ∈ C,
ιC (x) = (4.50)
+∞ otherwise,
4.6. Generalized Sinkhorn 83
one retrieves the hard marginal constraints defining U(a, b). The proof of Proposi-
tion 4.3 carries to this more general problem (4.49), so that the unique solution of (4.49)
also has the form (4.12).
As shown in [Peyré, 2015, Frogner et al., 2015, Chizat et al., 2018b, Karlsson and
Ringh, 2016], Sinkhorn iterations (4.15) can hence be extended to this problem, and
they read
ProxKL
F (Kv) ProxKL
G (K u)
T
u← and v ← , (4.51)
Kv KT u
where the proximal operator for the KL divergence is
0 0
∀ u ∈ RN
+, ProxKL
F (u) = argmin KL(u |u) + F (u ). (4.52)
u0 ∈RN
+
For some functions F, G it is possible to prove the linear rate of convergence for iter-
ations (4.51), and these schemes can be generalized to arbitrary measures; see [Chizat
et al., 2018b] for more details.
Iterations (4.51) are thus interesting in the cases where ProxKL KL
F and ProxG can be
computed in closed form or very efficiently. This is in particular the case for separable
P
functions of the form F (u) = i Fi (ui ) since in this case
ProxKL KL
F (u) = ProxFi (ui ) .
i
Remark 4.28 (Duality and Legendre transform). The dual problem to (4.49) reads
fi +gj −Ci,j
max − F ∗ (f) − G∗ (g) − ε
X
e ε (4.53)
f,g
i,j
so that (u, v) = (ef/ε , eg/ε ) are the associated scalings appearing in (4.12). Here,
F ∗ and G∗ are the Fenchel–Legendre conjugate, which are convex functions defined
as
∀ f ∈ Rn , F ∗ (f) = maxn hf, ai − F (a).
def.
(4.54)
a∈R
The generalized Sinkhorn iterates (4.51) are a special case of Dykstra’s algo-
rithm [Dykstra, 1983, 1985] (extended to Bregman divergence [Bauschke and Lewis,
84 Entropic Regularization of Optimal Transport
2000, Censor and Reich, 1998]; see also Remark 8.1) and is an alternate maximiza-
tion scheme on the dual problem (4.53).
The formulation (4.49) can be further generalized to more than two functions and
more than a single coupling; we refer to [Chizat et al., 2018b] for more details. This
includes as a particular case the Sinkhorn algorithm (10.2) for the multimarginal prob-
lem, as detailed in §10.1. It is also possible to rewrite the regularized barycenter prob-
lem (9.15) this way, and the iterations (9.18) are in fact a special case of this generalized
Sinkhorn.
5
Semidiscrete Optimal Transport
This chapter studies methods to tackle the optimal transport problem when one of the
two input measures is discrete (a sum of Dirac masses) and the other one is arbitrary,
including notably the case where it has a density with respect to the Lebesgue measure.
When the ambient space has low dimension, this problem has a strong geometrical
flavor because one can show that the optimal transport from a continuous density
toward a discrete one is a piecewise constant map, where the preimage of each point
in the support of the discrete measure is a union of disjoint cells. When the cost is
the squared Euclidean distance, these cells correspond to an important concept from
computational geometry, the so-called Laguerre cells, which are Voronoi cells offset by
a constant. This connection allows us to borrow tools from computational geometry to
obtain fast computational schemes. In high dimensions, the semidescrete formulation
can also be interpreted as a stochastic programming problem, which can also benefit
from a bit of regularization, extending therefore the scope of applications of the entropic
regularization scheme presented in Chapter 4. All these constructions rely heavily on
the notion of the c-transform, this time for general cost functions and not only matrices
as in §3.2. The c-transform is a generalization of the Legendre transform from convex
analysis and plays a pivotal role in the theory and algorithms for OT.
85
86 Semidiscrete Optimal Transport
where we used the useful indicator function notation (4.50). Keeping either dual poten-
tial f or g fixed and optimizing w.r.t. g or f , respectively, leads to closed form solutions
that provide the definition of the c-transform:
Note that these partial minimizations define maximizers on the support of respectively
α and β, while the definitions (5.1) actually define functions on the whole spaces X
and Y. This is thus a way to extend in a canonical way solutions of (2.24) on the
whole spaces. When X = Rd and c(x, y) = kx − ykp2 = ( di=1 |xi − yi |)p/2 , then the c-
P
transform (5.1) f c is the so-called inf-convolution between −f and k·kp . The definition
of f c is also often referred to as a “Hopf–Lax formula.”
The map (f, g) ∈ C(X ) × C(Y) 7→ (g c̄ , f c ) ∈ C(X ) × C(Y) replaces dual potentials
by “better” ones (improving the dual objective E). Functions that can be written in
the form f c and g c̄ are called c-concave and c̄-concave functions. In the special case
c(x, y) = hx, yi in X = Y = Rd , this definition coincides with the usual notion of
concave functions. Extending naturally Proposition 3.1 to a continuous case, one has
the property that
f cc̄c = f c and g c̄cc̄ = g c̄ ,
where we denoted f cc̄ = (f c )c̄ . This invariance property shows that one can “improve”
only once the dual potential this way. Alternatively, this means that alternate maximiza-
tion does not converge (it immediately enters a cycle), which is classical for functionals
involving a nonsmooth (a constraint) coupling of the optimized variables. This is in
sharp contrast with entropic regularization of OT as shown in Chapter 4. In this case,
because of the regularization, the dual objective (4.30) is smooth, and alternate maxi-
mization corresponds to Sinkhorn iterations (4.43) and (4.44). These iterates, written
over the dual variables, define entropically smoothed versions of the c-transform, where
min operations are replaced by a “soft-min.”
Using (5.3), one can reformulate (2.24) as an unconstrained convex program over a
single potential,
Z Z
Lc (α, β) = sup f (x)dα(x) + f c (y)dβ(y) (5.4)
f ∈C(X ) X Y
Z Z
c̄
= sup g (x)dα(x) + g(y)dβ(y). (5.5)
g∈C(Y) X Y
5.2. Semidiscrete Formulation 87
Since one can iterate the map (f, g) 7→ (g c̄ , f c ), it is possible to add the constraint that
f is c̄-concave and g is c-concave, which is important to ensure enough regularity on
these potentials and show, for instance, existence of solutions to (2.24).
This transform maps a vector g to a continuous function gc̄ ∈ C(X ). Note that this
definition coincides with (5.1) when imposing that the space X is equal to the support
of β. Figure 5.1 shows some examples of such discrete c̄-transforms in one and two
dimensions.
Crucially, using the discrete c̄-transform in the semidiscrete problem (5.4) yields a
finite-dimensional optimization,
Z X
gc̄ (x)dα(x) +
def.
Lc (α, β) = max
m
E(g) = gy bj . (5.7)
g∈R X
decomposition corresponds to the Voronoi diagram partition of the space. Figure 5.1,
bottom row, shows examples of Laguerre cells segmentations in two dimensions.
This allows one to conveniently rewrite the minimized energy as
m Z
X
E(g) = c(x, yj ) − gj dα(x) + hg, bi. (5.8)
j=1 Lj (g)
Figure 5.2 displays iterations of a gradient descent to minimize E. Once the optimal g
is computed, then the optimal transport map T from α to β is mapping any x ∈ Lj (g)
toward yj , so it is piecewise constant.
In the special case c(x, y) = kx − yk2 , the decomposition in Laguerre cells is also
known as a “power diagram.” The cells are polyhedral and can be computed efficiently
88 Semidiscrete Optimal Transport
2
0.6
0.4 1.5
0.2
1
0
-0.2 0.5
0 0.5 1
Figure 5.1: Top: examples of semidiscrete c̄-transforms gc̄ in one dimension, for ground cost c(x, y) =
|x − y|p for varying p (see colorbar). The red points are at locations (yj , −gj )j . Bottom: examples of
Pd
semidiscrete c̄-transforms gc̄ in two dimensions, for ground cost c(x, y) = kx − ykp2 = ( i=1 |xi − yi |)p/2
2
for varying p. The red points are at locations yj ∈ R , and their size is proportional to gj . The regions
delimited by bold black curves are the Laguerre cells (Lj (g))j associated to these points (yj )j .
using computational geometry algorithms; see [Aurenhammer, 1987]. The most widely
used algorithm relies on the fact that the power diagram of points in Rd is equal to the
projection on Rd of the convex hull of the set of points ((yj , kyj k2 − gj ))m
j=1 ⊂ R
d+1 .
There are numerous algorithms to compute convex hulls; for instance, that of Chan
[1996] in two and three dimensions has complexity O(m log(Q)), where Q is the number
of vertices of the convex hull.
The initial idea of a semidiscrete solver for Monge–Ampère equations was proposed
by Oliker and Prussner [1989], and its relation to the dual variational problem was
shown by Aurenhammer et al. [1998]. A theoretical analysis and its application to the
reflector problem in optics is detailed in [Caffarelli et al., 1999]. The semidiscrete for-
mulation was used in [Carlier et al., 2010] in conjunction with a continuation approach
based on Knothe’s transport. The recent revival of this methods in various fields is due
to Mérigot [2011], who proposed a quasi-Newton solver and clarified the link with con-
cepts from computational geometry. We refer to [Lévy and Schwindt, 2018] for a recent
5.3. Entropic Semidiscrete Formulation 89
Figure 5.2: Iterations of the semidiscrete OT algorithm minimizing (5.8) (here a simple gradient
descent is used). The support (yj )j of the discrete measure β is indicated by the colored points, while
the continuous measure α is the uniform measure on a square. The colored cells display the Laguerre
partition (Lj (g(`) ))j where g(`) is the discrete dual potential computed at iteration `.
overview. The use of a Newton solver which is applied to sampling in computer graphics
is proposed in [De Goes et al., 2012]; see also [Lévy, 2015] for applications to 3-D volume
and surface processing. An important area of application of the semidiscrete method
is for the resolution of the incompressible fluid dynamic (Euler’s equations) using La-
grangian methods [de Goes et al., 2015, Gallouët and Mérigot, 2017]. The semidiscrete
OT solver enforces incompressibility at each iteration by imposing that the (possibly
weighted) points cloud approximates a uniform distribution inside the domain. The
convergence (with linear rate) of damped Newton iterations is proved in [Mirebeau,
2015] for the Monge–Ampère equation and is refined in [Kitagawa et al., 2016] for op-
timal transport. Semidiscrete OT finds important applications to illumination design,
notably reflectors; see [Meyron et al., 2018].
The dual of the entropic regularized problem between arbitrary measures (4.9) is a
smooth unconstrained optimization problem:
Z Z Z
−c+f ⊕g
Lεc (α, β) = sup f dα + gdβ − ε e ε dαdβ, (5.9)
(f,g)∈C(X )×C(Y) X Y X ×Y
def.
where we denoted (f ⊕ g)(x, y) = f (x) + g(y).
Similarly to the unregularized problem (5.1), one can minimize explicitly with re-
spect to either f or g in (5.9), which yields a smoothed c-transform
Z
−c(x,y)+f (x)
f c,ε (y) = −ε log
def.
∀ y ∈ Y, e ε dα(x) ,
ZX
−c(x,y)+g(y)
g c̄,ε (x) = −ε log
def.
∀x ∈ X, e ε dβ(y) .
Y
Pm
In the case of a discrete measure β = j=1 bj δyj , the problem simplifies as with (5.7)
to a finite-dimensional problem expressed as a function of the discrete dual potential
90 Semidiscrete Optimal Transport
g ∈ Rm ,
m
X −c(x,yj )+gj
gc̄,ε (x) = −ε log
def.
∀x ∈ X, e ε bj . (5.10)
j=1
One defines similarly fc̄,ε in the case of a discrete measure α. Note that the rewrit-
ing (4.40) and (4.41) of Sinkhorn using the soft-min operator minε corresponds to the
alternate computation of entropic smoothed c-transforms,
(`+1) (`+1)
fi = gc̄,ε (xi ) and gj = f c,ε (yj ). (5.11)
Instead of maximizing (5.9), one can thus solve the following finite-dimensional
optimization problem:
Z
maxn E ε (g) = gc̄,ε (x)dα(x) + hg, bi.
def.
(5.12)
g∈R X
Note that this optimization problem is still valid even in the unregularized case ε = 0
and in this case gc̄,ε=0 = gc̄ is the c̄-transform defined in (5.6) so that (5.12) is in
fact (5.8). The gradient of this functional reads
Z
ε
∀ j ∈ JmK, ∇E (g)j = − χεj (x)dα(x) + bj , (5.13)
X
where χεj is a smoothed version of the indicator χ0j of the Laguerre cell Lj (g),
−c(x,yj )+gj
e ε
χεj (x) =P −c(x,y` )+g`
.
`e ε
Note once again that this formula (5.13) is still valid for ε = 0. Note also that the
family of functions (χεj )j is a partition of unity, i.e. j χεj = 1 and χεj ≥ 0. Figure 5.3,
P
Remark 5.1 (Second order methods and connection with logistic regression). A crucial
aspect of the smoothed semidiscrete formulation (5.12) is that it corresponds to the
minimization of a smooth function. Indeed, as shown in [Genevay et al., 2016], the
Hessian of E ε is upper bounded by 1/ε, so that ∇E ε is 1ε -Lipschitz continuous. In
fact, that problem is very closely related to a multiclass logistic regression problem
(see Figure 5.3 for a display of the resulting fuzzy classification boundary) and enjoys
the same favorable properties (see [Hosmer Jr et al., 2013]), which are generaliza-
tions of self-concordance; see [Bach, 2010]. In particular, the Newton method converges
quadratically, and one can use in practice quasi-Newton techniques, such as L-BFGS,
as advocated in [Cuturi and Peyré, 2016]. Note that [Cuturi and Peyré, 2016] stud-
ies the more general barycenter problem detailed in §9.2, but it is equivalent to this
semidiscrete setting when considering only a pair of input measures. The use of second
5.3. Entropic Semidiscrete Formulation 91
0.3
0.2
0.2
0
0.1
-0.2
0
0 0.5 1
Figure 5.3: Top: examples of entropic semidiscrete c̄-transforms gc̄,ε in one dimension, for ground
cost c(x, y) = |x − y| for varying ε (see colorbar). The red points are at locations (yj , −gj )j . Bottom:
examples of entropic semidiscrete c̄-transforms gc̄,ε in two dimensions, for ground cost c(x, y) = kx − yk2
for varying ε. The black curves are the level sets of the function gc̄,ε , while the colors indicate the
smoothed indicator function of the Laguerre cells χεj . The red points are at locations yj ∈ R2 , and their
size is proportional to gj .
92 Semidiscrete Optimal Transport
def.
which can be seen as a smoothed version of the Legendre transform of G(α, β) =
Lc (α, β),
The semidiscrete formulation (5.8) and its smoothed version (5.12) are appealing be-
cause the energies to be minimized are written as an expectation with respect to the
probability distribution α,
Z
ε
E (g) = E ε (g, x)dα(x) = EX (E ε (g, X))
X
and X denotes a random vector distributed on X according to α. Note that the gradient
of each of the involved functional reads
One can thus use stochastic optimization methods to perform the maximization, as pro-
posed in Genevay et al. [2016]. This allows us to obtain provably convergent algorithms
5.4. Stochastic Optimization Methods 93
Stochastic gradient descent with averaging. SGD is slow because of the fast decay of
the stepsize τ` toward zero. To improve the convergence speed, it is possible to average
the past iterates, which is equivalent to running a “classical” SGD on auxiliary variables
(g̃(`) )`
g̃(`+1) = g̃(`) + τ` ∇g E ε (g̃(`) , x` ),
def.
where x` is drawn according to α (and all the (x` )` are independent) and output as
estimated weight vector the average
`
(`) 1X
g̃(k) .
def.
g =
` k=1
This defines the stochastic gradient descent with averaging (SGA) algorithm. Note that
it is possible to avoid explicitly storing all the iterates by simply updating a running
94 Semidiscrete Optimal Transport
Figure 5.4: Evolution of the energy E ε (g(`) ), for ε = 0 (no regularization) during the SGD itera-
tions (5.14). Each colored curve shows a different randomized run. The images display the evolution of
the Laguerre cells (Lj (g(`) ))j through the iterations.
average as follows:
1 `
g(`+1) = g̃(`+1) + g(`) .
`+1 `+1
In this case, a typical choice of decay is rather of the form
def. τ0
τ` = p .
1 + `/`0
Notice that the step size now goes much slower to 0 than for (5.15), at rate `−1/2 .
Bach [2014] proves that SGA leads to a faster convergence (the constants involved are
smaller) than SGD, since in contrast to SGD, SGA is adaptive to the local strong
convexity (or concavity for maximization problems) of the functional.
much smaller subset, such as that spanned by multilayer neural networks [Seguy et al.,
2018]. This approach leads to nonconvex finite-dimensional optimization problems with
no approximation guarantees, but this can provide an effective way to compute a proxy
for the Wasserstein distance in high-dimensional scenarios. Another solution is to use
nonparametric families, which is equivalent to considering some sort of progressive re-
finement, as that proposed by Genevay et al. [2016] using reproducing kernel Hilbert
spaces, whose dimension is proportional to the number of iterations of the SGD algo-
rithm.
6
W 1 Optimal Transport
This chapter focuses on optimal transport problems in which the ground cost is equal
to a distance. Historically, this corresponds to the original problem posed by Monge
in 1781; this setting was also that chosen in early applications of optimal transport in
computer vision [Rubner et al., 2000] under the name of “earth mover’s distances”.
Unlike the case where the ground cost is a squared Hilbertian distance (studied
in particular in Chapter 7), transport problems where the cost is a metric are more
difficult to analyze theoretically. In contrast to Remark 2.24 that states the uniqueness
of a transport map or coupling between two absolutely continuous measures when
using a squared metric, the optimal Kantorovich coupling is in general not unique
when the cost is the ground distance itself. Hence, in this regime it is often impossible
to recover a uniquely defined Monge map, making this class of problems ill-suited for
interpolation of measures. We refer to works by Trudinger and Wang [2001], Caffarelli
et al. [2002], Sudakov [1979], Evans and Gangbo [1999] for proofs of existence of optimal
W 1 transportation plans and detailed analyses of their geometric structure.
Although more difficult to analyze in theory, optimal transport with a linear ground
distance is usually more robust to outliers and noise than a quadratic cost. Further-
more, a cost that is a metric results in an elegant dual reformulation involving local
flow, divergence constraints, or Lipschitzness of the dual potential, suggesting cheaper
numerical algorithms that align with minimum-cost flow methods over networks in
graph theory. This setting is also popular because the associated OT distances define
a norm that can compare arbitrary distributions, even if they are not positive; this
property is shared by a larger class of so-called dual norms (see §8.2 and Remark 10.6
for more details).
96
6.1. W 1 on Metric Spaces 97
Here we assume that d is a distance on X = Y, and we solve the OT problem with the
ground cost c(x, y) = d(x, y). The following proposition highlights key properties of the
c-transform (5.1) in this setup. In the following, we denote the Lipschitz constant of a
function f ∈ C(X ) as
|f (x) − f (y)|
: (x, y) ∈ X 2 , x 6= y .
def.
Lip(f ) = sup
d(x, y)
We define Lipschitz functions to be those functions f satisfying Lip(f ) < +∞; they
form a convex subset of C(X ).
Proposition 6.1. Suppose X = Y and c(x, y) = d(x, y). Then, there exists g such that
f = g c if and only Lip(f ) ≤ 1. Furthermore, if Lip(f ) ≤ 1, then f c = −f .
Proof. First, suppose f = g c . Then, for x, y ∈ X ,
|f (x) − f (y)| = inf d(x, z) − g(z) − inf d(y, z) − g(z)
z∈X z∈X
The first equality follows from the definition of g c , the next inequality from the identity
| inf f − inf g| ≤ sup |f − g|, and the last from the triangle inequality. This shows that
Lip(f ) ≤ 1.
def.
Now, suppose Lip(f ) ≤ 1, and define g = −f . By the Lipschitz property, for all
x, y ∈ X , f (y) − d(x, y) ≤ f (x) ≤ f (y) + d(x, y). Applying these inequalities,
g c (y) = inf [d(x, y) + f (x)] ≥ inf [d(x, y) + f (y) − d(x, y)] = f (y),
x∈X x∈X
c
g (y) = inf [d(x, y) + f (x)] ≤ inf [d(x, y) + f (y) + d(x, y)] = f (y).
x∈X x∈X
This shows fc = −f .
Starting from the single potential formulation (5.4), one can iterate the construction
and replace the couple (g, g c ) by (g c , (g c )c ). The last proposition shows that one can
thus use (g c , −g c ), which in turn is equivalent to any pair (f, −f ) such that Lip(f ) ≤ 1.
This leads to the following alternative expression for the W 1 distance:
Z
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : Lip(f ) ≤ 1 . (6.1)
f X
98 W 1 Optimal Transport
This expression shows that W 1 is actually a norm, i.e. W 1 (α, β) = kα − βkW 1 , and
R R
that it is still valid for any measures (not necessary positive) as long as X α = X β.
This norm is often called the Kantorovich and Rubinstein norm [1958].
For discrete measures of the form (2.1), writing α − β = k mk δzk with zk ∈ X and
P
P
k mk = 0, the optimization (6.1) can be rewritten as
( )
X
W 1 (α, β) = max fk mk : ∀ (k, `), |fk − f` | ≤ d(zk , z` ), (6.2)
(fk )k
k
which is a linear program. Note that furthermore, in this 1-D case, a closed form
expression for W 1 using cumulative functions is given in (2.37).
Remark 6.1 (W p with 0 < p ≤ 1). If 0 < p ≤ 1, then d(x, ˜ y) def. = d(x, y)p satisfies the
triangular inequality, and hence d˜ is itself a distance. One can thus apply the results and
algorithms detailed above for W 1 to compute W p by simply using d˜ in place of d. This
is equivalent to stating that W p is the dual of p-Hölder functions {f : Lipp (f ) ≤ 1},
where
|f (x) − f (y)|
def. 2
Lipp (f ) = sup : (x, y) ∈ X , x 6
= y .
d(x, y)p
In the special case of Euclidean spaces X = Y = Rd , using c(x, y) = kx − yk, the global
Lipschitz constraint appearing in (6.1) can be made local as a uniform bound on the
gradient of f ,
Z
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : k∇f k∞ ≤ 1 . (6.3)
f Rd
Here the constraint k∇f k∞ ≤ 1 signifies that the norm of the gradient of f at any
point x is upper bounded by 1, k∇f (x)k2 ≤ 1 for any x.
Considering the dual problem to (6.3), one obtains an optimization problem under
fixed divergence constraint
Z
W 1 (α, β) = min ks(x)k2 dx : div(s) = α − β , (6.4)
s Rd
6.3. W 1 on a Graph 99
which is often called the Beckmann formulation [Beckmann, 1952]. Here the vectorial
function s(x) ∈ R2 can be interpreted as a flow field, describing locally the movement
of mass. Outside the support of the two input measures, div(s) = 0, which is the
conservation of mass constraint. Once properly discretized using finite elements, Prob-
lems (6.3) and (6.4) become nonsmooth convex optimization problems. It is possible to
use an off-the-shelf interior points quadratic-cone optimization solver, but as advocated
in §7.3, large-scale problems require the use of simpler but more adapted first order
methods. One can thus use, for instance, Douglas–Rachford (DR) iterations (7.14) or
the related alternating direction method of multipliers method. Note that on a uniform
grid, projecting on the divergence constraint is conveniently handled using the fast
Fourier transform. We refer to Solomon et al. [2014a] for a detailed account for these
approaches and application to OT on triangulated meshes. See also Li et al. [2018a],
Ryu et al. [2017b,a] for similar approaches using primal-dual splitting schemes. Ap-
proximation schemes that relax the Lipschitz constraint on the dual potentials f have
also been proposed, using, for instance, a constraint on wavelet coefficients leading to
an explicit formula [Shirdhonkar and Jacobs, 2008], or by considering only functions f
parameterized as multilayer neural networks with “rectified linear” max(0, ·) activation
function and clipped weights [Arjovsky et al., 2017].
6.3 W 1 on a Graph
The previous formulations (6.3) and (6.4) of W 1 can be generalized to the setting where
X is a geodesic space, i.e. c(x, y) = d(x, y) where d is a geodesic distance. We refer
to Feldman and McCann [2002] for a theoretical analysis in the case where X is a
Riemannian manifold. When X = J1, nK is a discrete set, equipped with undirected
edges (i, j) ∈ E ⊂ X 2 labeled with a weight (length) wi,j , we recover the important
case where X is a graph equipped with the geodesic distance (or shortest path metric):
(K−1 )
def.
X
Di,j = min wik ,ik+1 : ∀ k ∈ J1, K − 1K, (ik , ik+1 ) ∈ E ,
K≥0,(ik )k :i→j
k=1
where i → j indicates that i1 = i and iK = j, namely that the path starts at i and
ends at j.
We consider two vectors (a, b) ∈ (Rn )2 defining (signed) discrete measures on the
graph X such that i ai = i bi (these weights do not need to be positive). The
P P
goal is now to compute W1 (a, b), as introduced in (2.17) for p = 1, when the ground
metric is the graph geodesic distance. This computation should be carried out without
going as far as having to compute a “full” coupling P of size n × n, to rely instead on
local operators thanks to the underlying connectivity of the graph. These operators are
discrete formulations for the gradient and divergence differential operators.
100 W 1 Optimal Transport
A flow s = (si,j )i,j is defined on edges, and the divergence operator div : RE → Rn ,
which is the adjoint of the gradient ∇, maps flows to vectors defined on vertices and is
defined as X
(si,j − sj,i ) ∈ Rn .
def.
∀ i ∈ J1, nK, div(s)i =
j:(i,j)∈E
This is a linear program and more precisely an instance of min-cost flow problems.
Highly efficient dedicated simplex solvers have been devised to solve it; see, for in-
stance, [Ling and Okada, 2007]. Figure 6.1 shows an example of primal and dual solu-
tions. Formulation (6.6) is the so-called Beckmann formulation [Beckmann, 1952] and
has been used and extended to define and study traffic congestion models; see, for
instance, [Carlier et al., 2008].
6.3. W 1 on a Graph 101
f (a, b) and s
Figure 6.1: Example of computation of W1 (a, b) on a planar graph with uniform weights wi,j = 1.
Left: potential f solution of (6.5) (increasing value from red to blue). The green color of the edges is
proportional to |(∇f)i,j |. Right: flow s solution of (6.6), where bold black edges display nonzero si,j ,
which saturate to wi,j = 1. These saturating flow edge on the right match the light green edge on the
left where |(∇f)i,j | = 1.
7
Dynamic Formulations
This chapter presents the geodesic (also called dynamic) point of view of optimal trans-
port when the cost is a squared geodesic distance. This describes the optimal transport
between two measures as a curve in the space of measures minimizing a total length.
The dynamic point of view offers an alternative and intuitive interpretation of optimal
transport, which not only allows us to draw links with fluid dynamics but also results
in an efficient numerical tool to compute OT in small dimensions when interpolating
between two densities. The drawback of that approach is that it cannot scale to large-
scale sparse measures and works only in low dimensions on regular domains (because
one needs to grid the space) with a squared geodesic cost.
In this chapter, we use the notation (α0 , α1 ) in place of (α, β) in agreement with
the idea that we start at time t = 0 from one measure to reach another one at time
t = 1.
∂αt
+ div(αt vt ) = 0 and αt=0 = α0 , αt=1 = α1 , (7.1)
∂t
102
7.1. Continuous Formulation 103
where the equation above should be understood in the sense of distributions on Rd . The
infinitesimal length of such a vector field is measured using the L2 norm associated to
the measure αt , that is defined as
Z 1/2
2
kvt kL2 (αt ) = kvt (x)k dαt (x) .
Rd
Figure 7.1: Displacement interpolation αt satisfying (7.2). Top: for two measures (α0 , α1 ) with
densities with respect to the Lebesgue measure. Bottom: for two discrete empirical measures with the
same number of points (bottom).
This definition might seem complicated, but it is crucial to impose that the momentum
Jt (x) should vanish when αt (x) = 0. Note also that (7.3) is written in an informal way
as if the measures (αt , Jt ) were density functions, but this is acceptable because θ is a
1-homogeneous function (and hence defined even if the measures do not have a density
with respect to Lebesgue measure) and can thus be extended in an unambiguous way
from density to functions.
Remark 7.1 (Links with McCann’s interpolation). In the case (see Equation (2.28))
where there exists an optimal Monge map T : Rd → Rd with T] α0 = α1 , then αt is
equal to McCann’s interpolation
In the 1-D case, using Remark 2.30, this interpolation can be computed thanks to
the relation
Cα−1
t
= (1 − t)Cα−1
0
+ tCα−1
1
; (7.7)
see Figure 2.11. We refer to Gangbo and McCann [1996] for a detailed review on
the Riemannian geometry of the Wasserstein space. In the case that there is “only”
an optimal coupling π that is not necessarily supported on a Monge map, one can
compute this interpolant as
For instance, in the discrete setup (2.3), denoting P a solution to (2.11), an inter-
polation is defined as X
αt = Pi,j δ(1−t)xi +tyj . (7.9)
i,j
paths. McCann’s interpolation finds many applications, for instance, color, shape,
and illumination interpolations in computer graphics [Bonneel et al., 2011].
Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds
(empirical measures (α0 , α1 ) with the same number of points). Bottom: same but with varying weights.
For 0 < t < 1, the top example corresponds to an empirical measure interpolation αt with N points,
while the bottom one defines a measure supported on 2N − 1 points.
which are both defined at grid points, thus forming arrays of RT ×n1 ×n2 .
106 Dynamic Formulations
IJ (J)k,i = (I(J1k,i1 +1,i2 , J1k,i1 ,i2 ), I(J2k,i1 ,i2 +1 , J2k,i1 ,i2 )).
def.
r+s
The simplest choice is to use a linear operator I(r, s) = 2 , which is the one we
consider next. The discrete counterpart to (7.3) reads
n1 X
T X
X n2
def.
where Θ(ã, J̃) = θ(ãk,i , J̃k,i ),
k=1 i1 =1 i2 =1
and where the constraint now reads
def.
C(a0 , a1 ) = {(a, J) : ∂t a + div(J) = 0, (a0,· , aT,· ) = (a0 , a1 )} ,
where a ∈ R(T +1)×n1 ×n2 , J = (J1 , J2 ) with J1 ∈ RT ×(n1 +1)×n2 , J2 ∈ RT ×n1 ×(n2 +1) .
Figure 7.3 shows an example of evolution (αt )t approximated using this discretization
scheme.
Remark 7.2 (Dynamic formulation on graphs). In the case where X is a graph and
c(x, y) = dX (x, y)2 is the squared geodesic distance, it is possible to derive faithful
discretization methods that use a discrete divergence associated to the graph structure
in place of the uniform grid discretization (7.10). In order to ensure that the heat
equation has a gradient flow structure (see §9.3 for more details about gradient flows)
for the corresponding dynamic Wasserstein distance, Maas [2011] and later Mielke
[2013] proposed to use a logarithmic mean I(r, s) (see also [Solomon et al., 2016b, Chow
et al., 2012, 2017b,a]).
where H is some Euclidean space, and where F, G : H → R ∪ {+∞} are two closed
convex functions, for which one can “easily ” (e.g. in closed form or using a rapidly
converging scheme) compute the so-called proximal operator
1
2
∀ x ∈ H, Proxτ F (x) = argmin
x − x0
+ τ F (x)
def.
(7.13)
x0 ∈H 2
for a parameter τ > 0. Note that this corresponds to the proximal map for the Euclidean
metric and that this definition can be extended to more general Bregman divergence in
place of kx − x0 k2 ; see (4.52) for an example using the KL divergence. The iterations of
the DR algorithm define a sequence (x(`) , w(`) ) ∈ H2 using an initialization (x(0) , w(0) ) ∈
H2 and
w(`+1) = w(`) + α(ProxγF (2x(`) − w(`) ) − x(`) ),
def.
(7.14)
x(`+1) = ProxγG (w(`+1) ).
def.
108 Dynamic Formulations
If 0 < α < 2 and γ > 0, one can show that x(`) → z ? , where z ? is a solution of (7.12);
see [Combettes and Pesquet, 2007] for more details. This algorithm is closely related to
another popular method, the alternating direction method of multipliers [Gabay and
Mercier, 1976, Glowinski and Marroco, 1975] (see also [Boyd et al., 2011] for a review),
which can be recovered by applying DR on a dual problem; see [Papadakis et al., 2014]
for more details on the equivalence between the two, first shown by [Eckstein and
Bertsekas, 1992].
There are many ways to recast Problem (7.11) in the form (7.12), and we refer
to [Papadakis et al., 2014] for a detailed account of these approaches. A simple way to
achieve this is by setting x = (a, J, ã, J̃) and letting
def.
F (x) = Θ(ã, J̃) + ιC(a0 ,a1 ) (a, J) and G(x) = ιD (a, J, ã, J̃),
n o
def.
where D = (a, J, ã, J̃) : ã = Ia (a), J̃ = IJ (J) .
The proximal operator of these two functions can be computed efficiently. Indeed, one
has
Proxτ F (x) = (Proxτ Θ (ã, J̃), ProjC(a0 ,a1 ) (a, J)).
The proximal operator Proxτ Θ is computed by solving a cubic polynomial equation at
each grid position. The orthogonal projection on the affine constraint C(a0 , a1 ) involves
the resolution of a Poisson equation, which can be achieved in O(N log(N )) operations
using the fast Fourier transform, where N = T n1 n2 is the number of grid points.
Lastly, the proximal operator Proxτ G is a linear projector, which requires the inversion
of a small linear system. We refer to Papadakis et al. [2014] for more details on these
computations. Figure 7.3 shows an example in which that method is used to compute
a dynamical interpolation inside a complicated planar domain. This class of proximal
methods for dynamical OT has also been used to solve related problems such as mean
field games [Benamou and Carlier, 2015].
literature, for instance, using an L2 cost Piccoli and Rossi [2014]. In order to avoid
having to “teleport” mass (mass which travels at infinite speed and suddenly grows
in a region where there was no mass before), the associated cost should be infinite. It
turns out that this can be achieved in a simple convex way, by also allowing st to be
an arbitrary measure (e.g. using a 1-homogeneous cost) by penalizing st in the same
way as the momentum Jt ,
Z 1Z
def.
where Θ(α, J, s) = (θ(αt (x), Jt (x)) + τ θ(αt (x), st (x))) dxdt,
0 Rd
Figure 7.4: Comparison of Hellinger (first row), Wasserstein (row 2), partial optimal transport (row
3), and Wasserstein–Fisher–Rao (row 4) dynamic interpolations.
where the parameter should satisfy p ≥ 1 and s ∈ [1, p] in order for θ to be convex.
Note that this definition should be handled with care in the case 1 < s ≤ p because θ
does not have a linear growth at infinity, so that solutions to (7.3) must be constrained
to have a density with respect to the Lebesgue measure.
The case s = 1 corresponds to the classical OT problem and the optimal value
of (7.3) defines W p (α, β). In this case, θ is 1-homogeneous, so that solutions to (7.3)
can be arbitrary measures. The case (s = 1, p = 2) is the initial setup considered in (7.3)
to define W 2 .
7.6. Dynamic Formulation over the Paths Space 111
for 1/q + 1/p = 1. In the limit (p = s, q) → (1, ∞), one recovers the W 1 norm. The
case s = p = 2 corresponds to the Sobolev H −1 (Rd ) Hilbert norm defined in (8.15).
OT over the space of paths. The dynamical version of classical OT (2.15), formulated
over the space of paths, then reads
Z
2
W 2 (α0 , α1 ) = min L(γ)2 dπ̄(γ), (7.17)
π̄∈Ū (α0 ,α1 ) X̄
where L(γ) = 01 |γ 0 (s)|2 ds is the kinetic energy of a path s ∈ [0, 1] 7→ γ(s) ∈ X . The
R
connection between optimal couplings π ? and π̄ ? solving respectively (7.17) and (2.15)
is that π̄ ? only gives mass to geodesics joining pairs of points in proportion prescribed
by π ? . In the particular case of discrete measures, this means that
X X
π? = Pi,j δ(xi ,yj ) and π̄ ? = Pi,j δγxi ,yj ,
i,j i,j
where γxi ,yj is the geodesic between xi and yj . Furthermore, the measures defined by
the distribution of the curve points γ(t) at time t, where γ is drawn following π̄ ? , i.e.
t ∈ [0, 1] 7→ αt = Pt] π̄ ?
def.
where Pt (γ) = γ(t) ∈ X , (7.18)
Entropic OT over the space of paths. We now turn to the re-interpretation of en-
tropic OT, defined in Chapter 4, using the space of paths. Similarly to (4.11), this is
defined using a Kullback–Leibler projection, but this time of a reference measure over
the space of paths K̄ which is the distribution of a reversible Brownian motion (Wiener
process), which has a uniform distribution at the initial and final times
We refer to the review paper by Léonard [2014] for an overview of this problem and an
historical account of the work of Schrödinger [1931]. One can show that the (unique)
solution π̄ε? to (7.19) converges to a solution of (7.17) as ε → 0. Furthermore, this
solution is linked to the solution of the static entropic OT problem (4.9) using Brownian
ε ∈ X̄ (which are similar to fuzzy geodesic and converge to δ
bridge γ̄x,y γx,y as ε → 0).
In the discrete setting, this means that
X X
πε? = P?ε,i,j δ(xi ,yj ) and π̄ε? = P?ε,i,j γ̄xεi ,yj , (7.20)
i,j i,j
where P?ε,i,j can be computed using Sinkhorn’s algorithm. Similarly to (7.18), one then
can define an entropic interpolation as
Since the law Pt] γ̄x,yε of the position at time t along a Brownian bridge is a Gaussian
Gt(1−t)ε2 (· − γx,y (t)) of variance t(1 − t)ε2 centered at γx,y (t), one can deduce that αε,t
is a Gaussian blurring of a set of traveling Diracs
X
αε,t = P?ε,i,j Gt(1−t)ε2 (· − γxi ,yj (t)).
i,j
Figure 7.5: Samples from Brownian bridge paths associated to the Schrödinger entropic interpola-
tion (7.20) over path space. Blue corresponds to t = 0 and red to t = 1.
7.6. Dynamic Formulation over the Paths Space 113
We study in this chapter the statistical properties of the Wasserstein distance. More
specifically, we compare it to other major distances and divergences routinely used
in data sciences. We quantify how one can approximate the distance between two
probability distributions when having only access to samples from said distributions.
To introduce these subjects, §8.1 and §8.2 review respectively divergences and integral
probability metrics between probability distributions. A divergence D typically satisfies
D(α, β) ≥ 0 and D(α, β) = 0 if and only if α = β, but it does not need to be symmetric
or satisfy the triangular inequality. An integral probability metric for measures is a
dual norm defined using a prescribed family of test functions. These quantities are
sound alternatives to Wasserstein distances and are routinely used as loss functions
to tackle inference problems, as will be covered in §9. We show first in §8.3 that the
optimal transport distance is not Hilbertian, i.e. one cannot approximate it efficiently
using a Hilbertian metric on a suitable feature representation of probability measures.
We show in §8.4 how to approximate D(α, β) from discrete samples (xi )i and (yj )j
drawn from α and β. A good statistical understanding of that problem is crucial when
using the Wasserstein distance in machine learning. Note that this section will be chiefly
concerned with the statistical approximation of optimal transport between distributions
supported on continuous sets. The very same problem when the ground space is finite
has received some attention in the literature following the work of Sommerfeld and
Munk [2018], extended to entropic regularized quantities by Bigot et al. [2017a].
114
8.1. ϕ-Divergences 115
8.1 ϕ-Divergences
Before detailing in the following section “weak” norms, whose construction shares sim-
ilarities with W 1 , let us detail a generic construction of so-called divergences between
measures, which can then be used as loss functions when estimating probability dis-
tributions. Such divergences compare two input measures by comparing their mass
pointwise, without introducing any notion of mass transportation. Divergences are func-
tionals which, by looking at the pointwise ratio between two measures, give a sense of
how close they are. They have nice analytical and computational properties and build
upon entropy functions.
Definition 8.1 (Entropy function). A function ϕ : R → R ∪ {∞} is an entropy function if
it is lower semicontinuous, convex, dom ϕ ⊂ [0, ∞[, and satisfies the following feasibility
condition: dom ϕ ∩ ]0, ∞[ 6= ∅. The speed of growth of ϕ at ∞ is described by
ϕ0∞ = lim ϕ(x)/x ∈ R ∪ {∞} .
x→+∞
If ϕ0∞ = ∞, then ϕ grows faster than any linear function and ϕ is said superlinear.
Any entropy function ϕ induces a ϕ-divergence (also known as Ciszár divergence [Ciszár,
1967, Ali and Silvey, 1966] or f -divergence) as follows.
Definition 8.2 (ϕ-Divergences). Let ϕ be an entropy function. For α, β ∈ M(X ), let
dα ⊥ 1
dβ β + α be the Lebesgue decomposition of α with respect to β. The divergence Dϕ
is defined by
dα
Z
dβ + ϕ0∞ α⊥ (X )
def.
Dϕ (α|β) = ϕ (8.1)
X dβ
if α, β are nonnegative and ∞ otherwise.
The additional term ϕ0∞ α⊥ (X ) in (8.1) is important to ensure that Dϕ defines a
continuous functional (for the weak topology of measures) even if ϕ has a linear growth
at infinity, as this is, for instance, the case for the absolute value (8.8) defining the TV
norm. If ϕ as a superlinear growth, e.g. the usual entropy (8.4), then ϕ0∞ = +∞ so
that Dϕ (α|β) = +∞ if α does not have a density with respect to β.
In the discrete setting, assuming
X X
α= ai δxi and β= bi δxi (8.2)
i i
are supported on the same set of n points (xi )ni=1 ⊂ X , (8.1) defines a divergence on
Σn
ai
bi + ϕ0∞
X X
Dϕ (a|b) = ϕ ai , (8.3)
i∈Supp(b)
bi i∈Supp(b)
/
1
The Lebesgue decomposition theorem asserts that, given β, α admits a unique decomposition as
the sum of two measures αs + α⊥ such that αs is absolutely continuous with respect to β and α⊥ and
β are singular.
116 Statistical Divergences
def.
where Supp(b) = {i ∈ JnK : bi 6= 0}.
The proof of the following proposition can be found in [Liero et al., 2018, Thm 2.7].
4
KL
TV
Hellinger
2 χ2
0
0 1 2 3
Figure 8.1: Example of entropy functionals.
Remark 8.1 (Dual expression). A ϕ-divergence can be expressed using the Legendre
transform
ϕ∗ (s) = sup st − ϕ(t)
def.
t∈R
We now review a few popular instances of this framework. Figure 8.1 displays the
associated entropy functionals, while Figure 8.2 reviews the relationship between them.
def.
Example 8.1 (Kullback–Leibler divergence). The Kullback–Leibler divergence KL =
DϕKL , also known as the relative entropy, was already introduced in (4.10) and (4.6).
It is the divergence associated to the Shannon–Boltzman entropy function ϕKL , given
by
s log(s) − s + 1 for s > 0,
ϕKL (s) = 1 for s = 0, (8.4)
+∞ otherwise.
8.1. ϕ-Divergences 117
KL 6 log(1 + 2
) 2
KL dH 6 p
KL
KL/2
dH 6
/2
p
2
p
p
TV 6
6
TV
2
W1 6 dmax TV p
dH 6 2TV
W1 TV dH
TV 6 W1 /dmin TV 6 dH
Figure 8.2: Diagram of relationship between divergences (inspired by Gibbs and Su [2002]). For X a
metric space with ground distance d, dmax = sup(x,x0 ) d(x, x0 ) is the diameter of X . When X is discrete,
def.
dmin = minx6=x0 d(x, x0 ).
def.
Remark 8.1 (Bregman divergence). The discrete KL divergence, KL = DϕKL , has the
unique property of being both a ϕ-divergence and a Bregman divergence. For discrete
vectors in Rn , a Bregman divergence [Bregman, 1967] associated to a smooth strictly
convex function ψ : Rn → R is defined as
def.
Bψ (a|b) = ψ(a) − ψ(b) − h∇ψ(b), a − bi, (8.5)
where h·, ·i is the canonical inner product on Rn . Note that Bψ (a|b) is a convex function
of a and a linear function of ψ. Similarly to ϕ-divergence, a Bregman divergence satisfies
Bψ (a|b) ≥ 0 and Bψ (a|b) = 0 if and only if a = b. The KL divergence is the Bregman
divergence for minus the entropy ψ = −H defined in (4.1)), i.e. KL = B−H . A Bregman
divergence is locally a squared Euclidean distance since
σβ2
! !
1 σα2 |mα − mβ |
KL(α|β) = + log + −1 . (8.6)
2 σβ2 σα2 σβ2
This expression shows that the divergence between α and β diverges to infinity as σβ
diminishes to 0 and β becomes a Dirac mass. In that sense, one can say that singular
Gaussians are infinitely far from all other Gaussians in the KL geometry. That geometry
is thus useful when one wants to avoid dealing with singular covariances. To simplify
the analysis, one can look at the infinitesimal geometry of KL, which is obtained by
performing a Taylor expansion at order 2,
1 1 2
2 2
KL(N (m + δm , (σ + δσ ) )|N (m, σ )) = 2 δ + δσ2 + o(δm
2
, δσ2 ).
σ 2 m
√
This local Riemannian metric, the so-called Fisher metric, expressed over (m/ 2, σ) ∈
R × R+,∗ , matches exactly that of the hyperbolic Poincaré half plane. Geodesics over
this space are half circles centered along the σ = 0 line and have an exponential speed,
i.e. they only reach the limit σ = 0 after an infinite time. Note in particular that if
σα = σβ but mα 6= mα , then the geodesic between (α, β) over this hyperbolic half plane
does not have a constant standard deviation.
The KL hyperbolic geometry over the space of Gaussian parameters (m, σ) should be
contrasted with the Euclidean geometry associated to OT as described in Remark 2.31,
since in the univariate case
Figure 8.3 shows a visual comparison of these two geometries and their respective
geodesics. This interesting comparison was suggested to us by Jean Feydy.
def.
Example 8.2 (Total variation). The total variation distance TV = DϕTV is the diver-
gence associated to
|s − 1| for s ≥ 0,
ϕTV (s) = (8.8)
+∞ otherwise.
It actually defines a norm on the full space of measure M(X ) where
Z
TV(α|β) = kα − βkTV , where kαkTV = |α|(X ) = d|α|(x). (8.9)
X
m m
KL OT
Figure 8.3: Comparisons of interpolation between Gaussians using KL (hyperbolic) and OT (Eu-
clidean) geometries.
Remark 8.2 (Strong vs. weak topology). The total variation norm (8.9) defines the so-
called “strong” topology on the space of measure. On a compact domain X of radius
R, one has
W 1 (α, β) ≤ R kα − βkTV
so that this strong notion of convergence implies the weak convergence metrized by
Wasserstein distances. The converse is, however, not true, since δx does not converge
strongly to δy if x → y (note that kδx − δy kTV = 2 if x 6= y). A chief advantage is that
M1+ (X ) (once again on a compact ground space X ) is compact for the weak topology,
so that from any sequence of probability measures (αk )k , one can always extract a con-
verging subsequence, which makes it a suitable space for several optimization problems,
such as those considered in Chapter 9.
def. 1/2
Example 8.3 (Hellinger). The Hellinger distance h = DϕH is the square root of the
divergence associated to
√
| s − 1|2 for s ≥ 0,
ϕH (s) =
+∞ otherwise.
If (α, β) are discrete as in (8.2) and have the same support, then
X (ai − bi )2
χ2 (α|β) = .
i
bi
Formulation (6.3) is a special case of a dual norm. A dual norm is a convenient way to
design “weak” norms that can deal with arbitrary measures. For a symmetric convex
set B of measurable functions, one defines
Z
def.
kαkB = max f (x)dα(x) : f ∈ B . (8.10)
f X
These dual norms are often called “integral probability metrics’; see [Sriperumbudur
et al., 2012].
Example 8.6 (Total variation). The total variation norm (Example 8.2) is a dual norm
associated to the whole space of continuous functions
B = {f ∈ C(X ) : kf k∞ ≤ 1} .
The total variation distance is the only nontrivial divergence that is also a dual norm;
see [Sriperumbudur et al., 2009].
Remark 8.3 (Metrizing the weak convergence). By using smaller “balls” B, which typ-
ically only contain continuous (and sometimes regular) functions, one defines weaker
dual norms. In order for k·kB to metrize the weak convergence (see Definition 2.2), it is
8.2. Integral Probability Metrics 121
sufficient for the space spanned by B to be dense in the set of continuous functions for
the sup-norm k·k∞ (i.e. for the topology of uniform convergence); see [Ambrosio et al.,
2006, para. 5.1].
Figure 8.4 displays a comparison of several such dual norms, which we now detail.
3 3
Energy Energy
Gauss Gauss
2 W1
2 W1
Flat Flat
1 1
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
(α, β) = (δ0 , δt ) (α, β) = (δ0 , 12 (δ−t/2 + δt/2 ))
If the set B is bounded, then k·kB is a norm on the whole space M(X ) of measures.
R
This is not the case of W 1 , which is only defined for α such that X dα = 0 (otherwise
kαkB = +∞). This can be alleviated by imposing a bound on the value of the potential
f , in order to define for instance the flat norm.
Example 8.7 (W 1 norm). W 1 as defined in (6.3), is a special case of dual norm (8.10),
using
B = {f : Lip(f ) ≤ 1}
the set of 1-Lipschitz functions.
Example 8.8 (Flat norm and Dudley metric). The flat norm is defined using
It metrizes the weak convergence on the whole space M(X ). Formula (6.2) is extended
to compute the flat norm by adding the constraint |fk | ≤ 1. The flat norm is sometimes
called the “Kantorovich–Rubinstein” norm [Hanin, 1992] and has been used as a fidelity
term for inverse problems in imaging [Lellmann et al., 2014]. The flat norm is similar
to the Dudley metric, which uses
B = {f : k∇f k∞ + kf k∞ ≤ 1} .
122 Statistical Divergences
The kernel is said to be conditionally positive if positivity only holds in (8.12) for zero
mean vectors r (i.e. such that hr, 1n i = 0).
If k is conditionally positive, one defines the following norm:
Z
kαk2k =
def.
k(x, y)dα(x)dα(y). (8.13)
X ×X
These norms are often referred to as “maximum mean discrepancy” (MMD) (see [Gret-
ton et al., 2007]) and have also been called “kernel norms” in shape analysis [Glaunes
et al., 2004]. This expression (8.13) can be rephrased, introducing two independent
random vectors (X, X 0 ) on X distributed with law α, as
kαk2k = EX,X 0 (k(X, X 0 )).
One can show that k·k2k is the dual norm in the sense of (8.10) associated to the unit
ball B of the RKHS associated to k. We refer to [Berlinet and Thomas-Agnan, 2003,
Hofmann et al., 2008, Schölkopf and Smola, 2002] for more details on RKHS functional
spaces.
Remark 8.4 (Universal kernels). According to Remark 8.3, the MMD norm k·kk metrizes
the weak convergence if the span of the dual ball B is dense in the space of continu-
ous functions C(X ). This means that finite sums of the form ni=1 ai k(xi , ·) (for arbi-
P
trary choice of n and points (xi )i ) are dense in C(X ) for the uniform norm k·k∞ . For
translation-invariant kernels over X = Rd , k(x, y) = k0 (x − y), this is equivalent to
having a nonvanishing Fourier transform, k̂0 (ω) > 0.
In the special case where α is a discrete measure of the form (2.3), one thus has the
simple expression
n X
n
kαk2k =
X def.
ai ai0 ki,i0 = hka, ai where ki,i0 = k(xi , xi0 ).
i=1 i0 =1
8.2. Integral Probability Metrics 123
In particular, when α = ni=1 ai δxi and β = ni=1 bi δxi are supported on the same set
P P
of points, kα − βk2k = hk(a − b), a − bi, so that k·kk is a Euclidean norm (proper if
k is positive definite, degenerate otherwise if k is semidefinite) on the simplex Σn . To
compute the discrepancy between two discrete measures of the form (2.3), one can use
kα − βk2k =
X X X
ai ai0 k(xi , xi0 ) + bj bj 0 k(yj , yj 0 ) − 2 ai bj k(xi , yj ). (8.14)
i,i0 j,j 0 i,j
Example 8.9 (Gaussian RKHS). One of the most popular kernels is the Gaussian one
kx−yk2
k(x, y) = e− 2σ2 , which is a positive universal kernel on X = Rd . An attractive
feature of the Gaussian kernel is that it is separable as a product of 1-D kernels,
which facilitates computations when working on regular grids (see also Remark 4.17).
However, an important issue that arises when using the Gaussian kernel is that one
needs to select the bandwidth parameter σ. This bandwidth should match the “typical
scale” between observations in the measures to be compared. If the measures have
multiscale features (some regions may be very dense, others very sparsely populated),
a Gaussian kernel is thus not well adapted, and one should consider a “scale-free” kernel
as we detail next. An issue with such scale-free kernels is that they are global (have
slow polynomial decay), which makes them typically computationally more expensive,
since no compact support approximation is possible. Figure 8.5 shows a comparison
between several kernels.
Figure 8.5: Top row: display of ψ such that kα − βkk = kψ ? (α − β)kL2 (R2 ) , formally defined over
p
Fourier as ψ̂(ω) = k̂0 (ω), where k(x, x0 ) = k0 (x − x0 ). Bottom row: display ofp
ψ ? (α − β). (G,σ)
2 2
stands for Gaussian kernel of variance σ . The kernel for ED(R , k·k) is ψ(x) = 1/ kxk.
124 Statistical Divergences
Example 8.10 (H −1 (Rd )). Another important dual norm is H −1 (Rd ), the dual (over
distributions) of the Sobolev space H 1 (Rd ) of functions having derivatives in L2 (Rd ).
It is defined using the primal RKHS norm k∇f k2L2 (Rd ) . It is not defined for singular
measures (e.g. Diracs) unless d = 1 because functions in the Sobolev space H 1 (Rd ) are
in general not continuous. This H −1 norm (defined on the space of zero mean measures
with densities) can also be formulated in divergence form,
Z
kα − βk2H −1 (Rd ) = min ks(x)k22 dx : div(s) = α − β , (8.15)
s Rd
which should be contrasted with (6.4), where an L1 norm of the vector field s was used
in place of the L2 norm used here. The “weighted” version of this Sobolev dual norm,
Z
kρk2H −1 (α) = min ks(x)k22 dα(x),
div(s)=ρ Rd
see [Santambrogio, 2015, Theo. 5.34], and see [Peyre, 2011] for sharp constants.
Example 8.11 (Negative Sobolev spaces). One can generalize this construction by con-
sidering the Sobolev space H −r (Rd ) of arbitrary negative index, which is the dual of
the functional Sobolev space H r (Rd ) of functions having r derivatives (in the sense of
distributions) in L2 (Rd ). In order to metrize the weak convergence, one needs functions
in H r (Rd ) to be continuous, which is the case when r > d/2. As the dimension d in-
creases, one thus needs to consider higher regularity. For arbitrary α (not necessarily
integers), these spaces are defined using the Fourier transform, and for a measure α
with Fourier transform α̂(ω) (written here as a density with respect to the Lebesgue
measure dω) Z
kαk2H −r (Rd ) = kωk−2r |α̂(ω)|2 dω.
def.
Rd
This corresponds to a dual RKHS norm with a convolutive kernel k(x, y) = k0 (x − y)
with k̂0 (ω) = ± kωk−2r . Taking the inverse Fourier transform, one sees that (up to
8.3. Wasserstein Spaces Are Not Hilbertian 125
A chief advantage of the energy distance over more usual kernels such as the Gaussian
(Example 8.9) is that it is scale-free and does not depend on a bandwidth parameter
σ. More precisely, one has the following scaling behavior on X = Rd , when denoting
fs (x) = sx the dilation by a factor s > 0,
p
kfs] (α − β)kED(Rd ,k·kp ) = s 2 kα − βkED(Rd ,k·kp ) ,
while the Wasserstein distance exhibits a perfect linear scaling,
W p (fs] α, fs] β)) = s W p (α, β)).
Note, however, that for the energy distance, the parameter p must satisfy 0 < p < 2,
and that for p = 2, it degenerates to the distance between the means
Z
kα − βkED(Rd ,k·k2 ) =
x(dα(x) − dβ(x))
,
d
R
so it is not a norm anymore. This shows that it is not possible to get the same linear
scaling under fs] with the energy distance as for the Wasserstein distance.
Some of the special cases of the Wasserstein geometry outlined earlier in §2.6 have
highlighted the fact that the optimal transport distance can sometimes be computed
in closed form. They also illustrate that in such cases the optimal transport distance is
a Hilbertian metric between probability measures, in the sense that there exists a map
φ from the space of input measures onto a Hilbert space, as defined below.
Definition 8.4. A distance d defined on a set Z × Z is said to be Hilbertian if there
exists a Hilbert space H and a mapping φ : Z → H such that for any pair z, z 0 in Z we
have that d(z, z 0 ) = kφ(z) − φ(z 0 )kH .
126 Statistical Divergences
For instance, Remark 2.30 shows that the Wasserstein metric is a Hilbert norm
between univariate distributions, simply by defining φ to be the map that associates
to a measure its generalized quantile function. Remark 2.31 shows that for univariate
Gaussians, as written in (8.7) in this chapter, the Wasserstein distance between two
univariate Gaussians is simply the Euclidean distance between their mean and standard
deviation.
Hilbertian distances have many favorable properties when used in a data analysis
context [Dattorro, 2017]. First, they can be easily cast as radial basis function kernels:
p
for any Hilbertian distance d, it is indeed known that e−d /t is a positive definite kernel
for any value 0 ≤ p ≤ 2 and any positive scalar t as shown in [Berg et al., 1984,
Cor. 3.3.3, Prop. 3.2.7]. The Gaussian (p = 2) and Laplace (p = 1) kernels are simple
applications of that result using the usual Euclidean distance. The entire field of kernel
methods [Hofmann et al., 2008] builds upon the positive definiteness of a kernel function
to define convex learning algorithms operating on positive definite kernel matrices.
Points living in a Hilbertian space can also be efficiently embedded in lower dimensions
with low distortion factors [Johnson and Lindenstrauss, 1984], [Barvinok, 2002, §V.6.2]
using simple methods such as multidimensional scaling [Borg and Groenen, 2005].
Because Hilbertian distances have such properties, one might hope that the Wasser-
stein distance remains Hilbertian in more general settings than those outlined above,
notably when the dimension of X is 2 and more. This can be disproved using the
following equivalence.
can be expanded, taking advantage of the fact that ri = 0 to −2 ri rj hφ(zi ), φ(zj )iH
P P
which is negative by definition of a Hilbert dot product. If, on the contrary, d2 is negative
definite, then the fact that d is Hilbertian proceeds from a key result by Schoenberg
[1938] outlined in ([Berg et al., 1984, p. 82, Prop. 3.2]).
It is therefore sufficient to show that the squared Wasserstein distance is not negative
definite to show that it is not Hilbertian, as stated in the following proposition.
Proposition 8.2. If X = Rd with d ≥ 2 and the ground cost is set to d(x, y) = kx − yk2 ,
then the p-Wasserstein distance is not Hilbertian for p = 1, 2.
Proof. It suffices to prove the result for d = 2 since any counterexample in that di-
mension suffices to obtain a counterexample in any higher dimension. We provide a
nonrandom counterexample which works using measures supported on four vectors
x1 , x2 , x3 , x4 ∈ R2 defined as follows: x1 = [0, 0], x2 = [1, 0], x3 = [0, 1], x4 = [1, 1]. We
now consider all points on the regular grid on the simplex of four dimensions, with
8.3. Wasserstein Spaces Are Not Hilbertian 127
{0, 41 , 12 , 43 , 1} and such that 4j=1 aij = 1. For a given p, the 35 × 35 pairwise Wasser-
P
stein distance matrix Dp between these histograms can be computed. Dp is not negative
definite if and only if its elementwise square D2p is such that JD2p J has positive eigen-
values, where J is the centering matrix J = In − n1 1n,n , which is the case as illustrated
in Figure 8.6.
1.6
Centered Distance Matrix
1.4
1.2
1
Max. Eig. of
0.8
0.6
0.4
0.2
1 1.5 2 2.5 3 3.5 4
p parameter to define p-Wasserstein
Figure 8.6: One can show that a distance is not Hilbertian by looking at the spectrum of the centered
matrix JD2p J corresponding to the pairwise squared-distance matrix D2p of a set of points. The spectrum
of such a matrix is necessarily non-positive if the distance is Hilbertian. Here we plot the values of the
maximal eigenvalue of that matrix for points selected in the proof of Proposition 8.2. We do so for
varying values of p, and display the maximal eigenvalues we obtain. These eigenvalues are all positive,
which shows that for all these values of p, the p-Wasserstein distance is not Hilbertian.
We show later in §10.4 that the sliced approximation to Wasserstein distances, essen-
tially a sum of 1-D directional transportation distance computed on random push-
forwards of measures projected on lines, is negative definite as the sum of negative
definite functions [Berg et al., 1984, §3.1.11]. This result can be used to define a pos-
itive definite kernel [Kolouri et al., 2016]. Another way to recover a positive definite
kernel is to cast the optimal transport problem as a soft-min problem (over all possible
transportation tables) rather than a minimum, as proposed by Kosowsky and Yuille
[1994] to introduce entropic regularization. That soft-min defines a term whose neg-
exponential (also known as a generating function) is positive definite [Cuturi, 2012].
In an applied setting, given two input measures (α, β) ∈ M1+ (X )2 , an important sta-
tistical problem is to approximate the (usually unknown) divergence D(α, β) using
only samples (xi )ni=1 from α and (yj )m
j=1 from β. These samples are assumed to be
independently identically distributed from their respective distributions.
For both Wasserstein distances W p (see 2.18) and MMD norms (see §8.2), a straight-
forward estimator of the unknown distance between distriubtions is compute it directly
between the empirical measures, hoping ideally that one can control the rate of con-
8.4. Empirical Estimators for OT, MMD and ϕ-divergences 129
Note that here both α̂n and β̂m are random measures, so D(α̂n , β̂m ) is a random
number. For simplicity, we assume that X is compact (handling unbounded domain
requires extra constraints on the moments of the input measures).
For such a dual distance that metrizes the weak convergence (see Definition 2.2),
since there is the weak convergence α̂n → α, one has D(α̂n , β̂n ) → D(α, β) as n → +∞.
But an important question is the speed of convergence of D(α̂n , β̂n ) toward D(α, β),
and this rate is often called the “sample complexity” of D.
Note that for D(α, β) = k·kTV , since the TV norm does not metrize the weak
convergence, kα̂n − β̂n kTV is not a consistent estimator, namely it does not converge
toward kα − βkTV . Indeed, with probability 1, kα̂n − β̂n kTV = 2 since the support of the
two discrete measures does not overlap. Similar issues arise with other ϕ-divergences,
which cannot be estimated using divergences between empirical distributions.
Rates for OT. For X = Rd and measure supported on bounded domain, it is shown
by [Dudley, 1969] that for d > 2, and 1 ≤ p < +∞,
1
E(| W p (α̂n , β̂n ) − W p (α, β)|) = O(n− d ),
where the expectation E is taken with respect to the random samples (xi , yi )i . This
rate is tight in Rd if one of the two measures has a density with respect to the Lebesgue
measure. This result was proved for general metric spaces [Dudley, 1969] using the
notion of covering numbers and was later refined, in particular for X = Rd in [Dereich
et al., 2013, Fournier and Guillin, 2015]. This rate can be refined when the measures are
supported on low-dimensional subdomains: Weed and Bach [2017] show that, indeed,
the rate depends on the intrinsic dimensionality of the support. Weed and Bach also
study the nonasymptotic behavior of that convergence, such as for measures which are
discretely approximated (e.g. mixture of Gaussians with small variances). It is also
possible to prove concentration of W p (α̂n , β̂n ) around its mean W p (α, β); see [Bolley
et al., 2007, Boissard, 2011, Weed and Bach, 2017].
Rates for MMD. For weak norms k·k2k which are dual of RKHS norms (also called
MMD), as defined in (8.13), and contrary to Wasserstein distances, the sample com-
plexity does not depend on the ambient dimension
1
E(|kα̂n − β̂n kk − kα − βkk |) = O(n− 2 );
see [Sriperumbudur et al., 2012]. Figure 8.7 shows a numerical comparison of the sample
complexity rates for Wasserstein and MMD distances. Note, however, that kα̂n − β̂n k2k
130 Statistical Divergences
n j=1 j 0 hσ (yj − yj 0 ),
8.5. Entropic Regularization: Between OT and MMD 131
where σ should be adapted to the number n of samples and to the dimension d. It is also
possible to devise nonparametric estimators, bypassing the choice of a fixed bandwidth
σ to select instead a number k of nearest neighbors. These methods typically make use
of the distance between nearest neighbors [Loftsgaarden and Quesenberry, 1965], which
is similar to locally adapting the bandwidth σ to the local sampling density. Denoting
∆k (x) the distance between x ∈ Rd and its kth nearest neighbor among the (xi )ni=1 , a
density estimator is defined as
k/n
ρkα̂n (x) =
def.
, (8.21)
|Bd |∆k (x)r
where |Bd | is the volume of the unit ball in Rd . Instead of somehow “counting” the
number of sample falling in an area of width σ in (8.20), this formula (8.21) estimates
the radius required to encapsulate k samples. Figure 8.8 compares the estimators (8.20)
and (8.21). A typical example of application is detailed in (4.1) for the entropy func-
tional, which is the KL divergence with respect to the Lebesgue measure. We refer
to [Moon and Hero, 2014] for more details.
k=1 k = 50 k = 100
Figure 8.8: Comparison of kernel density estimation α̂n ? hσ (top, using a Gaussian kernel h) and
k-nearest neighbors estimation ρkα̂n (bottom) for n = 200 samples from a mixture of two Gaussians.
where P? is the solution of (4.2) while (f? , g? ) are solutions of (4.30). Assuming Ci,j =
d(xi , xj )p for some distance d on X , for two discrete probability distributions of the
form (2.3), this defines a regularized Wasserstein cost
W̃ p,ε (α, β)p = 2 W p,ε (α, β)p − W p,ε (α, α)p − W p,ε (β, β)p .
def.
It is proved in [Feydy et al., 2019] that if e−c/ε is a positive kernel, then a related
corrected divergence (obtained by using LεC in place of PεC ) is positive. Note that it is
possible to define other renormalization schemes using regularized optimal transport,
as proposed, for instance, by Amari et al. [2018].
d=2 d=5
0
Figure 8.9: Decay of E(log10 (W̃ p,ε (α̂n , α̂n ))), for p = 3/2 for various ε, as a function of log10 (n)
where α is the same as in Figure 8.7.
The following proposition, whose proof can be found in [Ramdas et al., 2017], shows
that this regularized divergence interpolates between the Wasserstein distance and the
energy distance defined in Example 8.12.
Figure 8.9 shows numerically the impact of ε on the sample complexity rates. It is
proved in Genevay et al. [2019], in the case of c(x, y) = kx − yk2 on X = Rd , that these
rates interpolate between the ones of OT and MMD.
9
Variational Wasserstein Problems
133
134 Variational Wasserstein Problems
shown in §8.2.2) are routinely used for shape matching (represented as measures over
a lifted space, often called currents) in computational anatomy [Vaillant and Glaunès,
2005], but OT distances offer an interesting alternative [Feydy et al., 2017]. To re-
duce the dimensionality of a dataset of histograms, Lee and Seung have shown that the
nonnegative matrix factorization problem can be cast using the Kullback–Leibler diver-
gence to quantify a reconstruction loss [Lee and Seung, 1999]. When prior information
is available on the geometry of the bins of those histograms, the Wasserstein distance
can be used instead, with markedly different results [Sandler and Lindenbaum, 2011,
Zen et al., 2014, Rolet et al., 2016].
All of these problems have in common that they require access to the gradients of
Wasserstein distances, or approximations thereof. We start this section by presenting
methods to approximate such gradients, then follow with three important applications
that can be cast as variational Wasserstein problems.
In statistics, text processing or imaging, one must usually compare a probability dis-
tribution β arising from measurements to a model, namely a parameterized family of
distributions {αθ , θ ∈ Θ}, where Θ is a subset of a Euclidean space. Such a comparison
is done through a “loss” or a “fidelity” term, which is the Wasserstein distance in this
section. In the simplest scenario, the computation of a suitable parameter θ is obtained
by minimizing directly
def.
min E(θ) = Lc (αθ , β). (9.1)
θ∈Θ
Of course, one can consider more complicated problems: for instance, the barycenter
problem described in §9.2 consists in a sum of such terms. However, most of these more
advanced problems can be usually solved by adapting tools defined for the basic case
above, either using the chain rule to compute explicitly derivatives or using automatic
differentiation as advocated in §9.1.3.
Convexity. The Wasserstein distance between two histograms or two densities is con-
vex with respect to its two inputs, as shown by (2.20) and (2.24), respectively. Therefore,
when the parameter θ is itself a histogram, namely Θ = Σn and αθ = θ, or more gen-
erally when θ describes K weights in the simplex, Θ = ΣK , and αθ = K
P
i=1 θi αi is a
convex combination of known atoms α1 , . . . , αK in ΣN , Problem (9.1) remains convex
(the first case corresponds to the barycenter problem, the second to one iteration of
the dictionary learning problem with a Wasserstein loss [Rolet et al., 2016]). However,
for more general parameterizations θ 7→ αθ , Problem (9.1) is in general not convex.
9.1. Differentiating the Wasserstein Loss 135
Simple cases. For those simple cases where the Wasserstein distance has a closed
form, such as univariate (see §2.30) or elliptically contoured (see §2.31) distributions,
simple workarounds exist. They consist mostly in casting the Wasserstein distance as
a simpler distance between suitable representations of these distributions (Euclidean
on quantile functions for univariate measures, Bures metric for covariance matrices
for elliptically contoured distributions of the same family) and solving Problem (9.1)
directly on such representations.
In most cases, however, one has to resort to a careful discretization of αθ to com-
pute a local minimizer for Problem (9.1). Two approaches can be envisioned: Eulerian
or Lagrangian. Figure 9.1 illustrates the difference between these two fundamental dis-
cretization schemes. At the risk of oversimplifying this argument, one may say that
a Eulerian discretization is the most suitable when measures are supported on a low-
dimensional space (as when dealing with shapes or color spaces), or for intrinsically
discrete problems (such as those arising from string or text analysis). When applied
to fitting problems where observations can take continuous values in high-dimensional
spaces, a Lagrangian perspective is usually the only suitable choice.
as a family of vector embeddings for all words in a given dictionary [Kusner et al., 2015,
Rolet et al., 2016]). The parameterized measure αθ is in that case entirely represented
through the weight vector a : θ 7→ a(θ) ∈ Σn , which, in practice, might be very sparse
if the grid is large. This setting corresponds to the so-called class of Eulerian discretiza-
tion methods. In its original form, the objective of Problem (9.1) is not differentiable.
In order to obtain a smooth minimization problem, we use the entropic regularized OT
and approximate (9.1) using
min EE (θ) = LεC (a(θ), b)
def. def.
where Ci,j = c(xi , yj ).
θ∈Θ
We recall that Proposition 4.6 shows that the entropic loss function is differentiable
and convex with respect to the input histograms, with gradient.
Proposition 9.1 (Derivative with respect to histograms). For ε > 0, (a, b) 7→ LεC (a, b)
is convex and differentiable. Its gradient reads
∇LεC (a, b) = (f, g), (9.2)
P P
where (f, g) is the unique solution to (4.30), centered such that i fi = j gj = 0. For
ε = 0, this formula defines the elements of the sub-differential of LεC , and the function
is differentiable if they are unique.
The zero mean condition on (f, g) is important when using gradient descent to
guarantee conservation of mass. Using the chain rule, one thus obtains that EE is
smooth and that its gradient is
∇EE (θ) = [∂a(θ)]> (f ), (9.3)
where ∂a(θ) ∈ Rn×dim(Θ) is the Jacobian (differential) of the map a(θ), and where
f ∈ Rn is the dual potential vector associated to the dual entropic OT (4.30) between
a(θ) and b for the cost matrix C (which is fixed in a Eulerian setting, and in particular
independent of θ). This result can be used to minimize locally EE through gradient
descent.
Proposition 9.2 (Derivative with respect to the cost). For fixed input histograms (a, b),
def.
for ε > 0, the mapping C 7→ R(C) = LεC (a, b) is concave and smooth, and
∇R(C) = P, (9.5)
where P is the unique optimal solution of (4.2). For ε = 0, this formula defines the set
of upper gradients.
where ∇1 c is the gradient with respect to the first variable. For instance, for X = Y =
Rd , for c(s, t) = ks − tk2 on X = Y = Rd , one has
n
m
X
∇F(x) = 2 ai xi − Pi,j yj , (9.7)
j=1 i=1
where ∂x(θ) ∈ Rdim(Θ)×(nd) is the Jacobian of the map x(θ) and where ∇F is imple-
mented as in (9.6) or (9.7) using for P the optimal coupling matrix between αθ and
β. One can thus implement a gradient descent to compute a local minimizer of EL , as
used, for instance, in [Cuturi and Doucet, 2014].
Sinkhorn divergences as introduced in (4.48), rather than the quantity LεC in (4.2)
which incorporates the entropy of the regularized optimal transport, and differentiating
it directly as a composition of simple maps using the inputs, either the histogram in the
Eulerian case or the cost matrix in the Lagrangian cases. Using definitions introduced
in §4.5, this is equivalent to differentiating
(L) (L)
DC (a(θ), b) or DC(x(θ)) (a, b)
with respect to θ, in, respectively, the Eulerian and the Lagrangian cases for L large
enough.
The cost for computing the gradient of functionals involving Sinkhorn divergences
is the same as that of computation of the functional itself; see, for instance, [Bon-
neel et al., 2016, Genevay et al., 2018] for some applications of this approach. We also
refer to [Adams and Zemel, 2011] for an early work on differentiating Sinkhorn iter-
ations with respect to the cost matrix (as done in the Lagrangian framework), with
applications to learning rankings. Further details on automatic differentiation can be
found in [Griewank and Walther, 2008, Rall, 1981, Neidinger, 2010], in particular on the
“reverse mode,” which is the fastest way to compute gradients. In terms of implementa-
tion, all recent deep-learning Python frameworks feature state-of-the-art reverse-mode
differentiation and support for GPU/TPU computations [Al-Rfou et al., 2016, Abadi
et al., 2016, Pytorch, 2017], they should be adopted for any large-scale application
of Sinkhorn losses. We strongly encourage the use of such automatic differentiation
techniques, since they have the same complexity as computing (9.3) and (9.8), these
formulas being mostly useful to obtain a theoretical understanding of what automatic
differentation is computing. The only downside is that reverse mode automatic dif-
ferentation is memory intensive (the memory grows proportionally with the number
of iterations). There exist, however, subsampling strategies that mitigate this prob-
lem [Griewank, 1992].
for a given family of weights (λs )s ∈ ΣS , where p is often set to p = 2. When X = Rd and
d(x, y) = kx − yk2 , this leads to the usual definition of the linear average x = s λs xs
P
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 139
for p = 2 and the more evolved median point when p = 1. One can retrieve various
notions of means (e.g. harmonic or geometric means over X = R+ ) using this formalism.
This process is often referred to as the “Fréchet” or “Karcher” mean (see Karcher
[2014] for a historical account). For a generic distance d, Problem (9.9) is usually a
difficult nonconvex optimization problem. Fortunately, in the case of optimal transport
distances, the problem can be formulated as a convex program for which existence can
be proved and efficient numerical schemes exist.
Fréchet means over the Wasserstein space. Given input histogram {bs }Ss=1 , where
bs ∈ Σns , and weights λ ∈ ΣS , a Wasserstein barycenter is computed by minimizing
S
X
min λs LCs (a, bs ), (9.10)
a∈Σn
s=1
where the cost matrices Cs ∈ Rn×ns need to be specified. A typical setup is “Eulerian,”
so that all the barycenters are defined on the same grid, ns = n, Cs = C = Dp is set
to be a distance matrix, to solve
S
X
min λs Wpp (a, bs ).
a∈Σn
s=1
The barycenter problem (9.10) was introduced in a more general form involving
arbitrary measures in Agueh and Carlier [2011] following earlier ideas of Carlier and
Ekeland [2010]. That presentation is deferred to Remark 9.1. The barycenter problem
for histograms (9.10) is in fact a linear program, since one can look for the S couplings
(Ps )s between each input and the barycenter itself, which by construction must be
constrained to share the same row marginal,
( S )
∀ s, P> a, P>
X
min λs hPs , Cs i : s 1ns = s 1n = bs .
a∈Σn ,(Ps ∈Rn×ns )s
s=1
Although this problem is an LP, its scale forbids the use of generic solvers for medium-
scale problems. One can resort to using first order methods such as subgradient descent
on the dual [Carlier et al., 2015].
Remark 9.1 (Barycenter of arbitrary measures). Given a set of input measure (βs )s
defined on some space X , the barycenter problem becomes
S
X
min λs Lc (α, βs ). (9.11)
α∈M1+ (X ) s=1
In the case where X = Rd and c(x, y) = kx − yk2 , Agueh and Carlier [2011] show
that if one of the input measures has a density, then this barycenter is unique.
140 Variational Wasserstein Problems
and the support of α? is located in the convex hull of the supports of the (αs )s . The
consistency of the approximation of the infinite-dimensional optimization (9.11)
when approximating the input distribution using discrete ones (and thus solv-
ing (9.10) in place) is studied in Carlier et al. [2015]. Let us also note that it is
possible to recast (9.11) as a multimarginal OT problem; see Remark 10.2.
here β̂S is itself a random measure). Problem (9.11) corresponds to the special case
P
of a “discrete” measure M = s λs δβs . The convergence (in expectation or with
high probability) of Lc (β̂S , α) to zero (where α is the unique solution to (9.13))
corresponds to the consistency of the barycenters, and is proved in [Bigot and
Klein, 2012a, Le Gouic and Loubes, 2016, Bigot and Klein, 2012b]. This can be
interpreted as a law of large numbers over the Wasserstein space. The extension of
this result to a central limit theorem is an important problem; see [Panaretos and
Zemel, 2016] and [Agueh and Carlier, 2017] for recent formulations of that problem
and solutions in particular cases (1-D distributions and Gaussian measures).
Remark 9.4 (Fixed-point map). When dealing with the Euclidean space X = Rd
with ground cost c(x, y) = kx − yk2 , it is possible to study the barycenter problem
using transportation maps. Indeed, if α has a density, according to Remark 2.24,
one can define optimal transportation maps Ts between α and αs , in particular
such that Ts,] α = αs . The average map
S
X
T (α) =
def.
λs Ts
s=1
(the notation above makes explicit the dependence of this map on α) is itself an
(α)
optimal map between α and T] α (a positive combination of optimal maps is
equal by Brenier’s theorem, Remark 2.24, to the sum of gradients of convex func-
tions, equal to the gradient of a sum of convex functions, and therefore optimal
by Brenier’s theorem again). As shown in [Agueh and Carlier, 2011], first order
?
optimality conditions of the barycenter problem (9.13) actually read T (α ) = IRd
(the identity map) at the optimal measure α? (the barycenter), and it is shown
in [Álvarez-Esteban et al., 2016] that the barycenter α? is the unique (under regular-
ity conditions clarified in [Zemel and Panaretos, 2018, Theo. 2]) to the fixed-point
equation
def. (α)
G(α) = α where G(α) = T] α, (9.14)
Under mild conditions on the input measures, Álvarez-Esteban et al. [2016]
and Zemel and Panaretos [2018] have shown that α 7→ G(α) strictly decreases
the objective function of (9.13) if α is not the barycenter and that the fixed-point
def.
iterations α(`+1) = G(α(`) ) converge to the barycenter α? . This fixed point al-
gorithm can be used in cases where the optimal transportation maps are known
in closed form (e.g. for Gaussians). Adapting this algorithm for empirical mea-
sures of the same size results in computing optimal assignments in place of Monge
maps. For more general discrete measures of arbitrary size the scheme can also be
142 Variational Wasserstein Problems
Special cases. In general, solving (9.10) or (9.11) is not straightforward, but there
exist some special cases for which solutions are explicit or simple.
where B is the Bure metric (2.42). As studied in [Agueh and Carlier, 2011], the
first order optimality condition of this convex problem shows that Σ? is the unique
positive definite fixed point of the map
X 1 1 1
Σ? = Ψ(Σ? ) where Ψ(Σ) =
def.
λs (Σ 2 Σs Σ 2 ) 2 ,
s
1
where Σ 2 is the square root of positive semidefinite matrices. This result was
known from [Knott and Smith, 1994, Rüschendorf and Uckelmann, 2002] and is
proved in [Agueh and Carlier, 2011]. While Ψ is not strictly contracting, iterating
def.
this fixed-point map, i.e. defining Σ(`+1) = Ψ(Σ(`) ) converges in practice to the
solution Σ? . This method has been applied to texture synthesis in [Xia et al., 2014].
Álvarez-Esteban et al. [2016] have also proposed to use an alternative map
1
X 1 1 1
2 1
Ψ̄(Σ) = Σ− 2 Σ− 2
def.
λs (Σ 2 Σs Σ 2 ) 2
s
def.
for which the iterations Σ(`+1) = Ψ̄(Σ(`) ) converge. This is because the fixed-point
map G defined in (9.14) preserves Gaussian distributions, and in fact,
Figure 9.2 shows two examples of computations of barycenters between four 2-D
Gaussians.
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 143
Figure 9.2: Barycenters between four Gaussian distributions in 2-D. Each Gaussian is displayed using
an ellipse aligned with the principal axes of the covariance, and with elongations proportional to the
corresponding eigenvalues.
Remark 9.6 (1-D cases). For 1-D distributions, the W p barycenter can be com-
puted almost in closed form using the fact that the transport is the monotone
rearrangement, as detailed in Remark 2.30. The simplest case is for empirical mea-
sures with n points, i.e. βs = n1 ni=1 δys,i , where the points are assumed to be
P
sorted ys,1 ≤ ys,2 ≤ . . .. Using (2.33) the barycenter αλ is also an empirical mea-
sure on n points
n
1X
αλ = δx where xλ,i = Aλ (xs,i )s ,
n i=1 λ,i
which can be used, for instance, to compute barycenters between discrete measures
supported on less than n points in O(n log(n)) operations, using a simple sorting
procedure.
translation (
r? = ( s λs /rs )−1 ,
P
αλ = T r? ,u? ,] α0 where
u? = s λs us .
P
Remark 9.8 (Case S = 2). In the case where X = Rd and c(x, y) = kx − yk2 (this
can be extended more generally to geodesic spaces), the barycenter between S =
2 measures (α0 , α1 ) is the McCann interpolant as already introduced in (7.6).
Denoting T] α0 = α1 the Monge map, one has that the barycenter αλ reads αλ =
(λ1 Id + λ2 T )] α0 . Formula (7.9) explains how to perform the computation in the
discrete case.
Entropic approximation of barycenters. One can use entropic smoothing and approx-
imate the solution of (9.10) using
S
X
min λs LεCs (a, bs ) (9.15)
a∈Σn
s=1
for some ε > 0. This is a smooth convex minimization problem, which can be tackled
using gradient descent [Cuturi and Doucet, 2014, Gramfort et al., 2015]. An alternative
is to use descent methods (typically quasi-Newton) on the semi-dual [Cuturi and Peyré,
2016], which is useful to integrate additional regularizations on the barycenter, to im-
pose, for instance, some smoothness w.r.t a given norm. A simpler yet very effective
approach, as remarked by Benamou et al. [2015] is to rewrite (9.15) as a (weighted) KL
projection problem
( )
X
T
min λs εKL(Ps |Ks ) : ∀ s, Ps 1m = bs , P1 11 = · · · = PS 1S , (9.16)
(Ps )s s
which shows the desired formula. To show (9.22), since this function is separable, one
needs to compute
∀ (u, k) ∈ R2+ , KL∗ (u|k) = max ur − (r log(r/k) − r + k)
def.
whose optimality condition reads u = log(r/k), i.e. r = keu , hence the result.
146 Variational Wasserstein Problems
Minimizing (9.21) with respect to each gs , while keeping all the other variables
fixed, is obtained in closed form by (9.18). Minimizing (9.21) with respect to all the
(fs )s requires us to solve for a using (9.20) and leads to the expression (9.19).
Figures 9.3 and 9.4 show applications to 2-D and 3-D shapes interpolation. Fig-
ure 9.5 shows a computation of barycenters on a surface, where the ground cost is the
square of the geodesic distance. For this figure, the computations are performed us-
ing the geodesic in heat approximation detailed in Remark 4.19. We refer to [Solomon
et al., 2015] for more details and other applications to computer graphics and imaging
sciences.
Figure 9.3: Barycenters between four input 2-D shapes using entropic regularization (9.15). To display
a binary shape, the displayed images shows a thresholded density. The weights (λs )s are bilinear with
respect to the four corners of the square.
Figure 9.4: Barycenters between four input 3-D shapes using entropic regularization (9.15). The
weights (λs )s are bilinear with respect to the four corners of the square. Shapes are represented as
measures that are uniform within the boundaries of the shape and null outside.
instance, [Seguy and Cuturi, 2015, Bigot et al., 2017b]) and the statistical estimation
of template models [Boissard et al., 2015]. The ability to compute barycenters enables
more advanced clustering methods such as the k-means on the space of probability
measures [del Barrio et al., 2016, Ho et al., 2017].
Figure 9.5: Barycenters interpolation between two input measures on surfaces, computed using the
geodesic in heat fast kernel approximation (see Remark 4.19). Extracted from [Solomon et al., 2015].
Figure 9.6: Interpolation between the two 3-D color empirical histograms of two input images (here
only the 2-D chromatic projection is visualized for simplicity). The modified histogram is then applied
to the input images using barycentric projection as detailed in Remark 4.11. Extracted from [Solomon
et al., 2015].
where the cost matrices Cr,s ∈ Rnr ×ns need to be specified by the user. The
barycenter problem (9.10) is a special case of this problem where the considered
graph G is “star shaped,” where U is a single vertex connected to all the other
vertices V (the weight λs associated to bs can be absorbed in the cost matrix).
Introducing explicitly a coupling Pr,s ∈ U(br , bs ) for each edge (r, s) ∈ G, and
using entropy regularization, one can rewrite this problem similarly as in (9.16),
and one extends Sinkhorn iterations (9.18) to this problem (this can also be de-
rived by recasting this problem in the form of the generalized Sinkhorn algorithm
detailed in §4.6). This discrete variational problem (9.23) on a graph can be gen-
eralized to define a Dirichlet energy when replacing the graph by a continuous do-
main [Solomon et al., 2013]. This in turn leads to the definition of measure-valued
harmonic functions which finds application in image and surface processing. We
refer also to Lavenant [2017] for a theoretical analysis and to Vogt and Lellmann
[2018] for extensions to nonquadratic (total-variation) functionals and applications
to imaging.
9.3. Gradient Flows 149
Given a smooth function a 7→ F (a), one can use the standard gradient descent
where τ is a small enough step size. This corresponds to a so-called “explicit” minimiza-
tion scheme and only applies for smooth functions F . For nonsmooth functions, one
can use instead an “implicit” scheme, which is also called the proximal-point algorithm
(see, for instance, Bauschke and Combettes [2011])
k·k 1
2
a(`+1) = Proxτ F (a(`) ) = argmin
a − a(`)
+ τ F (a).
def. def.
(9.25)
a 2
Note that this corresponds to the Euclidean proximal operator, already encountered
in (7.13). The update (9.24) can be understood as iterating the explicit operator Id −
τ ∇F , while (9.25) makes use of the implicit operator (Id + τ ∇F )−1 . For convex F ,
iterations (9.25) always converge, for any value of τ > 0.
If the function F is defined on the simplex of histograms Σn , then it makes sense to
use an optimal transport metric in place of the `2 norm k·k in (9.25), in order to solve
for some function F defined on M1+ (X ). This implicit time stepping is a useful tool
to construct continuous flows, by formally taking the limit τ → 0 and introducing
the time t = τ `, so that α(`) is intended to approximate a continuous flow t ∈
R+ 7→ αt . For the special case p = 2 and X = Rd , a formal calculus shows that αt
is expected to solve a PDE of the form
∂αt
= div(αt ∇(F 0 (αt ))), (9.28)
∂t
where F 0 (α) denotes the derivative of the function F in the sense that it is a
continuous function F 0 (α) ∈ C(X ) such that
Z
F (α + εξ) = F (α) + ε F 0 (α)dξ(x) + o(ε).
X
A typical example is when using F = −H, where H(α) = KL(α|LRd ) is the relative
150 Variational Wasserstein Problems
(setting H(α) = −∞ when α does not have a density), then (9.28) shows that the
gradient flow of this neg-entropy is the linear heat diffusion
∂αt
= ∆αt , (9.30)
∂t
where ∆ is the spatial Laplacian. The heat diffusion can therefore be interpreted
either as the “classical” Euclidian flow (somehow performing “vertical” movements
with respect to mass amplitudes) of the Dirichlet energy Rd k∇ρα (x)k2 dx or, al-
R
ternatively, as the entropy for the optimal transport flow (somehow a “horizontal”
movement with respect to mass positions). Interest in Wasserstein gradient flows
was sparked by the seminal paper of Jordan, Kinderlehrer and Otto [Jordan et al.,
1998], and these evolutions are often called “JKO flows” following their work. As
shown in detail in the monograph by Ambrosio et al. [2006], JKO flows are a
special case of gradient flows in metric spaces. We also refer to the recent survey
paper [Santambrogio, 2017]. JKO flows can be used to study in particular non-
linear evolution equations such as the porous medium equation [Otto, 2001], total
variation flows [Carlier and Poon, 2019], quantum drifts [Gianazza et al., 2009],
or heat evolutions on manifolds [Erbar, 2010]. Their flexible formalism allows for
constraints on the solution, such as the congestion constraint (an upper bound on
the density at any point) that Maury et al. used to model crowd motion [Maury
et al., 2010] (see also the review paper [Santambrogio, 2018]).
Remark 9.11 (Gradient flows in metric spaces). The implicit stepping (9.27) is a
special case of a more general formalism to define gradient flows over metric spaces
(X , d), where d is a distance, as detailed in [Ambrosio et al., 2006]. For some func-
tion F (x) defined for x ∈ X , the implicit discrete minmization step is then defined
as
x(`+1) ∈ argmin d(x(`) , x)2 + τ F (x). (9.31)
x∈X
The JKO step (9.27) corresponds to the use of the Wasserstein distance on the
space of probability distributions. In some cases, one can show that (9.31) admits
a continuous flow limit xt as τ → 0 and kτ = t. In the case that X also has a
Euclidean structure, an explicit stepping is defined by linearizing F
p=1 p=1
1
2
p= p=
+
1
=
+
p
=
p
Explicit Implicit
Figure 9.7: Comparison of explicit and implicit gradient flow to minimize the function f (x) = kxk2
on X = R2 for the distances d(x, y) = kx − ykp for several values of p.
then replaced by a set of n coupled ODE prescribing the dynamic of the points
X(t) = (xi (t))i ∈ X n . If the energy F is finite for discrete measures, then one
can simply define F(X) = F ( n1 ni=1 δxi ). Typical examples are linear functions
P
R R
F (α) = X V (x)dα(x) and quadratic interactions F (α) = X 2 W (x, y)dα(x)dα(y),
in which case one can use respectively
1X 1 X
F(X) = V (xi ) and F(X) = W (xi , xj ).
n i n2 i,j
152 Variational Wasserstein Problems
For functions such as generalized entropy, which are only finite for measures having
densities, one should apply a density estimator to convert the point cloud into a
density, which allows us to also define function F(x) consistent with F as n → +∞.
A typical example is for the entropy F (α) = H(α) defined in (9.29), for which a
consistent estimator (up to a constant term) can be obtained by summing the
logarithms of the distances to nearest neighbors
1X
x − x0
;
F(X) = log(dX (xi )) where dX (x) = 0 min0 (9.33)
n i x ∈X,x 6=x
see Beirlant et al. [1997] for a review of nonparametric entropy estimators. For
small enough step sizes τ , assuming X = Rd , the Wasserstein distance W 2
matches the Euclidean distance on the points, i.e. if |t − t0 | is small enough,
W2 (αt , αt0 ) = kX(t) − X(t0 )k. The gradient flow is thus equivalent to the Euclidean
flow on positions X 0 (t) = −∇F(X(t)), which is discretized for times tk = τ k simi-
larly to (9.24) using explicit Euler steps
Figure 9.8 shows an example of such a discretized explicit evolution for a linear plus
entropy functional, resulting in a discretized version of a Fokker–Planck equation.
Note that for this particular case of linear Fokker–Planck equation, it is possible
also to resort to stochastic PDEs methods, and it can be approximated numerically
by evolving a single random particle with a Gaussian drift. The convergence of
these schemes (so-called Langevin Monte Carlo) to the stationary distribution can
in turn be quantified in terms of Wasserstein distance; see, for instance, [Dalalyan
and Karagulyan, 2017]. If the function F is not smooth, one should discretize
similarly to (9.25) using implicit Euler steps, i.e. consider
k·k 1
2
X (`+1) = Proxτ F (X (`) ) = argmin
Z − X (`)
+ τ F(Z).
def. def.
Z∈X n 2
R
In the simplest case of a linear function F (α) = X V (x)dα(x), the flow operates
independently over each particule xi (t) and corresponds to a usual Euclidean flow
for the function V , x0i (t) = −∇V (xi (t)) (and is an advection PDEs of the density
along the integral curves of the flow).
Figure R9.8: Example of gradient flow evolutions using a Lagrangian discretization, for the function
F (α) = V dα−H(α), for V (x) = kxk2 . The entropy is discretized using (9.33). The limiting stationary
distribution is a Gaussian.
such a function exists, is unique, and is the limit of the discrete stepping (9.27)
as τ → 0. It converges to a fixed stationary distribution as t → +∞. The entropy
is a typical example of geodesically convex function, and so are linear functions
R
of the form F (α) = X V (x)dα(x) and quadratic interaction functions F (α) =
R
X ×X W (x, y)dα(x)dα(y) for convex functions V : X → R, W : X × X → R. Note
that while linear functions are convex in the classical sense, quadratic interaction
functions might fail to be. A typical example is W (x, y) = kx − yk2 , which is a
negative semi-definite kernel (see Definition 8.3) and thus corresponds to F (α)
being a concave function in the usual sense (while it is geodesically convex). An
important result of McCann [1997] is that generalized “entropy” functions of the
form F (α) = Rd ϕ(ρα (x))dx on X = Rd are geodesically convex if ϕ is convex,
R
[2018b] and theoretically analyzed in Carlier et al. [2017]. With an entropic regular-
ization, Problem (9.26) has the form (4.49) when setting G = ιa(`) and replacing F
by τ F . One can thus use the iterations (4.51) to approximate a(`+1) as proposed ini-
tially in Peyré [2015]. The convergence of this scheme as ε → 0 is proved in Carlier
et al. [2017]. Figure 9.9 shows an example of evolution computed with this method.
An interesting application of gradient flows to machine learning is to learn the un-
derlying function F that best models some dynamical model of density. This learning
can be achieved by solving a smooth nonconvex optimization using entropic regularized
transport and automatic differentiation (see Remark 9.1.3); see Hashimoto et al. [2016].
Analyzing the convergence of gradient flows discretized in both time and space is
difficult in general. Due to the polyhedral nature of the linear program defining the
distance, using too-small step sizes leads to a “locking” phenomena (the distribution is
stuck and does not evolve, so that the step size should be not too small, as discussed
in [Maury and Preux, 2017]). We refer to [Matthes and Osberger, 2014, 2017] for a
convergence analysis of a discretization method for gradient flows in one dimension.
Figure 9.9: ExamplesRof gradient flows evolutions, with drift V and congestion terms (from Peyré
[2015]), so that F (α) = X V (x)dα(x) + ι≤κ (ρα ).
It is also possible to compute gradient flows for unbalanced optimal transport dis-
tances as detailed in §10.2. This results in evolutions allowing mass creation or de-
struction, which is crucial to model many physical, biological or chemical phenomena.
An example of unbalanced gradient flow is the celebrated Hele-Shaw model for cell
growth [Perthame et al., 2014], which is studied theoretically in [Gallouët and Mon-
saingeon, 2017, Di Marino and Chizat, 2017]. Such an unbalanced gradient flow also
9.4. Minimum Kantorovich Estimators 155
can be approximated using the generalized Sinkhorn algorithm [Chizat et al., 2018b].
Given some discrete samples (xi )ni=1 ⊂ X from some unknown distribution, the goal is
to fit a parametric model θ 7→ αθ ∈ M(X ) to the observed empirical input measure β
1X
min L(αθ , β) where β = δxi , (9.34)
θ∈Θ n i
g✓ ↵✓
⇣ x
z
Z X
αθ = hθ,] ζ where hθ : Z → X ,
156 Variational Wasserstein Problems
where the push-forward operator is introduced in Definition 2.1. The space Z is usually
low-dimensional, so that the support of αθ is localized along a low-dimensional “mani-
fold” and the resulting density is highly singular (it does not have a density with respect
to Lebesgue measure). Furthermore, computing this density is usually intractable, while
generating i.i.d. samples from αθ is achieved by computing xi = hθ (zi ), where (zi )i are
i.i.d. samples from ζ.
In order to cope with such a difficult scenario, one has to use weak metrics in place
of the MLE functional LMLE , which needs to be written in dual form as
Z Z
def.
L(α, β) = max f (x)dα(x) + g(x)dβ(x) : (f, g) ∈ R . (9.35)
(f,g)∈C(X )2 X X
R = {(f, −f ) : f ∈ B} ,
where ∂hθ (x) ∈ Rdim(Θ)×d is the differential (with respect to θ) of θ ∈ Rdim(Θ) 7→ hθ (x),
while ∇fθ (x) is the gradient (with respect to x) of fθ . This formula is hard to use
numerically, first because it requires first computing a continuous function fθ , which
is a solution to a semi-discrete problem. As shown in §8.5, for OT loss, this can be
achieved using stochastic optimization, but this is hardly applicable in high dimension.
Another option is to impose a parametric form for this potential, for instance expansion
in an RKHS (Genevay et al. [2016]) or a deep-network approximation ([Arjovsky et al.,
2017]). This, however, leads to important approximation errors that are not yet analyzed
theoretically. A last issue is that it is unstable numerically because it requires the
computation of the gradient ∇fθ of the dual potential fθ .
For the OT loss, an alternative gradient formula is obtained when one rather com-
putes a primal optimal coupling for the following equivalent problem:
Z
Lc (αθ , β) = min c(hθ (z), x)dγ(z, x) : γ ∈ U(ζ, β) . (9.37)
γ∈M(Z×X ) Z×X
Note that in the semidiscrete case considered here, the objective to be minimized can
9.4. Minimum Kantorovich Estimators 157
be actually decomposed as
n Z n
1
X X Z
min c(hθ (z), xi )dγi (z) where γi = ζ, dγi (z) = , (9.38)
n
(γi )i=1
i=1 Z i=1 Z n
where each γi ∈ M1+ (Z). Once an optimal (γθ,i )i solving (9.38) is obtained, the gradient
of E(θ) is computed as
n Z
[∂hθ (z)]> ∇1 c(hθ (z), xi )dγi (z),
X
∇E(θ) =
i=1 Z
where ∇1 c(x, y) ∈ Rd is the gradient of x 7→ c(x, y). Note that as opposed to (9.36),
this formula does not involve computing the gradient of the potentials being solutions
of the dual OT problem.
The class of estimators obtained using L = Lc , often called “minimum Kantorovich
estimators,” was initially introduced in [Bassetti et al., 2006]; see also [Canas and
Rosasco, 2012]. It has been used in the context of generative models by [Montavon et al.,
2016] to train restricted Boltzmann machines and in [Bernton et al., 2017] in conjunction
with approximate Bayesian computations. Approximations of these computations using
Deep Network are used to train deep generative models for both GAN [Arjovsky et al.,
2017] and VAE [Bousquet et al., 2017]; see also [Genevay et al., 2018, 2017, Salimans
et al., 2018]. Note that the use of Sinkhorn divergences for parametric model fitting
is used routinely for shape matching and registration, see [Gold et al., 1998, Chui and
Rangarajan, 2000, Myronenko and Song, 2010, Feydy et al., 2017].
Remark 9.14 (Metric learning and transfer learning). Let us insist on the fact that,
for applications in machine learning, the success of OT-related methods very much
depends on the choice of an adapted cost c(x, y) which captures the geometry of the
data. While it is possible to embed many kinds of data in Euclidean spaces (see, for
instance, [Mikolov et al., 2013] for words embedding), in many cases, some sort of
adaptation or optimization of the metric is needed. Metric learning for supervised
tasks is a classical problem (see, for instance, [Kulis, 2012, Weinberger and Saul,
2009]) and it has been extended to the learning of the ground metric c(x, y) when
some OT distance is used in a learning pipeline [Cuturi and Avis, 2014] (see also Zen
et al. 2014, Wang and Guibas 2012, Huang et al. 2016). Let us also mention the
related inverse problem of learning the cost matrix from the observations of an
optimal coupling P, which can be regularized using a low-rank prior [Dupuy et al.,
2016]. Related problems are transfer learning [Pan and Yang, 2010] and domain
adaptation [Glorot et al., 2011], where one wants to transfer some trained machine
learning pipeline to adapt it to some new dataset. This problem can be modeled
158 Variational Wasserstein Problems
This chapter details several variational problems that are related to (and share the same
structure of) the Kantorovich formulation of optimal transport. The goal is to extend
optimal transport to more general settings: several input histograms and measures,
unnormalized ones, more general classes of measures, and optimal transport between
measures that focuses on local regularities (points nearby in the source measure should
be mapped onto points nearby in the target measure) rather than a total transport
cost, including cases where these two measures live in different metric spaces.
Instead of coupling two input histograms using the Kantorovich formulation (2.11),
one can couple S histograms (as )Ss=1 , where as ∈ Σns , by solving the following multi-
marginal problem:
ns
XX
def.
min hC, Pi = Ci1 ,...,iS Pi1 ,...,iS , (10.1)
P∈U(as )s s is =1
159
160 Extensions of Optimal Transport
and one can then apply Sinkhorn’s algorithm to compute the optimal P in scaling form,
where each entry indexed by a multi-index vector i = (i1 , . . . , iS )
S
C
where K = e− ε ,
Y def.
Pi = Ki us,is
s=1
where us ∈ Rn+s
are (unknown) scaling vectors, which are iteratively updated, by cycling
repeatedly through s = 1, . . . , S,
as,is
us,is ← P Pn` Q (10.2)
`6=s i` =1 Ki r6=s u`,ir
.
Remark 10.1 (General measures). The discrete multimarginal problem (10.1) is
generalized to measures (αs )s on spaces (X1 , . . . , XS ) by computing a coupling
measure Z
min c(x1 , . . . , xS )dπ(x1 , . . . , xS ), (10.3)
π∈U (αs )s X1 ×...×XS
where the set of couplings is
n o
π ∈ M1+ (X1 × . . . × XS ) : ∀ s = 1, . . . , S, Ps,] π = αs ,
def.
U(αs )s =
subject to ∀ s = 1, . . . , S, Ps,] π̄ = αs .
This stems from the “gluing lemma,” which states that given couplings (πs )Ss=1
10.1. Multimarginal Problems 161
For instance, for c(x, y) = kx − yk2 , one has, removing the constant squared terms,
X
c(x1 , . . . , xS ) = − λr λs hxr , xs i,
r≤s
which is a problem studied in Gangbo and Swiech [1998]. We refer to Agueh and
Carlier [2011] for more details. This formula shows that if all the input measures
are discrete βs = niss=1 as,is δxs,is , then the barycenter α is also discrete and is
P
where P is an optimal solution of (10.1) with cost matrix Ci1 ,...,iS = c(xi1 , . . . , xiS )
Q
as defined in (10.5). Since P is a nonnegative tensor of s ns dimensions obtained
as the solution of a linear program with s ns − S + 1 equality constraints, an
P
This result and other considerations in the discrete case can be found in Anderes
et al. [2016].
can be split. The evolution with time does not necessarily define a diffemorphism
of the underlying space X . The dynamic of the fluid is obtained by minimizing as
in (7.17) the energy 01 kγ 0 (t)k2 dt of each path. The difference with OT over the
R
Here R is a large enough penalization constant, which is here to enforce the move-
ment of particules between initial and final times, which is prescribed by a per-
mutation σ : JnK → JnK. This resulting multimarginal problem is implemented
efficiently in conjunction with Sinkhorn iterations (10.2) using the special struc-
ture of the cost, as detailed in [Benamou et al., 2015]. Indeed, in place of the O(nS )
cost required to compute the denominator appearing in (10.2), one can decompose
it as a succession of S matrix-vector multiplications, hence with a low cost Sn2 .
Note that other solvers have been proposed, for instance, using the semidiscrete
framework shown in §5.2; see [de Goes et al., 2015, Gallouët and Mérigot, 2017].
A major bottleneck of optimal transport in its usual form is that it requires the two
input measures (α, β) to have the same total mass. While many workarounds have
been proposed (including renormalizing the input measure, or using dual norms such
as detailed in § 8.2), it is only recently that satisfying unifying theories have been
developed. We only sketch here a simple but important particular case.
Following Liero et al. [2018], to account for arbitrary positive histograms (a, b) ∈
R+ × Rm
n
+ , the initial Kantorovich formulation (2.11) is “relaxed” by only penalizing
marginal deviation using some divergence Dϕ , defined in (8.3). This equivalently cor-
10.2. Unbalanced Optimal Transport 163
where (τ1 , τ2 ) controls how much mass variations are penalized as opposed to trans-
portation of the mass. In the limit τ1 = τ2 → +∞, assuming i ai = j bj (the
P P
“balanced” case), one recovers the original optimal transport formulation with hard
marginal constraint (2.11).
This formalism recovers many different previous works, for instance introducing
for Dϕ an `2 norm [Benamou, 2003] or an `1 norm as in partial transport [Figalli,
2010, Caffarelli and McCann, 2010]. A case of particular importance is when using
Dϕ = KL the Kulback–Leibler divergence, as detailed in Remark 10.5. For this cost,
in the limit τ = τ1 = τ2 → 0, one obtains the so-called squared Hellinger distance (see
also Example 8.3)
τ →0 X √
LτC (a, b) −→ h2 (a, b) = bi ) 2 .
p
( ai −
i
Sinkhorn’s iterations (4.15) can be adapted to this problem by making use of the
generalized algorithm detailed in §4.6. The solution has the form (4.12) and the scalings
are updated as
τ1 τ2
a τ1 +ε b
τ2 +ε
u← and v ← T
. (10.9)
Kv K u
Remark 10.4 (Generic measure). For (α, β) two arbitrary measures, the unbal-
anced version (also called “log-entropic”) of (2.15) reads
Z
Lτc (α, β) =
def.
min c(x, y)dπ(x, y)
π∈M+ (X ×Y) X ×Y
+ τ Dϕ (P1,] π|α) + τ Dϕ (P2,] π|β),
where divergences Dϕ between measures are defined in (8.1). In the special case
c(x, y) = kx − yk2 , Dϕ = KL, Lτc (α, β)1/2 is the Gaussian–Hellinger distance [Liero
et al., 2018], and it is shown to be a distance on M1+ (Rd ).
Remark 10.6 (Connection with dual norms). A particularly simple setup to account
for mass variation is to use dual norms, as detailed in §8.2. By choosing a compact
set B ⊂ C(X ) one obtains a norm defined on the whole space M(X ) (in particular,
the measures do not need to be positive). A particular instance of this setting is the
flat norm (8.11), which is recovered as a special instance of unbalanced transport,
when using Dϕ (α|α0 ) = kα − α0 kTV to be the total variation norm (8.9); see, for in-
stance, [Hanin, 1992, Lellmann et al., 2014]. We also refer to [Schmitzer and Wirth,
2017] for a general framework to define Wasserstein-1 unbalanced transport.
10.3. Problems with Extra Constraints on the Couplings 165
Figure 10.1: Influence of relaxation parameter τ on unbalanced barycenters. Top to bottom: the
evolution of the barycenter between two input measures.
Many other OT-like problems have been proposed in the literature. They typically
correspond to adding extra constraints C on the set of feasible couplings appearing in
the original OT problem (2.15)
Z
min c(x, y)dπ(x, y) : π ∈ C . (10.10)
π∈U (α,β) X ×Y
Let us give two representative examples. The optimal transport with capacity con-
straint [Korman and McCann, 2015] corresponds to imposing that the density ρπ (for
instance, with respect to the Lebesgue measure) is upper bounded
C = {π : ρπ ≤ κ} (10.11)
for some κ > 0. This constraint rules out singular couplings localized on Monge maps.
The martingale transport problem (see, for instance, Galichon et al. [2014], Dolinsky
and Soner [2014], Tan and Touzi [2013], Beiglböck et al. [2013]), which finds many
applications in finance, imposes the so-called martingale constraint on the conditional
mean of the coupling, when X = Y = Rd :
dπ(x, y)
Z
d
C = π : ∀x ∈ R , y dβ(y) = x . (10.12)
Rd dα(x)dβ(y)
166 Extensions of Optimal Transport
This constraint imposes that the barycentric projection map (4.20) of any admissible
coupling must be equal to the identity. For arbitrary (α, β), this set C is typically empty,
but necessary and sufficient conditions exist (α and β should be in “convex order”) to
ensure C = 6 ∅ so that (α, β) satisfy a martingale constraint. This constraint can be
difficult to enforce numerically when discretizing an existing problem. It also forbids
the solution to concentrate on a single Monge map, and can lead to couplings con-
centrated on the union of several graphs (a “multivalued” Monge map), or even more
complicated support sets. Using an entropic penalization as in (4.9), one can solve ap-
proximately (10.10) using the Dykstra algorithm as explained in Benamou et al. [2015],
which is a generalization of Sinkhorn’s algorithm shown in §4.2. This requires comput-
ing the projection onto C for the KL divergence, which is straightforward for (10.11)
but cannot be done in closed form (10.12) and thus necessitates subiterations; see [Guo
and Obloj, 2017] for more details.
One can define a distance between two measures (α, β) defined on Rd by aggregating
1-D Wasserstein distances between their projections onto all directions of the sphere.
This defines Z
2 def.
SW(α, β) = W 2 (Pθ,] α, Pθ,] β)2 dθ, (10.13)
Sd
d
where S = {θ ∈ Rd : kθk = 1} is the d-dimensional sphere, and Pθ : x ∈ Rd → R is the
projection. This approach is detailed in [Bonneel et al., 2015], following ideas from Marc
Bernot. It is related to the problem of Radon inversion over measure spaces [Abraham
et al., 2017].
using a small enough step size τ > 0. To make the method tractable, one can use a
stochastic gradient descent (SGD), replacing this integral with a discrete sum against
randomly drawn directions θ ∈ Sd (see §5.4 for more details on SGD). The flow (10.15)
can be understood as (Langrangian implementation of) a Wasserstein gradient flow (in
the sense of §9.3) of the function α 7→ SW(α, β)2 . Numerically, one finds that this flow
has no local minimizer and that it thus converges to α = β. The usefulness of the
Lagrangian solver is that, at convergence, it defines a matching (similar to a Monge
map) between the two distributions. This method has been used successfully for color
transfer and texture synthesis in [Rabin et al., 2011] and is related to the alternate
minimization approach detailed in [Pitié et al., 2007].
It is simple to extend this Lagrangian scheme to compute approximate “sliced”
barycenters of measures, by mimicking the Frechet definition of Wasserstein barycen-
ters (9.11) and minimizing
S
X
min λs SW(α, βs )2 , (10.16)
α∈M1+ (X ) s=1
given a set (βs )Ss=1 of fixed input measure. Using a Lagrangian discretization of the
form (10.14) for both α and the (βs )s , one can perform the nonconvex minimization
over the position x = (xi )i
def.
X X
min E(x) = λs Eβs (x), and ∇E(x) = λs ∇Eβs (x), (10.17)
x
s s
by gradient descent using formula (10.15) to compute ∇Eβs (x) (coupled with a random
sampling of the direction θ).
Figure 10.2: Example of sliced barycenters computation using the Radon transform (as defined
in (10.20)). Top: barycenters αt for S = 2 two input and weights (λ1 , λ2 ) = (1 − t, t). Bottom: their
Radon transform R(αt ) (the horizontal axis being the orientation angle θ).
Each 1-D barycenter problem is easily computed using the monotone rearrangement as
def.
detailed in Remark 9.6. The Radon approximation αR = R+ (ρ? ) of a sliced barycen-
ter solving (9.11) is then obtained by the inverse Radon transform R+ . Note that in
general, αR is not a solution to (9.11) because the Radon transform is not surjective,
so that ρ? , which is obtained as a barycenter of the Radon transforms R(βs ) does not
necessarily belong to the range of R. But numerically it seems in practice to be almost
the case [Bonneel et al., 2015]. Numerically, this Radon transform formulation is very
effective for input measures and barycenters discretized on a fixed grid (e.g. a uniform
grid for images), and R and well as R+ are computed approximately on this grid using
fast algorithms (see, for instance, [Averbuch et al., 2001]). Figure 10.2 illustrates this
141414 Nicolas Bonneel
Nicolas
Nicolas et et
Bonneel
Bonneel al.et
al.al.
computation of barycenters (and highlights the way the Radon transforms are interpo-
lated), while Figure 10.3 shows a comparison of the Radon barycenters (10.20) and the
ones obtained by Lagrangian discretization (10.17).
Radon
Radonbarycenter
Radon barycenter
barycenter Sliced barycenter
Sliced
Sliced barycenter
barycenter Wasserstein barycenter
Wasserstein
Wasserstein barycenter
barycenter
Radon Lagrangian Wasserstein
R R R S S S WWW
Fig. 6 6Comparison
Fig.
Fig. ofof
6Comparison
Comparison Bar
of
Bar d , dBar
RBar , Barand
, Bar Bar
and
and d (computed
RBar
Bar using
(computed
(computed thethe
using
using method
the detailed
method
method in in
detailed
detailed [25]).
in [25]).
[25]).
R Rd RdRdRd RdRd
Figure 10.3: Comparison of barycenters computed using Radon transform (10.20) (Eulerian dis-
cretization), Lagrangian discretization (10.17), and Wasserstein OT (computed using Sinkhorn itera-
tions (9.18)).
Euclidean and equipped with some inner product h·, ·i (typically V = Rd and the inner
product is the canonical one). Thanks to this inner product, vector-valued measures
are identified with the dual of continuous functions g : X → V, i.e. for any such g, one
defines its integration against the measure as
Z
g(x)dα(x) ∈ R, (10.21)
X
P
which is a linear operation on g and α. A discrete measure has the form α = i ai δxi
where (xi , ai ) ∈ X × V and the integration formula (10.21) simply reads
Z X
g(x)dα(x) = hai , g(xi )i ∈ R.
X i
where g(x) = (gs (x))ds=1 are the coordinates of g in the canonical basis.
treating matrices as vectors; see, for instance, [Ning and Georgiou, 2014]. Dynamical
convex formulations for OT over such a cone have been provided [Chen et al., 2016b,
Jiang et al., 2012]. Some static (Kantorovich-like) formulations also have been pro-
posed [Ning et al., 2015, Peyré et al., 2017], but a mathematically sound theoretical
10.5. Transporting Vectors and Matrices 171
OT over positive matrices. A related but quite different setting is to replace discrete
measures, i.e. histograms a ∈ Σn , by positive matrices with unit trace A ∈ Sn+ such
that tr(A) = 1. The rationale is that the eigenvalues λ(A) ∈ Σn of A play the role of
a histogram, but one also has to take care of the rotations of the eigenvectors, so that
this problem is more complicated.
One can extend several divergences introduced in §8.1 to this setting. For instance,
the Bures metric (2.42) is a generalization of the Hellinger distance (defined in Re-
mark 8.3), since they are equal on positive diagonal matrices. One can also extend
the Kullback–Leibler divergence (4.6) (see also Remark 8.1), which is generalized to
positive matrices as
def.
KL(A|B) = tr (P log(P ) − P log(Q) − P + Q, ) (10.22)
where log(·) is the matrix logarithm. This matrix KL is convex with both of its argu-
ments.
It is possible to solve convex dynamic formulations to define OT distances between
such matrices [Carlen and Maas, 2014, Chen et al., 2016b, 2017]. There also exists
an equivalent of Sinkhorn’s algorithm, which is due to Gurvits [2004] and has been
extensively studied in [Georgiou and Pavon, 2015]; see also the review paper [Idel,
2016]. It is known to converge only in some cases but seems empirically to always work.
Figure 10.4: Interpolations between two input fields of positive semidefinite matrices (displayed at
times t ∈ {0, 1} using ellipses) on some domain (here, a 2-D planar square and a surface mesh), using
the method detailed in Peyré et al. [2017]. Unlike linear interpolation schemes, this OT-like method
transports the “mass” of the tensors (size of the ellipses) as well as their anisotropy and orientation.
172 Extensions of Optimal Transport
For some applications such as shape matching, an important weakness of optimal trans-
port distances lies in the fact that they are not invariant to important families of invari-
ances, such as rescaling, translation or rotations. Although some nonconvex variants
of OT to handle such global transformations have been proposed [Cohen and Guibas,
1999, Pele and Taskar, 2013] and recently applied to problems such as cross-lingual
word embeddings alignments [Grave et al., 2019, Alvarez-Melis et al., 2019, Grave et al.,
2019], these methods require specifying first a subset of invariances, possibly between
different metric spaces, to be relevant. We describe in this section a more general and
very natural extension of OT that can deal with measures defined on different spaces
without requiring the definition of a family of invariances.
see Figure 10.5. This defines a distance between compact sets K(Z) of Z, and if Z is
compact, then (K(Z), HZ ) is itself compact; see [Burago et al., 2001].
X
Y
Y
Following Mémoli [2011], one remarks that this distance between sets (A, B) can
be defined similarly to the Wasserstein distance between measures (which should be
somehow understood as “weighted” sets). One replaces the measures couplings (2.14)
by sets couplings
( )
def. ∀ a ∈ A, ∃b ∈ B, (a, b) ∈ R
R(A, B) = R∈X ×Y : .
∀ b ∈ B, ∃a ∈ A, (a, b) ∈ R
With respect to Kantorovich problem (2.15), one should replace integration (since one
10.6. Gromov–Wasserstein Distances 173
Note that the support of a measure coupling π ∈ U(α, β) is a set coupling between
the supports, i.e. Supp(π) ∈ R(Supp(α), Supp(β)). The Hausdorff distance is thus
connected to the ∞-Wasserstein distance (see Remark 2.20) and one has H(A, B) ≤
W ∞ (α, β) for any measure (α, β) whose supports are (A, B).
a Z
X f
H b
GH
g
Y
Figure 10.6: The GH approach to compare two metric spaces.
For discrete spaces X = (xi )ni=1 , Y = (yj )m j=1 represented using a distance matrix D =
(dX (xi , xi0 ))i,i0 ∈ Rn×n , D0 = (dY (yj , yj 0 ))j,j 0 ∈ Rm×m , one can rewrite this optimization
174 Extensions of Optimal Transport
using binary matrices R ∈ {0, 1}n×m indicating the support of the set couplings R as
follows:
1
GH(D, D0 ) = inf max Ri,j Rj,j 0 |Di,i0 − D0j,j 0 |. (10.24)
2 R1>0,R> 1>0 (i,i0 ,j,j 0 )
The initial motivation of the GH distance is to define and study limits of metric spaces,
as illustrated in Figure 10.7, and we refer to [Burago et al., 2001] for details. There is an
explicit description of the geodesics for the GH distance [Chowdhury and Mémoli, 2016],
which is very similar to the one in Gromov–Wasserstein spaces, detailed in Remark 10.8.
GH
GH
Optimal transport needs a ground cost C to compare histograms (a, b) and thus cannot
be used if the bins of those histograms are not defined on the same underlying space,
or if one cannot preregister these spaces to define a ground cost between any pair of
bins in the first and second histograms, respectively. To address this limitation, one
can instead only assume a weaker assumption, namely that two matrices D ∈ Rn×n
and D0 ∈ Rm×m quantify similarity relationships between the points on which the
histograms are defined. A typical scenario is when these matrices are (power of) distance
matrices. The GW problem reads
see Figure 10.8. This problem is similar to the GH problem (10.24) when replacing
maximization by a sum and set couplings by measure couplings. This is a nonconvex
problem, which can be recast as a quadratic assignment problem [Loiola et al., 2007]
and is in full generality NP-hard to solve for arbitrary inputs. It is in fact equivalent
to a graph matching problem [Lyzinski et al., 2016] for a particular cost.
X
D D0j,j 0 Y
i,i0
|Di,i0 D0j,j 0 |
One can show that GW satisfies the triangular inequality, and in fact it defines a
distance between metric spaces equipped with a probability distribution, here assumed
to be discrete in definition (10.25), up to isometries preserving the measures. This dis-
tance was introduced and studied in detail by Mémoli [2011]. An in-depth mathematical
exposition (in particular, its geodesic structure and gradient flows) is given in [Sturm,
2012]. See also [Schmitzer and Schnörr, 2013a] for applications in computer vision. This
distance is also tightly connected with the GH distance [Gromov, 2001] between metric
spaces, which have been used for shape matching [Mémoli, 2007, Bronstein et al., 2010].
(10.26)
Z
min |dX (x, x0 ) − dY (y, y 0 )|2 dπ(x, y)dπ(x0 , y 0 ).
π∈U (αX ,αY ) X 2 ×Y 2
This formula allows one to define and analyze gradient flows which minimize func-
tionals involving metric spaces; see Sturm [2012]. It is, however, difficult to handle
numerically, because it involves computations over the product space X0 × X1 . A
heuristic approach is used in [Peyré et al., 2016] to define geodesics and barycenters
of metric measure spaces while imposing the cardinality of the involved spaces and
making use of the entropic smoothing (10.27) detailed below.
As proposed initially in [Gold and Rangarajan, 1996, Rangarajan et al., 1999], and later
revisited in [Solomon et al., 2016a] for applications in graphics, one can use iteratively
Sinkhorn’s algorithm to progressively compute a stationary point of (10.27). Indeed,
successive linearizations of the objective function lead to consider the succession of
updates
P(`+1) = min hP, C(`) i − εH(P) where
def.
(10.28)
P∈U(a,b)
Figure 10.9: Iterations of the entropic GW algorithm (10.28) between two shapes (xi )i and (yj )j in
R2 , initialized with P(0) = a ⊗ b. The distance matrices are Di,i0 = kxi − xi0 k and D0j,j 0 = kyj − yj 0 k.
Top row: coupling P(`) displayed as a 2-D image. Bottom row: matching induced by P(`) (each point
(`)
xi is connected to the three yj with the three largest values among {Pi,j }j ). The shapes have the same
size, but for display purposes, the inner shape (xi )i has been reduced.
Figure 10.10: Example of fuzzy correspondences computed by solving GW problem (10.27) with
Sinkhorn iterations (10.28). Extracted from [Solomon et al., 2016a].
Acknowledgements
We would like to thank the many colleagues, collaborators and students who have
helped us at various stages when preparing this survey. Some of their inputs have
shaped this work, and we would like to thank in particular Jean-David Benamou,
Yann Brenier, Guillaume Carlier, Vincent Duval and the entire MOKAPLAN team at
Inria; Francis Bach, Espen Bernton, Mathieu Blondel, Nicolas Courty, Rémi Flamary,
Alexandre Gramfort, Young-Heon Kim, Daniel Matthes, Philippe Rigollet, Filippo San-
tambrogio, Justin Solomon, Jonathan Weed; as well as the feedback by our current and
former students on these subjects, in particular Gwendoline de Bie, Lénaïc Chizat,
Aude Genevay, Hicham Janati, Théo Lacombe, Boris Muzellec, Francois-Pierre Paty,
Vivien Seguy.
178
References
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: large-scale
machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,
2016.
Isabelle Abraham, Romain Abraham, Maıtine Bergounioux, and Guillaume Carlier. Tomo-
graphic reconstruction from a few views: a multi-marginal optimal transport approach. Ap-
plied Mathematics & Optimization, 75(1):55–73, 2017.
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation. arXiv preprint
arXiv:1106.1925, 2011.
Martial Agueh and Malcolm Bowles. One-dimensional numerical algorithms for gradient flows
in the p-Wasserstein spaces. Acta Applicandae Mathematicae, 125(1):121–134, 2013.
Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal
on Mathematical Analysis, 43(2):904–924, 2011.
Martial Agueh and Guillaume Carlier. Vers un théorème de la limite centrale dans l’espace de
Wasserstein? Comptes Rendus Mathematique, 355(7):812–818, 2017.
Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermüller, Dzmitry Bahdanau,
and Nicolas Ballas et al. Theano: A python framework for fast computation of mathematical
expressions. CoRR, abs/1605.02688, 2016.
Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society. Series B (Methodological),
28(1):131–142, 1966.
Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for
matrix scaling. arXiv preprint arXiv:1704.02315, 2017.
Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation al-
gorithms for optimal transport via Sinkhorn iteration. arXiv preprint arXiv:1705.09634,
2017.
179
180 References
Heinz H Bauschke and Patrick L Combettes. Convex analysis and monotone operator theory
in Hilbert spaces. Springer-Verlag, New York, 2011.
Heinz H Bauschke and Adrian S Lewis. Dykstra’s algorithm with Bregman projections: a
convergence proof. Optimization, 48(4):409–427, 2000.
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods
for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Martin Beckmann. A continuous model of transportation. Econometrica, 20:643–660, 1952.
Mathias Beiglböck, Pierre Henry-Labordère, and Friedrich Penkner. Model-independent bounds
for option prices: a mass transport approach. Finance and Stochastics, 17(3):477–501, 2013.
Jan Beirlant, Edward J Dudewicz, Laszlo Gyorfi, and Edward C Van der Meulen. Nonparamet-
ric entropy estimation: an overview. International Journal of Mathematical and Statistical
Sciences, 6(1):17–39, 1997.
Jean-David Benamou. Numerical resolution of an “unbalanced” mass transport problem.
ESAIM: Mathematical Modelling and Numerical Analysis, 37(05):851–868, 2003.
Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the
Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
Jean-David Benamou and Guillaume Carlier. Augmented lagrangian methods for transport
optimization, mean field games and degenerate elliptic equations. Journal of Optimization
Theory and Applications, 167(1):1–26, 2015.
Jean-David Benamou, Brittany D Froese, and Adam M Oberman. Numerical solution of the op-
timal transportation problem using the Monge–Ampere equation. Journal of Computational
Physics, 260:107–126, 2014.
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. It-
erative Bregman projections for regularized transportation problems. SIAM Journal on Sci-
entific Computing, 37(2):A1111–A1138, 2015.
Jean-David Benamou, Guillaume Carlier, Quentin Mérigot, and Edouard Oudet. Discretization
of functionals involving the Monge–Ampère operator. Numerische Mathematik, 134(3):611–
636, 2016a.
Jean-David Benamou, Francis Collino, and Jean-Marie Mirebeau. Monotone and consistent
discretization of the Monge-Ampere operator. Mathematics of Computation, 85(302):2743–
2775, 2016b.
Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semi-
groups. Number 100 in Graduate Texts in Mathematics. Springer Verlag, 1984.
Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability
and Statistics. Kluwer Academic Publishers, 2003.
Espen Bernton. Langevin Monte Carlo and JKO splitting. In Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning The-
ory, volume 75 of Proceedings of Machine Learning Research, pages 1777–1798. PMLR, 2018.
Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. Inference in gener-
ative models using the Wasserstein distance. arXiv preprint arXiv:1701.05146, 2017.
182 References
Dimitri P Bertsekas. A new algorithm for the assignment problem. Mathematical Programming,
21(1):152–171, 1981.
Dimitri P Bertsekas. Auction algorithms for network flow problems: a tutorial introduction.
Computational Optimization and Applications, 1(1):7–66, 1992.
Dimitri P Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific,
1998.
Dimitri P Bertsekas and Jonathan Eckstein. Dual coordinate step methods for linear network
flow problems. Mathematical Programming, 42(1):203–243, 1988.
Dimitris Bertsimas and John N Tsitsiklis. Introduction to Linear Optimization. Athena Scien-
tific, 1997.
Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures-wasserstein distance between
positive definite matrices. Expositiones Mathematicae, to appear, 2018.
Jérémie Bigot and Thierry Klein. Consistent estimation of a population barycenter in the
Wasserstein space. arXiv Preprint arXiv:1212.2562, 2012a.
Jérémie Bigot and Thierry Klein. Characterization of barycenters in the Wasserstein space by
averaging optimal transport maps. arXiv preprint arXiv:1212.2562, 2012b.
Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Central limit theorems for sinkhorn
divergence between probability distributions on finite spaces and statistical applications.
arXiv preprint arXiv:1711.08947, 2017a.
Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic PCA in the Wasserstein
space by convex pca. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 53
(1):1–26, 2017b.
Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Universidad Nacional de Tucumán
Revista Series A, 5:147–151, 1946.
Garrett Birkhoff. Extensions of jentzsch’s theorem. Transactions of the American Mathematical
Society, 85(1):219–227, 1957.
Adrien Blanchet and Guillaume Carlier. Optimal transport and Cournot-Nash equilibria. Math-
ematics of Operations Research, 41(1):125–145, 2015.
Adrien Blanchet, Vincent Calvez, and José A Carrillo. Convergence of the mass-transport
steepest descent scheme for the subcritical Patlak-Keller-Segel model. SIAM Journal on
Numerical Analysis, 46(2):691–721, 2008.
Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures
in 1-Wasserstein distance. Electronic Journal of Probability, 16:2296–2333, 2011.
Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution’s template esti-
mate with Wasserstein metrics. Bernoulli, 21(2):740–759, 2015.
Franccois Bolley, Arnaud Guillin, and Cédric Villani. Quantitative concentration inequalities
for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137
(3):541–593, 2007.
References 183
Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement
interpolation using lagrangian mass transport. ACM Transactions on Graphics, 30(6):158,
2011.
Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasser-
stein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45,
2015.
Nicolas Bonneel, Gabriel Peyré, and Marco Cuturi. Wasserstein barycentric coordinates: his-
togram regression using optimal transport. ACM Transactions on Graphics, 35(4):71:1–71:10,
2016.
CW Borchardt and CGJ Jocobi. De investigando ordine systematis aequationum differentialium
vulgarium cujuscunque. Journal für die reine und angewandte Mathematik, 64:297–320, 1865.
Ingwer Borg and Patrick JF Groenen. Modern Multidimensional Scaling: Theory and Applica-
tions. Springer Science & Business Media, 2005.
Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno Lévy. Polygon mesh process-
ing. Taylor & Francis, 2010.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard
Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. arXiv
preprint arXiv:1705.07642, 2017.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, January 2011.
Lev M Bregman. The relaxation method of finding the common point of convex sets and
its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics, 7(3):200–217, 1967.
Yann Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. C.
R. Acad. Sci. Paris Sér. I Math., 305(19):805–808, 1987.
Yann Brenier. The least action principle and the related concept of generalized flows for in-
compressible perfect fluids. Journal of the AMS, 2:225–255, 1990.
Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions.
Communications on Pure and Applied Mathematics, 44(4):375–417, 1991.
Yann Brenier. The dual least action problem for an ideal, incompressible fluid. Archive for
Rational Mechanics and Analysis, 122(4):323–351, 1993.
Yann Brenier. Minimal geodesics on groups of volume-preserving maps and generalized solutions
of the Euler equations. Communications on Pure and Applied Mathematics, 52(4):411–452,
1999.
Yann Brenier. Generalized solutions and hydrostatic approximation of the Euler equations.
Physica D. Nonlinear Phenomena, 237(14-17):1982–1988, 2008.
Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional
scaling: a framework for isometry-invariant partial surface matching. Proceedings of the
National Academy of Sciences, 103(5):1168–1172, 2006.
184 References
Alexander M Bronstein, Michael M Bronstein, Ron Kimmel, Mona Mahmoudi, and Guillermo
Sapiro. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-
rigid shape matching. International Journal on Computer Vision, 89(2-3):266–286, 2010.
Richard A Brualdi. Combinatorial Matrix Classes, volume 108. Cambridge University Press,
2006.
Dmitri Burago, Yuri Burago, and Sergei Ivanov. A Course in Metric Geometry, volume 33.
American Mathematical Society Providence, RI, 2001.
Donald Bures. An extension of Kakutani’s theorem on infinite product measures to the tensor
product of semifinite w∗ -algebras. Transactions of the American Mathematical Society, 135:
199–212, 1969.
Martin Burger, José Antonio Carrillo de la Plata, and Marie-Therese Wolfram. A mixed finite
element method for nonlinear diffusion equations. Kinetic and Related Models, 3(1):59–83,
2010.
Martin Burger, Marzena Franek, and Carola-Bibiane Schönlieb. Regularised regression and
density estimation based on optimal transport. Applied Mathematics Research Express, 2:
209–253, 2012.
Giuseppe Buttazzo, Luigi De Pascale, and Paola Gori-Giorgi. Optimal-transport formulation
of electronic density-functional theory. Physical Review A, 85(6):062502, 2012.
Luis Caffarelli. The Monge-Ampere equation and optimal transportation, an elementary review.
Lecture Notes in Mathematics, Springer-Verlag, pages 1–10, 2003.
Luis Caffarelli, Mikhail Feldman, and Robert McCann. Constructing optimal maps for Monge’s
transport problem as a limit of strictly convex costs. Journal of the American Mathematical
Society, 15(1):1–26, 2002.
Luis A Caffarelli and Robert J McCann. Free boundaries in optimal transport and Monge-
Ampère obstacle problems. Annals of Mathematics, 171(2):673–730, 2010.
Luis A Caffarelli, Sergey A Kochengin, and Vladimir I Oliker. Problem of reflector design with
given far-field scattering data. In Monge Ampère equation: applications to geometry and
optimization, volume 226, page 13, 1999.
Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal
transport metrics. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 2492–2500. 2012.
Eric A Carlen and Jan Maas. An analog of the 2-Wasserstein metric in non-commutative prob-
ability under which the fermionic Fokker–Planck equation is gradient flow for the entropy.
Communications in Mathematical Physics, 331(3):887–926, 2014.
Guillaume Carlier and Ivar Ekeland. Matching for teams. Economic Theory, 42(2):397–418,
2010.
Guillaume Carlier and Clarice Poon. On the total variation Wasserstein gradient flow and the
TV-JKO scheme. to appear in ESAIM: COCV, 2019.
Guillaume Carlier, Chloé Jimenez, and Filippo Santambrogio. Optimal transportation with
traffic congestion and Wardrop equilibria. SIAM Journal on Control and Optimization, 47
(3):1330–1350, 2008.
References 185
Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe’s transport to Bre-
nier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical
Analysis, 41(6):2554–2576, 2010.
Guillaume Carlier, Adam Oberman, and Edouard Oudet. Numerical methods for matching
for teams and Wasserstein barycenters. ESAIM: Mathematical Modelling and Numerical
Analysis, 49(6):1621–1642, 2015.
Guillaume Carlier, Victor Chernozhukov, and Alfred Galichon. Vector quantile regression be-
yond correct specification. arXiv preprint arXiv:1610.06833, 2016.
Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of
entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical
Analysis, 49(2):1385–1418, 2017.
José A Carrillo and J Salvador Moll. Numerical simulation of diffusive and aggregation phe-
nomena in nonlinear continuity equations by evolving diffeomorphisms. SIAM Journal on
Scientific Computing, 31(6):4305–4329, 2009.
José A Carrillo, Alina Chertock, and Yanghong Huang. A finite-volume method for nonlin-
ear nonlocal equations with a gradient flow structure. Communications in Computational
Physics, 17:233–258, 1 2015.
Yair Censor and Simeon Reich. The Dykstra algorithm with Bregman projections. Communi-
cations in Applied Analysis, 2:407–419, 1998.
Yair Censor and Stavros Andrea Zenios. Proximal minimization algorithm with d-functions.
Journal of Optimization Theory and Applications, 73(3):451–464, 1992.
Thierry Champion, Luigi De Pascale, and Petri Juutinen. The ∞-wasserstein distance: local
solutions and existence of optimal transport maps. SIAM Journal on Mathematical Analysis,
40(1):1–20, 2008.
Timothy M Chan. Optimal output-sensitive convex hull algorithms in two and three dimensions.
Discrete & Computational Geometry, 16(4):361–368, 1996.
Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation between optimal
transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization
Theory and Applications, 169(2):671–691, 2016a.
Yongxin Chen, Tryphon T Georgiou, and Allen Tannenbaum. Matrix optimal mass transport:
a quantum mechanical approach. arXiv preprint arXiv:1610.03041, 2016b.
Yongxin Chen, Wilfrid Gangbo, Tryphon T Georgiou, and Allen Tannenbaum. On the matrix
Monge-Kantorovich problem. arXiv preprint arXiv:1701.02826, 2017.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. Unbalanced
optimal transport: geometry and Kantorovich formulation. Journal of Functional Analysis,
274(11):3090–3123, 2018a.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. Scaling al-
gorithms for unbalanced transport problems. Mathematics of Computation, 87:2563–2609,
2018b.
186 References
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. An interpo-
lating distance between optimal transport and Fisher–Rao metrics. Foundations of Compu-
tational Mathematics, 18(1):1–44, 2018c.
Shui-Nee Chow, Wen Huang, Yao Li, and Haomin Zhou. Fokker-Planck equations for a free en-
ergy functional or Markov process on a graph. Archive for Rational Mechanics and Analysis,
203(3):969–1008, 2012.
Shui-Nee Chow, Wuchen Li, and Haomin Zhou. A discrete Schrodinger equation via optimal
transport on graphs. arXiv preprint arXiv:1705.07583, 2017a.
Shui-Nee Chow, Wuchen Li, and Haomin Zhou. Entropy dissipation of Fokker-Planck equations
on graphs. arXiv preprint arXiv:1701.04841, 2017b.
Samir Chowdhury and Facundo Mémoli. Constructing geodesics on the space of compact metric
spaces. arXiv preprint arXiv:1603.02385, 2016.
Haili Chui and Anand Rangarajan. A new algorithm for non-rigid point matching. In Computer
Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages
44–51. IEEE, 2000.
Imre Ciszár. Information-type measures of difference of probability distributions and indirect
observations. Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967.
Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix scaling and
balancing via box constrained Newton’s method and interior point methods. arXiv preprint
arXiv:1704.02310, 2017.
Scott Cohen and Leonidas Guibas. The earth mover’s distance under transformation sets. In
Proceedings of the Seventh IEEE International Conference on Computer vision, volume 2,
pages 1076–1083. IEEE, 1999.
Patrick L Combettes and Jean-Christophe Pesquet. A Douglas-Rachford splitting approach to
nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal
Processing, 1(4):564 –574, 2007.
Roberto Cominetti and Jaime San Martín. Asymptotic analysis of the exponential penalty
trajectory in linear programming. Mathematical Programming, 67(1-3):169–187, 1994.
Laurent Condat. Fast projection onto the simplex and the `1 ball. Math. Programming, Ser.
A, pages 1–11, 2015.
Sueli IR Costa, Sandra A Santos, and João E Strapasson. Fisher information distance: a
geometrical reading. Discrete Applied Mathematics, 197:59–69, 2015.
Codina Cotar, Gero Friesecke, and Claudia Klüppelberg. Density functional theory and optimal
transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66
(4):548–599, 2013.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Thomas Corpetti. Optimal transport for data
fusion in remote sensing. In 2016 IEEE International Geoscience and Remote Sensing Sym-
posium, pages 3571–3574. IEEE, 2016.
References 187
Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribu-
tion optimal transportation for domain adaptation. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Infor-
mation Processing Systems 30, pages 3730–3739. 2017a.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for
domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39
(9):1853–1865, 2017b.
Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: a new approach to
computing distance based on heat flow. ACM Transaction on Graphics, 32(5):152:1–152:11,
October 2013.
Juan Antonio Cuesta and Carlos Matran. Notes on the wasserstein metric in hilbert spaces.
The Annals of Probability, 17(3):1264–1276, 07 1989.
Marco Cuturi. Positivity and transportation. arXiv preprint 1209.2655, 2012.
Marco Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In Advances
in Neural Information Processing Systems 26, pages 2292–2300, 2013.
Marco Cuturi and David Avis. Ground metric learning. Journal of Machine Learning Research,
15:533–564, 2014.
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings
of ICML, volume 32, pages 685–693, 2014.
Marco Cuturi and Kenji Fukumizu. Kernels on structured objects through nested histograms.
In P. B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems 19, pages 329–336. MIT Press, 2007.
Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational Wasserstein prob-
lems. SIAM Journal on Imaging Sciences, 9(1):320–343, 2016.
Marco Cuturi and Gabriel Peyré. Semidual regularized optimal transport. SIAM Review, 60
(4):941–965, 2018.
Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin
monte carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory,
volume 65 of Proceedings of Machine Learning Research, pages 678–689. PMLR, 2017.
Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the Langevin Monte
Carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.
George B. Dantzig. Programming of interdependent activities: II mathematical model. Econo-
metrica, 17(3/4):200–211, 1949.
George B Dantzig. Application of the simplex method to a transportation problem. Activity
Analysis of Production and Allocation, 13:359–373, 1951.
George B. Dantzig. Reminiscences Aabout the origins of linear programming, pages 78–86.
Springer, 1983.
George B. Dantzig. Linear programming. In J. K. Lenstra, A. H. G. Rinnooy Kan, and A. Schri-
jver, editors, History of mathematical programming: a collection of personal reminiscences,
pages 257–282. Elsevier Science Publishers, 1991.
188 References
Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing, 2017.
Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun. Blue
noise through optimal transport. ACM Transactions on Graphics, 31(6):171, 2012.
Fernando de Goes, Corentin Wallez, Jin Huang, Dmitry Pavlov, and Mathieu Desbrun. Power
particles: an incompressible fluid solver based on power diagrams. ACM Transaction Graph-
ics, 34(4):50:1–50:11, July 2015.
Eustasio del Barrio, JA Cuesta-Albertos, C Matrán, and A Mayo-Íscar. Robust clustering tools
based on optimal transportation. arXiv preprint arXiv:1607.01179, 2016.
Julie Delon. Midway image equalization. Journal of Mathematical Imaging and Vision, 21(2):
119–134, 2004.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Fast transport optimization for Monge
costs on the circle. SIAM Journal on Applied Mathematics, 70(7):2239–2258, 2010.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Local matching indicators for transport
problems with concave costs. SIAM Journal on Discrete Mathematics, 26(2):801–827, 2012.
Edwards Deming and Frederick F Stephan. On a least squares adjustment of a sampled fre-
quency table when the expected marginal totals are known. Annals of Mathematical Statistics,
11(4):427–444, 1940.
Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Ap-
proximation by empirical measures. In Annales de l’Institut Henri Poincaré, Probabilités et
Statistiques, volume 49, pages 1183–1203, 2013.
Rachid Deriche. Recursively implementating the Gaussian and its derivatives. PhD thesis,
INRIA, 1993.
Arnaud Dessein, Nicolas Papadakis, and Charles-Alban Deledalle. Parameter estimation in
finite mixture models by regularized optimal transport: a unified framework for hard and
soft clustering. arXiv preprint arXiv:1711.04366, 2017.
Arnaud Dessein, Nicolas Papadakis, and Jean-Luc Rouas. Regularized optimal transport and
the rot mover’s distance. Journal of Machine Learning Research, 19(15):1–53, 2018.
Simone Di Marino and Lenaic Chizat. A tumor growth model of Hele-Shaw type as a gradient
flow. Arxiv, 2017.
Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms
for earth mover’s distance. Theory of Computing Systems, 48(2):428–442, 2011.
Jean Dolbeault, Bruno Nazaret, and Giuseppe Savaré. A new class of transport distances
between measures. Calculus of Variations and Partial Differential Equations, 34(2):193–231,
2009.
Yan Dolinsky and H Mete Soner. Martingale optimal transport and robust hedging in continuous
time. Probability Theory and Related Fields, 160(1-2):391–427, 2014.
Richard M. Dudley. The speed of mean Glivenko-Cantelli convergence. Annals of Mathematical
Statistics, 40(1):40–50, 1969.
Arnaud Dupuy, Alfred Galichon, and Yifei Sun. Estimating matching affinity matrix under
low-rank constraints. Arxiv:1612.09585, 2016.
References 189
Pavel Dvurechenskii, Darina Dvinskikh, Alexander Gasnikov, Cesar Uribe, and Angelia Nedich.
Decentralize and randomize: Faster algorithm for wasserstein barycenters. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances
in Neural Information Processing Systems 31, pages 10783–10793. 2018.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal trans-
port: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In
Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference
on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1367–
1376. PMLR, 2018.
Richard L Dykstra. An algorithm for restricted least squares regression. Journal American
Statistical Association, 78(384):839–842, 1983.
Richard L Dykstra. An iterative procedure for obtaining I-projections onto the intersection of
convex sets. Annals of Probability, 13(3):975–984, 1985.
Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas-Rachford splitting method and
the proximal point algorithm for maximal monotone operators. Mathematical Programming,
55:293–318, 1992.
David A Edwards. The structure of superspace. In Studies in topology, pages 121–133. Elsevier,
1975.
Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal maps. Journal
of Computational Physics, 231(23):7815–7850, 2012.
Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions.
IEEE Transactions on Information theory, 49(7):1858–1860, 2003.
Matthias Erbar. The heat equation on manifolds as a gradient flow in the Wasserstein space.
Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 46(1):1–23, 2010.
Matthias Erbar and Jan Maas. Gradient flow structures for discrete porous medium equations.
Discrete and Continuous Dynamical Systems, 34(4):1355–1374, 2014.
Sven Erlander. Optimal Spatial Interaction and the Gravity Model, volume 173. Springer-Verlag,
1980.
Sven Erlander and Neil F Stewart. The Gravity Model in Transportation Analysis: Theory and
Extensions. 1990.
Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization
using the wasserstein metric: Performance guarantees and tractable reformulations. Mathe-
matical Programming, 171(1-2):115–166, 2018.
Montacer Essid and Justin Solomon. Quadratically-regularized optimal transport on graphs.
arXiv preprint arXiv:1704.08200, 2017.
Lawrence C. Evans and Wilfrid Gangbo. Differential Equations Methods for the Monge-
Kantorovich Mass Transfer Problem, volume 653. American Mathematical Society, 1999.
Mikhail Feldman and Robert McCann. Monge’s transport problem on a Riemannian manifold.
Transaction AMS, 354(4):1667–1697, 2002.
Jean Feydy, Benjamin Charlier, Francois-Xavier Vialard, and Gabriel Peyré. Optimal transport
for diffeomorphic registration. In Proceedings of MICCAI’17, pages 291–299. Springer, 2017.
190 References
Jean Feydy, Thibault Séjourné, Franccois-Xavier Vialard, Shun-Ichi Amari, Alain Trouvé, and
Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences.
In Proceedings of the 22th International Conference on Artificial Intelligence and Statistics,
2019.
Alessio Figalli. The optimal partial transport problem. Archive for Rational Mechanics and
Analysis, 195(2):533–560, 2010.
Rémi Flamary, Cédric Févotte, Nicolas Courty, and Valentin Emiya. Optimal spectral trans-
portation with application to music transcription. In Advances in Neural Information Pro-
cessing Systems, pages 703–711, 2016.
Lester Randolph Ford and Delbert Ray Fulkerson. Flows in Networks. Princeton University
Press, 1962.
Peter J Forrester and Mario Kieburg. Relating the Bures measure to the Cauchy two-matrix
model. Communications in Mathematical Physics, 342(1):151–187, 2016.
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of
the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.
Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra
and its Applications, 114:717–735, 1989.
Uriel Frisch, Sabino Matarrese, Roya Mohayaee, and Andrei Sobolevski. A reconstruction of
the initial conditions of the universe by optimal mass transportation. Nature, 417(6886):
260–262, 2002.
Brittany D Froese and Adam M Oberman. Convergent finite difference solvers for viscosity
solutions of the elliptic monge–ampère equation in dimensions two and higher. SIAM Journal
on Numerical Analysis, 49(4):1692–1714, 2011.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio.
Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems,
pages 2053–2061, 2015.
Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational
problems via finite element approximation. Computers & Mathematics with Applications, 2
(1):17–40, 1976.
Alfred Galichon. Optimal Transport Methods in Economics. Princeton University Press, 2016.
Alfred Galichon and Bernard Salanié. Matching with trade-offs: revealed preferences over com-
peting characteristics. Technical report, Preprint SSRN-1487307, 2009.
Alfred Galichon, Pierre Henry-Labordère, and Nizar Touzi. A stochastic control approach to
no-arbitrage bounds given marginals, with an application to lookback options. Annals of
Applied Probability, 24(1):312–336, 2014.
Thomas O Gallouët and Quentin Mérigot. A lagrangian scheme à la brenier for the incompress-
ible euler equations. Foundations of Computational Mathematics, 18:1–31, 2017.
Thomas O Gallouët and Leonard Monsaingeon. A JKO splitting scheme for Kantorovich–
Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1100–1130, 2017.
Wilfrid Gangbo and Robert J McCann. The geometry of optimal transportation. Acta Mathe-
matica, 177(2):113–161, 1996.
References 191
Wilfrid Gangbo and Andrzej Swiech. Optimal maps for the multidimensional Monge-
Kantorovich problem. Communications on Pure and Applied Mathematics, 51(1):23–45, 1998.
RUI GAO, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using wasserstein
uncertainty sets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7913–7923.
2018.
Matthias Gelbrich. On a formula for the l2 wasserstein metric between measures on euclidean
and hilbert spaces. Mathematische Nachrichten, 147(1):185–203, 1990.
Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for
large-scale optimal transport. In Advances in Neural Information Processing Systems, pages
3440–3448, 2016.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. GAN and VAE from an optimal transport
point of view. (arXiv preprint arXiv:1706.01807), 2017.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn
divergences. In Proceedings of the 21st International Conference on Artificial Intelligence
and Statistics, pages 1608–1617, 2018.
Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complex-
ity of sinkhorn divergences. In Proceedings of the 22th International Conference on Artificial
Intelligence and Statistics, 2019.
Ivan Gentil, Christian Léonard, and Luigia Ripani. About the analogy between optimal trans-
port and minimal entropy. arXiv preprint arXiv:1510.08230, 2015.
Alan George and Joseph WH Liu. The evolution of the minimum degree ordering algorithm.
SIAM Review, 31(1):1–19, 1989.
Tryphon T Georgiou and Michele Pavon. Positive contraction mappings for classical and quan-
tum Schrödinger systems. Journal of Mathematical Physics, 56(3):033301, 2015.
Pascal Getreuer. A survey of Gaussian convolution algorithms. Image Processing On Line,
2013:286–310, 2013.
Ugo Gianazza, Giuseppe Savaré, and Giuseppe Toscani. The Wasserstein gradient flow of the
Fisher information and the quantum drift-diffusion equation. Archive for Rational Mechanics
and Analysis, 194(1):133–220, 2009.
Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. Inter-
national Statistical Review, 70(3):419–435, 2002.
Joan Glaunes, Alain Trouvé, and Laurent Younes. Diffeomorphic matching of distributions:
a new approach for unlabelled point-sets and sub-manifolds matching. In Proceedings of
the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 2, 2004.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sen-
timent classification: A deep learning approach. In Proceedings of the 28th International
Conference on Machine Learning, pages 513–520, 2011.
192 References
Roland Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et
la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires.
ESAIM: Mathematical Modelling and Numerical Analysis, 9(R2):41–76, 1975.
Steven Gold and Anand Rangarajan. A graduated assignment algorithm for graph matching.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377–388, April 1996.
Steven Gold, Anand Rangarajan, Chien-Ping Lu, Suguna Pappu, and Eric Mjolsness. New
algorithms for 2d and 3d point matching: pose estimation and correspondence. Pattern
Recognition, 31(8):1019–1031, 1998.
Eusebio Gómez, Miguel A Gómez-Villegas, and J Miguel Marín. A survey on continuous ellip-
tical vector distributions. Rev. Mat. Complut, 16:345–361, 2003.
Paola Gori-Giorgi, Michael Seidl, and Giovanni Vignale. Density-functional theory for strongly
interacting electrons. Physical Review Letters, 103(16):166402, 2009.
Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of
neuroimaging data. In Information Processing in Medical Imaging - 24th International Con-
ference, IPMI 2015, pages 261–272, 2015.
Kristen Grauman and Trevor Darrell. The pyramid match kernel: discriminative classification
with sets of image features. In Tenth IEEE International Conference on Computer Vision,
volume 2, pages 1458–1465. IEEE, 2005.
Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings
with wasserstein procrustes. In Proceedings of the 22th International Conference on Artificial
Intelligence and Statistics, 2019.
Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola.
A kernel method for the two-sample-problem. In Advances in Neural Information Processing
Systems, pages 513–520, 2007.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773,
2012.
Andreas Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse
automatic differentiation. Optimization Methods and Software, 1(1):35–54, 1992.
Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation. SIAM, 2008.
Mikhail Gromov. Metric Structures for Riemannian and Non-Riemannian Spaces. Progress in
Mathematics. Birkhäuser, 2001.
Gaoyue Guo and Jan Obloj. Computational methods for martingale optimal transport prob-
lems. arXiv preprint arXiv:1710.07911, 2017.
Leonid Gurvits. Classical complexity and quantum entanglement. Journal of Computer and
System Sciences, 69(3):448–484, 2004.
Cristian E Gutiérrez. The Monge-Ampere Equation. Springer, 2016.
Jorge Gutierrez, Julien Rabin, Bruno Galerne, and Thomas Hurtut. Optimal patch assignment
for statistically constrained texture synthesis. In International Conference on Scale Space
and Variational Methods in Computer Vision, pages 172–183. Springer, 2017.
References 193
LV Kantorovich and G.S. Rubinstein. On a space of totally additive functions. Vestn Leningrad
Universitet, 13:52–59, 1958.
Hermann Karcher. Riemannian center of mass and so called Karcher mean. arXiv preprint
arXiv:1407.2087, 2014.
Johan Karlsson and Axel Ringh. Generalized Sinkhorn iterations for regularizing inverse prob-
lems using optimal mass transport. arXiv preprint arXiv:1612.02273, 2016.
Sanggyun Kim, Rui Ma, Diego Mesa, and Todd P Coleman. Efficient bayesian inference meth-
ods via convex optimization and optimal transport. In IEEE International Symposium on
Information Theory, pages 2259–2263. IEEE, 2013.
David Kinderlehrer and Noel J Walkington. Approximation of parabolic equations using the
Wasserstein metric. ESAIM: Mathematical Modelling and Numerical Analysis, 33(04):837–
852, 1999.
Jun Kitagawa, Quentin Mérigot, and Boris Thibert. A Newton algorithm for semi-discrete
optimal transport. arXiv preprint arXiv:1603.05579, 2016.
Philip A Knight. The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal
on Matrix Analysis and Applications, 30(1):261–275, 2008.
Philip A Knight and Daniel Ruiz. A fast algorithm for matrix balancing. IMA Journal of
Numerical Analysis, 33(3):1029–1047, 2013.
Philip A Knight, Daniel Ruiz, and Bora Uccar. A symmetry preserving algorithm for matrix
scaling. SIAM Journal on Matrix Analysis and Applications, 35(3):931–955, 2014.
Martin Knott and Cyril S Smith. On the optimal mapping of distributions. Journal of Opti-
mization Theory and Applications, 43(1):39–49, 1984.
Martin Knott and Cyril S Smith. On a generalization of cyclic monotonicity and distances
among random vectors. Linear Algebra and Its Applications, 199:363–371, 1994.
Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced Wasserstein kernels for probabil-
ity distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5258–5267, 2016.
Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal
mass transport: signal processing and machine-learning applications. IEEE Signal Processing
Magazine, 34(4):43–59, 2017.
Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vofnikov. A new optimal transport
distance on the space of finite Radon measures. Advances in Differential Equations, 21
(11/12):1117–1164, 2016.
Tjalling C Koopmans. Optimum utilization of the transportation system. Econometrica: Jour-
nal of the Econometric Society, pages 136–146, 1949.
Jonathan Korman and Robert McCann. Optimal transportation with capacity constraints.
Transactions of the American Mathematical Society, 367(3):1501–1521, 2015.
Bernhard Korte and Jens Vygen. Combinatorial Optimization. Springer, 2012.
JJ Kosowsky and Alan L Yuille. The invisible hand algorithm: Solving the assignment problem
with statistical physics. Neural networks, 7(3):477–490, 1994.
References 195
Bruno Lévy and Erica L Schwindt. Notions of optimal transport theory and how to implement
them on a computer. Computers & Graphics, 72:135–148, 2018.
Peihua Li, Qilong Wang, and Lei Zhang. A novel earth mover’s distance methodology for
image matching with Gaussian mixture models. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1689–1696, 2013.
Wuchen Li, Ernest K. Ryu, Stanley Osher, Wotao Yin, and Wilfrid Gangbo. A parallel method
for Earth Mover’s distance. Journal of Scientific Computing, 75(1):182–197, 2018a.
Yupeng Li, Wuchen Li, and Guo Cao. Image segmentation via l1 monge-kantorovich problem.
CAM report 17-73, 2018b.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal transport in competition
with reaction: the Hellinger–Kantorovich distance and geodesic curves. SIAM Journal on
Mathematical Analysis, 48(4):2869–2911, 2016.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems
and a new hellinger–kantorovich distance between positive measures. Inventiones Mathemat-
icae, 211(3):969–1117, 2018.
Haibin Ling and Kazunori Okada. Diffusion distance for histogram comparison. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages
246–253. IEEE, 2006.
Haibin Ling and Kazunori Okada. An efficient earth mover’s distance algorithm for robust
histogram comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(5):840–853, 2007.
Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial
algorithm for matrix scaling and approximate permanents. In Proceedings of the Thirtieth
Annual ACM Symposium on Theory of Computing, pages 644–652. ACM, 1998.
Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear
operators. SIAM Journal on Numerical Analysis, 16:964–979, 1979.
Don O Loftsgaarden and Charles P Quesenberry. A nonparametric estimate of a multivariate
density function. Annals of Mathematical Statistics, 36(3):1049–1051, 1965.
Eliane Maria Loiola, Nair Maria Maia de Abreu, Paulo Oswaldo Boaventura-Netto, Peter Hahn,
and Tania Querido. A survey for the quadratic assignment problem. European Journal
Operational Research, 176(2):657–690, 2007.
Vince Lyzinski, Donniell E Fishkind, Marcelo Fiori, Joshua T Vogelstein, Carey E Priebe, and
Guillermo Sapiro. Graph matching: relax at your own risk. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 38(1):60–73, 2016.
Jan Maas. Gradient flows of the entropy for finite Markov chains. Journal of Functional
Analysis, 261(8):2250–2292, 2011.
Jan Maas, Martin Rumpf, Carola Schönlieb, and Stefan Simon. A generalized model for optimal
transport of images including dissipation and density modulation. ESAIM: Mathematical
Modelling and Numerical Analysis, 49(6):1745–1769, 2015.
Jan Maas, Martin Rumpf, and Stefan Simon. Generalized optimal transport with singular
sources. arXiv preprint arXiv:1607.01186, 2016.
References 197
Yasushi Makihara and Yasushi Yagi. Earth mover’s morphing: Topology-free shape morphing
using cluster-based EMD flows. In Asian Conference on Computer Vision, pages 202–215.
Springer, 2010.
Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of
positive-definite matrices. arXiv preprint arXiv:1801.09269, 2018.
Anton Mallasto and Aasa Feragen. Learning from uncertain curves: The 2-wasserstein metric
for gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,
pages 5660–5670. 2017.
Stephane Mallat. A Wavelet Tour of Signal Processing: the Sparse Way. Academic press, 2008.
Benjamin Mathon, Francois Cayre, Patrick Bas, and Benoit Macq. Optimal transport for secure
spread-spectrum watermarking of still images. IEEE Transactions on Image Processing, 23
(4):1694–1705, 2014.
Daniel Matthes and Horst Osberger. Convergence of a variational Lagrangian scheme for a
nonlinear drift diffusion equation. ESAIM: Mathematical Modelling and Numerical Analysis,
48(3):697–726, 2014.
Daniel Matthes and Horst Osberger. A convergent lagrangian discretization for a nonlinear
fourth-order equation. Foundations of Computational Mathematics, 17(1):73–126, 2017.
Bertrand Maury and Anthony Preux. Pressureless Euler equations with maximal density con-
straint: a time-splitting scheme. Topological Optimization and Optimal Transport: In the
Applied Sciences, 17:333, 2017.
Bertrand Maury, Aude Roudneff-Chupin, and Filippo Santambrogio. A macroscopic crowd
motion model of gradient flow type. Mathematical Models and Methods in Applied Sciences,
20(10):1787–1821, 2010.
Robert J McCann. A convexity principle for interacting gases. Advances in Mathematics, 128
(1):153–179, 1997.
Facundo Mémoli. On the use of Gromov–Hausdorff distances for shape comparison. In Sympo-
sium on Point Based Graphics, pages 81–90. 2007.
Facundo Mémoli. Gromov–Wasserstein distances and the metric approach to object matching.
Foundations of Computational Mathematics, 11(4):417–487, 2011.
Facundo Mémoli and Guillermo Sapiro. A theoretical and computational framework for isometry
invariant recognition of point cloud data. Foundations of Computational Mathematics, 5(3):
313–347, 2005.
Quentin Mérigot. A multiscale approach to optimal transport. Computer Graphics Forum, 30
(5):1583–1592, 2011.
Ludovic Métivier, Romain Brossier, Quentin Merigot, Edouard Oudet, and Jean Virieux. An
optimal transport approach for seismic tomography: Application to 3D full waveform inver-
sion. Inverse Problems, 32(11):115008, 2016.
Jocelyn Meyron, Quentin Mérigot, and Boris Thibert. Light in power: a general and parameter-
free algorithm for caustic design. In SIGGRAPH Asia 2018 Technical Papers, page 224. ACM,
2018.
198 References
Alexander Mielke. Geodesic convexity of the relative entropy in reversible Markov chains.
Calculus of Variations and Partial Differential Equations, 48(1-2):1–31, 2013.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Jean-Marie Mirebeau. Discretization of the 3D Monge-Ampere operator, between wide stencils
and power diagrams. ESAIM: Mathematical Modelling and Numerical Analysis, 49(5):1511–
1523, 2015.
Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie
Royale des Sciences, pages 666–704, 1781.
Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein training of restricted
Boltzmann machines. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, Advances in Neural Information Processing Systems 29, pages 3718–3726. 2016.
Kevin Moon and Alfred Hero. Multivariate f -divergence estimation with confidence. In Ad-
vances in Neural Information Processing Systems, pages 2420–2428, 2014.
Oleg Museyko, Michael Stiglmayr, Kathrin Klamroth, and Günter Leugering. On the applica-
tion of the Monge–Kantorovich problem to image registration. SIAM Journal on Imaging
Sciences, 2(4):1068–1097, 2009.
Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the wasserstein space
of elliptical distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,
pages 10258–10269. 2018.
Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen. Tsallis regularized optimal
transport and ecological inference. In AAAI, pages 2387–2393, 2017.
Andriy Myronenko and Xubo Song. Point set registration: coherent point drift. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, 2010.
Assaf Naor and Gideon Schechtman. Planar earthmover is not in l1 . SIAM Journal on Com-
puting, 37(3):804–826, 2007.
Richard D Neidinger. Introduction to automatic differentiation and Matlab object-oriented
programming. SIAM Review, 52(3):545–563, 2010.
Arkadi Nemirovski and Uriel Rothblum. On complexity of matrix scaling. Linear Algebra and
its Applications, 302:435–460, 1999.
Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex pro-
gramming, volume 13. SIAM, 1994.
Kangyu Ni, Xavier Bresson, Tony Chan, and Selim Esedoglu. Local histogram based segmenta-
tion using the Wasserstein distance. International Journal of Computer Vision, 84(1):97–111,
2009.
Lipeng Ning and Tryphon T Georgiou. Metrics for matrix-valued measures via test functions.
In 53rd IEEE Conference on Decision and Control, pages 2642–2647. IEEE, 2014.
Lipeng Ning, Tryphon T Georgiou, and Allen Tannenbaum. On matrix-valued Monge–
Kantorovich optimal mass transport. IEEE Transactions on Automatic Control, 60(2):373–
382, 2015.
References 199
Gabriel Peyré, Jalal Fadili, and Julien Rabin. Wasserstein active contours. In 19th IEEE
International Conference on Image Processing, pages 2541–2544. IEEE, 2012.
Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-Wasserstein averaging of kernel
and distance matrices. In International Conference on Machine Learning, pages 2664–2672,
2016.
Gabriel Peyré, Lenaic Chizat, Francois-Xavier Vialard, and Justin Solomon. Quantum entropic
regularization of matrix-valued optimal transport. to appear in European Journal of Applied
Mathematics, 2017.
Rémi Peyre. Comparison between w2 distance and h−1 norm, and localisation of Wasserstein
distance. arXiv preprint arXiv:1104.4631, 2011.
Benedetto Piccoli and Francesco Rossi. Generalized Wasserstein distance and its application
to transport equations with source. Archive for Rational Mechanics and Analysis, 211(1):
335–358, 2014.
Franccois Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colour
distribution transfer. Computer Vision and Image Understanding, 107(1):123–137, 2007.
Pytorch. Pytorch library. http://pytorch.org/, 2017.
Julien Rabin and Nicolas Papadakis. Convex color image segmentation with optimal transport
distances. In Proceedings of SSVM’15, pages 256–269, 2015.
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its
application to texture mixing. In International Conference on Scale Space and Variational
Methods in Computer Vision, pages 435–446. Springer, 2011.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems: Volume I: Theory.
Springer Science & Business Media, 1998a.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems: Volume II: Ap-
plications. Springer Science & Business Media, 1998b.
Louis B Rall. Automatic Differentiation: Techniques and Applications. Springer, 1981.
Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On Wasserstein two-sample testing
and related families of nonparametric tests. Entropy, 19(2):47, 2017.
Anand Rangarajan, Alan L Yuille, Steven Gold, and Eric Mjolsness. Convergence properties
of the softassign quadratic assignment algorithm. Neural Computation, 11(6):1455–1474,
August 1999.
Sebastian Reich. A nonparametric ensemble transform method for bayesian inference. SIAM
Journal on Scientific Computing, 35(4):A2013–A2024, 2013.
R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal
on Control and Optimization, 14(5):877–898, 1976.
Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast dictionary learning with a smoothed
Wasserstein loss. In Proceedings of the 19th International Conference on Artificial Intelligence
and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 630–638, 2016.
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric
for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.
References 201
Bernhard Schmitzer and Christoph Schnörr. Object segmentation by shape matching with
Wasserstein modes. In International Workshop on Energy Minimization Methods in Com-
puter Vision and Pattern Recognition, pages 123–136. Springer, 2013b.
Bernhard Schmitzer and Benedikt Wirth. A framework for Wasserstein-1-type metrics. arXiv
preprint arXiv:1701.01945, 2017.
Isaac J Schoenberg. Metric spaces and positive definite functions. Transactions of the American
Mathematical Society, 38:522–356, 1938.
Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Sitzungsberichte Preuss. Akad.
Wiss. Berlin. Phys. Math., 144:144–153, 1931.
Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the
optimal transport metric. In Advances in Neural Information Processing Systems 28, pages
3294–3302. 2015.
Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and
Mathieu Blondel. Large-scale optimal transport and mapping estimation. In Proceedings of
ICLR 2018, 2018.
Soroosh Shafieezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel Kuhn. Dis-
tributionally robust logistic regression. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1576–
1584. 2015.
Soroosh Shafieezadeh Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Moha-
jerin Esfahani. Wasserstein distributionally robust kalman filtering. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural
Information Processing Systems 31, pages 8483–8492. 2018.
Sameer Shirdhonkar and David W Jacobs. Approximate earth mover’s distance in linear time.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26. CRC
press, 1986.
Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic
matrices. Annals of Mathematical Statististics, 35:876–879, 1964.
Marcos Slomp, Michihiro Mikamo, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda. Gpu-
based softassign for maximizing image utilization in photomosaics. International Journal of
Networking and Computing, 1(2):211–229, 2011.
Justin Solomon, Leonidas Guibas, and Adrian Butscher. Dirichlet energy for analysis and
synthesis of soft maps. In Computer Graphics Forum, volume 32, pages 197–206. Wiley
Online Library, 2013.
Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Earth mover’s dis-
tances on discrete surfaces. Transaction on Graphics, 33(4), 2014a.
References 203
Justin Solomon, Raif Rustamov, Guibas Leonidas, and Adrian Butscher. Wasserstein propa-
gation for semi-supervised learning. In Proceedings of the 31st International Conference on
Machine Learning, pages 306–314, 2014b.
Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy
Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: efficient optimal
transportation on geometric domains. ACM Transactions on Graphics, 34(4):66:1–66:11,
2015.
Justin Solomon, Gabriel Peyré, Vladimir G Kim, and Suvrit Sra. Entropic metric alignment
for correspondence problems. ACM Transactions on Graphics, 35(4):72:1–72:13, 2016a.
Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Continuous-flow graph
transportation distances. arXiv preprint arXiv:1603.06927, 2016b.
Max Sommerfeld and Axel Munk. Inference for empirical wasserstein distances on finite spaces.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):219–238,
2018.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG
Lanckriet. On integral probability metrics,ϕ-divergences and binary classification. arXiv
preprint arXiv:0901.2698, 2009.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG
Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of
Statistics, 6:1550–1599, 2012.
Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via
barycenters of subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Pro-
ceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics,
volume 38 of Proceedings of Machine Learning Research, pages 912–920, San Diego, Califor-
nia, USA, 2015a. PMLR. URL http://proceedings.mlr.press/v38/srivastava15.html.
Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: scalable bayes
via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920,
2015b.
Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming
wasserstein barycenters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 30, pages 2647–2658. 2017a.
Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming
wasserstein barycenters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 30, pages 2647–2658. 2017b.
Leen Stougie. A polynomial bound on the diameter of the transportation polytope. Technical
report, TU/e, Technische Universiteit Eindhoven, Department of Mathematics and Comput-
ing Science, 2002.
Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient flows on the space
of metric measure spaces. Preprint 1208.0434, arXiv, 2012.
204 References
Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal
mass transport for shape matching and comparison. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 37(11):2246–2259, 2015.
Vladimir N Sudakov. Geometric Problems in the Theory of Infinite-dimensional Probability
Distributions. Number 141. American Mathematical Society, 1979.
Mahito Sugiyama, Hiroyuki Nakahara, and Koji Tsuda. Tensor balancing on statistical mani-
fold. arXiv preprint arXiv:1702.08142, 2017.
Mohamed M Sulman, JF Williams, and Robert D Russell. An efficient approach for the numer-
ical solution of the monge–ampère equation. Applied Numerical Mathematics, 61(3):298–307,
2011.
Paul Swoboda and Christoph Schnörr. Convex variational image restoration with histogram
priors. SIAM Journal on Imaging Sciences, 6(3):1719–1735, 2013.
Gábor J Székely and Maria L Rizzo. Testing for equal distributions in high dimension. InterStat,
5(16.10), 2004.
Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics,
48(4):1005–1026, 2011.
Xiaolu Tan and Nizar Touzi. Optimal transportation under controlled stochastic dynamics.
Annals of Probability, 41(5):3201–3240, 2013.
Robert E. Tarjan. Dynamic trees as search trees via euler tours, applied to the network simplex
algorithm. Mathematical Programming, 78(2):169–177, 1997.
Guillaume Tartavel, Gabriel Peyré, and Yann Gousseau. Wasserstein loss for image synthesis
and restoration. SIAM Journal on Imaging Sciences, 9(4):1726–1755, 2016.
Matthew Thorpe, Serim Park, Soheil Kolouri, Gustavo K Rohde, and Dejan Slepčev. A trans-
portation lp distance for signal analysis. Journal of Mathematical Imaging and Vision, 59
(2):187–210, 2017.
AN Tolstoı. Metody nakhozhdeniya naimen’shego summovogo kilome-trazha pri planirovanii
perevozok v prostranstve (russian; methods of finding the minimal total kilometrage in cargo
transportation planning in space). TransPress of the National Commissariat of Transporta-
tion, pages 23–55, 1930.
AN Tolstoı. Metody ustraneniya neratsional’nykh perevozok priplanirovanii [russian; methods
of removing irrational transportation in planning]. Sotsialisticheskiı Transport, 9:28–51, 1939.
Alain Trouvé and Laurent Younes. Metamorphoses through Lie group action. Foundations of
Computational Mathematics, 5(2):173–198, 2005.
Neil S Trudinger and Xu-Jia Wang. On the monge mass transfer problem. Calculus of Variations
and Partial Differential Equations, 13(1):19–31, 2001.
Marc Vaillant and Joan Glaunès. Surface matching via currents. In Information Processing in
Medical Imaging, pages 1–5. Springer, 2005.
Sathamangalam R Srinivasa Varadhan. On the behavior of the fundamental solution of the
heat equation with variable coefficients. Communications on Pure and Applied Mathematics,
20(2):431–455, 1967.
References 205