Computational Optimal Transport

Download as pdf or txt
Download as pdf or txt
You are on page 1of 209

arXiv:1803.00567v4 [stat.

ML] 18 Mar 2020

Computational Optimal Transport

Gabriel Peyré Marco Cuturi


CNRS and DMA, ENS Google and CREST, ENSAE
2

@article{COTFNT,
year = {2019},
volume = {11},
journal = {Foundations and Trends in Machine Learning},
title = {Computational Optimal Transport},
number = {5-6},
pages = {355--607}
author = {Gabriel Peyr\’e and Marco Cuturi}
}
Contents

1 Introduction 3

2 Theoretical Foundations 7
2.1 Histograms and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Assignment and Monge Problem . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Kantorovich Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Metric Properties of Optimal Transport . . . . . . . . . . . . . . . . . . . 19
2.5 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Algorithmic Foundations 37
3.1 The Kantorovich Linear Programs . . . . . . . . . . . . . . . . . . . . . . 38
3.2 C-Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Vertices of the Transportation Polytope . . . . . . . . . . . . . . . . . . . 42
3.5 A Heuristic Description of the Network Simplex . . . . . . . . . . . . . . . 45
3.6 Dual Ascent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Auction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Entropic Regularization of Optimal Transport 57


4.1 Entropic Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Sinkhorn’s Algorithm and Its Convergence . . . . . . . . . . . . . . . . . . 62
4.3 Speeding Up Sinkhorn’s Iterations . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Stability and Log-Domain Computations . . . . . . . . . . . . . . . . . . . 77
4.5 Regularized Approximations of the Optimal Transport Cost . . . . . . . . . 80
4.6 Generalized Sinkhorn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3
4

5 Semidiscrete Optimal Transport 85


5.1 c-Transform and c̄-Transform . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Semidiscrete Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Entropic Semidiscrete Formulation . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Stochastic Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . 92

6 W1 Optimal Transport 96
6.1 W 1 on Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 W 1 on Euclidean Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 W 1 on a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7 Dynamic Formulations 102


7.1 Continuous Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Discretization on Uniform Staggered Grids . . . . . . . . . . . . . . . . . . 105
7.3 Proximal Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Dynamical Unbalanced OT . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 More General Mobility Functionals . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Dynamic Formulation over the Paths Space . . . . . . . . . . . . . . . . . 111

8 Statistical Divergences 114


8.1 ϕ-Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2 Integral Probability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3 Wasserstein Spaces Are Not Hilbertian . . . . . . . . . . . . . . . . . . . . 125
8.4 Empirical Estimators for OT, MMD and ϕ-divergences . . . . . . . . . . . 128
8.5 Entropic Regularization: Between OT and MMD . . . . . . . . . . . . . . . 131

9 Variational Wasserstein Problems 133


9.1 Differentiating the Wasserstein Loss . . . . . . . . . . . . . . . . . . . . . 134
9.2 Wasserstein Barycenters, Clustering and Dictionary Learning . . . . . . . . 138
9.3 Gradient Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.4 Minimum Kantorovich Estimators . . . . . . . . . . . . . . . . . . . . . . . 155

10 Extensions of Optimal Transport 159


10.1 Multimarginal Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.2 Unbalanced Optimal Transport . . . . . . . . . . . . . . . . . . . . . . . . 162
10.3 Problems with Extra Constraints on the Couplings . . . . . . . . . . . . . . 165
10.4 Sliced Wasserstein Distance and Barycenters . . . . . . . . . . . . . . . . . 166
10.5 Transporting Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . 169
10.6 Gromov–Wasserstein Distances . . . . . . . . . . . . . . . . . . . . . . . . 172

References 179
Abstract

Optimal transport (OT) theory can be informally described using the words of the
French mathematician Gaspard Monge (1746–1818): A worker with a shovel in hand
has to move a large pile of sand lying on a construction site. The goal of the worker is
to erect with all that sand a target pile with a prescribed shape (for example, that of a
giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified
for instance as the total distance or time spent carrying shovelfuls of sand. Mathe-
maticians interested in OT cast that problem as that of comparing two probability
distributions—two different piles of sand of the same volume. They consider all of the
many possible ways to morph, transport or reshape the first pile into the second, and
associate a “global” cost to every such transport, using the “local” consideration of
how much it costs to move a grain of sand from one place to another. Mathematicians
are interested in the properties of that least costly transport, as well as in its efficient
computation. That smallest cost not only defines a distance between distributions, but
it also entails a rich geometric structure on the space of probability distributions. That
structure is canonical in the sense that it borrows key geometric properties of the un-
derlying “ground” space on which these distributions are defined. For instance, when
the underlying space is Euclidean, key concepts such as interpolation, barycenters, con-
vexity or gradients of functions extend naturally to the space of distributions endowed
with an OT geometry.
OT has been (re)discovered in many settings and under different forms, giving it
a rich history. While Monge’s seminal work was motivated by an engineering problem,
Tolstoi in the 1920s and Hitchcock, Kantorovich and Koopmans in the 1940s estab-
lished its significance to logistics and economics. Dantzig solved it numerically in 1949
within the framework of linear programming, giving OT a firm footing in optimization.
OT was later revisited by analysts in the 1990s, notably Brenier, while also gaining
fame in computer vision under the name of earth mover’s distances. Recent years have
witnessed yet another revolution in the spread of OT, thanks to the emergence of ap-
proximate solvers that can scale to large problem dimensions. As a consequence, OT is
being increasingly used to unlock various problems in imaging sciences (such as color
or texture processing), graphics (for shape manipulation) or machine learning (for re-
gression, classification and generative modeling).
This paper reviews OT with a bias toward numerical methods, and covers the
theoretical properties of OT that can guide the design of new algorithms. We focus in
particular on the recent wave of efficient algorithms that have helped OT find relevance
in data sciences. We give a prominent place to the many generalizations of OT that
have been proposed in but a few years, and connect them with related approaches
originating from statistical inference, kernel methods and information theory. All of
2

the figures can be reproduced using code made available in a companion website1 . This
website hosts the book project Computational Optimal Transport. You will also find
slides and computational resources.

now Publishers Inc.. Computational Optimal Transport. Foundations and Trends


R
in Computer
Graphics and Vision, vol. XX, no. XX, pp. 1–205, 2020.
DOI: 10.1561/XXXXXXXXXX.
1
https://optimaltransport.github.io/
1
Introduction

The shortest path principle guides most decisions in life and sciences: When a commod-
ity, a person or a single bit of information is available at a given point and needs to be
sent at a target point, one should favor using the least possible effort. This is typically
reached by moving an item along a straight line when in the plane or along geodesic
curves in more involved metric spaces. The theory of optimal transport generalizes that
intuition in the case where, instead of moving only one item at a time, one is concerned
with the problem of moving simultaneously several items (or a continuous distribution
thereof) from one configuration onto another. As schoolteachers might attest, planning
the transportation of a group of individuals, with the constraint that they reach a given
target configuration upon arrival, is substantially more involved than carrying it out
for a single individual. Indeed, thinking in terms of groups or distributions requires a
more advanced mathematical formalism which was first hinted at in the seminal work
of Monge [1781]. Yet, no matter how complicated that formalism might look at first
sight, that problem has deep and concrete connections with our daily life. Transporta-
tion, be it of people, commodities or information, very rarely involves moving only
one item. All major economic problems, in logistics, production planning or network
routing, involve moving distributions, and that thread appears in all of the seminal ref-
erences on optimal transport. Indeed Tolstoı [1930], Hitchcock [1941] and Kantorovich
[1942] were all guided by practical concerns. It was only a few years later, mostly after
the 1980s, that mathematicians discovered, thanks to the works of Brenier [1991] and
others, that this theory provided a fertile ground for research, with deep connections
to convexity, partial differential equations and statistics. At the turn of the millenium,
researchers in computer, imaging and more generally data sciences understood that op-

3
4 Introduction

timal transport theory provided very powerful tools to study distributions in a different
and more abstract context, that of comparing distributions readily available to them
under the form of bags-of-features or descriptors.
Several reference books have been written on optimal transport, including the two
recent monographs by Villani (2003, 2009), those by Rachev and Rüschendorf (1998a,
1998b) and more recently that by Santambrogio [2015]. As exemplified by these books,
the more formal and abstract concepts in that theory deserve in and by themselves
several hundred pages. Now that optimal transport has gradually established itself as
an applied tool (for instance, in economics, as put forward recently by Galichon [2016]),
we have tried to balance that rich literature with a computational viewpoint, centered
on applications to data science, notably imaging sciences and machine learning. We
follow in that sense the motivation of the recent review by Kolouri et al. [2017] but
try to cover more ground. Ultimately, our goal is to present an overview of the main
theoretical insights that support the practical effectiveness of OT and spend more time
explaining how to turn these insights into fast computational schemes. The main body
of Chapters 2, 3, 4, 9, and 10 is devoted solely to the study of the geometry induced by
optimal transport in the space of probability vectors or discrete histograms. Targeting
more advanced readers, we also give in the same chapters, in light gray boxes, a more
general mathematical exposition of optimal transport tailored for discrete measures.
Discrete measures are defined by their probability weights, but also by the location
at which these weights are defined. These locations are usually taken in a continuous
metric space, giving a second important degree of freedom to model random phenomena.
Lastly, the third and most technical layer of exposition is indicated in dark gray boxes
and deals with arbitrary measures that need not be discrete, and which can have in
particular a density w.r.t. a base measure. This is traditionally the default setting for
most classic textbooks on OT theory, but one that plays a less important role in general
for practical applications. Chapters 5 to 8 deal with the interplay between continuous
and discrete measures and are thus targeting a more mathematically inclined audience.
The field of computational optimal transport is at the time of this writing still an
extremely active one. There are therefore a wide variety of topics that we have not
touched upon in this survey. Let us cite in no particular order the subjects of distri-
butionally robust optimization [Shafieezadeh Abadeh et al., 2015, Esfahani and Kuhn,
2018, Lee and Raginsky, 2018, GAO et al., 2018], in which parameter estimation is
carried out by minimizing the worst posssible empirical risk of any data measure taken
within a certain Wasserstein distance of the input data; convergence of the Langevin
Monte Carlo sampling algorithm in the Wasserstein geometry [Dalalyan and Karag-
ulyan, 2017, Dalalyan, 2017, Bernton, 2018]; other numerical methods to solve OT
with a squared Euclidian cost in low-dimensional settings using the Monge-Ampère
equation [Froese and Oberman, 2011, Benamou et al., 2014, Sulman et al., 2011] which
5

are only briefly mentioned in Remark 2.25.

Notation

• JnK: set of integers {1, . . . , n}.

• 1n,m : matrix of Rn×m with all entries identically set to 1. 1n : vector of ones.

• In : identity matrix of size n × n.

• For u ∈ Rn , diag(u) is the n × n matrix with diagonal u and zero otherwise.

• Σn : probability simplex with n bins, namely the set of probability vectors in Rn+ .

• (a, b): histograms in the simplices Σn × Σm .

• (α, β): measures, defined on spaces (X , Y).



• dβ : relative density of a measure α with respect to β.

• ρα = dx : density of a measure α with respect to Lebesgue measure.

• (α = i ai δxi , β = bj δyj ): discrete measures supported on x1 , . . . , xn ∈ X and


P P
j
y1 , . . . , ym ∈ Y.

• c(x, y): ground cost, with associated pairwise cost matrix Ci,j = (c(xi , yj ))i,j
evaluated on the support of α, β.

• π: coupling measure between α and β, namely such that for any A ⊂ X , π(A ×
Y) = α(A), and for any subset B ⊂ Y, π(X × B) = β(B). For discrete measures
P
π = i,j Pi,j δ(xi ,yj ) .

• U(α, β): set of coupling measures, for discrete measures U(a, b).

• R(c): set of admissible dual potentials; for discrete measures R(C).

• T : X → Y: Monge map, typically such that T] α = β.

• (αt )1t=0 : dynamic measures, with αt=0 = α0 and αt=1 = α1 .

• v: speed for Benamou–Brenier formulations; J = αv: momentum.

• (f, g): dual potentials, for discrete measures (f, g) are dual variables.

• (u, v) = (ef/ε , eg/ε ): Sinkhorn scalings.


def.

• K = e−C/ε : Gibbs kernel for Sinkhorn.


def.
6 Introduction

• s: flow for W 1 -like problem (optimization under divergence constraints).

• LC (a, b) and Lc (α, β): value of the optimization problem associated to the OT
with cost C (histograms) and c (arbitrary measures).

• Wp (a, b) and W p (α, β): p-Wasserstein distance associated to ground distance


matrix D (histograms) and distance d (arbitrary measures).

• λ ∈ ΣS : weight vector used to compute the barycenters of S measures.

• h·, ·i: for the usual Euclidean dot-product between vectors; for two matrices of
the same size A and B, hA, Bi = tr(A> B) is the Frobenius dot-product.
def.

def.
• f ⊕ g(x, y) = f (x) + g(y), for two functions f : X → R, g : Y → R, defines
f ⊕ g : X × Y → R.

• f ⊕ g = f1> >
def. n×m for two vectors f ∈ Rn , g ∈ Rm .
m + 1n g ∈ R
R def.
• α ⊗ β is the product measure on X × Y, i.e. X ×Y g(x, y)d(α ⊗ β)(x, y) =
R
X ×Y g(x, y)dα(x)dβ(y).

• a ⊗ b = ab> ∈ Rn×m .
def.

• u v = (ui vi ) ∈ Rn for (u, v) ∈ (Rn )2 .


2
Theoretical Foundations

This chapter describes the basics of optimal transport, introducing first the related
notions of optimal matchings and couplings between probability vectors (a, b), gen-
eralizing gradually this computation to transport between discrete measures (α, β), to
cover lastly the general setting of arbitrary measures. At first reading, these last nuances
may be omitted and the reader can only focus on computations between probability
vectors, namely histograms, which is the only requisite to implement algorithms de-
tailed in Chapters 3 and 4. More experienced readers will reach a better understanding
of the problem by considering the formulation that applies to arbitrary measures, and
will be able to apply it for more advanced problems (e.g. in order to move positions of
clouds of points, or in a statistical setting where points are sampled from continuous
densities).

2.1 Histograms and Measures

We will use interchangeably the terms histogram and probability vector for any element
a ∈ Σn that belongs to the probability simplex
n
( )
X
Rn+
def.
Σn = a∈ : ai = 1 .
i=1

A large part of this review focuses exclusively on the study of the geometry induced by
optimal transport on the simplex.

7
8 Theoretical Foundations

Remark 2.1 (Discrete measures). A discrete measure with weights a and locations
x1 , . . . , xn ∈ X reads
n
X
α= ai δxi , (2.1)
i=1
where δx is the Dirac at position x, intuitively a unit of mass which is infinitely
concentrated at location x. Such a measure describes a probability measure if,
additionally, a ∈ Σn and more generally a positive measure if all the elements of
vector a are nonnegative. To avoid degeneracy issues where locations with no mass
are accounted for, we will assume when considering discrete measures that all the
elements of a are positive.

Remark 2.2 (General measures). A convenient feature of OT is that it can deal


with measures that are either or both discrete and continuous within the same
framework. To do so, one relies on the set of Radon measures M(X ) on the space
X . The formal definition of that set requires that X is equipped with a distance,
usually denoted d, because one can access a measure only by “testing” (integrating)
it against continuous functions, denoted f ∈ C(X ).
Integration of f ∈ C(X ) against a discrete measure α computes a sum
Z n
X
f (x)dα(x) = ai f (xi ).
X i=1

More general measures, for instance on X = Rd (where d ∈ N∗ is the dimension),


can have a density dα(x) = ρα (x)dx w.r.t. the Lebesgue measure, often denoted
ρα = dα
dx , which means that
Z Z
∀ h ∈ C(Rd ), h(x)dα(x) = h(x)ρα (x)dx.
Rd Rd

An arbitrary measure α ∈ M(X ) (which need not have a density nor be a sum
of Diracs) is defined by the fact that it can be integrated against any continuous
R
function f ∈ C(X ) and obtain X f (x)dα(x) ∈ R. If X is not compact, one should
also impose that f has compact support or at least has 0 limit at infinity. Measures
are thus in some sense “less regular” than functions but more regular than distribu-
tions (which are dual to smooth functions). For instance, the derivative of a Dirac
is not a measure. We denote M+ (X ) the set of all positive measures on X . The
set of probability measures is denoted M1+ (X ), which means that any α ∈ M1+ (X )
R
is positive, and that α(X ) = X dα = 1. Figure 2.1 offers a visualization of the
different classes of measures, beyond histograms, considered in this work.
2.2. Assignment and Monge Problem 9

Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2


Pn
Figure 2.1: Schematic display of discrete distributions α = i=1 ai δxi (red corresponds to empirical
uniform distribution ai = 1/n, and blue to arbitrary distributions) and densities dα(x) = ρα (x)dx (in
purple), in both one and two dimensions. Discrete distributions in one-dimension are displayed as stem
plots (with length equal to ai ) and in two dimensions using point clouds (in which case their radius
might be equal to ai or, for a more visually accurate representation, their area).

2.2 Assignment and Monge Problem

Given a cost matrix (Ci,j )i∈JnK,j∈JmK , assuming n = m, the optimal assignment problem
seeks for a bijection σ in the set Perm(n) of permutations of n elements solving
n
1X
min C . (2.2)
σ∈Perm(n) n i=1 i,σ(i)

One could naively evaluate the cost function above using all permutations in the set
Perm(n). However, that set has size n!, which is gigantic even for small n. Consider,
for instance, that such a set has more than 10100 elements [Dantzig, 1983] when n is as
small as 70. That problem can therefore be solved only if there exist efficient algorithms
to optimize that cost function over the set of permutations, which is the subject of §3.7.

Remark 2.3 (Uniqueness). Note that the optimal assignment problem may have several
optimal solutions. Suppose, for instance, that n = m = 2 and that the matrix C is the
pairwise distance matrix between the four corners of a 2-D square of side length 1, as
represented in the left plot of Figure 2.2. In that case only two assignments exist, and
they are both optimal.

Remark 2.4 (Monge problem between discrete measures). For discrete measures
n
X m
X
α= ai δxi and β= b j δ yj , (2.3)
i=1 j=1

the Monge problem [1781] seeks a map that associates to each point xi a single
point yj and which must push the mass of α toward the mass of β, namely, such a
10 Theoretical Foundations

x1 x1

x2 y2

x5
y1 y2 x6 x3
x4 y3
x2 x7
y1
Figure 2.2: Left: blue dots from measure α and red dots from measure β are pairwise equidistant.
Hence, either matching σ = (1, 2) (full line) or σ = (2, 1) (dotted line) is optimal. Right: a Monge map
can associate the blue measure α to the red measure β. The weights αi are displayed proportionally
to the area of the disk marked at each location. The mapping here is such that T (x1 ) = T (x2 ) = y2 ,
T (x3 ) = y3 , whereas for 4 ≤ i ≤ 7 we have T (xi ) = y1 .

map T : {x1 , . . . , xn } → {y1 , . . . , ym } must verify that


X
∀ j ∈ JmK, bj = ai , (2.4)
i:T (xi )=yj

which we write in compact form as T] α = β. Because all the elements of b are


positive, that map is necessarily surjective. This map should minimize some trans-
portation cost, which is parameterized by a function c(x, y) defined for points
(x, y) ∈ X × Y, ( )
X
min c(xi , T (xi )) : T] α = β . (2.5)
T
i
Such a map between discrete points can be of course encoded, assuming all x’s
and y’s are distinct, using indices σ : JnK → JmK so that j = σ(i), and the mass
conservation is written as X
a i = bj ,
i∈σ −1 (j)

where the inverse σ −1 (j) is to be understood as the preimage set of j. In the special
case when n = m and all weights are uniform, that is, ai = bj = 1/n, then the
mass conservation constraint implies that T is a bijection, such that T (xi ) = yσ(i) ,
and the Monge problem is equivalent to the optimal matching problem (2.2), where
the cost matrix is
def.
Ci,j = c(xi , yj ).
When n 6= m, note that, optimality aside, Monge maps may not even exist between
a discrete measure to another. This happens when their weight vectors are not
2.2. Assignment and Monge Problem 11

compatible, which is always the case when the target measure has more points
than the source measure, n < m. For instance, the right plot in Figure 2.2 shows
an (optimal) Monge map between α and β, but there is no Monge map from β to
α.

Remark 2.5 (Push-forward operator). For a continuous map T : X → Y, we define


its corresponding push-forward operator T] : M(X ) → M(Y). For discrete mea-
sures (2.1), the push-forward operation consists simply in moving the positions of
all the points in the support of the measure
def.
X
T] α = ai δT (xi ) .
i

For more general measures, for instance, for those with a density, the notion of
push-forward plays a fundamental role to describe the spatial modification (or
transport) of a probability measure. The formal definition reads as follows.

Definition 2.1 (Push-forward). For T : X → Y, the push-forward measure β =


T] α ∈ M(Y) of some α ∈ M(X ) satisfies
Z Z
∀ h ∈ C(Y), h(y)dβ(y) = h(T (x))dα(x). (2.6)
Y X

Equivalently, for any measurable set B ⊂ Y, one has

β(B) = α({x ∈ X : T (x) ∈ B}) = α(T −1 (B)). (2.7)

Note that T] preserves positivity and total mass, so that if α ∈ M1+ (X ) then
T] α ∈ M1+ (Y).

Intuitively, a measurable map T : X → Y can be interpreted as a function


moving a single point from a measurable space to another. T] is an extension of
T that can move an entire probability measure on X toward a new probability
measure on Y. The operator T] pushes forward each elementary mass of a measure
α on X by applying the map T to obtain then an elementary mass in Y. Note that
a push-forward operator T] : M1+ (X ) → M1+ (Y) is linear in the sense that for two
measures α1 , α2 on X , T] (α1 + α2 ) = T] α1 + T] α2 .

Remark 2.6 (Push-forward for multivariate densities). Explicitly doing the change
of variables in formula (2.6) for measures with densities (ρα , ρβ ) on Rd (assum-
ing T is smooth and bijective) shows that a push-forward acts on densities linearly
12 Theoretical Foundations

as a change of variables in the integration formula. Indeed, one has

ρα (x) = | det(T 0 (x))|ρβ (T (x)), (2.8)

where T 0 (x) ∈ Rd×d is the Jacobian matrix of T (the matrix formed by taking the
gradient of each coordinate of T ). This implies
ρα (x)
| det(T 0 (x))| = .
ρβ (T (x))

Remark 2.7 (Monge problem between arbitrary measures). The Monge prob-
lem (2.5) can be extended to the case where two arbitrary probability measures
(α, β), supported on two spaces (X , Y) can be linked through a map T : X → Y
that minimizes Z 
min c(x, T (x))dα(x) : T] α = β . (2.9)
T X
The constraint T] α = β means that T pushes forward the mass of α to β, using
the push-forward operator defined in Remark 2.5

Remark 2.8 (Push-forward vs. pull-back). The push-forward T] of measures should


not be confused with the pull-back of functions T ] : C(Y) → C(X ) which corre-
sponds to “warping” between functions, defined as the linear map which to g ∈ C(Y)
associates T ] g = g ◦ T . Push-forward and pull-back are actually adjoint to one an-
other, in the sense that
Z Z
∀ (α, g) ∈ M(X ) × C(Y), gd(T] α) = (T ] g)dα.
Y X

Note that even if (α, β) have densities (ρα , ρβ ) with respect to a fixed measure (e.g.
Lebesgue on Rd ), T] α does not have T ] ρβ as density, because of the presence of
the Jacobian in (2.8). This explains why OT should be used with caution to per-
form image registration, because it does not operate as an image warping method.
Figure 2.3 illustrates the distinction between these push-forward and pull-back
operators.

Remark 2.9 (Measures and random variables). Radon measures can also be viewed
as representing the distributions of random variables. A random variable X on X
is actually a map X : Ω → X from some abstract (often unspecified) probability
space (Ω, P), and its distribution α is the Radon measure α ∈ M1+ (X ) such that
R
P(X ∈ A) = α(A) = A dα(x). Equivalently, it is the push-forward of P by X,
α = X] P. Applying another push-forward β = T] α for T : X → Y, following (2.6),
2.3. Kantorovich Relaxation 13

T T

P↵
= T↵
i xi
def. P ] T ]g g
= i T (xi )
def.
= g T
Push-forward of measures Pull-back of functions

Figure 2.3: Comparison of the push-forward operator T] , which can take as an input any measure,
and the pull-back operator T ] , which operates on functions, notaly densities.

is equivalent to defining another random variable Y = T (X) : ω ∈ Ω → T (X(ω)) ∈


Y , so that β is the distribution of Y . Drawing a random sample y from Y is thus
simply achieved by computing y = T (x), where x is drawn from X.

2.3 Kantorovich Relaxation

The assignment problem, and its generalization found in the Monge problem laid out in
Remark 2.4, is not always relevant to studying discrete measures, such as those found
in practical problems. Indeed, because the assignment problem is formulated as a per-
mutation problem, it can only be used to compare uniform histograms of the same size.
A direct generalization to discrete measures with nonuniform weights can be carried
out using Monge’s formalism of push-forward maps, but that formulation may also be
degenerate in the absence of feasible solutions satisfying the mass conservation con-
straint (2.4) (see the end of Remark 2.4). Additionally, the assignment problem (2.5) is
combinatorial, and the feasible set for the Monge problem (2.9), despite being continu-
ously parameterized as the set consisting in all push-forward measures that satisfy the
mass conservation constraint, is nonconvex. Both are therefore difficult to solve when
approached in their original formulation.
The key idea of Kantorovich [1942] is to relax the deterministic nature of trans-
portation, namely the fact that a source point xi can only be assigned to another point
or location yσi or T (xi ) only. Kantorovich proposes instead that the mass at any point
xi be potentially dispatched across several locations. Kantorovich moves away from the
idea that mass transportation should be deterministic to consider instead a probabilistic
transport, which allows what is commonly known now as mass splitting from a source
toward several targets. This flexibility is encoded using, in place of a permutation σ or a
n×m
map T , a coupling matrix P ∈ R+ , where Pi,j describes the amount of mass flowing
14 Theoretical Foundations

from bin i toward bin j, or from the mass found at xi toward yj in the formalism of
discrete measures (2.3). Admissible couplings admit a far simpler characterization than
Monge maps,
n o
n×m
PT 1n = b ,
def.
U(a, b) = P ∈ R+ : P1m = a and (2.10)

where we used the following matrix-vector notation:


  !
X X
P1m =  Pi,j  ∈ Rn and T
P 1n = Pi,j ∈ Rm .
j i i j

The set of matrices U(a, b) is bounded and defined by n + m equality constraints, and
therefore is a convex polytope (the convex hull of a finite set of matrices) [Brualdi,
2006, §8.1].
Additionally, whereas the Monge formulation (as illustrated in the right plot of
Figure 2.2) was intrisically asymmetric, Kantorovich’s relaxed formulation is always
symmetric, in the sense that a coupling P is in U(a, b) if and only if PT is in U(b, a).
Kantorovich’s optimal transport problem now reads
def. def.
X
LC (a, b) = min hC, Pi = Ci,j Pi,j . (2.11)
P∈U(a,b)
i,j

This is a linear program (see Chapter 3), and as is usually the case with such programs,
its optimal solutions are not necessarily unique.

Remark 2.10 (Mines and factories). The Kantorovich problem finds a very natural
illustration in the following resource allocation problem (see also Hitchcock [1941]).
Suppose that an operator runs n warehouses and m factories. Each warehouse contains
a valuable raw material that is needed by the factories to run properly. More precisely,
each warehouse is indexed with an integer i and contains ai units of the raw material.
These raw materials must all be moved to the factories, with a prescribed quantity bj
needed at factory j to function properly. To transfer resources from a warehouse i to
a factory j, the operator can use a transportation company that will charge Ci,j to
move a single unit of the resource from location i to location j. We assume that the
transportation company has the monopoly to transport goods and applies the same
linear pricing scheme to all actors of the economy: the cost of shipping a units of the
resource from i to j is equal to a × Ci,j .
Faced with the problem described above, the operator chooses to solve the linear
program described in Equation (2.11) to obtain a transportation plan P? that quantifies
for each pair i, j the amount of goods Pi,j that must transported from warehouse i to
factory j. The operator pays on aggregate a total of hP? , Ci to the transportation
company to execute that plan.
2.3. Kantorovich Relaxation 15

Permutation matrices as couplings. For a permutation σ ∈ Perm(n), we write Pσ


for the corresponding permutation matrix,
(
2 1/n if j = σi ,
∀ (i, j) ∈ JnK , (Pσ )i,j = (2.12)
0 otherwise.

One can check that in that case


n
1X
hC, Pσ i = Ci,σi ,
n i=1

which shows that the assignment problem (2.2) can be recast as a Kantorovich prob-
lem (2.11) where the couplings P are restricted to be exactly permutation matrices:
n
1X
min C = min hC, Pσ i.
σ∈Perm(n) n i=1 i,σ(i) σ∈Perm(n)

Next, one can easily check that the set of permutation matrices is strictly included in
the Birkhoff polytope U(1n /n, 1n /n). Indeed, for any permutation σ we have Pσ 1 =
1n and Pσ T 1 = 1n , whereas 1n 1n T /n2 is a valid coupling but not a permutation
matrix. Therefore, the minimum of hC, Pi is necessarily smaller when considering all
transportation than when considering only permutation matrices:

LC (1n /n, 1n /n) ≤ min hC, Pσ i.


σ∈Perm(n)

The following proposition shows that these problems result in fact in the same
optimum, namely that one can always find a permutation matrix that minimizes Kan-
torovich’s problem (2.11) between two uniform measures a = b = 1n /n. The Kan-
torovich relaxation is therefore tight when considered on assignment problems. Fig-
ure 2.4 shows on the left a 2-D example of optimal matching corresponding to this
special case.

Proposition 2.1 (Kantorovich for matching). If m = n and a = b = 1n /n, then there


exists an optimal solution for Problem (2.11) Pσ? , which is a permutation matrix
associated to an optimal permutation σ ? ∈ Perm(n) for Problem (2.2).

Proof. Birkhoff’s theorem [1946] states that the set of extremal points of U(1n /n, 1n /n)
is equal to the set of permutation matrices. A fundamental theorem of linear program-
ming [Bertsimas and Tsitsiklis, 1997, Theorem 2.7] states that the minimum of a linear
objective in a nonempty polyhedron, if finite, is reached at an extremal point of the
polyhedron.
16 Theoretical Foundations

↵ ↵

Figure 2.4: Comparison of optimal matching and generic couplings. A black segment between xi
and yj indicates a nonzero element in the displayed optimal coupling Pi,j solving (2.11). Left: optimal
matching, corresponding to the setting of Proposition 2.1 (empirical measures with the same number
n = m of points). Right: these two weighted point clouds cannot be matched; instead a Kantorovich
coupling can be used to associate two arbitrary discrete measures.

Remark 2.11 (Kantorovich problem between discrete measures). For discrete mea-
sures α, β of the form (2.3), we store in the matrix C all pairwise costs between
def.
points in the supports of α, β, namely Ci,j = c(xi , yj ), to define
def.
Lc (α, β) = LC (a, b). (2.13)

Therefore, the Kantorovich formulation of optimal transport between discrete mea-


sures is the same as the problem between their associated probability weight vectors
a, b except that the cost matrix C depends on the support of α and β. The nota-
tion Lc (α, β), however, is useful in some situations, because it makes explicit the
dependency with respect to both probability weights and supporting points, the
latter being exclusively considered through the cost function c.

Remark 2.12 (Using optimal assignments and couplings). The optimal transport
plan itself (either as a coupling P or a Monge map T when it exists) has found
many applications in data sciences, and in particular image processing. It has,
for instance, been used for contrast equalization [Delon, 2004] and texture syn-
thesis Gutierrez et al. [2017]. A significant part of applications of OT to imaging
sciences is for image matching [Zhu et al., 2007, Wang et al., 2013, Museyko et al.,
2009, Li et al., 2013], image fusion [Courty et al., 2016], medical imaging [Wang
et al., 2011] and shape registration [Makihara and Yagi, 2010, Lai and Zhao, 2017,
2.3. Kantorovich Relaxation 17

↵ ↵

⇡ ⇡ ⇡

↵ ↵

Discrete Semidiscrete Continuous

Figure 2.5: Schematic viewed of input measures (α, β) and couplings U(α, β) encountered in the three
main scenarios for Kantorovich OT. Chapter 5 is dedicated to the semidiscrete setup.

Su et al., 2015], and image watermarking [Mathon et al., 2014]. In astrophysics,


OT has been used for reconstructing the early universe [Frisch et al., 2002]. OT has
also been used for music transcription [Flamary et al., 2016], and finds numerous
applications in economics to interpret matching data [Galichon, 2016]. Lastly, let us
note that the computation of transportation maps computed using OT techniques
(or inspired from them) is also useful to perform sampling [Reich, 2013, Oliver,
2014] and Bayesian inference [Kim et al., 2013, El Moselhy and Marzouk, 2012].

Remark 2.13 (Kantorovich problem between arbitrary measures). Definition (2.13)


of Lc is extended to arbitrary measures by considering couplings π ∈ M1+ (X × Y)
which are joint distributions over the product space. The discrete case is a
special situation where one imposes this product measure to be of the form
P
π = i,j Pi,j δ(xi ,yj ) . In the general case, the mass conservation constraint (2.10)
should be rewritten as a marginal constraint on joint probability distributions
n o
π ∈ M1+ (X × Y) : PX ] π = α
def.
U(α, β) = and PY] π = β . (2.14)

Here PX ] and PY] are the push-forwards (see Definition 2.1) of the projections
PX (x, y) = x and PY (x, y) = y. Figure 2.5 shows how these coupling con-
straints translate for different classes of problems (discrete measures and den-
sities). Using (2.7), these marginal constraints are equivalent to imposing that
π(A × Y) = α(A) and π(X × B) = β(B) for sets A ⊂ X and B ⊂ Y. The Kan-
18 Theoretical Foundations

torovich problem (2.11) is then generalized as


Z
def.
Lc (α, β) = min c(x, y)dπ(x, y). (2.15)
π∈U (α,β) X ×Y

This is an infinite-dimensional linear program over a space of measures. If (X , Y)


are compact spaces and c is continuous, then it is easy to show that it always
has solutions. Indeed U(α, β) is compact for the weak topology of measures (see
R
Remark 2.2), π 7→ cdπ is a continuous function for this topology and the con-
straint set is nonempty (for instance, α ⊗ β ∈ U(α, β)). Figure 2.6 shows examples
of discrete and continuous optimal coupling solving (2.15). Figure 2.7 shows other
examples of optimal 1-D couplings, involving discrete and continuous marginals.

↵ ⇡ ⇡


Figure 2.6: Left: “continuous” coupling π solving (2.14) between two 1-D measures with density. The
coupling is localized along the graph of the Monge map (x, T (x)) (displayed in black). Right: “discrete”
coupling T solving (2.11) between two discrete measures of the form (2.3). The positive entries Ti,j are
displayed as black disks at position (i, j) with radius proportional to Ti,j .

↵ ↵ ↵ ↵

⇡ ⇡ ⇡ ⇡

↵ ↵ ↵ ↵

Figure 2.7: Four simple examples of optimal couplings between 1-D distributions, represented as
maps above (arrows) and couplings below. Inspired by Lévy and Schwindt [2018].
2.4. Metric Properties of Optimal Transport 19

Remark 2.14 (Probabilistic interpretation). Kantorovich’s problem can be reinter-


preted through the prism of random variables, following Remark 2.9. Indeed, Prob-
lem (2.15) is equivalent to
n o
Lc (α, β) = min E(X,Y ) (c(X, Y )) : X ∼ α, Y ∼ β , (2.16)
(X,Y )

where (X, Y ) is a couple of random variables over X × Y and X ∼ α (resp., Y ∼ β)


means that the law of X (resp., Y ), represented as a measure, must be α (resp.,
β). The law of the couple (X, Y ) is then π ∈ U(α, β) over the product space X × Y.

2.4 Metric Properties of Optimal Transport

An important feature of OT is that it defines a distance between histograms and prob-


ability measures as soon as the cost matrix satisfies certain suitable properties. Indeed,
OT can be understood as a canonical way to lift a ground distance between points to
a distance between histogram or measures.
We first consider the case where, using a term first introduced by Rubner et al.
[2000], the “ground metric” matrix C is fixed, representing substitution costs between
bins, and shared across several histograms we would like to compare. The following
proposition states that OT provides a valid distance between histograms supported on
these bins.
Proposition 2.2. We suppose n = m and that for some p ≥ 1, C = Dp = (Dpi,j )i,j ∈
Rn×n , where D ∈ Rn×n
+ is a distance on JnK, i.e.
(i) D ∈ Rn×n
+ is symmetric;
(ii) Di,j = 0 if and only if i = j;
(iii) ∀ (i, j, k) ∈ JnK3 , Di,k ≤ Di,j + Dj,k .
Then
Wp (a, b) = LDp (a, b)1/p
def.
(2.17)
(note that Wp depends on D) defines the p-Wasserstein distance on Σn , i.e. Wp is
symmetric, positive, Wp (a, b) = 0 if and only if a = b, and it satisfies the triangle
inequality
∀ a, b, c ∈ Σn , Wp (a, c) ≤ Wp (a, b) + Wp (b, c).
Proof. Symmetry and definiteness of the distance are easy to prove: since C = Dp
has a null diagonal, Wp (a, a) = 0, with corresponding optimal transport matrix P? =
diag(a); by the positivity of all off-diagonal elements of Dp , Wp (a, b) > 0 whenever
a 6= b (because in this case, an admissible coupling necessarily has a nonzero element
outside the diagonal); by symmetry of Dp , Wp (a, b) is itself a symmetric function.
20 Theoretical Foundations

To prove the triangle inequality of Wasserstein distances for arbitrary measures,


Villani [2003, Theorem 7.3] uses the gluing lemma, which stresses the existence of
couplings with a prescribed structure. In the discrete setting, the explicit constuction
of this glued coupling is simple. Let a, b, c ∈ Σn . Let P and Q be two optimal solutions
of the transport problems between a and b, and b and c, respectively. To avoid issues
def.
that may arise from null coordinates in b, we define a vector b̃ such that b̃j = bj if
def.
bj > 0, and b̃j = 1 otherwise, to write
S = P diag(1/b̃)Q ∈ Rn×n
def.
+ ,

and notice that S ∈ U(a, c) because


S1n = P diag(1/b̃)Q1n = P(b/b̃) = P1Supp(b) = a,
where we denoted 1Supp(b) the vector of size n with ones located at those indices j
where bj > 0 and zero otherwise, and we use the fact that P1Supp(b) = P1 = a because
necessarily Pi,j = 0 for those j where bj = 0. Similarly one verifies that ST 1n = c. The
triangle inequality follows then from
!1/p
Wp (a, c) = min hP, Dp i ≤ hS, Dp i1/p
P∈U (a,c)
 1/p  1/p
X p X Pij Qjk X p Pij Qjk
=  Dik  ≤  (Dij + Djk ) 
ik j b̃j ijk b̃j
 1/p  1/p
Pij Qjk X p Pij Qjk
Dpij
X
≤  + D 
jk .
ijk b̃j ijk b̃j
The first inequality is due to the suboptimality of S, the second is the triangle inequality
for elements in D, and the third comes from Minkowski’s inequality. One thus has
 1/p  1/p
X p X Qjk X p X Pij
Wp (a, c) ≤  Dij Pij  +  Djk Qjk 
ij k b̃j jk i b̃j
 1/p  1/p
X p X p
=  Dij Pij  +  Djk Qjk 
ij jk

= Wp (a, b) + Wp (b, c),


which concludes the proof.

Remark 2.15 (The cases 0 < p ≤ 1). Note that if 0 < p ≤ 1, then Dp is itself distance.
This implies that while for p ≥ 1, Wp (a, b) is a distance, in the case p ≤ 1, it is actually
Wp (a, b)p which defines a distance on the simplex.
2.4. Metric Properties of Optimal Transport 21

Remark 2.16 (Applications of Wasserstein distances). The fact that the OT dis-
tance automatically “lifts” a ground metric between bins to a metric between
histograms on such bins makes it a method of choice for applications in computer
vision and machine learning to compare histograms. In these fields, a classical ap-
proach is to “pool” local features (for instance, image descriptors) and compute
a histogram of the empirical distribution of features (a so-called bag of features)
to perform retrieval, clustering or classification; see, for instance, [Oliva and Tor-
ralba, 2001]. Along a similar line of ideas, OT distances can be used over some lifted
feature spaces to perform signal and image analysis [Thorpe et al., 2017]. Appli-
cations to retrieval and clustering were initiated by the landmark paper [Rubner
et al., 2000], with renewed applications following faster algorithms for threshold
matrices C that fit for some applications, for example, in computer vision [Pele and
Werman, 2008, 2009]. More recent applications stress the use of the earth mover’s
distance for bags-of-words, either to carry out dimensionality reduction [Rolet et al.,
2016] and classify texts [Kusner et al., 2015, Huang et al., 2016], or to define an
alternative loss to train multiclass classifiers that output bags-of-words [Frogner
et al., 2015]. Kolouri et al. [2017] provides a recent overview of such applications
to signal processing and machine learning.

Remark 2.17 (Wasserstein distance between measures). Proposition 2.2 can be


generalized to deal with arbitrary measures that need not be discrete.

Proposition 2.3. We assume X = Y and that for some p ≥ 1, c(x, y) = d(x, y)p ,
where d is a distance on X , i.e.

(i) d(x, y) = d(y, x) ≥ 0;


(ii) d(x, y) = 0 if and only if x = y;
(iii) ∀ (x, y, z) ∈ X 3 , d(x, z) ≤ d(x, y) + d(y, z).

Then the p-Wasserstein distance on X ,

W p (α, β) = Ldp (α, β)1/p


def.
(2.18)

(note that W p depends on d), is indeed a distance, namely W p is symmetric,


nonnegative, W p (α, β) = 0 if and only if α = β, and it satisfies the triangle
inequality

∀ (α, β, γ) ∈ M1+ (X )3 , W p (α, γ) ≤ W p (α, β) + W p (β, γ).


22 Theoretical Foundations

Proof. The proof follows the same approach as that for Proposition 2.2 and relies on
the existence of a coupling between (α, γ) obtained by “gluing” optimal couplings
between (α, β) and (β, γ).

Remark 2.18 (Geometric intuition and weak convergence). The Wasserstein dis-
tance W p has many important properties, the most important being that it is
a weak distance, i.e. it allows one to compare singular distributions (for instance,
discrete ones) whose supports do not overlap and to quantify the spatial shift
between the supports of two distributions. In particular, “classical” distances (or
divergences) are not even defined between discrete distributions (the L2 norm can
only be applied to continuous measures with a density with respect to a base mea-
sure, and the discrete `2 norm requires that positions (xi , yj ) take values in a pre-
determined discrete set to work properly). In sharp contrast, one has that for any
p > 0, W pp (δx , δy ) = d(x, y). Indeed, it suffices to notice that U(δx , δy ) = {δx,y } and
therefore the Kantorovich problem having only one feasible solution, W pp (δx , δy )
is necessarily (d(x, y)p )1/p = d(x, y). This shows that W p (δx , δy ) → 0 if x → y.
This property corresponds to the fact that W p is a way to quantify the weak
convergence, as we now define.

Definition 2.2 (Weak convergence). On a compact domain X , (αk )k converges


weakly to α in M1+ (X ) (denoted αk * α) if and only if for any continuous function
R R
g ∈ C(X ), X gdαk → X gdα. One needs to add additional decay conditions on
g on noncompact domains. This notion of weak convergence corresponds to the
convergence in the law of random vectors.

This convergence can be shown to be equivalent to W p (αk , α) → 0 [Villani,


2009, Theorem 6.8] (together with a convergence of the moments up to order p for
unbounded metric spaces).

Remark 2.19 (Translations). A nice feature of the Wasserstein distance over a Eu-
clidean space X = Rd for the ground cost c(x, y) = kx − yk2 is that one can factor
out translations; indeed, denoting Tτ : x 7→ x − τ the translation operator, one has
2
W 2 (Tτ ] α, Tτ 0 ] β)2 = W 2 (α, β)2 − 2hτ − τ 0 , mα − mβ i + τ − τ 0 ,

def.
where mα = X xdα(x) ∈ Rd is the mean of α. In particular, this implies the nice
R

decomposition of the distance as

W 2 (α, β)2 = W 2 (α̃, β̃)2 + kmα − mβ k2 ,

where (α̃, β̃) are the “centered” zero mean measures α̃ = Tmα ] α.
2.5. Dual Problem 23

Remark 2.20 (The case p = +∞). Informally, the limit of W pp as p → +∞ is


def.
W ∞ (α, β) = min sup d(x, y), (2.19)
π∈U (α,β) (x,y)∈Supp(π)

where the sup should be understood as the essential supremum according to the
measure π on X 2 . In contrast to the cases p < +∞, this is a nonconvex optimization
problem, which is difficult to solve numerically and to study theoretically. The W ∞
distance is related to the Hausdorff distance between the supports of (α, β); see
§ 10.6.1. We refer to [Champion et al., 2008] for details.

2.5 Dual Problem

The Kantorovich problem (2.11) is a constrained convex minimization problem, and as


such, it can be naturally paired with a so-called dual problem, which is a constrained
concave maximization problem. The following fundamental proposition explains the
relationship between the primal and dual problems.
Proposition 2.4. The Kantorovich problem (2.11) admits the dual
LC (a, b) = max hf, ai + hg, bi, (2.20)
(f,g)∈R(C)

where the set of admissible dual variables is


R(C) = {(f, g) ∈ Rn × Rm : ∀ (i, j) ∈ JnK × JmK, f ⊕ g ≤ C} .
def.
(2.21)
Such dual variables are often referred to as “Kantorovich potentials.”
Proof. This result is a direct consequence of the more general result on the strong
duality for linear programs [Bertsimas and Tsitsiklis, 1997, p. 148, Theo. 4.4]. The
easier part of the proof, namely, establishing that the right-hand side of Equation (2.20)
is a lower bound of LC (a, b), is discussed in Remark 3.2 in the next section. For the
sake of completeness, let us derive our result using Lagrangian duality. The Lagangian
associated to (2.11) reads
min max hC, Pi + ha − P1m , fi + hb − PT 1n , gi. (2.22)
P≥0 (f,g)∈Rn ×Rm

We exchange the min and the max above, which is always possible when considering
linear programs (in finite dimension), to obtain
max ha, fi + hb, gi + min hC − f1m T − 1n gT , Pi.
(f,g)∈Rn ×Rm P≥0

We conclude by remarking that


(
0 if Q ≥ 0,
min hQ, Pi =
P≥0 −∞ otherwise
24 Theoretical Foundations

so that the constraint reads C − f1m T − 1n gT = C − f ⊕ g ≥ 0.

The primal-dual optimality relation for the Lagrangian (2.22) allows us to locate
the support of the optimal transport plan (see also §3.3)
n o
{(i, j) ∈ JnK × JmK : Pi,j > 0} ⊂ (i, j) ∈ JnK × JmK : fi + gj = Ci,j . (2.23)

Remark 2.21. Following the interpretation given to the Kantorovich problem in Re-
mark 2.10, we follow with an intuitive presentation of the dual. Recall that in that
setup, an operator wishes to move at the least possible cost an overall amount of re-
sources from warehouses to factories. The operator can do so by solving (2.11), follow
the instructions set out in P? , and pay hP? , Ci to the transportation company.
Outsourcing logistics. Suppose that the operator does not have the computational
means to solve the linear program (2.11). He decides instead to outsource that task to
a vendor. The vendor chooses a pricing scheme with the following structure: the vendor
splits the logistic task into that of collecting and then delivering the goods and will
apply a collection price fi to collect a unit of resource at each warehouse i (no matter
where that unit is sent to) and a price gj to deliver a unit of resource to factory j
(no matter from which warehouse that unit comes from). On aggregate, since there
are exactly ai units at warehouse i and bj needed at factory j, the vendor asks as a
consequence of that pricing scheme a price of hf, ai + hg, bi to solve the operator’s
logistic problem.
Setting prices. Note that the pricing system used by the vendor allows quite nat-
urally for arbitrarily negative prices. Indeed, if the vendor applies a price vector f for
warehouses and a price vector g for factories, then the total bill will not be changed
by simultaneously decreasing all entries in f by an arbitrary number and increasing all
entries of g by that same number, since the total amount of resources in all warehouses
is equal to those that have to be delivered to the factories. In other words, the vendor
can give the illusion of giving an extremely good deal to the operator by paying him to
collect some of his goods, but compensate that loss by simply charging him more for
delivering them. Knowing this, the vendor, wishing to charge as much as she can for
that service, sets vectors f and g to be as high as possible.
Checking prices. In the absence of another competing vendor, the operator must
therefore think of a quick way to check that the vendor’s prices are reasonable. A
possible way to do so would be for the operator to compute the price LC (a, b) of the
most efficient plan by solving problem (2.11) and check if the vendor’s offer is at the
very least no larger than that amount. However, recall that the operator cannot afford
such a lengthy computation in the first place. Luckily, there is a far more efficient way
for the operator to check whether the vendor has a competitive offer. Recall that fi
is the price charged by the vendor for picking a unit at i and gj to deliver one at j.
Therefore, the vendor’s pricing scheme implies that transferring one unit of the resource
2.5. Dual Problem 25

from i to j costs exactly fi + gj . Yet, the operator also knows that the cost of shipping
one unit from i to j as priced by the transporting company is Ci,j . Therefore, if for any
pair i, j the aggregate price fi + gj is strictly larger that Ci,j , the vendor is charging
more than the fair price charged by the transportation company for that task, and the
operator should refuse the vendor’s offer.

α P? 40

β f? g?
20

-20

Figure 2.8: Consider in the left plot the optimal transport problem between two discrete measures α
and β, represented respectively by blue dots and red squares. The area of these markers is proportional
to the weight at each location. That plot also displays the optimal transport P? using a quadratic
Euclidean cost. The corresponding dual (Kantorovich) potentials f? and g? that correspond to that
configuration are also displayed on the right plot. Since there is a “price” f?i for each point in α (and
conversely for g and β), the color at that point represents the obtained value using the color map on
the right. These potentials can be interpreted as relative prices in the sense that they indicate the
individual cost, under the best possible transport scheme, to move a mass away at each location in α,
or on the contrary to send a mass toward any point in β. The optimal transport cost is therefore equal
to the sum of the squared lengths of all the arcs on the left weighted by their thickness or, alternatively,
using the dual formulation, to the sum of the values (encoded with colors) multiplied by the area of
each marker on the right plot.

Optimal prices as a dual problem. It is therefore in the interest of the operator to


check that for all pairs i, j the prices offered by the vendor verify fi +gj ≤ Ci,j . Suppose
that the operator does check that the vendor has provided price vectors that do comply
with these n × m inequalities. Can he conclude that the vendor’s proposal is attractive?
Doing a quick back of the hand calculation, the operator does indeed conclude that it
is in his interest to accept that offer. Indeed, since any of his transportation plans P
would have a cost hP, Ci = i,j Pi,j Ci,j , the operator can conclude by applying these
P

n × m inequalities that for any transport plan P (including the optimal one P? ), the
marginal constraints imply
   
X X   X X X X
Pi,j Ci,j ≥ Pi,j fi + gj =  fi Pi,j  +  gj Pi,j 
i,j i,j i j j i

= hf, ai + hg, bi,


and therefore observe that any attempt at doing the job by himself would necessarily
be more expensive than the vendor’s price.
26 Theoretical Foundations

Knowing this, the vendor must therefore find a set of prices f, g that maximize
hf, ai + hg, bi but that must satisfy at the very least for all i, j the basic inequality
that fi + gj ≤ Ci,j for his offer to be accepted, which results in Problem (2.20). One can
show, as we do later in §3.1, that the best price obtained by the vendor is in fact exactly
equal to the best possible cost the operator would obtain by computing LC (a, b).
Figure 2.8 illustrates the primal and dual solutions resulting from the same transport
problem. On the left, blue dots represent warehouses and red dots stand for factories; the
areas of these dots stand for the probability weights a, b, links between them represent
an optimal transport, and their width is proportional to transfered amounts. Optimal
prices obtained by the vendor as a result of optimizing Problem (2.20) are shown on
the right. Prices have been chosen so that their mean is equal to 0. The highest relative
prices come from collecting goods at an isolated warehouse on the lower left of the
figure, and delivering goods at the factory located in the upper right area.

Remark 2.22 (Dual problem between arbitrary measures). To extend this primal-
dual construction to arbitrary measures, it is important to realize that measures are
naturally paired in duality with continuous functions (a measure can be accessed
only through integration against continuous functions). The duality is formalized
in the following proposition, which boils down to Proposition 2.4 when dealing with
discrete measures.

Proposition 2.5. One has


Z Z
Lc (α, β) = sup f (x)dα(x) + g(y)dβ(y), (2.24)
(f,g)∈R(c) X Y

where the set of admissible dual potentials is


def.
R(c) = {(f, g) ∈ C(X ) × C(Y) : ∀(x, y), f (x) + g(y) ≤ c(x, y)} . (2.25)

Here, (f, g) is a pair of continuous functions and are also called, as in the discrete
case, “Kantorovich potentials.”

The discrete case (2.20) corresponds to the dual vectors being samples of the
continuous potentials, i.e. (fi , gj ) = (f (xi ), g(yj )). The primal-dual optimality con-
ditions allow us to track the support of the optimal plan, and (2.23) is generalized
as
Supp(π) ⊂ {(x, y) ∈ X × Y : f (x) + g(y) = c(x, y)} . (2.26)
Note that in contrast to the primal problem (2.15), showing the existence of
solutions to (2.24) is nontrivial, because the constraint set R(c) is not compact and
the function to minimize noncoercive. Using the machinery of c-transform detailed
2.5. Dual Problem 27

in § 5.1, in the case c(x, y) = d(x, y)p with p ≥ 1, one can, however, show that
optimal (f, g) are necessarily Lipschitz regular, which enables us to replace the
constraint by a compact one.

R R
Remark 2.23 (Unconstrained dual). In the case X dα = Y dβ = 1, the constrained
dual problem (2.24) can be replaced by an unconstrained one,
Z Z
Lc (α, β) = sup f dα + gdβ + min (c − f ⊕ g), (2.27)
(f,g)∈C(X )×C(Y) X Y X ⊗Y

where we denoted (f ⊕ g)(x, y) = f (x) + g(y). Here the minimum should be con-
sidered as the essential supremum associated to the measure α ⊗ β, i.e., it does not
change if f or g is modified on sets of zero measure for α and β. This alternative
dual formulation was pointed out to us by Francis Bach. It is obtained from the
R
primal problem (2.15) by adding the redundant constraint dπ = 1.

Remark 2.24 (Monge–Kantorovich equivalence—Brenier theorem). The following


theorem is often attributed to Brenier [1991] and ensures that in Rd for p = 2, if
at least one of the two input measures has a density, and for measures with second
order moments, then the Kantorovich and Monge problems are equivalent. The
interested reader should also consult variants of the same result published more
or less at the same time by Cuesta and Matran [1989], Rüschendorf and Rachev
[1990], including notably the original result in [Brenier, 1987] and a precursor
by Knott and Smith [1984].

Theorem 2.1 (Brenier). In the case X = Y = Rd and c(x, y) = kx − yk2 , if at


least one of the two input measures (denoted α) has a density ρα with respect to
the Lebesgue measure, then the optimal π in the Kantorovich formulation (2.15)
is unique and is supported on the graph (x, T (x)) of a “Monge map” T : Rd → Rd .
This means that π = (Id, T )] α, i.e.
Z Z
∀ h ∈ C(X × Y), h(x, y)dπ(x, y) = h(x, T (x))dα(x). (2.28)
X ×Y X

Furthermore, this map T is uniquely defined as the gradient of a convex function ϕ,


T (x) = ∇ϕ(x), where ϕ is the unique (up to an additive constant) convex function
such that (∇ϕ)] α = β. This convex function is related to the dual potential f
kxk2
solving (2.24) as ϕ(x) = 2 − f (x).

Proof. We sketch the main ingredients of the proof; more details can be found, for
R R
instance, in [Santambrogio, 2015]. We remark that cdπ = Cα,β −2 hx, yidπ(x, y),
28 Theoretical Foundations

where the constant is Cα,β = kxk2 dα(x)+ kyk2 dβ(y). Instead of solving (2.15),
R R

one can thus consider the problem


Z
max hx, yidπ(x, y),
π∈U (α,β) X ×Y

whose dual reads


Z Z 
min ϕdα + ψdβ : ∀(x, y), ϕ(x) + ψ(y) ≥ hx, yi . (2.29)
(ϕ,ψ) X Y
2 2
The relation between these variables and those of (2.25) is (ϕ, ψ) = ( k·k2 − f, k·k2 −
g). One can replace the constraint by

ψ(y) ≥ ϕ∗ (y) = sup hx, yi − ϕ(x).


def.
∀ y, (2.30)
x

Here ϕ∗ is the Legendre transform of ϕ and is a convex function as a supremum


of linear forms (see also (4.54)). Since the objective appearing in (2.31) is linear
and the integrating measures positive, one can minimize explicitly with respect to
ψ and set ψ = ϕ∗ in order to consider the unconstrained problem
Z Z
min ϕdα + ϕ∗ dβ; (2.31)
ϕ X Y

see also §3.2 and §5.1, where that idea is applied respectively in the discrete
setting and for generic costs c(x, y). By iterating this argument twice, one
can replace ϕ by ϕ∗∗ , which is a convex function, and thus impose in (2.31)
that ϕ is convex. Condition (2.26) shows that an optimal π is supported on
{(x, y) : ϕ(x) + ϕ∗ (y) = hx, yi}, which shows that such a y is optimal for the
minimization (2.30) of the Legendre transform, whose optimality condition reads
y ∈ ∂ϕ(x). Since ϕ is convex, it is differentiable almost everywhere, and since α has
a density, it is also differentiable α-almost everywhere. This shows that for each
x, the associated y is uniquely defined α-almost everywhere as y = ∇ϕ(x), and it
shows that necessarily π = (Id, ∇ϕ)] α.

This result shows that in the setting of W 2 with no-singular densities, the
Monge problem (2.9) and its Kantorovich relaxation (2.15) are equal (the relaxation
is tight). This is the continuous counterpart of Proposition 2.1 for the assignment
case (2.1), which states that the minimum of the optimal transport problem is
achieved at a permutation matrix (a discrete map) when the marginals are equal
and uniform. Brenier’s theorem, stating that an optimal transport map must be
the gradient of a convex function, provides a useful generalization of the notion
of increasing functions in dimension more than one. This is the main reason why
2.5. Dual Problem 29

optimal transport can be used to define quantile functions in arbitrary dimensions,


which is in turn useful for applications to quantile regression problems [Carlier
et al., 2016].
Note also that this theorem can be extended in many directions. The condition
that α has a density can be weakened to the condition that it does not give mass
to “small sets” having Hausdorff dimension smaller than d − 1 (e.g. hypersurfaces).
One can also consider costs of the form c(x, y) = h(x − y), where h is a strictly
convex function.

Remark 2.25 (Monge–Ampère equation). For measures with densities, using (2.8),
one obtains that ϕ is the unique (up to the addition of a constant) convex function
which solves the following Monge–Ampère-type equation:

det(∂ 2 ϕ(x))ρβ (∇ϕ(x)) = ρα (x) (2.32)

where ∂ 2 ϕ(x) ∈ Rd×d is the Hessian of ϕ. The Monge–Ampère operator det(∂ 2 ϕ(x))
can be understood as a nonlinear degenerate Laplacian. In the limit of small dis-
placements, ϕ = Id + εψ, one indeed recovers the Laplacian ∆ as a linearization
since for smooth maps

det(∂ 2 ϕ(x)) = 1 + ε∆ψ(x) + o(ε).

The convexity constraint forces det(∂ 2 ϕ(x)) ≥ 0 and is necessary for this equation
to have a solution. There is a large body of literature on the theoretical analysis of
the Monge–Ampère equation, and in particular the regularity of its solution—see,
for instance, [Gutiérrez, 2016]; we refer the interested read to the review paper
by Caffarelli [2003]. A major difficulty is that in full generality, solutions need not
be smooth, and one has to resort to the machinery of Alexandrov solutions when the
input measures are arbitrary (e.g. Dirac masses). Many solvers have been proposed
in the simpler case of the Monge–Ampère equation det(∂ 2 ϕ(x)) = f (x) for a fixed
right-hand-side f ; see, for instance, [Benamou et al., 2016b] and the references
therein. In particular, capturing anisotropic convex functions requires special care,
and usual finite differences can be inaccurate. For optimal transport, where f
actually depends on ∇ϕ, the discretization of Equation (2.32), and the boundary
condition result in technical challenges outlined in [Benamou et al., 2014] and the
references therein. Note also that related solvers based on fixed-point iterations
have been applied to image registration [Haker et al., 2004].
30 Theoretical Foundations

2.6 Special Cases

In general, computing OT distances is numerically involved. Before detailing in §§3,4,


and 7 different numerical solvers, we first review special favorable cases where the
resolution of the OT problem is relatively easy.

Remark 2.26 (Binary cost matrix and 1-norm). One can easily check that when the
cost matrix C is 0 on the diagonal and 1 elsewhere, namely, when C = 1n×n − In ,
the 1-Wasserstein distance between a and b is equal to the 1-norm of their difference,
LC (a, b) = ka − bk1 .

Remark 2.27 (Kronecker cost function and total variation). In addition to Re-
mark 2.26 above, one can also easily check that this result extends to arbitrary
measures in the case where c(x, y) is 0 if x = y and 1 when x 6= y. The OT
distance between two discrete measures α and β is equal to their total variation
distance (see also Example 8.2).

Remark 2.28 (1-D case—Empirical measures). Here X = R. Assuming α =


1 Pn 1 Pn
n i=1 δxi and β = n j=1 δyj , and assuming (without loss of generality) that
the points are ordered, i.e. x1 ≤ x2 ≤ · · · ≤ xn and y1 ≤ y2 ≤ · · · ≤ yn , then one
has the simple formula
n
1X
W p (α, β)p = |xi − yi |p , (2.33)
n i=1

i.e. locally (if one assumes distinct points), W p (α, β) is the `p norm between two
vectors of ordered values of α and β. That statement is valid only locally, in the
sense that the order (and those vector representations) might change whenever
some of the values change. That formula is a simple consequence of the more
general setting detailed in Remark 2.30. Figure 2.9, top row, illustrates the 1-D
transportation map between empirical measures with the same number of points.
The bottom row shows how this monotone map generalizes to arbitrary discrete
measures.
It is also possible to leverage this 1-D computation to also compute efficiently
OT on the circle as shown by Delon et al. [2010]. Note that if the cost is a concave
function of the distance, notably when p < 1, the behavior of the optimal transport
plan is very different, yet efficient solvers also exist [Delon et al., 2012].
2.6. Special Cases 31

Figure 2.9: 1-D optimal couplings: each arrow xi → yj indicates a nonzero Pi,j in the optimal
coupling. Top: empirical measures with same number of points (optimal matching). Bottom: generic
case. This corresponds to monotone rearrangements, if xi ≤ xi0 are such that Pi,j 6= 0, Pi0 ,j 0 6= 0, then
necessarily yj ≤ yj 0 .

Remark 2.29 (Histogram equalization). One-dimensional optimal transport can be


used to perform histogram equalization, with applications to the normalization of
the palette of grayscale images, see Figure 2.10. In this case, one denotes (x̄i )i and
(ȳj )j the gray color levels (0 for black, 1 for white, and all values in between) of all
pixels of the two input images enumerated in a predefined order (i.e. columnwise).
Assuming the number of pixels in each image is the same and equal to n×m, sorting
these color levels defines xi = x̄σ1 (i) and yj = ȳσ2 (j) as in Remark 2.28, where
σ1 , σ2 : {1, . . . , nm} → {1, . . . , nm} are permutations, so that σ = σ2 ◦ σ1−1 is the
def.

optimal assignment between the two discrete distributions. For image processing
applications, (ȳσ(i) )i defines the color values of an equalized version of x̄, whose
empirical distribution matches exactly the one of ȳ. The equalized version of that
image can be recovered by folding back that nm-dimensional vector as an image
of size n × m. Also, t ∈ [0, 1] 7→ (1 − t)x̄i + tȳσ(i) defines an interpolation between
the original image and the equalized one, whose empirical distribution of pixels is
the displacement interpolation (as defined in (7.7)) between those of the inputs.

Remark 2.30 (1-D case—Generic case). For a measure α on R, we introduce the


cumulative distribution function from R to → [0, 1] defined as
Z x
def.
∀ x ∈ R, Cα (x) = dα, (2.34)
−∞

and its pseudoinverse Cα−1 : [0, 1] → R ∪ {−∞}

∀ r ∈ [0, 1], Cα−1 (r) = min {x ∈ R ∪ {−∞} : Cα (x) ≥ r} . (2.35)


x

That function is also called the generalized quantile function of α. For any p ≥ 1,
32 Theoretical Foundations

t=0 t = 0.25 t = 0.5 t = .75 t=1

Figure 2.10: Histogram equalization for image processing, where t parameterizes the displacement
interpolation between the histograms.

one has
p Z 1
W p (α, β)p = Cα−1 − Cβ−1 = |Cα−1 (r) − Cβ−1 (r)|p dr. (2.36)

Lp ([0,1]) 0

This means that through the map α 7→ Cα−1 , the Wasserstein distance is isometric
to a linear space equipped with the Lp norm or, equivalently, that the Wasserstein
distance for measures on the real line is a Hilbertian metric. This makes the geom-
etry of 1-D optimal transport very simple but also very different from its geometry
in higher dimensions, which is not Hilbertian as discussed in Proposition 8.1 and
more generally in §8.3. For p = 1, one even has the simpler formula
Z
W 1 (α, β) = kCα − Cβ kL1 (R) = |Cα (x) − Cβ (x)|dx (2.37)
R
Z Z x

=
d(α − β) dx,
(2.38)
R −∞

which shows that W 1 is a norm (see §6.2 for the generalization to arbitrary dimen-
sions). An optimal Monge map T such that T] α = β is then defined by

T = Cβ−1 ◦ Cα . (2.39)

Figure 2.11 illustrates the computation of 1-D OT through cumulative functions.


It also displays displacement interpolations, computed as detailed in (7.7); see also
Remark 9.6. For a detailed survey of the properties of optimal transport in one
dimension, we refer the reader to [Santambrogio, 2015, Chapter 2].

Remark 2.31 (Distance between Gaussians). If α = N (mα , Σα ) and β =


2.6. Special Cases 33

α β (tT + (1 − t)Id)] α
1 1 1
Cµ C-1 T
µ -1
Cν T
C-1
ν

0.5 0.5 0.5 0.5

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

(Cα , Cβ ) (Cα−1 , Cβ−1 ) (T, T −1 ) (1 − t)Cα−1 + tCβ−1

Figure 2.11: Computation of OT and displacement interpolation between two 1-D measures, using
cumulant function as detailed in (2.39).

N (mβ , Σβ ) are two Gaussians in Rd , then one can show that the following map

T : x 7→ mβ + A(x − mα ), (2.40)

where
1 1 1 1 1
−2  2 2 −
A= Σα Σα Σβ Σα2 Σα 2 = AT ,
is such that T] ρα = ρβ . Indeed, one simply has to notice that the change of variables
formula (2.8) is satisfied since
1
ρβ (T (x)) = det(2πΣβ )− 2 exp(−hT (x) − mβ , Σ−1
β (T (x) − mβ )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , AT Σ−1
β A(x − mα )i)
1
= det(2πΣβ )− 2 exp(−hx − mα , Σ−1
α (x − mα )i),

and since T is a linear map we have that


1
det Σβ

0 2
| det T (x)| = det A =
det Σα
and we therefore recover ρα = | det T 0 |ρβ meaning T] α = β. Notice now that T
is the gradient of the convex function ψ : x 7→ 12 hx − mα , A(x − mα )i + hmβ , xi
to conclude, using Brenier’s theorem [1991] (see Remark 2.24), that T is optimal.
34 Theoretical Foundations

Both that map T and the corresponding potential ψ are illustrated in Figures 2.12
and 2.13
With additional calculations involving first and second order moments of ρα ,
we obtain that the transport cost of that map is

W 22 (α, β) = kmα − mβ k2 + B(Σα , Σβ )2 , (2.41)

where B is the so-called Bures metric [1969] between positive definite matrices (see
also Forrester and Kieburg [2016]),
 
B(Σα , Σβ )2 = tr Σα + Σβ − 2(Σα1/2 Σβ Σα1/2 )1/2 ,
def.
(2.42)

where Σ1/2 is the matrix square root. One can show that B is a distance on covari-
ance matrices and that B 2 is convex with respect to both its arguments. In the
case where Σα = diag(ri )i and Σβ = diag(si )i are diagonals, the Bures metric is
the Hellinger distance
√ √
B(Σα , Σβ ) = r − s 2 .
For 1-D Gaussians, W 2 is thus the Euclidean distance on the √ 2-D plane plot-
ting the mean and the standard deviation of a Gaussian (m, Σ), as illustrated
in Figure 2.14. For a detailed treatment of the Wasserstein geometry of Gaus-
sian distributions, we refer to Takatsu [2011], and for additional considerations
on the Bures metric the reader can consult the very recent references [Malagò
et al., 2018, Bhatia et al., 2018]. One can also consult [Muzellec and Cuturi, 2018]
for a a recent application of this metric to compute probabilistic embeddings for
words, [Shafieezadeh Abadeh et al., 2018] to see how it is used to compute a robust
extension to Kalman filtering, or [Mallasto and Feragen, 2017] in which it is applied
to covariance functions in reproducing kernel Hilbert spaces.

Remark 2.32 (Distance between elliptically contoured distributions). Gelbrich pro-


vides a more general result than that provided in Remark 2.31: the Bures met-
ric between Gaussians extends more generally to elliptically contoured distribu-
tions [1990]. In a nutshell, one can first show that for two measures with given
mean and covariance matrices, the distance between the two Gaussians with these
respective parameters is a lower bound of the Wasserstein distance between the
two measures [Gelbrich, 1990, Theorem 2.1]. Additionally, the closed form (2.41)
extends to families of elliptically contoured densities: If two densities ρα and ρβ
belong to such a family, namely when ρα and ρβ can be written for any point x
using a mean and positive definite parameter,
2.6. Special Cases 35

0
ρβ
-1
ρα
-2

-3
-4 -2 0 2 4 6

Figure 2.12: Two Gaussians ρα and ρβ , represented using the contour plots  of their densities, with
respective mean and variance matrices mα = (−2, 0), Σα = 21 1 − 21 ; − 12 1 and mβ = (3, 1), Σβ =

2, 12 ; 21 , 1 . The arrows originate at random points x taken on the plane and end at the corresponding
mappings of those points T (x) = mβ + A(x − mα ).

1
ρα (x) = p h(hx − mα , A−1 (x − mα )i)
det(A)
1
ρβ (x) = p h(hx − mβ , B−1 (x − mβ )i),
det(B)
for the same nonnegative valued function h such that the integral
Z
h(hx, xi)dx = 1,
Rd

then their optimal transport map is also the linear map (2.40) and their Wasserstein
distance is also given by the expression (2.41), with a slightly different scaling of the
Bures metric that depends only the generator function h. For instance, that scaling
is 1 for Gaussians (h(t) = e−t/2 ) and 1/(d+2) for uniform distributions on ellipsoids
(h the indicator function for [0, 1]). This result follows from the fact that the
covariance matrix of an elliptic distribution is a constant times its positive definite
parameter [Gómez et al., 2003, Theo. 4(ii)] and that the Wasserstein distance
between elliptic distributions is a function of the Bures distance between their
covariance matrices [Gelbrich, 1990, Cor. 2.5].
36 Theoretical Foundations

Figure 2.13: Same Gaussians ρα and ρβ as defined in Figure 2.12, represented this time as surfaces.
The surface above is the Brenier potential ψ defined up to an additive constant (here +50) such that
T = ∇ψ. For visual purposes, both Gaussian densities have been multiplied by a factor of 100.

m
Figure 2.14: Computation of displacement interpolation between two 1-D Gaussians. De-
(x−m)2
def. −
noting Gm,σ (x) = √1 e 2s2 the Gaussian density, it thus shows the interpolation
2πs
G(1−t)m0 +tm1 ,(1−t)σ0 +tσ1 .
3
Algorithmic Foundations

This chapter describes the most common algorithmic tools from combinatorial opti-
mization and linear programming that can be used to solve the discrete formulation
of optimal transport, as described in the primal problem (2.11) or alternatively its
dual (2.20).
The origins of these algorithms can be traced back to World War II, either right
before with Tolstoı’s seminal work [1930] or during the war itself, when Hitchcock
[1941] and Kantorovich [1942] formalized the generic problem of dispatching available
resources toward consumption sites in an optimal way. Both of these formulations, as
well as the later contribution by Koopmans [1949], fell short of providing a provably cor-
rect algorithm to solve that problem (the cycle violation method was already proposed
as a heuristic by Tolstoı [1939]). One had to wait until the field of linear programming
fully blossomed, with the proposal of the simplex method, to be at last able to solve
rigorously these problems.
The goal of linear programming is to solve optimization problems whose objective
function is linear and whose constraints are linear (in)equalities in the variables of in-
terest. The optimal transport problem fits that description and is therefore a particular
case of that wider class of problems. One can argue, however, that optimal transport is
truly special among all linear program. First, Dantzig’s early motivation to solve linear
programs was greatly related to that of solving transportation problems [Dantzig, 1949,
p. 210]. Second, despite being only a particular case, the optimal transport problem
remained in the spotlight of optimization, because it was understood shortly after that
optimal transport problems were related, and in fact equivalent, to an important class
of linear programs known as minimum cost network flows [Korte and Vygen, 2012, p.

37
38 Algorithmic Foundations

213, Lem. 9.3] thanks to a result by Ford and Fulkerson [1962]. As such, the OT prob-
lem has been the subject of particular attention, ever since the birth of mathematical
programming [Dantzig, 1951], and is still widely used to introduce optimization to a
new audience [Nocedal and Wright, 1999, §1, p. 4].

3.1 The Kantorovich Linear Programs

We have already introduced in Equation (2.11) the primal OT problem:


X
LC (a, b) = min Ci,j Pi,j . (3.1)
P∈U(a,b)
i∈JnK,j∈JmK

To make the link with the linear programming literature, one can cast the equation
above as a linear program in standard form, that is, a linear program with a linear
objective; equality constraints defined with a matrix and a constant vector; and non-
negative constraints on variables. Let In stand for the identity matrix of size n and let
⊗ be Kronecker’s product. The (n + m) × nm matrix
" #
1 T ⊗ Im
A= n ∈ R(n+m)×nm
In ⊗ 1m T

can be used to encode the row-sum and column-sum constraints that need to be satisfied
for any P to be in U(a, b). To do so, simply cast a matrix P ∈ Rn×m as a vector p ∈ Rnm
such that the i + n(j − 1)’s element of p is equal to Pij (P is enumerated columnwise)
to obtain the following equivalence:
a
P ∈ Rn×m ∈ U(a, b) ⇔ p ∈ Rnm
+ , Ap = b .

Therefore we can write the original optimal transport problem as

LC (a, b) = min cT p, (3.2)


p∈Rnm
+a 
Ap= b

where the nm-dimensional vector c is equal to the stacked columns contained in the
cost matrix C.

Remark 3.1. Note that one of the n + m constraints described above is redundant
or that, in other words, the line vectors of matrix A are not linearly independent.
Indeed, summing all n first lines and the subsequent m lines results in the same vector
(namely A 01m = A 10m = 1 T ). One can show that removing a line in A and
 n  n
 a  nm
the corresponding entry in b yields a properly defined linear system. For simplicity,
and to avoid treating asymmetrically a and b, we retain in what follows a redundant
formulation, keeping in mind that degeneracy will pop up in some of our computations.
3.1. The Kantorovich Linear Programs 39

The dual problem corresponding to Equation (3.2) is, following duality in linear
programming [Bertsimas and Tsitsiklis, 1997, p. 143] defined as
 a T
LC (a, b) = max b h. (3.3)
h∈Rn+m
AT h≤c

Note that this program is exactly equivalent to that presented in Equation (2.4).
Remark 3.2. We provide a simple derivation of the duality result above, which can be
seen as a direct formulation of the arguments developed in Remark 2.21. Strong duality,
namely the fact that the optima of both primal (3.2) and dual (3.3) problems do indeed
coincide, requires a longer proof [Bertsimas and Tsitsiklis, 1997, §4.10]. To simplify
notation, we write q = ba . Consider now a relaxed primal problem of the optimal
 

transport problem, where the constraint Ap = q is no longer necessarily enforced


but bears instead a cost hT (Ap − q) parameterized by an arbitrary vector of costs
h ∈ Rn+m . This relaxation, whose optimum depends directly on the cost vector h, can
be written as
cT p − hT (Ap − q).
def.
H(h) = min nm
p∈R+

Note first that this relaxed problem has no marginal constraints on p. Because that
minimization allows for many more p solutions, we expect H(h) to be smaller than
z̄ = LC (a, b). Indeed, writing p? for any optimal solution of the primal problem (3.1),
we obtain
min cT p − hT (Ap − q) ≤ cT p? − hT (Ap? − q) = cT p? = z̄.
p∈Rnm
+

The approach above defines therefore a problem which can be used to compute an
optimal upper bound for the original problem (3.1), for any cost vector h; that function
is called the Lagrange dual function of L. The goal of duality theory is now to compute
the best lower bound z by maximizing H over any cost vector h, namely
!
T
z = max H(h) = max hT q + min
nm
(c − AT h) p .
h h p∈R+

The second term involving a minimization on p can be easily shown to be −∞ if any


coordinate of cT − AT h is negative. Indeed, if for instance for a given index i ≤ n + m
we have ci − (AT h)i < 0, then it suffices to take for p the canonical vector ei multiplied
by any arbitrary large positive value to obtain an unbounded value. When trying to
maximize the lower bound H(h) it therefore makes sense to restrict vectors h to be
such that AT h ≤ c, in which case the best possible lower bound becomes
z = max hT q.
h∈Rn+m
AT h≤c

We have therefore proved a weak duality result, namely that z ≤ z̄.


40 Algorithmic Foundations

3.2 C-Transforms

We present in this section an important property of the dual optimal transport prob-
lem (3.3) which takes a more important meaning when used for the semidiscrete optimal
transport problem in §5.1. This section builds upon the original formulation (2.20) that
splits dual variables according to row and column sum constraints:

LC (a, b) = max hf, ai + hg, bi. (3.4)


(f,g)∈R(C)

Consider any dual feasible pair (f, g). If we “freeze” the value of f, we can notice that
there is no better vector solution for g than the C-transform vector of f, denoted
f C ∈ Rm and defined as
(f C )j = min Cij − fi ,
i∈JnK

since it is indeed easy to prove that (f, f C ) ∈ R(C) and that f C is the largest possible
vector such that this constraint is satisfied. We therefore have that

hf, ai + hg, bi ≤ hf, ai + hf C , bi.

This result allows us first to reformulate the dual problem as a piecewise affine concave
maximization problem expressed in a single variable f as

LC (a, b) = max
n
hf, ai + hf C , bi. (3.5)
f∈R

Putting that result aside, the same reasoning applies of course if we now “freeze”
the values of g and consider instead the C̄-transform of g, namely vector gC̄ ∈ Rn
defined as
(gC̄ )i = min Cij − gj ,
j∈JmK

with a different increase in objective

hf, ai + hg, bi ≤ hgC̄ , ai + hg, bi.

Starting from a given f, it is therefore tempting to alternate C and C̄ transforms several


times to improve f. Indeed, we have the sequence of inequalities

hf, ai + hf C , bi ≤ hf CC̄ , ai + hf C , bi ≤ hf CC̄ , ai + hf CC̄C , bi ≤ . . .

One may hope for a strict increase in the objective at each of these iterations. However,
this does not work because alternating C and C̄ transforms quickly hits a plateau.

Proposition 3.1. The following identities, in which the inequality sign between vectors
should be understood elementwise, hold:

(i) f ≤ f 0 ⇒ f C ≥ f 0 C ,
3.3. Complementary Slackness 41

(ii) f CC̄ ≥ f, gC̄C ≥ g,


(iii) f CC̄C = f C .
Proof. The first inequality follows from the definition of C-transforms. Expanding the
definition of f CC̄ we have
 
f CC̄ = min Cij − fjC = min Cij − min
0
Ci0 j − fi0 .
i j∈JmK j∈JmK i ∈JnK

Now, since − mini0 ∈JnK Ci0 j − fi0 ≥ −(Cij − fi ), we recover


 
f CC̄ ≥ min Cij − Cij + fi = fi .
i j∈JmK

The relation gC̄C ≥ g is obtained in the same way. Now, set g = f C . Then, gC̄ =
f CC̄ ≥ f. Therefore, using result (i) we have f CC̄C ≤ f C . Result (ii) yields f CC̄C ≥ f C ,
proving the equality.

3.3 Complementary Slackness

Primal (3.2) and dual (3.3), (2.20) problems can be solved independently to obtain
optimal primal P? and dual (f? , g? ) solutions. The following proposition characterizes
their relationship.
Proposition 3.2. Let P? and f? , g? be optimal solutions for the primal (2.24) and
dual (2.11) problems, respectively. Then, for any pair (i, j) ∈ JnK × JmK, P?i,j (Ci,j −
f?i + g?j ) = 0 holds. In other words, if P?i,j > 0, then necessarily f?i + g?j = Ci,j ; if
f?i + g?j < Ci,j then necessarily P?i,j = 0.
Proof. We have by strong duality that hP? , Ci = hf? , ai + hg? , bi. Recall that P? 1m =
a and P? T 1n = b; therefore
hf? , ai + hg? , bi = hf? , P? 1m i + hg? , P? T 1n i
= hf? 1m T , P? i + h1n g? T , P? i,
which results in
hP? , C − f? ⊕ g? i = 0.
Because (f? , g? ) belongs to the polyhedron of dual constraints (2.21), each entry of the
matrix C − f? ⊕ g? is necessarily nonnegative. Therefore, since all the entries of P are
nonnegative, the constraint that the dot-product above is equal to 0 enforces that, for
any pair of indices (i, j) such that Pi,j > 0, Ci,j − (fi + gj ) must be zero, and for any
pair of indices (i, j) such that Ci,j > fi + gj that Pi,j = 0.

The converse result is also true. We define first the idea that two variables for the
primal and dual problems are complementary.
42 Algorithmic Foundations

Definition 3.1. A matrix P ∈ Rn×m and a pair of vectors (f, g) are complementary
w.r.t. C if for all pairs of indices (i, j) such that Pi,j > 0 one also has Ci,j = fi + gj .

If a pair of feasible primal and dual variables is complementary, then we can conclude
they are optimal.

Proposition 3.3. If P and (f, g) are complementary and feasible solutions for the pri-
mal (2.24) and dual (2.11) problems, respectively, then P and (f, g) are both primal
and dual optimal.

Proof. By weak duality, we have that

LC (a, b) ≤ hP, Ci = hP, f ⊕ gi = ha, fi + hb, gi ≤ LC (a, b)

and therefore P and (f, g) are respectively primal and dual optimal.

3.4 Vertices of the Transportation Polytope

Recall that a vertex or an extremal point of a convex set is formally a point x in that
set such that, if there exiss y and z in that set with x = (y + z)/2, then necessarily
x = y = z. A linear program with a nonempty and bounded feasible set attains its
minimum at a vertex (or extremal point) of the feasible set [Bertsimas and Tsitsiklis,
1997, p. 65, Theo. 2.7]. Since the feasible set U(a, b) of the primal optimal transport
problem (3.2) is bounded, one can restrict the search for an optimal P to the set of
extreme points of the polytope U(a, b). Matrices P that are extremal in U(a, b) have
an interesting structure that has been the subject of extensive research [Brualdi, 2006,
§8]. That structure requires describing the transport problem using the formalism of
bipartite graphs.

3.4.1 Tree Structure of the Support of All Vertices of U(a, b)


Let V = (1, 2, . . . , n) and V 0 = (10 , 20 , . . . , m0 ) be two sets of nodes. Note that we add a
prime to the labels of set V 0 to disambiguate them from those of V . Consider their union
V ∪V 0 , with n+m nodes, and the set E of all nm directed edges {(i, j 0 ), i ∈ JnK, j ∈ JmK}
between them (here we just add a prime to an integer j ≤ m to form j 0 in V 0 ). To
each edge (i, j 0 ) we associate the corresponding cost value Cij . The complete bipartite
graph G between V and V 0 is (V ∪ V 0 , E). A transport plan is a flow on that graph
satisfying source (ai flowing out of each node i) and sink (bj flowing into each node
j 0 ) constraints, as described informally in Figure 3.1. An extremal point in U(a, b) has
the following property [Brualdi, 2006, p. 338, Theo. 8.1.2].

Proposition 3.4 (Extremal solutions). Let P be an extremal point of the polytope


U(a, b). Let S(P) ⊂ E be the subset of edges {(i, j 0 ), i ∈ JnK, j ∈ JmK such that Pij >
3.4. Vertices of the Transportation Polytope 43

10 0.3
10 0.2
1 1

20 0.5
20 0.16
2 2

30 0.2
30 0.08
3 3

40 40 0.56

Figure 3.1: The optimal transport problem as a bipartite network flow problem. Here n = 3, m = 4.
All coordinates of the source histogram, a, are depicted as source nodes on the left labeled 1, 2, 3,
whereas all coordinates of the target histogram b are labeled as nodes 10 , 20 , 30 , 40 . The graph is bipartite
in the sense that all source nodes are connected to all target nodes, with no additional edges. To each
edge (i, j 0 ) is associated a cost Cij . A feasible flow is represented on the right. Proposition 3.4 shows
that this flow is not extremal since it has at least one cycle given by ((1, 10 ), (2, 10 ), (2, 40 ), (1, 40 )).

10 10 10
1 1 1
P220 20 P220 + " 20 P220 " 20
2 P320 2 P320 " 2 P320 + "
P230 P230 " P230 + "
0 0
3 3 30
3 P330 3 P330 + " 3 P330 "

n n n
P m0 Q m0 R m0

Figure 3.2: A solution P with a cycle in the graph of its support can be perturbed to obtain two
feasible solutions Q and R such that P is their average, therefore disproving that P is extremal.

0}. Then the graph G(P) = (V ∪ V 0 , S(P)) has no cycles. In particular, P cannot have
def.

more than n + m − 1 nonzero entries.

Proof. We proceed by contradiction. Suppose that P is an extremal point of the poly-


tope U(a, b) and that its corresponding set S(P) of edges, denoted F for short, is such
that the graph G = (V ∪ V 0 , F ) contains a cycle, namely there exists k > 1 and a
sequence of distinct indices i1 , . . . , ik−1 ∈ JnK and j1 , . . . , jk−1 ∈ JmK such that the set
of edges H given below forms a subset of F .

H = (i1 , j10 ), (i2 , j10 ), (i2 , j20 ), . . . , (ik , jk0 ), (i1 , jk0 ) .


We now construct two feasible matrices Q and R such that P = (Q + R)/2. To


do so, consider a directed cycle H̄ corresponding to H, namely the sequence of pairs
i1 → j10 , j10 → i2 , i2 → j20 , . . . , ik → jk0 , jk0 → i1 , as well as the elementary amount of flow
44 Algorithmic Foundations

ε < min(i,j 0 )∈F Pij . Consider a perturbation matrix E whose (i, j) entry is equal to ε
if i → j 0 ∈ H̄, −ε if j → i0 ∈ H̄, and zero otherwise. Define matrices Q = P + E and
R = P−E as illustrated in Figure 3.2. Because ε is small enough, all elements in Q and
R are nonnegative. By construction, E has either lines (resp., columns) with all entries
equal to 0 or exactly one entry equal to ε and another equal to −ε for those indexed
by i1 , . . . , ik (resp., j1 , . . . , jk ). Therefore, E is such that E1m = 0n and ET 1n = 0m ,
and we have that Q and R have the same marginals as P, and are therefore feasible.
Finally P = (Q + R)/2 which, since Q, R 6= P, contradicts the fact that P is an
extremal point. Since a graph with k nodes and no cycles cannot have more than k − 1
edges, we conclude that S(P) cannot have more than n + m − 1 edges, and therefore
P cannot have more than n + m − 1 nonzero entries.

3.4.2 The North-West Corner Rule


The north-west (NW) corner rule is a heuristic that produces a vertex of the polytope
U(a, b) in up to n + m operations. This heuristic can play a role in initializing any
algorithm working on the primal, such as the network simplex outlined in the next
section.
The rule starts by giving the highest possible value to P1,1 by setting it to
min(a1 , b1 ). At each step, the entry Pi,j is chosen to saturate either the row con-
straint at i, the column constraint at j, or both if possible. The indices i, j are then
updated as follows: i is incremented in the first case, j is in the second, and both i and
j are in the third case. The rule proceeds until Pn,m has received a value.
Formally, the algorithm works as follows: i and j are initialized to 1, r ← a1 , c ← b1 .
While i ≤ n and j ≤ m, set t ← min(r, c), Pi,j ← t, r ← r − t, c ← s − t; if r = 0 then
increment i, and update r ← ai if i ≤ n; if c = 0 then increment j, and update c ← bj
if j ≤ n; repeat. Here is an example of this sequence assuming a = [0.2, 0.5, 0.3] and
b = [0.5, 0.1, 0.4]:
     
• 0 0 0.2 0 0 0.2 0 0
0 0 0 →  • 0 0 → 0.3 • 0
     

0 0 0 0 0 0 0 0 0
     
0.2 0 0 0.2 0 0 0.2 0 0
→ 0.3 0.1 • → 0.3 0.1 0.1 → 0.3 0.1 0.1
     

0 0 0 0 0 • 0 0 0.3
We write NW(a, b) for the unique plan that can be obtained through this heuristic.
Note that there is, however, a much larger number of NW corner solutions that
can be obtained by permuting arbitrarily the order of a and b first, computing
the corresponding NW corner table, and recovering a table of U(a, b) by invert-
ing again the order of columns and rows: setting σ = (3, 1, 2), σ 0 = (3, 2, 1) gives
3.5. A Heuristic Description of the Network Simplex 45

aσ = [0.3, 0.2, 0.5], bσ0 = [0.4, 0.1, 0.5], and σ −1 = (2, 3, 1), σ 0 = (3, 2, 1). Observe that
 
0.3 0 0
NW(aσ , bσ0 ) = 0.1 0.1 0  ∈ U(aσ , bσ0 ),
 

0 0 0.5
 
0 0.1 0.1
NWσ−1 σ0−1 (aσ , bσ0 ) = 0.5 0 0  ∈ U(a, b).
 

0 0 0.3

Let N (a, b) be the set of all NW corner solutions that can be produced this way:

N (a, b) = {NWσ−1 σ0−1 (rσ , cσ0 ), σ, σ 0 ∈ Sd }.


def.

All NW corner solutions have by construction up to n + m − 1 nonzero elements. The


NW corner rule produces a table which is by construction unique for aσ and b0σ , but
there is an exponential number of pairs or row/column permutations (σ, σ 0 ) that may
yield the same table [Stougie, 2002, p. 2]. N (a, b) forms a subset of (usually strictly
included in) the set of extreme points of U(a, b) [Brualdi, 2006, Cor. 8.1.4].

3.5 A Heuristic Description of the Network Simplex

Consider a feasible matrix P whose graph G(P) = (V ∪ V 0 , S(P)) has no cycles. P has
therefore no more than n + m − 1 nonzero entries and is a vertex of U(a, b) by Propo-
sition 3.4. Following Proposition 3.3, it is therefore sufficient to obtain a dual solution
(f, g) which is feasible (i.e. C − f ⊕ g has nonnegative entries) and complementary to P
(pairs of indices (i, j 0 ) in S(P) are such that Ci,j = fi + gj ), to prove that P is optimal.
The network simplex relies on two simple principles: to each feasible primal solution
P one can associate a complementary pair (f, g). If that pair is feasible, then we have
reached optimality. If not, one can consider a modification of P that remains feasible
and whose complementary pair (f, g) is modified so that it becomes closer to feasibility.

3.5.1 Obtaining a Dual Pair Complementary to P

The simplex proceeds by associating first to any extremal solution P a pair of (f, g)
complementary dual variables. This is simply carried out by finding two vectors f and
g such that for any (i, j 0 ) in S(P), fi + gj is equal to Ci,j . Note that this, in itself, does
not guarantee that (f, g) is feasible.
Let s be the cardinality of S(P). Because P is extremal, s ≤ n + m − 1. Because
G(P) has no cycles, G(P) is either a tree or a forest (a union of trees), as illustrated
in Figure 3.3. Aiming for a pair (f, g) that is complementary to P, we consider the
46 Algorithmic Foundations

10 0.1
0.16 1
F (P) = {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 },
20 0.16
0.4 2 {3, 40 }, {4, 40 }, {4, 50 }, {5, 60 }
0
3 0.3
0.06 3 n
40 0.1 G(P) = {1, 2, 10 , 20 , 30 }, {1, 10 }, {1, 20 }, {2, 20 }, {2, 30 } ,
0.24 4
0 0 0 0 0
50 0.2 {3, 4, 4 , 5 }, {3, 4 }, {4, 4 }, {4, 5 } ,
0.14 5 o
0 0
{5, 6 }, {5, 6 }
60 0.14

Figure 3.3: A feasible transport P and its corresponding set of edges S(P) and graph G(P). As can
be seen, the graph G(P) = ({1, . . . , 5, 10 , . . . , 60 }, S(P)) is a forest, meaning that it can be expressed as
the union of tree graphs, three in this case.

following set of s linear equality constraints on n + m variables:

f i1 + gj1 = Ci1 ,j1


f i2 + gj1 = Ci2 ,j1
.. .. (3.6)
. = .
f is + gjs = Cis ,js ,

where the elements of S(P) are enumerated as (i1 , j10 ), . . . , (is , js0 ).
Since s ≤ n + m − 1 < n + m, the linear system (3.6) above is always undetermined.
This degeneracy can be interpreted in part because the parameterization of U(a, b)
with n + m constraints results in n + m dual variables. A more careful formulation,
outlined in Remark 3.1, would have resulted in an equivalent formulation with only
n + m − 1 constraints and therefore n + m − 1 dual variables. However, s can also be
strictly smaller than n + m − 1: This happens when G(P) is the disjoint union of two
or more trees. For instance, there are 5 + 6 = 11 dual variables (one for each node) in
Figure 3.3, but only 8 edges among these 11 nodes, namely 8 linear equations to define
(f, g). Therefore, there will be as many undetermined dual variables under that setting
as there will be connected components in G(P).
Consider a tree among those listed in G(P). Suppose that tree has k nodes i1 , . . . , ik
among source nodes and l nodes j10 , . . . , jl0 among target nodes, resulting in r = k + l,
def.

and r − 1 edges, corresponding to k variables in f and l variables in g, linked with


r − 1 linear equations. To lift an indetermination, we can choose arbitrarily a root node
in that tree and assign the value 0 to its corresponding dual variable. From there, we
can traverse the tree using a breadth-first or depth-first search to obtain a sequence
of simple variable assignments that determines the values of all other dual variables in
that tree, as illustrated in Figure 3.4. That procedure can then be repeated for all trees
in the graph of P to obtain a pair of dual variables (f, g) that is complementary to P.
3.5. A Heuristic Description of the Network Simplex 47

1 30 g3 := C2,3 f2
f1 := 0 0
1
g1 := C1,1 f1 2
f2 := C2,1 g1 20 g2 := C2,2 f2

Figure 3.4: The five dual variables f1 , f2 , g1 , g2 , g3 corresponding to the five nodes appearing in the
first tree of the graph G(P) illustrated in Figure 3.3 are linked through four linear equations that
involve corresponding entries in the cost matrix C. Because that system is degenerate, we choose a
root in that tree (node 1 in this example) and set its corresponding variable to 0 and proceed then by
traversing the tree (either breadth-first or depth-first) from the root to obtain iteratively the values of
the four remaining dual variables.

3.5.2 Network Simplex Update


The dual pair (f, g) obtained previously might be feasible, in the sense that for all i, j
we have fi + gj ≤ Ci,j , in which case we have reached the optimum by Proposition 3.3.
When that is not the case, namely when there exists i, j such that fi + gj > Ci,j , the
network simplex algorithm kicks in. We first initialize a graph G to be equal to the
graph G(P) corresponding to the feasible solution P and add the violating edge (i, j 0 )
to G. Two cases can then arise:
(a) G is (still) a forest, which can happen if (i, j 0 ) links two existing subtrees. The
approach outlined in §3.5.1 can be used on graph G to recover a new complemen-
tary dual vector (f, g). Note that this addition simply removes an indetermination
among the n + m dual variables and does not result in any change in the primal
variable P. That update is usually called degenerate in the sense that (i, j 0 ) has
now entered graph G although Pi,j remains 0. G(P) is, however, contained in G.

(b) G now has a cycle. In that case, we need to remove an edge in G to ensure that G
is still a forest, yet also modify P so that P is feasible and G(P) remains included
in G. These operations can all be carried out by increasing the value of Pi,j and
modifying the other entries of P appearing in the detected cycle, in a manner
very similar to the one we used to prove Proposition 3.4. To be more precise, let
us write that cycle (i1 , j10 ), (j10 , i2 ), (i2 , j20 ), . . . , (il , jl0 ), (jl0 , il+1 ) with the convention
that i1 = il+1 = i to ensure that the path is a cycle that starts and ends at i,
whereas j1 = j, to highlight the fact that the cycle starts with the added edge
{i, j}, going in the right direction. Increase now the flow of all “positive” edges
(ik , jk0 ) (for k ≤ l), and decrease that of “negative” edges (jk0 , ik+1 ) (for k ≤ l), to
obtain an updated primal solution P̃, equal to P for all but the following entries:
∀k ≤ l, P̃ik ,jk := Pik ,jk + θ; P̃ik+1 ,jk := Pik+1 ,jk − θ.
Here, θ is the largest possible increase at index i, j using that cycle. The value
48 Algorithmic Foundations

of θ is controlled by the smallest flow negatively impacted by the cycle, namely


mink Pik+1 ,jk . That update is illustrated in Figure 3.5. Let k ? be an index that
achieves that minimum. We then close the update by removing (ik? +1 , jk? ) from
G, to compute new dual variables (f, g) using the approach outlined in §3.5.1.

3.5.3 Improvement of the Primal Solution


Although this was not necessarily our initial motivation, one can show that the ma-
nipulation above can only improve the cost of P. If the added edge has not created a
cycle, case (a) above, the primal solution remains unchanged. When a cycle is created,
case (b), P is updated to P̃, and the following equality holds:
l l
!
X X
hP̃, Ci − hP, Ci = θ Cik ,jk − Cik+1 ,jk .
k=1 k=1

We now use the dual vectors (f, g) computed at the end of the previous iteration. They
are such that fik + gik = Cik ,jk and fik+1 + gik = Cik+1 ,jk for all edges initially in G,
resulting in the identity
l
X l
X l
X l
X
Cik ,jk − Cik+1 ,jk = Ci,j + fik + gjk − fik+1 + gjk
k=1 k=1 k=2 k=1
= Ci,j − (fi + gj ).

That term is, by definition, negative, since i, j were chosen because Ci,j < fi − gj .
Therefore, if θ > 0, we have that

hP̃, Ci = hP, Ci + θ (Ci,j − (fi − fg )) < hP, Ci.

If θ = 0, which can happen if G and G(P) differ, the graph G is simply changed, but
P is not.
The network simplex algorithm can therefore be summarized as follows: Initialize
the algorithm with an extremal solution P, given for instance by the NW corner rule
as covered in §3.4.2. Initialize the graph G with G(P). Compute a pair of dual vari-
ables (f, g) that are complementary to P using the linear system solve using the tree
structure(s) in G as described in §3.5.1. (i) Look for a violating pair of indices to the
constraint C − f ⊕ g ≥ 0; if none, P is optimal and stop. If there is a violating pair
(i, j 0 ), (ii) add the edge (i, j 0 ) to G. If G still has no cycles, update (f, g) accordingly;
if there is a cycle, direct it making sure (i, j 0 ) is labeled as positive, and remove a neg-
ative edge in that cycle with the smallest flow value, updating P, G as illustrated in
Figure 3.5, then build a complementary pair f, g accordingly; return to (i). Some of the
operations above require graph operations (cycle detection, tree traversals) which can
be implemented efficiently in this context, as described in ([Bertsekas, 1998, §5]).
3.6. Dual Ascent Methods 49

10 0.1 0.06 10 0.1 0 10 0.1


0.06 1 0.06 1 0.04 0.06 1 0.1
0
2 0.12 0 20 0.12 20
0.12
0.46 2 0.46 2 0.46 2 0.06
0 0 0
3 0.3 0.3 3 0.3 0.24 3 0.3
0.06 3 0.06 3 0.06 3
40 0.1 40 0.1 40 0.1
0.24 4 0.24 4 0.24 4
50 0.2 50 0.2 50 0.2
0.14 5 0.14 5 0.14 5
0 0 0
6 0.14 6 0.14 6 0.14

(a) {3, 30 } added (b.1) {1, 30 } added (b.2) {1, 10 } removed

Figure 3.5: Adding an edge {i, j} to the graph G(P) can result in either (a) the graph remains a
forest after this addition, in which case f, g can be recomputed following the approach outlined in §3.5.1;
(b.1) the addition of that edge creates a cycle, from which we can define a directed path; (b.2) the path
can be used to increase the value of Pi,j and propagate that change along the cycle to maintain the
flow feasibility constraints, until the flow of one of the edges that is negatively impacted by the cycle
is decreased to 0. This removes the cycle and updates P.

Orlin [1997] was the first to prove the polynomial time complexity of the
network simplex. Tarjan [1997] provided shortly after an improved bound in
O ( (n + m)nm log(n + m) log ((n + m)kCk∞ ) ) which relies on more efficient data
structures to help select pivoting edges.

3.6 Dual Ascent Methods

Dual ascent methods precede the network simplex by a few decades, since they can be
traced back to work by Borchardt and Jocobi [1865] and later König and Egerváry, as
recounted by Kuhn [1955]. The Hungarian algorithm is the best known algorithm in
that family, and it can work only in the particular case when a and b are equal and are
both uniform, namely a = b = 1n /n. We provide in what follows a concise description
of the more general family of dual ascent methods. This requires the knowledge of
the maximum flow problem ([Bertsimas and Tsitsiklis, 1997, §7.5]). By contrast to the
network simplex, presented above in the primal, dual ascent methods maintain at each
iteration dual feasible solutions whose objective is progressively improved by adding a
sparse vector to f and g. Our presentation is mostly derived from that of ([Bertsimas
and Tsitsiklis, 1997, §7.7]) and starts with the following definition.

Definition 3.2. For S ⊂ JnK, S 0 ⊂ JmK0 = {10 , . . . , m0 } we write 1S for the vector in Rn
def.

of zeros except for ones at the indices enumerated in S, and likewise for the vector 1S 0
in Rm with indices in S 0 .

In what follows, (f, g) is a feasible dual pair in R(C). Recall that this simply means
that for all pairs (i, j 0 ) ∈ JnK × JmK0 , fi + gj ≤ Cij . We say that (i, j 0 ) is a balanced pair
(or edge) if fi + gj = Cij and inactive otherwise, namely if fi + gj < Cij . With this
50 Algorithmic Foundations

convention, we start with a simple result describing how a feasible dual pair (f, g) can
be perturbed using sparse vectors indexed by sets S and S 0 and still remain feasible.
def.
Proposition 3.5. (f̃, g̃) = (f, g) + ε(1S , −1S 0 ) is dual feasible for a small enough ε > 0
if for all i ∈ S, the fact that (i, j 0 ) is balanced implies that j 0 ∈ S 0 .

Proof. For any i ∈ S, consider the set Ii of all j 0 ∈ JmK0 such that (i, j 0 ) is inactive,
def.
namely such that fi + gj < Cij . Define εi = minj∈Ii Ci,j − fi − gj , the smallest margin
by which fi can be increased without violating the constraints corresponding to j 0 ∈ Ii .
Indeed, one has that if ε ≤ εi then f̃i + g̃j < Ci,j for any j 0 ∈ Ii . Consider now the set
Bi of balanced edges associated with i. Note that Bi = JmK0 \ Ii . The assumption above
is that j 0 ∈ Bi ⇒ j 0 ∈ S 0 . Therefore, one has that for j 0 ∈ Bi , f̃i + g̃j = fi + gj = Ci,j . As
a consequence, the inequality f̃i + g̃j ≤ Ci,j is ensured for any j ∈ JmK0 . Choosing now
an increase ε smaller than the smallest possible allowed, namely mini∈S εi , we recover
that (f̃, g̃) is dual feasible.

The main motivation behind the iteration of the network simplex presented in §3.5.1
is to obtain, starting from a feasible primal solution P, a complementary feasible dual
pair (f, g). To reach that goal, P is progressively modified such that its complementary
dual pair reaches dual feasibility. A symmetric approach, starting from a feasible dual
variable to obtain a feasible primal P, motivates dual ascent methods. The proposi-
tion below is the main engine of dual ascent methods in the sense that it guarantees
(constructively) the existence of an ascent direction for (f, g) that maintains feasibil-
ity. That direction is built, similarly to the network simplex, by designing a candidate
primal solution P whose infeasibility guides an update for (f, g).

Proposition 3.6. Either (f, g) is optimal for Problem (3.4) or there exists S ⊂ JnK, S 0 ⊂
JmK0 such that (f̃, g̃) = (f, g) + ε(1S , −1S 0 ) is feasible for a small enough ε > 0 and has
def.

a strictly better objective.

Proof. We consider first a complementary primal variable P to (f, g). To that effect, let
B be the set of balanced edges, namely all pairs (i, j 0 ) ∈ JnK × JmK0 such that fi + gj =
Ci,j , and form the bipartite graph whose vertices {1, . . . , n, 10 , . . . , m0 } are linked with
edges in B only, complemented by a source node s connected with capacitated edges to
all nodes i ∈ JnK with respective capacities ai , and a terminal node t also connected
to all nodes j 0 ∈ JmK0 with edges of respective capacities bj , as seen in Figure 3.6.
The Ford–Fulkerson algorithm ([Bertsimas and Tsitsiklis, 1997, p. 305]) can be used to
compute a maximal flow F on that network, namely a family of n+m+|B| nonnegative
values indexed by (i, j 0 ) ∈ B as fsi ≤ ai , fij 0 , fj 0 t ≤ bj that obey flow constraints and
P
such that i fsi is maximal. If the throughput of that flow F is equal to 1, then a feasible
primal solution P, complementary to f, g by construction, can be extracted from F by
defining Pi,j = fij 0 for (i, j 0 ) ∈ B and zero elsewhere, resulting in the optimality of (f, g)
3.6. Dual Ascent Methods 51

and P by Proposition 3.3. If the throughput of F is strictly smaller than 1, the labeling
algorithm proceeds by labeling (identifying) those nodes reached iteratively from s for
which F does not saturate capacity constraints, as well as those nodes that contribute
flow to any of the labeled nodes. Labeled nodes are stored in a nonempty set Q, which
does not contain the terminal node t per optimality of F (see Bertsimas and Tsitsiklis
1997, p. 308, for a rigorous presentation of the algorithm). Q can be split into two sets
S = Q ∩ JnK and S 0 = Q ∩ JmK0 . Because we have assumed that the total throughput
is strictly smaller than 1, S 6= ∅. Note first that if i ∈ S and (i, j) is balanced, then j 0
is necessarily in S 0 . Indeed, since all edges (i, j 0 ) have infinite capacity by construction,
the labeling algorithm will necessarily reach j 0 if it includes i in S. By Proposition 3.5,
there exists thus a small enough ε to ensure the feasibility of f̃, g̃. One still needs to
prove that 1S T a − 1S 0 T b > 0 to ensure that (f̃, g̃) has a better objective than (f, g).
Let S̄ = JnK \ S and S̄ 0 = JmK0 \ S 0 and define
X X X X
A= fsi , B= fsi , C= fj 0 t , D= fj 0 t .
i∈S i∈S̄ j 0 ∈S 0 j 0 ∈S¯0

The total maximal flow starts from s and is therefore equal to A + B, but also arrives
at t and is therefore equal to C + D. Flow conservation constraints also impose that the
very same flow is equal to B + C, therefore A = C. On the other hand, by definition
of the labeling algorithm, we have for all i in S that fsi < ai , whereas fj 0 t = bj for
j 0 ∈ S̄ 0 because t cannot be in S 0 by optimality of the considered flow. We therefore
have A < 1S T a and C = 10S T b. Therefore 1S T a − 10S T b > A − C = 0.

The dual ascent method proceeds by modifying any feasible solution (f, g) by any
vector generated by sets S, S 0 that ensure feasibility and improve the objective. When
the sets S, S 0 are those given by construction in the proof of Proposition 3.6, and
the steplength ε is defined as in the proof of Proposition 3.5, we recover a method
known as the primal-dual method. That method reduces to the Hungarian algorithm
for matching problems. Dual ascent methods share similarities with the dual variant
of the network simplex, yet they differ in at least two important aspects. Simplex-type
methods always ensure that the current solution is an extreme point of the feasible
set, R(C) for the dual, whereas dual ascent as presented here does not make such
an assumption, and can freely produce iterates that lie in the interior of the feasible
set. Additionally, whereas the dual network simplex would proceed by modifying (f, g)
to produce a primal solution P that satisfies linear (marginal constraints) but only
nonnegativity upon convergence, dual ascent builds instead a primal solution P that is
always nonnegative but which does not necessarily satisfy marginal constraints.
52 Algorithmic Foundations

10 0.1
0.06 10
1 1 0.06
0.06 0.12
20 0.12 20 0.12
2 2 0.24
0.46 30
s 30 0.3 t s t
3 1 capacity 3 0.06 0.3
0.06 0.1
40 40
4 0.1
0.24 4
0
5 0.2 0.1 50 0
0.14 5 5
0.14
60 0.14 60 0.14
(a) Given balanced edges, what is the maximal flow possible? (b) Maxflow = 0.74

10 10
1 1 D
20
B 20
2 2
s 30 t s 30 t
3 3
0 0
4 4
4 4
50 A 50
5 5 C
S = {2, 3, 4, 5} 60 60
S 0 = {20 , 30 , 40 , 60 }
(c) the labeling algorithm identifies sets S, S 0 (d) total flows A, B, C, D through nodes S, S̄, S 0 , S̄ 0

Figure 3.6: Consider a transportation problem involving the marginals introduced first in Figure 3.3,
with n = 5, m = 6. Given two feasible dual vectors f, g, we try to obtain the “best” flow matrix P that
is complementary to (f, g). Recall that this means that P can only take positive values on those edges
(i, j 0 ) corresponding to indices for which fi + gj = Ci,j , here represented with dotted lines in plot (a).
The best flow that can be achieved with that graph structure can be formulated as a max-flow problem
in a capacitated network, starting from an abstract source node s connected to all nodes labeled i ∈ JnK,
terminating at an abstract terminal node t connected to all nodes labeled j 0 , where j ∈ Jm0 K, and such
that the capacities of edge (s, i), (j 0 , t), i ∈ JnK, j ∈ JmK are respectively ai , bj and all others infinite.
The Ford–Fulkerson algorithm ([Bertsimas and Tsitsiklis, 1997, p. 305]) can be applied to compute such
a max-flow, which, as represented in plot (b), only achieves 0.74 units of mass out of 1 needed to solve
the problem. One of the subroutines used by max-flow algorithms, the labeling algorithm ([Bertsimas
and Tsitsiklis, 1997, p. 308]), can be used to identify nodes that receive an unsaturated flow from s
(and recursively, all of its successors), denoted by orange lines in plot (c). The labeling algorithm also
adds by default nodes that send a positive flow to any labeled node, which is the criterion used to select
node 3, which contributes with a red line to 30 . Labeled nodes can be grouped in sets S, S 0 to identify
nodes which can be better exploited to obtain a higher flow, by modifying f, g to obtain a different
graph. The proof involves partial sums of flows described in plot (d)

3.7 Auction Algorithm

The auction algorithm was originally proposed by Bertsekas [1981] and later refined
in [Bertsekas and Eckstein, 1988]. Several economic interpretations of this algorithm
have been proposed (see e.g. Bertsekas [1992]). The algorithm can be adapted for arbi-
trary marginals, but we present it here in its formulation to solve optimal assignment
problems.
3.7. Auction Algorithm 53

Complementary slackness. Notice that in the optimal assignment problem, the


primal-dual conditions presented for the optimal transport problem become easier to
formulate, because any extremal solution P is necessarily a permutation matrix Pσ
for a given σ (see Equation (3.3)). Given primal Pσ? and dual f? , g? optimal solutions
we necessarily have that
f?i + g?σ? = Ci,σi? .
i
Recall also that, because of the principle of C-transforms enunciated in §3.2, that one
can choose f? to be equal to gC̄ . We therefore have that
Ci,σi? − g?σi = min Ci,j − g?j . (3.7)
j
On the contrary, it is easy to show that if there exists a vector g and a permutation σ
such that
Ci,σi − gσi = min Ci,j − gj (3.8)
j
holds, then they are both optimal, in the sense that σ is an optimal assignment and
gC̄ , g is an optimal dual pair.

Partial assignments and ε-complementary slackness. The goal of the auction algo-
rithm is to modify iteratively a triplet S, ξ, g, where S is a subset of JnK, ξ a partial
assignment vector, namely an injective map from S to JnK, and g a dual vector. The
dual vector is meant to converge toward a solution satisfying an approximate com-
plementary slackness property (3.8), whereas S grows to cover JnK as ξ describes a
permutation. The algorithm works by maintaining the three following properties after
each iteration:
(a) ∀i ∈ S, Ci,ξi − gξi ≤ ε + minj Ci,j − gj (ε-CS).
(b) The size of S can only increase at each iteration.
(c) There exists an index i such that gi decreases by at least ε.

Auction algorithm updates. Given a point j the auction algorithm uses not only the
optimum appearing in the usual C-transform but also a second best,
ji1 ∈ argminj Ci,j − gj , ji2 ∈ argminj6=j 1 Ci,j − gj ,
i

to define the following updates on g for an index i ∈


/ S, as well as on S and ξ:
1. update g: Remove to the ji1 th entry of g the sum of ε and the difference between
the second lowest and lowest adjusted cost {Ci,j − gj }j ,
 
gj 1 ← gj 1 − (Ci,j 2 − gj 2 ) − (Ci,j 1 − gj 1 ) + ε
i i i i i i
| {z }
≥ε>0 (3.9)
= Ci,j 1 − (Ci,j 2 − gj 2 ) − ε.
i i i
54 Algorithmic Foundations

2. update S and ξ: If there exists an index i0 ∈ S such that ξi0 = ji1 , remove it by
updating S ← S \ {i0 }. Set ξi = ji1 and add i to S, S ← S ∪ {i}.

Algorithmic properties. The algorithm proceeds by starting from an empty set of


assigned points S = ∅ with no assignment and empty partial assignment vector ξ,
and g = 0n , terminates when S = JnK, and loops through both steps above until
it terminates. The fact that properties (b) and (c) are valid after each iteration is
made obvious by the nature of the updates (it suffices to look at Equation (3.9)). ε-
complementary slackness is easy to satisfy at the first iteration since in that case S = ∅.
The fact that iterations preserve that property is shown by the following proposition.
Proposition 3.7. The auction algorithm maintains ε-complementary slackness at each
iteration.
Proof. Let g, ξ, S be the three variables at the beginning of a given iteration. We there-
fore assume that for any i0 ∈ S the relationship
Ci,ξi0 − gξi0 ≤ ε + min Ci0 ,j − gj
j

holds. Consider now the particular i ∈ / S considered in an iteration. Three updates


happen: g, ξ, S are updated to g , ξ , S n using indices ji1 and ji2 . More precisely, gn is
n n

equal to g except for element ji1 , whose value is equal to


 
gnj 1 = gj 1 − (Ci,j 2 − gj 2 ) − (Ci,j 1 − gj 1 ) − ε ≤ gj 1 − ε
i i i i i i i

, ξ n is equal to ξ except for its ith element equal to ji1 , and S n is equal to the union of
{i} with S (with possibly one element removed). The update of gn can be rewritten
gnj 1 = Ci,j 1 − (Ci,j 2 − gj 2 ) − ε;
i i i i

therefore we have
Ci,j 1 − gnj 1 = ε + (Ci,j 2 − gj 2 ) = ε + min(Ci,j − gj ).
i i i i j6=ji1

Since −g ≤ −gn this implies that


Ci,j 1 − gnj 1 = ε + min(Ci,j − gj ) ≤ ε + min(Ci,j − gnj ),
i i j6=ji1 j6=ji1

and since the inequality is also obviously true for j = ji1 we therefore obtain the ε-
complementary slackness property for index i. For other indices i0 6= i, we have again
that since gn ≤ g the sequence of inequalities holds,
Ci,ξn0 − gnξn0 = Ci,ξi0 − gξi0 ≤ ε + min Ci0 ,j − gj ≤ ε + min Ci0 ,j − gnj .
i i j j
3.7. Auction Algorithm 55

Proposition 3.8. The number of steps of the auction algorithm is at most N =


nkCk∞ /ε.
Proof. Suppose that the algorithm has not stopped after T > N steps. Then there
exists an index j which is not in the image of ξ, namely whose price coordinate gj has
never been updated and is still gj = 0. In that case, there cannot exist an index j 0 such
that gj 0 was updated n times with n > kCk∞ /ε. Indeed, if that were the case then for
any index i
gj 0 ≤ −nε < −kCk∞ ≤ −Ci,j = gj − Ci,j ,
which would result in, for all i,
Ci,j 0 − gj 0 > Ci,j + (Ci,j − gj ),
which contradicts ε-CS. Therefore, since there cannot be more than kCk∞ /ε updates
for each variable, the total number of iterations T cannot be larger than nkCk∞ /ε =
N.

Remark 3.3. Note that this result yields a naive number of operations of N 3 kCk∞ /ε
for the algorithm to terminate. That complexity can be reduced to N 3 log kCk∞ when
using a clever method known as ε-scaling, designed to decrease the value of ε with each
iteration ([Bertsekas, 1998, p. 264]).
Proposition 3.9. The auction algorithm finds an assignment whose cost is nε subopti-
mal.
Proof. Let σ, g? be the primal and dual optimal solutions of the assignment problem
of matrix C, with optimum
X X X
t? = Ci,σi = min Ci,j − g?j + g?j .
j
i j

Let ξ, g be the solutions output by the auction algorithm upon termination. The ε-CS
conditions yield that for any i ∈ S,
min Ci,j − gj ≥ Ci,ξi − gξi − ε.
j

Therefore by simple suboptimality of g we first have


X  X
?
t ≥ min Ci,j − gj + gj
j
i j
X   X X ,
≥ −ε + Ci,ξi − gξi + gj = −nε + Ci,ξj ≥ −nε + t? .
i j i

where the second inequality comes from ε-CS, the next equality by cancellation of the
sum of terms in gξi and gj , and the last inequality by the suboptimality of ξ as a
permutation.
56 Algorithmic Foundations

The auction algorithm can therefore be regarded as an alternative way to use the
machinery of C-transforms. Next we explore another approach grounded on regulariza-
tion, the so-called Sinkhorn algorithm, which also bears similarities with the auction
algorithm as discussed in [Schmitzer, 2016b].
Note finally that, on low-dimensional regular grids in Euclidean space, it is possible
to couple these classical linear solvers with multiscale strategies, to obtain a significant
speed-up [Schmitzer, 2016a, Oberman and Ruan, 2015].
4
Entropic Regularization of Optimal Transport

This chapter introduces a family of numerical schemes to approximate solutions to


Kantorovich formulation of optimal transport and its many generalizations. It operates
by adding an entropic regularization penalty to the original problem. This regulariza-
tion has several important advantages, which make it, when taken altogether, a very
useful tool: the minimization of the regularized problem can be solved using a simple
alternate minimization scheme; that scheme translates into iterations that are simple
matrix-vector products, making them particularly suited to execution of GPU; for some
applications, these matrix-vector products do not require storing an n × m cost matrix,
but instead only require access to a kernel evaluation; in the case where a large group
of measures share the same support, all of these matrix-vector products can be cast as
matrix-matrix products with significant speedups; the resulting approximate distance
is smooth with respect to input histogram weights and positions of the Diracs and can
be differentiated using automatic differentiation.

4.1 Entropic Regularization

The discrete entropy of a coupling matrix is defined as


def.
X
H(P) = − Pi,j (log(Pi,j ) − 1), (4.1)
i,j

with an analogous definition for vectors, with the convention that H(a) = −∞ if one of
the entries aj is 0 or negative. The function H is 1-strongly concave, because its Hessian
is ∂ 2 H(P ) = − diag(1/Pi,j ) and Pi,j ≤ 1. The idea of the entropic regularization
of optimal transport is to use −H as a regularizing function to obtain approximate

57
58 Entropic Regularization of Optimal Transport

solutions to the original transport problem (2.11):


LεC (a, b) =
def.
min hP, Ci − εH(P). (4.2)
P∈U(a,b)

Since the objective is an ε-strongly convex function, Problem (4.2) has a unique optimal
solution. The idea to regularize the optimal transport problem by an entropic term can
be traced back to modeling ideas in transportation theory [Wilson, 1969]: Actual traffic
patterns in a network do not agree with those predicted by the solution of the optimal
transport problem. Indeed, the former are more diffuse than the latter, which tend
to rely on a few routes as a result of the sparsity of optimal couplings for (2.11).
To mitigate this sparsity, researchers in transportation proposed a model, called the
“gravity” model [Erlander, 1980], that is able to form a more “blurred” prediction of
traffic given marginals and transportation costs.
Figure 4.1 illustrates the effect of the entropy to regularize a linear program over the
simplex Σ3 (which can thus be visualized as a triangle in two dimensions). Note how
the entropy pushes the original LP solution away from the boundary of the triangle.
The optimal Pε progressively moves toward an “entropic center” of the triangle. This
is further detailed in the proposition below. The convergence of the solution of that
regularized problem toward an optimal solution of the original linear program has been
studied by Cominetti and San Martín [1994], with precise asymptotics.

c
P"

"
Figure 4.1: Impact of ε on the optimization of a linear function on the simplex, solving Pε =
argminP∈Σ3 hC, Pi − εH(P) for a varying ε.

Proposition 4.1 (Convergence with ε). The unique solution Pε of (4.2) converges to
the optimal solution with maximal entropy within the set of all optimal solutions of
the Kantorovich problem, namely
ε→0
Pε −→ argmin {−H(P) : P ∈ U(a, b), hP, Ci = LC (a, b), } (4.3)
P

so that in particular
ε→0
LεC (a, b) −→ LC (a, b).
One also has
ε→∞
Pε −→ a ⊗ b = abT = (ai bj )i,j . (4.4)
4.1. Entropic Regularization 59

Proof. We consider a sequence (ε` )` such that ε` → 0 and ε` > 0. We denote P` the
solution of (4.2) for ε = ε` . Since U(a, b) is bounded, we can extract a sequence (that
we do not relabel for the sake of simplicity) such that P` → P? . Since U(a, b) is closed,
P? ∈ U(a, b). We consider any P such that hC, Pi = LC (a, b). By optimality of P
and P` for their respective optimization problems (for ε = 0 and ε = ε` ), one has

0 ≤ hC, P` i − hC, Pi ≤ ε` (H(P` ) − H(P)). (4.5)

Since H is continuous, taking the limit ` → +∞ in this expression shows that hC, P? i =
hC, Pi so that P? is a feasible point of (4.3). Furthermore, dividing by ε` in (4.5) and
taking the limit shows that H(P) ≤ H(P? ), which shows that P? is a solution of (4.3).
Since the solution P?0 to this program is unique by strict convexity of −H, one has
P? = P?0 , and the whole sequence is converging. In the limit ε → +∞, a similar proof
shows that one should rather consider the problem

min − H(P),
P∈U(a,b)

the solution of which is a ⊗ b.

Formula (4.3) states that for a small regularization ε, the solution converges to the
maximum entropy optimal transport coupling. In sharp contrast, (4.4) shows that for
a large regularization, the solution converges to the coupling with maximal entropy be-
tween two prescribed marginals a, b, namely the joint probability between two indepen-
dent random variables distributed following a, b. A refined analysis of this convergence
is performed in Cominetti and San Martín [1994], including a first order expansion in
ε (resp., 1/ε) near ε = 0 (resp., ε = +∞). Figures 4.2 and 4.3 show visually the effect
of these two convergences. A key insight is that, as ε increases, the optimal coupling
becomes less and less sparse (in the sense of having entries larger than a prescribed
threshold), which in turn has the effect of both accelerating computational algorithms
(as we study in §4.2) and leading to faster statistical convergence (as shown in §8.5).
Defining the Kullback–Leibler divergence between couplings as
!
def.
X Pi,j
KL(P|K) = Pi,j log − Pi,j + Ki,j , (4.6)
i,j
Ki,j

the unique solution Pε of (4.2) is a projection onto U(a, b) of the Gibbs kernel associ-
ated to the cost matrix C as Ci,j
Ki,j = e− ε .
def.

Indeed one has that using the definition above

Pε = ProjKL
def.
U(a,b) (K) = argmin KL(P|K). (4.7)
P∈U(a,b)
60 Entropic Regularization of Optimal Transport

ε = 10 ε=1 ε = 10−1 ε = 10−2

Figure 4.2: Impact of ε on the couplings between two 1-D densities, illustrating Proposition 4.1.
Top row: between two 1-D densities. Bottom row: between two 2-D discrete empirical densities with
the same number n = m of points (only entries of the optimal (Pi,j )i,j above a small threshold are
displayed as segments between xi and yj ).

"
Figure 4.3: Impact of ε on coupling between two 2-D discrete empirical densities with the same
number n = m of points (only entries of the optimal (Pi,j )i,j above a small threshold are displayed as
segments between xi and yj ).

Remark 4.1 (Entropic regularization between discrete measures). For discrete mea-
sures of the form (2.1), the definition of regularized transport extends naturally
to
Lεc (α, β) = LεC (a, b),
def.
(4.8)
with cost Ci,j = c(xi , yj ), to emphasize the dependency with respect to the posi-
tions (xi , yj ) supporting the input measures.
4.1. Entropic Regularization 61

Remark 4.2 (General formulation). One can consider arbitrary measures by replac-
ing the discrete entropy by the relative entropy with respect to the product measure
def.
dα ⊗dβ(x, y) = dα(x)dβ(y), and propose a regularized counterpart to (2.15) using
Z
Lεc (α, β)
def.
= min c(x, y)dπ(x, y) + ε KL(π|α ⊗ β), (4.9)
π∈U (α,β) X ×Y

where the relative entropy is a generalization of the discrete Kullback–Leibler di-


vergence (4.6) Z  dπ 
def.
KL(π|ξ) = log (x, y) dπ(x, y)
X ×Y dξ
Z (4.10)
+ (dξ(x, y) − dπ(x, y)),
X ×Y

and by convention KL(π|ξ) = +∞ if π does not have a density dπ dξ with respect


to ξ. It is important to realize that the reference measure α ⊗ β chosen in (4.9)
to define the entropic regularizing term KL(·|α ⊗ β) plays no specific role; only its
support matters, as noted by the following proposition.

Proposition 4.2. For any π ∈ U(α, β), and for any (α0 , β 0 ) having the same 0
measure sets as (α, β) (so that they have both densities with respect to one another)
one has
KL(π|α ⊗ β) = KL(π|α0 ⊗ β 0 ) − KL(α ⊗ β|α0 ⊗ β 0 ).

This proposition shows that choosing KL(·|α0 ⊗ β 0 ) in place of KL(·|α ⊗ β)


in (4.9) results in the same solution.
Formula (4.9) can be refactored as a projection problem

min KL(π|K), (4.11)


π∈U (α,β)

c(x,y)
where K is the Gibbs distributions dK(x, y) = e− ε dα(x)dβ(y). This problem is
def.

often referred to as the “static Schrödinger problem” [Léonard, 2014, Rüschendorf


and Thomsen, 1998], since it was initially considered by Schrödinger in statistical
physics [Schrödinger, 1931]. As ε → 0, the unique solution to (4.11) converges to
the maximum entropy solution to (2.15); see [Léonard, 2012, Carlier et al., 2017].
Section 7.6 details an alternate “dynamic” formulation of the Schrödinger problem
over the space of paths connecting the points of two measures.

Remark 4.3 (Mutual entropy). Similarly to (2.16), one can rephrase (4.9) using
62 Entropic Regularization of Optimal Transport

random variables
n o
Lεc (α, β) = min E(X,Y ) (c(X, Y )) + εI(X, Y ) : X ∼ α, Y ∼ β ,
(X,Y )

def.
where, denoting π the distribution of (X, Y ), I(X, Y ) = KL(π|α ⊗ β) is the so-
called mutual information between the two random variables. One has I(X, Y ) ≥ 0
and I(X, Y ) = 0 if and only if the two random variables are independent.

Remark 4.4 (Independence and couplings). A coupling π ∈ U(α, β) describes the


distribution of a couple of random variables (X, Y ) defined on (X , Y), where X
(resp., Y ) has law α (resp., β). Proposition 4.1 carries over for generic (nonnecessary
discrete) measures, so that the solution πε of (4.9) converges to the tensor product
coupling α⊗β as ε → +∞. This coupling α⊗β corresponds to the random variables
(X, Y ) being independent. In contrast, as ε → 0, πε convergence to a solution π0 of
the OT problem (2.15). On X = Y = Rd , if α and β have densities with respect to
the Lebesgue measure, as detailed in Remark 2.24, then π0 is unique and supported
on the graph of a bijective Monge map T : Rd → Rd . In this case, (X, Y ) are in
some sense fully dependent, since Y = T (X) and X = T −1 (Y ). In the simple 1-D
case d = 1, a convenient way to visualize the dependency structure between X and
Y is to use the copula ξπ associated to the joint distribution π. The cumulative
function defined in (2.34) is extended to couplings as
Z x Z y
∀ (x, y) ∈ R2 ,
def.
Cπ (x, y) = dπ.
−∞ −∞

The copula is then defined as

∀ (s, t) ∈ [0, 1]2 , ξπ (s, t) = Cπ (Cα−1 (s), Cβ−1 (t)),


def.

where the pseudoinverse of a cumulative function is defined in (2.35). For indepen-


dent variables, ε = +∞, i.e. π = α ⊗ β, one has ξπ+∞ (s, t) = st. In contrast, for
fully dependent variables, ε = +∞, one has ξπ0 (s, t) = min(s, t). Figure 4.4 shows
how entropic regularization generates copula ξπε interpolating between these two
extreme cases.

4.2 Sinkhorn’s Algorithm and Its Convergence

The following proposition shows that the solution of (4.2) has a specific form, which can
be parameterized using n + m variables. That parameterization is therefore essentially
dual, in the sense that a coupling P in U(a, b) has nm variables but n + m constraints.
4.2. Sinkhorn’s Algorithm and Its Convergence 63

α β

ε = 10 ε=1 ε = 0.5 · 10−1 ε = 10−1 ε = 10−3

Figure 4.4: Top: evolution with ε of the solution πε of (4.9). Bottom: evolution of the copula function
ξπε .

Proposition 4.3. The solution to (4.2) is unique and has the form

∀ (i, j) ∈ JnK × JmK, Pi,j = ui Ki,j vj (4.12)

for two (unknown) scaling variable (u, v) ∈ Rn+ × Rm


+.

Proof. Introducing two dual variables f ∈ Rn , g ∈ Rm for each marginal constraint, the
Lagrangian of (4.2) reads

E(P, f, g) = hP, Ci − εH(P) − hf, P1m − ai − hg, PT 1n − bi.

First order conditions then yield


∂E(P, f, g)
= Ci,j + ε log(Pi,j ) − fi − gj = 0,
∂Pi,j
which result, for an optimal P coupling to the regularized problem, in the expression
Pi,j = efi /ε e−Ci,j /ε egj /ε , which can be rewritten in the form provided above using
nonnegative vectors u and v.

Regularized OT as matrix scaling. The factorization of the optimal solution ex-


hibited in Equation (4.12) can be conveniently rewritten in matrix form as P =
64 Entropic Regularization of Optimal Transport

diag(u)K diag(v). The variables (u, v) must therefore satisfy the following nonlinear
equations which correspond to the mass conservation constraints inherent to U(a, b):

diag(u)K diag(v)1m = a, and diag(v)K> diag(u)1n = b. (4.13)

These two equations can be further simplified, since diag(v)1m is simply v, and the
multiplication of diag(u) times Kv is

u (Kv) = a and v (KT u) = b, (4.14)

where corresponds to entrywise multiplication of vectors. That problem is known in


the numerical analysis community as the matrix scaling problem (see [Nemirovski and
Rothblum, 1999] and references therein). An intuitive way to handle these equations
is to solve them iteratively, by modifying first u so that it satisfies the left-hand side
of Equation (4.14) and then v to satisfy its right-hand side. These two updates define
Sinkhorn’s algorithm,
a b
u(`+1) = v(`+1) =
def. def.
and , (4.15)
Kv(`) KT u(`+1)
initialized with an arbitrary positive vector v(0) = 1m . The division operator used above
between two vectors is to be understood entrywise. Note that a different initialization
will likely lead to a different solution for u, v, since u, v are only defined up to a
multiplicative constant (if u, v satisfy (4.13) then so do λu, v/λ for any λ > 0). It turns
out, however, that these iterations converge (see Remark 4.8 for a justification using
iterative projections, and see Remark 4.14 for a strict contraction result) and all result
in the same optimal coupling diag(u)K diag(v). Figure 4.5, top row, shows the evolution
of the coupling diag(u(`) )K diag(v(`) ) computed by Sinkhorn iterations. It evolves from
the Gibbs kernel K toward the optimal coupling solving (4.2) by progressively shifting
the mass away from the diagonal.

Remark 4.5 (Historical perspective). The iterations (4.15) first appeared in [Yule, 1912,
Kruithof, 1937]. They were later known as the iterative proportional fitting procedure
(IPFP) Deming and Stephan [1940] and RAS [Bacharach, 1965] methods [Idel, 2016].
The proof of their convergence is attributed to Sinkhorn [1964], hence the name of the
algorithm. This algorithm was later extended in infinite dimensions by Ruschendorf
[1995]. This regularization was used in the field of economics to obtain approximate
solutions to optimal transport problems, under the name of gravity models [Wilson,
1969, Erlander, 1980, Erlander and Stewart, 1990]. It was rebranded as “softassign”
by Kosowsky and Yuille [1994] in the assignment case, namely when a = b = 1n /n,
and used to solve matching problems in economics more recently by Galichon and
Salanié [2009]. This regularization has received renewed attention in data sciences (in-
cluding machine learning, vision, graphics and imaging) following [Cuturi, 2013], who
4.2. Sinkhorn’s Algorithm and Its Convergence 65

showed that Sinkhorn’s algorithm provides an efficient and scalable approximation to


optimal transport, thanks to seamless parallelization when solving several OT problems
simultaneously (notably on GPUs; see Remark 4.16), and that this regularized quantity
also defines, unlike the linear programming formulation, a differentiable loss function
(see §4.5). There exist countless extensions and generalizations of the Sinkhorn algo-
rithm (see for instance §4.6). For instance, when a = b, one can use averaged projection
iterations to maintain symmetry [Knight et al., 2014].

⇡"(`)

`
0

-0.5

-1

-1.5

-2
`
1000 2000 3000 4000 5000

ε = 10 ε = 0.1 ε= 10−3
(`)
Figure 4.5: Top: evolution of the coupling πε = diag(u(`) )K diag(v(`) ) computed at iteration `
of Sinkhorn’s iterations, for 1-D densities on X = [0, 1], c(x, y) = |x − y|2 , and ε = 0.1. Bottom:
impact of ε the convergence rate of Sinkhorn, as measured in term of marginal constraint violation
(`)
log(kπε 1m − bk1 ).

Remark 4.6 (Overall complexity). By doing a careful convergence analysis (assum-


ing n = m for the sake of simplicity), Altschuler et al. [2017] showed that by
setting ε = 4 log(n)
τ , O(kCk3∞ log(n)τ −3 ) Sinkhorn iterations (with an additional
rounding step to compute a valid coupling P̂ ∈ U(a, b)) are enough to ensure that
hP̂, Ci ≤ LC (a, b)+τ . This implies that Sinkhorn computes a τ -approximate solu-
tion of the unregularized OT problem in O(n2 log(n)τ −3 ) operations. The rounding
scheme consists in, given two vectors u ∈ Rn , v ∈ Rm to carry out the following
66 Entropic Regularization of Optimal Transport

updates ([Altschuler et al., 2017, Alg. 2]):


!
a b
 
0
, 1n , v0 = v min
def. def.
u = u min , 1n ,
u (Kv) v (KT u0 )
∆a = a − u0 (Kv0 ), ∆b = b − v0 (KT u),
def. def.

P̂ = diag(u0 )K diag(v0 ) + ∆a (∆b )T / k∆a k1 .


def.

This yields a matrix P̂ ∈ U(a, b) such that the 1-norm between P̂ and
diag(u)K diag(v) is controlled by the marginal violations of diag(u)K diag(v),
namely

P̂ − diag(u)K diag(v) ≤ ka − u (Kv)k1 + b − v (KT u) .

1 1

This field remains active, as shown by the recent improvement on the result above
by Dvurechensky et al. [2018].
Remark 4.7 (Numerical stability of Sinkhorn iterations). As we discuss in Remarks 4.14
and 4.15, the convergence of Sinkhorn’s algorithm deteriorates as ε → 0. In numerical
practice, however, that slowdown is rarely observed in practice for a simpler reason:
Sinkhorn’s algorithm will often fail to terminate as soon as some of the elements of
the kernel K become too negligible to be stored in memory as positive numbers, and
become instead null. This can then result in a matrix product Kv or KT u with ever
smaller entries that become null and result in a division by 0 in the Sinkhorn update
of Equation (4.15). Such issues can be partly resolved by carrying out computations
on the multipliers u and v in the log domain. That approach is carefully presented in
Remark 4.23 and is related to a direct resolution of the dual of Problem (4.2).
Remark 4.8 (Relation with iterative projections). Denoting
n o
Ca1 = {P : P1m = a} Cb2 = P : PT 1m = b
def. def.
and
the rows and columns constraints, one has U(a, b) = Ca1 ∩ Cb2 . One can use Bregman
iterative projections [Bregman, 1967],
P(`+1) = ProjKL (`)
P(`+2) = ProjKL (`+1)
def. def.
Ca1 (P ) and C 2 (Pb
). (4.16)
Since the sets Ca1 and Cb2 are affine, these iterations are known to converge to the solution
of (4.7); see [Bregman, 1967]. These iterates are equivalent to Sinkhorn iterations (4.15)
since defining
P(2`) = diag(u(`) )K diag(v(`) ),
def.

one has
P(2`+1) = diag(u(`+1) )K diag(v(`) )
def.

P(2`+2) = diag(u(`+1) )K diag(v(`+1) ).


def.
and
4.2. Sinkhorn’s Algorithm and Its Convergence 67

In practice, however, one should prefer using (4.15), which only requires manipulating
scaling vectors and multiplication against a Gibbs kernel, which can often be accelerated
(see Remarks 4.17 and 4.19 below).
Remark 4.9 (Proximal point algorithm). In order to approximate a solution of the un-
regularized (ε = 0) problem (2.11), it is possible to use iteratively the Sinkhorn al-
gorithm, using the so-called proximal point algorithm for the KL metric. We denote
def.
F (P) = hP, πi + ιU(a,b) (P) the unregularized objective function. The proximal point
iterations for the KL divergence computes a minimizer of F , and hence a solution of
the unregularized OT problem (2.11), by computing iteratively
1
P(`+1) = ProxKL (P(`) ) = argmin KL(P|P(`) ) + F (P)
def. def.
1
F
(4.17)
ε
P∈Rn×m ε
+

(0)
starting from an arbitrary P (see also (4.52)). The proximal point algorithm is the
most basic proximal splitting method. Initially introduced for the Euclidean metric
(see, for instance, (Rockafellar 1976)), it extends to any Bregman divergence [Censor
and Zenios, 1992], so in particular it can be applied here for the KL divergence (see
Remark 8.1). The proximal operator is usually not available in closed form, so some
form of subiterations are required. The optimization appearing in (4.17) is very similar
to the entropy regularized problem (4.2), with the relative entropy KL(·|P(`) ) used in
place of the negative entropy −H. Proposition 4.3 and Sinkhorn iterations (4.15) carry
C
over to this more general setting when defining the Gibbs kernel as K = e− ε P(`) =
Ci,j
(`)
(e− ε Pi,j )i,j . Iterations (4.17) can thus be implemented by running the Sinkhorn
algorithm at each iteration. Assuming for simplicity P(0) = 1n 1>
m , these iterations thus
have the form
C
P (`+1) = diag(u(`) )(e− ε P(`) ) diag(v(`) )
(`+1)C
= diag(u(`) · · · u(0) )e− ε P(`) ) diag(v(`) · · · v(0) ).
The proximal point iterates apply therefore iteratively Sinkhorn’s algorithm with a
− C
kernel e ε/` , i.e., with a decaying regularization parameter ε/`. This method is thus
tightly connected to a series of works which combine Sinkhorn with some decaying
schedule on the regularization; see, for instance, [Kosowsky and Yuille, 1994]. They
are efficient in small spacial dimension, when combined with a multigrid strategy to
approximate the coupling on an adaptive sparse grid [Schmitzer, 2016b].
Remark 4.10 (Other regularizations). It is possible to replace the entropic term −H(P)
in (4.2) by any strictly convex penalty R(P), as detailed, for instance, in [Dessein et al.,
2018]. A typical example is the squared `2 norm
X
R(P) = P2i,j + ιR+ (Pi,j ); (4.18)
i,j
68 Entropic Regularization of Optimal Transport

see [Essid and Solomon, 2017]. Another example is the family of Tsallis entropies [Muzel-
lec et al., 2017]. Note, however, that if the penalty function is defined even when entries
of P are nonpositive, which is, for instance, the case for a quadratic regularization (4.18),
then one must add back a nonnegativity constraint P ≥ 0, in addition to the marginal
constraints P1m = a and P> 1n = b. Indeed, one can afford to ignore the nonnega-
tivity constraint using entropy because that penalty incorporates a logarithmic term
which forces the entries of P to stay in the positive orthant. This implies that the set
of constraints is no longer affine and iterative Bregman projections do not converge
anymore to the solution. A workaround is to use instead Dykstra’s algorithm (1983,
1985) (see also Bauschke and Lewis 2000), as detailed in [Benamou et al., 2015]. This
algorithm uses projections according to the Bregman divergence associated to R. We
refer to Remark 8.1 for more details regarding Bregman divergences. An issue is that in
general these projections cannot be computed explicitly. For the squared norm (4.18),
this corresponds to computing the Euclidean projection on (Ca1 , Cb2 ) (with the extra
positivity constraints), which can be solved efficiently using projection algorithms on
simplices [Condat, 2015]. The main advantage of the quadratic regularization over en-
tropy is that it produces sparse approximation of the optimal coupling, yet this comes at
the expense of a slower algorithm that cannot be parallelized as efficiently as Sinkhorn
to compute several optimal transports simultaneously (as discussed in §4.16). Figure 4.6
contrasts the approximation achieved by entropic and quadratic regularizers.

ε = 10 ε=1 ε = 0.5 · 10−1 ε = 10−1 ε = 10−3

ε = 5 · 103 ε = 103 ε = 102 ε = 10 ε=1

Figure 4.6: Comparison of entropic regularization R = −H (top row) and quadratic regularization
R = k·k2 + ιR+ (bottom row). The (α, β) marginals are the same as for Figure 4.4.
4.2. Sinkhorn’s Algorithm and Its Convergence 69

Remark 4.11 (Barycentric projection). Consider again the setting of Remark 4.1 in
which we use entropic regularization to approximate OT between discrete measures.
The Kantorovich formulation in (2.11) and its entropic regularization (4.2) both
yield a coupling P ∈ U(a, b). In order to define a transportation map T : X → Y,
in the case where Y = Rd , one can define the so-called barycentric projection map
1 X
T : xi ∈ X 7−→ Pi,j yj ∈ Y, (4.19)
ai j

where the input measures are discrete of the form (2.3). Note that this map is only
defined for points (xi )i in the support of α. In the case where T is a permutation
matrix (as detailed in Proposition 2.1), then T is equal to a Monge map, and as
ε → 0, the barycentric projection progressively converges to that map if it is unique.
For arbitrary (not necessarily discrete) measures, solving (2.15) or its regularized
version (4.9) defines a coupling π ∈ U(α, β). Note that this coupling π always has
dπ(x,y)
a density dα(x)dβ(y) with respect to α ⊗ β. A map can thus be retrieved by the
formula
dπ(x, y)
Z
T : x ∈ X 7−→ y dβ(y). (4.20)
Y dα(x)dβ(y)
In the case where, for ε = 0, π is supported on the graph of the Monge map
(see Remark 2.24), then using ε > 0 produces a smooth approximation of this
map. Such a barycentric projection is useful to apply the OT Monge map to solve
problems in imaging; see Figure 9.6 for an application to color modification. It has
also been used to compute approximations of principal geodesics in the space of
probability measures endowed with the Wasserstein metric; see [Seguy and Cuturi,
2015].

Remark 4.12 (Hilbert metric). As initially explained by [Franklin and Lorenz,


1989], the global convergence analysis of Sinkhorn is greatly simplified using the
Hilbert projective metric on Rn+,∗ (positive vectors), defined as

ui u0j
∀ (u, u0 ) ∈ (Rn+,∗ )2 , dH (u, u0 ) = log max
def.
.
i,j uj u0i
It can be shown to be a distance on the projective cone Rn+,∗ / ∼, where u ∼ u0
means that ∃r > 0, u = ru0 (the vectors are equal up to rescaling, hence the
name “projective”). This means that dH satisfies the triangular inequality and
dH (u, u0 ) = 0 if and only if u ∼ u0 . This is a projective version of Hilbert’s original
distance on bounded open convex sets [Hilbert, 1895]. The projective cone Rn+,∗ / ∼
is a complete metric space for this distance. By a logarithmic change of variables,
70 Entropic Regularization of Optimal Transport

the Hilbert metric on the rays of the positive cone is isometric to the variation
seminorm (it is a norm between vectors that are defined up to an additive constant)

dH (u, u0 ) = log(u) − log(u0 ) var



(4.21)
def.
where kfkvar = (max fi ) − (min fi ).
i i
This variation seminorm is closely related to the `∞ norm since one always has
kfkvar ≤ 2 kfk∞ . If one imposes that fi = 0 for some fixed i, then a converse inequal-
ity also holds since kfk∞ ≤ kfkvar . These bounds are especially useful to analyze
Sinkhorn convergence (see Remark 4.14 below), because dual variables f = log(u)
solving (4.14) are defined up to an additive constant, so that one can impose that
fi = 0 for some i. The Hilbert metric was introduced independently by [Birkhoff,
1957] and [Samelson et al., 1957]. They proved the following fundamental theorem,
which shows that a positive matrix is a strict contraction on the cone of positive
vectors.

Theorem 4.1. Let K ∈ Rn×m 0 m 2


+,∗ ; then for (v, v ) ∈ (R+,∗ )
 √
def. η(K)−1
 λ(K) = √
 < 1,
dH (Kv, Kv0 ) ≤ λ(K)dH (v, v0 ), where def.
η(K)+1
K K
 η(K) = max Ki,k Kj,` .

i,j,k,` j,k i,`

Figure 4.7 illustrates this theorem.


dH
(u

!
,u

u0
0
)

2
> 0} K K 2 R+
;r KR2+ K
{ru R2+
u =
!

Figure 4.7: Left: the Hilbert metric dH is a distance over rays in cones (here positive vectors). Right:
visualization of the contraction induced by the iteration of a positive matrix K.

Remark 4.13 (Perron–Frobenius). A typical application of Theorem 4.1 is to pro-


vide a quantitative proof of the Perron–Frobenius theorem, which, as explained
in Remark 4.15, is linked to a local linearization of Sinkhorn’s iterates. A matrix
K ∈ Rn×n
+ with K> 1n = 1n maps Σn into Σn . If furthermore K > 0, then accord-
ing to Theorem 4.1, it is strictly contractant for the metric dH , hence there exists
a unique invariant probability distribution p? ∈ Σn with Kp? = p? . Furthermore,
4.2. Sinkhorn’s Algorithm and Its Convergence 71

for any p0 ∈ Σn , dH (K` p0 , p? ) ≤ λ(K)` dH (p0 , p? ), i.e. one has linear convergence
of the iterates of the matrix toward p? . This is illustrated in Figure 4.8.

p?

Σ3 KΣ3 K2 Σ3 {K` Σ3 }`

Figure 4.8: Evolution of K` Σ3 → {p? } the invariant probability distribution of K ∈ R3×3


+,∗ with
K> 13 = 13 .

Remark 4.14 (Global convergence). The following theorem, proved by [Franklin


and Lorenz, 1989], makes use of Theorem 4.1 to show the linear convergence of
Sinkhorn’s iterations.

Theorem 4.2. One has (u(`) , v(`) ) → (u? , v? ) and

dH (u(`) , u? ) = O(λ(K)2` ), dH (v(`) , v? ) = O(λ(K)2` ). (4.22)

One also has


dH (P(`) 1m , a)
dH (u(`) , u? ) ≤ ,
1 − λ(K)2
(4.23)
(`) ? dH (P(`),> 1n , b)
dH (v , v ) ≤ ,
1 − λ(K)2

where we denoted P(`) = diag(u(`) )K diag(v(`) ). Last, one has


def.

k log(P(`) ) − log(P? )k∞ ≤ dH (u(`) , u? ) + dH (v(`) , v? ), (4.24)

where P? is the unique solution of (4.2).

Proof. One notices that for any (v, v0 ) ∈ (Rm 2


+,∗ ) , one has

dH (v, v0 ) = dH (v/v0 , 1m ) = dH (1m /v, 1m /v0 ).


72 Entropic Regularization of Optimal Transport

This shows that


a a
 
(`+1) ?
dH (u , u ) = dH ,
Kv (`) Kv?
= dH (Kv(`) , Kv? ) ≤ λ(K)dH (v(`) , v? ),

where we used Theorem 4.1. This shows (4.22). One also has, using the triangular
inequality,

dH (u(`) , u? ) ≤ dH (u(`+1) , u(`) ) + dH (u(`+1) , u? )


a
 
≤ dH ,u (`)
+ λ(K)2 dH (u(`) , u? )
Kv(`)
 
= dH a, u(`) (Kv(`) ) + λ(K)2 dH (u(`) , u? ),

which gives the first part of (4.23) since u(`) (Kv(`) ) = P(`) 1m (the second one
being similar). The proof of (4.24) follows from [Franklin and Lorenz, 1989, Lem.
3].

The bound (4.23) shows that some error measures on the marginal constraints
T
violation, for instance, kP(`) 1m − ak1 and kP(`) 1n − bk1 , are useful stopping
criteria to monitor the convergence. Note that thanks to (4.21), these Hilbert metric
rates on the scaling variable (u(`) , v(`) ) give a linear rate on the dual variables
(f(`) , g(`) ) = (ε log(u(`) ), ε log(v(`) )) for the variation norm k·kvar .
def.

Figure 4.5, bottom row, highlights this linear rate on the constraint violation
and shows how this rate degrades as ε → 0. These results are proved in [Franklin
and Lorenz, 1989] and are tightly connected to nonlinear Perron–Frobenius the-
ory [Lemmens and Nussbaum, 2012]. Perron–Frobenius theory corresponds to the
linearization of the iterations; see (4.25). This convergence analysis is extended
by [Linial et al., 1998], who show that each iteration of Sinkhorn increases the
permanence of the scaled coupling matrix.

Remark 4.15 (Local convergence). The global linear rate (4.24) is often quite pes-
simistic, typically in X = Y = Rd for cases where there exists a Monge map when
ε = 0 (see Remark 2.7). The global rate is in contrast rather sharp for more dif-
ficult situations where the cost matrix C is close to being random, and in these
cases, the rate scales exponentially bad with ε, 1 − λ(K) ∼ e−1/ε . To obtain a
finer asymptotic analysis of the convergence (e.g. if one is interested in a high-
precision solution and performs a large number of iterations), one usually rather
studies the local convergence rate. One can write a Sinkhorn update as iterations
4.3. Speeding Up Sinkhorn’s Iterations 73

of a fixed-point map f(`+1) = Φ(f(`) ), where


(
def. Φ1 (f) = ε log KT (ef/ε ) − log(b),
Φ = Φ 2 Φ1 where
Φ2 (g) = ε log K(eg/ε ) − log(a).

For optimal (f, g) solving (4.30), denoting P = diag(ef/ε )K diag(eg/ε ) the optimal
coupling solving (4.2), one has the following Jacobian:

∂Φ(f) = diag(a)−1 P diag(b)−1 PT . (4.25)

This Jacobian is a positive matrix with ∂Φ(f)1n = 1n , and thus by the Perron–
Frobenius theorem, it has a single dominant eigenvector 1m with associated eigen-
value 1. Since f is defined up to a constant, it is actually the second eigenvalue
1 − κ < 1 which governs the local linear rate, and this shows that for ` large
enough,
kf(`) − fk = O((1 − κ)` ).
Numerically, in “simple cases” (such as when there exists a smooth Monge map
when ε = 0), this rate scales like κ ∼ ε. We refer to [Knight, 2008] for more details
in the bistochastic (assignment) case.

4.3 Speeding Up Sinkhorn’s Iterations

The main computational bottleneck of Sinkhorn’s iterations is the vector-matrix multi-


plication against kernels K and K> , with complexity O(nm) if implemented naively. We
now detail several important cases where the complexity can be improved significantly.
Remark 4.16 (Parallel and GPU friendly computation). The simplicity of Sinkhorn’s
algorithm yields an extremely efficient approach to compute simultaneously several
regularized Wasserstein distances between pairs of histograms. Let N be an integer,
a1 , . . . , aN be histograms in Σn , and b1 , . . . , bN be histograms in Σm . We seek to
compute all N approximate distances LεC (a1 , b1 ), . . . , LεC (aN , bN ). In that case, writing
A = [a1 , . . . , aN ] and B = [b1 , . . . , bN ] for the n × N and m × N matrices storing all
histograms, one can notice that all Sinkhorn iterations for all these N pairs can be
carried out in parallel, by setting, for instance,
A B
U(`+1) = V(`+1) =
def. def.

(`)
and , (4.26)
KV K U(`+1)
T

initialized with V(0) = 1m×N . Here ·· corresponds to the entrywise division of matri-
ces. One can further check that upon convergence of V and U, the (row) vector of
regularized distances simplifies to
1n T (U log U ((K C)V) + U ((K C)(V log V))) ∈ RN .
74 Entropic Regularization of Optimal Transport

Note that the basic Sinkhorn iterations described in Equation (4.15) are intrinsically
GPU friendly, since they only consist in matrix-vector products, and this was exploited,
for instance, to solve matching problems in Slomp et al. [2011]). However, the matrix-
matrix operations presented in Equation (4.26) present even better opportunities for
parallelism, which explains the success of Sinkhorn’s algorithm to compute OT distances
between histograms at large scale.

Remark 4.17 (Speed-up for separable kernels). We consider in this section an important
particular case for which the complexity of each Sinkhorn iteration can be significantly
reduced. That particular case happens when each index i and j considered in the cost-
matrix can be described as a d-uple taken in the cartesian product of d finite sets
Jn1 K, . . . , Jnd K,
i = (ik )dk=1 , j = (jk )dk=1 ∈ Jn1 K × · · · × Jnd K.
In that setting, if the cost Cij between indices i and j is additive along these sub-indices,
namely if there exists d matrices C1 , . . . , Cd , each of respective size n1 × n, . . . , nd × nd ,
such that
d
X
Cij = Ckik ,jk ,
k=1
then one obtains as a direct consequence that the kernel appearing in the Sinkhorn
iterations has a separable multiplicative structure,
d
Y
Ki,j = Kkik ,jk . (4.27)
k=1

Such a separable multiplicative structure allows for a very fast (exact) evaluation of
Ku. Indeed, instead of instantiating K as a matrix of size n × n, which would have
Q
a prohibitive size since n = k nk is usually exponential in the dimension d, one can
instead recover Ku by simply applying Kk along each “slice” of u. If n = m, the
complexity reduces to O(n1+1/d ) in place of O(n2 ).
An important example of this speed-up arises when X = Y = [0, 1]d ; the ground
cost is the q-th power of the q-norm,
d
c(x, y) = kx − ykqq =
X
|xi − yi |q , q > 0;
i=1

and the space is discretized using a regular grid in which only points xi =
(i1 /n1 , . . . , id /nd ) for i = (i1 , . . . , id ) ∈ Jn1 K × · · · × Jnd K are considered. In that case a
multiplication by K can be carried out more efficiently by applying each 1-D nk × nk
convolution matrix h q i
Kk = exp(− r−s nk /ε)

1≤r,s≤nk
4.3. Speeding Up Sinkhorn’s Iterations 75

to u reshaped as a tensor whose first dimension has been permuted to match the k-th
set of indices. For instance, if d = 2 (planar case) and q = 2 (2-Wasserstein, resulting in
Gaussian convolutions), histograms a and as a consequence Sinkhorn multipliers u can
be instantiated as n1 ×n2 matrices. We write U to underline the fact that the multiplier
u is reshaped as a n1 ×n2 matrix, rather than a vector of length n1 n2 . Then, computing
Ku, which would naively require (n1 n2 )2 operations with a naive implementation, can
be obtained by applying two 1-D convolutions separately, as

(K2 (K1 U)T )T = K1 UK2 ,

to recover a n1 × n2 matrix in (n21 )n2 + n1 (n22 ) operations instead of n21 n22 operations.
Note that this example agrees with the exponent (1 + 1/d) given above. With larger d,
one needs to apply these very same 1-D convolutions to each slice of u (reshaped as a
tensor of suitable size) an operation which is extremely efficient on GPUs.
This important observations underlies many of the practical successes found when
applying optimal transport to shape data in 2-D and 3-D, as highlighted in [Solomon
et al., 2015, Bonneel et al., 2016], in which distributions supported on grids of sizes as
large as 2003 = 8 × 106 are handled.

Remark 4.18 (Approximated convolutions). The main computational bottleneck of


Sinkhorn’s iterations (4.15) lies in the multiplication of a vector by K or by its ad-
joint. Besides using separability (4.27), it is also possible to exploit other special struc-
tures in the kernel. The simplest case is for translation invariant kernels Ki,j = ki−j ,
which is typically the case when discretizing the measure on a fixed uniform grid in
Euclidean space X = Rd . Then Kv = k ? v is a convolution, and there are several
algorithms to approximate the convolution in nearly linear time. The most usual one
is by Fourier transform F, assuming for simplicity periodic boundary conditions, be-
cause F(k ? v) = F(k) F(v). This leads, however, to unstable computations and is
often unacceptable for small ε. Another popular way to speed up computation is by
approximating the convolution using a succession of autoregressive filters, using, for
instance, the Deriche filtering method Deriche [1993]. We refer to [Getreuer, 2013] for
a comparison of various fast filtering methods.

Remark 4.19 (Geodesic in heat approximation). For nonplanar domains, the kernel K
is not a convolution, but in the case where the cost is Ci,j = dM (xi , yj )p where dM
is a geodesic distance on a surface M (or a more general manifold), it is also possible
dM
to perform fast approximations of the application of K = e− ε to a vector. Indeed,
Varadhan’s formulas [1967] assert that this kernel is close to the Laplacian kernel (for
p = 1) and the heat kernel (for p = 2). The first formula of Varadhan states

t
log(Pt (x, y)) = dM (x, y) + o(t) where Pt = (Id − t∆M )−1 ,
def.
− (4.28)
2
76 Entropic Regularization of Optimal Transport

where ∆M is the Laplace–Beltrami operator associated to the manifold M (which is


R
negative semidefinite), so that Pt is an integral kernel and g = M Pt (x, y)f (y)dy is the
solution of g − t∆M g = f . The second formula of Varadhan states
q
−4t log(Ht (x, y)) = dM (x, y) + o(t), (4.29)

R
where Ht is the integral kernel defined so that gt = M Ht (x, y)f (y)dy is the solution
at time t of the heat equation

∂gt (x)
= (∆M gt )(x).
∂t

The convergence in these formulas (4.28) and (4.29) is uniform on compact manifolds.
Numerically, the domain M is discretized (for instance, using finite elements) and
∆M is approximated by a discrete Laplacian matrix L. A typical example is when
using piecewise linear finite elements, so that L is the celebrated cotangent Laplacian
(see [Botsch et al., 2010] for a detailed account for this construction). These formulas
can be used to approximate efficiently the multiplication by the Gibbs kernel Ki,j =
d(xi ,yj )p √
e− ε . Equation (4.28) suggests, for the case p = 1, to use ε = 2t and to replace
the multiplication by K by the multiplication by (Id − tL)−1 , which necessitates the
resolution of a positive symmetric linear system. Equation (4.29), coupled with R steps
of implicit Euler for the stable resolution of the heat flow, suggests for p = 2 to trade
the multiplication by K by the multiplication by (Id − Rt L)−R for 4t = ε, which in turn
necessitates R resolutions of linear systems. Fortunately, since these linear systems
are supposed to be solved at each Sinkhorn iteration, one can solve them efficiently
by precomputing a sparse Cholesky factorization. By performing a reordering of the
rows and columns of the matrix [George and Liu, 1989], one obtains a nearly linear
sparsity for 2-D manifolds and thus each Sinkhorn iteration has linear complexity (the
performance degrades with the dimension of the manifold). The use of Varadhan’s
formula to approximate geodesic distances was initially proposed in [Crane et al., 2013]
and its use in conjunction with Sinkhorn iterations in [Solomon et al., 2015].

Remark 4.20 (Extrapolation acceleration). Since the Sinkhorn algorithm is a fixed-point


algorithm (as shown in Remark 4.15), one can use standard linear or even nonlinear
extrapolation schemes to enhance the conditioning of the fixed-point mapping near
the solution, and improve the linear convergence rate. This is similar to the successive
overrelaxation method (see, for instance, [Hadjidimos, 2000]), so that the local linear

rate of convergence is improved from O((1 − κ)` ) to O((1 − κ)` ) for some κ > 0 (see
Remark 4.15). We refer to [Peyré et al., 2017] for more details.
4.4. Stability and Log-Domain Computations 77

4.4 Stability and Log-Domain Computations

As briefly mentioned in Remark 4.7, the Sinkhorn algorithm suffers from numerical
overflow when the regularization parameter ε is small compared to the entries of the cost
matrix C. This concern can be alleviated to some extent by carrying out computations
in the log domain. The relevance of this approach is made more clear by considering
the dual problem associated to (4.2), in which these log-domain computations arise
naturally.

Proposition 4.4. One has

LεC (a, b) = max hf, ai + hg, bi − εhef/ε , Keg/ε i. (4.30)


f∈Rn ,g∈Rm

The optimal (f, g) are linked to scalings (u, v) appearing in (4.12) through

(u, v) = (ef/ε , eg/ε ). (4.31)

Proof. We start from the end of the proof of Proposition 4.3, which links the optimal
primal solution P and dual multipliers f and g for the marginal constraints as

Pi,j = efi /ε e−Ci,j /ε egj /ε .

Substituting in the Lagrangian E(P, f, g) of Equation (4.2) the optimal P as a function


of f and g, we obtain that the Lagrange dual function equals

f, g 7→ hef/ε , (K C) eg/ε i − εH(diag(ef/ε )K diag(eg/ε )). (4.32)

The neg-entropy of P scaled by ε, namely εhP, log P − 1n×m i, can be stated explicitly
as a function of f, g, C,

hdiag(ef/ε )K diag(eg/ε ), f1m T + 1n gT − C − ε1n×m i


= −hef/ε , (K C) eg/ε i + hf, ai + hg, bi − εhef/ε , Keg/ε i;

therefore, the first term in (4.32) cancels out with the first term in the entropy above.
The remaining terms are those appearing in (4.30).

Remark 4.21 (Sinkhorn as a block coordinate ascent on the dual problem). A simple
approach to solving the unconstrained maximization problem (4.30) is to use an exact
block coordinate ascent strategy, namely to update alternatively f and g to cancel the
respective gradients in these variables of the objective of (4.30). Indeed, one can notice
after a few elementary computations that, writing Q(f, g) for the objective of (4.30),
 
∇|f Q(f, g) = a − ef/ε Keg/ε , (4.33)
 
∇|g Q(f, g) = b − eg/ε KT ef/ε . (4.34)
78 Entropic Regularization of Optimal Transport

Block coordinate ascent can therefore be implemented in a closed form by applying


successively the following updates, starting from any arbitrary g(0) , for l ≥ 0:
 (`) /ε

f(`+1) = ε log a − ε log Keg , (4.35)
(`+1)
 
g(`+1) = ε log b − ε log KT ef /ε
. (4.36)

Such iterations are mathematically equivalent to the Sinkhorn iterations (4.15) when
considering the primal-dual relations highlighted in (4.31). Indeed, we recover that at
any iteration
(f(`) , g(`) ) = ε(log(u(`) ), log(v(`) )).
Remark 4.22 (Soft-min rewriting). Iterations (4.35) and (4.36) can be given an alterna-
tive interpretation, using the following notation. Given a vector z of real numbers we
write minε z for the soft-minimum of its coordinates, namely
e−zi /ε .
X
minε z = −ε log (4.37)
i

Note that minε (z) converges to min z for any vector z as ε → 0. Indeed, minε can be
interpreted as a differentiable approximation of the min function, as shown in Figure 4.9.

ε=1 ε = 0.5 ε = 10−1 ε = 10−2 ε = 10−3

Figure 4.9: Display of the function minε (z) in 2-D, z ∈ R2 , for varying ε.

Using this notation, Equations (4.35) and (4.36) can be rewritten


(`)
(f(`+1) )i = minε (Cij − gj )j + ε log ai , (4.38)
(`)
(g(`+1) )j = minε (Cij − fi )i + ε log bj . (4.39)
(`)
Here the term minε (Cij − gj )j denotes the soft-minimum of all values of the jth
column of matrix (C − 1n (g(`) )> ). To simplify notations, we introduce an operator that
takes a matrix as input and outputs now a column vector of the soft-minimum values
of its columns or rows. Namely, for any matrix A ∈ Rn×m , we define
 
Minrow ∈ Rn ,
def.
ε (A) = minε (Ai,j )j
i
Mincol ∈ Rm .
def. 
ε (A) = minε (Ai,j )i j
4.4. Stability and Log-Domain Computations 79

Note that these operations are equivalent to the entropic c-transform introduced in §5.3
(see in particular (5.11)). Using this notation, Sinkhorn’s iterates read
T
f(`+1) = Minrow
ε (C − 1n g(`) ) + ε log a, (4.40)
g(`+1) = Mincol (`) T
ε (C − f 1m ) + ε log b. (4.41)
Note that as ε → 0, minε converges to min, but the iterations do not converge anymore
in the limit ε = 0, because alternate minimization does not converge for constrained
problems, which is the case for the unregularized dual (2.20).
Remark 4.23 (Log-domain Sinkhorn). While mathematically equivalent to the Sinkhorn
updates (4.15), iterations (4.38) and (4.39) suggest using the log-sum-exp stabilization
trick to avoid underflow for small values of ε. Writing z = min z, that trick suggests
evaluating minε z as
e−(zi −z)/ε .
X
minε z = z − ε log (4.42)
i
Instead of substracting z to stabilize the log-domain iterations as in (4.42), one can ac-
tually substract the previously computed scalings. This leads to the stabilized iteration
f(`+1) = Minrow (`) (`)
ε (S(f , g )) + f
(`)
+ ε log(a), (4.43)
g(`+1) = Mincol
ε (S(f
(`+1) (`)
, g )) + g(`) + ε log(b), (4.44)
where we defined  
S(f, g) = Ci,j − fi − gj .
i,j
In contrast to the original iterations (4.15), these log-domain iterations (4.43) and (4.44)
are stable for arbitrary ε > 0, because the quantity S(f, g) stays bounded during the
iterations. The downside is that it requires nm computations of exp at each step. Com-
puting a Minrow
ε or Mincol
ε is typically substantially slower than matrix multiplications
and requires computing line by line soft-minima of matrices S. There is therefore no
efficient way to parallelize the application of Sinkhorn maps for several marginals simul-
taneously. In Euclidean domains of small dimension, it is possible to develop efficient
multiscale solvers with a decaying ε strategy to significantly speed up the computation
using sparse grids [Schmitzer, 2016b].

Remark 4.24 (Dual for generic measures). For generic and not necessarily discrete
input measures (α, β), the dual problem (4.30) reads
Z Z Z
−c(x,y)+f (x)+g(y)
sup f dα + gdβ − ε e ε dα(x)dβ(y). (4.45)
(f,g)∈C(X )×C(Y) X Y X ×Y

This corresponds to a smoothing of the constraint R(c) appearing in the original


problem (2.24), which is retrieved in the limit ε → 0. Proving existence (i.e. the
80 Entropic Regularization of Optimal Transport

sup is actually a max) of these Kantorovich potentials (f, g) in the case of entropic
transport is less easy than for classical OT, because one cannot use the c-transform
and potentials are not automatically Lipschitz. Proof of existence can be done using
the convergence of Sinkhorn iterations; see [Chizat et al., 2018b] for more details.

Remark 4.25 (Unconstrained entropic dual). As in Remark 2.23, in the case


R R
X dα = Y dβ = 1, one can consider an alternative dual formulation
Z Z
sup f dα + gdβ + minε (c − f ⊕ g), (4.46)
(f,g)∈C(X )×C(Y) X Y

which achieves the same optimal value as (4.45). Similarly to (4.37), the soft-
minimum (here on X × Y) is defined as
Z
def. −S(x,y)
∀ S ∈ C(X × Y), minε S = −ε e ε dα(x)dβ(y)
X ×Y

(note that it depends on (α, β)). As ε → 0, minε → min, as used in the unregular-
ized and unconstrained formulation (2.27). Note that while both (4.45) and (4.46)
are unconstrained problems, a chief advantage of (4.46) is that it is better condi-
tioned, in the sense that the Hessian of the functional is uniformly bounded by ε.
Another way to obtain such a conditioning improvement is to consider semidual
problems; see §5.3 and in particular Remark 5.1. A disadvantage of this alterna-
tive dual formulation is that the presence of a log prevents the use of stochastic
optimization methods as detailed in §5.4; see in particular Remark 5.3.

4.5 Regularized Approximations of the Optimal Transport Cost

The entropic dual (4.30) is a smooth unconstrained concave maximization problem,


which approximates the original Kantorovich dual (2.20), as detailed in the following
proposition.

Proposition 4.5. Any pair of optimal solutions (f? , g? ) to (4.30) are such that (f? , g? ) ∈
R(C), the set of feasible Kantorovich potentials defined in (2.21). As a consequence,
we have that for any ε,
hf? , ai + hg? , bi ≤ LC (a, b).

Proof. Primal-dual optimality conditions in (4.4) with the constraint that P is a prob-
ability and therefore Pi,j ≤ 1 for all i, j yields that exp(−(f?i + g?j − Ci,j )/ε) ≤ 1 and
therefore that f?i + g?j ≤ Ci,j .

A chief advantage of the regularized transportation cost LεC defined in (4.2) is that
it is smooth and convex, which makes it a perfect fit for integrating as a loss function
4.5. Regularized Approximations of the Optimal Transport Cost 81

in variational problems (see Chapter 9).

Proposition 4.6. LεC (a, b) is a jointly convex function of a and b for ε ≥ 0. When
ε > 0, its gradient is equal to
" #
f?
∇LεC (a, b) = ? ,
g

where f? and g? are the optimal solutions of Equation (4.30) chosen so that their
coordinates sum to 0.

In [Cuturi, 2013], lower and upper bounds to approximate the Wasserstein distance
between two histograms were proposed. These bounds consist in evaluating the primal
and dual objectives at the solutions provided by the Sinkhorn algorithm.

Definition 4.1 (Sinkhorn divergences). Let f? and g? be optimal solutions to (4.30)


and P? be the solution to (4.2). The Wasserstein distance is approximated using the
following primal and dual Sinkhorn divergences:
f? g?
PεC (a, b) = hC, P? i = he ε , (K C)e
def.
ε i,
DεC (a, b) ? ?
def.
= hf , ai + hg , bi,

where stands for the elementwise product of matrices,

Proposition 4.7. The following relationship holds:

DεC (a, b) ≤ LεC (a, b) ≤ PεC (a, b).

Furthermore
PεC (a, b) − DεC (a, b) = ε(H(P? ) + 1). (4.47)

Proof. Equation (4.47) is obtained by writing that the primal and dual problems have
the same values at the optima (see (4.30)), and hence
? ? /ε
LεC (a, b) = PεC (a, b) − εH(P? ) = DεC (a, b) − εhef /ε
, Keg i
? ? /ε
The final result can be obtained by remarking that hef /ε , Keg i = 1, since the latter
amounts to computing the sum of all entries of P? .

The relationships given above suggest a practical way to bound the actual OT
distance, but they are, in fact, valid only upon convergence of the Sinkhorn algorithm
and therefore never truly useful in practice. Indeed, in practice Sinkhorn iterations are
always terminated after a certain accuracy threshold is reached. When a predetermined
number of L iterations is set and used to evaluate DεC using iterates f(L) and g(L) instead
of optimal solutions f? and g? , one recovers, however, a lower bound: Using notation
82 Entropic Regularization of Optimal Transport

appearing in Equations (4.43) and (4.44), we thus introduce the following finite step
approximation of LεC :
(L)
DC (a, b) = hf(L) , ai + hg(L) , bi.
def.
(4.48)
This “algorithmic” Sinkhorn functional lower bounds the regularized cost function as
soon as L ≥ 1.
Proposition 4.8 (Finite Sinkhorn divergences). The following relationship holds:
(L)
DC (a, b) ≤ LεC (a, b).
Proof. Similarly to the proof of Proposition 4.5, we exploit the fact that after even just
one single Sinkhorn iteration, we have, following (4.35) and (4.36), that f(L) and g(L)
(L) (L)
are such that the matrix with elements exp(−(fi + gj − Ci,j )/ε) has column sum
b and its elements are therefore each upper bounded by 1, which results in the dual
(L)
feasibility of (fi , g(L) ).

Remark 4.26 (Primal infeasibility of the Sinkhorn iterates). Note that the primal iterates
provided in (4.8) are not primal feasible, since, by definition, these iterates are designed
to satisfy upon convergence marginal constraints. Therefore, it is not valid to consider
hC, P(2L+1) i as an approximation of LC (a, b) since P(2L+1) is not feasible. Using the
rounding scheme of Altschuler et al. [2017] laid out in Remark 4.6 one can, however,
yield an upper bound on LεC (a, b) that can, in addition, be conveniently computed
using matrix operations in parallel for several pairs of histograms, in the same fashion
as Sinkhorn’s algorithm [Lacombe et al., 2018].
Remark 4.27 (Nonconvexity of finite dual Sinkhorn divergence). Unlike the regularized
(L)
expression LεC in (4.30), the finite Sinkhorn divergence DC (a, b) is not, in general, a
(L)
convex function of its arguments (this can be easily checked numerically). DC (a, b)
is, however, a differentiable function which can be differentiated using automatic differ-
entiation techniques (see Remark 9.1.3) with respect to any of its arguments, notably
C, a, or b.

4.6 Generalized Sinkhorn

The regularized OT problem (4.2) is a special case of a structured convex optimization


problem of the form
X
min Ci,j Pi,j − εH(P) + F (P1m ) + G(PT 1n ). (4.49)
P
i,j

Indeed, defining F = ι{a} and G = ι{b} , where the indicator function of a closed convex
set C is (
0 if x ∈ C,
ιC (x) = (4.50)
+∞ otherwise,
4.6. Generalized Sinkhorn 83

one retrieves the hard marginal constraints defining U(a, b). The proof of Proposi-
tion 4.3 carries to this more general problem (4.49), so that the unique solution of (4.49)
also has the form (4.12).
As shown in [Peyré, 2015, Frogner et al., 2015, Chizat et al., 2018b, Karlsson and
Ringh, 2016], Sinkhorn iterations (4.15) can hence be extended to this problem, and
they read
ProxKL
F (Kv) ProxKL
G (K u)
T
u← and v ← , (4.51)
Kv KT u
where the proximal operator for the KL divergence is
0 0
∀ u ∈ RN
+, ProxKL
F (u) = argmin KL(u |u) + F (u ). (4.52)
u0 ∈RN
+

For some functions F, G it is possible to prove the linear rate of convergence for iter-
ations (4.51), and these schemes can be generalized to arbitrary measures; see [Chizat
et al., 2018b] for more details.
Iterations (4.51) are thus interesting in the cases where ProxKL KL
F and ProxG can be
computed in closed form or very efficiently. This is in particular the case for separable
P
functions of the form F (u) = i Fi (ui ) since in this case
 
ProxKL KL
F (u) = ProxFi (ui ) .
i

Computing each ProxKL Fi is usually simple since it is a scalar optimization problem.


Note that, similarly to the initial Sinkhorn algorithm, it is also possible to stabilize the
computation using log-domain computations [Chizat et al., 2018b].
This algorithm can be used to approximate the solution to various generalizations
of OT, and in particular unbalanced OT problems of the form (10.7) (see §10.2 and in
particular iterations (10.9)) and gradient flow problems of the form (9.26) (see §9.3).

Remark 4.28 (Duality and Legendre transform). The dual problem to (4.49) reads
fi +gj −Ci,j
max − F ∗ (f) − G∗ (g) − ε
X
e ε (4.53)
f,g
i,j

so that (u, v) = (ef/ε , eg/ε ) are the associated scalings appearing in (4.12). Here,
F ∗ and G∗ are the Fenchel–Legendre conjugate, which are convex functions defined
as
∀ f ∈ Rn , F ∗ (f) = maxn hf, ai − F (a).
def.
(4.54)
a∈R
The generalized Sinkhorn iterates (4.51) are a special case of Dykstra’s algo-
rithm [Dykstra, 1983, 1985] (extended to Bregman divergence [Bauschke and Lewis,
84 Entropic Regularization of Optimal Transport

2000, Censor and Reich, 1998]; see also Remark 8.1) and is an alternate maximiza-
tion scheme on the dual problem (4.53).

The formulation (4.49) can be further generalized to more than two functions and
more than a single coupling; we refer to [Chizat et al., 2018b] for more details. This
includes as a particular case the Sinkhorn algorithm (10.2) for the multimarginal prob-
lem, as detailed in §10.1. It is also possible to rewrite the regularized barycenter prob-
lem (9.15) this way, and the iterations (9.18) are in fact a special case of this generalized
Sinkhorn.
5
Semidiscrete Optimal Transport

This chapter studies methods to tackle the optimal transport problem when one of the
two input measures is discrete (a sum of Dirac masses) and the other one is arbitrary,
including notably the case where it has a density with respect to the Lebesgue measure.
When the ambient space has low dimension, this problem has a strong geometrical
flavor because one can show that the optimal transport from a continuous density
toward a discrete one is a piecewise constant map, where the preimage of each point
in the support of the discrete measure is a union of disjoint cells. When the cost is
the squared Euclidean distance, these cells correspond to an important concept from
computational geometry, the so-called Laguerre cells, which are Voronoi cells offset by
a constant. This connection allows us to borrow tools from computational geometry to
obtain fast computational schemes. In high dimensions, the semidescrete formulation
can also be interpreted as a stochastic programming problem, which can also benefit
from a bit of regularization, extending therefore the scope of applications of the entropic
regularization scheme presented in Chapter 4. All these constructions rely heavily on
the notion of the c-transform, this time for general cost functions and not only matrices
as in §3.2. The c-transform is a generalization of the Legendre transform from convex
analysis and plays a pivotal role in the theory and algorithms for OT.

5.1 c-Transform and c̄-Transform

Recall that the dual OT problem (2.24) reads


Z Z
def.
sup E(f, g) = f (x)dα(x) + g(y)dβ(y) + ιR(c) (f, g),
(f,g) X Y

85
86 Semidiscrete Optimal Transport

where we used the useful indicator function notation (4.50). Keeping either dual poten-
tial f or g fixed and optimizing w.r.t. g or f , respectively, leads to closed form solutions
that provide the definition of the c-transform:

f c (y) = inf c(x, y) − f (x),


def.
∀ y ∈ Y, (5.1)
x∈X

g c̄ (x) = inf c(x, y) − g(y),


def.
∀x ∈ X, (5.2)
y∈Y
def.
where we denoted c̄(y, x) = c(x, y). Indeed, one can check that

f c ∈ argmax E(f, g) and g c̄ ∈ argmax E(f, g). (5.3)


g f

Note that these partial minimizations define maximizers on the support of respectively
α and β, while the definitions (5.1) actually define functions on the whole spaces X
and Y. This is thus a way to extend in a canonical way solutions of (2.24) on the
whole spaces. When X = Rd and c(x, y) = kx − ykp2 = ( di=1 |xi − yi |)p/2 , then the c-
P

transform (5.1) f c is the so-called inf-convolution between −f and k·kp . The definition
of f c is also often referred to as a “Hopf–Lax formula.”
The map (f, g) ∈ C(X ) × C(Y) 7→ (g c̄ , f c ) ∈ C(X ) × C(Y) replaces dual potentials
by “better” ones (improving the dual objective E). Functions that can be written in
the form f c and g c̄ are called c-concave and c̄-concave functions. In the special case
c(x, y) = hx, yi in X = Y = Rd , this definition coincides with the usual notion of
concave functions. Extending naturally Proposition 3.1 to a continuous case, one has
the property that
f cc̄c = f c and g c̄cc̄ = g c̄ ,
where we denoted f cc̄ = (f c )c̄ . This invariance property shows that one can “improve”
only once the dual potential this way. Alternatively, this means that alternate maximiza-
tion does not converge (it immediately enters a cycle), which is classical for functionals
involving a nonsmooth (a constraint) coupling of the optimized variables. This is in
sharp contrast with entropic regularization of OT as shown in Chapter 4. In this case,
because of the regularization, the dual objective (4.30) is smooth, and alternate maxi-
mization corresponds to Sinkhorn iterations (4.43) and (4.44). These iterates, written
over the dual variables, define entropically smoothed versions of the c-transform, where
min operations are replaced by a “soft-min.”
Using (5.3), one can reformulate (2.24) as an unconstrained convex program over a
single potential,
Z Z
Lc (α, β) = sup f (x)dα(x) + f c (y)dβ(y) (5.4)
f ∈C(X ) X Y
Z Z

= sup g (x)dα(x) + g(y)dβ(y). (5.5)
g∈C(Y) X Y
5.2. Semidiscrete Formulation 87

Since one can iterate the map (f, g) 7→ (g c̄ , f c ), it is possible to add the constraint that
f is c̄-concave and g is c-concave, which is important to ensure enough regularity on
these potentials and show, for instance, existence of solutions to (2.24).

5.2 Semidiscrete Formulation


P
A case of particular interest is when β = j bj δyj is discrete (of course the same
construction applies if α is discrete by exchanging the role of α, β). One can adapt the
definition of the c̄ transform (5.1) to this setting by restricting the minimization to the
support (yj )j of β,

∀ g ∈ Rm , ∀ x ∈ X , gc̄ (x) = min c(x, yj ) − gj .


def.
(5.6)
j∈JmK

This transform maps a vector g to a continuous function gc̄ ∈ C(X ). Note that this
definition coincides with (5.1) when imposing that the space X is equal to the support
of β. Figure 5.1 shows some examples of such discrete c̄-transforms in one and two
dimensions.
Crucially, using the discrete c̄-transform in the semidiscrete problem (5.4) yields a
finite-dimensional optimization,
Z X
gc̄ (x)dα(x) +
def.
Lc (α, β) = max
m
E(g) = gy bj . (5.7)
g∈R X

The Laguerre cells associated to the dual weights g


n o
x ∈ X : ∀ j 0 6= j, c(x, yj ) − gj ≤ c(x, yj 0 ) − gj 0
def.
Lj (g) =

induce a disjoint decomposition of X = j Lj (g). When g is constant, the Laguerre cells


S

decomposition corresponds to the Voronoi diagram partition of the space. Figure 5.1,
bottom row, shows examples of Laguerre cells segmentations in two dimensions.
This allows one to conveniently rewrite the minimized energy as
m Z
X  
E(g) = c(x, yj ) − gj dα(x) + hg, bi. (5.8)
j=1 Lj (g)

The gradient of this function can be computed as follows:


Z
∀ j ∈ JmK, ∇E(g)j = − dα(x) + bj .
Lj (g)

Figure 5.2 displays iterations of a gradient descent to minimize E. Once the optimal g
is computed, then the optimal transport map T from α to β is mapping any x ∈ Lj (g)
toward yj , so it is piecewise constant.
In the special case c(x, y) = kx − yk2 , the decomposition in Laguerre cells is also
known as a “power diagram.” The cells are polyhedral and can be computed efficiently
88 Semidiscrete Optimal Transport

2
0.6
0.4 1.5

0.2
1
0
-0.2 0.5
0 0.5 1

p = 1/2 p=1 p = 3/2 p=2

Figure 5.1: Top: examples of semidiscrete c̄-transforms gc̄ in one dimension, for ground cost c(x, y) =
|x − y|p for varying p (see colorbar). The red points are at locations (yj , −gj )j . Bottom: examples of
Pd
semidiscrete c̄-transforms gc̄ in two dimensions, for ground cost c(x, y) = kx − ykp2 = ( i=1 |xi − yi |)p/2
2
for varying p. The red points are at locations yj ∈ R , and their size is proportional to gj . The regions
delimited by bold black curves are the Laguerre cells (Lj (g))j associated to these points (yj )j .

using computational geometry algorithms; see [Aurenhammer, 1987]. The most widely
used algorithm relies on the fact that the power diagram of points in Rd is equal to the
projection on Rd of the convex hull of the set of points ((yj , kyj k2 − gj ))m
j=1 ⊂ R
d+1 .

There are numerous algorithms to compute convex hulls; for instance, that of Chan
[1996] in two and three dimensions has complexity O(m log(Q)), where Q is the number
of vertices of the convex hull.
The initial idea of a semidiscrete solver for Monge–Ampère equations was proposed
by Oliker and Prussner [1989], and its relation to the dual variational problem was
shown by Aurenhammer et al. [1998]. A theoretical analysis and its application to the
reflector problem in optics is detailed in [Caffarelli et al., 1999]. The semidiscrete for-
mulation was used in [Carlier et al., 2010] in conjunction with a continuation approach
based on Knothe’s transport. The recent revival of this methods in various fields is due
to Mérigot [2011], who proposed a quasi-Newton solver and clarified the link with con-
cepts from computational geometry. We refer to [Lévy and Schwindt, 2018] for a recent
5.3. Entropic Semidiscrete Formulation 89

α and β `=1 ` = 20 ` = 40 ` = 100

Figure 5.2: Iterations of the semidiscrete OT algorithm minimizing (5.8) (here a simple gradient
descent is used). The support (yj )j of the discrete measure β is indicated by the colored points, while
the continuous measure α is the uniform measure on a square. The colored cells display the Laguerre
partition (Lj (g(`) ))j where g(`) is the discrete dual potential computed at iteration `.

overview. The use of a Newton solver which is applied to sampling in computer graphics
is proposed in [De Goes et al., 2012]; see also [Lévy, 2015] for applications to 3-D volume
and surface processing. An important area of application of the semidiscrete method
is for the resolution of the incompressible fluid dynamic (Euler’s equations) using La-
grangian methods [de Goes et al., 2015, Gallouët and Mérigot, 2017]. The semidiscrete
OT solver enforces incompressibility at each iteration by imposing that the (possibly
weighted) points cloud approximates a uniform distribution inside the domain. The
convergence (with linear rate) of damped Newton iterations is proved in [Mirebeau,
2015] for the Monge–Ampère equation and is refined in [Kitagawa et al., 2016] for op-
timal transport. Semidiscrete OT finds important applications to illumination design,
notably reflectors; see [Meyron et al., 2018].

5.3 Entropic Semidiscrete Formulation

The dual of the entropic regularized problem between arbitrary measures (4.9) is a
smooth unconstrained optimization problem:
Z Z Z
−c+f ⊕g
Lεc (α, β) = sup f dα + gdβ − ε e ε dαdβ, (5.9)
(f,g)∈C(X )×C(Y) X Y X ×Y
def.
where we denoted (f ⊕ g)(x, y) = f (x) + g(y).
Similarly to the unregularized problem (5.1), one can minimize explicitly with re-
spect to either f or g in (5.9), which yields a smoothed c-transform
Z 
−c(x,y)+f (x)
f c,ε (y) = −ε log
def.
∀ y ∈ Y, e ε dα(x) ,
ZX 
−c(x,y)+g(y)
g c̄,ε (x) = −ε log
def.
∀x ∈ X, e ε dβ(y) .
Y
Pm
In the case of a discrete measure β = j=1 bj δyj , the problem simplifies as with (5.7)
to a finite-dimensional problem expressed as a function of the discrete dual potential
90 Semidiscrete Optimal Transport

g ∈ Rm ,  
m
X −c(x,yj )+gj
gc̄,ε (x) = −ε log 
def.
∀x ∈ X, e ε bj  . (5.10)
j=1

One defines similarly fc̄,ε in the case of a discrete measure α. Note that the rewrit-
ing (4.40) and (4.41) of Sinkhorn using the soft-min operator minε corresponds to the
alternate computation of entropic smoothed c-transforms,
(`+1) (`+1)
fi = gc̄,ε (xi ) and gj = f c,ε (yj ). (5.11)

Instead of maximizing (5.9), one can thus solve the following finite-dimensional
optimization problem:
Z
maxn E ε (g) = gc̄,ε (x)dα(x) + hg, bi.
def.
(5.12)
g∈R X

Note that this optimization problem is still valid even in the unregularized case ε = 0
and in this case gc̄,ε=0 = gc̄ is the c̄-transform defined in (5.6) so that (5.12) is in
fact (5.8). The gradient of this functional reads
Z
ε
∀ j ∈ JmK, ∇E (g)j = − χεj (x)dα(x) + bj , (5.13)
X

where χεj is a smoothed version of the indicator χ0j of the Laguerre cell Lj (g),
−c(x,yj )+gj
e ε
χεj (x) =P −c(x,y` )+g`
.
`e ε

Note once again that this formula (5.13) is still valid for ε = 0. Note also that the
family of functions (χεj )j is a partition of unity, i.e. j χεj = 1 and χεj ≥ 0. Figure 5.3,
P

bottom row, illustrates this.

Remark 5.1 (Second order methods and connection with logistic regression). A crucial
aspect of the smoothed semidiscrete formulation (5.12) is that it corresponds to the
minimization of a smooth function. Indeed, as shown in [Genevay et al., 2016], the
Hessian of E ε is upper bounded by 1/ε, so that ∇E ε is 1ε -Lipschitz continuous. In
fact, that problem is very closely related to a multiclass logistic regression problem
(see Figure 5.3 for a display of the resulting fuzzy classification boundary) and enjoys
the same favorable properties (see [Hosmer Jr et al., 2013]), which are generaliza-
tions of self-concordance; see [Bach, 2010]. In particular, the Newton method converges
quadratically, and one can use in practice quasi-Newton techniques, such as L-BFGS,
as advocated in [Cuturi and Peyré, 2016]. Note that [Cuturi and Peyré, 2016] stud-
ies the more general barycenter problem detailed in §9.2, but it is equivalent to this
semidiscrete setting when considering only a pair of input measures. The use of second
5.3. Entropic Semidiscrete Formulation 91

order schemes (Newton or L-BFGS) is also advocated in the unregularized case ε = 0


by [Mérigot, 2011, De Goes et al., 2012, Lévy, 2015]. In [Kitagawa et al., 2016, Theo.
5.1], the Hessian of E 0 (g) is shown to be uniformly bounded as long as the volume
of the Laguerre cells is bounded by below and α has a continuous density. Kitagawa
et al. proceed by showing the linear convergence of a damped Newton algorithm with
a backtracking to ensure that the Laguerre cells never vanish between two iterations.
This result justifies the use of second order methods even in the unregularized case.
The intuition is that, while the conditioning of the entropic regularized problem scales
like 1/ε, when ε = 0, this conditioning is rather driven by m, the number of samples of
the discrete distribution (which controls the size of the Laguerre cells). Other methods
exploiting second order schemes were also recently studied by [Knight and Ruiz, 2013,
Sugiyama et al., 2017, Cohen et al., 2017, Allen-Zhu et al., 2017].

0.3
0.2
0.2
0
0.1
-0.2
0
0 0.5 1

ε=0 ε = 0.01 ε = 0.1 ε = 0.3

Figure 5.3: Top: examples of entropic semidiscrete c̄-transforms gc̄,ε in one dimension, for ground
cost c(x, y) = |x − y| for varying ε (see colorbar). The red points are at locations (yj , −gj )j . Bottom:
examples of entropic semidiscrete c̄-transforms gc̄,ε in two dimensions, for ground cost c(x, y) = kx − yk2
for varying ε. The black curves are the level sets of the function gc̄,ε , while the colors indicate the
smoothed indicator function of the Laguerre cells χεj . The red points are at locations yj ∈ R2 , and their
size is proportional to gj .
92 Semidiscrete Optimal Transport

Remark 5.2 (Legendre transforms of OT cost functions). As stated in Proposition 4.6,


LεC (a, b) is a convex function of (a, b) (which is also true in the unregularized case
ε = 0). It is thus possible to compute its Legendre–Fenchel transform, which is defined
in (4.54). Denoting Fa (b) = LεC (a, b), one has, for a fixed a, following Cuturi and Peyré
[2016]:
Fa∗ (g) = −εH(a) +
X
ai gc̄,ε (xi ).
i

Here gc̄,ε is the entropic-smoothed c-transform introduced in (5.10). In the unregu-


larized case ε = 0, and for generic measures, Carlier et al. [2015] show, denoting
def.
Fα (β) = Lc (α, β), Z
∀ g ∈ C(Y), Fα∗ (g) = g c̄ (x)dα(x),
X
where the c̄-transform g c̄ ∈ C(X ) of g is defined in §5.1. Note that here, since M(X )
is in duality with C(X ), the Legendre transform is a function of continuous functions.
def.
Denoting now G(a, b) = LεC (a, b), one can derive as in [Cuturi and Peyré, 2016, 2018]
the Legendre transform for both arguments,
−Ci,j +fi +gj
G∗ (f, g) = −ε log
X
∀ (f, g) ∈ Rn × Rm , e ε ,
i,j

def.
which can be seen as a smoothed version of the Legendre transform of G(α, β) =
Lc (α, β),

∀ (f, g) ∈ C(X ) × C(Y), G ∗ (f, g) = inf c(x, y) − f (x) − g(y).


(x,y)∈X ×Y

5.4 Stochastic Optimization Methods

The semidiscrete formulation (5.8) and its smoothed version (5.12) are appealing be-
cause the energies to be minimized are written as an expectation with respect to the
probability distribution α,
Z
ε
E (g) = E ε (g, x)dα(x) = EX (E ε (g, X))
X

where E ε (g, x) = gc̄,ε (x) − hg, bi,


def.

and X denotes a random vector distributed on X according to α. Note that the gradient
of each of the involved functional reads

∇g E ε (x, g) = (χεj (x) − bj )m m


j=1 ∈ R .

One can thus use stochastic optimization methods to perform the maximization, as pro-
posed in Genevay et al. [2016]. This allows us to obtain provably convergent algorithms
5.4. Stochastic Optimization Methods 93

without the need to resort to an arbitrary discretization of α (either approximating α


using sums of Diracs or using quadrature formula for the integrals). The measure α is
used as a black box from which one can draw independent samples, which is a natural
computational setup for many high-dimensional applications in statistics and machine
learning. This class of methods has been generalized to the computation of Wasserstein
barycenters (as described in §9.2) in [Staib et al., 2017b].

Stochastic gradient descent. Initializing g(0) = 0P , the stochastic gradient descent


algorithm (SGD; used here as a maximization method) draws at step ` a point x` ∈ X
according to distribution α (independently from all past and future samples (x` )` ) to
form the update
g(`+1) = g(`) + τ` ∇g E ε (g(`) , x` ).
def.
(5.14)
The step size τ` should decay fast enough to zero in order to ensure that the “noise”
created by using ∇g E ε (x` , g) as a proxy for the true gradient ∇E ε (g) is canceled in the
limit. A typical choice of schedule is
def. τ0
τ` = , (5.15)
1 + `/`0
where `0 indicates roughly the number of iterations serving as a warmup phase. One
can prove the convergence result
1
 
E ε (g? ) − E(E ε (g(`) )) = O √ ,
`
where g? is a solution of (5.12) and where E indicates an expectation with respect to
the i.i.d. sampling of (x` )` performed at each iteration. Figure 5.4 shows the evolution
of the algorithm on a simple 2-D example, where α is the uniform distribution on [0, 1]2 .

Stochastic gradient descent with averaging. SGD is slow because of the fast decay of
the stepsize τ` toward zero. To improve the convergence speed, it is possible to average
the past iterates, which is equivalent to running a “classical” SGD on auxiliary variables
(g̃(`) )`
g̃(`+1) = g̃(`) + τ` ∇g E ε (g̃(`) , x` ),
def.

where x` is drawn according to α (and all the (x` )` are independent) and output as
estimated weight vector the average
`
(`) 1X
g̃(k) .
def.
g =
` k=1

This defines the stochastic gradient descent with averaging (SGA) algorithm. Note that
it is possible to avoid explicitly storing all the iterates by simply updating a running
94 Semidiscrete Optimal Transport

Figure 5.4: Evolution of the energy E ε (g(`) ), for ε = 0 (no regularization) during the SGD itera-
tions (5.14). Each colored curve shows a different randomized run. The images display the evolution of
the Laguerre cells (Lj (g(`) ))j through the iterations.

average as follows:
1 `
g(`+1) = g̃(`+1) + g(`) .
`+1 `+1
In this case, a typical choice of decay is rather of the form
def. τ0
τ` = p .
1 + `/`0

Notice that the step size now goes much slower to 0 than for (5.15), at rate `−1/2 .
Bach [2014] proves that SGA leads to a faster convergence (the constants involved are
smaller) than SGD, since in contrast to SGD, SGA is adaptive to the local strong
convexity (or concavity for maximization problems) of the functional.

Remark 5.3 (Continuous-continuous problems). When neither α nor β is a discrete mea-


sure, one cannot resort to semidiscrete strategies involving finite-dimensional dual vari-
ables, such as that given in Problem (5.7). The only option is to use stochastic opti-
mization methods on the dual problem (4.45), as proposed in [Genevay et al., 2016].
A suitable regularization of that problem is crucial, for instance by setting an entropic
regularization strength ε > 0, to obtain an unconstrained problem that can be solved by
stochastic descent schemes. A possible approach to revisit Problem (4.45) is to restrict
that infinite-dimensional optimization problem over a space of continuous functions to a
5.4. Stochastic Optimization Methods 95

much smaller subset, such as that spanned by multilayer neural networks [Seguy et al.,
2018]. This approach leads to nonconvex finite-dimensional optimization problems with
no approximation guarantees, but this can provide an effective way to compute a proxy
for the Wasserstein distance in high-dimensional scenarios. Another solution is to use
nonparametric families, which is equivalent to considering some sort of progressive re-
finement, as that proposed by Genevay et al. [2016] using reproducing kernel Hilbert
spaces, whose dimension is proportional to the number of iterations of the SGD algo-
rithm.
6
W 1 Optimal Transport

This chapter focuses on optimal transport problems in which the ground cost is equal
to a distance. Historically, this corresponds to the original problem posed by Monge
in 1781; this setting was also that chosen in early applications of optimal transport in
computer vision [Rubner et al., 2000] under the name of “earth mover’s distances”.
Unlike the case where the ground cost is a squared Hilbertian distance (studied
in particular in Chapter 7), transport problems where the cost is a metric are more
difficult to analyze theoretically. In contrast to Remark 2.24 that states the uniqueness
of a transport map or coupling between two absolutely continuous measures when
using a squared metric, the optimal Kantorovich coupling is in general not unique
when the cost is the ground distance itself. Hence, in this regime it is often impossible
to recover a uniquely defined Monge map, making this class of problems ill-suited for
interpolation of measures. We refer to works by Trudinger and Wang [2001], Caffarelli
et al. [2002], Sudakov [1979], Evans and Gangbo [1999] for proofs of existence of optimal
W 1 transportation plans and detailed analyses of their geometric structure.
Although more difficult to analyze in theory, optimal transport with a linear ground
distance is usually more robust to outliers and noise than a quadratic cost. Further-
more, a cost that is a metric results in an elegant dual reformulation involving local
flow, divergence constraints, or Lipschitzness of the dual potential, suggesting cheaper
numerical algorithms that align with minimum-cost flow methods over networks in
graph theory. This setting is also popular because the associated OT distances define
a norm that can compare arbitrary distributions, even if they are not positive; this
property is shared by a larger class of so-called dual norms (see §8.2 and Remark 10.6
for more details).

96
6.1. W 1 on Metric Spaces 97

6.1 W 1 on Metric Spaces

Here we assume that d is a distance on X = Y, and we solve the OT problem with the
ground cost c(x, y) = d(x, y). The following proposition highlights key properties of the
c-transform (5.1) in this setup. In the following, we denote the Lipschitz constant of a
function f ∈ C(X ) as
|f (x) − f (y)|
 
: (x, y) ∈ X 2 , x 6= y .
def.
Lip(f ) = sup
d(x, y)
We define Lipschitz functions to be those functions f satisfying Lip(f ) < +∞; they
form a convex subset of C(X ).
Proposition 6.1. Suppose X = Y and c(x, y) = d(x, y). Then, there exists g such that
f = g c if and only Lip(f ) ≤ 1. Furthermore, if Lip(f ) ≤ 1, then f c = −f .
Proof. First, suppose f = g c . Then, for x, y ∈ X ,


|f (x) − f (y)| = inf d(x, z) − g(z) − inf d(y, z) − g(z)
z∈X z∈X

≤ sup |d(x, z) − d(y, z)| ≤ d(x, y).


z∈X

The first equality follows from the definition of g c , the next inequality from the identity
| inf f − inf g| ≤ sup |f − g|, and the last from the triangle inequality. This shows that
Lip(f ) ≤ 1.
def.
Now, suppose Lip(f ) ≤ 1, and define g = −f . By the Lipschitz property, for all
x, y ∈ X , f (y) − d(x, y) ≤ f (x) ≤ f (y) + d(x, y). Applying these inequalities,
g c (y) = inf [d(x, y) + f (x)] ≥ inf [d(x, y) + f (y) − d(x, y)] = f (y),
x∈X x∈X
c
g (y) = inf [d(x, y) + f (x)] ≤ inf [d(x, y) + f (y) + d(x, y)] = f (y).
x∈X x∈X

Hence, f = gc with g = −f . Using the same inequalities shows


f c (y) = inf [d(x, y) − f (x)] ≥ inf [d(x, y) − f (y) − d(x, y)] = −f (y),
x∈X x∈X
c
f (y) = inf [d(x, y) − f (x)] ≤ inf [d(x, y) − f (y) + d(x, y)] = −f (y).
x∈X x∈X

This shows fc = −f .

Starting from the single potential formulation (5.4), one can iterate the construction
and replace the couple (g, g c ) by (g c , (g c )c ). The last proposition shows that one can
thus use (g c , −g c ), which in turn is equivalent to any pair (f, −f ) such that Lip(f ) ≤ 1.
This leads to the following alternative expression for the W 1 distance:
Z 
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : Lip(f ) ≤ 1 . (6.1)
f X
98 W 1 Optimal Transport

This expression shows that W 1 is actually a norm, i.e. W 1 (α, β) = kα − βkW 1 , and
R R
that it is still valid for any measures (not necessary positive) as long as X α = X β.
This norm is often called the Kantorovich and Rubinstein norm [1958].
For discrete measures of the form (2.1), writing α − β = k mk δzk with zk ∈ X and
P
P
k mk = 0, the optimization (6.1) can be rewritten as
( )
X
W 1 (α, β) = max fk mk : ∀ (k, `), |fk − f` | ≤ d(zk , z` ), (6.2)
(fk )k
k

which is a finite-dimensional convex program with quadratic-cone constraints. It can


be solved using interior point methods or, as we detail next for a similar problem, using
proximal methods.
When using d(x, y) = |x − y| with X = R, we can reduce the number of constraints
by ordering the zk ’s via z1 ≤ z2 ≤ . . .. In this case, we only have to solve
( )
X
W 1 (α, β) = max fk mk : ∀ k, |fk+1 − fk | ≤ zk+1 − zk ,
(fk )k
k

which is a linear program. Note that furthermore, in this 1-D case, a closed form
expression for W 1 using cumulative functions is given in (2.37).

Remark 6.1 (W p with 0 < p ≤ 1). If 0 < p ≤ 1, then d(x, ˜ y) def. = d(x, y)p satisfies the
triangular inequality, and hence d˜ is itself a distance. One can thus apply the results and
algorithms detailed above for W 1 to compute W p by simply using d˜ in place of d. This
is equivalent to stating that W p is the dual of p-Hölder functions {f : Lipp (f ) ≤ 1},
where
|f (x) − f (y)|
 
def. 2
Lipp (f ) = sup : (x, y) ∈ X , x 6
= y .
d(x, y)p

6.2 W 1 on Euclidean Spaces

In the special case of Euclidean spaces X = Y = Rd , using c(x, y) = kx − yk, the global
Lipschitz constraint appearing in (6.1) can be made local as a uniform bound on the
gradient of f ,
Z 
W 1 (α, β) = max f (x)(dα(x) − dβ(x)) : k∇f k∞ ≤ 1 . (6.3)
f Rd

Here the constraint k∇f k∞ ≤ 1 signifies that the norm of the gradient of f at any
point x is upper bounded by 1, k∇f (x)k2 ≤ 1 for any x.
Considering the dual problem to (6.3), one obtains an optimization problem under
fixed divergence constraint
Z 
W 1 (α, β) = min ks(x)k2 dx : div(s) = α − β , (6.4)
s Rd
6.3. W 1 on a Graph 99

which is often called the Beckmann formulation [Beckmann, 1952]. Here the vectorial
function s(x) ∈ R2 can be interpreted as a flow field, describing locally the movement
of mass. Outside the support of the two input measures, div(s) = 0, which is the
conservation of mass constraint. Once properly discretized using finite elements, Prob-
lems (6.3) and (6.4) become nonsmooth convex optimization problems. It is possible to
use an off-the-shelf interior points quadratic-cone optimization solver, but as advocated
in §7.3, large-scale problems require the use of simpler but more adapted first order
methods. One can thus use, for instance, Douglas–Rachford (DR) iterations (7.14) or
the related alternating direction method of multipliers method. Note that on a uniform
grid, projecting on the divergence constraint is conveniently handled using the fast
Fourier transform. We refer to Solomon et al. [2014a] for a detailed account for these
approaches and application to OT on triangulated meshes. See also Li et al. [2018a],
Ryu et al. [2017b,a] for similar approaches using primal-dual splitting schemes. Ap-
proximation schemes that relax the Lipschitz constraint on the dual potentials f have
also been proposed, using, for instance, a constraint on wavelet coefficients leading to
an explicit formula [Shirdhonkar and Jacobs, 2008], or by considering only functions f
parameterized as multilayer neural networks with “rectified linear” max(0, ·) activation
function and clipped weights [Arjovsky et al., 2017].

6.3 W 1 on a Graph

The previous formulations (6.3) and (6.4) of W 1 can be generalized to the setting where
X is a geodesic space, i.e. c(x, y) = d(x, y) where d is a geodesic distance. We refer
to Feldman and McCann [2002] for a theoretical analysis in the case where X is a
Riemannian manifold. When X = J1, nK is a discrete set, equipped with undirected
edges (i, j) ∈ E ⊂ X 2 labeled with a weight (length) wi,j , we recover the important
case where X is a graph equipped with the geodesic distance (or shortest path metric):
(K−1 )
def.
X
Di,j = min wik ,ik+1 : ∀ k ∈ J1, K − 1K, (ik , ik+1 ) ∈ E ,
K≥0,(ik )k :i→j
k=1

where i → j indicates that i1 = i and iK = j, namely that the path starts at i and
ends at j.
We consider two vectors (a, b) ∈ (Rn )2 defining (signed) discrete measures on the
graph X such that i ai = i bi (these weights do not need to be positive). The
P P

goal is now to compute W1 (a, b), as introduced in (2.17) for p = 1, when the ground
metric is the graph geodesic distance. This computation should be carried out without
going as far as having to compute a “full” coupling P of size n × n, to rely instead on
local operators thanks to the underlying connectivity of the graph. These operators are
discrete formulations for the gradient and divergence differential operators.
100 W 1 Optimal Transport

A discrete dual Kantorovich potential f ∈ Rn is a vector indexed by all vertices of


the graph. The gradient operator ∇ : Rn → RE is defined as
def.
∀ (i, j) ∈ E, (∇f)i,j = fi − fj .

A flow s = (si,j )i,j is defined on edges, and the divergence operator div : RE → Rn ,
which is the adjoint of the gradient ∇, maps flows to vectors defined on vertices and is
defined as X
(si,j − sj,i ) ∈ Rn .
def.
∀ i ∈ J1, nK, div(s)i =
j:(i,j)∈E

Problem (6.3) becomes, in the graph setting,


( n )
X
W1 (a, b) = max
n
fi (ai − bi ) : ∀(i, j) ∈ E, |(∇f)i,j | ≤ wi,j . (6.5)
f∈R
i=1

The associated dual problem, which is analogous to Formula (6.4), is then


 
 X 
W1 (a, b) = min wi,j si,j : div(s) = a − b . (6.6)
s∈RE
+
 
(i,j)∈E

This is a linear program and more precisely an instance of min-cost flow problems.
Highly efficient dedicated simplex solvers have been devised to solve it; see, for in-
stance, [Ling and Okada, 2007]. Figure 6.1 shows an example of primal and dual solu-
tions. Formulation (6.6) is the so-called Beckmann formulation [Beckmann, 1952] and
has been used and extended to define and study traffic congestion models; see, for
instance, [Carlier et al., 2008].
6.3. W 1 on a Graph 101

f (a, b) and s

Figure 6.1: Example of computation of W1 (a, b) on a planar graph with uniform weights wi,j = 1.
Left: potential f solution of (6.5) (increasing value from red to blue). The green color of the edges is
proportional to |(∇f)i,j |. Right: flow s solution of (6.6), where bold black edges display nonzero si,j ,
which saturate to wi,j = 1. These saturating flow edge on the right match the light green edge on the
left where |(∇f)i,j | = 1.
7
Dynamic Formulations

This chapter presents the geodesic (also called dynamic) point of view of optimal trans-
port when the cost is a squared geodesic distance. This describes the optimal transport
between two measures as a curve in the space of measures minimizing a total length.
The dynamic point of view offers an alternative and intuitive interpretation of optimal
transport, which not only allows us to draw links with fluid dynamics but also results
in an efficient numerical tool to compute OT in small dimensions when interpolating
between two densities. The drawback of that approach is that it cannot scale to large-
scale sparse measures and works only in low dimensions on regular domains (because
one needs to grid the space) with a squared geodesic cost.
In this chapter, we use the notation (α0 , α1 ) in place of (α, β) in agreement with
the idea that we start at time t = 0 from one measure to reach another one at time
t = 1.

7.1 Continuous Formulation

In the case X = Y = Rd , and c(x, y) = kx − yk2 , the optimal transport distance


W 22 (α, β) = Lc (α, β) as defined in (2.15) can be computed by looking for a minimal
length path (αt )1t=0 between these two measures. This path is described by advecting
the measure using a vector field vt defined at each instant. The vector field vt and the
path αt must satisfy the conservation of mass formula, resulting in

∂αt
+ div(αt vt ) = 0 and αt=0 = α0 , αt=1 = α1 , (7.1)
∂t

102
7.1. Continuous Formulation 103

where the equation above should be understood in the sense of distributions on Rd . The
infinitesimal length of such a vector field is measured using the L2 norm associated to
the measure αt , that is defined as
Z 1/2
2
kvt kL2 (αt ) = kvt (x)k dαt (x) .
Rd

This definition leads to the following minimal-path reformulation of W 2 , originally


introduced by Benamou and Brenier [2000]:
Z 1Z
W 22 (α0 , α1 ) = min kvt (x)k2 dαt (x)dt, (7.2)
(αt ,vt )t sat. (7.1) 0 Rd

where αt is a scalar-valued measure and vt a vector-valued measure. Figure 7.1 shows


two examples of such paths of measures.

α0 α1/4 α1/2 α3/4 α1

Figure 7.1: Displacement interpolation αt satisfying (7.2). Top: for two measures (α0 , α1 ) with
densities with respect to the Lebesgue measure. Bottom: for two discrete empirical measures with the
same number of points (bottom).

The formulation (7.2) is a nonconvex formulation in the variables (αt , vt )t because


of the constraint (7.1) involving the product αt vt . Introducing a vector-valued measure
(often called the “momentum”)
def.
Jt = αt vt ,
Benamou and Brenier showed in their landmark paper [2000] that it is instead convex
in the variable (αt , Jt )t when writing
Z 1Z
W 22 (α0 , α1 ) = min θ(αt (x), Jt (x))dxdt, (7.3)
(αt ,Jt )t ∈C(α0 ,α1 ) 0 Rd
104 Dynamic Formulations

where we define the set of constraints as


∂αt
 
def.
C(α0 , α1 ) = (αt , Jt ) : + div(Jt ) = 0, αt=0 = α0 , αt=1 = α1 , (7.4)
∂t
and where θ :→ R+ ∪ {+∞} is the following lower semicontinuous convex function
kbk2


 a if a > 0,
∀ (a, b) ∈ R+ × Rd , θ(a, b) = 0 if (a, b) = 0, (7.5)

+∞ otherwise.

This definition might seem complicated, but it is crucial to impose that the momentum
Jt (x) should vanish when αt (x) = 0. Note also that (7.3) is written in an informal way
as if the measures (αt , Jt ) were density functions, but this is acceptable because θ is a
1-homogeneous function (and hence defined even if the measures do not have a density
with respect to Lebesgue measure) and can thus be extended in an unambiguous way
from density to functions.

Remark 7.1 (Links with McCann’s interpolation). In the case (see Equation (2.28))
where there exists an optimal Monge map T : Rd → Rd with T] α0 = α1 , then αt is
equal to McCann’s interpolation

αt = ((1 − t)Id + tT )] α0 . (7.6)

In the 1-D case, using Remark 2.30, this interpolation can be computed thanks to
the relation
Cα−1
t
= (1 − t)Cα−1
0
+ tCα−1
1
; (7.7)
see Figure 2.11. We refer to Gangbo and McCann [1996] for a detailed review on
the Riemannian geometry of the Wasserstein space. In the case that there is “only”
an optimal coupling π that is not necessarily supported on a Monge map, one can
compute this interpolant as

αt = Pt] π where Pt : (x, y) ∈ Rd × Rd 7→ (1 − t)x + ty. (7.8)

For instance, in the discrete setup (2.3), denoting P a solution to (2.11), an inter-
polation is defined as X
αt = Pi,j δ(1−t)xi +tyj . (7.9)
i,j

Such an interpolation is typically supported on n + m − 1 points, which is the


maximum number of nonzero elements of P. Figure 7.2 shows two examples of
such displacement interpolation of discrete measures. This construction can be
generalized to geodesic spaces X by replacing Pt by the interpolation along geodesic
7.2. Discretization on Uniform Staggered Grids 105

paths. McCann’s interpolation finds many applications, for instance, color, shape,
and illumination interpolations in computer graphics [Bonneel et al., 2011].

α0 α1/5 α2/5 α3/5 α4/5 α1

Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds
(empirical measures (α0 , α1 ) with the same number of points). Bottom: same but with varying weights.
For 0 < t < 1, the top example corresponds to an empirical measure interpolation αt with N points,
while the bottom one defines a measure supported on 2N − 1 points.

7.2 Discretization on Uniform Staggered Grids

For simplicity, we describe the numerical scheme in dimension d = 2; the extension to


higher dimensions is straightforward. We follow the discretization method introduced
by Papadakis et al. [2014], which is inspired by staggered grid techniques which are
commonly used in fluid dynamics. We discretize time as tk = k/T ∈ [0, 1] and assume
the space is uniformly discretized at points xi = (i1 /n1 , i2 /n2 ) ∈ X = [0, 1]2 . We
use a staggered grid representation, so that αt is represented using a ∈ R(T +1)×n1 ×n2
associated to half grid points in time, whereas J is represented using J = (J1 , J2 ),
where J1 ∈ RT ×(n1 +1)×n2 and J1 ∈ RT ×n1 ×(n2 +1) are stored at half grid points in each
space direction. Using this representation, for (k, i1 , i2 ) ∈ J1, T K × J1, n1 K × J1, n2 K, the
time derivative is computed as
def.
(∂t a)k,i = ak+1,i − ak,i

and spatial divergence as

div(J)k,i = J1k,i1 +1,i2 − J1k,i1 ,i2 + J2k,i1 ,i2 +1 − J2k,i1 ,i2 ,


def.
(7.10)

which are both defined at grid points, thus forming arrays of RT ×n1 ×n2 .
106 Dynamic Formulations

In order to evaluate the functional to be optimized, one needs interpolation opera-


tors from midgrid points to grid points, for all (k, i1 , i2 ) ∈ J1, T K × J1, n1 K × J1, n2 K,
def.
Ia (a)k,i = I(ak+1,i , ak,i ),

IJ (J)k,i = (I(J1k,i1 +1,i2 , J1k,i1 ,i2 ), I(J2k,i1 ,i2 +1 , J2k,i1 ,i2 )).
def.

r+s
The simplest choice is to use a linear operator I(r, s) = 2 , which is the one we
consider next. The discrete counterpart to (7.3) reads

min Θ(Ia (a), IJ (J)), (7.11)


(a,J)∈C(a0 ,a1 )

n1 X
T X
X n2
def.
where Θ(ã, J̃) = θ(ãk,i , J̃k,i ),
k=1 i1 =1 i2 =1
and where the constraint now reads
def.
C(a0 , a1 ) = {(a, J) : ∂t a + div(J) = 0, (a0,· , aT,· ) = (a0 , a1 )} ,

where a ∈ R(T +1)×n1 ×n2 , J = (J1 , J2 ) with J1 ∈ RT ×(n1 +1)×n2 , J2 ∈ RT ×n1 ×(n2 +1) .
Figure 7.3 shows an example of evolution (αt )t approximated using this discretization
scheme.

Remark 7.2 (Dynamic formulation on graphs). In the case where X is a graph and
c(x, y) = dX (x, y)2 is the squared geodesic distance, it is possible to derive faithful
discretization methods that use a discrete divergence associated to the graph structure
in place of the uniform grid discretization (7.10). In order to ensure that the heat
equation has a gradient flow structure (see §9.3 for more details about gradient flows)
for the corresponding dynamic Wasserstein distance, Maas [2011] and later Mielke
[2013] proposed to use a logarithmic mean I(r, s) (see also [Solomon et al., 2016b, Chow
et al., 2012, 2017b,a]).

7.3 Proximal Solvers

The discretized dynamic OT problem (7.11) is challenging to solve because it requires


us to minimize a nonsmooth optimization problem under affine constraints. Indeed,
the function θ is convex but nonsmooth for measures with vanishing mass ak,i . When
interpolating between two compactly supported input measures (a0 , a1 ), one typically
expects the mass of the interpolated measures (ak )Tk=1 to vanish as well, and the difficult
part of the optimization process is indeed to track this evolution of the support. In
particular, it is not possible to use standard smooth optimization techniques.
There are several ways to recast (7.11) into a quadratic-cone program, either by
considering the dual problem or simply by replacing the functional θ(ak,i , Jk,i ) by a
7.3. Proximal Solvers 107

linear function under constraints,


 
X 
Θ(ã, J̃) = min z̃k,i : ∀ (k, i), (zk,i , ãk,i , J̃i,j ) ∈ L ,
z̃  
k,i
def.
which thus requires the introduction of an extra variable z̃. Here L = {(z, a, J) ∈
R × R+ × Rd : kJk2 ≤ za} is a rotated Lorentz quadratic-cone. With this extra
variable, it is thus possible to solve the discretized problem using standard interior point
solvers for quadratic-cone programs [Nesterov and Nemirovskii, 1994]. These solvers
have fast convergence rates and are thus capable of computing a solution with high
precision. Unfortunately, each iteration is costly and requires the resolution of a linear
system of dimension that scales with the number of discretization points. They are
thus not applicable for large-scale multidimensional problems encountered in imaging
applications.
An alternative to these high-precision solvers are low-precision first order methods,
which are well suited for nonsmooth but highly structured problems such as (7.11).
While this class of solvers is not new, it has recently been revitalized in the fields of
imaging and machine learning because they are the perfect fit for these applications,
where numerical precision is not the driving goal. We refer, for instance, to the mono-
graph [Bauschke and Combettes, 2011] for a detailed account on these solvers and their
use for large-scale applications. We here concentrate on a specific solver, but of course
many more can be used, and we refer to [Papadakis et al., 2014] for a study of several
such approaches for dynamical OT. Note that the idea of using a first order scheme for
dynamical OT was initially proposed by Benamou and Brenier [2000].
The DR algorithm [Lions and Mercier, 1979] is specifically tailored to solve nons-
mooth structured problems of the form
min F (x) + G(x), (7.12)
x∈H

where H is some Euclidean space, and where F, G : H → R ∪ {+∞} are two closed
convex functions, for which one can “easily ” (e.g. in closed form or using a rapidly
converging scheme) compute the so-called proximal operator
1 2
∀ x ∈ H, Proxτ F (x) = argmin x − x0 + τ F (x)
def.
(7.13)
x0 ∈H 2
for a parameter τ > 0. Note that this corresponds to the proximal map for the Euclidean
metric and that this definition can be extended to more general Bregman divergence in
place of kx − x0 k2 ; see (4.52) for an example using the KL divergence. The iterations of
the DR algorithm define a sequence (x(`) , w(`) ) ∈ H2 using an initialization (x(0) , w(0) ) ∈
H2 and
w(`+1) = w(`) + α(ProxγF (2x(`) − w(`) ) − x(`) ),
def.

(7.14)
x(`+1) = ProxγG (w(`+1) ).
def.
108 Dynamic Formulations

If 0 < α < 2 and γ > 0, one can show that x(`) → z ? , where z ? is a solution of (7.12);
see [Combettes and Pesquet, 2007] for more details. This algorithm is closely related to
another popular method, the alternating direction method of multipliers [Gabay and
Mercier, 1976, Glowinski and Marroco, 1975] (see also [Boyd et al., 2011] for a review),
which can be recovered by applying DR on a dual problem; see [Papadakis et al., 2014]
for more details on the equivalence between the two, first shown by [Eckstein and
Bertsekas, 1992].
There are many ways to recast Problem (7.11) in the form (7.12), and we refer
to [Papadakis et al., 2014] for a detailed account of these approaches. A simple way to
achieve this is by setting x = (a, J, ã, J̃) and letting
def.
F (x) = Θ(ã, J̃) + ιC(a0 ,a1 ) (a, J) and G(x) = ιD (a, J, ã, J̃),
n o
def.
where D = (a, J, ã, J̃) : ã = Ia (a), J̃ = IJ (J) .
The proximal operator of these two functions can be computed efficiently. Indeed, one
has
Proxτ F (x) = (Proxτ Θ (ã, J̃), ProjC(a0 ,a1 ) (a, J)).
The proximal operator Proxτ Θ is computed by solving a cubic polynomial equation at
each grid position. The orthogonal projection on the affine constraint C(a0 , a1 ) involves
the resolution of a Poisson equation, which can be achieved in O(N log(N )) operations
using the fast Fourier transform, where N = T n1 n2 is the number of grid points.
Lastly, the proximal operator Proxτ G is a linear projector, which requires the inversion
of a small linear system. We refer to Papadakis et al. [2014] for more details on these
computations. Figure 7.3 shows an example in which that method is used to compute
a dynamical interpolation inside a complicated planar domain. This class of proximal
methods for dynamical OT has also been used to solve related problems such as mean
field games [Benamou and Carlier, 2015].

7.4 Dynamical Unbalanced OT

In order to be able to match input measures with different mass α0 (X ) 6= α1 (X ) (the


so-called “unbalanced” settings, the terminology introduced by Benamou [2003]), and
also to cope with local mass variation, several normalizations or relaxations have been
proposed, in particular by relaxing the fixed marginal constraint; see §10.2. A general
methodology consists in introducing a source term st (x) in the continuity equation (7.4).
We thus consider
∂αt
 
¯ 0 , α1 ) def.
C(α = (αt , Jt , st ) : + div(Jt ) = st , αt=0 = α0 , αt=1 = α1 .
∂t
The crucial question is how to measure the cost associated to this source term and
introduce it in the original dynamic formulation (7.3). Several proposals appear in the
7.4. Dynamical Unbalanced OT 109

Figure 7.3: Solution αt of dynamic OT computed with a proximal splitting scheme.

literature, for instance, using an L2 cost Piccoli and Rossi [2014]. In order to avoid
having to “teleport” mass (mass which travels at infinite speed and suddenly grows
in a region where there was no mass before), the associated cost should be infinite. It
turns out that this can be achieved in a simple convex way, by also allowing st to be
an arbitrary measure (e.g. using a 1-homogeneous cost) by penalizing st in the same
way as the momentum Jt ,

WFR2 (α0 , α1 ) = min Θ(α, J, s), (7.15)


¯ 0 ,α1 )
(αt ,Jt ,st )t ∈C(α

Z 1Z
def.
where Θ(α, J, s) = (θ(αt (x), Jt (x)) + τ θ(αt (x), st (x))) dxdt,
0 Rd

where θ is the convex 1-homogeneous function introduced in (7.5), and τ is a weight


controlling the trade-off between mass transportation and mass creation/destruction.
This formulation was proposed independently by several authors [Liero et al., 2016,
Chizat et al., 2018c, Kondratyev et al., 2016]. This “dynamic” formulation has a “static”
counterpart; see Remark 10.5. The convex optimization problem (7.15) can be solved
using methods similar to those detailed in §7.3. Figure 7.4 displays a comparison of
several unbalanced OT dynamic interpolations. This dynamic formulation resembles
“metamorphosis” models for shape registration [Trouvé and Younes, 2005], and a more
precise connection is detailed in [Maas et al., 2015, 2016].
As τ → 0, and if α0 (X ) = α1 (X ), then one retrieves the classical OT problem,
WFR(α0 , α1 ) → W(α0 , α1 ). In contrast, as τ → +∞, this distance approaches the
110 Dynamic Formulations

Hellinger metric over densities


1
Z q q
τ →+∞
WFR(α0 , α1 )2 −→ | ρα0 (x) − ρα1 (x)|2 dx
τ X
s
dα1
Z
= |1 − (x)|2 dα0 (x).
X dα0

Figure 7.4: Comparison of Hellinger (first row), Wasserstein (row 2), partial optimal transport (row
3), and Wasserstein–Fisher–Rao (row 4) dynamic interpolations.

7.5 More General Mobility Functionals

It is possible to generalize the dynamic formulation (7.3) by considering other “mobility


functions” θ in place of the one defined in (7.5). A possible choice for this mobility
functional is proposed in Dolbeault et al. [2009],

∀ (a, b) ∈ R+ × Rd , θ(a, b) = as−p kbkp , (7.16)

where the parameter should satisfy p ≥ 1 and s ∈ [1, p] in order for θ to be convex.
Note that this definition should be handled with care in the case 1 < s ≤ p because θ
does not have a linear growth at infinity, so that solutions to (7.3) must be constrained
to have a density with respect to the Lebesgue measure.
The case s = 1 corresponds to the classical OT problem and the optimal value
of (7.3) defines W p (α, β). In this case, θ is 1-homogeneous, so that solutions to (7.3)
can be arbitrary measures. The case (s = 1, p = 2) is the initial setup considered in (7.3)
to define W 2 .
7.6. Dynamic Formulation over the Paths Space 111

The limiting case s = p is also interesting, because it corresponds to a dual Sobolev


norm W −1,p and the value of (7.3) is then equal to
Z Z 
kα − βkpW −1,p (Rd ) = min f d(α − β) : k∇f (x)kq dx ≤ 1
f Rd Rd

for 1/q + 1/p = 1. In the limit (p = s, q) → (1, ∞), one recovers the W 1 norm. The
case s = p = 2 corresponds to the Sobolev H −1 (Rd ) Hilbert norm defined in (8.15).

7.6 Dynamic Formulation over the Paths Space

There is a natural dynamical formulation of both classical and entropic regularized


(see §4) formulations of OT, which is based on studying abstract optimization problems
on the space X̄ of all possible paths γ : [0, 1] → X (i.e. curves) on the space X . For
simplicity, we assume X = Rd , but this extends to more general spaces such as geodesic
spaces and graphs. Informally, the dynamic of “particles” between two input measures
α0 , α1 at times t = 0, 1 is described by a probability distribution π̄ ∈ M1+ (X̄ ). Such a
distribution should satisfy that the distributions of starting and end points must match
(α0 , α1 ), which is formally written using push-forward as
n o
π̄ ∈ M1+ (X̄ ) : P̄0] π̄ = α0 , P̄1] π̄ = α1 ,
def.
Ū(α0 , α1 ) =

where, for any path γ ∈ X̄ , P0 (γ) = γ(0), P1 (γ) = γ(1).

OT over the space of paths. The dynamical version of classical OT (2.15), formulated
over the space of paths, then reads
Z
2
W 2 (α0 , α1 ) = min L(γ)2 dπ̄(γ), (7.17)
π̄∈Ū (α0 ,α1 ) X̄

where L(γ) = 01 |γ 0 (s)|2 ds is the kinetic energy of a path s ∈ [0, 1] 7→ γ(s) ∈ X . The
R

connection between optimal couplings π ? and π̄ ? solving respectively (7.17) and (2.15)
is that π̄ ? only gives mass to geodesics joining pairs of points in proportion prescribed
by π ? . In the particular case of discrete measures, this means that
X X
π? = Pi,j δ(xi ,yj ) and π̄ ? = Pi,j δγxi ,yj ,
i,j i,j

where γxi ,yj is the geodesic between xi and yj . Furthermore, the measures defined by
the distribution of the curve points γ(t) at time t, where γ is drawn following π̄ ? , i.e.

t ∈ [0, 1] 7→ αt = Pt] π̄ ?
def.
where Pt (γ) = γ(t) ∈ X , (7.18)

is a solution to the dynamical formulation (7.3), i.e. it is the displacement interpolation.


In the discrete case, one recovers (7.9).
112 Dynamic Formulations

Entropic OT over the space of paths. We now turn to the re-interpretation of en-
tropic OT, defined in Chapter 4, using the space of paths. Similarly to (4.11), this is
defined using a Kullback–Leibler projection, but this time of a reference measure over
the space of paths K̄ which is the distribution of a reversible Brownian motion (Wiener
process), which has a uniform distribution at the initial and final times

min KL(π̄|K̄). (7.19)


π̄∈Ū (α0 ,α1 )

We refer to the review paper by Léonard [2014] for an overview of this problem and an
historical account of the work of Schrödinger [1931]. One can show that the (unique)
solution π̄ε? to (7.19) converges to a solution of (7.17) as ε → 0. Furthermore, this
solution is linked to the solution of the static entropic OT problem (4.9) using Brownian
ε ∈ X̄ (which are similar to fuzzy geodesic and converge to δ
bridge γ̄x,y γx,y as ε → 0).
In the discrete setting, this means that
X X
πε? = P?ε,i,j δ(xi ,yj ) and π̄ε? = P?ε,i,j γ̄xεi ,yj , (7.20)
i,j i,j

where P?ε,i,j can be computed using Sinkhorn’s algorithm. Similarly to (7.18), one then
can define an entropic interpolation as

αε,t = Pt] π̄ε? .


def.

Since the law Pt] γ̄x,yε of the position at time t along a Brownian bridge is a Gaussian

Gt(1−t)ε2 (· − γx,y (t)) of variance t(1 − t)ε2 centered at γx,y (t), one can deduce that αε,t
is a Gaussian blurring of a set of traveling Diracs
X
αε,t = P?ε,i,j Gt(1−t)ε2 (· − γxi ,yj (t)).
i,j

The resulting mixture of Brownian bridges is displayed on Figure 7.5.

ε=0 ε = .05 ε = 0.2 ε=1

Figure 7.5: Samples from Brownian bridge paths associated to the Schrödinger entropic interpola-
tion (7.20) over path space. Blue corresponds to t = 0 and red to t = 1.
7.6. Dynamic Formulation over the Paths Space 113

Another way to describe this entropic interpolation (αt )t is using a regularization


of the Benamou–Brenier dynamic formulation (7.2), namely
Z 1Z
ε
 
min kvt (x)k + k∇ log(αt )(x)k2 dαt (x)dt;
2
(7.21)
(αt ,vt )t sat. (7.1) 0 Rd 4
see [Gentil et al., 2015, Chen et al., 2016a].
8
Statistical Divergences

We study in this chapter the statistical properties of the Wasserstein distance. More
specifically, we compare it to other major distances and divergences routinely used
in data sciences. We quantify how one can approximate the distance between two
probability distributions when having only access to samples from said distributions.
To introduce these subjects, §8.1 and §8.2 review respectively divergences and integral
probability metrics between probability distributions. A divergence D typically satisfies
D(α, β) ≥ 0 and D(α, β) = 0 if and only if α = β, but it does not need to be symmetric
or satisfy the triangular inequality. An integral probability metric for measures is a
dual norm defined using a prescribed family of test functions. These quantities are
sound alternatives to Wasserstein distances and are routinely used as loss functions
to tackle inference problems, as will be covered in §9. We show first in §8.3 that the
optimal transport distance is not Hilbertian, i.e. one cannot approximate it efficiently
using a Hilbertian metric on a suitable feature representation of probability measures.
We show in §8.4 how to approximate D(α, β) from discrete samples (xi )i and (yj )j
drawn from α and β. A good statistical understanding of that problem is crucial when
using the Wasserstein distance in machine learning. Note that this section will be chiefly
concerned with the statistical approximation of optimal transport between distributions
supported on continuous sets. The very same problem when the ground space is finite
has received some attention in the literature following the work of Sommerfeld and
Munk [2018], extended to entropic regularized quantities by Bigot et al. [2017a].

114
8.1. ϕ-Divergences 115

8.1 ϕ-Divergences

Before detailing in the following section “weak” norms, whose construction shares sim-
ilarities with W 1 , let us detail a generic construction of so-called divergences between
measures, which can then be used as loss functions when estimating probability dis-
tributions. Such divergences compare two input measures by comparing their mass
pointwise, without introducing any notion of mass transportation. Divergences are func-
tionals which, by looking at the pointwise ratio between two measures, give a sense of
how close they are. They have nice analytical and computational properties and build
upon entropy functions.
Definition 8.1 (Entropy function). A function ϕ : R → R ∪ {∞} is an entropy function if
it is lower semicontinuous, convex, dom ϕ ⊂ [0, ∞[, and satisfies the following feasibility
condition: dom ϕ ∩ ]0, ∞[ 6= ∅. The speed of growth of ϕ at ∞ is described by
ϕ0∞ = lim ϕ(x)/x ∈ R ∪ {∞} .
x→+∞

If ϕ0∞ = ∞, then ϕ grows faster than any linear function and ϕ is said superlinear.
Any entropy function ϕ induces a ϕ-divergence (also known as Ciszár divergence [Ciszár,
1967, Ali and Silvey, 1966] or f -divergence) as follows.
Definition 8.2 (ϕ-Divergences). Let ϕ be an entropy function. For α, β ∈ M(X ), let
dα ⊥ 1
dβ β + α be the Lebesgue decomposition of α with respect to β. The divergence Dϕ
is defined by

Z  
dβ + ϕ0∞ α⊥ (X )
def.
Dϕ (α|β) = ϕ (8.1)
X dβ
if α, β are nonnegative and ∞ otherwise.
The additional term ϕ0∞ α⊥ (X ) in (8.1) is important to ensure that Dϕ defines a
continuous functional (for the weak topology of measures) even if ϕ has a linear growth
at infinity, as this is, for instance, the case for the absolute value (8.8) defining the TV
norm. If ϕ as a superlinear growth, e.g. the usual entropy (8.4), then ϕ0∞ = +∞ so
that Dϕ (α|β) = +∞ if α does not have a density with respect to β.
In the discrete setting, assuming
X X
α= ai δxi and β= bi δxi (8.2)
i i
are supported on the same set of n points (xi )ni=1 ⊂ X , (8.1) defines a divergence on
Σn
ai
 
bi + ϕ0∞
X X
Dϕ (a|b) = ϕ ai , (8.3)
i∈Supp(b)
bi i∈Supp(b)
/
1
The Lebesgue decomposition theorem asserts that, given β, α admits a unique decomposition as
the sum of two measures αs + α⊥ such that αs is absolutely continuous with respect to β and α⊥ and
β are singular.
116 Statistical Divergences

def.
where Supp(b) = {i ∈ JnK : bi 6= 0}.
The proof of the following proposition can be found in [Liero et al., 2018, Thm 2.7].

Proposition 8.1. If ϕ is an entropy function, then Dϕ is jointly 1-homogeneous, convex


and weakly* lower semicontinuous in (α, β).

4
KL
TV
Hellinger
2 χ2

0
0 1 2 3
Figure 8.1: Example of entropy functionals.

Remark 8.1 (Dual expression). A ϕ-divergence can be expressed using the Legendre
transform
ϕ∗ (s) = sup st − ϕ(t)
def.

t∈R

of ϕ (see also (4.54)) as


Z Z
Dϕ (α|β) = sup f (x)dα(x) − ϕ∗ (f (x))dβ(x);
f :X →R X X

see Liero et al. [2018] for more details.

We now review a few popular instances of this framework. Figure 8.1 displays the
associated entropy functionals, while Figure 8.2 reviews the relationship between them.
def.
Example 8.1 (Kullback–Leibler divergence). The Kullback–Leibler divergence KL =
DϕKL , also known as the relative entropy, was already introduced in (4.10) and (4.6).
It is the divergence associated to the Shannon–Boltzman entropy function ϕKL , given
by

s log(s) − s + 1 for s > 0,


ϕKL (s) = 1 for s = 0, (8.4)


+∞ otherwise.

8.1. ϕ-Divergences 117

KL 6 log(1 + 2
) 2
KL dH 6 p
KL

KL/2

dH 6
/2

p
2
p

p
TV 6
6
TV

2
W1 6 dmax TV p
dH 6 2TV
W1 TV dH
TV 6 W1 /dmin TV 6 dH

Figure 8.2: Diagram of relationship between divergences (inspired by Gibbs and Su [2002]). For X a
metric space with ground distance d, dmax = sup(x,x0 ) d(x, x0 ) is the diameter of X . When X is discrete,
def.
dmin = minx6=x0 d(x, x0 ).

def.
Remark 8.1 (Bregman divergence). The discrete KL divergence, KL = DϕKL , has the
unique property of being both a ϕ-divergence and a Bregman divergence. For discrete
vectors in Rn , a Bregman divergence [Bregman, 1967] associated to a smooth strictly
convex function ψ : Rn → R is defined as
def.
Bψ (a|b) = ψ(a) − ψ(b) − h∇ψ(b), a − bi, (8.5)

where h·, ·i is the canonical inner product on Rn . Note that Bψ (a|b) is a convex function
of a and a linear function of ψ. Similarly to ϕ-divergence, a Bregman divergence satisfies
Bψ (a|b) ≥ 0 and Bψ (a|b) = 0 if and only if a = b. The KL divergence is the Bregman
divergence for minus the entropy ψ = −H defined in (4.1)), i.e. KL = B−H . A Bregman
divergence is locally a squared Euclidean distance since

Bψ (a + ε|a + η) = h∂ 2 ψ(a)(ε − η), ε − ηi + o(kε − ηk2 )

and the set of separating points a : Bψ (a|b) = Bψ (a|b0 ) is a hyperplane between




b and b0 . These properties make Bregman divergence suitable to replace Euclidean


distances in first order optimization methods. The best know example is mirror gradient
descent [Beck and Teboulle, 2003], which is an explicit descent step of the form (9.32).
Bregman divergences are also important in convex optimization and can be used, for
instance, to derive Sinkhorn iterations and study its convergence in finite dimension;
see Remark 4.8.

Remark 8.2 (Hyperbolic geometry of KL). It is interesting to contrast the geometry of


the Kullback–Leibler divergence to that defined by quadratic optimal transport when
comparing Gaussians. As detailed, for instance, by Costa et al. [2015], the Kullback–
Leibler divergence has a closed form for Gaussian densities. In the univariate case,
118 Statistical Divergences

d = 1, if α = N (mα , σα2 ) and β = N (mβ , σβ2 ), one has

σβ2
! !
1 σα2 |mα − mβ |
KL(α|β) = + log + −1 . (8.6)
2 σβ2 σα2 σβ2

This expression shows that the divergence between α and β diverges to infinity as σβ
diminishes to 0 and β becomes a Dirac mass. In that sense, one can say that singular
Gaussians are infinitely far from all other Gaussians in the KL geometry. That geometry
is thus useful when one wants to avoid dealing with singular covariances. To simplify
the analysis, one can look at the infinitesimal geometry of KL, which is obtained by
performing a Taylor expansion at order 2,
1 1 2
 
2 2
KL(N (m + δm , (σ + δσ ) )|N (m, σ )) = 2 δ + δσ2 + o(δm
2
, δσ2 ).
σ 2 m

This local Riemannian metric, the so-called Fisher metric, expressed over (m/ 2, σ) ∈
R × R+,∗ , matches exactly that of the hyperbolic Poincaré half plane. Geodesics over
this space are half circles centered along the σ = 0 line and have an exponential speed,
i.e. they only reach the limit σ = 0 after an infinite time. Note in particular that if
σα = σβ but mα 6= mα , then the geodesic between (α, β) over this hyperbolic half plane
does not have a constant standard deviation.
The KL hyperbolic geometry over the space of Gaussian parameters (m, σ) should be
contrasted with the Euclidean geometry associated to OT as described in Remark 2.31,
since in the univariate case

W 22 (α, β) = |mα − mβ |2 + |σα − σβ |2 . (8.7)

Figure 8.3 shows a visual comparison of these two geometries and their respective
geodesics. This interesting comparison was suggested to us by Jean Feydy.

def.
Example 8.2 (Total variation). The total variation distance TV = DϕTV is the diver-
gence associated to 
|s − 1| for s ≥ 0,
ϕTV (s) = (8.8)
+∞ otherwise.
It actually defines a norm on the full space of measure M(X ) where
Z
TV(α|β) = kα − βkTV , where kαkTV = |α|(X ) = d|α|(x). (8.9)
X

If α has a density ρα on X = Rd , then the TV norm is the L1 norm on functions,


R
kαkTV = X |ρα (x)|dx = kρα kL1 . If α is discrete as in (8.2), then the TV norm is the
`1 norm of vectors in Rn , kαkTV = i |ai | = kak`1 .
P
8.1. ϕ-Divergences 119

m m

KL OT

Figure 8.3: Comparisons of interpolation between Gaussians using KL (hyperbolic) and OT (Eu-
clidean) geometries.

Remark 8.2 (Strong vs. weak topology). The total variation norm (8.9) defines the so-
called “strong” topology on the space of measure. On a compact domain X of radius
R, one has
W 1 (α, β) ≤ R kα − βkTV
so that this strong notion of convergence implies the weak convergence metrized by
Wasserstein distances. The converse is, however, not true, since δx does not converge
strongly to δy if x → y (note that kδx − δy kTV = 2 if x 6= y). A chief advantage is that
M1+ (X ) (once again on a compact ground space X ) is compact for the weak topology,
so that from any sequence of probability measures (αk )k , one can always extract a con-
verging subsequence, which makes it a suitable space for several optimization problems,
such as those considered in Chapter 9.
def. 1/2
Example 8.3 (Hellinger). The Hellinger distance h = DϕH is the square root of the
divergence associated to
√
| s − 1|2 for s ≥ 0,
ϕH (s) =
+∞ otherwise.

As its name suggests, h is a distance on M+ (X ), which metrizes the strong topology


√ √
as k·kTV . If (α, β) have densities (ρα , ρβ ) on X = Rd , then h(α, β) = k ρα − ρβ kL2 .
√ √
If (α, β) are discrete as in (8.2), then h(α, β) = k a − bk. Considering ϕLp (s) =
|s1/p − 1|p generalizes the Hellinger (p = 2) and total variation (p = 1) distances and
1/p
DϕLp is a distance which metrizes the strong convergence for 0 < p < +∞.
120 Statistical Divergences

Example 8.4 (Jensen–Shannon distance). The KL divergence is not symmetric and,


while being a Bregman divergence (which are locally quadratic norms), it is not the
square of a distance. On the other hand, the Jensen–Shannon distance JS(α, β), defined
as
def. 1 α+β
JS(α, β)2 = (KL(α|ξ) + KL(β|ξ)) where ξ = ,
2 2
is a distance [Endres and Schindelin, 2003, Österreicher and Vajda, 2003]. JS2 can be
shown to be a ϕ-divergence for ϕ(s) = t log(t) − (t + 1) log(t + 1). In sharp contrast
with KL, JS(α, β) is always bounded; more precisely, it satisfies 0 ≤ JS(α, β)2 ≤ ln(2).
Similarly to the TV norm and the Hellinger distance, it metrizes the strong convergence.
def.
Example 8.5 (χ2 ). The χ2 -divergence χ2 = Dϕχ2 is the divergence associated to

|s − 1|2 for s ≥ 0,
ϕχ2 (s) =
+∞ otherwise.

If (α, β) are discrete as in (8.2) and have the same support, then
X (ai − bi )2
χ2 (α|β) = .
i
bi

8.2 Integral Probability Metrics

Formulation (6.3) is a special case of a dual norm. A dual norm is a convenient way to
design “weak” norms that can deal with arbitrary measures. For a symmetric convex
set B of measurable functions, one defines
Z 
def.
kαkB = max f (x)dα(x) : f ∈ B . (8.10)
f X

These dual norms are often called “integral probability metrics’; see [Sriperumbudur
et al., 2012].

Example 8.6 (Total variation). The total variation norm (Example 8.2) is a dual norm
associated to the whole space of continuous functions

B = {f ∈ C(X ) : kf k∞ ≤ 1} .

The total variation distance is the only nontrivial divergence that is also a dual norm;
see [Sriperumbudur et al., 2009].

Remark 8.3 (Metrizing the weak convergence). By using smaller “balls” B, which typ-
ically only contain continuous (and sometimes regular) functions, one defines weaker
dual norms. In order for k·kB to metrize the weak convergence (see Definition 2.2), it is
8.2. Integral Probability Metrics 121

sufficient for the space spanned by B to be dense in the set of continuous functions for
the sup-norm k·k∞ (i.e. for the topology of uniform convergence); see [Ambrosio et al.,
2006, para. 5.1].

Figure 8.4 displays a comparison of several such dual norms, which we now detail.

3 3
Energy Energy
Gauss Gauss
2 W1
2 W1
Flat Flat
1 1

0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
(α, β) = (δ0 , δt ) (α, β) = (δ0 , 12 (δ−t/2 + δt/2 ))

Figure 8.4: Comparison of dual norms.

8.2.1 W 1 and Flat Norm

If the set B is bounded, then k·kB is a norm on the whole space M(X ) of measures.
R
This is not the case of W 1 , which is only defined for α such that X dα = 0 (otherwise
kαkB = +∞). This can be alleviated by imposing a bound on the value of the potential
f , in order to define for instance the flat norm.

Example 8.7 (W 1 norm). W 1 as defined in (6.3), is a special case of dual norm (8.10),
using
B = {f : Lip(f ) ≤ 1}
the set of 1-Lipschitz functions.

Example 8.8 (Flat norm and Dudley metric). The flat norm is defined using

B = {f : k∇f k∞ ≤ 1 and kf k∞ ≤ 1} . (8.11)

It metrizes the weak convergence on the whole space M(X ). Formula (6.2) is extended
to compute the flat norm by adding the constraint |fk | ≤ 1. The flat norm is sometimes
called the “Kantorovich–Rubinstein” norm [Hanin, 1992] and has been used as a fidelity
term for inverse problems in imaging [Lellmann et al., 2014]. The flat norm is similar
to the Dudley metric, which uses

B = {f : k∇f k∞ + kf k∞ ≤ 1} .
122 Statistical Divergences

8.2.2 Dual RKHS Norms and Maximum Mean Discrepancies


It is also possible to define “Euclidean” norms (built using quadratic functionals) on
measures using the machinery of kernel methods and more specifically reproducing
kernel Hilbert spaces (RKHS; see [Schölkopf and Smola, 2002] for a survey of their
applications in data sciences), of which we recall first some basic definitions.
Definition 8.3. A symmetric function k (resp., ϕ) defined on a set X × X is said to be
positive (resp., negative) definite if for any n ≥ 0, family x1 , . . . , xn ∈ Z, and vector
r ∈ Rn the following inequality holds:
 
n
X n
X
ri rj k(xi , xj ) ≥ 0, resp. ri rj ϕ(xi , xj ) ≤ 0 . (8.12)
i,j=1 i,j=1

The kernel is said to be conditionally positive if positivity only holds in (8.12) for zero
mean vectors r (i.e. such that hr, 1n i = 0).
If k is conditionally positive, one defines the following norm:
Z
kαk2k =
def.
k(x, y)dα(x)dα(y). (8.13)
X ×X

These norms are often referred to as “maximum mean discrepancy” (MMD) (see [Gret-
ton et al., 2007]) and have also been called “kernel norms” in shape analysis [Glaunes
et al., 2004]. This expression (8.13) can be rephrased, introducing two independent
random vectors (X, X 0 ) on X distributed with law α, as
kαk2k = EX,X 0 (k(X, X 0 )).
One can show that k·k2k is the dual norm in the sense of (8.10) associated to the unit
ball B of the RKHS associated to k. We refer to [Berlinet and Thomas-Agnan, 2003,
Hofmann et al., 2008, Schölkopf and Smola, 2002] for more details on RKHS functional
spaces.
Remark 8.4 (Universal kernels). According to Remark 8.3, the MMD norm k·kk metrizes
the weak convergence if the span of the dual ball B is dense in the space of continu-
ous functions C(X ). This means that finite sums of the form ni=1 ai k(xi , ·) (for arbi-
P

trary choice of n and points (xi )i ) are dense in C(X ) for the uniform norm k·k∞ . For
translation-invariant kernels over X = Rd , k(x, y) = k0 (x − y), this is equivalent to
having a nonvanishing Fourier transform, k̂0 (ω) > 0.
In the special case where α is a discrete measure of the form (2.3), one thus has the
simple expression
n X
n
kαk2k =
X def.
ai ai0 ki,i0 = hka, ai where ki,i0 = k(xi , xi0 ).
i=1 i0 =1
8.2. Integral Probability Metrics 123

In particular, when α = ni=1 ai δxi and β = ni=1 bi δxi are supported on the same set
P P

of points, kα − βk2k = hk(a − b), a − bi, so that k·kk is a Euclidean norm (proper if
k is positive definite, degenerate otherwise if k is semidefinite) on the simplex Σn . To
compute the discrepancy between two discrete measures of the form (2.3), one can use

kα − βk2k =
X X X
ai ai0 k(xi , xi0 ) + bj bj 0 k(yj , yj 0 ) − 2 ai bj k(xi , yj ). (8.14)
i,i0 j,j 0 i,j

Example 8.9 (Gaussian RKHS). One of the most popular kernels is the Gaussian one
kx−yk2
k(x, y) = e− 2σ2 , which is a positive universal kernel on X = Rd . An attractive
feature of the Gaussian kernel is that it is separable as a product of 1-D kernels,
which facilitates computations when working on regular grids (see also Remark 4.17).
However, an important issue that arises when using the Gaussian kernel is that one
needs to select the bandwidth parameter σ. This bandwidth should match the “typical
scale” between observations in the measures to be compared. If the measures have
multiscale features (some regions may be very dense, others very sparsely populated),
a Gaussian kernel is thus not well adapted, and one should consider a “scale-free” kernel
as we detail next. An issue with such scale-free kernels is that they are global (have
slow polynomial decay), which makes them typically computationally more expensive,
since no compact support approximation is possible. Figure 8.5 shows a comparison
between several kernels.

(α, β) ED(R2 , k·k) (G, .005) (G, .02) (G, .05)

Figure 8.5: Top row: display of ψ such that kα − βkk = kψ ? (α − β)kL2 (R2 ) , formally defined over
p
Fourier as ψ̂(ω) = k̂0 (ω), where k(x, x0 ) = k0 (x − x0 ). Bottom row: display ofp
ψ ? (α − β). (G,σ)
2 2
stands for Gaussian kernel of variance σ . The kernel for ED(R , k·k) is ψ(x) = 1/ kxk.
124 Statistical Divergences

Example 8.10 (H −1 (Rd )). Another important dual norm is H −1 (Rd ), the dual (over
distributions) of the Sobolev space H 1 (Rd ) of functions having derivatives in L2 (Rd ).
It is defined using the primal RKHS norm k∇f k2L2 (Rd ) . It is not defined for singular
measures (e.g. Diracs) unless d = 1 because functions in the Sobolev space H 1 (Rd ) are
in general not continuous. This H −1 norm (defined on the space of zero mean measures
with densities) can also be formulated in divergence form,
Z 
kα − βk2H −1 (Rd ) = min ks(x)k22 dx : div(s) = α − β , (8.15)
s Rd

which should be contrasted with (6.4), where an L1 norm of the vector field s was used
in place of the L2 norm used here. The “weighted” version of this Sobolev dual norm,
Z
kρk2H −1 (α) = min ks(x)k22 dα(x),
div(s)=ρ Rd

can be interpreted as the natural “linearization” of the Wasserstein W 2 norm, in the


sense that the Benamou–Brenier dynamic formulation can be interpreted infinitesimally
as
W 2 (α, α + ερ) = ε kρkH −1 (α) + o(ε). (8.16)
The functionals W 2 (α, β) and kα − βkH −1 (α) can be shown to be equivalent [Peyre,
2011]. The issue is that kα − βkH −1 (α) is not a norm (because of the weighting by α),
and one cannot in general replace it by kα − βkH −1 (Rd ) unless (α, β) have densities. In
this case, if α and β have densities on the same support bounded from below by a > 0
and from above by b < +∞, then

b−1/2 kα − βkH −1 (Rd ) ≤ W2 (α, β) ≤ a−1/2 kα − βkH −1 (Rd ) ; (8.17)

see [Santambrogio, 2015, Theo. 5.34], and see [Peyre, 2011] for sharp constants.

Example 8.11 (Negative Sobolev spaces). One can generalize this construction by con-
sidering the Sobolev space H −r (Rd ) of arbitrary negative index, which is the dual of
the functional Sobolev space H r (Rd ) of functions having r derivatives (in the sense of
distributions) in L2 (Rd ). In order to metrize the weak convergence, one needs functions
in H r (Rd ) to be continuous, which is the case when r > d/2. As the dimension d in-
creases, one thus needs to consider higher regularity. For arbitrary α (not necessarily
integers), these spaces are defined using the Fourier transform, and for a measure α
with Fourier transform α̂(ω) (written here as a density with respect to the Lebesgue
measure dω) Z
kαk2H −r (Rd ) = kωk−2r |α̂(ω)|2 dω.
def.

Rd
This corresponds to a dual RKHS norm with a convolutive kernel k(x, y) = k0 (x − y)
with k̂0 (ω) = ± kωk−2r . Taking the inverse Fourier transform, one sees that (up to
8.3. Wasserstein Spaces Are Not Hilbertian 125

constant) one has


1
(
if r < d/2,
d
∀x ∈ R , k0 (x) = kxkd−2r (8.18)
2r−d
− kxk if r > d/2.
Example 8.12 (Energy distance). The energy distance (or Cramer distance when d =
1) [Székely and Rizzo, 2004] associated to a distance d is defined as
where kED (x, y) = −d(x, y)p
def.
kα − βkED(X ,dp ) = kα − βkkED (8.19)
for 0 < p < 2. It is a valid MMD norm over measures if d is negative definite (see
Definition 8.3), a typical example being the Euclidean distance d(x, y) = kx − yk. For
X = Rd , d(x, y) = k·k, using (8.18), one sees that the energy distance is a Sobolev norm
k·kED(Rd ,k·kp ) = k·k d+p .
H− 2 (Rd )

A chief advantage of the energy distance over more usual kernels such as the Gaussian
(Example 8.9) is that it is scale-free and does not depend on a bandwidth parameter
σ. More precisely, one has the following scaling behavior on X = Rd , when denoting
fs (x) = sx the dilation by a factor s > 0,
p
kfs] (α − β)kED(Rd ,k·kp ) = s 2 kα − βkED(Rd ,k·kp ) ,
while the Wasserstein distance exhibits a perfect linear scaling,
W p (fs] α, fs] β)) = s W p (α, β)).
Note, however, that for the energy distance, the parameter p must satisfy 0 < p < 2,
and that for p = 2, it degenerates to the distance between the means
Z

kα − βkED(Rd ,k·k2 ) = x(dα(x) − dβ(x))
,
d

R
so it is not a norm anymore. This shows that it is not possible to get the same linear
scaling under fs] with the energy distance as for the Wasserstein distance.

8.3 Wasserstein Spaces Are Not Hilbertian

Some of the special cases of the Wasserstein geometry outlined earlier in §2.6 have
highlighted the fact that the optimal transport distance can sometimes be computed
in closed form. They also illustrate that in such cases the optimal transport distance is
a Hilbertian metric between probability measures, in the sense that there exists a map
φ from the space of input measures onto a Hilbert space, as defined below.
Definition 8.4. A distance d defined on a set Z × Z is said to be Hilbertian if there
exists a Hilbert space H and a mapping φ : Z → H such that for any pair z, z 0 in Z we
have that d(z, z 0 ) = kφ(z) − φ(z 0 )kH .
126 Statistical Divergences

For instance, Remark 2.30 shows that the Wasserstein metric is a Hilbert norm
between univariate distributions, simply by defining φ to be the map that associates
to a measure its generalized quantile function. Remark 2.31 shows that for univariate
Gaussians, as written in (8.7) in this chapter, the Wasserstein distance between two
univariate Gaussians is simply the Euclidean distance between their mean and standard
deviation.
Hilbertian distances have many favorable properties when used in a data analysis
context [Dattorro, 2017]. First, they can be easily cast as radial basis function kernels:
p
for any Hilbertian distance d, it is indeed known that e−d /t is a positive definite kernel
for any value 0 ≤ p ≤ 2 and any positive scalar t as shown in [Berg et al., 1984,
Cor. 3.3.3, Prop. 3.2.7]. The Gaussian (p = 2) and Laplace (p = 1) kernels are simple
applications of that result using the usual Euclidean distance. The entire field of kernel
methods [Hofmann et al., 2008] builds upon the positive definiteness of a kernel function
to define convex learning algorithms operating on positive definite kernel matrices.
Points living in a Hilbertian space can also be efficiently embedded in lower dimensions
with low distortion factors [Johnson and Lindenstrauss, 1984], [Barvinok, 2002, §V.6.2]
using simple methods such as multidimensional scaling [Borg and Groenen, 2005].
Because Hilbertian distances have such properties, one might hope that the Wasser-
stein distance remains Hilbertian in more general settings than those outlined above,
notably when the dimension of X is 2 and more. This can be disproved using the
following equivalence.

Proposition 8.1. A distance d is Hilbertian if and only if d2 is negative definite.

Proof. If a distance is Hilbertian, then d2 is trivially negative definite. Indeed, given n


points in Z, the sum ri rj d2 (zi , zj ) can be rewritten as ri rj kφ(zi ) − φ(zj )k2H which
P P

can be expanded, taking advantage of the fact that ri = 0 to −2 ri rj hφ(zi ), φ(zj )iH
P P

which is negative by definition of a Hilbert dot product. If, on the contrary, d2 is negative
definite, then the fact that d is Hilbertian proceeds from a key result by Schoenberg
[1938] outlined in ([Berg et al., 1984, p. 82, Prop. 3.2]).

It is therefore sufficient to show that the squared Wasserstein distance is not negative
definite to show that it is not Hilbertian, as stated in the following proposition.

Proposition 8.2. If X = Rd with d ≥ 2 and the ground cost is set to d(x, y) = kx − yk2 ,
then the p-Wasserstein distance is not Hilbertian for p = 1, 2.

Proof. It suffices to prove the result for d = 2 since any counterexample in that di-
mension suffices to obtain a counterexample in any higher dimension. We provide a
nonrandom counterexample which works using measures supported on four vectors
x1 , x2 , x3 , x4 ∈ R2 defined as follows: x1 = [0, 0], x2 = [1, 0], x3 = [0, 1], x4 = [1, 1]. We
now consider all points on the regular grid on the simplex of four dimensions, with
8.3. Wasserstein Spaces Are Not Hilbertian 127

increments of 1/4. There are 35 = 44 = 4+4−1


 
4 such points in the simplex. Each
probability vector a on that grid is such that for j ≤ 4, we have that aij is in the set
i

{0, 41 , 12 , 43 , 1} and such that 4j=1 aij = 1. For a given p, the 35 × 35 pairwise Wasser-
P

stein distance matrix Dp between these histograms can be computed. Dp is not negative
definite if and only if its elementwise square D2p is such that JD2p J has positive eigen-
values, where J is the centering matrix J = In − n1 1n,n , which is the case as illustrated
in Figure 8.6.

1.6
Centered Distance Matrix

1.4

1.2

1
Max. Eig. of

0.8

0.6

0.4

0.2
1 1.5 2 2.5 3 3.5 4
p parameter to define p-Wasserstein

Figure 8.6: One can show that a distance is not Hilbertian by looking at the spectrum of the centered
matrix JD2p J corresponding to the pairwise squared-distance matrix D2p of a set of points. The spectrum
of such a matrix is necessarily non-positive if the distance is Hilbertian. Here we plot the values of the
maximal eigenvalue of that matrix for points selected in the proof of Proposition 8.2. We do so for
varying values of p, and display the maximal eigenvalues we obtain. These eigenvalues are all positive,
which shows that for all these values of p, the p-Wasserstein distance is not Hilbertian.

8.3.1 Embeddings and Distortion


An important body of work quantifies the hardness of approximating Wasserstein dis-
tances using Hilbertian embeddings. It has been shown that embedding measures in
`2 spaces incurs necessarily an important distortion (Naor and Schechtman [2007],
Andoni et al. [2018]) as soon as X = Rd with d ≥ 2.
It is possible to embed quasi-isometrically p-Wasserstein spaces for 0 < p ≤ 1 in `1
(see [Indyk and Thaper, 2003, Andoni et al., 2008, Do Ba et al., 2011]), but the equiv-
alence constant between the distances grows fast with the dimension d. Note also that
for p = 1 the embedding is true only for discrete measures (i.e. the embedding constant
depends on the minimum distance between the spikes). A closely related embedding
128 Statistical Divergences

technique consists in using the characterization of W 1 as the dual of Lipschitz functions


f (see §6.2) and approximating the Lipschitz constraint k∇f k1 ≤ 1 by a weighted `1
ball over the wavelets coefficients; see [Shirdhonkar and Jacobs, 2008]. This weighted `1
ball of wavelet coefficients defines a so-called Besov space of negative index [Leeb and
Coifman, 2016]. These embedding results are also similar to the bound on the Wasser-
stein distance obtained using dyadic partitions; see [Weed and Bach, 2017, Prop. 1]
and also [Fournier and Guillin, 2015]. This also provides a quasi-isometric embedding
in `1 (this embedding being given by rescaled wavelet coefficients) and comes with the
advantage that this embedding can be computed approximately in linear time when
the input measures are discretized on uniform grids. We refer to [Mallat, 2008] for
more details on wavelets. Note that the idea of using multiscale embeddings to com-
pute Wasserstein-like distances has been used extensively in computer vision; see, for
instance, [Ling and Okada, 2006, Grauman and Darrell, 2005, Cuturi and Fukumizu,
2007, Lazebnik et al., 2006].

8.3.2 Negative/Positive Definite Variants of Optimal Transport

We show later in §10.4 that the sliced approximation to Wasserstein distances, essen-
tially a sum of 1-D directional transportation distance computed on random push-
forwards of measures projected on lines, is negative definite as the sum of negative
definite functions [Berg et al., 1984, §3.1.11]. This result can be used to define a pos-
itive definite kernel [Kolouri et al., 2016]. Another way to recover a positive definite
kernel is to cast the optimal transport problem as a soft-min problem (over all possible
transportation tables) rather than a minimum, as proposed by Kosowsky and Yuille
[1994] to introduce entropic regularization. That soft-min defines a term whose neg-
exponential (also known as a generating function) is positive definite [Cuturi, 2012].

8.4 Empirical Estimators for OT, MMD and ϕ-divergences

In an applied setting, given two input measures (α, β) ∈ M1+ (X )2 , an important sta-
tistical problem is to approximate the (usually unknown) divergence D(α, β) using
only samples (xi )ni=1 from α and (yj )m
j=1 from β. These samples are assumed to be
independently identically distributed from their respective distributions.

8.4.1 Empirical Estimators for OT and MMD

For both Wasserstein distances W p (see 2.18) and MMD norms (see §8.2), a straight-
forward estimator of the unknown distance between distriubtions is compute it directly
between the empirical measures, hoping ideally that one can control the rate of con-
8.4. Empirical Estimators for OT, MMD and ϕ-divergences 129

vergence of the latter to the former,


(
α̂n = n1 i δxi ,
def. P
D(α, β) ≈ D(α̂n , β̂m ) where def. 1 P
β̂m = m j δ yj .

Note that here both α̂n and β̂m are random measures, so D(α̂n , β̂m ) is a random
number. For simplicity, we assume that X is compact (handling unbounded domain
requires extra constraints on the moments of the input measures).
For such a dual distance that metrizes the weak convergence (see Definition 2.2),
since there is the weak convergence α̂n → α, one has D(α̂n , β̂n ) → D(α, β) as n → +∞.
But an important question is the speed of convergence of D(α̂n , β̂n ) toward D(α, β),
and this rate is often called the “sample complexity” of D.
Note that for D(α, β) = k·kTV , since the TV norm does not metrize the weak
convergence, kα̂n − β̂n kTV is not a consistent estimator, namely it does not converge
toward kα − βkTV . Indeed, with probability 1, kα̂n − β̂n kTV = 2 since the support of the
two discrete measures does not overlap. Similar issues arise with other ϕ-divergences,
which cannot be estimated using divergences between empirical distributions.

Rates for OT. For X = Rd and measure supported on bounded domain, it is shown
by [Dudley, 1969] that for d > 2, and 1 ≤ p < +∞,
1
E(| W p (α̂n , β̂n ) − W p (α, β)|) = O(n− d ),
where the expectation E is taken with respect to the random samples (xi , yi )i . This
rate is tight in Rd if one of the two measures has a density with respect to the Lebesgue
measure. This result was proved for general metric spaces [Dudley, 1969] using the
notion of covering numbers and was later refined, in particular for X = Rd in [Dereich
et al., 2013, Fournier and Guillin, 2015]. This rate can be refined when the measures are
supported on low-dimensional subdomains: Weed and Bach [2017] show that, indeed,
the rate depends on the intrinsic dimensionality of the support. Weed and Bach also
study the nonasymptotic behavior of that convergence, such as for measures which are
discretely approximated (e.g. mixture of Gaussians with small variances). It is also
possible to prove concentration of W p (α̂n , β̂n ) around its mean W p (α, β); see [Bolley
et al., 2007, Boissard, 2011, Weed and Bach, 2017].

Rates for MMD. For weak norms k·k2k which are dual of RKHS norms (also called
MMD), as defined in (8.13), and contrary to Wasserstein distances, the sample com-
plexity does not depend on the ambient dimension
1
E(|kα̂n − β̂n kk − kα − βkk |) = O(n− 2 );
see [Sriperumbudur et al., 2012]. Figure 8.7 shows a numerical comparison of the sample
complexity rates for Wasserstein and MMD distances. Note, however, that kα̂n − β̂n k2k
130 Statistical Divergences

is a slightly biased estimate of kα − βk2k . In order to define an unbiased estimator, and


thus to be able to use, for instance, SGD when minimizing such losses, one should
rather use the unbiased estimator
1 X 1 X
MMDk (α̂n , β̂n )2 =
def.
k(xi , xi0 ) + k(yj , yj 0 )
n(n − 1) i,i0 n(n − 1) j,j 0
1 X
−2 k(xi , yj ),
n2 i,j

which should be compared to (8.14). It satisfies E(MMDk (α̂n , β̂n )2 ) = kα − βk2k ;


see [Gretton et al., 2012].

Energy distance k·kH −1 W2


0
Figure 8.7: Decay of log10 (D(α̂n , α̂n )) as a function of log10 (n) for D being the energy distance
−1
D = k·kH −1 (i.e. the H norm) as defined in Example 8.12 (left) and the Wasserstein distance D = W 2
0
(right). Here (α̂n , α̂n ) are two independent empirical distributions of α, the uniform distribution on
the unit cube [0, 1]d , tested for several value of d ∈ {2, 3, 5}. The shaded bar displays the confidence
interval at ± the standard deviation of log(D(α̂n , α)).

8.4.2 Empirical Estimators for ϕ-divergences


It is not possible to approximate Dϕ (α|β), as defined in (8.2), from discrete samples us-
ing Dϕ (α̂n |β̂n ). Indeed, this quantity is either +∞ (for instance, for the KL divergence)
or is not converging to Dϕ (α|β) as n → +∞ (for instance, for the TV norm). Instead,
it is required to use a density estimator to somehow smooth the discrete empirical
measures and replace them by densities; see [Silverman, 1986]. In a Euclidean space
X = Rd , introducing hσ = h(·/σ) with a smooth windowing function and a bandwidth
σ > 0, a density estimator for α is defined using a convolution against this kernel,
1X
α̂n ? hσ = hσ (· − xi ). (8.20)
n i
One can then approximate the ϕ divergence using
n
!
hσ (yj − xi )
P
1X
Dϕσ (α̂n |β̂n ) = ϕ P i
def.

n j=1 j 0 hσ (yj − yj 0 ),
8.5. Entropic Regularization: Between OT and MMD 131

where σ should be adapted to the number n of samples and to the dimension d. It is also
possible to devise nonparametric estimators, bypassing the choice of a fixed bandwidth
σ to select instead a number k of nearest neighbors. These methods typically make use
of the distance between nearest neighbors [Loftsgaarden and Quesenberry, 1965], which
is similar to locally adapting the bandwidth σ to the local sampling density. Denoting
∆k (x) the distance between x ∈ Rd and its kth nearest neighbor among the (xi )ni=1 , a
density estimator is defined as
k/n
ρkα̂n (x) =
def.
, (8.21)
|Bd |∆k (x)r

where |Bd | is the volume of the unit ball in Rd . Instead of somehow “counting” the
number of sample falling in an area of width σ in (8.20), this formula (8.21) estimates
the radius required to encapsulate k samples. Figure 8.8 compares the estimators (8.20)
and (8.21). A typical example of application is detailed in (4.1) for the entropy func-
tional, which is the KL divergence with respect to the Lebesgue measure. We refer
to [Moon and Hero, 2014] for more details.

σ = 2.5 · 10−3 σ = 15 · 10−3 σ = 25 · 10−3

k=1 k = 50 k = 100

Figure 8.8: Comparison of kernel density estimation α̂n ? hσ (top, using a Gaussian kernel h) and
k-nearest neighbors estimation ρkα̂n (bottom) for n = 200 samples from a mixture of two Gaussians.

8.5 Entropic Regularization: Between OT and MMD

Following Proposition 4.7, we recall that the Sinkhorn divergence is defined as


f? g?
PεC (a, b) = hP? , Ci = he ε , (K C)e
def.
ε i,

where P? is the solution of (4.2) while (f? , g? ) are solutions of (4.30). Assuming Ci,j =
d(xi , xj )p for some distance d on X , for two discrete probability distributions of the
form (2.3), this defines a regularized Wasserstein cost

W p,ε (α, β)p = PεC (a, b).


def.
132 Statistical Divergences

This definition is generalized to any input distribution (not necessarily discrete) as


Z
p
d(x, y)p dπ ? (x, y),
def.
W p,ε (α, β) =
X ×X

where π ? is the solution of (4.9).


In order to cancel the bias introduced by the regularization (in particular,
W p,ε (α, α) 6= 0), we introduce a corrected regularized divergence

W̃ p,ε (α, β)p = 2 W p,ε (α, β)p − W p,ε (α, α)p − W p,ε (β, β)p .
def.

It is proved in [Feydy et al., 2019] that if e−c/ε is a positive kernel, then a related
corrected divergence (obtained by using LεC in place of PεC ) is positive. Note that it is
possible to define other renormalization schemes using regularized optimal transport,
as proposed, for instance, by Amari et al. [2018].

d=2 d=5
0
Figure 8.9: Decay of E(log10 (W̃ p,ε (α̂n , α̂n ))), for p = 3/2 for various ε, as a function of log10 (n)
where α is the same as in Figure 8.7.

The following proposition, whose proof can be found in [Ramdas et al., 2017], shows
that this regularized divergence interpolates between the Wasserstein distance and the
energy distance defined in Example 8.12.

Proposition 8.3. One has


ε→0 ε→+∞
W̃ p,ε (α, β) −→ 2 W p (α, β) and W̃ p,ε (α, β)p −→ kα − βk2ED(X ,d) ,

where k·kED(X ,d) is defined in (8.19).

Figure 8.9 shows numerically the impact of ε on the sample complexity rates. It is
proved in Genevay et al. [2019], in the case of c(x, y) = kx − yk2 on X = Rd , that these
rates interpolate between the ones of OT and MMD.
9
Variational Wasserstein Problems

In data analysis, common divergences between probability measures (e.g. Euclidean,


total variation, Hellinger, Kullback–Leibler) are often used to measure a fitting error or
a loss in parameter estimation problems. Up to this chapter, we have made the case that
the optimal transport geometry has a unique ability, not shared with other information
divergences, to leverage physical ideas (mass displacement) and geometry (a ground
cost between observations or bins) to compare measures. These two facts combined
make it thus very tempting to use the Wasserstein distance as a loss function. This
idea was recently explored for various applied problems. However, the main technical
challenge associated with that idea lies in approximating and differentiating efficiently
the Wasserstein distance. We start this chapter with a few motivating examples and
show how the different numerical schemes presented in the first chapters of this book
can be used to solve variational Wasserstein problems.
In image processing, the Wasserstein distance can be used as a loss to synthesize
textures [Tartavel et al., 2016], to account for the discrepancy between statistics of
synthesized and input examples. It is also used for image segmentation to account
for statistical homogeneity of image regions [Swoboda and Schnörr, 2013, Rabin and
Papadakis, 2015, Peyré et al., 2012, Ni et al., 2009, Schmitzer and Schnörr, 2013b,
Li et al., 2018b]. The Wasserstein distance is also a very natural fidelity term for in-
verse problems when the measurements are probability measures, for instance, image
restoration [Lellmann et al., 2014], tomographic inversion [Abraham et al., 2017], den-
sity regularization [Burger et al., 2012], particle image velocimetry [Saumier et al.,
2015], sparse recovery and compressed sensing [Indyk and Price, 2011], and seismic
inversion [Métivier et al., 2016]. Distances between measures (mostly kernel-based as

133
134 Variational Wasserstein Problems

shown in §8.2.2) are routinely used for shape matching (represented as measures over
a lifted space, often called currents) in computational anatomy [Vaillant and Glaunès,
2005], but OT distances offer an interesting alternative [Feydy et al., 2017]. To re-
duce the dimensionality of a dataset of histograms, Lee and Seung have shown that the
nonnegative matrix factorization problem can be cast using the Kullback–Leibler diver-
gence to quantify a reconstruction loss [Lee and Seung, 1999]. When prior information
is available on the geometry of the bins of those histograms, the Wasserstein distance
can be used instead, with markedly different results [Sandler and Lindenbaum, 2011,
Zen et al., 2014, Rolet et al., 2016].
All of these problems have in common that they require access to the gradients of
Wasserstein distances, or approximations thereof. We start this section by presenting
methods to approximate such gradients, then follow with three important applications
that can be cast as variational Wasserstein problems.

9.1 Differentiating the Wasserstein Loss

In statistics, text processing or imaging, one must usually compare a probability dis-
tribution β arising from measurements to a model, namely a parameterized family of
distributions {αθ , θ ∈ Θ}, where Θ is a subset of a Euclidean space. Such a comparison
is done through a “loss” or a “fidelity” term, which is the Wasserstein distance in this
section. In the simplest scenario, the computation of a suitable parameter θ is obtained
by minimizing directly
def.
min E(θ) = Lc (αθ , β). (9.1)
θ∈Θ

Of course, one can consider more complicated problems: for instance, the barycenter
problem described in §9.2 consists in a sum of such terms. However, most of these more
advanced problems can be usually solved by adapting tools defined for the basic case
above, either using the chain rule to compute explicitly derivatives or using automatic
differentiation as advocated in §9.1.3.

Convexity. The Wasserstein distance between two histograms or two densities is con-
vex with respect to its two inputs, as shown by (2.20) and (2.24), respectively. Therefore,
when the parameter θ is itself a histogram, namely Θ = Σn and αθ = θ, or more gen-
erally when θ describes K weights in the simplex, Θ = ΣK , and αθ = K
P
i=1 θi αi is a
convex combination of known atoms α1 , . . . , αK in ΣN , Problem (9.1) remains convex
(the first case corresponds to the barycenter problem, the second to one iteration of
the dictionary learning problem with a Wasserstein loss [Rolet et al., 2016]). However,
for more general parameterizations θ 7→ αθ , Problem (9.1) is in general not convex.
9.1. Differentiating the Wasserstein Loss 135

Simple cases. For those simple cases where the Wasserstein distance has a closed
form, such as univariate (see §2.30) or elliptically contoured (see §2.31) distributions,
simple workarounds exist. They consist mostly in casting the Wasserstein distance as
a simpler distance between suitable representations of these distributions (Euclidean
on quantile functions for univariate measures, Bures metric for covariance matrices
for elliptically contoured distributions of the same family) and solving Problem (9.1)
directly on such representations.
In most cases, however, one has to resort to a careful discretization of αθ to com-
pute a local minimizer for Problem (9.1). Two approaches can be envisioned: Eulerian
or Lagrangian. Figure 9.1 illustrates the difference between these two fundamental dis-
cretization schemes. At the risk of oversimplifying this argument, one may say that
a Eulerian discretization is the most suitable when measures are supported on a low-
dimensional space (as when dealing with shapes or color spaces), or for intrinsically
discrete problems (such as those arising from string or text analysis). When applied
to fitting problems where observations can take continuous values in high-dimensional
spaces, a Lagrangian perspective is usually the only suitable choice.

Figure 9.1: Increasing fine discretization of a continuous distribution having a density


P(violet, left)
using a Lagrangian representation n1
P
δ (blue, top) and an Eulerian representation
i xi
a δ with
i i xi
xi representing cells on a grid of increasing size (red, bottom). The Eulerian perspective starts from
a pixelated image down to one with such fine resolution that it almost matches the original density.
Weights ai are directly proportional to each pixel-cell’s intensity.

9.1.1 Eulerian Discretization


A first way to discretize the problem is to suppose that both distributions β =
Pm Pn
j=1 bj δyj and αθ = i=1 a(θ)i δxi are discrete distributions defined on fixed loca-
tions (xi )i and (yj )j . Such locations might stand for cells dividing the entire space of
observations in a grid, or a finite subset of points of interest in a continuous space (such
136 Variational Wasserstein Problems

as a family of vector embeddings for all words in a given dictionary [Kusner et al., 2015,
Rolet et al., 2016]). The parameterized measure αθ is in that case entirely represented
through the weight vector a : θ 7→ a(θ) ∈ Σn , which, in practice, might be very sparse
if the grid is large. This setting corresponds to the so-called class of Eulerian discretiza-
tion methods. In its original form, the objective of Problem (9.1) is not differentiable.
In order to obtain a smooth minimization problem, we use the entropic regularized OT
and approximate (9.1) using
min EE (θ) = LεC (a(θ), b)
def. def.
where Ci,j = c(xi , yj ).
θ∈Θ
We recall that Proposition 4.6 shows that the entropic loss function is differentiable
and convex with respect to the input histograms, with gradient.
Proposition 9.1 (Derivative with respect to histograms). For ε > 0, (a, b) 7→ LεC (a, b)
is convex and differentiable. Its gradient reads
∇LεC (a, b) = (f, g), (9.2)
P P
where (f, g) is the unique solution to (4.30), centered such that i fi = j gj = 0. For
ε = 0, this formula defines the elements of the sub-differential of LεC , and the function
is differentiable if they are unique.
The zero mean condition on (f, g) is important when using gradient descent to
guarantee conservation of mass. Using the chain rule, one thus obtains that EE is
smooth and that its gradient is
∇EE (θ) = [∂a(θ)]> (f ), (9.3)
where ∂a(θ) ∈ Rn×dim(Θ) is the Jacobian (differential) of the map a(θ), and where
f ∈ Rn is the dual potential vector associated to the dual entropic OT (4.30) between
a(θ) and b for the cost matrix C (which is fixed in a Eulerian setting, and in particular
independent of θ). This result can be used to minimize locally EE through gradient
descent.

9.1.2 Lagrangian Discretization


A different approach consists in using instead fixed (typically uniform) weights and
approximating an input measure α as an empirical measure αθ = n1 i δx(θ)i for a
P

point-cloud parameterization map x : θ 7→ x(θ) = (x(θ)i )ni=1 ∈ X n , where we assume


here that X is Euclidean. Problem (9.1) is thus approximated as
min EL (θ) = LεC(x(θ)) (1n /n, b) where C(x)i,j = c(x(θ)i , yj ).
def. def.
(9.4)
θ
Note that here the cost matrix C(x(θ)) now depends on θ since the support of αθ
changes with θ. The following proposition shows that the entropic OT loss is a smooth
function of the cost matrix and gives the expression of its gradient.
9.1. Differentiating the Wasserstein Loss 137

Proposition 9.2 (Derivative with respect to the cost). For fixed input histograms (a, b),
def.
for ε > 0, the mapping C 7→ R(C) = LεC (a, b) is concave and smooth, and

∇R(C) = P, (9.5)

where P is the unique optimal solution of (4.2). For ε = 0, this formula defines the set
of upper gradients.

Assuming (X , Y) are convex subsets of Rd , for discrete measures (α, β) of the


def.
form (2.3), one obtains using the chain rule that x = (xi )ni=1 ∈ X n 7→ F(x) =
LC(x) (1n /n, b) is smooth and that
 n
m
X
∇F(x) =  Pi,j ∇1 c(xi , yj ) ∈ X n, (9.6)
j=1 i=1

where ∇1 c is the gradient with respect to the first variable. For instance, for X = Y =
Rd , for c(s, t) = ks − tk2 on X = Y = Rd , one has
 n
m
X
∇F(x) = 2 ai xi − Pi,j yj  , (9.7)
j=1 i=1

where ai = 1/n here. Note that, up to a constant, this gradient is Id − T , where T is


the barycentric projection defined in (4.19). Using the chain rule, one thus obtains that
the Lagrangian discretized problem (9.4) is smooth and its gradient is

∇EL (θ) = [∂x(θ)]> (∇F(x(θ))), (9.8)

where ∂x(θ) ∈ Rdim(Θ)×(nd) is the Jacobian of the map x(θ) and where ∇F is imple-
mented as in (9.6) or (9.7) using for P the optimal coupling matrix between αθ and
β. One can thus implement a gradient descent to compute a local minimizer of EL , as
used, for instance, in [Cuturi and Doucet, 2014].

9.1.3 Automatic Differentiation


The difficulty when applying formulas (9.3) and (9.8) is that one needs to compute
the exact optimal solutions f or P for these formulas to be valid, which can only be
achieved with acceptable precision using a very large number of Sinkhorn iterates. In
challenging situations in which the size and the quantity of histograms to be compared
are large, the computational budget to compute a single Wasserstein distance is usually
limited, therefore allowing only for a few Sinkhorn iterations. In that case, and rather
than approximating the gradient (4.30) using the value obtained at a given iterate,
it is usually better to differentiate directly the output of Sinkhorn’s algorithm, using
reverse mode automatic differentiation. This corresponds to using the “algorithmic”
138 Variational Wasserstein Problems

Sinkhorn divergences as introduced in (4.48), rather than the quantity LεC in (4.2)
which incorporates the entropy of the regularized optimal transport, and differentiating
it directly as a composition of simple maps using the inputs, either the histogram in the
Eulerian case or the cost matrix in the Lagrangian cases. Using definitions introduced
in §4.5, this is equivalent to differentiating
(L) (L)
DC (a(θ), b) or DC(x(θ)) (a, b)

with respect to θ, in, respectively, the Eulerian and the Lagrangian cases for L large
enough.
The cost for computing the gradient of functionals involving Sinkhorn divergences
is the same as that of computation of the functional itself; see, for instance, [Bon-
neel et al., 2016, Genevay et al., 2018] for some applications of this approach. We also
refer to [Adams and Zemel, 2011] for an early work on differentiating Sinkhorn iter-
ations with respect to the cost matrix (as done in the Lagrangian framework), with
applications to learning rankings. Further details on automatic differentiation can be
found in [Griewank and Walther, 2008, Rall, 1981, Neidinger, 2010], in particular on the
“reverse mode,” which is the fastest way to compute gradients. In terms of implementa-
tion, all recent deep-learning Python frameworks feature state-of-the-art reverse-mode
differentiation and support for GPU/TPU computations [Al-Rfou et al., 2016, Abadi
et al., 2016, Pytorch, 2017], they should be adopted for any large-scale application
of Sinkhorn losses. We strongly encourage the use of such automatic differentiation
techniques, since they have the same complexity as computing (9.3) and (9.8), these
formulas being mostly useful to obtain a theoretical understanding of what automatic
differentation is computing. The only downside is that reverse mode automatic dif-
ferentation is memory intensive (the memory grows proportionally with the number
of iterations). There exist, however, subsampling strategies that mitigate this prob-
lem [Griewank, 1992].

9.2 Wasserstein Barycenters, Clustering and Dictionary Learning

A basic problem in unsupervised learning is to compute the “mean” or “barycenter” of


several data points. A classical way to define such a weighted mean of points (xs )Ss=1 ∈
X S living in a metric space (X , d) (where d is a distance or more generally a divergence)
is by solving a variational problem
S
X
min λs d(x, xs )p (9.9)
x∈X
s=1

for a given family of weights (λs )s ∈ ΣS , where p is often set to p = 2. When X = Rd and
d(x, y) = kx − yk2 , this leads to the usual definition of the linear average x = s λs xs
P
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 139

for p = 2 and the more evolved median point when p = 1. One can retrieve various
notions of means (e.g. harmonic or geometric means over X = R+ ) using this formalism.
This process is often referred to as the “Fréchet” or “Karcher” mean (see Karcher
[2014] for a historical account). For a generic distance d, Problem (9.9) is usually a
difficult nonconvex optimization problem. Fortunately, in the case of optimal transport
distances, the problem can be formulated as a convex program for which existence can
be proved and efficient numerical schemes exist.

Fréchet means over the Wasserstein space. Given input histogram {bs }Ss=1 , where
bs ∈ Σns , and weights λ ∈ ΣS , a Wasserstein barycenter is computed by minimizing
S
X
min λs LCs (a, bs ), (9.10)
a∈Σn
s=1

where the cost matrices Cs ∈ Rn×ns need to be specified. A typical setup is “Eulerian,”
so that all the barycenters are defined on the same grid, ns = n, Cs = C = Dp is set
to be a distance matrix, to solve
S
X
min λs Wpp (a, bs ).
a∈Σn
s=1

The barycenter problem (9.10) was introduced in a more general form involving
arbitrary measures in Agueh and Carlier [2011] following earlier ideas of Carlier and
Ekeland [2010]. That presentation is deferred to Remark 9.1. The barycenter problem
for histograms (9.10) is in fact a linear program, since one can look for the S couplings
(Ps )s between each input and the barycenter itself, which by construction must be
constrained to share the same row marginal,
( S )
∀ s, P> a, P>
X
min λs hPs , Cs i : s 1ns = s 1n = bs .
a∈Σn ,(Ps ∈Rn×ns )s
s=1
Although this problem is an LP, its scale forbids the use of generic solvers for medium-
scale problems. One can resort to using first order methods such as subgradient descent
on the dual [Carlier et al., 2015].

Remark 9.1 (Barycenter of arbitrary measures). Given a set of input measure (βs )s
defined on some space X , the barycenter problem becomes
S
X
min λs Lc (α, βs ). (9.11)
α∈M1+ (X ) s=1

In the case where X = Rd and c(x, y) = kx − yk2 , Agueh and Carlier [2011] show
that if one of the input measures has a density, then this barycenter is unique.
140 Variational Wasserstein Problems

Problem (9.11) can be viewed as a generalization of the problem of computing


barycenters of points (xs )Ss=1 ∈ X S to arbitrary measures. Indeed, if βs = δxs is
a single Dirac mass, then a solution to (9.11) is δx? , where x? is a Fréchet mean
solving (9.9). Note that for c(x, y) = kx − yk2 , the mean of the barycenter α? is
necessarily the barycenter of the mean, i.e.
Z X Z
?
xdα (x) = λs xdαs (x),
X s X

and the support of α? is located in the convex hull of the supports of the (αs )s . The
consistency of the approximation of the infinite-dimensional optimization (9.11)
when approximating the input distribution using discrete ones (and thus solv-
ing (9.10) in place) is studied in Carlier et al. [2015]. Let us also note that it is
possible to recast (9.11) as a multimarginal OT problem; see Remark 10.2.

Remark 9.2 (k-means as a Wasserstein variational problem). When the family of


input measures (βs )s is limited to but one measure β, this measure is supported on
a discrete finite subset of X = Rd , and the cost is the squared Euclidean distance,
then one can show that the barycenter problem

min Lc (α, β), (9.12)


α∈M1k (X )

where α is constrained to be a discrete measure with a finite support of size up


to k, is equivalent to the usual k-means problem taking β. Indeed, one can easily
show that the centroids output by the k-means problem correspond to the support
of the solution α and that its weights correspond to the fraction of points in β
assigned to each centroid. One can show that approximating Lc using entropic
regularization results in smoothed out assignments that appear in soft-clustering
variants of k-means, such as mixtures of Gaussians [Dessein et al., 2017].

Remark 9.3 (Distribution of distributions and consistency). It is possible to gener-


alize (9.11) to a possibly infinite collection of measures. This problem is described
by considering a probability distribution M over the space M1+ (X ) of probability
distributions, i.e. M ∈ M1+ (M1+ (X )). A barycenter is then a solution of
Z
min EM (Lc (α, β)) = Lc (α, β)dM (β), (9.13)
α∈M1+ (X ) M1+ (X )

where β is a random measure distributed according to M . Drawing uniformly at


random a finite number S of input measures (βs )Ss=1 according to M , one can then
define β̂S as being a solution of (9.11) for uniform weights λs = 1/S (note that
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 141

here β̂S is itself a random measure). Problem (9.11) corresponds to the special case
P
of a “discrete” measure M = s λs δβs . The convergence (in expectation or with
high probability) of Lc (β̂S , α) to zero (where α is the unique solution to (9.13))
corresponds to the consistency of the barycenters, and is proved in [Bigot and
Klein, 2012a, Le Gouic and Loubes, 2016, Bigot and Klein, 2012b]. This can be
interpreted as a law of large numbers over the Wasserstein space. The extension of
this result to a central limit theorem is an important problem; see [Panaretos and
Zemel, 2016] and [Agueh and Carlier, 2017] for recent formulations of that problem
and solutions in particular cases (1-D distributions and Gaussian measures).

Remark 9.4 (Fixed-point map). When dealing with the Euclidean space X = Rd
with ground cost c(x, y) = kx − yk2 , it is possible to study the barycenter problem
using transportation maps. Indeed, if α has a density, according to Remark 2.24,
one can define optimal transportation maps Ts between α and αs , in particular
such that Ts,] α = αs . The average map
S
X
T (α) =
def.
λs Ts
s=1

(the notation above makes explicit the dependence of this map on α) is itself an
(α)
optimal map between α and T] α (a positive combination of optimal maps is
equal by Brenier’s theorem, Remark 2.24, to the sum of gradients of convex func-
tions, equal to the gradient of a sum of convex functions, and therefore optimal
by Brenier’s theorem again). As shown in [Agueh and Carlier, 2011], first order
?
optimality conditions of the barycenter problem (9.13) actually read T (α ) = IRd
(the identity map) at the optimal measure α? (the barycenter), and it is shown
in [Álvarez-Esteban et al., 2016] that the barycenter α? is the unique (under regular-
ity conditions clarified in [Zemel and Panaretos, 2018, Theo. 2]) to the fixed-point
equation
def. (α)
G(α) = α where G(α) = T] α, (9.14)
Under mild conditions on the input measures, Álvarez-Esteban et al. [2016]
and Zemel and Panaretos [2018] have shown that α 7→ G(α) strictly decreases
the objective function of (9.13) if α is not the barycenter and that the fixed-point
def.
iterations α(`+1) = G(α(`) ) converge to the barycenter α? . This fixed point al-
gorithm can be used in cases where the optimal transportation maps are known
in closed form (e.g. for Gaussians). Adapting this algorithm for empirical mea-
sures of the same size results in computing optimal assignments in place of Monge
maps. For more general discrete measures of arbitrary size the scheme can also be
142 Variational Wasserstein Problems

adapted [Cuturi and Doucet, 2014] using barycentric projections (4.19).

Special cases. In general, solving (9.10) or (9.11) is not straightforward, but there
exist some special cases for which solutions are explicit or simple.

Remark 9.5 (Barycenter of Gaussians). It is shown in [Agueh and Carlier, 2011]


that the barycenter of Gaussians distributions αs = N (ms , Σs ), for the squared
Euclidean cost c(x, y) = kx − yk2 , is itself a Gaussian N (m? , Σ? ). Making use
of (2.41), one sees that the barycenter mean is the mean of the inputs
X
m? = λs ms
s

while the covariance minimizes


X
min λs B(Σ, Σs )2 ,
Σ
s

where B is the Bure metric (2.42). As studied in [Agueh and Carlier, 2011], the
first order optimality condition of this convex problem shows that Σ? is the unique
positive definite fixed point of the map
X 1 1 1
Σ? = Ψ(Σ? ) where Ψ(Σ) =
def.
λs (Σ 2 Σs Σ 2 ) 2 ,
s
1
where Σ 2 is the square root of positive semidefinite matrices. This result was
known from [Knott and Smith, 1994, Rüschendorf and Uckelmann, 2002] and is
proved in [Agueh and Carlier, 2011]. While Ψ is not strictly contracting, iterating
def.
this fixed-point map, i.e. defining Σ(`+1) = Ψ(Σ(`) ) converges in practice to the
solution Σ? . This method has been applied to texture synthesis in [Xia et al., 2014].
Álvarez-Esteban et al. [2016] have also proposed to use an alternative map
1
X 1 1 1
2 1
Ψ̄(Σ) = Σ− 2 Σ− 2
def.
λs (Σ 2 Σs Σ 2 ) 2
s
def.
for which the iterations Σ(`+1) = Ψ̄(Σ(`) ) converge. This is because the fixed-point
map G defined in (9.14) preserves Gaussian distributions, and in fact,

G(N (m, Σ)) = N (m? , Ψ̄(Σ)).

Figure 9.2 shows two examples of computations of barycenters between four 2-D
Gaussians.
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 143

Figure 9.2: Barycenters between four Gaussian distributions in 2-D. Each Gaussian is displayed using
an ellipse aligned with the principal axes of the covariance, and with elongations proportional to the
corresponding eigenvalues.

Remark 9.6 (1-D cases). For 1-D distributions, the W p barycenter can be com-
puted almost in closed form using the fact that the transport is the monotone
rearrangement, as detailed in Remark 2.30. The simplest case is for empirical mea-
sures with n points, i.e. βs = n1 ni=1 δys,i , where the points are assumed to be
P

sorted ys,1 ≤ ys,2 ≤ . . .. Using (2.33) the barycenter αλ is also an empirical mea-
sure on n points
n
1X
αλ = δx where xλ,i = Aλ (xs,i )s ,
n i=1 λ,i

where Aλ is the barycentric map


S
X
λs |x − xs |p .
def.
Aλ (xs )s = argmin
x∈R s=1
PS
For instance, for p = 2, one has xλ,i = s=1 λs xs,i . In the general case, one needs
to use the cumulative functions as defined in (2.34), and using (2.36), one has

∀ r ∈ [0, 1], Cα−1


λ
(r) = Aλ (Cα−1
s
(r))Ss=1 ,

which can be used, for instance, to compute barycenters between discrete measures
supported on less than n points in O(n log(n)) operations, using a simple sorting
procedure.

Remark 9.7 (Simple cases). Denoting by Tr,u : x 7→ rx + u a scaling and transla-


tion, and assuming that αs = Trs ,us ,] α0 is obtained by scaling and translating an
initial template measure, then a barycenter αλ is also obtained using scaling and
144 Variational Wasserstein Problems

translation (
r? = ( s λs /rs )−1 ,
P
αλ = T r? ,u? ,] α0 where
u? = s λs us .
P

Remark 9.8 (Case S = 2). In the case where X = Rd and c(x, y) = kx − yk2 (this
can be extended more generally to geodesic spaces), the barycenter between S =
2 measures (α0 , α1 ) is the McCann interpolant as already introduced in (7.6).
Denoting T] α0 = α1 the Monge map, one has that the barycenter αλ reads αλ =
(λ1 Id + λ2 T )] α0 . Formula (7.9) explains how to perform the computation in the
discrete case.

Entropic approximation of barycenters. One can use entropic smoothing and approx-
imate the solution of (9.10) using

S
X
min λs LεCs (a, bs ) (9.15)
a∈Σn
s=1

for some ε > 0. This is a smooth convex minimization problem, which can be tackled
using gradient descent [Cuturi and Doucet, 2014, Gramfort et al., 2015]. An alternative
is to use descent methods (typically quasi-Newton) on the semi-dual [Cuturi and Peyré,
2016], which is useful to integrate additional regularizations on the barycenter, to im-
pose, for instance, some smoothness w.r.t a given norm. A simpler yet very effective
approach, as remarked by Benamou et al. [2015] is to rewrite (9.15) as a (weighted) KL
projection problem
( )
X
T
min λs εKL(Ps |Ks ) : ∀ s, Ps 1m = bs , P1 11 = · · · = PS 1S , (9.16)
(Ps )s s

where we denoted Ks = e−Cs /ε . Here, the barycenter a is implicitly encoded in the


def.

row marginals of all the couplings Ps ∈ Rn×ns as a = P1 11 = · · · = PS 1S . As detailed


by Benamou et al. [2015], one can generalize Sinkhorn to this problem, which also
corresponds to iterative projections. This can also be seen as a special case of the
generalized Sinkhorn detailed in §4.6. The optimal couplings (Ps )s solving (9.16) are
computed in scaling form as

Ps = diag(us )K diag(vs ), (9.17)


9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 145

and the scalings are sequentially updated as


bs
v(`+1)
def.
∀ s ∈ J1, SK, s = (`)
, (9.18)
KT
s us
a(`+1)
u(`+1)
def.
∀ s ∈ J1, SK, s = (`+1)
, (9.19)
Ks vs
Y
where a(`+1) (Ks v(`+1) )λs .
def.
= s (9.20)
s
An alternative way to derive these iterations is to perform alternate minimization on
the variables of a dual problem, which is detailed in the following proposition.
Proposition 9.1. The optimal (us , vs ) appearing in (9.17) can be written as (us , vs ) =
(efs /ε , egs /ε ), where (fs , gs )s are the solutions of the following program (whose value
matches the one of (9.15)):
( )
 
gs /ε fs /ε
X X
max λs hgs , bs i − εhKs e ,e i : λs fs = 0 . (9.21)
(fs ,gs )s s s

Proof. Introducing Lagrange multipliers in (9.16) leads to


X 
min max λs εKL(Ps |Ks ) + ha − Ps 1m , fs i
(Ps )s ,a (fs ,gs )s s

+hbs − Ps T 1m , gs i .
Strong duality holds, so that one can exchange the min and the max, to obtain
X  
max λs hgs , bs i + min εKL(Ps |Ks ) − hPs , fs ⊕ gs i
(fs ,gs )s s Ps
X
+ min h λs fs , ai.
a
s
P
The explicit minimization on a gives the constraint s λs fs = 0 together with
fs ⊕ gs
 

X
max λs hgs , bs i − εKL |Ks ,
(fs ,gs )s s ε
where KL∗ (·|Ks ) is the Legendre transform (4.54) of the function KL∗ (·|Ks ). This
Legendre transform reads
KL∗ (U|K) = Ki,j (eUi,j − 1),
X
(9.22)
i,j

which shows the desired formula. To show (9.22), since this function is separable, one
needs to compute
∀ (u, k) ∈ R2+ , KL∗ (u|k) = max ur − (r log(r/k) − r + k)
def.

whose optimality condition reads u = log(r/k), i.e. r = keu , hence the result.
146 Variational Wasserstein Problems

Minimizing (9.21) with respect to each gs , while keeping all the other variables
fixed, is obtained in closed form by (9.18). Minimizing (9.21) with respect to all the
(fs )s requires us to solve for a using (9.20) and leads to the expression (9.19).
Figures 9.3 and 9.4 show applications to 2-D and 3-D shapes interpolation. Fig-
ure 9.5 shows a computation of barycenters on a surface, where the ground cost is the
square of the geodesic distance. For this figure, the computations are performed us-
ing the geodesic in heat approximation detailed in Remark 4.19. We refer to [Solomon
et al., 2015] for more details and other applications to computer graphics and imaging
sciences.

Figure 9.3: Barycenters between four input 2-D shapes using entropic regularization (9.15). To display
a binary shape, the displayed images shows a thresholded density. The weights (λs )s are bilinear with
respect to the four corners of the square.

The efficient computation of Wasserstein barycenters remains at this time an active


research topic [Staib et al., 2017a, Dvurechenskii et al., 2018]. Beyond their methodolog-
ical interest, Wasserstein barycenters have found many applications outside the field
of shape analysis. They have been used for image processing [Rabin et al., 2011], in
particular color modification [Solomon et al., 2015] (see Figure 9.6); Bayesian computa-
tions [Srivastava et al., 2015a,b] to summarize measures; and nonlinear dimensionality
reduction, to express an input measure as a Wasserstein barycenter of other known
measures [Bonneel et al., 2016]. All of these problems result in involved nonconvex
objective functions which can be accurately optimized using automatic differentiation
(see Remark 9.1.3). Problems closely related to the computation of barycenters include
the computation of principal components analyses over the Wasserstein space (see, for
9.2. Wasserstein Barycenters, Clustering and Dictionary Learning 147

Figure 9.4: Barycenters between four input 3-D shapes using entropic regularization (9.15). The
weights (λs )s are bilinear with respect to the four corners of the square. Shapes are represented as
measures that are uniform within the boundaries of the shape and null outside.

instance, [Seguy and Cuturi, 2015, Bigot et al., 2017b]) and the statistical estimation
of template models [Boissard et al., 2015]. The ability to compute barycenters enables
more advanced clustering methods such as the k-means on the space of probability
measures [del Barrio et al., 2016, Ho et al., 2017].

Figure 9.5: Barycenters interpolation between two input measures on surfaces, computed using the
geodesic in heat fast kernel approximation (see Remark 4.19). Extracted from [Solomon et al., 2015].

Remark 9.9 (Wasserstein propagation). As studied in Solomon et al. [2014b], it is


possible to generalize the barycenter problem (9.10), where one looks for distribu-
tions (bu )u∈U at some given set U of nodes in a graph G given a set of fixed input
distributions (bv )v∈V on the complementary set V of the nodes. The unknown are
determined by minimizing the overall transportation distance between all pairs of
148 Variational Wasserstein Problems

Figure 9.6: Interpolation between the two 3-D color empirical histograms of two input images (here
only the 2-D chromatic projection is visualized for simplicity). The modified histogram is then applied
to the input images using barycentric projection as detailed in Remark 4.11. Extracted from [Solomon
et al., 2015].

nodes (r, s) ∈ G forming edges in the graph


X
min LCr,s (br , bs ), (9.23)
(bu ∈Σnu )u∈U
(r,s)∈G

where the cost matrices Cr,s ∈ Rnr ×ns need to be specified by the user. The
barycenter problem (9.10) is a special case of this problem where the considered
graph G is “star shaped,” where U is a single vertex connected to all the other
vertices V (the weight λs associated to bs can be absorbed in the cost matrix).
Introducing explicitly a coupling Pr,s ∈ U(br , bs ) for each edge (r, s) ∈ G, and
using entropy regularization, one can rewrite this problem similarly as in (9.16),
and one extends Sinkhorn iterations (9.18) to this problem (this can also be de-
rived by recasting this problem in the form of the generalized Sinkhorn algorithm
detailed in §4.6). This discrete variational problem (9.23) on a graph can be gen-
eralized to define a Dirichlet energy when replacing the graph by a continuous do-
main [Solomon et al., 2013]. This in turn leads to the definition of measure-valued
harmonic functions which finds application in image and surface processing. We
refer also to Lavenant [2017] for a theoretical analysis and to Vogt and Lellmann
[2018] for extensions to nonquadratic (total-variation) functionals and applications
to imaging.
9.3. Gradient Flows 149

9.3 Gradient Flows

Given a smooth function a 7→ F (a), one can use the standard gradient descent

a(`+1) = a(`) − τ ∇F (a(`) ),


def.
(9.24)

where τ is a small enough step size. This corresponds to a so-called “explicit” minimiza-
tion scheme and only applies for smooth functions F . For nonsmooth functions, one
can use instead an “implicit” scheme, which is also called the proximal-point algorithm
(see, for instance, Bauschke and Combettes [2011])
k·k 1 2
a(`+1) = Proxτ F (a(`) ) = argmin a − a(`) + τ F (a).
def. def.
(9.25)

a 2
Note that this corresponds to the Euclidean proximal operator, already encountered
in (7.13). The update (9.24) can be understood as iterating the explicit operator Id −
τ ∇F , while (9.25) makes use of the implicit operator (Id + τ ∇F )−1 . For convex F ,
iterations (9.25) always converge, for any value of τ > 0.
If the function F is defined on the simplex of histograms Σn , then it makes sense to
use an optimal transport metric in place of the `2 norm k·k in (9.25), in order to solve

a(`+1) = argmin Wp (a, a(`) )p + τ F (a).


def.
(9.26)
a

Remark 9.10 (Wasserstein gradient flows). Equation (9.26) can be generalized to


arbitrary measures by defining the iteration

α(`+1) = argmin W p (α, α(`) )p + τ F (α)


def.
(9.27)
α

for some function F defined on M1+ (X ). This implicit time stepping is a useful tool
to construct continuous flows, by formally taking the limit τ → 0 and introducing
the time t = τ `, so that α(`) is intended to approximate a continuous flow t ∈
R+ 7→ αt . For the special case p = 2 and X = Rd , a formal calculus shows that αt
is expected to solve a PDE of the form
∂αt
= div(αt ∇(F 0 (αt ))), (9.28)
∂t
where F 0 (α) denotes the derivative of the function F in the sense that it is a
continuous function F 0 (α) ∈ C(X ) such that
Z
F (α + εξ) = F (α) + ε F 0 (α)dξ(x) + o(ε).
X

A typical example is when using F = −H, where H(α) = KL(α|LRd ) is the relative
150 Variational Wasserstein Problems

entropy with respect to the Lebesgue measure LRd on X = Rd


Z
H(α) = − ρα (x)(log(ρα (x)) − 1)dx (9.29)
Rd

(setting H(α) = −∞ when α does not have a density), then (9.28) shows that the
gradient flow of this neg-entropy is the linear heat diffusion
∂αt
= ∆αt , (9.30)
∂t
where ∆ is the spatial Laplacian. The heat diffusion can therefore be interpreted
either as the “classical” Euclidian flow (somehow performing “vertical” movements
with respect to mass amplitudes) of the Dirichlet energy Rd k∇ρα (x)k2 dx or, al-
R

ternatively, as the entropy for the optimal transport flow (somehow a “horizontal”
movement with respect to mass positions). Interest in Wasserstein gradient flows
was sparked by the seminal paper of Jordan, Kinderlehrer and Otto [Jordan et al.,
1998], and these evolutions are often called “JKO flows” following their work. As
shown in detail in the monograph by Ambrosio et al. [2006], JKO flows are a
special case of gradient flows in metric spaces. We also refer to the recent survey
paper [Santambrogio, 2017]. JKO flows can be used to study in particular non-
linear evolution equations such as the porous medium equation [Otto, 2001], total
variation flows [Carlier and Poon, 2019], quantum drifts [Gianazza et al., 2009],
or heat evolutions on manifolds [Erbar, 2010]. Their flexible formalism allows for
constraints on the solution, such as the congestion constraint (an upper bound on
the density at any point) that Maury et al. used to model crowd motion [Maury
et al., 2010] (see also the review paper [Santambrogio, 2018]).

Remark 9.11 (Gradient flows in metric spaces). The implicit stepping (9.27) is a
special case of a more general formalism to define gradient flows over metric spaces
(X , d), where d is a distance, as detailed in [Ambrosio et al., 2006]. For some func-
tion F (x) defined for x ∈ X , the implicit discrete minmization step is then defined
as
x(`+1) ∈ argmin d(x(`) , x)2 + τ F (x). (9.31)
x∈X
The JKO step (9.27) corresponds to the use of the Wasserstein distance on the
space of probability distributions. In some cases, one can show that (9.31) admits
a continuous flow limit xt as τ → 0 and kτ = t. In the case that X also has a
Euclidean structure, an explicit stepping is defined by linearizing F

x(`+1) = argmin d(x(`) , x)2 + τ h∇F (x(`) ), xi. (9.32)


x∈X
9.3. Gradient Flows 151

In sharp contrast to the implicit formula (9.31) it is usually straightforward to


compute but can be unstable. The implicit step is always stable, is also defined for
nonsmooth F , but is usually not accessible in closed form. Figure 9.7 illustrates
this concept on the function F (x) = kxk2 on X = R2 for the distances d(x, y) =
1
kx − ykp = (|x1 −y1 |p +|x2 −y2 |p ) p for several values of p. The explicit scheme (9.32)
is unstable for p = 1 and p = +∞, and for p = 1 it gives axis-aligned steps
(coordinatewise descent). In contrast, the implicit scheme (9.31) is stable. Note in
particular how, for p = 1, when the two coordinates are equal, the following step
operates in the diagonal direction.

p=1 p=1

1
2
p= p=

+
1

=
+

p
=
p

Explicit Implicit

Figure 9.7: Comparison of explicit and implicit gradient flow to minimize the function f (x) = kxk2
on X = R2 for the distances d(x, y) = kx − ykp for several values of p.

Remark 9.12 (Lagrangian discretization using particles systems). The finite-


dimensional problem in (9.26) can be interpreted as the Eulerian discretization
of a flow over the space of measures (9.27). An alternative way to discretize
the problem, using the so-called Lagrangian method using particles systems, is
to parameterize instead the solution as a (discrete) empirical measure moving
with time, where the locations of that measure (and not its weights) become
the variables of interest. In practice, one can consider a dynamic point cloud
of particles αt = n1 ni=1 δxi (t) indexed with time. The initial problem (9.26) is
P

then replaced by a set of n coupled ODE prescribing the dynamic of the points
X(t) = (xi (t))i ∈ X n . If the energy F is finite for discrete measures, then one
can simply define F(X) = F ( n1 ni=1 δxi ). Typical examples are linear functions
P
R R
F (α) = X V (x)dα(x) and quadratic interactions F (α) = X 2 W (x, y)dα(x)dα(y),
in which case one can use respectively
1X 1 X
F(X) = V (xi ) and F(X) = W (xi , xj ).
n i n2 i,j
152 Variational Wasserstein Problems

For functions such as generalized entropy, which are only finite for measures having
densities, one should apply a density estimator to convert the point cloud into a
density, which allows us to also define function F(x) consistent with F as n → +∞.
A typical example is for the entropy F (α) = H(α) defined in (9.29), for which a
consistent estimator (up to a constant term) can be obtained by summing the
logarithms of the distances to nearest neighbors
1X x − x0 ;

F(X) = log(dX (xi )) where dX (x) = 0 min0 (9.33)
n i x ∈X,x 6=x

see Beirlant et al. [1997] for a review of nonparametric entropy estimators. For
small enough step sizes τ , assuming X = Rd , the Wasserstein distance W 2
matches the Euclidean distance on the points, i.e. if |t − t0 | is small enough,
W2 (αt , αt0 ) = kX(t) − X(t0 )k. The gradient flow is thus equivalent to the Euclidean
flow on positions X 0 (t) = −∇F(X(t)), which is discretized for times tk = τ k simi-
larly to (9.24) using explicit Euler steps

X (`+1) = X (`) − τ ∇F(X (`) ).


def.

Figure 9.8 shows an example of such a discretized explicit evolution for a linear plus
entropy functional, resulting in a discretized version of a Fokker–Planck equation.
Note that for this particular case of linear Fokker–Planck equation, it is possible
also to resort to stochastic PDEs methods, and it can be approximated numerically
by evolving a single random particle with a Gaussian drift. The convergence of
these schemes (so-called Langevin Monte Carlo) to the stationary distribution can
in turn be quantified in terms of Wasserstein distance; see, for instance, [Dalalyan
and Karagulyan, 2017]. If the function F is not smooth, one should discretize
similarly to (9.25) using implicit Euler steps, i.e. consider
k·k 1 2
X (`+1) = Proxτ F (X (`) ) = argmin Z − X (`) + τ F(Z).
def. def.
Z∈X n 2
R
In the simplest case of a linear function F (α) = X V (x)dα(x), the flow operates
independently over each particule xi (t) and corresponds to a usual Euclidean flow
for the function V , x0i (t) = −∇V (xi (t)) (and is an advection PDEs of the density
along the integral curves of the flow).

Remark 9.13 (Geodesic convexity). An important concept related to gradient flows


is the convexity of the functional F with respect to the Wasserstein-2 geometry, i.e.
the convexity of F along Wasserstein geodesics (i.e. displacement interpolations as
shown in Remark 7.1). The Wasserstein gradient flow (with a continuous time) for
9.3. Gradient Flows 153

t=0 t = 0.2 t = 0.4 t = 0.6 t = 0.8

Figure R9.8: Example of gradient flow evolutions using a Lagrangian discretization, for the function
F (α) = V dα−H(α), for V (x) = kxk2 . The entropy is discretized using (9.33). The limiting stationary
distribution is a Gaussian.

such a function exists, is unique, and is the limit of the discrete stepping (9.27)
as τ → 0. It converges to a fixed stationary distribution as t → +∞. The entropy
is a typical example of geodesically convex function, and so are linear functions
R
of the form F (α) = X V (x)dα(x) and quadratic interaction functions F (α) =
R
X ×X W (x, y)dα(x)dα(y) for convex functions V : X → R, W : X × X → R. Note
that while linear functions are convex in the classical sense, quadratic interaction
functions might fail to be. A typical example is W (x, y) = kx − yk2 , which is a
negative semi-definite kernel (see Definition 8.3) and thus corresponds to F (α)
being a concave function in the usual sense (while it is geodesically convex). An
important result of McCann [1997] is that generalized “entropy” functions of the
form F (α) = Rd ϕ(ρα (x))dx on X = Rd are geodesically convex if ϕ is convex,
R

with ϕ(0) = 0, ϕ(t)/t → +∞ as t → +∞ and such that s 7→ sd ϕ(s−d ) is convex


decaying.

There is important literature on the numerical resolution of the resulting discretized


flow, and we give only a few representative publications. For 1-D problems, very precise
solvers have been developed because OT is a quadratic functional in the inverse cumu-
lative function (see Remark 2.30): Kinderlehrer and Walkington [1999], Blanchet et al.
[2008], Agueh and Bowles [2013], Matthes and Osberger [2014], Blanchet and Carlier
[2015]. In higher dimensions, it can be tackled using finite elements and finite volume
schemes: Carrillo et al. [2015], Burger et al. [2010]. Alternative solvers are obtained
using Lagrangian schemes (i.e. particles systems): Carrillo and Moll [2009], Benamou
et al. [2016a], Westdickenberg and Wilkening [2010]. Another direction is to look for
discrete flows (typically on discrete grids or graphs) which maintain some properties of
their continuous counterparts; see Mielke [2013], Erbar and Maas [2014], Chow et al.
[2012], Maas [2011].
An approximate approach to solve the Eulerian discretized problem (9.24) relying
on entropic regularization was initially proposed in Peyré [2015], refined in Chizat et al.
154 Variational Wasserstein Problems

[2018b] and theoretically analyzed in Carlier et al. [2017]. With an entropic regular-
ization, Problem (9.26) has the form (4.49) when setting G = ιa(`) and replacing F
by τ F . One can thus use the iterations (4.51) to approximate a(`+1) as proposed ini-
tially in Peyré [2015]. The convergence of this scheme as ε → 0 is proved in Carlier
et al. [2017]. Figure 9.9 shows an example of evolution computed with this method.
An interesting application of gradient flows to machine learning is to learn the un-
derlying function F that best models some dynamical model of density. This learning
can be achieved by solving a smooth nonconvex optimization using entropic regularized
transport and automatic differentiation (see Remark 9.1.3); see Hashimoto et al. [2016].
Analyzing the convergence of gradient flows discretized in both time and space is
difficult in general. Due to the polyhedral nature of the linear program defining the
distance, using too-small step sizes leads to a “locking” phenomena (the distribution is
stuck and does not evolve, so that the step size should be not too small, as discussed
in [Maury and Preux, 2017]). We refer to [Matthes and Osberger, 2014, 2017] for a
convergence analysis of a discretization method for gradient flows in one dimension.

cos(w) t=0 t=5 t = 10 t = 20

Figure 9.9: ExamplesRof gradient flows evolutions, with drift V and congestion terms (from Peyré
[2015]), so that F (α) = X V (x)dα(x) + ι≤κ (ρα ).

It is also possible to compute gradient flows for unbalanced optimal transport dis-
tances as detailed in §10.2. This results in evolutions allowing mass creation or de-
struction, which is crucial to model many physical, biological or chemical phenomena.
An example of unbalanced gradient flow is the celebrated Hele-Shaw model for cell
growth [Perthame et al., 2014], which is studied theoretically in [Gallouët and Mon-
saingeon, 2017, Di Marino and Chizat, 2017]. Such an unbalanced gradient flow also
9.4. Minimum Kantorovich Estimators 155

can be approximated using the generalized Sinkhorn algorithm [Chizat et al., 2018b].

9.4 Minimum Kantorovich Estimators

Given some discrete samples (xi )ni=1 ⊂ X from some unknown distribution, the goal is
to fit a parametric model θ 7→ αθ ∈ M(X ) to the observed empirical input measure β
1X
min L(αθ , β) where β = δxi , (9.34)
θ∈Θ n i

where L is some “loss” function between a discrete and a “continuous” (arbitrary)


distribution (see Figure 9.10).
def.
In the case where αθ as a density ρθ = ραθ with respect to the Lebesgue measure
(or any other fixed reference measure), the maximum likelihood estimator (MLE) is
obtained by solving X
def.
min LMLE (αθ , β) = − log(ρθ (xi )).
θ
i
This corresponds to using an empirical counterpart of a Kullback–Leibler loss since,
assuming the xi are i.i.d. samples of some β̄, then
n→+∞
LMLE (α, β) −→ KL(α|β̄).

g✓ ↵✓
⇣ x
z
Z X

Figure 9.10: Schematic display of the density fitting problem 9.34.

This MLE approach is known to lead to optimal estimation procedures in many


cases (see, for instance, Owen [2001]). However, it fails to work when estimating singular
distributions, typically when the αθ does not have a density (so that LMLE (αθ , β) =
+∞) or when (xi )i are samples from some singular β̄ (so that the αθ should share the
same support as β for KL(αθ |β̄) to be finite, but this support is usually unknown).
Another issue is that in several cases of practical interest, the density ρθ is inaccessible
(or too hard to compute).
A typical setup where both problems (singular and unknown densities) occur is for
so-called generative models, where the parametric measure is written as a push-forward
of a fixed reference measure ζ ∈ M(Z)

αθ = hθ,] ζ where hθ : Z → X ,
156 Variational Wasserstein Problems

where the push-forward operator is introduced in Definition 2.1. The space Z is usually
low-dimensional, so that the support of αθ is localized along a low-dimensional “mani-
fold” and the resulting density is highly singular (it does not have a density with respect
to Lebesgue measure). Furthermore, computing this density is usually intractable, while
generating i.i.d. samples from αθ is achieved by computing xi = hθ (zi ), where (zi )i are
i.i.d. samples from ζ.
In order to cope with such a difficult scenario, one has to use weak metrics in place
of the MLE functional LMLE , which needs to be written in dual form as
Z Z 
def.
L(α, β) = max f (x)dα(x) + g(x)dβ(x) : (f, g) ∈ R . (9.35)
(f,g)∈C(X )2 X X

Dual norms shown in §8.2 correspond to imposing

R = {(f, −f ) : f ∈ B} ,

while optimal transport (2.24) sets R = R(c) as defined in (2.25).


For a fixed θ, evaluating the energy to be minimized in (9.34) using such a loss
function corresponds to solving a semidiscrete optimal transport, which is the focus
of Chapter 5. Minimizing the energy with respect to θ is much more involved and is
typically highly nonconvex.
Denoting fθ a solution to (9.35) when evaluating E(θ) = L(αθ , β), a subgradient is
obtained using the formula
Z
∇E(θ) = [∂hθ (x)]> ∇fθ (x)dαθ (x), (9.36)
X

where ∂hθ (x) ∈ Rdim(Θ)×d is the differential (with respect to θ) of θ ∈ Rdim(Θ) 7→ hθ (x),
while ∇fθ (x) is the gradient (with respect to x) of fθ . This formula is hard to use
numerically, first because it requires first computing a continuous function fθ , which
is a solution to a semi-discrete problem. As shown in §8.5, for OT loss, this can be
achieved using stochastic optimization, but this is hardly applicable in high dimension.
Another option is to impose a parametric form for this potential, for instance expansion
in an RKHS (Genevay et al. [2016]) or a deep-network approximation ([Arjovsky et al.,
2017]). This, however, leads to important approximation errors that are not yet analyzed
theoretically. A last issue is that it is unstable numerically because it requires the
computation of the gradient ∇fθ of the dual potential fθ .
For the OT loss, an alternative gradient formula is obtained when one rather com-
putes a primal optimal coupling for the following equivalent problem:
Z 
Lc (αθ , β) = min c(hθ (z), x)dγ(z, x) : γ ∈ U(ζ, β) . (9.37)
γ∈M(Z×X ) Z×X

Note that in the semidiscrete case considered here, the objective to be minimized can
9.4. Minimum Kantorovich Estimators 157

be actually decomposed as
n Z n
1
X X Z
min c(hθ (z), xi )dγi (z) where γi = ζ, dγi (z) = , (9.38)
n
(γi )i=1
i=1 Z i=1 Z n

where each γi ∈ M1+ (Z). Once an optimal (γθ,i )i solving (9.38) is obtained, the gradient
of E(θ) is computed as
n Z
[∂hθ (z)]> ∇1 c(hθ (z), xi )dγi (z),
X
∇E(θ) =
i=1 Z

where ∇1 c(x, y) ∈ Rd is the gradient of x 7→ c(x, y). Note that as opposed to (9.36),
this formula does not involve computing the gradient of the potentials being solutions
of the dual OT problem.
The class of estimators obtained using L = Lc , often called “minimum Kantorovich
estimators,” was initially introduced in [Bassetti et al., 2006]; see also [Canas and
Rosasco, 2012]. It has been used in the context of generative models by [Montavon et al.,
2016] to train restricted Boltzmann machines and in [Bernton et al., 2017] in conjunction
with approximate Bayesian computations. Approximations of these computations using
Deep Network are used to train deep generative models for both GAN [Arjovsky et al.,
2017] and VAE [Bousquet et al., 2017]; see also [Genevay et al., 2018, 2017, Salimans
et al., 2018]. Note that the use of Sinkhorn divergences for parametric model fitting
is used routinely for shape matching and registration, see [Gold et al., 1998, Chui and
Rangarajan, 2000, Myronenko and Song, 2010, Feydy et al., 2017].

Remark 9.14 (Metric learning and transfer learning). Let us insist on the fact that,
for applications in machine learning, the success of OT-related methods very much
depends on the choice of an adapted cost c(x, y) which captures the geometry of the
data. While it is possible to embed many kinds of data in Euclidean spaces (see, for
instance, [Mikolov et al., 2013] for words embedding), in many cases, some sort of
adaptation or optimization of the metric is needed. Metric learning for supervised
tasks is a classical problem (see, for instance, [Kulis, 2012, Weinberger and Saul,
2009]) and it has been extended to the learning of the ground metric c(x, y) when
some OT distance is used in a learning pipeline [Cuturi and Avis, 2014] (see also Zen
et al. 2014, Wang and Guibas 2012, Huang et al. 2016). Let us also mention the
related inverse problem of learning the cost matrix from the observations of an
optimal coupling P, which can be regularized using a low-rank prior [Dupuy et al.,
2016]. Related problems are transfer learning [Pan and Yang, 2010] and domain
adaptation [Glorot et al., 2011], where one wants to transfer some trained machine
learning pipeline to adapt it to some new dataset. This problem can be modeled
158 Variational Wasserstein Problems

and solved using OT techniques; see [Courty et al., 2017b,a].


10
Extensions of Optimal Transport

This chapter details several variational problems that are related to (and share the same
structure of) the Kantorovich formulation of optimal transport. The goal is to extend
optimal transport to more general settings: several input histograms and measures,
unnormalized ones, more general classes of measures, and optimal transport between
measures that focuses on local regularities (points nearby in the source measure should
be mapped onto points nearby in the target measure) rather than a total transport
cost, including cases where these two measures live in different metric spaces.

10.1 Multimarginal Problems

Instead of coupling two input histograms using the Kantorovich formulation (2.11),
one can couple S histograms (as )Ss=1 , where as ∈ Σns , by solving the following multi-
marginal problem:
ns
XX
def.
min hC, Pi = Ci1 ,...,iS Pi1 ,...,iS , (10.1)
P∈U(as )s s is =1

where the set of valid couplings is


 
 n 
P ∈ Rn1 ×...×nS : ∀ s, ∀ is ,
X X̀
U(as )s = Pi1 ,...,iS = as,is .
 
`6=s i` =1

The entropic regularization scheme (4.2) naturally extends to this setting

min hP, Ci − εH(P),


P∈U(as )s

159
160 Extensions of Optimal Transport

and one can then apply Sinkhorn’s algorithm to compute the optimal P in scaling form,
where each entry indexed by a multi-index vector i = (i1 , . . . , iS )
S
C
where K = e− ε ,
Y def.
Pi = Ki us,is
s=1

where us ∈ Rn+s
are (unknown) scaling vectors, which are iteratively updated, by cycling
repeatedly through s = 1, . . . , S,
as,is
us,is ← P Pn` Q (10.2)
`6=s i` =1 Ki r6=s u`,ir
.
Remark 10.1 (General measures). The discrete multimarginal problem (10.1) is
generalized to measures (αs )s on spaces (X1 , . . . , XS ) by computing a coupling
measure Z
min c(x1 , . . . , xS )dπ(x1 , . . . , xS ), (10.3)
π∈U (αs )s X1 ×...×XS
where the set of couplings is
n o
π ∈ M1+ (X1 × . . . × XS ) : ∀ s = 1, . . . , S, Ps,] π = αs ,
def.
U(αs )s =

where Ps : X1 × . . . × XS → Xs is the projection on the sth component,


Ps (x1 , . . . , xS ) = xs ; see, for instance, [Gangbo and Swiech, 1998]. We refer to [Pass,
2015, 2012] for a review of the main properties of the multimarginal OT problem.
A typical application of multimarginal OT is to compute approximation of so-
lutions to quantum chemistry problems, and in particular, in density functional
theory [Cotar et al., 2013, Gori-Giorgi et al., 2009, Buttazzo et al., 2012]. This
problem is obtained when considering the singular Coulomb interaction cost
X 1
c(x1 , . . . , xS ) = .
i6=j
kxi − xj k

Remark 10.2 (Multimarginal formulation of the barycenter). It is possible to recast


the linear program optimization (9.11) as an optimization over a single coupling
over X S+1 where the last marginal is the barycenter and the other ones are the
input measure (αs )Ss=1
Z S
X
min λs c(x, xs )dπ̄(x1 , . . . , xs , x) (10.4)
π̄∈M1+ (X S+1 ) X S+1 s=1

subject to ∀ s = 1, . . . , S, Ps,] π̄ = αs .
This stems from the “gluing lemma,” which states that given couplings (πs )Ss=1
10.1. Multimarginal Problems 161

where πs ∈ U(αs , α), one can construct a higher-dimensional coupling π̄ ∈


def.
M1+ (X S+1 ) with marginals πs , i.e. such that Qs] π̄ = πs , where Qs (x1 , . . . , xS , x) =
(xs , x) ∈ X 2 . By explicitly minimizing in (10.4) with respect to the last marginal
(associated to x ∈ X ), one obtains that solutions α of the barycenter problem (9.11)
can be computed as α = Aλ,] π, where Aλ is the “barycentric map” defined as
X
Aλ : (x1 , . . . , xS ) ∈ X S 7→ argmin λs c(x, xs )
x∈X s

(assuming this map is single-valued), where π is any solution of the multimarginal


problem (10.3) with cost
X
c(x1 , . . . , xS ) = λ` c(x` , Aλ (x1 , . . . , xS )). (10.5)
`

For instance, for c(x, y) = kx − yk2 , one has, removing the constant squared terms,
X
c(x1 , . . . , xS ) = − λr λs hxr , xs i,
r≤s

which is a problem studied in Gangbo and Swiech [1998]. We refer to Agueh and
Carlier [2011] for more details. This formula shows that if all the input measures
are discrete βs = niss=1 as,is δxs,is , then the barycenter α is also discrete and is
P

obtained using the formula


X
α= P(i1 ,...,iS ) δAλ (xi1 ,...,xi ) ,
S
(i1 ,...,iS )

where P is an optimal solution of (10.1) with cost matrix Ci1 ,...,iS = c(xi1 , . . . , xiS )
Q
as defined in (10.5). Since P is a nonnegative tensor of s ns dimensions obtained
as the solution of a linear program with s ns − S + 1 equality constraints, an
P

optimal solution P with up to s ns − S + 1 nonzero values can be obtained. A


P

barycenter α with a support of up to s ns −S +1 points can therefore be obtained.


P

This result and other considerations in the discrete case can be found in Anderes
et al. [2016].

Remark 10.3 (Relaxation of Euler equations). A convex relaxation of Euler equa-


tions of incompressible fluid dynamics has been proposed by Brenier (1990,
1993, 1999, 2008) and [Ambrosio and Figalli, 2009]. Similarly to the setting ex-
posed in §7.6, it corresponds to the problem of finding a probability distribution
π̄ ∈ M1+ (X̄ ) over the set X̄ of all paths γ : [0, 1] → X , which describes the
movement of particules in the fluid. This is a relaxed version of the initial partial
differential equation model because, as in the Kantorovich formulation of OT, mass
162 Extensions of Optimal Transport

can be split. The evolution with time does not necessarily define a diffemorphism
of the underlying space X . The dynamic of the fluid is obtained by minimizing as
in (7.17) the energy 01 kγ 0 (t)k2 dt of each path. The difference with OT over the
R

space of paths is the additional incompressibilty of the fluid. This incompressibilty


is taken care of by imposing that the density of particules should be uniform at
any time t ∈ [0, 1] (and not just imposed at initial and final times t ∈ {0, 1} as in
classical OT). Assuming X is compact and denoting ρX the uniform distribution
on X , this reads P̄t,] π̄ = ρX where P̄t : γ ∈ X̄ → γ(t) ∈ X . One can discretize
this problem by replacing a continuous path (γ(t))t∈[0,1] by a sequence of S points
(xi1 , xi2 , . . . , xiS ) on a grid (xk )nk=1 ⊂ X , and Π̄ is represented by an S-way cou-
S
pling P ∈ Rn ∈ U(as )s , where the marginals are uniform as = n−1 1n . The cost of
the corresponding multimarginal problem is then
S−1 2
xi − xi 2 + R
X
Ci1 ,...,iS = x − x iS . (10.6)

s s+1 σ(i1 )
s=1

Here R is a large enough penalization constant, which is here to enforce the move-
ment of particules between initial and final times, which is prescribed by a per-
mutation σ : JnK → JnK. This resulting multimarginal problem is implemented
efficiently in conjunction with Sinkhorn iterations (10.2) using the special struc-
ture of the cost, as detailed in [Benamou et al., 2015]. Indeed, in place of the O(nS )
cost required to compute the denominator appearing in (10.2), one can decompose
it as a succession of S matrix-vector multiplications, hence with a low cost Sn2 .
Note that other solvers have been proposed, for instance, using the semidiscrete
framework shown in §5.2; see [de Goes et al., 2015, Gallouët and Mérigot, 2017].

10.2 Unbalanced Optimal Transport

A major bottleneck of optimal transport in its usual form is that it requires the two
input measures (α, β) to have the same total mass. While many workarounds have
been proposed (including renormalizing the input measure, or using dual norms such
as detailed in § 8.2), it is only recently that satisfying unifying theories have been
developed. We only sketch here a simple but important particular case.
Following Liero et al. [2018], to account for arbitrary positive histograms (a, b) ∈
R+ × Rm
n
+ , the initial Kantorovich formulation (2.11) is “relaxed” by only penalizing
marginal deviation using some divergence Dϕ , defined in (8.3). This equivalently cor-
10.2. Unbalanced Optimal Transport 163

responds to minimizing an OT distance between approximate measures

LτC (a, b) = min LC (a, b) + τ1 Dϕ (a, ã) + τ2 Dϕ (b, b̃) (10.7)


ã,b̃

= min hC, Pi + τ1 Dϕ (P1m |a) + τ2 Dϕ (P> 1m |b), (10.8)


P∈Rn×m
+

where (τ1 , τ2 ) controls how much mass variations are penalized as opposed to trans-
portation of the mass. In the limit τ1 = τ2 → +∞, assuming i ai = j bj (the
P P

“balanced” case), one recovers the original optimal transport formulation with hard
marginal constraint (2.11).
This formalism recovers many different previous works, for instance introducing
for Dϕ an `2 norm [Benamou, 2003] or an `1 norm as in partial transport [Figalli,
2010, Caffarelli and McCann, 2010]. A case of particular importance is when using
Dϕ = KL the Kulback–Leibler divergence, as detailed in Remark 10.5. For this cost,
in the limit τ = τ1 = τ2 → 0, one obtains the so-called squared Hellinger distance (see
also Example 8.3)
τ →0 X √
LτC (a, b) −→ h2 (a, b) = bi ) 2 .
p
( ai −
i

Sinkhorn’s iterations (4.15) can be adapted to this problem by making use of the
generalized algorithm detailed in §4.6. The solution has the form (4.12) and the scalings
are updated as
 τ1  τ2
a τ1 +ε b
 
τ2 +ε
u← and v ← T
. (10.9)
Kv K u

Remark 10.4 (Generic measure). For (α, β) two arbitrary measures, the unbal-
anced version (also called “log-entropic”) of (2.15) reads
Z
Lτc (α, β) =
def.
min c(x, y)dπ(x, y)
π∈M+ (X ×Y) X ×Y
+ τ Dϕ (P1,] π|α) + τ Dϕ (P2,] π|β),

where divergences Dϕ between measures are defined in (8.1). In the special case
c(x, y) = kx − yk2 , Dϕ = KL, Lτc (α, β)1/2 is the Gaussian–Hellinger distance [Liero
et al., 2018], and it is shown to be a distance on M1+ (Rd ).

Remark 10.5 (Wasserstein–Fisher–Rao). For the particular choice of cost

c(x, y) = − log cos(min(d(x, y)/κ, π/2)),


164 Extensions of Optimal Transport

where κ is some cutoff distance, and using Dϕ = KL, then


1
WFR(α, β) = Lτc (α, β) 2
def.

is the so-called Wasserstein–Fisher–Rao or Hellinger–Kantorovich distance. In the


special case X = Rd , this static (Kantorovich-like) formulation matches its dy-
namical counterparts (7.15), as proved independently by Liero et al. [2018], Chizat
et al. [2018a]. This dynamical formulation is detailed in §7.4.

The barycenter problem (9.11) can be generalized to handle an unbalanced setting


by replacing Lc with Lτc . Figure 10.1 shows the resulting interpolation, providing a
good illustration of the usefulness of the relaxation parameter τ . The input measures
are mixtures of two Gaussians with unequal mass. Classical OT requires the leftmost
bump to be split in two and gives a nonregular interpolation. In sharp contrast, un-
balanced OT allows the mass to vary during interpolation, so that the bumps are not
split and local modes of the distributions are smoothly matched. Using finite values
for τ (recall that OT is equivalent to τ = ∞) is thus important to prevent irregular
interpolations that arise because of mass splitting, which happens because of a “hard”
mass conservation constraint. The resulting optimization problem can be tackled nu-
merically using entropic regularization and the generalized Sinkhorn algorithm detailed
in §4.6.
In practice, unbalanced OT techniques seem to outperform classical OT for ap-
plications (such as in imaging or machine learning) where the input data is noisy or
not perfectly known. They are also crucial when the signal strength of a measure, as
measured by its total mass, must be accounted for, or when normalization is not mean-
ingful. This was the original motivation of Frogner et al. [2015], whose goal was to
compare sets of word labels used to describe images. Unbalanced OT and the corre-
sponding Sinkhorn iterations have also been used for applications to the dynamics of
cells in [Schiebinger et al., 2017].

Remark 10.6 (Connection with dual norms). A particularly simple setup to account
for mass variation is to use dual norms, as detailed in §8.2. By choosing a compact
set B ⊂ C(X ) one obtains a norm defined on the whole space M(X ) (in particular,
the measures do not need to be positive). A particular instance of this setting is the
flat norm (8.11), which is recovered as a special instance of unbalanced transport,
when using Dϕ (α|α0 ) = kα − α0 kTV to be the total variation norm (8.9); see, for in-
stance, [Hanin, 1992, Lellmann et al., 2014]. We also refer to [Schmitzer and Wirth,
2017] for a general framework to define Wasserstein-1 unbalanced transport.
10.3. Problems with Extra Constraints on the Couplings 165

Classical OT (τ = +∞) Ubalanced OT (τ = 1)

Figure 10.1: Influence of relaxation parameter τ on unbalanced barycenters. Top to bottom: the
evolution of the barycenter between two input measures.

10.3 Problems with Extra Constraints on the Couplings

Many other OT-like problems have been proposed in the literature. They typically
correspond to adding extra constraints C on the set of feasible couplings appearing in
the original OT problem (2.15)
Z 
min c(x, y)dπ(x, y) : π ∈ C . (10.10)
π∈U (α,β) X ×Y

Let us give two representative examples. The optimal transport with capacity con-
straint [Korman and McCann, 2015] corresponds to imposing that the density ρπ (for
instance, with respect to the Lebesgue measure) is upper bounded
C = {π : ρπ ≤ κ} (10.11)
for some κ > 0. This constraint rules out singular couplings localized on Monge maps.
The martingale transport problem (see, for instance, Galichon et al. [2014], Dolinsky
and Soner [2014], Tan and Touzi [2013], Beiglböck et al. [2013]), which finds many
applications in finance, imposes the so-called martingale constraint on the conditional
mean of the coupling, when X = Y = Rd :
dπ(x, y)
 Z 
d
C = π : ∀x ∈ R , y dβ(y) = x . (10.12)
Rd dα(x)dβ(y)
166 Extensions of Optimal Transport

This constraint imposes that the barycentric projection map (4.20) of any admissible
coupling must be equal to the identity. For arbitrary (α, β), this set C is typically empty,
but necessary and sufficient conditions exist (α and β should be in “convex order”) to
ensure C = 6 ∅ so that (α, β) satisfy a martingale constraint. This constraint can be
difficult to enforce numerically when discretizing an existing problem. It also forbids
the solution to concentrate on a single Monge map, and can lead to couplings con-
centrated on the union of several graphs (a “multivalued” Monge map), or even more
complicated support sets. Using an entropic penalization as in (4.9), one can solve ap-
proximately (10.10) using the Dykstra algorithm as explained in Benamou et al. [2015],
which is a generalization of Sinkhorn’s algorithm shown in §4.2. This requires comput-
ing the projection onto C for the KL divergence, which is straightforward for (10.11)
but cannot be done in closed form (10.12) and thus necessitates subiterations; see [Guo
and Obloj, 2017] for more details.

10.4 Sliced Wasserstein Distance and Barycenters

One can define a distance between two measures (α, β) defined on Rd by aggregating
1-D Wasserstein distances between their projections onto all directions of the sphere.
This defines Z
2 def.
SW(α, β) = W 2 (Pθ,] α, Pθ,] β)2 dθ, (10.13)
Sd
d
where S = {θ ∈ Rd : kθk = 1} is the d-dimensional sphere, and Pθ : x ∈ Rd → R is the
projection. This approach is detailed in [Bonneel et al., 2015], following ideas from Marc
Bernot. It is related to the problem of Radon inversion over measure spaces [Abraham
et al., 2017].

Lagrangian discretization and stochastic gradient descent. The advantage of this


functional is that 1-D Wasserstein distances are simple to compute, as detailed in §2.6.
In the specific case where m = n and
n m
1X 1X
α= δx and β= δy , (10.14)
n i=1 i n i=1 i
this is achieved by simply sorting points
n
Z !
X
2 2
SW(α, β) = |hxσθ (i) − yκθ (i) , θi| dθ,
Sd i=1
where σθ , κθ ∈ Perm(n) are the permutation ordering in increasing order, respectively,
(hxi , θi)i and (hyi , θi)i .
def.
Fixing the vector y, the function Eβ (x) = SW(α, β)2 is smooth, and one can use
this function to define a mapping by gradient descent
x ← x − τ ∇Eβ (x) where (10.15)
10.4. Sliced Wasserstein Distance and Barycenters 167
Z  
∇Eβ (x)i = 2 hxi − yκθ ◦σ−1 (i) , θiθ dθ
Sd θ

using a small enough step size τ > 0. To make the method tractable, one can use a
stochastic gradient descent (SGD), replacing this integral with a discrete sum against
randomly drawn directions θ ∈ Sd (see §5.4 for more details on SGD). The flow (10.15)
can be understood as (Langrangian implementation of) a Wasserstein gradient flow (in
the sense of §9.3) of the function α 7→ SW(α, β)2 . Numerically, one finds that this flow
has no local minimizer and that it thus converges to α = β. The usefulness of the
Lagrangian solver is that, at convergence, it defines a matching (similar to a Monge
map) between the two distributions. This method has been used successfully for color
transfer and texture synthesis in [Rabin et al., 2011] and is related to the alternate
minimization approach detailed in [Pitié et al., 2007].
It is simple to extend this Lagrangian scheme to compute approximate “sliced”
barycenters of measures, by mimicking the Frechet definition of Wasserstein barycen-
ters (9.11) and minimizing
S
X
min λs SW(α, βs )2 , (10.16)
α∈M1+ (X ) s=1

given a set (βs )Ss=1 of fixed input measure. Using a Lagrangian discretization of the
form (10.14) for both α and the (βs )s , one can perform the nonconvex minimization
over the position x = (xi )i
def.
X X
min E(x) = λs Eβs (x), and ∇E(x) = λs ∇Eβs (x), (10.17)
x
s s

by gradient descent using formula (10.15) to compute ∇Eβs (x) (coupled with a random
sampling of the direction θ).

Eulerian discretization and Radon transform. A related way to compute an approx-


imated sliced barycenter, without resorting to an iterative minimization scheme, is to
use the fact that (10.13) computes a distance between the Radon transforms R(α) and
R(β) where
def.
R(α) = (Pθ,] α)θ∈Sd .
A crucial point is that the Radon transform is invertible and that its inverse can be
computed using a filtered backprojection formula. Given a collection of measures ρ =
(ρθ )θ∈Sd , one defines the filtered backprojection operator as
d−1
R+ (ρ) = Cd ∆ 2 B(ρ), (10.18)
where ξ = B(ρ) ∈ M(Rd ) is the measure defined through the relation
Z Z Z Z
∀ g ∈ C(Rd ), g(x)dξ(x) = g(rθ + Uθ z)dρθ (r)dzdθ, (10.19)
Rd Sd Rd−1 R
168 Extensions of Optimal Transport

where Uθ is any orthogonal basis of θ⊥ , and where Cd ∈ R is a normalizing constant


d−1
which depends on the dimension. Here ∆ 2 is a fractional Laplacian, which is the
high-pass filter defined over the Fourier domain as ∆ ˆ d−1
2 (ω) = kωk
d−1
. The definition
of the backprojection (10.19) adds up the contribution of all the measures (ρθ )θ by
extending each one as being constant in the directions orthogonal to θ. One then has
the left-inverse relation R+ ◦ R = IM(Rd ) , so that R+ is a valid reconstruction formula.

t=0 t = 1/4 t = 1/2 t = 3/4 t=1

Figure 10.2: Example of sliced barycenters computation using the Radon transform (as defined
in (10.20)). Top: barycenters αt for S = 2 two input and weights (λ1 , λ2 ) = (1 − t, t). Bottom: their
Radon transform R(αt ) (the horizontal axis being the orientation angle θ).

In order to compute barycenters of input densities, it makes sense to replace for-


mula (9.11) by its equivalent using Radon transform, and thus consider independently
for each θ the 1-D barycenter problem
S
X
ρ?θ ∈ argmin λs W 2 (ρθ , Pθ,] βs )2 . (10.20)
(ρθ ∈M1+ (R)) s=1

Each 1-D barycenter problem is easily computed using the monotone rearrangement as
def.
detailed in Remark 9.6. The Radon approximation αR = R+ (ρ? ) of a sliced barycen-
ter solving (9.11) is then obtained by the inverse Radon transform R+ . Note that in
general, αR is not a solution to (9.11) because the Radon transform is not surjective,
so that ρ? , which is obtained as a barycenter of the Radon transforms R(βs ) does not
necessarily belong to the range of R. But numerically it seems in practice to be almost
the case [Bonneel et al., 2015]. Numerically, this Radon transform formulation is very
effective for input measures and barycenters discretized on a fixed grid (e.g. a uniform
grid for images), and R and well as R+ are computed approximately on this grid using
fast algorithms (see, for instance, [Averbuch et al., 2001]). Figure 10.2 illustrates this
141414 Nicolas Bonneel
Nicolas
Nicolas et et
Bonneel
Bonneel al.et
al.al.

10.5. Transporting Vectors and Matrices 169

computation of barycenters (and highlights the way the Radon transforms are interpo-
lated), while Figure 10.3 shows a comparison of the Radon barycenters (10.20) and the
ones obtained by Lagrangian discretization (10.17).

Radon
Radonbarycenter
Radon barycenter
barycenter Sliced barycenter
Sliced
Sliced barycenter
barycenter Wasserstein barycenter
Wasserstein
Wasserstein barycenter
barycenter
Radon Lagrangian Wasserstein
R R R S S S WWW
Fig. 6 6Comparison
Fig.
Fig. ofof
6Comparison
Comparison Bar
of
Bar d , dBar
RBar , Barand
, Bar Bar
and
and d (computed
RBar
Bar using
(computed
(computed thethe
using
using method
the detailed
method
method in in
detailed
detailed [25]).
in [25]).
[25]).
R Rd RdRdRd RdRd
Figure 10.3: Comparison of barycenters computed using Radon transform (10.20) (Eulerian dis-
cretization), Lagrangian discretization (10.17), and Wasserstein OT (computed using Sinkhorn itera-
tions (9.18)).

Sliced Wasserstein kernels. Beside its computational simplicity, another advantage of


the sliced Wasserstein distance is that it is isometric to a Euclidean distance (it is thus
a “Hilbertian” metric), as detailed in Remark 2.30, and in particular formula (2.36).
R R R
As highlighted in §8.3, this should be contrasted Fig.
Fig.8 8Top:
Fig. 8Top:
withRadon
Top: the
Radon
Radon barycenter
Wasserstein
barycenter
barycenter BarBarRBarRR
ofoffour 2-D
offour
four
d d d 2-D2-D W 2with
distributions
distance distributions
distributions onwith
with
equal weights. Bottom : Same experiment with SW2, using NN 4 10 4 4 4
d equal
equal weights.
weights. Bottom
Bottom : :
SameSame experiment
experiment with
with SW2,
SW2, using
using =N= 4 10
4 10
=
R , which is not Hilbertian in dimension dpoints ≥points
points
2.samples,
It |Q
samples, is|Q| =|Q
samples, thus
100
|=|=100 100 possible
directions
directions
directions and aand
and
to
gaussianusekernel
a gaussian
a gaussian
this
with
kernel
kernel
sliced
standard
with
with standard
standard
+ s ds=s=
distance to equip the space of distributions deviation M deviation
deviation
1 (R ) 20/512
= 20/512
20/512
with toa estimate
to to
estimate thethe
estimate
reproducing corresponding
the corresponding
corresponding
kerneldensities. . . .
densities.
densities.
Hilbert
Fig. 7 7Image
Fig.
Fig. 7Imagewarping
Image using
warping
warping thethe
using
using Radon barycenter
the Radon exhibits
barycenter artifacts.
exhibits artifacts.
space structure (asRadon barycenter
detailed inexhibits
§8.3).artifacts.
One second
can,
second application
second
for application
instance,
application allows allows
allows for
forthe
use forthe editing
the
the ofofaofsingle
editing
exponential
editing image
a single
a single bybyby
image
imageand
5.5
5.5Application
5.5Application totoColor
Application toColorPalette
Color Manipulation
Palette
Palette Manipulation
Manipulation bringing
bringing
bringing itsitscolors
itscolors
colorscloser closer
closer totoatoasetasetof
setofphotographs
ofphotographs
photographs exhibiting
exhibiting
exhibiting
energy distance kernels
particular
particular
particular color
color palettes.
color palettes.
palettes. This
This process
This is iscalled
process
process iscalledcolor
called grading,
color
color grading,
grading,
p
InInthis
Inthis section
this we
section
section investigate
wewe investigate
investigate thethe benefit
the − ofof
benefit
benefit our
of
SW(α,β)
our Sliced
our Sliced
Sliced and
and finds
and finds applications
finds applications
applications ininphotograph
inphotograph
photograph
p enhancement.
enhancement.
enhancement.
Wasserstein barycenter for
k(α,
two
β)applications:
=e
applications:
2σ p
and k(α, β) = − SW(α, β)
Wasserstein
Wasserstein barycenter
barycenter forfortwotwo applications:harmonizing
harmonizing
harmonizing
colors
colors
colors
for 1inin≤
aninanimage
panimage2sequence,
≤image theand
sequence,
sequence,
for and grading
and grading
grading
exponential colors
colors ofofaofasin-
colors
kernels sin- Lagrangian
asin-
< p <color
0Lagrangian
and Lagrangian color palette.
2color
for palette.
palette.
the WeWe consider
We
energy
N⇥3
consider
consider
N⇥3
N⇥3
a acolor
distance acolorimage
color image
image
kernels.rep-rep-
rep-
gle
gleimage.
gle image.
image. Color
Color harmonization
Color harmonization
harmonization isisthe
isthe process
the process
process ofofbringing
bringing resented
ofbringing resented
resented asasaasavector
avector
vector X X2X2R2R R ofofN
ofNpixels,
Npixels, sosothat
pixels, sothat X X=X==
that
theThis
thecolors
the means
colors
colors ofofinput
ofinputthat
images
input images
images for toany
totoan anaverage
anaveragecollection
average colorcolor
color (αi )i of (X
distribution
distribution
distribution input
k)
(X(X measures,
where
)k=1,...,N
)kk=1,...,N
kk=1,...,N where
where each the
each
each pixel matrix
XkX2
pixel
pixel kX2kR2R R(k(α
3 3stores
3 stores
storesthe ,α
the
ithe ))
value ofofais
value
jvalue
i,j ofa a
such
such that
suchthat
symmetric the
thatthe images
theimages end
images end
positive up
endupuplooking
looking more
lookingmore
semidefinite. similar.
moresimilar. This
similar.
It isThis has
Thishas
possible pixel
has pixel indexed
pixel
to indexed
indexed
use by
these k. In the
bybyk.kernels
k.InInthefollowing,
thefollowing,
following,we
to perform use
weweuse the YCbCr
usethe
athe color
YCbCrcolor
YCbCr
variety color
of
several
several
several applications
applications
applications such
such as,as,as,
such for instance,
forfor instance,
instance, imageimage
image stitching
stitching space
stitching space
space because
because ofofits
ofitsability
itsability totodecorrelate
ability todecorrelate
decorrelate color channels,
color channels,
machine
ororenforcing
orenforcing
enforcing
learning
temporal
temporal
temporal
tasks
coherence
coherence
coherence
using
ofofcolors
ofcolors
the
colors
“kernel
ininmovies.
inmovies.
movies.TheThe
trick,”
The although
forbecause
although
although
instance,
other color
other
other spaces
color
color
in regression,
spaces
spaces may bebeused
may
may beused (e.g.,
used
color
the
(e.g.,
(e.g.,
channels,
classification
CIE-Lab
the
the CIE-Lab
CIE-Lab
(SVM and logistic), clustering (K-means) and dimensionality reduction (PCA) [Hof-
mann et al., 2008]. We refer to Kolouri et al. [2016] for details and applications.

10.5 Transporting Vectors and Matrices

Real-valued measures α ∈ M(X ) are easily generalized to vector-valued measures α ∈


M(X ; V), where V is some vector space. For notational simplicity, we assume V is
170 Extensions of Optimal Transport

Euclidean and equipped with some inner product h·, ·i (typically V = Rd and the inner
product is the canonical one). Thanks to this inner product, vector-valued measures
are identified with the dual of continuous functions g : X → V, i.e. for any such g, one
defines its integration against the measure as
Z
g(x)dα(x) ∈ R, (10.21)
X
P
which is a linear operation on g and α. A discrete measure has the form α = i ai δxi
where (xi , ai ) ∈ X × V and the integration formula (10.21) simply reads
Z X
g(x)dα(x) = hai , g(xi )i ∈ R.
X i

Equivalently, if V = Rd , then such an α can be viewed as a collection (αs )ds=1 of d


“classical” real-valued measures (its coordinates), writing
Z d Z
X
g(x)dα(x) = gs (x)dαs (x),
X s=1 X

where g(x) = (gs (x))ds=1 are the coordinates of g in the canonical basis.

Dual norms. It is nontrivial, and in fact in general impossible, to extend OT distances


to such a general setting. Even coping with real-valued measures taking both positive
and negative values is difficult. The only simple option is to consider dual norms, as
defined in §8.2. Indeed, formula (6.3) readily extends to M(X ; V) by considering B to
be a subset of C(X ; V). So in particular, W1 , the flat norm and MMD norms can be
computed for vector-valued measures.

OT over cone-valued measures. It is possible to define more advanced OT distances


when α is restricted to be in a subset M(X ; V) ⊂ M(X ; V). The set V should be a
positively 1-homogeneous convex cone of V
n o
λu : λ ∈ R+ , u ∈ V0 ,
def.
V =
where V0 is a compact convex set. A typical example is the set of positive measures
where V = Rd+ . Dynamical convex formulations of OT over such a cone have been pro-
posed; see [Zinsl and Matthes, 2015]. This has been applied to model the distribution
of chemical components. Another important example is the set of positive symmetric
matrices V = S+ d ⊂ Rd×d . It is of course possible to use dual norms over this space, by

treating matrices as vectors; see, for instance, [Ning and Georgiou, 2014]. Dynamical
convex formulations for OT over such a cone have been provided [Chen et al., 2016b,
Jiang et al., 2012]. Some static (Kantorovich-like) formulations also have been pro-
posed [Ning et al., 2015, Peyré et al., 2017], but a mathematically sound theoretical
10.5. Transporting Vectors and Matrices 171

framework is still missing. In particular, it is unclear if these static approaches define


distances for vector-valued measures and if they relate to some dynamical formulation.
Figure 10.4 is an example of tensor interpolation obtained using the method detailed
in [Peyré et al., 2017], which proposes a generalization of Sinkhorn algorithms using
quantum relative entropy (10.22) to deal with tensor fields.

OT over positive matrices. A related but quite different setting is to replace discrete
measures, i.e. histograms a ∈ Σn , by positive matrices with unit trace A ∈ Sn+ such
that tr(A) = 1. The rationale is that the eigenvalues λ(A) ∈ Σn of A play the role of
a histogram, but one also has to take care of the rotations of the eigenvectors, so that
this problem is more complicated.
One can extend several divergences introduced in §8.1 to this setting. For instance,
the Bures metric (2.42) is a generalization of the Hellinger distance (defined in Re-
mark 8.3), since they are equal on positive diagonal matrices. One can also extend
the Kullback–Leibler divergence (4.6) (see also Remark 8.1), which is generalized to
positive matrices as
def.
KL(A|B) = tr (P log(P ) − P log(Q) − P + Q, ) (10.22)

where log(·) is the matrix logarithm. This matrix KL is convex with both of its argu-
ments.
It is possible to solve convex dynamic formulations to define OT distances between
such matrices [Carlen and Maas, 2014, Chen et al., 2016b, 2017]. There also exists
an equivalent of Sinkhorn’s algorithm, which is due to Gurvits [2004] and has been
extensively studied in [Georgiou and Pavon, 2015]; see also the review paper [Idel,
2016]. It is known to converge only in some cases but seems empirically to always work.

t=0 t = 1/8 t = 1/4 t = 3/8 t = 1/2 t = 5/8 t = 3/4 t = 7/8 t=1

Figure 10.4: Interpolations between two input fields of positive semidefinite matrices (displayed at
times t ∈ {0, 1} using ellipses) on some domain (here, a 2-D planar square and a surface mesh), using
the method detailed in Peyré et al. [2017]. Unlike linear interpolation schemes, this OT-like method
transports the “mass” of the tensors (size of the ellipses) as well as their anisotropy and orientation.
172 Extensions of Optimal Transport

10.6 Gromov–Wasserstein Distances

For some applications such as shape matching, an important weakness of optimal trans-
port distances lies in the fact that they are not invariant to important families of invari-
ances, such as rescaling, translation or rotations. Although some nonconvex variants
of OT to handle such global transformations have been proposed [Cohen and Guibas,
1999, Pele and Taskar, 2013] and recently applied to problems such as cross-lingual
word embeddings alignments [Grave et al., 2019, Alvarez-Melis et al., 2019, Grave et al.,
2019], these methods require specifying first a subset of invariances, possibly between
different metric spaces, to be relevant. We describe in this section a more general and
very natural extension of OT that can deal with measures defined on different spaces
without requiring the definition of a family of invariances.

10.6.1 Hausdorff Distance


The Hausdorff distance between two sets A, B ⊂ Z for some metric dZ is
 
def.
HZ (A, B) = max sup inf dZ (a, b), sup inf dZ (a, b) ,
a∈A b∈B b∈B a∈A

see Figure 10.5. This defines a distance between compact sets K(Z) of Z, and if Z is
compact, then (K(Z), HZ ) is itself compact; see [Burago et al., 2001].

X
Y
Y

Figure 10.5: Computation of the Hausdorff distance in R2 .

Following Mémoli [2011], one remarks that this distance between sets (A, B) can
be defined similarly to the Wasserstein distance between measures (which should be
somehow understood as “weighted” sets). One replaces the measures couplings (2.14)
by sets couplings
( )
def. ∀ a ∈ A, ∃b ∈ B, (a, b) ∈ R
R(A, B) = R∈X ×Y : .
∀ b ∈ B, ∃a ∈ A, (a, b) ∈ R

With respect to Kantorovich problem (2.15), one should replace integration (since one
10.6. Gromov–Wasserstein Distances 173

does not have access to measures) by maximization, and one has

HZ (A, B) = inf sup d(a, b). (10.23)


R∈R(A,B) (a,b)∈R

Note that the support of a measure coupling π ∈ U(α, β) is a set coupling between
the supports, i.e. Supp(π) ∈ R(Supp(α), Supp(β)). The Hausdorff distance is thus
connected to the ∞-Wasserstein distance (see Remark 2.20) and one has H(A, B) ≤
W ∞ (α, β) for any measure (α, β) whose supports are (A, B).

10.6.2 Gromov–Hausdorff distance


The Gromov–Hausdorff (GH) distance [Gromov, 2001] (see also [Edwards, 1975]) is a
way to measure the distance between two metric spaces (X , dX ), (Y, dY ) by quantifying
how far they are from being isometric to each other, see Figure 10.6. It is defined as the
minimum Hausdorff distance between every possible isometric embedding of the two
spaces in a third one,
isom
( )
def. f : X −→ Z
GH(dX , dY ) = inf HZ (f (X ), g(Y)) : isom .
Z,f,g g : Y −→ Z

Here, the constraint is that f must be an isometric embedding, meaning that


dZ (f (x), f (x0 )) = dX (x, x0 ) for any (x, x0 ) ∈ X 2 (similarly for g). One can show that
GH defines a distance between compact metric spaces up to isometries, so that in par-
ticular GH(dX , dY ) = 0 if and only if there exists an isometry h : X → Y, i.e. h is
bijective and dY (h(x), h(x0 )) = dX (x, x0 ) for any (x, x0 ) ∈ X 2 .

a Z
X f
H b
GH
g
Y
Figure 10.6: The GH approach to compare two metric spaces.

Similarly to (10.23) and as explained in [Mémoli, 2011], it is possible to rewrite


equivalently the GH distance using couplings as follows:
1
GH(dX , dY ) = inf sup |dX (x, x0 ) − dX (y, y 0 )|.
2 R∈R(X ,Y) ((x,y),(x0 ,y 0 ))∈R2

For discrete spaces X = (xi )ni=1 , Y = (yj )m j=1 represented using a distance matrix D =
(dX (xi , xi0 ))i,i0 ∈ Rn×n , D0 = (dY (yj , yj 0 ))j,j 0 ∈ Rm×m , one can rewrite this optimization
174 Extensions of Optimal Transport

using binary matrices R ∈ {0, 1}n×m indicating the support of the set couplings R as
follows:
1
GH(D, D0 ) = inf max Ri,j Rj,j 0 |Di,i0 − D0j,j 0 |. (10.24)
2 R1>0,R> 1>0 (i,i0 ,j,j 0 )
The initial motivation of the GH distance is to define and study limits of metric spaces,
as illustrated in Figure 10.7, and we refer to [Burago et al., 2001] for details. There is an
explicit description of the geodesics for the GH distance [Chowdhury and Mémoli, 2016],
which is very similar to the one in Gromov–Wasserstein spaces, detailed in Remark 10.8.

GH

GH

Figure 10.7: GH limit of sequences of metric spaces.

The underlying optimization problem (10.24) is highly nonconvex, and computing


the global minimum is untractable. It has been approached numerically using approx-
imation schemes and has found applications in vision and graphics for shape match-
ing [Mémoli and Sapiro, 2005, Bronstein et al., 2006].
It is often desirable to “smooth” the definition of the Hausdorff distance by replac-
ing the maximization by an integration. This in turn necessitates the introduction of
measures, and it is one of the motivations for the definition of the GW distance in the
next section.

10.6.3 Gromov–Wasserstein Distance

Optimal transport needs a ground cost C to compare histograms (a, b) and thus cannot
be used if the bins of those histograms are not defined on the same underlying space,
or if one cannot preregister these spaces to define a ground cost between any pair of
bins in the first and second histograms, respectively. To address this limitation, one
can instead only assume a weaker assumption, namely that two matrices D ∈ Rn×n
and D0 ∈ Rm×m quantify similarity relationships between the points on which the
histograms are defined. A typical scenario is when these matrices are (power of) distance
matrices. The GW problem reads

GW((a, D), (b, D0 ))2 =


def.
min ED,D0 (P) (10.25)
P∈U(a,b)
10.6. Gromov–Wasserstein Distances 175

|Di,i0 − D0j,j 0 |2 Pi,j Pi0 ,j 0 ,


def.
X
where ED,D0 (P) =
i,j,i0 ,j 0

see Figure 10.8. This problem is similar to the GH problem (10.24) when replacing
maximization by a sum and set couplings by measure couplings. This is a nonconvex
problem, which can be recast as a quadratic assignment problem [Loiola et al., 2007]
and is in full generality NP-hard to solve for arbitrary inputs. It is in fact equivalent
to a graph matching problem [Lyzinski et al., 2016] for a particular cost.

X
D D0j,j 0 Y
i,i0
|Di,i0 D0j,j 0 |

Figure 10.8: The GW approach to comparing two metric measure spaces.

One can show that GW satisfies the triangular inequality, and in fact it defines a
distance between metric spaces equipped with a probability distribution, here assumed
to be discrete in definition (10.25), up to isometries preserving the measures. This dis-
tance was introduced and studied in detail by Mémoli [2011]. An in-depth mathematical
exposition (in particular, its geodesic structure and gradient flows) is given in [Sturm,
2012]. See also [Schmitzer and Schnörr, 2013a] for applications in computer vision. This
distance is also tightly connected with the GH distance [Gromov, 2001] between metric
spaces, which have been used for shape matching [Mémoli, 2007, Bronstein et al., 2010].

Remark 10.7 (Gromov–Wasserstein distance). The general setting corresponds to


computing couplings between metric measure spaces (X , dX , αX ) and (Y, dY , αY ),
where (dX , dY ) are distances, while αX and αY are measures on their respective
spaces. One defines

GW((αX , dX ), (αY , dY ))2 =


def.

(10.26)
Z
min |dX (x, x0 ) − dY (y, y 0 )|2 dπ(x, y)dπ(x0 , y 0 ).
π∈U (αX ,αY ) X 2 ×Y 2

GW defines a distance between metric measure spaces up to isometries, where


one says that (X , αX , dX ) and (Y, αY , dY ) are isometric if there exists a bijection
ϕ : X → Y such that ϕ] αX = αY and dY (ϕ(x), ϕ(x0 )) = dX (x, x0 ).
176 Extensions of Optimal Transport

Remark 10.8 (Gromov–Wasserstein geodesics). The space of metric spaces (up to


isometries) endowed with this GW distance (10.26) has a geodesic structure. Sturm
[2012] shows that the geodesic between (X0 , dX0 , α0 ) and (X1 , dX1 , α1 ) can be chosen
to be t ∈ [0, 1] 7→ (X0 × X1 , dt , π ? ), where π ? is a solution of (10.26) and for all
((x0 , x1 ), (x00 , x01 )) ∈ (X0 × X1 )2 ,

dt ((x0 , x1 ), (x00 , x01 )) = (1 − t)dX0 (x0 , x00 ) + tdX1 (x1 , x01 ).


def.

This formula allows one to define and analyze gradient flows which minimize func-
tionals involving metric spaces; see Sturm [2012]. It is, however, difficult to handle
numerically, because it involves computations over the product space X0 × X1 . A
heuristic approach is used in [Peyré et al., 2016] to define geodesics and barycenters
of metric measure spaces while imposing the cardinality of the involved spaces and
making use of the entropic smoothing (10.27) detailed below.

10.6.4 Entropic Regularization


To approximate the computation of GW, and to help convergence of minimization
schemes to better minima, one can consider the entropic regularized variant

min ED,D0 (P) − εH(P). (10.27)


P∈U(a,b)

As proposed initially in [Gold and Rangarajan, 1996, Rangarajan et al., 1999], and later
revisited in [Solomon et al., 2016a] for applications in graphics, one can use iteratively
Sinkhorn’s algorithm to progressively compute a stationary point of (10.27). Indeed,
successive linearizations of the objective function lead to consider the succession of
updates
P(`+1) = min hP, C(`) i − εH(P) where
def.
(10.28)
P∈U(a,b)

C(`) = ∇ED,D0 (P(`) ) = −DP(`) D0 ,


def.

which can be interpreted as a mirror-descent scheme [Solomon et al., 2016a]. Each


update can thus be solved using Sinkhorn iterations (4.15) with cost C(`) . Figure 10.9
displays the evolution of the algorithm. Figure 10.10 illustrates the use of this entropic
GW to compute soft maps between domains.
10.6. Gromov–Wasserstein Distances 177

`=1 `=2 `=3 `=4

Figure 10.9: Iterations of the entropic GW algorithm (10.28) between two shapes (xi )i and (yj )j in
R2 , initialized with P(0) = a ⊗ b. The distance matrices are Di,i0 = kxi − xi0 k and D0j,j 0 = kyj − yj 0 k.
Top row: coupling P(`) displayed as a 2-D image. Bottom row: matching induced by P(`) (each point
(`)
xi is connected to the three yj with the three largest values among {Pi,j }j ). The shapes have the same
size, but for display purposes, the inner shape (xi )i has been reduced.

Figure 10.10: Example of fuzzy correspondences computed by solving GW problem (10.27) with
Sinkhorn iterations (10.28). Extracted from [Solomon et al., 2016a].
Acknowledgements

We would like to thank the many colleagues, collaborators and students who have
helped us at various stages when preparing this survey. Some of their inputs have
shaped this work, and we would like to thank in particular Jean-David Benamou,
Yann Brenier, Guillaume Carlier, Vincent Duval and the entire MOKAPLAN team at
Inria; Francis Bach, Espen Bernton, Mathieu Blondel, Nicolas Courty, Rémi Flamary,
Alexandre Gramfort, Young-Heon Kim, Daniel Matthes, Philippe Rigollet, Filippo San-
tambrogio, Justin Solomon, Jonathan Weed; as well as the feedback by our current and
former students on these subjects, in particular Gwendoline de Bie, Lénaïc Chizat,
Aude Genevay, Hicham Janati, Théo Lacombe, Boris Muzellec, Francois-Pierre Paty,
Vivien Seguy.

178
References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: large-scale
machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,
2016.
Isabelle Abraham, Romain Abraham, Maıtine Bergounioux, and Guillaume Carlier. Tomo-
graphic reconstruction from a few views: a multi-marginal optimal transport approach. Ap-
plied Mathematics & Optimization, 75(1):55–73, 2017.
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation. arXiv preprint
arXiv:1106.1925, 2011.
Martial Agueh and Malcolm Bowles. One-dimensional numerical algorithms for gradient flows
in the p-Wasserstein spaces. Acta Applicandae Mathematicae, 125(1):121–134, 2013.
Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal
on Mathematical Analysis, 43(2):904–924, 2011.
Martial Agueh and Guillaume Carlier. Vers un théorème de la limite centrale dans l’espace de
Wasserstein? Comptes Rendus Mathematique, 355(7):812–818, 2017.
Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermüller, Dzmitry Bahdanau,
and Nicolas Ballas et al. Theano: A python framework for fast computation of mathematical
expressions. CoRR, abs/1605.02688, 2016.
Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society. Series B (Methodological),
28(1):131–142, 1966.
Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for
matrix scaling. arXiv preprint arXiv:1704.02315, 2017.
Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation al-
gorithms for optimal transport via Sinkhorn iteration. arXiv preprint arXiv:1705.09634,
2017.

179
180 References

Pedro C Álvarez-Esteban, E del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point


approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Appli-
cations, 441(2):744–762, 2016.
David Alvarez-Melis, Stefanie Jegelka, and Tommi S Jaakkola. Towards optimal transport with
global invariances. 2019.
Shun-ichi Amari, Ryo Karakida, and Masafumi Oizumi. Information geometry connecting
Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transporta-
tion problem. Information Geometry, (1):13–37, 2018.
L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows in Metric Spaces and in the Space of
Probability Measures. Springer, 2006.
Luigi Ambrosio and Alessio Figalli. Geodesics in the space of measure-preserving maps and
plans. Archive for Rational Mechanics and Analysis, 194(2):421–462, 2009.
Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete Wasserstein barycenters: optimal
transport for discrete data. Mathematical Methods of Operations Research, 84(2):389–409,
2016.
Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over high-
dimensional spaces. In Proceedings of the nineteenth annual ACM-SIAM Symposium on
Discrete Algorithms, pages 343–352. Society for Industrial and Applied Mathematics, 2008.
Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of Wasserstein spaces.
Annales scientifiques de l’École normale supérieure, 51:657–700, 2018.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial net-
works. Proceedings of the 34th International Conference on Machine Learning, 70:214–223,
2017.
Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal
on Computing, 16(1):78–96, 1987.
Franz Aurenhammer, Friedrich Hoffmann, and Boris Aronov. Minkowski-type theorems and
least-squares clustering. Algorithmica, 20(1):61–76, 1998.
Amir Averbuch, Ronald Coifman, David Donoho, Moshe Israeli, and Johan Walden. Fast slant
stack: a notion of radon transform for data in a cartesian grid which is rapidly computible,
algebraically exact, geometrically faithful and invertible. Tech. Rep., Stanford University,
2001.
Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics,
4:384–414, 2010.
Francis R Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity
for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
Michael Bacharach. Estimating nonnegative matrices from marginal data. International Eco-
nomic Review, 6(3):294–310, 1965.
Alexander Barvinok. A Course in Convexity. Graduate Studies in Mathematics. American
Mathematical Society, 2002. ISBN 9780821829684.
Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum kantorovich distance
estimators. Statistics & Probability Letters, 76(12):1298–1302, 2006.
References 181

Heinz H Bauschke and Patrick L Combettes. Convex analysis and monotone operator theory
in Hilbert spaces. Springer-Verlag, New York, 2011.
Heinz H Bauschke and Adrian S Lewis. Dykstra’s algorithm with Bregman projections: a
convergence proof. Optimization, 48(4):409–427, 2000.
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods
for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Martin Beckmann. A continuous model of transportation. Econometrica, 20:643–660, 1952.
Mathias Beiglböck, Pierre Henry-Labordère, and Friedrich Penkner. Model-independent bounds
for option prices: a mass transport approach. Finance and Stochastics, 17(3):477–501, 2013.
Jan Beirlant, Edward J Dudewicz, Laszlo Gyorfi, and Edward C Van der Meulen. Nonparamet-
ric entropy estimation: an overview. International Journal of Mathematical and Statistical
Sciences, 6(1):17–39, 1997.
Jean-David Benamou. Numerical resolution of an “unbalanced” mass transport problem.
ESAIM: Mathematical Modelling and Numerical Analysis, 37(05):851–868, 2003.
Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the
Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
Jean-David Benamou and Guillaume Carlier. Augmented lagrangian methods for transport
optimization, mean field games and degenerate elliptic equations. Journal of Optimization
Theory and Applications, 167(1):1–26, 2015.
Jean-David Benamou, Brittany D Froese, and Adam M Oberman. Numerical solution of the op-
timal transportation problem using the Monge–Ampere equation. Journal of Computational
Physics, 260:107–126, 2014.
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. It-
erative Bregman projections for regularized transportation problems. SIAM Journal on Sci-
entific Computing, 37(2):A1111–A1138, 2015.
Jean-David Benamou, Guillaume Carlier, Quentin Mérigot, and Edouard Oudet. Discretization
of functionals involving the Monge–Ampère operator. Numerische Mathematik, 134(3):611–
636, 2016a.
Jean-David Benamou, Francis Collino, and Jean-Marie Mirebeau. Monotone and consistent
discretization of the Monge-Ampere operator. Mathematics of Computation, 85(302):2743–
2775, 2016b.
Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semi-
groups. Number 100 in Graduate Texts in Mathematics. Springer Verlag, 1984.
Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability
and Statistics. Kluwer Academic Publishers, 2003.
Espen Bernton. Langevin Monte Carlo and JKO splitting. In Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning The-
ory, volume 75 of Proceedings of Machine Learning Research, pages 1777–1798. PMLR, 2018.
Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. Inference in gener-
ative models using the Wasserstein distance. arXiv preprint arXiv:1701.05146, 2017.
182 References

Dimitri P Bertsekas. A new algorithm for the assignment problem. Mathematical Programming,
21(1):152–171, 1981.
Dimitri P Bertsekas. Auction algorithms for network flow problems: a tutorial introduction.
Computational Optimization and Applications, 1(1):7–66, 1992.
Dimitri P Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific,
1998.
Dimitri P Bertsekas and Jonathan Eckstein. Dual coordinate step methods for linear network
flow problems. Mathematical Programming, 42(1):203–243, 1988.
Dimitris Bertsimas and John N Tsitsiklis. Introduction to Linear Optimization. Athena Scien-
tific, 1997.
Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures-wasserstein distance between
positive definite matrices. Expositiones Mathematicae, to appear, 2018.
Jérémie Bigot and Thierry Klein. Consistent estimation of a population barycenter in the
Wasserstein space. arXiv Preprint arXiv:1212.2562, 2012a.
Jérémie Bigot and Thierry Klein. Characterization of barycenters in the Wasserstein space by
averaging optimal transport maps. arXiv preprint arXiv:1212.2562, 2012b.
Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Central limit theorems for sinkhorn
divergence between probability distributions on finite spaces and statistical applications.
arXiv preprint arXiv:1711.08947, 2017a.
Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic PCA in the Wasserstein
space by convex pca. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 53
(1):1–26, 2017b.
Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Universidad Nacional de Tucumán
Revista Series A, 5:147–151, 1946.
Garrett Birkhoff. Extensions of jentzsch’s theorem. Transactions of the American Mathematical
Society, 85(1):219–227, 1957.
Adrien Blanchet and Guillaume Carlier. Optimal transport and Cournot-Nash equilibria. Math-
ematics of Operations Research, 41(1):125–145, 2015.
Adrien Blanchet, Vincent Calvez, and José A Carrillo. Convergence of the mass-transport
steepest descent scheme for the subcritical Patlak-Keller-Segel model. SIAM Journal on
Numerical Analysis, 46(2):691–721, 2008.
Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures
in 1-Wasserstein distance. Electronic Journal of Probability, 16:2296–2333, 2011.
Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution’s template esti-
mate with Wasserstein metrics. Bernoulli, 21(2):740–759, 2015.
Franccois Bolley, Arnaud Guillin, and Cédric Villani. Quantitative concentration inequalities
for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137
(3):541–593, 2007.
References 183

Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement
interpolation using lagrangian mass transport. ACM Transactions on Graphics, 30(6):158,
2011.
Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasser-
stein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45,
2015.
Nicolas Bonneel, Gabriel Peyré, and Marco Cuturi. Wasserstein barycentric coordinates: his-
togram regression using optimal transport. ACM Transactions on Graphics, 35(4):71:1–71:10,
2016.
CW Borchardt and CGJ Jocobi. De investigando ordine systematis aequationum differentialium
vulgarium cujuscunque. Journal für die reine und angewandte Mathematik, 64:297–320, 1865.
Ingwer Borg and Patrick JF Groenen. Modern Multidimensional Scaling: Theory and Applica-
tions. Springer Science & Business Media, 2005.
Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno Lévy. Polygon mesh process-
ing. Taylor & Francis, 2010.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard
Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. arXiv
preprint arXiv:1705.07642, 2017.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, January 2011.
Lev M Bregman. The relaxation method of finding the common point of convex sets and
its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics, 7(3):200–217, 1967.
Yann Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. C.
R. Acad. Sci. Paris Sér. I Math., 305(19):805–808, 1987.
Yann Brenier. The least action principle and the related concept of generalized flows for in-
compressible perfect fluids. Journal of the AMS, 2:225–255, 1990.
Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions.
Communications on Pure and Applied Mathematics, 44(4):375–417, 1991.
Yann Brenier. The dual least action problem for an ideal, incompressible fluid. Archive for
Rational Mechanics and Analysis, 122(4):323–351, 1993.
Yann Brenier. Minimal geodesics on groups of volume-preserving maps and generalized solutions
of the Euler equations. Communications on Pure and Applied Mathematics, 52(4):411–452,
1999.
Yann Brenier. Generalized solutions and hydrostatic approximation of the Euler equations.
Physica D. Nonlinear Phenomena, 237(14-17):1982–1988, 2008.
Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional
scaling: a framework for isometry-invariant partial surface matching. Proceedings of the
National Academy of Sciences, 103(5):1168–1172, 2006.
184 References

Alexander M Bronstein, Michael M Bronstein, Ron Kimmel, Mona Mahmoudi, and Guillermo
Sapiro. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-
rigid shape matching. International Journal on Computer Vision, 89(2-3):266–286, 2010.
Richard A Brualdi. Combinatorial Matrix Classes, volume 108. Cambridge University Press,
2006.
Dmitri Burago, Yuri Burago, and Sergei Ivanov. A Course in Metric Geometry, volume 33.
American Mathematical Society Providence, RI, 2001.
Donald Bures. An extension of Kakutani’s theorem on infinite product measures to the tensor
product of semifinite w∗ -algebras. Transactions of the American Mathematical Society, 135:
199–212, 1969.
Martin Burger, José Antonio Carrillo de la Plata, and Marie-Therese Wolfram. A mixed finite
element method for nonlinear diffusion equations. Kinetic and Related Models, 3(1):59–83,
2010.
Martin Burger, Marzena Franek, and Carola-Bibiane Schönlieb. Regularised regression and
density estimation based on optimal transport. Applied Mathematics Research Express, 2:
209–253, 2012.
Giuseppe Buttazzo, Luigi De Pascale, and Paola Gori-Giorgi. Optimal-transport formulation
of electronic density-functional theory. Physical Review A, 85(6):062502, 2012.
Luis Caffarelli. The Monge-Ampere equation and optimal transportation, an elementary review.
Lecture Notes in Mathematics, Springer-Verlag, pages 1–10, 2003.
Luis Caffarelli, Mikhail Feldman, and Robert McCann. Constructing optimal maps for Monge’s
transport problem as a limit of strictly convex costs. Journal of the American Mathematical
Society, 15(1):1–26, 2002.
Luis A Caffarelli and Robert J McCann. Free boundaries in optimal transport and Monge-
Ampère obstacle problems. Annals of Mathematics, 171(2):673–730, 2010.
Luis A Caffarelli, Sergey A Kochengin, and Vladimir I Oliker. Problem of reflector design with
given far-field scattering data. In Monge Ampère equation: applications to geometry and
optimization, volume 226, page 13, 1999.
Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal
transport metrics. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 2492–2500. 2012.
Eric A Carlen and Jan Maas. An analog of the 2-Wasserstein metric in non-commutative prob-
ability under which the fermionic Fokker–Planck equation is gradient flow for the entropy.
Communications in Mathematical Physics, 331(3):887–926, 2014.
Guillaume Carlier and Ivar Ekeland. Matching for teams. Economic Theory, 42(2):397–418,
2010.
Guillaume Carlier and Clarice Poon. On the total variation Wasserstein gradient flow and the
TV-JKO scheme. to appear in ESAIM: COCV, 2019.
Guillaume Carlier, Chloé Jimenez, and Filippo Santambrogio. Optimal transportation with
traffic congestion and Wardrop equilibria. SIAM Journal on Control and Optimization, 47
(3):1330–1350, 2008.
References 185

Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe’s transport to Bre-
nier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical
Analysis, 41(6):2554–2576, 2010.
Guillaume Carlier, Adam Oberman, and Edouard Oudet. Numerical methods for matching
for teams and Wasserstein barycenters. ESAIM: Mathematical Modelling and Numerical
Analysis, 49(6):1621–1642, 2015.
Guillaume Carlier, Victor Chernozhukov, and Alfred Galichon. Vector quantile regression be-
yond correct specification. arXiv preprint arXiv:1610.06833, 2016.
Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of
entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical
Analysis, 49(2):1385–1418, 2017.
José A Carrillo and J Salvador Moll. Numerical simulation of diffusive and aggregation phe-
nomena in nonlinear continuity equations by evolving diffeomorphisms. SIAM Journal on
Scientific Computing, 31(6):4305–4329, 2009.
José A Carrillo, Alina Chertock, and Yanghong Huang. A finite-volume method for nonlin-
ear nonlocal equations with a gradient flow structure. Communications in Computational
Physics, 17:233–258, 1 2015.
Yair Censor and Simeon Reich. The Dykstra algorithm with Bregman projections. Communi-
cations in Applied Analysis, 2:407–419, 1998.
Yair Censor and Stavros Andrea Zenios. Proximal minimization algorithm with d-functions.
Journal of Optimization Theory and Applications, 73(3):451–464, 1992.
Thierry Champion, Luigi De Pascale, and Petri Juutinen. The ∞-wasserstein distance: local
solutions and existence of optimal transport maps. SIAM Journal on Mathematical Analysis,
40(1):1–20, 2008.
Timothy M Chan. Optimal output-sensitive convex hull algorithms in two and three dimensions.
Discrete & Computational Geometry, 16(4):361–368, 1996.
Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation between optimal
transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization
Theory and Applications, 169(2):671–691, 2016a.
Yongxin Chen, Tryphon T Georgiou, and Allen Tannenbaum. Matrix optimal mass transport:
a quantum mechanical approach. arXiv preprint arXiv:1610.03041, 2016b.
Yongxin Chen, Wilfrid Gangbo, Tryphon T Georgiou, and Allen Tannenbaum. On the matrix
Monge-Kantorovich problem. arXiv preprint arXiv:1701.02826, 2017.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. Unbalanced
optimal transport: geometry and Kantorovich formulation. Journal of Functional Analysis,
274(11):3090–3123, 2018a.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. Scaling al-
gorithms for unbalanced transport problems. Mathematics of Computation, 87:2563–2609,
2018b.
186 References

Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and Franccois-Xavier Vialard. An interpo-
lating distance between optimal transport and Fisher–Rao metrics. Foundations of Compu-
tational Mathematics, 18(1):1–44, 2018c.
Shui-Nee Chow, Wen Huang, Yao Li, and Haomin Zhou. Fokker-Planck equations for a free en-
ergy functional or Markov process on a graph. Archive for Rational Mechanics and Analysis,
203(3):969–1008, 2012.
Shui-Nee Chow, Wuchen Li, and Haomin Zhou. A discrete Schrodinger equation via optimal
transport on graphs. arXiv preprint arXiv:1705.07583, 2017a.
Shui-Nee Chow, Wuchen Li, and Haomin Zhou. Entropy dissipation of Fokker-Planck equations
on graphs. arXiv preprint arXiv:1701.04841, 2017b.
Samir Chowdhury and Facundo Mémoli. Constructing geodesics on the space of compact metric
spaces. arXiv preprint arXiv:1603.02385, 2016.
Haili Chui and Anand Rangarajan. A new algorithm for non-rigid point matching. In Computer
Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages
44–51. IEEE, 2000.
Imre Ciszár. Information-type measures of difference of probability distributions and indirect
observations. Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967.
Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix scaling and
balancing via box constrained Newton’s method and interior point methods. arXiv preprint
arXiv:1704.02310, 2017.
Scott Cohen and Leonidas Guibas. The earth mover’s distance under transformation sets. In
Proceedings of the Seventh IEEE International Conference on Computer vision, volume 2,
pages 1076–1083. IEEE, 1999.
Patrick L Combettes and Jean-Christophe Pesquet. A Douglas-Rachford splitting approach to
nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal
Processing, 1(4):564 –574, 2007.
Roberto Cominetti and Jaime San Martín. Asymptotic analysis of the exponential penalty
trajectory in linear programming. Mathematical Programming, 67(1-3):169–187, 1994.
Laurent Condat. Fast projection onto the simplex and the `1 ball. Math. Programming, Ser.
A, pages 1–11, 2015.
Sueli IR Costa, Sandra A Santos, and João E Strapasson. Fisher information distance: a
geometrical reading. Discrete Applied Mathematics, 197:59–69, 2015.
Codina Cotar, Gero Friesecke, and Claudia Klüppelberg. Density functional theory and optimal
transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66
(4):548–599, 2013.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Thomas Corpetti. Optimal transport for data
fusion in remote sensing. In 2016 IEEE International Geoscience and Remote Sensing Sym-
posium, pages 3571–3574. IEEE, 2016.
References 187

Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribu-
tion optimal transportation for domain adaptation. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Infor-
mation Processing Systems 30, pages 3730–3739. 2017a.
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for
domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39
(9):1853–1865, 2017b.
Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: a new approach to
computing distance based on heat flow. ACM Transaction on Graphics, 32(5):152:1–152:11,
October 2013.
Juan Antonio Cuesta and Carlos Matran. Notes on the wasserstein metric in hilbert spaces.
The Annals of Probability, 17(3):1264–1276, 07 1989.
Marco Cuturi. Positivity and transportation. arXiv preprint 1209.2655, 2012.
Marco Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In Advances
in Neural Information Processing Systems 26, pages 2292–2300, 2013.
Marco Cuturi and David Avis. Ground metric learning. Journal of Machine Learning Research,
15:533–564, 2014.
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings
of ICML, volume 32, pages 685–693, 2014.
Marco Cuturi and Kenji Fukumizu. Kernels on structured objects through nested histograms.
In P. B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems 19, pages 329–336. MIT Press, 2007.
Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational Wasserstein prob-
lems. SIAM Journal on Imaging Sciences, 9(1):320–343, 2016.
Marco Cuturi and Gabriel Peyré. Semidual regularized optimal transport. SIAM Review, 60
(4):941–965, 2018.
Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin
monte carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory,
volume 65 of Proceedings of Machine Learning Research, pages 678–689. PMLR, 2017.
Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the Langevin Monte
Carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.
George B. Dantzig. Programming of interdependent activities: II mathematical model. Econo-
metrica, 17(3/4):200–211, 1949.
George B Dantzig. Application of the simplex method to a transportation problem. Activity
Analysis of Production and Allocation, 13:359–373, 1951.
George B. Dantzig. Reminiscences Aabout the origins of linear programming, pages 78–86.
Springer, 1983.
George B. Dantzig. Linear programming. In J. K. Lenstra, A. H. G. Rinnooy Kan, and A. Schri-
jver, editors, History of mathematical programming: a collection of personal reminiscences,
pages 257–282. Elsevier Science Publishers, 1991.
188 References

Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing, 2017.
Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun. Blue
noise through optimal transport. ACM Transactions on Graphics, 31(6):171, 2012.
Fernando de Goes, Corentin Wallez, Jin Huang, Dmitry Pavlov, and Mathieu Desbrun. Power
particles: an incompressible fluid solver based on power diagrams. ACM Transaction Graph-
ics, 34(4):50:1–50:11, July 2015.
Eustasio del Barrio, JA Cuesta-Albertos, C Matrán, and A Mayo-Íscar. Robust clustering tools
based on optimal transportation. arXiv preprint arXiv:1607.01179, 2016.
Julie Delon. Midway image equalization. Journal of Mathematical Imaging and Vision, 21(2):
119–134, 2004.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Fast transport optimization for Monge
costs on the circle. SIAM Journal on Applied Mathematics, 70(7):2239–2258, 2010.
Julie Delon, Julien Salomon, and Andrei Sobolevski. Local matching indicators for transport
problems with concave costs. SIAM Journal on Discrete Mathematics, 26(2):801–827, 2012.
Edwards Deming and Frederick F Stephan. On a least squares adjustment of a sampled fre-
quency table when the expected marginal totals are known. Annals of Mathematical Statistics,
11(4):427–444, 1940.
Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Ap-
proximation by empirical measures. In Annales de l’Institut Henri Poincaré, Probabilités et
Statistiques, volume 49, pages 1183–1203, 2013.
Rachid Deriche. Recursively implementating the Gaussian and its derivatives. PhD thesis,
INRIA, 1993.
Arnaud Dessein, Nicolas Papadakis, and Charles-Alban Deledalle. Parameter estimation in
finite mixture models by regularized optimal transport: a unified framework for hard and
soft clustering. arXiv preprint arXiv:1711.04366, 2017.
Arnaud Dessein, Nicolas Papadakis, and Jean-Luc Rouas. Regularized optimal transport and
the rot mover’s distance. Journal of Machine Learning Research, 19(15):1–53, 2018.
Simone Di Marino and Lenaic Chizat. A tumor growth model of Hele-Shaw type as a gradient
flow. Arxiv, 2017.
Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms
for earth mover’s distance. Theory of Computing Systems, 48(2):428–442, 2011.
Jean Dolbeault, Bruno Nazaret, and Giuseppe Savaré. A new class of transport distances
between measures. Calculus of Variations and Partial Differential Equations, 34(2):193–231,
2009.
Yan Dolinsky and H Mete Soner. Martingale optimal transport and robust hedging in continuous
time. Probability Theory and Related Fields, 160(1-2):391–427, 2014.
Richard M. Dudley. The speed of mean Glivenko-Cantelli convergence. Annals of Mathematical
Statistics, 40(1):40–50, 1969.
Arnaud Dupuy, Alfred Galichon, and Yifei Sun. Estimating matching affinity matrix under
low-rank constraints. Arxiv:1612.09585, 2016.
References 189

Pavel Dvurechenskii, Darina Dvinskikh, Alexander Gasnikov, Cesar Uribe, and Angelia Nedich.
Decentralize and randomize: Faster algorithm for wasserstein barycenters. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances
in Neural Information Processing Systems 31, pages 10783–10793. 2018.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal trans-
port: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In
Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference
on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1367–
1376. PMLR, 2018.
Richard L Dykstra. An algorithm for restricted least squares regression. Journal American
Statistical Association, 78(384):839–842, 1983.
Richard L Dykstra. An iterative procedure for obtaining I-projections onto the intersection of
convex sets. Annals of Probability, 13(3):975–984, 1985.
Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas-Rachford splitting method and
the proximal point algorithm for maximal monotone operators. Mathematical Programming,
55:293–318, 1992.
David A Edwards. The structure of superspace. In Studies in topology, pages 121–133. Elsevier,
1975.
Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal maps. Journal
of Computational Physics, 231(23):7815–7850, 2012.
Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions.
IEEE Transactions on Information theory, 49(7):1858–1860, 2003.
Matthias Erbar. The heat equation on manifolds as a gradient flow in the Wasserstein space.
Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 46(1):1–23, 2010.
Matthias Erbar and Jan Maas. Gradient flow structures for discrete porous medium equations.
Discrete and Continuous Dynamical Systems, 34(4):1355–1374, 2014.
Sven Erlander. Optimal Spatial Interaction and the Gravity Model, volume 173. Springer-Verlag,
1980.
Sven Erlander and Neil F Stewart. The Gravity Model in Transportation Analysis: Theory and
Extensions. 1990.
Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization
using the wasserstein metric: Performance guarantees and tractable reformulations. Mathe-
matical Programming, 171(1-2):115–166, 2018.
Montacer Essid and Justin Solomon. Quadratically-regularized optimal transport on graphs.
arXiv preprint arXiv:1704.08200, 2017.
Lawrence C. Evans and Wilfrid Gangbo. Differential Equations Methods for the Monge-
Kantorovich Mass Transfer Problem, volume 653. American Mathematical Society, 1999.
Mikhail Feldman and Robert McCann. Monge’s transport problem on a Riemannian manifold.
Transaction AMS, 354(4):1667–1697, 2002.
Jean Feydy, Benjamin Charlier, Francois-Xavier Vialard, and Gabriel Peyré. Optimal transport
for diffeomorphic registration. In Proceedings of MICCAI’17, pages 291–299. Springer, 2017.
190 References

Jean Feydy, Thibault Séjourné, Franccois-Xavier Vialard, Shun-Ichi Amari, Alain Trouvé, and
Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences.
In Proceedings of the 22th International Conference on Artificial Intelligence and Statistics,
2019.
Alessio Figalli. The optimal partial transport problem. Archive for Rational Mechanics and
Analysis, 195(2):533–560, 2010.
Rémi Flamary, Cédric Févotte, Nicolas Courty, and Valentin Emiya. Optimal spectral trans-
portation with application to music transcription. In Advances in Neural Information Pro-
cessing Systems, pages 703–711, 2016.
Lester Randolph Ford and Delbert Ray Fulkerson. Flows in Networks. Princeton University
Press, 1962.
Peter J Forrester and Mario Kieburg. Relating the Bures measure to the Cauchy two-matrix
model. Communications in Mathematical Physics, 342(1):151–187, 2016.
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of
the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.
Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra
and its Applications, 114:717–735, 1989.
Uriel Frisch, Sabino Matarrese, Roya Mohayaee, and Andrei Sobolevski. A reconstruction of
the initial conditions of the universe by optimal mass transportation. Nature, 417(6886):
260–262, 2002.
Brittany D Froese and Adam M Oberman. Convergent finite difference solvers for viscosity
solutions of the elliptic monge–ampère equation in dimensions two and higher. SIAM Journal
on Numerical Analysis, 49(4):1692–1714, 2011.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio.
Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems,
pages 2053–2061, 2015.
Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational
problems via finite element approximation. Computers & Mathematics with Applications, 2
(1):17–40, 1976.
Alfred Galichon. Optimal Transport Methods in Economics. Princeton University Press, 2016.
Alfred Galichon and Bernard Salanié. Matching with trade-offs: revealed preferences over com-
peting characteristics. Technical report, Preprint SSRN-1487307, 2009.
Alfred Galichon, Pierre Henry-Labordère, and Nizar Touzi. A stochastic control approach to
no-arbitrage bounds given marginals, with an application to lookback options. Annals of
Applied Probability, 24(1):312–336, 2014.
Thomas O Gallouët and Quentin Mérigot. A lagrangian scheme à la brenier for the incompress-
ible euler equations. Foundations of Computational Mathematics, 18:1–31, 2017.
Thomas O Gallouët and Leonard Monsaingeon. A JKO splitting scheme for Kantorovich–
Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1100–1130, 2017.
Wilfrid Gangbo and Robert J McCann. The geometry of optimal transportation. Acta Mathe-
matica, 177(2):113–161, 1996.
References 191

Wilfrid Gangbo and Andrzej Swiech. Optimal maps for the multidimensional Monge-
Kantorovich problem. Communications on Pure and Applied Mathematics, 51(1):23–45, 1998.
RUI GAO, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using wasserstein
uncertainty sets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7913–7923.
2018.
Matthias Gelbrich. On a formula for the l2 wasserstein metric between measures on euclidean
and hilbert spaces. Mathematische Nachrichten, 147(1):185–203, 1990.
Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for
large-scale optimal transport. In Advances in Neural Information Processing Systems, pages
3440–3448, 2016.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. GAN and VAE from an optimal transport
point of view. (arXiv preprint arXiv:1706.01807), 2017.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn
divergences. In Proceedings of the 21st International Conference on Artificial Intelligence
and Statistics, pages 1608–1617, 2018.
Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complex-
ity of sinkhorn divergences. In Proceedings of the 22th International Conference on Artificial
Intelligence and Statistics, 2019.
Ivan Gentil, Christian Léonard, and Luigia Ripani. About the analogy between optimal trans-
port and minimal entropy. arXiv preprint arXiv:1510.08230, 2015.
Alan George and Joseph WH Liu. The evolution of the minimum degree ordering algorithm.
SIAM Review, 31(1):1–19, 1989.
Tryphon T Georgiou and Michele Pavon. Positive contraction mappings for classical and quan-
tum Schrödinger systems. Journal of Mathematical Physics, 56(3):033301, 2015.
Pascal Getreuer. A survey of Gaussian convolution algorithms. Image Processing On Line,
2013:286–310, 2013.
Ugo Gianazza, Giuseppe Savaré, and Giuseppe Toscani. The Wasserstein gradient flow of the
Fisher information and the quantum drift-diffusion equation. Archive for Rational Mechanics
and Analysis, 194(1):133–220, 2009.
Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. Inter-
national Statistical Review, 70(3):419–435, 2002.
Joan Glaunes, Alain Trouvé, and Laurent Younes. Diffeomorphic matching of distributions:
a new approach for unlabelled point-sets and sub-manifolds matching. In Proceedings of
the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 2, 2004.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sen-
timent classification: A deep learning approach. In Proceedings of the 28th International
Conference on Machine Learning, pages 513–520, 2011.
192 References

Roland Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et
la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires.
ESAIM: Mathematical Modelling and Numerical Analysis, 9(R2):41–76, 1975.
Steven Gold and Anand Rangarajan. A graduated assignment algorithm for graph matching.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377–388, April 1996.
Steven Gold, Anand Rangarajan, Chien-Ping Lu, Suguna Pappu, and Eric Mjolsness. New
algorithms for 2d and 3d point matching: pose estimation and correspondence. Pattern
Recognition, 31(8):1019–1031, 1998.
Eusebio Gómez, Miguel A Gómez-Villegas, and J Miguel Marín. A survey on continuous ellip-
tical vector distributions. Rev. Mat. Complut, 16:345–361, 2003.
Paola Gori-Giorgi, Michael Seidl, and Giovanni Vignale. Density-functional theory for strongly
interacting electrons. Physical Review Letters, 103(16):166402, 2009.
Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of
neuroimaging data. In Information Processing in Medical Imaging - 24th International Con-
ference, IPMI 2015, pages 261–272, 2015.
Kristen Grauman and Trevor Darrell. The pyramid match kernel: discriminative classification
with sets of image features. In Tenth IEEE International Conference on Computer Vision,
volume 2, pages 1458–1465. IEEE, 2005.
Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings
with wasserstein procrustes. In Proceedings of the 22th International Conference on Artificial
Intelligence and Statistics, 2019.
Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola.
A kernel method for the two-sample-problem. In Advances in Neural Information Processing
Systems, pages 513–520, 2007.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773,
2012.
Andreas Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse
automatic differentiation. Optimization Methods and Software, 1(1):35–54, 1992.
Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation. SIAM, 2008.
Mikhail Gromov. Metric Structures for Riemannian and Non-Riemannian Spaces. Progress in
Mathematics. Birkhäuser, 2001.
Gaoyue Guo and Jan Obloj. Computational methods for martingale optimal transport prob-
lems. arXiv preprint arXiv:1710.07911, 2017.
Leonid Gurvits. Classical complexity and quantum entanglement. Journal of Computer and
System Sciences, 69(3):448–484, 2004.
Cristian E Gutiérrez. The Monge-Ampere Equation. Springer, 2016.
Jorge Gutierrez, Julien Rabin, Bruno Galerne, and Thomas Hurtut. Optimal patch assignment
for statistically constrained texture synthesis. In International Conference on Scale Space
and Variational Methods in Computer Vision, pages 172–183. Springer, 2017.
References 193

A Hadjidimos. Successive overrelaxation (SOR) and related methods. Journal of Computational


and Applied Mathematics, 123(1):177–199, 2000.
Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport for
registration and warping. International Journal of Computer Vision, 60(3):225–240, 2004.
Leonid G Hanin. Kantorovich-Rubinstein norm and its application in the theory of Lipschitz
spaces. Proceedings of the American Mathematical Society, 115(2):345–352, 1992.
Tatsunori Hashimoto, David Gifford, and Tommi Jaakkola. Learning population-level diffusions
with generative RNNs. In International Conference on Machine Learning, pages 2417–2426,
2016.
David Hilbert. Über die gerade linie als kürzeste verbindung zweier punkte. Mathematische
Annalen, 46(1):91–96, 1895.
Frank L Hitchcock. The distribution of a product from several sources to numerous localities.
Studies in Applied Mathematics, 20(1-4):224–230, 1941.
Nhat Ho, XuanLong Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung.
Multilevel clustering via wasserstein means. In International Conference on Machine Learn-
ing, pages 1501–1509, 2017.
Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine
learning. Annals of Statistics, 36(3):1171–1220, 2008.
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied Logistic Regression,
volume 398. John Wiley & Sons, 2013.
Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised
word mover’s distance. In Advances in Neural Information Processing Systems, pages 4862–
4870, 2016.
Martin Idel. A review of matrix scaling and Sinkhorn’s normal form for matrices and positive
maps. arXiv preprint arXiv:1609.06349, 2016.
Piotr Indyk and Eric Price. K-median clustering, model-based compressive sensing, and sparse
recovery for earth mover distance. In Proceedings of the forty-third annual ACM Symposium
on Theory of Computing, pages 627–636. ACM, 2011.
Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In 3rd International
Workshop on Statistical and Computational Theories of Vision, 2003.
Xianhua Jiang, Lipeng Ning, and Tryphon T Georgiou. Distances and Riemannian metrics for
multivariate spectral densities. IEEE Transactions on Automatic Control, 57(7):1723–1735,
2012.
William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert
space. In Conference in Modern Analysis and Probability (New Haven, Conn., 1982), vol-
ume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984.
Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker-
Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
Leonid Kantorovich. On the transfer of masses (in russian). Doklady Akademii Nauk, 37(2):
227–229, 1942.
194 References

LV Kantorovich and G.S. Rubinstein. On a space of totally additive functions. Vestn Leningrad
Universitet, 13:52–59, 1958.
Hermann Karcher. Riemannian center of mass and so called Karcher mean. arXiv preprint
arXiv:1407.2087, 2014.
Johan Karlsson and Axel Ringh. Generalized Sinkhorn iterations for regularizing inverse prob-
lems using optimal mass transport. arXiv preprint arXiv:1612.02273, 2016.
Sanggyun Kim, Rui Ma, Diego Mesa, and Todd P Coleman. Efficient bayesian inference meth-
ods via convex optimization and optimal transport. In IEEE International Symposium on
Information Theory, pages 2259–2263. IEEE, 2013.
David Kinderlehrer and Noel J Walkington. Approximation of parabolic equations using the
Wasserstein metric. ESAIM: Mathematical Modelling and Numerical Analysis, 33(04):837–
852, 1999.
Jun Kitagawa, Quentin Mérigot, and Boris Thibert. A Newton algorithm for semi-discrete
optimal transport. arXiv preprint arXiv:1603.05579, 2016.
Philip A Knight. The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal
on Matrix Analysis and Applications, 30(1):261–275, 2008.
Philip A Knight and Daniel Ruiz. A fast algorithm for matrix balancing. IMA Journal of
Numerical Analysis, 33(3):1029–1047, 2013.
Philip A Knight, Daniel Ruiz, and Bora Uccar. A symmetry preserving algorithm for matrix
scaling. SIAM Journal on Matrix Analysis and Applications, 35(3):931–955, 2014.
Martin Knott and Cyril S Smith. On the optimal mapping of distributions. Journal of Opti-
mization Theory and Applications, 43(1):39–49, 1984.
Martin Knott and Cyril S Smith. On a generalization of cyclic monotonicity and distances
among random vectors. Linear Algebra and Its Applications, 199:363–371, 1994.
Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced Wasserstein kernels for probabil-
ity distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5258–5267, 2016.
Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal
mass transport: signal processing and machine-learning applications. IEEE Signal Processing
Magazine, 34(4):43–59, 2017.
Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vofnikov. A new optimal transport
distance on the space of finite Radon measures. Advances in Differential Equations, 21
(11/12):1117–1164, 2016.
Tjalling C Koopmans. Optimum utilization of the transportation system. Econometrica: Jour-
nal of the Econometric Society, pages 136–146, 1949.
Jonathan Korman and Robert McCann. Optimal transportation with capacity constraints.
Transactions of the American Mathematical Society, 367(3):1501–1521, 2015.
Bernhard Korte and Jens Vygen. Combinatorial Optimization. Springer, 2012.
JJ Kosowsky and Alan L Yuille. The invisible hand algorithm: Solving the assignment problem
with statistical physics. Neural networks, 7(3):477–490, 1994.
References 195

J. Kruithof. Telefoonverkeersrekening. De Ingenieur, 52:E15–E25, 1937.


Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics
Quarterly, 2:83–97, 1955.
Brian Kulis. Metric learning: a survey. Foundations and Trends in Machine Learning, 5(4):
287–364, 2012.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to
document distances. In International Conference on Machine Learning, pages 957–966, 2015.
Theo Lacombe, Marco Cuturi, and Steve Oudot. Large scale computation of means and clusters
for persistence diagrams using optimal transport. Advances in Neural Information Processing
Systems 31, pages 9792–9802, 2018.
Rongjie Lai and Hongkai Zhao. Multiscale nonrigid point cloud registration using rotation-
invariant sliced-wasserstein distance via laplace–beltrami eigenmap. SIAM Journal on Imag-
ing Sciences, 10(2):449–483, 2017.
Hugo Lavenant. Harmonic mappings valued in the Wasserstein space. Preprint cvgmt 3649,
2017.
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, volume 2, pages 2169–2178. IEEE, 2006.
Thibaut Le Gouic and Jean-Michel Loubes. Existence and consistency of Wasserstein barycen-
ters. Probability Theory and Related Fields, 168:901–917, 2016.
Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–791, 1999.
Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. In
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems 31, pages 2692–2701. 2018.
William Leeb and Ronald Coifman. Hölder–Lipschitz norms and their duals on spaces with
semigroups, with applications to earth mover’s distance. Journal of Fourier Analysis and
Applications, 22(4):910–953, 2016.
Jan Lellmann, Dirk A Lorenz, Carola Schönlieb, and Tuomo Valkonen. Imaging with
Kantorovich–Rubinstein discrepancy. SIAM Journal on Imaging Sciences, 7(4):2833–2859,
2014.
Bas Lemmens and Roger Nussbaum. Nonlinear Perron-Frobenius Theory, volume 189. Cam-
bridge University Press, 2012.
Christian Léonard. From the Schrödinger problem to the Monge–Kantorovich problem. Journal
of Functional Analysis, 262(4):1879–1920, 2012.
Christian Léonard. A survey of the Schrödinger problem and some of its connections with
optimal transport. Discrete Continuous Dynamical Systems Series A, 34(4):1533–1574, 2014.
Bruno Lévy. A numerical algorithm for l2 semi-discrete optimal transport in 3d. ESAIM:
Mathematical Modelling and Numerical Analysis, 49(6):1693–1715, 2015.
196 References

Bruno Lévy and Erica L Schwindt. Notions of optimal transport theory and how to implement
them on a computer. Computers & Graphics, 72:135–148, 2018.
Peihua Li, Qilong Wang, and Lei Zhang. A novel earth mover’s distance methodology for
image matching with Gaussian mixture models. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1689–1696, 2013.
Wuchen Li, Ernest K. Ryu, Stanley Osher, Wotao Yin, and Wilfrid Gangbo. A parallel method
for Earth Mover’s distance. Journal of Scientific Computing, 75(1):182–197, 2018a.
Yupeng Li, Wuchen Li, and Guo Cao. Image segmentation via l1 monge-kantorovich problem.
CAM report 17-73, 2018b.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal transport in competition
with reaction: the Hellinger–Kantorovich distance and geodesic curves. SIAM Journal on
Mathematical Analysis, 48(4):2869–2911, 2016.
Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems
and a new hellinger–kantorovich distance between positive measures. Inventiones Mathemat-
icae, 211(3):969–1117, 2018.
Haibin Ling and Kazunori Okada. Diffusion distance for histogram comparison. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages
246–253. IEEE, 2006.
Haibin Ling and Kazunori Okada. An efficient earth mover’s distance algorithm for robust
histogram comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(5):840–853, 2007.
Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial
algorithm for matrix scaling and approximate permanents. In Proceedings of the Thirtieth
Annual ACM Symposium on Theory of Computing, pages 644–652. ACM, 1998.
Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear
operators. SIAM Journal on Numerical Analysis, 16:964–979, 1979.
Don O Loftsgaarden and Charles P Quesenberry. A nonparametric estimate of a multivariate
density function. Annals of Mathematical Statistics, 36(3):1049–1051, 1965.
Eliane Maria Loiola, Nair Maria Maia de Abreu, Paulo Oswaldo Boaventura-Netto, Peter Hahn,
and Tania Querido. A survey for the quadratic assignment problem. European Journal
Operational Research, 176(2):657–690, 2007.
Vince Lyzinski, Donniell E Fishkind, Marcelo Fiori, Joshua T Vogelstein, Carey E Priebe, and
Guillermo Sapiro. Graph matching: relax at your own risk. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 38(1):60–73, 2016.
Jan Maas. Gradient flows of the entropy for finite Markov chains. Journal of Functional
Analysis, 261(8):2250–2292, 2011.
Jan Maas, Martin Rumpf, Carola Schönlieb, and Stefan Simon. A generalized model for optimal
transport of images including dissipation and density modulation. ESAIM: Mathematical
Modelling and Numerical Analysis, 49(6):1745–1769, 2015.
Jan Maas, Martin Rumpf, and Stefan Simon. Generalized optimal transport with singular
sources. arXiv preprint arXiv:1607.01186, 2016.
References 197

Yasushi Makihara and Yasushi Yagi. Earth mover’s morphing: Topology-free shape morphing
using cluster-based EMD flows. In Asian Conference on Computer Vision, pages 202–215.
Springer, 2010.
Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of
positive-definite matrices. arXiv preprint arXiv:1801.09269, 2018.
Anton Mallasto and Aasa Feragen. Learning from uncertain curves: The 2-wasserstein metric
for gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,
pages 5660–5670. 2017.
Stephane Mallat. A Wavelet Tour of Signal Processing: the Sparse Way. Academic press, 2008.
Benjamin Mathon, Francois Cayre, Patrick Bas, and Benoit Macq. Optimal transport for secure
spread-spectrum watermarking of still images. IEEE Transactions on Image Processing, 23
(4):1694–1705, 2014.
Daniel Matthes and Horst Osberger. Convergence of a variational Lagrangian scheme for a
nonlinear drift diffusion equation. ESAIM: Mathematical Modelling and Numerical Analysis,
48(3):697–726, 2014.
Daniel Matthes and Horst Osberger. A convergent lagrangian discretization for a nonlinear
fourth-order equation. Foundations of Computational Mathematics, 17(1):73–126, 2017.
Bertrand Maury and Anthony Preux. Pressureless Euler equations with maximal density con-
straint: a time-splitting scheme. Topological Optimization and Optimal Transport: In the
Applied Sciences, 17:333, 2017.
Bertrand Maury, Aude Roudneff-Chupin, and Filippo Santambrogio. A macroscopic crowd
motion model of gradient flow type. Mathematical Models and Methods in Applied Sciences,
20(10):1787–1821, 2010.
Robert J McCann. A convexity principle for interacting gases. Advances in Mathematics, 128
(1):153–179, 1997.
Facundo Mémoli. On the use of Gromov–Hausdorff distances for shape comparison. In Sympo-
sium on Point Based Graphics, pages 81–90. 2007.
Facundo Mémoli. Gromov–Wasserstein distances and the metric approach to object matching.
Foundations of Computational Mathematics, 11(4):417–487, 2011.
Facundo Mémoli and Guillermo Sapiro. A theoretical and computational framework for isometry
invariant recognition of point cloud data. Foundations of Computational Mathematics, 5(3):
313–347, 2005.
Quentin Mérigot. A multiscale approach to optimal transport. Computer Graphics Forum, 30
(5):1583–1592, 2011.
Ludovic Métivier, Romain Brossier, Quentin Merigot, Edouard Oudet, and Jean Virieux. An
optimal transport approach for seismic tomography: Application to 3D full waveform inver-
sion. Inverse Problems, 32(11):115008, 2016.
Jocelyn Meyron, Quentin Mérigot, and Boris Thibert. Light in power: a general and parameter-
free algorithm for caustic design. In SIGGRAPH Asia 2018 Technical Papers, page 224. ACM,
2018.
198 References

Alexander Mielke. Geodesic convexity of the relative entropy in reversible Markov chains.
Calculus of Variations and Partial Differential Equations, 48(1-2):1–31, 2013.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Jean-Marie Mirebeau. Discretization of the 3D Monge-Ampere operator, between wide stencils
and power diagrams. ESAIM: Mathematical Modelling and Numerical Analysis, 49(5):1511–
1523, 2015.
Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie
Royale des Sciences, pages 666–704, 1781.
Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein training of restricted
Boltzmann machines. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, Advances in Neural Information Processing Systems 29, pages 3718–3726. 2016.
Kevin Moon and Alfred Hero. Multivariate f -divergence estimation with confidence. In Ad-
vances in Neural Information Processing Systems, pages 2420–2428, 2014.
Oleg Museyko, Michael Stiglmayr, Kathrin Klamroth, and Günter Leugering. On the applica-
tion of the Monge–Kantorovich problem to image registration. SIAM Journal on Imaging
Sciences, 2(4):1068–1097, 2009.
Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the wasserstein space
of elliptical distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,
pages 10258–10269. 2018.
Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen. Tsallis regularized optimal
transport and ecological inference. In AAAI, pages 2387–2393, 2017.
Andriy Myronenko and Xubo Song. Point set registration: coherent point drift. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, 2010.
Assaf Naor and Gideon Schechtman. Planar earthmover is not in l1 . SIAM Journal on Com-
puting, 37(3):804–826, 2007.
Richard D Neidinger. Introduction to automatic differentiation and Matlab object-oriented
programming. SIAM Review, 52(3):545–563, 2010.
Arkadi Nemirovski and Uriel Rothblum. On complexity of matrix scaling. Linear Algebra and
its Applications, 302:435–460, 1999.
Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex pro-
gramming, volume 13. SIAM, 1994.
Kangyu Ni, Xavier Bresson, Tony Chan, and Selim Esedoglu. Local histogram based segmenta-
tion using the Wasserstein distance. International Journal of Computer Vision, 84(1):97–111,
2009.
Lipeng Ning and Tryphon T Georgiou. Metrics for matrix-valued measures via test functions.
In 53rd IEEE Conference on Decision and Control, pages 2642–2647. IEEE, 2014.
Lipeng Ning, Tryphon T Georgiou, and Allen Tannenbaum. On matrix-valued Monge–
Kantorovich optimal mass transport. IEEE Transactions on Automatic Control, 60(2):373–
382, 2015.
References 199

Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer-Verlag, 1999.


Adam M Oberman and Yuanlong Ruan. An efficient linear programming method for optimal
transportation. arXiv preprint arXiv:1509.03668, 2015.
2 2
∂ z ∂ z
Vladimir Oliker and Laird D Prussner. On the numerical solution of the equation ∂x 2 ∂y 2 −
 2 2
∂ z
∂x∂y = f and its discretizations, I. Numerische Mathematik, 54(3):271–293, 1989.
Aude Oliva and Antonio Torralba. Modeling the shape of the scene: a holistic representation
of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
Dean S Oliver. Minimization for conditional simulation: Relationship to optimal transport.
Journal of Computational Physics, 265:1–15, 2014.
James B. Orlin. A polynomial time primal network simplex algorithm for minimum cost flows.
Mathematical Programming, 78(2):109–129, 1997.
Ferdinand Österreicher and Igor Vajda. A new class of metric divergences on probability spaces
and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):
639–653, 2003.
Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.
Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
Art B Owen. Empirical Likelihood. Wiley Online Library, 2001.
Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on
knowledge and data engineering, 22(10):1345–1359, 2010.
Victor M Panaretos and Yoav Zemel. Amplitude and phase variation of point processes. Annals
of Statistics, 44(2):771–812, 2016.
Nicolas Papadakis, Gabriel Peyré, and Edouard Oudet. Optimal transport with proximal split-
ting. SIAM Journal on Imaging Sciences, 7(1):212–238, 2014.
Brendan Pass. On the local structure of optimal measures in the multi-marginal optimal trans-
portation problem. Calculus of Variations and Partial Differential Equations, 43(3-4):529–
536, 2012.
Brendan Pass. Multi-marginal optimal transport: theory and applications. ESAIM: Mathemat-
ical Modelling and Numerical Analysis, 49(6):1771–1790, 2015.
Ofir Pele and Ben Taskar. The tangent earth mover’s distance. In Geometric Science of
Information, pages 397–404. Springer, 2013.
Ofir Pele and Michael Werman. A linear time histogram metric for improved sift matching.
Computer Vision–ECCV 2008, pages 495–508, 2008.
Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In IEEE 12th Inter-
national Conference on Computer Vision, pages 460–467, 2009.
Benoît Perthame, Fernando Quirós, and Juan Luis Vázquez. The Hele-Shaw asymptotics for
mechanical models of tumor growth. Archive for Rational Mechanics and Analysis, 212(1):
93–127, 2014.
Gabriel Peyré. Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging
Sciences, 8(4):2323–2351, 2015.
200 References

Gabriel Peyré, Jalal Fadili, and Julien Rabin. Wasserstein active contours. In 19th IEEE
International Conference on Image Processing, pages 2541–2544. IEEE, 2012.
Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-Wasserstein averaging of kernel
and distance matrices. In International Conference on Machine Learning, pages 2664–2672,
2016.
Gabriel Peyré, Lenaic Chizat, Francois-Xavier Vialard, and Justin Solomon. Quantum entropic
regularization of matrix-valued optimal transport. to appear in European Journal of Applied
Mathematics, 2017.
Rémi Peyre. Comparison between w2 distance and h−1 norm, and localisation of Wasserstein
distance. arXiv preprint arXiv:1104.4631, 2011.
Benedetto Piccoli and Francesco Rossi. Generalized Wasserstein distance and its application
to transport equations with source. Archive for Rational Mechanics and Analysis, 211(1):
335–358, 2014.
Franccois Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colour
distribution transfer. Computer Vision and Image Understanding, 107(1):123–137, 2007.
Pytorch. Pytorch library. http://pytorch.org/, 2017.
Julien Rabin and Nicolas Papadakis. Convex color image segmentation with optimal transport
distances. In Proceedings of SSVM’15, pages 256–269, 2015.
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its
application to texture mixing. In International Conference on Scale Space and Variational
Methods in Computer Vision, pages 435–446. Springer, 2011.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems: Volume I: Theory.
Springer Science & Business Media, 1998a.
Svetlozar T Rachev and Ludger Rüschendorf. Mass Transportation Problems: Volume II: Ap-
plications. Springer Science & Business Media, 1998b.
Louis B Rall. Automatic Differentiation: Techniques and Applications. Springer, 1981.
Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On Wasserstein two-sample testing
and related families of nonparametric tests. Entropy, 19(2):47, 2017.
Anand Rangarajan, Alan L Yuille, Steven Gold, and Eric Mjolsness. Convergence properties
of the softassign quadratic assignment algorithm. Neural Computation, 11(6):1455–1474,
August 1999.
Sebastian Reich. A nonparametric ensemble transform method for bayesian inference. SIAM
Journal on Scientific Computing, 35(4):A2013–A2024, 2013.
R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal
on Control and Optimization, 14(5):877–898, 1976.
Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast dictionary learning with a smoothed
Wasserstein loss. In Proceedings of the 19th International Conference on Artificial Intelligence
and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 630–638, 2016.
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric
for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.
References 201

Ludger Ruschendorf. Convergence of the iterative proportional fitting procedure. Annals of


Statistics, 23(4):1160–1174, 1995.
Lüdger Rüschendorf and Svetlozar T Rachev. A characterization of random variables with
minimum l2-distance. Journal of Multivariate Analysis, 32(1):48–54, 1990.
Ludger Rüschendorf and Wolfgang Thomsen. Closedness of sum spaces and the generalized
Schrodinger problem. Theory of Probability and its Applications, 42(3):483–494, 1998.
Ludger Rüschendorf and Ludger Uckelmann. On the n-coupling problem. Journal of Multi-
variate Analysis, 81(2):242–258, 2002.
Ernest K. Ryu, Yongxin Chen, Wuchen Li, and Stanley Osher. Vector and matrix optimal mass
transport: theory, algorithm, and applications. SIAM Journal on Scientific Compututing, 40
(5):A3675–A3698, 2017a.
Ernest K. Ryu, Wuchen Li, Penghang Yin, and Stanley Osher. Unbalanced and partial l1
Monge-Kantorovich problem: a scalable parallel first-order method. Journal of Scientific
Computing, 75(3):1596–1613, 2017b.
Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal
transport. In International Conference on Learning Representations, 2018.
Hans Samelson et al. On the perron-frobenius theorem. Michigan Mathematical Journal, 4(1):
57–59, 1957.
Roman Sandler and Michael Lindenbaum. Nonnegative matrix factorization with earth mover’s
distance metric for image analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 33(8):1590–1602, 2011.
Filippo Santambrogio. Optimal transport for applied mathematicians. Birkhauser, 2015.
Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bul-
letin of Mathematical Sciences, 7(1):87–154, 2017.
Filippo Santambrogio. Crowd motion and population dynamics under density constraints. GMT
preprint 3728, 2018.
Louis-Philippe Saumier, Boualem Khouider, and Martial Agueh. Optimal transport for particle
image velocimetry. Communications in Mathematical Sciences, 13(1):269–296, 2015.
Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh
Solomon, Siyan Liu, Stacie Lin, Peter Berube, Lia Lee, et al. Reconstruction of develop-
mental landscapes by optimal-transport analysis of single-cell gene expression sheds light on
cellular reprogramming. bioRxiv, page 191056, 2017.
Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of
Mathematical Imaging and Vision, 56(2):238–259, 2016a.
Bernhard Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport
problems. arXiv preprint arXiv:1610.06519, 2016b.
Bernhard Schmitzer and Christoph Schnörr. Modelling convex shape priors and matching based
on the Gromov-Wasserstein distance. Journal of Mathematical Imaging and Vision, 46(1):
143–159, 2013a.
202 References

Bernhard Schmitzer and Christoph Schnörr. Object segmentation by shape matching with
Wasserstein modes. In International Workshop on Energy Minimization Methods in Com-
puter Vision and Pattern Recognition, pages 123–136. Springer, 2013b.
Bernhard Schmitzer and Benedikt Wirth. A framework for Wasserstein-1-type metrics. arXiv
preprint arXiv:1701.01945, 2017.
Isaac J Schoenberg. Metric spaces and positive definite functions. Transactions of the American
Mathematical Society, 38:522–356, 1938.
Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Sitzungsberichte Preuss. Akad.
Wiss. Berlin. Phys. Math., 144:144–153, 1931.
Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the
optimal transport metric. In Advances in Neural Information Processing Systems 28, pages
3294–3302. 2015.
Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and
Mathieu Blondel. Large-scale optimal transport and mapping estimation. In Proceedings of
ICLR 2018, 2018.
Soroosh Shafieezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel Kuhn. Dis-
tributionally robust logistic regression. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1576–
1584. 2015.
Soroosh Shafieezadeh Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Moha-
jerin Esfahani. Wasserstein distributionally robust kalman filtering. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural
Information Processing Systems 31, pages 8483–8492. 2018.
Sameer Shirdhonkar and David W Jacobs. Approximate earth mover’s distance in linear time.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26. CRC
press, 1986.
Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic
matrices. Annals of Mathematical Statististics, 35:876–879, 1964.
Marcos Slomp, Michihiro Mikamo, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda. Gpu-
based softassign for maximizing image utilization in photomosaics. International Journal of
Networking and Computing, 1(2):211–229, 2011.
Justin Solomon, Leonidas Guibas, and Adrian Butscher. Dirichlet energy for analysis and
synthesis of soft maps. In Computer Graphics Forum, volume 32, pages 197–206. Wiley
Online Library, 2013.
Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Earth mover’s dis-
tances on discrete surfaces. Transaction on Graphics, 33(4), 2014a.
References 203

Justin Solomon, Raif Rustamov, Guibas Leonidas, and Adrian Butscher. Wasserstein propa-
gation for semi-supervised learning. In Proceedings of the 31st International Conference on
Machine Learning, pages 306–314, 2014b.
Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy
Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: efficient optimal
transportation on geometric domains. ACM Transactions on Graphics, 34(4):66:1–66:11,
2015.
Justin Solomon, Gabriel Peyré, Vladimir G Kim, and Suvrit Sra. Entropic metric alignment
for correspondence problems. ACM Transactions on Graphics, 35(4):72:1–72:13, 2016a.
Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Continuous-flow graph
transportation distances. arXiv preprint arXiv:1603.06927, 2016b.
Max Sommerfeld and Axel Munk. Inference for empirical wasserstein distances on finite spaces.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):219–238,
2018.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG
Lanckriet. On integral probability metrics,ϕ-divergences and binary classification. arXiv
preprint arXiv:0901.2698, 2009.
Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG
Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of
Statistics, 6:1550–1599, 2012.
Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via
barycenters of subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Pro-
ceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics,
volume 38 of Proceedings of Machine Learning Research, pages 912–920, San Diego, Califor-
nia, USA, 2015a. PMLR. URL http://proceedings.mlr.press/v38/srivastava15.html.
Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: scalable bayes
via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920,
2015b.
Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming
wasserstein barycenters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 30, pages 2647–2658. 2017a.
Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming
wasserstein barycenters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 30, pages 2647–2658. 2017b.
Leen Stougie. A polynomial bound on the diameter of the transportation polytope. Technical
report, TU/e, Technische Universiteit Eindhoven, Department of Mathematics and Comput-
ing Science, 2002.
Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient flows on the space
of metric measure spaces. Preprint 1208.0434, arXiv, 2012.
204 References

Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal
mass transport for shape matching and comparison. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 37(11):2246–2259, 2015.
Vladimir N Sudakov. Geometric Problems in the Theory of Infinite-dimensional Probability
Distributions. Number 141. American Mathematical Society, 1979.
Mahito Sugiyama, Hiroyuki Nakahara, and Koji Tsuda. Tensor balancing on statistical mani-
fold. arXiv preprint arXiv:1702.08142, 2017.
Mohamed M Sulman, JF Williams, and Robert D Russell. An efficient approach for the numer-
ical solution of the monge–ampère equation. Applied Numerical Mathematics, 61(3):298–307,
2011.
Paul Swoboda and Christoph Schnörr. Convex variational image restoration with histogram
priors. SIAM Journal on Imaging Sciences, 6(3):1719–1735, 2013.
Gábor J Székely and Maria L Rizzo. Testing for equal distributions in high dimension. InterStat,
5(16.10), 2004.
Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics,
48(4):1005–1026, 2011.
Xiaolu Tan and Nizar Touzi. Optimal transportation under controlled stochastic dynamics.
Annals of Probability, 41(5):3201–3240, 2013.
Robert E. Tarjan. Dynamic trees as search trees via euler tours, applied to the network simplex
algorithm. Mathematical Programming, 78(2):169–177, 1997.
Guillaume Tartavel, Gabriel Peyré, and Yann Gousseau. Wasserstein loss for image synthesis
and restoration. SIAM Journal on Imaging Sciences, 9(4):1726–1755, 2016.
Matthew Thorpe, Serim Park, Soheil Kolouri, Gustavo K Rohde, and Dejan Slepčev. A trans-
portation lp distance for signal analysis. Journal of Mathematical Imaging and Vision, 59
(2):187–210, 2017.
AN Tolstoı. Metody nakhozhdeniya naimen’shego summovogo kilome-trazha pri planirovanii
perevozok v prostranstve (russian; methods of finding the minimal total kilometrage in cargo
transportation planning in space). TransPress of the National Commissariat of Transporta-
tion, pages 23–55, 1930.
AN Tolstoı. Metody ustraneniya neratsional’nykh perevozok priplanirovanii [russian; methods
of removing irrational transportation in planning]. Sotsialisticheskiı Transport, 9:28–51, 1939.
Alain Trouvé and Laurent Younes. Metamorphoses through Lie group action. Foundations of
Computational Mathematics, 5(2):173–198, 2005.
Neil S Trudinger and Xu-Jia Wang. On the monge mass transfer problem. Calculus of Variations
and Partial Differential Equations, 13(1):19–31, 2001.
Marc Vaillant and Joan Glaunès. Surface matching via currents. In Information Processing in
Medical Imaging, pages 1–5. Springer, 2005.
Sathamangalam R Srinivasa Varadhan. On the behavior of the fundamental solution of the
heat equation with variable coefficients. Communications on Pure and Applied Mathematics,
20(2):431–455, 1967.
References 205

Cedric Villani. Topics in Optimal Transportation. Graduate Studies in Mathematics Series.


American Mathematical Society, 2003. ISBN 9780821833124.
Cedric Villani. Optimal Transport: Old and New, volume 338. Springer Verlag, 2009.
Thomas Vogt and Jan Lellmann. Measure-valued variational models with applications to
diffusion-weighted imaging. Journal of Mathematical Imaging and Vision, 60(9):1482–1502,
2018.
Fan Wang and Leonidas J Guibas. Supervised earth mover’s distance learning and its computer
vision applications. ECCV2012, pages 442–455, 2012.
Wei Wang, John A. Ozolek, Dejan Slepcev, Ann B. Lee, Cheng Chen, and Gustavo K. Rohde. An
optimal transportation approach for nuclear structure-based pathology. IEEE Transactions
on Medical Imaging, 30(3):621–631, 2011.
Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear
optimal transportation framework for quantifying and visualizing variations in sets of images.
International Journal of Computer Vision, 101(2):254–269, 2013.
Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of
empirical measures in Wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.
Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest
neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
Michael Westdickenberg and Jon Wilkening. Variational particle schemes for the porous medium
equation and for the system of isentropic Euler equations. ESAIM: Mathematical Modelling
and Numerical Analysis, 44(1):133–166, 2010.
Alan Geoffrey Wilson. The use of entropy maximizing models, in the theory of trip distribution,
mode split and route split. Journal of Transport Economics and Policy, pages 108–126, 1969.
Gui-Song Xia, Sira Ferradans, Gabriel Peyré, and Jean-Franccois Aujol. Synthesizing and
mixing stationary Gaussian texture models. SIAM Journal on Imaging Sciences, 7(1):476–
508, 2014.
G Udny Yule. On the methods of measuring association between two attributes. Journal of the
Royal Statistical Society, 75(6):579–652, 1912.
Yoav Zemel and Victor M Panaretos. Fréchet means and procrustes analysis in wasserstein
space. to appear in Bernoulli, 2018.
Gloria Zen, Elisa Ricci, and Nicu Sebe. Simultaneous ground metric learning and matrix
factorization with earth mover’s distance. Proceedings of ICPR’14, pages 3690–3695, 2014.
Lei Zhu, Yan Yang, Steven Haker, and Allen Tannenbaum. An image morphing technique
based on optimal mass preserving mapping. IEEE Transactions on Image Processing, 16(6):
1481–1495, 2007.
Jonathan Zinsl and Daniel Matthes. Transport distances and geodesic convexity for systems of
degenerate diffusion equations. Calculus of Variations and Partial Differential Equations, 54
(4):3397–3438, 2015.

You might also like