RP 2
RP 2
RP 2
Departamento de Matematica
Pontifı́cia Universidade Católica do Rio de Janeiro
Rio De Janeiro, RJ 22451-900, Brazil
Rene Cabrera [email protected]
Department of Mathematics
University of Texas at Austin
Austin, TX 78712, USA
Abstract
We study quantitatively the overparametrization limit of the original Wasserstein-GAN algorithm.
Effectively, we show that the algorithm is a stochastic discretization of a system of continuity equa-
tions for the parameter distributions of the generator and discriminator. We show that parameter
clipping to satisfy the Lipschitz condition in the algorithm induces a discontinuous vector field in
the mean field dynamics, which gives rise to blow-up in finite time of the mean field dynamics. We
look into a specific toy example that shows that all solutions to the mean field equations converge
in the long time limit to time periodic solutions, this helps explain the failure to converge.
Keywords: GAN, Aggregation Equation, blow-up
1 Introduction
Generative algorithms are at the forefront of the machine learning revolution we are currently
experiencing. Some of the most famous types are diffusion models Sohl-Dickstein et al. (2015),
generative language models Radford et al. (2018) and Generative Adversarial Networks (GAN)
Goodfellow et al. (2014). GAN was one of the first algorithms to successfully produce synthetically
realistic images and audio and is the topic of this article.
A guiding assumption for GAN is that the support of the distribution can be well approxi-
mated by a lower dimensional object. That is to say although P∗ ∈ P(RK ), we expect that the
inherent correlations in data, like values of neighboring pixels in an image, drastically reduce the
dimensionality of the problem. In broad terms, we expect that, in some non-specified sense, the
effective dimension of the support of P∗ is less or equal than a latent dimension L ≪ K. The
GAN algorithm tries to find an easy way to evaluate a continuous function from G : RL → RK ,
which we call the generator. The objective is to make G(Z) to be approximately distributed like
P∗ , where Z is distributed like the standard Gaussian N (0, 1) ∈ P(RL ). To get an idea of orders
of magnitude, Karras et al. (2017) creates realistic looking high resolution images of faces with
K = 1024 × 1024 × 3 = 3145728 and L = 512.
As the word adversarial in its name suggests, the algorithm pits two Neural Networks against
each other, the generator network G and the discriminator network D. The discriminator network
tries to discern from the synthetic samples G(Z) and the real samples X ∼ P∗ . For this purpose,
the optimization over the discriminator network D is the dual formulation of a metric between
1
the associated synthetic data distribution G#N and the real data distribution P∗ . The original
algorithm Goodfellow et al. (2014) used Jensen-Shannon divergence. The version we analyze in
detail here is the Wasserstein-GAN (WGAN) Arjovsky et al. (2017) which uses the 1-Wasserstein
distance instead. The behavior of GAN is known to be directly tied to the choice of the metric, see
Section 3 for more details.
The architecture of the Neural Networks (NN) which parametrize the generator and discrimina-
tor also plays a large role in the success of the algorithms. The paradigm for architectures at the time
of the first prototypes of GANs was to use Convolutional Neural Networks (CNNs) which exploit
the natural spatial correlations of pixels, see for example AlexNet introduced in Krizhevsky et al.
(2017). Currently, the paradigm has changed with the advent of attention networks which are more
parallelizable and outperform CNNs in most benchmarks, see Vaswani et al. (2017). In this paper,
we forego the interesting question of the role of NN architecture to understand in more detail the
induced dynamics, see Section 2.1 for more details.
To understand the dynamics, we will follow the success of understanding the overparametrized
limit in the supervised learning problem for shallow one hidden layer NN architectures Mei et al.
(2018); Chizat and Bach (2018); Rotskoff and Vanden-Eijnden (2022), see also Fernández-Real and Figalli
(2022); Wojtowytsch and E (2020) for reviews of these results. In a nutshell, to the first order these
articles relate Stochastic Gradient Descent (SGD) parameter training to a stochastic discretization
of an associated aggregation equation Bertozzi et al. (2011); Carrillo et al. (2011), and to a second
order to an aggregation diffusion equation Carrillo et al. (2006). In probabilistic terms, this is
akin to the law of large numbers Sirignano and Spiliopoulos (2020a) and the central limit theorem
Sirignano and Spiliopoulos (2020b).
Our contribution, which is novel even in the supervised learning case, is to quantify this type of
analysis. First, we show a quantitative result for the stability of the limiting aggregation equation
in the 2-Wasserstein metric, see Theorem 5. The difficulty of the stability in our case is not the
regularity of the activation function Chizat and Bach (2018), but instead the growth of the Lipschitz
constant with respect to the size of the parameters themselves. Next, we show a quantitative
convergence of the empirical process to the solution to the mean field PDE, to our knowledge this
is the first of its kind in terms of a strong metric like the 2-Wasserstein metric, see Theorem 6 and
Corollary 8.
Moreover, the WGAN algorithm clips the discriminator parameters after every training itera-
tion. In the follow up work Gulrajani et al. (2017) observed numerically that it created undesirable
behavior. In terms of the mean field PDE (7), the clipping of parameters induces an associated
discontinuous vector field. This explains from a mathematical viewpoint the pathology mentioned
before. In a nutshell, the parameter distribution can blow-up in finite time, and after that time the
discriminator network loses the universal approximation capabilities, see Secion 2.4.
Failure to converge is a known problem of GAN. For instance, Karras et al. (2017) introduces a
progressive approach to training higher and higher resolution pictures, effectively having hot start
of the algorithm at every step. By looking at an enlightening simplified example, we can explicitly
understand the long time behavior of the algorithm. In this example, any initialization eventually
settles to a time periodic orbit, which implies that the generator oscillates forever, see Section 3.
2
GAN: Dynamics
presents the conclusions and discusses some future directions for research. Appendix A recalls well
posedness and approximation of differential inclusions.
where d1 is the 1-Wasserstein distance. Although this problem seems rather straight forward, the
Wasserstein distance is notorious for being difficult to calculate in high dimensions, and we do not
have direct access to P∗ ; hence, in practice a proxy of said distance is chosen. More specifically, we
approximate the dual problem
Z Z
d1 (GΘ #N , P∗ ) = sup D(GΘ (z)) dN (z) − D(x) dP∗ (x), (1)
D∈Lip1 RL RK
The parametric function DΩ will also be considered as a Neural Network and the parameters
Ω are restricted to a compact convex set. The precise definition of GΘ and DΩ as Neural Networks
with a single hidden layer is given bellow, letting σ : R → R denote the activation function. Since
the parameters Ω are restricted to a compact set, if σ is C 1 bounded the family {DΩ } is uniformly
Lipschitz.
Remark 1 The original GAN Goodfellow et al. (2014) utilizes the Jensen-Shannon divergence,
which in terms of Legendre-Fenchel dual can be written as
Z Z
JS(GΘ #N , P∗ ) = sup log D(GΘ (z)) dN (z) + log(1 − D(x)) dP∗ (x).
D∈Cb (RK ) RL RK
N N
!
1 X 1 1 1 1 X K K K
GΘ (z) = αi σ(βi · z + γi ), ... , αi σ(βi · z + γi ) ,
N N
i=1 i=1
3
where the array Θ = (θ1 , · · · , θN ) ∈ ((R × RL × R)K )N is given by θi = (αji , βij , γij )1≤j≤K , and DΩ
defined by
M
1 X
DΩ (x) = ai σ(bi · x + ci ),
M
i=1
and
aσ(b · x + c) = σ(x; ω) with ω = (a, b, c) ∈ R × RK × R.
Remark 2 The mean field analysis of two hidden layers NN is also possible, see for instance
Sirignano and Spiliopoulos (2022).
where
Ω1,1 ∈ (R × RK × R)M and Θ1 ∈ ((R × RL × R)K )N ,
and the initial distributions
νin ∈ P R × RK × R and µin ∈ P (R × RL × R)K
are fixed independent of N and M . Of course, correlations in parameter initialization and N and
M dependent initial conditions can be introduced if they were desirable.
Iteratively in n until convergence, and iteratively for l = 2, ..., nc with nc a user defined param-
eter, we define
Ωn,l = clip Ωn,l−1 + h∇Ω (DΩn,l−1 (GΘn (zln )) − DΩn,l−1 (xnl )) ,
Ωn+1,1 = Ωn,nc ,
and
Θn+1 = Θn − h∇Θ DΩn+1,1 (GΘn (znnc +1 )),
where the function clip stands for the projection onto [−1, 1] × [−1, 1]K × [−1, 1], and h > 0 is the
learning rate which is a user chosen parameter. The families {xnl }n∈N,l∈{1,...,nc} , {zln }n∈N,l∈{1,...,nc+1}
are independent RK and RL valued random variables distributed by P∗ and N , respectively.
4
GAN: Dynamics
Remark 3 The clipping of the parameter is made to ensure that the discriminator network is
uniformly bounded in the Lipschitz norm, to approximate Kantorovich’s duality formulation (1).
With this in mind, we should notice that the clipping of all parameter is slightly indiscriminate. For
instance, the dependence of the discriminator function with respect to the parameter a is bounded
by our assumption on the activation function σ, and would not need to be clipped.
Remark 4 We should note that other versions of SGD like Adam or RMSProp (see Kingma and Ba
(2014) and Tieleman (2012)) are preferred by users as they are considered to outperform SGD. They
introduce adaptive time stepping and momentum in an effort to avoid metastability of plateaus, and
falling into shallow local minima. These tweaks of SGD add another layer of complexity which we
will not analyze in this paper.
and Z
Dν (x) = σ(x; ω)dν(ω), (3)
R×RK ×R
where µi , for i = 1, ..., K, denotes the i-th marginal of µ. We should note that due to the exchange-
ability of the parameters, there is no loss of information from considering the pair (Θn , Ωn ) versus
the pair (µnN , νM
n ). In fact, using the previous notations we have
Hence, to understand the behavior of the algorithm in the overparameterization limit, we will
center our attention on the evolution of
the empirical measures. More specifically,
we consider the
L
curves µ ∈ C [0, ∞); P (R × R × R) K K
and ν ∈ C [0, ∞); P R × R × R to be, respectively,
the linear interpolation of µnN and νM
n at the time values t = n(h/N ).
n
The choice of the scale ∆t = h/N is arbitrary, and could also be expressed in terms of M . The
relationship between N , M and nc gives rise to different mean field limits
N +∞
nc → γc ∼ 1 (4)
M
0,
and we will obtain different behavior in terms of limiting dynamics. In this paper, we address the
intermediate limit γc ∼ 1, but we should notice that in practice it is also interesting to study when
5
γc = ∞, which assumes that the discriminator has been trained to convergence, see Section 3 for
an illustrative example. For notational simplicity, we write the proof for N = M and nc = 1, but
our methods are valid for any finite value of γc ∼ 1.
Explicitly, for any t ∈ [0, ∞), we find n ∈ N and s ∈ [0, 1) such that
(1 − s)tn + stn+1 = t
The evolution of the limit can be characterized by the gradient descent of E on µ and gradient
ascent on ν, the latter restricted to P([−1, 1]×[−1, 1]K ×[−1, 1]). In terms of equations we consider
δE
∂
t µ − ∇ θ · µ ∇ θ [µ, ν] = 0,
δµ
δE (7)
∂ t ν + γ c ∇ ω · ν Proj πQ ∇ ω δν [µ, ν] = 0,
µ(0) = µin , ν(0) = νin ,
where we define Q = [−1, 1] × [−1, 1]K × [−1, 1] and the first variations are
Z Z
δE
[µ, ν](θ) = ∇1 σ(Gµ (z); ω) · (σ(z; θ1 ), ... , σ(z; θK )) dν(ω) dN (z),
δµ RL Q
Z Z
δE
[µ, ν](ω) = σ(Gµ (z); ω) dN (z) − σ(x; ω) dP∗ (x)
δν RL RK
We should notice in fact that the projection is trivial away from the boundary, or if the vector field
at the boundary points into the domain. Effectively, the projection does not allow for mass to exit
the domain. We do note that this can easily make mass collapse onto the boundary and flatten the
support of the distribution ν into less dimensions, see Section 3 for a further discussion.
In the context of ODEs, the projection onto convex sets was considered by Henry (1973), which
we recall and expand on Appendix A. For Hilbert spaces setting, we mention the more general
sweeping processes introduced by Mureau Moreau (1977). Recently, projections of solutions to
the continuity equation onto semi-convex subsets have been considered as models of pedestrian
dynamics with density constraints, see for instance Di Marino et al. (2016); Santambrogio (2018);
De Philippis et al. (2016).
6
GAN: Dynamics
there exists a unique absolutely continuous weak solution to the mean field system (7).
Moreover, we have the following stability estimate: For any T ∈ [0, ∞), there exists C > 1 such
that
sup d2 ((µ1 (t), ν1 (t)), (µ2 (t), ν2 (t))) ≤ Cd24 ((µ1,in , ν1,in ), (µ2,in , ν2,in )), (10)
t∈[0,T ]
The proof of Theorem 5 is given in Section 4, see Proposition 10 for a precise dependence of the
constants. Our main result is the following estimate on the continuous time approximation of
parameter dynamics.
Theorem 6 Let (µN (t), νN (t)) be the empirical measures associated to the continuous time inter-
polation of the parameter values, assumed to be initialized by independent samples from (µin , νin )
given by (5). Consider (µ̂N (t), ν̂N (t)) the unique solution to the PDE (7) with random initial con-
ditions (µN (0), νN (0)). If µin has bounded double exponential moments on α, that is to say for
some δ > 0
2
eδ|α|
Eµin e < ∞, (11)
then for any fixed time horizon T ∈ [0, ∞) there exists C > 0 such that
C
sup Ed22 ((µN (t), νN (t)), (µ̂N (t), ν̂N (t))) ≤ . (12)
t∈[0,T ] N
Remark 7 The need for (11) stems from the linear dependence of the Lipschitz constant of the
mean field vector field with respect to the size of the parameters, see Lemma 13.
The proof of Theorem 6 is presented in Section 5. Using the convergence Theorem 6 and the
stability of the mean field Theorem 5, we can obtain a convergence rate estimate which suffers the
curse of dimensionality.
Corollary 8 Under the hypotheses of Theorem 5 and Theorem 6, for any fixed T > 0, there exists
C > 0 such that
C
max Ed22 ((µ(t), ν(t)), (µN (t), νN (t))) ≤ 2 (13)
t∈[0,T ] N K(L+2)
where (µ, ν) is the unique solution of (7) and (µN , νN ) is the curve of interpolated empirical mea-
sures associated to the parameter training (5).
Remark 9 We should note that the difference between the results of Theorem 6 and Corollary 8 is
that the estimate (12) does not suffer from the curse of dimensionality, while the stronger estimate
(13) does. The later dependence on dimension is typical and sharp for the approximation of the
7
Wasserstein distance with sampled empirical measures, see Dudley (1978); Fournier and Guillin
(2015); Bolley et al. (2007). This stiff dependence on dimension suggests that studying the long time
behavior of the mean field dynamics of smooth initial data (µ(t), ν(t)) is not necessarily applicable
in practice. Instead, the focus should be to show that with high probability that discrete mean field
trajectories (µ̂N (t), ν̂N (t)) converge to a desirable saddle point of the dynamics. See Section 3 for
an explicit example of long time behavior.
Proof [Proof of Corollary 8] We consider the auxiliary pair of random measure-valued paths
(µ̂N , ν̂N ) which are a solution to (7) with stochastic initial conditions (µN (0), νN (0)), that is
N N
1 X 1 X
µ̂N (0) = µN (0) = δθi,in and ν̂N (0) = νN (0) = δωi,in ,
N N
i=1 i=1
where θi,in and ωi,in are N independent samples from µin and νin , respectively.
By the large deviation estimate in Fournier and Guillin (2015), for q large enough we have
!
4
1 1
E[d44 ((µ̂N (0), ν̂N (0)), (µin , νin ))] ≤ CMqq
4 + q−4 ,
N K(L+2) N q
where Mq denotes the q-th moment of µin ⊗ νin . By Theorem 5, taking q large enough, and using
that µin ⊗ νin has finite moments of all orders we have
C
E[d22 ((µ̂N (t), ν̂N (t)), (µ(t), ν(t))] ≤ CE[d24 ((µ̂N (0), ν̂N (0)), (µ(0), ν(0)))] ≤ 2 .
N K(L+2)
By the triangle inequality,
d2 ((µ(t), ν(t)), (µN (t), νN (t))) ≤ d2 ((µN (t), νN (t)), (µ̂N (t), ν̂N (t)) + d2 ((µ̂N (t), ν̂N (t)), (µ(t), ν(t)),
8
GAN: Dynamics
problem. Still, training the Generator to get useful outputs is not an easy task, it requires a lot of
computation time and more often than not it fails to converge. For instance to produce realistic
looking images Karras et al. (2017) took 32 days of GPU compute time, and the networks are
trained on progressively higher and higher resolution images to help with convergence.
1 1
P∗ = δ−1 + δ1 ∈ P(R).
2 2
We consider the simplest network that can approximate this measures perfectly. We consider the
generator, depending on a single parameter g ∈ R to be given by
(
−1 x < g
G(z, g) =
1 x > g.
Although, this generator architecture seems far from our assumptions 2.1. This type of discontinuity
arises naturally as a limit when the parameters go to infinity. Namely, if we take b, c → ∞ in such
a way that c/b → g ∈ R, then
(
0 x<g
σ(bx + c) →
1 x > g,
where σ is the sigmoid. The generator G can then be recovered as a linear combination of two such
limits. The generated distribution is given by
Gg #P = Φ(g)δ−1 + (1 − Φ(g))δ1 ,
where Φ(g) = P ({z < g}) is the cumulative distribution function of the prior distribution P ∈ P(R),
which we can chose. We make the choice of the cumulative distribution
1
Φ(g) = for g ∈ R
1 + e−g
to simplify the calculations. Under this choice for g = 0, we have that Gg #P = P∗ , hence the
network can approximate the target measure perfectly.
Moreover, we can explicitly compute the 1-Wasserstein distance,
1
d1 (Gg #P, P∗ ) = − Φ(g) ,
2
9
0.5
0.4
d1 (Gg #P, P∗ )
0.3
0.2
0.1
0
−6 −4 −2 0 2 4 6
g
We can clearly see that this function has a unique minimum at g∗ = 0, and also that this function is
concave in g away from g = g∗ . The concavity of the functional makes the problem more challenging
from the theoretical perspective and it will explain the oscillatory behavior of the algorithm close
to the minimizer g∗ .
For the discriminator, we consider a ReLU activation given by
D(x; ω) = (ωx)+
with ω ∈ [−1, 1]. We note that taking a single parameter, instead of a distribution, for the discrim-
inator is supported by the mean field dynamics (7). In the sense that under a bad initialization of
parameters, the parameters of the discriminator can blow up in finite time to ν = δω(t) .
We consider the joint dependence function
Z Z
Ψ(ω, g) = Dω (Gg (z)) dP (z) − Dω (x) dP∗ (x)
R R
1 1
= Φ(g)(−ω)+ + (1 − Φ(g))(ω)+ − (−ω)+ − (ω)+
2 2
1
= − Φ(g) ω.
2
0.5
Ψ(ω, g)
−0.5
−2
0 2
1
0
g −1
2 −2 ω
10
GAN: Dynamics
Ignoring, for now, the projection onto ω ∈ [−1, 1], we have the dynamics
1 e−g
ġ(t) = −∇g Ψ[g, ω] =
ω
2 (1 + e−g )2
γc e−g − 1
ω̇(t) = γc ∇ω Ψ[g, ω] =
,
2 1 + e−g
where γc is the critics speed up (4). These dynamics can be integrated perfectly, to obtain that
|ω(t)|2 |ωin |2
Eγc (ω(t), g(t)) = 2 cosh(g(t)) + = 2 cosh(gin ) + = Eγc (ωin , gin ).
γc γc
Contours of γc = 1 Contours of γc = 10
Eγc = 2.1
2 2 Eγc = 2.5
Eγc = 3
1 1 Eγc = 4
Eγc = 5
Eγc = 10
0 0
ω
|ω| = 1
−1 −1
−2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
g g
In the figure above, we plot the level sets of Eγc as well as the restriction of |ω| ≤ 1. We notice
that, given the value of γc there exists a unique level set
1
E∗ (γc ) = 2 +
γc
such that the level set {Eγc = E∗ } is tangent to the restriction |ω| ≤ 1.
Now, we consider the dynamics with the restriction |ω| ≤ 1. We notice that for any initial
conditions (ωin , gin ) satisfying Eγc (ωin , gin ) ≤ E∗ (γc ) the trajectory of parameters is unaffected by
the restriction |ω| ≤ 1 and it is time periodic. On the other hand, if we consider initial conditions
(ωin , gin ) satisfying Eγc (ωin , gin ) > E∗ (γc ) and |ωin | ≤ 1, the trajectory will follow the unconstrained
dynamics until it hits the boundary of the restriction ω(t) ∈ ∂Q = {|ω| = 1}. Then it follows on the
boundary ω(t) ∈ ∂Q = {|ω| = 1} until it reaches the point (ω(t∗ ), g(t∗ )) = (±1, 0) on the tangential
level set {Eγc = E∗ } and start following this trajectory becoming time periodic. Hence, there exists
t∗ = t(Eγc (ωin , gin )) large enough, such that the trajectory (ω(t), g(t)) ∈ {Eγc (ωin , gin ) = E∗ } for
t > t∗ . Therefore, we can conclude that
1
|g(t)| ≤ cosh−1 1 + ∀t > t∗ .
2γc
Looking back at the figure, we can see that for γc = 1 that the limiting trajectory is {E1 = 3}, and
that the generator parameter oscillates in the range |g(t)| ≤ 0.96 for t > t∗ . While for γc = 10,
11
we obtain that the limiting trajectory is {E10 = 2.1} and the limiting oscillations are smaller
|g(t)| ≤ 0.31 for t > t∗ .
We do notice that regardless of the parameter γc and the initial configuration, the limiting
trajectory is always periodic in time. In fact, we expect that every trajectory of the mean field
dynamics settles into a periodic solution.
Hence, as long as the underlying velocity field inducing the motion is continuous, we can consider
the notion of weak solution for the continuity equation given by (Ambrosio et al., 2005, Chapter
8).
With this in mind, we first notice the Lipschitz continuity properties of the vector fields that
induce the motion (7). More specifically, we denote by
Θ δE Θ
V(µ,ν) (θ) = −∇θ [µ, ν](θ) = Ez v(µ,ν) (θ, z) (14)
δµ
and
Ω δE Ω
V(µ,ν) (ω) = ∇ω [µ, ν](ω) = Ez Ex v(µ,ν) (ω, z, x), (15)
δν
where we define the vector fields
Z
Θ
v(µ,ν) (θ, z) = −∇θ ∇1 σ(Gµ (z); ω) · (σ(z; θ1 ), ... , σ(z; θK )) dν(ω) (16)
[−1,1]1+K+1
and
Ω
v(µ,ν) (ω, z, x) = ∇ω [σ(Gµ (z); ω) − σ(x; ω)]. (17)
In Lemma 13 below, we show that V(µ,ν) Θ (θ) and V Ω (ω) are Lipschitz continuous with respect to
µ,ν
the dependence of arguments θ, ω as well as the measure arguments (µ, ν). Notice that V Ω and v Ω
do not depend on ν, only on µ.
By (Ambrosio et al., 2005, Theorem 8.2.1), any continuous solution to the continuity equation
(7) is supported over solutions of the associated characteristic field. Using the classical theory
Henry (1973) for projected ODE flows, we can show that the characteristic equations
(
d Θ Ω
dt (θ, ω) = (V(µ,ν) (θi ), ProjπQ (ωi ) V(µ,ν) (ωi ))
(18)
(θ, ω)(0) = (θin , ωin )
have a unique solution. More specifically, an absolutely continuous curve (µ, ν) ∈ AC([0, ∞); P (RL+2 )K ×
P(Q)) is a weak solution to (7), if it is given as the image of the initial distributions (µin , νin ) through
the unique projected ODE flow. That is to say,
(µ, ν)(t) = Φt(µ,ν) #(µin , νin ) (19)
12
GAN: Dynamics
d2 ((µ1 (t), ν1 (t)), (µ2 (t), ν2 (t))) ≤ A(t)eB(t) d24 ((µ1,in , ν1,in ), (µ2,in , ν2,in )), (21)
where Z Z 1/2
C (t2 +tΛ) Ct|α| Ct|α|
A(t) = e e dµ1,in + e dµ2,in ,
and
B(t) = CtA(t) (t + Λ) ,
with Z 1/2 Z 1/2
Λ=1+ |α|2 dµ1,in + |α|2 dµ2,in
Remark 11 The double exponential growth on the estimate is related to the dependence Lipschitz
constant of the vector field with respect to the size of the parameters themselves, see Lemma 13 for
the specific estimates.
For discrete initial conditions, existence to (7) follows from applying the results in Appendix A.
Using stability, we can then approximate the initial condition by taking discrete approximations of
it.
Proposition 12 (Existence) For any initial condition (µin , νin ) ∈ P (RL+2 )K ×P(Q) satisfying
that there exists δ > 0 such that Z
2
eδ|α| dµin < ∞,
there exists (µ, ν) ∈ AC [0, ∞); P (RL+2 )K × P(Q) a weak solution to (7) which satisfies the
mild formulation (19).
13
of the initial conditions µin , νin , where wi and vi are weights which add up to 1. The main properties
we need from this discretization is that
Z Z
L L δ|α|2 2
lim d4 ((µin , νin ), (µin , νin )) = 0 and e dµin ≤ eδ|α| dµin .
L
L→∞
Such a discretization can be given by the following procedure. For simplicity we consider R =
2k(L+2)K , we divide the box [− log R, log R](L+2)K into equal sized boxes {Bi }L i
i=1 . We assign θ to
be the the point with the smallest norm of the box Bi , and the weights are given by wi = µin (Bi ).
We add any leftover mass on ([− log R, log R](L+2)K )c to the delta at the origin. We do the same
to produce νinL.
By Appendix A, for any L ∈ N there exists a unique solution to the projected ODE associated
to the solution of the mean field equations with initial conditions given by (µL L
in , νin ). Hence, we can
construct a global weak solution to the PDE (µL (t), ν L (t)). By the stability result, we know that
for any finite time horizon T > 0, {(µL , ν L )}L form a Cauchy sequence in AC([0, T ], P (RL+2 )K ×
P(Q)). Hence, there exists (µ, ν) ∈ AC [0, ∞); P (RL+2 )K × P(Q) , such that for any fixed time
horizon T
lim sup d22 ((µ(t), ν(t)), (µL (t), ν L (t)) = 0.
L→∞ t∈[0,T ]
By Lemma 14, µL satisfies the growth condition (26), and so does µ. By Lemma 15, we have that
the associated projected ODE flows also converge
lim sup |Φt(µL ,ν L ) (θ, ω) − Φt(µ,ν) (θ̃, ω̃)|2 ≤ eC(Λ+|α|+|α̃|) |(θ, ω) − (θ̃, ω̃)|2 . (22)
L→∞ t∈[0,T ]
Using that
(µL (t), ν L (t)) = Φt(µL ,ν L ) #(µL L
in νin )
which in turn implies that (µ, ν) is a weak solution to (7) satisfying (19).
For the next lemma we use the notation θ = (θ1 , . . . , θK ) with θi = (αi , βi , γi ) ∈ R × RL × R,
and α = (α1 , . . . , αK ) ∈ RK .
Lemma 13 There exists C ∈ R depending on kσkC 1 such that the vector fields (14), (15), (16)
and (17) satisfy the bounds
Z Z
Ω Ω
V(µ,ν) ≤ C 1 + |α|dµ , v(µ,ν) (·, z, x) ≤ C 1 + |x| + |α|dµ , (23)
∞ ∞
and
(
Θ
C for r = 1
Vj r ∞
≤
C|αj | for r 6= 1,
(
C for r = 1
vjΘ r ∞
≤ (24)
C|αj | (1 + |z|) for r =6 1,
14
GAN: Dynamics
Moreover, we have the following Lipschitz estimate. There exists C ∈ R depending on kσkC 2 ,
such that
Θ Θ
V(µ ,ν
1 1 ) (θ) − V (µ ,ν
2 2 ) (θ̃) ≤ C(|α| + |α̃| + A(µ 1 , µ 2 )) d2 ((µ 1 , ν 1 ), (µ 2 , ν 2 )) + |θ − θ̃|
Ω Ω
|V(µ1 ,ν1 )
(ω1 ) − V(µ2 ,ν2 )
(ω2 )| ≤ CA(µ1 , µ2 ) |ω1 − ω2 | + d2 (µ1 , µ2 ) ,
and
Θ Θ
|v(µ1 ,ν1 )
(θ, z) − v(µ2 ,ν2 )
(θ̃, z)|
≤ C (|α| + |α̃| + A(µ1 , µ2 ) + |z|) d2 ((µ1 , ν1 ), (µ2 , ν2 )) + |θ − θ̃| ,
Ω Ω
|v(µ ,ν
1 1 ) (ω 1 , z, x) − v (µ ,ν
2 2 ) (ω 2 , z, x)| ≤ C (A(µ 1 , µ 2 ) + |x| + |z|) |ω 1 − ω 2 | + d2 (µ 1 , µ 2 ) ,
where Z 1/2 Z 1/2
A(µ1 , µ2 ) = 1 + |α|2 dµ1 + |α|2 dµ2 ,
Proof [Proof of Lemma 13] Throughout the proof, we use the notation θ = (θ1 , ..., θK ) with
θi = (αi , βi , γi ) ∈ R × RL × R, α = (α1 , ..., αK ) ∈ (RL )K , and ω = (a, b, c) ∈ Q. We begin by
explicitly writing out the vector fields
σ(b · Gµ (z) + c) − σ(b · x + c)
Ω
v(µ,ν) (ω, z, x) = aGµ (z)σ ′ (b · Gµ (z) + c) − axσ ′ (b · x + c) ,
aσ ′ (b · Gµ (z) + c) − aσ ′ (b · x + c)
Θ
and v(µ,ν) Θ
(θ, z) = v(µ,ν);1 Θ
(θ1 , z), · · · , v(µ,ν);K (θK , z) with for 1 ≤ j ≤ K:
Z abj σ(βj · z + γj )σ(b · Gµ (z) + c)
Θ abj αj zσ ′ (βj · z + γj )σ(b · Gµ (z) + c) dν(ω).
v(µ,ν);j (θj , z) = −
[−1,1]1+K+1 abj αj σ ′ (βj · z + γj )σ(b · Gµ (z) + c)
Using (25), and that |a|, |b|, |c| ≤ 1, we readily obtain (23) and (24). Applying the mean value
theorem,
∇ω σ(x1 ; ω1 ) − ∇ω σ(x2 ; ω2 )
σ ′ (ξ0 )[b1 · x1 − b2 · x2 + c1 − c2 ]
= a1 x1 σ ′′ (ξ1 )[(b1 · x + c1 ) − (b2 · y + c2 )] + (x1 (a1 − a2 ) + (x1 − x2 )a2 )σ ′ (b2 · y + c2 ) ,
a1 σ ′′ (ξ1 )[(b1 · x1 + c1 ) − (b2 · x2 + c2 )] + (a1 − a2 )σ ′ (b2 · x2 + c2 )
where ξ0 , ξ1 are points in between b1 ·x+c1 and b2 ·y +c2 . To obtain the estimate for v Ω , we consider
the difference above in two instances x1 = x2 = x, and taking x1 = Gµ1 (z) and x2 = Gµ2 (z). Using
the triangle inequality, and kσkC 2 < ∞, we can conclude
Ω Ω
|v(µ1 ,ν1 )
(ω1 , z, x) − v(µ2 ,ν2 )
(ω2 , z, x)|
≤ C (1 + |x| + Gµ1 (z) + Gµ2 (z)) (|ω1 − ω2 | + |Gµ1 (z) − Gµ2 (z)|) .
15
To estimate Gµ1 (z) − Gµ2 (z), we consider π a coupling between µ1 and µ2 , and notice that the
difference is given by
Z Z
|Gµ1 (z) − Gµ2 (z)| = σ(z; θ) − σ(z; θ̃)dπ(θ, θ̃) ≤ |σ(z; θ) − σ(z; θ̃)| dπ.
Estimating,
|σ(z, θ) − σ(z, θ̃)| ≤ C 1 + (|α| + |α̃|)(1 + |z|) |θ − θ̃|.
Taking π to be the optimal coupling with respect to the d2 distance, and using (25), we conclude
Ω Ω
|v(µ (ω1 , z) − v(µ
1 ,ν1 ) 2 ,ν2 )
(ω2 , z)|2
P R
≤ C 1 + |x|2 + |z|2 + i=1,2 |α|2 dµi |ω1 − ω2 |2 + d22 (µ1 , µ2 ) .
For v Θ , apply the same argument as above to obtain a bound that also depends on the size of
|α|.
Lemma 14 Let (µ, ν) ∈ AC([0, T ]; P((RL+2 )K ) × P(Q)) a weak solution to (7), then
Z Z
2 2 2
|α| dµt ≤ C |α| dµin + t . (26)
Proof By the bound k(VjΘ )1 k∞ ≤ C, we conclude that |α(t, θin )| ≤ |αin | + Ct, which implies the
desired bound.
A key step in the proof of existence and uniqueness, Proposition 12 and Proposition 10, is the
stability of the projected ODE flow.
Lemma 15 We consider (µ1 , ν1 ), (µ2 , ν2 ) ∈ AC([0, T ]; P((RL+2 )K )×P(Q)) that satisfy the growth
condition (26). The associated flow maps (20) satisfy the bounds
2
|Φt(µ1 ,ν1 ) (θ1 , ω1 ) − Φt(µ2 ,ν2 ) (θ2 , ω2 )|2 ≤ eC(Λ+|α1 |+|α2 |)t eCt |(θ1 , ω1 ) − (θ2 , ω2 )|2
Z t
C(Λ+|α1 |+|α2 |)t Ct2
+Ce e C(r)d22 ((µ1 , ν1 ) (r), (µ2 , ν2 ) (r))dr,
0
where
Z 1/2 Z 1/2
2 2
Λ=1+ |α1 | dµ1,in + |α2 | dµ2,in ;
2 /2
C(r) = e−C|α! |r e−C|α2 |r e−CΛr e−Cr (Λ + r + |α1 | + |α2 |)
16
GAN: Dynamics
CΘ (θ1 (t), θ2 (t)) ≤ C (CΘ (θ1,in , θ2,in ) + t) and CΩ (t) ≤ C (CΩ (µ1,in , µ2,in ) + t) .
1 d
|ω1 (t) − ω2 (t)|2
2 dt D E
Ω Ω
= ω1 (t) − ω2 (t), Projπ(Q) V(µ t,1 ,νt,1 )
(ω 1 (t)) − Proj V (ω
π(Q) (µt,2 ,νt,2 ) 2 (t))
D E
Ω Ω
≤ ω1 (t) − ω2 (t), V(µ t,1 ,νt,1 )
(ω 1 (t)) − V (ω
(µt,2 ,νt,2 ) 2 (t))
≤ CΩ (t) |ω1 (t) − ω2 (t)| + d2 (µ1,t , µ2,t )2 ,
2
17
Proof [Proof of Proposition 10] Let
and notice that for any coupling Π∗ between µ1,in ⊗ ν1,in and µ2,in ⊗ ν2,in
Z
d(t) ≤ |(θ1 (t), ω1 (t)) − (θ2 (t), ω2 (t))|2 dΠ∗ ((θ1,in , ω1,in ), (θ2,in , ω2,in )),
since the push-forward of Π∗ along the ODE flow at time t is a coupling between µ1,t ⊗ ν1,t and
µ2,t ⊗ ν2,t . Using Lemma 15, we obtain that
Z
2
d(t) ≤ eC(Λ+|α1 |+|α2 |)t eCt |(θ1 , ω1 )(0) − (θ2 , ω2 )(0)|2 dΠ∗
| {z }
I
Z t Z
C(Λ+|α1 |+|α2 |)(t−r) C(t2 −r 2 )
+C d(r) e e (Λ + r + |α1 | + |α2 |)dΠ∗ dr.
0
| {z }
II
For I we apply the Cauchy-Schwarz and Cauchy’s inequality, and take Π∗ as the optimal coupling
with respect to the 4-Wasserstein to get the bound,
Z Z 1/2 Z 1/2
C(Λt+t2 ) C|α|t C|α|t 4
I ≤ e e dµ1,in + e dµ2,in |(θ1 , ω1 )(0) − (θ2 , ω2 )(0)| dΠ∗
Z Z 1/2
C(Λt+t2 ) C|α|t C|α|t
≤ e e dµ1,in + e dµ2,in d24 ((µin,1 , νin,1 ) , (µin,2 , νin,2 )).
d22 ((µt,1 , νt,1 ) , (µt,2 , νt,2 )) ≤ A(t)etB(t) d24 ((µin,1 , νin,1 ) , (µin,2 , νin,2 )),
18
GAN: Dynamics
where at each step we sample xn ∼ P∗ and zn ∼ N independently, µnN denotes the empirical
measure associated to θ1n , . . . , θN
n and ν n the empirical measure associated to ω n , . . . , ω n . The
N 1 N
parameters are assumed to be initialized by independently (θi0 , ωi0 ) by sampling µin ⊗ νin . The
linear interpolation of the parameters to a continuous time variable t > 0 with time step ∆t = h/N
will be denoted by (θi , ωi ), where we let θi (tn ) = θin and ωi (tn ) = ωin , with tn = n∆t = nh/N . We
let µ and ν be the empirical measures associated to θ1 , . . . , θN and ω1 , . . . , ωN . We suppress the
dependence on N of the measures for notational simplicity.
We consider the mean field ODE system defined by the expectation of the vector fields over z
and x (
d Θ ˆ
dt θ̂i = V(µ̂,ν̂) (θi )
d Ω (ω̂ ),
dt ω̂i = ProjπQ (ωi ) V(µ̂,ν̂) i
where µ̂ and ν̂ are the empirical measures associated to θ̂1 , . . . , θ̂N and ω̂1 , . . . , ω̂N , respectively, and
the initial conditions are coupled to the parameter training by θ̂i (0) = θi0 and ω̂i (0) = ωi0 . More
clearly, the probability measures µ̂ and ν̂ are the solutions of the PDE (7) with random initial
conditions chosen as (µ̂(0), ν̂(0)) = (µN (0), νN (0)).
To simplify the arguments, we first consider the distance between mean field ODE system and
the discrete projected forward Euler algorithm
θ̂ n+1 = θ̂ n + ∆t V Θn n (θˆi )
i i (µ̂ ,ν̂ ) ,
n+1
ω̂i = ProjQ ω̂in + ∆t V(µ̂ Ω n
n ,ν̂ n ) (ω̂i )
where we let T > 0 be a fixed time horizon and consider ∆t = h/N , where h > 0 is the user defined
learning rate. To estimate the difference between the continuum and the discrete approximation,
we can use a similar argument to Theorem 17, taking into consideration the bound on the Lipschitz
constant of the vector fields given by Lemma 13. We can obtain the bound
" N
#
1 X ∆t h C|α| i
E |θ̂i − θ̂| ≤ ∆tC 1 + Eµin eCe
2
.
N
i=1
The argument is simpler than the argument below, so we skip it to avoid burdensome repetition.
We define
N
1 X n
eni = |θ̂in − θin |2 + |ω̂in − ωin |2 and en = ei ,
N
i=1
and notice the inequality
d22 ((µn , ν n ), (µ̂n , ν̂ n )) ≤ en .
Using a step in either algorithm
en+1
i = |θ̂in + ∆tV(µ̂
Θ n n Θ n 2
n ,ν̂ n ) (θ̂i ) − (θi + ∆tv(µn ,ν n ) (θi ))|
n Ω
+|ProjQ (ω̂i + ∆tV(µ̂n ,ν̂ n ) (ω̂i )) − ProjQ (ωin + ∆tv(µ
n Ω n 2
n ,ν n ) (ωi ))| .
19
Using that the projection is contractive, expanding the square and bounding we obtain
en+1
i ≤ eni + ∆t(Ani + Bin ) + (∆t)2 Cin ,
where
and
Cin = 2 |V(µ̂
Θ
n ,ν̂ n ) (θ̂ n 2
i )| + |v Θ
(θ
(µn ,ν n ) i
n 2
)| + |V Ω
(ω̂
(µ̂n ,ν̂ n ) i
n 2
)| + |v Ω
(ω
(µn ,ν n ) i
n 2
)| .
where 1/2
1 X
|αj,in |2 + |αi,in | .
Ki = C 1 +
N
j
Next, we will take the conditional expectation with respect to the variables {αj,in }. To this end,
we notice the bound
" n # 1/2
n 2
X X
E Bir {αj,in } ≤ E Bir {αj,in }
r=0 r=0
n n n
!1/2
X X X
= E[|Bir |2 ||{αj,in }] +2 E[Bir1 Bir2 |{αj,in }]
r=0 r1 =0 r2 =r1 +1
n
!1/2
X
≤ Ki2 E[eri r
+ e |{αj,in }]
r=0
n
!
X
≤ Ki 1+ E[eri + er |{αj,in }] ,
r=0
20
GAN: Dynamics
and that
E[Bir1 Bir2 |{αj,in }] = 0,
which follows by using the law of iterated expectation with the sigma algebra F r2 generated by
i , ω i }N , {xr }r2 −1 and {z r }r2 −1 . Namely,
{(θin in i=1 r=0 r=0
where
N
1 X
K= Ki eT Ki .
N
i=1
Using discrete Gromwall’s inequality one last time we have the estimate
N
n+1 TK 1 X 2 T Ki
E[e |{αj,in }] ≤ ∆te Ki e . (27)
N
i=1
21
The desired bound (12) follows from using the bound (11) to show that the right hand side above
is finite.
with λ being a user chosen penalization parameter. The evolution of the mean field limit can be
formally characterized as the gradient descent of E on µ and gradient ascent on ν. In terms of
equations we consider
δE
∂t µ − ∇ θ · µ ∇ θ δµ [µ, ν] = 0,
δE
∂t ν + γ c ∇ ω · ν ∇ ω δν [µ, ν] = 0,
µ(0) = µ ,
in ν(0) = νin .
Understanding, the difference in the dynamics for these improved algorithms is an interesting open
problem.
For the long time behavior of the dynamics (7), we refer to Section 3 for intuition where we
show in a toy example of ODEs that for any initial conditions the dynamics stabilize to a limiting
periodic orbit. Generalizing this to absolutely continuous initial data is quite complicated, we
mention the recent work for the Euler equation, in Hassainia et al. (2023) the authors construct
vortex patches that replicate the motion of leapfrogging vortex points. Moreover, for the general
system, we expect that the dynamics will always converge to some limiting periodic orbit. Showing
this rigorously is a challenging PDE problem.
In terms of the curse of dimensionality exhibited in Corollary 8, an alternative would be to
quantify the convergence of the algorithm in a Reproducing Kernel Hilbert Space (RKHS). In PDE
22
GAN: Dynamics
terms, this would mean to show well posedness of the PDE in a negative Sobolev space like H −s
with s > d/2.
Acknowledgments
We would like to acknowledge Justin Sirignano, Yao Yao and Federico Camara Halac for useful
conversations at the beginning of this project. MGD would like to thank the Isaac Newton Institute
for Mathematical Sciences, Cambridge, for support and hospitality during the program Frontiers
in Kinetic Theory where part of the work on this paper was undertaken. This work was supported
by EPSRC grant no EP/R014604/1. The research of MGD was partially supported by NSF-DMS-
2205937 and NSF-DMS RTG 1840314. The research of RC was partially supported by NSF-DMS
RTG 1840314.
Appendix A.
Following the ideas of Henry (1973), in this section we prove the existence, uniqueness and stability
to a class of ODEs with discontinuous forcing given by a projection. We also show quantitative
convergence of the projected forward Euler algorithm, for which we could not find a good reference
for.
Before we present the main result, we introduce some notation that we need. For any closed
convex subset Q ⊂ Rd and x ∈ Rd there exists an unique ProjQ x ∈ Q such that
which is a closed convex cone. The map ProjπQ (x) : Rd → Rd denotes the projection onto
πQ (x) ⊂ Rd . We notice that for a smooth vector field V (x) : Q → Rd , the mapping x ∈ Rn 7→
ProjπQ (x) (V (x)) is discontinuous at points x such that V (x) ∈
/ πQ (x).
23
Moreover, we can approximate these solutions by a projected forward Euler algorithm.
Theorem 17 Let x∆t : [0, ∞) → Q be the linear interpolation at times n∆t of {xn∆t } defined by
the projected Euler algorithm
(
xn+1 n n
∆t = ProjQ (x∆t + ∆t V (x∆t ))
(30)
x0∆t = xin .
hx1 − x2 , ProjπQ (x1 ) V (x1 ) − ProjπQ (x2 ) V (x2 )i ≤ hx1 − x2 , V (x1 ) − V (x2 )i
≤ k∇V k∞ kx1 − x2 k2 ,
where we have used Lemma 18 for the first inequality, and the Lipschitz propety for the second
inequality. Grownwall’s inequality applied to kx1 − x2 k2 gives:
which shows the uniqueness and stability of solutions with respect to the initial condition.
Equivalence with a relaxed problem. Using NQ (x), we now introduce a relaxed problem
which we prove is equivalent to the ODE (29). For each x ∈ Q we define the compact convex set
V(x) ⊂ Rd by
V(x) = {V (x) − nx | nx ∈ NQ (x), knx k2 ≤ V (x) · nx }.
The relaxed problem is finding an absolutely continuous curve x : [0, T ] → Q such that
(
ẋ(t) ∈ V(x(t))
(31)
x(0) = xin ,
for almost every t ∈ [0, T ]. To show the equivalence between (29) and (31), we need the following
Lemma.
24
GAN: Dynamics
Lemma 19 For all x ∈ Q we have V (x) ∈ V(x), ProjπQ (x) V (x) ∈ V(x) and
Proof [Proof of Lemma 19] Taking nx = 0 in the definition of V(x) gives that V (x) ∈ V(x). Writing
nx = V (x) − ProjπQ (x) V (x), we recall from Lemma 18 that nx ∈ NQ (x) and
so knx k2 = hnx , V (x)i and we conclude ProjπQ (x) V (x) ∈ V(x). Now note that if V (x) − nx ∈ πQ (x)
with nx ∈ NQ (x), then hV (x) − nx , nx i ≤ 0 with equality only if V (x) − nx = ProjπQ (x) V (x), as we
have noted above. So if V (x) − nx ∈ πQ (x) ∩ V(x) then V (x) − nx = ProjπQ (x) V (x).
which follows directly from the properties of the projection. For each n ≥ 0 we consider the discrete
velocity
xn+1 − xn∆t
un∆t = ∆t ,
∆t
which we re-write as
un∆t ∈ V(xn+1
∆t ) + k∇V k∞ kV k∞ ∆t B1 .
25
Noting that x∆t is uniformly Lipschitz with constant less that kV k∞ , we get up to subsequence there
exists a Lipschitz function X : [0, ∞) → Q such that x∆t → X uniformly at compact subintervals,
by Arzerlà-Ascoli. We conclude using Mazur’s Lemma that the derivative of x belongs almost
everywhere to the upper limit of the convex hull of the values of ẋ∆t (t),
which implies that x is a solution to the relaxed problem, and therefore a solution to the original
(29).
Quantitative Estimate. We differentiate the distance, between X and x∆t to obtain
1 d
|X − x∆t |2 = hX − x∆t , Ẋ − ẋ∆t i
2 dt
≤ hX − x∆t , V (X) − V (x∆t )i + ∆tkV k∞ |Ẋ − ẋ∆t |
+∆tkV k∞ k∇V k∞ |X − x∆t |
≤ (1 + k∇V k∞ )|X − x∆t |2 + 2∆tkV k2∞ + (∆t)2 kV k2∞ k∇V k2∞ ,
where we have used estimate (32) and the contraction to property. Using Gromwall’s inequality
and that |X − x∆t |2 = 0 , we obtain
|X − x∆t |2 ≤ e2(1+k∇V k∞ )t 2∆tkV k2∞ + (∆t)2 kV k2∞ k∇V k2∞ .
References
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the
space of probability measures. Springer Science & Business Media, 2005.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks.
In International conference on machine learning, pages 214–223. PMLR, 2017.
Andrea L Bertozzi, Thomas Laurent, and Jesús Rosado. Lp theory for the multidimensional ag-
gregation equation. Communications on Pure and Applied Mathematics, 64(1):45–83, 2011.
François Bolley, Arnaud Guillin, and Cédric Villani. Quantitative concentration inequalities for
empirical measures on non-compact spaces. Probability Theory and Related Fields, 137:541–593,
2007.
José A Carrillo, Robert J McCann, and Cédric Villani. Contractions in the 2-wasserstein length
space and thermalization of granular media. Archive for Rational Mechanics and Analysis, 179:
217–263, 2006.
26
GAN: Dynamics
Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-
parameterized models using optimal transport. Advances in neural information processing sys-
tems, 31, 2018.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,
Signals and Systems, 2(4):303–314, 1989.
Guido De Philippis, Alpár Richárd Mészáros, Filippo Santambrogio, and Bozhidar Velichkov. Bv
estimates in optimal transportation and applications. Archive for Rational Mechanics and Anal-
ysis, 219:829–860, 2016.
Simone Di Marino, Bertrand Maury, and Filippo Santambrogio. Measure sweeping processes.
Journal of Convex Analysis, 23(2):567–601, 2016.
Richard M Dudley. Central limit theorems for empirical measures. The Annals of Probability, pages
899–929, 1978.
Xavier Fernández-Real and Alessio Figalli. The continuous formulation of shallow neural networks
as wasserstein-type gradient flows. In Analysis at Large: Dedicated to the Life and Work of Jean
Bourgain, pages 29–57. Springer, 2022.
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the
empirical measure. Probability Theory and Related Fields, 162(3-4):707, August 2015. URL
https://hal.science/hal-00915365.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information
processing systems, 27, 2014.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
Improved training of wasserstein gans. Advances in neural information processing systems, 30,
2017.
Zineb Hassainia, Taoufik Hmidi, and Nader Masmoudi. Rigorous derivation of the leapfrogging
motion for planar euler equations. arXiv preprint arXiv:2311.15765, 2023.
Claude Henry. An existence theorem for a class of differential equations with multivalued right-hand
side. Journal of Mathematical Analysis and Applications, 41(1):179–186, 1973.
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-
layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671,
2018.
Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled Generative Adversarial
Networks. In International Conference on Learning Representations, 2016.
27
Jean Jacques Moreau. Evolution problem associated with a moving convex set in a hilbert space.
Journal of differential equations, 26(3):347–374, 1977.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks:
An interacting particle system approach. Communications on Pure and Applied Mathematics,
75(9):1889–1935, 2022.
Filippo Santambrogio. Crowd motion and evolution pdes under density constraints. ESAIM:
Proceedings and Surveys, 64:137–157, 2018.
Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of
large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020a.
Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A central
limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020b.
Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of deep neural networks.
Mathematics of Operations Research, 47(1):120–152, 2022.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. In International conference on machine learning,
pages 2256–2265. PMLR, 2015.
Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Vee-
gan: Reducing mode collapse in gans using implicit variational learning. Advances in neural
information processing systems, 30, 2017.
Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in gans. In 2020
international joint conference on neural networks (ijcnn), pages 1–10. IEEE, 2020.
Tijmen Tieleman. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural networks for machine learning, 4(2):26, 2012.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
Stephan Wojtowytsch and Weinan E. Can shallow neural networks beat the curse of dimensionality?
a mean field training perspective. IEEE Transactions on Artificial Intelligence, 1(2):121–129,
2020. doi: 10.1109/TAI.2021.3051357.
28