Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN
Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN
Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN
RESEARCH
* Correspondence:
[email protected] Abstract
Division of Applied Mathematics,
Brown University, Providence,
We propose new and original mathematical connections between Hamilton–Jacobi
USA (HJ) partial differential equations (PDEs) with initial data and neural network
Research supported by NSF DMS architectures. Specifically, we prove that some classes of neural networks correspond to
1820821. Authors’ names are
given in last/family name
representation formulas of HJ PDE solutions whose Hamiltonians and initial data are
alphabetical order obtained from the parameters of the neural networks. These results do not rely on
universal approximation properties of neural networks; rather, our results show that
some classes of neural network architectures naturally encode the physics contained in
some HJ PDEs. Our results naturally yield efficient neural network-based methods for
evaluating solutions of some HJ PDEs in high dimension without using grids or
numerical approximations. We also present some numerical results for solving some
inverse problems involving HJ PDEs using our proposed architectures.
1 Introduction
The Hamilton–Jacobi (HJ) equations are an important class of partial differential equation
(PDE) models that arise in many scientific disciplines, e.g., physics [6,25,26,33,101], imag-
ing science [38–40], game theory [13,24,49,82], and optimal control [9,46,55,56,110].
Exact or approximate solutions to these equations then give practical insight about the
models in consideration. We consider here HJ PDEs specified by a Hamiltonian function
H : Rn → R and convex initial data J : Rn → R
⎧
⎨ ∂S (x, t) + H(∇x S(x, t)) = 0 in Rn × (0, +∞),
∂t (1)
⎩
S(x, 0) = J (x) in Rn ,
where ∂S
∂t (x, t) and ∇x S(x, t) = ∂S
∂x1 (x, t), . . . , ∂S
∂xn (x, t) denote the partial derivative with
respect to t and the gradient vector with respect to x of the function (x, t) → S(x, t), and
the Hamiltonian H only depends on the gradient ∇x S(x, t).
0123456789().,–: volV
20 Page 2 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Our main motivation is to compute the viscosity solution of certain HJ PDEs of the
form of (1) in high dimension for a given x ∈ Rn and t > 0 [9–11,34] by leveraging
new efficient hardware technologies and silicon-based electric circuits dedicated to neu-
ral networks. As noted by LeCun in [102], the use of neural networks has been greatly
influenced by available hardware. In addition, there have been many initiatives to cre-
ate new hardware for neural networks that yield extremely efficient (in terms of speed,
latency, throughput or energy) implementations: For instance, [50–52] propose efficient
neural network implementations using field-programmable gate array, [8] optimizes neu-
ral network implementations for Intel’s architecture, and [96] provides efficient hardware
implementation of certain building blocks widely used in neural networks. It is also worth
mentioning that Google created specific hardware, called “Tensor Processor Unit” [87] to
implement their neural networks in data centers. Note that Xilinx announced a new set of
hardware (Versal AI core) for implementing neural networks while Intel enhances their
processors with specific hardware instructions for neural networks. LeCun also suggests
in [102, Section 3] possible new trends for hardware dedicated to neural networks. Finally,
we refer the reader to [30] (see also [69]) that describes the evolution of silicon-based elec-
trical circuits for machine learning.
In this paper, we propose classes of neural network architectures that exactly represent
viscosity solutions of certain HJ PDEs of the form of (1). Our results pave the way to lever-
age efficient dedicated hardware implementation of neural networks to evaluate viscosity
solutions of certain HJ PDEs for initial data which takes a particular form.
Related work The viscosity solution to the HJ PDE (1) rarely admits a closed-form expres-
sion, and in general it must be computed with numerical algorithms or other methods
tailored for the Hamiltonian H, initial data J , and dimension n.
The dimensionality, in particular, matters significantly because in many applications
involving HJ PDE models, the dimension n is extremely large. In imaging problems, for
example, the vector x typically corresponds to a noisy image whose entries are its pixel
values, and the associated Hamilton–Jacobi equations describe the solution to an image
denoising convex optimization problem [38,39]. Denoising a 1080 x 1920 standard full
HD image on a smartphone, for example, corresponds to solving a HJ PDE in dimension
n = 1080 × 1920 = 2,073,600.
Unfortunately, standard grid-based numerical algorithms for PDEs are impractical when
n > 4. Such algorithms employ grids to discretize the spatial and time domain, and the
number of grid points required to evaluate accurately solutions of PDEs grows exponen-
tially with the dimension n. It is therefore essentially impossible in practice to numerically
solve PDEs in high dimension using grid-based algorithms, even with sophisticated high-
order accuracy methods for HJ PDEs such as ENO [121], WENO [84], and DG [75]. This
problem is known as the curse of dimensionality [17].
Overcoming the curse of dimensionality in general remains an open problem, but for
HJ PDEs several methods have been proposed to solve it. These include, but are not
limited to, max-plus algebra methods [2,3,45,54,60,110–113], dynamic programming
and reinforcement learning [4,19], tensor decomposition techniques [44,73,142], sparse
grids [20,59,90], model order reduction [5,97], polynomial approximation [88,89], multi-
level Picard method [79–81,146], optimization methods [38–40,151] and neural networks
[7,42,64,76,77,83,100,120,131,134,136,138]. Among these methods, neural networks
have become increasingly popular tools to solve PDEs [7,14–16,18,29,31,41–43,53,58,
J. Darbon et al. Res Math Sci (2020)7:20 Page 3 of 50 20
62–65,74,76–78,85,92,93,98–100,104,109,114,115,118,120,123,131,134–136,138–140,
144,145,148–150] and inverse problems involving PDEs [107,108,116,117,122,126–
130,143,149,152,153]. Their popularity is due to universal approximation theorems that
state that neural networks can approximate broad classes of (high-dimensional, non-
linear) functions on compact sets [35,71,72,124]. These properties, in particular, have
been recently leveraged to approximate solutions to high-dimensional nonlinear HJ PDEs
[64,138] and for the development of physics-informed neural networks that aim to solve
supervised learning problems while respecting any given laws of physics described by a
set of nonlinear PDEs [128].
In this paper, we propose some neural network architectures that exactly represent
viscosity solutions to HJ PDEs of the form of (1), where the Hamiltonians and initial
data are obtained from the parameters of the neural network architectures. Recall our
results require the initial data J to be convex and the Hamiltonian H to only depend on
the gradient ∇x S(x, t) [see Eq. (1)]. In other words, we show that some neural networks
correspond to exact representation formulas of HJ PDE solutions. To our knowledge, this
is the first result that shows that certain neural networks can exactly represent solutions
of certain HJ PDEs.
Note that an alternative method to numerically evaluate solutions of HJ PDEs of the
form of (1) with convex initial data has been proposed in [40]. This method relies on
the Hopf formula and is only based on optimization. Therefore, this method is grid and
approximation-free and works well in high dimension. Contrary to [40], our proposed
approach does not rely on any (possibly non-convex) optimization techniques.
Contributions of this paper In this paper, we prove that some classes of shallow neural
networks are, under certain conditions, viscosity solutions to Hamilton–Jacobi equations
for initial data which takes a particular form. The main result of this paper is Theorem 3.1.
We show in this theorem that the neural network architecture illustrated in Fig. 1 rep-
resents, under certain conditions, the viscosity solution to a set of first-order HJ PDEs of
the form of (1), where the Hamiltonians and the convex initial data are obtained from the
parameters of the neural network. As a corollary of this result for the one-dimensional
case, we propose a second neural network architecture (illustrated in Fig. 4) that repre-
sents the spatial gradient of the viscosity solution of the HJ PDE above in 1D and show
in Proposition 3.1 that under appropriate conditions, this neural network corresponds to
entropy solutions of some conservation laws in 1D.
Let us emphasize that the proposed architecture in Fig. 1 for representing solutions to
HJ PDEs allows us to numerically evaluate their solutions in high dimension without using
grids.
We also stress that our results do not rely on universal approximation properties of
neural networks. Instead, our results show that the physics contained in HJ PDEs satisfying
the conditions of Theorem 3.1 can naturally be encoded by the neural network architecture
depicted in Fig. 1. Our results further suggest interpretations of this neural network
architecture in terms of solutions to PDEs.
We also test the proposed neural network architecture (depicted in Fig. 1) on some
inverse problems. To do so, we consider the following problem. Given training data sam-
pled from the solution S of a first-order HJ PDE (1) with unknown convex initial function
J and Hamiltonian H, we aim to recover the unknown initial function. After the training
process using the Adam optimizer, the trained neural network with input time variable
20 Page 4 of 50 J. Darbon et al. Res Math Sci (2020)7:20
2 Background
In this section, we introduce mathematical concepts that will be used in this paper. We
review the standard structure of shallow neural networks from a mathematical point of
view in Sect. 2.1 and present some fundamental definitions and results in convex analysis
in Sect. 2.2. For the notation, we use Rn to denote the n-dimensional Euclidean space. The
Euclidean scalar product and Euclidean norm on Rn are denoted by ·, · and · 2 . The set
of matrices with m rows and n columns with real entries is denoted by Mm,n (R).
Rn × Rn × R (x, w i , bi ) → w i , x + bi .
J. Darbon et al. Res Math Sci (2020)7:20 Page 5 of 50 20
These m affine functions can be succinctly written in vector form as W x + b, where the
matrix W ∈ Mm,n (R) has for rows the weights w i and the vector b ∈ Rm has for entries
the biases bi . The output layer comprises a nonlinear function σ : Rm → R that takes for
input the vector W x + b of affine functions and gives the number
Rn × Rn × R (x, w i , bi ) → σ (W x + b) .
The nonlinear function σ is called the activation function of the output layer.
i=1 ⊂ R ×
In Sect. 4, we will consider the following problem: Given data points {(xi , yi )}N n
R, infer the relationship between the input xi ’s and the output yi ’s. To infer this relation,
we assume that the output takes the form (or can be approximated by) yi = σ (W xi + b)
for some known activation function σ , unknown matrix of weights W ∈ Mm,n (R), and
unknown vector of bias b. A standard approach to solve such a problem is to estimate the
weights w i and biases bi so as to minimize the mean square error
1
N
{(w̄ i , b̄i )}m
i=1 ∈ arg min (σ (W xi + b) − yi )2 . (2)
{(w i ,bi )}m N
i=1 ⊂R ×R
n
i=1
In the field of machine learning, solving this minimization problem is called the learning
or training process. The data {(xi , yi )}N
i=1 used in the training process is called training
data. Finding a global minimizer is generally difficult due to the complexity of the mini-
mization problem and that the objective function is not convex with respect to the weights
and biases. State-of-the-art algorithms for solving these problems are stochastic gradient
descent-based methods with momentum acceleration, such as the Adam optimizer for
neural networks [94]. This algorithm will be used in our numerical experiments.
Definition 1 (Convex sets, relative interiors, and convex hulls) A set C ⊂ Rn is called
convex if for any λ ∈ [0, 1] and any x, y ∈ C, the element λx + (1 − λ)y is in C. The relative
interior of a convex set C ⊂ Rn , denoted by ri C, consists of the points in the interior of
the unique smallest affine set containing C. The convex hull of a set C, denoted by conv C,
consists of all the convex combinations of the elements of C. An important example of a
convex hull is the unit simplex in Rn , which we denote by
n
Λn := (α1 , . . . , αn ) ∈ [0, 1]n : αi = 1 . (3)
i=1
A function f is called proper if its domain is non-empty and f (x) > −∞ for every x ∈ Rn .
20 Page 6 of 50 J. Darbon et al. Res Math Sci (2020)7:20
The subdifferential ∂f (x) is a closed convex set whenever it is non-empty, and any vector
p ∈ ∂f (x) is called a subgradient of f at x. If f is a proper convex function, then ∂f (x) = ∅
whenever x ∈ ri (dom f ), and ∂f (x) = ∅ whenever x ∈ / dom J [133, Thm. 23.4]. If a convex
function f is differentiable at x0 ∈ R , then its gradient ∇x f (x0 ) is the unique subgradient
n
with equality attained if and only if p ∈ ∂f (x), if and only if x ∈ ∂f ∗ (p) [68, Cor. X.1.4.4].
Table 1 Notation used in this paper. Here, we use C to denote a set in Rn , f to denote a
function from Rn to R ∪ {+∞} and x to denote a vector in Rn
Notation Meaning Definition
·, · Euclidean scalar product in Rn x, y:= ni=1 xi yi
√
· 2 Euclidean norm in Rn x 2 := x, x
ri C Relative interior of C The interior of C with respect to
the minimal hyperplane contain-
ing C in Rn
conv C Convex hull of C The set containing all convex
combinations of the elements of
C
n
Λn Unit simplex in Rn (α1 , . . . , αn ) ∈ [0, 1]n : i=1 αi = 1
dom f Domain of f {x ∈ Rn : f (x) < +∞}
Γ0 (Rn ) A useful and standard class of convex functions The set containing all proper, con-
vex, lower semicontinuous func-
tions from Rn to R ∪ {+∞}
co f Convex envelope of f The largest convex function such
that co f (x) f (x) for every x ∈
Rn
co f Convex and lower semicontinuous envelope of f The largest convex and lower
semicontinuous function such
that co f (x) f (x) for every x ∈
Rn
∂f (x) Subdifferential of f at x {p ∈ Rn : f (y) f (x) + p, y −
x ∀y ∈ Rn }
f∗ Fenchel–Legendre transform of f f ∗ (p):= supx∈Rn {p, x − f (x)}
Fig. 1 Illustration of the structure of the neural network (8) that can represent the viscosity solution to
first-order Hamilton–Jacobi equations for initial data which takes a particular form
3.1 Setup
In this section, we consider the function f : Rn × [0, +∞) → R given by the neural
network in Fig. 1. Mathematically, the function f can be expressed using the following
formula
Our goal is to show that the function f in (8) is the unique uniformly continuous
viscosity solution to a suitable Hamilton–Jacobi equation. In what follows, we denote
f (x, t; {(pi , θi , γi )}m
i=1 ) by f (x, t) when there is no ambiguity in the parameters.
20 Page 8 of 50 J. Darbon et al. Res Math Sci (2020)7:20
there holds i =j αi θi > θj .
Note that (A3) is not a strong assumption. Indeed, if there exist j ∈ {1, . . . , m} and
(α1 , . . . , αm ) ∈ Rm satisfying Eq. (9) and i=j αi θi θj , then
pj , x − tθj − γj αi (pi , x − tθi − γi ) max{ pi , x − tθi − γi }.
i =j
i =j
As a result, the jth neuron in the network can be removed without changing the value
of f (x, t) for any x ∈ Rn and t 0. Removing all such neurons in the network, we can
therefore assume (A3) holds.
Our aim is to identify the HJ equations whose viscosity solutions correspond to the
neural network f defined by Eq. (8). Here, x and t play the role of the spatial and time
variables, and f (·, 0) corresponds to the initial data. To simplify the notation, we define
the function J : Rn → R as
and the set Ix as the collection of maximizers in Eq. (10) at x, that is,
Note that the initial data J given by (10) is a convex and polyhedral function, and it satisfies
several properties that we describe in the following lemma.
(i) The Fenchel–Legendre transform of J is given by the convex and lower semicontinuous
function
⎧ m
⎪
⎪
⎪
⎨ min αi γ i if p ∈ conv ({pi }m
i=1 ),
∗ (α
1 ,...,αm )∈Λm
J (p) = m i=1 (12)
⎪ i=1 αi pi =p
⎪
⎪
⎩
+∞ otherwise.
(ii) Let p ∈ dom J ∗ and x ∈ ∂J ∗ (p). Then, (α1 , . . . , αm ) ∈ Rm is a minimizer in Eq. (12)
if and only if it satisfies the constraints
(a) (α1 , . . . , αm ) ∈ Λm ,
m
(b) i=1 αi pi = p,
(c) αi = 0 for any i ∈ / Ix .
Having defined the initial condition J , the next step is to define a Hamiltonian H. To do
so, first denote by A(p) the set of minimizers in Eq. (12) evaluated at p ∈ dom J ∗ , i.e.,
m
A(p):= arg min αi γ i . (13)
(α ,...αm )∈Λm
1m i=1
i=1 αi pi =p
Note that the set A(p) is non-empty for every p ∈ dom J ∗ by Lemma 3.1(i). Now, we
define the Hamiltonian function H : Rn → R ∪ {+∞} by
⎧ m
⎪
⎪
⎨ inf αi θi if p ∈ dom J ∗ ,
H(p):= α∈ A(p) (14)
⎪
⎪
i=1
⎩
+∞ otherwise.
The function H defined in (14) is a polyhedral function whose properties are stated in the
following lemma.
(i) For every p ∈ dom J ∗ , the set A(p) is compact and Eq. (14) has at least one minimizer.
(ii) The restriction of H to dom J ∗ is a bounded and continuous function.
(iii) There holds H(pi ) = θi for each i ∈ {1, . . . , m}.
(A1)-(A3), and let f be the neural network defined by Eq. (8) with these parameters. Let J
and H be the functions defined in Eqs. (10) and (14), respectively, and let H̃ : Rn → R be
a continuous function. Then the following two statements hold.
(i) The neural network f is the unique uniformly continuous viscosity solution to the
first-order Hamilton–Jacobi equation
⎧
⎨ ∂f (x, t) + H(∇x f (x, t)) = 0, in Rn × (0, +∞),
∂t (15)
⎩
f (x, 0) = J (x), in Rn .
if and only if H̃(pi ) = H(pi ) for each i ∈ {1, . . . , m} and H̃(p) H(p) for every
p ∈ dom J ∗ .
Remark 1 This theorem identifies the set of HJ equations with initial data J whose solution
is given by the neural network f . To each such HJ equation, there corresponds a continuous
Hamiltonian H̃ satisfying H̃(pi ) = H(pi ) for every i = {1, . . . , m} and H̃(p) H(p) for
every p ∈ dom J ∗ . The smallest possible Hamiltonian satisfying these constraints is the
function H defined in (14), and its corresponding HJ equation is given by (15).
Example 1 In this example, we consider the HJ PDE with initial data J true (x) = x 1 and
p 2
the Hamiltonian H true (p) =− 2
2
for all x, p ∈ R . The viscosity solution to this HJ PDE
n
is given by
nt
S(x, t) = x 1 + = max {pi , x − tθi − γi } for every x ∈ Rn and t 0,
2 i∈{1,...,m}
Theorem 3.1 stipulates that S solves the HJ PDE (16) if and only if H̃(pi ) = − n2 for every
i ∈ {1, . . . , m} and H̃(p) ≥ − n2 for every p ∈ [−1, 1]n \ {pi }m
i=1 . The Hamiltonian H
true is
Example 2 In this example, we consider the case when J true (x) = x ∞ and H true (p) =
p 2
− 2 2 for every x, p ∈ Rn . Denote by ei the ith standard unit vector in Rn . Let m = 2n,
n
{pi }m
i=1 = {±e i }i=1 , θi = − 2 , and γi = 0
n for every i ∈ {1, . . . , m}. The viscosity solution S
is given by
nt
S(x, t) = x ∞ + = max {pi , x − tθi − γi } for every x ∈ Rn and t 0.
2 i∈{1,...,m}
Hence, S can be represented using the proposed neural network with parameters
{(pi , − n2 , 0)}m
i=1 . Similarly, as in the first example, we compute J and H and obtain the
following results
where Bn denotes the unit ball with respect to the l 1 norm in Rn , i.e., Bn = conv {±ei :
i ∈ {1, . . . , n}}. By Theorem 3.1, S is a viscosity solution to the HJ PDE (16) if and only
if H̃(pi ) = − n2 for every i ∈ {1, . . . , m} and H̃(p) ≥ − n2 for every p ∈ Bn \{pi }m
i=1 . The
Hamiltonian H true is one candidate satisfying these constraints.
{(pi , θi , γi )}2n n n
i=1 = {(e i , 1, 0)}i=1 ∪ {(−e i , 1, 0)}i=1 ,
(p2n+1 , θ2n+1 , γ2n+1 ) = (0, 0, 0),
1
{(pi , θi , γi )}2n+5
i=2n+2 = √ (αe 1 + βe 2 , 2, 0) : α, β ∈ {±1} ,
2
where ei is the ith standard unit vector in Rn and 0 denotes the zero vector in Rn . The
functions J and H defined by (10) and (14) coincide with the underlying true initial data
J true and Hamiltonian H true . Therefore, by Theorem 3.1, the proposed neural network
represents the viscosity solution to the HJ PDE. In other words, given the true parameters
{(pi , θi , γi )}m
i=1 , the proposed neural network solves this HJ PDE without the curse of
dimensionality. We illustrate the solution with dimension n = 16 in Fig. 2, which shows
several slices of the solution evaluated at x = (x1 , x2 , 0, . . . , 0) ∈ R16 and t = 0, 1, 2, 3 in
figures 2(A), 2(B), 2(C), 2(D), respectively. In each figure, the x and y axes correspond to
the first two components x1 and x2 in x, while the color represents the function value
S(x, t).
Remark 2 Let > 0 and consider the neural network f : Rn × [0, +∞) → R defined by
m
f (x, t):= log e(pi ,x−tθi −γi )/ (17)
i=1
20 Page 12 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Fig. 2 Solution S : R16 × [0, +∞) → R to the HJ PDE in Example 3 is solved using the proposed neural
network. Several slices of the solution S evaluated at x = (x1 , x2 , 0, . . . , 0) and t = 0, 1, 2, 3 are shown in figures
2(A), 2(B), 2(C), 2(D), respectively. In each figure, the x and y axes correspond to the first two components x1
and x2 in the variable x, while the color represents the function value S(x, t)
and illustrated in Fig. 3. This neural network substitutes the non-smooth maximum acti-
vation function in the neural network f defined by Eq. (8) (and depicted in Fig. 1) with
2
a smooth log-exponential activation function. When the parameter θi = − 12 pi 2 , then
the neural network f is the unique, jointly convex and smooth solution to the following
viscous HJ PDE
⎧
⎪ ∂f (x, t) 1 2
⎪
⎪ − ∇x f (x, t)2 = Δx f (x, t) in Rn × (0, +∞),
⎨ ∂t 2 2
m (18)
⎪
⎪
⎩f (x, 0) = log
⎪ e(pi ,x−γi )/ in Rn .
i=1
This result relies on the Cole–Hopf transformation ([47], Sect. 4.4.1); see Appendix C for
the proof. While this neural network architecture represents, under certain conditions,
the solution to the viscous HJ PDE (18), we note that the particular form of the convex
initial data in the HJ PDE
(18), which effectively corresponds to a soft Legendre transform
in that lim →0 log m (p ,x −γ ) / = maxi∈{1,...,m} { pi , x − γi }, severely restricts
i=1 e
i i
>0
the practicality of this result.
J. Darbon et al. Res Math Sci (2020)7:20 Page 13 of 50 20
Fig. 3 Illustration of the structure of the neural network (17) that represents the solution to a subclass of
second-order HJ equations when θi = − 12 pi 22 for i ∈ {1, . . . , m}
⎧
⎨ ∂u (x, t) + ∇x H(u(x, t)) = 0 in R × (0, +∞),
∂t (19)
⎩
u(x, 0) = u0 (x):=∇J (x) in R,
where the flux function corresponds to the Hamiltonian H in the HJ equation. Here,
we assume that the initial data J is convex and globally Lipschitz continuous, and the
symbols ∇ and ∇x in this section correspond to derivatives in the sense of distribution if
the classical derivatives do not exist.
In this section, we show that the conservation law derived from the HJ equation (1) can
be represented by a neural network architecture. Specifically, the corresponding entropy
solution u(x, t) ≡ ∇x f (x, t) to the one-dimensional conservation law (19) can be repre-
sented using a neural network architecture with an argmax based activation function, i.e.,
The structure of this network is shown in Fig. 4. When more than one maximizer exist in
the optimization problem above, one can choose any maximizer j and define the value to
be pj . We now prove that the function ∇x f given by the neural network (20) is indeed the
entropy solution to the one-dimensional conservation law (19) with flux function H and
initial data ∇J , where H and J are defined by Eqs. (14) and (10), respectively.
Proposition 3.1 Consider the one-dimensional case, i.e., n = 1. Suppose the parameters
{(pi , θi , γi )}m
i=1 ⊂ R × R × R satisfy assumptions (A1)–(A3), and let u:=∇x f be the neural
network defined in Eq. (20) with these parameters. Let J and H be the functions defined
in Eqs. (10) and (14), respectively, and let H̃ : R → R be a locally Lipschitz continuous
function. Then, the following two statements hold.
20 Page 14 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Fig. 4 Illustration of the structure of the neural network (20) that can represent the entropy solution to
one-dimensional conservation laws
(i) The neural network u is the entropy solution to the conservation law
⎧
⎨ ∂u (x, t) + ∇x H(u(x, t)) = 0 in R × (0, +∞),
∂t (21)
⎩
u(x, 0) = ∇J (x) in R.
(ii) The neural network u is the entropy solution to the conservation law
⎧
⎨ ∂u (x, t) + ∇x H̃(u(x, t)) = 0 in R × (0, +∞),
∂t (22)
⎩
u(x, 0) = ∇J (x) in R,
if and only if there exists a constant C ∈ R such that H̃(pi ) = H(pi ) + C for every
i ∈ {1, . . . , m} and H̃(p) H(p) + C for any p ∈ conv {pi }m
i=1 .
Example 4 Here, we give one example related to Example 1. Consider J true (x) = |x| and
2
H true (p) = − p2 for every x, p ∈ R. The entropy solution u to the corresponding one
dimensional conservation law is given by
⎧
⎨1 if x > 0,
u(x, t) =
⎩−1 if x < 0.
This solution u can be represented using the neural network in Fig. 4 with m = 2, p1 = 1,
p2 = −1, θ1 = θ2 = − 12 and γ1 = γ2 = 0. To be specific, we have
The initial data J and Hamiltonian H defined in Eqs. (10) and (14) are given by
By Proposition 3.1, u solves the one-dimensional conservation law (22) if and only if there
exists some constant C ∈ R such that H̃(±1) = − 12 + C and H̃(p) − 12 + C for every
p ∈ (−1, 1). Note that H true is one candidate satisfying these constraints.
4 Numerical experiments
4.1 First-order Hamilton–Jacobi equations
In this subsection, we present several numerical experiments to test the effectiveness of
the Adam optimizer using our proposed architecture (depicted in Fig. 1) for solving some
inverse problems. We focus on the following inverse problem: We are given data samples
from a function S : Rn × [0, +∞) → R that is the viscosity solution to an HJ equation (1)
with unknown convex initial data J and Hamiltonian H, which only depends on ∇x S(x, t).
Our aim is to recover the convex initial data J . We propose to learn the neural network
using machine learning techniques to recover the convex initial data J . We shall see that
this approach also provides partial information on the Hamiltonian H.
Specifically, given data samples {(xj , tj , S(xj , tj ))}N
j=1 , where {(x j , tj )}j=1 ⊂ R × [0, +∞),
N n
we train the neural network f with structure in Fig. 1 using the mean square loss function
defined by
1
N
2
l({(pi , θi , γi )}m
i=1 ) = |f (xj , tj ; {(pi , θi , γi )}m
i=1 ) − S(x j , tj )| .
N
j=1
L:= arg max{ pi , x − tθi − γi },
x∈Rn , t≥0 i∈{1,...,m}
and we finally use each effective parameter (pl , θl ) for l ∈ L to approximate the point
(pl , H(pl )) on the graph of the Hamiltonian. In practice, we approximate the set L using a
large number of points (x, t) sampled in the domain Rn × [0, +∞).
20 Page 16 of 50 J. Darbon et al. Res Math Sci (2020)7:20
By Theorem 3.1, this function S is a viscosity solution to the HJ equations whose Hamil-
tonian and initial function are the piecewise affine functions defined in Eqs. (14) and (10),
respectively. In other words, S solves the HJ equation with initial data J satisfying
where A(p) is the set of maximizers of the corresponding maximization problem in Eq.
(25). Specifically, if we construct a neural network f as shown in Fig. 1 with the underlying
parameters {(ptrue i , θi
true , γ true )}m , then the function given by the neural network is exactly
i i=1
the same as the function S. In other words, {(ptrue i , θi
true , γ true )}m is a global minimizer
i i=1
for the training problem (23) with the global minimal loss value equal to zero.
Now, we train the neural network f with training data {(xj , tj , S(xj , tj ))}N j=1 , where the
points {(xj , tj )}N j=1 are randomly sampled in R n × [0, +∞) with respect to the standard
normal distribution for each j ∈ {1, . . . , N }. (We take the absolute value for t to make sure
it is nonnegative.) Here and after, the number of training data points is N = 20,000. We
run 60,000 descent steps using the Adam optimizer to train the neural network f . The
parameters for the Adam optimizer are chosen to be β1 = 0.5, β2 = 0.9, the learning rate
is 10−4 and the batch size is 500.
To measure the performance of the training process, we compute the relative mean
square errors of the sorted parameters in the trained neural network, denoted by
{(pi , θi , γi )}m
i=1 , and the sorted underlying true parameters {(pi , θi
true true , γ true )}m . To be
i i=1
specific, the errors are computed as follows
J. Darbon et al. Res Math Sci (2020)7:20 Page 17 of 50 20
Table 2 Relative mean square errors of the parameters in the neural network f with 2
neurons in different cases and different dimensions averaged over 100 repeated
experiments
# Case Case 1 Case 2 Case 3 Case 4
Averaged Relative Errors of {pi } 2D 4.10E−03 2.10E−03 3.84E−03 2.82E−03
4D 1.41E−09 1.20E−09 1.38E−09 1.29E−09
8D 1.14E−09 1.03E−09 1.09E−09 1.20E−09
16D 1.14E−09 6.68E−03 1.23E−09 7.74E−03
32D 1.49E−09 3.73E−01 1.46E−03 4.00E−01
Averaged Relative Errors of {θi } 2D 4.82E−02 7.31E−02 1.17E−01 1.79E−01
4D 3.47E−10 2.82E−10 1.15E−09 1.15E−09
8D 1.47E−10 1.08E−10 2.10E−10 2.25E−10
16D 5.44E−11 1.69E−03 4.75E−11 4.12E−03
32D 3.61E−11 3.27E−01 6.42E−03 2.39E−01
Averaged Relative Errors of {γi } 2D 1.35E−02 1.01E−01 1.33E−02 9.24E−02
4D 3.71E−10 1.24E−09 3.67E−10 1.10E−09
8D 2.91E−10 1.74E−10 2.82E−10 2.01E−10
16D 2.80E−10 2.08E−04 3.10E−10 3.20E−04
32D 3.56E−10 1.88E−02 1.56E−01 3.62E−02
m
i=1 pi − ptrue
i
2
2
relative mean square error of {pi } = m ,
i=1 ptrue
i
2
2
m
i=1 |θi − θi |2
true
relative mean square error of {θi } = m ,
i=1 |θi |2
true
m
i=1 |γi − γi
true |2
relative mean square error of {γi } = m .
i=1 |γi
true |2
For the cases when the denominator m i=1 |γi
true |2 is zero, such as Case 1 and Case 3, we
1 m
i=1 |γi − γi
measure the absolute mean square error m true |2 instead.
We test Cases 1–4 on the neural networks with 2 and 4 neurons, i.e., we set m = 2, 4
and repeat the experiments 100 times. We then compute the relative mean square errors
in each experiment and take the average. The averaged relative mean square errors are
shown in Tables 2 and 3, respectively. From the error tables, we observe that the training
process performs pretty well and gives errors below 10−8 in some cases when m = 2.
However, for the case when m = 4, we do not obtain the global minimizers and the error
is above 10−3 . Therefore, there is no guarantee for the performance of the Adam optimizer
in this training problem and it may be related to the complexity of the solution S to the
underlying HJ equation.
The solution to each of the two corresponding HJ equations can be represented using the
Hopf formula [70] and reads
20 Page 18 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Table 3 Relative mean square errors of the parameters in the neural network f with 4
neurons in different cases and different dimensions averaged over 100 repeated
experiments
# Case Case 1 Case 2 Case 3 Case 4
Averaged Relative Errors of {pi } 2D 3.12E−01 2.21E−01 2.85E−01 2.14E−01
4D 7.82E−02 6.12E−02 7.92E−02 4.30E−02
8D 2.62E−02 4.31E−03 4.02E−02 7.82E−03
16D 2.88E−02 3.64E−02 4.35E−02 1.73E−02
32D 1.42E−02 3.72E−01 1.42E−01 5.04E−01
Averaged Relative Errors of {θi } 2D 2.59E−01 3.68E−01 4.82E−01 1.34E+00
4D 6.07E−02 8.37E−02 9.47E−02 1.23E−01
8D 1.04E−02 8.48E−03 1.41E−02 1.31E−02
16D 2.66E−03 2.53E−02 7.80E−03 1.90E−02
32D 8.09E−04 4.41E−01 1.81E−02 3.66E−01
Averaged Relative Errors of {γi } 2D 1.01E−02 3.19E−01 1.51E−02 2.65E−01
4D 6.72E−03 1.79E−02 1.03E−02 1.30E−02
8D 3.22E−03 2.34E−03 3.93E−03 2.65E−03
16D 9.48E−03 3.70E−03 1.92E−02 1.94E−03
32D 1.33E−02 5.35E−02 4.73E−01 1.17E−01
We train the neural network f using the same procedure as in the previous subsection
and obtain the function J̃ (see Eq. (24)) and the parameters {(pl , θl )}l∈L associated with
the effective neurons. We compute the relative mean square error of J̃ and {(pl , θl )}l∈L as
follows:
N test
j=1 |J̃ (xtest
i ) − J (x i )|
test 2
relative error of J̃ := N test ,
j=1 |J (x i )|
test 2
l∈L |θl − H(pl )|
2
relative error of {(pl , θl )}l := ,
l∈L |H(pl )|
2
where {xtest
i } are randomly sampled with respect to the standard normal distribution in
Rn and there are in total N test = 2,000 testing data points. We repeat the experiments 100
times. The corresponding averaged errors in the two examples are listed in Tables 4 and
5, respectively.
In the first example, we have H(p) = − 12 p 22 and J (x) = x 1 . According to Theorem
3.1, the solution S can be represented without error by the neural network in Fig. 1 with
parameters
n
(p, θ, γ ) ∈ Rn × R × R : p(i) ∈ {±1}, for i ∈ {1, . . . , n}, θ = , γ = 0 , (26)
2
where p(i) denotes the ith entry of the vector p. In other words, the global minimal loss
value in the training problem is theoretically guaranteed to be zero. From the numerical
errors in Table 4, we observe that in low dimension such as 1D and 2D, the errors of the
initial function are small. However, in most cases, the errors of the parameters are pretty
large. In the case of n dimension, the viscosity solution can be represented using the 2n
parameters in Eq. (26). However, the number of effective neurons are larger than 2n in all
J. Darbon et al. Res Math Sci (2020)7:20 Page 19 of 50 20
Table 4 Relative mean square errors of J̃ and {(pl , θl )} for the inverse problems of the
first-order HJ equations in different dimensions with J = · 1 and H = − 12 · 22 , averaged
over 100 repeated experiments
# Neurons 64 128 256 512 1024
Averaged Relative Errors of J̃ 1D 2.29E−07 2.20E−07 2.12E−07 2.14E−07 1.82E−07
2D 1.49E−06 1.27E−06 1.16E−06 1.01E−06 9.25E−07
4D 6.27E−04 1.81E−04 5.93E−05 1.69E−06 3.44E−07
8D 1.27E−02 1.10E−02 1.03E−02 9.92E−03 9.73E−03
16D 5.69E−02 5.83E−02 5.96E−02 5.99E−02 6.01E−02
Averaged Relative Errors of {(pl , θl )} 1D 2.58E−01 1.29E−01 7.05E−02 3.56E−02 1.72E−02
2D 4.77E−02 3.28E−02 2.03E−02 1.03E−02 6.53E−03
4D 9.36E−03 4.09E−03 1.58E−03 5.31E−04 1.73E−04
8D 3.75E−02 3.39E−02 3.25E−02 2.78E−02 2.60E−02
16D 5.30E−01 5.40E−01 5.43E−01 5.43E−01 5.42E−01
Averaged Number of Effective Neurons 1D 4.45 4.37 4.18 3.92 3.55
2D 8.84 8.59 7.87 7.1 6.3
4D 20.04 20.62 19.52 18.3 17.06
8D 36.97 43.91 47.84 49.19 50.03
16D 48.2 59.53 64.85 65.79 64.84
Table 5 Relative mean square errors of J̃ and {(pl , θl )} for the inverse problems of the
first-order HJ equations in different dimensions with J = · 1 and H = · 22 /2, averaged
over 100 repeated experiments
# Neurons 64 128 256 512 1024
Averaged Relative Errors of J̃ 1D 5.23E−08 2.45E−08 1.96E−08 1.77E−08 1.77E−08
2D 1.75E−05 1.67E−05 1.77E−05 1.85E−05 1.91E−05
4D 5.82E−04 4.94E−04 5.28E−04 5.76E−04 6.16E−04
8D 1.54E−02 1.40E−02 1.35E−02 1.33E−02 1.32E−02
16D 4.19E−02 4.33E−02 4.43E−02 4.46E−02 4.49E−02
Averaged Relative Errors of {(pl , θl )} 1D 3.25E−02 1.93E−02 1.24E−02 5.62E−03 2.92E−03
2D 8.30E−03 7.08E−03 5.78E−03 4.25E−03 3.47E−03
4D 2.41E−02 2.41E−02 2.51E−02 2.65E−02 2.82E−02
8D 7.33E−02 7.32E−02 7.25E−02 7.15E−02 7.08E−02
16D 3.85E−01 3.90E−01 3.92E−01 3.92E−01 3.91E−01
Averaged Number of Effective Neurons 1D 20.26 26.94 32.26 36.02 38.61
2D 32.74 48.05 65.7 84.87 99.83
4D 46.69 72.3 103.71 147.41 198.27
8D 55.55 82.04 95.46 90.82 82.5
16D 61.51 99.63 119.95 118.89 109.1
cases, which also implies that the Adam optimizer does not find the global minimizers in
this example.
In the second example, the solution S cannot be represented using our proposed neural
network without error. Hence, the results describe the approximation of the solution S
by the neural network. From Table 5, we observe that the errors become larger when
the dimension increases. For this example, the number of effective neurons should be m
where m is the number of neurons used in the architecture. Table 5 shows that the average
number of effective neurons is below this optimal number. Therefore, this implies that
the Adam optimizer does not find the global minimizers in this example either.
20 Page 20 of 50 J. Darbon et al. Res Math Sci (2020)7:20
In conclusion, these numerical experiments suggest that recovering initial data from
data samples using our proposed neural network architecture with the Adam optimizer
is unsatisfactory for solving these inverse problems. In particular, Adam optimizer is not
always able to find a global minimizer when the solution can be represented without error
using our network architecture.
1. H(p) = − 12 p2 and J (x) = |x| for p, x ∈ R. The initial condition u0 is then given by
⎧
⎨1, x > 0,
u0 (x) =
⎩−1, x < 0.
2. H(p) = 12 p2 and J (x) = |x| for p, x ∈ R. Hence, the initial function u0 is the same as
in example 1.
In the first example, the entropy solution u only takes values in the finite set {±1}, and it
can be represented by the neural network ∇x f without error by Prop. 3.1. However, in the
second example, the solution u takes values in the infinite set [−1, 1]; hence, the neural
network ∇x f is only an approximation of the corresponding solution u.
To show the representability of the neural network, in each example, we choose the
parameters {pi }mi=1 to be the uniform grid points in [−1, 1], i.e.,
2(i − 1)
pi = −1 + for i ∈ {1, . . . , m}.
m−1
We set θi = H(pi ) and γi = J ∗ (pi ) for each i ∈ {1, . . . , m}, where J ∗ is the Fenchel–
Legendre transform of the antiderivative of the initial function u0 . Hence, in these two
examples, γi equals for each i. Figures 5 and 6 show the neural network ∇x f and the true
entropy solution u in these two examples at time t = 1. As expected, the error in Fig. 5
for example 1 is negligible. For example 2, we consider neural networks with 32 and 128
neurons whose graphs are plotted in Figs. 6a and 6b, respectively. We observe in these
figures that the error of the neural networks with the specific parameters decreases as the
number of neurons increases. In conclusion, the neural network ∇x f with the architecture
in Fig. 4 can represent the solution to the one-dimensional conservation laws given in
Eq. (19) pretty well. In fact, because of the discontinuity of the activation function, the
proposed neural network ∇x f has advantages in representing the discontinuity in solution
such as shocks, but it requires more neurons when approximating non-constant smooth
parts of the solution.
5 Conclusion
Summary of the proposed work In this paper, we have established novel mathematical
connections between some classes of HJ PDEs with convex initial data and neural net-
J. Darbon et al. Res Math Sci (2020)7:20 Page 21 of 50 20
Fig. 5 Plot of the function represented by the neural network ∇x f at time t = 1 with 64 neurons whose
parameters are defined using H and J ∗ in example 1. The function given by the neural network is plotted in
orange and the true solution is plotted in blue
Fig. 6 Plot of the function represented by the neural network ∇x f at time t = 1 with 32 and 128 neurons
whose parameters are defined using H and J ∗ in example 2. The function given by the neural network is
plotted in orange and the true solution is plotted in blue. The neural network with 32 neurons is shown on
the left, while the neural network with 128 neurons is shown on the right
work architectures. Our main results give conditions under which for initial data which
takes a particular form. These results do not rely on universal approximation properties
of neural networks; rather, our results show that some neural networks correspond to
representation formulas of solutions to HJ PDEs whose Hamiltonians and convex initial
data are obtained from the parameters of the neural network. This means that some neural
network architectures naturally encode the physics contained in some HJ PDEs satisfying
the conditions in Theorem 3.1.
The first neural network architecture that we have proposed is depicted in Fig. 1. We
have shown in Theorem 3.1 that under certain conditions on the parameters, this neural
network architecture represents the viscosity solution of the HJ PDEs (16) for initial data
which takes a particular form. The corresponding Hamiltonian and convex initial data
can be recovered from the parameters of this neural network. As a corollary of this result
for the one-dimensional case, we have proposed a second neural network architecture
(depicted in Fig. 4) that represents the spatial gradient of the viscosity solution of the
20 Page 22 of 50 J. Darbon et al. Res Math Sci (2020)7:20
HJ PDEs (1) (in one dimension), and we have shown in Prop. 3.1 that under appropriate
conditions on the parameters, this neural network corresponds to entropy solutions of
the conservation laws (22).
Let us emphasize that the neural network architecture depicted in Fig. 1 that represents
solutions to the HJ PDEs (16) allows us to numerically evaluate these solutions in high
dimension without using grids or numerical approximations. Our work also paves the way
to leverage efficient technologies and hardware developed for neural networks to compute
efficiently solutions to certain HJ PDEs.
We have also tested the performance of the state-of-the-art Adam optimizer using our
proposed neural network architecture (depicted in Fig. 1) on some inverse problems. Our
numerical experiments in Sect. 4 show that these problems cannot generally be solved
with the Adam optimizer with high accuracy. These numerical results suggest further
developments of efficient neural network training algorithms for solving inverse problems
with our proposed neural network architectures.
Perspectives on other neural network architectures and HJ PDEs We now present exten-
sions of the proposed architectures that are viable candidates for representing solutions
of HJ PDEs.
First consider the following multi-time HJ PDE [12,27,39,105,119,125,132,141] which
reads
⎧
⎪ ∂S
⎪
⎪ (x, t1 , . . . , tN ) + Hj (∇x S(x, t1 , . . . , tN ))
⎪
⎨ ∂tj
(27)
⎪
⎪ = 0 for each j ∈ {1, . . . , N } in Rn × (0, +∞)N ,
⎪
⎪
⎩S(x, 0, . . . , 0) = J (x) in Rn .
∗ ⎧ ⎫
N ⎨
N ⎬
S(x, t1 , . . . , tN ) = ti Hi + J ∗ (x) = sup p, x − tj Hj (p) − J ∗ (p) ,
i=1 p∈Rn ⎩ j=1
⎭
(28)
Hopf formula (28) suggests that the neural network architecture depicted in Fig. 7 is a
good candidate for representing the solution to (27) under appropriate conditions on the
parameters of the network.
As mentioned in [105], the multi-time HJ equation (27) may not have viscosity solutions.
However, under suitable assumptions [12,27,39,119], the generalized Hopf formula (28)
is a viscosity solution of the multi-time HJ equation. We intend to clarify the connec-
tions between the generalized Hopf formula, multi-time HJ PDEs, viscosity solutions, and
general solutions in a future work.
J. Darbon et al. Res Math Sci (2020)7:20 Page 23 of 50 20
Fig. 7 Illustration of the structure of the neural network (29) that can represent solutions to some first-order
multi-time HJ equations
In [38,39], it is shown that when the Hamiltonian H and the initial data J are both
convex, and under appropriate assumptions, the solution S to the following HJ PDE
⎧
⎨ ∂S (x, t) + H(∇x S(x, t)) = 0 in Rn × (0, +∞),
∂t
⎩
S(x, 0) = J (x) in Rn ,
is represented by the Hopf [70] and Lax–Oleinik formulas [47, Sect. 10.3.4]. These for-
mulas read
Let p(x, t) be the maximizer in the Hopf formula and u(x, t) be the minimizer in the
Lax–Oleinik formula. Then, they satisfy the following relation [38,39]
Figure 8 depicts an architecture of a neural network that implements the formula above
for the minimizer u(x, t). In other words, we consider the ResNet-type neural network
defined by
Note that this proposed neural network suggests an interpretation of some ResNet archi-
tecture (for details on the ResNet architecture, see [66]) in terms of HJ PDEs. The activa-
tion functions of the proposed ResNet architecture are a composition of an argmax-based
function and t∇H, where H is the Hamiltonian in the corresponding HJ equation. More-
over, when the time variable is fixed, the input x and the output u are in the same space Rn ;
hence, one can chain the ResNet structure in Fig. 8 to obtain a deep neural network archi-
tecture by specifying a sequence of time variables t1 , t2 , . . . , tN . The deep neural network
is given by
Fig. 8 Illustration of the structure of the ResNet-type neural network (30) that can represent the minimizer u
in the Lax–Oleinik formula. Note that the activation function is defined using the gradient of the Hamiltonian
H, i.e., ∇H
Fig. 9 Illustration of the structure of the ResNet-type deep neural network (31) that can represent the
minimizers in the generalized Lax–Oleinik formula for the multi-time HJ PDEs. Note that the activation
function in the k th layer is defined using the gradient of one Hamiltonian Hk , i.e., ∇Hk . This figure only depicts
two layers
where u0 = x and pkjk is the output of the argmax based activation function in the k th layer.
For the case when N = 2, an illustration of this deep ResNet architecture with two layers
is shown in Fig. 9. In fact, this deep ResNet architecture can be formulated as follows
N
uN = x − tk ∇H(pkjk ).
k=1
This formulation suggests that this architecture should also provide the minimizers of
the generalized Lax–Oleinik formula for the multi-time HJ PDEs [39]. These ideas and
perspectives will be presented in detail in a forthcoming paper.
Applications of these neural architectures that can represent viscosity solutions of cer-
tain HJ PDEs to certain optimal control problems will be presented elsewhere.
Conflict of interest
The authors declare that they have no conflict of interest.
∗
conv {pi }mi=1 [133, Thms. 10.2 and 20.5], and moreover, the subdifferential ∂J (p) is
non-empty by [133, Thm. 23.10].
Proof of (ii): First, suppose the vector (α1 , . . . , αm ) ∈ Rm satisfies the constraints (a)–
(c). Since x ∈ ∂J ∗ (p), there holds J ∗ (p) = p, x − J (x) [68, Cor. X.1.4.4], and using the
definition of the set Ix (11) and constraints (a)–(c) we deduce that
J ∗ (p) = p, x − J (x) = p, x − αi J (x)
i∈Ix
= p, x − αi (pi , x − γi )
i∈Ix
m
= p− αi p i , x + αi γ i = αi γ i .
i∈Ix i∈Ix i=1
Therefore, the vector (δ1k , . . . , δmk ) is a minimizer in Eq. (12) at the point pk , and J ∗ (pk ) =
γk follows.
m
−∞ < min θi αi θi max θi < +∞
i={1,...,m} i={1,...,m}
i=1
from which we conclude that H is a bounded function on dom J ∗ . Since the target function
in the minimization problem (14) is continuous, existence of a minimizer follows by
compactness of A(p).
Proof of (ii): We have already shown in the proof of (i) that the restriction of H to
dom J ∗ is bounded, and so it remains to prove its continuity. For any p ∈ dom J ∗ , we
20 Page 26 of 50 J. Darbon et al. Res Math Sci (2020)7:20
m
have that (α1 , . . . , αm ) ∈ A(p) if and only if (α1 , . . . , αm ) ∈ Λm , i=1 αi pi = p, and
m ∗
i=1 αi γi = J (p). As a result, we have
m
m
m
H(p) = min αi θi : (α1 , . . . , αm ) ∈ Λm , αi pi = p, αi γi = J ∗ (p) . (32)
i=1 i=1 i=1
for any p ∈ Rn and r ∈ R. Using the same argument as in the proof of Lemma 3.1(i), we
conclude that h is a convex lower semicontinuous function, and in fact continuous over
its domain dom h = conv {(pi , γi )}m i=1 . Comparing Eq. (32) and the definition of h in (33),
we deduce that H(p) = h(p, J (p)) for any p ∈ dom J ∗ . Continuity of H in dom J ∗ then
∗
m
H(pk ) δik θi = θk . (34)
i=1
On the other hand, let (α1 , . . . , αm ) ∈ A(pk ) be a vector different from (δk1 , . . . , δkm ).
m ∗
Then, (α1 , . . . , αm ) ∈ Λm satisfies m i=1 αi pi = p, i=1 αi γi = J (p), and αk < 1. Define
(β1 , . . . , βm ) ∈ Λm by
⎧ αj
⎨ if j = k,
βj := 1 − αk
⎩
0 if j = k.
In other words, Eq. (9) holds at index k, which, by assumption (A3), implies that
i =k βi θi > θk . As a result, we have
m
m
αi θi = αk θk + (1 − αk ) βi θi > αk θk + (1 − αk )θk = θk = δik θi .
i=1 i =k i=1
Taken together with Eq. (34), we conclude that (δ1k , . . . , δmk ) is the unique minimizer in
(14), and hence, we obtain H(pk ) = θk .
J. Darbon et al. Res Math Sci (2020)7:20 Page 27 of 50 20
(A1)-(A3). Let J and H be the functions defined in Eqs. (10) and (14), respectively. Let
H̃ : Rn → R be a continuous function satisfying H̃(pi ) = H(pi ) for each i ∈ {1, . . . , m} and
H̃(p) H(p) for all p ∈ dom J ∗ . Then, the neural network f defined in Eq. (8) satisfies
Proof Let x ∈ Rn and t 0. Since H̃(p) H(p) for every p ∈ dom J ∗ , we get
Let (α1 , . . . , αm ) be a minimizer in (14). By Eqs. (12), (13), and (14), we have
m
m
m
p= αi p i , H(p) = αi θi , and J ∗ (p) = αi γ i . (37)
i=1 i=1 i=1
m
p, x − t H̃(p) − J ∗ (p) αi (pi , x − tθi − γi )
i=1
where the second inequality follows from the constraint (α1 , . . . , αm ) ∈ Λm . Since p ∈
dom J ∗ is arbitrary, we obtain
where the inequality holds since pi ∈ dom J ∗ for every i ∈ {1, . . . , m}. The conclusion
then follows from Eqs. (38) and (39).
20 Page 28 of 50 J. Darbon et al. Res Math Sci (2020)7:20
(A1)-(A3). For every k ∈ {1, . . . , m}, there exist x ∈ Rn and t > 0 such that f (·, t) is
differentiable at x and ∇x f (x, t) = pk .
Proof Since f is the supremum of a finite number of affine functions by definition (8),
it is finite-valued and convex for t 0. As a result, ∇x f (x, t) = pk is equivalent to
∂(f (·, t))(x) = {pk }, and so it suffices to prove that ∂(f (·, t))(x) = {pk } for some x ∈ Rn
and t > 0. To simplify the notation, we use ∂x f (x, t) to denote the subdifferential of f (·, t)
at x.
By [67, Thm. VI.4.4.2], the subdifferential of f (·, t) at x is the convex hull of the pi ’s
whose indices i’s are maximizers in (8), that is,
First, consider the case when there exists x ∈ Rn such that pk , x − γk > pi , x − γi for
every i = k. In that case, by continuity, there exists small t > 0 such that pk , x−tθk −γk >
pi , x − tθi − γi for every i = k and so (40) holds.
Now, consider the case when there does not exist x ∈ Rn such that pk , x − γk >
maxi=k {pi , x − γi }. In other words, we assume
Let x0 ∈ ∂J ∗ (pk ). Denote by Ix0 the set of maximizers in Eq. (41) at the point x0 , i.e.,
Hence, the point pk is in the domain of the polytopal convex function co h. Then, [133,
Thm. 23.10] implies ∂(co h)(pk ) = ∅. Let v 0 ∈ ∂(co h)(pk ) and x = x0 + tv 0 . It remains
to choose suitable positive t such that (40) holds. Letting x = x0 + tv 0 in (40) yields
Now, we consider two situations, the first when i ∈ / Ix0 ∪ {k} and the second when i ∈ Ix0 .
It suffices to prove (40) hold in each case for small enough positive t.
If i ∈
/ Ix0 ∪ {k}, then i is not a maximizer in Eq. (41) at the point x0 . By (45), pk is a convex
combination of the set {pi : i ∈ Ix0 }. In other words, there exists (c1 , . . . , cm ) ∈ Λm such
that m j=1 cj pj = pk and cj = 0 whenever j ∈ / Ix0 . Taken together with assumption (A2)
and Eqs. (10), (41), (43), we have
⎛ ⎞
J (x0 ) pk , x0 − γk = pk , x0 − g(pk ) = cj pj , x0 − g ⎝ cj pj ⎠
j∈Ix0 j∈Ix0
cj (pj , x0 − g(pj )) = cj J (x0 ) = J (x0 ).
j∈Ix0 j∈Ix0
Thus, the inequalities become equalities in the equation above. As a result, we have
where the inequality holds because i ∈ / Ix0 ∪ {k} by assumption. This inequality implies
that the constant pk , x0 − γk − (pi , x0 − γi ) is positive, and taken together with (46),
we conclude that the inequality in (40) holds for i ∈ / Ix0 ∪ {k} when t is small enough.
If i ∈ Ix0 , then both i and k are maximizers in Eq. (10) at x0 , and hence, we have
Together with Eq. (46) and the definition of h in Eq. (44), we obtain
To prove the result, it suffices to show co h(pk ) > θk . As pk ∈ co h (as shown before in
Eq. (45)), then according to [68, Prop. X.1.5.4] we have
co h(pk ) = αj h(pj ) = αj θj , (51)
j∈Ix0 j∈Ix0
20 Page 30 of 50 J. Darbon et al. Res Math Sci (2020)7:20
for some (α1 , . . . , αm ) ∈ Λm satisfying pk = m j=1 αj pj and αj = 0 whenever j ∈
/ Ix0 . Then,
by Lemma 3.1(ii) (α1 , . . . , αm ) is a minimizer in Eq. (42), that is,
m
γk = J ∗ (pk ) = αj γ j = αi γ i = αi γ i .
j=1 j∈Ix0 i =k
Hence, Eq. (9) holds for the index k. By assumption (A3), we have θk < j=k αj θj . Taken
together with the fact that αj = 0 whenever j ∈
/ Ix0 and Eq. (51), we find
θk < αj θj = αj θj = co h(pk ). (52)
j =k j∈Ix0
Hence, the right-hand side of Eq. (50) is strictly positive, and we conclude that pk , x −
tθk − γk > pi , x − tθi − γi for t > 0 if i ∈ Ix0 .
Therefore, in this case, when t > 0 is small enough and x is chosen as above, we have
pk , x − tθk − γk > pi , x − tθi − γi for every i = k, and the proof is complete.
m
co F (p, E − ) = inf ci γi , (54)
(c1 ,...,cm )∈C(p,E − )
i=1
Proof First, we compute the convex hull of epi F , which we denote by co (epi F ). Let
(p, E − , r) ∈ co (epi F ), where p ∈ Rn and E − , r ∈ R. Then there exist k ∈ N, (β1 , . . . , βk ) ∈
Λk and (q i , Ei− , ri ) ∈ epi F for each i ∈ {1, . . . , k} such that (p, E − , r) = ki=1 βi (q i , Ei− , ri ).
By definition of F in Eq. (53), (q i , Ei− , ri ) ∈ epi F holds if and only if q i ∈ dom J ∗ , Ei− +
H(q i ) 0 and ri J ∗ (q i ). In conclusion, we have
⎧
⎪
⎪(β1 , . . . , βk ) ∈ Λk ,
⎪
⎪
⎪
⎪ −
k −
⎪
⎨(p, E , r) = i=1 βi (q i , Ei , ri ),
⎪
q 1 , . . . , q k ∈ dom J ∗ , (55)
⎪
⎪
⎪
⎪ −
⎪Ei + H(q i ) 0 for each i ∈ {1, . . . , k},
⎪
⎪
⎪
⎩r J ∗ (q ) for each i ∈ {1, . . . , k}.
i i
J. Darbon et al. Res Math Sci (2020)7:20 Page 31 of 50 20
For each i, since we have q i ∈ dom J ∗ , by Lemma 3.2(i) the minimization problem in (14)
evaluated at q i has at least one minimizer. Let (αi1 , . . . , αim ) be such a minimizer. Using
Eqs. (12), (14), and (αi1 , . . . , αim ) ∈ Λm , we have
m
αij (1, pj , θj , γj ) = (1, q i , H(q i ), J ∗ (q i )). (56)
j=1
Define the real number cj := ki=1 βi αij for any j ∈ {1, . . . , m}. Combining Eqs. (55) and
(56), we get that cj 0 for any j and
m
m
k
cj (1, pj , θj , γj ) = βi αij (1, pj , θj , γj )
j=1 j=1 i=1
⎛ ⎞
k m
k
= βi ⎝ αij (1, pj , θj , γj )⎠ = βi (1, q i , H(q i ), J ∗ (q i )).
i=1 j=1 i=1
m
k
cj (1, pj ) = βi (1, q i ) = (1, p);
j=1 i=1
m
k
k
cj θj = βi H(q i ) − βi Ei− = −E − ;
j=1 i=1 i=1
m
k
k
cj γj = βi J ∗ (q i ) βi ri = r.
j=1 i=1 i=1
m
m
m
p= cj pj , E − − cj θj , r cj γj . .
j=1 j=1 j=1
(57)
∂f
(x, t) = min −H(p) : p is a maximizer in Eq. (59) .
∂t
Since x and t satisfy ∂x f (x, t) = {pk }, [67, Thm. VI.4.4.2] implies that the only maximizer
in Eq. (59) is pk . As a result, there holds
∂f
(x, t) = −H(pk ). (60)
∂t
In other words, the subdifferential ∂f (x, t) contains only one element, and therefore, f is
differentiable at (x, t) and its gradient equals (pk , −H(pk )) [133, Thm. 21.5]. Using (16)
and (60), we obtain
∂f
0= (x, t) + H̃(∇x f (x, t)) = −H(pk ) + H̃(pk ).
∂t
As k ∈ {1, . . . , m} is arbitrary, we find that H(pk ) = H̃(pk ) for every k ∈ {1, . . . , m}.
Next, we prove by contradiction that H̃(p) H(p) for every p ∈ dom J ∗ . It is enough
to prove the property only for every p ∈ ri dom J ∗ by continuity of both H̃ and H (where
J. Darbon et al. Res Math Sci (2020)7:20 Page 33 of 50 20
continuity of H is proved in Lemma 3.2(ii)). Assume H̃(p) < H(p) for some p ∈ ri dom J ∗ .
Define two functions F and F̃ from Rn × R to R ∪ {+∞} by
J ∗ (q) if E − + H(q) 0, J ∗ (q) if E − + H̃(q) 0,
F (q, E − ):= and F̃ (q, E − ):=
+∞ otherwise. +∞ otherwise.
(61)
m
co F (q, E − ) = inf ci γi , where C is defined by
(c1 ,...,cm )∈C(q,E − )
i=1
(62)
m
m
− −
C(q, E ):= (c1 , . . . , cm ) ∈ Λm : ci pi = q, ci θi −E .
i=1 i=1
Let E1− ∈ −H(p), −H̃(p) . Now, we want to prove that co F (p, E1− ) J ∗ (p); this
inequality will lead to a contradiction with the definition of H.
Using statement (i) of this theorem and the supposition that f is the unique viscosity
solution to the HJ equation (16), we have that
f = F ∗ = F̃ ∗ , which implies f ∗ = co F = co F̃ .
%
f ∗ p, −H̃(p) F̃ p, −H̃(p) = J ∗ (p) and {p} × −∞, −H̃(p) ⊆ dom F̃ ⊆ dom f ∗ .
(63)
Recall that p ∈ ri dom J ∗ and E1− < −H̃(p), so that (p, E1− ) ∈ ri dom f ∗ . As a result, we
get
p, αE1− + (1 − α)(−H̃(p)) ∈ ri dom f ∗ for all α ∈ (0, 1). (64)
Note that co F (p, ·) is monotone non-decreasing. Indeed, if E2− is a real number such that
E2− > E1− , by the definition of the set C in Eq. (62) there holds C(p, E2− ) ⊆ C(p, E1− ), which
20 Page 34 of 50 J. Darbon et al. Res Math Sci (2020)7:20
implies co F (p, E2− ) co F (p, E1− ). Recalling that E1− < −H̃(p), monotonicity of co F (p, ·)
and Eq. (65) imply
f ∗ p, −H̃ (p) lim co F p, αE1− + (1 − α)E1− = co F (p, E1− ). (66)
α→0
0<α<1
As a result, the set C(p, E1− ) is non-empty. Since it is also compact, there exists a min-
imizer in Eq. (62) evaluated at the point (p, E1− ). Let (c1 , . . . , cm ) be such a minimizer. By
Eqs. (62) and (67) and the assumption that E1− ∈ −H(p), −H̃(p) , there holds
⎧
⎪
⎪(c1 , . . . , cm ) ∈ Λm ,
⎪
⎪
⎨m c p = p,
⎪
i=1 i i
⎪ m ci γi = co F (p, E − ) J ∗ (p),
(68)
⎪
⎪
⎪
⎪
i=1 1
⎩m −
i=1 i ic θ −E 1 < H(p).
Comparing the first three statements in Eq. (68) and the formula of J ∗ in Eq. (12), we
deduce that (c1 , . . . , cm ) is a minimizer in Eq. (12), i.e., (c1 , . . . , cm ) ∈ A(p). By definition
of H in Eq. (14), we have
m
m
H(p) = inf αi θi ci θi ,
α∈A(p)
i=1 i=1
which contradicts the last inequality in Eq. (68). Therefore, we conclude that H̃(p) H(p)
for any p ∈ ri dom J ∗ and the proof is finished.
C Connections between the neural network (17) and the viscous HJ PDE (18)
Let f be the neural network defined by Eq. (17) with parameters {(pi , θi , γi )}mi=1 and > 0,
which is illustrated in Fig. 3. We will show in this appendix that when the parameter θi =
2
− 12 pi 2 for i ∈ {1, . . . , m}, then the neural network f corresponds to the unique, jointly
convex smooth solution to the viscous HJ PDE (18). This result will follow immediately
from the following lemma.
is the unique, jointly log-convex and smooth solution to the Cauchy problem
⎧
⎪ ∂w
⎪
⎪ (x, t) = Δx w (x, t) in Rn × (0, +∞),
⎨ ∂t 2
⎪ m (70)
⎪
⎩w (x, 0) =
⎪ e(pi ,x−γi )/ in Rn .
i=1
J. Darbon et al. Res Math Sci (2020)7:20 Page 35 of 50 20
Proof A short calculation shows that the function w defined in Eq. (69) solves the Cauchy
problem (70), and uniqueness holds by strict positiveness of the initial data (see [147, Chap.
VIII, Thm. 2.2] and note that the uniqueness result can easily be generalized to n > 1).
Now, let λ ∈ [0, 1] and (x1 , t1 ) and (x2 , t2 ) be such that x = λx1 + (1 − λ)x2 and
t = λt1 + (1 − λ)t2 . Then, the Hölder’s inequality (see, e.g., [57, Thm. 6.2]) implies
m
m
pi ,x+ 2t pi 22 −γi / λ pi ,x1 + 21 pi 2 −γi / (1−λ) pi ,x2 + 22 pi 2 −γi /
t 2 t 2
e = e e
i=1 i=1
m λ 1−λ
t
pi ,x1 + 21
2
pi 2 −γi /
m
pi ,x2 + t22 pi
2
2 −γi /
e e ,
i=1 i=1
and we find w (x, t) (w (x1 , t1 ))λ (w (x2 , t2 ))1−λ , which implies that w is jointly log-
convex in (x, t).
Thanks to Lemma C.1 and the Cole–Hopf transformation f (x, t) = log (w (x, t)) (see,
e.g., [47], Sect. 4.4.1), a short calculation immediately implies that the neural network f
solves the viscous HJ PDE (18), and it is also its unique solution because w is the unique
solution to the Cauchy problem (70). Joint convexity in (x, t) follows from log-convexity
of (x, t) → w (x, t) for every > 0.
Proof Let Ix denotes the set of maximizers in Eq. (11) at x. Since p ∈ ∂J (x), p = pi for
i ∈ {1, . . . , m}, and ∂J (x) = co {pi : i ∈ Ii } by [67, Thm. VI.4.4.2], there exist j, l ∈ Ix
such that pj < p < pl . Moreover, there exists k with j k < k + 1 l such that
pj pk < p < pk+1 pl . We will show that k, k + 1 ∈ Ix . We only prove k ∈ Ix ; the case
for k + 1 is similar.
If pj = pk , then k = j ∈ Ix and the conclusion follows directly. Hence suppose pj <
pk < pl . Then, there exists α ∈ (0, 1) such that pk = αpj + (1 − α)pl . Using that j, l ∈ Ix ,
assumption (A2), and Jensen inequality, we get
which implies that k ∈ Ix . A similar argument shows that k + 1 ∈ Ix , which completes the
proof.
where
pk+1 − u0 u0 − pk
βk := and βk+1 := . (73)
pk+1 − pk pk+1 − pk
pk+1 − u0 u 0 − pk
βk := and βk+1 := ,
pk+1 − pk pk+1 − pk
and βi = 0 for every i ∈ {1, . . . , m} \ {k, k + 1}. We will prove that β is a minimizer in Eq.
(14) evaluated at u0 , that is,
m
β ∈ arg min αi θi ,
α∈A(u0 ) i=1
where
m
A(u0 ):= arg min αi γ i .
(α
1m,...αm )∈Λm i=1
i=1 αi pi =u0
First, we show that β ∈ A(u0 ). By definition of β and Lemma 3.1(ii) with p = u0 , the
statement holds provided k, k + 1 ∈ Ix , where the set Ix contains the maximizers in Eq.
(10) evaluated at x ∈ ∂J ∗ (u0 ). But if x ∈ ∂J ∗ (u0 ), we have u0 ∈ ∂J (x), and Lemma D.1
implies k, k + 1 ∈ Ix . Hence, β ∈ A(u0 ).
Now, suppose that β is not a minimizer in Eq. (14) evaluated at u0 . By Lemma 3.2(i), there
exists a minimizer in Eq. (14) evaluated at the point u0 , which we denote by (α1 , . . . , αm ).
Then there holds
⎧ m
⎪
⎪
m
i=1 αi = i=1 βi = 1,
⎪
⎪
⎪
⎨ m m
i=1 αi pi = i=1 βi pi = u0 ,
⎪m m ∗
(74)
⎪
⎪ i=1 αi γi = i=1 βi γi = J (u0 ),
⎪
⎪
⎩ m m
i=1 αi θi < i=1 βi θi .
Since αi 0 for every i and βi = 0 for every i ∈ {1, . . . , m} \ {k, k + 1}, we have
αk + αk+1 1 = βk + βk+1 . As α = β, then one or both of the inequalities αk < βk and
αk+1 < βk+1 hold. This leaves three possible cases, and we now show that each case leads
to a contradiction.
J. Darbon et al. Res Math Sci (2020)7:20 Page 37 of 50 20
Note that from the first two equations in (74) and the assumption that αk < βk and
αk+1 < βk+1 , there exist i1 < k and i2 > k + 1 such that αi1 = 0 and αi2 = 0, and hence,
the numbers qk and qk+1 are well-defined. By definition, we have qk < pk < pk+1 < qk+1 .
Therefore, there exist bk , bk+1 ∈ (0, 1) such that
⎧ ⎧
⎪
⎪ b αi ⎪
⎪ b αi
⎪
⎪ k , i < k, ⎪
⎪ k+1 , i < k,
⎪
⎪ α ⎪
⎪ α
⎨ ω<k ω ⎨ ω<k ω
cik := (1 − bk )αi and cik+1 := (1 − b )αi
⎪ , i > k + 1, ⎪ k+1 , i > k + 1,
⎪
⎪ ω>k+1 αω
⎪
⎪ ω>k+1 αω
⎪
⎪ ⎪
⎪
⎪
⎩0, ⎪
⎩0,
otherwise, otherwise.
(79)
20 Page 38 of 50 J. Darbon et al. Res Math Sci (2020)7:20
m m
These coefficients satisfy cik , cik+1 ∈ [0, 1] for any i and k
i=1 ci = k+1
i=1 ci = 1. In
other words, we have
(c1k , . . . , cm
k
) ∈ Δm with ckk = 0 and (c1k+1 , . . . , cm
k+1 k+1
) ∈ Δm with ck+1 = 0. (80)
Hence, the first equality in Eq. (9) holds for the coefficients (c1k , . . . , cm
k ) with the index
k+1
k and also for the coefficients (c1 , . . . , cm ) with the index k + 1. We show next that
k+1
these coefficients satisfy the second and third equalities in (9) and draw a contradiction
with assumption (A3).
Using Eqs. (76), (77), and (79) to write the formulas for pk and pk+1 via the coefficients
cik and cik+1 , we find
α i pi α i pi
pk = bk i<k + (1 − bk ) i>k+1 = cik pi = cik pi ,
i<k α i i>k+1 α i
i =k,k+1 i =k
α i pi α i pi
pk+1 = bk+1 i<k + (1 − bk+1 ) i>k+1 = cik+1 pi = cik+1 pi ,
i<k αi i>k+1 αi i =k,k+1 i =k+1
(81)
k
where the last equalities in the two formulas above hold because ck+1 = 0 and ckk+1 = 0
by definition. Hence, the second equality in Eq. (9) also holds for both the index k and
k + 1.
From the third equality in Eq. (75), assumption (A2), Eq. (81), and Jensen’s inequality,
we have
αi γi = (βk − αk )γk + (βk+1 − αk+1 )γk+1
i =k,k+1
i =k,k+1 i =k,k+1
⎛ ⎞ ⎛ ⎞
(82)
(βk − αk ) ⎝ cik g(pi )⎠ + (βk+1 − αk+1 ) ⎝ cik+1 g(pi )⎠
i =k,k+1 i =k,k+1
= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )g(pi )
i =k,k+1
= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )γi .
i =k,k+1
We now compute and simplify the coefficients (βk − αk )cik + (βk+1 − αk+1 )cik+1 in the
formula above. First, consider the case when i < k. Eqs. (78) and (79) imply
αi qk+1 − pk qk+1 − pk+1
= (βk − αk ) + (βk+1 − αk+1 )
α
ω<k ω q k+1 − q k qk+1 − qk
αi 1
= · ((βk − αk + βk+1 − αk+1 )qk+1
α
ω<k ω q k+1 − qk
−(βk − αk )pk − (βk+1 − αk+1 )pk+1 ).
Applying the first two equalities in Eq. (75) and Eq. (76) to the last formula above, we
obtain
= αi .
The same result for the case when i > k + 1 also holds and the proof is similar. Therefore,
we have
Since the left side and right side are the same, the inequality above becomes equality,
which implies that the inequality in Eq. (82) also becomes equality. In other words, we
have
γk = g (pk ) = cik g(pi ) = cik γi = cik γi ,
i =k,k+1 i =k,k+1 i =k
(84)
γk+1 = g (pk+1 ) = cik+1 g(pi ) = cik+1 γi = cik+1 γi ,
i =k,k+1 i =k,k+1 i =k+1
where the last equalities in the two formulas above hold because ck+1 k = 0 and ckk+1 = 0
by definition. Hence, the third equality in (9) also holds for both indices k and k + 1.
In summary, Eqs. (80), (81), and (84) imply that Eq. (9) holds for the index k with
k ) and also for the index k + 1 with coefficients (ck+1 , . . . , ck+1 ).
coefficients (c1k , . . . , cm 1 m
Hence, by assumption (A3), we find
cik θi > θk and cik+1 θi > θk+1 .
i =k i =k+1
20 Page 40 of 50 J. Darbon et al. Res Math Sci (2020)7:20
k
Using the inequalities above with Eq. (83) and the fact that ck+1 = 0 and ckk+1 = 0, we
find
(βk − αk )θk + (βk+1 − αk+1 )θk+1 < (βk − αk ) cik θi + (βk+1 − αk+1 ) cik+1 θi
i =k i =k+1
= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )θi = αi θi ,
i =k,k+1 i =k,k+1
θl − θk θl − θj
. (86)
pl − p k pl − p j
Proof Note that Eq. (86) holds trivially when j = k, so we only need to consider the case
when j < k < l. On the one hand, Eq. (85) implies
which yields
γl − γk x(pl − pk ) − t(θl − θk ),
(87)
γl − γj = x(pl − pj ) − t(θl − θj ).
On the other hand, for each i ∈ {j, j + 1, . . . , l − 1} let qi ∈ (pi , pi+1 ) and xi ∈ ∂J ∗ (qi ). Such
xi exists because qi ∈ int dom J ∗ , so that the subdifferential ∂J ∗ (qi ) is non-empty. Then,
qi ∈ ∂J (xi ) and Lemma D.1 imply
l−1
l−1
γl − γ k = (γi+1 − γi ) = xi (pi+1 − pi ),
i=k i=k
l−1
l−1
γl − γj = (γi+1 − γi ) = xi (pi+1 − pi ).
i=j i=j
J. Darbon et al. Res Math Sci (2020)7:20 Page 41 of 50 20
Combining the two equalities above with Eq. (87), we conclude that
l−1
x(pl − pk ) − t(θl − θk ) xi (pi+1 − pi ),
i=k
l−1
x(pl − pj ) − t(θl − θj ) = xi (pi+1 − pi ).
i=j
Now, divide the inequality above by t(pl − pk ) > 0 (because by assumption t > 0 and
l > k, which implies that pl > pk ), divide the equality above by t(pl − pj ) > 0 (because
l > j, which implies that t(pl − pj ) = 0), and rearrange the terms to obtain
l−1
θl − θk x 1 i=k xi (pi+1 − pi )
− ,
pl − p k t t pl − p k
l−1 (88)
θl − θj x 1 i=j xi (pi+1 − pi )
= − .
pl − p j t t pl − p j
Recall that qj < qj+1 < · · · < ql−1 and xi ∈ ∂J ∗ (qi ) for any j i < l. Since the function
J ∗ is convex, the subdifferential operator ∂J ∗ is a monotone non-decreasing operator
[67, Def. IV.4.1.3, and Prop. VI.6.1.1], which yields xj xj+1 · · · xl−1 . Using that
p1 < p2 < · · · < pm and j < k < l, we obtain
l−1 l−1
i=k xi (pi+1 − pi ) xk (pi+1 − pi )
i=k = xk
pl − p k pl − p k
k−1 k−1
i=j xk (pi+1 − pi ) i=j xi (pi+1 − pi )
= . (89)
pk − p j pk − p j
To proceed, we now use that fact that if four real numbers a, c ∈ R and b, d > 0 satisfy
a
b d , then b b+d . Combining this fact with inequality (89), we find
c a a+c
θl − θk θl − θj
.
pl − p k pl − p j
its Lipschitz property [57, Thm. 4.16]. Therefore, we can invoke [36, Prop. 2.1] to conclude
that u is the entropy solution to the conservation law (21) provided it satisfies the two
following conditions. Let x̄(t) be any smooth line of discontinuity of u. Fix t > 0 and
define u− and u+ as
d x̄ H(u+ ) − H(u− )
= . (91)
dt u+ − u −
First, we prove the first condition and Eq. (91). According to the definition of u in Eq.
(20), the range of u is the compact set {p1 , . . . , pm }. As a result, u− and u+ are in the range
of u, i.e., there exist indices j and l such that
u− = p j and u + = pl . (93)
Let (x̄(s), s) be a point on the curve x̄ which is not one of the endpoints. Since u is piecewise
constant, there exists a neighborhood N of (x̄(s), s) such that for any (x− , t), (x+ , t) ∈ N
satisfying x− < x̄(t) < x+ , we have u(x− , t) = u− = pj and u(x+ , t) = u+ = pl . In other
words, if x− , x+ , t are chosen as above, according to the definition of u in Eq. (20), we have
By a continuity argument, Eqs. (95) and (96) also hold for the end points of x̄. In conclusion,
for any (x̄(t), t) on the curve x̄, we have
and Eq. (93) and Lemma 3.2(iii) imply that its slope equals
d x̄ θl − θj H(u+ ) − H(u− )
= = .
dt pl − p j u+ − u −
θl − H(u0 ) θl − θj
. (98)
pl − u 0 pl − p j
Without loss of generality, we may assume that p1 < p2 < · · · < pm . Then, the fact
pj = u− < u+ = pl implies j < l. We consider the following two cases.
First, if there exists some k such that u0 = pk , then H(u0 ) = θk by Lemma 3.2(iii). Since
u < u0 < u+ , we have j < k < l. Recall that Eq. (97) holds. Therefore, the assumptions
−
Using these three equations, we can write the left-hand side of Eq. (98) as
θl − H(u0 ) θl − θk
= .
pl − u 0 pl − p k
20 Page 44 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Since j k < l and Eq. (97) hold, then the assumptions of Lemma D.3 are satisfied. This
allows us to conclude that Eq. (98) holds.
If k + 1 = l, then using Eq. (97), the inequalities j k < k + 1 < l, and Lemma D.3, we
obtain
βk (θl − θk ) θl − θk θl − θj βk+1 (θl − θk+1 ) θl − θk+1 θl − θj
= and = .
βk (pl − pk ) pl − p k pl − p j βk+1 (pl − pk+1 ) pl − pk+1 pl − p j
Note that if ai ∈ R and bi ∈ (0, +∞) for i ∈ {1, 2, 3} satisfy ab11 ab33 and ab22 ab33 , then
a1 +a2 a3
b1 +b2 b3 . Then, since βk (pl − pk ), βk+1 (pl − pk+1 ) and pl − pj are positive, we have
Hence, Eq. (98) follows directly from the inequality above and Eq. (99).
Therefore, the two conditions, including Eqs. (91) and (92), are satisfied and we apply
[36, Prop 2.1] to conclude that the function u is the entropy solution to the conservation
law (21).
Proof of (ii) (sufficiency): Without loss of generality, assume p1 < p2 < · · · < pm . Let
C ∈ R. Suppose H̃ satisfies H̃(pi ) = H(pi )+C for each i ∈ {1, . . . , m} and H̃(p) H(p)+C
for any p ∈ [p1 , pm ]. We want to prove that u is the entropy solution to the conservation
law (22).
As in the proof of (i), we apply [36, Prop 2.1] and verify that the two conditions hold
through Eqs. (91) and (92). Let x̄(t) be any smooth line of discontinuity of u, define u− and
u+ by Eq. (90) (and recall that u− = pj and u+ = pl ), and let u0 ∈ (u− , u+ ). We proved in
the proof of (i) that x̄(t) is a straight line, and so it suffices to prove that
We start with proving the equality in Eq. (100). By assumption, there holds
We combine Eq. (101) with Eq. (91), (which we proved in the proof of (i)), we obtain
Proof of (ii) (necessity): Suppose that u is the entropy solution to the conservation law
(22). We prove that there exists C ∈ R such that H̃(pi ) = H(pi ) + C for any i and
H̃(p) H(p) + C for any p ∈ [p1 , pm ].
By Lemma B.2, for each i ∈ {1, . . . , m} there exist x ∈ R and t > 0 such that
Moreover, the proof of Lemma B.2 implies there exists T > 0 such that for any 0 <
t < T , there exists x ∈ R such that Eq. (102) holds. As a result, there exists t > 0 such
that for each i ∈ {1, . . . , m}, there exists xi ∈ R satisfying Eq. (102) at the point (xi , t),
which implies u(xi , t) = pi . Note that pi = pj implies that xi = xj . (Indeed, if xi = xj , then
pi = ∇x f (xi , t) = ∇x f (xj , t) = pj which gives a contradiction since pi = pj by assumption
(A1).) As mentioned before, the function u(·, t) ≡ ∇x f is a monotone non-decreasing
operator and pi is increasing with respect to i, and therefore x1 < x2 < · · · < xm . Since u
is piecewise constant, for each k ∈ {1, . . . , m − 1} there exists a curve of discontinuity of u
with u = pk on the left-hand side of the curve and u = pk+1 on the right-hand side of the
curve. Let x̄(s) be such a curve and let u− and u+ be the corresponding numbers defined
in Eq. (90). The argument above proves that we have u− = pk and u+ = pk+1 .
Since u is the piecewise constant entropy solution, we invoke [36, Prop 2.1] to conclude
that the two aforementioned conditions hold for the curve x̄(s), i.e., (100) holds with
u− = pk and u+ = pk+1 . From the equality in (100) and Eq. (91) proved in (i), we deduce
Since k is an arbitrary index, the equality above implies that H̃(pk+1 )− H̃(pk ) = H(pk+1 )−
H(pk ) holds for any k ∈ {1, . . . , m − 1}. Therefore, there exists C ∈ R such that
It remains to prove H̃(u0 ) H(u0 ) + C for all u0 ∈ [pk , pk+1 ]. If this inequality holds,
then the statement follows because k is an arbitrary index. We already proved that H̃(u0 )
H(u0 ) + C for u0 = pk with k ∈ {1, . . . , m}. Therefore, we need to prove that H̃(u0 )
H(u0 ) + C for all u0 ∈ (pk , pk+1 ). Let u0 ∈ (pk , pk+1 ). By Eq. (103) and the inequality in
(100), we have
Comparing Eqs. (104) and (105), we obtain H̃(u0 ) H(u0 ) + C. Since k is arbitrary, we
conclude that H̃(u0 ) H(u0 ) + C holds for all u0 ∈ [p1 , pm ] and the proof is complete.
Received: 27 October 2019 Accepted: 22 June 2020
20 Page 46 of 50 J. Darbon et al. Res Math Sci (2020)7:20
References
1. Aaibid, M., Sayah, A.: A direct proof of the equivalence between the entropy solutions of conservation laws and
viscosity solutions of Hamilton–Jacobi equations in one-space variable. JIPAM J. Inequal. Pure Appl. Math. 7(2), 11
(2006)
2. Akian, M., Bapat, R., Gaubert, S.: Max-plus algebra. Handbook of Linear Algebra 39, (2006)
3. Akian, M., Gaubert, S., Lakhoua, A.: The max-plus finite element method for solving deterministic optimal control
problems: basic properties and convergence analysis. SIAM J. Control Optim. 47(2), 817–848 (2008)
4. Alla, A., Falcone, M., Saluzzi, L.: An efficient DP algorithm on a tree-structure for finite horizon optimal control problems.
SIAM J. Sci. Comput. 41(4), A2384–A2406 (2019)
5. Alla, A., Falcone, M., Volkwein, S.: Error analysis for POD approximations of infinite horizon problems via the dynamic
programming approach. SIAM J. Control Optim. 55(5), 3091–3115 (2017)
6. Arnol’d, V.I.: Mathematical methods of classical mechanics. Graduate Texts in Mathematics, vol. 60. Springer, New York
(1989). Translated from the 1974 Russian original by K. Vogtmann and A. Weinstein, Corrected reprint of the second
(1989) edition
7. Bachouch, A., Huré, C., Langrené, N., Pham, H.: Deep neural networks algorithms for stochastic control problems on
finite horizon: numerical applications. arXiv preprint arXiv:1812.05916 (2018)
8. Banerjee, K., Georganas, E., Kalamkar, D., Ziv, B., Segal, E., Anderson, C., Heinecke, A.: Optimizing deep learning RNN
topologies on intel architecture. Supercomput. Front. Innov. 6(3), 64–85 (2019)
9. Bardi, M., Capuzzo-Dolcetta, I.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Syst.
Control Found. Appl. Birkhäuser Boston, Inc., Boston, MA (1997). https://doi.org/10.1007/978-0-8176-4755-1. With
appendices by Maurizio Falcone and Pierpaolo Soravia
10. Bardi, M., Evans, L.: On Hopf’s formulas for solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods
Appl. 8(11), 1373–1381 (1984). https://doi.org/10.1016/0362-546X(84)90020-8
11. Barles, G.: Solutions de viscosité des équations de Hamilton–Jacobi. Mathématiques et Applications. Springer, Berlin
(1994)
12. Barles, G., Tourin, A.: Commutation properties of semigroups for first-order Hamilton–Jacobi equations and application
to multi-time equations. Indiana Univ. Math. J. 50(4), 1523–1544 (2001)
13. Barron, E., Evans, L., Jensen, R.: Viscosity solutions of Isaacs’ equations and differential games with Lipschitz controls.
J. Differ. Equ. 53(2), 213–233 (1984). https://doi.org/10.1016/0022-0396(84)90040-8
14. Beck, C., Becker, S., Cheridito, P., Jentzen, A., Neufeld, A.: Deep splitting method for parabolic PDEs. (2019). arXiv
preprint arXiv:1907.03452
15. Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations
by means of deep learning. (2018). arXiv preprint arXiv:1806.00421
16. Beck, C., E, W., Jentzen, A.: Machine learning approximation algorithms for high-dimensional fully nonlinear partial
differential equations and second-order backward stochastic differential equations. J. Nonlinear Sci. 29(4), 1563–1619
(2019)
17. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)
18. Berg, J., Nyström, K.: A unified deep artificial neural network approach to partial differential equations in complex
geometries. Neurocomputing 317, 28–41 (2018). https://doi.org/10.1016/j.neucom.2018.06.056
19. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific, Belmont (2019)
20. Bokanowski, O., Garcke, J., Griebel, M., Klompmaker, I.: An adaptive sparse grid semi-Lagrangian scheme for first order
Hamilton–Jacobi Bellman equations. J. Sci. Comput. 55(3), 575–605 (2013)
21. Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer Series in Operations Research.
Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1394-9
22. Brenier, Y., Osher, S.: Approximate Riemann solvers and numerical flux functions. SIAM J. Numer. Anal. 23(2), 259–273
(1986)
23. Brenier, Y., Osher, S.: The discrete one-sided Lipschitz condition for convex scalar conservation laws. SIAM J. Numer.
Anal. 25(1), 8–23 (1988). https://doi.org/10.1137/0725002
24. Buckdahn, R., Cardaliaguet, P., Quincampoix, M.: Some recent aspects of differential game theory. Dyn. Games Appl.
1(1), 74–114 (2011). https://doi.org/10.1007/s13235-010-0005-0
25. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part I: Partial differential
equations of the first order. Translated by Robert B. Dean and Julius J. Brandstatter. Holden-Day, Inc., San Francisco-
London-Amsterdam (1965)
26. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part II: Calculus of variations.
Translated from the German by Robert B. Dean, Julius J. Brandstatter, translating editor. Holden-Day, Inc., San Francisco-
London-Amsterdam (1967)
27. Cardin, F., Viterbo, C.: Commuting Hamiltonians and Hamilton–Jacobi multi-time equations. Duke Math. J. 144(2),
235–284 (2008). https://doi.org/10.1215/00127094-2008-036
28. Caselles, V.: Scalar conservation laws and Hamilton–Jacobi equations in one-space variable. Nonlinear Anal. Theory
Methods Appl. 18(5), 461–469 (1992). https://doi.org/10.1016/0362-546X(92)90013-5
29. Chan-Wai-Nam, Q., Mikael, J., Warin, X.: Machine learning for semi linear PDEs. J. Sci. Comput. 79(3), 1667–1712 (2019)
30. Chen, T., van Gelder, J., van de Ven, B., Amitonov, S.V., de Wilde, B., Euler, H.C.R., Broersma, H., Bobbert, P.A., Zwanenburg,
F.A., van der Wiel, W.G.: Classification with a disordered dopant-atom network in silicon. Nature 577(7790), 341–345
(2020)
31. Cheng, T., Lewis, F.L.: Fixed-final time constrained optimal control of nonlinear systems using neural network HJB
approach. In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3016–3021 (2006). https://doi.
org/10.1109/CDC.2006.377523
32. Corrias, L., Falcone, M., Natalini, R.: Numerical schemes for conservation laws via Hamilton–Jacobi equations. Math.
Comput. 64(210), 555–580, S13–S18 (1995). https://doi.org/10.2307/2153439
33. Courant, R., Hilbert, D.: Methods of mathematical physics. Vol. II. Wiley Classics Library. Wiley: New York (1989). Partial
differential equations, Reprint of the 1962 original, A Wiley-Interscience Publication
J. Darbon et al. Res Math Sci (2020)7:20 Page 47 of 50 20
34. Crandall, M.G., Ishii, H., Lions, P.L.: User’s guide to viscosity solutions of second order partial differential equations. Bull.
Am. Math. Soc. 27(1), 1–67 (1992). https://doi.org/10.1090/S0273-0979-1992-00266-5
35. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314
(1989). https://doi.org/10.1007/BF02551274
36. Dafermos, C.M.: Polygonal approximations of solutions of the initial value problem for a conservation law. J. Math.
Anal. Appl. 38(1), 33–41 (1972). https://doi.org/10.1016/0022-247X(72)90114-X
37. Dafermos, C.M.: Hyperbolic conservation laws in continuum physics, Grundlehren der Mathematischen Wissenschaften,
vol. 325, 4th Edn. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-49451-6
38. Darbon, J.: On convex finite-dimensional variational methods in imaging sciences and Hamilton–Jacobi equations.
SIAM J. Imaging Sci. 8(4), 2268–2293 (2015). https://doi.org/10.1137/130944163
39. Darbon, J., Meng, T.: On decomposition models in imaging sciences and multi-time Hamilton-Jacobi partial differential
equations. (2019). arXiv preprint arXiv:1906.09502
40. Darbon, J., Osher, S.: Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations
arising in control theory and elsewhere. Res. Math. Sci. 3(1), 19 (2016). https://doi.org/10.1186/s40687-016-0068-7
41. Dissanayake, M.W.M.G., Phan-Thien, N.: Neural-network-based approximations for solving partial differential equa-
tions. Commun. Numer. Methods Eng. 10(3), 195–201 (1994). https://doi.org/10.1002/cnm.1640100303
42. Djeridane, B., Lygeros, J.: Neural approximation of PDE solutions: An application to reachability computations. In:
Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3034–3039 (2006). https://doi.org/10.1109/
CDC.2006.377184
43. Dockhorn, T.: A discussion on solving partial differential equations using neural networks. (2019). arXiv preprint
arXiv:1904.07200
44. Dolgov, S., Kalise, D., Kunisch, K.: A tensor decomposition approach for high-dimensional Hamilton-Jacobi-Bellman
equations. (2019). arXiv preprint arXiv:1908.01533
45. Dower, P.M., McEneaney, W.M., Zhang, H.: Max-plus fundamental solution semigroups for optimal control problems.
In: 2015 Proceedings of the Conference on Control and its Applications, pp. 368–375. SIAM (2015)
46. Elliott, R.J.: Viscosity solutions and optimal control, Pitman research notes in mathematics series, vol. 165. Longman
Scientific & Technical, Harlow; Wiley, New York (1987)
47. Evans, L.C.: Partial differential equations, Graduate Studies in Mathematics, vol. 19, second edn. American Mathematical
Society, Providence, RI (2010). https://doi.org/10.1090/gsm/019
48. Evans, L.C., Gariepy, R.F.: Measure Theory and Fine Properties of Functions. Textbooks in Mathematics, revised edn.
CRC Press, Boca Raton (2015)
49. Evans, L.C., Souganidis, P.E.: Differential games and representation formulas for solutions of Hamilton–Jacobi–Isaacs
equations. Indiana Univ. Math. J. 33(5), 773–797 (1984)
50. Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., Talay, S.: Large-scale fpga-based convolu-
tional networks. In: Bekkerman, R., Bilenko, M., Langford, J. (eds.) Scaling up Machine Learning: Parallel and Distributed
Approaches. Cambridge University Press, Cambridge (2011)
51. Farabet, C., poulet, C., Han, J., LeCun, Y.: CNP: An FPGA-based processor for convolutional networks. In: International
Conference on Field Programmable Logic and Applications. IEEE, Prague (2009)
52. Farabet, C., Poulet, C., LeCun, Y.: An FPGA-based stream processor for embedded real-time vision with convolutional
networks. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 878–885.
IEEE Computer Society, Los Alamitos, CA, USA (2009). https://doi.org/10.1109/ICCVW.2009.5457611
53. Farimani, A.B., Gomes, J., Pande, V.S.: Deep Learning the Physics of Transport Phenomena. arXiv e-prints (2017)
54. Fleming, W., McEneaney, W.: A max-plus-based algorithm for a Hamilton–Jacobi–Bellman equation of nonlinear
filtering. SIAM J. Control Optim. 38(3), 683–710 (2000). https://doi.org/10.1137/S0363012998332433
55. Fleming, W.H., Rishel, R.W.: Deterministic and stochastic optimal control. Bull. Am. Math. Soc. 82, 869–870 (1976)
56. Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, New York (2006)
57. Folland, G.B.: Real Analysis: Modern Techniques and Their Spplications. Wiley, Hoboken (2013)
58. Fujii, M., Takahashi, A., Takahashi, M.: Asymptotic expansion as prior knowledge in deep learning method for high
dimensional BSDEs. Asia-Pacific Financ. Mark. 26(3), 391–408 (2019). https://doi.org/10.1007/s10690-019-09271-7
59. Garcke, J., Kröner, A.: Suboptimal feedback control of PDEs by solving HJB equations on adaptive sparse grids. J. Sci.
Comput. 70(1), 1–28 (2017)
60. Gaubert, S., McEneaney, W., Qu, Z.: Curse of dimensionality reduction in max-plus based approximation methods:
Theoretical estimates and improved pruning algorithms. In: 2011 50th IEEE Conference on Decision and Control and
European Control Conference, pp. 1054–1061. IEEE (2011)
61. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, New York (2016)
62. Grohs, P., Jentzen, A., Salimova, D.: Deep neural network approximations for Monte Carlo algorithms. (2019). arXiv
preprint arXiv:1908.10828
63. Grüne, L.: Overcoming the curse of dimensionality for approximating lyapunov functions with deep neural networks
under a small-gain condition. (2020). arXiv preprint arXiv:2001.08423
64. Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad.
Sci. 115(34), 8505–8510 (2018). https://doi.org/10.1073/pnas.1718942115
65. Han, J., Zhang, L., E, W.: Solving many-electron Schrödinger equation using deep neural networks. J. Comput. Phys.
108929 (2019)
66. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
67. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals, vol. 305. Springer,
New York (1993)
68. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle
Methods, vol. 306. Springer, New York (1993)
69. Hirjibehedin, C.: Evolution of circuits for machine learning. Nature 577, 320–321 (2020). https://doi.org/10.1038/
d41586-020-00002-x
20 Page 48 of 50 J. Darbon et al. Res Math Sci (2020)7:20
70. Hopf, E.: Generalized solutions of non-linear equations of first order. J. Math. Mech. 14, 951–973 (1965)
71. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991). https://
doi.org/10.1016/0893-6080(91)90009-T
72. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw.
2(5), 359–366 (1989). https://doi.org/10.1016/0893-6080(89)90020-8
73. Horowitz, M.B., Damle, A., Burdick, J.W.: Linear Hamilton Jacobi Bellman equations in high dimensions. In: 53rd IEEE
Conference on Decision and Control, pp. 5880–5887. IEEE (2014)
74. Hsieh, J.T., Zhao, S., Eismann, S., Mirabella, L., Ermon, S.: Learning neural PDE solvers with convergence guarantees. In:
International Conference on Learning Representations (2019)
75. Hu, C., Shu, C.: A discontinuous Galerkin finite element method for Hamilton–Jacobi equations. SIAM J. Sci. Comput.
21(2), 666–690 (1999). https://doi.org/10.1137/S1064827598337282
76. Huré, C., Pham, H., Bachouch, A., Langrené, N.: Deep neural networks algorithms for stochastic control problems on
finite horizon, part I: convergence analysis. (2018). arXiv preprint arXiv:1812.04300
77. Huré, C., Pham, H., Warin, X.: Some machine learning schemes for high-dimensional nonlinear PDEs. (2019). arXiv
preprint arXiv:1902.01599
78. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse
of dimensionality in the numerical approximation of semilinear heat equations. SN Partial Differ. Equ. Appl. 1(10),
(2020)
79. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A., von Wurstemberger, P.: Overcoming the curse of dimensionality
in the numerical approximation of semilinear parabolic partial differential equations (2018)
80. Hutzenthaler, M., Jentzen, A., von Wurstemberger, P.: Overcoming the curse of dimensionality in the approximative
pricing of financial derivatives with default risks (2019)
81. Hutzenthaler, M., Kruse, T.: Multilevel picard approximations of high-dimensional semilinear parabolic differential
equations with gradient-dependent nonlinearities. SIAM J. Numer. Anal. 58(2), 929–961 (2020). https://doi.org/10.
1137/17M1157015
82. Ishii, H.: Representation of solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods Appl. 12(2),
121–146 (1988). https://doi.org/10.1016/0362-546X(88)90030-2
83. Jiang, F., Chou, G., Chen, M., Tomlin, C.J.: Using neural networks to compute approximate and guaranteed feasible
Hamilton–Jacobi–Bellman PDE solutions. (2016). arXiv preprint arXiv:1611.03158
84. Jiang, G., Peng, D.: Weighted ENO schemes for Hamilton–Jacobi equations. SIAM J. Sci. Comput. 21(6), 2126–2143
(2000). https://doi.org/10.1137/S106482759732455X
85. Jianyu, L., Siwei, L., Yingjian, Q., Yaping, H.: Numerical solution of elliptic partial differential equation using radial basis
function neural networks. Neural Netw. 16(5–6), 729–734 (2003)
86. Jin, S., Xin, Z.: Numerical passage from systems of conservation laws to Hamilton–Jacobi equations, and relaxation
schemes. SIAM J. Numer. Anal. 35(6), 2385–2404 (1998). https://doi.org/10.1137/S0036142996314366
87. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.:
In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pp. 1–12. Association for Computing Machinery, New York, NY, USA
(2017). https://doi.org/10.1145/3079856.3080246
88. Kalise, D., Kundu, S., Kunisch, K.: Robust feedback control of nonlinear PDEs by numerical approximation of high-
dimensional Hamilton–Jacobi–Isaacs equations. (2019). arXiv preprint arXiv:1905.06276
89. Kalise, D., Kunisch, K.: Polynomial approximation of high-dimensional Hamilton–Jacobi–Bellman equations and appli-
cations to feedback control of semilinear parabolic PDEs. SIAM J. Sci. Comput. 40(2), A629–A652 (2018)
90. Kang, W., Wilcox, L.C.: Mitigating the curse of dimensionality: sparse grid characteristics method for optimal feedback
control and HJB equations. Comput. Optim. Appl. 68(2), 289–315 (2017)
91. Karlsen, K., Risebro, H.: A note on front tracking and the equivalence between viscosity solutions of Hamilton–
Jacobi equations and entropy solutions of scalar conservation laws. Nonlinear Anal. (2002). https://doi.org/10.1016/
S0362-546X(01)00753-2
92. Khoo, Y., Lu, J., Ying, L.: Solving parametric PDE problems with artificial neural networks. (2017). arXiv preprint
arXiv:1707.03351
93. Khoo, Y., Lu, J., Ying, L.: Solving for high-dimensional committor functions using artificial neural networks. Res. Math.
Sci. 6(1), 1 (2019)
94. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference
on Learning Representations (ICLR 2015) (2015)
95. Kružkov, S.N.: Generalized solutions of nonlinear first order equations with several independent variables II. Math.
USSR-Sbornik 1(1), 93–116 (1967). https://doi.org/10.1070/sm1967v001n01abeh001969
96. Kundu, A., Srinivasan, S., Qin, E.C., Kalamkar, D., Mellempudi, N.K., Das, D., Banerjee, K., Kaul, B., Dubey, P.: K-tanh:
Hardware efficient activations for deep learning (2019)
97. Kunisch, K., Volkwein, S., Xie, L.: HJB-POD-based feedback design for the optimal control of evolution problems. SIAM
J. Appl. Dyn. Syst. 3(4), 701–722 (2004)
98. Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial differential equations. IEEE
Trans. Neural Netw. 9(5), 987–1000 (1998). https://doi.org/10.1109/72.712178
99. Lagaris, I.E., Likas, A.C., Papageorgiou, D.G.: Neural-network methods for boundary value problems with irregular
boundaries. IEEE Trans. Neural Netw. 11(5), 1041–1049 (2000). https://doi.org/10.1109/72.870037
100. Lambrianides, P., Gong, Q., Venturi, D.: A new scalable algorithm for computational optimal control under uncertainty.
(2019). arXiv preprint arXiv:1909.07960
101. Landau, L., Lifschic, E.: Course of theoretical physics. vol. 1: Mechanics. Oxford, (1978)
102. LeCun, Y.: 1.1 deep learning hardware: Past, present, and future. In: 2019 IEEE International Solid-State Circuits
Conference—(ISSCC), pp. 12–19 (2019). https://doi.org/10.1109/ISSCC.2019.8662396
103. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
104. Lee, H., Kang, I.S.: Neural algorithm for solving differential equations. J. Comput. Phys. 91(1), 110–131 (1990)
J. Darbon et al. Res Math Sci (2020)7:20 Page 49 of 50 20
105. Lions, P.L., Rochet, J.C.: Hopf formula and multitime Hamilton–Jacobi equations. Proc. Am. Math. Soc. 96(1), 79–84
(1986)
106. Lions, P.L., Souganidis, P.E.: Convergence of MUSCL and filtered schemes for scalar conservation laws and Hamilton–
Jacobi equations. Numerische Mathematik 69(4), 441–470 (1995). https://doi.org/10.1007/s002110050102
107. Long, Z., Lu, Y., Dong, B.: PDE-net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network. J.
Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925
108. Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-net: Learning PDEs from data. (2017). arXiv preprint arXiv:1710.09668
109. Lye, K.O., Mishra, S., Ray, D.: Deep learning observables in computational fluid dynamics. (2019). arXiv preprint
arXiv:1903.03040
110. McEneaney, W.: Max-Plus Methods for Nonlinear Control and Estimation. Springer, New York (2006)
111. McEneaney, W.: A curse-of-dimensionality-free numerical method for solution of certain HJB PDEs. SIAM J. Control
Optim. 46(4), 1239–1276 (2007). https://doi.org/10.1137/040610830
112. McEneaney, W.M., Deshpande, A., Gaubert, S.: Curse-of-complexity attenuation in the curse-of-dimensionality-free
method for HJB PDEs. In: 2008 American Control Conference, pp. 4684–4690. IEEE (2008)
113. McEneaney, W.M., Kluberg, L.J.: Convergence rate for a curse-of-dimensionality-free method for a class of HJB PDEs.
SIAM J. Control Optim. 48(5), 3052–3079 (2009)
114. McFall, K.S., Mahan, J.R.: Artificial neural network method for solution of boundary value problems with exact
satisfaction of arbitrary boundary conditions. IEEE Trans. Neural Netw. 20(8), 1221–1233 (2009). https://doi.org/10.
1109/TNN.2009.2020735
115. Meade, A., Fernandez, A.: The numerical solution of linear ordinary differential equations by feedforward neural
networks. Math. Comput. Modell. 19(12), 1–25 (1994). https://doi.org/10.1016/0895-7177(94)90095-7
116. Meng, X., Karniadakis, G.E.: A composite neural network that learns from multi-fidelity data: Application to function
approximation and inverse PDE problems. (2019). arXiv preprint arXiv:1903.00104
117. Meng, X., Li, Z., Zhang, D., Karniadakis, G.E.: PPINN: Parareal physics-informed neural network for time-dependent
PDEs. (2019). arXiv preprint arXiv:1909.10145
118. van Milligen, B.P., Tribaldos, V., Jiménez, J.A.: Neural network differential equation and plasma equilibrium solver.
Phys. Rev. Lett. 75, 3594–3597 (1995). https://doi.org/10.1103/PhysRevLett.75.3594
119. Motta, M., Rampazzo, F.: Nonsmooth multi-time Hamilton–Jacobi systems. Indiana Univ. Math. J. 55(5), 1573–1614
(2006)
120. Niarchos, K.N., Lygeros, J.: A neural approximation to continuous time reachability computations. In: Proceedings of
the 45th IEEE Conference on Decision and Control, pp. 6313–6318 (2006). https://doi.org/10.1109/CDC.2006.377358
121. Osher, S., Shu, C.: High-order essentially nonoscillatory schemes for Hamilton–Jacobi equations. SIAM J. Numer. Anal.
28(4), 907–922 (1991). https://doi.org/10.1137/0728049
122. Pang, G., Lu, L., Karniadakis, G.E.: fPINNs: Fractional physics-informed neural networks. SIAM J. Sci. Comput. 41(4),
A2603–A2626 (2019)
123. Pham, H., Pham, H., Warin, X.: Neural networks-based backward scheme for fully nonlinear PDEs. (2019). arXiv preprint
arXiv:1908.00412
124. Pinkus, A.: Approximation theory of the MLP model in neural networks. In: Acta numerica, 1999, Acta Numer., vol. 8,
pp. 143–195. Cambridge University Press, Cambridge (1999)
125. Plaskacz, S., Quincampoix, M.: Oleinik–Lax formulas and multitime Hamilton–Jacobi systems. Nonlinear Anal. Theory,
Methods Appl. 51(6), 957–967 (2002). https://doi.org/10.1016/S0362-546X(01)00871-9
126. Raissi, M.: Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res.
19(1), 932–955 (2018)
127. Raissi, M.: Forward-backward stochastic neural networks: Deep learning of high-dimensional partial differential
equations. (2018). arXiv preprint arXiv:1804.07010
128. Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving
forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707
(2019). https://doi.org/10.1016/j.jcp.2018.10.045
129. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part i): Data-driven solutions of nonlinear
partial differential equations. (2017). arXiv preprint arXiv:1711.10561
130. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part ii): Data-driven discovery of nonlinear
partial differential equations. (2017). arXiv preprint arXiv:1711.10566
131. Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value
functions in zero-sum games of nonlinear stiff systems. (2019). arXiv preprint arXiv:1903.06652
132. Rochet, J.: The taxation principle and multi-time Hamilton–Jacobi equations. J. Math. Econ. 14(2), 113–128 (1985).
https://doi.org/10.1016/0304-4068(85)90015-1
133. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
134. Royo, V.R., Tomlin, C.: Recursive regression with neural networks: Approximating the HJI PDE solution. (2016). arXiv
preprint arXiv:1611.02739
135. Rudd, K., Muro, G.D., Ferrari, S.: A constrained backpropagation approach for the adaptive solution of partial differential
equations. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 571–584 (2014). https://doi.org/10.1109/TNNLS.2013.2277601
136. Ruthotto, L., Osher, S., Li, W., Nurbekyan, L., Fung, S.W.: A machine learning framework for solving high-dimensional
mean field game and mean field control problems. (2019). arXiv preprint arXiv:1912.01825
137. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.
1016/j.neunet.2014.09.003
138. Sirignano, J., Spiliopoulos, K.: DGM: A deep learning algorithm for solving partial differential equations. J. Comput.
Phys. 375, 1339–1364 (2018). https://doi.org/10.1016/j.jcp.2018.08.029
139. Tang, W., Shan, T., Dang, X., Li, M., Yang, F., Xu, S., Wu, J.: Study on a Poisson’s equation solver based on deep learning
technique. In: 2017 IEEE Electrical Design of Advanced Packaging and Systems Symposium (EDAPS), pp. 1–3 (2017).
https://doi.org/10.1109/EDAPS.2017.8277017
20 Page 50 of 50 J. Darbon et al. Res Math Sci (2020)7:20
140. Tassa, Y., Erez, T.: Least squares solutions of the HJB equation with neural network value-function approximators. IEEE
Trans. Neural Netw. 18(4), 1031–1041 (2007). https://doi.org/10.1109/TNN.2007.899249
141. Tho, N.: Hopf-Lax-Oleinik type formula for multi-time Hamilton–Jacobi equations. Acta Math. Vietnamica 30, 275–287
(2005)
142. Todorov, E.: Efficient computation of optimal actions. Proc. Natl. Acad. Sci. 106(28), 11478–11483 (2009)
143. Uchiyama, T., Sonehara, N.: Solving inverse problems in nonlinear PDEs by recurrent neural networks. In: IEEE
International Conference on Neural Networks, pp. 99–102. IEEE (1993)
144. E, W., Yu, B.: The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems.
Commun. Math. Stat. 6(1), 1–12 (2018)
145. E, W., Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential
equations and backward stochastic differential equations. Commun. Math. Stat. 5(4), 349–380 (2017). https://doi.org/
10.1007/s40304-017-0117-6
146. E, W., Hutzenthaler, M., Jentzen, A., Kruse, T.: Multilevel picard iterations for solving smooth semilinear parabolic heat
equations (2016)
147. Widder, D.V.: The Heat Equation, vol. 67. Academic Press, New York (1976)
148. Yadav, N., Yadav, A., Kumar, M.: An introduction to neural network methods for differential equations. SpringerBriefs
in Applied Sciences and Technology. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9816-7
149. Yang, L., Zhang, D., Karniadakis, G.E.: Physics-informed generative adversarial networks for stochastic differential
equations. (2018). arXiv preprint arXiv:1811.02033
150. Yang, Y., Perdikaris, P.: Adversarial uncertainty quantification in physics-informed neural networks. J. Comput. Phys.
394, 136–152 (2019)
151. Yegorov, I., Dower, P.M.: Perspectives on characteristics based curse-of-dimensionality-free numerical approaches
for solving Hamilton–Jacobi equations. Appl. Math. Optim. 1–49 (2017)
152. Zhang, D., Guo, L., Karniadakis, G.E.: Learning in modal space: solving time-dependent stochastic PDEs using physics-
informed neural networks. (2019). arXiv preprint arXiv:1905.01205
153. Zhang, D., Lu, L., Guo, L., Karniadakis, G.E.: Quantifying total uncertainty in physics-informed neural networks for
solving forward and inverse stochastic problems. J. Comput. Phys. 397, 108850 (2019)
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.