Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN

J. Darbon et al.
Res Math Sci (2020)7:20

https://doi.org/10.1007/s40687-020-00215-6
RESEARCH
Overcoming the curse of dimensionality

for some Hamilton–Jacobi partial
differential equations via neural network
architectures
Jérôme Darbon* , Gabriel P. Langlois and Tingwei Meng
* Correspondence:
[email protected] Abstract
Division of Applied Mathematics,
Brown University, Providence,
We propose new and original mathematical connections between Hamilton–Jacobi
USA (HJ) partial differential equations (PDEs) with initial data and neural network
Research supported by NSF DMS architectures. Specifically, we prove that some classes of neural networks correspond to
1820821. Authors’ names are
given in last/family name
representation formulas of HJ PDE solutions whose Hamiltonians and initial data are
alphabetical order obtained from the parameters of the neural networks. These results do not rely on
universal approximation properties of neural networks; rather, our results show that
some classes of neural network architectures naturally encode the physics contained in
some HJ PDEs. Our results naturally yield efficient neural network-based methods for
evaluating solutions of some HJ PDEs in high dimension without using grids or
numerical approximations. We also present some numerical results for solving some
inverse problems involving HJ PDEs using our proposed architectures.
1 Introduction
The Hamilton–Jacobi (HJ) equations are an important class of partial differential equation
(PDE) models that arise in many scientific disciplines, e.g., physics [6,25,26,33,101], imag-
ing science [38–40], game theory [13,24,49,82], and optimal control [9,46,55,56,110].
Exact or approximate solutions to these equations then give practical insight about the
models in consideration. We consider here HJ PDEs specified by a Hamiltonian function
H : Rn → R and convex initial data J : Rn → R
⎧
⎨ ∂S (x, t) + H(∇x S(x, t)) = 0 in Rn × (0, +∞),
∂t (1)
⎩
S(x, 0) = J (x) in Rn ,

where ∂S
∂t (x, t) and ∇x S(x, t) = ∂S
∂x1 (x, t), . . . , ∂S
∂xn (x, t) denote the partial derivative with
respect to t and the gradient vector with respect to x of the function (x, t) → S(x, t), and
the Hamiltonian H only depends on the gradient ∇x S(x, t).
123 © Springer Nature Switzerland AG 2020.
0123456789().,–: volV
20 Page 2 of 50 J. Darbon et al. Res Math Sci (2020)7:20
Our main motivation is to compute the viscosity solution of certain HJ PDEs of the
form of (1) in high dimension for a given x ∈ Rn and t > 0 [9–11,34] by leveraging
new efficient hardware technologies and silicon-based electric circuits dedicated to neu-
ral networks. As noted by LeCun in [102], the use of neural networks has been greatly
influenced by available hardware. In addition, there have been many initiatives to cre-
ate new hardware for neural networks that yield extremely efficient (in terms of speed,
latency, throughput or energy) implementations: For instance, [50–52] propose efficient
neural network implementations using field-programmable gate array, [8] optimizes neu-
ral network implementations for Intel’s architecture, and [96] provides efficient hardware
implementation of certain building blocks widely used in neural networks. It is also worth
mentioning that Google created specific hardware, called “Tensor Processor Unit” [87] to
implement their neural networks in data centers. Note that Xilinx announced a new set of
hardware (Versal AI core) for implementing neural networks while Intel enhances their
processors with specific hardware instructions for neural networks. LeCun also suggests
in [102, Section 3] possible new trends for hardware dedicated to neural networks. Finally,
we refer the reader to [30] (see also [69]) that describes the evolution of silicon-based elec-
trical circuits for machine learning.
In this paper, we propose classes of neural network architectures that exactly represent
viscosity solutions of certain HJ PDEs of the form of (1). Our results pave the way to lever-
age efficient dedicated hardware implementation of neural networks to evaluate viscosity
solutions of certain HJ PDEs for initial data which takes a particular form.
Related work The viscosity solution to the HJ PDE (1) rarely admits a closed-form expres-
sion, and in general it must be computed with numerical algorithms or other methods
tailored for the Hamiltonian H, initial data J , and dimension n.
The dimensionality, in particular, matters significantly because in many applications
involving HJ PDE models, the dimension n is extremely large. In imaging problems, for
example, the vector x typically corresponds to a noisy image whose entries are its pixel
values, and the associated Hamilton–Jacobi equations describe the solution to an image
denoising convex optimization problem [38,39]. Denoising a 1080 x 1920 standard full
HD image on a smartphone, for example, corresponds to solving a HJ PDE in dimension
n = 1080 × 1920 = 2,073,600.
Unfortunately, standard grid-based numerical algorithms for PDEs are impractical when
n > 4. Such algorithms employ grids to discretize the spatial and time domain, and the
number of grid points required to evaluate accurately solutions of PDEs grows exponen-
tially with the dimension n. It is therefore essentially impossible in practice to numerically
solve PDEs in high dimension using grid-based algorithms, even with sophisticated high-
order accuracy methods for HJ PDEs such as ENO [121], WENO [84], and DG [75]. This
problem is known as the curse of dimensionality [17].
Overcoming the curse of dimensionality in general remains an open problem, but for
HJ PDEs several methods have been proposed to solve it. These include, but are not
limited to, max-plus algebra methods [2,3,45,54,60,110–113], dynamic programming
and reinforcement learning [4,19], tensor decomposition techniques [44,73,142], sparse
grids [20,59,90], model order reduction [5,97], polynomial approximation [88,89], multi-
level Picard method [79–81,146], optimization methods [38–40,151] and neural networks
[7,42,64,76,77,83,100,120,131,134,136,138]. Among these methods, neural networks
have become increasingly popular tools to solve PDEs [7,14–16,18,29,31,41–43,53,58,
J. Darbon et al. Res Math Sci (2020)7:20 Page 3 of 50 20
62–65,74,76–78,85,92,93,98–100,104,109,114,115,118,120,123,131,134–136,138–140,
144,145,148–150] and inverse problems involving PDEs [107,108,116,117,122,126–
130,143,149,152,153]. Their popularity is due to universal approximation theorems that
state that neural networks can approximate broad classes of (high-dimensional, non-
linear) functions on compact sets [35,71,72,124]. These properties, in particular, have
been recently leveraged to approximate solutions to high-dimensional nonlinear HJ PDEs
[64,138] and for the development of physics-informed neural networks that aim to solve
supervised learning problems while respecting any given laws of physics described by a
set of nonlinear PDEs [128].
In this paper, we propose some neural network architectures that exactly represent
viscosity solutions to HJ PDEs of the form of (1), where the Hamiltonians and initial
data are obtained from the parameters of the neural network architectures. Recall our
results require the initial data J to be convex and the Hamiltonian H to only depend on
the gradient ∇x S(x, t) [see Eq. (1)]. In other words, we show that some neural networks
correspond to exact representation formulas of HJ PDE solutions. To our knowledge, this
is the first result that shows that certain neural networks can exactly represent solutions
of certain HJ PDEs.
Note that an alternative method to numerically evaluate solutions of HJ PDEs of the
form of (1) with convex initial data has been proposed in [40]. This method relies on
the Hopf formula and is only based on optimization. Therefore, this method is grid and
approximation-free and works well in high dimension. Contrary to [40], our proposed
approach does not rely on any (possibly non-convex) optimization techniques.
Contributions of this paper In this paper, we prove that some classes of shallow neural
networks are, under certain conditions, viscosity solutions to Hamilton–Jacobi equations
for initial data which takes a particular form. The main result of this paper is Theorem 3.1.
We show in this theorem that the neural network architecture illustrated in Fig. 1 rep-
resents, under certain conditions, the viscosity solution to a set of first-order HJ PDEs of
the form of (1), where the Hamiltonians and the convex initial data are obtained from the
parameters of the neural network. As a corollary of this result for the one-dimensional
case, we propose a second neural network architecture (illustrated in Fig. 4) that repre-
sents the spatial gradient of the viscosity solution of the HJ PDE above in 1D and show
in Proposition 3.1 that under appropriate conditions, this neural network corresponds to
entropy solutions of some conservation laws in 1D.
Let us emphasize that the proposed architecture in Fig. 1 for representing solutions to
HJ PDEs allows us to numerically evaluate their solutions in high dimension without using
grids.
We also stress that our results do not rely on universal approximation properties of
neural networks. Instead, our results show that the physics contained in HJ PDEs satisfying
the conditions of Theorem 3.1 can naturally be encoded by the neural network architecture
depicted in Fig. 1. Our results further suggest interpretations of this neural network
architecture in terms of solutions to PDEs.
We also test the proposed neural network architecture (depicted in Fig. 1) on some
inverse problems. To do so, we consider the following problem. Given training data sam-
pled from the solution S of a first-order HJ PDE (1) with unknown convex initial function
J and Hamiltonian H, we aim to recover the unknown initial function. After the training
process using the Adam optimizer, the trained neural network with input time variable
t = 0 gives an approximation to the convex initial function J . Moreover, the parameters

in the trained neural network also provide partial information on the Hamiltonian H. The
parameters only approximate the Hamiltonian at certain points, however, and therefore
do not give complete information about the function. We show the experimental results
on several examples. Our numerical results show that this problem cannot generally be
solved using Adam optimizer with high accuracy. In other words, while our theoretical
results (see Theorem 3.1) show that the neural network representation (depicted in Fig. 1)
to some HJ PDEs is exact, the Adam optimizer for training the proposed networks in this
paper sometimes gives large errors in some of our inverse problems, and as such there is
no guarantee that the Adam optimizer works well for the proposed network.
Organization of this paper In Sect. 2, we briefly review shallow neural networks and
concepts of convex analysis that will be used throughout this paper. In Sect. 3, we establish
connections between the neural network architecture illustrated in Fig. 1 and viscosity
solutions to HJ PDEs of the form of (1), and the neural network architecture illustrated in
Fig. 4 and one-dimensional conservation laws. The mathematical setup for establishing
these connections is described in Sect. 3.1, our main results, which concern first-order
HJ PDEs, are described in Sect. 3.2, and an extension of these results to one-dimensional
conservation laws is presented in Sect. 3.3. In Sect. 4, we perform numerical experiments
to test the effectiveness of the Adam optimizer using our proposed architecture (depicted
in Fig. 1) for solving some inverse problems. Finally, we draw some conclusions and
directions for future work in Sect. 5. Several appendices contain proofs of our results.
2 Background
In this section, we introduce mathematical concepts that will be used in this paper. We
review the standard structure of shallow neural networks from a mathematical point of
view in Sect. 2.1 and present some fundamental definitions and results in convex analysis
in Sect. 2.2. For the notation, we use Rn to denote the n-dimensional Euclidean space. The
Euclidean scalar product and Euclidean norm on Rn are denoted by ·, · and · 2 . The set
of matrices with m rows and n columns with real entries is denoted by Mm,n (R).
2.1 Shallow neural networks

Neural networks provide architectures for constructing complicated nonlinear func-
tions from simple building blocks. Common neural network architectures in applications
include, for example, feedforward neural networks in statistical learning, recurrent neural
networks in natural language processing, and convolutional neural networks in imaging
science. In this paper, we focus on shallow neural networks, a subclass of feedforward
neural networks that typically consist of one hidden layer and one output layer. We give
here a brief mathematical introduction to shallow neural networks. For more details, we
refer the reader to [61,103,137] and the references listed therein.
A shallow neural network with one hidden layer and one output layer is a composition of
affine functions with a nonlinear function. A hidden layer with m ∈ N neurons comprises
m affine functions of an input x ∈ Rn with weights w i ∈ Rn and biases bi ∈ R:
Rn × Rn × R (x, w i , bi ) → w i , x + bi .
These m affine functions can be succinctly written in vector form as W x + b, where the
matrix W ∈ Mm,n (R) has for rows the weights w i and the vector b ∈ Rm has for entries
the biases bi . The output layer comprises a nonlinear function σ : Rm → R that takes for
input the vector W x + b of affine functions and gives the number
Rn × Rn × R (x, w i , bi ) → σ (W x + b) .
The nonlinear function σ is called the activation function of the output layer.
i=1 ⊂ R ×
In Sect. 4, we will consider the following problem: Given data points {(xi , yi )}N n
R, infer the relationship between the input xi ’s and the output yi ’s. To infer this relation,
we assume that the output takes the form (or can be approximated by) yi = σ (W xi + b)
for some known activation function σ , unknown matrix of weights W ∈ Mm,n (R), and
unknown vector of bias b. A standard approach to solve such a problem is to estimate the
weights w i and biases bi so as to minimize the mean square error

1
N
{(w̄ i , b̄i )}m
i=1 ∈ arg min (σ (W xi + b) − yi )2 . (2)
{(w i ,bi )}m N
i=1 ⊂R ×R
n
i=1
In the field of machine learning, solving this minimization problem is called the learning
or training process. The data {(xi , yi )}N
i=1 used in the training process is called training
data. Finding a global minimizer is generally difficult due to the complexity of the mini-
mization problem and that the objective function is not convex with respect to the weights
and biases. State-of-the-art algorithms for solving these problems are stochastic gradient
descent-based methods with momentum acceleration, such as the Adam optimizer for
neural networks [94]. This algorithm will be used in our numerical experiments.
2.2 Convex analysis

We introduce here several definitions and results of convex analysis that will be used in
this paper. We refer readers to Hiriart–Urruty and Lemaréchal [67,68] and Rockafellar
[133] for comprehensive references on finite-dimensional convex analysis.
Definition 1 (Convex sets, relative interiors, and convex hulls) A set C ⊂ Rn is called
convex if for any λ ∈ [0, 1] and any x, y ∈ C, the element λx + (1 − λ)y is in C. The relative
interior of a convex set C ⊂ Rn , denoted by ri C, consists of the points in the interior of
the unique smallest affine set containing C. The convex hull of a set C, denoted by conv C,
consists of all the convex combinations of the elements of C. An important example of a
convex hull is the unit simplex in Rn , which we denote by

n
Λn := (α1 , . . . , αn ) ∈ [0, 1]n : αi = 1 . (3)
i=1
Definition 2 (Domains and proper functions) The domain of a function f : Rn → R ∪

{+∞} is the set
dom f = x ∈ Rn : f (x) < +∞ .
A function f is called proper if its domain is non-empty and f (x) > −∞ for every x ∈ Rn .
Definition 3 (Convex functions, lower semicontinuity, and convex envelopes) A proper

function f : Rn → R ∪ {+∞} is called convex if the set dom f is convex and if for any
x, y ∈ dom f and all λ ∈ [0, 1], there holds
f (λx + (1 − λ)y) λf (x) + (1 − λ)f (y) (4)
A proper function f : Rn → R ∪ {+∞} is called lower semicontinuous if for every

sequence {xk }+∞
k=1 ∈ R with limk→+∞ x k = x ∈ R , we have lim inf k→+∞ f (x k ) f (x).
n n
The class of proper, lower semicontinuous convex functions is denoted by Γ0 (Rn ).

Given a function f : Rn → R ∪ {+∞}, we define its convex envelope co f as the largest
convex function such that co f (x) f (x) for every x ∈ Rn . We define the convex lower
semicontinuous envelope co f as the largest convex and lower semicontinuous function
such that co f (x) f (x) for every x ∈ Rn .
Definition 4 (Subdifferentials and subgradients) The subdifferential ∂f (x) of f ∈ Γ0 (Rn )

at x ∈ dom f is the set (possibly empty) of vectors p ∈ Rn satisfying
∀y ∈ Rn , f (y) f (x) + p, y − x . (5)
The subdifferential ∂f (x) is a closed convex set whenever it is non-empty, and any vector
p ∈ ∂f (x) is called a subgradient of f at x. If f is a proper convex function, then ∂f (x) = ∅
whenever x ∈ ri (dom f ), and ∂f (x) = ∅ whenever x ∈ / dom J [133, Thm. 23.4]. If a convex
function f is differentiable at x0 ∈ R , then its gradient ∇x f (x0 ) is the unique subgradient
n
of f at x0 , and conversely if f has a unique subgradient at x0 , then f is differentiable at that

point [133, Thm. 21.5].
Definition 5 (Fenchel–Legendre transforms) Let f ∈ Γ0 (Rn ). The Fenchel–Legendre

transform f ∗ : Rn → R ∪ {+∞} of f is defined as
f ∗ (p) = sup p, x − f (x) . (6)

x∈Rn
For any f ∈ Γ0 (Rn ), the mapping f → f ∗ is one-to-one, f ∗ ∈ Γ0 (Rn ), and (f ∗ )∗ = f .

Moreover, for any (x, p) ∈ Rn × Rn , the so-called Fenchel’s inequality holds:
f (x) + f (p) x, p , (7)
with equality attained if and only if p ∈ ∂f (x), if and only if x ∈ ∂f ∗ (p) [68, Cor. X.1.4.4].
We summarize some notations and definitions in Table 1.
3 Connections between neural networks and Hamilton–Jacobi equations

This section establishes connections between HJ PDEs and neural network architectures.
Subsection 3.1 presents the mathematical setup, subsection 3.2 describes our main results
for first-order HJ PDEs, and finally subsection 3.3 presents our results for first-order
one-dimensional conservation laws.
Table 1 Notation used in this paper. Here, we use C to denote a set in Rn , f to denote a
function from Rn to R ∪ {+∞} and x to denote a vector in Rn
Notation Meaning Definition

·, · Euclidean scalar product in Rn x, y:= ni=1 xi yi
√
· 2 Euclidean norm in Rn x 2 := x, x
ri C Relative interior of C The interior of C with respect to
the minimal hyperplane contain-
ing C in Rn
conv C Convex hull of C The set containing all convex
combinations of the elements of
C
n
Λn Unit simplex in Rn (α1 , . . . , αn ) ∈ [0, 1]n : i=1 αi = 1
dom f Domain of f {x ∈ Rn : f (x) < +∞}
Γ0 (Rn ) A useful and standard class of convex functions The set containing all proper, con-
vex, lower semicontinuous func-
tions from Rn to R ∪ {+∞}
co f Convex envelope of f The largest convex function such
that co f (x) f (x) for every x ∈
Rn
co f Convex and lower semicontinuous envelope of f The largest convex and lower
semicontinuous function such
that co f (x) f (x) for every x ∈
Rn
∂f (x) Subdifferential of f at x {p ∈ Rn : f (y) f (x) + p, y −
x ∀y ∈ Rn }
f∗ Fenchel–Legendre transform of f f ∗ (p):= supx∈Rn {p, x − f (x)}
Fig. 1 Illustration of the structure of the neural network (8) that can represent the viscosity solution to
first-order Hamilton–Jacobi equations for initial data which takes a particular form
3.1 Setup
In this section, we consider the function f : Rn × [0, +∞) → R given by the neural
network in Fig. 1. Mathematically, the function f can be expressed using the following
formula
f (x, t; {(pi , θi , γi )}m

i=1 ) = max { pi , x − tθi − γi }. (8)
i∈{1,...,m}
Our goal is to show that the function f in (8) is the unique uniformly continuous
viscosity solution to a suitable Hamilton–Jacobi equation. In what follows, we denote
f (x, t; {(pi , θi , γi )}m
i=1 ) by f (x, t) when there is no ambiguity in the parameters.
We adopt the following assumptions on the parameters:
(A1) The parameters {pi }m i=1 are pairwise distinct, i.e., pi = pj if i = j.

(A2) There exists a convex function g : Rn → R such that g(pi ) = γi .
(A3) For any j ∈ {1, . . . , m} and any (α1 , . . . , αm ) ∈ Rm that satisfy
⎧
⎪
⎪(α1 , . . . , αm ) ∈ Λm with αj = 0,
⎨
α p = pj , (9)
⎪i=j i i
⎪
⎩
i =j αi γi = γj ,

there holds i =j αi θi > θj .
Note that (A3) is not a strong assumption. Indeed, if there exist j ∈ {1, . . . , m} and

(α1 , . . . , αm ) ∈ Rm satisfying Eq. (9) and i=j αi θi θj , then

pj , x − tθj − γj αi (pi , x − tθi − γi ) max{ pi , x − tθi − γi }.
i =j
i =j
As a result, the jth neuron in the network can be removed without changing the value
of f (x, t) for any x ∈ Rn and t 0. Removing all such neurons in the network, we can
therefore assume (A3) holds.
Our aim is to identify the HJ equations whose viscosity solutions correspond to the
neural network f defined by Eq. (8). Here, x and t play the role of the spatial and time
variables, and f (·, 0) corresponds to the initial data. To simplify the notation, we define
the function J : Rn → R as
f (x, 0) = J (x):= max { pi , x − γi } (10)

i∈{1,...,m}
and the set Ix as the collection of maximizers in Eq. (10) at x, that is,
Ix :=arg max{ pi , x − γi }. (11)

i∈{1,...,m}
Note that the initial data J given by (10) is a convex and polyhedral function, and it satisfies
several properties that we describe in the following lemma.
Lemma 3.1 Suppose {(pi , γi )}m

i=1 ⊂ R × R satisfy assumptions (A1) and (A2). Then the
n
following statements hold.
(i) The Fenchel–Legendre transform of J is given by the convex and lower semicontinuous
function
⎧ m
⎪
⎪
⎪
⎨ min αi γ i if p ∈ conv ({pi }m
i=1 ),
∗ (α
1 ,...,αm )∈Λm
J (p) = m i=1 (12)
⎪ i=1 αi pi =p
⎪
⎪
⎩
+∞ otherwise.
Moreover, its restriction to dom J ∗ is continuous, and the subdifferential ∂J ∗ (p) is

non-empty for every p ∈ dom J ∗ .
(ii) Let p ∈ dom J ∗ and x ∈ ∂J ∗ (p). Then, (α1 , . . . , αm ) ∈ Rm is a minimizer in Eq. (12)
if and only if it satisfies the constraints
(a) (α1 , . . . , αm ) ∈ Λm ,
m
(b) i=1 αi pi = p,
(c) αi = 0 for any i ∈ / Ix .
(iii) For each i, k ∈ {1, . . . , m}, let

⎧
⎨1 if i = k,
αi = δik :=
⎩0 if i = k.
Then, (α1 , . . . , αm ) is a minimizer in Eq. (12) at the point p = pk . Hence, we have

J ∗ (pk ) = γk .
Proof See “Appendix A.1” for the proof.

Having defined the initial condition J , the next step is to define a Hamiltonian H. To do
so, first denote by A(p) the set of minimizers in Eq. (12) evaluated at p ∈ dom J ∗ , i.e.,
m

A(p):= arg min αi γ i . (13)
(α ,...αm )∈Λm
1m i=1
i=1 αi pi =p
Note that the set A(p) is non-empty for every p ∈ dom J ∗ by Lemma 3.1(i). Now, we
define the Hamiltonian function H : Rn → R ∪ {+∞} by
⎧ m
⎪
⎪
⎨ inf αi θi if p ∈ dom J ∗ ,
H(p):= α∈ A(p) (14)
⎪
⎪
i=1
⎩
+∞ otherwise.
The function H defined in (14) is a polyhedral function whose properties are stated in the
following lemma.
Lemma 3.2 Suppose {(pi , θi , γi )}m

i=1 ⊂ R × R × R satisfy assumptions (A1)–(A3). Then,
n
the following statements hold:
(i) For every p ∈ dom J ∗ , the set A(p) is compact and Eq. (14) has at least one minimizer.
(ii) The restriction of H to dom J ∗ is a bounded and continuous function.
(iii) There holds H(pi ) = θi for each i ∈ {1, . . . , m}.
Proof See “Appendix A.2” for the proof.

3.2 Main results: First-order Hamilton–Jacobi equations

Let f be the function represented by the neural network architecture in Fig. 1, whose
mathematical definition is given in Eq. (8). In the following theorem, we identify the set
of first-order HJ equations whose viscosity solutions correspond to the neural network f .
Specifically, f solves a first-order HJ equation with Hamiltonian H and initial function J
that were defined previously in Eqs. (14) and (10), respectively. Furthermore, we provide
necessary and sufficient conditions for a first-order HJ equation of the form of (1) with
initial data given in the form of (10) to have for viscosity solution the neural network f .
Theorem 3.1 Suppose the parameters {(pi , θi , γi )}m

i=1 ⊂ R × R × R satisfy assumptions
n
(A1)-(A3), and let f be the neural network defined by Eq. (8) with these parameters. Let J
and H be the functions defined in Eqs. (10) and (14), respectively, and let H̃ : Rn → R be
a continuous function. Then the following two statements hold.
(i) The neural network f is the unique uniformly continuous viscosity solution to the
first-order Hamilton–Jacobi equation
⎧
⎨ ∂f (x, t) + H(∇x f (x, t)) = 0, in Rn × (0, +∞),
∂t (15)
⎩
f (x, 0) = J (x), in Rn .
Moreover, f is jointly convex in (x,t).

(ii) The neural network f is the unique uniformly continuous viscosity solution to the
first-order Hamilton–Jacobi equation
⎧
⎨ ∂f (x, t) + H̃(∇x f (x, t)) = 0, in Rn × (0, +∞),
∂t (16)
⎩
f (x, 0) = J (x), in Rn ,
if and only if H̃(pi ) = H(pi ) for each i ∈ {1, . . . , m} and H̃(p) H(p) for every
p ∈ dom J ∗ .
Proof See “Appendix B” for the proof.

Remark 1 This theorem identifies the set of HJ equations with initial data J whose solution
is given by the neural network f . To each such HJ equation, there corresponds a continuous
Hamiltonian H̃ satisfying H̃(pi ) = H(pi ) for every i = {1, . . . , m} and H̃(p) H(p) for
every p ∈ dom J ∗ . The smallest possible Hamiltonian satisfying these constraints is the
function H defined in (14), and its corresponding HJ equation is given by (15).
Example 1 In this example, we consider the HJ PDE with initial data J true (x) = x 1 and
p 2
the Hamiltonian H true (p) =− 2
2
for all x, p ∈ R . The viscosity solution to this HJ PDE
n
is given by
nt
S(x, t) = x 1 + = max {pi , x − tθi − γi } for every x ∈ Rn and t 0,
2 i∈{1,...,m}
where m = 2n , each entry of pi takes value in {±1}, and θi = − n2 , γi = 0 for every

i ∈ {1, . . . , m}. In other words, the solution S can be represented using the proposed
neural network with parameters {(pi , − n2 , 0)}m
i=1 . We can compute the functions J and H
using definitions in Eqs. (10) and (14) and then obtain
J (x) = x 1 = J true (x) for every x ∈ Rn ;

⎧
⎨− n , p ∈ [−1, 1]n ;
2
H(p) =
⎩+∞, otherwise.
Theorem 3.1 stipulates that S solves the HJ PDE (16) if and only if H̃(pi ) = − n2 for every
i ∈ {1, . . . , m} and H̃(p) ≥ − n2 for every p ∈ [−1, 1]n \ {pi }m
i=1 . The Hamiltonian H
true is
one candidate satisfying these constraints.

Example 2 In this example, we consider the case when J true (x) = x ∞ and H true (p) =
p 2
− 2 2 for every x, p ∈ Rn . Denote by ei the ith standard unit vector in Rn . Let m = 2n,
n
{pi }m
i=1 = {±e i }i=1 , θi = − 2 , and γi = 0
n for every i ∈ {1, . . . , m}. The viscosity solution S
is given by
nt
S(x, t) = x ∞ + = max {pi , x − tθi − γi } for every x ∈ Rn and t 0.
2 i∈{1,...,m}
Hence, S can be represented using the proposed neural network with parameters
{(pi , − n2 , 0)}m
i=1 . Similarly, as in the first example, we compute J and H and obtain the
following results
J (x) = x ∞ for every x ∈ Rn ;

⎧
⎨− n p ∈ Bn ;
2
H(p) =
⎩+∞ otherwise,
where Bn denotes the unit ball with respect to the l 1 norm in Rn , i.e., Bn = conv {±ei :
i ∈ {1, . . . , n}}. By Theorem 3.1, S is a viscosity solution to the HJ PDE (16) if and only
if H̃(pi ) = − n2 for every i ∈ {1, . . . , m} and H̃(p) ≥ − n2 for every p ∈ Bn \{pi }m
i=1 . The
Hamiltonian H true is one candidate satisfying these constraints.
Example 3 In this example, we consider the HJ PDE withHamiltonian H true (p) = p 1

and initial data J (x) = max x ∞ , √1 (|x1 | + |x2 |) , for all p ∈ Rn and x =
true
2
(x1 , x2 , . . . , xn ) ∈ Rn . The corresponding neural network has m = 2n + 5 neurons, where
the parameters are given by
{(pi , θi , γi )}2n n n
i=1 = {(e i , 1, 0)}i=1 ∪ {(−e i , 1, 0)}i=1 ,
(p2n+1 , θ2n+1 , γ2n+1 ) = (0, 0, 0),

1
{(pi , θi , γi )}2n+5
i=2n+2 = √ (αe 1 + βe 2 , 2, 0) : α, β ∈ {±1} ,
2
where ei is the ith standard unit vector in Rn and 0 denotes the zero vector in Rn . The
functions J and H defined by (10) and (14) coincide with the underlying true initial data
J true and Hamiltonian H true . Therefore, by Theorem 3.1, the proposed neural network
represents the viscosity solution to the HJ PDE. In other words, given the true parameters
{(pi , θi , γi )}m
i=1 , the proposed neural network solves this HJ PDE without the curse of
dimensionality. We illustrate the solution with dimension n = 16 in Fig. 2, which shows
several slices of the solution evaluated at x = (x1 , x2 , 0, . . . , 0) ∈ R16 and t = 0, 1, 2, 3 in
figures 2(A), 2(B), 2(C), 2(D), respectively. In each figure, the x and y axes correspond to
the first two components x1 and x2 in x, while the color represents the function value
S(x, t).
Remark 2 Let > 0 and consider the neural network f : Rn × [0, +∞) → R defined by
m

f (x, t):= log e(pi ,x−tθi −γi )/ (17)
i=1
Fig. 2 Solution S : R16 × [0, +∞) → R to the HJ PDE in Example 3 is solved using the proposed neural
network. Several slices of the solution S evaluated at x = (x1 , x2 , 0, . . . , 0) and t = 0, 1, 2, 3 are shown in figures
2(A), 2(B), 2(C), 2(D), respectively. In each figure, the x and y axes correspond to the first two components x1
and x2 in the variable x, while the color represents the function value S(x, t)
and illustrated in Fig. 3. This neural network substitutes the non-smooth maximum acti-
vation function in the neural network f defined by Eq. (8) (and depicted in Fig. 1) with
2
a smooth log-exponential activation function. When the parameter θi = − 12 pi 2 , then
the neural network f is the unique, jointly convex and smooth solution to the following
viscous HJ PDE
⎧
⎪ ∂f (x, t) 1 2
⎪
⎪ − ∇x f (x, t)2 = Δx f (x, t) in Rn × (0, +∞),
⎨ ∂t 2 2
m (18)
⎪
⎪
⎩f (x, 0) = log
⎪ e(pi ,x−γi )/ in Rn .
i=1
This result relies on the Cole–Hopf transformation ([47], Sect. 4.4.1); see Appendix C for
the proof. While this neural network architecture represents, under certain conditions,
the solution to the viscous HJ PDE (18), we note that the particular form of the convex
initial data in the HJ PDE
(18), which effectively corresponds to a soft Legendre transform
in that lim →0 log m (p ,x −γ ) / = maxi∈{1,...,m} { pi , x − γi }, severely restricts
i=1 e
i i
>0
the practicality of this result.
Fig. 3 Illustration of the structure of the neural network (17) that represents the solution to a subclass of
second-order HJ equations when θi = − 12 pi 22 for i ∈ {1, . . . , m}
3.3 First-order one-dimensional conservation laws

It is well known that one-dimensional conservation laws are related to HJ equations
(see, e.g., [1,22,23,28,32,86,91,95,106], and also [37] for a comprehensive introduction
to conservation laws and entropy solutions). Formally, by taking spatial gradient of the HJ
equation (1) and identifying the gradient ∇x f ≡ u, we obtain the conservation law
⎧
⎨ ∂u (x, t) + ∇x H(u(x, t)) = 0 in R × (0, +∞),
∂t (19)
⎩
u(x, 0) = u0 (x):=∇J (x) in R,
where the flux function corresponds to the Hamiltonian H in the HJ equation. Here,
we assume that the initial data J is convex and globally Lipschitz continuous, and the
symbols ∇ and ∇x in this section correspond to derivatives in the sense of distribution if
the classical derivatives do not exist.
In this section, we show that the conservation law derived from the HJ equation (1) can
be represented by a neural network architecture. Specifically, the corresponding entropy
solution u(x, t) ≡ ∇x f (x, t) to the one-dimensional conservation law (19) can be repre-
sented using a neural network architecture with an argmax based activation function, i.e.,
∇x f (x, t) = pj , where j ∈ arg max{pi , x − tθi − γi }. (20)

i∈{1,...,m}
The structure of this network is shown in Fig. 4. When more than one maximizer exist in
the optimization problem above, one can choose any maximizer j and define the value to
be pj . We now prove that the function ∇x f given by the neural network (20) is indeed the
entropy solution to the one-dimensional conservation law (19) with flux function H and
initial data ∇J , where H and J are defined by Eqs. (14) and (10), respectively.
Proposition 3.1 Consider the one-dimensional case, i.e., n = 1. Suppose the parameters
{(pi , θi , γi )}m
i=1 ⊂ R × R × R satisfy assumptions (A1)–(A3), and let u:=∇x f be the neural
network defined in Eq. (20) with these parameters. Let J and H be the functions defined
in Eqs. (10) and (14), respectively, and let H̃ : R → R be a locally Lipschitz continuous
function. Then, the following two statements hold.
Fig. 4 Illustration of the structure of the neural network (20) that can represent the entropy solution to
one-dimensional conservation laws
(i) The neural network u is the entropy solution to the conservation law
⎧
⎨ ∂u (x, t) + ∇x H(u(x, t)) = 0 in R × (0, +∞),
∂t (21)
⎩
u(x, 0) = ∇J (x) in R.
(ii) The neural network u is the entropy solution to the conservation law
⎧
⎨ ∂u (x, t) + ∇x H̃(u(x, t)) = 0 in R × (0, +∞),
∂t (22)
⎩
u(x, 0) = ∇J (x) in R,
if and only if there exists a constant C ∈ R such that H̃(pi ) = H(pi ) + C for every
i ∈ {1, . . . , m} and H̃(p) H(p) + C for any p ∈ conv {pi }m
i=1 .
Proof See “Appendix D” for the proof.

Example 4 Here, we give one example related to Example 1. Consider J true (x) = |x| and
2
H true (p) = − p2 for every x, p ∈ R. The entropy solution u to the corresponding one
dimensional conservation law is given by
⎧
⎨1 if x > 0,
u(x, t) =
⎩−1 if x < 0.
This solution u can be represented using the neural network in Fig. 4 with m = 2, p1 = 1,
p2 = −1, θ1 = θ2 = − 12 and γ1 = γ2 = 0. To be specific, we have
u(x) = pj , where j ∈ arg max {xpi − tθi − γi } .

i∈{1,...,m}
The initial data J and Hamiltonian H defined in Eqs. (10) and (14) are given by
J (x) = |x| for every x ∈ R;

⎧
⎨− 1 p ∈ [−1, 1],
2
H(p) =
⎩+∞ otherwise.
By Proposition 3.1, u solves the one-dimensional conservation law (22) if and only if there
exists some constant C ∈ R such that H̃(±1) = − 12 + C and H̃(p) − 12 + C for every
p ∈ (−1, 1). Note that H true is one candidate satisfying these constraints.
4 Numerical experiments
4.1 First-order Hamilton–Jacobi equations
In this subsection, we present several numerical experiments to test the effectiveness of
the Adam optimizer using our proposed architecture (depicted in Fig. 1) for solving some
inverse problems. We focus on the following inverse problem: We are given data samples
from a function S : Rn × [0, +∞) → R that is the viscosity solution to an HJ equation (1)
with unknown convex initial data J and Hamiltonian H, which only depends on ∇x S(x, t).
Our aim is to recover the convex initial data J . We propose to learn the neural network
using machine learning techniques to recover the convex initial data J . We shall see that
this approach also provides partial information on the Hamiltonian H.
Specifically, given data samples {(xj , tj , S(xj , tj ))}N
j=1 , where {(x j , tj )}j=1 ⊂ R × [0, +∞),
N n
we train the neural network f with structure in Fig. 1 using the mean square loss function
defined by
1
N
2
l({(pi , θi , γi )}m
i=1 ) = |f (xj , tj ; {(pi , θi , γi )}m
i=1 ) − S(x j , tj )| .
N
j=1
The training problem is formulated as
arg min l({(pi , θi , γi )}m

i=1 ). (23)
{(pi ,θi ,γi )}m
i=1 ⊂R ×R×R
n
After training, we approximate the initial condition in the HJ equation, denoted by J̃ ,

by evaluating the trained neural network at t = 0. That is, we approximate the initial
condition by
J̃ :=f (·, 0). (24)
In addition, we obtain partial information of the Hamiltonian H using the parameters

in the trained neural network via the following procedure. We first detect the effective
neurons of the network, which we define to be the affine functions { pi , x − tθi − γi }
that contribute to the pointwise maximum in the neural network f (see Eq. (8)). We then
denote by L the set of indices that correspond to the parameters of the effective neurons,
i.e.,

L:= arg max{ pi , x − tθi − γi },
x∈Rn , t≥0 i∈{1,...,m}
and we finally use each effective parameter (pl , θl ) for l ∈ L to approximate the point
(pl , H(pl )) on the graph of the Hamiltonian. In practice, we approximate the set L using a
large number of points (x, t) sampled in the domain Rn × [0, +∞).
4.1.1 Randomly generalized piecewise affine H and J

In this subsection, we randomly select m parameters ptrue
i in [−1, 1)n for i ∈ {1, . . . , m},
and define θitrue and γitrue as follows
Case 1. θitrue = − ptrue

i 2 and γi
true = 0 for i ∈ {1, . . . , m}.
θitrue = − ptrue true = 1 ptrue 2 for i ∈ {1, . . . , m}.

Case 2. i 2 and γi 2 i 2
1
Case 3. θitrue = − 2 pi true 2
2 and γ i
true = 0 for i ∈ {1, . . . , m}.
Case 4. θitrue = − 12 ptrue

i
2 and γ true = 1 ptrue 2 for i ∈ {1, . . . , m}.
2 i 2 i 2
Define the function S as
S(x, t):= max {ptrue true

i , x − tθi − γitrue }.
i∈{1,...,m}
By Theorem 3.1, this function S is a viscosity solution to the HJ equations whose Hamil-
tonian and initial function are the piecewise affine functions defined in Eqs. (14) and (10),
respectively. In other words, S solves the HJ equation with initial data J satisfying
J (x):= max ptrue

i ,x , for Case 1 and 3;
i∈{1,...,m}
(25)
true 1 true 2
J (x):= max pi , x − pi , for Case 2 and 4,
i∈{1,...,m} 2
and Hamiltonian H satisfying

⎧ m
⎪
⎪
⎨− max αi ptrue
i 2 , if p ∈ dom J ∗ ,
H(p):= α∈A(p) for Case 1 and 2;
⎪
⎪
i=1
⎩
+∞, otherwise,
⎧ m
⎪
⎪ 1
⎨− max αi ptrue
i
2
2 if p ∈ dom J ∗ ,
H(p):= 2 α∈A(p) for Case 3 and 4,
⎪
⎪
i=1
⎩
+∞ otherwise,
where A(p) is the set of maximizers of the corresponding maximization problem in Eq.
(25). Specifically, if we construct a neural network f as shown in Fig. 1 with the underlying
parameters {(ptrue i , θi
true , γ true )}m , then the function given by the neural network is exactly
i i=1
the same as the function S. In other words, {(ptrue i , θi
true , γ true )}m is a global minimizer
i i=1
for the training problem (23) with the global minimal loss value equal to zero.
Now, we train the neural network f with training data {(xj , tj , S(xj , tj ))}N j=1 , where the
points {(xj , tj )}N j=1 are randomly sampled in R n × [0, +∞) with respect to the standard
normal distribution for each j ∈ {1, . . . , N }. (We take the absolute value for t to make sure
it is nonnegative.) Here and after, the number of training data points is N = 20,000. We
run 60,000 descent steps using the Adam optimizer to train the neural network f . The
parameters for the Adam optimizer are chosen to be β1 = 0.5, β2 = 0.9, the learning rate
is 10−4 and the batch size is 500.
To measure the performance of the training process, we compute the relative mean
square errors of the sorted parameters in the trained neural network, denoted by
{(pi , θi , γi )}m
i=1 , and the sorted underlying true parameters {(pi , θi
true true , γ true )}m . To be
i i=1
specific, the errors are computed as follows
Table 2 Relative mean square errors of the parameters in the neural network f with 2
neurons in different cases and different dimensions averaged over 100 repeated
experiments
# Case Case 1 Case 2 Case 3 Case 4
Averaged Relative Errors of {pi } 2D 4.10E−03 2.10E−03 3.84E−03 2.82E−03
4D 1.41E−09 1.20E−09 1.38E−09 1.29E−09
8D 1.14E−09 1.03E−09 1.09E−09 1.20E−09
16D 1.14E−09 6.68E−03 1.23E−09 7.74E−03
32D 1.49E−09 3.73E−01 1.46E−03 4.00E−01
Averaged Relative Errors of {θi } 2D 4.82E−02 7.31E−02 1.17E−01 1.79E−01
4D 3.47E−10 2.82E−10 1.15E−09 1.15E−09
8D 1.47E−10 1.08E−10 2.10E−10 2.25E−10
16D 5.44E−11 1.69E−03 4.75E−11 4.12E−03
32D 3.61E−11 3.27E−01 6.42E−03 2.39E−01
Averaged Relative Errors of {γi } 2D 1.35E−02 1.01E−01 1.33E−02 9.24E−02
4D 3.71E−10 1.24E−09 3.67E−10 1.10E−09
8D 2.91E−10 1.74E−10 2.82E−10 2.01E−10
16D 2.80E−10 2.08E−04 3.10E−10 3.20E−04
32D 3.56E−10 1.88E−02 1.56E−01 3.62E−02
m
i=1 pi − ptrue
i
2
2
relative mean square error of {pi } = m ,
i=1 ptrue
i
2
2
m
i=1 |θi − θi |2
true
relative mean square error of {θi } = m ,
i=1 |θi |2
true
m
i=1 |γi − γi
true |2
relative mean square error of {γi } = m .
i=1 |γi
true |2

For the cases when the denominator m i=1 |γi
true |2 is zero, such as Case 1 and Case 3, we
1 m
i=1 |γi − γi
measure the absolute mean square error m true |2 instead.
We test Cases 1–4 on the neural networks with 2 and 4 neurons, i.e., we set m = 2, 4
and repeat the experiments 100 times. We then compute the relative mean square errors
in each experiment and take the average. The averaged relative mean square errors are
shown in Tables 2 and 3, respectively. From the error tables, we observe that the training
process performs pretty well and gives errors below 10−8 in some cases when m = 2.
However, for the case when m = 4, we do not obtain the global minimizers and the error
is above 10−3 . Therefore, there is no guarantee for the performance of the Adam optimizer
in this training problem and it may be related to the complexity of the solution S to the
underlying HJ equation.
4.1.2 Quadratic Hamiltonians

In this subsection, we consider two inverse problems of first-order HJ equations whose
Hamiltonians and initial data are defined as follows:
1. H(p) = − 12 p 22 and J (x) = x 1 for p, x ∈ Rn .

2. H(p) = 12 p 22 and J (x) = x 1 for p, x ∈ Rn .
The solution to each of the two corresponding HJ equations can be represented using the
Hopf formula [70] and reads
Table 3 Relative mean square errors of the parameters in the neural network f with 4
neurons in different cases and different dimensions averaged over 100 repeated
experiments
# Case Case 1 Case 2 Case 3 Case 4
Averaged Relative Errors of {pi } 2D 3.12E−01 2.21E−01 2.85E−01 2.14E−01
4D 7.82E−02 6.12E−02 7.92E−02 4.30E−02
8D 2.62E−02 4.31E−03 4.02E−02 7.82E−03
16D 2.88E−02 3.64E−02 4.35E−02 1.73E−02
32D 1.42E−02 3.72E−01 1.42E−01 5.04E−01
Averaged Relative Errors of {θi } 2D 2.59E−01 3.68E−01 4.82E−01 1.34E+00
4D 6.07E−02 8.37E−02 9.47E−02 1.23E−01
8D 1.04E−02 8.48E−03 1.41E−02 1.31E−02
16D 2.66E−03 2.53E−02 7.80E−03 1.90E−02
32D 8.09E−04 4.41E−01 1.81E−02 3.66E−01
Averaged Relative Errors of {γi } 2D 1.01E−02 3.19E−01 1.51E−02 2.65E−01
4D 6.72E−03 1.79E−02 1.03E−02 1.30E−02
8D 3.22E−03 2.34E−03 3.93E−03 2.65E−03
16D 9.48E−03 3.70E−03 1.92E−02 1.94E−03
32D 1.33E−02 5.35E−02 4.73E−01 1.17E−01
1. S(x, t) = x 1 + nt2 for x ∈ Rn and t 0.

x2
2. S(x, t) = i:|xi |t |xi | − 2t + i:|xi |<t 2ti , where x = (x1 , . . . , xn ) ∈ Rn and t 0.
We train the neural network f using the same procedure as in the previous subsection
and obtain the function J̃ (see Eq. (24)) and the parameters {(pl , θl )}l∈L associated with
the effective neurons. We compute the relative mean square error of J̃ and {(pl , θl )}l∈L as
follows:
N test
j=1 |J̃ (xtest
i ) − J (x i )|
test 2
relative error of J̃ := N test ,
j=1 |J (x i )|
test 2

l∈L |θl − H(pl )|
2
relative error of {(pl , θl )}l := ,
l∈L |H(pl )|
2
where {xtest
i } are randomly sampled with respect to the standard normal distribution in
Rn and there are in total N test = 2,000 testing data points. We repeat the experiments 100
times. The corresponding averaged errors in the two examples are listed in Tables 4 and
5, respectively.
In the first example, we have H(p) = − 12 p 22 and J (x) = x 1 . According to Theorem
3.1, the solution S can be represented without error by the neural network in Fig. 1 with
parameters
n
(p, θ, γ ) ∈ Rn × R × R : p(i) ∈ {±1}, for i ∈ {1, . . . , n}, θ = , γ = 0 , (26)
2
where p(i) denotes the ith entry of the vector p. In other words, the global minimal loss
value in the training problem is theoretically guaranteed to be zero. From the numerical
errors in Table 4, we observe that in low dimension such as 1D and 2D, the errors of the
initial function are small. However, in most cases, the errors of the parameters are pretty
large. In the case of n dimension, the viscosity solution can be represented using the 2n
parameters in Eq. (26). However, the number of effective neurons are larger than 2n in all
Table 4 Relative mean square errors of J̃ and {(pl , θl )} for the inverse problems of the
first-order HJ equations in different dimensions with J = · 1 and H = − 12 · 22 , averaged
over 100 repeated experiments
# Neurons 64 128 256 512 1024
Averaged Relative Errors of J̃ 1D 2.29E−07 2.20E−07 2.12E−07 2.14E−07 1.82E−07
2D 1.49E−06 1.27E−06 1.16E−06 1.01E−06 9.25E−07
4D 6.27E−04 1.81E−04 5.93E−05 1.69E−06 3.44E−07
8D 1.27E−02 1.10E−02 1.03E−02 9.92E−03 9.73E−03
16D 5.69E−02 5.83E−02 5.96E−02 5.99E−02 6.01E−02
Averaged Relative Errors of {(pl , θl )} 1D 2.58E−01 1.29E−01 7.05E−02 3.56E−02 1.72E−02
2D 4.77E−02 3.28E−02 2.03E−02 1.03E−02 6.53E−03
4D 9.36E−03 4.09E−03 1.58E−03 5.31E−04 1.73E−04
8D 3.75E−02 3.39E−02 3.25E−02 2.78E−02 2.60E−02
16D 5.30E−01 5.40E−01 5.43E−01 5.43E−01 5.42E−01
Averaged Number of Effective Neurons 1D 4.45 4.37 4.18 3.92 3.55
2D 8.84 8.59 7.87 7.1 6.3
4D 20.04 20.62 19.52 18.3 17.06
8D 36.97 43.91 47.84 49.19 50.03
16D 48.2 59.53 64.85 65.79 64.84
Table 5 Relative mean square errors of J̃ and {(pl , θl )} for the inverse problems of the
first-order HJ equations in different dimensions with J = · 1 and H = · 22 /2, averaged
over 100 repeated experiments
# Neurons 64 128 256 512 1024
Averaged Relative Errors of J̃ 1D 5.23E−08 2.45E−08 1.96E−08 1.77E−08 1.77E−08
2D 1.75E−05 1.67E−05 1.77E−05 1.85E−05 1.91E−05
4D 5.82E−04 4.94E−04 5.28E−04 5.76E−04 6.16E−04
8D 1.54E−02 1.40E−02 1.35E−02 1.33E−02 1.32E−02
16D 4.19E−02 4.33E−02 4.43E−02 4.46E−02 4.49E−02
Averaged Relative Errors of {(pl , θl )} 1D 3.25E−02 1.93E−02 1.24E−02 5.62E−03 2.92E−03
2D 8.30E−03 7.08E−03 5.78E−03 4.25E−03 3.47E−03
4D 2.41E−02 2.41E−02 2.51E−02 2.65E−02 2.82E−02
8D 7.33E−02 7.32E−02 7.25E−02 7.15E−02 7.08E−02
16D 3.85E−01 3.90E−01 3.92E−01 3.92E−01 3.91E−01
Averaged Number of Effective Neurons 1D 20.26 26.94 32.26 36.02 38.61
2D 32.74 48.05 65.7 84.87 99.83
4D 46.69 72.3 103.71 147.41 198.27
8D 55.55 82.04 95.46 90.82 82.5
16D 61.51 99.63 119.95 118.89 109.1
cases, which also implies that the Adam optimizer does not find the global minimizers in
this example.
In the second example, the solution S cannot be represented using our proposed neural
network without error. Hence, the results describe the approximation of the solution S
by the neural network. From Table 5, we observe that the errors become larger when
the dimension increases. For this example, the number of effective neurons should be m
where m is the number of neurons used in the architecture. Table 5 shows that the average
number of effective neurons is below this optimal number. Therefore, this implies that
the Adam optimizer does not find the global minimizers in this example either.
In conclusion, these numerical experiments suggest that recovering initial data from
data samples using our proposed neural network architecture with the Adam optimizer
is unsatisfactory for solving these inverse problems. In particular, Adam optimizer is not
always able to find a global minimizer when the solution can be represented without error
using our network architecture.
4.2 One-dimensional conservation laws

In this part, we show the representability of the neural network ∇x f given in Fig. 4 and
Eq. (20). Since the number of neurons is finite, the function ∇x f only takes values in the
finite set {pi }m
i=1 . In other words, it can represent the entropy solution u to the PDE (19)
without error only if u takes values in a finite set.
Here, we consider the following two examples
1. H(p) = − 12 p2 and J (x) = |x| for p, x ∈ R. The initial condition u0 is then given by
⎧
⎨1, x > 0,
u0 (x) =
⎩−1, x < 0.
2. H(p) = 12 p2 and J (x) = |x| for p, x ∈ R. Hence, the initial function u0 is the same as
in example 1.
In the first example, the entropy solution u only takes values in the finite set {±1}, and it
can be represented by the neural network ∇x f without error by Prop. 3.1. However, in the
second example, the solution u takes values in the infinite set [−1, 1]; hence, the neural
network ∇x f is only an approximation of the corresponding solution u.
To show the representability of the neural network, in each example, we choose the
parameters {pi }mi=1 to be the uniform grid points in [−1, 1], i.e.,
2(i − 1)
pi = −1 + for i ∈ {1, . . . , m}.
m−1
We set θi = H(pi ) and γi = J ∗ (pi ) for each i ∈ {1, . . . , m}, where J ∗ is the Fenchel–
Legendre transform of the antiderivative of the initial function u0 . Hence, in these two
examples, γi equals for each i. Figures 5 and 6 show the neural network ∇x f and the true
entropy solution u in these two examples at time t = 1. As expected, the error in Fig. 5
for example 1 is negligible. For example 2, we consider neural networks with 32 and 128
neurons whose graphs are plotted in Figs. 6a and 6b, respectively. We observe in these
figures that the error of the neural networks with the specific parameters decreases as the
number of neurons increases. In conclusion, the neural network ∇x f with the architecture
in Fig. 4 can represent the solution to the one-dimensional conservation laws given in
Eq. (19) pretty well. In fact, because of the discontinuity of the activation function, the
proposed neural network ∇x f has advantages in representing the discontinuity in solution
such as shocks, but it requires more neurons when approximating non-constant smooth
parts of the solution.
5 Conclusion
Summary of the proposed work In this paper, we have established novel mathematical
connections between some classes of HJ PDEs with convex initial data and neural net-
Fig. 5 Plot of the function represented by the neural network ∇x f at time t = 1 with 64 neurons whose
parameters are defined using H and J ∗ in example 1. The function given by the neural network is plotted in
orange and the true solution is plotted in blue
Fig. 6 Plot of the function represented by the neural network ∇x f at time t = 1 with 32 and 128 neurons
whose parameters are defined using H and J ∗ in example 2. The function given by the neural network is
plotted in orange and the true solution is plotted in blue. The neural network with 32 neurons is shown on
the left, while the neural network with 128 neurons is shown on the right
work architectures. Our main results give conditions under which for initial data which
takes a particular form. These results do not rely on universal approximation properties
of neural networks; rather, our results show that some neural networks correspond to
representation formulas of solutions to HJ PDEs whose Hamiltonians and convex initial
data are obtained from the parameters of the neural network. This means that some neural
network architectures naturally encode the physics contained in some HJ PDEs satisfying
the conditions in Theorem 3.1.
The first neural network architecture that we have proposed is depicted in Fig. 1. We
have shown in Theorem 3.1 that under certain conditions on the parameters, this neural
network architecture represents the viscosity solution of the HJ PDEs (16) for initial data
which takes a particular form. The corresponding Hamiltonian and convex initial data
can be recovered from the parameters of this neural network. As a corollary of this result
for the one-dimensional case, we have proposed a second neural network architecture
(depicted in Fig. 4) that represents the spatial gradient of the viscosity solution of the
HJ PDEs (1) (in one dimension), and we have shown in Prop. 3.1 that under appropriate
conditions on the parameters, this neural network corresponds to entropy solutions of
the conservation laws (22).
Let us emphasize that the neural network architecture depicted in Fig. 1 that represents
solutions to the HJ PDEs (16) allows us to numerically evaluate these solutions in high
dimension without using grids or numerical approximations. Our work also paves the way
to leverage efficient technologies and hardware developed for neural networks to compute
efficiently solutions to certain HJ PDEs.
We have also tested the performance of the state-of-the-art Adam optimizer using our
proposed neural network architecture (depicted in Fig. 1) on some inverse problems. Our
numerical experiments in Sect. 4 show that these problems cannot generally be solved
with the Adam optimizer with high accuracy. These numerical results suggest further
developments of efficient neural network training algorithms for solving inverse problems
with our proposed neural network architectures.
Perspectives on other neural network architectures and HJ PDEs We now present exten-
sions of the proposed architectures that are viable candidates for representing solutions
of HJ PDEs.
First consider the following multi-time HJ PDE [12,27,39,105,119,125,132,141] which
reads
⎧
⎪ ∂S
⎪
⎪ (x, t1 , . . . , tN ) + Hj (∇x S(x, t1 , . . . , tN ))
⎪
⎨ ∂tj
(27)
⎪
⎪ = 0 for each j ∈ {1, . . . , N } in Rn × (0, +∞)N ,
⎪
⎪
⎩S(x, 0, . . . , 0) = J (x) in Rn .
A generalized Hopf formula [39,105,132] for this multi-time HJ equation is given by
∗ ⎧ ⎫

N ⎨
N ⎬
S(x, t1 , . . . , tN ) = ti Hi + J ∗ (x) = sup p, x − tj Hj (p) − J ∗ (p) ,
i=1 p∈Rn ⎩ j=1
⎭
(28)
for any x ∈ Rn and t1 , . . . , tN 0. Based on this formula, we propose a neural network

architecture, depicted in Fig. 7, whose mathematical definition is given by
⎧ ⎫
⎨
N ⎬
f (x, t1 , . . . , tN ; {(pi , θi1 , . . . , θiN , γi )}m
i=1 ) = max ⎩pi , x − tj θij − γi , (29)
i∈{1,...,m} ⎭
j=1
where {(pi , θi1 , . . . , θiN , γi )}m

i=1 ⊂ R × R × R is the set of parameters. The generalized
n N
Hopf formula (28) suggests that the neural network architecture depicted in Fig. 7 is a
good candidate for representing the solution to (27) under appropriate conditions on the
parameters of the network.
As mentioned in [105], the multi-time HJ equation (27) may not have viscosity solutions.
However, under suitable assumptions [12,27,39,119], the generalized Hopf formula (28)
is a viscosity solution of the multi-time HJ equation. We intend to clarify the connec-
tions between the generalized Hopf formula, multi-time HJ PDEs, viscosity solutions, and
general solutions in a future work.
Fig. 7 Illustration of the structure of the neural network (29) that can represent solutions to some first-order
multi-time HJ equations
In [38,39], it is shown that when the Hamiltonian H and the initial data J are both
convex, and under appropriate assumptions, the solution S to the following HJ PDE
⎧
⎨ ∂S (x, t) + H(∇x S(x, t)) = 0 in Rn × (0, +∞),
∂t
⎩
S(x, 0) = J (x) in Rn ,
is represented by the Hopf [70] and Lax–Oleinik formulas [47, Sect. 10.3.4]. These for-
mulas read
S(x, t) = maxn p, x − J ∗ (p) − tH(p) (Hopf formula)

p∈R

x−u
= minn J (u) + tH ∗ . (Lax–Oleinik formula)
u∈R t
Let p(x, t) be the maximizer in the Hopf formula and u(x, t) be the minimizer in the
Lax–Oleinik formula. Then, they satisfy the following relation [38,39]
u(x, t) = x − t∇H(p(x, t)).
Figure 8 depicts an architecture of a neural network that implements the formula above
for the minimizer u(x, t). In other words, we consider the ResNet-type neural network
defined by
u(x, t) = x − t∇H(pj ), where j ∈ arg max pi , x − tθi − γi . (30)

i∈{1,...,m}
Note that this proposed neural network suggests an interpretation of some ResNet archi-
tecture (for details on the ResNet architecture, see [66]) in terms of HJ PDEs. The activa-
tion functions of the proposed ResNet architecture are a composition of an argmax-based
function and t∇H, where H is the Hamiltonian in the corresponding HJ equation. More-
over, when the time variable is fixed, the input x and the output u are in the same space Rn ;
hence, one can chain the ResNet structure in Fig. 8 to obtain a deep neural network archi-
tecture by specifying a sequence of time variables t1 , t2 , . . . , tN . The deep neural network
is given by
uk = uk−1 − tk ∇H(pkjk ), for each k ∈ {1, . . . , N }, (31)

Fig. 8 Illustration of the structure of the ResNet-type neural network (30) that can represent the minimizer u
in the Lax–Oleinik formula. Note that the activation function is defined using the gradient of the Hamiltonian
H, i.e., ∇H
Fig. 9 Illustration of the structure of the ResNet-type deep neural network (31) that can represent the
minimizers in the generalized Lax–Oleinik formula for the multi-time HJ PDEs. Note that the activation
function in the k th layer is defined using the gradient of one Hamiltonian Hk , i.e., ∇Hk . This figure only depicts
two layers
where u0 = x and pkjk is the output of the argmax based activation function in the k th layer.
For the case when N = 2, an illustration of this deep ResNet architecture with two layers
is shown in Fig. 9. In fact, this deep ResNet architecture can be formulated as follows

N
uN = x − tk ∇H(pkjk ).
k=1
This formulation suggests that this architecture should also provide the minimizers of
the generalized Lax–Oleinik formula for the multi-time HJ PDEs [39]. These ideas and
perspectives will be presented in detail in a forthcoming paper.
Applications of these neural architectures that can represent viscosity solutions of cer-
tain HJ PDEs to certain optimal control problems will be presented elsewhere.
Conflict of interest
The authors declare that they have no conflict of interest.
A Proofs of lemmas in Section 3.1

A.1 Proof of Lemma 3.1
Proof of (i): The convex and lower semicontinuous function J ∗ satisfies Eq. (12) by [68,
Prop. X.3.4.1]. It is also finite and continuous over its polytopal domain dom J ∗ =
∗
conv {pi }mi=1 [133, Thms. 10.2 and 20.5], and moreover, the subdifferential ∂J (p) is
non-empty by [133, Thm. 23.10].
Proof of (ii): First, suppose the vector (α1 , . . . , αm ) ∈ Rm satisfies the constraints (a)–
(c). Since x ∈ ∂J ∗ (p), there holds J ∗ (p) = p, x − J (x) [68, Cor. X.1.4.4], and using the
definition of the set Ix (11) and constraints (a)–(c) we deduce that

J ∗ (p) = p, x − J (x) = p, x − αi J (x)
i∈Ix

= p, x − αi (pi , x − γi )
i∈Ix

m
= p− αi p i , x + αi γ i = αi γ i .
i∈Ix i∈Ix i=1
Therefore, (α1 , . . . , αm ) is a minimizer in Eq. (12). Second, let (α1 , . . . , αm ) be a minimizer in

Eq. (12). Then, (a)–(b) follow directly from the constraints in Eq. (12). A similar argument
as above yields

m
m
m
∗
J (x) = p, x − J (p) = αi p i , x − αi γ i = αi (pi , x − γi ) .
i=1 i=1 i=1
But J (x) = maxi∈{1,...,m} { pi , x − γi } by definition, and so there holds αi = 0 whenever

J (x) = pi , x − γi . In other words, αi = 0 whenever i ∈/ Ix .

Proof of (iii): Let (β1 , . . . , βm ) ∈ Λm satisfy mi=1 i pi = pk . By assumption (A2), we
β
have γk = g(pk ) with g convex, and hence, Jensen’s inequality yields
m

m
m
m
δik γi = γk = g(pk ) = g βi pi βi g(pi ) = βi γi .
i=1 i=1 i=1 i=1
Therefore, the vector (δ1k , . . . , δmk ) is a minimizer in Eq. (12) at the point pk , and J ∗ (pk ) =
γk follows.
A.2 Proof of Lemma 3.2

Proof of (i): Let p ∈ dom J ∗ . The set A(p) ⊆ Λm is non-empty and bounded by Lemma
3.1(i), and it is closed since A(p) is the solution set to the linear programming problem (12).
Hence, A(p) is compact. As a result, we immediately have that H(p) < +∞. Moreover,
for each (α1 , . . . , αm ) ∈ A(p) there holds

m
−∞ < min θi αi θi max θi < +∞
i={1,...,m} i={1,...,m}
i=1
from which we conclude that H is a bounded function on dom J ∗ . Since the target function
in the minimization problem (14) is continuous, existence of a minimizer follows by
compactness of A(p).
Proof of (ii): We have already shown in the proof of (i) that the restriction of H to
dom J ∗ is bounded, and so it remains to prove its continuity. For any p ∈ dom J ∗ , we
m
have that (α1 , . . . , αm ) ∈ A(p) if and only if (α1 , . . . , αm ) ∈ Λm , i=1 αi pi = p, and
m ∗
i=1 αi γi = J (p). As a result, we have
m

m
m
H(p) = min αi θi : (α1 , . . . , αm ) ∈ Λm , αi pi = p, αi γi = J ∗ (p) . (32)
i=1 i=1 i=1
Define the function h : Rn+1 → R ∪ {+∞} by

m

m
m
h(p, r):= min αi θi : (α1 , . . . , αm ) ∈ Λm , αi pi = p, αi γ i = r , (33)
i=1 i=1 i=1
for any p ∈ Rn and r ∈ R. Using the same argument as in the proof of Lemma 3.1(i), we
conclude that h is a convex lower semicontinuous function, and in fact continuous over
its domain dom h = conv {(pi , γi )}m i=1 . Comparing Eq. (32) and the definition of h in (33),
we deduce that H(p) = h(p, J (p)) for any p ∈ dom J ∗ . Continuity of H in dom J ∗ then
∗
follows from the continuity of h and J ∗ in their own domains.

Proof of (iii): Let k ∈ {1, . . . , m}. On the one hand, Lemma 3.1(iii) implies (δ1k , . . . , δmk ) ∈
A(pk ), so that

m
H(pk ) δik θi = θk . (34)
i=1
On the other hand, let (α1 , . . . , αm ) ∈ A(pk ) be a vector different from (δk1 , . . . , δkm ).
m ∗
Then, (α1 , . . . , αm ) ∈ Λm satisfies m i=1 αi pi = p, i=1 αi γi = J (p), and αk < 1. Define
(β1 , . . . , βm ) ∈ Λm by
⎧ αj
⎨ if j = k,
βj := 1 − αk
⎩
0 if j = k.
A straightforward computation using the properties of (α1 , . . . , αm ), Lemma 3.1(iii), and

the definition of (β1 , . . . , βm ) yields
⎧
⎪
⎪ (β1 , . . . , βm ) ∈ Λm with βk = 0,
⎪
⎪
⎪
⎪ αi p i p − αk p k
⎪
⎨ βi pi = = k = pk ,
1 − αk 1 − αk
i =k i =k
⎪
⎪
⎪
⎪ αi γ i J ∗ (pk ) − αk γk γk − αk γk
⎪
⎪ βi γi = = = = γk .
⎪
⎩ 1−α 1−α 1−α
k k k
i =k i =k
In other words, Eq. (9) holds at index k, which, by assumption (A3), implies that

i =k βi θi > θk . As a result, we have

m
m
αi θi = αk θk + (1 − αk ) βi θi > αk θk + (1 − αk )θk = θk = δik θi .
i=1 i =k i=1
Taken together with Eq. (34), we conclude that (δ1k , . . . , δmk ) is the unique minimizer in
(14), and hence, we obtain H(pk ) = θk .
B Proof of Theorem 3.1

To prove this theorem, we will use three lemmas whose statements and proofs are given
in Sect. B.1, B.2, and B.3, respectively. The proof of Theorem 3.1 is given in Sect. B.4.
B.1 Statement and proof of Lemma B.1

Lemma B.1 Suppose the parameters {(pi , θi , γi )}mi=1 ⊂ R × R × R satisfy assumptions
n
(A1)-(A3). Let J and H be the functions defined in Eqs. (10) and (14), respectively. Let
H̃ : Rn → R be a continuous function satisfying H̃(pi ) = H(pi ) for each i ∈ {1, . . . , m} and
H̃(p) H(p) for all p ∈ dom J ∗ . Then, the neural network f defined in Eq. (8) satisfies
f (x, t):= max { pi , x − tθi − γi } = sup p, x − t H̃(p) − J ∗ (p) . (35)

i∈{1,...,m} p∈dom J ∗
Proof Let x ∈ Rn and t 0. Since H̃(p) H(p) for every p ∈ dom J ∗ , we get
p, x − t H̃(p) − J ∗ (p) p, x − tH(p) − J ∗ (p). (36)
Let (α1 , . . . , αm ) be a minimizer in (14). By Eqs. (12), (13), and (14), we have

m
m
m
p= αi p i , H(p) = αi θi , and J ∗ (p) = αi γ i . (37)
i=1 i=1 i=1
Combining Eqs. (36), (37), and (8), we get

m
p, x − t H̃(p) − J ∗ (p) αi (pi , x − tθi − γi )
i=1
max {pi , x − tθi − γi } = f (x, t),

i∈{1,...,m}
where the second inequality follows from the constraint (α1 , . . . , αm ) ∈ Λm . Since p ∈
dom J ∗ is arbitrary, we obtain
sup p, x − t H̃(p) − J ∗ (p) f (x, t). (38)

p∈dom J ∗
Now, by Lemmas 3.1(iii), 3.2(iii), and the assumptions on H̃, we have
H̃(pk ) = H(pk ) = θk and J ∗ (pk ) = γk ,
for each k ∈ {1, . . . , m}. A straightforward computation yields
f (x, t) = max {pi , x − tθi − γi }

i∈{1,...,m}
= max pi , x − t H̃(pi ) − J ∗ (pi ) (39)

i∈{1,...,m}
sup p, x − t H̃(p) − J ∗ (p) ,

p∈dom J ∗
where the inequality holds since pi ∈ dom J ∗ for every i ∈ {1, . . . , m}. The conclusion
then follows from Eqs. (38) and (39).


Lemma B.2 Suppose the parameters {(pi , θi , γi )}m i=1 ⊂ R × R × R satisfy assumptions
n
(A1)-(A3). For every k ∈ {1, . . . , m}, there exist x ∈ Rn and t > 0 such that f (·, t) is
differentiable at x and ∇x f (x, t) = pk .
Proof Since f is the supremum of a finite number of affine functions by definition (8),
it is finite-valued and convex for t 0. As a result, ∇x f (x, t) = pk is equivalent to
∂(f (·, t))(x) = {pk }, and so it suffices to prove that ∂(f (·, t))(x) = {pk } for some x ∈ Rn
and t > 0. To simplify the notation, we use ∂x f (x, t) to denote the subdifferential of f (·, t)
at x.
By [67, Thm. VI.4.4.2], the subdifferential of f (·, t) at x is the convex hull of the pi ’s
whose indices i’s are maximizers in (8), that is,
∂x f (x, t) = co {pi : i is a maximizer in (8)}.
It suffices then to prove the existence of x ∈ Rn and t > 0 such that
pk , x − tθk − γk > pi , x − tθi − γi for every i = k. (40)
First, consider the case when there exists x ∈ Rn such that pk , x − γk > pi , x − γi for
every i = k. In that case, by continuity, there exists small t > 0 such that pk , x−tθk −γk >
pi , x − tθi − γi for every i = k and so (40) holds.
Now, consider the case when there does not exist x ∈ Rn such that pk , x − γk >
maxi=k {pi , x − γi }. In other words, we assume
J (x) = max{pi , x − γi } for every x ∈ Rn . (41)

i =k
We apply Lemma 3.1(i) to the formula above and obtain

m

m
J ∗ (pk ) = min αi γi : (α1 , . . . , αm ) ∈ Λm , αi p i = p k , αk = 0 . (42)
i=1 i=1
Let x0 ∈ ∂J ∗ (pk ). Denote by Ix0 the set of maximizers in Eq. (41) at the point x0 , i.e.,
Ix0 : = arg max{pi , x − γi }. (43)

i =k
/ Ix0 by definition of Ix0 . Define a function h : Rn → R ∪ {+∞} by

Note that we have k ∈
⎧
⎨θ if p = pi and i ∈ Ix0 ,
i
h(p):= (44)
⎩+∞ otherwise.
Denote the convex lower semicontinuous envelope of h by co h. Since x0 ∈ ∂J ∗ (pk ), we

can use [67, Thm. VI.4.4.2] and the definition of Ix0 and h in Eqs. (43) and (44) to deduce
pk ∈ ∂J (x0 ) = co {pi : i ∈ Ix0 } = dom co h. (45)

Hence, the point pk is in the domain of the polytopal convex function co h. Then, [133,
Thm. 23.10] implies ∂(co h)(pk ) = ∅. Let v 0 ∈ ∂(co h)(pk ) and x = x0 + tv 0 . It remains
to choose suitable positive t such that (40) holds. Letting x = x0 + tv 0 in (40) yields
pk , x − tθk − γk − (pi , x − tθi − γi )

= pk , x0 + tv 0 − tθk − γk − (pi , x0 + tv 0 − tθi − γi ) (46)
= pk , x0 − γk − (pi , x0 − γi ) + t(θi − θk − pi − pk , v 0 ).
Now, we consider two situations, the first when i ∈ / Ix0 ∪ {k} and the second when i ∈ Ix0 .
It suffices to prove (40) hold in each case for small enough positive t.
If i ∈
/ Ix0 ∪ {k}, then i is not a maximizer in Eq. (41) at the point x0 . By (45), pk is a convex
combination of the set {pi : i ∈ Ix0 }. In other words, there exists (c1 , . . . , cm ) ∈ Λm such

that m j=1 cj pj = pk and cj = 0 whenever j ∈ / Ix0 . Taken together with assumption (A2)
and Eqs. (10), (41), (43), we have
⎛ ⎞

J (x0 ) pk , x0 − γk = pk , x0 − g(pk ) = cj pj , x0 − g ⎝ cj pj ⎠
j∈Ix0 j∈Ix0

cj (pj , x0 − g(pj )) = cj J (x0 ) = J (x0 ).
j∈Ix0 j∈Ix0
Thus, the inequalities become equalities in the equation above. As a result, we have
pk , x0 − γk = J (x0 ) > pi , x0 − γi ,
where the inequality holds because i ∈ / Ix0 ∪ {k} by assumption. This inequality implies
that the constant pk , x0 − γk − (pi , x0 − γi ) is positive, and taken together with (46),
we conclude that the inequality in (40) holds for i ∈ / Ix0 ∪ {k} when t is small enough.
If i ∈ Ix0 , then both i and k are maximizers in Eq. (10) at x0 , and hence, we have
pk , x0 − γk = J (x0 ) = pi , x0 − γi . (47)
Together with Eq. (46) and the definition of h in Eq. (44), we obtain
pk , x − tθk − γk − (pi , x − tθi − γi ) = 0 + t(h(pi ) − θk − pi − pk , v 0 )

(48)
t(co h(pi ) − θk − pi − pk , v 0 ).
In addition, since v 0 ∈ ∂(co h)(pk ), we have
co h(pi ) co h(pk ) + pi − pk , v 0 . (49)
Combining Eqs. (48) and (49), we obtain
pk , x − tθk − γk − (pi , x − tθi − γi ) t(co h(pk ) − θk ). (50)
To prove the result, it suffices to show co h(pk ) > θk . As pk ∈ co h (as shown before in
Eq. (45)), then according to [68, Prop. X.1.5.4] we have

co h(pk ) = αj h(pj ) = αj θj , (51)
j∈Ix0 j∈Ix0

for some (α1 , . . . , αm ) ∈ Λm satisfying pk = m j=1 αj pj and αj = 0 whenever j ∈
/ Ix0 . Then,
by Lemma 3.1(ii) (α1 , . . . , αm ) is a minimizer in Eq. (42), that is,

m
γk = J ∗ (pk ) = αj γ j = αi γ i = αi γ i .
j=1 j∈Ix0 i =k

Hence, Eq. (9) holds for the index k. By assumption (A3), we have θk < j=k αj θj . Taken
together with the fact that αj = 0 whenever j ∈
/ Ix0 and Eq. (51), we find

θk < αj θj = αj θj = co h(pk ). (52)
j =k j∈Ix0
Hence, the right-hand side of Eq. (50) is strictly positive, and we conclude that pk , x −
tθk − γk > pi , x − tθi − γi for t > 0 if i ∈ Ix0 .
Therefore, in this case, when t > 0 is small enough and x is chosen as above, we have
pk , x − tθk − γk > pi , x − tθi − γi for every i = k, and the proof is complete.


Lemma B.3 Suppose the parameters {(pi , θi , γi )}m
i=1 ⊂ R × R × R satisfy assumptions
n
(A1)-(A3). Define a function F : R n+1 → R ∪ {+∞} by

⎧
⎨J ∗ (p) if E − + H(p) 0,
F (p, E − ):= (53)
⎩+∞ otherwise,
for all p ∈ Rn and E − ∈ R. Then, the convex envelope of F is given by

m
co F (p, E − ) = inf ci γi , (54)
(c1 ,...,cm )∈C(p,E − )
i=1
where the constraint set C(p, E − ) is defined by

m
m
−
C(p, E ):= (c1 , . . . , cm ) ∈ Λm : ci pi = p, ci θi −E − .
i=1 i=1
Proof First, we compute the convex hull of epi F , which we denote by co (epi F ). Let
(p, E − , r) ∈ co (epi F ), where p ∈ Rn and E − , r ∈ R. Then there exist k ∈ N, (β1 , . . . , βk ) ∈

Λk and (q i , Ei− , ri ) ∈ epi F for each i ∈ {1, . . . , k} such that (p, E − , r) = ki=1 βi (q i , Ei− , ri ).
By definition of F in Eq. (53), (q i , Ei− , ri ) ∈ epi F holds if and only if q i ∈ dom J ∗ , Ei− +
H(q i ) 0 and ri J ∗ (q i ). In conclusion, we have
⎧
⎪
⎪(β1 , . . . , βk ) ∈ Λk ,
⎪
⎪
⎪
⎪ −
k −
⎪
⎨(p, E , r) = i=1 βi (q i , Ei , ri ),
⎪
q 1 , . . . , q k ∈ dom J ∗ , (55)
⎪
⎪
⎪
⎪ −
⎪Ei + H(q i ) 0 for each i ∈ {1, . . . , k},
⎪
⎪
⎪
⎩r J ∗ (q ) for each i ∈ {1, . . . , k}.
i i
For each i, since we have q i ∈ dom J ∗ , by Lemma 3.2(i) the minimization problem in (14)
evaluated at q i has at least one minimizer. Let (αi1 , . . . , αim ) be such a minimizer. Using
Eqs. (12), (14), and (αi1 , . . . , αim ) ∈ Λm , we have

m
αij (1, pj , θj , γj ) = (1, q i , H(q i ), J ∗ (q i )). (56)
j=1

Define the real number cj := ki=1 βi αij for any j ∈ {1, . . . , m}. Combining Eqs. (55) and
(56), we get that cj 0 for any j and

m
m
k
cj (1, pj , θj , γj ) = βi αij (1, pj , θj , γj )
j=1 j=1 i=1
⎛ ⎞

k m
k
= βi ⎝ αij (1, pj , θj , γj )⎠ = βi (1, q i , H(q i ), J ∗ (q i )).
i=1 j=1 i=1
We continue the computation using Eq. (55) and get

m
k
cj (1, pj ) = βi (1, q i ) = (1, p);
j=1 i=1

m
k
k
cj θj = βi H(q i ) − βi Ei− = −E − ;
j=1 i=1 i=1

m
k
k
cj γj = βi J ∗ (q i ) βi ri = r.
j=1 i=1 i=1
Therefore, we conclude that (c1 , . . . , cm ) ∈ Λm and

⎧ m
⎪
⎨p = j=1
⎪

cj pj ,
E− − m j=1 cj θj ,
⎪
⎪ m
⎩
r j=1 cj γj .

As a consequence, co (epi F ) ⊆ co ∪m j=1 {pj } × (−∞, −θj ] × [γj , +∞) . Now, Lem-
mas 3.1(iii) and 3.2(iii) imply {pj } × (−∞, −θj ] × [γj , +∞) ⊆ epi F for each j ∈ {1, . . . , m}.
Therefore, we have

co (epi F ) = (p, E − , r) ∈ Rn × R × R : there exists (c1 , . . . , cm ) ∈ Λm s.t.

m
m
m
p= cj pj , E − − cj θj , r cj γj . .
j=1 j=1 j=1
(57)
By [68, Def. IV.2.5.3 and Prop. IV.2.5.1], we have
co F (p, E − ) = inf {r ∈ R : (p, E − , r) ∈ co (epi F )}. (58)
The conclusion then follows from Eqs. (57) and (58).

B.4 Proof of Theorem 3.1

Proof of (i): First, the neural network f is the pointwise maximum of m affine functions
in (x, t) and therefore is jointly convex in these variables. Second, as the function H
is continuous and bounded in dom J ∗ by Lemma 3.2(ii), there exists a continuous and
bounded function defined in Rn whose restriction to dom J ∗ coincides with H [57, Thm.
4.16]. Then, statement (i) follows by substituting this function for H̃ in statement (ii), and
so it suffices to prove the latter.
Proof of (ii) (sufficiency): Suppose H̃(pi ) = H(pi ) for every i ∈ {1, . . . , m} and
H̃(p) H(p) for every p ∈ dom J ∗ . Since H̃ is continuous on Rn and J is convex and Lip-
schitz continuous with Lipschitz constant L = max i∈{1,...,m} pi , [10, Thm. 3.1] implies
that (x, t) → supp∈dom J ∗ p, x − t H̃(p) − J ∗ (p) is the unique uniformly continuous
viscosity solution to the HJ equation (16). But this function is equivalent to the neural
network f by Lemma B.1, and therefore, both sufficiency and statement (i) follow.
Proof of (ii) (necessity): Suppose the neural network f is the unique uniformly continuous
viscosity solution to (16). First, we prove that H̃(pk ) = H(pk ) for every k ∈ {1, . . . , m}. Fix
k ∈ {1, . . . , m}. By Lemma B.2, there exist x ∈ Rn and t > 0 satisfying ∂x f (x, t) = {pk }.
Use Lems. 3.1(iii) and 3.2(iii) to write the maximization problem in Eq. (8) as
f (x, t) = max {p, x − tH(p) − J ∗ (p)}, (59)

p∈{p1 ,...,pm }
where (p, t) → p, x − tH(p) − J ∗ (p) is continuous in (p, t) and differentiable in t. As

the feasible set {p1 , . . . , pm } is compact, f is also differentiable with respect to t [21, Prop.
4.12], and its derivative equals
∂f
(x, t) = min −H(p) : p is a maximizer in Eq. (59) .
∂t
Since x and t satisfy ∂x f (x, t) = {pk }, [67, Thm. VI.4.4.2] implies that the only maximizer
in Eq. (59) is pk . As a result, there holds
∂f
(x, t) = −H(pk ). (60)
∂t
Since f is convex on Rn , its subdifferential ∂f (x, t) is non-empty and satisfies
∂f (x, t) ⊆ ∂x f (x, t) × ∂t f (x, t) = {(pk , −H(pk ))}.
In other words, the subdifferential ∂f (x, t) contains only one element, and therefore, f is
differentiable at (x, t) and its gradient equals (pk , −H(pk )) [133, Thm. 21.5]. Using (16)
and (60), we obtain
∂f
0= (x, t) + H̃(∇x f (x, t)) = −H(pk ) + H̃(pk ).
∂t
As k ∈ {1, . . . , m} is arbitrary, we find that H(pk ) = H̃(pk ) for every k ∈ {1, . . . , m}.
Next, we prove by contradiction that H̃(p) H(p) for every p ∈ dom J ∗ . It is enough
to prove the property only for every p ∈ ri dom J ∗ by continuity of both H̃ and H (where
continuity of H is proved in Lemma 3.2(ii)). Assume H̃(p) < H(p) for some p ∈ ri dom J ∗ .
Define two functions F and F̃ from Rn × R to R ∪ {+∞} by

J ∗ (q) if E − + H(q) 0, J ∗ (q) if E − + H̃(q) 0,
F (q, E − ):= and F̃ (q, E − ):=
+∞ otherwise. +∞ otherwise.
(61)
for any q ∈ Rn and E − ∈ R. Denoting the convex envelope of F by co F , Lemma B.3

implies

m
co F (q, E − ) = inf ci γi , where C is defined by
(c1 ,...,cm )∈C(q,E − )
i=1
(62)

m
m
− −
C(q, E ):= (c1 , . . . , cm ) ∈ Λm : ci pi = q, ci θi −E .
i=1 i=1

Let E1− ∈ −H(p), −H̃(p) . Now, we want to prove that co F (p, E1− ) J ∗ (p); this
inequality will lead to a contradiction with the definition of H.
Using statement (i) of this theorem and the supposition that f is the unique viscosity
solution to the HJ equation (16), we have that
f (x, t) = sup {q, x − tH(q) − J ∗ (q)} = sup {q, x − t H̃(q) − J ∗ (q)}.

q∈Rn q∈Rn
Furthermore, a similar calculation as in the proof of [39, Prop. 3.1] yields
f = F ∗ = F̃ ∗ , which implies f ∗ = co F = co F̃ .
where co F and co F̃ denotes the convex lower semicontinuous envelopes of F and F̃ ,

respectively. On the one hand, since f ∗ = co F̃ , the definition of F̃ in Eq. (61) implies
%
f ∗ p, −H̃(p) F̃ p, −H̃(p) = J ∗ (p) and {p} × −∞, −H̃(p) ⊆ dom F̃ ⊆ dom f ∗ .
(63)
Recall that p ∈ ri dom J ∗ and E1− < −H̃(p), so that (p, E1− ) ∈ ri dom f ∗ . As a result, we
get

p, αE1− + (1 − α)(−H̃(p)) ∈ ri dom f ∗ for all α ∈ (0, 1). (64)
On the other hand, since f ∗ = co F , we have ri dom f ∗ = ri dom (co F ) and f ∗ = co F in

ri dom f ∗ . Taken together with Eq. (64) and the continuity of f ∗ , there holds

f ∗ p, −H̃ (p) = lim f ∗ p, αE1− + (1 − α)(−H̃(p))
α→0
0<α<1
(65)
= lim co F p, αE1− + (1 − α)(−H̃(p)) .
α→0
0<α<1
Note that co F (p, ·) is monotone non-decreasing. Indeed, if E2− is a real number such that
E2− > E1− , by the definition of the set C in Eq. (62) there holds C(p, E2− ) ⊆ C(p, E1− ), which
implies co F (p, E2− ) co F (p, E1− ). Recalling that E1− < −H̃(p), monotonicity of co F (p, ·)
and Eq. (65) imply

f ∗ p, −H̃ (p) lim co F p, αE1− + (1 − α)E1− = co F (p, E1− ). (66)
α→0
0<α<1
Combining Eqs. (63) and (66), we get
co F (p, E1− ) J ∗ (p) < +∞. (67)
As a result, the set C(p, E1− ) is non-empty. Since it is also compact, there exists a min-
imizer in Eq. (62) evaluated at the point (p, E1− ). Let (c1 , . . . , cm ) be such a minimizer. By

Eqs. (62) and (67) and the assumption that E1− ∈ −H(p), −H̃(p) , there holds
⎧
⎪
⎪(c1 , . . . , cm ) ∈ Λm ,
⎪
⎪
⎨m c p = p,
⎪
i=1 i i

⎪ m ci γi = co F (p, E − ) J ∗ (p),
(68)
⎪
⎪
⎪
⎪
i=1 1
⎩m −
i=1 i ic θ −E 1 < H(p).
Comparing the first three statements in Eq. (68) and the formula of J ∗ in Eq. (12), we
deduce that (c1 , . . . , cm ) is a minimizer in Eq. (12), i.e., (c1 , . . . , cm ) ∈ A(p). By definition
of H in Eq. (14), we have

m
m
H(p) = inf αi θi ci θi ,
α∈A(p)
i=1 i=1
which contradicts the last inequality in Eq. (68). Therefore, we conclude that H̃(p) H(p)
for any p ∈ ri dom J ∗ and the proof is finished.
C Connections between the neural network (17) and the viscous HJ PDE (18)
Let f be the neural network defined by Eq. (17) with parameters {(pi , θi , γi )}mi=1 and > 0,
which is illustrated in Fig. 3. We will show in this appendix that when the parameter θi =
2
− 12 pi 2 for i ∈ {1, . . . , m}, then the neural network f corresponds to the unique, jointly
convex smooth solution to the viscous HJ PDE (18). This result will follow immediately
from the following lemma.
Lemma C.1 Let {(pi , γi )}m

i=1 ⊂ R × R and
n > 0. Then, the function w : Rn → R
defined by

m
pi ,x+ 2t pi
2
2 −γi /
w (x, t):= e (69)
i=1
is the unique, jointly log-convex and smooth solution to the Cauchy problem
⎧
⎪ ∂w
⎪
⎪ (x, t) = Δx w (x, t) in Rn × (0, +∞),
⎨ ∂t 2
⎪ m (70)
⎪
⎩w (x, 0) =
⎪ e(pi ,x−γi )/ in Rn .
i=1
Proof A short calculation shows that the function w defined in Eq. (69) solves the Cauchy
problem (70), and uniqueness holds by strict positiveness of the initial data (see [147, Chap.
VIII, Thm. 2.2] and note that the uniqueness result can easily be generalized to n > 1).
Now, let λ ∈ [0, 1] and (x1 , t1 ) and (x2 , t2 ) be such that x = λx1 + (1 − λ)x2 and
t = λt1 + (1 − λ)t2 . Then, the Hölder’s inequality (see, e.g., [57, Thm. 6.2]) implies
m
m

pi ,x+ 2t pi 22 −γi / λ pi ,x1 + 21 pi 2 −γi / (1−λ) pi ,x2 + 22 pi 2 −γi /
t 2 t 2
e = e e
i=1 i=1
m λ 1−λ
t
pi ,x1 + 21
2
pi 2 −γi /
m
pi ,x2 + t22 pi
2
2 −γi /
e e ,
i=1 i=1
and we find w (x, t) (w (x1 , t1 ))λ (w (x2 , t2 ))1−λ , which implies that w is jointly log-
convex in (x, t).

Thanks to Lemma C.1 and the Cole–Hopf transformation f (x, t) = log (w (x, t)) (see,
e.g., [47], Sect. 4.4.1), a short calculation immediately implies that the neural network f
solves the viscous HJ PDE (18), and it is also its unique solution because w is the unique
solution to the Cauchy problem (70). Joint convexity in (x, t) follows from log-convexity
of (x, t) → w (x, t) for every > 0.
D Proof of Proposition 3.1

To prove this proposition, we will use three lemmas whose statements and proofs are
given in Sect. D.1, D.2, and D.3, respectively. The proof of Prop. 3.1 is given in Sect. D.4.
D.1 Statement and proof of Lemma D.1

Lemma D.1 Consider the one-dimensional case, i.e., n = 1. Let p1 , . . . , pm ∈ R satisfy
p1 < · · · < pm and define the function J using Eq. (10). Suppose assumptions (A1)-(A2)
hold. Let x ∈ R, p ∈ ∂J (x), and suppose p = pi for any i ∈ {1, . . . , m}. Then, there exists
k ∈ {1, . . . , m} such that pk < p < pk+1 and
k, k + 1 ∈ arg max{xpi − γi }. (71)

i∈{1,...,m}
Proof Let Ix denotes the set of maximizers in Eq. (11) at x. Since p ∈ ∂J (x), p = pi for
i ∈ {1, . . . , m}, and ∂J (x) = co {pi : i ∈ Ii } by [67, Thm. VI.4.4.2], there exist j, l ∈ Ix
such that pj < p < pl . Moreover, there exists k with j k < k + 1 l such that
pj pk < p < pk+1 pl . We will show that k, k + 1 ∈ Ix . We only prove k ∈ Ix ; the case
for k + 1 is similar.
If pj = pk , then k = j ∈ Ix and the conclusion follows directly. Hence suppose pj <
pk < pl . Then, there exists α ∈ (0, 1) such that pk = αpj + (1 − α)pl . Using that j, l ∈ Ix ,
assumption (A2), and Jensen inequality, we get
xpk − γk = xpk − g(pk ) = (αpj + (1 − α)pl )x − g(αpj + (1 − α)pl )

αxpj + (1 − α)xpl − αg(pj ) − (1 − α)g(pl )
= α(xpj − γj ) + (1 − α)(xpl − γl )
= max {xpi − γi },
i∈{1,...,m}
which implies that k ∈ Ix . A similar argument shows that k + 1 ∈ Ix , which completes the
proof.


p1 < · · · < pm and define the function H using Eq. (14). Suppose assumptions (A1)–(A3)
hold. Let u0 ∈ R and pk < u0 < pk+1 for some index k. Then, there holds
H(u0 ) = βk θk + βk+1 θk+1 , (72)
where
pk+1 − u0 u0 − pk
βk := and βk+1 := . (73)
pk+1 − pk pk+1 − pk
Proof Let β:=(β1 , . . . , βm ) ∈ Λm satisfy
pk+1 − u0 u 0 − pk
βk := and βk+1 := ,
pk+1 − pk pk+1 − pk
and βi = 0 for every i ∈ {1, . . . , m} \ {k, k + 1}. We will prove that β is a minimizer in Eq.
(14) evaluated at u0 , that is,
m

β ∈ arg min αi θi ,
α∈A(u0 ) i=1
where
m

A(u0 ):= arg min αi γ i .
(α
1m,...αm )∈Λm i=1
i=1 αi pi =u0
First, we show that β ∈ A(u0 ). By definition of β and Lemma 3.1(ii) with p = u0 , the
statement holds provided k, k + 1 ∈ Ix , where the set Ix contains the maximizers in Eq.
(10) evaluated at x ∈ ∂J ∗ (u0 ). But if x ∈ ∂J ∗ (u0 ), we have u0 ∈ ∂J (x), and Lemma D.1
implies k, k + 1 ∈ Ix . Hence, β ∈ A(u0 ).
Now, suppose that β is not a minimizer in Eq. (14) evaluated at u0 . By Lemma 3.2(i), there
exists a minimizer in Eq. (14) evaluated at the point u0 , which we denote by (α1 , . . . , αm ).
Then there holds
⎧ m
⎪
⎪
m
i=1 αi = i=1 βi = 1,
⎪
⎪
⎪
⎨ m m
i=1 αi pi = i=1 βi pi = u0 ,
⎪m m ∗
(74)
⎪
⎪ i=1 αi γi = i=1 βi γi = J (u0 ),
⎪
⎪
⎩ m m
i=1 αi θi < i=1 βi θi .
Since αi 0 for every i and βi = 0 for every i ∈ {1, . . . , m} \ {k, k + 1}, we have
αk + αk+1 1 = βk + βk+1 . As α = β, then one or both of the inequalities αk < βk and
αk+1 < βk+1 hold. This leaves three possible cases, and we now show that each case leads
to a contradiction.
Case 1: Let αk < βk and αk+1 βk+1 . Define the coefficient ci by

⎧
⎨ αi − βi , i = k,
⎪
ci := βk − αk
⎪
⎩0, i = k.
The following equations then hold

⎧
⎪
⎪ (c1 , . . . , cm ) ∈ Δm with ck = 0,
⎪
⎪
⎪
⎨
i =k ci pi = pk ,
⎪
⎪
⎪ i =k ci γi = γk ,
⎪
⎪
⎩
i =k ci θi < θk .
These equations, however, violate assumption (A3), and so we get a contradiction.
Case 2: Let αk βk and αk+1 < βk+1 . A similar argument as in case 1 can be applied
here by exchanging the indices k and k + 1 to derive a contradiction.
Case 3: Let αk < βk and αk+1 < βk+1 . From Eq. (74), we obtain
⎧
⎪
⎪ βk − αk + βk+1 − αk+1 = i=k,k+1 αi ,
⎪
⎪
⎪
⎨(β − α )p + (β
k k k k+1 − αk+1 )pk+1 = i =k,k+1 αi pi ,
⎪ (75)
⎪
⎪ (β k − α k )γ k + (βk+1 − α k+1 )γ k+1 = i =k,k+1 α i γ i ,
⎪
⎪
⎩
(βk − αk )θk + (βk+1 − αk+1 )θk+1 > i=k,k+1 αi θi .
Define two numbers qk and qk+1 by

α i pi α i pi
qk := i<k and qk+1 := i>k+1 . (76)
i<k αi i>k+1 αi
Note that from the first two equations in (74) and the assumption that αk < βk and
αk+1 < βk+1 , there exist i1 < k and i2 > k + 1 such that αi1 = 0 and αi2 = 0, and hence,
the numbers qk and qk+1 are well-defined. By definition, we have qk < pk < pk+1 < qk+1 .
Therefore, there exist bk , bk+1 ∈ (0, 1) such that
pk = bk qk + (1 − bk )qk+1 and pk+1 = bk+1 qk + (1 − bk+1 )qk+1 . (77)
A straightforward computation yields
qk+1 − pk qk+1 − pk+1

bk = and bk+1 = . (78)
qk+1 − qk qk+1 − qk
Define the coefficients cik and cik+1 as follows
⎧ ⎧
⎪
⎪ b αi ⎪
⎪ b αi
⎪
⎪ k , i < k, ⎪
⎪ k+1 , i < k,
⎪
⎪ α ⎪
⎪ α
⎨ ω<k ω ⎨ ω<k ω
cik := (1 − bk )αi and cik+1 := (1 − b )αi
⎪ , i > k + 1, ⎪ k+1 , i > k + 1,
⎪
⎪ ω>k+1 αω
⎪
⎪ ω>k+1 αω
⎪
⎪ ⎪
⎪
⎪
⎩0, ⎪
⎩0,
otherwise, otherwise.
(79)
m m
These coefficients satisfy cik , cik+1 ∈ [0, 1] for any i and k
i=1 ci = k+1
i=1 ci = 1. In
other words, we have
(c1k , . . . , cm
k
) ∈ Δm with ckk = 0 and (c1k+1 , . . . , cm
k+1 k+1
) ∈ Δm with ck+1 = 0. (80)
Hence, the first equality in Eq. (9) holds for the coefficients (c1k , . . . , cm
k ) with the index
k+1
k and also for the coefficients (c1 , . . . , cm ) with the index k + 1. We show next that
k+1
these coefficients satisfy the second and third equalities in (9) and draw a contradiction
with assumption (A3).
Using Eqs. (76), (77), and (79) to write the formulas for pk and pk+1 via the coefficients
cik and cik+1 , we find

α i pi α i pi
pk = bk i<k + (1 − bk ) i>k+1 = cik pi = cik pi ,
i<k α i i>k+1 α i
i =k,k+1 i =k

α i pi α i pi
pk+1 = bk+1 i<k + (1 − bk+1 ) i>k+1 = cik+1 pi = cik+1 pi ,
i<k αi i>k+1 αi i =k,k+1 i =k+1
(81)
k
where the last equalities in the two formulas above hold because ck+1 = 0 and ckk+1 = 0
by definition. Hence, the second equality in Eq. (9) also holds for both the index k and
k + 1.
From the third equality in Eq. (75), assumption (A2), Eq. (81), and Jensen’s inequality,
we have

αi γi = (βk − αk )γk + (βk+1 − αk+1 )γk+1
i =k,k+1
= (βk − αk )g(pk ) + (βk+1 − αk+1 )g(pk+1 )

⎛ ⎞ ⎛ ⎞

= (βk − αk )g ⎝ ci pi ⎠ + (βk+1 − αk+1 )g ⎝
k
ci pi ⎠
k+1
i =k,k+1 i =k,k+1
⎛ ⎞ ⎛ ⎞
(82)
(βk − αk ) ⎝ cik g(pi )⎠ + (βk+1 − αk+1 ) ⎝ cik+1 g(pi )⎠
i =k,k+1 i =k,k+1

= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )g(pi )
i =k,k+1

= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )γi .
i =k,k+1
We now compute and simplify the coefficients (βk − αk )cik + (βk+1 − αk+1 )cik+1 in the
formula above. First, consider the case when i < k. Eqs. (78) and (79) imply
(βk − αk )cik + (βk+1 − αk+1 )cik+1

b k αi bk+1 αi
= (βk − αk ) + (βk+1 − αk+1 )
ω<k ω α ω<k αω
αi
= ((βk − αk )bk + (βk+1 − αk+1 )bk+1 )
ω<k αω

αi qk+1 − pk qk+1 − pk+1
= (βk − αk ) + (βk+1 − αk+1 )
α
ω<k ω q k+1 − q k qk+1 − qk
αi 1
= · ((βk − αk + βk+1 − αk+1 )qk+1
α
ω<k ω q k+1 − qk
−(βk − αk )pk − (βk+1 − αk+1 )pk+1 ).
Applying the first two equalities in Eq. (75) and Eq. (76) to the last formula above, we
obtain
(βk − αk )cik + (βk+1 − αk+1 )cik+1

⎛⎛ ⎞ ⎞
αi 1
= · ⎝⎝ αi ⎠ qk+1 − α i pi ⎠
ω<k αω qk+1 − qk i =k,k+1 i =k,k+1
⎛ ⎞
αi 1
= · ⎝ αi qk+1 − α i pi − α i pi ⎠
ω<k αω qk+1 − qk i =k,k+1 i<k i>k+1
⎛ ⎛ ⎞ ⎞
αi 1
= · ⎝ αi qk+1 − α i qk − ⎝ αi ⎠ qk+1 ⎠
ω<k ω α q k+1 − qk
i =k,k+1 i<k i>k+1

αi 1
= · αi (qk+1 − qk )
ω<k ω α q k+1 − qk
i<k
= αi .
The same result for the case when i > k + 1 also holds and the proof is similar. Therefore,
we have
(βk − αk )cik + (βk+1 − αk+1 )cik+1 = αi for each i = k, k + 1. (83)
Combining Eqs. (82) and (83), we have

αi γ i ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )γi = αi γ i .
i =k,k+1 i =k,k+1 i =k,k+1
Since the left side and right side are the same, the inequality above becomes equality,
which implies that the inequality in Eq. (82) also becomes equality. In other words, we
have

γk = g (pk ) = cik g(pi ) = cik γi = cik γi ,
i =k,k+1 i =k,k+1 i =k
(84)
γk+1 = g (pk+1 ) = cik+1 g(pi ) = cik+1 γi = cik+1 γi ,
i =k,k+1 i =k,k+1 i =k+1
where the last equalities in the two formulas above hold because ck+1 k = 0 and ckk+1 = 0
by definition. Hence, the third equality in (9) also holds for both indices k and k + 1.
In summary, Eqs. (80), (81), and (84) imply that Eq. (9) holds for the index k with
k ) and also for the index k + 1 with coefficients (ck+1 , . . . , ck+1 ).
coefficients (c1k , . . . , cm 1 m
Hence, by assumption (A3), we find

cik θi > θk and cik+1 θi > θk+1 .
i =k i =k+1
k
Using the inequalities above with Eq. (83) and the fact that ck+1 = 0 and ckk+1 = 0, we
find

(βk − αk )θk + (βk+1 − αk+1 )θk+1 < (βk − αk ) cik θi + (βk+1 − αk+1 ) cik+1 θi
i =k i =k+1

= ((βk − αk )cik + (βk+1 − αk+1 )cik+1 )θi = αi θi ,
i =k,k+1 i =k,k+1
which contradicts the last inequality in Eq. (75).

In conclusion, we obtain contradictions in all the three cases. As a consequence, we
conclude that β is a minimizer in Eq. (14) evaluated at u0 and Eq. (72) follows from the
definition of H in (14).


p1 < · · · < pm . Suppose assumptions (A1)-(A2) hold. Let x ∈ R and t > 0. Assume j, k, l
are three indices such that 1 j k < l m and
j, l ∈ arg max{xpi − tθi − γi }. (85)

i∈{1,...,m}
Then, there holds
θl − θk θl − θj
. (86)
pl − p k pl − p j
Proof Note that Eq. (86) holds trivially when j = k, so we only need to consider the case
when j < k < l. On the one hand, Eq. (85) implies
xpj − tθj − γj = xpl − tθl − γl xpk − tθk − γk ,
which yields
γl − γk x(pl − pk ) − t(θl − θk ),
(87)
γl − γj = x(pl − pj ) − t(θl − θj ).
On the other hand, for each i ∈ {j, j + 1, . . . , l − 1} let qi ∈ (pi , pi+1 ) and xi ∈ ∂J ∗ (qi ). Such
xi exists because qi ∈ int dom J ∗ , so that the subdifferential ∂J ∗ (qi ) is non-empty. Then,
qi ∈ ∂J (xi ) and Lemma D.1 imply
xi pi − γi = xi pi+1 − γi+1 = max {xi pω − γω }.

ω∈{1,...,m}
A straightforward computation yields

l−1
l−1
γl − γ k = (γi+1 − γi ) = xi (pi+1 − pi ),
i=k i=k

l−1
l−1
γl − γj = (γi+1 − γi ) = xi (pi+1 − pi ).
i=j i=j
Combining the two equalities above with Eq. (87), we conclude that

l−1
x(pl − pk ) − t(θl − θk ) xi (pi+1 − pi ),
i=k

l−1
x(pl − pj ) − t(θl − θj ) = xi (pi+1 − pi ).
i=j
Now, divide the inequality above by t(pl − pk ) > 0 (because by assumption t > 0 and
l > k, which implies that pl > pk ), divide the equality above by t(pl − pj ) > 0 (because
l > j, which implies that t(pl − pj ) = 0), and rearrange the terms to obtain
l−1
θl − θk x 1 i=k xi (pi+1 − pi )
− ,
pl − p k t t pl − p k
l−1 (88)
θl − θj x 1 i=j xi (pi+1 − pi )
= − .
pl − p j t t pl − p j
Recall that qj < qj+1 < · · · < ql−1 and xi ∈ ∂J ∗ (qi ) for any j i < l. Since the function
J ∗ is convex, the subdifferential operator ∂J ∗ is a monotone non-decreasing operator
[67, Def. IV.4.1.3, and Prop. VI.6.1.1], which yields xj xj+1 · · · xl−1 . Using that
p1 < p2 < · · · < pm and j < k < l, we obtain
l−1 l−1
i=k xi (pi+1 − pi ) xk (pi+1 − pi )
i=k = xk
pl − p k pl − p k
k−1 k−1
i=j xk (pi+1 − pi ) i=j xi (pi+1 − pi )
= . (89)
pk − p j pk − p j
To proceed, we now use that fact that if four real numbers a, c ∈ R and b, d > 0 satisfy
a
b d , then b b+d . Combining this fact with inequality (89), we find
c a a+c
l−1 l−1 k−1

i=k xi (pi+1 − pi ) i=k xi (pi+1 − pi ) + i=j xi (pi+1 − pi )

pl − p k pl − p k + p k − p j
l−1
i=j xi (pi+1 − pi )
= .
pl − p j
We combine the inequality above with (88) to obtain
θl − θk θl − θj
.
pl − p k pl − p j
which concludes the proof.

D.4 Proof of Proposition 3.1

Proof of (i): First, note that u is piecewise constant. Second, recall that J is defined as
the pointwise maximum of a finite number of affine functions. Therefore, the initial data
u(·, 0) = ∇J (·) (recall that here, the gradient ∇ is taken in the sense of distribution) are
bounded and of locally bounded variation (see [48, Chap. 5, page 167] for the definition
of locally bounded variation). Finally, the flux function H, defined in Eq. (14), is Lipschitz
continuous in dom J ∗ by Lemma D.2. It can therefore be extended to R while preserving
its Lipschitz property [57, Thm. 4.16]. Therefore, we can invoke [36, Prop. 2.1] to conclude
that u is the entropy solution to the conservation law (21) provided it satisfies the two
following conditions. Let x̄(t) be any smooth line of discontinuity of u. Fix t > 0 and
define u− and u+ as
u− := lim u(x, t) and u+ := lim u(x, t). (90)

x→x̄(t)− x→x̄(t)+
Then, the two conditions are:
1. The curve x̄(t) is a straight line with the slope
d x̄ H(u+ ) − H(u− )
= . (91)
dt u+ − u −
2. For any u0 between u+ and u− , we have
H(u+ ) − H(u0 ) H(u+ ) − H(u− )

. (92)
u+ − u 0 u+ − u −
First, we prove the first condition and Eq. (91). According to the definition of u in Eq.
(20), the range of u is the compact set {p1 , . . . , pm }. As a result, u− and u+ are in the range
of u, i.e., there exist indices j and l such that
u− = p j and u + = pl . (93)
Let (x̄(s), s) be a point on the curve x̄ which is not one of the endpoints. Since u is piecewise
constant, there exists a neighborhood N of (x̄(s), s) such that for any (x− , t), (x+ , t) ∈ N
satisfying x− < x̄(t) < x+ , we have u(x− , t) = u− = pj and u(x+ , t) = u+ = pl . In other
words, if x− , x+ , t are chosen as above, according to the definition of u in Eq. (20), we have
j ∈ arg max{x− pi − tθi − γi } and l ∈ arg max{x+ pi − tθi − γi }. (94)

i∈{1,...,m} i∈{1,...,m}
Define a sequence {xk− }+∞ −

k=1 ⊂ (−∞, x̄(s)) such that (xk , s) ∈ N for any k ∈ N and
limk→+∞ xk− = x̄(s). By Eq. (94), we have
xk− pj − sθj − γj ≥ xk− pi − sθi − γi for any i ∈ {1, . . . , m}.
When k approaches infinity, the above inequality implies
x̄(s)pj − sθj − γj ≥ x̄(s)pi − sθi − γi for any i ∈ {1, . . . , m}.
In other words, we have
j ∈ arg max{x̄(s)pi − sθi − γi }. (95)

i∈{1,...,m}
Similarly, define a sequence {xk+ }+∞ +

k=1 ⊂ (x̄(s), +∞) such that (xk , s) ∈ N for any k ∈ N
+
and limk→+∞ xk = x̄(s). Using a similar argument as above, we can conclude that
l ∈ arg max{x̄(s)pi − sθi − γi }. (96)

i∈{1,...,m}
By a continuity argument, Eqs. (95) and (96) also hold for the end points of x̄. In conclusion,
for any (x̄(t), t) on the curve x̄, we have
j, l ∈ arg max{x̄(t)pi − tθi − γi }, (97)

i∈{1,...,m}
which implies that
x̄(t)pl − tθl − γl = x̄(t)pj − tθj − γj .
Therefore, the curve x̄(t) lies on the straight line
x(pl − pj ) − t(θl − θj ) − (γl − γj ) = 0
and Eq. (93) and Lemma 3.2(iii) imply that its slope equals
d x̄ θl − θj H(u+ ) − H(u− )
= = .
dt pl − p j u+ − u −
This proves Eq. (91) and the first condition holds.

It remains to show the second condition. Since u equals ∇x f and f is convex by Theorem
3.1, its corresponding subdifferential operator u is monotone non-decreasing with respect
to x [67, Def. IV.4.1.3 and Prop. VI.6.1.1]. As a result, u− < u+ and u0 ∈ (u− , u+ ), where
we still adopt the notation u− = pj and u+ = pl . Recall that Lemma 3.2(iii) implies
H(pi ) = θi for any i. Then, Eq. (92) in the second condition becomes
θl − H(u0 ) θl − θj
. (98)
pl − u 0 pl − p j
Without loss of generality, we may assume that p1 < p2 < · · · < pm . Then, the fact
pj = u− < u+ = pl implies j < l. We consider the following two cases.
First, if there exists some k such that u0 = pk , then H(u0 ) = θk by Lemma 3.2(iii). Since
u < u0 < u+ , we have j < k < l. Recall that Eq. (97) holds. Therefore, the assumptions
−
of Lemma D.3 are satisfied, which implies Eq. (98) holds.

Second, suppose u0 = pi for every i ∈ {1, . . . , m}. Then there exists some k ∈ {j, j +
1, . . . , l − 1} such that pk < u0 < pk+1 . Lemma D.2 then implies that Eqs. (72) and (73)
hold, that is,
H(u0 ) = βk θk + βk+1 θk+1 , u0 = βk pk + βk+1 pk+1 , and βk + βk+1 = 1.
Using these three equations, we can write the left-hand side of Eq. (98) as
θl − H(u0 ) θl − βk θk − βk+1 θk+1 βk (θl − θk ) + βk+1 (θl − θk+1 )

= = . (99)
pl − u 0 pl − βk pk − βk+1 pk+1 βk (pl − pk ) + βk+1 (pl − pk+1 )
If k + 1 = l, then this equation become
θl − H(u0 ) θl − θk
= .
pl − u 0 pl − p k
Since j k < l and Eq. (97) hold, then the assumptions of Lemma D.3 are satisfied. This
allows us to conclude that Eq. (98) holds.
If k + 1 = l, then using Eq. (97), the inequalities j k < k + 1 < l, and Lemma D.3, we
obtain
βk (θl − θk ) θl − θk θl − θj βk+1 (θl − θk+1 ) θl − θk+1 θl − θj
= and = .
βk (pl − pk ) pl − p k pl − p j βk+1 (pl − pk+1 ) pl − pk+1 pl − p j
Note that if ai ∈ R and bi ∈ (0, +∞) for i ∈ {1, 2, 3} satisfy ab11 ab33 and ab22 ab33 , then
a1 +a2 a3
b1 +b2 b3 . Then, since βk (pl − pk ), βk+1 (pl − pk+1 ) and pl − pj are positive, we have
βk (θl − θk ) + βk+1 (θl − θk+1 ) θl − θj

.
βk (pl − pk ) + βk+1 (pl − pk+1 ) pl − p j
Hence, Eq. (98) follows directly from the inequality above and Eq. (99).
Therefore, the two conditions, including Eqs. (91) and (92), are satisfied and we apply
[36, Prop 2.1] to conclude that the function u is the entropy solution to the conservation
law (21).
Proof of (ii) (sufficiency): Without loss of generality, assume p1 < p2 < · · · < pm . Let
C ∈ R. Suppose H̃ satisfies H̃(pi ) = H(pi )+C for each i ∈ {1, . . . , m} and H̃(p) H(p)+C
for any p ∈ [p1 , pm ]. We want to prove that u is the entropy solution to the conservation
law (22).
As in the proof of (i), we apply [36, Prop 2.1] and verify that the two conditions hold
through Eqs. (91) and (92). Let x̄(t) be any smooth line of discontinuity of u, define u− and
u+ by Eq. (90) (and recall that u− = pj and u+ = pl ), and let u0 ∈ (u− , u+ ). We proved in
the proof of (i) that x̄(t) is a straight line, and so it suffices to prove that
d x̄ H̃(u+ ) − H̃(u− ) H̃(u+ ) − H̃(u0 ) H̃(u+ ) − H̃(u− )

= , and . (100)
dt u+ − u − u+ − u 0 u+ − u −
We start with proving the equality in Eq. (100). By assumption, there holds
H̃(u− ) = H̃(pj ) = H(pj ) + C = H(u− ) + C and

+ +
H̃(u ) = H̃(pl ) = H(pl ) + C = H(u ) + C. (101)
We combine Eq. (101) with Eq. (91), (which we proved in the proof of (i)), we obtain
d x̄ H(u+ ) − H(u− ) H(u+ ) + C − (H(u− ) + C) H̃(u+ ) − H̃(u− )

= + −
= + −
= .
dt u −u u −u u+ − u −
Therefore, the equality in (100) holds.

Next, we prove the inequality in Eq. (100). Since u0 ∈ (u− , u+ ) ⊆ [p1 , pm ], by assumption
there holds H̃(u0 ) H(u0 ) + C. Taken together with Eqs. (92) and (101), we get
H̃(u+ ) − H̃(u0 ) H(u+ ) + C − (H(u0 ) + C)

u+ − u 0 u+ − u 0
+
H(u ) − H(u ) − H̃(u ) − H̃(u− )
+
= ,
u+ − u − u+ − u −
which shows that the inequality in Eq. (100) holds.
Hence, we can invoke [36, Prop 2.1] to conclude that u is the entropy solution to the
conservation law (22).
Proof of (ii) (necessity): Suppose that u is the entropy solution to the conservation law
(22). We prove that there exists C ∈ R such that H̃(pi ) = H(pi ) + C for any i and
H̃(p) H(p) + C for any p ∈ [p1 , pm ].
By Lemma B.2, for each i ∈ {1, . . . , m} there exist x ∈ R and t > 0 such that
f (·, t) is differentiable at x, and ∇x f (x, t) = pi . (102)
Moreover, the proof of Lemma B.2 implies there exists T > 0 such that for any 0 <
t < T , there exists x ∈ R such that Eq. (102) holds. As a result, there exists t > 0 such
that for each i ∈ {1, . . . , m}, there exists xi ∈ R satisfying Eq. (102) at the point (xi , t),
which implies u(xi , t) = pi . Note that pi = pj implies that xi = xj . (Indeed, if xi = xj , then
pi = ∇x f (xi , t) = ∇x f (xj , t) = pj which gives a contradiction since pi = pj by assumption
(A1).) As mentioned before, the function u(·, t) ≡ ∇x f is a monotone non-decreasing
operator and pi is increasing with respect to i, and therefore x1 < x2 < · · · < xm . Since u
is piecewise constant, for each k ∈ {1, . . . , m − 1} there exists a curve of discontinuity of u
with u = pk on the left-hand side of the curve and u = pk+1 on the right-hand side of the
curve. Let x̄(s) be such a curve and let u− and u+ be the corresponding numbers defined
in Eq. (90). The argument above proves that we have u− = pk and u+ = pk+1 .
Since u is the piecewise constant entropy solution, we invoke [36, Prop 2.1] to conclude
that the two aforementioned conditions hold for the curve x̄(s), i.e., (100) holds with
u− = pk and u+ = pk+1 . From the equality in (100) and Eq. (91) proved in (i), we deduce
H̃(pk+1 ) − H̃(pk ) H̃(u+ ) − H̃(u− ) d x̄ H(u+ ) − H(u− ) H(pk+1 ) − H(pk )

= + −
= = + −
= .
pk+1 − pk u −u dt u −u pk+1 − pk
Since k is an arbitrary index, the equality above implies that H̃(pk+1 )− H̃(pk ) = H(pk+1 )−
H(pk ) holds for any k ∈ {1, . . . , m − 1}. Therefore, there exists C ∈ R such that
H̃(pk ) = H(pk ) + C for any k ∈ {1, . . . , m}. (103)
It remains to prove H̃(u0 ) H(u0 ) + C for all u0 ∈ [pk , pk+1 ]. If this inequality holds,
then the statement follows because k is an arbitrary index. We already proved that H̃(u0 )
H(u0 ) + C for u0 = pk with k ∈ {1, . . . , m}. Therefore, we need to prove that H̃(u0 )
H(u0 ) + C for all u0 ∈ (pk , pk+1 ). Let u0 ∈ (pk , pk+1 ). By Eq. (103) and the inequality in
(100), we have
H(pk+1 ) + C − H̃(u0 ) H̃(u+ ) − H̃(u0 ) H̃(u+ ) − H̃(u− ) H(pk+1 ) − H(pk )

= +
+ −
= .
pk+1 − u0 u − u0 u −u pk+1 − pk
(104)
By Lemma D.2 and a straightforward computation, we also have
H(pk+1 ) − H(u0 ) H(pk+1 ) − H(pk )

= . (105)
pk+1 − u0 pk+1 − pk
Comparing Eqs. (104) and (105), we obtain H̃(u0 ) H(u0 ) + C. Since k is arbitrary, we
conclude that H̃(u0 ) H(u0 ) + C holds for all u0 ∈ [p1 , pm ] and the proof is complete.
Received: 27 October 2019 Accepted: 22 June 2020
References
1. Aaibid, M., Sayah, A.: A direct proof of the equivalence between the entropy solutions of conservation laws and
viscosity solutions of Hamilton–Jacobi equations in one-space variable. JIPAM J. Inequal. Pure Appl. Math. 7(2), 11
(2006)
2. Akian, M., Bapat, R., Gaubert, S.: Max-plus algebra. Handbook of Linear Algebra 39, (2006)
3. Akian, M., Gaubert, S., Lakhoua, A.: The max-plus finite element method for solving deterministic optimal control
problems: basic properties and convergence analysis. SIAM J. Control Optim. 47(2), 817–848 (2008)
4. Alla, A., Falcone, M., Saluzzi, L.: An efficient DP algorithm on a tree-structure for finite horizon optimal control problems.
SIAM J. Sci. Comput. 41(4), A2384–A2406 (2019)
5. Alla, A., Falcone, M., Volkwein, S.: Error analysis for POD approximations of infinite horizon problems via the dynamic
programming approach. SIAM J. Control Optim. 55(5), 3091–3115 (2017)
6. Arnol’d, V.I.: Mathematical methods of classical mechanics. Graduate Texts in Mathematics, vol. 60. Springer, New York
(1989). Translated from the 1974 Russian original by K. Vogtmann and A. Weinstein, Corrected reprint of the second
(1989) edition
7. Bachouch, A., Huré, C., Langrené, N., Pham, H.: Deep neural networks algorithms for stochastic control problems on
finite horizon: numerical applications. arXiv preprint arXiv:1812.05916 (2018)
8. Banerjee, K., Georganas, E., Kalamkar, D., Ziv, B., Segal, E., Anderson, C., Heinecke, A.: Optimizing deep learning RNN
topologies on intel architecture. Supercomput. Front. Innov. 6(3), 64–85 (2019)
9. Bardi, M., Capuzzo-Dolcetta, I.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Syst.
Control Found. Appl. Birkhäuser Boston, Inc., Boston, MA (1997). https://doi.org/10.1007/978-0-8176-4755-1. With
appendices by Maurizio Falcone and Pierpaolo Soravia
10. Bardi, M., Evans, L.: On Hopf’s formulas for solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods
Appl. 8(11), 1373–1381 (1984). https://doi.org/10.1016/0362-546X(84)90020-8
11. Barles, G.: Solutions de viscosité des équations de Hamilton–Jacobi. Mathématiques et Applications. Springer, Berlin
(1994)
12. Barles, G., Tourin, A.: Commutation properties of semigroups for first-order Hamilton–Jacobi equations and application
to multi-time equations. Indiana Univ. Math. J. 50(4), 1523–1544 (2001)
13. Barron, E., Evans, L., Jensen, R.: Viscosity solutions of Isaacs’ equations and differential games with Lipschitz controls.
J. Differ. Equ. 53(2), 213–233 (1984). https://doi.org/10.1016/0022-0396(84)90040-8
14. Beck, C., Becker, S., Cheridito, P., Jentzen, A., Neufeld, A.: Deep splitting method for parabolic PDEs. (2019). arXiv
preprint arXiv:1907.03452
15. Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations
by means of deep learning. (2018). arXiv preprint arXiv:1806.00421
16. Beck, C., E, W., Jentzen, A.: Machine learning approximation algorithms for high-dimensional fully nonlinear partial
differential equations and second-order backward stochastic differential equations. J. Nonlinear Sci. 29(4), 1563–1619
(2019)
17. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)
18. Berg, J., Nyström, K.: A unified deep artificial neural network approach to partial differential equations in complex
geometries. Neurocomputing 317, 28–41 (2018). https://doi.org/10.1016/j.neucom.2018.06.056
19. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific, Belmont (2019)
20. Bokanowski, O., Garcke, J., Griebel, M., Klompmaker, I.: An adaptive sparse grid semi-Lagrangian scheme for first order
Hamilton–Jacobi Bellman equations. J. Sci. Comput. 55(3), 575–605 (2013)
21. Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer Series in Operations Research.
Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1394-9
22. Brenier, Y., Osher, S.: Approximate Riemann solvers and numerical flux functions. SIAM J. Numer. Anal. 23(2), 259–273
(1986)
23. Brenier, Y., Osher, S.: The discrete one-sided Lipschitz condition for convex scalar conservation laws. SIAM J. Numer.
Anal. 25(1), 8–23 (1988). https://doi.org/10.1137/0725002
24. Buckdahn, R., Cardaliaguet, P., Quincampoix, M.: Some recent aspects of differential game theory. Dyn. Games Appl.
1(1), 74–114 (2011). https://doi.org/10.1007/s13235-010-0005-0
25. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part I: Partial differential
equations of the first order. Translated by Robert B. Dean and Julius J. Brandstatter. Holden-Day, Inc., San Francisco-
London-Amsterdam (1965)
26. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part II: Calculus of variations.
Translated from the German by Robert B. Dean, Julius J. Brandstatter, translating editor. Holden-Day, Inc., San Francisco-
London-Amsterdam (1967)
27. Cardin, F., Viterbo, C.: Commuting Hamiltonians and Hamilton–Jacobi multi-time equations. Duke Math. J. 144(2),
235–284 (2008). https://doi.org/10.1215/00127094-2008-036
28. Caselles, V.: Scalar conservation laws and Hamilton–Jacobi equations in one-space variable. Nonlinear Anal. Theory
Methods Appl. 18(5), 461–469 (1992). https://doi.org/10.1016/0362-546X(92)90013-5
29. Chan-Wai-Nam, Q., Mikael, J., Warin, X.: Machine learning for semi linear PDEs. J. Sci. Comput. 79(3), 1667–1712 (2019)
30. Chen, T., van Gelder, J., van de Ven, B., Amitonov, S.V., de Wilde, B., Euler, H.C.R., Broersma, H., Bobbert, P.A., Zwanenburg,
F.A., van der Wiel, W.G.: Classification with a disordered dopant-atom network in silicon. Nature 577(7790), 341–345
(2020)
31. Cheng, T., Lewis, F.L.: Fixed-final time constrained optimal control of nonlinear systems using neural network HJB
approach. In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3016–3021 (2006). https://doi.
org/10.1109/CDC.2006.377523
32. Corrias, L., Falcone, M., Natalini, R.: Numerical schemes for conservation laws via Hamilton–Jacobi equations. Math.
Comput. 64(210), 555–580, S13–S18 (1995). https://doi.org/10.2307/2153439
33. Courant, R., Hilbert, D.: Methods of mathematical physics. Vol. II. Wiley Classics Library. Wiley: New York (1989). Partial
differential equations, Reprint of the 1962 original, A Wiley-Interscience Publication
34. Crandall, M.G., Ishii, H., Lions, P.L.: User’s guide to viscosity solutions of second order partial differential equations. Bull.
Am. Math. Soc. 27(1), 1–67 (1992). https://doi.org/10.1090/S0273-0979-1992-00266-5
35. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314
(1989). https://doi.org/10.1007/BF02551274
36. Dafermos, C.M.: Polygonal approximations of solutions of the initial value problem for a conservation law. J. Math.
Anal. Appl. 38(1), 33–41 (1972). https://doi.org/10.1016/0022-247X(72)90114-X
37. Dafermos, C.M.: Hyperbolic conservation laws in continuum physics, Grundlehren der Mathematischen Wissenschaften,
vol. 325, 4th Edn. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-49451-6
38. Darbon, J.: On convex finite-dimensional variational methods in imaging sciences and Hamilton–Jacobi equations.
SIAM J. Imaging Sci. 8(4), 2268–2293 (2015). https://doi.org/10.1137/130944163
39. Darbon, J., Meng, T.: On decomposition models in imaging sciences and multi-time Hamilton-Jacobi partial differential
equations. (2019). arXiv preprint arXiv:1906.09502
40. Darbon, J., Osher, S.: Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations
arising in control theory and elsewhere. Res. Math. Sci. 3(1), 19 (2016). https://doi.org/10.1186/s40687-016-0068-7
41. Dissanayake, M.W.M.G., Phan-Thien, N.: Neural-network-based approximations for solving partial differential equa-
tions. Commun. Numer. Methods Eng. 10(3), 195–201 (1994). https://doi.org/10.1002/cnm.1640100303
42. Djeridane, B., Lygeros, J.: Neural approximation of PDE solutions: An application to reachability computations. In:
Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3034–3039 (2006). https://doi.org/10.1109/
CDC.2006.377184
43. Dockhorn, T.: A discussion on solving partial differential equations using neural networks. (2019). arXiv preprint
arXiv:1904.07200
44. Dolgov, S., Kalise, D., Kunisch, K.: A tensor decomposition approach for high-dimensional Hamilton-Jacobi-Bellman
45. Dower, P.M., McEneaney, W.M., Zhang, H.: Max-plus fundamental solution semigroups for optimal control problems.
In: 2015 Proceedings of the Conference on Control and its Applications, pp. 368–375. SIAM (2015)
46. Elliott, R.J.: Viscosity solutions and optimal control, Pitman research notes in mathematics series, vol. 165. Longman
Scientific & Technical, Harlow; Wiley, New York (1987)
47. Evans, L.C.: Partial differential equations, Graduate Studies in Mathematics, vol. 19, second edn. American Mathematical
Society, Providence, RI (2010). https://doi.org/10.1090/gsm/019
48. Evans, L.C., Gariepy, R.F.: Measure Theory and Fine Properties of Functions. Textbooks in Mathematics, revised edn.
CRC Press, Boca Raton (2015)
49. Evans, L.C., Souganidis, P.E.: Differential games and representation formulas for solutions of Hamilton–Jacobi–Isaacs
equations. Indiana Univ. Math. J. 33(5), 773–797 (1984)
50. Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., Talay, S.: Large-scale fpga-based convolu-
tional networks. In: Bekkerman, R., Bilenko, M., Langford, J. (eds.) Scaling up Machine Learning: Parallel and Distributed
Approaches. Cambridge University Press, Cambridge (2011)
51. Farabet, C., poulet, C., Han, J., LeCun, Y.: CNP: An FPGA-based processor for convolutional networks. In: International
Conference on Field Programmable Logic and Applications. IEEE, Prague (2009)
52. Farabet, C., Poulet, C., LeCun, Y.: An FPGA-based stream processor for embedded real-time vision with convolutional
networks. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 878–885.
IEEE Computer Society, Los Alamitos, CA, USA (2009). https://doi.org/10.1109/ICCVW.2009.5457611
53. Farimani, A.B., Gomes, J., Pande, V.S.: Deep Learning the Physics of Transport Phenomena. arXiv e-prints (2017)
54. Fleming, W., McEneaney, W.: A max-plus-based algorithm for a Hamilton–Jacobi–Bellman equation of nonlinear
filtering. SIAM J. Control Optim. 38(3), 683–710 (2000). https://doi.org/10.1137/S0363012998332433
55. Fleming, W.H., Rishel, R.W.: Deterministic and stochastic optimal control. Bull. Am. Math. Soc. 82, 869–870 (1976)
56. Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, New York (2006)
57. Folland, G.B.: Real Analysis: Modern Techniques and Their Spplications. Wiley, Hoboken (2013)
58. Fujii, M., Takahashi, A., Takahashi, M.: Asymptotic expansion as prior knowledge in deep learning method for high
dimensional BSDEs. Asia-Pacific Financ. Mark. 26(3), 391–408 (2019). https://doi.org/10.1007/s10690-019-09271-7
59. Garcke, J., Kröner, A.: Suboptimal feedback control of PDEs by solving HJB equations on adaptive sparse grids. J. Sci.
Comput. 70(1), 1–28 (2017)
60. Gaubert, S., McEneaney, W., Qu, Z.: Curse of dimensionality reduction in max-plus based approximation methods:
Theoretical estimates and improved pruning algorithms. In: 2011 50th IEEE Conference on Decision and Control and
European Control Conference, pp. 1054–1061. IEEE (2011)
61. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, New York (2016)
62. Grohs, P., Jentzen, A., Salimova, D.: Deep neural network approximations for Monte Carlo algorithms. (2019). arXiv
63. Grüne, L.: Overcoming the curse of dimensionality for approximating lyapunov functions with deep neural networks
under a small-gain condition. (2020). arXiv preprint arXiv:2001.08423
64. Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad.
Sci. 115(34), 8505–8510 (2018). https://doi.org/10.1073/pnas.1718942115
65. Han, J., Zhang, L., E, W.: Solving many-electron Schrödinger equation using deep neural networks. J. Comput. Phys.
108929 (2019)
66. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
67. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals, vol. 305. Springer,
New York (1993)
68. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle
Methods, vol. 306. Springer, New York (1993)
69. Hirjibehedin, C.: Evolution of circuits for machine learning. Nature 577, 320–321 (2020). https://doi.org/10.1038/
d41586-020-00002-x
70. Hopf, E.: Generalized solutions of non-linear equations of first order. J. Math. Mech. 14, 951–973 (1965)
71. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991). https://
doi.org/10.1016/0893-6080(91)90009-T
72. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw.
2(5), 359–366 (1989). https://doi.org/10.1016/0893-6080(89)90020-8
73. Horowitz, M.B., Damle, A., Burdick, J.W.: Linear Hamilton Jacobi Bellman equations in high dimensions. In: 53rd IEEE
Conference on Decision and Control, pp. 5880–5887. IEEE (2014)
74. Hsieh, J.T., Zhao, S., Eismann, S., Mirabella, L., Ermon, S.: Learning neural PDE solvers with convergence guarantees. In:
International Conference on Learning Representations (2019)
75. Hu, C., Shu, C.: A discontinuous Galerkin finite element method for Hamilton–Jacobi equations. SIAM J. Sci. Comput.
21(2), 666–690 (1999). https://doi.org/10.1137/S1064827598337282
76. Huré, C., Pham, H., Bachouch, A., Langrené, N.: Deep neural networks algorithms for stochastic control problems on
finite horizon, part I: convergence analysis. (2018). arXiv preprint arXiv:1812.04300
77. Huré, C., Pham, H., Warin, X.: Some machine learning schemes for high-dimensional nonlinear PDEs. (2019). arXiv
78. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse
of dimensionality in the numerical approximation of semilinear heat equations. SN Partial Differ. Equ. Appl. 1(10),
(2020)
79. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A., von Wurstemberger, P.: Overcoming the curse of dimensionality
in the numerical approximation of semilinear parabolic partial differential equations (2018)
80. Hutzenthaler, M., Jentzen, A., von Wurstemberger, P.: Overcoming the curse of dimensionality in the approximative
pricing of financial derivatives with default risks (2019)
81. Hutzenthaler, M., Kruse, T.: Multilevel picard approximations of high-dimensional semilinear parabolic differential
equations with gradient-dependent nonlinearities. SIAM J. Numer. Anal. 58(2), 929–961 (2020). https://doi.org/10.
1137/17M1157015
82. Ishii, H.: Representation of solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods Appl. 12(2),
121–146 (1988). https://doi.org/10.1016/0362-546X(88)90030-2
83. Jiang, F., Chou, G., Chen, M., Tomlin, C.J.: Using neural networks to compute approximate and guaranteed feasible
Hamilton–Jacobi–Bellman PDE solutions. (2016). arXiv preprint arXiv:1611.03158
84. Jiang, G., Peng, D.: Weighted ENO schemes for Hamilton–Jacobi equations. SIAM J. Sci. Comput. 21(6), 2126–2143
(2000). https://doi.org/10.1137/S106482759732455X
85. Jianyu, L., Siwei, L., Yingjian, Q., Yaping, H.: Numerical solution of elliptic partial differential equation using radial basis
function neural networks. Neural Netw. 16(5–6), 729–734 (2003)
86. Jin, S., Xin, Z.: Numerical passage from systems of conservation laws to Hamilton–Jacobi equations, and relaxation
schemes. SIAM J. Numer. Anal. 35(6), 2385–2404 (1998). https://doi.org/10.1137/S0036142996314366
87. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.:
In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, pp. 1–12. Association for Computing Machinery, New York, NY, USA
(2017). https://doi.org/10.1145/3079856.3080246
88. Kalise, D., Kundu, S., Kunisch, K.: Robust feedback control of nonlinear PDEs by numerical approximation of high-
dimensional Hamilton–Jacobi–Isaacs equations. (2019). arXiv preprint arXiv:1905.06276
89. Kalise, D., Kunisch, K.: Polynomial approximation of high-dimensional Hamilton–Jacobi–Bellman equations and appli-
cations to feedback control of semilinear parabolic PDEs. SIAM J. Sci. Comput. 40(2), A629–A652 (2018)
90. Kang, W., Wilcox, L.C.: Mitigating the curse of dimensionality: sparse grid characteristics method for optimal feedback
control and HJB equations. Comput. Optim. Appl. 68(2), 289–315 (2017)
91. Karlsen, K., Risebro, H.: A note on front tracking and the equivalence between viscosity solutions of Hamilton–
Jacobi equations and entropy solutions of scalar conservation laws. Nonlinear Anal. (2002). https://doi.org/10.1016/
S0362-546X(01)00753-2
92. Khoo, Y., Lu, J., Ying, L.: Solving parametric PDE problems with artificial neural networks. (2017). arXiv preprint
arXiv:1707.03351
93. Khoo, Y., Lu, J., Ying, L.: Solving for high-dimensional committor functions using artificial neural networks. Res. Math.
Sci. 6(1), 1 (2019)
94. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference
on Learning Representations (ICLR 2015) (2015)
95. Kružkov, S.N.: Generalized solutions of nonlinear first order equations with several independent variables II. Math.
USSR-Sbornik 1(1), 93–116 (1967). https://doi.org/10.1070/sm1967v001n01abeh001969
96. Kundu, A., Srinivasan, S., Qin, E.C., Kalamkar, D., Mellempudi, N.K., Das, D., Banerjee, K., Kaul, B., Dubey, P.: K-tanh:
Hardware efficient activations for deep learning (2019)
97. Kunisch, K., Volkwein, S., Xie, L.: HJB-POD-based feedback design for the optimal control of evolution problems. SIAM
J. Appl. Dyn. Syst. 3(4), 701–722 (2004)
98. Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial differential equations. IEEE
Trans. Neural Netw. 9(5), 987–1000 (1998). https://doi.org/10.1109/72.712178
99. Lagaris, I.E., Likas, A.C., Papageorgiou, D.G.: Neural-network methods for boundary value problems with irregular
boundaries. IEEE Trans. Neural Netw. 11(5), 1041–1049 (2000). https://doi.org/10.1109/72.870037
100. Lambrianides, P., Gong, Q., Venturi, D.: A new scalable algorithm for computational optimal control under uncertainty.
(2019). arXiv preprint arXiv:1909.07960
101. Landau, L., Lifschic, E.: Course of theoretical physics. vol. 1: Mechanics. Oxford, (1978)
102. LeCun, Y.: 1.1 deep learning hardware: Past, present, and future. In: 2019 IEEE International Solid-State Circuits
Conference—(ISSCC), pp. 12–19 (2019). https://doi.org/10.1109/ISSCC.2019.8662396
103. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
104. Lee, H., Kang, I.S.: Neural algorithm for solving differential equations. J. Comput. Phys. 91(1), 110–131 (1990)
105. Lions, P.L., Rochet, J.C.: Hopf formula and multitime Hamilton–Jacobi equations. Proc. Am. Math. Soc. 96(1), 79–84
(1986)
106. Lions, P.L., Souganidis, P.E.: Convergence of MUSCL and filtered schemes for scalar conservation laws and Hamilton–
Jacobi equations. Numerische Mathematik 69(4), 441–470 (1995). https://doi.org/10.1007/s002110050102
107. Long, Z., Lu, Y., Dong, B.: PDE-net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network. J.
Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925
108. Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-net: Learning PDEs from data. (2017). arXiv preprint arXiv:1710.09668
109. Lye, K.O., Mishra, S., Ray, D.: Deep learning observables in computational fluid dynamics. (2019). arXiv preprint
arXiv:1903.03040
110. McEneaney, W.: Max-Plus Methods for Nonlinear Control and Estimation. Springer, New York (2006)
111. McEneaney, W.: A curse-of-dimensionality-free numerical method for solution of certain HJB PDEs. SIAM J. Control
Optim. 46(4), 1239–1276 (2007). https://doi.org/10.1137/040610830
112. McEneaney, W.M., Deshpande, A., Gaubert, S.: Curse-of-complexity attenuation in the curse-of-dimensionality-free
method for HJB PDEs. In: 2008 American Control Conference, pp. 4684–4690. IEEE (2008)
113. McEneaney, W.M., Kluberg, L.J.: Convergence rate for a curse-of-dimensionality-free method for a class of HJB PDEs.
SIAM J. Control Optim. 48(5), 3052–3079 (2009)
114. McFall, K.S., Mahan, J.R.: Artificial neural network method for solution of boundary value problems with exact
satisfaction of arbitrary boundary conditions. IEEE Trans. Neural Netw. 20(8), 1221–1233 (2009). https://doi.org/10.
1109/TNN.2009.2020735
115. Meade, A., Fernandez, A.: The numerical solution of linear ordinary differential equations by feedforward neural
networks. Math. Comput. Modell. 19(12), 1–25 (1994). https://doi.org/10.1016/0895-7177(94)90095-7
116. Meng, X., Karniadakis, G.E.: A composite neural network that learns from multi-fidelity data: Application to function
approximation and inverse PDE problems. (2019). arXiv preprint arXiv:1903.00104
117. Meng, X., Li, Z., Zhang, D., Karniadakis, G.E.: PPINN: Parareal physics-informed neural network for time-dependent
PDEs. (2019). arXiv preprint arXiv:1909.10145
118. van Milligen, B.P., Tribaldos, V., Jiménez, J.A.: Neural network differential equation and plasma equilibrium solver.
Phys. Rev. Lett. 75, 3594–3597 (1995). https://doi.org/10.1103/PhysRevLett.75.3594
119. Motta, M., Rampazzo, F.: Nonsmooth multi-time Hamilton–Jacobi systems. Indiana Univ. Math. J. 55(5), 1573–1614
(2006)
120. Niarchos, K.N., Lygeros, J.: A neural approximation to continuous time reachability computations. In: Proceedings of
the 45th IEEE Conference on Decision and Control, pp. 6313–6318 (2006). https://doi.org/10.1109/CDC.2006.377358
121. Osher, S., Shu, C.: High-order essentially nonoscillatory schemes for Hamilton–Jacobi equations. SIAM J. Numer. Anal.
28(4), 907–922 (1991). https://doi.org/10.1137/0728049
122. Pang, G., Lu, L., Karniadakis, G.E.: fPINNs: Fractional physics-informed neural networks. SIAM J. Sci. Comput. 41(4),
A2603–A2626 (2019)
123. Pham, H., Pham, H., Warin, X.: Neural networks-based backward scheme for fully nonlinear PDEs. (2019). arXiv preprint
arXiv:1908.00412
124. Pinkus, A.: Approximation theory of the MLP model in neural networks. In: Acta numerica, 1999, Acta Numer., vol. 8,
pp. 143–195. Cambridge University Press, Cambridge (1999)
125. Plaskacz, S., Quincampoix, M.: Oleinik–Lax formulas and multitime Hamilton–Jacobi systems. Nonlinear Anal. Theory,
Methods Appl. 51(6), 957–967 (2002). https://doi.org/10.1016/S0362-546X(01)00871-9
126. Raissi, M.: Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res.
19(1), 932–955 (2018)
127. Raissi, M.: Forward-backward stochastic neural networks: Deep learning of high-dimensional partial differential
128. Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving
forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707
(2019). https://doi.org/10.1016/j.jcp.2018.10.045
129. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part i): Data-driven solutions of nonlinear
partial differential equations. (2017). arXiv preprint arXiv:1711.10561
130. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part ii): Data-driven discovery of nonlinear
partial differential equations. (2017). arXiv preprint arXiv:1711.10566
131. Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value
functions in zero-sum games of nonlinear stiff systems. (2019). arXiv preprint arXiv:1903.06652
132. Rochet, J.: The taxation principle and multi-time Hamilton–Jacobi equations. J. Math. Econ. 14(2), 113–128 (1985).
https://doi.org/10.1016/0304-4068(85)90015-1
133. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
134. Royo, V.R., Tomlin, C.: Recursive regression with neural networks: Approximating the HJI PDE solution. (2016). arXiv
135. Rudd, K., Muro, G.D., Ferrari, S.: A constrained backpropagation approach for the adaptive solution of partial differential
equations. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 571–584 (2014). https://doi.org/10.1109/TNNLS.2013.2277601
136. Ruthotto, L., Osher, S., Li, W., Nurbekyan, L., Fung, S.W.: A machine learning framework for solving high-dimensional
mean field game and mean field control problems. (2019). arXiv preprint arXiv:1912.01825
137. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.
1016/j.neunet.2014.09.003
138. Sirignano, J., Spiliopoulos, K.: DGM: A deep learning algorithm for solving partial differential equations. J. Comput.
Phys. 375, 1339–1364 (2018). https://doi.org/10.1016/j.jcp.2018.08.029
139. Tang, W., Shan, T., Dang, X., Li, M., Yang, F., Xu, S., Wu, J.: Study on a Poisson’s equation solver based on deep learning
technique. In: 2017 IEEE Electrical Design of Advanced Packaging and Systems Symposium (EDAPS), pp. 1–3 (2017).
https://doi.org/10.1109/EDAPS.2017.8277017
140. Tassa, Y., Erez, T.: Least squares solutions of the HJB equation with neural network value-function approximators. IEEE
Trans. Neural Netw. 18(4), 1031–1041 (2007). https://doi.org/10.1109/TNN.2007.899249
141. Tho, N.: Hopf-Lax-Oleinik type formula for multi-time Hamilton–Jacobi equations. Acta Math. Vietnamica 30, 275–287
(2005)
142. Todorov, E.: Efficient computation of optimal actions. Proc. Natl. Acad. Sci. 106(28), 11478–11483 (2009)
143. Uchiyama, T., Sonehara, N.: Solving inverse problems in nonlinear PDEs by recurrent neural networks. In: IEEE
International Conference on Neural Networks, pp. 99–102. IEEE (1993)
144. E, W., Yu, B.: The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems.
Commun. Math. Stat. 6(1), 1–12 (2018)
145. E, W., Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential
equations and backward stochastic differential equations. Commun. Math. Stat. 5(4), 349–380 (2017). https://doi.org/
10.1007/s40304-017-0117-6
146. E, W., Hutzenthaler, M., Jentzen, A., Kruse, T.: Multilevel picard iterations for solving smooth semilinear parabolic heat
equations (2016)
147. Widder, D.V.: The Heat Equation, vol. 67. Academic Press, New York (1976)
148. Yadav, N., Yadav, A., Kumar, M.: An introduction to neural network methods for differential equations. SpringerBriefs
in Applied Sciences and Technology. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9816-7
149. Yang, L., Zhang, D., Karniadakis, G.E.: Physics-informed generative adversarial networks for stochastic differential
150. Yang, Y., Perdikaris, P.: Adversarial uncertainty quantification in physics-informed neural networks. J. Comput. Phys.
394, 136–152 (2019)
151. Yegorov, I., Dower, P.M.: Perspectives on characteristics based curse-of-dimensionality-free numerical approaches
for solving Hamilton–Jacobi equations. Appl. Math. Optim. 1–49 (2017)
152. Zhang, D., Guo, L., Karniadakis, G.E.: Learning in modal space: solving time-dependent stochastic PDEs using physics-
informed neural networks. (2019). arXiv preprint arXiv:1905.01205
153. Zhang, D., Lu, L., Guo, L., Karniadakis, G.E.: Quantifying total uncertainty in physics-informed neural networks for
solving forward and inverse stochastic problems. J. Comput. Phys. 397, 108850 (2019)
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN

Uploaded by

Copyright:

Available Formats

Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN

Uploaded by

Copyright:

Available Formats

J. Darbon et al.

Res Math Sci (2020)7:20

Overcoming the curse of dimensionality

123 © Springer Nature Switzerland AG 2020.

t = 0 gives an approximation to the convex initial function J . Moreover, the parameters

2.1 Shallow neural networks

2.2 Convex analysis

Definition 2 (Domains and proper functions) The domain of a function f : Rn → R ∪

dom f = x ∈ Rn : f (x) < +∞ .

Definition 3 (Convex functions, lower semicontinuity, and convex envelopes) A proper

f (λx + (1 − λ)y) λf (x) + (1 − λ)f (y) (4)

A proper function f : Rn → R ∪ {+∞} is called lower semicontinuous if for every

The class of proper, lower semicontinuous convex functions is denoted by Γ0 (Rn ).

Definition 4 (Subdiﬀerentials and subgradients) The subdiﬀerential ∂f (x) of f ∈ Γ0 (Rn )

∀y ∈ Rn , f (y) f (x) + p, y − x . (5)

of f at x0 , and conversely if f has a unique subgradient at x0 , then f is diﬀerentiable at that

Definition 5 (Fenchel–Legendre transforms) Let f ∈ Γ0 (Rn ). The Fenchel–Legendre

f ∗ (p) = sup p, x − f (x) . (6)

For any f ∈ Γ0 (Rn ), the mapping f  → f ∗ is one-to-one, f ∗ ∈ Γ0 (Rn ), and (f ∗ )∗ = f .

f (x) + f (p) x, p , (7)

We summarize some notations and deﬁnitions in Table 1.

3 Connections between neural networks and Hamilton–Jacobi equations

f (x, t; {(pi , θi , γi )}m

We adopt the following assumptions on the parameters:

(A1) The parameters {pi }m i=1 are pairwise distinct, i.e., pi  = pj if i  = j.

f (x, 0) = J (x):= max { pi , x − γi } (10)

Ix :=arg max{ pi , x − γi }. (11)

Lemma 3.1 Suppose {(pi , γi )}m

following statements hold.

Moreover, its restriction to dom J ∗ is continuous, and the subdiﬀerential ∂J ∗ (p) is

(iii) For each i, k ∈ {1, . . . , m}, let

Then, (α1 , . . . , αm ) is a minimizer in Eq. (12) at the point p = pk . Hence, we have

Proof See “Appendix A.1” for the proof. 

Lemma 3.2 Suppose {(pi , θi , γi )}m

the following statements hold:

Proof See “Appendix A.2” for the proof. 

3.2 Main results: First-order Hamilton–Jacobi equations

Theorem 3.1 Suppose the parameters {(pi , θi , γi )}m

Moreover, f is jointly convex in (x,t).

Proof See “Appendix B” for the proof. 

where m = 2n , each entry of pi takes value in {±1}, and θi = − n2 , γi = 0 for every

J (x) = x 1 = J true (x) for every x ∈ Rn ;

one candidate satisfying these constraints.

J (x) = x ∞ for every x ∈ Rn ;

Example 3 In this example, we consider  the HJ PDE withHamiltonian H true (p) = p 1

3.3 First-order one-dimensional conservation laws

∇x f (x, t) = pj , where j ∈ arg max{pi , x − tθi − γi }. (20)

Proof See “Appendix D” for the proof. 

u(x) = pj , where j ∈ arg max {xpi − tθi − γi } .

J (x) = |x| for every x ∈ R;

The training problem is formulated as

arg min l({(pi , θi , γi )}m

After training, we approximate the initial condition in the HJ equation, denoted by J̃ ,

J̃ :=f (·, 0). (24)

In addition, we obtain partial information of the Hamiltonian H using the parameters

4.1.1 Randomly generalized piecewise aﬃne H and J

Case 1. θitrue = − ptrue

θitrue = − ptrue true = 1 ptrue 2 for i ∈ {1, . . . , m}.

Case 4. θitrue = − 12 ptrue

f ∗ (p) = sup p, x − f (x) . (6)

For any f ∈ Γ0 (Rn ), the mapping f → f ∗ is one-to-one, f ∗ ∈ Γ0 (Rn ), and (f ∗ )∗ = f .

f (x) + f (p) x, p , (7)

(A1) The parameters {pi }m i=1 are pairwise distinct, i.e., pi = pj if i = j.

Proof See “Appendix A.1” for the proof.

Proof See “Appendix A.2” for the proof.

Proof See “Appendix B” for the proof.

Example 3 In this example, we consider the HJ PDE withHamiltonian H true (p) = p 1

∇x f (x, t) = pj , where j ∈ arg max{pi , x − tθi − γi }. (20)

Proof See “Appendix D” for the proof.

S(x, t):= max {ptrue true

S(x, t) = maxn p, x − J ∗ (p) − tH(p) (Hopf formula)

u(x, t) = x − t∇H(pj ), where j ∈ arg max pi , x − tθi − γi . (30)

f (x, t):= max { pi , x − tθi − γi } = sup p, x − t H̃(p) − J ∗ (p) . (35)

p, x − t H̃(p) − J ∗ (p) p, x − tH(p) − J ∗ (p). (36)

max {pi , x − tθi − γi } = f (x, t),

sup p, x − t H̃(p) − J ∗ (p) f (x, t). (38)

f (x, t) = max {pi , x − tθi − γi }

= max pi , x − t H̃(pi ) − J ∗ (pi ) (39)

sup p, x − t H̃(p) − J ∗ (p) ,

pk , x − tθk − γk > pi , x − tθi − γi for every i = k. (40)

J (x) = max{pi , x − γi } for every x ∈ Rn . (41)

Ix0 : = arg max{pi , x − γi }. (43)

pk , x − tθk − γk − (pi , x − tθi − γi )

pk , x0 − γk = J (x0 ) > pi , x0 − γi ,

pk , x0 − γk = J (x0 ) = pi , x0 − γi . (47)

pk , x − tθk − γk − (pi , x − tθi − γi ) = 0 + t(h(pi ) − θk − pi − pk , v 0 )

co h(pi ) co h(pk ) + pi − pk , v 0 . (49)

pk , x − tθk − γk − (pi , x − tθi − γi ) t(co h(pk ) − θk ). (50)

The conclusion then follows from Eqs. (57) and (58).

f (x, t) = max {p, x − tH(p) − J ∗ (p)}, (59)

where (p, t) → p, x − tH(p) − J ∗ (p) is continuous in (p, t) and diﬀerentiable in t. As

f (x, t) = sup {q, x − tH(q) − J ∗ (q)} = sup {q, x − t H̃(q) − J ∗ (q)}.