Theory and Formulas For Backpropagation in Hilbert Spaces
Theory and Formulas For Backpropagation in Hilbert Spaces
Theory and Formulas For Backpropagation in Hilbert Spaces
Daniel Crespin
Facultad de Ciencias
Universidad Central de Venezuela
Abstract
This paper provides a detailed proof of the backpropagation algorithm for single input
data as stated in section 17, and for multiple input data as given in section 18. Our
viewpoint is that backpropagation consists essentially in the calculation of the gradient
of the quadratic error of a multilayer differentiable neural network having an architecture
of Hilbert spaces. Along the way a general theory for such networks is outlined. The
gradient is expressed, as expected, in terms of the error vectors and the transpose partial
derivatives of the layers. Compare with [3] and note that all the present results apply
without change to the case of Euclidean spaces (finite dimensional Hilbert spaces) hence
to Cartesian spaces Rn as well. In Numerical Calculus/Analysis there is the well known
gradient descent method, a procedure much used to find or approach the minimum of real
valued functions. Beyond the calculation of a gradient, backpropagation is the name given
to gradient descent when applied to the particularities of neural networks. The topic has
a very long history as revealed in [6]. Although categories are not formally used, there is a
section of Figures containing twelve diagrams that in the fashion of objects and morphisms
illustrate neural networks, their values on inputs (forward propagation), their derivatives,
transpose derivatives, backpropagated errors and lifted errors, these liftings being up to a
numerical factor of 2 the components of the sought gradient of the quadratic error.
§1 Neural networks
Neural networks defined
1. An n layer differentiable neural network is a sequence f = (fk )nk=1 of C 1 maps, fk :
Uk × Wk → Uk+1 .
See Figure 1. With terms as explained below in 5 this definition states that the k-th
output domain Uk+1 is also the input domain of the layer fk+1 next in the sequence.
4. Units and layers will not be given a detailed separate treatment in the present paper.
We discuss multilayer networks as compositions of networks. This approach is extremely
flexible and should prove useful in theory and practice for multilayer networks having
sophisticated hidden layers. Compare with [3] where units and layers are functions with
specific structures.
5. Terminology:
1.- The k-th layer of f = (fk )nk=1 is, of course, fk ;
2.- the input layer is f1 ;
3.- the output layer is fn ;
4.- the hidden layers or deep layers are fk with 2 ≤ k ≤ n − 1.
5.- The k-th domain is Uk × Wk ;
6.- the k-th codomain is Uk+1 .
7.- The initial input domain is Uk ;
8.- the k-th input domain is Uk ;
9.- the k-th weight domain is Wk ;
10.- the k-th output codomain is Uk+1 ;
11.- and the final output domain is Un+1 .
All these maps and domains appear as objects and arrows in Figure 1.
6. The multinput domain is U = U1 × · · · × Un and the multiweight domain is W = W1 ×
· · · × Wn .
7. By definition f is unilayer if n = 1, bilayer if n = 2 and trilayer if n = 3. The neural
network is multilayer if n ≥ 2. Figures 2, 3 and 4 display diagrams for unilayer, bilayer
and trilayer networks.
8. If the architecture is linear, Uk = Ek and Wk = Gk for all k, and if the layers fk are linear
transformations then f is a linear network. The derivative networks to be defined in 14
are linear.
9. Let f = (fk )nk=1 and w ∈ W be given. The forward pass of an initial input x1 ∈ U1 ,
displayed as the lower row in Figure 5, is the multinput sequence with first term x1 and
remaining terms specified by the recursion formula xk = fk−1 (xk−1 , wk−1 ), k = 1, . . . , n,
where for notational purposes we let f0 (x0 , w0 ) = x1 so that the forward pass is
(xk )nk=1 = (fk−1 (xk−1 , wk−1 ))nk=1 ∈ U = U1 × · · · × Un
2
10. Given f , w and x1 by definition the final output is xn+1 = fn (xn , wn ) ∈ Un+1 .
11. A multinput x ∈ U is x1 -conditioned under f and w if it is the forward pass of the initial
input term x1 ∈ U1 , that is, if the recursive formula xk = fk−1 (xk−1 , wk−1 ) is satisfied.
12. In an alternative but equivalent and sometimes more convenient notation the initial input
will be a1 ∈ U1 which under the forward pass gives the a1 -conditioned sequence
§3 Derivatives
Textbooks discussing derivatives in normed spaces are [5] Ch. VIII, §9 and [7] Ch. XIII, §7.
A good introduction to derivatives in Rn spaces can be found in [8].
14. The derivative of f at (x, w) ∈ U × W is the multilayer linear network Df (x, w), shown
in Figure 9, having input domains Uk = Ek , weight domains Wk = Gk , output domains
Uk+1 = Ek+1 and layers equal to the derivatives Dfk (xk , wk ) : Ek × Gk → Ek+1 of the
layers of f calculated at the pairs (xk , wk )
16. Above in 15 by construction the forward pass (∆xk )nk=1 is ∆x1 -conditioned; or (∆ak )nk=1
is ∆a1 -conditioned. The original network f need not be x1 -conditioned, that is, the
relations xk = fk (xk−1 , wk−1 ) are not necessarily imposed. Nevertheless and particularly
with neural compositions, x1 -conditioning is crucial.
3
§4 Transpose derivatives, gradients and squared norm
Reference [1] explains transposes of linear transformations between complex Hilbert spaces
but the reader can easily adapt that discussion to the real Hilbert spaces here needed.
17. Consider an inner product space E and let a ∈ E. The linear form of a is the function
φa : E → R defined as φa (∆x) = ⟨a, ∆x⟩. A linear form φ : E → R is representable if
there exists an a ∈ E such that φ = φa , in which case a is unique.
18. Let [a] = {λa | λ ∈ R} be the line spanned by a. The parametrized line of a is the linear
map ℓa : R → E defined as ℓa (t) = ta. Observe that ℓa (1) = a.
19. In an inner product space the linear form and the parametrized line are transposes of each
other
φ∗a = ℓa ℓ∗a = φa (1)
20. The Riesz representation theorem states that if E is a Hilbert space then all linear forms
are representable. Assuming that E, F are Hilbert spaces it follows that any continuous
linear transformation T : E → F has a unique well defined transpose transformation
T ∗ : F → E characterized by the condition
⟨T (x), y⟩ = ⟨x, T ∗ (y)⟩ for all x ∈ E and all y ∈ F
.
21. Consider Hilbert spaces E, F with open subsets U ⊆ E, V ⊆ F and let f : U → V
be a differentiable map with derivative at x ∈ U equal to the linear transformation
Df (x) : E → F . Then as a particular case of 20 f has a transpose derivative at x defined
as D∗ f (x) = (Df (x))∗ : F → E.
22. Let g : V → R have derivative at y = f (x) ∈ V equal to the linear form Dg(y) : F → R.
The gradient of f at y ∈ W is defined as a vector ∇g(y) ∈ F such that for all ∆y ∈ G
Dg(y) · ∆y = ⟨∇g(y), ∆y⟩
23. Riesz theorem implies that the gradient vector ∇g(y) exists, is given by
∇g(y) = Dg ∗ (y) · 1 (2)
24. Furthermore the chain rule and properties of transposition imply that the gradient pulls
back by the transpose derivative
∇(g ◦ f )(x) = D∗ f (x)(∇g(y))
25. The squared norm function Sq : En+1 → R is defined as Sq(xn+1 ) = ⟨xn+1 , xn+1 ⟩ =
∥xn+1 ∥2 . The derivative of Sq at any an+1 ∈ En+1 is twice the linear form of an+1 and the
transpose derivative is twice the parametrized line of an+1
DSq(an+1 ) = 2φan+1 D∗ Sq(an+1 ) = 2ℓan+1 (3)
4
§5 Partial derivatives
26. Given (x, w) ∈ U ×W the layer fk has an Ek -partial derivative and a Gk -partial derivative
at (xk , wk )
27. Basic properties of partials imply that the derivative of the k-th layer is the direct sum of
the partials
Dfk (xk , wk ) = DEk fk (xk , wk ) ⊕ DGk fk (xk , wk )
28. The diagram of partials at (x, w) of the neural network f = (fk )nk=1 by definition has
nodes for the vector spaces Ek and Gk and arrows labeled by the partials of the layers,
DE fk = DEk fk (xk , wk ) and DG fk = DEk fk (xk , wk ), all arranged as shown in Figure 10.
Compare with the diagram of transpose partials in Figure 11. These diagrams of partials
are not themselves neural networks.
29. Assuming that the normed spaces are all Hilbert it follows from that any linear trans-
formation has a transpose. In particular there is for each layer of f = (fk )nk=1 and at
each (xk , wk ) ∈ Uk × Wk a transpose derivative, a transpose Ek -partial derivative and a
transpose Gk -partial derivative
30. The transpose of a direct sum is the product of the transposes. Applying this property
to 27 it follows that the transpose derivative of a layer fk is equal to the product of its
transpose partials
31. The diagram of transpose partials of f = (fk )nk=1 at (x, w) is defined as having nodes equal
to the Hilbert spaces Ek and Gk , arrows labeled by the transpose partials of the layers,
DE∗ fk = DE∗ k fk (xk , wk ) and DG
∗
fk = DE∗ k fk (xk , wk ), with these objects and morphisms
displayed in the manner of Figure 11. The diagram in this figure, when x1 -conditioned,
is the foundation of a thorough understanding of backpropagation.
5
§7 Conditioning the derivatives
33. By definition the neural composition of a bilayer network f = (f1 , f2 ) is the function that
transforms (x1 , w1 , w2 ) into the final output of the forward pass of f with multiweight
w = (w1 , w2 ) and initial input x1
34. The neural composition f2 b◦ f1 is a unilayer neural network with input, weight and output
domains respectively equal to U1 , W1 × W2 and U3 .
35. The natural projections of the bilayer network f are the maps π U1 ×W1 (x1 , w1 , w2 ) =
(x1 , w1 ) and π W2 (x1 , w1 , w2 ) = w2 with domains and codomains as shown
π U1 ×W1 : U1 × W1 × W1 → U1 × W1
(4)
π W2 : U1 × W1 × W2 → W2
36. The neural composition of a bilayer network considered as a function from its domain to
its codomain can written as a composition of layers, natural projections and products, as
displayed in Figure 7 and expressed by the formula
37. The neural composition of a multilayer neural network f = (fk )nk=1 is defined as the
function fb = fn b◦ · · · b◦ f1 that sends (x1 , w1 , . . . , wn ) into the final output after and beyond
the forward pass of f with multiweight w = (w1 , . . . , wn ) and initial input x1
6
38. The neural composition fb = fn b◦ · · · b◦ f1 is a unilayer neural network, see Figure 6, having
input, weight and output domains equal to respectively U1 , W = W1 × · · · × Wn and Un+1 .
39. The natural projections of the multilayer network f are the maps
π U1 ×W1 ×···×Wn−1 (x1 , w1 , . . . , wn−1 , wn ) = (x1 , w1 , . . . , wn−1 )
(6)
π Wn (x1 , w1 , . . . , wn−1 , wn ) = wn
having the following domains and codomains
π U1 ×W1 ×···×Wn−1 : U1 × W1 × · · · × Wn−1 × Wn → U1 × W1 × · · · × Wn−1
(7)
π Wn : U1 × W1 × · · · × Wn−1 × Wn → Wn
40. It follows that the neural composition of a neural network having n layers can be expressed
as the neural composition of two unilayer networks, namely of the last layer and the neural
composition of the first n − 1 layers
fb = fn b◦ · · · b◦ f1 = fn b◦ (fn−1 b◦ · · · b◦ f1 )
41. The bilayer network derivative, which is the case n = 2 of 14 and of Figure 9, has linear
architecture with layers Df1 (x1 , w1 ) : E1 × G1 → E2 and Df2 (x2 , w2 ) : E2 × G2 → E3 .
Since the domains are normed spaces, Uk = Ek , Wk = Gk , to apply equation (5) the
following linear projections are needed
π E1 ×G1 : E1 × G1 × G2 → E1 × G1
π G2 : E1 × G1 × G2 → G2
42. These linear projections are the derivatives, for all x = (x1 , w1 , w2 ) ∈ U1 × W1 × W2 , of
the natural projections given in equation (4), that is
Dπ U1 ×W1 (x1 , w1 , w2 ) = π E1 ×G1
(8)
Dπ W2 (x1 , w1 , w2 ) = π G2
43. Then the derivative of the neural composition of a bilayer network is equal to the neural
composition of the x1 -conditioned derivative
D(f2 b◦ f1 ) = D(f2 ◦ ((f1 ◦ π U1 ×W1 ), π W2 ))
= Df2 ◦ ((Df1 ◦ π E1 ×G1 ), π G2 ) (9)
= Df2 b◦ Df1
The first equality is true by equation (5) applied to the bilayer network (f1 , f2 ); the
second by the chain rule for derivatives together with the product rule for derivatives and
(8); and the third again by (5) but this time applied to the x1 -conditioned linear network
Df = (Df1 , Df2 ). An alternative proof of (9) is described in the caption of Figure 8.
7
§11 Derivatives of multilayer compositions
44. Define the linear projections π E1 ×G1 ×···×Gn−1 (x1 , w) = (x1 , w1 , . . . , wn−1 ) and π Gn (x1 , w) =
wn hence
π E1 ×G1 ×···×Gn−1 : E1 × G → E1 × G1 × · · · × Gn−1
π Gn : E1 × G → Gn
45. The above linear projections are the derivatives, for all x = (x1 , w) ∈ U1 × W , of the
natural projections given in equation (6)
46. By induction in n (the case n = 2 being (9)) the neural chain rule follows: The deriva-
tive of a multilayer neural composition is equal to the composition of the x1 -conditioned
derivatives of the layers
47. A basic property of derivatives having a product domain is that they are the direct sum
of the partials taken with respect to the factors.
For this direct sum decomposition the G-partial as well as the Gk -partials are all calculated
at (x1 , w) ∈ U1 ×W . No forward pass is required to specify (x1 , w) and no x1 -conditioning
on the partials is involved.
49. In general the transpose of a direct sum is the product of the transposes. Therefore the
transpose G-partial derivative of the neural composition at (x1 , w) ∈ U1 × W is equal to
the product of transpose Gk -partials of fb
∗ b ∗ b ∗ b
DG f (x1 , w) = (DG 1
f (x1 , w), · · · , DG n
f (x1 , w)) (10)
8
§13 Partials of multilayer compositions
Two statements in this section require proof, to be performed invoking the neural chain
rule, basic properties of partials and induction in n. The relevant diagram is Figure 10.
50. The E1 -partial of the neural composition at (x1 , w) ∈ U1 × W of a multilayer network is
equal to the composition of the x1 -conditioned Ek -partials of the layers
51. By definition an Ei -partial DEi fi (xi , wi ) is downstream from the Gk -partial DGk fk (xk , wk )
if k < i. See the direction of the arrows in Figure 10.
52. The Gk -partial of fb calculated at (x1 , w) ∈ U1 × W is equal to the x1 -conditioned compo-
sition of the Gk -partial of the k-th layer and its downstream Ei -partials
9
§15 Backpropagating and lifting
57. Let f = (fk )nk=1 , fb and w be as in previous sections. Consider any element ∆an+1 ∈ En+1
to be called (n + 1)-th backpropagated error or final error. Other errors in this section
are obtained applying transpose derivatives to this rather final error. In later sections we
will focus on the output error of the network, ∆an+1 = fb(a1 , w) − bn+1 .
58. By definition the k-th backpropagated error ∆ak ∈ Ek is the image of the (k + 1)-th
backpropagated error by the conditioned transpose Ek -partial of the k-th layer, k = n, n −
1, . . . , 2
This recursive descent formula allows to define ∆a1 ∈ E1 , however only the backpropa-
gated errors ∆ak ∈ Ek with 2 ≤ k ≤ n + 1 will be needed.
59. The k-th lifted error is by definition the image of the (k + 1)-th backpropagated error by
the conditioned transposed Gk -partial of the k-th layer, k = n, n − 1, . . . , 2, 1
∗
∆wk = DG f (ak , wk ) · ∆ak+1
k k
(14)
60. The G-partial of the neural composition calculated at (a1 , w) ∈ U1 × W and evaluated on
any given output error ∆an+1 ∈ En+1 is equal to the n-tuple of lifted errors
∗ b ∗ b ∗ b
DG f (a1 , w) · ∆an+1 = (DG 1
f (a1 , w), · · · , DG n
f (a1 , w)) · ∆an+1
∗ ∗ (15)
= (DG1 f1 (a1 , w1 ) · ∆a2 , . . . , DG f (ak , wk ) · ∆ak+1 )
k k
n
= (∆wk )k=1
Proof: The first equality is true by (10); the second by the chain rule for transpose
partials (12) (with a1 -conditioning) and the recursive descent definition of ∆ak given in
(13); and the third by (14) (again with a1 -conditioning).
62. From the manner partial derivatives are defined it follows that the derivative of the output
error function at w ∈ W is equal to the G-partial of fb at (a1 , w)
10
63. Then the transpose derivative of the output error function e at w ∈ W is equal to the
transpose G-partial of fb at (a1 , w)
D∗ e(w) = DG
∗ b
f (a1 , w) (18)
64. The quadratic error function, Q : W → R is by definition the squared length of the error
65. The first term below, which is the transpose derivative at the multiweight w ∈ W =
W1 ×· · · Wn of the quadratic error function, is equal to the last, which is twice the transpose
G-partial derivative of the neural composition calculated at (a1 , w) and composed with
the parametric line of the final error
Proof: The first equality is true by the definition of quadratic error in (19); the second
by the definition of transpose derivative; the third by the chain rule applied to Sq ◦ e; the
fourth because of (3) and (17); the fifth because the transpose of a composition is the
composition of the transposes in reverse order; and the last by (1) and by the definition
of e(w) given in equation (16).
11
§17 Backpropagation for a single input data
The gradient formula given below in equation (21), and reflected in Figures 11 and 12,
requires data and calculations as now recapitulated
66. From
1.- a multilayer network f = (fk )nk=1
2.- a multiweight w = (w1 , . . . , wn ) ∈ W
3.- an initial input a1 ∈ U1 , to be also symbolized a1 = f0 (a0 , w0 )
4.- a desired output bn+1 ∈ En+1
the following is obtained
1.- the forward pass (ak )nk=1 = (fk−1 (ak−1 , wk−1 ))nk=1
2.- the final output an+1 = fn (an , wn ) ∈ En+1
3.- the output error ∆an+1 = an+1 − bn+1 ∈ En+1
4.- the output error function e : W → En+1 , e(w) = fb(a1 , w) − bn+1
5.- the transposed Ek -partials DE∗ k fk (ak , wk )
∗
6.- the transposed Gk -partials DG f (ak , wk )
k k
7.- the backpropagated errors ∆ak = DE∗ k fk (ak , wk ) · ∆ak+1 ∈ Ek , k = n, (n − 1), . . . , 2
∗
8.- the lifted errors ∆wk = DG f (ak , wk ) · ∆ak+1 ∈ Gk , k = n, (n − 1), . . . , 2, 1
k k
67. If there is a single initial input the substance of backpropagation for neural networks
in Hilbert spaces is the following statement:
The gradient of the quadratic error function is equal to twice the n-tuple of the
lifted errors
68. For a justification and proof of (21) check the hypotheses and follow all the steps indicated
in 66. Alternatively take a = ∆an+1 = fn (an , wn ) − bn+1 = an+1 − bn+1 in equation 2,
and then apply equations (12), (15) and (20).
69. Under a variety of formats and scenarios this gradient must be calculated whenever trying
to minimize the quadratic error function of a differentiable multilayer neural network by
stepwise modification of the weights applying the method of gradient descent. “Backprop-
agation” is the name given to the calculation of ∇Q(w), or to certain particular steps
like obtaining the backpropagated errors ∆ak . Or may sometimes refer to the totality
of the awesome forest surrounding deep learning of neural networks by gradient descent,
where concepts like “learning rate”, “thresholds”, “distances to decision hypersurfaces”,
“training epochs”, “overfitting”, “cutoff values” and many others thrive.
12
§18 Backpropagation for multiple input data
70. Formula (21) has an extension to several initial inputs arranged in a finite set A =
{aj1 | j = 1, . . . , m} ⊆ U1 , where each aj1 ∈ U1 has a corresponding desired output bjn+1 =
d(aj1 ) ∈ En+1 specified by means of a usually empirical function d : A → En+1 .
71. Given:
1.- the multilayer neural network f = (fk )nk=1
2.- the multiweight w ∈ W = W1 × · · · × Wn
3.- the finite set A ⊆ U1
4.- and the desired output function d : A → En+1 , d(xj1 ) = bj
there is for each j:
1.- forward pass (ajk )nk=1 = (fk−1 (ajk−1 , wk−1 ))nk=1
2.- the final output ajn+1 = ajn+1 (aj1 , w) = fn (ajn , wn )
3.- the output error ∆ajn+1 = ajn+1 − bjn+1 ∈ En+1
4.- the output error function, ej : W → En+1 , ej (w) = ajn+1 (w) − bjn+1
5.- the transposed Ek -partials DE∗ k fk (ajk , wk )
6.- the transposed Gk -partials DG ∗
f (ajk , wk )
k k
7.- the backpropagated errors, ∆ajk = DE∗ k fk (ajk , wk ) · ∆ajk+1 ∈ Ek , k = n, (n − 1), . . . , 2
8.- the lifted errors, ∆wkj = DG ∗
f (ajk , wk ) · ∆ajk+1
k k
73. The gradient of the total quadratic error function is equal to the sum of the j-gradients
m
X
∇QT (w) = ∇Qj (w)
j=1
74. If there are multiple initial inputs then backpropagation for neural networks in
Hilbert spaces relies on the following result:
The gradient of the total quadratic error function is equal to twice the n-tuple of
the lifted total errors
13
75. Summing up, backpropagation is required for the deep learning of (deep teaching to) neural
networks. In a first instance learning means making smaller the quadratic error. And
backpropagation is the attempt to reduce the error by small stepwise changes of the
weights in directions opposite to the gradient.
14
Some conventions
15
§19 Figures
Figures
W1 W2 ··· Wn−1 Wn
× XXXf1 × XXXf2 ··· XXX × XXXfn−1 × XXXfn
XX XX XX XX XX
U1 U2 U3 ··· Un−1 Un Un+1
z
X z
X z
X z
X z
X
16
W1
× XXXf1
XX
U1 U2
z
X
17
W1 W2
× XXXf1 × XXXf2
XX XX
U1 U2 U3
z
X z
X
Figure 3: A bilayer neural network consists of two layers which are the functions
shown in the diagram. The functions f1 and f2 are not composable as usually
understood but they do have a neural composition in the sense of the definition
given in 33 or as expressed in 36. See also Figure 7.
18
W1 W2 W3
× XXXf1 × XXXf2 × XXXf3
XX XX XX
U1 U2 U3 U4
z
X z
X z
X
Figure 4: A trilayer neural network has the three layers shown above. This is the
result of setting n = 3 in the diagram of Figure 1. The architecture is specified by
seven open subsets of respective normed spaces. Customarily f1 is the input layer,
f2 is the hidden layer and f3 is the output layer.
19
w1 w2 ··· wn−1 wn
Figure 5: The nodes of this diagram are elements of the various domains of a
multilayer network. The upper row has the components wk of a given multiweight
w. At the far left: the lower row has the initial input a1 , the upper row has the
first weight component w1 and the middle row has the pair (a1 , w1 ). The terms
ak = fk−1 (ak−1 , wk−1 ) are calculated by iteration and constitute the forward pass
defined in 9. The final output an+1 is at the extreme right. All the entries in
the middle row are pairs (ak , wk ) obtained by pairing corresponding elements in the
lower and upper rows. The arrows join the pairs to their images under the layer
maps. Compare with Figure 1.
20
W1 × · · · × Wn
× XXXfb
XX
U1 Un+1
z
X
21
U1 × W1 × W2 U1 × W1 × W2
U1 ×W1
π
A
A
+
U1 × W1
A
[[f1 ]] A
A W2
[f1 ] Aπ
? A
f1 U2 × W2 A
f2 b◦ f1 = f2 ◦ [[f1 ]]
Q A
Q A
?
Q
+ s AU
Q
U2 f2 W2
? ?
U3 U3
Figure 7: The commutative diagram at left simplifies to become the single arrow
at right. This is an arrow theoretic or categorical version of formula (5) for the
neural composition of a bilayer network. Here f2 b◦ f1 : U1 ×W1 ×W2 → U3 is expressed
using objects and arrows that represent f1 , f2 , various projections, their compositions
and products. The projections reduce the number of variables as required by the
layers. Consider [f1 ] = f1 ◦ π U1 ×W1 . The neural composition f2 b◦ f1 (long red arrow
at right) is equal to the (ordinary) composition of the product map [[f1 ]] = ([f1 ], π W2 )
(red arrow at top left) and f2 (red arrow at bottom left) as implied by the equalities
f2 ◦ [[f1 ]] = f2 ◦ ([f1 ], π W1 ) = f2 ◦ ((f1 ◦ π U1 ×W1 ), π W2 ) = f2 b◦ f1 .
22
E1 × G1 × G2 E1 × G1 × G2
E1 ×G1
π
A
A
+
E1 × G1
A
D [[f1 ]] A
A G2
D[f1 ] Aπ
? A
Df1 E2 × G2 A
Df2 b◦ Df1 = Df2 ◦ D [[f1 ]]
Q A
Q A
?
Q
+ s AU
Q
E2 Df2 G2
? ?
E3 E3
Figure 8: This is the derivative of the diagram of Figure 7. For a diagram theo-
retic demonstration of the bilayer neural chain rule given in equation (9) modify
Figure 7 as follows: 1.- take the derivatives of all the maps noting that the nat-
ural projections have derivatives equal to the linear projections; 2.- substitute the
open sets by their normed spaces; 3.- and invoke the chain rule with x2 = f1 (x1 , w1 )
to maintain commutativity. The result is this x1 -conditioned diagram which, being
commutative, proves that D(f2 b◦ f1 ) = Df2 b◦ Df1 .
23
G1 G2 ··· Gn−1 Gn
× PPDf1 × PPDf2 ··· PPDfn−2 × PPDfn−1 × PPDfn
PP PP PP PP PP
PP P PP PP PP
E1 qE
P
2
qE
PP
3 ··· qE
P
n−1
qE
P PE
q
n n+1
Figure 9: For any (x, w) ∈ U ×W the derivative network was defined in 14 as the
n-tuple of derivatives of the layers Dfk (xk , wk ) : Ek × Gk → Ek+1 each calculated at
the appropriate (xk , wk ) ∈ Uk ×Wk . These constitute a linear multilayer network
here displayed with Dfk standing for Dfk (xk , wk ). Compare with Figure 1.
24
G1 G2 ··· Gn−1 Gn
@ @ @ @
@DG f1 @DG f2 ··· @DG fn−1 @DG fn
@ @ @ @
R
@ R
@ R
@ R
@
E1 -
E2 -
E3 - ··· -
En−1 -
En -
En+1
DE f1 DE f2 DE f3 DE fn−2 DE fn−1 DE fn
Figure 10: The diagram of partials of the neural network f = (fk )nk=1 at
(x, w) has objects Ek and Gk with arrows labeled by the the partials which are
linear transformations here denoted as DE fk = DEk fk (xk , wk ) : Ek → Ek+1
and DG fk = DGk fk (xk , wk ) : Gk → Ek+1 . By inspection it is obvious that
the diagram of partials is not a neural network. The relation of being down-
stream was defined in 51. Heuristically the arrows, either horizontal or slanted,
all point from left to right and tell the “direction of the stream”. Down-
stream of DG f1 = DG1 f1 (x1 , w1 ) are the Ek -partials to the right it, that is,
DE f2 , DE f3 , . . . , DE fn . Downstream of DG f2 are DE f3 , . . . , DE fn , and so on. At
the far right, downstream of DG fn−1 there is only DE fn and downstream of DG fn
there are no Ek -partials. According to 50 the composition of the x1 -conditioned
linear maps in the lower row is equal to the E1 -partial of the neural composition,
DE1 (fn b◦ · · · b◦ f1 )(x1 , w) = DEn fn (xn , wn ) ◦ · · · ◦ DE1 f1 (x1 , w1 ). From (11) the Gk -
partial of the neural composition is equal to the composition of the x1 -conditioned Gk -
partial of the k-th layer with the downstream Ej -partials, DGk (fn b◦ · · · b◦ f1 )(x1 , w) =
DEn fn (xn , wn ) ◦ · · · ◦ DEk+1 fk+1 (xk+1 , wk+1 ) ◦ DGk fk (xk , wk ).
25
G1 G2 ··· Gn−1 Gn
I
@ ∗ @
I ∗ @
I ∗ @
I ∗
@ DG f1 @DG f2 ··· @DG fn−1 @DG fn
@ @ @ @
@ @ @ @
E1
E2
E3 ∗ ··· En−1
En
En+1
DE∗ f1 DE∗ f2 DE f3 DE∗ fn−1 DE∗ fn
Figure 11: This is the diagram of transpose partials of f = (fk )nk=1 at (x, w),
clearly not a neural network. Compare with Figure 10. The reader will benefit
by becoming familiar with this diagram and the next one. At the nodes it has the
Hilbert spaces Ek and Gk and the arrows are labeled with DE∗ fk = DE∗ k fk (xk , wk ) and
∗ ∗
DG fk = DG f (xk , wk ). Being upstream was defined in 54. Since transposition
k k
reverses the original arrows they now point from right to left. Upstream of the
∗ ∗
transpose partial DG fn no transpose Ek -partials exist. Upstream of DG fn−1 only
DE fn is found. At the far left, upstream of DG f2 the partials DE f3 , . . . , DE∗ fn are
∗ ∗ ∗
∗ ∗
located and upstream of DG f1 appear DG f2 , DE f3 , . . . , DE∗ fn .
26
∆w1 ∆w2 ··· ∆wn−1 ∆wn
@
I ∗ @
I ∗ @
I ∗ @
I ∗
@ DG f1 @ DG f2 ··· @DG fn−1 @ DG fn
@ @ @ @
@ @ @ @
∆a1
∆a2
∆a3 ··· ∆an−1
∆an
∆an+1
DE∗ f1 DE∗ f2 DE∗ fn−1 DE∗ fn
Figure 12: Starting at the far right with the output error ∆an+1 = an+1 − bn+1 ∈
En+1 the lower row consists of backpropagated errors ∆ak = DE∗ fk · ∆ak+1 where
DE∗ fk = DE∗ k fk (ak , wk ). The lifted errors, appearing in the upper row, are by def-
inition the images of these backpropagated errors by the transpose Gk -derivatives,
∗ ∗ ∗
∆wk = DG fk · ∆ak+1 with DG fk = DG f (ak , wk ). The reader should acquire famil-
k k
iarity with this diagram and the previous one. The gradient of the quadratic error Q
of the multilayer network f = (fk )nk=1 with multiweight w, initial input a1 and desired
output bn+1 is equal to the n-tuple of twice the lifted errors, ∇Q(w) = (2∆wk )nk=1 ;
see equation (21). Also, compare with Figure 11.
References
[1] Berberian, S. K. Introduction to Hilbert Space. Oxford University Press, 1961
[6] Schmidhuber, J. Deep Learning in Neural Networks: An Overview Technical Report. 2014
[7] Lang, S. Real and Functional Analysis. Springer, 3rd ed. 1993.
Daniel Crespin
Oteyeva
March 31, 2023
27