Theory and Formulas For Backpropagation in Hilbert Spaces

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Theory and Formulas for Backpropagation in Hilbert Spaces

Daniel Crespin
Facultad de Ciencias
Universidad Central de Venezuela

Abstract
This paper provides a detailed proof of the backpropagation algorithm for single input
data as stated in section 17, and for multiple input data as given in section 18. Our
viewpoint is that backpropagation consists essentially in the calculation of the gradient
of the quadratic error of a multilayer differentiable neural network having an architecture
of Hilbert spaces. Along the way a general theory for such networks is outlined. The
gradient is expressed, as expected, in terms of the error vectors and the transpose partial
derivatives of the layers. Compare with [3] and note that all the present results apply
without change to the case of Euclidean spaces (finite dimensional Hilbert spaces) hence
to Cartesian spaces Rn as well. In Numerical Calculus/Analysis there is the well known
gradient descent method, a procedure much used to find or approach the minimum of real
valued functions. Beyond the calculation of a gradient, backpropagation is the name given
to gradient descent when applied to the particularities of neural networks. The topic has
a very long history as revealed in [6]. Although categories are not formally used, there is a
section of Figures containing twelve diagrams that in the fashion of objects and morphisms
illustrate neural networks, their values on inputs (forward propagation), their derivatives,
transpose derivatives, backpropagated errors and lifted errors, these liftings being up to a
numerical factor of 2 the components of the sought gradient of the quadratic error.

Notation used in this article is commented under Conventions, page 15.

§1 Neural networks
Neural networks defined
1. An n layer differentiable neural network is a sequence f = (fk )nk=1 of C 1 maps, fk :
Uk × Wk → Uk+1 .

f ∈ C 1 (U1 × W1 , U2 ) × · · · × C 1 (Un × Wn , Un+1 )

See Figure 1. With terms as explained below in 5 this definition states that the k-th
output domain Uk+1 is also the input domain of the layer fk+1 next in the sequence.

2. The architecture of f is the collection A of the n pairs of normed spaces Ek , Gk , k =


1, . . . , n, and the additional normed space En+1 together with the open sets Uk ⊆ Ek and
Wk ⊆ Gk for a total of 2n + 1 normed spaces and 2n + 1 open sets. It is then said that
the network f defined in 1 has architecture A.
3. For the record, the function space of the architecture A is the set N (A) of all the neural
networks f having A as architecture
N (A) = C 1 (U1 × W1 , U2 ) × · · · × C 1 (Un × Wn , Un+1 )

4. Units and layers will not be given a detailed separate treatment in the present paper.
We discuss multilayer networks as compositions of networks. This approach is extremely
flexible and should prove useful in theory and practice for multilayer networks having
sophisticated hidden layers. Compare with [3] where units and layers are functions with
specific structures.
5. Terminology:
1.- The k-th layer of f = (fk )nk=1 is, of course, fk ;
2.- the input layer is f1 ;
3.- the output layer is fn ;
4.- the hidden layers or deep layers are fk with 2 ≤ k ≤ n − 1.
5.- The k-th domain is Uk × Wk ;
6.- the k-th codomain is Uk+1 .
7.- The initial input domain is Uk ;
8.- the k-th input domain is Uk ;
9.- the k-th weight domain is Wk ;
10.- the k-th output codomain is Uk+1 ;
11.- and the final output domain is Un+1 .
All these maps and domains appear as objects and arrows in Figure 1.
6. The multinput domain is U = U1 × · · · × Un and the multiweight domain is W = W1 ×
· · · × Wn .
7. By definition f is unilayer if n = 1, bilayer if n = 2 and trilayer if n = 3. The neural
network is multilayer if n ≥ 2. Figures 2, 3 and 4 display diagrams for unilayer, bilayer
and trilayer networks.
8. If the architecture is linear, Uk = Ek and Wk = Gk for all k, and if the layers fk are linear
transformations then f is a linear network. The derivative networks to be defined in 14
are linear.

§2 Forward pass and conditioning

9. Let f = (fk )nk=1 and w ∈ W be given. The forward pass of an initial input x1 ∈ U1 ,
displayed as the lower row in Figure 5, is the multinput sequence with first term x1 and
remaining terms specified by the recursion formula xk = fk−1 (xk−1 , wk−1 ), k = 1, . . . , n,
where for notational purposes we let f0 (x0 , w0 ) = x1 so that the forward pass is
(xk )nk=1 = (fk−1 (xk−1 , wk−1 ))nk=1 ∈ U = U1 × · · · × Un

2
10. Given f , w and x1 by definition the final output is xn+1 = fn (xn , wn ) ∈ Un+1 .

11. A multinput x ∈ U is x1 -conditioned under f and w if it is the forward pass of the initial
input term x1 ∈ U1 , that is, if the recursive formula xk = fk−1 (xk−1 , wk−1 ) is satisfied.

12. In an alternative but equivalent and sometimes more convenient notation the initial input
will be a1 ∈ U1 which under the forward pass gives the a1 -conditioned sequence

(ak )nk=1 = (fk−1 (ak−1 , wk−1 ))nk=1 ∈ U = U1 × · · · × Un

§3 Derivatives
Textbooks discussing derivatives in normed spaces are [5] Ch. VIII, §9 and [7] Ch. XIII, §7.
A good introduction to derivatives in Rn spaces can be found in [8].

Derivatives of a multilayer network


13. Consider a multilayer C 1 neural network f = (fk )nk=1 , a multinput x ∈ U and a multiweight
w ∈ W from which pairs (xk , wk ) ∈ Uk × Wk can be formed.

14. The derivative of f at (x, w) ∈ U × W is the multilayer linear network Df (x, w), shown
in Figure 9, having input domains Uk = Ek , weight domains Wk = Gk , output domains
Uk+1 = Ek+1 and layers equal to the derivatives Dfk (xk , wk ) : Ek × Gk → Ek+1 of the
layers of f calculated at the pairs (xk , wk )

Df (x, w) = (Dfk (xk , wk ))nk=1

15. The notions of §2 can be applied to derivative networks. Given:


1.- the derivative network calculated at (x, w) ∈ U × W , Df = (Dfk (xk , wk ))nk=1 , which
is a linear network;
2.- a multiweight ∆w = (∆w1 , . . . , ∆wn ) ∈ G1 × · · · × Gn = G for Df (x, w);
3.- and an initial input ∆x1 ∈ E1 for Df (x, w)
the forward pass by the neural network Df (x, w) with multiweight ∆w and initial input
∆x1 ∈ E1 is the multinput (∆xk )nk=1 = (Dfk (x, w) · ∆xk−1 )nk=1 ∈ E = E1 × · · · × En
where we let Df0 (x, w) · ∆x0 = ∆x1 . This can also be written ∆a1 ∈ E1 and (∆ak )nk=1 =
(Dfk (x, w) · ∆ak−1 )nk=1 ∈ E.

16. Above in 15 by construction the forward pass (∆xk )nk=1 is ∆x1 -conditioned; or (∆ak )nk=1
is ∆a1 -conditioned. The original network f need not be x1 -conditioned, that is, the
relations xk = fk (xk−1 , wk−1 ) are not necessarily imposed. Nevertheless and particularly
with neural compositions, x1 -conditioning is crucial.

3
§4 Transpose derivatives, gradients and squared norm
Reference [1] explains transposes of linear transformations between complex Hilbert spaces
but the reader can easily adapt that discussion to the real Hilbert spaces here needed.
17. Consider an inner product space E and let a ∈ E. The linear form of a is the function
φa : E → R defined as φa (∆x) = ⟨a, ∆x⟩. A linear form φ : E → R is representable if
there exists an a ∈ E such that φ = φa , in which case a is unique.
18. Let [a] = {λa | λ ∈ R} be the line spanned by a. The parametrized line of a is the linear
map ℓa : R → E defined as ℓa (t) = ta. Observe that ℓa (1) = a.
19. In an inner product space the linear form and the parametrized line are transposes of each
other
φ∗a = ℓa ℓ∗a = φa (1)

20. The Riesz representation theorem states that if E is a Hilbert space then all linear forms
are representable. Assuming that E, F are Hilbert spaces it follows that any continuous
linear transformation T : E → F has a unique well defined transpose transformation
T ∗ : F → E characterized by the condition
⟨T (x), y⟩ = ⟨x, T ∗ (y)⟩ for all x ∈ E and all y ∈ F
.
21. Consider Hilbert spaces E, F with open subsets U ⊆ E, V ⊆ F and let f : U → V
be a differentiable map with derivative at x ∈ U equal to the linear transformation
Df (x) : E → F . Then as a particular case of 20 f has a transpose derivative at x defined
as D∗ f (x) = (Df (x))∗ : F → E.
22. Let g : V → R have derivative at y = f (x) ∈ V equal to the linear form Dg(y) : F → R.
The gradient of f at y ∈ W is defined as a vector ∇g(y) ∈ F such that for all ∆y ∈ G
Dg(y) · ∆y = ⟨∇g(y), ∆y⟩

23. Riesz theorem implies that the gradient vector ∇g(y) exists, is given by
∇g(y) = Dg ∗ (y) · 1 (2)

24. Furthermore the chain rule and properties of transposition imply that the gradient pulls
back by the transpose derivative
∇(g ◦ f )(x) = D∗ f (x)(∇g(y))

25. The squared norm function Sq : En+1 → R is defined as Sq(xn+1 ) = ⟨xn+1 , xn+1 ⟩ =
∥xn+1 ∥2 . The derivative of Sq at any an+1 ∈ En+1 is twice the linear form of an+1 and the
transpose derivative is twice the parametrized line of an+1
DSq(an+1 ) = 2φan+1 D∗ Sq(an+1 ) = 2ℓan+1 (3)

4
§5 Partial derivatives

26. Given (x, w) ∈ U ×W the layer fk has an Ek -partial derivative and a Gk -partial derivative
at (xk , wk )

DEk fk (xk , wk ) : Ek → Ek+1 DGk fk (xk , wk ) : Gk → Ek+1

27. Basic properties of partials imply that the derivative of the k-th layer is the direct sum of
the partials
Dfk (xk , wk ) = DEk fk (xk , wk ) ⊕ DGk fk (xk , wk )

28. The diagram of partials at (x, w) of the neural network f = (fk )nk=1 by definition has
nodes for the vector spaces Ek and Gk and arrows labeled by the partials of the layers,
DE fk = DEk fk (xk , wk ) and DG fk = DEk fk (xk , wk ), all arranged as shown in Figure 10.
Compare with the diagram of transpose partials in Figure 11. These diagrams of partials
are not themselves neural networks.

§6 Transpose partial derivatives

29. Assuming that the normed spaces are all Hilbert it follows from that any linear trans-
formation has a transpose. In particular there is for each layer of f = (fk )nk=1 and at
each (xk , wk ) ∈ Uk × Wk a transpose derivative, a transpose Ek -partial derivative and a
transpose Gk -partial derivative

D∗ fk (xk , wk ) = (Dfk (xk , wk ))∗ : Ek+1 → Ek × Gk


DE∗ k fk (xk , wk ) = (DEk fk (xk , wk ))∗ : Ek+1 → Ek

DG f (xk , wk ) = (DGk fk (xk , wk ))∗ : Ek+1 → Gk
k k

30. The transpose of a direct sum is the product of the transposes. Applying this property
to 27 it follows that the transpose derivative of a layer fk is equal to the product of its
transpose partials

D∗ fk (xk , wk ) = (DE∗ k fk (xk , wk ), DG



f (xk , wk )) : Ek+1 → Ek × Gk
k k

31. The diagram of transpose partials of f = (fk )nk=1 at (x, w) is defined as having nodes equal
to the Hilbert spaces Ek and Gk , arrows labeled by the transpose partials of the layers,
DE∗ fk = DE∗ k fk (xk , wk ) and DG

fk = DE∗ k fk (xk , wk ), with these objects and morphisms
displayed in the manner of Figure 11. The diagram in this figure, when x1 -conditioned,
is the foundation of a thorough understanding of backpropagation.

5
§7 Conditioning the derivatives

32. Consider the following sequences of linear transformations:


4.- the derivative network Df (x, w) = (Dfk (xk , wk ))nk=1 ,
5.- the E-partials DE f = (DEk f (xk , wk ))nk=1 ,
6.- the G-partials DG f = (DG f (xk , wk ))nk=1 ,
7.- the transpose derivative network D∗ f (x, w) = (D∗ fk (xk , wk ))nk=1 ,
8.- the transpose E-partials DE∗ f = (DE∗ k f (xk , wk ))nk=1 and
∗ ∗
9.- the transpose G-partials DG f = (DG k
f (xk , wk ))nk=1 .
These sequences are x1 -conditioned by f and w, or conditioned for short, if the multinput
x = (xk )nk=1 is conditioned by the initial input x1 in the sense of 11, that is if xk =
fk−1 (xk−1 , wk−1 ) for k = 1, . . . , n. And a1 -conditioned means that ak = fk−1 (ak−1 , wk−1 ).

§8 Neural compositions of bilayers

33. By definition the neural composition of a bilayer network f = (f1 , f2 ) is the function that
transforms (x1 , w1 , w2 ) into the final output of the forward pass of f with multiweight
w = (w1 , w2 ) and initial input x1

(f2 b◦ f1 )(x1 , w1 , w2 ) = f2 (f1 (x1 , w1 ), w2 )

34. The neural composition f2 b◦ f1 is a unilayer neural network with input, weight and output
domains respectively equal to U1 , W1 × W2 and U3 .

35. The natural projections of the bilayer network f are the maps π U1 ×W1 (x1 , w1 , w2 ) =
(x1 , w1 ) and π W2 (x1 , w1 , w2 ) = w2 with domains and codomains as shown

π U1 ×W1 : U1 × W1 × W1 → U1 × W1
(4)
π W2 : U1 × W1 × W2 → W2

36. The neural composition of a bilayer network considered as a function from its domain to
its codomain can written as a composition of layers, natural projections and products, as
displayed in Figure 7 and expressed by the formula

f2 b◦ f1 = f2 ◦ ((f1 ◦ π U1 ×W1 ), π W2 ) (5)

§9 Neural composition of multilayers

37. The neural composition of a multilayer neural network f = (fk )nk=1 is defined as the
function fb = fn b◦ · · · b◦ f1 that sends (x1 , w1 , . . . , wn ) into the final output after and beyond
the forward pass of f with multiweight w = (w1 , . . . , wn ) and initial input x1

fb(x1 , w) = (fn b◦ · · · b◦ f1 )(x1 , w1 , . . . , wn ) = xn+1 = fn (xn , wn )

6
38. The neural composition fb = fn b◦ · · · b◦ f1 is a unilayer neural network, see Figure 6, having
input, weight and output domains equal to respectively U1 , W = W1 × · · · × Wn and Un+1 .
39. The natural projections of the multilayer network f are the maps
π U1 ×W1 ×···×Wn−1 (x1 , w1 , . . . , wn−1 , wn ) = (x1 , w1 , . . . , wn−1 )
(6)
π Wn (x1 , w1 , . . . , wn−1 , wn ) = wn
having the following domains and codomains
π U1 ×W1 ×···×Wn−1 : U1 × W1 × · · · × Wn−1 × Wn → U1 × W1 × · · · × Wn−1
(7)
π Wn : U1 × W1 × · · · × Wn−1 × Wn → Wn

40. It follows that the neural composition of a neural network having n layers can be expressed
as the neural composition of two unilayer networks, namely of the last layer and the neural
composition of the first n − 1 layers

fb = fn b◦ · · · b◦ f1 = fn b◦ (fn−1 b◦ · · · b◦ f1 )

§10 Derivatives of bilayer compositions

41. The bilayer network derivative, which is the case n = 2 of 14 and of Figure 9, has linear
architecture with layers Df1 (x1 , w1 ) : E1 × G1 → E2 and Df2 (x2 , w2 ) : E2 × G2 → E3 .
Since the domains are normed spaces, Uk = Ek , Wk = Gk , to apply equation (5) the
following linear projections are needed
π E1 ×G1 : E1 × G1 × G2 → E1 × G1
π G2 : E1 × G1 × G2 → G2

42. These linear projections are the derivatives, for all x = (x1 , w1 , w2 ) ∈ U1 × W1 × W2 , of
the natural projections given in equation (4), that is
Dπ U1 ×W1 (x1 , w1 , w2 ) = π E1 ×G1
(8)
Dπ W2 (x1 , w1 , w2 ) = π G2

43. Then the derivative of the neural composition of a bilayer network is equal to the neural
composition of the x1 -conditioned derivative
D(f2 b◦ f1 ) = D(f2 ◦ ((f1 ◦ π U1 ×W1 ), π W2 ))
= Df2 ◦ ((Df1 ◦ π E1 ×G1 ), π G2 ) (9)
= Df2 b◦ Df1

The first equality is true by equation (5) applied to the bilayer network (f1 , f2 ); the
second by the chain rule for derivatives together with the product rule for derivatives and
(8); and the third again by (5) but this time applied to the x1 -conditioned linear network
Df = (Df1 , Df2 ). An alternative proof of (9) is described in the caption of Figure 8.

7
§11 Derivatives of multilayer compositions

44. Define the linear projections π E1 ×G1 ×···×Gn−1 (x1 , w) = (x1 , w1 , . . . , wn−1 ) and π Gn (x1 , w) =
wn hence
π E1 ×G1 ×···×Gn−1 : E1 × G → E1 × G1 × · · · × Gn−1
π Gn : E1 × G → Gn

45. The above linear projections are the derivatives, for all x = (x1 , w) ∈ U1 × W , of the
natural projections given in equation (6)

Dπ U1 ×G1 ×···×Gn−1 (x1 , w) = π E1 ×G1 ×···×Gn−1


Dπ Wn (x1 , w) = π Gn

46. By induction in n (the case n = 2 being (9)) the neural chain rule follows: The deriva-
tive of a multilayer neural composition is equal to the composition of the x1 -conditioned
derivatives of the layers

Dfb(x1 , w) = D(fn b◦ fn−1 b◦ · · · b◦ f1 )(x1 , w(n−1) , wn )


= D(fn b◦ (fn−1 b◦ · · · b◦ f1 ))(x1 , w(n−1) , wn )
= Dfn (xn , wn ) ◦ (Dfn−1 (xn−1 , wn−1 ) · · · ◦ Df1 (x1 , w1 ))
= Dfn (xn , wn ) ◦ Dfn−1 (xn−1 , wn−1 ) · · · ◦ Df1 (x1 , w1 )

§12 Partials as direct sums and products

47. A basic property of derivatives having a product domain is that they are the direct sum
of the partials taken with respect to the factors.

48. The G-partial at (x1 , w) ∈ U1 × W = U1 × W1 × · · · × Wn of the neural composition of a


multilayer network is equal to the direct sum of its Gk -partials

DG fb(x1 , w) = DG1 fb(x1 , w) ⊕ · · · ⊕ DGn fb(x1 , w)

For this direct sum decomposition the G-partial as well as the Gk -partials are all calculated
at (x1 , w) ∈ U1 ×W . No forward pass is required to specify (x1 , w) and no x1 -conditioning
on the partials is involved.

49. In general the transpose of a direct sum is the product of the transposes. Therefore the
transpose G-partial derivative of the neural composition at (x1 , w) ∈ U1 × W is equal to
the product of transpose Gk -partials of fb
∗ b ∗ b ∗ b
DG f (x1 , w) = (DG 1
f (x1 , w), · · · , DG n
f (x1 , w)) (10)

Again, all the partials are calculated at (x1 , w).

8
§13 Partials of multilayer compositions
Two statements in this section require proof, to be performed invoking the neural chain
rule, basic properties of partials and induction in n. The relevant diagram is Figure 10.
50. The E1 -partial of the neural composition at (x1 , w) ∈ U1 × W of a multilayer network is
equal to the composition of the x1 -conditioned Ek -partials of the layers

DE1 fb(x1 , w) = DEn fn (xn , wn ) ◦ · · · ◦ DE1 f1 (x1 , w1 )

51. By definition an Ei -partial DEi fi (xi , wi ) is downstream from the Gk -partial DGk fk (xk , wk )
if k < i. See the direction of the arrows in Figure 10.
52. The Gk -partial of fb calculated at (x1 , w) ∈ U1 × W is equal to the x1 -conditioned compo-
sition of the Gk -partial of the k-th layer and its downstream Ei -partials

DGk fb(x1 , w) = DEn fn (xn , wn ) ◦ DEn−1 fn−1 (xn−1 , wn−1 ) ◦ · · ·


(11)
· · · ◦ DEk+1 fk+1 (xk+1 , wk+1 ) ◦ DGk fk (xk , wk )

§14 Transpose partials of multilayer compositions


Properties of transpose partials are consequence of applying transposition to the corre-
sponding results on the partials of section 13. The diagram to look at is Figure 11.
53. The transpose E1 -partial of fb is the x1 -conditioned composition, in the appropriated re-
verse order, of the transpose Ek -partials of the layers
DE∗ 1 fb(x1 , w) = DE∗ 1 f1 (x1 , w1 ) ◦ · · · ◦ DE∗ n fn (xn , wn )

54. For a transpose Gk -partial DG f (xk , wk ) a transpose Ei -partial DE∗ i fn (xi , wi ) is upstream
k k
if k < i. Now the arrows have been reversed relative to the downstream definition given
in 51.
55. The version to be given in equation (12) of the chain rule for the transpose partial
derivatives of a neural composition is central for a general formulation of backpropagation
in Hilbert spaces, and presumably also in Euclidean and Cartesian spaces since finite
dimensionality and matrices seems to provide little conceptual simplification. The whole
article has been elaborated around this equation, including the diagram of Figure 11 and
equations (21) and (22).
56. The transpose Gk -partial of the neural composition at (x1 , w) ∈ U1 × W is equal to the
x1 -conditioned composition, in the appropriated reverse order, of the transpose Gk -partial
of the k-th layer and its upstream transpose Ei -partials
∗ b ∗
DG k
f (x1 , w) = DG f (xk , wk ) ◦ DE∗ k+1 fk+1 (xk+1 , wk+1 ) ◦ · · ·
k k
(12)
· · · ◦ DE∗ n fn (xn , wn )

9
§15 Backpropagating and lifting

57. Let f = (fk )nk=1 , fb and w be as in previous sections. Consider any element ∆an+1 ∈ En+1
to be called (n + 1)-th backpropagated error or final error. Other errors in this section
are obtained applying transpose derivatives to this rather final error. In later sections we
will focus on the output error of the network, ∆an+1 = fb(a1 , w) − bn+1 .

58. By definition the k-th backpropagated error ∆ak ∈ Ek is the image of the (k + 1)-th
backpropagated error by the conditioned transpose Ek -partial of the k-th layer, k = n, n −
1, . . . , 2

∆ak = DE∗ k fk (ak , wk ) · ∆ak+1 (13)

This recursive descent formula allows to define ∆a1 ∈ E1 , however only the backpropa-
gated errors ∆ak ∈ Ek with 2 ≤ k ≤ n + 1 will be needed.

59. The k-th lifted error is by definition the image of the (k + 1)-th backpropagated error by
the conditioned transposed Gk -partial of the k-th layer, k = n, n − 1, . . . , 2, 1

∆wk = DG f (ak , wk ) · ∆ak+1
k k
(14)

60. The G-partial of the neural composition calculated at (a1 , w) ∈ U1 × W and evaluated on
any given output error ∆an+1 ∈ En+1 is equal to the n-tuple of lifted errors
∗ b ∗ b ∗ b
DG f (a1 , w) · ∆an+1 = (DG 1
f (a1 , w), · · · , DG n
f (a1 , w)) · ∆an+1
∗ ∗ (15)
= (DG1 f1 (a1 , w1 ) · ∆a2 , . . . , DG f (ak , wk ) · ∆ak+1 )
k k
n
= (∆wk )k=1

Proof: The first equality is true by (10); the second by the chain rule for transpose
partials (12) (with a1 -conditioning) and the recursive descent definition of ∆ak given in
(13); and the third by (14) (again with a1 -conditioning).

§16 Gradient of the quadratic error

61. Let f = (fk )nk=1 be a multilayer network, fb = fn b◦ · · · b◦ f1 its neural composition, a1 ∈ U1


an initial input, bn+1 ∈ En+1 a desired output and an+1 = fn (an , wn ) = fb(a1 , w) the final
output. Define the output error e : W = W1 × · · · Wn → En+1 as the function

e(w) = fb(a1 , w) − bn+1 = an+1 − bn+1 (16)

62. From the manner partial derivatives are defined it follows that the derivative of the output
error function at w ∈ W is equal to the G-partial of fb at (a1 , w)

De(w) = DG fb(a1 , w) (17)

10
63. Then the transpose derivative of the output error function e at w ∈ W is equal to the
transpose G-partial of fb at (a1 , w)

D∗ e(w) = DG
∗ b
f (a1 , w) (18)

64. The quadratic error function, Q : W → R is by definition the squared length of the error

Q(w) = ∥fb(a1 , w) − b∥2 = ∥e(w)∥2 = (Sq ◦ e)(w) (19)

65. The first term below, which is the transpose derivative at the multiweight w ∈ W =
W1 ×· · · Wn of the quadratic error function, is equal to the last, which is twice the transpose
G-partial derivative of the neural composition calculated at (a1 , w) and composed with
the parametric line of the final error

D∗ Q(w) = D∗ (Sq ◦ e)(w)


= (D(Sq ◦ e)(w))∗
= (DSq(e(w)) ◦ De(w))∗
= (2φe(w) ◦ DG fb(a1 , w))∗ (20)
∗ b
= 2DG f (a1 , w) ◦ φ∗e(w)
∗ b
= 2DG f (a1 , w) ◦ ℓfb(a1 ,w)−bn+1

Proof: The first equality is true by the definition of quadratic error in (19); the second
by the definition of transpose derivative; the third by the chain rule applied to Sq ◦ e; the
fourth because of (3) and (17); the fifth because the transpose of a composition is the
composition of the transposes in reverse order; and the last by (1) and by the definition
of e(w) given in equation (16).

11
§17 Backpropagation for a single input data
The gradient formula given below in equation (21), and reflected in Figures 11 and 12,
requires data and calculations as now recapitulated

66. From
1.- a multilayer network f = (fk )nk=1
2.- a multiweight w = (w1 , . . . , wn ) ∈ W
3.- an initial input a1 ∈ U1 , to be also symbolized a1 = f0 (a0 , w0 )
4.- a desired output bn+1 ∈ En+1
the following is obtained
1.- the forward pass (ak )nk=1 = (fk−1 (ak−1 , wk−1 ))nk=1
2.- the final output an+1 = fn (an , wn ) ∈ En+1
3.- the output error ∆an+1 = an+1 − bn+1 ∈ En+1
4.- the output error function e : W → En+1 , e(w) = fb(a1 , w) − bn+1
5.- the transposed Ek -partials DE∗ k fk (ak , wk )

6.- the transposed Gk -partials DG f (ak , wk )
k k
7.- the backpropagated errors ∆ak = DE∗ k fk (ak , wk ) · ∆ak+1 ∈ Ek , k = n, (n − 1), . . . , 2

8.- the lifted errors ∆wk = DG f (ak , wk ) · ∆ak+1 ∈ Gk , k = n, (n − 1), . . . , 2, 1
k k

67. If there is a single initial input the substance of backpropagation for neural networks
in Hilbert spaces is the following statement:

The gradient of the quadratic error function is equal to twice the n-tuple of the
lifted errors

∇Q(w) = (2∆wk )nk=1 (21)

68. For a justification and proof of (21) check the hypotheses and follow all the steps indicated
in 66. Alternatively take a = ∆an+1 = fn (an , wn ) − bn+1 = an+1 − bn+1 in equation 2,
and then apply equations (12), (15) and (20).

69. Under a variety of formats and scenarios this gradient must be calculated whenever trying
to minimize the quadratic error function of a differentiable multilayer neural network by
stepwise modification of the weights applying the method of gradient descent. “Backprop-
agation” is the name given to the calculation of ∇Q(w), or to certain particular steps
like obtaining the backpropagated errors ∆ak . Or may sometimes refer to the totality
of the awesome forest surrounding deep learning of neural networks by gradient descent,
where concepts like “learning rate”, “thresholds”, “distances to decision hypersurfaces”,
“training epochs”, “overfitting”, “cutoff values” and many others thrive.

12
§18 Backpropagation for multiple input data

70. Formula (21) has an extension to several initial inputs arranged in a finite set A =
{aj1 | j = 1, . . . , m} ⊆ U1 , where each aj1 ∈ U1 has a corresponding desired output bjn+1 =
d(aj1 ) ∈ En+1 specified by means of a usually empirical function d : A → En+1 .

71. Given:
1.- the multilayer neural network f = (fk )nk=1
2.- the multiweight w ∈ W = W1 × · · · × Wn
3.- the finite set A ⊆ U1
4.- and the desired output function d : A → En+1 , d(xj1 ) = bj
there is for each j:
1.- forward pass (ajk )nk=1 = (fk−1 (ajk−1 , wk−1 ))nk=1
2.- the final output ajn+1 = ajn+1 (aj1 , w) = fn (ajn , wn )
3.- the output error ∆ajn+1 = ajn+1 − bjn+1 ∈ En+1
4.- the output error function, ej : W → En+1 , ej (w) = ajn+1 (w) − bjn+1
5.- the transposed Ek -partials DE∗ k fk (ajk , wk )
6.- the transposed Gk -partials DG ∗
f (ajk , wk )
k k
7.- the backpropagated errors, ∆ajk = DE∗ k fk (ajk , wk ) · ∆ajk+1 ∈ Ek , k = n, (n − 1), . . . , 2
8.- the lifted errors, ∆wkj = DG ∗
f (ajk , wk ) · ∆ajk+1
k k

72. Define then:


Pm
1.- The total quadratic error function, QT : W → R, QT (w) = j=1 Qj (w)
2.- The backpropagated total errors, ∆T ak = m j
P
j=1 ∆ak

3.- The lifted total errors, ∆T wk = m j


P
j=1 ∆wk

73. The gradient of the total quadratic error function is equal to the sum of the j-gradients
m
X
∇QT (w) = ∇Qj (w)
j=1

74. If there are multiple initial inputs then backpropagation for neural networks in
Hilbert spaces relies on the following result:

The gradient of the total quadratic error function is equal to twice the n-tuple of
the lifted total errors

∇QT (w) = (2∆T wk )nk=1 (22)

13
75. Summing up, backpropagation is required for the deep learning of (deep teaching to) neural
networks. In a first instance learning means making smaller the quadratic error. And
backpropagation is the attempt to reduce the error by small stepwise changes of the
weights in directions opposite to the gradient.

14
Some conventions

First, beware that full notational consistency is difficult, perhaps impossible.


Sets have generic elements, say x ∈ X. Functions f : X → Y transform elements x ∈ X
into y = f (x) ∈ Y . But often it will be convenient to indicate that instead of a “free”
x, a specific fixed element is of interest and will then change to a letter at the beginning
of the alphabet and write a ∈ X, b = f (a) ∈ Y , and so on.
All the vector spaces, usually denoted E, F , G, En , etc., are over the real number system.
They are normed and whenever required they will also be Hilbert spaces.
All the linear transformations between normed spaces, T : E → F , in particular the
derivatives, are assumed to be bounded (continuous) in the usual sense of Functional
Analysis.
The value of a linear T on x ∈ E is T (x) but may also be denoted T · x. The variables
on which derivative linear transformations and their transposes act will be denoted ∆x
or ∆w, resulting in expressions like Df (a1 , w) · ∆x and D∗ fk (a1 , w) · ∆ak .

15
§19 Figures
Figures

W1 W2 ··· Wn−1 Wn
× XXXf1 × XXXf2 ··· XXX × XXXfn−1 × XXXfn
XX XX XX XX XX
U1 U2 U3 ··· Un−1 Un Un+1
z
X z
X z
X z
X z
X

Figure 1: A multilayer differentiable neural network is a sequence f = (fk )nk=1


of C 1 functions called layers each having domain equal to a product Uk ×Wk (of open
subsets of normed spaces) and codomain equal to the input domain Uk+1 of the map
fk+1 next in the sequence. The products Uk × Wk are displayed here with the factor
Wk atop the factor Uk and the symbol × placed in between. Note that the codomain
Uk+1 is not a subset of Uk+1 × Wk+1 . This diagram is a general layout for neural
networks and can be extended to domains that are sets and to layers that are arbitrary
functions. Beyond sets it is possible to work with categories having products. The
diagrams for unilayer, bilayer and trilayer neural networks are particular cases of the
present diagram with n = 1, 2, 3, shown in Figures 2, 3 and 4. Another special case is
Figure 10 that shows the diagram of a derivative network. The number n of layers
varies from one to an arbitrarily large n but in practice trilayer seems sufficient for
many applications. To have a well defined backpropagation procedure the normed
spaces have to be Hilbert spaces, Euclidean if finite dimensional. In actual practice
the spaces are Cartesian Rn s with elements equal to row and column matrices,
linear transformations given by rectangular matrices, derivatives specified by means
of Jacobian matrices and transpose derivatives by transpose Jacobians; there are
then semilinear differentiable perceptrons units which when fitted together provide
perceptron layers. See references [2], [3] and [4].

16
W1
× XXXf1
XX
U1 U2
z
X

Figure 2: A unilayer differentiable neural network with input domain U1 , weight


domain W1 and output domain U2 is a map f1 of class C 1 with domain the product
of open sets U1 × W1 and codomain U2 . The product U1 × W1 is displayed with
the factor W1 atop U1 . The domains are open subsets of normed spaces: U1 ⊆ E1 ,
W1 ⊆ G1 and U2 ⊆ E2 . Reversing the order of the product replaces U1 × W1 with
W1 × U1 and then the roles of input and weight are interchanged. Examples of
unilayer neural networks are the differentiable semilinear perceptron units and the
semilinear perceptron layers often used in relation with deep learning. Derivatives
of differentiable maps f : U → V (with U ⊆ E and V ⊆ F ) are maps (x, ∆x) →
Df (x) · ∆x ∈ F and therefore can be considered as unilayer neural networks with
initial input domain U1 = U , weight domain W1 = E and output domain U2 = F .
The neural compositions of multilayer perceptron networks fn b◦ · · · b◦ f1 (see 37, 38
and 40) are also examples of unilayer networks.

17
W1 W2
× XXXf1 × XXXf2
XX XX
U1 U2 U3
z
X z
X

Figure 3: A bilayer neural network consists of two layers which are the functions
shown in the diagram. The functions f1 and f2 are not composable as usually
understood but they do have a neural composition in the sense of the definition
given in 33 or as expressed in 36. See also Figure 7.

18
W1 W2 W3
× XXXf1 × XXXf2 × XXXf3
XX XX XX
U1 U2 U3 U4
z
X z
X z
X

Figure 4: A trilayer neural network has the three layers shown above. This is the
result of setting n = 3 in the diagram of Figure 1. The architecture is specified by
seven open subsets of respective normed spaces. Customarily f1 is the input layer,
f2 is the hidden layer and f3 is the output layer.

19
w1 w2 ··· wn−1 wn

(a1 , w1 ) H (a2 , w2 ) H ··· HH


(an−1 , wn−1H) (an , wH
n)
HH H H HH
f1 Hj
H f2 HH
H
j ···
H
H
j
H fn−1
H
H
j
H fn H
j
H
a1 a2 a3 an−1 an an+1

Figure 5: The nodes of this diagram are elements of the various domains of a
multilayer network. The upper row has the components wk of a given multiweight
w. At the far left: the lower row has the initial input a1 , the upper row has the
first weight component w1 and the middle row has the pair (a1 , w1 ). The terms
ak = fk−1 (ak−1 , wk−1 ) are calculated by iteration and constitute the forward pass
defined in 9. The final output an+1 is at the extreme right. All the entries in
the middle row are pairs (ak , wk ) obtained by pairing corresponding elements in the
lower and upper rows. The arrows join the pairs to their images under the layer
maps. Compare with Figure 1.

20
W1 × · · · × Wn
× XXXfb
XX
U1 Un+1
z
X

Figure 6: The diagram is the neural composition fb = fn b◦ · · · b◦ f1 : U1 × W1 ×


· · · × Wn → Un+1 of the multilayer neural network f = (fk )nk=1 previously shown in
Figure 1. The original network f and fb share a common input domain U1 and final
output domain Un+1 . But one is multilayer and the other is unilayer. The weight
spaces Wk of the individual layers have been assembled together in the multiweight
product W = W1 ×· · ·×Wn which is the weight space of the neural composition. The
input domains Uk , 2 ≤ k ≤ n, do not appear explicitly in this diagram and remain
hidden from view but they are necessary to perform the forward pass stipulated in
9, this pass being involved in the definition of fb given in 37. All the intermediate
outputs xk = fk−1 (xk−1 , wk−1 ) ∈ Uk are determined by x1 , f and w. Therefore the
notion of x1 -conditioning defined in 11 is always present in neural compositions.

21
U1 × W1 × W2 U1 × W1 × W2
U1 ×W1
π  
 A
  A
+

U1 × W1
 A
 [[f1 ]] A
 A W2
[f1 ]  Aπ
 ? A
f1  U2 × W2 A
f2 b◦ f1 = f2 ◦ [[f1 ]]
  Q A
  Q A
? 
  Q
+ s AU
Q
U2 f2 W2

? ?
U3 U3

Figure 7: The commutative diagram at left simplifies to become the single arrow
at right. This is an arrow theoretic or categorical version of formula (5) for the
neural composition of a bilayer network. Here f2 b◦ f1 : U1 ×W1 ×W2 → U3 is expressed
using objects and arrows that represent f1 , f2 , various projections, their compositions
and products. The projections reduce the number of variables as required by the
layers. Consider [f1 ] = f1 ◦ π U1 ×W1 . The neural composition f2 b◦ f1 (long red arrow
at right) is equal to the (ordinary) composition of the product map [[f1 ]] = ([f1 ], π W2 )
(red arrow at top left) and f2 (red arrow at bottom left) as implied by the equalities
f2 ◦ [[f1 ]] = f2 ◦ ([f1 ], π W1 ) = f2 ◦ ((f1 ◦ π U1 ×W1 ), π W2 ) = f2 b◦ f1 .

22
E1 × G1 × G2 E1 × G1 × G2
E1 ×G1
π 
  A
  A
+

E1 × G1
 A
 D [[f1 ]] A
 A G2
D[f1 ]  Aπ
 ? A
Df1  E2 × G2 A
Df2 b◦ Df1 = Df2 ◦ D [[f1 ]]
  Q A
  Q A
? 
  Q
+ s AU
Q
E2 Df2 G2

? ?
E3 E3

Figure 8: This is the derivative of the diagram of Figure 7. For a diagram theo-
retic demonstration of the bilayer neural chain rule given in equation (9) modify
Figure 7 as follows: 1.- take the derivatives of all the maps noting that the nat-
ural projections have derivatives equal to the linear projections; 2.- substitute the
open sets by their normed spaces; 3.- and invoke the chain rule with x2 = f1 (x1 , w1 )
to maintain commutativity. The result is this x1 -conditioned diagram which, being
commutative, proves that D(f2 b◦ f1 ) = Df2 b◦ Df1 .

23
G1 G2 ··· Gn−1 Gn
× PPDf1 × PPDf2 ··· PPDfn−2 × PPDfn−1 × PPDfn
PP PP PP PP PP
PP P PP PP PP
E1 qE
P
2
qE
PP
3 ··· qE
P
n−1
qE
P PE
q
n n+1

Figure 9: For any (x, w) ∈ U ×W the derivative network was defined in 14 as the
n-tuple of derivatives of the layers Dfk (xk , wk ) : Ek × Gk → Ek+1 each calculated at
the appropriate (xk , wk ) ∈ Uk ×Wk . These constitute a linear multilayer network
here displayed with Dfk standing for Dfk (xk , wk ). Compare with Figure 1.

24
G1 G2 ··· Gn−1 Gn
@ @ @ @
@DG f1 @DG f2 ··· @DG fn−1 @DG fn
@ @ @ @
R
@ R
@ R
@ R
@
E1 -
E2 -
E3 - ··· -
En−1 -
En -
En+1
DE f1 DE f2 DE f3 DE fn−2 DE fn−1 DE fn

Figure 10: The diagram of partials of the neural network f = (fk )nk=1 at
(x, w) has objects Ek and Gk with arrows labeled by the the partials which are
linear transformations here denoted as DE fk = DEk fk (xk , wk ) : Ek → Ek+1
and DG fk = DGk fk (xk , wk ) : Gk → Ek+1 . By inspection it is obvious that
the diagram of partials is not a neural network. The relation of being down-
stream was defined in 51. Heuristically the arrows, either horizontal or slanted,
all point from left to right and tell the “direction of the stream”. Down-
stream of DG f1 = DG1 f1 (x1 , w1 ) are the Ek -partials to the right it, that is,
DE f2 , DE f3 , . . . , DE fn . Downstream of DG f2 are DE f3 , . . . , DE fn , and so on. At
the far right, downstream of DG fn−1 there is only DE fn and downstream of DG fn
there are no Ek -partials. According to 50 the composition of the x1 -conditioned
linear maps in the lower row is equal to the E1 -partial of the neural composition,
DE1 (fn b◦ · · · b◦ f1 )(x1 , w) = DEn fn (xn , wn ) ◦ · · · ◦ DE1 f1 (x1 , w1 ). From (11) the Gk -
partial of the neural composition is equal to the composition of the x1 -conditioned Gk -
partial of the k-th layer with the downstream Ej -partials, DGk (fn b◦ · · · b◦ f1 )(x1 , w) =
DEn fn (xn , wn ) ◦ · · · ◦ DEk+1 fk+1 (xk+1 , wk+1 ) ◦ DGk fk (xk , wk ).

25
G1 G2 ··· Gn−1 Gn
I
@ ∗ @
I ∗ @
I ∗ @
I ∗
@ DG f1 @DG f2 ··· @DG fn−1 @DG fn
@ @ @ @
@ @ @ @
E1 
E2 
E3  ∗ ···  En−1 
En 
En+1
DE∗ f1 DE∗ f2 DE f3 DE∗ fn−1 DE∗ fn

Figure 11: This is the diagram of transpose partials of f = (fk )nk=1 at (x, w),
clearly not a neural network. Compare with Figure 10. The reader will benefit
by becoming familiar with this diagram and the next one. At the nodes it has the
Hilbert spaces Ek and Gk and the arrows are labeled with DE∗ fk = DE∗ k fk (xk , wk ) and
∗ ∗
DG fk = DG f (xk , wk ). Being upstream was defined in 54. Since transposition
k k
reverses the original arrows they now point from right to left. Upstream of the
∗ ∗
transpose partial DG fn no transpose Ek -partials exist. Upstream of DG fn−1 only
DE fn is found. At the far left, upstream of DG f2 the partials DE f3 , . . . , DE∗ fn are
∗ ∗ ∗
∗ ∗
located and upstream of DG f1 appear DG f2 , DE f3 , . . . , DE∗ fn .

26
∆w1 ∆w2 ··· ∆wn−1 ∆wn
@
I ∗ @
I ∗ @
I ∗ @
I ∗
@ DG f1 @ DG f2 ··· @DG fn−1 @ DG fn
@ @ @ @
@ @ @ @
∆a1 
∆a2 
∆a3  ···  ∆an−1
∆an 
∆an+1
DE∗ f1 DE∗ f2 DE∗ fn−1 DE∗ fn

Figure 12: Starting at the far right with the output error ∆an+1 = an+1 − bn+1 ∈
En+1 the lower row consists of backpropagated errors ∆ak = DE∗ fk · ∆ak+1 where
DE∗ fk = DE∗ k fk (ak , wk ). The lifted errors, appearing in the upper row, are by def-
inition the images of these backpropagated errors by the transpose Gk -derivatives,
∗ ∗ ∗
∆wk = DG fk · ∆ak+1 with DG fk = DG f (ak , wk ). The reader should acquire famil-
k k
iarity with this diagram and the previous one. The gradient of the quadratic error Q
of the multilayer network f = (fk )nk=1 with multiweight w, initial input a1 and desired
output bn+1 is equal to the n-tuple of twice the lifted errors, ∇Q(w) = (2∆wk )nk=1 ;
see equation (21). Also, compare with Figure 11.

References
[1] Berberian, S. K. Introduction to Hilbert Space. Oxford University Press, 1961

[2] Crespin, D. Neural Network Formalism. 1995

[3] Crespin, D. Generalized Backpropagation. 1995

[4] Crespin, D. A Primer on Perceptrons. 2005

[5] Dieudonné, J. Foundations of Modern Analysis. Academic Press. 1969

[6] Schmidhuber, J. Deep Learning in Neural Networks: An Overview Technical Report. 2014

[7] Lang, S. Real and Functional Analysis. Springer, 3rd ed. 1993.

[8] Spivak, M. Calculus on Manifolds. Addison-Wesley, 1965.

Daniel Crespin
Oteyeva
March 31, 2023

27

You might also like