Abstract
In this journal, Cheng has proposed a backpropagation (BP) procedure called BPFCC for deep fully connected cascaded (FCC) neural network learning in comparison with a neuron-by-neuron (NBN) algorithm of Wilamowski and Yu. Both BPFCC and NBN are designed to implement the Levenberg-Marquardt method, which requires an efficient evaluation of the Gauss-Newton (approximate Hessian) matrix \(\nabla \textbf{r}^\textsf{T} \nabla \textbf{r}\), the cross product of the Jacobian matrix \(\nabla \textbf{r}\) of the residual vector \(\textbf{r}\) in nonlinear least squares sense. Here, the dominant cost is to form \(\nabla \textbf{r}^\textsf{T} \nabla \textbf{r}\) by rank updates on each data pattern. Notably, NBN is better than BPFCC for the multiple \(q~\!(>\!1)\)-output FCC-learning when q rows (per pattern) of the Jacobian matrix \(\nabla \textbf{r}\) are evaluated; however, the dominant cost (for rank updates) is common to both BPFCC and NBN. The purpose of this paper is to present a new more efficient stage-wise BP procedure (for q-output FCC-learning) that reduces the dominant cost with no rows of \(\nabla \textbf{r}\) explicitly evaluated, just as standard BP evaluates the gradient vector \(\nabla \textbf{r}^\textsf{T} \textbf{r}\) with no explicit evaluation of any rows of the Jacobian matrix \(\nabla \textbf{r}\).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Following the notations in [3], we consider the nonlinear least squares problem (e.g., see [8, 37]) for optimizing a multiple q-output FCC network having the M-length weight vector \(\textbf{w}\). Given N data pairs of n-input vector and q-target vector \(\{ \left( \textbf{x}(p), \textbf{t}(p) \right) \}^N_{p=1} \in \mathbb {R}^n \times \mathbb {R}^q\), the objective function E (cf. [3, Eq(24), p.301] and [45, Eq(1), p.1794]) is given below with \(m \! \equiv \! q N\) as the sum of squared m residuals, each denoted by either \(r_k\), \(k \!=\! 1,...,m\), or \(r_{j}(p)\) between model’s jth output \(F_j(\textbf{x}(p);\textbf{w})\) and its target value \(t_{j}(p)\), \(j \!=\! 1,...,q\) and \(p \!=\! 1,...,N\); hence, the relation \(k \!=\! j \!+\! (p-1)q\) for \(r_k \!=\! F_j(\textbf{x}(p); \textbf{w}) - t_{j}(p) = r_j(p)\):
where \(E_p(\textbf{w}) \! \equiv \! \frac{1}{2} \sum ^q_{j=1} \! \left\{ r_{j}(p) \right\} ^2 \!=\! \frac{1}{2} \textbf{r}^\textsf{T}_p \textbf{r}_p\) on each datum p (with \(\textbf{r}_p\), the residual vector of length q), and \(\textbf{r}\) is the m-length (\(m \! \equiv \! qN\)) residual vector (over all N data) that is a function of \(\textbf{w}\). Then, \(\nabla E\), the gradient vector of \(E(\textbf{w})\) above, and Hessian matrix \(\nabla ^2 E\) can be expressed as below with \(\textbf{J}\! \equiv \! \nabla \textbf{r}\! = \! \frac{\partial \textbf{r}}{\partial \textbf{w}}\), the \(m \times M\) Jacobian matrix of \(\textbf{r}\):
where \(\nabla r_{k} \!=\! \nabla r_{j}(p)\) with \(k \!=\! j \!+\! (p-1)q\), and \(\textbf{U}^\textsf{T}_p ~\!(\equiv \! \nabla \textbf{r}_p)\), a (transposed) matrix of q rows of \(\textbf{J}\), each row \(j~\!(=\!1,...,q)\) denoted by \(\nabla r_{j}(p)^\textsf{T}\), on each training data pattern p. The cross-product \(\textbf{J}^\textsf{T} \textbf{J}\), known as the Gauss-Newton (approximate) Hessian matrix [8, 37], can be formed by rank-q updates Footnote 1 per datum p (or by q rank-1 updates), as below:
where \(\textbf{U}_p~\!(\equiv \! \nabla \textbf{r}_p^\textsf{T})\) is of size \(M \times q\). Algorithm BPMLP below is a widely-employed scheme (e.g., see [1, 19, 24, 25, 40, 46]) for calculating \(\textbf{J}^\textsf{T} \textbf{J}\) by backpropagation (BP) for a multilayer perceptron (MLP) that is assumed to have q linear outputs at the terminal layer:
Let \(\textbf{e}_j\) denote the jth column of \(\textbf{I}\), the \(q \times q\) identity matrix; then, Algorithm BPMLP performs backward process [see step (b), Lines 2 to 6] for \(\textbf{U}_p\) in Eq. (2) by back-propagating \(\textbf{e}_j\), one column of \(\textbf{I}\) after another, for \(\nabla r_j(p)\! =\! \textbf{u}^p_j \! =\!\textbf{U}_p \textbf{e}_j\), \(j \!=\! 1,...,q\) (Line 5). This process amounts to back-propagating numeric unit entry “1.0,” the jth entry of \(\textbf{e}_j\) (Line 4) rather than residual \(r_j\) from each output node j, \(j \!=\! 1,...,q\); this is what we call the (output) node-wise BP procedure. After that, Algorithm BPMLP forms \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by rank-q updates [step (c) in Line 8] and computes \(\nabla E_p\) [step (d) in Line 9]. The cost per pattern p (e.g., see [28, 30]) of steps (a) to (d) can be approximated by
In this journal, Cheng [3, Sec. 3.3] has proposed Algorithm BPFCC for q linear-output FCC networks; it is essentially identical to the above Algorithm BPMLP (see Appendix A.1), back-propagating \(\alpha _{L \rightarrow L} \! \equiv \! 1\) on each linear output node for q times (to be detailed in Sect. 2.3). Wilamowski and Yu [45] have identified computational overheads over such q repetitions (see lines 3 to 6 in BPMLP above), and thus have developed the forward-only neuron-by-neuron (NBN) algorithm [47] (see Appendix B), similar in spirit to forward-mode automatic differentiation [18], for q-output FCC networks [45, p.1798] in nonlinear regression. Yet, their efforts are not significant because both NBN and BPFCC are designed to reduce the cost of step (b) with no effect on the dominant cost of step (c) in Eq. (4). On the other hand, we show a new BP strategy reducing the cost of both steps (b) and (c) directly for \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) without evaluating \(\textbf{U}_p\) explicitly. In particular, our BP exploits the special structure of q linear-output FCC network of Cheng’s type, computing the Gauss-Newton matrix \(\textbf{J}^\textsf{T} \textbf{J}\) in a nice block-arrow form, which allows us to exploit sparsity in matrix factorization.
It should be noted here that Wilamowski and Yu’s FCC network model [45] and Cheng’s FCC model both have only one node at each hidden layer, but differ in the arrangement of q output nodes; those two FCC models are compared in Fig. 1 (for \(q \!=\! 3\)), where Cheng’s model has all q linear output nodes at the terminal layer (see [3, Fig. 2, p.294] for \(q \!=\! 2\)), whereas Wilamowski and Yu’s model (see [45, Fig. 5,p.1796] for \(q \!=\! 3\)) produce q (nonlinear) outputs from the last q layers Footnote 2. Therefore, when those two models in Fig. 1 are optimized with E in Eq. (1), different sparsity patterns of \(\textbf{U}_p\) result; see [45, Eq. (24), p.1797] and [3, Eq. (22), p.300], and see also Eq. (27) later.
The rest of the paper is organized as follows. Sect. 2 scrutinizes the FCC network of Cheng’s type and his proposed algorithm BPFCC using the same notations as defined in [3]. Then, Sect. 3 describes our new structure-exploiting BP procedure, and Sect. 4 presents numerical evidence; finally, our conclusion follows in Sect. 5.
2 Fully Connected Cascaded (FCC) Neural Network Learning of Cheng’s Type
We investigate carefully Cheng’s backpropagation (BP) procedure proposed for FCC network learning, following the same notation as defined in [3].
2.1 The FCC Network Structure of Cheng’s Type
Consider a single-output L-layer FCC (see [3, Fig. 1, p.294]) with totally M weights (including biases), receiving a (row) vector of n external inputs \(\textbf{x}\!=\! [x_1, ..., x_n]\) (see [3, Eq. (1), p.295]). More specifically, there is only one hidden node at each layer (or stage) l, \(l>0\), except the input layer at stage 0 (\(l \!=\! 0\)); then, each connection weight is denoted by \(\lambda _{l,k}\) from node k, \(k \!=\! 0,..., n\!+\!l\!-\!1\), to the node at stage l, \(l \!=\! 1,..., L\); here, the node at stage l, \(l>0\), is numbered “\(n \!+\! l\).” The given set of M weights (see [3, Eqs. (2) & (7), p.296]) are grouped with respect to each node at stage l, \(l>0\), each denoted by \(\textbf{w}_l\), and they are collectively denoted by a row vector (in [3, Eq. (3), p.296]) of length M as
Each weight-group vector \(\textbf{w}_l\) is of length \(n\!+\!l\) including \(\lambda _{l,0}\), a bias to the node at stage l, in the first entry, as shown below (see [3, Eq. (4), p.296]):
Then, M, the total number of weights including biases, is given below (see [3, Eq. (2),p.296])
In Cheng’s notation [3, Eq. (5), p.296], the net input Footnote 3 to the lth layer neuron is denoted by \(f^l(\textbf{x}; \textbf{w})\); then, it is given below by the forward pass at each stage l, \(l \!=\! 1,...,L\):
where \(\phi (x) \!=\! \textsf{tanh}(x)\), and then the output of the lth layer neuron is produced by \(\phi \big ( {{f^l(\textbf{x}; \textbf{w})}} \big )\) for \(l \!=\! 1,..., L - 1\). At end layer L, the linear terminal output is produced as \(f^L(\textbf{x}; \textbf{w})\).
Next, consider the case of multiple q outputs (\(q>1\)) of Cheng’s FCC network depicted in Fig. 1a. Since each terminal output node j, \(j \!=\! 1,...,q\), has \(n \!+\! L ~\!(\equiv \!C)\) incoming connections, M, the total number of weights, increases by \((q \!-\! 1) C\), leading Eq. (7) to [3, Eq(7), p.296] below including \(q C \! = \! q ( n \!+\! L)\), the total number of linear terminal weights:
where \(h \! \equiv \! M \!-\! qC\), the total number of nonlinear “hidden” weights. Then, the weight vector \(\textbf{w}\) in Eq. (5) is augmented as below [3, Eq(8), p.296]
where \(\textbf{w}^{(j)}_L\), \(j \!=\! 1,...,q\), is a vector of length \(C ~\!(= \! n \!+\! L)\) [3, Eq. (9), p.296], given by
As defined just before Eq. (1), let \(F_j\) be the jth output, \(j \!=\! 1,...,q\), at terminal layer L; then, the FCC model produces q linear outputs \(F_j\) (see [3, Eq. (10), p.296]) below with Eq. (11):
where \(f^r(\textbf{x}; \textbf{w})\), the net input to the hidden node at layer r, is obtained by Eq. (8). Appendix C.1 shows how the preceding formulas apply to a five-stage (\(L \!=\! 5\)) two-input (\(n \!=\! 2\)) three-output (\(q \!=\! 3\)) FCC network depicted in Fig. 2.
2.2 Standard (Stage-Wise) BP [41] or Generalized Delta Rule for FCC Network Learning
Cheng, according to his literature summary [3, p.295], seems to be ignorant of the pioneering work of Rumelhart et al. [41], where the generalized delta rule (see [41, Eq. (14), p.326]) is applied to a two-stage (\(L \!=\! 2\)) FCC model on the well-known XOR problem (see Fig. 3), and described for a general feedforward network. We now show how the stage-wise BP, or the generalized delta rule, works for multiple q-output (\(q\!>\!1\)) FCC networks of Cheng’s type (see Fig. 1a). To this end, let us define the quantity “delta” as the node-input sensitivity to \(E_p\) (on data pattern p) in Eq. (1) with respect to the net input to the node at each stage l:
Similarly, let us introduce another quantity, called node-output sensitivity, below, also known as adjoint variable (or co-state, multiplier, etc) in optimal control (e.g., see [2, 10, 13, 22, 27]):
By forward pass, the output of each hidden node at stage l, \(0<l<L\), is given by \(\phi \left( f^l(\textbf{x}; \textbf{w}) \right) \), where its net input \(f^l(\textbf{x}; \textbf{w})\) is subject to the forward-pass equation (8), and then q terminal linear outputs \(F_j(\cdot )\), \(j \!=\! 1,...,q\), are obtained by Eq. (12). Then, by backward pass, \(\delta _l\), the node-input sensitivity of (hidden) node at stage l, \(0<l<L\), can be determined as below
where \(r_j \! = \! \xi ^{(j)}_L \!=\! \delta ^{(j)}_L\) (terminal node sensitivity); i.e., the residual in Eq. (1) evaluated at linear-output node j, \(j \!=\! 1,...,q\), at end stage L on each data pattern. Then, the gradient entry of \(\nabla E_p\) with respect to each “hidden” weight \(\lambda _{l,i}\) for \(l<L\) is given by
For the gradient of linear “terminal” weight \(\lambda ^{(j)}_{L,i}\), we employ \(\delta ^{(j)}_L \! = \! r_j\) in Eq. (16) as below:
Appendix C.2 demonstrates the above stage-wise BP on the FCC model depicted in Fig. 2. Next, we describe in detail Cheng’s BP algorithm (called BPFCC), which is characterized as node-wise BP, showing how it works differently from the above stage-wise BP.
2.3 The BPFCC Routine of Cheng’s BP Algorithm
For our discussion purpose, the BPFCC routine of Cheng’s BP procedure (see [3, Algorithm 2, p.300]) is displayed below.
The BPFCC is designed for the single-output (\(q \!=\! 1\)) FCC network. Hence, for the multiple q-output (\(q>1\)) case, BPFCC is repeated for q times with respect to each output node (hence, the node-wise BP as mentioned earlier); see Note added between lines 2 and 3 in BPFCC shown above. In the nonlinear least squares problem with \(E(\textbf{w})\) in Eq. (1) involving N training data, Cheng’s BP algorithm calls \(\textbf{G} \!=\! \textsf{BPFCC}(\textbf{x}_p, \textbf{w}, n, L)\) on each data pattern p for \(p \!=\! 1,...,N\). By forward pass (Line 2), the FCC network produces the linear output \(F_j(\textbf{x}; \textbf{w})\), \(j \!=\! 1,...,q\), at terminal layer L by Eq. (12). Then, during the backward process (see Lines 3\(\sim \)9), BPFCC evaluates explicitly the quantity \(\alpha _{l \rightarrow L}\), called the derivative amplification coefficient (from layer l to terminal layer L), by the following backward-pass recurrence relation [3, Eq. (21), p.299] for \(0<l < L\), starting with the boundary condition, \(\alpha _{L \rightarrow L} \!=\! 1\), at terminal layer L:
which must be repeated for q times on each output node j, \(j \!=\! 1,...,q\). According to [3, Sec. 3.3], for each repeat of Eq. (18) on j, weight \(\lambda _{L,i}\) must be regarded as \(\lambda ^{(j)}_{L,i}\) in \(\textbf{w}^{(j)}_L\), a vector of \(C ~\!(= \! n \!+\! L)\) linear terminal weights defined in Eq. (11):
Eq. (18) is highlighted as the core equation [3, p.299], although it should be implemented as
Then, the quantities below are evaluated for each repeat on output node j, \(j \!=\! 1,...,q\):
and stored into \(\textbf{G}\). Note in Lines 6 and 7 in Algorithm BPFCC that \(f^L(\textbf{x};\textbf{w})\) therein must be regarded as \(F_j (\textbf{x};\textbf{w})\) in accordance with Eq. (19) for each repeat on output node j, \(j \!=\! 1,...,q\). Since Eq. (21) yields \(\nabla r_j(p)\) at each repeat on j, BPFCC returns \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) in Eq. (2); i.e., the q rows of the residual Jacobian matrix \(\textbf{J}\! \equiv \! \nabla \textbf{r}\) of size \(qN \times M\) in Eq. (2). It thus reveals that Algorithm BPFCC is equivalent to step (b) of Algorithm BPMLP shown in Sect. 1. Therefore, \(\alpha _{l \rightarrow L}\), what Cheng [3, p.297] termed the “derivative amplification coefficient,” is just the partial derivative of residual entry \(r_j(p)\) evaluated at output node j on datum p with respect to the net input to the node at stage l at each repeat on j, \(j \!=\! 1,...,q\):
which is recursively evaluated by Eqs. (18) or (20). After \(\textbf{U}_p^\textsf{T} \!=\! \textbf{G}\!=\! \textsf{BPFCC}(\textbf{x}_p, \textbf{w}, n, L)\) is obtained, the gradient \(\nabla E_p(\textbf{w})\) per pattern p is computed below as step (d) in Eq. (4)
Appendix C.3 shows how Algorithm BPFCC works on the FCC shown in Fig. 2.
2.4 Complexity Analysis of BPFCC for \(\textbf{U}_p\) and \(\nabla E_p\)
Cheng’s BP employs \(\textbf{G} \!=\! \textbf{U}_p^\textsf{T} \!=\! \textsf{BPFCC}(\textbf{x}(p), \textbf{w}, n, L)\) for the q rows of \(\textbf{J}\! \equiv \! \nabla \textbf{r}\) in Eq. (2) by accumulating each row \(\nabla r_j(p)^\textsf{T}\) for q times (see [3, Sec. 3.3]) node-wisely with respect to each output node j, \(j \!=\! 1,..., q\). This (output-)node-wise BP procedure is popular in MLP-learning (e.g., see [1, 19, 24, 25, 40, 46]), as summarized in Algorithm BPMLP in Sect. 1, for evaluating rows of \(\textbf{J}\), to be followed by the rank-updating of Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\).
Consider the cost of evaluating the gradient vector \(\nabla E_p\) of length M, where M is the number of weights defined in Eqs. (7) and (9); such a node-wise procedure as BPFCC is not as efficient as stage-wise BP in the multiple q-output (\(q>1\)) case. This is because, while BPFCC obtains the gradient vector \(\nabla E_p\) by Eq. (23) in O(qM), stage-wise BP evaluates \(\nabla E_p\) with Eqs. (16) and (17) in O(M) by computing node sensitivities \(\delta _l\) in Eq. (13) at each stage l with no row \(\nabla r_j(p)^\textsf{T}\) of \(\textbf{J}\) required explicitly (unlike BPFCC); this is the essence of stage-wise BP by the generalized delta rule (15). More specifically, in multiple-output FCC network learning, the q repetitions of backward passes in BPFCC require \(O({{{q(C \!+\! h)}}})\), where \(q(C + h) > qC + h \!=\! M\) by Eq. (9). This fact can be observed in the comparison between the stage-wise progress in Fig. 7 and the node-wise progress in Fig. 8. To see it better, let us set \(\alpha _{5 \rightarrow 5} \!=\! r_1\) in Eq. (45), \(\alpha _{5 \rightarrow 5} \!=\! r_2\) in Eq. (46), and \(\alpha _{5 \rightarrow 5} \!=\! r_3\) in Eq. (47); then, accumulating quantities of \(\alpha \) over Eqs. (45), (46), and (47) yields Eq. (44) for node-sensitivities \(\delta _l\). This \(\alpha \)-accumulation for \(\delta _l\) is inefficient because six hidden-weights (\(\lambda _{4,3}\), \(\lambda _{4,4}\), \(\lambda _{4,5}\), \(\lambda _{3,3}\), \(\lambda _{3,4}\), \(\lambda _{2,3}\)) are repeatedly used in Eqs. (45), (46), and (47).
Algorithm BPFCC always starts with \(\alpha _{5 \rightarrow 5} \!=\! 1\) at any output node j, \(j \!=\! 1,...,q\), on each datum p, just as in Line 4 of Algorithm BPMLP; this is simply because BPFCC evaluates explicitly \(\nabla r_j(p)\) for \(j \!=\! 1,...,q\); i.e., the q rows of the residual Jacobian matrix \(\textbf{J}\) as \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) with Eq. (21). Wilamowski and Yu [45] have recognized the computational overheads (see Appendix A.1) behind the q repetitions (\(q > 1\)) of such a node-wise BP procedure for \(\textbf{U}_p\), and then developed the neuron-by-neuron (NBN) method for evaluating \(\textbf{U}_p\) on each datum p; see \(\textbf{U}_p\) in [45, Eq. (24), p.1797]. When \(q \!=\! 1\) (single-output case), both NBN and BPFCC work at the essentially same cost; see the complexity analysis of NBN in [45, p.1798]. As explained with Eq. (4) in Sect. 1, NBN (as well as BPFCC) attempts to reduce the cost of step (b) for \(\textbf{U}_p\) of size \(M \times q\) with no impact on the dominant cost of step (c) for rank-updating by Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\) of size \(M \times M\) in Eq. (2). This indicates that NBN is not significantly better than BPFCC when \(q > 1\).
3 A New Structure-Exploiting BP for Evaluating \(\textbf{J}\) and Its Cross Product \(\textbf{J}^\textsf{T} \textbf{J}\)
All the preceding BPMLP, BPFCC and NBN procedures evaluate rows of \(\textbf{J}\) explicitly and perform rank updates by Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\). On the other hand, the so-called second-order stage-wise BP [30, 31, 33] can improve the dominant cost in Eq. (4)(c) by avoiding the explicit evaluation of any rows \(\nabla r_k^\textsf{T}\) of \(\textbf{J}\) for MLP-learning, just as the gradient vector \(\nabla E\) is computed with node sensitivities in Eq. (13) by Eq. (16) without any row \(\nabla r_k\) explicitly for \(\textbf{J}^\textsf{T} \textbf{r}\). That is, no rows of \(\textbf{J}\) are explicitly required to form \(\textbf{J}^\textsf{T} \textbf{J}\) in MLP-learning; consequently, stage-wise BP can evaluate the exact Hessian \(\nabla ^2 E\) in Eq. (2) faster than the node-wise BP that evaluates merely \(\textbf{J}^\textsf{T} \textbf{J}\) with rank-updating by Eq. (3) in MLP-learning [33].
In this section, we develop a new BP procedure specifically geared to FCC of Cheng’s type having multiple q linear outputs (see Fig. 2) for evaluating \(\textbf{J}^\textsf{T} \textbf{J}\) efficiently by accumulating node sensitivities in a stage-wise fashion (with no rows of \(\textbf{J}\) explicitly evaluated).
3.1 The Structure of Block Angular and Block Arrowhead Matrices
To show the special structure of \(\textbf{J}^\textsf{T} \textbf{J}\), we first note that \(\textbf{w}\), the entire weight vector of length M in Eq. (10), consists of the following linear and nonlinear weights:
-
1.
q linear weight vectors at terminal stage L, \(\textbf{w}^{(j)}_L\), \(j \!=\! 1,...,q\), each of length \(C \! \equiv \! n \!+\! L\), as defined in Eq. (11); and
-
2.
\((L \!-\! 1)\) nonlinear “hidden” weight vectors \(\textbf{w}_l\), \(l \!=\! 1,...,L \!-\! 1\), defined in Eq. (6).
We then arrange those group-weight vectors in the reverse order of \(\textbf{w}_l\) in Eq. (10) (i.e., [3, Eq(8), p.296]) below for \(l \!=\! L, L \!-\! 1,..., 1\) with \(h \! \equiv \! M \!-\! qC\) [see Eq. (9)]:
Here, \(\textbf{w}_h\), the h-vector of hidden weights, is stage-wisely partitioned, comprising \(L \!-\! 1\) weight vectors \(\textbf{w}_l\) in Eq. (6) of length \(n \!+\! l\) at each stage \(l \!=\! 1,..., L \!-\! 1\) due to Eq. (24):
For the weight arrangements in Eq. (24), let us show below the sparsity structure of \(\nabla r_j(p)^\textsf{T}\), a row of \(\textbf{J}~\!(\equiv \! \nabla \textbf{r})\) corresponding to the residual entry \(r_j(p)\) evaluated at terminal node j, \(j \!=\! 1,...,q\), on data pattern p, \(p \!=\! 1,...,N\), for Eqs. (1) and (2):
where \(\textbf{0}\) is the zero vector of length \(C~\!(\equiv \! n \!+\! L)\). In general, the M-vector \(\nabla r_j(p)\) consists of \((q \!-\! 1)C\) zero entries (see [3, Eq. (22), p.300]) and only \(C \!+\! h\) non-zero entries, denoted by the C-vector \(\textbf{a}_j(p) \! \equiv \! \frac{\partial r_j(p)}{\partial \textbf{w}^{(j)}_L}\) and the h-vector \(\textbf{b}_j(p) \! \equiv \! \frac{\partial r_j(p)}{\partial \textbf{w}_h}\). Since \(\nabla r_j(p)\) is the jth column of the \(M \times q\) matrix \(\textbf{U}_p\) in Eq. (2), \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\), the cross-product matrix resulting from Eq. (3), becomes such a nice block-arrow matrix as \(\textbf{U}\textbf{U}^\textsf{T}\) (with the notation p of datum p omitted) below right due to the \(M \times q\) block-angular matrix \(\textbf{U}\) (see also Appendix A.1) below left with \(h \! \equiv \! M \!-\! qC\), the number of hidden weights [see Eq. (9)]:
for which we pay attention to the following three types of block matrices in forming \(\textbf{U}\textbf{U}^\textsf{T}\):
Those three block-matrix types (a), (b), and (c) in \(\textbf{U}\textbf{U}^\textsf{T}\) are colored in green, yellow, and pink, respectively, in Eq. (27). Such angular and arrowhead matrices in Eq. (27) frequently arise in scientific computing, and the arrowhead should point to the south-east direction to exploit the posed sparsity (e.g., see [16, 21, 29, 42]; pp.348-351 [5]). Unfortunately, most currently available neural-network software packages are not designed to exploit the sparsity inevitably arising in multiple-output neural-network learning; Fig. 4, for instance, compares three sparsity patterns of \(16 \times 16\) \(\textbf{J}^\textsf{T} \textbf{J}\) obtainable from four-output 1-2-4 MLP-learning (16 weights). For the desired sparsity in Fig. 4a, only dense blocks need to be formed to save memory, and its block-arrow form allows us to avoid fill-ins when (modified) Cholesky factorization is performed on each block in \(\textbf{J}^\textsf{T} \textbf{J}\) or when QR is applied to blocks in \(\textbf{J}\) (see [29] and references therein); e.g., for the Levenberg-Marquardt algorithm [19, 25, 46] with trust-region globalization (e.g., see [6,7,8, 26] for NL2SOL type). In Fig. 4b and c, the posed complicated sparsity patterns are hard to exploit; obviously, their optimization procedures simply ignore the sparsity.
Furthermore, the minimization of \(E(\textbf{w})\) in Eq. (1) is called the separable nonlinear least squares problem [15, 34, 39], when the model (of our choice) consists of so mixed linear and nonlinear parameters as shown in Eq. (24). Then, a distinguished feature of block matrices shown in Eq. (27) is that all q on-diagonal blocks \(\textbf{a}_j\), \(j \!=\! 1,...,q\), in \(\textbf{U}\) are identical to \(\textbf{y}\), the C-vector of node outputs, defined for Eq. (42); hence, \(\textbf{a}_j \!=\! \textbf{y}\) for any j on each datum owing to linear terminal outputs. In consequence, the first q on-diagonal blocks \(\textbf{a}_j \textbf{a}_j^\textsf{T}\), \(j \!=\! 1,...,q\), in \(\textbf{U}\textbf{U}^\textsf{T}\) [see type (a) in Eq. (28)] are identical to the rank-one matrix \(\textbf{y}\textbf{y}^\textsf{T}\); hence, \(\textbf{a}_j \textbf{a}_j^\textsf{T} \!=\! \textbf{y}\textbf{y}^\textsf{T}\) for any j. Since those q blocks are identical, we can save memory by forming only one on-diagonal block, which is obtainable just after the forward pass evaluating node-output vector \(\textbf{y}\); this is an important feature to be exploited (e.g., see [29]) that commonly arises in multiple-linear-output neural-network learning.
3.2 A BP Procedure for Extracting the Information of \(\textbf{U}\)
We describe how to use BP for extracting the information of block-angular (residual Jacobian) matrix \(\textbf{U}\) in Eq. (27) without forming \(\textbf{U}\) explicitly.
First, by forward pass, we calculate hidden-node outputs \(\phi ( {f^l(\cdot )} )\) by Eq. (8) at each stage l, \(0<l<L\), and then evaluate q terminal linear outputs \(F_j(\cdot )\), \(j \!=\! 1,...,q\) by Eq. (12) at end stage L. Eq. (42), for instance, shows that the q-length vector of linear terminal outputs is given by \(\varvec{\varTheta }_{\!{{L}}} \textbf{y}\), where \(\varvec{\varTheta }_{\!{{L}}}\) is the \(q \times C~\!(\equiv \! n \!+\! L)\) matrix of terminal linear weights [see Eqs. (39) and (40)], and \(\textbf{y}\) is the C-length vector that concatenates all preceding hidden-node outputs including the n inputs and the unit-constant output of node 0 at stage 0 (i.e., nominal input layer). In general, let us define \(\textbf{y}_l\) to be the concatenated vector of the first \((n \!+\! l \!+\! 1)\) node outputs up to stage l, \(l<L\); then, by definition, \(\textbf{y}_{L-1}\) is the above C-vector \(\textbf{y}\) (hence, \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\)), and \(\textbf{y}_l\) is the vector of the first \((n \!+\! l \!+\! 1)\) entries of \(\textbf{y}\); that is, \(\textbf{y}_l^\textsf{T} \! \equiv \! \left[ 1, x_1, ..., \phi \!\left( \!{{{f^{l}\!(\textbf{x};\!\textbf{w})}}}\!\right) \right] \). Below, we show their concatenated structure of \(\textbf{y}_l\), \(l \!=\! 0,..., L \!-\! 1\), on any data pattern with \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\) of length \(C ~\!(\equiv \! n \!+\! L)\):
where the first element is the unit-constant output of node 0 associated with bias \(\lambda _{l,0}\) in hidden-weight vector \(\textbf{w}_l\), \(l \!=\! 1,...,L-1\). As mentioned at the end of Sect. 3.1 for type (a) of Eq. (28), \(\textbf{a}_j \textbf{a}_j^\textsf{T} \!=\! \textbf{y}\textbf{y}^\textsf{T}\) (for any j); hence, we only need one block \(\textbf{y}\textbf{y}^\textsf{T}\), which is easy to obtain just after the forward pass for \(\textbf{y}\) above.
Then, by backward pass, we evaluate \(\textbf{b}_j \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_h}\) on each datum by stage-wisely back-propagating the q-length vector \(\varvec{\beta }_l\) defined below of (hidden) node-input sensitivities [compare \(\delta _l\) in Eq. (13)] to the residual vector \(\textbf{r}_p\) of length q in Eq. (1) on any datum p with respect to \(f^l(\textbf{x}; \textbf{w})\), the net input to hidden node l, \(0<l<L\), defined in Eq. (8):
which is recursively computed by the backward formula below analogous to Eq. (15), starting from \(\varvec{\beta }_{L-1}\) down to \(\varvec{\beta }_1\) for \(l \!=\! L \!-\! 1, L \!-\! 2, ..., 1\), with \(\phi _l ' (.) \! \equiv \! \phi ' ({{{f^l(\textbf{x};\textbf{w})}}})\) and \(e \equiv n + L - 1\) for the boundary value, \(\varvec{\beta }_{L-1}\), below right:
According to \(\textbf{w}_h\) in Eq. (25), \(\textbf{b}_j \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_h}\) on any datum is also stage-wisely partitioned, consisting of \(L \!-\! 1\) vectors \(\textbf{v}^{(j)}_l \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_l}\), \(l \!=\! 1,2,...,L \!-\! 1\); that is, \(\textbf{v}^{(j)}_l\) is the \((n \!+\! l)\)-length gradient vector of the jth residual \(r_j\) with respect to \(\textbf{w}_l\). Thus, we may evaluate \(\textbf{b}_j\) in a stage-wise fashion by computing \(\textbf{v}^{(j)}_l\) at each stage l, as shown below, using \(\textbf{y}_l\) in Eq. (29) and \(\beta ^{(j)}_l\), \(j \!=\! 1,...,q\), the jth component of the q-vector \(\varvec{\beta }_l\) defined in Eq. (30):
3.3 A Structure-Exploiting BP Strategy for Forming \(\textbf{U}\textbf{U}^\textsf{T}\)
Next for \(\textbf{U}\textbf{U}^\textsf{T}\) in Eq. (27), three block-matrix types (a) to (c) in Eq. (28) are partitioned as below, according to the partition of the weight-vector \(\textbf{w}\) shown in Eqs. (24) and (25):
More specifically, (b) and (c) are stage-wisely partitioned, conforming to the stage-wise partition of each \(\textbf{b}_j\), \(j \!=\! 1,...,q\), into \(\textbf{v}^{(j)}_l\) in Eq. (32) at each stage l, \(l \!=\! 1,...,(L \!-\! 1)\); hence, by definition of \(\textbf{v}^{(j)}_l\), each partitioned block in (b) and (c) is respectively given below using \(\textbf{y}_l\), the node-output vector of length \(n \!+\! l \!+\! 1\) at stage l shown in Eq. (29), and scalar \(\beta ^{(j)}_l\), \(j \!=\! 1,...,q\), is the jth component of the q-length vector \(\varvec{\beta }_l\) in Eq. (30):
Notice here that we need to evaluate blocks of type (b) only for \(s \le t\) by symmetry. Furthermore, by definition of \(\textbf{y}_l\) in Eq. (29), where \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\) and \(C \!=\!n \!+\! L\), the outer product, \(\textbf{y}_{l-1} \textbf{y}^\textsf{T}\), in (c) is just a submatrix of the \(C \times C\) rank-one matrix \(\textbf{y}\textbf{y}^\textsf{T}\) in type (a) of Eq. (33). So is \(\textbf{y}_{s-1} \textbf{y}_{t-1}^\textsf{T}\) in (b). See below for their nested structure.
Therefore, we evaluate only one outer product, \(\textbf{y}\textbf{y}^\textsf{T}\), of size \(C \times C\), in type (a) of Eq. (33) when we obtain \(\textbf{y}~\!(=\!\textbf{y}_{L-1})\) in Eq. (29) after the forward pass with Eqs. (8) and (12); then, there is no need to compute the other outer products for (b) and (c) because any of them is just a submatrix of \(\textbf{y}\textbf{y}^\textsf{T}\), as shown in Eq. (35). That is, we form the blocks of types (b) and (c) in Eqs. (28) and (33) by multiplying a submatrix of \(\textbf{y}\textbf{y}^\textsf{T}\) [in Eq. (35)] by a scalar without evaluating any \(\textbf{v}^{(j)}_l\) (or \(\textbf{b}_j\)) explicitly, as indicated in Eq. (34), for which we obtain each scalar quantity for (b) and (c) by the backward-pass equation (31) of \(\varvec{\beta }_l\). The procedure is summarized in Algorithm BPJJ below for computing only dense blocks of types (b) and (c) as well as one dense block \(\textbf{y}\textbf{y}^\textsf{T}\) of type (a):
Appendix C.4 illustrates how Algorithm BPJJ works on the FCC network in Fig. 2, leading to the block-arrow matrix shown in Fig. 5a juxtaposed with b another block-arrow matrix associated with a 21-input 10-output five-stage FCC network for the cardiotocography benchmark problem (see [3, Table 2, p.309]).
3.4 Algorithmic Complexity on an L-Stage FCC Network with n Inputs and q Outputs
To implement the Levenberg-Marquardt method [3, p.303] by Algorithm BPFCC, Cheng has developed the FCCNET program [3, Sec. 5]. We now show the difference in algorithmic complexity between Algorithm BPJJ (in Sect. 3.3) and Cheng’s FCCNET with node-wise Algorithm BPFCC (in Sect. 2.3) on multiple q-output FCC network learning. Table 1 summarizes our complexity analysis on each data pattern p, \(p \!=\! 1,...,N\), in three items: (1) cost for evaluating node outputs \(\textbf{y}\) and gradients \(\nabla E_p\); (2) cost for \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by Algorithm BPJJ; and (3) cost for \(\textbf{U}_p\) by BPFCC and \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by Eq. (3). Because FCCNET employs node-wise BPFCC, it thus follows that the whole time complexity analysis of FCCNET is similar to that of Algorithm BPMLP shown in Eq. (4), where M is the total number of weights; for the L-stage FCC model having n inputs, M is defined in Eqs. (7) and (9) such that \(O(M) \! \approx \! O(L^2 \!+\! nL)\). Therefore, Cheng wrote (see [3, p.300]) that BPFCC works in \(O(L^2 + nL)\) per pattern p for the single-output FCC network, and then claimed that the time complexity for a component of gradient is O(1), where what means by “gradient” is \(\nabla r_p\), the gradient of \(r_p\), rather than \(\nabla E_p\) in Eq. (1). It should be noted here that his complexity analysis is due to \(O(M) \! \approx \! O(L^2 + nL)\) in Eqs. (7) and (9), and it should not be confused with the conventional wisdom available in the discrete L-stage optimal control literature, where it is widely accepted that the so-called second-order stage-wise Newton methods [11, 14, 35, 38] compute a Newton step in O(L), just as the first-order Kelley–Bryson gradient methods [2, 10, 13, 22] find a steepest-descent step in O(L) stage-wisely. Hereafter, we follow Cheng’s cost analysis using both L and M for analyzing FCC networks of his type shown in Fig. 1a as long as no confusion arises.
Table 1 (2) shows the time complexity of our structure-exploiting Algorithm BPJJ. Just after the forward pass (Line 1 of BPJJ) per pattern p, the BPJJ procedure first obtains only one \(C \times C\) block \(\textbf{y}\textbf{y}^\textsf{T}\) (in Line 2), block of type (a) in Eq. (33), at the cost of \(O(C^2)\), where \(C \! \equiv \! n \!+\! L\), and then starts the backward process, evaluating \(\varvec{\beta }_{l}\) (in Line 3) by Eq. (31), while exploiting the special structure shown in Eqs. (35) and (53) for the other blocks of types (b) and (c). After \(\textbf{B}\) is obtained (in Line 5) by Eq. (31) in \(O(qL^2)\) as mentioned above, the procedure evaluates \(\textbf{B}\textbf{B}^\textsf{T}\), which requires \(q (L \!-\! 1)^2\) operations, and then forms the blocks of types (b) and (c) by Eq. (34) [see Line 6]; the cost for type (b) is \(O(h (M-h))\), where \(h \!=\! M \!-\! qC\), the number of hidden weights, in Eq. (9), and the cost for type (c) is \(O(h^2)\); hence, both (b) and (c) cost O(hM) per pattern p. In addition, the procedure evaluates \(\varvec{\delta }\!=\! \textbf{B}\textbf{r}_p\), which costs O(qL), and then computes \(\nabla E_p\) (see Line 4) at the cost of O(M) by Eq. (17) at end stage L and by Eq. (16) for \(l<L\) [e.g., see Appendix C.4]. On the other hand, as explained in Eq. (4) and in Sect. 2.4, Table 1 (3) shows that BPFCC obtains \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) of size \(M \times q\) by using Eq. (21) with \(\alpha _{l \rightarrow L}\) on each data pattern p (see also line 6 in Algorithm BPFCC in Sect. 2.3); this costs O(qM). After that, Cheng’s FCCNET computes \(\nabla E_p\) by Eq. (23) at the cost O(qM), and form \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by rank updates on \(\textbf{J}^\textsf{T} \textbf{J}\) by Eq. (3), which is the dominant cost \(O(q M^2)\). Overall, Table 1 indicates that BPJJ is superior to FCCNET with BPFCC.
Table 1 also shows that BPJJ obtains \(\textbf{B}\) [of \(\varvec{\beta }_{l}\) in Eq. (30)] by Eq. (31) in \(O(qL^2)\) on each data pattern [see Eq. (55)], while node-wise BPFCC evaluates \(\alpha ^{(j)}_{l \rightarrow L}\) at each repetition j, \(j \!=\! 1,..., q\), by Eq. (20) at the same cost; this is due to the special FCC structure of Cheng’s type (see Appendix A.1). Here, two node sensitivities are related by
This suggests that BPFCC may be so modified as to return \(\textbf{G}\!=\! \textbf{B}\), node sensitivities [e.g., see Eq. (54)], rather than \(\textbf{G}\!=\! \textbf{U}^\textsf{T}\), the rows of \(\textbf{J}\), in order to follow Eq. (34); this simplification is probably the easiest way to improve BPFCC.
4 Numerical Results
In [3, Table 3, p.309], the multiple q-output (\(q>1\)) benchmark problems are investigated by the FCCNET program with BPFCC for various FCC models. Among them, we use the cardiotocography benchmark problem (2,126 data) involving 21 attributes (\(n \!=\! 21\)) and ten classes (\(q \!=\! 10\)) for our numerical test (to be shown in Sect. 4.1 below). After that, we discuss the numerical results available in [3, Table 4, p.309] and [45, Table II, p.1799].
4.1 Numerical Evidence Supporting our Cost Analysis in Table 1
We show the numerical evidence supporting our complexity analysis shown in Table 1 with a five-stage (\(L \!=\! 5\)) 21-input 10-output FCC model having 354 weights (\(M \!=\! 354\)); see Fig. 5b for the sparsity structure of the Gauss-Newton Hessian matrix for optimizing that FCC model. In [3, Table 3], the set of 2,126 (cardiotocography) data is split into two for training and testing, but we just use all 2,126 data in the batch mode to measure the execution (CPU) time for evaluating the Gauss-Newton Hessian matrix explicitly on a Windows-10 PC with i7-6500U CPU (2.6 GHz) and 8 GB (RAM).
Table 2 compares the execution (CPU) time for forming the Gauss-Newton Hessian matrix \(\textbf{J}^\textsf{T} \textbf{J}\) between our BPJJ and Cheng’s FCCNET with BPFCC. The approximate flops ratio (with hidden constants totally ignored) gives just a rough estimate in speed difference, showing that BPJJ can work about 18 times faster than FCCNET; actually, it worked roughly six times faster in the execution time (averaged over ten trials).
4.2 Comparison Between BPFCC and NBN
Cheng has employed BPFCC for a certain implementation of the Levenberg-Marquardt method [3, p.303], called FCCNET, and then highlighted in [3, Sec. 5.5] the comparison between FCCNET and the NBN algorithm [45, 46] with respect to the total training time in single-output (\(q \!=\! 1\)) problems (although NBN is developed for \(q > 1\)); in [3, Table 4, p.309], FCCNET worked much faster than NBN. However, the total training-time results therein are misleading because the total training time heavily depends on the implementation Footnote 4 of the Levenberg-Marquardt method. Furthermore, both procedures must perform the same rank-updating by Eq. (3), which is much more time-consuming than forming \(\textbf{U}\), as shown in Table 1(c); see also steps (b) and (c) in Eq. (4) for MLP-learning. In other words, the execution time for one Levenberg-Marquardt step should be virtually equivalent between FCCNET and NBN for any single-output (\(q \!=\! 1\)) applications, as clearly described in [45, p.1798] with the complexity analysis of NBN.
Since NBN is originally designed for multiple q-output problems, NBN can work faster than Cheng’s BPFCC for forming \(\textbf{U}^\textsf{T}\), q rows of \(\textbf{J}\). Wilamowski and Yu compared Algorithms BPMLP and NBN in time complexity for 8-56-56 \((q \!=\! 56)\) MLP-learning; see [45, Table II, p.1799], where BPMLP is called “Hagan-Menhaj computation” due to [19]. Their cost analysis shows that NBN can form \(\textbf{U}\) roughly 17 times faster than BPMLP; needless to say, NBN can also work faster than BPFCC in forming \(\textbf{U}\). The point to note here is that both NBN and FCCNET (with BPFCC) must employ the same most time-consuming rank-updating by Eq. (3) [see also step (c) in Eq. (4)]; therefore, the overall time difference between NBN and FCCNET would be small per iteration. By contrast, our BPJJ is designed to reduce the dominant cost by avoiding rank-updating with Eq. (3) without evaluating \(\textbf{U}\) explicitly.
4.3 Discussion
If the residual Jacobian matrix \(\textbf{J}\) of size \(m \times M\) in Eq. (2) can be stored explicitly, then QR factorization can apply to \(\textbf{J}\) directly. Alternatively, one can avoid storing \(\textbf{J}\) due to the memory concerns by forming \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\) by Eq. (3); then, the (modified) Cholesky factorization can apply to \(\textbf{J}^\textsf{T} \textbf{J}\). In both cases, the regularization parameter \(\mu \) may be introduced for implementing the Levenberg-Marquardt method. In any event, the latter Cholesky approach works about twice faster than the former QR approach (e.g., see [5]). In nonlinear least squares learning, the speed per epoch and memory-saving are primary concerns; therefore, the (modified) Cholesky approach is much popular [1, 19, 24, 25, 40, 46].
Algorithm BPJJ in Sect. 3.3 efficiently evaluates \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\) without evaluating \(\textbf{U}_p\) explicitly in a block-arrow matrix form (e.g., see Fig. 5), for which only dense blocks need to be formed to save memory. As stated in Sect. 3.1, once the dense blocks of the desired block-arrow matrix are obtained, it is easy to exploit the matrix sparsity (e.g., see Fig. 5) for an efficient implementation of any subsequent optimization procedures; e.g., see block-arrow least squares [29] with (modified) block Cholesky or QR factorization for trust-region Levenberg-Marquardt methods [6,7,8, 26] (see also [15, 21, 34, 39] for separable nonlinear regression by variable projection methods). This is an important additional benefit beyond the result of Table 1, commonly arising in any multi-output applications in [3, 9, 20, 45]; the application in [23] is of our future interest for constructing an FCC network with a set of cascaded hidden nodes representing a chain of joints of a robotic manipulator.
5 Conclusion
We have developed a new stage-wise BP procedure (Algorithm BPJJ in Sect. 3.3) that exploits the special structure of multi-output fully-connected cascade (FCC) networks (see Fig. 2) for nonlinear least squares learning. Our stage-wise BPJJ is designed to reduce the dominant cost of rank-update operations by Eq. (3) without evaluating explicitly any rows of the residual Jacobian matrix on each data pattern, working faster than Cheng’s node-wise BPFCC [3, p.300] (as well as NBN [45] and widely-employed BPMLP in Sect. 1).
We hope that the various BP aspects (e.g., stage-wise versus node-wise BP) discussed in this paper would prove useful for further developments of efficient BP procedures for (deep) neural network learning.
Notes
For \(q>1\), the rank-q update is the Level-3 BLAS operation [5], working faster than q rank-1 updates, although the total counts of arithmetic operations are the same between them.
Precisely speaking, \(f^l(\textbf{x}; \textbf{w})\), the net input to the node at layer l, in Eq. (8) should be denoted by \(f^l(\textbf{x}; \textbf{w}_1,...,\textbf{w}_l)\) since the weight vectors \(\textbf{w}_i\), \(i>l\) (at subsequent stages i after l), have no influence on the net input to the node at stage l due to the feedforward structure of FCC network learning.
In a general MLP, let \(m_l\) be the number of hidden nodes at stage l; then, \(\textbf{x}_l\) becomes an \(m_l\)-vector; accordingly, \(\frac{\partial \textbf{x}_{l+1}}{\partial \textbf{x}_l}\), the intermediate quantity at stage l, becomes a matrix of size \(m_l \! \times \! m_{l+1}\); e.g., see [31].
References
Beale MH, Hagan MT, Demuth HB (2014) “calcjejj.m,” a script file in the Matlab Neural Network Toolbox, The MathWorks, Inc., Version 8.2
Bryson AE (1961) A gradient method for optimizing multi-stage allocation processes. In: Proceedings of Harvard University symposium on digital computers and their applications, pp 125–135
Cheng Y (2017) Backpropagation for fully connected cascade networks. Neural Process Lett 46:293–311
Conn AR, Gould NIM, Toint PL (2000) Trust-Region Methods, SIAM
Demmel JW (1997) Applied numerical linear algebra. SIAM
Dennis JE, Gay DM, Welsch RE (1981) An adaptive nonlinear least-squares algorithm. ACM Trans Math Softw 7(3):348–368
Dennis JE, Gay DM, Welsch RE (1981) Algorithm 573; NL2SOL: an adaptive nonlinear least-squares algorithm. ACM Trans Math Softw 7(3):369–383
Dennis JE, Schnabel RB (1983) Numerical methods for unconstrained optimization and nonlinear equations. Prentice Hall, New Jersey, Chapter 10, pp 218–238
Deshpande G, Wang P, Rangaprakash D, Wilamowski B (2015) Fully connected cascade artificial neural network architecture for attention deficit hyperactivity disorder classification from functional magnetic resonance imaging data. IEEE Trans Cybern 45(12):2668–2679
Dreyfus SE (1962) The numerical solution of variational problems. J Math Anal Appl 5(1):30–45
Dreyfus SE (1966) The numerical solution of non-linear optimal control problems. In: Greenspan D(ed) Numerical solutions of nonlinear differential equations: proceedings of an advanced symposium, John Wiley & Sons Inc, pp 97–113
Dreyfus SE (1973) The computational solution of optimal control problems with time lag. IEEE Trans Autom Control 18(4):383–385
Dreyfus SE (1990) Artificial neural networks, back propagation, and the Kelley–Bryson gradient procedure. J Guidance Control Dyn 13(5):926–928
Dunn J, Bertsekas DP (1989) Efficient dynamic programming implementations of Newton’s method for unconstrained optimal control problems. J Optim Theory Appl 63(1):23–38
Golub GH, Pereyra V (1973) The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM J Numer Anal 10:413–432
Golub GH, Ortega JM (1992) Scientific computing and differential Equations. Academic Press, pp 283–291
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press, Cambridge
Griewank A (2000) Evaluating derivatives. SIAM
Hagan HT, Menhaj M (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993
Hussain S, Mokhtar M, Howe JM (2015) Sensor failure detection, identification, and accommodation using fully connected cascade neural network. IEEE Trans Ind Electron 62(3):1683–1692
Kaufman L, Sylvester G (1992) Separable nonlinear least squares with multiple right-hand sides. SIAM J Mat Anal Appl 13(1):68–89
Kelley HJ (1960) Gradient theory of optimal flight paths. Am Rocket Soc J 30(10):941–954
Kubota N, Arakawa T, Fukuda T (1998) Trajectory planning and learning of a redundant manipulator with structured intelligence. J Braz Comput Soc 4(3):14–26
LeCun Y, Bottou L, Orr GB, Muller K-R (2012) Efficient Back Prop. In: Montavon et al (eds) Neural networks: tricks of the trade. 2nd edn, Springer-Verlag, Berlin Heidelberg, p 36
Masters T (1995) Advanced algorithms for neural networks: a C++ source book. John Wiley & Sons, New York, pp 68–69
Mizutani E (1999) Powell’s dogleg trust-region steps with the quasi-Newton augmented Hessian for neural nonlinear least-squares learning. In: Proceedings of the IEEE international conference on neural networks, vol 2, pp 1239-1244, Washington, DC
Mizutani E, Dreyfus S, Nishio K (2000) On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. In: Proceedings of the IEEE international conference on neural networks, vol 2, Como, Italy, pp 167–172
Mizutani E, Dreyfus S (2001) On complexity analysis of supervised MLP-learning for algorithmic comparisons. In: Proceedings of the INNS-IEEE international joint conference on neural networks, vol 1, pp 347–352, Wasington DC, USA
Mizutani E, Demmel JW (2003) On structure-exploiting trust-region regularized nonlinear least squares algorithms for neural network learning. Neural Netw 16(5–6):745–753
Mizutani E (2005) On computing the Gauss-Newton Hessian matrix for neural-network learning. In: Proceedings of the 12th international conference on neural information processing (ICONIP 2005), pp 43–48, Taipei, TAIWAN, Oct 30 to Nov 2
Mizutani E, Dreyfus SE, Demmel JW (2005) Second-order backpropagation algorithms for a stagewise-partitioned separable Hessian matrix. In: Proceedings of the INNS-IEEE international joint conference on neural networks, vol 2, pp 1027–1032, Montreal Quebec, CANADA, July 31 to August 4
Mizutani E, Fan, JYC (2007) On exploiting symmetry for multilayer perceptron learning. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN 2007), pp 2857–2862, Orlando FL, USA, August
Mizutani E, Dreyfus SE (2008) Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature. Neural Netw 21:193–203
Mizutani E, Demmel JW (2011) On improving trust-region variable projection algorithms for separable nonlinear least squares learning. In: Proceedings of the 2011 IEEE international joint conference on neural networks, pp 397–404, San Jose, California, USA, July 31 to August 5
Mizutani E (2020) A proof on equivalence of stagewise Newton and Dreyfus’s successive approximation procedures. IEEE Trans AutomControl 65(6):2716–2723
Nabney I (2002) NETLAB: algorithms for Pattern Recognition. Springer (see additional information at http://www.ncrg.aston.ac.uk/netlab/)
Nocedal J, Wright SJ (2006) Numerical optimization, 2nd edn. Springer Verlag, Berlin, p 10
DE Pantoja JFA (1988) Differential dynamic programming and Newton’s method. Int J Control 47(5):1539–1553
Pereyra P, Schererb G, Wong F (2006) Variable projections neural network training. Math Comput Simul 73:231–243
Rojas R (1996) Neural networks–a systematic introduction, vol 7. Springer-Verlag, Berlin
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing, vol 1. MIT Press, Cambridge, MA, pp 318–362
Saad Y (2003) Iterative methods for sparse linear systems. SIAM, pp 80–81
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Wilamowski BM (2009) Neural network architectures and learning algorithms. IEEE Ind Electron Mag 3(4):55–63
Wilamowski BM, Yu H (2010) Neural network learning without backpropagation. IEEE Trans Neural Netw 21(11):1793–1803
Wilamowski BM, Yu H (2010) Improved computation for Levenberg–Marquardt training. IEEE Trans Neural Netw 21(6):930–937
Wilamowski BM, Yu H, Cotton N (2011) NBN Algorithm. In: Wilamowski BM, Irwin JD (eds) Intelligent systems, the industrial electronics handbook, Chapter 13. CRC Press, Boca Raton
Acknowledgements
Eiji Mizutani would like to thank Stuart Dreyfus (UC Berkeley) for numerous fruitful discussions on optimal-control gradient procedures and his work in [10,11,12,13]. A special thanks goes to Jing-Yun Carey Fan and Mochamad Nizar Palefi Ma’ady for their assistance. The work is partially supported by the Ministry of Science and Technology, Taiwan (grant: 108-2221-E-011-098-MY3).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 Appendix A The Residual Jacobian Matrix for MLP-Learning and Its Evaluation
Using MLP-learning, we first show the structure of the block-angular matrix \(\textbf{U}\) in Eq. (27) for the Gauss-Newton Hessian matrix \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\) in Eq. (2). We then show that Algorithm BPFCC (in Sect. 2.3) becomes identical to step (b) of Algorithm BPMLP (see Sect. 1) in MLP-learning.
1.2 Appendix A.1 Block-Angular Matrix \(\textbf{U}\) in Eq. (27)
Suppose that all the 24 jump (or skip-layer) connections are removed from the five-stage (\(L \!=\! 5\)) three-output (\(q \!=\! 3\)) FCC network (with 39 weights) depicted in Fig. 2; then, it reduces down to a skinny five-stage 2-1-1-1-1-3 MLP shown in Fig. 6 involving only 15 weights (\(M \!=\! 15\)). Those \(15~\!(=\!M)\) weights, collectively denoted by a vector \(\textbf{w}\) below, are clearly shown in Fig. 6:
In Fig. 6, four hidden-node outputs are denoted by \(\phi (x_l)\) at intermediate stages l, \(l \!=\! 1,...,4~\!(=\! L \!-\! 1)\), with \(x_l\), the net input to the node at stage l, and three linear outputs are denoted by \(y_k\), \(k \!=\! 1,...,3~\!(=\! q)\), at terminal stage \(5~\!(=\! L)\), where the residual \(r_k \!=\! y_k \!-\! t_k\), \(k \!=\! 1,...,3~\!(=\! q)\), is evaluated with given teacher signal \(t_k\). Let \(\phi _l ' (\cdot ) \! \equiv \! \phi '(x_l)\); then, the \(15~\!(=\! M) \times 3~\!(=\! q)\) matrix \(\textbf{U}_p\) per pattern p in Eq. (27) obtainable by any BP procedure on the skinny MLP in Fig. 6 may be expressed as below with the symbols shown in Fig. 6:
1.3 Appendix A.2 BPMLP and BPFCC Algorithms for MLP-Learning
In MLP-learning, Algorithm BPFCC works in the same way as step (b) of Algorithm BPMLP, starting with \(\alpha _{L \rightarrow L} \!=\! 1\). For \(q \!=\! 3\) in Fig. 6, BPFCC repeats the backward pass for three times. The progress of those three repeats, shown below with \(\alpha ^{{{(k)}}}_{l \rightarrow 5}\) of the kth repeat, agrees with simplifying Eqs. (45), (46), and (47):
yielding \(\textbf{U}_p\) below [see \(\textbf{U}\) in Eq. (27), where \(q \!=\! 3\); \(C \!=\! 2\); and \(h \!=\! 9\)], and BPFCC returns \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\).
On the other hand, the stage-wise BP employs the recursive formula (31) simplified (for an MLP in Fig. 6) as \(\varvec{\beta }_l \!=\!\big (\frac{\partial x_{l+1}}{\partial x_l}\big ) \varvec{\beta }_{l+1}\), where \(\frac{\partial x_{l+1}}{\partial x_l} \!=\! \phi _l'(\cdot ) u_l\), \(l \!=\! 1,...,3~\!(=\!q)\), starting with \(\varvec{\beta }_4 \!=\! \phi _4'(\cdot ) [p_1, p_2, p_2]^\textsf{T}\), and ending up with a \(4 \times 3\) matrix \(\textbf{B}\), which gives the same \(\alpha \)-quantities due to the relation (36). Here, stage-wise BP evaluates an intermediate quantity Footnote 5, \(\partial x_{l+1}/\partial x_l \!=\! \phi _l'(\cdot ) u_l\), once at each stage only (hence, memory efficient as well) rather than evaluating them repeatedly for q times for each column of \(\textbf{U}_p\) corresponding to each output node (see in \(\textbf{U}_p\) above such intermediate quantities repeating): This is what Wilamowski and Yu [45] have identified as computational overheads made by step (b) of BPMLP, and they proposed the NBN forward-pass only method to avoid such computational redundancy (see [45, 47] and Appendix B).
We leave this section with several remarks on (extended) stage-wise BP. The above computational redundancy can be avoided by standard (stage-wise) BP: For standard MLP-learning, one may simplify the second-order stage-wise BP [31, 33] (that evaluates the exact Hessian matrix) by using the recursive formula of \(\textbf{Z}^{s}\) in [31, Eq. (13), p.1029] with the second-derivative term omitted; then, stage-wise BP can compute \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) directly with no \(\textbf{U}_p\) required explicitly so as to avoid Eq. (3) for rank-updating operations (see [30]), just as it evaluates the gradient vector \(\nabla E_p\) using node sensitivities \(\delta \) at each stage [e.g., see Eq. (16)] rather than by Eq. (23) as \(\nabla E_p \!=\! \textbf{U}_p \textbf{r}_p\). Hence, stage-wise BP requires no \(\textbf{U}_p\) explicitly. For general FCC-learning, stage-wise BP (for both the gradient vector and exact Hessian matrix) can be derived from a gradient procedure attacking a general history-dependent optimal-control problem [12].
Appendix B Neuron-by-Neuron (NBN) Algorithm of Wilamowski and Yu
For our comparison purpose, we briefly describe the essence of a forward-only version of the neuron-by-neuron (NBN) algorithm, developed by Wilamowski and Yu [45, 47].
Given a multiple q-output network, assume that the output of neuron j at any layer (except at the nominal input layer) is given by \(y_j \!=\! \phi (x_j)\), where \(\phi (.)\) is a differentiable nonlinear activation function and \(x_j\) is the net input to neuron j. Then, the fundamental idea of NBN is to evaluate the quantity, \(\delta _{k,j} \! \equiv \!\frac{\partial y_k}{\partial x_j}\), called signal gain between neurons j and k for \(k \ge j\); by definition, \(\delta _{k,k} \! = \! \phi '(x_k)\), called the slope of the activation function \(\phi \) of neuron k [45, p.1796]. In three-output (\(q \!=\! 3\)) five-layered 2-1-1-1-1-3 MLP-learning (see Fig. 6), for instance, \(\textbf{U}_p\) (per pattern p) of size \(15 \times 3\) is explicitly displayed in Appendix A.1; notably, it is not hard to identify common terms, which can be stored as \(\delta _{k,j}\) so as to avoid its later re-evaluation. The key is to avoid repeating backward passes q times (see lines 3 to 6 in Algorithm BPMLP shown in Sect. 1). To this end, at each neuron j, the forward-only NBN algorithm evaluates the net-input \(x_j\), output \(y_j \!=\! \phi (x_j)\), and slope \(\delta _{j,j} \!=\! \phi '(x_j)\), and then computes \(\delta _{k,j}\), \(k \!> \!j\), by the following two-step recursive formulas (see [45, Eqs. (25),(26)] and [47, Eqs. (13.24),(13.25)]) during the forward-pass process:
where \(\theta _{i,k}\) denotes the weight between neurons i and k.
Appendix C Illustrations of FCC-Learning Processes in Fig. 2
We describe the learning processes of a multiple-output (\(q \!>\! 1\)) FCC model of Cheng’s type, using a five-stage (\(L \!=\! 5\)) two-input (\(n \!=\! 2\)) three-output (\(q \!=\! 3\)) FCC network depicted in Fig. 2.
1.1 Appendix C.1 FCC Structure and Forward Propagation
By Eq. (9), there are totally 39 weights (\(M \!=\! 39\)) in Fig. 2, including \(21 ~\![=\!qC \!=\! q(n \!+\! L)]\) linear weights directly connecting to three linear output nodes at terminal layer 5, leading Eq. (10) to \(\textbf{w}\) of length 39:
where 18 hidden (nonlinear) weights \(\textbf{w}_l\), \(l \!=\! 1,...,4~\!(\!=\!L \!-\! 1)\), are defined in Eq. (6), and the other 21 linear terminal weights split into three sets, \(\textbf{w}^{(j)}_5\), \(j \!=\! 1,..., q~\!(=\! 3)\), and each \(\textbf{w}^{(j)}_5\) of length \(C \! \equiv \! n \!+\! L \!=\! 7\) is given by Eq. (11) as below (according to [3, Sec. 2.2]):
In Fig. 2, those three sets of 21 linear terminal weights in Eq. (39), \(\lambda ^{(1)}_{5,i}, \lambda ^{(2)}_{5,i},\) and \(\lambda ^{(3)}_{5,i}\) (\(i \!=\! 0,1,...,6\)), are denoted by \(\lambda _{5,i}\), \(\theta _{5,i}\), and \(\rho _{5,i}\), respectively for display purposes; that is,
Since \(L \!=\! 5\), let \(\varvec{\varTheta }_5\) denote the terminal weight matrix of size \(q \times (n \!+\! L) \!=\! 3 \times 7\); then,
By forward pass, the hidden-node outputs are produced by Eq. (8) for \(l \!=\! 1,...,4 ~\!(\!=\! L \!-\! 1)\). At terminal stage \(5~\!(=\! L)\), the final three linear outputs \(F_j(\textbf{x};\textbf{w})\), \(j \!=\! 1,...,3 ~\!(=\!q)\), are generated by Eq. (12) as shown next with \(\varvec{\varTheta }_5\) in Eq. (41):
where the \( \varvec{\varTheta }_{\!{{L}}}~\!(=\!\varvec{\varTheta }_5)\) is a \(q \times C ~\!(= \! 3 \times 7)\) matrix of 21 linear weights, and \(\textbf{y}\) is a vector of \(C~\!(=\!7)\) node outputs comprising unit-constant output of node 0, \(n ~\!(=\!2)\) inputs, and \(L\!-\! 1~\!(=\!4)\) hidden-node outputs, for which each \(f^r(\textbf{x}; \textbf{w})\), \(0\!<\!r\!<\!5~\!(=\!L)\), is evaluated by Eq. (8), as explained above.
1.2 Appendix C.2 Stage-Wise Backpropagation (BP) for the Gradient Vector
We now demonstrate stage-wise BP on three-output (\(q \!=\! 3\)) five-stage (\(L \!=\! 5\)) FCC in Fig. 2: Using the weights \(\lambda _{5,i}\), \(\theta _{5,i}\), and \(\rho _{5,i}\) in Eq. (40), the backward-pass equation (15) of \(\delta _l\) is given by
where \(r_j \! = \! \xi ^{(j)}_L \!=\! \delta ^{(j)}_L\), \(j \!=\! 1,2,3\), the residual evaluated at three terminal node j on each data pattern. Figure 7 illustrates the backward flow of node sensitivities down to each stage, starting from the terminal stage 5; we show below how node-output sensitivities \(\delta _l\) in Eq. (13) are back-propagated in a stage-wise manner by Eq. (43) with \(\phi _l ' (.) \! \equiv \! \phi ' ({{{f^l(\textbf{x};\textbf{w})}}})\):
Then, the stage-wise BP evaluates the gradient \(\nabla E_p\) by Eqs. (16) and (17).
1.3 Appendix C.3 Demonstration of Algorithm BPFCC Shown in Sect. 2.3
We now apply Algorithm BPFCC to the FCC model displayed in Fig. 2. In comparison with the stage-wise BP progress in Fig. 7, We show in Fig. 8 how \(\alpha _{l \rightarrow L}\) in Eq. (22) is back-propagated by BPFCC in a node-wise manner. Since \(q \!=\! 3\), the backward pass must be repeated for three times. The first backward pass begins with output node 1 at stage \(5~(=\!L)\) [see Fig. 8 (left-most column)] with \(\phi _l ' (\cdot ) \! \equiv \! \phi '({{{f^l(\textbf{x};\textbf{w})}}})\), as shown below:
Then, \(\nabla r_1(p)\) is obtained by Eq. (21) and stored into \(\textbf{G}\). The second pass starts at output node 2 [Fig. 8 (middle column)] with \(\lambda _{5,i} \!=\! \lambda ^{(2)}_{5,i} \!=\! \theta _{5,i}, i \!=\! 0,...,6 ~\!(=\! n \!+\! L \!-\! 1)\) by Eq. (19):
Then, \(\nabla r_2(p)\) is evaluated by Eq. (21) and stored into \(\textbf{G}\). Finally, the third backward pass begins with output node 3 [see Fig. 8 (right-most column)] with \(\phi _l ' (\cdot ) \! \equiv \! \phi '({{{f^l(\textbf{x};\textbf{w})}}})\) and \(\lambda _{5,i} \!=\! \lambda ^{(3)}_{5,i} \!=\! \rho _{5,i}, i \!=\! 0,...,6 ~\!(=\! n \!+\! L \!-\! 1)\) by Eq. (19):
Then, \(\nabla r_3(p)\) is evaluated by Eq. (21) and stored into \(\textbf{G}\). After the three backward-pass repetitions, BPFCC returns \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) of size \(M \times q\), where \(M \!=\! 39\) and \(q \!=\! 3\). After that, the gradient vector \(\nabla E_p\) is computed by Eq. (23) as \(\nabla E_p(\textbf{w}) = \textbf{U}_p \textbf{r}_p\).
1.4 Appendix C.4 Application of Algorithm BPJJ in Sect. 3.3
In the FCC model shown in Fig. 2, we have \(n \!=\! 2\) (inputs); \(q \!=\! 3\) (outputs); \(M \!=\! 39\) (weights); \(C \!=\! n \!+\! L \!=\! 7\); and \(h \!=\! M \!-\! qC \!=\! 18\). Using this model, we explain the behavior of Algorithm BPJJ in Sect. 3.3.
First, by forward pass, we evaluate the node outputs by Eq. (8) for hidden-node outputs at each stage l, \(l<L ~\!(=\! 5)\), and then by Eq. (42) for terminal linear outputs. As a result, the node-output vector \(\textbf{y}~\!(=\!\textbf{y}_{L-1})\) in Eq. (29) becomes of length \(7~\!(=\!C)\) shown in Eq. (42), which is re-displayed below with a simplified notation of \(\phi _k \! \equiv \! \phi \left( f^k(\textbf{x}; \textbf{w}) \right) \), the hidden-node output at stage k, \(k \!=\! 1,...,4 ~\!(=\!L \!-\! 1)\):
Next, by backward pass, we extract \(\textbf{U}\) and then construct \(\textbf{U}\textbf{U}^\textsf{T}\) in a desired block-arrow matrix form on each data pattern. To this end, we use Eq. (24) [rather than Eq. (38)] as the definition below of the weight-vector \(\textbf{w}\) of length \(M \!=\! qC \!+\! h\) in Eq. (9):
where each \(\textbf{w}^{(j)}_5\) (\(j \!=\! 1,2,3\)) in Eq. (39) is of length \(C \!=\! n \!+\! L \!=\! 7\), as shown in Eq. (42), and the remaining four hidden weight vectors, \(\textbf{w}_1,..., \textbf{w}_{4}\), are defined in Eq. (6). Then, on any data pattern p, three (\(q \!=\! 3\)) rows of \(\textbf{J}\) in Eqs. (2) and (3) are given by Eq. (26) as \(\textbf{U}^\textsf{T}\) below (with subscript p omitted) in node-wise partitions:
which can be stage-wisely partitioned as below with \(\textbf{v}_{l}^{(j)}\) in Eq. (32) and \(\textbf{y}\) in Eq. (29):
where the node-output vector \(\textbf{y}\) of length \(7~\!(=\!C)\) is shown in Eqs. (42) and (48). The above \(\textbf{U}\) in Eq. (50) leads to the desired block-arrow matrix \(\textbf{U}\textbf{U}^\textsf{T}\) of size \(39 \times 39\) below with the comprehensive structure of stage-wisely partitioned blocks:
where the size of each partitioned block is clearly displayed in Fig. 5a. It should be noted here that any (approximate) Hessian matrix can be of the desired block-arrow matrix form, depending only on the structure of a given FCC network; just for comparison, Fig. 5b shows another block-arrow matrix associated with a 21-input 10-output five-stage FCC network that may arise in attacking the cardiotocography benchmark problem (see [3, Table 2, p.309]) available in the UCI machine learning repository.
For the specific structure of \(\textbf{U}\textbf{U}^\textsf{T}\) shown in Eq. (51), we can confirm those three types of partitioned blocks explained in Eq. (33):
where for type-(b) partitioned blocks above, we only need to evaluate \(\frac{(L \!-\! 1)L}{2}\) blocks in the triangular half of the \(h \times h\) submatrix by exploiting symmetry. The key to efficiency is to exploit the property of stage-wisely partitioned blocks in Eq. (34); for example, when we evaluate a block of type (b) \(\sum ^3_{k=1} \textbf{v}_1^{(k)}\textbf{v}_4^{(k)\mathsf T} \!=\! \varvec{\beta }_1^\textsf{T} \varvec{\beta }_4 \textbf{y}_0 \textbf{y}_3^\textsf{T}\) and a block of type (c) \(\textbf{v}_3^{(j)}\textbf{y}^\textsf{T} \!=\! \beta _3^{(j)} \textbf{y}_2 \textbf{y}^\textsf{T}\) for any j, we can take advantage of a special property that the outer product of form \(\textbf{y}_{j} \textbf{y}_{k}^\textsf{T}\) in any block is just a submatrix of the \(C \times C\) rank-one matrix \(\textbf{y}\textbf{y}^\textsf{T}\) in type (a) of Eq. (33); that is, by Eq. (35), we immediately obtain \(\textbf{y}_0 \textbf{y}_3^\textsf{T}\) and \(\textbf{y}_2 \textbf{y}^\textsf{T}\) below from \(\textbf{y}\textbf{y}^\textsf{T}\), where \(\textbf{y}~\!(=\!\textbf{y}_{4})\) is given in Eq. (48) of length \(C \! \equiv \! n \!+\! L \!=\! 7\):
In this way, we can save the outer-product evaluation of \(\textbf{y}_j \textbf{y}_k\) for any \(0 \le j \le k < L~\!(=\!5)\) for the blocks of types (b) and (c) owing to Eq. (35). Furthermore, scalar quantities in Eq. (34) can be evaluated when the \((L \!-\! 1) \times q\) matrix, denoted by \(\textbf{B}\) below, is obtained that consists of q-length vectors \(\varvec{\beta }_l\) in Eq. (30) for \(l \!=\! 1,..., L \!-\! 1\) at each row l from bottom:
Here, the (i, j)-entry of \(\textbf{B}\) is \(\beta ^{(j)}_{L-i}\), scalar quantity for Eq. (34)(c), and the (i, j)-entry of \(\textbf{B}\textbf{B}^\textsf{T}\) of size \((L\!-\!1)\!\times \! (L\!-\!1)\) is given by \(\varvec{\beta }_{L-i}^\textsf{T} \varvec{\beta }_{L-j} \!=\! \sum ^q_{k=1} \beta ^{(k)}_{L-i} \beta ^{(k)}_{L-j}\), scalar quantity for Eq. (34)(b). The above matrix \(\textbf{B}\) in Eq. (54) can be easily obtained by Eq. (31); i.e., by evaluating \(\varvec{\beta }_{l}\) in a stage-wise manner at each stage l, \(l \!=\! L \!-\!1,...,1\), as shown below together with the evaluation of \(\delta _l\) (scalar) in Eq. (13) for the gradient \(\nabla E\) using \(\textbf{r}_p \! \equiv \! \left[ r_1, r_2, r_3 \right] ^\textsf{T}\), the residual vector of length \(q~\!(=\!3)\) on any data pattern p [see Eq. (2)]:
When \(\textbf{B}\) is formed explicitly, as shown in Eq. (54), \(\textbf{B}\textbf{r}_p\) yields \(\varvec{\delta }\! \equiv \![ \delta _4, \delta _3, \delta _2, \delta _1]^\textsf{T}\), which should be compared with Eq. (44). Then, \(\nabla E_p\), the gradient of \(E_p\) per pattern p, is simply obtained by Eq. (17) at end stage L, and by Eq. (16) at each (hidden) stage l for \(l<L\).
Appendix D. A Role of Linear Hidden Nodes as Jump (Skip-Layer) Connections
In this section, we show that a linear hidden node, if it exists, plays a role as a jump connection. Return to an FCC network illustrated in Fig. 3b, originally described in [41] for solving the XOR problem. Such an FCC network can be extended to a B-bit parity problem (i.e., B-dimensional XOR problem); e.g., see [44, Fig. 5, p.58] for \(B \!=\! 8\). Now, consider a general B-bit parity problem with B-H-1 MLP-learning (having H hidden nodes). When the hyperbolic tangent activation function, \(\phi (x) \!=\! \textsf{tanh}(x)\), is used at hidden nodes, one might observe that some hidden nodes are driven to saturation (or, active state of a neuron), whereas some hidden nodes are quite inactive, where the net input to those nodes comes close to the inflection point of \(\textsf{tanh}(x)\); in other words, the activation is produced by the (almost) linear part of \(\textsf{tanh}(x)\). Then, in B-H-1 MLP-learning with \(H\!=\!\frac{B+1}{2}\) for B odd (and \(B\!-\!1\) even), it can be proved [32] that there always exists a solution (of integral weights) when one hidden node activation function is replaced by a linear-identity function \(\phi (x) \!=\! x\) and the other hidden nodes are replaced by a step function that produces a binary value, ON (\(+1\)) or Off (\(-1\)). Such an integer-valued solution can be found easily, and it leads to a desired (solution) set of weights of the B-H-1 MLP having H tanh hidden nodes, one of which can be replaced by the linear hidden node because the output of that hidden node is generated by using only the (almost) linear part of tanh on any datum; e.g., see [32, Eq. (10)] for a solution of the B-H-1 MLP having only 11 hidden nodes when \(B \!=\! 21\) (21-bit parity involving \(2^{21}\) data). In general MLP-learning, however, there is no point of using linear hidden-node functions, because the linear transformation of another linear transformation is just linear. This suggests that such a linear hidden node can be eliminated by merging it to a terminal node, leading to a jump connection. Fig. 9b shows a linear-output 2-2-1 MLP with one nonlinear (e.g., tanh or ReLU) hidden node and one linear hidden node for solving XOR; it can be transformed to a linear-output 2-1-1 FCC network via node-merging process [32], as illustrated in Fig. 9c identical to Fig. 3b. We now justify this node-condensing transformation as Lemma below for the B-bit parity problem.
- Lemma::
-
Given the B-bit parity data, the input-to-output mappings of a B-H-1 MLP having one “linear” hidden-node function (with the other non-linear hidden nodes) can be realized by its corresponding B-(\(H\!-\!1\))-1 MLP having direct input-to-terminal jump (or skip-layer) connections.
- Proof::
-
We consider a case of \(H\!=\!2\) since the argument that follows can be generalized to other cases of arbitrary H. Fig. 9 a and c display the two MLPs of our concern: (a) for 2-2-1 MLP producing node outputs \(z_i^s\) at node i in layer s with nine weights denoted by \(w_{i,j}^{s,t}\) between node i in layer s and node j in layer t, and (c) for 2-1-1 MLP producing node outputs \(y_i^s\) with seven weights denoted by \(\theta _{i,j}^{s,t}\). At terminal layer, 2-2-1 MLP (a) produces \( z_1^2\!=\! \phi (w^{1,2}_{0,1} \! + \! z^1_1 w^{1,2}_{1,1}\! + \! z^1_2 w^{1,2}_{2,1} ) \), whereas 2-1-1 MLP (c) generates \( y^2_1\!=\!\phi (\theta ^{1,2}_{0,1}\!+\!y^1_1\theta ^{1,2}_{1,1} \!+\! x_1 \theta ^{0,2}_{1,1} \!+\! x_2 \theta ^{0,2}_{2,1}). \) Now, our goal is to accomplish \(z^2_1\!=\!y^2_1\). To this end, we set \(\theta ^{0,1}_{i,1} \!= \!w^{0,1}_{i,1}\), \(i \!=\! 0,1,2\), so as to obtain \(z^1_1 \! = \! y^1_1\), and also set \( \theta ^{1,2}_{1,1} \! = \! w^{1,2}_{1,1};\) these four weights, denoted by dotted arrows in MLP (c), are the same in MLP (a). This setting allows us to match up the remaining weights (for \(z^2_1\!=\!y^2_1\)) by simply comparing the terms of the net input to \(z^2_1\) and to \(y^2_1\) between those two MLPs; that is, choose \( \theta ^{0,2}_{1,1} \! = \! w^{0,1}_{1,2} w^{1,2}_{2,1} \) and \( \theta ^{0,2}_{2,1} \! = \! w^{0,1}_{2,2} w^{1,2}_{2,1} \) with bias \( \theta ^{1,2}_{0,1} \! = \! w^{1,2}_{0,1} \! + \! w^{0,1}_{0,2} w^{1,2}_{2,1}. \) In consequence, we attain \(z^2_1\!=\!y^2_1\) on any external input pairs \((x_1,x_2)\). Graphically, the foregoing weight-matching-up procedure corresponds to merging the “linear” hidden node into the terminal node, as illustrated in (b). Obvious generalization can apply to any \(H (\ge 3)\); details are omitted. \(\square \)
In recent deep neural network learning [43], the so-called rectified linear unit (ReLU), \(\phi (x)\!=\!\text {max}\{0,x\}\), enjoys great popularity (e.g., see [17, Chapter 6]). Consider deep learning with hidden nodes activated by ReLU, which is mostly linear. When no saturation occurs in any hidden node on all data patterns, such a node can be regarded as linear. In this way or some other, the learning situation involving linear hidden nodes can arise, leading to a network having (skip-layer) jump connections; see Example next.
Example: Given the four XOR data pairs of two inputs \((x_{1,d}, x_{2,d})\) and target \(t_d\), \(d \!=\! 1,...,4\):
consider a 2-2-1 MLP having the ReLU node function at hidden and terminal layers, as depicted in Fig. 10a, where each connection weight is numerically specified therein. Then, the second hidden node of the 2-2-1 MLP can be replaced by the linear identity function. By Lemma above, it can be transformed to a 2-1-1 FCC, as shown in (b). By weight-sharing, (b) can be further simplified as a 1-1-1 FCC with a jump connection illustrated in (c) that receives input “\(x_1 \!+\! x_2\).” (Remark: The FCC having one hidden node can solve bit-3 parity (8 data) as well as XOR (4 data); see [32] for more details.)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mizutani, E., Kubota, N. & Truong, T.C. On Stage-Wise Backpropagation for Improving Cheng’s Method for Fully Connected Cascade Networks. Neural Process Lett 56, 212 (2024). https://doi.org/10.1007/s11063-024-11655-4
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11655-4