1 Introduction

Following the notations in [3], we consider the nonlinear least squares problem (e.g., see [8, 37]) for optimizing a multiple q-output FCC network having the M-length weight vector \(\textbf{w}\). Given N data pairs of n-input vector and q-target vector \(\{ \left( \textbf{x}(p), \textbf{t}(p) \right) \}^N_{p=1} \in \mathbb {R}^n \times \mathbb {R}^q\), the objective function E (cf. [3, Eq(24), p.301] and [45, Eq(1), p.1794]) is given below with \(m \! \equiv \! q N\) as the sum of squared m residuals, each denoted by either \(r_k\), \(k \!=\! 1,...,m\), or \(r_{j}(p)\) between model’s jth output \(F_j(\textbf{x}(p);\textbf{w})\) and its target value \(t_{j}(p)\), \(j \!=\! 1,...,q\) and \(p \!=\! 1,...,N\); hence, the relation \(k \!=\! j \!+\! (p-1)q\) for \(r_k \!=\! F_j(\textbf{x}(p); \textbf{w}) - t_{j}(p) = r_j(p)\):

$$\begin{aligned} E(\textbf{w}) \!=\! \sum ^N_{p=1} E_p(\textbf{w}) \!=\! \frac{1}{2} \sum ^N_{p=1} \textbf{r}_p^\textsf{T} \textbf{r}_p \!=\! \frac{1}{2} \sum ^N_{p=1} \sum ^q_{j=1} \! \left\{ r_{j}(p) \right\} ^2 \!=\! \frac{1}{2} \sum ^{m=qN}_{k=1} \! \left\{ r_k \right\} ^2 \!=\! \frac{1}{2} \textbf{r}^\textsf{T} \textbf{r}\end{aligned}$$
(1)

where \(E_p(\textbf{w}) \! \equiv \! \frac{1}{2} \sum ^q_{j=1} \! \left\{ r_{j}(p) \right\} ^2 \!=\! \frac{1}{2} \textbf{r}^\textsf{T}_p \textbf{r}_p\) on each datum p (with \(\textbf{r}_p\), the residual vector of length q), and \(\textbf{r}\) is the m-length (\(m \! \equiv \! qN\)) residual vector (over all N data) that is a function of \(\textbf{w}\). Then, \(\nabla E\), the gradient vector of \(E(\textbf{w})\) above, and Hessian matrix \(\nabla ^2 E\) can be expressed as below with \(\textbf{J}\! \equiv \! \nabla \textbf{r}\! = \! \frac{\partial \textbf{r}}{\partial \textbf{w}}\), the \(m \times M\) Jacobian matrix of \(\textbf{r}\):

$$\begin{aligned} \left\{ \begin{array}{l} {\displaystyle \underbrace{ \nabla E }_{M \times 1} \!=\! \sum ^N_{p=1} \nabla E_p \!=\! \sum ^N_{p=1} \underbrace{ \textbf{U}_p }_{M \times q} \textbf{r}_p \!=\! \sum ^N_{p=1} \sum ^q_{j=1} r_{j}(p) \underbrace{ \nabla r_{j}(p) }_{M \times 1} \textbf{r}_p \!=\! \sum ^m_{k=1} r_{k} \nabla r_{k} \!=\! \textbf{J}^\textsf{T} \textbf{r}, } \\ {\displaystyle \underbrace{ \nabla ^2 E }_{M \times M} = \textbf{J}^\textsf{T} \textbf{J}+ \sum _{k=1}^m r_k \underbrace{ \nabla ^2 \! r_k }_{M \times M} } \\ \end{array} \right. \end{aligned}$$
(2)

where \(\nabla r_{k} \!=\! \nabla r_{j}(p)\) with \(k \!=\! j \!+\! (p-1)q\), and \(\textbf{U}^\textsf{T}_p ~\!(\equiv \! \nabla \textbf{r}_p)\), a (transposed) matrix of q rows of \(\textbf{J}\), each row \(j~\!(=\!1,...,q)\) denoted by \(\nabla r_{j}(p)^\textsf{T}\), on each training data pattern p. The cross-product \(\textbf{J}^\textsf{T} \textbf{J}\), known as the Gauss-Newton (approximate) Hessian matrix [8, 37], can be formed by rank-q updates Footnote 1 per datum p (or by q rank-1 updates), as below:

$$\begin{aligned} \underbrace{ \textbf{J}^\textsf{T} \textbf{J}}_{M \times M} \! = \! \sum ^N_{p=1} \nabla \textbf{r}_p^\textsf{T} \underbrace{ \nabla \textbf{r}_p }_{q \times M} \!=\! \sum ^N_{p=1} \! \underbrace{ \textbf{U}_p \textbf{U}_p^\textsf{T} }_{{\textrm{rank}}-q\,{\textrm{update}}} \! \!=\! \sum ^N_{p=1} \sum ^q_{j=1} \underbrace{ \nabla r_{j}(p) \nabla r_{j}(p)^\textsf{T} }_{{\mathrm{rank-1 \,\,update}}} \!=\! \sum ^m_{k=1} \nabla r_{k} \nabla r_{k}^\textsf{T} \end{aligned}$$
(3)

where \(\textbf{U}_p~\!(\equiv \! \nabla \textbf{r}_p^\textsf{T})\) is of size \(M \times q\). Algorithm BPMLP below is a widely-employed scheme (e.g., see [1, 19, 24, 25, 40, 46]) for calculating \(\textbf{J}^\textsf{T} \textbf{J}\) by backpropagation (BP) for a multilayer perceptron (MLP) that is assumed to have q linear outputs at the terminal layer:

figure a

Let \(\textbf{e}_j\) denote the jth column of \(\textbf{I}\), the \(q \times q\) identity matrix; then, Algorithm BPMLP performs backward process [see step (b), Lines 2 to 6] for \(\textbf{U}_p\) in Eq. (2) by back-propagating \(\textbf{e}_j\), one column of \(\textbf{I}\) after another, for \(\nabla r_j(p)\! =\! \textbf{u}^p_j \! =\!\textbf{U}_p \textbf{e}_j\), \(j \!=\! 1,...,q\) (Line 5). This process amounts to back-propagating numeric unit entry “1.0,” the jth entry of \(\textbf{e}_j\) (Line 4) rather than residual \(r_j\) from each output node j, \(j \!=\! 1,...,q\); this is what we call the (output) node-wise BP procedure. After that, Algorithm BPMLP forms \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by rank-q updates [step (c) in Line 8] and computes \(\nabla E_p\) [step (d) in Line 9]. The cost per pattern p (e.g., see [28, 30]) of steps (a) to (d) can be approximated by

$$\begin{aligned} \left\{ \! \begin{array}{l} \mathbf{(a)} {O(M)}, \text { cost for one forward pass (to evaluate node outputs at each layer) } \\ \mathbf{(b)} {O(qM)}, \text { cost for}\,\textbf{U}_p\text { by repeating backward pass for }q{ times} \\ \mathbf{(c)} {O(q M^2)},\text { cost for forming }M \times M \textbf{U}_p \textbf{U}_p^\textsf{T}\text { by rank-}q\text { updates for Eq. }(3) \\ \mathbf{(d)} {O(q M)},\text { cost for the }M \times 1\text { gradient vector by }\nabla E_p \!=\!\textbf{U}_p \textbf{r}_p\text { as in Eq. }(2) \\ \end{array} \right. \end{aligned}$$
(4)
Fig. 1
figure 1

Comparison between two fully connected cascaded (FCC) network models between (a) Cheng’s type [3] and (b) Wilamowski and Yu’s type, which corresponds to [45, Fig. 5,p.1796], when both models have four layers (\(L \!=\! 4\)) and three outputs (\(q \!=\! 3\)). Node numbering in those two models differs between [3] and [45]; in (a), three outputs at terminal stage 4 are numbered 1 through 3, for our convenience

In this journal, Cheng [3, Sec. 3.3] has proposed Algorithm BPFCC for q linear-output FCC networks; it is essentially identical to the above Algorithm BPMLP (see Appendix A.1), back-propagating \(\alpha _{L \rightarrow L} \! \equiv \! 1\) on each linear output node for q times (to be detailed in Sect. 2.3). Wilamowski and Yu [45] have identified computational overheads over such q repetitions (see lines 3 to 6 in BPMLP above), and thus have developed the forward-only neuron-by-neuron (NBN) algorithm [47] (see Appendix B), similar in spirit to forward-mode automatic differentiation [18], for q-output FCC networks [45, p.1798] in nonlinear regression. Yet, their efforts are not significant because both NBN and BPFCC are designed to reduce the cost of step (b) with no effect on the dominant cost of step (c) in Eq. (4). On the other hand, we show a new BP strategy reducing the cost of both steps (b) and (c) directly for \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) without evaluating \(\textbf{U}_p\) explicitly. In particular, our BP exploits the special structure of q linear-output FCC network of Cheng’s type, computing the Gauss-Newton matrix \(\textbf{J}^\textsf{T} \textbf{J}\) in a nice block-arrow form, which allows us to exploit sparsity in matrix factorization.

It should be noted here that Wilamowski and Yu’s FCC network model [45] and Cheng’s FCC model both have only one node at each hidden layer, but differ in the arrangement of q output nodes; those two FCC models are compared in Fig. 1 (for \(q \!=\! 3\)), where Cheng’s model has all q linear output nodes at the terminal layer (see [3, Fig. 2, p.294] for \(q \!=\! 2\)), whereas Wilamowski and Yu’s model (see [45, Fig. 5,p.1796] for \(q \!=\! 3\)) produce q (nonlinear) outputs from the last q layers Footnote 2. Therefore, when those two models in Fig. 1 are optimized with E in Eq. (1), different sparsity patterns of \(\textbf{U}_p\) result; see [45, Eq. (24), p.1797] and [3, Eq. (22), p.300], and see also Eq. (27) later.

The rest of the paper is organized as follows. Sect. 2 scrutinizes the FCC network of Cheng’s type and his proposed algorithm BPFCC using the same notations as defined in [3]. Then, Sect. 3 describes our new structure-exploiting BP procedure, and Sect. 4 presents numerical evidence; finally, our conclusion follows in Sect. 5.

2 Fully Connected Cascaded (FCC) Neural Network Learning of Cheng’s Type

We investigate carefully Cheng’s backpropagation (BP) procedure proposed for FCC network learning, following the same notation as defined in [3].

2.1 The FCC Network Structure of Cheng’s Type

Consider a single-output L-layer FCC (see [3, Fig. 1, p.294]) with totally M weights (including biases), receiving a (row) vector of n external inputs \(\textbf{x}\!=\! [x_1, ..., x_n]\) (see [3, Eq. (1), p.295]). More specifically, there is only one hidden node at each layer (or stage) l, \(l>0\), except the input layer at stage 0 (\(l \!=\! 0\)); then, each connection weight is denoted by \(\lambda _{l,k}\) from node k, \(k \!=\! 0,..., n\!+\!l\!-\!1\), to the node at stage l, \(l \!=\! 1,..., L\); here, the node at stage l, \(l>0\), is numbered “\(n \!+\! l\).” The given set of M weights (see [3, Eqs. (2) & (7), p.296]) are grouped with respect to each node at stage l, \(l>0\), each denoted by \(\textbf{w}_l\), and they are collectively denoted by a row vector (in [3, Eq. (3), p.296]) of length M as

$$\begin{aligned} \textbf{w}\equiv \left[ \textbf{w}_1, \textbf{w}_2,..., \textbf{w}_l,..., \textbf{w}_{L-1}, \textbf{w}_L \right] . \end{aligned}$$
(5)

Each weight-group vector \(\textbf{w}_l\) is of length \(n\!+\!l\) including \(\lambda _{l,0}\), a bias to the node at stage l, in the first entry, as shown below (see [3, Eq. (4), p.296]):

$$\begin{aligned} \textbf{w}_l {\mathop {=}\limits ^\textrm{def}} \Big [ \lambda _{l,0}, \lambda _{l,1},..., \lambda _{l,n}, \lambda _{l,n+1},..., \lambda _{l,n+l-1} \Big ]. \end{aligned}$$
(6)

Then, M, the total number of weights including biases, is given below (see [3, Eq. (2),p.296])

$$\begin{aligned} M = \sum ^L_{l=1} (n+l) = nL + \frac{L(L+1)}{2}. \end{aligned}$$
(7)

In Cheng’s notation [3, Eq. (5), p.296], the net input Footnote 3 to the lth layer neuron is denoted by \(f^l(\textbf{x}; \textbf{w})\); then, it is given below by the forward pass at each stage l, \(l \!=\! 1,...,L\):

$$\begin{aligned} f^l(\textbf{x}; \textbf{w}) = \lambda _{l,0} + \sum ^n_{i=1} x_i \lambda _{l,i} + \sum ^{l-1}_{r=1} \phi \Big ( f^r(\textbf{x}; \textbf{w}) \Big ) \lambda _{l,n+r}, \end{aligned}$$
(8)

where \(\phi (x) \!=\! \textsf{tanh}(x)\), and then the output of the lth layer neuron is produced by \(\phi \big ( {{f^l(\textbf{x}; \textbf{w})}} \big )\) for \(l \!=\! 1,..., L - 1\). At end layer L, the linear terminal output is produced as \(f^L(\textbf{x}; \textbf{w})\).

Next, consider the case of multiple q outputs (\(q>1\)) of Cheng’s FCC network depicted in Fig. 1a. Since each terminal output node j, \(j \!=\! 1,...,q\), has \(n \!+\! L ~\!(\equiv \!C)\) incoming connections, M, the total number of weights, increases by \((q \!-\! 1) C\), leading Eq. (7) to [3, Eq(7), p.296] below including \(q C \! = \! q ( n \!+\! L)\), the total number of linear terminal weights:

$$\begin{aligned} M \!=\! nL + \frac{L(L \!+\! 1)}{2} + (q \!-\! 1) (n \!+\! L) \!=\! \underbrace{ (L \!-\! 1) \left( n + \frac{L}{2}\right) }_{{{h\,\,\mathrm{hidden\, weights}}}} + \! \underbrace{ q (n \!+\! L) }_{{\mathrm{linear\, weights}}} \! \! = h + q C, \end{aligned}$$
(9)

where \(h \! \equiv \! M \!-\! qC\), the total number of nonlinear “hidden” weights. Then, the weight vector \(\textbf{w}\) in Eq. (5) is augmented as below [3, Eq(8), p.296]

$$\begin{aligned} \textbf{w}\equiv \left[ \textbf{w}_1, \textbf{w}_2,..., \textbf{w}_l,..., \textbf{w}^{(1)}_L,..., \textbf{w}^{(j)}_L,..., \textbf{w}^{(q)}_L \right] , \end{aligned}$$
(10)

where \(\textbf{w}^{(j)}_L\), \(j \!=\! 1,...,q\), is a vector of length \(C ~\!(= \! n \!+\! L)\) [3, Eq. (9), p.296], given by

$$\begin{aligned} \textbf{w}^{(j)}_L {\mathop {=}\limits ^\textrm{def}} \Big [ \lambda ^{(j)}_{L,0}, \lambda ^{(j)}_{L,1},..., \lambda ^{(j)}_{L,n}, \lambda ^{(j)}_{L,n+1},..., \lambda ^{(j)}_{L,n+L-1} \Big ]. \end{aligned}$$
(11)

As defined just before Eq. (1), let \(F_j\) be the jth output, \(j \!=\! 1,...,q\), at terminal layer L; then, the FCC model produces q linear outputs \(F_j\) (see [3, Eq. (10), p.296]) below with Eq. (11):

$$\begin{aligned} F_j(\textbf{x}; \textbf{w}) = \lambda ^{(j)}_{L,0} + \sum ^n_{i=1} x_i \lambda ^{(j)}_{L,i} + \sum ^{L-1}_{r=1} \phi \Big ( f^r(\textbf{x}; \textbf{w}) \Big ) \lambda ^{(j)}_{L,n+r}, \end{aligned}$$
(12)

where \(f^r(\textbf{x}; \textbf{w})\), the net input to the hidden node at layer r, is obtained by Eq. (8). Appendix C.1 shows how the preceding formulas apply to a five-stage (\(L \!=\! 5\)) two-input (\(n \!=\! 2\)) three-output (\(q \!=\! 3\)) FCC network depicted in Fig. 2.

Fig. 2
figure 2

A three-output (\(q \!=\! 3\)) five-stage (\(L \!=\! 5\)) fully connected cascaded (FCC) network of Cheng’s type having two (\(n \!=\! 2\)) inputs (see Fig.1), where the three linear outputs at terminal stage 5 are numbered 1 through 3; each output j (\(j \!=\! 1,2,3\)) has \(7~\! (=\! C \! \equiv \! n \!+\! L)\) connections, incoming arcs from nodes i, \(i \!=\! 0,1,...,6\), at previous stages; hence, \(21~\!(=\! q C)\) linear terminal connection weights in total. Those 21 linear weights are denoted separately by: (1) dotted arcs \(\lambda _{5,i}\), \(i \!=\! 0,...,6\), connecting to output 1; (2) solid arcs \(\theta _{5,i}\), \(i \!=\! 0,...,6\), to output 2; and (3) dashed arcs \(\rho _{5,i}\), \(i \!=\! 0,...,6\), to output 3 (see Appendix C)

2.2 Standard (Stage-Wise) BP [41] or Generalized Delta Rule for FCC Network Learning

Fig. 3
figure 3

Two neural networks [41] for solving the XOR problem: (a) a standard 2-2-1 multilayer perceptron (MLP), and (b) a 2-1-1 FCC network having two jump connections. Both models are trained by the generalized delta rule for feedforward networks [41, p.326]. A derivation of (b) from (a) is shown in Appendix D

Cheng, according to his literature summary [3, p.295], seems to be ignorant of the pioneering work of Rumelhart et al. [41], where the generalized delta rule (see [41, Eq. (14), p.326]) is applied to a two-stage (\(L \!=\! 2\)) FCC model on the well-known XOR problem (see Fig. 3), and described for a general feedforward network. We now show how the stage-wise BP, or the generalized delta rule, works for multiple q-output (\(q\!>\!1\)) FCC networks of Cheng’s type (see Fig. 1a). To this end, let us define the quantity “delta” as the node-input sensitivity to \(E_p\) (on data pattern p) in Eq. (1) with respect to the net input to the node at each stage l:

$$\begin{aligned} \delta _l {\mathop {=}\limits ^\textrm{def}} \frac{\partial E_p}{\partial f^{l}(\textbf{x}(p);\textbf{w})}. \end{aligned}$$
(13)

Similarly, let us introduce another quantity, called node-output sensitivity, below, also known as adjoint variable (or co-state, multiplier, etc) in optimal control (e.g., see [2, 10, 13, 22, 27]):

$$\begin{aligned} \xi _l {\mathop {=}\limits ^\textrm{def}} \frac{\partial E_p}{\partial \phi \big (f^l(\textbf{x}(p); \textbf{w}) \big )}. \end{aligned}$$
(14)

By forward pass, the output of each hidden node at stage l, \(0<l<L\), is given by \(\phi \left( f^l(\textbf{x}; \textbf{w}) \right) \), where its net input \(f^l(\textbf{x}; \textbf{w})\) is subject to the forward-pass equation (8), and then q terminal linear outputs \(F_j(\cdot )\), \(j \!=\! 1,...,q\), are obtained by Eq. (12). Then, by backward pass, \(\delta _l\), the node-input sensitivity of (hidden) node at stage l, \(0<l<L\), can be determined as below

$$\begin{aligned} \delta _l = \phi ' \Big ( f^l(\textbf{x}; \textbf{w}) \Big ) \xi _l = \phi ' \Big ( f^l(\textbf{x}; \textbf{w}) \Big ) \! \left\{ \sum ^{q}_{j=1} \lambda ^{(j)}_{L,n+l} \cdot r_j + \sum ^{L-1}_{s=l+1} \lambda _{s,n+l} \cdot \delta _{s} \right\} \end{aligned}$$
(15)

where \(r_j \! = \! \xi ^{(j)}_L \!=\! \delta ^{(j)}_L\) (terminal node sensitivity); i.e., the residual in Eq. (1) evaluated at linear-output node j, \(j \!=\! 1,...,q\), at end stage L on each data pattern. Then, the gradient entry of \(\nabla E_p\) with respect to each “hidden” weight \(\lambda _{l,i}\) for \(l<L\) is given by

$$\begin{aligned} \frac{\partial E_p}{\partial \lambda _{l,i}} \!=\! \left\{ \! \begin{array}{rl} \delta _{l}, &{} \text {if}\,i \!=\! 0, \\ x_i \cdot \delta _{l}, &{} \text {if}\,i \!=\! 1,...,n ~\! (=\! 2), \\ \phi \big ( {{f^{i-2}(\textbf{x}(p); \textbf{w})}} \big ) \cdot \delta _{l}, &{} \text {if}\,i \!=\! 3,..., n \!+ \! l \!-\! 1. \\ \end{array} \right. \end{aligned}$$
(16)

For the gradient of linear “terminal” weight \(\lambda ^{(j)}_{L,i}\), we employ \(\delta ^{(j)}_L \! = \! r_j\) in Eq. (16) as below:

$$\begin{aligned} \frac{\partial E_p}{\partial \lambda ^{(j)}_{L,i}} \!=\! \left\{ \! \begin{array}{rl} r_j, &{} \text {if}\,i \!=\! 0, \\ x_i \cdot r_j, &{} \text {if}\,i \!=\! 1,...,n, \\ \phi \big ( {{f^{i-2}(\textbf{x}(p); \textbf{w})}} \big ) \cdot r_j, &{} \text {if}\,i \!=\! 3,..., n \!+ \! l \!-\! 1. \\ \end{array} \right. \end{aligned}$$
(17)

Appendix C.2 demonstrates the above stage-wise BP on the FCC model depicted in Fig. 2. Next, we describe in detail Cheng’s BP algorithm (called BPFCC), which is characterized as node-wise BP, showing how it works differently from the above stage-wise BP.

2.3 The BPFCC Routine of Cheng’s BP Algorithm

For our discussion purpose, the BPFCC routine of Cheng’s BP procedure (see [3, Algorithm 2, p.300]) is displayed below.

figure b

The BPFCC is designed for the single-output (\(q \!=\! 1\)) FCC network. Hence, for the multiple q-output (\(q>1\)) case, BPFCC is repeated for q times with respect to each output node (hence, the node-wise BP as mentioned earlier); see Note added between lines 2 and 3 in BPFCC shown above. In the nonlinear least squares problem with \(E(\textbf{w})\) in Eq. (1) involving N training data, Cheng’s BP algorithm calls \(\textbf{G} \!=\! \textsf{BPFCC}(\textbf{x}_p, \textbf{w}, n, L)\) on each data pattern p for \(p \!=\! 1,...,N\). By forward pass (Line 2), the FCC network produces the linear output \(F_j(\textbf{x}; \textbf{w})\), \(j \!=\! 1,...,q\), at terminal layer L by Eq. (12). Then, during the backward process (see Lines 3\(\sim \)9), BPFCC evaluates explicitly the quantity \(\alpha _{l \rightarrow L}\), called the derivative amplification coefficient (from layer l to terminal layer L), by the following backward-pass recurrence relation [3, Eq. (21), p.299] for \(0<l < L\), starting with the boundary condition, \(\alpha _{L \rightarrow L} \!=\! 1\), at terminal layer L:

$$\begin{aligned} \alpha _{l \rightarrow L} = \! \sum ^L_{r=l+1} \! \lambda _{r,n+l} \cdot \alpha _{r \rightarrow L} \cdot \phi ' \Big (\! f^l(\textbf{x}; \textbf{w}) \! \Big ) \end{aligned}$$
(18)

which must be repeated for q times on each output node j, \(j \!=\! 1,...,q\). According to [3, Sec. 3.3], for each repeat of Eq. (18) on j, weight \(\lambda _{L,i}\) must be regarded as \(\lambda ^{(j)}_{L,i}\) in \(\textbf{w}^{(j)}_L\), a vector of \(C ~\!(= \! n \!+\! L)\) linear terminal weights defined in Eq. (11):

$$\begin{aligned} \lambda _{L,i} \leftarrow \lambda ^{(j)}_{L,i}, ~~ i \!=\! 0,..., n+L-1, ~~~~\text {for}\,j \!=\! 1,..., q. \end{aligned}$$
(19)

Eq. (18) is highlighted as the core equation [3, p.299], although it should be implemented as

$$\begin{aligned} \alpha _{l \rightarrow L} = \phi ' \Big (\! f^l(\textbf{x}; \textbf{w}) \! \Big ) \! \sum ^L_{r=l+1} \! \lambda _{r,n+l} \cdot \alpha _{r \rightarrow L} \end{aligned}$$
(20)

Then, the quantities below are evaluated for each repeat on output node j, \(j \!=\! 1,...,q\):

$$\begin{aligned} \frac{\partial F_j(\textbf{x}; \textbf{w})}{\partial \lambda _{l,i}} \!=\! \frac{\partial f^l(\textbf{x}; \textbf{w})}{\partial \lambda _{l,i}} ~\! \alpha _{l \rightarrow L} \!=\! \left\{ \! \begin{array}{rl} \alpha _{l \rightarrow L}, &{} \hspace{-0.1mm} \text {if}\,i \!=\! 0, \\ x_i \! \cdot \! \alpha _{l \rightarrow L} &{} \hspace{-0.1mm} \text {if}\,i \!=\! 1,...,n, \\ \phi \big ( {{f^{i-n}(\textbf{x}; \textbf{w})}} \big ) \! \cdot \! \alpha _{l \rightarrow L} &{} \hspace{-0.1mm} \text {if}\,i \!=\! n \!+\! 1,..., n \!+\! l \!-\! 1 \\ \end{array} \right. \end{aligned}$$
(21)

and stored into \(\textbf{G}\). Note in Lines 6 and 7 in Algorithm BPFCC that \(f^L(\textbf{x};\textbf{w})\) therein must be regarded as \(F_j (\textbf{x};\textbf{w})\) in accordance with Eq. (19) for each repeat on output node j, \(j \!=\! 1,...,q\). Since Eq. (21) yields \(\nabla r_j(p)\) at each repeat on j, BPFCC returns \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) in Eq. (2); i.e., the q rows of the residual Jacobian matrix \(\textbf{J}\! \equiv \! \nabla \textbf{r}\) of size \(qN \times M\) in Eq. (2). It thus reveals that Algorithm BPFCC is equivalent to step (b) of Algorithm BPMLP shown in Sect. 1. Therefore, \(\alpha _{l \rightarrow L}\), what Cheng [3, p.297] termed the “derivative amplification coefficient,” is just the partial derivative of residual entry \(r_j(p)\) evaluated at output node j on datum p with respect to the net input to the node at stage l at each repeat on j, \(j \!=\! 1,...,q\):

$$\begin{aligned} \alpha _{l \rightarrow L} {\mathop {=}\limits ^\textrm{def}} \frac{\partial F_j(\textbf{x}(p); \textbf{w})}{\partial f^l(\textbf{x}(p); \textbf{w})} = \frac{\partial r_j(p)}{\partial f^l(\textbf{x}(p); \textbf{w})} \end{aligned}$$
(22)

which is recursively evaluated by Eqs. (18) or (20). After \(\textbf{U}_p^\textsf{T} \!=\! \textbf{G}\!=\! \textsf{BPFCC}(\textbf{x}_p, \textbf{w}, n, L)\) is obtained, the gradient \(\nabla E_p(\textbf{w})\) per pattern p is computed below as step (d) in Eq. (4)

$$\begin{aligned} \underbrace{ \nabla E_p(\textbf{w}) }_{M \times 1} = \underbrace{ \textbf{U}_p }_{M \times q} \textbf{r}_p ~ \Longleftrightarrow ~ \nabla E_p(\textbf{w}) = \sum ^q_{j=1} r_j(p) \nabla r_j(p). \end{aligned}$$
(23)

Appendix C.3 shows how Algorithm BPFCC works on the FCC shown in Fig. 2.

2.4 Complexity Analysis of BPFCC for \(\textbf{U}_p\) and \(\nabla E_p\)

Cheng’s BP employs \(\textbf{G} \!=\! \textbf{U}_p^\textsf{T} \!=\! \textsf{BPFCC}(\textbf{x}(p), \textbf{w}, n, L)\) for the q rows of \(\textbf{J}\! \equiv \! \nabla \textbf{r}\) in Eq. (2) by accumulating each row \(\nabla r_j(p)^\textsf{T}\) for q times (see [3, Sec. 3.3]) node-wisely with respect to each output node j, \(j \!=\! 1,..., q\). This (output-)node-wise BP procedure is popular in MLP-learning (e.g., see [1, 19, 24, 25, 40, 46]), as summarized in Algorithm BPMLP in Sect. 1, for evaluating rows of \(\textbf{J}\), to be followed by the rank-updating of Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\).

Consider the cost of evaluating the gradient vector \(\nabla E_p\) of length M, where M is the number of weights defined in Eqs. (7) and (9); such a node-wise procedure as BPFCC is not as efficient as stage-wise BP in the multiple q-output (\(q>1\)) case. This is because, while BPFCC obtains the gradient vector \(\nabla E_p\) by Eq. (23) in O(qM), stage-wise BP evaluates \(\nabla E_p\) with Eqs. (16) and (17) in O(M) by computing node sensitivities \(\delta _l\) in Eq. (13) at each stage l with no row \(\nabla r_j(p)^\textsf{T}\) of \(\textbf{J}\) required explicitly (unlike BPFCC); this is the essence of stage-wise BP by the generalized delta rule (15). More specifically, in multiple-output FCC network learning, the q repetitions of backward passes in BPFCC require \(O({{{q(C \!+\! h)}}})\), where \(q(C + h) > qC + h \!=\! M\) by Eq. (9). This fact can be observed in the comparison between the stage-wise progress in Fig. 7 and the node-wise progress in Fig. 8. To see it better, let us set \(\alpha _{5 \rightarrow 5} \!=\! r_1\) in Eq. (45), \(\alpha _{5 \rightarrow 5} \!=\! r_2\) in Eq. (46), and \(\alpha _{5 \rightarrow 5} \!=\! r_3\) in Eq. (47); then, accumulating quantities of \(\alpha \) over Eqs. (45), (46), and (47) yields Eq. (44) for node-sensitivities \(\delta _l\). This \(\alpha \)-accumulation for \(\delta _l\) is inefficient because six hidden-weights (\(\lambda _{4,3}\), \(\lambda _{4,4}\), \(\lambda _{4,5}\), \(\lambda _{3,3}\), \(\lambda _{3,4}\), \(\lambda _{2,3}\)) are repeatedly used in Eqs. (45), (46), and (47).

Algorithm BPFCC always starts with \(\alpha _{5 \rightarrow 5} \!=\! 1\) at any output node j, \(j \!=\! 1,...,q\), on each datum p, just as in Line 4 of Algorithm BPMLP; this is simply because BPFCC evaluates explicitly \(\nabla r_j(p)\) for \(j \!=\! 1,...,q\); i.e., the q rows of the residual Jacobian matrix \(\textbf{J}\) as \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) with Eq. (21). Wilamowski and Yu [45] have recognized the computational overheads (see Appendix A.1) behind the q repetitions (\(q > 1\)) of such a node-wise BP procedure for \(\textbf{U}_p\), and then developed the neuron-by-neuron (NBN) method for evaluating \(\textbf{U}_p\) on each datum p; see \(\textbf{U}_p\) in [45, Eq. (24), p.1797]. When \(q \!=\! 1\) (single-output case), both NBN and BPFCC work at the essentially same cost; see the complexity analysis of NBN in [45, p.1798]. As explained with Eq. (4) in Sect. 1, NBN (as well as BPFCC) attempts to reduce the cost of step (b) for \(\textbf{U}_p\) of size \(M \times q\) with no impact on the dominant cost of step (c) for rank-updating by Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\) of size \(M \times M\) in Eq. (2). This indicates that NBN is not significantly better than BPFCC when \(q > 1\).

3 A New Structure-Exploiting BP for Evaluating \(\textbf{J}\) and Its Cross Product \(\textbf{J}^\textsf{T} \textbf{J}\)

All the preceding BPMLP, BPFCC and NBN procedures evaluate rows of \(\textbf{J}\) explicitly and perform rank updates by Eq. (3) for \(\textbf{J}^\textsf{T} \textbf{J}\). On the other hand, the so-called second-order stage-wise BP [30, 31, 33] can improve the dominant cost in Eq. (4)(c) by avoiding the explicit evaluation of any rows \(\nabla r_k^\textsf{T}\) of \(\textbf{J}\) for MLP-learning, just as the gradient vector \(\nabla E\) is computed with node sensitivities in Eq. (13) by Eq. (16) without any row \(\nabla r_k\) explicitly for \(\textbf{J}^\textsf{T} \textbf{r}\). That is, no rows of \(\textbf{J}\) are explicitly required to form \(\textbf{J}^\textsf{T} \textbf{J}\) in MLP-learning; consequently, stage-wise BP can evaluate the exact Hessian \(\nabla ^2 E\) in Eq. (2) faster than the node-wise BP that evaluates merely \(\textbf{J}^\textsf{T} \textbf{J}\) with rank-updating by Eq. (3) in MLP-learning [33].

In this section, we develop a new BP procedure specifically geared to FCC of Cheng’s type having multiple q linear outputs (see Fig. 2) for evaluating \(\textbf{J}^\textsf{T} \textbf{J}\) efficiently by accumulating node sensitivities in a stage-wise fashion (with no rows of \(\textbf{J}\) explicitly evaluated).

3.1 The Structure of Block Angular and Block Arrowhead Matrices

To show the special structure of \(\textbf{J}^\textsf{T} \textbf{J}\), we first note that \(\textbf{w}\), the entire weight vector of length M in Eq. (10), consists of the following linear and nonlinear weights:

  1. 1.

    q linear weight vectors at terminal stage L, \(\textbf{w}^{(j)}_L\), \(j \!=\! 1,...,q\), each of length \(C \! \equiv \! n \!+\! L\), as defined in Eq. (11); and

  2. 2.

    \((L \!-\! 1)\) nonlinear “hidden” weight vectors \(\textbf{w}_l\), \(l \!=\! 1,...,L \!-\! 1\), defined in Eq. (6).

We then arrange those group-weight vectors in the reverse order of \(\textbf{w}_l\) in Eq. (10) (i.e., [3, Eq(8), p.296]) below for \(l \!=\! L, L \!-\! 1,..., 1\) with \(h \! \equiv \! M \!-\! qC\) [see Eq. (9)]:

$$\begin{aligned} \textbf{w}{\mathop {=}\limits ^\textrm{def}} \Big [ \underbrace{ ~ \!~ \textbf{w}^{(1)}_L, \!~ ~ \textbf{w}^{(2)}_L, \!~ ~... \!~ ~,\textbf{w}^{(j)}_L, \!~ ~... \!~ ~,\textbf{w}^{(q)}_L \!~ ~ }_{{qC\,\,{\mathrm{linear\, terminal \,weights}}}}, \underbrace{ \textbf{w}_{L-1}, \textbf{w}_{L-2},... \textbf{w}_2, \textbf{w}_1 }_{{h\,\,{\mathrm{nonlinear \,hidden \,weights}} \,\,\textbf{w}_h}} \Big ]. \end{aligned}$$
(24)

Here, \(\textbf{w}_h\), the h-vector of hidden weights, is stage-wisely partitioned, comprising \(L \!-\! 1\) weight vectors \(\textbf{w}_l\) in Eq. (6) of length \(n \!+\! l\) at each stage \(l \!=\! 1,..., L \!-\! 1\) due to Eq. (24):

$$\begin{aligned} \underbrace{ \textbf{w}_h }_{{\textrm{length}}\,h} \equiv \big [ \underbrace{ ~ ~ ~ \textbf{w}_{L-1} ~ ~ ~ }_{n + L-1}, \underbrace{ ~ ~ \textbf{w}_{L-2} ~ ~ }_{n + L-2}, ~ \ldots ~, \underbrace{ ~ ~ ~ ~ \textbf{w}_{l} ~ ~ ~ ~ }_{n + l}, ~ \ldots ~ \underbrace{ ~ ~ \textbf{w}_2 ~ ~ }_{n + 2}, \underbrace{ \textbf{w}_1 }_{n + 1} \big ] \end{aligned}$$
(25)

For the weight arrangements in Eq. (24), let us show below the sparsity structure of \(\nabla r_j(p)^\textsf{T}\), a row of \(\textbf{J}~\!(\equiv \! \nabla \textbf{r})\) corresponding to the residual entry \(r_j(p)\) evaluated at terminal node j, \(j \!=\! 1,...,q\), on data pattern p, \(p \!=\! 1,...,N\), for Eqs. (1) and (2):

$$\begin{aligned} \underbrace{ \nabla r_j(p)^\textsf{T} }_{1 \times M} {\mathop {=}\limits ^\textrm{def}} \Big [ \underbrace{ ~ ~ \textbf{0}^\textsf{T}, ~ ~ \! ~ \textbf{0}^\textsf{T}, ~ ~ \! ~... ~ ~ \! ~,~\textbf{a}_j(p)^\textsf{T}, ~ ~ \! ~... ~ ~ \! ~,\textbf{0}^\textsf{T} ~ ~ }_{{{\textrm{only}}\,C\,\,{\mathrm{non-zero \,entries \,in}}\,\textbf{a}_j(p)}}, \underbrace{ \hspace{12.0mm} \textbf{b}_j(p)^\textsf{T} \hspace{12.0mm} }_{{h\,\,\mathrm{dense \,entries}}} \Big ], \end{aligned}$$
(26)

where \(\textbf{0}\) is the zero vector of length \(C~\!(\equiv \! n \!+\! L)\). In general, the M-vector \(\nabla r_j(p)\) consists of \((q \!-\! 1)C\) zero entries (see [3, Eq. (22), p.300]) and only \(C \!+\! h\) non-zero entries, denoted by the C-vector \(\textbf{a}_j(p) \! \equiv \! \frac{\partial r_j(p)}{\partial \textbf{w}^{(j)}_L}\) and the h-vector \(\textbf{b}_j(p) \! \equiv \! \frac{\partial r_j(p)}{\partial \textbf{w}_h}\). Since \(\nabla r_j(p)\) is the jth column of the \(M \times q\) matrix \(\textbf{U}_p\) in Eq. (2), \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\), the cross-product matrix resulting from Eq. (3), becomes such a nice block-arrow matrix as \(\textbf{U}\textbf{U}^\textsf{T}\) (with the notation p of datum p omitted) below right due to the \(M \times q\) block-angular matrix \(\textbf{U}\) (see also Appendix A.1) below left with \(h \! \equiv \! M \!-\! qC\), the number of hidden weights [see Eq. (9)]:

(27)

for which we pay attention to the following three types of block matrices in forming \(\textbf{U}\textbf{U}^\textsf{T}\):

$$\begin{aligned} \left\{ \begin{array}{l} \text {(a)} \textbf{a}_j \textbf{a}_j^\textsf{T}\hbox { of size }C \times C, j \!=\! 1,...,q,\hbox { the first }q\hbox { on-diagonal blocks; } \\ \text {(b)}\, {\displaystyle \underbrace{ \widetilde{\textbf{b}}\widetilde{\textbf{b}}^\textsf{T} }_{h \times h} \! \equiv \! \sum _{j=1}^q \textbf{b}_j \textbf{b}_j^\textsf{T} },\hbox { the last on-diagonal block at the lower-right corner; and } \\ \text {(c)} \textbf{b}_j \textbf{a}_j^\textsf{T}\hbox { of size }h \times C, j \!=\! 1,...,q,\hbox { the remaining }q\hbox { off-diagonal blocks. } \\ \end{array} \right. \end{aligned}$$
(28)

Those three block-matrix types (a), (b), and (c) in \(\textbf{U}\textbf{U}^\textsf{T}\) are colored in green, yellow, and pink, respectively, in Eq. (27). Such angular and arrowhead matrices in Eq. (27) frequently arise in scientific computing, and the arrowhead should point to the south-east direction to exploit the posed sparsity (e.g., see [16, 21, 29, 42]; pp.348-351 [5]). Unfortunately, most currently available neural-network software packages are not designed to exploit the sparsity inevitably arising in multiple-output neural-network learning; Fig. 4, for instance, compares three sparsity patterns of \(16 \times 16\) \(\textbf{J}^\textsf{T} \textbf{J}\) obtainable from four-output 1-2-4 MLP-learning (16 weights). For the desired sparsity in Fig. 4a, only dense blocks need to be formed to save memory, and its block-arrow form allows us to avoid fill-ins when (modified) Cholesky factorization is performed on each block in \(\textbf{J}^\textsf{T} \textbf{J}\) or when QR is applied to blocks in \(\textbf{J}\) (see [29] and references therein); e.g., for the Levenberg-Marquardt algorithm [19, 25, 46] with trust-region globalization (e.g., see [6,7,8, 26] for NL2SOL type). In Fig. 4b and c, the posed complicated sparsity patterns are hard to exploit; obviously, their optimization procedures simply ignore the sparsity.

Fig. 4
figure 4

Three sparsity patterns of the \(16 \times 16\) (Gauss–Newton approximate) Hessian matrix \(\textbf{J}^\textsf{T} \textbf{J}\) in Eq. (2) obtained by three different software implementations on single-input four-output 1-2-4 MLP-learning (16 weights): (a) our desired block-arrow matrix [see also Eq.  (27)(right)] with its arrow-head pointing downward to the right; (b) a matrix with undesired sparsity generated by the script file calcjejj.m in [1]; (c) a matrix with undesired sparsity obtained by the (Matlab-based software) NETLAB with mlphess.m [36]

Furthermore, the minimization of \(E(\textbf{w})\) in Eq. (1) is called the separable nonlinear least squares problem [15, 34, 39], when the model (of our choice) consists of so mixed linear and nonlinear parameters as shown in Eq. (24). Then, a distinguished feature of block matrices shown in Eq. (27) is that all q on-diagonal blocks \(\textbf{a}_j\), \(j \!=\! 1,...,q\), in \(\textbf{U}\) are identical to \(\textbf{y}\), the C-vector of node outputs, defined for Eq. (42); hence, \(\textbf{a}_j \!=\! \textbf{y}\) for any j on each datum owing to linear terminal outputs. In consequence, the first q on-diagonal blocks \(\textbf{a}_j \textbf{a}_j^\textsf{T}\), \(j \!=\! 1,...,q\), in \(\textbf{U}\textbf{U}^\textsf{T}\) [see type (a) in Eq. (28)] are identical to the rank-one matrix \(\textbf{y}\textbf{y}^\textsf{T}\); hence, \(\textbf{a}_j \textbf{a}_j^\textsf{T} \!=\! \textbf{y}\textbf{y}^\textsf{T}\) for any j. Since those q blocks are identical, we can save memory by forming only one on-diagonal block, which is obtainable just after the forward pass evaluating node-output vector \(\textbf{y}\); this is an important feature to be exploited (e.g., see [29]) that commonly arises in multiple-linear-output neural-network learning.

3.2 A BP Procedure for Extracting the Information of \(\textbf{U}\)

We describe how to use BP for extracting the information of block-angular (residual Jacobian) matrix \(\textbf{U}\) in Eq. (27) without forming \(\textbf{U}\) explicitly.

First, by forward pass, we calculate hidden-node outputs \(\phi ( {f^l(\cdot )} )\) by Eq. (8) at each stage l, \(0<l<L\), and then evaluate q terminal linear outputs \(F_j(\cdot )\), \(j \!=\! 1,...,q\) by Eq. (12) at end stage L. Eq. (42), for instance, shows that the q-length vector of linear terminal outputs is given by \(\varvec{\varTheta }_{\!{{L}}} \textbf{y}\), where \(\varvec{\varTheta }_{\!{{L}}}\) is the \(q \times C~\!(\equiv \! n \!+\! L)\) matrix of terminal linear weights [see Eqs. (39) and (40)], and \(\textbf{y}\) is the C-length vector that concatenates all preceding hidden-node outputs including the n inputs and the unit-constant output of node 0 at stage 0 (i.e., nominal input layer). In general, let us define \(\textbf{y}_l\) to be the concatenated vector of the first \((n \!+\! l \!+\! 1)\) node outputs up to stage l, \(l<L\); then, by definition, \(\textbf{y}_{L-1}\) is the above C-vector \(\textbf{y}\) (hence, \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\)), and \(\textbf{y}_l\) is the vector of the first \((n \!+\! l \!+\! 1)\) entries of \(\textbf{y}\); that is, \(\textbf{y}_l^\textsf{T} \! \equiv \! \left[ 1, x_1, ..., \phi \!\left( \!{{{f^{l}\!(\textbf{x};\!\textbf{w})}}}\!\right) \right] \). Below, we show their concatenated structure of \(\textbf{y}_l\), \(l \!=\! 0,..., L \!-\! 1\), on any data pattern with \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\) of length \(C ~\!(\equiv \! n \!+\! L)\):

$$\begin{aligned} \textbf{y}^\textsf{T} \!\! = \! \textbf{y}_{L-1}^\textsf{T} \! \equiv \! \Big [ \! \underbrace{ \underset{{\hspace{14.0mm}\ddots }}{ \underbrace{ \underbrace{ ~\! 1, x_1,..., x_n }_{ {\textbf{y}_0^\textsf{T}} }, \phi \!\left( \!{{{f^1\!(\textbf{x};\!\textbf{w})}}}\right) }_{ {\textbf{y}_1^\textsf{T}} }, }..., \phi \!\left( \!{{{f^{l-1}\!(\textbf{x};\!\textbf{w})}}}\!\right) \!, \phi \!\left( \!{{{f^{l}\!(\textbf{x};\!\textbf{w})}}}\!\right) }_{ {\textbf{y}_{l}^\textsf{T}} },..., \phi \!\left( \!{{{f^{L-1}\!(\textbf{x};\!\textbf{w})}}}\!\right) \! \Big ] \end{aligned}$$
(29)

where the first element is the unit-constant output of node 0 associated with bias \(\lambda _{l,0}\) in hidden-weight vector \(\textbf{w}_l\), \(l \!=\! 1,...,L-1\). As mentioned at the end of Sect. 3.1 for type (a) of Eq. (28), \(\textbf{a}_j \textbf{a}_j^\textsf{T} \!=\! \textbf{y}\textbf{y}^\textsf{T}\) (for any j); hence, we only need one block \(\textbf{y}\textbf{y}^\textsf{T}\), which is easy to obtain just after the forward pass for \(\textbf{y}\) above.

Then, by backward pass, we evaluate \(\textbf{b}_j \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_h}\) on each datum by stage-wisely back-propagating the q-length vector \(\varvec{\beta }_l\) defined below of (hidden) node-input sensitivities [compare \(\delta _l\) in Eq. (13)] to the residual vector \(\textbf{r}_p\) of length q in Eq. (1) on any datum p with respect to \(f^l(\textbf{x}; \textbf{w})\), the net input to hidden node l, \(0<l<L\), defined in Eq. (8):

$$\begin{aligned} \underbrace{ \varvec{\beta }_l }_{q \times 1} {\mathop {=}\limits ^\textrm{def}} \frac{\partial \textbf{r}_p}{\partial f^l(\textbf{x}(p); \textbf{w})}, \end{aligned}$$
(30)

which is recursively computed by the backward formula below analogous to Eq. (15), starting from \(\varvec{\beta }_{L-1}\) down to \(\varvec{\beta }_1\) for \(l \!=\! L \!-\! 1, L \!-\! 2, ..., 1\), with \(\phi _l ' (.) \! \equiv \! \phi ' ({{{f^l(\textbf{x};\textbf{w})}}})\) and \(e \equiv n + L - 1\) for the boundary value, \(\varvec{\beta }_{L-1}\), below right:

$$\begin{aligned} \underbrace{ \varvec{\beta }_l }_{q \times 1} = \phi _l ' (. ) \! \left\{ \! \left[ \! \begin{array}{c} \lambda ^{(1)}_{L,n+l} \\ \vdots \\ \lambda ^{(j)}_{L,n+l} \\ \vdots \\ \lambda ^{(q)}_{L,n+l} \\ \end{array} \! \right] \! + \! \sum ^{L-1}_{s=l+1} \lambda _{s,n+l} \cdot \varvec{\beta }_s \! \right\} ~~ \text {with} ~~ \varvec{\beta }_{L-1} \!=\! \phi _{L-1} ' (. ) \left[ \! \begin{array}{c} \lambda ^{(1)}_{L,e} \\ \vdots \\ \lambda ^{(j)}_{L,e} \\ \vdots \\ \lambda ^{(q)}_{L,e} \\ \end{array} \! \right] \end{aligned}$$
(31)

According to \(\textbf{w}_h\) in Eq. (25), \(\textbf{b}_j \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_h}\) on any datum is also stage-wisely partitioned, consisting of \(L \!-\! 1\) vectors \(\textbf{v}^{(j)}_l \! \equiv \! \frac{\partial r_j}{\partial \textbf{w}_l}\), \(l \!=\! 1,2,...,L \!-\! 1\); that is, \(\textbf{v}^{(j)}_l\) is the \((n \!+\! l)\)-length gradient vector of the jth residual \(r_j\) with respect to \(\textbf{w}_l\). Thus, we may evaluate \(\textbf{b}_j\) in a stage-wise fashion by computing \(\textbf{v}^{(j)}_l\) at each stage l, as shown below, using \(\textbf{y}_l\) in Eq. (29) and \(\beta ^{(j)}_l\), \(j \!=\! 1,...,q\), the jth component of the q-vector \(\varvec{\beta }_l\) defined in Eq. (30):

$$\begin{aligned} \underbrace{ \textbf{v}^{(j)}_l }_{(n + l) \times 1} \! \equiv \frac{\partial r_j}{\partial \textbf{w}_l} = \beta ^{(j)}_l \textbf{y}_{l-1}, ~~~ j \!=\! 1,...,q; ~~~ l \!=\! 1,2,...,L \!-\! 1. \end{aligned}$$
(32)

3.3 A Structure-Exploiting BP Strategy for Forming \(\textbf{U}\textbf{U}^\textsf{T}\)

Next for \(\textbf{U}\textbf{U}^\textsf{T}\) in Eq. (27), three block-matrix types (a) to (c) in Eq. (28) are partitioned as below, according to the partition of the weight-vector \(\textbf{w}\) shown in Eqs. (24) and (25):

$$\begin{aligned} \left\{ \begin{array}{l} \text {(a)}\,q\hbox { (green) blocks }\textbf{a}_j \textbf{a}_j^\textsf{T}, j \!=\! 1,...,q,\hbox { are identical to }\textbf{y}\textbf{y}^\textsf{T}\hbox { of size }C \times C. \\ \text {(b)}\, \widetilde{\textbf{b}}\widetilde{\textbf{b}}^\textsf{T}\hbox { is partitioned in }(L \!-\! 1) \times ( L \!-\! 1)\hbox { grid of size }h \times h; \hbox { hence, }\\ \text { there are }(L \!-\! 1)^2\hbox { partitioned (yellow) blocks in the }h \times h \hbox { submatrix }\widetilde{\textbf{b}}\widetilde{\textbf{b}}^\textsf{T}.\\ \text {(c)}\,q\hbox { blocks }\textbf{b}_j \textbf{a}_j^\textsf{T} \!=\! \textbf{b}_j \textbf{y}^\textsf{T}, j \!=\! 1,...,q,\hbox { are arranged in }(L \!-\! 1) \times q\hbox { grid of }\\ \text { size }\, h \times qC,\hbox { and the }h\hbox { rows of each } h \times C\hbox { block }\textbf{b}_j \textbf{y}^\textsf{T}\hbox { is divided by }(L\!-\!1); \\ \text { hence, totally }q(L \!-\! 1)\hbox { partitioned (pink) blocks in the } h \times qC\hbox { submatrix. } \\ \end{array} \right. \end{aligned}$$
(33)

More specifically, (b) and (c) are stage-wisely partitioned, conforming to the stage-wise partition of each \(\textbf{b}_j\), \(j \!=\! 1,...,q\), into \(\textbf{v}^{(j)}_l\) in Eq. (32) at each stage l, \(l \!=\! 1,...,(L \!-\! 1)\); hence, by definition of \(\textbf{v}^{(j)}_l\), each partitioned block in (b) and (c) is respectively given below using \(\textbf{y}_l\), the node-output vector of length \(n \!+\! l \!+\! 1\) at stage l shown in Eq. (29), and scalar \(\beta ^{(j)}_l\), \(j \!=\! 1,...,q\), is the jth component of the q-length vector \(\varvec{\beta }_l\) in Eq. (30):

$$\begin{aligned} \left\{ \begin{array}{l} {\displaystyle \text {(b)}\!~\sum ^q_{j=1} \! \textbf{v}_s^{(j)} \textbf{v}_{t}^{(j) \textsf{T}} = \sum ^q_{j=1} \! \beta ^{(j)}_s \beta ^{(j)}_t \!\! \underbrace{ \textbf{y}_{s-1} \textbf{y}_{t-1}^\textsf{T} }_{(n+s) \times (n+t)} = \underbrace{ \varvec{\beta }^\textsf{T}_s \varvec{\beta }_t }_{{\textrm{scalar}}} ~ \textbf{y}_{s-1} \textbf{y}_{t-1}^\textsf{T} }, ~~~ s \le t; \\ \text {(c)}~\textbf{v}^{(j)}_l \textbf{y}^\textsf{T} = \beta ^{(j)}_l \!\!\! \underbrace{ \textbf{y}_{l-1} \textbf{y}^\textsf{T} }_{(n+l) \times (n+L)}\!\!. \\ \end{array} \right. \end{aligned}$$
(34)

Notice here that we need to evaluate blocks of type (b) only for \(s \le t\) by symmetry. Furthermore, by definition of \(\textbf{y}_l\) in Eq. (29), where \(\textbf{y}\! \equiv \! \textbf{y}_{L-1}\) and \(C \!=\!n \!+\! L\), the outer product, \(\textbf{y}_{l-1} \textbf{y}^\textsf{T}\), in (c) is just a submatrix of the \(C \times C\) rank-one matrix \(\textbf{y}\textbf{y}^\textsf{T}\) in type (a) of Eq. (33). So is \(\textbf{y}_{s-1} \textbf{y}_{t-1}^\textsf{T}\) in (b). See below for their nested structure.

$$\begin{aligned} \left[ \! \begin{array}{c} \\ ~\! \underbrace{ \textbf{y}\textbf{y}^\textsf{T} }_{ {{{ \begin{array}{c} (n \!+\! L) \times (n \!+\! L) \\ = C \times C ~~~~ \\ \end{array} }}} } ~\! \\ \\ \\ \end{array} \! \right] = \left[ \! \begin{array}{c} \\ ~~ \underbrace{ \textbf{y}_{l-1} \textbf{y}^\textsf{T} }_{(n+l) \times (n+L)} ~~ \\ \\ \hline ~~~ \\ \end{array} \! \right] = \left[ \! \begin{array}{c|c} &{} \\ \underbrace{ \textbf{y}_{s-1} \textbf{y}_{t-1}^\textsf{T} }_{(n+s) \times (n+t)} &{} ~~~~~ \\ \hline &{} \\ &{} \\ \end{array} \! \right] \!. \end{aligned}$$
(35)

Therefore, we evaluate only one outer product, \(\textbf{y}\textbf{y}^\textsf{T}\), of size \(C \times C\), in type (a) of Eq. (33) when we obtain \(\textbf{y}~\!(=\!\textbf{y}_{L-1})\) in Eq. (29) after the forward pass with Eqs. (8) and (12); then, there is no need to compute the other outer products for (b) and (c) because any of them is just a submatrix of \(\textbf{y}\textbf{y}^\textsf{T}\), as shown in Eq. (35). That is, we form the blocks of types (b) and (c) in Eqs. (28) and (33) by multiplying a submatrix of \(\textbf{y}\textbf{y}^\textsf{T}\) [in Eq. (35)] by a scalar without evaluating any \(\textbf{v}^{(j)}_l\) (or \(\textbf{b}_j\)) explicitly, as indicated in Eq. (34), for which we obtain each scalar quantity for (b) and (c) by the backward-pass equation (31) of \(\varvec{\beta }_l\). The procedure is summarized in Algorithm BPJJ below for computing only dense blocks of types (b) and (c) as well as one dense block \(\textbf{y}\textbf{y}^\textsf{T}\) of type (a):

figure c

Appendix C.4 illustrates how Algorithm BPJJ works on the FCC network in Fig. 2, leading to the block-arrow matrix shown in Fig. 5a juxtaposed with b another block-arrow matrix associated with a 21-input 10-output five-stage FCC network for the cardiotocography benchmark problem (see [3, Table 2, p.309]).

Fig. 5
figure 5

The sparsity pattern of each block arrow matrix \(\textbf{U}\textbf{U}^\textsf{T}\) obtainable from (a) two-input three-output five-stage FCC shown in Fig. 2 with \(C \!=\! n \!+\! L \!=\! 2 \!+\! 5 \!=\! 7\) in Eq.  (9), and (b) 21-input 10-output five-stage FCC with \(C \!=\! n \!+\! L \!=\! 21 \!+\! 5 \!=\! 26\). In both (a) and (b), the x-axis denotes the column size of each block. We need only dense blocks for block matrix factorization [29]

3.4 Algorithmic Complexity on an L-Stage FCC Network with n Inputs and q Outputs

To implement the Levenberg-Marquardt method [3, p.303] by Algorithm BPFCC, Cheng has developed the FCCNET program [3, Sec. 5]. We now show the difference in algorithmic complexity between Algorithm BPJJ (in Sect. 3.3) and Cheng’s FCCNET with node-wise Algorithm BPFCC (in Sect. 2.3) on multiple q-output FCC network learning. Table 1 summarizes our complexity analysis on each data pattern p, \(p \!=\! 1,...,N\), in three items: (1) cost for evaluating node outputs \(\textbf{y}\) and gradients \(\nabla E_p\); (2) cost for \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by Algorithm BPJJ; and (3) cost for \(\textbf{U}_p\) by BPFCC and \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by Eq. (3). Because FCCNET employs node-wise BPFCC, it thus follows that the whole time complexity analysis of FCCNET is similar to that of Algorithm BPMLP shown in Eq. (4), where M is the total number of weights; for the L-stage FCC model having n inputs, M is defined in Eqs. (7) and (9) such that \(O(M) \! \approx \! O(L^2 \!+\! nL)\). Therefore, Cheng wrote (see [3, p.300]) that BPFCC works in \(O(L^2 + nL)\) per pattern p for the single-output FCC network, and then claimed that the time complexity for a component of gradient is O(1), where what means by “gradient” is \(\nabla r_p\), the gradient of \(r_p\), rather than \(\nabla E_p\) in Eq. (1). It should be noted here that his complexity analysis is due to \(O(M) \! \approx \! O(L^2 + nL)\) in Eqs. (7) and (9), and it should not be confused with the conventional wisdom available in the discrete L-stage optimal control literature, where it is widely accepted that the so-called second-order stage-wise Newton methods [11, 14, 35, 38] compute a Newton step in O(L), just as the first-order Kelley–Bryson gradient methods [2, 10, 13, 22] find a steepest-descent step in O(L) stage-wisely. Hereafter, we follow Cheng’s cost analysis using both L and M for analyzing FCC networks of his type shown in Fig. 1a as long as no confusion arises.

Table 1 (2) shows the time complexity of our structure-exploiting Algorithm BPJJ. Just after the forward pass (Line 1 of BPJJ) per pattern p, the BPJJ procedure first obtains only one \(C \times C\) block \(\textbf{y}\textbf{y}^\textsf{T}\) (in Line 2), block of type (a) in Eq. (33), at the cost of \(O(C^2)\), where \(C \! \equiv \! n \!+\! L\), and then starts the backward process, evaluating \(\varvec{\beta }_{l}\) (in Line 3) by Eq. (31), while exploiting the special structure shown in Eqs. (35) and (53) for the other blocks of types (b) and (c). After \(\textbf{B}\) is obtained (in Line 5) by Eq. (31) in \(O(qL^2)\) as mentioned above, the procedure evaluates \(\textbf{B}\textbf{B}^\textsf{T}\), which requires \(q (L \!-\! 1)^2\) operations, and then forms the blocks of types (b) and (c) by Eq. (34) [see Line 6]; the cost for type (b) is \(O(h (M-h))\), where \(h \!=\! M \!-\! qC\), the number of hidden weights, in Eq. (9), and the cost for type (c) is \(O(h^2)\); hence, both (b) and (c) cost O(hM) per pattern p. In addition, the procedure evaluates \(\varvec{\delta }\!=\! \textbf{B}\textbf{r}_p\), which costs O(qL), and then computes \(\nabla E_p\) (see Line 4) at the cost of O(M) by Eq. (17) at end stage L and by Eq. (16) for \(l<L\) [e.g., see Appendix C.4]. On the other hand, as explained in Eq. (4) and in Sect. 2.4, Table 1 (3) shows that BPFCC obtains \(\textbf{G}\!=\! \textbf{U}_p^\textsf{T}\) of size \(M \times q\) by using Eq. (21) with \(\alpha _{l \rightarrow L}\) on each data pattern p (see also line 6 in Algorithm BPFCC in Sect. 2.3); this costs O(qM). After that, Cheng’s FCCNET computes \(\nabla E_p\) by Eq. (23) at the cost O(qM), and form \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) by rank updates on \(\textbf{J}^\textsf{T} \textbf{J}\) by Eq. (3), which is the dominant cost \(O(q M^2)\). Overall, Table 1 indicates that BPJJ is superior to FCCNET with BPFCC.

Table 1 Comparison in algorithmic complexity per data pattern between Algorithms BPFCC and BPJJ for evaluating \(\textbf{U}_p \textbf{U}_p^\textsf{T}\) and \(\nabla E_p\) in Eq. (2), where M, the number of weights, is defined in Eq. (9) as \(M \!=\! h \!+\! qC\) with \(h \! \equiv \! (L \!-\! 1)(n \!+\! L/2)\) (hidden nonlinear weights) and \(C \! \equiv \! n \!+\! L\) (terminal linear ones)

Table 1 also shows that BPJJ obtains \(\textbf{B}\) [of \(\varvec{\beta }_{l}\) in Eq. (30)] by Eq. (31) in \(O(qL^2)\) on each data pattern [see Eq. (55)], while node-wise BPFCC evaluates \(\alpha ^{(j)}_{l \rightarrow L}\) at each repetition j, \(j \!=\! 1,..., q\), by Eq. (20) at the same cost; this is due to the special FCC structure of Cheng’s type (see Appendix A.1). Here, two node sensitivities are related by

$$\begin{aligned} \alpha ^{(j)}_{l \rightarrow L} = \beta ^{(j)}_l, ~~~~~ \text {for each}~~ j \!=\! 1,..., q ~ ~ \text {and} ~ ~ l \!=\! 1,..., L \!-\! 1. \end{aligned}$$
(36)

This suggests that BPFCC may be so modified as to return \(\textbf{G}\!=\! \textbf{B}\), node sensitivities [e.g., see Eq. (54)], rather than \(\textbf{G}\!=\! \textbf{U}^\textsf{T}\), the rows of \(\textbf{J}\), in order to follow Eq. (34); this simplification is probably the easiest way to improve BPFCC.

4 Numerical Results

In [3, Table 3, p.309], the multiple q-output (\(q>1\)) benchmark problems are investigated by the FCCNET program with BPFCC for various FCC models. Among them, we use the cardiotocography benchmark problem (2,126 data) involving 21 attributes (\(n \!=\! 21\)) and ten classes (\(q \!=\! 10\)) for our numerical test (to be shown in Sect. 4.1 below). After that, we discuss the numerical results available in [3, Table 4, p.309] and [45, Table II, p.1799].

4.1 Numerical Evidence Supporting our Cost Analysis in Table 1

We show the numerical evidence supporting our complexity analysis shown in Table 1 with a five-stage (\(L \!=\! 5\)) 21-input 10-output FCC model having 354 weights (\(M \!=\! 354\)); see Fig. 5b for the sparsity structure of the Gauss-Newton Hessian matrix for optimizing that FCC model. In [3, Table 3], the set of 2,126 (cardiotocography) data is split into two for training and testing, but we just use all 2,126 data in the batch mode to measure the execution (CPU) time for evaluating the Gauss-Newton Hessian matrix explicitly on a Windows-10 PC with i7-6500U CPU (2.6 GHz) and 8 GB (RAM).

Table 2 Speed comparison in the execution (CPU) time for evaluating the Gauss–Newton Hessian matrix of size \(354 \times 354\) (see Fig. 5b) in the cardiotocography benchmark problem (2126 data) between our BPJJ and Cheng’s FCCNET based on BPFCC, for which \(L \!=\! 5\) (stages); \(n \!=\! 21\) (inputs); \(C \!=\! n \!+\! L \!=\! 26\) (terminal weights); and \(M \!=\! h \!+\! qC \!=\! 354\) (total weights)

Table 2 compares the execution (CPU) time for forming the Gauss-Newton Hessian matrix \(\textbf{J}^\textsf{T} \textbf{J}\) between our BPJJ and Cheng’s FCCNET with BPFCC. The approximate flops ratio (with hidden constants totally ignored) gives just a rough estimate in speed difference, showing that BPJJ can work about 18 times faster than FCCNET; actually, it worked roughly six times faster in the execution time (averaged over ten trials).

4.2 Comparison Between BPFCC and NBN

Cheng has employed BPFCC for a certain implementation of the Levenberg-Marquardt method [3, p.303], called FCCNET, and then highlighted in [3, Sec. 5.5] the comparison between FCCNET and the NBN algorithm [45, 46] with respect to the total training time in single-output (\(q \!=\! 1\)) problems (although NBN is developed for \(q > 1\)); in [3, Table 4, p.309], FCCNET worked much faster than NBN. However, the total training-time results therein are misleading because the total training time heavily depends on the implementation Footnote 4 of the Levenberg-Marquardt method. Furthermore, both procedures must perform the same rank-updating by Eq. (3), which is much more time-consuming than forming \(\textbf{U}\), as shown in Table 1(c); see also steps (b) and (c) in Eq. (4) for MLP-learning. In other words, the execution time for one Levenberg-Marquardt step should be virtually equivalent between FCCNET and NBN for any single-output (\(q \!=\! 1\)) applications, as clearly described in [45, p.1798] with the complexity analysis of NBN.

Since NBN is originally designed for multiple q-output problems, NBN can work faster than Cheng’s BPFCC for forming \(\textbf{U}^\textsf{T}\), q rows of \(\textbf{J}\). Wilamowski and Yu compared Algorithms BPMLP and NBN in time complexity for 8-56-56 \((q \!=\! 56)\) MLP-learning; see [45, Table II, p.1799], where BPMLP is called “Hagan-Menhaj computation” due to [19]. Their cost analysis shows that NBN can form \(\textbf{U}\) roughly 17 times faster than BPMLP; needless to say, NBN can also work faster than BPFCC in forming \(\textbf{U}\). The point to note here is that both NBN and FCCNET (with BPFCC) must employ the same most time-consuming rank-updating by Eq. (3) [see also step (c) in Eq. (4)]; therefore, the overall time difference between NBN and FCCNET would be small per iteration. By contrast, our BPJJ is designed to reduce the dominant cost by avoiding rank-updating with Eq. (3) without evaluating \(\textbf{U}\) explicitly.

4.3 Discussion

If the residual Jacobian matrix \(\textbf{J}\) of size \(m \times M\) in Eq. (2) can be stored explicitly, then QR factorization can apply to \(\textbf{J}\) directly. Alternatively, one can avoid storing \(\textbf{J}\) due to the memory concerns by forming \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\) by Eq. (3); then, the (modified) Cholesky factorization can apply to \(\textbf{J}^\textsf{T} \textbf{J}\). In both cases, the regularization parameter \(\mu \) may be introduced for implementing the Levenberg-Marquardt method. In any event, the latter Cholesky approach works about twice faster than the former QR approach (e.g., see [5]). In nonlinear least squares learning, the speed per epoch and memory-saving are primary concerns; therefore, the (modified) Cholesky approach is much popular [1, 19, 24, 25, 40, 46].

Algorithm BPJJ in Sect. 3.3 efficiently evaluates \(\textbf{J}^\textsf{T} \textbf{J}\!=\! \sum _p \textbf{U}_p \textbf{U}_p^\textsf{T}\) without evaluating \(\textbf{U}_p\) explicitly in a block-arrow matrix form (e.g., see Fig. 5), for which only dense blocks need to be formed to save memory. As stated in Sect. 3.1, once the dense blocks of the desired block-arrow matrix are obtained, it is easy to exploit the matrix sparsity (e.g., see Fig. 5) for an efficient implementation of any subsequent optimization procedures; e.g., see block-arrow least squares [29] with (modified) block Cholesky or QR factorization for trust-region Levenberg-Marquardt methods [6,7,8, 26] (see also [15, 21, 34, 39] for separable nonlinear regression by variable projection methods). This is an important additional benefit beyond the result of Table 1, commonly arising in any multi-output applications in [3, 9, 20, 45]; the application in [23] is of our future interest for constructing an FCC network with a set of cascaded hidden nodes representing a chain of joints of a robotic manipulator.

5 Conclusion

We have developed a new stage-wise BP procedure (Algorithm BPJJ in Sect. 3.3) that exploits the special structure of multi-output fully-connected cascade (FCC) networks (see Fig. 2) for nonlinear least squares learning. Our stage-wise BPJJ is designed to reduce the dominant cost of rank-update operations by Eq. (3) without evaluating explicitly any rows of the residual Jacobian matrix on each data pattern, working faster than Cheng’s node-wise BPFCC [3, p.300] (as well as NBN [45] and widely-employed BPMLP in Sect. 1).

We hope that the various BP aspects (e.g., stage-wise versus node-wise BP) discussed in this paper would prove useful for further developments of efficient BP procedures for (deep) neural network learning.