Schfer Zimmermann
Schfer Zimmermann
Schfer Zimmermann
net/publication/295975293
CITATIONS READS
293 2,663
2 authors:
All content following this page was uploaded by Hans Georg Zimmermann on 14 March 2021.
Recurrent Neural Networks (RNN) have been developed for a better understanding and analysis of open
dynamical systems. Still the question often arises if RNN are able to map every open dynamical system,
which would be desirable for a broad spectrum of applications. In this article we give a proof for the
universal approximation ability of RNN in state space model form and even extend it to Error Correction
and Normalized Recurrent Neural Networks.
Keywords: Dynamical systems; system identification; recurrent neural networks; universal approximation.
253
2nd Reading
July 29, 2007 21:55 00111
a
Alternative descriptions of open dynamical systems are given in Corollary 2.
2nd Reading
July 29, 2007 21:55 00111
A A A
(...) y t−4 y t−3 y t−2 y t−1 yt
C D s t−1 C z t−1 D st C zt D s t+1 C y t+1
(...) (...)
θ θ θ θ
C C C C C C
B −I B −I B −I B
A A A A A
(...) s t−4 s t−3 s t−2 s t−1 st (...) (...) x t−2 ydt−1 x t−1 ydt xt
the training itself is unstable due to the concate- neurons and N outputs, the dimension of the internal
nated matrices A, B, and C. As the training changes state would be J := dim(s) = I + Q + N . With the
weights in all of these matrices, different effects or matrix [I 0 0] we connect only the first N neurons of
tendencies, even opposing ones, can influence them the internal state st to the output layer yt . As this
and may superpose. This implies that there results connector is not trained, it can be seen as a fixed
no clear learning direction or change of weights from identity matrix of appropriate size. Consequently,
a certain backpropagated error.1 the NRNN is forced to generate its N outputs at
Normalized Recurrent Neural Networks (NRNN) the first N components of the state vector st . The
(Eqs. (5)) avoid the stability and learning problems last state neurons are used for the processing of the
resulting from the concatenation of the three matri- external inputs xt . The connector [0 0 I]T between
ces A, B, and C because they incorporate besides the externals xt and the internal state st is an appro-
fixed identity matrices I of appropriate dimensions priately sized fixed identity matrix. More precisely,
and the bias θ, only a transition matrix A: the connector is designed such that the input xt is
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
0
st = f Ast−1 + 0 xt − θ support the internal processing and to increase the
I (5) network’s computational power, we add a number of
Q hidden neurons between the first N and the last
yt = [I 0 0]st I state neurons. This composition ensures that the
Hence, by using NRNN the modeling is solely focused input and output processing of the network is sepa-
on the transition matrix A. The matrices between rated but implies that NRNN can only be designed
input and hidden as well as hidden an output layer as large Neural Networks.1 Note, that in contrast to
are fixed and therefore not changed during the train- RNN (Sec. 2.1) and ECNN (Sec. 2.2) the output of
ing process. Consequently matrix A does not only the NRNN can be bounded by the activation func-
code the autonomous and the externally driven parts tion, e.g., by using the hyperbolic tangent we have
of the dynamics, but also the (pre-)processing of the yt ∈ (−1, 1). Still, this is not a real constraint as
external inputs xt and the computation of the net- we can simply scale the data appropriately before
work outputs yt . This implies that all free parame- applying it to the network.
ters, as they are combined in one matrix, are now Our experiments indicate that, in comparison to
treated the same way by BPTT. Figure 4 illustrates RNN, NRNN often achieve better results22 and gen-
the corresponding architecture. erally show a more stable training process, especially
At first view it seems that in the network archi- when the dimension of the internal state is very large.
tecture (Fig. 4) the external input xt is directly con-
nected to the corresponding output yt . This is not 3. Universal Approximation by
the case, because we enlarge the dimension of the Feedforward Neural Networks
internal state st , such that the input xt has no direct
influence on the output yt . Assuming that we have a Our proof for RNN in state space model form (Sec. 4)
number of I external inputs, Q computational hidden is based on the work of Hornik, Stinchcombe and
White.11 In the following we therefore recall their
definitions and main results:
(...) y t−2 y t−1 yt y t+1
Definition 1. Let AI with I ∈ N be the set of all
I
00
I
00
I
00
I
00
I
00 affine mappings A(x) = w · x − θ from RI to R with
A A A A w, x ∈ RI and θ ∈ R. ‘·’ denotes the scalar product.
(...) s t−2 s t−1 st s t+1
θ θ θ θ θ
Transferred to Neural Networks x corresponds to
0 0 0 0
0 0 0 0 the input, w to the network weights and θ to the
I I I I
bias.
(...) x t−2 x t−1 xt
Definition 2. For any (Borel-)measurable function
I
Fig. 4. Normalized Recurrent Neural Network. f (·) : R → R and I ∈ N be (f ) the class of
2nd Reading
July 29, 2007 21:55 00111
I
Remark 1. The function class (f ) can also be
written in matrix form Definition 7. Given a probability measure µ on
(RI , BI ), the metric ρµ : MI × MI → R+ be defined
NN (x) = vf (W x − θ)
as follows
where x ∈ RI , v, θ ∈ RJ , and W ∈ RJ×I .
In this context the computation of the function ρµ (f, g) = inf{ε > 0 : µ{x : |f (x) − g(x)| > ε} < ε}.
f (·) : RJ → RJ be defined component-wise, i.e.,
Theorem 1 (Universal Approximation Theo-
f (W1 · x − θ1 ) rem for Feedforward Neural Networks). For
.. any sigmoid activation function f, any dimension I
.
and any probability measure µ on (RI , BI ), I
f (W x − θ) := f (Wj · x − θj )
(f )
.. is uniformly dense on a compact domain in C and
I
.
ρµ -dense in MI .
f (WJ · x − θJ )
where Wj (j = 1, . . . , J) denotes the jth row of the This theorem states that a three-layered Feed-
matrix W . forward Neural Network, i.e., a Feedforward Neu-
ral Network with one hidden layer, is able to
Definition 3. A function f is called a sig- approximate any continuous function uniformly on a
moid function, if f is monotonically increasing and compact domain and any measurable function in the
bounded, i.e., ρµ -metric with an arbitrary accuracy. The proposi-
f (a) ∈ [α, β], whereas lim f (a) = α tion is independent of the applied sigmoid activation
a→−∞ function f (Def. 3), the dimension of the input space
and lim f (a) = β I, and the underlying probability measure µ. Conse-
a→∞
quently three-layered Feedforward Neural Networks
with α, β ∈ R and α < β. are universal approximators.
Definition 4. Let C I and MI be the sets of Theorem 1 is only valid for Feedforward Neural
all continuous and respectively all Borel-measurable Networks with I input- and a single output-neuron.
functions from RI to R. Further denote BI the Borel- Accordingly, only functions from RI to R can be
σ-algebra of RI and (RI , BI ) the I-dimensional Borel- approximated. However with a simple extension it
measurable space. can be shown that the theorem holds for Networks
with a multiple output (Cor. 1).
MI contains all functions relevant for applica- For this, the set of all continuous functions from
tions. C I is a subset of it. Consequently, for every RI to RN , I, N ∈ N, be denoted by C I,N and the
I
Borel-measurable function f the class (f ) belongs one of (Borel-)measurable functions from RI to RN
I
to the set M and for every continuous f to its
I
by MI,N respectively. The function class gets
I,N
subset C I . extended to by (re-)defining the weights vj
2nd Reading
July 29, 2007 21:55 00111
k=1 µ
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com
st+1 = g(st , xt )
Consequently three-layered multi-output Feedfor- (6)
ward Networks are universal approximators for yt = h(st )
vector-valued functions. can be approximated by an element of the function
class RNN I,N (f ) (Def. 8) with an arbitrary accu-
racy, where f is a continuous sigmoidal activation
4. Universal Approximation by RNN
function (Def. 3).
The universal approximation theorem for Feedfor-
ward Neural Networks (Theo. 1) proves that any Proof. The proof is given in two steps. Thereby
(Borel-)measurable function can be approximated the equations of the open dynamical system (6) are
by a three-layered Feedforward Neural Network. traced back to the representation by a three-layered
We now show that RNN in state space model Feedforward Neural Network.
form (Eq. (2)) are also universal approximators and In the first step, we conclude that the state space
able to approximate every open dynamical system equation of the open dynamical system, st+1 =
(Eq. (1)) with an arbitrary accuracy. g(st , xt ), can be approximated by a Neural Net-
work of the form s̄t+1 = f (As̄t + Bxt − θ) for all
Definition 8. For any (Borel-)measurable function t = 1, . . . , T .
f (·) : RJ → RJ and I, N, T ∈ N be RNN I,N (f ) the Let now be ε > 0 and K ⊂ RJ × RI a com-
class of functions pact set, which contains (st , xt ) and (s̄t , xt ) for all
st+1 = f (Ast + Bxt − θ) t = 1, . . . , T . From the universal approximation the-
yt = Cst . orem for Feedforward Networks (Theo. 1) and the
subsequent corollary (Cor. 1) we know that for any
Thereby be xt ∈ RI , st ∈ RJ , and yt ∈ RN , with measurable function g(st , xt ) : RJ × RI → RJ and
t = 1, . . . , T . Further be the matrices A ∈ RJ×J , for an arbitrary δ > 0, a function
B ∈ RJ×I , and C ∈ RN ×J and the bias θ ∈ RJ . In
NN (st , xt ) = Vf (W st + Bxt − θ̄),
the following, analogue to remark 1, the calculation
¯ ¯
of the function f be defined component-wise, i.e., with weight matrices V ∈ RJ×J , W ∈ RJ ×J , B ∈
¯ ¯
RJ×I , a bias θ̄ ∈ RJ , and a component-wise applied
st+1j = f (Aj st + Bj xt − θj ),
continuous sigmoid activation function f (Rem. 1)
where Aj and Bj (j = 1, . . . , J) denote the jth row exists, such that ∀ t = 1, . . . , T
of the matrices A and B respectively. sup g(st , xt ) − NN (st , xt )∞ < δ. (7)
st ,xt ∈K
I,N
It is obvious that the class RNN (f ) is equiva-
As f is continuous and T finite, there exists a δ > 0,
lent to the RNN in state space model form (Eq. (2)).
such that according to the ε-δ-criterion we get out of
Analogue to its description in Sec. 2.1 as well as Def-
Eq. (7) that for the dynamics
inition 2, I stands for the number of input-neurons,
J for the number of hidden-neurons and N for the s̄t+1 = Vf (W s̄t + Bxt − θ̄)
2nd Reading
July 29, 2007 21:55 00111
the following condition holds Using again Theorem 1 Equation (11) can be approx-
st − s̄t ∞ < ε ∀ t = 1, . . . , T. (8) imated by
orem for RNN (Sec. 4) we show that the result holds Proof. Analogue to Theorem 2. The error cor-
for ECNN approximating open dynamical systems rection term simply serves as an additional input
with error correction (Eq. (3)). and therefore does not substantially change the
Definition 10. For any (Borel-)measurable func- derivations.
tion f (·) : RJ → RJ and I, N, T ∈ N be
6. Universal Approximation by NRNN
ECNN I,N (f ) the class of functions
We finally extend the universal approximation ability
st+1 = f (Ast + Bxt + D(yt − ytd ) − θ)
of standard RNN to their normalized version. As we
yt = Cst . pointed out in Sec. 2.3 NRNN play an important
Thereby be xt ∈ RI , st ∈ RJ , and yt , ytd ∈ RN , with role for the construction of large Neural Networks.1
t = 1, . . . , T . Further be the matrices A ∈ RJ×J , Besides that, due to the stability increase, i.e., to the
B ∈ RJ×I , C ∈ RN ×J , and D ∈ RJ×N and the bias reduction to one single transition matrix, they often
θ ∈ RJ . Again the calculation of the function f be achieve better results in terms of forecast precision
defined component-wise (Def. 8). than standard RNN.22
In the following we show that RNN are a subset
The class ECNN I,N (f ) only differs from the of NRNN by proving that every RNN can be repre-
ECNN in state space model form (Eq. (4)) in the sented by an NRNN. This already implies that the
missing hyperbolic tangent, i.e., the nonlinear trans- universal approximation theorem for RNN (Theo. 2)
formation of the error-correction term (yt−1 − yt−1 d
), also holds for NRNN.
which has been added to the ECNN for numerical
reasons (Sec. 2.2). It has been left out in definition Definition 11. For any (Borel-)measurable func-
10 because it is not required for the theoretical anal- tion f (·) : RJ → RJ and I, N, T ∈ N be
ysis of system approximation. It has simply proved NRNN I,N (f ) the class of functions
to produce better and especially more stable results 0
during our practical applications.10 Analogue to its st = f Ast−1 + 0 xt − θ
description in Sec. 2.2 as well as Definition 8, I stands I (14)
I
for the number of input-neurons, J for the number
yt = [IN 0 0]st
of hidden-neurons and N for the number of output-
neurons. xt denotes the external inputs, st the inner Thereby be xt ∈ RI , st ∈ RJ , and yt ∈ RN , with
states, and yt the outputs of the Neural Network. The t = 1, . . . , T . Further be matrix A ∈ RJ×J , and the
matrices A, B, C, and D correspond to the weight- bias θ ∈ RJ . For the fixed identity matrices we have
matrices between hidden- and hidden-, input- and [II 0 0]T ∈ RJ×I and [IN 0 0] ∈ RN ×J where the
hidden-, hidden- and output-, and error correction- index i of the identity matrix Ii denotes its dimen-
and hidden-neurons respectively. f is an arbitrary sion. As before the calculation of the function f be
activation function. defined component-wise (Def. 8).
2nd Reading
July 29, 2007 21:55 00111
The class NRNN I,N (f ) is equivalent to NRNN with à ∈ RQ×Q , B̃ ∈ RQ×I , C̃ ∈ RN ×Q and θ̃ ∈ RQ .
(Eq. (5)). Analogue to its description in Sec. 2.3 as In doing so Eq. (14) can be transformed into
well as Definition 8, I stands for the number of input- y y
s 0 C̃ 0 s
neurons, J for the number of hidden-neurons and
sh = f 0 Ã B̃ sh
N for the number of output-neurons. xt denotes the
sx t 0 0 0 sx t−1
external inputs, st the inner states and yt the outputs
of the Neural Network. Matrix A corresponds to the 0 0
weight-matrix between hidden- and hidden-neurons. + 0 xt − θ̃
f is an arbitrary activation function. II 0
y
Lemma 1. The class RNN I,N s
3 (f ) (Def. 9) is a
subset of the class NRNN I,N
(f ) in the sense that yt = [IN 0 0] sh
(16)
every element of RNN I,N (f ) can be represented by sx t
3
an element of NRNN I,N (f ) with an arbitrary accu-
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
racy given the assumption that the input x and the equation we have yt = syt .
output y of the Networks can be scaled such that for Writing Eq. (16) component-wise we derive
an arbitrary ε > 0 we have
syt = f (C̃sht−1 )
(i) xt − f (xt )∞ < ε ∀ t = 1, . . . , T sht = f (Ãsht−1 + B̃sxt−1 − θ̃) (17)
(ii) yt − f (yt )∞ < ε ∀ t = 1, . . . , T .
sxt = f (xt ).
Note, that the assumption in lemma 1 is not a
According to our assumptions (i) and (ii) and yt =
major restriction as for most of the sigmoidal activa-
f (C̃sht−1 ), we have for ε > 0
tion functions like the hyperbolic tangent the condi-
tion is implicitly fulfilled for small values of x and f (yt ) − f (C̃sht−1 )∞ < ε (18)
y as those functions are linear around the origin. and sxt − xt ∞ < ε. (19)
Besides in NRNN the output is anyway bounded by
the activation function (Sec. 2.3). This generally also Out of Eq. (18) in conjunction with assumption
applies to the input as in most applications one does (i) and f being a continous sigmoid function then
not distinct between input and output variables but follows
simply considers observable data.1 yt − C̃sht−1 ∞ < ε. (20)
Finally for ε → 0 Eqs. (19) and (20) imply that
Proof. Let I and N be arbitrary but fixed. We prove
Eq. (17) are an element of RNN I,N 3 (f ). Conse-
the theorem by showing that by reconstructing the
quently, as matrices Ã, B̃, and C̃ are arbitrary we
equations of class NRNN I,N (f ) we can represent any
can represent any element of RNN I,N3 (f ) by one of
element of RNN I,N
3 (f ) with arbitrary matrices Ã, B̃,
NRNN I,N (f ).
and C̃.
We start by defining the inner state vector st ∈
Theorem 4 (Universal Approximation Theo-
RJ in NRNN I,N (f ) as follows
rem for NRNN). Let g(·) : RJ × RI → RJ be
y
s measurable and h(·) : RJ → RN be continuous, the
st := sh
(15) external inputs xt ∈ RI , the inner states st ∈ RJ ,
sx t and the outputs yt ∈ RN (t = 1, . . . , T ). Then, any
open dynamical system of the form
where sx ∈ RI represents the network’s input-, sh ∈
RQ the hidden-, sy ∈ RN the output-part (respec- st+1 = g(st , xt )
tively neurons) of st and I + Q + N = J. We further yt+1 = h(st )
set the single transition matrix of NRNN I,N (f )
can be approximated by an element of the function
0 C̃ 0 0 class NRNN I,N (f ) (Def. 11) with an arbitrary accu-
A := 0 Ã B̃ and the bias θ := θ̃ racy, where f is a continuous sigmoidal activation
0 0 0 0 function (Def. 3).
2nd Reading
July 29, 2007 21:55 00111
Proof. Follows directly out of lemma 1 and corol- nal Processing: From Systems to Brain (MIT Press,
lary 2. Cambridge, MA, 2006).
4. J. F. Kolen and S. C. Kremer, A Field Guide to
Dynamical Recurrent Networks, IEEE Press (2001).
7. Conclusion 5. L. R. Medsker and L. C. Jain, Recurrent neural net-
works: Design and application, Vol. 1, Comp. Intel-
In this article we gave a proof for the universal ligence (CRC Press international, 1999).
approximation ability of RNN in state space model 6. A. Soofi and L. Cao, Modeling and Forecasting
form. After a short introduction into open dynamical Financial Data, Techniques of Nonlinear Dynamics,
Kluwer Academic Publishers (2002).
systems, RNN in state space model form as well as
7. H. G. Zimmermann and R. Neuneier, Neural net-
ECNN and NRNN we recalled the universal approx- work architectures for the modeling of dynamical
imation theorem for Feedforward Neural Networks. systems, in J. F. Kolen and St. Kremer (eds.), A
Based on this result we proved that RNN in state Field Guide to Dynamical Recurrent Networks, IEEE
space model form are able to approximate any open Press (2001), pp. 311–350.
8. D. E. Rumelhart, G. E. Hinton and R. J. Williams,
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
19. E. Sontag, Neural nets as systems model and con- 22. A. M. Schaefer, S. Udluft and H. G. Zimmermann,
trollers, in Proceedings of the Seventh Yale Workshop Learning long term dependencies with recurrent neu-
on Adaptive and Learning Systems, Yale University ral networks, in Proceedings of the International
(1992), pp. 73–79. Conference on Artificial Neural Networks (ICANN-
20. D. P. Mandic and J. A. Chambers, Recurrent neural 06), Athens (2006).
networks for prediction: Learning algorithms, archi- 23. H. G. Zimmermann, L. Bertolini, R. Grothmann,
tectures and stability, in S. Haykin (ed.), Adaptive A. M. Schaefer and Ch. Tietz, A technical trad-
and Learning Systems for Signal Processing, Com- ing indicator based on dynamical consistent neural
munications and Control, John Wiley & Sons, Chich- networks, in Proceedings of the International Con-
ester (2001). ference on Artificial Neural Networks (ICANN-06),
21. R. Neuneier and H. G. Zimmermann, How to train Athens (2006).
neural networks, in G. B. Orr and K. R. Mueller
(eds.), Neural Networks: Tricks of the Trade,
Springer Verlag, Berlin (1998), pp. 373–423.
by UNIVERSITY OF CALIFORNIA @ DAVIS on 01/09/15. For personal use only.
Int. J. Neur. Syst. 2007.17:253-263. Downloaded from www.worldscientific.com