Neural Network Time Series

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Neural Networks for Time Series

Prediction
15-486/782: Articial Neural Networks
Fall 2006
(based on earlier slides by Dave Touretzky and Kornel Laskowski)
What is a Time Series?
A sequence of vectors (or scalars) which depend on time t. In this
lecture we will deal exclusively with scalars:
{ x(t
0
), x(t
1
), x(t
i1
), x(t
i
), x(t
i+1
), }
Its the output of some process P that we are interested in:
P x(t)
2
Examples of Time Series
Dow-Jones Industrial Average
sunspot activity
electricity demand for a city
number of births in a community
air temperature in a building
These phenomena may be discrete or continuous.
3
Discrete Phenomena
Dow-Jones Industrial Average closing value each day
sunspot activity each day
Sometimes data have to be aggregated to get meaningful values.
Example:
births per minute might not be as useful as births per month
4
Continuous Phenomena
t is real-valued, and x(t) is a continuous signal.
To get a series {x[t]}, must sample the signal at discrete points.
In uniform sampling, if our sampling period is t, then
{x[t]} = {x(0), x(t), x(2t), x(3t), } (1)
To ensure that x(t) can be recovered from x[t], t must be chosen
according to the Nyquist sampling theorem.
5
Nyquist Sampling Theorem
If f
max
is the highest frequency component of x(t), then we must
sample at a rate at least twice as high:
1
t
= f
sampling
> 2f
max
(2)
Why? Otherwise we will see aliasing of frequencies in the range
[f
sampling
/2, f
max
].
6
Studying Time Series
In addition to describing either discrete or continuous phenomena,
time series can also be deterministic vs stochastic, governed by linear
vs nonlinear dynamics, etc.
Time series are the focus of several overlapping disciplines:
Information Theory deals with describing stochastic time series.
Dynamical Systems Theory deals with describing and manipulating
mostly non-linear deterministic time series.
Digital Signal Processing deals with describing and manipulating
mostly linear time series, both deterministic and stochastic.
We will use concepts from all three.
7
Possible Types of Processing
predict future values of x[t]
classify a series into one of a few classes
price will go up
price will go down sell now
no change
describe a series using a few parameter values of some model
transform one time series into another
oil prices interest rates
8
The Problem of Predicting the Future
Extending backward from time t, we have time series {x[t], x[t
1], }. From this, we now want to estimate x at some future time
x[t +s] = f( x[t], x[t 1], )
s is called the horizon of prediction. We will come back to this; in
the meantime, lets predict just one time sample into the future,
s = 1.
This is a function approximation problem.
Heres how well solve it:
1. Assume a generative model.
2. For every point x[t
i
] in the past, train the generative model
with what preceded t
i
as the Inputs and what followed t
i
as the
Desired.
3. Now run the model to predict x[t +s] from {x[t], }.
9
Embedding
Time is constantly moving forward. Temporal data is hard to deal
with...
If we set up a shift register of delays, we can retain successive
values of our time series. Then we can treat each past value as an
additional spatial dimension in the input space to our predictor.
This implicit transformation of a one-dimensional time vector into
an innite-dimensional spatial vector is called embedding.
The input space to our predictor must be nite. At each instant t,
truncate the history to only the previous d samples. d is called the
embedding dimension.
10
Using the Past to Predict the Future
x(t 1)
x(t 2)
x(t T)
f
x(t)
tapped delay line
delay element
x(t +1)
11
Linear Systems
Its possible that P, the process whose output we are trying to
predict, is governed by linear dynamics.
The study of linear systems is the domain of Digital Signal Process-
ing (DSP).
DSP is concerned with linear, translation-invariant (LTI) operations
on data streams. These operations are implemented by lters. The
analysis and design of lters eectively forms the core of this eld.
Filters operate on an input sequence u[t], producing an output se-
quence x[t]. They are typically described in terms of their frequency
response, ie. low-pass, high-pass, band-stop, etc.
There are two basic lter architectures, known as the FIR lter and
the IIR lter.
12
Finite Impulse Response (FIR) Filters
Characterized by q +1 coecients:
x[t] =
q

i=0

i
u[t i] (3)
FIR lters implement the convolution of the input signal with a given
coecient vector {
i
}.
They are known as Finite Impulse Response because, when the input
u[t] is the impulse function, the output x is only as long as q + 1,
which must be nite.
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
IMPULSE FILTER RESPONSE
13
Innite Impulse Response (IIR) Filters
Characterized by p coecients:
x[t] =
p

i=1

i
x[t i] +u[t] (4)
In IIR lters, the input u[t] contributes directly to x[t] at time t, but,
crucially, x[t] is otherwise a weighed sum of its own past samples.
These lters are known as Innite Impulse Response because, in
spite of both the impulse function and the vector {
i
} being nite
in duration, the response only asympotically decays to zero. Once
one of the x[t]s is non-zero, it will make non-zero contributions to
future values of x[t] ad innitum.
14
FIR and IIR Dierences
In DSP notation:

1
u[t] x[t]
x[t 1]
x[t 2]
x[t p]
u[t] x[t]

0
u[t 1]
u[t 2]
u[t q]
FIR IIR
15
DSP Process Models
Were interested in modeling a particular process, for the purpose
of predicting future inputs.
Digital Signal Processing (DSP) theory oers three classes of pos-
sible linear process models:
Autoregressive (AR[p]) models
Moving Average (MA[q]) models
Autoregressive Moving Average (ARMA[p, q]) models
16
Autoregressive (AR[p]) Models
An AR[p] assumes that at its heart is an IIR lter applied to some
(unknown) internal signal, [t]. p is the order of that lter.
x[t] =
p

i=1

i
x[t i] + [t] (5)
This is simple, but adequately describes many complex phenomena
(ie. speech production over short intervals).
If on average [t] is small relative to x[t], then we can estimate x[t]
using
x[t] x[t] [t] (6)
=
p

i=1
w
i
x[t i] (7)
This is an FIR lter! The w
i
s are estimates of the
i
s.
17
Estimating AR[p] Parameters
Batch version:
x[t] x[t] (8)
=
p

i=1
w
i
x[t i] (9)
_

_
x[p +1]
x[p +2]
.
.
.
_

_ =
_

_
x[1] x[2] x[p]
x[2] x[3] x[p +1]
.
.
.
.
.
.
.
.
.
.
.
.
_

_
_

_
w
1
w
2
.
.
.
w
p
_

_
. .
w
(10)
Can use linear regression. Or LMS.
Application: speech recognition. Assume that over small windows
of time, speech is governed by a static AR[p] model. To learn w is
to characterize the vocal tract during that window. This is called
Linear Predictive Coding (LPC).
18
Estimating AR[p] Parameters
Incremental version (same equation):
x[t] x[t]
=
p

i=1
w
i
x[t i]
For each sample, modify each w
i
by a small w
i
to reduce the
sample squared error (x[t] x[t])
2
. One iteration of LMS.
Application: noise cancellation. Predict the next sample x[t] and
generate x[t] at the next time step t. Used in noise cancelling
headsets for oce, car, aircraft, etc.
19
Moving Average (MA[q]) Models
A MA[q] assumes that at its heart is an FIR lter applied to some
(unknown) internal signal, [t]. q +1 is the order of that lter.
x[t] =
q

i=0

i
[t i] (11)
Sadly, cannot assume that [t] is negligible; x[t] would have to be
negligible. If our goal was to describe a noisy signal x[t] with specic
frequency characteristics, we could set [t] to white noise and the
{w
i
} would just subtract the frequency components that we do not
want.
Seldom used alone in practice. By using Eq 11 to estimate x[t], we
are not making explicit use of past values of x[t].
20
Autoregressive Moving Average
(ARMA[p, q]) Models
A combination of the AR[p] and MA[q] models:
x[t] =
p

i=1

i
x[t i] +
q

i=1

i
[t i] + [t] (12)
To estimate future values of x[t], assume that [t] at time t is small
relative to x[t]. We can obtain estimates of past values of [t] at
time t i from past true values of x[t] and past values of x[t]:
[t i] = x[t i] x[t i] (13)
The estimate for x[t] is then
x[t] =
p

i=1

i
x[t i] +
q

i=1

i
[t i] (14)
21
Linear DSP Models as Linear NNs
DSP Filter DSP Model NN Connections
FIR MA[q] feedforward
IIR AR[p] recurrent
An AR[p] model is equivalent to:
(t) x(t)
x(t 1)
x(t 2)
x(t p)

1
(t)
x(t p)
x(t p +1)

p1

x(t)
Train using backprop as in Eq 11.
22
Nonlinear AR[p] Models
Once weve moved to NNs, theres nothing to stop us from replacing
the

s with a nonlinear activation function like tanh(

).
Non-linear models are more powerful, but need more training data,
and are less well behaved (overtting, local minima, etc).
TDNNs can be viewed as NAR[p] models.
An example of nonlinear ARMA neural net ... (next slide)
23
Nonlinear ARMA[p, q] Models
f f f f
f

t
r
a
i
n
w
i
t
h
b
a
c
k
p
r
o
p
x[t 1]
subtract
x[t]
x[t 3] x[t 2]
[t 3] [t 2]
[t 1]
24
Jordan Nets
A Jordan net can be viewed as a variant of a NARMA model.
hidden
out
plan state
This network has no memory; it remembers only the output from
the previous timestep.
25
The Case for Alternative Memory Models
Uniform sampling is simple but has limitations.
x(t 1)
x(t 2)
x(t T)
f x(t +1)
x(t)
Can only look back T equispaced time steps. To look far into the
past, T must be large.
Large T complicated model: many parameters, slow to train.
26
A Change of Notation
x
i
[t] = x[t i +1] (15)
This is a just a reformulation. x
i
[t] is a memory term, allowing us
to ellide the tapped delay line from our diagrams:
f x(t +1)
x(t)
x(t T)
x(t 2)
x(t 1)
x
T+1
[t]
x
3
[t]
x
2
[t]
x
1
[t]
27
Propose Non-uniform Sampling
x
i
[t] = x[t d
i
] , d
i
N (16)
d
i
is an integer delay; for example, for four inputs, d could be
{1, 2, 4, 8}. This is a generalization. If d were {1, 2, 3, 4}, we would
be back to uniform sampling.
28
Convolutional Memory Terms
Mozer has suggested treating each memory term as a convolution
of x[t] with a kernel function:
x
i
[t] =
t

=1
c
i
[t ] x[] (17)
Delay lines, non-uniformly and uniformly sampled, can be expressed
using this notation, with the kernel function dened by:
c
i
[t] =
_
1 if t = d
i
0 otherwise
(18)
0 2 d_i 6 8 10 12
0
0.5
1
1.5
2
t
c
i
[
t
]
29
Exponential Trace Memory
The idea: remember past values as exponentially decaying weighed
average of input:
c
i
[t] = (1
i
)
t
i
, (1, +1) (19)

i
is the decay rate (a discount factor), eg. 0.99.
Each x
i
uses a dierent decay rate.
No outputs are forgotten; they just fade away.
0 2 4 6 8 10 12
0
0.1
0.2
0.3
0.4
0.5
t
c
i
[
t
]
30
Exponential Trace Memory, contd
A nice feature: if all
i
, dont have to do the convolution at
each time step. Compute incrementally:
x
i
[t] = (1 ) x[t] + x
i
[t i] (20)
Example: a Jordan net with memory
hidden
out
plan state
31
Special Case: Binary Sequences
Let x
i
[t] {0, 1}, with = 0.5.
Memory x[t] is a bit string, treated as a oating point fraction.
x[t] = {1} x[t] = .1
{1, 0} .01
{1, 0, 0} .001
{1, 0, 0, 1} .1001
{1, 0, 0, 1, 1} .11001
Earliest bit becomes least signicant bit of x[t].
32
Memory Depth and Resolution
Depth is how far back memory goes.
Resolution is the degree to which information about individual se-
quence elements is preserved.
At xed model order, we have a tradeo.
Tapped delay line: low depth, high resolution.
Exponential trace: high depth, low resolution.
33
Gamma Memory (deVries & Principe)
c
i
[t] =
_
_
_
_
t
d
i
_
(1
i
)
d
i
+1

td
i
i
if t d
i
0 otherwise
(21)
d
i
is an integer;
i
[0, 1]. Eg. for d
i
= 4 and = 0.21:
0 2 4 6 8 10 12
0
0.2
0.4
0.6
0.8
t
c
i
[
t
]
If d
i
= 0, this is exponential trace memory.
As
i
0, this becomes the tapped delay line.
Can trade depth for resolution by adjusting d
i
and
i
.
Gamma functions form a basis for a family of kernel functions.
34
Memory Content
Dont have to store the raw x[t].
Can store any transformation we like. For example, can store the
internal state of the NN.
Example: Elman net
hidden
out
plan context
Think of this as a 1-tap delay line storing f(x[t]), the hidden layer.
35
Horizon of Prediction
So far covered many neural net architectures which could be used
for predicting the next sample in a time series. What if we need
a longer forecast, ie. not x[t + 1] but x[t + s], with the horizon of
prediction s > 1?
Three options:
Train on {x[t], x[t 1], x[t 2], } to predict x[t +s].
Train to predict all x[t +i], 1 i s (good for small s).
Train to predict x[t +1] only, but iterate to get x[t +s] for any s.
36
Predicting Sunspot Activity
Fessant, Bengio and Collobert.
Sunspots aect ionospheric propagation of radio waves.
Telecom companies want to predict sunspot activity six months in
advance.
Sunspots follow an 11 year cycle, varying from 9-14 years.
Monthly data goes back to 1849.
Authors focus on predicting IR5, a smoothed index of monthly solar
activity.
37
Fessant et al: the IR5 Sunspots Series
IR5[t] =
1
5
(R[t 3] +R[t 2] +R[t 1] +R[t] +R[t +1])
where R[t] is the mean sunspot number for month t and IR5[t] is
the desired index.
38
Fessant et al: Simple Feedforward NN
(1087 weights)
Output: { x[t], , x[t +5]}
Input: {x[t 40], , x[t 1]}
39
Fessant et al: Modular Feedforward NN
(552 weights)
Output: x[t +5]
x[t], , x[t +5] x[t +5]
Input: {x[t 40], , x[t 1]}
40
Fessant et al: Elman NN
(786 weights)
Output: { x[t], , x[t +5]}
Input: {x[t 40], , x[t 1]}
41
Fessant et al: Results
Train on rst 1428 samples CNET Simple Modular Elman
Test on last 238 samples heuristic Net Net Net
Average Relative Variance 0.1130 0.0884 0.0748 0.0737
# Strong Errors 12 12 4 4
42

You might also like