Forecast Time Series-Notes
Forecast Time Series-Notes
Forecast Time Series-Notes
1 Introduction to forecasting 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some case studies . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Time series data . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Some simple forecasting methods . . . . . . . . . . . . . . . . 11
1.5 Lab Session 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Exponential smoothing 34
3.1 The state space perspective . . . . . . . . . . . . . . . . . . . . 34
3.2 Simple exponential smoothing . . . . . . . . . . . . . . . . . . 34
3.3 Trend methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Seasonal methods . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Lab Session 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Taxonomy of exponential smoothing methods . . . . . . . . . 41
3.7 Innovations state space models . . . . . . . . . . . . . . . . . 41
3.8 ETS in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.9 Forecasting with ETS models . . . . . . . . . . . . . . . . . . . 50
3.10 Lab Session 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2
Forecasting: principles and practice 3
1.1 Introduction
Brief bio
1. Guru: I wrote the book, done it for decades, now I do the conference
circuit.
2. Expert: It has been my full time job for more than a decade.
3. Skilled: I have been doing it for years.
4. Comfortable: I understand it and have done it.
5. Learner: I am still learning.
6. Beginner: I have heard of it and would like to learn more.
7. Unknown: What is forecasting? Is that what the weather people do?
Key reference
5
Forecasting: principles and practice 6
install.packages("fpp", dependencies=TRUE)
# Detailed help
help(forecast)
# Worked examples
example("forecast.ar")
# Similar names
apropos("forecast")
#Help on package
help(package="fpp")
Approximate outline
Assumptions
450
400
Year
Australian GDP
Class: "ts"
> ausgdp
Qtr1 Qtr2 Qtr3 Qtr4
1971 4612 4651
1972 4645 4615 4645 4722
1973 4780 4830 4887 4933
1974 4921 4875 4867 4905
1975 4938 4934 4942 4979
1976 5028 5079 5112 5127
1977 5130 5101 5072 5069
1978 5100 5166 5244 5312
1979 5349 5370 5388 5396
1980 5388 5403 5442 5482
Forecasting: principles and practice 10
> plot(ausgdp)
7000
6500
ausgdp
6000
5500
5000
4500
Time
> elecsales
Time Series:
Start = 1989
End = 2008
Frequency = 1
[1] 2354.34 2379.71 2318.52 2468.99 2386.09 2569.47
[7] 2575.72 2762.72 2844.50 3000.70 3108.10 3357.50
[13] 3075.70 3180.60 3221.60 3176.20 3430.60 3527.48
[19] 3637.89 3655.00
450
400
90
80
Average method
Forecast of all future values is equal to mean of historical data
{y1 , . . . , yT }.
Forecasts: yT +h|T = y = (y1 + + yT )/T
Nave method (for time series only)
Forecasts equal to last observed value.
Forecasts: yT +h|T = yT .
Consequence of efficient market hypothesis.
Seasonal nave method
Forecasts equal to last value from same season.
Forecasts: yT +h|T = yT +hkm where m = seasonal period and k =
b(h 1)/mc+1.
Forecasts for quarterly beer production
500
450
400
Drift method
Day
Before doing any exercises in R, load the fpp package using li-
brary(fpp).
1. Use the Dow Jones index (data set dowjones) to do the following:
(a) Produce a time plot of the series.
(b) Produce forecasts using the drift method and plot them.
(c) Show that the graphed forecasts are identical to extending the
line drawn between the first and last observations.
(d) Try some of the other benchmark functions to forecast the same
data set. Which do you think is best? Why?
2. For each of the following series, make a graph of the data with fore-
casts using the most appropriate of the four benchmark methods:
mean, naive, seasonal naive or drift.
In each case, do you think the forecasts are reasonable? If not, how
could they be improved?
2
The forecasters toolbox
15
10
5
0
Year
plot(melsyd[,"Economy.Class"])
14
Forecasting: principles and practice 15
> plot(a10)
25
20
$ million
15
10
5
Year
Seasonal plots
2008
2007
2007
25
2006
2006
2005
20
2005
2004
2004
$ million
2003
2003
2002
15
2002
2001
2001
2000
2000
1999
1998
1999
10
1997
1998
1997
1996
1996
1995
1993
1995
1994
1993
1994
1992
1992
5
1991
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
> monthplot(a10)
25
20
$ million
15
10
5
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Data for each season collected together in time plot as separate time
series.
Enables the underlying seasonal pattern to be seen clearly, and
changes in seasonality over time to be visualized.
In R: monthplot
450
400
2000
2001
2006
2003
2005
2007
megalitres
2004
450
2001
1994
1992
2006
1999
2004
2003
1993
1997
1998
2002
2007
1995
2000
2008
2005
1996
400
Q1 Q2 Q3 Q4
Quarter
450
400
Quarter
Cyclic pattern exists when data exhibit rises and falls that are not of fixed
period (duration usually of at least 2 years).
Forecasting: principles and practice 18
14000
12000 Australian electricity production
GWh
10000
8000
Year
400
300
200
Year
60
50
40
30
91
90
89 US Treasury bill contracts
price
88
87
86
85
0 20 40 60 80 100
Day
Time
2.3 Autocorrelation
> lag.plot(beer,lags=9,do.lines=FALSE)
beer
beer
beer
450
400
beer
beer
450
400
beer
beer
beer
450
400
0.0
0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Lag
r4 higher than for the other lags. This is due to the seasonal pattern
in the data: the peaks tend to be 4 quarters apart and the troughs
tend to be 2 quarters apart.
r2 is more negative than for the other lags because troughs tend to
be 2 quarters behind peaks.
Together, the autocorrelations at lags 1, 2, . . . , make up the autocor-
relation or ACF.
The plot is known as a correlogram
If there is seasonality, the ACF at the seasonal lag (e.g., 12 for monthly
data) will be large and positive.
For seasonal monthly data, a large ACF value will be seen at lag 12
and possibly also at lags 24, 36, . . .
For seasonal quarterly data, a large ACF value will be seen at lag 4
and possibly also at lags 8, 12, . . .
Forecasting: principles and practice 22
10000
8000
Year
0.8
0.6
0.4
ACF
0.2
0.0
0.2
0 10 20 30 40
Lag
Which is which?
thousands
10
9
60
8
40
7
0 20 40 60 1973 1975 1977 1979
100
thousands
thousands
60
20
100
A B
1.0
1.0
0.6
0.6
ACF
ACF
0.2
0.2
-0.4
-0.4
5 10 15 20 5 10 15 20
C D
1.0
1.0
0.6
0.6
ACF
ACF
0.2
0.2
-0.4
-0.4
5 10 15 20 5 10 15 20
Forecasting: principles and practice 24
2. {et } have mean zero. If they dont, then forecasts are biased.
3800
3700
3600
Day
Nave forecast:
yt|t1 = yt1
et = yt yt1
3900
DowJones index
3800
3700
3600
Day
50
Change in DowJones index
0
50
100
Day
Histogram of residuals
60
50
40
Normal?
Frequency
30
20
10
0
100 50 0 50
0.15
0.10
0.05
ACF
0.05
0.15
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22
Lag
fc <- rwf(dj)
res <- residuals(fc)
plot(res)
hist(res,breaks="FD")
Acf(res,main="")
White noise
2
1
0
x
1
2
3
0 10 20 30 40 50
Time
White noise data is uncorrelated across time with zero mean and constant
variance.
(Technically, we require independence as well.)
Think of white noise as completely uninteresting with no predictable
patterns.
Forecasting: principles and practice 27
0.4
r1 = 0.013
r2 = 0.163
0.2
r3 = 0.163
r4 = 0.259
r5 = 0.198
ACF
0.0
r6 = 0.064
r7 = 0.139
0.2
r8 = 0.032
r9 = 0.199
0.4
r10 = 0.240
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Lag
90
80
0.2
ACF
0.0
0.2
0 10 20 30 40
Lag
ACF of residuals
Portmanteau tests
Consider a whole set of rk values, and develop a test to see whether the set
is significantly different from a zero set.
Box-Pierce test
h
X
Q=T rk2
k=1
where h is max lag being considered and T is number of observations.
My preferences: h = 10 for non-seasonal data, h = 2m for seasonal
data.
If each rk close to zero, Q will be small.
If some rk values large (positive or negative), Q will be large.
Forecasting: principles and practice 29
Ljung-Box test
h
X
Q = T (T + 2) (T k)1 rk2
k=1
where h is max lag being considered and T is number of observations.
My preferences: h = 10 for non-seasonal data, h = 2m for seasonal
data.
Better performance, especially in small samples.
If data are WN, Q has 2 distribution with (hK) degrees of freedom
where K = no. parameters in model.
When applied to raw data, set K = 0.
For the Dow-Jones example,
res <- residuals(naive(dj))
Exercise
Let yt denote the tth observation and yt|t1 denote its forecast based on all
previous data, where t = 1, . . . , T . Then the following measures are useful.
T
X
MAE = T 1 |yt yt|t1 |
t=1 v
u
T T
t
X X
MSE = T 1 (yt yt|t1 )2 RMSE = T 1 (yt yt|t1 )2
t=1 t=1
T
X
MAPE = 100T 1 |yt yt|t1 |/|yt |
t=1
Mean method
Nave method
Day
Mean method
Nave method
Drift model
The test set must not be used for any aspect of model development
or calculation of forecasts.
Forecast accuracy is based only on the test set.
beer3 <- window(ausbeer,start=1992,end=2005.99)
beer4 <- window(ausbeer,start=2006)
accuracy(fit1,beer4)
accuracy(fit2,beer4)
Forecasting: principles and practice 32
Beware of over-fitting
A model which fits the data well does not necessarily forecast well.
A perfect fit can always be obtained by using a model with enough
parameters. (Compare R2 )
Over-fitting a model to data is as bad as failing to identify the sys-
tematic pattern in the data.
Problems can be overcome by measuring true out-of-sample forecast
accuracy. That is, total data divided into training set and test set.
Training set used to estimate parameters. Forecasts are made for test
set.
Accuracy measures computed for errors in test set only.
(a) Use either the naive or seasonal naive forecasting method and
apply it to the full data set.
(b) Compute the residuals and plot their ACF. Do the residuals
appear to be white noise? What did your forecasting method
miss?
(c) Do a Ljung-Box test on the residuals. What do the results
mean?
Forecasting: principles and practice 33
Observed data: y1 , . . . , yT .
Unobserved state: x1 , . . . , xT .
Forecast yT +h|T = E(yT +h |xT ).
The forecast variance is Var(yT +h |xT ).
A prediction interval or interval forecast is a range of values of
yT +h with high probability.
Component form
`1 = y1 + (1 )`0
`2 = y2 + (1 )`1 = y2 + (1 )y1 + (1 )2 `0
2
X
`3 = y3 + (1 )`2 = (1 )j y3j + (1 )3 `0
j=0
..
.
t1
X
`t = (1 )j ytj + (1 )t `0
j=0
Forecast equation
t
X
yt+h|t = (1 )tj yj + (1 )t `0 , (0 1)
j=1
34
Forecasting: principles and practice 35
Optimisation
Multi-step forecasts
yT +h|T = yT +1|T , h = 2, 3, . . .
A flat forecast function.
Remember, a forecast is an estimated mean of a future value.
So with no trend, no seasonality, and no other patterns, the forecasts
are constant.
SES in R
library(fpp)
fit <- ses(oil, h=3)
plot(fit)
summary(fit)
Forecasting: principles and practice 36
=
et = yt (`t1 + bt1 ) = yt yt|t1
Need to estimate , , `0 , b0 .
Holts method in R
accuracy(fit1)
accuracy(fit2)
Holts method will almost always have better in-sample RMSE be-
cause it is optimized over one additional parameter.
It may not be better on other measures.
You need to compare out-of-sample RMSE (using a test set) for the
comparison to be useful.
But we dont have enough data.
A better method for comparison will be in the next session!
Forecasting: principles and practice 37
Trend methods in R
(+2 ++h )
yt+h|t = `t bt
`t = yt + (1 )(`t1 bt1 )
bt = (`t /`t1 ) + (1 )bt1
Holt's
Livestock, sheep in Asia (millions)
Exponential
Additive Damped
Multiplicative Damped
400
350
300
Seasonal methods in R
plot(aus1)
plot(aus2)
summary(aus1)
summary(aus2)
Often the single most accurate forecasting method for seasonal data:
1. For this exercise, use the price of a dozen eggs in the United States
from 19001993 (data set eggs). Experiment with the various op-
tions in the holt() function to see how much the forecasts change
with damped or exponential trend. Also try changing the param-
eter values for and to see how they affect the forecasts. Try to
develop an intuition of what each parameter and argument is doing
to the forecasts.
[Hint: use h=100 when calling holt() so you can clearly see the
differences between the various options when plotting the forecasts.]
Which model gives the best RMSE?
Do the residuals from the best model look like white noise?
(a) Make a time plot of your data and describe the main features of
the series.
(b) Forecast the next two years using Holt-Winters multiplicative
method.
(c) Why is multiplicative seasonality necessary here?
(d) Experiment with making the trend exponential and/or damped.
(e) Compare the RMSE of the one-step forecasts from the various
methods. Which do you prefer?
(f) Check that the residuals from the best model look like white
noise.
Seasonal Component
Trend N A M
Component (None) (Additive) (Multiplicative)
N (None) N,N N,A N,M
A (Additive) A,N A,A A,M
Ad (Additive damped) Ad ,N Ad ,A Ad ,M
M (Multiplicative) M,N M,A M,M
Md (Multiplicative damped) Md ,N Md ,A Md ,M
There are 15 separate exponential smoothing methods.
Generate same point forecasts but can also generate forecast inter-
vals.
A stochastic (or random) data generating process that can generate
an entire forecast distribution.
Allow for proper model selection.
ETS models
Two models for each method: one with additive and one with multi-
plicative errors, i.e., in total 30 models.
ETS(Error,Trend,Seasonal):
Error= {A, M}
Trend = {N, A, Ad , M, Md }
Seasonal = {N, A, M}.
ETS(A,N,N)
Observation equation yt = `t1 + t ,
State equation `t = `t1 + t
et = yt yt|t1 = t
Assume t NID(0, 2 )
innovations or single source of error because same error process,
t .
ETS(M,N,N)
y y
Specify relative errors t = ty t|t1 NID(0, 2 )
t|t1
Substituting yt|t1 = `t1 gives:
yt = `t1 + `t1 t
et = yt yt|t1 = `t1 t
Models with additive and multiplicative errors with the same pa-
rameters generate the same point forecasts but different Forecast
intervals.
ETS(A,A,N)
yt = `t1 + bt1 + t
`t = `t1 + bt1 + t
bt = bt1 + t
ETS(M,A,N)
yt = (`t1 + bt1 )(1 + t )
`t = (`t1 + bt1 )(1 + t )
bt = bt1 + (`t1 + bt1 )t
ETS(A,A,A)
Forecast equation yt+h|t = `t + hbt + stm+h+m
Observation equation yt = `t1 + bt1 + stm + t
State equations `t = `t1 + bt1 + t
bt = bt1 + t
st = stm + t
iid
Let xt = (`t , bt , st , st1 , . . . , stm+1 ) and t N(0, 2 ).
yt = h(xt1 ) + k(xt1 )t
| {z } | {z }
t et
xt = f (xt1 ) + g(xt1 )t
Additive errors: Multiplicative errors:
k(x) = 1. y t = t + t . k(xt1 ) = t . yt = t (1 + t ).
t = (yt t )/t is relative error.
Seasonal Component
Trend N A M
Component (None) (Additive) (Multiplicative)
N (None) A,N,N A,N,A A,N,M
A (Additive) A,A,N A,A,A A,A,M
Ad (Additive damped) A,Ad ,N A,Ad ,A A,Ad ,M
M (Multiplicative) A,M,N A,M,A A,M,M
Md (Multiplicative damped) A,Md ,N A,Md ,A A,Md ,M
Seasonal Component
Trend N A M
Component (None) (Additive) (Multiplicative)
N (None) M,N,N M,N,A M,N,M
A (Additive) M,A,N M,A,A M,A,M
Ad (Additive damped) M,Ad ,N M,Ad ,A M,Ad ,M
M (Multiplicative) M,M,N M,M,A M,M,M
Md (Multiplicative damped) M,Md ,N M,Md ,A M,Md ,M
Forecasting: principles and practice 45
Estimation
n
X ! n
X
L (, x0 ) = n log t2 /k 2 (xt1 ) +2 log |k(xt1 )|
t=1 t=1
= 2 log(Likelihood) + constant
Parameter restrictions
Usual region
Traditional restrictions in the methods 0 < , , , < 1 equa-
tions interpreted as weighted averages.
In models we set = and = (1 ) therefore 0 < < 1, 0 <
< and 0 < < 1 .
0.8 < < 0.98 to prevent numerical difficulties.
Admissible region
To prevent observations in the distant past having a continuing
effect on current forecasts.
Usually (but not always) less restrictive than the usual region.
For example for ETS(A,N,N):
usual 0 < < 1 admissible is 0 < < 2.
Model selection
AIC = 2 log(Likelihood) + 2p
2(p + 1)(p + 2)
AICC = AIC +
np
Schwartz Bayesian IC
Automatic forecasting
3.8 ETS in R
> fit
ETS(M,Md,M)
Smoothing parameters:
alpha = 0.1776
beta = 0.0454
gamma = 0.1947
phi = 0.9549
Initial states:
l = 263.8531
b = 0.9997
s = 1.1856 0.9109 0.8612 1.0423
sigma: 0.0356
> fit2
ETS(A,A,A)
Smoothing parameters:
alpha = 0.2079
beta = 0.0304
gamma = 0.2483
Initial states:
l = 255.6559
b = 0.5687
s = 52.3841 -27.1061 -37.6758 12.3978
sigma: 15.9053
ets() function
Automatically chooses a model by default using the AIC, AICc or
BIC.
Can handle any combination of trend, seasonality and damping
Produces Forecast intervals for every model
Ensures the parameters are admissible (equivalent to invertible)
Produces an object of class "ets".
ets objects
Methods: "coef()", "plot()", "summary()", "residuals()", "fitted()", "sim-
ulate()" and "forecast()"
"plot()" function shows time plots of the original time series along
with the extracted components (level, growth and seasonal).
plot(fit)
Decomposition by ETS(M,Md,M) method
600
observed
400
450200
level
350
250
1.005
slope
1.1 0.990
season
0.9
Time
Forecasting: principles and practice 48
Goodness-of-fit
> accuracy(fit)
ME RMSE MAE MPE MAPE MASE
0.17847 15.48781 11.77800 0.07204 2.81921 0.20705
> accuracy(fit2)
ME RMSE MAE MPE MAPE MASE
-0.11711 15.90526 12.18930 -0.03765 2.91255 0.21428
Forecast intervals
> plot(forecast(fit,level=c(50,80,95)))
550
500
450
400
350
300
> plot(forecast(fit,fan=TRUE))
550
500
450
400
350
300
> accuracy(test)
ME RMSE MAE MPE MAPE MASE
-3.35419 58.02763 43.85545 -0.07624 1.18483 0.52452
y
The time series to be forecast.
model
use the ETS classification and notation: N for none, A for addi-
tive, M for multiplicative, or Z for automatic selection. Default
ZZZ all components are selected using the information criterion.
damped
If damped=TRUE, then a damped trend will be used (either Ad or
Md ).
damped=FALSE, then a non-damped trend will used.
If damped=NULL (the default), then either a damped or a non-
damped trend will be selected according to the information
criterion chosen.
alpha, beta, gamma, phi
The values of the smoothing parameters can be specified using these
arguments. If they are set to NULL (the default value for each of
them), the parameters are estimated.
additive.only
Only models with additive components will be considered if
additive.only=TRUE. Otherwise all models will be considered.
lambda
Box-Cox transformation parameter. It will be ignored if
lambda=NULL (the default value). Otherwise, the time series will
be transformed before the model is estimated. When lambda is not
NULL, additive.only is set to TRUE.
lower,upper bounds for the parameter estimates of , , and .
Forecasting: principles and practice 50
1. Use ets() to find the best ETS model for the price of eggs (data set
eggs). How does this model compare to the one you found in the
previous lab session?
Yt = f (St , Tt , Et )
where Yt = data at period t
St = seasonal component at period t
Tt = trend component at period t
Et = remainder (or irregular or error) component at
period t
Additive decomposition: Yt = St + Tt + Et .
100
90
80
70
60
52
Forecasting: principles and practice 53
130
120
110 Electrical equipment manufacturing (Euro area)
New orders index
100
90
80
70
60
80
60
10
5
seasonal
0
10
20
110
100
trend
90
80
6
4
remainder
2
0
4
8
time
Forecasting: principles and practice 54
15
10
5
Seasonal
0
5
10
15
20
J F M A M J J A S O N D
Yt St = Tt + Et
100
90
80
70
60
120
100
data
80
60
10
5
seasonal
0
5
15
110
100
trend
90
80
10
5
remainder
time
120
110
100 Naive forecasts of seasonally adjusted data
New orders index
90
80
70
80
60
40
How to do this in R
(a) Plot the time series of sales of product A. Can you identify sea-
sonal fluctuations and/or a trend?
(b) Use an STL decomposition to calculate the trend-cycle and sea-
sonal indices. (Experiment with having fixed or changing sea-
sonality.)
(c) Do the results support the graphical interpretation from part
(a)?
(d) Compute and plot the seasonally adjusted data.
(e) Use a random walk to produce forecasts of the seasonally ad-
justed data.
(f) Reseasonalize the results to give forecasts on the original scale.
[Hint: you can use the stlf function with method="naive".]
(g) Why do the forecasts look too low?
5
Time series cross-validation
5.1 Cross-validation
Traditional evaluation
Training data Test data
time
Standard cross-validation
58
Forecasting: principles and practice 59
Cross-sectional data
Minimizing the AIC is asymptotically equivalent to minimizing
MSE via leave-one-out cross-validation. (Stone, 1977).
Time series cross-validation
Minimizing the AIC is asymptotically equivalent to minimizing
MSE via one-step cross-validation. (Akaike, 1969,1973).
15
10
5
Year
Log Antidiabetic drug sales
3.0
2.5
2.0
1.5
1.0
Year
Forecasting: principles and practice 60
1. Linear model with trend and seasonal dummies applied to log data.
2. ARIMA model applied to log data
3. ETS model applied to original data
Set k = 48 as minimum training set.
Forecast 12 steps ahead based on data to time k + i 1 for i =
1, 2, . . . , 156.
Compare MAE values for each forecast horizon.
k <- 48
n <- length(a10)
mae1 <- mae2 <- mae3 <- matrix(NA,n-k-12,12)
for(i in 1:(n-k-12))
{
xshort <- window(a10,end=1995+(5+i)/12)
xnext <- window(a10,start=1995+(6+i)/12,end=1996+(5+i)/12)
fit1 <- tslm(xshort ~ trend + season, lambda=0)
fcast1 <- forecast(fit1,h=12)
fit2 <- auto.arima(xshort,D=1, lambda=0)
fcast2 <- forecast(fit2,h=12)
fit3 <- ets(xshort)
fcast3 <- forecast(fit3,h=12)
mae1[i,] <- abs(fcast1[[mean]]-xnext)
mae2[i,] <- abs(fcast2[[mean]]-xnext)
mae3[i,] <- abs(fcast3[[mean]]-xnext)
}
plot(1:12,colMeans(mae1),type="l",col=2,xlab="horizon",ylab="MAE",
ylim=c(0.58,1.0))
lines(1:12,colMeans(mae2),type="l",col=3)
lines(1:12,colMeans(mae3),type="l",col=4)
legend("topleft",legend=c("LM","ARIMA","ETS"),col=2:4,lty=1)
140
LM
ARIMA
ETS
130
120
MAE
110
100
90
2 4 6 8 10 12
horizon
Forecasting: principles and practice 61
LM
ARIMA
ETS
115
110
105
MAE
100
95
90
2 4 6 8 10 12
horizon
Forecasting: principles and practice 62
1. For this exercise, use the monthly Australian short-term overseas vis-
itors data, May 1985April 2005. (Data set: visitors in expsmooth
package.)
(a) Use ets to find the best model for these data and record the
training set RMSE. You should find that the best model is
ETS(M,A,M).
(b) We will now check how much larger the one-step RMSE is on
out-of-sample data using time series cross-validation. The fol-
lowing code will compute the result, beginning with four years
of data in the training set.
k <- 48 # minimum size for training set
n <- length(visitors) # Total number of observations
e <- visitors*NA # Vector to record one-step forecast errors
for(i in 48:(n-1))
{
train <- ts(visitors[1:i],freq=12)
fit <- ets(train, "MAM", damped=FALSE)
fc <- forecast(fit,h=1)$mean
e[i] <- visitors[i+1]-fc
}
sqrt(mean(e^2,na.rm=TRUE))
Check that you understand what the code is doing. Ask if you
dont.
(c) What would happen in the above loop if I had set
train <- visitors[1:i]?
(d) Plot e. What do you notice about the error variances? Why
does this occur?
(e) How does this problem bias the comparison of the RMSE val-
ues from (1a) and (1b)? (Hint: think about the effect of the
missing values in e.)
(f) In practice, we will not know that the best model on the whole
data set is ETS(M,A,M) until we observe all the data. So a more
realistic analysis would be to allow ets to select a different
model each time through the loop. Calculate the RMSE using
this approach. (Warning: it will take a while as there are a lot
of models to fit.)
(g) How does the RMSE computed in (1f) compare to that com-
puted in (1b)? Does the re-selection of a model at each step
make much difference?
6.1 Transformations
If the data show different variation at different levels of the series, then a
transformation can be useful.
Denote original observations as y1 , . . . , yn and transformed observations as
w1 , . . . , wn .
Square root wt = yt
3
Cube root wt = yt Increasing
Logarithm wt = log(yt ) strength
Logarithms, in particular, are useful because they are more interpretable:
changes in a log value are relative (percent) changes on the original
scale.
24
22
100
20
18
80
16
60
14
12
40
Year Year
Log electricity production Inverse electricity production
9.5
2e04
9.0
4e04
8.5
8.0
6e04
7.5
8e04
Year Year
63
Forecasting: principles and practice 64
Box-Cox transformations
exp(wt ), = 0;
(
yt =
(Wt + 1)1/ , , 0.
plot(BoxCox(elec,lambda=1/3))
fit <- snaive(elec, lambda=1/3)
plot(fit)
plot(fit, include=120)
BoxCox.lambda(elec)
This attempts to balance the seasonal fluctuations and random varia-
tion across the series.
Always check the results.
A low value of can give extremely large Forecast intervals.
6.2 Stationarity
Definition If {yt } is a stationary time series, then for all s, the distribution
of (yt , . . . , yt+s ) does not depend on t.
A stationary series is:
roughly horizontal
constant variance
no patterns predictable in the long-term
Stationary?
50
3900
3800
0
3700
50
3600
100
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Day Day
Annual strikes in the US Sales of new onefamily houses, USA
90
6000
80
5500
70
Number of strikes
5000
Total sales
60
4500
50
4000
40
30
3500
1950 1955 1960 1965 1970 1975 1980 1975 1980 1985 1990 1995
Year
Price of a dozen eggs in 1993 dollars Number of pigs slaughtered in Victoria
350
110
300
100
250
thousands
$
200
90
150
80
100
1900 1920 1940 1960 1980 1990 1991 1992 1993 1994 1995
Year
Forecasting: principles and practice 66
500
Number trapped
megaliters
450
400
0
Time
0.6
3800
ACF
0.4
3700
0.2
3600
0.0
0.2
Lag
0.15
50
0.10
Change in DowJones index
0.05
0
ACF
0.05
50
0.15
100
Day Lag
Forecasting: principles and practice 67
Differencing
Second-order differencing
Occasionally the differenced data will not appear stationary and it may be
necessary to difference the data a second time:
30
Forecasting: principles and practice 68
25
Sales ($million)
6.4 Seasonal differencing
20
yt0 = yt ytm
10
3.0
3.0
25
sales
20
2.5
2.5
$ million
Monthly log
15
2.0
2.0
10
1.5
5
1.5
1.0
Year Year
1.0
Annual change in monthly log sales
0.3
0.2
0.1
0.0
0.1
Year
Seasonally differenced series is closer to being stationary.
Remaining non-stationarity can be removed with further first differ-
ence.
If yt0 = yt yt12 denotes seasonally differenced series, then twice-differenced
series is yt = yt0 yt1
0
Interpretation of differencing
first differences are the change between one observation and the
next;
seasonal differences are the change between one year to the next.
But taking lag 3 differences for yearly data, for example, results in a
model which cannot be sensibly interpreted.
Dickey-Fuller test
data: dj
Dickey-Fuller = -1.9872, Lag order = 6, p-value = 0.5816
alternative hypothesis: stationary
ndiffs(x)
nsdiffs(x)
Automated differencing
ns <- nsdiffs(x)
if(ns > 0)
xstar <- diff(x,lag=frequency(x), differences=ns)
else
xstar <- x
nd <- ndiffs(xstar)
if(nd > 0)
xstar <- diff(xstar,differences=nd)
Forecasting: principles and practice 70
B(Byt ) = B2 yt = yt2 .
For monthly data, if we wish to shift attention to the same month last
year, then B12 is used, and the notation is B12 yt = yt12 .
The backward shift operator is convenient for describing the process of
differencing. A first difference can be written as
(1 B)d yt .
(1 B)(1 Bm )yt .
24
12
22
11
20
10
18
9
8
16
7
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
AR(1) model
yt = c + 1 yt1 + et
When 1 = 0, yt is equivalent to WN
When 1 = 1 and c = 0, yt is equivalent to a RW
When 1 = 1 and c , 0, yt is equivalent to a RW with drift
When 1 < 0, yt tends to oscillate between positive and negative
values.
Stationarity conditions
72
Forecasting: principles and practice 73
4
23
22
2
21
0
20
19
2
18
4
17
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Invertibility
ARIMA(p, d, q) model
ARMA model:
yt = c + 1 Byt + + p Bp yt + et + 1 Bet + + q Bq et
or (1 1 B p Bp )yt = c + (1 + 1 B + + q Bq )et
ARIMA(1,1,1) model:
(1 1 B) (1 B)yt = c + (1 + 1 B)et
AR(1) First MA(1)
difference
Written out: yt = c + yt1 + 1 yt1 1 yt2 + 1 et1 + et
US personal consumption
US consumption
2
Quarterly percentage change
1
0
1
2
Year
plot(forecast(fit,h=10),include=80)
Partial autocorrelations
Example: US consumption
0.3
0.3
0.2
0.2
Partial ACF
0.1
0.1
ACF
0.0
0.0
0.1
0.1
0.2
0.2
1 3 5 7 9 12 15 18 21 1 3 5 7 9 12 15 18 21
Lag Lag
80
60
40
20
0.6
0.4
0.4
0.2
0.2
PACF
ACF
0.0
0.0
0.4
0.4
5 10 15 5 10 15
Lag Lag
Information criteria
tsdisplay(internet)
adf.test(internet)
kpss.test(internet)
kpss.test(diff(internet))
tsdisplay(diff(internet))
fit <- Arima(internet,order=c(3,1,0))
fit2 <- auto.arima(internet)
Acf(residuals(fit))
Box.test(residuals(fit), fitdf=3, lag=10, type="Ljung")
tsdiag(fit)
forecast(fit)
plot(forecast(fit))
Modelling procedure
Do the
residuals
no look like
white
noise?
yes
7. Calculate forecasts.
110
100
90
80
Year
10
0.3
0.1
0.1
PACF
ACF
0.1
0.1
0.3
0.3
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag
> tsdisplay(diff(eeadj))
4. PACF is suggestive of AR(3). So initial candidate model is
ARIMA(3,1,0). No other obvious candidates.
5. Fit ARIMA(3,1,0) model along with variations: ARIMA(4,1,0),
ARIMA(2,1,0), ARIMA(3,1,1), etc. ARIMA(3,1,1) has smallest AICc
value.
Forecasting: principles and practice 81
Coefficients:
ar1 ar2 ar3 ma1
0.0519 0.1191 0.3730 -0.4542
s.e. 0.1840 0.0888 0.0679 0.1993
7.6 Forecasting
Point forecasts
h=1
Replace t by T + 1 in (7.1):
Replace eT +1 by 0 and eT by eT :
h=2
Forecast intervals
95% forecast interval: yT +h|T 1.96 vT +h|T where vT +h|T is estimated
forecast variance.
vT +1|T = 2 for all ARIMA models regardless of parameters and
orders.
Multi-step forecast intervals for ARIMA(0,0,q):
q
X
yt = et + i eti .
i=1
h1
X
vT |T +h = 2 1 + i2 ,
for h = 2, 3, . . . .
i=1
Forecast intervals increase in size with forecast horizon.
Forecast intervals can be difficult to calculate by hand
Calculations assume residuals are uncorrelated and normally dis-
tributed.
Forecast intervals tend to be too narrow.
the uncertainty in the parameter estimates has not been ac-
counted for.
the ARIMA model assumes historical patterns will not change
during the forecast period.
the ARIMA model assumes uncorrelated future errors
Forecasting: principles and practice 83
6 6 6 6 6 6
Non-seasonal Non-seasonal Non-seasonal
! ! !
All the factors can be multiplied out and the general model written as
follows:
84
Forecasting: principles and practice 85
> plot(euretail)
102
100
98
Retail index
96
94
92
90
Year
Forecasting: principles and practice 86
> tsdisplay(diff(euretail,4))
3
3 2 1
0.8
0.4
0.4
PACF
ACF
0.0
0.0
0.4
0.4
5 10 15 5 10 15
Lag Lag
> tsdisplay(diff(diff(euretail,4)))
1.0
0.0
1.0
2.0
0.4
0.4
5 10 15 5 10 15
Lag Lag
1.0
0.5
0.0
1.0 0.5
0.4
0.2
0.2
PACF
ACF
0.0
0.0
0.4 0.2
0.4 0.2
5 10 15 5 10 15
Lag Lag
0.5
0.0
1.0 0.5
0.4
0.4
5 10 15 5 10 15
Lag Lag
Forecasting: principles and practice 88
Coefficients:
ar1 ma1 sma1
0.8828 -0.5208 -0.9704
s.e. 0.1424 0.1755 0.6792
Coefficients:
ma1 ma2 ma3 sma1
0.2625 0.3697 0.4194 -0.6615
s.e. 0.1239 0.1260 0.1296 0.1555
1.0
0.8
0.6
0.4
Year
0.2
Log H02 sales
0.2
0.6
1.0
Year
Seasonally differenced H02 scripts
0.4
0.3
0.2
0.1
0.1 0.0
Year
0.4
0.4
0.2
0.2
PACF
ACF
0.0
0.0
0.2
0.2
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag
Choose D = 1 and d = 0.
Spikes in PACF at lags 12 and 24 suggest seasonal AR(2) term.
Spikes in PACF sugges possible non-seasonal AR(3) term.
Initial candidate model: ARIMA(3,0,0)(2,1,0)12 .
Forecasting: principles and practice 90
Model AICc
ARIMA(3,0,0)(2,1,0)12 475.12
ARIMA(3,0,1)(2,1,0)12 476.31
ARIMA(3,0,2)(2,1,0)12 474.88
ARIMA(3,0,1)(1,1,0)12 463.40
ARIMA(3,0,1)(0,1,1)12 483.67
ARIMA(3,0,1)(0,1,2)12 485.48
ARIMA(3,0,1)(1,1,1)12 484.25
> fit <- Arima(h02, order=c(3,0,1), seasonal=c(0,1,2), lambda=0)
ARIMA(3,0,1)(0,1,2)[12]
Box Cox transformation: lambda= 0
Coefficients:
ar1 ar2 ar3 ma1 sma1 sma2
-0.1603 0.5481 0.5678 0.3827 -0.5222 -0.1768
s.e. 0.1636 0.0878 0.0942 0.1895 0.0861 0.0872
0.1
0.0
0.2 0.1
0.2
PACF
ACF
0.0
0.0
0.2
0.2
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag
tsdisplay(residuals(fit))
Box.test(residuals(fit), lag=36, fitdf=6, type="Ljung")
auto.arima(h02,lambda=0)
Forecasting: principles and practice 91
Model RMSE
ARIMA(3,0,0)(2,1,0)12 0.0661
ARIMA(3,0,1)(2,1,0)12 0.0646
ARIMA(3,0,2)(2,1,0)12 0.0645
ARIMA(3,0,1)(1,1,0)12 0.0679
ARIMA(3,0,1)(0,1,1)12 0.0644
ARIMA(3,0,1)(0,1,2)12 0.0622
ARIMA(3,0,1)(1,1,1)12 0.0630
ARIMA(4,0,3)(0,1,1)12 0.0648
ARIMA(3,0,3)(0,1,1)12 0.0640
ARIMA(4,0,2)(0,1,1)12 0.0648
ARIMA(3,0,2)(0,1,1)12 0.0644
ARIMA(2,1,3)(0,1,1)12 0.0634
ARIMA(2,1,4)(0,1,1)12 0.0632
ARIMA(2,1,5)(0,1,1)12 0.0640
getrmse <- function(x,h,...)
{
train.end <- time(x)[length(x)-h]
test.start <- time(x)[length(x)-h+1]
train <- window(x,end=train.end)
test <- window(x,start=test.start)
fit <- Arima(train,...)
fc <- forecast(fit,h=h)
return(accuracy(fc,test)[2,"RMSE"])
}
getrmse(h02,h=24,order=c(3,0,0),seasonal=c(2,1,0),lambda=0)
getrmse(h02,h=24,order=c(3,0,1),seasonal=c(2,1,0),lambda=0)
getrmse(h02,h=24,order=c(3,0,2),seasonal=c(2,1,0),lambda=0)
getrmse(h02,h=24,order=c(3,0,1),seasonal=c(1,1,0),lambda=0)
getrmse(h02,h=24,order=c(3,0,1),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(3,0,1),seasonal=c(0,1,2),lambda=0)
getrmse(h02,h=24,order=c(3,0,1),seasonal=c(1,1,1),lambda=0)
getrmse(h02,h=24,order=c(4,0,3),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(3,0,3),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(4,0,2),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(3,0,2),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(2,1,3),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(2,1,4),seasonal=c(0,1,1),lambda=0)
getrmse(h02,h=24,order=c(2,1,5),seasonal=c(0,1,1),lambda=0)
Models with lowest AICc values tend to give slightly better results
than the other models.
AICc comparisons must have the same orders of differencing. But
RMSE test set comparisons can involve any models.
No model passes all the residual tests.
Use the best model available, even if it does not pass all tests.
In this case, the ARIMA(3,0,1)(0,1,2)12 has the lowest RMSE value
and the best AICc value for models with fewer than 6 parameters.
Forecasting: principles and practice 92
1.6
1.4 Forecasts from ARIMA(3,0,1)(0,1,2)[12]
H02 sales (million scripts)
1.2
1.0
0.8
0.6
0.4
Year
Equivalences
2. For the time series you selected from the retail data set in Lab Ses-
sion 6, develop an appropriate seasonal ARIMA model, and com-
pare the forecasts with those you obtained earlier.
Obtain up-to-date data from January 2008 onwards from the ABS
website (www.abs.gov.au) (Cat. 8501.0, Table 11), and compare
your forecasts with the actual numbers. How good were the fore-
casts from the various models?
9
State space models
yt depends on xt .
A different error process affects xt |xt1 and yt |xt .
94
Forecasting: principles and practice 95
Trigonometric models
J
X
yt = `t + sj,t + t
j=1
`t = `t1 + bt1 + t
bt = bt1 + t
sj,t = cos j sj,t1 + sin j sj,t1 + j,t
sj,t = sin j sj,t1 + cos j sj,t1 + j,t
j = 2j/m
t , t , t , j,t , j,t are independent Gaussian white noise processes
2
j,t and j,t have same variance ,j
2
Equivalent to BSM when ,j = 2 and J = m/2
Choose J < m/2 for fewer degrees of freedom
Structural models in R
StructTS(oil, type="level")
StructTS(ausair, type="trend")
StructTS(austourists, type="BSM")
40
45 20
level
35
25
slope
0.5
10 2.0
seasonal
0
10
Time
Observation equation y t = f 0 x t + t
State equation xt = Gxt1 + wt
1 1 0 0 ... 0 0
`t
0 1 0 0 ... 0 0
b
t
0 0 1 1 . . . 1 1
s1,t
0 0 1 0 ... 0 0
s2,t
xt = G = .. ..
.
s3,t 0 0 0 1 .. . .
..
. .. .. . . . .
.
.
. . . . . 0 0
sm1,t
0 0 0 ... 0 1 0
Forecasting: principles and practice 97
Notation:
Iterate for t = 1, . . . , T
Assume we know x1|0 and P1|0 .
Just conditional expectations. So this gives minimum MSE estimates.
2. Forecasting
Forecast Observation
yt = `t + t t NID(0, 2 )
`t = `t1 + ut ut NID(0, q2 )
yt|t1 = `t1|t1
vt|t1 = pt|t1 + 2
`t|t = `t1|t1 + pt|t1 vt|t1
1
(yt yt|t1 )
1
pt+1|t = pt|t1 (1 vt|t1 pt|t1 ) + q2
Multi-step forecasting
Likelihood calculation
AR(2) model
yt = 1 yt1 + 2 yt2 + et , et NID(0, 2 )
" # " #
yt e
Let xt = and wt = t .
yt1 0
Then yt = [1 0]xt
" #
1 2
xt = xt1 + wt
1 0
h i
y t = 1 0 xt
1 1
" #
xt = x + wt
2 0 t1
AR(p) model
yt = 1 yt1 + + p ytp + et , et NID(0, 2 )
yt et
yt1 0
Let xt = . and wt = . .
.. ..
0
ytp+1
Forecasting: principles and practice 100
h i
yt = 1 0 0 . . . 0 xt
1 2 . . . p1 p
1 0 ... 0 0
xt = . .. .. .. xt1 + wt
.. . . .
0 ... 0 1 0
ARMA(1, 1) model
yt = yt1 + et1 + et , et NID(0, 2 )
" # " #
yt e
Let xt = and wt = t .
et et
h i
y t = 1 0 xt
1
" #
xt = x + wt
0 0 t1
ARMA(p, q) model
yt = 1 yt1 + + p ytp + 1 et1 + + q etq + et
Let r = max(p, q + 1), i = 0, q < i r, j = 0, p < j r.
h i
yt = 1 0 . . . 0 xt
1 1 0 . . . 0
1
. . ..
2 0 1 . . 1
xt = .. .. . . .. xt1 + . et
. . . . 0 ..
0 ... 0 1
r1
r1
r 0 0 ... 0
1
= Pt|t G 0 Pt+1|t
where At .
plot(austourists)
lines(sm[,1],col=blue)
lines(fitted(fit)[,1],col=red)
legend("topleft",col=c(blue,red),lty=1,
legend=c("Filtered level","Smoothed level"))
fit <- StructTS(austourists, type = "BSM")
sm <- tsSmooth(fit)
plot(austourists)
lines(sm[,1],col=blue)
lines(fitted(fit)[,1],col=red)
legend("topleft",col=c(blue,red),lty=1,
legend=c("Filtered level","Smoothed level"))
plot(austourists)
# Seasonally adjusted data
aus.sa <- austourists - sm[,3]
lines(aus.sa, col=blue)
60
Filtered level
Smoothed level
50
austourists
40
30
20
Time
fit <- StructTS(austourists, type = "BSM")
sm <- tsSmooth(fit)
plot(austourists)
60
50
austourists
40
30
20
Time
x <- austourists
miss <- sample(1:length(x), 5)
x[miss] <- NA
fit <- StructTS(x, type = "BSM")
sm <- tsSmooth(fit)
estim <- sm[,1]+sm[,3]
plot(x, ylim=range(austourists))
points(time(x)[miss], estim[miss], col=red, pch=1)
points(time(x)[miss], austourists[miss], col=black, pch=1)
legend("topleft", pch=1, col=c(2,1), legend=c("Estimate","Actual"))
60
Estimate
Actual
50
40
x
30
20
Time
Forecasting: principles and practice 103
yt = ft0 xt + t , t N(0, t2 )
xt = Gt xt1 + wt wt N(0, Wt )
Kalman recursions:
yt = `t + zt + t
`t = `t1 + t
1 0 2 0
" # " # " #
`t
ft0 = [1 zt ] xt = G= Wt =
0 1 0 0
Assumes zt is fixed and known (as in regression)
Estimate of is given by xT |T .
Equivalent to simple linear regression with time varying intercept.
Easy to extend to multiple regression with additional terms.
yt = `t + t zt + t
`t = `t1 + t
t = t1 + t
" 2
1 0 0
" # " # #
`t
ft0 = [1 zt ] xt = G= Wt =
t 0 1 0 2
Allows for a linear regression with parameters that change slowly
over time.
Parameters follow independent random walks.
Estimates of parameters given by xt|t or xt|T .
1. Use StructTS to forecast each of the time series you used in Lab
session 4. How do your results compare to those obtained earlier in
terms of their point forecasts and prediction intervals?
Check the residuals of each fitted model to ensure they look like
white noise.
2. In this exercise, you will write your own code for updating regres-
sion coefficients using a Kalman filter. We will model quarterly
growth rates in US personal consumption expenditure (y) against
quarterly growth rates in US real personal disposable income (z). So
the model is yt = a + bzt + t . The corresponding state space model is
yt = at + bt zt + t
at = at1
bt = bt1
yt = ft0 xt + t , t N(0, t2 )
xt = Gt xt1 + ut wt N(0, Wt )
where
1 0 0 0
" # " # " #
a
ft0 = [1 zt ], xt = t , Gt = , and Wt = .
bt 0 1 0 0
yt = 0 + 1 x1,t + + k xk,t + nt ,
(1 1 B)(1 B)nt = (1 + 1 B)et ,
Estimation
105
Forecasting: principles and practice 106
yt = 0 + 1 x1,t + + k xk,t + nt ,
(1 1 B)(1 B)nt = (1 + 1 B)et ,
yt0 = 1 x1,t
0 0
+ + k xk,t + n0t ,
(1 1 B)n0t = (1 + 1 B)et ,
yt = 0 + 1 x1,t + + k xk,t + nt
where (B)(1 B)d Nt = (B)et
Model selection
Selecting predictors
1
0
1
2
0 1 2 3 4
income
Year
Quarterly changes in US consumption and personal income
2
consumption
2 1 0 1 2 3 4
income
arima.errors(fit)
0.2
PACF
ACF
0.0
0.0
0.2
0.2
5 10 15 20 5 10 15 20
Lag Lag
Coefficients:
ar1 ma1 ma2 intercept usconsumption[,2]
0.6516 -0.5440 0.2187 0.5750 0.2420
s.e. 0.1468 0.1576 0.0790 0.0951 0.0513
10.3 Forecasting
Deterministic trend
yt = 0 + 1 t + nt
where nt is ARMA process.
Stochastic trend
yt = 0 + 1 t + nt
where nt is ARIMA process with d 1.
Difference both sides until nt is stationary:
yt0 = 1 + n0t
International visitors
3
2
1
Year
Deterministic trend
> auto.arima(austa,d=0,xreg=1:length(austa))
ARIMA(2,0,0) with non-zero mean
Coefficients:
ar1 ar2 intercept 1:length(austa)
1.0371 -0.3379 0.4173 0.1715
s.e. 0.1675 0.1797 0.1866 0.0102
yt = 0.4173 + 0.1715t + nt
nt = 1.0371nt1 0.3379nt2 + et
et NID(0, 0.02486).
Forecasting: principles and practice 111
Stochastic trend
> auto.arima(austa,d=1)
ARIMA(0,1,0) with drift
Coefficients:
drift
0.1538
s.e. 0.0323
yt yt1 = 0.1538 + et
yt = y0 + 0.1538t + nt
nt = nt1 + et
et NID(0, 0.03132).
2kt 2kt
! !
sk (t) = sin ck (t) = cos
m m
K
X
yt = [k sk (t) + k ck (t)] + nt
k=1
nt is non-seasonal ARIMA process.
Every periodic function can be approximated by sums of sin and cos
terms for large enough K.
Choose K by minimizing AICc.
US Accidental Deaths
plot(fc)
Forecasts from ARIMA(0,1,1)
10000 11000 12000
9000
8000
7000
The model include present and past values of predictor: xt , xt1 , xt2 , . . . .
yt = a + 0 xt + 1 xt1 + + k xtk + nt
where nt is an ARIMA process.
Rewrite model as
yt = a + (0 + 1 B + 2 B2 + + k Bk )xt + nt
= a + (B)xt + nt .
Year
10
TV Adverts
9
8
7
6
Coefficients:
ar1 ar2 ar3 intercept AdLag0 AdLag1
1.4117 -0.9317 0.3591 2.0393 1.2564 0.1625
s.e. 0.1698 0.2545 0.1592 0.9931 0.0667 0.0591
12
10
8
(B)
(B)nt = (B)et or nt = e = (B)et .
(B) t
yt = a + (B)xt + (B)et
ARMA models are rational approximations to general transfer func-
tions of et .
We can also replace (B) by a rational approximation.
There is no R package for forecasting using a general transfer func-
tion approach.
Forecasting: principles and practice 115
1. For the time series you selected from the retail data set in previous
lab sessions:
(a) Use the data to calculate the average cost of a nights accommo-
dation in Victoria each month.
(b) Plot this cost time series against CPI.
(c) Produce time series plots of both variables and explain why
logarithms of both variables need to be taken before fitting any
models.
(d) Fit an appropriate regression model with ARIMA errors.
(e) Forecast the average price per room for the next twelve months
using your fitted model. (Hint: You will need to produce fore-
casts of the CPI figures first.)
11
Hierarchical forecasting
Total
A B C
AA AB AC BA BB BC CA CB CC
Examples
Manufacturing product hierarchies
Net labour turnover
Pharmaceutical sales
Tourism demand by region and purpose
A hierarchical time series is a collection of several time series that are
linked together in a hierarchical structure.
Example: Pharmaceutical products are organized in a hierarchy under the
Anatomical Therapeutic Chemical (ATC) Classification System.
A grouped time series is a collection of time series that are aggregated in
a number of non-hierarchical ways.
Example: Australian tourism demand is grouped by region and purpose of
travel.
Hierarchical data
Total
A B C
116
Forecasting: principles and practice 117
1 1 1
YA,t
1 0 0
Yt = [Yt , YA,t , YB,t , YC,t ]0 = Y = SBt
0 1 0 B,t
YC,t
0 0 1 |{z}
| {z }
Bt
S
Total
A B C
AX AY AZ BX BY BZ CX CY CZ
Yt 1 1 1 1 1 1 1 1 1
YA,t 1 1 1 0 0 0 0 0 0
YB,t 0 0 0 1 1 1 0 0 0YAX,t
Y 0 0 0 0 0 0 1 1 1YAY ,t
C,t
YAX,t 1 0 0 0 0 0 0 0 0 YAZ,t
YAY ,t 0 1 0 0 0 0 0 0 0 YBX,t
Yt = YAZ,t = 0 0 1 0 0 0 0 0 0 YBY ,t = SBt
YBX,t 0 0 0 1 0 0 0 0 0 YBZ,t
YBY ,t 0 0 0 0 1 0 0 0 0YCX,t
YBZ,t 0 0 0 0 0 1 0 0 0YCY ,t
YCX,t 0 0 0 0 0 0 1 0 0 YCZ,t
CY ,t 0 0 0 0 0 0 0 1 0
Y | {z }
0 0 0 0 0 0 0 0 1
Bt
YCZ,t
| {z }
S
Grouped data
AX AY A
BX BY B
X Y Total
Yt 1 1 1 1
YA,t 1 1 0 0
YB,t 0 0 1 1
Y
0 AX,t
Y 1 0 1
X,t
YAY ,t
Yt = YY ,t = 0 1 0 1 = SBt
Y
YAX,t 1 0 0 0 BX,t
YBY ,t
YAY ,t 0 1 0 0 | {z
}
YBX,t 0 0 1 0 Bt
0 0 0 1
YBY ,t
| {z }
S
Forecasting notation
Yn (h) = SP Yn (h)
Bottom-up forecasts
Yn (h) = SP Yn (h)
Bottom-up forecasts are obtained using P = [0 | I] , where 0 is null matrix
and I is identity matrix.
P matrix extracts only bottom-level forecasts from Yn (h)
S adds them up to give the bottom-up forecasts.
Top-down forecasts
Yn (h) = SP Yn (h)
Top-down forecasts are obtained using P = [p | 0], where p =
[p1 , p2 , . . . , pmK ]0 is a vector of proportions that sum to one.
P distributes forecasts of the aggregate to the lowest level series.
Different methods of top-down forecasting lead to different propor-
tionality vectors p.
h = Var[Yn (h)|Y1 , . . . , Yn ]
Var[Yn (h)|Y1 , . . . , Yn ] = SP h P 0 S 0 .
Approximate 1 by cI.
Or assume h SB,h where B,h is the forecast error at bottom level.
Then h Sh S 0 where h = Var(B,h ).
If Moore-Penrose generalized inverse used, then
(S 0 h S)1 S 0 h = (S 0 S)1 S 0 .
Features
Challenges
2000
2000
2000
2000
2000
2000
2000
2005
2005
2005
2005
2005
2005
Forecasting: principles and practice
2010
2010
2010
2010
2010
2010
Other Other QLD Other VIC Other NSW
2005
Other VIC
5500 7500 6000 12000 6000 12000 14000 22000
18000 24000 10000 14000 18000
2000
2000
2000
2000
2000
2000
121
2005
2005
2005
2005
2005
2005
2010
2010
2010
2010
2010
2010
2010
Forecasting: principles and practice 122
Forecast evaluation
ANZSCO
8 major groups
43 sub-major groups
* 97 minor groups
359 unit groups
* 1023 occupations
Example: statistician
2 Professionals
11000
Level 0
9000
7000
1. Managers
2. Professionals
2500
6. Sales workers
7. Machinery operators and drivers
8. Labourers
Level 1
1500
1000
500
700
Time
600
500
Level 2
400
300
200
700 100
Time
600
500
Level 3
400
300
200
100
500
Time
400
Level 4
300
200
100
12000
Base forecasts
Reconciled forecasts
11600
Level 0
11200
800 10800
Time
780
760
Level 1
740
720
700
680
200
Time
190
180
Level 2
170
160
150
180 140
Time
170
160
Level 3
150
140
160
Time
150
Level 4
140
130
120
RMSE h=1 h=2 h=3 h=4 h=5 h=6 h=7 h=8 Average
Top level
Bottom-up 74.71 102.02 121.70 131.17 147.08 157.12 169.60 178.93 135.29
OLS 52.20 77.77 101.50 119.03 138.27 150.75 160.04 166.38 120.74
WLS 61.77 86.32 107.26 119.33 137.01 146.88 156.71 162.38 122.21
Level 1
Bottom-up 21.59 27.33 30.81 32.94 35.45 37.10 39.00 40.51 33.09
OLS 21.89 28.55 32.74 35.58 38.82 41.24 43.34 45.49 35.96
WLS 20.58 26.19 29.71 31.84 34.36 35.89 37.53 38.86 31.87
Level 2
Bottom-up 8.78 10.72 11.79 12.42 13.13 13.61 14.14 14.65 12.40
OLS 9.02 11.19 12.34 13.04 13.92 14.56 15.17 15.77 13.13
WLS 8.58 10.48 11.54 12.15 12.88 13.36 13.87 14.36 12.15
Level 3
Bottom-up 5.44 6.57 7.17 7.53 7.94 8.27 8.60 8.89 7.55
OLS 5.55 6.78 7.42 7.81 8.29 8.68 9.04 9.37 7.87
WLS 5.35 6.46 7.06 7.42 7.84 8.17 8.48 8.76 7.44
Bottom Level
Bottom-up 2.35 2.79 3.02 3.15 3.29 3.42 3.54 3.65 3.15
OLS 2.40 2.86 3.10 3.24 3.41 3.55 3.68 3.80 3.25
WLS 2.34 2.77 2.99 3.12 3.27 3.40 3.52 3.63 3.13
Example using R
library(hts)
Total
A B
AX AY AZ BX BY
forecast.gts function
Usage
forecast(object, h,
method = c("comb", "bu", "mo", "tdgsf", "tdgsa", "tdfp"),
fmethod = c("ets", "rw", "arima"),
weights = c("sd", "none", "nseries"),
positive = FALSE,
parallel = FALSE, num.cores = 2, ...)
Arguments
object Hierarchical time series object of class gts.
h Forecast horizon
method Method for distributing forecasts within the hierarchy.
fmethod Forecasting method to use
positive If TRUE, forecasts are forced to be strictly positive
weights Weights used for "optimal combination" method.
When weights = "sd", it takes account of the standard
deviation of forecasts.
parallel If TRUE, allow parallel processing
num.cores If parallel = TRUE, specify how many cores are going
to be used
Forecasting: principles and practice 127
1. We will reconcile the forecasts for the infant deaths data. The fol-
lowing code can be used. Check that you understand what each
step is doing. You will probably need to read the help files for some
functions.
library(hts)
plot(infantgts)
smatrix(infantgts)
# combine the forecasts with the group matrix to get a gts object
y.f <- combinef(allf, groups = infantgts$groups)
128
Forecasting: principles and practice 129
VAR example
> ar(usconsumption,order=3)
$ar
, , 1 consumption income
consumption 0.222 0.0424
income 0.475 -0.2390
, , 2 consumption income
consumption 0.2001 -0.0977
income 0.0288 -0.1097
, , 3 consumption income
consumption 0.235 -0.0238
income 0.406 -0.0923
$var.pred
consumption income
consumption 0.393 0.193
income 0.193 0.735
> library(vars)
> summary(var)
VAR Estimation Results:
=========================
Endogenous variables: consumption, income
Deterministic variables: const
Sample size: 161
1
0
2 1
0 1 2 3 4
income
Year
13
Neural network models
Input #2
Output
Input #3
Input #4
Input #2
Output
Input #3
Input #4
131
Forecasting: principles and practice 132
1
s(z) = ,
1 + ez
This tends to reduce the effect of extreme input values, thus making the
network somewhat robust to outliers.
Weights take random values to begin with, which are then updated
using the observed data.
There is an element of randomness in the predictions. So the net-
work is usually trained several times using different random starting
points, and the results are averaged.
Number of hidden layers, and the number of nodes in each hidden
layer, must be specified in advance.
NNAR models
NNAR models in R
Sunspots
US finished motor gasoline products Number of calls to large American bank (7am9pm)
400
6500 7000 7500 8000 8500 9000 9500
Thousands of barrels per day
300
Number of call arrivals
200
100
1992 1994 1996 1998 2000 2002 2004 3 March 17 March 31 March 14 April 28 April 12 May
20
15
10
Days
134
Forecasting: principles and practice 135
yt = observation at time t
() (yt 1)/ if , 0;
yt =
log y
if = 0.
t
M
() (i)
X
yt = `t1 + bt1 + stmi + dt
i=1
`t = `t1 + bt1 + dt
bt = (1 )b + bt1 + dt
Xp q
X
dt = i dti + j tj + t
i=1 j=1
ki
(i) (i)
X
st = sj,t
j=1
(i) (i) (i) (i) (i) (i)
sj,t = sj,t1 cos j + sj,t1 sin j + 1 dt
(i) (i) (i) (i) (i) (i)
sj,t = sj,t1 sin j + sj,t1 cos j + 2 dt
Examples
9000
8000
7000
Weeks
500
400 Forecasts from TBATS(1, {3,1}, 0.987, {<169,5>, <845,3>})
Number of call arrivals
300
200
100
0
5 minute intervals
20
15
10
Days
4. Over this course, you have developed several models for the retail
data. The last exercise is to use cross-validation to objectively com-
pare the models you have developed. Compute cross-validated MAE
values for each of the time series models you have considered. It
will take some time to run, so perhaps leave it running overnight
and check the results the next morning.
Forecasting: principles and practice 138