2016 Book TimeSeriesEconometrics
2016 Book TimeSeriesEconometrics
2016 Book TimeSeriesEconometrics
Klaus Neusser
Time Series
Econometrics
Springer Texts in Business and Economics
123
Klaus Neusser
Bern, Switzerland
Over the past decades, time series analysis has experienced a proliferous increase of
applications in economics, especially in macroeconomics and finance. Today these
tools have become indispensable to any empirically working economist. Whereas in
the beginning the transfer of knowledge essentially flowed from the natural sciences,
especially statistics and engineering, to economics, over the years theoretical and
applied techniques specifically designed for the nature of economic time series
and models have been developed. Thereby, the estimation and identification of
structural vector autoregressive models, the analysis of integrated and cointegrated
time series, and models of volatility have been extremely fruitful and far-reaching
areas of research. With the award of the Nobel Prizes to Clive W. J. Granger and
Robert F. Engle III in 2003 and to Thomas J. Sargent and Christopher A. Sims in
2011, the field has reached a certain degree of maturity. Thus, the idea suggests
itself to assemble the vast amount of material scattered over many papers into a
comprehensive textbook.
The book is self-contained and addresses economics students who have already
some prerequisite knowledge in econometrics. It is thus suited for advanced
bachelor, master’s, or beginning PhD students but also for applied researchers. The
book tries to bring them in a position to be able to follow the rapidly growing
research literature and to implement these techniques on their own. Although the
book is trying to be rigorous in terms of concepts, definitions, and statements
of theorems, not all proofs are carried out. This is especially true for the more
technically and lengthy proofs for which the reader is referred to the pertinent
literature.
The book covers approximately a two-semester course in time series analysis
and is divided in two parts. The first part treats univariate time series, in particular
autoregressive moving-average processes. Most of the topics are standard and can
form the basis for a one-semester introductory time series course. This part also
contains a chapter on integrated processes and on models of volatility. The latter
topics could be included in a more advanced course. The second part is devoted to
multivariate time series analysis and in particular to vector autoregressive processes.
It can be taught independently of the first part. The identification, modeling, and
estimation of these processes form the core of the second part. A special chapter
treats the estimation, testing, and interpretation of cointegrated systems. The book
also contains a chapter with an introduction to state space models and the Kalman
v
vi Preface
filter. Whereas the books is almost exclusively concerned with linear systems, the
last chapter gives a perspective on some more recent developments in the context
of nonlinear models. I have included exercises and worked out examples to deepen
the teaching and learning content. Finally, I have produced five appendices which
summarize important topics such as complex numbers, linear difference equations,
and stochastic convergence.
As time series analysis has become a tremendously growing field with an active
research in many directions, it goes without saying that not all topics received the
attention they deserved and that there are areas not covered at all. This is especially
true for the recent advances made in nonlinear time series analysis and in the
application of Bayesian techniques. These two topics alone would justify an extra
book.
The data manipulations and computations have been performed using the
software packages EVIEWS and MATLAB.1 Of course, there are other excellent
packages available. The data for the examples and additional information can
be downloaded from my home page www.neusser.ch. To maximize the learning
success, it is advised to replicate the examples and to perform similar exercises
with alternative data. Interesting macroeconomic time series can, for example, be
downloaded from the following home pages:
Germany: www.bundesbank.de
Switzerland: www.snb.ch
United Kingdom: www.statistics.gov.uk
United States: research.stlouisfed.org/fred2
The book grew out of lectures which I had the occasion to give over the years
in Bern and other universities. Thus, it is a concern to thank the many students,
in particular Philip Letsch, who had to work through the manuscript and who
called my attention to obscurities and typos. I also want to thank my colleagues
and teaching assistants Andreas Bachmann, Gregor Bäurle, Fabrice Collard, Sarah
Fischer, Stephan Leist, Senada Nukic, Kurt Schmidheiny, Reto Tanner, and Martin
Wagner for reading the manuscript or part of it and for making many valuable
criticisms and comments. Special thanks go to my former colleague and coauthor
Robert Kunst who meticulously read and commented on the manuscript. It goes
without saying that all errors and shortcomings go to my expense.
1
EVIEWS is a product of IHS Global Inc. MATLAB is a matrix-oriented software developed by
MathWorks which is ideally suited for econometric and time series applications.
Contents
1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.1 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.2 Formal Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
1.4 Construction of Stochastic Processes. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.4.1 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.4.2 Construction of Stochastic Processes: Some Examples . . 16
1.4.3 Moving-Average Process of Order One . . . . . . . . . . . . . . . . . . . 17
1.4.4 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
1.4.5 Changing Mean .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.5 Properties of the Autocovariance Function . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.5.1 Autocovariance Function of MA(1) Processes .. . . . . . . . . . . 21
1.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
2 ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.1 The Lag Operator.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 26
2.2 Some Important Special Cases . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.2.1 Moving-Average Process of Order q . .. . . . . . . . . . . . . . . . . . . . 27
2.2.2 First Order Autoregressive Process . . . .. . . . . . . . . . . . . . . . . . . . 29
2.3 Causality and Invertibility . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 32
2.4 Computation of Autocovariance Function . . . . . .. . . . . . . . . . . . . . . . . . . . 38
2.4.1 First Procedure .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 39
2.4.2 Second Procedure.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
2.4.3 Third Procedure.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
2.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 44
3 Forecasting Stationary Processes . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
3.1 Linear Least-Squares Forecasts .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
3.1.1 Forecasting with an AR(p) Process . . . .. . . . . . . . . . . . . . . . . . . . 48
3.1.2 Forecasting with MA(q) Processes . . . .. . . . . . . . . . . . . . . . . . . . 50
3.1.3 Forecasting from the Infinite Past . . . . . .. . . . . . . . . . . . . . . . . . . . 53
3.2 The Wold Decomposition Theorem . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 54
3.3 Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
vii
viii Contents
3.4 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 60
3.5 Partial Autocorrelation .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 61
3.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 62
3.5.2 Interpretation of ACF and PACF . . . . . . .. . . . . . . . . . . . . . . . . . . . 64
3.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 65
4 Estimation of Mean and ACF . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
4.1 Estimation of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
4.2 Estimation of ACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
4.3 Estimation of PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 78
4.4 Estimation of the Long-Run Variance .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79
4.4.1 An Example .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83
4.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 85
5 Estimation of ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 87
5.1 The Yule-Walker Estimator . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 87
5.2 OLS Estimation of an AR(p) Model . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 91
5.3 Estimation of an ARMA(p,q) Model .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 94
5.4 Estimation of the Orders p and q . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 99
5.5 Modeling a Stochastic Process . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
5.6 Modeling Real GDP of Switzerland .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 103
6 Spectral Analysis and Linear Filters . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109
6.1 Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 110
6.2 Spectral Decomposition of a Time Series . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
6.3 The Periodogram and the Estimation of Spectral Densities . . . . . . . . 117
6.3.1 Non-Parametric Estimation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 117
6.3.2 Parametric Estimation . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 121
6.4 Linear Time-Invariant Filters . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 122
6.5 Some Important Filters . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 127
6.5.1 Construction of Low- and High-Pass Filters . . . . . . . . . . . . . . 127
6.5.2 The Hodrick-Prescott Filter . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 128
6.5.3 Seasonal Filters . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 130
6.5.4 Using Filtered Data . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131
6.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 132
7 Integrated Processes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
7.1 Definition, Properties and Interpretation . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
7.1.1 Long-Run Forecast . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 135
7.1.2 Variance of Forecast Error . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 136
7.1.3 Impulse Response Function .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 137
7.1.4 The Beveridge-Nelson Decomposition . . . . . . . . . . . . . . . . . . . . 138
7.2 Properties of the OLS Estimator in the Case
of Integrated Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141
7.3 Unit-Root Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 145
7.3.1 Dickey-Fuller Test . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
7.3.2 Phillips-Perron Test. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 149
Contents ix
9 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 197
10 Definitions and Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 201
11 Estimation of Covariance Function.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
11.1 Estimators and Asymptotic Distributions . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
11.2 Testing Cross-Correlations of Time Series . . . . . .. . . . . . . . . . . . . . . . . . . . 209
11.3 Some Examples for Independence Tests . . . . . . . .. . . . . . . . . . . . . . . . . . . . 211
12 VARMA Processes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 215
12.1 The VAR(1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 216
12.2 Representation in Companion Form .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 218
12.3 Causal Representation.. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 218
12.4 Computation of Covariance Function . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 221
13 Estimation of VAR Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225
13.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225
13.2 The Least-Squares Estimator . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 226
13.3 Proofs of Asymptotic Normality . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
13.4 The Yule-Walker Estimator . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 238
x Contents
D BN-Decomposition .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 383
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 391
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 403
List of Figures
Fig. 15.2 Impulse response functions for advertisement and sales . . . . . . . . . . . 276
Fig. 15.3 Impulse response functions of IS-LM model . . .. . . . . . . . . . . . . . . . . . . . 280
Fig. 15.4 Impulse response functions of the Blanchard-Quah model . . . . . . . . 289
Fig. 16.1 Impulse responses of present discounted value model . . . . . . . . . . . . . 302
Fig. 16.2 Stochastic simulation of present discounted value model .. . . . . . . . . 303
Fig. 17.1 State space model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 326
Fig. 17.2 Spectral density of cyclical component . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 334
Fig. 17.3 Estimates of quarterly GDP growth rates . . . . . . .. . . . . . . . . . . . . . . . . . . . 349
Fig. 17.4 Components of the basic structural model (BSM) for
real GDP of Switzerland. (a) Logged Swiss GDP
(demeaned). (b) Local linear trend (LLT). (c) Business
cycle component. (d) Seasonal component . . . . .. . . . . . . . . . . . . . . . . . . . 350
Fig. 18.1 Break date UK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 357
Fig. A.1 Representation of a complex number . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 370
List of Tables
xvii
List of Definitions
1.3 Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
1.4 Autocovariance Function . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
1.5 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
1.6 Strict Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 14
1.7 Strict Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 14
1.8 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.9 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
xxi
Notation and Symbols
Time series analysis is an integral part of every empirical investigation which aims
at describing and modeling the evolution over time of a variable or a set of variables
in a statistically coherent way. The economics of time series analysis is thus very
much intermingled with macroeconomics and finance which are concerned with the
construction of dynamic models. In principle, one can approach the subject from
two complementary perspectives. The first one focuses on descriptive statistics.
It characterizes the empirical properties and regularities using basic statistical
concepts like mean, variance, and covariance. These properties can be directly
measured and estimated from the data using standard statistical tools. Thus, they
summarize the external (observable) or outside characteristics of the time series. The
second perspective tries to capture the internal data generating mechanism. This
mechanism is usually unknown in economics as the models developed in economic
theory are mostly of a qualitative nature and are usually not specific enough to
single out a particular mechanism.1 Thus, one has to consider some larger class
of models. By far most widely used is the class of autoregressive moving-average
(ARMA) models which rely on linear stochastic difference equations with constant
coefficients. Of course, one wants to know how the two perspectives are related
which leads to the important problem of identifying a model from the data.
The observed regularities summarized in the form of descriptive statistics or as a
specific model are, of course, of principal interest to economics. They can be used
to test particular theories or to uncover new features. One of the main assumptions
underlying time series analysis is that the regularities observed in the sample period
1
One prominent exception is the random-walk hypothesis of real private consumption first derived
and analyzed by Hall (1978). This hypothesis states that the current level of private consumption
should just depend on private consumption one period ago and on no other variable, in particular
not on disposable income. The random-walk property of asset prices is another very much
discussed hypothesis. See Campbell et al. (1997) for a general exposition and Samuelson (1965)
for a first rigorous derivation from market efficiency.
are not specific to that period, but can be extrapolated into the future. This leads to
the issue of forecasting which is another major application of time series analysis.
Although its roots lie in the natural sciences and in engineering, time series
analysis, since the early contributions by Frisch (1933) and Slutzky (1937), has
become an indispensable tool in empirical economics. Early applications mostly
consisted in making the knowledge and methods acquired there available to
economics. However, with the progression of econometrics as a separate scientific
field, more and more techniques that are specific to the characteristics of economic
data have been developed. I just want to mention the analysis of univariate and
multivariate integrated, respectively cointegrated time series (see Chaps. 7 and 16),
the identification of vector autoregressive (VAR) models (see Chap. 15), and the
analysis of volatility of financial market data in Chap. 8. Each of these topics
alone would justify the treatment of time series analysis in economics as a separate
subfield.
Before going into more formal analysis, it is useful to examine some prototypical
economic time series by plotting them against time. This simple graphical inspection
already reveals some of the issues encountered in this book. One of the most popular
time series is the real gross domestic product. Figure 1.1 plots the data for the
U.S. from 1947 first quarter to 2011 last quarter on logarithmic scale. Several
observations are in order. First, the data at hand cover just a part of the time series.
There are data available before 1947 and there will be data available after 2011. As
there is no natural starting nor end point, we think of a time series as extending back
into the infinite past and into the infinite future. Second, the observations are treated
as the realizations of a random mechanism. This implies that we observe only one
realization. If we could turn back time and let run history again, we would obtain
a second realization. This is, of course, impossible, at least in the macroeconomics
context. Thus, typically, we are faced with just one realization on which to base our
analysis. However, sound statistical analysis needs many realizations. This implies
that we have to make some assumption on the constancy of the random mechanism
over time. This leads to the concept of stationarity which will be introduced more
rigorously in the next section. Third, even a cursory look at the plot reveals that
the mean of real GDP is not constant, but is upward trending. As we will see, this
feature is typical of many economic time series.2 The investigation into the nature
of the trend and the statistical consequences thereof have been the subject of intense
research over the last couple of decades. Fourth, a simple way to overcome this
2
See footnote 1 for some theories predicting non-stationary behavior.
1.1 Some Examples 5
9.5
9
logarithm
8.5
7.5
1950 1960 1970 1980 1990 2000 2010
Fig. 1.1 Real gross domestic product (GDP) of the U.S. (chained 2005 dollars; seasonally
adjusted annual rate)
2
percent
−1
−2
−3
1950 1960 1970 1980 1990 2000 2010
Fig. 1.2 Quarterly growth rate of U.S. real gross domestic product (GDP) (chained 2005 dollars)
problem is to take first differences. As the data have been logged, this amounts to
taking growth rates.3 The corresponding plot is given in Fig. 1.2 which shows no
trend anymore.
Another feature often encountered in economic time series is seasonality. This
issue arises, for example in the case of real GDP, because of a particular regularity
within a year: the first quarter being the quarter with the lowest values, the second
3
This is obtained by using the approximation ln.1 C "/ " for small " where " equals the growth
rate of GDP.
6 1 Introduction and Basic Theoretical Concepts
11.8
11.7
11.6
11.5
11.4
not adjusted
11.3
seasonally adjusted
11.2
Fig. 1.3 Comparison of unadjusted and seasonally adjusted Swiss real gross domestic product
(GDP)
and fourth quarter those with the highest values, and the third quarter being in
between. These movements are due to climatical and holiday seasonal variations
within the year and are viewed to be of minor economic importance. Moreover,
these seasonal variations, because of there size, hide the more important business
cycle movements. It is therefore customary to work with time series which have
been adjusted for seasonality before hand. Figure 1.3 shows the unadjusted and
the adjusted real gross domestic product for Switzerland. The adjustment has been
achieved by taking a moving-average. This makes the time series much smoother
and evens out the seasonal movements.
Other typical economic time series are interest rates plotted in Fig. 1.4. Over the
period considered these two variables also seem to trend. However, the nature of
this trend must be different because of the theoretically binding zero lower bound.
Although the relative level of the two series changes over time—at the beginning of
the sample, short-term rates are higher than long-terms ones—they move more or
less together. This comovement is true in particular true with respect to the medium-
and long-term.
Other prominent time series are stock market indices. In Fig. 1.5 the Swiss
Market Index (SMI) in plotted as an example. The first panel displays the raw
data on a logarithmic scale. One can clearly discern the different crises: the internet
bubble in 2001 and the most recent financial market crisis in 2008. More interesting
than the index itself is the return on the index plotted in the second panel. Whereas
the mean seems to stay relatively constant over time, the volatility is not: in the
periods of crisis volatility is much higher. This clustering of volatility is a typical
feature of financial market data and will be analyzed in detail in Chap. 8.
1.2 Formal Definitions 7
10
three−month LIBOR
8
10−year government bond
6
percent
0
1990 1995 2000 2005 2010
Fig. 1.4 Short- and long-term Swiss interest rates (three-month LIBOR and 10 year government
bond)
Finally, Fig. 1.6 plots the unemployment rate for Switzerland. This is another
widely discussed time series. However, the Swiss data have a particular feature
in that the behavior of the series changes over time. Whereas unemployment was
practically nonexistent in Switzerland up to the end of 1990’s, several policy
changes (introduction of unemployment insurance, liberalization of immigration
laws) led to drastic shifts. Although such dramatic structural breaks are rare, one
has to be always aware of such a possibility. Reasons for breaks are policy changes
and simply structural changes in the economy at large.4
The previous section attempted to give an intuitive approach of the subject. The
analysis to follow necessitates, however, more precise definitions and concepts.
At the heart of the exposition stands the concept of a stochastic process. For this
purpose we view the observation at some time t as the realization of random
variable Xt . In time series analysis we are, however, in general not interested in a
particular point in time, but rather in a whole sequence. This leads to the following
definition.
Definition 1.1. A stochastic process fXt g is a family of random variables indexed
by t 2 T and defined on some given probability space.
4
Burren and Neusser (2013) investigate, for example, how systematic sectoral shifts affect volatility
of real GDP growth.
8 1 Introduction and Basic Theoretical Concepts
a 9.5
8.5
logaritm
7.5
7
2/1/1989 30/9/1992 31/7/1996 31/5/2000 31/3/2004 30/1/2008 30/11/2011
b
10
internet
bubble
5
percent
−5
Asian financial
crisis
financial market
−10
crisis
Fig. 1.5 Swiss Market Index (SMI). (a) Index. (b) Daily return
Thereby T denotes an ordered index set which is typically identified with time.
In the literature one can encounter the following index sets:
4
percent
0
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Remark 1.1. Given that T is identified with time and thus has a direction, a
characteristic of time series analysis is the distinction between past, present, and
future.
For technical reasons which will become clear later, we will work with T D
Z, the set of integers. This choice is consistent with the use of time indices in
economics as there is, usually, no natural starting point nor a foreseeable endpoint.
Although models in continuous time are well established in the theoretical finance
literature, we will disregard them because observations are always of a discrete
nature and because models in continuous time would need substantially higher
mathematical requirements.
Remark 1.2. The random variables fXt g take values in a so-called state space. In
the first part of this treatise, we take as the state space the space of real numbers R
and thus consider only univariate time series. In part II we extend the state space to
Rn and study multivariate times series. Theoretically, it is possible to consider other
state spaces (for example, f0; 1g, the integers, or the complex numbers), but this will
not be pursued here.
Definition 1.2. The function t ! xt which assigns to each point in time t the
realization of the random variable Xt , xt , is called a realization or a trajectory of
the stochastic process. We denote such a realization by fxt g.
10 1 Introduction and Basic Theoretical Concepts
5
In the theoretical probability theory ergodicity is an important concept which asks the question
under which conditions the time average of a property is equal to the corresponding ensemble
average, i.e. the average over the entire state space. In particular, ergodicity ensures that the
arithmetic averages over time converge to their theoretical counterparts. In Chap. 4 we allude to
this principle in the estimation of the mean and the autocovariance function of a time series.
1.3 Stationarity 11
where fXt g is the process from the example just above. In this case St is
the proceeds after t rounds of coin tossing. More generally, fXt g could be
any sequence of identically and independently distributed random variables.
Figure 1.7 shows a realization of fXt g for t D 1; 2; : : : ; 100 and the corresponding
random walk fSt g. For more on random walks see Sect. 1.4.4 and, in particular,
Chap. 7.
• The simple branching process is defined through the recursion
Xt
X
XtC1 D Zt;j with starting value: X0 D x0 :
jD1
In this example Xt represents the size of a population where each member lives
just one period and reproduces itself with some probability. Zt;j thereby denotes
the number of offsprings of the j-th member of the population in period t.
In the simplest case fZt;j g is nonnegative integer valued and identically and
independently distributed. A realization with X0 D 100 and with probabilities
of one third each that the member has no, one, or two offsprings is shown as an
example in Fig. 1.8.
1.3 Stationarity
12
X
10 random walk
−2
−4
−6
0 10 20 30 40 50 60 70 80 90 100
180
160
140
population size
120
100
80
60
40
0 10 20 30 40 50 60 70 80 90 100
time
the derivation of forecasts (Chap. 3), in the estimation of ARMA models, the most
important class of models (Chap. 5), and in the Wold representation (Sect. 3.2 in
Chap. 3). It is therefore of utmost importance to get a thorough understanding of the
meaning and properties of the covariance function.
1.3 Stationarity 13
Remark 1.3. The acronym auto emphasizes that the covariance is computed with
respect to the same variable taken at different points in time. Alternatively, one may
use the term covariance function for short.
Remark 1.4. Processes with these properties are often called weakly stationary,
wide-sense stationary, covariance stationary, or second order stationary. As we will
not deal with other forms of stationarity, we just speak of stationary processes, for
short.
Remark 1.5. For t D s, we have
X .t; s/ D
X .t; t/ D VXt which is nothing but the
unconditional variance of Xt . Thus, if fXt g is stationary
X .t; t/ D VXt D constant.
X .t; s/ D X .t s; 0/:
Thus the covariance
X .t; s/ does not depend on the points in time t and s, but only
on the number of periods t and s are apart from each other, i.e. from t s. For
stationary processes it is therefore possible to view the autocovariance function as a
function of just one argument. We denote the autocovariance function in this case by
X .h/, h 2 Z. Because the covariance is symmetric in t and s, i.e.
X .t; s/ D
X .s; t/,
we have
It is thus sufficient to look at the autocovariance function for positive integers only,
i.e. for h D 0; 1; 2; : : :. In this case we refer to h as the order of the autocovariance.
For h D 0, we get the unconditional variance of Xt , i.e.
X .0/ D VXt .
14 1 Introduction and Basic Theoretical Concepts
X .h/
X .h/ D D corr.XtCh ; Xt / for all integers h
X .0/
where h is referred to as the order. Note that this definition is equivalent to the
t ;Xt h /
ordinary correlation coefficients .h/ D pcov.X p
VXt VXt h
because stationarity implies
p p
that VXt D VXt h so that VXt VXt h D VXt D
X .0/.
Most of the time it is sufficient to concentrate on the first two moments. However,
there are situations where it is necessary to look at the whole distribution. This leads
to the concept of strict stationarity.
Definition 1.6 (Strict Stationarity). A stochastic process is called strictly stationary
if the joint distributions of .Xt1 ; : : : ; Xtn / and .Xt1 Ch ; : : : ; Xtn Ch / are the same for all
h 2 Z and all .t1 ; : : : ; tn / 2 T n , n D 1; 2; : : :
Definition 1.7 (Strict Stationarity). A stochastic process is called strictly stationary
if for all integers h and n 1 .X1 ; : : : ; Xn / and .X1Ch ; : : : ; XnCh / have the same
distribution.
Remark 1.8. If fXt g is strictly stationary then Xt has the same distribution for all
t (n=1). For n D 2 we have that XtCh and Xt have a joint distribution which is
independent of t. This implies that the covariance, if it exists, depends only on h.
Thus, every strictly stationary process with VXt < 1 is also stationary.6
The converse is, however, not true as shown by the following example:
(
exponentially distributed with mean 1 (i.e. f .x/ D e x /; t unevenI
Xt
N.1; 1/; t evenI
• EXt D 1
•
X .0/ D 1 and
X .h/ D 0 for h ¤ 0
Thus fXt g is stationary, but not strictly stationary, because the distribution changes
depending on whether t is even or uneven.
6
An example of a process which is strictly stationary, but not stationary, is given by the IGARCH
process (see Sect. 8.1.4). This process is strictly stationary with infinite variance.
1.4 Construction of Stochastic Processes 15
At this point we will not delve into the relation between stationarity, strict
stationarity and Gaussian processes, rather some of these issues will be further
discussed in Chap. 8.
One important notion in time series analysis is to build up more complicated process
from simple ones. The simplest building block is a process with zero autocorrelation
called a white noise process which is introduced below. Taking moving-averages
from this process or using it in a recursion gives rise to more sophisticated process
with more elaborated autocovariance functions. Slutzky (1937) first introduced the
idea that moving-averages of simple processes can generate time series whose
motion resembles business cycle fluctuations.
The simplest building block is a process with zero autocorrelation called a white
noise process.
Definition 1.9 (White Noise). A stationary process fZt g is called a white noise
process if fZt g satisfies:
• EZt D 0 (
2 h D 0I
•
Z .h/ D
0 h ¤ 0:
If fZt g is not only temporally uncorrelated, but also independently and identically
distributed, we write Zt IID.0; 2 /. If in addition Zt is normally distributed, we
write Zt IIN.0; 2 /. An IID.0; 2 / process is always a white noise process. The
converse is, however, not true as will be shown in Chap. 8.
7
The following calculations are subject to rounding to four digits.
1.4 Construction of Stochastic Processes 17
a b
3 4
2
2
1
0 0
−1
−2
−2
−3 −4
0 20 40 60 80 100 0 20 40 60 80 100
c d
5 8
4
0
2
−5 −2
0 20 40 60 80 100 0 20 40 60 80 100
Fig. 1.9 Processes constructed from a given white noise process. (a) White noise. (b) Moving-
average with D 0:9. (c) Autoregressive with D 0:9. (d) Random walk
The white noise process can be used as a building block to construct more complex
processes with a more involved autocorrelation structure. The simplest procedure
is to take moving averages over consecutive periods.8 This leads to the moving-
average processes. The moving-average process of order one, MA(1) process, is
defined as
8
This procedure is an example of a filter. Section 6.4 provides a general introduction to filters.
18 1 Introduction and Basic Theoretical Concepts
Clearly, EXt D EZt C EZt 1 D 0. The mean is therefore constant and equal to
zero.
The autocovariance function can be computed as follows:
X .t C h; t/ D cov.XtCh ; Xt /
D cov.ZtCh C ZtCh 1 ; Zt C Zt 1 /
D EZtCh Zt C EZtCh Zt 1 C EZtCh 1 Zt C 2 EZtCh 1 Zt 1 :
Recalling that fZt g is white noise so that EZt2 D 2 and EZt ZtCh D 0 for h ¤ 0, we
therefore get the following autocovariance function of fXt g:
8̂
2 2
ˆ
<.1 C / h D 0I
X .h/ D 2 h D ˙1I (1.1)
ˆ
:̂0 otherwise:
Thus fXt g is stationary irrespective of the value of . The autocorrelation function is:
8̂
ˆ
<1 h D 0I
X .h/ D
h D ˙1I
ˆ 1C 2
:̂0 otherwise:
Note that the newly created process now exhibits a dependence from its past as
Xt is correlated with Xt 1 . This correlation is restricted to the interval Œ0; 12 , i.e.
0 jX .1/j 21 . As the correlation between Xt and Xs is zero when t and s are
more than one period apart, we call a moving-average process a process with finite
memory or a process with finite-range dependence.
Remark 1.10. To motivate the name moving-average, we can define the MA(1)
process more generally as
Xt D ZQ t C Q ZQ t 1 with ZQ t WN.0; Q 2 /
1.4 Construction of Stochastic Processes 19
where Q D 1 =0 and Q 2 D 02 2 . Both processes would generate the same first two
moments and are therefore observationally indistinguishable from each other. Thus,
we can set 0 D 1 without loss of generality.
Let Zt WN.0; 2 / be a white noise process then the new process fXt g defined as
t
X
Xt D Z1 C Z2 C : : : C Zt D Zj ; t > 0; (1.2)
jD1
is called a random walk. Note that, in contrast to fZt g, fXt g is only defined for t > 0.
The random walk may alternatively be defined through the recursion
Xt D Xt 1 C Zt ; t > 0 and X0 D 0:
Xt D ı C Xt 1 C Zt ;
Proposition 1.1. The random walk fXt g as defined in Eq. (1.2) is nonstationary.
P
tC1
Proof. The variance of XtC1 X1 equals V.XtC1 X1 / D V jD2 Zj D
PtC1 2
jD2 VZj D t .
Assume for the moment that fXt g is stationary then the triangular inequality
implies for t > 0:
p
0< t 2 D std.XtC1 X1 / std.XtC1 / C std.X1 / D 2 std.X1 /
where “std” denotes the standard deviation. As the left hand side of the inequality
converges to infinity for t going to infinity, also the right hand side must go
to infinity. This means that the variance of X1 must be infinite. This, however,
contradicts the assumption of stationarity. Thus fXt g cannot be stationary. t
u
The random walk represents by far the most widely used nonstationary process
in economics. It has proven to be an important ingredient in many economic
time series. Typical nonstationary time series which are or are driven by random
walks are stock market prices, exchange rates, or the gross domestic product
20 1 Introduction and Basic Theoretical Concepts
where tc is some specific point in time. fXt g is clearly not stationary because the
mean is not constant. In econometrics we refer to such a situation as a structural
change which can be accommodated by introducing a so-called dummy variable.
Models with more sophisticated forms of structural changes will be discussed in
Chap. 18
(i)
X .0/ 0;
(ii) 0 j
X .h/j
X .0/;
(iii)
P X .h/ D
X . h/;
n
(iv) i;jD1 ai
X .ti tj /aj 0 for all n and all vectors .a1 ; : : : ; an /0 and .t1 ; : : : ; tn /.
This property is called non-negative definiteness.
Proof. The first property is obvious as the variance is always nonnegative. The
second property follows from the Cauchy-Bunyakovskii-Schwarz inequality(see
1.5 Properties of the Autocovariance Function 21
Theorem C.1) applied to Xt and XtCh which yields 0 j
X .h/j
X .0/. The third
property follows immediately from the definition of the covariance. Define a D
.a1 ; : : : ; an /0 and X D .Xt1 ; : : : ; Xtn /0 then the last property follows
P from the fact that
the variance is always nonnegative: 0 V .a0 X/ D a0 V.X/a D ni;jD1 ai
X .ti tj /aj .
t
u
Similar properties hold for the correlation function X , except that we have
X .0/ D 1.
Theorem 1.2. The autocorrelation function of a stationary stochastic process fXt g
is characterized by the following properties:
(i) X .0/ D 1;
(ii) 0 jX .h/j 1;
(iii) X .h/ D X . h/;
P n
(iv) i;jD1 ai X .ti tj /aj 0 for all n and all vectors .a1 ; : : : ; an /0 and .t1 ; : : : ; tn /.
Proof. The proof follows immediately from the properties of the autocovariance
function. t
u
It can be shown that for any given function with the above properties there exists
a stationary process (Gaussian process) which has this function as its autocovariance
function, respectively autocorrelation function.
The problem consists of determining the parameters of the MA(1) model, and
2 , from the values of the autocovariance function. For this purpose we equate
0 D .1 C 2 / 2 and
1 D 2 (see Eq. (1.1)). This leads to an equation system in
the two unknowns und 2 . This system can be simplified by dividing the second
equation by the first one to obtain:
1 =
0 D =.1C 2 /. Because
1 =
0 D .1/ D 1
22 1 Introduction and Basic Theoretical Concepts
1 2 C 1 D 0:
The two solutions of this equation are
q
1
1;2 D 1˙ 1 412 :
21
The solutions are real if and only if the discriminant 1 412 is positive. This is the
case if and only if 12 1=4, respectively j1 j 1=2. Note that one root is the
inverse of the other. The identification problem thus takes the following form:
j1 j < 1=2: there exists two observationally equivalent MA(1) processes corre-
sponding to the two solutions 1 and 2 .
1 D ˙1=2: there exists exactly one MA(1) process with D ˙1.
j1 j > 1=2: there exists no MA(1) process with this autocovariance function.
The relation between the first order autocorrelation coefficient, 1 D .1/, and the
parameter of the MA(1) process is represented in Fig. 1.10. As can be seen,
there exists for each .1/ with j.1/j < 12 two solutions. The two solutions are
inverses of each other. Hence one solution is absolutely smaller than one whereas the
other is bigger than one. In Sect. 2.3 we will argue in favor of the solution smaller
than one. For .1/ D ˙1=2 there exists exactly one solution, namely D ˙1.
For j.1/j > 1=2 there is no solution. For j1 j > 1=2, .h/ actually does not
represent a genuine autocorrelation function as the fourth condition in Theorem 1.1,
respectively Theorem 1.2 is violated. For 1 > 12 , set a D .1; 1; 1; 1; : : : ; 1; 1/0
to get:
n
X 21
ai .i j/aj D n 2.n 1/1 < 0; if n > :
i;jD1
21 1
For 1 D 1
2
one sets a D .1; 1; : : : ; 1/0 . Hence the fourth property is violated.
1.6 Exercises
0.5
first order autocorrelation coefficient: ρ(1)
0.4
0.3
0.2
0.1
0 0
0.5 1 2
−0.1
−0.2
θ/(1+θ2)
−0.3
−0.4
−0.5
−3 −2 −1 0 1 2 3
first order ‘‘moving average’’ parameter:θ
Fig. 1.10 Relation between the autocorrelation coefficient of order one, .1/, and the parameter
of a MA(1) process
(i) Determine the autocovariance and the autocorrelation function of fXt g for
D 0:9.
(ii) Determine the variance of the mean .X1 C X2 C X3 C X4 /=4.
(iii) How do the previous results change if D 0:9?
Determine the parameters and 2 , if they exist, of the first order moving-average
process Xt D Zt C Zt 1 with Zt WN.0; 2 / such that autocovariance function
above is the autocovariance function corresponding to fXt g.
24 1 Introduction and Basic Theoretical Concepts
where fZt g is identically and independently distributed as Zt N.0; 1/. Show that
fXt g WN.0; 1/, but not IID.0; 1/.
(i) Xt D Zt C Zt 1
(ii) Xt D Zt Zt 1
(iii) Xt D a C Z0
(iv) Xt D Z0 sin.at/
In all cases we assume that fZt g is identically and independently distributed with
Zt N.0; 2 /. and a are arbitrary parameters.
Autoregressive Moving-Average Models
2
A basic idea in time series analysis is to construct more complex processes from
simple ones. In the previous chapter we showed how the averaging of a white
noise process leads to a process with first order autocorrelation. In this chapter we
generalize this idea and consider processes which are solutions of linear stochastic
difference equations. These so-called ARMA processes constitute the most widely
used class of models for stationary processes.
Definition 2.1 (ARMA Models). A stochastic process fXt g with t 2 Z is called
an autoregressive moving-average process (ARMA process) of order .p; q/, denoted
by ARMA.p; q/ process, if the process is stationary and satisfies a linear stochastic
difference equation of the form
Xt 1 X t 1 ::: p X t p D Zt C 1 Zt 1 C : : : C q Zt q (2.1)
Xt 1 X t 1 ::: p X t p D c C Zt C 1 Zt 1 C : : : C q Zt q :
In times series analysis it is customary to rewrite the above difference equation more
compactly in terms of the lag operator L. This is, however, not only a compact
notation, but will open the way to analyze the inner structure of ARMA processes.
The lag or back-shift operator L moves the time index one period back:
LfXt g D fXt 1 g:
For ease of notation we write: LXt D Xt 1 . The lag operator is a linear operator with
the following calculation rules:
(i) L applied to the process fXt D cg where c is an arbitrary constant gives:
Lc D c:
L
„ƒ‚ L Xt D Ln Xt D Xt n :
:::…
n times
(iii) The inverse of the lag operator is the lead or forward operator. This operator
shifts the time index one period into the future.1 We can write L 1 :
L 1 Xt D XtC1 :
Lm Ln Xt D LmCn Xt D Xt m n:
L0 D 1:
(vi) For any real numbers a and b, any integers m and n, and arbitrary stochastic
processes fXt g and fYt g we have:
1
One technical advantage of using the double-infinite index set Z is that the lag operators form a
group.
2.2 Some Important Special Cases 27
calculation rules apply. Let, for example, A.L/ D 1 0:5L and B.L/ D 1 C 4L2
then C.L/ D A.L/B.L/ D 1 0:5L C 4L2 2L3 .
Applied to the stochastic difference equation, we define the autoregressive and
the moving-average polynomial as follows:
ˆ.L/ D 1 1 L ::: p L p ;
‚.L/ D 1 C 1 L C : : : C q Lq :
The stochastic difference equation defining the ARMA process can then be written
compactly as
ˆ.L/Xt D ‚.L/Zt :
Thus, the use of lag polynomials provides a compact notation for ARMA processes.
Moreover and most importantly, ˆ.z/ and ‚.z/, viewed as polynomials of the
complex number z, also reveal much of their inherent structural properties as will
become clear in Sect. 2.3.
Before we deal with the general theory of ARMA processes, we will analyze some
important special cases first:
q D 0: autoregressive process of order p, AR.p/ process
p D 0: moving-average process of order q, MA.q/ process
because Zt WN.0; 2 /. As can be easily verified using the properties of fZt g, the
autocovariance function of the MA.q/ processes are:
realization
4
−2
−4
0 20 40 60 80 100
time
estimated ACF
correlation coefficient
1
upper bound for confidence interval
lower bound for confidence interval
0.5 theoretical ACF
−0.5
−1
0 5 10 15 20
order
Fig. 2.1 Realization and estimated ACF of a MA(1) process: Xt D Zt 0:8Zt 1 with Zt
IID N.0; 1/
( Pq jhj
2 iD0 i iCjhj ; jhj qI
D
0; jhj > q:
( Pq jhj
Pq 1 jhj qI
2 iD0 i iCjhj ;
X .h/ D corr.XtCh ; Xt / D iD0 i
0; jhj > q:
The AR.p/ process requires a more thorough analysis as will already become
clear from the AR(1) process. This process is defined by the following stochastic
difference equation:
The above stochastic difference equation has in general several solutions. Given a
sequence fZt g and an arbitrary distribution for X0 , it determines all random variables
Xt , t 2 Z n f0g, by applying the above recursion. The solutions are, however, not
necessarily stationary. But, according to the Definition 2.1, only stationary processes
qualify for ARMA processes. As we will demonstrate, depending on the value of ,
there may exist no or just one solution.
Consider first the case of jj < 1. Inserting into the difference equation several
times leads to:
Xt D Xt 1 C Zt D 2 Xt 2 C Zt 1 C Zt
D :::
D Zt C Zt 1 C 2 Zt 2 C : : : C k Zt k C kC1 Xt k 1:
P
This shows that kjD0 j Zt j converges in the mean square sense, and thus also in
probability, to Xt for k ! 1 (see Theorem C.8 in Appendix C). This suggests to
take
1
X
2
Xt D Zt C Zt 1 C Zt 2 C ::: D j Zt j (2.3)
jD0
P
as the solution to the stochastic difference equation. As 1 j 1
jD0 j j D 1 < 1 this
solution is well-defined according to Theorem 6.4 and has the following properties:
1
X
EXt D j EZt j D 0;
jD0
30 2 ARMA Models
0 10 1
Xk k
X
X .h/ D cov.XtCh ; Xt / D lim E @ j ZtCh j A @ j Zt j A
k!1
jD0 jD0
1
X jhj
D 2 jhj 2j D 2; h 2 Z;
jD0
1 2
X .h/ D jhj :
P
Thus the solution Xt D 1 j
jD0 Zt j is stationary and fulfills the difference equation
as can be easily verified. It is also the only stationary solution which is compatible
with the difference equation. Assume that there is second solution fXQ t g with these
properties. Inserting into the difference equation yields again
0 1
k
X
V @XQ t j Zt j A D 2kC2 VXQ t k 1:
jD0
This variance converges to zero for k going to infinity because jj < P11andj because
fXQ t g is stationary. The two processes fXQ t g and fXt g with Xt D jD0 Zt j are
therefore identical in the mean square sense and thus with probability one.
Finally, note that the recursion (2.2) will only generate a stationary process if it
is initialized with X0 having the stationary distribution, i.e. if EX0 D 0 and VX0 D
2 =.1 2 /. If the recursion is initiated with an arbitrary variance of X0 , 0 < 02 <
1, Eq. (2.2) implies the following difference equation for the variance of Xt , t2 :
t D 2 t2 1 C 2 :
where 2 D 2 =.1 2 / denotes the variance of the stationary distribution. If 02 ¤
2 , t2 is not constant implying that the process fXt g is not stationary. However,
as jj < 1, the variance of Xt , t2 , will converge to the variance of the stationary
distribution.2
Figure 2.2 shows a realization of such a process and its estimated autocorrelation
function.
In the case jj > 1 the solution (2.3) does not converge. It is, however, possible
to iterate the difference equation forward in time to obtain:
2
Phillips and Sul (2007) provide an application and an in depth discussion of the hypothesis of
economic growth convergence.
2.2 Some Important Special Cases 31
realization
6
4
2
0
−2
−4
0 20 40 60 80 100
time
estimated ACF
correlation coefficient
−0.5
−1
0 5 10 15 20
order
Fig. 2.2 Realization and estimated ACF of an AR(1) process: Xt D 0:8Xt 1 C Zt with Zt
IIN.0; 1/
1 1
Xt D XtC1 ZtC1
k 1 1 2 k 1
D XtCkC1 ZtC1 ZtC2 ::: ZtCkC1 :
as the solution. Going through similar arguments as before it is possible to show that
this is indeed the only stationary solution. This solution is, however, viewed to be
inadequate because Xt depends on future shocks ZtCj ; j D 1; 2; : : : Note, however,
that there exists an AR(1) process with jj < 1 which is observationally equivalent,
in the sense that it generates the same autocorrelation function, but with a new shock
or forcing variable fe Z t g (see next section).
In the case jj D 1 there exists no stationary solution (see Sect. 1.4.4) and
therefore, according to our definition, no ARMA process. Processes with this
property are called random walks, unit root processes or integrated processes. They
play an important role in economics and are treated separately in Chap. 7.
32 2 ARMA Models
If we interpret fXt g as the state variable and fZt g as an impulse or shock, we can
ask whether it is possible to represent today’s state Xt as the outcome of current
and past shocks Zt ; Zt 1 ; Zt 2 ; : : : In this case we can view Xt as being caused by
past shocks and call this a causal representation. Thus, shocks to current Zt will not
only influence current Xt , but will propagate to affect also future Xt ’s. This notion of
causality rest on the assumption that the past can cause the future but that the future
cannot cause the past. See Sect. 15.1 for an elaboration of the concept of causality
and its generalization to the multivariate context.
In the case that fXt g is a moving-average process of order q, Xt is given as
a weighted sum of current and past shocks Zt ; Zt 1 ; : : : ; Zt q . Thus, the moving-
average representation is already the causal representation. In the case of an
AR(1) process, we have seen that this is not always feasible. For jj < 1, the
solution (2.3) represents Xt as a weighted sum of current and past shocks and is
thus the corresponding causal representation. For jj > 1, no such representation is
possible. The following Definition 2.2 makes the notion of a causal representation
precise and Theorem 2.1 gives a general condition for its existence.
Definition 2.2 (Causality). An ARMA.p; q/ process fXt g with ˆ.L/Xt D ‚.L/Zt is
P1 causal with respect to fZt g if there exists a sequence f j g with the property
called
jD0 j j j < 1 such that
1
X
Xt D Zt C 1 Zt 1 C 2 Zt 2 C ::: D j Zt j D ‰.L/Zt with 0 D 1:
jD0
P
where ‰.L/ D 1 C 1 L C 2 L2 C : : : D 1 j
jD0 j L . The above equation is referred
to as the causal representation of fXt g with respect to fZt g.
The coefficients f j g are of great importance because they determine how an
impulse or a shock in period t propagates to affect current and future XtCj , j D
0; 1; 2 : : : In particular, consider an impulse et0 at time t0 , i.e. a time series which is
equal to zero except for the time t0 where it takes on the values et0 . Then, f t t0 et0 g
traces out the time history of this impulse. For this reason, the coefficients j with
j D t t0 , t D t0 ; t0 C 1; t0 C 2; : : : , are called the impulse response function.
If et0 D 1, it is called a unit impulse. Alternatively, et0 is sometimes taken to be
equal to , the standard deviation of Zt . It is customary to plot j as a function of j,
j D 0; 1; 2 : : :
Note that the notion of causality is not an attribute of fXt g, but is defined relative
to another process fZt g. It is therefore possible that a stationary process is causal
with respect to one process, but not with respect to another process. In order to make
this point more concrete, consider again the AR(1) process defined by the equation
Xt D Xt 1P CZt with jj > 1. As we have seen, the only stationary solution is given
1 j
by Xt D jD1 ZtCj which is clearly not causal with respect fZt g. Consider as
2.3 Causality and Invertibility 33
fXt g is causal with respect to fZQ t g. This remark shows that there is no loss of
generality involved if we concentrate on causal ARMA processes.
Theorem 2.1. Let fXt g be an ARMA.p; q/ process with ˆ.L/Xt D ‚.L/Zt and
assume that the polynomials ˆ.z/ and ‚.z/ have no common root. fXt g is causal
with respect to fZt g if and only if ˆ.z/ ¤ 0 for jzj 1, i.e. all roots of the equation
ˆ.z/ D 0 are outside the unit circle. The coefficients f j g are then uniquely defined
by identity :
1
X
j ‚.z/
‰.z/ D jz D :
jD0
ˆ.z/
Proof. Given that ˆ.z/ is a finite order polynomial with ˆ.z/ ¤ 0 for jzj 1, there
exits > 0 such that ˆ.z/ ¤ 0 for jzj 1 C . This implies that 1=ˆ.z/ is an
analytic function on the circle with radius 1 C and therefore possesses a power
series expansion:
X 1
1
D j zj D „.z/; for jzj < 1 C :
ˆ.z/ jD0
This implies that j .1C=2/j goes to zero for j to infinity. Thus there exists a positive
and finite constant C such that
Xt D „.L/ˆ.L/Xt D „.L/‚.L/Zt :
3
The reader is invited to verify this.
34 2 ARMA Models
Theorem 6.4 implies that the right hand side is well-defined. Thus ‰.L/ D
„.L/‚.L/ is the sought polynomial. Its coefficients are determined by the relation
‰.z/ D ‚.z/=ˆ.z/. P1
Assume now that there exists a causal representation Xt D jD0 j Zt j with
P1
jD0 j j j < 1. Therefore
Remark 2.1. If the AR and the MA polynomial have common roots, there are two
possibilities:
• No common roots lies on the unit circle. In this situation there exists a unique
stationary solution which can be obtained by canceling the common factors of
the polynomials.
• If at least one common root lies on the unit circle then more than one stationary
solution may exist (see the last example below).
Some Examples
Q
and ‚.L/ D 1. Because the root of Q̂ .z/ equals 5=4 > 1, there exists a unique
stationary and causal solution with respect to fZt g.
ˆ.L/ D 1 C L and ‚.L/ D 1 C L: ˆ.z/ and ‚.z/ have the common root 1
which lies on the unit circle. As before one might cancel both polynomials by
1 C L to obtain the trivial stationary and causal solution fXt g D fZt g. This
is, however, not the only solution. Additional solutions are given by fYt g D
fZt C A. 1/t g where A is an arbitrary random variable with mean zero and finite
variance A2 which is independent from both fXt g and fZt g. The process fYt g has
a mean of zero and an autocovariance function
Y .h/ which is equal to
(
2 C A2 ; h D 0I
Y .h/ D
. 1/h A2 ; h D ˙1; ˙2; : : :
Thus this new process is therefore stationary and fulfills the difference equation.
Remark 2.2. If the AR and the MA polynomial in the stochastic difference equation
ˆ.L/Xt D ‚.L/Zt have no common root, but ˆ.z/ D 0 for some z on the unit circle,
there exists no stationary solution. In this sense the stochastic difference equation
does no longer define an ARMA model. Models with this property are said to have
a unit root and are treated in Chap. 7. If ˆ.z/ has no root on the unit circle, there
exists a unique stationary solution.
2
0 C 1z C 2z C ::: 1 1 z 2 z2 : : : p zp
D 1 C 1 z C 2 z2 C : : : C q zq
C 2 z2 2 1 z
3
2 p z
pC2
36 2 ARMA Models
:::
D 1 C 1 z C 2 z2 C 3 z3 C C q zq
z0 W 0 D 1;
z1 W 1 D 1 C 1 0 D 1 C 1 ;
z2 W 2 D 2 C 2 0 C 1 1 D 2 C 2 C 1 1 C 12 ;
:::
4
In the case of multiple roots one has to modify the formula according to Eq. (B.2).
2.3 Causality and Invertibility 37
@XtCj
D j ! 0 for j ! 1:
@Zt
As can be seen from Eq. (2.5), the coefficients f j g even converge to zero exponen-
j
tially fast to zero because each term ci zi , i D 1; : : : ; p, goes to zero exponentially
fast as the roots zi are greater than one in absolute value. Viewing f j g as a function
of j one gets the so-called impulse response function which is usually displayed
graphically.
The effect of a permanent shock in period t on XtCj is defined as the cumulative
effect of a transitory shock. Thus, the effect of a permanent shock to XtCj is given by
Pj Pj Pj P1
iD0 i . Because iD0 i iD0 j i j iD0 j i j < 1, the cumulative effect
remains finite.
In time series analysis we view the observations as realizations of fXt g and treat
the realizations of fZt g as unobserved. It is therefore of interest to know whether it is
possible to recover the unobserved shocks from the observations on fXt g. This idea
leads to the concept of invertibility.
Definition 2.3 (Invertibility). An ARMA(p,q) process for fXt g satisfying ˆ.L/Xt
D ‚.L/Zt is called invertible P with respect to fZt g if and only if there exists a
sequence fj g with the property 1 jD0 jj j < 1 such that
1
X
Zt D j Xt j :
jD0
Note that like causality, invertibility is not an attribute of fXt g, but is defined only
relative to another process fZt g. In the literature, one often refers to invertibility as
the strict miniphase property.6
Theorem 2.2. Let fXt g be an ARMA(p,q) process with ˆ.L/Xt D ‚.L/Zt such
that polynomials ˆ.z/ and ‚.z/ have no common roots. Then fXt g is invertible with
respect to fZt g if and only if ‚.z/ ¤ 0 for jzj 1. The coefficients fj g are then
uniquely determined through the relation:
5
The use of the partial derivative sign actually represents an abuse of notation. It is inspired by an
Q t XtCj
@P
alternative definition of the impulse responses: j D @x Q t denotes the optimal (in the
where P
t
mean squared error sense) linear predictor of XtCj given a realization back to infinite remote past
fxt ; xt 1 ; xt 2 ; : : : g (see Sect. 3.1.3). Thus, j represents the sensitivity of the forecast of XtCj with
respect to the observation xt . The equivalence of alternative definitions in the linear and especially
nonlinear context is discussed in Potter (2000).
6
Without the qualification strict, the miniphase property allows for roots of ‚.z/ on the unit circle.
The terminology is, however, not uniform in the literature.
38 2 ARMA Models
1
X ˆ.z/
….z/ D j zj D :
jD0
‚.z/
Proof. The proof follows from Theorem 2.1 with Xt and Zt interchanged. t
u
The discussion in Sect. 1.3 showed that there are in general two MA(1) processes
compatible with the same autocorrelation function .h/ given by .0/ D 1, .1/ D
with jj 21 , and .h/ D 0 for h 2. However, only one of these solutions
is invertible because the two solutions for are inverses of each other. As it is
important to be able to recover Zt from current and past Xt , one prefers the invertible
solution. Section 3.2 further elucidates this issue.
‚.z/
where the coefficients f j g and fj g are determined for jzj 1 by ‰.z/ D and
ˆ.z/
ˆ.z/
….z/ D , respectively. In this case fXt g is causal and invertible with respect
‚.z/
to fZt g.
Remark 2.4. If fXt g is an ARMA process with ˆ.L/Xt D ‚.L/Zt such that
ˆ.z/ ¤ 0 for jzj D 1 then there exists polynomials Q̂ .z/ and ‚.z/ Q and a white
noise process fZQ t g such that fXt g fulfills the stochastic difference equation Q̂ .L/Xt D
Q
‚.L/ ZQ t and is causal with respect to fZQ t g. If in addition ‚.z/ ¤ 0 for jzj D 1 then
Q
‚.L/ can be chosen such that fXt g is also invertible with respect to fZQ t g (see the
discussion of the AR(1) process after the definition of causality and Brockwell and
Davis (1991, p. 88)). Thus, without loss of generality, we can restrict the analysis to
causal and invertible ARMA processes.
Whereas the autocovariance function summarizes the external and directly observ-
able properties of a time series, the coefficients of the ARMA process give
information of its internal structure. Although there exists for each ARMA model
2.4 Computation of Autocovariance Function 39
Starting from the causal representation of fXt g, it is easy to calculate its autoco-
variance function given that fZt g is white noise. The exact formula is proved in
Theorem (6.4).
1
X
.h/ D 2 j jCjhj ;
jD0
where
1
X
j ‚.z/
‰.z/ D jz D for jzj 1:
jD0
ˆ.z/
The first step consists in determining the coefficients j by the method of undeter-
mined coefficients. This leads to the following system of equations:
X
j k j k D j ; 0 j < maxfp; q C 1g;
0<kj
X
j k j k D 0; j maxfp; q C 1g:
0<kp
0 D 0 D 1;
1 D 1 C 0 1 D 1 C 1 ;
2 D 2 C 0 2 C 1 1 D 2 C 2 C 1 1 C 12 ;
:::
40 2 ARMA Models
Alternatively one may view the second part of the equation system as a linear
homogeneous difference equation with constant coefficients (see Sect. 2.3). Its
solution is given by Eq. (2.5). The first part of the equation system delivers the
necessary initial conditions to determine the coefficients c1 ; c2 ; : : : ; cp . Finally one
can insert the ’s in the above formula for the autocovariance function.
A Numerical Example
Consider the ARMA(2,1) process with ˆ.L/ D 1 1:3L C 0:4L2 and ‚.L/ D
1 C 0:4L. Writing out the defining equation for ‰.z/, ‰.z/ˆ.z/ D ‚.z/, gives:
2 3
1C 1z C 2z C 3z C :::
2 3
1:3z 1:3 1z 1:3 2z :::
2 3
C 0:4z C 0:4 1z C :::
::: D 1 C 0:4z:
Equating the coefficients of the powers of z leads to the following equation system:
z0 W 0 D 1;
z W 1 1:3 D 0:4;
z2 W 2 1:3 1 C 0:4 D 0;
3
z W 3 1:3 2 C 0:4 1 D 0;
:::
j 1:3 j 1 C 0:4 j 2 D 0; for j 2:
The last equation represents a linear difference equation of order two. Its solution is
given by
j j
j D c1 z1 C c2 z2 ; j maxfp; q C 1g p D 0;
whereby z1 and z2 are the two distinct roots of the characteristic polynomial
ˆ.z/ D 1 1:3z C 0:4z2 D 0 (see Eq. (2.5)) and where the coefficients p
c1 and
1:3˙ 1:69 40:4
c2 are determined from the initial conditions. The two roots are 20:4 D
5=4 D 1:25 and 2. The general solution to the homogeneous equation therefore is
j j
j D c1 0:8 C c2 0:5 . The constants c1 and c2 are determined by the equations:
Solving this equation system in the two unknowns c1 and c2 gives: c1 D 4 and
c2 D 3. Thus the solution to the difference equation is given by:
j D 4.0:8/j 3.0:5/j :
Inserting this solution for j into the above formula for
.h/ one obtains after using
the formula for the geometric sum:
1
X
.h/ D 2 4 0:8j 3 0:5j 4 0:8jCh 3 0:5jCh
jD0
1
X
D 2 .16 0:82jCh 12 0:5j 0:8jCh
jD0
1
0.8
0.6
correlation coefficient
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0 2 4 6 8 10 12 14 16 18 20
order
Fig. 2.3 Autocorrelation function of the ARMA(2,1) process: .1 1:3L C 0:4L2 /Xt D .1 C
0:4L/Zt
The second part of the equation system consists again of a linear homogeneous
difference equation in
.h/ whereas the first part can be used to determine the initial
conditions. Note that the initial conditions depend 1 ; : : : ; q which have to be
determined before hand. The general solution of the difference equation is:
.h/ D c1 z1 h C : : : C cp zp h (2.6)
A Numerical Example
We consider the same example as before. The second part of the above equation
system delivers a difference equation for
.h/:
.h/ D 1
.h 1/ C 2
.h 2/ D
1:3
.h 1/ 0:4
.h 2/, h 2. The general solution of this difference equation
is (see Appendix B):
.h/ D c1 .0:8/h C c2 .0:5/h ; h2
7
In case of multiple roots the formula has to be adapted accordingly. See Eq. (B.2) in the Appendix.
2.4 Computation of Autocovariance Function 43
where 0:8 and 0:5 are the inverses of the roots computed from the same polynomial
ˆ.z/ D 1 1:3z 0:4z2 D 0.
The first part of the system delivers the initial conditions which determine the
constants c1 and c2 :
where the numbers on the right hand side are taken from the first procedure.
Inserting the general solution in this equation system and bearing in mind that
.h/ D
. h/ leads to:
Solving this equation system in the unknowns c1 and c2 one gets finally gets: c1 D
.220=9/ 2 and c2 D 8 2 .
Whereas the first two procedures produce an analytical solution which relies on
the solution of a linear difference equation, the third procedure is more suited for
numerical computation using a computer. It rests on the same equation system as in
the second procedure. The first step determines the values
.0/;
.1/; : : : ;
.p/ from
the first part of the equation system. The following
.h/; h > p are then computed
recursively using the second part of the equation system.
A Numerical Example
Using again the same example as before, the first of the equation delivers
.2/;
.1/
and
.0/ from the equation system:
Bearing in mind that
.h/ D
. h/, this system has three equations in three
unknowns
.0/;
.1/ and
.2/. The solution is:
.0/ D .148=9/ 2 ,
.1/ D
.140=9/ 2,
.2/ D .614=45/ 2 . This corresponds, of course, to the same numerical
values as before. The subsequent values for
.h/; h > 2 are then determined
recursively from the difference equation
.h/ D 1:3
.h 1/ 0:4
.h 2/.
44 2 ARMA Models
2.5 Exercises
Exercise 2.5.2. Check whether the following stochastic difference equations pos-
sess a stationary solution. If yes, is the solution causal and/or invertible with respect
to Zt WN.0; 2 /?
(i) Xt D Zt C 2Zt 1
(ii) Xt D 1:3Xt 1 C Zt
(iii) Xt D 1:3Xt 1 0:4Xt 2 C Zt
(iv) Xt D 1:3Xt 1 0:4Xt 2 C Zt 0:3Zt 1
(v) Xt D 0:2Xt 1 C 0:8Xt 2 C Zt
(vi) Xt D 0:2Xt 1 C 0:8Xt 2 C Zt 1:5Zt 1 C 0:5Zt 2
Thereby Zt WN.0; 2 /.
We restrict our discussion to linear forecast functions, also called linear predictors,
PT XTCh . Given observation from period 1 up to period T, these predictors take the
form:
T
X
PT XTCh D a0 C a1 XT C : : : C aT X1 D a0 C ai XTC1 i
iD1
T
!
@S X
D E XTCh a0 ai XTC1 i D 0; (3.1)
@a0 iD1
" T
! #
@S X
D E XTCh a0 ai XTC1 i XTC1 j D 0; j D 1; : : : ; T: (3.2)
@aj iD1
PT
The first equation can be rewritten as a0 D iD1 ai so that the forecasting
function becomes:
T
X
PT XTCh D C ai .XTC1 i / :
iD1
1
Elliott and Timmermann (2008) provide a general overview of forecasting procedures and their
evaluations.
3.1 Linear Least-Squares Forecasts 47
The unconditional mean of the forecast error, E .XTCh PT XTCh /, is therefore equal
to zero. This means that there is no bias, neither upward nor downward, in the
forecasts. The forecasts correspond on average to the “true” value.
Inserting in the second normal equation the expression for PT XTCh from above,
we get:
E .XTCh PT XTCh / XTC1 j D 0; j D 1; 2; : : : ; T:
The forecast error is therefore uncorrelated with the available information repre-
sented by past observations. Thus, the forecast errors XTCh PT XTCh are orthogonal
to XT ; XT 1 ; : : : ; X1 . Geometrically speaking, the best linear forecast is obtained by
finding the point in the linear subspace spanned by fXT ; XT 1 ; : : : ; X1 g which is
closest to XTCh . This point is found by projecting XTCh on this linear subspace.2
The normal equations (3.1) and (3.2) can be rewritten in matrix notation as
follows:
T
!
X
a0 D 1 ai (3.3)
iD1
0 10 1 0 1
.0/
.1/ : : :
.T 1/ a1
.h/
B : : :
.T 2/C B C B C
B
.1/
.0/ C B a2 C B
.h C 1/ C
B :: :: :: :: CB : C D B : C: (3.4)
@ : : : : A @ :: A @ :: A
.T 1/
.T 2/ : : :
.0/ aT
.h C T 1/
Dividing the second equation by
.0/, one obtains an equation in terms autocorre-
lations instead of autocovariances:
RT ˛T D T .h/; (3.7)
2
Note the similarity of the forecast errors with the least-square residuals of a linear regression.
48 3 Forecasting Stationary Processes
0 1
a1
B :: C
˛T D @ : A D T 1
T .h/ D RT 1 T .h/:
aT
because T ˛T D
T .h/. Bracketing out
.0/, one can write the mean squared
forecast error as:
vT .h/ D
.0/ 1 ˛T0 T .h/ : (3.8)
3
See Brockwell and Davis (1991, p. 167).
3.1 Linear Least-Squares Forecasts 49
0 10 1 0 1
1 2 : : : T 1 a1 h
B 1 : : : T 2C B C B hC1 C
B C B a2 C B C
B 2 1 T 3C B C
: : : C B a3 C D B hC2 C
B B C:
B :: :: :: :: : CB : C B : C
@ : : : : :: A @ :: A @ :: A
1 2 3
T T T ::: 1 aT hCT 1
PT XTCh D h XT :
The forecast therefore just depends on the last observation with the corresponding
coefficient a1 D h being independent of T. All previous observations can be
disregarded, they cannot improve the forecast further. To put it otherwise, all the
useful information about XTCh in the entire realization previous to XT , i.e. in
fXT ; XT 1 ; : : : ; X1 g, is contained in XT .
The variance of the prediction error is given by
1 2h 2
vT .h/ D :
1 2
.j/ D 1 .j 1/ C 2 .j 2/ C : : : C p .j p/:
The one-step ahead forecast of an AR(p) process therefore depends only on the last
p observations.
The above predictor can also be obtained in a different way. View for this purpose
PT as an operator with the following meaning: Take the linear least-squares forecast
50 3 Forecasting Stationary Processes
with respect to the information fXT ; : : : ; X1 g. Apply this operator to the defining
stochastic difference equation of the AR(p) process forwarded one period:
PT XTC1 D PT .1 XT / C PT .2 XT 1/ C : : : C PT p XTC1 p C PT .ZTC1 / :
C 1 p XTC1 p :
The forecasting problem becomes more complicated in the case of MA(q) processes.
In order to get a better understanding we analyze the case of a MA(1) process:
Despite the fact that the equation system has a simple structure, the forecasting
function will depend in general on all past observations of XT j , 0 j T. We
illustrate this point by a numerical example which will allow us to get a deeper
understanding.
Suppose that we know the parameters of the MA(1) process to be D 0:9
and 2 D 1. We start the forecasting exercise in period T D 0 and assume that,
at this point in time, we have no observation at hand. The best forecast is therefore
just the unconditional mean which in this example is zero. Thus, P0 X1 D 0. The
variance of the forecast error then is V.X1 P0 X1 / D v0 .1/ D 2 C 2 2 D 1:81.
This result is summarized in the first row of Table 3.1. In period 1, the realization of
X1 is observed. This information can be used and the forecasting function becomes
P1 X2 D a1 X1 . The coefficient a1 is found by solving the equation system (3.10)
for T D 1. This gives a1 D =.1 C 2 / D 0:4972. The corresponding
variance of the forecast error according to Eq. (3.8) is V.X2 P1 X2 / D v1 .1/ D
.0/.1 ˛10 1 .1// D 1:81.1 0:4972 0:4972/ D 1:3625. This value is lower
compared to the previous forecast because additional information, the observation
of the realization of X1 , is taken into account. Row 2 in Table 3.1 summarizes these
results.
In period 2, not only X1 , but also X2 is observed which allows us to base our
forecast on both observations: P2 X3 D a1 X2 C a2 X1 . The coefficients can be found
by solving the equation system (3.10) for T D 2. This amounts to solving the
simultaneous equation system
! !
1 1C 2 a1 1C 2
D :
1C 2
1 a2 0
Table 3.1 Forecast function for a MA(1) process with D 0:9 and 2 D 1
Time Forecasting function ˛T D .a1 ; a2 ; : : : ; aT /0Forecast error variance
TD0W v0 .1/ D 1:8100
TD1W ˛1 D . 0:4972/0 v1 .1/ D 1:3625
TD2W ˛2 D . 0:6606; 0:3285/0 v2 .1/ D 1:2155
TD3W ˛3 D . 0:7404; 0:4891; 0:2432/0 v3 .1/ D 1:1436
TD4W ˛4 D . 0:7870; 0:5827; 0:3849; 0:1914/0 v4 .1/ D 1:1017
::: ::: :::
TD1W ˛1 D . 0:9000; 0:8100; 0:7290; : : :/0 v1 .1/ D 1
These results are summarized in row 4 of Table 3.1. We can, of course, continue in
this way and derive successively the forecast functions for T D 4; 5; : : :.
From this exercise we can make several observations.
becomes better and better possible to recover the “true” value of the unobserved
ZT from the observations XT ; XT 1 ; : : : ; X1 . As the process is invertible, in the
limit it is possible to recover the value of ZT exactly (almost surely). The only
uncertainty remaining is with respect to ZTC1 which has a mean of zero and a
variance of 2 D 1.
Zt D Xt Xt 1 C 2 Xt 2 :::
e
PT XTC1 D 0:9XT 0:81XT 1 0:729XT 2 :::
Consider now the case of a causal and invertible ARMA(1,1) process fXt g:
Xt D Xt 1 C Zt C Zt 1 ;
54 3 Forecasting Stationary Processes
where jj < 1, jj < 1 and Zt WN.0; 2 /. Because fXt g is causal and invertible
with respect to fZt g,
1
X
XTC1 D ZTC1 C . C / j ZT j ;
jD0
1
X
ZTC1 D XTC1 . C / . /j XT j :
jD0
Applying the forecast operator e PT to the second equation and noting that
e
PT ZTC1 D 0, one obtains the following one-step ahead predictor
1
X
e
PT XTC1 D . C / . /j XT j :
jD0
This implies that the one-step ahead prediction error is equal to XTC1 e
PT XTC1 D
ZTC1 and that the mean squared forecasting error of the one-step ahead predictor
2
given the infinite past is equal to EZTC1 D 2.
4
More about harmonic processes can be found in Sect. 6.2.
3.2 The Wold Decomposition Theorem 55
Thereby, A and B denote two uncorrelated random variables with mean zero and
finite variance. One can check that Xt satisfies the deterministic difference equation
Xt D .2 cos !/Xt 1 Xt 2 :
Thus, Xt can be forecasted exactly from its past. In this example the last two obser-
vations are sufficient. We are now in a position to state the Wold Decomposition
Theorem.
Theorem 3.1 (Wold Decomposition). Every stationary stochastic process fXt g with
mean zero and finite positive variance can be represented as
1
X
Xt D j Zt j C Vt D ‰.L/Zt C Vt ; (3.11)
jD0
where
(i) Zt D Xt e Pt 1 Xt D e Pt Zt ;
2
(ii) Zt WN.0;P 2 / with 2 D E XtC1 e
Pt XtC1 > 0;
1 2
(iii) 0 D 1 and jD0 j < 1;
(iv) fVt g is deterministic;
(v) E.Zt Vs / D 0 for all t; s 2 Z.
Proof. The proof, although insightful, requires some knowledge about Hilbert
spaces which is beyond the scope of this book. A rigorous proof can be found in
Brockwell and Davis (1991, Section 5.7).
It is nevertheless instructive to give an intuition of the proof. Following the
MA(1) example from the previous section, we start in period 0 and assume that
no information is available. Thus, the best forecast P0 X1 is zero so that trivially
X1 D X1 P0 X1 D Z1 :
.1/ .1/
X2 D X2 P1 X2 C P1 X2 D Z2 C a1 X1 D Z2 C a1 Z1
.2/ .2/
X3 D X3 P2 X3 C P2 X3 D Z3 C a1 X2 C a2 X1
.2/ .2/ .1/ .2/
D Z3 C a1 Z2 C a1 a1 C a2 Z1
.3/ .3/ .3/
X4 D X4 P3 X4 C P3 X4 D Z4 C a1 X3 C a2 X2 C a3 X1
56 3 Forecasting Stationary Processes
.3/ .3/ .2/ .3/
D Z4 C a1 Z3 C a1 a1 C a2 Z2
.3/ .2/ .1/ .3/ .2/ .3/ .1/ .3/
C a1 a1 a1 C a1 a2 C a2 a1 C a3 Z1
:::
.t 1/
where aj , j D 1; : : : ; t 1, denote the coefficients of the forecast function for Xt
based on Xt 1 ; : : : ; X1 . This shows how Xt unfolds into the sum of forecast errors.
The stationarity of fXt g ensures that the coefficients of Zj converge, as t goes to
infinity, to j which are independent of t. t
u
5
The Wold Decomposition corresponds to the decomposition of the spectral distribution function
of F into the sum of FZ and FV (see Sect. 6.2). Thereby the spectral distribution function FZ has
2
spectral density fZ ./ D 2 j‰.e { /j2 .
6
The series j D 1=j, for example, is square summable, but not absolutely summable.
3.2 The Wold Decomposition Theorem 57
to recover the j ’s from the causal representation. This amounts to say that ‰.L/ is
a rational polynomial which means that
‚.L/ 1 C 1 L C 2 L 2 C : : : C q L q
‰.L/ D D :
ˆ.L/ 1 1 L 2 L 2 : : : p L p
Remark 3.1. In the case of ARMA processes, the purely deterministic part fVt g
can be disregarded so that the process is represented only by a weighted sum of
current and past innovations. Processes with this property are called purely non-
deterministic, linearly regular, or regular for short. Moreover, it can be shown
that every regular process fXt g can be approximated arbitrarily well by an ARMA
.ARMA/
process fXt g meaning that
2
.ARMA/
sup E Xt Xt
t2Z
can be made arbitrarily small. The proof of these results can be found in Hannan
and Deistler (1988, Chapter 1).
Remark 3.2. The process fZt g is white noise, but not necessarily Gaussian. In
particular, fZt g need not be independently and identically distributed (IID). Thus,
E.ZtC1 jXt ; Xt 1 ; : : :/ need not be equal to zero although e
Pt ZtC1 D 0. The reason is
that e
Pt ZtC1 is only the best linear forecast function, whereas E.ZtC1 jXt ; Xt 1 ; : : :/ is
the best forecast function among all linear and non-linear functions. Examples of
processes which are white noise, but not IID, are GARCH processes discussed in
Chap. 8.
Remark 3.3. The innovations fZt g may not correspond to the “true” shocks of
the underlying economic system. In this case, the shocks to the economic system
cannot be recovered from the Wold Decomposition. Thus, they are not fundamental
with respect to fXt g. Suppose, as a simple example, that fXt g is generated by a
noninvertible MA(1) process:
This generates an impulse response function with respect to the true shocks
of the system equal to .1; ; 0; : : :/. The above mechanism can, however, not be
the Wold Decomposition because the noninvertibility implies that Ut cannot be
recovered from the observation of fXt g. As shown in the introduction, there is an
58 3 Forecasting Stationary Processes
observationally equivalent MA(1) process, i.e. a process which generates the same
ACF. Based on the computation in Sect. 1.5, this MA(1) process is
Q t 1;
Xt D Zt C Z Zt WN.0; Q 2 /;
2
with Q D 1 and Q 2 D 1C1C 2
2 . This is already the Wold Decomposition. The
impulse response function for this process is .1; 1 ; 0; : : :/ which is different from
Q D j 1 j < 1, the innovations fZt g can be recovered
the original system. As jj P
from the observations as Zt D 1 Q j
jD0 . / Xt j , but they do not correspond to the
shocks of the system fUt g. Hansen and Sargent (1991), Quah (1990), and Lippi and
Reichlin (1993) among others provide a deeper discussion and present additional
more interesting economic examples.
Xt D f .tI ˇ/ C "t ;
Xt D ˇ C " t :
where “O” means that the model parameter ˇ has been replaced by its estimate. The
one-period ahead forecast function can then be rewritten as follows:
b T 1b 1
PT XTC1 D PT 1 XT C XT
T T
1
Db
PT 1 XT C XT b
PT 1 XT :
T
3.3 Exponential Smoothing 59
The first equation represents the forecast for T C 1 as a linear combination of the
forecast for T and of the last additional information, i.e. the last observation. The
weight given to the last observation is equal to 1=T because we assumed that the
mean remains constant and because the contribution of one observation to the mean
is 1=T. The second equation represents the forecast for T C 1 as the forecast for T
plus a correction term which is proportional to the last forecast error. One advantage
of this second representation is that the computation of the new forecast, i.e. the
forecast for T C1, only depends on the forecast for T and the additional observation.
In this way the storage requirements are minimized.
In many applications, the mean does not remain constant, but is a slowly moving
function of time. In this case it is no longer meaningful to give each observation
the same weight. Instead, it seems plausible to weigh the more recent observation
higher than the older ones. A simple idea is to let the weights decline exponentially
which leads to the following forecast function:
T 1
1 ! X t
PT XTC1 D ! XT t with j!j < 1:
1 ! T tD0
! thereby acts like a discount factor which controls the rate at which agents forget
information. 1 ! is often called the smoothing parameter. The value of ! should
depend on the speed at which the mean changes. In case when the mean changes
only slowly, ! should be large so that all observations are almost equally weighted;
in case when the mean changes rapidly, ! should be small so that only the most
recent observations are taken into account. The normalizing constant 11 !!T ensures
that the weights sum up to one. For large T the term ! T can be disregarded so
that one obtains the following forecasting function based on simple exponential
smoothing:
PT XTC1 D .1 !/ XT C !XT 1 C ! 2 XT 2 C :::
D .1 !/XT C ! PT 1 XT
D PT 1 XT C .1 !/ .XT PT 1 XT / :
P0 X1 D S 0
P1 X2 D !P0 X1 C .1 !/X1
60 3 Forecasting Stationary Processes
P2 X3 D !P1 X2 C .1 !/X2
:::
PT XTC1 D !PT 1 XT C .1 !/XT :
the effect of the starting value declines exponentially with time. In practice, we can
take S0 D X1 or S0 D X T . The discount factor ! is usually set a priori to be a number
between 0.7 and 0.95. It is, however, possible to determine ! optimally by choosing
a value which minimizes the mean squared one-period forecast error:
T
X
.Xt Pt 1 Xt /2 ! min :
j!j<1
tD1
From a theoretical perspective one can ask the question for which class of models
exponential smoothing represents the optimal procedure. Muth (1960) showed that
this class of models is given by
Xt D Xt Xt 1 D Zt !Zt 1 :
Note that the process generated by the above equation is no longer stationary. This
has to be expected as the exponential smoothing assumes a non-constant mean.
Despite the fact that this class seems rather restrictive at first, practice has shown
that it delivers reasonable forecasts, especially in situations when it becomes costly
to specify a particular model.7 Additional results and more general exponential
smoothing methods can be found in Abraham and Ledolter (1983) and Mertens
and Rässler (2005).
3.4 Exercises
Exercise 3.4.1. Compute the linear least-squares predictor PT XTCh , T > 2, and
the mean squared error vT .h/, h D 1; 2; 3, if fXt g is given by the AR(2) process
7
This happens, for example, when many, perhaps thousands of time series have to be forecasted in
a real time situation.
3.5 Partial Autocorrelation 61
Exercise 3.4.2. Compute the linear least-squares predictor PT .XTC1/ and the mean
squared error vT .1/, T D 0; 1; 2; 3, if fXt g is given by the MA(1) process
Exercise 3.4.3. Suppose that you observe fXt g for the two periods t D 1 and t D 3,
but not for t D 2.
Xt D A cos.!t/ C B sin.!t/
with A and B being two uncorrelated random variables with mean zero and finite
variance. Show that fXt g satisfies the deterministic difference equation:
Xt D .2 cos !/Xt 1 Xt 2 :
3.5.1 Definition
The above intuition suggests two equivalent definitions of the partial autocorrelation
function (PACF).
Definition 3.2 (Partial Autocorrelation Function I). The partial autocorrelation
function (PACF) ˛.h/; h D 0; 1; 2; : : :, of a stationary process is defined as follows:
˛.0/ D 1
˛.h/ D ah ; h D 1; 2; : : : ;
where ah denotes the last element of the vector ˛h D h 1
h .1/ D Rh 1 h .1/ (see
Sect. 3.1 and Eq. (3.7)).
Definition 3.3 (Partial Autocorrelation Function II). The partial autocorrelation
function (PACF) ˛.h/; h D 0; 1; 2; : : :, of a stationary process is defined as follows:
˛.0/ D 1
˛.1/ D corr .X2 ; X1 / D .1/
˛.h/ D corr ŒXhC1 P .XhC1 j1; X2 ; : : : ; Xh / ; X1 P .X1 j1; X2 ; : : : ; Xh / ;
where P .XhC1 j1; X2 ; : : : ; Xh / and P .X1 j1; X2 ; : : : ; Xh / denote the best, in the sense
mean squared forecast errors, linear forecasts of XhC1 , respectively X1 given
f1; X2 ; : : : ; Xh g.
Remark 3.4. If fXt g has a mean of zero, then the constant in the projection operator
can be omitted.
The first definition implies that the partial autocorrelations are determined
from the coefficients of the forecasting function which are themselves functions
3.5 Partial Autocorrelation 63
˛.0/ D 1
˛.1/ D a11 D .1/
.2/ .1/2
˛.2/ D a22 D
1 .1/2
:::
Ph 1
.h/ jD1 ah 1;j h j
˛.h/ D ahh D Ph 1
;
1 jD1 ah 1;j j
Autoregressive Processes
The idea of the PACF can be well illustrated in the case of an AR(1) process
As shown in Chap. 2, Xt and Xt 2 are correlated with each other despite the fact
that there is no direct relationship between the two. The correlation is obtained
“indirectly” because Xt is correlated with Xt 1 which is itself correlated with Xt 2 .
Because both correlation are equal to , the correlation between Xt and Xt 2 is equal
to .2/ D 2 . The ACF therefore accounts for all correlation, including the indirect
ones. The partial autocorrelation on the other hand only accounts for the direct
relationships. In the case of the AR(1) process, there is only an indirect relation
between Xt and Xt h for h 2, thus the PACF is zero.
Based on the results in Sect. 3.1 for the AR(1) process, the definition 3.2 of the
PACF implies:
˛1 D ) ˛.1/ D .1/ D ;
0
˛2 D .; 0/ ) ˛.2/ D 0;
˛3 D .; 0; 0/0 ) ˛.3/ D 0:
The partial autocorrelation function of an AR(1) process is therefore equal to zero
for h 2.
This logic can be easily generalized. The PACF of a causal AR(p) process is
equal to zero for h > p, i.e. ˛.h/ D 0 for h > p. This property characterizes an
AR(p) process as shown in the next section.
64 3 Forecasting Stationary Processes
Moving-Average Processes
Consider now the case of an invertible MA process. For this process we have:
1
X 1
X
Zt D j Xt j ) Xt D j Xt j C Zt :
jD0 jD1
˛1 D ) ˛.1/ D .1/ D ;
1 C 2 1 C 2
0
.1 C 2 / 2 2
˛2 D ; ) ˛.2/ D :
1 C 2 C 4 1 C 2 C 4 1 C 2 C 4
. /h . /h .1 2 /
˛.h/ D D
1C 2C : : : C 2h 1 2.hC1/
The ACF and the PACF are two important tools to determining the nature of the
underlying mechanism of a stochastic process. In particular, they can be used to
determine the orders of the underlying AR, respectively MA processes. The analysis
of ACF and PACF to identify appropriate models is know as the Box-Jenkins
methodology (Box and Jenkins 1976). Table 3.2 summarizes the properties of both
tools for the case of a causal AR and an invertible MA process.
If fXt g is a causal and invertible ARMA(p,q) process, we have the following
properties. As shown in Sect. 2.4, the ACF is characterized for h > maxfp; qC1g by
the homogeneous difference equation .h/ D 1 .h 1/C: : :Cp .h p/. Causality
implies that the roots of the characteristic equation are all inside the unit circle. The
autocorrelation coefficients therefore decline exponentially to zero. Whether this
convergence is monotonic or oscillating depends on the signs of the roots. The PACF
starts to decline to zero for h > p. Thereby the coefficients of the PACF exhibit the
same behavior as the autocorrelation coefficients of 1 .L/Xt .
3.6 Exercises 65
3.6 Exercises
Exercise 3.6.1. Assign the ACF and the PACF from Fig. 3.1 to the following
processes:
Xt D Zt ;
Xt D 0:9Xt 1 C Zt ;
Xt D Zt C 0:8Zt 1 ;
Xt D 0:9Xt 1 C Zt C 0:8Zt 1
with Zt WN.0; 2 /.
66 3 Forecasting Stationary Processes
a ACF with 95−percent confidence band b ACF with 95−percent confidence band
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
order order
PACF with 95−percent confidence band PACF with 95−percent confidence band
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
order order
c ACF with 95−percent confidence band d ACF with 95−percent confidence band
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
order order
PACF with 95−procent confidence band PACF with 95−percent confidence band
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
order order
Fig. 3.1 Autocorrelation and partial autocorrelation functions. (a) Process 1. (b) Process 2. (c)
Process 3. (d) Process 4
Estimation of the Mean
and the Autocorrelation Function 4
In the previous chapters we have seen in which way the mean , and, more
importantly, the autocovariance function,
.h/; h D 0; ˙1; ˙2; : : :, of a stationary
stochastic process fXt g characterize its dynamic properties, at least if we restrict
ourself to the first two moments. In particular, we have investigated how the
autocovariance function is related to the coefficients of the corresponding ARMA
process. Thus the estimation of the ACF is not only interesting for its own sake,
but also for the specification and identification of appropriate ARMA models. It is
therefore of outmost importance to have reliable (consistent) estimators for these
entities. Moreover, we want to test specific features for a given time series. This
means that we have to develop corresponding testing theory. As the small sample
distributions are hard to get, we rely for this purpose on asymptotic theory.1
In this section we will assume that the process is stationary and observed for
the time periods t D 1; 2; : : : ; T. We will refer to T as the sample size. As
mentioned previously, the standard sampling theory is not appropriate in the times
series context because the Xt ’s are not independent draws from some underlying
distribution, but are systematically related to each other.
1
XT D .X1 C X2 C : : : C XT / :
T
1
Recently, bootstrap methods have also been introduced in the time series context.
1 X
T X jhj
0 TVX T D cov.Xi ; Xj / D 1
.h/
T i;jD1 T
jhj<T
X T
X
j
.h/j D 2 j
.h/j C
.0/:
jhj<T hD1
The assumption
.h/ ! 0 for h ! 1 implies that for any given " > 0, we can find
T0 such that j
.h/j < "=2 for h T0 . If T > T0 and T > 2T0
.0/=" then
T T0 1 T
1X 1 X 1 X
0 j
.h/j D j
.h/j C j
.h/j
T hD1 T hD1 T hDT
0
This Theorem establishes that the arithmetic average is not only an unbiased
estimator of the mean, but also a consistent one. In particular, the arithmetic average
converges in the mean-square sense, and therefore also in probability, to the true
mean (see appendix C). This result can be interpreted as a reflection of the concept
of ergodicity (see Sect. 1.2). The assumptions are relatively mild and are fulfilled
for the ARMA processes because for these processes
.h/ converges exponentially
fast to zero (see Sect. 2.4.2, in particular Eq. (2.6)). Under little more restrictive
assumptions it is even possible to show that the arithmetic mean is asymptotically
normally distributed.
Theorem 4.2 (Asymptotic Distribution of Sample Mean). For any stationary
process fXt g given by
1
X
Xt D C j Zt j ; Zt IID.0; 2 /;
jD 1
P P1
such that 1 jD 1 j j j < 1 and jD 1 j ¤ 0, the arithmetic average X T is
asymptotically normal:
1
!
p d X
T.X T / ! N 0;
.h/
hD 1
0 0 12 1
1
X
B C
D N @0; 2 @ j A A D N.0; ‰.1/ /
2 2
jD 1
Proof. The standard proof invokes the Basic Approximation Theorem C.14 and the
Central Limit Theorem for m-dependent processes C.13. To this end we define the
2m-dependent approximate process
m
X
.m/
Xt DC j Zt j :
jD m
.m/ Pm Pm
For fXt g, we have Vm D hD m
.h/ D 2 . jD m j/
2
. This last assertion can
be verified by noting that
1
X 1
X 1
X
VD
.h/ D 2 j jCh
hD 1 hD 1 jD 1
0 12
1
X 1
X 1
X
D 2 j jCh D 2 @ jA :
jD 1 hD 1 jD 1
70 4 Estimation of Mean and ACF
P
Note that the assumption 1 jD 1 j j j < 1 guarantees the convergence of the
infinite sums. Applying this result to the special case j D 0 for jjj > m, we
obtain Vm .
The arithmetic average of the approximating process is
T
.m/ 1 X .m/
XT D Xt :
T tD1
The CLT for m-dependent processes C.13 then implies that for T ! 1
p .m/ d
T XT ! X .m/ D N.0; Vm /:
Pm P1
As m ! 1, 2 . jD m j/
2
converges to 2 . jD 1 j/
2
and thus
0 0 12 1
1
X
d B C
X .m/ ! X D N.0; V/ D N @0; 2 @ jA A :
jD 1
This assertion can be established by noting that the characteristic functions of X .m/
d
approaches the characteristic function of X so that by Theorem C.11 X .m/ ! X.
Finally, we show that the approximation error becomes negligible as T goes to
infinity:
p p .m/
1=2
T
X .m/ 1=2
T
X .m/
T XT T XT DT .Xt Xt / D T et
tD1 tD1
.m/
where the error et is
.m/
X
et D j Zt j :
jjj>m
.m/
Clearly, fet g is a stationary process with autocovariance function
e such that
P1 P 2
2
hD 1
e .h/ D jjj>m j < 1. We can therefore invoke Theorem 4.1
to show that
!
p p .m/ T
1 X .m/
V T XT T XT D TV et
T tD1
4.1 Estimation of the Mean 71
P 2
converges to 2 jjj>m as T ! 1. This term converges to zero as m ! 1.
j
p p .m/
The approximation error T X T T XT therefore converges in
mean square to zero and thus, using Chebyschev’s inequality (see Theorem C.3
or C.7), also in probability. We have therefore established the third condition of
p d
Theorem C.14 as well. Thus, we can conclude that T X T ! X. t
u
P1
with the properties Zt IID.0; 2 / and jD0 j2 j j j2 < 1, the arithmetic average
X T is asymptotically normal:
1
!
p d X
T.X T / ! N 0;
.h/
hD 1
0 0 12 1
1
X
B C
D N @0; 2 @ j A A D N.0; ‰.1/ /:
2 2
jD0
T T
1X 1X e
XT D ‰.L/Zt D .‰.1/ .L 1//‰.L//Zt
T tD1 T tD1
T
!
1X 1e
D ‰.1/ Zt C ‰.L/.Z 0 ZT /
T tD1 T
PT !
p p tD1 Zt 1 e 1 e
T.X T / D ‰.1/ T C p ‰.L/Z 0 p ‰.L/ZT:
T T T
2
The Beveridge-Nelson decomposition is an indispensable tool for the understanding of integrated
and cointegrated processes analyzed in Chaps. 7 and 16.
72 4 Estimation of Mean and ACF
The assumption Zt IID.0; 2 / allows toPinvoke the Central Limit Theorem C.12
p T
Zt
of Appendix C to the first term. Thus, T tD1 T is asymptotical normal with mean
2
zero and variance . Theorem D.1 also implies j‰.1/j < 1. Therefore, the term
p PT Zt
‰.1/ T tD1
T is asymptotically normal with mean zero and variance 2 ‰.1/2 .
2 PT
The variances of the second and third term are equal to T Q2
jD0 j . The
PT
summability condition then implies according to Theorem D.1 that Q2
jD0 j
converges for T ! 1. Thus, the variances of the last two terms converge to zero
implying that these terms converge also to zero in probability (see Theorem C.7)
and thus also in distribution. We Pcan then invoke Theorem C.10 to establish the
Theorem. Finally, the equality of 1 2 2
hD 1
.h/ and ‰.1/ can be obtained from
direct computations or by the application of Theorem 6.4. t
u
Remark 4.1. Theorem 4.2 holds with respect to any causal ARMA process because
the j ’s converge exponentially fast to zero (see the discussion following Eq. (2.5)).
Remark 4.2. If fXt g is a Gaussian process, then for any given fixed T, X T is
distributed as
0 1
p X jhj
T X T N @0; 1
.h/A :
T
jhj<T
1 1
!
X X
JD
.h/ D
.0/ 1 C 2 .h/ : (4.1)
hD 1 hD1
Note that the long-run variance equals 2 times the spectral density f ./ evaluated
at D 0 (see the Definition 6.1 of the spectral density in Sect. 6.1).
As the long-run variance takes into account the serial properties of the time
series, it is also called heteroskedastic and autocorrelation consistent variance (HAC
variance). If fXt g has some nontrivial autocorrelation (i.e. .h/ ¤ 0 for h ¤ 0), the
long-run variance J is different from
.0/. This implies among other things that the
construction of the t-statistic for testing the simple hypothesis H0 : D 0 should
be based on J rather than on
.0/.
In case that fXt g is a causal ARMA process with ˆ.L/Xt D ‚.L/Zt , Zt
WN.0; 2 /, the long-run variance is given by
2
‚.1/
JD 2 D ‰.1/2 2 :
ˆ.1/
4.2 Estimation of ACF 73
With some slight, asymptotically unimportant modifications, we can use the stan-
dard estimators for the autocovariances,
.h/, and the autocorrelations, .h/, of a
stationary stochastic process:
T h
1 X
O .h/ D Xt XT XtCh XT ; (4.2)
T tD1
O .h/
O
.h/ D : (4.3)
O .0/
These estimators are biased because the sums are normalized (divided) by T rather
than T h. The normalization with T h delivers an unbiased estimate only if
X T is replaced by which, however is typically unknown in practice. The second
modification concerns the use of the complete sample for the estimation of .3 The
main advantage of using the above estimators is that the implied estimator for the
covariance matrix, b RT , of .X1 ; : : : ; XT /0 ,
T , respectively the autocorrelation matrix, b
0 1
O .0/
O .1/ : : :
O .T 1/
B
O .1/
O .0/ : : :
O .T 2/C
b B C
T D B :: :: :: :: C
@ : : : : A
O .T 1/
O .T 2/ : : :
O .0/
b
T
b
RT D
O .0/
always delivers, independently of the realized observations, non-negative definite
and for
O .0/ > 0 non-singular matrices. The resulting estimated autocovariance
function will then satisfy the characterization given in Theorem 1.1, in particular
property (iv).
3
The standard statistical formulas would suggest to estimate the mean appearing in first multipli-
cand from X1 ; : : : ; Xt h , and the mean appearing in the second multiplicand from XhC1 ; : : : ; XT .
74 4 Estimation of Mean and ACF
According to Box and Jenkins (1976, p. 33), one can expect reasonable estimates
for
.h/ and .h/ if the sample size is larger than 50 and if the order of the
autocorrelation coefficient is smaller than T=4.
The theorem below establishes that these estimators lead under rather mild
conditions to consistent and asymptotically normally distributed estimators.
Theorem 4.4 (Asymptotic Distribution of Autocorrelations). Let fXt g be the
stationary process
1
X
Xt D C j Zt j
jD 1
P1 P1
with Zt IID.0; 2 /, jD 1 j j j < 1 and jD 1 jj j j2 < 1. Then we have for
h D 1; 2; : : :
0 1 00 1 1
O
.1/ .1/
B :: C BB C WC
! N @@ ::: A ; A
d
@ : A
T
O
.h/ .h/
where the elements of W D wij i;j2f1;:::;hg are given by Bartlett’s formula
1
X
wij D Œ.k C i/ C .k i/ 2.i/.k/Œ.k C j/ C .k j/ 2.j/.k/:
kD1
P1 Brockwell and Davis (1991) offer a second version of the above theorem where
2
jD 1 jj j j < 1 is replaced by the assumption of finite fourth moments, i.e.
by assuming EZt4 < 1. As we rely mainly on ARMA processes, we do not
pursue this distinction further because this class of process automatically fulfills
the above assumptions as soon as fZt g is identically and independently distributed
(IID). A proof which relies on the Beveridge-Nelson polynomial decomposition (see
Theorem D.1 in Appendix D) can be gathered from Phillips and Solo (1992).
The most important application of Theorem 4.4 is related to the case of a white noise
process. For this process .h/ is equal to zero for jhj > 0. Theorem 4.4 then implies
that
1; for i D j;
wij D
0; otherwise.
4.2 Estimation of ACF 75
correlation coefficient
0.5
estimated ACF
−0.5 lower bound for confidence interval
upper bound for confidence interval
−1
0 5 10 15 20
order
Fig. 4.1 Estimated autocorrelation function of a WN(0,1) process with 95 % confidence interval
for sample size T D 100
This test statistic is also asymptotically 2 distributed with the same degree of
freedom N. This statistic accounts for the fact that the estimates for high orders
h are based on a smaller number of observations and are thus less precise and more
noisy. The two test statistics are used in the usual way. The null hypothesis that
all correlation coefficients are jointly equal to zero is rejected if Q, respectively Q0
is larger than the critical value corresponding the 2N distribution. The number of
summands N is usually taken to be rather large, for a sample size of 150 in the
range between 15 and 20. The two test are also referred to as Portmanteau tests.
For i; j > q, wij is equal to zero. The 95 % confidence interval for the MA(1) process
1
Xt D Zt 0:8Zt 1 is therefore given for a sample size of T D 200 by ˙1:96T 2 Œ1 C
1
2.1/2 2 D ˙0:1684.
Figure 4.2 shows the estimated autocorrelation function of the above MA(1) pro-
cess together with 95 % confidence interval based on a white noise process and
a MA(1) process with D 0:8. As the first order autocorrelation coefficient is
1 estimated ACF
lower bound for WN−confidence interval
upper bound for WN−confidence interval
bound for MA(1)−confidence interval
correlation coefficient
−0.5
−1
0 5 10 15 20
order
Fig. 4.2 Estimated autocorrelation function of a MA(1) process with D 0:8 with correspond-
ing 95 % confidence interval for T D 200
4.2 Estimation of ACF 77
clearly outside the confidence interval whereas all other autocorrelation coefficients
are inside it, the figure demonstrate that the observations are evidently the realization
of MA(1) process.
The formula for wij with i ¤ j are not shown. In any case, this formula is of
relatively little importance because the partial autocorrelations are better suited for
the identification of AR processes (see Sect. 3.5 and 4.3).
Figure 4.3 shows an estimated autocorrelation function of an AR(1) process. The
autocorrelation coefficients decline exponentially which is a characteristic for an
1 theoretical ACF
estimated ACF
upper bound for AR(1)−confidence interval
correlation coefficient
−0.5
−1
0 5 10 15 20 25 30 35 40
order
Fig. 4.3 Estimated autocorrelation function of an AR(1) process with D 0:8 and corresponding
95 % confidence interval for T D 100
78 4 Estimation of Mean and ACF
AR(1) process.4 Furthermore the coefficients are outside the confidence interval up
to order 8 for white noise processes.
According to its definition (see Definition 3.2), the partial autocorrelation of order
h, ˛.h/, is equal to ah , the last element of the vector ˛h D h 1
h .1/ D Rh 1 h .1/.
Thus, ˛h and consequently ah can be estimated by ˛O h D b h 1
Oh .1/ D RO h 1 Oh .1/.
As .h/ can be consistently estimated and is asymptotically normally distributed
(see Sect. 4.2), the continuous mapping theorem (see Appendix C) ensures that the
above estimator for ˛.h/ is also consistent and asymptotically normal. In particular
we have for an AR(p) process (Brockwell and Davis 1991)
p d
O
T ˛.h/ ! N.0; 1/ for T ! 1 and h > p:
1
estimated PACF
lower bound for confidence interval
partial autocorrelation coefficient
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
Fig. 4.4 Estimated PACF for an AR(1) process with D 0:8 and corresponding 95 % confidence
interval for T D 200
4
As a reminder: the theoretical autocorrelation coefficients are .h/ D jhj .
4.4 Estimation of the Long-Run Variance 79
1
estimated PACF
lower bound for confidence interval
partial autocorellation coefficient
upper bound for confidence interval
0.5
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
Fig. 4.5 Estimated PACF for a MA(1) process with D 0:8 and corresponding 95 % confidence
interval for T D 200
Figure 4.5 shows the estimated PACF for an MA(1) process with D 0:8. In
conformity with the theory, the partial autocorrelation coefficients converge to zero.
They do so in an oscillating manner because is positive (see formula in Sect. 3.5).
1 1 1
!
X X X
JD
.h/ D
.0/ C 2
.h/ D
.0/ 1 C 2 .h/ : (4.5)
hD 1 hD1 hD1
This can, in principle, be done in two different ways. The first one consists in the
estimation of an ARMA model which is then used to derive the implied covariances
as explained in Sect. 2.4. These covariances are then inserted into Eq. (4.5). The
second method is a nonparametric one and is the subject for the rest of this Section.
It has the advantage that it is not necessary to identify and estimate an appropriate
ARMA model, a step which can be cumbersome in practice. Additional and more
5
For example, when testing the null hypothesis H0 : D 0 in the case of serially correlated
observations (see Sect. 4.1); for the Phillips-Perron unit-root test explained in Sect. 7.3.2.
6
See Theorem 4.2 and the comments following it.
80 4 Estimation of Mean and ACF
advanced material on this topic can be found in Andrews (1991), Andrews and
Monahan (1992), or among others in Haan and Levin (1997).7
If the sample size is T > 1, only the covariances
.0/; : : : ;
.T 1/ can, in
principle, be estimated. Thus, a first naive estimator of J is given by JO T defined as
T 1 T 1 T 1
!
X X X
JO T D
O .h/ D
O .0/ C 2
O .h/ D
O .0/ 1 C 2 O
.h/ ;
hD TC1 hD1 hD1
where k is a weighting or kernel function.8 The kernel functions are required to have
the following properties:
The basic idea of the kernel function is to give relatively little weight to the higher
order autocovariances and relatively more weight to the smaller order ones. As k.0/
equals one, the variance
O .0/ receives weight one by construction. The continuity
assumption implies that also the covariances of smaller order, i.e. for h small, receive
a weight close to one. Table 4.1 lists some of the most popular kernel functions used
in practice.
Figure 4.6 shows a plot of these functions. The first three functions are nonzero
only for jxj < 1. This implies that only the orders h for which jhj `T are
taken into account. `T is called the lag truncation parameter or the bandwidth. The
quadratic spectral kernel function is an example of a kernel function which takes all
7
Note the connection between the long-run variance and the spectral density at frequency zero:
J D 2f .0/ where f is the spectral density function (see Sect. 6.3).
8
Kernel functions are also relevant for spectral estimators. See in particular Sect. 6.3.
4.4 Estimation of the Long-Run Variance 81
1.2
Boxcar
1
(truncated)
0.8
quadratic
0.6
spectral
Daniell
Bartlett
0.4
0.2 Tukey-
Hanning
0
-0.2
-1.5 -1 -0.5 0 0.5 1 1.5
covariances into account. Note that some weights are negative in this case as shown
in Fig. 4.6.9
The estimator for the long-run variance is subject to the correction term T T r .
This factor depends on the number of parameters estimated in a first step and is only
relevant when the sample size is relatively small. In the case of the estimation of the
mean r would be equal to one and the correction term is negligible. If on the other
hand Xt , t D 1; : : : ; T, are the residuals from multivariate regression, r designates
the number of regressors. In many applications the correction term is omitted.
The lag truncation parameter or bandwidth, `T , depends on the number of
observations. It is intuitive that the number of autocovariances accounted for in
9
Phillips (2004) has proposed a nonparametric regression-based method which does not require a
kernel function.
82 4 Estimation of Mean and ACF
the computation of the long-run variance should increase with the sample size,
i.e. we should have `T ! 1 for T ! 1.10 The relevant issue is, at which
rate the lag truncation parameter should go to infinity. The literature made several
suggestions.11 In the following we concentrate on the Bartlett and the quadratic
spectral kernel because these function always deliver a positive long-run variance
in small samples. Andrews (1991) proposes the following formula to determine the
optimal bandwidth:
1
Bartlett W `T D 1:1447 Œ˛Bartlett T 3
1
Quadratic Spectral W `T D 1:3221 ˛QuadraticSpectral T 5
where Œ: rounds to the nearest integer. The two coefficients ˛Bartlett and
˛QuadraticSpectral are data dependent constants which have to be determined in a
first step from the data (see (Andrews 1991, 832–839), (Andrews and Monahan
1992, 958) and (Haan and Levin 1997)). If the underlying process is approximated
by an AR(1) model, we get:
4O2
˛Bartlett D
.1 O2 /.1 C O2 /
4O2
˛QuadraticSpectral D ;
.1 / O4
where O is the first order empirical autocorrelation coefficient.
In order to avoid the cumbersome determination of the ˛’s Newey and West
(1994) suggest the following rules of thumb:
2
T 9
Bartlett W `T D ˇBartlett
100
2
T 25
Quadratic Spectral W `T D ˇQuadraticSpectral :
100
It has been shown that values of 4 for ˇBartlett as well as for ˇQuadraticSpectral lead
to acceptable results. A comparison of these formulas with the ones provided by
Andrews shows that the latter imply larger values for `T when the sample sizes gets
10
This is true even when the underlying process is known to be a MA(q) process. Even in this
case it is advantageous to include also the autocovariances for h > q. The reason is twofold. First,
only when `T ! 1 for T ! 1, do we get a consistent estimator, i.e. JOT ! JT , respectively J.
Second, the restriction to
O .h/; jhj q, does not necessarily lead to positive value for the estimated
long-run variance JOT , even when the Bartlett kernel is used. See Ogaki (1992) for details.
11
See Haan and Levin (1997) for an overview.
4.4 Estimation of the Long-Run Variance 83
p
larger. Both approaches lead to consistent estimates, i.e. JO T .`T / JT ! 0 for
T ! 1.
In practice, a combination of both parametric and nonparametric methods proved
to deliver the best results. This combined method consists of five steps:
(i) The first step is called prewhitening and consists in the estimation of a
simple ARMA model for the process fXt g to remove the most obvious serial
correlations. The idea, which goes back to Press and Tukey (1956) (see also
Priestley (1981)), is to get a process for the residuals ZO t which is close to a
white noise process. Usually, an AR(1) model is sufficient.12
(ii) Choose a kernel function and, if the method of Andrews has been chosen, the
corresponding data dependent constants, i.e. ˛Bartlett or ˛QuadraticSpectral for the
Bartlett, respectively the quadratic spectral kernel function.
(iii) Compute the lag truncation parameter for the residuals using the above
formulas.
(iv) Estimate the long-run variance for the residuals ZOt .
(v) Compute the long-run variance for the original time series fXt g.
If in the first step an AR(1) model, Xt D Xt 1 C Zt , was used, the last step is
given by:
JO Z .`T /
JO TX .`T / D T ;
.1 / O 2
where JOTZ .`T / and JO TX .`T / denote the estimated long-run variances of fXt g and fZO t g.
In the general case, of an arbitrary ARMA model, ˆ.L/Xt D ‚.L/Zt , we get:
2
‚.1/
JO TX .`T / D JOTZ .`T /:
ˆ.1/
4.4.1 An Example
Suppose we want to test whether the yearly growth rate of Switzerland’s real GDP in
the last 25 years was higher than 1 %. For this purpose we compute the percentage
change against the corresponding quarter of the last year over the period 1982:1
to 2006:1 (97 observations in total), i.e. we compute Xt D .1 L4 / log.GDPt /.
The arithmetic average of these growth rates is 1.4960 with a variance of 3.0608 .
12
If in this step an AR(1) model is used and if a first order correlation O larger in absolute terms than
0:97 is obtained, Andrews and Monahan (1992, 457) suggest to replace O by 0:97, respectively
0.97. Instead of using a arbitrary fixed value, p it turns out that a data driven pvalue is superior. Sul
et al. (2005) suggest to replace 0.97 by 1 1= T and 0:97 by 1 C 1= T.
84 4 Estimation of Mean and ACF
0.8
0.6
0.4
Korrelationskoeffizient
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 2 4 6 8 10 12 14 16 18 20
Ordnung
Fig. 4.7 Estimated autocorrelation function for Switzerland’s real GDP growth (percentage
change against corresponding last year’s quarter)
We test the null hypothesis that the growth rate is smaller than one against the
alternative thatpit is greater than one. The corresponding value of the t-statistic is
.1:4960 1/= 3:0608=97 D 2:7922. Taking a 5 % significance level, the critical
value for this one-sided test is 1.661. Thus the null hypothesis is clearly rejected.
The above computation is, however, not valid because the serial correlation of
the time series was not taken into account. Indeed the estimated autocorrelation
function shown in Fig. 4.7 clearly shows that the growth rate is subject to high and
statistically significant autocorrelations.
Taking the Bartlett function as the kernel function, the rule of thumb formula for
the lag truncation parameter suggest `T D 4. The weights in the computation of the
long-run variance are therefore
8̂
ˆ
ˆ 1; h D 0I
ˆ
ˆ
ˆ
ˆ 3=4; h D ˙1I
<
k.h=`T / D 2=4; h D ˙2I
ˆ
ˆ
ˆ
ˆ
ˆ1=4; h D ˙3I
ˆ
:̂0; jhj 4:
4.5 Exercises 85
The corresponding estimate for the long-run variance is therefore given by:
3 2 1
JOT D 3:0608 1 C 2 0:8287 C 2 0:6019 C 2 0:3727 D 9:2783:
4 4 4
Using the long-run variance instead ofp the simple variance leads to a quite different
value of the t-statistic: .1:4960 1/= 9:2783=97 D 1:6037. The null hypothesis
is thus not rejected at the 5 % significance level when the serial correlation of the
process is taken into account.
4.5 Exercises
Exercise 4.5.1. You regress 100 realizations of a stationary stochastic process fXt g
against a constant c. The least-squares estimate of c equals cO D 004 with an
estimated standard deviation of O c D 0:15. In addition, you have estimated the
autocorrelation function up to order h D 5 and obtained the following values:
O
.1/ D 0:43; .2/
O D 0:13; .3/
O D 0:12; .4/
O D 0:18; .5/
O D 0:23:
We assume that the stochastic process has mean zero and is governed by a causal
purely autoregressive model of order p:
ˆ.L/Xt D Zt with Zt WN.0; 2 /
0 10 1 0 1
.0/
.1/ : : :
.p 1/ 1
.1/
B
.1/
.0/ : : :
.p 2/C B C B C
B C B2 C B
.2/C
B :: :: :: : C B : C D B : C;
@ : : : :: A @ :: A @ :: A
.p 1/
.p 2/ : : :
.0/ p
.p/
respectively
.0/ ˆ0
p .1/ D 2 ;
p ˆ D
p .1/:
Note the recursiveness of the equation system: the estimate b̂ is obtained without
knowledge of O 2 as the estimator b Rp 1 Op .1/ involves only autocorrelations. The
estimates b
p, b
Rp ,
Op .1/, Op .1/, and
O .0/ are obtained in the usual way as explained
in Chap. 4.1
1
Note that the application of the estimator introduced in Sect. 4.2 guarantees that b
p is always
invertible.
5.1 The Yule-Walker Estimator 89
The construction of the Yule-Walker estimator implies that the first p values
of the autocovariance, respectively the autocorrelation function, implied by the
estimated model exactly correspond to their estimated counterparts. It can be shown
that this moment estimator always delivers coefficients b̂ which imply that fXt g is
causal with respect to fZt g. In addition, the following Theorem establishes that the
estimated coefficients are asymptotically normal.
Theorem 5.1 (Asymptotic Normality of Yule-Walker Estimator). Let fXt g be an
AR(p) process which is causal with respect to fZt g whereby fZt g IID.0; 2 /.
Then the Yule-Walker estimator is consistent and b̂ is asymptotically normal with
distribution given by:
p d
T b̂ ˆ ! N 0; 2 p 1 :
b̂ D O D
O .1/ D .1/:
O
O .0/
This shows that the assumption of causality, i.e. jj < 1, is crucial. Otherwise
no strictly positive value for the variance would exist. For the case p D 1
which corresponds to the random walk, the asymptotic distribution of T.O 1/
becomes degenerate as the variance is equal to zero. This case is, however, of prime
importance in economics and is treated detail in Chap. 7.
In practice the order of the model is usually unknown. However, one can expect
when estimating an AR(m) model whereby the true order p is strictly smaller than m
90 5 Estimation of ARMA Models
that the estimated coefficients O pC1 ; : : : ; O m should be close to zero. This is indeed
the case as shown in Brockwell and Davis (1991, 241). In particular, under the
assumptions of Theorem 5.1 it holds that
p d
T Om ! N.0; 1/ for m > p: (5.1)
This result justifies the following strategy to identify the order of an AR-model.
Estimate in a first step a highly parameterized model (overfitted model), i.e. a model
with a large value of m, and test via a t-test whether m is zero. If the hypothesis
cannot be rejected, reduce the order of the model from m to m 1 and repeat the
same procedure now with respect to m 1 . This is done until the hypothesis can no
longer be rejected.
If the order of the initial model is too low (underfitted model) so that the true
order is higher than m, one incurs an “omitted variable bias”. The corresponding
estimates are no longer consistent. In Sect. 5.4, we take closer look at the problem
of determining the order of a model.
O .0/ D O 2 .1 C O 2 /
O .1/ D O 2 O
As shown in Sect. 1.5.1, this system of equations has for the case j.1/j O D
j
O .1/=
.0/j
O < 1=2 two solutions; for j.1/j
O D j
.1/=
O O
.0/j D 1=2 one solution;
and for j.1/j
O D j
O .1/=
.0/j
O > 1=2 no real solution. In the case of several solutions,
we usually take the invertible one which leads to jj < 1. Invertibility is, however,
a restriction which is hard to implement in the case of higher order MA processes.
Moreover, it can be shown that Yule-Walker estimator is no longer consistent in
general (see Brockwell and Davis (1991, 246) for details). For these reasons, it is not
advisable to use the Yule-Walker estimator in the case of MA processes, especially
when there exist consistent and efficient alternatives.
5.2 OLS Estimation of an AR(p) Model 91
X t D 1 X t 1 C : : : C p X t p C Zt ; Zt WN.0; 2 /:
Note that the first p observations are lost and that the effective sample size is thus
reduced to T p. The least-squares estimator (OLS estimator) is obtained as the
minimizer of the sum of squares S.ˆ/:
T
X
D .Xt Pt 1 Xt /2 ! min : (5.3)
ˆ
tDpC1
Though Eq. (5.2) resembles very much an ordinary regression model, there are
some important differences. First, the standard orthogonality assumption between
regressors and error is violated. The regressors Xt j , j D 1; : : : ; p, are correlated
with the error terms Zt j ; j D 1; 2; : : :. Second, there is a dependency on the starting
values Xp ; : : : ; X1 . The assumption of causality, however, insures that these features
do not play a role asymptotically. It can be shown that .X0 X/=T converges in
probability to b p and .X0 Y/=T to
Op . In addition, under quite general conditions,
92 5 Estimation of ARMA Models
plim s2T D 2
Z 0b
where s2T D b Z=T and b
Z t are the OLS residuals.
Proof. See Chap. 13 and in particular Sect. 13.3 for a proof in the multivariate case.
Additional details may be gathered from Brockwell and Davis (1991, chapter 8).
t
u
Remark 5.1. In practice 2 p 1 is approximated by s2T .X0 X=T/ 1 . Thus, for large
T, b̂ can be viewed as being normally distributed as N.ˆ; s2T .X0 X/ 1 /. This result
allows the application of the usual t- and F-tests.
Remark 5.2. The OLS estimator does in general not deliver coefficients Ô for which
fXt g is causal with respect fZt g. In particular, in the case of an AR(1) model, it
can happen that, in contrast to the Yule-Walker estimator, jj O is larger than one
despite the fact that the true parameter is absolutely smaller than one. Nevertheless,
the least-squares estimator is to be preferred in practice because it delivers small-
sample biases of the coefficients which are smaller than those of Yule-Walker
estimator, especially for roots of ˆ.z/ close to the unit circle (Tjøstheim and Paulsen
1983; Shaman and Stine 1988; Reinsel 1993).
5.2 OLS Estimation of an AR(p) Model 93
The proofs of Theorems 5.1 and 5.2 are rather involved and will therefore not
be pursued here. A proof for the more general multivariate case will be given in
Chap. 13. It is, however, instructive to look at a simple case, namely the AR(1)
model with jj < 1, Zt IIN.0; 2 / and X0 D 0. Denoting by OT the OLS estimator
of , we have:
PT
p p1
T tD1 Xt 1 Zt
T O T D 1 PT 2
: (5.4)
T tD1 Xt 1
T
2 X 2
D EXt 1 :
T tD1
P PT PT PT
Moreover, TtD1 Xt2 D 2
tD1 Xt 1 .X02 XT2 / D 2 tD1 Xt2 1 C tD1 Zt2 C
PT
2 tD1 Xt 1 Zt so that
T
X T
X T
1 1 2 X
Xt2 1 D 2
.X02 XT2 / C Zt2 C Xt 1 Zt :
tD1
1 1 2 tD1
1 2 tD1
T
2 X 2 2 EX02 EXT2
EXt 1 D
T tD1 1 2 T
PT PT
2 tD1 EZt2 2 tD1 Xt 1 Zt
C C
1 2 T 1 2 T
4 .1 2T / 4
D 2 2
C :
T.1 / 1 2
94 5 Estimation of ARMA Models
The numerator in Eq. (5.4) therefore converges to a normal random variable with
4
mean zero and variance 1 2 .
The denominator in Eq. (5.4) can be rewritten as
T T T
1X 2 X02 XT2 1 X
2 2 X
X 1 D C Z C Xt 1 Zt :
T tD1 t .1 2 /T .1 2 /T tD1 t .1 2 /T tD1
The expected value and the variance of XT2 =T converge to zero. Chebyschev’s
inequality (see Theorem C.3 in Appendix C) then implies that the first term
converges also in probability to zero. X0 is equal to zero by assumption. The second
term has a constant mean equal to 2 =.1 2 / and a variance which converges to
zero. Theorem C.8 in Appendix C then implies that the second term converges in
probability to 2 =.1 2 /. The third term has a mean of zero and a variance which
converges to zero. Thus the third term converges to zero in probability. This implies:
T
1X 2 p 2
X 1 ! :
T tD1 t 1 2
Putting the results for the numerator and the denominator together and applying
Theorem C.10 and the continuous mapping theorem for the convergence in distri-
bution one finally obtains:
p d
T O T ! N.0; 1 2 /: (5.5)
We assume that the process fXt g is a causal and invertible ARMA(p,q) process
following the difference equation
Xt 1 X t 1 ::: p X t p D Zt C 1 Zt 1 C : : : C q Zt q
with Zt IID.0; 2 /. We also assume that ˆ.z/ and ‚.z/ have no roots in common.
We then stack the parameters of the model into a vector ˇ and a scalar 2 :
0
ˇ D 1 ; : : : ; p ; 1 ; : : : ; q and 2 :
Given the assumption above the admissible parameter space for ˇ, C, is described
by the following set:
2
If the process does not have a mean of zero, we can demean the data in a preliminary step.
3
In Sect. 2.4 we showed how the autocovariance function
and as a consequence T , respectively
GT can be inferred from a given ARMA model, i.e from a given ˇ.
96 5 Estimation of ARMA Models
T 1 0
1 T
ln LT .ˇjxT / D ln.2/ ln T xT GT .ˇ/ 1 xT ln det GT .ˇ/ :
2 2 2
This function is then maximized with respect to ˇ 2 C. This is, however, equivalent
to minimizing the function
`T .ˇjxT / D ln T 1 x0T GT .ˇ/ 1 xT C T 1 ln det GT .ˇ/ ! min :
ˇ2C
T
!
1 X .Xt Pt 1 Xt /2
exp :
2 2 tD1 rt 1
4
One such algorithm is the innovation algorithm. See Brockwell and Davis (1991, section 5) for
details.
5.3 Estimation of an ARMA(p,q) Model 97
2 1 O
O ML D S.ˇML/;
T
where
T
X .Xt P t 1 X t /2
S.ˇOML / D
tD1
rt 1
and where ˇOML denote the value of ˇ which minimizes the function
T
1 1X
`T .ˇjxT / D ln S.ˇ/ C ln rt 1
T T tD1
T
X .Xt P t 1 X t /2
S.ˇ/ D
tD1
rt 1
2 S.ˇOLS /
O LS D :
T p q
P
The term T1 TtD1 ln rt 1 disappears asymptotically because, given the restriction
ˇ 2 C, the mean-squared forecast error vT converges to 2 and thus rT goes to one
as T goes to infinity. This implies that for T going to infinity the maximization of
the likelihood function becomes equivalent to the minimization of the least-squares
criterion. Thus the maximum-likelihood estimator and the least-squares estimator
share the same asymptotic normal distribution.
Note also that in the case of autoregressive models rt is constant and equal to one.
In this case, the least-squares criterion S.ˇ/ reduces to the criterion (5.3) discussed
in the previous Sect. 5.2.
98 5 Estimation of ARMA Models
where fut g and fvt g denote autoregressive processes defined as ˆ.L/ut D wt and
‚.L/vt D wt with wt WN.0; 1/.
It can be shown that both estimators are asymptotically efficient.5 Note that the
asymptotic covariance matrix V.ˇ/ is independent of 2 .
The use of the Gaussian likelihood function makes sense even when the process
is not Gaussian. First, the Gaussian likelihood can still be interpreted as a measure
of fit of the ARMA model to the data. Second, the asymptotic distribution is still
Gaussian even when the process is not Gaussian as long as Zt IID.0; 2 /.
The Gaussian likelihood is then called the quasi Gaussian likelihood. The use of
the Gaussian likelihood under this circumstance is, however, in general no longer
efficient.
5
See Brockwell and Davis (1991) and Fan and Yao (2003) for details.
5.4 Estimation of the Orders p and q 99
In particular, we have:
AR.1/ W O N ; .1 2 /=T ;
O1 1 1 1 22 1 .1 C 2 /
AR.2/ W N ; :
O2 2 T 1 .1 C 2 / 1 22
Similarly, one can compute the asymptotic distribution for an MA(q) process. In
particular, we have:
MA.1/ W O N ; .1 2 /=T ;
!
O1 1 1 1 22 1 .1 2 /
MA.2/ W N ; :
O2 2 T 1 .1 2 / 1 22
! 1
1 1
1 2 .1 C /
V.; / D 1
1 :
.1 C / 1 2
Therefore we have:
!
O 1 1 C .1 2 /.1 C / .1 2 /.1 2 /
N ; :
O T . C /2 .1 2 /.1 2 / .1 2 /.1 C /
Up to now we have always assumed that the true orders of the ARMA model p and q
are known. This is, however, seldom the case in practice. As economic theory does
usually not provide an indication, it is all too often the case that the orders of the
ARMA model must be identified from the data. In such a situation one can make
two type of errors: p and q are too large in which case we speak of overfitting; p and
q are too low in which case we speak of underfitting.
In the case of overfitting, the maximum likelihood estimator is no longer
consistent for the true parameter, but still consistent for the coefficients of the causal
.z/
representation j , j D 0; 1; 2; : : :, where .z/ D .z/ . This can be illustrated by the
100 5 Estimation of ARMA Models
1.5
1
1
ML-estimator
0.5
-1 0 1
0
-0.5
-1
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
For these reasons the identification of the orders is an important step. One
method which goes back to Box and Jenkins (1976) consists in the analysis of
the autocorrelation function (ACF) and the partial autocorrelation function (PACF)
(see Sect. 3.5). Although this method requires some experience, especially when
the process is not a purely AR or MA process, the analysis of the ACF und PACF
remains an important first step in every practical investigation of a time series.
An alternative procedure relies on the automatic order selection. The objective
is to minimize a so-called information criterion over different values of p and
q. These criteria are based on the following consideration. Given a fixed number
of observations, the successive increase of the orders p and q increases the fit
2
of the model so that variance of the residuals O p;q steadily decreases. In order to
compensate for this tendency to overfitting a penalty is introduced. This penalty
term depends on the number of free parameters and on the number of observations
at hand.6 The most important information criteria have the following additive form:
2 C.T/ 2 C.T/
ln O p;q C (# free parameters) D ln O p;q C .p C q/ ! min;
T T p;q
2
where ln O p;q measures the goodness of fit of the ARMA(p,q) model and .p C q/ C.T/
T
denotes the penalty term. Thereby C.T/ represents a nondecreasing function of T
which governs the trade-off between goodness of fit and complexity (dimension)
of the model. Thus, the information criteria chooses higher order models for larger
sample sizes T. If the model includes a constant term or other exogenous variables,
the criterion must be adjusted accordingly. However, this will introduce, for a given
sample size, just a constant term in the objective function and will therefore not
influence the choice of p and q.
The most common criteria are the Akaike information criterion (AIC), the
Schwarz or Bayesian information criterion (BIC), and the Hannan-Quinn informa-
tion criterion (HQ criterion):
2 2
AIC.p; q/ D ln O p;q C .p C q/
T
2 ln T
BIC.p; q/ D ln O p;q C .p C q/
T
2 2 ln.ln T/
HQC.p; q/ D ln O p;q C .p C q/
T
Because AIC < HQC < BIC for a given sample size T 16, Akaike’s criterion
delivers the largest models, i.e. the highest order p C q; the Bayesian criterion is
more restrictive and delivers therefore the smallest models, i.e. the lowest p C q.
Although Akaike’s criterion is not consistent with respect to p and q and has a
6
See Brockwell and Davis (1991) for details and a deeper appreciation.
102 5 Estimation of ARMA Models
tendency to deliver overfitted models, it is still widely used in practice. This feature
is sometimes desired as overfitting is seen as less damaging than underfitting.7 Only
the BIC and HQC lead to consistent estimates of the orders p and q.
7
See, for example, the section on the unit root tests 7.3.
8
An exact definition will be provided in Chap. 7. In this chapter we will analyze the consequences
of non-stationarity and discuss tests for specific forms of non-stationarity.
5.6 Modeling Real GDP of Switzerland 103
Having achieved stationarity, one has to find the appropriate orders p and q of the
ARMA model. Thereby one can rely either on the analysis of the ACF and the PACF,
or on the information criteria outlined in the previous Sect. 5.4.
After having identified a particular model or a set of models, one has to inspect its
adequacy. There are several dimensions along which the model(s) can be checked.
(i) Are the residuals white noise? This can be checked by investigating at the ACF
of the residuals and by applying the Ljung-Box test (4.4). If they are not this
means that the model failed to capture all the dynamics inherent in the data.
(ii) Are the parameters plausible?
(iii) Are the parameters constant over time? Are there structural breaks? This can
be done by looking at the residuals or by comparing parameter estimates across
subsamples. More systematic approaches are discussed in Perron (2006). These
involve the revolving estimation of parameters by allowing the break point
to vary over the sample. Thereby different type of structural breaks can be
distinguished. A more in depth analysis of structural breaks is presented in
Sect. 18.1.
(iv) Does the model deliver sensible forecasts? It is particularly useful to investi-
gate the out-of-sample forecasting performance. If one has several candidate
models, one can perform a horse-race among them.
In case the model turns out to be unsatisfactory, one has to go back to steps 1
and 2.
This section illustrates the concepts and ideas just presented by working out a
specific example. We take the seasonally unadjusted Swiss real GDP as an example.
The data are plotted in Fig. 1.3. To take the seasonality into account we transform
the logged time series by taking first seasonal differences, i.e. Xt D .1 L4/ ln GDPt .
Thus, the variable corresponds to the growth rate with respect to quarter of the
previous year. The data are plotted in Fig. 5.2. A cursory inspection of the plot
reveals that this transformation eliminated the trend as well as the seasonality.
First we analyze the ACF and the PACF. They are plotted together with
corresponding confidence intervals in Fig. 5.3. The slowly monotonically declining
ACF suggests an AR process. As only the first two orders of the PACF are
significantly different from zero, it seems that an AR(2) model is appropriate. The
least-squares estimate of this model are:
104 5 Estimation of ARMA Models
4
percent
−2
−4
1985 1990 1995 2000 2005 2010
time
The numbers in parenthesis are the estimated standard errors of the corresponding
parameter above. The roots of the AR-polynomial are 1.484 and 2.174. They
are clearly outside the unit circle so that there exists a stationary and causal
representation.
Next, we investigate the information criteria AIC and BIC to identify the orders
of the ARMA(p,q) model. We examine all models with 0 p; q 4. The AIC and
the BIC values, are reported in Tables 5.1 and 5.2. Both criteria reach a minimum
at .p; q/ D .1; 3/ (bold numbers) so that both criteria prefer an ARMA(1,3) model.
The parameters of this models are as follows:
The estimated standard errors of the estimated parameters are again reported in
parenthesis below. The AR(2) model is not considerably worse than the ARMA(1,3)
model, according to the BIC criterion it is even the second best model.
5.6 Modeling Real GDP of Switzerland 105
0.5
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
0.5
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
Fig. 5.3 Autocorrelation (ACF) and partial autocorrelation (PACF) function (PACF) of real GDP
growth rates of Switzerland with 95 % confidence interval
The inverted roots of the AR- and the MA-polynomial are plotted together with
their corresponding 95 % confidence regions in Fig. 5.4.9 As the confidence regions
are all inside the unit circle, also the ARMA(1,3) has a stationary and causal
representation. Moreover, the estimated process is also invertible. In addition, the
roots of the AR- and the MA-polynomial are distinct.
9
The confidence regions are determined by the delta-method (see Appendix E).
106 5 Estimation of ARMA Models
0.2
−0.2 root
of Φ(z)
−0.4
−0.6
−0.8
−1
−1 −0.5 0 0.5 1
real part
The autocorrelation functions of the AR(2) and the ARMA(1,3) model are
plotted in Fig. 5.5. They show no sign of significant autocorrelations so that both
residual series are practically white noise. We can examine this hypothesis formally
by the Ljung-Box test (see Sect. 4.2 Eq. (4.4)). Taking N D 20 the values of the
test statistics are Q0AR.2/ D 33:80 and Q0ARMA.1;3/ D 21:70, respectively. The 5 %
critical value according to the 220 distribution is 31:41. Thus the null hypothesis
.1/ D : : : D .20/ D 0 is rejected for the AR(2) model, but not for the
ARMA(1,3) model. This implies that the AR(2) model does not capture the full
dynamics of the data.
Although the AR(2) and the ARMA(1,3) model seem to be quite different at
first glance, they deliver similar impulse response functions as can be gathered from
Fig. 5.6. In both models, the impact of the initial shock is first built up to values
higher than 1:1 in quarters one and two, respectively. Then the effect monotonically
declines to zero. After 10 to 12 quarters the effect of the shock has practically
dissipated.
As a final exercise, we use both models to forecast real GDP growth over the next
nine quarters, i.e. for the period fourth quarter 2003 to fourth quarter 2005. As can
5.6 Modeling Real GDP of Switzerland 107
0.5
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
ACF of the residuals from the ARMA(1,3) model
1
0.5
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
order
Fig. 5.5 Autocorrelation function (ACF) of the residuals from the AR(2) and the ARMA(1,3)
model
1.4
1.2
1
ARMA(1,3)
0.8
ψj
0.6
AR(2)
0.4
0.2
0
0 2 4 6 8 10 12 14 16 18 20
order j
Fig. 5.6 Impulse responses of the AR(2) and the ARMA(1,3) model
108 5 Estimation of ARMA Models
2
forecast ARMA(1,3)
1.5
observations
1
percent
0.5
forecast AR(2)
0
−0.5
−1
2001Q1 2001Q3 2002Q1 2002Q3 2003Q1 2003Q3 2004Q1 2004Q3 2005Q1 2005Q3 2006Q1
be seen from Fig. 5.7, both models predict that the Swiss economy should move out
of recession in the coming quarters. However, the ARMA(1,3) model indicates that
the recovery is taking place more quickly and the growth overshooting its long-run
mean of 1.3 % in about a year. The forecast of the AR(2) predicts a more steady
approach to the long-run mean.
Spectral Analysis and Linear Filters
6
1
The use of spectral methods in the natural sciences can be traced many centuries back. The modern
statistical approach builds on to the work of N. Wiener, G. U. Yule, J. W. Tukey, and many others.
See the interesting survey by Robinson (1982).
From a mathematical point of view, the equivalence between time and frequency
domain analysis rest on the theory of Fourier series. An adequate representation
of this theory is beyond the scope of this book. The interested reader may
consult Brockwell and Davis (1991, chapter 4). An introduction to the underlying
mathematical theory can be found in standard textbooks like Rudin (1987).
is called the spectral density function or spectral density of fXt g. Thereby { denotes
the imaginary unit (see Appendix A).
The sine is an odd function whereas the cosine and the autocovariance function
are even functions.2 This implies that the spectral density can be rewritten as:
1
1 X
f ./ D
.h/.cos.h/ { sin.h//
2 hD 1
1 1
1 X 1 X
D
.h/ cos.h/ C 0 D
.h/ cos. h/
2 hD 1 2 hD 1
1
.0/ 1X
D C
.h/ cos.h/: (6.2)
2 hD1
2
A function f is called even if f . x/ D f .x/; the function is called odd if f . x/ D f .x/. Thus,
we have sin. / D sin. / and cos. / D cos. /.
6.1 Spectral Density 111
it is sufficient to consider the spectral density only in the interval . ; . As the
cosine is an even function so is f . Thus, we restrict the analysis of the spectral
density f ./ further to the domain 2 Œ0; .
In practice, we often use the period or oscillation length instead of the radiant .
They are related by the formula:
2
period length D : (6.3)
If, for example, the data are quarterly observations, a value of 0.3 for corresponds
to a period length of approximately 21 quarters.
1 P1
• Because f .0/ D 2 hD 1
.h/, the long-run variance of fXt g J (see Sect. 4.4)
equals 2f .0/, i.e. 2f .0/ D J.
• f is an even function so that f ./ D f . /.
• f ./ 0 for all 2 . ; . The proof of this proposition can be found
in Brockwell and Davis (1996, chapter 4). This property corresponds to the
non-negative definiteness of the autocovariance function (see property 4 in
Theorem 1.1 of Sect. 1.3).
• The single autocovariances are the Fourier-coefficients of the spectral density f :
Z Z
.h/ D e{h f ./d D cos.h/f ./d:
R
For h D 0, we therefore get
.0/ D f ./d.
The last property allows us to compute the autocovariances from a given spectral
density. It shows how time and frequency domain analysis are related to each other
and how a property in one domain is reflected as a property in the other.
These properties of a non-negative definite function can be used to characterize
the spectral density of a stationary process fXt g with autocovariance function
.
Theorem 6.1 (Properties of a Spectral Density). A function f defined on . ;
is the spectral density of a stationary process if and only if the following properties
hold:
• f ./ D f . /;
• fR./ 0;
• f ./d < 1.
112 6 Spectral Analysis and Linear Filters
Some Examples
1
1 X {h e{ C 1 C e {
1 C 2 cos
f ./ D
.h/e D D :
2 hD 1 2 2
Thus, f ./ 0 if and only jj 1=2. According to Corollary 6.1 above,
is the autocovariance function of a stationary stochastic process if and only if
jj 1=2. This condition corresponds exactly to the condition derived in the time
domain (see Sect. 1.3). The spectral density for D 0:4 or equivalently D 0:5,
respectively for D 0:4 or equivalently D 0:5, and 2 D 1 is plotted in
Fig. 6.1a. As the process is rather smooth when the first order autocorrelation is
positive, the spectral density is large in the neighborhood of zero and small in the
neighborhood of . For a negative autocorrelation the picture is just reversed.
AR(1): The spectral density of an AR(1) process Xt D Xt 1 C Zt with Zt
WN.0; 2 / is:
1
!
.0/ X {h
h {h
f ./ D 1C e Ce
2 hD1
2 e{ e { 2 1
D 1 C C D
2.1 2 / 1 e{ 1 e { 2 1 2 cos C 2
The spectral density for D 0:6 and D 0:6 and 2 D 1 are plotted
in Fig. 6.1b. As the process with D 0:6 exhibits a relatively large positive
autocorrelation so that it is rather smooth, the spectral density takes large values
for low frequencies. In contrast, the process with D 0:6 is rather volatile
due to the negative first order autocorrelation. Thus, high frequencies are more
important than low frequencies as reflected in the corresponding figure.
Note that, as approaches one, the spectral density evaluated at zero tends to
infinity, i.e. lim#0 f ./ D 1. This can be interpreted in the following way.
As the process gets closer to a random walk more and more weight is given to
long-run fluctuations (cycles with very low frequency or very high periodicity)
(Granger 1966).
Consider the simple harmonic process fXt g which just consists of a cosine and a sine
wave:
a b
0.3 θ = 0.5
0.25 ρ = 0.4 1
0.2
θ = −0.5
0.15 φ = 0.6
ρ = −0.4 0.5 φ = −0.6
0.1
0.05
0
0 1 2 3 0 1 2 3
λ λ
Fig. 6.1 Examples of spectral densities with Zt WN.0; 1/. (a) MA(1) process. (b) AR(1)
process
where
8
< 0; for < !;
F./ D 1=2; for ! < !; (6.5)
:
1; for !.
The integral with respect to the discrete distribution function is a so-called Riemann-
Stieltjes integral.3 F is a step function with jumps at ! and ! and step size of 1=2
so that the above integral equals 21 e { h! C 21 e { h! D cos.h!/.
These considerations lead to a representation, called the spectral representation,
of the autocovariance function as the Fourier transform a distribution function over
Œ ; .
3
The Riemann-Stieltjes integral is a generalization of the Riemann integral. Let f and g be two
Rb
Pn on the interval Œa; b then the Riemann-Stieltjes integral a f .x/dg.x/ is
bounded functions defined
defined as limn!1 iD1 f .i /Œg.xi / g.xi 1 / where a D x1 < x2 < < xn 1 < xn D b. For
g.x/ D x we obtain the standard Riemann integral. If g is a step function with a countable number
Rb P
of steps xi of height hi then a f .x/dg.x/ D i f .xi /hi .
6.2 Spectral Decomposition of a Time Series 115
Remark 6.2. If the spectral distribution function F has a density f such that F./ D
R
f .!/d! then f is called the spectral density and the time series is said to have a
continuous spectrum.
k
X
Xt D Aj cos.!j t/ C Bj sin.!j t/; 0 < !1 < < !k < (6.7)
jD1
with
8
< 0; for < !j ;
Fj ./ D 1=2; for !j < !j ;
:
1; for !j .
This generalization points to the following properties:
4
A mathematically precise statement is given in Brockwell and Davis (1991, chapter 4) where also
the notion of stochastic integration is explained.
6.3 The Periodogram and the Estimation of Spectral Densities 117
In general, we have:
F./ F. /; discrete spectrum;
D
f ./d; continuous spectrum.
A simple estimator of the spectral density, fOT ./, can be obtained by replacing in
the defining equation (6.1) the theoretical autocovariances
by their estimates
O .
However, instead of a simple sum, we consider a weighted sum:
X h
OfT ./ D 1 k
O .h/e {h
: (6.9)
2 `T
jhj`T
The weighting function k, also known as the lag window, is assumed to have
exactly the same properties as the kernel function introduced in Sect. 4.4. This
correspondence is not accidental, indeed the long-run variance defined in Eq. (4.1)
is just 2 times the spectral density evaluated at D 0. Thus, one might choose
a weighting, kernel or lag window from Table 4.1, like the Bartlett-window, and
use it to estimate the spectral density. The lag truncation parameter is chosen in
such a way that `T ! 1 as T ! 1. The rate of divergence should, however, be
smaller than T so that `TT approaches zero as T goes to infinity. As an estimator of
the autocovariances one uses the estimator given in Eq. (4.2) of Sect. 4.2.
The above estimator is called an indirect spectral estimator because it requires
the estimation of the autocovariances in the first step. The periodogram provides an
alternative direct spectral estimator. For this purpose, we represent the observations
as linear combinations of sinusoids of specific frequencies. These so-called Fourier
frequencies are defined as !k D 2k T , k D b T 2 1 c; : : : ; b T2 c. Thereby bxc denotes
5
Thereby F. / denotes the left-sided limit, i.e. F. / D lim!" F.!/.
118 6 Spectral Analysis and Linear Filters
the largest integer smaller or equal to x. With this notation, the observations xt ,
t D 1; : : : ; T, can be represented as a sum of sinusoids:
T T
b2c b2c
X X
{!k t
xt D ak e D ak .cos.!k t/ C { sin.!k t//:
kD b T 2 1 c kD b T 2 1 c
For each Fourier-frequency !k , the periodogram IT .!k / equals jak j2 . This implies
that
T T
T b2c b2c
X X X
jxt j2 D jak j2 D IT .!k /:
1
tD1 kD b T
2 c kD b T 2 1 c
Thus the periodogram represents, disregarding the proportionality factor 2, the
sample analogue of the spectral density and therefore carries the same information.
Unfortunately, it turns out that the periodogram is not a consistent estimator of
the spectral density. In particular, the covariance between IT .1 / and IT .2 /, 1 ¤
2 , goes to zero for T going to infinity. The periodogram thus has a tendency to get
very jagged for large T leading to the detection of spurious sinusoids. A way out of
this problem is to average the periodogram over neighboring frequencies, thereby
reducing its variance. This makes sense because the variance is relatively constant
within a small frequency band. The averaging (smoothing) of the periodogram over
neighboring frequencies leads to the class of discrete spectral average estimators
which turn out to be consistent:
X
OfT ./ D 1 KT .h/IT !Q T; C
2h
(6.10)
2 T
jhj`T
6.3 The Periodogram and the Estimation of Spectral Densities 119
where !Q T; denotes the multiple of 2T which is closest to . `T is the bandwidth
of the estimator, i.e. the number of ordinates over which the average is taken. `T
satisfies the same properties as in the case of the indirect spectral estimator (6.9):
`T ! 1 and `T =T ! 0 for T ! 1. Thus, as T goes to infinity, on the one
hand the average is taken over more and more values, but on the other hand the
frequency band over which the average is taken is getting smaller and smaller.
The spectralPweighting function or spectral
P window KT is a positive even function
satisfying jhj`T KT .h/ D 1 and jhj`T KT2 .h/ ! 0 for T ! 1. It can be
shown that under these conditions the discrete spectral average estimator is mean-
square consistent. Moreover, the estimator in Eq. (6.9) can be approximated by a
corresponding discrete spectral average estimator by defining the spectral window as
1 X h {h!
KT .!/ D k e
2 `T
jhj`T
or vice versa
Z
{h!
k.h/ D KT .!/e d!:
Thus, the lag and the spectral window are related via the Fourier transform. For
details and the asymptotic distribution the interested reader is referred to Brockwell
and Davis (1991, Chapter10). Although the indirect and the direct estimator give
approximately the same result when the kernels used are related as in the equation
above, the direct estimator (6.10) is usually preferred in practice because it is,
especially for long times series, computationally more efficient, in particular in
connection with the fast fourier transformation (FFT).6
A simple spectral weighting function, known as the Daniell spectral window, p is
given by KT .h/ D .2`T C 1/ 1 when jhj `T and 0 otherwise and where `T D T.
4
It averages over 2`T C 1 values within a frequency band of approximate width p .
T
This function corresponds to the Daniell kernel function or Daniell lag window
k.x/ D sin.x/=.x/ for jxj 1 and zero otherwise (see Sect. 4.4). In practice, the
sample size is fixed and the researcher is faced with a trade-off between variance and
bias. On the one hand, a weighting function which averages over a wide frequency
band produces a smooth spectral density, but has probably a large bias because the
estimate of f ./ depends on frequencies which are rather far away from . On the
other hand, a weighting function which averages only over a small frequency band
produces a small bias, but probably a large variance. It is thus advisable in practice
6
The FFT is seen as one of the most important numerical algorithms ever as it allows a rapid
computation of Fourier transformations and its inverse. The FFT is widely in digital signal
processing.
120 6 Spectral Analysis and Linear Filters
4
true
spectrum
3
0
0 0.5 1 1.5 2 2.5 3
frequency in radians
Fig. 6.2 Raw periodogram of a white noise time series (Xt WN.0; 1/, T D 200)
to work with alternative weighting functions and to choose the one which delivers a
satisfying balance between bias and variance.
The following two examples demonstrate the large variance of the periodogram.
The first example consists of 200 observations from a simulated white noise
time series with variance equal. Whereas the true spectrum is constant equal to
one, the raw periodogram, i.e. the periodogram without smoothing, plotted in
Fig. 6.2 is quite erratic. However, it is obvious that by taking averages of adjacent
frequencies the periodogram becomes smoother and more in line with the theoretical
spectrum. The second example consists of 200 observations of a simulated AR(2)
process. Figure 6.3 demonstrates again the jaggedness of the raw periodogram.
However, these erratic movements are distributed around the true spectrum. Thus,
by smoothing one can hope to get closer to the true spectrum and even detect the
dominant cycle with radian equal to one. It is also clear that by smoothing over a
too large range, in the extreme over all frequencies, no cycle could be detected.
Figure 6.4 illustrates these considerations with real life data by estimating the
spectral density of quarterly growth rates of real investment in constructions for the
Swiss economy using alternative weighting functions. To obtain a better graphical
resolution we have plotted the estimates on a logarithmic scale. All three estimates
show a peak (local maximum) at the frequency D 2 . This corresponds to a wave
with a period of one year. The estimator with a comparably wide frequency band
(dotted line) smoothes the minimum D 1 away. The estimator with a comparable
small frequency band (dashed line), on the contrary, reveals additional waves with
frequencies D 0:75 and 0:3 which correspond to periods of approximately two,
respectively five years. Whether these waves are just artifacts of the weighting
function or whether there really exist cycles of that periodicity remains open.
6.3 The Periodogram and the Estimation of Spectral Densities 121
102
true
spectrum
101
100
10−1
10−2
0 0.5 1 1.5 2 2.5 3
frequency in radians
Fig. 6.3 Raw periodogram of an AR(2) process (Xt D 0:9Xt 1 0:7Xt 2 C Zt with Zt
WN.0; 1/, T D 200)
10−2
too
jagged too
10−3 smooth
optimal
choice
10−4
10−1 0,3
0,75
100 π/2
λ
Fig. 6.4 Non-parametric direct estimates of a spectral density with alternative weighting functions
ˇ ˇ2
2 ˇ‚.e {
/ˇ
fX ./ D ˇ ˇ ; : (6.11)
2 ˇˆ.e { /ˇ2
Proof. fXt g is generated by applying the linear filter ‰.L/ with transfer function
{ /
‰.e { / D ‚.eˆ.e { /
to fZt g (see Sect. 6.4). Formula (6.11) is then an immediate
2
consequence of Theorem 6.5 because the spectral density of fZt g is equal to 2
. t
u
Remark 6.4. As the spectral density of an ARMA process fXt g is given by a quotient
of trigonometric functions, the process is said to have a rational spectral density.
2
fX ./ D :
2.1 C 12 C 22 C 22 C 2.1 2 1 / cos 42 cos2 /
2 .1 C 2 C 2 cos /
fX ./ D :
2.1 C 2 C 2 cos /
10−2
nonparametric estimate
parametric estimate using an AR(4) model
10−3
10−4
10−1 100 π/2
Fig. 6.5 Comparison of nonparametric and parametric estimates of the spectral density of the
growth rate of investment in the construction sector
1
X 1
X
Yt D ‰.L/Xt D j Xt j with j j j < 1:
jD 1 jD 1
Remark 6.5. Time-invariance in this context means that the lagged process fYt s g
is obtained for all s 2 Z from fXt s g by applying the same filter ‰.
Remark 6.6. MA processes, causal AR processes and causal ARMA processes can
be viewed as filtered white noise processes.
P
with 1 jD 1 j j j < 1 is also a mean-zero stationary process with autocovariance
function
Y . Thereby the two autocovariance functions are related as follows:
1
X 1
X
Y .h/ D j k
X .h Ck j/; h D 0; ˙1; ˙2; : : :
jD 1 kD 1
Proof. We first show the existence of the output process fYt g. For this end, consider
.m/
the sequence of random variables fYt gmD1;2;::: defined as
m
X
.m/
Yt D j Xt j :
jD m
To show that the limit for m ! 1 exists in the mean square sense, it is, according
to Theorem C.6, enough to verify the Cauchy criterion
ˇ ˇ2
ˇ .m/ .n/ ˇ
E ˇYt Yt ˇ ! 0; for m; n ! 1:
Taking without loss of generality m > n, Minkowski’s inequality (see Theorem C.2
or triangular inequality) leads to
0 ˇ ˇ2 11=2
ˇ m n ˇ
B ˇˇ X X ˇC
ˇ
@E ˇ j Xt j j Xt j ˇ A
ˇjD m jD n ˇ
0 ˇ ˇ2 11=2 0 ˇ ˇ2 11=2
ˇ m ˇ ˇ m ˇ
B ˇˇ X ˇ C
ˇ B ˇˇ X ˇC
ˇ
@E ˇ j Xt j ˇ A C @E ˇ j Xt j ˇ A :
ˇjDnC1 ˇ ˇjD n 1 ˇ
m
X
1=2
D
X .0/ j j j:
jDnC1
6.4 Linear Time-Invariant Filters 125
P1
As jD 1 j j j < 1 by assumption, the last term converges to zero. Thus, the limit
.m/
of fYt g, m ! 1, denoted by St ,P exists in the mean square sense with ESt2 < 1.
In remains to show that St and 1jD 1 j Xt j are actually equal with probability
one. This is established by noting that
ˇ ˇ2
ˇ m
X ˇ
ˇ ˇ
E jSt ‰.L/Xt j2 D E lim inf ˇˇSt Xt j ˇˇ
m!1 ˇ ˇ
jD m
ˇ ˇ2
ˇ m
X ˇ
ˇ ˇ
lim inf E ˇˇSt Xt j ˇˇ D 0
m!1 ˇ ˇ
jD m
1
X 1
X
D j k
X .h Ck j/:
jD 1 kD 1
Thus, EYt and EYt Yt h are finite and independent of t. fYt g is therefore stationary.
t
u
P1 P1
Corollary 6.2. If Xt WN.0; 2 / and Yt D jD0 j Xt j with jD0 j jj < 1
then the above expression for
Y .h/ simplifies to
1
X
Y .h/ D 2 j jCjhj :
jD0
Remark 6.1. In the proof of the existence of fYt g, the assumption of the stationarity
of fXt g can be weakened by assuming only supt EXt2 < 1.
Theorem 6.5. Under the conditions of Theorem 6.4, the spectral densities of fXt g
and fYt g are related as
ˇ ˇ2
fY ./ D ˇ‰.e { /ˇ fX ./ D ‰.e{ /‰.e { /fX ./
P1
where ‰.e { / D jD 1 j e
{ j
. ‰.e { / is called the transfer function of the
filter.
126 6 Spectral Analysis and Linear Filters
To understand the effect of the filter ‰, consider the simple harmonic process
Xt D 2 cos. t/ D e{ t C e { t . Passing fXt g through the filter ‰ leads to a
transformed time series fYt g defined as
ˇ ˇ ./
Yt D 2 ˇ‰.e {
/ˇ cos t
where ./ D ˇarg ‰.e ˇ{ /. The filter therefore amplifies some frequencies by the
factor g./ D ˇ‰.e { /ˇ and delays Xt by ./
periods. Thus, we have a change in
amplitude given by the amplitude gain function g./ and a phase shift given by the
phase gain function ./. If the gain function is bigger than one the corresponding
frequency is amplified. On the other hand, if the value is smaller than one the
corresponding frequency is dampened.
Examples of Filters
‰.L/ D D 1 L:
The transfer function of this filter is .1 e { / and the gain function is 2.1
cos /. These functions take the value zero for D 0. Thus, the filter eliminates
the trend which can be considered as a wave with an infinite period length.
• Change with respect to same quarter last year, assuming that the data are
quarterly observations:
‰.L/ D 1 L4 :
The transfer function and the gain function are 1 e 4{ and 2.1 cos.4//,
respectively. Thus, the filter eliminates all frequencies which are multiples of 2
including the zero frequency. In particular, it eliminates the trend and waves with
periodicity of four quarters.
• A famous example of a filter which led to wrong conclusions is the Kuznets
filter (see Sargent 1987, 273–276). Assuming yearly data, this filter is obtained
from two transformations carried out in a row. The first transformation which
should eliminate cyclical movements takes centered five year moving averages.
The second one take centered non-overlapping first differences. Thus, the filter
can be written as:
1 2 1
5
‰.L/ D L CL C 1 C L C L2 L L5 :
5
„ ƒ‚ … „ ƒ‚ …
first transformation second transformation
6.5 Some Important Filters 127
1.5
0.5
0
0 0.2886 0.5 1 1.5 2 2.5 3
Figure 6.6 gives a plot of the transfer function of the Kuznets filter. Thereby it
can be seen that all frequencies are dampened, except those around D 0:2886.
The value D 0:2886 corresponds to a wave with periodicity of approximately
2=0:2886 D 21:77 22 years. Thus, as first claimed by Howrey (1968), even a
filtered white noise time series would exhibit a 22 year cycle. This demonstrates
that cycles of this length, related by Kuznets (1930) to demographic processes
and infrastructure investment swings, may just be an artefact produced by the
filter and are therefore not endorsed by the data.
The Hodrick-Prescott filter (HP-Filter) has gained great popularity in the macroeco-
nomic literature, particularly in the context of the real business cycles theory. This
high-pass filter is designed to eliminate the trend and cycles of high periodicity and
to emphasize movements at business cycles frequencies (see Hodrick and Prescott
1980; King and Rebelo 1993; Brandner and Neusser 1992).
One way to introduce the HP-filter is to examine the problem of decompos-
ing a time series fXt g additively into a growth component fGt g and a cyclical
component fCt g:
Xt D Gt C Ct :
This decomposition is, without further information, not unique. Following the
suggestion of Whittaker (1923), the growth component should be approximated by a
smooth curve. Based on this recommendation Hodrick and Prescott suggest to solve
the following restricted least-squares problem given a sample fXt gtD1;:::;T :
T
X T 1
X
.Xt G t /2 C Œ.GtC1 Gt / .Gt Gt 1 /2 ! min :
fGt g
tD1 tD2
The above objective function has two terms. The first one measures the fit of fGt g to
the data. The closer fGt g is to fXt g the smaller this term becomes. In the limit when
Gt D Xt for all t, the term is minimized and equal to zero. The second term measures
the smoothness of the growth component by looking at the discrete analogue to the
second derivative. This term is minimized if the changes of the growth component
from one period to the next are constant. This, however, implies that Gt is a linear
6.5 Some Important Filters 129
1.2
truncated
hochpass filter
1 with q = 32
0.8
0.6 truncated
Hodrick−Prescott high−pass filter
filter with λ = 1600 with q = 8
0.4
ideal high−pass
0.2 filter
0
π/16
−0.2
−2 −1
10 10 100
λ
function. Thus the above objective function represents a trade-off between fitting
the data and smoothness of the approximating function. This trade-off is governed
by the meta-parameter which must be fixed a priori.
The value of depends on the critical frequency and on the periodicity of the
data (see Uhlig and Ravn 2002, for the latter). Following the proposal by Hodrick
and Prescott (1980) the following values for are common in the literature:
8
< 6:25; yearly observations;
D 1600; quarterly observations;
:
14400; monthly observations.
It can be shown that these choices for practically eliminate waves of periodicity
longer than eight years. The cyclical or business cycle component is therefore
composed of waves with periodicity less than eight years. Thus, the choice of
implicitly defines the business cycle. Figure 6.7 compares the transfer function of
the HP-filter to the ideal high-pass filter and two approximate high-pass filters.7
As an example, Fig. 6.8 displays the HP-filtered US logged GDP together with
the original series in the upper panel and the implied business cycle component in
the lower panel.
7
As all filters, the HP-filter systematically distorts the properties of the time series. Harvey and
Jaeger (1993) show how the blind application of the HP-filter can lead to the detection of spurious
cyclical behavior.
130 6 Spectral Analysis and Linear Filters
9.5
8.5
7.5
1950 1960 1970 1980 1990 2000 2010
4
2
percent
0
−2
−4
−6
1950 1960 1970 1980 1990 2000 2010
Fig. 6.8 HP-filtered US logged GDP (upper panel) and cyclical component (lower panel)
‰.L/ D .1 C L C L2 C L3 /=4
8
An exception to this view is provided by Miron (1996).
6.5 Some Important Filters 131
103
not seasonally
adjusted
102
101
100
seasonally
adjusted
10−1
10−1 0.72 100 π/2 π
λ
Fig. 6.9 Transfer function of growth rate of investment in the construction sector with and without
seasonal adjustment
1 2 1 1 1 1 1
‰.L/ D L C LC C L C L 2:
8 4 4 4 8
In practice, the so-called X–11-Filter or its enhanced versions X–12 and X–13
filter developed by the United States Census Bureau are often applied. This filter is
a two-sided filter which makes, in contrast to two examples above, use of all sample
observations. As this filter not only adjusts for seasonality, but also corrects for
outliers, a blind mechanical use is not recommended. Gómez and Maravall (1996)
developed an alternative method known under the name TRAMO-SEATS. More
details on the implementation of both methods can be found in Eurostat (2009).
Figure 6.9 shows the effect of seasonal adjustment using TRAMO-SEATS by
looking at the corresponding transfer functions of the growth rate of construction
investment. One can clearly discern how the yearly and the half-yearly waves
corresponding to the frequencies =2 and are dampened. On the other hand, the
seasonal filter weakly amplifies a cycle of frequency 0:72 corresponding to a cycle
of periodicity of two years.
Whether or not to use filtered, especially seasonally adjusted, data is still an ongoing
debate. Although the use of unadjusted data together with a correctly specified
model is clearly the best choice, there is a nonnegligible uncertainty in modeling
132 6 Spectral Analysis and Linear Filters
economic time series. Thus, in practice one faces several trade-offs which must be
taken into account and which may depend on the particular context (Sims 1974,
1993; Hansen and Sargent 1993). One the one hand, the use of adjusted data may
disregard important information on the dynamics of the time series and introduce
some biases. On the other hand, the use of unadjusted data encounters the risk
of misspecification, especially because usual measures of fit may put too large
emphasis on fitting the seasonal frequencies thereby neglecting other frequencies.
6.6 Exercises
Exercise 6.6.1.
(i) Show that the process defined in Eq. (6.4) has an autocovariance function equal
to
.h/ D cos.!h/.
(ii) Show that the process defined in Eq. (6.7) has autocovariance function
k
X
.h/ D j2 cos.!j h/
jD1
Exercise 6.6.2. Compute the transfer and the gain function for the following filters:
(i) ‰.L/ D 1 L
(ii) ‰.L/ D 1 L4
Integrated Processes
7
Xt D C ‰.L/Zt ;
P1
where fZt g WN.0; 2 / and jD0
2
j < 1. Typically, we model Xt as an ARMA
‚.L/
process so that ‰.L/ D ˆ.L/ . This representation implies:
• EXt D ,
• limh!1 Pt XtCh D .
The above property is often referred to as mean reverting because the process moves
around a constant mean. Deviations from this mean are only temporary or transitory.
Thus, the best long-run forecast is just the mean of the process.
This property is often violated by economic time series which typically show
a tendency to growth. Classic examples are time series for GDP (see Fig. 1.3) or
some stock market index (see Fig. 1.5). This trending property is not compatible
with stationarity as the mean is no longer constant. In order to cope with this
characteristic of economic time series, two very different alternatives have been
proposed. The first one consists in letting the mean be a function of time .t/.
The most popular specification for .t/ is a linear function, i.e. .t/ D ˛ C ıt. In
this case we get:
Xt D ˛ C ıt C‰.L/Zt
„ƒ‚…
linear trend
The process fXt g is then referred to as a trend-stationary process. In practice one also
encounters quadratic polynomials of t or piecewise linear functions. For example,
.t/ D ˛1 C ı1 t for t t0 and .t/ D ˛2 C ı2 t for t > t0 . In the following, we
restrict ourself to linear trend functions.
The second alternative assumes that the time series becomes stationary after
differentiation. The number of times one has to differentiate the process to achieve
stationarity is called the order of integration. If d times differentiation is necessary,
the process is called integrated of order d and is denoted by Xt I.d/. If the
resulting time series, d Xt D .1 L/d Xt , is an ARMA(p,q) process, the original
process is called an ARIMA(p,d,q) process. Usually it is sufficient to differentiate
the time series only once, i.e. d D 1. For expositional purposes we will stick to this
case.
The formal definition of an I(1) process is given as follows.
Definition 7.1. The stochastic process fXt g is called integrated of order one or
difference-stationary, denoted as Xt I.1/, if and only if Xt D Xt Xt 1 can be
represented as
Xt D .1 L/Xt D ı C ‰.L/Zt ; ‰.1/ ¤ 0;
P
with fZt g WN.0; 2 / and 1jD0 jj j j < 1.
Xt D ı C Xt 1 C Zt ; Zt WN.0; 2 /:
1
Strictly speaking this does not conform to the definitions used in this book because our definition
of ARMA processes assumes stationarity.
7.1 Definition, Properties and Interpretation 135
The optimal forecast in the least-squares sense given the infinite past of a trend-
stationary process is given by
e
Pt XtCh D ˛ C ı.t C h/ C h Zt C hC1 Zt 1 C :::
Thus we have
1
X
2
lim E e
Pt XtCh ˛ ı.t C h/ D 2 lim 2
hCj D0
h!1 h!1
jD0
P
because 1 2
jD0 j < 1. Thus the long-run forecast is given by the linear trend.
Even if Xt deviates temporarily from the trend line, it is assumed to return to it.
A trend-stationary process therefore behaves in the long-run like .t/ D ˛ C ıt.
The forecast of the differentiated series is
e
Pt XtCh D ı C h Zt C hC1 Zt 1 C hC2 Zt 2 C :::
so that
e
Pt XtCh D e
Pt XtCh C e
Pt XtCh 1 C ::: Ce
Pt XtC1 C Xt
DıC h Zt C hC1 Zt 1 C hC2 Zt 2 C :::
CıC h 1 Zt C h Zt 1 C hC1 Zt 2 C :::
CıC h 2 Zt C h 1 Zt 1 C h Zt 2 C :::
C : : : C Xt
D Xt C ıh
C. h C h 1 C ::: C 1 / Zt
C. hC1 C h C ::: C 2 / Zt 1
:::
136 7 Integrated Processes
This shows that also for the integrated process the long-run forecast depends on
a linear trend with slope ı. However, the intercept is no longer a fixed number,
but given by Xt which is stochastic. With each new realization of Xt the intercept
changes so that the trend line moves in parallel up and down. This issue can be well
illustrated by the following two examples.
Example 1. Let fXt g be a random walk with drift ı. Then best forecast of XtCh ,
Pt XtCh , is
Pt XtCh D ıh C Xt :
The forecast thus increases at rate ı starting from the initial value of Xt . ı is therefore
the slope of a linear trend. The intercept of this trend is stochastic and equal to Xt .
Thus the trend line moves in parallel up or down depending on the realization of Xt .
Pt XtCh D ıh C Xt C Zt :
As before the intercept changes in a stochastic way, but in contrary to the previous
example it is now given by Xt C Zt . If we consider the forecast given the infinite
past, the invertibility of the process implies that Zt can be expressed as a weighted
sum of current and past realizations of Xt (see Sects. 2.3 and 3.1).
XtCh e
Pt XtCh D ZtCh C 1 ZtCh 1 C ::: C h 1 ZtC1 :
XtCh e
Pt XtCh D ZtCh C .1 C 1 / ZtCh 1 C
: : : C .1 C 1 C 2 C ::: C h 1 / ZtC1 :
7.1 Definition, Properties and Interpretation 137
This expression increases with the length of the forecast horizon h, but is no longer
bounded. It increases linearly in h to infinity.2 The precision of the forecast therefore
not only decreases with the forecasting horizon h as in the case of the trend-
stationary model, but converges to zero. In the example above of the ARIMA(0,1,1)
process the forecasting error variance is
h i
E .XtCh Pt XtCh /2 D 1 C .h 1/ .1 C /2 2 :
@e
Pt XtCh
D h !0 for h ! 1:
@Zt
The effect of a shock thus declines with time and dies out. Shocks have therefore
only transitory or temporary effects. In the case of an ARMA process the effect even
declines exponentially (see the considerations in Sect. 2.3).3
In the case of integrated processes the impulse response function for Xt implies:
@e
Pt XtCh
D1C 1 C 2 C ::: C h:
@Zt
P
For h going to infinity, this expression converges 1 jD0 j D ‰.1/ ¤ 0. This
implies that a shock experienced in period t will have a long-run or permanent
effect. This long-run effect is called persistence. If fXt g is an ARMA process then
the persistence is given by the expression
2
Proof : By assumption f j g is absolutely summable so ˇthat ‰.1/ˇ converges. Moreover, as
ˇPh ˇ
‰.1/ ¤ 0, there exists " > 0 and an integer m such that ˇ jD0 j ˇ > " for all h > m. The
squares are therefore bounded from below by "2 > 0 so that their infinite sum diverges to infinity.
3
The use of the partial derivative is just for convenience. It does not mean that XtCh is differentiated
in the literal sense.
138 7 Integrated Processes
‚.1/
‰.1/ D :
ˆ.1/
4
Neusser (2000) shows how a Beveridge-Nelson decomposition can also be derived for higher
order integrated processes.
7.1 Definition, Properties and Interpretation 139
t
X ˚
D X0 C ı C ‰.1/ C .L e
1/‰.L/ Zj
jD1
t
X t
X
D X0 C ıt C ‰.1/ Zj C .L e
1/‰.L/Zj
jD1 jD1
t
X
D X0 C ıt C ‰.1/ e
Zj C ‰.L/Z 0
e
‰.L/Z :
„ ƒ‚ … „ ƒ‚ …t
linear trend jD1
„ƒ‚… stationary component
random walk
e
Proof. The only substantial issue is to show that ‰.L/Z 0
e
‰.L/Z t defines a
stationary process. According to Theorem 6.4 it is sufficient to show that the
e
coefficients of ‰.L/ are absolutely summable. We have that:
ˇ ˇ
1 1 ˇ X ˇ
X X ˇ 1 ˇ X 1 X 1 X1
j Q jj D ˇ ˇ
iˇ j ij D jj j j < 1;
ˇ
jD0 jD0 ˇiDjC1 ˇ jD0 iDjC1 jD1
where the first inequality is a consequence of the triangular inequality and the
second inequality follows from the Definition 7.1 of an integrated process. u
t
Cochrane (1988) or Christiano and Eichenbaum (1990)). For a critical view from an
econometric standpoint see Hauser et al. (1999). A more sophisticated multivariate
approach to identify supply and demand shocks and to disentangle their relative
importance is provided in Sect. 15.5.
In business cycle analysis it is often useful to decompose fXt g into a sum of a
trend component t and a cyclical component "t :
Xt D t C "t :
ˆ.L/
t D ‰.1/Xt :
‚.L/
Examples
Let fXt g be a MA(q) process with Xt D ı CZt C: : :Cq Zt q then the persistence
is given simply by the sum of the MA-coefficients: ‰.1/ D 1 C 1 C : : : C q .
Depending on the value of these coefficients. The persistence can be smaller or
greater than one.
If fXt g is an AR(1) process
P withj Xt D ı CXt 1 CZt and assuming jj < 1
then we get: Xt D 1 ı C 1 jD0 Zt j . The persistence is then given as ‰.1/ D
P1 j 1
jD0 D 1 . For positive values of , the persistence is greater than one. Thus,
a shock of one is amplified to have an effect larger than one in the long-run.
If fXt g is assumed to be an ARMA(1,1) processP with Xt D ı CXt 1 CZt C
Zt 1 and jj < 1 then Xt D 1 ı C Zt C . C / 1 j Zt j 1 . The persistence
P1 j jD0
is therefore given by ‰.1/ D 1 C . C / jD0 D 1C 1
.
7.2 Properties of the OLS Estimator in the Case of Integrated Variables 141
The computation of the persistence for the model estimated for Swiss GDP in
Sect. 5.6 is more complicated because a fourth order difference 1 L4 has been
used instead of a first order one. As 1 L4 D .1 L/.1 C L C L2 C L3 /, it is possible
to extend the above computations also to this case. For this purpose we compute the
persistence for .1 C L C L2 C L3 / ln BIPt in the usual way. The long-run effect on
ln BIPt is therefore given by ‰.1/=4 because .1 C L C L2 C L3 / ln BIPt is nothing
but four times the moving-average of the last four values. For the AR(2) model we
get a persistence of 1.42 whereas for the ARMA(1,3) model the persistence is 1.34.
Both values are definitely above one so that the permanent effect of a one-percent
shock to Swiss GDP is amplified to be larger than one in the long-run. Campbell
and Mankiw (1987) and Cochrane (1988) report similar values for the US.
Xt D Xt 1 C Zt ; t D 1; ; 2; : : : ;
For jj < 1, the OLS estimator of , OT , converges in distribution to a normal
random variable (see Chap. 5 and in particular Sect. 5.2):
p d
T OT ! N 0; 1 2 :
5
We will treat more general cases in Sect. 7.5 and Chap. 16.
142 7 Integrated Processes
45
40
true value
of φ = 1.0
35
30
25
20
true value
15 of φ = 0.95
true value
10 of φ = 0.9
true value true value
of φ = 0.5 of φ = 0.8
5
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
φ
Fig. 7.1 Distribution of the OLS estimator of for T D 100 and 10,000 replications
each value of .6 The figure shows that the distribution of O T becomes more and
more concentrated if the true value of gets closer and closer to one. Moreover
the distribution gets also more and more skewed to the left. This implies that the
OLS estimator is downward biased and that this bias gets relatively more and more
pronounced in small samples as approaches one.
The asymptotic distribution would be degenerated for D 1 because the
variance approaches zero as goes to one. Thus the asymptotic distribution
becomes useless for statistical inferences under this circumstance. In order to p
obtain
a non-degenerate distribution the estimator must be scaled by T instead by T. It
can be shown that
d
T O T ! :
This result was first established by Dickey and Fuller (1976) and Dickey and Fuller
(1981). However, the asymptotic distribution need no longerp be normal. It was first
tabulated in Fuller (1976). The scaling with T instead of T means that the OLS-
estimator converges, if the true value of equals one, at a higher rate to D 1. This
property is known as superconsistency.
6
The densities were estimated using an adaptive kernel density estimator with Epanechnikov
window (see Silverman (1986)).
7.2 Properties of the OLS Estimator in the Case of Integrated Variables 143
In order to understand this result better, in particular in the light of the derivation
inthe Appendix
of Sect. 5.2, we take a closer look at the asymptotic distribution of
T OT :
1 PT
2T tD1 Xt 1 Zt
T OT D 1 PT 2
:
2T2 tD1 Xt 1
T
X T
X 2 T.T 1/
E Xt2 1 D 2 .t 1/ D ;
tD1 tD1
2
because Xt 1 N 0; 2 .t 1/ . To obtain a nondegenerate random variable one
must scale by T 2 . Thus, intuitively, T.OT / will no longer converge to a degenerate
distribution.
Using similar arguments it can be shown that the t-statistic
OT 1 OT 1
tT D D r
O O s2T
PT
tD1 Xt2 1
P 2
with s2T D T 1 2 TtD2 Xt OT Xt 1 is not asymptotically normal. Its distribution
was also first tabulated by Fuller (1976). Figure 7.2 compares its density with the
standard normal distribution in a Monte-Carlo experiment using again a sample of
144 7 Integrated Processes
0.45
0.3
0.25
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
t−statistic
Fig. 7.2 Distribution of t-statistic for T D 100 and 10,000 replications and standard normal
distribution
T D 100 and 10; 000 replications. It is obvious that the t-distribution is shifted to
the left. This implies that the critical values will be absolutely higher than for the
standard case. In addition, one may observe a slight skewness.
Finally, we also want to investigate the autocovariance function of a random
walk. Using similar arguments as in Sect. 1.3 we get:
.h/ D E .XT XT h/
D E Œ.ZT C ZT 1 C : : : C Z1 / .ZT h C ZT h 1 C : : : C Z1 /
D E ZT2 h C ZT2 h 1 C : : : C Z12 D .T h/ 2 :
theoretical ACF
0.5 estimated ACF of a random walk
of a random walk
−0.5
estimated ACF of an
AR(1) process with φ = 0.9
generated with the same innovations
as the random walk
−1
0 5 10 15 20 25
order
show the estimated ACF for an AR(1) process with D 0:9 and using the same
realizations of the white noise process as in the construction of the random walk.
Despite the large differences between the ACF of an AR(1) process and a random
walk, the ACF is only of limited use to discriminate between an (stationary) ARMA
process and a random walk.
The above calculation also shows that .1/ < 1 so that the expected value of the
OLS estimator is downward biased in finite samples: EOT < 1.
The previous Sects. 7.1 and 7.2 have shown that, depending on the nature of the
non-stationarity (trend versus difference stationarity), the stochastic process has
quite different algebraic (forecast, forecast error variance, persistence) and statistical
(asymptotic distribution of OLS-estimator) properties. It is therefore important to be
able to discriminate among these two different types of processes. This also pertains
to standard regression models for which the presence of integrated variables can lead
to non-normal asymptotic distributions.
The ability to differentiate between trend- and difference-stationary processes is
not only important from a statistical point of view, but can be given an economic
interpretation. In macroeconomic theory, monetary and demand disturbances are
alleged to have only temporary effects whereas supply disturbances, in particular
technology shocks, are supposed to have permanent effects. To put it in the language
146 7 Integrated Processes
of time series analysis: monetary and demand shocks have a persistence of zero
whereas supply shocks have nonzero (positive) persistence. Nelson and Plosser
(1982) were the first to investigate the trend properties of economic time series
from this angle. In their influential study they reached the conclusion that, with the
important exception of the unemployment rate, most economic time series in the
US are better characterized as being difference stationary. Although this conclusion
came under severe scrutiny (see Cochrane (1988) and Campbell and Perron (1991)),
this issue resurfaces in many economic debates. The latest discussion relates to the
nature and effect of technology shocks (see Galí (1999) or Christiano et al. (2003)).
The following exposition focuses on the Dickey-Fuller test (DF-test) and the
Phillips-Perron test(PP-test). Although other test procedures and variants thereof
have been developed in the meantime, these two remain the most widely applied in
practice. These types of tests are also called unit-root tests.
Both the DF- as well as the PP-test rely on a regression of Xt on Xt 1 which may
include further deterministic regressors like a constant or a linear time trend. We
call this regression the Dickey-Fuller regression:
deterministic
Xt D C Xt 1 C Zt : (7.1)
variables
Alternatively and numerically equivalent, one may run the Dickey-Fuller regression
in difference form:
deterministic
Xt D C ˇXt 1 C Zt
variables
with ˇ D 1. For both tests, the null hypothesis is that the process is integrated
of order one, difference stationary, or has a unit-root. Thus we have
H0 W D 1 or ˇ D 0:
Thus the unit root test is a one-sided test. The advantage of the second formulation
of the Dickey-Fuller regression is that the corresponding t-statistic can be readily
read off from standard outputs of many computer packages which makes additional
computations unnecessary.
7.3 Unit-Root Tests 147
The Dickey-Fuller test comes in two forms. The first one, sometimes called the -
test, takes T.O 1/ as the test statistic. As shown previously, this statistic is no
longer asymptotically normally distributed. However, it was first tabulated by Fuller
and can be found in textbooks like Fuller (1976) or Hamilton (1994b). The second
and much more common one relies on the usual t-statistic for the hypothesis D 1:
This test-statistic is also not asymptotically normally distributed. It was for the first
time tabulated by Fuller (1976) and can be found, for example, in Hamilton (1994b).
Later MacKinnon (1991) presented much more detailed tables where the critical
values can be approximated for any sample size T by using interpolation formulas
(see also Banerjee et al. (1993)).7
The application of the Dickey-Fuller test as well as the Phillips-Perron test is
obfuscated by the fact that the asymptotic distribution of the test statistic (- or t-
test) depends on the specification of the deterministic components and on the true
data generating process. This implies that depending on whether the Dickey-Fuller
regression includes, for example, a constant and/or a time trend and on the nature
of the true data generating process one has to use different tables and thus different
critical values. In the following we will focus on the most common cases listed in
Table 7.1.
In case 1 the Dickey-Fuller regression includes no deterministic component.
Thus, a rejection of the null hypothesis implies that fXt g has to be a mean zero
stationary process. This specification is, therefore, only warranted if one can make
sure that the data have indeed mean zero. As this is rarely the case, except, for
example, when the data are the residuals from a previous regression,8 case 1 is
Table 7.1 The four most important cases for the unit-root test
Data generating process Estimated regression -test:
(null hypothesis) T.O 1/ (Dickey-Fuller regression) t-test
Xt D Xt 1 C Zt Xt D Xt 1 C Zt Case 1 Case 1
Xt D Xt 1 C Zt Xt D ˛ C Xt 1 C Zt Case 2 Case 2
Xt D ˛ C Xt 1 C Zt ; Xt D ˛ C Xt 1 C Zt N(0,1)
˛¤0
Xt D ˛ C Xt 1 C Zt Xt D ˛ C ıt
CXt 1 C Zt Case 4 Case 4
7
These interpolation formula are now implemented in many software packages, like EVIEWS, to
compute the appropriate critical values.
8
This fact may pose a problem by itself.
148 7 Integrated Processes
very uncommon in practice. Thus, if the data do not display a trend, which can
be checked by a simple time plot, the Dickey-Fuller regression should include a
constant. A rejection of the null hypothesis then implies that fXt g is a stationary
process with mean D 1 c . If the data display a time trend, the Dickey-Fuller
regression should also include a linear time trend as in case 4. A rejection of the
null hypothesis then implies that the process is trend-stationary. In the case that the
Dickey-Fuller regression contains no time trend and there is no time trend under the
alternative hypothesis, asymptotic normality holds. This case is only of theoretical
interest as it should a priori be clear whether the data are trending or not. In the
instance where one is not confident about the trending nature of the time series see
the procedure outlined in Sect. 7.3.3.
In the cases 2 and 4 it is of interest to investigate the joint hypothesis H0 W ˛ D 0
and D 1, and H0 W ı D 0 and D 1 respectively. Again the corresponding
F-statistic is no longer F-distributed, but has been tabulated (see Hamilton (1994b,
Table B7)). The trade-off between t- and F-test is discussed in Sect. 7.3.3.
Most economic time series display a significant amount of autocorrelation.
To take this feature into account it is necessary to include lagged differences
Xt 1 ; : : : ; Xt pC1 as additional regressors. The so modified Dickey-Fuller
regression then becomes:
deterministic
Xt D C Xt 1 C
1 Xt 1 C : : : C
p 1 Xt pC1 C Zt :
variables
This modified test is called the augmented Dickey-Fuller test (ADF-test). This
autoregressive correction does not change the asymptotic distribution of the test
statistics. Thus the same tables can be used as before. For the coefficients of the
autoregressive terms asymptotic normality holds. This implies that the standard
testing procedures (t-test, F-test) can be applied in the usual way. This is true if
instead of autoregressive correction terms moving-average terms are used instead
(see Said and Dickey (1984)).
For the ADF-test the order p of the model should be chosen such that the residuals
are close to being white noise. This can be checked, for example, by looking at
the ACF of the residuals or by carrying out a Ljung-Box test (see Sect. 4.2). In
case of doubt, it is better to choose a higher order. A consistent procedure to find
the right order is to use the Akaike’s criterion (AIC). Another alternative strategy
advocated by Ng and Perron (1995) is an iterative testing procedure which makes
use of the asymptotic normality of the autoregressive correction terms. Starting
from a maximal order p 1 D pmax , the method amounts to the test whether the
coefficient corresponding to the highest order is significantly different from zero. If
the null hypothesis that the coefficient is zero is not rejected, the order of the model
is reduced by one and the test is repeated. This is done as long as the null hypothesis
is not rejected. If the null hypothesis is finally rejected, one sticks with the model
and performs the ADF-test. The successive test are standard t-tests. It is advisable to
use a rather high significance level, for example a 10 % level. The simulation results
by Ng and Perron (1995) show that this procedure leads to a smaller bias compared
to using the AIC criterion and that the reduction in power remains negligible.
7.3 Unit-Root Tests 149
If fZt g would be white noise so that J D
.0/, respectively b JT
OZ .0/ one gets
the ordinary Dickey-Fuller test statistic. Similar formulas can be derived for the
cases 2 and 4. As already mentioned these modifications will not alter the asymptotic
distributions so the same critical values as for the ADF-test can be used.
The main advantage of the Phillips-Perron test is that the non-parametric correc-
tion allows for very general fZt g processes. The PP-test is particularly appropriate if
fZt g has some MA-components which can be only poorly approximated by low
order autoregressive terms. Another advantage is that one can avoid the exact
modeling of the process. It has been shown by Monte-Carlo studies that the PP-test
has more power compared to the DF-test, i.e. the PP-test rejects the null hypothesis
more often when it is false, but that, on the other hand, it has also a higher size
distortion, i.e. that it rejects the null hypothesis too often.
9
The exact assumptions can be read in Phillips (1987) and Phillips and Perron (1988).
150 7 Integrated Processes
Xt D ˛ C ıt C Xt 1 C Zt
H0 W D 1 and ı D 0
by a corresponding F-test. Note that the F-statistic, like the t-test, is not
distributed according to the F-distribution. If the test does not reject the null, we
conclude that fXt g is a unit root process with drift or equivalently a difference-
stationary (integrated) process. If the F-test rejects the null hypothesis, there are
three possible situations:
(i) The possibility < 1 and ı D 0 contradicts the primary observation that
fXt g has a trend and can therefore be eliminated.
(ii) The possibility D 1 and ı ¤ 0 can also be excluded because this would
imply that fXt g has a quadratic trend, which is unrealistic.
(iii) The possibility < 1 and ı ¤ 0 represents the only valid alternative. It
implies that fXt g is stationary around a linear trend, i.e. that fXt g is trend-
stationary.
Similar conclusions can be reached if, instead of the F-test, a t-test is used to test
the null hypothesis H0 W D 1 against the alternative H1 W < 1. Thereby a
non-rejection of H0 is interpreted that ı D 0. If, however, the null hypothesis
H0 is rejected, this implies that ı ¤ 0, because fXt g exhibits a long-run trend.
10
In case of the ADF-test additional regressors, Xt j ; j > 0, might be necessary.
7.3 Unit-Root Tests 151
The F-test is more powerful than the t-test. The t-test, however, is a one-sided
test, which has the advantage that it actually corresponds to the primary objective
of the test. In Monte-Carlo simulations the t-test has proven to be marginally
superior to the F-test.
Xt has no long-run trend: In this case ı D 0 and the Dickey-Fuller regression
should be run without a trend11 :
Xt D ˛ C Xt 1 C Zt :
H0 W D 1 and ˛ D 0:
(i) The case < 1 and ˛ D 0 can be eliminated because it implies that fXt g
would have a mean of zero which is unrealistic for most economic time
series.
(ii) The case D 1 and ˛ ¤ 0 can equally be eliminated because it implies that
fXt g has a long-run trend which contradicts our primary assumption.
(iii) The case < 1 and ˛ ¤ 0 is the only realistic alternative. It implies that
the time series is stationary around a constant mean given by 1 ˛ .
As before one can use, instead of a F-test, a t-test of the null hypothesis H0 W
D 1 against the alternative hypothesis H1 W < 1. If the null hypothesis is not
rejected, we interpret this to imply that ˛ D 0. If, however, the null hypothesis
H0 is rejected, we conclude that ˛ ¤ 0. Similarly, Monte-Carlo simulations have
proven that the t-test is superior to the F-test.
The trend behavior of Xt is uncertain: This situation poses the following prob-
lem. Should the data exhibit a trend, but the Dickey-Fuller regression contains
no trend, then the test is biased in favor of the null hypothesis. If the data have
no trend, but the Dickey-Fuller regression contains a trend, the power of the test
is reduced. In such a situation one can adapt a two-stage strategy. Estimate the
Dickey-Fuller regression with a linear trend:
Xt D ˛ C ıt C Xt 1 C Zt :
Use the t-test to test the null hypothesis H0 W D 1 against the alternative
hypothesis H1 W < 1. If H0 is not rejected, we conclude the process has a unit
root with or without drift. The presence of a drift can then be investigated by
a simple regression of Xt against a constant followed by a simple t-test of the
11
In case of the ADF-test additional regressors, Xt j ; j > 0, might be necessary.
152 7 Integrated Processes
null hypothesis that the constant is zero against the alternative hypothesis that the
constant is nonzero. As Xt is stationary, the usual critical values can be used. 12
If the t-test rejects the null hypothesis H0 , we conclude that there is no unit root.
The trend behavior can then be investigated by a simple t-test of the hypothesis
H0 W ı D 0. In this test the usual critical values can be used as fXt g is already
viewed as being stationary.
As our first example, we examine the logged real GDP for Switzerland, ln.BIPt /,
where we have adjusted the series for seasonality by taking a moving-average.
The corresponding data are plotted in Fig. 1.3. As is evident from this plot, this
variable exhibits a clear trend so that the Dickey-Fuller regression should include
a constant and a linear time trend. Moreover, f ln.BIPt /g is typically highly
autocorrelated which makes an autoregressive correction necessary. One way to
make this correction is by augmenting the Dickey-Fuller regression by lagged
f ln.BIPt /g as additional regressors. Thereby the number of lags is determined
by AIC. The corresponding result is reported in the first column of Table 7.2. It
shows that AIC chooses only one autoregressive correction term. The value of t-
test statistic is 3:110 which is just above the 5-% critical value. Thus, the null
hypothesis is not rejected. If the autoregressive correction is chosen according to
the method proposed by Ng and Perron five autoregressive lags have to be included.
With this specification, the value of the t-test statistic is clearly above the critical
value, implying that the null hypothesis of the presence of a unit root cannot
be rejected (see second column in Table 7.2).13 The results of the ADF-tests is
confirmed by the PP-test (column 3 in Table 7.2) with quadratic spectral kernel
function and band width 20:3 chosen according to Andrews’ formula (see Sect. 4.4).
The second example, examines the three-month LIBOR, fR3Mt g. The series is
plotted in Fig. 1.4. The issue whether this series has a linear trend or not is not easy
to decide. On the one hand, the series clearly has a negative trend over the sample
period considered. On the other hand, a negative time trend does not make sense
from an economic point of view because interest rates are bounded from below
by zero. Because of this uncertainty, it is advisable to include in the Dickey-Fuller
regression both a constant and a trend to be on the safe side. Column 5 in Table 7.2
reports the corresponding results. The value of the t-statistic of the PP-test with
Bartlett kernel function and band width of 5 according to the Newey-West rule of
thumb is 2:142 and thus higher than the corresponding 5-% critical of 3:435.
12
Eventually, one must correct the corresponding standard deviation by taking the autocorrelation
in the residual into account. This can be done by using the long-run variance. In the literature this
correction is known as the Newey-West correction.
13
The critical value changes slightly because the inclusion of additional autoregressive terms
changes the sample size.
7.4 Generalizations of Unit-Root Tests 153
Thus, we cannot reject the null hypothesis of the presence of a unit root. We
therefore conclude that the process fR3Mt g is integrated of order one, respectively
difference-stationary. Based on this conclusion, the issue of the trend can now be
decided by running a simple regression of R3Mt against a constant. This leads to
the following results:
R3Mt D 0.0315 C et :
(0.0281)
As we have seen, the unit-root test depends heavily on the correct specification of
the deterministic part. Most of the time this amounts to decide whether a linear
trend is present in the data or not. In the previous section we presented a rule how
154 7 Integrated Processes
a b
5.5 5.5
5 5
4.5 4.5
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
TB TB
0.5 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
c
7
1
TB
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 7.4 Three types of structural breaks at TB . (a) Level shift. (b) Change in slope. (c) Level shift
and change in slope
to proceed in case of uncertainty about the trend. Sometimes, however, the data
exhibit a structural break in their deterministic component. If this structural break is
ignored, the unit-root test is biased in favor of the null hypothesis (i. e. in favor of a
unit root) as demonstrated by Perron (1989). Unfortunately, the distribution of the
test statistic under the null hypothesis, in our case the t-statistic, depends on the exact
nature of the structural break and on its date of occurrence in the data. Following
Perron (1989) we concentrate on three exemplary cases: a level shift, a change in the
slope (change in the growth rate), and a combination of both possibilities. Figure 7.4
shows the three possibilities assuming that a break occurred in period TB . Thereby
an AR(1) process with D 0:8 was superimposed on the deterministic part.
The unit-root test with the possibility of a structural break in period TB is
carried out using the Dickey-Fuller test. Thereby the date of the structural break is
assumed to be known. This assumption, although restrictive, is justifiable in many
applications. The first oil price shock in 1973 or the German reunification in 1989
are examples of structural breaks which can be dated exactly. Other examples would
7.4 Generalizations of Unit-Root Tests 155
include changes in the way the data are constructed. These changes are usually
documented by the data collecting agencies. Table 7.3 summarizes the three variants
of Dickey-Fuller regression allowing for structural breaks.14
Model A allows only for a level shift. Under the null hypothesis the series
undergoes a one-time shift at time TB . This level shift is maintained under the null
hypothesis which posits a random walk. Under the alternative, the process is viewed
as being trend-stationary whereby the trend line shifts parallel by ˛B ˛ at time
TB . Model B considers a change in the mean growth rate from ˛ to ˛B at time TB .
Under the alternative, the slope of time trend changes from ı to ıB . Model C allows
for both types of break to occur at the same time.
The unit-root test with possible structural break for a time series Xt ,
t D 0; 1; : : : ; T, is implemented in two stages as follows. In the first stage, we
regress Xt on the corresponding deterministic component using OLS. The residuals
e
X0; eX1; : : : ; e
X T from this regression are then used to carry out a Dickey-Fuller test:
e
X t D e
Xt 1 C Zt ; t D 1; : : : ; T:
The distribution of the corresponding t-statistic under the null hypothesis depends
not only on the type of the structural break, but also on the relative date of the break
in the sample. Let this relative date be parameterized by D TB =T. The asymptotic
distribution of the t-statistic has been tabulated by Perron (1989). This table can be
used to determine the critical values for the test. These critical values are smaller
than those from the normal Dickey-Fuller table. Using a 5 % significance level, the
critical values range between 3:80 and 3:68 for model A, between 3:96 and
3:65 for model B, and between 4:24 and 3:75 for model C, depending on the
value of . These values also show that the dependence on is only weak.
In the practical application of the test one has to control for the autocorrelation in
the data. This can be done by using the Augmented Dickey-Fuller (ADF) test. This
14
See Eq. (7.1) and Table 7.1 for comparison.
156 7 Integrated Processes
The time of the structural TB , respectively D TB =T, is estimated in such a way that
fXt g comes as close as possible to a trend-stationary process. Under the alternative
hypothesis fXt g is viewed as a trend-stationary process with unknown break point.
The goal of the estimation strategy is to chose TB , respectively , in such a way
that the trend-stationary alternative receives the highest weight. Zivot and Andrews
(1992) propose to estimate by minimizing the value of the t-statistic tO ./ under
the hypothesis D 1:
where ƒ is a closed subinterval of .0; 1/.15 The distribution of the test statistic under
the null hypothesis for the three cases is tabulated in Zivot and Andrews (1992). This
table then allows to determine the appropriate critical values for the test. In practice,
one has to take the autocorrelation of the time series into account by one of the
methods discussed previously.
This testing strategy can be adapted to determine the time of a structural
break in the linear trend irrespective of whether the process is trend-stationary or
integrated of order one. The distributions of the corresponding test statistics have
been tabulated by Vogelsang (1997).16
15
Taking the infimum over ƒ instead over .0; 1/ is for theoretical reasons only. In practice, the
choice of ƒ plays no important role. For example, one may take ƒ D Œ0:01; 0:99.
16
See also the survey by Perron (2006).
7.4 Generalizations of Unit-Root Tests 157
The unit-root tests we discussed so far tested the null hypothesis that the process
is integrated of order one against the alternative hypothesis that the process
is integrated of order zero (i.e. is stationary). However, one may be interested
in reversing the null and the alternative hypothesis and test the hypothesis of
stationarity against the alternative that the process is integrated of order one. Such
a test has been proposed by Kwiatkowski et al. (1992), called the KPSS-Test. This
test rests on the idea that according to the Beveridge-Nelson decomposition (see
Sect. 7.1.4) each integrated process of order one can be seen as the sum of a linear
time trend, a random walk and a stationary process:
t
X
Xt D ˛ C ıt C d Zj C Ut ;
jD1
where fUt g denotes a stationary process. If d D 0 then the process becomes trend-
stationary, otherwise it is integrated of order one.17 Thus, one can state the null and
the alternative hypothesis as follows:
H0 W d D 0 against H1 W d ¤ 0:
Denote by fSt g the process of partial sums obtained from the residuals P fet g of
a regression of Xt against a constant and a linear time trend, i.e. St D tjD1 ej .18
Under the null hypothesis d D 0, fSt g is integrated of order one whereas
under the alternative fSt g is integrated of order two. Based on this consideration
Kwiatkowski et al. propose the following test statistic for a time series consisting of
T observations:
PT 2
tD1 St
KPSS test statistic: WT D (7.3)
T 2b
JT
where b J T is an estimate of the long-run variance of fUt g (see Sect. 4.4). As fSt g is an
integrated process under the null hypothesis, the variance of fSt g grows linearly in t
(see Sect. 1.4.4 or 7.2) so that the sum of squared St diverges at rate T 2 . Thus, the test
statistic remains bounded and can be shown to converge. Note that the test statistic
is independent from further nuisance parameters. Under the alternative hypothesis,
however, fSt g is integrated of order two. Thus, the null hypothesis will be rejected for
large values of WT . The corresponding asymptotic critical values of the test statistic
are reported in Table 7.4.
17
If the data exhibit no trend, one can set ı equal to zero.
18
This auxiliary regression may include additional exogenous variables.
158 7 Integrated Processes
Xt D Xt 1 C Ut ; Ut IID.0; U2 /
Yt D Yt 1 C Vt ; Vt IID.0; V2 /
where the processes fUt g and fVt g are uncorrelated with each other at all leads and
lags. Thus,
Yt D ˛ C ˇXt C "t :
As fXt g and fYt g are two random walks which are uncorrelated with each other by
construction, one would expect that the OLS-estimate of the coefficient of Xt , ˇ, O
should tend to zero as the sample size T goes to infinity. The same is expected for
the coefficient of determination R2 . This is, however, not true as has already been
remarked by Yule (1926) and, more recently, by Granger and Newbold (1974). The
above regression will have a tendency to “discover” a relationship between Yt and Xt
despite the fact that there is none. This phenomenon is called spurious correlation
or spurious regression. Similarly, unreliable results would be obtained by using
a simple t-test for the null hypothesis ˇ D 0 against the alternative hypothesis
ˇ ¤ 0. The reason for these treacherous findings is that the model is incorrect
under the null as well as under the alternative hypothesis. Under the null hypothesis
f"t g is an integrated process which violates the standard assumption for OLS. The
alternative hypothesis is not true by construction. Thus, OLS-estimates should be
7.5 Regression with Integrated Variables 159
Xt D ıX C Xt 1 C Ut ; Ut IID.0; U2 /
Yt D ıY C Yt 1 C Vt ; Vt IID.0; V2 /
where fUt g and fVt g are again independent from each other at all leads and lags.
The regression would be same as above:
Yt D ˛ C ˇXt C "t :
The spurious regression problem cannot be circumvented by first testing for a unit
root in Yt and Xt and then running the regression in first differences in case of no
rejection of the null hypothesis. The reason being that a regression in the levels of
Yt and Xt may be sensible even when both variables are integrated. This is the case
when both variables are cointegrated. The concept of cointegration goes back to
Engle and Granger (1987) and initiated a literal research boom. We will give a more
general definition in Chap. 16 when we deal with multivariate time series. Here we
stick to the case of two variables and present the following definition.
Definition 7.2 (Cointegration, Bivariate). Two stochastic processes fXt g and fYt g
are called cointegrated if the following two conditions are fulfilled:
160 7 Integrated Processes
25
stationary
AR(1) processes
20
15
10
5 random
random
walks walks
0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
b
0.6
0.5 stationary
AR(1) processes
0.4
0.3
0.2
0
−100 −80 −60 −40 −20 0 20 40 60 80 100
Fig. 7.5 Distribution of OLS-estimate Ǒ and t-statistic t Ǒ for two independent random walks and
two independent AR(1) processes. (a) Distribution of ˇ.O (b) Distribution of t Ǒ . (c) Distribution of
ˇO and t-statistic t Ǒ
(i) fXt g and fYt g are both integrated processes of order one, i.e. Xt I.1/ and
Yt I.1/;
(ii) there exists a constant ˇ ¤ 0 such that fYt ˇXt g is a stationary process, i.e.
fYt ˇXt g I.0/.
The issue whether two integrated processes are cointegrated can be decided on
the basis of a unit root test. Two cases can be distinguished. In the first one, ˇ is
7.5 Regression with Integrated Variables 161
assumed to be known. Thus, one can immediately apply the augmented Dickey-
Fuller (ADF) or the Phillips-Perron (PP) test to the process fYt ˇXt g. Thereby the
same issue regarding the specification of the deterministic part arises. The critical
values can be retrieved from the usual tables (for example from MacKinnon 1991).
In the second case, ˇ is not known and must be estimated from the data. This can be
done running, as a first step, a simple (cointegrating) regression of Yt on Xt including
a constant and/or a time trend.19 Thereby the specification of the deterministic part
follows the same rules as before. The unit root test is then applied, in the second
step, to the residuals from this regression. As the residuals have been obtained
from a preceding regression, we are faced with the so-called “generated regressor
problem”.20 This implies that the usual Dickey-Fuller tables can no longer be used,
instead the tables provided by Phillips and Ouliaris (1990) become the relevant ones.
As before, the corresponding asymptotic distribution depends on the specification
of the deterministic part in the cointegrating regression. If this regression included
a constant, the residuals have necessary a mean of zero so that the Dickey-Fuller
regression should include no constant (case 1 in Table 7.1):
et D et 1 C t
where et and t denote the residuals from the cointegrating and the residuals of the
Dickey-Fuller regression, respectively. In most applications it is necessary to correct
for autocorrelation which can be done by including additional lagged differences
O"t 1 ; : : : ; O"t pC1 as in the ADF-test or by adjusting the t-statistic as in the PP-
test. The test where ˇ is estimated from a regression is called the regression test for
cointegration. Note that if the two series are cointegrated then the OLS estimate of
ˇ is (super) consistent.
In principle it is possible the generalize this single equation approach to more
than two variables. This encounters, however, some conceptual problems. First,
there is the possibility of more than one linearly independent cointegrating rela-
tionships which cannot be detected by a single regression. Second, the dependent
variable in the regression may not be part of the cointegrating relation which might
involves only the other variables. In such a situation the cointegrating regression is
again subject to the spurious regression problem. These issues turned the interest
of the profession towards multivariate approaches. Chapter 16 presents alternative
procedures and discusses the testing, estimation, and interpretation of cointegrating
relationships in detail.
19
Thereby, in contrast to ordinary OLS regressions, it is irrelevant which variable is treated as the
left hand, respectively right hand variable.
20
This problem was first analyzed by Nicholls and Pagan (1984) and Pagan (1984) in a stationary
context.
162 7 Integrated Processes
The residuals from this regression, denoted by et , are represented in Fig. 7.6b. The
ADF unit root test of these residuals leads to a value of 3:617 for the t-statistic.
Thereby an autoregressive correction of 13 lags was necessary according to the AIC
criterion. The corresponding value of the t-statistic resulting from the PP unit root
test using a Bartlett window with band width 7 is 4:294. Taking a significance
level of 5 %, the critical value according to Phillips and Ouliaris (1990, Table IIb) is
3:365.21 Thus, the ADF as well as the PP test reject the null hypothesis of a unit
root in the residuals. This implies that inflation and the short-term interest rate are
cointegrated.
The previous sections demonstrated that the handling of integrated variables has
to be done with care. We will therefore in this section examine some rules of
thumb which should serve as a guideline in practical empirical work. These rules
are summarized in Table 7.5. In that this section follows very closely the paper by
Stock and Watson (1988b) (see also Campbell and Perron 1991).22 Consider the
linear regression model:
Yt D ˇ0 C ˇ1 X1;t C : : : C ˇK XK;t C "t : (7.5)
(1) The disturbance term "t is white noise and is uncorrelated with any regressor.
This is, for example, the case if the regressors are deterministic or exogenous.
(2) All regressors are either deterministic or stationary processes.
If Eq. (7.5) represents the true data generating process, fYt g must be a stationary
process. Under the above assumptions, the OLS-estimator is consistent and the
OLS-estimates are asymptotically normally distributed so that the corresponding
t- and F-statistics will be approximately distributed as t- and F- distributions.
21
For comparison, the corresponding critical value according to MacKinnon (1991) is 2:872.
22
For a thorough analysis the interested reader is referred to Sims et al. (1990).
7.5 Regression with Integrated Variables 163
a 10
8 inflation
three−month LIBOR
6
percent
−2
1990 1995 2000 2005 2010
b
2.5
1.5
0.5
−0.5
−1
−1.5
−2
1990 1995 2000 2005 2010
Fig. 7.6 Cointegration of inflation and three-month LIBOR. (a) Inflation and three-month
LIBOR. (b) Residuals from cointegrating regression
164 7 Integrated Processes
Consider now the case that assumption 2 is violated and that some or all
regressors are integrated, but that instead one of the two following assumptions
holds:
Under assumptions 1 and 2.a or 2.b the OLS-estimator remains consistent. Also the
corresponding t- and F-statistics remain valid so that the appropriate critical values
can be retrieved from the t-, respectively F-distribution. If neither assumption 2.a
nor 2.b holds, but the following assumption:
(2.c) The relevant coefficients are coefficients of integrated variables and the
regression cannot be rewritten in a way that they become coefficients of
stationary variables.
If assumption 1 remains valid, but assumption 2.c holds instead of 2.a and 2.b, the
OLS-estimator is still consistent. However, the standard asymptotic theory for the t-
and the F-statistic fails so that they become useless for normal statistical inferences.
If we simply regress one variable on another in levels, the error term "t is likely
not to follow a white noise process. In addition, it may even be correlated with some
regressors. Suppose that we replace assumption 1 by:
(1.a) The integrated dependent variable is cointegrated with at least one integrated
regressor such that the error term is stationary, but may remain autocorrelated
or correlated with the regressors.
Under assumptions 1.a and 2.a, respectively 2.b, the regressors are stationary, but
correlated with the disturbance term, in this case the OLS-estimator becomes incon-
sistent. This situation is known as the classic omitted variable bias, simultaneous
equation bias or errors-in-variable bias. However, under assumptions 1.a and 2.c, the
OLS-estimator is consistent for the coefficients of interest. However, the standard
asymptotic theory fails. Finally, if both the dependent variable and the regressors are
integrated without being cointegrated, then the disturbance term is integrated and
the OLS-estimator becomes inconsistent. This is the spurious regression problem
treated in Sect. 7.5.1.
R3Mt D c
C .1 C 2 C 3 1/R3Mt 1 .2 C 3 /R3Mt 1 3 R3Mt 2
R3Mt D c
C ı1 .INFLt 1
O
ˇR3M t 1 / C ı2 INFLt 2 C "t :
The prices of financial market securities are often shaken by large and time-varying
shocks. The amplitudes of these price movements are not constant. There are periods
of high volatility and periods of low volatility. Within these periods volatility
seems to be positively autocorrelated: high amplitudes are likely to be followed
by high amplitudes and low amplitudes by low amplitudes. This observation which
is particularly relevant for high frequency data such as, for example, daily stock
market returns implies that the conditional variance of the one-period forecast error
is no longer constant (homoskedastic), but time-varying (heteroskedastic). This
insight motivated Engle (1982) and Bollerslev (1986) to model the time-varying
variance thereby triggering a huge and still growing literature.1 The importance of
volatility models stems from the fact that the price of an option crucially depends
on the variance of the underlying security price. Thus with the surge of derivative
markets in the last decades the application of such models has seen a tremendous
rise. Another use of volatility models is to assess the risk of an investment. In the
computation of the so-called value at risk (VaR), these models have become an
indispensable tool. In the banking industry, due to the regulations of the Basel
accords, such assessments are in particular relevant for the computation of the
required equity capital backing-up assets of different risk categories.
The following exposition focuses on the class of autoregressive conditional
heteroskedasticity models (ARCH models) and their generalization the generalized
autoregressive conditional heteroskedasticity models (GARCH models). These
1
Robert F. Engle III was awarded the Nobel prize in 2003 for his work on time-varying volatility.
His Nobel lecture (Engle 2004) is a nice and readable introduction to this literature.
models form the basis for even more generalized models (see Bollerslev et al. (1994)
or Gouriéroux (1997)). Campbell et al. (1997) provide a broader economically
motivated approach to the econometric analysis of financial market data.
Pt XtC1 D c C Xt :
Thus the conditional as well as the unconditional variance of the forecast error are
constant. In addition, the conditional variance is smaller and thus more precise
because it uses more information. Similar arguments can be made for ARMA
models in general.
2
Instead of assuming Zt WN.0; 2 /, we make for convenience the stronger assumption that
Zt IID.0; 2 /.
8.1 Specification and Interpretation 169
The volatility of financial market prices exhibit a systematic behavior so that the
conditional forecast error variance is no longer constant. This observation led Engle
(1982) to consider the following simple model for heteroskedasticity (non-constant
variance).
Definition 8.1 (ARCH(1) Model). A stochastic process fZt g, t 2 Z, is called an
autoregressive conditional heteroskedastic process of order one, ARCH(1) process,
if it is the solution of the following stochastic difference equation:
q
Zt D t ˛0 C ˛1 Zt2 1 with ˛0 > 0 and 0 < ˛1 < 1; (8.1)
where t IID N.0; 1/ and where t and Zt 1 are independent from each other for
all t 2 Z.
We will discuss the implications of this simple model below and consider
generalizations in the next sections. First we prove the following theorem.
Theorem 8.1. Under conditions stated in the definition of the ARCH(1) process,
the difference equation (8.1) possesses a unique and strictly stationary solution with
EZt2 < 1. This solution is given by
v 0 1
u
u 1
X
u @ j
Zt D t t˛0 1 C ˛1 t2 1 t2 2 : : : t2 j A: (8.2)
jD1
:::
D ˛0 t2 C ˛0 ˛1 t2 t2 1 C : : : C ˛0 ˛1k t2 t2 1 : : : t2 k
1
X j
Yt0 D ˛0 t2 C ˛0 ˛1 t2 t2 1 : : : t2 j :
jD1
The right-hand side of the above expression just contains nonnegative terms.
Moreover, making use of the IID N.0; 1/ assumption of ft g,
0 1
1
X j
EYt0 D E.˛0 t2 / C ˛0 E @ ˛1 t2 t2 1 : : : t2 j A
jD1
1
X j ˛0
D ˛0 ˛1 D :
jD0
1 ˛1
Thus, 0 Yt0 < 1 a.s. Therefore, fYt0 g is strictly p stationary and satisfies the
difference equation (8.3). This implies that Zt D Yt0 is also strictly stationary
and satisfies the difference equation (8.1).
To prove uniqueness, we follow Giraitis et al. (2000). For any fixed t, it follows
from the definitions of Yt and Yt0 that for any k 1
1
X j
jYt Yt0 j ˛1kC1 t2 t2 1 : : : t2 k jYt k 1 j C ˛0 ˛1 t2 t2 1 : : : t2 j :
jDkC1
Remark 8.1. Note that the normality assumption is not necessary for the proof. The
assumption t IID.0; 1/ would be sufficient. Indeed, in practice it has been proven
useful to adopt distributions with fatter tail than the normal, like the t-distribution
(see the discussion in Sect. 8.1.3).
8.1 Specification and Interpretation 171
Given the assumptions made above fZt g has the following properties:
q q
D Et t h E ˛0 C ˛1 Zt2 1 ˛0 C ˛1 Zt2 h 1 D 0:
This follows from the independence assumption between t and Zt 1 and from
the stationarity of fZt g. Because ˛0 > 0 and 0 < ˛1 < 1, the variance is always
strictly positive and finite.
(iv) As t is normally distributed, its skewness, Et3 , equals zero. The independence
assumption between t and Zt2 1 then implies that the skewness of Zt is also
zero, i.e.
EZt3 D 0:
Zt therefore has a symmetric distribution.
The properties (i), (ii) and (iii) show that fZt g is a white noise process. According
to Theorem 8.1 it is not only stationary but even strictly stationary. Thus fZt g is
uncorrelated with Zt 1 ; Zt 2 ; : : :, but not independent from its past! In particular we
have:
q
E .Zt jZt 1 ; Zt 2 ; : : :/ D Et t ˛0 C ˛1 Zt2 1 D 0
V.Zt jZt 1 ; Zt 2 ; : : :/ D E Zt2 jZt 1 ; Zt 2 ; : : :
D Et t2 ˛0 C ˛1 Zt2 1 D ˛0 C ˛1 Zt2 1 :
The conditional variance of Zt therefore depends on Zt 1 . Note that this dependence
is positive because ˛1 > 0.
172 8 Models of Volatility
3˛02 .1 C ˛1 /
.1 3˛12 /EZt4 D H)
1 ˛1
1 3˛02 .1 C ˛1 /
EZt4 D :
1 3˛12 1 ˛1
4 2
EZpt is therefore positive and finite if and only if 3˛1 < 1, respectively if 0 < ˛1 <
1=p3 D 0:5774. For high correlation of the conditional variance, i.e. high ˛1 >
1= 3, the fourth moment and therefore also all higher even moments will no longer
exist. The kurtosis is
EZt4 1 ˛12
D D 3 > 3;
ŒEZt2 2 1 3˛12
if EZt4 exists. The heavy-tail property manifests itself by a kurtosis greater than 3
which is the kurtosis of the normal distribution. The distribution of Zt is therefore
leptokurtic and thus more vaulted than the normal distribution.
Finally, we want to examine the autocorrelation function of Zt2 . This will lead to
a test for ARCH effects, i.e. for time varying volatility (see Sect. 8.2 below).
3
The case ˛1 D 1 is treated in Sect. 8.1.4.
Qk
4
As t N.0; 1/ its even moments, m2k D Et2k , k D 1; 2; : : :, are given by m2k D jD1 .2j 1/.
Thus we get m4 D 3, m6 D 15, etc. As the normal distribution is symmetric, all odd moments are
equal to zero.
8.1 Specification and Interpretation 173
Z2
Theorem 8.2. Assuming that EZt4 exists, Yt D ˛t0 has the same autocorrelation
function as the AR(1) process Wt D ˛1 Wt 1 C Ut with Ut WN.0; 1/. In addition,
under the assumption 0 < ˛1 < 1, the process fWt g is also causal with respect
to fUt g.
1
Y .h/ D EYt Yt h EYt EYt h D EYt Yt h
.1 ˛1 /2
1
D Et2 .1 C ˛1 Yt 1 / Yt h
.1 ˛1 /2
1
D EYt h C ˛1 EYt 1 Yt h
.1 ˛1 /2
1 1 1
D C ˛1
Y .h 1/ C
1 ˛1 .1 ˛1 /2 .1 ˛1 /2
1 ˛1 C ˛1 1
D ˛1
Y .h 1/ C D ˛1
Y .h 1/:
.1 ˛1 /2
The simple ARCH(1) model can be and has been generalized in several directions.
A straightforward generalization proposed by Engle (1982) consists by allowing
further lags to enter the ARCH equation (8.1). This leads to the ARCH(p) model:
174 8 Models of Volatility
5 5
0 0
−5 −5
0 20 40 60 80 100 0 20 40 60 80 100
10 10
5 5
0 0
−5 −5
0 20 40 60 80 100 0 20 40 60 80 100
Fig. 8.1 Simulation of two ARCH(1) processes (˛1 D 0:9 and ˛1 D 0:5)
p
X
ARCH.p/ W Zt D t t with t2 D ˛0 C ˛j Zt2 j (8.4)
jD1
p q
X X
GARCH.p; q/ W Zt D t t with t2 D ˛0 C ˛j Zt2 j C ˇj t2 j (8.5)
jD1 jD1
˛0
Z2 D V.Zt / D Pp Pq :
1 jD1 ˛j jD1 ˇj
5
A detailed exposition of the GARCH(1,1) model is given in Sect. 8.1.4.
8.1 Specification and Interpretation 175
This condition is sufficient, but not necessary.6 Furthermore, fZt g is a white noise
process with heavy-tail property if fZt g is strictly stationary with finite fourth
moment.
In addition, fZt2 g is a causal and invertible ARMA.max fp; qg; q/ process satisfy-
ing the following difference equation:
p q
X X
Zt2 D ˛0 C ˛j Zt2 j C ˇj t2 j C et
jD1 jD1
maxfp;qg q
X X
D ˛0 C .˛j C ˇj /Zt2 j C et ˇj et j ;
jD1 jD1
0 1
p q
X X
et D Zt2 t2 D .t2 1/ @˛0 C ˛j Zt2 j C ˇj t2 j A :
jD1 jD1
Note, however, there is a circularity here because the noise process fet g is defined
in terms of Zt2 and is therefore not an exogenous process driving Zt2 . Thus, one has
to be precautious in the interpretation of fZt2 g as an ARMA process.
Further generalizations of the GARCH(p,q) model can be obtained by allowing
deviations from the normal distribution for t . In particular, distributions such as
the t-distribution which put more weight on extreme values have become popular.
This seems warranted as prices on financial markets exhibit large and sudden
fluctuations.7
6
Zadrozny (2005) derives a necessary and sufficient condition for the existence of the fourth
moment.
7
A thorough treatment of the probabilistic properties of GARCH processes can be found in Nelson
(1990), Bougerol and Picard (1992a), Giraitis et al. (2000), Klüppelberg et al. (2004, theorem 2.1),
and Lindner (2009).
176 8 Models of Volatility
where g is a function of the volatility t2 and where Mt0 consists of a vector of
regressors, including lagged values of Xt . If Mt D .1; Xt 1 / then we get the AR(1)-
ARCH-M model. The most commonly used specification for g is a linear function:
g.t2 / D ı0 C ı1 t2 . In the asset pricing literature, higher volatility would require a
higher return to compensate the investor for the additional risk. Thus, if Xt denotes
the return on some asset, we expect ı1 to be positive. Note that any time variation
in t2 translates into a serial correlation of fXt g (see Hong 1991, for details). Of
course, one could easily generalize the model to allow for more sophisticated mean
and variance equations.
Qj
where i D 1 whenever i > j. The solution is also unique given the sequence ft g.
The solution is unique and (weakly) stationary with variance EZt2 D 1 ˛˛10 ˇ < 1
if ˛1 C ˇ < 1.
178 8 Models of Volatility
Proof. The proof proceeds similarly to Theorem 8.1. For this purpose, we define
Yt D t2 and rewrite the GARCH(1,1) model as
Yt D ˛0 C ˛0 t 1 C : : : C ˛0 t 1 : : : t k C t 1 : : : t k t k 1 Yt k 1
j
k Y kC1
X Y
D ˛0 t i C t i Yt k 1:
jD0 iD1 iD1
1 Y
X j
Yt0 D ˛0 t i : (8.8)
jD0 iD1
The right-hand side of this expression converges almost surely as can be seen from
the following argument. Given that t IID and given the assumption E log.˛1 t2 C
ˇ/ < 0, the strong law of large numbers (Theorem C.5) implies that
j
!
1 X
lim sup log.t i / < 0 a.s.,
j!1 j iD1
or equivalently,
j
!1=j
Y
lim sup log t i <0 a.s.
j!1 iD1
Thus,
j
!1=j
Y
lim sup t i <1 a.s.
j!1 iD1
The application of the root test then shows that the infinite series (8.8) converges
almost surely. Thus, fYt0 g is well-defined. It is easy to see that fYt0 g is strictly
stationary and satisfies the difference equation. Moreover, if ˛1 C ˇ < 1, we get
1
X j
Y 1
X ˛0
EYt0 D ˛0 E t i D ˛0 .˛1 C ˇ/j D :
jD0 iD1 jD0
1 ˛1 ˇ
8.1 Specification and Interpretation 179
k
!
Y
jYt Yt0 j D jt 1 j jYt 1 Yt0 1 j D t i jYt k Yt0 k j
iD1
k
! k
!
Y Y
t i jYt k j C t i jYt0 k j
iD1 iD1
The assumption E log t D E log.˛1 t2 C ˇ/ < 0 together with the strong law of
large numbers (Theorem C.5) imply
k k
!!k
Y 1X
t i D exp log t i !0 a.s.
iD1
k iD1
0
As both solutions are strictly stationary so
Qthat the distribution
of
QjYt k j and
jYt k j
k k 0
do not depend on t, this implies that both iD1 t i jYt k j and iD1 t i jYt k j
0
p Thus, Yt D Yt a.s. once the sequence t , respectively
converge in probability to zero.
0
t , is given. Because Zt D t Yt this completes the proof. t
u
Thus, the condition ˛1 C ˇ < 1 is sufficient, but not necessary, to ensure the
existence of a strictly stationary solution. Thus even when ˛1 C ˇ D 1, a strictly
stationary solution exists, albeit one with infinite variance. This case is known as the
IGARCH model and is discussed below. In the case ˛1 C ˇ < 1, the Borel-Cantelli
lemma can be used as Theorem 8.1 to establish the uniqueness of the solution.
Further details can be found in the references listed in footnote 7.
Assume that ˛1 C ˇ < 1, then a unique strictly stationary process fZt g with finite
variance which satisfies the above difference equation exists. In particular Zt
WN.0; Z2 / such that
˛0
V.Zt / D :
1 ˛1 ˇ
180 8 Models of Volatility
0.9
sqrt(3) α1 / (1−β) = 1
0.8
α1 + β = 1
0.7
0.6
E log(α1ν2t + β) = 0
0.5
β
0.4
0.3
0.2
region region region region
I II III IV
0.1
0
0 0.5 1 1.5 2 2.5 3
α1
Fig. 8.2 Parameter region for which a strictly stationary solution to the GARCH(1,1) process
exists assuming t IID N.0; 1/
The assumption 1 ˛1 ˇ > 0 guarantees that the variance exists. The third moment
of Zt is zero due to the assumption of a symmetric
p distribution for t . The condition
for the existence of the fourth moment is: 3 1˛1ˇ < 1.8 The kurtosis is then
if EZt4 exists.9 Therefore the GARCH(1,1) model also possesses the heavy-tail
property because Zt is more peaked than the normal distribution.
Figure 8.2 shows how the different assumptions and conditions divide up the
parameter space. In region I, all conditions are fulfilled. The process has a strictly
stationary solution with finite variance and kurtosis. In region II, the kurtosis does
no longer exist, but the variance does as ˛1 C ˇ < 1 still holds. In region III, the
process has infinite variance, but a strictly stationary solution yet exists. In region
IV, no such solution exists.
Viewing the equation for t2 as a stochastic difference equation, its solution is
given by
1
X
˛0
t2 D C ˛1 ˇ j Zt2 1 j: (8.9)
1 ˇ jD0
8
A necessary and sufficient condition is .˛1 C ˇ/2 C 2˛12 < 1 (see Zadrozny (2005)).
9
The condition for the existence of the fourth moment implies 3˛12 < .1 ˇ/2 so that the
2 2 2 2
denominator 1 ˇ 2˛1 ˇ 3˛1 > 1 ˇ 2˛1 ˇ 1 ˇ C 2ˇ D 2ˇ.1 ˛1 ˇ/ > 0.
8.1 Specification and Interpretation 181
This expression is well-defined because 0 < ˇ < 1 so that the infinite sum
converges. The conditional variance given the infinite past is therefore equal to
1
X
˛0
V .Zt jZt 1 ; Zt 2 ; : : :/ D E Zt2 jZt 1 ; Zt 2 ; : : : D C ˛1 ˇ j Zt2 1 j:
1 ˇ jD0
Thus, the conditional variance depends on the entire history of the time series and
not just on Zt 1 as in the case of the ARCH(1) model. As all coefficients are assumed
to be positive, the clustering of volatility is more persistent than for the ARCH(1)
model.
Defining a new time series fet g by et D Zt2 t2 D .t2 1/.˛0 C˛1 Zt2 1 Cˇt2 1 /,
one can verify that Zt2 obeys the stochastic difference equation
.1 ˇ 2 ˛1 ˇ/˛1 .1 ˇ'/.' ˇ/
Z 2 .1/ D 2
D ;
1 ˇ 2˛1 ˇ 1 C ˇ 2 2'ˇ
Z 2 .h/ D .˛1 C ˇ/Z 2 .h 1/ D 'Z 2 .h 1/; h D 2; 3; : : : (8.11)
with et D Zt2 t2 D .t2 1/.˛0 C .1 ˇ/Zt2 1 C ˇt2 1 /. As fet g is white noise,
the squared innovations Zt2 behave like a random walk with a MA(1) error term.
Although the variance of Zt becomes infinite, the difference equation still allows for
a strictly stationary solution provided that E log.˛1 t2 C ˇ/ < 0 (see Theorem 8.3
182 8 Models of Volatility
and the citations in footnote 7 for further details).10 It has been shown by Lumsdaine
(1986) and Lee and Hansen (1994) that standard inferences can still be applied
although ˛1 C ˇ D 1. The model may easily generalized to higher lag orders.
Forecasting
On many occasions it is necessary to obtain forecasts of the conditional variance t2 .
An example is given in Sect. 8.4 where the value at risk (VaR) of a portfolio several
2
periods ahead must be evaluated. Denote by Pt tCh the h period ahead forecast based
on information available in period t. We assume that predictions are based on the
infinite past. Then the one-period ahead forecast based on Zt and t2 , respectively t
and t2 , is:
2
Pt tC1 D ˛0 C ˛1 Zt2 C ˇt2 D ˛0 C .˛1 t2 C ˇ/t2 : (8.12)
Assuming ˛1 C ˇ < 1, the second term in the above expression vanishes as h goes
to infinity. Thus, the contribution of the current conditional variance vanishes when
we look further and further into the future. The forecast of the conditional variance
then approaches the unconditional one: limh!1 Pt tCh 2
D 1 ˛˛10 ˇ . If ˛1 C ˇ D 1
as in the IGARCH model, the contribution of the current conditional variance is
constant, but diminishes to zero relative to the first term. Finally, if ˛1 C ˇ > 1, the
two terms are of the same order and we have a particularly persistent situation.
In practice, the parameters of the model are unknown and have therefore be
replaced by an estimate. The method can be easily adapted for higher order
models. Instead of using the recursive approach outlined above, it is possible to use
simulation methods by drawing repeatedly from the actual empirical distribution of
the O t D ZOt =O t . This has the advantage to capture deviations from the underlying
distributional assumptions (see Sect. 8.4 for a comparison of both methods). Such
methods must be applied if nonlinear models for the conditional variance, like the
TARCH model, are employed.
10
As the variance becomes infinite, the IGARCH process is an example of a stochastic process
which is strictly stationary, but not stationary.
8.2 Tests for Heteroskedasticity 183
The first test is based on the autocorrelation function of squared residuals from a
preliminary regression. This preliminary regression or mean regression produces a
series b
Z t which should be approximately white noise if the equation is well specified.
Then we can look at the ACF of the squared residuals fb Z 2t g and apply the Ljung-Box
test (see Eq. (4.4)). Thus the test can be broken down into three steps.
(i) Estimate an ARMA model for fXt g and retrieve the residuals b Z t from this
model. Compute b
Z 2t . These data can be used to estimate 2 as
T
2 1 X b2
O D Z
T tD1 t
Note that the ARMA model should be specified such that the residuals are
approximately white noise.
[(ii) ] Estimate the ACF for the squared residuals in the usual way:
PT
b2 O 2 ZO t2 h O 2
tDhC1 Z t
OZ 2 .h/ D
PT 2 2
b
tD1 Z t O 2
(iii) It is now possible to use one of the methods laid out in Chap. 4 to test the
null hypothesis that fZt2 g is white noise. It can be shown that under the null
p d
hypothesis T OZ 2 .h/ ! N.0; 1/. One can therefore construct confidence
intervals for the ACF in the usual way. Alternatively, one may use the Ljung-
Box test statistic (see Eq. (4.4)) to test the hypothesis that all correlation
coefficients up to order N are simultaneously equal to zero.
N
X O2 2 .h/
Q0 D T.T C 2/ Z
:
hD1
T h
Under the null hypothesis this statistic is distributed as 2N . To carry out the
test, N should be chosen rather high, for example equal to T=4.
184 8 Models of Volatility
b
Z 2t D ˛0 C ˛1b
Z 2t 1 C ˛2b
Z 2t 2 C : : : C ˛pb
Z 2t p C "t ;
where "t denotes the error term. Then the null hypothesis H0 W ˛1 D ˛2 D : : : D
˛p D 0 is tested against the alternative hypothesis H1 W ˛j ¤ 0 for at least one j. As
a test statistic one can use the coefficient of determination times T, i.e. TR2 . This
test statistic is distributed as a 2 with p degrees of freedom. Alternatively, one may
use the conventional F-test.
The literature has proposed several approaches to estimate models of volatility (see
Fan and Yao (2003, 156–162)). The most popular one, however, rest on the method
of maximum-likelihood. We will describe this method using the GARCH(p,q)
model. Related and more detailed accounts can be found in Weiss (1986), Bollerslev
et al. (1994) and Hall and Yao (2003).
In particular we consider the following model:
mean equation: X t D c C 1 X t 1 C : : : C r X t r C Zt ;
where
The mean equation represents a simple AR(r) process for which we assume that it is
causal with respect to fZt g, i.e. that all roots of ˆ.z/ are outside the unit circle. The
method demonstrated here can be easily generalized to ARMA processes or even
ARMA process with additional exogenous variables (so-called ARMAX processes)
as noted by Weiss (1986). The method also incorporates the ARCH-in-mean model
(see equation (8.6)) which allows for an effect of the conditional variance t on Xt .
8.3 Estimation of GARCH(p,q) Models 185
In addition,
Pp we assume
Pq that the coefficients of the variance equation are all positive,
that jD1 ˛j C jD1 ˇj < 1 and that EZt4 < 1 exists.11
As t is identically and independently standard normally distributed, the dis-
tribution of Xt conditional on Xt 1 D fXt 1 ; Xt 2 ; : : :g is normal with mean
c C 1 Xt 1 C : : : C r Xt r and variance t2 . The conditional density, f .Xt jXt 1 /,
therefore is:
1 Zt2
f .Xt jXt 1 / D p exp
2t2 2t2
where s is an integer greater than p. The necessity, not to factorize the first s 1
observations, relates to the fact that t2 can only be evaluated for s > p in the
ARCH(p) model. For the ARCH(p) model s can be set to p C 1. In the case of a
GARCH model t2 is given by weighted infinite sum of the Zt2 1 ; Zt2 2 ; : : : (see the
expression (8.9) for t2 in the GARCH(1,1) model). For finite samples, this infinite
sum must be approximated by a finite sum of s summands such that the numbers of
summands s is increasing with the sample size.(see Hall and Yao (2003)).
We then merge all parameters of the model as follows: D .c; 1 ; : : : ; r /0 , ˛ D
.˛0 ; ˛1 ; : : : ; ˛p /0 and ˇ D .ˇ1 ; : : : ; ˇq /0 . For a given realization x D .x1 ; x2 ; : : : ; xT /
the likelihood function conditional on x, L.; ˛; ˇjx/, is defined as
T
Y
L.; ˛; ˇjx/ D f .x1 ; x2 ; : : : ; xs 1 / f .xt jXt 1 /
tDs
where in Xt 1 the random variables are replaced by their realizations. The likelihood
function can be seen as the probability of observing the data at hand given the values
for the parameters. The method of maximum likelihood then consist in choosing the
parameters .; ˛; ˇ/ such that the likelihood function is maximized. Thus we chose
the parameter so that the probability of observing the data is maximized. In this way
11
The existence of the fourth moment is necessary for the asymptotic normality of the maximum-
likelihood estimator, but not for the consistence. It is possible to relax this assumption somewhat
(see Hall and Yao (2003)).
12
If t is assumed to follow another distribution than the normal, one may use this distribution
instead.
186 8 Models of Volatility
we obtain the maximum likelihood estimator. Taking the first s realizations as given
deterministic starting values, we then get the conditional likelihood function.
In practice we do not maximize the likelihood function but the logarithm of it
where we take f .x1 ; : : : ; xs 1 / as a fixed constant which can be neglected in the
optimization:
T
X
log L.; ˛; ˇjx/ D log f .xt jXt /
tDs
T T
T 1X 1 X z2t
D log 2 log t2
2 2 tDs 2 tDs t2
13
Jensen and Rahbek (2004) showed that, at least for the GARCH(1,1) case, the stationarity
condition is not necessary.
8.3 Estimation of GARCH(p,q) Models 187
A final remark concerns the choice of the parameter r, p and q. Similarly to the
ordinary ARMA models, one can use information criteria such as the Akaike or the
Bayes criterion, to determine the order of the model (see Sect. 5.4).
The maximization of the likelihood function requires the use of numerical opti-
mization routines. Depending on the routine actually used and on the starting value,
different results may be obtained if the likelihood function is not well-behaved. It
is therefore of interest to have alternative estimation methods at hand. The method
of moments is such an alternative. It is similar to the Yule-Walker estimator (see
Sect. 5.1) applied to the autocorrelation function of fZt2 g. This method not only leads
to an analytic solution, but can also be easily implemented. Following Kristensen
and Linton (2006), we will illustrate the method for the GARCH(1,1) model.
Equation (8.11) applied to Z 2 .1/ and Z 2 .2/ constitutes a nonlinear equation
system in the unknown parameters ˇ and ˛1 . This system can be reparameterized
to yield an equation system in ' D ˛1 C ˇ and ˇ which can be reduced to a single
quadratic equation in ˇ:
(i) Estimate the correlations Z 2 .1/ and Z 2 .2/ and 2 based on the formulas (8.11)
in Sect. 8.2.
(ii) An estimate for ' D ˛1 C ˇ is then given by
3 OZ 2 .2/
'O D .˛1 C ˇ/ D :
OZ 2 .1/
(iv) The estimate for ˛1 is ˛O 1 D 'O ˇ.O Because ˛0 D 2 .1 .˛1 Cˇ//, the estimate
for ˛0 is equal to ˛O 0 D O 2 .1 '/.
O
Kristensen and Linton (2006) show that, given the existence of the fourth moment of
Zt , this method of moment leads to consistent and asymptotically normal distributed
estimates. These estimates may then serve as starting values for the maximization
of the likelihood function to improve efficiency.
In this section, we will illustrate the methods discussed previously to analyze the
volatility of the Swiss Market Index (SMI). The SMI is the most important stock
market index for Swiss blue chip companies. It is constructed solely from stock
market prices, dividends are not accounted for. The data are the daily values of the
index between the 3rd of January 1989 and the 13th of February 2004. Figure 1.5
shows a plot of the data. Instead of analyzing the level of the SMI, we will
investigate the daily return computed as the logged difference. This time series
is denoted by Xt and plotted in Fig. 8.3. One can clearly discern phases of high
(observations around t D 2500 and t D 3500) and low (t D 1000 and t D 2000)
volatility. This represents a first sign of heteroskedasticity and positively correlated
volatility.
0
percent
−5
−10
Fig. 8.3 Daily return of the SMI (Swiss Market Index) computed as log.SMIt / between January
3rd 1989 and February 13th 2004
8.4 Example: Swiss Market Index (SMI) 189
0.999
0.997
0.99
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
0.001
−10 −8 −6 −4 −2 0 2 4 6
daily return of SMI
Fig. 8.4 Normal-Quantile Plot of the daily returns of the SMI (Swiss Market Index)
550
500
450
400
350
300
250
200
150
100
50
0
−12 −10 −8 −6 −4 −2 0 2 4 6 8
Fig. 8.5 Histogram of the daily returns of the SMI (Swiss Market Index) and the density of a fitted
normal distribution (red line)
∆ log(SMIt)
1
correlation coefficient
0.5
−0.5
0 5 10 15 20 25 30 35 40 45 50
order
(∆ log(SMIt))2
1
correlation coefficient
0.5
−0.5
0 5 10 15 20 25 30 35 40 45 50
order
Fig. 8.6 ACF of the daily returns and the squared daily returns of the SMI
8.4 Example: Swiss Market Index (SMI) 191
likelihood where p is varied between p 1 and 3 and q between 0 and 3. The values
of the AIC, respectively BIC criterion corresponding to the variance equation are
listed in Tables 8.1 and 8.2.
The results reported in these tables show that the AIC criterion favors a
GARCH(3,3) model corresponding to the bold number in Table 8.1 whereas the
BIC criterion opts for a GARCH(1,1) model corresponding to the bold number
in Table 8.2. It also turns out that high dimensional models, in particular those
for which q > 0, the maximization algorithm has problems to find an optimum.
Furthermore, the roots of the implicit AR and the MA polynomial corresponding
to the variance equation of the GARCH(3,3) model are very similar. These two
arguments lead us to prefer the GARCH(1,1) over the GARCH(3,3) model. This
model was estimated to have the following mean equation:
Xt D 0.0755 C Zt C 0.0484 Zt 1
(0.0174) (0.0184)
14
The corresponding Wald test clearly rejects the null hypothesis ˛1 C ˇ D 1 at a significance
level of 1 %.
192 8 Models of Volatility
Value at Risk
where eX tCh is the return of the portfolio over an investment horizon of h periods.
This
Ph return is approximately equal to the sum of the daily returns: e X tCh D
jD1 XtCj .
The one period forecast error is given by XtC1 e Pt XtC1 which is equal to ZtC1 D
tC1 tC1 . Thus the VaR for the next day is
( " # )
x ePt XtC1
VaR˛t;tC1 D inf x W P tC1 ˛ :
tC1
This entity can be computed by replacing the forecast given the infinite past, e Pt XtC1 ,
by a forecast given the finite sample information Xt k , k D 0; 1; 2; : : : ; t 1, Pt XtC1 ,
and by substituting tC1 by the corresponding forecast from variance equation,
O t;tC1 . Thus we get:
2 x Pt XtC1
VaR˛t;tC1 D inf x W P tC1 ˛ :
O t;tC1
8.4 Example: Swiss Market Index (SMI) 193
2
The computation of VaR˛t;tC1 requires to determine the ˛-quantile of the distribution
of t . This can be done in two ways. The first one uses the assumption about the
distribution of t explicitly. In the simplest case, t is distributed as a standard
normal so that the appropriate quantile can be easily retrieved. The 1 % quantile for
the standard normal distribution is 2.33. The second approach is a non-parametric
one and uses the empirical distribution function of O t D ZO t =O t to determine the
required quantile. This approach has the advantage that deviations from the standard
normal distribution are accounted for. In our case, the 1 % quantile is 2:56 and thus
considerably lower than the 2:33 obtained from the normal distribution. Thus the
VaR is under estimated by using the assumption of the normal distribution.
The corresponding computations for the SMI based on the estimated
ARMA(0,1)-GARCH(1,1) model are reported in Table 8.3. A value of 5.71 for
31st of December 2001 means that one can be 99 % sure that the return of an
investment in the stocks of the SMI will not be lower than 5.71 %. The values for
the non-parametric approach are typically higher. The comparison of the VaR for
different dates clearly shows how the risk evolves over time.
Due to the nonlinear character of the model, the VaR for more than one day
can only be gathered from simulating the one period returns over the corresponding
horizon. Starting from a given date 10’000 realizations of the returns over the next
10 days have been simulated whereby the corresponding values for t are either
drawn from a standard normal distribution (parametric case) or from the empirical
distribution function of O t (non-parametric case). The results from this exercise are
reported in Table 8.4. Obviously, the risk is much higher for a 10 day than for a
one day investment. Alternatively, one may use the forecasting equation (8.12) and
the corresponding recursion formula (8.13).
Part II
Multivariate Time Series Analysis
Introduction
9
1
See Epstein (1987) for an historical overview.
The critique had several facets. First, it was argued that the bottom-up strategy of
building a system from single equations is not compatible with general equilibrium
theory which stresses the interdependence of economic activities. This insight was
even reinforced by the advent of the theory of rational expectations. This theory
postulated that expectations should be formed on the basis of all available infor-
mation and not just by mechanically extrapolating from the past. This implies that
developments in every part of the economy, in particular in the realm of economic
policy making, should in principle be taken into account and shape the expectation
formation. As expectations are omnipresent in almost every economic decision,
all aspects of economic activities (consumption capital accumulation, investment,
etc.) are inherently linked. Thus the strategy of using zero restrictions—which
meant that certain variables were omitted from a particular equation—to identify the
parameters in a simultaneous equation system was considered to be flawed. Second,
the theory of rational expectations implied that the typical behavioral equations
underlying these models are not invariant to changes in policies because economic
agents would take into account systematic changes in the economic environment
in their decision making. This so-called Lucas-critique (Lucas 1976) undermined
the basis for the existence of large simultaneous equation models. Third, simple
univariate ARMA models proved to be as good in forecasting as the sophisticated
large simultaneous models. Thus it was argued that the effort or at least part of the
effort devoted to these models was wasted.
In 1980 Sims (1980b) and proposed an alternative modeling strategy. This
strategy concentrates the modeling activity to only a few core variables, but places
no restrictions what so ever on the dynamic interrelation among them. Thus every
variable is considered to be endogenous and, in principle, dependent on all other
variables of the model. In the linear context, the class of vector autoregressive
(VAR) models has proven to be most convenient to capture this modeling strategy.
They are easy to implement and to analyze. In contrast to the simultaneous equation
approach, however, it is no longer possible to perform comparative static exercises
and to analyze the effect of one variable on another one because every variable
is endogenous a priori. Instead, one tries to identify and quantify the effect of
shocks over time. These shocks are usually given some economic content, like
demand or supply disturbances. However, these shocks are not directly observed,
but are disguised behind the residuals from the VAR. Thus, the VAR approach also
faces a fundamental identification problem. Since the seminal contribution by Sims,
the literature has proposed several alternative identification schemes which will be
discussed in Chap. 15 under the header of structural vector autoregressive (SVAR)
models. The effects of these shocks are then further analyzed by computing impulse
responses and forecast error variance decompositions.2
The reliance on shocks can be seen as a substitute for the lack of experiments
in macroeconomics. The approach can be interpreted as a statistical analogue
to the identification of specific episodes where some unforseen event (shock)
2
Watson (1994) and Kilian (2013) provide a general introduction to this topic.
9 Introduction 199
Similarly to the univariate case, we start our exposition with the concept of
stationarity which is also crucial in the multivariate setting. Before doing so let us
define the multivariate stochastic process.
Definition 10.1. A multivariate stochastic Process, fXt g, is a family of random
variables indexed by t, t 2 Z, which take values in Rn , n 1. n is called the
dimension of the process.
Setting n D 1, the above definition includes as a special case univariate stochastic
processes. This implies that the statements for multivariate processes carry over
analogously to the univariate case. We view Xt as a column vector:
0 1
X1t
B C
Xt D @ ::: A :
Xnt
Each element fXit g thereby represents a particular variable which may be treated as a
univariate process. As in the example of Sect. 15.4.5, fXt g represents the multivariate
process consisting of the growth rate of GDP Yt , the unemployment rate Ut , the
inflation rate Pt , the wage inflation rate Wt , and the growth rate of money Mt . Thus,
Xt D .Yt ; Ut ; Pt ; Wt ; Mt /0 .
As in the univariate case, we characterize the joint distribution of the elements
Xit and Xjt by the first two moments (if they exist), i.e. by the mean and the variance,
respectively covariance:
it D EXit ; i D 1; : : : ; n
ij .t; s/ D E.Xit it /.Xjs js /; i; j D 1; : : : ; nI t; s 2 Z (10.1)
0 1 0 1
1t EX1t
B C B C
t D @ ::: A D EXt D @ ::: A
nt EXnt
0 1
11 .t; s/ : : :
1n .t; s/
B :: C D E.X
.t; s/ D @ ::: ::
: : A t t /.Xs s /0
n1 .t; s/ : : :
nn .t; s/
Thus, we apply the expectations operator element-wise to vectors and matrices. The
matrix-valued function .t; s/ is called the covariance function of fXt g.
In analogy to the univariate case, we define stationarity as the invariance of the
first two moments to time shifts:
Definition 10.2 (Stationarity). A multivariate stochastic process fXt g is stationary
if and only if for all integers r, s and t we have:
In the literature these properties are often called weak stationarity, covariance
stationarity, or stationarity of second order. If fXt g is stationary, the covariance
function only depends on the number of periods between t and s (i.e. on t s)
and not on t or s themselves. This implies that by setting r D s and h D t s the
covariance function simplifies to
.h/ D . h/0 :
Note that .h/ is in general not symmetric for h ¤ 0 because
ij .h/ ¤
ji .h/
for h ¤ 0.
Based on the covariance function of a stationary process, we can define the
correlation function R.h/ where R.h/ D .ij .h//i;j with
ij .h/
ij .h/ D p :
ii .0/
jj .0/
10 Definitions and Stationarity 203
In the case i ¤ j we refer to the cross-correlations between two variables fXit g and
fXjt g. The correlation function can be written in matrix notation as
1=2 1=2
R.h/ D V .h/V
where V represents a diagonal matrix with diagonal elements equal to
ii .0/. Clearly,
ii .0/ D 1. As for the covariance matrix we have that in general ij .h/ ¤ ji .h/ for
h ¤ 0. It is possible that ij .h/ > ij .0/. We can summarize the properties of the
covariance function by the following theorem.1
Theorem 10.1. The covariance function of a stationary process fXt g has the
following properties:
0
(i) For all h 2 Z, .h/ D . p h/ ;
(ii) for all h 2 Z, j
ij .h/j
ii .0/
jj .0/;
(iii) Pmeach i0 D 1; : : : ; n,
ii .h/ is a univariate autocovariance function;
for
(iv) r;kD1 ar .r k/ak 0 for all m 2 N and all a1 ; : : : ; am 2 Rn . This property
is called non-negative definiteness (see Property 4 in Theorem 1.1 of Sect. 1.3
in the univariate case).
Proof. Property (i) follows immediately from the definition. Property (ii) follows
from the fact that the correlation coefficient is always smaller or equal to one.
ii .h/
is the autocovariance function of fXit g which delivers property (iii). Property (iv)
Pm 0
2
follows from E kD1 ak .Xt k / 0. t
u
If not only the first two moments are invariant to time shifts, but the distribution
as a whole we arrive at the concept of strict stationarity.
Definition 10.3 (Strict Stationarity). A process multivariate fXt g is called strictly
stationary if and only if, for all n 2 N, t1 ; : : : ; tn , h 2 Z, the joint distributions of
.Xt1 ; : : : ; Xtn / and .Xt1 Ch ; : : : ; Xtn Ch / are the same.
An Example
X1t D Zt
X2t D Zt C 0:75Zt 2
1
We leave it to the reader to derive an analogous theorem for the correlation function.
204 10 Definitions and Stationarity
8̂
ˆ 1 1
ˆ
ˆ ; h D 0I
ˆ
ˆ 1 1:5625
ˆ
<
00
.h/ D ; h D 1I
ˆ
ˆ 00
ˆ
ˆ
ˆ
ˆ 0 0
:̂ ; h D 2:
0:75 0:75
The covariance function is zero for h > 2. The values for h < 0 are obtained from
property (i) in Theorem 10.1. The correlation function is:
8̂
ˆ 1 0:8
ˆ
ˆ ; h D 0I
ˆ
ˆ 0:8 1
ˆ
<
00
R.h/ D ; h D 1I
ˆ
ˆ 00
ˆ
ˆ
ˆ
ˆ 0 0
:̂ ; h D 2:
0:60 0:48
The correlation function is zero for h > 2. The values for h < 0 are obtained from
property (i) in Theorem 10.1.
One idea in time series analysis is to construct more complicated process from
simple ones, for example by taking moving-averages. The simplest process is the
white noise process which is uncorrelated with its own past. In the multivariate
context the white noise process is defined as follows.
Definition 10.4. A stochastic process fZt g is called (multivariate) white noise
process with mean zero and covariance matrix † > 0, denoted by Zt WN.0; †/,
if it is stationary fZt g and
EZt D 0;
†; h D 0I
.h/ D
0; h ¤ 0:
If fZt g is not only white noise, but independently and identically distributed we
write Zt IID.0; †/.
Remark 10.1. Even if each component of fZit g is univariate white noise, this does
not imply that fZt g D f.Z1t ; : : : ; Znt /0 g is multivariate white noise. Take,
for example
0 2 0 0
the process Zt D .ut ; ut 1 / where ut WN.0; u /. Then .1/ D ¤ 0.
u2 0
1
X
Xt D ‰j Zt j
jD 1
where Zt IID.0;
P †/ and where the sequence f‰j g of the nn matrices is absolutely
summable, i.e. 1jD 1 k‰j k < 1. If for all j < 0 ‰j D 0, the linear process is also
called an MA.1/ process.
Theorem 10.2. A linear process is stationary with a mean of zero and with
covariance function
1
X 1
X
.h/ D ‰jCh †‰j0 D ‰j †‰j0 h ; h D 0; ˙1; ˙2; : : :
jD 1 jD 1
Remark 10.1. The same conclusion is reached if fZt g is a white noise process and
not an IID process.
X n
X
kAk2 D a2ij D tr.A0 A/ D i
i;j iD1
where tr.A0 A/ denotes the trace of A0 A, i.e. the sum of the diagonal elements of A0 A,
and where i are the n eigenvalues of A0 A.
2
For details see Meyer (2000, 279ff).
206 10 Definitions and Stationarity
1
X 1
X 1 X
X n X
n 1 X
X n X
n
jŒ‰j kl j k‰j k jŒ‰j kl j D jŒ‰j kl j
jD0 jD0 jD0 kD1 lD1 kD1 lD1 jD0
We characterize the stationary process fXt g by its mean and its (matrix) covariance
function. In the Gaussian case, this already characterizes the whole distribution. The
estimation of these entities becomes crucial in the empirical analysis. As it turns out,
the results from the univariate process carry over analogously to the multivariate
case. If the process is observed over the periods t D 1; 2; : : : ; T, then a natural
estimator for the mean is the arithmetic mean or sample average:
0 1
X1
1 B :: C
O D X T D .X1 C : : : C XT / D @ : A :
T
Xn
Thus, the sample average converges in mean square and therefore also in
probability to the true mean. Thereby the second condition is more restrictive
than the first one. They are, in particular, fulfilled for all VARMA processes (see
Chap. 12). As in the univariate case analyzed in Sect. 4.1, it can be shown with some
mild additional assumptions that X T is also asymptotically normally distributed.
Theorem 11.2. For any stationary process fXt g
1
X
Xt D C ‰j Zt j
jD 1
P1
with Zt IID.0; †/ and jD 1 k‰j k < 1, the arithmetic average X T is
asymptotically normal:
1
!
p d X
T XT ! N 0; .h/
hD 1
0 0 1 0 11
1
X 1
X
D N @0; @ ‰j A † @ ‰j0 AA
jD 1 jD 1
D N 0; ‰.1/†‰.1/0 :
Proof. The proof is a straightforward extension to the multivariate case of the one
given for Theorem 4.2 of Sect. 4.1. t
u
The estimator of the covariance function can then be applied to derive an estimator
for the correlation function:
b 1=2 b
R.h/ D VO .h/VO 1=2
11.2 Testing Cross-Correlations of Time Series 209
p p
where VO 1=2 D diag
O11 .0/; : : : ;
Onn .0/ . Under the conditions given in
b
Theorem 11.2 the estimator of the covariance matrix
p .h/, converges
of order h,
b
to the true covariance matrix .h/. Moreover, T .h/ .h/ is asymptotically
normally distributed. In particular, we can state the following Theorem:
Theorem 11.3. Let fXt g be a stationary process with
1
X
Xt D C ‰j Zt j
jD 1
P P1
where Zt IID.0; †/, 1 jD 1 k‰j k < 1, and jD 1 ‰j ¤ 0. Then, for each
b
fixed h, .h/ converges in probability as T ! 1 to .h/:
p
b
.h/ ! .h/
Proof. A proof can be given along the lines given in Proposition 13.1. t
u
As for the univariate case, we can define the long-run covariance matrix J as
1
X
JD .h/: (11.1)
hD 1
where k.x/ is a kernel function and where b .h/ is the corresponding estimate
of the covariance matrix at lag h. For the choice of the kernel function and the
lag truncation parameter the same principles apply as in the univariate case (see
Sect. 4.4 and Haan and Levin (1997)).
1
X
X1t D ˛j Z1;t j with Z1t IID.0; 12 /
jD 1
1
X
X2t D ˇj Z2;t j with Z2t IID.0; 22 /
jD 1
where fZ
P1t g and fZ2t g are independent
P from each other at all leads and lags and
where j j˛j j < 1 and j jˇj j < 1. Under these conditions the asymptotic
distribution of the estimator of the cross-correlation function 12 .h/ between fX1t g
and fX2t g is
0 1
1
X
p d
T O12 .h/ ! N @0; 11 .j/22 .j/A ; h 0: (11.2)
jD 1
p p
For all h and k with h ¤ k, . T O12 .h/; T O12 .k//0 converges in distribution to a
bivariate
P normal distributionPwith mean zero and variances and covariances given
by 1 jD 1 11 .j/22 .j/ and
1
jD 1 11 .j/22 .j C k h/, respectively.
This result can be used to construct a test of independence, respectively uncor-
relatedness, between two time
p series. The above theorem, however, shows that the
asymptotic distribution of T O12 .h/ depends on 11 .h/ and 22 .h/ and is therefore
unknown. Thus, the test cannot be based on the cross-correlation alone.1
This problem can, however, be overcome by implementing the following two-
step procedure suggested by Haugh (1976).
First step: Estimate for each times series separately a univariate invertible
P
ARMA model and compute the resulting residuals ZO it as ZO it D 1
.i/
O j Xi;t j ,
jD0
i D 1; 2. If the ARMA models correspond to the true ones, these residuals should
approximately be white noise. This first step is called pre-whitening.
Second step: Under the null hypothesis the two time series fX1t g and fX2t g
are uncorrelated with each other. This implies that the residuals fZ1t g and
fZ2t g should also be uncorrelated with each other. The variance of the cross-
correlations between fZ1t g and fZ2t g are therefore asymptotically equal to 1=T
under the null hypothesis. Thus, one can apply the result of Theorem 11.4
to construct confidence intervals based on formula (11.2). A 95-% confidence
interval is therefore given by ˙1:96T 1=2 . The Theorem may also be used to
construct a test whether the two series are uncorrelated.
If one is not interested in modeling the two time series explicitly, the simplest way
is to estimate a high order AR model in the first step. Thereby, the order should
be chosen high enough to obtain white noise residuals in the first step. Instead
1
The theorem may also be used to conduct a causality test between two times series (see Sect. 15.1).
11.3 Some Examples for Independence Tests 211
of looking at each cross-correlation separately, one may also test the joint null
hypothesis that all cross-correlations are simultaneously equal to zero. Such a test
can be based on T times the sum of the squared cross-correlation coefficients. This
statistic is distributed as a 2 with L degrees of freedom where L is the number of
summands (see the Haugh-Pierce statistic (15.1) in Sect. 15.1).
Consider two AR(1) process fX1t g and fX2t g governed by the following stochastic
difference equation Xit D 0:8Xi;t 1 C Zit , i D 1; 2. The two white noise processes
fZ1t g and fZ2t g are such that they are independent from each other. fX1t g and
fX2t g are therefore independent from each other too. We simulate realizations of
these two processes over 400 periods. The estimated cross-correlation function
of these so-generated processes are plotted in the upper panel of Fig. 11.1. From
there one can see that many values are outside the 95 % confidence interval given
by ˙1:96T 1=2 D 0:098, despite the fact that by construction both series are
independent of each other. The reason is that the so computed confidence interval
is not correct because it does not take the autocorrelation of each series into
account. The application of Theorem 11.4 leads to the much larger 95-% confidence
interval of
v v
uX 1 uX
˙1:96 u ˙1:96 u 1
p t 11 .j/22 .j/ D t 0:8j 0:8j
T jD 1
20 jD 1
r
˙1:96 2 0:64
D 1C D ˙0:209
20 1 0:64
which is more than twice as large. This confidence interval then encompasses most
the cross-correlations computed with respect to the original series.
If one follows the testing procedure outline above instead and fits an AR(10)
model for each process and then estimates the cross-correlation function of the
corresponding residual series (filtered or pre-whitened time series), the plot in
the lower panel of Fig. 11.1 is obtained.2 This figure shows no significant cross-
correlation anymore so that one cannot reject the null hypothesis that both time
series are independent from each other.
2
The order of the AR processes are set arbitrarily equal to 10 which is more than enough to obtain
white noise residuals.
212 11 Estimation of Covariance Function
original series
0.2
cross−correlation
0.1
0
−0.1
−0.2
−20 −15 −10 −5 0 5 10 15 20
order
filtered series
0.2
cross−correlation
0.1
0
−0.1
−0.2
−20 −15 −10 −5 0 5 10 15 20
order
Fig. 11.1 Cross-correlations between two independent AR(1) processes with D 0:8
3
The quarterly data are taken from Berndt (1991). They cover the period from the first quarter 1956
to the fourth quarter 1975. In order to achieve stationarity, we work with first differences.
4
The order of the AR processes are set arbitrarily equal to 10 which is more than enough to obtain
white noise residuals.
11.3 Some Examples for Independence Tests 213
original series
cross−correlation
0.4
0.2
0
−0.2
0.4
0.2
−0.2
−20 −15 −10 −5 0 5 10 15 20
order
Fig. 11.2 Cross-correlations between aggregate nominal private consumption expenditures and
aggregate nominal advertisement expenditures
can reject the null hypothesis of independence between the two series. However,
most of the interdependence seems to come from the correlation within the same
quarter. This is confirmed by a more detailed investigation in Berndt (1991) where
no significant lead and/or lag relations are found.
5
We use data for Switzerland as published by the State Secretariat for Economic Affairs SECO.
214 11 Estimation of Covariance Function
original series
1
cross−correlation
0.5
filtered series
Fig. 11.3 Cross-correlations between real growth of GDP and the consumer sentiment index
cross-correlations are plotted in the upper panel of Fig. 11.3. It shows several
correlations outside the conventional confidence interval. The use of this confidence
interval is, however, misleading as the distribution of the raw cross-correlations
depends on the autocorrelations of each series. Thus, instead we filter both time
series by an AR(8) model and investigate the cross-correlations of the residuals.6
The order of the AR model was chosen deliberately high to account for all
autocorrelations. The cross-correlations of the filtered data are displayed in the
lower panel of Fig. 11.3. As it turns out, only the cross-correlation which is
significantly different from zero is for h D 1. Thus the Consumer Sentiment Index
is leading the growth rate in GDP. In other words, an unexpected higher consumer
sentiment is reflected in a positive change in the GDP growth rate of next quarter.7
6
With quarterly data it is wise to set to order as a multiple of four to account for possible seasonal
movements. As it turns out p D 8 is more than enough to obtain white noise residuals.
7
During the interpretation of the cross-correlations be aware of the ordering of the variables
because 12 .1/ D 21 . 1/ ¤ 21 .1/.
Stationary Time Series Models: Vector
Autoregressive Moving-Average Processes 12
(VARMA Processes)
The most important class of models is obtained by requiring fXt g to be the solution
of a linear stochastic difference equation with constant coefficients. In analogy to
the univariate case, this leads to the theory of vector autoregressive moving-average
processes (VARMA processes or just ARMA processes).
Definition 12.1 (VARMA process). A multivariate stochastic process fXt g is a vec-
tor autoregressive moving-average process of order .p; q/, denoted as VARMA.p; q/
process, if it is stationary and fulfills the stochastic difference equation
Xt ˆ1 Xt 1 ::: ˆp Xt p D Zt C ‚1 Zt 1 C : : : C ‚q Zt q (12.1)
ˆ.L/Xt D ‚.L/Zt
We start our discussion by analyzing the properties of the VAR(1) process which is
defined as the solution the following stochastic difference equation:
We assume that all eigenvalues of ˆ are absolutely strictly smaller than one. As
the eigenvalues correspond to the inverses of the roots of the matrix polynomial
det.ˆ.z// D det.In ˆz/, this assumption implies that all roots must lie outside the
unit circle:
For the sake of exposition, we will further assume that ˆ is diagonalizable, i.e.
there exists an invertible matrix P such that J D P 1 ˆP is a diagonal matrix with
the eigenvalues of ˆ on the diagonal.1
Consider now the stochastic process
1
X
Xt D Zt C ˆZt 1 C ˆ2 Zt 2 C ::: D ˆj Zt j :
jD0
We will show that this process is stationary and fulfills the firstPorder difference
equation above. For fXt g to be well-defined, we must show that 1 j
jD0 kˆ k < 1.
Using the properties of the matrix norm we get:
1
X 1
X 1
X
1
j
kˆ k D j
kPJ P k kPkkJ j kP 1 k
jD0 jD0 jD0
v
1 u n
X uX
kPk kP 1 kt ji j2j
jD0 iD1
1
p X
kPk kP 1 k n jmax j2j < 1;
jD0
1
The following exposition remains valid even if ˆ is not diagonalizable. In this case one has to
rely on the Jordan form which complicates the computations (Meyer 2000).
12.1 The VAR(1) Process 217
1
X 1
X
Xt D ˆj Zt j D Zt C ˆ ˆj Zt 1 j D ˆXt 1 C Zt :
jD0 jD0
Yt D Zt C ˆZt 1 C ˆ2 Yt 2
:::
D Zt C ˆZt 1 C ˆ2 Zt 2 C : : : C ˆk Zt k C ˆkC1 Yt k 1:
n
!
kC1
X
ˆ .0/ˆ0kC1
ˆkC1
2 k.0/k D kPk2 kP 1 k2 k.0/k ji j 2.kC1/
:
iD1
As all eigenvalues of ˆ are absolutely strictly smaller than one, the right hand side
of the above expression
P1 j converges to zero for k going to infinity. This implies that
Yt and Xt D jD0 ˆ Zt j are equal in the mean square sense and thus also in
probability.
Based on Theorem 10.2, the mean and the covariance function of the VAR(1)
process is:
1
X
EXt D ˆj EZt j D 0;
jD0
1
X 1
X
.h/ D ˆjCh †ˆ0j D ˆh ˆj †ˆ0j D ˆh .0/:
jD0 jD0
Analogously to the univariate case, it can be shown that there still exists a unique
stationary solution if all eigenvalues are absolutely strictly greater than one. This
solution is, however, no longer causal with respect to fZt g. If some of the eigenvalues
of ˆ are on the unit circle, there exists no stationary solution.
218 12 VARMA Processes
D ˆYt 1 C Ut
†0
where Ut D .Zt ; 0; 0; : : : ; 0/0 with Ut WN 0; . This representation is
0 0
also known as the companion form or state space representation (see also Chap. 17).
In this representation the last p.n 1/ equations are simply identities so that there is
no error term attached. The latter name stems from the fact that Yt encompasses all
the information necessary to describe the state of the system. The matrix ˆ is called
the companion matrix of the VAR(p) process.2
The main advantage of the companion form is that by studying the properties
of the VAR(1) model, one implicitly encompasses VAR models of higher order and
also univariate AR(p) models which can be considered as special cases. The relation
between the eigenvalues of the companion matrix and the roots of the polynomial
matrix ˆ.z/ is given by the formula (Gohberg et al. 1982):
det Inp ˆz D det In ˆ1 z ::: ˆp zp : (12.2)
In the case of the AR(p) process the eigenvalues of ˆ are just the inverses of the
roots of the polynomial ˆ.z/. Further elaboration of state space models is given in
Chap. 17.
As will become clear in Chap. 15 and particularly in Sect. 15.2, the issue of the
existence of a causal representation is even more important than in the univariate
2
The representation of a VAR(p) process in companion form is not uniquely defined. Permutations
of the elements in Yt will lead to changes in the companion matrix.
12.3 Causal Representation 219
case. Before stating the main theorem let us generalize the definition of a causal
representation from the univariate case (see Definition 2.2 in Sect. 2.3) to the
multivariate one.
Definition 12.2. A VARMA((p,q) process fXt g with ˆ.L/Xt D ‚.L/Zt is called
causal with respect to fZt g if and only if thereP exists a sequence of absolutely
summable matrices f‰j g, j D 0; 1; 2; : : :, i.e. 1
jD0 k‰j k < 1, such that
1
X
Xt D ‰j Zt j :
jD0
Theorem 12.1. Let fXt g be a VARMA(p,q) process with ˆ.L/Xt D ‚.L/Zt and
assume that
then the stochastic difference equation ˆ.L/Xt D ‚.L/Zt has exactly one stationary
solution with causal representation
1
X
Xt D ‰j Zt j ;
jD0
whereby the sequence of matrices f‰j g is absolutely summable and where the
matrices are uniquely determined by the identity
ˆ.z/‰.z/ D ‚.z/:
As in the univariate case, the coefficient matrices which make up the causal
representation can be found by the method of undetermined coefficients, i.e. by
equating ˆ.z/‰.z/ D ‚.z/. In the case of the VAR(1) process, the f‰j g have to
obey the following recursion:
0 W ‰0 D In
z W ‰1 D ˆ‰0 D ˆ
z2 W ‰2 D ˆ‰1 D ˆ2
:::
zj W ‰j D ˆ‰j 1 D ˆj
0 W ‰0 D In
z W ˆ1 C ‰1 D 0 ) ‰1 D ˆ1
z2 W ˆ2 ˆ1 ‰1 C ‰2 D 0 ) ‰2 D ˆ2 C ˆ21
z3 W ˆ1 ‰2 ˆ2 ‰1 C ‰3 D 0 ) ‰3 D ˆ31 C ˆ1 ˆ2 C ˆ2 ˆ1
:::
0
Remark 12.1. Consider a VAR(1) process with ˆ D with ¤ 0 then
00
the matrices in the causal representation are ‰j D ˆj D 0 for j > 1. This
means that fXt g has an alternative representation as a VMA(1) process because
Xt D Zt C ˆZt 1 . This simple example demonstrates that the representation of
fXt g as a VARMA process is not unique. It is therefore impossible to always
distinguish between VAR and VMA process of higher orders without imposing
additional assumptions. These additional assumptions are much more complex in
the multivariate case and are known as identifying assumptions. Thus, a general
treatment of this identification problem is outside the scope of this book. See Hannan
and Deistler (1988) for a general treatment of this issue. For this reason we will
concentrate exclusively on VAR processes where these identification issues do not
arise.
Example
0:8 0:5 0:3 0:3
Xt D Xt 1 C Xt 2 C Zt
0:1 0:5 0:2 0:3
0 1:0 0:4
with Zt WN ; :
0 0:4 2:0
In a first step, we check whether the VAR model admits a causal representation
with respect to fZt g. For this purpose we have to compute the roots of the equation
det.I2 ˆ1 z ˆ2 z2 / D 0:
1 0:8z C 0:3z2 0:5z C 0:3z2
det
0:1z C 0:2z2 1 C 0:5z 0:3z2
D1 0:3z 0:35z2 C 0:32z3 0:15z4 D 0:
12.4 Computation of Covariance Function 221
The four roots are: 1:1973; 0:8828 ˙ 1:6669; 1:5650. As they are all outside
the unit circle, there exists a causal representation which can be found from the
equation ˆ.z/‰.z/ D I2 by the method of undetermined coefficients. Multiplying
the equation system out, we get:
I2 ˆ1 z ˆ2 z2
C ‰1 z ˆ1 ‰1 z2 ˆ2 ‰1 z3
C‰2 z2 ˆ1 ‰2 z3 ˆ2 ‰2 z4
::: D I2 :
Knowing .0/ and ˆ, .h/, h > 0, can be computed recursively from the second
equation as
Given ˆ and †, we can compute .0/. For h D 1, the second equation above
implies .1/ D ˆ.0/. Inserting this expression in the first equation and using the
fact that . 1/ D .1/0 , we get an equation in .0/:
.0/ D ˆ.0/ˆ0 C †:
where ˝ and “vec” denote the Kronecker-product and the vec-operator, respec-
tively.3 Thus,
The assumption that fXt g is causal with respect to fZt g guarantees that the
eigenvalues of ˆ ˝ ˆ are strictly smaller than one in absolute value, implying that
In2 ˆ ˝ ˆ is invertible.4
If the process is a causal VAR(p) process the covariance function can be found in
two ways. The first one rewrites the process in companion form as a VAR(1) process
and applies the procedure just outlined. The second way relies on the Yule-Walker
equation. This equation is obtained by multiplying the stochastic difference equation
from the left by Xt0 and then successively by Xt0 h , h > 0, and taking expectations:
.0/ D ˆ1 . 1/ C : : : C ˆp . p/ C †;
D ˆ1 .1/0 C : : : C ˆp .p/0 C †;
.h/ D ˆ1 .h 1/ C : : : C ˆp .h p/: (12.5)
3
The vec-operator stacks the column of a n m matrix to a column vector of dimension nm. The
properties of ˝ and vec can be found, e.g. in Magnus and Neudecker (1988).
4
If the eigenvalues of ˆ are i , i D 1; : : : ; n, then the eigenvalues of ˆ˝ˆ are i j , i; j D 1; : : : ; n
(see Magnus and Neudecker (1988)).
12.4 Computation of Covariance Function 223
Example
We illustrate the computation of the covariance function using the same example as
in Sect. 12.3. First, we transform the model into the companion form:
0 1 0 10 1 01
X1;t 0:8 0:5 0:3 0:3 X1;t 1 Z1;t
B X2;t C B0:1 0:5 0:2 0:3C BX2;t C BZ2;t C
Yt D B C B
@X1;t 1 A D @ 1
CB 1C B C
AC@ 0 A:
0 0 0 A @X1;t 2
X2;t 1 0 1 0 0 X2;t 2 0
Definition 12.1 defined the VARMA process fXt g as a solution to the corresponding
multivariate stochastic difference equation (12.1). However, as pointed out by
Zellner and Palm (1974) there is an equivalent representation in the form of n
univariate ARMA processes, one for each Xit . Formally, these representations, also
called autoregressive final form or transfer function form (Box and Jenkins 1976),
can be written as
det ˆ.L/Xit D ˆ .L/‚.L/ i Zt
224 12 VARMA Processes
where the index i indicates the i-th row of ˆ .L/‚.L/. Thereby ˆ .L/ denotes the
adjugate matrix of ˆ.L/.5 Thus each variable in Xt may be investigated separately as
an univariate ARMA process. Thereby the autoregressive part will be the same for
each variable. Note, however, that the moving-average processes will be correlated
across variables.
The disadvantage of this approach is that it involves rather long AR and MA lags
as will become clear from the following example.6 Take a simple two-dimensional
VAR of order one, i.e. Xt D ˆXt 1 CZt , Zt WN.0; †/. Then the implied univariate
processes will be ARMA(2,1) processes. After some straightforward manipulations
we obtain:
.1 .11 C 22 /L C .11 22 12 21 /L2 /X1t D Z1t 22 Z1;t 1 C 12 Z2;t 1 ;
.1 .11 C 22 /L C .11 22 12 21 /L2 /X2t D 21 Z1;t 1 C Z2t 11 Z2;t 1 :
It can be shown by the means given in Sects. 1.4.3 and 1.5.1 that the right hand sides
are observationally equivalent to MA(1) processes.
5
The elements of the adjugate matrix A of some matrix A are given by ŒA ij D . 1/iCj Mij where
Mij is the minor (minor determinant) obtained by deleting the i-th column and the j-th row of A
(Meyer 2000, p. 477).
6
The degrees of the AR and the MA polynomial can be as large as np and .n 1/pCq, respectively.
Estimation of Vector Autoregressive Models
13
13.1 Introduction
ˆ.L/Xt D Zt
Xt ˆ1 Xt 1 ˆp Xt p D Zt with Zt WN.0; †/;
P1
with jD0 k‰j k < 1.
Assumption 13.2. The residual process fZt g is not only white noise, but also
independently and identically distributed:
Zt IID.0; †/:
Assumption 13.3. All fourth moments of Zt exist. In particular, there exists a finite
constant c > 0 such that
E Zit Zjt Zkt Zlt c for all i; j; k; l D 1; 2; : : : ; n; and for all t:
We can view this equation as a regression equation of Xit on all lagged variables
X1;t 1 ; : : : ; Xn;t 1 ; : : : ; X1;t p ; : : : ; Xn;t p with error term Zit . Note that the regres-
sors
are the same for each equation. The np regressors have coefficient vector
0
.1/ .1/ .p/ .p/
i1 ; : : : ; in ; : : : ; i1 ; : : : ; in . Thus, the complete VAR(p) model has n2 p
coefficients in total to be estimated. In addition, there are n.n C 1/=2 independent
elements of the covariance matrix † that have to be estimated too.
It is clear that the n different equations are linked through the regressors and the
errors terms which in general have non-zero covariances ij D EZit Zjt . Hence, it
seems warranted to take a systems approach and to estimate all equations of the
VAR jointly. Below, we will see that an equation-by-equation approach is, however,
still appropriate.
Suppose that we have T C p observations with t D p C 1; : : : ; 0; 1; : : : ; T, then
we can write the regressor matrix for each equation compactly as a T np matrix X:
0 1
X1;0 : : : Xn;0 : : : X1; pC1 : : : Xn; pC1
B X1;1 : : : Xn;1 : : : X1; pC2 : : : Xn; pC2 C
B C
XDB : : :: C :
@ :: : : ::: ::
:
::
:
::
: : A
X1;T 1 : : : Xn;T 1 : : : X1;T p : : : Xn;T p
C .Z1 ; Z2 ; : : : ; ZT /
„ ƒ‚ …
DZ
13.2 The Least-Squares Estimator 227
or more compactly
Y D ˆX0 C Z:
There are two ways to bring this equation system in the usual multivariate regression
framework. One can either arrange the data according to observations or according
to equations. Ordered in terms of observations yields:
with vec Y D .X11 ; X21 ; : : : ; Xn1 ; X12 ; X22 ; : : : ; Xn2 ; : : : ; X1T ; X2T ; : : : ; XnT /0 . If the
data are arranged equation by equation, the dependent variable is vec Y 0 D
.X11 ; X12 ; : : : ; X1T ; X21 ; X22 ; : : : ; X2T ; : : : ; Xn1 ; Xn2 ; : : : ; XnT /0 . As both representa-
tions, obviously, contain the same information, there exists a nT nT permutation
or commutation matrix KnT such that vec Y 0 D Knt vec Y. Using the computation
rules for the Kronecker product, the vec operator, and the permutation matrix
(see Magnus and Neudecker 1988), we get for the ordering in terms of equations
vec Y 0 D KnT vec Y D KnT vec.ˆX0 / C vec Z
D KnT .X ˝ In / vec ˆ C KnT vec Z
D .In ˝ X/Kn2 p vec ˆ C KnT vec Z
D .In ˝ X/ vec ˆ0 C vec Z 0 (13.2)
where Kn2 p is the corresponding n2 p permutation matrix relating vec ˆ and vec ˆ0 .
The error terms of the different equations are correlated because, in general, the
covariances ij D EZit Zjt are nonzero. In the case of an arrangement by observation
the covariance matrix of the error term vec Z is
In the second case, the arrangement by equation, the covariance matrix of the error
term vec Z 0 is
Given that the covariance matrix is not a multiple of the identity matrix, efficient
estimation requires the use of generalized least squares (GLS). The GLS estimator
minimizes the weighted sum of squared errors
D .vec b̂/OLS
As the covariance matrix † cancels, the GLS and the OLS-estimator deliver
numerically exactly the same solution. The reason for this result is that the
regressors are the same in each equation. If this does not hold, for example when
some coefficients are set a priori to zero, efficient estimation would require the use
of GLS.
13.2 The Least-Squares Estimator 229
The least squares estimator can also be rewritten without the use of the vec-operator:
b̂ D YX.X0 X/ 1 :
Under the assumptions stated in the Introduction Sect. 13.1, these estimators are
consistent and asymptotically normal.
Theorem 13.1 (Asymptotic Distribution of OLS Estimator). Under the assumption
stated in the Introduction Sect. 13.1, it holds that
plim b̂ D ˆ
1
Alternatively, one could start from scratch and investigate the minimization problem S.vec ˆ0 / D
.vec Z 0 /0 .† 1 ˝ IT /.vec Z 0 / ! minˆ .
230 13 Estimation of VAR Models
and that
p d
by observation: T vec b̂ vec ˆ ! N 0; p 1 ˝ † ;
respectively,
p d
by equation: T vec b̂0 vec ˆ0 ! N 0; † ˝ p 1
In order to make use of this result in practice, we have to replace the matrices †
and p by some estimate. A natural consistent estimate of p is given according to
Proposition 13.1 by
0
bp D X X :
T
In analogy to the multivariate regression model, a natural estimator for † can be
obtained from the Least-Squares residuals b
Z:
T
b 1 X b b0 b
Zb
Z0 .Y b̂X0 /.Y b̂X0 /0
†D ZtZt D D :
T tD1 T T
e is obtained by adjust-
An alternative, but asymptotically equivalent estimator †
b
ing † for the degrees of freedom:
eD T b
† †: (13.5)
T np
If the VAR contains a constant, as is normally the case in practice, the degrees of
freedom correction should be T np 1.
13.3 Proofs of Asymptotic Normality 231
Small sample inference with respect to the parameters ˆ can therefore be carried
out using the approximate distribution
vec b̂ N vec ˆ; †
b ˝ .X0 X/ 1 : (13.6)
This implies that hypothesis testing can be carried out using the conventional t- and
F-statistics. From a system perspective, the appropriate degree of freedom for the
t-ratio would be nT n2 p n, taking a constant in each equation into account.
However, as that the system can be estimated on an equation by equation basis,
it seems reasonable to use T np 1 instead. This corresponds to a multivariate
regression setting with T observation and np C 1 regressors, including a constant.
However, as in the univariate case the Gauss Markov theorem does not apply
because the lagged regressors are correlated with past error terms. This results in
biased estimates in small samples. The amount of the bias can be assessed and
corrected either by analytical or bootstrap methods. For an overview, a comparison
of the different corrections proposed in the literature, and further references see
Engsteg and Pedersen (2014).
Lemma 13.1. Given the assumptions made in Sect. 13.1, the process fvec Zt j Zt0 i g,
i; j 2 Z and i ¤ j, is white noise.
t
u
Proposition 13.1. Under the assumption stated in the Introduction Sect. 13.1
232 13 Estimation of VAR Models
0 1
.0/ .1/ : : : .p 1/
X0 X B 0 .1/ .0/ : : : .p 2/C
p B C
! p D B :: :: :: :: C:
T @ : : : : A
0 .p 1/ 0 .p 2/ : : : .0/
where
T 1
b 1X
.h/ D Xt Xt0 h ; h D 0; 1; : : : ; p 1:
T tD0
T 1 T 1 1 1
b 1X 1 XXX
.h/ D Xt Xt0 h D ‰j Zt j Zt0 0
h i ‰i
T tD0 T tD0 jD0 iD0
1 X
1 T 1
!
X 1X
D ‰j Zt j Zt0 h i ‰i0
jD0 iD0
T tD0
1 X
1 T 1
!
X 1X
D ‰j Zt j Zt0 i ‰i0 h :
jD0 iDh
T tD0
T 1
1X p
Zt j Zt0 i ! 0; i ¤ j;
T tD0
13.3 Proofs of Asymptotic Normality 233
m mCh T 1
!
X X 1X p
Gm .h/ D ‰j Zt j Zt0 i ‰i0 h ! 0:
jD0 iDh
T tD0
i¤j
1 T 1
!
X 1X
D G1 .h/ C ‰j Zt Zt ‰j0
0
h C remainder
jDh
T tD0
234 13 Estimation of VAR Models
where the remainder only depends on initial conditions2 and is therefore negligible
as T ! 1. As
T 1
1X p
Zt Zt0 ! †;
T tD0
we finally get
1
X
p
b
.h/ ! ‰j †‰j0 h D .h/:
jDh
Proposition 13.2. Under the assumption stated in the Introduction Sect. 13.1
T
1 X
p vec.Zt Xt0 1 ; Zt Xt0 2 ; : : : ; Zt Xt0 p /
T tD1
1 1
D p vec.ZX/ D p .X0 ˝ In / vec Z
T T
d
! N.0; p ˝ †/
.m/
Proof. The idea of the proof is to approximate fXt g by some simpler process fXt g
which allows the application of the CLT for dependent processes (Theorem C.13).
This leads to an asymptotic distribution which by the virtue of the Basic Approxima-
tion Theorem C.14 converges to the asymptotic distribution of the original process.
.m/
Define Xt as the truncated process from the causal presentation of Xt :
.m/
Xt D Zt C ‰1 Zt 1 C : : : C ‰m Zt m; m D p; p C 1; p C 2; : : :
.m/
Using this approximation, we can then define the process fYt g as
.m/
0
1
Xt 1
B .m/ C
BXt 2 C
.m/ .m/0 .m/0 .m/0
Yt B
D vec Zt Xt 1 ; Zt Xt 2 ; : : : ; Zt Xt p D B : C C ˝ Zt :
@ :: A
.m/
Xt p
2
See the proof of Theorem 11.2.2 in Brockwell and Davis (1991) for details.
13.3 Proofs of Asymptotic Normality 235
Due to the independence of fZt g this process is a mean zero white noise process,
but is clearly not independent. It is easy to see that the process is actually .m C p/-
dependent with variance Vm given by
00
.m/
1 1 00 .m/ 1 10
Xt 1 Xt 1
BB .m/ C C BB .m/ C C
.m/ .m/0 BBXt 2 C C BBXt 2 C C
Vm D EYt Yt BB C C BB C
D E BB : C ˝ Zt C BB : C ˝ Zt C
C
@@ :: A A @@ :: A A
.m/ .m/
Xt p Xt p
0 0 .m/
1 0 .m/ 10 1
Xt 1 Xt 1
B B .m/ C B .m/ C C
B BX C BX C C
D BE B t 2 C B t 2 C C ˝ EZt Zt0
@ @ ::: A@ ::: A A
.m/ .m/
Xt p Xt p
0 1
.m/ .0/ .m/ .1/ : : : .m/ .p 1/
B .m/ .1/0 .m/ .0/ : : : .m/ .p 2/C
B C
DB :: :: :: :: C˝†
@ : : : : A
.m/ .p 1/0 .m/ .p 2/0 : : : .m/ .0/
D p.m/ ˝ †
.m/
where p is composed of
.m/ .m/0
.m/ .h/ D EXt 1 Xt 1 h
D E .Zt 1 C ‰1 Zt 2 C : : : C ‰m Zt 1 m/
0
.Zt 1 h C ‰1 Zt 2 h C : : : C ‰m Zt 1 m h/
m
X
D ‰j †‰j0 h ; h D 0; 1; : : : ; p 1:
jDh
Thus, we can invoke the CLT for .m C p/-dependent process (see Theorem C.13) to
establish that
T
!
p 1 X .m/ d
T Yt ! N.0; Vm /:
T tD1
00P1 1 1 00P1 1 10
‰j Zt 1 j
jDmC1 jDmC1 ‰j Zt 1 j
BB :
:: C C B B :: C C
D E @@ A ˝ Zt A @@ : A ˝ Zt A
P1 P1
‰ Z
jDmC1 j t p j jDmC1 ‰j Zt p j
0P1 0 1
jDmC1 ‰j †‰j : : : :::
B :: :: :: C
D@ : : : A ˝ †:
P1 0
::: : : : jDmC1 ‰j †‰j
The absolute summability of ‰j then implies that the infinite sums converge to
.m/ m.s.
zero as m ! 1. As Xt ! Xt for m ! 1, we can apply the Basic
Approximation Theorem C.14 to reach the required conclusion
T
1 X d
p vec.Zt Xt0 1 ; Zt Xt0 2 ; : : : ; Zt Xt0 p / ! N.0; p ˝ †/:
T tD1 t
u
Proof. We prove the Theorem for the arrangement by observation. The prove for the
arrangement by equation can be proven in a completely analogous way. Inserting the
regression formula (13.1) into the least-squares formula (13.3) leads to:
Bringing vec ˆ to the left hand side and taking the probability limit, we get using
Slutzky’s Lemma C.10 for the product of probability limits
plim.vec b̂ vec ˆ/ D plim vec ZX.X0 X/ 1
0 1!
ZX XX
D vec plim plim D 0:
T T
The last equality follows from the observation that Proposition 13.1 implies
0
plim XTX D p nonsingular and that Proposition 13.2 implies plim ZXT
D 0. Thus,
we have established that the Least-Squares estimator is consistent.
Equation (13.7) further implies:
p p
T.vec b̂ vec ˆ/ D T ..X0 X/ 1 X0 / ˝ In vec Z
0 1 !
XX 1
D ˝ In p .X0 ˝ In / vec Z
T T
0
As plim XTX D p nonsingular, the above expression converges in distribution
according to Theorem C.10 and Proposition 13.2 to a normally distributed random
variable with mean zero and covariance matrix
. p 1 ˝ In /. p ˝ †/. p 1 ˝ In / D p 1 ˝ †
t
u
Proof.
ZX.ˆ b̂/0 p
p !0
T
238 13 Estimation of VAR Models
and
0 p
b̂/ X X
p
.ˆ T.ˆ b̂/0 ! 0:
T
Hence,
!
p .Y b̂X0 /.Y b̂X0 /0 ZZ 0 p ZZ 0 p
T D T †b !0
T T T
t
u
.0/ D ˆ. 1/ C †
.1/ D ˆ.0/
or
.0/ D ˆ.0/ˆ0 C †
.1/ D ˆ.0/:
Replacing the theoretical moments by their empirical counterparts, we get the Yule-
Walker estimator for ˆ and †:
b̂ D b
.1/b
.0/ 1 ;
bDb
† .0/ b̂b
.0/b̂0 :
13.4 The Yule-Walker Estimator 239
In the general case of a VAR(p) model the Yule-Walker estimator is given as the
solution of the equation system
p
X
b
.h/ D b̂j b
.h j/; k D 1; : : : ; p;
jD1
bDb
† .0/ b̂1 b
. 1/ ::: b̂p b
. p/
As the least-squares and the Yule-Walker estimator differ only in the treatment
of the starting values, they are asymptotically equivalent. In fact, they yield very
similar estimates even for finite samples (see e.g. Reinsel (1993)). However, as in the
univariate case, the Yule-Walker estimator always delivers, in contrast to the least-
square estimator, coefficient estimates with the property det.In Ô 1 z : : : Ô p zp / ¤
0 for all z 2 C with jzj 1. Thus, the Yule-Walker estimator guarantees that the
estimated VAR possesses a causal representation. This, however, comes at the price
that the Yule-Walker estimator has a larger small-sample bias than the least-squares
estimator, especially when the roots of ˆ.z/ get close to the unit circle (Tjøstheim
and Paulsen 1983; Shaman and Stine 1988; Reinsel 1993). Thus, it is generally
preferable to use the least-squares estimator in practice.
Forecasting with VAR Models
14
The discussion of forecasting with VAR models proceeds in two steps. First, we
assume that the parameters of the model are known. Although this assumption is
unrealistic, it will nevertheless allow us to introduce and analyze important concepts
and ideas. In a second step, we then investigate how the results established in the
first step have to be amended if the parameters are estimated. The analysis will
focus on stationary and causal VAR(1) processes. Processes of higher order can be
accommodated by rewriting them in companion form. Thus we have:
Thereby “tr” denotes the trace operator which takes the sum of the diagonal elements
of a matrix. As we rely on linear forecasting functions, PT XTCh can be expressed as
PT XTCh D A1 XT C A2 XT 1 C : : : C AT X1 (14.1)
These equations state that the forecast error .XTCh PT XTCh / must be uncorrelated
with the available information Xs , s D 1; 2; : : : ; T. The normal equations can be
written as
0 1
.0/ .1/
: : : .T 1/
B 0 .1/ : : : .T 2/C
.0/
B C
.A1 ; A2 ; : : : ; AT / B :: ::
:: :: C
@ : : : : A
0 0
.T 1/ .T 2/ : : : .0/
D .h/ .h C 1/ : : : .T C h 1/ :
Using the assumption that fXt g is a VAR(1), .h/ can be expressed as .h/ D
ˆh .0/ (see Eq. (12.3)) so that the normal equations become
0 1
.0/ : : : ˆT 1 .0/
ˆ.0/
B .0/ˆ0 : : : ˆT 2 .0/C
.0/
B C
.A1 ; A2 ; : : : ; AT / B :: ::
:: :: C
@ : : : : A
0T 1 0T 2
.0/ˆ .0/ˆ : : : .0/
h
D ˆ .0/ ˆhC1 .0/ : : : ˆTCh 1 .0/ :
1
If the mean is non-zero, a constant A0 must be added to the forecast function.
14.1 Forecasting with Known Parameters 243
PT XTCh D ˆh XT : (14.2)
The forecast error XTCh PT XTCh has expectation zero. Thus, the linear least-
squares predictor delivers unbiased forecasts. As
In order to analyze the case of a causal VAR(p) process with T > p, we transform
the model into the companion form. For h D 1, we can apply the result above to get:
0 1 0 10 1
PT XTC1 ˆ1 ˆ2
: : : ˆp 1 ˆp XT
B XT C B In 0
::: 0 0C B C
B C B C B XT 1 C
B C B C B
0 C B XT 2 C
PT YTC1 D ˆYT D B XT 1 C D B 0 ::: 0
In C:
B : C B : :::: : :: C B : C
@ :: A @ :: :: :: : A @ :
: A
XT pC2 0 0 : : : In 0 XT pC1
The forecast error is XTC1 PT XTC1 D Zt which has mean zero and covariance
variance matrix †. In general we have that PT YTCh D ˆh YT so that PT XTCh is
equal to
.h/ .h/
PT XTCh D ˆ1 XT C ˆ2 XT 1 C : : : C ˆ.h/
p XT pC1
.h/
where ˆi , i D 1; : : : ; p, denote the blocks in the first row of ˆh . Alternatively, the
forecast for h > 1 can be computed recursively. For h D 2 this leads to:
PT XTC2 D PT .ˆ1 XTC1 / C PT .ˆ2 XT / C : : : C PT ˆp XTC2 p C PT .ZTC2 /
D ˆ1 ˆ1 XT C ˆ2 XT 1 C : : : C ˆp XTC1 p
C ˆ2 XT C : : : C ˆp XTC2 p
D ˆ21 C ˆ2 XT C .ˆ1 ˆ2 C ˆ3 / XT 1 C : : : C ˆ1 ˆp 1 C ˆp XTC2 p
C ˆ1 ˆp XTC1 p :
244 14 Forecasting with VAR Models
Example
Consider again the VAR(2) model of Sect. 12.3. The forecast function in this case
is then:
PT XTC1 D ˆ1 Xt C ˆ2 Xt 1
0:8 0:5 0:3 0:3
D Xt C Xt 1 ;
0:1 0:5 0:2 0:3
PT XTC2 D .ˆ21 C ˆ2 /Xt C ˆ1 ˆ2 Xt 1
0:29 0:45 0:14 0:39
D Xt C Xt 1 ;
0:17 0:50 0:07 0:18
PT XTC3 D .ˆ31 C ˆ1 ˆ2 C ˆ2 ˆ1 /Xt C .ˆ21 ˆ2 C ˆ22 /Xt 1
0:047 0:310 0:003 0:222
D Xt C Xt 1 :
0:016 0:345 0:049 0:201
Based on the results computed in Sect. 12.3, we can calculate the corresponding
mean squared errors (MSE):
1:0 0:4
MSE.1/ D † D ;
0:4 2:0
0 1:82 0:80
MSE.2/ D † C ‰1 †‰1 D ;
0:80 2:47
2:2047 0:3893
MSE.3/ D † C ‰1 †‰10 C ‰2 †‰20 D :
0:3893 2:9309
At this stage we note that Wold’s theorem or Wold’s Decomposition carries over to
the multivariate case (see Sect. 3.2 for the univariate case). This Theorem asserts that
there exists for each purely non-deterministic stationary process2 a decomposition,
respectively representation, of the form:
1
X
Xt D C ‰j Zt j ;
jD0
P
where ‰0 D In , Zt WN.0; †/ with † > 0 and 1 2
jD0 k‰j k < 1. The innovations
fZt g have the property Zt D Xt e Pt 1 Xt and consequently Zt D e Pt Zt . Thereby e Pt
denotes the linear least-squares predictor based on the infinite past fXt ; Xt 1 ; : : :g.
The interpretation of the multivariate case is analogous to the univariate one.
In practice the parameters of the VAR model are usually unknown and have
therefore to be estimated. In the previous Section we have demonstrated that
PT XTCh D b̂1b
b PT XTCh 1 C : : : C b̂pb
PT XTCh p :
where a hat indicates the use of estimates. The forecast error can then be decom-
posed into two components:
XTCh b
PT XTCh D .XTCh PT XTCh / C PT XTCh b
PT XTCh
h 1
X
D ˆj ZTCh j C PT XTCh b
PT XTCh : (14.6)
jD0
Dufour (1985) has shown that, under the assumption of symmetrically distributed
Zt ’s (i.e. if Zt and Zt have the same distribution) the expectation of the forecast
error is zero even when the parameters are replaced by their least-squares estimates.
2
A stationary stochastic process is called deterministic if it can be perfectly forecasted from its
infinite past. It is called purely non-deterministic if there is no deterministic component (see
Sect. 3.2).
246 14 Forecasting with VAR Models
This result holds despite the fact that these estimates are biased in small samples.
Moreover, the results do not assume that the model is correctly specified in terms
of the order p. Thus, under quite general conditions the forecast with estimated
coefficients remains unbiased so that E XTCh b PT XTCh D 0.
If the estimation is based on a different sample than the one used for forecasting,
the two terms in the above expression are uncorrelated so that its mean squared error
is by the sum of the two mean squared errors:
h 1
X
b
MSE.h/ D ‰j †‰j0
jD0
0
CE PT XTCh b
PT XTCh PT XTCh b
PT XTCh : (14.7)
The last term can be evaluated by using the asymptotic distribution of the coeffi-
cients as an approximation. The corresponding formula turns out to be cumbersome.
The technical details can be found in Lütkepohl (2006) and Reinsel (1993). The
formula can, however, be simplified considerably if we consider a forecast horizon
of only one period. We deduce the formula for a VAR of order one, i.e. taking
Xt D ˆXt 1 C Zt , Zt WN.0; †/.
PT XTCh b
PT XTCh D .ˆ b̂/XT D vec .ˆ b̂/XT D .XT0 ˝ In / vec.ˆ b̂/:
1 1 ˝ † 1
D E.XT0 ˝ In / .XT ˝ In / D E.XT0 1 1 XT / ˝ †
T T
1 1
D E.trXT0 1 1 XT / ˝ † D tr. 1 1 E.XT XT0 // ˝ †
T T
1 n
D .tr.In / ˝ †/ D †:
T T
Thereby, we have used the asymptotic normality of the least-squares estimator (see
Theorem 13.1) and the assumption that forecasting and estimation uses different
realizations of the stochastic process. Thus, for h D 1 and p D 1, we get
b n T Cn
MSE.1/ D † C †D †:
T T
14.3 Modeling of VAR Models 247
Higher order models can be treated similarly using the companion form of VAR(p).
In this case:
b np T C np
MSE.1/ D † C †D †: (14.8)
T T
This is only an approximation as we applied asymptotic results to small sample
entities. The expression shows that the effect of the substitution of the coefficients
by their least-squares estimates vanishes as the sample becomes large. However, in
small sample the factor TCnpT
can be sizeable. In the example treated in Sect. 14.4,
the covariance matrix †, taking the use of a constant into account and assuming
8 lags, has to be inflated by TCnpC1
T
D 196C48C1
196
D 1:168. Note also that the
precision of the forecast, given †, diminishes with the number of parameters.
The previous section treated the estimation of VAR models under the assumption
that the order of the VAR, p, is known. In most cases, this assumption is unrealistic
as the order p is unknown and must be retrieved from the data. We can proceed
analogously as in the univariate case (see Sect. 5.1) and iteratively test the
hypothesis that coefficients corresponding to the highest lag, i.e. ˆp D 0, are
simultaneously equal to zero. Starting from a maximal order pmax , we test the null
hypothesis that ˆpmax D 0 in the corresponding VAR.pmax / model. If the hypothesis
is not rejected, we reduce the order by one to pmax 1 and test anew the null
hypothesis ˆpmax 1 D 0 using the smaller VAR.pmax 1/ model. One continues
in this way until the null hypothesis is rejected. This gives, then, the appropriate
order of the VAR. The different tests can be carried out either as Wald-tests (F-tests)
or as likelihood-ratio tests (2 -tests) with n2 degrees of freedom.
An alternative procedure to determine the order of the VAR relies on some
information criteria. As in the univariate case, the most popular ones are the Akaike
(AIC), the Schwarz or Bayesian (BIC) and the Hannan-Quinn criterion (HQC).
The corresponding formula are:
2
AIC(p): ep C 2pn ;
ln det †
T
2
BIC(p): ep C pn ln T;
ln det †
T
ep C 2pn2
HQC(p): ln det † ln .ln T/;
T
where †ep denotes the degree of freedom adjusted estimate of the covariance matrix
† for a model of order p (see equation(13.5)). n2 p is the number of estimated
coefficients. The estimated order is then given as the minimizer of one of these
248 14 Forecasting with VAR Models
criteria. In practice the Akaike’s criterion is the most popular one although it has a
tendency to deliver orders which are too high. The BIC and the HQ-criterion on the
other hand deliver the correct order on average, but can lead to models which suffer
from the omitted variable bias when the estimated order is too low. Examples are
discussed in Sects. 14.4 and 15.4.5.
Following Lütkepohl (2006), Akaike’s information criterion can be rationalized
as follows. Take as a measure of fit the determinant of the one period approximate
b
mean-squared errors MSE.1/ from Eq. (14.8) and take as an estimate of † the
degrees of freedom corrected version in Eq. (13.5). The resulting criterion is called
according to Akaike (1969) the final prediction error (FPE):
T C np T e T C np n e
FPE.p/ D det † D det †: (14.9)
T T np T np
TCnp 2np 2np 2np
Taking logs and using the approximations T np
1C T
and log.1 C T
/ T
,
we arrive at
In this section, we illustrate how to build and use VAR models for forecasting
key macroeconomic variables. For this purpose, we consider the following four
variables: GDP per capita (fYt g), price level in terms of the consumer price index
(CPI) (fPt g), real money stock M1 (fMt g), and the three month treasury bill rate
(fRt g). All variables are for the U.S. and are, with the exception of the interest rate,
in logged differences.3 The components of Xt are with the exception of the interest
rate stationary.4 Thus, we aim at modeling Xt D . log Yt ; log Pt ; log Mt ; Rt /0 .
The sample runs from the first quarter 1959 to the first quarter 2012. We estimate
our models, however, only up to the fourth quarter 2008 and reserve the last thirteen
quarters, i.e. the period from the first quarter 2009 to first quarter of 2012, for an out-
of-sample evaluation of the forecast performance. This forecast assessment has the
advantage to account explicitly of the sampling variability in estimated parameter
models.
The first step in the modeling process is the determination of the lag-length.
Allowing for a maximum of twelve lags, the different information criteria produce
the values reported in Table 14.1. Unfortunately, the three criteria deliver different
3
Thus, log Pt equals the inflation rate.
4
Although the unit root test indicate that Rt is integrated of order one, we do not difference this
variable. This specification will not affect the consistency of the estimates nor the choice of the
lag-length (Sims et al. 1990), but has the advantage that each component of Xt is expressed in
percentage points which facilitates the interpretation.
14.4 Example: VAR Model 249
orders: AIC suggests 8 lags, HQ 5 lags, and BIC 2 lags. In such a situation it is
wise to keep all three models and to perform additional diagnostic tests.5 One such
test is to run a horse-race between the three models in terms of their forecasting
performance.
We evaluate the forecasts according to the two criteria: the root-mean-squared-
error (RMSE) and the mean-absolute-error (MAE)6 :
v
u TCh
u1 X
RMSE W t .b
X it Xit /2 (14.10)
h TC1
TCh
1X b
MAE W jX it Xit j (14.11)
h TC1
where bX it and Xit denote the forecast and the actual value of variable i in period t.
Forecasts are computed for a horizon h starting in period T. We can gain further
insights by decomposing the mean-squared-error additively into three components:
TCh TCh
! !2
1Xb 1 X
b
.X it Xit /2 D X it Xi
h TC1 h TC1
C.b
Xi
Xi /2 C 2.1 /b :
X i Xi
5
Such tests would include an analysis of the autocorrelation properties of the residuals and tests of
structural breaks.
6
Alternatively one could use the mean-absolute-percentage-error (MAPE). However, as all vari-
ables are already in percentages, the MAE is to be preferred.
250 14 Forecasting with VAR Models
PTCh
The first component measures how far the mean of the forecasts h1 TC1 b
X it is away
from the actual mean of the data X i . It therefore measures the bias of the forecasts.
The second one compares the standard deviation of the forecast b Xi
to those of the
data Xi . Finally, the last component is a measure of the unsystematic forecast errors
where denotes the correlation between the forecast and the data. Ideally, each of
the three components should be close to zero: there should be no bias, the variation
of the forecasts should correspond to those of the data, and the forecasts and the
data should be highly positively correlated. In order to avoidPscaling problems, all
TCh b
three components are usually expressed as a proportion of h1 TC1 .X it Xit /2 :
P 2
1 TCh
TC1
b
X it Xi
h
bias proportion: PTCh (14.12)
1 b
TC1 .X it Xit /2
h
.b
Xi
Xi /2
variance proportion: 1 PTCh
(14.13)
b
TC1 .X it Xit /2
h
2.1 /b
X i Xi
covariance proportion: PTCh (14.14)
1 b
TC1 .X it Xit /2
h
a b
4.64 5.60
4.40 5.32
2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014
c d
3.0 6
5 actual VAR(2)
VAR(5) VAR(8)
2.9
4
3
2.8
2
2.7 1
actual VAR(2) 0
2.6 VAR(5) VAR(8)
-1
2.5 -2
2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014
Fig. 14.1 Forecast comparison of alternative models. (a) log Yt . (b) log Pt . (c) log Mt . (d) Rt
Value-at-Risk in Sect. 8.4. Figure 14.2 plots the forecasts of the VAR(8) model
together with a 80 % confidence interval computed from the bootstrap approach. It
shows that, with the exception of the logged price level, the actual realizations fall
out of the confidence interval despite the fact that the intervals are already relatively
large. This documents the uniqueness of the financial crisis and gives a hard time
for any forecasting model.
Instead of computing a confidence interval, one may estimate the probability
distribution of possible future outcomes. This provides a complete description of the
uncertainty related to the prediction problem (Christoffersen 1998; Diebold et al.
1998; Tay and Wallis 2000; Corradi and Swanson 2006). Finally, one should be
aware that the innovation uncertainty is not the only source of uncertainty. As the
parameters of the model are themselves estimated, there is also a coefficient uncer-
tainty. In addition, we have to face the possibility that the model is misspecified.
The forecasting performance of the VAR models may seem disappointing at
first. However, this was only be a first attempt and further investigations are usually
necessary. These may include the search for structural breaks (See Bai et al. 1998;
Perron 2006). This topic is treated in Sect. 18.1. Another reason for the poor
14.4 Example: VAR Model 253
a b
4.64 5.70
5.65
4.60 Actual
forecast 5.60
4.56
5.55 Actual
forecast
4.52 5.50
5.45
4.48
5.40
4.44
5.35
4.40 5.30
2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014
c d
3.0 10
Actual 8
forecast
2.9 actual
forecast
6
2.8 4
2
2.7
0
2.6
-2
2.5 -4
2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014
Fig. 14.2 Forecast of VAR(8) model and 80 % confidence intervals (red dotted lines). (a) log Yt .
(b) log Pt . (c) log Mt . (d) Rt
As a first technique for the understanding of VAR processes, we analyze the concept
of causality which was introduced by Granger (1969). The concept is also known as
Wiener-Granger causality because Granger’s idea goes back to the work of Wiener
(1956). Take a multivariate time series fXt g and consider the forecast of X1;TCh ,
h 1, given XT ; XT 1 ; : : : where fXt g has not only X1t as a component, but also
another variable or group of variables X2t . Xt may contain even further variables than
X1;t and X2;t . The mean-squared forecast error is denoted by MSE1 .h/. Consider now
an alternative forecast of X1;TCh given e XT ; e
X T 1 ; : : : where fe
X t g is obtained from
fXt g by eliminating the component X2;t . The mean-squared error of this forecast is
e
denoted by MSE1 .h/. According to Granger, we can say that the second variable
X2;t causes or is causal for X1;t if and only if
e
MSE.h/1 < MSE1 .h/ for some h 1:
This means that the information contained in fX2t g and its past improves the forecast
of fX1t g in the sense of the mean-squared forecast error. Thus the concept of Wiener-
Granger causality makes only sense for purely non-deterministic processes and rest
on two principles1 :
• The future cannot cause the past. Only the past can have a causal influence on
the future.2
• A specific cause contains information not available otherwise.
If one restricts oneself to linear least-squares forecasts, the above definition can
be easily operationalized in the context of VAR models with only two variables (see
also Sims 1972). Consider first a VAR(1) model. Then according to the explanations
in Chap. 14 the one-period forecast is:
PT X1;TC1 11 12 X1;T
PT XTC1 D D ˆXT D
PT X2;TC1 21 22 X2;T
and therefore
1
Compare this to the concept of a causal representation developed in Sects. 12.3 and 2.3.
2
Sometimes also the concept of contemporaneous causality is considered. This concept is,
however, controversial and has therefore not gained much success in practice and will, therefore,
not be pursued.
15.1 Wiener-Granger Causality 257
If 12 D 0 then the second variable does not contribute to the one-period forecast
e
of the first variable and can therefore be omitted: MSE1 .1/ D MSE1 .1/. Note that
h
11 0
h
ˆ D h ;
22
where is a placeholder for an arbitrary number. Thus the second variable is not
only irrelevant for the one-period forecast, but for any forecast horizon h 1. Thus,
the second variable is not causal for the first variable in the sense of Wiener-Granger
causality.
These arguments can be easily extended to VAR(p) models. According to
Eq. (14.4) we have that
.k/
where ij denotes .i; j/-th element, i D 1; 2, of the matrix ˆk , k D 1; : : : ; p. In
order for the second variable to have no influence on the forecast of the first one,
.1/ .2/ .p/
we must have that 12 D 12 D : : : D 12 D 0. This implies that all matrices
0
ˆk , k D 1; : : : ; p, must be lower triangular, i.e. they must be of the form .
As the multiplication and addition of lower triangular matrices is again a lower
triangular matrix, the second variable is irrelevant in forecasting the first one at any
forecast horizon. This can be seen by computing the corresponding forecast function
recursively as in Chap. 14.
Based on this insight it is straightforward to test the null hypothesis that the
second variable does not cause the first one within the VAR(p) context:
The alternative hypothesis is that the null hypothesis is violated. As the method
of least-squares estimation leads under quite general conditions to asymptotically
normal distributed coefficient estimates, it is straightforward to test the above
hypothesis by a Wald-test (F-test). In the context of a VAR(1) model a simple t-
test is also possible.
If more than two variables are involved the concept of Wiener-Granger causality
is no longer so easy to implement. Consider for expositional purposes a VAR(1)
model in three variables with coefficient matrix:
258 15 Interpretation of VAR Models
0 1
11 12 0
ˆ D @21 22 23 A :
31 32 33
Thus, the third variable X3T is irrelevant for the one-period forecast of the first
variable. However, as the third variable has an influence on the second variable,
23 ¤ 0, and because the second variable has an influence on the first variable,
12 ¤ 0, the third variable will provide indirectly useful information for the forecast
of the first variable for forecasting horizons h 2. Consequently, the concept of
causality cannot immediately be extended from two to more than two variables.
It is, however, possible to merge variables one and two, or variables two and
three, into groups and discuss the hypothesis that the third variable does not cause
the first two variables, seen as a group; likewise that the second and third variable,
seen as a group, does not cause the first variable. The corresponding null hypotheses
then are:
Under these null hypotheses we get again lower (block-) triangular matrices:
0 1 0 1
:: ::
B11 12 : 0 C
B11 : 0 0 C
B :: C B: : :
B21 22 : 0 C B ::: : : : : : :C
C
B C or B :: C:
B :: C B21
B: : : : : : : : : :C @ : 22 23 C A
@ A ::
::
31 32 33 : 31 : 32 33
We can get further insights into the concept of causality by considering a bivariate
VAR, ˆ.L/Xt D Zt , with causal representation Xt D ‰.L/Zt . Partitioning the
matrices according to the two variables fX1t g and fX2t g, Theorem 12.1 of Sect. 12.3
implies that
ˆ11 .z/ ˆ12 .z/ ‰11 .z/ ‰12 .z/ 10
D
ˆ21 .z/ ˆ22 .z/ ‰21 .z/ ‰22 .z/ 01
15.1 Wiener-Granger Causality 259
where the polynomials ˆ12 .z/ and ˆ21 .z/ have no constant terms. The hypothesis
that the second variable does not cause the first one is equivalent in this framework
to the hypothesis that ˆ12 .z/ D 0. Multiplying out the above expression leads to the
condition
Because ˆ11 .z/ involves a constant term, the above equation implies that
‰12 .z/ D 0. Thus the causal representation is lower triangular. This means that
the first variable is composed of the first shock, fZ1t g only whereas the second
variable involves both shocks fZ1t g and fZ2t g. The univariate causal representation
of fX1t g is therefore the same as the bivariate one.3 Finally, note the similarity to the
issue of the identification of shocks discussed in subsequent sections.
In the case of two variables we also examine the cross-correlations to test for
causality. This non-parametric test has the advantage that one does not have to
rely on an explicit VAR model. This advantage becomes particularly relevant, if a
VMA model must be approximated by a high order AR model. Consider the cross-
correlations
If 12 .h/ ¤ 0 for h > 0, we can say that the past values of the second variable
are useful for forecasting the first variable such that the second variable causes the
first one in the sense of Wiener and Granger. Another terminology is that the second
variable is a leading indicator for the first one. If in addition, 12 .h/ ¤ 0, for h < 0,
so that the past values of the first variable help to forecast the second one, we have
causality in both directions.
As the distribution of the cross-correlations of two independent variables depends
on the autocorrelation of each variable, see Theorem 11.4, Haugh (1976) and Pierce
and Haugh (1977) propose a test based on the filtered time series. Analogously to
the test for independence (see Sect. 11.2), we proceed in two steps:
Step 1: Estimate in the first step a univariate AR(p) model for each of the two
time series fX1t g and fX2t g. Thereby chose p such that the corresponding residuals
fZO 1t g and fZO 2t g are white noise. Note that although fZO 1t g and fZO 2t g are both not
autocorrelated, the cross-correlations Z1 ;Z2 .h/ may still be non-zero for arbitrary
orders h.
3
As we are working with causal VAR’s, the above arguments also hold with respect to the Wold
Decomposition.
260 15 Interpretation of VAR Models
Step 2: As fZO 1t g and fZO 2t g are the forecast errors based on forecasts which rely
only on the own past, the concept of causality carries over from the original
variables to the residuals. The null hypothesis that the second variable does not
cause the first variable in the sense of Wiener and Granger can then be checked
by the Haugh-Pierce statistic:
L
X
Haugh-Pierce statistic: T OZ21 ;Z2 .h/ 2L : (15.1)
hD1
The discussion in the previous section showed that the relation between VAR models
and economic models is ambiguous. In order to better understand the quintessence
of the problem, we first analyze a simple macroeconomic example. Let fyt g and
fmt g denote the output and the money supply of an economy4 and suppose that
the relation between the two variables is represented by the following simultaneous
equation system:
4
If one is working with actual data, the variables are usually expressed in log-differences to achieve
stationarity.
15.2 Structural and Reduced Form 261
Note that the two structural shocks are assumed to be contemporaneously uncor-
related which is reflected in the assumption that is a diagonal matrix. This
assumption in the literature is uncontroversial. Otherwise, there would remain some
unexplained relationship between them. The structural shocks can be interpreted
as the statistical analog of an experiment in the natural sciences. The experiment
corresponds in this case to a shift of the AD-curve due to, for example, a temporary
non-anticipated change in government expenditures or money supply. The goal of
the analysis is then to trace the reaction of the economy, in our case represented
by the two variables fyt g and fmt g, to these isolated and autonomous changes
in aggregate demand and money supply. The structural equations imply that the
reaction is not restricted to contemporaneous effects, but is spread out over time.
We thus represent this reaction by the impulse response function.
We can write the system more compactly in matrix notation:
1 a1 yt
yt 1 10 vyt
D 11 12 C
a2 1 mt
21
22 mt 1 01 vmt
or
11 C a1
21
12 C a1
22 vyt a1 vmt
X1t D yt D yt 1 C mt 1 C C
1 a1 a2 1 a1 a2 1 a1 a2 1 a1 a2
D 11 yt 1 C 12 mt 1 C Z1t
21 C a2
11
22 C a2
12 a2 vyt vmt
X2t D mt D yt 1 C mt 1 C C
1 a1 a2 1 a1 a2 1 a1 a2 1 a1 a2
D 21 yt 1 C 22 mt 1 C Z2t :
Thus, the reduced form has the structure of a VAR(1) model with error term
fZt g D f.Z1t ; Z2t /0 g. The reduced form can also be expressed in matrix notation as:
Xt D A 1 Xt 1 C A 1 BVt
D ˆXt 1 C Zt
262 15 Interpretation of VAR Models
where
Zt WN.0; †/ with † D A 1 BB0 A0 1 :
Whereas the structural form represents the inner economic relations between the
variables (economic model), the reduced form given by the VAR model summarizes
their outer directly observable characteristics. As there is no unambiguous relation
between the reduced and structural form, it is impossible to infer the inner
economic relationships from the observations alone. This is known in statistics
as the identification problem. Typically, a whole family of structural models is
compatible with a particular reduced form. The models of the family are thus
observationally equivalent to each other as they imply the same distribution for fXt g.
The identification problem can be overcome if one is willing to make additional
a priori assumptions. The nature and the type of these assumption and their
interpretation is subject of the rest of this chapter.
In our example, the parameters characterizing the structural and the reduced
form are
and
As there are eight parameters in the structural form, but only seven parameters in the
reduced form, there is no one-to-one relation between structural and reduced form.
The VAR(1) model delivers estimates for the seven reduced form parameters, but
there is no way to infer from these estimates the parameters of the structural form.
Thus, there is a fundamental identification problem.
The simple counting of the number of parameters in each form tells us that we
need at least one additional restriction on the parameters of the structural form.
The simplest restriction is a zero restriction. Suppose that a2 equals zero, i.e. that
the central bank does not react immediately to current output. This seems reasonable
because national accounting figures are usually released with some delay. With this
assumption, we can infer the structural parameters from the reduced ones:
21 D 21 ;
22 D 22 ;
mt D Z2t ) !m2 D 22 ; ) a1 D 12 =22 ; ) !y2 D 12 2
12 =22
11 D 11 .12 =22 /21 ;
12 D 12 .12 =22 /22 :
Remark 15.1. Note that, because Zt D A 1 BVt , the reduced form disturbances Zt
are a linear combination of the structural disturbances, in our case the demand dis-
turbance vyt and the money supply disturbance vmt . In each period t the endogenous
variables output yt and money supply mt are therefore hit simultaneously by both
15.2 Structural and Reduced Form 263
shocks. It is thus not possible without further assumptions to assign the movements
in Zt and consequently in Xt to corresponding changes in the fundamental structural
shocks vyt and vmt .
Remark 15.2. As Cooley and LeRoy (1985) already pointed out, the statement
“money supply is not causal in the sense of Wiener and Granger for real economic
activity”, which, in our example is equivalent to 12 D 0, is not equivalent to the
statement “money supply does not influence real economic activity“ because 12 can
be zero without a1 being zero. Thus, the notion of causality is not very meaningful
in inferring the inner (structural) relationships between variables.
We now present the general identification problem in the context of VAR.5 The
starting point of the analysis consists of a linear model, derived ideally from
economic theory, in its structural form:
The assumption that the structural disturbance are uncorrelated with each other is
not viewed as controversial as otherwise there would be unexplained relationships
between them. In the literature one encounters an alternative completely equivalent
normalization which leaves the coefficients in B unrestricted but assumes the
covariance matrix of Vt , , to be equal to the identity matrix In .
The reduced form is obtained by solving the equation system with respect to Xt .
Assuming that A is nonsingular, the premultiplication of Eq. (15.2) by A 1 leads to
the reduced form which corresponds to a VAR(p) model:
5
A thorough treatment of the identification problem in econometrics can be found in Rothenberg
(1971), and for the VAR context in Rubio-Ramírez et al. (2010).
264 15 Interpretation of VAR Models
Xt D A 1 1 Xt 1 C : : : C A 1 p Xt p C A 1 BVt
D ˆ1 Xt 1 C : : : C ˆp Xt p C Zt : (15.3)
The relation between the structural disturbances Vt and the reduced form distur-
bances Zt is in the form of a simultaneous equation system:
While the structural disturbances are not directly observed, the reduced form
disturbances are given as the residuals of the VAR and can thus be considered as
given. The relation between the lagged variables is simply
j D Aˆj ; j D 1; 2; : : : ; p:
Consequently, once A and B have been identified, not only the coefficients of
the lagged variables in the structural form are identified, but also the impulse
response functions (see Sect. 15.4.1). We can therefore concentrate our analysis
of the identification problem on Eq. (15.4).
With these preliminaries, it is now possible to state the identification problem
more precisely. Equation (15.4) shows that the structural form is completely
determined by the parameters .A; B; /. Taking the normalization of A and B into
account, these parameters can be viewed as points in Rn.2n 1/ . These parameters
determine the distribution of Zt D A 1 BVt which is completely characterized by
the covariance matrix of Zt , †, as the mean is equal to zero.6 Thus, the parameters
of the reduced form, i.e. the independent elements of † taking the symmetry into
account, are points in Rn.nC1/=2 . The relation between structural and reduced form
can therefore be described by a function g W Rn.2n 1/ ! Rn.nC1/=2 :
Ideally, one would want to find the inverse of this function and retrieve, in this way,
the structural parameters .A; B; / from †. This is, however, in general not possible
because the dimension of the domain space of g, n.2n 1/, is strictly greater, for
n 2, than the dimension of its range space, n.n C 1/=2. This discrepancy between
the dimensions of the domain and the range space of g is known as the identification
problem. To put it in another way, there are only n.n C 1/=2 (nonlinear) equations
for n.2n 1/ unknowns.7
6
As usual, we concentrate on the first two moments only.
7
Note also that our discussion of the identification problem focuses on local identification, i.e. the
invertibility of g in an open neighborhood of †. See Rothenberg (1971) and Rubio-Ramírez et al.
(2010) for details on the distinction between local and global identification.
15.2 Structural and Reduced Form 265
where R, Q and D are arbitrary invertible matrices such that R respects the
normalization of A and B, Q is an orthogonal matrix, and D is a diagonal matrix.
It can be verified that
The dimensions of the matrices R, Q, and D are n2 2n, n.n 1/=2, and n,
respectively. Summing up gives 3n.n 1/=2 D n2 2n C n.n 1/=2 C n degrees
of freedom as before8.
The empirical economics literature proposed several alternative identification
schemes:
8
See Neusser (2016) for further implications of viewing the identification problem from an
invariance perspective.
266 15 Interpretation of VAR Models
Pagan 2011; Kilian and Murphy 2012; Rubio-Ramírez et al. 2010; Arias et al.
2014; Baumeister and Hamilton 2015). This approach is complementary to
the two previous identification schemes and will be discussed in Sect. 15.6.
(v) Identification through heteroskedasticity (Rigobon 2003)
(vi) Restrictions derived from a dynamic stochastic general equilibrium (DSGE)
model. These restrictions often come as nonlinear cross-equation restrictions
and are viewed as the hallmark of rational expectations models (Hansen and
Sargent 1980). Typically, the identification issue is overcome by imposing a
priori restrictions via a Bayesian approach (Negro and Schorfheide (2004)
among many others).
(vii) Identification using information on global versus idiosyncratic shocks in the
context of multi-country or multi-region VAR models (Canova and Ciccarelli
2008; Dees et al. 2007)
(viii) Instead of identifying all parameters, researchers may be interested in iden-
tifying only one equation or a subset of equations. This case is known as
partial identification. The schemes presented above can be extended in a
straightforward manner to the partial identification case.
These schemes are not mutually exclusive, but can be combined with each other.
In the following we will only cover the identification through short- and long-
run restrictions, because these are by far the most popular ones. The economic
importance of these restrictions for the analysis of monetary policy has been
emphasized by Christiano et al. (1999).
with unknowns .B/12 ; .B/21 ; !12 , and !22 . Note that the assumption of † being
positive definite implies that .B/12 .B/21 ¤ 1. Thus, we can solve out !12 and !22
and reduce the three equations to only one:
.B/12 b211 .B/21 b121 D r122 1 (15.7)
where b21 and b12 denote the least-squares regression coefficients of Z2t on Z1t ,
respectively of Z1t on Z2t , i.e. b21 D 12 =12 and b12 D 12 =22 . r12 is the correlation
9
The exposition is inspired by Leamer (1981).
15.2 Structural and Reduced Form 267
b−1
12
C = (b−1,b−1)
21 12
(0,b21) (b12,0)
(0,0) −1 (B)12
b21
coefficient between Z2t and Z1t , i.e. r12 D 12 =.1 2 /. Note that imposing a zero
restriction by setting .B/12 , for example, equal to zero, implies that .B/21 equals b21 ;
and vice versa, setting .B/21 D 0, implies .B/12 D b12 . As a final remark, the right
hand side of Eq. (15.7) is always positive as the inverse of the squared correlation
coefficient is bigger than one. This implies both product terms must be of the same
sign.
Equation (15.7) delineates all possible combinations of .B/12 and .B/21 which
are compatible with a given covariance matrix †. Its graph represents a rectangular
hyperbola in the parameter space ..B/12 ; .B/21 / with center C D .b211 ; b121 /
and asymptotes .B/12 D b211 and .B/21 D b121 and is plotted in Fig. 15.1.10
The hyperbola consist of two disconnected branches with a pole at the center
C D .b211 ; b121 /. At this point, the relation between the two parameters changes
sign. The figure also indicates the two possible zero restrictions .B/12 D 0 and
.B/21 D 0, called short-run restrictions. These two restrictions are connected and
its path completely falls within one quadrant. Thus, along this path the sign of the
parameters remain unchanged.
Suppose that instead of fixing a particular parameter, we only want to restrict
its sign. Assuming that .B/12 0 implies that .B/21 must lie in one of the two
disconnected intervals . 1; b21 and .b121 ; C1/.11 Although not very explicit,
some economic consequences of this topological particularity are discussed in Fry
and Pagan (2011). Alternatively, assuming .B/12 0 implies .B/21 2 Œb21 ; b121 /.
Thus, .B/21 is unambiguously positive. Sign restrictions for .B/21 can be discussed
in a similar manner. Section 15.6 discuses sign restrictions more explicitly.
10
Moon et al. (2013; section 2) provided an alternative geometric representation.
11
That these two intervals are disconnected follows from the fact that † is positive definite.
268 15 Interpretation of VAR Models
† D BB0 :
12
The case of overidentification is, for example, treated in Bernanke (1986).
13
A version of the instrumental variable (IV) approach is discussed in Sect. 15.5.2.
15.3 Identification via Short-Run Restrictions 269
where is just a placeholder. The matrices B and are uniquely determined by the
Cholesky decomposition of the matrix †. The Cholesky decomposition factorizes
a positive-definite matrix † uniquely into the product BB0 where B is a lower
triangular matrix with ones on the diagonal and a diagonal matrix with strictly
positive diagonal entries (Meyer 2000). As Zt D BVt , Sims’ identification gives rise
to the following interpretation. v1t is the only structural shock which has an effect on
X1t in period t. All other shocks have no contemporaneous effect. Moreover, Z1t D
v1t so that the residual from the first equation is just equal to the first structural shock
and that 12 D !12 . The second variable X2t is contemporaneously only affected by
v1t and v2t , and not by the remaining shocks v3t ; : : : ; vnt . In particular, Z2t D b21 v1t C
v2t so that b21 can be retrieved from the equation 21 D b21 !12 . This identifies the
second structural shock v2t and !22 . Due to the triangular nature of B, the system
is recursive and all structural shocks and parameters can be identified successively.
The application of the Cholesky decomposition as an identification scheme rests
crucially on the ordering of the variables .X1t ; X2t ; : : : ; Xnt /0 in the system.
Sims’ scheme, although easy to implement, becomes less plausible as the number
of variables in the system increases. For this reason the more general scheme with
A ¤ In and B not necessarily lower triangular are more popular. However, even for
medium sized systems such as n D 5, 30 restrictions are necessary which stresses
the imagination even of brilliant economists as the estimation of Blanchard’s model
in Sect. 15.4.5 shows.
Focusing on the identification of the matrices A and B brings also an advantage
in terms of estimation. As shown in Chap. 13, the OLS-estimator of the VAR
coefficient matrices ˆ1 ; : : : ; ˆp equals the GLS-estimator independently of †.
Thus, the estimation of the structural parameters can be broken down into two
steps. In the first step, the coefficient matrices ˆ1 ; : : : ; ˆp are estimated using
OLS. The residuals are then used to estimate † which leads to an estimate of the
covariance matrix (see Eq. (13.6)). In the second step, the coefficients of A, B, and
are then estimated given the estimate of †, †, b by solving the nonlinear equation
system (15.5) taking the specific identification
p
scheme into account. Thereby † is
replaced by its estimate †.b As T vech.†/ b vech.†/ converges in distribution
b are also asymptotically normal
O BO and
to a normal distribution with mean zero, A,
because they are obtained by a one-to-one mapping from †. b 14 The Continuous
O O b
Mapping Theorem further implies that A, B and converge to their true means
and that their asymptotic covariance matrix can be obtained by an application of the
delta-method (see Theorem E.1 in the Appendix E). Further details can be found
in Bernanke (1986), Blanchard and Watson (1986), Giannini (1991), Hamilton
(1994b), and Sims (1986).
14
The vech operator transforms a symmetric n n matrix † into a 12 n.n C 1/ vector by stacking
the columns of † such that each element is listed only once.
270 15 Interpretation of VAR Models
Xt D Zt C ‰1 Zt 1 C ‰2 Zt 2 C :::
D A 1 BVt C ‰1 A 1 BVt 1 C ‰2 A 1 BVt 2 C ::: (15.8)
The effect of the j-th structural disturbance on the i-th variable after h periods,
@X
denoted by @vi;tCh
jt
is thus given by the .i; j/-th element of the matrix ‰h A 1 B:
@Xi;tCh
D ‰h A 1 B i;j :
@vjt
Another instrument for the interpretation of VAR models is the forecast error
variance decomposition (“FEVD”) or variance decomposition for short which
decomposes the total forecast error variance of a variable into the variances of
the structural shocks. It is again based on the causal representation of the VAR(p)
model. According to Eq. (14.3) in Chap. 14 the variance of the forecast error or
mean squared error (MSE) is given by:
15
Sometimes the causal representation is called the MA.1) representation.
15.4 Interpretation of VAR Models 271
h 1
X h 1
X
D ‰j †‰j0 D ‰j A 1 BB0 A0 1 ‰j0 :
jD0 jD0
.h/
Our interest lies exclusively on the variances, mii , i D 1; : : : ; n, so that we represent
the uninteresting covariance terms by the placeholder . These variances can be seen
as a linear combinationof the !i2 ’s because the covariance matrix of the structural
disturbances D diag !12 ; : : : ; !n2 is a diagonal matrix:
or in matrix form
0 1 0 1
h 1
X h 1
X
D e0i @ ‰j †‰j0 A ei D e0i @ ‰j A 1 BB0 A0 1 ‰j0 A ei
jD0 jD0
where the vector ei has entries equal to zero, except for the i-th entry which is equal
.h/
to one. Given the positive definiteness of †, the weights dij , i; j D 1; : : : ; n and
h D 1; 2; : : :, are strictly positive. They can be computed as
h 1
!
.h/
X
1
dij D Œ‰k A B2ij
kD0
In order to arrive at the percentage value of the contribution of the j-th disturbance
.h/
to the MSE of the i-th variable at forecast horizon h, denoted by fij , we divide each
summand in the above expression by the total sum:
.h/
.h/ dij !j2
fij D .h/
; i; j D 1; : : : ; n; for h D 0; 1; 2; : : :
mii
272 15 Interpretation of VAR Models
P
h 1
e0i jD0 ‰j A 1 B1=2 ej e0j 1=2 B0 A0 1 ‰j0 ei
D P
h 1
e0i jD0 ‰ j †‰ 0
j ei
Usually, these numbers are multiplied by 100 to give percentages and are either
displayed graphically as a plot against h or in table form (see the example in
.h/
Sect. 15.4.5). The forecast error variance fij thus shows which percentage of the
forecast variance of variable i at horizon h can be attributed to the j-th structural
shock and thus measures the contribution of each of these shocks to the overall
fluctuations of the variables in question.
The FEVD can be used as an alternative identification scheme, sometimes called
max share identification. Assume for the ease of exposition that A D In . The VAR
disturbances and the structural shocks are then simply related as Zt D BVt (compare
Eq. (15.4)). Moreover, take D In , but leave B unrestricted. This corresponds to
a different, but equivalent normalization which economizes on the notation. Then
the j-th structural disturbance can be identified by assuming that it maximizes the
forecast error variance share with respect to variable i. Noting that, given †, B can
be written as B D RQ with R being the unique Cholesky factor of † and Q being an
orthogonal matrix, this optimization problem can be casted as
0 1
Xh 1
max e0i @ ‰j Rqj q0j R0 ‰j0 A ei s.t. q0j qj D 1
qj
jD0
where qj is the j-th column of Q, i.e. qj D Qej . The constraint q0j qj D 1 normalizes
the vector to have length 1. From Zt D BVt it then follows that corresponding
structural disturbance is equal to Vjt D q0j R 1 Zt . Because e0j Q0 R 1 †R0 1 Qek D 0
for j ¤ k, this shock is orthogonal to the other structural disturbances. For
practical applications it is advisable for reasons of numerical stability to transform
to optimization problem into an equivalent eigenvalue problem (see Faust 1998;
appendix).
The impulse response functions and the variance decomposition are the most
important tools for the analysis and interpretation of VAR models. It is, therefore,
of importance not only to estimate these entities, but also to provide corresponding
confidence intervals to underpin the interpretation from a statistical perspective. In
the literature two approaches have been established: an analytic and a bootstrap
approach. The analytic approach relies on the fact that the coefficient matrices
‰h , h D 1; 2; : : :, are continuously differentiable functions of the estimated VAR
15.4 Interpretation of VAR Models 273
16
Recall that the vec operator stacks the columns of a n m matrix to get one nm vector.
17
The bootstrap is a resampling method. Efron and Tibshirani (1993) provide a general introduction
to the bootstrap.
18
The draws can also be done blockwise. This has the advantage that possible remaining temporal
dependences are taken in account.
274 15 Interpretation of VAR Models
Second step: Given the fixed starting values X pC1 ; : : : ; X0 , the estimated coef-
ficients matrices b̂1 ; : : : ; b̂p and the new disturbances drawn in step one, a new
realization of the time series for fXt g is generated.
Third step: Estimate the VAR model, given the newly generated realizations for
fXt g, to obtain new estimates for the coefficient matrices.
Fourth step: Generate a new set of impulse response functions given the new
estimates, taking the identification scheme as fixed.
The steps one to four are repeated several times to generate a whole family of
impulse response functions which form the basis for the computation of the confi-
dence bands. In many applications, these confidence bands are constructed in a naive
fashion by connecting the confidence intervals for individual impulse responses at
different horizons. This, however, ignores the fact that the impulse responses at
different horizons are correlated which implies that the true coverage probability of
the confidence band is different from the presumed one. Thus, the joint probability
distribution of the impulse responses should serve as the basis of the computation of
the confidence bands. Recently, several alternatives have been proposed which take
this feature in account. Lütkepohl et al. (2013) provides a comparison of several
methods.
The number of repetitions should be at least 500. The method can be refined
somewhat if the bias towards zero of the estimates of the ˆ’s is taken into account.
This bias can again be determined through simulation methods (Kilian 1998).
A critical appraisal of the bootstrap can be found in Sims (1999) where additional
improvements are discussed. The bootstrap of the variance decomposition works in
similar way.
0 1 0 1
B 0.145 C B 0.451 0.642 C
B C B (0.302) C
Xt D B (0.634) C C B (0.174) C Xt 1
@ A @ A
0.762 -0.068 1.245
(0.333) (0.091) (0.159)
0 1
B -0.189 0.009 C
B (0.333) C
C B (0.180) C Xt 2 C Zt ;
@ A
-0.176 -0.125
(0.095) (0.175)
where the estimated standard deviations of the coefficients are reported in parenthe-
b of the covariance matrix † is
sis. The estimate †
b D 0:038 0:011 :
†
0:011 0:010
The estimated VAR(2) model is taken to be the reduced form model. The struc-
tural model contains two structural shocks: a shock to advertisement expenditures,
VAt , and a shock to sales, VSt . The disturbance vector of the structural shock is thus
fVt g D f.VAt ; VSt /0 g. It is related to Zt via relation (15.4), i.e. AZt D BVt . To identify
the model we thus need 3 restrictions.19 We will first assume that A D I2 which
gives two restrictions. A plausible further assumption is that shocks to sales have
no contemporaneous effects on advertisement expenditures. This zero restriction
seems justified because advertisement campaigns have to be planed in advance.
They cannot be produced and carried out immediately. This argument then delivers
the third restriction as it implies that B is a lower triangular. This lower triangular
matrix can be obtained from the Cholesky decomposition of †: b
b 1 0 b D 0:038 0
BD and :
0:288 1 0 0:007
The identifying assumptions then imply the impulse response functions plotted
in Fig. 15.2.
The upper left figure shows the response of a sudden transitory increase in
advertisement expenditures by 1 % (i.e. of a 1-% increase of VAt ) to itself. This shock
is positively propagated to the future years, but is statistically zero after four years.
After four years the shock even changes to negative, but statistically insignificant,
expenditures. The same shock produces an increase of sales by 0.3 % in the current
and next year as shown in the lower left figure. The effect then deteriorates and
becomes even negative after three years. The right hand figures display the reaction
19
Formula (15.6) for n D 2 gives 3 restrictions.
276 15 Interpretation of VAR Models
0.5 2
0 1
-0.5 0
zero restriction
-1 -1
0 10 20 30 0 10 20 30
period period
0.5 2
0 1
-0.5 0
-1 -1
0 10 20 30 0 10 20 30
period period
Fig. 15.2 Impulse response functions for advertisement expenditures and sales with 95-%
confidence intervals computed using the bootstrap procedure
In this example we replicate the study of Blanchard (1989) which investigates the
US business cycle within a traditional IS-LM model with Phillips curve.20 The
starting point of his analysis is the VAR(p) model:
Xt D ˆ1 Xt 1 C : : : C ˆp Xt p C C Dt C Zt
The VAR has attached to it a disturbance term Zt D .Zyt ; Zut ; Zpt ; Zwt ; Zmt /0 . Finally,
fDt g denotes the deterministic variables of the model such as a constant, time trend
or dummy variables. In the following, we assume that all variables are stationary.
The business cycle is seen as the result of five structural shocks which impinge
on the economy:
We will use the IS-LM model to rationalize the restrictions so that we will be able
to identify the structural form from the estimated VAR model. The disturbance of
the structural and the reduced form models are related by the simultaneous equation
system:
AZt D BVt
where Vt D .Vyt ; Vst ; Vpt ; Vwt ; Vmt /0 and where A and B are 5 5 matrices with ones
on the diagonal. Blanchard (1989) proposes the following specification:
20
The results do not match exactly those of Blanchard (1989), but are qualitatively similar.
278 15 Interpretation of VAR Models
The first equation is interpreted as an aggregate demand (AD) equation where the
disturbance term related to GDP growth, Zyt , depends on the demand shock Vdt and
the supply shock Vst . The second equation is related to Okun’s law (OL) which
relates the unemployment disturbance Zut to the demand disturbance and the supply
shock. Thereby an increase in GDP growth reduces unemployment in the same
period by a21 whereas a supply shock increases it. The third and the fourth equation
represent a price (PS) and wage setting (WS) system where wages and prices interact
simultaneously. Finally, the fifth equation (MR) is supposed to determine the money
shock (MR). No distinction is made between money supply and money demand
shocks. A detailed interpretation of these equations is found in the original article
by Blanchard (1989).
Given that the dimension of the system is five (i.e. n D 5), formula (15.6)
instructs us that we need 3 .5 4/=2 D 30 restrictions. Counting the number
of zero restrictions implemented above, we see that we only have 28 zeros. Thus
we lack two additional restrictions. We can reach the same conclusion by counting
the number of coefficients and the number of equations. The coefficients are
a21 ; a31 ; a34 ; a42 ; a43 ; a51 ; a52 ; a53 ; a54 , b12 ; b32 ; b42 and the diagonal elements of ,
the covariance matrix of Vt . We therefore have to determine 17 unknown coefficients
out of .5 6/=2 D 15 equations. Thus we find again that we are short of two
restrictions. Blanchard discusses several possibilities among which the restrictions
b12 D 1:0 and a34 D 0:1 seem most plausible.
The sample period runs through the second quarter in 1959 to the second
quarter in 2004 encompassing 181 observations. Following Blanchard, we include a
constant in combination with a linear time trend in the model. Whereas BIC suggests
a model of order one, AIC favors a model of order two. As a model of order one
seems rather restrictive, we stick to the VAR(2) model whose estimated coefficients
are reported below21 :
21
To save space, the estimated standard errors of the coefficients are not reported.
15.4 Interpretation of VAR Models 279
0 1
0:07 1:31 0:01 0:12 0:02
B 0:02 1:30 0:03 0:00 0:00C
B C
b̂1 D B
B 0:07 1:47 0:56 0:07
C
0:03C
B C
@ 0:07 0:50 0:44 0:07 0:06A
0:10 1:27 0:07 0:04 0:49
0 1
0:05 1:79 0:41 0:13 0:05
B 0:02 0:35 0:00 0:01 0:00C
B C
b̂2 D B
B 0:04 1:38 0:28 0:05
C
0:00C
B C
@ 0:07 0:85 0:19 0:10 0:04A
0:02 0:77 0:07 0:11 0:17
0 1
2:18 0:0101
B 0:29 0:0001C
B C
b B C
C D B 0:92 0:0015C
B C
@ 4:06 0:0035A
0:98 0:0025
0 1
9:94 0:46 0:34 0:79 0:29
B 0:46 0:06 0:02 0:05 0:06C
B C
bDB
† B 0:34 0:02 1:06 0:76
C
0:13 C :
B C
@ 0:79 0:05 0:76 5:58 0:76 A
0:29 0:06 0:13 0:76 11:07
The first column of b C relates to the constants, whereas the second column gives
the coefficients of the time trend. From these estimates and given the identifying
restrictions established above, the equation †b D A 1 BB0 A0 1 uniquely determines
the matrices A, B and :
0 1
1 0 0 0 0
B0:050 1 0 0 0C
B C
b B C
A D B0:038 0 1 0:1 0C
B C
@ 0 1:77 0:24 1 0A
0:033 1:10 0:01 0:13 1
0 1
1 1 000
B0 1 0 0 0C
B C
b B C
B D B0 1:01 1 0 0C
B C
@0 1:55 0 1 0A
0 0 001
280 15 Interpretation of VAR Models
0 1
9:838 0 0 0 0
B 0 0:037 0 0 0 C
B C
b B C
DB 0 0 0:899 0 0 C
B C
@ 0 0 0 5:162 0 A
0 0 0 0 10:849
In order to give better interpretation of the results we have plotted the impulse
response functions and their 95-% confidence bands in Fig. 15.3. The results show
that a positive demand shock has only a positive and statistically significant effect
on GDP growth in the first three quarters, after that the effect becomes even
slightly negative and vanishes after sixteen quarters. The positive demand shock
reduces unemployment significantly for almost fifteen quarters. The maximal effect
is achieved after three to four quarters. Although the initial effect is negative, the
positive demand shock also drives inflation up which then pushes up wage growth.
The supply shock also has a positive effect on GDP growth, but it takes more than
four quarters before the effect reaches its peak. In the short-run the positive supply
demand shock on Yt supply shock on Yt price shock on Yt wage shock on Yt money shock on Yt
3 0.4
1 2 0.4 0.15
0.8 1 0.2 0.2 0.1
0.6 0 0 0.05
0
0.4 −1 −0.2 0
0.2 −2 −0.4 −0.2 −0.05
0 −3 −0.6 −0.1
−0.2 −4 −0.8 −0.4 −0.15
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
demand shock on Ut supply shock on Ut price shock on Ut wage shock on Ut money shock on Ut
0.05 1.5 0.15 0.06
1 0.4 0.04
0 0.1
0.5 0.3 0.02
−0.05 0.2
0 0.05 0
−0.1 0.1
−0.5 0 −0.02
−0.15 0
−1 −0.1 −0.04
−0.2 −1.5 −0.2 −0.05 −0.06
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
demand shock on Pt supply shock on Pt price shock on Pt wage shock on Pt money shock on Pt
0.15 1 1.5 0.15
0 0.2
0.1 1 0.1
0.15
0.05 −1
0.1
0.5 0.05
0 −2 0.05
0 0 0
−0.05 −3
−0.05
−0.1 −4 −0.5 −0.1 −0.05
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
demand shock on Wt supply shock on Wt price shock on Wt wage shock on Wt money shock on Wt
2
0.2 1 0.8 1 0.2
0.15 0 0.6 0.8 0.15
0.1 0.4 0.6 0.1
−1
0.05 0.2 0.4 0.05
0 −2 0 0.2 0
−0.05 −3 −0.2 0 −0.05
−0.1 −4 −0.4 −0.2 −0.1
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
demand shock on Mt supply shock on Mt price shock on Mt wage shock on Mt money shock on Mt
0.1 4 1 1.2
3 0.4 1
0 0.5
2 0.3 0.8
−0.1 1 0.2 0.6
0
−0.2 0 0.1 0.4
−1 −0.5 0 0.2
−0.3
−2 −0.1 0
−0.4 −3 −1 −0.2 −0.2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 5 10 15 20 25 30
Fig. 15.3 Impulse response functions for the IS-LM model with Phillips curve with 95-%
confidence intervals computed using the bootstrap procedure (compare with Blanchard (1989))
15.4 Interpretation of VAR Models 281
shock even reduces GDP growth. In contrast to the demand shock, the positive
supply shock increases unemployment in the short-run. The effect will only reduce
unemployment in the medium- to long-run. The effect on price and wage inflation
is negative.
Finally, we compute the forecast error variance decomposition according to
Sect. 15.4.2. The results are reported in Table 15.1. In the short-run, the identifying
restrictions play an important role as reflected by the plain zeros. The demand shock
accounts for almost all the variance of GDP growth in the short-run. The value of
99:62 % for forecast horizon of one quarter, however, diminishes as h increases to
40 quarters, but still remains with a value 86:13 very high. The supply shock on the
Table 15.1 Forecast error Horizon Demand Supply Price Wage Money
variance decomposition
Growth rate of real GDP
(FEVD) in terms of demand,
supply, price, wage, and 1 99.62 0.38 0 0 0
money shocks (percentages) 2 98.13 0.94 0.02 0.87 0.04
4 93.85 1.59 2.13 1.86 0.57
8 88.27 4.83 3.36 2.43 0.61
40 86.13 6.11 4.29 2.58 0.89
Unemployment rate
1 42.22 57.78 0 0 0
2 52.03 47.57 0.04 v0.01 0.00
4 64.74 33.17 1.80 0.13 0.16
8 66.05 21.32 10.01 1.99 0.63
40 39.09 16.81 31.92 10.73 0.89
Inflation rate
1 0.86 4.18 89.80 5.15 0
2 0.63 13.12 77.24 8.56 0.45
4 0.72 16.79 68.15 13.36 0.97
8 1.79 19.34 60.69 16.07 2.11
40 2.83 20.48 55.84 17.12 3.74
Growth rate of wages
1 1.18 0.10 0.97 97.75 0
2 1.40 0.10 4.30 93.50 0.69
4 2.18 2.75 9.78 84.49 0.80
8 3.80 6.74 13.40 74.72 1.33
40 5.11 8.44 14.19 70.14 2.13
Growth rate of money stock
1 0.10 0.43 0.00 0.84 98.63
2 1.45 0.44 0.02 1.02 97.06
4 4.22 1.09 0.04 1.90 92.75
8 8.31 1.55 0.81 2.65 86.68
40 8.47 2.64 5.77 4.55 78.57
282 15 Interpretation of VAR Models
contrary does not explain much of the variation in GDP growth. Even for a horizon
of 40 quarters, it contributes only 6.11 %. The supply shock is, however, important
for the variation in the unemployment rate, especially in the short-run. It explains
more than 50 % whereas demand shocks account for only 42.22 %. Its contribution
diminishes with the increase of the forecast horizon giving room for price and
wage shocks. The variance of the inflation rate is explained in the short-run almost
exclusively by price shocks. However, as the forecast horizon is increased supply
and wage shocks become relatively important. The money growth rate does not
interact much with the other variables. Its variation is almost exclusively explained
by money shocks.
where Vt D .vdt ; vst /0 WN.0; / with D diag.!d2 ; !s2 /. Thereby fvdt g and fvst g
denote demand and supply shocks, respectively. The causal representation of fXt g
implies that the effect of a demand shock in period t on GDP growth in period t C h
is given by:
15.5 Identification via Long-Run Restrictions 283
@YtCh
D Œ‰h B11
@vdt
where Œ‰h B11 denotes the upper left hand element of the matrix ‰h B. YtCh can be
written as YtCh D YtCh C YtCh 1 C : : : C YtC1 C Yt so that the effect of the
demand shock on the level of logged GDP is given by:
2 3
Xh Xh
@YtCh
D Œ‰j B11 D 4 ‰j B5 :
@vdt jD0 jD0
11
X 1
@YtCh
lim D Œ‰j B11 D 0:
h!1 @vdt
jD0
where is a placeholder. This restriction is sufficient to infer b21 from the relation
Œ‰.1/11 b11 C b21 Œ‰.1/12 D 0 and the normalization b11 D 1:
The second part of the equation follows from the identity ˆ.z/‰.z/ D I2 which
gives ‰.1/ D ˆ.1/ 1 for z D 1. The long-run effect of the supply shock on Yt is
left unrestricted and is therefore in general nonzero. Note that the implied value of
b21 depends on ˆ.1/, and thus on ˆ1 ; : : : ; ˆp . The results are therefore, in contrast
to short-run restrictions, much more sensitive to correct specification of the VAR.
The relation Zt D BVt implies that
2
! 0
† D B d 2 B0
0 !s
or more explicitely
2
12 12 1 b12 !d 0 1 b21
†D D :
12 22 b21 1 0 !s2 b12 1
284 15 Interpretation of VAR Models
Taking b21 as already given from above, this equation system has three equations
and three unknowns b12 ; !d2 ; !s2 which is a necessary condition for a solution to
exist.
Using the last two equations, we can express !d2 and !s2 as functions of b12 :
These expressions are only valid if b21 ¤ 0 and b12 b21 ¤ 1. The case b21 D 0
is not interesting with regard to content. It would simplify the original equation
system strongly and would results in b12 D 12 =22 , !d2 D .12 22 12 2
/=22 > 0
2 2
and !s D 2 > 0. The case b12 b21 D 1 contradicts the assumption that † is a
positive-definite matrix and can therefore be disregarded.23
Inserting the solutions of !d2 and !s2 into the first equation, we obtain a quadratic
equation in b12 :
b21 22 b221 12 b212 C b221 12 22 b12 C 12 b21 12 D 0:
22
This equation system is similar to the one analyzed in Sect. 15.2.3.
23
If b12 b21 D 1, det † D 12 22 122
D 0. This implies that Z1t and Z2t are perfectly correlated,
2 2 2 2
i.e. Z1t ;Z2t D 12 =.1 2 / D 1.
15.5 Identification via Long-Run Restrictions 285
The positivity of the discriminant implies that the quadratic equation has two real
.1/ .2/
solutions b12 and b12 :
The general case of long-run restrictions has a structure similar to the case of short-
run restrictions. Take as a starting point the structural VAR (15.2) from Sect. 15.2.2:
where fXt g is stationary and causal with respect to fVt g. As before the matrix A
is normalized to have ones on its diagonal and is assumed to be invertible, is a
diagonal matrix with D diag.!12 ; : : : ; !n2 /, and B is a matrix with ones on the
diagonal. The matrix polynomial A.L/ is defined as A.L/ D A 1 L : : : p Lp .
The reduced form is given by
ˆ.L/Xt D Zt ; Zt WN.0; †/
The long-run variance of fXt g, J, (see Eq. (11.1) in Chap. 11) can be derived
from the reduced as well as from the structural form, which gives the following
expressions:
J D ˆ.1/ 1 † ˆ.1/ 10
D ˆ.1/ 1 A 1 BB0 A0 1 ˆ.1/ 10
D A.1/ 1 BB0 A.1/ 10
where Xt D ‰.L/Zt denotes the causal representation of fXt g. The long-run variance
J can be estimated by adapting the methods in Sect. 4.4 to the multivariate case.
Thus, the above equation system has a similar structure as the system (15.5) which
underlies the case of short-run restrictions. As before, we get n.n C 1/=2 equations
with 2n2 n unknowns. The nonlinear equation system is therefore undetermined
for n 2. Therefore, 3n.n 1/=2 additional equations or restrictions are necessary
to achieve identification. Hence, conceptually we are in a similar situation as in the
case of short-run restrictions.24
In practice, it is customary to achieve identification through zero restrictions
where some elements of ‰.1/A 1 B, respectively ˆ.1/ 1 A 1 B, are set a priori
to zero. Setting the ij-th element Œ‰.1/A 1 Bij D Œˆ.1/ 1 A 1 Bij equal to zero
amounts to set the cumulative effect of the j-th structural disturbance Vjt on the
i-th variable equal to zero. If the i-th variable enters Xt in first differences, as was
the case for Yt in the previous example, this zero restriction restrains the long-run
effect on the level of that variable.
An interesting simplification arises if one assumes that A D In and that ‰.1/B D
ˆ.1/ 1 B is a lower triangular matrix. In this case, B and can be estimated from
the Cholesky decomposition of the estimated long-run variance J. O Let JO D LO DO LO 0
O
be the Cholesky decomposition where L is lower triangular matrix with ones on
the diagonal and D O is a diagonal matrix with strictly positive diagonal entries.
O Ô 1 O 0 Ô
As J D .1/ BB .1/ 10 , the matrix of structural coefficients can then be
estimated as BO D Ô .1/LO U O 1 . The multiplication by the inverse of the diagonal
matrix U D diag. .1/L/ is necessary to guarantee that the normalization of BO
Ô O
(diagonal elements equal to one) is respected. is then estimated as O DU ODO U.
O
Instead of using a method of moments approach, it is possible to use an
instrumental variable (IV) approach. For this purpose we rewrite the reduced form
of Xt in the Dickey-Fuller form (see Eqs. (7.1) and (16.4)):
24
See Rubio-Ramírez et al. (2010) for a unified treatment of both type of restrictions.
15.5 Identification via Long-Run Restrictions 287
Consider for simplicity the case that Aˆ.1/ is a lower triangular matrix. This implies
that the structural shocks V2t ; V3t ; : : : ; Vnt have no long-run impact on the first
variable X1t . It is, therefore, possible to estimate the coefficients A12 ; A13 ; : : : ; A1n
by instrumental variables taking X2;t 1 ; X3;t 1 ; : : : ; Xn;t 1 as instruments.
For n D 2 the Dickey-Fuller form of the equation system (15.9) is:
1 A12 XQ 1t ŒAˆ.1/11 0 XQ 1;t 1 V1t
D C ;
A21 1 XQ 2t ŒAˆ.1/21 ŒAˆ.1/22 XQ 2;t 1 V2t
respectively
Thereby XQ 1t and XQ 2t denote the OLS residuals from a regression of X1t ,
respectively X2t on .X1;t 1 ; X2;t 1 ; : : : ; X1;t pC1 ; X2;t pC1 /. XQ 2;t 1 is a valid
instrument for XQ 2t because this variable does not appear in the first equation. Thus,
A12 can be consistently estimated by the IV-approach. For the estimation of A21 ,
we can use the residuals from the first equation as instruments because V1t and
V2t are assumed to be uncorrelated. From this example, it is easy to see how this
recursive method can be generalized to more than two variables. Note also that the
IV-approach can also be used in the context of short-run restrictions.
The issue whether a technology shock leads to a reduction of hours worked in
the short-run, led to a vivid discussion on the usefulness of long-run restrictions for
structural models (Galí 1999; Christiano et al. 2003, 2006; Chari et al. 2008). From
an econometric point of view, it turned out, on the one hand, that the estimation
of ˆ.1/ is critical for the method of moments approach. The IV-approach, on the
other hand, depends on the strength or weakness of the instrument used (Pagan and
Robertson 1998; Gospodinov 2010).
It is, of course, possible to combine both short- and long-run restrictions
simultaneously. An interesting application of both techniques was presented by Galí
(1992). In doing so, one must be aware that both type of restrictions are consistent
with each other and that counting the number of restrictions gives only a necessary
condition.
coefficients of b̂2 are significant at the 10 % level, we prefer to use the VAR(2)
model which results in the following estimates25 :
b̂1 D 0:070 3:376
0:026 1:284
b̂2 D 0:029 3:697
0:022 0:320
bD 7:074 0:382
† :
0:382 0:053
Assuming that Zt D BVt and following the argument in Sect. 15.5.1 that the demand
shock has no long -run impact on the level of real GDP, we can retrieve an estimate
for b21 :
b
bO 21 D Œ‰.1/ b
11 =Œ‰.1/12 D 0:112:
The solution of the quadratic equation for b12 are 8:894 and 43.285. As the first
solution results in a negative variance for !22 , we can disregard this solution and stick
to the second one. The second solution makes also sense economically, because a
positive supply shock leads to positive effects on GDP. Setting b12 D 43:285 gives
the following estimates for covariance matrix of the structural shocks :
2
b D !O d 0 D 4:023 0
:
0 !O s2 0 0:0016
The big difference in the variance of both shocks clearly shows the greater
importance of demand shocks for business cycle movements.
Figure 15.4 shows the impulse response functions of the VAR(2) identified by
the long-run restriction. Each figure displays the dynamic effect of a demand and a
supply shock on real GDP and the unemployment rate, respectively, where the size
of the initial shock corresponds to one standard deviation. The result conforms well
with standard economic reasoning. A positive demand shock increases real GDP
25
The results for the constants are suppressed to save space.
15.6 Sign Restrictions 289
6
4
4
2 2
0
0
−2
−2 −4
0 10 20 30 40 0 10 20 30 40
period period
effect of demand shock to the unemployment rate effect of supply shock to the unemployment rate
0.5 0.4
0.2
0
−0.5
−0.2
−1 −0.4
0 10 20 30 40 0 10 20 30 40
period period
Fig. 15.4 Impulse response functions of the Blanchard-Quah model (Blanchard and Quah 1989)
with 95-% confidence intervals computed using the bootstrap procedure
and lowers the unemployment rate in the short-run. The effect is even amplified
for some quarters before it declines monotonically. After 30 quarters the effect of
the demand has practically vanished so that its long-run effect becomes zero as
imposed by the restriction. The supply shock has a similar short-run effect on real
GDP, but initially increases the unemployment rate. Only when the effect on GDP
becomes stronger after some quarters will the unemployment rate start to decline.
In the long-run, the supply shock has a positive effect on real GDP but no effect
on unemployment. Interestingly, only the short-run effects of the demand shock are
statistically significant at the 95-% level.
In recent years the use of sign restrictions has attracted a lot of attention. Pioneering
contributions have been provided by Faust (1998), Canova and De Nicoló (2002),
and Uhlig (2005). Since then the literature has abounded by many applications in
many contexts. Sign restrictions try to identify the impact of the structural shocks by
requiring that the signs of the impulse response coefficients have to follow a given
pattern. The motivation behind this development is that economists are often more
confident about the sign of an economic relationship than about its exact magnitude.
290 15 Interpretation of VAR Models
This seems to be true also for zero restrictions, whether they are short- or long-run
restrictions. This insight already led Samuelson (1947) to advocate a calculus of
qualitative relations in economics. Unfortunately, this approach has been forgotten
in the progression of economic analysis.26 With the rise in popularity of sign
restrictions his ideas may see a revival.
To make the notion of sign restrictions precise, we introduce a language based
on the following notation and definitions. Sign restrictions will be specified as sign
pattern matrices. These matrices make use of the sign function of a real number x,
sgn.x/, defined as
8
< 1; if x > 0;
sgn.x/ D 1; if x < 0;
:
0; if x D 0.
Definition 15.1. A sign pattern matrix or pattern for short is a matrix whose
elements are solely from the set f1; 1; 0g. Given a sign pattern matrix S, the sign
pattern class of S is defined by
where M.n/ is the set of all n n matrices and where sgn is applied elementwise
to M. Clearly, S.S/ D S.sgn.S//.
Remark 15.1. Often the set fC; ; 0g is used instead of fC1; 1; 0g to denote the
sign patterns.
Remark 15.2. In some instances we do not want to restrict all signs but only a subset
of it. In this case, the elements of the sign pattern matrix S may come from the larger
set f 1; 0; 1; #g where # stands for an unspecified sign. S is then called a generalized
sign pattern matrix or a generalized pattern. The addition C and the multiplication
of (generalized) sign pattern matrices are defined in a natural way.
In order not to overload the discussion, we set A D In and focus on the case where
Zt D BVt .27 Moreover, we use a different, but completely equivalent normalization.
In particular, we relax, on the one hand, the assumption that B has only ones on its
diagonal, but assume on the other hand that D In . Note that † D EZt Zt0 D BB0 is
a strictly positive definite matrix. Assuming that a causal representation of fXt g in
terms of fZt g and thus also in terms of fVt g exists, we can represent Xt as
26
It is interesting to note that Samuelson’s ideas have fallen on fruitful grounds in areas like
computer science or combinatorics (see Brualdi and Shader 1995; Hall and Li 2014).
27
The case with general A matrices can be treated analogously.
15.6 Sign Restrictions 291
1
X
Xt D BVt C ‰1 BVt 1 C ‰2 BVt 2 C ::: D ‰h BVt h D ‰.L/BVt :
hD0
Denote by B.†/ the set of invertible matrices B such that † D BB0 , i.e. B.†/ D
fB 2 GL.n/ j BB0 D †g.28 Thus, B.†/ is the set of all feasible structural
factorizations (models).
Sign restrictions on the impulse responses Œ‰h Bi;j can then be defined in terms
of a sequence of (generalized) sign pattern matrices fSh g, h D 0; 1; 2; : : :
Definition 15.2 (Sign Restrictions). A causal VAR allows a sequence of (general-
ized) sign pattern matrices fSh g if and only if there exists B 2 B.†/ such that
Remark 15.4. With this notation we can also represent (short-run) zero restrictions
if the sign patterns are restricted to 0 and #.
A natural question to ask is how restrictive a prescribed sign pattern is. This
amounts to the question whether a given VAR can be compatible with any sign
pattern. As is already clear from the discussion of the two-dimensional case in
Sect. 15.2.3, this is not the case. As the set of feasible parameters can be represented
by a rectangular hyperbola, there will always be one quadrant with no intersection
with the branches of the hyperbola. In the example plotted in Fig. 15.1, this is
quadrant III. Thus, configurations with .B/21 < 0 and .B/12 < 0 are not feasible
given .†/12 > 0. This argument can be easily extended to models of higher
dimensions. Thus, there always exist sign patterns which are incompatible with a
given †.
As pointed out in Sect. 15.3 there always exists a unique lower triangular matrix
R, called the Cholesky factor of †, such that † D RR0 . Thus, B.†/ ¤ ; because
R 2 B.†/.
28
GL.n/ is known as the general linear group. It is the set of all invertible n n matrices.
292 15 Interpretation of VAR Models
This lemma establishes that there is a one-to-one function '† from the group
of orthogonal matrices O.n/ onto the set of feasible structural factorizations B.†/.
From the proof we see that '† .Q/ D RQ and '† 1 .B/ D R 1 B. Moreover, for
any two matrices B1 and B2 in B.†/ with B1 D RQ1 and B2 D RQ2 , there exists
an orthogonal matrix Q equal to Q02 Q1 such that B1 D B2 Q. As '† and '† 1 are
clearly continuous, '† is a homeomorphism. See Neusser (2016) for more details
and further implications.
To make the presentation more transparent, we focus on sign restrictions only and
disregard zero restrictions. Arias et al. (2014) show how sign and zero restrictions
can be treated simultaneously. Thus, the entries of fSh g are elements of f 1; C1; #g
only. Assume that a VAR allows sign patterns fSh g, h D 0; 1; : : : ; hmax . Then
according to Definition 15.2, there exists B 2 B.†/ such that ‰h B 2 S.Sh / for all
h D 0; 1; : : : ; hmax . As the (strict) inequality restrictions delineate an open subset of
B 2 B.†/, there exist other nearby matrices which also fulfill the sign restrictions.
Sign restrictions therefore do not identify one impulse response sequence, but a
whole set. Thus, the impulse responses are called set identified.
This set is usually difficult to characterize algebraically so that one has to rely on
computer simulations. Conditional on the estimated VAR, thus conditional on f‰ bj g
b Lemma 15.1 suggests a simple and straightforward algorithm (see Rubio-
and †,
Ramírez et al. 2010; Arias et al. 2014; for further details):
Step 1: Draw at random an element Q from the uniform distribution on the set of
orthogonal matrices O.n/.
Step 2: Convert Q into a random element of B.†/ b by applying ' to Q. As ' is
b
† b
†
a homeomorphism this introduces a uniform distribution on B.†/.
Step 3: Compute the impulse responses with respect to 'b †
.Q/, i.e. compute
b
‰ h 'b .Q/.
†
Step 4: Keep those models with impulse response functions which satisfy the
b h ' .Q/ 2 S.Sh /, h D 0; 1; : : : ; hmax .
prescribed sign restrictions ‰ b
†
Step 5: Repeat steps 1–4 until a satisfactory number of feasible structural models
with impulse responses obeying the sign restrictions have been obtained.
The implementation of this algorithm requires a way to generate random draws
Q from the uniform distribution on O.n/.29 This is not a straightforward task
because the elements of Q are interdependent as they must ensure the orthonormality
of the columns of Q. Edelman and Rao (2005) proposes the following efficient
29
It can be shown that this probability measure is the unique measure on O .n/ which satisfies
the normalization .O .n// D 1 and the (left)-invariance property .QQ/ D .Q/ for every
Q O .n/ and Q 2 O .n/. In economics, this probability measure is often wrongly referred
to as the Haar measure. The Haar measure is not normalized and is, thus, unique only up to a
proportionality factor.
15.6 Sign Restrictions 293
two-step procedure. First, draw nn matrices X such that X N.0; In ˝In /, i.e. each
element of X is drawn independently from a standard normal distribution. Second,
perform the QR-decomposition which factorizes each matrix X into the product of
an orthogonal matrix Q and an upper triangular matrix R normalized to have positive
diagonal entries.30
As the impulse responses are only set identified, the way to report the results and
how to conduct inference becomes a matter of discussion. Several methods have
been proposed in the literature:
(i) One straightforward possibility consists in reporting, for each horizon h, the
median of the impulse responses. Although simple to compute, this method
presents some disadvantages. The median responses will not correspond to any
of the structural models. Moreover, the orthogonality of the structural shocks
will be lost. Fry and Pagan (2011) propose the median–target method as an
ad hoc remedy to this shortage. They advocate to search for the admissible
structural model whose impulse responses come closest to the median ones.
(ii) Another possibility is to search for the admissible structural model which
maximizes the share of the forecast error variance at some horizon of a given
variable after a particular shock.31 An early application of this method can be
found in Faust (1998). This method remains, however, uninformative about the
relative explanatory power of alternative admissible structural models.
(iii) The penalty function approach by Mountford and Uhlig (2009) does not
accept or reject particular impulse responses depending on whether it is in
accordance with the sign restrictions (see step 4 in the above algorithm).
Instead, it associates for each possible impulse response function and every
sign restriction a value which rewards a “correct” sign and penalizes a “wrong”
sign. Mountford and Uhlig (2009) propose the following ad hoc penalty
function: f .x/ D 100x if sgn.x/ is wrong and f .x/ D x if sgn.x/ is correct. The
impulse response function which minimizes the total (standardized) penalty is
then reported.
(iv) The exposition becomes more coherent if viewed from a Bayesian perspective.
From this perspective, the uniform distribution on O.n/, respectively on B.†/,b
32
is interpreted as diffuse or uninformative prior distribution. The admissible
structural models which have been retained in step 5 of the algorithm are then
seen as draws from the corresponding posterior distribution. The most likely
model is then given by the model which corresponds to the mode of the poste-
rior distribution. This model is associated an impulse response function which
30
Given a value of n, the corresponding MATLAB commands are [Q,R]=qr(randn(n));
Q = Q*diag(sign(diag(R))); (see Edelman and Rao 2005).
31
The minimization of the forecast error variance share have also been applied as an identification
device outside the realm of sign restrictions. See Sect. 15.4.2.
32
Whether this distribution is always the “natural” choice in economics has recently been disputed
by Baumeister and Hamilton (2015).
294 15 Interpretation of VAR Models
can then be reported. This method also allows the construction of 100.1 ˛/-%
highest posterior density credible sets (see Inoue and Kilian 2013; for details).
As shown by Moon and Schorfheide (2012) these sets cannot, even in a large
sample context, be interpreted as approximate frequentist confidence intervals.
Recently, however, Moon et al. (2013) proposed a frequentist approach to the
construction of error bands for sign identified impulse responses.
Cointegration
16
As already mentioned in Chap. 7, many raw economic time series are nonstationary
and become stationary only after some transformation. The most common of these
transformations is the formation of differences, perhaps after having taken logs. In
most cases first differences are sufficient to achieve stationarity. The stationarized
series can then be analyzed in the context of VAR models as explained in the
previous chapters. However, many economic theories are formalized in terms of
the original series so that we may want to use the VAR methodology to infer also
the behavior of the untransformed series. Yet, by taking first differences we loose
probably important information on the levels. Thus, it seems worthwhile to develop
an approach which allows us to take the information on the levels into account and at
the same time take care of the nonstationary character of the variables. The concept
of cointegration tries to achieve this double requirement.
In the following we will focus our analysis on variables which are integrated
of order one, i.e. on time series which become stationary after having taken first
differences. However, as we have already mentioned in Sect. 7.5.1, a regression
between integrated variables may lead to spurious correlations which make statisti-
cal inferences and interpretations of the estimated coefficients a delicate issue (see
Sect. 7.5.3). A way out of this dilemma is presented by the theory of cointegrated
processes. Loosely speaking, a multivariate process is cointegrated if there exists
a linear combination of the processes which is stationary although each process
taken individually may be integrated. In many cases, this linear combination can be
directly related to economic theory which has made the analysis of cointegrated
processes an important research topic. In the bivariate case, already been dealt
with in Sect. 7.5.2, the cointegrating relation can be immediately read off from the
cointegrating regression and the cointegration test boils down to a unit root test for
the residuals of the cointegrating regression. However, if more than two variables
are involved, the single equation residual based test is, as explained in Sect. 7.5.2,
no longer satisfactory. Thus, a genuine multivariate is desirable.
The concept of cointegration goes back to the work of Engle and Granger
(1987) which is itself based on the precursor study of Davidson et al. (1978). In
the meantime the literature has grown tremendously. Good introductions can be
found in Banerjee et al. (1993), Watson (1994) or Lütkepohl (2006). For the more
statistically inclined reader Johansen (1995) is a good reference.
Before we present the general theory of cointegration within the VAR context, it is
instructive to introduce the concept in the well-known class of present discounted
value models. These models relate some variable Xt to present discounted value of
another variable Yt :
1
X
Xt D
.1 ˇ/ ˇ j Pt YtCj C ut ; 0 < ˇ < 1;
jD0
This specification of the fYt g process implies that Pt YtCh D .1 h / C h Yt .
Because Pt YtCh D Pt YtCh C : : : C Pt YtC1 C Yt , h D 0; 1; 2; : : :, the present
discounted value model can be manipulated to give:
1
A more recent interesting application of this model is given by the work of Beaudry and Portier
(2006).
16.1 A Theoretical Example 297
Xt D
.1 ˇ/ Yt C ˇPt YtC1 C ˇ 2 Pt YtC2 C : : : C ut
D
.1 ˇ/Œ Yt
C ˇ Yt C ˇ Pt YtC1
C ˇ 2 Yt C ˇ 2 Pt YtC1 C ˇ 2 Pt YtC2
C ˇ 3 Yt C ˇ 3 Pt YtC1 C ˇ 3 Pt YtC2 C ˇ 3 Pt YtC3
C : : : C ut
1 ˇ ˇ2
D
.1 ˇ/ Yt C Pt YtC1 C Pt YtC2 C : : : C ut
1 ˇ 1 ˇ 1 ˇ
This expression shows that the integratedness of fYt g is transferred to fXt g. Bringing
Yt to the left we get the following expression:
1
X
S t D Xt
Yt D
ˇ j Pt YtCj C ut :
jD1
ˇ
.1 / ˇ
St D C Yt C ut :
.1 ˇ/.1 ˇ/ 1 ˇ
The remarkable feature about this relation is that fSt g is a stationary process because
both fYt g and fut g are stationary, despite the fact that fYt g and fXt g are both
integrated processes of order one. The mean of St is:
ˇ
ESt D :
1 ˇ
From the relation between St and Yt and the AR(1) representation of fYt g we
can deduce a VAR representation of the joint process f.St ; Yt /0 g:
298 16 Cointegration
! 2
!
ˇ
ˇ
St .1 ˇ/.1 ˇ/ C 1 ˇ 0 1ˇ
ˇ St 1
D .1 / C
Yt 1 0 Yt 1
!
ˇ
ut C 1 ˇ vt
C :
vt
Further algebraic transformation lead to a VAR representation of order two for the
level variables f.Xt ; Yt /0 g:
Xt Xt 1 Xt 2
D c C ˆ1 C ˆ2 C Zt
Yt Yt 1 Yt 2
! !
.1 ˇ/.1 ˇ/
0
C 1
ˇ Xt 1
D .1 / C
1 0 1C Yt 1
! !
0 1
ˇ Xt 2 1 1
ˇ ut
C C
0 Yt 2 0 1 vt
the roots are z1 D 1= and z2 D 1. Thus, only the root z1 lies outside the unit circle
whereas the root z2 lies on the unit circle. The existence of a unit root precludes
the existence of a stationary solution. Note that we have just one unit root, although
each of the two processes taken by themselves are integrated of order one.
The above VAR representation can be further transformed to yield a representa-
tion of process in first differences f.Xt ; Yt /0 g:
!
Xt 1
.1 ˇ/.1 ˇ/ Xt 1
D .1 /
Yt 1 0 0 Yt 1
! !
0 1
ˇ Xt 1 1 1
ˇ ut
C C :
0 Yt 1 0 1 vt
1
… D ˆ.1/ D
0 0
is of special importance. This matrix is singular and of rank one. This is not
an implication which is special to this specification of the present discounted
value model, but arises generally as shown in Campbell (1987) and Campbell and
Shiller (1987). In the VECM representation all variables except .Xt 1 ; Yt 1 /0 are
stationary by construction. This implies that ….Xt 1 ; Yt 1 /0 must be stationary
too, despite the fact that f.Xt ; Yt /0 g is not stationary as shown above. Multiplying
….Xt 1 ; Yt 1 /0 out, one obtains two linear combinations which define stationary
processes. However, as … has only rank one, there is just one linearly independent
combination. The first one is Xt 1
Yt 1 and equals St 1 which was already
shown to be stationary. The second one is degenerate because it yields zero. The
phenomenon is called cointegration.
Because … has rank one, it can be written as the product of two vectors ˛ and ˇ:
0
1 1
… D ˛ˇ 0 D :
0
ˆ.z/ D M.z/V.z/
!
2
1 0 1
1 ˇ zC 1 ˇ z
D :
01 z 0 1 z
e D 1 z0
Multiplying ˆ.z/ with M.z/ from the left, we find:
0 1
e
M.z/ˆ.z/ e
D M.z/M.z/V.z/ D .1 z/I2 V.z/ D .1 z/V.z/:
300 16 Cointegration
The application of this result to the VAR representation of f.Xt ; Yt /g leads to a causal
representation of f.Xt ; Yt /g:
Xt X
ˆ.L/ D M.L/V.L/ t D c C Zt
Yt Yt
e Xt X
M.L/ˆ.L/ D .1 L/V.L/ t
Yt Yt
!
1 L0 .1 ˇ/.1 ˇ/ 1 L0
D .1 / C Zt
0 1 1 0 1
Xt 0 1 L0
V.L/ D C Zt
Yt .1 / 0 1
Xt
1 L0
D C V 1 .L/ Zt
Yt 1 0 1
Xt
D C ‰.L/Zt :
Yt 1
1
In this exposition, we abstain from the explicit computation of V .L/ and ‰.L/.
However, the following relation holds:
!
1
1 1 1
V.1/ D H) V .1/ D 1 :
01 0 1
Implying that
1 00 1 0
‰.1/ D V .1/ D .1 / :
01 01
Like in the univariate case (see Theorem 7.1 in Sect. 7.1.4), we can also construct
the Beveridge-Nelson decomposition in the multivariate case. For this purpose, we
decompose ‰.L/ as follows:
‰.L/ D ‰.1/ C .L e
1/‰.L/
P
ej D 1 ‰i . This result can be used to derive the multivariate Beveridge-
with ‰ iDjC1
Nelson decomposition (see Theorem 16.1 in Sect. 16.2.3):
t
X
Xt X0
D C t C ‰.1/ Zj C stationary process
Yt Y0 1
jD1
! t
X0
1 0
1
X uj
D C tC 1 ˇ
Y0 1 1 01 0 1 vj
jD1
C stationary process
X t
X0
1 0
uj
D C tC
Y0 1 1 0 1 jD1 vj
6
Variable Y
5.5 Variable X
4.5
3.5
2.5
1.5
1
0 5 10 15 20 25 30
h
Fig. 16.1 Impulse response functions of the present discounted value model after a unit shock to
Yt (
D 1; ˇ D 0:9; D 0:8)
so that the spread will return steadily to zero. The corresponding impulse responses
of both variables are displayed in Fig. 16.1.
Figure 16.2 displays the trajectories of both variables after a stochastic simulation
where both shocks fut g and fvt g are drawn from a standard normal distribution.
One can clearly discern the non-stationary character of both series. However, as
it is typically for cointegrated series, they move more or less in parallel to each
other. This parallel movement is ensured by the error correction mechanism. The
difference between both series which is equal to the spread under this parameter
constellation is mean reverting around zero.
16.2.1 Definition
We now want to make the concepts introduced earlier more precise and give a
general definition of cointegrated processes and derive the different representations
we have seen in the previous section. Given an arbitrary regular (purely non-
deterministic) stationary process fUt gt2Z of dimension n, n 1, with mean zero
and some distribution for the starting random variable X0 , we can define recursively
a process fXt g, t D 0; 1; 2; : : : as follows:
Xt D C Xt 1 C Ut ; t D 1; 2; : : :
16.2 Definition and Representation 303
330
320
310
Values for Xt and Yt
300
290
280
270
260 Variable X
Variable Y
250
240
0 10 20 30 40 50 60 70 80 90 100
time
Fig. 16.2 Stochastic simulation of the present discounted value model under standard normally
distributed shocks (
D 1; ˇ D 0:9; D 0:8)
Ut D ‰.L/Zt D Zt C ‰1 Zt 1 C ‰2 Zt 2 C :::
P1 P1
such that Zt WN.0; †/, jD0 jk‰j k < 1, and ‰.1/ D jD0 ‰j ¤ 0.
Definition 16.2. A stochastic process fXt g is integrated of order d, I.d/, d D 0;
1; 2; : : :, if and only if d .Xt E.Xt // is integrated of order zero.
In the following we concentrate on I(1) processes.P The definition of an I(1)
process implies that fXt g equals Xt D X0 C t C tjD1 Uj and is thus non-
stationary even if D 0. The condition ‰.1/ ¤ 0 corresponds to the one in the
univariate case (compare Definition 7.1 in Sect. 7.1). On the one hand, it precludes
the case that a trend-stationary process is classified as an integrated process. On the
304 16 Cointegration
other hand, it implies that fXt g is in fact non-stationary. Indeed, if the condition is
violated so that ‰.1/ D 0, we could express ‰.L/ as .1 L/‰.L/. e Thus we could
cancel 1 L on both sides of Eq. (16.2) to obtain a stationary representation of
fXt g, given some initial distribution for X0 . This would Pthen contradict our primal
assumption that fXt g is non-stationary. The condition 1 jD0 jk‰j k < 1 is stronger
P
than 1 jD0 k‰ j k2
< 1 which follows from the Wold’s Theorem. It guarantees the
existence of the Beveridge-Nelson decomposition (see Theorem 16.1 below).2 In
particular, the condition is fulfilled if fUt g is a causal ARMA process which is the
prototypical case.
Like in the univariate case, we can decompose an I(1) process additively into
several components.
Theorem 16.1 (Beveridge-Nelson Decomposition). If fXt g is an integrated process
of order one, it can be decomposed as
t
X
Xt D X0 C t C ‰.1/ Zj C Vt ;
jD1
P1
e
where Vt D ‰.L/Z0
e
‰.L/Z e
t with ‰ j D iDjC1 ‰i , j D 0; 1; 2; : : : and fVt g
stationary.
Proof. Following the proof of the univariate case (see Sect. 7.1.4):
‰.L/ D ‰.1/ C .L e
1/‰.L/
P1
ej D
with ‰ iDjC1 ‰i . Thus,
t
X t
X
Xt D X0 C t C Uj D X0 C t C ‰.L/Zj
jD1 jD1
t
X
D X0 C t C ‰.1/ C .L e
1/‰.L/ Zj
jD1
t
X t
X
D X0 C t C ‰.1/ Zj C .L e
1/‰.L/Zj
jD1 jD1
t
X
D X0 C t C ‰.1/ e
Zj C ‰.L/Z0
e
‰.L/Zt:
jD1
e
The only point left is to show that ‰.L/Z 0
e
‰.L/Z t is stationary. Based
on Theorem 10.2, it is sufficient to show that the coefficient matrices are
P1
2
This condition could be relaxed and replaced by the condition jD0 j2 k‰j k2 < 1. In addition,
this condition is an important assumption for the application of the law of large numbers and for
the derivation of the asymptotic distribution (Phillips and Solo 1992).
16.2 Definition and Representation 305
The process fXt g can therefore be viewed as the sum of a linear P trend,
X0 C t, with stochastic intercept, a multivariate random walk, ‰.1/ tjD0 Zt ,
and a stationary process fVt g. Based on this representation, we can then define the
notion of cointegration (Engle and Granger 1987).
Definition 16.3 (Cointegration). A multivariate stochastic process fXt g is called
cointegrated if fXt g is integrated of order one and if there exists a vector ˇ 2 Rn ,
ˇ ¤ 0, such that fˇ 0 Xt g, is integrated of order zero, given a corresponding
distribution for the initial random variable X0 . ˇ is called the cointegrating or
cointegration vector. The cointegrating rank is the maximal number, r, of linearly
independent cointegrating vectors ˇ1 ; : : : ; ˇr . These vectors span a linear space
called the cointegration space.
The Beveridge-Nelson decomposition implies that ˇ is a cointegrating P vector
if and only if ˇ 0 ‰.1/ D 0. In this case the random walk component tjD1 Zj is
annihilated and only the deterministic and the stationary component remain.3 For
some issues it is of interest whether the cointegration vector ˇ also eliminates the
trend component. This would be the case if ˇ 0 D 0. See Sect. 16.3 for details.
The cointegration vectors are determined only up to some basis transformations.
If ˇ1 ; : : : ; ˇr is a basis for the cointegration space then .ˇ1 ; : : : ; ˇr /R is also a
basis for the cointegration space for any nonsingular r r matrix R because
..ˇ1 ; : : : ; ˇr /R/0 ‰.1/ D 0.
Xt D c C ˆ1 Xt 1 C : : : C ˆp Xt p C Zt ; Zt WN.0; †/ (16.3)
3
The distribution of X0 is thereby chosen such that ˇ 0 X0 D ˇ 0e
‰ .L/Z0 .
306 16 Cointegration
(i) All roots of the polynomial det ˆ.z/ are outside the unit circle or equal to one,
i.e.
jzj > 1 or
det ˆ.z/ D 0 H)
z D 1;
Assumption (i) makes sure that fXt g is an integrated process with order of integration
d 1. Moreover, it precludes other roots on the unit circles than one. The
case of seasonal unit roots is treated in Hylleberg et al. (1990) and Johansen and
Schaumburg (1998).4 Assumption (ii) implies that there exists at least n r unit
roots and two n r matrices ˛ and ˇ with full column rank r such that
… D ˛ˇ 0 :
ˆ.z/ D U.z/M.z/V.z/
where the roots of the matrix polynomials U.z/ and V.z/ are all outside the unit
circle and where M.z/ equals
.1 z/In r 0
M.z/ D :
0 Ir
4
The seasonal unit roots are the roots of zs 1 D 0 where s denotes the number of seasons. These
roots can be expressed as cos.2k=s/ C sin.2k=s/, k D 0; 1; : : : ; s 1.
5
For details see Johansen (1995), Neusser (2000) and Bauer and Wagner (2003).
16.2 Definition and Representation 307
These assumptions will allow us to derive from the VAR(p) model several repre-
sentations where each of them brings with it a particular interpretation. Replacing …
by ˛ˇ 0 in Eq. (16.4), we obtain the vector error correction representation or vector
error correction model (VECM):
˛ has full column rank r so that ˛ 0 ˛ is a non-singular r r matrix. As the right hand
side of the equation represents a stationary process, also the left hand side must be
stationary. This means that the r-dimensional process fˇ 0 Xt 1 g is stationary despite
the fact that fXt g is integrated and has potentially a unit root with multiplicity n.
The term error correction was coined by Davidson et al. (1978). They interpret
the mean of ˇ 0 Xt , D Eˇ 0 Xt , as the long-run equilibrium or steady state around
which the system fluctuates. The deviation from equilibrium (error) is therefore
given by ˇ 0 Xt 1 . The coefficients of the loading matrix ˛ should then guarantee
that deviations from the equilibrium are corrected over time by appropriate changes
(corrections) in Xt .
An Illustration
To illustrate the concept of the error correction model, we consider the following
simple system with ˛ D .˛1 ; ˛2 /0 , ˛1 ¤ ˛2 , and ˇ D .1; 1/0 . For simplicity,
we assume that the long-run equilibrium is zero. Ignoring higher order lags, we
consider the system:
2 ˛12 ˛1 ˛2 ˛12 C ˛1 ˛2
… D :
˛22 C ˛1 ˛2 ˛22 ˛1 ˛2
Thus, the rank of …2 is also one because ˛1 ¤ ˛2 . Hence, assumption (iii) is also
fulfilled.
We can gain an additional insight into the system by subtracting the second
equation from the first one to obtain:
X1t X2t D .1 C ˛1 ˛2 /.X1;t 1 X2;t 1 / C Z1t Z2t :
The process ˇ 0 Xt D X1t X2t is stationary and causal with respect to Z1t Z2t if and
only if j1 C ˛1 ˛2 j < 1, or equivalently if and only if 2 < ˛1 ˛2 < 0. Note
the importance of the assumption that ˛1 ¤ ˛2 . It prevents that X1t X2t becomes a
random walk and thus a non-stationary (integrated) process. A sufficient condition
is that 1 < ˛1 < 0 and 0 < ˛2 < 1 which imply that a positive (negative) error,
i.e. X1;t 1 X2;t 1 > 0.< 0/, is corrected by a negative (positive) change in X1t and
a positive (negative) change in X2t . Although the shocks Z1t and Z2t push X1t X2t
time and again away from its long-run equilibrium, the error correction mechanism
ensures that the variables are adjusted in such a way that the system moves back to
its long-run equilibrium.
Xt D V 1 e
.1/M.1/U 1
.1/c C V 1 e
.L/M.L/U 1
.L/Zt
D C ‰.L/Zt :
U.1/ and V.1/ are non-singular so that ˛ and ˇ have full column rank r. Based on
this derivation we can formulate the following lemma.
Lemma 16.1. The columns of the so-defined matrix ˇ are the cointegration vectors
for the process fXt g. The corresponding matrix of loading coefficients is ˛ which
fulfills ‰.1/˛ D 0.
Proof. We must show that ˇ 0 ‰.1/ D 0 which is the defining property of cointe-
gration vectors. Denoting by .V .ij/ .1//i;jD1;2 the appropriately partitioned matrix of
V.1/ 1 , we obtain:
.11/
V .1/ V .12/ .1/ In r 0
ˇ 0 ‰.1/ D ˇ 0 U 1 .1/
V .21/ .1/ V .22/ .1/ 0 0
V .11/ .1/ 0
D V21 .1/ V22 .1/ U 1 .1/
V .21/ .1/ 0
D V21 .1/V .11/ .1/ C V22 .1/V .21/ .1/ ::: 0 U 1 .1/ D 0n
where the last equality is a consequence of the property of the inverse matrix.
With the same arguments, we can show that ‰.1/˛ D 0. t
u
t
X
D X0 C V 1 e
.1/M.1/U 1
.1/c t C V 1 e
.1/M.1/U 1
.1/ Zj C Vt (16.7)
jD1
The ‰.1/ in the Beveridge-Nelson P decomposition is singular. This implies that the
multivariate random walk ‰.1/ 1 jD1 Zj does not consist of n independent univariate
random walks. Instead only n r independent random walks make up the stochastic
trend so that fXt g is driven by n r stochastic trends. In order to emphasize this fact,
we derive from the Beveridge-Nelson decomposition the so-called common trend
representation (Stock and Watson 1988a).
As ‰.1/ has rank n r, there exists a n r matrix
such that ‰.1/
D 0. Denote
by
? the n .n r/ matrix whose columns are orthogonal to
, i.e.
0
? D 0. The
Beveridge-Nelson decomposition can then be rewritten as:
1
Xt D X0 C ‰.1/
? :::
? :::
ct
1 t
X
C ‰.1/
? :::
? :::
Zj C Vt
jD1
1
D X0 C ‰.1/
? ::: 0
? :::
ct
1
t
X
C ‰.1/
? ::: 0
? :::
Zj C Vt
jD1
X
t
D X0 C ‰.1/
? ::: 0 cQ t C ‰.1/
? ::: 0 e
Z j C Vt
jD1
1 1
where cQ D
? :::
c and eZ j D
? :::
Zj . Therefore, only the first
n r elements of the vector cQ are relevant for the deterministic linear trend. The
remaining elements are multiplied by zero and are thus irrelevant. Similarly, for
the multivariate random walk only the first n r elements of the process fe Z t g are
responsible for the stochastic trend. The remaining elements of e Z t are multiplied
16.3 Johansen’s Cointegration Test 311
by zero and thus have no permanent, but only a transitory influence. The above
representation decomposes the shocks orthogonally into permanent and transitory
ones (Gonzalo and Ng 2001). The previous lemma shows that one can choose for
the matrix of loading coefficients ˛.
Summarizing the first n r elements of cQ and e
Z t to cQ 1 and e
Z 1t , respectively, we
arrive at the common trend representation:
t
X
Xt D X0 C BQc1 t C B e
Z 1j C Vt
jD1
This again demonstrates that the trend, the linear as well as the stochastic trend,
are exclusively stemming from the nonstationary variables fYt g (compare with
Eq. (16.1)).
Finally, we want to present a triangular representation which is well suited to
deal with the nonparametric estimation approach advocated by Phillips (1991) and
Phillips and Hansen (1990) (see Sect. 16.4). In this representation we normalize
the cointegration vector such ˇ D .Ir ; b0 /0 . In addition, we partition the vector Xt
into X1t and X2t such that X1t contains the first r and X2t the last n r elements.
Xt D .X1t0 ; X2t /0 can then be expressed as:
In Sect. 7.5.2 we have already discussed a regression based test for cointegration
among two variables. It was based on a unit root of the residuals from a bivari-
ate regression of one variable against the other. In this regression, it turned out to
be irrelevant which of the two variables was chosen as the regressor and which
one as the regressand. This method can, in principle, be extended to more than
two variables. However, with more than two variables, the choice of the regressand
312 16 Cointegration
becomes more crucial as not all variables may be part of the cointegrating relation.
Moreover, more than one independent cointegrating relation may exist. For these
reasons, it is advantageous to use a method which treats all variables symmetrically.
The cointegration test developed by Johansen fulfills this criterion because it is
based on a VAR which does not single out a particular variable. This test has
received wide recognition and is most often used in practice. The test serves two
purposes. First, we want to determine the number r of cointegrating relationships.
Second, we want to test properties of the cointegration vector ˇ and the loading
matrix ˛.
The exposition of the Johansen test follows closely the work of Johansen where
the derivations and additional details can be found (Johansen 1988, 1991, 1995). We
start with a VAR(p) model with constant c in VEC form (see Eq. (16.4)):
Xt D …Xt 1 C Zt
H.r/ W rank.…/ r; r D 0; 1; : : : ; n:
Hypothesis H.r/, thus, implies that there exists at most r linearly independent
cointegrating vectors. The sequence of hypotheses is nested in the sense that H.r/
implies H.r C 1/:
The hypothesis H.0/ means that rank.…/ D 0. In this case, … D 0 and there are
no cointegration vectors. fXt g is thus driven by n independent random walks and
6
If the VAR model (16.9) contains further deterministic components besides the constant, these
components have to be accounted for in these regressions.
7
This two-stage least-squares procedure is also known as partial regression and is part of the Frisch-
Waugh-Lowell Theorem (Davidson and MacKinnon 1993; 19–24).
16.3 Johansen’s Cointegration Test 313
the VAR can be transformed into a VAR model for fXt g which in our simplified
version just means that Xt D Zt IIDN.0; †/. The hypothesis H.n/ places no
restriction on … and includes in this way the case that the level of fXt g is already
stationary. Of particular interest are the hypotheses between these two extreme ones
where non-degenerate cointegrating vectors are possible. In the following, we not
only want to test for the number of linearly independent cointegrating vectors, r,
but we also want to test hypotheses about the structure of the cointegrating vectors
summarized in ˇ.
Johansen’s test is conceived as a likelihood-ratio test. This means that we must
determine the likelihood function for a sample X1 ; X2 ; : : : ; XT where T denotes the
sample size. For this purpose, we assume that fZt g IIDN.0; †/ so that logged
likelihood function of the parameters ˛, ˇ, and † conditional on the starting values
is given by :
Tn T
`.˛; ˇ; †/ D ln.2/ C ln det.† 1 /
2 2
T
1X
.Xt ˛ˇ 0 Xt 1 /0 † 1 .Xt ˛ˇ 0 Xt 1 /
2 tD1
˛O D ˛.ˇ/
O D S01 ˇ.ˇ 0 S11 ˇ/ 1
where the moment matrices S00 ; S11 ; S01 and S10 are defined as:
T
1X
S00 D .Xt /.Xt /0
T tD1
T
1X
S11 D Xt 1 Xt0 1
T tD1
T
1X
S01 D .Xt /Xt0 1
T tD1
0
S10 D S01 :
b D †.ˇ/
† b D S00 S01 ˇ.ˇ 0 S11 ˇ/ 1 ˇ 0 S10 :
b Tn T b Tn
`.ˇ/ D `.˛.ˇ/;
O ˇ; †.ˇ// D ln.2/ ln det.†.ˇ//
2 2 2
Tn Tn T
D ln.2/ ln det S00 S01 ˇ.ˇ 0 S11 ˇ/ 1 ˇ 0 S10 : (16.10)
2 2 2
Tn
The expression 2
in the above equation is derived as follows:
T
1X b 1 .Xt
.Xt O 0 X t 1 /0 †
˛ˇ O 0 Xt 1 /
˛ˇ
2 tD1
T
!
1 X
0 0 0b 1
D tr .Xt O Xt 1 /.Xt
˛ˇ O Xt 1 / †
˛ˇ
2 tD1
1 b
D tr .TS00 O 0 S10
T ˛ˇ TS01 ˇ ˛O 0 C T ˛ˇ
O 0 S11 ˇ ˛O 0 /† 1
2
0 1
T B b 1C Tn
D O 0S / †
tr @.S00 ˛ˇ AD :
2 „ ƒ‚ 10… 2
Db
†
is minimized over ˇ.8 The minimum is obtained by solving the following general-
ized eigenvalue problem (Johansen 1995):
det S11 S10 S001 S01 D 0:
1 O 1 O 2 : : : O n 0
8
Thereby we make use of the following equality for partitioned matrices:
A A
det 11 12 D det A11 det.A22 A21 A111 A12 / D det A22 det.A11 A12 A221 A21 /
A21 A22
where A11 and A22 are invertible matrices (see for example Meyer 2000; p. 475).
16.3 Johansen’s Cointegration Test 315
the generalized eigenvalue problem above therefore just determines the singular
1=2 1=2 1=2 b 1=2
values of S00 S01 S11 D S00 …S 11 .
9
An appraisal of the singular values of a matrix can be found in Strang (1988) or Meyer (2000).
316 16 Cointegration
r
Tn Tn T TX
`.ˇOr / D ln ln det S00 ln.1 i /:
2 2 2 2 iD1
The expression for the optimized likelihood function can now be used to
construct the Johansen likelihood-ratio test. There are two versions of the test
depending on the alternative hypothesis:
In practice it is useful to adopt a sequential test strategy based on the trace test. Given
some significance level, we test in a first step the null hypothesis H.0/ against H.n/.
If, on the one hand, the test does not reject the null hypothesis, we conclude that
r D 0 and that there is no cointegrating relation. If, on the other hand, the test rejects
the null hypothesis, we conclude that there is at least one cointegrating relation. We
then test in a second step the null hypothesis H.1/ against H.n/. If the test does not
reject the null hypothesis, we conclude that there exists one cointegrating relation,
i.e. that r D 1. If the test rejects the null hypothesis, we examine the next hypothesis
H.2/, and so on. In this way we obtain a test sequence. If in this sequence, the null
hypothesis H.r/ is not rejected, but H.r C 1/ is, we conclude that exist r linearly
independent cointegrating relations as explained in the diagram below.
rejection rejection
H.0/ against H.n/ ! H.1/ against H.n/ ! H.2/ against H.n/ : : :
? ? ?
? ? ?
y y y
rD0 rD1 rD2
If in this sequence we do not reject H.r/ for some r, it is useful to perform the max
test H.r/ against H.r C 1/ as a robustness check. The asymptotic distributions of the
test statistics are, like in the Dickey-Fuller unit root test, nonstandard and depend on
the specification of the deterministic components.
16.3 Johansen’s Cointegration Test 317
Xt D 0 C 1 t C Yt (16.11)
Yt D …Yt 1 C Zt D ˛ˇ 0 Yt 1 C Zt : (16.12)
For the ease of exposition, we have omit the autoregressive corrections. Eliminating
Yt using Yt D Xt 0 1 t and Yt D Xt 1 leads to
Xt D c0 C c1 .t 1/ C ˛ˇ 0 Xt 1 C Zt (16.13)
with c0 D 1 ˛ˇ 0 0 and c1 D ˛ˇ 0 1
where Xt0 D .Xt0 ; t/0 . Equation (16.13) is just the vector error correction model (16.4)
augmented by the linear trend term c1 t. If the term c1 would be left unrestricted
arguments similar to those in Sect. 16.2.3 would show that Xt exhibits a determinis-
tic quadratic trend with coefficient vector ‰.1/c1 . This, however, contradicts the
specification in Eq. (16.11). However, if we recognize that c1 in Eq. (16.13) is
actually restricted to lie in the span of ˛, i.e. that c1 D ˛
1 with
1 D ˇ 0 1 ,
no quadratic trend would emerge in the levels because ‰.1/˛ D 0 by Granger’s
representation Theorem 16.1. Alternatively, one may view the time trend as showing
up in the error correction term, respectively being part of the cointegrating relation,
as in Eq. (16.14).
Similarly, one may consider the case that Xt has a constant mean 0 , i.e. that
1 D 0 in Eq. (16.11). This leads to the same error correction specification (16.13),
but without the term c1 t. Leaving the constant c0 unrestricted, this will generate a
linear trend ‰.1/c0 t as shown in Sect. 16.2.3. In order to reconcile this with the
assumption of a constant mean, we must recognize that c0 D ˛
0 with
0 D ˇ 0 0 .
318 16 Cointegration
As mentioned previously, the cointegrating vectors are not unique, only the coin-
tegrating space is. This makes the cointegrating vectors often difficult to interpret
economically, despite some basis transformation. It is therefore of interest to see
whether the space spanned by the cointegrating vectors summarized in the columns
of ˇO can be viewed as a subspace spanned by some hypothetical vectors H D
.h1 ; : : : ; hs /, r s < n. If this hypothesis is true, the cointegrating vectors should
be linear combinations of the columns of H so that the null hypothesis can be
formulated as
H0 W ˇ D H' (16.15)
for some s r matrix '. Under this null hypothesis, this amounts to solve an
analogous general eigenvalue problem:
det 'H 0 S11 H H 0 S10 S001 S01 H D 0:
10
It is instructive to compare theses cases to those of the unit root test (see Sect. 7.3.1).
11
The tables by MacKinnon et al. (1999) allow for the possibility of exogenous integrated variables.
16.4 Estimation and Testing of Cointegrating Relationships 319
r
X 1 Q j
T ln :
jD1 1 O j
H0 W K' D ˇ (16.16)
for some s r matrix '. Like in the previous case, this hypothesis can also be
tested by the corresponding likelihood ratio test statistic which is asymptotically
distributed as a 2 distribution with s.n r/ degrees of freedom. Similarly, it is
possible to test hypotheses on ˛ and joint hypotheses on ˛ and ˇ (see Johansen
1995; Kunst and Neusser 1990; Lütkepohl 2006).
12
The choice of the variables used for normalization turns out to be important in practice. See the
application in Sect. 16.5.
320 16 Cointegration
As the endogeneity shows up in the long-run correlation between the variables, the
proposed modification uses of the long-run variance J of ut D .u01t ; u02t /0 . According
to Sect. 11.1 this entity defined as:
1
X
J11 J12
JD D .h/ D ƒ C ƒ0 †
J21 J22
hD 1
where
1
X
ƒ11 ƒ12
ƒD .h/ D ;
ƒ21 ƒ22
hD0
0 11 12
† D E.ut ut / D :
21 22
.C/
X1t D X1t JO12 JO221 uO 2t ;
O .C/ D ƒ
ƒ O 21 O 22 :
JO12 JO221 ƒ
21
T
! 1
X
.X2t0 ; D0t /0 .X2t0 ; D0t / :
tD1
H0 W R vec b D q:
WD
2 0 ! 11 3 1
T
X
04 @O A R0 5
.R vec bO q/ R J 11:2 ˝ .X2t0 ; D0t /0 .X2t0 ; D0t / .R vec bO q/
tD1
where JO11:2 D JO11 JO12 JO221 JO21 . It can be shown that the so defined modified Wald test
statistic is asymptotically distributed as 2 with g degrees of freedom (see Phillips
and Hansen 1990; Hansen 1992).
16.5 An Example
This example reproduces the study by Neusser (1991) with actualized data for
the United States over the period first quarter 1950 to fourth quarter 2005. The
starting point is a VAR model which consists of four variables: real gross domestic
product (Y), real private consumption (C), real gross investment (I), and the ex-
post real interest rate (R). All variables, except the real interest rate, are in logs.
First, we identify a VAR model for these variables where the order is determined
by Akaike’s (AIC), Schwarz’ (BIC) or Hannan-Quinn’ (HQ) information criteria.
The AIC suggests seven lags whereas the other criteria propose a VAR of order two.
As the VAR(7) consists of many statistically insignificant coefficients, we prefer the
more parsimonious VAR(2) model which produces the following estimates:
0 1 0 1
B 0.185 C B 0.951 0.254 0.088 0.042 C
B (0.047) C B (0.086) (0.091) (0.033) (0.032) C
0 1 B B
C B
C B
C
C
Yt B 0.069 C B 0.157 0.746 0.065 -0.013 C
BCt C B C B C
Xt D B C B (0.043) C B
C C B (0.079) (0.084) (0.031) (0.030) C
@ It A D B
B C B
C Xt
C
1
B 0.041 C B 0.283 0.250 1.304 0.026 C
Rt B C B C
B (0.117) C B (0.216) (0.229) (0.084) (0.081) C
B C B C
@ -0.329 A @ 0.324 -0.536 -0.024 0.551 A
(0.097) (0.178) (0.189) (0.069) (0.067)
322 16 Cointegration
0 1
B -0.132 -0.085 -0.089 -0.016 C
B (0.085) (0.093) (0.033) (0.031) C
B C
B C
B -0.213 0.305 -0.066 0.112 C
B C
B
C B (0.078) (0.085) (0.031) (0.029) C
C Xt C Zt
2
B C
B -0.517 0.040 -0.364 0.098 C
B C
B (0.214) (0.233) (0.084) (0.079) C
B C
@ -0.042 0.296 0.005 0.163 A
(0.176) (0.192) (0.069) (0.065)
where the estimated standard errors of the corresponding coefficients are reported
b is
in parenthesis. The estimate covariance matrix †, †,
0 1
0:722 0:428 1:140 0:002
B C
b D 10 4 B0:428 0:610 1:026 0:092C :
† @1:140 1:026 4:473 0:328A
0:002 0:092 0:328 3:098
The sequence of the hypotheses starts with H.0/ which states that there exists
no cointegrating relation. The alternative hypothesis is always H.n/ which says that
there are n cointegrating relations. According to Table 16.2 the value of the trace
test statistic is 111.772 which is clearly larger than the 5 % critical value of 47.856.
Thus, the null hypothesis H.0/ is rejected and we consider next the hypothesis H.1/.
This hypothesis is again clearly rejected so that we move on to the hypothesis
H.2/. Because H.3/ is not rejected, we conclude that there exists 3 cointegrating
relations. To check this result, we test the hypothesis H.2/ against H.3/ using the
max test. As this test also rejects H.2/, we can be pretty confident that there are
three cointegrating relations given as:
0 1
1:000 0:000 0:000
B 0:000 1:000 0:000C
ˇO D B
@
C:
0:000 0:000 1:000A
258:948 277:869 337:481
This matrix is actually the outcome from the EVIEWS econometrics software
package. It should be noted that EVIEWS, like other packages, chooses the
normalization mechanically. This can become a problem if the variable on which
the cointegration vectors are normalized is not part of the cointegrating relation.
In this form, the cointegrating vectors are economically difficult to interpret. We
therefore ask whether they are compatible with the following hypotheses:
0 1 0 1 0 1
1:0 1:0 0:0
B 1:0C B 0:0C B0:0C
ˇC D B C
@ 0:0A ; ˇI D B C
@ 1:0A ; ˇR D B C
@0:0A :
0:0 0:0 1:0
These hypotheses state that the log-difference (ratio) between consumption and
GDP, the log-difference (ratio) between investment and GDP, and the real interest
rate are stationary. They can be rationalized in the context of the neoclassical growth
model (see King et al. 1991; Neusser 1991). Each of them can be brought into
the form of Eq. (16.16) where ˇ is replaced by its estimate ˇ. O The corresponding
test statistics for each of the three cointegrating relations is distributed as a 2
distribution with one degree of freedom,13 which gives a critical value of 3.84
at the 5 % significance level. The corresponding values for the test statistic are
12.69, 15.05 and 0.45, respectively. This implies that we must reject the first two
hypotheses ˇC and ˇI . However, the conjecture that the real interest is stationary,
cannot be rejected. Finally, we can investigate the joint hypothesis ˇ0 D .ˇC ; ˇI ; ˇR /
which can be represented in the form (16.15). In this case the value of the test
statistic is 41.20 which is clearly above the critical value of 7.81 inferred from the
23 distribution.14 Thus, we must reject this joint hypothesis.
As a matter of comparison, we perform a similar investigation using the fully-
modified approach of Phillips and Hansen (1990). For this purpose we restrict
the analysis to Yt , Ct , and It because the real interest rate cannot be classified
unambiguously as being stationary, respectively integrated of order one. The long-
run variance J and its one-sided counterpart ƒ are estimated using the quadratic
spectral kernel with VAR(1) prewhitening as advocated by Andrews and Monahan
(1992) (see Sect. 4.4). Assuming two cointegrating relations and taking Yt and Ct
as the left hand side variables in the cointegrating regression (Eq. (16.8a)), the
following results are obtained:
0 1 0 1 0 1 0 1
B Yt C B 0.234 C B 6.282 C B 0.006 C
B C B (0.166) C B C B C
B CDB C It C B (0.867) C C B (0.002) C t C uO 1t
@ A @ A @ A @ A
Ct 0.215 5.899 0.007
(0.171) (0.892) (0.002)
13
The degrees of freedom are computed according to the formula: s.n r/ D 1.4 3/ D 1.
14
The degrees of freedom are computed according to the formula: r.n s/ D 3.4 3/ D 3.
324 16 Cointegration
where the estimated standard deviations are reported in parenthesis. The specifica-
tion allows for a constant and a deterministic trend as well as a drift in the equation
for It (Eq. (16.8b), not shown).
Given these results we can test a number of hypotheses to get a better understand-
ing of the cointegrating relations. First we test the hypothesis of no cointegration of
Yt , respectively Ct with It . Thus, we test H0 W b.1/ D b.2/ D 0. The value of the
corresponding Wald test statistic is equal to 2.386 which is considerably less than
the 5 % critical value of 5.992. Therefore we can not reject the null hypothesis no
cointegration Another interesting hypothesis is H0 W b.1/ D b.2/ which would
mean that Yt and Ct are cointegrated with cointegration vector .1; 1/. As the
corresponding Wald statistic is equal to 0.315, this hypothesis can not be rejected at
the 5 % critical value of 3.842. This suggests a long-run relation between Yt and Ct .
Repeating the analysis with Ct and It as the left hand side variables leads to the
following results:
0 1 0 1 0 1 0 1
B Ct C B 0.834 C B 0.767 C B 0.002 C
B C B (0.075) C B C B C
B CDB C Yt C B (0.561) C C B (0.001) C t C uO 1t
@ A @ A @ A @ A
It 2.192 -11.27 -0.008
(0.680) (5.102) (0.006)
1
VARIMA models stand for vector autoregressive integrated moving-average models.
Thereby Xt denotes an m-dimensional vector which describes the state of the system
in period t. The evolution of the state is represented as a vector autoregressive model
error Vt+1
2
We will focus on linear dynamic models only. With the availability of fast and cheap computing
facilities, non-linear approaches have gained some popularity. See Durbin and Koopman (2011)
for an exposition.
3
For the ease of exposition, we will present first the time-invariant case and analyze the case of
time-varying coefficients later.
17.1 The State Space Model 327
of order one with coefficient matrix F and disturbances VtC1 .4 As we assume that the
state Xt is unobservable or at least partly unobservable, we need a second equation
which relates the state to the observations. In particular, we assume that there is
a linear time-invariant relation given by A and G of the n-dimensional vector of
observations, Yt , to the state Xt . This relation may be contaminated by measurement
errors Wt . The system is initialized in period t D 1.
We make the following simplifying assumption of the state space model repre-
sented by Eqs. (17.1) and (17.2).
Remark 17.1. In a more general context, we can make both covariance matrices Q
and R time-varying and allow for contemporaneous correlations between Vt and Wt
(see example Sect. 17.4.1).
Remark 17.2. As both the state and the observation equation may include identities,
the covariance matrices need not be positive definite. They can be non-negative
definite.
Remark 17.4. The specification of the state equation and the normality assumption
imply that the sequence fX1 ; V1 ; V2 ; : : :g is independent so that the conditional
distribution XtC1 given Xt ; Xt 1 ; : : : ; X1 equals the conditional distribution of XtC1
given Xt . Thus, the process fXt g satisfies the Markov property. As the dimension of
the state vector Xt is arbitrary, it can be expanded in such a way as to encompass
every component Xt 1 for any t (see, for example, the state space representation of
a VAR(p) model with p > 1). However, there remains the problem of the smallest
dimension of the state vector (see Sect. 17.3.2).
Remark 17.5. The state space representation is not unique. Defining, for example,
a new state vector XQ t by multiplying Xt with an invertible matrix P, i.e. XQ t D PXt ,
all properties of the system remain unchanged. Naturally, we must redefine all the
system matrices accordingly: FQ D PFP 1 , Q Q D PQP0 , GQ D GP 1 .
4
In control theory the state equation (17.1) is amended by an additional term HUt which represents
the effect of control variables Ut . These exogenous controls are used to regulate the system.
328 17 Kalman Filter
t 1
X
Xt D F t 1 X1 C F j 1 VtC1 j ; t D 1; 2; : : :
jD1
t 1
X
Yt D A C GF t 1 X1 C GF j 1 VtC1 j C Wt ; t D 1; 2; : : :
jD1
The state equation is called stable or causal if all eigenvalues of F are inside the unit
circle which is equivalent that all roots of det.Im Fz/ D 0 are outside the unit circle
(see Sect. 12.3). In this case the state equation has a unique stationary solution:
1
X
Xt D F j 1 VtC1 j : (17.3)
jD0
17.1.1 Examples
The following examples should illustrate the versatility of the state space model
and demonstrate how many economically relevant models can be represented in this
form.
17.1 The State Space Model 329
VAR(p) Process
Suppose that fYt g follows a n-dimensional VAR(p) process given by ˆ.L/Yt D Zt ,
respectively by Yt D ˆ1 Yt 1 C : : : C ˆp Yt p C Zt , with Zt WN.0; †/. Then
the companion form of the VAR(p) process (see Sect. 12.2) just represents the state
equation (17.1):
0 1 0 10 1 0 1
YtC1 ˆ1 ˆ2 : : : ˆp 1 ˆp Yt ZtC1
B Yt C B In 0 : : : 0 0 C BYt 1 C B 0 C
B C B CB C B C
B C B CB C B C
XtC1 D B Yt 1 C D B 0 In : : : 0 0 C BYt 2 C C B 0 C
B : C B : : : : : CB : C B : C
@ :: A @ :: :: : : :: :: A @ :: A @ :: A
Yt pC1 0 0 : : : In 0 Yt p 0
D FXt C VtC1 ;
0 †0
with VtC1 D .ZtC1 ; 0; 0; : : : ; 0/0 and Q D : The observation equation is just
0 0
an identity because all components of Xt are observable:
ARMA(1,1) Process
The representation of ARMA processes as a state space model is more involved
when moving-average terms are involved. Let fYt g be an ARMA(1,1) process
defined by the stochastic difference equation Yt D Yt 1 C Zt C Zt 1 with
Zt WN.0; 2 / and ¤ 0.
Define fXt g as the AR(1) process defined by the stochastic difference equation
Xt Xt 1 D Zt and Xt D .Xt ; Xt 1 /0 as the state vector, then we can write the
observation equation as:
If jj < 1, the state equation defines a causal process fXt g so that the unique
stationary solution is given by Eq. (17.3). This implies a stationary solution for fYt g
too. It is thus easy to verify if this solution equals the unique solution of the ARMA
stochastic difference equation.
The state space representation of an ARMA model is not unique. An alternative
representation in the case of a causal system is given by:
Note that in this representation the dimension of the state vector is reduced from two
to one. Moreover, the two disturbances VtC1 D . C /Zt and Wt D Zt are perfectly
correlated.
ARMA(p,q) Process
It is straightforward to extend the above representation to ARMA(p,q) models.5 Let
fYt g be defined by the following stochastic difference equation:
Yt D .1; 1 ; : : : ; r 1 /Xt
where the state vector equals Xt D .Xt ; : : : ; Xt rC2 ; Xt rC1 /0 and where fXt g follows
an AR(p) process ˆ.L/Xt D Zt . The AR(p) process can be transformed into
companion form to arrive at the state equation:
0 1 0 1
1 2: : : r 1 r Zt
B1 0 ::: 0 0C B0C
B C B C
B 0C B C
XtC1 DB0 1 ::: 0 C Xt C B 0 C :
B: ::
:: : :: C B:C
@ :: : : :: :A @ :: A
0 0 ::: 1 0 0
Missing Observations
The state space approach is best suited to deal with missing observations. However,
in this situation the coefficient matrices are no longer constant, but time-varying.
Consider the following simple example of an AR(1) process for which we have
5
See also Exercise 17.2.
17.1 The State Space Model 331
only observations for the periods t D 1; : : : ; 100 and t D 102; : : : ; 200, but not for
period t D 101 which is missing. This situation can be represented in state space
form as follows:
XtC1 D Xt C Zt
Yt D Gt Xt C Wt
1; t D 1; : : : ; 100; 102; : : : ; 200;
Gt D
0; t D 101.
0; t D 1; : : : ; 100; 102; : : : ; 200;
Rt D
c > 0; t D 101.
This means that Wt D 0 and that Yt D Xt for all t except for t D 101. For the
missing observation, we have G101 D Y101 D 0. The variance for this observation is
set to R101 D c > 0.
The same idea can be used to obtain quarterly data when only yearly data are
available. This problem typically arises in statistical offices which have to produce,
for example, quarterly GDP data from yearly observations incorporating quarterly
information from indicator variables (see Sect. 17.4.1). More detailed analysis for
the case of missing data can be found in Harvey and Pierce (1984) and Brockwell
and Davis (1991; Chapter 12.3).
Time-Varying Coefficients
Consider the regression model with time-varying parameter vector ˇt :
Yt D x0t ˇt C Wt (17.5)
Hildreth-Houck W ˇt D ˇN C vt
Harvey-Phillips: ˇt ˇN D F.ˇt N C vt
ˇ/
p
Cooley-Prescott: ˇt D ˇt C v1t
p p
ˇt D ˇt 1 C v2t
where vt , v1t , and v2t are white noise error terms. In the first specification, proposed
originally proposed by Hildreth and Houck (1968), the parameter vector is in
each period just a random from a distribution with mean ˇN and variance given
by the variance of vt . Departures from the mean are seen as being only of a
transitory nature. In the specification by Harvey and Phillips (1982), assuming that
all eigenvalues of F are strictly smaller than one in absolute value, the parameter
vector is a mean reverting VAR of order one. In this case, the departures from
332 17 Kalman Filter
the mean can have a longer duration depending on the eigenvalues of F. The last
specification due to Cooley and Prescott (1973, 1976) views the parameter vector as
being subject to transitory and permanent shifts. Whereas shifts in v1t have only a
transitory effect on ˇt , movements in v2t result in permanent effects.
In the Cooley-Prescott specification, for example, the state is given by
p
Xt D .ˇt0 ; ˇt 0 /0 and the state equation can be written as:
ˇtC1 01 ˇt v1t
XtC1 D p D p C
ˇtC1 01 ˇt v2t
„ƒ‚…
DF
The second equation models the drift as a random walk. The two disturbances f"t g
and ft g are assumed to be uncorrelated with each other and with fWt g. Defining the
.T/ .T/
state vector Xt as Xt D .Tt ; ıt /0 , the state and the observation equations become:
.T/ TtC1 11 Tt " .T/ .T/
XtC1 D D C tC1 D F .T/ Xt C VtC1
ıtC1 01 ıt tC1
.T/
Yt D .1; 0/Xt C Wt
17.1 The State Space Model 333
with Wt WN.0; W2 /. This representation is called the local linear trend (LLT)
model and implies that fYt g follows an ARIMA(0,2,2) process (see Exercise 17.5.1).
In the special case of a constant drift equal to ı, 2 D 0 and we have that
Yt D ı C "t C Wt Wt 1 . fYt g therefore follows a MA(1) process with
.1/ D W2 =."2 C 2W2 / D .2 C / 1 where D "2 =W2 is called the signal-
to-noise ratio. Note that the first order autocorrelation is necessarily negative. Thus,
this model is not suited for time series with positive first order autocorrelation in its
first differences.
PdThe seasonal component is characterized by two conditions St D St d and
6
tD1 tS D 0 where d denotes the frequency of the data. Given starting values
S1 ; S0 ; S 1 ; : : : ; S dC3 , the subsequent values can be computed recursively as:
where a noise t WN.0; 2 / is taken into account.7 The state vector related to the
.S/ .S/ 0
seasonal component, Xt , is defined as Xt D .St ; St 1 ; : : : ; St dC2 / which gives
the state equation
0 1 0 1
1 1 ::: 1 1 tC1
B 1 0 ::: 0 0C B 0 C
B C B C
.S/ B 0 1 ::: 0 0C .S/ B : C .S/ .S/
XtC1 DB C Xt C B :: C D F .S/ Xt C VtC1
B :: :: : : :: :: C B C
@ : : : : : A @ 0 A
0 0 ::: 1 0 0
with Q D diag."2 ; ı2 ; 2 ; 0; : : : ; 0/. The observation equation then is:
Yt D 1 0 1 0 : : : 0 Xt C Wt
with R D W2 .
Finally, we can add a cyclical component fCt g which is modeled as a harmonic
process (see Sect. 6.2) with frequency C , respectively periodicity 2=C :
Ct D A cos.C t/ C B sin.C t/
6
Four in the case of quarterly and twelve in the case of monthly observations.
7
Alternative seasonal models can be found in Harvey (1989) and Hylleberg (1986).
334 17 Kalman Filter
4.5
4 λc = π/4, ρ = 0.7
λc = π/4, ρ = 0.8
3.5
λc = π/4, ρ = 0.85
3
λc = π/12, ρ = 0.85
2.5
1.5
0.5
0
0 π/12 0.5 π/4 1 1.5 2 2.5 3
frequency
Fig. 17.2 Spectral density of the cyclical component for different values of C and
Following Harvey (1989; p.39), we let the parameters A and B evolve over time by
introducing the recursion
!
.C/
CtC1 cos C sin C Ct V1;tC1
D C .C/
CtC1 sin C cos C Ct V2;tC1
where C0 D A and C0 D B and where fCt g is an auxiliary process. The dampening
.C/
factor allows for additional flexibility in the specification. The processes fV1;t g
.C/
and fV2;t g are two mutually uncorrelated white noise processes. It is instructive to
examine the spectral density (see Sect. 6.1) of the cyclical component in Fig. 17.2.
It can be shown (see Exercise 17.5.2) that fCt g follows an ARMA(2,1) process.
The cyclical component can be incorporated into the state space model above
by augmenting the state vector XtC1 by the cyclical components CtC1 and CtC1
.C/ .C/
and the error term VtC1 by fV1;tC1 g and fV2;tC1 g. The observations equation has to
be amended accordingly. Section 17.4.2 presents an empirical application of this
approach.
and Q D diag.†; 0; 0/. This scheme can be easily generalized to the case p > q C 1
or to allow for autocorrelated idiosyncratic components, assuming for example that
they follow autoregressive processes.
The dimension of these models can be considerably reduced by an appropriate
re–parametrization or by collapsing the state space adequately (Bräuning and
Koopman 2014). Such a reduction can considerably increase the efficiency of the
estimation.
8
Prototypical models can be found in King et al. (1988) or Woodford (2003). Canova (2007) and
Dejong and Dave (2007) present a good introduction to the analysis of DSGE models.
336 17 Kalman Filter
income which is not consumed) at the market rate of interest. These savings can
be used as a mean to finance investment projects which increase the economy wide
capital stock. The increased capital stock then allows for increased production in the
future. The production process itself is subject to a random shocks called technology
shocks.
The solution of this optimization problem is a nonlinear dynamic system
which determines the capital stock and consumption in every period. Its local
behavior can be investigated by linearizing the system around its steady state.
This equation can then be interpreted as the state equation of the system. The
parameters of this equation F and Q are related, typically in a nonlinear way, to
the parameters describing the utility and the production function as well as the
process of technology shocks. Thus, the state equation summarizes the behavior
of the theoretical model.
The parameters of the state equation can then be estimated by relating the state
vector, given by the capital stock and the state of the technology, via the observation
equation to some observable variables, like real GDP, consumption, investment, or
the interest rate. This then completes the state space representation of the model
which can be analyzed and estimated using the tools presented in Sect. 17.3.9
As we have seen, the state space model provides a very flexible framework for a
wide array of applications. We therefore want to develop a set of tools to handle
this kind of models in terms of interpretation and estimation. In this section we
will analyze the problem of inferring the unobserved state from the data given the
parameters of the model. In Sect. 17.3 we will then investigate the estimation of the
parameters by maximum likelihood.
In many cases the state of the system is not or only partially observable. It is
therefore of interest to infer from the data Y1 ; Y2 ; : : : ; YT the state vector Xt . We can
distinguish three types of problems depending on the information used:
For the ease of exposition, we will assume that the disturbances Vt and Wt
are normally distributed.
Pt 2 jThe recursive nature of the state equation implies that
Xt D F t 1 X1 C jD0 F Vt j . Therefore, Xt is also normally distributed for all
t, if X1 is normally distributed. From the observation equation we can infer also
9
See Sargent (2004) or Fernandez-Villaverde et al. (2007) for systematic treatment of state space
models in the context of macroeconomic models. In this literature the use of Bayesian methods is
widespread (see An and Schorfheide 2007; Dejong and Dave 2007).
17.2 Filtering and Smoothing 337
where the covariance matrices X ; YX ; XY and Y can be retrieved from the model
given the parameters.
For the understanding of the rest of this section, the following theorem is essential
(see standard textbooks, like Amemiya 1994; Greene 2008).
Theorem 17.1. Let Z be a n-dimensional normally distributed random variable
with Z N.; †/. Consider the partitioned vector Z D .Z10 ; Z20 /0 where Z1 and Z2
are of dimensions n1 1 and n2 1, n D n1 C n2 , respectively. The corresponding
partitioning of the covariance matrix † is
†11 †12
†D
†21 †22
where †11 D VZ1 , †22 D VZ2 , and †12 D †021 D cov.Z1 ; Z2 / D E.Z1 EZ1 /0 .Z2
EZ2 /. Then the partitioned vectors Z1 and Z2 are normally distributed. Moreover, the
conditional distribution of Z1 given Z2 is also normal with mean and variance
This formula can be directly applied to figure out the mean and the variance
of the state vector given the observations. Thus, setting Z1 D .X10 ; : : : ; Xt0 /0 and
Z2 D .Y10 ; : : : ; Yt0 1 /0 , we get the predicted values; setting Z1 D .X10 ; : : : ; Xt0 /0 and
Z2 D .Y10 ; : : : ; Yt0 /0 , we get the filtered values; setting Z1 D .X10 ; : : : ; Xt0 /0
and Z2 D .Y10 , : : : ; YT0 /0 , we get the smoothed values.
10
Sargent (1989) provides an interesting application showing the implications of measurement
errors in macroeconomic models.
338 17 Kalman Filter
For simplicity, we assume jj < 1. Suppose that we only have observations Y1
and Y2 at our disposal. The joint distribution of .X1 ; X2 ; Y1 ; Y2 /0 is normal. The
covariances can be computed by applying the methods discussed in Chap. 2:
0 0 11
0 1 0 1 1 1
X1 B 0 B 1 CC
BX2 C BB C 2 B 1 CC
B C N BB0C ; v B w2 .1 2 / CC
@Y1 A B @ A 2 B 1 1 C 2 CC
@ 0 1 @ v
2 2
AA
Y2 0 .1 /
1 1 C w 2
v
The smoothed values are obtained by applying the formula from Theorem 17.1:
ˇ
X1 ˇˇ
E Y1 ; Y2 D
X2 ˇ
0 1
w2 .1 2 /
1 1 @ 1C v2
Y1
A
w2 .1 2 / 2 1 2 .1 2 / Y2
.1 C v2
/ 2 1 C w 2
v
Note that for the last observation, Y2 in our case, the filtered and the smoothed values
are the same. For X1 the filtered value is
1
E.X1 jY1 / D w2 .1 2 /
Y1 :
1C v2
An intuition for this result can be obtained by considering some special cases.
For D 0, the observations are not correlated over time. The filtered value for X1
therefore corresponds to the smoothed one. This value lies between zero, the uncon-
ditional mean of X1 , and Y1 with the variance ratio w2 =v2 delivering the weights:
the smaller the variance of the measurement error the closer the filtered value is to
Y1 . This conclusion holds also in general. If the variance of the measurement error is
relatively large, the observations do not deliver much information so that the filtered
and the smoothed values are close to the unconditional mean.
For large systems the method suggested by Theorem 17.1 may run into numerical
problems due to the inversion of the covariance matrix of Y, †22 . This matrix can
become rather large as it is of dimension nT nT. Fortunately, there exist recursive
solutions to this problem known as the Kalman filter, and also the Kalman smoother.
17.2 Filtering and Smoothing 339
Step 1: Forecasting Step The state equation and the assumption about the
disturbance term VtC1 imply:
Step 2: Updating Step In this step the additional information coming from the
additional observation YtC1 is processed to update the conditional distribution of the
0 0
state vector. The joint conditional distribution of .XtC1 ; YtC1 /0 given Y1 ; : : : ; Yt is
ˇ
XtC1 ˇˇ XtC1jt PtC1jt PtC1jt G0
Y ; : : : ; Y N ;
YtC1 ˇ
1 t
YtC1jt GPtC1jt GPtC1jt G0 C R
As all elements of the distribution are available from the forecasting step, we can
again apply Theorem 17.1 to get the distribution of the filtered state vector at time
t C 1:
XtC1jtC1 D XtC1jt C PtC1jt G0 .GPtC1jt G0 C R/ 1 .YtC1 YtC1jt / (17.8)
PtC1jtC1 D PtC1jt PtC1jt G0 .GPtC1jt G0 C R/ 1 GPtC1jt (17.9)
340 17 Kalman Filter
where we replace XtC1jt , PtC1jt , and YtC1jt by FXtjt , FPtjt F 0 C Q, and GFXtjt ,
respectively, which have been obtained from the forecasting step.
Starting from given values for X0j0 and P0j0 , we can therefore iteratively compute
Xtjt and Ptjt for all t D 1; 2; : : : ; T. Only the information from the last period is
necessary at each step. Inserting Eq. (17.8) into Eq. (17.6) we obtain as a forecasting
equation:
Kt D FPtjt 1 G0 .GPtjt 1 G0 C R/ 1
is know as the (Kalman) gain matrix. It prescribes how the innovation Yt Ytjt 1 D
Yt GXtjt 1 leads to an update of the predicted state.
X0j0 D E.X0 / D 0
P0j0 D V.X0 /
P0j0 D FP0j0 F 0 C Q:
According to Eq. (12.4), the solution of the above matrix equation is:
vec.P0j0 / D ŒI F ˝ F 1 vec.Q/:
If the process is not stationary, we can set X0j0 to zero and P0j0 to infinity. In
practice, a very large number is sufficient.
The Kalman filter determines the distribution of the state at time t given the
information available up to this time. In many instances, we want, however, make
an optimal forecast of the state given all the information available, i.e. the whole
sample. Thus, we want to determine XtjT and PtjT . The Kalman filter determines
the smoothed distribution for t D T, i.e. XTjT and PTjT . The idea of the Kalman
17.2 Filtering and Smoothing 341
The above mean is only conditional on all information available up to time t and
on the information at time t C 1. The Markov property implies that this mean also
incorporates the information from the observations YtC1 ; : : : ; YT . Thus, we have:
Applying the law of iterated expectations or means (see, f.e. Amemiya 1994; p. 78),
we can derive XtjT :
The algorithm can now be implemented as follows. In the first step compute
XT 1jT according to Eq. (17.10) as
0
XT 1jT D XT 1jT 1 C PT 1jT 1 F PTjT1 1 .XTjT XTjT 1 /:
All entities on the right hand side can readily be computed by applying the Kalman
filter. Having found XT 1jT , we can again use Eq. (17.10) for t D T 2 to evaluate
XT 2jT :
0
XT 2jT D XT 2jT 2 C PT 2jT 2 F PT 1 1jT 2 .XT 1jT XT 1jT 2 /:
the Kalman filter. The smoothed covariance matrix PtjT is given as (see Hamilton
1994b; Section 13.6):
1
PtjT D Ptjt C Ptjt FPtC1jt .PtC1jT 1
PtC1jt /PtC1jt F 0 Ptjt :
Thus, we can compute also the smoothed variance with the aid of the values already
determined by the Kalman filter.
Then we compute the forecasting step as the first step of the filter (see Eq. (17.6)):
X1j0 D X0j0 D 0
v2 v2
P1j0 D 2 C v
2
D
1 2 1 2
Y1j0 D 0:
P1j0 was computed by the recursive formula from the previous section, but is, of
course, equal to the unconditional variance. For the updating step, we get from
Eqs. (17.8) and (17.9):
1
v2 v2 1
X1j1 D C w2 Y1 D w2 .1 2 /
Y1
1 2 1 2 1C v2
2 1
v2 v2 v2
P1j1 D C w2
1 2 1 2 1 2
0 1
2
v @ 1 A
D 1 w2 .1 2 /
1 2 1C v2
These two results are then used to calculate the next iteration of the algorithm.
This will give the filtered values for t D 2 which would correspond to the smoothed
values because we just have two observations. The forecasting step is:
17.3 Estimation of State Space Models 343
X2j1 D w2 .1 2 /
Y1
1C v2
0 1
2 v2 @1 1 A C v2
P2j1 D w2 .1 2 /
1 2 1C v2
Next we perform the updating step to calculate X2j2 and P2j2 . It is easy to verify that
this leads to the same results as in the first part of this example.
An interesting special case is obtained when we assume that D 1 so that the
state variable is a simple random walk. In this case the unconditional variance of Xt
and consequently also of Yt are no longer finite. As mentioned previously, we can
initialize the Kalman filter by X0j0 D 0 and P0j0 D 1. This implies:
Inserting this result in the updating Eqs. (17.8) and (17.9), we arrive at:
X1j1 D Y1
P1j1 D W2 :
This shows that the filtered variance is finite for t D 1 although P1j0 was infinite.
Up to now we have assumed that the parameters of the system are known and
that only the state is unknown. In most economic applications, however, also the
parameters are unknown and have therefore to be estimated from the data. One big
advantage of the state space models is that they provide an integrated approach
to forecasting, smoothing and estimation. In particular, the Kalman filter turns
out to be an efficient and quick way to compute the likelihood function. Thus, it
seems natural to estimate the parameters of state space models by the method of
344 17 Kalman Filter
maximum likelihood. Kim and Nelson (1999) and Durbin and Koopman (2011)
provide excellent and extensive reviews of the estimation of state space models
using the Kalman filter.
More recently, due to advances in computational methods, in particular with
respect to sparse matrix programming, other approaches can be implemented. For
example, by giving the states a matrix representation Chan and Jeliazkov (2009)
derive a viable and efficient method for the estimation of state space models.
The joint unconditional density of the observations .Y10 ; : : : ; YT0 /0 can be factorized
into the product of conditional densities as follows:
:
D ::
D f .YT jY1 ; : : : ; YT 1 /f .YT 1 jY1 ; : : : ; YT 2 / : : : f .Y2 jY1 /f .Y1 /
n=2 1=2
f .Yt jY1 ; : : : ; Yt 1 / D .2/ .det t /
1
exp .Yt Ytjt 1 /0 t 1 .Yt Ytjt 1 /
2
where t D GPtjt 1 G0 CR. The Gaussian likelihood function L is therefore equal to:
T
! 1=2
Y
.Tn/=2
L D .2/ det.t /
tD1
" T
#
1X 0 1
exp .Yt Ytjt 1 / t .Yt Ytjt 1 / :
2 tD1
Note that all the entities necessary to evaluate the likelihood function are provided
by the Kalman filter. Thus, the evaluation of the likelihood function is a byproduct
of the Kalman filter. The maximum likelihood estimator (MLE) is then given by
the maximizer of the likelihood function, or more conveniently the log-likelihood
function. Usually, there is no analytic solution available so that one must resort
to numerical methods. An estimation of the asymptotic covariance matrix can
be obtained by evaluating the Hessian matrix at the optimum. Under the usual
17.3 Estimation of State Space Models 345
11
The analogue to the EM algorithm in the Bayesian context is given by the Gibbs sampler. In
contrast to the EM algorithm, we compute in the first step not the expected value of the states, but
we draw a state vector from the distribution of state vectors given the parameters. In the second
step, we do not maximize the likelihood function, but draw a parameter from the distribution of
parameters given the state vector drawn previously. Going back and forth between these two steps,
we get a Markov chain in the parameters and the states whose stationary distribution is exactly the
distribution of parameters and states given the data. A detailed description of Bayesian methods
and the Gibbs sampler can be found in Geweke (2005). Kim and Nelson (1999) discuss this method
in the context of state space models.
346 17 Kalman Filter
17.3.2 Identification
As emphasized in Remark 17.5 of Sect. 17.1, the state space representations are
not unique. See, for example, the two alternative representations of the ARMA(1,1)
model in Sect. 17.1. This non-uniqueness of state space models poses an identi-
fication problem because different specifications may give rise to observationally
equivalent models.12 This problem is especially serious if all states are unob-
servable. In practice, the identification problem gives rise to difficulties in the
numerical maximization of the likelihood function. For example, one may obtain
large differences for small variations in the starting values; or one may encounter
difficulties in the inversion of the matrix of second derivatives.
The identification of state space models can be checked by transforming them
into VARMA models and by investigating the issue in this reparameterized setting
(Hannan and Deistler 1988). Exercise 17.5.6 invites the reader to apply this method
to the AR(1) model with measurement errors. System identification is a special field
in systems theory and will not be pursued further here. A systematic treatment can
be found in the textbook by Ljung (1999).
17.4 Examples
The official data for quarterly GDP are released in Switzerland by the State
Secretariat for Economic Affairs (SECO). They estimate these data taking the yearly
values provided by the Federal Statistical Office (FSO) as given. This division of
tasks is not uncommon in many countries. One of the most popular methods for
disaggregation of yearly data into quarterly ones was proposed by Chow and Lin
(1971).13 It is a regression based method which can take additional information
in the form of indicator variables (i.e. variables which are measured at the higher
frequency and correlated at the lower frequency with the variable of interest) into
account. This procedure is, however, rather rigid. The state space framework is much
more flexible and ideally suited to deal with missing observations. Applications of
this framework to the problem of disaggregation were provided by Bernanke et al.
(1997:1) and Cuche and Hess (2000), among others. We will illustrate this approach
below.
Starting point of the analysis are the yearly growth rates of GDP and indicator
variables which are recorded at the quarterly frequency and which are correlated
with GDP growth. In our application, we will consider the growth of industrial pro-
duction .IP/ and the index on consumer sentiment .C/ as indicators. Both variables
12
Remember that, in our context, two representations are equivalent if they generate the same mean
and covariance function for fYt g.
13
Similarly, one may envisage the disaggregation of yearly data into monthly ones or other forms
of disaggregation.
17.4 Examples 347
are available on a quarterly basis from 1990 onward. For simplicity, we assume that
the annualized quarterly growth rate of GDP, fQt g, follows an AR(1) process with
mean :
where the residuals vIP;t and vC;t are uncorrelated. Finally, we define the relation
between quarterly and yearly GDP growth as:
1 1 1 1
Jt D Qt C Qt 1 C Qt 2 C Qt 3 ; t D 4; 8; 12 : : :
4 4 4 4
We can now bring these equations into state space form. Thereby the observation
equation is given by
Yt D At C Gt Xt C Wt
0 1
Qt
B Qt 1 C
Xt D B C
@ Qt 2 A
Qt 3
8̂ 0 1
ˆ
ˆ
ˆ
ˆ @ ˛IP A ; t D 4; 8; 12; : : : I
ˆ
ˆ
< ˛C
At D 0 1
ˆ 0
ˆ
ˆ
ˆ
ˆ @˛ A ; t ¤ 4; 8; 12; : : :
ˆ
:̂ IP
˛C
8̂ 0 1 1 1 1 1
ˆ
ˆ 4 4 4 4
ˆ
ˆ @ˇIP 0 0 0 A ; t D 4; 8; 12; : : : I
ˆ
ˆ
< ˇC 0 0 0
Gt D 0 1
ˆ
ˆ 0 000
ˆ
ˆ
ˆ
ˆ @ˇIP 0 0 0A ; t ¤ 4; 8; 12; : : :
:̂
ˇC 0 0 0
8̂ 0 1
0 0 0
ˆ
ˆ
ˆ
ˆ @0 2 0 A ; t D 4; 8; 12; : : : I
ˆ
ˆ IP
< 0 0 2
C
Rt D 0 1
ˆ
ˆ 1 0 0
ˆ
ˆ
ˆ
ˆ @0 IP2
0 A ; t ¤ 4; 8; 12; : : :
:̂
0 0 C2
where
0 1
000
B 1 0 0 0C
FDB C
@ 0 1 0 0A
0010
0 2 1
w 0 0 0
B 0 0 0 0C
QDB
@ 0 0 0 0A
C
0 000
2
percent
−1 yearly GDP
filterted quarterly GDP
smoothed quarterly GDP
−2
quarterly estimates of GDP
published by SECO
−3
Q1−91 Q4−92 Q4−94 Q4−96 Q4−98 Q4−00 Q4−02 Q4−04 Q4−06
14
The irregular component which is not shown has only very small variance.
350 17 Kalman Filter
a b
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−0.1 −0.1
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
1980 1985 1990 1995 2000 2005 2010 1980 1985 1990 1995 2000 2005 2010 2015
c d
0.02 0.03
0.02
0.01
0.01
0 0
−0.01
−0.01 −0.02
−0.03
−0.02
−0.04
−0.03 −0.05
1980 1985 1990 1995 2000 2005 2010 2015 1980 1985 1990 1995 2000 2005 2010 2015
Fig. 17.4 Components of the basic structural model (BSM) for real GDP of Switzerland. (a)
Logged Swiss GDP (demeaned). (b) Local linear trend (LLT). (c) Business cycle component. (d)
Seasonal component
17.5 Exercises
Exercise 17.5.1. Consider the basic structural time series model for fYt g:
Yt D Tt C Wt ; Wt WN.0; W2 /
Tt D ıt 1 C Tt 1 C "t ; "t WN.0; "2 /
ıt D ıt 1 C t ; t WN.0; 2 /
where the error terms Wt , "t and t are all uncorrelated with other at all leads
and lags.
Exercise 17.5.2. If the cyclical component of the basic structural model for fYt g is:
!
.C/
Ct cos C sin C Ct 1 V1;t
D C .C/
Ct sin C cos C Ct 1 V2;t
17.5 Exercises 351
.C/ .C/
where fV1;t g and fV2;t g are mutually uncorrelated white-noise processes.
(i) Show that fCt g follows an ARMA(2,1) process with ACF given by
h .h/ D h cos C h.
Exercise 17.5.4. Show that Xt and Yt have a unique stationary and causal solution
if all eigenvalues of F are absolutely strictly smaller than one. Use the results from
Sect. 12.3.
Exercise 17.5.5. Find the Kalman filter equations for the following system:
Xt D Xt 1 C wt
Yt D Xt C vt
Exercise 17.5.6. Consider the state space model of an AR(1) process with mea-
surement error analyzed in Sect. 17.2:
(i) Show that fYt g is an ARMA(1,1) process given by Yt Yt 1 D Zt C Zt 1 with
Zt WN.0; Z2 /.
(ii) Show that the parameters of the state space, ; v2 ; w2 and those of the
ARMA(1,1) model are related by the equation
Z2 D w2
1 w2
2
D 2
1C v C .1 C 2 /w2
There is a extensive literature dealing with the detection and dating of structural
breaks in the context of time series. This literature is comprehensibly summarized
in Perron (2006), among others. A compact account can also be found in Aue
and Horváth (2011) where additional testing procedures, like the CUSUM test, are
presented. In this short exposition we follow Bai et al. (1998) and focus on Chow
type test procedures. For the technical details the interested reader is referred to
these papers.
18.1.1 Methodology
Consider, for the ease of exposition, a VAR(1) process which allows for a structural
break at some known date tb :
Xt D dt .tb / c.1/ C ˆ.1/ Xt 1 C .1 dt .tb // c.2/ C ˆ.2/ Xt 1 C Zt (18.1)
where
1; t tb I
dt .tb / D
0; t > tb :
Thus, before time tb the coefficients of the VAR process are given by c.1/ and ˆ.1/
whereas after tb they are given by c.2/ and ˆ.2/ . The error process fZt g is assumed to
be IID.0; †/ with † positive definite.1 Suppose further that the roots of ˆ.1/ .z/ as
well as those of ˆ.2/ .z/ are outside the unit circle. The process therefore is stationary
and admits a causal representation with respect to fZt g before and after date tb .
The assumption of a structural break at some known date tb can then be
investigated by testing the hypothesis
H0 W c.1/ D c.2/ and ˆ.1/ D ˆ.2/ against H1 W c.1/ ¤ c.2/ or ˆ.1/ ¤ ˆ.2/ :
The standard way to test such a hypothesis is via the F-statistic. Given a sample
ranging from period 0 to period T, the strategy is to partition all variables
and matrices along the break date tb . Following the notation and the spirit of
Sect. 13.2, define Y D vec.Y .1/ ; Y .2/ / where Y .1/ D .X1 ; X2 ; : : : ; Xtb / and Y .2/ D
.Xtb C1 ; Xtb C2 ; : : : ; XT /, Z D .Z1 ; Z2 ; : : : ; ZT /, and
0 1 0 1
1 X1;0 : : : Xn;0 1 X1;tb : : : Xn;tb
B1 X1;1 : : : Xn;1 C B1 X1;t C1 : : : Xn;tb C1 C
B C B b C
X.1/ D B: :: :: :: C X.2/ D B: :: :: :: C ;
@ :: : : : A @ :: : : : A
1 X1;tb 1 : : : Xn;tb 1 1 X1;T 1 : : : Xn;T 1
1
Generalization to higher order VAR models is straightforward. For changes in the covariance
matrix † see Bai (2000). For the technical details the reader is referred to the relevant literature.
18.1 Structural Breaks 355
This amounts to estimate the model separately over the two sample periods. Note
that as in Sect. 13.2 the GLS estimator is numerically identical to the OLS estimator
because the same regressors are used for each equation. The corresponding Wald-
test can be implemented by defining R D .In2 Cn ; In2 Cn / and computing the
F-statistic
O 0 ŒR.X0 .IT ˝ †
F.tb / D .Rˇ/ O
bt 1 /X/R0 1 .Rˇ/ (18.2)
b
where ˇO D vec.Oc.1/ ; b̂.1/ ; cO .2/ ; b̂.2/ / and where † btb is computed from the least-
PT
squares residuals b btb D
Z t as † 1 bb 0
tD1 Z t Z t given break date tb . Under the standard
T
assumptions made in Sect. 13.2, the test statistic F.tb /=.n2 C n/ converges for
T ! 1 to a chi-square distribution with n2 C n degrees of freedom.2 This test
is known in the literature as the Chow test.
The previous analysis assumed that the potential break date tb is known. This
assumption often turns out to be unrealistic in practice. The question then arises how
to determine a potential break date. Quandt (1960) proposed a simple procedure:
compute the Chow-test for all possible break dates and take as a candidate break
date the date where the F-statistic reaches its maximal value. Despite its simplicity,
Quandt’s procedure could not be implemented coherently because it was not clear
which distribution to use for the construction of the critical values. This problem
remained open for more than thirty years until the contribution of Andrews (1993).3
Denote by bxc the value of x rounded to the nearest integer towards minus infinity,
then the maximum Wald statistic and the logarithm of the Andrews and Ploberger
(1994) exponential Wald statistic can be written as follows:
2
As the asymptotic theory requires that tb =T does not go to zero, one has to assume that both the
number of periods before and after the break go to infinity.
3
A textbook version of the test can be found in Stock and Watson (2011).
356 18 Generalizations of Linear Models
where denotes the percentage of the sample which is trimmed. Usually, takes
the value of 0:15 or 0:10. Critical values for low degrees of freedom are tabulated in
Andrews (1993, 2003) and Stock and Watson (2011). It is possible to construct an
asymptotic confidence interval for the break date. The corresponding formulas can
be found in Bai et al. (1998; p. 401–402).
18.1.2 An Example
The use of the structural break test is demonstrated using historical data for the
United Kingdom. The data consist of logged per capita real GDP, logged per
capita real government expenditures, logged per capita real government revenues,
the inflation based on the consumer price index, and a long-term interest rate
over a sample period from 1830 to 2003. The basis for the analysis consists of
a five variable VAR(2) model including a constant term and a linear trend. Three
alternative structural break modes are investigated: break in the intercept, break
in the intercept and the time trend, and break in all coefficients, including the
VAR coefficients. The corresponding F-statistics are plotted in Fig. 18.1 against
all possible break dates allowing for a trimming value of 10 %. The horizontal lines
show for all three alternative break modes the corresponding critical values for the
supF test given 5 % significance levels. These critical values have been obtained
from Monte Carlo simulations as in Andrews (1993, 2003) and are given as 18.87,
28.09, and 97.39.4
Figure 18.1 shows that for all three modes a significant structural break occurs.
The corresponding values of the supF statistics are 78.06, 104.75, and 285.22. If
only the deterministic parts are allowed to change, the break date is located in 1913.
If all coefficients are allowed to change, the break is dated in 1968. However, all
three F-statistics show a steep increase in 1913. Thus, if only one break is allowed
1913 seems to be the most likely one.5 The breaks are quite precisely dated. The
corresponding standard errors are estimated to be two years for the break in the
intercept only and one year for the other two break modes.
4
Assuming a trimming value of 0:10 Andrews (2003; table I) reports critical values of 18.86 for
p D 5 which corresponds to changes in the intercept only and 27.27 for p D 10 which corresponds
to changes in intercept and time trend.
5
See Perron (2006) for a discussion of multiple breaks.
18.2 Time-Varying Parameters 357
300
break year:1968
break in all coefficients
break in intercept
break in intercept and trend
250
200
F statistic
150
break year:1913
100 critical value for
break in all coefficients
break year:1913
critical value for break
in intercept and trend critical value for
50 break in intercept
0
1840 1860 1880 1900 1920 1940 1960 1980 2000
time
Fig. 18.1 Analysis of breaks dates with the sup F test statistic for historical UK time series
This model can be easily generalized to higher order VAR’s (see below) or,
alternatively, one may think of Eq. (18.3) as a higher order VAR in companion
form. The autoregressive coefficient matrix is assumed to be stochastic. Thus, ˆt
is a random n n matrix. Models of this type have been widely discussed in the
probabilistic literature because they arise in many diverse contexts. In economics,
Eq. (18.3) can be interpreted as the probabilistic version describing the value of a
perpetuity, i.e. the present discounted value of a permanent commitment to pay a
certain sum each period. Thereby Zt denotes the random periodic payments and ˆt
the random cumulative discount factors. The model also plays an important role in
the characterization of the properties of volatility models as we have seen in Sect. 8.1
(see in particular the proofs of Theorems 8.1 and 8.3). In this presentation, the above
model is interpreted as a locally valid VAR process.
A natural question to ask is under which conditions Eq. (18.3) admits a stationary
solution. An answer to this question can be found by iterating the equation
backwards in time:
358 18 Generalizations of Linear Models
Xt D ˆt 1 Xt 1 C Zt D ˆt 1 .ˆt 2 Xt 2 C Zt 1 / C Zt
D Zt C ˆt 1 Zt 1 C ˆt 1 ˆt 2 Xt 2
D Zt C ˆt 1 Zt 1 C ˆt 1 ˆt 2 Zt 2 C ˆt 1 ˆt 2 ˆt 3 Xt 3
:::
k j
! kC1
!
X Y Y
D ˆt i Zt j C ˆt i Xt k 1; k D 0; 1; 2; : : :
jD0 iD1 iD1
Q0
where it is understood that iD1 D In . This suggests as a solution candidate
k j
!
X Y
Xt D lim ˆt i Zt j
k!1
jD0 iD1
D Zt C ˆt 1 Zt 1 C ˆt 1 ˆt 2 Zt 2 C ˆt 1 ˆt 2 ˆt 3 Zt 3 C ::: (18.4)
Based on results obtained by Brandt (1986) and extended by Bougerol and Picard
(1992b), we can cite the following theorem.
Theorem 18.1 (Solution TVC-VAR(1)). Let f.ˆt ; Zt /g be a strictly stationary
ergodic process such that
(i) E.logC kˆt k/ < 1 and E.logC kZt k/ < 1 where xC denotes maxfx; 0g;
(ii) the top Lyapounov exponent
defined as
1
D inf E log kˆ0 ˆ 1 : : : ˆ nk
n2N nC1
is strictly negative.
Then Xt as defined in Eq. (18.4) converges a.s. and fXt g is the unique strictly
stationary solution of equation (18.3).
Remark 18.1. The Lyapunov exponent measures the rate of separation of nearly
trajectories in a dynamic system. The top Lyapunov exponent gives the largest of
these rates. It is used to characterize the stability of a dynamic system (see Colonius
and Kliemann (2014)).
Remark 18.2. Although Theorem 18.1 states only sufficient conditions, these
assumptions can hardly be relaxed.
functions studied so far, they are clearly random and time-dependent because the
effect of Zt depends on future coefficients. In particular, the effect of Zt on XtCh ,
h 1, is not the same as the effect of Zt h on Xt . Nevertheless it is possible
to construct meaningful impulse response functions by Monte Carlo simulations.
One may then report the mean of the impulse responses or some quantiles for
different time periods.6 Alternatively, one may ignore the randomness and time-
dependency and define “local” impulse responses as ˆht , h D 0; 1; 2; : : :. Note,
however, that the impulse responses so defined still vary with time. Irrespectively
how the impulse responses are constructed, they can be interpreted in the same
way as in the case of constant coefficients. In particular, we may use some of the
identification schemes discussed in Chap. 15 and compute the impulse responses
with respect to structural shocks. Similar arguments apply to the forecast error
variance decomposition (FEVD).
The model is closed by fixing the law of motion for ˆt . As already mentioned
in Sect. 17.1.1 there are several possibilities. In this presentation we adopt the
following flexible autoregressive specification:
Conditional on initial values for the coefficients and their covariances, the state
space model can be estimated by maximum likelihood by applying the Kalman
filter (see Sect. 17.3 and Kim and Nelson (1999)). One possibility to initialize the
Kalman filter is to estimate the model for some initial sample period assuming fixed
coefficients and extract from these estimates the corresponding starting values.
6
Potter (2000) discusses the primal problems of defining impulse responses in a nonlinear context.
7
They allow for a correlation between Vt and Zt .
360 18 Generalizations of Linear Models
As it turns out, allowing time-variation only in the coefficients of the VAR model
overstates the role attributed to structural changes. We therefore generalize the
model to allow for time-varying volatility. More specifically, we also allow † in
Eq. (18.3) to vary with time. The modeling of the time-variation in † is, however,
not a straightforward task because we must ensure that in each period †t is a
symmetric positive definite matrix. One approach is to specify a process especially
designed for modeling the dynamics of covariance matrices. This so-called Wishart
autoregressive process was first introduced to economics by Gouriéroux et al. (2009)
and successfully applied by Burren and Neusser (2013). It leads to a nonlinear state
space system which can be estimated with the particle filter, a generalization of the
Kalman filter.
Another more popular approach was initiated by Cogley and Sargent (2005)
and Primiceri (2005). It is based on the Cholesky factorization of the time-varying
covariance matrix †t . Using the same notation as in Sect. 15.3 †t is decomposed as
†t D Bt t B0t (18.8)
where Bt is a time-varying lower triangular matrix with ones on the diagonal and t
a time-varying diagonal matrix with strictly positive diagonal elements.8 The logged
diagonal elements of t are then assumed to evolve as independent univariate
random walks. This specification can be written in matrix terms as
t D t 1 exp.Dt / (18.9)
Bt D Bt 1 exp.Ct / (18.10)
where Ct is a strictly lower triangular matrix, i.e. Ct is a lower triangular matrix with
zeros on the diagonal. The non-zero entries of Ct , denoted by ŒCt i>j , are assumed to
follow a multivariate white noise process with diagonal covariance matrix †B , i.e.
ŒCt i>j WN.0; †B /. It can be shown that the matrix exponential of strictly lower
triangular matrices are triangular matrices with ones on the diagonal. As the set of
triangular matrices with ones on the diagonal form a group, called the unipotent
group and denoted by SLTn , the above specification is well-defined. Moreover, this
formulation is a very natural one as the set of strictly lower triangular matrices
8
It is possible to consider other short-run type identification schemes (see Sect. 15.3) than the
Cholesky factorization.
P1
9
The matrix exponential of a matrix A is defined as exp.A/ D iD0 iŠ1 Ai where A is any matrix.
P1 . 1/i 1 i
Its inverse log.A/ is defined only for kAk < 1 and is given by log.A/ D iD1 i
A.
18.2 Time-Varying Parameters 361
is the tangent space of SLTn at the identity (see Baker 2002; for details). Thus,
Eq. (18.10) can be interpreted as a log-linearized version of Bt . The technique
proposed for the evolution of Bt in Eq. (18.10) departs from Primiceri (2005) who
models each element of the inverse of Bt and therefore misses a coherent system
theoretic approach. See Neusser (2016) for details.
Although this TVC-VAR model with time-varying volatility can in principle also
be estimated by maximum likelihood, this technique can hardly be implemented
successfully in practice. The main reason is that the likelihood function of such a
model, even when the dimension and the order of the VAR is low, is a very high
dimensional nonlinear object with probably many local maxima. Moreover, as the
variances governing the time-variation are small, at least for some of the coefficients,
the likelihood function is flat in some regions of the parameter space. These features
make maximization of the likelihood function a very difficult, if not impossible,
task in practice. For these reasons, Bayesian techniques have been used almost
exclusively. There is, however, also a conceptional issue involved. As the Bayesian
approach does not strictly distinguish between fixed “true” parameters and random
samples, it is better suited to handle TVC-VAR models which treat the parameters
as random. In this monograph, we will not tackle the Bayesian approach but refer
to the relevant literature. See for example Primiceri (2005), Negro and Primiceri
(2015), Cogley and Sargent (2005), Canova (2007) and Koop and Korobilis (2009)
among others.
.1/ .p/
Xt D ct C ˆt 1 Xt 1 C : : : C ˆt 1 Xt p C Zt : (18.11)
.1/ .p/
where Xt 1 D .1; Xt0 1 ; : : : ; Xt0 p /0 , ˆt 1 D .ct ; ˆt 1 ; : : : ; ˆt 1 /, and ˇt D vec ˆt .
Assuming for ˇt the same autoregressive form as in Eq. (18.5), the state space
representation (18.6) and (18.7) also applies to the TVC-VAR(p) model with Xt 1
replaced by Xt 1 . Note that the dimension of the state equation can become very
high because ˇt is a n C n2 p vector.
Taking date 0 as the initial date, the prior distribution of the autoregressive
parameters is supposed to be normal:
N P0j0 /
ˇ0 D vec ˆ0 N.ˇ;
where ˇN D vec.0; In ; 0; : : : ; 0/. This implies that the mean for all coefficients,
including the constant term, is assumed to be zero except for the own lag coefficients
.1/
of order one Œˆ0 ii , i D 1; : : : ; n, which are assumed to be one. The covariance
matrix P0j0 is taken as being diagonal so that there is no correlation across
coefficients. Thus, the prior specification amounts to assuming that each variable
follows a random walk with no interaction with other variables.
The strength of this belief is governed by a number of so-called hyperparameters
which regulate the diagonal elements of P0j0 . The first one,
2 , controls the
.1/
confidence placed on the assumption that Œˆ0 ii D 1:
h i
.1/
ˆ0 N.1;
2 /; i D 1; 2; : : : ; n:
ii
A small (large) value of
2 thus means more (less) confidence. As the lag order
.h/
increases more confidence is placed on the assumption Œˆ0 ii D 0:
h i
.h/
2
ˆ0 N 0; ; h D 2; : : : ; p and i D 1; : : : ; n
ii h
Instead of the harmonic decline other schemes have been proposed. For h D 1; : : : ; p
.h/
the off-diagonal elements of ˆ0 are assumed to have prior distribution
!
h i w2
2 Oi2
.h/
ˆ0 N 0; ; i; j D 1; : : : ; n; i ¤ j; h D 1; 2; : : : ; p:
ij hOj2
Thereby Oi2 =Oj2 represents a correction factor which accounts for the magnitudes of
Xit relative to Xjt . Specifically, Oi2 is the residual variance of a univariate AR(1)
model. The hyperparameter w2 is assumed to be strictly smaller than one. This
represents the belief that Xj;t h is less likely to be important as an explanation for
Xi;t , i ¤ j, than the own lag Xi;t h . Finally, the strength of the belief that the constant
terms are zero is
This completes the specification for the prior belief on ˇ0 . Combining all elements
we can write P0j0 as a block diagonal matrix with diagonal blocks:
.c/
!
P0j0 0
P0j0 D ./
0 P0j0
.c/ ./
where P0j0 D g diag.O1 ; : : : ; On / and P0j0 D diag.vec.G ˝ ‡//. Thereby, G and
‡ are defined as
G D .
2 ;
2 =2; : : : ;
2 =p/
(
1; i D jI
Œ‡ij D 2 2 2
w .Oi =Oj /; i ¤ j:
According to Doan et al. (1984) the preferred values for the three hyperparameters
are g D 700,
2 D 0:07, and w2 D 0:01.
Thus, for a bivariate TVC-VAR(2) model the mean vector is given by
ˇN D .0; 0; 1; 0; 0; 1; 0; 0; 0; 0/0 with diagonal covariance matrix P0j0 :
0 .c/ 1
P0j0 0 0
B C
P0j0 D @ 0 P.1/
0j0 0 A
.2/
0 0 P0j0
with
2
.c/ O1 0
P0j0 D g ;
0 O22
0 1
1 0 0 0
B 0 w 2 2
O
= O
2
0 0C
D
2 B C
.1/ 2 1
P0j0 @0 0 w O1 =O22
2 2
0A
0 0 0 1
and
0 1
1 0 0 0
.2/
B2 2 2 2
B0 w O2 =O1 0 0C
C
P0j0 D
2 @0 0 w2 O12 =O22 0A
0 0 0 1
364 18 Generalizations of Linear Models
Next we specify the parameters of the state transition equation (18.5). Following
Doan et al. (1984), F D F InCpn2 with F D 0:999 and Q D Q P0j0 with Q D
10 7 . The proportionality factor does, however, not apply to the constant terms. For
these terms, the corresponding diagonal elements of Q, ŒQii , i D 1; : : : ; n, are set
to Q ŒP0j0 i.nC1/;i.nC1/ , i D 1; : : : ; n. The reason for this correction is that the prior
put on the constants is rather loose as expressed by the high value of g. The final
component is a specification for †, the variance of Zt . This matrix is believed to be
diagonal with † D † diag.O12 ; : : : ; On2 / and † D 0:9.
With these ingredients the state space model is completely specified. Given
observations X1 ; : : : ; Xt , the Kalman filter produces a sequence of ˇtC1jt ,
t D 1; 2; : : : and one-period ahead forecasts XtC1jt computed as
The regime switching model is similar to the time-varying model discussed in the
previous section. The difference is that the time-varying parameters are governed
by a hidden Markov chain with a finite state space S D f1; 2; : : : ; kg. Usually,
the number of states k is small and is equal in practice to two or maximal three.
The states have usually an economic connotation. For example, if k equals two,
state 1 might correspond to a boom phase whereas state 2 to a recession. Such
models have a long tradition in economics and have therefore been used extensively.
Seminal references include Goldfeld and Quandt (1973, 1976), Hamilton (1994b),
Kim and Nelson (1999), Krolzig (1997), and Maddala (1986). Frühwirt-Schnatter
(2006) presents a detailed statistical analysis of regime switching models.
The starting point of our presentation of the regime switching model is again the
TVC-VAR(1) as given in Eq. (18.3). We associate to each state j 2 S a coefficient
matrix ˆ.j/ . Thus, in the regime switching model the coefficients ˆt can only assume
a finite number values ˆ.1/ ; : : : ; ˆ.k/ depending on the state of the Markov chain.
The actual value assigned to ˆt is governed by a Markov chain defined through a
fixed but unknown transition probability matrix P where
ŒPij D P ˆt D ˆ.j/ jˆt 1 D ˆ.i/ i; j D 1; : : : ; k: (18.13)
Thus, ŒPij is the probability that ˆt assumes value ˆ.j/ given that it assumed in the
previous period the value ˆ.i/ . The probability that ˆtCh is in state j given that ˆt
was in state i is therefore ŒPh ij . The definition of the transition matrix in Eq. (18.13)
Pk
implies that P is a stochastic matrix, i.e. that ŒPij 0 and jD1 ŒPij D 1.
Moreover, we assume that the chain is regular meaning that it is ergodic (irreducible)
18.3 Regime Switching Models 365
where in analogy to the Kalman filter the expressions P.st D jjXt 1 I /,
j D 1; : : : ; k, are called the predicted transition probabilities. The conditional
marginal density of xt then becomes
k
X
f .xt jXt 1 I / D f .xt jst D j; Xt 1 I / P.st D jjXt 1 I /:
jD1
In the case of Zt IID N.0; †/ the above density is a finite mixture of Gaussian
distributions (see Frühwirt-Schnatter 2006; for details). The (conditional) log
likelihood function, finally, is therefore given by
T
X
`./ D log f .xt jXt 1 I /:
tD1
10
A chain is called ergodic or irreducible if for every states i and j there is a strictly positive
probability that the chain moves from state i to state j in finitely many steps. A chain is called
aperiodic if it can return to any state i at irregular times. See, among others, Norris (1998) and
Berman and Plemmons (1994) for an introduction to Markov chains and its terminology.
11
The presentation of the maximum likelihood approach follows closely the exposition by
Hamilton (1994b; chapter 22) where more details can be found.
366 18 Generalizations of Linear Models
In order to evaluate the likelihood function note that the joint density of .xt ; st D j/
may also be factored as
Combining these expressions one obtains an expression for the filtered transition
probabilities P.st D jjXt I /:
f .xt jst D j; Xt 1 I / P.st D jjXt 1 I /
P.st D jjXt I / D
f .xt jXt 1 I /
f .xt jst D j; Xt 1 I / P.st D jjXt 1 I /
D Pk (18.14)
jD1 f .xt jst D j; Xt 1 I / P.st D jjXt 1 I /
Given initial probabilities P.s1 D jjX0 I /, j D 1; : : : ; k, and a fixed value for
, Eqs. (18.14) and (18.15) can be iterated forward to produce a sequence of
predicted transition probabilities .P.st D 1jXt 1 I /; : : : ; P.st D kjXt 1 I //0 ,
t D 1; 2; : : : ; T which can be used to evaluate the Gaussian likelihood function.
Numerical procedures must then be used for the maximization of the likelihood
function. This task is not without challenge because the likelihood function of
Gaussian mixture models typically has singularities and many local maxima.
Kiefer (1978) showed that there exists a bounded local maximum which yields
a consistent and asymptotically normal estimate for for which standard errors
can be constructed in the usual way. In practice, problems encountered during the
maximization can be alleviated by experimentation with alternative starting values.
Thereby the initial probability .P.s1 D 1jX0 I /; : : : ; P.s1 D kjX0 I / could either
be treated as additional parameters as in Goldfeld and Quandt (1973) or set to the
uniform distribution. For technical details and alternative estimation strategies, like
the EM algorithm, see Hamilton (1994b; chapter 22) and in particular Frühwirt-
Schnatter (2006).
By reversing the above recursion it is possible to compute smoothed transition
probabilities P.st D jjXT I / (see Kim 1994):
Xk
P.stC1 D ijXT I /
P.st D jjXT I / D P.st D jjXt I / ŒPij
iD1
P.stC1 D ijXt I /
The iteration is initialized with P.sT D jjXT I / which has been computed in the
forward recursion.
18.3 Regime Switching Models 367
The basic model can and has been generalized in several dimensions. The most
obvious one is the inclusion of additional lags beyond the first one. The second
one concerns the possibility of a regime switching covariance matrix †. These
modifications can be accommodated using the methods outlined above. Thirdly,
one may envision time-varying transition probabilities to account for duration
dependence. In business cycle analysis, for example, the probability of moving out
of a recession may depend on how long the economy has been in the recession
regime. This idea can be implemented by modeling the transition probabilities via a
logit specification:
exp.z0t ˛i /
ŒPij D i¤j
1 C exp.z0t ˛i /
These two operations will turn C into a field where .0; 0/ and .1; 0/ play the role of
0 and 1.1 The real numbers R are embedded into C because we identify any a 2 R
with .a; 0/ 2 C.
The number { D .0; 1/ is of special interest. It solves the equation x2 C 1 D 0,
i.e. { 2 D 1. The other solution being { D .0; 1/. Thus any complex number
.a; b/ may be written as .a; b/ D a C {b where a; b are arbitrary real numbers.2
1
Substraction and division can be defined accordingly:
2
A more detailed introduction of complex numbers can be found in Rudin (1976) or any other
mathematics textbook.
z D a C {b Cartesian coordinates
D re{ D r.cos C { sin / polar coordinates:
r
imaginary part
θ
0
a
−1
−b
unit circle: conjugate of z:
a2 + b2 = 1 a − ib
−2
−2 −1 0 1 2
real part
A Complex Numbers 371
e{ C e {
a
cos D D ;
2 r
e{ e {
b
sin D D :
2{ r
Further implications are de Moivre’s formula and Pythagoras’ theorem (see
Fig. A.1):
{ n
de Moivre’s formula re D rn e{n D rn .cos n C { sin n/
Pythagoras’ theorem 1 D e{ e {
D .cos C { sin /.cos { sin /
2 2
D cos C sin
3
The notation with “ j zj ” instead of “j zj ” was chosen to conform to the notation of AR-models.
Linear Difference Equations
B
Linear difference equations play an important role in time series analysis. We there-
fore summarize the most important results.1 Consider the following linear difference
equation of order p with constant coefficients. This equation is defined by the
recursion:
X t D 1 X t 1 C : : : C p X t p ; p ¤ 0; t 2 Z:
.1/ .m/
c1 X t C : : : C cm X t D 0; for t D 0; 1; : : : ; p 1
X t D 1 X t 1 C : : : C p X t p t D p; p C 1; : : : :
1
For more detailed presentations see Agarwal (2000), Elaydi (2005) or Neusser (2009).
1 1 z ::: p zp D 0:
This equation is called the characteristic equation.2 Thus z must be a root of the
polynomial ˆ.z/ D 1 1 z : : : p zp . From the fundamental theorem of algebra
we know that there are exactly p roots in the field of complex numbers. Denote these
roots by z1 ; : : : ; zp .
Suppose that these roots are different from each other. In this case
ffz1 t g; : : : ; fzp t gg constitutes a set of p linearly independent solutions. To show
this it is sufficient to verify that the determinant of the matrix
0 1
1 1 ::: 1
B z 1 z 1 : : : zp 1 C
B 1 2 C
B 2 2 C
B
W DB 1z z 2 : : : zp 2 C
B :: :: :: CC
@ : : : A
pC1 pC1
z1 z2 : : : zp pC1
Xt D c1 z1 t C : : : C cp zp t (B.1)
where the constants c1 ; : : : ; cp are determined from the starting values (initial
conditions).
In the case where some roots of the characteristic polynomial are equal, the
general solution becomes more involved. Let z1 ; : : : ; zr , r < p, be the roots which
2
Sometimes one can find zp 1 zp 1 : : : p D 0 as the characteristic equation. The roots are
of the two characteristic equations are then reciprocal to each other.
B Linear Difference Equations 375
are different from eachP other and denote their corresponding multiplicities by
m1 ; : : : ; mr . It holds that rjD1 D p. The general solution is then given by
r
X 1
Xt D cj0 C cj1 t C : : : C cjmj 1 tmj zj t
(B.2)
jD1
where the constants cji are again determined from the starting values (initial
conditions).
Stochastic Convergence
C
This appendix presents the relevant concepts and theorems from probability theory.
The reader interested in more details should consult corresponding textbooks, for
example Billingsley (1986), Brockwell and Davis (1991), Hogg and Craig (1995),
or Kallenberg (2002) among many others.
In the following, all real random variables or random vectors X are defined with
respect to some probability space .; A; P/. Thereby, denotes an arbitrary space
with -field A and probability measure P. A random variable, respectively random
vector, X is then defined as a measurable function from to R, respectively Rn . The
probability space plays no role as it is introduced just for the sake of mathematical
rigor. The interest rather focuses on the distributions induced by P ı X 1 .
We will make use of the following important inequalities.
Theorem C.1 (Cauchy-Bunyakovskii-Schwarz Inequality). For any two random
variables X and Y,
p p
jE.XY/j EX 2 EY 2 :
E.XY/
The equality holds if and only if X D E.Y 2 /
Y.
Theorem C.2 (Minkowski’s Inequality). Let X and Y be two random variables with
EjXj2 < 1 and EjYj2 < 1, then
1=2 1=2 1=2
EjX C Yj2 EjXj2 C EjYj2 :
Theorem C.3 (Chebyschev’s Inequality). If EjXjr < 1 for r 0 then for every
r 0 and any " > 0
a:s:
This fact is denoted by Xt ! X or lim Xt D Xa:s:
p
This fact is denoted by Xt ! X or plim Xt D X.
Remark C.1. If X and fXt g are real valued random vectors, we replace the absolute
value in the definition above by the Euclidean norm k:k. This is, however, equivalent
to saying that every component Xit converges in probability to Xi , the i-th component
of X.
r
We denote this fact by Xt ! X. If r D 1 we say that the sequence converges
absolutely; and if r D 2 we say that the sequence converges in mean square which
m:s:
is denoted by Xt ! X.
Remark C.2. In the case r D 2, the corresponding definition for random vectors is
where C denotes the set of points for which FX .x/ is continuous. We denote this fact
d
by Xt ! X.
Note that, in contrast to the previously mentioned modes of convergence,
convergence in distribution does not require that all random vectors are defined on
the same probability space. The convergence in distribution states that, for large
enough t, the distribution of Xt can be approximated by the distribution of X.
The following Theorem relates the four convergence concepts.
a:s: p
Theorem C.7. (i) If Xt ! X then Xt ! X.
p a:s:
(ii) If Xt ! X then there exists a subsequence fXtn g such that Xtn ! X.
r p
(iii) If Xt ! X then Xt ! X by Chebyschev’s inequality (Theorem C.3).
p d
(iv) If Xt ! X then Xt ! X.
d p
(v) If X is a fixed constant, then Xt ! X implies Xt ! X. Thus, the two concepts
are equivalent under this assumption.
380 C Stochastic Convergence
d
(i) Xt C Yt ! X C c,
d
(ii) Yt0 Xt ! c0 X.
d
(iii) Xt =Yt ! X=c if c is a nonzero scalar.
have the same characteristic function, they have the same distribution. Moreover,
convergence in distribution is equivalent to convergence of the corresponding
characteristic functions.
Theorem C.11 (Convergence of Characteristic Functions, Lévy). Let fXt g be a
sequence of real random variables with corresponding characteristic functions 'Xt
then
d
Xt !X if and only if lim 'Xt ./ D 'X ./; for all 2 Rn :
t!1
d
t 1 .Xt t / ! X N.0; 1/:
Note that the definition does not require that t D EXt nor that t2 D V.Xt /.
Asymptotic normality is obtained if the Xt ’s are identically and independently
distributed with constant mean and variance. In this case the Central Limit Theorem
(CLT) holds.
Theorem C.12 (Central Limit Theorem). Let fXt g be a sequence of identically
and independently distributed random variables with constant mean and constant
variance 2 then
p XT d
T ! N.0; 1/;
1
PT
where X T D T tD1 Xt is the arithmetic average.
It is possible to relax the assumption of identically distributed variables in various
ways so that there exists a variety of CLT’s in the literature. For our purpose it is
especially important to relax the independence assumption. A natural way to do this
is by the notion of m-dependence.
Definition C.7 (m-Dependence). A strictly stationary random process fXt g is called
m-dependent for some nonnegative integer m if and only if the two sets of random
variables fX ; tg and fX ; t C m C 1g are independent.
Note that for such processes .j/ D 0 for j > m. This type of dependence allows
to proof the following generalized Central Limit Theorem (see Brockwell and Davis
1991).
Theorem C.13 (CLT for m-Dependent Processes). Let fXt g be a strictly stationary
mean zero m-dependent process with autocovariance function .h/ such that Vm D
P m
hD m .h/ ¤ 0 then
382 C Stochastic Convergence
.m/ d
(i) Xt ! X .m/ as t ! 1 for each m D 1; 2; : : :,
d
(ii) X .m/ ! X as m ! 1, and
.m/
(iii) limm!1 lim supt!1 PŒjXt Xt j > D 0 for every > 0.
Then
d
Xt !X as t ! 1:
Beveridge-Nelson Decomposition
D
1
X 1
X
j2 k‰j k2 < 1 implies ej k2 < 1 and k‰.1/k < 1:
k‰
jD1 jD0
Proof. The first part of the Theorem is obtained by the algebraic manipulations
below:
‰.L/ ‰.1/ D In C ‰1 L C ‰2 L2 C : : :
In ‰1 ‰2 :::
D ‰1 .L In / C ‰2 .L2 In / C ‰3 .L3 In / C : : :
D .L In /‰1 C .L In /‰2 .L C In /
C .L In /‰3 .L2 C L C In / C : : :
D .In L/..‰1 C ‰2 C ‰3 C : : :/ C
„ ƒ‚ …
e
‰0
.‰2 C ‰3 C : : :/ L C .‰3 C : : :/ L2 C : : :/
„ ƒ‚ … „ ƒ‚ …
e
‰1 e
‰2
Taking any ı 2 .1=2; 1/, the second part of the Theorem follows from
2 0 12
1
X X1
X 1
X1 X1
ej k2 D
k‰
‰j
@ k‰j kA
jD0 jD0
iDjC1
jD0 iDjC1
0 12
1
X X
D @ iı k‰j ki ı A
jD0 iDjC1
0 10 1
1
X 1
X 1
X
@ i2ı k‰j k2 A @ i 2ı A
jD0 iDjC1
0 1
1
X i 1
X
D .2ı 1/ 1 @ j1 2ı A 2ı
i k‰i k2
iD0 jD0
1
X
1
Œ.2ı 1/.2 2ı/ j2ı k‰i k2 j2 2ı
jD0
1
X
1
D Œ.2ı 1/.2 2ı/ j2 k‰i k2 < 1:
jD0
The first inequality follows from the triangular inequality for the norm. The second
inequality is Hölder’s inequality (see, for example, Naylor and Sell 1982; p. 548)
with p D q D 2. The third and the fourth inequality follow from the Lemma below.
The last inequality, finally, follows from the assumption.
D BN-Decomposition 385
1 b 1 b
Proof. Let k be a number greater than j, then k j and
Z k
1 b 1 b
k j dj D b 1 .k 1/ b
b 1k b:
k 1
P
This implies that 1kDjC1 k
1 b
b 1 j b . This proves part (i) by changing the
summation index back from k to j. Similarly, kc 1 jc 1 and
Z k
1
kc jc 1 dj D c 1 kc c 1 .k 1/c :
k 1
Pi
Therefore kD1 kc 1 c 1 ic which proves part (ii) by changing the summation
index back from k to j. t
u
P1
Remark D.1. An alternative common assumption is jD1 jk‰j k < 1. It is,
however, easy to see that this assumption is more restrictive as it implies the one
assumed in the Theorem, but not vice versa. See Phillips and Solo (1992) for more
details.
The Delta Method
E
It is often the case that it is possible to obtain an estimate ˇOT of some parameter
ˇ, but that one is really interested in a function f of ˇ. The Continuous Mapping
Theorem then suggests to estimate f .ˇ/ by f .ˇOT /. But then the question arises how
the distribution of ˇOT is related to the distribution of f .ˇOT /.
Expanding the function into a first order Taylor approximation allows to derive
the following theorem.
Theorem E.1. Let fˇOT g be a K-dimensional sequence of random variables with the
p d
property T.ˇOT ˇ/ ! N.0; †/ then
p d
T f .ˇOT / f .ˇ/ ! N 0; rf .ˇ/ † rf .ˇ/0 ;
Remark E.2. The JK Jacobian matrix of first order partial derivatives is defined as
0 @f 1
1 .ˇ/ @f1 .ˇ/
@ˇ1 ::: @ˇK
B : :: :: C
rf .ˇ/ D @f .ˇ/=@ˇ 0 D B
@ :: : C
: A:
@fJ .ˇ/ @fJ .ˇ/
@ˇ1 ::: @ˇK
Remark E.3. In most applications ˇ is not known so that one evaluates the Jacobian
matrix at ˇOT .
Example: Univariate
Example: Multivariate
In theprocess of computing the impulse response function of a VAR(1) model with
11 12
ˆD one has to calculate ‰2 D ˆ2 . If we stack all coefficients of ˆ into
21 22
a vector ˇ D vec.ˆ/ D .11 ; 21 ; 12 ; 22 /0 then we get:
0 .2/
1 0 1
2
11 11 C 12 21
B .2/ C B11 21 C 21 22 C
B 21 C
f .ˇ/ D vec ‰2 D vec ˆ2 D B .2/ C DB C
@11 12 C 12 22 A ;
@ 12
A
.2/ 2
22
12 21 C 22
h i
.2/
where ‰2 D ij . The Jacobian matrix then becomes:
In Section 15.4.4 we obtained the following estimate for a VAR(1) model for
fXt g D f.ln.At /; ln.St //0 g:
0:141 0:316 0:640
Xt D cO C Ô Xt Ot D
1CZ C Xt 1 C ZO t :
0:499 0:202 1:117
O D vec.ˆ2 / by
We can then approximate the variance of f .ˇ/
O .vec Ô // D V.vec
V.f O Ô 2 / D rf .vec ˆ/j Ô V.vec
O Ô / rf .vec ˆ/j0 :
ˆD ˆD Ô
This leads :
0 1
0:0245 0:0121 0:0245 0:0119
B 0:0121 0:0145 0:0122 0:0144C
O .vec Ô // D B
V.f C:
@ 0:0245 0:0122 0:0382 0:0181A
0:0119 0:0144 0:0181 0:0213
Bibliography
Abraham B, Ledolter J (1983) Statistical methods for forecasting. Wiley, New York
Adelman I, Adelman FL (1959) The dynamic properties of the Klein-Goldberger model. Econo-
metrica 27:596–625
Agarwal RP (2000) Difference equations and inequalities, 2nd edn. Marcel Dekker, New York
Akaike H (1969) Fitting autoregressive models for prediction. Ann Inst Stat Math 21:243–247
Amemiya T (1994) Introduction to statistics and econometrics. Harvard University Press,
Cambridge
An S, Schorfheide F (2007) Bayesian analysis of DSGE models. Econ Rev 26:113–172
Anderson BDO, Moore JB (1979) Optimal filtering. Electrical Engeneering Series. Prentice-Hall,
Englewood Cliffs
Andrews DWK (1991) Heteroskedasticity and autocorrelation consistent covariance matrix esti-
mation. Econometrica 59:817–858
Andrews DWK (1993) Tests for parameter instability and structural change with unknown change
point. Econometrica 61:821–856
Andrews DWK (2003) Tests for parameter instability and structural change with unknown change
point: A corrigendum. Econometrica 71:395–397
Andrews DWK, Monahan JC (1992) An improved heteroskedasticity and autocorrelation consis-
tent covariance matrix estimator. Econometrica 60:953–966
Andrews DWK, Ploberger W (1994) Optimal tests when a nuisance parameter is present only
under the alternative. Econometrica 62:1383–1414
Arias JE, Rubio-Ramírez J, Waggoner DF (2014) Inference based on SVARs identified with sign
and zero restrictions: Theory and applications. FRB Atlanta Working Paper 2014–1, Federal
Reserve Bank of Atlanta
Ashley R, Granger CWJ, Schmalensee R (1980) Advertising and aggregate consumption: An
analysis of causality. Econometrica 48(5):1149–1168
Aue A, Horváth L (2011) Structural breaks in time series. J Time Ser Anal 34:1–16
Bai J (2000) Vector autoregressive models with structural changes in regression coefficients and in
variance-covariance matrices. Ann Econ Finance 1:303–339
Bai J, Lumsdaine RL, Stock JH (1998) Testing for and dating common breaks in multivariate time
series. Rev Econ Stud 65:395–432
Baker A (2002) Matrix groups – an introduction to Lie Group theory. Springer, London
Banbura M, Giannone D, Reichlin L (2010) Large Bayesian vector autoregressions. J Appl Econ
25:71–92
Banerjee A, Dolado J, Galbraith JW, Hendry DF (1993) Co-integration, error-correction, and the
econometric analysis of non-stationary data. Oxford University Press, Oxford
Barsky R, Sims E (2011) News shocks and business cycles. J Monet Econ 58:273–289
Bauer D, Wagner M (2003) A canonical form for unit root processes on the state space framework,
diskussionsschrift 3-12, Volkswirtschaftliches Institut, Universität Bern
Baumeister C, Hamilton JD (2015) Sign restrictions, structural vector autoregressions, and useful
prior information. Econometrica 83:1963–1999
Beaudry P, Portier F (2006) Stock prices, news and economic fluctuations. Am Econ Rev
96(4):1293–1307
Berman A, Plemmons RJ (1994) Nonnegative matrices in the mathematical sciences. No. 9 in
Classics in Applied Mathematics, Society of Industrial and Applied Mathematics, Philadelphia
Bernanke BS (1986) Alternative explanations of money-income correlation. In: Brunner K,
Meltzer A (eds) Real business cycles, real exchange rates, and actual policies, no. 25
in Carnegie-Rochester Conference Series on Public Policy. North-Holland, Amsterdam,
pp 49–100
Bernanke BS, Gertler M, Watson MW (1997:1) Systematic monetary policy and the effects of oil
price shocks. Brook Pap Econ Act 91–142
Berndt ER (1991) The practice of econometrics. Addison Wesley, Reading,
Bhansali RJ (1999) Parameter estimation and model selection for multistep prediction of a time
series: A review. In: Gosh S (ed) Asymptotics, Nonparametrics, and Time Series. Marcel
Dekker, New York, pp 201–225
Billingsley P (1986) Probability and measure, 2nd edn. Wiley, New York
Blanchard OJ (1989) A traditional interpretation of macroeconomic fluctuations. Am Econ Rev
79:1146–1164
Blanchard OJ, Quah D (1989) The dynamic effects of aggregate demand and supply disturbances.
Am Econ Rev 79:655–673
Blanchard OJ, Watson MW (1986) Are business cycles all alike? In: Gordon R (ed) The American
business cycle: continuity and change. University of Chicago Press, Chicago, pp 123–179
Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econ 31:307–327
Bollerslev T (1988) On the correlation structure for the generalized autoregressive conditional
heteroskedastic process. J Time Ser Anal 9:121–131
Bollerslev T, Engle RF, Nelson DB (1994) ARCH models. In: Engle RF, McFadden DL (eds)
Handbook of econometrics, vol IV. Elsevier Science B.V., Amsterdam, pp 2959–3038
Bougerol P, Picard N (1992a) Stationarity of GARCH processes and some nonnegative time series.
J Econ 52:115–127
Bougerol P, Picard N (1992b) Strict stationarity of generalized autoregressive processes. Ann
Probab 20:1714–1730
Box GEP, Jenkins GM (1976) Time series analysis: forecasting and control, revised edn. Holden-
Day, San Francisco
Brandner P, Neusser K (1992) Business cycles in open economies: Stylized facts for Austria and
Germany. Weltwirtschaftliches Arch 128:67–87
Brandt A (1986) The stochastic equation ynC1 D an yn C bn with stationary coefficients. Adv Appl
Probab 18:211–220
Bräuning F, Koopman SJ (2014) Forecasting macroeconomic variables using collapsed dynamic
factor analysis. Int J Forecast 30:572–584
Breitung J, Eickmeier S (2006) Dynamic factor models. In: Hübler O, Frohn J (eds) Modern
econometric analysis, chap 3. Springer, Berlin, pp 25–40
Brockwell PJ, Davis RA (1991) Time series: theory and methods, 2nd edn. Springer, New York
Brockwell PJ, Davis RA (1996) Introduction to time series and forecasting. Springer, New York
Brualdi RA, Shader BL (1995) Matrices of sign-solvable linear systems. No. 116 in Cambridge
tracts in mathematics. Cambridge University Press, Cambridge
Burren D, Neusser K (2013) The role of sectoral shifts in the decline of real GDP volatility.
Macroecon Dyn 17:477–500
Campbell JY (1987) Does saving anticipate declining labor income? An alternative test of the
permanent income hypothesis. Econometrica 55:1249–1273
Campbell JY, Mankiw NG (1987) Are output fluctuations transitory? Q J Econ 102:857–880
Campbell JY, Perron P (1991) Pitfalls and opportunities: What macroeconomists should know
about unit roots. In: Blanchard OJ, Fischer S (eds) Macroeconomics annual 1991, vol 6. MIT
Press, Cambridge, pp 141–201
Bibliography 393
Campbell JY, Shiller RJ (1987) Cointegration and tests of present value models. J Polit Econ
95:1062–1088
Campbell JY, Lo AW, MacKinlay AC (1997) The econometrics of financial markets. Princeton
University Press, Princeton
Canova F (2007) Methods for applied macroeconomic research. Princeton University Press,
Princeton
Canova F, Ciccarelli M (2008) Estimating multi-country VAR models. Working Paper 603,
European Central Bank
Canova F, De Nicoló G (2002) Monetary disturbances matter for business fluctuations in the G–7.
J Monet Econ 49:1131––1159
Cavaliere G, Rahbek A, Taylor AMR (2012) Bootstrap determination of the co-integration rank in
vector autoregressive models. Econometrica 80:1721–1740
Chan JCC, Jeliazkov I (2009) Efficient simulation and integrated likelihood estimation in state
space models. Int J Math Model Numer Optim 1:101–120
Chari VV, Kehoe PJ, McGrattan ER (2008) Are structural VARs with long-run restrictions useful
in developing business cycle theory? J Monet Econ 55:1337–1352
Chow GC, Lin A (1971) Best linear unbiased interpolation, distribution, and extrapolation of time
series by related series. Rev Econ Stat 53:372–375
Christiano LJ, Eichenbaum M (1990) Unit roots in real GNP: Do we know and do care? Carn-Roch
Conf Ser Public Pol 32:7–62
Christiano LJ, Eichenbaum M, Evans CL (1999) Monetary policy shocks: what have we learned
and to what end? Handbook of macroeconomics, vol 1A, chap 2. North-Holland, Amsterdam,
pp 65–148
Christiano LJ, Eichenbaum M, Vigfusson RJ (2003) What happens after a technology shock?
Working Paper No. 9819, NBER
Christiano LJ, Eichenbaum M, Vigfusson RJ (2006) Assessing structural VARs. International
Finance Discussion Papers No. 866, Board of Governors of the Federal Reserve System
Christoffersen PF (1998) Evaluating interval forecasts. Int Econ Rev 39:841–862
Clements MP, Hendry DF (1996) Intercept corrections and structural change. J Appl Econ
11:475–494
Clements MP, Hendry DF (2006) Forecasting with breaks. In: Handbook of economic forecasting,
vol 1, Elsevier, Amsterdam, pp 605–657
Cochrane JH (1988) How big is the random walk in GNP? J Polit Econ 96(5):893–920
Cogley T, Sargent TJ (2001) Evolving post-world war II U.S. inflation dynamics. In: Bernanke BS,
Rogoff K (eds) NBER macroeconomics annual, vol 16. MIT Press, Cambridge, pp 331–373
Cogley T, Sargent TJ (2005) Drift and volatilities: Monetary policies and outcomes in the post
WWII U.S. Rev Econ Dyn 8:262–302
Colonius F, Kliemann W (2014) Dynamical systems and linear algebra. Graduate studies in
mathematics, vol 158. American Mathematical Society, Providence
Colonius F, Kliemann W (2014) Dynamical Systems and Linear Algebra. Graduate Studies in
Mathematics, Vol. 158. American Mathematical Society, Providence, Rhode Island
Cooley TF, LeRoy SF (1985) Atheoretical macroeconometrics - a critique. J Monet Econ
16:283–308
Cooley TF, Prescott EC (1973) Varying parameter regression. A theory and some applications.
Ann Econ Soc Meas 2:463–474
Cooley TF, Prescott EC (1976) Estimation in the presence of stochastic parameter variation.
Econometrica 44:167–184
Corradi V, Swanson NR (2006) Predictive density evaluation. In: Handbook of economic forecast-
ing, vol 1, Elsevier, Amsterdam, pp 197–284
Cuche NA, Hess MA (2000) Estimating monthly GDP in a general Kalman filter framework:
Evidence from Switzerland. Econ Financ Model 7:153–194
Davidson JEH, Hendry DF, Srba F, Yeo S (1978) Econometric modelling of the aggregate time-
series relationship between consumers’ expenditure and income in the United Kingdom. Econ
J 88:661–692
394 Bibliography
Filardo AJ (1994) Business-cycle phases and their transitional dynamics. J Bus Econ Anal
12:299–308
Filardo AJ, FGordon S (1998) Business cycle duration. J Econ 85:99–123
Francis N, Owyang MT, Roush JE, DiCecio R (2014) A flexible finite-horizon alternative to long-
run restrictions with an application to technology shocks. Rev Econ Stat 96:638–647
Friedman M, Schwartz AJ (1963) A monetary history of the United States, 1867–1960. Priceton
University Press, Princeton
Frisch R (1933) Propagation problems and impulse problems in dynamic economics. In: Economic
essays in honour of Gustav cassel. Frank Cass, London, pp 171–205
Frühwirt-Schnatter S (2006) Finite mixture and markov switching models. Springer Science +
Business Media LLC, New York
Fry RA, Pagan AR (2011) Sign restrictions in structural vector autoregressions: A critical review.
J Econ Lit 49:938–960
Fuller WA (1976) Introduction to statistical time series. Wiley, New York
Galí J (1992) How well does the IS-LM model fit postwar U.S. data? Q J Econ 107(2):709–738
Galí J (1999) Technology, employment, and the business cycle: Do technology shocks explain
aggregate fluctuations? Am Econ Rev 89:249–271
Geweke JF (1984) Inference and causality in economic time series models. In: Griliches Z,
Intriligator MD (eds) Handbook of econometrics, vol II. Elsevier, Amsterdam, pp 1101–1144
Geweke JF (2005) Contemporary bayesian econometrics and statistics. Wiley series in probability
and statistics. Wiley, New York
Ghysels E, Osborn DR (2001) The econometric analysis of seasonal time series. Cambridge
University Press, Cambridge
Giannini C (1991) Topics in structural VAR econometrics. Quaderni di Ricerca 21, Università degli
Studi di Anacona, Dipartimento di Economia
Giraitis L, Kokoszka P, Leipus R (2000) Stationary ARCH models: Dependence structure and
central limit theorem. Econ Theory 16:3–22
Glosten LR, Jagannathan R, Runkle DE (1993) On the relation between expected value and the
volatility of the nominal excess returns on stocks. J Finance 48:1779–1801
Gohberg I, Lancaster P, Rodman L (1982) Matrix polynomials. Academic Press, New York
Goldfeld SM, Quandt RE (1973) A Markov model for switching regressions. J Econ 1:3–15
Goldfeld SM, Quandt RE (1976) Studies in nonlinear estimation. Ballinger Publishing, Cambridge
Gómez V, Maravall A (1996) Programs TRAMO and SEATS. Instructions for the user (with some
updates). Working Paper 9628, Servicio de Estudios, Banco de España
Gonzalo J, Ng S (2001) A systematic framework for analyzing the dynamic effects of permanent
and transitory schocks. J Econ Dyn Control 25:1527–1546
Gospodinov N (2010) Inference in nearly nonstationary SVAR models with long-run identifying
restrictions. J Bus Econ Stat 28:1–12
Gouriéroux C (1997) ARCH models and financial applications. Springer, New York
Gouriéroux C, Jasiak J, Sufana R (2009) The Wishart autoregressive process of multivariate
stochastic volatility. J Econ 150:167–181
Granger CWJ (1964) Spectral analysis of economic time series. Princeton University Press,
Princeton
Granger CWJ (1966) The typical spectral shape of an economic variable. Econometrica
34:150–161
Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral
methods. Econometrica 37(3):424–438
Granger CWJ, Newbold P (1974) Spurious regression in econometrics. J Econ 2:111–120
Granger CWJ, Teräsvirta T (1993) Modelling nonlinear economic relationships. Oxford University
Press, Oxford
Greene WH (2008) Econometric anlysis, 7th edn. Prentice Hall, New Jersey
Haan WJ, Levin AT (1997) A practitioner’s guide to robust covariance matrix estimation.
In: Maddala GS, Rao CR (eds) Handbook of statistics: robust inference, vol 15. Elsevier,
New York, pp 299–342
396 Bibliography
Hall FJ, Li Z (2014) Sign pattern matrices. In: Hogben L (ed) Handbook of linear algebra, 2nd edn,
chap 42. Chapman & Hall/CRC, Boca Raton, pp 1–32
Hall P, Yao Q (2003) Inference in ARCH and GARCH models with heavy-tailed errors.
Econometrica 71(1):285–317
Hall RE (1978) Stochastic implications of the life cycle-permanent income hypothesis: Theory and
evidence. J Polit Econ 86:971–987
Hamilton JD (1994a) State-Space models. In: Engle RF, McFadden DL (eds) Handbook of
econometrics, vol 4, chap 50. Elsevier, Amsterdam, pp 3039–3080
Hamilton JD (1994b) Time series analysis. Princeton University Press, Princeton
Hamilton JD (1996) Specification testing in Markov-switching time-series models. J Econ
70:127–157
Hannan EJ, Deistler M (1988) The statistical theory of linear systems. Wiley, New York
Hansen BE (1992) Efficient estimation and testing of cointegrating vectors in the presence of
deterministic trends. J Econ 53:321–335
Hansen LP, Sargent TJ (1980) Formulating and estimating dynamic linear rational expectations
models. J Econ Dyn Control 2:7–46
Hansen LP, Sargent TJ (1991) Two difficulties in interpreting vector autoregressions. In: Hansen
LP, Sargent TJ (eds) Rational expectations econometrics, underground classics in economics.
Westview, Boulder, Colorado, pp 77–119
Hansen LP, Sargent TJ (1993) Saesonality and approximation errors in rational expectations
models. J Econ 55:21–55
Harvey AC (1989) Forecasting, structural time series models and the kalman filter. Cambridge
University Press, Cambridge
Harvey AC, Jaeger A (1993) Detrending, stylized facts and the business cycle. J Appl Econ
8:231–247
Harvey AC, Phillips GD (1982) Estimation of regression models with time varying parameters.
In: Deistler M, Fürst E, Schödiauer G (eds) Games, economic dynamics and time series analysis.
Physica-Verlag, Wien-Würzburg, pp 306–321
Harvey AC, Pierce RG (1984) Estimating missing observations in economic time series. J Am Stat
Assoc 79:125–131
Haugh LD (1976) Checking the independence of two covariance stationary time series: A univari-
ate residual cross-correlation approach. J Am Stat Assoc 71:378–385
Hauser MA, Pötscher BM, Reschenhofer E (1999) Measuring persistence in aggregate output:
ARMA models, fractionally integrated ARMA models and nonparametric procedures. Empir
Econ 24:243–269
Hildreth C, Houck JP (1968) Some estimators for a linear model with random coefficients. J Am
Stat Assoc 63:584–595
Hodrick RJ, Prescott EC (1980) Post-war U.S. business cycles: An empirical investigation.
Discussion Paper 451, Carnegie-Mellon University, Pittsburgh
Hogg RV, Craig AT (1995) Introduction to mathematical statistics, 5th edn. Prentice-Hall, Upper
Saddle River
Hong EP (1991) The autocorrelation structure for the GARCH-M process. Econ Lett 37:129–132
Howrey EP (1968) A spectrum analysis of the long swing hypothesis. Int Econ Rev 9:228–252
Hylleberg S (1986) Seasonality in regression. Academic Press, Orlando, FL
Hylleberg S, Engle RF, Granger CWJ, Yoo S (1990) Seasonal integration and cointegration. J Econ
44:215–238
Inoue A, Kilian L (2013) Inference on impulse response functions in structural VAR models. J Econ
177:1–13
Jensen ST, Rahbek A (2004) Asymptotic normality for non-stationary, explosive GARCH. Econ
Theory 20(6):1203–1226
Johansen S (1988) Statistical analysis of cointegration vectors. J Econ Dyn Control 12:231–254
Johansen S (1991) Estimation and hypothesis testing of cointegration vectors in Gaussian vector
autoregressive models. Econometrica 59:1551–1580
Bibliography 397
Naylor AW, Sell GR (1982) Linear operator theory in engeneering and science, applied mathemat-
ical sciences, vol 40. Springer, New York
Negro Md, Primiceri GE (2015) Time varying structural vector autoregressions and monetary
polic: A corrigendum. Rev Econ Stud, forthcoming
Negro Md, Schorfheide F (2004) Priors from general equilibrium models for VARs. Int Econ Rev
45:643–673
Nelson CR, Plosser CI (1982) Trends and random walks in macro-economic time series: Some
evidence and implications. J Monet Econ 10:139–162
Nelson DB (1990) Stationarity and persistence in the GARCH(1,1) model. Econ Theory 6:318–334
Nelson DB (1991) Conditional heteroskedasticity in asset returns: A new approach. Econometrica
59:347–370
Neusser K (1991) Testing the long-run implications of the neoclassical growth model. J Monet
Econ 27:3–37
Neusser K (2000) An algebraic interpretation of cointegration. Econ Lett 67:273–281
Neusser K (2009) Difference equations for economists. http://www.neusser.ch/downloads/
DifferenceEquations.pdf
Neusser K (2016) A topological view on the identification of structural vector autoregressions.
Econ. Lett. (forthcoming)
Neusser K, Kugler M (1998) Manufacturing growth and financial development: Evidence from
oecd countries. Rev Econ Stat 80:638–646
Newey WK, West KD (1994) Automatic lag selection in covariance matrix estimation. Rev Econ
Stud 61:631–653
Ng S, Perron P (1995) Unit root tests in ARMA models with data dependent methods for the
selection of the truncation lag. J Am Stat Assoc 90:268–281
Nicholls DF, Pagan AR (1984) Estimating predictions, prediction errors and their standard
deviations using constructed variables. J Econ 24:293–310
Norris JR (1998) Markov chains. Cambridge series in statistical and probabilistic mathematics.
Cambridge University Press, Cambridge
Ogaki M (1992) An introduction to the generalized method of moments. Working Paper No. 314,
University of Rochester
Orcutt GH, Winokur HS Jr (1969) First order autoregression inference, estimation, and prediction.
Econometrica 37:1–14
Osterwald-Lenum M (1992) A note with quantiles of the asymptotic distribution of the maximum
likelihood cointegration rank test statistics. Oxford Bull Econ Stat 54:461–471
Pagan AR (1984) Econometric issues in the analysis of regressions with generated regressors. Int
Econ Rev 25:183–209
Pagan AR, Robertson OC (1998) Structural models of the liquidity effect. Rev Econ Stat
80:202–217
Perron P (1989) The great crash, the oil price shock, and the unit root hypothesis. Econometrica
57:1361–1401
Perron P (2006) Dealing with structural breaks. In: Hassani H, Mills TC, Patterson K (eds) Pal-
grave handbook in econometrics, vol 1. Econometric theory. Palgrave Macmillan, Hampshire,
pp 278–352
Phillips PC (1987) Time series regression with a unit root. Econometrica 55:277–301
Phillips PCB (1986) Understanding spurious regressions in econometrics. J Econ 33:311–340
Phillips PCB (1991) Optimal inference in cointegrating systems. Econometrica 59:283–306
Phillips PCB (2004) HAC estimation by automated regression. Cowles Foundation Discussion
Paper 1470, Yale University
Phillips PCB, Hansen BE (1990) Statistical inference in instrumental variables regresssion with
I(1) processes. Rev Econ Stud 57:99–125
Phillips PCB, Ouliaris S (1990) Asymptotic properties of residual based tests of cointegration.
Econometrica 58(1):165–193
Phillips PCB, Perron P (1988) Testing for a unit root in time series regression. Biometrika
75:335–346
400 Bibliography
Phillips PCB, Solo V (1992) Asymptotics for linear processes. Ann Stat 20:971–1001
Phillips PCB, Sul D (2007) Some empirics on economic growth under heterogeneous technology.
J Macroecon 29:455–469
Pierce DA, Haugh LD (1977) Causality in temporal systems - characterization and survey. J Econ
5:265–293
Potter SM (2000) Nonlinear impulse response functions. J Econ Dyn Control 24:1425–1446
Press H, Tukey JW (1956) Power spetral methods of analysis and their application to problems in
airplane dynamics. In: Flight Test Manual, NATO Advisory Group for Aeronautical Research
and Development, pp 1–41
Priestley MB (1981) Spectral analysis and time series, vol 1&2. Academic Press, London
Primiceri GE (2005) Time varying structural vector autoregressions and monetary policy. Rev Econ
Stud 72:821–852
Quah D (1990) Permanent and transitory movements in labor income: An explanation for ‘excess
smoothnes’ in consumption. J Polit Econ 98:449–475
Quah D, Sargent TJ (1993) A dynamic index model for large cross sections. In: Stock JH, Watson
MW (eds) Business cycles, indicators, and forecasting, Chapter 7. University of Chicago Press,
Chicago
Quandt RE (1960) Tests of the hypothesis that a linear regression system obeys two separate
regimes. J Am Stat Assoc 55:324–330
Reichlin L (2003) Factor models in large cross sections of time series. In: Dewatripont M, Hansen
LP, JTurnovsky S (eds) Advances in econometrics, theory and applications, econometric society
monographs, vol III. Cambridge University Press, Cambridge, pp 47—86
Reinsel GC (1993) Elements of multivariate time series analysis. Springer Series in statistics.
Springer, New York
Rigobon R (2003) Identification through heteroskedasticity. Rev Econ Stat 85:777–792
Robinson EA (1982) A historical perspective of spectrum estimation. Proc IEEE 70:885–907
Rosenblatt M (2000) Gaussian and Non-Gaussian linear time series and random fields. Springer,
New York
Rothenberg TJ (1971) Identification of parametric models. Econometrica 39:577–591
Rubio-Ramírez JF, Waggoner DF, Zha T (2010) Structural vector autoregressions: Theory of
identification and algorithms for inference. Rev Econ Stud 77:665–696
Rudin W (1976) Principles of mathematical analysis, 3rd edn. McGraw-Hill, New York
Rudin W (1987) Real and complex analysis, 3rd edn. McGraw-Hill, Boston
Runkle D (1987) Vector autoregressions and reality. J Bus Econ Stat 5:437–442
Said SE, Dickey DA (1984) Testing for unit roots in autoregressive-moving average models of
unknown order. Biometrika 71:599–607
Samuelson PA (1947) Foundations of economic analysis. Harvard University Press, Cambridge
Samuelson PA (1965) Proof that properly anticipated prices fluctuate randomly. Ind Manage Rev
6:41–49
Sargent TJ (1987) Macroeconomic theory, 2nd edn. Academic Press, Orlando, FL
Sargent TJ (1989) Two models of measurements and the investment accelerator. J Polit Econ
97:251–87
Sargent TJ (2004) Recursive macroeconomic theory, 2nd edn. MIT Press, Cambridge
Sargent TJ, Sims CA (1977) Business cycle modelling without pretending to have too much a
priori economic theory. In: Sims CA (ed) New Methods in business cycle research. Federal
Reserve Bank of Minneapolis, Minneapolis, pp 45–109
Schorfheide F (2005) VAR forecasting under misspecification. J Econ 128:99–136
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New York
Shaman P, Stine RA (1988) The bias of autoregressive coefficient estimators. J Am Stat Assoc
83:842–848
Silverman BW (1986) Density estimation. Chapman and Hall, London
Sims CA (1972) Money, income, and causality. Am Econ Rev 62:540–552
Sims CA (1974) Seasonality in regression. J Am Stat Assoc 69:618–626
Bibliography 401
Sims CA (1980a) Comparison of interwar and postwar business cycles: Monetarism reconsidered.
Am Econ Rev 70(2):250–257
Sims CA (1980b) Macroeconomics and reality. Econometrica 48:1–45
Sims CA (1986) Are forecasting models usable for policy analysis. Federal Reserve Bank of
Minneapolis Q Rev 10(1):2–16
Sims CA (1993) Rational expectations modeling with seasonally adjsuted data. J Econ 55:9–19
Sims CA (1999) Error bands for impulse responses. Econometrica 67:1113–1155
Sims CA, Stock JH, Watson MW (1990) Inference in linear time series with some unit roots.
Econometrica 58:113–144
Slutzky E (1937) The summation of random causes as the source of cyclic processes. Econometrica
5:105–146
Stock JH (1994) Unit roots, structural breaks and trends. In: Engle RF, McFadden DL (eds)
Handbook of econometrics, vol IV. Elsevier Science B.V., Amsterdam, pp 2739–2841
Stock JH, Watson MW (1988a) Testing for common trends. J Am Stat Assoc 83:1097–1107
Stock JH, Watson MW (1988b) Variable trends in economic time series. J Econ Perspect
2(3):147–174
Stock JH, Watson MW (2011) Introduction to econometrics, 3rd edn. Addison Wesley, Longman
Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jovanovich, San
Diego
Sul D, Phillips PCB, Choi CY (2005) Prewhitening bias in HAC estimation. Oxford Bull Econ Stat
67:517–546
Tay AS, Wallis KF (2000) Density forecasting: A Survey. J Forecast 19:235–254
Tinbergen J (1939) Statistical testing of business cycle theories. League of Nations, Genf
Tjøstheim D, Paulsen J (1983) Bias of some commonly-used time series estimates. Biometrika
70:389–399; corrigendum (1984), 71:656
Tobin J (1970) Money and income: Post hoc ergo propter hoc? Q J Econ 84:310–317
Uhlig H (2004) Do technology shocks lead to a fall in total hours worked? J Eur Econ Assoc
2:361–371
Uhlig H (2005) What are the effects of monetary policy on output? results from an agnostic
identification procedure. J Monet Econ 52:381–419
Uhlig H, Ravn M (2002) On adjusting the HP-filter for the frequency of observations. Rev Econ
Stat 84:371–376
Vogelsang TJ (1997) Wald-type tests for detecting breaks in the trend function of a dynamic time
series. Econ Theory 13:818–849
Watson MW (1994) Vector autoregressions and cointegration. Handbook of econometrics, vol 4,
chap 47. North-Holland, Amsterdam, pp 2843–2915
Weiss AA (1986) Asymptotic theory for ARCH models: Estimation and testing. Econ Theory
2:107–131
White H (1980) A heteroskedasticity consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica 48:817–838
Whittaker ET (1923) On a new method of graduation. Proc Edinbrough Math Soc 41:63–75
Wiener N (1956) The theory of prediction. In: Beckenbach EF (ed) Modern mathematics for
engineers. McGraw-Hill, New York, Series 1
Woodford M (2003) Interest and prices: foundations of a theory of monetary policy. Princeton
University Press, Princeton
Wu CFJ (1983) On the convergence of the EM algorithm. Ann Stat 11:95–103
Yoo BS (1987) Multi-cointegrated time series and a generalized error correction model. Ph.D.,
University of California, San Diego
Yule GU (1926) Why do we sometimes get nonsense correlations between time series? A study in
sampling and the nature of time series. J R Stat Soc 89:1–64
Yule GU (1927) On a method of investigating periodicities in disturbed series, with special
reference to Wolfer’ sunspot numbers. Philos Trans R Soc A 226:267–298
Zadrozny PA (2005) Necessary and sufficient restrictions for existence of a unique fourth moment
of a univariate GARCH(p,q) process, cESifo Working Paper No.1505
402 Bibliography
A estimation, 78
ACF, see also Autocorrelation interpretation, 64
function MA process, 52, 64
ADF-test, 148 Autocovariance function, 13
AIC, see Information criterion, 101, see ARMA process, 38
Information criterion, 247 estimation, 73
AR process, 29 linear process, 124
autocorrelation function, 29 MA(1) process, 21
autocovariance function, 29 multivariate, 202
stationary solution, 29 order, 13
ARIMA process, 102, 134 properties, 20
ARMA model random walk, 144
estimation, 87 univariate, 13
identification, 87 Autoregressive conditional heteroskedasticity
ARMA process, see also Autoregressive models, see Volatility
moving-average process Autoregressive final form, 223
autocovariance function, 38 Autoregressive moving-average process, 25
causality, 32 Autoregressive moving-average prozess
causality condition, 33 mean, 25
estimation, 95
invertibility, 37
invertibility condition, 37 B
maximum likelihood estimation, 95 Back-shift operator, see also Lag operator
state space representation, 330 Bandwidth, 80
Autocorrelation function, 14 Bartlett’s formula, 74
confidence interval Basic structural model, 332, 349
MA(q) process, 76 cylical component, 333
AR(1) process, 77 local linear trend model, 333
estimation, 73 seasonal component, 333
asymptotic distribution, 74 Bayesian VAR, 253
Bartlett’s formula, 74 Beveridge-Nelson decomposition, 138, 383
confidence interval, 75 Bias proportion, 250
interpretation, 64 Bias, small sample, 92, 231
order, 14 correction, 92, 231
properties, 21 BIC, see Information criterion, 101, see
random walk, 144 Information criterion, 247
univariate, 14 Borel-Cantelli lemma, 377
Autocorrelation function, partial, 62 Box-Pierce statistic, 75
AR process, 63 BSM, see Basic structural model
C D
Canonical correlation coefficients, 315 Dickey-Fuller distribution, 142
Cauchy-Bunyakovskii-Schwarz Durbin-Levinson algorithm, 48, 63
inequality, 377 Dynamic factor model, 335
Causal representation, 32 Dynamic multiplier, see Shocks, transitory
Causality, see also Wiener-Granger
causality, 328
Wiener-Granger causality, 255 E
Central Limit Theorem EM algorithm, 345
m-dependence, 381 Ergodicity, 10, 69
Characteristic function, 380 Estimation
Chebyschev’s inequality, 377 ARMA model, 95
Chow test, 355 order, 99
Cointegration, 159 Estimator
Beveridge-Nelson decomposition, 304, maximum likelihood estimator, 95, 96
309 method of moments
bivariate, 159 GARCH(1,1) model, 187
common trend representation, 310 moment estimator, 88
definition, 305 OLS estimator, 91
fully-modified OLS, 319, 323 process, integrated, 141
Granger’s representation theorem, 309 Yule-Walker estimator, 88
normalization, 323 Example
order of integration, 303 AD-curve and Money Supply, 260
shocks, permanent and transitory, 311 advertisement and sales, 274
Smith-McMillan factorization, 306 ARMA processes, 34
test cointegration
Johansen test, 312 fully-modified OLS, 323
regression test, 161 Johansen approach, 321
triangular representation, 311 consumption expenditure and
VAR model, 305 advertisement, 212
assumptions, 305 demand and supply shocks, 287
VECM, 307 estimation of long-run variance, 83
vector error correction, 307 estimation of quarterly GDP, 346
Wald test, 321 GDP and consumer sentiment index, 213
Companion form, 218 growth model, neoclassical, 323
Convergence inflation and short-term interest rate, 162
Almost sure convergence, 378 IS-LM model with Phillips curve, 277
Convergence in r-th mean, 378 modeling real GDP of Switzerland, 103
Convergence in distribution, 379 present discounted value model, 296
Convergence in probability, 378 structural breaks, 356
Correlation function, 202 Swiss Market Index, 188
estimator, 208 term structure of interest rate, 164
multivariate, 202 unit root test, 152
Covariance function Expectation, adaptive, 59
estimator, 208 Exponential smoothing, 58
properties, 203
covariance function, 202
Covariance proportion, 250 F
Covariance, long-run, 209 Factor model, dynamic, see Dynamic factor
Cross-correlation, 203 model
distribution, asymptotic, 209 FEVD, see also Forecast error variance
Cyclical component, 128, 333 decomposition
Index 405