2018 Book TimeSeriesEconometrics PDF
2018 Book TimeSeriesEconometrics PDF
2018 Book TimeSeriesEconometrics PDF
John D. Levendis
Time Series
Econometrics
Learning Through Replication
Springer Texts in Business and Economics
Springer Texts in Business and Economics (STBE) delivers high-quality
instructional content for undergraduates and graduates in all areas of Business/
Management Science and Economics. The series is comprised of self-contained
books with a broad and comprehensive coverage that are suitable for class as well
as for individual self-study. All texts are authored by established experts in their
fields and offer a solid methodological background, often accompanied by problems
and exercises.
123
John D. Levendis
Department of Economics
Loyola University New Orleans
New Orleans, LA, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Catherine and Jack
Preface
What makes this book unique? It follows a simple ethos: it is easier to learn by
doing. Or, “Econometrics is better taught by example than abstraction” (Angrist
and Pischke 2017, p. 2).
The aim of this book is to explain how to use the basic, yet essential, tools of
time-series econometrics. The approach is to be as simple as possible so that real
learning can take place. We won’t try to be encyclopedic, nor will we break new
methodological ground. The goal is to develop a practical understanding of the basic
tools you will need to get started in this exciting field.
We progress methodically, building as much as possible from concrete examples
rather than from abstract first principles. First we learn by doing. Then, with
a bit of experience under our belts, we’ll begin developing a deeper theoretical
understanding of the real processes. After all, when students learn calculus, they
learn the rules of calculus first and practice taking derivatives. Only after they’ve
gained familiarity with calculus do students learn the Real Analysis theory behind
the formulas. In my opinion, students should take applied econometrics before
econometric theory. Otherwise, the importance of the theory falls on deaf ears, as
the students have no context with which to understand the theorems.
Other books seem to begin at the end, with theory, and then they throw in some
examples. We, on the other hand, begin where you are likely to be and lead you
forward, building slowly from examples to theory.
In the first section, we begin with simple univariate models on well-behaved
(stationary) data. We devote a lot of attention on the properties of autoregressive
and moving-average models. We then investigate deterministic and stochastic
seasonality. Then we explore the practice of unit root testing and the influence of
structural breaks. The first section ends with models of non-stationary variance. In
the second section, we extend the simpler concepts to the more complex cases of
multi-equation multi-variate VAR and VECM models.
By the time you finish working through this book, you will not only have studied
some of the major techniques of time series, you will actually have worked through
many simulations. In fact, if you work along with the text, you will have replicated
some of the most influential papers in the field. You won’t just know about some
results, you’ll have derived them yourself.
vii
viii Preface
No textbook can cover everything. In this text we will not deal with fractional
integration, seasonal cointegration, or anything in the frequency domain. Opting for
a less-is-more approach, we must leave these and other more complicated topics to
other textbooks.
Nobody works alone. Many people helped me complete this project. They
deserve thanks.
Several prominent econometricians dug up—or tried to dig up—data from their
classic papers. Thanks, specifically, to Richard T. Baillie, David Dickey, Jordi Gali,
Charles Nelson, Dan Thornton, and Jean-Michel Zakoian.
Justin Callais provided tireless research assistance, verifying Stata code for the
entire text. Donald Lacombe, Wei Sun, Peter Wrubleski, and Jennifer Moreale
reviewed early drafts of various chapters and offered valuable suggestions. Mehmet
F. Dicle found some coding and data errors and offered useful advice. Matt Lutey
helped with some of the replications.
The text was inflicted upon a group of innocent undergraduate students at Loyola
University. These bright men and women patiently pointed out mistakes and typos,
as well as passages that required clarification. For that, I am grateful and wish to
thank Justin Callais, Rebecca Driever, Patrick Driscoll, William Herrick, Christian
Mays, Nate Straight, David Thomas, Tom Whelan, and Peter Wrobleski.
This project could not have been completed without the financial support of
Loyola University, the Marquette Fellowship Grant committee, and especially Fr.
Kevin Wildes.
Thanks to Lorraine Klimowich from Springer for believing in the project and
encouraging me to finish it.
Finally, and most importantly, I’d like to thank my family. My son Jack: you
are my reason for being; I hope to make you proud. My wife Catherine: you are a
constant source of support and encouragement. You are amazing. I love you.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Makes Time-Series Econometrics Unique? . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Statistical Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Specifying Time in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Installing New Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 ARMA(p,q) Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 A Purely Random Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 AR(1) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Estimating an AR(1) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Impulse Responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 AR(p) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Estimating an AR(p) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Impulse Responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 MA(1) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 Impulse Responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 MA(q) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2 Impulse Responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Non-zero ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.1 Non-zero AR Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2 Non-zero MA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.3 Dealing with Non-zero Means . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 ARMA(p,q) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ix
x Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
The original version of this book was revised: The data sets and a blurb has been updated in the
cover. The correction to this book is available at https://doi.org/10.1007/978-3-319-98282-3_14
Introduction
1
• Suppose you own a business. How might you use the previous 10 years’ worth
of monthly sales data to predict next month’s sales?
• You wish to test whether the “Permanent income hypothesis” holds. Can you
see whether consumption spending is a relatively constant fraction of national
income?
• You are a financial researcher. You wish to determine whether gold prices lead
stock prices, or vice versa. Is there a relationship between these variables? If so,
can you use it to make money?
Consider the differences in the two panels of Fig. 1.1. Panel (a) shows cross-
sectional data and panel (b) shows time-series data. Standard econometrics—the
econometrics of cross sections—relies on the fact that observations are independent.
If we take a sample of people and ask whether they are unemployed today, we
will get a mixture of answers. And even though we might be in a particularly
bad spell in the economy, one person’s unemployment status is not likely to affect
another person’s. It’s not as though person A can’t get a job just because person
B is unemployed. But if we are focused on the unemployment rate, year after
year, then this year’s performance is likely influenced by last year’s economy. The
observations in time series are almost never independent. Usually, one observation
is correlated with the previous observation.
10
16
14
8
12
Y
Y
6
10
4
8
2
6
-2 -1 0 1 2 0 5 10 15 20
X time
(a) (b)
Fig. 1.1 Two different types of data. (a) Cross-sectional data. (b) Time-series data
This isn’t just a trivial difference. Perhaps a simpler example will illustrate why
this distinction is so important. Suppose you want to know whether a coin is biased.
You should flip it and record its value. Then, flip it again and record its value. Do
this over and over, say, one hundred times, and you will get a good idea of whether
it is biased. An unbiased coin should give us roughly fifty heads and fifty tails, plus
or minus random error. Under the null hypothesis of a fair coin, the probability of
a head in the first flip is 1/2. Likewise, the probability of heads in the second flip is
also 1/2, regardless of how the first flip turned out.
But things are different when the observations are not independent. Suppose you
flip the coin once and record the outcome. Then, you immediately look back down,
observe that it is still heads, and you record a second observation of heads. You can
do this one hundred times. But you don’t have one hundred useful observations! No,
you only have one good observation. Even though you recorded one hundred heads,
you only had one coin flip.
Things in time series are never quite as bad as this example. But time series is far
more like a situation where you flip the coin and record the observation, sometimes
after two flips, and sometimes after four flips. There will be a lot of inertia in your
observations, which invalidates the simple formulas. You’ll need new ones.
In fact, in one sense, this dependency makes some things easier in time series. In
time series, we watch something unfold slowly over time. If the economy changes
slowly, then we can use the past as a useful guide to the future. Want to know what
next month’s unemployment rate will be? It will very likely be close to this month’s
rate. And it will be changing by roughly the same amount as it changed in the recent
past.
1.2 Notation 3
1.2 Notation
Yt = Zt − Zt−1
= (Xt − Xt−1 ) − (Xt−1 − Xt−2 )
= Xt − 2Xt−1 + Xt−2 .
That is, the first difference of a first difference is called a “second difference.”
For notational simplicity, we denote the second difference as: D 2 . Thus, the
second difference of Xt is D 2 Xt . The third difference of Xt is D 3 Xt . The k-th
difference is D k Xt . As with L, raising D to a power denotes the number of times
that differencing is to occur.
4 1 Introduction
D 2 = (1 − L)2 X
= (Xt − Xt−1 ) − (Xt−1 − Xt−2 )
= Xt − 2Xt−1 + Xt−2 .
Notice that a lagged variable shifts the column of data down by one row.
In this section, we dust off some cobwebs and refresh the basic rules of probability.
Given a random variable X, its probability distribution function f(x) is a function
that assigns a probability to each possible outcome of X. For example, suppose
you are flipping a coin; the random variable X is whether the coin shows heads or
tails, and to these two outcomes, we assign a probability of Pr(X=heads) = 1/2, and
Pr(X=tails) = 1/2.
Continuous variables are those that can take on any value between two numbers.
Between 1 and 100, there is an infinite continuum of numbers. Discrete numbers
are more like the natural numbers. They take on distinct values. Things that are not
normally thought of as numeric can also be coded as discrete numbers. A common
example is pregnancy, a variable that is not intrinsically numeric. Pregnancy status
might be coded as a zero/one variable, one if the person is pregnant and zero
otherwise.
Some discrete random variables in economics are: a person’s unemployment
status, whether two countries are in a monetary union, the number of members in
the OPEC, and whether a country was once a colony of the UK.1 Some discrete
random variables in finance include: whether a company is publicly traded or not,
the number of times it has offered dividends, or the number of members on the
Board.
Continuous financial random variables include: the percent returns of a stock,
the amount of dividends, and the interest rate on bonds. In economics: GDP, the
unemployment rate, and the money supply are all continuous variables.
If the list of all possible outcomes of X has discrete outcomes, then we can define
the mean (aka average or expectation) of X as:
E (X) = xi P r (X = xi ) .
1 Acemoglu et al. (2000) argue that a colonizing country might negatively affect a colony’s legal
and cultural institutions. To the extent that those institutions are still around today, the colonial
history from dozens if not hundreds of years ago could have a lingering effect.
6 1 Introduction
The population mean of X will be denoted μX , and the sample mean, X̄. We will
switch between the notations E(X) and μX as convenient.
The population variance of a random variable is the average squared deviation of
each outcome from its mean:
V ar (X) = σx2
= E X2 − E (X) E (X)
= 1/N (xi − E (X))2
= (xi − E (X))2 dx,
If X and Y are random variables, and a and b are constants, then some simple
properties of the statistics listed above are:
E(a) = a
E(aX) = aE(X)
Stdev(aX) = aStdev(X)
1.4 Specifying Time in Stata 7
V ar(a) = 0
V ar(aX) = a 2 V ar(X)
V ar(X) = Cov(X, X)
Cov(X, Y ) = Cov(Y, X)
Cov(aX, bY ) = abCov(X, Y )
Corr(aX, bY ) = Corr(X, Y ) = Corr(Y, X).
Adding a constant to a random variable changes its mean, but not its variance:
E (a + X) = a + E (X)
V (a + X) = V ar (X) .
If two random variables are added together, then it can be shown that
et ∼ N(0, σ 2 )
so that E(et ) = 0 , and V ar(et ) = σ 2 for all t. In this case, we say that et is
distributed “IID Normal,” or “independently and identically distributed” from a
Normal distribution. If this is the case, then we should be able to show that,
In other words, the variance formula simplifies; the variable does not covary with
itself across any lag; and the variable is not correlated with itself across any lag.
For time series, the order of observations is important. In fact, it is the defining
feature of time series. Order matters. First one thing happens, then another, and
8 1 Introduction
another still. This happens over time. You can’t rearrange the order without changing
the problem completely. If sales are trending down, you can’t just rearrange the
observations so that the line trends up. This is different from cross sections where
the order of observations is irrelevant. You might want to know what the correlation
is between heights and weight; whether you ask Adam before Bobby won’t change
their heights or weights.
Given that time is the defining feature of time series, Stata needs to have a time
variable. It needs to know which observation came first and which came second.
Suppose your observations were inputted with the first observation in, say, row one,
and the second observation in row two, and so forth. Then it might be pretty obvious
to you that the data are already in the proper order. But Stata doesn’t know that. It
needs a variable (a column in its spreadsheet) that defines time. In our case, we
could just create a new variable called, say, time that is equal to the row number.
But just because we named the variable “time” doesn’t mean that Stata
understands what it is. To Stata, the variable time is just another variable. We need
to tell it that time establishes the proper order of the data. We do this by using the
tsset command:
Sometimes we might import a dataset that already has a variable indicating time.
Stata needs to be told which variable is the time variable. To check whether a time
variable has already been declared, you can type
and Stata will tell you which variable has been tsset, if any. If no variable has
already been tsset, then you must do it yourself.
If you are certain that there are no gaps in your data (no missing observations),
then you could simply just sort your data by the relevant variable, and then
generate a new time variable using the two commands above. This simple
procedure will get you through most of the examples in this book.
If there are gaps, however, then you should be a bit more specific about your time
variable. Unfortunately, this is where things get tricky. There are a myriad different
ways to describe the date (ex: Jan 2, 2003; 2nd January 2003; 1/2/2003; 2-1-03;
and so on). There are almost as many different ways to tsset your data in Stata.
Alas, we must either show you specifics as they arise, or ask you to consult Stata’s
extensive documentation.
Stata comes off the shelf with an impressive array of time-series commands. But
it is a fully programmable language, so many researchers have written their own
commands. Many are downloadable directly from Stata.
1.6 Exercises 9
In this book, we’ll make heavy use of three user-written commands to download
data in Stata-readable format: fetchyahooquotes (Dicle and Levendis 2011);
freduse (Drukker 2006); and wbopendata (Azevedo 2011).
The first of these, fetchyahooquotes, downloads publicly available
financial data from Yahoo! Finance. Macroeconomic data for the US can be
downloaded from FRED, the Federal Reserve Bank of St. Louis’ economic
database, using the freduse command. Finally, wbopendata downloads
worldwide macroeconomic data from the World Bank’s online database. If these
commands are not already installed on your computer, you can download and install
them by typing the following:
1.6 Exercises
IBM’s share price. (The first difference of the logs is equal to the percentage
change.) What is the average daily rate of return for IBM during this period?
On which date did IBM have its highest percentage returns? On which date did
it have its lowest percentage returns?
5. Download the daily adjusted closing price of MSFT stock from 2000–2012
using fetchyahooquotes. Take the natural logarithm of this price. Then,
using Stata’s D notation, generate a new variable containing the percentage
returns of Microsoft’s share price. (The first difference of the logs is equal to
the percentage change.) What is the average daily rate of return for Microsoft
during this period? On which date did Microsoft have its highest percentage
returns? On which date did it have its lowest percentage returns?
6. Suppose midterm grades were distributed normally, with a mean of 70 and a
standard deviation of 10. Suppose further that the professor multiplies each
exam by 1.10 as a curve. Calculate the new mean, standard deviation, and
variance of the curved midterm grades.
7. Suppose X is distributed normally with a mean of 5 and a standard deviation
of 2. What is the expected value of 10X? What is the expected value of 20X?
What are the variance and standard deviations of 5X and of 10X?
8. Suppose that two exams (the midterm and the final) usually have averages of
70 and 80, respectively. They have standard deviations of 10 and 7, and their
correlation is 0.80. What is their covariance? Suppose that the exams were not
weighted equally. Rather, in calculating the course grade, the midterm carries a
weight of 40% and the final has a weight of 60%. What is the expected grade for
the course? What is the variance and standard deviation for the course grade?
9. Suppose that an exam has an average grade of 75 and a standard deviation of
10. Suppose that the professor decided to curve the exams by adding five points
to everyone’s score. What are the mean, standard deviation and variance of the
curved exam?
10. Suppose that in country A, the price of a widget has a mean of $100 and a
variance of $25. Country B has a fixed exchange rate with A, so that it takes
two B-dollars to equal one A-dollar. What is the expected price of a widget in
B-dollars? What is its variance in B-dollars? What would the expected price
and variance equal if the exchange rate were three-to-one?
ARMA(p,q) Processes
2
2.1 Introduction
1 The list of influential economists who have worked in some capacity at the Cowles Commission
is astounding. All in all, thirteen Nobel prize winning economists have worked at the Commission
including Maurice Allais, Kenneth Arrow, Gerard Debreu, Ragnar Frisch, Trygve Haavelmo,
Lenoid Hurwicz, Lawrence Klein, Tjalling Koopmans, Harry Markowitz, Franco Modigliani,
Edmund Phelps, Joseph Stiglitz, and James Tobin. Not all were involved in the econometric side
of the Cowles Commission’s work.
2 For a brief discussion of the Cowles approach, see Fair (1992). Epstein (2014) provides much
more historical detail. Diebold (1998) provides some historical context, as well as a discussion of
the more current macroeconomic forecasting models that have replaced the Cowles approach.
First, the models stopped working well. To patch them up, the economists began
adding ad-hoc terms to the equations.3
Second, Lucas (1976) levied a powerful theoretical critique. He argued that the
estimated parameters for each of the equations weren’t structural. For example,
they might have estimated that the marginal propensity to consume, in a linear
consumption function, was, say 0.80. That is, on average people consume 80%
of their income. Lucas argued that this might be the optimal consumption amount
because of a particular set of tax or monetary policies. Change the tax structure, and
people will change their behavior. The models, then, are not useful at all for policy
analysis, only for forecasting within an unchanging policy regime.
Third, a series of papers compared revealed that large-scale econometric models
were outperformed by far simpler models. These simple models—called ARIMA
models—are the subject of the present chapter. Naylor et al. (1972) found that
ARIMA outperformed Wharton’s more complex model by 50% in forecasting GNP,
unemployment, inflation and investment. Cooper (1972) compared even simpler
AR models with the forecasting ability of seven leading large-scale models. For
almost all of the thirty-one variables he examined, the simpler models were superior.
Nelson (1972) examined critically the performance of a large scale model jointly
developed the Federal Reserve Bank, MIT, and the University of Pennsylvania. By
1972 the FRB-MIT-Penn model used 518 parameters to investigate 270 economic
variables (Ando et al. 1972). In fact, Nelson showed that this complex model was
outperformed by the simplest of time-series models. Embarrassingly simple models.
One variable regressed on itself usually produced better forecasts than the massive
FRB model.
An ARIMA model is made up of two components: an Autoregressive (AR)
model and a Moving Average (MA) model. Both rely on previous data to help
predict future outcomes. AR and MA models are the building blocks of all our
future work in this text. They are foundational, so we’ll proceed slowly.
In Chap. 10 we will discuss VAR models. These models generalize univariate
autoregressive models to include systems of equations. They have come to be the
replacement for the Cowles approach. But first, we turn to two subsets of ARIMA
models: autoregressive (AR) models and moving average (MA) models.
2.1.1 Stationarity
In order to use AR and MA models the data have to be “well behaved.” Formally,
the data need to be “stationary.” We will hold off rigorously defining and testing for
stationarity for later chapters. For now, let us make the following loose simplifying
assumptions. Suppose you have a time series on a variable, X, that is indexed by a
3 Thisis reminiscent of adding epicycles to models of the geocentric universe. The basic model
wasn’t fitting the data right, so they kept adding tweaks on top of tweaks to the model, until the
model was no longer elegant.
2.1 Introduction 13
Figure 2.1 illustrates a time-series that is mean stationary (it reverts back to its
average value) but is not variance stationary (its variance fluctuates over time with
periods of high volatility and low volatility).
Finally, X is “covariance stationary” if the covariance of X with its own lagged
values depends only upon the length of the lag, but not on the specific time period
nor on the direction of the lag. Symbolically, for a lag-length of one,
Cov (Xt , Xt+k ) = Cov (Xt+1 , Xt+k+1 ) = Cov (Xt−1 , Xt+k−1 ) . (2.4)
14 2 ARMA(p,q) Processes
Fig. 2.2 X is mean stationary but neither variance nor covariance stationary
Suppose you are the manager at a casino, and one of your jobs is to track and predict
the flow of cash into and from the casino. How much cash will you have on hand on
Tuesday of next week? Suppose you have daily data extending back for the previous
1000 days.
Let Xt denote the net flow of cash into the casino on day t. Can we predict
tomorrow’s cash flow (Xt+1 ), given what happened today (Xt ), yesterday (Xt−1 ),
and before?
Consider a model of the following form
Xt = et
where the errors are Normally distributed with mean of zero and variance of one,
et ∼ iidN (0, 1)
2.2 AR(1) Models 15
in all time periods. In other words, X is just pure random error. This is not a very
useful, or even accurate, model of a Casino’s cash flows, but it is a useful starting
point pedagogically. Each day’s cash flow is completely independent of the previous
days’ flow, and moreover, the amount of money coming into the casino is offset, on
average, by cash outflow. In other words, the average cash flow is zero. That is,
E (Xt ) = E (et ) = 0
Exercise
1. Using the definitions in Eqs. (2.3) and (2.4), show whether the purely random
process
Xt = βXt−1 + et . (2.5)
We’ll look more closely at this simple random process. It is the workhorse of time-
series econometrics and we will make extensive use of its properties throughout this
text.
Here, the current realization of X depends in part X’s value last period plus some
random error. If we were to estimate this model, we’d regress X on itself (lagged
one period). This is why the model is called an “autoregressive model with lag one”
or “AR(1)” for short. An autoregression is a regression of a variable on itself.
One of the appeals of AR models is that they are quite easy to estimate. An AR(1)
model consists of X regressed on its first lag. As expressed in Eq. (2.5) there is no
constant in the model, so we can estimate it using the standard regress command
with the nocons option. Let’s try this on a simple dataset, ARexamples.dta.
16 2 ARMA(p,q) Processes
The nocons option tells Stata not to include a constant term in the regression.
Our estimated model is
Xt = 0.524Xt−1 + et .
The data were constructed specifically for this chapter, and came from an AR(1)
process where the true value of β = 0.50. Our estimate of 0.524 is fairly close to
this true value.
Another way to estimate this model is to use Stata’s arima command.
2.2 AR(1) Models 17
As before, the nocons option tells Stata not to include a constant. The nolog
option de-clutters the output but does not affect the estimates in any way.
There are very small differences in the two estimates. The arima command uses
an iterative procedure to maximize the likelihood function. This iterative procedure
sometimes converges on an estimate that is slightly different from the one using the
regress command. Another difference is that it uses one more observation than
does regress.
Why does Stata use a more complicated procedure than OLS? Actually, OLS
is a biased estimator of a lagged dependent variable. This bias goes away in large
samples.
Xt = βXt−1 + et (2.6)
where |β| < 1 and et ∼ iidN 0, σ 2 . (We will see shortly that this restriction
on β implies that E(X) = X̄ = 0.) The variable Xt−1 on the right hand side is
a lagged dependent variable. Running ordinary least squares (OLS) of X on its lag
produces a biased (but consistent) estimate of β. To see this, recall from introductory
econometrics that the OLS estimate of β is
Cov (Xt , Xt−1 ) Xt Xt−1
β̂OLS = = 2 .
V ar (Xt−1 ) Xt−1
Thus we can see that the OLS estimate β̂OLS is equal to the true value of β plus
some bias.4 Fortunately this bias shrinks in larger samples (that is, the estimate is
said to be “consistent.”)
If the errors are autocorrelated then the problem is worse. OLS estimates are
biased and inconsistent. That is, the problem of bias doesn’t go away even in
infinitely large samples. We illustrate this problem with some simulated data, and
graph the sampling distribution of the OLS estimator. Figures 2.3 and 2.4 show
the performance of OLS. Figure 2.3 shows that OLS estimates on LDVs are
biased in small samples, but that this bias diminishes as the sample size increases.
Figure 2.4 shows that OLS’s bias does not diminish in the case where the errors are
autocorrelated.
Below is the core part of the Stata code used to generate Figs. 2.3 and 2.4.
5
4
3
2
1
0
-.5 0 .5 1 1.5
4Iam indebted to Keele and Kelly (2005) who showed the algebra behind OLS’s bias when used
with lagged dependent variables.
2.2 AR(1) Models 19
6
4
2
0
-.5 0 .5 1 1.5
Summary stats for the OLS estimates of β are reported below for sample sizes of
20, 40, 60, 80, and 100:
The table above and Fig. 2.3 shows that as the sample size increases, the OLS
estimates get closer and closer to the true value of 0.50.
What about the case where the errors are autocorrelated? In this case, OLS
estimates do not converge to 0.50 (see below and Fig. 2.4)
Xt = 0.75Xt−1 + et
let us trace out the estimated effects of a one-unit change in et . First, suppose that
X has been constantly zero for each period leading up to the current period t.
2.2 AR(1) Models 21
And now suppose that e receives a one-time shock of one unit in period t; that is,
et = 1 in period t only.
Xt = 0.75Xt−1 + et = 0.75(0) + 1 = 1.
Thus, we can see that a one-time one-unit shock onto X has a lingering, but
exponentially decaying effect on X (Fig. 2.5).
So much for the theoretical IRF. What about an estimated—i.e. empirical—IRF?
Stata can use the estimated model to calculate the IRF. Before we proceed, though,
it would be beneficial to note that we were quite arbitrary in postulating a shock of
one unit. The shock could have been any size we wished to consider. We could have
considered a shock of, say, two or three units. A more common option would be
to trace out the effect of a one-standard deviation (of X) shock. In fact, this is the
default in Stata.
Using Stata’s irf post-estimation command, we can automatically graph the
IRF of an estimated ARMA model. After estimating a model, you must first create
a file to store the IRF’s estimates, and then ask for those estimates to be displayed
graphically. We can do this by typing:
1
.8
.6
X
.4
.2
0
0 2 4 6 8 10
time
AR1, X, X
1
.5
0
0 5 10
step
95% CI impulse-response function (irf)
Graphs by irfname, impulse variable, and response variable
Now let’s get a little more general. Rather than assume β = 0.75, or 0.5236071,
or some other particular number, let’s keep β unspecified. To keep things simple we
will assume that Eq. (2.5) has et ∼ N (0, σ ), and for stationarity, we’ll assume that
−1 < β < 1.
2.2 AR(1) Models 23
What is the average value of this process? Does the answer to this question
depend upon the time period (i.e. is it mean-stationary)? And how do previous
shocks affect current realizations?
Rewriting Eq. (2.5) for period t = 1, we have
X1 = βX0 + e1 . (2.7)
X2 = βX1 + e2 (2.8)
X3 = βX2 + e3 (2.9)
and so forth.
Substituting X1 from (2.7) into X2 (Eq. 2.8) we have
X2 = βX1 + e2
= β(βX0 + e1 ) + e2
= β 2 X0 + βe1 + e2 .
X3 = βX2 + e3
= β(β 2 X0 + βe1 + e2 ) + e3
= β 3 X0 + β 2 e1 + βe2 + e3
X4 = β 4 X0 + β 3 e1 + β 2 e2 + βe3 + e4 .
Notice that the effect of the most recent shock (e4 ) enters undiminished. The effect
of the previous shock (e3 ) is diminished since it is multiplied by the fraction β. The
shock two periods previous is diminished by a factor of β 2 , and so forth. Since β is
a number between −1 and 1, β t can become quite small quite quickly.
A general pattern begins to emerge. In general, an AR(1) process can be
expressed as
t
Xt = βXt−1 + et = β t X0 + β t−i ei .
i=1
I RF (k) = β k .
24 2 ARMA(p,q) Processes
2.2.3 Forecasting
Xt = 0.75Xt−1 + et . (2.10)
We do not usually know what the error terms are—we have to estimate them
via the residuals—but lets pretend that we know them so we can better understand
how the variable Xt evolves. For example, let’s suppose that X0 = 100 and that
the first four error terms, drawn from a N(0, 100) distribution, happen to be et =
[20, −30, 10, 15]. Then the next four values of Xt are:
X1 = 0.75X0 + e1 = 0.75(100) + 20 = 95
X2 = 0.75X1 + e2 = 0.75(95) − 30 = 41.25
X3 = 0.75X2 + e3 = 0.75(41.25) + 10 = 40.9375
X4 = 0.75X3 + e4 = 0.75(40.9375) + 15 = 45.7031
Forecasting out beyond where we have data, we can only comment on the expected
value of X5 , conditional on all the previous data:
E (X5 | X4 , X3 , . . . ) = E (0.75X4 + e5 | X4 , X3 , . . . )
= E (0.75 (45.7031) + e5 | X4 , X3 , . . . )
= E (34.27725 + e5 | X4 , X3 , . . . )
= 34.27725 + E (e5 | X4 , X3 , . . . ) .
E (X6 | X4 , X3 . . .) = E (βX5 + e6 )
= βE (X5 ) + E (e6 )
= 0.75 (34.27725) + E (0)
= 25.7079.
= β 3 Xt .
Even more generally, given data up to period t, we can expect that the value of
Xt+a , i.e. a periods ahead, will be
E (Xt+a | Xt , Xt−1 , . . . ) = β a Xt .
Since |β| < 1, this means that the one period ahead forecast is a fraction of
today’s X value. The forecast two periods ahead is twice as small; the three-periods
ahead forecast is smaller yet. In the limit, X is expected eventually to converge to
its mean which, in this case, is equal to zero.
The idea of an autoregressive model can be extended to include lags reaching farther
back than one period. In general, a process is said to be AR(p) if
As before, we will assume that the process is stationary so that it has a constant
mean, variance, and autocovariance.5
Usually, economic theory is silent on the number of lags to include. The matter
is usually an econometric one: AR models with more lags can accommodate richer
dynamics. Further, adding lags makes the residuals closer to white noise, a feature
which aids in hypothesis testing. Occasionally economic theory implies a model
with a specific number of lags. Paul Samuelson’s multiplier-accelerator model is
an example of AR(2) process from economic theory. Beginning with the GDP
accounting identity for a closed economy with no governmental expenditure,
Yt = Ct + It ,
Ct = β0 + β1 Yt−1
It = β2 (Ct − Ct−1 ) + et
Yt = β0 + β1 (1 + β2 )Yt−1 − β1 β2 Yt−2 + et
or,
Yt = α0 + α1 Yt−1 + α2 Yt−2 + et
with the α’s properly defined. Samuelson’s model accommodates different kinds of
dynamics: dampening, oscillating, etc. . . depending on the estimated parameters.
5 ForAR(p) models, the requirements for stationarity are a little more stringent than they are for
AR(1) processes. Necessary conditions include that the βs each be less than one in magnitude,
they must not sum to anything greater than plus or minus one, and that they cannot be more than
one unit apart. We will explore the stationarity restrictions at greater length in Chap. 4.
2.3 AR(p) Models 27
Notice that Stata requires us to tell it which lags to include. The option ar(1/3)
or lags(1/3) tells it that we want the first through third lags. If we had typed
ar(3) or lags(3), it would have estimated an AR(3) model where the first two
lags were set to zero:
Exercises
Let’s practice estimating AR(p) models using the dataset ARexamples.dta. The
dataset consists of a three variables (X, Y and Z) and a time variable.
Xt = β1 Xt−1 + et .
Yt = β1 Yt−1 + β2 Yt−2 + et .
Verify that the coefficients are approximately: βˆ1 ≈ 0.70 and βˆ2 ≈ 0.20.
5. Using the ARexamples.dta dataset, graph the last 100 observations of Z over
time. Using all of the observations, estimate the AR(3) model,
28 2 ARMA(p,q) Processes
Verify that the coefficients are approximately: βˆ1 ≈ 0.60, βˆ2 ≈ 0.20, and βˆ3 ≈
0.10.
6. Using the ARexamples.dta dataset, estimate the AR(3) model,
Zt = β1 Zt−1 + β3 Zt−3 + et .
Notice that this is a restricted model, where the coefficient on the second lag is
set to zero. Verify that the estimated coefficients are approximately: βˆ1 ≈ 0.70,
and βˆ3 ≈ 0.20.
The IRFs of AR(p) processes are only slightly more complicated than those for an
AR(1) but the calculation procedure is essentially the same.
Suppose that X follows an AR(3) process,
To calculate the IRF of this particular AR(3) model, let us assume as before, that
Xt and et were equal to zero for every period up until and including period zero.
Now, in period t = 1, X1 receives a shock of one unit via e1 (that is, e1 = 1). Let us
trace out the effect of this one-period shock on X1 and subsequent periods:
Just as in the case of the AR(1) process, the effect of the shock lingers on, but the
effects decay.
We didn’t estimate this model, we posited it, so we can’t use irf after arima to
automatically draw the impulse response functions. But we can get Stata to calculate
the IRF’s values by typing:
The last line above calculated the response, from the first period after the one-
unit shock through to the last observation. (Stata denotes the last observation with a
capital “L.”) The results of this procedure are:
Exercises
1. Calculate by hand the IRFs out to five periods for the following AR models:
(a) Xt = 0.5Xt−1 + et
(b) Xt = −0.5Xt−1 + et
(c) Xt = 0.5Xt−1 − 0.10Xt−2 + et
(d) Xt = 0.10 + 0.5Xt−1 − 0.20Xt−2 + et
(e) Xt = Xt−1 + et
Explain how the dynamics change as the coefficients change, paying special
attention to negative coefficients. Given the IRFs you calculated, do these all
seem stationary? Why or why not?
2.3.3 Forecasting
Given that we have estimated an AR(p) model, how can we use it to forecast future
values of Xt ? In the same way that we did from an AR(1) model: iteratively. Let us
work out a simple example by hand. Stata can calculate more extensive examples
quite quickly—and we will see how to do this—but first it will be instructive to do
it manually.
Suppose we estimated the following AR(3) model
Suppose further that X1 = 5, X2 = −10 and X3 = 15. Given these values, what
is the expected value of X4 ? That is, what is: E(X4 | X3 , X2 , X1 , . . .)? And rather
than specifying all of those conditionals in the expectation, let’s use the following
notation: let E3 (.) denote the expectation conditional on all information up to and
including period 3.
E(X4 | X3 , X2 , X1 , . . .) = E3 (X4 )
= E3 (0.75X3 + 0.50X2 + 0.10X1 + e4 )
= 0.75X3 + 0.50X2 + 0.10X1 + E3 (e4 )
= 0.75(15) + 0.50(−10) + 0.10(5) + E3 (e4 )
= 11.25 − 5 + 0.5 + 0
= 6.75.
Given this expected value of E(X4 ), we use it to help us make forecasts farther
out, at X5 and beyond. We proceed in the same fashion as before. The expected
value two periods out, at X5 , is
Stata can automate these calculations for us. If you’d like to forecast four periods
out, add four blank observations to the end of your dataset. After estimating the
model, use Stata’s predict command to calculate the forecasts. For example, using
the ARexamples.dta dataset, let’s estimate an AR(3) model on the variable Z,
and then forecast out to four periods.
We begin by loading the data and estimating the model.
Finally, we use the predict command to have Stata calculate the forecasts:
Exercises
1. Consider the model described in Eq. (2.11). In the text, we forecasted out to
periods four and five. Now, forecast out from period six through period ten. Graph
these first ten observations on Xt . Does Xt appear to be mean-stationary?
2. Estimate an AR(3) model of the variable Z found in ARexamples.dta. Verify
by hand Stata’s calculations for the four-periods out forecast of 0.526421 that was
reported in our last example.
ARMA models are composed of two parts, the second of which is called a Moving
Average (or “MA”) model. AR models had autocorrelated X’s because current X
depended directly upon lagged values of X. MA models, on the other hand have
autocorrelated X’s because the errors are, themselves, autocorrelated.
The simplest type of MA model is:
Xt = et (2.12a)
et = ut + βut−1 (2.12b)
ut ∼ iidN(μ, σu2 ) (2.12c)
Xt = ut + βut−1 . (2.13)
2.4 MA(1) Models 33
It will be useful to differentiate between the errors (et ) from the random shocks (ut ).
The error terms (et ) are autocorrelated. The shocks (ut ) are presumed to be white
noise. That is, each ut is drawn from the same Normal distribution, independently
of all the other draws of u in other time periods; thus, we say that the ut ’s are
independent and identically distributed from a Normal distribution.
Such a model is called an MA(1) model because the shock shows up in Eq. (2.13)
with a lag of one. The important thing to note is that the action in this model lies in
the fact that the errors have a direct effect on X beyond the immediate term. They
have some inertia to them.
Notice that E(ut ut−1 ) is equivalent to E(ut−1 ut−2 ) because of stationarity. Also,
recall that ut ∼ iidN(μ, σu2 ), so that E(u2t ) = σu2 . Since the ut are all independent
of each other, then it will always be the case that: E(ut uj ) = 0 for all t = j .
Since the errors (et ) on X are autocorrelated, then X is also autocorrelated. What
is the nature of this autocorrelation? At what lags is X autocorrelated? In other
words, what is the autocorrelation function (ACF) of this MA(1) process?
2.4.1 Estimation
How does one estimate an MA(1) model in Stata? MA(1) models look like:
Xt = ut + βut−1 .
That is, X is a function, not of past lags of itself, but of past lags of unknown error
terms. Thus, we cannot create a lagged-X variable to regress upon.
To estimate an MA(1) model in Stata, we can use the now-familiar arima
command, with the ma(1) option:
Include nocons only in those cases where the AR or MA process has a mean of
zero. If you graph the data and find that it doesn’t hover around zero, then leave out
the nocons option.
Xt = ut + 0.75ut−1
Let us presume that X and e have been equal to zero for every period, up until what
we will call period t = 1, at which point X1 receive a one-time shock equal to one
unit, via u1 . In other words, u1 = 1, u2 = 0, u3 = 0 and so forth. Let us trace out
the effects of this shock:
34 2 ARMA(p,q) Processes
X1 = u1 + 0.75(u0 ) = 1 + 0.75(0) = 1
X2 = u2 + 0.75(u1 ) = 0 + 0.75(1) = 0.75
X3 = u3 + 0.75(u2 ) = 0 + 0.75(0) = 0
X4 = u4 + 0.75(u3 ) = 0 + 0.75(0) = 0
and so forth.
Thus we see that the IRF of an MA(1) process is quite short-lived. In fact, we
will see shortly that the IRF of an MA(q) process is only non-zero for q periods.
The practical implication of this is that a one-time shock to an MA process does not
have lasting effects (unlike with an AR process). This has significant implications
for economic policy. For example, if GDP follows an AR process, then the one-
time shock of, say, the Arab oil embargo of 1973, will still influence the economy
35 years later in 2018. On the other hand, if memories are short, as in an MA
process, then the economy recovers quickly, and we no longer suffer the effects
of that economic shock. Once repealed, bad financial regulations, for example, will
have a temporary—but only temporary—effect on financial markets if such markets
are MA processes.
2.4.3 Forecasting
The iterative process for forecasting from MA(1) models is complicated by the fact
that we are not able to directly use previous lagged X’s in helping us predict future
X’s.
Let us work concretely with a simple MA(1) model:
Xt = ut + βut−1 .
And let us suppose that we have 100 observations of data on X, extending back from
t = −99 through t = 0. Now we find ourselves at t = 0 and we wish to forecast
next period’s value, X1 . First, we estimate the parameter β, and let’s suppose that
β̂ = 0.50. Given the data and our estimated model, we can calculate the residuals
from t = −99 through t = 0. These will be our best guess as to the actual errors
(residuals approximate errors), and using these, we can forecast Xt . In other words,
the procedure is:
3. Calculate the residuals (r) between the data and the fitted values.
Stata has filled in the missing observation with the predicted value:
In the last time period (t = 3000) the value of X3000 is -0.59472139, and the
predicted value of X3000 is -0.1318733, so the residual is -0.4628481. We can use
the residual as our best guess for the error, and calculate the expectation of X3001
conditional on the previous period’s residual:
Moving Average models can be functions of lags deeper than 1. The general form
of the Moving Average model with lags of one through q, an MA(q) model, is:
q
Xt = ut + β1 ut−1 + β2 ut−2 + . . . + βq ut−q = ut−i βi , (2.14)
i=0
2.5.1 Estimation
It is easy to see that the MA(1) process we were working with in the previous section
is a special case of the general MA(q) process, where β2 through βq are equal to
zero.
We can use Stata’s arima command to estimate MA(q) models. The general
format is:
Example
Using MAexamples.dta let’s calculate an MA(3) model on the variable Y.
38 2 ARMA(p,q) Processes
Since the coefficient on ut−3 is not significant at the 0.05 significance level, a case
could be made for dropping that lag and estimating an MA(2) model instead.
Exercises
Use MAexamples.dta to answer the following questions.
1. A moment ago we estimated an MA(3) model on Y and found that the third
lag was statistically insignificant at the 0.05 level. Drop that lag and estimate an
MA(2) model instead. Write out the estimated equation. You should be able to
verify that βˆ1 ≈ 0.69 and βˆ2 ≈ 0.20.
2. Estimate an MA(3) model on Z. Write out the estimated equation. Are all the lags
statistically significant? You should be able to verify that βˆ1 ≈ 0.60, βˆ2 ≈ 0.20,
and βˆ3 ≈ 0.05.
Calculating the IRF for an MA(q) process is quite straightforward. Suppose that X
follows an MA(q) process such as:
2.6 Non-zero ARMA Processes 39
q
Xt = et + β1 et−1 + β2 et−2 + . . . + βq et−q = et−i βi .
i=0
Suppose, as before, that all the e’s (and therefore all the Xs)are equal to zero, up
until what we will call period k. In period k, ek = 1, a one-time one-unit shock,
after which the e’s return to being zero (i.e., ek+1 = ek+2 = ek+3 = . . . = 0). Let
us trace out the effects of this one-time shock:
Xk+2 = β2 .
Xk+q = βq
after which the series is once again at its equilibrium level of zero and the effects of
the one-time shock are completely eradicated from the economy.
Absent any seasonality, the βs are usually smaller at further lags; for example, it
would be odd for an event two periods ago to have a larger effect, on average, than
events only one period ago.
By now we have, hopefully, become familiar with zero-mean AR(p) processes. You
might have been wondering, though, why do we pay so much attention to a process
with zero mean? Isn’t that assumption very restrictive? How many things in life
have an average value of zero, anyway?!
While many processes have a zero mean, many more do not. GDP or GNP don’t
vary around zero. Nor do the unemployment rate, the discount rate, nor the Federal
Funds rate. It turns out that the zero-mean assumption makes understanding the
40 2 ARMA(p,q) Processes
crucial concepts behind time-series modeling much clearer. It also turns out that
the zero-mean assumption isn’t all that critical, and it is really easy to drop that
assumption altogether.
Xt = β0 + β1 Xt−1 + et .
E (Xt ) − β1 E (Xt ) = β0
E (Xt ) (1 − β1 ) = β0
β0
E (Xt ) = .
1 − β1
β0
E (Xt ) = .
1 − β1 − β2 − . . . βp
Notice that β0 does not show up in Eq. (2.15). Thus, adding a constant (β0 ) changes
the mean but it does not affect the variance.
Xt = α + ut + βut−1
with ut ∼ N (0, σu2 ). The constant, α allows the mean of the error to be non-zero.
What are the features of this type of MA(1) model? What is the mean of such a
process?
E (Xt ) = E (α + ut + βut−1 )
= α + E (ut ) + βE (ut−1 )
= α + 0 + β (0)
= α.
The rather straightforward result is that mean of an MA(1) process is equal to the
intercept. This generalizes to any MA(q) process:
E (Xt ) = E α + ut + β1 ut−1 + +β2 ut−2 + · · · + βq ut−q
= α + E (ut ) + β1 E (ut−1 ) + β2 E (ut−2 ) + · · · + βq E ut−q
= α + 0 + β1 (0) + β2 (0) + · · · + βq (0)
= α.
We moved from the first to the second line because, since the ut are white noise at
all t, there is no covariance between ut and ut−1 . We moved to the third line because
α and β are not random variables.
Notice that the variance does not depend on the added constant (α). That is,
adding a constant affects the mean of an MA process, but does not affect its variance.
If we are presented with an AR process that doesn’t have a mean of zero, how do
we accommodate it? We could directly estimate a model with an intercept.
Alternatively, we could de-mean the data: estimate the average and subtract this
average each of the observations: Then we can estimate an AR process in the de-
meaned variables without an intercept. Let’s see exactly why this is the case.
Suppose we have a random variable, Xt , which does not have a mean of zero,
but a mean of, say, X̄. The fact that there is no time subscript on X̄ indicates that
the mean is constant; it does not depend on the time period t. That is, Xt is a mean-
stationary process, with a non-zero mean.
If we subtract the mean (X̄) from Xt ,
X̃t = Xt − X̄ (2.16)
Subtracting a constant shifts our variable (changes its mean) but does not affect the
dynamics nor the variance of the process.
This has a deeper implication. We’ve been talking all along about zero-mean
process Xt . We can now see that Xt can be thought of as the deviations of X̃t from
its mean. That is, we’ve been modeling the departures from the average value all
along.
It is easy to show that de-meaning the variables changes the model from an AR(1)
with a constant to our more familiar zero-mean AR(1) process. Beginning with a
non-zero AR(1) process,
Xt = β0 + β1 Xt−1 + et (2.18)
β0 β0
X̃t + = β0 + β1 X̃t−1 + + et
1 − β1 1 − β1
β0 β0
X̃t = β0 − + β1 + β1 X̃t−1 + et
1 − β1 1 − β1
β0 (1 − β1 ) − β0 + β1 β0
X̃t = + β1 X̃t−1 + et
1 − β1
X̃t = 0 + β1 X̃t−1 + et
X̃t = β1 X̃t−1 + et .
De-meaning the variables transforms the non-zero AR(1) process (i.e. one with
a constant) to a zero-mean AR(1) process (i.e. one without a constant).
The moral is the following: whenever you are looking at a zero-mean AR(p)
process, just remember that the Xs represent deviations of a variable X̂ from its
mean.
Example
We can illustrate these two “solutions” using some simulated data in Stata. First,
let’s generate our non-zero data:
7 We do this because the earlier data have not yet converged to their long-run level. By keeping
only the later observations, we ensure that the earlier data do not contaminate our analysis. It is
probably overkill to drop so many of our initial observations, but we’re playing with lots of fake
data anyway. . .
44 2 ARMA(p,q) Processes
24 22
X = 10 + 0.50*L.X + e
20 18
16
Fig. 2.7 The last 100 observations of simulated non-zero AR(1) data
The second approach is to estimate the sample mean (X̄), subtract this mean from
the data (X̃t = Xt − X̄) so that they are centered over zero and then estimate the
AR model without a constant:
Notice that the estimated coefficients are virtually identical in the two
approaches. Which approach should you use? The first approach: directly estimate
the constant. By manually de-meaning, Stata doesn’t know that you’ve subtracted
an estimate. It cannot adjust its standard errors to reflect this additional bit of
uncertainty.
It has p lags of X and q lags of shocks. We did have a slight change of notation.
Before, when we were discussing simple AR and MA models separately, all of our
coefficients were βs. Now that we’re estimating models that mix the two, it’ll be
easier for us to use βi for the i’th lagged AR coefficient, and γj for the j-th lagged
MA coefficients.
2.7.1 Estimation
or
If for some reason we wished to leave out some lags, then we proceed as before:
we list only the lags we want. For example, the command:
estimates:
2.8 Conclusion
We have learned how about AR and MA processes, the two basic components of
ARMA models. We have learned what they are, how to estimate them, how they
describe different reactions to shocks, and how to use them for forecasting. What
we haven’t figure out yet, however, is how to tell whether to estimate one type of
model or another. Given a dataset, should we model it as an AR process, an MA
process, or a combination of the two? To answer this question, we need to delve
deeper into some additional characteristics of AR and MA processes. AR and MA
processes imply different patterns of correlation between a variable and its own
previous values. Once we understand the types of autocorrelation patterns associated
with each type of process, we are in a better position to tell what type of model we
should estimate. We turn to this in the next chapter.
Model Selection in ARMA(p,q) Processes
3
In practice, the form of the underlying process that generated the data is unknown.
Should we estimate an AR(p) model, an MA(q) model, or an ARMA(p,q) model?
Moreover, what lag lengths of p and q should we choose? We simply do not have
good a priori reason to suspect that the data generating process is of one type or
another, or a combination of the two. How is a researcher to proceed? Which sort of
model should we estimate?
It is often impossible to tell visually whether a time series is an AR or an MA
process. Consider Fig. 3.1 which shows four time series: an AR(1) process, an
MA(1), and two ARMA(p,q) processes. Which one is which? It is impossible to
tell visually. We need something a bit more formal, something that relies on the
differing statistical processes associated with AR and MA models.
The classic (Box and Jenkins 1976) procedure is to check whether a time series
mimics the properties of various theoretical models before estimation is actually
carried out. These properties involve comparing the estimated autocorrelation
functions (ACFs) and partial autocorrelation functions (PACFs) from the data, with
the theoretical ACFs and PACFs implied by the various model types. A more recent
approach is to use various “information criteria” to aid in model selection. We will
discuss each of these in turn. We begin with deriving the theoretical ACFs and
PACFs for AR(p) and MA(q) processes. Once we know the tell-tale signs of these
processes, then we can check whether our data correspond to one or both of these
processes. Then we estimate the model. The Box-Jenkins procedure is concluded by
verifying that the estimated residuals are white noise. This implies that there is no
leftover structure to the data that we have neglected to model. If the residuals are not
white noise, then Box and Jenkins recommend modifying the model, re-estimating,
and re-examining the residuals. It is a complicated process. But the central part
in their procedure compares the autocorrelation structure from the data with the
autocorrelation implied theoretically by various processes. We turn to this now.
2
MA(1): beta = 0.50
AR(1): beta = 0.50
2
1
1
0
0
-1
-1
-2
0 10 20 30 0 10 20 30
time time
ARMA(1,1): beta1 = 0.25, gamma1 = 0.25
2
1
1
0
0
-1
-2 -1
-2
0 10 20 30 0 10 20 30
time time
ACFs and PACFs each come in two flavors: theoretical and empirical. The former
is implied by a model; the latter is a characteristic of the data. We can compare (a)
the empirical ACFs and PACFs that we estimate directly from data without using a
model, with (b) the theoretical ACFs and PACFs that are associated with a particular
model. Then, we only need to see how they match up. That is, we can be fairly
certain that the data were generated from a particular type of process (model) if the
empirical ACF matches up with that of a particular model’s theoretical ACFs.
We’ll proceed as follows. First, we’ll derive the theoretical ACFs and PACFs for
AR processes and then for MA processes. Then we’ll see how to estimate the ACFs
and PACFs directly from the data. And then we’ll see how we can match the two.
Xt = βXt−1 + et . (3.1)
3.1 ACFs and PACFs 49
An ACF is a description of how Xt is correlated with its first lag, its second lag,
through to its k-th lag. To find the theoretical ACF for an AR(1) process, let’s
derive the values of Corr(Xt , Xt−1 ), Corr(Xt , Xt−2 ), . . . Corr(Xt , Xt−k ), under
the assumption that Eq. (3.1) is true.
We will make use of the following:
E (et ) = 0 (3.2)
V ar (et ) = σ 2 (3.3)
Since the AR(1) process is stationary, then Stdev (Xt ) = Stdev (Xt−1 ), so Eq. (3.6)
simplifies to
E(Xt Xt−1 )
Corr(Xt , Xt−1 ) = . (3.10)
E(Xt2 )
Now, let’s look further at the numerator in (3.10). Take Xt = βXt−1 + et and
multiply both sides by Xt−1 :
βE (Xt−1 Xt−1 )
Corr (Xt , Xt−1 ) = . (3.11)
E Xt2
By stationarity, E (Xt−1 Xt−1 ) = E (Xt Xt ) = E Xt2 , so (3.11) simplifies to
βE Xt2
Corr (Xt , Xt−1 ) = = β. (3.12)
E Xt2
E (Xt Xt−2 )
Corr (Xt , Xt−2 ) = . (3.13)
V ar (Xt )
Let us now focus on the numerator. Since Xt = βXt−1 + et , then multiplying both
sides by Xt−2 gives
3.1 ACFs and PACFs 51
By stationarity, E(Xt−1 Xt−2 ) = E(Xt Xt−1 ), which we know from our work
computing the lag-1 autocorrelation several lines previously is equal to βE Xt2 .
Substituting, we get
E (Xt Xt−2 ) = β 2 E Xt2 = β 2 V ar (Xt ) . (3.14)
Notice that Xt is correlated with Xt−2 even though it is not explicitly a function
of Xt−2 , i.e. even though Xt−2 does not appear in the definition of an AR(1) process:
Xt = βXt−1 + et .
Cov(Xt , Xt+k ) = β k .
Thus,
and so forth.
Thus, even a simple AR process with one lag can induce an outcome where each
observation of X will be correlated with long lags of itself.
Notice that the ACF of an AR(1) process decays exponentially. If β is a positive
number then it will decay toward zero.1 If β is a negative number, then it will still
.5
.2
.4
Autocorrelations
Autocorrelations
0
.3
-.2
.2
-.4
.1
-.6
0
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.2 Theoretical ACF of AR(1): Xt = β1 Xt−1 + et . (a) β1 = 0.50. (b) β1 = −0.50
converge toward zero, but it will oscillate between negative and positive numbers.
(Raising a negative number to an even power makes it positive.) The ACF for a
positive β1 has the characteristic shape shown in Fig. 3.2a. The ACF for a negative
β1 oscillates, such as in Fig. 3.2b.
Regardless of whether the ACF oscillates or not, it is still the case that today’s
value of Xt is correlated with values from its past. That is, even though Xt is not
directly determined by Xt−2 or Xt−3 (they are not terms in Xt = βXt−1 + et ), Xt
is correlated with its past, but old values have increasingly faint impact.
What is its autocorrelation function? That is, what are Corr(Xt , Xt−1 ), and
Corr(Xt , Xt−2 ), through Corr(Xt , Xt−k )?
Beginning with the definition of autocorrelation at arbitrary lag k, we can use the
stationarity of the standard deviation, and the IID assumption for ut s, to arrive at:
Thus, our task is to derive expressions for each of these autocorrelations at each
lag k.
We will attack this problem piece by piece, focusing on the numerator and
denominator of (3.16) separately. We will begin with the numerator, i.e. with the
autocovariances.
We can solve this problem using a system of equations called Yule-Walker
equations. To find the first such equation, multiply both sides of (3.15) by Xt and
take the expectation:
E (Xt Xt ) = E β1 Xt−1 + β2 Xt−2 + . . . + βp Xt−p + et Xt
= E β1 Xt−1 Xt + β2 Xt−2 Xt + . . . + βp Xt−p Xt + et Xt
= β1 E (Xt−1 Xt ) + β2 E (Xt−2 Xt ) + . . . + βp E Xt−p Xt + E (et Xt ) .
(3.17)
The only term in (3.17) that looks a little new is the last one, E (et Xt ); it is the
only term that is not an autocovariance. Let’s look at that term a bit more closely.
Multiplying (3.15) by et and taking expectations:
E (Xt et ) = E β1 Xt−1 et + β2 Xt−2 et + . . . + βp Xt−p et + et et
= β1 E (Xt−1 et ) + β2 E (Xt−2 et ) + . . . + βp E Xt−p et + E (et et ) .
It is time to take stock. For notational simplicity, let’s denote the variance and
each of the autocovariances with φs:
E (Xt Xt ) = φ0
E (Xt Xt−1 ) = φ1
E (Xt Xt−2 ) = φ2
..
.
E (Xt Xt−k ) = φk
..
.
Using this notation, we can re-write Eqs. (3.18) through (3.21) as:
φ0 = β1 φ1 + β2 φ2 + . . . + βp φp + σ 2
φ1 = β1 φ0 + β2 φ1 + . . . + βp φp−1
φ2 = β1 φ1 + β2 φ0 + . . . + βp φp−2
..
.
φp = β1 φp−1 + β2 φp−2 + . . . + βp φ0 (3.22)
..
.
3.1 ACFs and PACFs 55
φ0 = β1 φ1 + β2 φ2 + σ 2
φ1 = β1 φ0 + β2 φ1 (3.23)
φ2 = β1 φ1 + β2 φ0 . (3.24)
The last two lines establish a recursive pattern: φs = β1 φs−1 + β2 φs−2 . With these
autocovariances, we are prepared to derive the autocorrelations.
Equation (3.23) simplifies to
β1
φ1 = φ0 , (3.25)
1 − β2
β1
Cov (Xt , Xt−1 ) φ1 1−β2 φ0 β1
Corr (Xt , Xt−1 ) = = = = . (3.26)
V ar (Xt ) φ0 φ0 1 − β2
β1
φ2 = β1 φ0 + β2 φ0 , (3.27)
1 − β2
φk φk−1 φk−2
= β1 + β2 . (3.29)
φ0 φ0 φ0
φ3 φ2 φ1
= β1 + β2
φ0 φ0 φ0
56 3 Model Selection in ARMA(p,q) Processes
or
Corr (Xt , Xt−3 ) = β1 Corr (Xt , Xt−2 ) + β2 Corr (Xt , Xt−1 ) . (3.30)
For the more general case of an AR(k) process, we can employ a similar strategy,
solving the Yule-Walker equations recursively.
Example
Using Eqs. (3.26)–(3.30), let’s solve for the Theoretical ACF implied by the
following AR(2) process:
Xt = 0.50Xt−1 + 0.20Xt−2 + et .
β1 0.50
Corr (Xt , Xt−1 ) = = = 0.625.
1 − β2 1 − 0.20
β12 0.502
Corr (Xt , Xt−2 ) = + β2 = + 0.20 = 0.5125
1 − β2 1 − 0.20
and so forth. Today’s value of X is increasingly less correlated with values farther
back in time.
Exercises
1. Using Eqs. (3.26)–(3.30), calculate the first three lags of the theoretical ACFs
implied by the following AR(2) processes:
(a) Xt = 0.50Xt−1 − 0.20Xt−2 + et
(b) Xt = −0.50Xt−1 − 0.20Xt−2 + et
3.1 ACFs and PACFs 57
Xt = ut + βut−1 , (3.31)
with ut ∼ iidN(0, σu2 ). That is, the u error terms are white noise, independent of
each other. Therefore,
What is X’s ACF at lag 1? In symbols, we need to figure out the value of:
In order to answer this question, we need to know the variance of Xt and the
covariance of Xt and Xt−1 . Let us take a brief detour to answer these intermediate
questions.
We begin by calculating the variance of Xt :
= 0 + 0 + βσu2 + 0
= βσu2 . (3.36)
Having calculated (3.35) and (3.36), we can substitute these into (3.34) to find the
autocorrelation at lag=1:
In other words, if X follows an MA(1) process, then Xt and Xt−1 will be correlated,
but Xt will not be correlated with Xt−2 , nor with longer lags of X.
at each lag length k. To derive this sequence of correlations, let’s take apart
Eq. (3.38) piece by piece.
We begin with the denominator, deriving an expression for the variance of Xt .
To do this, let’s start with the definition of an MA(q) process:
where
ut ∼ iidN 0, σu2 . (3.40)
This will be our term in the denominator. What about the numerator?
Beginning with the definition of covariance, and using the fact that E (Xt ) = 0,
This equation includes many products of us. Since each ui is independent of each uj
whenever their subscripts are different (i = j ), then E(ui uj ) = 0 and, mercifully,
the equation above simplifies dramatically.
At k = 1, Eq. (3.43) reduces to:
E (Xt Xt−1 ) =E β1 u2t−1 + β2 β1 u2t−2 + β3 β2 u2t−3 + · · · + βq βq−1 u2t−q
=β1 E u2t−1 + β2 β1 E u2t−2 + β3 β2 E u2t−3 + . . .
+ βq βq−1 E u2t−q
Notice that the sequence of βs begins later and later. Eventually, once k exceeds
q, there are no longer any non-zero correlations. In other words, at k = q, Eq. (3.43)
reduces to:
We’ve calculated all of the autocovariances at each lag k = 1, 2,. . . We are now,
finally, in a position to show the autocorrelations that comprise the ACF.
The autocorrelation at lag k = 1 is found by plugging (3.44) and (3.41) into
Eq. (3.38):
β1 + β2 β1 + β3 β2 + · · · + βq βq−1
= . (3.49)
1 + β12 + β22 + · · · + βq2
β2 + β3 β1 + β4 β2 + · · · + βq βq−2
= . (3.50)
1 + β12 + β22 + · · · + βq2
β3 + β4 β1 + β5 β2 + β6 β3 + · · · + βq βq−3
Corr (Xt , Xt−3 ) = , (3.51)
1 + β12 + β22 + · · · + βq2
βq
Corr Xt , Xt−q = , (3.52)
1 + β12 + β22 + · · · + βq2
62 3 Model Selection in ARMA(p,q) Processes
0
Corr (Xt , Xt−k ) = = 0. (3.53)
σu2 1 + β12 + β22 + · · · + βq2
The ACF of an MA(q) process is given by the values of Eqs. (3.49)–(3.52) and
zeros thereafter.
This might seem a bit too abstract. It is time for an example.
Example
Suppose that somehow we knew that an MA(3) process was equal to
Armed with the above formulas for the ACF of an MA(q) process, we can calculate
the theoretical autocorrelations at lags k = 0, 1, 2, 3, and k > 3:
β1 + β2 β1 + β3 β2 + · · · + βq βq−1
AC (k = 1) =
1 + β12 + β22 + · · · + βq2
β1 + β2 β1 + β3 β2
= (3.55)
1 + β12 + β22 + β32
0.40 + (0.20) (0.40) + (0.10) (0.20)
=
1 + (0.40)2 + (0.20)2 + (0.10)2
= 0.4132
β2 + β3 β1 + β4 β2 + · · · + βq βq−2
AC (k = 2) =
1 + β12 + β22 + · · · + βq2
β2 + β3 β1
= (3.56)
1 + β12 + β22 + β32
0.20 + (0.10) (0.40)
=
1 + (0.40)2 + (0.20)2 + (0.10)2
= 0.1983
β3 + β4 β1 + β5 β2 + β6 β3 + · · · + βq βq−3
AC (k = 3) =
1 + β12 + β22 + · · · + βq2
β3
= (3.57)
1 + β12 + β22 + β32
0.10
=
1 + (0.40) + (0.20)2 + (0.10)2
2
= 0.0826
3.1 ACFs and PACFs 63
.4
.3
Autocorrelations
.2
.1
0
0 2 4 6 8 10
lag
AC (k > 3) = 0. (3.58)
Exercises
1. Use the formulas for the ACFs of an MA(3) processes derived above (i.e. (3.55)
through (3.58)) to calculate the first five values of the ACF of the following
processes:
(a) Xt = ut + 0.50ut−1 − 0.10ut−2 + 0.05ut−3 .
(b) Xt = ut − 0.50ut−1 + 0.20ut−2 + 0.10ut−3 .
Theoretical Partial ACFs are more difficult to derive, so we will only outline their
general properties. Theoretical PACFs are similar to ACFs, except they remove the
effects of other lags. That is, the PACF at lag 2 filters out the effect of autocorrelation
from lag 1. Likewise, the partial autocorrelation at lag 3 filters out the effect of
autocorrelation at lags 2 and 1.
A useful rule of thumb is that Theoretical PACFs are the mirrored opposites of
ACFs. While the ACF of an AR(p) process dies down exponentially, the PACF has
spikes at lags 1 through p, and then zeros at lags greater than p. The ACF of an
MA(q) process has non-zero spikes up to lag q and zero afterward, while the PACF
dampens toward zero, and often with a bit of oscillation.
64 3 Model Selection in ARMA(p,q) Processes
.5
.5
Partial Autocorrelations
.4
.4
Autocorrelations
.3
.3
.2
.2
.1
.1
0
0
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.4 Theoretical (a) ACF and (b) PACF of AR(1): Xt = 0.50Xt−1 + et
We have covered much ground thus far, so it will be useful to summarize what we
have concluded about Theoretical ACFs and PACFs of the various processes.
Theoretical ACFs and PACFs will show the following features:
Figures 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, and 3.15 graph the
Theoretical ACFs and PACFs of several different AR(p), MA(q), and ARMA(p,q)
processes.
Theoretical ACFs and PACFs were implied by particular models. Empirical ACFs
and PACFs, on the other hand, are the sample correlations estimated from data. As
such, they are quite easy to estimate. We’ll review the Stata syntax for estimating
simple correlations, and we’ll explore in greater depth what was meant by a Partial
ACF.
3.2 Empirical ACFs and PACFs 65
0
.2
Partial Autocorrelations
-.1
0
Autocorrelations
-.2
-.2
-.3
-.4
-.4
-.6
-.5
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.5 Theoretical (a) ACF and (b) PACF of AR(1): Xt = −0.50Xt−1 + et
.6
.6
Partial Autocorrelations
.2 .3 .4 .5
Autocorrelations
.4
.2
.1
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.6 Theoretical (a) ACF and (b) PACF of AR(2): Xt = 0.50Xt−1 + 0.20Xt−2 + et
.4
.4
Partial Autocorrelations
Autocorrelations
.1 .2 .3
0 .2
0
-.2
-.1
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.7 Theoretical (a) ACF and (b) PACF of AR(2): Xt = 0.50Xt−1 − 0.20Xt−2 + et
66 3 Model Selection in ARMA(p,q) Processes
.1
0
Partial Autocorrelations
-.3 -.2 -.1 0
Autocorrelations
-.4
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.8 Theoretical (a) ACF and (b) PACF of AR(2): Xt = −0.50Xt−1 − 0.20Xt−2 + et
.4
.4
Partial Autocorrelations
.3
Autocorrelations
.2
.2
0
.1
-.2
0
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.9 Theoretical (a) ACF and (b) PACF of MA(1): Xt = ut + 0.50ut−1
0
0
Partial Autocorrelations
-.1
-.1
Autocorrelations
-.2
-.2
-.3
-.3
-.4
-.4
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.10 Theoretical (a) ACF and (b) PACF of MA(1): Xt = ut − 0.50ut−1
3.2 Empirical ACFs and PACFs 67
.5
.6
Partial Autocorrelations
.4
.4
Autocorrelations
.3
.2
.2
0
.1
-.2
0
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.11 Theoretical (a) ACF and (b) PACF of MA(2): Xt = ut + 0.50ut−1 + 0.20ut−2
.4
.3
Partial Autocorrelations
Autocorrelations
.2
.2
.1
0
0
-.2
-.1
-.2
-.4
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.12 Theoretical (a) ACF and (b) PACF of MA(2): Xt = ut + 0.50ut−1 − 0.20ut−2
0
0
Partial Autocorrelations
Autocorrelations
-.1
-.1
-.2
-.2
-.3
-.3
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.13 Theoretical (a) ACF and (b) PACF of MA(2): Xt = ut − 0.50ut−1 − 0.20ut−2
68 3 Model Selection in ARMA(p,q) Processes
.6
.6
Partial Autocorrelations
Autocorrelations
.4
.4
.2
.2
-.2 0
0
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.14 Theoretical (a) ACF and (b) PACF of ARMA(1,1): Xt = 0.40Xt−1 + ut + 0.40ut−1
0
.2
Partial Autocorrelations
Autocorrelations
0
-.2
-.2
-.4
-.4
-.6
-.6
0 2 4 6 8 10 0 2 4 6 8 10
lag lag
(a) (b)
Fig. 3.15 Theoretical (a) ACF and (b) PACF of ARMA(1,1): Xt = −0.40Xt−1 + ut − 0.40ut−1
Empirical ACFs are not the result of a model. They are a description of data. They
can be calculated much like any other correlation. To calculate an Empirical ACF in
Stata, create a new variable that is the lag of X—let us call it LagX. Treat this new
variable like any other variable Y and calculate the correlation between X and Y .
That is:
In fact, Stata is quite smart. There is no need to create the new variable. Rather,
we may estimate the correlation between X and its lag more directly by:
which only calculates the autocorrelation at a lag of one. To calculate deeper lags,
3.2 Empirical ACFs and PACFs 69
Alternatively,
provides the empirical ACF (and PACF), as well as a text-based picture of the two.
A nicer graph of the ACF is produced via the ac command:
The empirical partial autocorrelation function shows the correlation between sets
of ordered pairs (Xt , Xt+k ), while removing the effect of the intervening Xs.
Regression analysis is perfectly suited for this type of procedure. After all, when
one estimates Y = β0 + β1 X + β2 Z, the coefficient β1 is interpreted as the effect,
or relationship, between X and Y , holding the effect of Z constant.
Let’s denote the partial autocorrelation coefficient between Xt and Xt+k as φkk
(following the notation in Pankratz 1991 and Pankratz 1983).
Suppose we are given data on X. Then the PACF between Xt and Xt−1 (or the
“PACF at lag 1”) is found by estimating, via linear regression:
The PACF between Xt and Xt−2 (i.e. the PACF at lag 2) is found by estimating
..
.
Example
We will show how to calculate PACFs “by hand” using a sequence of regressions.
Then, we will estimate the PACF more quickly using Stata’s built-in pac and
corrgram commands, showing that the approaches—the long way and the quick
way—are equivalent. Having shown this, we will thereafter rely on the easier short
way in our subsequent calculations.
First, we will do this for a dataset which we know comes from an AR process
(it was constructed to be so), and then we will repeat the process for data from an
MA process. Then we will compare the ACFs and PACFs from the AR and MA
processes. AR and MA processes have different ACFs and PACFs, so in practice,
estimating these ACFs and PACFs will let us know what type of model we should
estimate.
0.60
0 10 20 30 40 0 10 20 30 40
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 3.16 Empirical (a) ACF and (b) PACF of data from an AR(1) process
-0.20 0.00 0.20 0.40 0.60 0.80
0 10 20 30 40 0 10 20 30 40
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 3.17 Empirical (a) ACF and (b) PACF of data from an AR(2) process
Likewise, we could use Stata’s built-in command pac to draw a nicer graph of
the PACF, along with confidence bands in a shaded area:
Exercises
1. Using ARexamples.dta calculate the Empirical PACFs (out to five lags) for
variable Z using the regression approach. Do the same using corrgram. Verify
that your answers are the same regardless of which approach you use.
2. Using MAexamples.dta calculate the Empirical PACFs (out to five lags) for
variables X, Y and Z using the regression approach and using corrgram. Verify
that your answers are the same regardless of which approach you use.
72 3 Model Selection in ARMA(p,q) Processes
0.40
-0.10 0.00 0.10 0.20 0.30 0.40
Partial autocorrelations of X
Autocorrelations of X
(a) (b)
Fig. 3.18 Empirical (a) ACF and (b) PACF of MA(1) process
Partial autocorrelations of Y
Autocorrelations of Y
0.40
0.20
0.00
0 10 20 30 40 0 10 20 30 40
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 3.19 Empirical (a) ACF and (b) PACF of MA(2) process
Each type of process has its signature: its Theoretical ACF and PACF. Each dataset
has its own correlation structure: its Empirical ACF and PACF. We can figure out
which type of process to use to model the data by comparing the correlations in the
data with the correlations implied by the different models. The process is simple:
calculate the Empirical ACF and PACF from the data, and see whether it looks like
the type of pattern predicted by a specific type of model. Do the Empirical ACF
and PACF look similar to the Theoretical ACF/PACF from, say, an AR(2) process?
Then, estimate an AR(2) model using the data.
3.3 Putting It All Together 73
0.60
0 5 10 15 20 0 5 10 15 20
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 3.20 Empirical (a) ACF and (b) PACF of example data
Example
Suppose you were given data that produced the Empirical ACF and PACF shown in
Fig. 3.20. What type of process might have generated this data?
In Fig. 3.20a, the ACF has two significant spikes. In Fig. 3.20b, the PACF has a
sequence of significant spikes, with dampening and oscillation. This looks similar
to what might be expected from an MA(2) process, as we see in Fig. 3.11.
Thus, we might estimate an MA(2) model, giving us output such as:
74 3 Model Selection in ARMA(p,q) Processes
Thus, we can conclude that the data are reasonably described by:
Xt = et + 0.6954989et−1 + 0.1299435et−2 .
Example
Suppose you are given the dataset rGDPgr.dta, which contains data on seasonally
adjusted real GDP growth rates, quarterly, from 1947 Q2 through 2017 Q2.
Alternatively, you can download it from the Federal Reserve’s website, and tsset
the data:
How should we model the real GDP growth rate? As an AR(p) process? An
MA(q) or ARMA(p,q)? And of what order p or q? The standard approach is to
calculate the Empirical ACF/PACF exhibited by the data, and compare them to the
characteristic (i.e. Theoretical) ACF/PACF implied by the various models.
So, our first step is to calculate the Empirical ACF and PACF:
0.40
0.40
Partial autocorrelations of rGDPgr
Autocorrelations of rGDPgr
0.20
0.20
0.00
0.00
-0.20
-0.20
0 5 10 15 20 0 5 10 15 20
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 3.21 Empirical (a) ACF and (b) PACF of the real GDP growth rate
The Empirical ACF shows two statistically significant spikes (at lags 1 and 2).
The PACF has one significant spike at lag 1, after which the partial autocorrelations
are not statistically different from zero (with one exception). The PACF at a lag of
12 is statistically significant. The data are quarterly, so this lag of 12 corresponds
to an occurrence 48 months, or 4 years, previous. There does not seem to be
any economically compelling reason why events 48 months previous should be
important when events at 36, 24, and 12 months previous are insignificant. It seems
as thought this is a false-positive partial autocorrelation.
Given that the patterns in Fig. 3.21 look similar to those in Fig. 3.4, we conclude
that the growth rate of real GDP is reasonably modeled as an AR(1) process.
Estimating the AR(1) model for rGDPgr,
76 3 Model Selection in ARMA(p,q) Processes
6
4
2
0
-2
Exercises
1. Load the dataset ARMAexercises.dta. Using the ac and pac commands
in Stata, calculate the Empirical ACF and PACF for each of the variables in the
dataset. What type of process seems to have generated each variable?
2. For each of the variables in the question above, estimate the AR, MA, or
ARMA model that seems to have generated the variable. Write out the estimated
3.4 Information Criteria 77
equation of each model. Pay special attention to whether the estimated constant
is statistically significant. If it is not, drop the constant from the model and re-
estimate.
3. Download data from FRED on the seasonally adjusted growth rate of nominal
GDP. (This is series: A191RP1Q027SBEA.) Use data from 1947 Q2 through
2017 Q2. Calculate its Empirical ACF and PACF. What type of ARMA(p,q)
process seems to best fit the ACF/PACF? Explain your reasoning. Estimate the
ARMA process and report your results. If necessary, modify your model based
on this output. Use your final estimated model to forecast out to five additional
periods. Graph your results. Compare your forecasted GDP numbers with the
real ones from FRED. Do you think your estimates are fairly good or not?
AI C = −2ln(L) + 2k (3.59)
where ln(L) is the maximized (natural) log-likelihood of the model and k is the
number of parameters estimated.
78 3 Model Selection in ARMA(p,q) Processes
Example
What type of model best fits X (from the dataset ARexamples.dta)? We will
consider AR and MA models up to three lags. To do this, calculate each of the
models and compare AICs and BICs. The model with the lowest AIC and BIC is
the preferred model.
First, estimate the AR models with three, two, and one lags, and compare their
information criteria:
3.4 Information Criteria 79
We can see that the AR model with the smallest information criteria is the last
model, the AR(1). How do these compare to the MA models?
Exercises
1. Load the dataset MAexamples.dta. Rather than using ACF and PACFs to
determine which model to estimate, let’s use Information Criteria (ICs) instead.
For each of the three variables in the dataset estimate AR and MA models up to
lag 3, calculate their corresponding AICs and BICs. Which type of model best
fits each variable according to each Information Criterion? Do you results differ
between the two ICs?
2. Load the dataset ARMAexercises.dta. For each of the variables in the
dataset, calculate ICs for AR(1/4) down to AR(1) models, and MA(1/4) down
to MA(1) models. (The data were artificially generated from either an AR or MA
process; they did not come from an ARMA process.) For each variable, which
model is “preferred” by the AIC? By the BIC? Do your results differ between
them? Do your results differ from what you deduced using ACFs and PACFs?
Stationarity and Invertibility
4
Most time-series methods are only valid if the underlying time-series is stationary.
The more stationary something is, the more predictable it is. More specifically, a
time-series is stationary if its mean, variance, and autocovariance do not rely on the
particular time period.1
The mean of a cross-sectional variable X is E(X) = μ. When X is a time-
series it is subscripted by the time period in which it is observed, with period t
as the usual notation for an arbitrary time period. The mean of a time-series Xt is
E(Xt ) = μt ; the subscript denotes that the mean could depend upon the particular
time. For example, if Xt is growing then its mean (or expected value) will also
be growing. Tomorrow’s Xt+1 is expected to be greater than today’s Xt . Likewise,
the variance of Xt , denoted V ar(Xt ) or σt2 , might depend upon the particular time
period. For example, volatility might be increasing over time. More likely, volatility
tomorrow might depend upon today’s volatility.
Specifically, we say that Xt is “mean stationary,” if
E (Xt ) = μ (4.1)
V ar (Xt ) = σ 2 (4.2)
1 Stationarityof mean, variance, and covariance is called “weak stationarity.” If all moments,
including higher order moments like skewness and kurtosis, area also constant, then we say the
time series has “strong form stationarity,” “strict stationarity” or has “strong stationarity.” For the
purposes of this book, “stationarity” will refer to “weak stationarity.”
That is, the covariance between Xt and Xt+k does not depend upon which particular
period t is; the time variable could be shifted forward or backward by a periods and
the same covariance relationship would hold. What matters is the distance between
the two observations.
For example, the covariance between X1 and X4 is the same as the covariance
between X5 and X8 , or between X11 and X14 . In symbols,
Cov (X1 , X4 ) = Cov (X5 , X8 ) = Cov (X11 , X14 ) = Cov (Xt , Xt+3 ) .
When a process satisfies all of the above conditions, we say that X is “station-
ary.”2
At a first pass, testing for mean and variance stationarity seems fairly straight-
forward. We could test to see whether the series is increasing or decreasing. We
could compare the mean or the variance between the first half and the second half
of the series. Such methods are crude, however. (More formal and powerful tests—
essential tests in the econometricians’ toolbox—are the subject of Chap. 7.)
In the previous chapter we presumed stationarity. In this chapter, we derive
the conditions under which a process is stationary, and also show some further
implications of this stationarity. In Chap. 5 we will weaken this assumption and
begin exploring processes which are not stationary.
2 Inthis chapter, we will be exploring primarily stationarity in the means of processes. This is
often called “stability” and is a subset of stationarity. Since we do not explore non-stationary
variance until Chap. 9, though, we will treat “stability” and “stationarity” as synonyms and use
them interchangeably.
4.3 Restrictions on AR coefficients Which Ensure Stationarity 83
Not all AR processes are stationary. Some grow without limit. Some have variances
which change over time. In this section we explore the restrictions on the parameters
(the β’s) of AR processes that render them stationary.
Xt = βXt−1 + et . (4.4)
It is easy to see that it will grow without bound if β > 1; it will decrease without
bound if β < −1. The process will only settle down and have a constant expected
value if |β| < 1.
This might be intuitively true, but we’d like to develop a method for examining
higher order AR processes.
First, rewrite Eq. (4.4) in terms of the lag operator L,
X = βLX + et .
X − βLX = et
(1 − βL) X = et .
(L) = (1 − βL) .
Stationarity is ensured if and only if the roots of the lag polynomial are greater
than one in absolute value.
Replacing the L’s with z’s, we apply a little algebra and solve for the roots of the
polynomial, i.e. solve for the values of z that set the polynomial equal to zero:
1 − zβ = 0
1 = zβ
z∗ = 1/β.
Thus, our lag polynomial has one root, and it is equal to 1/β.
84 4 Stationarity and Invertibility
|β| < 1.
To summarize, the AR(1) process is stationary if the roots of its lag polynomial
are greater than one (in absolute value); and this is assured if β is less than one in
absolute value.
For an AR(2) process, stationarity is ensured if and only if the roots of the second
order lag polynomial (L) lie outside the complex unit circle. We say the “complex
unit circle” now because we have a second degree polynomial. These polynomials
might have imaginary roots. Plotting the root on the complex plane, it must have a
length greater than one; it must lie outside a circle of radius = 1.
Suppose we estimated an AR(2) model,
Xt = β1 Xt−1 + β2 Xt−2 + et .
Xt − β1 LXt − β2 L2 Xt = et .
This AR(2) process will be stationary if the roots of the second-order lag polynomial
(L) = 1 − Lβ1 − L2 β2
are greater than one. Replacing L’s with z’s , we set the polynomial equal to zero
(to find its roots) and solve
1 − zβ1 − z2 β2 = 0
z2 β2 + zβ1 − 1 = 0.
β2 + zβ1 − z2 = 0
0 = z2 − zβ1 − β2 . (4.5)
4.3 Restrictions on AR coefficients Which Ensure Stationarity 85
Working with the inverse characteristic polynomial will be a bit easier in this case.
Many software programs such as Stata report their results in terms of the inverse
characteristic polynomial. In this case, the AR process is stationary if the roots of
the inverse polynomial lie inside the unit circle. This has caused a great deal of
confusion with students. (In Sect. 4.3.4 we will derive the inverse characteristic
polynomial, and explore the relationship between the characteristic and inverse
characteristic polynomials.)
To find these roots, use the quadratic formula. We’re used to seeing the quadratic
formula in terms of Y s and Xs, as in Y = aX2 + bX + c, in which case the roots
(X∗ ) are given by:
√
∗ −b ± b2 − 4ac
X = .
2a
So, to find the roots of Eq. (4.5) use the quadratic formula, replacing a with 1, b
with −β1 , and c with −β2 :
β1 ± β12 + 4β2
z∗ = .
2
Since we’ve presumably estimated the model, we simply plug in values for βˆ1
and βˆ2 to find the value of the roots. If these roots of the inverse characteristic
polynomial are less than one, then the process is stable.
What values of β1 and β2 ensure that these roots of the inverse characteristic
polynomial are less than one?
β ± β 2 + 4β
1 2
1
< 1. (4.6)
2
We have a second-degree polynomial, so we will have up to two roots, z1∗ and z2∗ .
We must consider a couple of different cases: (1) the term inside the square root
of (4.6) is positive, in which case we are dealing with nice real numbers, or (2) the
term inside the square root is negative, which means that we have imaginary roots.
Let’s begin with the simpler case where the roots are real numbers.
To find the first root,
β1 + β12 + 4β2
z1∗ = <1
2
β1 + β12 + 4β2 < 2
β12 + 4β2 < 2 − β1
86 4 Stationarity and Invertibility
Complex numbers are usually expressed in the form: z∗ = r ± ci where r is the real
part, and√
ci is the complex part. The length, or “modulus”, of a complex number is
equal to r 2 + c2 , which for stationarity must be less than one, so
β12 −β12 − 4β2
+ <1
4 4
4.3 Restrictions on AR coefficients Which Ensure Stationarity 87
−β2 < 1
−β2 < 1
β2 > −1 (4.9)
In summary, there are three conditions on the β’s of an AR(2) process that imply
stability:
β2 + β1 < 1 (4.11)
β2 − β1 < 1 (4.12)
In words: (1) the coefficients cannot add up to a number greater than one, so that
each successive X doesn’t become greater and greater; (2) the coefficients cannot
be too far apart; and (3) the coefficient on the deepest lag cannot be too big. If any
of these conditions are violated, then the process is not stationary.
We can get a better understanding of the constraints by examining Fig. 4.1, a
graph of the so-called Stralkowski Triangle (Stralkowski and Wu 1968). If we
rewrite each of the constraints with β2 as our “y” variable, and β1 as the “x” variable,
then we see that constraints (4.11)–(4.13) define a triangle. Any set of β’s that fall
inside this triangle will result in a stable AR(2) process.
If the characteristic equation (or its inverse) has complex roots, this implies that
the AR(2) process will have oscillations, fluctuating up and down. These complex
roots will arise if the term in the square root of Eq. (4.6) is negative:
3
2
1
β2
0
-1
-2
-4 -2 0 2 4
β1
While Eq. (4.14) is not particularly illuminating in this form, it has a nice geometric
interpretation within the Stralkowski triangle. Combinations of β’s that fall below
the upside-down parabola will result in oscillating patters in the time series. If the
β’s are under the parabola, but still within the triangle, then we will have a stable
oscillatory pattern. If the β’s are under the parabola, but outside the triangle, then
we will have an explosive oscillatory pattern.
Examples
Which of the following AR processes are stationary, and why?
1. Xt = 1.10Xt−1 + et
2. Yt = 0.70Yt−1 + 0.10Yt−2 + et
3. Zt = 0.80Zt−1 + 0.30Zt−2 + et
4. Wt = −0.80Wt−1 + 0.30Wt−2 + et
Y
0 10 20 30 40 50 0 20 40 60 80 100
time time
W
Z
0 10 20 30 40 50 0 20 40 60 80 100
time time
Process (3) is not stationary as the coefficients add to more than one (0.30 +
0.80 = 1.10 > 1).
Process (4) is not stationary. While the first condition is met (|0.80| < 1), and
the second condition is met (0.30 − 0.80 = −0.50 < 1), the third condition is not
met (0.30 − (−0.80) = 1.10 > 1).
Figure 4.2 graphs each of the four examples above. You can verify visually which
series seem stationary.
Exercises
1. Which of the following processes are stationary? Why? Express your answer in
terms of the Stralkowski Triangle inequalities.
(a) Xt = 1.05Xt−1 + et
(b) Xt = 0.60Xt−1 + 0.10Xt−2 + et
(c) Xt = 0.50Xt−1 + 0.30Xt−2 + et
(d) Xt = 0.80Xt−1 − 0.10Xt−2 + et
Example
Let’s use Stata to estimate an AR(2) model and check whether it is stationary.
90 4 Stationarity and Invertibility
After estimation, the command estat aroots calculates the roots of the inverse
characteristic function and graphs them as well (see Fig. 4.3).
The two roots are complex. They have lengths that are quite close to, but
less than, one. Having lengths of 0.999493, they are not technically “unit roots”,
but they are too close for comfort. “Near unit roots” pose their own problems.
Visually inspecting Fig. 4.3a the roots seem to be on, not inside, the unit circle.
The practical take-away is that the estimated model may not be stationary. A more
formal hypothesis test will be required to test whether the root is statistically close
to the unit circle. (Such unit root tests are the subject of Chap. 7.)
Example
Working with the same dataset, let’s estimate an AR(2) model on the variable Y.
4.3 Restrictions on AR coefficients Which Ensure Stationarity 91
1
.5
.5
Imaginary
Imaginary
0
0
-.5
-.5
-1
-1
-1 -.5 0 .5 1 -1 -.5 0 .5 1
Real Real
(a) (b)
Fig. 4.3 Inverse roots of two estimated AR(2) models. (a) Variable X. (b) Variable Y
92 4 Stationarity and Invertibility
Stata estimates the two inverse roots to be 0.701802 and 0.025906, and graphs
them as in Fig. 4.3b. The estimated AR(2) model on Y seems to be stable and,
therefore, stationary.
Exercises
1. For each of the remaining variables (Z, W, A, B, C and D) in the
stationarity_ex.dta dataset, answer the following:
(a) Estimate a zero-mean AR(2) model using Stata’s arima command.
(b) Check for stationarity using Stata’s estat aroots post-estimation com-
mand.
(c) Check for stationarity by using the quadratic formula to compute the roots of
the characteristic equation.
(d) Check for stationarity by using the quadratic formula to compute the roots of
the inverse characteristic equation.
(e) Which variables are not stationary?
(f) Do any of the variables seem to have roots on the unit circle (i.e. do they have
“unit roots?”)
Stata sometimes has problems estimating non-stationary ARIMA models. If it
cannot find an estimate, this is one indication that the estimated model is not
stationary. Still, if you want to force Stata into providing an estimate, you can try
using the diffuse option at the end of the arima command. The diffuse
estimates are suspect, though, so use this option sparingly. This will be necessary,
however, for some of the exercises above.
Collecting the X’s to the left hand side and using the lag operator L,
X 1 − β1 L − β2 L2 − · · · − βp Lp = et
X (L) = et
Substituting z’s into the lag polynomial gives the characteristic polynomial:
1 − β1 z − β2 z2 − · · · − βp zp = 0.
If the roots of this characteristic polynomial (i.e. the values of z such that the
polynomial is equal to zero) are greater than zero, then the AR process is stationary.
Alternatively, we could calculate the roots of the inverse characteristic polyno-
mial:
and verify whether they are inside the complex unit circle.
While there is no “quadratic formula” for an arbitrary p-th order polynomial,
computers can still estimates the roots of such equations. Stata does this easily.
Thus, to check for stationarity, we simply need to verify that the roots as provided
by Stata are inside the unit circle.
A linear difference equation is stable if the roots of its characteristic equation are
greater than one in absolute value. Including the possibility of imaginary roots, the
restriction is that the roots of the characteristic equation must have a “modulus
greater than one” (i.e. they must lie outside the unit circle).
Some textbooks and econometric packages (such as Stata) express this stationar-
ity as having roots less than one rather than greater than one. What gives? They are
referring to roots of related, but different, equations. One is referring to the roots of
the characteristic equation. The other is referring to the roots of the inverse equation.
Still others talk about “inverse roots.” What is the relationship between these?
For an AR(p) process,
the characteristic equation is found by finding the lag polynomial, substituting z’s
for L’s, and setting it equal to zero (since we’ll want to find its roots).
1 − β1 z − β2 z2 − · · · − βp zp = 0. (4.15)
1 1 1
1 − β1 − β2 2 − · · · − βp p = 0
Z Z Z
1 1 1
Z 1 − β1 − β2 2 − · · · − βp p = 0
p
Z Z Z
Zp Zp Zp
Z p − β1 − β2 2 − · · · − βp p = 0
Z Z Z
Z p − β1 Z p−1 − β2 Z p−2 − · · · − βp = 0.
Since z = 1/Z, the roots of the characteristic equation (z) are reciprocals (i.e.
inverses) of the roots of the inverse characteristic equation (Z). The roots of the
inverse equation happen to be inverses of the roots of the characteristic equation.
Thus, the terms “inverse roots”, or “the roots of the inverse equation” are synonyms.
Stata reports the inverse roots of the characteristic equation, so the stationarity
condition is that these roots must lie inside the unit circle.
Exercises
1. For each of the following AR(2) process,
(a) Xt = 0.50Xt−1 + 0.10Xt−2 + et
(b) Xt = −0.50Xt−1 + 0.20Xt−2 + et
(c) Xt = 1.10Xt−1 + 0.20Xt−2 + et
Write down the characteristic equation and use the quadratic formula to find its
roots. Write down the inverse characteristic equation. Use the quadratic formula
to find its roots. Show that the two roots are reciprocals of each other.
2. For the following general AR(2) process,
Xt = β1 Xt−1 + β2 Xt−2 + et
Write down the characteristic equation; plug in the appropriate β’s into the
quadratic formula to describe its roots. Write down the inverse characteristic
equation; plug in the appropriate β’s into the quadratic formula to describe its
roots. Show that the two roots are reciprocals of each other. (Hint: Reciprocals
multiply to one.)
What restrictions on the β’s and γ ’s ensure that the estimated model is stable?
After collecting terms and factoring, we can express Eq. (4.17) in terms of two
lag polynomials:
X 1 − β1 L + β2 L2 + · · · + βp Lp = u 1 + γ1 L + γ2 L2 + · · · + γq Lq
(L)X = (L)u
The same restrictions apply here, as well. If the roots of the characteristic equation
are outside the unit circle, the estimated model is stationary. Likewise, the model is
stationary if the roots of the inverse characteristic equation are inside the unit circle.
Xt = βXt−1 + et . (4.18)
or as
Xt = βXt−1 + et
Xt = βLXt + et
X − βLX = et
Xt (1 − βL) = et (4.23)
1
Xt = et . (4.24)
1 − βL
We can only move from line (4.23) to (4.24) if βL is not equal to one; otherwise we
would be dividing by zero.
Continuing, recall the infinite sum formula: 1/ (1 − α) = 1 + α 1 + α 2 + . . . if
|α| < 1. In this context, and presuming |βL| < 1 holds, then we can substitute βL
for α, and re-express the AR(1) process as:
4.4 The Connection Between AR and MA Processes 97
Xt = et 1 + βL + β 2 L2 + β 3 L3 + . . .
= 1 + βLet + β 2 L2 et + β 3 L3 et + . . .
= 1 + βet−1 + β 2 et−2 + β 3 et−3 + . . .
We could only make the infinite sum substitution as long as the terms in the infinite
sum are appropriately bounded, which is ensured by |βL| < 1.
We have shown that an AR(1) can be expressed as an MA(∞) as long as it doesn’t
grow without bound: |βL| < 1.
X = β1 LX + β2 L2 X + · · · + βp Lp + et
X (L) = et
where
(L) = 1 − β1 L − β2 L2 − · · · − βp Lp . (4.25)
If (L) is not equal to zero, then we can divide both sides of Eq. (4.25) by (L):
et
X= .
(L)
Xt = ut + γ ut−1 (4.26)
ut = Xt − γ ut−1 . (4.27)
and so forth.
Substituting (4.30) into (4.29) into (4.28) into (4.27),
Equivalently,
∞
Xt = ut − −γ i Xt−i (4.31)
i=1
Xt = ut + γ1 ut−1 + γ2 ut−2 + · · · + γq ut−q
= ut 1 + γ1 L + γ2 L2 + · · · + γq Lq
= (L)ut .
4.5 What Are Unit Roots, and Why Are They Bad?
As was hinted in the previous section, “unit roots” refer to the roots of the lag
polynomial. In the AR(1) process, if there was a unit root, then L∗ = 1/β = 1,
so β = 1, which means that the AR process is actually a random walk.
Roots that lie on the unit circle are right at the threshold that marks the transition
from stationarity.
The problem with unit-root processes—that is, with processes that contain
random walks—is that they look stationary in small samples. But treating them as
stationary leads to very misleading results. Moreover, regressing one non-stationary
process on another, leads many “false positives” where two variables seem related
when they are not. This important finding is due to Granger and Newbold (1974),
whose paper we replicate in Sect. 5.7.
Unit roots represent a specific type of non-stationarity. We will explore unit root
processes (such as a “random walk”) in the next chapter. We will learn how to test
for these processes in Chap. 7.
Non-stationarity and ARIMA(p,d,q) Processes
5
Up until now we have been looking at time series whose means did not exhibit
long-run growth. It is time to drop this assumption. After all, many economic and
financial time series do not have a constant mean. Examples include: the US GDP
per capita, the US CPI, the Dow Jones Industrial Index, and the share price of
Google (Fig. 5.1).
Non-stationary ARIMA models include the “random walk” and the “random
walk with drift.” Simple univariate models such as these have proven to be very
powerful forecasting tools. Nelson (1972) showed this with his comparison of
ARIMA vs Cowles-type models. Meese and Rogoff (1983a,b) found that simple
random walk models perform at least as well as structural univariate models and
even vector autoregressions for forecasting exchange rates.1
5.1 Differencing
1 The exchange rate market is very efficient and therefore notoriously hard to forecast.
60
20000
40
20
0
1960 1980 2000 2020 1960 1970 1980 1990 2000 2010
Year
(a) (b)
Fig. 5.1 Two non-stationary economic time series. (a) Nominal GDP per capita. (b) Consumer
price index
We will show an example, using Stata, of how differencing a series can render
it stationary. We would like to do the following: Using fetchyahooquotes
download the daily Dow Jones Industrial Index from the beginning of 2000 through
the end of 2010. (Alternatively, use the ARIMA_DJI.dta dataset.) Calculate the
first difference of the DJIA. Graph the original series, as well as the differenced
series. Using this visual information, what does the order of integration of the DJIA
seem to be?
5.1 Differencing 103
14000
800
12000
600
DJIA
10000
400
8000
200
6000
0
01jan2000 01jan2002 01jan2004 01jan2006 01jan2008 01jan2010
(a) (b)
Fig. 5.2 Two non-stationary financial time series. (a) Dow Jones Industrial Avg. (b) Google share
prices
In the original (not-differenced) series, the DJIA has some rather long swings
(see Fig. 5.2a). The first-differenced DJIA series seems to have a constant mean,
most likely a mean of zero (see Fig. 5.3). The variance might not be stationary,
though, as there are certain patches of high volatility interspersed by periods of low
volatility.
Exercises
1. For each of the items listed below, you should be able to do the following:
Download the data, and calculate the first and second differences. Graph the
original series and the two differenced series. Visually identify its possible order
of integration.
(a) The nominal US GDP per capita. The command to download the data into
Stata is:
wbopendata, country(usa) indicator(ny.gdp.pcap.cd)
year(1960:2015) long clear.
(b) US CPI. The command to download the data into Stata is:
wbopendata , country(usa) indicator(fp.cpi.totl)
year(1960:2015) long clear.
104 5 Non-stationarity and ARIMA(p,d,q) Processes
1000
500
0
-500
-1000
Fig. 5.3 The first difference of the Dow Jones industrial average
The random walk process is one of the simpler examples of a non-stationary process.
The random walk is:
Xt = Xt−1 + et , (5.1)
which is the usual AR(1) but with the coefficient on Xt−1 equal to one. Whenever
that coefficient is equal to, or greater than one (or less than negative one), the series
either increases (or decreases) without bound. Thus, its expected value depends
upon the time period, rendering the process non-stationary.
Before we show how to make the random walk stationary, let us first see why the
random walk itself is not stationary. To do so, though, we will have to re-express
this process in a slightly different way.
Applying the back-shift or lag operator onto both sides of Eq. (5.1) and substitut-
ing the result back into (5.1) yields:
5.2 The Random Walk 105
Xt = Xt−1 + et
= (Xt−2 + et−1 ) + et .
Xt = X0 + e1 + e2 + . . . + et−1 + et . (5.2)
Written in this way, it will be easier for us to see what the mean and variance of this
process is.
At period 0, taking the expectation of both sides of Eq. (5.2):
E(Xt | X0 ) = E (X0 + e1 + e2 + . . . + et | X0 )
= X0 + E (e1 | X0 ) + E (e2 | X0 ) + . . . + E (et | X0 )
= X0 .
The random walk model is very unpredictable, so our best guess during period 0 of
what X will be in period t is just X’s value right now at period zero. The predicted
value of a random walk tomorrow is equal to its value today.
Taking the variance of both side of Eq. (5.2) gives:
Since each of the error terms are drawn independently of each other, there is no
covariance between them. And since X0 was drawn before any of the es in (5.2),
there is no covariance between X0 and the es. (Moreover, but incidentally, it is a
seed term, and is usually thought of as a constant.) Therefore, we can push the
variance calculation through the additive terms:
We can difference the random walk process once and the resulting differenced series
is stationary. To see this, subtract Xt−1 from both sides of Eq. (5.1) to yield:
where we call the new differenced series Zt . The differenced series is now the
strictly random process, in fact, the first model we looked at in this chapter, a model
which is stationary.
Xt = β0 + Xt−1 + et . (5.3)
This process can be expressed in slightly different terms, which we will find useful.
Given an initial value of X0 , which we arbitrarily set to zero, then
X0 = 0
X1 = β0 + X0 + e1 = β0 + e1
X2 = β0 + X1 + e2 = β0 + (β0 + e1 ) + e2 = 2β0 + e1 + e2
X3 = β0 + X2 + e3 = β0 + (2β0 + e1 + e2 ) + e3 = 3β0 + e1 + e2 + e3
t
Xt = tβ0 + ei . (5.4)
i=1
5.3.1 The Mean and Variance of the Random Walk with Drift
In this section, we see why a random walk with drift is neither mean-stationary nor
variance-stationary.
Taking the mathematical expectation of Eq. (5.4), we see that at any point in time
t, the mean of X is
t
t
V ar(Xt ) = V ar(tβ0 + ei ) = V ar( ei ) = tσe2 . (5.6)
i=1 i=1
Yt = β0 + et . (5.9)
Xt = β0 + β1 t + et , (5.10)
where t denotes the time elapsed and the βs are parameters; the only random
component in the model is et , the IID errors.
E(Xt ) = E (β0 + β1 t + et ) = β0 + β1 t.
The variance is
Thus, a deterministic trend process has a non-stationary mean (it grows linearly with
time) and a stationary variance (equal to σe2 ).
Since this first-differenced series does not depend upon time, then the mean and
variance of this first-differenced series also do not depend upon time:
Notice that the first-differenced model now has an MA unit root in the error
terms. Never take first differences to remove a deterministic trend. Rather, regress
X on time, and then work with the residuals. These residuals now represent X that
has been linearly detrended.
There are many similarities between (a) random walks with drift and (b) deter-
ministic trend processes. They are both non-stationary, but the source of the
non-stationarity is different. It is worthwhile to look at these models side by side.
The “random walk with drift” is
t
Xt = β0 + Xt−1 + et = tβ0 + ei
i=1
E (Xt ) = tβ0
V ar (Xt ) = tσe2 .
Xt = β0 + β1 t + et
E (Xt ) = tβ1 + β0
V ar (Xt ) = σe2 .
Both models have means which increase linearly over time. This makes it very
difficult to visually identify which process generated the data. The variance of
the random walk with drift, however, grows over time, while the variance of the
deterministic trend model does not.
5.6 Differencing and Detrending Appropriately 109
Toward the end of the next section, we will show some formal means of
identifying which type of process generated a particular dataset.
We have seen that we can take the first difference of an integrated process to make
it stationary. Such a process is said to be “difference stationary.”
A different type of process is called “trend stationary.” Such a process has an
increasing mean, so it is non-stationary, in a sense. But it can be made stationary by
“detrending,” and so it is called “trend stationary.” Confusingly, both differencing
and detrending remove a trend, but they refer to two different things. When
econometricians say they are “detrending” the data, they usually mean that there
is a deterministic trend. That is, the variable “time” shows up in the data generating
process. Its effect can be removed by including time in a regression, and extracting
the residuals.
So, why is it worthwhile understanding this difference? What would happen if
we detrend a difference-stationary process, or difference a trend-stationary process?
We will answer these two questions in this subsection. We will do so by simulating
some data and seeing what happens.
First, let us generate two variables: one which is trend stationary (it has time as a
right-hand-side variable) and another which is difference stationary (a random walk
with drift).
We can easily deal with first-differenced data by using the D. difference operator
in Stata. We can detrend the data by regressing each variable on time. Let’s call the
resulting detrended variables “dtx” and “dty.”
110 5 Non-stationarity and ARIMA(p,d,q) Processes
100
80
60
x
40
20
0
0 20 40 60 80 100
time
0 20 40 60 80 100
time
These two series look very similar at the outset. Visually, they are nearly
indistinguishable in their levels (Figs. 5.4, 5.5) and in first differences (Figs. 5.6,
5.7). They also look similar when detrended (Figs. 5.8, and 5.9).
The detrended and differenced series at first blush look similar.
5.6 Differencing and Detrending Appropriately 111
4
2
D.x
0
-2
0 20 40 60 80 100
time
0 20 40 60 80 100
time
2
1
Residuals
0
-1
-2
-3
0 20 40 60 80 100
time
0 20 40 60 80 100
time
By changing observations in the third line of the Stata code above, we can
gauge the impact that sample size has on these distortions. Below, we show what the
standard deviations would have been for a sample size of one million observations.
Exercises
1. Redo the exercise above, slowly increasing the sample size from 100, to 1000,
10,000 and 100,000. Summarize your results. Especially important, what are the
standard deviations? What conclusions do you draw?
Xt = et .
White noise processes are already stationary. Lagging by one period and subtracting,
Xt − Xt−1 = et − et−1
X̃t = et − et−1 , (5.11)
The variance has doubled. The process was already stationary. We didn’t make it
more stationary. What we did was make things worse: we added noise. Econometri-
cians strive to find the signal through the noise, but here we’ve added more noise!
First-differencing also introduces negative autocorrelation into the data. The ACF
of a white noise process is zero at every lag. But now, after over-differencing, the
ACF of X̃t at lag=1 is:
Cov(X̃t , X̃t−1 )
Corr(X̃t , X̃t−1 ) =
˜ )
V ar(X̃t )V ar(Xt−1
Cov(X̃t , X̃t−1 )
=
V ar(X̃t )
5.6 Differencing and Detrending Appropriately 115
Of course, if the process really were white noise, then a graph of the data would
tend to look stationary; you wouldn’t be tempted to difference in the first place.
A more realistic example is a trend-stationary process, where the increasing values
might tempt you to automatically first-difference.
What would happen if you inappropriately first-differenced a trend-stationary
process?
Xt = α + βt + et .
1.00
1.00
Partial autocorrelations of dty
Autocorrelations of dty
0.50
0.50
0.00
0.00
-0.50 -1.00
-0.50
0 10 20 30 40 0 10 20 30 40
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Fig. 5.10 (a) ACF and (b) PACF of a detrended random walk with drift
We have made this into an MA(1) process. Notice too, that the coefficient on et−1
is equal to one, so that there is a unit root in the MA terms. As such, it is non-
invertible.2
Exercises
1. Use Stata to generate 1000 observations of a variable X, equal to random noise
from a normal distribution with zero mean and a standard deviation of two.
(a) Summarize X and verify that the mean is approximately zero and the variance
is four (since it is the square of the standard deviation).
(b) Use corrgram X, lags(5) to calculate the ACF and PACF of X out to
five lags. Verify that this is white noise with no autocorrelation structure.
(c) Summarize D1.X, the first difference of the variable. What is the mean? What
is its variance? Did it increase or decrease? In what way? Is this what you
expected?
(d) Use corrgram D1.X, lags(5) to calculate the ACF and PACF of the
first difference of X out to five lags. Describe its autocorrelation structure. Is
it positive or negative?
2 Plosserand Schwert (1977) explore the implications of such overdifferencing and suggest a way
to estimate such models.
5.7 Replicating Granger and Newbold (1974) 117
In 1974, Granger and Newbold published one of the most influential papers in
econometrics. They showed via simulation that if two completely unrelated series
are regressed on each other, but these series each have unit roots, then all of the
standard methods will tend to show that the two series are related. They will have
statistically significant coefficients between them (i.e. statistically significant βs),
they will have low p-values, and they will have high R 2 s. Even if the two series are
independent random walks, having nothing to do with each other, in finite samples,
they will look similar.
This is easiest to understand if the two series, Y and X, are random walks
with drift. Since they are both drifting, then regressing Y on X will find a linear
relationship between them, simply because they are both drifting. If they both
happen to be trending in the same direction, the coefficient will be positive; if they
are trending in opposite directions, the coefficient will be negative. But the point is
that there will be a statistically significant coefficient.
Granger and Newbold showed that this would be the case even if there were
no drift in the variables. Two random walks without drift will wander aimlessly,
but wander they will. And so it will look as though they have some sort of
trend. Regressing one on the other, then, will indicate a statistically significant
relationship between them. This phenomenon of finding relationships between
integrated variables where there are none, is called “spurious regression.”3
Phillips (1986) provides the theory explaining Granger and Newbold’s findings.
Phillips shows that this problem of spurious regression worsens as the sample size
increases.
In this subsection, we will generate two simple random walks, regress one on the
other, and show that Stata finds a spurious relationship between them. We follow
this up with a more thorough simulation that mimics much of the original Granger
and Newbold paper.
3 The book by Vigen (2015) presents dozens of humorous instances of spurious correlations
using real data. For example, there is a correlation of 66% between (a) films in which Nicolas
Cage has appeared, and (b) the number of people who drowned by falling into a pool. The
correlation between (a) per capita cheese consumption, and (b) the number of people who died
by becoming tangled in their bedsheets, is also quite large with a correlation of 95%. Vigen’s
website (TylerVigen.com) also provides many such spurious correlations.
118 5 Non-stationarity and ARIMA(p,d,q) Processes
We can best understand what happened by looking at the graphs of the variables.
The first graph above shows the two random walks over time. Just by random
chance, there seems to be a negative correlation between them. This is further shown
when we create a scatter plot of X1 vs X2, and overlay the regression line over the
scatter (Fig. 5.11).
5.7 Replicating Granger and Newbold (1974) 119
10
10
5
5
0
0
-5
-10
-5
0 20 40 60 80 100 -10 -5 0 5
time X2
X1 X2 X1 Fitted values
Stata finds a negative and statistically significant relationship between these two
series. That is, even though the two series were created independently of each other,
Stata estimates a relationship that is statistically significant even at the 1% level.
Of course, this was just for one pair of series. What if we had drawn a different
set of numbers? Readers are encouraged to enter the code above once again into
Stata, but removing the set seed command. With that command removed, you
will draw a different set of random numbers each time you run the do file. You will
find, however, that you reject the null of no relationship far too frequently, far more
than 5% of the time.
In what follows, we show how to repeat this process thousands of times, each
time storing the output, so that we can summarize the results. That is, we show how
to run a simulation study of the “spurious regression” phenomenon, similar to what
Granger and Newbold had done.
First we define a program. This program describes what would happen with one
run of a simulation.
120 5 Non-stationarity and ARIMA(p,d,q) Processes
So that your random numbers look like my numbers, set the “seed” of Stata’s
random number generator to the same number I used:
Next, using the simulate command, we run 200 iterations of the simulation
We have just generated two random walks, each of length 50. We regressed one
on the other, calculated the R 2 , the p-value and the Durbin-Watson statistic. We
took note of these numbers, and repeated the whole process another 199 times. We
then summarized all of those results in the summary table above.
What we see is that the average R 2 is equal to 0.21. This is quite high, considering
that there is no relationship between these variables. Furthermore,
One hundred and thirty (or 65%) of the 200 p-values are less than 0.05. That
is, 65% of the time, we would believe that the two independent unit-root processes
are statistically correlated. This is far more than we would normally expect, which
shows us that Granger and Newbold were correct: regressing two unit roots on each
other leads one to believe falsely that they are related.
5.8 Conclusion
In these first few chapters, we have explored simple univariate models. We did
this at length for two reasons. First, understanding these fundamental models helps
us understand more complicated models. Second, these simple models are quite
powerful. In fact, in forecasting competitions, these simple models hold their own
against far more complicated ones.
For decades, Spiros Makridakis has run a series of competitions to assess the
accuracy of various forecasting methods. In his paper (1979), he and Michele
Hibon use 111 data series to compare various simple univariate methods (including
exponential smoothing, AR, MA, ARMA and ARIMA models). There is no clear
“best method.” The answer depends upon the definition of “best.” Still, Makridakis
and Hibon find that the simpler methods do remarkably well. They conjecture that
the simpler methods are robust to structural breaks. The random walk model, for
example, uses the most recent observation as its prediction. In the simplest of
methods, data from earlier periods—data that may have come from an economy
that was structurally different than it is today—do not enter into the forecast
calculations. This competition was followed by what is now known as the M-
competition (Makridakis et al. 1982). This competition increased the number of data
series to 1001 and included data sampled at different frequencies (yearly, quarterly,
and monthly). Makridakis also outsourced the forecasting to individual researchers
who were free to propose their own methods. No one used multivariate methods
such as VARs. It was a competition among univariate methods. They found that
simpler models hold their own, that the choice of “best model” still depends on the
definition of “best,” and that models that average other models do better than their
constituent pieces.
The M2-competition focused on real-time updating of forecasts, but the conclu-
sions were largely the same (Makridakis et al. 1993). The M3-competition increased
122 5 Non-stationarity and ARIMA(p,d,q) Processes
the number of data series to 3003. This competition saw the inclusion of artificial
neural networks. The earlier conclusions still hold, however: complex methods do
not always outperform the simpler ones (Makridakis and Hibon 2000). The M4-
competition is being organized at the time of writing. It promises to extend the
number of data series to 100,000 and incorporate more machine learning (neural
network) algorithms.
Seasonal ARMA(p,q) Processes
6
Many financial and economic time series exhibit a regular cyclicality, periodicity,
or “seasonality.” For example, agricultural output follows seasonal variation, flower
sales are higher in February, retail sales are higher in December, and beer sales in
college towns are lower during the summers.
Of course, when we say “seasonality” here, we simply mean any sort of
periodicity (Fig. 6.1). A weekly recurring pattern is seasonal, but at a weekly
frequency.
Seasonality can have different lengths. Retail sales vary seasonally with the
holiday shopping season, but they also have “seasonality” at a weekly frequency:
weekend sales are higher than weekday sales. Moreover, the two seasonal effects
may be of different types (stochastic vs deterministic). If you had quarterly data on
airline travel, what type of seasonal pattern might you expect to see? What if you
had monthly data?
ACFs and PACFs will prove themselves especially useful in detecting sea-
sonality, as fourth-quarter GDP in one year should tend to be correlated with
fourth-quarter GDP in another year.
(1) The seasonal differences can vary by the same amount, or by the same percent,
each year. Such deterministic seasonality is best captured with the use of
seasonal dummy variables. If the dependent variable is in levels, then the
dummies capture level shifts; if the dependent variable is logged, then they
4 6 8 10 12
Which type of seasonality should we use? Solutions depend on the source of the
problem. We need to properly examine the data.
If the seasonality is deterministic, then we should use dummy variables. If the
seasonality varies stochastically, then a seasonal unit root process captures the
evolving dynamics quite nicely, and seasonal differencing should be used.
6.1 Different Types of Seasonality 125
15
10
X
5
0
-5
0 20 40 60 80 100
time
Xt = et where et ∼ iidN(0, σ 2 ),
where the data are now quarterly. To this we can add: 5 in the first quarter of every
year, 10 in the second quarter, -3 in the third quarter, and 2 in the fourth. Let the
dummy variables D1 , D2 , D3 , and D4 denote first through fourth quarters of every
year. This is modeled as:
In the first quarter, E(X) = 5. In the second, E(X) = 5 + 5 = 10. In the third,
E(X) = 5 − 8 = −3, and in the fourth quarter E(X) = 5 + 3 = 8.
Suppose you are examining retail sales with monthly data. Holiday sales in
December are usually the strongest of the year. To see how strong this year’s
sales are, we should compare this December with last December. It is not terribly
important or illuminating to say that retail sales this December were higher than in
November. Of course they are! But are they bigger than what we’d expect December
sales to be? Arguably, we should compare December sales with December sales,
November with November, and so forth: Xt vs Xt−12 . If we had a great Christmas
season, then Xt > Xt−12 , or
Xt − Xt−12 > 0
Xt 1 − L12 > 0.
and so forth.
Let’s apply seasonal differencing to a small dataset showing the quarterly US
unemployment rate for all persons, aged 15–64, not seasonally adjusted.
6.1 Different Types of Seasonality 127
The value for the differenced variable, temp, in the second quarter of 1962 is the
difference between the unemployment rate in the second quarters of 1962 and 1961:
5.5 − 7.1 = −1.6. Likewise, temp in 1962q1 is unemp(1962q1) − unemp(1961q1)
= 6.5 − 8 = −1.5. A graph of the original and seasonally differenced series is shown
in Fig. 6.3.
The appropriate lag for differencing depends upon the frequency of the data
and the type of seasonality. If we have quarterly data with quarterly seasonality,
then seasonally difference a variable X by subtracting its value from four periods
previous:
For monthly seasonal data, subtract its value twelve periods ago:
ln(RetailSales) - ln(L12.RetailSales)
HousingStarts - L12.HousingStarts
50 100
.1 .15 .2
0
.05
-100 -50
0
UnempRate - L12.UnempRate
LaborForce - L12.LaborForce
4
4 6
2
-4 -2 0 2
-2 0
Xt = β4 Xt−4 + et
X 1 − L4 β4 = e.
Xt = ut + γ4 ut−4
Xt = u 1 + L4 γ4 .
December sales are not always independent of November sales. There is some
inertia to human behavior. If November sales are unusually brisk, then we might
6.1 Different Types of Seasonality 129
expect this to carry over into December. For this reason, a purely additive seasonal
model would be inadequate. Box and Jenkins (1976) propose a multiplicative model
of seasonality.
Consider an ARIMA(1,0,0) model such as
Xt = β1 Xt−1 + et
X (1 − β1 L) = e
X (L) = e
X(L)φ(L4 ) = e
X (1 − β1 L) 1 − β4 L4 = e.
Modeled in this way, two parameters (β1 and β4 ) allow for lags at three different
lengths (1, 4, and 5).
Multiplicative seasonality allows us to capture a lot of complexity with few
parameters: parsimony. Such a model is often denoted as ARIMA(1, 0, 0) ×
(1, 0, 0)4 . The × symbol indicates that the seasonality is multiplicative. The second
set of parentheses indicates that we have seasonal ARIMA terms. The subscript
denotes the duration of seasonality. Our seasonality repeats every four observations
(we have quarterly seasonality). The terms inside the parentheses have a similar
interpretation to the terms in the first set. We have included one AR term at one
seasonal lag; we do not need to take seasonal differences to induce stationarity (i.e.
the number of seasonal differences is zero), and we have not included any seasonal
MA terms in our model.
An ARIMA(1, 0, 1) × (2, 0, 0) model would be:
X (L) φ L4 = u (L)
130 6 Seasonal ARMA(p,q) Processes
which expands to
X (1 − β1 L) 1 − L4 β4 − L8 β8 = u (1 + Lγ1 )
X 1 − β1 L − β4 L4 − β8 L8 + β1 β4 LL4 + β1 β8 LL8 = u (1 + Lγ1 )
X 1 − β1 L − β4 L4 + β1 β4 L5 − β8 L8 + β1 β8 L9 = u (1 + Lγ1 ) .
Example
What would an ARIMA(1, 0, 0) × (2, 0, 0)12 model look like? We have two AR
polynomials. The non-seasonal polynomial,
has two AR lags at seasonal length of twelve: the powers on L in the seasonal
polynomial are all multiples of twelve. Since this is multiplicative seasonality, we
multiply the two lag polynomials:
(L) φ L12 X = et . (6.3)
Notice that the lag-1 term interacts with the two direct seasonal lags (12 and 24).
The explicit form of this ARIMA model can be found by applying the lag operator:
Xt = β1 Xt−1 + β12 Xt−12 − β1 β12 Xt−13 + β24 Xt−24 − β1 β24 Xt−25 + et . (6.4)
6.1 Different Types of Seasonality 131
Exercises
1. Generate 1000 observations from the model in Eq. (6.4), with β1 = 0.10,
β12 = 0.40, and β24 = 40. Graph the last 200 observations. Does the data
appear seasonal? Examine the autocorrelation structure of the data. Can you
detect seasonality?
2. What are the seasonal and non-seasonal AR and MA polynomials implied by
the following models? Multiply out these polynomials and write out the explicit
ARIMA model.
(a) ARIMA(1, 0, 1) × (1, 0, 0)12
(b) ARIMA(2, 0, 0) × (0, 0, 1)4
(c) ARIMA(0, 0, 1) × (0, 0, 2)12
(d) ARIMA(0, 0, 2) × (1, 0, 1)4
3. For each of the models listed above, what is the characteristic equation?
4. For each of the models listed above, what is the inverse characteristic equation?
6.1.5 MA Seasonality
This season’s retail sales (Xt ) might depend upon last year’s sales (Xt−12 ) directly
via an AR term, or they might instead be related via the error terms. That is, we
might have seasonality by way of a moving average term. These can enter additively
or multiplicatively.
Additive MA Seasonality
An example of additive MA seasonality is
Xt = et + γ12 ut−12 ,
where we simply add an additional error term at seasonal lengths. This can be
estimated in Stata much like any other ARIMA model:
Multiplicative MA Seasonality
We can also have multiplicative MA seasonality. Before, we might have had a non-
seasonal MA(q) polynomial:
to produce
Xt = ut (L) θ Ls .
Example
What would an ARIMA(0, 0, 1)×(0, 0, 2)12 model look like? We have no AR terms
at all. We have one non-seasonal MA term, and two MA lags at a seasonal period of
twelve.
Xt = ut (1 − Lβ1 ) 1 − L12 β12 − L24 β24
= ut 1 − Lβ1 − L12 β12 + L13 β1 β12 − L24 β24 + L25 β1 β24 .
Or, explicitly,
ARIMA(p, d, q) × (P , D, Q)s
6.2 Identification 133
or
s
s X = (L)θ L u.
(L) φ Ls d D
6.2 Identification
Given some dataset, what type of model should we estimate? We turn to familiar
tools: ACFs and PACFs.
If a model only has seasonal terms, then the ACFs and PACFs will behave
identically to non-seasonal ACFs/PACFs, only at seasonal frequencies, with all other
terms equalling zero. For example, the ACF of the non-seasonal AR(1) process,
Xt = 0.50Xt−1 + et (6.5)
is
Xt = 0.50Xt−4 + et (6.6)
has an ACF of
0 10 20 30 0 10 20 30
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
Partial autocorrelations of X
Autocorrelations of X
0.30
0.20
0.20
0.00
0.10 0.00
-0.20
0 20 40 60 0 20 40 60
Lag Lag
Bartlett's formula for MA(q) 95% confidence bands 95% Confidence bands [se = 1/sqrt(n)]
(a) (b)
The PACF of (6.5) is equal to 0.50 at lag one, and zeros at all other lag lengths.
Analogously, the PACF of (6.6) equals 0.50 at one seasonal length (i.e. at lag of
four) and zeros otherwise. Figure 6.4a and b graph the ACF and PACF of Eq. (6.6)
respectively.
The analogy extends to additively seasonal MA processes. The ACF will show a
spike at the seasonal length, and the PACF will oscillate, declining exponentially at
multiples of the seasonal length. Figure 6.5a and b graph the ACF and PACF of
Xt = ut−4 ,
As before, the command provides a graph of the unit circle and plots the roots of
the inverse characteristic polynomials. If the roots are inside the unit circle, then the
function is invertible and stable.
What would happen if one of the seasonal roots is on the unit circle? Then we
have seasonal unit roots, and we would have to estimate the model in differences.
That is, take seasonal differences until the model is stationary, and then estimate. In
practice, it is quite rare to take more than two seasonal differences. Usually one will
suffice.
measures of industrial production for each of three European countries and find little
evidence of unit roots. Rather, dummy variable models capture seasonality best.1
Ashworth and Thomas (1999) examine employment in the British tourism
industry—a notoriously seasonal variable—and finds that its pattern of seasonality
was best explained as consisting of two different periods of deterministic (dummy
variable) seasonality. This repeats a finding that has become common in the
literature: unit roots can often be confused with structural breaks.
Franses (1991) also warns against automatically taking seasonal differences. It
is difficult to distinguish between deterministic and stochastic seasonality. If the
seasonality is deterministic, seasonal differencing results in misspecification and
poor forecasting ability.
Beaulieu and Miron (1990) examine the cross-country data and find that deter-
ministic seasonality explains a large fraction of the variation in real GDP, industrial
production and retail sales. Beaulieu and Miron (1993) explore the aggregate
US data and find only mixed support for seasonal unit roots. They warn against
mechanically or automatically taking seasonal differences in an attempt to preclude
possible unit root problems. Doing so runs the risk of misspecification. Ultimately,
there should be a strong economic reasons before estimating unit-root seasonal
models.
The evidence for seasonal unit roots has been weak. That doesn’t mean they
don’t exist: seasonal unit root tests have serious deficiencies. For these reasons
many researchers today opt simply include seasonal dummy variables (even on de-
seasonalized data) and proceed with their analyses.
Most macroeconomic data are available in de-seasonalized form. That is, you can
download the raw data, or data that have been filtered so that the seasonal component
has already been removed. Given this availability, many researchers take the easy
route, use de-seasonalized data, and ignore seasonality in their analyses. This might
be useful, but there are no free lunches. Everything comes at a cost.
No pre-canned automatic routine will always be appropriate in all situations.
Some seasonality may be additive. Some maybe multiplicative. Some might be
stochastic or deterministic. One size does not fit all. Relying on pre-canned de-
seasonalized data requires a lot of trust that the procedures being used by the
statistical agency are appropriate. They may be complicated, and they may test for
and adjust for various contingencies, but they are never perfect.
Because of this, using de-seasonalized data is bound to introduce some errors.
1 The
conclusions in Osborn et al. (1999) are tempered a bit by their finding that seasonal unit root
models have good out-of-sample forecasting properties. This might have more to say about the low
power of seasonal unit root tests, than about the existence of seasonal unit roots.
6.6 Conclusion 137
Xt = 0.90Xt−1 + et
so that there was no seasonality, and the data were run through, say, the Census
Bureau’s X-11 filter, then OLS would estimate something closer to
Xt = 0.99Xt−1 + et .
This was the case for AR models of higher order: the sum of their coefficients was
closer to one. This was also the case for data generated from different types of
seasonal processes. Because of this upward bias, the standard unit root tests (such
as Dickey-Fuller and Phillips-Perron tests, which we will discuss in Chap. 7) are less
able to reject the null hypothesis of no unit roots. Using pre-filtered data reduces the
power of the standard unit root tests. That is, the tests would indicate that there is a
unit root even though there was none.
6.6 Conclusion
Little research is conducted these days using only univariate ARIMA models,
but they are quite important. The concepts surrounding ARIMA modeling are
foundational to time series; AR and MA processes are the component pieces
of many more complicated models, and the problems of integration and non-
stationarity must be dealt with in any time-series setting. Mastering these ingredients
ensures that the more complicated material will be more digestible.
Pankratz (1991) and Pankratz (1983) are two classic and gentle introductions to
the theory and practice of ARIMA modeling. They are readable, insightful, and are
highly recommended. Of course it is hard to beat the time-series textbook by Box
and Jenkins (1976), especially regarding ARIMA modeling. Box and Jenkins are
the two people most responsible for the popularity of ARIMA and seasonal ARIMA
modeling. Their textbook should be on your bookshelf.
Hibon and Makridakis (1997) offer a skeptical voice regarding the mecha-
nistically applied Box-Jenkins model-selection process, especially regarding the
uncritical use of first-differencing to remove trends and seasonality, mistakenly
taking first-differences results in decreased forecast accuracy. More formal testing
is required before taking seasonal first differences.
138 6 Seasonal ARMA(p,q) Processes
The “frequency domain” (decomposing time series into the sum of sine and
cosine waves of different frequencies) is a natural home for the study of periodic
or seasonal time series. Unfortunately, this material is far beyond the scope of this
book. However, for the interested and adventurous, the relevant chapters in Chatfield
(2016) contain the gentlest introduction I have found to time series in the frequency
domain. The book by Bloomfield (2004) is a popular and more in-depth introduction
to frequency domain econometrics.
Franses (1996), Franses and Paap (2004), Ghysels and Osborn (2001) provide
book-length treatments of seasonal time-series models.
Unit Root Tests
7
7.1 Introduction
A process might be non-stationary without being a unit root. The two concepts are
related, but they are not identical and it is common to confuse the two. We can have
non-stationarity without it being due to a unit root. We could have a seasonal model.
Or, we could have a deterministic trend. (We can even have non-stationarity because
the variance is changing over time.)
As we saw briefly in Chap. 5, the deterministic trend model and the random walk
with drift share many features. They have a similar mean process that grows linearly
over time, but they differ in the source of their non-stationarity. One has a stochastic
trend. The other has a deterministic trend.
This is not mere semantics, and the source of the difference is of interest to
more than just nerdy academics. Knowing the source of the non-stationarity has
real-world policy implications.
For example, if the gross domestic product of the US is the result of a
deterministic trend model, then any shocks to the system will dampen over time
and become irrelevant. A random walk process, on the other hand, never gets over
a shock. The shock lingers on, in full force, forever. Thus, the effects of a decades
old bad economic policy, even if the policy was short-lived, would still be felt today.
If GDP is a random walk with drift, we are doomed to suffer the full consequences
of yesterday’s mistakes; if it comes from a deterministic trend, then we will soon
outgrow those mistakes.
If a company’s stock price is the result of a stochastic growth process, then
temporary shocks—mismanagement by the CEO, an unexpected energy crisis, a
small recession—will affect the stock price indefinitely into the future. If the stock
price is determined by a deterministic trend model, then the stock will rebound;
investors may profitably take a “this too will pass” attitude and invest counter-
cyclically.
Any hypothesis test involves comparing the fit of the data with the results that would
be expected if the null hypothesis were true, and an implicit alternative hypothesis.
Which hypothesis is the null is up to the researcher, but it is not an insignificant
choice.
In what follows, we will discuss several unit root tests. These can be grouped as
tests that have a null of a unit root, and those whose null lacks a unit root.
We will work through some examples of each, paying extra attention to the
Dickey-Fuller and Augmented Dickey-Fuller tests.
After this, we will turn to a famous application of the DF unit root test on US
macroeconomic data (Nelson and Plosser 1982). We will end by contrasting the
results from all our various unit root tests.
If we need to analyze a dataset, we need to know whether it is stationary
and what the source of non-stationarity is. Is it non-stationary because it is a
random walk with drift? Or is it non-stationary because it has a deterministic
trend?
In Chap. 5 we examined several stationary processes including: (1) the AR
processes (with and without a mean of zero), and some non-stationary processes:
(2) the deterministic trend process (which is difference stationary) (3) the random
walk process, and (4) the random walk with drift.
7.3 Dickey-Fuller Tests 141
Let’s combine all of the models listed above into one overarching model, so that
we can make sense of how various statistical tests relate to each other. We write the
following overarching model:
Xt = β0 + β1 Xt−1 + β2 t + et . (7.1)
Notice that it nests the various models listed above. Table 7.1 lists all of the
parameter restrictions required to yield each particular model.
How do we test whether a particular dataset came from an AR process? We need
to specify the alternative hypothesis, and Table 7.1 makes clear that there are many
possible alternative hypotheses.
With unit root testing, we’re comparing one model with another. But there are
so many different models to compare to. Unit root testing gets confusing because
there are so many different alternatives. It’s like that old joke: “How’s your wife?”
“Compared to what?”
The tests below differ—among other ways—in the specified alternative hypothe-
ses. A test of an AR(1) versus an alternative of a deterministic trend (DT) model
tests that β2 = 0 against the alternative that β2 = 0. A test of an RW model versus
an alternative of an RWWD model tests whether β0 = 0 against the alternative
hypothesis that β0 = 0. Testing between RW and DT models involves joint tests of
several parameters.
A random walk process and a zero-mean AR(1) process are both nested by the
overarching model (7.1):
Xt = β0 + β1 Xt−1 + β2 t + et
when β0 = 0 (to make the process zero-mean), β2 = 0 and |β1 | < 0. We assume
here that the error terms are IID(0,σ 2 ), which is to say they are serially uncorrelated.
If β1 = 1 then the model is a random walk. (It has a unit root.) Alternatively, if
|β2 | < 0 then the model is an AR(1).
A simple test of the random walk versus the zero-mean AR(1) model would seem
to be the following. Estimate a regression of the form:
Xt = β1 Xt−1 + et
and test whether γ = 0 (i.e. that β = 1). Equation (7.3) emphasizes the fact that
if the model were a random walk, then first-differencing would render the model
stationary.
Unfortunately, under the null of a unit root, the sampling distribution of β1
does not follow a t-distribution, or any other standard distribution, neither in finite
samples nor asymptotically. The reason for this stems from the fact that Xt−1 in the
right-hand side of Eq. (7.3) is not stationary. This means that the test statistics do
not converge along the usual lines of the central limit theorem.
Fortunately, Stata has done the hard work for us. We simply have to remember to
ask it to do so.
Example
In this example, we will load up two artificial datasets, one where we know the
data come from a RW process, and another where the data come from a zero-mean
AR(1). We will compare the results from the two datasets.
First, we load the random walk dataset,
6
Random Walk Process
4
2
0
-2
0 20 40 60 80 100
time
Second, the null hypothesis that we are testing is that the data came from a simple
random walk:
H0 : Xt = Xt−1 + et ,
HA : Xt = βXt−1 + et ,
with |β| < 1. Thus, there is neither constant nor trend in the null hypothesis nor the
alternative. Therefore, we can implement a simple DF test by estimating (7.3) and
comparing the test statistic with the appropriate critical values:
144 7 Unit Root Tests
4
2
X
0
-2
-4
0 20 40 60 80 100
time
The regress option on dfuller tells Stata to report the coefficient estimates
along with the Dickey-Fuller results.
When we usually undertake a hypothesis test, the most common null hypothesis
is that the estimated coefficient is equal to zero. But looking back at the over-
arching model, the relevant coefficient value is that β1 = 1. Recall, though, that we
transform the model so that we are looking at first differences. This also transforms
the relevant coefficient value from β1 = 1 to β1 = 0 so that we can run our usual
hypothesis tests. Long story short, just check whether the dfuller coefficient
values are equal to zero.
Resuming, Stata estimates the following model:
Xt = (−0.039)Xt−1 + et .
Example
Let’s contrast the performance of the Dickey-Fuller test when the data generating
process does not have a unit root. Load up the dataset ARexamples.dta, keeping
only the first 100 observations so that it has the same number of observations as the
previous example. Graphing the variable indicates that the process is likely not a
random walk (Fig. 7.2).
7.3 Dickey-Fuller Tests 145
Indeed, the data on X were simulated from an AR(1) process, with a coefficient
equal to 0.50. Nevertheless, we conduct a formal Dickey-Fuller test:
The test statistic of −3.973 is far greater in magnitude than any of the critical
values, indicating a strong rejection of the null hypothesis of a unit root. This is
encouraging, since the data were not from a random walk, but from a stationary AR
process.
Exercises
1. Unit root tests have notoriously low power, especially if the AR coefficient is
close to one. In this exercise, you are asked to explore this for yourself. Generate
100 observations from a stationary zero-mean AR(1) process with β = 0.95.
Draw the errors independently from an N(0,1) distribution. Conduct a dfuller
test with the noconstant option. Are you able to reject the null of a unit root?
Repeat the exercise another nine times for a total of ten experiments. In how
many instances are you able to reject the null hypothesis?
2. Repeat the exercise above but with 1000 observations and with β = 0.99. In how
many of the ten instances are you able to reject the null hypothesis? (We should
be rejecting 95% of the time, but we reject far less frequently than that.)
3. Repeat the previous exercise (N = 1000 obs and β = 0.99) but write a loop to
conduct the exercise 1000 times. What proportion of the times are you able to
reject the null hypothesis? Do you reject approximately 95% of the time? (Hint:
you can’t possibly count by hand the number of times the reported test statistic
is greater than the critical value. Set up a counter to do this.)
146 7 Unit Root Tests
Many processes do not have a mean of zero. The zero-mean assumption is not
problematic. As we saw before in Chap. 2, a stationary AR(1) process with a
constant:
Xt = β0 + β1 Xt−1 + et (7.4)
has a mean of
β0
μ= .
1 − β1
How do we test for a unit root? That is, how do we test whether our data
came from a RW or a non-zero stationary AR(1) process? Begin again with our
overarching model (7.1) but with no deterministic trend (β2 = 0),
Xt = β0 + β1 Xt−1 + β2 t + et
= β0 + β1 Xt−1 + et .
Under the null hypothesis, the data generating process is a random walk (RW). As
such, β1 = 1, and γ0 = 0. Under the alternative hypothesis, the data generating
process is a stationary AR(1) with some potentially non-zero mean. As such, |β1 | <
1 and γ < 0.
The fact that the alternative hypothesis, the stationary AR(1) process, had a non-
zero mean does not affect this. It affects the critical values only. Thus, to perform the
Dickey-Fuller test where the alternative is a non-stationary process with non-zero
mean, we simply add an intercept and proceed as before.
Example
We will do this version of the Dickey-Fuller test with two datasets. The first dataset
will use data simulated from a random walk process, and the second will use data
simulated from a non-zero AR(1) process.
The default setting for Stata’s dfuller command is to include a constant.
Previously, we had to specify a noconstant option. Here, the constant is called for
because we are testing against an AR(1) process with a non-zero mean (i.e. one with
a constant).
7.3 Dickey-Fuller Tests 147
The test statistic (−2.723) and p-value (0.0701) indicate that we cannot reject
the null at the 1% of 5% levels.
Example
Let’s now load up a dataset from a stationary AR(1) process with non-zero mean,
and compare the results. The data were generated according to Xt = 10−0.50Xt−1 .
148 7 Unit Root Tests
In this case, we reject the null of a unit root at all the usual significance levels.
We can see this in two ways. First, the estimated p-value is zero. Second, the
test statistic (−5.926) is far more negative than any of the Dickey-Fuller critical
values.
What if our data are trending over time? How do we test between the RW with
drift (a “stochastic trend”), and the “deterministic trend” AR model? Visually,
they look the same: they both trend upward linearly. And they are both non-
stationary. We are not testing the stationarity assumption here, then. What we are
testing is the source of the stationarity. Is it due to a stochastic or deterministic
factor?
Both models are nested by the overarching model,
Xt = β0 + β1 Xt−1 + β2 t + et . (7.7)
The deterministic trend model has |β1 | < 1 and β2 = 0, while the random walk
with drift has β0 = 0, β1 = 1 and β2 = 0.
Then a straightforward test of the unit root hypothesis—i.e. the hypothesis that
the model is a random walk—is to run a regression of (7.7) and conduct an F-test of
β1 = 1 and β2 = 0. The Dickey-Fuller approach is to re-write (7.7) by subtracting
Xt−1 from both sides, and estimating:
50
40
30
Y
20
10
0
0 20 40 60 80 100
time
The output here is a bit ambiguous. Given the test statistic (−3.240) and p-value
(0.0768), we can reject the null hypothesis of a unit root at the 10%, but not at the
1% and 5% levels.
Example
Let’s see how the test performs when the true DGP does not have a unit root but
rather, has a deterministic trend (see Fig. 7.4).
150 7 Unit Root Tests
60
40
X
20
0
0 20 40 60 80 100
time
The test statistic (−10.078) and p-value (0.0000) indicate that we can safely
reject the null hypothesis of a unit root. This is fortunate, as the data were generated
from a deterministic trend process, not a unit root with drift.
The Dickey-Fuller test assumes that the error term was not serially correlated. This
may not be a reasonable assumption. In fact, most of the times, it is not. Most
7.3 Dickey-Fuller Tests 151
of the times, there is a lot of autocorrelation in the residuals. To account for this
autocorrelation, Said and Dickey (1984) introduced the Augmented Dickey-Fuller
test. The ADF test adds k lagged-difference terms onto the standard DF estimation
equations.
Why did we just tack on some lagged difference terms to the standard DF
equation? Dickey and Fuller (1979) presumed that the data generating process is
AR(1)—i.e. that it has one lag (ex: Xt = β0 + β1 Xt−1 + et )—and the question
is whether it is stationary (|β1 | < 0) or non-stationary (β1 = 1). Said and Dickey
extended this to arbitrary AR(p) processes.
Let’s see how this works for a simpler AR(2) process vs a RW without drift.
Suppose that the data generating process is AR(2),
As with the standard Dickey-Fuller procedure, subtract Xt−1 from both sides:
γ = (β1 + β2 − 1) = 0,
or equivalently whether
β1 + β2 = 1,
p−1
Xt = γ Xt−1 + ci Xt−i + et .
i=1
k
Xt = β0 + γ Xt−1 + β2 t + ci Xt−i + et (7.10)
i=1
and we test whether γ = β1 + β2 + · · · + βp − 1 = 0.
Adding k − 1 lagged difference terms allows us to test for unit roots in AR(k)
processes. If the data generating process includes MA terms, then all is not lost.
Recall that an invertible MA process can be expressed as an AR(∞) process. We
could never estimate an infinite number of lagged differences, but if we estimate
enough of them, we can adequately account for any number of MA terms.
In practice, however, we are never sure what the order of the ARMA(p,q)
process is. And there are no definitive rules for how many lags to include in our
estimating regressions. Some researchers add terms until the residuals exhibit no
autocorrelation. Others begin with many lags and slowly remove insignificant terms,
in a sort of general-to-specific methodology.
We will explore several of the more commonly used lag-selection methods in
Sect. 7.3.6. Ignoring this complication, though, the process for performing an ADF
test in Stata is no different from performing the standard DF test. In fact, the
command is the same. You must simply add a certain number of lagged differences
via lags(k) as an option to dfuller. For example, an ADF for an AR(2) vs a
random walk is dfuller X, nocons lags(1).
Rather than using OLS to detrend the data, Elliot et al. (1992, 1996) propose a GLS
detrending procedure. On the detrended data, X̂t , they then estimate
p
X̂t = γ X̂t−1 + ci X̂t−i + et .
i=1
The mathematical details of generalized least squares are beyond the scope of this
book, however, it is easy to implement the DF-GLS procedure in Stata using the
command:
7.3 Dickey-Fuller Tests 153
The maxlag(#) option allows you to set the maximal number of lags to
consider. (We will discuss lag selection in Sect. 7.3.6.) A trend is included by
default; you can exclude the deterministic trend term by using the notrend option.
Finally, ers uses critical values as calculated by Elliott, Rothenberg and Stock
(ERS), the original authors of the DF-GLS procedure. This option is seldom used, as
the Cheung and Lai (1995a,b) critical values are considered superior. ERS calculated
their critical values only for the case where the number of lags is zero. Cheung and
Lai (1995a,b) find that finite-sample critical values depend on the number of lags,
and re-estimated the critical values for various lag lengths.
The dfgls command reports the optimal lag choices of several different
selection procedures. Specifically, it computes the optimal lag based upon: Ng
and Perron’s (1995) sequential t-test, Ng and Perron’s (2001) modified Akaike
Information Criterion (the MIC), and Schwartz’s (1978) Information Criterion (the
SIC). We will discuss lag selection and work out an example in the next section.
Up to this point, we have left unspecified how many lags to include in the Dickey-
Fuller tests. There are several different approaches to answering this problem, but all
of them have as their centerpiece the idea that, once a model is properly specified,
the residuals are white noise. Thus, a quick and easy answer to the question of lag
length is simply this: choose as many lags as is required to leave the residuals as
uncorrelated white noise.
There is, as usual, a trade-off to consider. If we have too few lags, then our
residuals will be autocorrelated; the autocorrelation will throw off our hypothesis
testing and bias our results. If we have too many lags, then we will have white noise
residuals, but we will be estimating more coefficients (the ones on the extraneous
lags) than we need to. This, in turn, means that our tests will have lower power; we
will use up valuable degrees of freedom to estimate these extraneous coefficients,
when they could be used to give us more precise estimates of the truly meaningful
ones. (In general, econometricians believe the latter is less problematic. When in
doubt, include the lag.)
Should you start with a few lags and keep adding more as needed? Or should you
start with many lags, and whittle them down as allowable? And if so, how many lags
should you begin with before whittling? Can we use information criteria to directly
choose the optimal lag length? Different econometricians have proposed different
rules. In this subsection, we will review some of the more common ones.
Ng and Perron (1995), and Campbell and Perron (1991) suggest a sequence
of t-tests, starting with a large number of lags, say kmax , and testing down. If
the coefficient on the longest lagged term is insignificant (Stata uses a p-value
greater than 0.10), then drop that term and re-estimate the smaller model; repeat
as necessary.
154 7 Unit Root Tests
Ng and Perron (1995) compared their sequential t-tests method with the Akaike
and Schwartz Information Criteria and found their sequential t-test approach to be
optimal. They suffered less from size distortions, but had comparable power.
Of course, the Ng and Perron (1995) procedure leaves as unspecified the value
of kmax from which to test down. Schwert provided one answer.
Schwert (1989, 2002) suggests that Pmax should be calculated from:
kmax = int 12 (T /100)1/4
where T denotes the number of periods in your dataset, and int denotes the integer
portion of the calculated number.
In its dfgls command, Stata implements a slight variation of this formula:
kmax = int 12 [(T + 1)/100]1/4
Example
This all gets quite dizzying. Let’s turn to an example to solidify the material.
First, we download and tsset some data: the seasonally adjusted civilian
unemployment rate for the US (UNRATE).
If you’re trying to work along, it would be best if our datasets were identical,
beginning in January 1948 and ending October 2017.
7.3 Dickey-Fuller Tests 155
The output shows that Schwert’s criterion suggests a maximum lag of 20. From
there, we could test down, using Ng and Perron’s sequential t-test procedure. If so,
we would have arrived at 19 lags. Quite a lot of lags. If instead we opt to use an
information criterion, we would have chosen a different number of lags to use in
our DF-GLS regressions. The Schwartz criterion chooses a lag length of five. This
criterion tends to favor fewer lags. Ng and Perron’s modified AIC, what Stata calls
the MAIC, chooses 12 lags.
Our ultimate goal is not choosing the number of lags, but to conduct a unit root
test. The lag selection is simply a preliminary.
If we had used 19 lags, then the DF-GLS test statistic is −3.247. This is greater
than the critical values at the 5% and 10% levels. But we cannot reject the null at
the 1% level.
If we had used the MAIC as our guiding principle for lag selection, then we
would conduct a DF-GLS test with 12 lags. The test statistic from this would be
156 7 Unit Root Tests
−2.796. Given this test statistic, we would reject the null hypothesis of a unit root
when testing at the 10% level. We cannot reject a unit root at the 5% or 1% level.
Finally, if we had opted for the SIC, we would have estimated a DF-GLS model
with five lags. This would have resulted in a test statistic of −3.813, which is greater
in absolute value than all of the critical values. We would have rejected the null
hypothesis of a unit root.
As you can see, the conclusions of these unit root tests depend upon the number
of lags which you have estimated. Ideally, the various lag-selection criteria would
have recommended the same number of lags. Unfortunately, this is rarely the case.
Exercises
1. Apply DF-GLS to the Nelson and Plosser (1982) dataset. Allow for a trend. Use
Ng and Perron’s modified AIC for lag selection. Use 5% for hypothesis testing.
Which of the variables are trend stationary? Which seem to be random walks
with drift?
The Phillips-Perron (1988) test is an alternative to the ADF test. Rather than
compensating for serial correlation in the error terms by adding lagged differ-
ences, Phillips and Perron correct the standard errors for heteroskedasticity and
autocorrelation (HAC). That is, whereas ADF changes the regression equation,
Phillips-Perron changes the test statistics. This is done in much the same way that
Stata’s robust option calculates HAC standard errors after the reg command
(Newey and West 1986). The specifics of this correction will lead us too far afield,
however, Stata estimates
Xt = β0 + ρXt−1 + β2 t + et (7.11)
and computes the HAC-corrected standard errors quite readily using the command:
The options noconstant and trend have the usual interpretation: regress
shows the coefficient estimates of Eq. (7.11), and lags(#) indicates the number
of lags used to calculate the Newey-West standard errors.
The pperron test produces two test statistics: Z(ρ) and Z(t). Phillips and Perron
find that Z(ρ) has higher power than Z(t) or ADF tests when the error process has
AR or positive MA components. The Phillips-Perron test is not suited to situations
where the error has large, or even moderately sized, negative MA terms.
7.4 Phillips-Perron Tests 157
Example
As with the other DF-type tests, the null hypothesis is that the data have a unit
root (i.e. ρ = 1) and the alternative is that ρ < 1. Examining the test statistics, there
is some evidence that Y has a unit root. The evidence is weak, though. As a further
step, we can take the first difference and verify that there is no unit root in Yt :
Here, we reject the null of a unit root. That is, we conclude that there is no unit
root in the first-differenced variable.
158 7 Unit Root Tests
The null hypotheses in most unit root tests (certainly all the ones we have mentioned
thus far) is that the process contains a unit root. Unfortunately, unit root tests have
notoriously low power (i.e. they do not reject the null of a unit root often enough).
Because of this, it is useful to run a complementary test, one that has stationarity
as the null rather than the alternative. The KPSS test is such a test and provides a
useful double check. The test was developed by Kwiatkowski et al. (1992), which
is, admittedly, a mouthful; everyone shortens this to “KPSS.” The test is easy to
execute in Stata, and researchers are encouraged to use it.1 If it isn’t already installed
on your computer, install it by:
The KPSS test decomposes a time series variable into the sum of a deterministic
trend, a random walk component, and a stationary error:
yt = βt + rt + et
rt = rt−1 + ut .
The initial term in the random walk sequence, r0 , plays the role of the intercept.
The error terms on the random walk component (ut ) are presumed IID(0,σ 2 ).
If ut has zero variance, its value is always zero. Thus, r1 = r2 = r3 = · · · = r0 .
In other words, rt is no longer a random walk, and yt is a simple trend stationary
model:
yt = βt + r0 + et .
The test is simply a Lagrange multiplier (LM) test that the random walk component
has zero variance.
To implement the KPSS test, first estimate the model and calculate the residuals
t . Calculate the running sum of the residuals:
t
St = i .
i=0
1 Sephton (2017) provides updated critical values for the KPSS test for use with small samples.
7.5 KPSS Tests 159
T
σ̂2 = t 2 .
t=0
T
LM = St 2 /σ̂2 .
t=0
which is simply the ratio of two different estimates of the residual variance. In
actuality, the denominator is a slightly different estimate of the “long-run” variance
of et , calculated using the residuals t , and weighted using a particular weighing
method (the Bartlett kernel). The details of this are beyond the scope of this book.
In Stata, the whole process is quite easy:
The maxlag(k) option allows you to specify the results using up to a maximal
number of lags. notrend specifies that the null hypothesis is level stationary,
rather than trend stationary. qs and auto allow Stata to use different methods for
calculating autocovariances and maximal lags.
KPSS ran their stationarity tests on Nelson and Plosser’s data and found that
there was less evidence for unit roots than was originally believed. Their results
can be replicated by running the following two commands for each variable in the
dataset:
Exercises
1. Kwiatkowski et al. (1992) applied their method to the Nelson and Plosser (1982)
dataset and came to some different conclusions. In this exercise, you will now
replicate KPSS’s study using Stata on NelsonPlosserData.dta.
(a) Calculate the KPSS test statistic for a model with no trend (notrend) and a
maximal lag of eight periods (maxlag(8)). Which variables seem to have
unit roots? (You can double-check your work by looking at Table 5a in the
original KPSS paper.)
(b) Calculate the KPSS test statistic for a model which allows for a trend (i.e.
don’t include the notrend option) and a maximal lag of eight periods
(maxlag(8)). Which variables seem to have unit roots? (You can double-
check your work by looking at Table 5b in the original KPSS paper.)
160 7 Unit Root Tests
2. Redo both parts of the above exercise, but with the following modifications: let
Stata pick the optimal lag length (using the auto option), and have Stata use
a quadratic kernel to estimate the long-run variance of the series (using the qs
option). Which variables seem to have unit roots?
In one of the most widely cited papers in modern macroeconomics, Nelson and
Plosser (1982) examined several common macro datasets (data on GDP, per capita
GDP, CPI, etc. . . ) and tested them for unit roots. This question is important for two
reasons, one statistical and one economic. Knowing whether the variables have a
unit root tells us how we should model them statistically. Far more importantly, if
economic variables such as GDP follow a unit root, then this tells us something
quite meaningful about the economy. If GDP follows a unit root, then any shock to
the economy will have a long-term impact. The shock’s effects will be felt until a
countervailing shock pushes the economy back onto its old path. Alternatively, if
GDP does not follow a unit root, then the economy is resilient and self-healing.
When GDP is affected by a shock, the effects of that shock are temporary: the
economy adjusts so that it resumes its previous growth path.
Nelson and Plosser considered equations of the form:
Xt = μ + ρXt−1 + γ t + ut . (7.12)
Using Dickey-Fuller tests, they found that almost all macroeconomic time series
contain unit roots, or, more correctly, they found that they could not reject the null
hypothesis of a unit root.
In this section, we will replicate the major tables in Nelson and Plosser’s study.
First, we download the data.
The variables are presented in their raw form, and then once again in their
logarithms. We will only use the logged versions, except for bond prices, which are
not logged. The logged variables are denoted with an “l” prefix.
To get the exact same numbers as Nelson and Plosser, we will need to define a
new time variable for each variable. Each variable will begin at period 0, regardless
of which year was the earliest date in the time series. It is easy to do so using a loop:
7.6 Nelson and Plosser 161
If we had not created new time variables and had simply used the calendar year,
the substantive results would not have changed. However, we are aiming to replicate
their study, so we follow their preference.
Nelson and Plosser then attempt to look at the autocorrelation structure of their
data (in levels, differences, and from trend) in order to compare with what they
would expect if the data had come from unit root processes. Such tests are not quite
formal, and have low power.
They first examine their data in levels (Table 7.2), and find they are highly
autocorrelated, with the autocorrelation weakening slowly as the lag increases. This
is indicative of a random walk.
In Table 7.3, they then take the first differences of their data. About half of the
variables in Table 7.3 (the differences) have large first-order AC components only.
This is indicative of an MA process. Only the most contrived trend stationary pro-
Table 7.3 Sample autocorrelations of the first difference of the natural log of annual data
Variable r1 r2 r3 r4 r5 r6
bnd 0.18 0.31 0.15 0.04 0.06 0.05
lrgnp 0.34 0.04 −0.18 −0.23 −0.19 0.01
lgnp 0.44 0.08 −0.12 −0.24 −0.07 0.15
lpcrgnp 0.33 0.04 −0.17 −0.21 −0.18 0.02
lip 0.03 −0.11 −0.00 −0.11 −0.28 0.05
lemp 0.32 −0.05 −0.08 −0.17 −0.20 0.01
lun 0.09 −0.29 0.03 −0.03 −0.19 0.01
lprgnp 0.43 0.20 0.07 −0.06 0.03 0.02
lcpi 0.58 0.16 0.02 −0.00 0.05 0.03
lwg 0.46 0.10 −0.03 −0.09 −0.09 0.08
lrwg 0.19 −0.03 −0.07 −0.11 −0.18 −0.15
lm 0.62 0.30 0.13 −0.01 −0.07 −0.04
lvel 0.11 −0.04 −0.16 −0.15 −0.11 0.11
lsp500 0.22 −0.13 −0.08 −0.18 −0.23 0.02
Note: Reprinted from Nelson, Charles R. and Charles R. Plosser (1982), Trends and random walks
in macroeconomic time series: Some evidence and implications, Journal of Monetary Economics,
10(2): 139–162, with permission from Elsevier
cess (one with serially uncorrelated errors) would give rise to an AC structure with
large coefficients on the first-order terms only, and these terms would be negative.
(We showed this in an earlier section.) This argues against trend stationarity.
The other half of the variables in Table 7.3 have more persistent autocorrelation.
“The conclusion we are pointed toward is that if these series do belong to the
TS class, then the deviations from trend must be sufficiently autocorrelated to
make it difficult to distinguish them from the DS class on the basis of sample
autocorrelations” (Nelson and Plosser 1982, p. 149).
In Table 7.4, they show the autocorrelations of the deviations from a fitted
trend. There, the autocorrelations for all but the unemployment series start high
and decrease exponentially. NP refer to Nelson and Kang (1981), who showed that
this is the autocorrelation structure of the residuals that would be generated when
fitting a random walk process to a trend.
Again, these comparative procedures provide a convenient starting point, but they
lack the formality of a statistical test. To this end, Nelson and Plosser employ the
unit root tests of Dickey and Fuller.
7.6 Nelson and Plosser 163
Upon entering the above commands, you should get the following output:
164 7 Unit Root Tests
Nelson and Plosser’s main results table (Table 7.5) requires a different lag length
for each variable. Below, you can see the commands for replicating the first variable
from the table. The rest are left as an exercise; simply replace the variable name and
lag length in the provided code. Table 7.5 shows the results from Stata’s replication
of Nelson and Plosser’s final results table.
166 7 Unit Root Tests
We are a bit redundant in using both the dfuller and reg commands. Still,
it is instructive to see how dfuller is really just a special type of regression. For
example, the output from dfuller is:
7.6 Nelson and Plosser 167
Recall from Eqs. (7.5) and (7.6) that we are estimated equations similar to:
statistic τ (ρ1 ) are the same between dfuller and Table 7.5, as would be expected,
since one merely subtracts a constant from the other.
Nelson and Plosser may have discovered something quite meaningful. Looking
at Table 7.5, we see that most of the variables seem to show evidence of a unit root.
This is not readily apparent when looking at the table. We need to keep in mind
that the relevant critical values are not 1.96. We need to use Dickey and Fuller’s
critical values, which are approximately equal to −3.45. Almost all of the variables
have a test statistic that are less than −3.45. Therefore, we should not reject the
null hypothesis of ρ = 1 so we cannot reject the null hypothesis of a unit root.
More plainly, we accept that these variables have a unit root.2 This, in turn, means
that the economy might carry the effects of negative (and positive) shocks with it
forever. The economy might not be self-healing and might not always resume its
earlier growth path.
Exercises
1. Using the code provided in the text, replicate Table 7.5 containing Nelson and
Plosser’s main results.
2. Are the major stock market indexes unit root processes? Redo the Nel-
son and Plosser exercise, but with daily data on the NASDAQ index,
the CAC-40 index, and the DAX for the period starting in 2013 and
ending in 2017. (Use the Index.dta dataset or download the data
using fetchyahooquotes ^IXIC ^FCHI ^GDAXI, freq(d)
start(01012013) end(01012018).) What do you find?
Testing for seasonal unit roots follows the same lines as non-seasonal unit roots.
If the roots of the characteristic polynomial are on the unit circle, then we have
seasonal unit roots. But what if they are quite close to the unit circle? We need to
perform a hypothesis test and verify whether we are statistically close or far away
from the circle. That is, we need to employ a seasonal unit root test.
The most popular test for seasonal unit roots is the so-called HEGY test, named
after the authors of the paper: Hylleberg, Engle, Granger, and Yoo (1990). The
HEGY test is a modification of a Dickey-Fuller test. It is implemented in Stata using
the hegy4 command, however, the command is limited to quarterly data. Beaulieu
and Miron (1993) developed the theory extending the HEGY procedure to monthly
data, but there does not yet seem to be a Stata implementation of this.
Ghysels and Perron (1993) suggest carefully examining the data for the existence
of different types of seasonality. And they suggest including at least as many lagged-
differences in the Dickey-Fuller and Phillips-Perron test as the length of seasonality.
2 Please keep in mind that failure to reject does not mean that we “accept.” Still, sometimes it is
useful to think in these simpler terms.
7.8 Conclusion and Further Readings 169
If there seems to be quarterly seasonality, use at least four lagged differences in the
unit root test. Doing so decreases the size of the bias and increases the power of the
unit root tests.
Ultimately, Ghysels et al. (1994) stress the difficulties in testing for stochastic
seasonality. The AR unit roots and the MA terms often interfere with each other,
resulting in low power and low size for the HEGY-type tests. Including deterministic
seasonal terms in the regression models improves the power of the HEGY tests,
however, “the picture drawn from our investigation is not very encouraging”
(p. 436); there are too many problems with the size and power of the HEGY test.
In this chapter, we have explored some of the more popular unit root tests.
There are two reasons why it is important to know whether your data have unit
roots. First, from an econometric point of view, it is important to know the source
of non-stationarity because we need to know how to correct for it. If the DGP is
trend stationary (a deterministic trend), then we can detrend the model to render it
stationary (and extract the business cycle components). If it is a stochastic trend
(an RWWD), then the model is difference stationary. That is, we can take first
(or second) differences and then proceed with our familiar ARMA(p,q) modeling.
Applying the wrong procedures and differencing a trend stationary process, i.e.
over-differencing, introduces an MA unit root. That is, if you wrongly believe there
is a unit root and take first differences to remove it, then you will inadvertently be
introducing a unit root.
Second, from an economic point of view, it affects how we view the economy.
A deterministic trend model shows the economy as self-healing, and reverting back
to its trend line. A stochastic trend shows the economy as non-healing. It never
recovers from a shock. It reverts back to its usual rate of growth, but from a lower
level.
Despite its ubiquity, not everyone is convinced of the importance of unit root
testing. Cochrane (1991) argues that the low power of unit root tests is inescapable.
Christiano and Eichenbaum (1990) ponder unit roots in GNP and ask, “Do we know,
and do we care?” They answer, “No, and maybe not.” The evidence is too sensitive
to various assumptions to draw any definitive conclusions. Further, they argue that
evidence of a unit root does not answer more important questions like the prevalence
of permanent technological vs temporary demand shocks.
The literature on unit root testing is vast and constantly increasing. Campbell
and Perron (1991) provide a lengthy, if somewhat dated, review of the major issues
involved in unit root testing and offer some very useful rules of thumb. The book
by Maddala and Kim (1998) provides a more thorough yet readable discussion of
univariate unit root testing and cointegration (which extends the unit root concept to
the multivariate setting).
Unit root testing has been extended to panel datasets. This has proven quite
useful, as most macroeconomic data is available for multiple countries. Previously,
researchers calculated a sequence of unit root tests, one for each country’s GDP,
170 7 Unit Root Tests
for example. But this presumes the cross-sections are independent. The profes-
sion demanded to be able to combine these into one panel model and test all
countries simultaneously. Naturally, the quantity supplied of panel-unit root tests
has responded to the increased demand. Scores of papers have been written on
this. Im et al. (2003) is an early and influential paper that developed a panel test
where the cross-sections are independent. Pesaran (2007) developed an extension, a
modification of the standard ADF tests, to account for cross-sectional dependence.
Stata implements some of the more popular panel unit root tests. These include the
tests by Levin et al. (2002) and Im et al. (2003), which have unit roots as their null
hypotheses (like DF-type tests), and Hadri (2000), which has stationarity as its null
hypothesis (like KPSS).
The concept of unit roots has been challenged by the concept of “structural
breaks,” a concept which we will explore in Chap. 8.
Structural Breaks
8
In 1976, Robert Lucas offered one of the strongest criticisms of the Cowles
Commission large-scale econometric modeling approach. Lucas critiqued Cowles’
presumption that many economic phenomena are structural. They are not. They
depend on the institutional and regulatory framework. For example, economists
vigorously debated what was the “true” value of the marginal propensity to consume
(MPC). A large MPC implies a large fiscal policy multiplier. A small MPC implies
that fiscal policy will be ineffective. Lucas argued that the observed MPC is
contingent on the economic and regulatory environment at the time. People consume
more or less in response to the economic situation. They’ll consume more if
times are good, or if tax laws are favorable to consumption. The MPC is not a
structural parameter. It is not a universal constant on par with Planck’s constant
or the gravitational constant. Essentially, Lucas argued that changes in the laws
and regulations affect human behavior, and this will be revealed through the data.
Change the rules of the game and you change the outcome. Change the economic
landscape, and you change the ups and downs of the time series. Change the
regulatory structure, and your econometric regressions should exhibit differences
before and after the change. Lucas, in effect, argued that econometrics should be
concerned with “structural breaks,” the topic of this chapter.
“Structural breaks” is econometric jargon for “the world changed.” At some point
in the real world, there was a change in either the legal, institutional, or geopolitical
rules of the game that resulted in a different process generating the data. If the
practicing econometrician attempted to fit the whole dataset to one model, rather
than two, he would be committing a serious misspecification error.
30
20
20
10
X
X
10
0
-10
100
40
X
X
50
20
0
The other three panels in Fig. 8.1 also show some structural breaks. Panel (A)
shows spliced data with the same trend, but with different intercepts, (B) shows two
trend stationary processes with different trends spliced together, and (C) shows two
trend stationary processes with different intercepts and trends spliced together.
The practicing econometrician needs to be aware that these are possibilities. In
fact, they are very likely. A dutiful researcher must know the subject matter well
enough to anticipate whether regulatory, legal, political or other changes might have
fundamentally altered the data-generating process and resulted in a structural break.
How was Perron able to completely reverse Nelson and Plosser’s conclusions?
For the remainder of this section, we will work through a well-known paper by
Perron which showed the importance of testing for structural breaks when testing
for unit roots. It is important, not just because it pointed out the fragility of standard
unit root tests, but also because it provided a method for dealing with structural
breaks econometrically. (Perron’s method is appropriate when the break happened
only once, and the date of the single break is known. More recent research has
relaxed these two assumptions.)
In his 1989 paper, Perron postulated that there was a structural break in the economy
in 1929, at least for the more common macroeconomic data. In other words, the
world was different before and after 1929. Nelson and Plosser (1982), however,
lumped all the data, pre- and post-1929, into the same group when performing their
Dickey-Fuller tests. Perron investigates whether a structural break at 1929 could
account for the time-series properties of the data. He concluded that, in contrast to
Nelson and Plosser, the data did not have a unit root. Rather, a structural change in
1929 was confused with a unit root. That is, the effects of the 1929 shock had not
dissipated, and so it looked like a unit root.
Perron begins with a casual analysis of the most major economic event of the
twentieth century: the Great Depression. Perron notices that the Great Depression
resulted in a drop in the values of most macro aggregates (a change in mean value).
This observation will guide his choice of estimated unit root models.
174 8 Structural Breaks
Perron points out that, if it is known that only a subset of the parameters have
changed after the structural break, then it does not make sense to estimate two
separate regressions; doing so would require estimating the unchanged parameters
twice, each time on smaller sub-samples. Rather, Perron suggests estimating all of
the parameters via one larger nested regression, where properly generated dummy-
or time- variables allow for changes in the parameters, but only those parameters that
are known to change. Why estimate the constant twice, for example, each time with
half the observations, when you could estimate it once with double the observations?
Let’s suppose that we are looking at some data similar to the top right panel
in Fig. 8.1 where there is a shift in the level of the series, whereas the slope has
not changed. The mean changes so it seems non-stationary, but are its components
stationary? The data could have come from one of two hypothesized models, a null
and an alternative:
H0 : yt = β0 + yt−1 + μDP + et
HA : yt = β0 + β1 t + μDL + et .
The null hypothesis is a unit root process. The alternative is a trend stationary
process. Both models allow for some kind of parameter change (i.e. a structural
change). Borrowing the terminology from Enders’ (2014) textbook, we call “DP ”
a pulse dummy variable; we construct it so that it has a value of zero, except in the
one period directly following a shock. We call “DL ” a level dummy variable; it has
a value equal to zero up to and including the shock and a value of one thereafter.
Let’s see how these two equations act, step by step. “By hand,” as it were. We’ll
do so in two stages, first without random errors, and then again with the errors. Since
we’re doing this by hand, let’s choose some nice easy numbers: β0 = 1, β1 = 1,
μ = 10, and let the initial value of y0 = 1.
At first, rather than generating error terms, let’s treat these series as though they
were deterministic. (In other words, ignore the error term for now by setting it equal
to zero.) Suppose the structural break occurs in period 50, and the series runs for
100 periods.
If the null equation were true, then the equation reduces to:
H0 : yt = 1 + yt−1 + 10DP
y0 ≡ 1
y1 = 1 + y0 + 10DP = 1 + 1 + 0 = 2
y2 = 1 + y1 + 10DP = 1 + 2 + 0 = 3
y3 = 1 + y2 + 10DP = 1 + 3 + 0 = 4
y49 = 1 + y48 + 10DP = 1 + 49 + 0 = 50
y50 = 1 + y49 + 10DP = 1 + 50 + 0 = 51
8.2 Perron (1989): Tests for a Unit Root with a Known Structural Break 175
HA : yt = 1 + t + 10DL
y0 ≡ 1
y1 = 1 + t + 10DL = 1 + 1 + 0 = 2
y2 = 1 + 2 + 0 = 3
y3 = 1 + 3 + 0 = 4
y49 = 1 + 49 + 0 = 50
y50 = 1 + 50 + 0 = 51
y51 = 1 + 51 + 10 = 62
y52 = 1 + 52 + 10 = 63
A graph of the series under the null and alternative is given in Fig. 8.2.
Notice that when these are deterministic functions, the null and the alternative
are equivalent. The processes differ by how they deal with shocks, those error terms
which we had set equal to zero. In unit root processes (such as in the null) the effects
of even small errors linger long into the future.
100
50
0
0 20 40 60 80 100
t
Ynull Yalt
Let’s now add some random errors and see how this changes things. We will
simulate one column (variable) of random errors, and use these same errors to
simulate the two models, the null and the alternative.
Graphs of the processes under the null and alternative are given in Figs. 8.3
and 8.4.
Enders (2014) distills Perron’s method into a few easy steps. Supposing that there
seems to be a shift, but no change in the slope, the null and alternative hypotheses
that we are testing are:
H0 : yt = β0 + yt−1 + βDP DP + t
HA : yt = β0 + β1 t + βDL DL + t .
120
80
Ynull
40
0
0 20 40 60 80 100
t
120
80
Yalt
40
0
0 20 40 60 80 100
t
1. Detrend the data. You can do this by estimating the model under the appropriate
alternative hypothesis, and then generating the residuals. Let’s denote these
detrended data ȳ.
2. Does ȳ (the detrended data) follow a unit root process? Estimate
k
y¯t = α ȳt−1 + γi ȳt−i + t . (8.2)
i=1
Perron (1989) provides the appropriate critical values for testing α = 1 here, too.
If the test statistic is large enough, we can reject the null hypothesis of a unit root.
Finally, we should mention (following Enders 2014) that detrending does not
have to occur as a separate step. In fact, all three steps can be collapsed into one big
regression. For the more general case where we can have both a change of level and
a change of slope, the model to be tested is:
178 8 Structural Breaks
k
yt = β0 + βDP DP + βDL DL + β3 t + β4 NewSlope + αyt−1 + γi yt−i + t ,
i=1
(8.3)
where
The unit root hypothesis (α = 1) can be tested using Perron’s critical values.
Perron uses Nelson and Plosser’s own data to show that their results are suspect.
So, let’s reload their data, keeping only the variables we’ll need.1 With the exception
of bond yields, we will look at the variables in their logarithms (hence the prefix “l”).
Next, we need to create several special types of dummy and time variables.
Perron points out that (almost) all of Nelson and Plosser’s data could fit into one
of several classes of stationary (but with structural break) models. There could be a
change in intercept, a change in slope, or both. Perron fits wages and the S&P 500
into two such models, as can be seen in Fig. 8.5.
In replicating this study, be aware that Perron follows Stata’s (and our) notation,
where k=lags; Nelson and Plosser have k=lags+1.
The code below allows us to reconstruct Perron’s Table 8.2, which shows the
sample autocorrelations of the deterministically detrended data. Perron detrends
most of the variables by fitting a regression with a constant, a trend, and an
intercept, and then extracting the residuals. The exceptions are the real wage and
S&P variables, which he detrends by fitting a regression with a slope and intercept
before extracting the residuals.
1 Perron
also examines a real GNP variable from Campbell and Mankiw (1987b), and Campbell
and Mankiw (1987a). We leave this variable out for the sake of brevity.
8.2 Perron (1989): Tests for a Unit Root with a Known Structural Break 179
5
9
4
8
3
7
2
6
Log of wage Fitted values Log of S&P 500 index Fitted values
Table 8.1 shows the autocorrelations of each variable after deterministic detrend-
ing. The autocorrelations decay fairly quickly, implying that the variables are trend
stationary.
The heart of Perron’s paper is his test for structural change (Table 8.2). This is
calculated for real GNP with using the code below.
180
The third line drives all of the results. The lines following, simply pull out the
appropriate statistics so they are not lost in a large regression table. To estimate the
other variables, simply replace lnrgnp and lags in the first two lines and re-run.
182 8 Structural Breaks
Real wages and the S&P 500 are believed to have come from a different type
of break. For these two series, Perron hypothesizes that the slope and intercept has
changed. Thus, he adds a different dummy variable into the mix.
8.2 Perron (1989): Tests for a Unit Root with a Known Structural Break 183
To actually see whether t (α) is statistically significant from 1 (i.e. it has a unit
root), we need special critical values. In fact, part of Perron’s purpose, similar to
Dickey and Fuller’s, is to provide new critical values. Perron’s critical values depend
upon, among other things, the position of the break relative to the sample. When
there is no break, the critical values mimic those of Dickey-Fuller; otherwise, they
are a bit larger. They are largest when the break occurs in the middle of the series.
When conducting such tests, it is strongly recommended that the researcher consults
the tables of critical values in Perron (1989).
Let’s examine the results in Table 8.2. Recall that we estimated a model which
controls for possible level shifts in all of the variables, and for changes in the
slopes in the case of common stocks and real wages. All of these structural change
parameters are significant. After filtering out these effects, we can test for a unit
root by examining the estimated value of α, the coefficient on yt−1 . If α = 1,
then we have a unit root. Looking at the t-statistics on α, we see that none of
the variables have a unit root, with the exception of consumer prices, velocity, and
the interest rate. Perron concludes that macroeconomic variables are not generally
characterized by unit root processes, but rather by structural breaks. This completely
reverses Nelson and Plosser’s (1982) conclusion that US macro variables are unit
root processes.
Perron’s paper is important beyond what it implied for the specific macroeco-
nomic data he examined. He proved that when undertaking ADF tests, the failure to
allow for a structural break introduces a bias; this bias reduces the ability to reject
a false unit root. There were a lot of double-negatives in that sentence. Here is the
intuition. Suppose there was a structural break and no unit root, but that you forgot
to account for that possibility in your ADF test. Then the structural break (a shock
whose effects linger indefinitely) will be confused with a unit root process (a process
of shocks whose effects linger indefinitely). In other words, it is more likely to look
as though there is a unit root when, in fact, there is none. To abuse the terminology,
we would be falsely led to “accept” the hypothesis of a unit root.
184 8 Structural Breaks
Perron (1989) showed how failing to account for the possibility of a structural
break biases the standard unit root tests. Throughout the entire exercise above,
however, we presumed we knew the date of the possible break. We cannot always
be so certain. We now turn to the question of finding the date of the break when it is
not known.
Exercises
1. Modify the code above to replicate the rest of Table 8.2.
2. Load the DJIA.dta dataset. This dataset reports the Dow Jones Industrial Average
from the beginning of 1995 to the end of 2015. (Alternatively, download the
data using fetchyahooquotes.) Take the natural log of the DJIA. Conduct
a Dickey-Fuller test on ln(DJIA). Does there seem to be a unit root in the
(log of the) DJIA? Graph the data. Visually, does there seem to be a structural
break? If so, of what type (slope change, intercept change, or both)? Estimate
the appropriate model, and conduct a unit root test following the Perron’s (1989)
procedure, as laid out in this chapter. Does there seem to be a unit root in the
DJIA once a structural break has been accounted for?
Perron’s paper showed what to do if we know the date of the (one) possible structural
break. This is often not the case. Rather, we might want to know which, if any,
legal or institutional changes changed the behavior of the economy. In this case, we
can employ a technique developed by Zivot and Andrews (1992). Their technique,
like Perron’s, can only identify the existence of a single structural break; different
techniques are required if there may be more than one break.
A brief history might be useful. Dickey and Fuller (1979) showed how to test for a
unit root, in the presence of AR(1) errors. Said and Dickey (1984) generalized this to
account for AR(p) errors in their Augmented Dickey-Fuller procedure by including
additional lagged (differenced) terms. Perron investigated how the ADF results
might change if a break point occurred at a known point in time. He did this by
adding a dummy variable in the ADF procedure. Finally, Zivot and Andrews asked
how to test for a break point at an unknown time. Their approach was to estimate
many Perron-style equations, one for each year. Each regression includes an optimal
number of lags (chosen via Schwarz’ (1978) Bayesian information criterion or a
sequence of t-tests). Finally, pick the one year which gives the most weight to the
alternative hypothesis.
Zivot and Andrews’ null hypothesis is of a unit root process—possibly with
drift—and no structural break. Their basic idea is to estimate a sequence of Perron’s
trend stationary models, each with a different break point. Which break point
should be selected? The one which gives “the most weight to the trend-stationary
alternative” (Zivot and Andrews 1992, p. 254).
8.3 Zivot and Andrews’ Test of a Break at an Unknown Date 185
k
yt = β0 + βDP DP + βDL DL + β3 t + β4 NewSlope + αyt−1 + γi yt−i + t .
i=1
This is Perron’s equation. Testing that α = 1 tests the unit root hypothesis. Testing
βDL = 0 tests for a level-shift structural break at a given year. Testing β4 = 0 tests
for a change-of-slope structural break at a given year. Zivot and Andrews suggest
we estimate a sequence of these equations, one for each possible break year, and
pick the most likely year under the alternative.
In this subsection, we will walk through implementing Zivot and Andrews’ (1992)
procedure, replicating the results from their paper. Again, recall that this is simply a
Perron (1989) exercise, but with an extra step to choose the appropriate break-date.
First, we download the familiar Nelson and Plosser data.
Let’s suppose, for just a second, that we knew the date of the break year and the
number of lags. For RGDP, we presume this break-date is 1929 and the number of
lags is 8.
If we didn’t have to worry about using the proper critical values, we could simply
test for a level break by typing test DL. We can isolate particular estimates, their
test statistics, and the equation’s root mean squared error (RMSE) by:
186 8 Structural Breaks
You will notice that, rather than having eight lags, we have eight differences. This
is equivalent, actually, to the following formulation:
Now, how did we know to include eight lags for 1929? Zivot and Andrews tested
down from a max-lag of eight. If the test statistic on the maximal lag is insignificant
(less than 1.60) at the 10% level, then we drop it and estimate the model with one
fewer lag. We repeat the process until we no longer drop insignificant lags.
The idea is to repeat the above process for every candidate year. This requires
creating an outer loop, one for every candidate break year. Be sure to drop the old
break year dummy variable (1929), and create a new one for the new candidate year.
188 8 Structural Breaks
Keep track of the t-stat on α for each year; we will keep regression for the year with
the smallest test statistic on α.
Finally, we need to repeat this procedure for every variable, so let’s create a final
outer loop, looping over all of the variables. A complete implementation of this
procedure on the Nelson and Plosser dataset is:
8.3 Zivot and Andrews’ Test of a Break at an Unknown Date 189
For the first three variables, the output from the code above is:
190 8 Structural Breaks
What does the output above tell us? We see that the most likely breakpoint is
1929, the date that Perron assumed for his structural break. The structural break
variables (DL) and (t) are significant. More importantly, α does not seem to be
equal to one. That is, unit roots do not describe Nelson and Plosser’s data. Or, more
accurately, there is less evidence for unit roots than Nelson and Plosser reported.
8.3 Zivot and Andrews’ Test of a Break at an Unknown Date 191
Kit Baum (2015) wrote a Stata command to implement Zivot and Andrews’ (1992)
technique. The command reports the period after the break rather than the year of
the break. Also, I have been unable to fully replicate Nelson and Plosser’s results
using Kit Baum’s zandrews program. Still, to install it, type the following in the
command line:
Familiar questions keep arising: what types of breaks, and how many lags?
As in Perron’s method, a break can occur in the intercept, the slope, or both.
Accordingly, you must specify either break(intercept), break(trend),
or break(both).
How many lags should be included? As before, there are several different
methods one might employ. You could find the number of lags that minimize your
favorite information criterion (AIC or BIC). Alternatively, you could try a testing-
down (sequential t-test) approach. The lagmethod() option allows you to choose
one of the three options: AIC, BIC or TTest. Alternatively, you can specify the
maximum number of lags to consider, by using these two options in combination:
lagmethod(input) maxlags(#).
The following code uses zandrews to replicate most of Zivot and Andrews’
findings:
The final two variables (common stock prices a la the S&P 500, and real wages)
were modeled with a break in both the intercept and the trend term. All other
variables were modeled with a break in the intercept only.
The output of the zandrews command is summarized in Table 8.3. We were
able to exactly replicate all but one of Zivot and Andrews’ results.
Exercises
1. Repeat Perron’s exercise on the Nelson and Plosser dataset, but use Stata’s
zandrews command and the AIC option to identify the existence and most
likely location of a structural break. Ignore data before 1909. For each variable,
which year is identified as the most likely for a structural break? Which of these
are statistically significant? Are your results different from those identified by
Perron? (Recall that real wages and the S&P 500 are thought to have a new
intercept and slope after the break. All other variables have only a new intercept.)
Are your results different from those in Zivot and Andrews’ paper?
2. Repeat the exercise above, but use Stata’s zandrews command and the
sequential t-test option (at the 0.10 level) to identify the existence and most
likely date of a structural break. Are your results different from those identified
by Perron? Are your results different from those in the previous exercise? Are
your results different from those in Zivot and Andrews’ paper?
Structural breaks refer to a qualitative difference in the data before and after an
event. For example, the economy might perform one way before a regulation is
put into effect, and differently afterward. Or investors might tolerate risk before the
Great Depression, and not tolerate it afterward.
The literature on structural breaks and unit root testing is its own cottage industry.
An exhaustive review would be impossible.
Perron kicked off the research by providing a way to test for a structural break at
a known date, specifically, the Great Crash and the oil price shocks. Sometimes the
date of a structural break is fairly certain. Germany’s economy, for example, would
certainly act differently pre- and post-unification (Lütkepohl and Wolters 2003).
The second wave of researchers relaxed the assumption that the date of the
possible break is known. Rather, the possible number of breaks is known, and the
algorithms check for the most likely dates of these changes. This wave is often
called the “endogenous break” stream. This does not imply that the breaks occurred
because of some endogenous economic process, rather, the endogeneity refers to
the calculation method. That is, the date is not given by the researcher exogenously.
It is estimated by the statistical algorithm. Zivot and Andrews (1992), a paper we
examined at length in this chapter, fits this mold. So, too, do the influential papers by
Christiano (1992), Banerjee et al. (1992), Lanne et al. (2003), Perron and Vogelsang
(1992), Perron (1997), and Vogelsang and Perron (1998).
Perron and Vogelsang (1992) propose a test for unknown structural break points.
Their procedure should sound familiar if you have worked through this book. First,
transform the series by removing its deterministic component. Then, calculate an
Augmented Dickey-Fuller type regression with a possible structural break on the
transformed series; similar in spirit to that of Zivot and Andrews, calculate t-
statistics for a possible break at each date. The minimal t-statistic is then used to
test the null hypothesis of a unit root. Perron (1997) expands upon this.
Zivot and Andrews’ asymptotic results rely on the assumption that structural
breaks and unit roots are strict substitutes. That is, that a structural break occurs in
a trend stationary process, but not in the null of a unit root process. Vogelsang and
Perron (1998)—and Lee and Strazicich (2003) for two possible breaks—rectify this
perceived deficiency. Since breaks are allowed under both the null (unit root) and the
8.4 Further Readings 195
9.1 Introduction
To this point, we have considered non-stationary means, but strictly speaking, non-
stationarity could apply to any of the moments of a random variable: the mean,
variance, skewness, kurtosis, etc. . . Finance especially is concerned with the non-
stationarity of variance.1 Most students will recall the issue of heteroskedasticity
from their introductory econometrics classes. Heteroskedasticity is a particular case
of non-stationarity in the variance of a variable. (Skedasticity is a synonym for
variance.) The traditional picture of heteroskedasticity is a scatterplot which spreads
outward, growing proportionately as the value of X increases. Graphically, it looks
like a funnel with the large part toward the right. Less common, is heteroskedasticity
with the variance decreasing in X; the funnel pointed in the other direction (see
Fig. 9.1). But these are special cases, more common in cross-sections. They are of
limited use to the practicing financial econometrician.
Consider the stock market. Sometimes the markets are very volatile, and
sometimes they are not. The volatility (variance) of stock-market returns determines
the riskiness of your investments. It is a truism in finance that risk and reward go
together: they are positively correlated. When people take bigger risks with their
investments (because the volatility of an asset’s price was high), they demand to
be compensated with higher reward. To make wise investments, it is crucial to
understand and account for risk properly.
Before the introduction of ARCH and GARCH models, the most common—
practically the only—method for incorporating volatility was to compute a rolling
variance or rolling standard deviation. This, of course, brought up the practical
question of choosing the best length of the window. Should you choose a rolling
1 Strictly
speaking, it is the unconditional variance which can imply non-stationarity; conditional
heteroskedasticity does not imply non-stationarity. The distinction between conditional and
unconditional variance will be one of the focuses of this chapter.
Xt = β0 + β1 Xt−1 + et .
9.1 Introduction 199
.4
.2
Daily returns of NBG
0
x
-.2
-.4
-.6
σt2 = β0 + β1 σt−1
2
+ et . (9.1)
Yt ∼ N(0.05, σt2 ).
Notice that Yt is stationary in its mean, but not in its variance, σt2 , which is changing
over time. (That’s why it has the t subscript.)
The most common model used to capture varying variances is the Generalized
Autoregressive Conditional Heteroskedasticity model (GARCH) developed by
Bollerslev (1986) as a generalization of Engle’s (1982) ARCH model. Since then,
there have been countless generalizations of the GARCH model, comprising an
alphabet soup of acronyms.
It has become standard for texts to jump directly to the general GARCH
model. We will take a more incremental approach, building up slowly from a
200 9 ARCH, GARCH and Time-Varying Variance
specific ARCH model to more general GARCH model. After we have established
exactly what is going on inside general GARCH models, we will finish with a
discussion of the many variants of GARCH. There are many, many variants of
GARCH, so we will discuss a few, and only briefly. But first, if ARCH stands for
Autoregressive Conditional Heteroskedasticity, what do we mean by “conditional
heteroskedasticity?”
Yt = Yt−1 + et (9.2)
with the variance of et , constant at σ 2 . We can substitute recursively so that Eq. (9.2)
can be rewritten as
t
Yt = Y0 + e1 + e2 + . . . + et = Y0 + et .
i=1
Written this way, we can see that Yt is equal to the sum of a sequence of errors.
The unconditional variance of Yt , is
t
t
V (Yt ) = V (Y0 ) + V ( et ) = 0 + V (et ) = tσ 2 .
i=1 i=1
Notice also how the unconditional variance changes: it increases unboundedly over
time.
But what about the variance of Yt , conditional on its previous value Yt−1 ? That is,
when we are curious to make a forecast of Yt , we usually have some historical data
to rely upon; we at least have Y ’s previous value. The variance of Yt , conditional on
Yt−1 is
9.3.1 ARCH(1)
ARCH and GARCH models of all stripes generally consist of two equations: (1) a
mean equation describing the evolution of the main variable of interest, Y, and (2) a
variance equation describing the evolution of Y ’s variance.
As promised, we will start out simple. Yt will not even follow an AR(1) process,
but will consist of a constant mean plus some error:
Yt = β0 + t . (9.3)
where ut ∼ N (0, 1), α0 > 0, and α1 > 1. Equations (9.3) and (9.4) together define
our ARCH(1) model.
This error term (Eq. (9.4)) looks a bit unusual, but it isn’t really. First, if you
square both sides, you see that it is an expression for the mean equation’s variance:
t2 = α0 + α1 t−1
2
u2t
and u2t is a slightly different error term. Second, it is simply a multiplicative error
term. This scales the variance up or down proportionately. While this all looks
unnecessarily complicated, it simplifies the computation, so it is quite useful.
Unconditional Moments
The unconditional mean of Yt is
= α0 E u2t + α1 E t−1
2
u2t .
Since ut ∼ N (0, 1), then E u2t = 1, so
E(t2 ) = α0 + α1 E t−1
2
. (9.6)
2
If E(t2 ) is stationary,2 then E t2 = E t−1 = E 2 , so Eq. (9.6) simplifies to
α0
E( 2 ) = . (9.7)
1 − α1
Conditional Moments
We have several random variables (yt , t , and ut ) and we could condition on each
of these or on their lags. So clearly, we could calculate an infinite number of
conditional moments. Not all of them will be of interest. We need to be judicious.
First, an easy one. The mean of Yt conditional on its previous value is:
= E α0 u2t | t−1 + E α1 t−1
2
u2t | t−1
= α0 E u2t | t−1 + α1 E t−1
2
u2t | t−1 .
Notice from Eq. (9.9) that the conditional variance of Yt depends upon time via
t−1 . And since t follows an AR(1) process, the conditional variance of Y exhibits
time-varying volatility. But this is only true of the conditional variance. As we saw
in Eq. (9.8), this was not a feature of the unconditional variance.
Yt = β0 + t (9.10)
1/2
t = α0 + α1 t−1
2
ut (9.11)
where the second equality follows when the mean of X is zero. It is a standard
exercise to show that the kurtosis of the normal distribution is three. Thus, we aim
to show that the kurtosis of t ∼ ARCH (1) is greater than 3. Since ut is standard
normal, then (9.13) implies that
E u4t
K(ut ) =
2 = 3. (9.14)
E u2t
So, if we can prove that the expectations term in the numerator is greater than
the denominator, we will have proven our case. To do this, we can rely on a
mathematical theorem called Jensen’s inequality. This theorem should be familiar
to economics and finance students, as it is the basis for risk aversion, which in
turn is the basis of the theories of insurance and portfolio management. In words,
the theorem states that if a function is concave, then the average of the function
is greater than the function, evaluated at the average. In mathematics, if f (x) is
concave, then E(f (x)) > f (E(x)). Here, f (x) = x 2 , which is a concave function,
and x = α0 + α1 t−12 . Therefore,
3E x 2 3E (f (x))
K (t ) = 2
= > 3.
[E (x)] f (E (x))
Thus we have shown that ARCH(1) processes have thicker tails than the normally
distributed processes.
15
10
Y
5
0
.5
.4
.3
Density
.2
.1
0
-10 -5 0 5
e
-4 -2 0 2 4
Inverse Normal
Alternatively, we can approach the problem visually using a histogram (Fig. 9.4)
or a QQ-plot (Fig. 9.5). The Stata commands for the graphs are:
The output of Stata’s ARCH command is divided into two parts, corresponding to
the two main component equations of any GARCH model: the mean and variance
equations. We defined β0 to be 10; it was estimated to be 10.0186. In the variance
equation, α0 and α1 were defined to be 0.4 and 0.5, respectively; Stata estimated
them to be 0.42 and 0.51.
to calculate the residuals (the estimates of the error, and the means by which the
squared errors are estimated for the variance equation).
Likewise, after an ARCH or GARCH estimation, you can predict the variance.
The command is simply:
This is the estimated conditional variance, a graph of which is given in the first
1000 observations of Fig. 9.6.
208 9 ARCH, GARCH and Time-Varying Variance
30
Conditional variance, one-step
20
10
0
0 500 1000
time
Fig. 9.6 The estimated variance from our simulated ARCH(1) data
After the first new observation, there is no data for the ARCH(1) process to
pull from. Instead, it uses its own predicted values recursively, to generate more
predicted values. That is, it predicts the variance in, say, period 1003 from its
predicted value (not the realized value) in period 1002. Almost immediately, the
predicted variance
stabilizes
to the estimated unconditional variance, a constant
equal to αˆ0 / 1 − αˆ1 = 0.423/ (1 − 0.515) = 0.87. The final 100 observations
in Fig. 9.6 show this convergence to the unconditional variance.
9.3.2 AR(1)-ARCH(1)
In the previous subsection, our mean equation was not particularly interesting; it
was simply a constant plus some error. All of the dynamics were in the variance
equation. In this subsection, we add some dynamics to the mean equation.
We do this to make an important point: what makes the model an ARCH or
GARCH model is the variance equation. The mean equation can be almost anything.
Here, it will be an AR(1) process. (As this chapter progresses, the mean equation
will usually be quite uninteresting. Sometimes, we won’t even show it.)
9.3 ARCH Models 209
Yt = β0 + β1 Yt−1 + t . (9.15)
As before,
1/2
t = α0 + α1 t−1
2
ut , (9.16)
with ut ∼ N (0, 1), and the αs and βs defined such that each autoregressive process
is stationary.
Unconditional Moments
First, we derive the unconditional moments. Yt evolves as an AR(1) process.
Presuming that β0 and β1 are such that Yt is stationary, then
Given that this is simply an AR(1) process (with an unusual but still zero-mean error
term), this equation should have been expected.
To derive the unconditional variance, let’s begin with the mean equation (the
AR(1) process) and substitute it recursively into itself a couple of times:
Yt = β0 + β1 Yt−1 + t
Yt = β0 + β1 (β0 + β1 Yt−2 + t−1 ) + t
Yt = β0 + β1 (β0 + β1 (β0 + β1 Yt−3 + t−2 ) + t−1 ) + t
2
2
Yt = β0 β1i + t−i β1i + β13 Yt−3 .
i=0 i=0
∞
β0
= + t−i β1i . (9.17)
1 − β1
i=0
210 9 ARCH, GARCH and Time-Varying Variance
The unconditional variance can be found by using Eqs. (9.17) and (9.7):
∞
β0
V (Yt ) = V + t−i β1i
1 − β1
i=0
∞
=V t−i β1i
i=0
∞
∞
∞
α0 2i
= β12i V (t−i ) = β12i E t−i
2
= β1
1 − α1
i=0 i=0 i=0
α0 1
= .
1 − α1 1 − β12
Notice that adding a lagged Yt−1 term in the mean equation changed the uncondi-
tional variance of Yt . Will it change the conditional variance as well?
Conditional Moments
Below, we calculate the mean and variance of Yt , conditional on the set of all
previous information, t−1 . This means that we know the values of Yt−1 , Yt−2 , . . .,
of t−1 , t−1 , . . . , and of ut , ut−2 , . . .
1/2
E (Yt | t−1 ) = E β0 + β1 Yt−1 + α0 + α1 t−1
2
ut | t−1
1/2
= β0 + β1 Yt−1 + E α0 + α1 t−1
2
ut | t−1
1/2
= β0 + β1 Yt−1 + α0 + α1 t−1
2
E [ut | t−1 ]
1/2
= β0 + β1 Yt−1 + α0 + α1 t−1
2
[0]
= β0 + β1 Yt−1
and
1/2
V (Yt | t−1 ) = V β0 + β1 Yt−1 + α0 + α1 t−1
2
ut | t−1
= V (β0 + β1 Yt−1 | t−1 ) + α0 + α1 t−1
2
V (ut | t−1 )
= 0 + α0 + α1 t−1
2
(1)
= α0 + α1 t−1
2
.
9.3 ARCH Models 211
16
14
Y
12 10
8
Stata does a good job estimating the parameters of this model, too. The mean
equation’s parameters, β0 and β1 , were defined to be 10 and 0.10. Stata estimates
them to be 9.95 and 0.103. The variance terms α0 and α1 were, as before, 0.40 and
0.50; Stata estimates them to be 0.437 and 0.465.
9.3.3 ARCH(2)
Yt = β0 + t , (9.18)
Conditional Moments
Let’s look more closely at variance Eq. (9.19). The expectation of t , conditional on
its entire past history (which we denote as t−1 ) is:
1/2
E(t | t−1 ) = E α0 + α1 t−1 + α2 t−2
2 2
ut | t−1
1/2
= α0 + α1 t−1
2
+ α2 t−2
2
E [ut | t−1 ]
1/2
= α0 + α1 t−1
2
+ α2 t−2
2
[0]
= 0.
3 For 2
to be positive, as it must be, since it is equal to the variance of Y , it is sufficient that all the
t
αs are positive.
214 9 ARCH, GARCH and Time-Varying Variance
The second-to-last equality follows from the fact that ut ∼ N (0, 1). The last
equality follows from the fact that V (Yt ) = V (t ). We can easily see that the
conditional variance depends upon its own past values.
Unconditional Moments
The unconditional expected value of Yt is
E (Yt ) = E (β0 + t ) = β0 .
V (Y ) = α0 + α1 V (Y ) + α2 V (Y )
α0
= .
1 − α1 − α2
The data are graphed in Fig. 9.8. The volatility clustering is visually evident.
The unconditional kurtosis is calculated to be:
9.3 ARCH Models 215
10 20
15
Y
5
0
Stata estimates the parameters quite accurately. The estimation command is:
The estimates are very close to their true values. The constant in the mean
function was equal to 10 and was estimated at 10.01. The parameters of the variance
equation were 0.20, 0.30, and 0.40, and were estimated at 0.21, 0.29, and 0.43.
9.3.4 ARCH(q)
Y t = β + t ,
−1 < αi < 1,
9.3 ARCH Models 217
and
q
αi < 1.
i=1
0 < αi < 1.
Conditional Moments
The conditional variance for the ARCH(q) process is
Unconditional Moments
Using the now-familiar methods from earlier in this chapter, the unconditional
variance for an ARCH(q) process is
α0
V (Y ) = .
1 − α1 − α2 − . . . − αq
that all of the ARCH effects have been removed. We picked four lags just for the
sake of illustration. Some researchers estimate an autocorrelation function (ACF) to
help pick the Ljung-Box lag length.
(2) The LM or ACF test estimates the autocorrelation function of e2 , the squared
residuals. This can be done by regressing e2 on an ample number of lags of itself:
. reg e2 L.e2 L2.e2 L3.e2 L4.e2, or, more compactly,
. reg e2 L(1/4).e2
Then, test whether the coefficients are jointly significant. If so, then there is evidence
of ARCH effects. A graphical alternative is to generate the ACF function using:
Note the p-values. Even at one lag, there is evidence of autocorrelation in the
squared residuals, implying autocorrelation in the variance of GM stock returns.
9.3 ARCH Models 219
In the former case, an LM test with one lag would be sufficient to detect the
ARCH effects. In the latter, the ARCH effects would not be apparent until the LM
test reached lags=4.
It should be pointed out that, in general, the LM tests will show decreasing p-
values as the number of lags increases. This is because the sequence of tests just
adds variables to a joint significance test. That is, the “lags(p)=1” line shows results
of a test that one lag is significant. The second line shows the result of a test that lags
one or two are significant. Thus, the higher-lagged tests nest the previous tests.4
4 We are speaking loosely, here. The p-value would be guaranteed to drop by adding variables if
we were adding variables to a test from the same regression. In this LM output, however, we are
actually estimating and testing ten increasingly larger models and jointly testing the significance
of each model’s parameters, so “nesting” is not literally correct in the strict statistical sense of the
term. Still, the point stands, that adding lags to an LM test will almost always result in a p-value
less than 0.05; so be judicious when adding lags.
220 9 ARCH, GARCH and Time-Varying Variance
Table 9.1 summarizes the Stata output from above. In Stata, lower information
criteria indicate a better fitting model. Therefore, the AIC and BIC indicate that
nine lags would be preferred. A discrepancy between the two ICs is not uncommon.
Generally, BICs choose smaller models than AICs. To the extent that parsimony is
a virtue, the BICs are preferred. Others argue that the problem of omitted variable
bias is more important than the usual gains in efficiency, so that AICs are preferred.
The argument is an open one.
In this example, we will download some stock price data, test whether ARCH effects
are present, and estimate the appropriate ARCH model.
We will show several ways to accomplish the same task. In subsequent examples,
we will proceed more quickly.
For this example, let’s download the daily stock prices of Toyota Motor Company
(stock ticker “TM”) for 2000–2010.
The variable TM is daily percentage returns of Toyota stock. A graph of the data
is given in Fig. 9.9.
222 9 ARCH, GARCH and Time-Varying Variance
.2
.1
TM
0 -.1
-.2
If the coefficients on the lagged squared residuals are statistically significant, then
this is evidence of ARCH effects. A quick and easy way to do this is:
224 9 ARCH, GARCH and Time-Varying Variance
Notice that the test statistic that we calculated “by hand” is identical to that which
Stata calculated via the archlm command. The p-value on this test is far below
0.05, so we reject the null hypothesis of “no ARCH.”
The Ljung-Box test indicates that there is significant autocorrelation (the p-value
is zero):
Finally, was the model that we estimated stationary? All of the ARCH coeffi-
cients are less than one. Do they add up to less than one?
The coefficients add to 0.78. Is this sufficiently far from one, statistically
speaking? To test this, we run a formal hypothesis test:
226 9 ARCH, GARCH and Time-Varying Variance
The hypothesis test verifies that 0.78 is sufficiently far from one. Thus, our
estimated ARCH model does not predict a variance that is growing without bound.
Download stock price data for Ford Motor Company (stock ticker “F”) for the
1990s. Test whether it has ARCH effects. Estimate an AR(0)-ARCH(5) model on
its pct-daily returns. Is it stationary?
First, we download the data. A graph of the daily returns is given in Fig. 9.10.
Both tests indicate strongly that there are ARCH effects present. We should
estimate an ARCH model, but of what length?
We calculate AICs for models with lags of 1 through 20 (output not shown).
The AIC indicates that a lag length of 15 fits best. Given this, we estimate the
ARCH(15) model:
228 9 ARCH, GARCH and Time-Varying Variance
Yes, the coefficients indicate a stationary model. The sum of the αs is less than
one (it is 0.53), and is statistically different from one (we reject the null since p <
0.05).
As you can see, it is not uncommon to have to estimate very large ARCH models.
We now turn to a related class of models, the so-called GARCH models, which are
able to mimic many of the properties of large ARCH models without having to
estimate quite as many parameters.
ARCH models can capture many of the features of financial data, but doing so
can require many lags in the variance equation. Bollerslev (1986) introduced a
solution to this problem via a generalization of the ARCH model. This new model,
called a GARCH(p,q) model, stands for “generalized autoregressive conditional
heteroskedasticity,” or “generalized ARCH.” A GARCH model can mimic an
infinite order ARCH model in the same way that an invertible MA process is
equivalent to an infinite order AR process.
In the rest of this section, we will explore the definition and estimation of a simple
GARCH(1,1) model before turning to the more general GARCH(p,q) model.
9.4.1 GARCH(1,1)
Before we jump to the GARCH(1,1) model, let’s rewrite the variance equation of
our ARCH(1) model, usually
1/2
t = α0 + α1 t−1
2
ut , (9.22)
as a two-equation model:
1/2
t = σt2 ut = σt ut (9.23)
σt2 = α0 + α1 t−1
2
. (9.24)
As we saw before, the conditional variance was equal 2to the term in parenthesis
in (9.22), hence our choice of notation: σt2 = α0 + α1 t−1 .
230 9 ARCH, GARCH and Time-Varying Variance
The GARCH(1,1) model amounts to a small change in Eq. (9.24), adding the
lagged variance:
σt2 = α0 + α1 t−1
2
+ γ σt−1
2
. (9.25)
= α0 + α0 γ + α1 t−1
2
+ α1 γ t−2
2
+ γ 2 σt−2
2
= α0 + α0 γ + α1 t−1
2
+ α1 γ t−2
2
+ γ 2 α0 + α1 t−32
+ γ σt−3
2
= α0 + α0 γ + α0 γ 2 + α1 t−1
2
+ α1 γ t−2
2
+ α1 γ 2 t−3
2
+ γ 3 σt−32
2
2
= α0 γ i + α1 2
γ i t−i−1 + γ 3 σt−3
2
.
i=0 i=0
Thus, simply adding one term in (9.24) turns a finite order process into an infinite
one. This allows us to capture a very complex process without needing to estimate
tons of parameters; one special parameter pulls a lot of weight.
To summarize, the variance equations of the GARCH(1,1) model are:
t = σt ut (9.26)
σt2 = α0 + α1 t−1
2
+ γ σt−1
2
. (9.27)
9.4 GARCH Models 231
The data are graphed in Fig. 9.11. We can estimate the model in Stata by
232 9 ARCH, GARCH and Time-Varying Variance
60
40
20
Y
0
-20
-40
The true mean equation consisted of only β0 = 10 plus error; it was estimated at
9.992. The constant in the variance equation, α1 , was set at 0.20, and was estimated
as 0.198, α1 = 0.40 was estimated as 0.412, and γ1 = 0.60 was estimated as 0.586.
9.4 GARCH Models 233
9.4.2 GARCH(p,q)
t = σt ut (9.28)
σt2 = α0 + α1 t−1
2
+ α2 t−1
2
. . . + αp t−p
2
(9.29)
+ γ1 σt−1
2
+ γ2 σt−2
2
+ . . . + γq σt−q
2
.
In practice, it is rare for stock returns to require more than two lags in the ARCH
and GARCH components (i.e. p ≤ 2 and q ≤ 2). Exceptions to this are often
due to model misspecification. For example, if day-of-the-week effects are ignored,
the standard model selection techniques will falsely prefer larger GARCH models
(Bollerslev, Chou and Kroner 1992.)
Yt = 0.10 + t
t = σt ut
ut ∼ N (0, 1) .
Of course, in practice, we do not have the luxury of knowing what the true model
is. We will continue as though we are agnostic of the true nature of the data.
It is always best to begin by graphing the data. (See Fig. 9.12.) At least visually,
there seems to be volatility clustering. We can formally test and confirm the presence
of volatility clustering using the Ljung-Box:
9.4 GARCH Models 235
5
0
Y
-5
-10
Given that there is volatility clustering, which type of model best fits our data?
An ARCH(1)? A GARCH(2,1)? We estimate several such models and choose the
one with the lowest AIC or BIC. The AICs and BICs are summarized in Table 9.2.
Both information criteria choose the correct model, indicating that GARCH(2,1) fits
best.
Now we can estimate the model.
236 9 ARCH, GARCH and Time-Varying Variance
The model is stationary. We conclude that there were GARCH effects in the data
and that our model has properly captured them.
Thus, we reject the null hypothesis that the true parameters add up to one or
greater: the estimated model is stationary.
9.5.1 GARCH-t
Up until this point, our models have presumed that the errors were drawn from a
normal distribution. Bollerslev (1987) developed a version of his GARCH model
where the errors come from a t-distribution. This was because previous research5
had indicated that various financial time series—such as foreign exchange rates
and major stock market indices—exhibited more leptokurtosis than the standard
GARCH models were able to mimic.
It is quite easy to force Stata to draw its errors from a t rather than a normal
distribution, by adding “distribution(t #)” as an option. For example, to
estimate a GARCH(1,1)-t model, where the errors come from a t distribution with
five degrees of freedom, the command would be
Alternatively, you can leave the degrees of freedom unspecified and have STATA
estimate it:
Yt = β0 + t
t = σt ut
σt2 = α0 + α1 t−1
2
+ γ σt−1
2
u ∼ t (ν) ,
5 Bollerslev points to Milhøj (1987), Hsieh (1988) and McCurdy and Morgan (1985).
9.5 Variations on GARCH 241
First we load the dataset and calculate the continuously compounded rate of
return:
We should note that Bollerslev did not justify his choice of ten lags in the Ljung-
Box tests, and we are simply following his choice. GARCH effects seem to be
evident, and the large kurtosis implies that errors from a t-distribution would yield
a better fit than a normal distribution.
We then estimate a GARCH(1,1) model with t-errors, calculate the residuals and
predict the variance:
Did GARCH(1,1)-t fit the data well? To answer this, we calculate the standard-
ized residuals. We then subject the standardized residuals to the same tests (Q, Q2 ,
and kurtosis) to see whether we have properly accounted for time-varying volatility
and thick-tailed returns.
There are no free lunches in economics. This truism also holds in finance: higher
rewards require higher risks. No one would undertake additional, unnecessary
risk unless they were compensated with the prospect of additional returns. When
purchasing stocks, for example, risk is a direct function of the conditional variance
of the underlying stock. A regression that was attempting to estimate the returns on
a stock must, therefore, have a term related to the conditional variance (σ 2 ) in its
mean equation. In GARCH-M (aka GARCH-in-mean) models, the variance is an
explicit term in the mean equation:
yt = βX + λσt + t . (9.30)
The variance equation can be any of the variance equations that we have seen.
The idea of including the variance in-mean is due to Engle et al. (1987) in
the context of ARCH-M models, French et al. (1987) for GARCH-M models, and
Bollerslev et al. (1988) for multivariate GARCH-M models.
To estimate the following GARCH(1,1)-in-mean model:
yt = β0 + β1 X + λσt + t
t = σt ut
σt2 = α0 + α1 t−1
2
+ γ σt−1
2
,
The archm option specifies that the mean equation includes the conditional
variance.
More general GARCH-M models simply include more lags of the variance:
L
yt = βX + λl σt−l + t . (9.31)
l=1
To estimate a model with two lags of the conditional variance in the mean
equation, such as:
yt = β0 + β1 X + λ1 σt + λ2 σt−1 + t
t = σt ut
σt2 = α0 + α1 t−1
2
+ γ σt−1
2
,
Calculate the excess rates of return for the three stock market indices:
6 The standard references include Sharpe (1964), Lintner (1965), and Merton (1973, 1980).
9.5 Variations on GARCH 249
We estimate the coefficient of relative risk aversion to be between 1.6 and 2.9,
depending on which equity market is in our sample.
This relative consistency in the estimated in-mean risk-aversion parameter is
a little uncommon. French et al. (1987) estimate a GARCH-M model on NYSE
and S&P returns over several different time periods. Their estimates of λ vary
considerably (between 0.6 and 7.8).7 Baillie and DeGennaro (1990) re-estimate the
7 Inresponse, Chou et al. (1992) developed a variation of GARCH-M called TVP-GARCH-M that
allows λ to vary over time, essentially modeling λt as a random walk.
250 9 ARCH, GARCH and Time-Varying Variance
In this subsection, we discuss three variations to the standard GARCH model which
are designed to capture an asymmetric response to new information. In finance,
the arrival of new information is usually considered to be an unexpected event
and is therefore a component of the error term. Many researchers, and investors,
for that matter, have noticed that volatility can rise quite rapidly and unexpectedly,
but it does not dampen quite as quickly as it rises. That is, there is an asymmetric
volatility response to the error term. The models that we discuss attempt to capture
this phenomenon in slightly different ways.
GJR-GARCH
Our standard GARCH(1,1) variance equation was Eq. (9.24), which we repeat here
for convenience:
σt2 = α0 + α1 t−1
2
+ γ1 σt−1
2
. (9.32)
2 into
Glosten et al. (1993) altered this equation by decomposing the effect of t−1
the sum of two different effects via a dummy-variable interaction:
σt2 = α0 + α1 t−1
2
+ α2 Dt−1 t−1
2
+ γ1 σt−1
2
(9.33)
1, if ≥ 0
Dt−1 = (9.34)
0, otherwise.
In this way, when the error is positive, the dummy variable Dt−1 is equal to one,
and
σt2 = α0 + α1 t−1
2
+ γ1 σt−1
2
.
1, if t−1 ≥ 0
Dt−1 =
0, otherwise
1, if t−2 ≥ 0
Dt−2 =
0, otherwise
The syntax is sufficiently flexible to allow for the dummy variable to affect all
or the ARCH terms, or only some of them. It would be difficult to find a theoretical
justification as to why the asymmetry would only affect some time periods versus
others, so on a priori grounds, the set of lags in the tarch() option should be the
same as those in the arch() option. Occasionally, a researcher chooses to interact
only the first lag and leave the other lagged 2 terms as symmetric. This is usually
done if the researcher is trying to economize on degrees of freedom.
Thus, to estimate a GJR-GARCH(2,1) model where only the first lag is asym-
metric:
σt2 = α0 + α1 t−1
2
+ α2 Dt−1 t−1
2
+ α3 t−2
2
+ γ1 σt−1
2
1, if t ≥ 0
Dt =
0, otherwise
In this case, α1 and α3 are the coefficients from arch(1/2), α2 is the coefficient
from tarch(1), and γ1 is the coefficient from garch(1).
252 9 ARCH, GARCH and Time-Varying Variance
E-GARCH
Another form of asymmetric GARCH model is Nelson’s (1991) Exponential
GARCH, or “E-GARCH” model. E-GARCH(1,1) makes several changes to the
standard variance Eq. (9.24), which we repeat here for convenience:
σt2 = α0 + α1 t−1
2
+ γ1 σt−1
2
. (9.35)
where zt−1 = t−1 /σt−1 so that zt−1 ∼ N(0, 1). Since zt−1 is a standard normal
variable,
√ its moments are well known.8 For example, it is known that E |zt−1 | =
2/π . Therefore, after substitution, (9.36) becomes
g (zt−1 ) = α11 zt−1 + α12 |zt−1 | − 2/π (9.37)
the output of which is presented in several parts. First, the parameters of the mean
equation are shown. This is followed by the parameters from the variance equations.
The parameters from Eq. (9.37) are shown as L1.earch and L1.earch_a.
Specifically, L1.earch = α11 and L1.earch_a = α12 .
8 When X ∼ N (μ, σ ) , |X| has a “folded normal distribution.” When μ = 0 , the distribution of
|X| is commonly known the “half normal distribution”.
9.5 Variations on GARCH 253
E-GARCH models with more lags can be handled easily. The variance equation
of a general E-GARCH(p,q) model is:
p
q
ln σt2 = α0 + αi1 zt−i + αi2 |zt−i | − 2/π + 2
γj ln σt−j
i=1 j =1
(9.39)
and is estimated with
That is, to add additional lags of the logged variance, just add terms to the
egarch option. To add additional lags of z, the standardized residual, just add
terms to the earch option.
Stata reports the coefficients of these in blocks. For each lag there are two earch
components: the coefficient on lagged z (L.earch), and the coefficient on the
absolute value of lagged z (L.earch_a).
This will make more sense with an example. Below, we generated data from the
following model:
Yt = β0 + t
t = σt ut
ut ∼ N (0, 1)
2
3
ln σt2 = α0 + αi1 zt−i + αi2 |zt−i | − 2/π + 2
γj ln σt−j
i=1 j =1
= α0 + α11 zt−1 + α12 |zt−1 | − 2/π
+ α21 zt−2 + α22 |zt−2 | − 2/π
+ γ1 ln σt−1
2
+ γ2 ln σt−2
2
+ γ3 ln σt−3
2
.
After generating the data (not shown), we estimate the model using:
The following should help you map the coefficients and the Stata output:
Now that we have estimated the models, we can report the results:
Which model fits best? We get conflicting results—a fact that is all too common
in applied work. Model selection is often done by comparing information criteria.
In this particular case, the AIC shows a slight preference for E-GARCH, while the
BIC shows a slight preference for the standard GARCH model without asymmetry.
A glance at the correlation table of the predicted variances shows that the differences
between the models is fairly modest. The predicted variances are highly correlated,
9.5 Variations on GARCH 257
with the lowest correlation at almost 90%. The threshold term in the GJR-GARCH
model is insignificant. Given this and the fact that neither information criterion
indicates this as a preferred model, we can safely discard this model.
It is quite common to find that the sum of the estimated ARCH or GARCH
parameters is quite close to one. That is, it would seem that the estimated model
has a unit root or is integrated.
If this is the case, there is no finite conditional variance. While this presents some
theoretical problems, it solves some other ones.
An I-GARCH(p,q) model restricts the standard GARCH(p,q) model by forcing
a unit root:
p
q
αi + γj = 1.
i=1 j =1
Integrated GARCH models have the feature that their unconditional variance is
not mean-reverting. The predicted variance from traditional GARCH models gets
closer and closer to the long-run variance as the forecast horizon increases. That
is, the one-step ahead estimate is a bit closer to the long-run variance, σ 2 . The
two-step ahead forecast is even closer. In the limit, the forecasted variance from a
traditional GARCH model is simply the unconditional variance. This is not the same
from an I-GARCH model. I-GARCH models share a similarity with other integrated
processes, specifically that the effect of a shock does not dampen over time. Rather,
the effects of the shock linger indefinitely. A shock that increases the variance of
a process will result in a increased variance indefinitely, or at least until a possible
sequence of negative shocks draws it back down. The point, however, is that there
is no guarantee that this will happen.
One of the fathers of Chaos Theory, Mandelbrot (1963), examined several
financial time series and found that financial returns had thick tails, far thicker than
a normal distribution would indicate. But so much of portfolio theory was based on
the presumption that returns were normal. Further, Mandelbrot could not replicate
his results in different time periods. The beta for a stock, for example, would have
one value during one period and a far different value in another. He hypothesized
that stock returns came from a distribution that did not have a finite variance. Thus,
econometricians were trying to pin down values that did not exist. Mandelbrot
found that processes drawn from the Stable Paretian family of distributions—which,
incidentally, do not have finite moments—seemed to mimic what he saw in the data.
They were leptokurtic and had heavy tails.9 Engle and Bollerslev (1986) developed
9 See
also Gleick (1987) and Peters (1996) for accessible discussions on the relationships between
Mandelbrot’s Paretian hypothesis, Chaos theory, and finance.
258 9 ARCH, GARCH and Time-Varying Variance
the I-GARCH model, which also seemed to share some of the same features that
Mandelbrot saw in the data. (It was an attempt to mimic this leptokurtosis that also
led Bollerslev (1987) to develop the GARCH-t model.)
Ghose and Kroner (1995) explored Mandelbrot’s hypothesis and the performance
of the I-GARCH model. The two models have many of the same features: they
have infinite variances, fat tails, are leptokurtic, and aggregate similarly. However,
they are not identical processes. Ghose and Kroner (1995) found that stable-Paretian
processes do not exhibit the volatility clustering that is so apparent in financial data.
On these grounds, they rejected Mandelbrot’s stable-Paretian hypothesis in favor of
I-GARCH processes.
Integrated GARCH processes are interesting for several reasons. First, they fit
economic data very well. Second, they are a restricted model, so they are more
efficient. If you know, somehow, that the variance process is integrated, then
restricting the coefficients to add to one means that we have one fewer parameter to
estimate.
We must provide a number as an identifier for each constraint. We will have only
one constraint, so we’ll call it constraint 1. We specify the constraint:
Notice how the ARCH and GARCH coefficients add to one (0.1841262 +
0.8158738 = 1), as we demanded in the constraint. Further, the ARCH and GARCH
coefficients are statistically significant, implying that volatility in returns has a
strong autoregressive component.
9.6 Exercises
estimate the appropriate ARCH model and report your results. Defend your
choice of model using some of the appropriate post-estimation specification
tests.
6. Download the daily stock prices for the 3M Co. (stock ticker “MMM”) for
the beginning of January 2000 to the end of December, 2010. Calculate the
percentage returns using the log-difference approach. Calculate the AICs and
BICs for an AR(0)-ARCH(20) and an AR(0)-ARCH(10) model of these daily
returns. Which model is preferred by the AIC? Which model is preferred by the
BIC?
7. We will compare the performance of a large ARCH model with a small
GARCH model. Download the Ford dataset that we used earlier this chapter,
arch-F.dta, and re-estimate the ARCH(10) model. Estimate a GARCH(1,1)
model. Using the AIC, which model is preferred? Using BIC? Estimate the
conditional variances from each model. What is the correlation between them?
8. What Stata command would you use to estimate the following model?
2
3
ln σt2 = α0 + αi1 zt−i + αi2 |zt−i | − 2/π + 2
γj ln σt−j
i=1 j =1
9. What Stata command would you use to estimate the following model?
3
4
ln σt2 = α0 + αi1 ut−i + αi2 |ut−i | − 2/π + 2
γj ln σt−j
i=1 j =1
10. What Stata command would you use to estimate the following model?
Yt = β0 + β1 Yt−1 + β2 Xt + β3 t + β4 t−1
t = σt ut
σt2 = α0 + α1 t−1
2
+ γ1 σt−1
2
11. Write down the equations that describe the following models:
(a) GARCH(1,2)
(b) AR(2)-ARCH(5)
(c) AR(3)-GARCH(2,1)
(d) ARMA(2,3)-GARCH(1,1)
(e) ARMA(4,3)-EGARCH(2,1)
(f) ARMA(2,0)-GRJ-GARCH(1,1)
9.6 Exercises 261
12. Continue replicating Bollerslev’s (1987) paper, this time on S&P returns.
Download the BollerslevSP.dta dataset. Generate the dependent variable
using the following Stata commands.
10.1 Introduction
The VAR is most closely associated with Christopher Sims’ (1980b) article,
“Macroeconomics and Reality.” As with everything, there is nothing new under the
sun. Sims’ article relied on a long history among economists of estimating vector
1 It
was found not to be the case. There are many hidden assumptions in VARs; the researcher
cannot stand outside the research process, even in the case of VARs.
working at the University of Minnesota and the Minnesota Federal Reserve. (They
also shared the 2011 Nobel Prize in Economics.) At a 1975 conference hosted by
the Minnesota Fed, they presented a paper on the VAR entitled “Business cycle
modeling without pretending to have too much a priori economic theory,” the title
hinting at an imminent attack on the Cowles approach. The paper was published in
the conference proceedings as Sargent et al. (1977).
Christopher Sims’ (1980b) brashly titled paper “Macroeconomics and Reality”
was “widely regarded as the manifesto to the VAR approach” (Qin 2011, p. 162–3).
It is one of the most widely cited papers in all of economics.2 In that paper, Sims laid
out his vision for the VAR as a fully coherent substitute for, and improvement upon,
the Cowles Commission approach. The paper was not received without controversy
(see, for example, Cooley and LeRoy 1985 and Leamer 1985) but its eventual
dominance was nearly absolute.
But if the major innovation in Sims (1980b) was that it did not require making
identifying restrictions, the subsequent history of the VAR has witnessed a retreat
from this. No sooner was the atheoretical VAR in use, that theoretical identifying
restrictions began to be imposed. Responding to the criticism by Cooley and LeRoy
(1985), the age of the VAR was quickly followed by the age of the structural
VAR. This was not a reversion to the errors of the Cowles Commission—their
identifying restrictions were considered ad-hoc, “incredible,” and without solid
theoretical backing. Rather, it was the imposition of theory-based restrictions on
formerly unrestricted VARs. This was required for VARs to be useful, not just for
describing data, but for prescribing policy. What was needed is a structural model—
one with current values of the policy variables showing up on both sides of the equal
sign—and errors that are not cross-correlated. What was needed was a structural
VAR.
There have been scores of different identifying assumptions proposed for SVARs.
These primarily include short-run restrictions (Christiano et al. 1999) and long-
run restrictions (Blanchard and Quah 1989). Though not discussed in this book,
Uhlig (2005) introduced sign restrictions as a less restrictive sort of identifying
restriction.3
Other researchers have acknowledged that when a researcher imposes
constraints—setting parameters to zero, or requiring cross-equation restrictions—he
imposes his beliefs on a parameter. The Bayesian VAR is an attempt to formalize
and loosen this requirement. Rather than setting a parameter to zero, the Bayesian
econometrician sets a soft constraint: a prior on the parameter which is centered
over zero. Thus, it is allowed to vary from zero, but only if the data require it.
Examples of this approach include Doan et al. (1984) and Litterman (1985).4
Whiteman (1998), and Geweke and Whiteman (2006). Most of these researchers are connected to
the University of Minnesota or the University of Iowa.
266 10 Vector Autoregressions I: Basics
There have been many technical extensions and modifications to VARs. We’ll
begin learning about VARs by examining the simplest possible VAR, one with two
variables and one lag.
Suppose that X’s value depends on its past, as well as past values of Y . Suppose the
same could be said of Y . This is the essence of a vector autoregression.
The simplest vector autoregression has two variables and one lag:
It is simple to estimate the above model. In fact, nothing more fancy is required
than ordinary least squares. So why do we make such a big deal out of it? Because
we can do a lot of other things once we have estimated these parameters. We can
graphically describe complex interactions, we can make statements about causality,
we might even understand something about how the structure of the real economy
works. But we’re getting ahead of ourselves. Let’s start simple: let’s estimate a
simple VAR model.
Suppose we want to know how two variables—the growth rates of the US’s
real GNP and money supply—are dynamically correlated. The data are graphed
in Fig. 10.1. We will estimate the VAR between these variables using OLS and then
again using Stata’s var command and compare results.
.08
.06
.04
.02
0
-.02
Notice that the coefficients are identical in the two approaches. However, there
is an important difference between the two. They differ in their standard errors.
This is because the OLS approach presumes that the errors are not correlated
across equations. When that isn’t the case, and the shocks to the equations
are correlated, then we need to adjust the standard errors to account for that.
Fortunately, Stata’s var and varbasic commands take care of that complication
automatically. They do so by using seemingly unrelated regression (SUR). We leave
it as an exercise for you to show that estimates using SUR are the same as those
from var.
What does the output above tell us? It is hard to tell; there’s a lot going on. It looks
like increases in the growth rate of real GNP have some inertia from one quarter
to the next (lagged X in the first equation is positive and statistically significant).
10.2 A Simple VAR(1) and How to Estimate it 269
.5
-.5
varbasic, M1gr, GNPgr varbasic, M1gr, M1gr
1
.5
-.5
0 2 4 6 8 0 2 4 6 8
step
95% CI impulse-response function (irf)
Graphs by irfname, impulse variable, and response variable
It does not appear that changes in the growth rate of the money supply (Y ) are
correlated with the next period’s GNP growth. (Y is statistically insignificant in the
first equation, with a p-value of 0.492.)
Suppose that there is one solitary shock to 1,t=1 , and all other s are zero.
How does this one shock affect the whole system? It affects Xt=1 immediately via
Eq. (10.1). Then the shock propagates: Xt=1 affects Xt=2 via (10.1) and Yt=2 via
Eq. (10.2). The process doesn’t stop there, and things get even more complicated.
The newly affected variables Xt=2 and Yt=2 now affect X2 and Y2 within and
across equations. Specifically, Xt=2 affects Xt=3 via Eq. (10.1) and it affects Yt=2
via (10.2); Yt=2 also affects Xt=3 and Yt=3 via Eqs. (10.1) and (10.1). This is
dizzying.
It would be much easier if we could see visually how changes in each of the
variables affect the other variables over time. We will get into the calculation of
these impulse response functions (IRFs) later in the chapter. Indeed, when it comes
to IRFs, the devil is truly in the details. But we’re not ready for details yet. For now,
let’s see what VARs can do.
The Stata command to graph the IRF in Fig. 10.2 is
. irf graph irf
The headings at the top of each panel list the impulse variable first, followed by
the response. Thus, the top left panel shows how a one standard deviation increase
in GNP’s growth rate affects itself over time. We see that the shock’s effects dampen
out, and become indistinguishable from zero by around period 4. At the bottom left,
we see that changes to the growth rate of the money supply do not seem to affect the
270 10 Vector Autoregressions I: Basics
growth rate of GNP. At the bottom right, we see that a one unit increase in the M1
growth rate dampens out over time until it reaches its usual rate by around period 5.
The top right panel shows how a one standard deviation shock to GNPgr affects
the money supply. It doesn’t affect it by much, and the effect is statistically
insignificant. You should always report the confidence intervals around your IRFs.
In practice, these confidence intervals can be quite large. If you ignore this fact, you
risk reporting insignificant results.5
Exercises
1. Re-estimate the two-variable VAR(1) model on the growth rate of the money
supply and the growth rate of RGNP, and show that the results from (a) Stata’s
var command and (b) seemingly unrelated regression (sureg) are equivalent.
Specifically, estimate the following two commands:
. sureg (GNPgr L.GNPgr L.M1gr) (M1gr L.GNPgr L.M1gr)
. var GNPgr M1gr, lags(1)
It is seldom the case that economic theory can tell us how many lags a model should
include. We are faced with a very practical question: how many lags should we
include in our model? Should it have one lag: VAR(1)? Two lags: VAR(2)? Eight
lags: VAR(8)?
One approach is to estimate many VAR models with different numbers of lags
and compare their fit. That is, estimate and compare a VAR(1), a VAR(2), etc.
through VAR(p). You have to decide how large p is. You also need to decide
on what selection criterion or measure of fit you’ll use. The standard set of lag-
length selection criteria includes Akaike’s information criterion (AIC), Schwarz’s
Bayesian information criterion (SBIC), and the Hannan and Quinn information
criterion (HQIC). These are the same metrics that we used in choosing lags for
ARMA(p,q) models, generalized to the multi-variate cases. The details or formulas
for these information criteria are not important here. They all adjust the log-
likelihood function of the estimated VAR and penalize it by some function of model
complexity. The three models differ by how, and how much, they penalize for the
number of variables. (This makes them quite similar to the adjusted R 2 , which
penalizes the measure of fit (R 2 ) by a function of the number of parameters being
estimated.)
Stata automates much of this task, so we don’t need to manually estimate VAR(1)
through VAR(p), nor do we have to manually calculate each VAR’s information
criteria. Stata’s varsoc command automates all of this for us.
5 Runkle (1987) emphasizes the importance of including and properly calculating confidence
intervals when reporting IRFs and FEVDs.
10.3 How Many Lags to Include? 271
The asterisks indicate which model is preferred by each selection criterion. Both
the AIC and HQIC indicate that a two-lag VAR is preferred; the SBIC prefers a
VAR with one lag.
How many lags should we include in our comparisons? i.e. How high should we
set maxlag(p)? We should pick p to be large enough that there are no gains from
increasing it further. Brandt and Williams (2007) suggest setting p no larger than 5
for yearly data, 8 for quarterly, or 15 for monthly data.
Also, consider adding lags if the selection criteria choose the model with the
maxlag. In our example above, if the information criteria had indicated that eight
lags were best, then maybe nine would have been even better. In this case, we would
have added a couple more lags, and re-estimated the information criteria.
What if the different ICs suggest different lag lengths? This is an unfortunate but
extremely common problem. And there is no universally accepted answer. Some
researchers go with the lag-length preferred by the majority of selection criteria.
Others choose not to choose only one, reporting results from all VARs that are
considered “best” by each selection criteria.
Braun and Mittnik (1993) investigate the effect of various types of misspecifica-
tions on VARs. These misspecifications include ignoring MA errors and selecting
the wrong lag lengths. Since a finite MA process can be approximated by a
sufficiently long AR(p) process, they argue for erring on the side of too many lags.
Improper lag selection seriously affects the variance decompositions. A far greater
problem is neglecting to include an important variable.
272 10 Vector Autoregressions I: Basics
Lütkepohl (1985) studies two-variable and three-variable VARs and finds that the
SBIC and HQIC perform best. However, Gonzalo and Pitarakis (2002) find that the
AIC is, by far, the best metric in large dimensional models. A four-variable VAR
has many more parameters than a three-variable VAR, so if your VAR has many
equations, then the AIC seems to be the best tool for selecting lag lengths. There
is no “last word” on this topic, and research is still underway.6 Still, few research
papers have been rejected for using the AIC for lag selection.
The simplest vector autoregression has two variables and one lag:
We could have added a constant but we’re trying to keep things as simple as
possible.
How might this be expressed in vector/matrix notation? Define the variables
matrix
Xt
Yt = ,
Zt
6 As examples, Hatemi-J (2003) proposes a lag-selection criterion that simply averages the SIC and
HQIC; the theory is that this is a straight-forward approach that is good enough for general use.
Other approaches are tailored to more specific uses. Schorfheide (2005), for example, proposes a
final prediction error approach to lag selection when the goal is multi-step forecasting.
There is no requirement that the number of lags in a VAR needs to be constant across
equations or variables. Ignoring irrelevant parameters means that the remaining parameters can
be estimated more efficiently. Precisely estimated coefficients produce better IRFs and better
forecasts. Hsiao (1979, 1981) developed a fully asymmetric VAR model to specifically address
Sims’ money/income causality question. Keating (2000) explores this concept within a class of
VARs where each variable takes on different lags, but the lag structure is the same across equations.
The AIC is known to select the correct symmetric lag lengths better than the other commonly used
alternatives. Keating develops an alternative to the AIC for asymmetric lag-length selection. In a
Monte Carlo simulation, Ozcicek and McMillin (1999) examine the small-sample performance
of the standard IC and Keating’s versions of these. They find that the KAIC more frequently
identified the correct number of asymmetric lags than did the other information criteria, and had
good forecasting properties. Ozcicek and McMillin (1999) conclude that the AIC and KAIC should
be used over SIC when forecasting; their results are reversed when the IRFs are the focus of the
study.
Ivanov et al. (2005) review much of the literature and conduct extensive Monte Carlo tests of
lag-order’s effect on IRFs. Their findings are sensitive to the observation frequency, with monthly
data preferring AIC and quarterly data preferring SBIC and HQIC.
The most obvious conclusion that we can draw from all this is that the field has not yet reached
a conclusion. But if we were to offer advice, it would be the following: if you wish to forecast, use
AIC or KAIC. If you wish to construct IRFs, then BIC or SICs are preferred. IRFs will fit the data
very well when they are fit with lots of lags.
10.4 Expressing VARs in Matrix Form 273
or more concisely as
Yt = βYt−1 + t . (10.3)
Expressed this way, we can see where vector autoregressions get their name.
Ignoring the bold font, they are an ordinary AR processes. The only difference is
that the variable being examined is now a collection of other variables, i.e. it is a
vector.
What about more complicated models? Ones with more variables, or more lags?
A two-variable two-lag VAR such as:
Xt = β1,1 Xt−1 + β1,2 Zt−1 + β1,3 Xt−1 + β1,4 Zt−2 + 1,t (10.4)
Zt = β2,1 Xt−1 + β2,2 Zt−1 + β2,3 Xt−1 + β2,4 Zt−2 + 2,t (10.5)
or more concisely as
or more concisely as
Notice that the matrix expression of a two-variable two-lag VAR (10.7) is the
same as that of a three-variable two-lag VAR (10.8). For this reason, we can often
ignore the number of variables in a VAR, and just think of it as a two-variable model.
Many of the same old issues arise, even in this new context. We still want to
know whether the estimated coefficient yields a stationary path for Yt , or whether it
is explosive. We’d still like to be able to plot out the impulse response functions. We
can also learn some new things. We can see how one variable affects other variables.
Let
⎡ ⎤
Xt
⎢ Zt ⎥
Yt = ⎢
⎣Xt−1 ⎦
⎥
Zt−1
10.4 Expressing VARs in Matrix Form 275
and
⎡ ⎤
1,t
⎢2,t ⎥
et = ⎢
⎣ 0 ⎦.
⎥
Yt = βYt−1 + et (10.9)
is equivalent to
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
Xt β1,1 β1,2 β1,3 β1,4 Xt−1 1,t
⎢ Zt ⎥ ⎢β2,1 β2,2 β2,3 β2,4 ⎥ ⎢ Zt−1 ⎥ ⎢2,t ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥+⎢ ⎥
⎣Xt−1 ⎦ = ⎣ 1 0 0 0 ⎦ ⎣Xt−2 ⎦ ⎣ 0 ⎦
(10.10)
Zt−1 0 1 0 0 Zt−2 0
or
In general, the companion matrix for an n-variable VAR(p) is the np×np matrix:7
⎡ ⎤
β1 β2 , . . . βp−1 βp
⎢I 0 ... 0 0⎥
⎢ ⎥
⎢ 0⎥
β =⎢0 I ... 0 ⎥. (10.11)
⎢. .. . . . .. ⎥
⎣ .. . . .. . ⎦
0 0 ... I 0
With the appropriately defined companion matrix, any VAR(p) can be writ-
ten as a VAR(1). Thus we will often restrict our attention to simple VAR(1)s,
knowing that what applies there also applies to more complicated VAR(p)s as
well. The companion matrix will also prove useful in examining the stability of
the VAR.
10.5 Stability
10.5.1 Method 1
Yt = βYt−1 + t ,
and Yt is stable if
|β| < 1.
If β is greater than one, then Yt grows without limit, it if is less than one, it decreases
without limit, and if it is equal to one, it is a non-stationary random walk process.
How does this generalize to the vector-valued case?
Let’s begin with a VAR in companion form, ignoring the error term:
Yt = βYt−1 . (10.12)
The system reaches a steady state if, as Y feeds into itself through β, it doesn’t
get bigger and bigger. Yt is a vector. Any matrix can be thought to map a vector
into another vector; alternatively, if we think of it as a change-of-basis, the matrix
stretches space so that a vector becomes another vector. Here, β maps Yt−1
into Yt .
8 We use “stability” and “stationarity” interchangeably. They are not the same thing. However,
stability implies stationarity if the error process is stationary. Stability applies to the coefficients
affecting the mean; stationarity is a broader concept that also demands that the autocovariances
and the error variances do not change over time. Given that we do not address GARCH errors in
this chapter, stability is enough to ensure stationarity.
10.5 Stability 277
Every square matrix (like the companion matrix β) has at least one eigenvalue
and associated eigenvector (if we allow for complex numbers). What are these
eigen-things? Without getting too deeply into matrix algebra, a matrix β has an
eigenvector v and associated eigenvalues λ if
λv = βv, (10.13)
i.e. if there is a vector v that keeps its direction and only changes its magnitude
when transformed by the matrix β. Thus, if the original vector was, say, v = [1, 5] ,
then it might become v = [2, 10] . In other words, the relationships between the
components stay proportional. As we iterate, constantly feeding Yt or v through β,
then it cannot get bigger and bigger if this system is to be stable. That is [1, 5]
cannot become [2, 10] and then [4, 20] and so forth. Rather, the vector needs to
shrink. In matrix-speak, its eigenvalues must be less than one, so that each iteration
is a fraction of the previous one. Since these eigenvalues might be complex numbers,
then we say that they must have a length less than one when mapped on the complex
plane, or it must lie inside the unit circle, or, put yet another way, their modulus must
be less than one.
Consider the two-variable VAR(1):
In our example,
0.5002 + 0.5472 ≈ 0.742 < 1,
so our estimated VAR is stable. When graphed on the unit circle, such as in Fig. 10.3,
we can see that their length (or “modulus”) is less than one.
9 Equivalently, the length of a complex vector is equal to the square root of the product of the vector
√
and its complex conjugate: (r + ci) (r − ci) = r 2 + rci − rci − ci 2 = r 2 + c2 .
278 10 Vector Autoregressions I: Basics
1
.5
0.742
0.742
.547
Imaginary
0
2 2 2
.500
A +B =C
2 2 2
-.5 0.500 + 0.547 = 0.742
0.742
-1
-1 -.5 0 .5 1
Real
Points labeled with their moduli
10.5.2 Method 2
Yt = βYt−1 + t
and express this equation using the lag operator L. (Recall, LYt = Yt−1 .)
Yt = βLYt + t
Yt − βLYt = t
Yt (1 − βL) = t
From this, we can construct what is called the “characteristic equation” by replacing
L with some variable – let’s call it z – and set the equation equal to zero.
1 − βz = 0 (10.16)
10.5 Stability 279
Now we solve for the roots of the characteristic equation, which we denote z∗ . In
this method, stability requires that |z∗ | > 1, so
1
|z∗ | = | | > 1
β
|β| < 1.
Since z and β are reciprocals, the requirement that |z∗ | > 1 is equivalent to |β| < 1.
What if we had an AR(2) process? Then
Yt = β1 Yt−1 + β2 Yt−2 + t .
Yt − β1 Yt−1 − β2 Yt−2 = t
Yt − β1 LYt − β2 LLYt = t
Yt 1 − β1 L − β2 L2 = t
Since this is a second degree function in z, then we can find the roots by using the
quadratic formula:
− (−β1 ) ± (−β1 )2 − 4(1)(−β2 )
z∗ = .
2 (1)
If these roots are greater than one in absolute value, then the equation is stable. In
the case that the roots are complex, the magnitude (a.k.a. modulus, length, size) of
the root must be greater than one when measured on the complex plane.
For an AR(p) process,
We’re now ready to generalize from the univariate case to the multivariate or
vector-valued case. After all, this is a chapter on vector autoregressions.
Consider our simple two-variable VAR(1) model,
Yt = βYt−1 + t .
As before, we apply the lag operator and move the lagged terms of Y to the left-hand
side
Yt = βYt−1 + t
Yt − βYt−1 = t
Yt − βLYt = t
(I − βL) Yt = t .
as we did before in Method 1. Is the estimated VAR(1) stable? Let’s solve for the
roots of the characteristic polynomial:
0 = I − β̂z
10 0.50 0.60
0 = − z
01 −0.50 0.50
10 0.50z 0.60z
0 = −
01 −0.50z 0.50z
1 − 0.50z −0.60z
0= .
0.50z 1 − 0.50z
10.5 Stability 281
which simplifies to
0.55z2 − z + 1 = 0.
z∗ = 0.90909 . . . ± 0.9959i.
These two roots have a length equal to: (0.909092 + 0.99592 )0.5 = 1.3484. Since
their lengths are greater than one, Method 2 agrees with Method 1 that the estimated
VAR is stable.
The connection between the two methods is that
1/.3484 = 0.742,
that is, the eigenvalues in Method 1 are the inverses of the roots from Method 2.
Exercises
1. Suppose you are estimating a two-variable VAR(1) model. For each estimated
coefficient matrix given below, determine whether the model is stable. Also,
assuming starting values of Y0 = X0 = 1, use your favorite software to calculate
the next 10 values of Y and X. Graph these to verify whether the estimated model
is stable.
0.1 −0.2
(a) β̂ =
−0.3 0.4
0.2 0.4
(c) β̂ =
0.4 0.5
24
(c) β̂ =
45
2. Suppose you are estimating a three-variable VAR(1) model. For each estimated
coefficient matrix given below, determine whether the model is stable. Also,
assuming a starting value of Y0 = X0 = Z0 = 1, use your favorite software
to calculate the next 10 values of X, Y, and Z. Graph these to verify whether the
estimated⎡model is stable. ⎤
−0.2 0.4 0.35
(a) β̂ = ⎣ 0.3 −0.2 0.15⎦
0.2 −0.3 0.4
⎡ ⎤
−0.1 0.3 0.4
(b) β̂ = ⎣ 0.9 −0.2 0.1⎦
0.2 0.3 0.8
282 10 Vector Autoregressions I: Basics
Fortunately, Stata has a built-in command to test for stability. After estimating the
VAR, issue the following command:
. varstable
Stata estimates the eigenvalues from Method 1, reports their moduli, and even
reports whether the estimated VAR is stable. Stata can also graph the eigenvalues
on the complex unit circle (such as in Fig. 10.3) by typing:
. varstable, graph
The previous examples have focused on VARs with no constants. That was just for
the sake of simplicity. But it also meant that the value of the series was centered
around zero. There is no reason to be quite so limiting in real life. It is easy to
change the mean of the series to be non zero, simply by adding a constant.
Consider the following VAR(1):
To what values of X and Y does this process converge? We can solve this by hand
by taking the unconditional expectation of both equations and solving for E(X) and
E(Y). Equivalently, drop the random errors, set Xt = Xt−1 = X∗ , and Yt = Yt−1 =
Y ∗ , and solve for X∗ and Y ∗ :
Grouping terms:
or
1
X∗ = 100 − 0.40Y ∗ = 125 − 0.50Y ∗ (10.20)
0.80
1
Y∗ = 120 − 0.30X∗ . (10.21)
1.10
Substituting (10.21) into (10.20) yields
∗ 1 ∗
X = 125 − 0.50 120 − 0.30X ,
1.10
10.7 Expressing a VAR as a VMA Process 283
X∗ ≈ 81.579.
Y ∗ ≈ 86.842.
The mean of the process is a complicated function of the constant and the other
coefficients. If we express this problem in matrix algebra form, then it doesn’t look
so complicated after all. Consider the general VAR(p) process:
Yt = β0 + β1 Yt−1 + t , (10.22)
where β1 is the companion matrix and β0 is a vector of constants. To solve for the
long-run mean of the process, set Yt = Yt−1 = Y∗ , set t = 0 and solve for Y∗ , the
steady state of the system:
Y∗ = β0 + β1 Y∗
Y∗ − β1 Y∗ = β0
Y∗ [I − β1 ] = β0
Y∗ = β0 [I − β1 ]−1 .
For Y to be stable, the matrix [I − β1 ] must be invertible. This does not depend on
β0 . Adding constants affects the mean of the process, but it does not affect whether
the process is stable.
Yt = βYt−1 + t ,
Yt = β (βYt−2 + t−1 ) + t
= β 2 Yt−2 + βt−1 + t ,
284 10 Vector Autoregressions I: Basics
where β 0 = I.
In the MA representation, Yt is equal to a weighted average of all its previous
shocks. More recent shocks propagate through β and impact Yt more forcefully;
more distant shocks have fainter effects, having cycled through β several times.
The weights (β j ) can be thought of as the values of the impulse response to a
shock (t−j ).
As we said earlier in this chapter, “When it comes to IRFs, the devil is truly in the
details.” We’re now ready for some details. Let’s calculate the first several values of
an IRF by hand.
Suppose we estimated a simple two-variable VAR(1):
Suppose that X0 and Y0 were equal to, say, zero. Let’s see what happens to Xt
and Yt if there is a one-time one-unit shock to x,1 , keeping all other s equal to
zero.
In period 1:
In period 3:
1
.5
.5
Xt
Yt
0
0
0 2 4 6 8 10 0 2 4 6 8 10
t t
Impulse=Y; Response=Y Impulse=Y; Response=Y
1
1
.5
.5
Xt
Yt
0
0 2 4 6 8 10 0 0 2 4 6 8 10
t t
and in period 4:
Let’s return to the univariate AR(1) case for a second, with β = 0.50, for example:
Yt = 0.50Yt−1 + t
∞
= 0.50j t−j .
j =0
Now suppose that Y = 0 and = 0 all along the infinite past up to period 0. Then,
in period 0, 0 receives a one-time shock equal to one, and then reverts back to zero.
What is impact of this on Yt ? That is, what is the IRF?
286 10 Vector Autoregressions I: Basics
Ŷ0 = 1
Ŷ1 = 0.5
Thus, at least for the univariate case, the slope coefficient of the AR(1) process
provides the exponentially decreasing weights of the MA representation, and is also
equal to the IRF. We will see shortly that this generalizes to the vector case.
A stationary VAR(1) process
Yt = βYt−1 + t ,
Equation (10.25) shows that the values of Yt are a weighted average of all its
previous shocks. Last period’s shock propagates through β and impacts Yt more
forcefully; a shock two periods ago cycles through β twice. Thus, a shock two
periods ago has a proportionally (β 2 ) smaller effect. For this reason, the weights
(β j ) can be thought of as the values of the impulse response to a shock to t−j . The
sequence of β j s are the IRFs. We will now verify this with an example.
The IRF of a one-unit shock on x as felt by Xt was 1, 0.40, 0.18, and 0.078. As felt
by Yt , it was 0, 0.20, 0.06, 0.03.
Now consider the left column entries in the following powers of β:
1 0.40 0.10
β̂ =
0.20 0.10
2 0.40 0.10 0.40 0.10 0.18 0.03
β̂ = =
0.20 0.10 0.20 = 0.10 0.06 0.03
3 0.078 0.015
β̂ = .
0.03 0.003
10.8 Impulse Response Functions 287
These show that a one-unit shock to X will result in the following response in
X: 1, 0.5, 0.31, 0.194, and 0.12145. The same shock will result in the following
response in Y: 0, 0.3, 0.195, 0.12225, 0.076537. Likewise, a one-unit shock to Y
will result in the following response to X: 0, 0.2, 0.13, 0.0815, and 0.051025. The
same shock to Y has the following response in Y: 1, 0.15, 0.0825, 0.051375, and
0.0321565.
Setting the initial values of X, Y, and Z equal to zero, the IRF from a one-unit shock
to x,t at period t = 1 is calculated as:
X̂1 = 0.25X0 + 0.20Y0 + 0.15Z0 + 1 = 0.25 (0) + 0.20 (0) + 0.15 (0) + 1 = 1
Yˆ1 = 0.15X0 + 0.30Y0 + 0.10Z0 + 0 = 0.15 (0) + 0.30 (0) + 0.10 (0) + 0 = 0
Zˆ1 = 0.20X0 + 0.25Y0 + 0.35Z0 + 0 = 0.15 (0) + 0.30 (0) + 0.10 (0) + 0 = 0.
At t = 2,
X̂2 = 0.25X1 + 0.20Y1 + 0.15Z1 + 0 = 0.25 (1) + 0.20 (0) + 0.15 (0) + 0 = 0.25
Yˆ2 = 0.15X1 + 0.30Y1 + 0.10Z1 + 0 = 0.15 (1) + 0.30 (0) + 0.10 (0) + 0 = 0.15
Zˆ2 = 0.20X1 + 0.25Y1 + 0.35Z1 + 0 = 0.20 (1) + 0.25 (0) + 0.35 (0) + 0 = 0.20.
X̂3 = 0.25X2 + 0.20Y2 + 0.15Z2 + 0 = 0.25 (0.25) + 0.20 (0.15) + 0.15 (0.20) + 0
= 0.1225
Yˆ3 = 0.15X2 + 0.30Y2 + 0.10Z2 + 0 = 0.15 (0.25) + 0.30 (0.15) + 0.10 (0.20) + 0
= 0.1025
Zˆ3 = 0.20X2 + 0.25Y2 + 0.35Z2 + 0 = 0.20 (0.25) + 0.25 (0.15) + 0.35 (0.20) + 0
= 0.1575.
10.8 Impulse Response Functions 289
Let’s set X and Y equal to zero for the first two periods (t = 0, and t = 1) and
setting x,t = 1 in period t = 2 only, the IRF is calculated as:
In period t = 3,
X̂3 = 0.50 (1) + 0.20 (0) + 0.10 (0) + 0.10 (0) + 0 = 0.50
Yˆ3 = 0.30 (1) + 0.15 (0) + 0.20 (0) − 0.10 (0) + 0 = 0.30.
In period t = 4,
X̂4 = 0.50 (0.50) + 0.20 (0.30) + 0.10 (1) + 0.10 (0) + 0 = 0.41
Yˆ4 = 0.30 (0.50) + 0.15 (0.30) + 0.20 (1) − 0.10 (0) + 0 = 0.395
and
X̂5 = 0.50 (0.41) + 0.20 (0.395) + 0.10 (0.50) + 0.10 (0.30) + 0 = 0.364
Yˆ5 = 0.30 (0.41) + 0.15 (0.395) + 0.20 (0.50) − 0.10 (0.30) + 0 = 0.25225.
Now let’s do this using matrices in Stata. The only wrinkle in this case is that the
coefficient matrix is not square, so we can’t square or cube it. However, we can work
with the companion matrix,
⎡ ⎤
0.50 0.20 0.10 0.10
⎢0.30 0.15 0.20 −0.10⎥
β=⎢
⎣ 1
⎥
0 0 0 ⎦
0 1 0 0
10.9 Forecasting 291
Thus, in Stata,
The IRF of X from shock to itself is given in the top left entry of each matrix.
The response to Y is given in the second entry of the leftmost column.
10.9 Forecasting
Suppose we had data on X and Y covering 100 periods, and we wished to forecast
X and Y for the subsequent 10 periods. Suppose we estimated a 1-lag VAR of X and
Y. Does this mean that we can only predict out to one period? No! And for the same
reason that a VAR(1) still gives us an IRF out to many periods. The trick is to iterate
on the function. To forecast for period 101, we must rely on the VAR model that was
estimated for periods 1–100. The VAR’s coefficients would tell us how to map from
period 100’s values to period 101. But to forecast further out, we must rely on our
own forecasts. That is, to forecast period 102, we will need to rely on our forecast of
period 101. To forecast period 103, we will need to rely on our forecasts of periods
101 and 102, and so forth.
And just in case you were wondering: what if we had a 4-lag VAR model? Could
we estimate out four periods at a time? Nope. We would still need to do this one
period at a time.
Let’s see how this works for a two-variable VAR(2) process.
Suppose we had the following data,
Now suppose that we wanted to forecast X and Y for the next five periods.
We have data on X100 , X99 , Y100 , and Y99 , so we’re on solid footing forecasting
one period ahead. To be clear, all of the expectations below are conditional on data
up to period 100. We forecast as follows:
10.9 Forecasting 293
What about forecasting two periods ahead, to period 102? We only have data for
X and Y up to period 100. We can plug these values in. But what about the values
for period 101? We can plug in the expected value, the forecast from the previous
step.
The command creates several new variables, each with a given prefix. You can
choose whichever prefix you want. I opted to use “E,” since we’re calculating
expected values. The option nose might seem fishy, but it just instructs Stata not to
calculate any standard errors (no SEs). Finally, the option step(5) tells Stata that
we want to calculate the forecast five periods out. This is often called the “forecast
window” or “forecast horizon.”
The results of the fcast command are:
294 10 Vector Autoregressions I: Basics
and so forth.
This method works, but it is clunky. It creates tons of new columns of data. (More
elegant coding goes a long way here.)
10.10 Granger Causality 295
One of the more exciting things the time-series econometrician can explore is
whether one variable “causes” another one. Strictly speaking, all we can explore
is correlation, and, as every student of statistics knows, correlation is not the
same thing as causation. Still, if changes in X tend to predate changes in Y,
then—at least, observationally speaking—X can be thought to cause Y. At least,
if X actually does cause Y, then we should see that changes in X predate those
in Y. If X really does cause Y, then X predating Y or even predicting Y, is
observationally indistinguishable from a correlation between Y and lagged X. Time-
series econometricians are playing the role of a smoke detector. Where there is
smoke, there isn’t always fire, but it sure is a lot more likely.
Granger causality (1969) is a necessary condition for existential causality, but it
is not a sufficient condition. It could be the case that Z causes X, and, after a much
longer lag, also causes Y. We would see a correlation between X and Y; we would
say that X Granger-causes Y, even though X didn’t really cause Y, Z did.
Econometricians don’t want to be accused of mistaking correlation for causation,
so they speak of “Granger causality” rather than strict “causality.”
For all the associated jargon, and the knighthood and Nobel Prize granted Sir
Clive Granger, testing for Granger causality is really straightforward. A variable X
is said to Granger-cause Y if accounting for earlier values of X helps us predict Y
better than we could without it.
The concept of Granger causality is usually introduced to students only in
the context of vector autoregressions, but the definition can be applied to simple
autoregressions, with exogenous variables. For example, if the following AR(2)
model:
Xt = β1,1 Xt−1 + β1,2 Yt−1 + β1,3 Xt−1 + β1,4 Yt−2 + x,t (10.30)
Yt = β2,1 Xt−1 + β2,2 Yt−1 + β2,3 Xt−1 + β2,4 Yt−2 + y,t (10.31)
296 10 Vector Autoregressions I: Basics
we would say that X Granger-causes Y if β2,1 and β2,3 are jointly significantly
different from zero.10
Notice that testing whether X Granger-causes Y requires testing only the Y
equation; really, all we’re asking is whether lagged values of X are a statistically
significant predictor of Y , given that we already control for lagged values of Y .
Tests of Granger causality are sensitive to omitted variables. An omitted third
variable might make it seem as though X causes Y. In reality, perhaps Z causes X
more quickly than it causes Y. In this case, failing to take Z into account will make it
seems as though X causes Y, inducing a “false positive” result to the tests. Omitted
variables might also lead to “false negatives” (Lütkepohl 1982). For these reasons,
researchers should make sure they have their economic theories straight, including
all relevant variables in their analyses.
Example
Let’s run a Granger causality test to see whether unemployment and inflation
Granger-cause each other. (We should note at the outset that this simple example
is only intended to illustrate the technique. We won’t do any of the necessary pre-
estimation or post-estimation tests; we aren’t verifying whether the variables are
integrated, cointegrated, etc. . . )
First we download and label the data:
After estimating the VAR, we employ Stata’s built-in Granger causality com-
mand:
10 Granger (1980) provides an interesting discussion on the philosophical nature and various
definitions of causality. In that paper, he also generalizes his own definition of causality to include
non-linear models, providing a broader operational definition of causality.
10.10 Granger Causality 297
The output above indicates that unemployment does not Granger-cause inflation
at the 0.05 level (0.081 > 0.05), whereas inflation does Granger-cause unemploy-
ment (0.022 < 0.05).
Alternatively, we could estimate the joint significance by hand, replicating the
statistics above:
Chris Sims’ first application of Granger’s (1969) causality paper was his 1972
“Money, Income and Causality” paper. There, he tested whether changes in
the money supply cause fluctuations in GNP, or vice versa. Understanding the
relationship between these two variables was highly relevant to one of the central
economic debates of the time: Monetarism vs Keynesianism. The debate centered
298 10 Vector Autoregressions I: Basics
over whether monetary policy could be effective in smoothing out the business
cycle, or whether business cycle fluctuations affected the money supply.
Sims’ approach to Granger-causality is different11 —but mathematically equiv-
alent12 —from the current practice as established by Granger (1969). Thus, rather
than replicating the paper closely using Sims’ unique method, we will replicate the
main conclusions of Sims’ paper using the standard Granger-causality tests.
Using quarterly data from 1947–1969, Sims tests whether changes in the money
supply cause changes in GNP. Both variables are measured in logarithms.
We begin by downloading the data, creating the time variables, and generating
logged versions of the monetary base (our estimate of the money supply) and of
GNP:
11 Sims argued that if X causes Y (and not vice versa), then this should be evident in zero-
coefficients in future values of X whenever regressing Y on past, present and future values of
X. Granger’s approach was that if X causes Y, then there should be non-zero coefficients when
regressing Y on past X and past Y. Ultimately, a string of research proved that the two approaches
are identical.
12 A sequence of papers by Hosoya (1977), Khon (1981), Chamberlain (1982), and Florens and
Mouchart (1982) established the general conditions under which the two approaches are equivalent.
10.10 Granger Causality 299
Next, we estimate the VAR. Following Sims, we include eight lags, a constant, a
linear time trend, and seasonal dummies as exogenous variables. Including exoge-
nous variables does not pose any problems. Even though the data are seasonally
adjusted at their source, Sims still includes quarterly dummy variables to capture
any remaining seasonality. (There was none.)
Our results confirm those in Sims (1972). That is, we conclude that the money
supply (monetary base) affects GNP, and GNP does not affect the money supply.13
One of the weaknesses in Sims’ (1972) article was that he did not consider
the effect of the interest rate. In his 1980a paper, Sims discovered that ignoring
an important third variable can change the Granger causality results. Sims added
an interest rate variable and discovered that the money supply no longer predicted
national income.
This was hardly the last word on the matter. Sims’ original results are sensitive
to the choice of lag length and including a deterministic time trend.14
different lag lengths for each variable and each equation. He estimated such a fully asymmetric-
lab VAR model to explore Sims’ money/income causality results. He found bi-directional Granger
causality for the US and Canada. Thornton and Batten (1985) replicated Sims’ paper on “Money,
Income and Causality,” using different lag lengths chosen by several different selection procedures.
Different models give different results, so you should not choose a lag length arbitrarily. Thornton
and Batten suggest relying on a lag selection method such as Akaike’s FPE. You should never
choose one simply because it gives you the results you were hoping for.
300 10 Vector Autoregressions I: Basics
Exercises
1. Redo the replication of Sims (1972), using data from 1970 through 2016. Do
your conclusions change? If so, how?
Granger causality cannot detect indirect causality. For example, what if X causes Z
and then Z causes Y (but X does not cause Y directly)? Then, logically speaking,
changing X affects Y, and X causes Y. But the standard Granger causality test would
not be able to detect this.15 This is because a test for Granger causality occurs
only on the coefficients of one equation. If we were looking for the variables that
Granger-cause Y, then we would be restricted to testing the coefficients on X and
Z only in the equation defining Y. That is, Granger causality tests are, ultimately,
always single equation tests, whereas indirect causality operates through multiple
equations.
Since we cannot use Granger causality tests to identify indirect causality, what
are we to do if we want to know whether changes in X affect Y indirectly? This is
the strength of IRFs, and especially OIRFs, as these show how changes in X—and
only changes in X—affect all the other variables in the VAR system. This includes
all of the indirect effects (as they propagate through powers of the companion
matrix).
First, we generate the data.
In their review of the money/income causality literature, Stock and Watson (1989) report
that adding a deterministic time trend strengthens money’s estimated effect on output. Further,
the sample data can affect the results (ex: Eichenbaum and Singleton 1986). (Structural breaks
can often be confused with unit roots, leading to inappropriate detrending in the money/income
regressions.) Stock and Watson’s (1989) main econometric finding is that the initial method of
detrending is responsible for the diverging results; detrending can cause the test statistics to have
non-standard distributions. Their main economic finding is that shocks to the money growth rate
that are greater than those predicted by the trend do have an effect on output.
Hall (1978) is also notable, reminding researchers that permanent rather than transitory income
matters, so simple regressions of consumption on past income conflate two different effects. Dickey
et al. (1991) revisited the money/income question, with interest rates added, in the context of
cointegration analysis.
15 This presumes that the lag-structure is correctly specified in the VAR.
10.10 Granger Causality 301
Our simulated data comprise shocks to X, Y, and Z, drawn so that they are
uncorrelated with each other. Then we generated the values of X, Y and Z with
carefully selected coefficients such that: (1) X does not depend on Z or Y, (2) X, and
only X, affects Z, and (3) only Z affects Y.
Next, we estimate the VAR and run a Granger causality test:
302 10 Vector Autoregressions I: Basics
.5
0
varbasic, Y, X varbasic, Y, Y varbasic, Y, Z
1
.5
0
varbasic, Z, X varbasic, Z, Y varbasic, Z, Z
1
.5
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
step
95% CI orthogonalized irf
Graphs by irfname, impulse variable, and response variable
The Granger causality table shows that: (1) nothing Granger-causes X, (2) X
Granger-causes Z, and (3) Z Granger-causes Y. But, logically speaking, isn’t it the
case that an independent change in X will result in a change in Y? Yes, indirectly
through X’s effect on Z, but ultimately X did logically cause Y. This can be thought
of as a failure of Granger causality tests, because when we tested, in the third panel
above, whether X caused Y, it was found not to (p = 0.5176 > 0.05). So X logically
causes Y, but X didn’t Granger-cause Y.
Does the indirect effect of X on Y show up in the OIRFs? Thankfully, it does,
as revealed in Fig. 10.5, where the top center panel indicates that a shock to X is
followed by an ultimate response in Y.
We’ll try to pull all of this information together by working out a full-scale example
of a VAR analysis. We’ll follow these steps:
12
Unemployment Rate
10
Real GNP
6 8 4
0
2
Real GNP growth rate (%)
D.Unemployment
2
1
0
0
-4 -2
-1
We will estimate a VAR on the US’s unemployment rate and the GNP growth
rate. Of course, this example is not meant to be definitive, only illustrative of the
technique.
First, we download and format the data:
Second, we use a KPSS test to ensure that the data are stationary around their
levels.
10.11 VAR Example: GNP and Unemployment 305
Above, we see that the growth rate of GNP is stationary. On the other hand, the
unemployment rate is not stationary; thus, we took the first difference and found
that the change in the unemployment rate is stationary. Thus, we recast our VAR
to look at the relationship between the growth rate of GNP and the change in the
unemployment rate.
As our third step, we determine the number of lags by looking at the various
information criteria.
The Akaike Information Criterion and the Hannan Quinn Information Criterion
indicate that two lags are preferred; the Schwarz Bayesian Information Criterion
disagrees slightly, preferring a single lag. Such a disagreement between the three
statistics is not uncommon. We follow the AIC and estimate our VAR with two lags.
306 10 Vector Autoregressions I: Basics
1
.5
0.602
Imaginary
0.277
0
0.277
0.602
-.5
-1
-1 -.5 0 .5 1
Real
Points labeled with their moduli
As our fourth and fifth steps, we estimate the VAR(2), and calculate the
eigenvalues of the companion matrix to make sure our VAR is stable.
All of the roots have a length (modulus) less than one. Thus, we are inside the
unit circle (see Fig. 10.7) and our estimated VAR is stable.
As a final post-estimation check, we verify that our residuals do not exhibit any
left-over autocorrelation:
10.11 VAR Example: GNP and Unemployment 307
Is there any Granger causality? The output below reveals that both variables
Granger-cause each other.
Finally, we examine the graphs of the IRFs and FEVDs (see Figs. 10.8 and 10.9)
to better understand the dynamics between the two variables. The IRF indicates
that a shock to D.Unemp dampens gently to zero by the third or fourth quarter. The
same shock to D.Unemp decreases the GNP growth rate for a quarter, before GNPgr
returns to its baseline level. Shocks to the GNP growth rate have a much more muted
effect. They decrease the unemployment growth rate only slightly; a positive shock
to the GNP growth rate also tends to dampen out and reaches its baseline within one
to three periods.
The first column of Fig. 10.9 shows how much of the forecast error variance
in D.Unemp is due to D.Unemp, and to GNPgr. The top left panel shows that
initially 100% of the variance in D.Unemp is due to earlier shocks to D.Unemp. This
decreases slightly to approximately 80% by periods 3 or 4. The bottom left panel
shows that, initially, very little of the variation in D.Unemp is due to shocks in the
growth rate of GNP; the impact of GNPgr on D.Unemp increases to approximately
20% by periods 3 or 4.
308 10 Vector Autoregressions I: Basics
-1
-2
varbasic, GNPgr, D.Unemp varbasic, GNPgr, GNPgr
1
-1
-2
0 2 4 6 8 0 2 4 6 8
step
95% CI impulse-response function (irf)
Graphs by irfname, impulse variable, and response variable
.5
0
varbasic, GNPgr, D.Unemp varbasic, GNPgr, GNPgr
1
.5
0
0 2 4 6 8 0 2 4 6 8
step
95% CI fraction of mse due to impulse
Graphs by irfname, impulse variable, and response variable
10.12 Exercises
with X0 = 0 and Y0 = 0.
(a) Express the VAR in matrix form using the companion matrix.
(b) Calculate the first five values of the IRF of a one-unit shock to X1 on Xt and
Yt .
(c) Calculate the first five values of the IRF of a one-unit shock to Y1 on Xt and
Yt .
2. Suppose you estimated the following two-variable VAR(2) model:
In the previous chapter, we covered the basics of reduced form VARs on stationary
data. In this chapter, we continue learning about VARs, but we extend the discussion
to structural VARs (SVARs) and VARS with integrated variables. In the process, we
will go through some additional examples and an in-depth replication of an SVAR
paper by Blanchard and Quah (1989).
Many students begin estimating SVARs without even realizing it: by estimating
“orthogonalized IRFs.” Thus, we begin there.
We were able to calculate IRFs in Sect. 10.8 because we presumed there was a one-
time shock to one variable that did not simultaneously affect the other endogenous
variable. In practice, however, random shocks affect many variables simultaneously.
They are not always independent of each other. Rather, they are often contempora-
neously correlated. This is why a VAR is estimated using SUR (which allows for
contemporaneously correlated errors across equations), rather than using separate
OLS regressions (which presume independence or orthogonality).
In general, IRFs indicate how a shock to one variable affects the other variables.
But this is not particularly useful for policy purposes if shocks are correlated across
variables. We’d like to identify an exogenous change in a policy variable, and track
its effect on the other variables. That is, for policy purposes, we can’t have those
shocks correlated. So how do we un-correlate those shocks? How can we make
them orthogonal to each other?
After we estimate a VAR, we must impose a certain type of assumption or
constraint to draw the corresponding orthogonalized IRF. There are several such
assumptions that allow us to draw OIRFs. The most common such constraint is to
impose sequential orthogonality via something called a “Cholesky decomposition.”
Then we estimate the VAR and calculate the impulse response functions. We’ll
ask Stata to calculate two sets of IRFs: (a) the simple IRFs, similar to the ones we
calculated by hand, and (b) orthogonalized IRFs, where we take into account the
correlation between x,t and x,t .
11.1 Orthogonalized IRFs 313
irf, X, X irf, X, Y
1
.5
-.5
irf, Y, X irf, Y, Y
1
.5
-.5
0 2 4 6 8 0 2 4 6 8
step
impulse-response function (irf) orthogonalized irf
Graphs by irfname, impulse variable, and response variable
The IRFs are drawn in Fig. 11.1, where you can see that IRFs can vary
significantly when we take into account the fact that the errors hitting both equations
might be correlated. IRFs are usually not the same as OIRFs.
Orthogonalized IRFs allow the shocks to one equation to be correlated to those in the
others. Thus, they (a) are arguably more realistic, and (b) better describe the causal
effects. Unfortunately, there are no free lunches. The process of orthogonalization
depends upon something that we’ve been able to ignore thus far: the order in which
the variables are listed in the VAR. That is, the OIRFs from
In Fig. 11.2 we show the OIRFs from two different orderings of the VAR, which we
created from:
314 11 Vector Autoregressions II: Extensions
1
.5
.5
0
0
-.5
-.5
0 2 4 6 8 10 0 2 4 6 8 10
step step
order1: oirf of Y -> Y order1: oirf of X -> Y
order2: oirf of Y -> Y order2: oirf of X -> Y
1
1
.5
.5
0
0
-.5
-.5
0 2 4 6 8 10 0 2 4 6 8 10
step step
order1: oirf of Y -> X order1: oirf of X -> X
order2: oirf of Y -> X order2: oirf of X -> X
The first irf command specified a causal ordering giving Y primacy (order
Y X). The second irf reversed the order, giving X primacy over Y (order X Y).
As you can see in Fig. 11.2, the estimated OIRFs can be quite different. So,
what are we to do? If you want to report the OIRF, then you will need to justify
the ordering. That is, you will have to determine—hopefully with the aid of solid
economic theory—which variable is more likely to be causal and independent, and
list it first. Then list the second most, third most, etc. . . .
Sometimes, as in our example, when the shocks were highly correlated across
equations, order is important. Other times, order is negligible. But there is only one
way to tell, and that requires estimating the OIRFs from many different orderings
and comparing them. If there are only two variables, then this is easy. But if there
are, say, four variables, then there are 4×3×2×1 = 24 orderings. With K variables,
there are K! orderings. One popular approach is to estimate your preferred ordering,
where you list the variables in decreasing order of alleged exogeneity, and then
re-estimate with the reversed order. The idea is that if the OIRFs from these two
extremes are in agreement, then the other orderings can be ignored.
We saw above that order matters when drawing OIRFs, but we still don’t
know why. To better understand this, we’ll need to take a detour into Cholesky
decompositions.
11.1 Orthogonalized IRFs 315
Exercises
1. Re-estimate the OIRFs from the VAR(1) described in Eqs. (11.1) and (11.2), but
change the variance/covariance matrix to
1 0.15
=
0.15 1
so that the shocks are not as highly correlated across equations. Use this to
show that the Cholesky ordering has a negligible impact when the variables
are relatively uncorrelated. (Hint: Reuse the Stata code provided in Sect. 11.1,
changing the definition of .)
A = LL .
Likewise,
⎡ ⎤ ⎡ ⎤⎡ ⎤
1 0.30 0.20 1 0 0 1 0.3 0.2
= ⎣0.30 1 0.10⎦ = ⎣0.3 0.953939 0 ⎦ ⎣0 0.954949 0.0419314⎦.
0.20 0.10 1 0.2 0.0419314 0.9788982 0 0.9788982
transform them, and then verify that the transformed variables have the correlation
structure that we wanted.
11.1 Orthogonalized IRFs 317
First, let’s define the original variance/covariance matrix, and get its Cholesky
factors (as well as their inverses):
Now, pre-multiply the error matrix by the inverse of the lower Cholesky factor to
generate the transformed errors:
The original two variables eps1 and eps2 were correlated (not orthogonal) to
each other (Fig. 11.3). The two transformed variables, e1 and e2, are orthogonal to
each other (Fig. 11.4).
318 11 Vector Autoregressions II: Extensions
4
2
eps1
0
-2
-4
-4 -2 0 2 4
eps2
-4 -2 0 2 4
e2
Yt = βYt−1 + t
where β 0 is equal to the identity matrix. We saw earlier how the sequence of β j s
were the IRFs of the VAR. How do we get the OIRFs?
The challenge is to rearrange the expression in (11.5) without altering the
equation: we wish to change the way we see the data; we don’t want to change
11.1 Orthogonalized IRFs 319
the data. Thus, we can’t just go about multiplying something on the right without
multiplying something on the left. What we can do, however, is judiciously
multiply by one; in matrices we can multiply by the identity matrix I. We can
rewrite (11.5) as
∞
Yt = β j LL−1 t−j = β 0 LL−1 t +β 1 LL−1 t−1 +β 2 LL−1 t−2 +. . . (11.6)
j =0
where L and L−1 are the lower Cholesky factor and its inverse. Thus, Yt can be seen
as a weighted average of orthogonal shocks. The weights—the OIRFs—are given
by the sequence of β j L−1 s and the orthogonal errors are given by the sequence of
L−1 t−j s.
Multiplying the error by L−1 orthogonalizes the error. Multiplying β by L
transforms the coefficients in an offsetting (inverse) yet complementary direction.
Thus, the βLs trace out the IRFs from orthogonal shocks L−1 . This is why
OIRFs are different from IRFs. The errors get fed through transformed βs.
There are other ways to multiply by one. And there are unfortunately also other
ways to orthogonalize. The Cholesky decomposition is merely the most popular
(popularized by Chris Sims himself).
To solidify this concept, we will estimate a VAR from artificial data. Then we
will have Stata estimate the IRFs and OIRFs. Then we’ll perform the Cholesky
multiplication, as in Eq. (11.6), by hand, and compare the results. We should be able
to replicate Stata’s answers. Then, we will redo the orthogonalization, but we’ll
change the order, and we’ll show how this changes the estimated OIRFs.
We begin by generating the data:
Then we estimate the VAR and extract the estimated coefficient matrix (β̂) and
ˆ
variance/covariance matrix ().
320 11 Vector Autoregressions II: Extensions
We can replicate the IRF by reporting the powers of the estimated coefficient
matrix:
. matrix betahat0 = I(2)
symmetric betahat0[2,2]
c1 c2
r1 1
r2 0 1
betahat1[2,2]
c1 c2
r1 .39704043 .10414341
r2 .20556151 .10486756
betahat2[2,2]
c1 c2
r1 .17904898 .05227041
r2 .10317297 .03240508
betahat3[2,2]
c1 c2
r1 .08183447 .02412824
r2 .04762508 .01414303
betahat4[2,2]
c1 c2
r1 .03745143 .01105279
r2 .02181634 .00644298
The OIRs are the estimated coefficients multiplied by the lower Cholesky factor:
. matrix phi_0 = L
phi_0[2,2]
X Y
X 1.0061183 0
Y .75687962 .65650145
phi_1[2,2]
X Y
r1 .47829365 .0683703
r2 .28619131 .0688457
phi_2[2,2]
X Y
r1 .21970686 .0343156
r2 .12833095 .02127398
phi_3[2,2]
X Y
r1 .10059733 .01584023
r2 .05862103 .00928492
phi_4[2,2]
X Y
r1 .0460462 .00725617
r2 .02682638 .00422983
Thus far, we have shown how to find the OIRFs automatically, and “by hand,”
using matrix multiplication. Now, we will show, yet again, that the order matters for
OIRFs.
324 11 Vector Autoregressions II: Extensions
Notice that we reversed the order in the VAR, and the new OIRFs reflect this
change. Now, why does order matter?
We’ve shown graphically and numerically that order matters for OIRFs, but we
haven’t been very explicit about why order matters. We’re now in a position to
explain this. Ultimately, the reason order matters is because when we take the
Cholesky decomposition of , ˆ the error’s estimated variance/covariance matrix,
we create two triangular matrices, L and L’. We multiply the coefficients in the
companion matrix by L, which is a lower triangular matrix. We also multiply the
errors by the inverse of L, but this is also a lower triangular matrix. So there are
systematic zeros in our equations that limit the effect in the earlier, higher up,
equations; the fewer zeros at the bottom of the lower triangular matrices allow for
more interactions with these later equations. But this is all quite vague. Let’s work
out the math of a simple example.
11.1 Orthogonalized IRFs 325
where, as usual, β̂s are estimated parameters. And suppose that the estimated
variance/covariance matrix of the error terms is
σ̂ 2 σ̂
ˆ = x x,y
.
σ̂y,x σ̂y2
ˆ as
Let’s express the Cholesky decomposition of
ˆ L̂xx 0 L̂xx L̂yx
= L̂L̂ =
L̂yx L̂yy 0 L̂yy
Y = βY +
= β 0 t + β 1 t−1 + β 2 t−2 + . . .
= β 0 LL−1 t + β 1 LL−1 t−1 + β 2 LL−1 t−2 + . . .
The OIRFs are the sequence of β j Ls. Recalling that β 0 = I, the OIRFs are
0 1 0 L̂xx 0 L̂xx 0
β̂ L̂ = = , (11.7)
0 1 L̂yx L̂yy L̂yx L̂yy
1 β̂xx β̂xy L̂xx 0 β̂ L̂ + β̂xy L̂yx β̂xy L̂yy
β̂ L̂ = = xx xx ,
β̂yx β̂yy L̂yx L̂yy β̂yx L̂xx + β̂yy L̂yx β̂yy L̂yy
and so forth. Please note the zero entry in the top right of Eq. (11.7); it is important.
The entries in these matrices get complicated rather quickly. To simplify, let’s
j
denote entries in β̂ L̂ as
j j
j bxx bxy
β̂ L̂ = j j .
byx byy
326 11 Vector Autoregressions II: Extensions
Multiplying out the matrices, and defining new coefficients (as and ds) to get rid of
some clutter, the two equations above are
Notice that Xt depends upon ex,t , but Yt depends upon ey,t and ex,t .
If we had ordered Y-then-X in the Cholesky ordering (and redefined the as and
d’s appropriately), we would have had something different:
and Yt would depend upon ey,t , while Xt would depend upon ex,t and ey,t .
Impulse response functions tell us how a shock to one variable propagates through
the system and affects the other variables. What they can’t do, however, is tell us how
important each shock is. How much of the variation in X is due to shocks in Y? What
we need is something like an R 2 , but that can be split up among different variables,
decomposing each shock’s effects on a variable. It would also be useful to know
how X affects Y across different lags. The forecast error variance decomposition
(FEVD) satisfies all of these criteria.
Suppose we estimate a VAR on X and Y, after which we draw some forecasts 1,
2, or k periods out. Those forecasts are never perfect, so there will be some forecast
error. The “forecast error” is simply the residual: the difference between what we
11.2 Forecast Error Variance Decompositions 327
expected Y to be after a certain number of periods, given the results of our VAR. The
“variance decomposition” splits up the variance of this residual into its component
causes, and expresses the result as a percent.
Thus, the FEVD tells us: What percent of the variation in X from its forecasted
value is due to shocks directly affecting X? And what percent is due to shocks from
Y? Since they are percents, they will add to one.
Let’s see how this works. Using the same data from Sect. 10.2:
which produces the following table of FEVDs (leaving out their confidence
intervals):
328 11 Vector Autoregressions II: Extensions
Column (1) of the first table shows the proportion of forecast error variance of
GNPgr attributable to its own shocks. As it is listed first in the Cholesky ordering,
it does not initially depend upon shocks from the other variable. Thus, 100% of the
FE variance at lag one is attributable to its own shocks; column (3) correspondingly
shows that 0% of GNPgr’s FEV comes from M1gr. Eight periods out, the story is
not much changed. Nearly 98.9% of our inability to accurately forecast GNPgr is
attributable to shocks to GNPgr itself; 1% is attributable to shocks in the money
growth rate.
Columns (2) and (4) decompose the FEV of M1gr into the effects of shocks
on GNPgr and M1gr. Column (4) shows that 98.6% of the lag=1 FEV of the
money growth rate is attributable to own shocks. Or, loosely speaking, 98.6% of our
inability to forecast M1gr one period out is attributable to changes in M1gr itself.
Correspondingly, 1.33% comes from shocks to the growth rate of GNP. The money
growth rate was the second variable in our Cholesky ordering, so it is allowed to
immediately suffer from shocks to the first variable.
Eight periods out, the story is only slightly different. The 91.6% of M1gr’s FEV
comes from own shocks; 8.3% comes from shocks to GNPgr.
FEVDs rely on the Cholesky order of the VAR. If we reverse the Cholesky
ordering, then we get a slightly different FEVD:
The columns in the two tables are presented in a different order. More importantly,
the values in the tables are slightly different. The more exogenous variable in this
11.3 Structural VARs 329
Cholesky ordering is the money growth rate, so at lag=1, GNPgr does not affect our
ability to properly forecast M1gr; 100% of the FEV of M1gr is from own shocks.
The two tables happen to look a bit similar in this example because the estimated
shocks between the two variables were not highly correlated. Thus, the Cholesky
transformation does not have much to un-correlate. Often in practice, the estimated
FEVDs from different orderings vary more than this.
What is a structural VAR? And how is it different from the “reduced form” VARs
that we’ve discussed so far? In short, a structural VAR has contemporaneous terms,
Xt and Yt , as dependent variables in each equation. Reduced form VARs have only
lagged values as dependent variables. Why the difference? And why is it a big deal?
Consider the simple model of supply and demand. The supply price of a widget
is a function of, say, the prices of inputs and the quantity supplied. The demand price
of a widget is a function of, say, the prices of competing products and the quantity
demanded. Prices and quantities are endogenous variables. They determine each
other. In a SVAR, X and Y also determine each other. They are both endogenous
variables; they both show up as the dependent variables in their respective equations.
You can’t just regress one on the other; they must be considered as a system.
Or, as we do with supply and demand models, we must derive a reduced form
representation of the system and try to back out the parameter values of the structural
model.
For structural VARs, we need X as a function of exogenous things—things we
can change via policy, too. This is why reduced form VARS are best suited for data
description and ill suited for policy. We need structural models to figure out what
happens if we change the structure of the economy.
VARs can be used to describe data or to estimate a structural model. The former is
easier than the latter, so we began there, with reduced form VARs. Now it is time to
turn to “structural form” VARs.
The VARs that we have looked at thus far have been “reduced form,” such as
or
Yt = βYt−1 + t (11.10)
330 11 Vector Autoregressions II: Extensions
with
σx2 σx,y
V ar(t ) = = .
σy,x σy2
Each variable affects the other with a lag; effects are not simultaneous but shocks
can be correlated; identification is not a problem.
A structural VAR (SVAR), on the other hand, is:
because the contemporaneous terms (the ones with a t subscript) can be moved over
to the left-hand side).
What’s the important difference between Eqs. (11.10) and (11.13)? What’s the big
deal about the matrix A that it requires its own section? The big deal is that the
model, as expressed, is unidentified. That is, we can’t estimate the parameters
of (11.13) using the estimates of (11.10).
What does it mean to be “unidentified?” First, notice that we can transform an
SVAR such as (11.13) into a reduced form VAR such as (11.10):
where β = A−1 α and t = A−1 et . We know how to estimate the reduced form
VAR’s parameters in (11.15). But we can’t use these estimates to figure out the
SVAR parameters in (11.14). It is the same problem we might find ourselves in if
we need to figure out a and b if we know that, say, a/b = 10. There’s an infinite
combination of as and bs that divide to 10. Likewise, there’s an infinite combination
of A−1 and α that multiply to β. In other words, if somehow we know the parameters
of (11.14), then we can figure out (11.15), but if we know (11.15), we can’t figure
out (11.14).
11.3 Structural VARs 331
In practical terms, this means that there are multiple structural form models that
are compatible with an estimated reduced form model. That is, multiple economic
theories are compatible with the data. Our reduced form regressions would not
cast any light on which theory—say, New Keynesian vs Real Business Cycles—
is correct.
To make this “unidentification problem” a bit more concrete, let’s see why
Eqs. (11.11) and (11.12) are unidentified. Notice that X and Y are truly endogenous
in (11.11) and (11.12). Thus, we cannot estimate each equation separately; they
are jointly determined. They are part of a system, so they need to be estimated as
part of a system. We can, however, estimate the reduced form equations. Could we
then back out the coefficients of (11.11) and (11.12) with our estimates of (11.8)
and (11.9)? No. We’d have four reduced form estimates (the β̂s) and six unknowns
(the αs).
To see why, suppose we estimated (11.8) and (11.9) and found
We have four known values (the β̂s) from the reduced form equations. Unfortu-
nately, there are six unknowns (the αs) in the structural form equations.
One set of αs that satisfies these equations would be:
but so does
as well as
Once we know two of the αs, we can figure out the rest.
Estimating a structural VAR amounts to pegging down some of the coefficients—
the so-called “identifying restrictions”—so that we can back out the rest of them
from the reduced form estimates. The statistical software takes care of the initial
estimation and the “backing out.” The hard part for the econometrician is coming
up with a defensible set of identifying restrictions. This is where the action is
at with SVARs; this is where macro-econometricians argue. How do we come
332 11 Vector Autoregressions II: Extensions
In what follows, we will switch notation to match Stata’s notation. This will help us
map our equations with Stata’s commands.
A reduced form VAR in companion form is
Yt = βYt−1 + t (11.16)
with
σx2 σx,y
V ar(t ) = = .
σy,x σy2
A0 Yt = AYt−1 + et (11.18)
with
et = But
and
10
V ar(et ) = e = .
01
In the reduced form, the two equations are related by the correlation across
equations via . In the SVAR, the two equations are related explicitly via A but
the structural shocks (es) are uncorrelated.
11.3 Structural VARs 333
A0 Yt = AYt−1 + But
A0 Yt − AYt−1 = But
(A0 − A (L)) Yt = But
(A0 − A (L))−1 (A0 − A (L)) Yt = (A0 − A (L))−1 But
Yt = (A0 − A (L))−1 But
Yt = Cut
where C is defined to be equal to (A0 − A (L))−1 . This C matrix describes the long-
run responses to the structural shocks.
To be able to estimate the parameters of the SVAR, we need to impose some
identification restrictions, that is, we need to specify some values in the entries of
A, B and C, or otherwise adequately constrain them.
Incidentally, if A0 = I and B = L (the lower Cholesky factor), then our SVAR
becomes the reduced form VAR.
Our motivation for SVARs was twofold: (1) to see what would happen if we
exogenously changed a particular variable via policy, and (2) to trace out the
effects of such policy shocks via IRFs. You might be asking: isn’t this what we
accomplished via reduced form VARs and OIRFs via Cholesky? Actually, yes. The
Cholesky factorization imposes a recursive structure on the unrestricted VAR. This
makes it a type of SVAR, but SVARs are more general. Still, it might prove useful
to see more formally how Cholesky is a type of SVAR. We’ll do this using Stata, so
it will prove useful to switch to Stata’s notation.
Using Stata’s notation, an SVAR can be written as:
A I − A1 L2 − · · · − Ap Lp Yt = At = Bet . (11.19)
and
⎡ ⎤
b11 0 0
B = ⎣ 0 b22 0 ⎦
0 0 b33
Example
Next, we will show with an example how OIRFs derived via the SVAR approach
and the Cholesky approach are equivalent. To do so, we need to generate some data.
First we set up 10,000 empty observations:
Then we specify a variance/covariance matrix (), and draw random data from
the multivariate normal distribution N(0,):
11.3 Structural VARs 335
Next, we calculate a simple reduced form VAR and instruct Stata to calculate
the OIRFs with the ordering X Y.
Next, we calculate the OIRF via an SVAR with the proper restrictions on the A
and B matrices.
336 11 Vector Autoregressions II: Extensions
As you can see, the results are the same, whether we calculated a reduced-form
VAR and then imposed the Cholesky transformation, or instead specified the
appropriately restricted SVAR.
Olivier Blanchard and Danny Quah (1989) introduced the long-run restriction to the
estimation of SVARs. David Schenck (2016) wrote a detailed tutorial on the Stata
blog estimating Blanchard and Quah’s SVAR in Stata. We follow Schenck closely,
explaining a few more steps along the way.
Blanchard and Quah estimated a two-variable eight-lag SVAR between the GNP
growth rate (YGR) and the unemployment rate (UNEMP):
Yt = Cut
11.3 Structural VARs 337
or
Y GRt C1,1 C1,2 uy,t
= .
U NEMPt C2,1 C2,2 uu,t
Blanchard and Quah argue that an unemployment shock has zero long-run impact
on GNP growth. Therefore, setting C1,2 = 0 is a proper identifying restriction on
C, so that:
The data were quarterly, but we want annualized growth rates expressed as a
percent, so we create a new variable:
In their paper, Blanchard and Quah estimate their SVAR on data from the second
quarter of 1952 through the fourth quarter of 1987. In his replication, Schenck
includes the first quarter of 1952. You can switch between the two sets of estimates
by commenting out the appropriate part below.
338 11 Vector Autoregressions II: Extensions
Blanchard and Quah take some ad-hoc steps to ensure their data are stationary.
They detrend the unemployment rate by regressing on a deterministic trend and
extracting the residuals.
Also, they perceive a break-point in the GNP growth data at 1974, so they de-
mean that variable by regressing on pre- and post-1974 dummy variables.
The output of the SVAR is not really of interest, but we report it below to show that
the C matrix was properly constrained.
.1
.5
0
0
-.1
-.5 -.2
Blanchard_Quah, unemp, Ygr Blanchard_Quah, unemp, unemp
1 .6
0
.4
-1
-2 .2
-3
0
0 10 20 30 40 0 10 20 30 40
step
Graphs by irfname, impulse variable, and response variable
Can we establish Granger causality if our variables are integrated? After all, if you
look back at the examples thus far, you’ll notice that the variables have all been
differenced so that they are I(0). But what if we don’t want to estimate a VAR in
differenced variables? What if we want to estimate the VAR in levels, and these
levels are I(d)? Toda and Yamamoto (1995) showed that there is a simple way to
estimate the VAR in levels, so that the usual asymptotic formulas for calculating
standard errors and p-values still apply.1 In fact, their method applies, regardless
of whether the variables have unit roots, are stationary around a deterministic trend
(i.e. that t is an explicit variable in the regression), integrated of order d, or even
cointegrated of order d.
The procedure is as follows:
1. Determine the number of lags k using one of the standard information criteria
(such as with Stata’s varsoc command). Let’s suppose that k = 4.
2. Determine dmax , the largest order of integration of the variables. For example, if
Y is I(1) and X is I(2), then dmax = 2. (Up until now, we might have differenced
1 Dolado and Lütkepohl (1996) derived many of the same results for the simpler case that the
variables are I(1). The paper by Toda and Yamamoto (1995) is more general, showing the case for
variables integrated up to an arbitrary order d.
340 11 Vector Autoregressions II: Extensions
Y once, and differenced X twice, and then estimated the VAR in differences: var
d1.y d2.x, lags(1/k).
3. Estimate the var in levels with k + dmax lags. For example, var y x,
lags(1/6).
4. In all further testing (ex: stability tests, Granger causality tests) ignore the
additional dmax = 2 lags, as these are zero. Restrict your tests to the first k
lags.
Adding the extra dmax lags to the VAR ensures that the asymptotic formulas for test
statistics are correct. Granger causality tests will then have to be done “by hand,”
using the test command, rather than Stata’s vargranger command, as the latter
automatically uses the extra lags.
As a caveat to the discussion above, this method is useful for establishing
Granger causality when variables are integrated (or even cointegrated). If the
variables are cointegrated, however, we should not stop there. Rather, we should
proceed with estimating and interpreting a Vector Error Correction Mechanism
(VECM). VECMs are the topic of the next chapter.
11.5 Conclusion
My definition [of causality] was pragmatic and any applied researcher with two or more
time series could apply it, so I got plenty of citations. Of course, many ridiculous papers
appeared (Granger 2004, p. 425).
Try not to write ridiculous papers. You should have a solid theory in mind that
would relate one variable to another. Don’t form a theory to fit the data. Further,
Granger causality is sensitive to many of the choices that you’ll have to make while
writing the paper. Your decisions to keep or discard variables might be justified
by their p-values, but these decisions might affect the validity of your conclusions
further in the process. Model mis-specification is a sure way to find spurious—or
even “ridiculous”—Granger-causal relations.
We’ve barely scratched the surface of VARs, but we’ll forgo looking further into
VARs and SVARs in favor of turning to VECMs. These are a related type of time-
series model which are designed to look simultaneously at long-run and short-run
relationships between variables.
For those who wish to explore VAR models in further detail, we recommend the
book by Enders (2014). It is a modern classic, providing a more technical but still
gentle and practitioner-oriented introduction into the application of VARs. The two
chapters on VARs in Rachev et al. (2007) add some mathematical complexity.
11.5 Conclusion 341
Tsay (2013) adds moving average errors to the coverage of VARs, exploring
vector-MA (VMA) and VARMA models in some depth. Shumway and Stoffer
(2006) extend the topics in this book to cover state-space and frequency-domain
models.
The next step up in difficulty is the book by Kilian and Lütkepohl (2017).
Adventurous students wishing to dive fully down the rabbit hole of VARs can do no
better than to explore Lütkepohl (2005) and Amisano and Giannini (2012). These
offer in-depth examinations of the issues and techniques involved in estimating
structural VARs. Both of these books are thorough and require a technical mastery
of matrix calculus and asymptotics.
Cointegration and VECMs
12
12.1 Introduction
The VARs that we looked at in the last chapter were very well suited for describing
the short-run relationship between variables, especially if they are stationary. Most
economic variables are not stationary, however. This required us to transform the
variables, taking first differences, so that they are stationary. In this chapter, we
show how to model the long-run relationship between variables in their levels, even
if they are integrated. This is possible if two or more variables are “cointegrated.”
If two variables are cointegrated, then, rather than taking the first difference of each
variable, we can essentially model the difference between the two variables. Loosely
speaking.
When we take first-differences, we lose a lot of important information. We might
know how a variable is changing; but we don’t know its actual value (its level). We
are not modeling the variables we are really interested in, but their rates of change.
Similarly, when we model the difference between two variables, we are estimating
the statistical properties of a new variable, not the original two variables we are
interested in. While this is better than nothing, Engle and Granger (1987) showed
how we can do much better than that.
12.2 Cointegration
10 20 30 40 50
Y
X
0
0 10 20 30 40 50
10 12 14
8
6
=Y-X
4
0 10 20 30 40 50
Time
randomly call out to her dog and it will move closer to its owner. Sometimes the dog
will randomly bark for its owner, drawing her closer to it. If you see the drunk, the
dog will not be too far away. Likewise, if you see the dog, its drunk owner should
be nearby. The barking and calling—and the staggering toward each other—is the
error correction mechanism, whereby when the two diverge, they begin to converge
again. We might not know where they’ll go as a pair, but we can be fairly certain
they’ll be close to each other. That is the essence of cointegration.
Suppose Y and X are two integrated but otherwise unrelated variables. Granger
and Newbold (1974) showed us that if we were to regress Y on X, we are likely to
find a statistically significant correlation between these variables, even though none
actually exists.
Now, consider the two variables in Fig. 12.1. They are both integrated. More
importantly, there seems to be a relationship between these two variables. They
never stray too far from each other. In the second panel of Fig. 12.1, we graph the
difference between the two variables. (We also shifted them down by ten units so
the new series fluctuates around zero.) Even though X and Y are non-stationary, the
difference between X and Y is stationary. Granger and Newbold may have made
us skittish about regressing the levels of X and Y on each other, but with this new
differenced variable, we can apply all of the techniques we learned in the VAR
chapter. In fact, we will see that if the variables are cointegrated, we can estimate
12.2 Cointegration 345
150
100
50
Z X
0
0 10 20 30 40 50
20 40 60 80 100
=Z-X
0
0 10 20 30 40 50
Time
the levels and differenced relationship simultaneously. But we’re getting ahead of
ourselves.
Let’s turn to a slightly more subtle example. Consider a new set of variables, X
and Z, in Fig. 12.2. They are both integrated. But are the cointegrated? The gap
between them is increasing, that is, their difference is not stationary, so at first
glance they do not seem to be cointegrated. But there does seem to be a relationship
between X and Z. Ignoring their different slopes, when X dips, so does Z. When X
spikes, so does Z. The problem is with their slopes. What would make the difference
between X and Z stationary is if we could either tilt X up (as in Fig. 12.3), or tilt
Z down (as in Fig. 12.4). This insight allows us to formally define cointegration:
two more nonstationary variables are cointegrated if a linear combination of them is
stationary.1
Many economic and financial theories provide conclusions in terms that relate to
cointegration. They seldom speak about the speed of convergence to an equilibrium,
but they do make equilibrium predictions about the levels of variables, and that
the relationship between them should be steady (i.e. cointegrated). For example,
1 Moreprecisely, two more variables which are integrated of order I(b) are cointegrated if a linear
combination of them is integrated of a lower order than b.
346 12 Cointegration and VECMs
150
100
50
X' = 3X Z
0
0 10 20 30 40 50
-4
-6
-14 -12 -10 -8
= X' - Z
0 10 20 30 40 50
Time
Z'= Z/3 X
0
0 10 20 30 40 50
-1
-2
diff4
-3
-4
= X - Z'
-5
0 10 20 30 40 50
Time
10
9
8
7
6
ln(GDP)
ln(personal consumption)
5
Fig. 12.6 Two arguably cointegrated variables, Aaa and Baa bond yields
Xt = Xt−1 + ut (12.2)
2 Testingthe theory of purchasing power parity is a classic use of cointegration analysis. Notable
examples include Juselius et al. (1992), Corbae and Ouliaris (1988), Taylor (1988) and Kim (1990).
Pedroni (2001) provides a cointegration test of PPP for panel data.
12.3 Error Correction Mechanism 349
Fig. 12.7 Two arguably cointegrated variables, 10-year and 5-year treasuries
Yt = βXt + t . (12.3)
the level predicted by βXt−1 (so that t−1 > 0), then the change in Yt must go down
by an amount equal to αt . If, instead, Y is below its expected level, then Yt will
tend to increase by αt
All of the terms in Eqs. (12.4) or (12.5) are stationary. Xt is I(1), so Xt is
stationary. Likewise, Yt is I(1), so Yt is stationary. Finally, et is a stationary IID
shock, so Yt − βXt = et is stationary. Since all the terms are stationary, estimation
should not be a problem.
The error correction model (ECM) seems useful. But where does it come from?
Below we show that ECM models can be derived from the very VAR models we
studied in Chap. 10.
Consider the dynamic equation:
The coefficients are starting to look a bit complicated. To make things easier to read,
define
α = 1 − λ1
θ = λ2 + α11 .
How can we know if two more variables are cointegrated? Let’s return to the
definition. Two or more variables are cointegrated if (1) they are integrated, and
(2) if a linear combination of them is integrated of lower order. In the case of I(1)
variables, like most economic variables, this means that (1) the variables X and Y
are both I(1), but (2) a linear combination of them, such as
Y = β0 + β1 X
352 12 Cointegration and VECMs
is not integrated. This implies that to test whether X and Y are cointegrated, we can
follow a two-step procedure, attributed to Engle and Granger (1987):
There are several complications in the second-stage unit root test. First, in
the second-stage DF/ADF test, you should not include a constant if one was
already included in the regression of Y on X. (If you typed reg Y X, then Stata
automatically included a constant and you should not include another one in the
second-stage Dickey-Fuller test.) Second, unit root tests have low power, especially
when the true data-generating processes have near unit roots. Since unit roots are
used to test for cointegration, this low power also affects cointegration tests (Elliott
1998). Third, this second stage must rely on the residuals-based critical values
provided by Engle and Yoo (1987), MacKinnon (1991) or MacKinnon (2010), rather
than the usual DF/ADF critical values. These will be different from the standard AD
and ADF critical values, to reflect the fact that the second step relies on estimated
residuals from the first step. That is, there is more uncertainty involved in step 2
dealing with estimates, than in step 1 when we are dealing with the raw data. In step
2, we are dealing with estimates, so the critical values are adjusted to account for
this added uncertainty.
The Engle-Granger two-step test uses an Augmented Dickey-Fuller (ADF) test as its
second step, but it cannot use the ADF test’s usual test statistics or p-values. Why?
The standard ADF test presumes that the data are actual data. The EG test, however,
relies on an ADF test on estimated residuals. These estimated residuals necessarily
contain error that the usual ADF test statistics fail to take into account. What to do?
Use different critical values from which to calculate p-values.
Engle and Yoo (1987) calculated critical values for the Dickey-Fuller3 and
ADF(4) tests of stationary residuals. These critical values depend upon the number
of variables (they consider up to five variables) and the sample size (they consider
samples of sizes 50, 100, and 200).
The latest and most accurate critical values were calculated by MacKinnon
(2010). These are updated estimates of those in MacKinnon (1991).4 MacKinnon
3 i.e.
ADF with zero lags.
4 MacKinnon (2010) repeated his Monte Carlo simulations from MacKinnon (1991), using many
more replications. This allowed him to provide a more accurate third-degree response surface,
rather than his earlier second-degree surface.
12.5 Engle and Granger’s Residual-Based Tests of Cointegration 353
presents his results as a “response surface.” This is jargon for something fairly
simple. The critical values were calculated for various sample sizes and number
of variables. But what about intermediate sample sizes? Essentially, MacKinnon
provides a formula which interpolates between these values so that we can estimate
the appropriate critical value for any sample size.
The MacKinnon critical value for a test at level-p with sample size T can be
calculated from:
β1 β2 β3
C (p, T ) = β∞ + + 2 + 3. (12.8)
T T T
The terms β∞ , β1 and β2 are not estimated coefficients from any of our regressions,
but are, rather, parameters given by MacKinnon in his tables. There are three
versions of his tables: one each for the no-trend (only a constant for drift), linear-
trend, and quadratic-trend cases. These tables are provided in the Appendix to this
textbook.
Example
Suppose we ran an Engle-Granger two-step test, that we estimated a (possible)
cointegrating equation, and that we wanted to verify using an ADF test that the
resulting residuals were stationary. How can we calculate the MacKinnon (2010)
critical values? We need several pieces of information: What kind of ADF test did
we run? What is the sample size? How many variables were in the cointegrating
equation? And what level test are we interested in? Suppose that we had estimated a
cointegrating equation with N=2 variables, that the ADF test had a constant but no
trend, that our sample size was T=100, and that we wanted to test at the 5% level.
Then it is simply a matter of looking at MacKinnon’s tables for the correct βs, and
plugging them into Eq. (12.8) to get
−6.1101 −6.823 0
C (p = 0.05, T = 100) = − 3.33613 + + +
100 1002 1003
= − 3.33613 − 0.061101 − 0.0006823 + 0
= − 3.3979133.
Example
Suppose we needed the MacKinnon critical values for an estimated first-step
regression with N=3 variables, that the ADF test had a constant but no trend, that
our sample size was T=50, and that we wanted to test at the 1% level. Using the βs
from MacKinnon’s tables, the correct critical value is
−14.4354 −33.195 47.433
C = −4.29374 + + + = −4.5953465.
50 502 503
Example
As our final example, suppose we needed the MacKinnon critical values for an
estimated first-step regression with four variables, that the ADF test had a constant
354 12 Cointegration and VECMs
and a trend, that our sample size was 150, and that we wanted to test at the 10%
level. Using the βs from MacKinnon’s tables, the correct critical value is
While this is straight-forward, you would need to back out the estimate of β
by hand. (It is equal to the estimated coefficient on L.X divided by the estimated
coefficient on L.Y. Can you see why?) Alternatively, Engle and Granger (1987)
suggest a two-step procedure. Once we have established that the variables are
cointegrated, then
In Stata:
5 Sinceit is user-written and not an official Stata command, you must install it. You can do this by
typing ssc install egranger.
12.5 Engle and Granger’s Residual-Based Tests of Cointegration 355
The first regression provides the estimate of the long-run or cointegrating equation
(the term in parentheses in Eq. (12.9)). The second equation estimates the short-run
coefficients (δ and γ ) and the speed-of-adjustment factor (α).
This approach is easy to implement, but it does have some deficiencies. First of
all, if the variables are cointegrated, then we cannot say that X causes Y or Y causes
X. Which variable should we have put on the left-hand side? By putting Y on the
left, we are assuming that any deviation from the long-run relationship would show
up directly in changes in Y rather than in X. Further, this approach only works well
when we have two variables. But what if we have more than two variables? With
three variables, X, Y and Z, there can be up to two cointegrating relationships. If
there are four variables, there can be as may as three cointegrating relationships.
Johansen’s approach is well-suited for these cases, and we turn to it in the next
section.
Engle-Granger Example
In this section, we will generate some simple cointegrated data, apply the Engle-
Granger method to verify that they are indeed cointegrated, and estimate an ECM
for these variables.
First, we generate the data. This is the same data used to create Fig. 12.1.
Let’s suppose that we were confronted by the data on X and Y, and that we didn’t
know they were cointegrated. Let’s follow the Engle-Granger procedure to test for
cointegration, and then estimate an ECM.
First we test whether the variables are I(1). We begin with Augmented Dickey-
Fuller (ADF) tests on X and Y.
356 12 Cointegration and VECMs
The p-values for both X and Y in their levels are above 0.10, so we cannot reject
the null hypothesis that these are two random walks with drift. What if we take the
first differences?
12.5 Engle and Granger’s Residual-Based Tests of Cointegration 357
The zero p-values from the ADF tests lead us to easily reject the null hypothesis
of a unit root for the first differences of X and of Y. Since X and Y seem to have a
unit root, but X and Y do not, then we conclude that X and Y are both I(1).
So, X and Y are integrated. Are they cointegrated? The next step is to see if there
is a linear combination of X and Y that is stationary. If so, then they are cointegrated.
So, let’s regress Y on X, and get the residuals:
The Stata output above indicates that Ŷt = 10.35 + 0.98Xt describes the long-
run relationship between X and Y; the residuals are the deviations from this long-run
relationship. This is close to what we know to be the true relationship: Y = 10 +
X + e; we know this to be true because that’s how we generated the data.
The final step is to verify that this particular linear combination of X and Y is
stationary. Recall that the residuals are equal to Ŷt − β̂0 − β̂1 Xt , so they are our
linear combination of X and Y. Are the residuals stationary?
We can get our test statistics from
or
These will give us the correct test statistics but the wrong critical values. Remember,
we are now working with estimates rather than data, so the usual critical values no
longer apply. We need to use the MacKinnon critical values.
To repeat, we can get our test statistic from:
358 12 Cointegration and VECMs
For the hypothesis that the residuals are stationary, the test statistic is −6.227
or −6.23 after rounding. We did not include a constant in these regressions, since
we already included a constant in the first step.
What are our critical values? We can calculate this a couple of different ways.
First, we can look at the appropriate table in MacKinnon and plug the corresponding
values into Eq. (12.8):
−6.1101 −6.823 0
C (p = 0.05, T = 49) = −3.33613 + + 2
+ 3
49 49 49
= −3.4636677.
Since the test statistic (−6.227) is greater in absolute value than the critical value
(−3.464), we reject the null hypothesis of a unit root and conclude that the residuals
are stationary.
Finally, now that we know that X and Y are cointegrated, we can estimate an
ECM model describing their short-run behavior. (Their long-run behavior was given
by the regression in the previous step.)
They are integrated, but are they cointegrated? Estimate the long-run linear
relationship between X and Y, and extract the residuals:
360 12 Cointegration and VECMs
Use a Dickey-Fuller test with MacKinnon critical values to verify that the
residuals are I(0):
Since the residuals are found to be I(0), then X and Y are cointegrated, and we
estimate an ECM model of X and Y:
The Engle-Granger test tables show the critical values for various numbers of
variables N . Don’t let this fool you into thinking that you are testing for multi-
cointegration. The Engle-Granger test still relies on a single first-step regression.
The researcher arbitrarily chooses one of the N variables as the dependent variable -
call it X1t , for example - and regresses it on all other variables. With four variables,
the first Engle-Granger step in Stata is
. reg X1 X2 X3 X4
But we are still only testing whether there is one cointegrating vector: one
set of coefficients that renders the residuals from this regression stationary. The
Johansen test can actually test whether there are different (linearly independent)
sets of coefficients (combinations) of the Xs which yield stationary residuals.
6 There are many features which recommend Johansen’s (1988) approach. For example, Gonzalo
(1994) shows that Johansen’s method outperforms four rival methods—asymptotically and in small
samples—at estimating cointegrating vectors. This is the case, even when the errors are not normal
or when the correct number of lags is unknown.
12.6 Multi-Equation Models and VECMs 361
Next, we add and subtract β1 Yt−2 and Yt−2 from the right-hand side:
Re-arranging terms:
Yt = β2 + β1 − I Yt−2 + β1 − I (Yt−1 − Yt−2 ) + et
Yt = Yt−2 + B Yt−1 + et , (12.18)
where
= β2 + β1 − I
B = β1 − I .
Yt−2 = 0. (12.19)
Notice that the original VAR model had two lags; the corresponding VECM
model had only one lag (in the Y term). This is a general relationship between
VARs and their corresponding VECMs. The VECM always has one fewer lag than
the VAR. When estimating a VECM in Stata, you specify the number of lags of the
VAR rather than the number of lags in the VECM; Stata is smart enough to know to
subtract one.
According to the Engle and Granger (1987) representation theorem, if the variables
in Yt are cointegrated, the VAR can be rewritten as a VECM of the form:
with
= αβ (12.22)
12.6 Multi-Equation Models and VECMs 363
or
Yt δ1 11 12 Yt−1 α1 e
= + + (Yt−1 − β0 − β1 Xt−1 ) + 1t .
Xt δ2 21 22 Xt−1 α2 e2t
(12.24)
and the adjustment parameter matrix α determines how deviations from the long-run
relationship between Y and X get transferred to Yt and Xt .
10 10
10 = 1 × 10 = 2 × 5 = 3 × = 3.1415 × = ...
3 3.1415
and so forth. That is, we cannot uniquely identify α̂ and β̂, because we can always
multiply and divide them by any constant c, and get a new set of numbers that
ˆ
multiply to :
1
10 = α̂ × β̂ = α̂c × β̂.
c
This is true for matrices and vectors, too:
ˆ = α̂cc−1 β̂ .
Johansen’s Normalization
So, if α̂ and β̂ are not identified, what are we to do? Johansen (1988) proposed
a straight-forward normalization, and this is the default in Stata. When we think
364 12 Cointegration and VECMs
Yt−1 = β0 + β1 Xt−1 +
Yt−1 − β0 − β1 Xt−1 =
or
All of the above cointegrating vectors “work.” Return to the variables X and
Z from Figs. 12.2, 12.3 and 12.4. Recall that two different linear operations could
establish cointegration between X and Z: multiplying X by three, or dividing Z by
three. Suppose you multiplied X by three to establish cointegration between X’ and
Z. Now that X’ and Z are parallel, they would stay parallel if we were to multiply
X’ and Z by the same constant (c). We would be tilting the pair up or down, but they
would be tilting in parallel. And while all of the above cointegrating vectors “work”
equally well mathematically, they are not equally intuitive.
Johansen’s normalization essentially insists that we write the cointegrating
vectors like we instinctively want to: with a one in front of all of our Y variables.
With the components of β̂ pinned down like this, and with ˆ known, then α̂ is
−1
identified and can be backed out by: α̂ = ˆ β̂ .
The Engle-Granger residuals-based tests of cointegration are intuitive, and they are
well-suited to testing for one cointegrating equation between two variables. But it is
not particularly suited to finding more than one cointegrating equation, as might be
the case if we are considering systems with more than two variables. In such a case,
we need a better approach, such as the one pioneered by Helmut Johansen.
Johansen developed his approach to estimating the rank of ˆ in a series of papers.
In Johansen (1988), he developed his eigenvalue tests for the case where there are
no constants or seasonal dummy variables in the long-run cointegrating equations.
Ultimately, including these dummy terms is important for empirics and affects the
distribution of the relevant test statistics. Johansen (1991) showed how to include
these important terms. Johansen (1995b) expanded these tests to include the case
12.6 Multi-Equation Models and VECMs 365
where the variables are I(2) rather than I(1).7 Johansen (1994) summarizes these
results in slightly less technical language.
The cointegrating equations (i.e. the long-run relationships) between the vari-
ables in Eq. (12.21) are all contained in Yt−k . How many cointegrating equations
are there? In matrices, this is equivalent to asking, what is r, the rank of the matrix
?
If there are n variables in the system, then there could be anywhere from zero
to n − 1 linearly independent cointegrating equations. If there are none, then these
variables aren’t cointegrated, and we should just estimate a VAR. The VAR will be
in levels if the variables in Yt are I(0); the VAR will be in first-differences if the
variables are all I(1).
If you have n variables that are I(1), then you can have up to n − 1 linearly
independent cointegrating vectors between them. Why not n cointegrating vectors?
You can’t have n linearly independent cointegrating vectors between n variables. If
we had two variables, Xt and Yt , then the residuals from
. reg X Y
might be stationary. And if that is the case, then so will be the residuals from
. reg Y X
That is, if et = Yt −a−bXt is stationary, then you could just rewrite this equation
as et /b = Xt + a/b − Yt /b, which will also be stationary. (Can you see why? You
are asked to prove this as an exercise.) To know one equation is to know the other.
They are just linear recombinations of the other.
Moreover, if each shock in the system is a unit-root process, then necessarily the
variables cannot be cointegrated. In a two-variable system, we would need one, say,
X, to be a unit root, and Y to depend on X. If they’re both unit roots, then they’ll
drift independently of each other. They would not be cointegrated.
The interesting case for a chapter on VECMs is when the number of cointegrating
equations, r, is at least one and less than n.
Johansen provides two different tests for the rank of :
They aren’t just different test statistics for the same hypothesis. They are different
procedures that test different hypotheses. And, in practice, they often lead the
researcher to different conclusions. This is unfortunate, but a fact of econometric
life. Ultimately, you should choose the specification that yields economically
reasonable results.
The mathematics behind each of these tests can be rather complicated and beyond
the scope of this introductory book. Instead, we’ll outline the two test procedures
below, and illustrate with an example.
7 We do not consider the I(2) case in this book. A workable but incomplete solution is to difference
the I(2) variables once to render them I(1) and then follow the procedures as outlined below.
366 12 Cointegration and VECMs
Any statistical test rests on some assumptions. Johansen’s tests are no exception.
They both build upon variations of a specific form of VECM model:
Recall that = αβ . If we allow a constant and trend along with the adjustment
parameter, then
Thus, Johansen allows for drift (via the constant terms μ and γ ) and deterministic
trend (via ρt and τ t). If the differenced variables follow a linear trend, then the
un-differenced variables will follow a quadratic trend.
Each of Johansen’s two tests has five variations; the variations rely on different
restrictions on these trend and drift terms:
Which case should you use? You’ll have to look at the data and verify whether
the various assumptions (zero mean, etc. . . ) seem reasonable. That said, cases 1 and
5 are not really used in practice. Case 1 is rather extreme, in that it would require
that all variables have a mean of zero, but how often do economies exhibit zero or
negative growth? Or zero inflation and deflation? Case 5 is also extreme in the sense
that the levels follow a quadratic trend. But even exponential growth of, say, your
bank account at a fixed interest rate, is linear (in logarithms).8
The website for the software program EViews recommends that: “As a rough
guide, use case 2 if none of the series appear to have a trend. For trending series,
use case 3 if you believe all trends are stochastic;9 if you believe some of the series
are trend stationary, use case 4.”
For macroeconomic variables (GDP and its components, the price level, etc.)
and financial asset prices (stock prices, bond prices, etc.), Zivot and Wang (2007)
recommend using case 3, as the assumption of deterministic growth in these
variables is untenable (GDP doesn’t have to grow at a specific deterministic
amount).
Thus far we know that there are two different Johansen cointegration tests, and
we have established that there are five (but in economics and finance, really three)
different test statistics for these tests that we might consider. Presuming we know
which case we are dealing with, how can we actually carry out either one of the
Johansen tests? We turn to this right now.
Both tests rely on the eigenvalues of the matrix. Why? Recall that =
αβ includes the matrix of cointegrating coefficients. If all of the n variables are
cointegrated, then there can be n−1 linearly independent cointegrating relationships
between them. The number of cointegrating relationships is equal to the number of
eigenvalues of . Likewise, if there is cointegration, then is not of full rank;
it will have a rank of r < n. A square matrix that is not of full rank has a
determinant of zero. Further, the determinant of a matrix is equal to the product of its
eigenvalues. Thus, if there is at least one eigenvalue that is zero, the determinant is
zero. Likewise, if we add eigenvalues and one of them is zero, then the sum wouldn’t
increase. The two tests essentially ask: at what point are we adding or multiplying
zero eigenvalues? This will reveal the rank of and thereby will also reveal r, the
number of cointegrating relationships.
8 The online help for the Eviews econometric software also warns against using cases 1 and
5 (http://www.eviews.com/help/helpintro.html#page/content/coint-Johansen_Cointegration_Test.
html). Likewise, Zivot and Wang (2007) warn against using case 1. Sjö (2008, p. 18) calls case
4 “the model of last resort” (since including a time in the vectors might induce stationarity) and
case 5 “quite unrealistic and should not be considered in applied work.” Thus, we are left with
cases 2 and 3 as reasonable choices.
9 i.e. the trend is due to drift from a random walk.
368 12 Cointegration and VECMs
LR(r, r + 1) = −T ln (1 − λr+1 ) .
Notice, we start with the largest and next-largest eigenvalues; this tests the hypoth-
esis that r is the smallest, versus the smallest plus one.
To summarize, we slowly increase our hypothesized rank r bit by bit until we
can no longer reject the null hypothesis.
10 Dwyer (2014, p. 6) explains that the trace statistic does not refer to the trace of ˆ but refers
instead to the “trace of a matrix based on functions of Brownian motion.” It also shares a similarity
with the trace of the matrix in that both involve the sum of terms (here, the sum of the eigenvalues);
more specifically, we sum ln(1 − λ) ≈ λ when (λ ≈ 0).
12.6 Multi-Equation Models and VECMs 369
The test statistic is calculated and compared with the appropriate critical value. If
we reject the null that r = 0, then it must be greater than zero. But this doesn’t tell
us what r is, just that it is greater than we thought. Now we update and repeat the
process. The updated null is that r = 1 vs the alternative that r > 1. Again, we
calculate the test statistic and compare it with the critical value. Rejecting the null
means that r must be greater still. We repeat the process until we can no longer reject
the null hypothesis. Strictly speaking, we won’t have “accepted the null” (we never
“accept the null,” only “fail to reject”), but we will use it as our working assumption
and calculate the r cointegrating vectors.
The test statistic for the null hypothesis that rank = r, vs the alternative that the
rank > r is:
n
LR(r, n) = −T ln (1 − λi ) .
i=r+1
As with the max eigenvalue test, we slowly increase our hypothesized rank r bit
by bit until we can no longer reject the null hypothesis.
Johansen in Stata
Fortunately, Stata’s vecrank command automates much of the tedium in testing
for the cointegration rank. If we had five variables (X1 through X5) that we wanted
to test for cointegration, we could type
which would show the trace statistic for the default case 3 at the 5% level.
Adding the notrace option suppresses the trace statistic. Thus, typing:
shows Johansen’s maximum eigenvalue statistic, but not the trace statistic.
The safer bet is to ask for both statistics and compare. This is done by including
the max option and excluding the notrace option.11
Johansen Example
We now turn to an example with simulated data. First, we simulate 1000 errors:
11 It is unclear to me why Stata opted not to have trace and max options.
370 12 Cointegration and VECMs
300
200
100
0
0 20 40 60 80 100
Time
X Z Y V
Both the trace statistic and the max eigenvalue statistics indicate (correctly) that
there are two cointegrating equations (X and Y, and X and Z). We can see this by
comparing the test statistics with their corresponding critical values.
The trace statistic for r = 0 is 823, whereas the critical value is 47.21. Since
the test statistic is bigger than the critical value, we reject the null hypothesis and
“accept” the null (that r > 0). Then we update to a new null hypothesis (that r = 1
vs the alternative that r > 1). The trace statistic is 394, which is greater than the
critical value of 29.68, so we reject the null that r = 1. What about r = 2? The
trace statistic is 7.26, which is smaller than the critical value (15.41). Thus, we
cannot reject the null and we conclude that r = 2. Stata even indicates this for us
with an *.
What about the maximum eigenvalue test? Beginning with a null hypothesis that
r = 0 vs the alternative that r = 1, the test statistic is 429.6. Since the test statistic is
greater than the critical value of 27.07, we reject the null of r = 0. Now we update.
The test statistic for r = 1 is 386, which is greater than the critical value of 20.97,
so we reject r = 1. What about r = 2? Here, the max eigenvalue statistic is 6.02,
which is smaller than 14.07. Thus, we cannot reject the null hypothesis that r = 2.
Both tests indicate that there are two cointegrating equations. (It is not always
the case that the two tests agree.)
The next step is to estimate these long-run cointegrating equations, as well as the
short-run adjustment terms.
12.6 Multi-Equation Models and VECMs 373
374 12 Cointegration and VECMs
Yt = [Zt , Yt , Xt ]
and
⎡ ⎤ ⎡ ⎤
Zt Zt − Zt−1
Yt = ⎣ Yt ⎦ = ⎣ Yt − Yt−1 ⎦ .
Xt Xt − Xt−1
−8.3150036
μ̂ = .
−12.248583
Economically speaking, the most important part of the output is the last Stata
table: the cointegrating equations. The first cointegrating equation expresses Z as a
function of X:
Z + 0Y − 2.999701X − 8.3150036 = 0,
Z = 3X + 8.32. (12.31)
0Z + Y − 1.99998X − 12.248583 = 0,
or
Y = 2X + 12.25. (12.32)
The slopes on these lines are almost identical to the true cointegrating Eqs. (12.27)
and (12.28).12
Incidentally, if we had ordered our variables differently, such as with
. vec X Y Z
12 Cointegration merely requires that a linear combination of the variables is stationary. In practical
terms, this means that the two variables can be tilted up or down until their difference is stationary.
Two parallel lines are stationary, regardless of the constant difference between them. Or, what we
care about is the slopes that establish stationarity; econometrically, we are less concerned with the
constant. Economically, the constant term seldom has practical significance.
376 12 Cointegration and VECMs
Since VECMs have an underlying VAR, we can estimate impulse response functions
and OIRFs after estimating a VEC. The same applies for forecasting, and for cal-
culating forecast error decompositions. In fact, the same Stata command calculates
the IRFs, OIRFs, and FEVDs after estimating a VECM, so there’s no need to repeat
ourselves. The accuracy of the IRFs, OIRFs, and FEVDs often depends critically on
the lag-length chosen in estimating the underlying VAR.
How many lags should be included in the initial VAR? In other words, what is
k in Eqs. (12.20) or (12.21)? Economic theory usually doesn’t have much to say
about such questions. But it is an important question for the econometrician, as the
estimated number of cointegrating relationships is found to depend upon k.
Since a VECM comes from a VAR, it stands to reason that there is a connection
between lag-order selection for a VAR and lag-order selection for a VECM. In short,
we can use the same information criteria that we used in VARs to select the lag-
lengths in VECMs. VECMs should have one fewer lag than the levels-VAR. (One
fewer because that extra lag is captured via differencing in the VECM.)
Lütkepohl and Saikkonen (1999) recommend using some form of information
criteria for lag-order selection, as they balance the size and power tradeoffs associ-
ated with having too few or too many lags. The command in Stata for estimating k
in a VECM is the same command as for VARs: varsoc. This command calculates
the common information criteria (the Akaike information criteria being the most
common) to guide lag selection.
Unfortunately, the various information criteria often disagree as to the optimal
lag length. Researchers are quite relieved when various information criteria choose
the same lag length. If they disagree, however, which information criteria should
you look at? This is an open question, and there are trade-offs with any such choice.
As a general rule of thumb, the AIC and FPE (final prediction error) are better
suited if the aim of the VAR model is forecasting. More parsimonious models tend
to have better predictive power. If the aim, however, is in proper estimation of the
true number of lags in the data generating process, then an argument can be made
that Schwartz IC (SBIC) or the Hannan-Quin Information Criterion (HQIC) should
be used (Lütkepohl 2005).
It is common for practitioners to rely on the following sequential approach: use
an information criterion to determine the lag-length, and then use Johansen (1991) to
estimate the cointegrating rank. Sequential approaches such as this, though intuitive,
12.9 Cointegration Implies Granger Causality 377
tend to accumulate problems. Properly estimating the lag-length (k) affects the
ability to estimate the rank (r). So, making a small mistake early on in the lag-length
step can lead to bigger errors in the rank estimation step. Gonzalo and Pitarakis
(1998) compared the various information criteria in a Monte Carlo experiment to
test their ability to properly choose r. They find that the BIC is better able than AIC
or HQIC to identify r correctly. They find the AIC to be particularly weak. Thus,
we face a trade-off. The AIC chooses k quite well, but when it doesn’t, it has large
consequences for r.
The AIC and BIC tend to prefer few lags. Too few lags, and the errors might
be autocorrelated. The test statistics in use rely on uncorrelated errors. Thus, in
practice, people add lags until the errors are white noise. Having too few leads to
a model that is mis-specified. Anything produced from this mis-specified model,
then, is suspect, including the IRFs and variance decompositions. Adding too many
lags spreads out the available observations over too many parameters, leading to
inefficient estimates of the coefficients. These noisy estimates result in poor IRFs
and variance decompositions (Braun and Mittnik 1993), as well as poor forecasting
properties from the estimated VARs and VECMs (Lütkepohl 2005). This is an
important problem with any finite sample, but especially in small samples.
Lag-length for VARs and VECMs continues to be an active area of research.
Researchers continue to investigate the small-sample properties of the various
selection methods. Others consider methods where different equations take different
lags. Still others consider whether the lag-lengths have gaps in them, that is, whether
to include, for example, lags 1, 3 and 4 but exclude lag=2. The general goal is to
avoid estimating more parameters than necessary. Otherwise, we will waste valuable
degrees of freedom. This, in turn, will result in more noisy parameter estimates,
which is the root cause of the bad forecasting ability of more longer-lagged models.
In practice, most practitioners opt for an eclectic approach. Many decide to
emulate a democracy, and they choose the lag-length that is preferred by the most
information criteria. Others will estimate models with different lag-lengths and
show that their results are robust to the different lags. Ultimately, few papers are
rejected because of lag-length. Still, it is best to have a procedure and stick to
it, otherwise you might be tempted to hunt for a particular outcome. This would
invalidate your results, and, more importantly, would be unethical.
Given the close connection between VECMs and VARs, you may be wondering
whether VECMs have a connection with Granger causality. Indeed, they do. If two
variables X and Y are cointegrated, then there must exist Granger causality in at
least one direction. That is, X must Granger-cause Y, or Y must Granger-cause X,
or both (Granger 1988). That is, VECMs imply Granger causality. It is not always
the case in the other direction, however. Granger causality does not imply that there
exists some linear combination of variables that is stationary.
378 12 Cointegration and VECMs
A VECM can be used as the basis for a Granger causality test. However, this
is not recommended. Instead, estimate a VAR model in levels using the Toda
and Yamamoto (1995) procedure. (For a refresher, refer to Sect. 11.4 on “VARs
with Integrated Variables” where this procedure is laid out.) If the variables are
integrated—regardless of whether the variables are also cointegrated—use the Toda-
Yamamoto procedure to test for Granger causality. Then proceed with tests of
cointegration. Recall that cointegration implies Granger causality, so if you did not
find causality in the first stage, this would provide some evidence that you do not
have cointegration, regardless of what any particular cointegration test might say.
The problem of “pre-testing” arises when testing for cointegration first, then
estimating a VECM, and then testing for causality from the VECM. In such a
case, the Granger causality test and its test statistics are contingent on the estimate
of the previous cointegration test. The usual test statistics for Granger causality
do not reflect this pre-testing.13 Clarke and Mirza (2006) find that pre-testing for
cointegration results in a bias toward finding Granger causality where none exists.
To repeat, if you are interested in Granger causality, estimate a VAR augmented
with additional lags, as suggested by Toda and Yamamoto (1995), and test for
13 Iam indebted to David Giles and his popular “Econometrics Beat” blog for bringing this and
the Toda-Yamamoto procedure to my attention. The blog piece can be found at: http://davegiles.
blogspot.com/2011/10/var-or-vecm-when-testing-for-granger.html. Readers are encouraged to
read the cited references in that blog entry, especially the work by Clarke and Mirza (2006).
12.11 Exercises 379
Granger causality. After this, proceed to test for cointegration and estimate a VEC
using Johansen’s method.
12.10 Conclusion
12.11 Exercises
Yt = 10 + Xt + et
Xt = 1 + Xt−1 + t
et ∼ iidN(0, 1)
t ∼ iidN(0, 1),
380 12 Cointegration and VECMs
as in Fig. 12.1. Generate 100 observations of this data. Graph these two vari-
ables, and verify visually that they seem cointegrated. Estimate the long-run
relationship between them and verify that the cointegrating vector is [1, −1] .
Now, generate Yt = 2Yt and Xt = 2Xt . Graph these two new variables. Verify
graphically that Xt and Yt are cointegrated. Perform Engle-Granger two-step
tests to verify formally that Xt and Yt are cointegrated, and that Xt and Yt are
cointegrated. If Xt and Yt are cointegrated, then we have shown that [2, −2] is
also a valid cointegrating vector.
2. Suppose that et is stationary. Show that bet is stationary, where b is a constant.
(Hint: recall that the definition of stationarity requires that E(et ) and V (et ) not
be functions of t.)
3. Calculate the MacKinnon (2010) critical values for an estimated first-step
regression with the following characteristics:
(a) 5 variables; with a sample size of 200 observations; that the ADF test had a
constant and but no trend; and that we wanted to test at the 1% level.
(b) 7 variables; with a sample size of 100 observations; that the ADF test had a
constant and a linear trend; and that we wanted to test at the 5% level.
(c) 9 variables; with a sample size of 50 observations; that the ADF test had a
constant and a trend; and that we wanted to test at the 10% level.
4. Suppose you used Stata to estimate a VEC model on X and Y. Write out the
estimated equations in Matrix notation, using Eq. (12.24) as a guide. Do any of
the estimated coefficients look out of line? Explain.
12.11 Exercises 381
5. Suppose you used Stata to estimate a VEC model on X and Z. Write out the
estimated equations in matrix notation, using Eq. (12.24) as a guide. Do any of
the estimated coefficients look out of line? Explain.
382 12 Cointegration and VECMs
Conclusion
13
In this text, we have explored some of the more common time-series econometric
techniques. The approach has centered around developing a practical knowledge of
the field, learning by replicating basic examples and seminal research. But there is
a lot of bad research out there, and you would be best not to replicate the worst
practices of the field. A few words of perspective and guidance might be useful.
First, all data are historical. A regression may only reveal a pattern in a dataset.
That dataset belongs to a particular time period. Perhaps our regression used the
latest data, say, 2018 quarterly back to 1980. This is a healthy sized dataset for a
macro-econometrician. But when we perform our tests of statistical significance,
we make several mistakes.
We claim to be make statements about the world, as though the world and its
data do not change. We might have tested the Quantity Theory of Money and found
that it held. But there is no statistical basis for claiming that the Quantity Theory
will, therefore, always apply. We are inclined to believe it. And I do, in fact, believe
that the Quantity Theory has a lot of predictive power. But it requires faith to take a
result based on historical data and extend it to the infinite future. The past may not
be prologue.
We do have methods to test whether the “world has changed.” We tested for this
when we studied “structural breaks.” But we cannot predict, outside of our sample
dataset, whether or when a structural break will occur in the future. We don’t have
crystal balls. The extent to which we can make predictions about the future is limited
by the extent to which the world doesn’t change, underneath our feet, unexpectedly.
We economists often portray ourselves as somehow completely objective in
our pursuits. That we let the data speak for themselves. That we are armed with
the scientific method—or at least with our regressions—and can uncover enduring
truths about the world. A bit of modesty is in order.
We write as though it is the data, and not us, who is talking. This is nonsense.
Econometricians are human participants in this process. We do have an effect.
Although we might let significance tests decide which of ten variables should be
included in a regression, we did, after all, choose the initial ten. Different models
with different variables give different results. Reasonable economists often differ.
In fact, when it comes to econometric results, it feels as though we rarely agree.
When your final draft is finally published, you will be making a claim about
the world. That your analysis of the data reveals something about how the world
was, is, or will be. Regardless, you will be presenting your evidence in order to
make a claim. All of your decisions and statistical tests along the way will shape the
credibility of your argument. Weak arguments use inaccurate or non-representative
data, have small samples, perform no statistical tests of underlying assumptions,
perform the improper tests of significance, perform the tests in the improper order,
and confuse statistical with practical significance. Such papers are not believable.
Unfortunately, they do get published.
Why is p-hacking so prevalent? The simple answer is: to get published. A deeper
answer may be that econometrics is a form of rhetoric. Of argumentation. Dressing
up a theory in mathematical clothes, and presenting statistically significant results,
is how we make our cases.
It is impossible for econometricians to be logical positivists. They cannot hold
an agnostic position, only taking a stance once all the data have been analyzed.
Rather, they have their beliefs and ideologies. They might construct a theory. Then
they might test it. But someone has to believe in the theory in the first place for the
tests to be worth the effort (Coase 1982). Which means that economists are not as
unbiased as they claim or believe themselves to be:
These studies, both quantitative and qualitative, perform a function similar to that of
advertising and other promotional activities in the normal products market. . . These studies
demonstrate the power of the theory, and the definiteness of the quantitative studies enables
them to make their point in a particularly persuasive form. What we are dealing with is a
competitive process in which purveyors of the various theories attempt to sell their wares
(Coase 1982, p. 17).
This also means that econometricians don’t test theories objectively as much as they
try to illustrate the theories they already have. This is unfortunate. But it not an
uncommon practice. Certainly, repeated failed attempts to successfully illustrate a
theory will lead one to modify their theory; but they rarely abandon them, and then
only if an alternative theory can replace it (Coase 1982).
Like it or not, economists engage in argumentation and rhetoric as much as they
engage in science.
“Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly
anyone takes anyone else’s data analyses seriously” (Leamer 1983, p. 37).
Why?
“If you torture the data long enough, it will confess [to anything]” answers the
famous quip attributed to Ronald Coase. Unfortunately, some scholars view the
research process as a hunt for low p-values and many asterisks.
As Andrew Gelman puts it, econometrics is a so-called “garden of forking paths”
that invalidates most hypothesis tests (ex: Gelman 2016, 2017; Gelman and Loken
2013, 2014). The econometrician is forced, from the outset, to make a series of
13 Conclusion 385
The econometric art as it is practiced at the computer terminal involves fitting many, perhaps
thousands, of statistical models. One or several that the researcher finds pleasing are selected
for reporting purposes. This searching for a model is often well intentioned, but there can
be no doubt that such a specification search invalidates the traditional theories of inference.
The concepts of unbiasedness, consistency, efficiency, maximum-likelihood estimation, in
fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied
researcher pulls from the bramble of computer output the one thorn of a model he likes best,
the one he chooses to portray as a rose (Leamer 1983, p. 36).
Statistical tests rely on the laws of probability. But “research” of the sort Leamer
describes is analogous to flipping a coin, even a fair and unbiased coin, repeatedly
until it lands on Heads, and then claiming that 100% of the flips that you report are
Heads!
A dataset can be analyzed in so many different ways. . . that very little information is
provided by the statement that a study came up with a p < 0.05 result. The short version is
that it’s easy to find a p < 0.05 comparison even if nothing is going on, if you look hard
enough. . . This problem is sometimes called ‘p-hacking’ or ‘researcher degrees of freedom.’
(Gelman and Loken 2013, p. 1)
In fact, one can engage in “p-hacking” without “fishing.” That is, p-values
should be taken with a grain of salt even if you stuck with your first regression
(and didn’t go on a fishing expedition). You’re still fishing if you caught a fish on
your first cast of the line (Gelman and Loken 2013).
Modifying a hypothesis after looking at (the results of) the data is a reverse form
of p-hacking. Changing one’s hypothesis to fit the data invalidates the hypothesis
test (Gelman and Loken 2013).
Even if there is no p-hacking, p-values are often misused. Deirdre McCloskey
and Steven Ziliak examined the articles in the American Economic Review, one
of the most prestigious journals in Economics, and found that approximately 70%
of the articles in the 1980s focus on statistical significance at the expense of
economic/practical significance (McCloskey and Ziliak 1996). By the 1990s, that
386 13 Conclusion
unfortunate statistic increased to 80% (Ziliak and McCloskey 2004).1 The “cult
of statistical significance” has caused widespread damage in economics, the social
sciences and even medicine (Ziliak and McCloskey 2008).
By 2016 the misuse of p-values became so widespread that the American
Statistical Association felt obligated to put out a statement on p-values. The ASA
reminds us that p-values are not useful if they come from cherry-picked regressions,
and that statistical significance is not the same as relevance (Wasserstein and Lazar
2016).
Some have suggested that the p < 0.05 standard be replaced with a more
stringent p < 0.005. The revised threshold for significance would make it harder to
engage in p-hacking (this includes the 72 co-authors of Benjamin et al. (2017)).
Too often, we mistake statistical significance for practical or economic signifi-
cance. Not all statistically significant results are important. These are not synonyms.
Statistical significance means, loosely speaking, that some kind of effect was
detectable. That doesn’t mean that it is important. For importance, you need to look
at the magnitude of the coefficients, and you need to use some human/economic
judgment. A large coefficient at the 10% level might be more important than a small
one that is significant at the 0.0001% level. How big is big enough? It depends
on your question. It depends on context. Statistics can supply neither context nor
judgment. Those are some of the things that you, as a practicing econometrician,
must bring to the table.
Statistical significance is not the same thing as ‘practical significance’ or
‘oompf.’
[A] variable has oomph when its coefficient is large, its variance high, and its character
exogenous, all decided by quantitative standard in the scientific conversation. A small
coefficient on an endogenous variable that does not move around can be statistically
significant, but it is not worth remembering.” (McCloskey 1992, p. 360)
Statistical significance really just focuses on sample size. With enough observa-
tions any coefficient becomes statistically significant (McCloskey 1985, p. 202).
Huff (2010, p. 138), in his sarcastically titled classic, How to Lie with Statistics,
remarked that “Many a statistic is false on its face. It gets by only because the magic
of numbers brings about a suspension of common sense.” You are implored to keep
your wits about you. Does the number jibe with your common sense? With your
trained professional intuition? If not, you should be skeptical.
So, what is to be done?
You will need to convince your readers that you have not cherry-picked a
regression. Always begin by graphing your data. As anyone who has worked
through Anscombe’s quartet can testify, a graph can often reveal a pattern in data
1 Neither I nor McCloskey and Ziliak have run the relevant hypothesis tests, but such large numbers
have large practical implications: the profession has neglected to consider whether an effect is
worth worrying over. For an interesting response to Ziliak and McCloskey on the usefulness of
p-values, see Elliott and Granger (2004).
13 Conclusion 387
that standard techniques would miss (ex: Anscombe 1973). The problem is that if
you stare long enough at anything, you’ll begin seeing patters when none exist.
Don’t practice uncritical cook-book econometrics. Begin with a good theory for
why two variables might be related. Don’t work in the other direction, letting your
coefficients determine what theory you’re pitching. Be able to explain why two
variables might be related.
If you will report regression coefficients, you should show whether and “how an
inference changes as variables are added to or deleted from the equation” Leamer
(1983, p. 38). It is now standard practice to report sets of results with slightly
different sets of variables. Most journals today demand at least this rudimentary
level of robustness.
Post your data and your code. Let people play with your data and models so
that they can see you aren’t pulling any fast ones. Let them look up your sleeve, as
it were. Papers that do not invite, or even encourage replication, should be treated
with suspicion.
Follow the advice of Coase and McCloskey and never forget to answer the
most important question: so what?! Pay attention to the size of your coefficients.
A statistically significant result doesn’t mean much more than that you are able to
detect some effect. It has nothing to say about whether an effect is worth worrying
over.
I recommend the practicing econometrician practice a bit of humility. Your
results are never unimpeachable, your analysis is never perfect, and you will never
have the final word.
Correction to: Time Series Econometrics
Correction to:
J. D. Levendis, Time Series Econometrics,
Springer Texts in Business and Economics,
https://doi.org/10.1007/978-3-319-98282-3
The book was inadvertently published without the data sets and a blurb on the cover
and all had been corrected now:
Blurb: Professor Simon Lee, from Columbia University, Co-Editor of Econometric
Theory and Associate Editor of Econometrics Journal
“How to best start learning time series econometrics? Learning by doing. This is
the ethos of this book. What makes this book useful is that it provides numerous
worked out examples along with basic concepts. It is a fresh, no-nonsense, practical
approach that students will love when they start learning time series econometrics. I
recommend this book strongly as a study guide for students who look for hands-on
learning experience.”
Table A.1 Engle and Yoo critical values for the co-integration test
Number of var’s Sample size Significance level
N T 1% 5% 10%
1a 50 2.62 1.95 1.61
100 2.60 1.95 1.61
250 2.58 1.95 1.62
500 2.58 1.95 1.62
∞ 2.58 1.95 1.62
1b 50 3.58 2.93 2.60
100 3.51 2.89 2.58
250 3.46 2.88 2.57
500 3.44 2.87 2.57
∞ 3.43 2.86 2.57
2 50 4.32 3.67 3.28
100 4.07 3.37 3.03
200 4.00 3.37 3.02
3 50 4.84 4.11 3.73
100 4.45 3.93 3.59
200 4.35 3.78 3.47
4 50 4.94 4.35 4.02
100 4.75 4.22 3.89
200 4.70 4.18 3.89
5 50 5.41 4.76 4.42
100 5.18 4.58 4.26
200 5.02 4.48 4.18
a Critical
values of τ̂ .
b Criticalvalues of τ̂μ . Both cited from Fuller (1976, p. 373), used with permission from Wiley.
Reprinted from Engle, Robert F. and Byung Sam Yoo (1987), Forecasting and testing in co-
integrated systems, Journal of Econometrics 35(1): 143–159; used with permission from Elsevier.
Table A.2 Engle and Yoo Number of var’s Sample size Significance level
critical values for a higher
N T 1% 5% 10%
order system
2 50 4.12 3.29 2.90
100 3.73 3.17 2.91
200 3.78 3.25 2.98
3 50 4.45 3.75 3.36
100 4.22 3.62 3.32
200 4.34 3.78 3.51
4 50 4.61 3.98 3.67
100 4.61 4.02 3.71
200 4.72 4.13 3.83
5 50 4.80 4.15 3.85
100 4.98 4.36 4.06
200 4.97 4.43 4.14
Reprinted from Engle, Robert F. and Byung Sam Yoo (1987),
Forecasting and testing in co-integrated systems, Journal of
Econometrics 35(1): 143–159; used with permission from
Elsevier.
Table A.3 McKinnon critical values for the no trend case (τnc and τc )
N Variant Level (%) Obs. β∞ (s.e.) β1 β2 β3
1 τnc 1 15,000 −2.56574 (0.000110) −2.2358 −3.627
1 τnc 5 15,000 −1.94100 (0.000740) −0.2686 −3.365 31.223
1 τnc 10 15,000 −1.61682 (0.000590) 0.2656 −2.714 25.364
1 τc 1 15,000 −3.43035 (0.000127) −6.5393 −16.786 −79.433
1 τc 5 15,000 −2.86154 (0.000068) −2.8903 −4.234 −40.040
1 τc 10 15,000 −2.56677 (0.000043) −1.5384 −2.809
2 τc 1 15,000 −3.89644 (0.000102) −10.9519 −22.527
2 τc 5 15,000 −3.33613 (0.000056) −6.1101 −6.823
2 τc 10 15,000 −3.04445 (0.000044) −4.2412 −2.720
3 τc 1 15,000 −4.29374 (0.000123) −14.4354 −33.195 47.433
3 τc 5 15,000 −3.74066 (0.000067) −8.5631 −10.852 27.982
3 τc 10 15,000 −3.45218 (0.000043) −6.2143 −3.718
4 τc 1 15,000 −4.64332 (0.000101) −18.1031 −37.972
4 τc 5 15,000 −4.09600 (0.000055) −11.2349 −11.175
4 τc 10 15,000 −3.81020 (0.000043) −8.3931 −4.137
5 τc 1 15,000 −4.95756 (0.000101) −21.8883 −45.142
5 τc 5 15,000 −4.41519 (0.000055) −14.0406 −12.575
5 τc 10 15,000 −4.41315 (0.000043) −10.7417 −3.784
6 τc 1 15,000 −5.24568 (0.000124) −25.6688 −57.737 88.639
6 τc 5 15,000 −4.70693 (0.000068) −16.9178 −17.492 60.007
6 τc 10 15,000 −4.42501 (0.000054) −13.1875 −5.104 27.877
(continued)
A Tables of Critical Values 391
Table A.4 McKinnon critical values for the linear trend case
N Level (%) Obs. β∞ (s.e.) β1 β2 β3
1 1 15,000 −3.95877 (0.000122) −9.0531 −28.428 −134.155
1 5 15,000 −3.41049 (0.000066) −4.3904 −9.036 −45.374
1 10 15,000 −3.12705 (0.000051) −2.5856 −3.925 −22.380
2 1 15,000 −4.32762 (0.000099) −15.4387 −35.679
2 5 15,000 −3.78057 (0.000054) −9.5106 −12.074
2 10 15,000 −3.49631 (0.000053) −7.0815 −7.538 21.892
3 1 15,000 −4.66305 (0.000126) −18.7688 −49.793 104.244
3 5 15,000 −4.11890 (0.000066) −11.8922 −19.031 77.332
3 10 15,000 −3.83511 (0.000053) −9.0723 −8.504 35.403
4 1 15,000 −4.96940 (0.000125) −22.4594 −52.599 51.314
4 5 15,000 −4.42871 (0.000067) −14.5876 −18.228 39.647
4 10 15,000 −4.14633 (0.000054) −11.2500 −9.873 54.109
5 1 15,000 −5.25276 (0.000123) −26.2183 −59.631 50.646
5 5 15,000 −4.71537 (0.000068) −17.3569 −22.660 91.359
5 10 15,000 −4.43422 (0.000054) −13.6078 −10.238 76.781
6 1 15,000 −5.51727 (0.000125) −29.9760 −75.222 202.253
6 5 15,000 −4.98228 (0.000066) −20.3050 −25.224 132.030
6 10 15,000 −4.70233 (0.000053) −16.1253 −9.836 94.272
(continued)
392 A Tables of Critical Values
Table A.5 McKinnon critical values for the quadratic trend case
N Level (%) Obs. B(∞) (s.e.) β1 β2 β3
1 1 15,000 −4.37113 (0.000123) −11.5882 −35.819 −334.047
1 5 15,000 −3.83239 (0.000065) −5.9057 −12.490 −118.284
1 10 15,000 −3.55326 (0.000051) −3.6596 −5.293 −63.559
2 1 15,000 −4.69276 (0.000124) −20.2284 −64.919 88.884
2 5 15,000 −4.15387 (0.000067) −13.3114 −28.402 72.741
2 10 15,000 −3.87346 (0.000052) −10.4637 −17.408 66.313
3 1 15,000 −4.99071 (0.000125) −23.5873 −76.924 184.782
3 5 15,000 −4.45311 (0.000068) −15.7732 −32.316 122.705
3 10 15,000 −4.17280 (0.000053) −12.4909 −17.912 83.285
4 1 15,000 −5.26780 (0.000125) −27.2836 −78.971 137.871
4 5 15,000 −4.73244 (0.000069) −18.4833 −31.875 111.817
4 10 15,000 −4.45268 (0.000053) −14.7199 −17.969 101.920
5 1 15,000 −5.52826 (0.000125) −30.9051 −92.490 248.096
5 5 15,000 −4.99491 (0.000068) −21.2360 −37.685 194.208
5 10 15,000 −4.71587 (0.000054) −17.0820 −18.631 136.672
6 1 15,000 −5.77379 (0.000126) −34.7010 −105.937 393.991
6 5 15,000 −5.24217 (0.000067) −24.2177 −39.153 232.528
6 10 15,000 −4.96397 (0.000054) −19.6064 −18.858 174.919
(continued)
A Tables of Critical Values 393
Acemoglu, D., Johnson, S., & Robinson, J. A. (2000). The colonial origins of comparative
development: An empirical investigation (Technical report), National Bureau of Economic
Research.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6), 716–723.
Amisano, G., & Giannini, C. (2012). Topics in structural VAR econometrics. Berlin: Springer
Science and Business Media.
Ando, A., Modigliani, F., & Rasche, R. (1972). Appendix to part 1: Equations and definitions of
variables for the FRB-MIT-Penn econometric model, November 1969. In Econometric models
of cyclical behavior (Vols. 1 and 2, pp. 543–598). New York: National Bureau of Economic
Research.
Angrist, J. D., & Pischke, J.-S. (2017). Undergraduate econometrics instruction: Through our
classes, darkly (Technical report), National Bureau of Economic Research.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21.
Ashworth, J., & Thomas, B. (1999). Patterns of seasonality in employment in tourism in the UK.
Applied Economics Letters, 6(11), 735–739.
Azevedo, J. P. (2011). WBOPENDATA: Stata module to access world bank databases. Statistical
Software Components S457234, Boston College Department of Economics.
Bai, J., & Perron, P. (1998). Estimating and testing linear models with multiple structural changes.
Econometrica, 66, 47–78.
Bai, J., & Perron, P. (2003). Computation and analysis of multiple structural change models.
Journal of Applied Econometrics, 18(1), 1–22.
Bai, J., Lumsdaine, R. L., & Stock, J. H. (1998). Testing for and dating common breaks in
multivariate time series. The Review of Economic Studies, 65(3), 395–432.
Baillie, R. T., & DeGennaro, R. P. (1990). Stock returns and volatility. Journal of Financial and
Quantitative Analysis, 25(2), 203–214.
Banerjee, A., Dolado, J. J., Galbraith, J. W., & Hendry, D. (1993). Co-integration, error correction,
and the econometric analysis of non-stationary data. Oxford: Oxford University press.
Banerjee, A., Lumsdaine, R. L., & Stock, J. H. (1992). Recursive and sequential tests of the unit-
root and trend-break hypotheses: Theory and international evidence. Journal of Business and
Economic Statistics, 10(3), 271–287.
Baum, C. (2015). Zandrews: Stata module to calculate Zivot-Andrews unit root test in presence of
structural break. https://EconPapers.repec.org/RePEc:boc:bocode:s437301
Beaulieu, J. J., & Miron, J. A. (1990). A cross country comparison of seasonal cycles and business
cycles (Technical report), National Bureau of Economic Research.
Beaulieu, J. J., & Miron, J. A. (1993). Seasonal unit roots in aggregate US data. Journal of
Econometrics, 55(1–2), 305–328.
Benjamin, D., Berger, J., Johannesson, M., Nosek, B., Wagenmakers, E., Berk, R., et al. (2017).
Redefine statistical significance Technical report, The Field Experiments Website.
Blanchard, O. J., & Quah, D. (1989). The dynamic effects of aggregate demand and supply
disturbances. American Economic Review, 79(4), 655–673.
Bloomfield, P. (2004). Fourier analysis of time series: An introduction. New York: Wiley.
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econo-
metrics, 31(3), 307–327.
Bollerslev, T. (1987). A conditionally heteroskedastic time series model for speculative prices and
rates of return. The Review of Economics and Statistics, 69, pp. 542–547.
Bollerslev, T., Engle, R. F., & Wooldridge, J. M. (1988). A capital asset pricing model with time-
varying covariances. Journal of Political Economy, 96(1), 116–131.
Bollerslev, T., Chou, R. Y., & Kroner, K. F. (1992). ARCH modeling in finance: A review of the
theory and empirical evidence. Journal of Econometrics, 52(1–2), 5–59.
Box, G. E., & Jenkins, G. M. (1976). Time series analysis: Forecasting and control (revised ed.).
Oakland: Holden-Day.
Brandt, P. T., & Williams, J. T. (2007). Multiple time series models. Quantitative Applications in
the Social Sciences (Vol. 148). Thousand Oaks, CA: Sage.
Braun, P. A., & Mittnik, S. (1993). Misspecifications in vector autoregressions and their effects on
impulse responses and variance decompositions. Journal of Econometrics, 59(3), 319–341.
Brooks, C. (2014). Introductory econometrics for finance. Cambridge: Cambridge University
Press.
Byrne, J. P., & Perman, R. (2007). Unit roots and structural breaks: A survey of the literature. In
B. B. Rao, (Ed.), Cointegration for the applied economist (2nd ed., pp. 129–142). New York:
Palgrave Macmillan.
Campbell, J. Y., & Mankiw, N. G. (1987a). Are output fluctuations transitory? The Quarterly
Journal of Economics, 102(4), 857–880.
Campbell, J. Y., & Mankiw, N. G. (1987b). Permanent and transitory components in macroeco-
nomic fluctuations. The American Economic Review, Papers and Proceedings, 77(2), 111–117.
Campbell, J. Y., & Perron, P. (1991). Pitfalls and opportunities: What macroeconomists should
know about unit roots. NBER Macroeconomics Annual, 6, 141–201.
Campos, J., Ericsson, N. R., & Hendry, D. F. (1996). Cointegration tests in the presence of
structural breaks. Journal of Econometrics, 70(1), 187–220.
Chamberlain, G. (1982). The general equivalence of Granger and Sims causality. Econometrica:
Journal of the Econometric Society, 50, 569–581.
Chang, S. Y., & Perron, P. (2017). Fractional unit root tests allowing for a structural change in trend
under both the null and alternative hypotheses. Econometrics, 5(1), 5.
Chatfield, C. (2016). The analysis of time series: An introduction. New York: CRC Press.
Cheung, Y.-W., & Lai, K. S. (1995a). Lag order and critical values of the augmented Dickey–Fuller
test. Journal of Business and Economic Statistics, 13(3), 277–280.
Cheung, Y.-W., & Lai, K. S. (1995b). Practitionar’s Corner: Lag order and critical values of a
modified Dickey-Fuller test. Oxford Bulletin of Economics and Statistics, 57(3), 411–419.
Chou, R., Engle, R. F., & Kane, A. (1992). Measuring risk aversion from excess returns on a stock
index. Journal of Econometrics, 52(1–2), 201–224.
Christiano, L. J. (1992). Searching for a break in GNP. Journal of Business and Economic
Statistics, 10(3), 237–250.
Christiano, L. J., & Eichenbaum, M. (1990). Unit roots in real GNP: Do we know, and do we
care? In Carnegie-Rochester conference series on public policy (Vol. 32, pp. 7–61). Amsterdam:
Elsevier.
Christiano, L. J., Eichenbaum, M., & Evans, C. L. (1999). Monetary policy shocks: What have we
learned and to what end? Handbook of Macroeconomics, 1, 65–148.
Clarke, J. A., & Mirza, S. (2006). A comparison of some common methods for detecting Granger
noncausality. Journal of Statistical Computation and Simulation, 76(3), 207–231.
Clemente, J., Gadea, M. D., Montañés, A., & Reyes, M. (2017). Structural breaks, inflation and
interest rates: Evidence from the G7 countries. Econometrics, 5(1), 11.
Clements, M. P., & Hendry, D. F. (1997). An empirical study of seasonal unit roots in forecasting.
International Journal of Forecasting, 13(3), 341–355.
Bibliography 397
Coase, R. H. (1982). How should economists choose? In The G warren nutter lectures in political
economy (pp. 5–21). Washington, DC: The American Enterprise Institute.
Cochrane, J. H. (1991). A critique of the application of unit root tests. Journal of Economic
Dynamics and Control, 15(2), 275–284.
Cooley, T. F., & LeRoy, S. F. (1985). Atheoretical macroeconometrics: A critique. Journal of
Monetary Economics, 16(3), 283–308.
Cooper, R. L. (1972). The predictive performance of quarterly econometric models of the united
states. In Econometric models of cyclical behavior (Vols. 1 and 2, pp. 813–947). New York:
National Bureau of Economic Research.
Corbae, D., & Ouliaris, S. (1988). Cointegration and tests of purchasing power parity. The Review
of Economics and Statistics, 70, 508–511.
DeJong, D. N., Ingram, B. F., & Whiteman, C. H. (2000). A Bayesian approach to dynamic
macroeconomics. Journal of Econometrics, 98(2), 203–223.
Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series
with a unit root. Journal of the American Statistical Association, 74(366a), 427–431.
Dickey, D. A., Jansen, D. W., & Thornton, D. L. (1991). A primer on cointegration with an
application to money and income (Technical report), Federal Reserve Bank of St. Louis.
Dicle, M. F., & Levendis, J. (2011). Importing financial data. Stata Journal, 11(4), 620–626.
Diebold, F. X. (1998). The past, present, and future of macroeconomic forecasting. Journal of
Economic Perspectives, 12(2), 175–192.
Doan, T., Litterman, R., & Sims, C. (1984). Forecasting and conditional projection using realistic
prior distributions. Econometric Reviews, 3(1), 1–100.
Dolado, J. J., & Lütkepohl, H. (1996). Making Wald tests work for cointegrated VAR systems.
Econometric Reviews, 15(4), 369–386.
Drukker, D. M. (2006). Importing federal reserve economic data. Stata Journal, 6(3), 384–386.
Durlauf, S. N., & Phillips, P. C. (1988). Trends versus random walks in time series analysis.
Econometrica: Journal of the Econometric Society, 56, 1333–1354.
Dwyer, G. (2014). The Johansen tests for cointegration. http://www.jerrydwyer.com/pdf/Clemson/
Cointegration.pdf.
Eichenbaum, M., & Singleton, K. J. (1986). Do equilibrium real business cycle theories explain
postwar us business cycles? NBER Macroeconomics Annual, 1, 91–135.
Elliott, G. (1998). On the robustness of cointegration methods when regressors almost have unit
roots. Econometrica, 66(1), 149–158.
Elliott, G., & Granger, C. W. (2004). Evaluating significance: Comments on “size matters”. The
Journal of Socio-Economics, 33(5), 547–550.
Elliott, G., Rothenberg, T. J., & Stock, J. H. (1992). Efficient tests for an autoregressive unit root.
Working Paper 130, National Bureau of Economic Research. http://www.nber.org/papers/t0130.
Elliott, G., Rothenberg, T. J., & Stock, J. H. (1996). Efficient tests of the unit root hypothesis.
Econometrica, 64(4), 813–836.
Enders, W. (2014). Applied econometric time series (3rd edn.). New York: Wiley.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance
of united kingdom inflation. Econometrica: Journal of the Econometric Society, 50, 987–1007.
Engle, R. F., & Bollerslev, T. (1986). Modelling the persistence of conditional variances.
Econometric Reviews, 5(1), 1–50.
Engle, R. F., & Granger, C. W. (1987). Co-integration and error correction: Representation,
estimation, and testing. Econometrica: Journal of the Econometric Society, 55, 251–276.
Engle, R. F., & Granger, C. W. (1991). Long-run economic relationships: Readings in cointegra-
tion. Oxford: Oxford University Press.
Engle, R. F., & Yoo, B. S. (1987). Forecasting and testing in co-integrated systems. Journal of
Econometrics, 35(1), 143–159.
Engle, R. F., Granger, C. W. J., Hylleberg, S., & Lee, H. S. (1993). Seasonal cointegration: The
Japanese consumption function. Journal of Econometrics, 55(1–2), 275–298.
398 Bibliography
Engle, R. F., Lilien, D. M., & Robins, R. P. (1987). Estimating time varying risk premia in the
term structure: the ARCH-M model. Econometrica: Journal of the Econometric Society, 55,
391–407.
Epstein, R. J. (2014). A history of econometrics. Amsterdam: Elsevier.
Fair, R. C. (1992). The Cowles Commission approach, real business cycle theories, and New-
Keynesian economics. In The business cycle: Theories and evidence (pp. 133–157). Berlin:
Springer.
Florens, J.-P., & Mouchart, M. (1982). A note on noncausality. Econometrica: Journal of the
Econometric Society, 50, 583–591.
Franses, P. H. (1991). Seasonality, non-stationarity and the forecasting of monthly time series.
International Journal of Forecasting, 7(2), 199–208.
Franses, P. H. (1996). Periodicity and stochastic trends in economic time series. Oxford: Oxford
University Press.
Franses, P. H., & Paap, R. (2004). Periodic time series models. Oxford: Oxford University Press.
French, K. R., Schwert, G. W., & Stambaugh, R. F. (1987). Expected stock returns and volatility.
Journal of Financial Economics, 19(1), 3–29.
Friedman, M. (1969). The optimum quantity of money and other essays. Chicago: Aldine Publisher.
Friedman, M., & Schwartz, A. J. (1963). A monetary history of the United States, 1867–1960.
Princeton: Princeton University Press.
Fuller, W. A. (1976), Introduction to Statistical Time Series. New York: Wiley.
Gelman, A. (2016). The problems with p-values are not just with p-values. The American Statis-
tician, Supplemental Material to the ASA Statement on p-values and Statistical Significance,
10(00031305.2016), 1154108.
Gelman, A. (2017). The failure of null hypothesis significance testing when studying incremental
changes, and what to do about it. Personality and Social Psychology Bulletin, 44(1), 16–23.
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a
problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis
was posited ahead of time, Department of Statistics, Columbia University.
Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a
“garden of forking paths”—explains why many statistically significant comparisons don’t hold
up. American Scientist, 102(6), 460.
Geweke, J., & Whiteman, C. (2006). Bayesian forecasting. Handbook of Economic Forecasting,
1, 3–80.
Ghose, D., & Kroner, K. F. (1995). The relationship between GARCH and symmetric stable
processes: Finding the source of fat tails in financial data. Journal of Empirical Finance,
2(3), 225–251.
Ghysels, E. (1990). Unit-root tests and the statistical pitfalls of seasonal adjustment: The case of
US postwar real gross national product. Journal of Business and Economic Statistics, 8(2), 145–
152.
Ghysels, E., & Osborn, D. R. (2001). The Econometric Analysis of Seasonal Time Series.
Cambridge: Cambridge University Press.
Ghysels, E., & Perron, P. (1993). The effect of seasonal adjustment filters on tests for a unit root.
Journal of Econometrics, 55(1–2), 57–98.
Ghysels, E., Lee, H. S., & Noh, J. (1994). Testing for unit roots in seasonal time series: Some
theoretical extensions and a monte carlo investigation. Journal of Econometrics, 62(2), 415–
442.
Gleick, J. (1987). Chaos: The Making of a New Science. New York: Viking Press.
Glosten, L. R., Jagannathan, R., & Runkle, D. E. (1993). On the relation between the expected
value and the volatility of the nominal excess return on stocks. The Journal of Finance,
48(5), 1779–1801.
Glynn, J., Perera, N., & Verma, R. (2007). Unit root tests and structural breaks: A survey with
applications. Revista de Métodos Cuantitativos para la Economía y la Empresa, 3, 63–79.
González, A., Teräsvirta, T., & Dijk, D. V. (2005). Panel smooth transition regression models
(Technical report), SSE/EFI Working Paper Series in Economics and Finance.
Bibliography 399
Lütkepohl, H., & Wolters, J. (2003). Transmission of German monetary policy in the pre-euro
period. Macroeconomic Dynamics, 7(5), 711–733.
MacKinnon, J. G. (1991). Critical values for cointegration tests, Chapter 13. In R. F. Engle & C. W.
J. Granger (Eds.), Long-run economic relationships: Readings in cointegration. Oxford : Oxford
University Press.
MacKinnon, J. G. (2010). Critical values for cointegration tests (Technical report), Queen’s
Economics Department Working Paper.
Maddala, G. S., & Kim, I.-M. (1998). Unit roots, cointegration, and structural change. Cambridge:
Cambridge University Press.
Makridakis, S., & Hibon, M. (2000). The m3-competition: Results, conclusions and implications.
International Journal of Forecasting, 16(4), 451–476.
Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., et al. (1982).
The accuracy of extrapolation (time series) methods: Results of a forecasting competition.
Journal of Forecasting, 1(2), 111–153.
Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T., Ord, K., et al. (1993). The
m2-competition: A real-time judgmentally based forecasting study. International Journal of
Forecasting, 9(1), 5–22.
Makridakis, S., Hibon, M., & Moser, C. (1979). Accuracy of forecasting: An empirical investiga-
tion. Journal of the Royal Statistical Society. Series A (General), 142, 97–145.
Mandelbrot, B. (1963). New methods in statistical economics. Journal of Political Economy,
71(5), 421–440.
McCloskey, D. N. (1985). The loss function has been mislaid: The rhetoric of significance tests.
The American Economic Review, 75(2), 201–205.
McCloskey, D. N. (1992). The bankruptcy of statistical significance. Eastern Economic Journal,
18(3), 359–361.
McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regressions. Journal of Economic
Literature, 34(1), 97–114.
McCurdy, T. H., & Morgan, I. G. (1985). Testing the martingale hypothesis in the Deutschmark/US
dollar futures and spot markets (Technical report 639), Queen’s Economics Department
Working Paper.
Meese, R. A., & Rogoff, K. (1983a). Empirical exchange rate models of the seventies: Do they fit
out of sample? Journal of International Economics, 14(1–2), 3–24.
Meese, R. A., & Rogoff, K. (1983b). The out-of-sample failure of empirical exchange rate models:
Sampling error or misspecification? In Exchange rates and international macroeconomics
(pp. 67–112). Chicago: University of Chicago Press.
Merton, R. C. (1973). An intertemporal capital asset pricing model. Econometrica: Journal of the
Econometric Society, 41, 867–887.
Merton, R. C. (1980). On estimating the expected return on the market: An exploratory investiga-
tion. Journal of Financial Economics, 8(4), 323–361.
Milhøj, A. (1987). A conditional variance model for daily deviations of an exchange rate. Journal
of Business and Economic Statistics, 5(1), 99–103.
Murray, M. P. (1994). A drunk and her dog: An illustration of cointegration and error correction.
The American Statistician, 48(1), 37–39.
Narayan, P. K., & Popp, S. (2010). A new unit root test with two structural breaks in level and
slope at unknown time. Journal of Applied Statistics, 37(9), 1425–1438.
Naylor, T. H., Seaks, T. G., & Wichern, D. W. (1972). Box-Jenkins methods: An alternative to
econometric models. International Statistical Review/Revue Internationale de Statistique, 40,
123–137.
Nelson, C. R. (1972). The prediction performance of the FRB-MIT-Penn model of the US
economy. American Economic Review, 62(5), 902–917.
Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: A new approach. Economet-
rica: Journal of the Econometric Society, 59, 347–370.
Nelson, C. R., & Kang, H. (1981). Spurious periodicity in inappropriately detrended time series.
Econometrica: Journal of the Econometric Society, 49, 741–751.
402 Bibliography
Nelson, C. R., & Plosser, C. R. (1982). Trends and random walks in macroeconomic time series:
Some evidence and implications. Journal of Monetary Economics, 10(2), 139–162.
Newey, W. K., & West, K. D. (1986). A simple, positive semi-definite, heteroskedasticity and
autocorrelation consistent covariance matrix. Working Paper 55, National Bureau of Economic
Research. http://www.nber.org/papers/t0055.
Ng, S., & Perron, P. (1995). Unit root tests in ARMA models with data-dependent methods for the
selection of the truncation lag. Journal of the American Statistical Association, 90(429), 268–
281.
Ng, S., & Perron, P. (2001). Lag length selection and the construction of unit root tests with good
size and power. Econometrica, 69(6), 1519–1554.
Osborn, D. R. (1990). A survey of seasonality in UK macroeconomic variables. International
Journal of Forecasting, 6(3), 327–336.
Osborn, D. R., Heravi, S., & Birchenhall, C. R. (1999). Seasonal unit roots and forecasts of two-
digit European industrial production. International Journal of Forecasting, 15(1), 27–47.
Otrok, C., & Whiteman, C. H. (1998). Bayesian leading indicators: Measuring and predicting
economic conditions in Iowa. International Economic Review, 39, 997–1014.
Ozcicek, O., & McMillin, D. W. (1999). Lag length selection in vector autoregressive models:
Symmetric and asymmetric lags. Applied Economics, 31(4), 517–524.
Pankratz, A. (1983). Forecasting with univariate Box-Jenkins models: concepts and cases. New
York: Wiley.
Pankratz, A. (1991). Forecasting with dynamic regression models. New York: Wiley.
Pedroni, P. (2001). Purchasing power parity tests in cointegrated panels. The Review of Economics
and Statistics, 83(4), 727–731.
Perron, P. (1989). The great crash, the oil price shock, and the unit root hypothesis. Econometrica:
Journal of the Econometric Society, 57(6), 1361–1401.
Perron, P. (1997). Further evidence on breaking trend functions in macroeconomic variables.
Journal of Econometrics, 80(2), 355–385.
Perron, P., & Vogelsang, T. J. (1992). Nonstationarity and level shifts with an application to
purchasing power parity. Journal of Business and Economic Statistics, 10(3), 301–320.
Pesaran, M. H. (2007). A simple panel unit root test in the presence of cross-section dependence.
Journal of Applied Econometrics, 22(2), 265–312.
Peters, E. E. (1996). Chaos and order in the capital markets (2nd edn.). New York: Wiley.
Phillips, P. C. (1986). Understanding spurious regressions in econometrics. Journal of Economet-
rics, 33(3), 311–340.
Phillips, P. C., & Perron, P. (1988). Testing for a unit root in time series regression. Biometrika,
75(2), 335–346.
Plosser, C. I., & Schwert, G. W. (1977). Estimation of a non-invertible moving average process:
The case of overdifferencing. Journal of Econometrics, 6(2), 199–224.
Qin, D. (2011). Rise of VAR modelling approach. Journal of Economic Surveys, 25(1), 156–174.
Quintos, C. E., & Phillips, P. C. (1993). Parameter constancy in cointegrating regressions.
Empirical Economics, 18(4), 675–706.
Rachev, S. T., Mittnik, S., Fabozzi, F. J., Focardi, S. M., et al. (2007). Financial econometrics:
From basics to advanced modeling techniques (Vol. 150). New York: Wiley.
Rao, B. B. (2007). Cointegration for the applied economist (2nd edn.). New York: Palgrave
Macmillan.
Runkle, D. E. (1987). Vector autoregressions and reality. Journal of Business and Economic
Statistics, 5(4), 437–442.
Said, S. E., & Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models
of unknown order. Biometrika, 71(3), 599–607.
Saikkonen, P., & Lütkepohl, H. (2000). Testing for the cointegrating rank of a VAR process with
structural shifts. Journal of Business and Economic Statistics, 18(4), 451–464.
Sargent, T. J. (1976). The observational equivalence of natural and unnatural rate theories of
macroeconomics. Journal of Political Economy, 84(3), 631–640.
Bibliography 403
Sargent, T. J., Sims, C. A., et al. (1977). Business cycle modeling without pretending to have too
much a priori economic theory. New Methods in Business Cycle Research, 1, 145–168.
Schaffer, M. E. (2010). EGRANGER: Engle-Granger (EG) and augmented Engle-Granger (AEG)
cointegration tests and 2-step ECM estimation. http://ideas.repec.org/c/boc/bocode/s457210.
html.
Schenck, D. (2016). Log-run restrictions in a structural vector autoregression. https://blog.stata.
com/2016/10/27/long-run-restrictions-in-a-structural-vector-autoregression/.
Schorfheide, F. (2005). VAR forecasting under misspecification. Journal of Econometrics,
128(1), 99–136.
Schwarz, G. W. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2),
461–464.
Schwert, G. W. (1989). Tests for unit roots: A Monte Carlo investigation. Journal of Business and
Economic Statistics, 7(2), 147–159.
Schwert, G. W. (2002). Tests for unit roots: A Monte Carlo investigation. Journal of Business and
Economic Statistics, 20(1), 5–17.
Sephton, P. (2017). Finite sample critical values of the generalized KPSS stationarity test.
Computational Economics, 50(1), 161–172.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk.
The Journal of Finance, 19(3), 425–442.
Shumway, R. H., & Stoffer, D. S. (2006). Time series analysis and its applications: With R
examples. New York: Springer Science and Business Media.
Sims, C. A. (1972). Money, income, and causality. The American Economic Review, 62(4), 540–
552.
Sims, C. A. (1980a). Comparison of interwar and postwar business cycles: Monetarism reconsid-
ered. The American Economic Review, 70(2), 250–257.
Sims, C. A. (1980b). Macroeconomics and reality. Econometrica: Journal of the Econometric
Society, 48, 1–48.
Sjö, B. (2008). Testing for unit roots and cointegration. https://www.iei.liu.se/nek/ekonometrisk-
teori-7-5-hp-730a07/labbar/1.233753/dfdistab7b.pdf.
Stock, J. H., & Watson, M. W. (1989). Interpreting the evidence on money-income causality.
Journal of Econometrics, 40(1), 161–181.
Stralkowski, C., & Wu, S. (1968). Charts for the interpretation of low order autoregressive moving
average models (Technical report 164), University of Wisconsin, Department of Statistics.
Taylor, M. P. (1988). An empirical examination of long-run purchasing power parity using
cointegration techniques. Applied Economics, 20(10), 1369–1381.
Thornton, D. L., & Batten, D. S. (1985). Lag-length selection and tests of Granger causality
between money and income. Journal of Money, Credit and Banking, 17(2), 164–178.
Toda, H. Y., & Yamamoto, T. (1995). Statistical inference in vector autoregressions with possibly
integrated processes. Journal of Econometrics, 66(1), 225–250.
Tsay, R. S. (2013). Multivariate time series analysis: With R and financial applications. New York:
Wiley.
Uhlig, H. (2005). What are the effects of monetary policy on output? Results from an agnostic
identification procedure. Journal of Monetary Economics, 52(2), 381–419.
Vigen, T. (2015). Spurious correlations. New York: Hachette Books.
Vogelsang, T. J., & Perron, P. (1998). Additional tests for a unit root allowing for a break in the
trend function at an unknown time. International Economic Review, 39, 1073–1100.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and
purpose. American Statistician, 70(2), 129–133
Wu, S. (2010). Lag length selection in DF-GLS unit root tests. Communications in Statistics-
Simulation and Computation, 39(8), 1590–1604.
Zakoian, J.-M. (1994). Threshold heteroskedastic models. Journal of Economic Dynamics and
control, 18(5), 931–955.
Ziliak, S. T., & McCloskey, D. N. (2004). Size matters: the standard error of regressions in the
American economic review. The Journal of Socio-Economics, 33(5), 527–546.
404 Bibliography
Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard
error costs us jobs, justice, and lives. Ann Arbor: University of Michigan Press.
Zivot, E., & Andrews, D. W. K. (1992). Further evidence on the Great Crash, the oil-price shock,
and the unit-root hypothesis. Journal of Business and Economic Statistics, 20(1), 25–44.
Zivot, E., & Wang, J. (2007). Modeling financial time series with S-plus
R (Vol. 191). New York:
Springer Science and Business Media.
Index
N S
Newey-West estimator, 156 Schwarz criterion, 184
Seasonal ARIMA process, 132, 135
Seasonal ARMA process, 123–138
O Seasonal differencing, 124, 126–128, 136
Observationally equivalent, 264 Seasonal unit root test, 136, 168
OIRF, 300, 302, 311–316, 318–320, 323–325, Shift dummy variable, 123
333–335, 376 Shocks, 20–24, 28, 29, 33, 34, 39, 46, 137,
Order of integration, 102, 103, 339 139, 160, 168, 169, 172–175, 183, 239,
Order of variables, 314 252, 255, 257, 268–270, 284–288, 291,
Orthogonalized Impulse Responses, see OIRFs 300–302, 307, 309–316, 319, 326–330,
332, 333, 337, 350, 365
Short-run parameters of VECM, 340
P Skewness, 81, 197
PACF, 47–77, 116, 123, 133, 134 Smooth transition, 195
Partial autocorrelation function, see PACF Spurious correlation, 117, 119
Phillips-Perron test, 137, 140, 156–157, 168 Spurious regression, 82, 117, 119
Portmanteau test, see Ljung-Box Stability condition, 87
p-Value, 117, 121, 141, 144, 147–150, 153, Stable AR process, 87
218, 219, 224, 269, 339, 340, 352, 356, Stable VAR, 282
357, 384–386 Standardized residuals, 217, 243, 246, 253,
261
Stata commands
R ac, 69, 74, 76, 218
Random walk, 99, 101, 102, 104–110, 112, ARCH, 252
116–119, 121, 124, 139–152, 158, arima, 16–17, 21, 27, 29, 31, 33–35, 37–38,
161–163, 166, 200, 249, 276, 343, 348, 44–46, 77
356, 369, 370 corrgram, 70, 71, 116, 163–165, 166, 179
Random walk with drift, 101, 102, 106–110, dfgls, 152–155
116, 139–141, 148–150, 370 dfuller, 143, 146, 147, 149, 150, 152,
Reduced form, 264, 329–333, 335, 336, 361, 166–168
362 drawnorm, 301, 312, 317, 319, 334
Residual autocorrelation egranger, 354, 358, 360
Box-Ljung test, 217, 219, 224 esttab, 249, 256
LM test, 217, 219 estat archlm, 218, 222, 227
408 Index