Time Series Analysis Using SAS The Augmented Dickey-Fuller (ADF) Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

NESUG 2008

Posters

Time Series Analysis Using SAS Part I The Augmented Dickey-Fuller (ADF) Test
By Ismail E. Mohamed
ABSTRACT
The purpose of this series of articles is to discuss SAS programming techniques specifically designed to simulate the steps involved in time series data analysis. The first part of this series will cover the Augmented Dickey-Fuller (ADF) test of time series (stationarity test). The second part will cover cointegration and error correction. The SAS techniques presented in both parts can be used with the more complex SAS routines such as PROC ARIMA, which require a high level of research and analysis expertise (Bails & Peppers, 1982).

INTRODUCTION
Time series data analysis has many applications in many areas including studying the relationship between wages and house prices, profits and dividends, and consumption and GDP. Many analysts erroneously use the framework of linear regression (OLS) models to predict change over time or extrapolate from present conditions to future conditions. Extreme caution is needed when interpreting the results of regression models estimated using time series data. Statisticians and analysts working with time series data uncovered a serious problem with standard analysis techniques applied to time series. Estimation of parameters of the Ordinary Least Square Regression (OLS) model produced statistically significant results between time series that contain a trend and are otherwise random. This finding led to considerable work on how to determine what properties a time series must possess if econometric techniques are to be used. One basic conclusion was that any times series used in econometric applications must be stationary (Granger and Newbold, 1974). This paper will discuss a simple SAS framework to assist SAS programmers in understanding and modeling time series data as a univariate series (Eq 1).
Y
t

(1)

BASICS AND TERMINOLOGY

Time series datasets are different from other ordinary datasets in that their observations are recorded sequentially over equal time increments (daily, weekly, monthly, quarterly, annually etc). A simple example of a time series dataset (RawData) is illustrated below.

YEAR 1987 1988 1988 1988 1988 1989 1989 1989 1989 1990 1990 1990 1990

QTR 4 1 2 3 4 1 2 3 4 1 2 3 4

X -0.05294 -0.14696 -0.12600 -0.14656 -0.06056 -0.02644 -0.05778 0.01924 -0.10823 -0.04056 -0.03390 -0.06903 0.07547

Y 0.067891 0.063533 0.065794 0.060760 0.062053 0.057527 0.049068 0.061497 0.060421 0.050771 0.036702 0.016959 0.002585

Each of x and y is called a series, while the combination of the 2 variables YEAR and QTR represent the sequential equal time increments. If x and y series are both non-stationary random processes (integrated), then modeling the x, y relationship as a simple OLS relationship as in equation 1 will only generate a spurious regression. Granger and Newbold (1974) introduced the notion of a spurious regression which they argued produces statistically significant results between series that contain a trend and are otherwise random. Time series stationarity is a statistical characteristic of a series mean and variance over time. If both are constant over time, then the series is said to be a stationary process (i.e. is not a random walk/has

NESUG 2008

Posters

no unit root), otherwise, the series is described as being a non-stationary process (i.e. a random walk/has unit root). Differencing techniques are normally used to transform a time series from a non-stationary to stationary by subtracting each datum in a series from its predecessor. As such, the set of observations that correspond to the initial time period (t) when the measurement was taken describes the series level. Differencing a series using differencing operations produces other sets of observations such as the firstdifferenced values, the second-differenced values and so on. x level st x 1 -differenced value nd x 2 -differenced value xt xt - xt-1 xt - xt-2

If a series is stationary without any differencing it is designated as I(0), or integrated of order 0. On the other hand, a series that has stationary first differences is designated I(1), or integrated of order 1. Stationarity of a series is an important phenomenon because it can influence its behavior. For example, the term shock is used frequently to indicate an unexpected change in the value of a variable (or error). For a stationary series a shock will gradually die away. That is, the effect of a shock during timet will have a smaller effect in time t+1, a still smaller effect in time t+2, etc. The data used in this paper assumed to represents time series data. Each series in equation 1 namely, x and y requires examinations at level for stationarity before proceeding further to investigate the relationship between the two variables (the OLS regression analysis). In this specification, because the data used by the paper is a quarterly series, stationarity testing will be conducted at level for up to 5-lagged periods. The stationarity test will utilize the Augmented Dickey-Fuller (ADF) technique (Dickey and Fuller (1981) which is a generalized auto-regression model formulated in the following regression equation (Dickey and Fuller (1981)
5 x i ,t = x i ,t 1 +

k =1

i,k

xi ,

tk

k ,t

(2)

The model hypotheses of interest are: The Series is HO: Non-stationary HA: Stationary ADF Statistics is compared to Critical values to draw conclusions about Stationarity (see Dickey and Fuller, 1979 for the critical values)
AN ANATOMY OF AN ADF EQUATION
xi,t =

This is the 1st-differenced value of x This is the 1st-lagged value of x These are the 1st, 2nd, 3rd, 4th, & 5th-lagged of 1st-differenced of values of x This is the error term

x i ,t 1 +
5 +

k =1

i , k xi , t k

k ,t

The above elements can be easily seen in the following chart.

NESUG 2008

Posters

SAS TECHNIQUES
As it was mentioned earlier that our sample data is quarterly spaced, this dictates that five lagged differences have to be included in testing of stationarity of both series (x and y) for more explanatory power. The following SAS Data step creates the first lagged, the first differenced and the five lagged-differenced values of the x series. A similar step is needed to create the same variables from the y series. The SAS Data step exploits the power of SAS LAG and DIF functions to create the set of the lagged and differenced values of x. SAS LAG function simply looks back in the dataset nth number of records and allows you to obtain a previous value of a variable and store it in the current observation. 'n' refers to the number of records back in the data and can be an integer from 1 to 99. Many times the only thing you want to do with a previous value of a variable is to compare it with the current value to compute the difference. It is always recommended that the LAG and DIF functions not to be executed conditionally because they could cause unexpected results. If you have to use them with conditional processing of a dataset, first execute the functions and assign their results to a new variable, then use the new variable for the conditional processing. The DIFn function works the same way as LAGn, but rather than simply assigning a value, it assigns the difference between the current value and a previous value of a variable. The statement At = DIF n ( X ) tells SAS that At should equal the current value of x minus the value x had nth number of records back in the time. Both LAG and DIF functions should only be used on the right hand side of assignment statements and again should not be executed conditionally. DATA TimeSeries; SET RawData; x_1st_LAG x_1st_DIFF x_1st_DIFF_1st_LAG x_1st_DIFF_2nd_LAG x_1st_DIFF_3rd_LAG x_1st_DIFF_4th_LAG x_1st_DIFF_5th_LAG RUN; SAS Output (partial): 1 _lagged, 1 _differenced, and the 1 5 _lagged values of the 1 _differenced value of x
x_1 x_1st_ LAG . -0.05294 -0.14696 -0.12600 -0.14656 -0.06056 -0.02644 -0.05778 0.01924 -0.10823 -0.04056 -0.03390 -0.06903 0.07547 0.03567 x_1 x_1st_ DIFF . -0.09402 0.02096 -0.02057 0.08600 0.03412 -0.03134 0.07702 -0.12748 0.06767 0.00666 -0.03513 0.14451 -0.03981 0.06252 x_1 x_1st_ DIFF_ 1st_LAG . . -0.09402 0.02096 -0.02057 0.08600 0.03412 -0.03134 0.07702 -0.12748 0.06767 0.00666 -0.03513 0.14451 -0.03981 x_1 x_1 x_1 x_1st_ x_1st_ x_1st_ DIFF_ DIFF_ DIFF_ 2nd_LAG 3rd_LAG 4th_LAG . . . . . . . . . -0.09402 . . 0.02096 -0.09402 . -0.02057 0.02096 -0.09402 0.08600 -0.02057 0.02096 0.03412 0.08600 -0.02057 -0.03134 0.03412 0.08600 0.07702 -0.03134 0.03412 -0.12748 0.07702 -0.03134 0.06767 -0.12748 0.07702 0.00666 0.06767 -0.12748 0.00666 -0.03513 0.00666 0.06767 0.14451 -0.03513 0.00666 x_1 x_1st_ DIFF_ 5th_LAG . . . . . . -0.09402 0.02096 -0.02057 0.08600 0.03412 -0.03134 0.07702 -0.12748 0.06767
st st st th st

= = = = = = =

LAG1(x); DIF1(x); DIF1(LAG1(x)); DIF1(LAG2(x)); DIF1(LAG3(x)); DIF1(LAG4(x)); DIF1(LAG5(x));

YEAR 1987 1988 1988 1988 1988 1989 1989 1989 1989 1990 1990 1990 1990 1991 1991

QTR QTR 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2

X -0.05294 -0.14696 -0.12600 -0.14656 -0.06056 -0.02644 -0.05778 0.01924 -0.10823 -0.04056 -0.03390 0.06903 -0.06903 0.07547 0.03567 0.09819

Next the SAS REG procedure, one of many regression procedures in the SAS System is used in the analysis to regress the lagged and differenced values of x generated by the above data step. The regression model used here was set as a relationship in which the value of x at the preceding time period (lagged value

NESUG 2008

Posters

of x) is the dependent variable and the independent variables are the set of 5 previous-differenced values of the x series. This analysis provides a "best-fit" mathematical equation for the relationship exhibited in Eq (2). SAS REG procedure for Unit Root Test at level, with fixed 5 Lag Length and a Constant : PROC REG DATA = TimeSeries; MODEL x_1st_DIFF = x_1st_LAG x_1st _DIFF_1st _LAG x_1st _DIFF_2nd_LAG x_1st _DIFF_3rd_LAG x_1st _DIFF_4th_LAG x_1st _DIFF_5th_LAG; RUN; QUIT;

DISCUSSION
The x_1 _LAG t-value generated by the above regression model corresponds to the Augmented DickeyFuller test (ADF) Statistics. Compare this t-value to the Critical Values (see Dickey and Fuller, 1979 for the critical values) to test the hypotheses that the x series is: HO: Non-Stationary HA: Stationary In our example the t-value of (-1.83) is greater than the Critical Values (CVs) at 1%, 5%, and 10% significant level (-3.524233, -2.902358, and -2.588587 respectively). We would fail to reject the null hypothesis and conclude that the x series is a non-stationary process when tested at level.
WHAT IS NEXT?
st

If we fail to reject the null hypothesis, and concluded that x and perhaps y are non-stationary series, we would have to difference each series once, create a set of lagged and differenced variables as shown in the earlier SAS data step this time from the differenced-values of each series, and finally carry out the ADF test (testing the series stationarity at its first-differenced value). Differencing of a series normally transforms it from non-stationarity to stationarity. A differenced stationary series is said to be integrated and is denoted as I(d) where d is the order of integration. The order of integration is the number of unit roots contained in the series, or the number of differencing operations it takes to make the series stationary. For our purpose here, since we will difference our example series once, there is one unit root, so it is an I(1) series. Once both x and y determined non-stationary at their level, we will move further to examine the nature of their linear combination. Specifically we will be interested in examining the linear combination between the non-stationary x and y, if such a linear combination exists, then x and y series are said to be cointegrated. The linear combination between them is the cointegrating equation and may be interpreted as the long-run equilibrium relationship among the 2 variables. Fortunately, this test can also be accomplished using the Augmented Dickey-Fuller test and will be the subject of discussion of the second part of this series of articles.

NESUG 2008

Posters

SAS Output Regression Analysis (Unit Root Test) Level with 5 Lags
NULL HYPOTHESIS: 'x' has a unit root LAG LENGTH: 5 (FIXED) AUGMENTED DICKEY-FULLER TEST STATISTICS, TEST CRITICAL VALUES: 1% LEVEL T-STATISTICS = -3.524233 5% LEVEL T-STATISTICS = -2.902358 10% LEVEL T-STATISTICS = -2.588587 LEVEL WITH 5 LAGS The REG Procedure Model: MODEL1 st Dependent Variable: x_1 _DIFF Number of Observations Read 78 Number of Observations Used 72 Number of Observations with Missing Values 6 Analysis of Variance Sum of Mean Source DF Squares Square F Value Model Error Corrected Total Root MSE Dependent Mean Coeff Var 6 65 71 0.02617 0.00172 1518.81011 0.08731 0.04451 0.13182 R-Square Adj R-Sq 0.01455 0.00068479 0.6623 0.6312 21.25

Pr > F <.0001

The hypothesis that x has a unit root cannot be rejected

Parameter Estimates Parameter Estimate 0.00916 -0.16361 -0.43485 0.11255 0.23609 -0.42082 -0.12741 Standard Error 0.00422 0.08960 0.13151 0.10735 0.10676 0.10964 0.10698

Variable Intercept st x_1 _LAG st st x_1 _DIFF_1 _LAG st nd x_1 _DIFF_2 _LAG st x_1 _DIFF_3rd_LAG st x_1 _DIFF_4th_LAG st x_1 _DIFF_5th_LAG

DF 1 1 1 1 1 1 1

t Value 2.17 -1.83 -3.31 1.05 2.21 -3.84 -1.19

Pr > |t| 0.0338 0.0724 0.0015 0.2983 0.0305 0.0003 0.2380

EVIEWS1 CODE AND OUTPUT FOR COMPARISON

Similarly, EVIEWS or other SAS time series tools can be used to carry out the same test. The following EVIEWS Code can be used to carry out the ADF test. Results of this code are shown in the next page. Uroot(adf,const,lag=5,save=mout)

EVIEWS is an econometrics & Time Series Analysis software package by Quantitative Micro Software. http://www.eviews.com/index.html

NESUG 2008

Posters

The hypothesis that x has a unit root cannot be rejected

REFERENCES
Bails, Dale G. and Larry C. Peppers (1982) Business Fluctuations: Forecasting Techniques and Applications, Englewood Cliffs NJ: Prentice-Hall Inc. Dickey, D. and W. Fuller (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association, 74, 427-431. Fuller, W. (1996). Introduction to Statistical Time Series, Second Edition. John Wiley, New York. Granger, C.W.J., and P. Newbold(1974). Spurious regressions in econometrics, Journal of Econometrics, 2, 111-120. Hamilton (1994). Time Series Analysis, Princeton University Press. Phillips, P.C.B. (1987). Time Series Regression with a Unit Root, Econometrica, 55, 227-301.

ACKNOWLEDGEMENTS
My sincere thanks to everyone I have had the pleasure of exchanging time Series analysis related ideas with in recent years. Special thanks to Theresa Diventi, Ian Keith both with the Financial Institutions Regulation Division, Kee N. Cheung with the Housing Finance Analysis Division of the U.S. Department of Housing and Urban Development, and Ronald Hanson with L3 Communications, Enterprise IT Solutions (EITS), for their constructive suggestions which added much to this paper.

TRADEMARKS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. EVIEWS and all other EVIEWS product or service names are registered trademarks or trademarks of Quantitative Micro Software in the USA and other countries. Indicates USA registration. The author welcomes and encourages any questions, corrections, improvements, feedback, remarks, both on- and off-topic via email. Please contact the author: Ismail E. Mohamed, Ph.D Software Engineer 5, L3 Communications, Enterprise IT Solutions (EITS), U.S. Department of Housing & Urban Development, 451 7th Street, SW, Room 8212, Washington, DC 20410; E-mail: [email protected]; [email protected]; Phone: 202-402-5884

You might also like