Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Finansiell Statistik

Regression

Types
● Simple linear regression (SLR)

● Multiple linear regression (MLR)


○ Same formula as SLR but with additional 𝛃i and xi terms

Regression components / variables


● R2 → Ranges from 0 → 1, how well the model explains the data
● P → Tells if the t-value is significant = If P < 0,05 you accept t
● Sum of Squares
○ SST → Sum of squares total
○ SSE → Sum of squares error
○ SSR → Sum of squares due to regression
● Point estimation
○ The process of finding an approximate value of some parameter—such as the
mean
● Interval estimation

- Residual analysis: Normal distribution assumption

The assumption is that all error terms are normally distributed

Minimize variance for portfolio/linear combination

Multiple linear regression


Model may continue via X2, B2, X3, B3……..

How to choose the best model

Important question to ask is if “Xi”, i=1, ……. k, are independent?

Y —----> contionous

Xi —-----> discrete

What you need to check regarding the E (error) in the model is:

1: normality
2: autocorrelated
3: variance is stable

One of the key questions you have to ask is if the variables X1 X2 etc are dependent or not,
otherwise the conclusion will not be accepted.

Model building

Is it wise to remove x5 from the model?

We test:

H0: B5 = 0 vs H1 B5 ⇎ 0

The P-value is 0,492


F-test

Test for R>1 variables (see NCT, kap 12.5)

e.g. we test H0: B5 = B6 = 0 vs H1: at least one of B5, B6 ⇎ 0

Slides inte tillgängliga på athena under föreläsning, lägger till relevanta bilder i efterhand

If you want to use a reduced model it is good if you have as small a difference as possible
between the large and reduced model when you perform the F-test.

Multicollinearity

VIF (Variance Inflation Factor)


● If VIF >= 10 → Multicollinearity

Multicollinearity means

Non-linear relationships

Sometimes the relationship is not linear, not even approximately

Residual analysis
Not testing the model but rather testing the errors the model produces

After you’ve created a good model you test the residuals for
● Normality
● Autocorrelation
● VAR. Const (Heteroscedasticity)

Autocorrelation and Independence


Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. Independence is the opposite.

Durbin-Watson d test
Difficult
● d is between 0 and 4
● if there is strong, positive autocorr., d is close to zero
Runs test
● We have a series of n observations (or residuals)
● We mark “+” if the value is at or above the median and “-”- if the value is below the
median
● Let R be the number of runs (streaks) with the same sign (1 >=)
● Example: ++++----+++-+++----+--- R=8, n = 24
● If few (many) runs, this is a sign of positive (negative) autocorrelation of the first order
● Check in table for P-value
○ For all tests
■ P-value < a → Reject H0
■ P-value > a → Do not reject H0

Heteroscedasticity
In some situations, the variance may depend on y on the value of independent variables -
this is called heteroscedasticity. Which is bad since in regression the assumption is that all
variances are equal.
“How much variance you have in errors over time”

Homoscedasticity → Variances are constant


Heteroscedasticity → Variances are not constant

● If heteroscedasticity is present, we still get unbiased estimates with the OLS


(ordinary least squares) method, but the standard errors, p-values, and CI can be
incorrect.

Breusch-Pagan nR2 test


Regress ei2 against predicted y-values i.e replace y in SLR with ei2.

A remedy

Normal Assumption
Histograms and the Normal quantile plot

Jarque-Bera test
JB is approx. X2~ distributed with two d.f. if “H0:X is normally distributed” is true. Reject JB is
greater than the critical value from the X2~table (5.991 for a = 0,05)

Logistic regression
● We want a model which gives us probabilities in [0,1].
● The logistic function: If Y is the event and X is an indep. variable then
𝑒𝑥𝑝(𝐵0+𝐵1𝑥)
𝑝(𝑥) = 𝑃(𝑌𝐼𝑋 = 𝑥) = 1+𝑒𝑥𝑝(𝐵0+𝐵1𝑥)
● If B1 > 0, the curve is increasing; if B1 < 0, the curve is decreasing
Odds and log-odds

Time series analysis

Summary of calculations

We can use the odds in order to convert it into a probability.

Time series

VIF à ACF, PACF

ARMA (p, d, q)

The plan for learning about time series:

AR (p)

MA (q) ? Stationarity ? ------- does volatility vary?

ARMA (p, q)

AR(I)MA (p, d, q)
Stationarity: (A) Dickey-Fuller test

Djung-Box test: H0: p0 = p H1: alternative

! Box-Jenkins ! – Common reason for F, not understanding this test

Stationarity
● A TS (Xt) is weakly stationary if
○ If it does not increase or decrease in the long run, i.e the expected value of Xt
is constant.
○ The variance is constant over time (homoscedasticity)
○ The COV between two observations Xs and Xt only depend on the difference t
-s

Dickey Fuller test


● Test for stationarity → Sometimes called “Unit root test”
𝑎ℎ𝑎𝑡1−1
● Test statistic → 𝐷𝐹 = 𝑆𝑡𝑑.𝐸𝑟𝑟(𝑎ℎ𝑎𝑡1)
● Reject H0 if DF is “negative enough”
● For critical values we need to use a special table

Augmented Dickey Fuller test


● If we have a AR(p) time series (p variables) we can test for stationarity with a version
of the DF test, the augmented Dickey-Fuller (ADF) test

Moving averages

Autoregressive models
● In TS, it’s rarely reasonable to assume independence of obs.

ARMA (p,q) and ARIMA (p,d,q)

ACF, PACF
Types of correlation dependence, The ts needs to be checked for stationary before
doing ACF & PACF

● ACF = Autocorrelation function → Determines q


○ Describes the correlation of values between ts which are “linked” in time e.g
day1, day2, day3. Indirect correlations. So day1 affects day2 and day3 but
day1 also affects day3

● PACF = Partial autocorrelation function → Determines p
○ “Different type of measure for dependence”. There can be impact through
intermediate points e.g day1 impacts day3 through day 2 etc. i.e Direct
correlations.

● Stationarity → Determines d

Summary

Notations & Abbreviations

● R^2
○ Measures how good a model is. Goes from 0 - 1. Higher = Better
R-squared = SSR/SST
● Adjusted R^2
○ Same as R^2 but takes into account the # of predictors/variables in the
model. Unlike R^2 it penalizes the addition of unnecessary variables to the
model.
● SSE
○ Sum of Squared Errors is the residual sum of squares and represents the
unexplained variation in the dependent variable
● SSR
○ Sum of Squared regression is the sum of the squared deviations of the
predicted values of the dependent variable from the mean of the dependent
variable.
● SST
○ Sum of Squared Totals is the sum of the squared deviations of the actual
values of the dependent variable from the mean of the dependent variable.
● P-value
○ The P-value represents the probability of observing a test statistic as extreme
as, or more extreme than, the one observed, given that the null hypothesis is
true. If the P-value is small it indicates that the chance of the observed data
occurring by chance only is low.
● Multicollinearity
○ If variables in a statistical model are too similar to each other and. E.g if
you’re trying to find what factors affect how much people exercise you might
look at a lot of different variables, age, income, gender etc. but some factors
might be related to each other. Older people might have higher income so
those variables will be correlated.
● VIF
○ Variance Inflated Factor is a way of checking if variables in a statistical model
are multicollinear. A VIF > 5 is a problem.
● Autocorrelation
○ Refers to the degree of correlation between successive observations in a time
series or other type of sequence data. In a time series, autocorrelation occurs
when a particular value in the series is correlated with a previous or future
value in the same series. If autocorrelation is present in a time series, it can
affect the accuracy of forecasts and lead to biased estimates of regression
coefficients.
● Normality
○ Many operations we use assume that the data we have is already normally
distributed. If the data is not normally distributed we have to transform it with
e.g a log transformation.
● Stationarity
○ Many models we use assume that the data we have is stationary and you
need to test your data for stationarity.
For a “process” to be called stationary it has to satisfy three conditions
■ The mean of the TS remains constant over time
■ The variance of the series remains constant over time
■ The ACF of the series remains constant over time
● Homoscedasticity
○ Where the errors in a regression model are spread out evenly across all
values of the predictor variables. This means that the spread of the errors
doesn't change depending on the values of the predictor variables. If the
errors are not evenly spread (heteroscedasticity) it will result in a biased
model.
● ACF
○ Autocorrelation function measures how correlated a data point is with its own
previous values at different lags (time intervals). ACF will help
■ Imagine you have a time series dataset of the daily temperature over
the course of a year. ACF would help you determine how much the
temperature on any given day is related to the temperature from
previous days at different lags (such as 1 day ago, 2 days ago, etc.).
● PACF
○ Partial Autocorrelation function measures the correlation between a data point
and its own previous values, after removing the effects of any intervening data
points. This can help identify the influence of specific time lags on the data.
When the “columns” in the PACF are above the error bands they are
significant
■ Imagine you have a time series dataset of the daily temperature over
the course of a year. PACF, would help you determine the relationship
between the temperature on any given day and its previous values,
after removing the influence of other days in between.
● AR(p)
○ Autoregressive model means that it's a regression on itself. Looking at the
PACF of the AR-model you get the order of the AR(p) which you then use in
the ARMA/ARIMA models.
● MA(q)
○ Moving average . Looking at the ACF of the MA-model you get the order of
the MA(q) which you then use in the ARMA/ARIMA models.
● ARMA
○ Autoregressive(AR), Moving Average(MA)
○ ARMA(1,1) is a basic model. The first number in brackets corresponds to the
AR part and the second corresponds to the MA part.
● ARIMA
○ Autoregressive(AR), Integrated(I), Moving Average(MA)
○ Instead of predicting the TS itself you’re predicting differences of the TS from
one timestamp to the previous timestamp.

Tests
● F-test
○ The F-test is a statistical test that compares the variances of two or more
samples to determine whether they are significantly different. It is used to test
whether the means of the samples are equal or not.
● Durbin-watson d-test
○ The test is used to check for autocorrelation in the residuals of a regression
analysis. It calculates a test statistic (d-value) which goes from 0-4. A d-value
of 2 indicates no autocorrelation. Values from 0-2 indicate positive
autocorrelation and values from 2-4 indicate negative autocorrelation.
● Runs test
○ Tests wheter the data is random or if it has a systematic pattern. Convert the
data into a binary sequence with a split in the middle (median) and count the
points above and below. If the number of runs are statistically significant from
the expected number of runs it suggests that the data is non-random.
● Breusch-Pagan nR2 test
○ Tests for heteroscedasticity in a regression model. The test calculates a
statistics called a LM-statistic which follows a Chi-square distribution. If the
P-value of the test is below the significance level we reject H0 and conclude
that there is evidence of heteroscedasticity in the model.
● Jarque-Bera test
○ Tests if the data is normally distributed. The test calculates a statistic based
on skewness and kurtosis and compares it to a Chi-square distribution with 2
degrees of freedom. If the test statistic is higher than a certain value we reject
H0 and the data is normally distributed. In a normal distribution the skewness
is 0 and the kurtosis is 3.
● Dickey-Fuller test
○ Test for stationarity in the data. The test is based on the null hypothesis that
the data is non-stationary. If the test statistic is less than the critical value we
reject the null and conclude that the data is stationary.
● Augmented Dickey-Fuller test
○ The Augmented Dickey-Fuller test is used to check whether a time series
dataset is stationary or not when there may be more complex trends in the
data than the simple linear trend tested by the Dickey-Fuller test. The ADF
also has H0 as the data being non-stationary.
● Ljung-box test
○ Tests whether a TS dataset is random or not. H0 as no serial correlation (is
random). The test calculates a statistic based on the sum of squared
autocorrelation coefficients for a range of lag values. If the test statistics is
less than the critical value we can’t reject the H0 that the data is random.
● ARCH
○ Autoregressive Conditional Heteroscedasticity.
■ An extension of an ARMA/ARIMA model and can be used when the
data suggests that the ARMA/ARIMA models aren’t any good.
● GARCH
○ Generalized Autoregressive Conditional Heteroscedasticity
● VAR
○ Vector Autoregressive model
■ We have two TS which influence each other. A model where the series
are assumed to influence each other can be written as a system in
which you allow interaction between the variables.

Box-Jenkins method
● In reality, we will not know the underlying model of a time series. We still want to
make predictions and Box-Jenkins proposes a step-by-step way to do this.
○ 1. Identify a model
○ 2. Estimate the parameters
○ 3. Critical evaluation
○ 4. Usage (forecasting)
■ AIC → Use for punish excessive number of variables → Choose the
model with the lowest AIC or BIC
● 2*(number of parameters in the model - log likelihood)
■ BIC → Use for punish excessive number of variables → Choose the
model with the lowest AIC or BIC
● ln(number of obs.)*(nbr. of parameters) - 2*log likelihood)

Stage 1 → Identify a model


Stationarity
Most variables are non-stationary and need to be differenced to remove trend.
● Comment on if the data is additive or multiplicative
○ Additive
■ The seasonal component is added on as an absolute value
○ Multiplicative
■ The seasonal component is added on as an amount proportional to
the value at that point
● Firstly graph the data and look at the correlogram and formal test (ADF)
● If stationary from start → ARMA (p,q)
● If non-stationary → ARIMA (p,d,q)

Identification
● Check ACF and PACF

Stage 2 → Estimation
Objective → Find a stationary and parsimonious (few variables) model that fits the data
● Model selection criteria
○ Significance of the ARMA components
○ Compare AIC and BIC
○ Comment on Adj-R2

Stage 3 → Diagnostic and forecasting


● Are the parameters sig different from 0? If not, can the model be simplified?
● Are the residuals uncorrelated? (Use Ljung-Box test → checkresiduals) Is there a
pattern in the residual plot?
● Forecast!

You might also like