Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics
Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics
Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics
Regression
Types
● Simple linear regression (SLR)
Y —----> contionous
Xi —-----> discrete
What you need to check regarding the E (error) in the model is:
1: normality
2: autocorrelated
3: variance is stable
One of the key questions you have to ask is if the variables X1 X2 etc are dependent or not,
otherwise the conclusion will not be accepted.
Model building
We test:
H0: B5 = 0 vs H1 B5 ⇎ 0
Slides inte tillgängliga på athena under föreläsning, lägger till relevanta bilder i efterhand
If you want to use a reduced model it is good if you have as small a difference as possible
between the large and reduced model when you perform the F-test.
Multicollinearity
Multicollinearity means
Non-linear relationships
Residual analysis
Not testing the model but rather testing the errors the model produces
After you’ve created a good model you test the residuals for
● Normality
● Autocorrelation
● VAR. Const (Heteroscedasticity)
Durbin-Watson d test
Difficult
● d is between 0 and 4
● if there is strong, positive autocorr., d is close to zero
Runs test
● We have a series of n observations (or residuals)
● We mark “+” if the value is at or above the median and “-”- if the value is below the
median
● Let R be the number of runs (streaks) with the same sign (1 >=)
● Example: ++++----+++-+++----+--- R=8, n = 24
● If few (many) runs, this is a sign of positive (negative) autocorrelation of the first order
● Check in table for P-value
○ For all tests
■ P-value < a → Reject H0
■ P-value > a → Do not reject H0
Heteroscedasticity
In some situations, the variance may depend on y on the value of independent variables -
this is called heteroscedasticity. Which is bad since in regression the assumption is that all
variances are equal.
“How much variance you have in errors over time”
A remedy
Normal Assumption
Histograms and the Normal quantile plot
Jarque-Bera test
JB is approx. X2~ distributed with two d.f. if “H0:X is normally distributed” is true. Reject JB is
greater than the critical value from the X2~table (5.991 for a = 0,05)
Logistic regression
● We want a model which gives us probabilities in [0,1].
● The logistic function: If Y is the event and X is an indep. variable then
𝑒𝑥𝑝(𝐵0+𝐵1𝑥)
𝑝(𝑥) = 𝑃(𝑌𝐼𝑋 = 𝑥) = 1+𝑒𝑥𝑝(𝐵0+𝐵1𝑥)
● If B1 > 0, the curve is increasing; if B1 < 0, the curve is decreasing
Odds and log-odds
Summary of calculations
Time series
ARMA (p, d, q)
AR (p)
ARMA (p, q)
AR(I)MA (p, d, q)
Stationarity: (A) Dickey-Fuller test
Stationarity
● A TS (Xt) is weakly stationary if
○ If it does not increase or decrease in the long run, i.e the expected value of Xt
is constant.
○ The variance is constant over time (homoscedasticity)
○ The COV between two observations Xs and Xt only depend on the difference t
-s
Moving averages
Autoregressive models
● In TS, it’s rarely reasonable to assume independence of obs.
●
ACF, PACF
Types of correlation dependence, The ts needs to be checked for stationary before
doing ACF & PACF
Summary
● R^2
○ Measures how good a model is. Goes from 0 - 1. Higher = Better
R-squared = SSR/SST
● Adjusted R^2
○ Same as R^2 but takes into account the # of predictors/variables in the
model. Unlike R^2 it penalizes the addition of unnecessary variables to the
model.
● SSE
○ Sum of Squared Errors is the residual sum of squares and represents the
unexplained variation in the dependent variable
● SSR
○ Sum of Squared regression is the sum of the squared deviations of the
predicted values of the dependent variable from the mean of the dependent
variable.
● SST
○ Sum of Squared Totals is the sum of the squared deviations of the actual
values of the dependent variable from the mean of the dependent variable.
● P-value
○ The P-value represents the probability of observing a test statistic as extreme
as, or more extreme than, the one observed, given that the null hypothesis is
true. If the P-value is small it indicates that the chance of the observed data
occurring by chance only is low.
● Multicollinearity
○ If variables in a statistical model are too similar to each other and. E.g if
you’re trying to find what factors affect how much people exercise you might
look at a lot of different variables, age, income, gender etc. but some factors
might be related to each other. Older people might have higher income so
those variables will be correlated.
● VIF
○ Variance Inflated Factor is a way of checking if variables in a statistical model
are multicollinear. A VIF > 5 is a problem.
● Autocorrelation
○ Refers to the degree of correlation between successive observations in a time
series or other type of sequence data. In a time series, autocorrelation occurs
when a particular value in the series is correlated with a previous or future
value in the same series. If autocorrelation is present in a time series, it can
affect the accuracy of forecasts and lead to biased estimates of regression
coefficients.
● Normality
○ Many operations we use assume that the data we have is already normally
distributed. If the data is not normally distributed we have to transform it with
e.g a log transformation.
● Stationarity
○ Many models we use assume that the data we have is stationary and you
need to test your data for stationarity.
For a “process” to be called stationary it has to satisfy three conditions
■ The mean of the TS remains constant over time
■ The variance of the series remains constant over time
■ The ACF of the series remains constant over time
● Homoscedasticity
○ Where the errors in a regression model are spread out evenly across all
values of the predictor variables. This means that the spread of the errors
doesn't change depending on the values of the predictor variables. If the
errors are not evenly spread (heteroscedasticity) it will result in a biased
model.
● ACF
○ Autocorrelation function measures how correlated a data point is with its own
previous values at different lags (time intervals). ACF will help
■ Imagine you have a time series dataset of the daily temperature over
the course of a year. ACF would help you determine how much the
temperature on any given day is related to the temperature from
previous days at different lags (such as 1 day ago, 2 days ago, etc.).
● PACF
○ Partial Autocorrelation function measures the correlation between a data point
and its own previous values, after removing the effects of any intervening data
points. This can help identify the influence of specific time lags on the data.
When the “columns” in the PACF are above the error bands they are
significant
■ Imagine you have a time series dataset of the daily temperature over
the course of a year. PACF, would help you determine the relationship
between the temperature on any given day and its previous values,
after removing the influence of other days in between.
● AR(p)
○ Autoregressive model means that it's a regression on itself. Looking at the
PACF of the AR-model you get the order of the AR(p) which you then use in
the ARMA/ARIMA models.
● MA(q)
○ Moving average . Looking at the ACF of the MA-model you get the order of
the MA(q) which you then use in the ARMA/ARIMA models.
● ARMA
○ Autoregressive(AR), Moving Average(MA)
○ ARMA(1,1) is a basic model. The first number in brackets corresponds to the
AR part and the second corresponds to the MA part.
● ARIMA
○ Autoregressive(AR), Integrated(I), Moving Average(MA)
○ Instead of predicting the TS itself you’re predicting differences of the TS from
one timestamp to the previous timestamp.
Tests
● F-test
○ The F-test is a statistical test that compares the variances of two or more
samples to determine whether they are significantly different. It is used to test
whether the means of the samples are equal or not.
● Durbin-watson d-test
○ The test is used to check for autocorrelation in the residuals of a regression
analysis. It calculates a test statistic (d-value) which goes from 0-4. A d-value
of 2 indicates no autocorrelation. Values from 0-2 indicate positive
autocorrelation and values from 2-4 indicate negative autocorrelation.
● Runs test
○ Tests wheter the data is random or if it has a systematic pattern. Convert the
data into a binary sequence with a split in the middle (median) and count the
points above and below. If the number of runs are statistically significant from
the expected number of runs it suggests that the data is non-random.
● Breusch-Pagan nR2 test
○ Tests for heteroscedasticity in a regression model. The test calculates a
statistics called a LM-statistic which follows a Chi-square distribution. If the
P-value of the test is below the significance level we reject H0 and conclude
that there is evidence of heteroscedasticity in the model.
● Jarque-Bera test
○ Tests if the data is normally distributed. The test calculates a statistic based
on skewness and kurtosis and compares it to a Chi-square distribution with 2
degrees of freedom. If the test statistic is higher than a certain value we reject
H0 and the data is normally distributed. In a normal distribution the skewness
is 0 and the kurtosis is 3.
● Dickey-Fuller test
○ Test for stationarity in the data. The test is based on the null hypothesis that
the data is non-stationary. If the test statistic is less than the critical value we
reject the null and conclude that the data is stationary.
● Augmented Dickey-Fuller test
○ The Augmented Dickey-Fuller test is used to check whether a time series
dataset is stationary or not when there may be more complex trends in the
data than the simple linear trend tested by the Dickey-Fuller test. The ADF
also has H0 as the data being non-stationary.
● Ljung-box test
○ Tests whether a TS dataset is random or not. H0 as no serial correlation (is
random). The test calculates a statistic based on the sum of squared
autocorrelation coefficients for a range of lag values. If the test statistics is
less than the critical value we can’t reject the H0 that the data is random.
● ARCH
○ Autoregressive Conditional Heteroscedasticity.
■ An extension of an ARMA/ARIMA model and can be used when the
data suggests that the ARMA/ARIMA models aren’t any good.
● GARCH
○ Generalized Autoregressive Conditional Heteroscedasticity
● VAR
○ Vector Autoregressive model
■ We have two TS which influence each other. A model where the series
are assumed to influence each other can be written as a system in
which you allow interaction between the variables.
Box-Jenkins method
● In reality, we will not know the underlying model of a time series. We still want to
make predictions and Box-Jenkins proposes a step-by-step way to do this.
○ 1. Identify a model
○ 2. Estimate the parameters
○ 3. Critical evaluation
○ 4. Usage (forecasting)
■ AIC → Use for punish excessive number of variables → Choose the
model with the lowest AIC or BIC
● 2*(number of parameters in the model - log likelihood)
■ BIC → Use for punish excessive number of variables → Choose the
model with the lowest AIC or BIC
● ln(number of obs.)*(nbr. of parameters) - 2*log likelihood)
Identification
● Check ACF and PACF
Stage 2 → Estimation
Objective → Find a stationary and parsimonious (few variables) model that fits the data
● Model selection criteria
○ Significance of the ARMA components
○ Compare AIC and BIC
○ Comment on Adj-R2