Methods To Address Multicollinearity: A Project Report On

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

A Project Report on

Methods to address Multicollinearity

INDIAN INSTITUTE OF
TECHNOLOGY KANPUR
Submitted by –
Abhishek Verma (211255)
Krishna Pratap Mall (211320)
Shiv Varun Maurya (211380)
Abhinav Shukla (211415)
MTH689A:
Linear and Non-Linear
Under Guidance of – Dr. Satya Prakash Singh
Models (Department of Mathematics and Statistics)
One of the main assumptions underlying multiple regression models is independence
of regressor variables.

If there is no linear relationship between the regressors, they are said to be


orthogonal . When the regressors are orthogonal, inferences can be made relatively
easily.
MULTICOLLINEARI
TY
Unfortunately, in most real life applications of regression, the regressors are not
always orthogonal. Sometimes the lack of orthogonality is not serious. However, in
some situations the regressors are nearly perfectly linearly related, and in such cases
the inferences based on the regression model can be misleading.

When there are near-linear dependencies among the regressors, the problem of
multicollinearity is said to exist.
• There are four primary sources of multicollinearity :

1. The data collection method employed

2. Constraints on the model or in the population

3. Model specification

4. An over-defined model

MULTICOLLINEARI Effects of Multicollinearity:

TY 1. Severe multicollinearity can increase the variance of the


coefficient estimates, making the coefficient estimates
unstable and difficult to interpret.

2. It can produce regression coefficients with large


standard errors which can potentially damage the
prediction capability of the model.

3. There can be other problems like significant variable


becoming insignificant one or regression coefficients
appearing with wrong signs from what is expected.
DATA
DESCRIPTION

• The dataset we have selected shows ‘Factors


affecting Life Expectancy’ for 193 countries
over a span of 15 years from 2000 to 2015.

A sample from the dataset


• Number of columns: 24
Number of rows: 2928

Number of Independent Columns:


23

Number of Dependent Column: 1


(Life Expectancy)

• The column ‘Status’ is categorical


and rest of the columns are
numerical.
CORRELATION
HEATMAP
In the adjoining Correlation
Heatmap, we can see that the dark
blue coloured boxes show high
positive correlation between the
respective columns.

Some of them are :


1. Percentage expenditure and GDP
2. Hepatitis B and Vitamin deficiency
3. Diphtheria and Polio, etc.
FITTING MLR Firstly, we did feature engineering
to dataset so that it becomes ready
for model deployment
MODEL we fit MLR model for the chosen
dataset , taking Life Expectancy as
dependent variable and rest other
variables as regressors.
Observations from the MLR output :

FITTED MLR MODEL :

• There are coefficients of some regressors


which are not making sense with the life
expectancy variable like polio, Infant
deaths ,diphtheria ,and others which are
positive in sign (which must be negative),
it is due to the effect of multicollinearity
in our data.
• Some of the coefficient estimate are too
large like under 5 deaths, HIV AIDS, infant
deaths.
• Standard error of Infant deaths , measles
and under five deaths is quite high
because these variables are highly
correlated with some subset of
explanatory variables.
FOR THE MLR MODEL :

• After doing train test split we have fitted train and


test model and found the adjusted R-square and
RMSE to know how the MLR performs.

• As we can observe from the adjoining figure

1. For train model,


Adj. R-Square =0.85
RMSE=3.68

2. For test model,


Adj. R-Square=0.85
RMSE=3.49
DETECTING • There are several methods for knowing the
presence of multicollinearity in the model. Some
of them are :

MULTICOLLINEARI 1. Examination of the correlation matrix.

TY 2. Calculating Variance Inflation Factor (VIF)


1.CORRELATION
HEATMAP
In the adjoining Correlation
Heatmap, we can see that the dark
blue coloured boxes show high
positive correlation between the
respective columns.

Some of them are :


1. Percentage expenditure and GDP
2. Hepatitis B and Vitamin deficiency
3. Diphtheria and Polio, etc.
• It is a statistical concept that indicates the increase in
the variance of a regression coefficient as a result of
collinearity.
2. VARIANCE
• VIF or Variance Inflation Factor for the j-th regressor
is defined as: 

INFLATION = j=1,2,…,p

where is the multiple   obtained from

FACTOR regressing 𝑋j on other regressors.

• The VIF value of 5 or more is an indicator of


multicollinearity.
(VIF) • Large values of VIF indicate multicollinearity leading
to poor estimates of associated regression
coefficients.
VARIANCE INFLATION
FACTOR (VIFs)
We can see that the following variables have
a VIF greater than 5 :
1. Infant Deaths
2. Hepatitis B
3. Under five deaths
4. Diphtheria
5. GDP
6. Thinness 1 to 19 years
7. Thinness 5 to 9 years
8. Vitamin Deficiency
9. Consumption Index

VIF of each independent variables


• There are several methods for dealing with the
problem of multicollinearity such as:
DEALING WITH 1. Merging regressors which have same properties.

2. Dropping regressors involved in


MULTICOLLINEAR multicollinearity (which have VIF > 5)

ITY 3. Ridge Regression

4. Principal Component Regression (PCR)


• The OLSE is the best linear unbiased estimator of
regression coefficient in the sense that it has minimum
variance in the class of linear and unbiased estimators.
However, if the condition of unbiasedness can be relaxed,
then it is possible to find a biased estimator of regression
coefficient say that has smaller variance than the
unbiased OLSE .

RIDGE • The mean squared error (MSE) of is

REGRESSION

Thus can be made smaller than by introducing small bias


in . One of the approach to do so is ridge regression.
• The ridge regression estimator is obtained by solving the normal equations of least squares estimation.
The normal equations are modified as :

is the ridge regression estimator of and is any characterizing


scalar termed as biasing parameter.
As and as

• Bias of ridge regression estimator :

The bias of is

Thus, the ridge regression estimator is a biased estimator of .


• Covariance matrix : The covariance matrix of is defined as

On Solving , we get

• Mean squared error : The mean squared error of is

where are the eigenvalues of .


• Thus as increases, the bias in increases but its variance decreases. Thus the trade-off between bias and
variance hinges upon the value of
i.e. there exists a value of such that
Choice of :

The estimation of ridge regression estimator depends upon the value of .


The value of can be chosen on the basis of criteria like
 the stability of estimators with respect to .
 reasonable signs.
 the magnitude of residual sum of squares, etc.

RIDGE TRACE :
• Ridge trace is the graphical display of ridge regression estimator versus .
• If multicollinearity is present and is severe, then the instability of regression coefficients is reflected in the ridge
trace.
• As increases, some of the ridge estimates vary dramatically, and they stabilize at some value of .
• The objective in ridge trace is to inspect the trace (curve) and find the reasonable small value of at which the
ridge regression estimators are stable.
• The ridge regression estimator with such a choice of will have smaller MSE than the variance of OLSE.
FITTING RIDGE REGRESSION MODEL
• We have used glmnet function in R to fit our Ridge model.
• We have found the optimum value of tuning parameter
(lambda) for the ridge regression model which is 0.7651
using K-fold cross validation.

• Our fitted ridge model for the optimum lambda value is :


= 63.11+1.31-1.82 -0.09 -0.16 +1.51 -0.43 -2.45 +0.13
-6.23 +0.91 +0.04 +1.03 -22.9 +0.8 +2.44 -0.15 -0.54
+1.38 +1.26 -0.02 +1.58
• As we can observe from the adjoining table that the
coefficients which were earlier positive in the MLR
model are now negative in sign , also there were some
large coefficients which are now small in ridge model.

• But also we can observe that diphtheria and polio


variable are still having positive sign , indicating that the
ridge regression can not fully resolve multicollinearity.
RIDGE TRACE :

• The adjoining figure is ridge trace plot , here


on X-axis we have values of different log of
lambdas(tuning parameter) and on Y-axis we
have independent variable coefficients which
are depicted with different colours

• Interpretation : From the plot, we can observe


that log lambda between -0.5 and 0 (that is
lambda between 0 and 1) , a reasonable stable
coefficients are achieved .

(We do not want to choose lambda


unnecessarily large because It will introduce
additional bias and increases the residual
mean square.)
EVALUATION METRICS FOR THE RIDGE
MODEL

• Train model :
R-square =0.84
RMSE =3.76

• Test model :
R-Square =0.85
RMSE =3.52
Principal component analysis, or PCA, is a
dimensionality reduction method that is often
PRINCIPAL used to reduce the dimensionality of large
data sets, by transforming a large set of
variables into a smaller one that still contains
COMPONENT most of the information in the large set.

• It is a linear transformation that chooses a


REGRESSION new coordinate system for the data set such
that greatest variance by any projection of the
data set comes to lie on the first axis (then
(PCR) called the first principal component), the
second greatest variance on the second axis,
and so on.
Let X= (X1, X2, ….. , Xp) : p dim random vector with var-cov matrix Σ=(())

Total variability in X = trace(Σ) = Σ

PCA aims to replace X with Yp X 1 = ( Y1, Y2, …. ,Yp )T such that

i. s are uncorrelated , cov(Yi ,Yj)=0 ∀ i ≠ j.


ii. Total variation in X = Total variation in Y.
iii. Total variation ( Y1, Y2, …. ,Yk )T ≈ Total variation in Y , k<<p.
Principal Components:
Principal Components are the uncorrelated linear combinations Y1, Y2, …. ,Yp
(derived from X1, X2, ….. , Xp) whose variances are in decreasing order, with Y1
explaining the maximum variability, Y2 the second highest variability and so on.
i.e, the 1st PC, Y1 is the linear combination Y1 = that maximizes Var( ) subject to = 1.
The 2nd PC, Y2 is the linear combination Y2 = that maximizes Var( ) subject to = 1 and
cov( , ) =0.
:
:
The ith PC, Yi is the linear combination Yi= that maximizes Var( ) subject to = 1 and
cov( , ) =0 ∀ k<i.
RESULT: Let Σ be the var-cov matrix associated with random vector X pX1.

The eigenvalue – eigenvector(orthonormal) pairs of Σ are (λ1,e1) ,…….(λp,ep), where λ1 ≥ λ2 ≥ ……


≥ λp

The ith PC is given by Yi = X ∀ i= 1(1)p.

with Var(Yi ) = λi ∀ i= 1(1)p


and cov(Yi ,Yj) = 0 ∀ i ≠ j.
These are the obtained PC’s
And we can see they are
Uncorrelated.
The variance explained by
individual PC’s can be seen in
the Scree Plot.

First 7 components explains


more than 93% of total variance.
We fit 5 MLR model using the first
5:9 PC’s.

EVALUATION METRICS FOR THE PCR MODEL

• With first 5 PC’s RMSE =5.072


• With first 6 PC’s RMSE =5.006
• With first 7 PC’s RMSE =5.000
• With first 8 PC’s RMSE =4.954
• With first 9 PC’s RMSE =4.970
Conclusion

You might also like