16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

Linear Regression

Cp, AIC, BIC, and Adjusted R2


Prof. Asim Tewari
IIT Bombay

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Cp
• For a fitted least squares model containing d
predictors, the Cp estimate of test MSE is
computed using the equation

• where ˆσ2 is an estimate of the variance of the


error associated with each response
measurement

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Cp

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
AIC
• The AIC criterion is defined for a large class of
models fit by maximum likelihood. In the case
of the model

with Gaussian errors, maximum likelihood and


least squares are the same thing.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
BIC
• BIC is derived from a Bayesian point of view,
but ends up looking similar to Cp (and AIC) as
well. For the least squares model with d
predictors, the BIC is, up to irrelevant
constants, given by

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Adjusted R2

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Choosing the Optimal Model
For a fitted least squares model containing d predictors
• Cp

• Akaike information criterion (AIC)

• Bayesian information (BIC)

• Adjusted R2
where ˆ σ 2 is an estimate of the
variance of the error associated
with each response measurement

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
1. Best-Subset Selection:
Best subset regression finds for each k ϵ{0,
1, 2, . . . , p} the subset of size k that gives
smallest MSE for testing
2. Forward- and Backward-Stepwise Selection:
its selection starts with the intercept, and
then sequentially adds into the model the
predictor that most improves the fit.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Best Subset Selection
Algorithm

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Best Subset Selection

For each possible model containing a subset of the ten predictors in the Credit data set, the
RSS and R2 are displayed. The red frontier tracks the best model for a given number of
predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-
axis ranges from 1 to 11, since one of the variables is categorical and takes on three values,
leading to the creation of two dummy variables.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
1. Forward selection
• We begin with the null model—a model that
contains an intercept but no predictors.
• We then fit p simple linear regressions and add to
the null model the variable that results in the
lowest RSS.
• We then add to that model the variable that
results in the lowest RSS for the new two-variable
model.
• This approach is continued until some stopping
rule is satisfied.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Forward Stepwise Selection

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
2. Backward selection
• We start with all variables in the model, and
remove the variable with the largest p-value—
that is, the variable
• that is the least statistically significant.
• The new (p − 1)-variable model is fit, and the
variable with the largest p-value is removed.
• This procedure continues until a stopping rule is
reached.
• For instance, we may stop when all remaining
variables have a p-value below some threshold.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Backward Stepwise Selection

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
3. Mixed selection
• This is a combination of forward and backward selection.
• We start with no variables in the model, and as with forward
selection, we add the variable that provides the best fit.
• We continue to add variables one-by-one.
• It is possible sometimes that the p-values for variables can become
larger as new predictors are added to the model.
• Hence, if at any point the p-value for one of the variables in the
model rises above a certain threshold, then we remove that
variable from the model.
• We continue to perform these forward and backward steps until all
variables in the model have a sufficiently low p-value, and all
variables outside the model would have a large p-value if added to
the model.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Hybrid Approach
• Variables are added to the model sequentially, in
analogy to forward selection.
• However, after adding each new variable, the method
may also remove any variables that no longer provide
an improvement in the model fit.
• Such an approach attempts to more closely mimic best
subset selection while retaining the computational
advantages of forward and backward stepwise
selection.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Linear Model Selection
and Regularization
• Shrinkage. This approach involves fitting a model
involving all p predictors. However, the estimated
coefficients are shrunken towards zero relative to
the least squares estimates. This shrinkage (also
known as regularization) has the effect of
reducing variance. Depending on what type of
shrinkage is performed, some of the coefficients
may be estimated to be exactly zero. Hence,
shrinkage methods can also perform variable
selection.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage
• By retaining a subset of the predictors and
discarding the rest, subset selection produces
a model that is interpretable and has possibly
lower prediction error than the full model.
• However, because it is a discrete process
variables are either retained or discarded—it
often exhibits high variance, and so doesn’t
reduce the prediction error of the full model.
• Shrinkage methods are more continuous, and
don’t suffer as much from high variability.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
1. Ridge Regression
• Ridge regression shrinks the regression
coefficients by imposing a penalty on their
size. The ridge coefficients minimize a
penalized residual sum of squares,

• Here λ ≥ 0 is a complexity parameter that


controls the amount of shrinkage. It's directly
proportional to shrinkage .
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Ridge Regression Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
• An equivalent way to write the ridge problem
is

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
• Ridge regression is very similar to least squares,
except that the coefficients ridge regression are
estimated by minimizing a slightly different quantity.

• where λ ≥ 0 is a tuning parameter. The second term, is called a shrinkage


penalty, it is small when β 1 ,...,β p are close to zero

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
2.The lasso
Lasso is a shrinkage method like ridge, with
subtle but important differences. The lasso
estimate is defined by

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
• We can also write the lasso problem in the
equivalent Lagrangian form

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Ridge Regression Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Summary
• Ridge regression does a proportional
shrinkage.
• Lasso translates each coefficient by a constant
factor , truncating at zero. This is called “soft
thresholding,”.
• Best-subset selection drops all variables with
coefficients smaller than the Mth largest; this
is a form of “hard-thresholding.”

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Dimensionality reduction
• Feature selection: Feature selection approaches try to find a subset
of the original variables (also called features or attributes)
– Filter strategy (e.g. information gain)
– Wrapper strategy (e.g. search guided by accuracy)
– Embedded strategy (features are selected to add or be removed while
building the model based on the prediction errors)

• Feature projection: Feature projection transforms the data in the


high-dimensional space to a space of fewer dimensions. The data
transformation may be linear, as in principal component analysis
(PCA), but many nonlinear dimensionality reduction techniques also
exist.For multidimensional data, tensor representation can be used
in dimensionality reduction through multilinear subspace learning.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Feature projection
• Principal component analysis (PCA)
• Non-negative matrix factorization (NMF)
• Kernel PCA
• Graph-based kernel PCA
• Linear discriminant analysis (LDA)
• Generalized discriminant analysis (GDA)
• Autoencoder

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis

The projected data

Inverse PCA yields

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications

You might also like