16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA

Linear Regression
Cp, AIC, BIC, and Adjusted R2

Prof. Asim Tewari
IIT Bombay
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Cp
• For a fitted least squares model containing d
predictors, the Cp estimate of test MSE is
computed using the equation
• where ˆσ2 is an estimate of the variance of the

error associated with each response
measurement
Cp
AIC
• The AIC criterion is defined for a large class of
models fit by maximum likelihood. In the case
of the model
with Gaussian errors, maximum likelihood and

least squares are the same thing.
BIC
• BIC is derived from a Bayesian point of view,
but ends up looking similar to Cp (and AIC) as
well. For the least squares model with d
predictors, the BIC is, up to irrelevant
constants, given by
Adjusted R2
Choosing the Optimal Model
For a fitted least squares model containing d predictors
• Cp
• Akaike information criterion (AIC)
• Bayesian information (BIC)
• Adjusted R2
where ˆ σ 2 is an estimate of the
variance of the error associated
with each response measurement
Subset Selection
Subset Selection
Subset Selection
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2
Strategies for choosing the subset
1. Best-Subset Selection:
Best subset regression finds for each k ϵ{0,
1, 2, . . . , p} the subset of size k that gives
smallest MSE for testing
2. Forward- and Backward-Stepwise Selection:
its selection starts with the intercept, and
then sequentially adds into the model the
predictor that most improves the fit.
Subset Selection
• Best Subset Selection
Algorithm
Subset Selection
• Best Subset Selection
For each possible model containing a subset of the ten predictors in the Credit data set, the
RSS and R2 are displayed. The red frontier tracks the best model for a given number of
predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-
axis ranges from 1 to 11, since one of the variables is categorical and takes on three values,
leading to the creation of two dummy variables.
1. Forward selection
• We begin with the null model—a model that
contains an intercept but no predictors.
• We then fit p simple linear regressions and add to
the null model the variable that results in the
lowest RSS.
• We then add to that model the variable that
results in the lowest RSS for the new two-variable
model.
• This approach is continued until some stopping
rule is satisfied.
Subset Selection
• Stepwise Selection
– Forward Stepwise Selection
2. Backward selection
• We start with all variables in the model, and
remove the variable with the largest p-value—
that is, the variable
• that is the least statistically significant.
• The new (p − 1)-variable model is fit, and the
variable with the largest p-value is removed.
• This procedure continues until a stopping rule is
reached.
• For instance, we may stop when all remaining
variables have a p-value below some threshold.
Subset Selection
– Backward Stepwise Selection
3. Mixed selection
• This is a combination of forward and backward selection.
• We start with no variables in the model, and as with forward
selection, we add the variable that provides the best fit.
• We continue to add variables one-by-one.
• It is possible sometimes that the p-values for variables can become
larger as new predictors are added to the model.
• Hence, if at any point the p-value for one of the variables in the
model rises above a certain threshold, then we remove that
variable from the model.
• We continue to perform these forward and backward steps until all
variables in the model have a sufficiently low p-value, and all
variables outside the model would have a large p-value if added to
the model.
Subset Selection
– Hybrid Approach
• Variables are added to the model sequentially, in
analogy to forward selection.
• However, after adding each new variable, the method
may also remove any variables that no longer provide
an improvement in the model fit.
• Such an approach attempts to more closely mimic best
subset selection while retaining the computational
advantages of forward and backward stepwise
selection.
Linear Model Selection
and Regularization
• Shrinkage. This approach involves fitting a model
involving all p predictors. However, the estimated
coefficients are shrunken towards zero relative to
the least squares estimates. This shrinkage (also
known as regularization) has the effect of
reducing variance. Depending on what type of
shrinkage is performed, some of the coefficients
may be estimated to be exactly zero. Hence,
shrinkage methods can also perform variable
selection.
Shrinkage
• By retaining a subset of the predictors and
discarding the rest, subset selection produces
a model that is interpretable and has possibly
lower prediction error than the full model.
• However, because it is a discrete process
variables are either retained or discarded—it
often exhibits high variance, and so doesn’t
reduce the prediction error of the full model.
• Shrinkage methods are more continuous, and
don’t suffer as much from high variability.
Shrinkage Methods
1. Ridge Regression
• Ridge regression shrinks the regression
coefficients by imposing a penalty on their
size. The ridge coefficients minimize a
penalized residual sum of squares,
• Here λ ≥ 0 is a complexity parameter that

controls the amount of shrinkage. It's directly
proportional to shrinkage .
Ridge Regression
Ridge Regression
Ridge Regression
Ridge Regression
Ridge Regression
Lasso Regression
Ridge and Lasso Regression
Ridge Regression Lasso Regression
Shrinkage Methods
• An equivalent way to write the ridge problem
is
Ridge Regression
• Ridge regression is very similar to least squares,
except that the coefficients ridge regression are
estimated by minimizing a slightly different quantity.
• where λ ≥ 0 is a tuning parameter. The second term, is called a shrinkage

penalty, it is small when β 1 ,...,β p are close to zero
Shrinkage Methods
2.The lasso
Lasso is a shrinkage method like ridge, with
subtle but important differences. The lasso
estimate is defined by
Shrinkage Methods
• We can also write the lasso problem in the
equivalent Lagrangian form
Ridge Regression Lasso Regression
Summary
• Ridge regression does a proportional
shrinkage.
• Lasso translates each coefficient by a constant
factor , truncating at zero. This is called “soft
thresholding,”.
• Best-subset selection drops all variables with
coefficients smaller than the Mth largest; this
is a form of “hard-thresholding.”
Dimensionality reduction
• Feature selection: Feature selection approaches try to find a subset
of the original variables (also called features or attributes)
– Filter strategy (e.g. information gain)
– Wrapper strategy (e.g. search guided by accuracy)
– Embedded strategy (features are selected to add or be removed while
building the model based on the prediction errors)
• Feature projection: Feature projection transforms the data in the

high-dimensional space to a space of fewer dimensions. The data
transformation may be linear, as in principal component analysis
(PCA), but many nonlinear dimensionality reduction techniques also
exist.For multidimensional data, tensor representation can be used
in dimensionality reduction through multilinear subspace learning.
Feature projection
• Principal component analysis (PCA)
• Non-negative matrix factorization (NMF)
• Kernel PCA
• Graph-based kernel PCA
• Linear discriminant analysis (LDA)
• Generalized discriminant analysis (GDA)
• Autoencoder
Principal Component Analysis
The projected data
Inverse PCA yields

16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA

Uploaded by

Copyright:

Available Formats

16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA

Uploaded by

Copyright:

Available Formats

Linear Regression

Cp, AIC, BIC, and Adjusted R2

• where ˆσ2 is an estimate of the variance of the

with Gaussian errors, maximum likelihood and

• Akaike information criterion (AIC)

• Bayesian information (BIC)

• Here λ ≥ 0 is a complexity parameter that

Ridge Regression Lasso Regression

• where λ ≥ 0 is a tuning parameter. The second term, is called a shrinkage

Ridge Regression Lasso Regression

• Feature projection: Feature projection transforms the data in the

The projected data

Inverse PCA yields

You might also like