16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA
16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA
16 Linear Regression - CP, AIC, BIC, and Adjusted R2 - PCA
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Cp
• For a fitted least squares model containing d
predictors, the Cp estimate of test MSE is
computed using the equation
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Cp
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
AIC
• The AIC criterion is defined for a large class of
models fit by maximum likelihood. In the case
of the model
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
BIC
• BIC is derived from a Bayesian point of view,
but ends up looking similar to Cp (and AIC) as
well. For the least squares model with d
predictors, the BIC is, up to irrelevant
constants, given by
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Adjusted R2
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Choosing the Optimal Model
For a fitted least squares model containing d predictors
• Cp
• Adjusted R2
where ˆ σ 2 is an estimate of the
variance of the error associated
with each response measurement
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• We cannot try all the 2p models that contain
subsets of p variables.
• Thus we have strategies for choosing the
subset.
• Various statistics can be used to judge the
quality of a model. These include
– Mallow’s C p , Akaike information criterion (AIC),
– Bayesian information criterion (BIC), and
– Adjusted R2
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
1. Best-Subset Selection:
Best subset regression finds for each k ϵ{0,
1, 2, . . . , p} the subset of size k that gives
smallest MSE for testing
2. Forward- and Backward-Stepwise Selection:
its selection starts with the intercept, and
then sequentially adds into the model the
predictor that most improves the fit.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Best Subset Selection
Algorithm
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Best Subset Selection
For each possible model containing a subset of the ten predictors in the Credit data set, the
RSS and R2 are displayed. The red frontier tracks the best model for a given number of
predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-
axis ranges from 1 to 11, since one of the variables is categorical and takes on three values,
leading to the creation of two dummy variables.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
1. Forward selection
• We begin with the null model—a model that
contains an intercept but no predictors.
• We then fit p simple linear regressions and add to
the null model the variable that results in the
lowest RSS.
• We then add to that model the variable that
results in the lowest RSS for the new two-variable
model.
• This approach is continued until some stopping
rule is satisfied.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Forward Stepwise Selection
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
2. Backward selection
• We start with all variables in the model, and
remove the variable with the largest p-value—
that is, the variable
• that is the least statistically significant.
• The new (p − 1)-variable model is fit, and the
variable with the largest p-value is removed.
• This procedure continues until a stopping rule is
reached.
• For instance, we may stop when all remaining
variables have a p-value below some threshold.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Backward Stepwise Selection
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Strategies for choosing the subset
3. Mixed selection
• This is a combination of forward and backward selection.
• We start with no variables in the model, and as with forward
selection, we add the variable that provides the best fit.
• We continue to add variables one-by-one.
• It is possible sometimes that the p-values for variables can become
larger as new predictors are added to the model.
• Hence, if at any point the p-value for one of the variables in the
model rises above a certain threshold, then we remove that
variable from the model.
• We continue to perform these forward and backward steps until all
variables in the model have a sufficiently low p-value, and all
variables outside the model would have a large p-value if added to
the model.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Subset Selection
• Stepwise Selection
– Hybrid Approach
• Variables are added to the model sequentially, in
analogy to forward selection.
• However, after adding each new variable, the method
may also remove any variables that no longer provide
an improvement in the model fit.
• Such an approach attempts to more closely mimic best
subset selection while retaining the computational
advantages of forward and backward stepwise
selection.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Linear Model Selection
and Regularization
• Shrinkage. This approach involves fitting a model
involving all p predictors. However, the estimated
coefficients are shrunken towards zero relative to
the least squares estimates. This shrinkage (also
known as regularization) has the effect of
reducing variance. Depending on what type of
shrinkage is performed, some of the coefficients
may be estimated to be exactly zero. Hence,
shrinkage methods can also perform variable
selection.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage
• By retaining a subset of the predictors and
discarding the rest, subset selection produces
a model that is interpretable and has possibly
lower prediction error than the full model.
• However, because it is a discrete process
variables are either retained or discarded—it
often exhibits high variance, and so doesn’t
reduce the prediction error of the full model.
• Shrinkage methods are more continuous, and
don’t suffer as much from high variability.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
1. Ridge Regression
• Ridge regression shrinks the regression
coefficients by imposing a penalty on their
size. The ridge coefficients minimize a
penalized residual sum of squares,
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
• An equivalent way to write the ridge problem
is
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge Regression
• Ridge regression is very similar to least squares,
except that the coefficients ridge regression are
estimated by minimizing a slightly different quantity.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
2.The lasso
Lasso is a shrinkage method like ridge, with
subtle but important differences. The lasso
estimate is defined by
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Shrinkage Methods
• We can also write the lasso problem in the
equivalent Lagrangian form
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Ridge and Lasso Regression
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Summary
• Ridge regression does a proportional
shrinkage.
• Lasso translates each coefficient by a constant
factor , truncating at zero. This is called “soft
thresholding,”.
• Best-subset selection drops all variables with
coefficients smaller than the Mth largest; this
is a form of “hard-thresholding.”
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Dimensionality reduction
• Feature selection: Feature selection approaches try to find a subset
of the original variables (also called features or attributes)
– Filter strategy (e.g. information gain)
– Wrapper strategy (e.g. search guided by accuracy)
– Embedded strategy (features are selected to add or be removed while
building the model based on the prediction errors)
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Feature projection
• Principal component analysis (PCA)
• Non-negative matrix factorization (NMF)
• Kernel PCA
• Graph-based kernel PCA
• Linear discriminant analysis (LDA)
• Generalized discriminant analysis (GDA)
• Autoencoder
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Principal Component Analysis
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications