All Possible Selection (Self-Study) : Objectives

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

4-3

All Possible Selection (Self-Study)

Objectives

• Explain the REG procedure options for all possible model selection.
• Describe model selection options and interpret output to evaluate the fit
of several models.

47
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Model Selection

Data set contains eight interval variables as potential predictors.

Possible Option #1:


Use a form of Stepwise Selection by hand or with assistance from SAS.

Possible Option #2:


Explore all possible models and determine “best.”

48
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

A process for selecting models might be to start with all the interval variables in the
STAT1.ameshousing3 data set and invoke some form of stepwise selection discussed in previous
sections. This could be done by hand or with the assistance of SAS.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Chapter 4 Model Building and Effect Selection

An alternative option would be to explore all possible models capable from the predictor variables
provided and determine which is “best.” This method of all possible selection can be performed using
PROC REG.

Model Selection Options

The SELECTION= option in the MODEL statement of PROC REG supports


these model selection techniques:
Stepwise selection methods
• STEPWISE, FORWARD, or BACKWARD using significance level

A l l-possible regressions ranked using


• RSQUARE, ADJRSQ, or CP

SELECTION=NONE is the default.

49
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

RSQUARE, ADJRSQ, CP Selection Options


Variables in Total Number of
Full Model (k) Subset Models (2 k)

0 1
1 2
2 4
3 8
4 16
5 32

50
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-5

In the STAT1.ameshousing3 data set, there are eight possible independent variables. Therefore, there
are 28 =256 possible regression models. There are eight possible one-variable models, 28 possible two-
variable models, 56 possible three-variable models, and so on.
You can choose to only look at the best n (as measured by the model R2 for k=1, 2, 3, …, 7) by using
the BEST= option on the model statement. The BEST= option only reduces the output. All regressions
are still calculated.
If there were 20 possible independent variables, there would be more than 1,000,000 models.

Mallows’ Cp

• Mallows’ Cp is a simple indicator of effective variable selection within


a model.
• Look for models with Cp ≤ p, where p equals the number of parameters
in the model, including the intercept.
Mallows recommends choosing the first (fewest variables) model where Cp
approaches p.

51
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Mallows’ Cp (1973) is estimated by C p = p +


(MSE p − MSE full )(n − p )
MSE full
where
MSEp is the mean squared error for the model with p parameters.
MSEfull is the mean squared error for the full model used to estimate the true residual variance.
n is the number of observations.
p is the number of parameters, including an intercept parameter, if estimated.
The choice of the best model based on Cp is debatable, as will be shown in the slide about Hocking’s
criterion. Many choose the model with the smallest Cp value. However, Mallows recommended that
the best model will have a Cp value approximating p. The most parsimonious model that fits that criterion
is generally considered to be a good choice, although subject-matter knowledge should also be a guide
in the selection from among competing models.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Chapter 4 Model Building and Effect Selection

Hocking’s Criterion versus Mallows’ Cp

Hocking (1976) suggests selecting a model based on the following:


• Cp ≤ p for prediction
• Cp ≤ 2p − pfull + 1 for parameter estimation

52
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Hocking suggested the use of the Cp statistic, but with alternative criteria, depending on the purpose of the
analysis. His suggestion of (Cp ≤2p−pfull +1) is included in the REG procedure’s calculations of criteria
reference plots for best models.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-7

Demo: All Possible Model Selection

Example: Invoke PROC REG to produce a regression of SalePrice on all the other interval variables
in the STAT1.ameshousing3 data set.
Note: Currently, stepwise, forward, and backward are the only three selection methods that can be
chosen in the SAS Studio task. To perform model selection using a method other than these three,
either manually edit the generated code or write the code directly.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;

/*st104d03.sas*/ /*Part A*/


ods graphics on;
proc reg data=STAT1.ameshousing3 plots(only)=(rsquare adjrsq cp);
ALLPOSS: model SalePrice=&interval
/ selection=rsquare adjrsq cp;
title "All Possible Model Selection for SalePrice";
run;
quit;
Selected MODEL statement options:
SELECTION= enables you to choose the different selection methods – RSQUARE, ADJRSQ, and
CP. The first listed method is the one that determines the sorting order in the output.
Selected SELECTION= option methods:
RSQUARE tells PROC REG to use the model R-square to rank the model from best to worst for
a given number of variables.
ADJRSQ prints the adjusted R-square for each model.
CP prints Mallows' Cp statistic for each model.

Partial PROC REG Output


Number of Observations Read 300
Number of Observations Used 300

Model Number in Adjusted


Index Model R-Square R-Square C(p) Variables in Model
1 1 0.4755 0.4737 510.5367 Basement_Area
2 1 0.4231 0.4212 591.1052 Gr_Liv_Area
3 1 0.3787 0.3767 659.3108 Age_Sold
4 1 0.3605 0.3584 687.3455 Total_Bathroom
5 1 0.3351 0.3329 726.3533 Garage_Area
6 1 0.1935 0.1908 944.1662 Deck_Porch_Area
7 1 0.0642 0.0610 1143.019 Lot_Area
8 1 0.0275 0.0243 1199.378 Bedroom_AbvGr

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Chapter 4 Model Building and Effect Selection

Model Number in Adjusted


Index Model R-Square R-Square C(p) Variables in Model
9 2 0.6725 0.6703 209.5529 Gr_Liv_Area Age_Sold
10 2 0.6249 0.6224 282.7594 Gr_Liv_Area Basement_Area
11 2 0.6148 0.6122 298.3135 Basement_Area Age_Sold
12 2 0.6027 0.6000 316.9559 Basement_Area Garage_Area
13 2 0.5708 0.5679 365.9609 Gr_Liv_Area Garage_Area
14 2 0.5680 0.5651 370.2769 Basement_Area Total_Bathroom

There are many models to compare. It would be unwieldy to try to determine the best model by viewing
the output tables. Therefore, it is advisable to look at the ODS plots.

Fit Criterion for SalePrice

0.8

0.6
R-Square

0.4

0.2

0.0

2 4 6 8
Number of Parameters

Best Model Evaluated at Number of Parameters

The R-square plot compares all models based on their R-square values. As noted earlier, adding variables
to a model always increases R-square, and therefore the full model is always best. Therefore, you can
only use the R-square value to compare models of equal numbers of parameters.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-9

Fit Criterion for SalePrice

0.8

0.6
Adjusted R-Square

0.4

0.2

0.0

2 4 6 8
Number of Parameters

Best Model Evaluated at Number of Parameters

The adjusted R-square does not have the problem that the R-square has. You can compare models of
different sizes. In this case, it is difficult to see which model has the higher adjusted R-square, the starred
model for seven parameters or eight parameters.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Chapter 4 Model Building and Effect Selection

Fit Criterion for SalePrice

1200

1000

800
Mallows C(p)

600

400

200

2 4 6 8
Number of Parameters

Mallows Hocking Best Model Evaluated at Number of Parameters

The line Cp =p is plotted to help you identify models that satisfy the criterion Cp ≤p for prediction.
The lower line is plotted to help identify which models satisfy Hocking's criterion Cp ≤2p−pfull +1 for
parameter estimation.
Use the graph and review the output to select a relatively short list of models that satisfy the criterion
appropriate for your objective. It is often the case that the best model is difficult to see because of the
range of Cp values at the high end. These models are clearly not the best and therefore you can focus on
the models near the bottom of the range of Cp .
/*st104d03.sas*/ /*Part B*/
proc reg data=STAT1.ameshousing3 plots(only)=(cp);
ALLPOSS: model SalePrice=&interval / selection=cp rsquare adjrsq
best=20;
title "Best Models Using All Possible Selection for SalePrice";
run;
quit;
Selected SELECTION= option methods:
BEST=n limits the output to only the best n models.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-11

Adjusted
Model Number in R- R-
Index Model C(p) Square Square Variables in Model
1 8 9.0000 0.8108 0.8056 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
2 7 9.4754 0.8091 0.8046 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
3 7 11.8765 0.8076 0.8030 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
4 6 12.4745 0.8059 0.8019 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
5 7 15.1956 0.8054 0.8008 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Total_Bathroom
6 6 15.7530 0.8038 0.7997 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Total_Bathroom
7 6 16.4459 0.8033 0.7993 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold
8 5 17.0005 0.8017 0.7983 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold
9 7 22.5339 0.8007 0.7959 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
10 6 23.7403 0.7986 0.7944 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr
11 6 25.8313 0.7972 0.7931 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
12 5 27.1943 0.7950 0.7915 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr
13 6 32.9173 0.7926 0.7884 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Total_Bathroom
14 5 33.3028 0.7911 0.7875 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Total_Bathroom
15 5 35.3618 0.7897 0.7861 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold
16 4 35.7387 0.7882 0.7853 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
17 7 37.7677 0.7907 0.7857 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
18 6 39.3019 0.7885 0.7841 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr
19 6 45.8708 0.7842 0.7798 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
20 5 47.7363 0.7817 0.7780 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Chapter 4 Model Building and Effect Selection

Investigate the plot of Mallows’ C(p).

Fit Criterion for SalePrice

50

40
Mallows C(p)

30

20

10

5 6 7 8 9
Number of Parameters

Mallows Hocking Best Model Evaluated at Number of Parameters

In this example, the number of parameters in the full model, pfull , equals 9 (eight variables plus the
intercept).
The smallest model that falls under the Hocking line has p=9, the full model. This model also has a Cp
value that is equal to p exactly, falling directly on Mallows line. From this information, your full model
appears to be a potential model for prediction and variable explanation. This result is likely to change
if additional continuous predictors are included in the analysis.
If multiple models, sharing the same number of parameters, fall below these lines, there are several
options that can be used to make a decision. First, the analyst can appeal to a subject matter expert who
could potentially provide previous experiences that could “break the tie.” Secondly, other fit statistics
could be used as a comparison between the models. Perhaps one of the models has a higher adjusted
R-square value. Thirdly, the models in question could be compared using a hold-out data set, especially
when the focus is prediction.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-13

Multiple Choice Question

Which value tends to increase (can never decrease) as you add predictor
variables to your regression model?
a. R square
b. Adjusted R square
c. Mallows’ Cp
d. Both a and b
e. F statistic
f. All of the above

54
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Chapter 4 Model Building and Effect Selection

Practice

1. Using All-Regression Techniques


Use the STAT1.BodyFat2 data set to identify a set of “best” models.
a. With the SELECTION=CP option, use an all-possible regression technique to identify a set of
candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height,
Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.
Hint: Select only the best 60 models based on Cp to compare.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Solutions 4-15

Solutions
Solution to Practice
1. Using All-Regression Techniques
a. With the SELECTION=CP option, use an all-possible regression technique to identify a set
of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight,
Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.
Hint: Select only the best 60 models based on Cp to compare.
/*st104s03.sas*/ /*Part A*/
ods graphics / imagemap=on;

proc reg data=STAT1.BodyFat2 plots(only)=(cp);


model PctBodyFat2=Age Weight Height
Neck Chest Abdomen Hip Thigh
Knee Ankle Biceps Forearm Wrist
/ selection=cp best=60;
title "Using Mallows Cp for Model Selection";
run;
quit;

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Chapter 4 Model Building and Effect Selection

The plot indicates that the best model according to Mallows’ criterion is an eight-parameter
(seven variables plus an intercept) model. The best model according to Hocking’s criterion
has 10 parameters (including the intercept).
A partial table of the 60 models, their C(p) values, and the numbers of variables in the
models is displayed.

Note: Number in Model does not include the intercept in this table.
The best MALLOWS model is either the eight-parameter models, number 1 (includes the
variables Age, Weight, Neck, Abdomen, Thigh, Forearm, and Wrist) or number 5 (includes
the variables Age, Weight, Neck, Abdomen, Biceps, Forearm, and Wrist).
The best HOCKING model is number 4. It includes Hip, along with the variables in the best
MALLOWS models listed above.

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Solutions 4-17

Solution to Quiz

Multiple Choice Question – Correct Answer

Which value tends to increase (can never decrease) as you add predictor
variables to your regression model?
a. R square
b. Adjusted R square
c. Mallows’ Cp
d. Both a and b
e. F statistic
f. All of the above

55
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .

Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.

You might also like