All Possible Selection (Self-Study) : Objectives
All Possible Selection (Self-Study) : Objectives
All Possible Selection (Self-Study) : Objectives
Objectives
• Explain the REG procedure options for all possible model selection.
• Describe model selection options and interpret output to evaluate the fit
of several models.
47
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Model Selection
48
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
A process for selecting models might be to start with all the interval variables in the
STAT1.ameshousing3 data set and invoke some form of stepwise selection discussed in previous
sections. This could be done by hand or with the assistance of SAS.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Chapter 4 Model Building and Effect Selection
An alternative option would be to explore all possible models capable from the predictor variables
provided and determine which is “best.” This method of all possible selection can be performed using
PROC REG.
49
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
0 1
1 2
2 4
3 8
4 16
5 32
50
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-5
In the STAT1.ameshousing3 data set, there are eight possible independent variables. Therefore, there
are 28 =256 possible regression models. There are eight possible one-variable models, 28 possible two-
variable models, 56 possible three-variable models, and so on.
You can choose to only look at the best n (as measured by the model R2 for k=1, 2, 3, …, 7) by using
the BEST= option on the model statement. The BEST= option only reduces the output. All regressions
are still calculated.
If there were 20 possible independent variables, there would be more than 1,000,000 models.
Mallows’ Cp
51
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Chapter 4 Model Building and Effect Selection
52
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Hocking suggested the use of the Cp statistic, but with alternative criteria, depending on the purpose of the
analysis. His suggestion of (Cp ≤2p−pfull +1) is included in the REG procedure’s calculations of criteria
reference plots for best models.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-7
Example: Invoke PROC REG to produce a regression of SalePrice on all the other interval variables
in the STAT1.ameshousing3 data set.
Note: Currently, stepwise, forward, and backward are the only three selection methods that can be
chosen in the SAS Studio task. To perform model selection using a method other than these three,
either manually edit the generated code or write the code directly.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Chapter 4 Model Building and Effect Selection
There are many models to compare. It would be unwieldy to try to determine the best model by viewing
the output tables. Therefore, it is advisable to look at the ODS plots.
0.8
0.6
R-Square
0.4
0.2
0.0
2 4 6 8
Number of Parameters
The R-square plot compares all models based on their R-square values. As noted earlier, adding variables
to a model always increases R-square, and therefore the full model is always best. Therefore, you can
only use the R-square value to compare models of equal numbers of parameters.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-9
0.8
0.6
Adjusted R-Square
0.4
0.2
0.0
2 4 6 8
Number of Parameters
The adjusted R-square does not have the problem that the R-square has. You can compare models of
different sizes. In this case, it is difficult to see which model has the higher adjusted R-square, the starred
model for seven parameters or eight parameters.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Chapter 4 Model Building and Effect Selection
1200
1000
800
Mallows C(p)
600
400
200
2 4 6 8
Number of Parameters
The line Cp =p is plotted to help you identify models that satisfy the criterion Cp ≤p for prediction.
The lower line is plotted to help identify which models satisfy Hocking's criterion Cp ≤2p−pfull +1 for
parameter estimation.
Use the graph and review the output to select a relatively short list of models that satisfy the criterion
appropriate for your objective. It is often the case that the best model is difficult to see because of the
range of Cp values at the high end. These models are clearly not the best and therefore you can focus on
the models near the bottom of the range of Cp .
/*st104d03.sas*/ /*Part B*/
proc reg data=STAT1.ameshousing3 plots(only)=(cp);
ALLPOSS: model SalePrice=&interval / selection=cp rsquare adjrsq
best=20;
title "Best Models Using All Possible Selection for SalePrice";
run;
quit;
Selected SELECTION= option methods:
BEST=n limits the output to only the best n models.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-11
Adjusted
Model Number in R- R-
Index Model C(p) Square Square Variables in Model
1 8 9.0000 0.8108 0.8056 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
2 7 9.4754 0.8091 0.8046 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
3 7 11.8765 0.8076 0.8030 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
4 6 12.4745 0.8059 0.8019 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
5 7 15.1956 0.8054 0.8008 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Total_Bathroom
6 6 15.7530 0.8038 0.7997 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Total_Bathroom
7 6 16.4459 0.8033 0.7993 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold
8 5 17.0005 0.8017 0.7983 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold
9 7 22.5339 0.8007 0.7959 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
10 6 23.7403 0.7986 0.7944 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr
11 6 25.8313 0.7972 0.7931 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
12 5 27.1943 0.7950 0.7915 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr
13 6 32.9173 0.7926 0.7884 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Total_Bathroom
14 5 33.3028 0.7911 0.7875 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Total_Bathroom
15 5 35.3618 0.7897 0.7861 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold
16 4 35.7387 0.7882 0.7853 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
17 7 37.7677 0.7907 0.7857 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
18 6 39.3019 0.7885 0.7841 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr
19 6 45.8708 0.7842 0.7798 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
20 5 47.7363 0.7817 0.7780 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Chapter 4 Model Building and Effect Selection
50
40
Mallows C(p)
30
20
10
5 6 7 8 9
Number of Parameters
In this example, the number of parameters in the full model, pfull , equals 9 (eight variables plus the
intercept).
The smallest model that falls under the Hocking line has p=9, the full model. This model also has a Cp
value that is equal to p exactly, falling directly on Mallows line. From this information, your full model
appears to be a potential model for prediction and variable explanation. This result is likely to change
if additional continuous predictors are included in the analysis.
If multiple models, sharing the same number of parameters, fall below these lines, there are several
options that can be used to make a decision. First, the analyst can appeal to a subject matter expert who
could potentially provide previous experiences that could “break the tie.” Secondly, other fit statistics
could be used as a comparison between the models. Perhaps one of the models has a higher adjusted
R-square value. Thirdly, the models in question could be compared using a hold-out data set, especially
when the focus is prediction.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 All Possible Selection (Self-Study) 4-13
Which value tends to increase (can never decrease) as you add predictor
variables to your regression model?
a. R square
b. Adjusted R square
c. Mallows’ Cp
d. Both a and b
e. F statistic
f. All of the above
54
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Chapter 4 Model Building and Effect Selection
Practice
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Solutions 4-15
Solutions
Solution to Practice
1. Using All-Regression Techniques
a. With the SELECTION=CP option, use an all-possible regression technique to identify a set
of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight,
Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.
Hint: Select only the best 60 models based on Cp to compare.
/*st104s03.sas*/ /*Part A*/
ods graphics / imagemap=on;
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Chapter 4 Model Building and Effect Selection
The plot indicates that the best model according to Mallows’ criterion is an eight-parameter
(seven variables plus an intercept) model. The best model according to Hocking’s criterion
has 10 parameters (including the intercept).
A partial table of the 60 models, their C(p) values, and the numbers of variables in the
models is displayed.
Note: Number in Model does not include the intercept in this table.
The best MALLOWS model is either the eight-parameter models, number 1 (includes the
variables Age, Weight, Neck, Abdomen, Thigh, Forearm, and Wrist) or number 5 (includes
the variables Age, Weight, Neck, Abdomen, Biceps, Forearm, and Wrist).
The best HOCKING model is number 4. It includes Hip, along with the variables in the best
MALLOWS models listed above.
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Solutions 4-17
Solution to Quiz
Which value tends to increase (can never decrease) as you add predictor
variables to your regression model?
a. R square
b. Adjusted R square
c. Mallows’ Cp
d. Both a and b
e. F statistic
f. All of the above
55
C o p yri gh t © SA S In sti tu te In c. A l l ri gh ts reserved .
Copyright © 2017, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.