Lab5 - NguyenHoangAnhTu - Jupyter Notebook

14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook
𝑁𝑔𝑢𝑦𝑒𝑛 𝐻𝑜𝑎𝑛𝑔 𝐴𝑛ℎ 𝑇𝑢

𝐼𝑇𝐷𝑆𝐼𝑈20090
𝐿𝐴𝐵 5 : 𝑀𝑂𝐷𝐸𝐿 𝑆𝐸𝐿𝐸𝐶𝑇𝐼𝑂𝑁
Problem 1:
𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [18]: ps = read.table('C:/Users/Nguyen Cuc/Desktop/Sem 2/RA/Dataset/PatientSatisfaction
head(ps)
Y X1 X2 X3
48 50 51 2.3
57 36 46 2.3
66 40 48 2.2
70 41 44 1.8
89 28 43 1.8
36 49 54 2.9
Entrée [19]: attach(ps)
The following objects are masked from ps (pos = 3):
X1, X2, X3, Y
a. Indicate which subset of predictor variables you would recommend

as best for predicting patient satisfaction according to each of the
2
,
following criteria: (1) 𝑅𝑎 𝑝 , (2) 𝐴𝐼𝐶𝑝 , (3) 𝐶𝑝 , (4) 𝐵𝐼𝐶𝑝 . Support your
recommendations with appropriate graphs.
(1) 𝑅2𝑎,𝑝
Manual calculation
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 1/33

Entrée [20]: n = nrow(ps)
ssto = sum((ps$Y - mean(ps$Y))^2)
Entrée [21]: # X1
p = 2
# Apply model
model_1 = lm(Y ~ X1)
# SSE
sse_1 = sum(resid(model_1)^2)
# Adjusted R square
aR2_1 = 1 - (sse_1 / (n-p))/(ssto/(n-1))
print(aR2_1)
# Double check with function
# summary(model_1)
[1] 0.6103248
Entrée [22]: # X2
p = 2
# Apply model
# SSE
# Adjusted R square
aR2_2 = 1 - (sse_2 / (n-p))/(ssto/(n-1))
print(aR2_2)
# summary(model_2)
[1] 0.3490737
Entrée [23]: # X3
p = 2
# Apply model
# SSE
# Adjusted R square
aR2_3 = 1 - (sse_3 / (n-p))/(ssto/(n-1))
print(aR2_3)
# summary(model_3)
[1] 0.4022134

Entrée [24]: # X1 - X2
p = 3
# Apply model
model_12 = lm(Y ~ X1 + X2)
# SSE
# Adjusted R square
aR2_12 = 1 - (sse_12 / (n-p))/(ssto/(n-1))
print(aR2_12)
# summary(model_12)
[1] 0.6389073
Entrée [25]: # X1 - X3
p = 3
# Apply model
model_13 = lm(Y ~ X1 + X3)
# SSE
# Adjusted R square
aR2_13 = 1 - (sse_13 / (n-p))/(ssto/(n-1))
print(aR2_13)
# summary(model_13)
[1] 0.6610206
Entrée [26]: # X2 - X3
p = 3
# Apply model
model_23 = lm(Y ~ X2 + X3)
# SSE
# Adjusted R square
aR2_23 = 1 - (sse_23 / (n-p))/(ssto/(n-1))
print(aR2_23)
# summary(model_23)
[1] 0.4437314

Entrée [28]: # X1 - X2 - X3
p = 4
# Apply model
model_123 = lm(Y ~ X1 + X2 + X3)
# SSE
# Adjusted R square
aR2_123 = 1 - (sse_123 / (n-p))/(ssto/(n-1))
print(aR2_123)
# Double check with function# X2 - X3
# summary(model_123)
[1] 0.6594939
Using function
Entrée [31]: install.packages('leaps')
package 'leaps' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Nguyen Cuc\AppData\Local\Temp\Rtmpk5HDPK\downloaded_packages

Entrée [32]: library(leaps)
X = cbind(ps$X1, ps$X2, ps$X3)
psleaps_aR = leaps(X, ps$Y, method='adjr2')
psleaps_aR
Warning message:
"package 'leaps' was built under R version 3.6.3"
$which
1 2 3
1 TRUE FALSE FALSE
1 FALSE FALSE TRUE
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
2 TRUE TRUE FALSE
2 FALSE TRUE TRUE
3 TRUE TRUE TRUE
$label
'(Intercept)'
'1'
'2'
'3'
$size
2
2
2
3
3
3
4
$adjr2
0.610324803177749
0.402213399193454
0.349073707181763
0.661020632881959
0.638907288953016
0.443731414752623
0.659493928515078
Entrée [33]: bestmodel.aR = psleaps_aR$which[which(psleaps_aR$adjr2 == max(psleaps_aR$adjr2)),

bestmodel.aR
1 TRUE
2 FALSE
3 TRUE

Entrée [43]: plot(psleaps_aR$size, psleaps_aR$adjr2, xlab='p', ylab='Adjusted R square')
According to 𝑅2𝑎,𝑝 criteria, the best subset of2 predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3 , with the highest 𝑅𝑎 𝑝 value
14:13, 07/05/2022
t at co ta 1a d 3, t t e g est 𝑎,𝑝
Lab5_NguyenHoangAnhTu - Jupyter Notebook
a ue
(2) 𝐴𝐼𝐶𝑝
Manual calculation
p = 2
AIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + 2*p
sprintf('AIC for model 1 is %f', AIC_1)
'AIC for model 1 is 220.529391'
p = 3
p = 4
'AIC for model 123 is 216.184962'
Using function

Entrée [40]: selinfo = regsubsets(Y ~ X1 + X2 + X3, data = ps)

info = summary(selinfo)
print(info)
Subset selection object
Call: regsubsets.formula(Y ~ X1 + X2 + X3, data = ps)
3 Variables (and intercept)
Forced in Forced out
X1 FALSE FALSE
X2 FALSE FALSE
X3 FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive
X1 X2 X3
1 ( 1 ) "*" " " " "
2 ( 1 ) "*" " " "*"
3 ( 1 ) "*" "*" "*"
The last four rows show best models among models which have the same p number.
Entrée [41]: p = (2:4)
AIC = n*log(info$rss) - n*log(n) + 2*p
AIC
220.529390822719
215.060654177041
216.184962183753

Entrée [42]: plot(p, AIC, xlab = 'p', ylab = 'AIC')

According to 𝐴𝐼𝐶𝑝criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐴𝐼𝐶𝑝
value.
(3) 𝐶𝑝
Manual calculation
Entrée [70]: MSE = sum((resid(model_123))^2) / (n - 4)

MSE
101.16287337699
Entrée [52]: p = 2
Cp_1 = (sum((resid(model_1))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 1 is %f', Cp_1)
'The Cp value for model 1 is 8.353606'
Entrée [53]: p = 3

Entrée [54]: p = 4
Using function
Entrée [44]: psleaps_Cp = leaps(X, ps$Y, method='Cp')

psleaps_Cp
$which
1 2 3
1 TRUE FALSE FALSE
1 FALSE FALSE TRUE
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
2 TRUE TRUE FALSE
2 FALSE TRUE TRUE
3 TRUE TRUE TRUE
$label
'(Intercept)'
'1'
'2'
'3'
$size
2
2
2
3
3
3
4
$Cp
8.35360628199045
35.2456429948055
42.1123236337672
2.80720376735253
5.59973485144707
30.2470562751665
4
Entrée [45]: bestmodel.Cp = psleaps_Cp$which[which(psleaps_Cp$Cp == min(psleaps_Cp$Cp)),]

bestmodel.Cp
1 TRUE
2 FALSE
3 TRUE

Entrée [46]: plot(psleaps_Cp$size, psleaps_Cp$Cp, xlab='p', ylab='Cp')
According to 𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐶𝑝 value as well as this value pretty near the number p
(2.8 near 3).
(4 ) 𝐵𝐼𝐶𝑝
Manual calculation
Entrée [56]: p = 2
BIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 1 is %f', BIC_1)
'The BIC value for model 1 is 224.186674'
Entrée [57]: p = 3
Entrée [63]: p = 4
Entrée [65]: p = (2:4)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC
224.186673615698
220.546578366508
223.499527769709
Entrée [69]: info$bic
-36.7287874898513
-40.368882739041
-37.4159333358395

Entrée [66]: plot(p, BIC, xlab = 'p', ylab = 'BIC')
According to 𝐵𝐼𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐵𝐼𝐶𝑝 value.
Plots for , 𝑅2𝑎,𝑝 𝐶𝑝

, and 𝐵𝐼𝐶𝑝 from the data from the summary of
regsubsets function.

Entrée [68]: # Set up a 2x2 grid so we can look at 4 plots at once

par(mfrow = c(2,2))
plot(info$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(info$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2
# The which.max() function can be used to identify the location of the maximum po
adj_r2_max = which.max(info$adjr2) # 11

# The points() command works like the plot() command, except that it puts points
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, info$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SM
plot(info$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(info$cp) # 10
points(cp_min, info$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(info$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(info$bic) # 6
points(bic_min, info$bic[bic_min], col = "red", cex = 2, pch = 20)

b. Do the four criteria in part (a) identify the same best subset? Does
this always happen?
Four criteria in part a identify the same best subset. However, this not always happen due to the
different method in criterias.
c. Would forward stepwise regression have any advantages here as a

screening procedure over the all-possible-regressions procedure?

Entrée [71]: forwardproc = step(lm(Y ~ 1, ps), list(upper = ~ X1+X2+X3), direction='forward')

summary(forwardproc)
Start: AIC=262.92
Y ~ 1
Df Sum of Sq RSS AIC
+ X1 1 8275.4 5093.9 220.53
+ X3 1 5554.9 7814.4 240.21
+ X2 1 4860.3 8509.0 244.13
<none> 13369.3 262.92
Step: AIC=220.53
Y ~ X1
+ X3 1 763.42 4330.5 215.06
+ X2 1 480.92 4613.0 217.97
<none> 5093.9 220.53
Step: AIC=215.06
Y ~ X1 + X3
<none> 4330.5 215.06
+ X2 1 81.659 4248.8 216.19
Call:
lm(formula = Y ~ X1 + X3, data = ps)
Residuals:
Min 1Q Median 3Q Max
-19.4453 -7.3285 0.6733 8.5126 18.0534
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 145.9412 11.5251 12.663 4.21e-16 ***
X1 -1.2005 0.2041 -5.882 5.43e-07 ***
X3 -16.7421 6.0808 -2.753 0.00861 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.04 on 43 degrees of freedom
Multiple R-squared: 0.6761, Adjusted R-squared: 0.661
F-statistic: 44.88 on 2 and 43 DF, p-value: 2.98e-11
The forward stepwise regression has advantages here as a screening procedure over the all-
possible-regressions procedure because it shows the best subsets among subsets that have the
same number p and in the order from least number p to the largest number.
Problem 2:
𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [1]: jp = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.10.txt', col.names = c
head(jp)
Y X1 X2 X3 X4
88 86 110 100 87
80 62 97 99 100
96 110 107 103 103
76 101 117 93 95
80 100 101 95 88
73 78 85 95 84
Entrée [45]: attach(jp)
The following objects are masked from jps:
X1, X2, X3, X4, Y
The following objects are masked from jp (pos = 5):
X1, X2, X3, X4, Y
a. Prepare several plots of the test scores for each of the four newly
developedaptitude tests. Are there any noteworthy features in these
plots? Comment.
Entrée [30]: summary(jp)
Y X1 X2 X3
Min. : 58.0 Min. : 62.0 Min. : 73.0 Min. : 80.0
1st Qu.: 78.0 1st Qu.: 91.0 1st Qu.: 94.0 1st Qu.: 95.0
Median : 94.0 Median :104.0 Median :113.0 Median :100.0
Mean : 92.2 Mean :103.4 Mean :106.7 Mean :100.8
3rd Qu.:109.0 3rd Qu.:112.0 3rd Qu.:121.0 3rd Qu.:107.0
Max. :127.0 Max. :150.0 Max. :129.0 Max. :116.0
X4
Min. : 74.00
1st Qu.: 87.00
Median : 95.00
Mean : 94.68
3rd Qu.:103.00
Max. :110.00

Entrée [31]: par(mfrow = c(3,2))

hist(jp$X1)
hist(jp$X2)
hist(jp$X3)
hist(jp$X4)
hist(jp$Y)
b. Obtain the scatter plot matrix. Also obtain the correlation matrix of
the X variables. What dothe scatter plots suggest about the nature of
the functional relationship between the responsevariable Y and each of
the predictorvariables? Are any serious multicollinearity
problemsevident? Explain.
Entrée [15]: pairs(jp)
Entrée [22]: cor(jp)
Y X1 X2 X3 X4
Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000
There are a clear linear relationship between Y and 𝑋3 , 𝑋4 . As for 𝑋1 and 𝑋2 , there are also
have linear relationship but not easy to see.
𝑋3 and 𝑋4 are dependent. (78%)

c. Using only first-order terms for the predictor variables
𝑅2𝑎,𝑝
in the pool of potential X variables,find the four best
subset regression models according to the
criterion.
Entrée [28]: X = cbind(jp$X1, jp$X2, jp$X3, jp$X4)
jpleaps_aR = leaps(X, jp$Y, method='adjr2')
jpleaps_aR
$which
1 2 3 4
1 FALSE FALSE TRUE FALSE
1 FALSE FALSE FALSE TRUE
1 TRUE FALSE FALSE FALSE
1 FALSE TRUE FALSE FALSE
2 TRUE FALSE TRUE FALSE
2 FALSE FALSE TRUE TRUE
2 TRUE FALSE FALSE TRUE
2 FALSE TRUE TRUE FALSE
2 FALSE TRUE FALSE TRUE
2 TRUE TRUE FALSE FALSE
3 TRUE FALSE TRUE TRUE
3 TRUE TRUE TRUE FALSE
3 FALSE TRUE TRUE TRUE
3 TRUE TRUE FALSE TRUE
4 TRUE TRUE TRUE TRUE

$label
'(Intercept)'
'1'
'2'
'3'
'4'
$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
$adjr2
0.7962344370837
0.745216968986486
0.23264524812475
0.214276183981299
0.926904337908834
0.866098849056532
0.79847156364106
0.788443562503307
0.763591589343016
0.415485278382781
0.956048217657989
0.924677876991158
0.861679721457713
0.823266429947636
0.955470175923839
𝑇𝑜𝑝1 : 0.956 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 0.955 − 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4
𝑇𝑜𝑝3 : 0.927 − 𝑋1 , 𝑋3
𝑇𝑜𝑝4 : 0.925 − 𝑋1 , 𝑋2 , 𝑋3
Entrée [54]: plot(jpleaps_aR$size, jpleaps_aR$adjr2, xlab='p', ylab='adjusted R square')
d. Using other criteria (𝐶𝑝, AIC, or BIC) to identify the 04 best model?
Compare the result to c)
𝐶𝑝

Entrée [55]: jpleaps_Cp = leaps(X, jp$Y, method='Cp')

jpleaps_Cp
$which
1 2 3 4
1 FALSE FALSE TRUE FALSE
1 FALSE FALSE FALSE TRUE
1 TRUE FALSE FALSE FALSE
1 FALSE TRUE FALSE FALSE
2 TRUE FALSE TRUE FALSE
2 FALSE FALSE TRUE TRUE
2 TRUE FALSE FALSE TRUE
2 FALSE TRUE TRUE FALSE
2 FALSE TRUE FALSE TRUE
2 TRUE TRUE FALSE FALSE
3 TRUE FALSE TRUE TRUE
3 TRUE TRUE TRUE FALSE
3 FALSE TRUE TRUE TRUE
3 TRUE TRUE FALSE TRUE
4 TRUE TRUE TRUE TRUE

$label
'(Intercept)'
'1'
'2'
'3'
'4'
$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
$Cp
84.2464959003476
110.597414426974
375.344689414102
384.832453717342
17.1129781976063
47.1539851520169
80.5653068899095
85.5196499534087
97.7977898488476
269.78002871929
3.72739895858592
18.5214649058647
48.2310201005948
66.3464997470431
5
𝑇𝑜𝑝1 : 3.727 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 17.113 − 𝑋1 , 𝑋3
𝑇𝑜𝑝3 : 18.521 − 𝑋1 , 𝑋2 , 𝑋3
𝑇𝑜𝑝4 : 47.154 − 𝑋3 , 𝑋4
𝐶𝑝 is not used for full model
Same result to question c

AIC
Entrée [58]: selinfo = regsubsets(Y ~ X1 + X2 + X3 + X4, data = jp)

info = summary(selinfo)
print(info)
Subset selection object
Call: regsubsets.formula(Y ~ X1 + X2 + X3 + X4, data = jp)
4 Variables (and intercept)
Forced in Forced out
X1 FALSE FALSE
X2 FALSE FALSE
X3 FALSE FALSE
X4 FALSE FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
X1 X2 X3 X4
1 ( 1 ) " " " " "*" " "
2 ( 1 ) "*" " " "*" " "
3 ( 1 ) "*" " " "*" "*"
4 ( 1 ) "*" "*" "*" "*"
Entrée [69]: n = nrow(jp)
p = (2:5)
AIC = n*log(info$rss) - n*log(n) + 2*p
sort(AIC)
73.8473152171715
74.9542108999027
85.7272117508187
110.468533533551

Entrée [70]: plot(p - 1, AIC, xlab = 'p', ylab = 'AIC', type = 'l')

aic_min = which.min(AIC) # 10
points(aic_min, AIC[aic_min], col = "red", cex = 2, pch = 20)

Same result with question c.
𝐵𝐼𝐶
Entrée [66]: p = (2:5)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC
112.906285183288
89.3838392254233
78.7228185166443
81.0485900242437

Entrée [68]: plot(p - 1, BIC, xlab = 'p', ylab = 'BIC', type = 'l')

bic_min = which.min(BIC)
points(bic_min, BIC[bic_min], col = "red", cex = 2, pch = 20)

Same result with question c.
e. Using forward stepwise regression, find the best subset of predictor

variables to predict job proficiency.

Entrée [71]: forwardproc = step(lm(Y ~ 1, jp), list(upper = ~ X1 + X2 + X3 + X4), direction='f

summary(forwardproc)
Start: AIC=149.3
Y ~ 1
+ X3 1 7286.0 1768.0 110.47
+ X4 1 6843.3 2210.7 116.06
+ X1 1 2395.9 6658.1 143.62
+ X2 1 2236.5 6817.5 144.21
<none> 9054.0 149.30
Step: AIC=110.47
Y ~ X3
+ X1 1 1161.37 606.66 85.727
+ X4 1 656.71 1111.31 100.861
<none> 1768.02 110.469
+ X2 1 12.21 1755.81 112.295
Step: AIC=85.73
Y ~ X3 + X1
+ X4 1 258.460 348.20 73.847
<none> 606.66 85.727
+ X2 1 9.937 596.72 87.314
Step: AIC=73.85
Y ~ X3 + X1 + X4
<none> 348.20 73.847
+ X2 1 12.22 335.98 74.954
Call:
lm(formula = Y ~ X3 + X1 + X4, data = jp)
Residuals:
-5.4579 -3.1563 -0.2057 1.8070 6.6083
Coefficients:
(Intercept) -124.20002 9.87406 -12.578 3.04e-11 ***
X3 1.35697 0.15183 8.937 1.33e-08 ***
X1 0.29633 0.04368 6.784 1.04e-06 ***
X4 0.51742 0.13105 3.948 0.000735 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic: 175 on 3 and 21 DF, p-value: 5.16e-15
𝑅2𝑎,𝑝
f. How does the best subset according to forward stepwise regression
compare with the best subset according to the criterion?
Both forward stepwise regression and 𝑅2𝑎,𝑝 criterion recommend the same best subset which
contain 𝑋1 , 𝑋3 , 𝑋4
Problem 3:
Entrée [72]: jps = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.22.txt', col.names =
head(jps)
Y X1 X2 X3 X4
58 65 109 88 84
92 85 90 104 98
71 93 73 91 82
77 95 57 95 85
92 102 139 101 92
66 63 101 93 84
Entrée [73]: attach(jps)
X1, X2, X3, X4, Y
The following objects are masked from jps (pos = 4):
X1, X2, X3, X4, Y
X1, X2, X3, X4, Y
a. Obtain the correlation matrix of the X variables for the validation data
set and compare it with the model-building data set. Are the two
correlation matrices reasonably similar?

Entrée [78]: # The validation dataset

cor(jps)
Y X1 X2 X3 X4
Y 1.0000000 0.53707787 0.34477442 0.8880519 0.8879388
X1 0.5370779 1.00000000 0.01057088 0.1772891 0.3196395
X2 0.3447744 0.01057088 1.00000000 0.3437441 0.2207638
X3 0.8880519 0.17728907 0.34374413 1.0000000 0.8714466
X4 0.8879388 0.31963945 0.22076377 0.8714466 1.0000000
Entrée [79]: # The model-building dataset

cor(jp)
Y X1 X2 X3 X4
Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865
X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632
X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101
X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385
X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000
The two correlation matrices are not really reasonably similar because the dependence
between 𝑋3 𝑋4
and increase whereas the dependence between and 𝑋3 𝑋4
decrease.
b. Using the best subset of variables identified in Problem 2 to fit the

validation data set. Compare the estimated regression coefficients and
their estimated standard deviations to those obtained in Problem 2.
Also, compare the error mean squares and coefficients of multiple
determination. Do the estimates for the validation data set appear to be
reasonably similar to those obtained for the model-building data set?

Entrée [101]: best_model_pro3 = lm(Y ~ X1 + X3 + X4, data = jps)

summary(best_model_pro3)
Call:
lm(formula = Y ~ X1 + X3 + X4, data = jps)
Residuals:
-9.4619 -2.3836 0.6834 2.1123 7.2394
Coefficients:
(Intercept) -122.76705 11.84783 -10.362 1.04e-09 ***
X1 0.31238 0.04729 6.605 1.54e-06 ***
X3 1.40676 0.23262 6.048 5.31e-06 ***
X4 0.42838 0.19749 2.169 0.0417 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
𝑌 ̂ = −122.8 + 0.3124𝑋1 + 1.407𝑋2 + 0.4284𝑋3

Best subset with validation:
Entrée [102]: best_model_pro2 = lm(Y ~ X1 + X3 + X4, data = jp)

summary(best_model_pro2)
Call:
lm(formula = Y ~ X1 + X3 + X4, data = jp)
Residuals:
-5.4579 -3.1563 -0.2057 1.8070 6.6083
Coefficients:
(Intercept) -124.20002 9.87406 -12.578 3.04e-11 ***
X1 0.29633 0.04368 6.784 1.04e-06 ***
X3 1.35697 0.15183 8.937 1.33e-08 ***
X4 0.51742 0.13105 3.948 0.000735 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
𝑌 ̂ = −124.2 + 0.2963𝑋1 + 1.357𝑋2 + 0.5174𝑋3

Best subset with model-building:

The estimates for the validation data set did not appear to be reasonably similar to those obtained
for the model-building data set
c. Using the best regression model found in Problem 2 to predict the

mean of job proficiency of the validation data set. Find the mean
square error for the validation data set and compare to the mean
square error of the best model with model-building data set.
Entrée [87]: y_val_predict = -124.2 + 0.2963*jps$X1 + 1.357*jps$X2 + 0.5174*jps$X3

y_val_predict
88.5037
76.9251
49.5003
30.4505
146.903
79.6421
120.3845
90.0475
79.3433
103.0607
98.7746
103.234
90.8567
140.6543
87.1153
112.2843
110.1744
151.4436
104.0741
114.2877
124.7463
128.0374
124.1323
57.6483
112.0692
Entrée [97]: # The mean square error for the validation data set
MSE_val = sum((jps$Y - y_val_predict)^2)/(nrow(jps) - 4)
MSE_val
949.418773435238
Entrée [103]: # The mean square error of the best model with model-building data set
MSE_mb = sum((resid(best_model_pro2))^2)/(nrow(jp) - 4)
MSE_mb
16.5808098885237
The mean square error for the validation datatset is very large compared with the one of the best
model with model-building dataset.

Lab5 - NguyenHoangAnhTu - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

Lab5 - NguyenHoangAnhTu - Jupyter Notebook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab5 - NguyenHoangAnhTu - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

𝑁𝑔𝑢𝑦𝑒𝑛 𝐻𝑜𝑎𝑛𝑔 𝐴𝑛ℎ 𝑇𝑢

The following objects are masked from ps (pos = 3):

X1, X2, X3, Y

a. Indicate which subset of predictor variables you would recommend

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 1/33

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 2/33

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 3/33

package 'leaps' successfully unpacked and MD5 sums checked

The downloaded binary packages are in

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 4/33

"package 'leaps' was built under R version 3.6.3"

1 TRUE FALSE FALSE

1 FALSE FALSE TRUE

1 FALSE TRUE FALSE

2 TRUE FALSE TRUE

2 TRUE TRUE FALSE

2 FALSE TRUE TRUE

3 TRUE TRUE TRUE

Entrée [33]: bestmodel.aR = psleaps_aR$which[which(psleaps_aR$adjr2 == max(psleaps_aR$adjr2)),

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 5/33

Entrée [43]: plot(psleaps_aR$size, psleaps_aR$adjr2, xlab='p', ylab='Adjusted R square')

'AIC for model 1 is 220.529391'

'AIC for model 2 is 244.131202'

'AIC for model 3 is 240.213723'

'AIC for model 12 is 217.967647'

'AIC for model 13 is 215.060654'

'AIC for model 23 is 237.845006'

'AIC for model 123 is 216.184962'

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 7/33

Entrée [40]: selinfo = regsubsets(Y ~ X1 + X2 + X3, data = ps)

Subset selection object

Call: regsubsets.formula(Y ~ X1 + X2 + X3, data = ps)

3 Variables (and intercept)

Forced in Forced out

1 subsets of each size up to 3

Selection Algorithm: exhaustive

1 ( 1 ) "*" " " " "

2 ( 1 ) "*" " " "*"

3 ( 1 ) "*" "*" "*"

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 8/33

Entrée [42]: plot(p, AIC, xlab = 'p', ylab = 'AIC')

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 9/33

Entrée [70]: MSE = sum((resid(model_123))^2) / (n - 4)

'The Cp value for model 1 is 8.353606'

'The Cp value for model 2 is 42.112324'

'The Cp value for model 3 is 35.245643'

'The Cp value for model 12 is 5.599735'

'The Cp value for model 13 is 2.807204'

'The Cp value for model 23 is 30.247056'

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 10/33

'The Cp value for model 123 is 4.000000'

Entrée [44]: psleaps_Cp = leaps(X, ps$Y, method='Cp')

1 TRUE FALSE FALSE

1 FALSE FALSE TRUE

1 FALSE TRUE FALSE

2 TRUE FALSE TRUE

2 TRUE TRUE FALSE

2 FALSE TRUE TRUE

2 ( 1 ) "" " " ""

3 ( 1 ) "" "" "*"