Lab5 - NguyenHoangAnhTu - Jupyter Notebook
Lab5 - NguyenHoangAnhTu - Jupyter Notebook
Lab5 - NguyenHoangAnhTu - Jupyter Notebook
𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [18]: ps = read.table('C:/Users/Nguyen Cuc/Desktop/Sem 2/RA/Dataset/PatientSatisfaction
head(ps)
Y X1 X2 X3
48 50 51 2.3
57 36 46 2.3
66 40 48 2.2
70 41 44 1.8
89 28 43 1.8
36 49 54 2.9
Entrée [19]: attach(ps)
(1) 𝑅2𝑎,𝑝
Manual calculation
Entrée [20]: n = nrow(ps)
ssto = sum((ps$Y - mean(ps$Y))^2)
Entrée [21]: # X1
p = 2
# Apply model
model_1 = lm(Y ~ X1)
# SSE
sse_1 = sum(resid(model_1)^2)
# Adjusted R square
aR2_1 = 1 - (sse_1 / (n-p))/(ssto/(n-1))
print(aR2_1)
# Double check with function
# summary(model_1)
[1] 0.6103248
Entrée [22]: # X2
p = 2
# Apply model
model_2 = lm(Y ~ X2)
# SSE
sse_2 = sum(resid(model_2)^2)
# Adjusted R square
aR2_2 = 1 - (sse_2 / (n-p))/(ssto/(n-1))
print(aR2_2)
# Double check with function
# summary(model_2)
[1] 0.3490737
Entrée [23]: # X3
p = 2
# Apply model
model_3 = lm(Y ~ X3)
# SSE
sse_3 = sum(resid(model_3)^2)
# Adjusted R square
aR2_3 = 1 - (sse_3 / (n-p))/(ssto/(n-1))
print(aR2_3)
# Double check with function
# summary(model_3)
[1] 0.4022134
Entrée [24]: # X1 - X2
p = 3
# Apply model
model_12 = lm(Y ~ X1 + X2)
# SSE
sse_12 = sum(resid(model_12)^2)
# Adjusted R square
aR2_12 = 1 - (sse_12 / (n-p))/(ssto/(n-1))
print(aR2_12)
# Double check with function
# summary(model_12)
[1] 0.6389073
Entrée [25]: # X1 - X3
p = 3
# Apply model
model_13 = lm(Y ~ X1 + X3)
# SSE
sse_13 = sum(resid(model_13)^2)
# Adjusted R square
aR2_13 = 1 - (sse_13 / (n-p))/(ssto/(n-1))
print(aR2_13)
# Double check with function
# summary(model_13)
[1] 0.6610206
Entrée [26]: # X2 - X3
p = 3
# Apply model
model_23 = lm(Y ~ X2 + X3)
# SSE
sse_23 = sum(resid(model_23)^2)
# Adjusted R square
aR2_23 = 1 - (sse_23 / (n-p))/(ssto/(n-1))
print(aR2_23)
# Double check with function
# summary(model_23)
[1] 0.4437314
Entrée [28]: # X1 - X2 - X3
p = 4
# Apply model
model_123 = lm(Y ~ X1 + X2 + X3)
# SSE
sse_123 = sum(resid(model_123)^2)
# Adjusted R square
aR2_123 = 1 - (sse_123 / (n-p))/(ssto/(n-1))
print(aR2_123)
# Double check with function# X2 - X3
# summary(model_123)
[1] 0.6594939
Using function
Entrée [31]: install.packages('leaps')
C:\Users\Nguyen Cuc\AppData\Local\Temp\Rtmpk5HDPK\downloaded_packages
Entrée [32]: library(leaps)
X = cbind(ps$X1, ps$X2, ps$X3)
psleaps_aR = leaps(X, ps$Y, method='adjr2')
psleaps_aR
Warning message:
$which
1 2 3
$label
'(Intercept)'
'1'
'2'
'3'
$size
2
2
2
3
3
3
4
$adjr2
0.610324803177749
0.402213399193454
0.349073707181763
0.661020632881959
0.638907288953016
0.443731414752623
0.659493928515078
1 TRUE
2 FALSE
3 TRUE
According to 𝑅2𝑎,𝑝 criteria, the best subset of2 predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3 , with the highest 𝑅𝑎 𝑝 value
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 6/33
14:13, 07/05/2022
t at co ta 1a d 3, t t e g est 𝑎,𝑝
Lab5_NguyenHoangAnhTu - Jupyter Notebook
a ue
(2) 𝐴𝐼𝐶𝑝
Manual calculation
Entrée [37]: n = nrow(ps)
p = 2
AIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + 2*p
sprintf('AIC for model 1 is %f', AIC_1)
AIC_2 = n*log(sum((resid(model_2))^2)) - n*log(n) + 2*p
sprintf('AIC for model 2 is %f', AIC_2)
AIC_3 = n*log(sum((resid(model_3))^2)) - n*log(n) + 2*p
sprintf('AIC for model 3 is %f', AIC_3)
Entrée [38]: n = nrow(ps)
p = 3
AIC_12 = n*log(sum((resid(model_12))^2)) - n*log(n) + 2*p
sprintf('AIC for model 12 is %f', AIC_12)
AIC_13 = n*log(sum((resid(model_13))^2)) - n*log(n) + 2*p
sprintf('AIC for model 13 is %f', AIC_13)
AIC_23 = n*log(sum((resid(model_23))^2)) - n*log(n) + 2*p
sprintf('AIC for model 23 is %f', AIC_23)
Entrée [39]: n = nrow(ps)
p = 4
AIC_123 = n*log(sum((resid(model_123))^2)) - n*log(n) + 2*p
sprintf('AIC for model 123 is %f', AIC_123)
Using function
X1 FALSE FALSE
X2 FALSE FALSE
X3 FALSE FALSE
X1 X2 X3
The last four rows show best models among models which have the same p number.
Entrée [41]: p = (2:4)
AIC = n*log(info$rss) - n*log(n) + 2*p
AIC
220.529390822719
215.060654177041
216.184962183753
According to 𝐴𝐼𝐶𝑝criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐴𝐼𝐶𝑝
value.
(3) 𝐶𝑝
Manual calculation
101.16287337699
Entrée [52]: p = 2
Cp_1 = (sum((resid(model_1))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 1 is %f', Cp_1)
Cp_2 = (sum((resid(model_2))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 2 is %f', Cp_2)
Cp_3 = (sum((resid(model_3))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 3 is %f', Cp_3)
Entrée [53]: p = 3
Cp_12 = (sum((resid(model_12))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 12 is %f', Cp_12)
Cp_13 = (sum((resid(model_13))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 13 is %f', Cp_13)
Cp_23 = (sum((resid(model_23))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 23 is %f', Cp_23)
Entrée [54]: p = 4
Cp_123 = (sum((resid(model_123))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 123 is %f', Cp_123)
Using function
$which
1 2 3
$label
'(Intercept)'
'1'
'2'
'3'
$size
2
2
2
3
3
3
4
$Cp
8.35360628199045
35.2456429948055
42.1123236337672
2.80720376735253
5.59973485144707
30.2470562751665
4
1 TRUE
2 FALSE
3 TRUE
According to 𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐶𝑝 value as well as this value pretty near the number p
(2.8 near 3).
(4 ) 𝐵𝐼𝐶𝑝
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 12/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook
Manual calculation
Entrée [56]: p = 2
BIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 1 is %f', BIC_1)
BIC_2 = n*log(sum((resid(model_2))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 2 is %f', BIC_2)
BIC_3 = n*log(sum((resid(model_3))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 3 is %f', BIC_3)
Entrée [57]: p = 3
BIC_12 = n*log(sum((resid(model_12))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 12 is %f', BIC_12)
BIC_13 = n*log(sum((resid(model_13))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 13 is %f', BIC_13)
BIC_23 = n*log(sum((resid(model_23))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 23 is %f', BIC_23)
Entrée [63]: p = 4
BIC_123 = n*log(sum((resid(model_123))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 123 is %f', BIC_123)
Entrée [65]: p = (2:4)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC
224.186673615698
220.546578366508
223.499527769709
Entrée [69]: info$bic
-36.7287874898513
-40.368882739041
-37.4159333358395
According to 𝐵𝐼𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐵𝐼𝐶𝑝 value.
b. Do the four criteria in part (a) identify the same best subset? Does
this always happen?
Four criteria in part a identify the same best subset. However, this not always happen due to the
different method in criterias.
Start: AIC=262.92
Y ~ 1
Step: AIC=220.53
Y ~ X1
Step: AIC=215.06
Y ~ X1 + X3
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The forward stepwise regression has advantages here as a screening procedure over the all-
possible-regressions procedure because it shows the best subsets among subsets that have the
same number p and in the order from least number p to the largest number.
Problem 2:
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 17/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook
𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [1]: jp = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.10.txt', col.names = c
head(jp)
Y X1 X2 X3 X4
88 86 110 100 87
80 62 97 99 100
76 101 117 93 95
80 100 101 95 88
73 78 85 95 84
Entrée [45]: attach(jp)
a. Prepare several plots of the test scores for each of the four newly
developedaptitude tests. Are there any noteworthy features in these
plots? Comment.
Entrée [30]: summary(jp)
Y X1 X2 X3
1st Qu.: 78.0 1st Qu.: 91.0 1st Qu.: 94.0 1st Qu.: 95.0
X4
Min. : 74.00
Median : 95.00
Mean : 94.68
3rd Qu.:103.00
Max. :110.00
b. Obtain the scatter plot matrix. Also obtain the correlation matrix of
the X variables. What dothe scatter plots suggest about the nature of
the functional relationship between the responsevariable Y and each of
the predictorvariables? Are any serious multicollinearity
problemsevident? Explain.
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 19/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook
Entrée [15]: pairs(jp)
Entrée [22]: cor(jp)
Y X1 X2 X3 X4
There are a clear linear relationship between Y and 𝑋3 , 𝑋4 . As for 𝑋1 and 𝑋2 , there are also
have linear relationship but not easy to see.
𝑅2𝑎,𝑝
in the pool of potential X variables,find the four best
subset regression models according to the
criterion.
Entrée [28]: X = cbind(jp$X1, jp$X2, jp$X3, jp$X4)
jpleaps_aR = leaps(X, jp$Y, method='adjr2')
jpleaps_aR
$which
1 2 3 4
$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
$adjr2
0.7962344370837
0.745216968986486
0.23264524812475
0.214276183981299
0.926904337908834
0.866098849056532
0.79847156364106
0.788443562503307
0.763591589343016
0.415485278382781
0.956048217657989
0.924677876991158
0.861679721457713
0.823266429947636
0.955470175923839
𝑇𝑜𝑝1 : 0.956 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 0.955 − 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 21/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook
𝑇𝑜𝑝3 : 0.927 − 𝑋1 , 𝑋3
𝑇𝑜𝑝4 : 0.925 − 𝑋1 , 𝑋2 , 𝑋3
Entrée [54]: plot(jpleaps_aR$size, jpleaps_aR$adjr2, xlab='p', ylab='adjusted R square')
d. Using other criteria (𝐶𝑝, AIC, or BIC) to identify the 04 best model?
Compare the result to c)
𝐶𝑝
$which
1 2 3 4
$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
$Cp
84.2464959003476
110.597414426974
375.344689414102
384.832453717342
17.1129781976063
47.1539851520169
80.5653068899095
85.5196499534087
97.7977898488476
269.78002871929
3.72739895858592
18.5214649058647
48.2310201005948
66.3464997470431
5
𝑇𝑜𝑝1 : 3.727 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 17.113 − 𝑋1 , 𝑋3
𝑇𝑜𝑝3 : 18.521 − 𝑋1 , 𝑋2 , 𝑋3
𝑇𝑜𝑝4 : 47.154 − 𝑋3 , 𝑋4
𝐶𝑝 is not used for full model
Same result to question c
AIC
X1 FALSE FALSE
X2 FALSE FALSE
X3 FALSE FALSE
X4 FALSE FALSE
X1 X2 X3 X4
Entrée [69]: n = nrow(jp)
p = (2:5)
AIC = n*log(info$rss) - n*log(n) + 2*p
sort(AIC)
73.8473152171715
74.9542108999027
85.7272117508187
110.468533533551
𝐵𝐼𝐶
Entrée [66]: p = (2:5)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC
112.906285183288
89.3838392254233
78.7228185166443
81.0485900242437
Start: AIC=149.3
Y ~ 1
Step: AIC=110.47
Y ~ X3
Step: AIC=85.73
Y ~ X3 + X1
Step: AIC=73.85
Y ~ X3 + X1 + X4
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
𝑅2𝑎,𝑝
f. How does the best subset according to forward stepwise regression
compare with the best subset according to the criterion?
Both forward stepwise regression and 𝑅2𝑎,𝑝 criterion recommend the same best subset which
contain 𝑋1 , 𝑋3 , 𝑋4
Problem 3:
Entrée [72]: jps = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.22.txt', col.names =
head(jps)
Y X1 X2 X3 X4
58 65 109 88 84
92 85 90 104 98
71 93 73 91 82
77 95 57 95 85
66 63 101 93 84
Entrée [73]: attach(jps)
a. Obtain the correlation matrix of the X variables for the validation data
set and compare it with the model-building data set. Are the two
correlation matrices reasonably similar?
Y X1 X2 X3 X4
Y X1 X2 X3 X4
The two correlation matrices are not really reasonably similar because the dependence
between 𝑋3 𝑋4
and increase whereas the dependence between and 𝑋3 𝑋4
decrease.
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimates for the validation data set did not appear to be reasonably similar to those obtained
for the model-building data set
88.5037
76.9251
49.5003
30.4505
146.903
79.6421
120.3845
90.0475
79.3433
103.0607
98.7746
103.234
90.8567
140.6543
87.1153
112.2843
110.1744
151.4436
104.0741
114.2877
124.7463
128.0374
124.1323
57.6483
112.0692
Entrée [97]: # The mean square error for the validation data set
MSE_val = sum((jps$Y - y_val_predict)^2)/(nrow(jps) - 4)
MSE_val
949.418773435238
Entrée [103]: # The mean square error of the best model with model-building data set
MSE_mb = sum((resid(best_model_pro2))^2)/(nrow(jp) - 4)
MSE_mb
16.5808098885237
The mean square error for the validation datatset is very large compared with the one of the best
model with model-building dataset.