Lab5 - NguyenHoangAnhTu - Jupyter Notebook

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

𝑁𝑔𝑢𝑦𝑒𝑛 𝐻𝑜𝑎𝑛𝑔 𝐴𝑛ℎ 𝑇𝑢


𝐼𝑇𝐷𝑆𝐼𝑈20090
𝐿𝐴𝐵 5 : 𝑀𝑂𝐷𝐸𝐿 𝑆𝐸𝐿𝐸𝐶𝑇𝐼𝑂𝑁
Problem 1:

𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [18]: ps = read.table('C:/Users/Nguyen Cuc/Desktop/Sem 2/RA/Dataset/PatientSatisfaction
head(ps)

Y X1 X2 X3

48 50 51 2.3

57 36 46 2.3

66 40 48 2.2

70 41 44 1.8

89 28 43 1.8

36 49 54 2.9

Entrée [19]: attach(ps)

The following objects are masked from ps (pos = 3):

X1, X2, X3, Y

a. Indicate which subset of predictor variables you would recommend


as best for predicting patient satisfaction according to each of the
2
,
following criteria: (1) 𝑅𝑎 𝑝 , (2) 𝐴𝐼𝐶𝑝 , (3) 𝐶𝑝 , (4) 𝐵𝐼𝐶𝑝 . Support your
recommendations with appropriate graphs.

(1) 𝑅2𝑎,𝑝
Manual calculation

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 1/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [20]: n = nrow(ps)
ssto = sum((ps$Y - mean(ps$Y))^2)

Entrée [21]: # X1
p = 2
# Apply model
model_1 = lm(Y ~ X1)
# SSE
sse_1 = sum(resid(model_1)^2)
# Adjusted R square
aR2_1 = 1 - (sse_1 / (n-p))/(ssto/(n-1))
print(aR2_1)
# Double check with function
# summary(model_1)

[1] 0.6103248

Entrée [22]: # X2
p = 2
# Apply model
model_2 = lm(Y ~ X2)
# SSE
sse_2 = sum(resid(model_2)^2)
# Adjusted R square
aR2_2 = 1 - (sse_2 / (n-p))/(ssto/(n-1))
print(aR2_2)
# Double check with function
# summary(model_2)

[1] 0.3490737

Entrée [23]: # X3
p = 2
# Apply model
model_3 = lm(Y ~ X3)
# SSE
sse_3 = sum(resid(model_3)^2)
# Adjusted R square
aR2_3 = 1 - (sse_3 / (n-p))/(ssto/(n-1))
print(aR2_3)
# Double check with function
# summary(model_3)

[1] 0.4022134

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 2/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [24]: # X1 - X2
p = 3
# Apply model
model_12 = lm(Y ~ X1 + X2)
# SSE
sse_12 = sum(resid(model_12)^2)
# Adjusted R square
aR2_12 = 1 - (sse_12 / (n-p))/(ssto/(n-1))
print(aR2_12)
# Double check with function
# summary(model_12)

[1] 0.6389073

Entrée [25]: # X1 - X3
p = 3
# Apply model
model_13 = lm(Y ~ X1 + X3)
# SSE
sse_13 = sum(resid(model_13)^2)
# Adjusted R square
aR2_13 = 1 - (sse_13 / (n-p))/(ssto/(n-1))
print(aR2_13)
# Double check with function
# summary(model_13)

[1] 0.6610206

Entrée [26]: # X2 - X3
p = 3
# Apply model
model_23 = lm(Y ~ X2 + X3)
# SSE
sse_23 = sum(resid(model_23)^2)
# Adjusted R square
aR2_23 = 1 - (sse_23 / (n-p))/(ssto/(n-1))
print(aR2_23)
# Double check with function
# summary(model_23)

[1] 0.4437314

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 3/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [28]: # X1 - X2 - X3
p = 4
# Apply model
model_123 = lm(Y ~ X1 + X2 + X3)
# SSE
sse_123 = sum(resid(model_123)^2)
# Adjusted R square
aR2_123 = 1 - (sse_123 / (n-p))/(ssto/(n-1))
print(aR2_123)
# Double check with function# X2 - X3
# summary(model_123)

[1] 0.6594939

Using function

Entrée [31]: install.packages('leaps')

package 'leaps' successfully unpacked and MD5 sums checked

The downloaded binary packages are in

C:\Users\Nguyen Cuc\AppData\Local\Temp\Rtmpk5HDPK\downloaded_packages

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 4/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [32]: library(leaps)
X = cbind(ps$X1, ps$X2, ps$X3)
psleaps_aR = leaps(X, ps$Y, method='adjr2')
psleaps_aR

Warning message:

"package 'leaps' was built under R version 3.6.3"

$which
1 2 3

1 TRUE FALSE FALSE

1 FALSE FALSE TRUE

1 FALSE TRUE FALSE

2 TRUE FALSE TRUE

2 TRUE TRUE FALSE

2 FALSE TRUE TRUE

3 TRUE TRUE TRUE

$label
'(Intercept)'
'1'
'2'
'3'

$size
2
2
2
3
3
3
4

$adjr2
0.610324803177749
0.402213399193454
0.349073707181763
0.661020632881959
0.638907288953016
0.443731414752623
0.659493928515078

Entrée [33]: bestmodel.aR = psleaps_aR$which[which(psleaps_aR$adjr2 == max(psleaps_aR$adjr2)),


bestmodel.aR

1 TRUE
2 FALSE
3 TRUE

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 5/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [43]: plot(psleaps_aR$size, psleaps_aR$adjr2, xlab='p', ylab='Adjusted R square')

According to 𝑅2𝑎,𝑝 criteria, the best subset of2 predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3 , with the highest 𝑅𝑎 𝑝 value
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 6/33
14:13, 07/05/2022
t at co ta 1a d 3, t t e g est 𝑎,𝑝
Lab5_NguyenHoangAnhTu - Jupyter Notebook
a ue

(2) 𝐴𝐼𝐶𝑝
Manual calculation

Entrée [37]: n = nrow(ps)
p = 2
AIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + 2*p
sprintf('AIC for model 1 is %f', AIC_1)
AIC_2 = n*log(sum((resid(model_2))^2)) - n*log(n) + 2*p
sprintf('AIC for model 2 is %f', AIC_2)
AIC_3 = n*log(sum((resid(model_3))^2)) - n*log(n) + 2*p
sprintf('AIC for model 3 is %f', AIC_3)

'AIC for model 1 is 220.529391'

'AIC for model 2 is 244.131202'

'AIC for model 3 is 240.213723'

Entrée [38]: n = nrow(ps)
p = 3
AIC_12 = n*log(sum((resid(model_12))^2)) - n*log(n) + 2*p
sprintf('AIC for model 12 is %f', AIC_12)
AIC_13 = n*log(sum((resid(model_13))^2)) - n*log(n) + 2*p
sprintf('AIC for model 13 is %f', AIC_13)
AIC_23 = n*log(sum((resid(model_23))^2)) - n*log(n) + 2*p
sprintf('AIC for model 23 is %f', AIC_23)

'AIC for model 12 is 217.967647'

'AIC for model 13 is 215.060654'

'AIC for model 23 is 237.845006'

Entrée [39]: n = nrow(ps)
p = 4
AIC_123 = n*log(sum((resid(model_123))^2)) - n*log(n) + 2*p
sprintf('AIC for model 123 is %f', AIC_123)

'AIC for model 123 is 216.184962'

Using function

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 7/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [40]: selinfo = regsubsets(Y ~ X1 + X2 + X3, data = ps)


info = summary(selinfo)
print(info)

Subset selection object

Call: regsubsets.formula(Y ~ X1 + X2 + X3, data = ps)

3 Variables (and intercept)

Forced in Forced out

X1 FALSE FALSE

X2 FALSE FALSE

X3 FALSE FALSE

1 subsets of each size up to 3

Selection Algorithm: exhaustive

X1 X2 X3

1 ( 1 ) "*" " " " "

2 ( 1 ) "*" " " "*"

3 ( 1 ) "*" "*" "*"

The last four rows show best models among models which have the same p number.

Entrée [41]: p = (2:4)
AIC = n*log(info$rss) - n*log(n) + 2*p
AIC

220.529390822719
215.060654177041
216.184962183753

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 8/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [42]: plot(p, AIC, xlab = 'p', ylab = 'AIC')

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 9/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

According to 𝐴𝐼𝐶𝑝criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐴𝐼𝐶𝑝
value.

(3) 𝐶𝑝
Manual calculation

Entrée [70]: MSE = sum((resid(model_123))^2) / (n - 4)


MSE

101.16287337699

Entrée [52]: p = 2
Cp_1 = (sum((resid(model_1))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 1 is %f', Cp_1)
Cp_2 = (sum((resid(model_2))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 2 is %f', Cp_2)
Cp_3 = (sum((resid(model_3))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 3 is %f', Cp_3)

'The Cp value for model 1 is 8.353606'

'The Cp value for model 2 is 42.112324'

'The Cp value for model 3 is 35.245643'

Entrée [53]: p = 3
Cp_12 = (sum((resid(model_12))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 12 is %f', Cp_12)
Cp_13 = (sum((resid(model_13))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 13 is %f', Cp_13)
Cp_23 = (sum((resid(model_23))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 23 is %f', Cp_23)

'The Cp value for model 12 is 5.599735'

'The Cp value for model 13 is 2.807204'

'The Cp value for model 23 is 30.247056'

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 10/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [54]: p = 4
Cp_123 = (sum((resid(model_123))^2) / MSE) - (n - 2 * p)
sprintf('The Cp value for model 123 is %f', Cp_123)

'The Cp value for model 123 is 4.000000'

Using function

Entrée [44]: psleaps_Cp = leaps(X, ps$Y, method='Cp')


psleaps_Cp

$which
1 2 3

1 TRUE FALSE FALSE

1 FALSE FALSE TRUE

1 FALSE TRUE FALSE

2 TRUE FALSE TRUE

2 TRUE TRUE FALSE

2 FALSE TRUE TRUE

3 TRUE TRUE TRUE

$label
'(Intercept)'
'1'
'2'
'3'

$size
2
2
2
3
3
3
4

$Cp
8.35360628199045
35.2456429948055
42.1123236337672
2.80720376735253
5.59973485144707
30.2470562751665
4

Entrée [45]: bestmodel.Cp = psleaps_Cp$which[which(psleaps_Cp$Cp == min(psleaps_Cp$Cp)),]


bestmodel.Cp

1 TRUE
2 FALSE
3 TRUE

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 11/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [46]: plot(psleaps_Cp$size, psleaps_Cp$Cp, xlab='p', ylab='Cp')

According to 𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐶𝑝 value as well as this value pretty near the number p
(2.8 near 3).

(4 ) 𝐵𝐼𝐶𝑝
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 12/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Manual calculation

Entrée [56]: p = 2
BIC_1 = n*log(sum((resid(model_1))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 1 is %f', BIC_1)
BIC_2 = n*log(sum((resid(model_2))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 2 is %f', BIC_2)
BIC_3 = n*log(sum((resid(model_3))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 3 is %f', BIC_3)

'The BIC value for model 1 is 224.186674'

'The BIC value for model 2 is 247.788485'

'The BIC value for model 3 is 243.871006'

Entrée [57]: p = 3
BIC_12 = n*log(sum((resid(model_12))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 12 is %f', BIC_12)
BIC_13 = n*log(sum((resid(model_13))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 13 is %f', BIC_13)
BIC_23 = n*log(sum((resid(model_23))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 23 is %f', BIC_23)

'The BIC value for model 12 is 223.453571'

'The BIC value for model 13 is 220.546578'

'The BIC value for model 23 is 243.330931'

Entrée [63]: p = 4
BIC_123 = n*log(sum((resid(model_123))^2)) - n*log(n) + log(n)*p
sprintf('The BIC value for model 123 is %f', BIC_123)

'The BIC value for model 123 is 223.499528'

Entrée [65]: p = (2:4)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC

224.186673615698
220.546578366508
223.499527769709

Entrée [69]: info$bic

-36.7287874898513
-40.368882739041
-37.4159333358395

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 13/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [66]: plot(p, BIC, xlab = 'p', ylab = 'BIC')

According to 𝐵𝐼𝐶𝑝 criteria, the best subset of predictor variable which is recommended is the one
that contain 𝑋1 and 𝑋3, with the smallest 𝐵𝐼𝐶𝑝 value.

Plots for , 𝑅2𝑎,𝑝 𝐶𝑝


, and 𝐵𝐼𝐶𝑝 from the data from the summary of
regsubsets function.

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 14/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [68]: # Set up a 2x2 grid so we can look at 4 plots at once


par(mfrow = c(2,2))
plot(info$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(info$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2
# The which.max() function can be used to identify the location of the maximum po
adj_r2_max = which.max(info$adjr2) # 11

# The points() command works like the plot() command, except that it puts points
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, info$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SM
plot(info$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(info$cp) # 10
points(cp_min, info$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(info$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(info$bic) # 6
points(bic_min, info$bic[bic_min], col = "red", cex = 2, pch = 20)

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 15/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

b. Do the four criteria in part (a) identify the same best subset? Does
this always happen?

Four criteria in part a identify the same best subset. However, this not always happen due to the
different method in criterias.

c. Would forward stepwise regression have any advantages here as a


screening procedure over the all-possible-regressions procedure?

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 16/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [71]: forwardproc = step(lm(Y ~ 1, ps), list(upper = ~ X1+X2+X3), direction='forward')


summary(forwardproc)

Start: AIC=262.92

Y ~ 1

Df Sum of Sq RSS AIC

+ X1 1 8275.4 5093.9 220.53

+ X3 1 5554.9 7814.4 240.21

+ X2 1 4860.3 8509.0 244.13

<none> 13369.3 262.92

Step: AIC=220.53

Y ~ X1

Df Sum of Sq RSS AIC

+ X3 1 763.42 4330.5 215.06

+ X2 1 480.92 4613.0 217.97

<none> 5093.9 220.53

Step: AIC=215.06

Y ~ X1 + X3

Df Sum of Sq RSS AIC

<none> 4330.5 215.06

+ X2 1 81.659 4248.8 216.19

Call:

lm(formula = Y ~ X1 + X3, data = ps)

Residuals:

Min 1Q Median 3Q Max

-19.4453 -7.3285 0.6733 8.5126 18.0534

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 145.9412 11.5251 12.663 4.21e-16 ***

X1 -1.2005 0.2041 -5.882 5.43e-07 ***

X3 -16.7421 6.0808 -2.753 0.00861 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.04 on 43 degrees of freedom

Multiple R-squared: 0.6761, Adjusted R-squared: 0.661

F-statistic: 44.88 on 2 and 43 DF, p-value: 2.98e-11

The forward stepwise regression has advantages here as a screening procedure over the all-
possible-regressions procedure because it shows the best subsets among subsets that have the
same number p and in the order from least number p to the largest number.

Problem 2:
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 17/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

𝐼𝑚𝑝𝑜𝑟𝑡 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Entrée [1]: jp = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.10.txt', col.names = c
head(jp)

Y X1 X2 X3 X4

88 86 110 100 87

80 62 97 99 100

96 110 107 103 103

76 101 117 93 95

80 100 101 95 88

73 78 85 95 84

Entrée [45]: attach(jp)

The following objects are masked from jps:

X1, X2, X3, X4, Y

The following objects are masked from jp (pos = 5):

X1, X2, X3, X4, Y

a. Prepare several plots of the test scores for each of the four newly
developedaptitude tests. Are there any noteworthy features in these
plots? Comment.

Entrée [30]: summary(jp)

Y X1 X2 X3

Min. : 58.0 Min. : 62.0 Min. : 73.0 Min. : 80.0

1st Qu.: 78.0 1st Qu.: 91.0 1st Qu.: 94.0 1st Qu.: 95.0

Median : 94.0 Median :104.0 Median :113.0 Median :100.0

Mean : 92.2 Mean :103.4 Mean :106.7 Mean :100.8

3rd Qu.:109.0 3rd Qu.:112.0 3rd Qu.:121.0 3rd Qu.:107.0

Max. :127.0 Max. :150.0 Max. :129.0 Max. :116.0

X4

Min. : 74.00

1st Qu.: 87.00

Median : 95.00

Mean : 94.68

3rd Qu.:103.00

Max. :110.00

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 18/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [31]: par(mfrow = c(3,2))


hist(jp$X1)
hist(jp$X2)
hist(jp$X3)
hist(jp$X4)
hist(jp$Y)

b. Obtain the scatter plot matrix. Also obtain the correlation matrix of
the X variables. What dothe scatter plots suggest about the nature of
the functional relationship between the responsevariable Y and each of
the predictorvariables? Are any serious multicollinearity
problemsevident? Explain.
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 19/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [15]: pairs(jp)

Entrée [22]: cor(jp)

Y X1 X2 X3 X4

Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865

X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632

X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101

X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385

X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000

There are a clear linear relationship between Y and 𝑋3 , 𝑋4 . As for 𝑋1 and 𝑋2 , there are also
have linear relationship but not easy to see.

𝑋3 and 𝑋4 are dependent. (78%)


localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 20/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

c. Using only first-order terms for the predictor variables

𝑅2𝑎,𝑝
in the pool of potential X variables,find the four best
subset regression models according to the
criterion.
Entrée [28]: X = cbind(jp$X1, jp$X2, jp$X3, jp$X4)
jpleaps_aR = leaps(X, jp$Y, method='adjr2')
jpleaps_aR

$which
1 2 3 4

1 FALSE FALSE TRUE FALSE

1 FALSE FALSE FALSE TRUE

1 TRUE FALSE FALSE FALSE

1 FALSE TRUE FALSE FALSE

2 TRUE FALSE TRUE FALSE

2 FALSE FALSE TRUE TRUE

2 TRUE FALSE FALSE TRUE

2 FALSE TRUE TRUE FALSE

2 FALSE TRUE FALSE TRUE

2 TRUE TRUE FALSE FALSE

3 TRUE FALSE TRUE TRUE

3 TRUE TRUE TRUE FALSE

3 FALSE TRUE TRUE TRUE

3 TRUE TRUE FALSE TRUE

4 TRUE TRUE TRUE TRUE


$label
'(Intercept)'
'1'
'2'
'3'
'4'

$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5

$adjr2
0.7962344370837
0.745216968986486
0.23264524812475
0.214276183981299
0.926904337908834
0.866098849056532
0.79847156364106
0.788443562503307
0.763591589343016
0.415485278382781
0.956048217657989
0.924677876991158
0.861679721457713
0.823266429947636
0.955470175923839

𝑇𝑜𝑝1 : 0.956 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 0.955 − 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4
localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 21/33
14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

𝑇𝑜𝑝3 : 0.927 − 𝑋1 , 𝑋3
𝑇𝑜𝑝4 : 0.925 − 𝑋1 , 𝑋2 , 𝑋3
Entrée [54]: plot(jpleaps_aR$size, jpleaps_aR$adjr2, xlab='p', ylab='adjusted R square')

d. Using other criteria (𝐶𝑝, AIC, or BIC) to identify the 04 best model?
Compare the result to c)

𝐶𝑝

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 22/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [55]: jpleaps_Cp = leaps(X, jp$Y, method='Cp')


jpleaps_Cp

$which
1 2 3 4

1 FALSE FALSE TRUE FALSE

1 FALSE FALSE FALSE TRUE

1 TRUE FALSE FALSE FALSE

1 FALSE TRUE FALSE FALSE

2 TRUE FALSE TRUE FALSE

2 FALSE FALSE TRUE TRUE

2 TRUE FALSE FALSE TRUE

2 FALSE TRUE TRUE FALSE

2 FALSE TRUE FALSE TRUE

2 TRUE TRUE FALSE FALSE

3 TRUE FALSE TRUE TRUE

3 TRUE TRUE TRUE FALSE

3 FALSE TRUE TRUE TRUE

3 TRUE TRUE FALSE TRUE

4 TRUE TRUE TRUE TRUE


$label
'(Intercept)'
'1'
'2'
'3'
'4'

$size
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5

$Cp
84.2464959003476
110.597414426974
375.344689414102
384.832453717342
17.1129781976063
47.1539851520169
80.5653068899095
85.5196499534087
97.7977898488476
269.78002871929
3.72739895858592
18.5214649058647
48.2310201005948
66.3464997470431
5

𝑇𝑜𝑝1 : 3.727 − 𝑋1 , 𝑋3 , 𝑋4
𝑇𝑜𝑝2 : 17.113 − 𝑋1 , 𝑋3
𝑇𝑜𝑝3 : 18.521 − 𝑋1 , 𝑋2 , 𝑋3
𝑇𝑜𝑝4 : 47.154 − 𝑋3 , 𝑋4
𝐶𝑝 is not used for full model
Same result to question c

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 23/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

AIC

Entrée [58]: selinfo = regsubsets(Y ~ X1 + X2 + X3 + X4, data = jp)


info = summary(selinfo)
print(info)

Subset selection object

Call: regsubsets.formula(Y ~ X1 + X2 + X3 + X4, data = jp)

4 Variables (and intercept)

Forced in Forced out

X1 FALSE FALSE

X2 FALSE FALSE

X3 FALSE FALSE

X4 FALSE FALSE

1 subsets of each size up to 4

Selection Algorithm: exhaustive

X1 X2 X3 X4

1 ( 1 ) " " " " "*" " "

2 ( 1 ) "*" " " "*" " "

3 ( 1 ) "*" " " "*" "*"

4 ( 1 ) "*" "*" "*" "*"

Entrée [69]: n = nrow(jp)
p = (2:5)
AIC = n*log(info$rss) - n*log(n) + 2*p
sort(AIC)

73.8473152171715
74.9542108999027
85.7272117508187
110.468533533551

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 24/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [70]: plot(p - 1, AIC, xlab = 'p', ylab = 'AIC', type = 'l')


aic_min = which.min(AIC) # 10
points(aic_min, AIC[aic_min], col = "red", cex = 2, pch = 20)

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 25/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Same result with question c.

𝐵𝐼𝐶
Entrée [66]: p = (2:5)
BIC = n*log(info$rss) - n*log(n) + log(n)*p
BIC

112.906285183288
89.3838392254233
78.7228185166443
81.0485900242437

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 26/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [68]: plot(p - 1, BIC, xlab = 'p', ylab = 'BIC', type = 'l')


bic_min = which.min(BIC)
points(bic_min, BIC[bic_min], col = "red", cex = 2, pch = 20)

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 27/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Same result with question c.

e. Using forward stepwise regression, find the best subset of predictor


variables to predict job proficiency.

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 28/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [71]: forwardproc = step(lm(Y ~ 1, jp), list(upper = ~ X1 + X2 + X3 + X4), direction='f


summary(forwardproc)

Start: AIC=149.3

Y ~ 1

Df Sum of Sq RSS AIC

+ X3 1 7286.0 1768.0 110.47

+ X4 1 6843.3 2210.7 116.06

+ X1 1 2395.9 6658.1 143.62

+ X2 1 2236.5 6817.5 144.21

<none> 9054.0 149.30

Step: AIC=110.47

Y ~ X3

Df Sum of Sq RSS AIC

+ X1 1 1161.37 606.66 85.727

+ X4 1 656.71 1111.31 100.861

<none> 1768.02 110.469

+ X2 1 12.21 1755.81 112.295

Step: AIC=85.73

Y ~ X3 + X1

Df Sum of Sq RSS AIC

+ X4 1 258.460 348.20 73.847

<none> 606.66 85.727

+ X2 1 9.937 596.72 87.314

Step: AIC=73.85

Y ~ X3 + X1 + X4

Df Sum of Sq RSS AIC

<none> 348.20 73.847

+ X2 1 12.22 335.98 74.954

Call:

lm(formula = Y ~ X3 + X1 + X4, data = jp)

Residuals:

Min 1Q Median 3Q Max

-5.4579 -3.1563 -0.2057 1.8070 6.6083

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -124.20002 9.87406 -12.578 3.04e-11 ***

X3 1.35697 0.15183 8.937 1.33e-08 ***

X1 0.29633 0.04368 6.784 1.04e-06 ***

X4 0.51742 0.13105 3.948 0.000735 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.072 on 21 degrees of freedom

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 29/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Multiple R-squared: 0.9615, Adjusted R-squared: 0.956

F-statistic: 175 on 3 and 21 DF, p-value: 5.16e-15

𝑅2𝑎,𝑝
f. How does the best subset according to forward stepwise regression
compare with the best subset according to the criterion?

Both forward stepwise regression and 𝑅2𝑎,𝑝 criterion recommend the same best subset which
contain 𝑋1 , 𝑋3 , 𝑋4

Problem 3:
Entrée [72]: jps = read.table('C:/Users/Admin/Desktop/Sem 2/RA/Dataset/9.22.txt', col.names =
head(jps)

Y X1 X2 X3 X4

58 65 109 88 84

92 85 90 104 98

71 93 73 91 82

77 95 57 95 85

92 102 139 101 92

66 63 101 93 84

Entrée [73]: attach(jps)

The following objects are masked from jp (pos = 3):

X1, X2, X3, X4, Y

The following objects are masked from jps (pos = 4):

X1, X2, X3, X4, Y

The following objects are masked from jp (pos = 6):

X1, X2, X3, X4, Y

a. Obtain the correlation matrix of the X variables for the validation data
set and compare it with the model-building data set. Are the two
correlation matrices reasonably similar?

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 30/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [78]: # The validation dataset


cor(jps)

Y X1 X2 X3 X4

Y 1.0000000 0.53707787 0.34477442 0.8880519 0.8879388

X1 0.5370779 1.00000000 0.01057088 0.1772891 0.3196395

X2 0.3447744 0.01057088 1.00000000 0.3437441 0.2207638

X3 0.8880519 0.17728907 0.34374413 1.0000000 0.8714466

X4 0.8879388 0.31963945 0.22076377 0.8714466 1.0000000

Entrée [79]: # The model-building dataset


cor(jp)

Y X1 X2 X3 X4

Y 1.0000000 0.5144107 0.4970057 0.8970645 0.8693865

X1 0.5144107 1.0000000 0.1022689 0.1807692 0.3266632

X2 0.4970057 0.1022689 1.0000000 0.5190448 0.3967101

X3 0.8970645 0.1807692 0.5190448 1.0000000 0.7820385

X4 0.8693865 0.3266632 0.3967101 0.7820385 1.0000000

The two correlation matrices are not really reasonably similar because the dependence
between 𝑋3 𝑋4
and increase whereas the dependence between and 𝑋3 𝑋4
decrease.

b. Using the best subset of variables identified in Problem 2 to fit the


validation data set. Compare the estimated regression coefficients and
their estimated standard deviations to those obtained in Problem 2.
Also, compare the error mean squares and coefficients of multiple
determination. Do the estimates for the validation data set appear to be
reasonably similar to those obtained for the model-building data set?

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 31/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

Entrée [101]: best_model_pro3 = lm(Y ~ X1 + X3 + X4, data = jps)


summary(best_model_pro3)

Call:

lm(formula = Y ~ X1 + X3 + X4, data = jps)

Residuals:

Min 1Q Median 3Q Max

-9.4619 -2.3836 0.6834 2.1123 7.2394

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -122.76705 11.84783 -10.362 1.04e-09 ***

X1 0.31238 0.04729 6.605 1.54e-06 ***

X3 1.40676 0.23262 6.048 5.31e-06 ***

X4 0.42838 0.19749 2.169 0.0417 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.284 on 21 degrees of freedom

Multiple R-squared: 0.9489, Adjusted R-squared: 0.9416

F-statistic: 130 on 3 and 21 DF, p-value: 1.017e-13

𝑌 ̂  = −122.8 + 0.3124𝑋1 + 1.407𝑋2 + 0.4284𝑋3


Best subset with validation:

Entrée [102]: best_model_pro2 = lm(Y ~ X1 + X3 + X4, data = jp)


summary(best_model_pro2)

Call:

lm(formula = Y ~ X1 + X3 + X4, data = jp)

Residuals:

Min 1Q Median 3Q Max

-5.4579 -3.1563 -0.2057 1.8070 6.6083

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -124.20002 9.87406 -12.578 3.04e-11 ***

X1 0.29633 0.04368 6.784 1.04e-06 ***

X3 1.35697 0.15183 8.937 1.33e-08 ***

X4 0.51742 0.13105 3.948 0.000735 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.072 on 21 degrees of freedom

Multiple R-squared: 0.9615, Adjusted R-squared: 0.956

F-statistic: 175 on 3 and 21 DF, p-value: 5.16e-15

𝑌 ̂  = −124.2 + 0.2963𝑋1 + 1.357𝑋2 + 0.5174𝑋3


Best subset with model-building:

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 32/33


14:13, 07/05/2022 Lab5_NguyenHoangAnhTu - Jupyter Notebook

The estimates for the validation data set did not appear to be reasonably similar to those obtained
for the model-building data set

c. Using the best regression model found in Problem 2 to predict the


mean of job proficiency of the validation data set. Find the mean
square error for the validation data set and compare to the mean
square error of the best model with model-building data set.

Entrée [87]: y_val_predict = -124.2 + 0.2963*jps$X1 + 1.357*jps$X2 + 0.5174*jps$X3


y_val_predict

88.5037
76.9251
49.5003
30.4505
146.903
79.6421
120.3845
90.0475
79.3433
103.0607
98.7746
103.234
90.8567
140.6543
87.1153
112.2843
110.1744
151.4436
104.0741
114.2877
124.7463
128.0374
124.1323
57.6483
112.0692

Entrée [97]: # The mean square error for the validation data set
MSE_val = sum((jps$Y - y_val_predict)^2)/(nrow(jps) - 4)
MSE_val

949.418773435238

Entrée [103]: # The mean square error of the best model with model-building data set
MSE_mb = sum((resid(best_model_pro2))^2)/(nrow(jp) - 4)
MSE_mb

16.5808098885237

The mean square error for the validation datatset is very large compared with the one of the best
model with model-building dataset.

localhost:8888/notebooks/Lab/Lab5_NguyenHoangAnhTu .ipynb# 33/33

You might also like