ST211 From Initial To Final

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

From the data we got, we can see that there are only four numerical predictors which are

k3en,
k3ma, k3sc, and IDACI_n. When creating scatter plots for the continuous variables of k3en, k3ma,
k3sc, and IDACI_n to ks4score, it was observed that there exists a strong positive linear
correlation between each of the k3en, k3ma, k3sc, and ks4score. However, there was no clear
relationship found between IDACI_n and ks4score. To analyze the categorical predictors, we
created box plots, but most of the predictors have numerous levels resulting in similar means for
each box. It is difficult to tell which predictor is related to ks4score. So we just put all categorical
predictors into the linear model. Also, we found that the box plot generated for ks4score based
on gender and attitude indicates that female students scored higher on average than male
students. Furthermore, the box plots also demonstrate that as the attitude score decreases, the
corresponding ks4score also tends to decrease.
We put some key plots in Graph 1 to justify our decision, which could also be used later in the
merging parts.
Graph 1:
k3en, k3ma, k3sc IDACI_n

Attitude Hiqumum
FSMband Homework

In the categorical predictors, however, there are two predictors called fiveac and fiveem, which
mean ‘5 or more GCSE grades A*-C', including and excluding maths and english. And the
estimator ks4score is the age 16 total point score, so fiveac and fiveem are part of the outcome
of the ks4score. If the observation is Yes for fiveac or fiveem, then the ks4score will have a higher
value. We can also see that from the box plot, as the mean of the ‘Yes’ is much higher than the
‘No’ for both the fieac and fiveem. So we thought we can not use the outcome of the ks4score to
estimate itself, and then we deleted fiveac and fiveem at the beginning.

After analysis of scatter plots and boxplots, we started running a linear regression model. As our
initial thought was running all the variables given except fiveac and fiveem which are explained
above, the evidence for deleting here is their relatively high p-value. By evaluating through the
charts, It was clear that over half of the variables had a significance level of 99%, that’s how we
set the line to decide deleting or not. Thereby we deleted those variables whose significance of
all subvariable is lower than 1% which are [SECshort, fsm, computer, tuition, parasp, absent,
IDACI_n], and obtained 14 variables [k3en, k3ma, k3sc, gender, hiquamum, pupasp, homework,
attitude, sen, truancy, exclude, FSMband, singlepar, house]. Then we ran the regression with
these 14 variables to check and deleted sen based on the significance level.

All predictors in this linear regression model were significant if we consider predictors which have
multiple levels as a single variable. And now we need to deal with the missing values and merger
levels of categorical predictors. Here, we chose to delete all the missing values since we did not
know the meaning of the missing value. For example, around a thousand observed values are
missing in the predictor ‘Truancy’, some students have been truant, but choose not to tell us. It
will be a higher bias if we choose to combine the missing values with other levels, so can't merge
it with the other levels. To manipulate, we analyzed the data using summary, converted ‘NA’ in
FSMband into ‘missing’, and found [hiquamum, homework, attitude, truancy, exclude, FSMband,
singlepar] have missing values. Then we created a new data frame without missing values. After
this step, we reran the regression model and deleted the predictor house based on the
significance level.

For the merging part, we use plots and data to evaluate and found:
1. k3en, k3ma, k3sc show similar effects with high significance level
2. Homework has 6 answers where some showed similar results
3. Attitude has 4 answers where some showed similar results
4. Hiqumum has 6 answers where some showed similar results
5. FSMband has 6 answers where some showed similar results

After analysis of the plots in Graph 1 and data of linear regression, it is clear that some predictors
have different significance levels across their subgroups, for example, attitude very_low has
<0.1%, low has 5%. Combining this point with our key reasons to merge below, we reached
merged variables in Table 1.

1. For the predictors - k3en, k3ma, k3sc, just according to our analysis above, we decided to
merge them all into one stronger predictor -- sum_k3score.
2. For the predictor homework, we thought that students who do their homework 4 or 5 times a
week can be referred to as always doing homework, so these two levels can be merged into one
level homework45, which has a strong positive correlation with ks4score. And for the other four
levels, then can be a new level called homework1n23.
3. For the predictor attitude, when setting the baseline of attitude low, we found that obviously
the attitude high and very high had positive intercepts, while very low had a negative one. So we
decided to create two new levels, attitudelow and attitudehigh.
4. For the predictor hiquamum, we combined the 'No_qualification' and 'Other_qualifications'
into one level named mumwithoutqual, while the remaining four categories were grouped as
mumwithqual.
5. For the predictor FSMband, we reassign the values accordingly as what we did in attitude and
created FSMbandlow, FSMbandhigh.
Table 1: merged variables
k3en
sum_k3score k3ma
k3sc
none
1_evening
homework1n23
2_evenings
homework
3_evenings
4_evenings
homework45
5_evenings
low
attitudelow
very_low
attitude
high
attitudehigh
very_high
Degree_or_equivalent
GCE_A_Level_or_equivalent
mumwithqual
GCSE_grades_A-C_or_equiv
hiquamum
HE_below_degree_level
No_qualification
mumwithoutqual
Other_qualifications
<5pr
FSMlow 5pr-9pr
9pr-13pr
FSMband
13pr-21pr
FSMhigh 21pr-35pr
35pr+

After performing these procedures, we reran the linear model and observed that the significance
of the predictor hiquamum decreased. As a result, we decided to eliminate it from the model. As
hiquamum is not a predictor anymore, we added the missing cases from the hiquamum back into
the data and ran the model again. Then we got the final model.

Here is our final model:

ks4score= -107.1771 + 31.2296 * sum_k3score - 14.9499 * genderMale + 24.8807 *


pupaspYes + 10.7055 * homeworkin45 -16.0126 * attitudelow - 23.7605* truancyYes
- 50.4809 * excludeYes + 11.4987 * FSMbandhigh -18.6255 * singleparyes + e
Table 2: Final Model
Std.
Estimate t value Pr(>|t|)
Error
(Intercept) -107.1771 5.7726 -18.566 < 2e-16 ***
sum_k3score 31.2296 0.3214 97.177 < 2e-16 ***
genderMale -14.9499 1.8234 -8.199 2.73E-16 ***
pupaspYes 24.8807 2.6684 9.324 < 2e-16 ***
homeworkin4 10.7055 1.9783 5.411 6.40E-08 ***
5
attitudelow -16.0126 1.8675 -8.574 < 2e-16 ***
truancyYes -23.7605 2.9103 -8.164 3.63E-16 ***
FSMbandhigh 11.4987 1.9155 6.003 2.01E-09 ***
excludeYes -50.4809 3.7025 -13.634 < 2e-16 ***
singleparye -18.6255 2.211 -8.424 < 2e-16 ***
s

The goodness of fit statistics for the final model are shown in Table 2 above, whose
coefficients are sum_k3score, gender, pupasp, homework, attitude, truancy, exclude,
FSMband and singlepar with significance level less than 0.1% level, which is relatively
high. The R2 and Adj R2 were about all 0.598, which is around 0.6, so satisfactory. At
the same time we could see that both the F-statistic and p-value showed the model
is a good fit. Also, the residual standard error is 89.02 on 9847 degrees of freedom,
which is also valuable to indicate a good model.

You might also like