Unit 551 Overall Model Quality R Squared and The F Test Without Answers

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Assignment for Unit 551

How to assess a model with more than one independent variable?


Henk van der Kolk

08 November, 2023

Goals of the practice questions and this assignment


When the model has more than one independent variable, we not only want to assess the
effect of one single variable, but also the overall quality of the model. This is also true when
assessing the impact of a nominal variable on a ratio variable, since in a linear model we
include a nominal variable as a set of ‘dummies’ (= several variables) impacting on the
dependent (scale) variable. We want to have an overall check of the impact of that variable.
In this unit, you will learn both to describe (by using R squared and the adjusted R
squared) and test (using the F test) the overall quality of a model.

Download the dataset ‘unit_551_practice_data.csv’ and import the data set.

The case.
Because of a new pandemic, you want to know whether four different ways of washing
your hand have an impact on the number of bacteria on your hand. I will not bother you
with how we measured the bacteria using sterile plates and counting bacteria colonies. Just
assume we were able to measure this.

Estimating a model
1. Using the data, create a graph displaying the data. In this case, a box-whisker plot
with four groups may be helpful. What do you notice?
**.
2. Check the model by describing and testing the relationship between method and
bacteria. Check the model using summary(), like you learned in previous
assignments. Focus on R squared and on the F-test.

3. What does R squared refer to? What is the difference between R squared and the
adjusted R squared? Could the adjusted one also be HIGHER?

**.
4. If R squared is very high, does this mean that the model is true?
**.
1
5. Which part of the output says something about the extent to which this model
informs us about the population?
**.
6. What is the outcome of the F-test in this case?
**.

Introducing ANOVA (again)


Another way to describe the output of a linear model is by using an ANOVA table (ANalysis
Of VAriance). We have another way of presenting the same model, because it sometimes
gives us a better understanding of models with different groups. But, to be explicit, the
underlying ideas are exactly the same as with a linear model. This will be shown in the
remainder of this assignment.
7. Check the book and the BMS R manual and find which commands are used to create
an ANOVA table. HINT: you can simply estimate the linear model like you did before
and ‘pipe’ that model into an anova.

8. Have a look at the this ANOVA table. Do you see how the F-value is connected to the
other numbers? And do you notice it is the same as the number we found in the LM
procedure?

**
9. What are the sum of squares of the method and of the residuals? How can you
determine these sums?
**.
10. What are the ‘degrees of freedom’ based on?
**.

Using ANOVA
An ANOVA table (ANalysis Of VAriance) is sometimes used to FIRST check ‘whether there is
something in the model’ (does at least of the groups differ) and to subsequently check what
these differences exactly are. This is called a post hoc contrast analysis. You basically
introduce dummies in the model and check which are different from the reference
category, although you can also check which of the groups are different from the overall
mean.

EXTRA: Dissecting R squared.


In the micro lectures you have learned about R squared and the way it was calculated. Let
us see whether we can reconstruct that calculation using the data used in this assignment.
This will deepen your understanding of R squared.

2
The basic underlying idea is that the overall variance (of y) is the the model variance plus
the residual variance. Use the following code in R to check whether this is correct. After
calculating these three things using R we can find the (unadjusted) R squared.
# overall variance is simply:
var(data$bacteria)

# the model variance is variances of the group means as compared to


the overall mean, taking into account group sizes. The following code
adds for each observation the group mean to the individual
observations
data <- data %>%
group_by(method) %>%
mutate(group_mean = mean(bacteria))

# we simply take the variance of the group means (taking group sizes
into account)
var(data$group_mean)

# the residuals are the differences between the observed values and
the group means
# these too are added to the dataset
data <- data %>%
mutate(residuals = group_mean - bacteria)

# this is the residual variance


var(data$residuals)

# 1 is the overall variance the same as the model variance plus the
residual variance?
var(data$residuals) + var(data$group_mean)
var(data$bacteria)

# 2 calculating R squared
r_sq <- 1 - (var(data$residuals)/var(data$bacteria))
r_sq

# is that the same as the r_squared found in the model?

<< END OF THE ASSIGNMENT>>

You might also like