Unit 551 Overall Model Quality R Squared and The F Test Without Answers
Unit 551 Overall Model Quality R Squared and The F Test Without Answers
Unit 551 Overall Model Quality R Squared and The F Test Without Answers
08 November, 2023
The case.
Because of a new pandemic, you want to know whether four different ways of washing
your hand have an impact on the number of bacteria on your hand. I will not bother you
with how we measured the bacteria using sterile plates and counting bacteria colonies. Just
assume we were able to measure this.
Estimating a model
1. Using the data, create a graph displaying the data. In this case, a box-whisker plot
with four groups may be helpful. What do you notice?
**.
2. Check the model by describing and testing the relationship between method and
bacteria. Check the model using summary(), like you learned in previous
assignments. Focus on R squared and on the F-test.
3. What does R squared refer to? What is the difference between R squared and the
adjusted R squared? Could the adjusted one also be HIGHER?
**.
4. If R squared is very high, does this mean that the model is true?
**.
1
5. Which part of the output says something about the extent to which this model
informs us about the population?
**.
6. What is the outcome of the F-test in this case?
**.
8. Have a look at the this ANOVA table. Do you see how the F-value is connected to the
other numbers? And do you notice it is the same as the number we found in the LM
procedure?
**
9. What are the sum of squares of the method and of the residuals? How can you
determine these sums?
**.
10. What are the ‘degrees of freedom’ based on?
**.
Using ANOVA
An ANOVA table (ANalysis Of VAriance) is sometimes used to FIRST check ‘whether there is
something in the model’ (does at least of the groups differ) and to subsequently check what
these differences exactly are. This is called a post hoc contrast analysis. You basically
introduce dummies in the model and check which are different from the reference
category, although you can also check which of the groups are different from the overall
mean.
2
The basic underlying idea is that the overall variance (of y) is the the model variance plus
the residual variance. Use the following code in R to check whether this is correct. After
calculating these three things using R we can find the (unadjusted) R squared.
# overall variance is simply:
var(data$bacteria)
# we simply take the variance of the group means (taking group sizes
into account)
var(data$group_mean)
# the residuals are the differences between the observed values and
the group means
# these too are added to the dataset
data <- data %>%
mutate(residuals = group_mean - bacteria)
# 1 is the overall variance the same as the model variance plus the
residual variance?
var(data$residuals) + var(data$group_mean)
var(data$bacteria)
# 2 calculating R squared
r_sq <- 1 - (var(data$residuals)/var(data$bacteria))
r_sq