Notes 11
Notes 11
Notes 11
Notes 11
What Types of Questions can We Answer by Multiple
Linear Regression?
2 Late cooling
3 No cooling
At the end of the experiment, the researchers measured the size of the infarcted
(i.e., damaged) area (in grams) in each of the 32 rabbits. But, as you can
imagine, there is great variability in the size of hearts.
Therefore, in order to adjust for differences in heart sizes, the researchers also measured
the size of the region at risk for infarction (in grams) in each of the 32 rabbits.
The researchers’ primary research question was:
1 Does the mean size of the infarcted area differ among the three treatment groups –
no cooling, early cooling, and late cooling – when controlling for the size of the
region at risk for infarction?
where:
1 Yi is the size of the infarcted area (in grams) of rabbit i
2 xi1 is the size of the region at risk (in grams) of rabbit i
3 xi2 = 1 if early cooling of rabbit i, 0 if not
4 xi3 = 1 if late cooling of rabbit i, 0 if not
and the independent error terms εi follow a normal distribution with mean 0 and equal
variance σ 2 .
Categorical Variable
1 For a binary variable, e.g., gender, we can use an “indicator” and code it as
x = 1 for female and x = 0 for male.
2 So for a variable with two possible categories, we only need one “indicator” x.
Its value is 1 or 0.
3 How about a variable with three possible categories? For example, degree of
pain: mild, moderate, severe, then we need two “indicator”s: for the mild
condition: x1 = 1 and x2 = 0; for the moderate condition: x1 = 0 and x2 = 1;
for the severe condition: x1 = 0 and x2 = 0.
The model can therefore be simplified for each of the three experimental groups:
Yi = β0 + β1 xi1 + β2 + εi
1 Thus, β2 represents the difference in mean size of the infarcted area – controlling for
the size of the region at risk – between “early cooling” and “no cooling” rabbits.
2 β3 represents the difference in mean size of the infarcted area – controlling for the
size of the region at risk – between “late cooling” and “no cooling” rabbits.
Fitting the model to the rabbits’ data, the summary table in R is
Control
Early
Late
0.8
Size of Infarcted Area (grams)
0.6
0.4
0.2
0.0
2 The plot also suggests that for this sample of 32 rabbits with a given size of area at
risk, 1.0 gram say, the average size of the infarcted area differs for the three
experimental groups.
As always, the researchers aren’t just interested in this sample. They want to be able to
answer their research question for the whole population of rabbits.
Recall that the research question is: Does the mean size of the infarcted area differ
among the three treatment groups – no cooling, early cooling, and late cooling – when
controlling for the size of the region at risk for infarction?
Categorical Predictors
Example: Is a baby’s birth weight related to the mother’s smoking during pregnancy? To
answer this question, a data set was collected on a random sample of n = 32 births:
Note that smoking is a qualitative predictor. It is a “binary variable” with only two values
(yes or no). The other predictor variable (Gest) is quantitative.
1 The scatter plot matrix suggests, not surprisingly, that there is a positive linear relationship
between length of gestation and birth weight.
2 It is hard to see if any kind of (marginal) relationship exists between birth weight and
smoking status, or between length of gestation and smoking status.
34 36 38 40 42
3200
Weight
2800
2400
42
40
Gest
38
36
34
2.0
1.8
1.6
Smoke
1.4
1.2
1.0
2400 2800 3200 1.0 1.2 1.4 1.6 1.8 2.0
We can also plot birth weight against length of gestation, using different colors for the
smoking status.
Non-Smoker
Smoker
3400
3200
Birth Weight
3000
2800
2600
2400
34 36 38 40 42
Length of Gestation
Recall that the question remains – after taking into account length of gestation, is there a
significant difference in the average birth weights of babies born to smoking and
non-smoking mothers?
A first-order model with one binary predictor and one quantitative predictor (a parallel
model) that helps us answer the question is:
Yi = β0 + β1 xi1 + β2 xi2 + εi
where
3 xi2 is a binary variable coded as a 1, if the baby’s mother smoked during pregnancy
and 0, if she did not
and the independent error terms εi follow a normal distribution with mean 0 and equal
variance σ 2 .
1 Notice that in order to include a qualitative variable in a regression model, we have
to “code” the variable, that is, assign a unique number to each of the possible
categories.
3 In doing so, we use the tradition of assigning the value of 1 to those having the
characteristic of interest and 0 to those not having the characteristic.
2 And, the blue line represents the estimated linear relationship between length of gestation
and birth weight for non-smoking mothers, while the red line represents the estimated linear
relationship for smoking mothers.
Non-Smoker
Smoker
3400
3200
Birth Weight
3000
2800
2600
2400
34 36 38 40 42
Length of Gestation
By lm() in R, we can get the estimated regression equation.
2 and the estimated regression equation for smoking mothers (smoking = 1) is:
That is, we obtain two different parallel estimated lines. The difference between the two lines,
−244.5 (grams), represents the difference in the average birth weights for a fixed gestation length
for smoking and non-smoking mothers in the sample.
1 Now, given that we generally use regression models to answer research questions, we need
to figure out how each of the parameters in our model enlightens us about our research
problem.
2 The fundamental principle is that you can determine the meaning of any regression
coefficient by seeing what effect changing the value of the predictor has on the mean
response.
3 The interpretation of the regression coefficients in a regression model with one (0, 1) binary
indicator variable and one quantitative predictor is:
1. β1 represents the change in the mean response for each additional unit increase in the
quantitative predictor x1 for both groups.
2. β2 represents how much higher (or lower) the mean response function of the second
group is than that of the first group for any value of x1 .
1 Now, let’s use our model and analysis to answer the following research question: Is baby’s
birth weight related to smoking during pregnancy, after taking into account length of
gestation?
Interpretation: There is sufficient evidence (p-value is very close to 0) to conclude that there is a
statistically significant difference in the mean birth weight of all babies of smoking mothers and
the mean birth weight of babies of all non-smoking mothers, after taking into account length of
gestation.
1 A 95% confidence interval for β2 tells us the magnitude of the difference.
Or, we can use the confint() funtion in R, the third row of the table is for β2 :
Interpretation: We can be 95% confident that the mean birth weight of smoking mothers is
between 158.7 and 330.4 grams less than the mean birth weight of non-smoking mothers,
regardless of the length of gestation.
1 Similarly, we can answer: How is birth weight related to gestation, after taking into account a
mother’s smoking status, by testing the null hypothesis H0 : β1 = 0 vs H1 : β1 6= 0.
Interpretation: There is sufficient evidence (p-value is very close to 0) to conclude that there is a
statistically significant linear relationship between birth weight and gestation, after taking into
account a mother’s smoking status.
1 Similarly, a 95% confidence interval for β1 tells us the magnitude of the difference.
Or, we can use the confint() funtion in R, the second row of the table is for β1 :
Interpretation: We can be 95% confident that the mean birth weight is expected to increase
between 124.4 and 161.8 grams given each unit increase in the length of gestation, regardless of
the smoking status of the mother.