Notes 11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Statistical Methods - 2

Notes 11
What Types of Questions can We Answer by Multiple
Linear Regression?

An Example: Heart attacks in rabbits.


1 When heart muscle is deprived of oxygen, the tissue dies and leads to a
heart attack.

2 Some researchers hypothesized that cooling the heart would be effective in


reducing the size of the heart attack even if it takes place after the blood flow
becomes restricted.

3 To investigate their hypothesis, the researchers conducted an experiment on


32 completely sedated rabbits that were subjected to a heart attack.
The researchers established three experimental groups:
1 Early cooling

2 Late cooling

3 No cooling
At the end of the experiment, the researchers measured the size of the infarcted
(i.e., damaged) area (in grams) in each of the 32 rabbits. But, as you can
imagine, there is great variability in the size of hearts.
Therefore, in order to adjust for differences in heart sizes, the researchers also measured
the size of the region at risk for infarction (in grams) in each of the 32 rabbits.
The researchers’ primary research question was:

1 Does the mean size of the infarcted area differ among the three treatment groups –
no cooling, early cooling, and late cooling – when controlling for the size of the
region at risk for infarction?

2 If we translate this question into a model, then it is:

Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi

where:
1 Yi is the size of the infarcted area (in grams) of rabbit i
2 xi1 is the size of the region at risk (in grams) of rabbit i
3 xi2 = 1 if early cooling of rabbit i, 0 if not
4 xi3 = 1 if late cooling of rabbit i, 0 if not
and the independent error terms εi follow a normal distribution with mean 0 and equal
variance σ 2 .
Categorical Variable

1 The predictors x2 and x3 are “indicator variables” that translate the


categorical information on the experimental group to which a rabbit belongs
into a usable form.

2 For “early cooling” rabbits x2 = 1 and x3 = 0

3 For “late cooling” rabbits x2 = 0 and x3 = 1

4 For “no cooling” rabbits x2 = 0 and x3 = 0


Simple Examples about Categorical Variable

1 For a binary variable, e.g., gender, we can use an “indicator” and code it as
x = 1 for female and x = 0 for male.

2 So for a variable with two possible categories, we only need one “indicator” x.
Its value is 1 or 0.

3 How about a variable with three possible categories? For example, degree of
pain: mild, moderate, severe, then we need two “indicator”s: for the mild
condition: x1 = 1 and x2 = 0; for the moderate condition: x1 = 0 and x2 = 1;
for the severe condition: x1 = 0 and x2 = 0.
The model can therefore be simplified for each of the three experimental groups:

1 For “early cooling” rabbits

Yi = β0 + β1 xi1 + β2 + εi

2 For “late cooling” rabbits


Yi = β0 + β1 xi1 + β3 + εi

3 For “no cooling” rabbits


Yi = β0 + β1 xi1 + εi

1 Thus, β2 represents the difference in mean size of the infarcted area – controlling for
the size of the region at risk – between “early cooling” and “no cooling” rabbits.

2 β3 represents the difference in mean size of the infarcted area – controlling for the
size of the region at risk – between “late cooling” and “no cooling” rabbits.
Fitting the model to the rabbits’ data, the summary table in R is

The regression equation is:

Infarcted = −0.135 + 0.613 Area − 0.244x2 − 0.066x3


A plot of the data adorned with the estimated regression equation looks like:

Infarcted Area vs Area of Risk


1.0

Control
Early
Late
0.8
Size of Infarcted Area (grams)

0.6
0.4
0.2
0.0

0.4 0.6 0.8 1.0 1.2 1.4

Size of Area at Risk (grams)


1 The plot suggests that, as we’d expect, as the size of the area at risk increases, the
size of the infarcted area also tends to increase.

2 The plot also suggests that for this sample of 32 rabbits with a given size of area at
risk, 1.0 gram say, the average size of the infarcted area differs for the three
experimental groups.

As always, the researchers aren’t just interested in this sample. They want to be able to
answer their research question for the whole population of rabbits.
Recall that the research question is: Does the mean size of the infarcted area differ
among the three treatment groups – no cooling, early cooling, and late cooling – when
controlling for the size of the region at risk for infarction?
Categorical Predictors

Example: Is a baby’s birth weight related to the mother’s smoking during pregnancy? To
answer this question, a data set was collected on a random sample of n = 32 births:

1 Potential predictor (x1 ): Smoking status of mother (yes or no)

2 Potential predictor (x2 ): length of gestation, i.e, pregnancy, (Gest) in weeks

3 Response (Y ): birth weight (Weight) in grams of baby

Note that smoking is a qualitative predictor. It is a “binary variable” with only two values
(yes or no). The other predictor variable (Gest) is quantitative.
1 The scatter plot matrix suggests, not surprisingly, that there is a positive linear relationship
between length of gestation and birth weight.

2 It is hard to see if any kind of (marginal) relationship exists between birth weight and
smoking status, or between length of gestation and smoking status.

34 36 38 40 42

3200
Weight

2800
2400
42
40

Gest
38
36
34

2.0
1.8
1.6
Smoke

1.4
1.2
1.0
2400 2800 3200 1.0 1.2 1.4 1.6 1.8 2.0
We can also plot birth weight against length of gestation, using different colors for the
smoking status.

Birth Weight vs Length of Gestation

Non-Smoker
Smoker
3400
3200
Birth Weight

3000
2800
2600
2400

34 36 38 40 42

Length of Gestation
Recall that the question remains – after taking into account length of gestation, is there a
significant difference in the average birth weights of babies born to smoking and
non-smoking mothers?
A first-order model with one binary predictor and one quantitative predictor (a parallel
model) that helps us answer the question is:

Yi = β0 + β1 xi1 + β2 xi2 + εi

where

1 Yi is the birth weight of baby i

2 xi1 is length of gestation of baby i

3 xi2 is a binary variable coded as a 1, if the baby’s mother smoked during pregnancy
and 0, if she did not

and the independent error terms εi follow a normal distribution with mean 0 and equal
variance σ 2 .
1 Notice that in order to include a qualitative variable in a regression model, we have
to “code” the variable, that is, assign a unique number to each of the possible
categories.

2 A common coding scheme is to use what’s called a “zero-one-indicator variable.”


Using such a variable here, we code the binary predictor smoking as:
1 xi2 = 1, if mother i smokes

2 xi2 = 0, if mother i does not smoke

3 In doing so, we use the tradition of assigning the value of 1 to those having the
characteristic of interest and 0 to those not having the characteristic.

4 Incidentally, other terms sometimes used instead of “zero-one-indicator variable” are


“dummy variable” or “binary variable”.
1 The blue circles represent the data on non-smoking mothers (x2 = 0), while the red circles
represent the data on smoking mothers (x2 = 1).

2 And, the blue line represents the estimated linear relationship between length of gestation
and birth weight for non-smoking mothers, while the red line represents the estimated linear
relationship for smoking mothers.

Birth Weight vs Length of Gestation


3600

Non-Smoker
Smoker
3400
3200
Birth Weight

3000
2800
2600
2400

34 36 38 40 42

Length of Gestation
By lm() in R, we can get the estimated regression equation.

Weight = −2389.6 + 143.10 Gest − 244.5 Smoke


1 Therefore, as illustrated in the previous slide, the estimated regression equation for
non-smoking mothers (smoking = 0) is:

Weight = −2389.6 + 143.10 Gest

2 and the estimated regression equation for smoking mothers (smoking = 1) is:

Weight = −2634.1 + 143.10 Gest

That is, we obtain two different parallel estimated lines. The difference between the two lines,
−244.5 (grams), represents the difference in the average birth weights for a fixed gestation length
for smoking and non-smoking mothers in the sample.
1 Now, given that we generally use regression models to answer research questions, we need
to figure out how each of the parameters in our model enlightens us about our research
problem.

2 The fundamental principle is that you can determine the meaning of any regression
coefficient by seeing what effect changing the value of the predictor has on the mean
response.

3 The interpretation of the regression coefficients in a regression model with one (0, 1) binary
indicator variable and one quantitative predictor is:
1. β1 represents the change in the mean response for each additional unit increase in the
quantitative predictor x1 for both groups.

2. β2 represents how much higher (or lower) the mean response function of the second
group is than that of the first group for any value of x1 .
1 Now, let’s use our model and analysis to answer the following research question: Is baby’s
birth weight related to smoking during pregnancy, after taking into account length of
gestation?

2 We can answer our research question by testing the null hypothesis H0 : β2 = 0 vs


H1 : β2 6= 0.

We can again use lm() in R and get a summary table:

Interpretation: There is sufficient evidence (p-value is very close to 0) to conclude that there is a
statistically significant difference in the mean birth weight of all babies of smoking mothers and
the mean birth weight of babies of all non-smoking mothers, after taking into account length of
gestation.
1 A 95% confidence interval for β2 tells us the magnitude of the difference.

2 A 95% t-multiplier with n − p = 32 − 3 = 29 degrees of freedom is t(0.025,29) = 2.0452.


Therefore, a 95% confidence interval for β2 is:

−244.54 ± 2.0452 (41.98) = (−330.4, −158.7).

Or, we can use the confint() funtion in R, the third row of the table is for β2 :

Interpretation: We can be 95% confident that the mean birth weight of smoking mothers is
between 158.7 and 330.4 grams less than the mean birth weight of non-smoking mothers,
regardless of the length of gestation.
1 Similarly, we can answer: How is birth weight related to gestation, after taking into account a
mother’s smoking status, by testing the null hypothesis H0 : β1 = 0 vs H1 : β1 6= 0.

2 Again, we can again use lm() in R and get a summary table:

Interpretation: There is sufficient evidence (p-value is very close to 0) to conclude that there is a
statistically significant linear relationship between birth weight and gestation, after taking into
account a mother’s smoking status.
1 Similarly, a 95% confidence interval for β1 tells us the magnitude of the difference.

2 A 95% t-multiplier with n − p = 32 − 3 = 29 degrees of freedom is t(0.025,29) = 2.0452.


Therefore, a 95% confidence interval for β1 is:

143.1 ± 2.0452 (9.128) = (124.4, 161.8).

Or, we can use the confint() funtion in R, the second row of the table is for β1 :

Interpretation: We can be 95% confident that the mean birth weight is expected to increase
between 124.4 and 161.8 grams given each unit increase in the length of gestation, regardless of
the smoking status of the mother.

You might also like