Homework - Week 7: Problem 3.31
Homework - Week 7: Problem 3.31
Homework - Week 7: Problem 3.31
Nathan Otten
April 7, 2019
Problem 3.31
A. The goal of this study is to predict the sales price of a home (Y) as a function of the finished square
feet (X). The data for this study was collected by a city tax assessor on 522 arms-length transactions in
the mid-western United States for the year 2002. For this study, we have taken a random sample of 200
homes from the 522 total observations. The variables in the data include the sales in dollars of the home,
the finished square feet of the home, the total number of bedrooms and bathrooms, the size of the garage,
the style, and other factors.
B. The sales price is expected to increase as a function of the finished square feet. Our goals are:
Yi = β0 + β1 Xi + i
for i = 1, 2, ..., 200 where i ∼ N (0, σ 2 ) and β0 , β1 , and σ 2 are the parameters of interest.
A. The scatter plot indicates a clear positive relationship between price and square feet (r = 0.877004).
B. The scatterplot appears to show a widening range of price as the finished square feet increases, which
gives the plot a cone-shaped appearance. This may indicate that our assumption of homoscedasticity
to be violated, that is, the variance of the i is not constant. It appears that as X increases, the variance
also increases. This observation is reasonable since other factors such as amenities can also drive up
the price of the home. In other words, smaller homes with expensive amenities, such as a pool, can
have high prices. See the appendix for a more complete analysis of the residuals on this point.
Using ordinary least squares, the simple regression model is given by:
Ŷ = −89274.191 + 160.754X
and is shown in figure 2. In this model, finished square feet explains about %77 of the total variation in the
price of the home (R2 = 0.7691).
1. A two-sided t-test is used to select between H0 : β1 = 0 and H1 : β1 6= 0 with a type I error rate of
α = 0.05. Since b1 = 160.754 is more than 25 SE below what is expected for H0 , we reject H0 in favor
of H1 . Thus, there is evidence of a linear association between price and square footage.
1
7e+05
5e+05
Price
3e+05
1e+05
2. For the values X = 1100 and X = 4900, we predict the price of the home to be $87,555.67 and
$698,422.4 respectively.
IV. Conclusion
There is an increasing linear relationship between finished square feet and price of the home, and this
relationship can be modeled by:
Ŷ = −89274.191 + 160.754X
The linear model can be improved upon given that homoscedasticity appears to be violated in the ordinary
least squares model.
Appendix
Both sales price and finished square feet appear to be slightly non-symetric, having a right skewed distri-
bution. There are outliers in the right tail of the distribution for both of these variables. There are 3-4
unusually high values between 2,500-4,000 sq ft., which skew the distribution.
2
Sales Price Square Ft.
4500
7e+05
3500
5e+05
2500
3e+05
1500
1e+05
## Sales_Price SQ_FT
## Min. :112000 Min. :1198
## 1st Qu.:179400 1st Qu.:1668
## Median :221475 Median :1980
## Mean :268975 Mean :2229
## 3rd Qu.:328500 3rd Qu.:2717
## Max. :830000 Max. :4746
3
B. Residual Analysis
Residuals against X
3e+05
1e+05
Residuals
−1e+05
Square Feet
150000
0 50000
Square Feet
4
1. Linearity of the Regression Function
Overall, linearity appears to be reasonable for values of X between 1,500 and 3,000. However, the linearity
model is suspect outside of that. All of the values below 1,500 are underestimated by the model, and the
values above 3,000 have a wide range and variance of values, which creates a lot of noise.
The error variance of the model is by no means constant. As X increases, so does the error variance. It is
particularly clear that the residuals between 2,500 and 4,000 vary greaty from 0, while the residuals between
1,500 and 2,000 are clustered tightly around 0. This would indicate that the assumption of homoscedasticity
is suspect in this model. More formally, we apply the Breucsh-Pagan test to determine if σ 2 is a function of
X, and at the 5% significance level, we reject the null hypothesis that σ 2 is not a function of X in favor of
the alternative that σ 2 is a function of X (Chisquare = 91.74862, p = < 2.22e-16).
2
0
−2
Square Feet
The scatterplot above shows Square feet plotted against the studentized residuals. Again it is clear that
values above about 2,700 have a greater range and variance because some of the values are 3 and 4 standard
deviations from the mean of 0. Dotted lines are drawn on the scatter plot, and values above these are
considered outliers. As X increases, we see more outliers in the residuals.
5
Independence of Residuals
−1e+05
Index
From our sequence plot of residuals, there appears to be no patterns in the residuals plotted against their
index. The residuals appear to be independent.
6
Normal Probability Plot of Residuals
3e+05
Sample Quantiles
1e+05
−1e+05
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Above is the Normal Probability Plot, which shows that the normality is a reasonable assumption for most
of the residuals. However, the outliers mentioned above are an exception to this assumption. We see from
the plot that there are 4-5 outliers where the observed quantiles do not match the theoretical quantiles.
Aside from these extreme values in the tails, the normality assumption generally holds.
Problem 3.32
A. The goal of this study is to predict the prostate-specific antigen (PSA) level (Y) as a function of cancer
volume (X). The data was collected by a university medical urology group from 97 men who were about to
undergo radical prostectomies. The variables in the data include the PSA level, Cancer Volume, Weight,
Age, and other facotrs.
B. The PSA is expected to increase as a function of the cancer volume. Our goals are:
Yi = β0 + β1 Xi + i
for i = 1, 2, ..., 200 where i ∼ N (0, σ 2 ) and β0 , β1 , and σ 2 are the parameters of interest.
7
II Preliminary Analysis
A. The scatter plot indicates a positive relationship between PSA and cancer volume (r = 0.6241506).
B. The scatterplot appears to show a widening range of PSA as the cancer volume increases, which gives
the plot a cone-shaped appearance. This may indicate that our assumption of homoscedasticity to be
violated, that is, the variance of the i is not constant. It appears that as X increases, the variance
also increases. See the appendix for a more complete analysis of the residuals on this point.
Using ordinary least squares, the simple regression model is given by:
Ŷ = 1.125 + 3.230X
and is shown in the figure below. In this model, cancer volume explains about %39 of the total variation in
the PSA (R2 = 0.3896).
1. A two-sided t-test is used to select between H0 : β1 = 0 and H1 : β1 6= 0 with a type I error rate of
α = 0.05. Since b1 = 3.2299 is more than 7 SE above what is expected for H0 , we reject H0 in favor
of H1 . Thus, there is evidence of a linear association between PSA and cancer volume.
250
200
150
PSA
100
50
0
0 10 20 30 40
Cancer Volume
8
IV. Conclusion
There is an increasing linear relationship between cancer volume and PSA, and this relationship can be
modeled by:
Ŷ = 1.125 + 3.230X
The linear model can be improved upon given that homoscedasticity appears to be violated in the ordinary
least squares model.
Appendix
Both sales PSA and cancer volume appear to be non-symetric, having a right skewed distribution. There
are outliers in the right tail of the distribution for both of these variables. There are multiple unusually high
values between for PSA and Volume, which skew the distribution.
PSA Volume
250
40
200
30
150
20
100
10
50
0
## Sales_Price SQ_FT
## Min. :112000 Min. :1198
## 1st Qu.:179400 1st Qu.:1668
## Median :221475 Median :1980
## Mean :268975 Mean :2229
## 3rd Qu.:328500 3rd Qu.:2717
## Max. :830000 Max. :4746
9
B. Residual Analysis
Residuals against X
150
100
Residuals
50
0
−50
0 10 20 30 40
Cancer Volume
150
100
50
0
0 10 20 30 40
Cancer Volume
10
1. Linearity of the Regression Function
Overall, linearity appears to be reasonable for smaller values of X. However, the linearity model is suspect
outside of that. All of the values above 5 have a wide range and variance of values, which creates a lot of
noise.
4
2
0
−2
0 10 20 30 40
Cancer Volume
The scatterplot above shows Cancer Volume plotted against the studentized residuals. Again it is clear
that values above about 5 have a greater range and variance because some of the values are between 4 to 6
standard deviations from the mean of 0. Dotted lines are drawn on the scatter plot, and values above these
are considered outliers. As X increases, we see more outliers in the residuals. Three outliers in particular
are obvious in the plot.
Independence of Residuals
11
Sequence Plot of Residuals
150
100
ei
50
0
−50
0 20 40 60 80 100
Index
From our sequence plot, there appears to be a pattern in the residuals, namely, the residuals are tightly
clustered around 0 for values between 1 and 15 and after that they grow increasingly more volitile. This
would seem to violate the assumption of independence in the residuals.
12
Normal Probability Plot of Residuals
150
Sample Quantiles
100
50
0
−50
−2 −1 0 1 2
Theoretical Quantiles
Above is the Normal Probability Plot, which shows that the normality is a reasonable assumption for most
of the residuals. However, the outliers mentioned above are an exception to this assumption. We see from
the plot that there are 4-5 outliers where the observed quantiles do not match the theoretical quantiles.
Aside from these extreme values in the tails, the normality assumption generally holds.
13