Simple Linear Regression Analysis
Simple Linear Regression Analysis
Simple Linear Regression Analysis
Text
Chapter 11
Simple Linear Regression Analysis
Chapter Outline 11.1 The Simple Linear Regression Model 11.2 The Least Squares Estimates, and Point Estimation and Prediction 11.3 Model Assumptions and the Standard Error 11.4 Testing the Signicance of the Slope and y Intercept
*Optional section
11.5 Condence and Prediction Intervals 11.6 Simple Coefcients of Determination and Correlation 11.7 An F Test for the Model *11.8 Residual Analysis *11.9 Some Shortcut Formulas
Text
anagers often make decisions by studying the relationships between variables, and process improvements can often be made by understanding how changes in one or more variables affect the process output. Regression analysis is a statistical technique in which we use observed data to relate a variable of interest, which is called the dependent (or response) variable, to one or more independent (or predictor) variables. The objective is to build a regression model, or prediction equation, that can be used to describe, predict, and control the dependent variable on the basis of the independent variables. For example, a company might wish to improve its marketing process. After collecting data concerning the demand for a product, the products price, and the advertising expenditures made to promote the product, the company might use regression analysis to develop an equation to predict demand on the basis of price and advertising
expenditure. Predictions of demand for various priceadvertising expenditure combinations can then be used to evaluate potential changes in the companys marketing strategies. As another example, a manufacturer might use regression analysis to describe the relationship between several input variables and an important output variable. Understanding the relationships between these variables would allow the manufacturer to identify control variables that can be used to improve the process performance. In the next two chapters we give a thorough presentation of regression analysis. We begin in this chapter by presenting simple linear regression analysis. Using this technique is appropriate when we are relating a dependent variable to a single independent variable and when a straight-line model describes the relationship between these two variables. We explain many of the methods of this chapter in the context of two new cases:
C
The Fuel Consumption Case: A management consulting rm uses simple linear regression analysis to predict the weekly amount of fuel (in millions of cubic feet of natural gas) that will be required to heat the homes and businesses in a small city on the basis of the weeks average hourly temperature. A natural gas company uses these predictions to improve its gas ordering process. One of the gas companys objectives is to reduce the nes imposed by its pipeline transmission system when the company places inaccurate natural gas orders. The QHIC Case: The marketing department at Quality Home Improvement Center (QHIC) uses simple linear regression analysis to predict home upkeep expenditure on the basis of home value. Predictions of home upkeep expenditures are used to help determine which homes should be sent advertising brochures promoting QHICs products and services.
CHAPTER 14
Text
446
Chapter 11
cities. For instance, the map on pages 448 and 449 illustrates the pipelines of and the cities served by the Columbia Gas System. To place an order (called a nomination) for an amount of natural gas to be transmitted to its city over a period of time (day, week, month), a natural gas company makes its best prediction of the citys natural gas needs for that period. The natural gas company then instructs its marketer(s) to deliver this amount of gas to its pipeline transmission system. If most of the natural gas companies being supplied by the transmission system can predict their cities natural gas needs with reasonable accuracy, then the overnominations of some companies will tend to cancel the undernominations of other companies. As a result, the transmission system will probably have enough natural gas to efciently meet the needs of the cities it supplies. In order to encourage natural gas companies to make accurate transmission nominations and to help control costs, pipeline transmission systems charge, in addition to their usual fees, transmission nes. A natural gas company is charged a transmission ne if it substantially undernominates natural gas, which can lead to an excessive number of unplanned transmissions, or if it substantially overnominates natural gas, which can lead to excessive storage of unused gas. Typically, pipeline transmission systems allow a certain percentage nomination error before they impose a ne. For example, some systems do not impose a ne unless the actual amount of natural gas used by a city differs from the nomination by more than 10 percent. Beyond the allowed percentage nomination error, nes are charged on a sliding scalethe larger the nomination error, the larger the transmission ne. Furthermore, some transmission systems evaluate nomination errors and assess nes more often than others. For instance, some transmission systems do this as frequently as daily, while others do this weekly or monthly (this frequency depends on the number of storage elds to which the transmission system has access, the systems accounting practices, and other factors). In any case, each natural gas company needs a way to accurately predict its citys natural gas needs so it can make accurate transmission nominations. Suppose we are analysts in a management consulting rm. The natural gas company serving a small city has hired the consulting rm to develop an accurate way to predict the amount of fuel (in millions of cubic feetMMcfof natural gas) that will be required to heat the city. Because the pipeline transmission system supplying the city evaluates nomination errors and assesses nes weekly, the natural gas company wants predictions of future weekly fuel consumptions.1 Moreover, since the pipeline transmission system allows a 10 percent nomination error before assessing a ne, the natural gas company would like the actual and predicted weekly fuel consumptions to differ by no more than 10 percent. Our experience suggests that weekly fuel consumption substantially depends on the average hourly temperature (in degrees Fahrenheit) measured in the city during the week. Therefore, we will try to predict the dependent (response) variable weekly fuel consumption (y) on the basis of the independent (predictor) variable average hourly temperature (x) during the week. To this end, we observe values of y and x for eight weeks. The data are given in Table 11.1. In Figure 11.1 we give an Excel output of a scatter plot of y versus x. This plot shows
1 2
A tendency for the fuel consumption to decrease in a straight-line fashion as the temperatures increase. A scattering of points around the straight line.
A regression model describing the relationship between y and x must represent these two characteristics. We now develop such a model.2 We begin by considering a specic average hourly temperature x. For example, consider the average hourly temperature 28F, which was observed in week 1, or consider the average hourly temperature 45.9F, which was observed in week 5 (there is nothing special about these two average hourly temperatures, but we will use them throughout this example to help explain the idea of a regression model). For the specic average hourly temperature x that we consider, there are, in theory, many weeks that could have this temperature. However, although these weeks
For whatever period of time a transmission system evaluates nomination errors and charges nes, a natural gas company is free to actually make nominations more frequently. Sometimes this is a good strategy, but we will not further discuss it.
Generally, the larger the sample size isthat is, the more combinations of values of y and x that we have observedthe more accurately we can describe the relationship between y and x. Therefore, as the natural gas company observes values of y and x in future weeks, the new data should be added to the data in Table 11.1.
Text
The Simple Linear Regression Model FIGURE 11.1 Excel Output of a Scatter Plot of y versus x
C 15 13 11 9 7 5 20 D E F G
447
1 2 3 4 5 6 7 8
FUEL
Week
A B 1 TEMP FUELCONS 12.4 2 28 11.7 3 28 12.4 4 32.5 10.8 5 39 9.4 6 45.9 9.5 7 57.8 8 8 58.1 7.5 9 62.5 10 11 12 13 14
30
40 50 TEMP
60
70
each have the same average hourly temperature, other factors that affect fuel consumption could vary from week to week. For example, these weeks might have different average hourly wind velocities, different thermostat settings, and so forth. Therefore, the weeks could have different fuel consumptions. It follows that there is a population of weekly fuel consumptions that could be observed when the average hourly temperature is x. Furthermore, this population has a mean, which we denote as y | x (pronounced mu of y given x). We can represent the straight-line tendency we observe in Figure 11.1 by assuming that my x is related to x by the equation my x b0 b1x
This equation is the equation of a straight line with y-intercept B0 (pronounced beta zero) and slope B1 (pronounced beta one). To better understand the straight line and the meanings of b0 and b1, we must rst realize that the values of b0 and b1 determine the precise value of the mean weekly fuel consumption my x that corresponds to a given value of the average hourly temperature x. We cannot know the true values of b0 and b1, and in the next section we learn how to estimate these values. However, for illustrative purposes, let us suppose that the true value of b0 is 15.77 and the true value of b1 is .1281. It would then follow, for example, that the mean of the population of all weekly fuel consumptions that could be observed when the average hourly temperature is 28F is my 28 b0 b1(28) 15.77 .1281(28) 12.18 MMcf of natural gas
As another example, it would also follow that the mean of the population of all weekly fuel consumptions that could be observed when the average hourly temperature is 45.9F is my 45.9 b0 b1(45.9) 15.77 .1281(45.9) 9.89 MMcf of natural gas
Note that, as the average hourly temperature increases from 28F to 45.9F, mean weekly fuel consumption decreases from 12.18 MMcf to 9.89 MMcf of natural gas. This makes sense because we would expect to use less fuel if the average hourly temperature increases. Of course, because we do not know the true values of b0 and b1, we cannot actually calculate these mean weekly fuel consumptions. However, when we learn in the next section how to estimate b0 and b1, we will then be able to estimate the mean weekly fuel consumptions. For now, when we say that the equation my x b0 b1x is the equation of a straight line, we mean that the different mean weekly fuel consumptions that correspond to different average hourly temperatures lie exactly on a straight line. For example, consider the eight mean weekly fuel consumptions that correspond to the eight average hourly temperatures in Table 11.1. In Figure 11.2(a) we depict these mean weekly fuel consumptions as triangles that lie exactly on the straight line dened by
Text
448
Chapter 11
Michigan
Middleburg Heights Sandusky Lorain Toledo Elyria Parma Ohio Gulf of Mexico
Columbia Gas Transmission Columbia Gulf Transmission Cove Point LNG Corporate Headquarters Cove Point Terminal Storage Fields Distribution Service Territory Independent Power Projects Communities Served by Columbia Companies Communities Served by Companies Supplied by Columbia
Mansfield Marion
Athens
Cincinnati
Kentucky
Tennessee
Text
11.1
449
New York
Binghamton
Wilkes-Barre Elizabeth Allentown New Brighton Pittsburgh Uniontown Wheeling Cumberland Hagerstown Maryland Baltimore West Virginia Manassas Delaware Washington, D.C. Cove Point Terminal Charleston Staunton Fredericksburg Richmond Virginia Williamsburg Newport News Norfolk Chesapeake York Wilmington Atlantic City Harrisburg Bethlehem New Jersey
Pennsylvania
Portsmouth
North Carolina
Text
450
F I G U R E 11.2
Chapter 11
The Simple Linear Regression Model Relating Weekly Fuel Consumption (y) to Average Hourly Temperature (x)
y 28
y 13 12 11 10 9 8 7
Mean weekly fuel consumption when x 28 The error term for the first week (a positive error term) 12.4 The observed fuel consumption for the first week
y 45.9
Mean weekly fuel consumption when x 45.9 The error term for the fifth week (a negative error term) 9.4 The observed fuel consumption for the fifth week The straight line defined by the equation y x 0 1x x
28.0
45.9
62.5
The change in mean weekly fuel consumption that is associated with a one-degree increase in average hourly temperature
1)
x c y 15 14 13 12 11 10 9 8 7
0
the equation my x b0 b1x. Furthermore, in this gure we draw arrows pointing to the triangles that represent the previously discussed means my 28 and my 45.9. Sometimes we refer to the straight line dened by the equation my x b0 b1x as the line of means. In order to interpret the slope b1 of the line of means, consider two different weeks. Suppose that for the rst week the average hourly temperature is c. The mean weekly fuel consumption for all such weeks is b0 b1(c) 1). The mean weekly
For the second week, suppose that the average hourly temperature is (c fuel consumption for all such weeks is b0 b1(c 1)
It is easy to see that the difference between these mean weekly fuel consumptions is b1. Thus, as illustrated in Figure 11.2(b), the slope b1 is the change in mean weekly fuel consumption that is associated with a one-degree increase in average hourly temperature. To interpret the meaning of
Text
11.1
451
the y-intercept b0, consider a week having an average hourly temperature of 0F. The mean weekly fuel consumption for all such weeks is b0 b1(0) b0 Therefore, as illustrated in Figure 11.2(c), the y-intercept b0 is the mean weekly fuel consumption when the average hourly temperature is 0F. However, because we have not observed any weeks with temperatures near 0, we have no data to tell us what the relationship between mean weekly fuel consumption and average hourly temperature looks like for temperatures near 0. Therefore, the interpretation of b0 is of dubious practical value. More will be said about this later. Now recall that the observed weekly fuel consumptions are not exactly on a straight line. Rather, they are scattered around a straight line. To represent this phenomenon, we use the simple linear regression model y my x e b0 b1x e This model says that the weekly fuel consumption y observed when the average hourly temperature is x differs from the mean weekly fuel consumption my x by an amount equal to e (pronounced epsilon). Here is called an error term. The error term describes the effect on y of all factors other than the average hourly temperature. Such factors would include the average hourly wind velocity and the average hourly thermostat setting in the city. For example, Figure 11.2(a) shows that the error term for the rst week is positive. Therefore, the observed fuel consumption y 12.4 in the rst week was above the corresponding mean weekly fuel consumption for all weeks when x 28. As another example, Figure 11.2(a) also shows that the error term for the fth week was negative. Therefore, the observed fuel consumption y 9.4 in the fth week was below the corresponding mean weekly fuel consumption for all weeks when x 45.9. More generally, Figure 11.2(a) illustrates that the simple linear regression model says that the eight observed fuel consumptions (the dots in the gure) deviate from the eight mean fuel consumptions (the triangles in the gure) by amounts equal to the error terms (the line segments in the gure). Of course, since we do not know the true values of b0 and b1, the relative positions of the quantities pictured in the gure are only hypothetical. With the fuel consumption example as background, we are ready to dene the simple linear regression model relating the dependent variable y to the independent variable x. We suppose that we have gathered n observationseach observation consists of an observed value of x and its corresponding value of y. Then:
T
1 2 3
he simple linear (or straight line) regression model is: y Here my x b0 b1x is the mean value of the dependent variable y when the value of the independent variable is x. b0 is the y-intercept. b0 is the mean value of y when x equals 0.3 b1 is the slope. b1 is the change (amount of increase or decrease) in the mean value of y
my x
b0
b1x
associated with a one-unit increase in x. If b1 is positive, the mean value of y increases as x increases. If b1 is negative, the mean value of y decreases as x increases.
e is an error term that describes the effects on y of all factors other than the value of the independent variable x.
This model is illustrated in Figure 11.3 (note that x0 in this gure denotes a specic value of the independent variable x). The y-intercept b0 and the slope b1 are called regression parameters. Because we do not know the true values of these parameters, we must use the sample data to
3
As implied by the discussion of Example 11.1, if we have not observed any values of x near 0, this interpretation is of dubious practical value.
Text
452
F I G U R E 11.3
Error term
Slope
0
estimate these values. We see how this is done in the next section. In later sections we show how to use these estimates to predict y. The fuel consumption data in Table 11.1 were observed sequentially over time (in eight consecutive weeks). When data are observed in time sequence, the data are called time series data. Many applications of regression utilize such data. Another frequently used type of data is called cross-sectional data. This kind of data is observed at a single point in time.
Quality Home Improvement Center (QHIC) operates ve stores in a large metropolitan area. The marketing department at QHIC wishes to study the relationship between x, home value (in thousands of dollars), and y, yearly expenditure on home upkeep (in dollars). A random sample of 40 homeowners is taken and asked to estimate their expenditures during the previous year on the types of home upkeep products and services offered by QHIC. Public records of the county auditor are used to obtain the previous years assessed values of the homeowners homes. The resulting x and y values are given in Table 11.2. Because the 40 observations are for the same year (for different homes), these data are cross-sectional. The MINITAB output of a scatter plot of y versus x is given in Figure 11.4. We see that the observed values of y tend to increase in a straight-line (or slightly curved) fashion as x increases. Assuming that my x and x have a straight-line relationship, it is reasonable to relate y to x by using the simple linear regression model having a positive slope (b1 0) y b0 b1x e
The slope b1 is the change (increase) in mean dollar yearly upkeep expenditure that is associated with each $1,000 increase in home value. In later examples the marketing department at QHIC will use predictions given by this simple linear regression model to help determine which homes should be sent advertising brochures promoting QHICs products and services. We have interpreted the slope b1 of the simple linear regression model to be the change in the mean value of y associated with a one-unit increase in x. We sometimes refer to this change as the effect of the independent variable x on the dependent variable y. However, we cannot prove that
Text
11.1 T A B L E 11.2
Home
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The Simple Linear Regression Model The QHIC Upkeep Expenditure Data
Upkeep Expenditure, y (Dollars)
1,412.08 797.20 872.48 1,003.42 852.90 288.48 1,288.46 423.08 1,351.74 378.04 918.08 1,627.24 1,204.78 857.04 775.00 869.26 1,396.00 711.50 1,475.18 1,413.32
453
QHIC
Home
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
F I G U R E 11.4
MINITAB Plot of Upkeep Expenditure versus Value of Home for the QHIC Data
2000
UPKEEP
1000
a change in an independent variable causes a change in the dependent variable. Rather, regression can be used only to establish that the two variables move together and that the independent variable contributes information for predicting the dependent variable. For instance, regression analysis might be used to establish that as liquor sales have increased over the years, college professors salaries have also increased. However, this does not prove that increases in liquor sales cause increases in college professors salaries. Rather, both variables are inuenced by a third variablelong-run growth in the national economy.
11.5, 11.6
Text
454
Chapter 11
11.2 In the simple linear regression model, what are y, my x, and e? 11.3 In the simple linear regression model, dene the meanings of the slope b1 and the y-intercept b0. 11.4 What is the difference between time series data and cross-sectional data? METHODS AND APPLICATIONS 11.5 THE STARTING SALARY CASE StartSal The chairman of the marketing department at a large state university undertakes a study to relate starting salary (y) after graduation for marketing majors to grade point average (GPA) in major courses. To do this, records of seven recent marketing graduates are randomly selected.
1 2 3 4 5 6 7
StartSal
Marketing Graduate
GPA, x
37 36 35 34 33 32 31 30 29 28 27 2 3 GPA 4
Using the scatter plot (from MINITAB) of y versus x, explain why the simple linear regression model y my x b0 might appropriately relate y to x. 11.6 THE STARTING SALARY CASE StartSal Consider the simple linear regression model describing the starting salary data of Exercise 11.5. a Explain the meaning of my x 4.00 b0 b1(4.00). b Explain the meaning of my x 2.50 b0 b1(2.50). c Interpret the meaning of the slope parameter b1. d Interpret the meaning of the y-intercept b0. Why does this interpretation fail to make practical sense? e The error term e describes the effects of many factors on starting salary y. What are these factors? Give two specic examples. 11.7 THE SERVICE TIME CASE SrvcTime Accu-Copiers, Inc., sells and services the Accu-500 copying machine. As part of its standard service contract, the company agrees to perform routine service on this copier. To obtain information about the time it takes to perform routine service, Accu-Copiers has collected data for 11 service calls. The data are as follows: e b1x e
Service Call
1 2 3 4 5 6 7 8 9 10 11
Text
11.1
The Simple Linear Regression Model Using the scatter plot (from Excel) of y versus x, discuss why the simple linear regression model might appropriately relate y to x.
455
SrvcTime
Consider the simple linear regression model describing the service time data in Exercise 11.7. a Explain the meaning of my x 4 b0 b1(4). b Explain the meaning of my x 6 b0 b1(6). c Interpret the meaning of the slope parameter b1. d Interpret the meaning of the y-intercept b 0 . Does this interpretation make practical sense? e The error term e describes the effects of many factors on service time. What are these factors? Give two specic examples. 11.9 THE FRESH DETERGENT CASE Fresh Enterprise Industries produces Fresh, a brand of liquid laundry detergent. In order to study the relationship between price and demand for the large bottle of Fresh, the company has gathered data concerning demand for Fresh over the last 30 sales periods (each sales period is four weeks). Here, for each sales period, y x1 x2 x4 demand for the large bottle of Fresh (in hundreds of thousands of bottles) in the sales period the price (in dollars) of Fresh as offered by Enterprise Industries in the sales period the average industry price (in dollars) of competitors similar detergents in the sales period x2 x1 the price difference in the sales period
Note: We denote the price difference as x4 (rather than, for example, x3) to be consistent with other notation to be introduced in the Fresh detergent case in Chapter 12.
Fresh Detergent Demand Data Sales x1 y Period
7.38 8.51 9.52 7.50 9.33 8.28 8.75 7.87 7.10 8.00 7.89 8.15 9.10 8.86 8.90 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Sales Period
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x1
3.85 3.75 3.70 3.70 3.60 3.60 3.60 3.80 3.80 3.85 3.90 3.90 3.70 3.75 3.75
x2
3.80 4.00 4.30 3.70 3.85 3.80 3.75 3.85 3.65 4.00 4.10 4.00 4.10 4.20 4.10
x4
.05 .25 .60 0 .25 .20 .15 .05 .15 .15 .20 .10 .40 .45 .35
x2
x1
3.80 3.70 3.80 3.70 3.80 3.80 3.75 3.70 3.55 3.60 3.65 3.70 3.75 3.80 3.70
x2
4.10 4.20 4.30 4.10 3.75 3.75 3.65 3.90 3.65 4.10 4.25 3.65 3.75 3.85 4.25
x4
.30 .50 .50 .40 .05 .05 .10 .20 .10 .50 .60 .05 0 .05 .55
x2
x1
y
8.87 9.26 9.00 8.75 7.95 7.65 7.27 8.00 8.50 8.75 9.21 8.27 7.67 7.93 9.26
Using the scatter plot (from MINITAB) of y versus x4 shown below, discuss why the simple linear regression model might appropriately relate y to x4.
9.5 Demand, y 9.0 8.5 8.0 7.5 7.0 -0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 PriceDif, x4
Text
456
Direct Labor Cost DirLab Data Direct Labor Cost, y ($100s)
71 663 381 138 861 145 493 548 251 1024 435 772
Batch Size, x
5 62 35 12 83 14 46 52 23 100 41 75
Consider the simple linear regression model relating demand, y, to the price difference, x4, and the Fresh demand data of Exercise 11.9. a Explain the meaning of my x4 .10 b0 b1(.10). b Explain the meaning of my x4 .05 b0 b1( .05). c Explain the meaning of the slope parameter b1. d Explain the meaning of the intercept b0. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specic examples. 11.11 THE DIRECT LABOR COST CASE DirLab An accountant wishes to predict direct labor cost (y) on the basis of the batch size (x) of a product produced in a job shop. Data for 12 production runs are given in the table in the margin. a Construct a scatter plot of y versus x. b Discuss whether the scatter plot suggests that a simple linear regression model might appropriately relate y to x. 11.12 THE DIRECT LABOR COST CASE DirLab Consider the simple linear regression model describing the direct labor cost data of Exercise 11.11. a Explain the meaning of my x 60 b0 b1(60). b Explain the meaning of my x 30 b0 b1(30). c Explain the meaning of the slope parameter b1. d Explain the meaning of the intercept b0. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specic examples of these factors. 11.13 THE REAL ESTATE SALES PRICE CASE RealEst A real estate agency collects data concerning y the sales price of a house (in thousands of dollars), and x the home size (in hundreds of square feet). The data are given in the margin. a Construct a scatter plot of y versus x. b Discuss whether the scatter plot suggests that a simple linear regression model might appropriately relate y to x. 11.14 THE REAL ESTATE SALES PRICE CASE RealEst Consider the simple linear regression model describing the sales price data of Exercise 11.13. a Explain the meaning of my x 20 b0 b1(20). b Explain the meaning of my x 18 b0 b1(18). c Explain the meaning of the slope parameter b1. d Explain the meaning of the intercept b0. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specic examples.
RealEst
Sales Price (y)
180 98.1 173.1 136.5 141 165.9 193.5 127.8 163.5 172.5
Source: Reprinted with permission from The Real Estate Appraiser and Analyst Spring 1986 issue. Copyright 1986 by the Appraisal Institute, Chicago, Illinois.
11.2 The Least Squares Estimates, and Point Estimation and Prediction
CHAPTER 15
The true values of the y-intercept (b0) and slope (b1) in the simple linear regression model are unknown. Therefore, it is necessary to use observed data to compute estimates of these regression parameters. To see how this is done, we begin with a simple example.
Consider the fuel consumption problem of Example 11.1. The scatter plot of y (fuel consumption) versus x (average hourly temperature) in Figure 11.1 suggests that the simple linear regression model appropriately relates y to x. We now wish to use the data in Table 11.1 to estimate the intercept b0 and the slope b1 of the line of means. To do this, it might be reasonable to estimate the line of means by tting the best straight line to the plotted data in Figure 11.1. But how do we t the best straight line? One approach would be to simply eyeball a line through the points. Then we could read the y-intercept and slope off the visually tted line and use these values as the estimates of b0 and b1. For example, Figure 11.5 shows a line that has been visually tted to the plot of the
Text
11.2
457
.1
12.4
12.2
.2
F I G U R E 11.6
28
x 0 10 20 30 40 50 60 70
fuel consumption data. We see that this line intersects the y axis at y 15. Therefore, the y-intercept of the line is 15. In addition, the gure shows that the slope of the line is 12.8 13.8 1 change in y .1 change in x 20 10 10 Therefore, based on the visually tted line, we estimate that b0 is 15 and that b1 is .1. In order to evaluate how good our point estimates of b0 and b1 are, consider using the visually tted line to predict weekly fuel consumption. Denoting such a prediction as y (pronounced y hat), a prediction of weekly fuel consumption when average hourly temperature is x is y y 15 15 .1x 15 2.8 12.2 For instance, when temperature is 28F, predicted fuel consumption is .1(28) Here y is simply the point on the visually tted line corresponding to x 28 (see Figure 11.6). We can evaluate how well the visually determined line ts the points on the scatter plot by
Text
458
T A B L E 11.3
Chapter 11
Calculation of SSE for a Line Visually Fitted to the Fuel Consumption Data
y
12.4 11.7 12.4 10.8 9.4 9.5 8.0 7.5
x
28.0 28.0 32.5 39.0 45.9 57.8 58.1 62.5 SSE 15 15 15 15 15 15 15 15
15
.1(28.0) .1(28.0) .1(32.5) .1(39.0) .1(45.9) .1(57.8) .1(58.1) .1(62.5)
.1x
12.2 12.2 11.75 11.1 10.41 9.22 9.19 8.75 .04 .25
y
.2 .5 .65 .3 1.01 .28 1.19 1.25 4.8796
(y
(.2)2 ( .5)2 (.65)2 ( .3)2 ( 1.01)2 (.28)2 ( 1.19)2 ( 1.25)2
y)2
.04 .25 .4225 .09 1.0201 .0784 1.4161 1.5625
12.4 12.2 11.7 12.2 12.4 11.75 10.8 11.1 9.4 10.41 9.5 9.22 8.0 9.19 7.5 8.75 1.5625
a(y
y )2
comparing each observed value of y with the corresponding predicted value of y given by the tted line. We do this by computing the deviation y y. For instance, looking at the rst obser vation in Table 11.1 (page 447), we observed y 12.4 and x 28.0. Since the predicted fuel consumption when x equals 28 is y 12.2, the deviation y y equals 12.4 12.2 .2. This deviation is illustrated in Figure 11.5. Table 11.3 gives the values of y, x, y, and y y for each observation in Table 11.1. The deviations (or prediction errors) are the vertical distances between the observed y values and the predictions obtained using the tted linethat is, they are the line segments depicted in Figure 11.5. If the visually determined line ts the data well, the deviations (errors) will be small. To obtain an overall measure of the quality of the t, we compute the sum of squared deviations or sum of squared errors, denoted SSE. Table 11.3 also gives the squared deviations and the SSE for our visually tted line. We nd that SSE 4.8796. Clearly, the line shown in Figure 11.5 is not the only line that could be tted to the observed fuel consumption data. Different people would obtain somewhat different visually tted lines. However, it can be shown that there is exactly one line that gives a value of SSE that is smaller than the value of SSE that would be given by any other line that could be tted to the data. This line is called the least squares regression line or the least squares prediction equation. To show how to nd the least squares line, we rst write the general form of a straight-line prediction equation as y b0 b1x
Here b0 (pronounced b zero) is the y-intercept and b1 (pronounced b one) is the slope of the line. In addition, y denotes the predicted value of the dependent variable when the value of the inde pendent variable is x. Now suppose we have collected n observations (x1, y1), (x2, y2), . . . , (xn, yn). If we consider a particular observation (xi, yi), the predicted value of yi is yi b0 b1xi
Furthermore, the prediction error (also called the residual) for this observation is ei yi yi yi (b0 b1xi)
Then the least squares line is the line that minimizes the sum of the squared prediction errors (that is, the sum of squared residuals):
n
SSE
a (yi
1
(b0
b1xi))2
To nd this line, we nd the values of the y-intercept b0 and the slope b1 that minimize SSE. These values of b0 and b1 are called the least squares point estimates of b0 and b1. Using calculus, it
Text
11.2
459
F
1 2
or the simple linear regression model: The least squares point estimate of the slope B1 is b1 a xi SSxy a ( xi x )( yi y) a xi yi n a yi and y a xi n SSxx a (xi x )2 a xi
2
b1x where
Here n is the number of observations (an observation is an observed value of x and its corresponding value of y).
The following example illustrates how to calculate these point estimates and how to use these point estimates to estimate mean values and predict individual values of the dependent variable. Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are also used throughout this chapter to perform other important calculations.
C
xi yi
xi
28.0 28.0 32.5 39.0 45.9 57.8 58.1 62.5 351.8 (28.0) (28.0)2 (32.5)2 (39.0)2 (45.9)2 (57.8)2 (58.1)2 (62.5)2 a xi
2
2
x2 i
784 784 1,056.25 1,521 2,106.81 3,340.84 3,375.61 3,906.25 16,874.76
3,413.11
Using these summations, we calculate SSxy and SSxx as follows. a xi SSxy a xiyi 3,413.11 a yi
n (351.8)(81.7) 8 a xi
2
179.6475
SSxx
a xi
n (351.8)2 8 1,404.355
16,874.76
4
In order to simplify notation, we will often drop the limits on summations in this and subsequent chapters. That is, instead of
n
Text
460
T A B L E 11.4
yi
12.4 11.7 12.4 10.8 9.4 9.5 8.0 7.5
Chapter 11
xi
28.0 28.0 32.5 39.0 45.9 57.8 58.1 62.5
15.84
.1279xi
12.2588 12.2588 11.68325 10.8519 9.96939 8.44738 8.40901 7.84625 .0199374 .3122574
yi
yi
residual
.1412 .5588 .71675 .0519 .56939 1.05262 .40901 .34625 2.5680112
(yi
(.1412)2 ( .5588)2 (.71675)2 ( .0519)2 ( .56939)2 (1.05262)2 ( .40901)2 ( .34625)2
yi)2
.0199374 .3122574 .5137306 .0026936 .324205 1.1080089 .1672892 .1198891
12.4 12.2588 11.7 12.2588 12.4 11.68325 10.8 10.8519 9.4 9.96939 9.5 8.44738 8.0 8.40901 7.5 7.84625 .1198891
a (yi
It follows that the least squares point estimate of the slope b1 is b1 Furthermore, because 81.7 a yi 10.2125 and x 8 8 the least squares point estimate of the y-intercept b0 is y b0 y b1x 10.2125 a xi 8 351.8 8 15.84 43.98 SSxy SSxx 179.6475 1,404.355 .1279
( .1279)(43.98)
Since b1 .1279, we estimate that mean weekly fuel consumption decreases (since b1 is negative) by .1279 MMcf of natural gas when average hourly temperature increases by 1 degree. Since b0 15.84, we estimate that mean weekly fuel consumption is 15.84 MMcf of natural gas when average hourly temperature is 0F. However, we have not observed any weeks with temperatures near 0, so making this interpretation of b0 might be dangerous. We discuss this point more fully after this example. Table 11.4 gives predictions of fuel consumption for each observed week obtained by using the least squares line (or prediction equation) y b0 b1x 15.84 .1279x The table also gives each of the residuals and squared residuals and the sum of squared residuals (SSE 2.5680112) obtained by using this prediction equation. Notice that the SSE here, which was obtained using the least squares point estimates, is smaller than the SSE of Table 11.3, which was obtained using the visually tted line y 15 .1x. In general, it can be shown that the SSE obtained by using the least squares point estimates is smaller than the value of SSE that would be obtained by using any other estimates of b0 and b1. Figure 11.7(a) illustrates the eight observed fuel consumptions (the dots in the gure) and the eight predicted fuel consumptions (the squares in the gure) given by the least squares line. The distances between the observed and predicted fuel consumptions are the residuals. Therefore, when we say that the least squares point estimates minimize SSE, we are saying that these estimates position the least squares line so as to minimize the sum of the squared distances between the observed and predicted fuel consumptions. In this sense, the least squares line is the best straight line that can be tted to the eight observed fuel consumptions. Figure 11.7(b) gives the MINITAB output of this best t line. Note that this output gives the least squares estimates b0 15.8379 and b1 .127922. In general, we will rely on MINITAB, Excel, and MegaStat to compute the least squares estimates (and to perform many other regression calculations). Part 2: Estimating a mean fuel consumption and predicting an individual fuel consumption We dene the experimental region to be the range of the previously observed values of the average hourly temperature x. Because we have observed average hourly temperatures between 28F and 62.5F (see Table 11.4), the experimental region consists of the range of average hourly temperatures from 28F to 62.5F. The simple linear regression model relates
Text
11.2 F I G U R E 11.7
The Least Squares Estimates, and Point Estimation and Prediction The Least Squares Line for the Fuel Consumption Data
461
Y = 15.8379 0.127922X R-Squared = 0.899 14 13 Predicted fuel consumption when x 45.9 12 11 10 9 8 7 x 6 30 40 TEMP 50 60
10
20
30
40
50
60
70
weekly fuel consumption y to average hourly temperature x for values of x that are in the experimental region. For such values of x, the least squares line is the estimate of the line of means. This implies that the point on the least squares line that corresponds to the average hourly temperature x y b0 b1x 15.84 .1279x
is the point estimate of the mean of all the weekly fuel consumptions that could be observed when the average hourly temperature is x: my x b0 b1x Note that y is an intuitively logical point estimate of my x. This is because the expression b0 b1x used to calculate y has been obtained from the expression b0 b1x for my x by replacing the unknown values of b0 and b1 by their least squares point estimates b0 and b1. The quantity y is also the point prediction of the individual value y b0 b1x e
which is the amount of fuel consumed in a single week when average hourly temperature equals x. To understand why y is the point prediction of y, note that y is the sum of the mean b0 b1x and the error term e. We have already seen that y b0 b1x is the point estimate of b0 b1x. We will now reason that we should predict the error term E to be 0, which implies that y is also the point prediction of y. To see why we should predict the error term to be 0, note that in the next section we discuss several assumptions concerning the simple linear regression model. One implication of these assumptions is that the error term has a 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it is reasonable to predict the error term to be 0 and to use y as the point prediction of a single value of y when the average hourly temperature equals x. Now suppose a weather forecasting service predicts that the average hourly temperature in the next week will be 40F. Because 40F is in the experimental region y 15.84 .1279(40) 10.72 MMcf of natural gas
is (1) the point estimate of the mean weekly fuel consumption when the average hourly temperature is 40F and (2) the point prediction of an individual weekly fuel consumption when the
Text
462
F I G U R E 11.8
Chapter 11
The least squares line ^ y 15.84 .1279x An individual value of fuel consumption when x 40 The true mean fuel consumption when x 40 The point estimate of mean fuel consumption when x 40 The true line of means yx 0 1x
11 10.72 10 9 8 7
x 10 20 30 40 50 60 70
F I G U R E 11.9
22 21 True mean fuel consumption when x 10 Estimated mean fuel consumption when x 10 obtained by extrapolating the least squares line 20 19 18 17 16 15 14 13 12 11 10 9 8 7
The relationship between mean fuel consumption and x might become curved at low temperatures
average hourly temperature is 40F. This says that (1) we estimate that the average of all possible weekly fuel consumptions that could potentially be observed when the average hourly temperature is 40F equals 10.72 MMcf of natural gas, and (2) we predict that the fuel consumption in a single week when the average hourly temperature is 40F will be 10.72 MMcf of natural gas. Figure 11.8 illustrates (1) the point estimate of mean fuel consumption when x is 40F (the square on the least squares line), (2) the true mean fuel consumption when x is 40F (the
Text
11.2
463
triangle on the true line of means), and (3) an individual value of fuel consumption when x is 40F (the dot in the gure). Of course this gure is only hypothetical. However, it illustrates that the point estimate of the mean value of y (which is also the point prediction of the individual value of y) will (unless we are extremely fortunate) differ from both the true mean value of y and the individual value of y. Therefore, it is very likely that the point prediction y 10.72, which is the natural gas companys transmission nomination for next week, will differ from next weeks actual fuel consumption, y. It follows that we might wish to predict the largest and smallest that y might reasonably be. We will see how to do this in Section 11.5. To conclude this example, note that Figure 11.9 illustrates the potential danger of using the least squares line to predict outside the experimental region. In the gure, we extrapolate the least squares line far beyond the experimental region to obtain a prediction for a temperature of 10F. As shown in Figure 11.1, for values of x in the experimental region the observed values of y tend to decrease in a straight-line fashion as the values of x increase. However, for temperatures lower than 28F the relationship between y and x might become curved. If it does, extrapolating the straight-line prediction equation to obtain a prediction for x 10 might badly underestimate mean weekly fuel consumption (see Figure 11.9). The previous example illustrates that when we are using a least squares regression line, we should not estimate a mean value or predict an individual value unless the corresponding value of x is in the experimental regionthe range of the previously observed values of x. Often the value x 0 is not in the experimental region. For example, consider the fuel consumption problem. Figure 11.9 illustrates that the average hourly temperature 0F is not in the experimental region. In such a situation, it would not be appropriate to interpret the y-intercept b0 as the estimate of the mean value of y when x equals 0. For example, in the fuel consumption problem it would not be appropriate to use b0 15.84 as the point estimate of the mean weekly fuel consumption when average hourly temperature is 0. Therefore, because it is not meaningful to interpret the y-intercept in many regression situations, we often omit such interpretations. We now present a general procedure for estimating a mean value and predicting an individual value:
et b0 and b1 be the least squares point estimates of the y-intercept b0 and the slope b1 in the simple linear regression model, and suppose that x0, a specied value of the independent variable x, is inside the experimental region. Then y b0 b1x0
is the point estimate of the mean value of the dependent variable when the value of the independent variable is x0. In addition, y is the point predic tion of an individual value of the dependent variable when the value of the independent variable is x0. Here we predict the error term to be 0.
is the point estimate of the mean yearly upkeep expenditure for all homes worth $220,000 and is the point prediction of a yearly upkeep expenditure for an individual home worth $220,000.
Text
464
Chapter 11
The marketing department at QHIC wishes to determine which homes should be sent advertising brochures promoting QHICs products and services. The prediction equation y b0 b1x implies that the home value x corresponding to a predicted upkeep expenditure of y is x y b1 b0 y ( 348.3921) 7.2583 y 348.3921 7.2583
Therefore, for example, if QHIC wishes to send an advertising brochure to any home that has a predicted upkeep expenditure of at least $500, then QHIC should send this brochure to any home that has a value of at least x y 348.3921 7.2583 500 348.3921 7.2583 116.886 ($116,886)
Regression Plot Y 37 36 35 34 33 32 31 30 29 28 27 2 3 GPA 4 14.8156 5.70657X R-Sq 0.977 200 180 160 140 120 100 80 60 40 20 0 0 2 Y
Copiers Line Fit Plot 11.4641 24.6022X 9.5 9.0 Demand 8.5 8.0 7.5 7.0 4 Copiers 6 8 Y
StartSal
Minutes
0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 PriceDif
StartSal
Using the leftmost output a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean starting salary for all marketing graduates having a grade point average of 3.25 and a point prediction of the starting salary for an individual marketing graduate having a grade point average of 3.25.
Text
465
Using the middle output a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean time to service four copiers and a point prediction of the time to service four copiers on a single call. 11.21 THE FRESH DETERGENT CASE Fresh
Using the rightmost output a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean demand in all sales periods when the price difference is .10 and a point prediction of the actual demand in an individual sales period when the price difference is .10. c If Enterprise Industries wishes to maintain a price difference that corresponds to a predicted demand of 850,000 bottles (that is, y 8.5), what should this price difference be? 11.22 THE DIRECT LABOR COST CASE DirLab
Consider the direct labor cost data given in Exercise 11.11 (page 456), and suppose that a simple linear regression model is appropriate. a Verify that b0 18.4880 and b1 10.1463 by using the formulas illustrated in Example 11.4 (pages 459460). b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense? c Write the least squares prediction equation. d Use the least squares line to obtain a point estimate of the mean direct labor cost for all batches of size 60 and a point prediction of the direct labor cost for an individual batch of size 60. 11.23 THE REAL ESTATE SALES PRICE CASE RealEst
Consider the sales price data given in Exercise 11.13 (page 456), and suppose that a simple linear regression model is appropriate. a Verify that b0 48.02 and b1 5.7003 by using the formulas illustrated in Example 11.4 (pages 459460). b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense? c Write the least squares prediction equation. d Use the least squares line to obtain a point estimate of the mean sales price of all houses having 2,000 square feet and a point prediction of the sales price of an individual house having 2,000 square feet.
we need to make certain assumptions about the error term e. At any given value of x, there is a population of error term values that could potentially occur. These error term values describe the different potential effects on y of all factors other than the value of x. Therefore, these error term values explain the variation in the y values that could be observed when the independent variable is x. Our statement of the simple linear regression model assumes that my x, the mean of the population of all y values that could be observed when the independent variable is x, is b0 b1x. This model also implies that e y (b0 b1x), so this is equivalent to assuming that the mean of the corresponding population of potential error term values is 0. In total, we make four assumptionscalled the regression assumptionsabout the simple linear regression model.
Text
466
Chapter 11
These assumptions can be stated in terms of potential y values or, equivalently, in terms of potential error term values. Following tradition, we begin by stating these assumptions in terms of potential error term values:
Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as 2.
Normality Assumption At any given value of x, the population of potential error term values has a normal distribution. Independence Assumption Any one value of the error term E is statistically independent of any other value of E. That is, the value of the error term E corresponding to an observed value of y is statistically independent of the value of the error term corresponding to any other observed value of y.
Taken together, the rst three assumptions say that, at any given value of x, the population of potential error term values is normally distributed with mean zero and a variance S2 that does not depend on the value of x. Because the potential error term values cause the variation in the potential y values, these assumptions imply that the population of all y values that could be observed when the independent variable is x is normally distributed with mean B0 B1x and a variance S2 that does not depend on x. These three assumptions are illustrated in Figure 11.10 in the context of the fuel consumption problem. Specically, this gure depicts the populations of weekly fuel consumptions corresponding to two values of average hourly temperature32.5 and 45.9. Note that these populations are shown to be normally distributed with different means (each of which is on the line of means) and with the same variance (or spread). The independence assumption is most likely to be violated when time series data are being utilized in a regression study. Intuitively, this assumption says that there is no pattern of positive error terms being followed (in time) by other positive error terms, and there is no pattern of positive error terms being followed by negative error terms. That is, there is no pattern of higher-thanaverage y values being followed by other higher-than-average y values, and there is no pattern of higher-than-average y values being followed by lower-than-average y values. It is important to point out that the regression assumptions very seldom, if ever, hold exactly in any practical regression problem. However, it has been found that regression results are not extremely sensitive to mild departures from these assumptions. In practice, only pronounced
F I G U R E 11.10
The mean fuel consumption when x The mean fuel consumption when x 45.9 Population of y values when x 32.5 Population of y values when x 45.9 32.5 45.9 9.4
Observed value of y when x 45.9 The straight line defined by the equation y |x 0 1x (the line of means) x
Text
11.3
467
departures from these assumptions require attention. In optional Section 11.8 we show how to check the regression assumptions. Prior to doing this, we will suppose that the assumptions are valid in our examples. In Section 11.2 we stated that, when we predict an individual value of the dependent variable, we predict the error term to be 0. To see why we do this, note that the regression assumptions state that, at any given value of the independent variable, the population of all error term values that can potentially occur is normally distributed with a mean equal to 0. Since we also assume that successive error terms (observed over time) are statistically independent, each error term has a 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it is reasonable to predict any particular error term value to be 0. The mean square error and the standard error To present statistical inference formulas in later sections, we need to be able to compute point estimates of s2 and s, the constant variance and standard deviation of the error term populations. The point estimate of s2 is called the mean square error and the point estimate of s is called the standard error. In the following box, we show how to compute these estimates:
f the regression assumptions are satised and SSE is the sum of squared residuals: The point estimate of s2 is the mean square error s2 SSE n 2
SSE 2 Bn
In order to understand these point estimates, recall that s2 is the variance of the population of y values (for a given value of x) around the mean value my x. Because y is the point estimate of this mean, it seems natural to use SSE a (yi yi)2
to help construct a point estimate of s2. We divide SSE by n 2 because it can be proven that doing so makes the resulting s2 an unbiased point estimate of s2. Here we call n 2 the number of degrees of freedom associated with SSE.
This implies that the point estimate of s is the standard error s .6542
As another example, it can be veried that the standard error for the simple linear regression model describing the QHIC data is s 146.8970. To conclude this section, note that in optional Section 11.9 we present a shortcut formula for calculating SSE. The reader may study Section 11.9 now or at any later point.
Text
468
Chapter 11
Refer to the Fresh detergent data of Exercise 11.9 (page 455). Given that SSE s2 and s. 11.29 THE DIRECT LABOR COST CASE DirLab Refer to the direct labor cost data of Exercise 11.11 (page 456). Given that SSE s2 and s. 11.30 THE REAL ESTATE SALES PRICE CASE RealEst Refer to the sales price data of Exercise 11.13 (page 456). Given that SSE and s.
747, calculate
896.8, calculate s2
11.31 Ten sales regions of equal sales potential for a company were randomly selected. The advertising expenditures (in units of $10,000) in these 10 sales regions were purposely set during July of last year at, respectively, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14. The sales volumes (in units of $10,000) were then recorded for the 10 sales regions and found to be, respectively, 89, 87, 98, 110, 103, 114, 116, 110, 126, and 130. Assuming that the simple linear regression model is appropriate, it can be shown that b0 66.2121, b1 4.4303, and SSE 222.8242. SalesVol Calculate s2 and s.
which says that there is no change in the mean value of y associated with an increase in x, versus the alternative hypothesis Ha: b1 0
which says that there is a (positive or negative) change in the mean value of y associated with an increase in x. It would be reasonable to conclude that x is signicantly related to y if we can be quite certain that we should reject H0 in favor of Ha. In order to test these hypotheses, recall that we compute the least squares point estimate b1 of the true slope b1 by using a sample of n observed values of the dependent variable y. A different sample of n observed y values would yield a different least squares point estimate b1. For example, consider the fuel consumption problem, and recall that we have observed eight average hourly temperatures. Corresponding to each temperature there is a (theoretically) innite population of fuel consumptions that could potentially be observed at that temperature [see Table 11.5(a)]. Sample 1 in Table 11.5(b) is the sample of eight fuel consumptions that we have actually observed from these populations (these are the same fuel consumptions originally given
Text
11.4
469
T A B L E 11.5 Three Samples in the Fuel Consumption Case (a) The Eight Populations of Fuel Consumptions
Week
1 2 3 4 5 6 7 8
Sample 2
y1 y2 y3 y4 y5 y6 y7 y8 12.0 11.8 12.3 11.5 9.1 9.2 8.5 7.2
Sample 3
y1 y2 y3 y4 y5 y6 y7 y8 10.7 10.2 10.5 9.8 9.5 8.9 8.5 8.0
b0
12.44
s2
.0667
in Table 11.1). Samples 2 and 3 in Table 11.5(b) are two other samples that we could have observed. In general, an innite number of such samples could be observed. Because each sample yields its own unique values of b1, b0, s2, and s [see Table 11.5(c)(f)], there is an innite population of potential values of each of these estimates. If the regression assumptions hold, then the population of all possible values of b1 is normally distributed with a mean of b1 and with a standard deviation of sb1 s 1SSxx s 1SSxx
The standard error s is the point estimate of s, so it follows that a point estimate of sb1 is sb1
which is called the standard error of the estimate b1. Furthermore, if the regression assumptions hold, then the population of all values of b1 sb1 has a t distribution with n 2 degrees of freedom. It follows that, if the null hypothesis H0: b1 0 is true, then the population of all possible values of the test statistic t b1 sb1 b1
Text
470
Chapter 11
has a t distribution with n 2 degrees of freedom. Therefore, we can test the signicance of the regression relationship as follows:
Testing the Signicance of the Regression Relationship: Testing the Signicance of the Slope
s 1SSxx
and suppose that the regression assumptions hold. Then we can reject H0: b1 0 in favor of a particular alternative hypothesis at signicance level a (that is, by setting the probability of a Type I error equal to a) if and only if the appropriate rejection point condition holds, or, equivalently, the corresponding p-value is less than a.
Alternative Hypothesis
Ha: b1 Ha: b1 Ha: b1 0 0 0 t t
p-Value
Twice the area under the t curve to the right of t The area under the t curve to the right of t The area under the t curve to the left of t
2 degrees of freedom.
We usually use the two-sided alternative Ha: b1 0 for this test of signicance. However, sometimes a one-sided alternative is appropriate. For example, in the fuel consumption problem we can say that if the slope b1 is not 0, then it must be negative. A negative b1 would say that mean fuel consumption decreases as temperature x increases. Because of this, it would be appropriate to decide that x is signicantly related to y if we can reject H0: b1 0 in favor of the onesided alternative Ha: b1 0. Although this test would be slightly more effective than the usual two-sided test, there is little practical difference between using the one-sided or two-sided alternative. Furthermore, computer packages (such as MINITAB and Excel) present results for testing a two-sided alternative hypothesis. For these reasons we will emphasize the two-sided test. It should also be noted that
1
If we can decide that the slope is signicant at the .05 signicance level, then we have concluded that x is signicantly related to y by using a test that allows only a .05 probability of concluding that x is signicantly related to y when it is not. This is usually regarded as strong evidence that the regression relationship is signicant. If we can decide that the slope is signicant at the .01 signicance level, this is usually regarded as very strong evidence that the regression relationship is signicant. The smaller the signicance level a at which H0 can be rejected, the stronger is the evidence that the regression relationship is signicant.
2 3
C
.6542 [see Examples 11.4 (pages 459460) .01746
Text
11.4
471
F I G U R E 11.11 MINITAB and Excel Output of a Simple Linear Regression Analysis of the Fuel Consumption Data
(a) The MINITAB output
Regression Analysis The regression equation is FUELCONS = 15.8 - 0.128 TEMP Predictor Constant TEMP S = 0.6542h Coef 15.8379a -0.12792b Stdev 0.8018c 0.01746d T 1 9 . 7 5e - 7 . 3 3f p 0 . 0 0 0g 0.000
9 5 . 0 % C Iq (10.130, 11.312)
b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 gp-values for t statistics hs standard error ir2 Explained variation kSSE Unexplained variation lTotal variation mF(model) statistic np-value for F(model) oy psy q 95% condence interval when x 40 r95% prediction interval when x 40
ANOVA df
Regression Residual Total 1 6 7
SS
22.98081629j 2.567933713k 25.54875l
MS
22.98082 0.427989
F
53.69488m
Signicance F
0.000330052n
Coefcients
Intercept Temp
a j
Standard Error
0.801773385c 0.01745733d
g
t Stat
19.75353e 7.32768f
P-Valueg
1.09E-06 0.00033
Lower 95%
13.87598718 0.170638294o
Upper 95%
17.79972765 0.08520514o
15.83785741a 0.127921715b
b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 Explained variation kSSE Unexplained variation lTotal variation
p-values for t statistics hs standard error ir2 F(model) statistic np-value for F(model) o95% condence interval for b1
To test the signicance of the slope we compare t with ta degrees of freedom. Because t 7.33 t.025 2.447
based on n
we can reject H0: b1 0 in favor of Ha: b1 0 at level of signicance .05. The p-value for testing H0 versus Ha is twice the area to the right of t 7.33 under the curve of the t distribution having n 2 6 degrees of freedom. Since this p-value can be shown to be .00033, we can reject H0 in favor of Ha at level of signicance .05, .01, or .001. We therefore have extremely strong evidence that x is signicantly related to y and that the regression relationship is signicant. Figure 11.11 presents the MINITAB and Excel outputs of a simple linear regression analysis of the fuel consumption data. Note that b0 15.84, b1 .1279, s .6542, sb1 .01746, and t 7.33 (each of which has been previously calculated) are given on these outputs. Also note that Excel gives the p-value of .00033, and MINITAB has rounded this p-value to .000 (which means less than .001). Other quantities on the MINITAB and Excel outputs will be discussed later.
Text
472
Chapter 11
In addition to testing the signicance of the slope, it is often useful to calculate a condence interval for b1. We show how this is done in the following box:
f the regression assumptions hold, a 100(1 A) percent condence interval for the true slope B1 is [b1 ta 2 sb1]. Here ta 2 is based on n 2 degrees of freedom.
.01746. The MINITAB and Excel outputs in Figure 11.11 tell us that b1 .1279 and sb1 Thus, for instance, because t.025 based on n 2 8 2 6 degrees of freedom equals 2.447, a 95 percent condence interval for b1 is [b1 t.025 sb1] [ .1279 [ .1706, 2.447(.01746)] .0852]
This interval says we are 95 percent condent that, if average hourly temperature increases by one degree, then mean weekly fuel consumption will decrease (because both the lower bound and the upper bound of the interval are negative) by at least .0852 MMcf of natural gas and by at most .1706 MMcf of natural gas. Also, because the 95 percent condence interval for b1 does not contain 0, we can reject H0: b1 0 in favor of Ha: b1 0 at level of signicance .05. Note that the 95 percent condence interval for b1 is given on the Excel output but not on the MINITAB output.
Figure 11.12 presents the MegaStat output of a simple linear regression analysis of the QHIC data. Below we summarize some important quantities from the output (we discuss the other quantities later): b0 sb1 348.3931 .4156 t b1 sb1 b1 7.2583 s 146.897 .001
17.466
p-value for t
Since the p-value for testing the signicance of the slope is less than .001, we can reject H0: b1 0 in favor of Ha: b1 0 at the .001 level of signicance. It follows that we have extremely strong evidence that the regression relationship is signicant. The MegaStat output also tells us that a 95 percent condence interval for the true slope b1 is [6.4170, 8.0995]. This interval says we are 95 percent condent that mean yearly upkeep expenditure increases by between $6.42 and $8.10 for each additional $1,000 increase in home value.
Testing the signicance of the y-intercept We can also test the signicance of the y-intercept b0. We do this by testing the null hypothesis H0: b0 0 versus the alternative hypothesis Ha: b0 0. To carry out this test we use the test statistic t b0 sb0 where sb0 s 1 Bn x2 SSxx
Here the rejection point and p-value conditions for rejecting H0 are the same as those given previously for testing the signicance of the slope, except that t is calculated as b0 sb0. For example, if we consider the fuel consumption problem and the MINITAB output in Figure 11.11, we see that b0 15.8379, sb0 .8018, t 19.75, and p-value .000. Because t 19.75 t.025 2.447 and p-value .05, we can reject H0: b0 0 in favor of Ha: b0 0 at the .05 level of signicance.
Text
11.4
473
F I G U R E 11.12 MegaStat Output of a Simple Linear Regression Analysis of the QHIC Data
Regression Analysis
r2 0.889i r 0.943 Std. Error 146.897h n 40 k 1 Dep. Var. Upkeep
SS
6,582,759.6972j 819,995.5427k 7,402,755.2399l
df
1 38 39
MS
6,582,759.6972 21,578.8301
F
305.06m
p-value
9.49E-20n
coefcients
348.3921a 7.2583b
std. error
76.1410c 0.4156d
t (df
38)
4.576e 17.466f
p-valueg
4.95E-05 9.49E-20
Predicted
1,248.42597
Leverages
0.042
b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 gp-values for t statistics hs standard error ir2 Explained variation kSSE Unexplained variation lTotal variation mF(model) statistic np-value for F(model) o95% condence interval for b1 p y q95% condence interval when x 220 r95% prediction interval when x 220 sdistance value
In fact, since the p-value .001, we can also reject H0 at the .001 level of signicance. This provides extremely strong evidence that the y-intercept b0 does not equal 0 and that we should include b0 in the fuel consumption model. In general, if we fail to conclude that the intercept is significant at a level of significance of .05, it might be reasonable to drop the y-intercept from the model. However, remember that b0 equals the mean value of y when x equals 0. If, logically speaking, the mean value of y would not equal 0 when x equals 0 (for example, in the fuel consumption problem, mean fuel consumption would not equal 0 when the average hourly temperature is 0), it is common practice to include the y-intercept whether or not H0: b0 0 is rejected. In fact, experience suggests that it is definitely safest, when in doubt, to include the intercept b0.
11.33 Give an example of a practical application of the condence interval for b1.
Text
474
Chapter 11
h Calculate the 99 percent condence interval for b1. i Identify sb0 and the t statistic for testing the signicance of the y intercept. Show how t has been calculated by using b0 and sb0. j Identify the p-value for testing H0: b0 0 versus Ha: b0 0. Using the p-value, determine whether we can reject H0 by setting a equal to .10, .05, .01, and .001. What do you conclude? k Using the appropriate data set, show how sb0 and sb1 have been calculated. Hint: Calculate SSxx. 11.34 THE STARTING SALARY CASE StartSal
The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.5 on page 454) is given in Figure 11.13. Recall that a labeled MINITAB regression output is on page 471. 11.35 THE SERVICE TIME CASE SrvcTime The MegaStat output of a simple linear regression analysis of the data set for this case (see Exercise 11.7 on page 454) is given in Figure 11.14. Recall that a labeled MegaStat regression output is on page 473.
F I G U R E 11.13
MINITAB Output of a Simple Linear Regression Analysis of the Starting Salary Data
The regression equation is SALARY = 14.8 + 5.71 GPA Predictor Constant GPA S = 0.5363 Coef 14.816 5.7066 StDev 1.235 0.3953 T 12.00 14.44 P 0.000 0.000
Analysis of Variance Source DF Regression 1 Error 5 Total 6 Fit 33.362 StDev Fit 0.213
F I G U R E 11.14
Regression Analysis
MegaStat Output of a Simple Linear Regression Analysis of the Service Time Data
r2 0.990 r 0.995 Std. Error 4.615 n k Dep. Var. 11 1 Minutes
SS
19,918.8438 191.7017 20,110.5455
df
1 9 10
MS
19,918.8438 21.3002
F
935.15
p-value
2.09E-10
Coefcients
11.4641 24.6022
std. error
3.4390 0.8045
t (df
9)
p-value
.0087 2.09E-10
3.334 30.580
Predicted
36.066 60.669 85.271 109.873 134.475 159.077 183.680
Leverage
0.348 0.202 0.116 0.091 0.127 0.224 0.381
Text
11.4 F I G U R E 11.15
Testing the Signicance of the Slope and y Intercept MINITAB Output of a Simple Linear Regression Analysis of the Fresh Detergent Demand Data
475
The regression equation is Demand = 7.81 + 2.67 PriceDif Predictor Constant PriceDif S = 0.3166 Coef 7.81409 2.6652 StDev 0.07988 0.2585 T 97.82 10.31 P 0.000 0.000
R-Sq = 79.2%
R-Sq(adj) = 78.4%
Analysis of Variance Source Regression Error Total Fit 8.0806 8.4804 StDev Fit 0.0648 0.0586 DF 1 28 29 SS 10.653 2.806 13.459 MS 10.653 0.100 F 106.30 P 0.000
F I G U R E 11.16
Excel and MegaStat Output of a Simple Linear Regression Analysis of the Direct Labor Cost Data
ANOVA df
Regression Residual Total Intercept LaborCost 1 10 11
SS
9945.418067 7.248599895 9952.666667
MS
9945.418067 0.724859989
F
13720.46769
Signicance F
5.04436E-17
Coefcients
1.787514479 0.098486713
Standard Error
0.473848084 0.000840801
t Stat
3.772336617 117.1344001
P-value
0.003647528 5.04436E-17
Lower 95%
2.843313988 0.096613291
Upper 95%
0.73171497 0.100360134
Predicted
627.263
Leverage
0.104
Fresh
The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.9 on page 455) is given in Figure 11.15. Recall that a labeled MINITAB regression output is on page 471. 11.37 THE DIRECT LABOR COST CASE DirLab The Excel and MegaStat output of a simple linear regression analysis of the data set for this case (see Exercise 11.11 on page 456) is given in Figure 11.16. Recall that labeled Excel and MegaStat regression outputs are on pages 471 and 473. 11.38 THE REAL ESTATE SALES PRICE CASE RealEst The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.13 on page 456) is given in Figure 11.17 on page 476. Recall that a labeled MINITAB regression output is on page 471. 11.39 Find and interpret a 95 percent condence interval for the slope b1 of the simple linear regression model describing the sales volume data in Exercise 11.31 (page 468). SalesVol
Text
476
F I G U R E 11.17
Chapter 11
MINITAB Output of a Simple Linear Regression Analysis of the Sales Price Data
The regression equation is SPrice = 48.0 + 5.70 HomeSize Predictor Constant HomeSize S = 10.59 Coef 48.02 5.7003 StDev 14.41 0.7457 T 3.33 7.64 P 0.010 0.000
R-Sq = 88.0%
R-Sq(adj) = 86.5%
Analysis of Variance Source Regression Error Total Fit 162.03 DF 1 8 9 StDev Fit 3.47 SS 6550.7 896.8 7447.5 MS 6550.7 112.1 F 58.43 P 0.000
F I G U R E 11.18
Excel Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data
Regression Statistics
Multiple R R Square Adjusted R Square Standard Error Observations 0.98728324 0.974728196 0.968410245 0.183265974 6
ANOVA df
Regression Residual Total 1 4 5
SS
5.181684399 0.134345669 5.316030068
MS
5.181684399 0.033586417
F
154.2791645
Signicance F
0.000241546
Coefcients
Intercept MEANTASTE 0.160197061 1.27312365
Standard Error
0.302868523 0.102498367
t Stat
0.528932686 12.42091641
P-Value
0.624842601 0.000241546
Lower 95%
1.00109663 0.988541971
Upper 95%
0.680702508 1.557705329
FastFood
In the early 1990s researchers at The Ohio State University studied consumer ratings of six fast-food restaurants: Borden Burger, Hardees, Burger King, McDonalds, Wendys, and White Castle. Each of 406 randomly selected individuals gave each restaurant a rating of 1, 2, 3, 4, 5, or 6 on the basis of taste, and then ranked the restaurants from 1 through 6 on the basis of overall preference. In each case, 1 is the best rating and 6 the worst. The mean ratings given by the 406 individuals are given in the following table:
Mean Taste
3.5659 3.329 2.4231 2.0895 1.9661 3.8061
Restaurant
Borden Burger Hardees Burger King McDonalds Wendys White Castle
Mean Preference
4.2552 4.0911 3.0052 2.2429 2.5351 4.7812
Figure 11.18 gives the Excel output of a simple linear regression analysis of this data. Here, mean preference is the dependent variable and mean taste is the independent variable. Recall that a labeled Excel regression output is given on page 471.
Text
11.5 a b c d e
Condence and Prediction Intervals Identify the least squares point estimates b0 and b1, of b0 and b1. Identify SSE, s2, and s. Identify sb1. Identify the t statistic for testing H0: b1 0 versus Ha: b1 0. Identify the p-value for testing H0: b1 0 versus Ha: b1 0. Would we reject H0 at a At a .01? At a .001? f Identify and interpret a 95 percent condence interval for b1.
477
.05?
Unless we are very lucky, y will not exactly equal either the mean value of y when x equals x0 or a particular individual value of y when x equals x0. Therefore, we need to place bounds on how far y might be from these values. We can do this by calculating a condence interval for the mean value of y and a prediction interval for an individual value of y. Both of these intervals employ a quantity called the distance value. For simple linear regression this quantity is calculated as follows:
Distance value
1 n
(x0 x )2 SSxx
This quantity is given its name because it is a measure of the distance between the value x0 of x and x, the average of the previously observed values of x. Notice from the above formula that the farther x0 is from x, which can be regarded as the center of the experimental region, the larger is the distance value. The signicance of this fact will become apparent shortly. We now consider establishing a condence interval for the mean value of y when x equals a particular value x0 (for later reference, we call this mean value my x0). Because each possible sample of n values of the dependent variable gives values of b0 and b1 that differ from the values given by other samples, different samples give different values of the point estimate y b0 b1x0
It can be shown that, if the regression assumptions hold, then the population of all possible values of y is normally distributed with mean my x0 and standard deviation sy The point estimate of sy is sy s2Distance value s2Distance value
which is called the standard error of the estimate y. Using this standard error, we form a con dence interval as follows:
f the regression assumptions hold, a 100(1 A) percent condence interval for the mean value of y when the value of the independent variable is x0 is [y tA 2s2Distance value]
Here ta 2 is based on n
2 degrees of freedom.
Text
478
Chapter 11
In the fuel consumption problem, suppose we wish to compute a 95 percent condence interval for the mean value of weekly fuel consumption when the average hourly temperature is x0 40F. From Example 11.4 (pages 459 and 460), the point estimate of this mean is y b0 15.84 b1x0 .1279(40)
10.72 MMcf of natural gas Furthermore, using the information in Example 11.4, we compute Distance value 1 n 1 8 (x0 x)2 SSxx (40 43.98)2 1,404.355
.1362 Since s .6542 (see Example 11.6 on page 467) and since ta 2 t.025 based on n 2 8 2 degrees of freedom equals 2.447, it follows that the desired 95 percent condence interval is [y ta 2s2Distance value] [10.72 [10.72 .59] 2.447(.6542)2.1362 ] 6
[10.13, 11.31] This interval says we are 95 percent condent that the mean (or average) of all the weekly fuel consumptions that would be observed in all weeks having an average hourly temperature of 40F is between 10.13 MMcf of natural gas and 11.31 MMcf of natural gas.
We develop an interval for an individual value of y when x equals a particular value x0 by considering the prediction error y y. After observing each possible sample and calculating the point prediction based on that sample, we could observe any one of an innite number of different individual values of y (because of different possible error terms). Therefore, there are an innite number of different prediction errors that could be observed. If the regression assumptions hold, it can be shown that the population of all possible prediction errors is normally distributed with mean 0 and standard deviation s(y The point estimate of s(y
y ) y )
s21
Distance value
is s(y
y )
s21
Distance value
which is called the standard error of the prediction error. Using this quantity we obtain a prediction interval as follows:
f the regression assumptions hold, a 100(1 A) percent prediction interval for an individual value of y when the value of the independent variable is x0 is [y tA 2s21 Distance value]
Here ta 2 is based on n
2 degrees of freedom.
Text
11.5
479
1.71]
[9.01, 12.43] Here ta 2 t.025 is again based on n 2 6 degrees of freedom. This interval says we are 95 percent condent that the individual fuel consumption in a future single week having an average hourly temperature of 40F will be between 9.01 MMcf of natural gas and 12.43 MMcf of natural gas. Because the weather forecasting service has predicted that the average hourly temperature in the next week will be 40F, we can use the prediction interval to evaluate how well our regression model is likely to predict next weeks fuel consumption and to evaluate whether the natural gas company will be assessed a transmission ne. First, recall that the point prediction y 10.72 given by our model is the natural gas companys transmission nomination for next week. Also, note that the half-length of the 95 percent prediction interval given by our model is 1.71, which is (1.71 10.72)100% 15.91% of the transmission nomination. It follows that we are 95 percent condent that the actual amount of natural gas that will be used by the city next week will differ from the natural gas companys transmission nomination by no more than 15.91 percent. That is, we are 95 percent condent that the natural gas companys percentage nomination error will be less than or equal to 15.91 percent. Although this does not imply that the natural gas company is likely to make a terribly inaccurate nomination, we are not condent that the companys percentage nomination error will be within the 10 percent allowance granted by the pipeline transmission system. Therefore, the natural gas company may be assessed a transmission ne. In Chapter 12 we use a multiple regression model to substantially reduce the natural gas companys percentage nomination errors. Below we repeat the bottom of the MINITAB output in Figure 11.11(a). This output gives the point estimate and prediction y 10.72, the 95 percent condence interval for the mean value of y when x equals 40, and the 95 percent prediction interval for an individual value of y when x equals 40.
Fit 10.721 Stdev Fit 0.241 95.0% CI (10.130, 11.312) 95.0% PI (9.014, 12.428)
Although the MINITAB output does not directly give the distance value, it does give sy s1Distance value under the heading Stdev Fit. A little algebra shows that this implies that the distance value equals ( sy s)2. Specically, because sy .241 and s .6542 [see the MINITAB output in Figure 11.11(a) on page 471], it follows that the distance value equals (.241/.6542)2 .1357. This distance value is (within rounding) equal to the distance value that we hand calculated in Example 11.10. Figure 11.19 illustrates and compares the 95 percent condence interval for the mean value of y when x equals 40 and the 95 percent prediction interval for an individual value of y when x equals 40. We see that both intervals are centered at y 10.72. However, the prediction interval is longer than the condence interval. This is because the formula for the prediction interval has an extra 1 under the radical, which accounts for the added uncertainty introduced by our not knowing the value of the error term (which we nevertheless predict to be 0, while it probably will not equal 0). Figure 11.19 hypothetically supposes that the true values of b0 and b1 are 15.77 and .1281, and also supposes that in a future week when x equals 40 the error term will equal 1.05no human being would actually know these facts. Assuming these values, the mean value of y when x equals 40 would be b0 b1(40) 15.77 .1281(40) 10.65
Text
480
F I G U R E 11.19
Chapter 11
A Comparison of a Condence Interval for the Mean Value of y When x Equals 40 and a Prediction Interval for an Individual Value of y When x Equals 40
Scale of values of fuel consumption A hypothetical true mean value of y when x equals 40 ( 10.65) A hypothetical individual value of y when x equals 40 ( 11.7)
.59
10.13
11.31 1.71
10.72 ^ y
12.43
F I G U R E 11.20
MINITAB Output of 95% Condence and Prediction Intervals for the Fuel Consumption Case
and in the future week the individual value of y when x equals 40 would be b0 b1(40) e 10.65 1.05 11.7
As illustrated in the gure, the 95 percent condence interval contains the mean 10.65. However, this interval is not long enough to contain the individual value 11.7; remember, it is not meant to contain this individual value of y. In contrast, the 95 percent prediction interval is long enough and does contain the individual value 11.7. Of course, the relative positions of the quantities shown in Figure 11.19 will vary in different situations. However, this gure emphasizes that we must be careful to include the extra 1 under the radical when computing a prediction interval for an individual value of y.
Text
11.5
481
To conclude this example, note that Figure 11.20 illustrates the MINITAB output of the 95 percent condence and prediction intervals corresponding to all values of x in the experimental region. Here x 43.98 can be regarded as the center of the experimental region. Notice that the farther x is from x 43.98, the larger is the distance value and, therefore, the longer are the 95 percent condence and prediction intervals. These longer intervals are undesirable because they give us less information about mean and individual values of y.
In general, the prediction interval is useful if, as in the fuel consumption problem, it is important to predict an individual value of the dependent variable. A condence interval is useful if it is important to estimate the mean value. Although it is not important to estimate a mean value in the fuel consumption problem, it is important to estimate a mean value in other situations. To understand this, recall that the mean value is the average of all the values of the dependent variable that could potentially be observed when the independent variable equals a particular value. Therefore, it might be important to estimate the mean value if we will observe and are affected by a very large number of values of the dependent variable when the independent variable equals a particular value. We illustrate this in the following example.
1,248.43 (that is, $1,248.43) This predicted value is given at the bottom of the MegaStat output in Figure 11.12, which we repeat here:
Predicted values for: Upkeep Value
220.00
Predicted
1,248.42597
Leverage
0.042
In addition to giving y 1,248.43, the MegaStat output also tells us that the distance value, which is given under the heading Leverage on the output, equals .042. Therefore, since s equals 146.897 (see Figure 11.12 on page 473), it follows that a 95 percent prediction interval for the yearly upkeep expenditure of an individual home worth $220,000 is calculated as follows: [ y t.025s11 [1,248.43 Distance value] t.025(146.897)11.042]
Here t.025 is based on n 2 40 2 38 degrees of freedom. Although the t table in Table A.4 does not give t.025 for 38 degrees of freedom, MegaStat uses the appropriate t point and nds that the 95 percent prediction interval is [944.93, 1,551.92]. This interval is given on the MegaStat output. There are many homes worth roughly $220,000 in the metropolitan area; QHIC is more interested in the mean upkeep expenditure for all such homes than in the individual upkeep expenditure for one such home. The MegaStat output tells us that a 95 percent condence interval for this mean upkeep expenditure is [1187.79, 1309.06]. This interval says that QHIC is 95 percent condent that the mean upkeep expenditure for all homes worth $220,000 is at least $1,187.79 and is no more than $1,309.06.
Text
482
Chapter 11
Predicted
36.066 60.669 85.271 109.873 134.475 159.077 183.680
a Report a point estimate of and a 95 percent condence interval for the mean time to service four copiers. b Report a point prediction of and a 95 percent prediction interval for the time to service four copiers on a single call. c If we examine the service time data, we see that there was at least one call on which AccuCopiers serviced each of 1, 2, 3, 4, 5, 6, and 7 copiers. The 95 percent condence intervals for the mean service times on these calls might be used to schedule future service calls. To understand this, note that a person making service calls will (in, say, a year or more) make a very large number of service calls. Some of the persons individual service times will be below, and some will be above, the corresponding mean service times. However, since the very large number of individual service times will average out to the mean service times, it seems fair to both the efciency of the company and to the person making service calls to schedule service calls by using estimates of the mean service times. Therefore, suppose we wish to schedule a call to service ve copiers. Examining the MegaStat output, we see that a 95 percent condence interval for the mean time to service ve copiers is [130.8, 138.2]. Since the mean time might be 138.2 minutes, it would seem fair to allow 138 minutes to make the service call. Now suppose we wish to schedule a call to service four copiers. Determine how many minutes to allow for the service call. 11.46 THE FRESH DETERGENT CASE Fresh The information at the bottom of the MINITAB output in Figure 11.15 (page 475) corresponds to future sales periods in which the price difference will be .10 and .25, respectively. a Report a point estimate of and a 95 percent condence interval for the mean demand for Fresh in all sales periods when the price difference is .10. b Report a point prediction of and a 95 percent prediction interval for the actual demand for Fresh in an individual sales period when the price difference is .10.
Text
11.5
Condence and Prediction Intervals c Locate the condence interval and prediction interval you found in parts a and b on the following MINITAB output:
483
Demand
7 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 PriceDif
d Find 99 percent condence and prediction intervals for the mean and actual demands referred to in parts a and b. Hint: Solve for the distance valuesee Example 11.11. e Repeat parts a, b, c, and d for sales periods in which the price difference is .25. 11.47 THE DIRECT LABOR COST CASE DirLab The information on the MegaStat output in Figure 11.16(b) (page 475) relates to a batch of size 60. a Report a point estimate of and a 95 percent condence interval for the mean direct labor cost of all batches of size 60. b Report a point estimate of and a 95 percent prediction interval for the actual direct labor cost of an individual batch of size 60. c Find 99 percent condence and prediction intervals for the mean and actual direct labor costs referred to in parts a and b. Hint: Find the distance value on the MegaStat output. 11.48 THE REAL ESTATE SALES PRICE CASE RealEst The information at the bottom of the MINITAB output in Figure 11.17 (page 476) relates to a house that will go on the market in the future and have 2,000 square feet. a Report a point estimate of and a 95 percent condence interval for the mean sales price of all houses having 2,000 square feet. b Report a point estimate of and a 95 percent prediction interval for the sales price of an individual house having 2,000 square feet. 11.49 THE FAST-FOOD RESTAURANT RATING CASE FastFood Figure 11.21 gives the MINITAB output of a simple linear regression analysis of the fast-food restaurant rating data in Exercise 11.40 (page 476). The information at the bottom of the output relates to a fast-food restaurant that has a mean taste rating of 1.9661. Find a point prediction of and a 95 percent prediction interval for the mean preference ranking (given by 406 randomly selected individuals) of the restaurant. 11.50 Ott (1987) presents a study of the amount of heat loss for a certain brand of thermal pane window. Three different windows were randomly assigned to each of three different outdoor temperatures. For each trial the indoor window temperature was controlled at 68 F and 50 percent relative humidity. The heat losses at the outdoor temperature 20 F were 86, 80, and 77. The heat losses at the outdoor temperature 40 F were 78, 84, and 75. The heat losses at the outdoor temperature 60 F were 33, 38, and 43. Use the simple linear regression model to nd a point prediction of and a 95 percent prediction interval for the heat loss of an individual window when the outdoor temperature is HeatLoss a 20F b 30F c 40F d 50F e 60F 11.51 In an article in the Journal of Accounting Research, Benzion Barlev and Haim Levy consider relating accounting rates on stocks and market returns. Fifty-four companies were selected. For each company the authors recorded values of x, the mean yearly accounting rate for the period 1959 to 1974, and y, the mean yearly market return rate for the period 1959 to 1974. The data in
Text
484
F I G U R E 11.21
Chapter 11
MINITAB Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data
The regression equation is MEANPREF = -0.160 + 1.27 MEANTASTE Predictor Constant MEANTAST S = 0.1833 Coef 0.1602 1.2731 StDev 0.3029 0.1025 T 0.53 12.42 P 0.625 0.000
R-Sq = 97.5%
R-Sq(adj) = 9 6 . 8 %
Analysis of Variance Source Regression Error Total Fit 2.3429 DF 1 4 5 StDev Fit 0.1186 SS 5.1817 0.1343 5.3160 MS 5.1817 0.0336 F 154.28 P 0.000
T A B L E 11.6
Company
McDonnell Douglas NCR Honeywell TRW Raytheon W. R. Grace Ford Motors Textron Lockheed Aircraft Getty Oil Atlantic Richeld Radio Corporation of America Westinghouse Electric Johnson & Johnson Champion International R. J. Reynolds General Dynamics Colgate-Palmolive Coca-Cola International Business Machines Allied Chemical Uniroyal Greyhound Cities Service Philip Morris General Motors Philips Petroleum
AcctRet
Market Rate
5.71 13.38 13.43 10.00 16.66 9.40 .24 4.37 3.11 6.63 14.73 6.15 5.96 6.30 .68 12.22 .90 2.35 5.03 6.13 6.58 14.26 2.60 4.97 6.65 4.25 7.30
Accounting Rate
17.96 8.11 12.46 14.70 11.90 9.67 13.35 16.11 6.78 9.41 8.96 14.17 9.12 14.23 10.43 19.74 6.42 12.16 23.19 19.20 10.76 8.49 17.70 9.10 17.47 18.45 10.06
Company
FMC Caterpillar Tractor Georgia Pacic Minnesota Mining & Manufacturing Standard Oil (Ohio) American Brands Aluminum Company of America General Electric General Tire Borden American Home Products Standard Oil (California) International Paper National Steel Republic Steel Warner Lambert U.S. Steel Bethlehem Steel Armco Steel Texaco Shell Oil Standard Oil (Indiana) Owens Illinois Gulf Oil Tenneco Inland Steel Kraft
Accounting Rate
13.30 17.66 14.59 20.94 9.62 16.32 8.19 15.74 12.02 11.44 32.58 11.89 10.06 9.60 7.41 19.88 6.97 7.90 9.34 15.40 11.95 9.56 10.05 12.11 11.53 9.92 12.27
Source: Reprinted by permission from Benzion Barlev and Haim Levy, On the Variability of Accounting Income Numbers, Journal of Accounting Research (Autumn 1979), pp. 305315.
Table 11.6 were obtained. Here the accounting rate can be interpreted to represent input into investment and therefore is a logical predictor of market return. Use the simple linear regression model and a computer to AcctRet a Find a point estimate of and a 95 percent condence interval for the mean market return rate of all stocks having an accounting rate of 15.00. b Find a point prediction of and a 95 percent prediction interval for the market return rate of an individual stock having an accounting rate of 15.00.
Text
11.6
485
CHAPTER 15
F I G U R E 11.22
The Reduction in the Prediction Errors Accomplished by Employing the Predictor Variable x
(a) Prediction errors for the fuel consumption problem when we do not use the information contributed by x
y 16 15 14 13 12 11 10 9 8 7
10.21
x 10 20 30 40 50 60 70
(b) Prediction errors for the fuel consumption problem when we use the information contributed by x by using the least squares line
y 16 15 14 13 12 11 10 9 8 7 The least squares line ^ y 15.84 .1279x
x 10 20 30 40 50 60 70
Text
486
Chapter 11
Figures 11.22(a) and (b) show the reduction in the prediction errors accomplished by employing the predictor variable x (and the least squares line). Using the predictor variable x decreases the prediction error in predicting yi from (yi y) to (yi yi), or by an amount equal to (yi It can be shown that in general a (yi y)2 a (yi yi )2 a (yi y)2 y) (yi yi ) (yi y)
The sum of squared prediction errors obtained when we do not employ the predictor variable x, (yi y)2, is called the total variation. Intuitively, this quantity measures the total amount of variation exhibited by the observed values of y. The sum of squared prediction errors obtained when we use the predictor variable x, (yi yi)2, is called the unexplained variation (this is another name for SSE). Intuitively, this quantity measures the amount of variation in the values of y that is not explained by the predictor variable. The quantity (yi y)2 is called the explained variation. Using these denitions and the above equation involving these summations, we see that Total variation Unexplained variation Explained variation
It follows that the explained variation is the reduction in the sum of squared prediction errors that has been accomplished by using the predictor variable x to predict y. It also follows that Total variation Explained variation Unexplained variation
Intuitively, this equation implies that the explained variation represents the amount of the total variation in the observed values of y that is explained by the predictor variable x (and the simple linear regression model). We now dene the simple coefcient of determination to be r2 Explained variation Total variation
That is, r 2 is the proportion of the total variation in the n observed values of y that is explained by the simple linear regression model. Neither the explained variation nor the total variation can be negative (both quantities are sums of squares). Therefore, r 2 is greater than or equal to 0. Because the explained variation must be less than or equal to the total variation, r 2 cannot be greater than 1. The nearer r 2 is to 1, the larger is the proportion of the total variation that is explained by the model, and the greater is the utility of the model in predicting y. If the value of r 2 is not reasonably close to 1, the independent variable in the model does not provide accurate predictions of y. In such a case, a different predictor variable must be found in order to accurately predict y. It is also possible that no regression model employing a single predictor variable will accurately predict y. In this case the model must be improved by including more than one independent variable. We see how to do this in Chapter 12. In the following box we summarize the results of this section:
F
1 2 3 4
5
y) a (yi Explained variation a ( yi Total variation Unexplained variation Total variation
2
y )2 yi )2
The simple coefcient of determination is Explained variation r2 Total variation r 2 is the proportion of the total variation in the n observed values of the dependent variable that is explained by the simple linear regression model.
a ( yi
Text
11.6
487
This output tells us that the explained variation is 22.981 (see SS Regression), the unexplained variation is 2.568 (see SS Error), and the total variation is 25.549 (see SS Total). It follows that Explained variation 22.981 .899 Total variation 25.549 This value of r 2 says that the regression model explains 89.9 percent of the total variation in the eight observed fuel consumptions. Note that r 2 is given on the MINITAB output (see R-Sq) and is expressed as a percentage.5 Also note that the quantities discussed here are given in the Excel output in Figure 11.11(b) on page 471. r2
SS
6,582,759.6972 819,995.5427 7,402,755.2399
df
1 38 39
MS
6,582,759.6972 21,578.8301
F
305.06
p-value
9.49E-20
This output gives the explained variation, the unexplained variation, and the total variation under the respective headings SS Regression, SS Residual, and SS Total. The output also tells us that r 2 equals .889. Therefore, the simple linear regression model that employs home value as a predictor variable explains 88.9 percent of the total variation in the 40 observed home upkeep expenditures. Before continuing, note that in optional Section 11.9 we present some shortcut formulas for calculating the total, explained, and unexplained variations. The simple correlation coefcient, r People often claim that two variables are correlated. For example, a college admissions ofcer might feel that the academic performance of college students (measured by grade point average) is correlated with the students scores on a standardized college entrance examination. This means that college students grade point averages are related to their college entrance exam scores. One measure of the relationship between two variables y and x is the simple correlation coefcient. We dene this quantity as follows:
2r 2 2r 2
T
5
where b1 is the slope of the least squares line relating y to x. This correlation coefcient measures the strength of the linear relationship between y and x.
We explain the meaning of R-Sq (adj) in Chapter 12.
Text
488
F I G U R E 11.23
y
Chapter 11
x (c) Little correlation (r near 0): little linear relationship between y and x
x (d) Negative correlation (negative r): y decreases as x increases in a straight-line fashion (e) r
Because r2 is always between 0 and 1, the correlation coefcient r is between 1 and 1. A value of r near 0 implies little linear relationship between y and x. A value of r close to 1 says that y and x have a strong tendency to move together in a straight-line fashion with a positive slope and, therefore, that y and x are highly related and positively correlated. A value of r close to 1 says that y and x have a strong tendency to move together in a straight-line fashion with a negative slope and, therefore, that y and x are highly related and negatively correlated. Figure 11.23 illustrates these relationships. Notice that when r 1, y and x have a perfect linear relationship with a positive slope, whereas when r 1, y and x have a perfect linear relationship with a negative slope.
In the fuel consumption problem we have previously found that b1 .1279 and r .899. It follows that the simple correlation coefcient between y (weekly fuel consumption) and x (average hourly temperature) is r .948
2r 2
2.899
C
2
This simple correlation coefcient says that x and y have a strong tendency to move together in a linear fashion with a negative slope. We have seen this tendency in Figure 11.1 (page 447), which indicates that y and x are negatively correlated. If we have computed the least squares slope b1 and r 2, the method given in the previous box provides the easiest way to calculate r. The simple correlation coefcient can also be calculated using the formula SSxy r 2SSxx SSyy Here SSxy and SSxx have been dened in Section 11.2 on page 459, and SSyy denotes the total variation, which has been dened in this section. Furthermore, this formula for r automatically gives r the correct ( or ) sign. For instance, in the fuel consumption problem, SSxy 179.6475,
Text
11.6
489
25.549 (see Examples 11.4 on page 459 and 11.13 on page 487). 179.6475 .948
2(1,404.355)(25.549)
It is important to make a couple of points. First, the value of the simple correlation coefcient is not the slope of the least squares line. If we wish to nd this slope, we should use the previously given formula for b1.6 Second, high correlation does not imply that a cause-and-effect relationship exists. When r indicates that y and x are highly correlated, this says that y and x have a strong tendency to move together in a straight-line fashion. The correlation does not mean that changes in x cause changes in y. Instead, some other variable (or variables) could be causing the apparent relationship between y and x. For example, suppose that college students grade point averages and college entrance exam scores are highly positively correlated. This does not mean that earning a high score on a college entrance exam causes students to receive a high grade point average. Rather, other factors such as intellectual ability, study habits, and attitude probably determine both a students score on a college entrance exam and a students college grade point average. In general, while the simple correlation coefcient can show that variables tend to move together in a straight-line fashion, scientic theory must be used to establish cause-and-effect relationships. Testing the signicance of the population correlation coefcient Thus far we have seen that the simple correlation coefcient measures the linear relationship between the observed values of x and the observed values of y that make up the sample. A similar coefcient of linear correlation can be dened for the population of all possible combinations of observed values of x and y. We call this coefcient the population correlation coefcient and denote it by the symbol R (pronounced rho). We use r as the point estimate of r. In addition, we can also carry out a hypothesis test. Here we test the null hypothesis H0: 0, which says there is no linear relationship between x and y. We test H0 against the alternative Ha: 0, which says there is a positive or negative linear relationship between x and y. This test can be done by using r to compute the test statistic t r2n 21 2 r2
The test is based on the assumption that the population of all possible observed combinations of values of x and y has a bivariate normal probability distribution. See Wonnacott and Wonnacott (1981) for a discussion of this distribution. It can be shown that the preceding test statistic t and the p-value used to test H0: r 0 versus Ha: r 0 are equal to, respectively, the test b1 sb1 and the p-value used to test H0: b1 0 versus Ha: b1 0, where b1 is the statistic t slope in the simple linear regression model. Keep in mind, however, that although the mechanics involved in these hypothesis tests are the same, these tests are based on different assumptions (remember that the test for signicance of the slope is based on the regression assumptions). If the bivariate normal distribution assumption for the test concerning r is badly violated, we can use a nonparametric approach to correlation. One such approach is Spearmans rank correlation coefcient. This approach is discussed in optional Section 15.5.
Essentially, the difference between r and b1 is a change of scale. It can be shown that b1 and r are related by the equation b1 (SSyy SSxx)1 2 r.
Text
490
Chapter 11
11.63 In the Quality Management Journal (Fall 1994), B. F. Yavas and T. M. Burrows present the following table, which gives the percentage of 100 U.S. managers and the percentage of 96 Asian managers who agree with each of 13 randomly selected statements concerning quality. (Note: All managers are in the electronics manufacturing industry.) MgmtOp
Percentage U.S. Managers Agreeing
36 31 28 27 78 74 43 50 31 66 18 61 53
Statement
1 2 3 4 5 6 7 8 9 10 11 12 13
Source: B. F. Yavas and T. M. Burrows, A Comparative Study of Attitudes of U.S. and Asian Managers toward Product Quality, Quality Management Journal, Fall 1994, p. 49 (Table 5). 1994 American Society for Quality. Reprinted by permission.
If we calculate the simple correlation coefcient, r, for these data, we nd that r .570. What does this indicate about the similarity of U.S. and Asians managers views about quality?
Text
11.7
491
uppose that the regression assumptions hold, and dene the overall F statistic to be F (model) Explained variation (Unexplained variation) (n 2)
We can reject H0 : b1 0 in favor of Ha : b1 Z 0 at level of signicance a if either of the following equivalent conditions hold:
Also dene the p-value related to F(model) to be the area under the curve of the F distribution (having 1 numerator and n 2 denominator degrees of freedom) to the right of F(model)see Figure 11.24(b).
1 2
F(model) p-value a
Fa
The rst condition in the box says we should reject H0: b1 0 (and conclude that the relationship between x and y is signicant) when F(model) is large. This is intuitive because a large overall F statistic would be obtained when the explained variation is large compared to the unexplained variation. This would occur if x is signicantly related to y, which would imply that the slope b1 is not equal to 0. Figure 11.24(a) illustrates that we reject H0 when F(model) is greater
F I G U R E 11.24
(a) The rejection point F based on setting the probability of a Type I error equal to
p-value
F(model) (b) If the p-value is smaller than , then F (model) F and we reject H0.
Text
492
Chapter 11
than Fa. As can be seen in Figure 11.24(b), when F(model) is large, the related p-value is small. When the p-value is small enough [resulting from an F(model) statistic that is large enough], we reject H0. Figure 11.24(b) illustrates that the second condition in the box ( p-value a) is an equivalent way to carry out this test.
Consider the fuel consumption problem and the MINITAB output in Example 11.13 (page 487) of the simple linear regression model relating weekly fuel consumption y to average hourly temperature x. Looking at this output, we see that the explained variation is 22.981 and the unexplained variation is 2.568. It follows that F(model) Explained variation (Unexplained variation) (n 22.981 22.981 2.568 (8 2) .428 53.69 Note that this overall F statistic is given on the MINITAB output (it is labeled as F). The p-value related to F(model) is the area to the right of 53.69 under the curve of the F distribution having 1 numerator and 6 denominator degrees of freedom. This p-value is also given on the MINITAB output (labeled p). Here MINITAB tells us that the p-value is .000 (which means less than .001). If we wish to test the signicance of the regression relationship with level of signicance a .05, we use the rejection point F.05 based on 1 numerator and 6 denominator degrees of freedom. Using Table A.6 (page 818), we nd that F.05 5.99. Since F(model) 53.69 F.05 5.99, we can reject H0: b1 0 in favor of Ha: b1 0 at level of signicance .05. Alternatively, since p-value .000 is smaller than .05, .01, and .001, we can reject H0 at level of signicance .05, .01, or .001. Therefore, we have extremely strong evidence that H0: b1 0 should be rejected and that the regression relationship between x and y is signicant. That is, we might say that we have extremely strong evidence that the simple linear model relating y to x is signicant. As another example, the MegaStat output in Example 11.14 (page 487) tells us that for the QHIC simple linear regression model F(model) is 305.06 and the related p-value is less than .001. Here F(model) is labeled as F. Because the p-value is less than .001, we have extremely strong evidence that the regression relationship is signicant. Testing the signicance of the regression relationship between y and x by using the overall F statistic and its related p-value is equivalent to doing this test by using the t statistic and its related p-value. Specically, it can be shown that (t)2 F(model) and that (ta 2)2 based on n 2 degrees of freedom equals Fa based on 1 numerator and n 2 denominator degrees of freedom. It follows that the rejection point conditions t ta
2
2)
and F(model)
Fa
are equivalent. Furthermore, the p-values related to t and F(model) can be shown to be equal. Because these tests are equivalent, it would be logical to ask why we have presented the F test. There are two reasons. First, most standard regression computer packages include the results of the F test as a part of the regression output. Second, the F test has a useful generalization in multiple regression analysis (where we employ more than one predictor variable). The F test in multiple regression is not equivalent to a t test. This is further explained in Chapter 12.
Text
Residual Analysis
493
In Exercises 11.66 through 11.71, we refer to MINITAB, MegaStat, and Excel output of simple linear regression analyses of the data sets related to the six previously discussed case studies. Using the appropriate computer output, a Calculate the F(model) statistic by using the explained variation, the unexplained variation, and other relevant quantities. b Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 0 versus Ha: b1 by setting a equal to .05. What do you conclude about the relationship between y and x? c Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 0 versus Ha: b1 by setting a equal to .01. What do you conclude about the relationship between y and x? d Find the p-value related to F(model). Using the p-value, determine whether we can reject H0: b1 in favor of Ha: b1 0 by setting a equal to .10, .05, .01, and .001. What do you conclude? e Show that the F(model) statistic is the square of the t statistic for testing H0: b1 Also, show that the F.05 rejection point is the square of the t.025 rejection point. 11.66 THE STARTING SALARY CASE 11.67 THE SERVICE TIME CASE 11.68 THE FRESH DETERGENT CASE 11.69 THE DIRECT LABOR COST CASE StartSal SrvcTime Fresh DirLab RealEst FastFood Use the MINITAB output in Figure 11.13 (page 474). Use the MegaStat output in Figure 11.14 (page 474). Use the MINITAB output in Figure 11.15 (page 475). Use the Excel output in Figure 11.16 (page 475). 11.70 THE REAL ESTATE SALES PRICE CASE Use the MINITAB output in Figure 11.17 (page 476). 11.71 THE FAST-FOOD RESTAURANT RATING CASE Use the Excel output in Figure 11.18 (page 476) or the MINITAB output in Figure 11.21 (page 484). 0 versus Ha: b1 0 0 0 0.
where the predicted value of y is calculated using the least squares prediction equation y b0 b1 x
The linear regression model y b0 b1x e implies that the error term e is given by the equation e y (b0 b1x). Since y in the previous box is clearly the point estimate of b0 b1x, we see that the residual e y y is the point estimate of the error term e. If the re gression assumptions are valid, then, for any given value of the independent variable, the population of potential error term values will be normally distributed with mean 0 and variance s2 (see the regression assumptions in Section 11.3 on page 466). Furthermore, the different error terms will be statistically independent. Because the residuals provide point estimates of the error terms, it follows that If the regression assumptions hold, the residuals should look like they have been randomly and independently selected from normally distributed populations having mean 0 and variance s2. In any real regression problem, the regression assumptions will not hold exactly. In fact, it is important to point out that mild departures from the regression assumptions do not seriously
Text
494
Chapter 11
hinder our ability to use a regression model to make statistical inferences. Therefore, we are looking for pronounced, rather than subtle, departures from the regression assumptions. Because of this, we will require that the residuals only approximately t the description just given. Residual plots One useful way to analyze residuals is to plot them versus various criteria. The resulting plots are called residual plots. To construct a residual plot, we compute the residual for each observed y value. The calculated residuals are then plotted versus some criterion. To validate the regression assumptions, we make residual plots against (1) values of the independent variable x; (2) values of y, the predicted value of the dependent variable; and (3) the time order in which the data have been observed (if the regression data are time series data). We next look at an example of constructing residual plots. Then we explain how to use these plots to check the regression assumptions.
C
y y y (b0 b1x) y ( 348.3921 7.2583x) 237.00 (see Table 11.2
The MegaStat output in Figure 11.25(a) presents the predicted home upkeep expenditures and residuals that are given by the simple linear regression model describing the QHIC data. Here each residual is computed as e
For instance, for the rst observation (home) when y on page 453), the residual is e 1,412.08 1,412.08
1,412.08 and x
The MINITAB output in Figure 11.25(b) and (c) gives plots of the residuals for the QHIC simple linear regression model against values of x and y. To understand how these plots are constructed, recall that for the rst observation (home) y 1,412.08, x 237.00, y 1,371.816, and the residual is 40.264. It follows that the point plotted in Figure 11.25(b) corresponding to the rst observation has a horizontal axis coordinate of the x value 237.00 and a vertical axis coordinate of the residual 40.264. It also follows that the point plotted in Figure 11.25(c) corresponding to the rst observation has a horizontal axis coordinate of the y value 1,371.816, and a vertical axis coordinate of the residual 40.264. Finally, note that the QHIC data are cross-sectional data, not time series data. Therefore, we cannot make a residual plot versus time.
The constant variance assumption To check the validity of the constant variance assumption, we examine plots of the residuals against values of x, y, and time (if the regression data are time series data). When we look at these plots, the pattern of the residuals uctuation around 0 tells us about the validity of the constant variance assumption. A residual plot that fans out [as in Figure 11.26(a)] suggests that the error terms are becoming more spread out as the horizontal plot value increases and that the constant variance assumption is violated. Here we would say that an increasing error variance exists. A residual plot that funnels in [as in Figure 11.26(b)] suggests that the spread of the error terms is decreasing as the horizontal plot value increases and that again the constant variance assumption is violated. In this case we would say that a decreasing error variance exists. A residual plot with a horizontal band appearance [as in Figure 11.26(c)] suggests that the spread of the error terms around 0 is not changing much as the horizontal plot value increases. Such a plot tells us that the constant variance assumption (approximately) holds. As an example, consider the QHIC case and the residual plot in Figure 11.25(b). This plot appears to fan out as x increases, indicating that the spread of the error terms is increasing as x increases. That is, an increasing error variance exists. This is equivalent to saying that the variance of the population of potential yearly upkeep expenditures for houses worth x (thousand dollars) appears to increase as x increases. The reason is that the model y b0 b1x e says that the variation of y is the same as the variation of e. For example, the variance of the population of potential yearly upkeep expenditures for houses worth $200,000 would be larger than the variance of the population of potential yearly upkeep expenditures for houses worth $100,000.
Text
11.8 F I G U R E 11.25
Residual Analysis MegaStat and MINITAB Output of the Residuals and Residual Plots for the QHIC Simple Linear Regression Model
495
F I G U R E 11.26
Residual
Text
496
Chapter 11
Increasing variance makes some intuitive sense because people with more expensive homes generally have more discretionary income. These people can choose to spend either a substantial amount or a much smaller amount on home upkeep, thus causing a relatively large variation in upkeep expenditures. Another residual plot showing the increasing error variance in the QHIC case is Figure 11.25(c). This plot tells us that the residuals appear to fan out as y (predicted y) increases, which is logical because y is an increasing function of x. Also, note that the original scatter plot of y versus x in Figure 11.4 (page 453) shows the increasing error variancethe y values appear to fan out as x increases. In fact, one might ask why we need to consider residual plots when we can simply look at scatter plots. One answer is that, in general, because of possible differences in scaling between residual plots and scatter plots, one of these types of plots might be more informative in a particular situation. Therefore, we should always consider both types of plots. When the constant variance assumption is violated, we cannot use the formulas of this chapter to make statistical inferences. Later in this section we discuss how we can make statistical inferences when a nonconstant error variance exists. The assumption of correct functional form If the functional form of a regression model is incorrect, the residual plots constructed by using the model often display a pattern suggesting the form of a more appropriate model. For instance, if we use a simple linear regression model when the true relationship between y and x is curved, the residual plot will have a curved appearance. For example, the scatter plot of upkeep expenditure, y, versus home value, x, in Figure 11.4 (page 453) has either a straight-line or slightly curved appearance. We used a simple linear regression model to describe the relationship between y and x, but note that there is a dip, or slightly curved appearance, in the upper left portion of each residual plot in Figure 11.25. Therefore, both the scatter plot and residual plots indicate that there might be a slightly curved relationship between y and x. Later in this section we discuss one way to model curved relationships. The normality assumption If the normality assumption holds, a histogram and/or stemand-leaf display of the residuals should look reasonably bell-shaped and reasonably symmetric about 0. Figure 11.27(a) gives the MINITAB output of a stem-and-leaf display of the residuals from the simple linear regression model describing the QHIC data. The stem-and-leaf display looks fairly bell-shaped and symmetric about 0. However, the tails of the display look somewhat long and heavy or thick, indicating a possible violation of the normality assumption. Another way to check the normality assumption is to construct a normal plot of the residuals. To make a normal plot, we rst arrange the residuals in order from smallest to largest. Letting the ordered residuals be denoted as e(1), e(2), . . . , e(n) we denote the ith residual in the ordered listing as e(i). We plot e(i) on the vertical axis against a point called z(i) on the horizontal axis. Here z(i) is dened to be the point on the horizontal axis under the standard normal curve so that the area under this curve to the left of z(i) is (3i 1) (3n 1). For example, recall in the QHIC case that there are n 40 residuals given in Figure 11.25(a). It follows that, when i 1, then 3i 3n 1 1 3(1) 3(40) 1 1 2 121 .0165
Therefore, z(1) is the normal point having an area of .0165 under the standard normal curve to its left. This implies that the area under the standard normal curve between z(1) and 0 is .5 .0165 .4835. Thus, as illustrated in Figure 11.27(b), z(1) equals 2.13. Because the smallest residual in Figure 11.25(a) is 289.044, the rst point plotted is e(1) 289.044 on the vertical scale versus z(1) 2.13 on the horizontal scale. When i 2, it can be veried that (3i 1) (3n 1) equals .0413 and thus that z(2) 1.74. Therefore, because the second-smallest residual in Figure 11.25(a) is 259.958, the second point plotted is e(2) 259.958 on the vertical scale versus z(2) 1.74 on the horizontal scale. This process is continued until the entire normal plot is constructed. The MINITAB output of this plot is given in Figure 11.27(c). It can be proven that, if the normality assumption holds, then the expected value of the ith ordered residual e(i) is proportional to z(i). Therefore, a plot of the e(i) values on the vertical scale versus the z(i) values on the horizontal scale should have a straight-line appearance. That is, if the normality assumption holds, then the normal plot should have a straight-line appearance. A
Text
11.8 F I G U R E 11.27
Residual Analysis Stem-and-Leaf Display and a Normal Plot of the Residuals from the Simple Linear Regression Model Describing the QHIC Data
N = 40
497
Stem-and-leaf of RESI Leaf Unit = 10 2 5 6 10 13 17 (11) 12 10 4 3 3 -2 -2 -1 -1 -0 -0 0 0 1 1 2 2 85 420 7 4320 876 4220 00022333344 68 001124 9 899
3(1) 3(40)
1 1
2 121
z(1)
2.13
Normal Probability Plot of the Residuals 300 200 Residual 100 0 -100 -200 -300 -2 -1 0 Normal Score (c) MINITAB output of the normal plot 1 2
normal plot that does not look like a straight line (admittedly, a subjective decision) indicates that the normality assumption is violated. Since the normal plot in Figure 11.27(c) has some curvature (particularly in the upper right portion), there is a possible violation of the normality assumption. It is important to realize that violations of the constant variance and correct functional form assumptions can often cause a histogram and/or stem-and-leaf display of the residuals to look nonnormal and can cause the normal plot to have a curved appearance. Because of this, it is usually a good idea to use residual plots to check for nonconstant variance and incorrect functional form before making any nal conclusions about the normality assumption. Later in this section we discuss a procedure that sometimes remedies simultaneous violations of the constant variance, correct functional form, and normality assumptions. The independence assumption The independence assumption is most likely to be violated when the regression data are time series datathat is, data that have been collected in a time sequence. For such data the time-ordered error terms can be autocorrelated. Intuitively, we say that error terms occurring over time have positive autocorrelation if a positive error term in time period i tends to produce, or be followed by, another positive error term in time period i k (some later time period) and if a negative error term in time period i tends to produce, or be followed by, another negative error term in time period i k. In other words, positive autocorrelation exists when positive error terms tend to be followed over time by positive error terms and when negative error terms tend to be followed over time by negative error terms. Positive autocorrelation in the error terms is depicted in Figure 11.28, which illustrates that positive autocorrelation can produce a cyclical error term pattern over time. The simple linear regression model implies that a positive error term produces a greater-than-average value of y and a negative error term produces a smaller-than-average value of y. It follows that positive
Text
498
F I G U R E 11.28
Simple Linear Regression Analysis Negative Autocorrelation in the Error Terms: Alternating Pattern
Error term
Error term
5 1 2 3 4 6 7 8
Time
9 1 2 3 4 5 6 7 8
Time
autocorrelation in the error terms means that greater-than-average values of y tend to be followed by greater-than-average values of y, and smaller-than-average values of y tend to be followed by smaller-than-average values of y. An example of positive autocorrelation could hypothetically be provided by a simple linear regression model relating demand for a product to advertising expenditure. Here we assume that the data are time series data observed over a number of consecutive sales periods. One of the factors included in the error term of the simple linear regression model is competitors advertising expenditure for their similar products. If, for the moment, we assume that competitors advertising expenditure signicantly affects the demand for the product, then a higher-than-average competitors advertising expenditure probably causes demand for the product to be lower than average and hence probably causes a negative error term. On the other hand, a lower-than-average competitors advertising expenditure probably causes the demand for the product to be higher than average and hence probably causes a positive error term. If, then, competitors tend to spend money on advertising in a cyclical fashion spending large amounts for several consecutive sales periods (during an advertising campaign) and then spending lesser amounts for several consecutive sales periodsa negative error term in one sales period will tend to be followed by a negative error term in the next sales period, and a positive error term in one sales period will tend to be followed by a positive error term in the next sales period. In this case the error terms would display positive autocorrelation, and thus these error terms would not be statistically independent. Intuitively, error terms occurring over time have negative autocorrelation if a positive error term in time period i tends to produce, or be followed by, a negative error term in time period i k and if a negative error term in time period i tends to produce, or be followed by, a positive error term in time period i k. In other words, negative autocorrelation exists when positive error terms tend to be followed over time by negative error terms and negative error terms tend to be followed over time by positive error terms. An example of negative autocorrelation in the error terms is depicted in Figure 11.29, which illustrates that negative autocorrelation in the error terms can produce an alternating pattern over time. It follows that negative autocorrelation in the error terms means that greater-than-average values of y tend to be followed by smaller-than-average values of y and smaller-than-average values of y tend to be followed by greater-than-average values of y. An example of negative autocorrelation might be provided by a retailers weekly stock orders. Here a larger-than-average stock order one week might result in an oversupply and hence a smaller-than-average order the next week. The independence assumption basically says that the time-ordered error terms display no positive or negative autocorrelation. This says that the error terms occur in a random pattern over time. Such a random pattern would imply that the error terms (and their corresponding y values) are statistically independent. Because the residuals are point estimates of the error terms, a residual plot versus time is used to check the independence assumption. If a residual plot versus the datas time sequence has a cyclical appearance, the error terms are positively autocorrelated, and the independence assumption is violated. If a plot of the time-ordered residuals has an alternating pattern, the error terms are negatively autocorrelated, and again the independence assumption is violated.
Text
11.8
Residual Analysis
499
However, if a plot of the time-ordered residuals displays a random pattern, the error terms have little or no autocorrelation. In such a case, it is reasonable to conclude that the independence assumption holds.
EXAMPLE 11.19
Figure 11.30(a) presents data concerning weekly sales at Pages Bookstore (Sales), Pages weekly advertising expenditure (Adver), and the weekly advertising expenditure of Pages main competitor (Compadv). Here the sales values are expressed in thousands of dollars, and the advertising expenditure values are expressed in hundreds of dollars. Figure 11.30(a) also gives the residuals that are obtained when MegaStat is used to perform a simple linear regression analysis relating Pages sales to Pages advertising expenditure. These residuals are plotted versus time in Figure 11.30(b). We see that the residual plot has a cyclical pattern. This tells us that the error terms for the model are positively autocorrelated and the independence assumption is violated. Furthermore, there tend to be positive residuals when the competitors advertising expenditure is lower (in weeks 1 through 8 and weeks 14, 15, and 16) and negative residuals when the competitors advertising expenditure is higher (in weeks 9 through 13). Therefore, the competitors advertising expenditure seems to be causing the positive autocorrelation.
F I G U R E 11.30
(a) The data and the MegaStat output of the residuals from a simple linear regression relating Pages sales to Pages advertising expenditure BookSales
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Adver 18 20 20 25 28 29 29 28 30 31 34 35 36 38 41 45 Compadv 10 10 15 15 15 20 20 25 35 35 35 30 30 25 20 20 Sales 22 27 23 31 45 47 45 42 37 39 45 52 57 62 73 84 Predicted 18.7 23.0 23.0 33.9 40.4 42.6 42.6 40.4 44.7 46.9 53.4 55.6 57.8 62.1 68.6 77.3 Residual 3.3 4.0 0.0 2.9 4.6 4.4 2.4 1.6 7.7 7.9 8.4 3.6 0.8 0.1 4.4 6.7 0.65
Durbin-Watson
(b) MegaStat output of a plot of the residuals in Figure 11.30(a) versus time
10.1 Residual (gridlines std. error)
5.0
0.0
5.0
10.1 0 5 10 Observation 15 20
Text
500
Chapter 11
To conclude this example, note that the simple linear regression model relating Pages sales to Pages advertising expenditure has a standard error, s, of 5.038. The MegaStat residual plot in Figure 11.30(b) includes grid lines that are placed one and two standard errors above and below the residual mean of 0. All MegaStat residual plots use such grid lines to help better diagnose potential violations of the regression assumptions. When the independence assumption is violated, various remedies can be employed. One approach is to identify which independent variable left in the error term (for example, competitors advertising expenditure) is causing the error terms to be autocorrelated. We can then remove this independent variable from the error term and insert it directly into the regression model, forming a multiple regression model. (Multiple regression models are discussed in Chapter 12.) The DurbinWatson test One type of positive or negative autocorrelation is called rstorder autocorrelation. It says that et, the error term in time period t, is related to et 1, the error term in time period t 1. To check for rst-order autocorrelation, we can use the Durbin Watson statistic
CHAPTER 16
n t
a (et
2 n t
et 1)2
2
a et
1
where e1, e2, . . . , en are the time-ordered residuals. Intuitively, small values of d lead us to conclude that there is positive autocorrelation. This is because, if d is small, the differences (et et 1) are small. This indicates that the adjacent residuals et and et 1 are of the same magnitude, which in turn says that the adjacent error terms et and et 1 are positively correlated. Consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are positively autocorrelated. Durbin and Watson have shown that there are points (denoted dL,a and dU,a) such that, if a is the probability of a Type I error, then
1 2 3
If d dL,a, we reject H0. If d dU,a, we do not reject H0. If dL,a d dU,a, the test is inconclusive.
So that the DurbinWatson test may be easily done, tables containing the points dL,a and dU,a have been constructed. These tables give the appropriate dL,a and dU,a points for various values of a; k, the number of independent variables used by the regression model; and n, the number of observations. Tables A.10, A.11, and A.12 (pages 827829) give these points for a .05, a .025, and a .01. A portion of Table A.10 is given in Table 11.7. Note that when we are considering a simple linear regression model, which uses one independent variable, we look up the points dL,a and dU,a under the heading k 1. Other values of k are used when we study multiple regression models in Chapter 12. Using the residuals in Figure 11.30(a), the DurbinWatson statistic for the simple linear regression model relating Pages sales to Pages advertising expenditure is calculated to be
16 t
a (et
2 16 t
et 1)2
2
a et
1
(4.0 .65
(6.7 (6.7)2
4.4)2
A MegaStat output of the DurbinWatson statistic is given at the bottom of Figure 11.30(a). To test for positive autocorrelation, we note that there are n 16 observations and the regression
Text
501
1 dU,.05
1.36 1.37 1.38 1.39 1.40 1.41
k dL,.05
0.95 0.98 1.02 1.05 1.08 1.10
2 dU,.05
1.54 1.54 1.54 1.53 1.53 1.54
k dL,.05
0.82 0.86 0.90 0.93 0.97 1.00
3 dL,.05
0.69 0.74 0.78 0.82 0.86 0.90
4 dU,.05
1.97 1.93 1.90 1.87 1.85 1.83
dL,.05
1.08 1.10 1.13 1.16 1.18 1.20
model uses k 1 independent variable. Therefore, if we set a .05, Table 11.7 tells us that dL,.05 1.10 and dU,.05 1.37. Since d .65 is less than dL,.05 1.10, we reject the null hypothesis of no autocorrelation. That is, we conclude (at an a of .05) that there is positive (rst-order) autocorrelation. It can be shown that the DurbinWatson statistic d is always between 0 and 4. Large values of d (and hence small values of 4 d) lead us to conclude that there is negative autocorrelation because if d is large, this indicates that the differences (et et 1) are large. This says that the adjacent error terms et and et 1 are negatively autocorrelated. Consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are negatively autocorrelated. Durbin and Watson have shown that based on setting the probability of a Type I error equal to a, the points dL,a and dU,a are such that
1 2 3
If (4 d ) dL,a, we reject H0. If (4 d ) dU,a, we do not reject H0. If dL,a (4 d ) dU,a, the test is inconclusive. (4 d) (4 .65) 3.35 dU,.05 1.37
As an example, for the Pages sales simple linear regression model, we see that Therefore, on the basis of setting a equal to .05, we do not reject the null hypothesis of no autocorrelation. That is, there is no evidence of negative (rst-order) autocorrelation. We can also use the DurbinWatson statistic to test for positive or negative autocorrelation. Specically, consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are positively or negatively autocorrelated. Durbin and Watson have shown that, based on setting the probability of a Type I error equal to a,
1 2 3
If d dL,a 2 or if (4 d ) dL,a 2, we reject H0. If d dU,a 2 and if (4 d ) dU,a 2, we do not reject H0. If dL,a 2 d dU,a 2 and dL,a 2 (4 d) dU,a 2, the test is inconclusive.
For example, consider testing for positive or negative autocorrelation in the Pages sales model. If we set a equal to .05, then a 2 .025, and we need to nd the points dL,.025 and dU,.025 when n 16 and k 1. Looking up these points in Table A.11 (page 828), we nd that dL,.025 .98 and dU,.025 1.24. Since d .65 is less than dL,.025 .98, we reject the null hypothesis of no autocorrelation. That is, we conclude (at an a of .05) that there is rst-order autocorrelation. Although we have used the Pages sales model in these examples to demonstrate the Durbin Watson tests for (1) positive autocorrelation, (2) negative autocorrelation, and (3) positive or negative autocorrelation, we must in practice choose one of these DurbinWatson tests in a particular situation. Since positive autocorrelation is more common in real time series data than negative autocorrelation, the DurbinWatson test for positive autocorrelation is used more often than the other two tests. Also, note that each DurbinWatson test assumes that the population of all possible residuals at any time t has a normal distribution. Transforming the dependent variable: A possible remedy for violations of the constant variance, correct functional form, and normality assumptions In general, if a data or residual plot indicates that the error variance of a regression model increases as an
Text
502
Chapter 11
independent variable or the predicted value of the dependent variable increases, then we can sometimes remedy the situation by transforming the dependent variable. One transformation that works well is to take each y value to a fractional power. As an example, we might use a transformation in which we take the square root (or one-half power) of each y value. Letting y* denote the value obtained when the transformation is applied to y, we would write the square root transformation as y* 1y y .5
Another commonly used transformation is the quartic root transformation. Here we take each y value to the one-fourth power. That is, y* y.25
If we consider a transformation that takes each y value to a fractional power (such as .5, .25, or the like), as the power approaches 0, the transformed value y* approaches the natural logarithm of y (commonly written lny). In fact, we sometimes use the logarithmic transformation y* lny
which takes the natural logarithm of each y value. In general, when we take a fractional power (including the natural logarithm) of the dependent variable, the transformation not only tends to equalize the error variance but also tends to straighten out certain types of nonlinear data plots. Specically, if a data plot indicates that the dependent variable is increasing at an increasing rate (as in Figure 11.4 on page 453), then a fractional power transformation tends to straighten out the data plot. A fractional power transformation can also help to remedy a violation of the normality assumption. Because we cannot know which fractional power to use before we actually take the transformation, we recommend taking all of the square root, quartic root, and natural logarithm transformations and seeing which one best equalizes the error variance and (possibly) straightens out a nonlinear data plot.
Consider the QHIC upkeep expenditures. In Figures 11.31, 11.32, and 11.33 we show the plots that result when we take the square root, quartic root, and natural logarithmic transformations of the upkeep expenditures and plot the transformed values versus the home values. The square root transformation seems to best equalize the error variance and straighten out the curved data plot in Figure 11.4. Note that the natural logarithm transformation seems to overtransform the datathe error variance tends to decrease as the home value increases and the data plot seems to
F I G U R E 11.31
MINITAB Plot of the Square Roots of the Upkeep Expenditures versus the Home Values
F I G U R E 11.32
MINITAB Plot of the Quartic Roots of the Upkeep Expenditures versus the Home Values
Text
11.8 F I G U R E 11.33
Residual Analysis MINITAB Plot of the Natural Logarithms of the Upkeep Expenditures versus the Home Values
503
7 LNUPKEEP
F I G U R E 11.34
MINITAB Output of a Regression Analysis of the Upkeep Expenditure Data by Using the Model y* B0 B1x E where y* y.5
Regression Analysis The regression equation is SRUPKEEP = 7.20 + 0.127 VALUE Predictor Constant VALUE S = 2.325 Coef 7.201 0.127047 StDev 1.205 0.006577 T 5.98 19.32 P 0.000 0.000
R-Sq = 90.8%
R-Sq(adj) = 90.5%
Analysis of Variance Source Regression Error Total Fit 35.151 DF 1 38 39 StDev Fit 0.474 SS 2016.8 205.4 2222.2 MS 2016.8 5.4 F 373.17 P 0.000
bend down. The plot of the quartic roots indicates that the quartic root transformation also seems to overtransform the data (but not by as much as the logarithmic transformation). In general, as the fractional power gets smaller, the transformation gets stronger. Different fractional powers are best in different situations. Since the plot in Figure 11.31 of the square roots of the upkeep expenditures versus the home values has a straight-line appearance, we consider the model y* b0 b1x e where y* y.5
The MINITAB output of a regression analysis using this transformed model is given in Figure 11.34, and the MINITAB output of an analysis of the models residuals is given in Figure 11.35. Note that the residual plot versus x for the transformed model in Figure 11.35(a) has a horizontal band appearance. It can also be veried that the transformed models residual plot versus y, which we do not give here, has a similar horizontal band appearance. Therefore, we conclude that the constant variance and the correct functional form assumptions approximately hold for the transformed model. Next, note that the stem-and-leaf display of the transformed models
Text
504
F I G U R E 11.35
Chapter 11
MINITAB Output of Residual Analysis for the Upkeep Expenditure Model y* B0 B1 x E where y* y .5
5 4 3 2 1 0 -1 -2 -3 -4 -5 100 200 VALUE (a) Residual plot versus x Normal Probability Plot of the Residuals 5 4 3 2 1 0 -1 -2 -3 -4 -5 -2 -1 0 Normal Score (c) Normal plot of the residuals 1 2 300
2 5 9 13 18 (7) 15 8 2 1
-4 -3 -2 -1 -0 0 1 2 3 4
residuals in Figure 11.35(b) looks reasonably bell-shaped and symmetric, and note that the normal plot of these residuals in Figure 11.35(c) looks straighter than the normal plot for the untransformed model (see Figure 11.27 on page 497). Therefore, we also conclude that the normality assumption approximately holds for the transformed model. Because the regression assumptions approximately hold for the transformed regression model, we can use this model to make statistical inferences. Consider a home worth $220,000. Using the least squares point estimates on the MINITAB output in Figure11.34, it follows that a point prediction of y* for such a home is y* 7.201 35.151 This point prediction is given at the bottom of the MINITAB output, as is the 95 percent prediction interval for y*, which is [30.347, 39.955]. It follows that a point prediction of the upkeep expenditure for a home worth $220,000 is (35.151)2 $1,235.59 and that a 95 percent prediction interval for this upkeep expenditure is [(30.347)2, (39.995)2] [$920.94, $1599.60]. Suppose that QHIC wishes to send an advertising brochure to any home that has a predicted upkeep expenditure of at least $500. Solving the prediction equation y* b0 b1x for x, and noting that a predicted upkeep expenditure of $500 corresponds to a y* of 1500 22.36068, it follows that QHIC should send the advertising brochure to any home that has a value of at least x y * b1 b0 22.36068 7.201 .127047 119.3234 (or $119,323) .127047(220)
Residual
Residual
N = 40
Text
11.8
Residual Analysis
505
Recall that because there are many homes of a particular value in the metropolitan area, QHIC is interested in estimating the mean upkeep expenditure corresponding to this value. Consider all homes worth, for example, $220,000. The MINITAB output in Figure 11.34 tells us that a point estimate of the mean of the square roots of the upkeep expenditures for all such homes is 35.151 and that a 95 percent condence interval for this mean is [34.191, 36.111]. Unfortunately, because it can be shown that the mean of the square root is not the square root of the mean, we cannot transform the results for the mean of the square roots back into a result for the mean of the original upkeep expenditures. This is a major drawback to transforming the dependent variable and one reason why many statisticians avoid transforming the dependent variable unless the regression assumptions are badly violated. In Chapter 12 we discuss other remedies for violations of the regression assumptions that do not have some of the drawbacks of transforming the dependent variable. Some of these remedies involve transforming the independent variablea procedure introduced in Exercise 11.85 of this section. Furthermore, if we reconsider the residual analysis of the original, untransformed QHIC model in Figures 11.25 (page 495) and 11.27 (page 497), we might conclude that the regression assumptions are not badly violated for the untransformed model. Also, note that the point prediction, 95 percent prediction interval, and value of x obtained here using the transformed model are not very different from the results obtained in Examples 11.5 (page 463) and 11.12 (page 481) using the untransformed model. This implies that it might be reasonable to rely on the results obtained using the untransformed model, or to at least rely on the results for the mean upkeep expenditures obtained using the untransformed model. In this section we have concentrated on analyzing the residuals for the QHIC simple linear regression model. If we analyze the residuals in Table 11.4 (page 460) for the fuel consumption simple linear regression model (recall that the fuel consumption data are time series data), we conclude that the regression assumptions approximately hold for this model.
Recall that Table 11.4 gives the residuals from the simple linear regression model relating weekly fuel consumption to average hourly temperature. Figure 11.36(a) gives the Excel output of a plot of these residuals versus average hourly temperature. Describe the appearance of this plot. Does the plot indicate any violations of the regression assumptions? 11.76 THE FRESH DETERGENT CASE Fresh Figure 11.36(b) gives the MINITAB output of residual diagnostics that are obtained when the simple linear regression model is t to the Fresh detergent demand data in Exercise 11.9 (page 455). Interpret the diagnostics and determine if they indicate any violations of the regression assumptions. Note that the I chart of the residuals is a plot of the residuals versus time, with control limits that are used to show unusually large residuals (such control limits are discussed in Chapter 14, but we ignore them here). 11.77 THE SERVICE TIME CASE SrvcTime Recall that Figure 11.14 on page 474 gives the MegaStat output of a simple linear regression analysis of the service time data in Exercise 11.7. The MegaStat output of the residuals given by this model is given in Figure 11.37, and MegaStat output of residual plots versus x and y is given in Figure 11.38(a) and (b). Do the plots indicate any violations of the regression assumptions? 11.78 THE SERVICE TIME CASE SrvcTime Figure 11.37 gives the MegaStat output of the residuals from the simple linear regression model describing the service time-data in Exercise 11.7.