Chap 13

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

the

the
rifi-
the
eral

Simple Linear R.gression

USINGSTATISTICS
@ SunflowersApparel

13.1 TYPESOF REGRESSION


MODELS 13.7 INFERENCES ABOUTTHESLOPE
AND CORRELATION COEFFICIENT
13.2 DETERMINING THESIMPLELINEAR t Testfor theSlope
REGRESSION EOUATION F Testfor theSlope
TheLeast-SquaresMethod Confidence of theSlope(8,)
IntervalEstimate
VisualExplorations:
ExploringSimpleLinear r Testfor theCorrelation
Coefficient
RegressionCoefficients
Predictions
in Regression
Analysis:Interpolation 13.8 ESTIMATION
OF MEAN VALUESAND
VersusExtrapolation PREDICTION
OF INDIVIDUALVALUES
ComputingtheI Intercept,
bo,andtheSlope,b, The Confidence Interval Estimate
The Prediction Interval
I3.3 MEASURES OF VARIATION
13.9 PITFALLSIN REGRESSION
ComputingtheSumof Squares
AND ETHICAL ISSUES
TheCoefficient
of Determination
Errorof theEstimate
Standard EXCEL COMPANION TO CHAPTER 13
El3.l PerformingSimpleLinearRegression
I3.4 ASSUMPTIONS Analyses
13.5 RESIDUAL ANALYSIS E13.2 CreatingScatterPlotsandAddine
Evaluating
theAssumptions a PredictionLine
El 3.3 PerformingResidualAnalyses
13.6 MEASURING AUTOCORRELATION: E13.4 ComputingtheDurbin-Watson Statistic
THE DURBIN.WATSONSTATISTIC E13.5 EstimatinstheMeanof yand Predictins
Residual
Plotsto DetectAutocorrelation )'Values
TheDurbin-Watson Statistic E13.6 Example:Sunflowers ApparelData

In this chapter,you learn:


r To use regressionanalysis to predict the value ofa dependentvariable based
on an independentvariable
r The meaningof the regressioncoefficients6n and b,
I To evaluatethe assumptionsof regressionanalysis and know what to do if
the assumptionsare violated
r To make inferencesabout the slope and correlation coefficient
I To estimate mean values and predict individual values
512 THIRTEENSimpleLinearRegression
CHAPTER

Using Statistics@ SunflowersApparel


The sales for Sunflowers Apparel, a chain of upscale clothing stores
women, have increased during the past 12 years as the chain
expanded the number of stores open. Until now, Sunflowers
selectedsitesbasedon subjectivefactors,suchas the availability
of
good lease or the perception that a location seemed ideal for an
store. As the new director of planning, you need to develop a s
approach that will lead to making better decisions during the site
tion process.As a startingpoint,you believethatthe sizeof thestore
nificantly contributes to store sales,and you want to use this relati
process.How can you usestatistics
in the decision-making sothat
canforecastthe annualsalesofa proposedstorebasedon thesizeof
store?

you
n t h i s c h a p t e ra n d t h e n e x t two chapters,you learnhow regressionanalysisenables
f
Idevelop a model to predict the values of a numerical variable, based on the valueof
variables.
In regressionanalysis,the variableyou wish to predict is called the dependent
The variablesused to make the prediction are called independent variables. In
predicting values of the dependentvariable, regressionanalysis also allows you to identiff
tvoe of mathematical relationshio that exists between a deoendent and an indeoendent
able, to quantify the effect that changes in the independent variable have on the
variable, and to identify unusual observations. For example, as the director of planning,
may wish to predictsalesfor a Sunflowersstore,basedon the sizeof the store.Other
ples include predicting the monthly rent of an apartment, based on its size, and predictr
monthly salesof a product in a supermarket,based on the amount of shelf spacedevoted
product.
simple linear regression,in which a singlenumericali
This chapterdiscusses
variable,X, is used to predict the numerical dependentvariable )', such as using the size
storeto predictthe annualsalesof the store.Chapters14 and l5 discussmultiple
models, which use several independentvariables to predict a numerical dependentvari
price,andthe
For example,you could usethe amountof advertisingexpenditures,
shelfspacedevotedto a productto predictits monthlysales.

13.1 MODELS
TYPESOF REGRESSION
ln Section2.5,youuseda scatterplot (alsoknownasa scatterdiagram)to examinethe
tionship between an X variable on the horizontal axis and a I variable on the verticalaxis.
nature of the relationship between two variables can take many forms, ranging from si
extremelycomplicated functions.
mathematical Thesimplestrelationship
consists
of a
is shownin Figure13.1.
line,or linear relationship.An exampleof thisrelationship
I 3. I : Typesof Rcgression
Models 5I3

13.1
FIGURE
straig ht-line
A positive
relationship
LY = "change in Y"
for A X = " c h a n g ei n X "
has
Iers
rfa
arel
ratic
rlec-
sig-
ship E q u a t i o n( 1 3 . l ) r e p r e s e n ttsh e s t r a i g h t - l i n e
( l i n e a r )m o d e l .
you
'that

L I N E A RR E G R E S S I OMNO D E L
SIMPLE
)i: Fo+ B,{ + e, (13.1)

wnere

Fu: Yintercept for the population


: slope for the population
Fr
t,: random error in Ifor observation i
= dependentvariable (sometimes referred to as
{
the response variable) for observation i
/ou to
'other
X,: independentvariable (sometimes referred to as
the explanatory variable) for observation i
'iable.
ion to
ify the The portion y,- 0n + F{,of the simple linear regressionmodel expressedin Equation
t vari- ( 1 3 . 1 )i s a s t r a i g h tl i n e . T h e s l o p e o f t h e l i n e , 8 , , r e p r e s e n ttsh e e x p e c t e dc h a n g ei n ) ' p e r u n i t
:ndent changein X. It representsthe mean amount that I changes(eitherpositivelyor negatively)for
g, You a one-unit changein X. The Yintercept, B,,,representsthe mean value of )'when Xequals 0.
exam- The last componentof the model, €,, representsthe random error in X for eachobservation,l. In
ng the other words, e, is the vertical distance of the actual value of X, above or below the predicted
Ito the value of { on the line.
The selectionof the proper mathematicalmodel dependson the distributionof the X and Y
endent valueson the scatterplot.[n PanelA of Figure 13.2on page 514, the valuesof /are generally
ze of a increasinglinearly asX increases. This panel is similar to Figure I 3.3 on page 5 15,which illus-
'ession tratesthe positive relationshipbetweenthe squarefootageof the store and the annual salesat
Lble,}. branchesof the SunflowersApparel women's clothing store chain.
runt of Panel B is an exampleof a negativelinear relationship.As X increases,the valuesof f are
generallydecreasing.An exampleof this type of relationshipmight be the price of a particular
product and the amount of sales.
The data in PanelC show a positive curvilinear relationshipbetweenX and Y. The values
of )'increaseas X increases,but this increasetapersoff beyond certainvaluesof X. An exam-
ple of a positivecurvilinear relationshipmight be the age and maintenancecost of a machine.
As a machine gets o1der,the maintenancecost rnay rise rapidly at first, but then level off
re rela- beyonda certainnumber ofyears.
ris.The Panel D showsa U-shapedrelationshipbetweenX and Y. As X increases,at first Igener-
nple to ally decreases; but as Xcontinues to increase,)/not only stopsdecreasingbut actuallyincreases
traight- above its minimum value. An example of this type of relationship might be the number of
errors per hour at a task and the number of hours worked. The number of errors per hour
5I4 CHAPTERTHIRTEENSimple Linear Regression

FIGURE 13.2
Examples of types
of relationshipsfound
in scatterolots

PanelA PanelB

PanelD
U-shapedcurvilinearrelationship

PanelF
No relationship between X and Y

decreasesas the individual becomesmore proficient at the task,but then it increases


certainpoint becauseoffactors suchas fatigue and boredom.
PanelE indicatesan exponentialrelationshipbetweenX and IZ.In this case,f
very rapidly asX first increases,but then it decreases
much lessrapidly asX increases
An exampleof an exponentialrelationshipcould be the resalevalue of an automobile
age.In the first year,the resalevalue dropsdrasticallyfrom its original price;
resalevaluethen decreases much lessrapidly in subsequentyears.
Finally, PanelF showsa set of datain which thereis very little or no relationship
X and Y. High and low valuesof Iappear at eachvalue ofX.
In this section,a variety of different models that representthe relationship
variableswere briefly examined.Although scatterplots are useful in visually
mathematicalform of a relationship,more sophisticatedstatisticalproceduresare
determinethe most appropriatemodel for a set of variables.The rest of this chapter
the model usedwhen thereis a linear relationshipbetweenvariables.

13.2 DETERMININGTHE SIMPLELINEARREGRESSION


In the Using Statisticsscenarioon page 512,the statedgoal is to forecastannual
new stores,basedon storesize.To examinethe relationshipbetweenthe storesizein
and its annualsales,a sampleof 14 storeswas selected.Tablel3.l summarizes the
these14 stores,which are storedin the file @[!.
I 3.2: Determiningthe Simple Linear RegressionEquation 5 15

L E1 3 . 1 Square Annual Sales Square Annual Sales


Footage Feet (in Millions Feet (in Millions
Thousandsof Square Store (Thousands) of Dollars) Store (Thousands) of Dollars)
andAnnualSales
I t.7 3.7 8 l.l 2.7
Millions
of Dollars)
a Sampleof z 1.6 3.9 9 5.5
Branches
of
a
J 2.8 6.7 10 1.5 2.9
Apparel 4 5.6 9.5 11 5.2 10.'7
5 1.3 J.+ 12 4.6 7.6
6 2.2 5.6 13 s.8 I 1.8
,7
1.3 3.7 14 3.0 4.1

FigureI 3.3 displaysthe scatterplot for the datain TableI 3.I . Observethe increasingrela-
tionshipbetweensquarefeet (,{) andannualsales(Y).As the sizeof the storeincreases, annual
salesincreaseapproximatelyas a straightline. Thus,you can assumethat a straightline pro-
videsa usefulmathematical modelof this relationshio.
Now vou needto determinethe soecific
straightline that is the bestfit to thesedata.

13.3 Scatter Plot for Site Selection

Excelscatter
fortheSunflowers
data

E2.12to create

beyond a

decreases
34567
;s further. Squde Fest (000)
le and its
vever,the
The Least-SquaresMethod
l between
In the precedingsection,a statisticalmodel is hypothesizedtorepresentthe relationship
ween two betweentwo variables,squarefootageandsales,in the entirepopulationof SunflowersApparel
aying the stores.Howeveqas shownin Table13.1,the dataare from only a randomsampleof stores.If
,ailable certainassumptions arevalid (seeSection13.4),you canusethe sampleXintercept,bo,andthe
discusses sampleslope,b,, as estimates populationparameters,
of the respective Boand B,. Equation
( 13.2)usestheseestimatesto form the simplelinear regression
equation.This straightlineis
oftenreferredto asthe prediction line.

ryoN SIMPLELINEAR REGRESSIONEOUATION: THE PREDICTIONLINE


:s for all The predictedvalue of I equalsthe Y interceptplus the slopetimes the value of X.
uare feet
:sults for Yi=bo+4Xi (13.2)
5 16 CHAPTERTHIRTEENSimpleLinearRegressron

where
I; : predictedvalue of I for observationi
X,: valueofXfor observationi
bo: samplelintercept
b, : sampleslope

Equation(13.2)requiresthe determination of two regressioncoefficients-bo (the


)zintercept)andb, (the sampleslope).The most commonapproachto finding bo andb, is
methodof leastsquares.This methodminimizesthe sum of the squareddifferences
the actualvalues({) andthe predictedvalues(Ii) usingthe simplelinearregression
[thatis, the predictionline; seeEquation( I 3.2)]. This sumof squareddifferencesis equalto

- f)'
\{r,
j=l

BecauseYi = bo + \Xi,

2cr,- f,)' =t rt,- (bo+brx,)12


i=l i=l

Becausethis equationhastwo unknowns,boandb,, the sumof squareddifferences depends


the sample)zintercept,bo, andthe sampleslope,b,. The least-squaresmethoddetermines
valuesof bo and brthat minimize the sum of squareddifferences.Any valuesforboand
other than thosedeterminedby the least-squaresmethodresult in a greatersum of squared
ferencesbetweenthe actualvalues({) and the predictedvalues )2,.In this book, Mi
Excel is usedto perform the computationsinvolvedin the least-squares
method.For thedata
Table13.1,Figure13.4presentsresultsfrom MicrosoftExcel.

FIGURE13.4
MicrosoftExcelresults
for the Sunflowers
Appareldata

See Section E13.1 to create


this.

t2 ssE-1145;r 0J339

Coeffclenes S'a,nde,lldEnor tsrrr P'*a,lllp Lover


o.1820
1J280
13.2:Determining
theSimpleLinearRegression
Equation 517

To understandhow the resultsare computed"many of the computationsinvolvedare illus-


tratedin Examples13.3and 13.4on pages520-521 and,526-527 .In Figure13.4,observethat
b0: 0.9645andbr: 1.6699.Thus,the predictionline [seeEquation(13.2)on page515] for
thesedatais

t, = 0.9645+ 1.6699Xi
The slope,b,, is +1.6699.This meansthat for eachincreaseof I unit in X, the meanvalueof I
is estimatedto increaseby | .6699units. In otherwords,for eachincreaseof I .0 thousandsquare
feet in the size of the store,the meanannualsalesare estimatedto increaseby | .6699millions
of dollars.Thus, the sloperepresentsthe portion of the annualsalesthat are estimatedto vary
accordingto the sizeof the store.
The )zintercept,bo, is +0.9645.The f interceptrepresentsthe mean value of Y whenX
equals0. Becausethe squarefootageofthe storecannotbe 0, this Iintercept hasno practical
interpretation.Also, the Iintercept for this exampleis outsidethe rangeof the observedvalues
of the X variable,and thereforeinterpretationsof the value of bo should be made cautiously.
Figure 13.5displaysthe actualobservationsand the prediction line. To illustratea situationin
which thereis a direct interpretationfor the I/ intercept,bo,seeExample I 3.I .

13.5 Scatter Diagram for Site Selection

Excelscatter
tand predictionline
SunflowersApparel y = r.0599t o.96,fs

Seaion El3.2 to create

LE 13.1 THE y |NTERCEpT,bo, AND THE SLOPE,b1


TNTERPRETTNG
A statisticsprofessorwants to use the number of hours a studentstudiesfor a statisticsfinal
exam (X) to predict the final exam score(y).A regressionmodel was fit basedon data col-
lectedfor a classduring the previoussemester,with the following results:

'ii=35.0+3Xi

What is the interpretationof the Iintercept, bo, andthe slope,b,?


SOLUTION The I intercept bo : 35.0 indicatesthat when the studentdoesnot study for the
final exam,the meanfinal examscoreis 35.0.The slopeb, : 3 indicatesthat for eachincrease
of one hour in studyingtime, the meanchangein the final exam scoreis predictedto be +3.0.
In other words, the final exam score is predicted to increaseby 3 points for each one-hour
increasein studyingtime.
5IU c H A P ' r l r RT I I I R T I T E NS i n r p l cL i n e a rl l c s r e s s i o n

VISUAL EXPLORATIONSExploringSimpleLinearRegression
Coefficients

U s e t h e V i s r - r aEl , x p l o r a t i o n sS i m p l e L i n e a r R e g r e s s i o n w o r k b o o k a n d s e l e c tV i s u a l E r p l o r a t i o n s ) S i m p l e
procedureto producc a predictionline that is as close as Linear Regression with your worksheet data
possibleto the predictionline defined by the least-sqLrares (91-2003) or Add-ins ) Visual Explorations )
solution. Open the fiffiffi add-in work- S i m p l e L i n e a r R e g r e s s i o nl v i t h 1 ' o u r u o r k s h e e t d a t a
b o o k a n d s e l e c tV i s u a l E x p l o r a t i o n s 9 S i m p l e L i n e a r ( 2 0 0 1 ) . I n t h e p l o c e d u r c ' sd i a l o g b o x ( s h o w n b e l o l v ) ,
Regression (E,xcel 91-2003) or Add-ins ) Visual e n t e r y o u r I v a r i a b l ec e l l r a n g ea s t h e Y V a r i a b l e C e l l
Erplorations ) Simple Linear Regression(Exccl 2001). R a n g e a n d y o u r X v a r i a b l cc c l l r a n g c a s t h c X \ h r i a b l e
( S c c S e c t i o rE
r l . 6 t o l e a r na b o u tu s i n ga d d - i n s . ) C e l l R a n g e .C l l i c k F i r s t c e l l s i n b o t h r a n g e s c o n t a i n a
When a scatterplot of the SunflowersApparcl data of l a b e l . c n t e r a t i t l c a s t h c T i t l e . a r r dc l i c k O K . W h e n t h e
T a b l e 1 3 . 1 o n p a g e 5 1 5 w i t h a n i n i t i a l p r e d i c t i o nl i n e s c a t t e rp l o t u ' i t h a n i n i t i a l p r e d i c t i o n l i n e a p p e a r s u
. se
a p p e a r s( s h o w n b e l o l v ) , c l i c k t h e s p i n n e r b r - r t t o n st o t h e i n s t r u c t i o n si n t h e f i r s t p a r t o f t h i s s e c t i o nt o t r y t 0
c l r a n g et h e v a l u e sf o r b , , t h e s l o p eo f t h e p r e d i c t i o nl i n e . p r o d u c ct h e p r c d i c t i o nl i n c d c f i n c c lb y t h e l c a s t - s q u a r c s
and b,,.the f interccptof thc predictionlirre. nrethoci.
Try to producea prcdictionlinc that is ascloseas possible
to the prcdictionline dcfinedby the least-squares estimates.
using the chart display and thc Differencc fi'om Targct SSE
valueas f-eedback (scepage525 fbr an cxplanationof SSE). Data
-*-;
C'lickFinish whenyou aredonewith this exploration. !VariableCellRarq", i-""-'-
-- -*-l]
At any time. click Reset to resetthc b, and ir,,values. X Variable
cell Range: i-.
Help for rrore inforn.ration, or Solution to revealthe pre- v flrst cellsin bothrangescontaina label

diction linc defined by the lcast-squarcs rnethod. Or-tputoptions

UsingYour Own RegressionData


l x p l o r a t i o n st o f i n d a p r e c l i c t i o nl i n e f o r
T o u s eV i s L r aE
your own data,openth. iilffiffiffi}ffj add-in
13.2: Determiningthe SimpleLinearRegression
Equation 5 l9

Return to the Using Statisticsscenarioconcerningthe SunflowersApparel stores.


howyou usethepredictionequationto predictthemeanannualsales.
Example13.2illustrates
.S

le E X A M P L 1E3 . 2 PREDICTINGMEAN,ANNUAL SALEs,BASEDON SOUAREFOOTAGE


ta Usethe predictionline to predictthe meanannualsalesfor a storewith 4,000squarefeet.
t
tL SOLUTION Youcandeterminethepredictedvalueby substitutingX: 4 (thousands
of square
,), feet)into the simplelinearregressionequation:
rll Yi=0.9645+1.6699Xi
,le
a ti = 0.9645+ l.6699(4)= 7.644or $7,644,000
he
Thus,the predictedmeanannualsalesof a storewith 4,000squarefeet is $7,644,000.
SC
to
CS
Predictionsin RegressionAnalysis: Interpolation Versus Extrapolation
When using a regressionmodel for prediction purposes,you need to consider only the relevant
range of the independentvariable in making predictions.This relevantrange includes all values
from the smallestto the largestXused in developingthe regressionmodel. Hence,when predict-
ing )'for a given value ofX, you can interpolatewithin this relevantrangeof the Xvalues, but you
should not extrapolatebeyond the range of X values.When you use the squarefootageto predict
annual sales,the square footage (in thousandsofsquare feet) varies from 1.1 to 5.8 (see Table
I 3. I on page 5 I 5). Therefore, you should predict annual sales only for stores whose size is
between l.l and 5.8 thousandsof squarefeet. Any prediction of annual salesfor storesoutside
this rangeassumesthat the observedrelationshipbetweensalesand store size for store sizesfrom
1.1 to 5.8 thousandsquarefeet is the sameas for storesoutsidethis range.For example,you can-
not extrapolatethe linear relationshipbeyond 5,800 squarefeet in Example 13.2.It would be
improper to use the prediction line to forecastthe salesfor a new store containing 8,000 square
feet. It is quite possiblethat store size has a point of diminishing returns. If that is true, as square
footageincreasesbeyond 5,800 squarefeet, the effect on salesmight becomesmaller and smaller.

Computing the Y Intercept, bo, and the Slope, b,


For small data sets, you can use a hand calculator to compute the least-squaresregression
coefficients.Equations(13.3) and (13.4) give the valuesof b,, and b', which minimize

- t,)'= (bo+b,x,)12
Ittl Itt'-
i=l i=l

FORMULAFORTHESLOPE,b1
COMPUTATIONAL
,'
A=-
,ssxr (13.3)
ssx
where

n
ssx:I(x, - x)',
J_ I
520. CHAPTERTHIRTEENSimpleLinearRegression

bO
FORMULAFORTHE Y INTERCEPT,
COMPUTATIONAL
bo=Y -btX (13.4)

where
n
Sv.
LJ't
v - i- =- l
t

n
n
Sr. I
^Lr"
v- i=l

E X A M P L E1 3 . 3 CoMPUTING THE y INTERCEPT,


bo, AND THE SLOPE,b1
Appareldata.
Computethe I/ intercept,bo,andthe slope,b1,for the Sunflowers

SOLUTION ExaminingEquations(13.3)and (13.4),you seethat five quantitiesmustbecal'

culatedto determineb, and bo.These aren, thesamplesir"; ! X , , thesum of theX values;


n
and )
X values; X,4. thesum
) 4.,fr. sumof rheX valueslZ f ?, thesumof thesquared
;-t l=l t--l

of the productof X and )2.For the SunflowersApparel data,the numberof squarefeet is usedto
predictthe annualsalesin a store.Table 13.2presentsthe computationsof the varioussums

neededfor the siteselectionproblem,pfu. ) Y,2, thesumof the squaredI/ valuesthatwill be


usedto computeSS?"inSection13.3. i=-

T A B L E1 3 . 2 Square Annual
Feet(X) Sales(Y) y2
Computations for the
SunflowersApparel I 1.7 3.7 2.89 13.69 6.29
Data 2.s6 15.21 6.24
2 1.6 3.9
3 2.8 6.7 7.84 44.89 18.76
4 5.6 9.5 31.36 90.25 s3.20
5 1.3 3.4 r.69 I 1.56 4.42
6 2.2 5.6 4.84 31.36 12.32
7 1.3 3.7 1.69 13.69 4.81
8 l.l 2.7 l.2r 7.29 2.97
9 3.2 5.5 10.24 30.25 r7.60
l0 1.5 2.9 2.25 8.41 4.35
ll 5.2 10.7 27.04 114.49 ss.64
'7.6 21.16 57.76 34.96
t2 4.6
l3 5.8 I 1.8 33.64 139.24 68.44
t4 3.0 4.1 9.00 16.81 12.30
Totals 40.9 81.8 r57.41 s94.90 302.30
rI

13.2: Determiningthe SimpleLinearRegression


Equation 521

usingEquations
(r3.3)and(13.4),youcancompute
thevaluesof boand,br:

, .Ssrry
D1=- ' ,ssr

=f,f*, - X)V,- l) = *,r, -


^ss,Kr
L
i=l j=l
2r, 2 x
i=l It
n
i=l

,YSyr= 302.3- (40'9X81'8)


l
t4
= 302.3-- 23997285
= 63.32715

=2r*, - x),=f *?-


.ssf,
.; -- tt i=l

= 157.41-@o'D2
t4
= 157.41
- 119.48642
= 37.92358

so that

, 63.3271s
, r' = -
37.923s8
= 1.6699

and

bo=F-brX

!r'
t =d-= ttf =5.842857
n14
n
)x, '
N =E =09?=2.e2t43
n14
bo = 5.842857
- (r.6699)(2.92143)
= 0.9645
522 CHAPTERTHIRTEENSimpleLinearRegression

Learningthe Basics a. Constructa scatterplot.

13.1 Fitting a straightline to a set of datayields Forthesedata,bo:145andbr:7.4.


b. Interpretthe meaningof the slope,6r, in this problem.
the following predictionline:
c. Predictthe meanweekly sales(in hundredsof dollars)
pet food for storeswith 8 feet of shelf spacefor pet
Yi=2+5Xi
13.5 Circulationis the lifeblood of the publishing
a. Interpretthe meaningof the Iintercept, bo. ness.The larger the salesof a magazine,the more it
b. Interpretthe meaningof the slope,br. chargeadvertisers.Recently,a circulationgaphas
c. Predictthe meanvalue of Y for X : 3, betweenthe publishers'reportsof magazines'
sales and subsequentaudits by the Audit Bureau
13.2 If thevaluesofXin Problem13.1rangefrom2to25, Circulations.The datain the file@@represent
shouldyou use this model to predict the mean value of Y reported and audited newsstandsales(in thousands)
whenXequals 2001for the following l0 magazines:
a.3?
b. -3?
c. 0? Magazine Audited (
d,.24?
YM 62r.0 299.6
1 3.3 Fitting a straightline to a set of datayields CosmoGirl 359.7 207.7
the following predictionline: Rosie 530.0 325.0
Playboy 492.1 336.3
Yi = 16 -O.5Xi Esquire 70.5 48.6
TbenPeople 567.0 400.3
a. Interpretthe meaningof the Iintercept, bo. More 125.5 91.2
b. Interpretthe meaningof the slope,bt. Spin 50.6 39.1
c. Predictthe meanvalue of Y for X: 6. Vogue 353.3 268.6
Elle 263.6 2t4.3
Applying the Concepts
Source: Extracted from M. Rose, "In Fight for Ads, Publishers
13.4 The marketing managerof a large super- OverstateTheirSales,"TheWall StreetJournal,August6, 2003,
ffi marketchain would like to useshelf spaceto pre-
pp.A1,AI0.

rc dict the salesof pet food. A randomsampleof 12


equal-sizedstoresis selected,with the following
results(storedin the file E!E!!E@:
a. Constructa scatterplot.
For thesedatabo: 26.724andb t : 0.5719.
b. Interpretthe meaningof the slope,b1,in this problem.
Shelf Space(X) Weekly Sales(Y)
c. Predict the mean audited newsstandsalesfor a
Store (Feet) ($)
zine that reportsnewsstandsalesof 400,000.
I 5 160 13.6 The owner of a moving companytypically has
2 5 220 most experiencedmanagerpredict the total number
J 5 r40 labor hours that will be requiredto completean
4 10 190 move.This approachhas proved useful in the past,but
5 l0 240 would like to be ableto developa more accuratemethod
6 l0 260 predicting labor hours by using the number of cubic
15 230 moved.In a preliminary effort to provide a more
8 l5 270 method"he has collected data for 36 moves in which
9 15 280 origin and destination were within the borough
10 20 260 Manhattan in New York Citv and in which the travelti
ll 20 290 was an insignificant portion of the hoursworked.The
t2 20 310 are storedin the file @!@f[.
13.2: Determiningthe Simple Linear RegressionEquation 523

a. Construct a scatterplot. 13.9 An agentfor a residentialreal estatecompanyin a


b. Assuminga linear relationship,use the least-squares largecity would like to be ableto predictthe monthly rental
methodto find the regressioncoefficientsbo andb,. cost for apartments, basedon the sizeof the apartment,as
c. Interpretthe meaningof the slope,b,, in this problem. defined by squarefootage.A sampleof 25 apartments
d. Predictthe meanlaborhoursfor movins 500 cubicfeet. (storedin the file [[l$) in a particularresidentialneigh-
13.7 A large mail-orderhouse believesthat borhood was selected.and the information sathered
f revealedthe followins:
thereis a linearrelationshipbetweenthe weight
1. of themail it receivesandthe numberof ordersto
t-
befilled. It would like to investigatethe relationshipin Monthly Size Monthly Size
n orderto predict the numberof orders,basedon the weight Rent (Square Rent (Square
d ofthemail. Froman operationalperspective, knowledgeof Apartment ($) Feet) Apartment ($) Feet)

d thenumberof orderswill help in the planningof the order- 1 950 850 t4 1,800 t,369
)f fulfillmentprocess.A sampleof 25 mail shipmentsis 2 1,600 I 45n l5 1,400 t , t 15
re selected that range from 200 to 700 pounds.The results J 1,200 1,085 l6 1,450 t,225
n (storedin the file @[@) are as follows: A
1,500 I t1') II 1,100 1,245
950 718 18 l,700 1,259
Weight Weight 6 l,700 I,485 19 t,200 I,150
ofMail Orders of Mail Orders 7 I,650 1,136 20 1,150 896
(Pounds) (Thousands) (Pounds) (Thousands) 8 93s 726 21 1,600 1,361
o 875 700 22 1,650 1,040
216 6.1 432 13.6 l0 1,150 956 ZJ t,200 755
283 9.1 409 t2.8 11 1,400 1,100 z+ 800 1,000
237 7.2 553 16.5 t2 1,650 t,285 25 l,750 1.200
2,300 1,985
203 7.5 572 t7.l
IJ

2s9 6.9 506 15.0


374 I 1.5 528 16.2 a. Constructa scatterplot.
342 10.3 501 l5.8 b. Use the least-squares methodto find the regression
301 9.5 628 r 9.0 coefficientsboandbr.
365 9.2 677 t9.4 c. Interpretthe meaningof 6o andb, in this problem.
384 10.6 602 19.1 d. Predictthe meanmonthly rent for an apartmentthat has
404 12.5 630 18.0 1,000squarefeet.
426 12.9 652 20.2 e. Why would it not be appropriateto usethe modelto pre-
482 14.5 dict the monthly rent for apartmentsthat have 500
squarefeet?
a. Constructa scatterplot.
f. Your friendsJim and Jenniferareconsideringsigninga
b.Assuminga linear relationship,use the least-squares
leasefor an apartmentin this residentialneighborhood.
methodto find the regressioncoefficientsboandb,.
They are trying to decidebetweentwo apartments,one
c. Interpretthe meaningof the slope,b,, in this problem.
with 1,000squarefeet for a monthlyrent of $1,2'/5and
L. d. Predictthe mean number of orderswhen the weisht of
the other with 1,200squarefeet for a monthly rent of
aa- themail is 500pounds.
$1,425.What would you recommendto them basedon
13.8 The valueof a sportsfranchiseis directlyrelatedto (a) through(d)?
his theamountof revenuethat a franchisecan generate.The
of datain the file EEEE@represent the value in 2005 (in 13.10 The data in the file ftII$EEprovide measure-
mg millionsof dollars)and the annualrevenue(in millions of mentson the hardnessandtensilestrengthfor 35 specimens
he dollars)for 30 baseballfranchises.Supposeyou want to of die-castaluminum.It is believedthat hardness(measured
lof develop a simple linear regressionmodel to predict fran- in Rockwell E units) can be usedto predicttensilestrength
bet chisevaluebasedon annualrevenuegenerated. (measuredin thousandsof poundsper squareinch).
ate a.Construct a scatterplot. a. Constructa scatterplot.
the b.Usethe least-squares method to find the regression b. Assuminga linear relationship,use the least-squares
of coefficients boandbr. methodto find the regressioncoefficientsbo andb,.
me c. Interpretthe meaningof bo and b, in this problem. c. Interpretthe meaningof the slope,b,, in this problem.
Iata d.Predictthe meanvalue of a baseballfranchisethat sen- d. Predictthe meantensilestrengthfor die-castaluminum
erates$150million of annualrevenue. that hasa hardnessof 30 RockwellE units
524 CHAPTERTHIRTEENSimpleLinearRegression

13.3 MEASURESOF VARIATION


Whenusingthe least-squares methodto determinethe regression coefficientsfor a setof da
you needto computethreeimportantmeasures of variation.The first measure,thetotal sumr
squares (,S,SZ),
is a measureof variationof the { valuesaroundtheir mean,l.In a regress
analysis,the total variation or total sumofsquaresis subdividedinto explainedvariation a
unexplainedvariation. The explainedvariationor regressionsum of squares(SSR)is due
the relationshipbetweenX and Y, andthe unexplainedvariation, or error sum of squan
(^SSf)is due to factorsotherthan the relationshipbetweenX and Y. Figure 13.6showsthe
different measuresof variation.

FIGURE
13.6
M e a s u r e so f v a r i a t i o n E r r o rs u m
of squares

,t',"^-?t'=ssr
Yi= bo+ btXi

,2,(r,-D2=ssr
Regressionsum
of squares
n^
v',)',--SSR
,Zr(V,-

ij

Computing the Sum of Squares


The regressionsumof squares(SSR)is basedon the differencebetween)2,(thepredicted val
of )'from the predictionline ) and F (the meanvalueof If . The error sum of squares (SS
represents thepart ofthe variationin Ithat is not explainedby the regression.
It is basedonth
difference betweenY,and,i,. Equations (13.5),(13.6),(13.7),and(13.8)definethesemeasu
of variation.

MEASURESOF VARIATIONIN REGRESSION


The total sumofsquaresis equalto the regressionsumofsquaresplus the errorsumof
squares.

,s,sz:ssR+.lsE (13.s)

TOTAL SUM OF SOUARES(557)


The total sum of squares(SSf is equalto the
observed)'valueand | , the meanvalueof /.

SSI = Total sum of squares (13.6)


n
=\{r,-f),
13.3:Measures
ofvariation 525

REGRESSION 5UM OF SOUARES (55R)


t?,
Theregressionsumof squares (S,SR)is equalto thesumof thesquared
differences
between
of
thepredictedvalueof Y andY , themeanvalueof )'.
on
nd
SSR= Explainedvariationor regressionof squares (13.7)
to
n
€s
rse =\{v, -r)2
i=l

1
:l
ERRORSUM OF SOUARES
Theerrorsumof squares
(55O
(SSU)is equalto thesumof thesquared
valueof Iand thepredicted
observed valueof ).
differences
between
the

il
rl
Il
Il
= Unexplainedvariationor errorsumof squares
^S,SE
n
= \{r, _ y,),
i=l
(13.8)

Figure13.7showsthe sumof squaresareaof theworksheetcontainingthe MicrosoftExcel


resultsforthe SunflowersAppareldata.The total variation,SSZ,is equalto 116.9543. This
amountis subdividedinto the sum of squaresexplainedby the regression(,S.SR), equalto
105.7476,andthe sumof squaresunexplainedby the regression(SSg),equalto I 1.2067.From
Equation(13.5)on page524:

S,SZ: SSR+ SSE


116.9543: 105.7
476 + 11.2067

',3.7
Excelsum
for the 11 i r|f SS frlS F Sign'ricanceF
rsAppareldata 12_ jRegresion | 105.7{76 105.7176 113.2335 0.fin0
l3lResldual 12 111067 0.934|
il'ltotal t3 116.95{3

16 Coe/ficJsntsSandard Erol t Stal P-value Lower 95o/o 95o/o


I 0.0917 o.1820 2.1110
E13.1to create
Section 18 iSquareFeet 1.66$ 0.1569 10.6411 0.fino 1.t200 2.0118
worksheetthat contains
area.

In a datasetthat hasa largenumberof significantdigits,the resultsof a regression


analy-
sisaresometimes displayedusinga numericalformatknownasscientificnotation.This typeof
format is usedto displayvery small or very largevalues.The numberafterthe letterE repre-
sentsthe numberof digits that the decimalpoint needsto be movedto the left (for a negative
number)or to the right (for a positivenumber).For example,the number3.7431E+02means
that the decimalpoint shouldbe movedtwo placesto the right, producingthe number374.31.
The number3.'7431E-02 meansthat the decimalpoint shouldbe movedtwo placesto the left,
producingthe number0.037431.When scientificnotationis used,fewersignificantdigits are
usuallydisplayedandthe numbersmay appearto be rounded.
526 CHAPTERTHIRTEENSimpleLinearRegressron

The Coefficient of Determination


By themselves, S,SR,SSE,andS,ST"provide little information.However,the ratioof the regres-
sion sum of squares(SSR)to the total sum of squares(SSf) measures the proportionof varia-
tion in I/ that is explainedby the independent variableX in the regressionmodel.This ratiois
calledthecoefficientof determination,12,andis definedin Equation( 13.9).

COEFFICIENTOF DETERMINATION
The coefficientof determinationis equalto the regressionsum of squares(thatis,
explainedvariation)dividedby the total sumofsquares(thatis, total variation).

Regression
sum of squares ,ssR
,2= (13.e)
Totalsumofsquares ,s,sz
The coefficient of determination measuresthe proportion of variation in Ithat is explained
by the independentvariable X in the regressionmodel. For the Sunflowers Apparel data,with
, S , S:R 1 0 5 . 7 4 7 6 S
. S E : 1 1 . 2 0 6 7a. n d , S S I : 1 1 6 . 9 5 4 3 .

) t05.7476 ^.^.-
t'- = = 0.9042
116.9543

Therefore,90.42%of thevariationin annualsalesis explainedby thevariabilityin thesizeof the


store,asmeasured by the squarefootage.This larger'2indicatesa strongpositivelinearrelation-
shipbetweenfwo variablesbecause the useof a regression modelhasreducedthe variabilityin
predictingannualsalesby 90.42%.Only 9.58%of the samplevariabilityin annualsalesis dueto
factorsotherthanwhatis accounted for by the linearregression
modelthatusessquarefootage.
Figure13.8presentsthecoefficientof determination portionof theMicrosoftExcelresults
for the Sunflowers
Appareldata.

F I G U R E1 3 . 8
PartialMicrosoftExcel
regression resultsfor the
Sunflowers Appareldata
4. iklultipleR
5 tRSquare
6 ";Adjuered R Square 0.852
7 :Standard Error svx-0.96$4
See SectionE13.1to create
the worksheet that contains
this area.

E X A M P L E1 3 . 4 COMPUTING
THECOEFFICIENT
OF DETERMINAT]ON
12,for the Sunflowers
Computethecoefficientof determination, Appareldata.
SOLUTION YoucancomputeS,Sl.SSR, andSSE,thataredefinedin Equations (13.6),(13.7),
a n d( 1 3 . 8o) n p a g e s5 2 4 - 5 2 5b,y u s i n gE q u a t i o n( 1
s 3 . 1 0 )( ,1 3 . 1l ) , a n d( 1 3 . 1 2 ) .

COMPUTATIONALFORMULA FOR S5T


/ \L
ln I

n
lIv, ' lI
l.Lt
ss?"= )tr, - y), = \i=l )
(13.10)
n
13.3:Measures
ofVariation 527

FORMULA
COMPUTATIONAL FORSsR
n
Yv
/d'i

= Etl - Y,' = 4I"r + 1Zx,Y'-


,ss.rt i=l
(13,r1)
j*l i=l i=l

COMPUTATIONALFORMUT.AFOR 558

= Itt, - 'fj' =|4' - h}r, - h}x,n


.e,sr (13.12)
,=l i=l i=l i=l

Using the summaryresultsfrom Table 13.2 onpage 520,

(n )2
llnI
=fd,-v)'=fr,'+ n
ssz
i=r 7-r'
(81'S)2
= 594.9-
t4
= 594,9- 477.94571
= 116.95429

3 ^ -.
^ S S RL=V i - Y \ "
i=l

I 'r2
t+l
n n
ll L)t v , lI
+b,\XiYi-*+
= uoZY,
i=l i=l

(sl'8)2
= (0.s64478X81.8)
+ (1.66e86)(
302.3)-
t4
= 105.74726

3 ^a
SSE=/(Yi-Yi)"

=fr? -b,ir,-u,fx,Y,
i=t i=l i=l
- (1.66986X302.3)
= 594.9- (0.e64478)(81.8)
= 11.2067

Therefore.

,z -105.74726=0.9042
116.95429
528 CHAPTERTHIRTEENSimple Linear Regression

StandardError of the Estimate


Although the least-squares method resultsin the line that fits the data with the minimum
amountof error,unlessall the observeddatapointsfall on a straightline, the predictionlineis
not a perfect predictor.Just as all data valuescannotbe expectedto be exactly equal to their
mean,neithercanthey be expectedto fall exactlyon the predictionline.An importantstatistic,
calledthe standard error of the estimate,measures the variabilityof the actual)zvaluesfrom
the predicted )zvaluesin the sameway that the standarddeviationin Chapter3 measuresthe
variability of eachvaluearoundthe samplemean.In otherwords,the standarderror of the esti-
mate is the standarddeviationaroundthe predictionline, whereasthe standarddeviationin
Chapter3 is the standarddeviationsround the samplemean.
Figure 13.5 on page 517 illustratesthe variability aroundthe predictionline for the
SunflowersApparel data. Observethat althoughmany of the actualvalues of )'fall nearthe
predictionline, noneof the valuesareexactlyon the line.
The standarderror of the estimate,representedby the symbol Sr", is defined in Equation
( 13 . 13 ) .

STANDARD ERROROF THE ESTIMATE

,sst 2rt,-+f
l=l
SYX= (13.13)
n-2 n-2
where
Y,: actualvalue of Y for a givenX,
: predictedvalue of I for a givenX,
i
^SSZ':error sum of squares

: I1.2067.Thus,
FromEquation(I3.8) andFigureI3.4 on page5l6,,S,SE

cY- X
O _ = 0.9664

This standarderror of the estimate,equalto 0.9664millions of dollars(that is, $966,400), is


labeledStandardError in theMicrosoft Excelresultsshownin Figure 13.8on page 526.Thestan-
dard error of the estimaterepresentsa measureof the variationaroundthe predictionline. It is
measuredin the sameunitsasthe dependentvariable)2.The interpretationof the standarderrorof
the estimateis similar to that of the standarddeviation.Justas the standarddeviationmeasures
variability aroundthe mean,the standarderror of the estimatemeasuresvariability aroundthe
predictionline. For SunflowersApparel,the typical differencebetweenactualannualsalesat a
storeandthe predictedannualsalesusingthe regressionequationis approximately$966,400.

Learning the Basics 13.13 If ^SSR : 66 and,S,Sf: 88, computethe


coefficientof determination,
12,and interpretits
13,11 How do you interpreta coefficientof meaning.
12, equalto 0.80?
determination,
13.12 lf .tSR: 36 andSSE: 4. determine
SSr 13.14 If .SSE : l0 andS.SR
: 30.compure the
@q flft@
lAsslsil andthencompute thecoefficient
of determina- lAsitiil coefficient 12.andinterpretits
of determination.
tion, rz, and interpretits meaning. meanlng.
13.4:Assumptions 529

: 120, why is it impossible for SSZ to


If ,S,SR received(storedin the file@. Using the resultsofthat
I l0? problem,
a. determinethe coefficientof determination,12, andinter-
the Concepts pret its meaning.
13.16 In Problem 13.4 on page522, the mar- b. find the standarderror of the estimate.
keting managerused shelf spacefor pet food c. How useful do you think this regressionmodel is for
predictingthe numberof orders?
to predict weekly sales (stored in the file
@!s[) : 20,535 and
For that data, ,S^SR 13.20 In Problem 13.8 on page 523, you used annual
30,025. revenuesto predict the value of a baseball franchise
the coefficient of determination.12. and (stored in the file !![s@lQ. Using the results of that
its meaning. problem,
ine the standarderror of the estimate. a. determinethe coefficientof determination.r2. and inter-
usefuldo you think this regressionmodel is for pret its meaning.
sales? b. determinethe standarderror of the estimate.
ln Problem13.5on page522, you usedreported c. How useful do you think this regressionmodel is for
ine newsstand sales to predict audited sales predictingthe value of a baseballfranchise?
in the file @s@). For that data, 13.21 In Problem 13.9 on page 523, an agent for a real
130.301.41 andS,SZ: 144.538.64. estate company wanted to predict the monthly rent for
ine the coefficient of determination,r2, and apartments,basedon the size of the apartment(stored in
lts mearung. the file ft@@. Using the resultsof that problem,
ine the standarderror of the estimate. a. determinethe coefficientof determination,r2, andinter-
usefuldo you think this regressionmodel is for pret its meaning.
ins auditedsales? b. determinethe standarderror of the estimate.
In Problem13.6 on page522, an owner of a mov- c. How useful do you think this regressionmodel is for
ny wantedto predict labor hours, basedon the predictingthe monthly rent?
feetmoved(storedin the file @@@. Using the
13.22 In Problem13.10on page523,you usedhardness
of that problem,
to predict the tensile strength of die-cast aluminum
rminethe coefficientof determination.12"and inter-
(stored in the file ft@!@). Using the results of that
lts meanmg.
problem,
ine the standarderror of the estimate.
a. determine the coefficient of determination.12. and
useful do you think this regressionmodel is for
interpretits meaning.
:tine labor hours?
b. find the standarderror of the estimate.
13.19 In Problem13.7on page 523,you used c. How useful do you think this regressionmodel is for
theweightof mail to predictthe numberof orders predictingthe tensilesfiengthof die-castaluminum?
is
I-
is 13.4 ASSUMPTIONS
f
)s The discussionof hypothesistestingandthe analysisof varianceemphasizedthe importanceof
|e the assumptionsto the validity of any conclusionsreached.The assumptionsnecessaryfor
a regressionare similar to thoseof the analysisof variancebecauseboth topics fall in the general
categoryof linear models(reference4).
The four assumptionsof regression(known by the acronymLINE) are as follows:
. Linearity
r Independenceoferrors
r Normality of error
. Equalvariance
s The first assumption,linearity, statesthat the relationshipbetweenvariablesis linear.
Relationshipsbetweenvariablesthat are not linear are discussedin Chapter15.
The secondassumption,independenceof errors, requiresthat the errors(er)are indepen-
dent of one another.This assumptionis particularly important when data are collectedover a
period of time. In suchsituations,the errorsfor a specific time period are sometimescorrelated
with thoseof the previoustime period.
530 CHAPTERTHIRTEENSimpleLinearRegression

The third assumption, normality, requires that the errors (e,) are normally
each value of X. Like the I test and the ANOVA F' test, regressionanalysisis fairly
againstdeparturesfrom the normality assumption.As long as the distribution of the enon
eachlevel ofXis not extremelydifferent from a normal distribution,inferencesaboutpo
are not seriouslvaffected.
The fourth assumption,equal variance or homoscedasticity,requiresthat the variance
the errors (e,) are constantfor all valuesof X. In other words,the variability of )'valuesis
samewhen X is a low value as when X is a high value.The equal varianceassumptic
important when making inferencesabout po and B,. If there are seriousdeparturesfrom
assumption,you can use either data transformationsor weighted least-squaresmethods
reference4).

13.5 RESIDUALANALYSIS
In Section13.1,regressionanalysiswas introduced.In Sections13.2and 13.3,a
model was developedusing the least-squares approachfor the SunflowersApparel data.Is
the correctmodel for thesedata?Are the assumptionsintroducedin Section13.4valid?In
section,a graphicalapproachcalled residual analysis is usedto evaluatethe assumptions
determinewhetherthe regressionmodel selectedis an appropriatemodel.
The residual or estimatederror value,e,, is the differencebetweenthe observed(I)
predicted (I,) valuesof the dependentvariablefor a given value ofX,. Graphically,a resi
appearson a scatterplot as the vertical distancebetweenan observedvalue of )zandthe
dictionline. Equation(13.14)definesthe residual.

RESIDUAL
The residual is equal to the difference betweenthe observedvalue of /and the predicted1:
valueot'I.

ei=Yi-Yi (13.14)

Evaluatingthe Assumptions
Recall from Section 13.4that the four assumptionsof regression(known by the
normality,and equalvariance.
LINE) are linearity,independence,

Linearity To evaluatelinearity,you plot the residualson the vertical axis againstthe cone-
spondingX, values of the independentvariable on the horizontal axis. If the linear modelis
appropriatefor the data,thereis no apparentpatternin this plot. However,if the linearmodelis
not appropriate,there is a relationshipbetweenthe X, valuesand the residuals,e,.You cansee
sucha patternin Figure 13.9.PanelA showsa situationin which, althoughthereis an increas-
ing trend in I as X increases,the relationshipseemscurvilinearbecausethe upwardtrend
decreasesfor increasingvalues of X. This quadratic effect is highlighted in Panel B, where
there is a clear relationshipbetweenX,and e,. By plotting the residuals,the linear trendof.f,
with I has beenremoved,therebyexposingthe lack of fit in the simple linear model.Thus,a
quadraticmodel is a better fit and should be used in place of the simple linear model.(See
Sectionl5.l for furtherdiscussionof fitting quadraticmodels.)
To determinewhetherthe simple linear regressionmodel is appropriate,returnto the eval-
uation ofthe SunflowersApparel data.Figure 13.10providesthe predictedand residualvalues
of the responsevariable(annualsales)computedby Microsoft Excel.
1 3 . 5 :R e s i d u A
a ln a l y s i s 5 3 1

FIGURE13.9
Studying the
appropnateness
of the simplelinear
regressionmodel
a
o oo
al a
a ' l o
aa 1o'
oa
l
a
a

F I G U R E1 3 . 1 0
MicrosoftExcel
residual statistics for the
Sunflowers Appareldata Obseruation Predicted Anmral Sates Fesidaals
1 3.803239598{.103239598
2 3.636253367 0.263746633
3 5.640088147 1.05991 1853
1 10.31570263.0.815702635
5 3.135294672 0.2647053?8
SeeSectionE13.3to create 6 d.638170757 0.961829243
the worksheetthat contains 7 3.1352916720.564705328
thisarea. I 2.801322208 s.101322208
I 6.3{n033074 .o.8r,8033071
10 3.469267135.0.569267135
11 9.64n57708 1.052242n2
12 8.645840318 -1.045840318
13 10.6{96751 1.150324S2
11 5.97106061'l-1.874060611

variable(storesize,in
To assessIinearity,the residualsareplottedagainstthe independent
thousands of squarefeet)in Figure13.11.Althoughthereis widespread scatterin the residual
plot, thereis no apparentpatternor relationshipbetweenthe residualsandXi. The residuals
appearto be evenlyspreadaboveand below 0 for the differingvaluesofX. You can conclude
thatthe linearmodelis appropriatefor the SunflowersAppareldata.

FIGUR1 E3 . 1 1 Square Feet Residual Plot

Micosoft Excelplot of
residuals againstthe
square footageof a
storefor the Sunflowers
Apparel data

SeeSectionE2.12 to create
this.

Square F6et
532 CHAPTERTHIRTEEN SimpleLinearRegression

Independence You can evaluatethe assumptionof independenceof the errorsby


the residualsin the order or sequencein which the datawere collected.Data collected
periodsof time sometimesexhibit an autocorrelationeffect amongsuccessiveobservations,
theseinstances,thereis a relationshipbetweenconsecutiveresiduals.Ifthis relationshipexi
(which violatesthe assumptionof independence), it is apparentin the plot of the residuals
susthe time in which the datawere collected.You can alsotest for autocorrelationby using
Durbin-Watson statistic.which is the subiectof Section13.6.Becausethe Sunflowers
datawere collectedduring the sametime period,you do not needto evaluatethe i
assumption.

Normality You can evaluatethe assumptionof normality in the errorsby tallying the
uals into a frequencydistribution and displayingthe resultsin a histogram(see Section
For the SunflowersApparel data,the residualshavebeentallied into a frequencydistribution
Table 13.3. (There are an insufficient number of values.however.to constructa hi
You can also evaluatethe normality assumptionby comparingthe actualversustheoretical
ues of the residualsor by constructinga normal probability plot of the residuals(seeSecti
6.3).Figure13.12is a normalprobabilityplot of the residualsfor the SunflowerApparel

TABLE 13.3 Residuals Frequency


FrequencyDistribution -2.25 but lessthan-1.75 I
'14
of ResidualValues -l.75 but lessthan-1.25 0
for the Sunflowers -1.25 but lessthan-0.75
ApparelData
3
-0.75 but lessthan-0.25 I
-0.25 but lessthan+0.25 2
+0.25but lessthan+0.75 3
+0.75but lessthan+1.25 4
t4

FIGURE13.12 Normal Probability Plot of the Residuals

MicrosoftExcelnormar
probabilityplot of
the residuals
for the
Sunflowers Appareldata

ll
! -o.s
o
See Section E6.2 to create E
this. -1

.1.5

-2

-2.5
0
ZValw

It is difficult to evaluatethe normality assumptionfor a sampleof only 14 values,regard-


lessof whetheryou use a histogram,stem-and-leafdisplay,box-and-whiskerplot, or
probability plot. You can seefrom Figure 13.12that the data do not appearto departsubstan-
tially from a normal distribution.The robustnessof regressionanalysiswith modestdepartures
from normality enablesyou to concludethat you shouldnot be overly concernedaboutdepar-
turesfrom this normality assumptionin the SunflowersApparel data.
13.5:ResidualAnalysis
533

Equal Variance You can evaluatethe assumptionof equal variance from a plot of the
residualswith X,. For the SunflowersApparel data of Figure I 3. I I on page 53I , there do not
appearto be major differencesin the variability of the residualsfor differentX, values.Thus,
you can concludethat thereis no apparentviolation in the assumptionofequal varianceat each
level ofX.
To examine a casein which the equal variance assumptionis violated, observeFigure
13.13,which is a plot ofthe residualswithX, for a hypotheticalsetof data.In this plot, the vari-
ability of the residualsincreasesdramaticallyasXincreases,demonstratingthe lack of homo-
geneityin the variancesof Y,at eachlevel ofX. For thesedata,the equalvarianceassumption
is invalid.

3.13
equal

a
a
a
..;j:'iii
.. . ! !l].
. tl
a a

a
a
aa
l1 a
a
a
.l'. !;33:
a aa a ooo !orr
3.f ' I
a
a

,:';i:l: t
t ta:::

the Basics
resultsbelow provide the Xvalues, residuals, 13.24 The resultsbelow showtheXvalues, residuals,and
plot from a regressionanalysis: a residualplot from a regressionanalysis:

2.u
1.5
t.0

t0: -"-{.0a
!! o.t
*ii;
l:r
-iti**-:ird I o.o
g2
-0.5
-,iit.---.,3.2r.1.0
!rt
"!,1._,*: -1.5

evidenceof a patternin the residuals?Explain. Is thereany evidenceof a patternin the residuals?Explain.


534 CHAPTERTHIRTEENSimpleLinearRegression

Applying the Concepts a. determinethe adequacyof the fit of the model.


b. evaluatewhether the assumptionsof regressionhave
13.25 In Problem 13.5 on page522, you usedreported beenseriouslyviolated.
magazinenewsstandsalesto predict auditedsales.The data
arestoredin the file@l$fi!. Performa residualanaly- 13.29 In Problem 13.9on page 523,an agentfor a real
sis for thesedata. estatecompany wanted to predict the monthly rent for
a. Determinethe adequacyof the fit of the model. apartments,basedon the sizeof the apartments.Performa
b. Evaluatewhetherthe assumptionsof regressionhave residualanalysisfor thesedata.The data are storedin the
beenseriouslyviolated. file [@. Basedon theseresults,
a. determinethe adequacyof the fit of the model.
13.26 In Problem13.4on page522,themarket- b. evaluatewhetherthe assumptionsof regressionhave
ing managerusedshelf spacefor pet food to pre- beenseriouslyviolated.
dict weekly sales.The dataarc storedin the file
[!$!!frE Performa residualanalysisfor thesedata. 13.30 In Problem13.8on page523,you usedannualrev-
a. Determinethe adequacyof the fit of the model. enuesto predict the value ofa baseballfranchise.Thedata
b. Evaluatewhetherthe assumptionsof regressionhave are stored in the file EE@. Perform a residual
beenseriouslyviolated. analysisfor thesedata.Basedon theseresults,
a. determinethe adequacyof the fit of the model.
13.27 In Problem13.7on page523,you usedthe weight b. evaluatewhether the assumptionsof regressionhave
of mail to predictthe numberof ordersreceived.Performa beenseriouslyviolated.
residualanalysisfor thesedata.The data are storedin the
file ftfiEE. Basedon theseresults, 13.31 In Problem13.10on page523,you usedhardness
a. determinethe adequacyof the fit of the model. to predict the tensile strengthof die-castaluminum.The
b. evaluatewhetherthe assumptionsof regressionhave data are stored in the file ftftl!$Q Perform a residual
beenseriouslyviolated. analysisfor thesedata.Basedon theseresults,
a. determinethe adequacyof the fit of the model.
13.28 In Problem13.6on page522,the ownerof a mov- b. evaluatewhether the assumptionsof regressionhave
ing companywantedto predict labor hours basedon the beenseriouslyviolated.
cubic feet moved. Perform a residualanalysisfor these
data.The data are storedin the file E@E. Basedon
theseresults,

13.5 MEASURINGAUTOCORRELATION:
TH E DU RBIN.WATSONSTATISTIC
One of the basic assumptionsof the regressionmodel is the independenceof the errors.This
assumptionis sometimesviolatedwhen dataarecollectedover sequentialtime periodsbecausc
a residualat any one time period may tend to be similar to residualsat adjacenttime peri
This patternin the residualsis called autocorrelation. When a setof datahas substantiala
correlation,the validity of a regressionmodel can be in seriousdoubt.

ResidualPlots to Detect Autocorrelation


As mentionedin Section13.5,oneway to detectautocorrelation is to plot the residualsin
order.If a positive autocorrelationeffect is present,therewill be clustersof residualswith
samesign, and you will readily detectan apparentpattern.If negativeautocorrelationexi
residualswill tend to jump back and forth from positiveto negativeto positive,and so on.
type of pattern is very rarely seenin regressionanalysis.Thus, the focus of this sectionis
positiveautocorrelation.To illustratepositiveautocorrelation,considerthe following
The managerof a packagedelivery store wants to predict weekly sales,basedon
numberof customersmaking purchasesfor a period of 15 weeks.In this situation,
data are collected over a period of l5 consecutiveweeks at the same store,you need
determinewhetherautocorrelationis present.Table I 3.4 presentsthe data(storedin thefi
@EED. Figure 13.14illustratesMicrosoft Excel resultsfor thesedata.
13.6: MeasuringAutocorrelation:
The Durbin-Watson
Statistic 535

T A B L E1 3 . 4 Sales Sales
Customers and (Thousands (Thousands
Salesfor a Periodof Customers of Dollars) Customers of Dollars)
I
15Consecutive Weeks o
r 794 9.33 880 t2.07
I
199 8.26 10 905 t2.55
831 7.48 lt 886 11.92
855 9.08 t2 843 10.27
la
845 9.83 IJ 904 I 1.80
844 10.09 t4 950 t2.15
863 11.01 l5 841 9.64
875 11.49

FIGURE'13.14
Microsoft
Excelresults
forthepackagedelivery
storedataof Table13.4
-
t\v t
l-la

-
SeeSectionE13.1to create !3 11"39010.8762

this.

From Figure 13.14,observethat 12 is 0.6514, indicating that 65.l4oh of the variation in


salesis explainedby variation in the number of customers.In addition, the )'intercept, bo, is
-16.0322, and the slope, b,, is 0.0308. However,before using this model for predictron!you
must undertakeproper analysesofthe residuals.Becausethe data have been collectedover a
consecutiveperiod of l5 weeks, in addition to checking the linearity, normality, and equal-
varianceassumptions,you must investigatethe independence-of-errors assumption.You can
plot the residualsversus time to help you see whether a pattern exists. In Figure 13.15,you
can see that the residualstend to fluctuate up and down in a cycfical pattern.This cyclical
pattern provides strong cause for concern about the autocorrelation of the residuals and,
hence,a violation of the independence-of-errors assumption.

F I G U R1E3 . 1 5 PackageDelivery Store Sales Analysis Residual Plol

Microsoft Excelresiduar
plotfor the package
rielivorv cfnra.]:ia

ofTable13.4

SeeSectronE13.3to create
this.
536 CHAPTERTHIRTEEN SimnleLinearResressron

The Durbin-Watson
Statistic
The Durbin-Watson statistic is used to measure autocorrelation.This statistic measuresthe
correlation between each residual and the residual for the time period immediately preceding
the one of interest.Equation(13.15) definesthe Durbin-Watsonstatistic.

DURBIN-WATSONSTATISTIC

f. L ' - I{ e ' - e , - , ) 2
--, (r3.ls)
>"?
i- |

where

e,: residualat the time periodI

To better understandthe Durbin-Watsonstatistic,D, you can examine Equation (13.15).


n
sr)
I n e n u m e r a t o r . ) , l e i - e i _ t ) - , representsthe squared difference between two successive
H
n
sa)
residuals.summed from the secondvalue to the nth value I h e d e n o m l n a t o r . represents
Lel:.
l=1
the sum of the squared residuals.When successiveresiduals are positively autocorrelatedthe
value of D approaches0. If the residualsare not correlated,the value of D will be close to 2. (lf
there is negative autocorrelation, D will be greater than 2 and could even approach its maxr-
mum value of 4.) For the package delivery store data, as shown in the Microsoft Excel results
of Figure 13.16,the Durbin-Watsonstatistic,D, is 0.8830.

FIGURE13.16
M icrosoft Excel results
of the Durbin-Watson
statisticfor the package
delivery store data
*83/84

You need to determine when the autocorrelation is large enough to make the Durbin-
See SectionE13.4to create Watson statistic,D, fall sufficiently below 2 to conclude that there is significant positive auto-
thts. correlation. After computing D, you compare it to the critical values of the Durbin-Watsonsta-
tistic found in Table E.10, a portion of which is presentedin Table 13.5.The critical values
dependon o(,the significancelevel chosen,n,the sample size, and k, the number of indepen-
dent variablesin the model (in simple linear resression./r : 1).

T A B L E1 3 . 5 cr: .05
F i n d i n gC r i t i c a V
l alues
of the Durbin-Watson
Statistic dL

.95 t.54 .82 .69 t.97


l6 .98 1.54 .86 .-/4 1.93 .62 2.15
1'7 1.02 1.54 .90 .78 1.90 .67 2.10
18 1.05 1.53 .93 .82 1.87 .7| 2.06
13.6: MeasurinsAutocorrelation:The Durbin-WatsonStatistic 537

In Table 13.5,two valuesare shownfor eachcombinationof cr (level of significance),r


(samplesize),andfr (numberof independent variablesin the model).The first value,d., repre-
;the sentsthe lowercriticalvalue.If D is belowdr, you concludethat thereis evidenceof positive
ding autocorrelation amongthe residuals.If this occurs,the least-squaresmethodusedin this chap-
ter is inappropriate,and you shoulduse alternativemethods(seereference4). The second
value,ds, representsthe upper critical value of D, abovewhich you would concludethat there
is no evidenceof positiveautocorrelation amongthe residuals.If D is betweend, andds, lov
areunableto arriveat a definiteconclusion.
Forthe packagedeliverystoredata,with one independentvariable(f : 1) and l5 values
(n: 15),dL: 1.08anddu: 1.36.Because D : 0.8830< 1.08,you conclude thatthereis pos-
itive autocorrelationamongthe residuals.The least-squares regressionanalysisof the datais
inappropriate becauseof the presenceof significantpositiveautocorrelation amongthe resid-
uals.In otherwords,the independence-of-errors assumptionis invalid.You needto usealter-
nativeapproaches discussedin reference4.

i.l5).

ssive
Learning
the Basics b. Computethe Durbin-Watsonstatistic.At the 0.05 level
rsents of significance,is thereevidenceof positiveautocorre-
13.32 The residualsfor l0 consecutivetime lationamongthe residuals?
periodsareas follows: c. Basedon (a) and (b), what conclusioncan you reach
d,the
2.(rf aboutthe autocorrelation ofthe residuals?
TimePeriod Residual Time Period Residual
naxi- Applying the Concepts
ssults I 6 r1
TI

2 7 +2 13.34 In Problem13.4on page522 concerning


3 8 +3 pet food sales,the marketingmanagerusedshelf
4 9 +4 spacefor pet food to predictweeklysales.
5 l0 +5 a. Is it necessaryto computethe Durbin-Watson statisticin
this case?Explain.
r. Plotthe residualsover time. What conclusioncan you b. Underwhatcircumstances is it necessary
to computethe
reachaboutthe patternof the residualsovertime? Durbin-Watsonstatisticbefore proceedingwith the
b.Based on (a), what conclusioncan you reachaboutthe least-squares methodof regression analysis?
autocorrelationof the residuals? 13.35 The owner of a single-familyhome in a suburban
rrbin-
13.33 The residualsfor l5 consecutivetime county in the northeasternUnited Stateswould like to
auto-
periodsareas follows: developa modelto predictelectricityconsumptionin his all-
n sta- electrichouse(lights,fans,heat,appliances, andsoon),based
alues
fimePeriod Residual Time Period Residual on averageatmospherictemperature(in degreesFahrenheit).
epen- Monthly kilowattusageandtemperaturedataareavailablefor
I +4 9 +6 a periodof 24 consecutive monthsin the file![@f@.
2 -6 l0 -3 a. Assuminga linear relationship,use the least-squares
i -l
: 3 ll +l methodto find the regressioncoefficientsboandb,.
4 -5 t2 +3 b. Predict the mean kilowatt usage when the average
f,
5 +2 l3 0 atmospheric temperature is 50oFahrenheit.
wu 6 +5 t4 -4 c. Plot the residualsversusthe time period.
7 -2 l5 -7 d. Computethe Durbin-Watsonstatistic.At the 0.05 level
,2.2r 8 +7 of significance,is thereevidenceof positiveautocorre-
2.15 lationamongthe residuals?
;2.10
I Plotthe residualsover time. What conclusioncan you e. Basedon the resultsof (c) and (d), is therereasonto
'2.06
reachaboutthe patternof the residualsovertime? questionthe validity of the model?
538 CHAPTERTHIRTEEN
SimpleLinearRegression

13.35 A mail-ordercatalogbusinessthat sells personal To use the espressoshot in making alatte, cappuccino,
computersupplies,software,and hardwaremaintainsa other drinks, the shot must be poured into the beverage
centralizedwarehousefor the distribution of products ing the separationof the heart,body,andcrema.If the shoti
ordered.Managementis currently examining the process used after the separationoccurs,the drink becomes
of distribution from the warehouseand is interestedin sively bitter and acidic, ruining the final drink. Thus,
studying the factors that affect warehousedistribution longer separationtime allows the drink-maker more time
costs.Currently,a small handlingfee is addedto the order, pour the shotandensurethatthebeveragewill meet
regardlessof the amountof the order.Data havebeen col- tions. An employeeat a coffee shop hypothesizedthat
lected over the past 24 months, indicating the warehouse harder the espressogrounds were tamped down into
distributioncostsand the numberof ordersreceived.They portafilter before brewing, the longer the separationti
are storedin the file@@. The resultsare as follows: would be. An experimentusing 24 observationswas
ductedto test this relationship.The independentvari
Tampmeasuresthe distance,in inches,betweenthe
Distribution Cost Number
groundsand the top ofthe portafilter (that is, the harder
Months (Thousandsof Dollars) of Orders
tamp, the largerthe distance).The dependentvariable
I 52.95 4,015 is the numberof secondsthe heart,body,and cremaare
2 7r.66 3,806 arated(that is. the amountof time after the shot is
J 85.58 5,309 beforeit mustbe usedfor the customer'sbeverage). The
4 63.69 4,262 are storedin the filel$!$$:
5 72.8r 4,296
6 68.44 4,097 Shot Tamp Time Shot Tamp
7 52.46 3,213
8 70,77 4,809 | 0.20 t4 13 0.50
9 82.03 5,237 2 0.50 t4 14 0.50 t3
l0 74.39 4,732 3 0.50 18 15 0.3s 19
ll 70.84 4,413 4 0.20 t6 16 0.35 l9
12 s4.08 2,921 s 0.20 16 r7 0.20 l7
13 62.98 3,977 6 0.50 13 18 0.20 l8
t4 72.30 4,428 7 0.20 12 19 0.20 t5
15 58.99 3,964 8 0.35 15 20 0.20 l6
l6 79.38 4,592 9 0.50 9 2t 0.35 l8
t7 94.44 5,582 10 0.35 15 22 0.35 16
l8 59.74 3,450 11 0.50 ll 23 0.35 t4
l9 90.50 5,079 t2 0.50 t6 24 0.35 l6
20 93.24 5,735
2l 69.33 4,269 Determinethe prediction line, using Time as the
22 53.7r 3,708 dent variableandTampas the independentvariable.
23 8 9 .8
1 5,387 b. Predictthe meanseparationtime for a Tampdistance
24 66.80 4,161 0.50inch.
c. Plot the residualsversusthe time order of exoeri
tion. Are thereany noticeablepatterns?
Assuming a linear relationship,use the least-squares
d. Computethe Durbin-Watsonstatistic.At the 0.05
methodto find the regressioncoefficientsbo and b,.
of significance,is there evidenceof positive
Predict the monthly warehousedistribution costswhen
lation amongthe residuals?
the numberof ordersis 4.500.
e. Basedon the resultsof (c) and (d), is there reason
c. Plot the residualsversusthe time period.
questionthe validity of the model?
d. Computethe Durbin-Watsonstatistic.At the 0.05 level
ofsignificance,is thereevidenceofpositive autocorre- 13.38 The owner of a chain of ice cream stores
lation amongthe residuals? like to study the effect of atmospherictemperature
e. Basedon the resultsof (c) and (d), is there reasonto salesduringthe summerseason.A sampleof 2l
questionthe validity of the model? tive daysis selected,with the resultsstoredin the data
13.37 A freshlybrewedshot of espressohasthreedistinct @.
components:the heart,body, and crema.The separationof (Hint: Determinewhich are the independentand
thesethreecomponentstypically lastsonly l0 to 20 seconds. dentvariables.)
13.7:lnferences
AbouttheSlopeandCorrelation
Coefficient 539

)r Assuminga linear relationship,use the least-squares d. Compute the Durbin-Watson statistic. At the 0.05 level
methodto find the regressioncoefficientsbo andb,. of significance, is there evidence of positive autocorre-
is Predictthe salesper storefor a day in which thetemper- lation among the residuals?
i- atureis 83"F. e. Based on the results of (c) and (d), is there reason to
a Plotthe residualsversusthe time oeriod. question the validity of the model?
o
l-
IE
IC
13.7 INFERENCES
ABOUTTHESLOPE
IC
l-
AND CORRELATION
COEFFICIENT
te In Sectionsl3.l through13.3,regression wasusedsolelyfor descriptivepurposes. Youlearned
io how the least-squaresmethoddeterminesthe regressioncoefficientsandhow to predictY for a
re given valueof X. In addition,you learnedhow to computeand interpretthe standarderror of
IC the estimateandthe coefficientof determination.
)- When residualanalysis,as discussedin Section13.5,indicatesthat the assumptions of a
rd least-squaresregressionmodel are not seriouslyviolated and that the straight-linemodel is
a appropriate,you canmakeinferencesaboutthe linearrelationshipbetweenthe variablesin the
population.

t Testfor the Slope


To determinethe existenceof a significantlinearrelationshipbetweenthe X and )zvariables,
you testwhetherFr (tltepopulationslope)is equalto 0. The null andalternativehypotheses
are
as follows:

Hot Fr: 0 (Thereis no linearrelationship.)


Hl Fr + 0 (Thereis a linearrelationship.)

If you rejectthe null hypothesis,you concludethat thereis evidenceof a linearrelationship.


Equation(13.16)definestheteststatistic.

TESTTNGA HypOTHEStSFOR A pOpULATtON SLOPE,01, USTNGTHE t TEST


The r statisticequalsthe differencebetweenthe sampleslopeand hypothesizedvalue of the
populationslopedivided by the standarderror ofthe slope.

r - 4-Fr (13.16)
sr,
where

Srr _- Svx
ffi
3
ssx:> 6i- x)2
j=l

The test statisticI follows a I distributionwith n - 2 desreesof freedom.

Returnto the Using StatisticsscenarioconcerningSunflowersApparel.To testwhetherthere


is a significantlinearrelationshipbetweenthe sizeof the storeandthe annualsalesat the0.05level
of significance,referto the MicrosoftExcelworksheetfor the / testpresentedin Fizure I 3.17.
540 CHAPTERTHIRTEEN
SimpleLinearRegression

FIGURE13.17 D:
MicrosoftExcelttest
forthe slopefor the 16 i CoefficientsSandard Errcr t Sat P-rralae Lawer95% Upper9S/o
SunflowersApparel data tZj lntercept 0.9645 0.5262 1.8329 0.0917 {.1820 2.1110
18 SquareFeet 1.6699 0.1569 10.6411 0.qpo 1.3280 2.0118

FromFigure13.17,
See SectionE13.1to create
the worksheet that contains 4=+1.6699 n=14 Sa =0.1569
this area.

and

hr-F
,_
sn,
_ r.6699-0:10.6411
0.I 569

MicrosoftExcellabelsthis r statisticl Stat(seeFigure13.17).Usingthe 0.05levelof signifi-


cance,thecriticalvalueof / withn - 2:12 degrees of freedom
is 2.1788.
Because I - 10.6411>
2.1188,you rejectHo (seeFigure13.18).Usingthep-value,you rejectHo because thep-value
is approximately0 whichis lessthancr: 0.05.Hence,you canconcludethatthereis a signifi-
cantlinearrelationshipbetweenmeanannualsalesandthe sizeof the store.

FIGURE
13.18
Testing a hypothesis
about the population
slope at the 0.05 level
o f s i g n i f i c a n c ew
, ith
12 deoreesof freedom

-2.1t788 0 +2.1788!, tp
I
R e g i o no f R e g i o no f R e g i o no f
Rejection Nonrejection Rejection

Critical Critical
Value Value

F Test for the Slope


As an alternativeto the I test,you can usean F testto determinewhetherthe slopein simple
linearregression is statistically
significant.In Section10.4,you usedthe tr distribution to test
the ratio of two variances.Equation( I 3. | 7) definesthe ,Etestfor the slopeas the ratio of the
variancethat is dueto the regression(MSR)dividedby the errorvariance(MSE- Sii.

TEST|NGA HYPOTHESISFOR A POPULAT]ONSLOPE,91' USTNGTHE FTEST


meansquare(MSR)dividedby the errormean
The F statisticis equalto the regression
square(MSD.

MSR
t -- (13.17)
MSE
13.7: InferencesAbout the Slopeand CorrelationCoefficient 541

where

MsR:!q4 L
L

MSE:
s,sE
n-k-1
t: numberof independent
variablesin the regression
model

The teststatisticF followsan F distributionwith k andn - k -l degreesof freedom.

Usinga levelof significance


a, thedecisionrule is

RejectHoif F> Fu.


otherwise,do not rejectl{n.

rtifi-
fll > TableI 3.6 organizesthe completesetof resultsinto an ANOVA table.
ralue
Frifi-
13.6 Sum of Mean Square
Table Source df Squares (Variance) F
inqthe
ofa Regression SSR . MSR
k ,SSR M,SR=
Coefficient MSE

Error 'S^St
n-k-l S.siE MSE =
n-k-l
Total n- | ,S,SZ

The completedANOVA table is also part of the MicrosoftExcel resultsshownin


F i g u r el 3 . l 9 . F i g u r el 3 . l g s h o w s t h a t t h e c o m p u t e d F s t a t i s t ilcl 3i s. 2 3 3 5 a n d t h ep - v a l u e
is approximately 0.

13.19
ExcelFtest ANOVA
Sunflowers
data
ss MS F F
Regreeslon 1 105.7476105.74761132335 0.{xno
Residual 12 11.2M7 0333!'
14lTotal 13 I16.9543

EI3.1to create
that contains
Using a level of significanceof 0.05,from TableE.5, the critical valueof the F distribu-
tion,with 1 and12degrees of freedom,is 4.75(seeFigure13.20).Because F: 113.2335 > 4.j5
or becausethep-value: 0.0000< 0.05,you rejectHn andconcludethatthe sizeof the storeis
significantly relatedto annualsales.Because theF teit in Equation13.17on page540is equiv-
alentto the I teston page539,you reachthe sameconclusion.
542 CHAPTERTHIRTEENSimple Linear Regression

FTGURE13.20
Regionsof rejection
and nonreiection when
testingfoisignificance
of slooeat the 0.05 level
with
of significance,
1 and 12 degrees
of freedom
| 4.75
it
Regionof Critical Regionof
Nonrejection Value Relection

ConfidenceInterval Estimateof the Slope (0r)


As an alternativeto testingfor the existenceof a linearrelationshipbetweenthe variables,
can constructa confidenceinterval estimateof B, and determinewhetherthe
value(8, :0) is includedin the interval.Equation(13.18)definesthe confidencei
estimateof B,.

CoNFTDENCETNTERVALEST|MATEOF THE SLOPE,B1


The confidenceinterval estimatefor the slopecan be constructedby taking the sample
slope,b1,and addingand subtractingthe critical / value multiplied by the standarderror
of the slope.

br!tn_256, (13.18)

Fromthe MicrosoftExcelresultsof Figure13.17on page540,

4 =1.6699 n =14 Sh = 0.1569

To constructa95ohconfidenceintervalestimate,al2:0.025, andfrom TableE.3,/,,


Thus,

b 1 + t n - 2 5 6=, 1 . 6 6 9 t9 ( 2 . 1 7 8 8 X 0 . 1 5 6 9 )
= 1.6699+ 0.3419
1.3280<Fr<2.0118

Therefore,you estimatewith 95o/oconfidencethat the populationslopeis between1.3280


2.0118.Becausethesevaluesare above0. vou concludethat thereis a sisnificantlinear
tionship betweenannualsalesand the size of the store.Had the interval included0, you
haveconcludedthat no significantrelationshipexistsbetweenthe variables.The con
intervalindicatesthat for eachincreaseof 1,000squarefeet,meanannualsalesareestimated
increase by at least$1,328,000 but no morethan$2,011,800.

t Testfor the CorrelationCoefficient


In Section3.5 on page 130,the strengthof the relationshipbetweentwo numerical
was measured, usingthe correlation coefficient,r. You can usethe correlationcoefficient
determinewhetherthereis a statisticallysignificant linear relationshipbetweenXand L To
13.7: InferencesAbout the Slopeand CorrelationCoefficient 543

so,you hypothesizethat the populationcorrelationcoefficient,p, is 0. Thus,the null andalter-


nativehypothesesare

Ho: p :0 (no correlation)


Hr:p+0(correlation)

Equation( 13.19)definestheteststatisticfor determiningthe existenceof a significantcorrelation.

TESTING FOR THE EXISTENCEOF CORRELATION

l= (r3.1e)

where
,: +F ifbl>0
,: _,[7i f b l < 0
The test statisticI follows a / distributionwith n - 2 degreesof freedom.

In the SunflowersApparel problem,12: 0 .9042 andb , : +1.6699 (seeFigure I 3.4 on


page516).Becausebtr 0, the correlatiopeqe.ficient for annualsalesand storesizeis the
positivesquareroot of P, that is, P : +40.9042 : +0.9509.Testingthe null hypothesisthat
thereis no correlationbetweenthesetwo variablesresultsin the following observed/ statistic:

r-0

= 10.641I
1- (o.9so9)2
t4-2

Usingthe 0.05 levelof significance,becauset : l0.64ll > 2.1'788,you rejectthe null hypoth-
esis.You concludethat thereis evidenceofan association betweenannualsalesand storesize.
This / statisticis equivalentto the / statisticfound when testingwhetherthe populationslope,
F1,is equalto zero(seeFigure13.17on page540).
When inferencesconcerningthe populationslopewere discussed" confidenceintervalsand
testsof hypothesiswereused interchangeably. However,developinga confidence intervalfor the
correlationcoefficientis morecomplicatedbecausethe shapeof the samplingdistributionof the
statisticr variesfor differentvaluesof the populationcorrelationcoefficient.Methodsfor devel-
oping a confidenceintervalestimatefor the correlationcoefficientarepresentedin reference4.

a. What is the valueof the I teststatistic?


the Basics
b. At the o : 0.05 level of significance,what arethe criti-
Youaretestingthe null hypothesisthat there is no cal values?
ionshipbetweentwo variables,X and )'. From c. Basedon your answersto (a) and (b), what statistical
of n = 10.vou determinethatr:0.80. decisionshouldyou make?
544 CHAPTERTHIRTEENSimpleLinearRegression

'13.40 You are testingthe null hypothesisthat 13.45 In Problem13.7on page523.you


there is no relationshipbetweentwo variables,X theweightof mail to predictthenumberof
and Y. From your sampleof n : 18, you deter- received. The data are stored in the file
minethatb1:+4.5 and 56, : 1.5. Using the resultsof thatproblem,
a. What is the value of the r test statistic? a. at the 0.05 level of significance, is there evidenceof
b. At the cr : 0.05 level of significance,what arethe criti- linear relationship between the weight of mail and
cal values? number of orders received?
c. Basedon your answersto (a) and (b), what statistical b. construct a95oh confidence interval estimateof the
decisionshouldyou make? ulation slope,B,.
d. Constructa 95ohconfidenceinterval estimateof the 13.45 In Problem13.8on page523,you usedannual
population slope,B,.
enuesto oredictthe valueofa baseballfranchise.The
13.41 You are testingthe null hypothesisthat are storedin the file[[[!!@fs. Using the resultsof
there is no relationshipbetweentwo variables,X problem,
andL Fromyour sampleof n:20, you determine a. at the 0.05 level of sienificance.is thereevidence of
:
thatSSR 60 and,SSt: 40. linear relationshipbetweenannualrevenueand
a. What is the valueof the F teststatistic? chisevalue?
b. At the cr: 0.05levelof significance,what is the critical b. construct a95o/oconfidence interval estimateof the
value? ulation slope,B,.
c. Basedon your answersto (a) and (b), what statistical 13.47 In Problem 13.9on page 523,an agentfor a
decisionshouldyou make? estatecompanywantedto predictthe monthlyrentfor
d. Computethe correlationcoefficient by first computing ments,basedon the sizeof the apartment.Thedataare
P andassumingthat b, is negative. in the file[S[!. Using the resultsof thatproblem,
e. At the 0.05 level of significance,is therea significant a. at the 0.05 level of significance, is there evidenceof
correlationbetweenXand l? linearrelationshipbetweenthe sizeof the apartment
the monthly rent?
Applying the Concepts
b. construct a95Yo confidence interval estimateof the
13.42 In Problem13.4on page522,the market- ulation slope, B,.
ing managerusedshelf spacefor pet food to pre-
13.48 In Problem13.10on page523,you usedha
dict weekly sales.The data are storedin the file
to predict the tensile strength of die-cast aluminum.
fE@ From the resultsof that problem,bt: 7.4 and Using the results
dataare storedin the file [[[ft$!
56, : 1.59.
that problem,
a. At the 0.05 level of significance,is thereevidenceof a
a. at the 0.05 level of significance,is thereevidenceof
linearrelationshipbetweenshelfspaceand sales?
linear relationship between hardness and
b. Constructa 95o/oconfidenceinterval estimateof the
strensth?
populationslope,8,.
b. construct a95"/oconfidence interval estimateof the
13.43 In Problem13.5on page522,you usedreported ulation slope,8,.
magazinenewsstandsalesto predict auditedsales.The data
13.49 The volatility of a stock is often measuredby
[email protected]
beta value.You can estimatethe beta value of a stock
problem, br:0.5719 and 56, :0.0668. model,usingthe
developinga simplelinearregression
a. At the 0.05 level of significance,is thereevidenceof a
centageweekly changein the stock as the dependent
linear relationshipbetweenreportedsalesand audited
able and the percentage weekly change in a market index
sales?
variable.The S&P 500 Indexts a
the independent
b. Constructa 95o/oconfidenceinterval estimateof the
index to use. For example, if you wanted to estimate
populationslope,B,.
beta for IBM, you could use the following model, which
13.44 In Problem13.6on pages522-523,theownerof a sometimes referred to as a market model:
moving companywantedto predict labor hours, basedon (% weekly changein IBM) : 9o * 9, (% weekly change
the numberof cubic feet moved.The dataare storedin the
S&P500index)+e
file@@$. Usingthe resultsof thatproblem,
a. at the 0.05 level of significance,is thereevidenceof a regressionestimateof the slopebr is
The least-squares
linear relationshipbetweenthe number of cubic feet estimate of the beta value for IBM. A stock with a
movedand labor hours? value of 1.0 tends to move the same as the overall
b. constructa95"/oconfidenceintervalestimateof thepop- A stock with a beta value of 1.5 tends to move 50%
ulationslope,8,. than the overall market. and a stock with a beta value
I 3.7: InferencesAbout the Slooe and Correlation Coefficient 545

used to moveonly 60% as much as the overall market. mately 12.5%.On the downside,if the sameindex loses
rders withnegativebetavaluestend to move in a direc- 20%, POSCX losesapproximately25o/o.
@. thatof the overallmarket.The following table
opposite a. Considerthe leveragedmutual fund ProFundUltraOTC
somebetavalues for some widely held stocks: "Inv" (UOPIX), whose descriptionis 200% of the per-
ofa formanceof the S&P 500 Index. What is its approxi-
Ticker Symbol Beta
d the mate marketmodel?
T 0.80 b. If the NASDAQ gains30% in a yeaq what return do you
pop- IBM 1.20 expectUOPX to have?
Company DIS 1.40 c. If the NASDAQ loses35% in a year,what return do you
AA 2.26 expectUOPX to have?
I rev- Logrc LSI 3.61 d. What type of investorsshouldbe attractedto leveraged
) data
funds?What type of investorsshould stay away from
f that from finance.yahoo.com, May 3 I, 2006.
: Extracted
thesefunds?
eachof the five companies,interpretthe betavalue. 13.51 The data in the file EEE@ representthe
: of a
Howcaninvestorsuse the beta value as a euide for caloriesand fat (in grams)of 16-ounce
iced coffeedrinks
fran-
investins? at Dunkin'Donutsand Starbucks:
) pop- lndexfundsare mutual funds that try to mimic the
Product Calories Fat
of leadingindexes,suchas the S&P 500 Index,
NASDAQ100Index, or the Russell2000 Index.The Dunkin'DonutsIced MochaSwirl latte
a real
valuesfor thesefunds(asdescribedin Problem 13.49) (wholemilk) 240 8.0
apart-
therefore approximately1.0. The estimatedmarket StarbucksCoffeeFrappuccinoblended
stored
for thesefundsare approximately coffee 260 3.5
Dunkin' DonutsCoffeeCoolatta(cream) 350 22.0
e of a (%weeklychangein index tu"d) : 0.0 + 1.0 (% weekly
StarbucksIcedCoffeeMochaEspresso
nt and changein the index)
(wholemilk andwhippedcream) 350 20.0
index funds are designedto magnify the StarbucksMocha Frappuccinoblended
epop-
of maior indexes.An article in Mutual Funds coffee (whippedcream) 420 16.0
0'Shaughnessy, "Reachfor Higher Returns,"Mutual StarbucksChocolateBrownie Frappuccino
ldness July1999,pp. 4449) describedsomeof the risks blendedcoffee(whippedcream) 510 22.0
r. The rewards associated
with thesefunds and savedetails StarbucksChocolateFrappuccinoBlended
nlts of some of themostpopularleveragedfunds,including Crdme(whippedcream) 530 r9.0
in thefollowins table:
re of a Source:Extractedfrom"Coffeeas Candyat Dunkin'Donutsand
(TickerSymbol) Fund Description ConsumerReports,June2004,p. 9.
Starbucks,"
;ensile
SmallCap 125%ofRussell2000Index a. Compute and interpret the coefficient of correlation, r.
e pop- (POSCX) b. At the 0.05 level of significance, is there a significant
linear relationship between the calories and fat?
"Inv"Nova 150%ofthe S&P 500Index
by its 13.52 There are several methods for calculating fuel
rck by economy. The following table (contained in the file
le per- indicates the mileage as calculated by owners
UltraOTC Double(200%)the NASDAQ 100 @l!!ls)
rt vari-
rdexas
(uoPx) Index and by current government standards:

InlmOn estimatedmarket models for these funds are Government


ate the Vehicle Owner Standards
hich is
(%weeklychangein POSCX) : 0.0 + 1.25(% weekly 2005FordF-150 14.3 16.8
changein the Russell 2000 Index) 2005 ChevroletSilverado 15.0 17.8
2002HondaAccordLX 27.8 26.2
(%weeklychangein RYNVX) : 0.0 + | .50 (% weekly
2002 HondaCivic 27.9 34.2
changein the S&P 500 Index)
2004 HondaCivic Hybrid 48.8 47.6
, is the changein UOPIX tund): 0.0 + 2.0 (% weekly
weekly 2002 Ford Explorer 16.8 18.3
a beta
changein theNASDAQ100Index) 2005 ToyotaCamry 23.7 28.5
narket.
2003 ToyotaCorolla 32.8 3 3 I.
{omOre if theRussell2000Indexgains10%overa periodof
2005 ToyotaPrius JI.J s6.0
c of 0.6 theleveragedmutual fund POSCX gains approxi-
546 CHAPTERTHIRTEENSimpleLinearRegressron

a. Compute and interpret the coefficient of correlation, r. 13.54 Collegefootballplayerstrying out for the NFL
b. At the 0.05 level of significance, is there a significant given the Wonderlic standardizedintelligence test.The datai
linear relationship between the mileage as calculated by the file[@!@Srepresent theaverageWonderlicscores
owners and by current government standards? football players trying out for the NFL and the
rates for football players at selected schools (extracted
13.53 College basketball is big business,with coaches' S. Walkeq "The NFUs SmartestTeam," The Wall
salaries,revenues,and expensesin millions of dollars. The 30,2005,pp.Wl, Wl0).
Journal,September
datain the file !![!l!$ls$l[f@ represent
the coaches' a. Compute and interpret the coefficient of correlation,r.
salariesand revenuesfor collegebasketballat selected b. At the 0.05 levelof sienificance.
is therea sisnifi
schoolsin a recentyear(extractedfrom R. Adams,"Pay for linear relationship betweenthe averageWonderlic
Playoffs,"TheWallStreetJournal,March ll-12,2006, pp. of football players trying out for the NFL and the
Pl, P8). ation rates for football players at selectedschools?
a. Computeand interpretthe coefficientof correlation,r. c. What conclusions can you reach about the relat
b. At the 0.05 level of significance,is therea signifi- between the averageWonderlic score of football
cant linear relationshipbetweena coach'ssalaryand trying out for the NFL and the graduation rates for
revenue? ball players at selectedschools?

13.8 ESTIMATIONOF MEANVALUESAND PREDICTION


OF INDIVIDUAL
VALUES
This section presentsmethods of making inferences about the mean of )'and predicting indi-
vidual values of )2.

The Confidence Interval Estimate


In Example13.2on page519,you usedthe predictionline to predictthe valueof )'for a given
X. The meanannualsalesfor storeswith 4,000squarefeet waspredictedtobe1.644 millions
of dollars($7,644,000).
This estimate,howeveqis a point estimateof the populationmean.In
Chapter8, you studiedthe conceptof the confidenceintervalas an estimateof the population
mean.In a similarfashion,Equation( I 3.20)definesthe confidenceinterval estimatefor the
mean responsefor a givenX.

CONFIDENCEINTERVALESTIMATEFOR THE MEAN OF Y


Y,t tr-rsrr^fi
Y,- tn-rsrrrE, V4x=x,3 t, + t,-rsvxfi, (13.20)

hi=
,ssx
where
Yi : predictedvalueof { = bs + b1X,
,Sr": standarderror of the estimate
n : samplesize
X,: givenvalueofX

Vvlx=x, : meanvalueof I whenX - X,


n
ssx:I (x,-x)',
j-!
13.8: Estimationof MeanValuesand Predictionof IndividualValues 547

ate The width of the confidenceintervalin Equation(13.20)dependson severalfactors.For a


ain given level of confidence,increasedvariationaroundthe predictionline, as measuredby the
sof standarderror of the estimate,resultsin a wider interval.However,as you would expect,
lion increasedsamplesizereducesthe width of the interval.In addition,the width of the interval
fom alsovariesat differentvaluesof X. Whenyou predict)'for valuesof X closeto X, the interval
reet is narrowerthanfor predictionsfor X valuesmoredistantfrom X.
In the SunflowersApparelexample,supposeyou want to constructa 95o/oconfidence
,r. intervalestimateof the meanannualsalesfor the entirepopulationof storesthat contain4,000
)ant squarefeet(X:4). Usingthe simplelinearregression equation,
)ore
rdu- ti =0.9645+1.6699X,
= 0.9645+ 1.6699(4)= 7.6439(millionsof dollars)
;hip
yers
Also,giventhe following:
oot-
X = 2.9214 S),x= 0.9664
il

SSX= Zr*,- Xl' = 37.9236


i= I

F r o mT a b l eE . 3 ,t r r : 2 . 1 7 8 8 .T h u s ,
ndi-
Y,X tn-rSrrrfr

where

lven ,t - - , (X,- x)'


-l--
ions tti

r. In
n ssx
tion
so that
the

* t, zsvx , 6,- v)2


T-

,ssx
= 7.6439t (2.1788X0
(4- 2.g2rq2
.9664)
37.9236
= 7.6439+ 0.6728

SO

6 . 9 7 1 1 ! F y r - q <8 . 3 1 6 7

Therefore,the 95o/oconfidenceinterval estimateis that the meanannualsalesare between


$6,971,100and$8,316,700 for thepopulationof storeswith 4,000squarefeet.

The Prediction Interval


In addition to the need for a confidence interval estimate for the mean value, you often want
to predict the responsefor an individual value. Although the form of the prediction intervalis
similar to that of the confidenceinterval estimateof Equation (13.20),the prediction interval
is predicting an individual value, not estimating a parameter.Equation (13.21) defines the
prediction interval for an individual response, Y, at aparticular value,X,, denotedby Yx=x, .
548 CHAPTERTHIRTEEN SimpleLinearRegression

PREDICTION
INTERVALFORAN INDIVIDUALRESPONSE,
Y
J ^ t--
Yi+t,-zSrxll+I+
(13.21)
1 - t,-rsu^t;a 3 Yy=y,s I + to*2sn.[il-
Yy*y,is
lvhere&r,-yr,SWn,a+dX,aredefinedasinEquation,(13.20)onpege546and
futurevalueof YwhenX=4.

To constructa95%ioprediction interval of the annualsalesfor an individual storethat


tains4,000 squarefeet(X:4), you first compute t1. Urittg the predictionline:

fi =0.9645+1.6699X,
= 0.9645+ 1.6699(4)
= 7.6439(millionsof dollars)

Also, given the following:

X _ 2.9214 SYX= 0.9664


n
SSX = \rx,-x)'=37.e236

FromTableE.3,tn: 2.1788.
Thus,

f, :'t,-rsr*[1

where

2<',- x)'
n

;-l

so that

'ti * tn-zsvx ,r,(xi-x)'


n SSX

(4 - 2.s q2
7.6439I (2.1788X0
.9664) t + ! +
t4 37.9236
7.6439
!2.2104

so

5.43353 Yr_+<9.8543

Therefore,with 95o/oconfidence,you predict that the annualsalesfor an individual store


4,000squarefeetis between$5,433,500and $9,854,300.
13.8:Estimation
of MeanValues
andPrediction
of Individual
Values 549

Figure 13.21is a Microsoft Excel worksheetthat illustratesthe confidenceinterval esti-


mateandthe predictioninterval for the SunflowersApparelproblem.If you comparethe results
of the confidenceinterval estimateand the prediction interval, you seethat the width of the
prediction interval for an individual storeis much wider than the confidenceinterval estimate
for the mean.Rememberthat there is much more variation in predicting an individual value
than in estimatinga meanvalue.

13,21
Excel
interval
and prediction
for the
Apparel -DarrCopylF2
-Bi -2
-nwF -85, Bl
-D6c.ItlF3
-DmcopylF{
trrn rrgrcdon rerh.t c.ll Bl
-t/80 + {Bf -Btll^2nn
'DrirCofylF
E13.5to create
-810'813'sARTFtal
-815 - 818
-815 r 8il

-Bl0"813'SQRTfi + Bl{
-Bl5 - 8?3
-815 r fiB

the Basics Applying the Concepts


13.55 Basedon a sampleof n:20, the least- 13.57 In Problem 13.5 on page 522,you usedreported
squaresmethodwas used to developthe follow- salesto predict auditedsalesof magazines.The data are
ing predictionline: ,t * 3X,.In addition, storedin the file@!s@. For thesedataSr*:42.186
lt and.h,: 0.108whenX: 400.
Syx= 1.0 X = 2 - X)2 =20 a. Constructa 95ohconfidenceinterval estimateof the
Z<*, meanauditedsalesfor magazinesthat report newsstand
i=l
salesof 400.000.
a 95o/oconfidenceinterval estimateof the b. Constructa95Yoprediction interval of the auditedsales
meanresponsefor X:2. for an individualmagazinethat reportsnewsstandsales
a 95o/oprediction interval of an individual of400.000.
forX:2. c. Explain the differencein the resultsin (a) and (b).
13.56 Basedon a sampleof n:20, the least-
13.58 In Problem 13.4 on page522, the mar-
squaresmethodwas usedto developthe follow-
ing predictionline: Yi : 5 + 3X,.ln addition, ffi keting managerused shelf spacefor pet food to

=l.o X=2 fr",-x)2=zo ffi predict weekly sales.The data are stored in the
file [@!![!.
h i : 0 . 1 3 7 3w h e n x : 8 .
For these dataSr*: 30.81 and

a. Constructa 95o/oconfidenceinterval estimateof the


a 95o/oconfidenceinterval estimateof the meanweekly salesfor all storesthat have8 feet of shelf
meanresponseforX:4. spacefor pet food.
a 95o/oprediction interval of an individual b. Constructa 95o/oprediction interval of the weekly sales
forX: 4. of an individual store that has 8 feet of shelf spacefor
theresultsof (a) and(b) with thoseof Problem pet food.
(a) and(b). Which interval is wider? Why? c. Explain the differencein the resultsin (a) and (b).
550 CHAPTERTHIRTEEN
Simple
LinearRegression

13.59 In Problem13.7on page523,you usedthe weight b. Construct a 95o/oprediction interval of the


of mail to predict the number of ordersreceived.The data rental of an individual apartmentthat is 1,000
are storedin the file@[!. feet in size.
a. Constructa 95o/oconfidenceinterval estimateof the c. Explain the differencein the resultsin (a) and (b),
meannumberof ordersreceivedfor all packageswith a
13.62 In Problem 13.8 on page 523, you predicted
weightof500 pounds.
value of a baseballfranchise.basedon current
b. Constructa 95o/opredictioninterval of the number of
The dataare storedin the file!![$@@.
ordersreceivedfor an individual packagewith a weight
a. Constructa 95o/oconfidenceinterval estimateof
of500 pounds.
meanvalue of all baseballfranchisesthat generate$
c. Explain the differencein the resultsin (a) and (b).
million of annualrevenue.
13.50 In Problem13.6on page522,the ownerof a mov- b. Construct a 95o/oprediction interval of the value
ing companywantedto predict labor hours basedon the individual baseballfranchisethat senerates$150
numberof cubic feet moved.The dataare storedin the file lion ofannualrevenue.
@. c. Explain the differencein the resultsin (a) and (b).
a. Constructa 95ohconfidenceinterval estimateof the
13.63 In Problem13.10on page523,you used
meanlabor hoursfor all movesof 500 cubic feet.
to predict the tensile strengthof die-castaluminum.
b. Constructa95%opredictioninterval of the labor hoursof
dataare storedin the file@[@.
an individual movethat has 500 cubic feet.
a. Constructa 95o/oconfidenceinterval estimateof
c. Explain the differencein the resultsin (a) and (b).
meantensile strengthfor all specimenswith a
',3.6', In Problem13.9on page 523,an agentfor a real of 30 RockwellE units.
estatecompanywanted to predict the monthly rent for b. Construct a 95Yo prediction interval of the
apartments,basedon the size of the apartment.The data strengthfor an individual specimenthat has a
are storedin the file [!@ of 30 RockwellE units.
a. Constructa 95o/oconfidenceinterval estimateof the c. Explain the differencein the resultsin (a) and (b).
mean monthly rental for all apartmentsthat are 1,000
squarefeet in size.

13.9 PITFALLS
IN REGRESSION
AND ETHICALIsSUEs
Someof the pitfalls involved in using regressionanalysisare as follows:

. Lacking an awarenessof the assumptionsof least-squares regression


I Not knowing how to evaluatethe assumptionsof least-squares regression
r Not knowing what the alternativesto least-squaresregressionareif a particularassu
is violated
, Using a regressionmodel without knowledgeof the subjectmatter
I Extrapolatingoutsidethe relevantrange
r Concludingthat a significant relationshipidentified in an observationalstudy is due
cause-and-effectrelationship

The widespreadavailability of spreadsheetand statistical softwarehas made


analysismuch more feasible.However,for many users,this enhancedavailability of
has not been accompaniedby an understandingofhow to use regressionanalysis
Someonewho is not familiar with either the assumptionsof regressionor how to evaluate
assumptionscannotbe expectedto know what the alternativesto least-squaresregression
a particularassumptionis violated.
ThedatainTablel3.7(storedinthefile@illustratetheimportanceof
scatterplots and residualanalysisto go beyondthe basic numbercrunchingof computing
Iintercept,the slope.and12.
13.9: Pitfallsin Regression
and EthicalIssues )) I

13.7 Data SetA Data Set B Data Set C Data Set D


Setsof ArtificiaI X
l0 8.04 l0 9.r4 10 7.46 8 6.58
re t4 9.96 t4 8 .1 0 14 8.84 8 5.16
5 5.68 5 4.74 5 5.73 8 I .11
8 6.95 8 8.14 8 6.77 8 8.84
le 9 8.81 9 8.',77 9 7.11 8 8.47
;0 t2 10.84 12 9.13 12 8.r5 8 7.04
4 4.26 4 3.10 4 5.39 8 5.25
ln 7 4.82 7 7.26 7 6.42 l9 t2.50
.l- ll 8.33 ll 9.26 ll 7 . 8r 8 5.56
l3 7.58 l3 8.74 13 12.74 8 7.91
6 7.24 6 6.13 6 6.08 8 6.89
ss Source:Extracted.fiomE J. Anscombe,"Graphsin StatisticalAnalysrs,"American Statistician,Vol.27 (1973),
ne pp. l7-21.

ne Anscombe (reference 1) showed that all four data sets given in Table 13.7 have the follow-
rSS ing identicalresults:

ile Yi = 3.0+ 0.5X;


)ss
Svx = 1'23'7
S a , = 0 . 11 8

12 = 0.667

SSR= Explainedvariation
= - f )2 = 27.5t
It1
j=l

SSE = Unxplainedvariation= \{v, -f)2 = 13.76


._I

SSZ = Total variation= (, - y 12= 41.27


t
l=l

Thus, with respect to these statistics associatedwith a simple linear regression analysis, the
four data setsare identical. Were you to stop the analysisat this point, you would fail to observe
the important differences among the four data sets.By examining the scatterplots for the four
data sets in Figure 13.22 on page 552, and their residual plots in Figure I 3.23 on page 552, you
oa can clearly seethat each ofthe four data sets has a different relationship betweenX and Y.
From the scatterplots of Figure 13.22 and the residual plots of Figure 13.23,you see how
different the data setsare. The only data set that seemsto follow an approximate straight line is
ion data set A. The residual plot for data set A does not show any obvious patterns or outlying
are residuals. This is certainly not true for data sets B, C, and D. The scatter plot for data set B
rly. shows that a quadratic regressionmodel (see Section l5.l) is more appropriate.This conclu-
the sion is reinforced by the residual plot for data set B. The scatter plot and the residual plot for
Eif data set C clearly show an outlying observation. If this is the case,you may want to remove the
outlier and reestimatethe regressionmodel (see reference4). Similarly, the scatterplot for data
ing set D representsthe situation in which the model is heavily dependenton the outcome of a sin-
the gle response(XB: 19 and )', : 12.50).You would have to cautiously evaluate any regression
model becauseits regressioncoefficients are heavily dependenton a single observation.
FIGURE13.22
Scatterplotsfor four
data sets

$ 0 | $a

FIGURE13.23
plotsforfour
Residual
data sets

Residual
+4

a
a

a
a

a
a
a

10
P a n e lD
13.9: Pitfallsin Regression
and EthicalIssues 553

In summary, scatter plots and residual plots are of vital importance to a complete regres-
sion analysis.The information they provide is so basic to a credible analysis that you should
always include these graphical methods as part of a regressionanalysis.Thus, a strategy that
you can use to help avoid the pitfalls of regressionis as follows:

1. Start with a scatterplot to observe the possible relationship betweenX and Y.


2. Check the assumptionsof regressionbefore moving on to using the results of the model.
3. Plot the residualsversus the independentvariable to determine whether the linear model is
appropriate and to check the equal-varianceassumption.
4. Use a histogram, stem-and-leaf display, box-and-whisker plot, or normal probability plot
of the residualsto check the normality assumption.
5. If you collected the data over time, plot the residuals versus time and use the Durbin-
Watsontest to check the independenceassumption.
6. If there are violations of the assumptions,use alternative methods to least-squaresregres-
sion or alternative least-squaresmodels.
7. If there are no violations of the assumptions,carry out tests for the significance of the
regressioncoefficients and develop confidence and prediction intervals.
8. Avoid making predictions and forecastsoutside the relevant range of the independent
variable.
9. Keep in mind that the relationships identified in observational studies may or may not be
due to cause-and-effectrelationships. Remember that while causation implies correlation,
correlation does not imolv causation.

$ youarefamiliar
erhaps withthe I regressionmodels)to determinethe t PublishingA studyof the effectof price
r.V TV competition
organized by i effectof an advertisementon sales,based changes at Amazon.com and BN.com on
fa\ model Tyra Banks to find r on a set of factors.Also,managers use sales(again,regression analysis)found
$ "America's
topmodel."
Youmay dataminingto predictpatternsof behav- thata 1% pricechange at BN.com pushed

a
^h
be lessfamiliarwith anotherset of toDmod-
els that are emergingfrom the business
ior of what customers will buy in the
future, basedon historicinformation
salesdown4%, but it pushedsalesdown
only 0.5% at Amazon.com. (You can
world. aboutthe consumer. downloadthe paperat http://gsbadg.
l*
ln a EusinessWeek article from its FinanceAnytimeyoureadabouta finan- uchicago.edu/vitae.htm.)
January23, 2006,edition(S.Baker,"Why cial"model,"youshouldunderstand that s TransportationFarecast.com usesdata
"S MathWillRockYourWorld: MoreMathGeeks sometypeof regression modelis being miningandpredictive technologies
to objec-
Are Callingthe Shotsin Business. ls Your used.For example, a New York Times tivelypredictairfarepricing(seeD.Darlin,
'An
\ lndustryNext?" Business Week,pp.54-62), articleon June18,2006,titled Old ?irfaresMadeEasy(OrEasier)," TheNew
q) StephenBakertalks about how "quants" Formula ThatPointsto NewWorry"by YorkTimes, July1,2006,pp.C1,C6).
s turnedfinance upside downandis movingon MarkHulbert(p.BU8)discusses a market # Real estate Zillow.com usesinformation
to otherbusiness fields.The namequants timingmodelthat predicts the returnof aboutthe features contained in a home
s derivesfrom the fact that "math geeks" stocksin the next three to five years, anditslocation to develooestimatesabout
the marketvalueof thehome,usinga "for-
,x
\
developmodelsand forecastsby using
"ouantitative methods." Thesemethodsare
basedon the dividendyieldof the stock
marketand the interestrate of 90-dav mula"builtwith a proprietaryalgorithm.
builton the principles of regression analysis Treasury bills.
discussedin this chapter,althoughthe actual Food and beverageBelieveit or not, In the article,Bakerstatedthat statistics
modelsare muchmorecomplicated than the Enologix, a Californiaconsultingcom- and probability will becomecoreskillsfor
simplelinearmodels discussedin thischapter. pany,has developeda "formula" (a businesspeople and consumers. Thosewho
Regression-based modelshavebecome regression model)that predicts a wine's aresuccessful will knowhowto usestatistics,
the top modelsfor manytypesof business qualityindex,basedon a set of chemical whethertheyarebuilding financialmodels or
analyses.Someexamples include compounds found in the wine (seeD, makingmarketingplans.He also strongly
Darlington,"The Chemistryof a 90+ endorsed the needfor everyone in businessto
n Advertisingand marketingManagers Wine,"Ihe New York TimesMagazine, haveknowledge of MicrosoftExcelto beable
models(in otherwords,
useeconometric August7, 2005,pp.36-39). to producestatistical andreports.
analysis
554 CHAPTERTHIRTEEN SimoleLinearResression

As you can see frorn the chapter roadmap in Figure 13.24, Once you are assuredthat the model is appropriate,you can
this chapter developsthe simple linear regressionmodel predict valuesby using the prediction line and test for the
and discussesthe assumotions and how to evaluate them. significance of the slope.

S i m p l e L i n e a rR e g r e s s i o n
and Correlation

Regression Primary Correlation


FOCUS

Coefficient
Least-Squares
, R e g r e s s i o nA n a l y s i s -:tto""l*:.5
r e s t | n gH o :
P=0
Scatter Plot

P r e d i c t i o nL i n e

Data
PlotResiduals Collected
I Over Ilme in Sequential
1-.,. Order
?
Compute
No
Durbin-Watson
Statistic
R e s i d u a lA n a l y s i s
I

ls Model
UseAlternativeto Yes No Yes No
Autocorrelation Appropriate
Regression
: Least-squares Present
L.".,. ?
7

Testing Hs:
I 0r=0
l{See Assumptionsl

No Model Yes
Significant '
?
Use Model for
P r e d i c t i o na n d E s t i m a t i o n

Estimate Estimate Predict


9r vy,*:1a..* ;
"=\^*

F|GURE'13.24Roadmapfor simplelinearregression
KeyEquations555

learnedhowthedirectorofplanningfor a chain whenselectingnew sitesfor storesaswell asto forecastsales


canuseregression
stores analysisto investigate for existing stores.In Chapter 14, regressionanalysisis
io betweenthe size of a store and its annual extendedto situationsin which more than one independent
haveusedthis analysisto make better decisions variableis usedto predictthe valueofa dependentvariable.

RegressionModel Computational Formula for SSR


Y,: Fo+ Plxi+ Ei (13.1) n

ssR=\fr,_yl,
RegressionEquation: The Prediction Line l=1

Yi=bo+hXi (r3.2) ( n \2 (13.11)


n n
llr,
lH
'l l
Formula for the Slope, D, = b"v.L
^Ir,+bt X,Y,-\r=l /
;-, 7i tt
, ,ssxr (13.3)
'
A=-
.ssx Computational Formula for ^S.SE
nnnn
Formula for the Y Intercept, 6o ssE= ),ti - v )' = 2t,' - aol vi- bt>x iYi
pf' j=l
bo=Y -4X (13.4) i=t i=l i=l
(13.12)
ofVariation in Regression Standard Error of the Estimate
+ ssr
,s,sz:s,sR (13.5)
-v,)'
of Squares(SST) M \{v,
Srx =
;-l
t- -
(13.13)
n \n-2-
i= Totalsumof squares
= - "1
> ff, 112
j-l
(13.6)
Residual
.?
€i=Ii-Ii (13.14)
$umof Squares(^SSR)
inedvariationor regressionofsquares Durbin-WatsonStatistic
n
(13'7) sr .)
ei_1)-
(f, - Y)' /(e;-
D - i=2
(13.15)
n
!o2
of Squares(S^lE)
inedvariationor enor sum of squares
3'',
Testinga Hypothesisfor a PopulationSlope,p'
(13.8) Using the t Test
(Y,- f,)'
4-9r (13.16)
sut
of Determination
Testinga Hypothesis
for a PopulationSlope,B'
r_ Regressionsum of squares _ ^S^SR
(13.e) Usingthe ,FTest
i: Total sum of souares ,S,SZ
MSR
'-- M S E (13.17)
Formula for SSZ
( ,
12 ConfidenceInterval Estimateof the Slope,B,
, n
-
. lrnl
| )' = ). r,t - \
t=r / (l3.lo)
bl ! tn_2Sbl (13.18)
/\ri -n b 1 - t n _ 2 s b<t F r < b r + t r _ r S ^
I 556' cUapTERTHIRTEEN
Simple
LinearRegression

Testingfor the Existenceof Correlation Prediction Interval for an Individual ResponseoY


r-p 'i, + tn-rsr"^F. r, (13.21)
(13.1e)
Ir-" 'i, - tn-rsrr..fl 3 Yx=xis f, + t,-2syr.,F+k
\,-z
ConfidenceInterval Estimatefor the Mean of Y
'i,t t,-rsr*rfi
(13.20)
't,- tn-rsr*rEa pr,"=x,3 f, + tn-rsrrrfi

assumptionsof regression 529 independenceoferrors 529 residualanalysis 530


autocorrelation 534 independentvariable 512 responsevariable 513
coefficient of determination 526 least-squaresmethod 516 scatterdiagram 512
confidenceinterval estimatefor the linearrelationship 512 scatterplot 512
meanresponse 546 normality 530 simple linear regression 512
correlationcoefficient 542 predictioninterval for an individual simplelinear regressionequation 5l
dependentvariable 512 response,I 547 slope 513
Durbin-Watsonstatistic 536 predictionline 515 standarderror of the estimate 528
error sum of squares(.LlE") 524 regressionanalysis 512 total sum ofsquares(SSQ 524
equalvariance 530 regressioncoefficient 516 total variation 524
explainedvariation 524 regressionsum ofsquares(SSR) 524 unexplainedvariation 524
explanatoryvariable 513 relevantrange 519 )zintercept 513
homoscedasticity530 residual 530

CheckingYour Understanding 13.72 What is the differencebetweena confidencei


13.54 What is the interpretationof the )zinterceptand the val estimateof the meanresponse,Vy x=x , anda
slopein the simple linear regressionequation? tion intervalof Yr=y ?

13.65 What is the interpretationof the coefficient of Applying the Concepts


determination? 13.73 Researchers from the Lubin Schoolof Business
PaceUniversity in New York Citv conducteda study
13.66 When is the unexplainedvariation (that is, error
lnternet-supportedcourses.In one part of the study,
sum of squares)equalto 0?
numericalvariableswere collectedon 108 students in
13.67 Whenis the explainedvariation(thatis, regression introductory managementcoursethat met oncea week
sum of squares)equalto 0? an entire semester.One variable collected was hit
tency.Tomeasurehit consistency, the researchers did
13.68 Why shouldyou alwayscarry out a residualanaly-
followine: If a student did not visit the Internet
sisaspart of a regression
model?
betweenclasses,the studentwas given a 0 for thatti
13.69 What are the assumptionsof regressionanalysis? period. If a studentvisited the Internet site one or
times betweenclasses,the studentwas given a I for
13.70 How do you evaluatethe assumptionsof regression
time period. Becausethere were 13 time periods,a
analysis?
dent'sscoreon hit consistencycould rangefrom 0 to 13.
13.71 When and how do you use the Durbin-Watson The other three variablesincluded the student's
statistic? average,the student'scumulative grade point
Chapter
ReviewProblems 557

thetotal numberof hits the studenthad on the e. Determine the coefficient of determination,12, and
'eitesupporting the course. The following table
explain its meaningin this problem.
conelationcoefficient for all pairs of variables. f. Perform a residual analysis.Is there any evidenceof a
correlationsmarked with an * are statisticallv patternin the residuals?Explain.
usingo : 0.001: g. At the 0.05 level of significance,is there evidenceof a
linear relationshipbetweendelivery time and the num-
Correlation ber ofcasesdelivered?
Cumulative GPA 0.72* h. Constructa 95ohconfidenceinterval estimateof the
Total Hits 0.08 meandelivery time for 150 casesof soft drink.
Hit Consistency 0.37* i. Constructa95o/oprediction interval of the delivery time
GPA.TotalHits 0.12 for a singledeliveryof 150casesofsoft drink.
j. Constructa 95o/oconfidenceinterval estimateof the
GPA,Hit Consistency 0.32*
Hit Consistency 0.64* populationslope.
k. Explain how the resultsin (a) through (j) can help allo-
F.xtmctedfromD. Baugheti A. Varanelli, and E. Weisbord, catedelivery coststo customers.
Hits in an Internet-Supported Course: How Can
UseThemand What Do They Mean? " Decision Sciences 13.75 A brokeragehousewantsto predict the numberof
Innovative
EducatioqFall 2003,I(2), pp. 159-179. trade executionsper day, using the number of incoming
phonecalls as a predictorvariable.Datawerecollectedover
conclusionscan you reach from this correlation a period of 35 daysand are storedin the file@@.
a. Use the least-squares methodto computethe regression
surprisedby the results,or are they consistent coefficientsboandbr.
own observationsand experiences? b. Interpretthe meaningof bo and b, in this problem.
c. Predictthe numberof tradesexecutedfor a day in which
Managementof a soft-drink bottling company
the numberof incoming calls is 2,000.
developa methodfor allocating delivery coststo
d. Should you use the model to predict the number of
Althoughone cost clearly relatesto travel time
tradesexecutedfor a day in which the numberof incom-
particularroute, anothervariablecost reflectsthe
ing calls is 5,000?Why or why not?
iredto unloadthe casesof soft drink at the deliv-
e. Determine the coefficient of determination,r2, and
A sampleof 20 deliverieswithin a territory was
explain its meaningin this problem.
The delivery times and the numbersof cases
f. Plot the residualsagainstthe number of incoming calls
wererecordedin the@@$@file:
andalsoagainstthedays.Is thereanyevidenceofa pattern
in the residualswith eitherof thesevariables?Explain.
Delivery Delivery
Number Time Number Time g. Determinethe Durbin-Watsonstatisticfor thesedata.
ofCases (Minutes) Customer ofCases (Minutes) h. Basedon the resultsof (f) and (g), is there reasonto
questionthe validity of the model?Explain.
52 32.1 ll 161 43.0
g i. At the 0.05 level of significance,is there evidenceof a
34.8 t2 184 49.4
73 36.2 l3 202 57.2 linear relationshipbetweenthe volume of trade execu-
85 37.8 t4 2r8 56.8 tions and the numberof incoming calls?
95 37.8 15 243 60.6 j. Constructa 95o/oconfidenceinterval estimateof the
103 39.7 l6 254 61.2 mean number of tradesexecutedfor days in which the
n6 38.5 t7 267 58.2 numberof incomingcallsis 2,000.
l2l 4r.9 l8 27s 63.1 k Construct a 95o/oprediction interval of the number of
t43 44.2 l9 287 65.6
tradesexecutedfor a particularday in which the number
t57 47.r 20 298 67.3
of incomingcallsis 2,000.
l. Constructa 95ohconfidenceinterval estimateof the
modelto predictdeliverytime,based
a regression
populationslope.
ofcasesdelivered.
m.Basedon the resultsof (a) through (l), do you think the
least-squares
methodto computethe regression
rtsDoandb,. brokeragehouse should focus on a strategyof increas-
ing the total number of incoming calls or on a strategy
themeaningof bo and 6, in this problem.
that relies on trading by a small number of heavy
thedeliverytime for 150 casesof soft drink.
you usethe model to predict the delivery time traders?Explain.
who is receiving500 casesof soft drink? 13.76 You want to developa model to predict the selling
why not? price of homesbasedon assessedvalue.A sampleof 30
558 CHAPTER
THIRTEEN
Simple
LinearRegression

recentlysoldsingle-familyhousesin a smallcity is selected a. Constructa scatterplot and"assuminga linear relation-


to studythe relationshipbetweensellingprice (in thousands ship,usethe least-squaresmethodto computethe regres-
ofdollars)andassessed value(in thousands ofdollars).The sioncoefficientsboandb,.
housesin the city had beenreassessed at full valueone year b. Interpret the meaningof the I intercept,bo, and the
prior to the study.The resultsare in the file@@. slope,b1,in this problem.
c. Use the predictionline developedin (a) to predictthe
(Hint: First, determinewhich are the independentand
assessed value for a housewhoseheatingareais 1,750
dependent variables.)
squarefeet.
a. Constructa scatterplot and"assuminga linearrelation-
d. Determine the coefficient of determination.r2. and,
ship, use the least-squaresmethod to compute the
interpretits meaningin this problem.
regressioncoefficientsbo andbr.
e. Performa residualanalysison your resultsand deter-
b. Interpret the meaning of the I intercept,bo, and the
mine the adequacyof the fit of the model.
slope,b,, in thisproblem.
f. At the0.05levelof significance,is thereevidenceof a lin-
c. Usethepredictionline developedin (a) to predictthe sell-
earrelationshipbetweenassessed valueandheatingarea?
ing price for a housewhoseassessed valueis SI 70,000.
g. Constructa 95o/oconfidenceinterval estimateof the
d. Determine the coefficient of determination,r2, and
meanassessed value for houseswith a heatingareaof
interpretits meaningin this problem.
1,750squarefeet.
e. Performa residualanalysison your resultsand deter-
h. Construcla 95ohpredictioninterval of the assessed
mine the adequacyof the fit of the model.
valueof an individualhousewith a heatingareaof 1,750
f. At the 0.05levelof significance,is thereevidenceof a lin-
squarefeet.
earrelationshipbetweensellingprice andassessed value?
i. Constructa 95o/oconfidenceinterval estimateof the
g. Constructa95o/oconfidenceintervalestimateof the mean
populationslope.
sellingpricefor houseswith anassessed valueof $170,000.
h. Constructa95o/opredictionintervalof the sellingprice of 13.78 The directorof graduatestudiesat alargecollegeof
an individualhousewith an assessed valueof $ 170,000. businesswould like to predictthe gradepoint average(GPA)
i. Constructa 95o/oconfidenceinterval estimateof the of studentsin an MBA program basedon the Graduate
populationslope. ManagementAdmissionTest(GMAI) score.A sampleof
20 studentswho had completed2 yearsin the programis
13,77 You want to develop a model to predict the
selected.The resultsare storedin the filefiS@@:
assessed valueofhouses,basedon heatingarea.A sample
of 15 single-familyhousesis selectedin a city.The assessed GMAT GMAT
value(in thousandsofdollars) and the heatingareaofthe Observation Score GPA Observation Score GPA
houses(in thousandsof squarefeet) are recorded,with the
1 688 3.72 ll 567 3.07
following results,storedin the file@@!fS:
2 647 3.44 12 542 2.86
J 652 3.21 IJ 551 2.91
Assessed Heating Area of Dwelling i
608 3.29 T4 573 2.79
House Value ($000) (Thousandsof SquareFeet) 5 680 3.91 l5 536 3.00
6 617 3.28 l6 639 3.55
I 184.4 2.00
7 557 3.02 l7 619 J.+ I
2 177.4 1 . 7| 8 599 3.13 18 694 3.60
a
r 75 . 7 1.45 9 616 3.45 19 718 3.88
4 185.9 t.76 l0 594 J.JJ 20 759 3.76
5 179.1 1.93
6 170.4 1.20 (Hint: First, determinewhich are the independentand
175.8 1.55 dependent variables.)
8 185.9 1.93 a. Constructa scatterplot and,assuminga linear relation-
9 r7 8 .5 1.59 ship, use the least-squaresmethod to compute the
10 179.2 1.50 regressioncoefficientsbo andb,.
ll 186.7 1.90 b. Interpret the meaning of the I intercept,bo, and the
t2 t'19.3 1.39 slope,b1,in this problem.
l3 174.5 1.54 c. Use the predictionline developedin (a) to predictthe
t4 r 8 3 .8 1.89 GPA for a studentwith a GMAT scoreof 600.
l5 176.8 1.59 d. Determine the coefficient of determination,12, and
interpretits meaningin this problem.
(Hint: First, determinewhich are the independentand e. Performa residualanalysison your resultsand deter-
dependentvariables.) mine the adequacyof the fit of the model.
Chapter
ReviewProblems 559

ioni the0.05level ofsignificance,is thereevidenceofa Temperature O-Ring


rcs{ relationshipbetweenGMAT scoreand GPA? Flight Number (oF) DamageIndex
a 95Yoconfidenceintervalestimateof the
I 66 0
GPAof studentswith a GMAI scoreof 600.
2 70 4
a 95%oprediction interval of the GPA for a
3 69 0
studentwith a GMAT scoreof 600.
5 68 0
a 95o/oconfidenceinterval estimateof the
6 67 0
slope.
72 0
Themanagerof the purchasingdepartmentof a 8 73 0
bankingorganizationwould like to developa model 9 70 0
the amountof time it takesto processinvoices. 4t-B JI 4
arecollectedfrom a sampleof 30 days,and the num- 4t-c 63 2
invoicesprocessedand completiontime, in hours,is 4t-D 70 4
'18
inthe file@@. 4t-G 0
5l-A 67 0
First,determinewhich are the independentand
5l-B 75 0
variables.)
ing a linear relationship,use the least-squares
sl-c 53 ll
5l-D 67 0
to computetheregression coefficientsboandb,.
the meaningof the )z intercept,bo, and the
sl- F 8r 0
sl- G 70 0
b1,in thisproblem.
5l - I 67 0
thepredictionline developedin (a) to predict the
5l-J 79 0
of time it would taketo process150invoices.
ine the coefficient of determination,r2, and
6r - A 75 4
61-B 76 0
ItSmeamng.
6l-c 58 4
the residuals asainst the number of invoices Note: Data from flight 4 is omitted due to unknown O-ring condition.
andalsoaeainsttime.
on the plots in (e), does the model seem Source: Extractedfrom Report of the PresidentialCommission on
the SpaceShuttle Challenger Accident Washington,DC, 1986,Vol.
the Durbin-Watsonstatisticand.at the 0.05 II (Hl-H3) and Vol.IV (664), andPost Challenger Evaluation of
Space Shuttle Risk Assessmentand Management, Washington,DC,
of significance,determinewhetherthere is any 1988,pp. 135-136.
ion in the residuals.
ontheresultsof (e) through(g), whatconclusions
a. Constructa scatterplot for the sevenflights in which
youreachconcerningthe validity of the model?
there was O-ring damage(O-ring damageindex * 0).
the0.05levelofsignificance,is thereevidenceofa
What conclusions,if any, can you draw about the rela-
relationshipbetweenthe amount of time and the
tionship betweenatmospherictemperatureand O-ring
of invoicesprocessed?
damase?
a95o/oconfidence interval estimate of the mean
b. Constructa scatterplot for all 23 flights.
oftime it wouldtaketo process150invoices.
c. Explain any differencesin the interpretationof the re
a95o/opredictionintervalofthe amountof time
tionship betweenatmospherictemperatureand O-ri
taketo process150invoiceson a particularday.
damagein (a) and (b).
On January28, 1986,the spaceshuffleChallenger d. Basedon the scatterplot in (b), provide reasonswhy a
and sevenastronautswere killed. Prior to the prediction shouldnot be made for an atmospherictem-
the predictedatmospherictemperaturewas for peratureof 3 I'F, the temperatureon the morning of the
weatherat the launch site. Engineersfor Morton launchof the Challenger.
(themanufacturerof the rocket motor) prepared e . Although the assumptionof a linear relationshipmay
tomakethe casethat the launchshouldnot takeplace not be valid"fit a simplelinearregression modelto pre-
thecoldweather.Theseargumentswererejected,and dict O-ring damage,basedon atmospherictemperature.
tragicallytook place.Upon investigationafter Include the prediction line found in (e) on the scatter
, experts agreed that the disasteroccurred plot developedin (b).
of leakyrubber O-rings that did not sealproperly g. Basedon the resultsof (f), do you think a linearmodel
the cold temperature.Data indicating the atmo- is appropriatefor thesedata?Explain.
temperature at the time of 23 previouslaunchesand h. Performa residualanalvsis.What conclusionsdo vou
damageindex are storedin the file!@@: reach?
5q0 Simple
CHAPTERTHIRTEEN LinearRegression

13.81 CrazyDave,a well-knownbaseballanalyst,would d. Computethe coefficient of determination,12, andirfier-


like to study various team statisticsfor the 2005 baseball pret its meaning.
seasonto determinewhich variablesmight be useful in pre- e. Perform a residual analysison your results and deter-
dicting the number of wins achievedby teamsduring the mine the adequacyof the fit of the model.
season.He has decidedto begin by using a team'searned f. At the 0.05 level of significance,is there evidenceof a
run average(ERA), a measureof pitching performance,to linear relationshipbetweenthe Wonderlic scorefor a
predict the number of wins. The data for the 30 Major football playertrying out for the NFL from a schooland
LeagueBaseballteamsare in the file [!!!!lf[ the school'sgraduationrate?
g. Constructa 95%oconfidenceinterval estimateof the
(Hint: First, determinewhich are the independentand
meanWonderlicscorefor football playerstrying out for
dependentvariables.)
the NFL from a schoolthat hasa graduationrateof 50%.'
a. Assuming a linear relationship,use the least-squares
h. Constructa 95o/opredictioninterval of the Wonderlic
methodto computethe regressioncoefficientsboandb,.
scorefor a football playertrying out for the NFL froma
b. Interpret the meaningof the I intercept,bo, and the
schoolthat hasa sraduationrate of50o/o.
slope,b1,in this problem.
i. Constructa 95%oconfidenceinterval estimateof the
c. Use the predictionline developedin (a) to predict the
slope.
numberof wins for a teamwith an ERA of 4.50.
d. Computethe coefficient of determination,12, andinter- 13.83 Collegebasketballis big business,with coaches'
pret its meaning. salaries,revenues,and expensesin millions of dollars.
e. Performa residualanalysison your resultsand deter- The data in the fil" EEEEEE!$EI@ contains the
mine the adequacyof the fit of the model. coaches'salariesand revenuesfor college basketball
f. At the 0.05 level of significance,is thereevidenceof a at selectedschools in a recent year (extractedfrom
linear relationshipbetweenthe number of wins and R. Adams, "Pay for Playoffs," The Wall StreetJourncl,
the ERA? March ll-12,2006, pp. Pl, P8).You plan to develop a
g. Constructa 95o/oconfidenceinterval estimateof the regressionmodel to predict a coach'ssalarybased
meannumberof wins expectedfor teamswith an ERA revenue.
of 4.50. a. Assuming a linear relationship,use the I
h. Constructa 95Yopredictioninterval of the numberof methodto computethe regressioncoefficientsboandb,
wins for an individual teamthat hasan ERA of 4.50. b. Interpret the meaningof the )zintercept,bo, andthe
i. Constructa 95%oconfidenceinterval estimateof the slope,b1,in this problem. !
slope. c. Use the prediction line developedin (a) to predici
j. The 30 teamsconstitutea population.In orderto usesta- the coach'ssalary for a school that has revenue
tistical inference,as in (f) through (i), the datamust be $7 million.
assumedto representa random sample.What "popula- d. Computethe coefficient of determination,r2, ant
tion" would this samplebe drawingconclusionsabout? pret its meaning.
k. What other independentvariablesmight you consider e. Performa residualanalysison your resultsand
for inclusionin the model? mine the adequacyof the fit of the model.
f. At the 0.05 level of significance,is thereevidenceof
13.82 Collegefootball playerstrying out for the NFL are
linear relationship betweenthe coach'ssalary for
giventheWonderlicstandardizedintelligencetest.The data
schooland revenue?
in the file E![!@! contains the averageWonderlic
g. Constructa 95o/oconfidenceinterval estimateof
scoresof football playerstrying out for the NFL and the
meansalaryofcoachesat schoolsthat haverevenue
graduationrates for football playersat selectedschools
(extractedfrom S.Walker,"The NFI-lsSmartestTeaml' The $7 million.
h. Constructa9lYoprediction interval of the coach's
WallStreetJournal, September30, 2005,pp. Wl, Wl0).
for a schoolthat hasrevenueof $7 million.
You plan to develop a regressionmodel to predict the
i. Constructa 95o/oconfidenceinterval estimateof
Wonderlic scoresfor football playerstrying out for the
slope.
NFL, based on the graduationrate of the school they
attended. 13.84 Durins the fall harvestseasonin the UnitedS
a. Assuming a linear relationship,use the least-squares pumpkinsare sold in large quantitiesat farm stands.
methodto computetheregression coefficientsboandbr. insteadof weighing the pumpkins prior to sale,the
b. Interpret the meaningof the I intercept,bo, and the standoperatorwill just place the pumpkin in the a
slope,b1,in this problem. ate circular cutout on the counter.When askedwhv
c. Use the predictionline developedin (a) to predict the was done, one farmer replied,"l cantell the weightof
Wonderlic score for football playerstrying out for the pumpkin from its circumference."To determine
NFL from a schoolthat hasa eraduationrateof 50o/o. this was really true, a sampleof 23 pumpkinswere
Chapter
ReviewProblems 561

for circumferenceand weighed"with the following Sales-Latest one-monthsalestotal (dollars)


in thefile E@@fr:
stored Age-Median ageof customerbase(years)
HS-Percentage of customerbasewith a high school
diploma
Weight Circumference Weight
College-Percentageof customerbasewith a college
(cm) (Grams) (cm) (Grams)
diploma
50 1,200 57 2,000 Growth-Annual population growth rate of customer
2,000 66 2,500 baseover the past 10 years
54 1,500 82 4,600 Income-Median family income of customerbase
52 1,700 83 4,600 (dollars)
37 500 70 3,100 a. Constructa scatterplot, using salesas the dependent
:52 1,000 34 600 variable and median family income as the independent
53 1,500 5t 1,500 variable.Discussthe scatterdiagram.
47 1,400 50 1,500 b. Assuming a linear relationship,use the least-squares
51 1,500 49 1,600 methodto computethe regressioncoefficientsboandb,.
63 2,500 60 2,300 c. Interpret the meaning of the I intercept,bo, and the
i33 s00 59 2,r00 slope,b1,in this problem.
43 1,000 d. Computethe coefficient of determination,12, andinler-
pret its meaning.
e. Perform a residualanalysison your resultsand deter-
ing a linear relationship,use the least-squares
mine the adequacyof the fit of the model.
to computethe regressioncoefficientsboand b,.
f. At the 0.05 level of significance,is there evidenceof a
themeaningof the slope,b,, in this problem.
linear relationshipbetweenthe independentvariableand
the meanweight for a pumpkin that is 60 cen-
the dependentvaiable?
in circumference.
g. Constructa 95o/oconfidenceinterval estimateof the
you think it is a good idea for the farmer to sell
slopeand interpretits meaning.
pkins by circumferenceinsteadof weight?Explain.
ine the coefficient of determination.12, and 13.86 For the dataof Problem13.85,repeat(a) through
Its meamns. (g), using medianageas the independentvariable.
a residualanalysisfor thesedataand determine '13.87 For the dataof Problem13.85,repeat(a) through(g),
adequacy of the fit of the model.
using high schoolgraduationrateasthe independentvariable.
the0.05level of sisnificance.is there evidenceof a
relationshipbetweenthe circumferenceand the 13.88 Forthe dataofProblem I 3.85,repeat(a) through(g),
ightof a pumpkin? usingcollegegraduationrateasthe independent variable.
t a 95ohconfidenceintervalestimateof the
tionslope,Br. 13.89 For the dataof Problem13.85,repeat(a) through
a 95% confidenceinterval estimateof the (g), using populationgrowth asthe independentvariable.
mean weight for pumpkins that have a cir- 13.90 Zagat'spublishesrestaurantratingsfor variousloca-
of 60 centimeters. tionsin theUnitedStates.Thedatafile @contains
a 95o/oprediction interval of the weight for the Zagatratingfor food, decor,service,andthe price per per-
individualpumpkin that has a circumferenceof 60 son for a sampleof 50 restaurantslocatedin an urbanarea
(New York City) and 50 restaurantslocatedin a suburbof
Candemographicinformation be helpful in pre- New York City. Develop a regressionmodel to predict the
salesof sporting goods stores?The data storedin price per person,basedon a variablethat represents the sum
ofthe ratingsfor food,decor,andservice.
EE@[Eure the monthly salestotals from a ran-
of 38 storesin a large chain of nationwide Source:Extractedfrom ZagatSurvey2002NewYorkCity
goodsstores.All storesin the franchise,and thus Restaurantsand Zagat Survey 200 l-2002, Long Island Restaurants.
the sample,are approximatelythe samesize and a. Assuming a linear relationship,use the least-squares
,thesamemerchandise.The county or, in somecases, methodto computethe regressioncoefficientsboandb,.
in which the store draws the majority of its cus- b. Interpret the meaning of the I intercept,bo, and the
is referredto hereasthe customerbase.For eachof slope,b1,in this problem.
stores,demographicinformation aboutthe customer c. Usethepredictionline developedin (a)to predicttheprice
is provided.The data are real, but the name of the per pe$on for a restaurantwith a summatedrating of 50.
iseis not used at the requestof the company.The d. Computethe coefficient of determination,12, and inter-
in the dataset are pret its meaning.
562 CHAPTER
THIRTEEN LinearRegression
Simple

e. Performa residualanalysison your resultsand deter- 13.92 The datafile [@!contains the stockpricesof
minethe adequacyof the fit of the model. four companies,collectedweekly for 53 consecutive
f. At the 0.05 level of significance,is thereevidenceof a weeks,endingMay 22,2006.Thevariablesare
linearrelationshipbetweenthe price per personand the Week-Closing datefor stockprices
summatedrating? MSFT-Stock price of Microsoft,Inc.
g. Constructa 95ohconfidenceintervalestimateof the Ford-Stock price of FordMotor Company
meanprice per personfor all restaurantswith a sum- GM-Stock price of GeneralMotors,Inc.
matedratingof 50. IAL-Stock price of International
Aluminum,Inc.
h. Constructa95o/opredictionintervalof the priceper per- Source;Extracted Jromfinance.yahoo.com, May 31, 2006.
sonfor a restaurant with a summatedratingof 50.
a. Calculate the correlation coefficient, r, for each pair of
i. Constructa 95% confidenceintervalestimateof the slope.
stocks. (There are six of them.)
j. How useful do you think the summatedrating is as a
b. Interpret the meaning of r for each pair.
predictorof price?Explain.
c. Is it a good idea to have all the stocks in an individual's
'13.91 Referto the discussionof betavaluesand market portfolio be strongly positively correlated among each
modelsin Problem13.49onpages544-545.One hundred other? Explain.
weeksof data,endingtheweekof May 22,2006,for the S&P
13.93 Is the daily performanceof stocks and bonds corre-
500 and threeindividual stocksare includedin the datafile
lated? The data file E!@s![[tE contains information
@ Note that the weeklypercentqgechangefor both
concerning the closing value of the Dow Jones Industrial
the S&P 500 and the individualstocksis measuredas the
Average and the Vanguard Long-Term Bond Index Fund
percentage changefrom the previousweek'sclosingvalueto
for 60 consecutivebusinessdays, ending May 30, 2006.
the currentweek'sclosingvalue.The variablesincludedare
The variables included are
Week-Current week
Date Current day
SP500-Weekly percentage changein the S&P 500 Index
Bonds Closing price of Vanguard Long-Term Bond
WALMART-Weekly percentage changein stockprice
Index Fund
of Wal-MartStores,Inc.
Stocks-Closing price of the Dow Jones Industrial
TARGET-Weekly percentage changein stockprice of
Average
the TargetCorporation
: Extracted.from
Scturce finance.yahoo.com,
May 31, 2006.
SARALEE-Weekly percentagechangein stockprice
of the SaraLeeCorporation a. Compute and interpret the correlation coefficient, r, for
Source
: Extracted the variables Stocks and Bonds.
from finance.yahoo.com,
May 3I, 2006.
b. At the 0.05 level of significance, is there a relationship
a. Estimate the market model forWal-Mart StoresInc. (Hint:
between these two variables?Explain.
Use the percentagechange in the S&P 500 Index as the
independent variable and the percentage change in Wal- Report Writing Exercises
Mart Stores,Inc.'s stock price as the dependentvariable.) 13.94 In Problems13.85-13.89 on page561,you devel-
b. Interpret the beta value for Wal-Mart Stores,Inc. opedregressionmodelsto predictmonthlysalesat a sport-
c. Repeat(a) and (b) forTarget Corporation. ing goodsstore.Noq write a reportbasedon the models
d. Repeat(a) and (b) for Sara Lee Corporation. you developed.Append to your report all appropriate
e. Write a brief summary of your findings. chartsand statisticalinformation.

Managingthe SpringvilleHerald
To ensure that as many trial subscriptions as possible are examining new subscription data for the prior three
converted to regular subscriptions, the Herald marketing months, a group of three managerswould develop a subjec-
departmentworks closely with the distribution department tive forecast of the number of new subscriptions. Lauren
to accomplish a smooth initial delivery processfor the trial Hall, who was recently hired by the company to provide
subscription customers.To assist in this effort, the market- special skills in quantitative forecasting methods, sug-
ing department needs to accurately forecast the number of gested that the department look for factors that might help
new regular subscriptionsfor the coming months. in predicting new subscriptions.
A team consisting of managersfrom the marketing and Members of the team found that the forecasts in the
distribution departmentswas convenedto develop a better past year had been particularly inaccuratebecausein some
method of forecasting new subscriptions.Previously, after months, much more time was spent on telemarketing than
References563

in othermonths.In particular,in thepastmonth,only 1,055 SHl3.2 What factorsotherthan numberof telemarketing


hourswerecompletedbecausecallerswerebusy during the hoursspentmight be usefulin predictingthe num-
frst weekof the month attendingtraining sessionson the ber of new subscriptions? Explain.
personalbut formal greetingstyle and a new standardpre- SHl3.3 a. Analyzethe dataanddevelopa regressionmodel
sentationguide(see"Managing the SpringvilleHerald" in to predictthe meannumberof new subscriptions
Chapterll). Lauren collected data (stored in the file for a month,basedon the numberof hoursspent
@@) for the number of new subscriptionsand hours on telemarketingfor new subscriptions.
spenton telemarketingfor each month for the past two b. If you expectto spend1,200hourson telemarket-
years. ing per month,estimatethe meannumberof new
subscriptionsfor the month.Indicatethe assump-
EXERCISES tions on which this predictionis based.Do you
SH13.1
What criticism can you make concerningthe think theseassumptions arevalid?Explain.
methodof forecastingthat involvedtaking the new c. What would be the danger of predicting the
subscriptionsdatafor the prior threemonthsasthe number of new subscriptionsfor a month in
basisfor futureprojections? which 2,000hourswerespenton telemarketine?

Applyyour knowledgeof simple linear regressionin this www.prenhall.com/Springville/Triangle_Sunfl ower.htm,


WebCase,which extendsthe SunJlowersApparel Using (or open this Web casefile from the StudentCD-ROM's
Statistics
scenariofrom this chapter Web Casefolder), and then answerthe following:
Leasingagentsfrom the Triangle Mall Management 1. Shouldmeandisposable incomebe usedto predictsales
Corporation havesuggestedthat Sunflowersconsidersev- basedon the sampleof 14 Sunflowersstores?
erallocationsin some of Triangle's newly renovated
2. Shouldthe management of Sunflowersacceptthe claims
lifestylemallsthat caterto shopperswith higher-than-mean
of Triangle'sleasingagents?Why or why not?
disposable income.Although the locationsare smallerthan
thetypicalSunflowerslocation,the leasingagentsargue 3. Is it possiblethatthe meandisposableincomeof the sur-
thathigherthan-meandisposableincome in the surround- rounding areais not an important factor in leasingnew
ingcommunityis a better predictorof higher salesthan locations?Explain.
storesize.The leasingagentsmaintain that sampledata 4. Are thereany other factorsnot mentionedby the leas-
from14Sunflowersstoresprovethat this is true. ing agentsthat might be relevantto the storeleasing
Reviewthe leasingagents'proposaland supporting decision?
documents that describethe dataat the company'sWeb site,

l. Anscombe,F. J., "Graphsin StatisticalAnalysisl' The 4. Kutner,M. H., C. J. Nachtsheim,J. Neter,and W. Li,
AmericanStatistician27 (1973):17-21. AppliedLinear StatisticalModels,5th ed. (NewYork:
2.Hoaglin,D. C., and R. Welsch,"The Hat Matrix in McGraw-Hill/Irwin,2005).
Regression
andANOVAI' TheAmericanStatistician32 5. MicrosoftExcel2007(Redmond,WA: MicrosoftCorp.,
(1978):17-22. 2007\.
3.Hocking,R. R., "Developmentsin Linear Regression
Methodology:1959-1982,"kchnometrics25 (l 983):
219-250.
564 EXCELcoMPANIoN to chaoterl3

E13.1 PERFORMINGS I M P L EL I N E A R PHStat2performsthe regressionanalysis,usingt


REGRESSION
ANALYSES ToolPak Regression procedure. Therefore, the worksheel
produced does rol dynamically change ifyou change
You perform a simple linear regressionanalysisby either data. (Rerun the procedure to create revised results.)
usingthe PHStat2SimpleLinear Regression procedureor threeOutput Optionsavailablein the PHStat2dialogbox
by usingtheToolPakRegression procedure. enhancethe ToolPak procedureand are explainedin
E13.2.813.4.
Sections andE13.5.

Using PHStat2 Simple Linear Regression Using ToolPak Regression


Opento the worksheetthat containsthe datafor the regres- Opento the worksheetthat containsthe datafor the regres-
sion analysis.SelectPHStat ) Regression) Simple sion analysis.Select Tools t Data Analysis. select
Linear Regression.In the procedure'sdialogbox (shown Regression from theDataAnalysislist,andclick OK.ln
below), enterthe cell rangeof the )zvariableas the Y procedure'sdialogbox (shownbelow),enterthecellrange
VariableCell Rangeandthe cell rangeof theXvariableas the X variabledataas the Input Y Range and enterthecell
the X Variable Cell Range. Click First cells in both rangeof the X variabledataas the Input X Range.Click
rangescontain label and entera valuefor the Confidence Labels, click Confidence Level and enter a value in its
level for regressioncoefficients.Click the Regression and then click OK. Resultsappearon a new worksheet.
StatisticsTable and the ANOVA and CoefficientsTable
Regression Tool OutputOptions,entera title as the Title,
andclick OK.
I'p'rt
rpr-tYRdrgc:
,=, fruT
E =
fct-dl
rreir1Rd4c: E
EIL$.* flcsstsrtca"no t- Hph-l

0*i M codidarc r-cvd: 9s %


Y tCarda Cd Rffqcl Outputopiisrs
X Valrilc Cd RrUc:
17 ** cek nrbo$rrurEescmtdr l$d
OQrrptncrgc:
O tlcr,rWutatcCEV:
,,.8
corfidarc brclfa rogre*sm ccfficlonB, lG-* Ot$+rltdt6oof
Residudk

RegrseionTod A*g-t Optftns trne*ar* nR.dCsdPtots


n*md*d2rdRrcdu.b ilWrcmnot3
l? neresaon**ucsrge
tlormd Probabiity
17 $.totArrdco6fMT.bh
DUonnlPr**yPtes
T Rcddi*TaHa
T Rcc&dPlot

O.tpr.t Opdons E13.2 CREATINGSCATTERPLOTS


Tf{c: I
ADDING A PREDICTION
LINE
f- sc*ter uagran
You useExcel chartingfeaturesto createa scatterplot
f qrbbFvcat$nstltirtt
add a predictionline to that plot. Ifyou selectthe
md RadctionhtorvalfarX-
I- Cor$dorra T-i
r Diagram output option of the PHStat2Simple Li
Regression procedure (seeSection813.1),you canskip
I | ----.---*rI I
the 'Adding a Prediction Line" section that applies to
lr.b I li oK ll cilcd I
Excel version you use.
E 13.2: C'reatingScatterPlotsandAddinga PredictionI-ine 565

Creating a Sea***r Flot


UseeithertheSectionE2.12instructions
to createa scatter Then*r@fi3 Tre,rdlire
Options
plot(seepage93) or usethe SectionE13.1instructionsin ilne Color TrendF.egressrmTfpc
"UsingPHStat2SimpleLinearRegression", but clicking Lfie Sti ie t,-_
E\Jsnrnnal
ScatterDiagrambeforeyou click OK. Shadoi
___l "
.t .-
(:i Lnear
|

Adding a Prediction Line (97-2003) _t


.t,,.
I i, Loo8rrmmri

the Open to the chart sheet that contains your scatter plot II polrno*at
I
eet and selectChart t Add Trendline. In the Add Trendline
| ,'r Po,r.
our d i a l o gb o x ( s e eF i g u r eE l 3 . l ) , c l i c k t h e T y p e t a b a n d t h e n I

fhe click Linear. Click the Options tab and select the ' t -| -
:" l,to.nc Areraoe

lox A u t o m a t i c o p t i o n .C l i c k D i s p l a y e q u a t i o n o n c h a r t a n d
Display R-squared value on chart and then click OK. If Trendftna l lame
ln
you haveincluded a label as part of your data range,you will i:l Aulomatc : Linrar (Annu6lSiies)
(j Eustom;
seethatlabeldisplayedin placeof Seriesl in this dialogbox.
ForeGst

foftlardl 0,0 gelods

res- Bicklrtrd: 0.0 pcrods

lect
ff 5et Intacept -
rthe
E DrsplayEqudbonon ch.rt
,eof t{ rqt:[email protected] 0" .njii
cell "1"" -l
lick Iype OPtbns
f*c"*
box, Ifendlnerrime
FIGURE E13.2 Format Trendline dialog box (2007)
r+) A*om*A: LirEsr(scri€rt)

{) Eurtmr
relocatethe X axis to the bottom of the chart. open to the
F0re{nst
f,orword: 0 I Lhits chart, right-click the I axis and select Format Axis fiom
[*kwvdr 0 ] t-kr*s the shortcutrnenu.
lf you use Excel 97-2003, selectthe Scale tab in the
r*aapt * o
Dl* FormatAxis dialogbox (seeFigureE 13.3), and enterthe value
g Bsplay gquatbn m chsrt
fbund in the Minimum box (-6 in FigureE13.3)asthe Value
E u+tev B-squareav.k€ on ch.ft
(X) axis Crossesat value and click OK. (As you enterthis
value,the check box fbr this entry is clearedautomatically.
)

Patterns 5cde Fort Nwnber


dialogbox (97-2003)
FIGUREE13.1Add Trendline valua (Y) axis scde
Arjto
-6
El Ptqnilrfl:
Adding a Prediction Line QA07) 4
B t*tagnrum:

Opento the chart sheet that contains your scatter plot and E] mg;orur*: I
selectLayout ) Trendline and in the Trendline gallery, P Fl.rnrmit: o.2

selectMore Trendline Options. In the TrendlineOpitions flva&ie (X)axis


qr6rc5 6t: -6
D panelof the FormatTrendline dialog box (seeFigure F.13.2),
selectthe Linear option, click Display equation on chart DsplayUr*s: ttone v
andDisplay R-squaredvalue on chart, and click Close.
flEogartfrrr scde
nd
I vduash geverserda
er
'ar Relocatingan X Axis flvaltr 1x;axiscrosiei at &6ximunvah.€

to Ifthere are )'values on a residualplot or scatterplot that 6---oK---l T(-"*a I


he arelessthan zero, MicrosofltExcel placesthe X axis at the
pointf : 0, possiblyobscuringsome of the data points.To FIGUREE13.3 FormatAxisdialoqbox (97-2003)
566 EXCELcoMPANIoNto chaoter13

If you useExcel2007,in theAxis Optionspanelof the WatsonStatisticcausesPHStat2to createa residualstable,


FormatAxis dialogbox (seeFigureE I 3.4), selectthe Axis evenif you did not checkthe ResidualsTable Regression
value option,changeits defaultvalue of 0.0 (shownin Tool outputoption.
FigureE13.4)to a valuelessthan the minimum )'value, The Durbin-WatsonStatisticoutput option createsa
andclick Close. new Durbin-Watson worksheetsimilar to the one shownin
Figure 13.16on page536.This worksheetreferences cells
in the regressionresultsworksheetthat is alsocreatedby the
{diiiiqisiit rlr i
procedure.If you deletethe regressionresultsworksheet,
{;ffistdt Axis options the DurbinWatson worksheetdisolavsan errormessase.
' t'?unun:
I t{.rrber O Apo O Exct
i n r Maxirum:Oegb ORxed
UsingDurbin-Watson.xls
ur€colar Mslo(unit:O autr O r,feA
t/klort'nit: O ruto O Fx# Open to the DurbinWatson worksheet of the
Lncsb/. ,
E@@workbook. This worksheet(seeFigure13.16
shadow ; I Yabcrrnrcvcrscordcr
s(6h
on page 536)usesthe SUMXMY2 (cell rangeI, cell range2)
313Fo{rnat LI Logarifndc
vv_,'
functionin cell 83 to computethe sum of squareddifference
I I DisplayUrib: i!{sc
DisplayUrib: it{sc
i Atgryneftt of the residuals,andthe SUMSQ (residualscell range)func-
tion in cell E}4to computethesumof squaredresidualsfor the
Faaprbcknrak tvpe: O.rtstdc v
Section13.6packagedeliverystoreexample.
i1
I i ttor tck marktypc: t{onr v By settingcell range 1 to the cell rangeof the first
i
: Ar! Irb.h:
Lxilabcb: tlrxt t6 axk v\
tlcxttoAxts residualthroughthe second-to-last residualandcell range
2 to the cell rangeof the secondresidualthroughthe last
i i :fbri:ontalaxbgosc.s:
i , O uu,ar.o, residual, you can get SUMXMY2 to computethe squared
(, Axb vabg: 0.0 differencebetweentwo successive residuals,which is the
numeratorterm of Equation(13.15).Becauseresiduals
appearin a regressionresultsworksheet,cell references
usedin the SUMXMY2 functionmust refer to the regres-
sion resultsworksheetby name.
FIGUREE13.4 FormatAxisdialogbox (2007) In the Durbin-Watsonworkbook,the SLR worksheet
containsthe simple linear regressionanalysisfor the
E13.3 PERFORMING
RESIDUAL Section13.6packagedeliveryexample.The residuals
ANALYSES appearin the cell rangeC25:C39.Therefore,cell rangeI
is set to SLR!C25:C38, and cell range 2 is set to
You modify the proceduresof SectionE I 3.I to perform a
SLR!C26:C39. This makes the cell B3 formula
residualanalysis.If you usethe PHStat2SimpleLinear :SUMXMY2(SLR!C26:C39,SLR!C25:C38).The cell
Regression procedure,click all the Regression
Tool output
84 formula, which also must refer to the SLR worksheet,
options (Regression Statistics Table, ANOVA and
is :SUMSQ(SLR!C25:C39).
CoefficientsTable,ResidualsTable,and ResidualPlot).
To adaptthe Durbin-Watsonworkbook to other prob-
If you use the ToolPak Regressionprocedure,click
lems, first createa simple linear regressionresultswork-
Residualsand ResidualPlots beforeclicking OK. If you
sheetthat containsresidualoutputandcopythatworksheet
needto relocateanXaxis to the bottomofa residualplot,
to the Durbin-Watsonworkbook. Then open to the
reviewthe "RelocatinganXAxis" part of SectionE13.2.
Durbin-Watson worksheetand edit the formulasin cells
83 and 84 so that they refer to the correctcell rangeson
E13.4 COMPUTINGTHE DURBIN. your regressionworksheet.Finally, deletethe no-longer-
WATSON STATISTIC neededSLR worksheet.
Youcomputethe Durbin-WatsonStatisticby eitherusingthe
PHStat2Simple Linear Regressionprocedureor by using a E13.5 ESTIMATING
THE MEAN OF Y
processthatusesth. EE@EEEEworkbook.
several-step AND PREDICTINGYVALUES
You computea confidenceintervalestimatefor the mean
UsingPHStat2SimpleLinearRegression responseand the prediction interval for an individual
Use the SectionE13.1instructionsin "Using PHStat2 responseeitherby selectingthe PHStat2SimpleLinear
SimpleLinear Regression," but clicking Durbin-Watson Regressionprocedure or by making entries in the
Statistic before you click OK. Choosingthe Durbin- g@@workbook.
E 1 3 . 6 :E x a m p l eS:u n f l o w eAr sp p a r eDl a t a 5 6 1

FIGURE E13.5 A B c U E F

DataCopy worksheet 5ql|a]e Anouat


(firstsixrows) 1 Fsel Saleg {X-XBad^2
2 1.7 3,7 1.4919 amole Sizs 1t -COUt{T(B:B)
3 t-t 3.S | 7t& rmpleMean 2.921t -AVTRAGE{A:A}
4 2.8 6.7 8Dl47 3umof SquarEdDifference 37.SZA *SUT(c:Q
5 5E 9.t rredicted Y ffHal) *TREI{D(82:815,
Ai2zA15,
ClEandPllBfl
5 3.4 2.6H
a ac

UsingPHStat2SimpleLinearRegression Cells B8, B I I, B 12,and B l5 containformulasthat ref-


erence individual cells on a DataCopy worksheet. This
Usethe SectionE13.1 instructions in "Using PHStat2
worksheet, the first six rows of which are shown in Figure
LinearRegression",
Simple but beforeyou click OK, click E13.5, containsa copy of the regressiondata in columnsA
Confidence and Prediction Interval for X: and enteran
and B and a formula in column C that squaresthe differ-
) Xvaluein its box (seebelow).Then entera value for the
ence between eachX and X .tne worksheetalso computes
Confidence levelfor interval estimatesandclick OK.
) the sample size, the sample mean, the sum of the squared
differences [SSXin Equation (13.20) on page 546], and the
predictedIvalue in cells F2, F3, F4, and F5.
The cell F5 formula uses the function TREND
D*.
(Y variable cell ronge,X vsriable cell range, X value) to
vvriatrhcdRangc, f------*:
it calculatethe predicted I value. Becausethe formula uses
xv$ntrhcctrRscc'
e [-----*f the X value that has been entered on the CIEandPt work-
it tr fi* ce*sin bo*r r*lges cont*r hbd
sheet, the X value in the cell F5 formula is set to
d Cmkrco bval fa roryessbnco#fiinr**,
lG-x CIEandPM4. Becausethe DataCopy and CIEandPI
e worksheetsreferenceeach other, you should consider these
Reg/es*rnTod Attrlt Optimg
ts 17 Rcges*onStatistis foile worksheetsa matched pair that should not be broken up.
)S To adapt these worksheetsto other problems, first cre-
[- AITOVA
andCocffi*nts rabb
t- ate a simple linear regression results worksheet. Then,
T Raidr*Tabb
transfer the standard error value, always found in the
I- ncn*f.reR*
3t r e g r e s s i o nr e s u l t s w o r k s h e e tc e l l 8 7 , t o c e l l B l 3 o f t h e
te OIg.tOdtdrt CIEandPI worksheet.Change,as is necessary,the XValue
ls Tf{cr and the confidence level in cells 84 and 85 of the
I [* sc*arnagrun CIEandPI worksheet.Next, open to the DataCopy work-
lo sheet,and if your samplesize is not 14, follow the instruc-
I* U.rbn-Wctron*&ktk
1a- tions found in the worksheet. Enter the problem's X values
tr CorSdcrmardFrodctionlr*arvdforX = l***
rll in column A and l'values in column B. Finally, return to
Crfi*rre b/d for htervd cstinatce, [**
)1, the CIEandPI worksheet to examine its updated results.
C.,"d I
b-
k-
Iet PHStat2 placesthe confidenceintervalestimateand E13.6 EXAMPLE:SUNFLOWERS
ne predictionintervalon a new worksheetsimilarto the one APPARELDATA
lls shown in Figure13.21on page549.(PHStat2also cre-
This sectionshowsyou how to usePHStat2or BasicExcel
JN ates
a DataCopyworksheetthat is discussedin the next
to performa regressionanalysisfor SunflowersApparel
ir- partof thissection.)
usingthe squarefootageandannualsalesdatastoredin the
l[!f[!workbook.
UsingClEandPlforSLR.xls
0pen to theCIEandPI worksheet of the
workbook. This worksheet(shownin
UsingPHStat2
@[@[![!EE
:an Figure13.21on page549) usesthe functionTINV(I- Opento the Data worksheet
of the[fff[!workbook. Select
ral conJidence
level, degreesof freedom) to determinethe PHStat) Regression ) SimpleLinear Regression. In the
)at lvalueandcomputethe confidenceintervalestimate procedure'sdialogbox (seeFigureE13.6),enterC1:Cl5
he andprediction
intervalfor the Section13.8Sunflower's as the Y Variable Cell Range and Bl:Bl5 as the
Apparelexample. X VariableCell Range.Click First cellsin both ranges
568 E X C E LC o M P A N I o Nt o C h a p t el r3

Data
* **;
YvariableCdl
Rarpr lcl'cls
-^ VariebleCdlRvrger lcz+:c:a J
x veriaHe
tdl Range: ru;s15 ; 17 frst qeficsntalnrldbel
17 Fir* cellsin bsth rangescn*e*n labd
Csrfidenc*bvel for rogressioncodficbr*s: k*'1{
Outg.rtO$ions
RegressbnToolortut Optitrts T*le: W
F Regessian5tdislics Tabb
V *xwn ard ceffkients TBbk Heb I lt oK il cmcd I
Lgr:i::::1! ^ --
17 Resid:dsT&le
tr7 Resid-rdPlot FIGURE E13.7 Completed Normal
P r o b a b i l i t yP l o t d i a l o g b o x
Ortput O$ions
rlrle: i5ir-d A";[
You conclude that all assumptionsare valid and that
f7 scetterDiegram you can use this simple linear regression model for the
l* fr.rrbift.watson5t*irtic SunflowersApparel data.You can now open to the SLR
fil {orfidgme 6ndtuedctbn intervattor x = worksheetto view the detailsof the analysisor open to the
[_
Csrfidencetevd for brtervdestimates: qt
igS Estimate worksheet to make inferencesabout the mean of
)'and the predictionof individual valuesof )'.

F I G U R E E 1 3 . 6 C o m p l e t e d S i m o l eL i n e a r
R e g r e s s i o nd i a l o g b o x Using Basis Excel
Open to the Data worksheet of the ffiE workbook.
Select Tools ) Data Analysis (972003) or Data ) Data
contain label and enter a value for the Confidence level for Analysis (2007). Select Regression from the Data
regression coefficients. Click the Regression Statistics Analysis list, and click OK. In the procedure'sdialog box
Table,ANOVA and Coefficients Table, Residuals Table, ( s e eF i g u r e E 1 3 . 8 ) ,e n t e r C l : C l 5 a s t h e I n p u t Y R a n g e
and Residual Plot RegressionTool Output Options. Enter and enterBl:Bl5 as the Input X Range. Click Labels,
S i t e S e l e c t i o nA n a l y s i s a s t h e T i t l e a n d c l i c k S c a t t e r click Confidence Level and enter 95 in its box, and click
Diagram. Click Confidence and Prediction Interval Residuals.Click OK to executethe orocedure.
for X: and enter 4 in its box. Enter 95 in the Confidence
level for interval estimates box. Click OK to executethe
procedure.
To evaluatethe assumptionof linearity, you review the
inFUl
R e s i d u a l P l o t f o r X l c h a r t s h e e t .N o t e t h a t t h e r e i s n o Inprl Y Ran{e: rl,rtq
ru t3-*:l
apparentpattern or relationshipbetweenthe residualsand f c*.-l I
X variable. InFtrtXRang6: 81:615 ffi
To evaluatethe normality assumption,create a nor- B laoets f corstat k Z*ero Tn-b 1
mal probability plot. With your workbook open to the E ccrtiderse
t-avelr S olo

SLR worksheet, select PHStat ) Probability & Prob.


Oul:frut 0Fli'rn5
Distributions ) Normal Probability Plot. In the proce-
f) Qr*nrt nmCe: a.
d u r e ' sd i a l o g b o x ( s e e F i g u r e E 1 3 . 7 ) , e n t e r C 2 4 : C 3 8 a s
the Variable Cell Range and click First cell contains Q ttewWwtstreetgly:
label. Enter Normal Probability Plot as the Title and C fiF/{ Wtrkbook
ReEidr.lal:
c l i c k O K . I n t h e N o r m a l P l o t c h a r t s h e e t ,o b s e r v et h a t t h e
E aeEou* I nesi6uU rkts
data do not appear to depart substantiallyfrom a normal Ra:#uals
[ &andardieed I rfrenittuts
distribution.
NermalFrob,ebiiity
To evaluatethe assurnptionof equal variances,review
I Sormd Prqbabiky ff*s
the Residual PIot for Xl chart sheet.Note that theredo not
appear to be major differences in the variability of the
residuals. FIGUREE'|3.8 CompletedRegression
dialogbox
E I 3.6: E,xample:SunflowcrsApparel Data 569

To evaluatethe assumptionof linearity, you plot the cell C2. Copy the residuals(including their column heading)
residuals againstthe squarefeet (independent)variable. To to the cell range Dl:Dl5. Selectthe formulas in cell range
simplifucreatingthis plot, open to the Data worksheet and B2:C2 and copy them down through row 15. Open to the
c o p yt h es q u a r ef e e t c e l l r a n g e B l : B l 5 t o c e l l E 1 . T h e n probability plot and observe that the data do not appear to
copythecell rangeof the residuals,C24:C38 on the SLR depart substantiallyfrom a normal distribution.
worksheet, to cell Fl of the Data worksheet. With your To evaluatethe assumptionof equal variance,returnto
workbook open to the Data worksheet, use the Section the scatter plot of the residuals and the X variable that you
813.2instructionson pages 564-566 to create a scatter already developed. Observe that there do not appear to be
plot.(UseEl:Fl5 as the Data range (Excel 97-2003) or major differencesin the variability of the residuals.
asthecell range of the X and I variables (Excel 2007) You conclude that all assumptionsare valid and that
whencreatingthe scatter plot.) Review the scatter plot. y o u c a n u s e t h i s s i m p l e l i n e a r r e g r e s s i o nm o d e l f o r t h e
0bserve that there is no apparentpattern or relationship Sunflowers Apparel data. You can now evaluatethe details
between the residualsand X variable.You conclude that the of the regressionresultsworksheet.If you are interestedin
linearity assumptionholds. making inferencesabout the mean of )'and the prediction
Younow evaluatethe normality assumptionby creating of individual values of )', open the (l!!@@$f[! work-
hat anormal probabilityplot. Createa Plot worksheet,using the book. (Usually, you would have to first make adjustments
the modelworksheet in the $fifr workbook as your guide. In a to the DataCopy worksheet,as discussedin SectionE13.5,
LR newworksheet,enter Rank in cell A I and then enter the but this workbook already contains the entries for the
the series 1 through14 in cells A2:A15. Enter Proportion in S u n f l o w e r sA p p a r e l a n a l y s i s . )O p e n t o t h e C I E a n d P I
rof allBl andenterthe formula :A2l15 in cell 82. Next. enter worksheetto make inferencesabout the mean of )'and the
ZValuein cell Cl and the formula :NORMSINV(82) in predictionof individual valuesof )'.

rok.
)ata
)ata
box
nge
rels,
;lick

You might also like