Regression
Regression
APPLIED STATISTICA
REGRESSION MODELS
Linear relationships Curvilinear relationships
Y
Y
X X
Y Y
APPLIED STATISTICA
TYPES OF RELATIONSHIPS
Strong relationships Weak relationships No relationship
Y Y
Y
X X
X
Y Y
Y
X X
APPLIED STATISTICA
INTRODUCTION TO REGRESSION ANALYSIS
▪ Regression analysis is used to:
▪ Predict the value of a dependent variable based on the value of at least
one independent variable
▪ Explain the impact of changes in an independent variable on the
dependent variable
▪ Dependent variable: the variable we wish to predict or explain
▪ Independent variable: the variable used to predict or explain the
dependent variable
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION MODEL
▪ Only one independent variable, 𝑋
▪ Relationship between 𝑋 and 𝑌 is described by a linear
function
▪ Changes in 𝑌 are assumed to be related to changes in 𝑋
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION MODEL
▪ For a linear relationship, we can use a model of the form
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀,
▪ where
▪ 𝑦 = the dependent variable
▪ 𝛽0 = the y-intercept
▪ 𝛽1 = the slope coefficient
▪ 𝑥 =the independent variable
▪ 𝜀 = the random error term
▪ 𝛽0 + 𝛽1 𝑥 = the linear component
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION MODEL
Y Yi = β0 + β1Xi + ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EQUATION
(PREDICTION LINE)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept
Value of X for
Ŷi = b0 + b1Xi
observation i
APPLIED STATISTICA
THE LEAST SQUARES METHOD
▪ 𝑏0 and 𝑏1 are obtained by finding the values that minimize the sum of the squared
differences between Y and Yƶ :
APPLIED STATISTICA
FINDING THE LEAST SQUARES EQUATION
▪ The business objective of the director of planning is to forecast annual sales for all
new stores, based on the number of profiled customers who live no more than 30
minutes from a Sunflowers store. To examine the relationship between the number
of profiled customers (in millions) who live within a fixed radius from a Sunflowers
store and its annual sales ($millions), data were collected from a sample of 14
stores. Determine the least squares equation for the given data using Excel.
Profiled
Customers Annual Sales Profiled Annual Sales
Store (millions) ($millions) Store Customers ($millions)
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
USING EXCEL DATA ANALYSIS FUNCTION
Scatter Plot of Profiled Customers and Annual Sales.
14.00
12.00
10.00
Annual sales
8.00
6.00
4.00
2.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00
profiled customers
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
USING EXCEL DATA ANALYSIS FUNCTION
1. Choose Data
2. Choose Data Analysis
3. Choose Regression
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
USING EXCEL DATA ANALYSIS FUNCTION
▪ Enter Y range and X range and desired options
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
USING EXCEL DATA ANALYSIS FUNCTION
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.920798
R Square 0.847869
Adjusted R Square
0.835191
Standard Error
0.999298
Observations 14
▪ Observe that 𝑏0 = −1.2088 and 𝑏1 = 2.0742.
▪ Therefore, the prediction line for these data is
ANOVA
df SS MS F Significance F
Regression 1 66.7854 66.7854 66.87922 3E-06
Residual 12 11.98317 0.998597
Total 13 78.76857 𝑌ƶ𝑖 = −1.2088 + 2.0742𝑋𝑖
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept -1.20884 0.994874 -1.21507 0.247707 -3.37648 0.958806 -3.37648 0.958806
X Variable 12.074173 0.253629 8.177972 3E-06 1.521562 2.626784 1.521562 2.626784
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
INTERPRETATION OF BO
𝑌ƶ𝑖 = −1.2088 + 2.0742𝑋𝑖
▪ The 𝑌 intercept, 𝑏0 , is -1.2088 . The 𝑌 intercept represents the
predicted value of 𝑌 when 𝑋 equals 0 . Because the number of
profiled customers of the store cannot be 0 , this 𝑌 intercept has
little or no practical interpretation. Also, the 𝑌 intercept for this
example is outside the range of the observed values of the 𝑋
variable, and therefore interpretations of the value of 𝑏0 should
be made cautiously.
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
INTERPRETING B1
𝑌ƶ𝑖 = −1.2088 + 2.0742𝑋𝑖
▪ The slope, 𝑏1 , is +2.0742 . This means that for each increase of 1
unit in 𝑋, the predicted mean value of 𝑌 is estimated to increase
by 2.0742 units. In other words, for each increase of 1.0 million
profiled customers within 30 minutes of the store, the predicted
mean annual sales are estimated to increase by $2.0742 million.
Thus, the slope represents the portion of the annual sales that
are estimated to vary according to the number of profiled
customers.
APPLIED STATISTICA
ACTIVTY
▪ Use the prediction line found in the previous example to predict the annual sales
for a store with 4 million profiled customers.
▪ A statistics professor wants to use the number of hours a student studies for a statistics final
exam (𝑋) to predict the final exam score (𝑌). A regression model is fit based on data
collected from a class during the previous semester, with the following results:
𝑌ƶ𝑖 = 35.0 + 3𝑋𝑖
▪ What is the interpretation of the 𝑌 intercept, 𝑏0 , and the slope, 𝑏1 ?
APPLIED STATISTICA
COMPUTING THE 𝑌 INTERCEPT,
𝑏0 , AND THE SLOPE, 𝑏1
▪ For small data sets, you can use a hand calculator to compute the
least-squares regression coefficients.
▪ Computational formula for the slope, 𝑏1
𝑆𝑆𝑋𝑌
𝑏1 =
𝑆𝑆𝑋
where
𝑛 𝑛 𝑛 𝑛
( 𝑖=1 𝑋𝑖 )( 𝑖=1 𝑌𝑖 )
𝑆𝑆𝑋𝑌 = (𝑋𝑖 − 𝑋)(𝑌𝑖 − 𝑌) = 𝑋𝑖 𝑌𝑖 −
𝑛
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
( 𝑖=1 𝑋𝑖 )2
𝑆𝑆𝑋 = (𝑋𝑖 − 𝑋)2 = 𝑋𝑖2 −
𝑛
𝑖=1 𝑖=1
APPLIED STATISTICA
COMPUTING THE 𝑌 INTERCEPT,
𝑏0 , AND THE SLOPE, 𝑏1
▪ Computational formula for the y-intercept, 𝑏0
𝑏0 = 𝑌 − 𝑏1 𝑋
where
𝑛
𝑖=1 𝑌𝑖
𝑌=
𝑛
𝑛
𝑖=1 𝑋𝑖
𝑋=
𝑛
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
HAND CALCULATION
▪ The business objective of the director of planning is to forecast annual sales for all
new stores, based on the number of profiled customers who live no more than 30
minutes from a Sunflowers store. To examine the relationship between the number
of profiled customers (in millions) who live within a fixed radius from a Sunflowers
store and its annual sales ($millions), data were collected from a sample of 14
stores. Determine the least-squares regression coefficients of the data given below.
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE: HAND
CALCULATION
▪ Five quantities need to be computed to determine 𝑏1 and
𝑏0 . These are 𝑛, the sample size; 𝑛𝑖=1 𝑋𝑖 , the sum of the 𝑋
values; 𝑛𝑖=1 𝑌𝑖 , the sum of the 𝑌 values; 𝑛𝑖=1 𝑋𝑖2 , the sum of
the squared 𝑋 values; and 𝑛𝑖=1 𝑋𝑖 𝑌𝑖 , the sum of the product
of 𝑋 and 𝑌. The computation for these terms are shown in
the table below:
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE: HAND
CALCULATION Profiled Annual
Store Customers Sales
𝑿𝟐 𝑿𝒀
(𝑋) (𝑌)
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE: HAND
CALCULATION
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
HAND CALCULATION
APPLIED STATISTICA
SIMPLE LINEAR REGRESSION EXAMPLE:
HAND CALCULATION
APPLIED STATISTICA
INFERENCES ABOUT THE SLOPE
▪ The standard error of the regression slope coefficient (b1) is estimated by
S YX S YX
Sb1 = =
SSX (X − X)
i
2
APPLIED STATISTICA
INFERENCES ABOUT THE SLOPE:
T TEST
▪ t test for a population slope
▪ Is there a linear relationship between X and Y?
APPLIED STATISTICA
INFERENCES ABOUT THE SLOPE:
T TEST EXAMPLE
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 Sb1
b1 − β 1 0.10977 − 0
t STAT = = = 3.32938
Sb 0.03297
1
APPLIED STATISTICA
INFERENCES ABOUT THE SLOPE:
T TEST EXAMPLE
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
Decision: Reject H0
a/2=.025 a/2=.025
APPLIED STATISTICA