Linear Regression (Simple & Multiple)
Linear Regression (Simple & Multiple)
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Statistical Concepts
y Lines y Planes
B
B
A
Slope: β1 C
A
x1
Intercept: β0
x2
x
Any two points (A and B), or Any three points (A, B, and C), or an
an intercept and slope (β0 and intercept and coefficients of x1 and x2
β1), define a line on a two- (β0 , β1 , and β2), define a plane in a
dimensional surface. three-dimensional surface.
Assumptions of LR
1. Linearity – the relationship between each x and y is linear. To
check this, again look at the plot(s) of the residuals versus the X
value(s). You don’t want to see a clustering of positive
residuals or a clustering of negative residuals.
2. Normally Distributed Error Terms – the error terms follow
the normal distribution.
3. Independence of Error Terms – successive residuals are not
correlated. If they are correlated, it is known as autocorrelation
4. Homoscedasticity – the variance of the error terms is constant
for each value of x, To check this, look at the plot(s) of the
residuals versus the X value(s). You do not want to see a
fanning effect in the residuals.
Linear Regression and Correlation
(Simple)
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response variable and
the level of the explanatory variable assumed to be
approximately linear (straight line)
Y = β 0 + β1 x + ε ε ~ N (0, σ )
Example: Sales = β0 + β1 (Advertising)
i =1
∂Z n
= ∑ 2( −1 )( y i − b0 − b1 x i ) = 0
∂b0 i =1
b0 = y − b1 x
∂Z n
= ∑ 2( − x i )( y i − b0 − b1 x i ) = 0
∂b1 i =1
b1 =
∑ xy − ( ∑ x ∑ y ) / n
∑ x −(∑ x ) / n
2 2
Miles Dollars Example
1211 1802
1345 2405
1422 2005 BN-Global bank believes that
1687 2511
1849 2332 there is a relationship between the
2026 2305
2133 3016 number of miles travelled (air-
2253 3385
2400 3090 travel) by its credit card holders
2468 3694
2699 3371
and the amount spent. They
2806
3082
3998
3555
collected the relevant data from a
3209 4692 random number of card holders.
3466 4244
3643 5298 They decided to use regression
3852 4801
4033 5147 analysis to test their belief.
4267 5738
4498 6420
4533 6059
4804 6426
5090 6321
5233 7026
5439 6964
Analysis of Variance and an F Test
of the Regression Model
df SS MS F Significance F
Regression 1 64527736.8 64527737 637.47 2.85084E-18
Residual 23 2328161.201 101224.4
Total 24 66855898
Yi = β 0 + β 1X 1i + β 2 X 2i ++ β k X ki + ε i
Dependent Independent
(response) (explanatory)
variable variables
Multiple Linear Regression
Equations
Too
complicated
by hand! Ouch!
Parameter Estimation
Example
• You work in advertising for You collected the
the Gleaner Comp. You following data:
want to find the effect of ad
Resp Size Circ
size (sq. in.) & newspaper
circulation (000) on the 1 1 2
number of ad responses 4 8 8
(00). 1 3 1
3 5 7
2 6 4
4 10 6
Parameter Estimation
Computer Output
• Parameter Estimates
• Parameter Standard T for H0:
• Variable DF Estimate Error Param =0 Prob>|T|
• INTERCEP 1 0.0640 0.2599 0.246 0.8214
• ADSIZE 1 0.2049 0.0588 3.656 0.0399
• CIRC 1 0.2805 0.0686 4.089 0.0264
•
β^0
β^1 β^2
Interpretation of Estimated Coefficients
^
• Slope (βk)
– Estimated Y Changes by β^k for Each 1 Unit Increase
in Xk Holding All Other Variables Constant
• Slope (β^1)
– # Responses to Ad is expected to increase by 0.2049
(20.49) for each 1 Sq. In. Increase in Ad Size
Holding Circulation Constant
• Slope (β^2)
– # Responses to Ad Is Expected to Increase by
0.2805 (28.05) for Each 1 Unit (1,000) Increase in
Circulation Holding Ad Size Constant
Evaluating Multiple Regression
Model Steps
SSE
2
2 SSR SSE R ( n − ( k + 1))
R = =1- F = 2 (n - (k + 1)) MSE
SST SST 2 R =1- =
(1 − R ) (k ) SST MST
(n - 1)
Example
• A hotel industry analyst wants to estimate the gross
earnings generated by a chain of hotels over a 12
month period. The estimate will be based on different
variables in the hotel industry. The independent
variables considered are X1 = production costs (in
millions of US dollars) associated with the hotel
(building maintenance, food, salary…) and X2 = total
cost (in millions of US dollars) of all promotional
activities (advertisements, sponsorship of events…). A
third variable that the analyst wants to consider is the
qualitative variable of whether a hotel is all-inclusive
(X3 = 1 if all-inclusive and X3 = 0 otherwise). The
edited version of portion of the output for the
regression is given in Exhibit 1.
Exhibit 1: Edited Output
Multiple R 0.949
R Square 0.900
Adjusted R 0.881
Standard Error 6.396
Observations 20
df SS MS F
Regression 3 5888.32 1962.77 47.9728
Residual 16 654.63 40.91
Total 19 6542.95
0 0
x or y x or y
Residuals Residuals
0 0
Time x or y
Resolving Multi-Collinearity
• Drop a collinear variable from the regression
• Change in sampling plan to include elements outside the
multicollinearity range
• Transformations of variables
Multiple Regression Example
The DTN consulting firm is trying to do a thorough analysis of bank charges for 40 companies and
their relation to sales. Average monthly bank charges (in dollars) are related to three background
variables: (1) yearly sales in millions of dollars, (2) the average daily number of disbursements,
and (3) the average daily number of deposits. The relevant data are given in Table 2.
Required
1. Obtain the estimated multiple regression equation for Charges from the data in Table 2.
2. Do the diagnostic checks, and make appropriate comments
3. Interpret the estimated coefficients. Can any of the three variables be dropped from the model?
4. Find and interpret approximate 98% confidence intervals for coefficients of Sales, Disburse, and
Deposits
5. Comment on the adjusted R2.
6. Suppose another company had the following values on the predictor variables: Sales = 435,
Disburse = 2,100, and Deposits = 600. What is a 95% prediction charges for its bank charges.
7. Suppose that you are given the following additional information: Companies 16, 20, 22, and 28
bank with the same, well-established bank. Create a fifth variable, Bank to capture this
information, and analyse this input.
Table 2: Bank Data
Charges Sales Disburse Deposits Charges Sales Disburse Deposits
4550 146 840 723 14290 889 2070 604
2470 342 1270 59 2570 489 10 66
6980 403 210 692 5400 365 300 673
7920 311 360 503 4570 115 950 744
3520 362 110 361 5960 469 290 44
8990 597 1010 67 3150 200 120 16
2010 226 110 26 4300 194 320 285
5730 351 140 675 1020 455 110 17
4940 460 570 43 3810 133 760 708
3010 208 80 157 5020 301 110 153
6550 383 150 725 1430 75 170 193
6460 494 310 70 13790 853 990 408
3990 337 1220 35 2420 234 20 181
6750 520 630 42 13690 826 1140 369
12500 811 830 408 6070 192 1690 458
1180 277 100 66 1430 31 90 212
5900 406 40 359 3810 193 950 750
2190 273 80 139 15040 874 2110 590
2500 256 350 344 3390 245 40 142
1330 464 20 103 5780 279 400 368