0% found this document useful (0 votes)
60 views

Linear Regression (Simple & Multiple)

This document provides an overview of linear regression models, including simple and multiple regression. Simple regression involves using one explanatory variable to predict an outcome, while multiple regression uses two or more explanatory variables. Key concepts discussed include the regression line or plane used to model relationships between variables, assumptions of the linear regression model such as linearity and homoscedasticity, and estimating regression coefficients using the method of least squares. The document also covers evaluating overall model significance using an F-test and interpreting estimated regression coefficients.

Uploaded by

O'Neil Robinson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Linear Regression (Simple & Multiple)

This document provides an overview of linear regression models, including simple and multiple regression. Simple regression involves using one explanatory variable to predict an outcome, while multiple regression uses two or more explanatory variables. Key concepts discussed include the regression line or plane used to model relationships between variables, assumptions of the linear regression model such as linearity and homoscedasticity, and estimating regression coefficients using the method of least squares. The document also covers evaluating overall model significance using an F-test and interpreting estimated regression coefficients.

Uploaded by

O'Neil Robinson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

LINEAR REGRESSION

(Simple & Multiple)


Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Statistical Concepts

y Lines y Planes
B

B
A

Slope: β1 C
A
x1
Intercept: β0

x2
x
Any two points (A and B), or Any three points (A, B, and C), or an
an intercept and slope (β0 and intercept and coefficients of x1 and x2
β1), define a line on a two- (β0 , β1 , and β2), define a plane in a
dimensional surface. three-dimensional surface.
Assumptions of LR
1. Linearity – the relationship between each x and y is linear. To
check this, again look at the plot(s) of the residuals versus the X
value(s). You don’t want to see a clustering of positive
residuals or a clustering of negative residuals.
2. Normally Distributed Error Terms – the error terms follow
the normal distribution.
3. Independence of Error Terms – successive residuals are not
correlated. If they are correlated, it is known as autocorrelation
4. Homoscedasticity – the variance of the error terms is constant
for each value of x, To check this, look at the plot(s) of the
residuals versus the X value(s). You do not want to see a
fanning effect in the residuals.
Linear Regression and Correlation
(Simple)
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response variable and
the level of the explanatory variable assumed to be
approximately linear (straight line)
Y = β 0 + β1 x + ε ε ~ N (0, σ )
Example: Sales = β0 + β1 (Advertising)

• β1 > 0 ⇒ Positive Association


• β1 < 0 ⇒ Negative Association
• β1 = 0 ⇒ No Association
Estimated Linear Regression Equation

• If we knew the values of β0 and β1 then given any


relevant population, we could plug it into the equation
and find the mean value of the sales.
• E (y ) = β0 + β1 x
• But we do not know the values for β0 and β1

• We have to estimate them.


• We estimate them using historical data.
• We estimate β0 by b0 and β1 by b1
• ŷ = b0 + b1 x
Least Squares Estimation of β0, β1
 b0 ≡ Mean response when x= 0 (y-intercept)
 b1 ≡ Change in mean response when x increases by 1 unit
(slope)
• β0, β1 are unknown parameters
• b0 + b1x ≡ Mean response when explanatory variable takes
on the value x
• Goal: Choose values (estimates) that minimize the sum of
squared errors (SSE) of observed values to the straight-line:
2
 
SSE = ∑i =1  yi − y i  = ∑i =1 ( yi − (b0 + b1 xi ))
^ n ^ n 2
y = b0 + b1 x
 
Derivatives
n
Z = ∑ ( y i − b0 − b1 x i ) 2

i =1

∂Z n
= ∑ 2( −1 )( y i − b0 − b1 x i ) = 0
∂b0 i =1
b0 = y − b1 x
∂Z n
= ∑ 2( − x i )( y i − b0 − b1 x i ) = 0
∂b1 i =1
b1 =
∑ xy − ( ∑ x ∑ y ) / n
∑ x −(∑ x ) / n
2 2
Miles Dollars Example
1211 1802
1345 2405
1422 2005 BN-Global bank believes that
1687 2511
1849 2332 there is a relationship between the
2026 2305
2133 3016 number of miles travelled (air-
2253 3385
2400 3090 travel) by its credit card holders
2468 3694
2699 3371
and the amount spent. They
2806
3082
3998
3555
collected the relevant data from a
3209 4692 random number of card holders.
3466 4244
3643 5298 They decided to use regression
3852 4801
4033 5147 analysis to test their belief.
4267 5738
4498 6420
4533 6059
4804 6426
5090 6321
5233 7026
5439 6964
Analysis of Variance and an F Test
of the Regression Model

Source of Sum of Degrees of


Variation Squares Freedom Mean Square F Ratio

Regression SSR (1) MSR MSR


MSE
Error SSE (n-2) MSE
Total SST (n-1) MST
Output (Excel)
Regression Statistics
Multiple R 0.982
R Square 0.965
Adjusted R Square 0.964
Standard Error 318.158
Observations 25

df SS MS F Significance F
Regression 1 64527736.8 64527737 637.47 2.85084E-18
Residual 23 2328161.201 101224.4
Total 24 66855898

Coefficients SE t Stat P-value L_95% U_95% L_98.0% U_98.0%


Intercept 274.85 170.34 1.61 0.120 -77.52 627.22 -150.97 700.67
Miles 1.26 0.05 25.25 2.85084E-18 1.15 1.36 1.13 1.38
Linear Multiple Regression Model
Relationship between 1 dependent & 2 or more
independent variables and is a linear function

Population Population Random


Y-intercept slopes error

Yi = β 0 + β 1X 1i + β 2 X 2i ++ β k X ki + ε i
Dependent Independent
(response) (explanatory)
variable variables
Multiple Linear Regression
Equations
Too
complicated
by hand! Ouch!
Parameter Estimation
Example
• You work in advertising for You collected the
the Gleaner Comp. You following data:
want to find the effect of ad
Resp Size Circ
size (sq. in.) & newspaper
circulation (000) on the 1 1 2
number of ad responses 4 8 8
(00). 1 3 1
3 5 7
2 6 4
4 10 6
Parameter Estimation
Computer Output
• Parameter Estimates
• Parameter Standard T for H0:
• Variable DF Estimate Error Param =0 Prob>|T|
• INTERCEP 1 0.0640 0.2599 0.246 0.8214
• ADSIZE 1 0.2049 0.0588 3.656 0.0399
• CIRC 1 0.2805 0.0686 4.089 0.0264

β^0
β^1 β^2
Interpretation of Estimated Coefficients
^
• Slope (βk)
– Estimated Y Changes by β^k for Each 1 Unit Increase
in Xk Holding All Other Variables Constant
• Slope (β^1)
– # Responses to Ad is expected to increase by 0.2049
(20.49) for each 1 Sq. In. Increase in Ad Size
Holding Circulation Constant
• Slope (β^2)
– # Responses to Ad Is Expected to Increase by
0.2805 (28.05) for Each 1 Unit (1,000) Increase in
Circulation Holding Ad Size Constant
Evaluating Multiple Regression
Model Steps

1. Examine Variation Measures


2. Do Residual Analysis
3. Test Parameter Significance
– Overall Model
– Individual Coefficients
4. Test for Multicollinearity
Testing Overall Significance

• 1. Shows If There Is a Linear Relationship


Between All X Variables Together & Y
• 2. Uses F Test Statistic
• 3. Hypotheses
– H0: β1 = β2 = ... = βk = 0
• No Linear Relationship
– Ha: At Least One Coefficient Is Not 0
• At Least One X Variable Affects Y
Measures of Performance in Multiple
Regression and the ANOVA Table
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio

Regression SSR (k)


MSR
SSR F =
MSR = MSE
k
Error SSE (n-(k+1))
SSE
=(n-k-1) MSE =
(n − ( k + 1))
Total SST (n-1) SST
MST =
( n − 1)

SSE
2
2 SSR SSE R ( n − ( k + 1))
R = =1- F = 2 (n - (k + 1)) MSE
SST SST 2 R =1- =
(1 − R ) (k ) SST MST

(n - 1)
Example
• A hotel industry analyst wants to estimate the gross
earnings generated by a chain of hotels over a 12
month period. The estimate will be based on different
variables in the hotel industry. The independent
variables considered are X1 = production costs (in
millions of US dollars) associated with the hotel
(building maintenance, food, salary…) and X2 = total
cost (in millions of US dollars) of all promotional
activities (advertisements, sponsorship of events…). A
third variable that the analyst wants to consider is the
qualitative variable of whether a hotel is all-inclusive
(X3 = 1 if all-inclusive and X3 = 0 otherwise). The
edited version of portion of the output for the
regression is given in Exhibit 1.
Exhibit 1: Edited Output
Multiple R 0.949
R Square 0.900
Adjusted R 0.881
Standard Error 6.396
Observations 20

df SS MS F
Regression 3 5888.32 1962.77 47.9728
Residual 16 654.63 40.91
Total 19 6542.95

Coefficients Standard Error t Stat P-value


Intercept 13.123 3.9161 3.3511 0.004058
X1 1.506 0.5752 2.6174 0.018668
X2 2.974 0.3848 7.7269 0.000001
X3 9.715 3.0590 3.1759 0.005867
Lower 95% Upper 95% Lower 99.0% Upper 99.0%
Intercept 4.821 21.425 1.685 24.561
X1 0.286 2.725 -0.174 3.186
X2 2.158 3.789 1.850 4.098
X3 3.230 16.200 0.781 18.650
Questions
• Write down the estimated equation
• Comment on the overall fit/adequacy of the model
• Give a full interpretation of each of the coefficients of the
independent variables.
• Comment on the independent variables at both the 95% and
99% confidence levels.
• Comment on the Adjusted R2
• Given the following for the Sunshine hotel: X1 = US$4.5M,
X2 = US$1.2M and X3 = 1, calculate the estimated gross
earnings over a 12 month period.
Residual Analysis and Checking
for Model Inadequacies
Residuals Residuals

0 0

x or y x or y

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals


random. No indication of model inadequacy. changes when x changes.

Residuals Residuals

0 0

Time x or y

Curved pattern in residuals resulting from


Residuals exhibit a linear trend with time. underlying nonlinear relationship.
Multi-collinearity
Collinearity, or multi-collinearity occurs when one
explanatory variable is, (or nearly is) a linear combination of
the other explanatory variable. That is, there is some degree
of correlation between two explanatory variables. When
there is collinearity, the explanatory variables may provide
little additional information over and above the information
provided by the other explanatory variables. In many, cases
there will be a measure of collinearity; the question is: what
degree of collinearity is acceptable…that is, at what point
will collinearity have an unacceptable effect of the fitted
regression equation and the attendant parameters?
Multi-Collinearity
It is important to investigate multi-collinearity, to ensure the validity of
the interpretation of fitted regression model. Also, the presence of
multi-collinearity might result in “faulty” parameter values. The
following must also be noted:
• The fact that there is high correlation (among independent
variables) neither precludes one from getting good fits nor from
making predictions of new observations.
• Estimates of error variances, and therefore, tests of model
adequacy, are still reliable.
• In cases of serious collinearity, standard errors of individual
regression coefficients are larger than cases where, other things
being equal, serious collinearity does not exist. With large
standard errors, individual regression coefficients may not be
meaningful. This means that the corresponding t-ratio will be
small…this will lead to difficulty in detecting the importance of a
variable.
Causes of Collinearity
• The data gathering procedure might be faulty… collecting
data with related values on some variables.
• Some variables are naturally related to each other
• Physical constraints on the data…e.g. in a mixture there
might be the situation where as the concentration in one
mixture (X1) increases, the concentration of the other (X2)
has to decrease to ensure the correct mixture is achieved.
• The inclusion of a higher power of an explanatory variable
and the variable itself…(X & X2)
Effects of Multi-collinearity
• Variances of regression coefficients are inflated.
• Magnitudes of regression coefficients may be different from what are
expected.
• Signs of regression coefficients may not be as expected.
• Adding or removing variables produces large changes in coefficients.
• Removing a data point may cause large changes in coefficient
estimates or signs.
• In some cases, the F ratio may be significant while the t ratios are not.

Resolving Multi-Collinearity
• Drop a collinear variable from the regression
• Change in sampling plan to include elements outside the
multicollinearity range
• Transformations of variables
Multiple Regression Example
The DTN consulting firm is trying to do a thorough analysis of bank charges for 40 companies and
their relation to sales. Average monthly bank charges (in dollars) are related to three background
variables: (1) yearly sales in millions of dollars, (2) the average daily number of disbursements,
and (3) the average daily number of deposits. The relevant data are given in Table 2.
Required
1. Obtain the estimated multiple regression equation for Charges from the data in Table 2.
2. Do the diagnostic checks, and make appropriate comments
3. Interpret the estimated coefficients. Can any of the three variables be dropped from the model?
4. Find and interpret approximate 98% confidence intervals for coefficients of Sales, Disburse, and
Deposits
5. Comment on the adjusted R2.
6. Suppose another company had the following values on the predictor variables: Sales = 435,
Disburse = 2,100, and Deposits = 600. What is a 95% prediction charges for its bank charges.
7. Suppose that you are given the following additional information: Companies 16, 20, 22, and 28
bank with the same, well-established bank. Create a fifth variable, Bank to capture this
information, and analyse this input.
Table 2: Bank Data
Charges Sales Disburse Deposits Charges Sales Disburse Deposits
4550 146 840 723 14290 889 2070 604
2470 342 1270 59 2570 489 10 66
6980 403 210 692 5400 365 300 673
7920 311 360 503 4570 115 950 744
3520 362 110 361 5960 469 290 44
8990 597 1010 67 3150 200 120 16
2010 226 110 26 4300 194 320 285
5730 351 140 675 1020 455 110 17
4940 460 570 43 3810 133 760 708
3010 208 80 157 5020 301 110 153
6550 383 150 725 1430 75 170 193
6460 494 310 70 13790 853 990 408
3990 337 1220 35 2420 234 20 181
6750 520 630 42 13690 826 1140 369
12500 811 830 408 6070 192 1690 458
1180 277 100 66 1430 31 90 212
5900 406 40 359 3810 193 950 750
2190 273 80 139 15040 874 2110 590
2500 256 350 344 3390 245 40 142
1330 464 20 103 5780 279 400 368

You might also like