Correlation and Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Chapter 11

Correlation
and
regression
Introduction to Simple Linear Regression Analysis
[Motivation]
Many studies are concerned with the analysis of the relationship between two variables.
Some focus on studying the degree and the type and direction of association. Others go
beyond describing the relationship, and aim at predicting the value of one variable using
the value of the other.

temperature
advertising
costs sales

reliability of
exam gross components
score domestic
number product (GDP)
of hours
of sleep your
carbon you
dioxide (CO2) crush
emissions

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[Learning|Objectives]
By the end of this chapter, each student is expected:

To compute for a correlation coefficient and to interpret it correctly;


To know the properties and limitations of correlation;
To find the equation of a regression line;
To clear some misconceptions on correlation and regression.
To model the world

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[introduction]
Last chapter, we studied how to analyze the relationship between categorical variables.

Now, we will look more closely at the analysis of the relationship between two continuous
variables. Specifically, we will discuss Correlation analysis and simple linear Regression
analysis.

Correlation
Analysis
regression
Analysis
aims to gain an insight on focuses on revealing the
the strength of the linear form of the linear
relationship between relationship between
variables variables

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[correlation|analysis]
Objective: To measure the strength and direction of a linear association between two
variables; To measure the covariation that is present between the two variables (i.e. how
the two variables change relative to each other)

Given: Bivariate Data = {(X1,Y1), (X2,Y2), , (Xn,Yn)}


temperature, in reliability of
number of hours exam score (Yi) advertising sales, in million degree Celsius component (Yi)
of sleep (Xi) costs, in pesos (Yi) (Xi)
3.4 82 thousand pesos
45 0.78
(Xi)
5.6 89 210 5.32 32 0.54

2.7 76 532 3.46 56 0.94

973 9.12

761 3.76

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SCATTER|DIAGRAM]
Objective: To help you visualize the possible underlying linear relationship between two
variables.

Given: Bivariate Data = {(X1,Y1), (X2,Y2), , (Xn,Yn)}

Procedure: We plot the individual pairs of observations on a two-dimensional graph.

Example: The following data were obtained in a study of the relationship between the
number of hours of sleep of a student and score in an examination.
number of hours exam score (Yi) number of hours exam score (Yi) number of hours exam score (Yi)
of sleep (Xi) of sleep (Xi) of sleep (Xi)

2.75 89.5 5.52 96.5 2.31 88.3

2.15 86.3 3.21 87.2 4.3 90.3

4.41 92.2 4.32 87.7 3.71 88.7

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SCATTER|DIAGRAM]
number of hours exam score (Yi) Using Microsoft Excel,
of sleep (Xi)
1. Highlight data.
2.75 89.5 2. Click Insert, then choose Scatter.
2.15 86.3

4.41 92.2

5.52 96.5

3.21 87.2

4.32 87.7

2.31 88.3
We can see from the scatter diagram that the points form an
4.3 90.3
upward trend. By visual inspection, we can say that the
3.71 88.7 number of hours of sleep (X) and score in the exam (Y) are
possibly linearly related with each other.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[Linear|correlation|coefficient]
A summary measure that can be used to describe the degree and direction of the linear relationship between two
continuous variables is the linear correlation coefficient.

Definition: The linear correlation coefficient, denoted by (rho), is a measure of the


strength of the linear relationship existing between two variables, X and Y, that is
independent of their respective scales of measurement.

Cov(X,Y E(XY) E(X)E(Y


= =
X Y Var(X) Var(Y

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[Linear|correlation|coefficient]
The linear correlation coefficient possesses the following interesting properties:

1. A linear correlation coefficient can only assume values between -1 and 1, inclusive of end points.
o -1 < < 1.

2. The sign of describes the direction of the linear relationship between X and Y.
o A positive means that the line slopes upward to the right, and so as X increases, the value of Y also increases.
o A negative means that it slopes downward to the right, and so as X increases, the value of Y decreases.

3. If = 0, then there is no linear correlation between X and Y.


o A value of = 0, however, does not mean a lack of association. It is possible to obtain a zero correlation even if
the two variables are related, though their relationship is nonlinear, such as a quadratic relationship.

4. When is 1 or 1, there is perfect linear relationship between X and Y.


o All the points (x,y) fall on a straight line.
o A close to 1 or 1 indicates a strong linear relationship

5. A strong linear relationship does not necessarily imply that X causes Y or Y causes X.
o It is possible that a third variable may have caused the change in both X and Y, producing the observed
relationship

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[ON|CORRELATION|&|CAUSALITY]
It is of interest to differentiate correlation from causality as this
is a common mistake.

Correlation does not


necessarily imply causation.
Criteria for Causality

1. Covariation correlation

2. Temporal precedence cause before effect

3. Nonspuriousness no alternative
explanations

4. *Specification of a mechanism

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[Linear|correlation|coefficient]
A point estimator of is the Pearson product moment correlation coefficient, which is
denoted by r.
n
n n
n XiYi Xi Yi
r
i1 i1 i1
n n

2
n
n

2

n Xi2 Xi n Yi2 Yi
i1 i1 i1 i1

Its value is also between -1 and 1, inclusive.


Just like , when r is -1 or 1, all the collected data points fall on a straight line.
Similarly, when r is 0, the points are scattered and give no evidence of a linear relationship.
Any other value of r suggests the degree to which the points tend to be linearly related.
An alternative form for r is (Xi X)(Yi Y)
.
2 2
(Xi X) (Yi Y)

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[illustrations]

Positive No apparent
Linear Linear
Correlation Correlation
(r is near 1) (r is near 0)

Negative
Linear Quadratic
Correlation Relation
(r is near -1) (r is near 0)

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[EXAMPLE]
number of exam score Compute for the Pearson product moment correlation coefficient
hours of sleep (Yi)
(Xi) and interpret.
9 9
2.75 89.5 n=9 Xi = 32.68 Yi = 806.7
2.15 86.3 9 i=1 i=1
Xi Yi = 2951.068 9 9
4.41 92.2
i=1 Xi 2 = 128.6602 Yi 2 = 72384.83
5.52 96.5 i=1 i=1
3.21 87.2 9 2951.068 (32.68)(806.7)
r= = 0.7845
4.32 87.7 2 2
(9 128.6602 32.68 ) (9 72384.83 806.7 )
2.31 88.3
The value of r = 0.7845 supports our earlier claim based on the scatter
4.3 90.3 diagram that X and Y are positively linearly correlated. Being positively
correlated, as the number of hours of sleep increases, the score in the
3.71 88.7 examination also increases.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[TEST|OF|HYPOTHESIS|for|rho]
Null hypothesis Alternative hypothesis Test Statistic Critical Region
(Ho) (Ha)

= 0 < 0 r 0 n2 t < -t, n-2


> 0 t= t > t, n-2
0 1 r2 |t| > -t/2, n-2

Consider the Sleep-Exam example. Suppose that the linear correlation between X and Y in the
past is 0.75. We want to determine if the correlation has significantly increased compared to the
past Use a 0.05 level of significance.

Ho: = 0.75
Ha: > 0.75

= 0.05

Decision Rule: Reject Ho if t > t, n-2 = t0.05, 9-2 = t0.05, 7 = 1.895

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[TEST|OF|HYPOTHESIS|for|rho]
Test Statistic:

r 0 n2 0.7845 0.75 92
t= = = 0.147193
1 r2 2
1 0.7845

Decision: Since 0.147193 < 1.895 , we do not reject Ho.

Conclusion: At 5% level of significance, we do not have sufficient evidence to say that the
correlation has significantly increased compared to the past.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[APPLICATION]
Check this out! www.guessthecorrelation.com.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
Objective: To evaluate the relative impact of a predictor on a particular outcome.

Given: Bivariate Data = {(X1,Y1), (X2,Y2), , (Xn,Yn)}

In this section, we deal with the case where one continuous variable is linearly regressed
with another continuous variable.
temperature, in reliability of
number of hours exam score (Yi) degree Celsius component (Yi)
advertising sales, in million
of sleep (Xi) (Xi)
costs, in pesos (Yi)
3.4 82 thousand pesos 45 0.78
(Xi)
5.6 89 210 5.32 32 0.54

2.7 76 532 3.46 56 0.94

973 9.12

761 3.76

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
The simple linear regression model is given by the equation:

Yi = o + 1Xi + i
where Yi is the value of the response variable for the ith element;
Xi is the value of the explanatory variable for the ith element;
o is a regression coefficient that gives the y-intercept of the regression line;
1 is a regression coefficient that gives the slope of the line;
i is the random error term for the ith element
where the i s are independent, normally distributed with mean 0 and
variance 2 (constant) for i = 1,2,,n
n is the number of elements.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
E(Y) = o + 1Xi
This function is known as the regression equation, and this function makes it easy to
interpret the parameters o and 1.

o is the value of the mean of Y when X = 0, hence the name intercept.

1 gives the amount of change in the mean of Y (whether positive or negative, depending
on the sign) for every unit increase in the value of X, hence the name slope.

E(Y) = o + 1Xi
y = b+mx

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
i
A random error term may be though of as a representation of the effect of other factors,
that is, apart from X, not explicitly stated in the model but do affect the response variable
to some extent.

Now, even if a response variable can be predicted adequately by using only one
explanatory variable, there remains an inherent and inevitable variation present in the
response variable.

Lastly, the random error term accounts for the measurement errors in recording the value
of the response variable.

In short, we dump into the random error term the effects of all other factors apart from X
that explains the variation that we observe in the realized values of Y.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
The random error is the vertical gap
between the ith observation and the blue
line. i is a random variable and we will
never know its realized value because 0
and 1 are unknown.

We require that the i s are independent


random variables. For any fixed value of X,
these random variables are normally
distributed. The mean of any i is 0 and its
variance is 2. That is, we do not allow that
the variation in the values of is to differ
for the different values of X.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[SIMPLE|LINEAR|REGRESSION|ANALYSIS]
Steps in doing Simple Linear Regression Analysis

1. Obtain the equation that best fits the data.


2. Evaluate the equation to determine the strength of the relationship for prediction and estimation.
3. Determine if the assumptions on the error terms are satisfied.
4. If the model fits the data adequately, use the equation for prediction and for describing the nature
of the relationship between the variables.

The process of obtaining the equation that best fits the data requires estimating the unknown
regression coefficients, 0 and 1.

There are several ways of deriving estimates for these regression coefficients but we will use the
method of least squares.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[METHOD|OF|LEAST|SQUARES]
In the method of least squares, the best-fitting line is selected as the one that minimizes the sum of
squares of the deviations of the observed value of Y from its expected value. Thus, the least
squares criterion considers the deviation:

i = Yi E(Yi) = Yi (0 + 1Xi)

and requires that our estimates for 0 and 1 are those values for which the sum of the squares of
these deviations, i 2 , is smallest. Based on this criterion, the following formulas are obtained:
n
n n
n XiYi Xi Yi
i1 i1 i1
b1 2
bo y b1 x
n
n

n Xi Xi
2

i1 i1
Thus, the estimated regression equation is given by Y= bo + b1 X.
Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis
[EXAMPLE]
Find the estimated regression equation of the data on the number of hours of sleep (X) and score in
an examination (Y). Interpret the coefficients. Predict the score of the student if his hours of sleep is
5. Lastly, compute for the coefficient of determination and interpret.
9 9
Recall: n=9 Xi = 32.68 Yi = 806.7
9 i=1 i=1
Xi Yi = 2951.068 9 9
i=1 Xi 2 = 128.6602 Yi 2 = 72384.83
i=1 i=1

9 2951.068 (32.68)(806.7)
b1 = = 2.1861
9 128.6602 32.68 2

806.7 32.68 The estimated regression equation is:


b0 = 2.1861 = 81.6954 score= 81.6954 + 2.1861(hours of sleep).
9 9

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[EXAMPLE]
Find the estimated regression equation of the data on the number of hours of sleep (X) and score in
an examination (Y). Interpret the coefficients. Predict the score of the student if his hours of sleep is
5. Lastly, compute for the coefficient of determination and interpret.

The estimated regression equation is: score= 81.6954 + 2.1861(hours of sleep).

Interpretation:

For every unit increase in the students number of hours of sleep, there is a 2.19 unit increase in the
mean score in the examination.

When the student has no sleep (that is, X = 0), the mean score in the examination is 81.70.

The predicted score of the student having 5 hours of sleep is given by:

score= 81.6954 + 2.1861(5) = 92.63.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[graphical|representation]

[PREDICTING|THE|VALUE|OF|Y]
The estimated regression equation is appropriate only for the relevant range of X. This
includes only the values of X used in developing the regression model. Hence, when
predicting Y for a given value of X, one may interpolate only within the relevant range of the
X values. On the other hand, extrapolation to predict Y for values of X outside the relevant
range can result in a serious prediction error.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[COEFFICIENT|of|determination]
The coefficient of determination, denoted by R2, is defined as the proportion of the variability in the
observed values of the response variable that can be explained by the explanatory variable through
their linear relationship.

The Pearson correlation coefficient between two variables X and Y may be used in simple linear
regression analysis as a descriptive statistic to measure the strength of the linear relationship
between two variables.
However, a more meaningful descriptive statistic that may be used to assess the goodness-of-fit
of the linear regression model is obtained by squaring the Pearson correlation, r.
This value is expressed in terms of percentage so that we may interpret the value to be the
percentage of variability in the response variable that is explained by the explanatory variable
through the model.
Although the term explained may seem to imply causality, we clarify that the relationship between
the variables need not be causal.
0 R2 1.
If a model has perfect predictability, then R2 = 1.
If a model has no predictive capability, then R2 = 0.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[EXAMPLE]
Find the estimated regression equation of the data on the number of hours of sleep (X) and score in
an examination (Y). Interpret the coefficients. Predict the score of the student if his hours of sleep is
5. Lastly, compute for the coefficient of determination and interpret.

Recall that the computed Pearson correlation is 0.7845.

Squaring it to obtain the coefficient of determination, R2 = 0.6514.

Interpretation: 65.14% of the variability in the examination score can be explained by the number of
hours of sleep of the student through the model.

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[EXAMPLE]
Find the estimated regression equation of the data on the number of hours of sleep (X) and score in
an examination (Y). Interpret the coefficients. Predict the score of the student if his hours of sleep is
5. Lastly, compute for the coefficient of determination and interpret.

Using Microsoft Excel, we have:

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis


[EXercise]
Suppose a researcher wishes to investigate the relationship between the achieved grade-point index (GPI) and the
starting salary of recent graduates majoring in business. A random sample of 30 recent graduates majoring in
Business is drawn, and the data pertaining to the GPI and starting salary (in thousands of dollars) are recorded for
each individual in the following table:
Starting Starting 1. Construct a scatter diagram for the given dataset.
Individual GPI Salary Individual GPI Salary What can you say about the relationship of GPI and
No. (X) (Y) No. (X) (Y) starting salary based on your visual inspection?
1 2.7 17.0 16 3.0 17.4
2 3.1 17.7 17 2.6 17.3
2. Compute and interpret the correlation coefficient.
3 3.0 18.6 18 3.3 18.1 3. Find the equation of the regression line. Interpret the
4 3.3 20.5 19 2.9 18.0 significant coefficients (at 10% level of significance)
5 3.1 19.1 20 2.4 16.2 4. Find an estimate for the starting salary if the
6 2.4 16.4 21 2.8 17.5 individuals GPI is 2.5.
7 2.9 19.3 22 3.7 21.3 5. Compute for the coefficient of determination. What
8 2.1 14.5 23 3.1 17.2 can you say about the models goodness-of-fit?
9 2.6 15.7 24 2.8 17.0
10 3.2 18.6 25 3.5 19.6
11 3.0 19.5 26 2.7 16.6
12 2.2 15.0 27 2.6 15.0
13 2.8 18.0 28 3.2 18.4
14 3.2 20.0 29 2.9 17.3
15 2.9 19.0 30 3.0 18.5

Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis

You might also like