Correlation and Regression
Correlation and Regression
Correlation and Regression
Correlation
and
regression
Introduction to Simple Linear Regression Analysis
[Motivation]
Many studies are concerned with the analysis of the relationship between two variables.
Some focus on studying the degree and the type and direction of association. Others go
beyond describing the relationship, and aim at predicting the value of one variable using
the value of the other.
temperature
advertising
costs sales
reliability of
exam gross components
score domestic
number product (GDP)
of hours
of sleep your
carbon you
dioxide (CO2) crush
emissions
Now, we will look more closely at the analysis of the relationship between two continuous
variables. Specifically, we will discuss Correlation analysis and simple linear Regression
analysis.
Correlation
Analysis
regression
Analysis
aims to gain an insight on focuses on revealing the
the strength of the linear form of the linear
relationship between relationship between
variables variables
973 9.12
761 3.76
Example: The following data were obtained in a study of the relationship between the
number of hours of sleep of a student and score in an examination.
number of hours exam score (Yi) number of hours exam score (Yi) number of hours exam score (Yi)
of sleep (Xi) of sleep (Xi) of sleep (Xi)
4.41 92.2
5.52 96.5
3.21 87.2
4.32 87.7
2.31 88.3
We can see from the scatter diagram that the points form an
4.3 90.3
upward trend. By visual inspection, we can say that the
3.71 88.7 number of hours of sleep (X) and score in the exam (Y) are
possibly linearly related with each other.
1. A linear correlation coefficient can only assume values between -1 and 1, inclusive of end points.
o -1 < < 1.
2. The sign of describes the direction of the linear relationship between X and Y.
o A positive means that the line slopes upward to the right, and so as X increases, the value of Y also increases.
o A negative means that it slopes downward to the right, and so as X increases, the value of Y decreases.
5. A strong linear relationship does not necessarily imply that X causes Y or Y causes X.
o It is possible that a third variable may have caused the change in both X and Y, producing the observed
relationship
1. Covariation correlation
3. Nonspuriousness no alternative
explanations
4. *Specification of a mechanism
Positive No apparent
Linear Linear
Correlation Correlation
(r is near 1) (r is near 0)
Negative
Linear Quadratic
Correlation Relation
(r is near -1) (r is near 0)
Consider the Sleep-Exam example. Suppose that the linear correlation between X and Y in the
past is 0.75. We want to determine if the correlation has significantly increased compared to the
past Use a 0.05 level of significance.
Ho: = 0.75
Ha: > 0.75
= 0.05
r 0 n2 0.7845 0.75 92
t= = = 0.147193
1 r2 2
1 0.7845
Conclusion: At 5% level of significance, we do not have sufficient evidence to say that the
correlation has significantly increased compared to the past.
In this section, we deal with the case where one continuous variable is linearly regressed
with another continuous variable.
temperature, in reliability of
number of hours exam score (Yi) degree Celsius component (Yi)
advertising sales, in million
of sleep (Xi) (Xi)
costs, in pesos (Yi)
3.4 82 thousand pesos 45 0.78
(Xi)
5.6 89 210 5.32 32 0.54
973 9.12
761 3.76
Yi = o + 1Xi + i
where Yi is the value of the response variable for the ith element;
Xi is the value of the explanatory variable for the ith element;
o is a regression coefficient that gives the y-intercept of the regression line;
1 is a regression coefficient that gives the slope of the line;
i is the random error term for the ith element
where the i s are independent, normally distributed with mean 0 and
variance 2 (constant) for i = 1,2,,n
n is the number of elements.
1 gives the amount of change in the mean of Y (whether positive or negative, depending
on the sign) for every unit increase in the value of X, hence the name slope.
E(Y) = o + 1Xi
y = b+mx
Now, even if a response variable can be predicted adequately by using only one
explanatory variable, there remains an inherent and inevitable variation present in the
response variable.
Lastly, the random error term accounts for the measurement errors in recording the value
of the response variable.
In short, we dump into the random error term the effects of all other factors apart from X
that explains the variation that we observe in the realized values of Y.
The process of obtaining the equation that best fits the data requires estimating the unknown
regression coefficients, 0 and 1.
There are several ways of deriving estimates for these regression coefficients but we will use the
method of least squares.
i = Yi E(Yi) = Yi (0 + 1Xi)
and requires that our estimates for 0 and 1 are those values for which the sum of the squares of
these deviations, i 2 , is smallest. Based on this criterion, the following formulas are obtained:
n
n n
n XiYi Xi Yi
i1 i1 i1
b1 2
bo y b1 x
n
n
n Xi Xi
2
i1 i1
Thus, the estimated regression equation is given by Y= bo + b1 X.
Chapter 10 Correlation and regression Introduction to Simple Linear Regression Analysis
[EXAMPLE]
Find the estimated regression equation of the data on the number of hours of sleep (X) and score in
an examination (Y). Interpret the coefficients. Predict the score of the student if his hours of sleep is
5. Lastly, compute for the coefficient of determination and interpret.
9 9
Recall: n=9 Xi = 32.68 Yi = 806.7
9 i=1 i=1
Xi Yi = 2951.068 9 9
i=1 Xi 2 = 128.6602 Yi 2 = 72384.83
i=1 i=1
9 2951.068 (32.68)(806.7)
b1 = = 2.1861
9 128.6602 32.68 2
Interpretation:
For every unit increase in the students number of hours of sleep, there is a 2.19 unit increase in the
mean score in the examination.
When the student has no sleep (that is, X = 0), the mean score in the examination is 81.70.
The predicted score of the student having 5 hours of sleep is given by:
[PREDICTING|THE|VALUE|OF|Y]
The estimated regression equation is appropriate only for the relevant range of X. This
includes only the values of X used in developing the regression model. Hence, when
predicting Y for a given value of X, one may interpolate only within the relevant range of the
X values. On the other hand, extrapolation to predict Y for values of X outside the relevant
range can result in a serious prediction error.
The Pearson correlation coefficient between two variables X and Y may be used in simple linear
regression analysis as a descriptive statistic to measure the strength of the linear relationship
between two variables.
However, a more meaningful descriptive statistic that may be used to assess the goodness-of-fit
of the linear regression model is obtained by squaring the Pearson correlation, r.
This value is expressed in terms of percentage so that we may interpret the value to be the
percentage of variability in the response variable that is explained by the explanatory variable
through the model.
Although the term explained may seem to imply causality, we clarify that the relationship between
the variables need not be causal.
0 R2 1.
If a model has perfect predictability, then R2 = 1.
If a model has no predictive capability, then R2 = 0.
Interpretation: 65.14% of the variability in the examination score can be explained by the number of
hours of sleep of the student through the model.