15 MAY - NR - Correlation and Regression
15 MAY - NR - Correlation and Regression
15 MAY - NR - Correlation and Regression
I. Correlation analysis
Considering observations of single variable, studies are made to know averages,
dispersion, Skewness, Kurtosis etc. Many often, two types of measurements are taken into
consideration from one individual or item. For example, when animals are brought to the
laboratory, their length and weight are measured. Thus, a pair of observations are available on
all the units of the sample. Analysis of data with pairs of observations (leading to study of two
variables) may be thought in two ways.
1. To learn whether there is any association between the two variables.
2. To establish the extent to which one variable is affected by the other.
The first one leads to the study of correlation [strength of association between variables]
and the second to the study of regression [nature of association between variables].
Since two variables are considered in this study, it is called bivariate analysis.
1
2
III. Types of Correlation
When two variables show relationship between them, then they are said to be correlated.
Correlation can be of two types, (a) Positive and (b) Negative.
If a variable changes in the same direction corresponding to the change in the other, then
the variables are said to be directly correlated (+ ve). On the other hand, if a variable moves in
the opposite direction corresponding to the change in the other, they are said to have –ve
correlation (or inversely correlated) (examples: - see figures given above).
Correlation between two variables can be linear or non-linear. When the change in one
variable is in constant ratio to the change in the other, then the two variables are said to be
linearly related. If the change in one variable is not in a constant ratio to the change in the other,
3
then there exists a non-linear relationship between the variables (Examples: see figures given
above).
∑ ( x− x̄)( y− ȳ )
r= n . σx . σy , where, x = AM of x variable, y = AM of y variable
x = SD of x variable, y = SD of y variable, n = denotes the number of pairs of observations.
∑ ( x− x̄ )( y− ȳ )
n , denoted as Cov (x,y) is called ‘covariance’ between x & y. It is also
known as product moment.
Cov ( x, y )
Hence r = .σx.σy
If (x1, y1), x2, y2), ……… (xn, yn) are the ‘n’ pairs of observations, ‘r’ can be calculated as
follows.
( ∑ x)2
Calculate Sxx =
∑x − n 2
[ ]
y )2
Syy =
∑ y 2− [∑ ]
(
n
( xy) n
( x )( y )
Sxy =
Sxy
Then, r = √(Sxx).(Syy )
x, x2, y; y2 and xy can be obtained from the given pairs of data.
“n” is number of pairs of data
4
V. Rank Correlation – Spearman’s correlation
Considering a characteristic, a group of individuals may be arranged in an order of merit.
The same individuals may follow a different order of merit when another characteristic is
considered. When the orders, corresponding to these two characteristics, are correlated for the
‘n’ pairs, we get what is called rank correlation.
Spearman rank correlation describes the monotonic relationship between two variables.
It is used for non-normally distributed data, and for ordinal data.
If (x1, y1), x2, y2)……..(xn, yn) are the ‘n’ pairs of ranks, the rank correlation ‘’ (rho) is
calculated as
6 ∑ (d )2
=
1−
[ n (n2 −1) ]
where d = x-y (difference between the ranks)
If any values observed are respited in any set of data, then a different formula will have to be
used
= 1 - {6[Σd² + (1/12) x (m1³ - m1) + (1/12) x (m2³ - m2) + .........] / n(n² - 1 )}
Some additional values are added to Σd². The additional values are the values of (1/12) x
(m³ - m) for each of the repeated values. m is the number of times the value is
repeated.
5
( ∑ x)2
Calculate Sxx =
2
∑x − n [ ] ; (x)2 = (x) (x)
y )2
Syy =
∑y − 2 (
[∑ ] n
( xy) n
( x)( y )
Sxy =
Remember Sxx formula. Chang ‘x’ to ‘y’, Syy is got. Change 2 nd ‘x’ to ‘y’, Sxy is got
Sxy
Then, r= √(Sxx).(Syy )
x; x2,y;y2 and xy can be obtained from the table prepared as follows.
x y x2 y2 xy
Number of pairs = n
Example:
6
8 64 11 121 88
28 154 44 372 238
n=6
( 28 )2
Sxx =
154− [ ]
6 = 23.33
( 44 )2
Syy =
372− [ ]
6 = 49.33
( 28×44 )
Sxy =
238− [ 6 ] = 32.67
7
b1 is called regression coefficient, which measures the growth in ‘y’ corresponding to unit
growth in ‘x’. ‘a1’ is known as the y intercept, which is the value of y when x = 0.
Similarly, ‘b2’ is also called regression coefficient. ‘a2’ is known as x intercept, which is
the value of x when y = 0.
σy σy
r r
b1 = σx and b2 = σx
and hence
Example
The following data pertains to the heights of fathers and sons. Obtain the regression
equations.
Height of father X: 65 66 67 67 68 69 71 73
Height of son Y: 67 68 64 68 72 70 69 70
x = 546 y = 548 n=8
x2 = 37314 y2 = 37578 xy = 37422
x = 68.25 and y = 68.5
8
(546 )2
Sxx =
37314−
8 [ ] = 49.5
(548 )2
Syy =
37422−
8 [ ] = 40
546×548
Sxy =
37422−
[ 8 ] = 21
Regression of y on x:
21
b1 = 49.5 = 0.4242
a1 = y – 0.4242 68.25
= 68.5 – 28.95 = 39.55
the equation is y = 0.4242 x + 39.55
9
If an association is found, the inference is that variation in X may cause variation in Y, or
variation in Y may cause variation in X, or variation in some other factor may affect both X and
Y.
3. The third common use of linear regression is estimating the value of one variable
corresponding to the selected value of the other variable.
10