15 MAY - NR - Correlation and Regression

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Bivariate studies

I. Correlation analysis
Considering observations of single variable, studies are made to know averages,
dispersion, Skewness, Kurtosis etc. Many often, two types of measurements are taken into
consideration from one individual or item. For example, when animals are brought to the
laboratory, their length and weight are measured. Thus, a pair of observations are available on
all the units of the sample. Analysis of data with pairs of observations (leading to study of two
variables) may be thought in two ways.
1. To learn whether there is any association between the two variables.
2. To establish the extent to which one variable is affected by the other.

The first one leads to the study of correlation [strength of association between variables]
and the second to the study of regression [nature of association between variables].
Since two variables are considered in this study, it is called bivariate analysis.

II. Scatter Diagram


When, all pairs of observations are plotted on a graph paper, after graduating the two
axes, several dots appear on it. If any association exists, these dots tend to cluster closely as well
as showing a direction. This diagram of dots is called Scatter Diagram.

Below are given different types of associations (correlation)

1
2
III. Types of Correlation
When two variables show relationship between them, then they are said to be correlated.
Correlation can be of two types, (a) Positive and (b) Negative.
If a variable changes in the same direction corresponding to the change in the other, then
the variables are said to be directly correlated (+ ve). On the other hand, if a variable moves in
the opposite direction corresponding to the change in the other, they are said to have –ve
correlation (or inversely correlated) (examples: - see figures given above).
Correlation between two variables can be linear or non-linear. When the change in one
variable is in constant ratio to the change in the other, then the two variables are said to be
linearly related. If the change in one variable is not in a constant ratio to the change in the other,

3
then there exists a non-linear relationship between the variables (Examples: see figures given
above).

IV. Pearson’s Coefficient of Correlation.


Karl Pearson suggested a measure called Coefficient of correlation. This coefficient
measures the degree of linear relationship between two variables. This is symbolically written as
‘r’. The formula for ‘r’ is given as

∑ ( x− x̄)( y− ȳ )
r= n . σx . σy , where, x = AM of x variable, y = AM of y variable
x = SD of x variable, y = SD of y variable, n = denotes the number of pairs of observations.

∑ ( x− x̄ )( y− ȳ )
n , denoted as Cov (x,y) is called ‘covariance’ between x & y. It is also
known as product moment.
Cov ( x, y )
Hence r = .σx.σy
If (x1, y1), x2, y2), ……… (xn, yn) are the ‘n’ pairs of observations, ‘r’ can be calculated as
follows.

( ∑ x)2
Calculate Sxx =
∑x − n 2
[ ]
y )2
Syy =
∑ y 2− [∑ ]
(
n

 ( xy)    n  
 ( x )( y ) 

Sxy =  

Sxy
Then, r = √(Sxx).(Syy )
x, x2, y; y2 and xy can be obtained from the given pairs of data.
“n” is number of pairs of data

4
V. Rank Correlation – Spearman’s correlation
Considering a characteristic, a group of individuals may be arranged in an order of merit.
The same individuals may follow a different order of merit when another characteristic is
considered. When the orders, corresponding to these two characteristics, are correlated for the
‘n’ pairs, we get what is called rank correlation.
Spearman rank correlation describes the monotonic relationship between two variables.
It is used for non-normally distributed data, and for ordinal data.
If (x1, y1), x2, y2)……..(xn, yn) are the ‘n’ pairs of ranks, the rank correlation ‘’ (rho) is
calculated as

6 ∑ (d )2
=
1−
[ n (n2 −1) ]
where d = x-y (difference between the ranks)
If any values observed are respited in any set of data, then a different formula will have to be
used
 = 1 - {6[Σd² + (1/12) x (m1³ - m1) + (1/12) x (m2³ - m2) + .........] / n(n² - 1 )}
Some additional values are added to Σd². The additional values are the values of (1/12) x
(m³ - m) for each of the repeated values. m is the number of times the value is
repeated.

Important Results / Properties of Correlation coefficient


1. The value of ‘r’ lies between –1 and +1.
2. When r = +1, the variables are perfectly correlated (direct).
3. When r = -1, the variables are perfectly correlated (inverse).
4. r = + 1 indicates perfect correlation between variable.
5. r = 0 indicates that the variables are uncorrelated linearly.

Examples and Exercises.

Calculation of Correlation coefficient

5
( ∑ x)2
Calculate Sxx =
2
∑x − n [ ] ; (x)2 = (x) (x)

y )2
Syy =
∑y − 2 (
[∑ ] n

 ( xy)    n  
 ( x)( y ) 

Sxy =  

Remember Sxx formula. Chang ‘x’ to ‘y’, Syy is got. Change 2 nd ‘x’ to ‘y’, Sxy is got

Sxy
Then, r= √(Sxx).(Syy )
x; x2,y;y2 and xy can be obtained from the table prepared as follows.

x y x2 y2 xy

x = y = x2 = y2 = xy =

Number of pairs = n
Example:

Calculate the correlation coefficient [‘r’] value, given


X 2 3 4 5 6 8
Y 3 5 6 9 10 11
Prepare a table as follows:
x x2 y y2 xy
2 04 03 009 06
3 09 05 025 15
4 16 06 036 24
5 25 09 081 45
6 36 10 100 60

6
8 64 11 121 88
 28 154 44 372 238
n=6

( 28 )2
Sxx =
154− [ ]
6 = 23.33

( 44 )2
Syy =
372− [ ]
6 = 49.33

( 28×44 )
Sxy =
238− [ 6 ] = 32.67

Sxy 32. 67 32 .67


=
r= √(Sxx).(Syy ) = √ 23 . 33×49. 33 33 . 92 = 0.9631
Exercises.
Calculate the coefficient of correlation of the data given below.
1. X: 12, 9, 8, 10, 11, 13, 7 Y: 14, 8, 6, 9, 11, 12, 3
2. X: 8, 2,10, 4, 8, 7 Y: 9, 11, 6, 9, 10, 12
3. X: 2, 4, 6, 8, 10 Y: 5, 8, 9, 10, 11
4. X: 1, 2, 4, 5, 7, 8 Y: 2, 5, 9, 11, 12, 13

VI. Regression analysis


Regression Equations
The best fit representing the linear form in a scatter diagram is called regression line.
The equation used to represent this straight line is known as regression equation. There are two
types of regression equations.
(1) Regression equation of y on x, given as
Sxy
y = b1 x + a 1 b1 = Sxx and a1 = y – b1 * x
(2) Regression equation of x on y, given as
Sxy
x = b2 y + a 2 b2 = Syy and a2 = x – b2 * y

7
b1 is called regression coefficient, which measures the growth in ‘y’ corresponding to unit
growth in ‘x’. ‘a1’ is known as the y intercept, which is the value of y when x = 0.
Similarly, ‘b2’ is also called regression coefficient. ‘a2’ is known as x intercept, which is
the value of x when y = 0.
σy σy
r r
b1 = σx and b2 = σx
and hence

r = b1 * b2 , * used for showing multiplication


i.e. correlation coefficient is the square root of the product of the two regression
coefficients.

VII. Dependent and Independent Variables in Regression analysis


When two variables are considered, one will be the cause for the change in the other. The
cause variable is always treated as independent variable (taken as ‘x’) and the affected variable
as dependent variable (taken as ‘y’).
Hence the general form of the regression equation (linear) is given as y = b x + a
From the given pairs of observations, Sxx and Sxy as in the case of correlation coefficient
calculation. Also x and y is to be obtained. Then
Sxy
‘b’ is calculated as b = Sxx and ‘a’ is given by a = y – b.x
Calculation of Regression equation

Example
The following data pertains to the heights of fathers and sons. Obtain the regression
equations.
Height of father X: 65 66 67 67 68 69 71 73
Height of son Y: 67 68 64 68 72 70 69 70
x = 546 y = 548 n=8
x2 = 37314 y2 = 37578 xy = 37422
x = 68.25 and y = 68.5

8
(546 )2
Sxx =
37314−
8 [ ] = 49.5

(548 )2
Syy =
37422−
8 [ ] = 40
546×548
Sxy =
37422−
[ 8 ] = 21
Regression of y on x:
21
b1 = 49.5 = 0.4242
a1 = y – 0.4242  68.25
= 68.5 – 28.95 = 39.55
the equation is y = 0.4242 x + 39.55

In questions, many often some additional calculations are involved.


The additional question will be “calculate the height of Son when father’s height is 68.4 “.
It is calculated by substituting 68.4 in place of the equation and adding 39.55
That is, the value will be 0.4242 * 68.4 + 39.55 = 68.57 [units]

I. Uses of Correlation and Regression


There are three main uses for correlation and regression.
1. One use is to test hypotheses about cause-and-effect relationships. In this case, the
experimenter determines the values of the X-variable and check whether variation in X
causes variation in Y. For example, giving people different amounts of a drug and
measuring their blood pressure.
2. Second main use is to see whether two variables are associated, without necessarily
inferring a cause-and-effect relationship. In this case, neither variable is determined by
the experimenter; both are naturally variables.

9
If an association is found, the inference is that variation in X may cause variation in Y, or
variation in Y may cause variation in X, or variation in some other factor may affect both X and
Y.
3. The third common use of linear regression is estimating the value of one variable
corresponding to the selected value of the other variable.

10

You might also like