Correlation and Regration
Correlation and Regration
Correlation and Regration
The most commonly used techniques for investigating the relationship between two quantitative
variables are correlation and linear regression. Correlation quantifies the strength of the linear
relationship between a pair of variables, whereas regression expresses the relationship in the form
of an equation. For example, in patients attending an accident and emergency unit (A&E), we
could use correlation and regression to determine whether there is a relationship between age and
urea level, and whether the level of urea can be predicted for a given age.
Correlation
Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables increase or
decrease in parallel; a negative correlation indicates the extent to which one variable increases as
the other decreases.
According to W. I. King, Correlation means that relation between two series of group of data their
exist some causal correlation.
Correlation is a statistical technique that can show whether and how strongly pairs of variables are
related. Although this correlation is fairly obvious your data may contain unsuspected correlations.
You may also suspect there are correlations, but don't know which are the strongest. An intelligent
correlation analysis can lead to a greater understanding of your data.
1
Correlation Coefficients Pearson, Kendall and Spearman
Correlation is a Bivariate analysis that measures the strengths of association between two
variables. In statistics, the value of the correlation coefficient varies between +1 and -1. When the
value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of
association between the two variables. As the correlation coefficient value goes towards 0, the
relationship between the two variables will be weaker. Usually, in statistics, we measure three
types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation.
Pearson 𝑟 correlation
Pearson correlation is widely used in statistics to measure the degree of the relationship between
linear related variables. For example, in the stock market, if we want to measure how two
commodities are related to each other, Pearson correlation is used to measure the degree of
relationship between the two commodities. The following formula is used to calculate the Pearson
correlation coefficient 𝑟:
𝑟 = Σ (𝑥𝑖 − 𝑥̅) (𝑦𝑖 − 𝑦̅) 𝑛𝑖=1 √Σ (𝑥𝑖 − 𝑥̅)2 Σ (𝑦𝑖 − 𝑦̅ )2
Kendall rank correlation is a non-parametric test that measures the strength of dependence between
two variables. If we consider two samples, x and y, where each sample size is n, we know that the
total number of pairings with x y is n (n-1)/2.
The following formula is used to calculate the value of Kendall rank correlation
𝒏𝒄 − 𝒏𝒅
𝒓=
𝟏
𝒏(𝒏 − 𝟏)
𝟐
Where:
2
Spearman rank correlation
Spearman rank correlation is a non-parametric test that is used to measure the degree of
association between two variables. It was developed by Spearman; thus, it is called the Spearman
rank correlation. Spearman rank correlation test does not assume any assumptions about the
distribution of the data and is the appropriate correlation analysis when the variables are
measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation coefficient:
𝟔∑𝒅𝒊𝟐
𝒑=𝟏−
𝒏(𝒏𝟐 − 𝟏)
Where:
𝜌 = Spearman rank correlation coefficient
di= the difference between the ranks of corresponding values Xi and Yi
n= number of values in each data set.
• Simple correlation
• Multiple correlation
3
Table (1): Individual’s Increasing Age with Increasing Sickness
Individual 1 2 3 4 5 6 7 8 9 10
Increasing 20 30 32 35 40 46 52 55 58 62
Age
Increasing 1 2 0 3 4 6 5 7 8 9
Sickness
Suppose that (x) denotes for Increasing age and (y) denotes for Increasing sickness.
4
Regression
Linear regression explores relationships that can be readily described by straight lines or their
generalization to many dimensions. A surprisingly large number of problems can be solved by
linear regression, and even more by means of transformation of the original variables that result in
linear relationships among the transformed variables.
Regression analysis is one of the most commonly used statistical techniques in social and
behavioral sciences as well as in physical sciences which involves identifying and evaluating the
relationship between a dependent variable and one or more independent variables, which are also
called predictor or explanatory variables. It is particularly useful for assess and adjusting for
confounding. Model of the relationship is hypothesized and estimates of the parameter values are
used to develop an estimated regression equation. Various tests are then employed to determine if
the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation
can be used to predict the value of the dependent variable given values for the independent
variables
Independent variables
are characteristics that can be measured directly; these variables are also called predictor or
explanatory variables used to predict or to explain the behavior of the dependent variable.
Dependent variable
is a characteristic whose value depends on the values of independent variables.
5
The primary objective of regression is to develop a linear relationship between a response variable
and explanatory variables for the purposes of prediction, assumes that a functional linear
relationship exists, and alternative approaches (functional regression) are superior.
Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables. In a cause and effect relationship, the
independent variable is the cause, and the dependent variable is the effect. Least squares linear
regression is a method for predicting the value of a dependent variable y, based on the value of an
independent variable x.
• One variable, denoted (x), is regarded as the predictor, explanatory, or independent
variable.
• The other variable, denoted (y), is regarded as the response, outcome, or dependent
variable.
Where:
x independent variable. 𝒏 Number of cases or individuals.
y dependent variable. Σ 𝒙𝐲 Sum of the product of dependent and
independent variables.
𝛽1 The Slope of the regression line Σ 𝒙 = Sum of independent variable.
𝜷𝟎 the intercept point of the Σ 𝐲 = Sum of dependent variable.
regression line and the y axis. Σ 𝒙𝟐 = Sum of square independent variable.
Multiple regression is an extension of simple linear regression. It is used when we want to predict
the value of a dependent variable (target or criterion variable) based on the value of two or more
independent variables (predictor or explanatory variables). Multiple regression allows you to
determine the overall fit (variance explained) of the model and the relative contribution of each of
the predictors to the total variance explained. For example, you might want to know how much of
the variation in exam performance can be explained by revision time and lecture attendance "as a
whole", but also the "relative contribution" of each independent variable in explaining the variance.
6
Mathematically, the multiple regression model is represented by the following equation
𝒀=𝜷𝟎 ± 𝜷𝒊 𝑿𝒊…………± 𝜷𝒏 𝑿𝒏 ± 𝒖
Where:
Conclusion
When comparing two different variables, two questions come to mind: “Is there a relationship
between two variables?” and “How strong is that relationship?” These questions can be answered
using regression and correlation. Regression answers whether there is a relationship (again this
book will explore linear only) and correlation answers how strong the linear relationship is. To
introduce both of these concepts, it is easier to look at a set of data.
7
References
• Whitley, E., & Ball, J. (2002). Statistics review 1: Presenting and summarizing data. Crit
Care.
• Zaid, M. A. (2015). Correlation and Regression Analysis. Organization of Islamic
cooperation.
• Allan G. Bluman, (2009). Elementary Statistics: A Brief Version, 7th Edition, New York:
McGraw-Hill.
• Higgins, J. (2006): The Radical Statistician: A Beginners Guide to Unleashing the Power
of Applied Statistics in The Real World (5th Ed.) Jim Higgins Publishing.