Simple Linear Regression: Coefficient of Determination
Simple Linear Regression: Coefficient of Determination
Simple linear regression is used to find out the best relationship between a single input
variable (predictor, independent variable, input feature, input parameter) & output variable
(predicted, dependent variable, output feature, output parameter) provided that both variables
are continuous in nature. This relationship represents how an input variable is related to the
output variable and how it is represented by a straight line.
To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots
provides a graphical representation of the relationship of two continuous variables.
Coefficient of Determination
The coefficient of determination is the square of the correlation (r) between predicted y scores and actual y
scores; thus, it ranges from 0 to 1.
With linear regression, the coefficient of determination is also equal to the square of the correlation
between x and y scores.
An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R 2 of 0.10
means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is
predictable; and so on.
The formula for computing the coefficient of determination for a linear regression model with one independent
variable is given below.
Coefficient of determination. The coefficient of determination (R2) for a linear regression model with one
independent variable is:
R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
where N is the number of observations used to fit the model, Σ is the summation symbol, x i is the x value for
observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard
deviation of x, and σy is the standard deviation of y.
Independent variables (aka explanatory variables, or predictors) are the factors that might influence the dependent
variable.
Regression analysis helps you understand how the dependent variable changes when one of the independent
variables varies and allows to mathematically determine which of those variables really has an impact.
Technically, a regression analysis model is based on the sum of squares, which is a mathematical way to find the
dispersion of data points. The goal of a model is to get the smallest possible sum of squares and draw a line that
comes closest to the data.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear regression models the
relationship between a dependent variable and one independent variables using a linear function. If you use two or
more explanatory variables to predict the dependent variable, you deal with multiple linear regression. If the
dependent variable is modeled as a non-linear function because the data relationships do not follow a straight line,
use nonlinear regression instead. The focus of this tutorial will be on a simple linear regression.
As an example, let's take sales numbers for umbrellas for the last 24 months and find out the average monthly
rainfall for the same period. Plot this information on a chart, and the regression line will demonstrate the
relationship between the independent variable (rainfall) and dependent variable (umbrella sales):
Linear regression equation
Mathematically, a linear regression is defined by this equation:
y = bx + a + ε
Where:
For our example, the linear regression equation takes the following
shape:
Umbrellas sold = b * rainfall + a
2. Select Regression and click OK.
Residual (or error) represents unexplained (or residual) variation after fitting
a regression model. It is the difference (or left over) between the observed
value of the variable and the value suggested by the regression model.
The difference between the observed value of the dependent variable (y)
and the predicted value (ŷ) is called the residual (e). Each data point
has one residual.
e=y–ŷ
Both the sum and the mean of the residuals are equal to zero. That is, Σ
e = 0 and e = 0.
Tools for analysing residuals – For the basic analysis of residuals you will use
the usual descriptive tools and scatterplots (plotting both fitted values
and residuals, as well as the dependent and independent variables you
have included in your model.
CONFIDENCE INTERVALS
Confidence Intervals are estimates that are calculated from sample data
to determine ranges likely to contain the population parameter (mean,
standard deviation) of interest. For example, if our population is (2,6), a
confidence interval of the mean suggests that the population mean is
likely between 2 and
1. And how confidently can we say this? Obviously 100%, right? Because we
know all the values and we can calculate it very easily.
But in real-life problems, this is not the case. It is not always feasible or
possible to study the whole population. So what do we do? We take sample
data. But can we rely on one sample? No, because different samples from
the same data will produce different mean.
So we take numerous random samples (from the same population) and
calculate confidence intervals for each sample and a certain percentage of
these ranges will contain the true population parameter.
This certain percentage is called the confidence level. A 95% confidence
level means that out of 100 random samples taken, I expect 95 of the
confidence intervals to contain the true population parameter.
PREDICTION INTERVALS
The range that likely contains the value of the dependent variable for a
single new observation given specific values of the independent
variables, is the prediction interval.
The prediction interval predicts in what range a future individual observation
will fall, while a confidence interval shows the likely range of values
associated with some statistical parameter of the data, such as the
population mean.
MULTIPLE LINEAR REGRESSION
To find the best-fit line for each independent variable, multiple linear
regression calculates three things:
INTERPRETATION OF REGRESSION
ANALYSIS
If you've ever wondered how two or more pieces of data relate to each other (e.g.
how GDP is impacted by changes in unemployment and inflation), or if you've
ever had your boss ask you to create a forecast or analyze predictions based on
relationships between variables, then learning regression analysis would be well
worth your time.
In this article, you'll learn the basics of simple linear regression, sometimes called
'ordinary least squares' or OLS regression—a tool commonly used in forecasting
and financial analysis. We will begin by learning the core principles of regression,
first learning about covariance and correlation, and then moving on to building
and interpreting a regression output. Popular business software such
as Microsoft Excel can do all the regression calculations and outputs for you, but
it is still important to learn the underlying mechanics.
Variables
The sales you are forecasting would be the dependent variable because their
value "depends" on the value of GDP and the GDP would be the independent
variable. You would then need to determine the strength of the relationship
between these two variables in order to forecast sales. If GDP
increases/decreases by 1%, how much will your sales increase or decrease?
Covariance
The actual number you get from calculating this can be hard to interpret because
it isn't standardized. A covariance of five, for instance, can be interpreted as a
positive relationship, but the strength of the relationship can only be said to be
stronger than if the number was four or weaker than if the number was six.
Correlation Coefficient
Regression Equation
Now that we know how the relative relationship between the two variables is
calculated, we can develop a regression equation to forecast or predict the
variable we desire. Below is the formula for a simple linear regression. The "y" is
the value we are trying to forecast, the "b" is the slope of the regression line, the
"x" is the value of our independent value, and the "a" represents the y-intercept.
The regression equation simply describes the relationship between the
dependent variable (y) and the independent variable (x).
Regressions in Excel
Now that you understand some of the background that goes into a regression
analysis, let's do a simple example using Excel's regression tools. We'll build on
the previous example of trying to forecast next year's sales based on changes in
GDP. The next table lists some artificial data points, but these numbers can be
easily accessible in real life.
We can see that there is going to be a positive correlation between sales and
GDP. Both tend to go up together. Using Excel, all you have to do is click
the Tools drop-down menu, select Data Analysis and from there
choose Regression. The popup box is easy to fill in from there; your Input Y
Range is your "Sales" column and your Input X Range is the change in GDP
column; choose the output range for where you want the data to show up on your
spreadsheet and press OK. You should see something similar to what is given in
the table below:
Regression Statistics Coefficients
0.829224
Multiple R Intercept 34.58409
3
0.687613 GDP 88.15552
R Square
Adjusted
0.583484
- -
R Square
Standard Error 51.02180 -
7 -
Observations 5
- -
Interpretation
The major outputs you need to be concerned about for simple linear regression
are the R-squared, the intercept (constant) and the GDP's beta (b) coefficient.
The R-squared number in this example is 68.7%. This shows how well our model
predicts or forecasts the future sales, suggesting that the explanatory variables in
the model predicted 68.7% of the variation in the dependent variable. Next, we
have an intercept of 34.58, which tells us that if the change in GDP was forecast
to be zero, our sales would be about 35 units. And finally, the GDP beta
or correlation coefficient of 88.15 tells us that if GDP increases by 1%, sales will
likely go up by about 88 units.
So how would you use this simple model in your business? Well if your research
leads you to believe that the next GDP change will be a certain percentage, you
can plug that percentage into the model and generate a sales forecast. This can
help you develop a more objective plan and budget for the upcoming year.
Of course, this is just a simple regression and there are models that you can
build that use several independent variables called multiple linear
regressions. But multiple linear regressions are more complicated and have
several issues that would need another article to discuss.
HETEROSCEDASTICITY
The word “heteroscedasticity” comes from the Greek, and quite literally means data with
a different (hetero) dispersion (skedasis). In simple terms, heteroscedasticity is any set of
data that isn’t homoscedastic. More technically, it refers to data with unequal variability
(scatter) across a set of second, predictor variables.
A residual plot can suggest (but not prove) heteroscedasticity. Residual plots are created
by:
1. Calculating the square residuals.
2. Plotting the squared residuals against an explanatory variable (one that you think is
related to the errors).
3. Make a separate plot for each explanatory variable you think is contributing to the
errors.
Consequences of Heteroscedasticity
Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more
independent variables in a multiple regression model. Multicollinearity can lead to
skewed or misleading results when a researcher or analyst attempts to determine
how well each independent variable can be used most effectively to predict or
understand the dependent variable in a statistical model.
Examples of Multicollinearity
Let’s assume that ABC Ltd, a KPO, has been hired by a pharmaceutical
company to provide research services and statistical analysis on the
diseases in India. For this, ABC ltd has selected age, weight, profession,
height, and health as the prima facie parameters.
OUTLIERS
In statistics, an outlier is a data point that differs significantly from other observations.
An outlier may be due to variability in the measurement or it may indicate experimental
error; the latter are sometimes excluded from the data set. An outlier can cause
serious problems in statistical analyses.
In a small sample the task of finding outliers with the use of tables can be easy. But
when the number of observations goes into the thousands or millions, it becomes
impossible. This task becomes even more difficult when many variables (the
worksheet columns) are involved. For this, there are other methods.