0% found this document useful (0 votes)
144 views21 pages

Simple Linear Regression: Coefficient of Determination

Simple linear regression finds the best relationship between a single input (independent) variable and a single output (dependent) variable. This relationship is represented by a straight line. The coefficient of determination (R2) indicates how well the regression line represents the data, with higher values indicating a better fit. R2 ranges from 0 to 1, with values closer to 1 showing the independent variable better predicts the dependent variable.

Uploaded by

Sweety 12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
144 views21 pages

Simple Linear Regression: Coefficient of Determination

Simple linear regression finds the best relationship between a single input (independent) variable and a single output (dependent) variable. This relationship is represented by a straight line. The coefficient of determination (R2) indicates how well the regression line represents the data, with higher values indicating a better fit. R2 ranges from 0 to 1, with values closer to 1 showing the independent variable better predicts the dependent variable.

Uploaded by

Sweety 12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 21

Simple Linear Regression

Simple linear regression is used to find out the best relationship between a single input
variable (predictor, independent variable, input feature, input parameter) & output variable
(predicted, dependent variable, output feature, output parameter) provided that both variables
are continuous in nature. This relationship represents how an input variable is related to the
output variable and how it is represented by a straight line.
To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots
provides a graphical representation of the relationship of two continuous variables.

Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the


proportion of the variance in the dependent variable that is predictable from the independent variable.

 The coefficient of determination is the square of the correlation (r) between predicted y scores and actual y
scores; thus, it ranges from 0 to 1.
 With linear regression, the coefficient of determination is also equal to the square of the correlation
between x and y scores.
 An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
 An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
 An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R 2 of 0.10
means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is
predictable; and so on.

The formula for computing the coefficient of determination for a linear regression model with one independent
variable is given below.

Coefficient of determination. The coefficient of determination (R2) for a linear regression model with one
independent variable is:
R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2

where N is the number of observations used to fit the model, Σ is the summation symbol, x i is the x value for
observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard
deviation of x, and σy is the standard deviation of y.

Regression analysis in Excel


Dependent variable (aka criterion variable) is the main factor you are trying to understand and predict.

Independent variables (aka explanatory  variables, or predictors) are the factors that might influence the dependent
variable.

Regression analysis helps you understand how the dependent variable changes when one of the independent
variables varies and allows to mathematically determine which of those variables really has an impact.

Technically, a regression analysis model is based on the sum of squares, which is a mathematical way to find the
dispersion of data points. The goal of a model is to get the smallest possible sum of squares and draw a line that
comes closest to the data.

In statistics, they differentiate between a simple and multiple linear regression. Simple linear regression models the
relationship between a dependent variable and one independent variables using a linear function. If you use two or
more explanatory variables to predict the dependent variable, you deal with multiple linear regression. If the
dependent variable is modeled as a non-linear function because the data relationships do not follow a straight line,
use nonlinear regression instead. The focus of this tutorial will be on a simple linear regression.

As an example, let's take sales numbers for umbrellas for the last 24 months and find out the average monthly
rainfall for the same period. Plot this information on a chart, and the regression line will demonstrate the
relationship between the independent variable (rainfall) and dependent variable (umbrella sales):
Linear regression equation
Mathematically, a linear regression is defined by this equation:
y = bx + a + ε

Where:

 x is an independent variable.


 y is a dependent variable.
 a is the Y-intercept, which is the expected mean value
of y when all x variables are equal to 0. On a regression graph, it's
the point where the line crosses the Y axis.
 b is the slope of a regression line, which is the rate of change
for y as x changes.
 ε is the random error term, which is the difference between the
actual value of a dependent variable and its predicted value.
The linear regression equation always has an error term because, in
real life, predictors are never perfectly precise. However, some
programs, including Excel, do the error term calculation behind the
scenes. So, in Excel, you do linear regression using the least
squares method and seek coefficients a and b such that:
y = bx + a

For our example, the linear regression equation takes the following
shape:
Umbrellas sold = b * rainfall + a

There exist a handful of different ways to find a and b. The three


main methods to perform linear regression analysis in Excel are:

 Regression tool included with Analysis ToolPak


 Scatter chart with a trendline
 Linear regression formula
Below you will find the detailed instructions on using each method.
Run regression analysis
In this example, we are going to do a simple linear regression in
Excel. What we have is a list of average monthly rainfall for the last
24 months in column B, which is our independent variable
(predictor), and the number of umbrellas sold in column C, which is
the dependent variable. Of course, there are many other factors that
can affect sales, but for now we focus only on these two variables:

With Analysis Toolpak added enabled, carry out these steps to


perform regression analysis in Excel:
1. On the Data tab, in the Analysis group, click the Data
Analysis button.

2. Select Regression and click OK.

3. In the Regression dialog box, configure the following settings:


1. Select the Input Y Range, which is your dependent
variable. In our case, it's umbrella sales (C1:C25).
2. Select the Input X Range, i.e. your independent variable.
In this example, it's the average monthly rainfall (B1:B25).
If you are building a multiple regression model, select two or more
adjacent columns with different independent variables.
3. Check the Labels box if there are headers at the top of
your X and Y ranges.
4. Choose your preferred Output option, a new worksheet
in our case.
5. Optionally, select the Residuals checkbox to get the
difference between the predicted and actual values.
4. Click OK and observe the regression analysis output created
by Excel.

The second part of the output is Analysis of Variance (ANOVA):

Regression analysis output: coefficients


This section provides specific information about the components of
your analysis:

The most useful component in this section is Coefficients. It


enables you to build a linear regression equation in Excel:
y = bx + a
:
Residual Analysis

Residual (or error) represents unexplained (or residual) variation after fitting
a regression model. It is the difference (or left over) between the observed
value of the variable and the value suggested by the regression model.

The difference between the observed value of the dependent variable (y)
and the predicted value (ŷ) is called the residual (e). Each data point
has one residual.

Residual = Observed value – Predicted value

e=y–ŷ

Both the sum and the mean of the residuals are equal to zero. That is, Σ
e = 0 and e = 0.

Residual Plots – A residual plot is a graph that shows the residuals on


the vertical axis and the independent variable on the horizontal axis. If the
points in a residual plot are randomly dispersed around the horizontal
axis, a linear regression model is appropriate for the data; otherwise, a
non-linear model is more appropriate.

Tools for analysing residuals – For the basic analysis of residuals you will use
the usual descriptive tools and scatterplots (plotting both fitted values
and residuals, as well as the dependent and independent variables you
have included in your model.

 A histogram, dot-plot or stem-and-leaf plot lets you examine


residuals: Standard regression assumes that residuals should be
normally distributed. Study the shape of the distribution, watch for
outliers and other unusual features.
 A Q-Q Plot to assess normality of the residuals.
 Plot the residuals against the dependent variable to zoom on the
distances from the regression line. The picture you see should not
show any particular pattern (random cloud). Look for outliers,
groups, systematic features etc. to assess the fit in detail.
 Plot the residuals against each independent variables to find out,
whether a pattern is clearly related to one of the independents.
 Plot the residuals against other variables to find out, whether a
structure appearing in the residuals might be explained by another
variable (a variable that you might want to include into a more
complex model.

CONFIDENCE INTERVALS

Confidence Intervals are estimates that are calculated from sample data
to determine ranges likely to contain the population parameter (mean,
standard deviation) of interest. For example, if our population is (2,6), a
confidence interval of the mean suggests that the population mean is
likely between 2 and
1. And how confidently can we say this? Obviously 100%, right? Because we
know all the values and we can calculate it very easily.
But in real-life problems, this is not the case. It is not always feasible or
possible to study the whole population. So what do we do? We take sample
data. But can we rely on one sample? No, because different samples from
the same data will produce different mean.
So we take numerous random samples (from the same population) and
calculate confidence intervals for each sample and a certain percentage of
these ranges will contain the true population parameter.
This certain percentage is called the confidence level. A 95% confidence
level means that out of 100 random samples taken, I expect 95 of the
confidence intervals to contain the true population parameter.
PREDICTION INTERVALS

The range that likely contains the value of the dependent variable for a
single new observation given specific values of the independent
variables, is the prediction interval.
The prediction interval predicts in what range a future individual observation
will fall, while a confidence interval shows the likely range of values
associated with some statistical parameter of the data, such as the
population mean.
MULTIPLE LINEAR REGRESSION

Multiple linear regression refers to a statistical technique that is used


to predict the outcome of a variable based on the value of two or
more variables. It is sometimes known simply as multiple
regression, and it is an extension of linear regression. The variable
that we want to predict is known as the dependent variable, while
the variables we use to predict the value of the dependent variable
are known as independent or explanatory variables.

To find the best-fit line for each independent variable, multiple linear
regression calculates three things:

 The regression coefficients that lead to the smallest overall model


error.
 The t-statistic of the overall model.
 The associated p-value (how likely it is that the t-statistic
would have occurred by chance if the null hypothesis of no
relationship between the independent and dependent
variables was true).

It then calculates the t-statistic and p-value for each regression


coefficient in the model.

INTERPRETATION OF REGRESSION
ANALYSIS
If you've ever wondered how two or more pieces of data relate to each other (e.g.
how GDP is impacted by changes in unemployment and inflation), or if you've
ever had your boss ask you to create a forecast or analyze predictions based on
relationships between variables, then learning regression analysis would be well
worth your time.

In this article, you'll learn the basics of simple linear regression, sometimes called
'ordinary least squares' or OLS regression—a tool commonly used in forecasting
and financial analysis. We will begin by learning the core principles of regression,
first learning about covariance and correlation, and then moving on to building
and interpreting a regression output. Popular business software such
as Microsoft Excel can do all the regression calculations and outputs for you, but
it is still important to learn the underlying mechanics.

Variables

At the heart of a regression model is the relationship between two different


variables, called the dependent and independent variables. For instance,
suppose you want to forecast sales for your company and you've concluded that
your company's sales go up and down depending on changes in GDP.

The sales you are forecasting would be the dependent variable because their
value "depends" on the value of GDP and the GDP would be the independent
variable. You would then need to determine the strength of the relationship
between these two variables in order to forecast sales. If GDP
increases/decreases by 1%, how much will your sales increase or decrease?
Covariance

\begin{aligned} &Cov(x,y) = \sum \frac { ( x_n - x_u )( y_n - y_u) }


{ N } \\ \end{aligned}Cov(x,y)=∑N(xn−xu)(yn−yu)
The formula to calculate the relationship between two variables is
called covariance. This calculation shows you the direction of the relationship. If
one variable increases and the other variable tends to also increase, the
covariance would be positive. If one variable goes up and the other tends to go
down, then the covariance would be negative.

The actual number you get from calculating this can be hard to interpret because
it isn't standardized. A covariance of five, for instance, can be interpreted as a
positive relationship, but the strength of the relationship can only be said to be
stronger than if the number was four or weaker than if the number was six.

Correlation Coefficient

\begin{aligned} &Correlation = \rho_{xy} = \frac { Cov_{xy} }{ s_x


s_y } \\ \end{aligned}Correlation=ρxy=sxsyCovxy
We need to standardize the covariance in order to allow us to better interpret and
use it in forecasting, and the result is the correlation calculation. The correlation
calculation simply takes the covariance and divides it by the product of
the standard deviation of the two variables. This will bind the correlation between
a value of -1 and +1.

A correlation of +1 can be interpreted to suggest that both variables move


perfectly positively with each other and a -1 implies they are perfectly negatively
correlated. In our previous example, if the correlation is +1 and the GDP
increases by 1%, then sales would increase by 1%. If the correlation is -1, a 1%
increase in GDP would result in a 1% decrease in sales—the exact opposite.

Regression Equation
Now that we know how the relative relationship between the two variables is
calculated, we can develop a regression equation to forecast or predict the
variable we desire. Below is the formula for a simple linear regression. The "y" is
the value we are trying to forecast, the "b" is the slope of the regression line, the
"x" is the value of our independent value, and the "a" represents the y-intercept.
The regression equation simply describes the relationship between the
dependent variable (y) and the independent variable (x).

\begin{aligned} &y = bx + a \\ \end{aligned}y=bx+a


The intercept, or "a," is the value of y (dependent variable) if the value of x
(independent variable) is zero, and so is sometimes simply referred to as the
'constant.' So if there was no change in GDP, your company would still make
some sales. This value, when the change in GDP is zero, is the intercept. Take a
look at the graph below to see a graphical depiction of a regression equation. In
this graph, there are only five data points represented by the five dots on the
graph. Linear regression attempts to estimate a line that best fits the data (a line
of best fit) and the equation of that line results in the regression equation.

Regressions in Excel
Now that you understand some of the background that goes into a regression
analysis, let's do a simple example using Excel's regression tools. We'll build on
the previous example of trying to forecast next year's sales based on changes in
GDP. The next table lists some artificial data points, but these numbers can be
easily accessible in real life.

Year Sales GDP


201
100 1.00%
5
201
250 1.90%
6
201
275 2.40%
7
201
200 2.60%
8
201
300 2.90%
9

 We can see that there is going to be a positive correlation between sales and
GDP. Both tend to go up together. Using Excel, all you have to do is click
the Tools drop-down menu, select Data Analysis  and from there
choose Regression. The popup box is easy to fill in from there; your Input Y
Range is your "Sales" column and your Input X Range is the change in GDP
column; choose the output range for where you want the data to show up on your
spreadsheet and press OK. You should see something similar to what is given in
the table below:

   Regression Statistics             Coefficients

0.829224
Multiple R Intercept 34.58409
3
 
0.687613 GDP 88.15552
R Square
 
Adjusted

     
0.583484
- -
R Square
 
Standard Error 51.02180 -  
7 -
   
Observations 5
- -

Interpretation

The major outputs you need to be concerned about for simple linear regression
are the R-squared, the intercept (constant) and the GDP's beta (b) coefficient.
The R-squared number in this example is 68.7%. This shows how well our model
predicts or forecasts the future sales, suggesting that the explanatory variables in
the model predicted 68.7% of the variation in the dependent variable. Next, we
have an intercept of 34.58, which tells us that if the change in GDP was forecast
to be zero, our sales would be about 35 units. And finally, the GDP beta
or correlation coefficient of 88.15 tells us that if GDP increases by 1%, sales will
likely go up by about 88 units.

The Bottom Line

So how would you use this simple model in your business? Well if your research
leads you to believe that the next GDP change will be a certain percentage, you
can plug that percentage into the model and generate a sales forecast. This can
help you develop a more objective plan and budget for the upcoming year.

Of course, this is just a simple regression and there are models that you can
build that use several independent variables called multiple linear
regressions. But multiple linear regressions are more complicated and have
several issues that would need another article to discuss.

HETEROSCEDASTICITY
The word “heteroscedasticity” comes from the Greek, and quite literally means data with
a different (hetero) dispersion (skedasis). In simple terms, heteroscedasticity is any set of
data that isn’t homoscedastic. More technically, it refers to data with unequal variability
(scatter) across a set of second, predictor variables.

Heteroscedastic data tends to follow a cone shape on a scatter graph.

How to Detect Heteroscedasticity

A residual plot can suggest (but not prove) heteroscedasticity. Residual plots are created
by:
1. Calculating the square residuals.
2. Plotting the squared residuals against an explanatory variable (one that you think is
related to the errors).
3. Make a separate plot for each explanatory variable you think is contributing to the
errors.

Consequences of Heteroscedasticity

Severe heteroscedastic data can give you a variety of problems:

 OLS will not give you the estimator with the smallest variance (i.e. your estimators


will not be useful).
 Significance tests will run either too high or too low.
 Standard errors will be biased, along with their corresponding test
statistics and confidence intervals.

How to Deal with Heteroscedastic Data

If your data is heteroscedastic, it would be inadvisable to run regression on the data as


is. There are a couple of things you can try if you need to run regression:

1. Give data that produces a large scatter less weight.


2. Transform the Y variable to achieve homoscedasticity. For example, use the Box-
Cox normality plot to transform the data.

Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more
independent variables in a multiple regression model. Multicollinearity can lead to
skewed or misleading results when a researcher or analyst attempts to determine
how well each independent variable can be used most effectively to predict or
understand the dependent variable in a statistical model.

There are four types of Multicollinearity

 1 – Perfect Multicollinearity – It exists when the independent


variables in the equation predict the perfect linear relationship.
 2 – High Multicollinearity – It refers to the linear relationship
between the two or more independent variables which are not
perfectly correlated to each other.
 3 – Structural Multicollinearity – This is caused by the
researcher himself by inserting different independent variables in
the equation.
 4 – Data based Multicollineaarity – It is caused by experiments
that are poorly designed by the researcher.
Causes of Multicollinearity
Independent Variables, Change in the parameters of the Variables do
that a little change in the variables. There is a significant impact on the
result & Data Collection refers to the sample of the Selected
population being taken.

Examples of Multicollinearity
Let’s assume that ABC Ltd, a KPO, has been hired by a pharmaceutical
company to provide research services and statistical analysis on the
diseases in India. For this, ABC ltd has selected age, weight, profession,
height, and health as the prima facie parameters.

 In the above example, there is a multicollinearity situation since


the independent variables selected for the study are directly
correlated to the results. Hence it would be advisable for the
researcher to adjust the variables first before starting any project
since the results will be directly impacted because of the selected
variables here.

OUTLIERS
In statistics, an outlier is a data point that differs significantly from other observations.
An outlier may be due to variability in the measurement or it may indicate experimental
error; the latter are sometimes excluded from the data set. An outlier can cause
serious problems in statistical analyses.

How to identify which record is outlier?

Find the outliers using tables


The simplest way to find outliers in your data is to look directly at the data table or
worksheet – the dataset, as data scientists call it. The case of the following table
clearly exemplifies a typing error, that is, input of the data. The field of the
individual’s age Antony Smith certainly does not represent the age of 470 years.
Looking at the table it is possible to identify the outlier, but it is difficult to say which
would be the correct age. There are several possibilities that can refer to the right
age, such as: 47, 70 or even 40 years.

In a small sample the task of finding outliers with the use of tables can be easy. But
when the number of observations goes into the thousands or millions, it becomes
impossible. This task becomes even more difficult when many variables (the
worksheet columns) are involved. For this, there are other methods.

Find outliers using graphs


One of the best ways to identify outliers data is by using charts. When plotting a
chart the analyst can clearly see that something different exists. Here are some
examples that illustrate the view of outliers with graphics.

You might also like