Lmodel
Lmodel
Lmodel
Prerequisites
To start with Linear Regression, you must be aware of a few basic concepts of
statistics. i.e.,
Correlation (r) – Explains the relationship between two variables, possible values -1 to
+1
i. Linearity & Additive: There should be a linear relationship between dependent and
independent variables and the impact of change in independent variable values should
have additive impact on dependent variable.
ii. Normality of error distribution: Distribution of differences between Actual & Predicted
values (Residuals) should be normally distributed.
iii. Homoscedasticity: Variance of errors should be constant versus,
a. Time
b. The predictions
iv. Statistical independence of errors: The error terms (residuals) should not have any
correlation among themselves. E.g., In case of time series data there shouldn’t be any
correlation between consecutive error terms
Linear Regression Line
While doing linear regression our objective is to fit a line through the
distribution which is nearest to most of the points. Hence reducing the
distance (error term) of data points from the fitted line.
For example, in above figure (left) dots represent various data points and line
(right) represents an approximate line which can explain the relationship
between ‘x’ & ‘y’ axes. Through, linear regression we try to find out such a
line. For example, if we have one dependent variable ‘Y’ and one independent
variable ‘X’ – relationship between ‘X’ & ‘Y’ can be represented in a form of
following equation:
Y = Β0 + Β1X
Where,
Y = Dependent Variable
X = Independent Variable
Regression line minimizes the sum of “Square of Residuals”. That’s why the method
of Linear Regression is known as “Ordinary Least Square (OLS)”
Food for thought: Why to reduce “Square of errors” and not just the errors ?
Β1 explains the change in Y with a change in X by one unit. In other words, if we
increase the value of ‘X’ by one unit then what will be the change in value of Y
Food for thought: Will correlation coefficient between ‘X’ and ‘Y’ be same as Β 1?
x y Predicted 'y'
1 2 Β0+B1*1
2 1 Β0+B1*2
3 3 Β0+B1*3
4 6 Β0+B1*4
5 9 Β0+B1*5
6 11 Β0+B1*6
7 13 Β0+B1*7
8 15 Β0+B1*8
9 17 Β0+B1*9
10 20 Β0+B1*10
Where,
Table 1:
Mean of x 5.5
Mean of y 9.7
B1 = 2.64
B0 = -2.2
Hence, the least regression equation will become –
Y = -2.2 + 2.64*x
Let see, how our predictions are looking like using this equation
x Y - Actual Y - Predicted
1 2 0.44
2 1 3.08
3 3 5.72
4 6 8.36
5 9 11
6 11 13.64
7 13 16.28
8 15 18.92
9 17 21.56
10 20 24.2
Given only 10 data points to fit a line our predictions are not pretty accurate
but if we see the correlation between ‘Y-Actual’ & ‘Y – Predicted’ it will turn out
to be very high; hence both the series are moving together and here is the
graph for visualizing our prediction values:
Model Performance
Once you build the model, the next logical question comes in mind is to
know whether your model is good enough to predict in future or the
relationship which you built between dependent and independent variables
is good enough or not.
For this purpose there are various metrics which we look into-
R – Square (R2)
Formula for calculating R2 is given by:
Total Sum of Squares (TSS) : TSS is a measure of total variance in the response/
dependent variable Y and can be thought of as the amount of variability inherent in
the response before the regression is performed.
Residual Sum of Squares (RSS) : RSS measures the amount of variability that is left
unexplained after performing the regression.
(TSS – RSS) measures the amount of variability in the response that is explained (or
removed) by performing the regression
Where N is the number of observations used to fit the model, σ x is the
standard deviation of x, and σy is the standard deviation of y.
R2 ranges from 0 to 1.
R2 of 0 means that the dependent variable cannot be predicted from the independent
variable
R2 of 1 means the dependent variable can be predicted without error from the
independent variable
An R2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R2 of 0.20 means that 20 percent of the variance in Y is predictable
from X; an R2 of 0.40 means that 40 percent is predictable; and so on.
ii. Root Mean Square Error (RMSE)
RMSE tells the measure of dispersion of predicted values from actual values.
The formula for calculating RMSE is
Though RMSE is a good measure for errors but the issue with it is
that it is susceptible to the range of your dependent variable. If your
dependent variable has thin range, your RMSE will be low and if
dependent variable has wide range RMSE will be high. Hence,
RMSE is a good metric to compare between different iterations of a
model.
How is it different
Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear
regression. Both works on OLS principle and procedure to get the best line is
also similar. In the case of later, regression equation will take a shape like:
Y=B0+B1X1+B2X2+B3X3.....
Where,
From the correlation matrix we can see that not all the variables are strongly
correlated to Petal.Width, hence we will only include significant variables to
build our model i.e. ‘Sepal.Length’ & ‘Petal.Length’. Let’s run our first model:
Here, Intercept estimate is same as B0 in previous examples, while coefficient
values written next to the variable names are nothing but our beta coefficients
(B1, B2, B3 …. Etc.). Hence we can write our linear regression equation as:
Petal.Width = -0.008996 – 0.082218*Sepal.Length + 0.449376*Petal.Length
When we run a linear regression, there is an underlying assumption that there
is some relationship between dependent and independent variable. To
validate this assumption, linear regression module validates the hypothesis
that “Beta coefficient Bi for an independent variable Xi is 0”. The P-Value which we
are seeing in the last column is nothing but the probability of this hypothesis
being true. Generally if P-Value is less than or equal to 0.05 we consider this
hypothesis to be false and establish a relationship between dependent and
independent variable
Multi-collinearity
Multi-collinearity tells us the strength of relationship between independent
variables. If there is Multi-Collinearity in our data, our beta coefficients may be
misleading. VIF (Variance Inflation Factor) is used to identify the Multi-
collinearity. If VIF value is greater than 4 we exclude that variable from our
model building exercise
Iterative Models
Model building is not one step process, one need to run multiple iterations in
order to reach a final model. Take care of P-Value and VIF for variable
selection and R-Square & MAPE for model selection.