Linear Regression
Linear Regression
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a 0 and a1 to find the best fit
line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values.
It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual will
be high, and so cost function will high. If the scatter points are close to the regression
line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:
1. R-squared method:
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on
continuous or categorical values.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
16.6K
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing
or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python,
we need to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing
. We have already done it earlier in this tutorial. But there will be some changes, which are given in
the below steps:
o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
o Next, we will load the dataset into our code:
data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable
is salary. Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:
Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries
take care of it for some cases, so we don't need to perform it here. Now, our dataset is
well prepared to work on it and we are going to start building a Simple Linear
Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to the Training Set:
Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor.
The code for this is given below:
In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and y_train,
which is our training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily learn the
correlations between the predictor and target variables. After executing the above lines
of code, we will get the below output.
Output:
dependent (salary) and an independent variable (Experience). So, now, our model is
ready to predict the output for the new observations. In this step, we will provide the
test dataset (new observations) to the model to check whether it can predict the
correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions
of test dataset, and prediction of training set respectively.
y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the training
set and test set.
Output:
You can check the variable by clicking on the variable explorer option in the IDE, and
also compare the result by comparing values from y_pred and y_test. By comparing
these values, we can check how good our model is performing.
Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary
of employees. In the function, we will pass the real values of training set, which means
a year of experience x_train, training set of Salaries y_train, and color of the
observations. Here we are taking a green color for the observation, but it can be any
color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of
the pyplot library. In this function, we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.
Finally, we will represent all above things in a graph using show(). The code is given
below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.
In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the same
as the above code, except in this, we will use x_test, and y_test instead of x_train and
y_train.
Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.
Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close to
the regression line, hence we can say our Simple Linear Regression is a good model
and able to make good predictions.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
2.5M
100
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear
Regression, so the same is applied for the multiple linear regression equation, the
equation becomes:
Y=b0+b1x1+b2x2+…………………+bnxn
Where,
Y= Output/Response variable
Problem Description:
Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:
, which we have already discussed in this tutorial. This process contains the below steps:
o Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains all
the variables. Below is the code for it:
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
Output:
Out[5]:
As we can see in the above output, the last column contains categorical variables which
are not suitable to apply directly for fitting the model. So we need to encode this
variable.
As we have one categorical variable (State), which cannot be directly applied to the
model, so we will encode it. To encode the categorical variable into numbers, we will
use the LabelEncoder class. But it is not sufficient because it still has some relational
order, which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables
are continuous.
Output:
As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the one
State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida
State, and the third column corresponds to the New York State.
Note: We should not use all the dummy variables at the same time, so it must be 1
less than the total number of dummy variables, else it will create a dummy variable
trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:
If we do not remove the first dummy variable, then it may introduce multicollinearity
in the model.
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given below:
The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE. The
test set and training set will look like the below image:
Test set:
Training set:
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we
don't need to do it manually.
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we
will fit our regression model to the training set. It will be similar to as we did in Simple
Linear Regression
Output:
Now, we have successfully trained our model using the training dataset. In the next
step, we will test the performance of the model using the test dataset.
By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test
set values.
Output:
In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index
has a predicted value of 103015$ profit and test/real value of 103282$ profit. The
difference is only of 267$, which is a good prediction, so, finally, our model is
completed here.
o We can also check the score for training dataset and test dataset. Below is the code for
it:
The above score tells that our model is 95% accurate with the training dataset
and 93% accurate with the test dataset.
Note: In the next topic, we will see how we can improve the performance of the
model using the Backward Elimination process.
Applications of Multiple Linear Regression:
There are mainly two applications of Multiple Linear Regression: