Lab-3: Regression Analysis and Modeling Name: Uid No. Objective

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Sardar Patel Institute of Technology,Mumbai

Department of Electronics and Telecommunication Engineering


T.E. Sem-V (2018-2019)
ETL54-Statistical Computational Laboratory
Lab-3: Regression Analysis and Modeling

Name:​ Vishal Ramina UID No.​_2017120049__

Objective:​To carry out linear regression (including multiple regression) and build a
regression model

Outcomes:
1. To carry out linear regression (including multiple regression)
2. To build a regression model using both forward and backward step wise
processes
3. To plot regression models
4. To add lines of best-fit to regression plots

System Requirements: ​Ubuntu OS with R and RStudio installed


Introduction to Linear Regression
Regression analysis​ is a statistical tool to determine relationships between different
types of variables. Variables that remain unaffected by changes made in other
variables are known as ​independent variables​, also known as a ​predictor​ or
explanatory variables​ while those that are affected are known as ​dependent variables
also known as the ​response variable​.
Linear regression is a statistical procedure which is used to predict the value of a
response variable, on the basis of one or more predictor variables.
There are two types of linear regressions in R:

Simple Linear Regression –​ Value of response variable depends on a single


explanatory variable.
Multiple Linear Regression –​ Value of response variable depends on more
than 1 explanatory variables.

Some common examples of linear regression are calculating GDP, CAPM, oil and gas
prices, medical diagnosis, capital asset pricing etc.

Simple Linear Regression in R


R Simple linear regression enables us to find a relationship between a continuous
dependent variable Y and a continuous independent variable X. It is assumed that
values of X are controlled and not subject to measurement error and corresponding
values of Y are observed.
The ​general simple linear regression​ ​model​ to evaluate the value of Y for a value of
X:
y​i​ = β​0​ + β​1​x +
​ ε
Here, the i​th​ data point, y​i​, is determined by the variable x​i​;
β​0​ and β​1​ are regression coefficients;
ε​i​ is the error in the measurement of the i​th​ value of x.
Regression analysis is implemented to do the following:

Establish a relationship between independent (x) and dependent (y) variables.


Predict the value of y based on a set of values of x1, x2…xn.
Identify independent variables to understand which of them are important to
explain the dependent variable, and thereby establishing a more precise and
accurate causal relationship between the variables.
Multiple Linear Regression in R
In the real world, you may find situations where you have to deal with more than 1
predictor variable to evaluate the value of response variable. In this case, simple linear
models cannot be used and you need to use R multiple linear regressions to perform
such analysis with multiple predictor variables.
R multiple linear regression models​ with two explanatory variables can be given as:
y​i​ = β​0​ + β​1​x1i​ ​ +
​ β​2​x1i​ ​ + ε​i
Here, the i​th​ data point, y​i​, is determined by the levels of the two continuous
explanatory variables x​1i and ​ x​1i’ by
​ the three parameters β​0​, β​1​, and β​2 of
​ the model,
and by the residual ε​1​ of point i from the fitted surface.
General Multiple regression models can be represented as:
y​i​ = Σβ​1​x1i​ ​ +
​ ε​i

Procedure:
Step-1: Open R Studio and go to R console (>)
>sessionInfo()
>install.packages("DAAG")
>library(lattice)
>library(DAAG)
>?cars # built-in data set in car

Example Problem
For this analysis, we will use the ​cars​ dataset that comes with R by default. ​cars​ is a
standard built-in dataset, that makes it convenient to demonstrate linear regression in
a simple and easy to understand fashion. You can access this dataset simply by typing
in ​cars​ in your R console. You will find that it consists of 50 observations(rows) and
2 variables (columns) – ​dist​ and ​speed​. Lets print out the first six observations
here..
head(cars) # display the first 6 observations#>
Before we begin building the regression model, it is a good practice to analyze and
understand the variables. The graphical analysis and correlation study below will help
with this.

Graphical Analysis
The aim of this exercise is to build a simple regression model that we can use to
predict Distance (dist) by establishing a statistically significant linear relationship
with Speed (speed). But before jumping in to the syntax, lets try to understand these
variables graphically. Typically, for each of the independent variables (predictors), the
following plots are drawn to visualize the following behavior:

1. Scatter plot​: Visualize the linear relationship between the predictor and
response
2. Box plot​: To spot any outlier observations in the variable. Having outliers in
your predictor can drastically affect the predictions as they can easily affect
the direction/slope of the line of best fit.

3. Density plot​: To see the distribution of the predictor variable. Ideally, a close
to normal distribution
(a bell shaped curve),
without being skewed
to the left or right is
preferred. Let us see
how to make each one
of them.
Scatter Plot
Scatter plots can help visualize any linear relationships between the dependent
(response) variable and independent (predictor) variables. Ideally, if you are having
multiple predictor variables, a scatter plot is drawn for each one of them against the
response, along with the line of best as seen below.
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed") # scatterplot

Correlation
Correlation is a statistical measure that suggests the level of linear dependence
between two variables, that occur in pair – just like what we have here in speed and
dist. Correlation can take values between -1 to +1. If we observe for every instance
where speed increases, the distance also increases along with it, then there is a high
positive correlation between them and therefore the correlation between them will be
closer to 1. The opposite is true for an inverse relationship, in which case, the
correlation between the variables will be close to -1.
A value closer to 0 suggests a weak relationship between the variables. A low
correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response
variable (​Y​) is unexplained by the predictor (​X)​ , in which case, we should probably
look for better explanatory variables.
cor(cars$speed, cars$dist) # calculate correlation between speed and d​ istance #>
[1] 0.8068949

To Build Linear Model


Refer the following online regression tutorial and perform all the steps and interpret.
1. http://r-statistics.co/Linear-Regression.html
&
2. Read the PPT shared on Google Classroom
Important Points to remember:
1. Understanding lm() function
2. Linear Regression Diagnostics using summary() function
3. Statistical Significance: The p-Value: Null and Alternate Hypothesis
4. To calculate the t Statistic and p-Values
5. To calculate AIC and BIC
6. To know if the model is best fit for your data:
The most common metrics to look at while selecting the model are:
STATISTIC CRITERION
R-Squared Higher the better ​(> 0.70)
Adj R-Squared Higher the better
F-Statistic Higher the better
Std. Error Closer to zero the better
Should be greater 1.96 for p-value to be less than
t-statistic
0.05
AIC Lower the better
BIC Lower the better
Should be close to the number of predictors in
Mallows cp
model
MAPE (Mean absolute
Lower the better
percentage error)
MSE (Mean squared error) Lower the better
Min_Max Accuracy =>
mean(min(actual, Higher the better
predicted)/max(actual,
predicted))
7. Predicting Linear Models:

Step 1:​ Create the training (development) and test (validation) data samples from
original data.

Step 2: ​Develop the model on the training data and use it to predict the distance on
test data

Step 3: ​Review diagnostic measures.

Step 4: ​Calculate prediction accuracy and error rates

8. Cross validation: k- Fold Cross validation

Build a regression model using the forward stepwise procedure.


1. Look at the mtcars data item. This is built into R.
> str(mtcars)
2. Start by creating a blank model using mpg as the response variable:
> mtcars.lm = lm(mpg ~ 1, data = mtcars)
3. Determine which predictor variable is the best starting candidate:
> add1(mtcars.lm, mtcars, test = 'F')
4. Add the best predictor variable to the blank model:
> mtcars.lm = lm(mpg ~ wt, data = mtcars)
5.Do a quick check of the model summary:
> summary(mtcars.lm)
6.Now look again at the remaining candidate predictor variables:
> add1(mtcars.lm, mtcars, test = 'F')
7.Add the next best predictor variable to your regression model:
> mtcars.lm = lm(mpg ~ wt + cyl, data = mtcars)
8.Now check the model summary once more:
> summary(mtcars.lm)
9. Check the remaining variables to see if there are any other candidate
predictors to add:
> add1(mtcars.lm, mtcars, test = 'F')
10. The current model remains the most adequate.

Comments on Result:

Describe the following with respect to Linear Regression and Building linear
model and Prediction
1) List types of regression
Ans- There are various types of regression models but the three most widley used
types are-
1) Simple regression
2) Polynomial regression
3) Logistic regression

2) What is statistical significance test?


Ans- Tests for statistical significance tell us what the probability is that the
relationship we think we have found is due only to random chance. They tell
us what the probability is that we would be making an error if we assume that
we have found that a relationship exists.

3) How to know if the model is best fit for your data?


Ans- A well-fitting regression model results in predicted values close to the observed
data values. The mean model, which uses the mean for every predicted value,
generally would be used if there were no informative predictor variables. The
fit of a proposed regression model should therefore be better than the fit of the
mean model.

Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate
model fit: R-squared, the overall F-test, and the Root Mean Square Error
(RMSE). All three are based on two sums of squares: Sum of Squares Total
(SST) and Sum of Squares Error (SSE). SST measures how far the data are
from the mean, and SSE measures how far the data are from the model’s
predicted values. Different combinations of these two values provide different
information about how the regression model compares to the mean model.

4) How to test model’s performance?


Ans- ​n regression model, the most commonly known evaluation metrics include:
1. R-squared​ (R2), which is the proportion of variation in the outcome that is
explained by the predictor variables. In multiple regression models, R2
corresponds to the squared correlation between the observed outcome values
and the predicted values by the model. The Higher the R-squared, the better the
model.
2. Root Mean Squared Error​ (RMSE), which measures the average error
performed by the model in predicting the outcome for an observation.
Mathematically, the RMSE is the square root of the ​mean squared error (MSE)​,
which is the average squared difference between the observed actual outome
values and the values predicted by the model. So, MSE = mean((observeds -
predicteds)^2) and RMSE = sqrt(MSE). The lower the RMSE, the better the
model.
3. Residual Standard Error​ (RSE), also known as the ​model sigma​, is a variant of
the RMSE adjusted for the number of predictors in the model. The lower the
RSE, the better the model. In practice, the difference between RMSE and RSE
is very small, particularly for large multivariate data.
4. Mean Absolute Error​ (MAE), like the RMSE, the MAE measures the prediction
error. Mathematically, it is the average absolute difference between observed
and predicted outcomes, MAE = mean(abs(observeds - predicteds)). MAE is
less sensitive to outliers compared to RMSE.

Conclusion:​ Through this experiment we have learnt the concept of regression and
how to use Rscript for applying different techniques of regression. We have also
learnt how different industrial models are built based on regression, the different
factors that make regression more accurate. We have applied these concepts in R by
using a sample data set and analyzing given set of variables and applied regression on
them.

You might also like