Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
Objective:To carry out linear regression (including multiple regression) and build a
regression model
Outcomes:
1. To carry out linear regression (including multiple regression)
2. To build a regression model using both forward and backward step wise
processes
3. To plot regression models
4. To add lines of best-fit to regression plots
Some common examples of linear regression are calculating GDP, CAPM, oil and gas
prices, medical diagnosis, capital asset pricing etc.
Procedure:
Step-1: Open R Studio and go to R console (>)
>sessionInfo()
>install.packages("DAAG")
>library(lattice)
>library(DAAG)
>?cars # built-in data set in car
Example Problem
For this analysis, we will use the cars dataset that comes with R by default. cars is a
standard built-in dataset, that makes it convenient to demonstrate linear regression in
a simple and easy to understand fashion. You can access this dataset simply by typing
in cars in your R console. You will find that it consists of 50 observations(rows) and
2 variables (columns) – dist and speed. Lets print out the first six observations
here..
head(cars) # display the first 6 observations#>
Before we begin building the regression model, it is a good practice to analyze and
understand the variables. The graphical analysis and correlation study below will help
with this.
Graphical Analysis
The aim of this exercise is to build a simple regression model that we can use to
predict Distance (dist) by establishing a statistically significant linear relationship
with Speed (speed). But before jumping in to the syntax, lets try to understand these
variables graphically. Typically, for each of the independent variables (predictors), the
following plots are drawn to visualize the following behavior:
1. Scatter plot: Visualize the linear relationship between the predictor and
response
2. Box plot: To spot any outlier observations in the variable. Having outliers in
your predictor can drastically affect the predictions as they can easily affect
the direction/slope of the line of best fit.
3. Density plot: To see the distribution of the predictor variable. Ideally, a close
to normal distribution
(a bell shaped curve),
without being skewed
to the left or right is
preferred. Let us see
how to make each one
of them.
Scatter Plot
Scatter plots can help visualize any linear relationships between the dependent
(response) variable and independent (predictor) variables. Ideally, if you are having
multiple predictor variables, a scatter plot is drawn for each one of them against the
response, along with the line of best as seen below.
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed") # scatterplot
Correlation
Correlation is a statistical measure that suggests the level of linear dependence
between two variables, that occur in pair – just like what we have here in speed and
dist. Correlation can take values between -1 to +1. If we observe for every instance
where speed increases, the distance also increases along with it, then there is a high
positive correlation between them and therefore the correlation between them will be
closer to 1. The opposite is true for an inverse relationship, in which case, the
correlation between the variables will be close to -1.
A value closer to 0 suggests a weak relationship between the variables. A low
correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response
variable (Y) is unexplained by the predictor (X) , in which case, we should probably
look for better explanatory variables.
cor(cars$speed, cars$dist) # calculate correlation between speed and d istance #>
[1] 0.8068949
Step 1: Create the training (development) and test (validation) data samples from
original data.
Step 2: Develop the model on the training data and use it to predict the distance on
test data
Comments on Result:
Describe the following with respect to Linear Regression and Building linear
model and Prediction
1) List types of regression
Ans- There are various types of regression models but the three most widley used
types are-
1) Simple regression
2) Polynomial regression
3) Logistic regression
Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate
model fit: R-squared, the overall F-test, and the Root Mean Square Error
(RMSE). All three are based on two sums of squares: Sum of Squares Total
(SST) and Sum of Squares Error (SSE). SST measures how far the data are
from the mean, and SSE measures how far the data are from the model’s
predicted values. Different combinations of these two values provide different
information about how the regression model compares to the mean model.
Conclusion: Through this experiment we have learnt the concept of regression and
how to use Rscript for applying different techniques of regression. We have also
learnt how different industrial models are built based on regression, the different
factors that make regression more accurate. We have applied these concepts in R by
using a sample data set and analyzing given set of variables and applied regression on
them.