67% found this document useful (3 votes)

182 views

Linear Regression

The document discusses using linear regression on a dataset containing prices and attributes of 27,000 cubic zirconias. Some key points: - The dataset has 11 variables including price, carat weight, cut, color, clarity and other stone attributes. - Exploratory data analysis found the price and carat variables were right-skewed, and some variables had outliers. - Missing depth values were imputed with the median. Zero values in some variables were investigated. - Categorical variables were encoded. The data was 70%/30% train-test split. - Linear regression was performed on the training set, achieving an R^2 of 95% to explain price variations. The

Uploaded by

Anil Bera

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

67% found this document useful (3 votes)

182 views

Linear Regression

Uploaded by

Anil Bera

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

LINEAR REGRESSION

Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the dataset containing
the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that are most important.

Data Dictionary:

Variable Name Description

Carat Carat weight of the cubic zirconia.

Describe the cut quality of the cubic zirconia. Quality

Cut is increasing order Fair, Good, Very Good, Premium,
Ideal.

Colour of the cubic zirconia.With D being the worst

Color
and J the best.

cubic zirconia Clarity refers to the absence of the

Inclusions and Blemishes. (In order from Best to Worst,
Clarity
IF = flawless, l1= level 1 inclusion) IF, VVS1, VVS2,
VS1, VS2, Sl1, Sl2, l1

The Height of cubic zirconia, measured from the Culet

Depth
to the table, divided by its average Girdle Diameter.

The Width of the cubic zirconia's Table expressed as a

Table
Percentage of its Average Diameter.

Price the Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.

Table of Content

Series Question Page no

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data 3
types, shape, EDA). Perform Univariate and Bivariate Analysis.
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any 10
meaning or do we need to change them or drop them? Do you think scaling is necessary in this case?
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). 11
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test
sets using Rsquare, RMSE
1.4 Inference: Basis on these predictions, what are the business insights and recommendations. 14
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types, shape, EDA). Perform
Univariate and Bivariate Analysis

Understanding the head & Tail of the dataset.

Head of the dataset

Tail of the dataset

Shape of the dataset

Type of the Dataset columnwise

 Data set contains 26967 row, 11 columns.

 In said data set there are 2 Integer type features,6 Float type features. 3 Object type features. Where 'price' is the target variable
and all other are predictor variable.
 First column is an index ("Unnamed: 0") as this only serial no, we can remove it.
Summary of the dataset

 Dataset consist of both categorical and continuous data,

 In categorical data we have cut, color and clarity whereas in continuous data we have carat, depth, table, x. y, z and price
 Target variable will be Price

If duplicate records in the dataset

If any Missing value

 we can observe there are 697 missing values in the depth column.

Unique value in categorial data

Univariate Analysis.

Measuring the skewness of every attribute

 It can see that the distribution of some quantitative features like "carat" and the target feature "price" are heavily "right-
skewed".

Checking for Outliers in the Dataset

 There is significant amount of outlier present in some variable.

Bivariate Analysis.
Getting the Correlation Heatmap

 It can be conditional that most features correlate

1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any meaning or do we need to change
them or drop them? Do you think scaling is necessary in this case?

we can observe there are 697 missing values in the depth column

Imputing null values

 we can observe there are 697 missing values in the depth column hence replacing the missing values with median value.

Checking for the values which are equal to zero

we checked for 'Zero' value and we can observe there are some amounts of 'Zero' value present on the data set on variable 'x', 'y','z'.

Do you think scaling is necessary in this case?

Scaling or standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare
measurements that have different units. Variables that are measured at different scales do not underwrite equally to the analysis and
might end up creating a prejudice.
With the given data set we can see the all the variable are in different scale i.e price are in 1000s unit and depth and table are in 100s
unit, and carat is in 10s. So it’s becomes necessary to scale the data to allow each variable to be compared on a common scale. Hence
recommended for regression technique
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on Train and Test sets using Rsquare, RMSE.

Encode the data having string values

Get Dummies

Train-Test Split:

Breaking the X and y data frames into training set and test set.

Split X and y into training and test set in 70:30 ratio

Invoke the Linear Regression function and find the best fit model on training data

The coefficients for each of the independent attributes

R square on training data

R square on testing data

RMSE on Training data

RMSE on Testing data

Check Multi-collinearity using VIF

We can observe there are very strong multi collinearity present in the data set. Ideally it should be within 1 to 5.

We are exploring the Linear Regression using stats models as we are interested in some more statistical metrics of the model.

Linear Regression using stats models.

Calculate MSE

The final Linear Regression equation is

1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

The predictions were able to capture 95% variations in the price and it is explained by the predictors in the training set. Using stats
model if we could run the model again, we can have P values and coefficients which will give us better understanding of the
relationship, so that values more 0.05 we can drop those variables and re run the model again for better results. For better accuracy
dropping depth column in iteration for better results.

A Wholesale Distributor
83% (6)
A Wholesale Distributor
5 pages
Linear Regression: Prepared by Muralidharan N
77% (13)
Linear Regression: Prepared by Muralidharan N
34 pages
Mountain State University 2
80% (5)
Mountain State University 2
4 pages
A Wholesale Distributor
100% (3)
A Wholesale Distributor
5 pages
SQL Project Questions
0% (1)
SQL Project Questions
3 pages
Decision Making: Submitted By-Ankita Mishra
No ratings yet
Decision Making: Submitted By-Ankita Mishra
20 pages
Project Questions
No ratings yet
Project Questions
4 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Project - 8 (MRA)
50% (4)
Project - 8 (MRA)
15 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
Mountain State University 1
100% (1)
Mountain State University 1
2 pages
A Wholesale Distributor 1
100% (1)
A Wholesale Distributor 1
5 pages
Catapult-Trebuchet Lab Report Instructions
No ratings yet
Catapult-Trebuchet Lab Report Instructions
2 pages
IATF 16949 QMS Requirements Matrix
100% (5)
IATF 16949 QMS Requirements Matrix
1 page
Final Research Presentation PATUNGAN BEACH
No ratings yet
Final Research Presentation PATUNGAN BEACH
78 pages
E Health
No ratings yet
E Health
19 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
REport Time Series
100% (2)
REport Time Series
57 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Answer Book - Rose Wines
100% (1)
Answer Book - Rose Wines
11 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
7z1018 CW Example Predicting House Prices in King County
No ratings yet
7z1018 CW Example Predicting House Prices in King County
16 pages
Clustering Project
100% (1)
Clustering Project
44 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Project Questions
No ratings yet
Project Questions
3 pages
Tushar Tukaram Bhakare: Education Skills
No ratings yet
Tushar Tukaram Bhakare: Education Skills
1 page
Rajiv Ranjan 11 Dec 2022
No ratings yet
Rajiv Ranjan 11 Dec 2022
18 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
MRA Project - Shehroz Khan
67% (3)
MRA Project - Shehroz Khan
19 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
Time Series Forecasting Business Report: Name: S.Krishna Veni Date: 20/02/2022
100% (1)
Time Series Forecasting Business Report: Name: S.Krishna Veni Date: 20/02/2022
31 pages
Pradeep Chauhan Business Report 09july'23
100% (1)
Pradeep Chauhan Business Report 09july'23
32 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Cold Storage1
No ratings yet
Cold Storage1
4 pages
Solution To Problem 1: Importing The Libraries
No ratings yet
Solution To Problem 1: Importing The Libraries
6 pages
Yash Arote Da J PDF
No ratings yet
Yash Arote Da J PDF
1 page
Chapter 14
No ratings yet
Chapter 14
3 pages
12-Reliability and Validity
No ratings yet
12-Reliability and Validity
38 pages
Thesis Final Version
No ratings yet
Thesis Final Version
48 pages
1 Use of Digital Technologies in Education PDF
No ratings yet
1 Use of Digital Technologies in Education PDF
359 pages
Stat 3 RD
No ratings yet
Stat 3 RD
91 pages
STAT-205 Probability and Statistics
No ratings yet
STAT-205 Probability and Statistics
3 pages
Week 01 Introduction and Graphical Statistics
No ratings yet
Week 01 Introduction and Graphical Statistics
19 pages
IHITES - Volume 3 - Issue 4 - Pages 27-31
No ratings yet
IHITES - Volume 3 - Issue 4 - Pages 27-31
6 pages
Unit 2
No ratings yet
Unit 2
48 pages
Saurabh 1 Bluestar
No ratings yet
Saurabh 1 Bluestar
60 pages
Introduction To Power BI Datamart
No ratings yet
Introduction To Power BI Datamart
19 pages
Topic 6d - Hierarchical Algorithm
No ratings yet
Topic 6d - Hierarchical Algorithm
38 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
18 pages
2 PB
No ratings yet
2 PB
15 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Lecture 7 Heteroskedasticity
No ratings yet
Lecture 7 Heteroskedasticity
41 pages
MPML10 2022 FR
No ratings yet
MPML10 2022 FR
24 pages
AI Roadmap_ based on Berkeley AI Graduate Certificate
No ratings yet
AI Roadmap_ based on Berkeley AI Graduate Certificate
23 pages
NCJRS-Crime Guide PDF
No ratings yet
NCJRS-Crime Guide PDF
146 pages
Factors_Affecting_Young_Shoppers_Online
No ratings yet
Factors_Affecting_Young_Shoppers_Online
14 pages
Bhawna BIA Foramate Resume
No ratings yet
Bhawna BIA Foramate Resume
2 pages
Paper3
No ratings yet
Paper3
22 pages
Physical Therapy in Sport: Caitlin E. George, Luke J. Heales, Robert Stanton, Sally-Anne Wintour, Crystal O. Kean
No ratings yet
Physical Therapy in Sport: Caitlin E. George, Luke J. Heales, Robert Stanton, Sally-Anne Wintour, Crystal O. Kean
11 pages
Bedside Endorsement of Patients Beep A Tool To Enhance Communication and Patient Involvement A Critical Component of Patient Safet
No ratings yet
Bedside Endorsement of Patients Beep A Tool To Enhance Communication and Patient Involvement A Critical Component of Patient Safet
5 pages