Linear Regression
Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the dataset containing
the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that are most important.
Data Dictionary:
we can observe there are 697 missing values in the depth column.
It can see that the distribution of some quantitative features like "carat" and the target feature "price" are heavily "right-
skewed".
Bivariate Analysis.
Getting the Correlation Heatmap
we can observe there are 697 missing values in the depth column
we can observe there are 697 missing values in the depth column hence replacing the missing values with median value.
we checked for 'Zero' value and we can observe there are some amounts of 'Zero' value present on the data set on variable 'x', 'y','z'.
Scaling or standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare
measurements that have different units. Variables that are measured at different scales do not underwrite equally to the analysis and
might end up creating a prejudice.
With the given data set we can see the all the variable are in different scale i.e price are in 1000s unit and depth and table are in 100s
unit, and carat is in 10s. So it’s becomes necessary to scale the data to allow each variable to be compared on a common scale. Hence
recommended for regression technique
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on Train and Test sets using Rsquare, RMSE.
Get Dummies
Train-Test Split:
Breaking the X and y data frames into training set and test set.
Invoke the Linear Regression function and find the best fit model on training data
We can observe there are very strong multi collinearity present in the data set. Ideally it should be within 1 to 5.
We are exploring the Linear Regression using stats models as we are interested in some more statistical metrics of the model.
The predictions were able to capture 95% variations in the price and it is explained by the predictors in the training set. Using stats
model if we could run the model again, we can have P values and coefficients which will give us better understanding of the
relationship, so that values more 0.05 we can drop those variables and re run the model again for better results. For better accuracy
dropping depth column in iteration for better results.