Sunira - Predictive Modeling
Sunira - Predictive Modeling
Sunira - Predictive Modeling
PREPARED BY
SUNIRA
Content
Problem 1: Linear Regression………………………………….……………………………………...
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided
with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an
inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting the price for the stone
on the bases of the details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data
types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis…………………………………………
8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the sub levels
of a ordinal variables and take actions accordingly. Explain why you are combining these sub levels with
appropriate
reasoning…………………………………………………………………………………………………………………………………5
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply
Linear regression using scikit learn. Perform checks for significant variables using appropriate method from
stats model. Create multiple models and check the performance of Predictions on Train and Test sets using
R square, RMSE & Adj R square. Compare these models and select the best one with appropriate
reasoning………………………………………………………………………………………………………………………………………………….
12
1.4 Inference: Basis on these predictions, what are the business insights and recommendations……………….5
LINEAR REGRESSION
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and
other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable
stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
Data Dictionary:
1.1.Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA). Perform
Univariate and Bivariate Analysis.
Loading all the necessary library for the model building.
Now, reading the head and tail of the dataset to check whether data has
been properly fed.
HEAD OF THE DATA (Table1.1)
cut COLOR: 7
J 1443
I 2771
D 3344
H 4102
F 4729
E 4917
G 5661
CLARITY: 8
I1 365
IF 894
VVS1 1839
VVS2 2531
VS1 4093
SI2 4575
VS2 6099
SI1 6571
PRICE – HIST
Fig 1.15
skew
Table 1.4
9
BIVARIATE ANALYSIS
CUT :
Quality is increasing order Fair, Good, Very Good, Premium, Ideal.
Fig 1.16
The reason for the most preferred cut ideal is because those diamonds are
priced lower than other cuts.
COLOR:
Fig 1.18
We have 7 colours in the data, The G seems to be the preferred colour,
Fig 1.19
We see the G is priced in the middle of the seven colours, whereas J being
the worst colour price seems too high.
CLARITY:
Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1,
VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
11
Fig 1.20
The clarity VS2 seems to be preferred by people
Fig 1.21
The data has No FL diamonds, from this we can clearly understand the
flawless diamonds are not bringing any profits to the store.
12
Fig 1.23
13
CORRLEATION
CARAT VS PRICE
Fig 1.24
DEPTH VS PRICE
Fig 1.25
X VS PRICE
Fig 1.26
14
Y VS PRICE
Fig 1.27
Z VS PRICE
Fig 1.28
15
DATA DISTRIBUTION
Fig 1.29
16
CORRELATIOM MATRIX
Fig 1.30
This matrix clearly shows the presence of multi collinearity in the dataset.
17
1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?
Yes we have Null values in depth, since depth being continuous variable
mean or median imputation can be done.
The percentage of Null values is less than 5%, we can also drop these if we
want.
After median imputation, we don’t have any null values in the dataset.
Table 1.5
We have certain rows having values zero, the x, y, z are the dimensions of a
diamond so this can’t take into model. As there are very less rows.
We can drop these rows as don’t have any meaning in model building.
SCALING
Scaling can be useful to reduce or check the multi collinearity in the data,
so if scaling is not applied I find the VIF – variance inflation factor values
very high. Which indicates presence of multi collinearity
These values are calculated after building the model of linear
regression. To understand the multi collinearity in the model
The scaling had no impact in model score or coefficients of attributes nor
the intercept.
BEFORE SCALING – VIF VALUES
19
Fig 1.31
Fiffgljnsdfjns
20
Fig 1.32
Fig 1.33
21
Fig 1.34
Fig 1.35
22
Fig 1.36
Fig 1.37
23
Fig 1.39
24
Fig 1.40
Fig 1.41
25
Fig 1.42
Fig 1.43
26
Fig 1.44
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.
ENCODING THE STRING VALUES
GET DUMMIES
Table 1.6
27
VIF –VALUES
From stats model we can understand the features that do not contribute to the Model
We can remove those features after that the Vif Values will be
STATSMODEL
=========================================================
========================
coef std err t P>|t| [0.025 0.975]
To ideally bring down the values to lower levels we can drop one of the
variable that is highly correlated.
Dropping variables would bring down the multi collinearity level down.
33
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
We had a business problem to predict the price of the stone and provide insights for the
company on the profits on different prize slots. From the EDA analysis we could understand
the cut, ideal cut had number profits to the company. The colours H, I, J have bought profits
for the company. In clarity if we could see there were no flawless stones and they were no
profits coming from l1, l2, l3 stones. The ideal, premium and very good types of cut were
bringing profits where as fair and good are not bringing profits.
The predictions were able to capture 95% variations in the price and it is explained by the
predictors in the training set.
Using stats model if we could run the model again we can have P values and coefficients which
will give us better understanding of the relationship, so that values more 0.05 we can drop those
variables and re run the model again for better results.
For better accuracy dropping depth column in iteration for better results.
The equation, (-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.2
8) * y + (-0.11) * z + (0.1) * cut_Good + (0.15) * cut_Ideal + (0.15) * cut_Premiu
m + (0.13) * cut_Very_Good + (-0.05) * color_E + (-0.06) * color_F + (-0.1) * color
_G + (-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) * clarity_IF + (
0.64) * clarity_SI1 + (0.43) * clarity_SI2 + (0.84) * clarity_VS1 + (0.77) * clarity_
VS2 + (0.94) * clarity_VVS1 + (0.93) * clarity_VVS2 +
Recommendations
1. The ideal, premium, very good cut types are the one which are bringing profits so
that we could use marketing for these to bring in more profits.
2. The clarity of the diamond is the next important attributes the more the clear is the
stone the profits are more
Carat,
Y the diameter of the
stone clarity_IF
clarity_SI1
clarity_SI2
clarity_VS1
clarity_VS2
clarity_VVS1
clarity_VVS2
THE END
LOGISTIC
REGRESSION AND
LDA
PREPARED BY
SUNIRA
1
Data Dictionary:
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.
Table 1.1
Table 1.2
DATA DESCRIBE
Table 1.3
No471
Name: Holliday Package, dtype: int64
FOREIGN : 2
Yes 216
No 656
Name: foreign, dtype:
This split indicates that 45% of employees are interested in the holiday package.
CATEGORICAL UNIVARIATE
ANALYSIS FOREIGN
Fig 1.1
5
HOLIDAY PACKAGE
Fig 1.2
HOLIDAY PACKAGE VS SALARY
Fig 1.3
We can see employee below salary 150000 have always opted for
holiday package
6
Fig 1.4
HOLIDAY PACKAGE VS EDUC
Fig 1.5
7
Fig 1.9
Fig 1.11
10
Fig 1.13
11
Fig 1.15
12
BIVARITE ANALYIS
DATA DISTRIBUTION
Fig 1.16
Fig 1.17
No multi collinearity in the data
TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical
computation treating outliers will help perform the model better.
Fig 1.18
14
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis).
ENCODING CATEGORICAL VARIABLE
Table 1.4
The encoding helps the logistic regression model predict better results
15
The grid search method gives, liblinear solver which is suitable for small
datasets.
Tolerance and penalty has been found using grid search
method Predicting the training data,
16
Table 1.5
Fig 1.20
CONFUSION MATRIX FOR TEST DATA
17
Fig 1.21
ACCURACY
LDA
MODEL SCORE
MODEL SCORE
Fig 1.23
21
22
23
24
25
Fig 1.31
Fig 1.32
Table 1.6
26
Comparing both these models, we find both results are same, but LDA
works better when there is category target variable.
The important factors deciding the predictions are salary, age and educ.
Recommendations
1. To improve holiday packages over the age above 50 we can
provide religious destination places.
2. For people earning more than 150000 we can provide vacation
holiday packages.
3. For employee having more than number of older children we can
provide packages in holiday vacation places.
THE END