Autos Automobile.. EDA Project by Anjali Sinha
Autos Automobile.. EDA Project by Anjali Sinha
Autos Automobile.. EDA Project by Anjali Sinha
Case Study
By Anjali Sinha
0
Contents
1.Austo Motor Company -What is the background and objective of the case study?..........3
2. Data overview- important technical information about the dataset that a
database administrator would be interested in?
2.A. Take a critical look at the data and do a preliminary analysis of the variables………… 4-6
2.B. Do a quality check of the data - Check for and Treat Missing Values.
Are there any discrepancies present in the data?............................................................5-6
2.C. Checking Data irregularities or wrong entries in other categorical data…………………. 7
2.D. Check the statistical summary………………………………………………………………………. 8
2.E. Check for and treat (if needed) data irregularities ……………………………………………. 8-10
3. Univariate Analysis
3.A. Univariate analysis of numerical data (visualisation and insights that can be
utilized by the business) ……………………………………………………………………………………… 10-12
3.B. Univariate analysis of categorical data (visualisation and insights that can be
utilized by the business) ……………………………………………………………………………………….12-14
4. Bivariate analysis
4.A. Exploring the relationship between all numerical variables…………………………………..14-16
4.B. Exploring the relationship between Categorical Data……………………………………………16-18
4.C. Exploring the relationship between categorical vs numerical variables …….…………….18-19
5. Key questions
5.A. Do men tend to prefer SUVs more compared to women?........................................... 19-20
5.B. What is the likelihood of a salaried person buying a Sedan?........................................20-21
5.C. What evidence or data supports Sheldon Cooper's claim that a salaried male is an
easier target for a SUV sale over a Sedan sale?.................................................................21
5.D. How does the amount spent on purchasing automobiles vary by gender?...................22
5.E. How much money was spent on purchasing automobiles by individuals who took a
personal loan?..............................................................................................................22-23
5.F. How does having a working partner influence the purchase of higher-priced cars……….23
6. Actional insights and Business Recommendations………………………………………………..24-25
1
List of Figures
Figure 1: Import library and load Data………………………………………….………4
Figure 2: Boxplots of the numerical variables.............................................9
Figure.3: Univariate Analysis of Numerical Data.........................................10 -11
Figure 4: Distribution of Partner Salary based on Employment…………12
Figure 5 : Univariate analysis of Categorical Variables…………………….12-13
Figure:6 Joint plot of Age Vs Price………………………………………………..14
Figure:7 Joint plot of No. Of Dependents Vs Price…………………………..14
Figure 8: Pair plot of all Numerical Variables in the Dataset……………..15
Figure:9 Correlation Heatmap between Numerical Variables……………15
Figure: 10 Bivariate plots for Categorical Data……………………………….16-17
Figure:11 Bar plot of Numerical Vs Categorical Data………………………18-19
Figure: 12 Count plot of Make Vs Gender and Gender Vs Make…………19
Figure 13: Count plot of Gender Vs Sedan Purchase……………………….20
Figure:14 Count plot of Make Vs Profession…………………………………..20
Figure: 15 Count plot of Salaried Vs Make……………………………………..20
Figure 16: Facet Grid Bar plot of Profession Vs Gender Vs Make………..21
Figure:17 Count plot of Gender Vs Price………………………………………..22
Figure:18 Bar plot of Personal loan Vs Price…………………………………...22
Figure: 19 Bar plot of Partner working and Price……………………………..23
Figure: 20 Facet grid Bar plot Marital status Vs Gender Vs Make……….24
List of Tables
2
Case Study- Autos Automobile
1. Background
Austo Motor Company is a leading car manufacturer specializing in SUV, Sedan, and
Hatchback models.
In its recent board meeting, concerns were raised by the members on the efficiency of
the marketing campaign currently being used. The board decides to rope in analytics
professional to improve the existing campaign.
Objective
They want to analyze the data to get a fair idea about the demand of customers which
will help them in enhancing their customer experience.
I have been roped in as a Data Scientist to perform the data analysis to find answers to
these questions that will help the company to improve the business.
The key to being successful in this business is to be able to detect patterns influencing
the buying of the cars and cater to the demand at any given time.
3
2. Data overview- important technical information about the dataset that a
database administrator would be interested in?
Check the structure of the data- The initial steps is conducted to get an overview of the
dataset :
• observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not- I use the head() function to do the same.
• get information about the number of rows and columns in the dataset- we use
shape function to get the details. In the dataset provided, we have 1581 rows and
14 columns
4
Table 1: Top five rows of the dataset
• We run the info() function to check the data type, understand any null occurrence
and to treat any missing values. It helps us find out the data types of the columns
to ensure that data is stored in the preferred format and the value of each
property is as expected.
Observations
• A quick look at the dataset information tells us that there are 8 objects, 5 integer
and 1 float variables.
• There are few Null records present in two variables: Gender and Partner_salary,
which will be analyzed in detail in the next section.
• We observe that the Gender has 1528 data compared to 1581 data that means
we have missing data which needs to be treated.
5
• We also observe that Partner_salary has 1472 dataset. We need to understand
the details and rectify this missing data as well.
• There are no duplicate records in the dataset.
• Inspecting Null values- we observe occurrence of null values in gender and Partner
salary by using isnull().sum()
• Handling Nulls
a- If the proportion of Null values is more than 60 % of the total number of records in
a column, then drop the column. Here you assume that the column is
uninformative.
b- If any row is missing many records across columns, then that row may also be
dropped.
c- Otherwise, the missing values may be imputed.
For the given data, we find that neither (a) nor (b) is applicable since we have data
available in other column and can’t be dropped since it will bias our working.
a) For categorical variables we can impute the Nulls with the majority class.
For the current dataset, Null values in 'Gender' field are imputed with
'Male' (Male being the majority class).
b) Also, when we check the unique value of the variable while doing deep
dive, we observe that there are two instances of possible data entry issue.
The word Female has been misspelt as 'Femle' and 'Femal'.
For the current dataset we are confident that category Female has been
misspelt, so we can go ahead and impute these records with the correct
spelling i.e. 'Female' by using replace function.
6
2.C. Checking Data irregularities or wrong entries in other categorical data
We observe that the rest of the categorical fields seem to be free from any such issues.
Also, non-null values in Partner_salary field is possible only if the Binary variable
Partner_working is Yes. Hence for this data, we do a rule-based imputation instead of
the mean/median imputation.
7
2.D. Inspecting the 5 Number Summary Statistics of the Dataset (Numerical
fields)
1. Age- The customers are between 22 and 54 years old. It may be said that they
belong to working age group. Mean age is 31.92 while median age is 29 years,
indicating age distribution is positively skew.
2. No. of dependants vary from nil to 4 max, mean being 2.45 compared to median
of 2.
3. The salary of the customers ranges between 30K and 99.3K and the distribution is
symmetric. The mean and the median values are very close and skewness is
negligible.
4. Total_salary ranges between 30K and 171K and does not show a high degree of
skewness.
5. The minimum price of the purchased automobile is 18K, whereas max is 70K.
Price has a slight skewness which is evident from the mean of 35.5K>31K
indicating. This indicates a small number of high-priced purchases were made.
2.E. Check for and treat (if needed) data irregularities (Inspecting continuous
fields for anomalies/extreme values) –
We also need to check anomalies for variable data. We use boxplot to check outliers or
negative variables in numerical data.
8
Figure 2: Boxplot of Numerical Variables
Observations
From the above graphs, I observe that there are no negative values present in the
numerical fields.
9
In our case, we can’t do (a) and (c) since the purchase is dependant on the salary of the
individual.
We will try analysing the data by treating the outliers with IQR rule or without treating the
data at all after checking the outlier percentage.
3. Univariate Analysis
For performing Univariate analysis we will take a look at the Boxplots and Histograms to
get better understanding of the distributions. Note that these plots have been produced
after all data pre-processing (Null value imputations and Gender name corrections have
been done.
10
Figure.3: Univariate Analysis of Numerical Data
a) Age seems to be distributed randomly with major age group being between 25 to
38 with right skewed data. It is majorly the working class age group.
b) Salary has multiple peaks however seen to have a normal distribution, with bulk
of data points in the range 50K to 70K.
11
c) Partner salary is very randomly distributed. This is also since the working partner
plot is impacted by total data where non-working class is also present
contributing to “0” salary as shown in the below figure.
For performing Univariate analysis, we will look at the count plot to get better understanding
of the distributions.
12
Figure 5 : Univariate analysis of Categorical Variables
13
Observation and insights
a) Profession- In the dataset, the count of Salaried customers is slightly higher than that of
Business customers (around 900 to 700).
b) Gender- Majority of customers are male.
c) Marital status- The data consists of very small proportion of single customers when
compared to married customers.
d) Education- Dataset has an educated population base with majority of the customers
being Post Graduate
e) No. of. Dependents- Dataset has majority of the customers with either 2 or 3
dependents, followed by 1 or 4 dependents. Very few customers have no dependents.
f) Personal loan- The dataset has a mixed volume of customers with housing loan i.e.
Approx. 50:50 percent.
g) House loan- From the plot, we can make out that the number of customers who took
House Loan is almost half the customers who did not take a House Loan.
h) Partner working- In the dataset, we observe that the number of customers having a
working partner are slightly higher than customers with non- working partner or singles.
i) Make- In the dataset provided, Sedan is the most preferred choice of purchase,
followed by Hatchback and SUV.
Bivariate Analysis
I use pairplot, jointplot and heatmap to understand the relationship between numerical variables.
Figure:6 Joint plot of Age Vs Price Figure:7 Joint plot of No. Of Dependents Vs Price
14
Figure:8 Pair plot of all Numerical Variables in the Dataset
15
Observation and insights-
a) From the pair plot, we don’t see real significant relationship among the various numerical
variables in the dataset barring age and salary, age and price.
b) Age vs Price- When we do a joint plot between age and price, it is noticed that
between the age group of 20-30 years, the purchase is higher for lower variant
cars.
c) Also, with increasing number of dependants, the choice of higher priced purchase has slight
reduction as shown in the correlation joint plot of age of dependent and price. However, with 2
dependents, the choice is for all prices of cars.
d) The heatmap shows the highest positive correlation between age and price (0.8), partner salary
and total salary (0.82). and slight correlation between age and salary (0.62- stating that in the
dataset given, with increasing age, salary hike is relatively proportional).
e) The heatmap also shows negative correlation between 1) price and no. of dependents stating
with more dependent members, the choice of lower priced purchase is higher, 2) age and
number of dependents, 3) no of dependents and salary.
16
Figure: 10 Bivariate plots for Categorical Data
17
g) Personal loan Vs Make- Sedan is the preferred choice whether customer has taken the
personal loan or not. Hatchback purchase is not impacted by loan. In SUV, customers
who have taken personal loan, tend to buy SUV lesser than customers who haven’t
opted for loan.
h) Partner working vs Make-If the partner is working, the sale of all variants is high
compared to non-working customers. However, the impact of the partner job is higher
for sale of Sedan and minimal in other car sale. Here, also the first choice is Sedan
compared to hatchback and last is SUV.
We use bar plot to look at the relationship between categorial variables and numerical
variables.
18
Figure:11 Bar plot of Numerical Vs Categorical Data
19
Figure 13: Count plot of Gender Vs Sedan Purchase
Analysing the count plot, we can notice that the Female tend to purchase approx.
Also through the count of SUVS purchased by both the Gender, we saw that Male preference for
SUV is 124 and Female preference for SUV is 173.
Also, if we look at total proportion of purchase, proportion of male buying SUVs is lesser 130 out of
1670 odd purchases i.e. 7% compared to female who purchase approx. 53%(180 SUVs against
340 purchases (these are approx. numbers as visualised from graph).
Hence, it can be stated that female tend to prefer SUVs more compared to men.
Here again, we use count plot to visualise the data with Profession and Make as the variables.
Figure:14 Count plot of Make Vs Profession Figure: 15 Count plot of Salaried Vs Make
As visualised from the above graph, we can infer that the likelihood of a salaried person buying
Sedan is higher than the business class.
From the notebook exercise, we got the following count which supports our visualization-
20
Salaried person buying Sedan is 396
Salaried person buying SUV is 208
Salaried person buying Hatchback is 292
We can also infer that the salaried person first choice is Sedan followed by Hatchback and SUV.
5.C. What evidence or data supports Sheldon Cooper's claim that a salaried male is an
easier target for a SUV sale over a Sedan sale?
Using visualization of the data variables, I used the facet grid method to use hue and condition the
data on another variable to make multiple plots.
The above plot clearly shows that the first choice of purchase of vehicles by salaried male is
Sedan followed by Hatchback and the least is SUV.
Calculating Total number of Cars purchased by Salaried Male Customers for each Make, we get –
From the notebook also, we found that the count of Make is as below-
Hence, the assumption of Sheldon Cooper's that a salaried male is an easier target for a SUV sale
over a Sedan sale is false.
21
5.D. How does the amount spent on purchasing automobiles vary by gender?
From the above visualization of the data clearly indicates that the Female
spend higher on automobile purchase compared to Male.
The mean spend by female is close to 48K compared to male average
spend of 32K.
The median spend by female is 49k compared to 29k of male.
Hence, we can see that the mean and median spend by Female Gender is
higher than Male and as seen in earlier count plot of Gender vs Make, Female
tends to buy more SUV than Male which is a costlier version.
5.E. How much money was spent on purchasing automobiles by individuals who
took a personal loan?
22
From the above visualization, we can see that there is very slight variance of customers
opting for personal loan and not opting for the loan to have any impact on the amount
spent or price of the vehicle and the average spend is approximately 34k.
Also, from the notebook, we saw that the
Median spend by customers who took personal loan is: 31k compared
to customers who didn’t opt for loan being 32k.
Mean spend by customers who took personal loan is: 34k compared
to customers who didn’t opt for loan being 34k.
Hence, we can see that the money spent on purchasing automobiles by individuals who
took a personal loan is 34k on an average.
5.F. How does having a working partner influence the purchase of higher-priced
cars?
I used bar plot to analyse the impact of working partner on the price of the vehicle.
As visualized in the below graph, there is not much difference (avg. spend being around
34k for both the situation, thus indicating that partner working or not has no effect on the
Purchase made by the customer.
23
6. Actionable Insights and Business Recommendations
The main purpose of this exercise is to give inputs to the board members to improve the
efficiency of the marketing campaign currently being used in the Austo Motor Company.The
observation and insight section under multiple sections above gives us lot of inputs about
the case study. Adding few more below to improve sale of high end models and also to
understand various catchment of target audience based on age/price/profession or marital
status.
A) We can use inferences from multiple graphs shown above and few more to target
the audience-
1. Marital status/Gender/Make
The marketing team should put efforts according to the requirement of the board on type of
vehicle sale to be increased.
Married Female customers can be focused more if SUV sale is the focus.
2) We saw that there is not much impact of personal loan on vehicle buying capacity.
However, to increase sale from this stratum, a campaign can be developed to reduce the
percentage of interest for individuals who have opted for personal loan.
3) The customers who took housing loan end up buying lesser vehicles compared to
customers who don’t have housing loan. In this also, SUV sale is lowest and Sedan is
highest. So, efforts can be taken by the marketing team to run campaigns to attract the
24
clients who took housing loan or may tie-up with the housing associations/banks to offer
some benefit of cross buying vehicles since the variance of customers who took housing
loan to purchase a car is stark compared to customers who didn’t opt for housing loan and
purchase of car and specially for SUV which is costlier is pretty low in this segment.
4) Younger population (20-30 years) preferred lower priced cars maximum compared to
higher variants which we saw in count plot of Age Vs Price.
Campaigns can be focused to increase the buying capacity of lower end model further
(volume increment) and even to focus on improving sale of high-end cars under this age
strata (margin improvement). Volume Vs Margin goal must be clear.
5) The heatmap for numerical data showed highest positive correlation between age and
price.
7) The gender vs Make vs Profession chart shows that salaried or business class female
prefer SUV compared to Sedan and Hatchback for males. So, campaigns can be focussed
on males to improve sale of SUV and for females to improve sale of Hatchback.
8) Male average salary was noticed to be on a higher side which means they can afford
SUVs. So, campaigns can be run to understand the true reason for the reduced sale and
focused activities can be done if SUV sale has to be increased.
Hence, based on the goal of the company and the board members, the marketing
campaigns can be aligned accordingly for male/female, married/unmarried,
salaried/business class, loan takers.
Goal of improved profitability via volume or make should be identified and multiple
decisions can be made on campaign choice based on above inferences.
Thank You.
25