Autos Automobile.. EDA Project by Anjali Sinha

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Austos Automobile

Case Study

By Anjali Sinha

0
Contents

Heading Page No.

1.Austo Motor Company -What is the background and objective of the case study?..........3
2. Data overview- important technical information about the dataset that a
database administrator would be interested in?
2.A. Take a critical look at the data and do a preliminary analysis of the variables………… 4-6
2.B. Do a quality check of the data - Check for and Treat Missing Values.
Are there any discrepancies present in the data?............................................................5-6
2.C. Checking Data irregularities or wrong entries in other categorical data…………………. 7
2.D. Check the statistical summary………………………………………………………………………. 8
2.E. Check for and treat (if needed) data irregularities ……………………………………………. 8-10
3. Univariate Analysis
3.A. Univariate analysis of numerical data (visualisation and insights that can be
utilized by the business) ……………………………………………………………………………………… 10-12
3.B. Univariate analysis of categorical data (visualisation and insights that can be
utilized by the business) ……………………………………………………………………………………….12-14
4. Bivariate analysis
4.A. Exploring the relationship between all numerical variables…………………………………..14-16
4.B. Exploring the relationship between Categorical Data……………………………………………16-18
4.C. Exploring the relationship between categorical vs numerical variables …….…………….18-19
5. Key questions
5.A. Do men tend to prefer SUVs more compared to women?........................................... 19-20
5.B. What is the likelihood of a salaried person buying a Sedan?........................................20-21
5.C. What evidence or data supports Sheldon Cooper's claim that a salaried male is an
easier target for a SUV sale over a Sedan sale?.................................................................21
5.D. How does the amount spent on purchasing automobiles vary by gender?...................22
5.E. How much money was spent on purchasing automobiles by individuals who took a
personal loan?..............................................................................................................22-23
5.F. How does having a working partner influence the purchase of higher-priced cars……….23
6. Actional insights and Business Recommendations………………………………………………..24-25

1
List of Figures
Figure 1: Import library and load Data………………………………………….………4
Figure 2: Boxplots of the numerical variables.............................................9
Figure.3: Univariate Analysis of Numerical Data.........................................10 -11
Figure 4: Distribution of Partner Salary based on Employment…………12
Figure 5 : Univariate analysis of Categorical Variables…………………….12-13
Figure:6 Joint plot of Age Vs Price………………………………………………..14
Figure:7 Joint plot of No. Of Dependents Vs Price…………………………..14
Figure 8: Pair plot of all Numerical Variables in the Dataset……………..15
Figure:9 Correlation Heatmap between Numerical Variables……………15
Figure: 10 Bivariate plots for Categorical Data……………………………….16-17
Figure:11 Bar plot of Numerical Vs Categorical Data………………………18-19
Figure: 12 Count plot of Make Vs Gender and Gender Vs Make…………19
Figure 13: Count plot of Gender Vs Sedan Purchase……………………….20
Figure:14 Count plot of Make Vs Profession…………………………………..20
Figure: 15 Count plot of Salaried Vs Make……………………………………..20
Figure 16: Facet Grid Bar plot of Profession Vs Gender Vs Make………..21
Figure:17 Count plot of Gender Vs Price………………………………………..22
Figure:18 Bar plot of Personal loan Vs Price…………………………………...22
Figure: 19 Bar plot of Partner working and Price……………………………..23
Figure: 20 Facet grid Bar plot Marital status Vs Gender Vs Make……….24

List of Tables

Table 1: Top five rows of the dataset.............................................................5


Table 2: Basic Information of the dataset......................................................5
Table 3: Numerical summarization of the dataset..........................................7
Table 4. 5 number Summary Statistics……………………………………………8

2
Case Study- Autos Automobile

1. Background

Austo Motor Company is a leading car manufacturer specializing in SUV, Sedan, and
Hatchback models.

In its recent board meeting, concerns were raised by the members on the efficiency of
the marketing campaign currently being used. The board decides to rope in analytics
professional to improve the existing campaign.

Objective

They want to analyze the data to get a fair idea about the demand of customers which
will help them in enhancing their customer experience.

I have been roped in as a Data Scientist to perform the data analysis to find answers to
these questions that will help the company to improve the business.

The key to being successful in this business is to be able to detect patterns influencing
the buying of the cars and cater to the demand at any given time.

3
2. Data overview- important technical information about the dataset that a
database administrator would be interested in?

2.A. Data Overview


These are the important basic and technical information about the dataset that a database
administrator would be interested in.

Import the libraries


Import and load the libraries. (numpy, pandas, matplotlib, sns and use %matplotlib
inline function for ensuring graphs are displayed.

Load the datasets


I use the pd.read_csv syntax and put the location to upload the dataframe in my
notebook.

Figure 1: Import Library and Load Data from system

Check the structure of the data- The initial steps is conducted to get an overview of the
dataset :
• observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not- I use the head() function to do the same.
• get information about the number of rows and columns in the dataset- we use
shape function to get the details. In the dataset provided, we have 1581 rows and
14 columns

4
Table 1: Top five rows of the dataset

Check the types of the data

• We run the info() function to check the data type, understand any null occurrence
and to treat any missing values. It helps us find out the data types of the columns
to ensure that data is stored in the preferred format and the value of each
property is as expected.

Table 2- Basic Information of the Dataset

Observations

• A quick look at the dataset information tells us that there are 8 objects, 5 integer
and 1 float variables.
• There are few Null records present in two variables: Gender and Partner_salary,
which will be analyzed in detail in the next section.
• We observe that the Gender has 1528 data compared to 1581 data that means
we have missing data which needs to be treated.

5
• We also observe that Partner_salary has 1472 dataset. We need to understand
the details and rectify this missing data as well.
• There are no duplicate records in the dataset.

2.B. Check for and Treat Missing Values

• Inspecting Null values- we observe occurrence of null values in gender and Partner
salary by using isnull().sum()

Gender - total 53 Nulls

Partner_salary - Total 106 Nulls

• Handling Nulls

Nulls are usually handled by the following techniques –

a- If the proportion of Null values is more than 60 % of the total number of records in
a column, then drop the column. Here you assume that the column is
uninformative.
b- If any row is missing many records across columns, then that row may also be
dropped.
c- Otherwise, the missing values may be imputed.

For the given data, we find that neither (a) nor (b) is applicable since we have data
available in other column and can’t be dropped since it will bias our working.

Simple rules for imputation:

a) For categorical variables we can impute the Nulls with the majority class.
For the current dataset, Null values in 'Gender' field are imputed with
'Male' (Male being the majority class).
b) Also, when we check the unique value of the variable while doing deep
dive, we observe that there are two instances of possible data entry issue.
The word Female has been misspelt as 'Femle' and 'Femal'.
For the current dataset we are confident that category Female has been
misspelt, so we can go ahead and impute these records with the correct
spelling i.e. 'Female' by using replace function.

We check other categorical variables as well to check for glitches.

6
2.C. Checking Data irregularities or wrong entries in other categorical data

Table 3: Checking Data irregularities or wrong entries in other categorical data

We observe that the rest of the categorical fields seem to be free from any such issues.

c) For continuous variables it is possible to impute the Null values with


mean/median of the variable depending upon the nature of the
distribution. However, more efficient imputation is possible if variables are
internally related. The three variables on salary are related to one another
Total _salary= Salary+Partner_salary as:

Also, non-null values in Partner_salary field is possible only if the Binary variable
Partner_working is Yes. Hence for this data, we do a rule-based imputation instead of
the mean/median imputation.

If Partner_working = 'No' then Partner_salary = 0

If Partner_working = 'Yes' then Partner_salary = Total_salary - Salary

7
2.D. Inspecting the 5 Number Summary Statistics of the Dataset (Numerical
fields)

Table 4. 5 number Summary Statistics

The observations and insights from the above table are-

1. Age- The customers are between 22 and 54 years old. It may be said that they
belong to working age group. Mean age is 31.92 while median age is 29 years,
indicating age distribution is positively skew.
2. No. of dependants vary from nil to 4 max, mean being 2.45 compared to median
of 2.
3. The salary of the customers ranges between 30K and 99.3K and the distribution is
symmetric. The mean and the median values are very close and skewness is
negligible.
4. Total_salary ranges between 30K and 171K and does not show a high degree of
skewness.
5. The minimum price of the purchased automobile is 18K, whereas max is 70K.
Price has a slight skewness which is evident from the mean of 35.5K>31K
indicating. This indicates a small number of high-priced purchases were made.

2.E. Check for and treat (if needed) data irregularities (Inspecting continuous
fields for anomalies/extreme values) –

We also need to check anomalies for variable data. We use boxplot to check outliers or
negative variables in numerical data.

8
Figure 2: Boxplot of Numerical Variables

Observations

From the above graphs, I observe that there are no negative values present in the
numerical fields.

However, we have outlier values in the Total_salary variables.

We can do three things to treat outliers-

a) Replace with null values.


b) IQR method- replace the values with lower whisker or upper whisker
values
c) Drop the observation

9
In our case, we can’t do (a) and (c) since the purchase is dependant on the salary of the
individual.

We will try analysing the data by treating the outliers with IQR rule or without treating the
data at all after checking the outlier percentage.

It is observed that the outlier percentage in Total_salary is 1.71%, hence it is not


necessary to treat the same.

3. Univariate Analysis

3.A. Univariate analysis of Numerical Data

For performing Univariate analysis we will take a look at the Boxplots and Histograms to
get better understanding of the distributions. Note that these plots have been produced
after all data pre-processing (Null value imputations and Gender name corrections have
been done.

10
Figure.3: Univariate Analysis of Numerical Data

Observation and Insights

a) Age seems to be distributed randomly with major age group being between 25 to
38 with right skewed data. It is majorly the working class age group.
b) Salary has multiple peaks however seen to have a normal distribution, with bulk
of data points in the range 50K to 70K.
11
c) Partner salary is very randomly distributed. This is also since the working partner
plot is impacted by total data where non-working class is also present
contributing to “0” salary as shown in the below figure.

Figure 4: Distribution of Partner Salary based on Employment

d) Skewness of Total_salary can be observed with the presence of outliers.


However since the percent of outlier is only 1.7%, I didn’t treat the same. The
distribution shows bulk of data points in the range of 60K to 100K.
e) Price seems to have a Bi-modal distribution with right skewness showing the
major purchase has happened in the price range of 25k to 38k.
f) Almost all the variables have some skewness present, thus none of them follow a
Normal distribution completely.

3.B. Univariate Analysis of Categorical Data

For performing Univariate analysis, we will look at the count plot to get better understanding
of the distributions.

12
Figure 5 : Univariate analysis of Categorical Variables

13
Observation and insights
a) Profession- In the dataset, the count of Salaried customers is slightly higher than that of
Business customers (around 900 to 700).
b) Gender- Majority of customers are male.
c) Marital status- The data consists of very small proportion of single customers when
compared to married customers.
d) Education- Dataset has an educated population base with majority of the customers
being Post Graduate
e) No. of. Dependents- Dataset has majority of the customers with either 2 or 3
dependents, followed by 1 or 4 dependents. Very few customers have no dependents.
f) Personal loan- The dataset has a mixed volume of customers with housing loan i.e.
Approx. 50:50 percent.
g) House loan- From the plot, we can make out that the number of customers who took
House Loan is almost half the customers who did not take a House Loan.
h) Partner working- In the dataset, we observe that the number of customers having a
working partner are slightly higher than customers with non- working partner or singles.
i) Make- In the dataset provided, Sedan is the most preferred choice of purchase,
followed by Hatchback and SUV.

Bivariate Analysis

4.A. Exploring the relationship between all numerical variables

I use pairplot, jointplot and heatmap to understand the relationship between numerical variables.

Figure:6 Joint plot of Age Vs Price Figure:7 Joint plot of No. Of Dependents Vs Price

14
Figure:8 Pair plot of all Numerical Variables in the Dataset

Figure:9 Correlation Heatmap between Numerical Variables

15
Observation and insights-

a) From the pair plot, we don’t see real significant relationship among the various numerical
variables in the dataset barring age and salary, age and price.
b) Age vs Price- When we do a joint plot between age and price, it is noticed that
between the age group of 20-30 years, the purchase is higher for lower variant
cars.
c) Also, with increasing number of dependants, the choice of higher priced purchase has slight
reduction as shown in the correlation joint plot of age of dependent and price. However, with 2
dependents, the choice is for all prices of cars.
d) The heatmap shows the highest positive correlation between age and price (0.8), partner salary
and total salary (0.82). and slight correlation between age and salary (0.62- stating that in the
dataset given, with increasing age, salary hike is relatively proportional).
e) The heatmap also shows negative correlation between 1) price and no. of dependents stating
with more dependent members, the choice of lower priced purchase is higher, 2) age and
number of dependents, 3) no of dependents and salary.

4.B. Exploring the relationship between Categorical Data

16
Figure: 10 Bivariate plots for Categorical Data

Observation and insights-


a) Profession vs Gender-Salaried people are more in proportion compared to business
class in the dataset provided for both the Gender.
b) House loan vs Make-Customers who have a house loan tend to have lesser purchases
compared to customers without home loan and in that also, customers who have
house loan are not likely to buy an SUV (which is the costliest make among the three).
Sedan is most preferred across both the categories.
c) Profession vs Make- Sedan is the most preferred choice of vehicle whether it is a
business class or salaried customer followed by hatchback and SUV. In that also,
hatchback proportion of sale is not impacted by the profession of individual.
d) Marital status vs Make- Married customers prefer Sedan>Hatchback>SUV. Single
customers prefer Hatchback>Sedan>SUV, SUV being the last choice is either case
probably due to cost.
e) Profession vs Personal loan- Not much of an impact is observed in it.
f) Make Vs Gender- Females prefer SUV and are least likely to buy a Hatchback, whereas
Male prefer Sedan or hatchback. SUV is least preferable among males.

17
g) Personal loan Vs Make- Sedan is the preferred choice whether customer has taken the
personal loan or not. Hatchback purchase is not impacted by loan. In SUV, customers
who have taken personal loan, tend to buy SUV lesser than customers who haven’t
opted for loan.
h) Partner working vs Make-If the partner is working, the sale of all variants is high
compared to non-working customers. However, the impact of the partner job is higher
for sale of Sedan and minimal in other car sale. Here, also the first choice is Sedan
compared to hatchback and last is SUV.

4.C. Exploring the relationship between categorical vs numerical variables

We use bar plot to look at the relationship between categorial variables and numerical
variables.

18
Figure:11 Bar plot of Numerical Vs Categorical Data

Observation and insights-

a) Gender vs price- female tend to spend more than men.


b) Education doesn’t have much impact on the buying capacity.
c) Personal loan and profession have negligible impact on the money spent.
d) The bar plot of make vs price show that SUV is the costliest (55k) compared to Sedan (35k)
compared to Hatchback(27k).
e) The bar plot of Gender vs Salary shows male salary is on an average 58k and female has an
average salary of 65K.

5. Answers to Key Questions:

5.A. Do men tend to prefer SUVs more compared to women?

Figure: 12 Count plot of Make Vs Gender and Gender Vs Make

19
Figure 13: Count plot of Gender Vs Sedan Purchase

Analysing the count plot, we can notice that the Female tend to purchase approx.
Also through the count of SUVS purchased by both the Gender, we saw that Male preference for
SUV is 124 and Female preference for SUV is 173.

Also, if we look at total proportion of purchase, proportion of male buying SUVs is lesser 130 out of
1670 odd purchases i.e. 7% compared to female who purchase approx. 53%(180 SUVs against
340 purchases (these are approx. numbers as visualised from graph).
Hence, it can be stated that female tend to prefer SUVs more compared to men.

5.B. What is the likelihood of a salaried person buying a Sedan?

Here again, we use count plot to visualise the data with Profession and Make as the variables.

Figure:14 Count plot of Make Vs Profession Figure: 15 Count plot of Salaried Vs Make

As visualised from the above graph, we can infer that the likelihood of a salaried person buying
Sedan is higher than the business class.
From the notebook exercise, we got the following count which supports our visualization-

20
Salaried person buying Sedan is 396
Salaried person buying SUV is 208
Salaried person buying Hatchback is 292
We can also infer that the salaried person first choice is Sedan followed by Hatchback and SUV.

5.C. What evidence or data supports Sheldon Cooper's claim that a salaried male is an
easier target for a SUV sale over a Sedan sale?

Using visualization of the data variables, I used the facet grid method to use hue and condition the
data on another variable to make multiple plots.

Figure 16: Facet Grid Bar plot of Profession Vs Gender Vs Make

The above plot clearly shows that the first choice of purchase of vehicles by salaried male is
Sedan followed by Hatchback and the least is SUV.

Even, the data if used mathematically gives the below output-

Calculating Total number of Cars purchased by Salaried Male Customers for each Make, we get –

Hatchback Sedan SUV


41% 45% 13%

From the notebook also, we found that the count of Make is as below-

Salaried male with SUV is 90

Salaried male with Sedan is 305

Hence, the assumption of Sheldon Cooper's that a salaried male is an easier target for a SUV sale
over a Sedan sale is false.
21
5.D. How does the amount spent on purchasing automobiles vary by gender?

Figure:17 Count plot of Gender Vs Price

From the above visualization of the data clearly indicates that the Female
spend higher on automobile purchase compared to Male.
The mean spend by female is close to 48K compared to male average
spend of 32K.
The median spend by female is 49k compared to 29k of male.
Hence, we can see that the mean and median spend by Female Gender is
higher than Male and as seen in earlier count plot of Gender vs Make, Female
tends to buy more SUV than Male which is a costlier version.

5.E. How much money was spent on purchasing automobiles by individuals who
took a personal loan?

Figure:18 Bar plot of Personal loan Vs Price

22
From the above visualization, we can see that there is very slight variance of customers
opting for personal loan and not opting for the loan to have any impact on the amount
spent or price of the vehicle and the average spend is approximately 34k.
Also, from the notebook, we saw that the
Median spend by customers who took personal loan is: 31k compared
to customers who didn’t opt for loan being 32k.
Mean spend by customers who took personal loan is: 34k compared
to customers who didn’t opt for loan being 34k.

Hence, we can see that the money spent on purchasing automobiles by individuals who
took a personal loan is 34k on an average.

5.F. How does having a working partner influence the purchase of higher-priced
cars?

I used bar plot to analyse the impact of working partner on the price of the vehicle.

As visualized in the below graph, there is not much difference (avg. spend being around
34k for both the situation, thus indicating that partner working or not has no effect on the
Purchase made by the customer.

Figure: 19 Bar plot of Partner working and Price

23
6. Actionable Insights and Business Recommendations

The main purpose of this exercise is to give inputs to the board members to improve the
efficiency of the marketing campaign currently being used in the Austo Motor Company.The
observation and insight section under multiple sections above gives us lot of inputs about
the case study. Adding few more below to improve sale of high end models and also to
understand various catchment of target audience based on age/price/profession or marital
status.

A) We can use inferences from multiple graphs shown above and few more to target
the audience-
1. Marital status/Gender/Make

Figure: 20 Marital status Vs Gender Vs Make

Married male prefer a Sedan whereas a Single male prefers Hatchback.


Married female prefers SUV whereas a Single female prefers Sedan.

The marketing team should put efforts according to the requirement of the board on type of
vehicle sale to be increased.
Married Female customers can be focused more if SUV sale is the focus.

2) We saw that there is not much impact of personal loan on vehicle buying capacity.
However, to increase sale from this stratum, a campaign can be developed to reduce the
percentage of interest for individuals who have opted for personal loan.

3) The customers who took housing loan end up buying lesser vehicles compared to
customers who don’t have housing loan. In this also, SUV sale is lowest and Sedan is
highest. So, efforts can be taken by the marketing team to run campaigns to attract the

24
clients who took housing loan or may tie-up with the housing associations/banks to offer
some benefit of cross buying vehicles since the variance of customers who took housing
loan to purchase a car is stark compared to customers who didn’t opt for housing loan and
purchase of car and specially for SUV which is costlier is pretty low in this segment.

4) Younger population (20-30 years) preferred lower priced cars maximum compared to
higher variants which we saw in count plot of Age Vs Price.

Campaigns can be focused to increase the buying capacity of lower end model further
(volume increment) and even to focus on improving sale of high-end cars under this age
strata (margin improvement). Volume Vs Margin goal must be clear.

5) The heatmap for numerical data showed highest positive correlation between age and
price.

6) Marital status vs Make- Married customers prefer Sedan>Hatchback>SUV. Single


customers prefer Hatchback>Sedan>SUV, SUV being the last choice in either case
probably due to cost. In both segments, SUV is the last choice.

7) The gender vs Make vs Profession chart shows that salaried or business class female
prefer SUV compared to Sedan and Hatchback for males. So, campaigns can be focussed
on males to improve sale of SUV and for females to improve sale of Hatchback.

8) Male average salary was noticed to be on a higher side which means they can afford
SUVs. So, campaigns can be run to understand the true reason for the reduced sale and
focused activities can be done if SUV sale has to be increased.

Hence, based on the goal of the company and the board members, the marketing
campaigns can be aligned accordingly for male/female, married/unmarried,
salaried/business class, loan takers.

Goal of improved profitability via volume or make should be identified and multiple
decisions can be made on campaign choice based on above inferences.

Thank You.

25

You might also like