Data Mining
Data Mining
Data Mining
PROJECT
Clustering:
1
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million.
They are expanding their wings in Marketing Analytics. They collected data from their
Marketing Intelligence team and now wants you (their newly appointed data analyst) to
segment type of ads based on the features provided. Use Clustering procedure to segment ads
into homogeneous groups.
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
CPC = Total Cost (spend) / Number of Clicks
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100
1.1 Clustering: Read the data and perform basic analysis such as printing a few rows (head
and tail), info, data summary, null values duplicate values, etc.
Answer:
Loading and viewing the datasets :
Viewing top 5 rows:
Tab :1.1
Tab :1.2
2
The dataset has 25857 rows and 19 columns .
Tab :1.3
Tab :1.4
1.2- Clustering: Treat missing values in CPC, CTR and CPM using the formula given.
3
The missing values in CPC, CTR and CPM are treated by writing a user-defined function, and
calling it.
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
1.3Clustering: Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes,
which method to employ. (As an analyst your judgement may be different from another
analyst).
4
Fig :1.2
Fig :1.3
1.4 - Clustering: Perform z-score scaling and discuss how it affects the speed of the algorithm.
Dropping some columns and checking top 5 rows:
5
Tab :1.6
Tab:1.7
Fig :1.4
6
Viewing the last 10 merged clusters using truncate , given p=10, we get :
7
Tab:1.9
Wss:
1.6 - Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters for k-
means algorithm.
Fig :1.6
8
When we move from k=1 to k=2 , we see that there is a significant drop in the value , also when
we move from k=2 to k=3,k=3 to k=4 there is a significant drop aswell.
But from k=4 to k=5 , k=5 to k=6 , the drop in values reduces significantly.
In otherwords, the wss is not significantly dropping beyond 4, so 4 is optimal number of
clusters.
1.7 - Clustering: Print silhouette scores for up to 10 clusters and identify optimum number of
clusters.
Two functions we use here are silhouette_samples and silhouette_score
The silhouette_score function computes the average of all the silhouette width
The silhouette_samples function computes the silhouette width for each and every row.
Tab: 1.10
silhouette_score:
Since the silhouette_score is 0.5, the we can conclude that it is a well distinguished set of
clusters.
The 4 clusters that are created have a silhouette_score of 0.50
Tab: 1.11
1.8 - Clustering: Profile the ads based on optimum number of clusters using silhouette score
and your domain understanding [Hint: Group the data by clusters and take sum or mean to
identify trends in Clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar
plots].
Cluster Profiling:
9
Tab: 1.12
10
Part2
PCA:
PCA FH (FT): Primary census abstract for female headed households excluding institutional
households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed
Household Excluding Institutional Household. The Indian Census has the reputation of being
one of the best in the world. The first Census in India was conducted in the year 1872. This was
conducted at different points of time in different parts of the country. In 1881 a Census was
taken for the entire country simultaneously. Since then, Census has been conducted every ten
years, without a break. Thus, the Census of India 2011 was the fifteenth in this unbroken series
since 1872, the seventh after independence and the second census of the third millennium and
twenty first century. The census has been uninterruptedly continued despite of several
adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The
Primary Census Abstract which is important publication of 2011 Census gives basic information
on Area, Total Number of Households, Total Population, Scheduled Castes, Scheduled Tribes
Population, Population in the age group 0-6, Literates, Main Workers and Marginal Workers
classified by the four broad industrial categories, namely, (i) Cultivators, (ii) Agricultural
Laborers, (iii) Household Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled Tribes, Institutional
and Houseless Population and are presented by sex and rural-urban residence. Census 2011
covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and
6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details without
using Data Science Techniques. You are tasked to perform detailed EDA and identify Optimum
Principal Components that explains the most variance in data. Use Sklearn only
11
2.1 PCA: Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc.
Tab: 2.1
Fig : 2.2
Fig : 2.3
12
We see there are 640 rows and 61 data columns
Fig : 2.
59 of 61 columns are int data type and 2 columns are categorical object data type. And no null
values.
Checking for duplicate values.
Fig : 2.
2.2 PCA: Perform detailed Exploratory analysis by creating certain questions like (i) Which
state has highest gender ratio and which has the lowest? (ii) Which district has the
highest & lowest gender ratio? (Example Questions). Pick 5 variables out of the given
24 variables .
Answer:
Which state has the highest population?
Fig :2.1
13
Which state has highest total female population?
Fig :2.2
14
Which state has highest total male population
Fig :2.3
15
For EDA - Variables considered:
No_HH TOT_M TOT_F TOT_WORK_M TOT_WORK_F
No of Household
Total population Male
Total population Female
Total Worker Population Male
Total Worker Population Female
Univariate Analysis:
Plotting histogram and boxplots for the above
variables:
Fig :2.4
16
Bivariate Analysis:
Fig :2.5
17
2.3 PCA: We choose not to treat outliers for this case. Do you think that treating outliers
for this case is necessary?
2.4 PCA: Scale the Data using z-score method. Does scaling have any impact on outliers?
Compare boxplots before and after scaling and comment.
Answer:
Tab:2.2
We have 57 features.
Check for presence of outliers in each feature
18
Plotting box plot before scaling the new data which contains only numerical columns.
Fig : 2.6
scaling the data set using the Z score and checking for top 5 rows of the scaled dataset :
Table 2.3
19
The data has been scaled .
Checking for outliers of the scaled data
Fig : 2.7
Fig : 2.8
2.5 PCA: Perform all the required steps for PCA (use sklearn only) Create the covariance
Matrix Get eigen values and eigen vector.
Answer:
Extracting eigen vectors and looking at PCA components
Tab: 2.4
20
Tab: 2.5
Explained variance=(eigen value of each PC)/(sum of eigen values of all PCs)
Check the explained variance for each PC
Tab:2.6
Organinzing the above explained variance in a dataframe
Tab: 2.7
21
2.6 PCA: Identify the optimum number of PCs (for this project, take at least 90% explained
variance). Show Scree plot.
Fig : 2.9
Tab: 2.8.
22
2.7 PCA: Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the Principal components in terms of actual
variables.
Fig :2.10
23
Fig :2.10
24