Customer Segmentation Clustering
Customer Segmentation Clustering
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
warnings.simplefilter("ignore")
np.random.seed(42)
LOADING DATA
#Loading the dataset
data = pd.read_csv("marketing_campaign.csv", sep="\t")
print("Number of datapoints:", len(data))
data.head()
1 08-03-2014 38 11 ... 5 0
3 10-02-2014 26 11 ... 6 0
[5 rows x 29 columns]
DATA CLEANING
In this section
• Data Cleaning
• Feature Engineering
In order to, get a full grasp of what steps should I be taking to clean the dataset. Let us have
a look at the information in data.
#Information on features
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Z_CostContact 2240 non-null int64
27 Z_Revenue 2240 non-null int64
28 Response 2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
In the next step, I am going to create a feature out of "Dt_Customer" that indicates the
number of days a customer is registered in the firm's database. However, in order to keep it
simple, I am taking this value relative to the most recent customer in the record.
Thus to get the values I must check the newest and oldest recorded dates.
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
dates = []
for i in data["Dt_Customer"]:
i = i.date()
dates.append(i)
#Dates of the newest and oldest recorded customer
print("The newest customer's enrolment date in
therecords:",max(dates))
print("The oldest customer's enrolment date in the
records:",min(dates))
Creating a feature ("Customer_For") of the number of days the customers started to shop
in the store relative to the last recorded date
#Created a feature "Customer_For"
days = []
d1 = max(dates) #taking it to be the newest customer
for i in dates:
delta = d1 - i
days.append(delta)
data["Customer_For"] = days
data["Customer_For"] = pd.to_numeric(data["Customer_For"],
errors="coerce")
Now we will be exploring the unique values in the categorical features to get a clear idea of
the data.
print("Total categories in the feature Marital_Status:\n",
data["Marital_Status"].value_counts(), "\n")
print("Total categories in the feature Education:\n",
data["Education"].value_counts())
In the next bit, I will be performing the following steps to engineer some new
features:
• Extract the "Age" of a customer by the "Year_Birth" indicating the birth year of the
respective person.
• Create another feature "Spent" indicating the total amount spent by the customer in
various categories over the span of two years.
• Create another feature "Living_With" out of "Marital_Status" to extract the living
situation of couples.
• Create a feature "Children" to indicate total children in a household that is, kids and
teenagers.
• To get further clarity of household, Creating feature indicating "Family_Size"
• Create a feature "Is_Parent" to indicate parenthood status
• Lastly, I will create three categories in the "Education" by simplifying its value
counts.
• Dropping some of the redundant features
#Feature Engineering
#Age of customer today
data["Age"] = 2021-data["Year_Birth"]
#For clarity
data=data.rename(columns={"MntWines":
"Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts
":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})
Now that we have some new features let's have a look at the data's stats.
data.describe()
The above stats show some discrepancies in mean Income and Age and max Income and
age.
Do note that max-age is 128 years, As I calculated the age that would be today (i.e. 2021)
and the data is old.
I must take a look at the broader view of the data. I will plot some of the selected features.
#To plot some selected features
#Setting up colors prefrences
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78",
"#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1",
"#B9C0C9", "#9F8A78", "#F3AB60"])
#Plotting following features
To_Plot = [ "Income", "Recency", "Customer_For", "Age", "Spent",
"Is_Parent"]
print("Reletive Plot Of Some Selected Features: A Data Subset")
plt.figure()
sns.pairplot(data[To_Plot], hue= "Is_Parent",palette=
(["#682F2F","#F3AB60"]))
#Taking hue
plt.show()
The total number of data-points after removing the outliers are: 2212
Next, let us look at the correlation amongst the features. (Excluding the categorical
attributes at this point)
#correlation matrix
corrmat= data.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)
<AxesSubplot:>
The data is quite clean and the new features have been included. I will proceed to the next
step. That is, preprocessing the data.
DATA PREPROCESSING
In this section, I will be preprocessing the data to perform clustering operations.
The following steps are applied to preprocess the data:
• Label encoding the categorical features
• Scaling the features using the standard scaler
• Creating a subset dataframe for dimensionality reduction
#Get list of categorical variables
s = (data.dtypes == 'object')
object_cols = list(s[s].index)
Family_Size Is_Parent
0 -1.758359 -1.581139
1 0.449070 0.632456
2 -0.654644 -1.581139
3 0.449070 0.632456
4 0.449070 0.632456
[5 rows x 23 columns]
DIMENSIONALITY REDUCTION
In this problem, there are many factors on the basis of which the final classification will be
done. These factors are basically attributes or features. The higher the number of features,
the harder it is to work with it. Many of these features are correlated, and hence redundant.
This is why I will be performing dimensionality reduction on the selected features before
putting them through a classifier.
Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables.
Principal component analysis (PCA) is a technique for reducing the dimensionality of
such datasets, increasing interpretability but at the same time minimizing information loss.
Steps in this section:
• Dimensionality reduction with PCA
• Plotting the reduced dataframe
Dimensionality reduction with PCA
For this project, I will be reducing the dimensions to 3.
#Initiating PCA to reduce dimentions aka features to 3
pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.DataFrame(pca.transform(scaled_ds),
columns=(["col1","col2", "col3"]))
PCA_ds.describe().T
max
col1 7.444305
col2 6.142721
col3 6.611222
The above cell indicates that four will be an optimal number of clusters for this data. Next,
we will be fitting the Agglomerative Clustering Model to get the final clusters.
#Initiating the Agglomerative Clustering model
AC = AgglomerativeClustering(n_clusters=4)
# fit model and predict clusters
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["Clusters"] = yhat_AC
#Adding the Clusters feature to the orignal dataframe.
data["Clusters"]= yhat_AC
To examine the clusters formed let's have a look at the 3-D distribution of the clusters.
#Plotting the clusters
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_ds["Clusters"], marker='o', cmap =
cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()
EVALUATING MODELS
Since this is an unsupervised clustering. We do not have a tagged feature to evaluate or
score our model. The purpose of this section is to study the patterns in the clusters formed
and determine the nature of the clusters' patterns.
For that, we will be having a look at the data in light of clusters via exploratory data
analysis and drawing conclusions.
Firstly, let us have a look at the group distribution of clustring
#Plotting countplot of clusters
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=data["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()
for i in Places:
plt.figure()
sns.jointplot(x=data[i],y = data["Spent"],hue=data["Clusters"],
palette= pal)
plt.show()
CONCLUSION
In this project, I performed unsupervised clustering. I did use dimensionality reduction
followed by agglomerative clustering. I came up with 4 clusters and further used them in
profiling customers in clusters according to their family structures and income/spending.
This can be used in planning better marketing strategies.
If you liked this Notebook, please do upvote.
If you have any questions, feel free to comment!
Best Wishes!
END