Case Study - Churn Mdel Prediction

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 77

CHURN MODELLING PREDICTION USING MACHINE LEARNING

CONTENTS

TITLE PAGE.NO

1. INTRODUCTION AND SCOPE OF THE PROBLEM 01-04

1.1. Scope of the Problem 02-03


1.2. Data Description 02-03
1.3 Review of chapters 04
2. REVIEW OF MACHINE LEARNING TECHINIQUES 05-22
2.0 Need of machine learning 06
2.1 Machine learning 06-08
2.1.1 Business understanding 07
2.1.2 Data understanding. 07
2.1.3 Data preparation. 08
2.1.4 Modelling 08
2.1.5 Evaluation 08-09
2.1.6 Deployment 09
2.2 Types of machine learning 10-13
2.2.1 Supervised learning. 11
2.2.2 Unsupervised learning 12
2.2.3 Reinforcement learning. 13
2.3 Choosing the algorithm 14-18
2.3.1 Types of Regression algorithm. 13
2.3.2 Types of Classification algorithm 14-16
2.3.3 Types of Un supervised algorithm 15-17
2.4 Choosing and Comparing models through Pipelines 18-20
2.4.1 Model validation 18-20
2.5 Model diagnosis with overfitting and under fitting 20-24
2.5.1 Bias and variance 20-24
2.6 overall process of machine learning. 24
3 MACHINE LEARNING AT WORK 23-47
4 SUMMARY 49
5 APPENDIX 51-74
Python-code
Data set
6 BIBLIOGRAPHY 75

I
II
CHAPTER 1

SCOPE OF THE PROBLEM

1
1.1 SCOPE OF THE PROBLEM
The problem is related to predict or identify credit card customer cancellation from
a bank .
Source:
The dataset was downloaded from:https://www.kaggle.com/malinivy/churn-
modelling-beginner-keras/data.
The output of a predictive churn model is a measure of the immediate or
future risk of a customer cancellation. This is what the term "churn
modeling"
A bank was losing credit card customers to its competitors, and the marketers
of the organization decided to use analytics in order to deal with this issue.
Their goal was to use data mining models for identifying customers with
increased propensity to churn so they could fine-tune their retention
campaigns.
Broadly the objective of the problem are:
This dataset is used to evaluate prediction algorithms in an effort of churn
prediction. Given a Bank customer, can we build a classifier which can
determine whether they will leave or not?
 The attributes that we use here besides Geography and Gender are credit
score, Age, Balance and estimated salary will determine his decision of
leaving from churn.
 Based on the attributes we need to perform algorithms to come up with an
appropriate classification model for classifying whether a customer will
leave or not?
 When a new data is given, we use the model obtained and predict whether
the customer will leave or not

1.2 DATA DESCRIPTION:


The data is extracted fromKaggle.com, data has 10000 observations and
11 attributes.
A brief description of the variables in the dataset:
 Rownmuber:Integer value of numbering given to each consumer.
 CustomerId:Unique identity number given to each consumer for
identification.
 Surname:Surname of each consumer in the dataset.

2
 creditscore: A statistical number that evaluates a consumer’s
creditworthiness and is based on credit history.
 geography:Location of the consumeri.e;country to which he belong.
 Gender: gender of the consumer as male or female which is categorical data.
 Age: Age of every consumer in the dataset.
 tenure: timeperiod been customer using creditcard.
 balance: amount that consumer has in his creditcard.
 Numofproducts:How many accounts,bank account affiliated products the
consumer has?
 HasCrcard:It says whether the consumer has credit card or not?
 Isactivemember:whether the consumer is using his Crcard?/active being
using his Crcard or not?
 Estimatedsalary:Estiamted salary of a consumer is earning.

Label:
Exited : Field used to split the data into two sets
 1 – represent person with liver disease.
 2- represent person without liver disease.

3
1.3 Review of the Chapters
Chapter 2 gives the brief introduction about machine learning techniques like need
of ML today, types of ML Algorithms and various models in each algorithm and
what technique to use when and how to validate, Tune the ML algorithms and
how to measure the performance of the ML model.

Section 3 describes the various results obtained for the problem.This section
contains all the outputs generated through the ML algorithms applied on the data as
well as validation and performance matrices. Section 4 describes the summary and
conclusions followed by Bibliography.

APPENDIX:
It describes the data set and Machine Learning code used.

4
CHAPTER 2
REVIEW OF MACHINE
LEARNING PROCESS

5
REVIEW OF MACHINE LEARNING PROCESS
2.0 NEED OF MACHINE LEARNING
In this age of modern technology, there is one resource that we have in abundance
a large amount of structured and unstructured data. In the second half of the
twentieth century, machine learning evolved as a subfield of artificial intelligence
that involved the development of self-learning algorithms to gain knowledge from
that data in order to make predictions. Instead of requiring humans to manually
derive rules and build models from analyzing large amounts of data, machine
learning offers a more efficient alternative for capturing the knowledge in data to
gradually improve the performance of predictive models, and make data-Profiven
decisions. Not only is machine learning becoming increasingly important in
computer science research but it also plays an ever greater role in our everyday
life.

2.1 Machine Learning Process


The CRISP-DM (Cross-Industry Standard Process for Data Mining) Process was
designed specifically for the data mining. However, it is flexible and thorough
enough that it can be applied to any analytical project whether it is predictive
analytics, data science, or Machine learning. The Process has the following six
phases :

6
 Business Understanding
 Data Understanding
 Data preparation
 Modeling
 Evaluation
 Deployment

2.1.1 Business Understanding


It is very important step of the process in achieving the success. The purpose of
this step is to identify the requirements of the business so that you can translate
them into analytical objectives. It has the following tasks:
1) Identify the Business objective
2) Assess the situation
3) Determine the Analytical goals
4) Produce a project plan
2.1.2 Data Understanding
After enduring the all-important pain of the first step, you can now get your hands
on the data. The task in this process consist the following
1) Collect the data
2) Describe the data
3) Explore the data
4) Verify the data Quality

2.1.3 Data Preparation


This step is relatively self-explanatory and in this step the goal is to get the data
ready to input in the algorithms. This includes merging, feature engineering, and
transformations. If imputation for missing values / outliers is needed then, it
happens in this step. The key five tasks under this step are as follows:
1) Select the data

7
2) Clean the data
3) Construct the data
4) Integrate the dat
5) Format the data
2.1.4 Modeling
Oddly, this process step includes the consideration that you already thought of and
prepared for. In this, one will need at least a modicum of an idea about how they
will be modeling. Remember, that this is flexible, iterative process and some strict
linear flow chart such as an aircrew checklist.
Below are the tasks in this step:
1) Select a modeling technique
2) Generate a test design
3) Build a model
4) Assess a Model
Both cross validation of the model (using train/test or K fold validation) and model
assessment which involves comparing the models with the chosen criterion
(RMSE, Accuracy, ROC) will be performed under this phase.
2.1.5 Evaluation
In the evaluation process, the main goal is to confirm that the work that has been
done and the model selected at this point meets the business objective. Ask
yourself and others, have we achieved the definition of success? And, here are the
tasks in this step:
1) Evaluate the results
2) Review the process
3) Determine the next steps
2.1.6 Deployment

If everything is done according to the plan up to this point, it might come down to
flipping a switch and your model goes live. Here are the tasks in this step:

8
1) Deploying the plan
2) Monitoring and maintenance of the plan
3) Producing the final report

2.2 Types of Machine Learning

Broadly, the Machine Learning Algorithms are classified into 3 types:

9
2.2.1 Supervised Learning
This algorithm consists of a target / outcome / dependent variable which is to be
predicted from a given set of predictors / independent variables. Using these set of
variables, we generate a function that maps inputs to desired output. The training
process continues until the model achieves a desired level of accuracy on the
training data.

The process of Supervised Learning model is illustrated in the below picture:

Fig. 2.2.1 Supervised Learning

Examples of Supervised Learning: Regression, Decision Tree, Random Forest,


KNN, Logistic Regression,…etc

10
2.2.2 Unsupervised Learning
In this algorithm, we will not have any target or outcome variable to predict /
estimate. It is used for clustering population into different groups, which is widely
used for segmenting customers in different groups for specific intervention. (More
of Exploratory Analysis

Examples of Unsupervised Learning: Data reduction techniques, Cluster Analysis,


Market Basket Analysis,…etc

11
2.2.3 Reinforcement Learning
Using this algorithm, the machine is trained to make specific decisions. It works
this way: the machine is exposed to an environment where it trains itself
continually using trial and error. This machine learns from past experience and
tries to capture the best possible knowledge to make accurate business decisions.
The process of reinforcement learning is illustrated in the below picture:

2.3 Choosing the algorithm

12
Choosing the right algorithm will depend on the type of the problem we are
solving and also depends on the scale of the dependent variable. In case of
continuous target variable, we will use regression algorithms and in case of
categorical target, we will use classification algorithms and for the model which
doesn’t have target variable, we will use either cluster analysis / data reduction
techniques.
2.3.1 Types of Regression Algorithms
There are many Regression algorithms in machine learning, which will be used in
different regression applications. Some of the main regression algorithms are as
follows:
a) Simple Linear Regression:-In simple linear regression, we predict scores of the
variable from the data of second variable. The variable we are forecasting is called
the criterion variable and referred to as Y. The variable we are basing our
predictions on is called the predictor variable and denoted as X.
b) Multiple Linear Regression:-Multiple linear regression is one of the
algorithms of regression technique, and is the most common form of linear
regression analysis. As a predictive analysis, the multiple linear regression is used
to explain the relationship between one dependent variable with two or more
independent variables. The independent variables can be either continuous or
categorical.
c) Polynomial Regression:-Polynomial regression is another form of regression
in which the maximum power of the independent variable is more than 1.
In this regression technique, the best fit line is not a straight line instead it is in the
form of a curve.
d) Support Vector Machines:-Support Vector Machines can be applied to
regression problems as well as Classification. It contains all the features that
characterizes maximum margin algorithm. Linear learning machine maps a non-
linear function into high dimensional kernel-induced feature space. The system
capacity will be controlled by parameters that do not depend on the dimensionality
of feature space.
e) Decision Tree Regression:-Decision tree builds regression models in the form
of a tree structure. It breaks down the data into smaller subsets and while at the
same time an associated decision tree is incrementally developed. The final result
is a tree with decision nodes and leaf nodes. f) Random Forest Regression:-
Random Forest is also one of the algorithms used in regression technique. It is very

13
a flexible, easy to use machine learning algorithm that produces, even without
hyper -parameter tuning, a great result most of the time. It is also one of the most
widely used algorithms because of its simplicity and the fact that it can used for
both regression and classification tasks. The forest it builds is an ensemble of
Decision Trees, most of the time trained with the “bagging” method.
f) Random Forest Regression:-Random Forest is also one of the algorithms used
in regression technique. It is very a flexible, easy to use machine learning
algorithm that produces, even without hyper -parameter tuning, a great result most
of the time. It is also one of the most widely used algorithms because of its
simplicity and the fact that it can used for both regression and classification tasks.
The forest it builds is an ensemble of Decision Trees, most of the time trained with
the “bagging” method.
Other than these we have regularized regression models like Ridge, LASSO and
Elastic Netregression which are used to select the key parameters and these is also
Bayesian regression which works with the Bayes theorem.

2.3.2 Types of Classification Algorithms

There are many Classification algorithms in machine Learning, which can be used
fordifferent classification applications. Some of the main classification algorithms
are as follows:
a) Logistic Regression/Classification:-Logistic regression falls under the category
of supervised learning; it measures the relationship between the dependent variable
which is categorical with one or more than one independent variables by
estimating probabilities using a logistic/sigmoid function. Logistic regression can
generally be used when the dependent variable is Binary or Dichotomous. It means
that the dependent variable can take only two possible values like “Yes or No”,
“Living or dead”.
b) K-Nearest Neighbors:- k-NN algorithm is one of the most straight forward
algorithms in classification, and it is one of the most used ML algorithms. An
object is classified by a majority vote of its neighbors, with the object being
assigned to the class most common among its k nearest neighbors. It can also use
for regression- output is the value of the object (predicts continuous values). This
value is the average (or median) of the values of its k nearest neighbors.

14
c) Naive Bayes:-Naive Bayes is a type of Classification technique based on Bayes
theorem, with an assumption of independence among predictors. In simple terms, a
Naive Bayes classifier assumes that the presence of a Particular feature in a class is
unrelated to the presence of any other function. Naive Bayes model is accessible to
build and particularly useful for extensive datasets.
d) Decision Tree Classification:-Decision tree builds classification models in the
form of a tree structure. It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is incrementally developed. The
final result is a tree with decision nodes and leaf nodes. A decision node has two or
more branches. Leaf node represents a classification or decision. The first decision
node in a tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data.
e) Support Vector Machines:-A Support Vector Machine is a type of Classifier,
in which a discriminative classifier is formally defined by a separating hyperplane.
The algorithm outputs an optimal hyperplane which categorizes new examples. In
two dimensional space, this hyperplane is a line dividing a plane in two parts
where in each class lay in either side. f) Random Forest Classification:-Random
Forest is a supervised learning algorithm. It creates a forest and makes it somehow
random. The forest it builds is an ensemble of Decision Trees, most of the times
the decision tree algorithm trained with the “bagging” method. The general idea of
the bagging method is that a combination of learning models increases the overall
result. And Random Forest is also very powerful to find the variable importance in
classification/ Regression problems.

2.3.3 Types of Unsupervised Learning


Clustering is the type of unsupervised learning in which an unlabelled data is used
to profaw inferences. It is the process of grouping similar entities together. The
goal of this unsupervised machine learning technique is to find similarities in the
data points and group similar data points together and also to figure out which
cluster should a new data point belong to.

Types of Clustering Algorithms:-There are many Clustering algorithms in machine


learning, which can be used for different clustering applications. Some of the main
clustering algorithms are as follows:

15
a) Hierarchical Clustering:-Hierarchical clustering is one of the algorithms of
clustering technique, in which similar data is grouped in a cluster.
It is an algorithm that builds the hierarchy of clusters. This algorithm starts with all
the data points assigned to a bunch of their own. Then, two nearest groups are
merged into the same cluster. In the end, this algorithm terminates when there is
only a single cluster left.
It starts by assigning each data point to its bunch. Finds the closest pair using
Euclidean distance and merges them into one cluster. This process is continued
until all data points are clustered into a single cluster.
b) K–Means Clustering:-K-Means clustering is one of the algorithms of
clustering technique, in which similar data is grouped into a cluster. K-means is an
iterative algorithm that aims to find local maxima in each iteration. It starts with K
as the input which is the desired number of clusters. Input k centroids in random
locations in your space. Now, with the use of the Euclidean distance method,
calculates the distance between data points and centroids, and assign data point to
the cluster which is close to its centroid. Re calculate the cluster centroids as a
mean of data points attached to it. Repeat until no further changes occur.
Types of Dimensionality Reduction Algorithms:-There are many dimensionality
reduction algorithms in machine learning, which are applied for different
dimensionality reduction applications. One of the main dimensionality reduction
techniques is Principal Component Analysis (PCA) / Factor Analysis.
Principal Component Analysis (Factor Analysis):-Principal Component
Analysis is one of the algorithms of Dimensionality reduction. In this technique, it
transforms data into a new set of variables from input variables, which are the
linear combination of real variables. These Specific new set of variables are known
as principal components. As a result of the transformation, the first primary
component will have the most significant possible variance, and each following
component in has the highest possible variance under the constraint that it is
orthogonal to the above components. Keeping only the best m < n components,
reduces the data dimensionality while retaining most of the data information.

2.4 Choosing and comparing models through Pipelines

16
When you work on machine learning project, you often end up with multiple good
models to choose from. Each model will have different performance
characteristics. Using resampling methods like k-fold cross validation; you can get
an estimate of how accurate each model may be on unseen data. You need to be
able to use these estimates to choose one or two best models from the suite of
models that you have created.
2.4.1 Model Validation
When you are building a predictive model, you need to evaluate the capability or
generalization power of the model on unseen data. This is typically done by
estimating accuracy using data that was not used to train the model, often referred
as cross validation.
A few common methods used for Cross Validation:
1) The Validation set Approach (Holdout Cross validation)
In this approach, we reserve large portion of dataset for training and rest remaining
portion of the data for model validation. Ideally people will use 70-30 or 80-20
percentages for training and validation purpose respectively.
A major disadvantage of this approach is that, since we are training a model on a
randomly chosen portion of the dataset, there is a huge possibility that we might
miss-out on some interesting information about the data which, will lead to a
higher bias.
2) K-fold cross validation
As there is never enough data to train your model, removing a part of it for
validation may lead to a problem of under fitting. By reducing the training data, we
risk losing important patterns/ trends in data set, which in turn increases error
induced by bias. So, what we require is a method that provides ample data for
training the model and also leaves ample data for validation. K Fold cross
validation does exactly that.
In K Fold cross validation, the data is divided into k subsets. Now the holdout
method is repeated k times, such that each time, one of the k subsets is used as the
test set/ validation set and the other k-1 subsets are put together to form a training
set. The error estimation is averaged over all k trials to get total effectiveness of
our model. As can be seen, every data point gets to be in a validation set exactly
once, and gets to be in a training set k-1 times. This significantly reduces the bias
as we are using most of the data for fitting and also significantly reduces variance

17
as most of the data is also being used in validation set. Interchanging the training
and test sets also adds to the effectiveness of this method. As a general rule and
empirical evidence, K = 5 or 10 is preferred, but nothing’s fixed and it can take any
value.
Below are the steps for it: Randomly split your entire dataset into k”folds”
 For each k-fold in your dataset, build your model on k – 1 folds of the
dataset. Then, testthe model to check the effectiveness for kth fold.
 Record the error you see on each of the predictions.
 Repeat this until each of the k-folds has served as the test set.
 The average of your k recorded errors is called the cross-validation error and
will serve as your performance metric for the model.
Below is the visualization of a k-fold validation when k=5.

How to choose K:
 Smaller dataset: 10-fold cross validation is better
 Moderate dataset: 5 or 6 fold cross validation works mostly Big dataset:
Train – Val split for validation
Other than this, we have Leave one out cross validation (LOOCV), in which each
record will be left over from the training and then, the same will be used for testing
purpose. This process will be repeated across all the respondents.

2.5 Model Diagnosis with over fitting and under fitting

18
2.5.1 Bias and Variance
A fundamental problem with supervised learning is the bias variance trade-off.
Ideally, a model should have two key characteristics
1) Sensitive enough to accurately capture the key patterns in the training dataset.
2) Generalized enough to work well on any unseen dataset.
Unfortunately, while trying to achieve the above-mentioned first point, there is an
ample chance of over-fitting to noisy or unrepresentative training data points
leading to a failure of generalizing the model. On the other hand, trying to
generalize a model may result in failing to capture important regularities.
If model accuracy is low on a training dataset as well as test dataset, the model is
said to be under-fitting or that the model has high bias. The Bias refers to the
simplifying assumptions made by the algorithm to make the problem easier to
solve. To solve an under-fitting issue or to reduce bias, try including more
meaningful features and try to increase the model complexity by trying higher-
order interactions
The Variance refers to sensitivity of a model changes to the training data. A model
is giving high accuracy on a training dataset, however on a test dataset the
accuracy proofs prophetically then, the model is said to be over-fitting or a model
that has high variance.
To solve the over-fitting issue try to reduce the number of features, that is, keep
only the meaningful features or try regularization methods that will keep all the
features. Ideal model will be the trade-off between Underfitting and over fitting
like mentioned in the below picture.

19
And, the Hyperparameters will be tuned in the below mentioned ways to reach the
optimal solution:
1) Grid Search
2) Random Search
3) Manual Tuning

2.5.2 Model Performance Matrix


Model evaluation is an integral part of the model development. Based on model
evaluation and subsequent comparisons, we can take a call whether to continue our
efforts in model enhancement or cease them and select the final model that should
be used / deployed.
1. Evaluating Classification Models

Confusion Matrix
Confusion matrix is one of the most popular ways to evaluate a classification
model. A confusion matrix can be created for a binary classification as well as a
multi-class classification model.

20
A confusion matrix is created by comparing the predicted class label of a data
point with its actual class label. This comparison is repeated for the whole dataset
and the results of this comparison are compiled in a matrix or tabular format.

2. Regression Model Evaluation :


A regression line predicts the y values for a given x value. Note that the values are
around the average. The prediction error (called as root-mean-square error or
RSME) is given by the following formula:

3. Evaluating Unsupervised Models


The Unsupervised algorithms will be assessed by the profile of the factors/ clusters
which were derived through the models.

21
2.6 Overall Process of Machine Learning
To put overall process together, below is the picture that describes the road map for
building ML Systems

22
MACHINE LEARNING
AT WORK

MACHINE LEARNING AT WORK

3.1 An Approach to the Problem:

23
In order to carry out the analysis, we have extracted 10000 from the kaggle.com
and the information of the same is mentioned in Chapter 1.
In this Chapter, we are going to discuss about the results of different Machine
Learning methods used in order to obtain the solution for the problem mentioned in
Chapter1.
As mentioned in Chapter 2, the first step of a ML Algorithm is Data cleaning and
preparing data for the modeling. As a first step, we have to check whether the data
was read properly and all the scale types are as per the data.

RowNu CustomSurna CreditS Geogra Gend Ten Bala NumOfPro HasCr IsActiveM Estimated Geograp Gende
Age Exited
mber erId me core phy er ure nce ducts Card ember Salary hy_n r_n
15634 Hargrav Franc Fem 10134
0 1 619.0 42 2 0.00 1 1 1 1 0 0
602 e e ale 8.88
15647 Fem 11254
1 2 Hill 608.0 Spain 41 1 83807.86 1 0 1 0 2 0
311 ale 2.58
15619 Franc Fem 11393
2 3 Onio 502.0 42 8 159660.80 3 1 0 1 0 0
304 e ale 1.57
15701 Franc Fem 93826.
3 4 Boni 699.0 39 1 0.00 2 0 0 0 0 0
354 e ale 63
15737 Fem 79084.
4 5 Mitchell 850.0 Spain 43 2 125510.82 1 1 1 0 2 0
888 ale 10
..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.
15606 Franc 96270.
9995 9996 Obijiaku 771.0 Male 39 5 0.00 2 1 0 0 0 1
229 e 64
15569 Johnsto Franc 10169
9996 9997 516.0 Male 35 10 57369.61 1 1 1 0 0 1
892 ne e 9.77
15584 Franc Fem 42085.
9997 9998 Liu 709.0 36 7 0.00 1 0 1 1 0 0
532 e ale 58
15682 Sabbati Germ 92888.
9998 9999 772.0 Male 42 3 75075.31 2 1 0 1 1 1
355 ni any 52
15628 Franc Fem 38190.
9999 10000 Walker 792.0 28 4 130142.79 1 1 0
319 e ale 78
Output:3.1.1 dataframe

Understanding data using Descriptive Statistics:


To understand the data, we will first look at the summary of the data.

24
RowNu Custom CreditS NumOfPr HasCrC IsActive Estimate
Age Tenure Balance Exited
mber erId core oducts ard Member dSalary

10000.0 1.00000 10000.0 10000.0 10000.00 10000.000 10000.0 10000.000 10000.000 10000.00 10000.0
count
0000 0e+04 00000 00000 0000 000 00000 00 000 0000 00000

5000.50 1.56909 650.528 38.9218 76485.889 1.53020 100090.2 0.20370


mean 5.012800 0.70550 0.515100
000 4e+07 800 00 288 0 39881 0

2886.89 7.19361 96.6532 10.4878 62397.405 0.58165 57510.49 0.40276


std 2.892174 0.45584 0.499797
568 9e+04 99 06 202 4 2818 9

1.55657 350.000 18.0000 1.00000 11.58000 0.00000


min 1.00000 0.000000 0.000000 0.00000 0.000000
0e+07 000 00 0 0 0

2500.75 1.56285 584.000 32.0000 1.00000 51002.11 0.00000


25% 3.000000 0.000000 0.00000 0.000000
000 3e+07 000 00 0 0000 0

5000.50 1.56907 652.000 37.0000 97198.540 1.00000 100193.9 0.00000


50% 5.000000 1.00000 1.000000
000 4e+07 000 00 000 0 15000 0

7500.25 1.57532 718.000 44.0000 127644.24 2.00000 149388.2 0.00000


75% 7.000000 1.00000 1.000000
000 3e+07 000 00 0000 0 47500 0

10000.0 1.58156 850.000 92.0000 10.00000 250898.09 4.00000 199992.4


max 1.00000 1.000000
0000 9e+07 000 00 0 0000 0 80000

Output: 3.1.2Descriptive Statistics of the data

From the summary,


we find mean, first quartile, median, third quartile, maximum and minimum
values of all the continuous attributes.

Understanding data visually:

25
Also, look at the data visually to understand the relationships between and within
the variables
Output:3.1. histogarm of EXITED

# Univariate Plots for Categorical variables

Output:3.1.4 histogarm of EXITED

From 3.1.3 we can see in attribute EXITED we have more than 70% customers are
exited and nearly 20% did not exit.
From 3.1.4 we can see in attribute GENDER we have more than 5000 male
customers and around 4500 female customers.

Checking for missing Values: We also need to check if the data contains any
missing values

26
RowNumber 0 Tenure 0
CustomerId 0 Balance 0
Surname 0 NumOfProduct0
CreditScore 0 HasCrCard 0
Geography0 IsActiveMember 0
Gender 0 EstimatedSalary 0
Age 0 Exited 0

As we don’t have any missing values and if we have any variables will be imputed
using Mean / Median values.

#3.1.4 Missing values plot

As we don’t have any missing values in any of the attributes values the heat map
doesnot show any missing values in it and the plot is entirely empty.
Checking for Outliers:
We used Box-plots to check for Outliers in each of the continuous variables..
Boxplot for CreditScore

27
3.1.5 With outliers 3.1.6 after capping outliers

Here the values less than 5th percentile are imputed using the 5th percentile value
in order to remove outliers.
Boxplot for ‘Age’

3.1.7with outliers 3.1.8 after capping outliers

Boxplot for ‘Estimated Salary’ Boxplot for ‘Balance’

28
3.1.9No outliers 3.1.10 No outliers

Boxplot for ”Balance”

3.1.11 No outliers

29
Understanding relationships between variables:
For the continuous variables, we will look at the Correlation plots to
understand the relationships between variables.

3.1.12 correlation plot

Here, the square size refers to the strength of the relation and color refers to the
direction of the relationship i.e., blue color represents the negative correlation and
yellow color represents positive correlation. From the plot, we can see that
Creditscore,age,tenure,balance,noofproducts and estimated salary are positively
correlated.

3.1.13 Univariate - Continuous variable plot

30
Skewness: -0.071607
Kurtosis: -0.425726
Skewness value says that curve is negatively skewed(left handed tail is larger than
right) and kurtosis value says that is PLATYKURTIC.
3.1.14 for ‘Age’

Skewness: 1.011320
Kurtosis: 1.395347
Skewness value says that curve is positively skewed((right handed tail is larger
than left) and positive kurtosis value says that is LEPTOKURTIC and dataset
withmore weight on tails.
3.1.15EstiamtedSalary

31
Skewness: 0.002085
Kurtosis: -1.181518
Here negative kurtosis shows that it is ‘lightly tailed’(it has as much as data in tail
as it has in peak) and skewness says that it is almost symmetrical as it is almost
zero,so it is MESOKURTIC.
3.1.16 Bi variate Categorical VS Categorical:

Exited 0 1 All

Gender

Female 3404 1139 4543

Male 4559 898 5457

All 7963 2037 10000

It combines two columns and gives the summary as there are 7963 customers got
exited and 2037 didn exit.
Out of 7963,4559 are male and 3404 are females.
Out of 2037 customers,who didn’t exit 898 are males and 1139 are females.

32
3.1.17 Geography and Exited

Exited 0 1 All

Geography

France 4204 810 5014

Germany 1695 814 2509

Spain 2064 413 2477

203
All 7963 10000
7

It combines two columns and gives the summary as there are 7963 customers got
exited and 2037 didn exit.
Out of 7963,4204customers are from france,1695 from germany and2064 from
spain
Out of 2037 customers,who didn’t exit 810customers are from France,814 from
germany and 413 from spain.

33
Categorical vs categorical plots:
3.1.18(Stacked bar plot)

From 3.1.18 we observe stacked bar plot between ‘exited’ and


‘gender’.here green colour indicates customers who exited and pink indicates who
didn’t exit based on Gender

3.1.19stacked bar plot for Geogarphy and Exited

34
From 3.1.19 we observe stacked bar plot between ‘Exited’ and ‘Geography’.here
green colour indicates customers who exited and pink indicates who didn’t exit
based on their geographical location.

Bi variate Continuous Vs Continuous

3.1.20(Scatter diagram for EstimatedSalary and CreditScore)

From 3.1.20 we observe the scatter diagram between estimated salary and credit
score and it says that they are highly correlated.

3.1.21(Scatter diagram for EstimatedSalary and Balance

35
FEATURE PLOTS:
For the continuous Vs categorical variable, we will look at Feature plots to
understand the relationships between variables.
3.1.22(feature plot for CreditScore and Gender)

We can observe that there is no much difference between CreditScore and Gender.
3.1.23(feature plot for EstimatedSalary and Gender)

We can observe that there is no much difference between EstimatedSalary and


Gender.

36
3.1.24(feature plot for Age and Exited)

We can observe that there is slight difference between Age and Exited.

3.1.25(feature plot for CreditScore and Geography)

We can observe that there is no much difference between CreditScore and


Geography.

37
3.1.26(feature plot for Estimatedsalary and Geography)

We can observe that there is no much difference between Estimatedsalary and


Geography.
3.1.27(feature plot for Balance and Exited)

We can observe that there is slight difference between Estimatedsalary and


Geography

38
Normalising the Continuous variables: As our input data is in different units, we
have to ideally do normalisation. Hence, we normalized all the continuous
variables with mean 0 and variance 1.
NumOf Has Active Esti.
CreditScore Geography Gender Age Tenure Balance Germany Geo Gen
Products CrCard Member Salary
0 619.0 France Female 42 2 0.00 1 1 1 101348.88 0 0 0
1 608.0 Spain Female 41 1 83807.86 1 0 1 112542.58 0 1 0
2 502.0 France Female 42 8 159660.80 3 1 0 113931.57 0 0 0
3 699.0 France Female 39 1 0.00 2 0 0 93826.63 0 0 0
4 850.0 Spain Female 43 2 125510.82 1 1 1 79084.10 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 771.0 France Male 39 5 0.00 2 1 0 96270.64 0 0 1
9996 516.0 France Male 35 10 57369.61 1 1 1 101699.77 0 0 1
9997 709.0 France Female 36 7 0.00 1 0 1 42085.58 0 0 0
9998 772.0 Germany Male 42 3 75075.31 2 1 0 92888.52 1 0 1
9999 792.0 France Female 28 4 130142.79 1 1 0 38190.78 0 0 0

Cross Validation: Here we’ll perform the train and test split cross validation
techniques. So, as part of it, we need to split the original data into train and test
considering 80:20 proportion respectively. Here, we are randomly considering 80%
of the original data as train data. And, the dimensions of the train and test data are:

len(X_train)
8000
len(X_test)
2000

#dimension of train and test data

Further we use k-fold validation for splitting train data into 10 folds as below
control <- trainControl(method="repeatedcv", number=10, repeats=3)
Running Pipeline using k-fold validation: Here, we will use a pipeline of
algorithms for classification to compare accuracies across different methods. As
this is a classification problem, we will use Logistic Regression, Decision Tree,
SVM, k-NN and Random Forest techniques as apart of the pipeline.

Comparing algorithms:

39
When we compare all the models using above, we got best accuracy for Random
Forest. So, we used Random Forest model to get the variable importance.

Finding key variables using RandomForest:


Variable importance plot:

importance

Age 0.232611

EstimatedSalary 0.155034

Balance 0.148611

CreditScore 0.141878

NumOfProducts 0.120731

40
importance

Tenure 0.086986

IsActiveMember 0.033105

Germany 0.025812

Male 0.020821

HasCrCard 0.019759

Spain 0.014652

Above table shows the mean decrease gini index of each variable, based on which
we are choosing the key variables. The variable which is having more gini index is
more effective.
Fitting the final model:

We considered the top 5 variables from the Random Forest and fitted Logistic
Regression. Summary of the final fitted models using key variables is shown
below.

Logistic Regression:

41
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=0, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
 
Accuracy: 0.79125
Confusion matrix:
[[6131 237]
[1433 199]]

3.1.28Logistic Regression output on train data

When Logistic Model is fitted for the train data, the accuracy obtained is 79%.

Accuracy: 0.7885
Confusion matrix:

[[1526 69]
[ 354 51]]
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

3.1.29 Logistic Regression output on test data

When Logistic Model is fitted for the train data, the accuracy obtained is 78.8%.

RANDOM FOREST(RF):

42
Hyper parameters in Random Forest are tuned to extract the best paparmeters for
final model.
 Number of trees as 100 or 200 or 300.
 The number of variables in each tree could range from 1to 5

GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn',
n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False),
iid='warn', n_jobs=-1, param_grid={'bootstrap': [True, False], 'criterion': ['gini',
'entropy'], 'max_features': ['auto', 2], 'n_estimators': [100, 300]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='accuracy', verbose=0)
{'bootstrap': True, 'criterion': 'gini', 'max_features': 'auto', 'n_estimators': 300}

From the above, we found the best accuracy when the number of trees are 300 with
each tree containing 2 variables.

43
Accuracy: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 6368
1 1.00 1.00 1.00 1632
accuracy 1.00 8000
macro avg 1.00 1.00 1.00 8000
weighted avg 1.00 1.00 1.00 8000
Confusion matrix:
[[6368 0]
[ 0 1632]]

3.1.30 RandomForest output on train data

{'bootstrap': True, 'criterion': 'entropy', 'max_features': 'auto', 'n_estimators':


100}
Accuracy :0.835
Confusion matrix: [[1595 0] [ 0 405]]
precision recall f1-score support
0 1.00 1.00 1.00 1595
1 1.00 1.00 1.00 405
Accuracy 1.00 2000
macro avg 1.00 1.00 1.00 200 weighted avg 1.00 1.00 1.00 2000
3.1.31 RandomForest output on test data

44
3.1.32 ConfusionMatrix

Support Vector Machines(SVM):

Accuracy: 0.84
Confusion matrix:

[[6210 158] [1122 510]]


Classification report:

precision recall f1-score support


0 0.85 0.98 0.91 6368
1 0.76 0.31 0.44 1632
accuracy 0.84 8000
macro avg 0.81 0.64 0.68 8000
weighted avg 0.83 0.84 0.81 8000
3.1.33 SVM output on train data

0.84 accuracy level is obtained when kernel=rbf and c=1.7 from a list of
parameters from train data using Support Vector Machines.

45
Accuracy: 0.8315
Confusion matrix:
[[1502 93] [ 244 161]]
Classification report:
precision recall f1-score support
0 0.86 0.94 0.90 1595
1 0.63 0.40 0.49 405
accuracy 0.83 2000
macro avg 0.75 0.67 0.69 2000
weighted avg 0.81 0.83 0.82 2000

3.1.34 SVM output on test data

When Model is fitted for the test data using SVM, the accuracy obtained on train
data is 83.1% .
Fitting the final model:

We considered the top 5 variables from the Random Forest and fitted the models.
Summary of the final fitted k-NN model using key variables for best accuracy is
shown below.

K – Nearest Neighbourhood:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',


metric_params=None, n_jobs=None, n_neighbors=21, p=2, weights='uniform')
Accuracy: 0.8466
Confusion matrix:

[[6123 245] [ 982 650]]


Classification report:

precision recall f1-score support


0 0.86 0.96 0.91 6368
1 0.74 0.41 0.53 1632

46
Accuracy 0.85 8000
macro avg 0.80 0.69 0.72 8000
weighted avg 0.84 0.85 0.83 8000
3.1.35 K-NN output on train data

Best accuracy i.e.,0.8466 is obtained when k=10 nearest neighbours is choosen


from a given list of neighbours by k- Nearest Neighbourhood method.

Accuracy: 0.8315
Confusion matrix:

[[1502 93] [ 244 161]]


Classification report:
precision recall f1-score support

0 0.86 0.94 0.90 1595

1 0.65 0.41 0.50 405

accuracy 0.84 2000

macro avg 0.76 0.68 0.70 2000

weighted avg 0.82 0.84 0.82 2000

3.1.36 k-NN output on Test data

When k-NN Model is fitted for the test data, the accuracy obtained is 83.1%.

As we got best accuracy for k-NN on both train and test data it is the best
fitted model.

47
CHAPTER 4

CONCLUSION

48
CONCLUSION

In order to identify whether a customer has liver exited or not, we developed an


algorithm in which we applied pipeline techniques as well as Random Forest to
choose the best model and key variables.

Once we identified the key variables, we run the model with only key variables
using Train data set and validate the same using Test data set.

The accuracy of Train and Test data are 84.8% and 83.5% respectively. Since, the
accuracy of Train and Test data are almost same, we can say that our model is a
Generalized Linear Model.

Hence, we can apply our model for future predictions.

49
APPENDIX

PYTHON – CODE
DATASET
BIBLIOGRAPHY

Python Code:

50
##################################################################
#Loading required libraries to perform modeling
##################################################################
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
##################################################################
#reading data from Folder and checking for the data types
################################################################
dataset = pd.read_csv('ChurnModelling.csv')
print(dataset.shape)
dataset.dtypes
dataset.head(10)
dataset.tail(10)
dataset.describe()
################################################################
# Univariate - Continuous variable plots
################################################################
plt.hist(dataset['CreditScore'])
plt.show()
print("Skewness: %f" % dataset['CreditScore'].skew())
print("Kurtosis: %f" % dataset['CreditScore'].kurt())
plt.hist(dataset['Age'])
plt.show()
print("Skewness: %f" % dataset['Age'].skew())
print("Kurtosis: %f" % dataset['Age'].kurt())

51
plt.hist(dataset['Tenure'])
plt.show()
print("Skewness: %f" % dataset['Tenure'].skew())
print("Kurtosis: %f" % dataset['Tenure'].kurt())
##################################################################
## Box plots
#################################################################
plt.boxplot(dataset['CreditScore'])
plt.show()
plt.boxplot(dataset['Balance'])
plt.show()
plt.boxplot(dataset['EstimatedSalary'])
plt.show()
plt.boxplot(dataset['Tenure'])
plt.show()
plt.boxplot(dataset['Age'])
plt.show()
#capping of outliers
dataset.loc[dataset['CreditScore']<dataset['CreditScore'].quantile(0.05),
['CreditScore']]=dataset['CreditScore'].quantile(0.05)
plt.boxplot(dataset['CreditScore'])
plt.show()
dataset.loc[dataset['Age']>dataset['Age'].quantile(0.95),
['Age']]=dataset['Age'].quantile(0.95)
plt.boxplot(dataset['Age'])
plt.show()

52
# Bi variate Categorical VS Categorical
ct=pd.crosstab(dataset['Gender'],dataset['Exited'],margins=True)
ct
ct=pd.crosstab(dataset['Geography'],dataset['Exited'],margins=True)
ct
##################################################################
# Categorical Vs categorical Plts
##################################################################
import matplotlib
%matplotlib inline
ct.iloc[:-1,:-1].plot(kind='bar',stacked=True, color=['green','pink'],grid=False)
##################################################################
# Continuous Vs Continuous
##################################################################
dataset.plot('CreditScore' , 'EstimatedSalary', kind='scatter')
dataset.plot('EstimatedSalary' , 'Balance', kind='scatter')
##################################################################
# Categorical-Continuous combination
##################################################################
dataset.boxplot(column='CreditScore', by= 'Gender')
dataset.boxplot(column='CreditScore', by= 'Geography')
dataset.boxplot(column='EstimatedSalary', by= 'Gender')
dataset.boxplot(column='Age', by= 'Exited')
dataset.boxplot(column='EstimatedSalary', by= 'Geography')
dataset.boxplot(column='Balance', by= 'Exited')
dataset.boxplot(column='EstimatedSalary', by= 'Exited')

53
##################################################################
## Correlation plots
##################################################################
import numpy
# Multivariate Plots ( Continuous Vs Continuous)
dataset1 = dataset[['CreditScore','Age', 'Tenure', 'Balance', 'NumOfProducts',
'EstimatedSalary']]
names= ['CreditScore','Age', 'Tenure', 'Balance', 'NumOfProducts',
'EstimatedSalary']
correlations = dataset1.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

##################################################################
# chi2_contingency table
#################################################################
from scipy.stats import chi2_contingency

54
from scipy.stats import chi2
table = [ [3404, 1139],
[4559, 898]]
print(table)
stat, p, dof, expected = chi2_contingency(table)
print(expected)
print(stat)
print(p)
##################################################################
#creating independent and dependent variables
##################################################################
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
#importing labelencoder
from sklearn.preprocessing import LabelEncoder
le_Geography=LabelEncoder()
le_Gender=LabelEncoder()
#transforming string to numeric
X['geography']=le_Geography.fit_transform(X['Geography'])
X['gender']=le_Gender.fit_transform(X['Gender'])
X=X.drop(['Geography','Gender'],axis='columns')
## Splitting the dataset into the Training set and Test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 0)

55
## Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
##Importing all the metrics and models
import os
import pandas
import numpy
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd

56
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score,
classification_report
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

57
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
##################################################################
# Tvalidation options and evaluation metric
##################################################################
num_folds = 10
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
Model Pipeline
## Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('AB', AdaBoostClassifier()))
models.append(('RF', RandomForestClassifier()))

58
##################################################################
# Finding best model
##################################################################
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
##################################################################
# Compare Algorithms
##################################################################
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

##################################################################
# Running Random Forest for variables selection
##################################################################

59
classifier = RandomForestClassifier(random_state=0)
# Random Forest
#Grid Search for Parameter Selection
grid_param = {
'n_estimators': [100, 200],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False],
'max_features' : ['auto',2],

} from sklearn.model_selection import GridSearchCV


gd_sr = GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)

gd_sr.fit(X_train, y_train)
best_parameters = gd_sr.best_params_
print(best_parameters)
rf = RandomForestClassifier(bootstrap= True, criterion='gini',
n_estimators=300,max_features='auto'

## Fit the model on your training data.


rf.fit(X_train, y_train)
predictions = rf.predict(X_train)

60
print(accuracy_score(y_train, predictions))
print(classification_report(y_train, predictions))
print(confusion_matrix(y_train, predictions))
rf.score(X_train,y_train)
gd_sr.fit(X_train, y_train)
best_parameters = gd_sr.best_params_
print(best_parameters)
rf = RandomForestClassifier(bootstrap= True, criterion='gini',
n_estimators=300,max_features='auto')
## Fit the model on your test data.
rf.fit(X_test, y_test)
y_predicted = rf.predict(X_test)
print(accuracy_score(y_test, y_predicted))
print(classification_report(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted))
rf.score(X_test,y_test)
##################################################################
#Confusion Matrix
##################################################################
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predicted)
cm
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn

61
plt.figure(figsize=(7,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
##feature selection
rf.feature_importances_
feature_columns =
['CreditScore','Age','Tenure','Balance','NumOfProducts','HasCrCard','IsActiveMem
ber','EstimatedSalary','Germany','Spain','Male']
import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
index = feature_columns,

columns=['importance']).sort_values('importance', ascending=False)
feature_importances
# selectibnf required variables
cols = [col for col in X .columns if col in
['Age','EstimatedSalary','Balance','CreditScore','NumOfProducts']]
datset1 = dataset[cols]
y = dataset.iloc[:, 13]

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(datset1, y, test_size=0.2,
random_state=0)

##Scaling the Data


from sklearn.preprocessing import StandardScaler

62
feature_scaler = StandardScaler()
X_train = feature_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
##################################################################
# Logistic Regression
##################################################################
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#estimate accuracy on validation dataset
predictions = classifier.predict(X_train)
print(accuracy_score(y_train, predictions))
print(confusion_matrix(y_train, predictions))
# estimate accuracy on validation dataset
predictions1 = classifier.predict(X_test)
print(accuracy_score(y_test, predictions1))
print(confusion_matrix(y_test, predictions1))
predictions1
##################################################################
# # Running grid search SVM
##################################################################
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = ['linear', 'poly', 'rbf', 'sigmoid']
param_grid = dict(C=c_values, kernel=kernel_values)
model = SVC()

63
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring,
cv=kfold)
grid_result = grid.fit(X_train, y_train)

## for params, mean_score, scores in grid_result.grid_scores_:


## print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))
for means,stdev,neighbors in
zip(grid_result.cv_results_['mean_test_score'],grid_result.cv_results_['std_test_sco
re'],grid_result.cv_results_['params']):
print(round(means, 6),round(stdev,6),neighbors)
## Printing final parameters
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
#rbf with C:0.5 is best parameter
#$ Final SVM Model
model = SVC(C=1.7, kernel='rbf')
model.fit(X_train, y_train)
# estimate accuracy on training dataset
predictions = model.predict(X_train)
print(accuracy_score(y_train, predictions))
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))
#estimate accuracy on test dataset
predictions1 = model.predict(X_test)
print(accuracy_score(y_test, predictions1))
print(confusion_matrix(y_test, predictions1))

64
print(classification_report(y_test, predictions1))
##################################################################
# Tune scaled k-NN
##################################################################
neighbors = [3,5,7,9,11,13,15,17,19,21]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring,
cv=kfold)
grid_result = grid.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
for means,stdev,neighbors in
zip(grid_result.cv_results_['mean_test_score'],grid_result.cv_results_['std_test_sco
re'],grid_result.cv_results_['params']):
print(round(means, 6),round(stdev,6),neighbors)

### Final model with k-NN


model = KNeighborsClassifier(n_neighbors= 21)
model.fit(X_train, y_train)
# estimate accuracy on training dataset
Predictions2 = model.predict(X_train)
print(accuracy_score(y_train, predictions2))
print(confusion_matrix(y_train, predictions2))
print(classification_report(y_train, predictions2))
#estimate accuracy on test dataset

65
Predictions3 = model.predict(X_test)
print(accuracy_score(y_test, predictions3))
print(confusion_matrix(y_test, predictions3))
print(classification_report(y_test, predictions3))

66
DATASET

Row Cust Surna Cred Geo Ge A Te Bal NumO Has IsActiv Estima Ex
Num omer me itSco grap nd g nu anc fProdu CrC eMem tedSala ite
ber Id re hy er e re e cts ard ber ry d
1 1563 Hargra 619 Fran Fe 4 2 0 1 1 1 10134 1
4602 ve ce ma 2 8.9
le
2 1564 Hill 608 Spai Fe 4 1 838 1 0 1 11254 0
7311 n ma 1 07. 2.6
le 86
3 1561 Onio 502 Fran Fe 4 8 159 3 1 0 11393 1
9304 ce ma 2 660 1.6
le .8
4 1570 Boni 699 Fran Fe 3 1 0 2 0 0 93826. 0
1354 ce ma 9 63
le
5 1573 Mitche 850 Spai Fe 4 2 125 1 1 1 79084. 0
7888 ll n ma 3 510 1
le .8
6 1557 Chu 645 Spai M 4 8 113 2 1 0 14975 1
4012 n ale 4 755 6.7
.8
7 1559 Bartlet 822 Fran M 5 7 0 2 1 1 10062. 0
2531 t ce ale 0 8
8 1565 Obinna 376 Ger Fe 2 4 115 4 1 0 11934 1
6148 man ma 9 046 6.9
y le .7
9 1579 He 501 Fran M 4 4 142 2 0 1 74940. 0
2365 ce ale 4 051 5
.1
10 1559 H? 684 Fran M 2 2 134 1 1 1 71725. 0
2389 ce ale 7 603 73
.9
11 1576 Bearce 528 Fran M 3 6 102 2 0 0 80181. 0
7821 ce ale 1 016 12
.7
12 1573 Andre 497 Spai M 2 3 0 2 1 0 76390. 0
7173 ws n ale 4 01
13 1563 Kay 476 Fran Fe 3 10 0 2 1 0 26260. 0
2264 ce ma 4 98
le
14 1569 Chin 549 Fran Fe 2 5 0 2 0 0 19085 0
1483 ce ma 5 7.8
le
15 1560 Scott 635 Spai Fe 3 7 0 2 1 1 65951. 0
0882 n ma 5 65
le

67
16 1564 Gofort 616 Ger M 4 3 143 2 0 1 64327. 0
3966 h man ale 5 129 26
y .4
17 1573 Romeo 653 Ger M 5 1 132 1 1 0 5097.6 1
7452 man ale 8 602 7
y .9
18 1578 Hender 549 Spai Fe 2 9 0 2 1 1 14406. 0
8218 son n ma 4 41
le
19 1566 Muldr 587 Spai M 4 6 0 1 0 0 15868 0
1507 ow n ale 5 4.8
20 1556 Hao 726 Fran Fe 2 6 0 2 1 1 54724. 0
8982 ce ma 4 03
le
21 1557 McDo 732 Fran M 4 8 0 2 1 1 17088 0
7657 nald ce ale 1 6.2
22 1559 Delluc 636 Spai Fe 3 8 0 2 1 0 13855 0
7945 ci n ma 2 5.5
le
23 1569 Gerasi 510 Spai Fe 3 4 0 1 1 0 11891 1
9309 mov n ma 8 3.5
le
24 1572 Mosm 669 Fran M 4 3 0 2 0 1 8487.7 0
5737 an ce ale 6 5
25 1562 Yen 846 Fran Fe 3 5 0 1 1 1 18761 0
5047 ce ma 8 6.2
le
26 1573 Maclea 577 Fran M 2 3 0 2 0 1 12450 0
8191 n ce ale 5 8.3
27 1573 Young 756 Ger M 3 2 136 1 1 1 17004 0
6816 man ale 6 815 2
y .6
28 1570 Nebec 571 Fran M 4 9 0 2 0 0 38433. 0
0772 hi ce ale 4 35
29 1572 McWil 574 Ger Fe 4 3 141 1 1 1 10018 0
8693 liams man ma 3 349 7.4
y le .4
30 1565 Luccia 411 Fran M 2 0 596 2 1 1 53483. 0
6300 no ce ale 9 97. 21
17
31 1558 Azikiw 591 Spai Fe 3 3 0 3 1 0 14046 1
9475 e n ma 9 9.4
le
32 1570 Odinak 533 Fran M 3 7 853 1 0 1 15673 0
6552 achuk ce ale 6 11. 1.9
wu 7
33 1575 Sander 553 Ger M 4 9 110 2 0 0 81898. 0
0181 son man ale 1 112 81
y .5
34 1565 Magga 520 Spai Fe 4 6 0 2 1 1 34410. 0

68
9428 rd n ma 2 55
le
35 1573 Cleme 722 Spai Fe 2 9 0 2 1 1 14203 0
2963 nts n ma 9 3.1
le
36 1579 Lomba 475 Fran Fe 4 0 134 1 1 0 27822. 1
4171 rdo ce ma 5 264 99
le
37 1578 Watso 490 Spai M 3 3 145 1 0 1 11406 0
8448 n n ale 1 260 6.8
.2
38 1572 Lorenz 804 Spai M 3 7 765 1 0 1 98453. 0
9599 o n ale 3 48. 45
6
39 1571 Armstr 850 Fran M 3 7 0 1 1 1 40812. 0
7426 ong ce ale 6 9
40 1558 Camer 582 Ger M 4 6 703 2 0 1 17807 0
5768 on man ale 1 49. 4
y 48
41 1561 Hsiao 472 Spai M 4 4 0 1 1 0 70154. 0
9360 n ale 0 22
42 1573 Clarke 465 Fran Fe 5 8 122 1 0 0 18129 1
8148 ce ma 1 522 7.7
le .3
43 1568 Osborn 556 Fran Fe 6 2 117 1 1 1 94153. 0
7946 e ce ma 1 419 83
le .4
44 1575 Lavine 834 Fran Fe 4 2 131 1 0 0 19436 1
5196 ce ma 9 394 5.8
le .6
45 1568 Bianch 660 Spai Fe 6 5 155 1 1 1 15833 0
4171 i n ma 1 931 8.4
le .1
46 1575 Tyler 776 Ger Fe 3 4 109 2 1 1 12651 0
4849 man ma 2 421 7.5
y le .1
47 1560 Martin 829 Ger Fe 2 9 112 1 1 1 11970 1
2280 man ma 7 045 8.2
y le .7
48 1577 Okagb 637 Ger Fe 3 9 137 1 1 1 11762 1
1573 ue man ma 9 843 2.8
y le .8
49 1576 Yin 550 Ger M 3 2 103 1 0 1 90878. 0
6205 man ale 8 391 13
y .4
50 1577 Bucch 776 Ger Fe 3 2 103 2 1 0 19409 0
1873 o man ma 7 769 9.1
y le .2
51 1561 Chidie 698 Ger M 4 10 116 2 1 0 19805 0
6550 bele man ale 4 363 9.2

69
y .4
52 1576 Trevis 585 Ger M 3 5 146 2 0 0 86424. 0
8193 ani man ale 6 051 57
y
53 1568 O'Brie 788 Fran Fe 3 5 0 2 0 0 11697 0
3553 n ce ma 3 8.2
le
54 1570 Parkhil 655 Ger M 4 8 125 1 0 0 16404 1
2298 l man ale 1 562 0.9
y
55 1556 Yoo 601 Ger M 4 1 984 1 1 0 40014. 1
9590 man ale 2 95. 76
y 72
56 1576 Phillip 619 Fran M 4 1 125 1 1 1 11341 0
0861 ps ce ale 3 211 0.5
.9
57 1563 Tsao 656 Fran M 4 5 127 1 1 0 87107. 0
0053 ce ale 5 864 57
.4
58 1564 Endriz 725 Ger M 1 0 758 1 0 0 45613. 0
7091 zi man ale 9 88. 75
y 2
59 1562 T'ien 511 Spai Fe 6 4 0 1 1 0 1643.1 1
3944 n ma 6 1
le
60 1580 Velazq 614 Fran M 5 4 406 1 1 1 46775. 0
4771 uez ce ale 1 85. 28
92
61 1565 Hunter 742 Ger M 3 5 136 1 0 0 84509. 0
1280 man ale 5 857 57
y
62 1577 Clark 687 Ger Fe 2 9 152 2 0 0 12649 0
3469 man ma 7 328 4.8
y le .9
63 1570 Jeffrey 555 Spai M 3 1 560 2 0 0 17879 0
2014 n ale 3 84. 8.1
69
64 1575 Pirozzi 684 Spai M 5 8 787 1 1 1 99398. 0
1208 n ale 6 07. 36
16
65 1559 Jackso 603 Ger M 2 4 109 1 1 1 92840. 0
2461 n man ale 6 166 67
y .4
66 1578 Hamm 751 Ger Fe 3 6 169 2 1 1 27758. 0
9484 ond man ma 6 831 36
y le .5
67 1569 Brown 581 Ger Fe 3 1 101 1 1 0 11043 0
6061 less man ma 4 633 1.5
y le
68 1564 Chibug 735 Ger M 4 10 123 2 1 1 19667 0

70
1582 o man ale 3 180 3.3
y
69 1563 Glauer 661 Ger Fe 3 5 150 2 0 1 11365 0
8424 t man ma 5 725 6.9
y le .5

157556 67 Femal 2 98373.


70 48 Pisano 5 France e 1 8 26 1 1 0 18203 0
157037 73 Germa 5 133745 28373.
71 93 Konovalova 8 ny Male 8 2 .4 4 1 0 86 1
156203 81 2 33953.
72 44 McKee 3 France Male 9 6 0 1 1 0 87 0
158125 65 Femal 3 163607 44203.
73 18 Palermo 7 Spain e 7 0 .2 1 0 1 55 0
157790 60 Germa Femal 2 157780 58426.
74 52 Ballard 4 ny e 5 5 .8 2 1 1 81 0
157708 51 3 145562
75 11 Wallace 9 France Male 6 9 0 2 0 1 .4 0
157809 73 Femal 2 178718
76 61 Cavenagh 5 France e 1 1 .2 2 1 0 22388 0
156140 66 5 139161
77 49 Hu 4 France Male 5 8 0 2 1 1 .6 0
156620 67 Femal 3 148210
78 85 Read 8 France e 2 9 0 1 1 1 .6 0
155751 75 3 77253. 194239
79 85 Bushell 7 Spain Male 3 5 22 1 0 1 .6 0
158031 41 Germa Femal 4 1 122189 98301.
80 36 Postle 6 ny e 1 0 .7 2 1 0 61 0
157060 66 Femal 3 96645. 171413
81 21 Buley 5 France e 4 1 54 2 0 0 .7 0
156637 77 Femal 3 136458
82 06 Leonard 7 France e 2 2 0 1 1 0 .2 1
156417 54 Femal 3 26019.
83 32 Mills 3 France e 6 3 0 2 0 0 59 0
157011 50 Femal 3 90307. 159235
84 64 Onyeorulu 6 France e 4 4 62 1 1 1 .3 0
157387 49 Femal 4 1907.6
85 51 Beit 3 France e 6 4 0 2 1 0 6 0
158052 65 Femal 7 1 114675
86 54 Ndukaku 2 Spain e 5 0 0 2 1 1 .8 0
157624 75 2 121681 128643
87 18 Gant 0 Spain Male 2 3 .8 1 1 0 .4 1
156257 72 3 151869
88 59 Rowley 9 France Male 0 9 0 2 1 0 .4 0
156228 64 Femal 4 93251.
89 97 Sharpe 6 France e 6 4 0 3 1 0 42 1

71
157679 63 Germa Femal 2 81623. 156791
90 54 Osborne 5 ny e 8 3 67 2 1 1 .4 0
157575 64 Femal 4 174205
91 35 Heap 7 Spain e 4 5 0 3 1 1 .2 1
157315 80 4 118626 147132
92 11 Ritchie 8 France Male 5 7 .6 2 1 0 .5 0
158092 52 Femal 3 1 109614
93 48 Cole 4 France e 6 0 0 2 1 0 .6 0
156406 76 2 172290
94 35 Capon 9 France Male 9 8 0 2 1 1 .6 0
156769 73 4 85982.
95 66 Capon 0 Spain Male 2 4 0 2 0 1 47 0
156994 51 3 1 121277
96 61 Fiorentini 5 Spain Male 5 0 176274 1 0 1 .8 0
157387 77 4 102827 64595.
97 21 Graham 3 Spain Male 1 9 .4 1 0 1 25 0
156936 81 Germa 2 97086. 197276
98 83 Yuille 4 ny Male 9 8 4 2 1 1 .1 0
156043 71 2 99645.
99 48 Allard 0 Spain Male 2 8 0 2 0 0 04 0
10 156330 41 3 6534.1
0 59 Fanucci 3 France Male 4 9 0 2 0 0 8 0
10 158085 66 Femal 4
1 82 Fu 5 France e 0 6 0 1 1 1 161848 0
10 157431 62 Femal 4 167162
2 92 Hung 3 France e 4 6 0 2 0 0 .4 0
10 155801 73 3 82674. 41970.
3 46 Hung 8 France Male 1 9 15 1 1 0 72 0
10 157766 52 3 60536.
4 05 Bradley 8 Spain Male 6 7 0 2 1 0 56 0
10 158049 67 Femal 6 177655
5 19 Dunbabin 0 Spain e 5 1 0 1 1 1 .7 1
10 156138 62 Femal 4 107073 30984.
6 54 Mauldon 2 Spain e 6 4 .3 2 1 1 59 1
10 155991 58 Germa 3 88938. 10054.
7 95 Stiger 2 ny Male 2 1 62 1 1 1 53 0
10 158128 78 Germa Femal 3 99806. 36976.
8 78 Parsons 5 ny e 6 2 85 1 0 1 52 0
10 156023 60 3 150092 71862.
9 12 Walkom 5 Spain Male 3 5 .8 1 0 0 79 0
11 157446 47 Germa 3 92833. 99449.
0 89 T'ang 9 ny Male 5 9 89 1 1 0 86 1
11 158035 68 Germa 3 90536. 63082.
1 26 Eremenko 5 ny Male 0 3 81 1 0 1 88 0
11 156657 53 Germa 3 108055 27231.
2 90 Rowntree 8 ny Male 9 7 .1 2 1 0 26 0
11 157159 Thorpe 56 France Male 4 2 100238 1 0 0 86797. 0

72
3 51 2 2 .4 41
11 155911 67 3 106190 22994.
4 00 Chiemela 5 Spain Male 6 9 .6 1 0 1 32 0
11 156096 72 Germa 2 154475 101300
5 18 Fanucci 1 ny Male 8 9 .5 2 0 1 .9 1
11 156755 62 Germa Femal 3 132351 74169.
6 22 Ko 8 ny e 0 9 .3 2 1 1 13 0
11 157055 66 Germa Femal 3 167864 115638
7 12 Welch 8 ny e 7 6 .4 1 1 0 .3 0
11 156980 50 Femal 4 31766.
8 28 Duncan 6 France e 1 1 0 2 1 0 3 0
11 156616 52 Germa Femal 3 107818 199725
9 70 Chidozie 4 ny e 1 8 .6 1 1 0 .4 1
12 156007 69 Germa 3 185173 120834
0 81 Wu 9 ny Male 4 4 .8 2 1 0 .5 0
12 156824 82 3 129433 38131.
1 72 Culbreth 8 France Male 4 8 .3 2 0 0 77 0
12 155802 67 3 120193
2 03 Kennedy 4 Spain Male 9 6 .4 1 0 0 100131 0
12 156906 65 Femal 3 141069
3 73 Cameron 6 France e 9 6 0 2 1 0 .9 0
12 157600 68 Germa Femal 4 1 126384 198129
4 85 Calabresi 4 ny e 8 0 .4 1 1 1 .4 0
12 157796 62 Femal 2 183646
5 59 Zetticci 5 France e 8 3 0 1 0 0 .4 0
12 156273 43 4 152603 110265
6 60 Fuller 2 France Male 2 9 .5 1 1 0 .2 1
12 156711 54 Femal 5 8636.0
7 37 MacDonald 9 France e 2 1 0 1 0 1 5 1
12 157826 62 Germa 5 148507 46824.
8 88 Piccio 5 ny Male 6 0 .2 1 1 0 08 1
12 155754 82 Femal 4 171378
9 92 Kennedy 8 France e 1 7 0 2 1 0 .8 0
13 155916 77 2 101827 167256
0 07 Fernie 0 France Male 4 9 .1 1 1 0 .4 0
13 157404 75 Femal 3 124226
1 04 He 8 France e 4 3 0 2 1 1 .2 0
13 157183 Kaodilinakachuk 79 Germa Femal 3 130862 114935
2 69 wu 5 ny e 3 9 .4 1 1 1 .2 0
13 156778 68 3 122570 35608.
3 71 Cocci 7 France Male 8 9 .9 1 1 1 88 0
13 156420 68 2 16459.
4 04 Alekseeva 6 France Male 5 1 0 2 0 1 37 0
13 157125 78 Germa 3 124828 124411
5 43 Chinweike 9 ny Male 9 7 .5 2 1 1 .1 0
13 155845 58 Germa Femal 5 144895 34941.
6 18 Arthur 9 ny e 0 5 .1 2 1 1 23 0

73
13 158023 46 Germa Femal 3 63663. 167784
7 81 Li 1 ny e 4 5 93 1 0 1 .3 0
13 156101 63 4 133463 93165.
8 56 Ma 7 France Male 0 2 .1 1 0 1 34 0
13 155944 58 Femal 4 213146 75161.
9 08 Chia 4 Spain e 8 2 .2 1 1 0 25 1

As the data is complex and big it has 10000 rows and 14 columns I have given
only 139 rows of data here.

BIBLIOGRAPHY

74
1. Multivariate data analysis (Fifth Edition) --- Joseph F.Hair, RolphE.Anderson,
Ronald l Tatham and William C.Black

2. Data Mining- Theories, Algorithms, and Examples – NoNG YE

3. A Practical Guide to Data Mining for Business and Industry -- AnProfea


AhlemeyerStubbe, Shirley Coleman

4. Data Mining and Predictive Analytics – Daniel T. Larose, Chantal D.Lorse

5. machine_learning_mastery_with_r. – Jason Brownlee 6.


master_machine_learning_algorithms -- Jason Brownlee

7. statistical_methods_for_machine_learning - Jason Brownlee

8. Machine Learning Using R -- KarthikRamasubramanian ,Abhishek Singh

9. Data Science for Business - Forster Provost & Tom Fawcett

75

You might also like