Analytics Compendium

`
A HANDBOOK
ON
Analytics
Compiled By:
THE CONSULTING CLUB
VINOD GUPTA SCHOOL OF MANAGEMENT

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
`
Table of Contents
1. Supervised learning............................................................................................ 4
2. Regression ........................................................................................................... 4
3. Simple Linear Regression.................................................................................. 5
4. Multiple Linear Regression ............................................................................... 7
5. Polynomial Regression....................................................................................... 8
6. Components of the Regression Equation......................................................... 9
7. KNN Regression ............................................................................................... 10
8. Standardization vs Normalization.................................................................. 13
9. Exploratory Data Analysis (EDA) .................................................................. 14
10. Classification..................................................................................................... 16
11. Classification Model Evaluation ..................................................................... 22
12. Feature Selection .............................................................................................. 25
13. Unsupervised Learning ................................................................................... 29
14. K MEANS Clustering ...................................................................................... 31
15. FAQs in Data Analytics Interviews ................................................................ 36
The Consulting Club, VGSoM, IIT Kharagpur 2

Kharagpur
`
Cheat Sheet
Source - Cheat Sheet

Kharagpur
`
Supervised learning
Supervised Learning is the process of making an algorithm to learn to map an input to a particular
output. This is achieved using the labelled datasets that you have collected. If the mapping is
correct, the algorithm has successfully learned. Else, you make the necessary changes to the
algorithm so that it can learn correctly. Supervised Learning algorithms can help make predictions
for new unseen data that we obtain later in the future.
Supervised Learning is important because:
 Learning gives the algorithm experience which can be used to output the predictions for
new unseen data
 Experience also helps in optimizing the performance of the algorithm
 Real-world computations can also be taken care of by the Supervised Learning algorithms
Classification and regression are two basic concepts in supervised learning. Classification and
regression follow the same basic concept of supervised learning i.e. to train the model on a known
dataset to make predict the outcome.
Regression
Unlike classification, here the regression model is trained in such a way that it predicts
continuous numerical value as an output based on input variables.
The algorithm maps the input data (x) to continuous or numerical data(y).
There are several kinds of regression algorithms like - linear regression, polynomial regression,
quantile regression, lasso regression, etc. Linear regression is the simplest method of regression.

Kharagpur
`
Simple Linear Regression

The most elementary type of regression model is the simple linear regression which explains
the relationship between a dependent variable
and one independent variable using a straight
line. The straight line is plotted on the scatter
plot of these two points.
Regression Line
The standard equation of the regression line is given
by the following expression: Y = β₀ + β₁.X
5 Major Assumptions for Linear Regression:

1) Linearity: The relationship between X and the mean of Y is linear.
2) Little or no Multicollinearity between the features
3) Homoscedasticity: The variance of residual is the same for any value of X.
4) Normal distribution of error terms
5) Little or No autocorrelation in the residuals:
Best fit Line

The best-fit line is found by minimising the expression of RSS (Residual Sum of Squares) which
is equal to the sum of squares of the residual for each data point in the plot. Residuals for any data
point is found by subtracting predicted value of dependent variable from actual value of dependent
variable:

Kharagpur
`
Strength of Linear Regression Model

1. R2
R2 is a number which explains what portion of the given data variation is explained by the
developed model. It always takes a value between 0 & 1. In general term, it provides a measure of
how well actual outcomes are replicated by the model, based on the proportion of total variation
of outcomes explained by the model, i.e. expected outcomes. Overall, the higher the R-squared,
the better the model fits your data.
Mathematically, it is represented as: R2 = 1 - (RSS / TSS)
RSS (Residual Sum of Squares): In statistics, it is defined as the total sum of error across the
whole sample. It is the measure of the difference between the expected and the actual output. A
small RSS indicates a tight fit of the model to the data. It is also defined as follows:
TSS (Total sum of squares): It is the sum of errors of the data points from mean of response
variable. Mathematically, TSS is:
Physical Significance of R2
In Graph 1: All the points lie on the line and the R2 value is a perfect 1
In Graph 2: Some points deviate from the line and the error is represented by the lower R2 value of
0.70
In Graph 3: The deviation further increases and the R2 value further goes down to 0.36
In Graph 4: The deviation is further higher with a very low R2 value of 0.05

Kharagpur
`
2. Adjusted R2
Adjusted R-Squared is used only when analysing multiple regression output and ignored when
analysing simple linear regression output. When we have more than one independent variable in
our analysis, the computation process inflates the R-squared. As the name indicates, the Adjusted
R-Squared is the R-Square adjusted for this inflation when performing multiple regression.
3. Significance F
The simplest way to understand the significance F is to think of it as the probability that our
regression model is wrong and needs to be discarded!! The significance F gives you the
probability that the model is wrong. We want the significance F or the probability of being wrong
to be as small as possible.
Significance F: Smaller is better
Multiple Linear Regression

Multiple linear regression is a statistical technique to understand the relationship between one
dependent variable and several independent variables (explanatory variables).
The objective of multiple regression is to find a linear equation that can best determine thevalue of
dependent variable Y for different values independent variables in X.
Consider an example of sales prediction using TV Marketing budget. In real life scenario, the
marketing head would want to look into the dependency of sales on the budget allocated to different
marketing sources. Here, we have considered three different marketing sources, i.e. TV marketing,
Radio marketing, and Newspaper marketing.

Kharagpur
`
Sales = β0 + β1.TV marketing + β2.Radio marketing + β3.Newspaper

marketing
The equation of multiple linear regression would be: Y = β0 + β1.X1 + β2.X2 + ... + βk.Xk
When is Multiple Linear Regression Used?

Running separate simple linear regressions will lead to different outcomes when we are
interested in just one. Besides that, there may be an input variable that is itself correlated with
or dependent on some other predictor. This can cause wrong predictions and unsatisfactory
results. In such cases, we use Multiple Linear Regression.
Polynomial Regression
One of the key assumptions of Linear Regression is that there has to be a linear relation between
dependent and independent variables. In most practical scenarios, this assumption might not be
valid. How do we then implement regression, when this condition is not valid?
We use Polynomial Regression in such cases. This kind of regression assumes that there exists a
non-linear relation between independent feature variable(s) X and the dependent target variable
Y. This model reduces the error in estimation, increases the accuracy by having a better fit with
the data; at the cost of making the regression equation non-linear in terms of X. Below is a figure
comparing the Polynomial Regression line fit against the Linear Regression line fit on a sample
non-linear data.
Because the regression assumes non-linearity, the regression equation varies from the linear
regression. We now consider terms of X raised to a specific power. The general polynomial
regression equation for a single independent term is as shown.
The degree of order which to use in the equation is a Hyperparameter, and we need to choose it
wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of
degree, the model tries to underfit so we need to find the optimum value of a degree. The model
equation can also be expressed in terms of a matrix are as shown.

Kharagpur
`
For calculating goodness of fit and accuracy we use the same metrics that we used in the linear
regression scenario.
Components of the Regression Equation

Standard error of the coefficients
The standard error of the coefficients reflects the variability of the coefficient. It reflects the average
error of the regression model. In other words, when we use the regression model to estimate the
coefficient of an independent variable, the standard error shows you how wrong the estimated
coefficient could be if you use it to make predictions. Again, because the standard error reflects
how wrong you could be, we want the standard error to be small in relation to its coefficient.
The standard error is used to help you get a confidence interval for your coefficient values.
p-values
The P value indicates the probability that the estimated coefficient is wrong or unreliable. The best
way to understand the P value is as the “probability of an error”. We want the P value to be as small
as possible.
How small the P value should be depending on a cut off level that we decide on separately (also
called the significance level). The cut-off selected depends on the nature of the data studied and the
different error types. The cut-off or significance level is usually 1%, 5% or 10%.
Generally, a cut-off point of 5% is used.
Statistically speaking, the P value is the probability of obtaining a result as or more extreme than
the one you got in a random distribution. In other words, the P value is the probability that the
coefficient of the independent variable in our regression model is not reliable or that
the coefficient in our regression output is actually zero.
Note that the P value is similar in interpretation to the significance F discussed earlier. The key
difference is that the P value applies to each corresponding coefficient and the significance F
applies to the entire model as a whole.

Kharagpur
`
KNN Regression
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems. The
KNN algorithm assumes that similar things exist in close proximity. In other words, similar
things are near to each other. Then KNN algorithm uses ‘feature similarity’ to predict the values
of any new data points.
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine

learning algorithm that can be used to solve both classification and regression problems. The
KNN algorithm assumes that similar things exist in close proximity. In other words, similar
things are near to each other. Then KNN algorithm uses ‘feature similarity’ to predict the values
of any new data points.
The steps in the algorithm are as follows:

1. Choose a value of K. This is done using goodness of fit tests like Elbow Score or Silhouette
Score.
2. For every point in the data find K nearest neighbors, using distance metrics. This serves as a
measure of similarity. Metrics like Euclidean distance/Manhattan distance etc. can be used to
find the distance. The lower the distance between two points, the more similar those are.
3. Save the mean value of each neighborhood.
When you get a new value for which you need the output, find the neighborhood to which it is
most similar, and use the saved value from previous step as the regressed output.
Below is an example of the KNN Regression.
Let’s assume that we have age, height and weight data is available for a set of people. We don’t
have the weight for person ID 11. We wish to find it out using KNN Regression. Let’s assume
that K is 3.

Kharagpur
`
The data has been plotted and for person ID1 to ID10. We see that person ID11 lies close to ID6,
ID5 and ID1. We will limit this only 3 points, as we chose K=3. Now that we have the
neighborhood, we want to see the average value of weight for the neighborhood. The average of
these points is (60+72+77)/3 Kg = 69.66 Kg. This will be the regressor ouput of this algorithm as
the weight of this new person ID11 whose weight we did not know.
Data Preprocessing
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.
Data preparation is often a lengthy undertaking for data professionals or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.
For example, the data preparation process usually includes standardizing data formats, enriching
source data, and/or removing outliers.
Steps in Data Preprocessing in Machine Learning

There are seven significant steps in data preprocessing in Machine Learning:
1. Acquire the dataset

Acquiring the dataset is the first step in data preprocessing in machine learning. To build and
develop Machine Learning models, you must first acquire the relevant dataset. This dataset will
be comprised of data gathered from multiple and disparate sources which are then combined in a
proper format to form a dataset. Dataset formats differ according to use cases.
2. Import all the crucial libraries

The predefined Python/R libraries can perform specific data preprocessing jobs. Importing all the
crucial libraries is the second step in data preprocessing in machine learning.
3. Import the dataset

In this step, you need to import the dataset/s that you have gathered for the ML project at hand.
Importing the dataset is one of the important steps in data preprocessing in machine learning.

Kharagpur
`
4. Identifying and handling the missing values

In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do
this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to
say, this will hamper your ML project.
Basically, there are two ways to handle missing data:

a. Deleting a particular row – In this method, you remove a specific row that has a null value for
a feature or a particular column where more than 75% of the values are missing. However, this
method is not 100% efficient, and it is recommended that you use it only when the dataset has
adequate samples. You must ensure that after deleting the data, there remains no addition of
bias.
b. Calculating the mean – This method is useful for features having numeric data like age, salary,
year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column
or row that contains a missing value and replace the result for the missing value. This method
can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields
better results compared to the first method (omission of rows/columns). Another way of
approximation is through the deviation of neighboring values. However, this works best for
linear data.
5. Encoding the categorical data

Categorical data refers to the information that has specific categories within the dataset. Machine
Learning models are primarily based on mathematical equations. Thus, you can intuitively
understand that keeping the categorical data in the equation will cause certain issues since you
would only need numbers in the equations.
6. Splitting the dataset

Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for
Machine Learning model must be split into two separate sets – training set and test set.
7. Feature scaling
Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to
standardize the independent variables of a dataset within a specific range. In other words, feature
scaling limits the range of variables so that you can compare them on common grounds.
We can perform feature scaling in Machine Learning in two ways:
1. Standardization
2. Normalization

Kharagpur
`
Standardization vs Normalization
Normalization
It refers to rescaling of data to make all the elements of variable to lie between 0 and 1 thus bringing
all the values of numeric columns in the dataset to a common scale.
The goal of normalization is to change the values of numeric columns in the dataset to a common
scale, without distorting differences in the ranges of values. For machine learning, every dataset
does not require normalization. It is required only when features have different ranges.
Normalization is a good technique to use when you do not know the distribution of your data or
when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your
data has varying scales and the algorithm you are using does not make assumptions about the
distribution of your data, such as k-nearest neighbours and artificial neural networks.
Here’s the formula for normalization:
Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.
Standardization
Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1
(unit variance).
Standardizing the features around the centre and 0 with a standard deviation of 1 is important when
we compare measurements that have different units. Variables that are measured at different scales
do not contribute equally to the analysis and might end up creating a bias.
Standardization is useful when your data has varying scales and the algorithm you are using does
make assumptions about your data having a Gaussian distribution, such as linear regression,
logistic regression, and linear discriminant analysis.
Here’s the formula for standardization:
is the mean of the feature values and is the standard deviation of the feature values. Note
that in this case, the values are not restricted to a particular range.

Kharagpur
`
Exploratory Data Analysis (EDA)

Correlation
Correlation, statistical technique which determines how one variables moves/changes in relation
with the other variable. It gives us the idea about the degree of the relationship of the two variables.
It’s a bi-variate analysis measure which describes the association between different variables. In
most of the business it’s useful to express one subject in terms of its relationship with others.
For example: Sales might increase if lot of money is spent on product marketing.
Two features (variables) can be positively correlated with each other. It means that when the value
of one variable increase then the value of the other variable(s) also increases.
Two features (variables) can be negatively correlated with each other. This occurs when thevalue
of one variable increase and the value of other variable(s) decreases.
Two features might not have any relationship with each other. This happens when the value of a
variable is changed then the value of the other variable is not impacted.

Kharagpur
`
Methods to calculate correlation

1. Pearson Correlation Coefficient
It captures the strength and direction of the linear association between two continuous variables.
It tries to draw the line of best fit through the data points of two variables. Pearson correlation
coefficient indicates how far these data points are away from the line of best fit. The relationship
is linear only when the change in one variable is proportional to the change in another variable.
Pearson Correlation Coefficient calculated as
r = Pearson Correlation Coefficient

n = number of observations
∑xy = sum of the products of x and y values
∑x = sum of x values
∑y = sum of y values
∑x2= sum of squared x values
∑y2= sum of squared y values
2. Spearman’s Correlation Coefficient
It tries to determine the strength and the direction of the monotonic relationship which exists
between two ordinal or continuous variables. In a monotonic relationship two variables tend to
change together but not with the constant rate. It’s calculated on the ranked values of the variables
rather than on the raw data.
Spearman rank correlation coefficient
ρ= Spearman rank correlation coefficient

di= the difference between the ranks of corresponding variables
n= number of observations

Kharagpur
`
Classification
Classification is a supervised machine learning technique which attempts to predict and categorize
output data into categories after observing values of independent variables. A classification
problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no
disease”. For example, when filtering emails “spam” or “not spam”, when looking at transaction
data, whether it is “fraudulent”, or “authorized”. In short Classification either predicts categorical
class labels or classifies data (construct a model) based on the training set and the values (class
labels) in classifying attributes and uses it in classifying new data. Classification models include
logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-
vs-rest, and Naive Bayes. Here, we will be looking at one of the simplest and widely used
classification models – Logistic Regression in detail.
Output data either could be binary if it has only two states or it can have multiple states. Logistic
regression is a binary classification model.
Logistic Regression
Logistic regression is a classification model which uses logistic function to make predictions for
labelled categorical outputs as compared to linear regression where output is a continuous variable.
It can be of 3 types 1) Binary 2) Multinomial 3) Ordinal. In binary classification, output variables
have only two values or classes. Some of the examples are:
1. A bank wishes to know whether a customer will default or not based on the model learnt
from data of previous customers
2. A online site would want to know whether a customer will churn or not based on the data of
other customers recorded in past
Uni-variate Analysis:
Consider a dependent variable Y having values only 0 Model A
and 1, depends only upon independent variable X.
Now, consider Model A as shown in figure tells that for
any value of X before that line will yield Model B
Y=0 and after that line yields Y=1. But in this case the
for values of X between dotted lines the values will be
different from the actual values causing model to fail.
Hence, a stringent cut-off model is not a good model.
What is the better model?

Kharagpur
`
Sigmoid Curve:
Better way to deal is using a model which yields probability as shown in model B using sigmoid
function given by:
This function will assign non zero probability for all values of x in the region between dotted lines.
Now, the question is how to decide that the values provided by the sigmoid function are optimized
for best possible modelling from the trained data?
Since, the function depends upon B0 and B1, task in hand is to optimize these two.
Likelihood Function:
Likelihood function is the function which will help in finding the values of B0 and B1 foroptimized
model.
Let’s they are 10 points in X axis and any random B0 and B1 values are chosen as shown in
curve below:

Kharagpur
`
Let us assume red points are actual values and the corresponding forecast points at the sigmoid curve
are at Pi vertical distance from X-Axis. Consider distance of 1-P i for points whole Y values actually
exist on X axis. In this case, likelihood function is:
The idea is to vary values of B0 and B1 such that Likelihood function is maximized. This process
is done by libraries using complex numerical algorithms.
Odds and Log Odds:

Now, consider P=Y in the equation below:
Assume, optimized B0 and B1 are obtained. Now this equation can be rewritten as:
Here, P/(1-P) is known as Odds and ln (Odds) is known as Log Odds. Comparing this equation
with a simple regression lines and assuming, it shows similarity with log odds in left hand side.
Hence, it is known as logistic regression.
Multi- Variate Analysis:

Multivariate logistic regression is simple extrapolation of univariate logistic regression. The new
equation will be like this:
Log Odds regression equation will be:

Kharagpur
`
Generating Output
Since the output is P which is probability and is a continuous variable with value between 0 and 1.
But output of a classification model is a binary variable, how is it related? The next step includes
selecting a cut-off probability below which output will be considered as 0 and above which it is 1.
Initially any random can be used as a cut-off and then model evaluation techniques are used to find
the optimum value.
Classification tree:
Classification tree methods (i.e., decision tree methods) are recommended when the data mining task
contains classifications or predictions of outcomes, and the goal is to generate rules that can be easily
explained and translated into SQL or a natural query language.
A Classification tree labels, records, and assigns variables to discrete classes. A Classification tree can also
provide a measure of confidence that the classification is correct. A Classification tree is built through a
process known as binary recursive partitioning. This is an iterative process of splitting the data into
partitions, and then splitting it up further on each of the branches.
The process starts with a Training Set consisting of pre-classified records (target field or dependent variable
with a known class or label such as purchaser or non-purchaser). The goal is to build a tree that distinguishes
among the classes.
Bagging:
Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset,
then combines the predictions from all models. It is highly effective for non-skewed class
distribution classification.
In 1996, Leo Bierman introduced the bagging algorithm, which has three basic steps:

Kharagpur
`
 Bootstrapping: Bagging leverages a bootstrapping sampling technique to create diverse

samples. This resampling method generates different subsets of the training dataset by selecting data
points at random and with replacement. This means that each time you select a data point from the
training dataset, you are able to select the same instance multiple times. As a result, a value/instance
repeated twice (or more) in a sample.
 Parallel training: These bootstrap samples are then trained independently and in parallel with
each other using weak or base learners.
 Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a
majority of the predictions are taken to compute a more accurate estimate. In the case of regression,
an average is taken of all the outputs predicted by the individual classifiers; this is known as soft
voting. For classification problems, the class with the highest majority of votes is accepted; this is
known as hard voting or majority voting.
There are a number of key advantages and challenges that the bagging method presents when used
for classification or regression problems.
Key Benefits Key Challenges
 Ease of implementation  Loss of interpretability
 Reduction of variance  Computationally expensive
 Less flexible
Random Forest Classification:

Random forest is an extension of bagging that also randomly selects subsets of features used in
each data sample. Both bagging and random forests have proven effective on a wide range of
different predictive modelling problems.
Random forest, like its name implies, consists of a large number of individual decision trees that
operate as an ensemble. Each individual tree in the random forest spits out a class prediction and
the class with the most votes becomes our model’s prediction.

Kharagpur
`
The fundamental concept behind random forest is a simple but powerful one — the wisdom of
crowds. In data science speak, the reason that the random forest model works so well is: A large
number of relatively uncorrelated models (trees) operating as a committee will outperform any of
the individual constituent models. While some trees may be wrong, many other trees will be right,
so as a group the trees are able to move in the correct direction. So, the prerequisites for random
forest to perform well are:
 There needs to be some actual signal in our features so that models built using those features
do better than random guessing.
 The predictions (and therefore the errors) made by the individual trees need to have low
correlations with each other.
Boosting:
Boosting is a general ensemble method that creates a strong classifier from a number of weak
classifiers. This is done by building a model from the training data, then creating a second model
that attempts to correct the errors from the first model. Models are added until the training set is
predicted perfectly or a maximum number of models are added.
A weak classifier (decision stump) is prepared on the training data using the weighted samples.
Only binary (two-class) classification problems are supported, so each decision stump makes one
decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:
error = (correct – N) / N
Where error is the misclassification rate, correct are the number of training instance predicted
correctly by the model and N is the total number of training instances. For example, if the model
predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-
100)/100 or 0.22.
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate, where w is the weight for training instance
i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly
classified.
A stage value is calculated for the trained model which provides a weighting for any predictions
that the model makes. The stage value for a trained model is calculated as follows:
stage = ln((1-error) / error)
Where stage is the stage value used to weight predictions from the model, ln() is the natural
logarithm and error is the misclassification error for the model. The effect of the stage weight is
that more accurate models have more weight or contribution to the final prediction.
The training weights are updated giving more weight to incorrectly predicted instances, and less
weight to correctly predicted instances. This has the effect of not changing the weight if the training
instance was classified correctly and making the weight slightly larger if the weak learner
misclassified the instance.
The most common algorithms used in boosting are AdaBoost and XGBoost.

Kharagpur
`
Support Vector Machine:

A support vector machine (SVM) is a supervised machine learning model that uses classification
algorithms for two-group classification problems. After giving an SVM model sets of labelled
training data for each category, they’re able to categorize new text.
Compared to newer algorithms like neural networks, they have two
main advantages: higher speed and better performance with a limited
number of samples (in the thousands). This makes the algorithm very
suitable for text classification problems, where it’s common to have
access to a dataset of at most a couple of thousands of tagged samples.
SVMs maximize the margin around the separating hyperplane.
• The decision function is fully specified by a (usually very small)
subset of training samples, the support vectors.
• This becomes a Quadratic programming problem that is easy to solve
by standard methods
Classification Model Evaluation

Confusion Matrix:
Actual/Predicted Predicted Positive Predicted Negative
Actual Positive A (True Positive) B (False Negative)
Actual Negative C (False Positive) D (True Negative)
Confusion matrix is a matrix as shown above which has counts of True Positives (Values which are
forecasted positive and are actually positive), False Negative (Values which are actually Positive
but are forecasted negative), False Positive (Values which are actually negative but are forecasted
positive) and True Negative (Values which are actually negative and are predicted as negative as
well).
Accuracy:
Accuracy is ratio of predictions which are predicted rightly to total no. of observations
Note: The issue with accuracy occurs when class imbalance exists in the output data where no.
of samples which are positives are very less in that case accuracy will be very high since model
will be predicting many negatives as negatives. So, negatives will contribute to accuracy but
positives won’t which will cause biased in the model towards negative outputs.
Sensitivity:
To resolve above issues another terms are introduced:

Kharagpur
`
Sensitivity is also known as True Positive Rate. One more evaluation parameter is important
False Positive Rate which is nothing but 1-Specificity.
ROC (Receiver Operating Characteristics):

To check the forecasting power of model ROC curve is used. ROC is a curve between sensitivity
and 1-specificity or in another words True positive rate vs false positive rate by selecting different
values of cut-off ranging from 0.0 to 1.0 at equal steps.
ROC shows the trade-off between TPR and FPR as values of cut-off are varied from 0.0 to
1.0 which will occur for all the models.
Significance:
1. Dotted line is obtained when specificity and sensitivity is equal for all values of cut-off
and it is treated as the worst model
2. More the curve is towards left top more the area between curve and dotted line and
better the model
3. Area under the curve is known as AUC

Kharagpur
`
Selection of Optimal Cut-off Value:

To select the most optimal cut-off value curve between specificity, sensitivity and accuracy are
plotted for which AUC is most.
Generally, the most optimum value is the one corresponding to the point where accuracy,
specificity and sensitivity curves interact which is 0.3 in the diagram above.
Other Model Evaluators:

Precision:
Precision is ratio of how many outputs in forecast are true which are also true in outputs to total
positives in the forecast. This shows how many of the positives in the forecasts are accurate.
Recall:
Recall is also known as Sensitivity or TPR (True positive rate). Its significance is that it
shows how many of positives outcomes in the training data output are forecasted accurately.
Precision-Recall Trade-Off Curve:

In every model there exists a trade-off between precision and recall which is shown in the
curve below:

Kharagpur
`
Here, Y and X axis are basically cut-off of probability selected to classify the forecasted outputs
in binary form. This curve also shows that the trade-off will be optimum when cut-off probability
is chosen as around 0.4
Why do we need Precision and Recall if we have accuracy, sensitivity and

specificity?
Answer: In some businesses it makes more sense to record precision instead of Specificity or 1-
Specificity. This occurs when businesses only care about cases where output state True is more
important from study perspective and it is easier to present as a business case.
Feature Selection
We have seen examples and equations for various models. It looks quite simple but in practice
we won’t have datasets very small and we won’t not have countable features for regression or
classification. Sometimes number of features can range in 100s or 1000s or millions. Following
are the reasons for selecting features:
1. Not all features will be affecting the output and hence not needed to be required in
modelling
2. Features may have relation with each other hence not all of them won’t be required. This
is known as multi-collinearity in features. Multi-collinearity may lead coefficients to swing
wildly. It means say dependent depends upon a predictor A positively and linearly but A
depends upon B then there may be a possibility that coefficient of A can be negative which
is compensated by coefficient of B. This is totally opposite to the case when output changes
only with A and may baffle business interpretation. Also, in this case the p-value of the
coefficients cannot be trusted.
3. More no. of features can lead to bias vs variance trade off to come into play and more
the features there Is a possibility of more variance and more over fitting hence model
may work better on a training data set but may generate high errors on test data set

Kharagpur
`
Solutions:
1. There are many ways to deal with issue 1.
- Using business knowledge to remove features which are not related
- Consider dependent Y depends on several independent variables (x1,x2,x3,…xn). Using
partial correlation coefficients which requires complex mathematical operations to find
partial correlation coefficients of dependent variable w.r.t individual independent
predictors say x1 while removing impact of other predictors (x2,x3..xn) in correlation
with dependent Y one by one.
- Using other methods like PCA, PLS etc.
2. Removing issues of Multi-collinearity requires understanding of multi-collinearity.
Principal Component Analysis (PCA) or Factor Analysis

Principal Component Analysis is a very useful method based on mathematics and statistics, which
makes dimensionality reduction by evaluating the dataset from different angles. Its task in machine
learning is to reduce the dimensionality of the inputs in the dataset and contribute to learning by the
algorithm or by grouping the dataset according to the features in the unsupervised approach. PCA
changes the orientation of the components to achieve maximum variance, in this way it aims to
reduce the dimensionality of the dataset.
Flow chart of dimensionality reduction
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies
dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset
while retaining as much information as possible
Extensions to LDA
Linear Discriminant Analysis is a simple and effective method for classification. Because it is
simple and so well understood, there are many extensions and variations to the method. Some
popular extensions include:

Kharagpur
`
 Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
 Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such
as splines.
 Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.
The original development was called the Linear Discriminant or Fisher’s Discriminant Analysis.
The multi-class version was referred to Multiple Discriminant Analysis. These are all simply
referred to as Linear Discriminant Analysis now.
Multicollinearity
Multicollinearity refers to the phenomenon of having related predictor
(independent) variables in the input data set. In simple terms, in a model that has been built using
several independent variables, some of these variables might be interrelated, due to which the
presence of that variable in the model is redundant. You drop some of these related independent
variables as a way of dealing with multicollinearity. Multicollinearity even causes even when
errors are not independent and hence violates the assumption of Linear regression. This can
be observed because of various patterns in the data itself, may multicollinearity in the data itself
and much more.
Detection of Multicollinearity:
1. One way to detect multicollinearity is to find correlation between each of the independent
predictors. For example, say there are n predictors X1, X2, X3…Xn. Find pairwise
correlation coefficients in nC2 ways. Draw a heatmap of the 2-D matrix which can easily help
to identify the pairs which are highly correlated and eliminate any of them using business
knowledge. But this technique is limited to identifying relation between only two predictors.
Refer to following link to understand multicollinearity better: Link

Kharagpur
`
2. Using VIF (Variance Inflation Factor) to identify predictors that are highly correlated. VIF
formula is given below:
VIFi is the VIF for Xi where i is the range of predictors say X1,X2…..Xn. VIF is
found by assuming X1 is a function of X2,X3….Xn and treating X1 as dependent
variable and fit a best fit line. Then R1 for this model is found and used to find VIF1.
Similarly, the process is repeated for all i.
Interpretation of VIF:
The common heuristic we follow for the VIF values is:
> 10: VIF value is definitely high, and the variable should be eliminated.
> 5: Can be okay, but it is worth inspecting.
< 5: Good VIF value. No need to eliminate this variable
Hence, all those predictors are removed for which VIF > 10.
3. Though increasing predictors improves R2 but it leaders to overfitting hence it is always

recommended to check the p values of the coefficients of predictors in the model and
remove any predictor where p-value is greater than 0.05.
Remedies:
1. Make sure you have not fallen into the dummy variable trap; including a dummy variable for
every category (e.g., summer, autumn, winter, and spring) and including a constant term in the
regression together guarantee perfect multicollinearity.
2. Drop one of the variables. An explanatory variable may be dropped to produce a model with
significant coefficients. However, you lose information (because you've dropped a variable).
Omission of a relevant variable results in biased coefficient estimates for the remaining
explanatory variables that are correlated with the dropped variable.
3. Try seeing what happens if you use independent subsets of your data for estimation and apply
those estimates to the whole data set. Theoretically you should obtain somewhat higher variance
from the smaller datasets used for estimation, but the expectation of the coefficient values should
be the same. Naturally, the observed coefficient values will vary, but look at how much they
vary.

Kharagpur
`
Unsupervised Learning
In some pattern recognition problems, the training data consists of a set of input vectors x without
any corresponding target values. The goal in such unsupervised learning problems may be to
discover groups of similar examples within the data, where it is called clustering, or to determine
how the data is distributed in the space, known as density estimation. To put forward in simpler
terms, for a n-sampled space x1 to xn, true class labels are not provided for each sample, hence
known as learning without teacher.
What is Clustering?
When you're trying to learn about something, say music, one approach might be to look for
meaningful groups or collections. You might organize music by genre, while your friend might
organize music by decade. How you choose to group items helps you to understand more about
them as individual pieces of music. You might find that you have a deep affinity for punk rock and
further break down the genre into different approaches or music from different locations. On the
other hand, your friend might look at music from the 1980's and be able to understand how the
music across genres at that time was influenced by the socio-political climate. In both cases, you
and your friend have learned something interesting about music, even though you took different
approaches.
In machine learning too, we often group examples as a first step to understand a subject (data
set) in a machine learning system. Grouping unlabelled examples is called clustering.
As the examples are unlabelled, clustering relies on unsupervised machine learning. If the
examples are labelled, then clustering becomes classification.
A graph displaying three clusters

Figure 1: Unlabelled examples grouped into three clusters.

Kharagpur
`
Before you can group similar examples, you first need to find similar examples. You can
measure similarity between examples by combining the examples' feature data into a metric,
called a similarity measure. When each example is defined by one or two features, it's easy to
measure similarity. For example, you can find similar books by their authors. As the number of
features increases, creating a similarity measure becomes more complex. We'll later see how to
create a similarity measure in different scenarios.
What are the Uses of Clustering?

Clustering has a myriad of uses in a variety of industries. Some common applications for
clustering include the following:
 market segmentation
 social network analysis
 search result grouping
 medical imaging
 image segmentation
 anomaly detection
After clustering, each cluster is assigned a number called a cluster ID. Now, you can condense
the entire feature set for an example into its cluster ID. Representing a complex example by a
simple cluster ID makes clustering powerful. Extending the idea, clustering data can simplify
large datasets.
For example, you can group items by different features as demonstrated in the following
examples: Examples:
 Group stars by brightness.
 Group organisms by genetic information into a taxonomy.
 Group documents by topic.
Machine learning systems can then use cluster IDs to simplify the processing of large datasets.
Thus,
clustering’s output serves as feature data for downstream ML systems.

Kharagpur
`
K MEANS Clustering
k-means clustering is a method of vector quantization, that aims to partition n observations into
k clusters in which each observation belongs to the cluster with the nearest mean (cluster centres
or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data
space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering
minimizes within- cluster variances (squared Euclidean distances), but not regular Euclidean
distances.
Recall the first property of clusters – it states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a cluster. K-
means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the
distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the
points and their respective cluster centroid.
Challenges with the K-Means Clustering Algorithm

One of the common challenges we face while working with K-Means is that the size of
clusters is different.
The left and the rightmost clusters are of smaller size compared to the central cluster. Now, if
we apply k-means clustering on these points, the results will be something like this:

Kharagpur
`
Another challenge with k-means is when the densities of the original points are different.
Here, the points in the red cluster are spread out whereas the points in the remaining clusters are
closely packed together. Now, if we apply k-means on these points, we will get clusters like this:
We can see that the compact points have been assigned to a single cluster. Whereas the points
that arespread loosely but were in the same cluster, have been assigned to different clusters. One
of the solutions is to use a higher number of clusters. So, in all the above scenarios, instead of
using 3 clusters, we can have a bigger number. Perhaps setting k=10 might lead to more
meaningful clusters.
How to Choose the Right Number of Clusters in K-Means Clustering?

The maximum possible number of clusters will be equal to the number of observations in the
dataset. But then how can we decide the optimum number of clusters? One thing we can do is
plot a graph, also known as an elbow curve, where the x-axis will represent the number of clusters
and the y-axis will be an evaluation metric. Let’s say inertia for now.

Kharagpur
`
Next, we will start with a small cluster value, let’s say 2. Train the model using 2 clusters,
calculate the inertia for that model, and finally plot it in the above graph. Let’s say we got an
inertia value of around 1000:
Now, we will increase the number of clusters, train the model again, and plot the inertia value.
This is the plot we get:

Kharagpur
`
When we changed the cluster value from 2 to 4, the inertia value reduced very sharply. This
decrease in the inertia value reduces and eventually becomes constant as we increase the number
of clusters further.
The cluster value where this decrease in inertia value becomes constant can be chosen as the
right cluster value for our data.
Here, we can choose any number of clusters between 6 and 10. We can have 7, 8, or even 9
clusters. You must also look at the computation cost while deciding the number of clusters. If
we increase the number of clusters, the computation cost will also increase. So, if you do not
have high computational resources, the advice is to choose a lesser number of clusters.

Kharagpur
`
When to use K Medoids Clustering:
Whenever the Euclidean distance does not make sense in the data, we shift to K Medoids from
K Means. K-means attempts to minimize the total squared error, while k-medoids minimizes
the sum of dissimilarities between points labelled to be in a cluster and a point designated as
the centre of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as
centres ( medoids or exemplars).

Kharagpur
`
FAQs in Data Analytics Interviews

1. What is Data Science? How do Supervised and Unsupervised Machine Learning
differ?
In plain words, Data Science is the study of data. It involves the collection of data from disparate
sources, storing it, cleaning and organizing it, and analysing it to uncover meaningful
information from it. Data Science uses a combination of Mathematics, Statistics Computer
Science, Machine Learning, Data Visualization, Cluster Analysis, and Data Modelling. It aims
to gain valuable insights from raw data (both structured and unstructured) and use those insights
to influence business and IT strategies positively. Such ideas can help businesses optimize
processes, boost productivity and revenue, streamline marketing strategies, enhance customer
satisfaction, and much more.
Supervised and Unsupervised ML differ from each other in the following respects:
 In supervised ML, the input data is labelled. In unsupervised ML, the input data remains
unlabeled.
 While supervised ML uses training dataset, unsupervised ML uses the input data set.
 Supervised ML is used for prediction purposes, whereas unsupervised ML is used for
analysis purposes.
 Supervised ML enables classification and regression. However, unsupervised ML enables
classification, density estimation, and dimension reduction.
2. Python or R, which is better for text analytics?

When it comes to text analytics, Python seems like the most suitable option. This is because it
comes with the Pandas library that includes user-friendly data structures and high-performance
data analysis tools. Also, Python is highly efficient and fast for all kinds of text analytics tasks.
As for R, it is best suited for Machine Learning applications.
3. What are the different classification algorithms?

The pivotal classification algorithms are linear classifiers (logistic regression, Naive Bayes
classifier), decision trees, boosted trees, random forest, SVM, kernel estimation, neural
networks, and nearest neighbour.
4. What is KNN imputation method?

KNN imputation method seeks to impute the values of the missing attributes using those
attribute values that are nearest to the missing attribute values. The similarity between two
attribute values is determined using the distance function.
5. What does data cleansing mean? What are the best ways to practice this?
Data cleansing primarily refers to the process of detecting and removing errors and
inconsistencies from the data to improve data quality.

Kharagpur
`
The best ways to clean data are:

 Segregating data, according to their respective attributes.
 Breaking large chunks of data into small datasets and then cleaning them.
 Analysing the statistics of each data column.
 Creating a set of utility functions or scripts for dealing with common cleaning tasks.
 Keeping track of all the data cleansing operations to facilitate easy addition or removal
from the datasets, if required.
6. Name the different data validation methods used by data analysts.

There are many ways to validate datasets. Some of the most commonly used data validation
methods by Data Analysts include:
 Field Level Validation – In this method, data validation is done in each field as and
when a user enters the data. It helps to correct the errors as you go.
 Form Level Validation – In this method, the data is validated after the user completes
the form and submits it. It checks the entire data entry form at once, validates all the
fields in it, and highlights the errors (if any) so that the user can correct it.
 Data Saving Validation – This data validation technique is used during the process of
saving an actual file or database record. Usually, it is done when multiple data entry
forms must be validated.
 Search Criteria Validation – This validation technique is used to offer the user accurate
and related matches for their searched keywords or phrases. The main purpose of this
validation method is to ensure that the user’s search queries can return the most relevant
results.
7. What should a data analyst do with missing or suspected data?

In such a case, a data analyst needs to:
 Use data analysis strategies like deletion method, single imputation methods, and
model-based methods to detect missing data.
 Prepare a validation report containing all information about the suspected or missing
data.
 Scrutinize the suspicious data to assess their validity.
 Replace all the invalid data (if any) with a proper validation code.
8. What is an N-gram?
An n-gram is a connected sequence of n items in a given text or speech. Precisely, an N-gram
is a probabilistic language model used to predict the next item in a particular sequence, as in
(n-1).
9. Define Outlier
A data analyst interview question and answers guide will not complete without this question.
An outlier is a term commonly used by data analysts when referring to a value that appears to
be far removed and divergent from a set pattern in a sample. There are two kinds of outliers –
Univariate and Multivariate.

Kharagpur
`
The two methods used for detecting outliers are:

 Box plot method – According to this method, if the value is higher or lesser than
1.5*IQR (interquartile range), such that it lies above the upper quartile (Q3) or below
the lower quartile (Q1), the value is an outlier.
 Standard deviation method – This method states that if a value is higher or lower than
mean ± (3*standard deviation), it is an outlier.
10. What is “Clustering?” Name the properties of clustering algorithms.

Clustering is a method in which data is classified into clusters and groups. A clustering
algorithm has the following properties:
 Hierarchical or flat
 Hard and soft
 Iterative
 Disjunctive
11. Define “Collaborative Filtering”.

Collaborative filtering is an algorithm that creates a recommendation system based on the
behavioural data of a user. For instance, online shopping sites usually compile a list of items
under “recommended for you” based on your browsing history and previous purchases. The
crucial components of this algorithm include users, objects, and their interest.
12. Name the statistical methods that are highly beneficial for data analysts?
The statistical methods that are mostly used by data analysts are:
 Bayesian method
 Markov process
 Simplex algorithm
 Imputation
 Spatial and cluster processes
 Rank statistics, percentile, outliers detection
 Mathematical optimization
13. What is a hash table collision? How can it be prevented?

When two separate keys hash to a common value, a hash table collision occurs. This means that
two different data cannot be stored in the same slot.
Hash collisions can be avoided by:
 Separate chaining – In this method, a data structure is used to store multiple items
hashing to a common slot.
 Open addressing – This method seeks out empty slots and stores the item in the first
empty slot available.

Kharagpur
`
14. How should you tackle multi-source problems?

To tackle multi-source problems, you need to:
 Identify similar data records and combine them into one record that will contain all the
useful attributes, minus the redundancy.
 Facilitate schema integration through schema restructuring.
15. What are the characteristics of a good data model?

For a data model to be considered as good and developed, it must depict the following
characteristics:
 It should have predictable performance so that the outcomes can be estimated
accurately, or at least, with near accuracy.
 It should be adaptive and responsive to changes so that it can accommodate the growing
business needs from time to time.
 It should be capable of scaling in proportion to the changes in data.
 It should be consumable to allow clients/customers to reap tangible and profitable
results.
16. Differentiate between variance and covariance.

Variance and covariance are both statistical terms. Variance depicts how distant two numbers
(quantities) are in relation to the mean value. So, you will only know the magnitude of the
relationship between the two quantities (how much the data is spread around the mean). On the
contrary, covariance depicts how two random variables will change together. Thus, covariance
gives both the direction and magnitude of how two quantities vary with respect to each other.
17. Explain univariate, bivariate, and multivariate analysis.

Univariate analysis refers to a descriptive statistical technique that is applied to datasets
containing a single variable. The univariate analysis considers the range of values and also the
central tendency of the values.
Bivariate analysis simultaneously analyzes two variables to explore the possibilities of an
empirical relationship between them. It tries to determine if there is an association between the
two variables and the strength of the association, or if there are any differences between the
variables and what is the importance of these differences.
Multivariate analysis is an extension of bivariate analysis. Based on the principles of
multivariate statistics, the multivariate analysis observes and analyzes multiple variables (two
or more independent variables) simultaneously to predict the value of a dependent variable for
the individual subjects.
18. Explain the difference between R-Squared and Adjusted R-Squared.

The R-Squared technique is a statistical measure of the proportion of variation in the dependent
variables, as explained by the independent variables. The Adjusted R-Squared is essentially a
modified version of R-squared, adjusted for the number of predictors in a model. It provides
the percentage of variation explained by the specific independent variables that have a direct
impact on the dependent variables.

Kharagpur
`
19. What are the advantages of version control?

The main advantages of version control are:
 It allows you to compare files, identify differences, and consolidate the changes
seamlessly.
 It helps to keep track of application builds by identifying which version is under which
category – development, testing, QA, and production.
 It maintains a complete history of project files that comes in handy if ever there’s a
central server breakdown.
 It is excellent for storing and maintaining multiple versions and variants of code files
securely.
 It allows you to see the changes made in the content of different files.
20. What are the problems that a Data Analyst can encounter while performing data
analysis?
A critical data analyst interview question you need to be aware of. A Data Analyst can confront
the following issues while performing data analysis:
 Presence of duplicate entries and spelling mistakes. These errors can hamper data
quality.
 Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will
have to spend a significant amount of time in cleansing the data.
 Data extracted from multiple sources may vary in representation. Once the collected
data is combined after being cleansed and organized, the variations in data
representation may cause a delay in the analysis process.
 Incomplete data is another major challenge in the data analysis process. It would
inevitably lead to erroneous or faulty results.
For further study of similar question, you can follow these links:
 Careers360
 Naukri.com
 SimpleLearn
 Edureka

Kharagpur
`
Source:
https://www.graduatetutor.com/statistics-tutor/interpreting-regression-output/
https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-
plots-solutions/
https://searchenterpriseai.techtarget.com/definition/supervised-learning
https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-
normalization-standardization/
https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-
rescale-your-data-3f083def38ff
https://medium.com/fintechexplained/did-you-know-the-importance-of-finding-correlations- in-
data-science-1fa3943debc2
https://towardsdatascience.com/getting-the-basics-of-correlation-covariance-c8fc110b90b4
https://towardsdatascience.com/latent-dirichlet-allocation-15800c852699

Kharagpur

Analytics Compendium

Uploaded by

Copyright:

Available Formats

Analytics Compendium

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analytics Compendium

Uploaded by

Copyright:

Available Formats

`

THE CONSULTING CLUB

VINOD GUPTA SCHOOL OF MANAGEMENT

The Consulting Club, VGSoM, IIT Kharagpur 2

Source - Cheat Sheet

The Consulting Club, VGSoM, IIT Kharagpur 3

The Consulting Club, VGSoM, IIT Kharagpur 4

Simple Linear Regression

5 Major Assumptions for Linear Regression:

Best fit Line

The Consulting Club, VGSoM, IIT Kharagpur 5

Strength of Linear Regression Model

Mathematically, it is represented as: R2 = 1 - (RSS / TSS)

The Consulting Club, VGSoM, IIT Kharagpur 6

Significance F: Smaller is better

Multiple Linear Regression

The Consulting Club, VGSoM, IIT Kharagpur 7

Sales = β0 + β1.TV marketing + β2.Radio marketing + β3.Newspaper

When is Multiple Linear Regression Used?

The Consulting Club, VGSoM, IIT Kharagpur 8

Components of the Regression Equation

The Consulting Club, VGSoM, IIT Kharagpur 9

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine

The steps in the algorithm are as follows:

Below is an example of the KNN Regression.

The Consulting Club, VGSoM, IIT Kharagpur 10

Steps in Data Preprocessing in Machine Learning

1. Acquire the dataset

2. Import all the crucial libraries

3. Import the dataset

The Consulting Club, VGSoM, IIT Kharagpur 11

4. Identifying and handling the missing values

Basically, there are two ways to handle missing data:

5. Encoding the categorical data

6. Splitting the dataset

We can perform feature scaling in Machine Learning in two ways:

The Consulting Club, VGSoM, IIT Kharagpur 12

Here’s the formula for normalization:

Here’s the formula for standardization:

The Consulting Club, VGSoM, IIT Kharagpur 13

Exploratory Data Analysis (EDA)

The Consulting Club, VGSoM, IIT Kharagpur 14

Methods to calculate correlation

Pearson Correlation Coefficient calculated as

r = Pearson Correlation Coefficient

2. Spearman’s Correlation Coefficient

Spearman rank correlation coefficient

ρ= Spearman rank correlation coefficient

The Consulting Club, VGSoM, IIT Kharagpur 15

The Consulting Club, VGSoM, IIT Kharagpur 16

The Consulting Club, VGSoM, IIT Kharagpur 17

Odds and Log Odds:

Multi- Variate Analysis:

Log Odds regression equation will be:

The Consulting Club, VGSoM, IIT Kharagpur 18

The Consulting Club, VGSoM, IIT Kharagpur 19

 Bootstrapping: Bagging leverages a bootstrapping sampling technique to create diverse

Random Forest Classification:

The Consulting Club, VGSoM, IIT Kharagpur 20