Analytics Compendium
Analytics Compendium
Analytics Compendium
A HANDBOOK
ON
Analytics
Compiled By:
Table of Contents
1. Supervised learning............................................................................................ 4
2. Regression ........................................................................................................... 4
3. Simple Linear Regression.................................................................................. 5
4. Multiple Linear Regression ............................................................................... 7
5. Polynomial Regression....................................................................................... 8
6. Components of the Regression Equation......................................................... 9
7. KNN Regression ............................................................................................... 10
8. Standardization vs Normalization.................................................................. 13
9. Exploratory Data Analysis (EDA) .................................................................. 14
10. Classification..................................................................................................... 16
11. Classification Model Evaluation ..................................................................... 22
12. Feature Selection .............................................................................................. 25
13. Unsupervised Learning ................................................................................... 29
14. K MEANS Clustering ...................................................................................... 31
15. FAQs in Data Analytics Interviews ................................................................ 36
Cheat Sheet
Supervised learning
Supervised Learning is the process of making an algorithm to learn to map an input to a particular
output. This is achieved using the labelled datasets that you have collected. If the mapping is
correct, the algorithm has successfully learned. Else, you make the necessary changes to the
algorithm so that it can learn correctly. Supervised Learning algorithms can help make predictions
for new unseen data that we obtain later in the future.
Supervised Learning is important because:
Learning gives the algorithm experience which can be used to output the predictions for
new unseen data
Experience also helps in optimizing the performance of the algorithm
Real-world computations can also be taken care of by the Supervised Learning algorithms
Classification and regression are two basic concepts in supervised learning. Classification and
regression follow the same basic concept of supervised learning i.e. to train the model on a known
dataset to make predict the outcome.
Regression
Unlike classification, here the regression model is trained in such a way that it predicts
continuous numerical value as an output based on input variables.
The algorithm maps the input data (x) to continuous or numerical data(y).
There are several kinds of regression algorithms like - linear regression, polynomial regression,
quantile regression, lasso regression, etc. Linear regression is the simplest method of regression.
Regression Line
The standard equation of the regression line is given
by the following expression: Y = β₀ + β₁.X
R2 is a number which explains what portion of the given data variation is explained by the
developed model. It always takes a value between 0 & 1. In general term, it provides a measure of
how well actual outcomes are replicated by the model, based on the proportion of total variation
of outcomes explained by the model, i.e. expected outcomes. Overall, the higher the R-squared,
the better the model fits your data.
RSS (Residual Sum of Squares): In statistics, it is defined as the total sum of error across the
whole sample. It is the measure of the difference between the expected and the actual output. A
small RSS indicates a tight fit of the model to the data. It is also defined as follows:
TSS (Total sum of squares): It is the sum of errors of the data points from mean of response
variable. Mathematically, TSS is:
Physical Significance of R2
In Graph 1: All the points lie on the line and the R2 value is a perfect 1
In Graph 2: Some points deviate from the line and the error is represented by the lower R2 value of
0.70
In Graph 3: The deviation further increases and the R2 value further goes down to 0.36
In Graph 4: The deviation is further higher with a very low R2 value of 0.05
2. Adjusted R2
Adjusted R-Squared is used only when analysing multiple regression output and ignored when
analysing simple linear regression output. When we have more than one independent variable in
our analysis, the computation process inflates the R-squared. As the name indicates, the Adjusted
R-Squared is the R-Square adjusted for this inflation when performing multiple regression.
3. Significance F
The simplest way to understand the significance F is to think of it as the probability that our
regression model is wrong and needs to be discarded!! The significance F gives you the
probability that the model is wrong. We want the significance F or the probability of being wrong
to be as small as possible.
The objective of multiple regression is to find a linear equation that can best determine thevalue of
dependent variable Y for different values independent variables in X.
Consider an example of sales prediction using TV Marketing budget. In real life scenario, the
marketing head would want to look into the dependency of sales on the budget allocated to different
marketing sources. Here, we have considered three different marketing sources, i.e. TV marketing,
Radio marketing, and Newspaper marketing.
The equation of multiple linear regression would be: Y = β0 + β1.X1 + β2.X2 + ... + βk.Xk
Polynomial Regression
One of the key assumptions of Linear Regression is that there has to be a linear relation between
dependent and independent variables. In most practical scenarios, this assumption might not be
valid. How do we then implement regression, when this condition is not valid?
We use Polynomial Regression in such cases. This kind of regression assumes that there exists a
non-linear relation between independent feature variable(s) X and the dependent target variable
Y. This model reduces the error in estimation, increases the accuracy by having a better fit with
the data; at the cost of making the regression equation non-linear in terms of X. Below is a figure
comparing the Polynomial Regression line fit against the Linear Regression line fit on a sample
non-linear data.
Because the regression assumes non-linearity, the regression equation varies from the linear
regression. We now consider terms of X raised to a specific power. The general polynomial
regression equation for a single independent term is as shown.
The degree of order which to use in the equation is a Hyperparameter, and we need to choose it
wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of
degree, the model tries to underfit so we need to find the optimum value of a degree. The model
equation can also be expressed in terms of a matrix are as shown.
For calculating goodness of fit and accuracy we use the same metrics that we used in the linear
regression scenario.
p-values
The P value indicates the probability that the estimated coefficient is wrong or unreliable. The best
way to understand the P value is as the “probability of an error”. We want the P value to be as small
as possible.
How small the P value should be depending on a cut off level that we decide on separately (also
called the significance level). The cut-off selected depends on the nature of the data studied and the
different error types. The cut-off or significance level is usually 1%, 5% or 10%.
Generally, a cut-off point of 5% is used.
Statistically speaking, the P value is the probability of obtaining a result as or more extreme than
the one you got in a random distribution. In other words, the P value is the probability that the
coefficient of the independent variable in our regression model is not reliable or that
the coefficient in our regression output is actually zero.
Note that the P value is similar in interpretation to the significance F discussed earlier. The key
difference is that the P value applies to each corresponding coefficient and the significance F
applies to the entire model as a whole.
KNN Regression
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems. The
KNN algorithm assumes that similar things exist in close proximity. In other words, similar
things are near to each other. Then KNN algorithm uses ‘feature similarity’ to predict the values
of any new data points.
Let’s assume that we have age, height and weight data is available for a set of people. We don’t
have the weight for person ID 11. We wish to find it out using KNN Regression. Let’s assume
that K is 3.
The data has been plotted and for person ID1 to ID10. We see that person ID11 lies close to ID6,
ID5 and ID1. We will limit this only 3 points, as we chose K=3. Now that we have the
neighborhood, we want to see the average value of weight for the neighborhood. The average of
these points is (60+72+77)/3 Kg = 69.66 Kg. This will be the regressor ouput of this algorithm as
the weight of this new person ID11 whose weight we did not know.
Data Preprocessing
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.
Data preparation is often a lengthy undertaking for data professionals or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.
For example, the data preparation process usually includes standardizing data formats, enriching
source data, and/or removing outliers.
b. Calculating the mean – This method is useful for features having numeric data like age, salary,
year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column
or row that contains a missing value and replace the result for the missing value. This method
can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields
better results compared to the first method (omission of rows/columns). Another way of
approximation is through the deviation of neighboring values. However, this works best for
linear data.
7. Feature scaling
Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to
standardize the independent variables of a dataset within a specific range. In other words, feature
scaling limits the range of variables so that you can compare them on common grounds.
1. Standardization
2. Normalization
Standardization vs Normalization
Normalization
It refers to rescaling of data to make all the elements of variable to lie between 0 and 1 thus bringing
all the values of numeric columns in the dataset to a common scale.
The goal of normalization is to change the values of numeric columns in the dataset to a common
scale, without distorting differences in the ranges of values. For machine learning, every dataset
does not require normalization. It is required only when features have different ranges.
Normalization is a good technique to use when you do not know the distribution of your data or
when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your
data has varying scales and the algorithm you are using does not make assumptions about the
distribution of your data, such as k-nearest neighbours and artificial neural networks.
Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.
Standardization
Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1
(unit variance).
Standardizing the features around the centre and 0 with a standard deviation of 1 is important when
we compare measurements that have different units. Variables that are measured at different scales
do not contribute equally to the analysis and might end up creating a bias.
Standardization is useful when your data has varying scales and the algorithm you are using does
make assumptions about your data having a Gaussian distribution, such as linear regression,
logistic regression, and linear discriminant analysis.
is the mean of the feature values and is the standard deviation of the feature values. Note
that in this case, the values are not restricted to a particular range.
Correlation, statistical technique which determines how one variables moves/changes in relation
with the other variable. It gives us the idea about the degree of the relationship of the two variables.
It’s a bi-variate analysis measure which describes the association between different variables. In
most of the business it’s useful to express one subject in terms of its relationship with others.
For example: Sales might increase if lot of money is spent on product marketing.
Two features (variables) can be positively correlated with each other. It means that when the value
of one variable increase then the value of the other variable(s) also increases.
Two features (variables) can be negatively correlated with each other. This occurs when thevalue
of one variable increase and the value of other variable(s) decreases.
Two features might not have any relationship with each other. This happens when the value of a
variable is changed then the value of the other variable is not impacted.
It captures the strength and direction of the linear association between two continuous variables.
It tries to draw the line of best fit through the data points of two variables. Pearson correlation
coefficient indicates how far these data points are away from the line of best fit. The relationship
is linear only when the change in one variable is proportional to the change in another variable.
It tries to determine the strength and the direction of the monotonic relationship which exists
between two ordinal or continuous variables. In a monotonic relationship two variables tend to
change together but not with the constant rate. It’s calculated on the ranked values of the variables
rather than on the raw data.
Classification
Classification is a supervised machine learning technique which attempts to predict and categorize
output data into categories after observing values of independent variables. A classification
problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no
disease”. For example, when filtering emails “spam” or “not spam”, when looking at transaction
data, whether it is “fraudulent”, or “authorized”. In short Classification either predicts categorical
class labels or classifies data (construct a model) based on the training set and the values (class
labels) in classifying attributes and uses it in classifying new data. Classification models include
logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-
vs-rest, and Naive Bayes. Here, we will be looking at one of the simplest and widely used
classification models – Logistic Regression in detail.
Output data either could be binary if it has only two states or it can have multiple states. Logistic
regression is a binary classification model.
Logistic Regression
Logistic regression is a classification model which uses logistic function to make predictions for
labelled categorical outputs as compared to linear regression where output is a continuous variable.
It can be of 3 types 1) Binary 2) Multinomial 3) Ordinal. In binary classification, output variables
have only two values or classes. Some of the examples are:
1. A bank wishes to know whether a customer will default or not based on the model learnt
from data of previous customers
2. A online site would want to know whether a customer will churn or not based on the data of
other customers recorded in past
Uni-variate Analysis:
Consider a dependent variable Y having values only 0 Model A
and 1, depends only upon independent variable X.
Now, consider Model A as shown in figure tells that for
any value of X before that line will yield Model B
Y=0 and after that line yields Y=1. But in this case the
for values of X between dotted lines the values will be
different from the actual values causing model to fail.
Hence, a stringent cut-off model is not a good model.
What is the better model?
Sigmoid Curve:
Better way to deal is using a model which yields probability as shown in model B using sigmoid
function given by:
This function will assign non zero probability for all values of x in the region between dotted lines.
Now, the question is how to decide that the values provided by the sigmoid function are optimized
for best possible modelling from the trained data?
Since, the function depends upon B0 and B1, task in hand is to optimize these two.
Likelihood Function:
Likelihood function is the function which will help in finding the values of B0 and B1 foroptimized
model.
Let’s they are 10 points in X axis and any random B0 and B1 values are chosen as shown in
curve below:
Let us assume red points are actual values and the corresponding forecast points at the sigmoid curve
are at Pi vertical distance from X-Axis. Consider distance of 1-P i for points whole Y values actually
exist on X axis. In this case, likelihood function is:
The idea is to vary values of B0 and B1 such that Likelihood function is maximized. This process
is done by libraries using complex numerical algorithms.
Assume, optimized B0 and B1 are obtained. Now this equation can be rewritten as:
Here, P/(1-P) is known as Odds and ln (Odds) is known as Log Odds. Comparing this equation
with a simple regression lines and assuming, it shows similarity with log odds in left hand side.
Hence, it is known as logistic regression.
Generating Output
Since the output is P which is probability and is a continuous variable with value between 0 and 1.
But output of a classification model is a binary variable, how is it related? The next step includes
selecting a cut-off probability below which output will be considered as 0 and above which it is 1.
Initially any random can be used as a cut-off and then model evaluation techniques are used to find
the optimum value.
Classification tree:
Classification tree methods (i.e., decision tree methods) are recommended when the data mining task
contains classifications or predictions of outcomes, and the goal is to generate rules that can be easily
explained and translated into SQL or a natural query language.
A Classification tree labels, records, and assigns variables to discrete classes. A Classification tree can also
provide a measure of confidence that the classification is correct. A Classification tree is built through a
process known as binary recursive partitioning. This is an iterative process of splitting the data into
partitions, and then splitting it up further on each of the branches.
The process starts with a Training Set consisting of pre-classified records (target field or dependent variable
with a known class or label such as purchaser or non-purchaser). The goal is to build a tree that distinguishes
among the classes.
Bagging:
Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset,
then combines the predictions from all models. It is highly effective for non-skewed class
distribution classification.
In 1996, Leo Bierman introduced the bagging algorithm, which has three basic steps:
Random forest, like its name implies, consists of a large number of individual decision trees that
operate as an ensemble. Each individual tree in the random forest spits out a class prediction and
the class with the most votes becomes our model’s prediction.
The fundamental concept behind random forest is a simple but powerful one — the wisdom of
crowds. In data science speak, the reason that the random forest model works so well is: A large
number of relatively uncorrelated models (trees) operating as a committee will outperform any of
the individual constituent models. While some trees may be wrong, many other trees will be right,
so as a group the trees are able to move in the correct direction. So, the prerequisites for random
forest to perform well are:
There needs to be some actual signal in our features so that models built using those features
do better than random guessing.
The predictions (and therefore the errors) made by the individual trees need to have low
correlations with each other.
Boosting:
Boosting is a general ensemble method that creates a strong classifier from a number of weak
classifiers. This is done by building a model from the training data, then creating a second model
that attempts to correct the errors from the first model. Models are added until the training set is
predicted perfectly or a maximum number of models are added.
A weak classifier (decision stump) is prepared on the training data using the weighted samples.
Only binary (two-class) classification problems are supported, so each decision stump makes one
decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:
error = (correct – N) / N
Where error is the misclassification rate, correct are the number of training instance predicted
correctly by the model and N is the total number of training instances. For example, if the model
predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-
100)/100 or 0.22.
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate, where w is the weight for training instance
i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly
classified.
A stage value is calculated for the trained model which provides a weighting for any predictions
that the model makes. The stage value for a trained model is calculated as follows:
stage = ln((1-error) / error)
Where stage is the stage value used to weight predictions from the model, ln() is the natural
logarithm and error is the misclassification error for the model. The effect of the stage weight is
that more accurate models have more weight or contribution to the final prediction.
The training weights are updated giving more weight to incorrectly predicted instances, and less
weight to correctly predicted instances. This has the effect of not changing the weight if the training
instance was classified correctly and making the weight slightly larger if the weak learner
misclassified the instance.
The most common algorithms used in boosting are AdaBoost and XGBoost.
Confusion matrix is a matrix as shown above which has counts of True Positives (Values which are
forecasted positive and are actually positive), False Negative (Values which are actually Positive
but are forecasted negative), False Positive (Values which are actually negative but are forecasted
positive) and True Negative (Values which are actually negative and are predicted as negative as
well).
Accuracy:
Accuracy is ratio of predictions which are predicted rightly to total no. of observations
Note: The issue with accuracy occurs when class imbalance exists in the output data where no.
of samples which are positives are very less in that case accuracy will be very high since model
will be predicting many negatives as negatives. So, negatives will contribute to accuracy but
positives won’t which will cause biased in the model towards negative outputs.
Sensitivity:
To resolve above issues another terms are introduced:
Sensitivity is also known as True Positive Rate. One more evaluation parameter is important
False Positive Rate which is nothing but 1-Specificity.
ROC shows the trade-off between TPR and FPR as values of cut-off are varied from 0.0 to
1.0 which will occur for all the models.
Significance:
1. Dotted line is obtained when specificity and sensitivity is equal for all values of cut-off
and it is treated as the worst model
2. More the curve is towards left top more the area between curve and dotted line and
better the model
3. Area under the curve is known as AUC
Generally, the most optimum value is the one corresponding to the point where accuracy,
specificity and sensitivity curves interact which is 0.3 in the diagram above.
Recall:
Recall is also known as Sensitivity or TPR (True positive rate). Its significance is that it
shows how many of positives outcomes in the training data output are forecasted accurately.
Here, Y and X axis are basically cut-off of probability selected to classify the forecasted outputs
in binary form. This curve also shows that the trade-off will be optimum when cut-off probability
is chosen as around 0.4
Feature Selection
We have seen examples and equations for various models. It looks quite simple but in practice
we won’t have datasets very small and we won’t not have countable features for regression or
classification. Sometimes number of features can range in 100s or 1000s or millions. Following
are the reasons for selecting features:
1. Not all features will be affecting the output and hence not needed to be required in
modelling
2. Features may have relation with each other hence not all of them won’t be required. This
is known as multi-collinearity in features. Multi-collinearity may lead coefficients to swing
wildly. It means say dependent depends upon a predictor A positively and linearly but A
depends upon B then there may be a possibility that coefficient of A can be negative which
is compensated by coefficient of B. This is totally opposite to the case when output changes
only with A and may baffle business interpretation. Also, in this case the p-value of the
coefficients cannot be trusted.
3. More no. of features can lead to bias vs variance trade off to come into play and more
the features there Is a possibility of more variance and more over fitting hence model
may work better on a training data set but may generate high errors on test data set
Solutions:
1. There are many ways to deal with issue 1.
- Using business knowledge to remove features which are not related
- Consider dependent Y depends on several independent variables (x1,x2,x3,…xn). Using
partial correlation coefficients which requires complex mathematical operations to find
partial correlation coefficients of dependent variable w.r.t individual independent
predictors say x1 while removing impact of other predictors (x2,x3..xn) in correlation
with dependent Y one by one.
- Using other methods like PCA, PLS etc.
2. Removing issues of Multi-collinearity requires understanding of multi-collinearity.
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies
dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset
while retaining as much information as possible
Extensions to LDA
Linear Discriminant Analysis is a simple and effective method for classification. Because it is
simple and so well understood, there are many extensions and variations to the method. Some
popular extensions include:
Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such
as splines.
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.
The original development was called the Linear Discriminant or Fisher’s Discriminant Analysis.
The multi-class version was referred to Multiple Discriminant Analysis. These are all simply
referred to as Linear Discriminant Analysis now.
Multicollinearity
Multicollinearity refers to the phenomenon of having related predictor
(independent) variables in the input data set. In simple terms, in a model that has been built using
several independent variables, some of these variables might be interrelated, due to which the
presence of that variable in the model is redundant. You drop some of these related independent
variables as a way of dealing with multicollinearity. Multicollinearity even causes even when
errors are not independent and hence violates the assumption of Linear regression. This can
be observed because of various patterns in the data itself, may multicollinearity in the data itself
and much more.
Detection of Multicollinearity:
1. One way to detect multicollinearity is to find correlation between each of the independent
predictors. For example, say there are n predictors X1, X2, X3…Xn. Find pairwise
correlation coefficients in nC2 ways. Draw a heatmap of the 2-D matrix which can easily help
to identify the pairs which are highly correlated and eliminate any of them using business
knowledge. But this technique is limited to identifying relation between only two predictors.
2. Using VIF (Variance Inflation Factor) to identify predictors that are highly correlated. VIF
formula is given below:
VIFi is the VIF for Xi where i is the range of predictors say X1,X2…..Xn. VIF is
found by assuming X1 is a function of X2,X3….Xn and treating X1 as dependent
variable and fit a best fit line. Then R1 for this model is found and used to find VIF1.
Similarly, the process is repeated for all i.
Interpretation of VIF:
The common heuristic we follow for the VIF values is:
> 10: VIF value is definitely high, and the variable should be eliminated.
> 5: Can be okay, but it is worth inspecting.
< 5: Good VIF value. No need to eliminate this variable
Hence, all those predictors are removed for which VIF > 10.
Remedies:
1. Make sure you have not fallen into the dummy variable trap; including a dummy variable for
every category (e.g., summer, autumn, winter, and spring) and including a constant term in the
regression together guarantee perfect multicollinearity.
2. Drop one of the variables. An explanatory variable may be dropped to produce a model with
significant coefficients. However, you lose information (because you've dropped a variable).
Omission of a relevant variable results in biased coefficient estimates for the remaining
explanatory variables that are correlated with the dropped variable.
3. Try seeing what happens if you use independent subsets of your data for estimation and apply
those estimates to the whole data set. Theoretically you should obtain somewhat higher variance
from the smaller datasets used for estimation, but the expectation of the coefficient values should
be the same. Naturally, the observed coefficient values will vary, but look at how much they
vary.
Unsupervised Learning
In some pattern recognition problems, the training data consists of a set of input vectors x without
any corresponding target values. The goal in such unsupervised learning problems may be to
discover groups of similar examples within the data, where it is called clustering, or to determine
how the data is distributed in the space, known as density estimation. To put forward in simpler
terms, for a n-sampled space x1 to xn, true class labels are not provided for each sample, hence
known as learning without teacher.
What is Clustering?
When you're trying to learn about something, say music, one approach might be to look for
meaningful groups or collections. You might organize music by genre, while your friend might
organize music by decade. How you choose to group items helps you to understand more about
them as individual pieces of music. You might find that you have a deep affinity for punk rock and
further break down the genre into different approaches or music from different locations. On the
other hand, your friend might look at music from the 1980's and be able to understand how the
music across genres at that time was influenced by the socio-political climate. In both cases, you
and your friend have learned something interesting about music, even though you took different
approaches.
In machine learning too, we often group examples as a first step to understand a subject (data
set) in a machine learning system. Grouping unlabelled examples is called clustering.
As the examples are unlabelled, clustering relies on unsupervised machine learning. If the
examples are labelled, then clustering becomes classification.
Before you can group similar examples, you first need to find similar examples. You can
measure similarity between examples by combining the examples' feature data into a metric,
called a similarity measure. When each example is defined by one or two features, it's easy to
measure similarity. For example, you can find similar books by their authors. As the number of
features increases, creating a similarity measure becomes more complex. We'll later see how to
create a similarity measure in different scenarios.
market segmentation
social network analysis
search result grouping
medical imaging
image segmentation
anomaly detection
After clustering, each cluster is assigned a number called a cluster ID. Now, you can condense
the entire feature set for an example into its cluster ID. Representing a complex example by a
simple cluster ID makes clustering powerful. Extending the idea, clustering data can simplify
large datasets.
For example, you can group items by different features as demonstrated in the following
examples: Examples:
Group stars by brightness.
Group organisms by genetic information into a taxonomy.
Group documents by topic.
Machine learning systems can then use cluster IDs to simplify the processing of large datasets.
Thus,
clustering’s output serves as feature data for downstream ML systems.
K MEANS Clustering
k-means clustering is a method of vector quantization, that aims to partition n observations into
k clusters in which each observation belongs to the cluster with the nearest mean (cluster centres
or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data
space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering
minimizes within- cluster variances (squared Euclidean distances), but not regular Euclidean
distances.
Recall the first property of clusters – it states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a cluster. K-
means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the
distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the
points and their respective cluster centroid.
Another challenge with k-means is when the densities of the original points are different.
Here, the points in the red cluster are spread out whereas the points in the remaining clusters are
closely packed together. Now, if we apply k-means on these points, we will get clusters like this:
We can see that the compact points have been assigned to a single cluster. Whereas the points
that arespread loosely but were in the same cluster, have been assigned to different clusters. One
of the solutions is to use a higher number of clusters. So, in all the above scenarios, instead of
using 3 clusters, we can have a bigger number. Perhaps setting k=10 might lead to more
meaningful clusters.
Next, we will start with a small cluster value, let’s say 2. Train the model using 2 clusters,
calculate the inertia for that model, and finally plot it in the above graph. Let’s say we got an
inertia value of around 1000:
Now, we will increase the number of clusters, train the model again, and plot the inertia value.
This is the plot we get:
When we changed the cluster value from 2 to 4, the inertia value reduced very sharply. This
decrease in the inertia value reduces and eventually becomes constant as we increase the number
of clusters further.
The cluster value where this decrease in inertia value becomes constant can be chosen as the
right cluster value for our data.
Here, we can choose any number of clusters between 6 and 10. We can have 7, 8, or even 9
clusters. You must also look at the computation cost while deciding the number of clusters. If
we increase the number of clusters, the computation cost will also increase. So, if you do not
have high computational resources, the advice is to choose a lesser number of clusters.
Whenever the Euclidean distance does not make sense in the data, we shift to K Medoids from
K Means. K-means attempts to minimize the total squared error, while k-medoids minimizes
the sum of dissimilarities between points labelled to be in a cluster and a point designated as
the centre of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as
centres ( medoids or exemplars).
Supervised and Unsupervised ML differ from each other in the following respects:
In supervised ML, the input data is labelled. In unsupervised ML, the input data remains
unlabeled.
While supervised ML uses training dataset, unsupervised ML uses the input data set.
Supervised ML is used for prediction purposes, whereas unsupervised ML is used for
analysis purposes.
Supervised ML enables classification and regression. However, unsupervised ML enables
classification, density estimation, and dimension reduction.
5. What does data cleansing mean? What are the best ways to practice this?
Data cleansing primarily refers to the process of detecting and removing errors and
inconsistencies from the data to improve data quality.
8. What is an N-gram?
An n-gram is a connected sequence of n items in a given text or speech. Precisely, an N-gram
is a probabilistic language model used to predict the next item in a particular sequence, as in
(n-1).
9. Define Outlier
A data analyst interview question and answers guide will not complete without this question.
An outlier is a term commonly used by data analysts when referring to a value that appears to
be far removed and divergent from a set pattern in a sample. There are two kinds of outliers –
Univariate and Multivariate.
12. Name the statistical methods that are highly beneficial for data analysts?
The statistical methods that are mostly used by data analysts are:
Bayesian method
Markov process
Simplex algorithm
Imputation
Spatial and cluster processes
Rank statistics, percentile, outliers detection
Mathematical optimization
20. What are the problems that a Data Analyst can encounter while performing data
analysis?
A critical data analyst interview question you need to be aware of. A Data Analyst can confront
the following issues while performing data analysis:
Presence of duplicate entries and spelling mistakes. These errors can hamper data
quality.
Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will
have to spend a significant amount of time in cleansing the data.
Data extracted from multiple sources may vary in representation. Once the collected
data is combined after being cleansed and organized, the variations in data
representation may cause a delay in the analysis process.
Incomplete data is another major challenge in the data analysis process. It would
inevitably lead to erroneous or faulty results.
For further study of similar question, you can follow these links:
Careers360
Naukri.com
SimpleLearn
Edureka
Source:
https://www.graduatetutor.com/statistics-tutor/interpreting-regression-output/
https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-
plots-solutions/
https://searchenterpriseai.techtarget.com/definition/supervised-learning
https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-
normalization-standardization/
https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-
rescale-your-data-3f083def38ff
https://medium.com/fintechexplained/did-you-know-the-importance-of-finding-correlations- in-
data-science-1fa3943debc2
https://towardsdatascience.com/getting-the-basics-of-correlation-covariance-c8fc110b90b4
https://towardsdatascience.com/latent-dirichlet-allocation-15800c852699