Unit 3 Supervised Learning Technique

Difference between Supervised and Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of machine learning. But both
the techniques are used in different scenarios and with different datasets. Below the explanation
of both learning methods along with their difference table is given.
machine learning
==>Supervised
Y=f(x)
need supervision
need to input and output data
problem solved
Classification
Regression
==>Unsupervised
Supervised Machine Learning: Supervised learning is a machine learning method in

which models are trained using labeled data. In supervised learning, models need to find the
mapping function to map the input variable (X) with the output variable (Y).
Supervised learning needs supervision to train the model, which is similar to as a student learns
things in the presence of a teacher. Supervised learning can be used for two types of
problems: Classification and Regression.
Example: Suppose we have an image of different types of fruits. The task of our supervised
learning model is to identify the fruits and classify them accordingly. So to identify the image
in supervised learning, we will give the input data as well as output for that, which means we
will train the model by the shape, size, color, and taste of each fruit. Once the training is
completed, we will test the model by giving the new set of fruit. The model will identify the
fruit and predict the output using a suitable algorithm.
Unsupervised Machine Learning: Unsupervised learning is another machine learning

method in which patterns inferred from the unlabeled input data. The goal of unsupervised
learning is to find the structure and patterns from the input data. Unsupervised learning does
not need any supervision. Instead, it finds patterns from the data by its own.
Unsupervised learning can be used for two types of problems: Clustering and Association.
Example: To understand the unsupervised learning, we will use the example given above. So
unlike supervised learning, here we will not provide any supervision to the model. We will just
provide the input dataset to the model and allow the model to find the patterns from the data.
PROCESS= patterns inferred from the unlabeled input data

GOAL=to find the structure and patterns from the input data.
not need any supervision
target problems=> Clustering
Association
With the help of a suitable algorithm, the model will train itself and divide the fruits into
different groups according to the most similar features between them.
The main differences between Supervised and Unsupervised learning are given
below:
Supervised Learning Unsupervised Learning
label Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take
feedback check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the
way
hidden patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is
model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
Supervison
model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified

CR/CA in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases Unsupervised learning can be used for those
I+O or I where we know the input as well as corresponding cases where we have only input data and no
outputs. corresponding output data.
Supervised learning model produces an accurate Unsupervised learning model may give less
Accuracy result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the
intelligence as in this, we first train the model for true Artificial Intelligence as it learns
true /false AI each data, and then only it can predict the correct similarly as a child learns daily routine things
output. by his experiences.
It includes various algorithms such as Linear

Regression, Logistic Regression, Support Vector
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.
Regression vs. Classification in Machine Learning

1.to identify the category or the class label of a new observation
2.a set of data is used as training data
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of
data is used as training data. The set of input data and the corresponding outputs are given to
the algorithm.
Sometimes there can be more than two classes to classify. That is called multiclass
classification.
What is Regression?
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Regression vs. Classification in Machine Learning
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets. But
the difference between both is how they are used for different machine learning problems.
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Consider the below diagram:

Difference between Regression and Classification
Regression Algorithm Classification Algorithm
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.
Regression algorithms can be used to Classification Algorithms can be used to solve

solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification
prediction, etc. of cancer cells, etc.
The regression Algorithm can be The Classification algorithms can be divided into
further divided into Linear and Non- Binary Classifier and Multi-class Classifier.
linear Regression.
Issues regarding Classification and Predication
There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to predict
future data trends. Classification predicts the categorical labels of data with the prediction
models. This analysis provides us with the best understanding of the data at a large scale.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output
values. The algorithm derives the model or a predictor according to the training dataset. The
model should find a numerical output when the new data is given. Unlike in classification, this
method does not have a class label. The model predicts a continuous-valued function or ordered
value.
Regression is generally used for prediction. Predicting the value of a house depending on the
facts such as the number of rooms, the total area, etc., is an example for prediction.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities, such as:
Data Cleaning: Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques, and the problem of missing values
is solved by replacing a missing value with the most commonly occurring value for that
attribute.
Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.
Data Transformation and reduction: The data can be transformed by any of the following
methods.
o Normalization: The data is transformed using normalization. Normalization

involves scaling all values for a given attribute to make them fall within a small
specified range. Normalization is used when the neural networks or the methods
involving measurements are used in the learning step.
o Generalization: The data can also be transformed by generalizing it to the
higher concept. For this purpose, we can use the concept hierar
Type of Classification
1. Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
2. Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
How does Classification Work?
There are two stages in the data classification system: classifier or model creation and
classification classifier.
Developing the Classifier or model creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in this stage. A classifier is
constructed from a training set composed of the records of databases and their corresponding
class names.
Applying classifier for classification: The classifier is used for classification at this level. The
test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build
sentiment analysis models to read and analyse misspelled words with
advanced machine learning algorithms. The accurate trained models provide
consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the
documents into sections according to the content. Document classification
refers to text classification; we can classify the words in the entire document.
And with the help of machine learning classification algorithms, we can execute
it automatically.
o Image Classification: Image classification is used for the trained categories
of an image. These could be the caption of the image, a statistical value, a
theme. You can tag images to train your model for relevant categories by
applying supervised learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable
algorithm rules to execute analytical tasks that would take humans hundreds
of more hours to perform.
2. Data Classification Process: The data classification process can be categorized into
five steps:
o Create the goals of data classification, strategy, workflows, and architecture of
data classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
Classification by Decision tree induction
Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf node (terminal node) holds a class label.
A decision tree for the concept Play Tennis.
Construction of Decision Tree:

A tree can be “learned” by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in a recursive manner
called recursive partitioning.
• The recursion is completed when the subset at a node all has the same value of the
target variable, or when splitting no longer adds value to the predictions.
• The construction of decision tree classifier does not require any domain knowledge
or parameter setting, and therefore is appropriate for exploratory knowledge
discovery.
• Decision trees can handle high dimensional data. In general decision tree classifier
has good accuracy. Decision tree induction is a typical inductive approach to learn
knowledge on classification.
Decision Tree Representation:
Decision trees classify instances by sorting them down the tree from the root to some
leaf node, which provides the classification of the instance.
An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of
the attribute as shown in the above figure. This process is then repeated for the subtree
rooted at the new node.
For example, the instance
(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong )
would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance.
In other words we can say that decision tree represent a disjunction of conjunctions of
constraints on the attribute values of instances.
(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^ Wind

= Weak)
The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.

• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
The weaknesses of decision tree methods:
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
• Decision tree can be computationally expensive to train.
Expressing attribute test conditions

Decision tree induction algorithms support an approach for defining an attribute test condition
and its correlating results for multiple attribute types.
Binary Attributes − A binary attribute is a nominal attribute with only two elements or
states including 0 or 1, where 0 frequently represents that the attribute is absent, and 1
represents that it is present. Binary attributes are defined as Boolean if the two states are
equivalent to true and false.
A binary attribute is symmetric if both of its states are equal valuable and make an equal weight.
There is no preference on which results must be coded as 0 or 1. An example can be the attribute
gender having the states male and female.
Nominal Attributes − Nominal defines associating with names. The values of a nominal
attribute are symbols or names of things. Each value defines some type of category, code,
or state, etc. Nominal attributes are defined as categorical. The values do not have any
significant order. In computer science, the values are also called enumerations
Ordinal Attributes − An ordinal attribute is an attribute with applicable values that have
an essential series or ranking among them, but the magnitude between successive values is
unknown.
Ordinal attributes can make binary or multiway splits. Ordinal attribute values can be
combined considering the grouping does not violate the order nature of the attribute values.
Numeric Attributes − A numeric attribute is quantitative. It is a computable quantity,

represented in numerical or real values. It can be interval-scaled or ratio-scaled.
Measures for Selecting the Best Split

Node splitting, or simply splitting, is the process of dividing a node into multiple sub-
nodes to create relatively pure nodes. There are multiple ways of doing this, which can be
broadly divided into two categories based on the type of target variable:
1. Continuous Target Variable:-
o Reduction in Variance
2. Categorical Target Variable
o Gini Impurity
o Information Gain
o Chi-Square
Decision Tree Splitting Method #1: Reduction in Variance

Reduction in Variance is a method for splitting the node used when the target variable is
continuous, i.e., regression problems. It is so-called because it uses variance as a measure for
deciding the feature on which node is split into child nodes.
Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous,
then the variance is zero.
Here are the steps to split a decision tree using reduction in variance:
1. For each split, individually calculate the variance of each child node
2. Calculate the variance of each split as the weighted average variance of child nodes
3. Select the split with the lowest variance
4. Perform steps 1-3 until completely homogeneous nodes are achieved
Decision Tree Splitting Method #2: Gini Impurity/ index
1. Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
2. It means an attribute with a lower Gini index should be preferred.
3. Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
The Formula for the calculation of the Gini Index is given below.
The Gini Index is a measure of the inequality or impurity of a distribution, commonly used in
decision trees and other machine learning algorithms. It ranges from 0 to 1, where 0 represents
perfect equality (all values are the same) and 1 represents perfect inequality (all values are
different).
Some additional features and characteristics of the Gini Index are:
1. It is calculated by summing the squared probabilities of each outcome in a distribution

and subtracting the result from 1.
2. A lower Gini Index indicates a more homogeneous or pure distribution, while a
higher Gini Index indicates a more heterogeneous or impure distribution.
3. In decision trees, the Gini Index is used to evaluate the quality of a split by measuring
the difference between the impurity of the parent node and the weighted impurity of the
child nodes.
4. Compared to other impurity measures like entropy, the Gini Index is faster to
compute and more sensitive to changes in class probabilities.
5. One disadvantage of the Gini Index is that it tends to favor splits that create equally
sized child nodes, even if they are not optimal for classification accuracy.
6. In practice, the choice between using the Gini Index or other impurity measures
depends on the specific problem and dataset, and often requires experimentation and
tuning.
Decision Tree Splitting Method #3: Information Gain

Now, what if we have a categorical target variable? Reduction in variation won’t quite cut it.
Well, the answer to that is Information Gain. Information Gain is used for splitting the
nodes when the target variable is categorical. It works on the concept of the entropy and is
given by:
Entropy is used for calculating the purity of a node. Lower the value of entropy, higher is
the purity of the node. The entropy of a homogeneous node is zero. Since we subtract entropy
from 1, the Information Gain is higher for the purer nodes with a maximum value of 1. Now,
let’s take a look at the formula for calculating the entropy:
Steps to split a decision tree using Information Gain:
1. For each split, individually calculate the entropy of each child node
2. Calculate the entropy of each split as the weighted average entropy of child nodes
3. Select the split with the lowest entropy or highest information gain
4. Until you achieve homogeneous nodes, repeat steps 1-3
Decision Tree Splitting Method #4: Chi-Square

Chi-square is another method of splitting nodes in a decision tree for datasets having
categorical target values. It can make two or more than two splits. It works on the statistical
significance of differences between the parent node and child nodes. Chi-Square value is:
Here, the Expected is the expected value for a class in a child node based on the distribution of
classes in the parent node, and Actual is the actual value for a class in a child node.
The above formula gives us the value of Chi-Square for a class. Take the sum of Chi-Square
values for all the classes in a node to calculate the Chi-Square for that node. Higher the
value, higher will be the differences between parent and child nodes, i.e., higher will be the
homogeneity.
Here are the steps to split a decision tree using Chi-Square:
1. For each split, individually calculate the Chi-Square value of each child node by taking
the sum of Chi-Square values for each class in a node
2. Calculate the Chi-Square value of each split as the sum of Chi-Square values for all the
child nodes
3. Select the split with a higher Chi-Square value
4. Until you achieve homogeneous nodes, repeat steps 1-3
Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Likelihood table weather condition:

Weather No Yes All
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.3
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.3= 0.7
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.3
So P(No|Sunny)= 0.5*0.29/0.3 = 0.48
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Bayesian Belief Networks

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
• Posterior Probability [P(H/X)]
• Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined between

subsets of variables.
• It provides a graphical model of causal relationship on which learning can be performed.
• We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
• Directed acyclic graph
• A set of conditional probability tables
Directed Acyclic Graph
• Each node in a directed acyclic graph represents a random variable.
• These variable may be discrete or continuous valued.
• These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing each
possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is
as follows −
K – Nearnest neighbor classification – Algorithm
What’s KNN?
KNN (K — Nearest Neighbors) is one of many (supervised learning) algorithms used in data
mining and machine learning, it’s a classifier algorithm where the learning is based “how
similar” is a data (a vector) from other .
How it’s working?
The KNN is pretty simple, imagine that you have a data about colored balls:
• Purple balls;
• Yellow balls;
• And a ball that you don’t know if it’s purple or yellow, but you has all the data about this
color (except the color label).
So, how are you going to know the ball’s color? imagine you like a machine that you only have
the ball’s characteristics(data), but doesn’t the final label. Hou do you will to know the ball’s
color(final label/your class)?
Obs: Let’s suppose that data with number 1(and label R) are referring to the purple balls and
the data with number 2 (and label A) are referring to the yellow balls, this’s just to make the
explanation easier,
Each line refers to a ball and each column refers to a ball’s characteristic, in the last column we
have the class (color) of each of the balls:
• R -> purple;
• A -> yellow
We have 5 balls there ( 5 lines), each one with yours classification, you can try to discover the
new ball’s color (in the case the class) of N ways, one of these N ways is to comparing this new
ball’s characteristics with all the others, and see what it looks like most, if the
data(characteristics) of this new ball (that you doesn’t know the correct class) is similar to
the data of the yellow balls, then the color of the new ball is yellow, if the data in the new
ball is more similar to the data of the purple then yellow, then the color of the new ball is
purple, it looks so simple, and that is almost what the knn does, but in a most sophisticated way.
The KNN’s steps are:
1 — Receive an unclassified data;
2 — Measure the distance (Euclidian, Manhattan, Minkowski or Weighted) from the new data
to all others data that is already classified;
3 — Gets the K(K is a parameter that you define) smaller distances;
4 — Check the list of classes had the shortest distance and count the amount of each class that
appears;
5 — Takes as correct class the class that appeared the most times;
6 —Classifies the new data with the class that you took in step 5;
Calculating distance:
To calculate the distance between two points (your new sample and all the data you have in your
dataset) is very simple, as said before, there are several ways to get this value, in this article we
will use the Euclidean distance.
The Euclidean distance’s formula is like the image below:
Characteristics of KNN
Between-sample geometric distance
The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a
test sample and the specified training samples. Let xi be an input sample
with p features (xi1,xi2,…,xip) , n be the total number of input samples (i=1,2,…,n)
and p the total number of features (j=1,2,…,p) .
The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as
d(xi,xl)= √ xi1−xl1)2+(xi2−xl2)2+⋯+(xip−xlp)2
Classification decision rule and confusion matrix

Classification typically involves partitioning samples into training and testing categories.
Let xi be a training sample and x be a test sample, and let ω be the true class of a training
sample and ω^ be the predicted class for a test sample (ω,ω^=1,2,…,Ω) . Here, Ω is the total
number of classes.
Feature transformation
Increased performance of a classifier can sometimes be achieved when the feature values are
transformed prior to classification analysis. Two commonly used feature transformations are
standardization and fuzzification.
Standardization removes scale effects caused by use of features with different measurement
scales. For example, if one feature is based on patient weight in units of kg and another feature
is based on blood protein values in units of ng/dL in the range [-3,3], then patient weight will
have a much greater influence on the distance between samples and may bias the performance
of the classifier. Standardization transforms raw feature values into z-scores using the
mean and standard deviation of a feature values over all input samples, given by the
relationship
zij=xij−μjσj,
where xij is the value for the ith sample and jth feature, μj is the average of all xij for
feature j, σj is the standard deviation of all xij over all input samples.
Performance assessment with cross-validation
A basic rule in classification analysis is that class predictions are not made for data samples
that are used for training or learning. If class predictions are made for samples used in training
or learning, the accuracy will be artificially biased upward. Instead, class predictions are made
for samples that are kept out of training process.
The performance of most classifiers is typically evaluated through cross-validation, which
involves the determination of classification accuracy for multiple partitions of the input
samples used in training. For example, during 5-fold (κ=5) cross-validation training, a set of
input samples is split up into 5 partitions D1,D2,…,D5 having equal sample sizes to the extent
possible. The notion of ensuring uniform class representation among the partitions is
called stratified cross-validation, which is preferred. To begin, for 5-fold cross-validation,
samples in partitions D2,D3,…,D5 are first used for training while samples in
partition D1 are used for testing. Next, samples in groups D1,D3,…,D5 are used for
training and samples in partition D2 used for testing. This is repeated until each partitions
have been used singly for testing. It is also customary to re-partition all of the input samples
e.g. 10 times in order to get a better estimate of accuracy.
Support Vector Machine (SVM) Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
×
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:
1. #Data Pre-processing Step

2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling

18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:
The scaled output for the test set will be:

Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # "Support vector classifier"

2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier
to the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.
o Predicting the test set result:

Now, we will predict the output for test set. For this, we will create a new vector y_pred.
Below is the code for it:
#Predicting the test set result
1. y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Creating the confusion matrix:

Now we will see the performance of the SVM classifier that how many incorrect predictions
are there as compared to the Logistic regression classifier. To create the confusion matrix, we
need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters,
mainly y_true( the actual values) and y_pred (the targeted value return by the classifier).
Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.
o Visualizing the training set result:

Now we will visualize the training set result, below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max()
+ 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:
By executing the above code, we will get the output as:
×
As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is a
straight line.
Visualizing the test set result:
1. #Visulaizing the test set result

2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max()
+ 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with
green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.
---------------------------------------------------------------------------------------------------------
Bagging Machine Learning
Bagging, or bootstrap aggregation, is the ensemble getting-to-know method generally used to

lessen variance within a loud dataset. In Bagging, a random pattern of statistics in this
study set is selected with replacement, meaning that the character statistics factors may
be chosen more soon as possible
The Bagging is an assembling approach that tries to resolve overfitting for class or the
regression problems. Bagging pursuits to improve the accuracy and overall performance of
gadget mastering algorithms. It does this by taking random subsets of an original dataset,
with substitute, and fits either a classifier (for classification) or regressor (for regression) to
each subset. Bagging is also known as Bootstrap aggregating.
What is Ensemble Learning?
Ensemble learning gives us credence to the idea of the "wisdom of crowds," it suggests that
the choice-making for a more extensive organization of humans is usually higher than that of
an individual professional.
---------------------------------------------------------------------------------------------------------
Boosting
Boosting is an ensemble learning method that combines a set of weak learners into strong
learners to minimize training errors.
In boosting, a random sample of data is selected, fitted with a model, and then trained
sequentially. That is, each model tries to compensate for the weaknesses of its predecessor.
Each classifier's weak rules are combined with each iteration to form one strict prediction rule.
Boosting is an efficient algorithm that converts a weak learner into a strong learner. They use
the concept of the weak learner and strong learner conversation through the weighted average
values and higher votes values for prediction. These algorithms use decision stamp and margin
maximizing classification for processing.
There are three types of Algorithms available such as AdaBoost or Adaptive boosting
Algorithm, Gradient, and XG Boosting algorithm. These are the machine learning algorithms
that follow the process of training for predicting and fine-tuning the result.
Example
Let's understand this concept with the help of the following example. Let's take the example of
the email. How will you recognize your email, whether it is spam or not? You can recognize it
by the following conditions:
o If an email contains lots of sources, that means it is spam.

o If an email contains only one file image, then it is spam.
o If an email contains the message "You Own a lottery of $xxxxx," it is spam.
o If an email contains some known source, then it is not spam.
o If it contains the official domain like educba.com, etc., it is not spam.
The rules mentioned above are not that powerful to recognize spam or not; hence these rules
are called weak learners.
Types of Boosting
Boosting algorithms can differ in how they create and aggregate weak learners during the
sequential process. Three popular types of boosting methods include:
1. Adaptive boosting or AdaBoost:

This method operates iteratively, identifying misclassified data points and adjusting their
weights to minimize the training error. The model continues to optimize sequentially until it
yields the strongest predictor.
AdaBoost is implemented by combining several weak learners into a single strong learner. The
weak learners in AdaBoost take into account a single input feature and draw out a single split
decision tree called the decision stump. Each observation is weighted equally while drawing
out the first decision stump.
The results from the first decision stump are analyzed, and if any observations are wrongfully
classified, they are assigned higher weights. A new decision stump is drawn by considering the
higher-weight observations as more significant. Again if any observations are misclassified,
they're given a higher weight, and this process continues until all the observations fall into the
right class.
AdaBoost can be used for both classification and regression-based problems. However, it is
more commonly used for classification purposes.
2. Gradient Boosting:
Gradient Boosting is also based on sequential ensemble learning. Here the base learners are
generated sequentially so that the present base learner is always more effective than the
previous one, i.e., and the overall model improves sequentially with each iteration.
The difference in this boosting type is that the weights for misclassified outcomes are not
incremented. Instead, the Gradient Boosting method tries to optimize the loss function of the
previous learner by adding a new model that adds weak learners to reduce the loss function.
The main idea here is to overcome the errors in the previous learner's predictions. This boosting
has three main components:
o Loss function:The use of the loss function depends on the type of problem. The
advantage of gradient boosting is that there is no need for a new boosting algorithm for
each loss function.
o Weak learner:In gradient boosting, decision trees are used as a weak learners. A
regression tree is used to give true values, which can combine to create correct
predictions. Like in the AdaBoost algorithm, small trees with a single split are used,
i.e., decision stump. Larger trees are used for large levels,e, 4-8.
o Additive Model: Trees are added one at a time in this model. Existing trees remain the
same. During the addition of trees, gradient descent is used to minimize the loss
function.
3. Extreme gradient boosting or XGBoost:
XGBoost is an advanced gradient boosting method. XGBoost, developed by Tianqi Chen, falls
under the Distributed Machine Learning Community (DMLC) category.
The main aim of this algorithm is to increase the speed and efficiency of computation. The
Gradient Descent Boosting algorithm computes the output slower since they sequentially
analyze the data set. Therefore XGBoost is used to boost or extremely boost the model's
performance.
XGBoost is designed to focus on computational speed and model efficiency. The main features
provided by XGBoost are:
o Parallel Processing: XG Boost provides Parallel Processing for tree construction which
uses CPU cores while training.
o Cross-Validation: XG Boost enables users to run cross-validation of the boosting
process at each iteration, making it easy to get the exact optimum number of boosting
iterations in one run.
o Cache Optimization: It provides Cache Optimization of the algorithms for higher
execution speed.
o Distributed Computing: For training large models, XG Boost allows Distributed
Computing.
Difference between bagging and boosting are:
Bagging Boosting
The most effective manner of mixing A manner of mixing predictions that belong to
predictions that belong to the same type. different sorts.
The main task of it is decrease the variance but The main task of it is decrease the bias but not
not bias. variance.
Here each of the model is different weight. Here each of the model is same weight.
Each of the model is built here independently. Each of the model is built here dependently.
This training records subsets are decided on Each new subset consists of the factors that were
using row sampling with alternative and misclassified through preceding models.
random sampling techniques from the whole
training dataset.
It is trying to solve by over fitting problem. It is trying to solve by reducing the bias.
If the classifier is volatile (excessive variance), If the classifier is stable and easy (excessive bias)
then apply bagging. the practice boosting.
In the bagging base, the classifier is works In the boosting base, the classifier is works
parallelly. sequentially.
Example is random forest model by using Example is AdaBoost using the boosting
bagging. technique.
Regression: -
o Regression refers to a type of supervised machine learning technique that is used to
predict any continuous-valued attribute. Regression helps any business organization to
analyze the target variable and predictor variable relationships. It is a most significant
tool to analyze the data that can be used for financial forecasting and time series
modeling.
o Regression involves the technique of fitting a straight line or a curve on numerous data
points. It happens in such a way that the distance between the data points and cure
comes out to be the lowest.
o The most popular types of regression are linear and logistic regressions. Other than that,
many other types of regression can be performed depending on their performance on
an individual data set.
Regression is divided into five different types
1. Linear Regression 2. Logistic Regression 3. Lasso Regression
4. Ridge Regression 5. Polynomial Regression
Linear Regression
Linear regression is the type of regression that forms a relationship between the target variable
and one or more independent variables utilizing a straight line. The given equation represents
the equation of linear regression
Y = a + b*X + e.
Where,
a=represents the intercept
b=represents the slope of the regression line

e=represents the error
X and Y represent the predictor and target variables, respectively.
In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line of
regression. Here, the positive and negative deviations do not get canceled as all the deviations
are squared.
Polynomial Regression
If the power of the independent variable is more than 1 in the regression equation, it is termed
a polynomial equation. With the help of the example given below, we will understand the
concept of polynomial regression.
Y = a + b * x2
In the particular regression, the best fit line is not considered a straight line like a linear
equation; however, it represents a curve fitted to all the data points.
Logistic Regression
When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or failure,
the logistic regression technique comes into existence. Here, the target value (Y) ranges from
0 to 1, and it is primarily used for classification-based problems. Unlike linear regression, it
does not need any independent and dependent variables to have a linear relationship.
Ridge Regression
Ride regression refers to a process that is used to analyze various regression data that have the
issue of multicollinearity. Multicollinearity is the existence of a linear correlation between two
independent variables.
Lasso Regression
he term LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression
is a linear type of regression that utilizes shrinkage. In Lasso regression, all the data points are
shrunk towards a central point, also known as the mean. The lasso process is most fitted for
simple and sparse models with fewer parameters than other regression. This type of regression
is well fitted for models that suffer from multicollinearity.
Application of Regression
o Environmental modeling
o Analyzing Business and marketing behavior
o Financial predictors or forecasting
o Analyzing the new trends and patterns.
Model Evaluation and Selection-Metrics for Evaluating
Classifier Performance
Evaluating the performance of a Machine learning model is one of the important steps while
building an effective ML model. To evaluate the performance or quality of the model,
different metrics are used, and these metrics are known as performance metrics or
evaluation metrics. These performance metrics help us understand how well our model has
performed for the given data.
In machine learning, each task or problem is divided into classification and Regression. Not all
metrics can be used for all types of problems; hence, it is important to know and understand
which metrics should be used. Different evaluation metrics are used for both Regression and
Classification tasks.
1. Performance Metrics for Classification
In a classification problem, the category or classes of data is identified based on training data.
The model learns from the given dataset and then classifies the new data into classes or groups
based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or
Not Spam, etc. To evaluate the performance of a classification model, different metrics are
used, and some of them are as follows:
1. Accuracy 2. Confusion Matrix 3. Precision 4. Recall
5. F-Score 6. AUC(Area Under the Curve)-ROC

I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions.
It can be formulated as:
To implement an accuracy metric, we can compare ground truth and predicted values in a loop,
or we can also use the scikit-learn module for this.
II. Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any binary classifier,

which is used to describe the performance of the classification model on a set of test data when
true values are known.
The confusion matrix is simple to implement, but the terminologies used in this matrix might
be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below image(However, it can
be extended to use for classifiers with more than two classes).
We can determine the following from the above matrix:
o In the matrix, columns are for the prediction values, and rows specify the Actual values.
Here Actual and prediction give two possible classes, Yes or No. So, if we are
predicting the presence of a disease in a patient, the Prediction column with Yes means,
Patient has the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted
yes, whereas 55 times predicted No.
o However, in reality, 60 cases in which patients don't have the disease, whereas 105
cases in which patients have the disease.
In general, the table is divided into four terminologies, which are as follows:
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality,
also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality,
also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in
actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines
the proportion of positive prediction that was actually correct. It can be calculated as the True
Positive or predictions that are actually true to the total positive predictions (True Positive and
False Positive).
IV. Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that
are actually true to the total number of positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false negative).
The formula for calculating Recall is given below:
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of

predictions that are made for the positive class. It is calculated with the help of Precision and
Recall. It is a type of single score that represents both Precision and Recall. So, the F1 Score
can be calculated as the harmonic mean of both precision and Recall, assigning equal weight
to each of them.
The formula for calculating the F1 score is given below:
VI. AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then,
we can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating
the performance of the classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents
a graph to show the performance of a classification model at different threshold levels. The
curve is plotted between two parameters, which are:
o True Positive Rate

o False Positive Rate
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
FPR or False Positive Rate can be calculated as:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model
multiple times with different classification thresholds, but this would not be much efficient.
So, for this, one efficient method is used, which is known as AUC.
AUC: Area Under the ROC curve

AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the
two-dimensional area under the entire ROC curve, as shown below image:
AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have
an AUC of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.
Cross-Validation in Machine Learning

Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data. We can also say that
it is a technique to check how a statistical model generalizes to an independent dataset.
basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.

o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.
Methods used for Cross-Validation
There are some common methods that are used for cross-validation. These methods are given
below:
1. Validation Set Approach

2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation Set Approach
We divide our input dataset into a training set and test or validation set in the validation set
approach. Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train our model,
so the model may miss out to capture important information of the dataset. It also tends to give
the underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n
datapoints in the original input dataset, then n-p data points will be used as the training dataset
and the p data points as the validation set. This complete process is repeated for all the samples,
and the average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for the
large p.
Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one datapoint is
reserved, and the remaining dataset is used to train the model. This process repeats for each
datapoint. Hence for n samples, we get n different training set and n test set. It has the following
features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
o Split the input dataset into K groups

o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model
using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On
1st iteration, the first fold is reserved for test the model, and rest are used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are used to train the model.
This process will continue until each fold is not used for the test fold.
Consider the below diagram:
Stratified k-fold cross-validation
This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each fold
or group is a good representative of the complete dataset. To deal with the bias and variance,
it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses can
be much high than other houses. To tackle such situations, a stratified k-fold cross-validation
technique is useful.
Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive
modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the data scientists
in the field of medical statistics.
Overfitting and Underfitting in Machine Learning

Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this well:
o Signal: It refers to the true underlying pattern of the data that helps the machine learning
model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model. It
means the more we train our model, the more chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
How to avoid underfitting:
o By increasing the training time of the model.

o By increasing the number of features.

Unit 3 Supervised Learning Technique

Uploaded by

Unit 3 Supervised Learning Technique

Uploaded by

Difference between Supervised and Unsupervised Learning

Supervised Machine Learning: Supervised learning is a machine learning method in

Unsupervised Machine Learning: Unsupervised learning is another machine learning

PROCESS= patterns inferred from the unlabeled input data

Supervised Learning Unsupervised Learning

Supervised learning can be categorized Unsupervised Learning can be classified

It includes various algorithms such as Linear

Regression vs. Classification in Machine Learning

Regression vs. Classification in Machine Learning

Consider the below diagram:

Regression Algorithm Classification Algorithm

Regression algorithms can be used to Classification Algorithms can be used to solve

Issues regarding Classification and Predication

Classification and Prediction Issues

o Normalization: The data is transformed using normalization. Normalization

How does Classification Work?

Classification by Decision tree induction

Construction of Decision Tree:

For example, the instance

(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong )

(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^ Wind

The strengths of decision tree methods are:

• Decision trees are able to generate understandable rules.

The weaknesses of decision tree methods:

Expressing attribute test conditions

Numeric Attributes − A numeric attribute is quantitative. It is a computable quantity,

Measures for Selecting the Best Split

1. Continuous Target Variable:-

2. Categorical Target Variable

Decision Tree Splitting Method #1: Reduction in Variance

Decision Tree Splitting Method #2: Gini Impurity/ index

Some additional features and characteristics of the Gini Index are:

1. It is calculated by summing the squared probabilities of each outcome in a distribution

Decision Tree Splitting Method #3: Information Gain

Steps to split a decision tree using Information Gain:

Decision Tree Splitting Method #4: Chi-Square

Here are the steps to split a decision tree using Chi-Square:

Naïve Bayes Classifier Algorithm

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

1. Convert the given dataset into frequency tables.

Solution: To solve this, first consider the below dataset:

Frequency table for the Weather Conditions:

Likelihood table weather condition:

Overcast 0 5 5/14= 0.35

All 4/14=0.29 10/14=0.71

P(Sunny|Yes)= 3/10= 0.3

So P(Yes|Sunny) = 0.3*0.71/0.3= 0.7

So P(No|Sunny)= 0.5*0.29/0.3 = 0.48

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Disadvantages of Naïve Bayes Classifier:

Applications of Naïve Bayes Classifier:

Bayesian Belief Networks

• A Belief Network allows class conditional independencies to be defined between

Conditional Probability Table

The Euclidean distance’s formula is like the image below:

Classification decision rule and confusion matrix

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Data Pre-processing step

1. #Data Pre-processing Step

17. #feature Scaling

The scaled output for the test set will be:

1. from sklearn.svm import SVC # "Support vector classifier"

o Predicting the test set result:

Creating the confusion matrix:

o Visualizing the training set result:

By executing the above code, we will get the output as:

Visualizing the test set result:

1. #Visulaizing the test set result

By executing the above code, we will get the output as:

Bagging, or bootstrap aggregation, is the ensemble getting-to-know method generally used to

What is Ensemble Learning?