KNN
KNN
KNN
* * 2
* * 3
* * 4
* * 5
* * 6
* * 7
* * 8
* * 9
* * 10
* * 11
* * 12
* * 13
* * 14
K-Nearest Neighbors Algorithm
* * 15
Parametric vs non-parametric models
Parametric models
Assume that the data follows a specific probability distribution, such as normal, binomial, or
●
exponential.
Have a fixed number of parameters that describe the shape and location of the distribution, such as
●
mean, variance.
Easier to fit, interpret, and generalize than non-parametric models, as they require less data and
●
computation.
Can be biased and inaccurate if the data does not match the assumed distribution, or if there are
●
09/27/2024 16
Parametric vs non-parametric models
Non Parametric models
Do not make any assumptions about the distribution of the data.
●
More flexible and adaptable than parametric models, as they can capture complex and irregular
●
of the model.
More difficult to fit, interpret, and generalize than parametric models, as they require more data and
●
computation.
Noisy and overfit the data, especially if there are irrelevant or redundant variables.
●
09/27/2024 17
Curse of Dimensionality
Curse of Dimensionality: It refers to the phenomena of strange/weird things happening
as we try to analyze the data in high-dimensional spaces.
Dimensionality Reduction Techniques:
● Feature Selection: Identify and select the most relevant features from the original dataset
while discarding irrelevant or redundant ones. This reduces the dimensionality of the
data, simplifying the model and improving its efficiency. Eg. Chi-Square Test,
Information gain
● Feature Extraction: Transform the original high-dimensional data into a
lower-dimensional space by creating new features that capture the essential information.
Techniques such as Principal Component Analysis (PCA) are commonly used for feature
extraction.
09/27/2024 18
K-Nearest Neighbor(KNN) Algorithm
● Assumes the similarity between the new case/data and available cases and put the new case
into the category that is most similar to the available categories.
● Stores all the available data and classifies a new data point based on the similarity. When new
data appears, then it can be easily classified into a well suite category by using K- NN
algorithm.
● Can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
● Non-parametric algorithm, which means it does not make any assumption on underlying data.
● Also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the
dataset. KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
09/27/2024 19
K-Nearest Neighbor(KNN) Algorithm
09/27/2024 20
How does K-NN work?
09/27/2024 21
K-Nearest Neighbor(KNN) Algorithm
09/27/2024 22
How to select the value of K in the K-NN Algorithm?
There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
Large values for K are good, but it may find some difficulties.
09/27/2024 23
Advantages of the KNN Algorithm
Easy to implement as the complexity of the algorithm is not that high.
Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage
and hence whenever a new example or data point is added then the algorithm adjusts itself as per
that new example and has its contribution to the future predictions as well.
Few Hyperparameters – The only parameters which are required in the training of a KNN
algorithm are the value of K and the choice of the distance metric which we would like to choose
from our evaluation metric.
It can be more effective if the training data is large.
It is robust to the noisy training data
09/27/2024 24
Disadvantages of KNN Algorithm:
● Always needs to determine the value of K which may be complex some time.
● The computation cost is high because of calculating the distance between the data points for all the
training samples.
● Although KNN achieves high accuracy on the testing set, it is slower and more expensive in terms of time
and memory. It needs a considerable amount of memory in order to store the whole training dataset for
prediction. F
● Furthermore, because Euclidean distance is very sensitive to magnitudes, characteristics in the dataset with
large magnitudes will always outweigh those with small magnitudes.
● KNN isn’t ideal for large-dimensional datasets
● KNN classifiers do not work well with high dimensional inputs. The distance between neighbors will be
dominated by the large number of irrelevant attributes. This difficulty, which arises when many irrelevant
attributes are present, is referred to as the curse of dimensionality. Nearest-neighbor approaches are
especially sensitive to this problem.
09/27/2024 25
Applications of KNN
1. Image Recognition: In image recognition, KNN can be utilized to classify images into different categories based on their
features.
2. Recommendation Systems: KNN serves as the backbone of collaborative filtering techniques in recommendation systems. By
analyzing the preferences of similar users or items, KNN helps recommend products, movies, or music to users based on their past
interactions or ratings.
3. Medical Diagnosis: In the healthcare industry, KNN aids in medical diagnosis by classifying patients into different disease
categories based on their symptoms, medical history, and test results. I
4. Intrusion Detection: KNN is employed in cybersecurity for intrusion detection systems (IDS). By analyzing network traffic
patterns, KNN can detect anomalies or suspicious activities that may indicate a security breach. It helps in identifying and
preventing various types of cyber attacks, including malware infections, denial-of-service (DoS) attacks, and unauthorized access
attempts.
5. Text Classification: Text classification tasks, such as sentiment analysis, spam detection, and language identification, benefit
from the application of KNN. By analyzing the similarity between text documents based on their content or features, KNN can
classify them into predefined categories. This application finds its use in social media analysis, customer feedback analysis, and
content filtering.
6. Fraud Detection: In finance, KNN is employed for credit risk assessment, and fraud detection.
09/27/2024 26
KNN Example
1. Apply K nearest neighbor classifier to predict the diabetic patient with the given features BMI, Age. If the training examples
are,
BMI. Age Sugar
33.6 50 1
26.6 30 O
23.4 40 O
43.1 67 O
35.3 23 1
35.9 67 1
36.7 45 1
25.7 46 O
23.3 29 O
Assume K=3,
31 56 1
Test Example BMI=43.6, Age=40, Sugar=?
09/27/2024 27
KNN Example
The given training dataset has 10 instances with two features BMI (Body Mass Index) and Age. Sugar is the target
label. The target label has two possibilities 0 and 1. 0 means the diabetic patient has no sugar and 1 means the
diabetic patient has sugar.
Given the dataset and new test instance, we need to find the distance from the new test instance to every training
example. Here we use the euclidean distance formula to find the distance.
In the next table, you can see the calculated distance from text example to training instances.
09/27/2024 28
Applications of KNN
BMI Age Su Formula Distance
gar
33.6 50 1 √((43.6-33.6)^2+(40-50)^2 ) 14.14
26.6 30 O √((43.6-26.6)^2+(40-30)^2 ) 19.72
23.4 40 O √((43.6-23.4)^2+(40-40)^2 ) 20.20
43.1 67 O √((43.6-43.1)^2+(40-67)^2 ) 27.00
35.3 23 1 √((43.6-35.3)^2+(40-23)^2 ) 18.92
35.9 67 1 √((43.6-35.9)^2+(40-67)^2 ) 28.08
36.7 45 1 √((43.6-36.7)^2+(40-45)^2 ) 8.52
25.7 46 O √((43.6-25.7)^2+(40-46)^2 ) 18.88
23.3 29 O √((43.6-23.3)^2+(40-29)^2 ) 23.09
31 56 1 √((43.6-31)^2+(40-56)^2 ) 20.37
09/27/2024 29
KNN Example
BMI Age Sugar Distanc Rank
e Now, we need to apply the majority voting
technique to decide the resulting label fro the new
33.6 50 1 14.14 2 example.
26.6 30 O 19.72 Here the 1st and 2nd nearest neighbors have target
label 1 and the 3rd nearest neighbor has target label
23.4 40 O 20.20 0.
43.1 67 O 27.00 Target label 1 has the majority. Hence the new
35.3 23 1 18.92 example is classified as 1, That is the diabetic
patient has Sugar.
35.9 67 1 28.08 Test Example BMI=43.6, Age=40, Sugar=1
36.7 45 1 8.52 1
25.7 46 O 18.88 3
23.3 29 O 23.09
31 56 1 20.37
09/27/2024 30
Optimized KNN- KD Tree
09/27/2024 31
09/27/2024 32
09/27/2024 33
09/27/2024 34
09/27/2024 35
09/27/2024 36
09/27/2024 37
Decision Tree in Classification
09/27/2024 38
09/27/2024 39
“Why are decision tree classifiers so popular?”
● The construction of decision tree classifiers does not require any domain knowledge
or parameter setting, and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle multidi-mensional data.
● Their representation of acquired knowledge in tree form is intuitive and
● generally easy to assimilate by humans.
● The learning and classification steps of decision tree induction are simple and fast.
● In general, decision tree classifiers have good accuracy.
09/27/2024 40
Attribute Selection Measures
● An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given
data partition, D, of class-labeled training tuples into individual classes.
● If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each
partition would be pure (i.e., all the tuples that fall into a given partition would belong to the same class).
Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
● Attribute selection measures are also known as splitting rules because they determine how the tuples at a
given node are to be split.
● The attribute selection measure provides a ranking for each attribute describing the given training tuples.
● The attribute having the best score for the measure is chosen as the splitting attribute for the given tuples.
● Three popular attribute selection measures—information gain, gain ratio, and Gini index.
09/27/2024 41
Information Gain
Let D, the data partition, be a training set of class-labeled tuples.
Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1, . . . , m).
Let Ci,D be the set of tuples of class Ci in D.
Let |D| and |Ci,D | denote the number of tuples in D and Ci,D , respectively.
ID3 uses information gain as its attribute selection measure.
The attribute with the highest information gain is chosen as the splitting attribute for node N .
It reflects the least randomness or “impurity” in these partitions.
The expected information needed to classify a tuple in D is given by
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and
is estimated by |Ci,D |/|D|.
Info(D) is just the average amount of information needed to identify the class label of a tuple in D.
Info(D) is also known as the entropy of D.
09/27/2024 42
Information Gain
Now, suppose we were to partition the tuples in D on some attribute A having v distinct values, {a1 , a2 ,
. . . , av }, as observed from the training data.
How much more information would we still need (after the partitioning) to arrive at an exact
classification? This amount is measured by
The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N .
09/27/2024 43
09/27/2024 44
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit rating) =
0.048 bits. Because age has the highest information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of the attribute’s values.
09/27/2024 45
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,which
attempts to overcome this bias. It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D) as
09/27/2024 46
09/27/2024 47
Gini Index
09/27/2024 48
09/27/2024 49
09/27/2024 50
09/27/2024 51
09/27/2024 52
09/27/2024 53