Machine Learning Lab Manual 7
Machine Learning Lab Manual 7
Machine Learning Lab Manual 7
In this lab, we will see how KNN can be implemented with Python's Scikit-Learn library. But before that
let's first explore the theory behind KNN and see what some of the pros and cons of the algorithm are.
Theory
The intuition behind the KNN algorithm is one of the simplest of all the supervised machine learning
algorithms. It simply calculates the distance of a new data point to all other training data points. The
distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data points, where
K can be any integer. Finally it assigns the data point to the class to which the majority of the K data
points belong.
Let's see this algorithm in action with the help of a simple example. Suppose you have a dataset with two
variables, which when plotted, looks like the one in the following figure.
Our task is to classify a new data point with 'X' into "Blue" class or "Red" class. The coordinate values of
the data point are x=45 and y=50. Suppose the value of K is 3. The KNN algorithm starts by calculating
the distance of point X from all the points. It then finds the 3 nearest points with least distance to point X.
This is shown in the figure below. The three nearest points have been encircled.
Pros
1. It is extremely easy to implement
2. As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real
time predictions. This makes the KNN algorithm much faster than other algorithms that require
training e.g SVM, linear regression, etc.
3. Since the algorithm requires no training before making predictions, new data can be added
seamlessly.
4. There are only two parameters required to implement KNN i.e. the value of K and the distance
function (e.g. Euclidean or Manhattan etc.)
Cons
5. The KNN algorithm doesn't work well with high dimensional data because with large number of
dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
The Dataset
We are going to use the famous iris data set for our KNN example. The dataset consists of four attributes:
sepal-width, sepal-length, petal-width and petal-length. These are the attributes of specific types of iris
plant. The task is to predict the class to which these plants belong. There are three classes in the dataset:
Iris-setosa, Iris-versicolor and Iris-virginica.
Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
dataset.head()
Executing the above script will display the first five rows of our dataset as shown below:
Preprocessing
The next step is to split our dataset into its attributes and labels. To do so, use the following code:
x= dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
The X variable contains the first four columns of the dataset (i.e. attributes) while y contains
the labels.
Since the range of values of raw data varies widely, in some machine learning algorithms, objective
functions will not work properly without normalization. For example, the majority of classifiers calculate
the distance between two points by the Euclidean distance. If one of the features has a broad range of
values, the distance will be governed by this particular feature. Therefore, the range of all features should
be normalized so that each feature contributes approximately proportionately to the final distance.
The gradient descent algorithm (which is used in neural network training and other machine learning
algorithms) also converges faster with normalized features.
Bottom of Form
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
The final step is to make predictions on our test data. To do so, execute the following script:
y_pred = classifier.predict(X_test)
One way to help you find the best value of K is to plot the graph of K value and the corresponding error
rate for the dataset. We will plot the mean error for the predicted values of test set for all the K values
between 1 and 40.
To do so, let's first calculate the mean of error for all the predicted values where K ranges from 1 and 40.
Execute the following script:
error = []
The next step is to plot the error values against K values. Execute the following script to create the plot:
Experiment No. 6: KNN Algorithm Page 7
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
The output graph looks like this:
From the output we can see that the mean error is zero when the value of the K is between 5 and 18.
Task :
Implement KNN on any data set and choose different values of K to see how it impacts the accuracy of
the predictions.