DS Report
DS Report
SEMINAR REPORT ON
Submitted By
Bhavana (1SV21AD003)
Nearest neighbors is one of the simplest predictive models there is. It makes no mathematical
assumptions, and it doesn’t require any sort of heavy machinery. The only things it requires are:
Let’s say we’ve picked a number k like 3 or 5. Then, when we want to classify some new data point,
we find the k nearest labeled points and let them vote on the new output. To do this, we’ll need a
function that counts votes. One possibility is:
But this doesn’t do anything intelligent with ties. For example, imagine we’re rating movies and the
five nearest movies are rated G, G, PG, PG, and R. Then G has two votes and PG also has two
votes. In that case, we
have several options:
▪ Pick one of the winners at random.
Weight the votes by distance and pick the weighted winner.
▪ Reduce k until we find a unique winner
We’ll implement the third:
def majority_vote(labels: List[str]) -> str:
"""Assumes that labels are ordered from nearest to farthest."""
vote_counts = Counter(labels)
winner, winner_count = vote_counts.most_common(1)[0]
num_winners = len([count
for count in vote_counts.values()
if count == winner_count])
if num_winners == 1:
return winner # unique winner, so return it
else:
return majority_vote(labels[:-1]) # try again without the farthest
This approach is sure to work eventually, since in the worst case we go all
the way down to just one label, at which point that one label wins.
With this function it’s easy to create a classifier:
from typing import NamedTuple
from scratch.linear_algebra import Vector, distance
class LabeledPoint(NamedTuple):
point: Vector
label: str
import requests
data = requests.get(
"https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data"
5.1,3.5,1.4,0.2,Iris-setosa
In this section we’ll try to build a model that can predict the class (that is,
the species) from the first four measurements.
To start with, let’s load and explore the data. Our nearest neighbors function expects a LabeledPoint ,
so let’s represent our data that way
with open('iris.data') as f:
reader = csv.reader(f)
iris_data = [parse_iris_row(row) for row in reader]
# We'll also group just the points by species/label so we can plot them
points_by_species: Dict[str, List[Vector]] = defaultdict(list)
for iris in iris_data:
points_by_species[iris.label].append(iris.point)
We’d like to plot the measurements so we can see how they vary by
species.Unfortunately, they are four-dimensional, which makes them tricky to plot. One
thing we can do is look at the scatterplots for each of the six pairs of measurements I won’t
explain all the details, but it’s a nice illustration of more complicated things you can do with
matplotlib, so it’s worth studying:
fig, ax = plt.subplots(2, 3)
fig, ax = plt.subplots(2, 3)
To start with, let’s split the data into a test set and a training set:
import
random
from
scratch.machine_learning
import split_data
random.seed(12)
iris_train, iris_test = split_data(iris_data, 0.70)
assert len(iris_train) == 0.7 * 150
assert len(iris_test) == 0.3 * 150
The training set will be the “neighbors” that we’ll use to classify the points in the test set.
We just need to choose a value for k, the number of neighbors who get to vote. Too small
(think k = 1), and we let outliers have too much influence; too large (think k = 105), and we
just predict the most common class in the dataset. In a real application (and with more
data), we might create a separate validation set and use it to choose k. Here we’ll just use
k = 5:
from
typing
import Tuple
# track how many times we see (predicted, actual)
confusion_matrix: Dict[Tuple[str, str], int] = defaultdict(int)
num_correct = 0
for iris in iris_test:
predicted = knn_classify(5, iris_train, iris.point)
actual = iris.label
if predicted == actual:
num_correct += 1
confusion_matrix[(predicted, actual)] += 1
pct_correct = num_correct / len(iris_test)
print(pct_correct, confusion_matrix)
On this simple dataset, the model predicts almost perfectly. There’s one versicolor for which it
predicts virginica, but otherwise it gets things exactly right.
Originally used as an example data set on which Fisher's linear discriminant analysis was
applied, it became a typical test case for many statistical classification techniques in machine
learning such as support vector machines.
The use of this data set in cluster analysis however is not common, since the data set only
contains two clusters with rather obvious separation. One of the clusters contains Iris setosa,
while the other cluster contains both Iris virginica and Iris versicolor and is not separable without
the species information Fisher used. This makes the data set a good example to explain the
difference between supervised and unsupervised techniques in data mining: Fisher's linear
discriminant model can only be obtained when the object species are known: class labels and
clusters are not necessarily the same.
Nevertheless, all three species of Iris are separable in the projection on the nonlinear and
branching principal component.[7] The data set is approximated by the closest tree with some
penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro
map" is constructed.[4] The data points are projected into the closest node. For each node
the pie diagram of the projected points is prepared. The area of the pie is proportional to the
number of the projected points. It is clear from the diagram (left) that the absolute majority of
the samples of the different Iris species belong to the different nodes. Only a small fraction
of Iris-virginica is mixed with Iris-versicolor (the mixed blue-green nodes in the diagram).
Therefore, the three species of Iris (Iris setosa, Iris virginica and Iris versicolor) are separable
by the unsupervising procedures of nonlinear principal component analysis. To discriminate
them, it is sufficient just to select the corresponding nodes on the principal tree.
❑ The Curse of Dimensionality:
The k-nearest neighbors algorithm runs into trouble in higher dimensions thanks to the “curse of
dimensionality,” which boils down to the fact that high-dimensional spaces are vast. Points in high-
dimensional spaces tend not to be close to one another at all. One way to see this is randomly
generating pairs of points in the d-dimensional “unit cube” in a variety of dimensions, and
calculating the distances between them.Generating random points should be second nature by
now:
For every dimension from 1 to 100, we’ll compute 10,000 distances and use those to compute the
average distance between points and the minimum distance between points in each dimension
import
tqdm
dimensions = range(1, 101)
avg_distances = []
min_distances = []
random.seed(0)
for dim in tqdm.tqdm(dimensions, desc="Curse of Dimensionality"):
In low-dimensional datasets, the closest points tend to be much closer than average. But
two points are close only if they’re close in every dimension, and every extra dimension—
even if just noise—is another opportunity for each point to be farther away from every other
point. When you have a lot of dimensions,
It’s likely that the closest points aren’t much closer than average, so two points being close
doesn’t mean very much (unless there’s a lot of structure in your data that makes it behave
as if it were much lower
dimensional).
A different way of thinking about the problem involves the sparsity of higher-dimensional
spaces.
If you pick 50 random numbers between 0 and 1, you’ll probably get a pretty good sample of
the unit interval
As the number of dimensions increases, the average distance between points increases. But
what’s more problematic is the ratio between the closest distance and the average distance
(Figure 12-3): min_avg_ratio = [min_dist / avg_dist for min_dist, avg_dist in
zip(min_distances, avg_distances)]
❑ Naive Bayes:
naive Bayes is a conditional probability model: it assigns probabilities for each of the K
possible outcomes or classes given a problem instance to be classified, represented by a
vector encoding some n features (independent variables).
The Naive Bayes classifier algorithm is a machine learning technique used for classification
tasks. It is based on Bayes’ theorem and assumes that features are conditionally
independent of each other given the class label.
Naive Bayes is a probabilistic algorithm that’s typically used for classification problems.
Naive Bayes is simple, intuitive, and yet performs surprisingly well in many cases. For
example, spam filters Email app uses are built on Naive Bayes. In this article, I’ll explain
the rationales behind Naive Bayes and build a spam filter in Python.
Imagine a “universe” that consists of receiving a message chosen randomly from all
possible messages. Let S be the event “the message is spam” and B be the event
“the message contains the word bitcoin.” Bayes’s theorem tells us that the
probability that the message is spam conditional on containing the word bitcoin is:
If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|¬S).
If we further assume that any message is equally likely to be spam or not spam (so that
P(S) = P(¬S) = 0.5), then:
P(S|B) = P(B|S)/[P(B|S)+P(B|¬S)]
For example, if 50% of spam messages have the word bitcoin, but only 1% of nonspam
messages do, then the probability that any given bitcoin-containing email is spam is:
0. 5/(0.5 + 0.01) = 98%
Imagine now that we have a vocabulary of many words, w ..., w . To move this into the
realm of probability theory, we’ll write X for the event “a message contains the word w .”
Also imagine that (through some unspecified-at-this-point process) we’ve come up with an
estimate P(X |S) for the probability that a spam message contains the ith word, and a
similar estimate P(X |¬S) for the probability that a nonspam message contains the ith word.
The key to Naive Bayes is making the (big) assumption that the presences (or absences) of each
word are independent of one another, conditional on a message being spam or not. Intuitively,
this assumption means that knowing whether a certain spam message contains the word bitcoin
gives you no information about whether that same
message contains the word rolex. In math terms, this means that
This is an extreme assumption. (There’s a reason the technique has naive in its name.) Imagine
that our vocabulary consists only of the words bitcoin and rolex, and that half of all spam
messages are for “earn bitcoin” and that the other half are for “authentic rolex.” In this case, the
Naive Bayes estimate that a spam message contains both bitcoin and rolex is
since we’ve assumed away the knowledge that bitcoin and rolex actually never occur together.
Despite the unrealisticness of this assumption, this model often performs well and has historically
been used in actual spam filters. The same Bayes’s theorem reasoning we used for our “bitcoin-
only” spam filter tells us that we can calculate the probability a message is spam using the
equation:
The Naive Bayes assumption allows us to compute each of the probabilities on the right simply
by multiplying together the individual probability estimates for each vocabulary word.
exp(log(p1) + ⋯+log(pn))
The only challenge left is coming up with estimates for P(Xi|S) and P(Xi|¬S) , the probabilities
that a spam message (or nonspam message) contains the word wi . If we have a fair number of
“training” messages labeled as spam and not spam, an obvious first try is to estimate P(Xi|S)
simply as the fraction of spam messages containing the word wi .
In particular, we’ll choose a pseudocount—k—and estimate the probability of seeing the ith word
in a spam message as
We do similarly for P(Xi|¬S) . That is, when computing the spam probabilities for the ith word, we
assume we also saw k additional nonspams containing the word and k additional nonspams not
containing the word.
THANK YOU