Python Data Science Cookbook - Sample Chapter
Python Data Science Cookbook - Sample Chapter
ee
$ 49.99 US
31.99 UK
P U B L I S H I N G
P U B L I S H I N G
Gopi Subramanian
Sa
pl
e
Gopi Subramanian
Preface
Today, we live in a world of connected things where tons of data is generated and it is humanly
impossible to analyze all the incoming data and make decisions. Human decisions are
increasingly replaced by decisions made by computers. Thanks to the field of data science.
Data science has penetrated deeply in our connected world and there is a growing demand
in the market for people who not only understand data science algorithms thoroughly, but
are also capable of programming these algorithms. Data science is a field that is at the
intersection of many fields, including data mining, machine learning, and statistics, to name
a few. This puts an immense burden on all levels of data scientists; from the one who is
aspiring to become a data scientist and those who are currently practitioners in this field.
Treating these algorithms as a black box and using them in decision-making systems will lead
to counterproductive results. With tons of algorithms and innumerable problems out there,
it requires a good grasp of the underlying algorithms in order to choose the best one for any
given problem.
Python as a programming language has evolved over the years and today, it is the number one
choice for a data scientist. Its ability to act as a scripting language for quick prototype building
and its sophisticated language constructs for full-fledged software development combined with
its fantastic library support for numeric computations has led to its current popularity among
data scientists and the general scientific programming community. Not just that, Python is also
popular among web developers; thanks to frameworks such as Django and Flask.
This book has been carefully written to cater to the needs of a diverse range of data
scientistsstarting from novice data scientists to experienced onesthrough carefully crafted
recipes, which touch upon the different aspects of data science, including data exploration,
data analysis and mining, machine learning, and large scale machine learning. Each chapter
has been carefully crafted with recipes exploring these aspects. Sufficient math has been
provided for the readers to understand the functioning of the algorithms in depth. Wherever
necessary, enough references are provided for the curious readers. The recipes are written in
such a way that they are easy to follow and understand.
Preface
This book brings the art of data science with power Python programming to the readers and
helps them master the concepts of data science. Knowledge of Python is not mandatory to
follow this book. Non-Python programmers can refer to the first chapter, which introduces the
Python data structures and function programming concepts.
The early chapters cover the basics of data science and the later chapters are dedicated
to advanced data science algorithms. State-of-the-art algorithms that are currently used in
practice by leading data scientists across industries including the ensemble methods, random
forest, regression with regularization, and others are covered in detail. Some of the algorithms
that are popular in academia and still not widely introduced to the mainstream such as
rotational forest are covered in detail.
With a lot of do-it-yourself books on data science today in the market, we feel that there is a
gap in terms of covering the right mix of math philosophy behind the data science algorithms
and implementation details. This book is an attempt to fill this gap. With each recipe, just
enough math introductions are provided to contemplate how the algorithm works; I believe
that the readers can take full benefits of these methods in their applications.
A word of caution though is that these recipes are written with the objective of explaining the
data science algorithms to the reader. They have not been hard-tested in extreme conditions
in order to be production ready. Production-ready data science code has to go through a
rigorous engineering pipeline.
This book can be used both as a guide to learn data science methods and quick references.
It is a self-contained book to introduce data science to a new reader with little programming
background and help them become experts in this trade.
Preface
Chapter 5, Data Mining Needle in a haystack Name, discusses unsupervised data mining
techniques, starting with elaborate discussions on distance methods and kernel methods and
following it up with clustering and outlier detection techniques.
Chapter 6, Machine Learning 1, covers supervised data mining techniques, including
nearest neighbors, Nave Bayes, and classification trees. In the beginning, we will lay a
heavy emphasis on data preparation for supervised learning.
Chapter 7, Machine Learning 2, introduces regression problems and follows it up with
topics on regularization including LASSO and ridge. Finally, we will discuss cross-validation
techniques as a way to choose hyperparameters for these methods.
Chapter 8, Ensemble Methods, introduces various ensemble techniques including bagging,
boosting, and gradient boosting This chapter shows you how to make a powerful state-of-theart method in data science where, instead of building a single model for a given problem, an
ensemble or a bag of models are built.
Chapter 9, Growing Trees, introduces some more bagging methods based on tree-based
algorithms. Due to their robustness to noise and universal applicability to a variety of
problems, they are very popular among the data science community.
Chapter 10, Large scale machine learning Online Learning covers large scale machine
learning and algorithms suited to tackle such large scale problems. This includes algorithms
that work with streaming data and data that cannot be fitted into memory completely.
Introduction
In this chapter, we will focus mostly on unsupervised data mining algorithms. We will start
with a recipe covering various distance measures. Understanding distance measures and
various spaces is critical when building data science applications. Any dataset is usually a set
of points that are objects belonging to a particular space. We can define space as a universal
set of points from which the points in our dataset are drawn. The most often encountered
space is Euclidean. In Euclidean space, the points are vectors real number. The length of the
vector denotes the number of dimensions.
We then have a recipe introducing kernel methods. Kernel methods are a very important topic
in machine learning. They help us solve nonlinear data problems using linear methods. We will
introduce the concept of the kernel trick.
185
186
Chapter 5
Getting ready
We will look at distance measures in Euclidean and non-Euclidean spaces. We will start
with Euclidean distance and then define Lrnorm distance. Lr-norm is a family of distance
measures of which Euclidean is a member. We will then follow it with the cosine distance.
In non-Euclidean spaces, we will look at Jaccard's distance and Hamming distance.
How to do it
Let's start by defining the functions to calculate the various distance measures:
import numpy as np
def euclidean_distance(x,y):
if len(x) == len(y):
return np.sqrt(np.sum(np.power((x-y),2)))
else:
print "Input should be of equal length"
return None
def lrNorm_distance(x,y,power):
if len(x) == len(y):
return np.power(np.sum (np.power((x-y),power)),(1/
(1.0*power)))
else:
print "Input should be of equal length"
return None
def cosine_distance(x,y):
if len(x) == len(y):
return np.dot(x,y) / np.sqrt(np.dot(x,x) * np.dot(y,y))
else:
print "Input should be of equal length"
return None
def jaccard_distance(x,y):
set_x = set(x)
set_y = set(y)
return 1 - len(set_x.intersection(set_y)) / len(set_x.
union(set_y))
187
Now, let's write a main routine in order to invoke these various distance measure functions:
if __name__ == "__main__":
# Sample data, 2 vectors of dimension 3
x = np.asarray([1,2,3])
y = np.asarray([1,2,3])
# print euclidean distance
print euclidean_distance(x,y)
# Print euclidean by invoking lr norm with
# r value of 2
print lrNorm_distance(x,y,2)
# Manhattan or citi block Distance
print lrNorm_distance(x,y,1)
# Sample data for cosine distance
x =[1,1]
y =[1,0]
print 'cosine distance'
print cosine_distance(x,y)
# Sample data for jaccard distance
x = [1,2,3]
y = [1,2,3]
print jaccard_distance(x,y)
# Sample data for hamming distance
x =[11001]
y =[11011]
print hamming_distance(x,y)
188
Chapter 5
How it works
Let's look at the main function. We created a sample dataset and two vectors of three
dimensions and invoked the euclidean_distance function.
This is the most common distance measure used is Euclidean distance. It belongs to a family
of the Lr-Norm distance. A space is defined as a Euclidean space if the points in this space
are vectors composed of real numbers. It's also called the L2-norm distance. The formula for
Euclidean distance is as follows:
d ([ x1 , x2 , xn ] , [ y1 , y2 , yn ]) =
(x y )
i
i =1
As you can see, Euclidean distance is derived by finding the distance in each dimension
(subtracting the corresponding dimensions), squaring the distance, and finally taking a
square root.
In our code, we leverage NumPy square root and power function in order to implement the
preceding formula:
np.sqrt(np.sum(np.power((x-y),2)))
Euclidean distance is strictly positive. When x is equal to y, the distance is zero. This should
become clear from how we invoked Euclidean distance:
x = np.asarray([1,2,3])
y = np.asarray([1,2,3])
print euclidean_distance(x,y)
As you can see, we defined two NumPy arrays, x and y. We have kept them the same. Now,
when we invoke the euclidean_distance function with these parameters, our output is zero.
Let's now invoke the L2-norm function, lrNorm_distance.
The Lr-Norm distance metric is from a family of distance metrics of which Euclidean distance
is a member. This should become clear as we see its formula:
1
n
r r
d ([ x1 , x2 , xn ] , [ y1 , y2 , yn ]) = xi yi
i =1
189
In addition to two vectors, we will also pass a third parameter called power. This is the
r defined in the formula. Invoking it with a power value set to two will yield the Euclidean
distance. You can check it by running the following code:
print lrNorm_distance(x,y,2)
This will yield zero as a result, which is similar to the Euclidean distance function.
Let's define two sample vectors, x and y, and invoke the cosine_distance function.
In the spaces where the points are considered as directions, the cosine distance yields a
cosine of the angle between the given input vectors as a distance value. Both the Euclidean
space also the spaces where the points are vectors of integers or Boolean values, are
candidate spaces where the cosine distance function can be applied. The cosine of the angle
between the input vectors is the ratio of a dot product of the input vectors to the product of an
L2-norm of individual input vectors:
np.dot(x,y) / np.sqrt(np.dot(x,x) * np.dot(y,y))
Let's look at the numerator where the dot product between the input vector is calculated:
np.dot(x,y)
We will use the NumPy dot function to get the dot product value. The dot product for the two
vectors, x and y, is defined as follows:
n
x y
i
i =1
We again use the dot function to find the L2-norm of our input vectors:
np.dot(x,x) is equivalent to
tot = 0
for i in range(len(x)):
tot+=x[i] * x[i]
Thus, we can calculate the cosine of the angle between the two input vectors.
190
Chapter 5
We will move on to Jaccard's distance. Similar to the previous invocations, we will define the
sample vectors and invoke the jaccard_distance function.
From vectors of real values, let's move on to sets. Commonly called Jaccard's coefficient,
it is the ratio of the sizes of the intersection and the union of the given input vectors. One
minus this value gives the Jaccard's distance. As you can see, in the implementation, we first
converted the input lists to sets. This will allows us to leverage the union and intersection
operations provided by the Python set datatype:
set_x = set(x)
set_y = set(y)
We must use the intersection and union functionalities that are available in the set datatype
in order to calculate the distance.
Our last distance metric is the Hamming distance. With two bit vectors, the Hamming distance
calculates how many bits have differed in these two vectors:
for char1,char2 in zip(x,y):
if char1 != char2:
diff+=1
return diff
As you can see, we used the zip functionality to check each of the bits and maintain a counter
on how many bits have differed. The Hamming distance is used with a categorical variable.
There's more...
Remember that by subtracting one from our distance values, we can arrive at a similarity value.
Yet another distance that we didn't go into in detail, but is used prevalently, is the Manhattan
or city block distance. It's an L1-norm distance. By passing an r value as 1 to the Lr-norm
distance function, we will get the Manhattan distance.
Depending on the underlying space in which the data is placed, an appropriate distance
measure needs to be selected. When using these distances in algorithms, we need to be
mindful about the underlying space. For example, in the k-means algorithm, at every step
cluster center is calculated as an average of all the points that are close to each other. A
nice property of Euclidean is that the average of the points exists and as a point in the same
space. Note that our input for the Jaccard's distance was sets. An average of the sets does not
make any sense.
191
The above URL lists all the distance measures supported by SciPy.
Additionally, the scikit-learn pairwise submodule provides you with a method called
pairwise_distance, which can be used to find out the distance matrix from input records.
This can be found at:
http://scikitlearn.org/stable/modules/generated/sklearn.metrics.
pairwise.pairwise_distances.html.
We had mentioned that the Hamming distance is used with a categorical variable. A point
worth mentioning here is the one-hot encoding that is used typically for categorical variables.
After the one-hot encoding, the Hamming distance can be used as a similarity/distance
measure between the input vectors.
See also
f
Chapter 5
Formally defining a kernel K is a similarity function: K(x1,x2) > 0 denotes the similarity
of x1 and x2.
Getting ready
Let's define it mathematically before looking at the various kernels:
k ( xi , ji ) = ( xi ) , ( x j )
Here, xi and, xj are the input vectors:
( xi ) , ( x j )
The above mapping function is used to transform the input vectors into a new space. For
example, if the input vector is in an n-dimensional space, the transformation function
transforms it into a new space of dimension, m, where m >> n:
( xi ) , ( x j )
( xi ) , ( x j )
The above image is the dot product, xi and xj are now transformed into a new space by the
mapping function.
In this recipe, we will see a simple kernel in action.
Our mapping function will be as follows:
( x1, x 2, x3) = ( x12 , x 22 x32 , x1x 2, x1x3, x 2 x1, x 2 x3, x3x1, x3x 2 )
When the original data is supplied to this mapping function, it transforms the input into the
new space.
193
How to do it
Let's create two input vectors and define the mapping function as described in the
previous section:
import numpy as np
# Simple example to illustrate Kernel Function concept.
# 3 Dimensional input space
x = np.array([10,20,30])
y = np.array([8,9,10])
# Let us find a mapping function to transform this space
# phi(x1,x2,x3) = (x1x2,x1x3,x2x3,x1x1,x2x2,x3x3)
# this will transorm the input space into 6 dimesions
def mapping_function(x):
output_list =[]
for i in range(len(x)):
output_list.append(x[i]*x[i])
output_list.append(x[0]*x[1])
output_list.append(x[0]*x[2])
output_list.append(x[1]*x[0])
output_list.append(x[1]*x[2])
output_list.append(x[2]*x[1])
output_list.append(x[2]*x[0])
return np.array(output_list)
Now, let's look at the main routine to invoke the kernel transformation. In the main function, we
will define a kernel function and pass the input variable to the function, and print the output:
k ( x ,y ) = x, y
if __name_ == "__main__"
# Apply the mapping function
tranf_x = mapping_function(x)
tranf_y = mapping_function(y)
# Print the output
print tranf_x
print np.dot(tranf_x,tranf_y)
# Print the equivalent kernel functions
# transformation output.
output = np.power((np.dot(x,y)),2)
print output
194
Chapter 5
How it works
Let's follow this program from our main function. We created two input vectors, x and y.
Both the vectors are of three dimensions.
We then defined a mapping function. The mapping function uses the input vector values and
transforms the input vector into a new space with an increased dimension. In this case, the
number of the dimension is increased to nine from three.
Let's now apply a mapping function on these vectors in order to increase their dimension
to nine.
If we print tranf_x, we will get the following:
[100 400 900 200 300 200 600 600 300]
As you can see, we transformed our input, x, from three dimensions to a nine-dimensional
vector.
Now, let's take the dot product in the transformed space and print its output.
The output is 313600, a scalar value.
Let's now recap: we first transformed our two input vectors into a higher dimensional space
and then calculated the dot product in order to derive a scalar output.
What we did was a very costly operation of transforming our original three-dimensional vector
to a nine-dimensional vector and then performing the dot product operation on it.
Instead, we can choose a kernel function, which can arrive at the same scalar output without
explicitly transforming the original space into a new space.
Our new kernel is defined as follows:
k ( x ,y ) = x, y
With two inputs, x and y, this kernel computes the dot product of the vectors, and squares them.
After printing the output from the kernel, we get 313600.
We never did the transformation but still were able to get the same result as the dot product
output in the transformed space. This is called the kernel trick.
There was no magic in choosing this kernel. By expanding the kernel, we can arrive at our
mapping function. Refer to the following reference for the expansion details:
http://en.wikipedia.org/wiki/Polynomial_kernel.
195
There's more...
There are several types of kernels. Based on our data characteristics and algorithm needs, we
need to choose the right kernel. Some of them are as follows:
Linear kernel: This is the simplest kind of kernel function. For two given inputs, it returns the
dot product of the input:
K ( x , y ) = xT y
Polynomial kernel: This is defined as follows:
K ( x , y ) = ( xT y + c )
Here, x and y are the input vectors, d is the degree of the polynomial, and c is a constant. In
our recipe, we used a polynomial kernel of degree 2.
The following is the scikit implementation of the linear and polynomial kernels:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.
pairwise.linear_kernel.html#sklearn.metrics.pairwise.linear_kernel
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.
pairwise.polynomial_kernel.html#sklearn.metrics.pairwise.polynomial_
kernel.
See also
f
196
Chapter 5
For any clustering algorithm, the quality of its output is determined by inter-cluster
cohesiveness and intra-cluster separation. Points in the same cluster should be close to each
other; points in different clusters should be far away from each other.
Getting ready
Before we jump into how to write the k-means algorithm in Python, there are two key concepts
that we need to cover that will help us understand better the quality of the output produced
by our algorithm. First is a definition with respect to the quality of the clusters formed, and
second is a metric that is used to find the quality of the clusters.
Every cluster detected by k-means can be evaluated using the following measures:
1. Cluster location: This is the coordinates of the cluster center. K-means starts with
some random points as the cluster center and iteratively finds a new center around
which points that are similar are grouped.
2. Cluster radius: This is the average deviation of all the points from the cluster center.
3. Mass of the cluster: This is the number of points in a cluster.
4. Density of the cluster: This is the ratio of mass of the cluster to its radius.
Now, we will measure the quality of our output clusters. As mentioned previously, this is an
unsupervised problem and we don't have labels against which to check our output in order
to get measures such as precision, recall, accuracy, F1-score, or other similar metrics. The
metric that we will use for our k-means algorithm is called a silhouette coefficient. It takes
values in the range of -1 to 1. Negative values indicate that the cluster radius is greater than
the distance between the clusters so that the clusters overlap. This suggests poor clustering.
Large values, that is, values close to 1, indicate good clustering.
A silhouette coefficient is defined for each point in the cluster. With a cluster, C, and a
point, i, in this cluster, let xi be the average distance of this point from all the other points
in the cluster.
Now, calculate the average distance that the point i has from all the points in another cluster,
D. Pick the smallest of these values and call it yi:
Si =
yi xi
max ( xi , yi )
For every cluster, the average of the silhouette coefficient of all the points can serve as a good
measure of the cluster quality. An average of the silhouette coefficient of all the data points
can serve as an overall quality metric for the clusters formed.
197
We sampled two sets of data from a normal distribution. The first set was picked up with a
mean of 0.2 and standard deviation of 0.2. For the second set, our mean value was 0.9
and standard deviation was 0.1. Each dataset was a matrix of size 100 * 100we have
100 instances and 100 dimensions. Finally, we merged both of them using the row stacking
function from NumPy. Our final dataset was of size 200 * 100.
Let's do a scatter plot of the data:
x = get_random_data()
plt.cla()
plt.figure(1)
plt.title("Generated Data")
plt.scatter(x[:,0],x[:,1])
plt.show()
198
Chapter 5
Though we plotted only the first and second dimension, you can still clearly see that we have
two clusters. Let's now jump into writing our k-means clustering algorithm.
How to do it
Let's define a function that can perform the k-means clustering for the given data and
a parameter, k. The function fits the clustering on the given data and returns an overall
silhouette coefficient.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def form_clusters(x,k):
"""
Build clusters
"""
# k = required number of clusters
no_clusters = k
model = KMeans(n_clusters=no_clusters,init='random')
model.fit(x)
labels = model.labels_
print labels
# Cacluate the silhouette score
sh_score = silhouette_score(x,labels)
return sh_score
Let's invoke the preceding function for the different values of k and store the returned
silhouette coefficient:
sh_scores = []
for i in range(1,5):
sh_score = form_clusters(x,i+1)
sh_scores.append(sh_score)
no_clusters = [i+1 for i in range(1,5)]
Finally, let's plot the silhouette coefficient for the different values of k.
no_clusters = [i+1 for i in range(1,5)]
plt.figure(2)
plt.plot(no_clusters,sh_scores)
plt.title("Cluster Quality")
plt.xlabel("No of clusters k")
plt.ylabel("Silhouette Coefficient")
plt.show()
199
How it works
As mentioned previously, k-means is an iterative algorithm. Roughly, the steps of k-means are
as follows:
1. Initialize k random points from the dataset as initial center points.
2. Do the following till the convergence of the specified number of times:
Assign the points to the closest cluster center. Typically, Euclidean distance is
used to find the distance between a point and the cluster center.
Recalculate the new cluster centers based on the assignment in this
iteration.
Exit the loop if a cluster assignment of the points remains the same as the
previous iteration. The algorithm has converged to an optimal solution.
3. We will leverage the k-means implementation from the scikit-learn library. Our
cluster function takes the k value and dataset as a parameter and runs the
k-means algorithm:
model = KMeans(n_clusters=no_clusters,init='random')
model.fit(x)
The no_clusters is the parameter that we will pass to the function. Using the init
parameter, we set the initial center points as random. When init is set to random, scikit-learn
estimates the mean and variance from the data and then samples k centers from a Gaussian
distribution.
Finally, we must call the fit method to run k-means on our dataset:
labels = model.labels_
sh_score = silhouette_score(x,labels)
return sh_score
We get the labels, that is, the cluster assignment for each point and find out the silhouette
coefficient for all the points in our cluster.
In real-world scenarios, when we start with the k-means algorithm on a dataset, we don't know
the number of clusters present in the data; in other words, we don't know the ideal value for k.
However, in our example, we know that k=2 as we generated the data in such a manner that it
fits in two clusters. Hence, we need to run k-means for the different values of k:
sh_scores = []
for i in range(1,5):
sh_score = form_clusters(x,i+1)
sh_scores.append(sh_score)
200
Chapter 5
For each run, that is, each value of k, we store the silhouette coefficient. A plot of k versus the
silhouette coefficient reveals the ideal k value for the dataset:
no_clusters = [i+1 for i in range(1,5)]
plt.figure(2)
plt.plot(no_clusters,sh_scores)
plt.title("Cluster Quality")
plt.xlabel("No of clusters k")
plt.ylabel("Silhouette Coefficient")
plt.show()
There's more...
A couple of points to be noted about k-means. The k-means algorithm cannot be used for
categorical data, k-medoids is used. Instead of averaging all the points in a cluster in order to
find the cluster center, k-medoids selects a point that has the smallest average distance to all
the other points in the cluster.
Care needs to be taken while assigning the initial cluster. If the data is very dense with very
widely separated clusters, and if the initial random centers are chosen in the same cluster,
k-means may not perform very well.
201
The presence of nested or other complicated clusters will result in a junk output from k-means.
The presence of outliers in the data may yield poor results. A good practice is to do a thorough
data exploration in order to identify the data characteristics before running k-means.
An alternative method to initialize the centers during the beginning of the algorithm is the
k-means++ method. So, instead of setting the init parameter to random, we can set it using
k-means++. Refer to the following paper for k-means++:
k-means++: the advantages of careful seeding. ACM-SIAM symposium on Discrete
algorithms. 2007
See also
f
Working with Distance Measures recipe in Chapter 5, Data Mining - Finding a needle
in a haystack
Getting ready
LVQ is an online learning algorithm where the data points are processed one at a time. It
makes a very simple intuition. Assume that we have prototype vectors identified for the
different classes present in our dataset. The training points will be attracted towards the
prototypes of similar classes and will repel the other prototypes.
The major steps in LVQ are as follows:
Select k initial prototype vectors for each class in the dataset. If it's a two-class problem
and we decide to have two prototype vectors for each class, we will end up with four initial
prototype vectors. The initial prototype vectors are selected randomly from the input dataset.
202
Chapter 5
We will start our iteration. Our iteration will end when our epsilon value has reached either
zero or a predefined threshold. We will decide an epsilon value and decrement the epsilon
value with every iteration.
In each iteration, we will sample an input point (with replacement) and find the closest
prototype vector to this point. We will use Euclidean distance to find the closest point. We will
update the prototype vector of the closest point, as follows:
If the class label of the prototype vector is the same as the input data point, we will increment
the prototype vector with the difference between the prototype vector and data point.
If the class label is different, we will decrement the prototype vector with the difference
between the prototype vector and data point.
We will use the Iris dataset to demonstrate how LVQ works. As in some of our previous recipe,
we will use the convenient data loading function from scikit-learn in order to load the Iris
dataset. Iris is a well known classificaiton dataset. However our purpose of using it here is to
only demonstrate LVQ's capability. Datasets without class lablels can also be used or processed
by LVQ. As we are going to use Euclidean distance, we will scale the data using minmax scaling.
from sklearn.datasets import load_iris
import numpy as np
from sklearn.metrics import euclidean_distances
data = load_iris()
x = data['data']
y = data['target']
# Scale the variables
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
x = minmax.fit_transform(x)
How to do it
1. Let's first declare the parameters for LVQ:
R = 2
n_classes = 3
epsilon = 0.9
epsilon_dec_factor = 0.001
203
3. This is the function to find the closest prototype vector for a given vector:
def find_closest(in_vector,proto_vectors):
closest = None
closest_distance = 99999
for p_v in proto_vectors:
distance = euclidean_distances(in_vector,p_v.p_vector)
if distance < closest_distance:
closest_distance = distance
closest = p_v
return closest
4. A convenient function to find the class ID of the closest prototype vector is as follows:
def find_class_id(test_vector,p_vectors):
return find_closest(test_vector,p_vectors).class_id
Chapter 5
for p_v in p_vectors:
print p_v.class_id,'\t',p_v.p_vector
print
6. Perform iteration to adjust the prototype vector in order to classify/cluster any new
incoming points using the existing data points:
while epsilon >= 0.01:
# Sample a training instance randonly
rnd_i = np.random.randint(0,149)
rnd_s = x[rnd_i]
target_y = y[rnd_i]
# Decrement epsilon value for next iteration
epsilon = epsilon - epsilon_dec_factor
# Find closes prototype vector to given point
closest_pvector = find_closest(rnd_s,p_vectors)
# Update closes prototype vector
if target_y == closest_pvector.class_id:
closest_pvector.update(rnd_s)
else:
closest_pvector.update(rnd_s,False)
closest_pvector.epsilon = epsilon
print "class id \t Final Prototype Vector\n"
for p_vector in p_vectors:
print p_vector.class_id,'\t',p_vector.p_vector
7.
How it works
In step 1, we initialize the parameters for the algorithm. We have chosen our R value as
two, that is, we have two prototype vectors per class label. The Iris dataset is a three-class
problem, so we have six prototype vectors in total. We must choose our epsilon value and
epsilon decrement factor.
205
The class id to which the prototype vector belongs is the vector itself and the epsilon value.
It also has a function update that is used to change the prototype values:
def update(self,u_vector,increment=True):
if increment:
# Move the prototype vector closer to input vector
self.p_vector = self.p_vector + self.epsilon*(u_vector - self.p_
vector)
else:
# Move the prototype vector away from input vector
self.p_vector = self.p_vector - self.epsilon*(u_vector - self.p_
vector)
In step 3, we define the following function, which takes any given vector as the input and a list
of all the prototype vectors. Out of all the prototype vectors, this function returns the closest
prototype vector to the given vector:
for p_v in proto_vectors:
distance = euclidean_distances(in_vector,p_v.p_vector)
if distance < closest_distance:
closest_distance = distance
closest = p_v
As you can see, it loops through all the prototype vectors to find the closest one. It uses
Euclidean distance to measure the similarity.
Step 4 is a small function that can return the class ID of the closest prototype vector to the
given vector.
Now that we have finished all the required preprocessing for the LVQ algorithm, we can move
on to the actual algorithm in step 5. For each class, we must select the initial prototype vectors.
We then select R random points from each class. The outer loop goes through each class, and
for each class, we select R random samples and create our prototype object, as follows:
samples = np.random.randint(0,len(x_subset),R)
# Select p_vectors
for sample in samples:
s = x_subset[sample]
p = prototype(i,s,epsilon)
p_vectors.append(p)
206
Chapter 5
In step 6, we increment or decrement the prototype vectors iteratively. We loop continuously
till our epsilon value falls below a threshold of 0.01.
We then randomly sample a point from our dataset, as follows:
# Sample a training instance randonly
rnd_i = np.random.randint(0,149)
rnd_s = x[rnd_i]
target_y = y[rnd_i]
If the current point's class ID matches the prototype's class ID, we call the update
method, with the increment set to True, or else we will call the update with the
increment set to False:
# Update closes prototype vector
if target_y == closest_pvector.class_id:
closest_pvector.update(rnd_s)
else:
closest_pvector.update(rnd_s,False)
Finally, we update the epsilon value for the closest prototype vector:
closest_pvector.epsilon = epsilon
We can get the predicted class ID using the find_class_id function. We pass a point and
all the learned prototype vectors to it to get the class ID.
Finally, we give our predicted output in order to generate a classification report:
print classification_report(y,predicted_y,target_names=['IrisSetosa','Iris-Versicolour', 'Iris-Virginica'])
207
You can see that we have done pretty well with our classification. Keep in mind that we
did not keep a separate test set. Never measure the accuracy of your model based on the
training data. Always use a test set that is unseen by the training routines. We did it only for
illustration purposes.
There's more...
Keep in mind that this technique does not involve any optimization criteria as in the other
classification methods. Hence, it is very difficult to judge how good the prototype vectors have
been generated.
In our recipe, we initialized the prototype vectors as random values. You can use the k-means
algorithm to initialize the prototype vectors.
See also
f
Clustering of data using K-Means recipe in Chapter 5, Data Mining - Finding a needle
in a haystack
Chapter 5
We will look at the detection of outliers in univariate data in this recipe and then move on to
look at outliers in multivariate and text data.
Getting ready
In this recipe, we will look at the following three methods for outlier detection in univariate
data:
f
Let's see how we can leverage these methods to spot outliers in univariate data. Before we
jump into the next section, let's create a dataset with outliers so that we can evaluate our
method empirically:
import numpy as np
import matplotlib.pyplot as plt
n_samples = 100
fraction_of_outliers = 0.1
number_inliers = int ( (1-fraction_of_outliers) * n_samples )
number_outliers = n_samples - number_inliers
We will create 100 data points, and 10 percent of them will be outliers:
# Get some samples from a normal distribution
normal_data = np.random.randn(number_inliers,1)
We will use the randn function in the random module of NumPy to generate our inliers. This
will be a sample from a distribution with a mean of zero and a standard deviation of one. Let's
verify the mean and standard deviation of our sample:
# Print the mean and standard deviation
# to confirm the normality of our input data.
mean = np.mean(normal_data,axis=0)
std = np.std(normal_data,axis=0)
print "Mean =(%0.2f) and Standard Deviation (%0.2f)"%(mean[0],std[0])
We will calculate the mean and standard deviation with the functions from NumPy and print
the output. Let's inspect the output:
Mean =(0.24) and Standard Deviation (0.90)
As you can see, the mean is close to zero and the standard deviation is close to one.
209
Our y axis is the actual values that we generated and our x axis is a running count. It will be a
good exercise to mark the points that you feel are outliers. We can later compare our program
output with your manual selections.
210
Chapter 5
How to do it
1. Let's start with the median absolute deviation. Then we will plot our values, with the
outliers marked in red:
# Median Absolute Deviation
median = np.median(total_data)
b = 1.4826
mad = b * np.median(np.abs(total_data - median))
outliers = []
# Useful while plotting
outlier_index = []
print "Median absolute Deviation = %.2f"%(mad)
lower_limit = median - (3*mad)
upper_limit = median + (3*mad)
print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_
limit,upper_limit)
for i in range(len(total_data)):
if total_data[i] > upper_limit or total_data[i] < lower_limit:
print "Outlier %0.2f"%(total_data[i])
outliers.append(total_data[i])
outlier_index.append(i)
plt.figure(2)
plt.title("Outliers using mad")
plt.scatter(range(len(total_data)),total_data,c='b')
plt.scatter(outlier_index,outliers,c='r')
plt.show()
2. Moving on to the mean plus or minus three standard deviation, we will plot our
values, with the outliers colored in red:
# Standard deviation
std = np.std(total_data)
mean = np.mean(total_data)
b = 3
outliers = []
outlier_index = []
lower_limt = mean-b*std
upper_limt = mean+b*std
print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_
limit,upper_limit)
for i in range(len(total_data)):
x = total_data[i]
if x > upper_limit or x < lower_limt:
print "Outlier %0.2f"%(total_data[i])
211
plt.figure(3)
plt.title("Outliers using std")
plt.scatter(range(len(total_data)),total_data,c='b')
plt.scatter(outlier_index,outliers,c='r')
plt.savefig("B04041 04 10.png")
plt.show()
How it works
In step 1, we use the median absolute deviation to detect the outliers in the data:
median = np.median(total_data)
b = 1.4826
mad = b * np.median(np.abs(total_data - median))
We first calculate the median value of our dataset using the median function from NumPy.
Next, we declare a variable with a value of 1.4826. This is a constant to be multiplied with the
absolute deviation from the median. Finally, we calculate the median of absolute deviations of
each entry from the median value and multiply it with the constant, b.
Any point that is more than or less than three times the median absolute deviation is deemed
as an outlier for our method:
lower_limit = median - (3*mad)
upper_limit = median + (3*mad)
print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_
limit)
We then calculate the lower and upper limits of the median absolute deviation, as shown
previously, and classify every point as either an outlier or inlier, as follows:
for i in range(len(total_data)):
if total_data[i] > upper_limit or total_data[i] < lower_limit:
print "Outlier %0.2f"%(total_data[i])
outliers.append(total_data[i])
outlier_index.append(i)
Finally, we have all our outlier points stored in a list by the name of outliers. We must also
store the index of the outliers in a separate list called outlier_index. This is done for the ease
of plotting, as you will see in the next step.
212
Chapter 5
We then plot the original points and outliers. The plot looks as follows:
We then calculate the standard deviation and mean of our dataset. Here, you can see that we
have set our b = 3. As the name of our algorithm suggests, we will need a standard deviation
of three, and this b is used for the same:
lower_limt = mean-b*std
upper_limt = mean+b*std
print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_
limit)
for i in range(len(total_data)):
x = total_data[i]
if x > upper_limit or x < lower_limt:
print "Outlier %0.2f"%(total_data[i])
outliers.append(total_data[i])
outlier_index.append(i)
213
There's more
As per the definition of outliers, outliers in a given dataset are those points that are far away
from the other points in the data source. The estimates of the center of the dataset and the
spread of the dataset can be used to detect the outliers. In the methods that we outlined
in this recipe, we used the mean and median as the estimates for the center of the data
and standard deviation, and the median absolute deviation as the estimates for the spread.
Spread is also called scale.
214
Chapter 5
Let's do a little bit of rationalization about why our methods work in the detection of the
outliers. Let's start with the method of using standard deviation. For Gaussian data, we know
that 68.27 percent of the data lies with in one standard deviation, 95.45 percent in two, and
99.73 percent lies in three. Thus, according to our rule that any point that is more than three
standard deviations from the mean is classified as an outlier. However, this method is not
robust. Let's look at a small example.
Let's sample eight data points from a normal distribution, with the mean as zero and the
standard deviation as one.
Let's use the convenient function from NumPy .random to generate our numbers:
np.random.randn(8)
Let's add two outliers to it manually, for example, 45 and 69, to this list.
Our dataset now looks as follows:
-1.763348607322289, -0.7581706357821458, 0.4446894368956213,
-0.07724717210195432, 0.1295194428816003, 0.4309609200681169,
-0.05436724238743103, -0.23719402072058543, 45, 69
The mean of the preceding dataset is 11.211 and the standard deviation is 23.523.
Let's look at the upper rule, mean + 3 * std. This is 11.211 + 3 * 23.523 = 81.78.
Now, according to this upper bound rule, both the points, 45 and 69, are not outliers! Both
the mean and the standard deviation are non-robust estimators of the center and scale of
the dataset, as they are extremely sensitive to outliers. If we replace one of the points with an
extreme point in a dataset with n observations, it will completely change the estimate of the
mean and the standard deviation. This property of the estimators is called the finite sample
breakdown point.
The finite sample breakdown point is defined as the proportion
of the observations in a sample that can be replaced before the
estimator fails to describe the data accurately.
Thus, for the mean and standard deviation, the finite sample breakdown point is 0 percent
because in a large sample, replacing even a single point would change the estimators
drastically.
215
See also
f
Performing summary statistics and plots recipe in Chapter 1, Using Python for
Data Science
Getting ready
In the previous recipe, we looked at univariate data. In this one, we will use multivariate data
and try to find outliers. Let's use a very small dataset to understand the LOF algorithm for
outlier detection.
We will create a 5 X 2 matrix, and looking at the data, we know that the last tuple is an outlier.
Let's also plot it as a scatter plot:
from collections import defaultdict
import numpy as np
instances = np.matrix([[0,0],[0,1],[1,1],[1,0],[5,0]])
import numpy as np
import matplotlib.pyplot as plt
x = np.squeeze(np.asarray(instances[:,0]))
y = np.squeeze(np.asarray(instances[:,1]))
216
Chapter 5
plt.cla()
plt.figure(1)
plt.scatter(x,y)
plt.show()
LOF works by calculating the local density of each point. Based on the distance of k-nearest
neighbors of a point, the local density of the point is estimated. By comparing the local density
of the point with the densities of its neighbors, outliers are detected. Outliers have a low
density compared with their neighbors.
We will need to go through some term definitions in order to understand LOF:
f
The k-distance of object P is the distance between the object P and its kth nearest
neighbor. K is a parameter of the algorithm.
The k-distance neighborhood of P is the list of all the objects, Q, whose distance from
P is either less than or equal to the distance between P and its kth nearest object.
The Local Reachability Density of P (LRD(P)) is the ratio of the k-distance neighborhood
of P and the sum of the reachability distance of k and its neighborhood.
The Local Outlier Factor of P (LOF(P)) is the average of the ratio of the local
reachability of P and those of P's k-nearest neighbors.
How to do it
1. Let's get the pairwise distance between the points:
k = 2
distance = 'manhattan'
2. Let's calculate the k-distance. We will use heapq and get the k-nearest neighbors:
# Calculate K distance
import heapq
k_distance = defaultdict(tuple)
# For each data point
for i in range(instances.shape[0]):
# Get its distance to all the other points.
# Convert array into list for convienience
distances = dist[i].tolist()
# Get the K nearest neighbours
ksmallest = heapq.nsmallest(k+1,distances)[1:][k-1]
# Get their indices
ksmallest_idx = distances.index(ksmallest)
# For each data point store the K th nearest neighbour and its
distance
k_distance[i]=(ksmallest,ksmallest_idx)
Chapter 5
k_distance_neig = defaultdict(list)
# For each data point
for i in range(instances.shape[0]):
# Get the points distances to its neighbours
distances = dist[i].tolist()
print "k distance neighbourhood",i
print distances
# Get the 1 to K nearest neighbours
ksmallest = heapq.nsmallest(k+1,distances)[1:]
print ksmallest
ksmallest_set = set(ksmallest)
print ksmallest_set
ksmallest_idx = []
# Get the indices of the K smallest elements
for x in ksmallest_set:
ksmallest_idx.append(all_indices(x,distances))
# Change a list of list to list
ksmallest_idx = [item for sublist in ksmallest_idx for item in
sublist]
# For each data pont store the K distance neighbourhood
k_distance_neig[i].extend(zip(ksmallest,ksmallest_idx))
5. Calculate LOF:
lof_list =[]
#Local Outlier Factor
for i in range(instances.shape[0]):
lrd_sum = 0
rdist_sum = 0
for neigh in k_distance_neig[i]:
lrd_sum+=local_reach_density[neigh[1]]
rdist_sum+=max(k_distance[neigh[1]][0],neigh[0])
lof_list.append((i,lrd_sum*rdist_sum))
219
How it works
In step 1, we select our distance metric to be Manhattan and our k value as two. We are
looking at the second nearest neighbor for our data point.
We must then proceed to calculate the pairwise distance between our tuples. The pairwise
similarity is stored in the dist matrix. As you can see, the shape of dist is as follows:
>>> dist.shape
(5, 5)
>>>
It is a 5 X 5 matrix, where the rows and columns are individual tuples and the cell value
indicates the distance between them.
In step 2, we then import heapq:
import heapq
heapq is a data structure that is also known as a priority queue. It is similar to a regular
queue except that each element is associated with a priority, and an element with a high
priority is served before an element with a low priority.
Next, we define a dictionary where the key is the tuple ID and the value is the distance of the
tuple to its kth nearest neighbor. In our case, it should be the second nearest neighbor.
We then enter a for loop in order to find the kth nearest neighbor's distance for each of the
data points:
distances = dist[i].tolist()
From our distance matrix, we extract the ith row. As you can see, the ith row captures the
distance between the object i and all the other objects. Remember that the cell value (i,i)
holds the distance to itself. We need to ignore this in the next step. We must convert the array
to a list for our convenience. Let's try to understand this with an example. The distance matrix
looks as follows:
>>> dist
array([[ 0.,
[ 1.,
220
1.,
0.,
2.,
1.,
1.,
2.,
5.],
6.],
Chapter 5
[ 2.,
[ 1.,
[ 5.,
1.,
2.,
6.,
0.,
1.,
5.,
1.,
0.,
4.,
5.],
4.],
0.]])
Let's assume that we are in the first iteration of our for loop and hence, our i =0. (remember
that the Python indexing starts with 0).
So, now our distances list will look as follows:
[ 0.,
1.,
2.,
1.,
5.]
From this, we need the kth nearest neighbor, that is, the second nearest neighbor, as we have
set K = 2 at the beginning of the program.
Looking at it, we can see that both index 1 and index 3 can be our the kth nearest neighbor as
both have a value of 1.
Now, we use the heapq.nsmallest function. Remember that we had mentioned that heapq
is a normal queue but with a priority associated with each element. The value of the element
is the priority in this case. When we say that give me the n smallest, heapq will return the
smallest elements:
# Get the Kth nearest neighbours
ksmallest = heapq.nsmallest(k+1,distances)[1:][k-1]
sorted(iterable, key=key)[:n]
It returns the n smallest elements from the given dataset. In our case, we need the second
nearest neighbor. Additionally, we need to avoid (i,i) as mentioned previously. So we must
pass n = 3 to heapq.nsmallest. This ensures that it returns the three smallest elements.
We then subset the list to exclude the first element (see [1:] after nsmallest function call) and
finally retrieve the second nearest neighbor (see [k-1] after [1:]).
We must also get the index of the second nearest neighbor of i and store it in our dictionary:
# Get their indices
ksmallest_idx = distances.index(ksmallest)
# For each data point store the K th nearest neighbour and its
distance
k_distance[i]=(ksmallest,ksmallest_idx)
221
Our tuples have two elements: the distance, and the index of the elements in the distances
array. So, for instance 0, the second nearest neighbor is the element in index 1.
Having calculated the k-distance for all our data points, we then move on to find the
k-distance neighborhood.
In step 3, we find the k-distance neighborhood for each of our data points:
# Calculate K distance neighbourhood
import heapq
k_distance_neig = defaultdict(list)
Similar to our previous step, we import the heapq module and declare a dictionary that
is going to hold our k-distance neighborhood details. Let's recap what the k-distance
neighborhood is:
The k-distance neighborhood of P is the list of all the objects, Q, whose distance from P is
either less than or equal to the distance between P and its kth nearest object:
distances = dist[i].tolist()
# Get the 1 to K nearest neighbours
ksmallest = heapq.nsmallest(k+1,distances)[1:]
ksmallest_set = set(ksmallest)
The first two lines should be familiar to you. We did this in our previous step. Look at the
second line. Here, we invoked n smallest again with n=3 in our case (K+1), but we selected all
the elements in the output list except the first one. (Guess why? The answer is in the previous
step.)
Let's see it in action by printing the values. As usual, in the loop, we assume that we are
seeing the first data point or tuple where i=0.
Our distances list is as follows:
[0.0, 1.0, 2.0, 1.0, 5.0]
222
Chapter 5
These are 1 to k-nearest neighbor's distances. We need to find their indices, a simple list.
index function will only return the first match, so we will write the all_indices function in
order to retrieve all the indices:
def all_indices(value, inlist):
out_indices = []
idx = -1
while True:
try:
idx = inlist.index(value, idx+1)
out_indices.append(idx)
except ValueError:
break
return out_indices
With a value and list, all_indices will return all the indices where the value occurs in the
list. We must convert our k smallest to a set:
ksmallest_set = set(ksmallest)
So, [1.0,1.0] becomes a set ([1.0]). Now, using a for loop, we can find all the indices of
the elements:
# Get the indices of the K smallest elements
for x in ksmallest_set:
ksmallest_idx.append(all_indices(x,distances))
The next for loop is to convert a list of the lists to a list. The all_indices function returns a
list, and we then append this list to the ksmallest_idx list. Hence, we flatten it using the
next for loop.
Finally, we add the k smallest neighborhood to our dictionary:
k_distance_neig[i].extend(zip(ksmallest,ksmallest_idx))
We then add tuples where the first item in the tuple is the distance and the second item is the
index of the nearest neighbor. Let's print the k-distance neighborhood dictionary:
defaultdict(<type 'list'>, {0: [(1.0, 1), (1.0, 3)], 1: [(1.0, 0),
(1.0, 2)], 2: [(1.0, 1), (1.0, 3)], 3: [(1.0, 0), (1.0, 2)], 4: [(4.0,
3), (5.0, 0)]})
223
For every point, we will first find the k-distance neighborhood of that point. For example,
for i = 0, the numerator would be len (k_distance_neig[0]), 2.
Now, in the inner for loop, we calculate the denominator. We then calculate the reachability
distance for each k-distance neighborhood point. The ratio is stored in the local_reach_
density dictionary.
Finally, in step 5, we calculate the LOF for each point:
for i in range(instances.shape[0]):
lrd_sum = 0
rdist_sum = 0
for neigh in k_distance_neig[i]:
lrd_sum+=local_reach_density[neigh[1]]
rdist_sum+=max(k_distance[neigh[1]][0],neigh[0])
lof_list.append((i,lrd_sum*rdist_sum))
224
Chapter 5
For each data point, we calculate the LRD sum of its neighbor and the reachability distance
sum with its neighbor, and multiply them to get the LOF.
The point with a very high LOF is considered an outlier. Let's print lof_list:
[(0, 4.0), (1, 4.0), (2, 4.0), (3, 4.0), (4, 18.0)]
As you can see, the last point has a very high LOF compared with the others and hence,
it's an outlier.
There's more
You can refer to the following paper in order to understand more about LOF:
LOF: Identifying Density-Based Local Outliers
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jrg Sander
Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000
225
www.PacktPub.com
Stay Connected: