Learners Guide - Machine Learning and Advanced Analytics Using Python

Machine Learning using Python
Learner’s Guide
Machine Learning using Python
LEARNER’S GUIDE
1
Basics of Python
In this notebook, we are going to learn about some of the basic operations and datatypes within python and how you can write simple
functions and codes
Importing Libraries
Python’s standard library is very extensive, offering a wide range of facilities. The library contains built-in modules (written in C) that
provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as
modules written in Python that provide standardized solutions for many problems that occur in everyday programming.
https://docs.python.org/3/library/
In [1]:
# Importing Libraries
import datetime
import pandas as pd
from pandas import read_csv
print("Libraries imported successfully")
Libraries imported successfully
Data Types in Python

Python has eight built-in data types. Four of those are quite simple, in the sense that they can store a single value:
Integers
Floats
Booleans
Strings
The other four are denoted collections because they can store arbitrary numbers of values. Python's four collection data types
are:
Lists
Tuples
Sets
Dictionaries
The four complex data types and their various operations are depicted below.
In [0]:
## Simple Variables
a_number = 2
a_word = 'dog'
print(a_number) # Printing
type(a_number) # Data Type
Lists are ordered sequences of elements, with that order being specified by the order that the elements are in when the list is created
or as elements are added to the list.
Lists are created using the [] syntax.

Lists can include mixed data types.
List elements are accessed by index.
Lists are mutable. You can add, remove, and replace values using functions such as append(), extend(), insert(), pop(),
remove(), and del.
Tuples are similar to lists except for the very important fact that they are immutable.
Sets are unordered collections of elements.

2
Sets are unordered collections of elements.
Sets are created using the {} syntax.
Sets can included mixed data types.
Set elements are accessed by key (name).
Elements in a set cannot be repeated.
Sets are mutable. You can add and remove values using functions such as add() and discard().
In [0]:
pets = ['dogs', 'cats', 'fish']
print(pets)
print(pets[0])
print(pets[1])
['dogs', 'cats', 'fish']

dogs
cats
In [0]:
pets.append('hedgehog')
pets
Out[0]:
['dogs', 'cats', 'fish', 'hedgehog']
In [0]:
pets.remove('dogs')
pets
Out[0]:
['cats', 'fish', 'hedgehog']
In [0]:
pets.reverse()
print(pets)
pets.sort()
print(pets)
['hedgehog', 'fish', 'cats']

['cats', 'fish', 'hedgehog']
In [0]:
# Create a tuple that stores the attributes of our golden retriever, `penny`.
penny = (60, 75, 'yellow') #Length (in), Weight (lbs), Color

print( type(penny) )
penny
<class 'tuple'>
Out[0]:
(60, 75, 'yellow')
In [0]:
del penny[1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-3dca6067f55e> in <module>()
3
<ipython-input-13-3dca6067f55e> in <module>()
----> 1 del penny[1]
TypeError: 'tuple' object doesn't support item deletion
In [0]:
store_one = {'bulldog', 'parrot', 'hamster', 'fish'}

store_two = {'fish', 'parrot', 'terrier', 'cat'}
print( store_one.intersection(store_two) )
print( store_one.union(store_two) )
{'parrot', 'fish'}
{'cat', 'parrot', 'hamster', 'bulldog', 'fish', 'terrier'}
This is a basic line
This is a header
This is a smaller header
This is an even smaller header

This line is in italics
This is a bold line
This is both
This broke here
So wait, why are there both lists and tuples?

I know, right now it's pretty hard to see why you would use one data type over the other.
Remember that Python was developed with the ideal of being readable and understandable. That means that when you are writing
code, you want to make choices that make it easy for others to understand what is going on. For example, if you have a some
variable that contains values that are never going to change, then why would you use a mutable data type such as a list to store it?
Think of the days of the week or the months of the year: Does it make more sense to define
days_of_the_week = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
or
days_of_the_week = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",

"Saturday"] ?
By using a tuple to store some data, you are signaling to a reader of your code, that the values you stored are not going to change
during the execution of your code.
Quick Exercise
pet_store = ['beagle', 'parrot', 'iguana', 'gerbil', 'chameleon', 'fish']
We've sold out of chameleon and iguana , you need to remove them from the inventory. Use both remove() and del to
do so.
How many terrier dogs do we have in the store? (Hint: the list data type has a built-in function called count() )
Flow Control
We will learn how to perform successive operations ("looping") without explicitly coding the commands. This is largely understood to
be controlling the 'flow' of a program.
We will control the flow using:

4
We will control the flow using:
1. Logical statements to check for conditions (performing operations only if a condition is met)
2. Continuing execution until a condition is met
3. Iterating through a sequence of numbers using the range() function
In order to do this we will learn the commands: if , while , and for
In [0]:
## "Buy a gallon of milk at the store and if there are organic eggs, buy a dozen"
# Define variables needed

milk = 0
eggs = 0
store_has_eggs = True
# Purchasing decisions
milk += 1 # milk = milk + 1
if store_has_eggs == True:
eggs += 12
# Check what was purchased
print("I purchased ", milk, " gallons of milk")
print("I purchased ", eggs, " eggs")
I purchased 1 gallons of milk

I purchased 12 eggs
Loops
A loop is used to repeat a block of commands multiple times. There are two ways to write a loop, one is a for loop and the other is
a while loop. Typically, you use a for loop when you know how many times you want to loop, and a while loop when looping
is based on a conditional that will be modified during the loop.
While loops
A while loop is pretty simple, it's structure looks like:
while a_condition:
# do something
...
and it continues until a_condition is false.
In [0]:
# Our numeric variables
number = 43
divisor = 5
answer = 0
# While loop
while number > 0:
number = number - divisor
print( number )
38
33
28
23
18
13
8
3
-2
For loops
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:
5
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:
for item in sequence:

# do something with item
...
But what is a sequence?
There are lots of functions in Python that will actually return a sequence - they are called iterators. An iterator essentially provides the
next element in the sequence each time we access it.
The iterator that we will use to demonstrate a for loop is the range() function. The range function gives us a sequence of numbers
from the first number we give it up until the last number we give it.
In [0]:
for i in range(1, 5):

print(i)
print(i*3)
i = 12
print(i*3)
print('---')
1
3
36
---
2
6
36
---
3
9
36
---
4
12
36
---
Functions
Writing modular code is good!
Functions are the workhorses of modular programming in Python! So, what's a function?
Whenever you see this syntax:
def function_name():
statements
return something
that block of code is a function.
Functions help us avoid repeating the same set of statements everytime we want to repeat a task. Functions increase code
readibility. Functions make code revision and updating easier (you do not have to re-do revisions in all the places of your code where
the task is needed. Functions make testing of your code easier and more reliable.
In [0]:
# Let's write a really simple function -- a function that "says hello".
def says_hello():
'''
Prints the word "Hello"
input:
6
input:
- None
output:
- None
'''
print('Hello!')
# Call the function

says_hello()
You just wrote a simple function! Notice that after writing it nothing was printed. That is because you didn't call the function, You only
defined it so Python will know what on earth you're talking about should you so choose to write says_hello anywhere.
You call a function just by writing its name along with the parentheses:
In [0]:
says_hello()
Hello!
Pandas
In [0]:
import pandas as pd
# Now let us load in some data

url = "http://bit.ly/wkspdata"
titanic_data = pd.read_csv(url)
In [0]:
titanic_data.head() # Gets the first few records
# How do we get the last few records if we wanted to?
Out[0]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs

1 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Th...
STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [0]:
# Generate all descriptive statistics
titanic_data.describe()
Out[0]:
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

7
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
survived pclass age sibsp parch fare
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
Numpy
In [0]:
import numpy as np # Importing the numpy library
x3 = np.random.randint(10, size=6) # One-dimensional array

print(x3)
[6 0 6 8 7 6]
In [0]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim: 1
x3 shape: (6,)
x3 size: 6
In [0]:
# Accessing elements
x3[0]
Out[0]:
6
In [0]:
# Create a 1D array of numbers

arr = np.arange(20)
arr
Out[0]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19])
In [0]:
# Extract all odd numbers from an array
# Input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Solution
arr[arr % 2 == 1]
Out[0]:
array([1, 3, 5, 7, 9])
Exercises
Extract all even numbers from an array
Extract all numbers from an array which are dicvisible by 3.
Create a simple 1D array with number from 0-100.
8
Create a simple 1D array with number from 0-100.
Read the first 25 elements of a pandas dataframe.
9
K- Means Clustering
This notebook will walk through some of the basics of K-Means Clustering.
In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
# Load the iris dataset

iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head() # See the first 5 rows
Out[1]:
sepal length sepal width petal length petal width

(cm) (cm) (cm) (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
How do you find the optimum number of clusters for K Means? How does one determine the value of K?
In [2]:
# Finding the optimum number of clusters for k-means classification
x = iris_df.iloc[:, [0, 1, 2, 3]].values

print(x)
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):

kmeans = KMeans(n_clusters = i)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
# Plotting the results onto a line graph,

# `allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()
[[5.1 3.5 1.4 0.2]

[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
10
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
11
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
12
You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs. This is
when the within cluster sum of squares (WCSS) doesn't decrease significantly with every iteration.
From this we choose the number of clusters as '3'.
In [0]:
# Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3)
kmeans_model = kmeans.fit(x)
y_kmeans = kmeans_model.predict(x)
y_kmeans
Out[0]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
In [0]:
# Visualising the clusters - On the first two columns
plt.figure(figsize=(15,6))
plt.subplot(121)
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1],
s = 100, c = 'red', label = 'Iris-setosa')
s = 100, c = 'blue', label = 'Iris-versicolour')
s = 100, c = 'green', label = 'Iris-virginica')
# Plotting the centroids of the clusters

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1],
s = 100, c = 'yellow', label = 'Centroids')
plt.xlabel("sepal length (cm)")

plt.ylabel("sepal width (cm)")
plt.legend()
# Second plot
plt.subplot(122)
s = 100, c = 'red', label = 'Iris-setosa')
s = 100, c = 'blue', label = 'Iris-versicolour')
s = 100, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters

plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:,2],
s = 100, c = 'yellow', label = 'Centroids')
plt.xlabel("sepal width (cm)")

plt.ylabel("petal length (cm)")
plt.legend()
Out[0]:
<matplotlib.legend.Legend at 0x7f30c0e98eb8>
13
This concludes the K-Means Workshop.
14
Self Organizing Maps (SOM)
A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-
dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is
therefore a method to do dimensionality reduction.
(Source: Wikipedia)
We will be using the example of handwritten digits and classify them using SOM (minisom library in Python)
In [1]:
# Import required libraries

import sys
!pip install minisom # Since minisom is not part of the colab library family, we will need to inst
all it using this command
from minisom import MiniSom # Library for SOM
import numpy as np
Collecting minisom
Downloading
https://files.pythonhosted.org/packages/bf/dd/b9089c073cc16c4f86758bf7668d056956b9cfb2b539d9d50151e1713
fe0/MiniSom-2.2.1.tar.gz
Building wheels for collected packages: minisom
Building wheel for minisom (setup.py) ... done
Created wheel for minisom: filename=MiniSom-2.2.1-cp36-none-any.whl size=6643
sha256=ca126f2512d61b6929bce56f4a76efda4ad4855e19846f24c77e676b458abf53
Stored in directory:
/root/.cache/pip/wheels/41/42/7d/dd12b479c5ea50cd572d91b8e935e4f11e1302acca329f84e0
Successfully built minisom
Installing collected packages: minisom
Successfully installed minisom-2.2.1
Loading in the data

Let us load in some data
In [0]:
# Importing required libraries for data load and processing
from sklearn import datasets
from sklearn.preprocessing import scale
# load the digits dataset from scikit-learn

digits = datasets.load_digits(n_class=10)
data = digits.data
data = scale(data)
num = digits.target # num[i] is the digit represented by data[i]
In [3]:
digits.data.shape
Out[3]:
(1797, 64)
Let us train an SOM Model

Once we have our model the next step is to train an SOM model. It will take 30-60seconds to train the model.
In [4]:
15
In [4]:
som = MiniSom(30, 30, 64, sigma=4,

learning_rate=0.5, neighborhood_function='triangle')
som.pca_weights_init(data)
print("Training...")
# Iterations - 5000
som.train_random(data, 5000, verbose=True)
print("\n...SOM model is ready!")
Training...
[ 5000 / 5000 ] 100% - 1399.78 it/s - 0:00:00 left - quantization error: 3.062048466401852
...SOM model is ready!
Visualizing the SOM

It is very important to visualize the SOM and see how it looks like.
In [5]:
plt.figure(figsize=(8, 8)) # canvas of size 8*8
wmap = {}
im = 0
# this plots all the numbers on the som grid

for x, t in zip(data, num): # scatterplot
w = som.winner(x)
wmap[w] = im
plt.text(w[0]+.5, w[1]+.5, str(t),
color=plt.cm.rainbow(t / 10.), fontdict={'weight': 'bold', 'size': 11})
im = im + 1
# this restricts the axis to be between 0-30 in this case

plt.axis([0, som.get_weights().shape[0], 0, som.get_weights().shape[1]])
plt.show()
In [6]:
plt.figure(figsize=(10, 10), facecolor='white')

cnt = 0
for j in reversed(range(20)): # images mosaic 16

for j in reversed(range(20)): # images mosaic
for i in range(20):
plt.subplot(20, 20, cnt+1, frameon=True, xticks=[], yticks=[])
if (i, j) in wmap:
plt.imshow(digits.images[wmap[(i, j)]],
cmap='Greys', interpolation='nearest')
else:
plt.imshow(np.zeros((8, 8)), cmap='Greys')
cnt = cnt + 1
plt.tight_layout()
plt.show()
This is the end of the workshop. Questions?
17
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms
work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features
from raw data that help facilitate the machine learning process. Feature Engineering is an art.
Manual Feature Engineering
In [2]:
# Imports
import pandas as pd
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import GridSearchCV
# Figures inline and set visualization style

%matplotlib inline
sns.set()
# Import data
df_train =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/train.csv')
df_test =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/test.csv')
# Store target variable of training data in a safe place

survived_train = df_train.Survived
# Concatenate training and test sets

data = pd.concat([df_train.drop(['Survived'], axis=1), df_test])
# View head
# data.info()
df_train
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley

1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
Futrelle, Mrs. Jacques Heath

3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
(Lily May Peel)
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta

7 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
Leonard
Johnson, Mrs. Oscar W

8 9 1 3 female 27.0 0 2 347742 11.1333 NaN S
(Elisabeth Vilhelmina Berg)
Nasser, Mrs. Nicholas (Adele

9 10 1 2 female 14.0 1 0 237736 30.0708 NaN C
Achem)
Sandstrom, Miss. Marguerite

10 11 1 3 female 4.0 1 1 PP 9549 16.7000 G6 S
Rut
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
Saundercock, Mr. William

12 13 0 3 male 20.0 0 0 A/5. 2151 8.0500 NaN S
Henry
18
Andersson, Mr. Anders
13 PassengerId
14 Survived
0 Pclass
3 Name Sex
male Age
39.0 SibSp
1 Parch
5 Ticket
347082 Fare Cabin
31.2750 NaN Embarked
S
Johan
Vestrom, Miss. Hulda

14 15 0 3 female 14.0 0 0 350406 7.8542 NaN S
Amanda Adolfina
Hewlett, Mrs. (Mary D

15 16 1 2 female 55.0 0 0 248706 16.0000 NaN S
Kingcome)
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
Williams, Mr. Charles

17 18 1 2 male NaN 0 0 244373 13.0000 NaN S
Eugene
Vander Planke, Mrs. Julius

18 19 0 3 female 31.0 1 0 345763 18.0000 NaN S
(Emelia Maria Vande...
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
McGowan, Miss. Anna

22 23 1 3 female 15.0 0 0 330923 8.0292 NaN Q
"Annie"
Sloper, Mr. William

23 24 1 1 male 28.0 0 0 113788 35.5000 A6 S
Thompson
Palsson, Miss. Torborg

24 25 0 3 female 8.0 3 1 349909 21.0750 NaN S
Danira
Asplund, Mrs. Carl Oscar

25 26 1 3 female 38.0 1 5 347077 31.3875 NaN S
(Selma Augusta Emilia...
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
C23
Fortune, Mr. Charles
27 28 0 1 male 19.0 3 2 19950 263.0000 C25 S
Alexander
C27
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
Swift, Mrs. Frederick Joel

862 863 1 1 female 48.0 0 0 17466 25.9292 D17 S
(Margaret Welles Ba...
Sage, Miss. Dorothy Edith

863 864 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
"Dolly"
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
Duran y More, Miss. SC/PARIS

866 867 1 2 female 27.0 1 0 13.8583 NaN C
Asuncion 2149
Roebling, Mr. Washington

867 868 0 1 male 31.0 0 0 PC 17590 50.4958 A24 S
Augustus II
van Melkebeke, Mr.

868 869 0 3 male NaN 0 0 345777 9.5000 NaN S
Philemon
Johnson, Master. Harold

869 870 1 3 male 4.0 1 1 347742 11.1333 NaN S
Theodor
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
Beckwith, Mrs. Richard

871 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Leonard (Sallie Monypeny)
B51
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B53 S
B55
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
Abelson, Mrs. Samuel

874 875 1 2 female 28.0 1 0 P/PP 3381 24.0000 NaN C
(Hannah Wizosky)
Najib, Miss. Adele Kiamie

875 876 1 3 female 15.0 0 0 2667 7.2250 NaN C
"Jane"
Gustafsson, Mr. Alfred

876 877 0 3 male 20.0 0 0 7534 9.8458 NaN S
Ossian
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
Potter, Mrs. Thomas Jr (Lily

879 880 1 1 female 56.0 0 1 11767 83.1583 C50 C
Alexenia Wilson)
Shelley, Mrs. William

880 881 1 2 female 25.0 0 1 230433 26.0000 NaN S
19
Shelley, Mrs. William
880 881 1 2 female 25.0 0 1 230433 26.0000 NaN S
PassengerId Survived Pclass (Imanita ParrishName
Hall) Sex Age SibSp Parch Ticket Fare Cabin Embarked
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
Banfield, Mr. Frederick C.A./SOTON

883 884 0 2 male 28.0 0 0 10.5000 NaN S
James 34068
SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076
Rice, Mrs. William (Margaret

885 886 0 3 female 39.0 0 5 382652 29.1250 NaN Q
Norton)
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
Graham, Miss. Margaret

887 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Edith
Johnston, Miss. Catherine

888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
Why Feature Engineer At All?

You perform feature engineering to extract more information from your data, so that you can up your game when building models.
Titanic's Passenger Titles

Let's check out what this is all about by looking at an example. Let's check out the 'Name' column with the help of the .tail() method,
which helps you to see the last five rows of your data:
In [3]:
# View head of 'Name' column
data.Name.tail()
Out[3]:
413 Spector, Mr. Woolf
414 Oliva y Ocana, Dona. Fermina
415 Saether, Mr. Simon Sivertsen
416 Ware, Mr. Frederick
417 Peter, Master. Michael J
Name: Name, dtype: object
Suddenly, you see different titles emerging! In other words, this column contains strings or text that contain titles, such as 'Mr',
'Master' and 'Dona'.
These titles of course give you information on social status, profession, etc., which in the end could tell you something more about
survival.
At first sight, it might seem like a difficult task to separate the names from the titles, but don't panic! Remember, you can easily use
regular expressions to extract the title and store it in a new column 'Title':
In [4]:
# Extract Title from Name, store in column and plot barplot
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);
20
In [5]:
data['Title'] = data['Title'].replace({'Mlle':'Miss', 'Mme':'Mrs', 'Ms':'Miss'})
data['Title'] = data['Title'].replace(['Don', 'Dona', 'Rev', 'Dr',
'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkh
eer'],'Special')
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);
21
Feature Sets
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have a huge influence on the performance you can achieve.
Why?
Irrelevant or partially relevant features can negatively impact model performance.
Feature selection and Data cleaning should be the first and most important step of your model designing.
Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction
variable or output in which you are interested in.
Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant
features.
How to select features and what are Benefits of performing feature selection before modeling your data?
· Reduces Overfitting: Less redundant data means less opportunity to make decisions based on
noise.
· Improves Accuracy: Less misleading data means modeling accuracy improves.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train
faster.
Description of variables in the below file
battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).
1) Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific
number of features.
The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the
Mobile Price Range Prediction Dataset.
22
In [1]:
# dataset train - https://tinyurl.com/y2v7doco
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
#apply SelectKBest class to extract top 10 best features

bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization

featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features
Specs Score
13 px_width 852.914979
14 ram 562.837207
0 id 223.566155
12 px_height 46.347162
9 mobile_wt 42.328627
5 fc 15.793117
11 pc 11.148155
7 int_memory 1.372252
3 clock_speed 1.052762
16 sc_w 0.809077
2) Feature Importance
You can get the feature importance of each feature of your dataset by using the feature importance property of the model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature
towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the
top 10 features for the dataset.
In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization

feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.plot(kind='barh')
plt.show()
/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default

value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
[0.06540314 0.05949467 0.02263201 0.05619236 0.0312657 0.05350424

0.02864679 0.05794044 0.05265377 0.05928094 0.04821835 0.05517616
23
0.02864679 0.05794044 0.05265377 0.05928094 0.04821835 0.05517616
0.06138363 0.06606335 0.05446478 0.05632481 0.05871349 0.05581697
0.02448873 0.03233566]
3) Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one
value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features
using the seaborn library.
In [0]:
import pandas as pd
import numpy as np
#get correlations of each features in dataset

corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map

g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
24
Have a look at the heatmap and see how to the columns are related to each other.
25
Logistic Regression
We have previously seen how linear regression works well for predicting continuous outputs that can easily fit to a line/plane. But
linear regression doesn't fare well for classification. This is where we need to use logistic regression.
In [1]:
# Import all required libraries
import numpy as np
import pandas as pd
# Now let us load in some data

url = "http://bit.ly/wkspdata"
df = pd.read_csv(url)
df.head()
Out[1]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs

1 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Th...
STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Note: The LogisticRegression class in Scikit-learn uses coordinate descent to solve the fit. However, we are going to use
Scikit-learn's SGDClassifier class which uses stochastic gradient descent. We want to use this optimization approach because
we will be using this for the models in subsequent lessons.
In [0]:
# Import packages
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
Data Preprocessing
In [0]:
# Preprocessing
def preprocess(df):
# Drop rows with NaN values

df = df.dropna()
# Drop text based features (we'll learn how to use them in later lessons)
features_to_drop = ["name", "cabin", "ticket"]
df = df.drop(features_to_drop, axis=1)
# pclass, sex, and embarked are categorical features -> We need to convert
# them to numerical figures for training
categorical_features = ["pclass","embarked","sex"]
df = pd.get_dummies(df, columns=categorical_features)
return df
In [9]:
# Preprocess the dataset

26
# Preprocess the dataset
df = preprocess(df)
df.head()
Out[9]:
survived age sibsp parch fare pclass_1 pclass_2 pclass_3 embarked_C embarked_Q embarked_S sex_female sex_male
1 1 38.0 1 0 71.2833 1 0 0 1 0 0 1 0
3 1 35.0 1 0 53.1000 1 0 0 0 0 1 1 0
6 0 54.0 0 0 51.8625 1 0 0 0 0 1 0 1
10 1 4.0 1 1 16.7000 0 0 1 0 0 1 1 0
11 1 58.0 0 0 26.5500 1 0 0 0 0 1 1 0
Model Training Phase
In [0]:
# get my input and output

X = df.iloc[:,1:13].values # inputs
y = df.iloc[:,0].values # output
In [0]:
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
# default test size = 0.25
In [19]:
# Initialize the model
log_reg = SGDClassifier(loss="log", penalty="none", max_iter=50)
# Train
log_reg.fit(X=X_train, y=y_train) # train the model
Out[19]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=50,
n_iter_no_change=5, n_jobs=None, penalty='none', power_t=0.5,
random_state=None, shuffle=True, tol=0.001,
validation_fraction=0.1, verbose=0, warm_start=False)
Predicting Models
In [21]:
# Predictions (unstandardize them)

pred_test = log_reg.predict(X_test)
print (pred_test)
# Make it look better

df = pd.DataFrame({'Original Survived':y_test, 'Predicted Survived':pred_test})
df
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0]
Out[21]:
Original Survived Predicted Survived
0 1 1
1 1 1
27
2 Original Survived
0 Predicted Survived
1
3 1 1
4 1 1
5 1 1
6 0 1
7 0 1
8 1 1
9 1 1
10 1 1
11 0 1
12 0 1
13 1 1
14 1 1
15 1 1
16 1 1
17 0 1
18 0 0
19 1 1
20 1 1
21 0 1
22 1 1
23 0 1
24 0 1
25 1 1
26 0 1
27 1 1
28 0 1
29 1 1
30 0 1
31 1 1
32 1 1
33 1 1
34 0 1
35 1 1
36 1 1
37 0 1
38 1 1
39 0 1
40 1 1
41 0 1
42 1 1
43 1 1
44 1 1
45 0 0
MODEL EVALUATION
Now that we have seen many models being built including the one above, it is very important to understand how to evaluate a
model and also look at the different performance metrics associated with it. We will take this as an example to understand
some of that.
28
In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix
# ANOTHER STYLE OF WRITING

#from sklearn import metrics
#metrics.accuracy_score
#metrics.confusion_matrix
In [26]:
# Accuracy
test_acc = accuracy_score(y_test, pred_test)
print ("Test acc: {}".format(test_acc))
Test acc: 0.6521739130434783
So far we looked at accuracy that determines our mode's level of performance. But we have several other options when it comes to
evaluation metrics.
link text
In [0]:
prfscore = precision_recall_fscore_support(y_test, pred_test)
prfscore
Out[0]:
(array([1. , 0.63265306]),
array([0.05263158, 1. ]),
array([0.1 , 0.775]),
array([19, 31]))
The above are the different performance metrics for both the classes, since this is a binary
classification problem.
In [27]:
confusion_matrix(y_test, pred_test)
Out[27]:
array([[ 2, 16],
[ 0, 28]])
Inference
Let us first see if you survived!
In [29]:
# Input your personal information

X_infer = pd.DataFrame([{"name": "Abhi", "cabin": "E", "ticket": "E44",
"pclass": 1, "age": 24, "sibsp": 1, "parch": 2,
"fare": 100, "embarked": "C", "sex": "male"}])
# Apply preprocessing
X_infer = preprocess(X_infer)
# Add missing columns

missing_features = set(X_test.columns) - set(X_infer.columns)
for feature in missing_features:
X_infer[feature] = 0
# Reorganize header
X_infer = X_infer[X_train.columns]
29
X_infer = X_infer[X_train.columns]
X_infer.head()
Out[29]:
age fare parch sibsp pclass_1 embarked_C sex_male
0 24 100 2 1 1 1 1
In [0]:
# Predict
y_infer = log_reg.predict_proba(X_infer)
classes = {0: "died", 1: "survived"}
_class = np.argmax(y_infer)
print ("I would have {0} with about {1:.0f}% probability on the Titanic expedition!".format(
classes[_class], y_infer[0][_class]*100.0))
I would have survived with about 100% probability on the Titanic expedition!
K-FOLD Cross Validation

Instead of splitting out data once at the beginning into train/val/test sets. We do this k (usually k=5 or 10) times with different training
and evaluation sets.
Steps:
1. Shuffle the train dataset randomly.

2. Split the dataset into k discint groups.
3. For each iteration k, choose one of the groups to be your test set and the rest as your training set.
4. Repeat so that each group experiences being part of the test and train set.
5. Train a model using randomly initialzied weights.
6. After each iteration k, reinitialize the model with the same randomly initialzied weights and repeat on the new test set.
In [0]:
from sklearn.model_selection import cross_val_score
# K-fold cross validation

log_reg = SGDClassifier(loss="log", penalty="none", max_iter=10)
scores = cross_val_score(log_reg, X_train, y_train, cv=10, scoring="accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())
Scores: [0.6 0.35714286 0.69230769 0.69230769 0.69230769 0.30769231

0.84615385 0.69230769 0.84615385 0.76923077]
Mean: 0.6495604395604395
Standard Deviation: 0.17428893612863966
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
30
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
this warning.
FutureWarning)
This is the end of the workshop! Questions?
31
Naive Bayes Classifier
In this workshop we are going to implement a very simple naive bayes classifer.
In [11]:
%matplotlib inline
import numpy as np
sns.set()
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning:
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
Gaussian Naive Bayes

Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes. In this classifier, the assumption is that data from
each label is drawn from a simple Gaussian distribution.
Lets us take into consideration some random data
In [0]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
# print(X,y)
Out[0]:
<matplotlib.collections.PathCollection at 0x7f17859ae630>
In [0]:
# Training the model

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
Out[0]:
GaussianNB(priors=None, var_smoothing=1e-09)
Generate some new data and feed it through the model.
32
In [0]:
# Generating random new points to test

rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
# Getting the output class (0/1 (red/blue)) for the new points
ynew = model.predict(Xnew)
In [0]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')

lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
In [0]:
yprob = model.predict_proba(Xnew) # predict with probabilities

yprob[-8:].round(3) # last eight rows of the dataset (rounding the prob to three digits)
Out[0]:
array([[0.895, 0.105],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.153, 0.847]])
The columns give the posterior probabilities of the first and second label, respectively. If you are looking for estimates of uncertainty
in your classification, Bayesian approaches like this can be a useful approach.
Classifying text using Naive Bayes
In [1]:
# Let's download some documents
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names
Downloading 20news dataset. This may take a few minutes.

Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
Out[1]:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
33
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
In [0]:
# We will consider a few categories for simplicity
categories = ['rec.sport.baseball', 'rec.motorcycles',

'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
In [7]:
# Printing a sample document
print(train.data[3])
From: rah13@cunixb.cc.columbia.edu (Robert A Holak)

Subject: Re: Why does Illustrator AutoTrace so poorly?
Nntp-Posting-Host: cunixb.cc.columbia.edu
Reply-To: rah13@cunixb.cc.columbia.edu (Robert A Holak)
Organization: Columbia University
Lines: 3
A shareware graphics program called Pman has a filter that makes a picture
look like a hand drawing. This picture could probably be converted into
vector format much easier because it is all lines. (With Corel Trace, etc..)
In [0]:
# We need to do some operations to convert the text into
# numbers so that it can be used by the model
from sklearn.feature_extraction.text import TfidfVectorizer

# term frequency inverse document frequency
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
In [9]:
model.fit(train.data, train.target)
pred = model.predict(test.data)
pred
Out[9]:
array([2, 2, 1, ..., 3, 1, 2])
In [12]:
# Confusion matrix on the testing set
34
# Confusion matrix on the testing set
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, pred)
# Plotting the confusion matrix

sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
We can also predict our own sentences
In [13]:
pred = model.predict(['sending a payload to the ISS'])

train.target_names[pred[0]]
Out[13]:
'sci.space'
In [14]:
pred = model.predict_proba(['determining the screen resolution'])
# train.target_names[pred[0]]
train.target_names, pred
Out[14]:
(['comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'sci.space'],
array([[0.56797615, 0.12966549, 0.13999519, 0.16236317]]))
Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more
complicated model. That said, they have several advantages:
They are extremely fast for both training and prediction

They provide straightforward probabilistic prediction
They are often very easily interpretable
They have very few (if any) tunable parameters
35
Support Vector Machines
We will understand how to build a support vector classifier to classify digit data. We will also learn about how to read a confusion
matrix.
In [2]:
import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
# Standard scientific Python imports

%matplotlib inline
# Import datasets, classifiers and performance metrics

from sklearn import datasets, svm, metrics
print('All libraries imported!')
All libraries imported!
The data that we are going to use today is made of 8x8 images of digits, which is part of the sklearn datasets.
First lets load and investigate the dataset.
In [3]:
# Load the digits dataset
digits = datasets.load_digits()
print('Digits dataset keys \n{}'.format(digits.keys()))
print('dataset target name: \n{}'.format(digits.target_names))

print('shape of the dataset: {} \nand target: {}'.format(digits.data.shape, digits.target.shape))
print('shape of the images: {}'.format(digits.images.shape))
Digits dataset keys

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
dataset target name:
[0 1 2 3 4 5 6 7 8 9]
shape of the dataset: (1797, 64)
and target: (1797,)
shape of the images: (1797, 8, 8)
We see that the dataset (digits.data) is composed of 1797 samples, with 64 features, where each feature is a single image pixel. Let's
have a look at the first 4 images, stored in the images attribute of the dataset.
In [4]:
# Plot the data
for i in range(0,4):
plt.subplot(2, 4,i + 1)
plt.axis('off')
imside = int(np.sqrt(digits.data[i].shape[0]))
im1 = np.reshape(digits.data[i],(imside,imside))
plt.imshow(im1, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: {}'.format(digits.target[i]))
plt.show()
36
In [0]:
# Flatten the images

n_samples = len(digits.images)
data_images = digits.images.reshape((n_samples, -1))
# converting the 8*8 image to a 64 length array for training
Before apply a classifier to the data, let's split the data into a training set and a test set.
In [10]:
from sklearn.model_selection import train_test_split
# Splitting into train and test sets

X_train, X_test, y_train, y_test = train_test_split(data_images,digits.target)
print('Training data and target sizes: \n{}, {}'.format(X_train.shape,y_train.shape))

print('Test data and target sizes: \n{}, {}'.format(X_test.shape,y_test.shape))
Training data and target sizes:

(1347, 64), (1347,)
Test data and target sizes:
(450, 64), (450,)
We will now use the above train data to train a classifier
In [14]:
# Create a support vector classifier
classifier = svm.SVC(gamma=0.0001)
# Fit to the training data

classifier.fit(X_train,y_train)
print("SVM has been trained")
SVM has been trained
In [15]:
# Predict the value of the digit on the test data

y_pred = classifier.predict(X_test)
# Printing a sample of the original vs the predicted outputs

print("Original Outputs")
print(y_test[0:5])
print("Predicted Outputs")
print(y_pred[0:5])
Original Outputs
[0 3 9 2 6]
Predicted Outputs
[0 7 9 2 6]
Now let us have a look at at the accuracy and the confusion matrix
In [16]:
### Printing the confusion matrix. Confused?
print("Accuracy Score:\n%s" % metrics.accuracy_score(y_test, y_pred))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_pred))
Accuracy Score:
37
0.98
Confusion matrix:
[[38 0 0 0 0 0 0 0 0 0]
[ 0 56 0 0 0 0 0 0 0 0]
[ 0 1 39 0 0 0 0 0 0 0]
[ 0 0 0 47 0 0 0 1 0 0]
[ 0 0 0 0 39 0 0 0 1 0]
[ 0 0 0 0 0 46 1 0 0 1]
[ 0 0 0 0 0 0 44 0 1 0]
[ 0 0 0 0 0 0 0 46 1 0]
[ 0 2 0 0 0 0 0 0 44 0]
[ 0 0 0 0 0 0 0 0 0 42]]
We have successfully trained a support vector machine to learn and predict on digits data.
38
Decision Trees
This workshop deals with understanding the working of decision trees.
In [5]:
# Importing libraries in Python
import sklearn.datasets as datasets
import pandas as pd
# Loading the iris dataset

iris=datasets.load_iris()
# Forming the iris dataframe

df=pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
Out[5]:
sepal length sepal width petal length petal width

(cm) (cm) (cm) (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [8]:
y=iris.target
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Now let us define the Decision Tree Algorithm
In [2]:
# Defining the decision tree algorithm

from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(df,y)
print('Decision Tree Classifer Created')
Decision Tree Classifer Created
In [12]:
dtree.decision_path
Out[12]:
<bound method BaseDecisionTree.decision_path of DecisionTreeClassifier(class_weight=None,

criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
39
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')>
Let us visualize the Decision Tree to understand it better.
In [3]:
# Install required libraries

!pip install pydotplus
!apt-get install graphviz -y
Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)

Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from
pydotplus) (2.4.2)
Reading package lists... Done
Building dependency tree
Reading state information... Done
graphviz is already the newest version (2.40.1-2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.
In [13]:
# Import necessary libraries for graph viz

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
# Visualize the graph

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data, feature_names=iris.feature_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[13]:
You can now feed any new/test data to this classifer and it would be able to predict the right class accordingly.
40
41
OPTIMIZING ML MODELS
This workshop is designed to teach you how you can optimize your ML models once you have built, trained and tested the
models. Optimization is the last step of a machine learning model process before results can be presented to the user.
So what are we going to optimize?
We are going to optimize Model Hyperparameters. A model hyperparameter is a configuration that is external to the model and
whose value cannot be estimated from data.
There are many strategies to tune modle hyperparameters. As part of this workshop we will discuss one technique - Grid Search
What dataset are we using for this workshop?

We will use the Pima Indian diabetes dataset. The dataset corresponds to a classification problem on which you need to make
predictions on the basis of whether a person is to suffer diabetes given the 8 features in the dataset. You can find the complete
description of the dataset here.
In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
# Reading and displaying the head of the data

data = pd.read_csv("http://bit.ly/opt-data")
data.head()
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Some basic data cleaning to remove all the missing/zero values
In [3]:
# Mark zero values as missing or NaN

data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] =
data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)
# Count the number of NaN values in each column
print(data.isnull().sum())
# Fill missing values with mean column values

data.fillna(data.mean(), inplace=True)
# Count the number of NaN values in each column
print(data.isnull().sum())
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
42
Outcome 0
dtype: int64
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
See the results after data cleaning? -> We now have no missing values
Now lets quickly train a model with random hyperparameter values
In [0]:
# Split dataset into inputs and outputs

values = data.values
X = values[:,0:8]
y = values[:,8]
# Initiate the LR model with random hyperparameters

lr = LogisticRegression(penalty='l2',dual=False,max_iter=130)
# We will optimize these parameters using Grid Search
In [17]:
# Pass data to train the LR Model
lr.fit(X,y);
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
In [18]:
# Let's check the accuracy of the model
lr.score(X,y)
Out[18]:
0.7669270833333334
Now lets build the model using hyperparameter optimization
In [0]:
from sklearn.model_selection import GridSearchCV
# Defining the grid parameter values

dual=[True,False]
max_iter=[100,110,120,130,140, 200, 300,50]
param_grid = dict(dual=dual,max_iter=max_iter)
In [24]:
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 10, n_jobs=-1)
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
43
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
Best: 0.766927 using {'dual': False, 'max_iter': 100}
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py:814: DeprecationWarning:
The default of the `iid` parameter will change from True to False in version 0.22 and will be
removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
You can play around with more parameters to optimize models better.
You also got to know about what role hyperparameter optimization plays in building efficient machine learning models.
44

Learners Guide - Machine Learning and Advanced Analytics Using Python

Uploaded by

Learners Guide - Machine Learning and Advanced Analytics Using Python

Uploaded by

Machine Learning using Python

Machine Learning using Python

print("Libraries imported successfully")

Libraries imported successfully

Data Types in Python

Lists are created using the [] syntax.

Sets are unordered collections of elements.

['dogs', 'cats', 'fish']

['hedgehog', 'fish', 'cats']

penny = (60, 75, 'yellow') #Length (in), Weight (lbs), Color

TypeError: 'tuple' object doesn't support item deletion

store_one = {'bulldog', 'parrot', 'hamster', 'fish'}

This is a basic line

This is an even smaller header

This is a bold line

So wait, why are there both lists and tuples?

days_of_the_week = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

days_of_the_week = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",

We will control the flow using:

In order to do this we will learn the commands: if , while , and for

# Define variables needed

I purchased 1 gallons of milk

and it continues until a_condition is false.

for item in sequence:

But what is a sequence?

for i in range(1, 5):

Whenever you see this syntax:

that block of code is a function.

# Let's write a really simple function -- a function that "says hello".

# Call the function

# Now let us load in some data

# How do we get the last few records if we wanted to?

Cumings, Mrs. John Bradley (Florence Briggs

4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

survived pclass age sibsp parch fare

count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

x3 = np.random.randint(10, size=6) # One-dimensional array

# Create a 1D array of numbers

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

# Extract all odd numbers from an array

# Load the iris dataset

sepal length sepal width petal length petal width

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

x = iris_df.iloc[:, [0, 1, 2, 3]].values

from sklearn.cluster import KMeans

for i in range(1, 11):

# Plotting the results onto a line graph,

[[5.1 3.5 1.4 0.2]

From this we choose the number of clusters as '3'.

# Visualising the clusters - On the first two columns

# Plotting the centroids of the clusters

plt.xlabel("sepal length (cm)")

#Plotting the centroids of the clusters

plt.xlabel("sepal width (cm)")

# Import required libraries

Loading in the data

# load the digits dataset from scikit-learn

Let us train an SOM Model

som = MiniSom(30, 30, 64, sigma=4,

...SOM model is ready!

Visualizing the SOM

# this plots all the numbers on the som grid

# this restricts the axis to be between 0-30 in this case

plt.figure(figsize=(10, 10), facecolor='white')

for j in reversed(range(20)): # images mosaic 16

This is the end of the workshop. Questions?