0% found this document useful (0 votes)
70 views44 pages

Learners Guide - Machine Learning and Advanced Analytics Using Python

This document provides an introduction to machine learning using Python. It covers [1] the basics of Python including data types, variables, lists, tuples, sets and dictionaries; [2] importing libraries and modules; [3] flow control using logical statements, while loops and for loops; [4] defining functions; and [5] working with data frames using the Pandas library. The goal is to teach learners the fundamental Python concepts needed to perform machine learning.

Uploaded by

Jason Chew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
70 views44 pages

Learners Guide - Machine Learning and Advanced Analytics Using Python

This document provides an introduction to machine learning using Python. It covers [1] the basics of Python including data types, variables, lists, tuples, sets and dictionaries; [2] importing libraries and modules; [3] flow control using logical statements, while loops and for loops; [4] defining functions; and [5] working with data frames using the Pandas library. The goal is to teach learners the fundamental Python concepts needed to perform machine learning.

Uploaded by

Jason Chew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 44

Machine Learning using Python

Learner’s Guide

Machine Learning using Python

LEARNER’S GUIDE

1
Basics of Python
In this notebook, we are going to learn about some of the basic operations and datatypes within python and how you can write simple
functions and codes

Importing Libraries
Python’s standard library is very extensive, offering a wide range of facilities. The library contains built-in modules (written in C) that
provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as
modules written in Python that provide standardized solutions for many problems that occur in everyday programming.
https://docs.python.org/3/library/

In [1]:
# Importing Libraries
import datetime
import pandas as pd
from pandas import read_csv

print("Libraries imported successfully")

Libraries imported successfully

Data Types in Python


Python has eight built-in data types. Four of those are quite simple, in the sense that they can store a single value:

Integers
Floats
Booleans
Strings

The other four are denoted collections because they can store arbitrary numbers of values. Python's four collection data types
are:

Lists
Tuples
Sets
Dictionaries

The four complex data types and their various operations are depicted below.

In [0]:

## Simple Variables

a_number = 2
a_word = 'dog'

print(a_number) # Printing
type(a_number) # Data Type

Lists are ordered sequences of elements, with that order being specified by the order that the elements are in when the list is created
or as elements are added to the list.

Lists are created using the [] syntax.


Lists can include mixed data types.
List elements are accessed by index.
Lists are mutable. You can add, remove, and replace values using functions such as append(), extend(), insert(), pop(),
remove(), and del.

Tuples are similar to lists except for the very important fact that they are immutable.

Sets are unordered collections of elements.


2
Sets are unordered collections of elements.
Sets are created using the {} syntax.
Sets can included mixed data types.
Set elements are accessed by key (name).
Elements in a set cannot be repeated.
Sets are mutable. You can add and remove values using functions such as add() and discard().

In [0]:
pets = ['dogs', 'cats', 'fish']
print(pets)
print(pets[0])
print(pets[1])

['dogs', 'cats', 'fish']


dogs
cats

In [0]:

pets.append('hedgehog')
pets

Out[0]:
['dogs', 'cats', 'fish', 'hedgehog']

In [0]:
pets.remove('dogs')
pets

Out[0]:
['cats', 'fish', 'hedgehog']

In [0]:
pets.reverse()
print(pets)

pets.sort()
print(pets)

['hedgehog', 'fish', 'cats']


['cats', 'fish', 'hedgehog']

In [0]:
# Create a tuple that stores the attributes of our golden retriever, `penny`.

penny = (60, 75, 'yellow') #Length (in), Weight (lbs), Color


print( type(penny) )
penny

<class 'tuple'>

Out[0]:
(60, 75, 'yellow')

In [0]:
del penny[1]

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-3dca6067f55e> in <module>()
3
<ipython-input-13-3dca6067f55e> in <module>()
----> 1 del penny[1]

TypeError: 'tuple' object doesn't support item deletion

In [0]:

store_one = {'bulldog', 'parrot', 'hamster', 'fish'}


store_two = {'fish', 'parrot', 'terrier', 'cat'}

print( store_one.intersection(store_two) )
print( store_one.union(store_two) )

{'parrot', 'fish'}
{'cat', 'parrot', 'hamster', 'bulldog', 'fish', 'terrier'}

This is a basic line

This is a header
This is a smaller header

This is an even smaller header


This line is in italics

This is a bold line

This is both
This broke here

So wait, why are there both lists and tuples?


I know, right now it's pretty hard to see why you would use one data type over the other.

Remember that Python was developed with the ideal of being readable and understandable. That means that when you are writing
code, you want to make choices that make it easy for others to understand what is going on. For example, if you have a some
variable that contains values that are never going to change, then why would you use a mutable data type such as a list to store it?

Think of the days of the week or the months of the year: Does it make more sense to define

days_of_the_week = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

or

days_of_the_week = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",


"Saturday"] ?

By using a tuple to store some data, you are signaling to a reader of your code, that the values you stored are not going to change
during the execution of your code.

Quick Exercise
pet_store = ['beagle', 'parrot', 'iguana', 'gerbil', 'chameleon', 'fish']

We've sold out of chameleon and iguana , you need to remove them from the inventory. Use both remove() and del to
do so.
How many terrier dogs do we have in the store? (Hint: the list data type has a built-in function called count() )

Flow Control
We will learn how to perform successive operations ("looping") without explicitly coding the commands. This is largely understood to
be controlling the 'flow' of a program.

We will control the flow using:


4
We will control the flow using:

1. Logical statements to check for conditions (performing operations only if a condition is met)
2. Continuing execution until a condition is met
3. Iterating through a sequence of numbers using the range() function

In order to do this we will learn the commands: if , while , and for

In [0]:
## "Buy a gallon of milk at the store and if there are organic eggs, buy a dozen"

# Define variables needed


milk = 0
eggs = 0
store_has_eggs = True
# Purchasing decisions
milk += 1 # milk = milk + 1
if store_has_eggs == True:
eggs += 12
# Check what was purchased
print("I purchased ", milk, " gallons of milk")
print("I purchased ", eggs, " eggs")

I purchased 1 gallons of milk


I purchased 12 eggs

Loops
A loop is used to repeat a block of commands multiple times. There are two ways to write a loop, one is a for loop and the other is
a while loop. Typically, you use a for loop when you know how many times you want to loop, and a while loop when looping
is based on a conditional that will be modified during the loop.

While loops
A while loop is pretty simple, it's structure looks like:

while a_condition:
# do something
...

and it continues until a_condition is false.

In [0]:
# Our numeric variables
number = 43
divisor = 5
answer = 0

# While loop
while number > 0:
number = number - divisor
print( number )

38
33
28
23
18
13
8
3
-2

For loops
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:
5
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:

for item in sequence:


# do something with item
...

But what is a sequence?

There are lots of functions in Python that will actually return a sequence - they are called iterators. An iterator essentially provides the
next element in the sequence each time we access it.

The iterator that we will use to demonstrate a for loop is the range() function. The range function gives us a sequence of numbers
from the first number we give it up until the last number we give it.

In [0]:

for i in range(1, 5):


print(i)
print(i*3)
i = 12
print(i*3)
print('---')

1
3
36
---
2
6
36
---
3
9
36
---
4
12
36
---

Functions
Writing modular code is good!

Functions are the workhorses of modular programming in Python! So, what's a function?

Whenever you see this syntax:

def function_name():

statements

return something

that block of code is a function.

Functions help us avoid repeating the same set of statements everytime we want to repeat a task. Functions increase code
readibility. Functions make code revision and updating easier (you do not have to re-do revisions in all the places of your code where
the task is needed. Functions make testing of your code easier and more reliable.

In [0]:

# Let's write a really simple function -- a function that "says hello".

def says_hello():
'''
Prints the word "Hello"

input:
6
input:
- None
output:
- None
'''

print('Hello!')

# Call the function


says_hello()

You just wrote a simple function! Notice that after writing it nothing was printed. That is because you didn't call the function, You only
defined it so Python will know what on earth you're talking about should you so choose to write says_hello anywhere.

You call a function just by writing its name along with the parentheses:

In [0]:
says_hello()

Hello!

Pandas
In [0]:

import pandas as pd

# Now let us load in some data


url = "http://bit.ly/wkspdata"
titanic_data = pd.read_csv(url)

In [0]:
titanic_data.head() # Gets the first few records

# How do we get the last few records if we wanted to?

Out[0]:

survived pclass name sex age sibsp parch ticket fare cabin embarked

0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence Briggs


1 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Th...

STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [0]:
# Generate all descriptive statistics

titanic_data.describe()

Out[0]:

survived pclass age sibsp parch fare

count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400


7
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
survived pclass age sibsp parch fare
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Numpy
In [0]:
import numpy as np # Importing the numpy library

x3 = np.random.randint(10, size=6) # One-dimensional array


print(x3)

[6 0 6 8 7 6]

In [0]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x3 ndim: 1
x3 shape: (6,)
x3 size: 6

In [0]:

# Accessing elements
x3[0]

Out[0]:
6

In [0]:

# Create a 1D array of numbers


arr = np.arange(20)
arr

Out[0]:

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,


17, 18, 19])

In [0]:

# Extract all odd numbers from an array

# Input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Solution
arr[arr % 2 == 1]

Out[0]:

array([1, 3, 5, 7, 9])

Exercises
Extract all even numbers from an array
Extract all numbers from an array which are dicvisible by 3.
Create a simple 1D array with number from 0-100.
8
Create a simple 1D array with number from 0-100.
Read the first 25 elements of a pandas dataframe.

9
K- Means Clustering
This notebook will walk through some of the basics of K-Means Clustering.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets

# Load the iris dataset


iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head() # See the first 5 rows

Out[1]:

sepal length sepal width petal length petal width


(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

How do you find the optimum number of clusters for K Means? How does one determine the value of K?

In [2]:
# Finding the optimum number of clusters for k-means classification

x = iris_df.iloc[:, [0, 1, 2, 3]].values


print(x)

from sklearn.cluster import KMeans


wcss = []

for i in range(1, 11):


kmeans = KMeans(n_clusters = i)
kmeans.fit(x)
wcss.append(kmeans.inertia_)

# Plotting the results onto a line graph,


# `allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()

[[5.1 3.5 1.4 0.2]


[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
10
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
11
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

12
You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs. This is
when the within cluster sum of squares (WCSS) doesn't decrease significantly with every iteration.

From this we choose the number of clusters as '3'.

In [0]:
# Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3)
kmeans_model = kmeans.fit(x)

y_kmeans = kmeans_model.predict(x)
y_kmeans

Out[0]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

In [0]:

# Visualising the clusters - On the first two columns

plt.figure(figsize=(15,6))

plt.subplot(121)
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1],
s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1],
s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1],
s = 100, c = 'green', label = 'Iris-virginica')

# Plotting the centroids of the clusters


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1],
s = 100, c = 'yellow', label = 'Centroids')

plt.xlabel("sepal length (cm)")


plt.ylabel("sepal width (cm)")
plt.legend()

# Second plot
plt.subplot(122)
plt.scatter(x[y_kmeans == 0, 1], x[y_kmeans == 0, 2],
s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 1], x[y_kmeans == 1, 2],
s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 1], x[y_kmeans == 2, 2],
s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters


plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:,2],
s = 100, c = 'yellow', label = 'Centroids')

plt.xlabel("sepal width (cm)")


plt.ylabel("petal length (cm)")
plt.legend()

Out[0]:
<matplotlib.legend.Legend at 0x7f30c0e98eb8>

13
This concludes the K-Means Workshop.

14
Self Organizing Maps (SOM)
A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-
dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is
therefore a method to do dimensionality reduction.

(Source: Wikipedia)

We will be using the example of handwritten digits and classify them using SOM (minisom library in Python)

In [1]:

# Import required libraries


import sys

!pip install minisom # Since minisom is not part of the colab library family, we will need to inst
all it using this command
from minisom import MiniSom # Library for SOM

import numpy as np
import matplotlib.pyplot as plt

Collecting minisom
Downloading
https://files.pythonhosted.org/packages/bf/dd/b9089c073cc16c4f86758bf7668d056956b9cfb2b539d9d50151e1713
fe0/MiniSom-2.2.1.tar.gz
Building wheels for collected packages: minisom
Building wheel for minisom (setup.py) ... done
Created wheel for minisom: filename=MiniSom-2.2.1-cp36-none-any.whl size=6643
sha256=ca126f2512d61b6929bce56f4a76efda4ad4855e19846f24c77e676b458abf53
Stored in directory:
/root/.cache/pip/wheels/41/42/7d/dd12b479c5ea50cd572d91b8e935e4f11e1302acca329f84e0
Successfully built minisom
Installing collected packages: minisom
Successfully installed minisom-2.2.1

Loading in the data


Let us load in some data

In [0]:
# Importing required libraries for data load and processing
from sklearn import datasets
from sklearn.preprocessing import scale

# load the digits dataset from scikit-learn


digits = datasets.load_digits(n_class=10)
data = digits.data
data = scale(data)
num = digits.target # num[i] is the digit represented by data[i]

In [3]:

digits.data.shape

Out[3]:
(1797, 64)

Let us train an SOM Model


Once we have our model the next step is to train an SOM model. It will take 30-60seconds to train the model.

In [4]:
15
In [4]:

som = MiniSom(30, 30, 64, sigma=4,


learning_rate=0.5, neighborhood_function='triangle')
som.pca_weights_init(data)
print("Training...")

# Iterations - 5000
som.train_random(data, 5000, verbose=True)
print("\n...SOM model is ready!")

Training...
[ 5000 / 5000 ] 100% - 1399.78 it/s - 0:00:00 left - quantization error: 3.062048466401852

...SOM model is ready!

Visualizing the SOM


It is very important to visualize the SOM and see how it looks like.

In [5]:
plt.figure(figsize=(8, 8)) # canvas of size 8*8

wmap = {}
im = 0

# this plots all the numbers on the som grid


for x, t in zip(data, num): # scatterplot
w = som.winner(x)
wmap[w] = im
plt.text(w[0]+.5, w[1]+.5, str(t),
color=plt.cm.rainbow(t / 10.), fontdict={'weight': 'bold', 'size': 11})
im = im + 1

# this restricts the axis to be between 0-30 in this case


plt.axis([0, som.get_weights().shape[0], 0, som.get_weights().shape[1]])

plt.show()

In [6]:

plt.figure(figsize=(10, 10), facecolor='white')


cnt = 0

for j in reversed(range(20)): # images mosaic 16


for j in reversed(range(20)): # images mosaic
for i in range(20):
plt.subplot(20, 20, cnt+1, frameon=True, xticks=[], yticks=[])
if (i, j) in wmap:
plt.imshow(digits.images[wmap[(i, j)]],
cmap='Greys', interpolation='nearest')
else:
plt.imshow(np.zeros((8, 8)), cmap='Greys')
cnt = cnt + 1

plt.tight_layout()
plt.show()

This is the end of the workshop. Questions?

17
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms
work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features
from raw data that help facilitate the machine learning process. Feature Engineering is an art.

Manual Feature Engineering

In [2]:

# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style


%matplotlib inline
sns.set()

# Import data
df_train =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/train.csv')
df_test =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/test.csv')

# Store target variable of training data in a safe place


survived_train = df_train.Survived

# Concatenate training and test sets


data = pd.concat([df_train.drop(['Survived'], axis=1), df_test])

# View head
# data.info()
df_train

Out[2]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
(Lily May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

Palsson, Master. Gosta


7 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
Leonard

Johnson, Mrs. Oscar W


8 9 1 3 female 27.0 0 2 347742 11.1333 NaN S
(Elisabeth Vilhelmina Berg)

Nasser, Mrs. Nicholas (Adele


9 10 1 2 female 14.0 1 0 237736 30.0708 NaN C
Achem)

Sandstrom, Miss. Marguerite


10 11 1 3 female 4.0 1 1 PP 9549 16.7000 G6 S
Rut

11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S

Saundercock, Mr. William


12 13 0 3 male 20.0 0 0 A/5. 2151 8.0500 NaN S
Henry

18
Andersson, Mr. Anders
13 PassengerId
14 Survived
0 Pclass
3 Name Sex
male Age
39.0 SibSp
1 Parch
5 Ticket
347082 Fare Cabin
31.2750 NaN Embarked
S
Johan

Vestrom, Miss. Hulda


14 15 0 3 female 14.0 0 0 350406 7.8542 NaN S
Amanda Adolfina

Hewlett, Mrs. (Mary D


15 16 1 2 female 55.0 0 0 248706 16.0000 NaN S
Kingcome)

16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q

Williams, Mr. Charles


17 18 1 2 male NaN 0 0 244373 13.0000 NaN S
Eugene

Vander Planke, Mrs. Julius


18 19 0 3 female 31.0 1 0 345763 18.0000 NaN S
(Emelia Maria Vande...

19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C

20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S

21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S

McGowan, Miss. Anna


22 23 1 3 female 15.0 0 0 330923 8.0292 NaN Q
"Annie"

Sloper, Mr. William


23 24 1 1 male 28.0 0 0 113788 35.5000 A6 S
Thompson

Palsson, Miss. Torborg


24 25 0 3 female 8.0 3 1 349909 21.0750 NaN S
Danira

Asplund, Mrs. Carl Oscar


25 26 1 3 female 38.0 1 5 347077 31.3875 NaN S
(Selma Augusta Emilia...

26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C

C23
Fortune, Mr. Charles
27 28 0 1 male 19.0 3 2 19950 263.0000 C25 S
Alexander
C27

28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q

29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S

... ... ... ... ... ... ... ... ... ... ... ... ...

861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S

Swift, Mrs. Frederick Joel


862 863 1 1 female 48.0 0 0 17466 25.9292 D17 S
(Margaret Welles Ba...

Sage, Miss. Dorothy Edith


863 864 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
"Dolly"

864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S

865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S

Duran y More, Miss. SC/PARIS


866 867 1 2 female 27.0 1 0 13.8583 NaN C
Asuncion 2149

Roebling, Mr. Washington


867 868 0 1 male 31.0 0 0 PC 17590 50.4958 A24 S
Augustus II

van Melkebeke, Mr.


868 869 0 3 male NaN 0 0 345777 9.5000 NaN S
Philemon

Johnson, Master. Harold


869 870 1 3 male 4.0 1 1 347742 11.1333 NaN S
Theodor

870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S

Beckwith, Mrs. Richard


871 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Leonard (Sallie Monypeny)

B51
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B53 S
B55

873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S

Abelson, Mrs. Samuel


874 875 1 2 female 28.0 1 0 P/PP 3381 24.0000 NaN C
(Hannah Wizosky)

Najib, Miss. Adele Kiamie


875 876 1 3 female 15.0 0 0 2667 7.2250 NaN C
"Jane"

Gustafsson, Mr. Alfred


876 877 0 3 male 20.0 0 0 7534 9.8458 NaN S
Ossian

877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S

878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S

Potter, Mrs. Thomas Jr (Lily


879 880 1 1 female 56.0 0 1 11767 83.1583 C50 C
Alexenia Wilson)

Shelley, Mrs. William


880 881 1 2 female 25.0 0 1 230433 26.0000 NaN S
19
Shelley, Mrs. William
880 881 1 2 female 25.0 0 1 230433 26.0000 NaN S
PassengerId Survived Pclass (Imanita ParrishName
Hall) Sex Age SibSp Parch Ticket Fare Cabin Embarked

881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S

882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S

Banfield, Mr. Frederick C.A./SOTON


883 884 0 2 male 28.0 0 0 10.5000 NaN S
James 34068

SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076

Rice, Mrs. William (Margaret


885 886 0 3 female 39.0 0 5 382652 29.1250 NaN Q
Norton)

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

Graham, Miss. Margaret


887 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Edith

Johnston, Miss. Catherine


888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Helen "Carrie"

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Why Feature Engineer At All?


You perform feature engineering to extract more information from your data, so that you can up your game when building models.

Titanic's Passenger Titles


Let's check out what this is all about by looking at an example. Let's check out the 'Name' column with the help of the .tail() method,
which helps you to see the last five rows of your data:

In [3]:
# View head of 'Name' column
data.Name.tail()

Out[3]:
413 Spector, Mr. Woolf
414 Oliva y Ocana, Dona. Fermina
415 Saether, Mr. Simon Sivertsen
416 Ware, Mr. Frederick
417 Peter, Master. Michael J
Name: Name, dtype: object

Suddenly, you see different titles emerging! In other words, this column contains strings or text that contain titles, such as 'Mr',
'Master' and 'Dona'.

These titles of course give you information on social status, profession, etc., which in the end could tell you something more about
survival.

At first sight, it might seem like a difficult task to separate the names from the titles, but don't panic! Remember, you can easily use
regular expressions to extract the title and store it in a new column 'Title':

In [4]:
# Extract Title from Name, store in column and plot barplot
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);

20
In [5]:
data['Title'] = data['Title'].replace({'Mlle':'Miss', 'Mme':'Mrs', 'Ms':'Miss'})
data['Title'] = data['Title'].replace(['Don', 'Dona', 'Rev', 'Dr',
'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkh
eer'],'Special')
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);

21
Feature Sets
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Why?
Irrelevant or partially relevant features can negatively impact model performance.

Feature selection and Data cleaning should be the first and most important step of your model designing.

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction
variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant
features.

How to select features and what are Benefits of performing feature selection before modeling your data?

· Reduces Overfitting: Less redundant data means less opportunity to make decisions based on
noise.

· Improves Accuracy: Less misleading data means modeling accuracy improves.

· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train
faster.

Description of variables in the below file

battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).

1) Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific
number of features.

The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the
Mobile Price Range Prediction Dataset.

22
In [1]:
# dataset train - https://tinyurl.com/y2v7doco

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range

#apply SelectKBest class to extract top 10 best features


bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

#concat two dataframes for better visualization


featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features

Specs Score
13 px_width 852.914979
14 ram 562.837207
0 id 223.566155
12 px_height 46.347162
9 mobile_wt 42.328627
5 fc 15.793117
11 pc 11.148155
7 int_memory 1.372252
3 clock_speed 1.052762
16 sc_w 0.809077

2) Feature Importance
You can get the feature importance of each feature of your dataset by using the feature importance property of the model.

Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature
towards your output variable.

Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the
top 10 features for the dataset.

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers

#plot graph of feature importances for better visualization


feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.plot(kind='barh')
plt.show()

/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default


value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)

[0.06540314 0.05949467 0.02263201 0.05619236 0.0312657 0.05350424


0.02864679 0.05794044 0.05265377 0.05928094 0.04821835 0.05517616
23
0.02864679 0.05794044 0.05265377 0.05928094 0.04821835 0.05517616
0.06138363 0.06606335 0.05446478 0.05632481 0.05871349 0.05581697
0.02448873 0.03233566]

3) Correlation Matrix with Heatmap


Correlation states how the features are related to each other or the target variable.

Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one
value of feature decreases the value of the target variable)

Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features
using the seaborn library.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range

#get correlations of each features in dataset


corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))

#plot heat map


g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

24
Have a look at the heatmap and see how to the columns are related to each other.

25
Logistic Regression
We have previously seen how linear regression works well for predicting continuous outputs that can easily fit to a line/plane. But
linear regression doesn't fare well for classification. This is where we need to use logistic regression.

In [1]:
# Import all required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Now let us load in some data


url = "http://bit.ly/wkspdata"
df = pd.read_csv(url)

df.head()

Out[1]:

survived pclass name sex age sibsp parch ticket fare cabin embarked

0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence Briggs


1 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Th...

STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Note: The LogisticRegression class in Scikit-learn uses coordinate descent to solve the fit. However, we are going to use
Scikit-learn's SGDClassifier class which uses stochastic gradient descent. We want to use this optimization approach because
we will be using this for the models in subsequent lessons.

In [0]:

# Import packages
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

Data Preprocessing

In [0]:

# Preprocessing
def preprocess(df):

# Drop rows with NaN values


df = df.dropna()

# Drop text based features (we'll learn how to use them in later lessons)
features_to_drop = ["name", "cabin", "ticket"]
df = df.drop(features_to_drop, axis=1)

# pclass, sex, and embarked are categorical features -> We need to convert
# them to numerical figures for training
categorical_features = ["pclass","embarked","sex"]
df = pd.get_dummies(df, columns=categorical_features)

return df

In [9]:

# Preprocess the dataset


26
# Preprocess the dataset
df = preprocess(df)
df.head()

Out[9]:

survived age sibsp parch fare pclass_1 pclass_2 pclass_3 embarked_C embarked_Q embarked_S sex_female sex_male

1 1 38.0 1 0 71.2833 1 0 0 1 0 0 1 0

3 1 35.0 1 0 53.1000 1 0 0 0 0 1 1 0

6 0 54.0 0 0 51.8625 1 0 0 0 0 1 0 1

10 1 4.0 1 1 16.7000 0 0 1 0 0 1 1 0

11 1 58.0 0 0 26.5500 1 0 0 0 0 1 1 0

Model Training Phase

In [0]:

# get my input and output


X = df.iloc[:,1:13].values # inputs
y = df.iloc[:,0].values # output

In [0]:
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
# default test size = 0.25

In [19]:
# Initialize the model
log_reg = SGDClassifier(loss="log", penalty="none", max_iter=50)

# Train
log_reg.fit(X=X_train, y=y_train) # train the model

Out[19]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=50,
n_iter_no_change=5, n_jobs=None, penalty='none', power_t=0.5,
random_state=None, shuffle=True, tol=0.001,
validation_fraction=0.1, verbose=0, warm_start=False)

Predicting Models

In [21]:

# Predictions (unstandardize them)


pred_test = log_reg.predict(X_test)
print (pred_test)

# Make it look better


df = pd.DataFrame({'Original Survived':y_test, 'Predicted Survived':pred_test})
df

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0]

Out[21]:

Original Survived Predicted Survived

0 1 1

1 1 1
27
2 Original Survived
0 Predicted Survived
1

3 1 1

4 1 1

5 1 1

6 0 1

7 0 1

8 1 1

9 1 1

10 1 1

11 0 1

12 0 1

13 1 1

14 1 1

15 1 1

16 1 1

17 0 1

18 0 0

19 1 1

20 1 1

21 0 1

22 1 1

23 0 1

24 0 1

25 1 1

26 0 1

27 1 1

28 0 1

29 1 1

30 0 1

31 1 1

32 1 1

33 1 1

34 0 1

35 1 1

36 1 1

37 0 1

38 1 1

39 0 1

40 1 1

41 0 1

42 1 1

43 1 1

44 1 1

45 0 0

MODEL EVALUATION
Now that we have seen many models being built including the one above, it is very important to understand how to evaluate a
model and also look at the different performance metrics associated with it. We will take this as an example to understand
some of that.

28
In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix

# ANOTHER STYLE OF WRITING


#from sklearn import metrics
#metrics.accuracy_score
#metrics.confusion_matrix

In [26]:
# Accuracy
test_acc = accuracy_score(y_test, pred_test)
print ("Test acc: {}".format(test_acc))

Test acc: 0.6521739130434783

So far we looked at accuracy that determines our mode's level of performance. But we have several other options when it comes to
evaluation metrics.

link text

In [0]:
prfscore = precision_recall_fscore_support(y_test, pred_test)
prfscore

Out[0]:
(array([1. , 0.63265306]),
array([0.05263158, 1. ]),
array([0.1 , 0.775]),
array([19, 31]))

The above are the different performance metrics for both the classes, since this is a binary
classification problem.

In [27]:
confusion_matrix(y_test, pred_test)

Out[27]:
array([[ 2, 16],
[ 0, 28]])

Inference
Let us first see if you survived!

In [29]:

# Input your personal information


X_infer = pd.DataFrame([{"name": "Abhi", "cabin": "E", "ticket": "E44",
"pclass": 1, "age": 24, "sibsp": 1, "parch": 2,
"fare": 100, "embarked": "C", "sex": "male"}])

# Apply preprocessing
X_infer = preprocess(X_infer)

# Add missing columns


missing_features = set(X_test.columns) - set(X_infer.columns)
for feature in missing_features:
X_infer[feature] = 0

# Reorganize header
X_infer = X_infer[X_train.columns]
29
X_infer = X_infer[X_train.columns]
X_infer.head()

Out[29]:

age fare parch sibsp pclass_1 embarked_C sex_male

0 24 100 2 1 1 1 1

In [0]:
# Predict
y_infer = log_reg.predict_proba(X_infer)
classes = {0: "died", 1: "survived"}
_class = np.argmax(y_infer)
print ("I would have {0} with about {1:.0f}% probability on the Titanic expedition!".format(
classes[_class], y_infer[0][_class]*100.0))

I would have survived with about 100% probability on the Titanic expedition!

K-FOLD Cross Validation


Instead of splitting out data once at the beginning into train/val/test sets. We do this k (usually k=5 or 10) times with different training
and evaluation sets.

Steps:

1. Shuffle the train dataset randomly.


2. Split the dataset into k discint groups.
3. For each iteration k, choose one of the groups to be your test set and the rest as your training set.
4. Repeat so that each group experiences being part of the test and train set.
5. Train a model using randomly initialzied weights.
6. After each iteration k, reinitialize the model with the same randomly initialzied weights and repeat on the new test set.

In [0]:

from sklearn.model_selection import cross_val_score

# K-fold cross validation


log_reg = SGDClassifier(loss="log", penalty="none", max_iter=10)
scores = cross_val_score(log_reg, X_train, y_train, cv=10, scoring="accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.6 0.35714286 0.69230769 0.69230769 0.69230769 0.30769231


0.84615385 0.69230769 0.84615385 0.76923077]
Mean: 0.6495604395604395
Standard Deviation: 0.17428893612863966

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
30
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)

This is the end of the workshop! Questions?

31
Naive Bayes Classifier
In this workshop we are going to implement a very simple naive bayes classifer.

In [11]:
%matplotlib inline
# Import all required libraries

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning:
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm

Gaussian Naive Bayes


Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes. In this classifier, the assumption is that data from
each label is drawn from a simple Gaussian distribution.

Lets us take into consideration some random data

In [0]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')

# print(X,y)

Out[0]:

<matplotlib.collections.PathCollection at 0x7f17859ae630>

In [0]:

# Training the model


from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)

Out[0]:
GaussianNB(priors=None, var_smoothing=1e-09)

Generate some new data and feed it through the model.

32
In [0]:

# Generating random new points to test


rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)

# Getting the output class (0/1 (red/blue)) for the new points
ynew = model.predict(Xnew)

In [0]:

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')


lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);

In [0]:

yprob = model.predict_proba(Xnew) # predict with probabilities


yprob[-8:].round(3) # last eight rows of the dataset (rounding the prob to three digits)

Out[0]:
array([[0.895, 0.105],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.153, 0.847]])

The columns give the posterior probabilities of the first and second label, respectively. If you are looking for estimates of uncertainty
in your classification, Bayesian approaches like this can be a useful approach.

Classifying text using Naive Bayes

In [1]:
# Let's download some documents

from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

Downloading 20news dataset. This may take a few minutes.


Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

Out[1]:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
33
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']

In [0]:
# We will consider a few categories for simplicity

categories = ['rec.sport.baseball', 'rec.motorcycles',


'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [7]:
# Printing a sample document
print(train.data[3])

From: rah13@cunixb.cc.columbia.edu (Robert A Holak)


Subject: Re: Why does Illustrator AutoTrace so poorly?
Nntp-Posting-Host: cunixb.cc.columbia.edu
Reply-To: rah13@cunixb.cc.columbia.edu (Robert A Holak)
Organization: Columbia University
Lines: 3

A shareware graphics program called Pman has a filter that makes a picture
look like a hand drawing. This picture could probably be converted into
vector format much easier because it is all lines. (With Corel Trace, etc..)

In [0]:
# We need to do some operations to convert the text into
# numbers so that it can be used by the model

from sklearn.feature_extraction.text import TfidfVectorizer


# term frequency inverse document frequency

from sklearn.naive_bayes import MultinomialNB


from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [9]:
model.fit(train.data, train.target)
pred = model.predict(test.data)

pred

Out[9]:
array([2, 2, 1, ..., 3, 1, 2])

In [12]:
# Confusion matrix on the testing set
34
# Confusion matrix on the testing set
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, pred)

# Plotting the confusion matrix


sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

We can also predict our own sentences

In [13]:

pred = model.predict(['sending a payload to the ISS'])


train.target_names[pred[0]]

Out[13]:
'sci.space'

In [14]:
pred = model.predict_proba(['determining the screen resolution'])
# train.target_names[pred[0]]
train.target_names, pred

Out[14]:
(['comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'sci.space'],
array([[0.56797615, 0.12966549, 0.13999519, 0.16236317]]))

Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more
complicated model. That said, they have several advantages:

They are extremely fast for both training and prediction


They provide straightforward probabilistic prediction
They are often very easily interpretable
They have very few (if any) tunable parameters

35
Support Vector Machines
We will understand how to build a support vector classifier to classify digit data. We will also learn about how to read a confusion
matrix.

In [2]:
# Import all required libraries

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output

# Standard scientific Python imports


import matplotlib.pyplot as plt
%matplotlib inline

# Import datasets, classifiers and performance metrics


from sklearn import datasets, svm, metrics

print('All libraries imported!')

All libraries imported!

The data that we are going to use today is made of 8x8 images of digits, which is part of the sklearn datasets.

First lets load and investigate the dataset.

In [3]:
# Load the digits dataset
digits = datasets.load_digits()
print('Digits dataset keys \n{}'.format(digits.keys()))

print('dataset target name: \n{}'.format(digits.target_names))


print('shape of the dataset: {} \nand target: {}'.format(digits.data.shape, digits.target.shape))
print('shape of the images: {}'.format(digits.images.shape))

Digits dataset keys


dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
dataset target name:
[0 1 2 3 4 5 6 7 8 9]
shape of the dataset: (1797, 64)
and target: (1797,)
shape of the images: (1797, 8, 8)

We see that the dataset (digits.data) is composed of 1797 samples, with 64 features, where each feature is a single image pixel. Let's
have a look at the first 4 images, stored in the images attribute of the dataset.

In [4]:
# Plot the data
for i in range(0,4):
plt.subplot(2, 4,i + 1)
plt.axis('off')
imside = int(np.sqrt(digits.data[i].shape[0]))
im1 = np.reshape(digits.data[i],(imside,imside))
plt.imshow(im1, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: {}'.format(digits.target[i]))
plt.show()

36
In [0]:

# Flatten the images


n_samples = len(digits.images)
data_images = digits.images.reshape((n_samples, -1))
# converting the 8*8 image to a 64 length array for training

Before apply a classifier to the data, let's split the data into a training set and a test set.

In [10]:
from sklearn.model_selection import train_test_split

# Splitting into train and test sets


X_train, X_test, y_train, y_test = train_test_split(data_images,digits.target)

print('Training data and target sizes: \n{}, {}'.format(X_train.shape,y_train.shape))


print('Test data and target sizes: \n{}, {}'.format(X_test.shape,y_test.shape))

Training data and target sizes:


(1347, 64), (1347,)
Test data and target sizes:
(450, 64), (450,)

We will now use the above train data to train a classifier

In [14]:
# Create a support vector classifier
classifier = svm.SVC(gamma=0.0001)

# Fit to the training data


classifier.fit(X_train,y_train)

print("SVM has been trained")

SVM has been trained

In [15]:

# Predict the value of the digit on the test data


y_pred = classifier.predict(X_test)

# Printing a sample of the original vs the predicted outputs


print("Original Outputs")
print(y_test[0:5])

print("Predicted Outputs")
print(y_pred[0:5])

Original Outputs
[0 3 9 2 6]
Predicted Outputs
[0 7 9 2 6]

Now let us have a look at at the accuracy and the confusion matrix

In [16]:
### Printing the confusion matrix. Confused?
print("Accuracy Score:\n%s" % metrics.accuracy_score(y_test, y_pred))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_pred))

Accuracy Score:
37
0.98
Confusion matrix:
[[38 0 0 0 0 0 0 0 0 0]
[ 0 56 0 0 0 0 0 0 0 0]
[ 0 1 39 0 0 0 0 0 0 0]
[ 0 0 0 47 0 0 0 1 0 0]
[ 0 0 0 0 39 0 0 0 1 0]
[ 0 0 0 0 0 46 1 0 0 1]
[ 0 0 0 0 0 0 44 0 1 0]
[ 0 0 0 0 0 0 0 46 1 0]
[ 0 2 0 0 0 0 0 0 44 0]
[ 0 0 0 0 0 0 0 0 0 42]]

We have successfully trained a support vector machine to learn and predict on digits data.

38
Decision Trees
This workshop deals with understanding the working of decision trees.

In [5]:
# Importing libraries in Python
import sklearn.datasets as datasets
import pandas as pd

# Loading the iris dataset


iris=datasets.load_iris()

# Forming the iris dataframe


df=pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

Out[5]:

sepal length sepal width petal length petal width


(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

In [8]:
y=iris.target
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

Now let us define the Decision Tree Algorithm

In [2]:

# Defining the decision tree algorithm


from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(df,y)

print('Decision Tree Classifer Created')

Decision Tree Classifer Created

In [12]:
dtree.decision_path

Out[12]:

<bound method BaseDecisionTree.decision_path of DecisionTreeClassifier(class_weight=None,


criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
39
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')>

Let us visualize the Decision Tree to understand it better.

In [3]:

# Install required libraries


!pip install pydotplus
!apt-get install graphviz -y

Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)


Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from
pydotplus) (2.4.2)
Reading package lists... Done
Building dependency tree
Reading state information... Done
graphviz is already the newest version (2.40.1-2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.

In [13]:

# Import necessary libraries for graph viz


from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

# Visualize the graph


dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data, feature_names=iris.feature_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Out[13]:

You can now feed any new/test data to this classifer and it would be able to predict the right class accordingly.

40
41
OPTIMIZING ML MODELS
This workshop is designed to teach you how you can optimize your ML models once you have built, trained and tested the
models. Optimization is the last step of a machine learning model process before results can be presented to the user.

So what are we going to optimize?

We are going to optimize Model Hyperparameters. A model hyperparameter is a configuration that is external to the model and
whose value cannot be estimated from data.

There are many strategies to tune modle hyperparameters. As part of this workshop we will discuss one technique - Grid Search

What dataset are we using for this workshop?


We will use the Pima Indian diabetes dataset. The dataset corresponds to a classification problem on which you need to make
predictions on the basis of whether a person is to suffer diabetes given the 8 features in the dataset. You can find the complete
description of the dataset here.

In [2]:

# Import all required libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Reading and displaying the head of the data


data = pd.read_csv("http://bit.ly/opt-data")
data.head()

Out[2]:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

Some basic data cleaning to remove all the missing/zero values

In [3]:

# Mark zero values as missing or NaN


data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] =
data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)
# Count the number of NaN values in each column
print(data.isnull().sum())

# Fill missing values with mean column values


data.fillna(data.mean(), inplace=True)
# Count the number of NaN values in each column
print(data.isnull().sum())

Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
42
Outcome 0
dtype: int64
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

See the results after data cleaning? -> We now have no missing values

Now lets quickly train a model with random hyperparameter values

In [0]:

# Split dataset into inputs and outputs


values = data.values
X = values[:,0:8]
y = values[:,8]

# Initiate the LR model with random hyperparameters


lr = LogisticRegression(penalty='l2',dual=False,max_iter=130)

# We will optimize these parameters using Grid Search

In [17]:
# Pass data to train the LR Model
lr.fit(X,y);

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)

In [18]:
# Let's check the accuracy of the model
lr.score(X,y)

Out[18]:
0.7669270833333334

Now lets build the model using hyperparameter optimization

In [0]:

from sklearn.model_selection import GridSearchCV

# Defining the grid parameter values


dual=[True,False]
max_iter=[100,110,120,130,140, 200, 300,50]

param_grid = dict(dual=dual,max_iter=max_iter)

In [24]:

lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 10, n_jobs=-1)

grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
43
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.766927 using {'dual': False, 'max_iter': 100}

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py:814: DeprecationWarning:
The default of the `iid` parameter will change from True to False in version 0.22 and will be
removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)

You can play around with more parameters to optimize models better.

You also got to know about what role hyperparameter optimization plays in building efficient machine learning models.

44

You might also like