Learners Guide - Machine Learning and Advanced Analytics Using Python
Learners Guide - Machine Learning and Advanced Analytics Using Python
Learner’s Guide
LEARNER’S GUIDE
1
Basics of Python
In this notebook, we are going to learn about some of the basic operations and datatypes within python and how you can write simple
functions and codes
Importing Libraries
Python’s standard library is very extensive, offering a wide range of facilities. The library contains built-in modules (written in C) that
provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as
modules written in Python that provide standardized solutions for many problems that occur in everyday programming.
https://docs.python.org/3/library/
In [1]:
# Importing Libraries
import datetime
import pandas as pd
from pandas import read_csv
Integers
Floats
Booleans
Strings
The other four are denoted collections because they can store arbitrary numbers of values. Python's four collection data types
are:
Lists
Tuples
Sets
Dictionaries
The four complex data types and their various operations are depicted below.
In [0]:
## Simple Variables
a_number = 2
a_word = 'dog'
print(a_number) # Printing
type(a_number) # Data Type
Lists are ordered sequences of elements, with that order being specified by the order that the elements are in when the list is created
or as elements are added to the list.
Tuples are similar to lists except for the very important fact that they are immutable.
In [0]:
pets = ['dogs', 'cats', 'fish']
print(pets)
print(pets[0])
print(pets[1])
In [0]:
pets.append('hedgehog')
pets
Out[0]:
['dogs', 'cats', 'fish', 'hedgehog']
In [0]:
pets.remove('dogs')
pets
Out[0]:
['cats', 'fish', 'hedgehog']
In [0]:
pets.reverse()
print(pets)
pets.sort()
print(pets)
In [0]:
# Create a tuple that stores the attributes of our golden retriever, `penny`.
<class 'tuple'>
Out[0]:
(60, 75, 'yellow')
In [0]:
del penny[1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-3dca6067f55e> in <module>()
3
<ipython-input-13-3dca6067f55e> in <module>()
----> 1 del penny[1]
In [0]:
print( store_one.intersection(store_two) )
print( store_one.union(store_two) )
{'parrot', 'fish'}
{'cat', 'parrot', 'hamster', 'bulldog', 'fish', 'terrier'}
This is a header
This is a smaller header
This is both
This broke here
Remember that Python was developed with the ideal of being readable and understandable. That means that when you are writing
code, you want to make choices that make it easy for others to understand what is going on. For example, if you have a some
variable that contains values that are never going to change, then why would you use a mutable data type such as a list to store it?
Think of the days of the week or the months of the year: Does it make more sense to define
or
By using a tuple to store some data, you are signaling to a reader of your code, that the values you stored are not going to change
during the execution of your code.
Quick Exercise
pet_store = ['beagle', 'parrot', 'iguana', 'gerbil', 'chameleon', 'fish']
We've sold out of chameleon and iguana , you need to remove them from the inventory. Use both remove() and del to
do so.
How many terrier dogs do we have in the store? (Hint: the list data type has a built-in function called count() )
Flow Control
We will learn how to perform successive operations ("looping") without explicitly coding the commands. This is largely understood to
be controlling the 'flow' of a program.
1. Logical statements to check for conditions (performing operations only if a condition is met)
2. Continuing execution until a condition is met
3. Iterating through a sequence of numbers using the range() function
In [0]:
## "Buy a gallon of milk at the store and if there are organic eggs, buy a dozen"
Loops
A loop is used to repeat a block of commands multiple times. There are two ways to write a loop, one is a for loop and the other is
a while loop. Typically, you use a for loop when you know how many times you want to loop, and a while loop when looping
is based on a conditional that will be modified during the loop.
While loops
A while loop is pretty simple, it's structure looks like:
while a_condition:
# do something
...
In [0]:
# Our numeric variables
number = 43
divisor = 5
answer = 0
# While loop
while number > 0:
number = number - divisor
print( number )
38
33
28
23
18
13
8
3
-2
For loops
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:
5
A for loop lets us repeat a set of commands a defined number of times. The syntax for a for loop is just:
There are lots of functions in Python that will actually return a sequence - they are called iterators. An iterator essentially provides the
next element in the sequence each time we access it.
The iterator that we will use to demonstrate a for loop is the range() function. The range function gives us a sequence of numbers
from the first number we give it up until the last number we give it.
In [0]:
1
3
36
---
2
6
36
---
3
9
36
---
4
12
36
---
Functions
Writing modular code is good!
Functions are the workhorses of modular programming in Python! So, what's a function?
def function_name():
statements
return something
Functions help us avoid repeating the same set of statements everytime we want to repeat a task. Functions increase code
readibility. Functions make code revision and updating easier (you do not have to re-do revisions in all the places of your code where
the task is needed. Functions make testing of your code easier and more reliable.
In [0]:
def says_hello():
'''
Prints the word "Hello"
input:
6
input:
- None
output:
- None
'''
print('Hello!')
You just wrote a simple function! Notice that after writing it nothing was printed. That is because you didn't call the function, You only
defined it so Python will know what on earth you're talking about should you so choose to write says_hello anywhere.
You call a function just by writing its name along with the parentheses:
In [0]:
says_hello()
Hello!
Pandas
In [0]:
import pandas as pd
In [0]:
titanic_data.head() # Gets the first few records
Out[0]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
In [0]:
# Generate all descriptive statistics
titanic_data.describe()
Out[0]:
Numpy
In [0]:
import numpy as np # Importing the numpy library
[6 0 6 8 7 6]
In [0]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim: 1
x3 shape: (6,)
x3 size: 6
In [0]:
# Accessing elements
x3[0]
Out[0]:
6
In [0]:
Out[0]:
In [0]:
# Input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Solution
arr[arr % 2 == 1]
Out[0]:
array([1, 3, 5, 7, 9])
Exercises
Extract all even numbers from an array
Extract all numbers from an array which are dicvisible by 3.
Create a simple 1D array with number from 0-100.
8
Create a simple 1D array with number from 0-100.
Read the first 25 elements of a pandas dataframe.
9
K- Means Clustering
This notebook will walk through some of the basics of K-Means Clustering.
In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
Out[1]:
How do you find the optimum number of clusters for K Means? How does one determine the value of K?
In [2]:
# Finding the optimum number of clusters for k-means classification
12
You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs. This is
when the within cluster sum of squares (WCSS) doesn't decrease significantly with every iteration.
In [0]:
# Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3)
kmeans_model = kmeans.fit(x)
y_kmeans = kmeans_model.predict(x)
y_kmeans
Out[0]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
In [0]:
plt.figure(figsize=(15,6))
plt.subplot(121)
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1],
s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1],
s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1],
s = 100, c = 'green', label = 'Iris-virginica')
# Second plot
plt.subplot(122)
plt.scatter(x[y_kmeans == 0, 1], x[y_kmeans == 0, 2],
s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 1], x[y_kmeans == 1, 2],
s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 1], x[y_kmeans == 2, 2],
s = 100, c = 'green', label = 'Iris-virginica')
Out[0]:
<matplotlib.legend.Legend at 0x7f30c0e98eb8>
13
This concludes the K-Means Workshop.
14
Self Organizing Maps (SOM)
A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-
dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is
therefore a method to do dimensionality reduction.
(Source: Wikipedia)
We will be using the example of handwritten digits and classify them using SOM (minisom library in Python)
In [1]:
!pip install minisom # Since minisom is not part of the colab library family, we will need to inst
all it using this command
from minisom import MiniSom # Library for SOM
import numpy as np
import matplotlib.pyplot as plt
Collecting minisom
Downloading
https://files.pythonhosted.org/packages/bf/dd/b9089c073cc16c4f86758bf7668d056956b9cfb2b539d9d50151e1713
fe0/MiniSom-2.2.1.tar.gz
Building wheels for collected packages: minisom
Building wheel for minisom (setup.py) ... done
Created wheel for minisom: filename=MiniSom-2.2.1-cp36-none-any.whl size=6643
sha256=ca126f2512d61b6929bce56f4a76efda4ad4855e19846f24c77e676b458abf53
Stored in directory:
/root/.cache/pip/wheels/41/42/7d/dd12b479c5ea50cd572d91b8e935e4f11e1302acca329f84e0
Successfully built minisom
Installing collected packages: minisom
Successfully installed minisom-2.2.1
In [0]:
# Importing required libraries for data load and processing
from sklearn import datasets
from sklearn.preprocessing import scale
In [3]:
digits.data.shape
Out[3]:
(1797, 64)
In [4]:
15
In [4]:
# Iterations - 5000
som.train_random(data, 5000, verbose=True)
print("\n...SOM model is ready!")
Training...
[ 5000 / 5000 ] 100% - 1399.78 it/s - 0:00:00 left - quantization error: 3.062048466401852
In [5]:
plt.figure(figsize=(8, 8)) # canvas of size 8*8
wmap = {}
im = 0
plt.show()
In [6]:
plt.tight_layout()
plt.show()
17
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms
work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features
from raw data that help facilitate the machine learning process. Feature Engineering is an art.
In [2]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import GridSearchCV
# Import data
df_train =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/train.csv')
df_test =
pd.read_csv('https://raw.githubusercontent.com/DesmondStone/dataXaltius/master/titanic/test.csv')
# View head
# data.info()
df_train
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
18
Andersson, Mr. Anders
13 PassengerId
14 Survived
0 Pclass
3 Name Sex
male Age
39.0 SibSp
1 Parch
5 Ticket
347082 Fare Cabin
31.2750 NaN Embarked
S
Johan
C23
Fortune, Mr. Charles
27 28 0 1 male 19.0 3 2 19950 263.0000 C25 S
Alexander
C27
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
B51
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B53 S
B55
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
In [3]:
# View head of 'Name' column
data.Name.tail()
Out[3]:
413 Spector, Mr. Woolf
414 Oliva y Ocana, Dona. Fermina
415 Saether, Mr. Simon Sivertsen
416 Ware, Mr. Frederick
417 Peter, Master. Michael J
Name: Name, dtype: object
Suddenly, you see different titles emerging! In other words, this column contains strings or text that contain titles, such as 'Mr',
'Master' and 'Dona'.
These titles of course give you information on social status, profession, etc., which in the end could tell you something more about
survival.
At first sight, it might seem like a difficult task to separate the names from the titles, but don't panic! Remember, you can easily use
regular expressions to extract the title and store it in a new column 'Title':
In [4]:
# Extract Title from Name, store in column and plot barplot
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);
20
In [5]:
data['Title'] = data['Title'].replace({'Mlle':'Miss', 'Mme':'Mrs', 'Ms':'Miss'})
data['Title'] = data['Title'].replace(['Don', 'Dona', 'Rev', 'Dr',
'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkh
eer'],'Special')
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);
21
Feature Sets
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have a huge influence on the performance you can achieve.
Why?
Irrelevant or partially relevant features can negatively impact model performance.
Feature selection and Data cleaning should be the first and most important step of your model designing.
Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction
variable or output in which you are interested in.
Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant
features.
How to select features and what are Benefits of performing feature selection before modeling your data?
· Reduces Overfitting: Less redundant data means less opportunity to make decisions based on
noise.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train
faster.
battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).
1) Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific
number of features.
The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the
Mobile Price Range Prediction Dataset.
22
In [1]:
# dataset train - https://tinyurl.com/y2v7doco
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
Specs Score
13 px_width 852.914979
14 ram 562.837207
0 id 223.566155
12 px_height 46.347162
9 mobile_wt 42.328627
5 fc 15.793117
11 pc 11.148155
7 int_memory 1.372252
3 clock_speed 1.052762
16 sc_w 0.809077
2) Feature Importance
You can get the feature importance of each feature of your dataset by using the feature importance property of the model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature
towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the
top 10 features for the dataset.
In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one
value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features
using the seaborn library.
In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.read_csv("https://tinyurl.com/y2v7doco")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
24
Have a look at the heatmap and see how to the columns are related to each other.
25
Logistic Regression
We have previously seen how linear regression works well for predicting continuous outputs that can easily fit to a line/plane. But
linear regression doesn't fare well for classification. This is where we need to use logistic regression.
In [1]:
# Import all required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df.head()
Out[1]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
Note: The LogisticRegression class in Scikit-learn uses coordinate descent to solve the fit. However, we are going to use
Scikit-learn's SGDClassifier class which uses stochastic gradient descent. We want to use this optimization approach because
we will be using this for the models in subsequent lessons.
In [0]:
# Import packages
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
Data Preprocessing
In [0]:
# Preprocessing
def preprocess(df):
# Drop text based features (we'll learn how to use them in later lessons)
features_to_drop = ["name", "cabin", "ticket"]
df = df.drop(features_to_drop, axis=1)
# pclass, sex, and embarked are categorical features -> We need to convert
# them to numerical figures for training
categorical_features = ["pclass","embarked","sex"]
df = pd.get_dummies(df, columns=categorical_features)
return df
In [9]:
Out[9]:
survived age sibsp parch fare pclass_1 pclass_2 pclass_3 embarked_C embarked_Q embarked_S sex_female sex_male
1 1 38.0 1 0 71.2833 1 0 0 1 0 0 1 0
3 1 35.0 1 0 53.1000 1 0 0 0 0 1 1 0
6 0 54.0 0 0 51.8625 1 0 0 0 0 1 0 1
10 1 4.0 1 1 16.7000 0 0 1 0 0 1 1 0
11 1 58.0 0 0 26.5500 1 0 0 0 0 1 1 0
In [0]:
In [0]:
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
# default test size = 0.25
In [19]:
# Initialize the model
log_reg = SGDClassifier(loss="log", penalty="none", max_iter=50)
# Train
log_reg.fit(X=X_train, y=y_train) # train the model
Out[19]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=50,
n_iter_no_change=5, n_jobs=None, penalty='none', power_t=0.5,
random_state=None, shuffle=True, tol=0.001,
validation_fraction=0.1, verbose=0, warm_start=False)
Predicting Models
In [21]:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0]
Out[21]:
0 1 1
1 1 1
27
2 Original Survived
0 Predicted Survived
1
3 1 1
4 1 1
5 1 1
6 0 1
7 0 1
8 1 1
9 1 1
10 1 1
11 0 1
12 0 1
13 1 1
14 1 1
15 1 1
16 1 1
17 0 1
18 0 0
19 1 1
20 1 1
21 0 1
22 1 1
23 0 1
24 0 1
25 1 1
26 0 1
27 1 1
28 0 1
29 1 1
30 0 1
31 1 1
32 1 1
33 1 1
34 0 1
35 1 1
36 1 1
37 0 1
38 1 1
39 0 1
40 1 1
41 0 1
42 1 1
43 1 1
44 1 1
45 0 0
MODEL EVALUATION
Now that we have seen many models being built including the one above, it is very important to understand how to evaluate a
model and also look at the different performance metrics associated with it. We will take this as an example to understand
some of that.
28
In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix
In [26]:
# Accuracy
test_acc = accuracy_score(y_test, pred_test)
print ("Test acc: {}".format(test_acc))
So far we looked at accuracy that determines our mode's level of performance. But we have several other options when it comes to
evaluation metrics.
link text
In [0]:
prfscore = precision_recall_fscore_support(y_test, pred_test)
prfscore
Out[0]:
(array([1. , 0.63265306]),
array([0.05263158, 1. ]),
array([0.1 , 0.775]),
array([19, 31]))
The above are the different performance metrics for both the classes, since this is a binary
classification problem.
In [27]:
confusion_matrix(y_test, pred_test)
Out[27]:
array([[ 2, 16],
[ 0, 28]])
Inference
Let us first see if you survived!
In [29]:
# Apply preprocessing
X_infer = preprocess(X_infer)
# Reorganize header
X_infer = X_infer[X_train.columns]
29
X_infer = X_infer[X_train.columns]
X_infer.head()
Out[29]:
0 24 100 2 1 1 1 1
In [0]:
# Predict
y_infer = log_reg.predict_proba(X_infer)
classes = {0: "died", 1: "survived"}
_class = np.argmax(y_infer)
print ("I would have {0} with about {1:.0f}% probability on the Titanic expedition!".format(
classes[_class], y_infer[0][_class]*100.0))
I would have survived with about 100% probability on the Titanic expedition!
Steps:
In [0]:
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
30
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:183:
FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter i
s set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equ
ivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence
this warning.
FutureWarning)
31
Naive Bayes Classifier
In this workshop we are going to implement a very simple naive bayes classifer.
In [11]:
%matplotlib inline
# Import all required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning:
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
In [0]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
# print(X,y)
Out[0]:
<matplotlib.collections.PathCollection at 0x7f17859ae630>
In [0]:
Out[0]:
GaussianNB(priors=None, var_smoothing=1e-09)
32
In [0]:
# Getting the output class (0/1 (red/blue)) for the new points
ynew = model.predict(Xnew)
In [0]:
In [0]:
Out[0]:
array([[0.895, 0.105],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.153, 0.847]])
The columns give the posterior probabilities of the first and second label, respectively. If you are looking for estimates of uncertainty
in your classification, Bayesian approaches like this can be a useful approach.
In [1]:
# Let's download some documents
data = fetch_20newsgroups()
data.target_names
Out[1]:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
33
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
In [0]:
# We will consider a few categories for simplicity
In [7]:
# Printing a sample document
print(train.data[3])
A shareware graphics program called Pman has a filter that makes a picture
look like a hand drawing. This picture could probably be converted into
vector format much easier because it is all lines. (With Corel Trace, etc..)
In [0]:
# We need to do some operations to convert the text into
# numbers so that it can be used by the model
In [9]:
model.fit(train.data, train.target)
pred = model.predict(test.data)
pred
Out[9]:
array([2, 2, 1, ..., 3, 1, 2])
In [12]:
# Confusion matrix on the testing set
34
# Confusion matrix on the testing set
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, pred)
In [13]:
Out[13]:
'sci.space'
In [14]:
pred = model.predict_proba(['determining the screen resolution'])
# train.target_names[pred[0]]
train.target_names, pred
Out[14]:
(['comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'sci.space'],
array([[0.56797615, 0.12966549, 0.13999519, 0.16236317]]))
Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more
complicated model. That said, they have several advantages:
35
Support Vector Machines
We will understand how to build a support vector classifier to classify digit data. We will also learn about how to read a confusion
matrix.
In [2]:
# Import all required libraries
The data that we are going to use today is made of 8x8 images of digits, which is part of the sklearn datasets.
In [3]:
# Load the digits dataset
digits = datasets.load_digits()
print('Digits dataset keys \n{}'.format(digits.keys()))
We see that the dataset (digits.data) is composed of 1797 samples, with 64 features, where each feature is a single image pixel. Let's
have a look at the first 4 images, stored in the images attribute of the dataset.
In [4]:
# Plot the data
for i in range(0,4):
plt.subplot(2, 4,i + 1)
plt.axis('off')
imside = int(np.sqrt(digits.data[i].shape[0]))
im1 = np.reshape(digits.data[i],(imside,imside))
plt.imshow(im1, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: {}'.format(digits.target[i]))
plt.show()
36
In [0]:
Before apply a classifier to the data, let's split the data into a training set and a test set.
In [10]:
from sklearn.model_selection import train_test_split
In [14]:
# Create a support vector classifier
classifier = svm.SVC(gamma=0.0001)
In [15]:
print("Predicted Outputs")
print(y_pred[0:5])
Original Outputs
[0 3 9 2 6]
Predicted Outputs
[0 7 9 2 6]
Now let us have a look at at the accuracy and the confusion matrix
In [16]:
### Printing the confusion matrix. Confused?
print("Accuracy Score:\n%s" % metrics.accuracy_score(y_test, y_pred))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_pred))
Accuracy Score:
37
0.98
Confusion matrix:
[[38 0 0 0 0 0 0 0 0 0]
[ 0 56 0 0 0 0 0 0 0 0]
[ 0 1 39 0 0 0 0 0 0 0]
[ 0 0 0 47 0 0 0 1 0 0]
[ 0 0 0 0 39 0 0 0 1 0]
[ 0 0 0 0 0 46 1 0 0 1]
[ 0 0 0 0 0 0 44 0 1 0]
[ 0 0 0 0 0 0 0 46 1 0]
[ 0 2 0 0 0 0 0 0 44 0]
[ 0 0 0 0 0 0 0 0 0 42]]
We have successfully trained a support vector machine to learn and predict on digits data.
38
Decision Trees
This workshop deals with understanding the working of decision trees.
In [5]:
# Importing libraries in Python
import sklearn.datasets as datasets
import pandas as pd
Out[5]:
In [8]:
y=iris.target
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
In [2]:
In [12]:
dtree.decision_path
Out[12]:
In [3]:
In [13]:
Out[13]:
You can now feed any new/test data to this classifer and it would be able to predict the right class accordingly.
40
41
OPTIMIZING ML MODELS
This workshop is designed to teach you how you can optimize your ML models once you have built, trained and tested the
models. Optimization is the last step of a machine learning model process before results can be presented to the user.
We are going to optimize Model Hyperparameters. A model hyperparameter is a configuration that is external to the model and
whose value cannot be estimated from data.
There are many strategies to tune modle hyperparameters. As part of this workshop we will discuss one technique - Grid Search
In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
Out[2]:
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
In [3]:
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
42
Outcome 0
dtype: int64
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
See the results after data cleaning? -> We now have no missing values
In [0]:
In [17]:
# Pass data to train the LR Model
lr.fit(X,y);
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
In [18]:
# Let's check the accuracy of the model
lr.score(X,y)
Out[18]:
0.7669270833333334
In [0]:
param_grid = dict(dual=dual,max_iter=max_iter)
In [24]:
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 10, n_jobs=-1)
grid_result = grid.fit(X, y)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
43
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py:814: DeprecationWarning:
The default of the `iid` parameter will change from True to False in version 0.22 and will be
removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning:
Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
You can play around with more parameters to optimize models better.
You also got to know about what role hyperparameter optimization plays in building efficient machine learning models.
44