Lab Manual ML Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Bharati Vidyapeeth's College of Engineering

New Delhi

LAB MANUAL

D ep ar t m en t C o m p u t e r S c i e n ce &
E n g i n e e ri n g
Academic 2 0 19 - 2 0
Y ea r
Se m e s t er 8th
Su b j e c t M a c h i n e Le a rn i n g
Name

Su b j e c t E T C S - 4 54
Code
F ac u lt y M r . V a r u n Sr i v a st a va
Name D r . P r a n a v D as s

1
VISION OF INSTITUTE

We strive to develop as an institute of excellence in education and research in consonance with


contemporary needs of the country. A Dynamic learning environment is provided to enhance
amongst students sound academic grounding, self esteem and self learning that inculcates
freedom of thought, human values and concern for the society thereby infusing in them a sense
of commitment and leading them towards becoming competent and motivated engineering
professionals.

MISSION OF INSTITUTE

The aim of the institute is to develop a unique academic culture that instils amongst students
responsibility and accountability in partnership with parents, business and education community.
The guiding philosophy remains "Social Transformation through Dynamic Education” achieved
through sound academic and social grounding of students.

2
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

VISION OF DEPARTMENT

To develop as a center of excellence in computer science and engineering education and research
so as to produce globally competent professionals with a sense of social responsibility.

MISSION OF DEPARTMENT

The department aims to impart state of art education with strong foundations in computer
engineering, to provide a conducive environment for the development of analytical and
collaborative learning skill of the students and inculcate leadership qualities and strong ethical
values among the students.

PROGRAM EDUCATIONAL OBJECTIVES (PEO)

Following are the three Program Educational Objectives (PEOs) of the B. Tech (Computer
Science and Engineering) program:
PEO1: To produce engineers with in-depth knowledge of sub-domains of computer science and
engineering to contribute towards the innovation, research and excel in higher studies
PEO2: To inculcate life-long learning skills in graduates enabling them to adapt changing
technologies, tools and work in multidisciplinary teams.
PEO3: To provide ethically responsible graduate engineers who are involved in transforming the
society by providing suitable engineering solutions.

3
PROGRAM SPECIFIC OUTCOMES (PSO)
PSO1: An ability to identify, formulate and analyze problems through computer engineering
concepts for the modelling and design of computer based system.
PSO2: Prepare and engage themselves in lifelong learning and updating regularly on new
technologies and tools.

PROGRAM OUTCOMES (PO)


1. Engineering knowledge: Apply the knowledge acquired in mathematics, science, engineering for the
solution of complex engineering problems.
2. Problem analysis: Identify research gaps, formulate and analyze complex engineering problems
drawing substantiated conclusions using basic knowledge of mathematics, natural sciences and
engineering sciences.
3. Design/development of solutions: Design solutions for the identified complex engineering
problems as well as develop solutions that meet the specified needs for the public health and safety,
and the cultural, societal and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods, including design of experiments, analysis and interpretation of data and synthesis of
the information to provide valid conclusions.
5. Modern tool usage: Work on the latest technologies, resources and software tools including prediction
and modelling to complex engineering activities with an understanding of their limitations.
6. The engineer and society: Apply the basic acquired knowledge to measure societal, health, safety,
legal and cultural issues and identifying the consequential responsibilities relevant to the professional
engineering practice.
7. Environment and Sustainability: Comprehend the impact of the professional engineering solutions in
context of society and environment and demonstrate the need and knowledge for sustainable
development
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice
9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning.

4
TABLE OF CONTENTS

S.No Page no.


.
1. Course details 6
1.1 Course objective
1.2. Course Outcomes
1.3 CO-PO/PSO mapping
1.4 Evaluation Scheme
1.5 Guidelines/Rubrics for continuous assessment
1.6 Lab safety instruction
1.7 Instructions for students while writing Experiment in Lab
file.

2 List of Experiments & Content Beyond Syllabus 11

3 Experimental Setup details for the course. 12

4 Experiment details 13

5 Course Exit Survey 47

5
1. COURSE DETAILS
1.1 COURSE OBJECTIVE

The objective of the course is to make students aware with the new discipline of Machine
Learning and its applications in real world problems. To explore Real world data available and
apply these techniques.

1.2 COURSE OUTCOMES

At the end of the course student will be able to: PO/ PSO
ETCS454.1 To understand and implement the various Machine PO1,
learning approaches and interpret the concepts of PO5
supervised learning PSO1

ETCS454.2 To Study and apply the fundamental concepts in Machine PO1,


Learning, including classification and be able to apply the PO2
Machine learning algorithms. PO3, PSO1
PSO2

ETCS454.3 Analyze and evaluate the data and performing experiments PO2, PSO1
in Machine Learning using real-world data and Learning
algorithms limitations.
ETCS454.4 To be capable of confidently applying/ executing and PO1,
analyzing the common Machine Learning algorithms in PO2,
practice and implementing. PO4, PSO1
PSO2
ETCS454.5 Illustrate and apply clustering algorithms and evaluate the PO4, PSO1
algorithms for performance measures and identify its
applicability in real life problems.
ETCS454.6 To create and analyze new models and modern PO1, PO3,
programming tools for existing Machine learning problems. PO4, PO9
PSO1, PSO2

6
1.3 MAPPING COURSE OUTCOMES (CO) AND PROGRAM OUTCOMES
(PO)/ PROGRAM SPECIFIC OUTCOME (PSO)

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO
CO 1 2 3 4 5 6 7 8 9 10 11 12 1 2

3 3 2 2
CO1

3 2 2 3 2
CO2

3 3 1 1
CO3

2 2 1 3 2 2 2
CO4

3 1 3 2 2 1
CO5

3 3 2 2 2 2 3 2
CO6

1.4 EVALUATION SCHEME

Laboratory
Components Internal External
Marks 40 60
Total Marks 100

1.5 GUIDELINES FOR CONTINUOUS ASSESSMENT FOR EACH


EXPERIMENT

 Attendance and performance in minimum eight experiments – 30 marks for all


semesters
 Each Experiment will carry a weight of 15 marks
 Experiment performance [5 Marks]
 File [5 Marks]
 Viva – Voce [5 Marks]
 2 innovative experiments (Content Beyond syllabus) 10 marks for 1 st & 2nd
Semester
 2 innovative experiments (Content Beyond syllabus) 5 marks for 3 rd , 4th ,5th , 6th,
7th and 8th Semester

7
 Viva 5 marks for 3rd , 4th ,5th , 6th ,7th 8th Semester

The Rubrics for Experiment execution and Lab file+ viva voce is given below:
Experiment Marks details:

Completed Completed Logically


and but Incorrect Unacceptable
Status
Executed partially Program or efforts/Absent
perfectly Executing errors

Marks 4-5 2-3 1 0

File Marks Details:

File File
File
Contents & Contents &
Contents &
Status Checked not Checked
Checked
Timely (after After two
Timely
one week) weeks
Marks 4-5 2-3 0-1

Viva-Voce Marks details:

Viva Viva Viva


Status
(Good) (Average) (Unsatisfactory)

Marks 4-5 1-3 0

Note: Viva Voce Questions for each experiment should be related to Course
Outcomes.

8
1.6 Safety Guidelines/Rules for laboratory:
1. The lab requires a computer system that needs to be handled with care.
2. No special safety guidelines are required.

9
1.7 Format for students while writing Experiment in Lab file.

Experiment No: 1

Aim:

Course Outcome:

Software used:

Theory:

Flowchart/Algorithm/Code:

Results:

Expected Outcome attained: YES/NO

10
2. LIST OF EXPERIMENTS AS PER GGSIPU

Sr. No. Title of Lab Experiments CO

1. Study and Implement the Naive Bayes learner using WEKA. (The datasets taken (CO1)
can be: Breast Cancer data file or Reuters data set).
2. (CO1)
Study and Implement the Decision Tree learners using WEKA. (The datasets
taken can be: Breast Cancer data file or Reuter’s data set).

3. Estimate the accuracy of decision classifier on breast cancer dataset using 5-fold (CO2)
cross-validation. (You need to choose the appropriate options for missing values).
4. Estimate the precision, recall, accuracy, and F-measure of the decision tree (CO2)
classifier on the text classification task for each of the 10 categories using 10-fold
cross-validation.
5. (CO3)
Develop a machine learning method to classifying your incoming mail.
6. Develop a machine learning method to Predict stock prices based on past (CO3)
price variation.
7. Develop a machine learning method to predict how people would rate (CO4)
movies, books, etc.
8. Develop a machine learning method to Cluster gene expression data, how to (CO4)
modify existing methods to solve the problem better.
9. Select two datasets. Each dataset should contain examples from multiple classes. (CO5)
For training purposes assume that the class label of each example is unknown (if it
is known, ignore it). Implement the Kmeans algorithm and apply it to the data you
selected. Evaluate performance by measuring the sum of Euclidean distance of
each example from its class center. Test the performance of the algorithm as a
function of the parameter k.
10. Implement the EM algorithm assuming a Gaussian mixture. Apply the algorithm (CO5)
to your datasets and report the parameters you obtain. Evaluate performance by
measuring the sum of Mahalanobis distance of each example from its class center.
Test performance as a function of the number of clusters.

11. Suggest and test a method for automatically determining the number of (CO6)
clusters.
12. Using a dataset with known class labels compare the labeling error of the K-means (CO6)
and EM algorithms. Measure the error by assigning a class label to each example.
Assume that the number of clusters is known.

11
CONTENT BEYOND SYLLABUS
S.No Name of Experiment
1. Application of Deep learning models on text or image classification. (CO6)
3 To explore VGG NET for image/text classification. (CO6)

3. EXPERIMENTAL SETUP DETAILS FOR THE COURSE

Software Requirements:
WEKA, Anaconda

Minimum Hardware Requirements


Dual Core based PC with 1 GB RAM

12
4. Experimental Details
4.1 Introduction

WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains. WEKA supports many different
standard data mining tasks such as data preprocessing, classification, clustering, regression,
visualization and feature selection. The basic premise of the application is to utilize a computer
application that can be trained to perform machine learning capabilities and derive useful
information in the form of trends and patterns. WEKA is an open source application that is freely
available under the GNU general public license agreement. Originally written in C the WEKA
application has been completely rewritten in Java and is compatible with almost every
computing platform. It is user friendly with a graphical interface that allows for quick set up and
operation. WEKA operates on the predication that the user data is available as a flat file or
relation, this means that each data object is described by a fixed number of attributes that usually
are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows
novice users a tool to identify hidden information from database and file systems with simple to
use options and visual interfaces.

13
4.2 Installation of WEKA Tool

The program information can be found by conducting a search on the Web for WEKA Data
Mining or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA . The site has a very
large amount of useful information on the program’s benefits and background. New users might
find some benefit from investigating the user manual for the program. The main WEKA site has
links to this information as well as past experiments for new users to refine the potential uses that
might be of particular interest to them. When prepared to download the software it is best to
select the latest application from the selection offered on the site. The format for downloading
the application is offered in a self installation package and is a simple procedure that provides the
complete program on the end users machine that is ready to use when extracted.

14
4.3 Experiments:
Experiment No. 1

1. Aim: Study and Implement the Naive Bayes learner using WEKA. (The datasets taken can be:
Breast Cancer data file or Reuters data set).

2. Software to be used: WEKA

3. Introduction to WEKA: WEKA is a data mining system developed by the University of Waikato in
New Zealand that implements data mining algorithms. WEKA is a state-of-the-art facility for developing
machine learning (ML) techniques and their application to real-world data mining problems. It is a
collection of machine learning algorithms for data mining tasks. The algorithms are applied directly to a
dataset. WEKA implements algorithms for data preprocessing, classification, regression, clustering,
association rules; it also includes a visualization tools. The new machine learning schemes can also be
developed with this package. WEKA is open source software issued under the GNU General Public
License.

4. Pedagogy/ Algorithm:

4.1 Launch WEKA: You can launch Weka from C:\Program Files directory, from your desktop selecting
icon, or from the Windows task bar ‘Start’ Æ ‘Programs’ Æ ‘Weka 3-4’. When ‘WEKA GUI Chooser’
window appears on the screen, you can select one of the four options at the bottom of the window. Fig. 1
shows the opening window and various options inside explorer tab.

4.1.1 Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands

4.1.2 Explorer is an environment for exploring data.

4.1.3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.

4.1.4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine


learning experiments.

Fig. 1: Opening window of WEKA

15
4.2 Data preprocessing:
WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file. Before you apply the
algorithm to your data, you need to convert your data into comma-separated file into ARFF format (into
the file with .arff extension). To save you data in comma-separated format, select the ‘Save As…’ menu
item from Excel ‘File’ pull-down menu. In the ensuing dialog box select ‘CSV (Comma Delimited)’ from
the file type pop-up menu, enter a name of the file, and click ‘Save’ button. Ignore all messages that
appear by clicking ‘OK’. Open this file with Microsoft Word. Your screen will look like the screen
below.We can create an arff file, load it from system using open file option or load a predefined database
using Open DB option. We can even upload data from a website using open url option.

4.3 Load data:


Lets load the data and look what is happening in the ‘Preprocess’ window. The most common and easiest
way of loading data into WEKA is from ARFF file, using ‘Open file…’ button as shown in Fig. 2.. Click
on ‘Open file…’ button and choose “weather.arff” file from your local filesystem. Note, the data can be
loaded from CSV.

Fig. 2: Loading data and various options in WEKA

4.4 Classify using classifier: Go to classifier tab

Naïve Bayes Algorithm:

The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly
suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often
outperform more sophisticated classification methods.

Fig. 3: Representation of data to be classified for two classes

16
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the Fig. 3
above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new
cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which
hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian
analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience,
in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they
actually happen.

Thus, we can write:

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class
membership are:

Fig. 4: A new object case in Naïve bayes

Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since
the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the
vicinity of X, the more likely that the new cases belong to that particular color. To measure this
likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points
irrespective of their class labels. Then we calculate the number of points in the circle belonging to each
class label. From this we calculate the likelihood:

17
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X
given
RED, since the circle encompasses 1 GREEN obje object and 3 RED ones. Thus:

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as
many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is
RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis,
the final classification is produced by combining both sources of information, i.e., the prior and the
likelihood, to form a posterior probability using the so
so-called Bayes' rule
ule (named after Rev. Thomas
Bayes 1702-1761).

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

Thereby load the requisite data and use naïve bayes as the classifier and click on start. The corresponding
correspondin
classification will be done and Precision and recall values will be calculated as given below.

If we consider m as the number of data points correctly classified and n as the number of images in the
database then Precision is P and recall is R.

P = number
ber of data points correctly classified

total number of data points classified

and

R(i) = number of correctly classified data points retrieved

total number of correctly classified data points in the database

Conclusion: Precision and Recall values will be calculated for Naïve bayes algorithm on the given
dataset.

18
Experiment No. 2

1. Aim: Study and Implement the Decision Tree learners using WEKA. (The datasets taken can

be: Breast Cancer data file or Reuter’s data set).

2. Software to be used: WEKA

3. Pedagogy and Algorithm:

3.1 Refer to section 4.1, 4.2 and 4.3 of Experiment No. 1 to understand how to open WEKA and load the
respective dataset.

3.2 Decision Tree learners:

Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what
the corresponding output is in the training data) where the data is continuously split according to a certain
parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are
the decisions or the final outcomes. And the decision nodes are where the data is split.

Now that we know what a Decision Tree is, we’ll see how it works internally. There are many algorithms
out there which construct Decision Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for
Iterative Dichotomiser 3.

Before discussing the ID3 algorithm, we’ll go through few definitions.

Entropy

Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the
amount of uncertainty or randomness in data.

Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss whose
probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest possible, since
there’s no way of determining what the outcome might be. Alternatively, consider a coin which has heads
on both the sides, the entropy of such an event can be predicted perfectly since we know beforehand that
it’ll always be heads. In other words, his event has no randomness hence it’s entropy is zero.

In particular, lower values imply less uncertainty while higher values imply high uncertainty.

Information Gain

Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the
effective change in entropy after deciding on a particular attribute A. It measures the relative change in
entropy with respect to the independent variables.

19
Alternatively,

where IG(S, A) is the information gain by applying ffeature


eature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A, where P(x) is the probability of event
x.
ID3 Algorithm will perform following tasks recursively

1. Create root node for the tree

2. If all examples are positive, return leaf node ‘positive’

3. Else if all examples are negative, return leaf node ‘negative’

4. Calculate the entropy of current state H(S)

5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)

6. Select the attribute which has maximum value of IG(S, x)

7. Remove the attribute that offers highest IG from the set of attributes

8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Thereby use classifier tab and select


elect decision tree as the classifier and click start. The given dataset will
be classified using Decision Tree.

Conclusion: The given dataset is classified using Decision Tree learners.

20
Experiment No. 3

1. Aim: Estimate the accuracy of decision classifier on breast cancer dataset using 5-fold cross-
validation. (You need to choose the appropriate options for missing values).

2. Software to be used: Jupyter/Python/WEKA

3. Pedagogy/Algorithm:

3.1 Open Jupyter software.

3.2 Click on Add new file and select python file. In various boxes/sections, we can write and compile our
code. A code snippet for this experiment is given for basic understanding of Python syntax.

3.3 Decision Classifer:

Algorithm will perform following tasks recursively

1. Create root node for the tree

2. If all examples are positive, return leaf node ‘positive’

3. Else if all examples are negative, return leaf node ‘negative’

4. Calculate the entropy of current state H(S)

5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)

6. Select the attribute which has maximum value of IG(S, x)

7. Remove the attribute that offers highest IG from the set of attributes

8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

3.4 k-Fold Cross-Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample
is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for
k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold
cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning
model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to
perform in general when used to make predictions on data not used during the training of the model.

21
It is a popular method because it is simple to understand and because it generally results in a less biased
or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

1. Shuffle the dataset randomly.

2. Split the dataset into k groups

3. For each unique group:

1. Take the group as a hold out or test data set

2. Take the remaining groups as a training data set

3. Fit a model on the training set and evaluate it on the test set

4. Retain the evaluation score and discard the model

4. Summarize the skill of the model using the sample of model

evaluation scores 3.5 Code Snippet

from sklearn import

datasets from sklearn

import metrics

from sklearn.tree import

DecisionTreeClassifier #load the iris

datasets dataset=datasets.load_iris()

#fir a CART model to the data

model=

DecisionTreeClassifier()

model.fit(dataset.data,dataset.tar

get) print(model)

22
#make predicitons expected=dataset.target

predicted=model.predict(dataset.data)

#summarize the fit of the model

print(metrics.classification_report(expected,

predicted))

print(metrics.confusion_matrix(expected,

predicted)) For adding 5-fold cross validation use:

clf.fit(dataset.data,dataset.target)
predictions = clf.predict(dataset_test)
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions, dataset.target_test)

Conclusion: The required dataset has been classified using Decision tree and 5-fold cross validation.

23
Experiment No. 4

1. Aim: Estimate the precision, recall, accuracy, and F-measure of the decision tree

classifier on the text classification task for each of the 10 categories using 10-fold cross-

validation.

2. Software Used: Jupyter/ Python/WEKA

3. Pedagogy/ Algorithm:

3.1 Load the text classification data from web.


(https://archive.ics.uci.edu/ml/datasets.html?area=&att=&format=&numAtt=&numIns=&sort=nameUp&t
ask=&typ
e=text&view=table)

3.2 Apply decision based classifier to calculate precision, recall, accuracy and F Score. Precision and
Recall are explained in section 4 of experiment 1. F score and Accuracy are explained here:

Accuracy = No. of data samples correctly classified / Total No. of samples.


F score: The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and
recall:

Decision Tree Classifier:

Algorithm will perform following tasks recursively

1. Create root node for the tree

2. If all examples are positive, return leaf node ‘positive’

3. Else if all examples are negative, return leaf node ‘negative’

4. Calculate the entropy of current state H(S)

5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)

6. Select the attribute which has maximum value of IG(S, x)

7. Remove the attribute that offers highest IG from the set of attributes

8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Conclusion: Algorithm has been run for all the parameters on the given dataset.

24
Experiment No. 5

1. Aim: Develop a machine learning method to classifying your incoming mail.

2. Software used: Jupyter/ Python

3. Pedagogy/Algorithm

3.1 Load the data from web. (https://www.kaggle.com/wcukierski/enron-email-dataset)

3.2 Use a logistic regression algorithm to classify.

Logistic Regression Algorithm:

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe
properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of
the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value
between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the
numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

Logistic regression is a linear method, but the predictions are transformed using the logistic function.
The impact of this is that we can no longer understand the predictions as a linear combination of the
inputs as we can with linear regression, for example, continuing on from above, the model can be stated
as:

p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))

Logistic Regression measures the relationship between the dependent variable (our label, what we want
to predict) and the one or more independent variables (our features), by estimating probabilities using
it’s underlying logistic function.

These probabilities must then be transformed into binary values in order to actually make a prediction.
This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-
shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1,
but never exactly at those limits. These values between 0 and 1 will then be transformed into either 0 or
1 using a threshold classifier.

We iterate the algorithm for all data points till all of them are classified as junk/useful mail. Following
functions are used from python sklearn package.

model= LogisticRegression()

25
model.fit(dataset.data,dataset.target)

print(model)

Conclusion: The given dataset is classified using Logistic regression algorithm.

26
Experiment No. 6

1. Aim: Develop a machine learning method to Predict stock prices based on past price variation.

2. Software Used: Jupyter/ Python

3. Pedagogy/ Algorithm:

3.1 Load the data. (https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs)

3.2 Use Random forest algorithm to classify the dataset:

Random Forest Algorithm:

To understand and use the various options, further information about how they are computed is useful.
Most of the options depend on two data objects generated by random forests.

When the training set for the current tree is drawn by sampling with replacement, about one-third of the
cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of
the classification error as trees are added to the forest. It is also used to get estimates of variable
importance.

After each tree is built, all of the data are run down the tree, and proximities are computed for each pair
of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of
the run, the proximities are normalized by dividing by the number of trees. Proximities are used in
replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data.

The out-of-bag (oob) error estimate

In random forests, there is no need for cross-validation or a separate test set to get an unbiased
estimate of the test set error. It is estimated internally, during the run, as follows:

Each tree is constructed using a different bootstrap sample from the original data. About one-third of
the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this
way, a test set classification is obtained for each case in about one-third of the trees. At the end of the
run, take j to be the class that got most of the votes every time case n was oob. The proportion of times
that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven
to be unbiased in many tests.

Variable importance

In every tree grown in the forest, put down the oob cases and count the number of votes cast for the
correct class. Now randomly permute the values of variable m in the oob cases and put these cases
down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data

27
from the number of votes for the correct class in the untouched oob data. The average of this number
over all trees in the forest is the raw importance score for variable m.

If the values of this score from tree to tree are independent, then the standard error can be computed by
a standard computation. The correlations of these scores between trees have been computed for a
number of data sets and proved to be quite low, therefore we compute standard errors in the classical
way, divide the raw score by its standard error to get a z-score, ands assign a significance level to the
z-score assuming normality.
If the number of variables is very large, forests can be run once with all the variables, then run again
using only the most important variables from the first run.

For each case, consider all the trees for which it is oob. Subtract the percentage of votes for the correct
class in the variable-m-permuted oob data from the percentage of votes for the correct class in the
untouched oob data. This is the local importance score for variable m for this case, and is used in the
graphics program RAFT.

Gini importance

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent
nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees
in the forest gives a fast variable importance that is often very consistent with the permutation importance
measure.

Interactions

The operating definition of interaction used is that variables m and k interact if a split on one variable,
say m, in a tree makes a split on k either systematically less possible or more possible. The
implementation used is based on the gini values g(m) for each tree in the forest. These are ranked for
each tree and for each two variables, the absolute difference of their ranks are averaged over all trees.

This number is also computed under the hypothesis that the two variables are independent of each other
and the latter subtracted from the former. A large positive number implies that a split on one variable
inhibits a split on the other and conversely. This is an experimental procedure whose conclusions need
to be regarded with caution. It has been tested on only a few data sets.

Proximities

These are one of the most useful tools in random forests. The proximities originally formed a NxN
matrix. After a tree is grown, put all of the data, both training and oob, down the tree. If cases k and n are
in the same terminal node increase their proximity by one. At the end, normalize the proximities by
dividing by the number of trees.

Users noted that with large data sets, they could not fit an NxN matrix into fast memory. A modification
reduced the required memory size to NxT where T is the number of trees in the forest. To speed up the

28
computation-intensive scaling and iterative missing value replacement, the user is given the option of
retaining only the nrnn largest proximities to each case.

When a test set is present, the proximities of each case in the test set with each case in the training set
can also be computed. The amount of additional computing is moderate.

Scaling

The proximities between cases n and k form a matrix {prox(n,k)}. From their definition, it is easy to
show that this matrix is symmetric, positive definite and bounded above by 1, with the diagonal elements
equal to 1. It follows that the values 1-prox(n,k) are squared distances in a Euclidean space of dimension
not greater than the number of cases. For more background on scaling see "Multidimensional Scaling" by
T.F. Cox and M.A. Cox.

Let prox(-,k) be the average of prox(n,k) over the 1st coordinate, prox(n,-) be the average of prox(n,k)
over the 2nd coordinate, and prox(-,-) the average over both coordinates. Then the matrix

cv(n,k)=.5*(prox(n,k)-prox(n,-)-prox(-,k)+prox(-,-))

is the matrix of inner products of the distances and is also positive definite symmetric. Let the
eigenvalues of cv be l(j) and the eigenvectors nj(n). Then the vectors
x(n) = (Öl(1) n1(n) , Öl(2) n2(n) , ...,)
have squared distances between them equal to 1-prox(n,k). The values of Öl(j) n j(n) are referred to as the
jth scaling coordinate.

In metric scaling, the idea is to approximate the vectors x(n) by the first few scaling coordinates. This is
done in random forests by extracting the largest few eigenvalues of the cv matrix, and their corresponding
eigenvectors . The two dimensional plot of the ith scaling coordinate vs. the jth often gives useful
information about the data. The most useful is usually the graph of the 2nd vs. the 1st.

Since the eigenfunctions are the top few of an NxN matrix, the computational burden may be time
consuming. We advise taking nrnn considerably smaller than the sample size to make this computation
faster.

There are more accurate ways of projecting distances down into low dimensions, for instance the
Roweis and Saul algorithm. But the nice performance, so far, of metric scaling has kept us from
implementing more accurate projection algorithms. Another consideration is speed. Metric scaling is
the fastest current algorithm for projecting down.

Generally three or four scaling coordinates are sufficient to give good pictures of the data. Plotting
the second scaling coordinate versus the first usually gives the most illuminating view.

Prototypes

29
Prototypes are a way of getting a picture of how the variables relate to the classification. For the jth class,
we find the case that has the largest number of class j cases among its k nearest neighbors, determined
using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for
each variable. The medians are the prototype for class j and the quartiles give an estimate of is stability.
For the second prototype, we repeat the procedure but only consider cases that are not among the original
k, and so on. When we ask for prototypes to be output to the screen or saved to a file, prototypes for
continuous variables are standardized by subtractng the 5th percentile and dividing by the difference
between the 95th and 5th percentiles. For categorical variables, the prototype is the most frequent value.
When we ask for prototypes to be output to the screen or saved to a file, all frequencies are given for
categorical variables.

Missing value replacement for the training set

Random forests has two ways of replacing missing values. The first way is fast. If the mth variable is not
categorical, the method computes the median of all values of this variable in class j, then it uses this
value to replace all missing values of the mth variable in class j. If the mth variable is categorical, the
replacement is the most frequent non-missing value in class j. These replacement values are called fills.

The second way of replacing missing values is computationally more expensive but has given better
performance than the first, even with large amounts of missing data. It replaces missing values only in
the training set. It begins by doing a rough and inaccurate filling in of the missing values. Then it does a
forest run and computes proximities.

If x(m,n) is a missing continuous value, estimate its fill as an average over the non-missing values of
the mth variables weighted by the proximities between the nth case and the non-missing value case. If
it is a missing categorical variable, replace it by the most frequent non-missing value where frequency
is weighted by proximity.

Now iterate-construct a forest again using these newly filled in values, find new fills and iterate
again. Our experience is that 4-6 iterations are enough.

Missing value replacement for the test set

When there is a test set, there are two different methods of replacement depending on whether labels
exist for the test set.

If they do, then the fills derived from the training set are used as replacements. If labels no not exist,
then each case in the test set is replicated nclass times (nclass= number of classes). The first replicate of
a case is assumed to be class 1 and the class one fills used to replace missing values. The 2nd replicate is
assumed class 2 and the class 2 fills used on it.

30
This augmented test set is run down the tree. In each set of replicates, the one receiving the most votes
determines the class of the original case.

Mislabeled cases

The training sets are often formed by using human judgment to assign labels. In some areas this leads
to a high frequency of mislabeling. Many of the mislabeled cases can be detected using the outlier
measure. An example is given in the DNA case study.

Outliers

Outliers are generally defined as cases that are removed from the main body of the data. Translate this
as: outliers are cases whose proximities to all other cases in the data are generally small. A useful
revision is to define outliers relative to their class. Thus, an outlier in class j is a case whose proximities
to all other class j cases are small.

Define the average proximity from case n in class j to the rest of the training data class j as:

The raw outlier measure for case n is defined as

This will be large if the average proximity is small. Within each class find the median of these raw
measures, and their absolute deviation from the median. Subtract the median from each raw measure,
and divide by the absolute deviation to arrive at the final outlier measure.

3.3 Following functions from sklearn package is available:

y = pd.factorize(train['species'])[0]

clf = RandomForestClassifier(n_jobs=2)

clf.fit(train[features], y)

Conclusion: Classification of the given data is achieved using Random Forest algorithm.

31
Experiment No. 7

Aim: Develop a machine learning method to predict how people would rate movies, books, etc.

2. Software Used: Jupyter/ Python

3. Pedagogy/ Algorithm:

3.1 Load the data. (https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)

3.2 Use KNN algorithm to classify the data

KNN Algorithm:

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and
green squares (GS) in fig. 1.

Fig. 1 : Representation of the dataset

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The
“K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we
will now make a circle with BS as center just as big as to enclose only three datapoints on the plane.
Refer to fig. 2 for more details:

Fig. 2: Identifying closest neighbors

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS
should belong to the class RC. Here, the choice became very obvious as all three votes from the closest
neighbor went to RC. The choice of the parameter K is very crucial in this algorithm.

32
For getting the predicted class, iterate from 1 to total number of training data points

1. Calculate the distance between test data and each row of training data. Here we will use
Euclidean distance as our distance metric since it’s the most popular method. The other
metrics that can be used are Chebyshev, cosine, etc.

2. Sort the calculated distances in ascending order based on distance values

3. Get top k rows from the sorted array

4. Get the most frequent class of these rows

5. Return the predicted class

Conclusion: KNN algorithm is used on dataset of movies for classification using their ratings.

33
Experiment No. 8

1. Aim: Develop a machine learning method to Cluster gene expression data, how to modify

existing methods to solve the problem better.

2. Software Used: Jupyter/Python

3. Pedagogy/Algorithm:

3.1 Load the data (https://www.kaggle.com/crawford/gene-expression).

3.2 Apply Support Vector Machine (SVM) Algorithm to classify the data.

SVM Algorithm:

SVM stands for Support Vector Machine. It is a machine learningapproach used for classification and
regression analysis. It depends on supervised learning models and trained by learning algorithms. They
analyze the large amount of data to identify patterns from them.

An SVM generates parallel partitions by generating two parallel lines. For each category of data in a high-
dimensional space and uses almost all attributes. It separates the space in a single pass to generate flat and
linear partitions. Divide the 2 categories by a clear gap that should be as wide as possible. Do this
partitioning by a plane called hyperplane.

An SVM creates hyperplanes that have the largest margin in a high-dimensional space to separate given
data into classes. The margin between the 2 classes represents the longest distance between closest data
points of those classes.

The larger the margin, the lower is the generalization error of the classifier.

Have a look at Kernel Functions

After training map the new data to the same space to predict which category they belong to. Categorize
the new data into different partitions and achieve it by training data.

Of all the available classifiers, SVM provides the largest flexibility.

SVMs are like probabilistic approaches but do not consider dependencies among attributes.

3. SVM Algorithm

To understand the algorithm of SVM, consider two cases:

Separable case – Infinite boundaries are possible to separate the data into two classes.

Non Separable case – Two classes are not separated but overlap with each other.

3.1. The Separable Case

34
In the separable case, infinite boundaries are possible. The boundary that gives the largest distance to the
nearest observation is called the optimal hyperplane. The optimal hyperplane ensures the fit and
robustness of the model. To find the optimal hyperplane, use the following equation.

.+=0

Here, a.x is the scalar product of a and x. This equation must satisfy the following two conditions:

Real-life applications of Support Vector Machine

It should separate the two classes A and B very well so that the function defined by:

f(x) = a.x + b is positive if and only if x ∈ A

f(x) ≤ 0 if and only if x ∈ B

It exists as far away as possible from all the observations (robustness of the model). Given that
the distance from an observation x to the hyperplane is | a.x + b|/||a||.

The width of the space between observations is 2/||a||. It is called margin and it should be largest.
Hyperplane depends on support points called the closest points. Generalization capacity of SVM increases
as the number of support points decreases.

3.2. The Non-Separable Case

If two classes are not perfectly separated but overlap. A term measuring the classification error must add
to each of the following two conditions:

For every i, yi(a.xi + b) ≥ 1 (correct separation)

½*||a||2 is minimal (greatest margin)

Define these condition for each observation xi on the wrong side of the boundary. By measuring the
distance separating it from the boundary of the margin on the side of its class.

This distance is then normalized by dividing it by the half-margin 1/||a||, giving a term i, called the slack
variable. An error in the model is an observation for which ξ > 1.The sum of all the ξi represents the set of
classification errors. So, the previous two constraints for finding the optimal hyperplane become:

For every i, yi(a.xi + b) ≥ 1 – ξi

½*||a||2 + δΣi ξi is minimal

3.3 Python functions for SVM:

C = 1.0 # SVM regularization parameter

svc = svm.SVC(kernel='linear', C=1,gamma=0).fit(X, y)

Conclusion: SVM classified gene expression data successfully.

35
Experiment No. 9

1. Aim: Select two datasets. Each dataset should contain examples from multiple classes. For
training purposes assume that the class label of each example is unknown (if it is known, ignore it).
Implement the Kmeans algorithm and apply it to the data you selected. Evaluate performance by
measuring the sum of Euclidean distance of each example from its class center. Test the
performance of the algorithm as a function of the parameter k.

2. Software used: Jupyter/Python

3. Pedagogy/Algorithm

3.1. Load two datasets. (Step 3.1 from experiment 7 and experiment 8)

3.2. Implement K – means clustering for the datsets.

K-Means Clustering

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs
are the number of clusters Κ and the data set. The data set is a collection of features for each data point.
The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or
randomly selected from the data set. The algorithm then iterates between two steps:

1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid,
based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then
each data point x is assigned to a cluster based on

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each i th
cluster centroid be Si.

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to
that centroid's cluster.

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points
change clusters, the sum of the distances is minimized, or some maximum number of iterations is
reached).

36
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with
randomized starting centroids may give a better outcome.

Choosing K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find
the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K
values and compare the results. In general, there is no method for determining exact value of K, but an
accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of K is the mean
distance between data points and their cluster centroid. Since increasing the number of clusters will
always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of
reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the
sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point,"
where the rate of decrease sharply shifts, can be used to roughly determine K.

A number of other techniques exist for validating K, including cross-validation, information criteria, the
information theoretic jump method, the silhouette method, and the G-means algorithm. In addition,
monitoring the distribution of data points across groups provides insight into how the algorithm is
splitting the data for each K.

3.3 Python function for implementing K means:

from sklearn.cluster import KMeans

estimators = [('dataset_name', KMeans(n_clusters=8)),

(' dataset_name', KMeans(n_clusters=3)),

(' dataset_name', KMeans(n_clusters=3, n_init=1,

init='random'))]

3.4 Test the performance on two datasets by using accuracy as a paramenter when Euclidean distance is
used function to calculate distance.

Conclusion: K means classification achieved for two datasets and the performance is compared.

37
Experiment No. 10

1. Aim: Implement the EM algorithm assuming a Gaussian mixture. Apply the algorithm to your
datasets and report the parameters you obtain. Evaluate performance by measuring the sum of
Mahalanobis distance of each example from its class center. Test performance as a function of the
number of clusters.

2. Software used: Jupyter/Python

3. Pedagogy/Algorithm

3.1. Load few datasets from the links given in the above experiments.

3.2 Apply Expectation Maximization (EM) algorithm.

EM algorithm

The Expectation Maximization (EM) algorithm can be used to generate the best hypothesis for the
distributional parameters of some multi-modal data. The best hypothesis for the distributional parameters
is the maximum likelihood hypothesis – the one that maximizes the probability that this data we are
looking at comes from K distributions, each with a mean mk and variance sigmak2. In this tutorial we are
assuming that we are dealing with K normal distributions.

In a single modal normal distribution this hypothesis h is estimated directly from the data as:

estimated m = m~ = sum(xi)/N (1)

estimated sigma2= sigma2~= sum(xi- m~)^2/N (2)

Which are simply the trusted arithmetic average and variance. In a multi-modal distribution we need to
estimate h = [ m1,m2,...,mK; sigma12,sigma22,...,sigmaK2 ]. The EM algorithm is going to help us to do this.
Let’s see how.
We begin with some initial estimate for each mk~ and sigmak2~. We will have a total of K estimates for
each parameter. The estimates can be taken from the plots we made earlier, our domain knowledge, or
they even can be wild (but not too wild) guesses. We then proceed to take each data point and answer the
following question – what is the probability that this data point was generated from a normal distribution
with mean mk~ and sigmak2~? That is, we repeat this question for each set of our distributional
parameters. In Figure 1 we plotted data from 2 distributions. Thus we need to answer these questions
twice – what is the probability that a data point xi, i=1,...N, was drawn from N(m1~, sigma12~) and what is
the probability that it was drawn from N(m2~, sigma22~). By the normal density function we get:
2
P(xi belongs to N(m1~ , sigma1 ~))=1/sqrt(2*pi* sigma1 2~) * exp(-(xi- m1~)^2/(2*sigma1 2
~)) (3)
2
P(xi belongs to N(m2~ , sigma2 ~))=1/sqrt(2*pi* sigma2 2~) * exp(-(xi- m2~)^2/(2*sigma2 2
~)) (4)
The individual probabilities only tell us half of the story because we still need to take into account the
probability of picking N(m1~, sigma12~) or N(m2~, sigma22~) to draw the data from. We now arrive at
what is known as responsibilities of each distribution for each data point. In a classification task this
responsibility can be expressed as the probability that a data point xi belongs to some class ck:

38
P(xi belongs to ck) = omegak~ * P(xi belongs to N(m1~ , sigma12~)) / sum(omegak~ * P(xi belongs to
N(m1~ , sigma12~))) (5)

In Equation 5 we introduce a new parameter omegak~ which is the probability of picking k’s distribution
to draw the data point from. Figure 1 indicates that each of our two clusters are equally likely to be
picked. But like with mk~ and sigmak2~ we do not really know the value for this parameter. Therefore we
need to guess it and it is a part of our hypothesis:

h = [ m1, m2, ..., mK; sigma12, sigma22, ..., sigmaK2; omega1~, omega2~, ..., omegaK~ ] (6)
You could be asking yourself where the denominator in Equation 5 comes from. The denominator is the
sum of probabilities of observing xi in each cluster weighted by that cluster’s probability. Essentially, it is
the total probability of observing xi in our data.
If we are making hard cluster assignments, we will take the maximum P(xi belongs to ck) and assign the
data point to that cluster. We repeat this probabilistic assignment for each data point. In the end this will
give us the first data ‘re-shuffle’ into K clusters. We are now in a position to update the initial estimates
for h to h'. These two steps of estimating the distributional parameters and updating them after
probabilistic data assignments to clusters is repeated until convergences to h*. In summary, the two steps
of the EM algorithm are:

1. E-step: perform probabilistic assignments of each data point to some class based on the current
hypothesis h for the distributional class parameters;

2. M-step: update the hypothesis h for the distributional class parameters based on the new data
assignments.

During the E-step we are calculating the expected value of cluster assignments. During the M-step we are
calculating a new maximum likelihood for our hypothesis.

3.3 Calculate Mahalanobis distance using the given formula.


The Mahalanobis distance of an observation from a set of observations
with mean and covariance matrix S is defined as:

3.4 Compare the performance by using number of clusters as a parameter.

Conclusion: Applied EM algorithm to maximize the expectation value and accordingly achieve
classification for the given datasets.

39
Experiment No. 11:

1. Aim: Suggest and test a method for automatically determining the number of clusters.

2. Software Used: Python/Jupyter

3. Pedagogy:

3.1. Load a dataset. (IRIS dataset / Breast Cancer dataset is available inside sklearn toolkit)

3.2. Apply K mean clustering.

3.3. Determine a method for automatically determining number of clusters. One such algorithm is dark
block algorithm whose steps are given below:

Step 1) A dissimilarity matrix ‗m‘ of size n*n is generated from the input dataset ‗S‘, where ‗n‘ is the
size of ‗S‘; //initialization

Step 2) set K←{1,2,3…….n}, I←J←{},P[]←{0,0,0……0};

Step 3) select ( i , j) Є argmax (mpq ) such that (p,q) Є K and set P[1]←i; I←{i},

J←K-{i}; Step 4) for r←2,3……n Select(i, j) Є argmin (mpq) and set P[r] ←j,

I←IU{j}, J←J-{j} Next r

Step 5) Obtain the ordered dissimilarity matrix ‗R‘ using the ordering array P as . Rij = mp(i)p(j) for
1<=i, j<=n.

Step 6) Display the Reordered Dissimilarity data point.

We can use VAT algorithm as well.

Conclusion: A method is suggested for automatically determining the number of clusters in a given
dataset.

40
Experiment No. 12

1. Aim: Using a dataset with known class labels compare the labeling error of the K-means and EM
algorithms. Measure the error by assigning a class label to each example. Assume that the number
of clusters is known.

2. Software Used: Jupyter/Python

3. Pedagogy:

3.1 Load Breast Cancer dataset from sklearn.

3.2 Assign class labels as either diseased or undiseased class. Also No. of clusters are 2 here.

3.3 Run both algorithms as done in experiment no. 9 and 10.

3.4 Measure the error by using the below hError or any other error algorithm.

The hError Clustering Algorithm

3.4.1 Distance Function : Here we use a greedy heuristic. The heuristic involves a tree hierarchy where
the lowest level of the tree consists of n clusters, each corresponding to one data point. At successive
levels, a pair of clusters is merged into a single cluster so that there is maximum increase in the value of
the objective function. We stop merging clusters when a desired number of clusters is obtained.

At an intermediate stage the greedy heuristic combines a pair of clusters Ci and Cj for which the
following distance is minimized. xi and xj are given data points. t defines a square root function.

dij = (xi − xj )*t [Σ(i) + Σ(j) ] −1(xi − xj ). (1)

Conclusion: Errors for a given dataset are compared for both EM and K-Means Algorithm.

41
CONTENT BEYOND SYLLABUS

1. Application of Deep learning models on text or image classification.

Software: Python

Deep learning is an artificial intelligence function that imitates the workings of the human brain in
processing data and creating patterns for use in decision making. Deep learning is a subset of
machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised
from data that is unstructured or unlabeled. It is also known as deep neural learning or deep neural
network.

Implementation:

Implementation requires defining following components:

1. Operators
Also used interchangeably with layers, they are the basic building blocks of any neural network. Operators
are vector-valued functions that transform the data. Some commonly used operators are layers like linear,
convolution, and pooling, and activation functions like ReLU and Sigmoid.
2. Optimizers
They are the backbones of any deep learning library. They provide the necessary recipe to update model
parameters using their gradients with respect to the optimization objective. Some well-known optimizers
are SGD, RMSProp, and Adam.
3. Loss Functions
They are closed-form and differentiable mathematical expressions that are used as surrogates for the
optimization objective of the problem at hand. For example, cross-entropy loss and Hinge loss are
commonly used loss functions for the classification tasks.
4. Initializers
They provide the initial values for the model parameters at the start of training. Initialization plays an
important role in training deep neural networks, as bad parameter initialization can lead to slow or no
convergence. There are many ways one can initialize the network weights like small random weights
drawn from the normal distribution. You may have a look at https://keras.io/initializers/ for a
comprehensive list.
5. Regularizers
They provide the necessary control mechanism to avoid overfitting and promote generalization. One can
regulate overfitting either through explicit or implicit measures. Explicit methods impose structural

42
constraints on the weights, for example, minimization of their L1-Norm and L2-Norm that make the
weights sparser and uniform respectively. Implicit measures are specialized operators that do the
transformation of intermediate representations, either through explicit normalization, for example,
BatchNorm, or by changing the network connectivity, for example, DropOut and DropConnect.

6. Automatic Differentiation (AD)


Every deep learning library provides a flavor of AD so that a user can focus on defining the model
structure (computation graph)and delegate the task of gradients computation to the AD module. Let us go
through an example to see how it works. Say we want to calculate partial derivatives of the following
function with respect to its input variables X₁ and X₂:
Y = sin(x₁)+X₁*X₂

Code:

from keras.datasets import mnist


from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# reshape to be [samples][width][height][channels]
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1)).astype('float32')
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1)).astype('float32')
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
# define a simple CNN model
def baseline_model():
# create model
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(28, 28, 1), activation='relu'))
model.add(MaxPooling2D())
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# build the model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200)

43
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))

Another example:

# Simple CNN for the MNIST Dataset


from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# reshape to be [samples][width][height][channels]
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1)).astype('float32')
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1)).astype('float32')
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
# define a simple CNN model
def baseline_model():
# create model
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(28, 28, 1), activation='relu'))
model.add(MaxPooling2D())
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# build the model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))

44
2. To explore VGG NET for image/text classification.

Software Used: Python’

Theory:

AlexNet came out in 2012 and was a revolutionary advancement; it improved on traditional
Convolutional Neural Networks (CNNs) and became one of the best models for image classification…
until VGG came out.

AlexNet.

When AlexNet was published, it easily won the ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC) and proved itself to be one of the most capable models for object-detection out there. Its key
features include using ReLU instead of the tanh function, optimization for multiple GPUs, and
overlapping pooling. It addressed overfitting by using data augmentation and dropout. People just wanted
even more accurate models.

The Dataset.

The general baseline for image recognition is ImageNet, a dataset that consists of more than 15 million
images labeled with more than 22 thousand classes. Made through web-scraping images and crowd-
sourcing human labelers, ImageNet even hosts its own competition: the previously mentioned ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC). Researchers from around the world are challenged
to innovate methodology that yields the lowest top-1 and top-5 error rates (top-5 error rate would be the
percent of images where the correct label is not one of the model’s five most likely labels). The
competition gives out a 1,000 class training set of 1.2 million images, a validation set of 50 thousand
images, and a test set of 150 thousand images; data is plentiful. AlexNet won this competition in 2012,
and models based off of its design won the competition in 2013.

Configurations of VGG; depth increases from left to right and the added layers are bolded. The
convolutional layer parameters are denoted as “conv<receptive field size> — <number of channels>”.
Image credits to Simonyan and Zisserman, the original authors of the VGG paper.
VGG Neural Networks. While previous derivatives of AlexNet focused on smaller window sizes and
strides in the first convolutional layer, VGG addresses another very important aspect of CNNs: depth.
Let’s go over the architecture of VGG:
Input. VGG takes in a 224x224 pixel RGB image. For the ImageNet competition, the authors cropped out
the center 224x224 patch in each image to keep the input image size consistent.
Convolutional Layers. The convolutional layers in VGG use a very small receptive field (3x3, the
smallest possible size that still captures left/right and up/down). There are also 1x1 convolution filters
which act as a linear transformation of the input, which is followed by a ReLU unit. The convolution
stride is fixed to 1 pixel so that the spatial resolution is preserved after convolution.
Fully-Connected Layers. VGG has three fully-connected layers: the first two have 4096 channels each
and the third has 1000 channels, 1 for each class.
Hidden Layers. All of VGG’s hidden layers use ReLU (a huge innovation from AlexNet that cut training
time). VGG does not generally use Local Response Normalization (LRN), as LRN increases memory
consumption and training time with no particular increase in accuracy.
The Difference. VGG, while based off of AlexNet, has several differences that separates it from other
competing models:
Instead of using large receptive fields like AlexNet (11x11 with a stride of 4), VGG uses very small
receptive fields (3x3 with a stride of 1). Because there are now three ReLU units instead of just one, the

45
decision function is more discriminative. There are also fewer parameters (27 times the number of
channels instead of AlexNet’s 49 times the number of channels).
VGG incorporates 1x1 convolutional layers to make the decision function more non-linear without
changing the receptive fields.
The small-size convolution filters allows VGG to have a large number of weight layers; of course, more
layers leads to improved performance. This isn’t an uncommon feature, though. GoogLeNet, another
model that uses deep CNNs and small convolution filters, was also showed up in the 2014 ImageNet
competition.

Code:

import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
input_shape = (224, 224, 3)

#Instantiate an empty model


model = Sequential([
Conv2D(64, (3, 3), input_shape=input_shape, padding=’same’, activation=’relu’),
Conv2D(64, (3, 3), activation=’relu’, padding=’same’),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(128, (3, 3), activation=’relu’, padding=’same’),
Conv2D(128, (3, 3), activation=’relu’, padding=’same’,),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(256, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(256, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(256, (3, 3), activation=’relu’, padding=’same’,),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
Conv2D(512, (3, 3), activation=’relu’, padding=’same’,),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Flatten(),
Dense(4096, activation=’relu’),
Dense(4096, activation=’relu’),
Dense(1000, activation=’softmax’)
])

model.summary()
# Compile the model
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=’adam’, metrics=[“accuracy”])

46
5. Course Exit Survey

BHARATI VIDYAPEETH COLLEGE OF ENGINEERING, NEW DELHI


Department of Computer Science & Engineering
Course Exit Survey
2019- 2020
Subject Name: MACHINE LEARNING LAB Subject Code: ETCS 454
TH
Semester: 8
Please rate how well you understood the course (Tick the most appropriate option)
(1 – Poor, 2- Good , 3- Excellent)
ETCS209.1 Did you understand and implement the various Machine learning approaches and
interpret the concepts of supervised learning?

1. 2. 3.

ETCS209.2 Do you understand to study and apply the fundamental concepts in Machine
Learning, including classification and be able to apply the Machine learning algorithms.
1. 2. 3.

ETCS209.3 Are you able to analyze and evaluate the data and performing experiments in
Machine Learning using real-world data and Learning algorithms limitations.

1. 2. 3.

ETCS209.4 Are you able to confidently apply/ execute and analyze the common Machine
Learning algorithms in practice and implementing.

1. 2. 3.

ETCS209.5 Can you illustrate and apply clustering algorithms and evaluate the algorithms for
performance measures and identify its applicability in real life problems.?

1. 2. 3.

ETCS209.6 Can you create and analyze new models and modern programming tools for existing
Machine learning problems.?

1. 2. 3.

Suggestions to improve the teaching methodology:

Overall how do you rate your understanding of the subject(tick whichever applicable)
1. Below 50%. 2. 50%-70%. 3.70%-90% 4. Above 90%

Name of student
Enrolment number Signature

47

You might also like