Project Report 2
Project Report 2
Project Report 2
Mukund S
Advisor: Prof. Shivani Agarwal
January 2015
0.1
Introduction
0.2
Supervised Learning:
Supervised Learning refers to dealing with classification problems and also pattern recognition problems.
Supervised learning can be defined as the task of taking the labeled data sets,
extracting information from it and labeling new data sets or simply function
approximation.
0.2.1
Classification Problem:
Classification is the process of taking some input and mapping it to some discrete
label.
Linear Classifiers:
Linear models for classification separate input vectors into classes using linear
(hyperplane) decision boundaries. The simple case would be the Two Class
Discriminant Function.
Decision Trees:
Decision trees are very efficient and one among the widely used techniques for
classification problems. Decision trees are built from set of decision rules such
that each path from root to a leaf is a rule and leaf nodes are different classes.
ID3 is a standard and popular algorithm where the features are ranked based
on their information gain and trees are built using the features with information
gain in decreasing order.[4, chapter 1]
Feature Normalization and feature pruning are few techniques which are
useful to remove irrelevant features and also help making comparisons easier
across the data.[7]
0.2.2
Regression:
0.3
Unsupervised Learning:
In unsupervised learning we essentially have an input data set and and we try
to derive some structure from them by looking at the relationships between the
inputs.[6]
Clustering:
Clustering is a type of unsupervised learning that automatically forms clusters of
similar things. This notion of similarity depends on the similarity measurement
which inturn depends on the type of application and data set.[4]
K-Means Clustering:
In K-means Clustering algorithm k unique clusters are found and center of each
cluster is the mean of the values of the elements in that cluster. Initially number
of clusters(K) needed is chosen and K points are chosen in the data set which
are assumed to represent the centroid of K clusters. Each point in the data
set is assigned to one of the K points for which the distance from the centroid
is minimum. After a point is assigned to a cluster, centroids are updated by
calculating the mean of all the points in each cluster. This method converges
at local minima but is slow on very large data sets. It doesnt necessarily need
to converge to a global minimum.[4]
Bisecting K-means:
In this algorithm initially all the points are assigned to a single cluster. While
the total number of clusters is less than K, total error is measured for each
cluster. Then K-means clustering with K = 2 is performed on the cluster with
largest total error and the process is repeated till desired number of clusters are
formed.[4]
0.4
Online Learning:
0.4.1
In this learning model there are finite number of Experts where each expert
predicts an answer and the predicted answer is made available to the forecaster(learning model). In each round, the forecaster can make use of these experts predictions for predicting the answer. Later the true answer is revealed.
Forecaster suffers regret and experts suffer loss. The goal of this algorithm is
to minimize the regret. Various models are proposed to solve this minimization
problem.[1]
Weighted Majority Algorithm:
In this algorithm each expert is assigned a particular weight and forecaster
predicts the weighted average of the experts predictions. After the true outcome
is revealed losses are calculated and weights are updated accordingly.[1]
Randomized Weighted Majority Algorithm:
In this algorithm initially all the experts are assigned equal weights(equal to 1)
and the forecaster predicts the weighted average of the experts advice. Experts
may suffer loss and weights of experts who make a mistake is penalized by ratio
. At round t, the ratio of weight of an expert to total sum of weights of
all experts can be thought of as probability of that expert predicting the right
answer. [9]
Online Gradient Descent:
Generally in Gradient Descent algorithm we run through all the samples in the
training set to do a single update for parameter in a particular iteration while in
5
online gradient descent, only one training sample is used from the training set,
to do the update for the parameter in a single iteration. so, if the training set is
very large, online gradient descent will be faster since only one point is used to
update the parameter. Online gradient descent converges faster than gradient
descent algorithm but is not as well minimized as the gradient descent algorithm.
This method can also be used for the minimization problem of Prediction with
Expert Advice. [8]
0.4.2
Perceptron:
When there are only two possible outputs(say -1 or 1) this algorithm can be
used to predict the output in an online learning problem. This method tries
to find a hyperplane that divides the vector space where each part denotes one
output. We assume there are N experts and each expert is assigned a weight.
The forecaster predicts its output based on the sign of the weighted sum of
experts choices. after the true outcome is revealed the weights are updated
accordingly. [9]
Follow the Leader:
Follow the leader algorithm can be used to minimize the regret over all of the
previous time steps. This is done by selecting the hypothesis in every round for
which the loss suffered by the forecaster is minimum over the previous steps.
The main backdraw of this algorithm is that it sometimes doesnt guarantee the
selection of hypothesis with minimal regret. [5]
Follow the Regularized Leader:
The objective of this algorithm is to improve on the mistakes made by the
follow the leader algorithm. This is achieved by adding a penalty function to
the function minimizing the regret. Better maximum bounds on the regret can
be attained by choosing appropriate penalty function.
Follow the Perturbed Leader is another algorithm where the penalty function is assumed to have continuous distribution with respect to N-dimensional
Lebesgue measure where N is the total number of hypotheses in the hypothesis
class. [5]bianchi
Mini Project
Task:
To predict the kind of ball a bowler bowls during a cricket match based on his
previous bowling statistics.
Data:
Commentary during the bowling of the bowler Dayle Steyn during the test
matches test1944, test1946, test1948 and test2049 between South Africa and
England is downloaded from the website www.cricinfo.com and analysed and
used as the training data.
Idea:
The line and length of the next ball being bowled can be predicted by training the machine with the data (statistics) of his previous balls that have been
bowled. We have assumed that the number of variations in the bowling to be
finite.
Algorithm Overview:
Each output (next ball) can be predicted using Prediction with Expert Advice
where each expert is assumed to predict constantly single possible output. Since
the total number of outputs that can be predicted are finite number of experts
is also finite.
Variables in the output are line and length. Line of the delivery can be
chosen in 5 different ways, In swinger, Out swinger, Reverse swing, Outside off,
Straight. And similarly length can be chosen in 5 different ways, Very short,
Short, Good length, Yorker, Fuller Delivery. So the total number of possible
outcomes is 25. Each expert constantly predicts one possible outcome. Now
the machine predicts the outcome using the predictions of the n experts using
the Randomized Weighted Majority algorithm. Initially the weights of each
expert is adjusted to 1. Then the machine predicts the outcome and then true
outcome is revealed. Then loss and regret are calculated. And if the prediction
of a particular expert is true then its weight remains constant otherwise its
weight is decreased by fraction . This is iteratively run for large number of
examples so that the machine gets trained and bounds the number of mistakes
made when compared to the best expert (expert which makes least number of
errors). Regrets and losses for each expert is calculated and best among the
experts is found.
= penalizing factor =
1+
12lnN
M
Wi = Weight of the ith expert predicting the outcome(assume that all possible
outcomes are tabulated in some definite order).
W 0 = Total weight of all experts after t predictions by the machine.
N = total weight of the experts initially.
m = number of training data.
If ith expert has made a wrong prediction, then its weight is updated as:
W i Wi
W0 N
(1+)n
2n
Wi
Expert i is expected to predict correctly with the probability W
0 where Wi are
updated weights.
If the best expert makes M mistakes then the expected number of mistakes
M0
log N
+M
Algorithm:
Algorithm 1 Predicting the output and updating the weights
Step:1 Read the input from the text file which contains predictions of N
experts and parameter
Step:2 Adjust weights of all the experts to 1. Wi 1
Step:3 Scan the input data.
Step:4 for i =P1 : m
(Wi )fit
ybt round(
) prediction of the forecaster is computed.
W0
Output yt is revealed.
If forecasters prediction is wrong, then weights are updated by Wit+1
Wit
Expected loss is calculated according to the 0,1 loss function for each expert
and forecaster for every training point.
Regrets with respect to each expert is calculated.
Best expert is found and regret with respect to best expert is found.
End.
Figure 1: Averaged Regret of the forecaster with respect to the best expert over
the entire data set with = 0.917
The following plot shows the progress of the algorithm in learning to predict
as good as the best expert. The averaged regret is decreasing with the number
of examples trained. From this we can infer that our algorithm is learning to
predict as good as the best expert.
The number of mistakes made by the best expert is found to be 730.
Bibliography
[1] Nicolo Bianchi. Prediction, learning, and games. Cambridge University
Press, Cambridge New York, 2006.
[2] Christopher Bishop. Pattern recognition and machine learning. Springer,
New York, 2006.
[3] Micheal Charles. Machine learning: Supervised learning, 2014. [Online,
accessed June-2014].
[4] Hal Daume III. A Course in Mahcine Learning. TODO, September 2013.
[5] Roni Khardon. Lecture 17 , COMP236: Computational Learning Theory ,
Department of Computer Science , Tufts University. 2013.
[6] Kevin Murphy. Machine Learning: A Probabilistic Perspective. The MIT
Press, August 2012.
[7] Andrew Ng. Coursera - machine learning 2014.
[8] Andrew Ng. CS229 Lecture Notes.
[9] Robert Schaphire. COS: 511 , Foundations of Machine Learning ,Lecture
14 , Princeton Univesity. march 2006. (Visited on 01/10/2015).
[10] Wikipedia. Machine learning wikipedia, the free encyclopedia, 2015.
[Online; accessed 9-January-2015].
10