Unit - 3
Unit - 3
• In Supervised Learning, the model learns by example. Along with our input variable, we also give our
model the corresponding correct labels. While training, the model gets to look at which label
corresponds to our data and hence can find patterns between our data and those labels.
• It classifies spam Detection by teaching a model of what mail is spam and not spam.
• Speech recognition where you teach a machine to recognize your voice.
• Object Recognition by showing a machine what an object looks like and having it pick that object
from among other objects.
• The Classification algorithm is a Supervised Learning technique that is used to identify the category of new
observations on the basis of training data. In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
• The classification predictive modeling is the task of approximating the mapping function from input
variables to discrete output variables. The main goal is to identify which class/category the new data will
fall into.
• The best example of an ML classification algorithm is Email Spam Detector.
• Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue",
"fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
• The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
• Classification algorithms can be better understood using the below diagram. In the below diagram, there are
two classes, class A and Class B. These classes have features that are similar to each other and dissimilar to
other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
The algorithm which implements the classification on a dataset is known as a classifier. There are two types of
o
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes,
Classifications:
then it is called as Multi-class
Example: Classifications of types of crops, Classification of types of music.
Classifier.
• Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-class
Classifier.
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test dataset.
In Lazy learner case, classification is done on the basis of the most related data stored in the training
dataset. It takes less time in training but more time for predictions.
Eager Learners: Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction.
• Linear Models
➢ Logistic Regression
➢ Support Vector Machines
• Non-linear Models
➢ K-Nearest Neighbours
➢ Kernel SVM
➢ Naïve Bayes
➢ Decision Tree Classification
➢ Random Forest Classification
Classification Algorithms
• In machine learning, classification is a supervised learning concept which basically categorizes a set of data into
classes. The most common classification problems are – speech recognition, face detection, handwriting
recognition, document classification, etc.
• It can be either a binary classification problem or a multi-class problem too. There are a bunch of machine
learning algorithms for classification in machine learning. Let us take a look at those classification algorithms in
machine learning.
Logistic Regression
• It is a classification algorithm in machine learning that uses one or more independent variables to determine an
outcome. The outcome is measured with a dichotomous variable meaning it will have only two possible
outcomes.
• The goal of logistic regression is to find a best-fitting relationship between the dependent variable and a set of
independent variables. It is better than other binary classification algorithms like nearest neighbour since it
quantitatively explains the factors leading to classification.
Advantages and Disadvantages
• Logistic regression is specifically meant for classification, it is useful in understanding how a set of
independent variables affect the outcome of the dependent variable.
• The main disadvantage of the logistic regression algorithm is that it only works when the predicted
variable is binary, it assumes that the data is free of missing values and assumes that the predictors
are independent of each other.
Use Cases
2. Word classification
3. Weather Prediction
4. Voting Applications
Naive Bayes:
• Naive Bayes is a classification algorithm that assumes that predictors in a dataset are independent. This
means that it assumes the features are unrelated to each other.
For example: if given a banana, the classifier will see that the fruit is of yellow color, oblong-shaped and long
and tapered. All of these features will contribute independently to the probability of it being a banana and are
not dependent on each other. Naive Bayes is based on Bayes’ theorem, which is given as:
Where :
• The Naive Bayes classifier requires a small amount of training data to estimate the necessary parameters
to get the results. They are extremely fast in nature compared to other classifiers.
Use Cases
1. Disease Predictions
2. Document Classification
3. Spam Filters
4. Sentiment Analysis
Decision Trees:
• A decision tree gives an advantage of simplicity to understand and visualize, it requires very little data preparation
as well.
• The disadvantage that follows with the decision tree is that it can create complex trees that may bot categorize
efficiently. They can be quite unstable because even a simplistic change in the data can hinder the whole
structure of the decision tree.
Use Cases
1. Data exploration
2. Pattern Recognition
• It is a lazy learning algorithm that stores all instances corresponding to training data in n-dimensional space.
• It is a lazy learning algorithm as it does not focus on constructing a general internal model, instead, it works
on storing instances of training data.
Advantages And Disadvantages
1. This algorithm is quite simple in its implementation and is robust to noisy training data. Even if
the training data is large, it is quite efficient.
2. The only disadvantage with the KNN algorithm is that there is no need to determine the value of
K and computation cost is pretty high compared to other algorithms.
Use Cases
3. Image recognition
4. Video recognition
5. Stock analysis
Random Forest
• Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It
operates by constructing a multitude of decision trees at training time and outputs the class that is the mode
of the classes or classification or mean prediction(regression) of the individual trees.
• A random forest is a meta-estimator that fits a number of trees on various subsamples of data sets and then
uses an average to improve the accuracy in the model’s predictive nature.
• The sub-sample size is always the same as that of the original input size but the samples are often drawn with
replacements.
Advantages and Disadvantages
• The advantage of the random forest is that it is more accurate than the decision trees due to the
reduction in the over-fitting.
• The only disadvantage with the random forest classifiers is that it is quite complex in implementation
and gets pretty slow in real-time prediction.
Use Cases
4. Performance scores
Support Vector Machine
• The support vector machine is a classifier that represents the training data as points in space separated
into categories by a gap as wide as possible. New points are then added to space by predicting which
category they fall into and which space they will belong to.
Advantages and Disadvantages
• It uses a subset of training points in the decision function which makes it memory efficient and is
highly effective in high dimensional spaces.
• he only disadvantage with the support vector machine is that the algorithm does not directly provide
probability estimates.
Use cases
1. Business applications for comparing the performance of a stock over a period of time
2. Investment suggestions
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:
Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the model.
• The matrix consists of predictions result in a summarized form, which has a total number of correct predictions
and incorrect predictions. The matrix looks like as below table:
Accuracy = TP+TN/TP+FP+FN+TN
True Positive: The number of correct predictions that the occurrence is positive.
F1- Score
• Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant
instances that have been retrieved over the total number of instances.
• They are basically used as the measure of relevance.
ROC Curve
• Receiver operating characteristics or ROC curve is used for visual comparison of classification models, which
shows the relationship between the true positive rate and the false positive rate .
• The area under the ROC curve is the measure of the accuracy of the model.
Concept Learning
• Inducing general functions from specific training examples is a
main issue of machine learning.
• Concept Learning: Acquiring the definition of a general category
from given sample positive and negative training examples of the
category.
• Concept Learning can seen as a problem of searching through
a predefined space of potential hypotheses for the hypothesis
that best fits the training examples.
• The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking
advantage of a naturally occurring structure over the hypothesis
space.
1
Concept Learning
• A Formal Definition for Concept Learning:
2
A Concept Learning Task – Enjoy Sport
Training Examples
ATTRIBUTES CONCEPT
4
Hypothesis Representation
• A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
<0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days for
which EnjoySport=yes, describing this set by a conjunction of
constraints over the instance attributes.
5
EnjoySport Concept Learning Task
Given
– Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
• AirTemp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
– Target Concept (Function) c : EnjoySport : X → { 0,1}
– Hypotheses H : Each hypothesis is described by a conjunction of constraints on
the attributes.
– Training Examples D : positive and negative examples of the target function
Determine
– A hypothesis h in H such that h(x) = c(x) for all x in D.
6
The Inductive Learning Hypothesis
• Although the learning task is to determine a hypothesis h identical to the
target concept cover the entire set of instances X, the only information
available about c is its value over the training examples.
– Inductive learning algorithms can at best guarantee that the output hypothesis fits the target
concept over the training data.
– Lacking any further information, our assumption is that the best hypothesis regarding
unseen instances is the hypothesis that best fits the observed training data. This is the
fundamental assumption of inductive learning.
7
Concept Learning As Search
• Concept learning can be viewed as the task of searching through a large
space of hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training
examples.
• By selecting a hypothesis representation, the designer of the learning
algorithm implicitly defines the space of all hypotheses that the program
can ever represent and therefore can ever learn.
8
Enjoy Sport - Hypothesis Space
• Sky has 3 possible values, and other 5 attributes have 2 possible values.
• There are 96 (= 3.2.2.2.2.2) distinct instances in X.
• There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
– Two more values for attributes: ? and 0
• Every hypothesis containing one or more 0 symbols represents the
empty set of instances; that is, it classifies every instance as negative.
• There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
– Only one more value for attributes: ?, and one hypothesis representing empty set of
instances.
• Although EnjoySport has small, finite hypothesis space, most learning
tasks have much larger (even infinite) hypothesis spaces.
– We need efficient search algorithms on the hypothesis spaces.
Machine Learning 9
General-to-Specific Ordering of Hypotheses
• Many algorithms for concept learning organize the search through the hypothesis
space by relying on a general-to-specific ordering of hypotheses.
• By taking advantage of this naturally occurring structure over the hypothesis space, we
can design learning algorithms that exhaustively search even infinite hypothesis spaces
without explicitly enumerating every hypothesis.
• Now consider the sets of instances that are classified positive by h1 and by h2.
– Because h2 imposesfewer constraintson the instance, it classifiesmore instancesas
positive.
– In fact, any instance classified positive by h1 will also be classified positive by h2.
– Therefore, we say that h2 is more general than h1.
10
More-General-Than Relation
• For any instance x in X and hypothesis h in H, we say that x satisfies h
if and only if h(x) = 1.
• More-General-Than-Or-Equal Relation:
Let h1 and h2 be two boolean-valued functions defined over X.
Then h1 is more-general-than-or-equal-to h2 (written h1 ≥ h2)
if and only if any instance that satisfies h2 also satisfies h1.
11
More-General-Relation
13
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is
satisfied by x
3. Output hypothesis h
14
FIND-S Algorithm - Example
15
Unanswered Questions by FIND-SAlgorithm
• Has FIND-S converged to the correct target concept?
– Although FIND-S will find a hypothesis consistent with the training data, it has no way to
determine whether it has found the only hypothesis in H consistent with the data (i.e., the
correct target concept), or whether there are many other consistent hypotheses as well.
– We would prefer a learning algorithm that could determine whether it had converged and,
if not, at least characterize its uncertainty regarding the true identity of the target concept.
16
Unanswered Questions by FIND-SAlgorithm
• Are the training examples consistent?
– In most practical learning problems there is some chance that the training examples will
contain at least some errors or noise.
– Such inconsistent setsof training examples can severely mislead FIND-S, given the fact
that it ignores negative examples.
– We would prefer an algorithm that could at least detect when the training data is
inconsistent and, preferably, accommodate such errors.
17
Common Sense
based Learning
The Goals of Artificial Intelligence
• Mental Amplification
Common Sense
Search
Query: “someone smiling”
Beirut is in Lebanon
Common Sense?
What is this “Knowledge”?
• Millions of facts, rules of thumb etc.
• Represented as sentences in some language.
• If the language is Logic, then computers can do
deductive reasoning automatically.
• This representation of a set of concepts within a
domain and the relationships between those
concepts is called Ontology
• The sentences are expressed in formal logic
notation.
• The words and the logic sentences about them are
called Formal Ontology
Hierarchy in Ontology
Predicate Calculus Representation
A predicate calculus representation assumes a universe of individuals, with
relations and functions on those individuals, and sentences formed by
combining relations with the logical connectives and, or, and not.
(ForAll ?P (ForAll ?C
(implies
(and
(isa ?P Person)
(child ?P ?C))
(loves ?P ?C)))))
• Contains
• 15,000 predicates
• 300,000 concepts
• 3,200,000 assertions
1
Explanation based learning
4
Learning by Generalizing Explanations
Given
• Goal concept (e.g., some predicate calculus
• statement)
• Training example (facts)
• Domain Theory (inference rules)
• Operationality Criterion
3. Goal: Bucket
13
Some Complications
• Inconsistencies and Incompleteness may be
due to abstractions and assumptions that make
a theory tractable.
14
Issues with Imperfect Theories
• Detecting imperfections
– “broken” explanations (missing clause)
– contradiction detection (proving P and not P)
– multiple explanations (but expected!)
– resources exceeded
• Correcting imperfections
• experimentation - motivated by failure type
(explanation- based)
• make approximations/assumptions - assume something
is true
15
EBL as Operationalization (Speedup Learning)
• Assuming a complete problem solver and
unlimited time, EBL already knows how to
recognize all the concepts it will know.
• What it learns is how to make its
knowledge operational (Mostow).
16
Knowledge-Level Learning
By Newell, Dietterich
• EBL as Knowledge closure
– all things that can be inferred from a collection of rules and facts
• “Pure” EBL only learns how to solve faster, not how to solve problems
previously insoluble.
• Inductive learners make inductive leaps and hence can solve more
after learning.
• What about considering resource-limits (e.g., time) on problem
solving?
17
Here, A Horn clause is a logical formula of a particular rule-like form which gives
it useful properties for use in logic programming, formal specification, and model
theory.