ML UNIT-1 Notes PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

R19-MACHINE LEARNING

Unit-1:
Introduction: Definition of learning systems, Goals and applications of machine learning, Aspects of
developing a learning system: training data, concept representation, function approximation. Inductive
Classification: The concept learning task, Concept learning as search through a hypothesis space, General-
to-specific ordering of hypotheses, Finding maximally specific hypotheses, Version spaces and the candidate
elimination algorithm, Learning conjunctive concepts, The importance of inductive bias.

The evolution of machine learning from 1950 is depicted in Figure 1.1:


WHAT IS MACHINE LEARNING?
Tom M. Mitchell, Professor of Machine Learning Department, School of Computer Science, Carnegie
Mellon University. Tom M. Mitchell has defined machine learning as:
‘A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.’

Process of machine learning:


The basic machine learning process can be divided into three parts.
1. Data Input: Past data or information is utilized as a basis for future decision-making
2. Abstraction: The input data is represented in a broader way through the underlying algorithm
3. Generalization: The abstracted representation is generalized to form a framework for making
decisions
Figure 1.2 is a schematic representation of the machine learning process.

Process of machine learning


1. Data storage: Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.
2. Abstraction: The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models and creation of
new models.
The process of fitting a model to a dataset is known as training. When the model has been trained, the data is
transformed into an abstract form that summarizes the original information.
3. Generalization: The third component of the learning process is known as generalization.
The term generalization describes the process of turning the knowledge about stored data into a form that
can be utilized for future action. These actions are to be carried out on tasks that are similar, but not
identical, to those what have been seen before.
In generalization, the goal is to discover those properties of the data that will be most relevant to future tasks.
4. Evaluation: Evaluation is the last component of the learning process. It is the process of giving feedback
to the user to measure the utility of the learned knowledge. This feedback is then utilized to effect
improvements in the whole learning process

Well-posed learning problem:


A computer program is said to learn from experience E in context to some task T and some performance
measure P, if its performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits-
• Task
• Performance Measure
• Experience
Certain example that efficiently defines the well-posed learning problems are:
1. To better filter emails as spam or not
• Task – Classifying emails as spam or not
• Performance Measure – The fraction of emails accurately classified as spam or not spam
• Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
• Task – Playing checkers game
• Performance Measure – percent of games won against opposer
• Experience – playing implementation games against itself
3. Handwriting Recognition Problem
• Task – Acknowledging handwritten words within portrayal
• Performance Measure – percent of words accurately classified
• Experience – a directory of handwritten words with given classifications
4. A Robot Driving Problem
• Task – driving on public four-lane highways using sight scanners
• Performance Measure – average distance progressed before a fallacy
• Experience – order of images and steering instructions noted down while observing a human
driver
5. Fruit Prediction Problem
• Task – forecasting different fruits for recognition
• Performance Measure – able to predict maximum variety of fruits
• Experience – training machine with the largest datasets of fruits images
6. Face Recognition Problem
• Task – predicting different types of faces
• Performance Measure – able to predict maximum types of faces
• Experience – training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
• Task – translating one type of language used in a document to other language
• Performance Measure – able to convert one language to other efficiently
• Experience – training machine with a large dataset of different types of languages

Goals and applications of machine learning:


 Learning to recognize spoken words.
o All of the most successful speech recognition systems employ machine learning in some
form. For example, the SPHINX system (e.g., Lee 1989) learns speaker-specific strategies for
recognizing the primitive sounds (phonemes) and words from the observed speech signal.
Neural network learning methods (e.g., Waibel et al. 1989) and methods for learning hidden
Markov models (e.g., Lee 1989) are effective for automatically customizing to, individual
speakers, vocabularies, microphone characteristics, background noise, etc. Similar techniques
have potential applications in many signal-interpretation problems.
 Learning to drive an autonomous vehicle.
o Machine learning methods have been used to train computer-controlled vehicles to steer
correctly when driving on a variety of road types. For example, the ALVINN system
(Pomerleau 1989) has used its learned strategies to drive unassisted at 70 miles per hour for
90 miles on public highways among other cars. Similar techniques have possible applications
in many sensor-based control problems.
 Learning to classify new astronomical structures.
o Machine learning methods have been applied to a variety of large databases to learn general
regularities implicit in the data. For example, decision tree learning algorithms have been
used by NASA to learn how to classify celestial objects from the second Palomar
Observatory Sky Survey (Fayyad et al. 1995). This system is now used to automatically
classify all objects in the Sky Survey, which consists of three terabytes of image data.
 Learning to play world-class backgammon.
o The most successful computer programs for playing games such as backgammon are based on
machine learning algorithms. For example, the world's top computer program for
backgammon, TD-GAMMON (Tesauro 1992, 1995). learned its strategy by playing over one
million practice games against itself. It now plays at a level competitive with the human
world champion. Similar techniques have applications in many practical problems where very
large search spaces must be examined efficiently.
 Banking and finance
o banks analyze their past data to build models to use in credit applications, fraud detection, and
the stock market
 Insurance
o risk prediction
o claims management
 Business
o to study consumer behavior
 Manufacturing
o optimization, control, and troubleshooting
 Medicine
o learning programs are used for medical diagnosis
 Telecommunications
o call patterns are analyzed for network optimization and maximizing the quality of service
 Artificial intelligence
o to teach a system to learn and adapt to changes so that the system designer need not foresee
and provide solutions for all possible situations
 It is used to find solutions to many problems in vision, speech recognition, and robotics.

ISSUES IN MACHINE LEARNING:


 What algorithms exist for learning general target functions from specific training examples? In what
settings will particular algorithms converge to the desired function, given sufficient training data?
Which algorithms perform best for which types of problems and representations?
 How much training data is sufficient? What general bounds can be found to relate the confidence in
learned hypotheses to the amount of training experience and the character of the learner's hypothesis
space?
 When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
 What is the best strategy for choosing a useful next training experience, and how does the choice of
this strategy alter the complexity of the learning problem?
 What is the best way to reduce the learning task to one or more function approximation problems?
Put another way, what specific functions should the system attempt to learn? Can this process itself
be automated?
 How can the learner automatically alter its representation to improve its ability to represent and learn
the target function?

1. Inadequate Training Data: due to noise 5. Getting bad recommendations


data, incorrect data and generalizing of 6. Lack of skilled resources
output data. 7. Process Complexity of Machine Learning
2. Poor quality of data 8. Data Bias(errors)
3. Non-representative training data 9. Slow implementations and results
4. Over fitting and Under fitting 10. Irrelevant features

TYPES OF MACHINE LEARNING


Machine learning can be classified into three broad categories:
1. Supervised learning – Also called predictive learning. A machine predicts the class of unknown objects
based on prior class-related information of similar objects.
2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in unknown objects by
grouping similar objects together.
3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.

FIG. Types of machine learning


Supervised learning:

FIG. Supervised learning


Some examples of supervised learning are
 Predicting the results of a game
 Predicting whether a tumour is malignant or benign
 Predicting the price of domains like real estate, stocks, etc.
 Classifying texts such as classifying a set of emails as spam or non-spam

Two areas of supervised learning, i.e. classification and regression.


Classification: classification is a type of supervised learning where a target feature, which is of type
categorical, is predicted for test data based on the information imparted by training data. The target
categorical feature is known as class.

FIG. Classification
Some typical classification problems include:
 Image classification
 Prediction of disease
 Win–loss prediction of games
 Prediction of natural calamity like earthquake, flood, etc.
 Recognition of handwriting
Regression: In linear regression, the objective is to predict numerical features like real estate or stock
price, temperature, marks in an examination, sales revenue, etc. The underlying predictor variable and the
target variable are continuous in nature.
In case of simple linear regression, there is only one predictor variable whereas in case of multiple linear
regression, multiple predictor variables can be included in the model.
A typical linear regression model can be represented in the form –
where ‘x’ is the predictor variable and ‘y’ is the target variable.
Typical applications of regression can be seen in
 Demand forecasting in retails
 Sales prediction for managers
 Price prediction in real estate
 Weather forecast
 Skill demand forecast in job market.

Unsupervised learning: In unsupervised learning, there is no labeled training data to learn from and no
prediction to be made. In unsupervised learning, the objective is to take a dataset as input and try to find
natural groupings or patterns within the data elements or records. Therefore, unsupervised learning is often
termed as descriptive model and the process of unsupervised learning is referred as pattern discovery or
knowledge discovery.
FIG. Unsupervised learning
Two areas of unsupervised learning, i.e. clustering and association analysis.

Some examples of unsupervised learning are


 Market basket analysis
 Recommender systems
 Customer segmentation etc.
Reinforcement learning:

FIG. Reinforcement learning


Aspects of developing a learning system or Design of a learning system:

The design choices will be to decide the following key components:


1. Type of training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing an approximation algorithm for the Target Function
5. The final Design
We will look into the game - checkers learning problem and apply the above design choices. For a checkers
learning problem, the three elements will be,
• Task T: To play checkers
• Performance measure P: Total present of the game won in the tournament.
• Training experience E: A set of games played against itself.
1. Type of training experience:
During the design of the checker's learning system, the type of training experience available for a
learning system will have a significant effect on the success or failure of the learning.
Direct or Indirect training experience:
In the case of direct training experience, an individual board states and correct move for each
board state are given. In case of indirect training experience, the move sequences for a game and the
final result (win, lose or draw) are given for a number of games. How to assign credit or blame to
individual moves is the credit assignment problem.
1. Teacher or Not:
 Supervised: The training experience will be labelled, which means, all the board states will be
labelled with the correct move. So the learning takes place in the presence of a supervisor or a
teacher.
 Un-Supervised: The training experience will be unlabelled, which means, all the board states will not
have the moves. So the learner generates random games and plays against itself with no supervision
or teacher involvement.
 Semi-supervised: Learner generates game states and asks the teacher for help in finding the correct
move if the board state is confusing.
2. Is the training experience good:
 Do the training examples represent the distribution of examples over which the final system
performance will be measured? Performance is best when training examples and test examples are
from the same/a similar distribution.
 The checker player learns by playing against oneself. Its experience is indirect. It may not encounter
moves that are common in human expert play. Once the proper training experience is available, the
next design step will be choosing the Target Function.

2. Choosing the Target Function


When you are playing the checkers game, at any moment of time, you make a decision on choosing
the best move from different possibilities. You think and apply the learning that you have gained from the
experience. Here the learning is, for a specific board, you move a checker such that your board state tends
towards the winning situation. Now the same learning has to be defined in terms of the target function.
Here there are 2 considerations — direct and indirect experience.
• During the direct experience the checkers learning system, it needs only to learn how to choose the best
move among some large search space. We need to find a target function that will help us choose the best
move among alternatives.
Let us call this function Choose Move and use the notation Choose Move: B→M to indicate that this
function accepts as input any board from the set of legal board states B and produces as output some move
from the set of legal moves M.
• When there is an indirect experience it becomes difficult to learn such function. How about assigning a real
score to the board state.
So the function be V: B →R indicating that this accepts as input any board from the set of legal board states
B and produces an output a real score. This function assigns the higher scores to better board states
If the system can successfully learn such a target function V, then it can easily use it to select the best move
from any board position. Let us therefore define the target value V(b) for an arbitrary board state b in B, as
follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the best final board state that can be
achieved starting from b and playing optimally until the end of the game.
The (4) is a recursive definition and to determine the value of V(b) for a particular board state, it performs
the search ahead for the optimal line of play, all the way to the end of the game. So this definition is not
efficiently computable by our checkers playing program, we say that it is a non operational definition.

3. Choosing a representation for the Target Function


Now that we have specified the ideal target function V, we must choose a representation that the
learning program will use to describe the function ^V that it will learn. As with earlier design choices, we
again have many options. We could, for example, allow the program to represent using a large table with a
distinct entry specifying the value for each distinct board state. Or we could allow it to represent using a
collection of rules that match against features of the board state, or a quadratic polynomial function of
predefined board features, or an artificial neural network. In general, this choice of representation involves a
crucial trade off. On one hand, we wish to pick a very expressive representation to allow representing as
close an approximation as possible to the ideal target function V. On the other hand, the more expressive the
representation, the more training data the program will require in order to choose among the alternative
hypotheses it can represent. To keep the discussion brief, let us choose a simple representation: for any given
board state, the function ^V will be calculated as a linear combination of the following board features:
• x1(b) — number of black pieces on board b
• x2(b) — number of red pieces on b
• x3(b) — number of black kings on b
• x4(b) — number of red kings on b
• x5(b) — number of red pieces threatened by black
• x6(b) — number of black pieces threatened by red
^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
Where w0 through w6 are numerical coefficients or weights to be obtained by a learning algorithm. Weights
w1 to w6 will determine the relative importance of different board features
4. Choosing an approximation algorithm for the Target Function
Generating training data — To train our learning program, we need a set of training data, each
describing a specific board state b and the training value V_train (b) for b. Each training example is an
ordered pair <b,V_train(b)>.
Temporal difference (TD) learning is a concept central to reinforcement learning, in which learning happens
through the iterative correction of your estimated returns towards a more accurate target return.
+V_train(b) ← ^V(Successor(b))
5. The final Design
The final design of our checkers learning system can be naturally described by four distinct program
modules that represent the central components in many learning systems.
1. The performance System: Takes a new board as input and outputs a trace of the game it played against
itself.
2. The Critic: Takes the trace of a game as an input and outputs a set of training examples of the target
function.
3. The Generalizer: Takes training examples as input and outputs a hypothesis that estimates the target
function. Good generalization to new cases is crucial.
4. The Experiment Generator: Takes the current hypothesis (currently learned function) as input and outputs
a new problem (an initial board state) for the performance system to explore.
Summary of choices in designing the checkers learning program.

Inductive Classification:

In concept learning we only learn a description for the positive class; label everything that doesn’t satisfy
that description as negative.
Each concept can be viewed as describing some subset of objects or events defined over a larger set (e.g., the
subset of animals that constitute birds). Alternatively, each concept can be thought of as a boolean-valued
function defined over this larger set (e.g., a function defined over all animals, whose value is true for birds
and false for other animals).
Concept learning. Inferring a boolean-valued function from training examples of its input and output.

THE CONCEPT LEARNING TASK:


Below Table describes a set of example days, each represented by a set of attributes. The attribute
EnjoySport indicates whether or not Aldo(name of person) enjoys his favorite water sport on this day. The
task is to learn to predict the value of EnjoySport for an arbitrary day, based on the values of its other
attributes.
Let us begin by considering a simple representation in which each hypothesis consists of a conjunction of
constraints on the instance attributes. In particular, let each hypothesis be a vector of six constraints,
specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. For each
attribute, the hypothesis will either
 indicate by a "?' that any value is acceptable for this attribute,
 specify a single required value (e.g., Warm) for the attribute, or
 indicate by a "Φ" that no value is acceptable.
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive example
(h(x) = 1). To illustrate, the hypothesis that Aldo enjoys his favorite sport only on cold days with high
humidity (independent of the values of the other attributes) is represented by the expression
<?, Cold, High, ?, ?, ?>

TABLE : Positive and negative training examples for the target concept EnjoySport.
The most general hypothesis-that every day is a positive example-is represented by
<?, ?, ?, ?, ?, ?>
and the most specific possible hypothesis-that no day is a positive example-is represented by
< Φ, Φ, Φ, Φ, Φ, Φ >
The definition of the EnjoySport concept learning task in this general form is given in Below.
Given:
 Instances X: Possible days, each described by the attributes
o Sky (with possible values Sunny, Cloudy, and Rainy),
o AirTemp (with values Warm and Cold),
o Humidity (with values Normal and High),
o Wind (with values Strong and Weak),
o Water (with values Warm and Cool), and
o Forecast (with values Same and Change).
 Hypotheses H: Each hypothesis is described by a conjunction of constraints on the attributes Sky,
AirTemp, Humidity, Wind, Water, and Forecast. The constraints may be "?" (any value is
acceptable), "0 (no value is acceptable), or a specific value.
 Target concept c: EnjoySport : X  (0,l)
 Training examples D: Positive and negative examples of the target function (see above Table).
Determine:
 A hypothesis h in H such that h(x) = c(x) for all x in X.

The set of items over which the concept is defined is called the set of instances, which we denote by X.
The concept or function to be learned is called the target concept, which we denote by c. In general, c can be
any boolean valued function defined over the instances X; that is, c : X{O, 1}.
In the current example, the target concept corresponds to the value of the attribute EnjoySport (i.e., c(x) = 1
if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
When learning the target concept, the learner is presented a set of training examples, each consisting of an
instance x from X, along with its target concept value c(x) (e.g., the training examples in above Table).
Instances for which c(x) = 1 are called positive examples, or members of the target concept. Instances for
which C(X) = 0 are called negative examples, or nonmembers of the target concept. We will often write the
ordered pair (x, c(x)) to describe the training example consisting of the instance x and its target concept
value c(x). We use the symbol D to denote the set of available training examples.
Given a set of training examples of the target concept c, the problem faced by the learner is to hypothesize,
or estimate, c. We use the symbol H to denote the set of all possible hypotheses that the learner may consider
regarding the identity of the target concept. Usually H is determined by the human designer's choice of
hypothesis representation.
In general, each hypothesis h in H represents a boolean-valued function defined over X; that is,
h : X  {O, 1}.
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.

The Inductive Learning Hypothesis:


Any hypothesis found to approximate the target function well over a sufficiently large set of training
examples will also approximate the target function well over other unobserved examples.

CONCEPT LEARNING AS SEARCH


Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly
defined by the hypothesis representation. The goal of this search is to find the hypothesis that best fits the
training examples.
Given that the attribute Sky has three possible values, and that AirTemp, Humidity, Wind, Water, and
Forecast each have two possible values,
the instance space X contains exactly 3.2 2 .2 2 .2 = 96 distinct instances.
A similar calculation shows that there are 5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.
The number of semantically distinct hypotheses is only 1 + (4.3.3.3.3.3)( We treat the absence of any feature
as an additional value gets possible different concepts )= 973.

GENERAL-TO-SPECIFIC ORDERING OF HYPOTHESES


To illustrate the general-to-specific ordering,
consider the two hypotheses h1 = <Sunny, ?, ?, Strong, ?, ?>
h2 = <Sunny, ?, ?, ?, ?, ?>

Now consider the sets of instances that are classified positive by h l and by h2. Because h2 imposes fewer
constraints on the instance, it classifies more instances as positive. In fact, any instance classified positive by
hl will also be classified positive by h2. Therefore, we say that h2 is more general than h1.

First, for any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) = 1

Definition: Let hj and hk be boolean-valued functions defined over X. Then hj is more-general-than-or-equal-


to hk (written hj ≥ hk) if and only if

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS


FIND-S Algorithm starts from the most specific hypothesis and generalize it by considering only
positive examples.
• FIND-S algorithm ignores negative example
: As long as the hypothesis space contains a hypothesis that describes the true target concept, and the
training data contains no errors, ignoring negative examples does not cause to any problem.
• FIND-S algorithm finds the most specific hypothesis within H that is consistent with the positive training
examples. – The final hypothesis will also be consistent with negative examples if the correct target concept
is in H, and the training examples are correct.
FIND-S Algorithm:
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
for each attribute constraint ai, in h
If the constraint ai, is satisfied by x
Then do nothing
Else replace ai, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Example: Consider the following data set having the data about which particular seeds are poisonous.

First, we consider the hypothesis to be a more specific hypothesis. Hence, our hypothesis would be:
h = {ϕ, ϕ, ϕ, ϕ}
Consider example 1: The data in example 1 is {GREEN, HARD, NO, WRINKLED}.
We see that our initial hypothesis is more specific and we have to generalize it for this example.
Hence, the hypothesis becomes: h = {GREEN, HARD, NO, WRINKLED}
Consider example 2:
Here we see that this example has a negative outcome. Hence we neglect this example and our hypothesis
remains the same. h = {GREEN, HARD, NO, WRINKLED}
Consider example 3: Here we see that this example has a negative outcome. hence we neglect this example
and our hypothesis remains the same. h = {GREEN, HARD, NO, WRINKLED}
Consider example 4: The data present in example 4 is {ORANGE, HARD, NO, WRINKLED}. We compare
every single attribute with the initial data and if any mismatch is found we replace that particular attribute
with a general case (“ ?”). After doing the process the hypothesis becomes:
h = {?, HARD, NO, WRINKLED }
Consider example 5: The data present in example 5 is {GREEN, SOFT, YES, SMOOTH}. We compare
every single attribute with the initial data and if any mismatch is found we replace that particular attribute
with a general case ( “?” ). After doing the process the hypothesis becomes: h = {?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have the general condition, example
6 and example 7 would result in the same hypothesizes with all general attributes. h = {?, ?, ?, ? }
Hence, for the given data the final hypothesis would be:
Final Hypothesis: h = { ?, ?, ?, ? }.
Note:
Find-S algorithm for above Enjoy Sport table is: h {Sunny, Warm, ?, Strong, ?, ?}

Version spaces and the candidate elimination algorithm

Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x) = c(x) for each
example (x, c(x)) in D.

Definition(Version space). A concept is complete if it covers all positive examples. A concept is consistent if
it covers none of the negative examples. The version space is the set of all complete and consistent concepts.
This set is convex and is fully defined by its least and most general elements.

FIGURE :A version space with its general and specific boundary sets. The version space includes all six
hypotheses shown here, but can be represented more simply by S and G. Arrows indicate instances of the
more-general-than relation. This is the version space for the Enjoysport concept learning problem and
training examples described in above Table Enjoy Sport

THE CANDIDATE-ELIMINATION ALGORITHM

A second approach to concept learning, the CANDIDATEELIMINATION algorithm that addresses


several of the limitations of FIND-S.

The CANDIDATE-ELIMINATION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples. It begins by initializing the
version space to the set of all hypotheses in H; that is, by initializing the G boundary set to contain the most
general hypothesis in H
Go {(?, ?, ?, ?, ?, ?)}

and initializing the S boundary set to contain the most specific (least general) hypothesis

So {(Φ, Φ, Φ, Φ, Φ, Φ)}

Algorithm: CANDIDATE-ELIMINATION algorithm using version spaces. Notice the duality in how
positive and negative examples influence S and G

Steps:

Initialize G to the set of maximally general hypotheses in H


Initialize S to the set of maximally specific hypotheses in H
for each training example d, do
 if d is a positive example
o Remove from G any hypothesis inconsistent with d ,
o For each hypothesis s in S that is not consistent with d ,-
 Remove s from S
 Add to S all minimal generalizations h of s such that
 h is consistent with d, and some member of G is more general than h
 Remove from S any hypothesis that is more general than another hypothesis in S
 If d is a negative example
o Remove from S any hypothesis inconsistent with d
o For each hypothesis g in G that is not consistent with d
 Remove g from G
 Add to G all minimal specializations h of g such that
 h is consistent with d, and some member of S is more specific than h
 Remove from G any hypothesis that is less general than another hypothesis in G

Training examples:
1. <Sunny, Warm, Normal, Strong, Warm, Same>, Enjoy Sport = Yes
2. <Sunny, Warm, High, Strong, Warm, Same>, Enjoy Sport = Yes
Training Example:
3. <Rainy, Cold, High, Strong, Warm, Change>, EnjoySport=No

FIGURE CANDIDATE-ELMNATION Trace 2. Training example 3 is a negative example that forces the G 2
boundary to be specialized to G3. Note several alternative maximally general hypotheses are included in G 3.

Training Example:
4. <Sunny, Warm, High, Strong, Cool, Change>, EnjoySport = Yes
CANDIDATE-ELIMINATION Trace 3. The positive training example generalizes the S boundary, from S 3
to S4. One member of G3 must also be deleted, because it is no longer more general than the S 4 boundary.

The final version space for the EnjoySport concept learning problem and training examples described earlier.
Learning conjunctive concepts

The importance of inductive bias


The fundamental questions for inductive inference
1. What if the target concept is not contained in the hypothesis space?
2. Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis?
3. How does the size of this hypothesis space influence the ability of the algorithm to generalize to
unobserved instances?
4. How does the size of the hypothesis space influence the number of training examples that must be
observed?
These fundamental questions are examined in the context of the CANDIDATE   ELIMINTION algorithm
A Biased Hypothesis Space:
 Suppose the target concept is not contained in the hypothesis space H, then obvious solution is to enrich
the hypothesis space to include every possible hypothesis.
 Consider the EnjoySport example in which the hypothesis space is restricted to include only
conjunctions of attribute values. Because of this restriction, the hypothesis space is unable to represent
even simple disjunctive target concepts such as
"Sky = Sunny or Sky = Cloudy."
 The following three training examples of disjunctive hypothesis, the algorithm would find that there are
zero hypotheses in the version space
<Sunny Warm Normal Strong Cool Change> Y
<Cloudy Warm Normal Strong Cool Change> Y
<Rainy Warm Normal Strong Cool Change> N
 If Candidate Elimination algorithm is applied, then it end up with empty Version Space.
After first two training example
S= <? Warm Normal Strong Cool Change>
 This new hypothesis is overly general and it incorrectly covers the third negative training example! So H
does not include the appropriate c.
 In this case, a more expressive hypothesis space is required.

An Unbiased Learner
 The solution to the problem of assuring that the target concept is in the hypothesis space H is to provide
a hypothesis space capable of representing every teachable concept that is representing every possible
subset of the instances X.
 The set of all subsets of a set X is called the power set of X
 In the EnjoySport learning task the size of the instance space X of days described by the six attributes is
96 instances.
 Thus, there are 296 distinct target concepts that could be defined over this instance space and learner
might be called upon to learn.
 The conjunctive hypothesis space is able to represent only 973 of these - a biased hypothesis space
indeed
 Let us reformulate the EnjoySport learning task in an unbiased way by defining a new hypothesis space
H' that can represent every subset of instances
 The target concept "Sky = Sunny or Sky = Cloudy" could then be described as
(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)

The Futility of Bias-Free Learning:


Inductive learning requires some form of prior assumptions, or inductive bias.
Definition: Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept
defined over X, and let Dc, = {(x, c(x))} be an arbitrary set of training examples of c. Let L(x i, Dc) denote
the classification assigned to the instance x i by L after training on the data Dc. The inductive bias of L is any
minimal set of assertions B such that for any target concept c and corresponding training examples Dc.

An inductive bias allows a learning algorithm to prioritize one solution (or interpretation) over another,
independent of the observed data. [...] Inductive biases can express assumptions about either the data-
generating process or the space of solutions.
The below figure explains

 Modelling inductive systems by equivalent deductive systems.


 The input-output behavior of the CANDIDATE-ELIMINATION algorithm using a hypothesis space H
is identical to that of a deductive theorem prover utilizing the assertion "H contains the target concept."
This assertion is therefore called the inductive bias of the CANDIDATE-ELIMINATION algorithm.
 Characterizing inductive systems by their inductive bias allows modelling them by their equivalent
deductive systems. This provides a way to compare inductive systems according to their policies for
generalizing beyond the observed training data.

You might also like