Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
EVALUATING HYPOTHESIS
Chapter 5
Definition: The sample error (denoted errors(h)) of hypothesis h with respect to target
function f and data sample S is
Where n is the number of examples in S, and the quantity δ(f (x), h(x)) is 1 if f (x) ≠ h(x),
and 0 otherwise.
The true error of a hypothesis is the probability that it will misclassify a single randomly
drawn instance from the distribution D.
Definition: The true error (denoted errorD(h)) of hypothesis h with respect to target
function f and distribution D, is the probability that h will misclassify an instance drawn at
random according to D.
What we usually wish to know is the true error errorD(h) of the hypothesis, because this is
the error we can expect when applying the hypothesis to future examples. All we can
measure, however, is the sample error errors(h) of the hypothesis for the data sample S that
we happen to have in hand. How good an estimate of errorD(h) is provided by errors (h)?
Confidence Intervals for Discrete-Valued Hypotheses
More specifically, suppose we wish to estimate the true error for some discrete valued
hypothesis h, based on its observed sample error over a sample S, where
The sample S contains n examples drawn independent of one another, and
independent of h, according to the probability distribution D
n ≥ 30
Hypothesis h commits r errors over these n examples (i.e., errors(h) = r|n).
Under these conditions, statistical theory allows us to make the following assertions:
1. Given no other information, the most probable value of errorD(h) is errors(h)
2. With approximately 95% probability, the true error errorD(h) lies in the interval
To illustrate, suppose the data sample S contains n = 40 examples and that hypothesis h
commits r = 12 errors over this data. In this case, the sample error errors(h) = 12/40 = .30.
Given no other information, the best estimate of the true error errorD(h) is the observed
sample error .30. For the given confidence interval of 95% the true error will be 0.30
(1.96 - .07) = 0.30 .14.
The above expression for the 95% confidence interval can be generalized to any desired
confidence level. The constant 1.96 is used in case we desire a 95% confidence interval. A
different constant, ZN, is used to calculate the N% confidence interval. The general expression
for approximate N% confidence intervals for errorv(h) is
Where the constant ZN is chosen depending on the desired confidence level, using the values
of ZN given in Table 5.1.
Instance-based learning methods such as nearest neighbour and locally weighted regression
are conceptually straightforward approaches to approximating real-valued or discrete-valued
target functions.
k - NEAREST NEIGHBOUR LEARNING
The most basic instance-based method is the k-NEARESNT NEIGHBOUR algorithm. The
nearest neighbours of an instance are defined in terms of the standard Euclidean distance.
More precisely, let an arbitrary instance x be described by the feature vector
where ar(x) denotes the value of the rth attribute of instance x. Then the distance between two
instances xi and xj is defined to be d(xi, xj), where
instance nearest to xq. For larger values of k, the algorithm assigns the most common value
among the k nearest training examples. The k-NEARESNT NEIGHBOR algorithm for
approximating a discrete-valued target function is given in Table 8.1.
We can distance-weight the instances for real-valued target functions in a similar fashion,
replacing the final line of the algorithm in this case by
be identified more efficiently at some additional cost in memory. One such indexing method
is the kd-tree.
Note: For example problem on k nearest neighbour algorithm refer class notes.
A Note on Terminology
Much of the literature on nearest-neighbour methods and weighted local regression uses a
terminology that has arisen from the field of statistical pattern recognition. In reading that
literature, it is useful to know the following terms:
Regression means approximating a real-valued target function.
Residual is the error ^f(x) - f (x) in approximating the target function.
Kernel function is the function of distance that is used to determine the weight of
each training example. In other words, the kernel function is the function K such that
wi = K(d(xi, xq)).
2. Minimize the squared error over the entire set D of training examples, while weighting the
error of each training example by some decreasing function K of its distance from xq.
3. Combine 1 and 2:
where each xu is an instance from X and where the kernel function Ku(d(xu, x)) is defined so
that it decreases as the distance d(xu, x) increases. Here k is a user provided constant that
specifies the number of kernel functions to be included.
Even though f^(x) is a global approximation to f(x), the contribution from each of the Ku(d
(xu, x)) terms is localized to a region nearby the point xu. It is common to choose each
function Ku(d (xu, x)) to be a Gaussian function centered at the point xu with some variance
σ2 u .
An example radial basis function (RBF) network is illustrated in Figure 8.2. Given a set of
training examples of the target function, RBF networks are typically trained in a two-stage
process. First, the number k of hidden units is determined and each hidden unit u is defined
by choosing the values of xu and σ2u that define its kernel function Ku(d(xu, x)). Second, the
weights wu are trained to maximize the fit of the network to the training data.
Methods proposed for choosing an appropriate number of hidden units or, equivalently,
kernel functions. One approach is to allocate a Gaussian kernel function for each training
example (xi, f (xi)), centering this Gaussian at the point xi.
Each of these kernels may be assigned the same width σ 2. Given this approach, the RBF
network learns a global approximation to the target function in which each training example
(xi, f (xi)) can influence the value of f only in the neighbourhood of xi. One advantage of this
choice of kernel functions is that it allows the RBF network to fit the training data exactly.
A second approach is to choose a set of kernel functions that is smaller than the number of
training examples. This approach can be much more efficient than the first approach,
especially when the number of training examples is large. The set of kernel functions may be
distributed with centres spaced uniformly throughout the instance space X.
CASE-BASED REASONING
Instance-based methods such as k-NEARESTN NEIGHBOUR and locally weighted
regression share three key properties.
1. They are lazy learning methods in that they defer the decision of how to generalize beyond
the training data until a new query instance is observed.
2. They classify new query instances by analyzing similar instances while ignoring instances
that are very different from the query.
If an exact match is found, indicating that some stored case implements exactly the
desired function, then this case can be returned as a suggested solution to the design
problem.
If no exact match occurs, CADET may find cases that match various subgraphs of the
desired functional specification. In Figure 8.3, for example, the T-junction function
matches a sub graph of the water faucet function graph.
Generic properties of case-based reasoning systems that distinguish them from approaches
such as k-NEAREST NEIGHBOUR:
o Instances or cases may be represented by rich symbolic descriptions, such as the
function graphs used in CADET. This may require a similarity metric different from
Euclidean distance, such as the size of the largest shared sub graph between two
function graphs.
o Multiple retrieved cases may be combined to form the solution to the newproblem.
This is similar to the k-NEAREST NEIGHBOUR approach, in that multiple similar
cases are used to construct a response for the new query.
o There may be a tight coupling between case retrieval, knowledge-based reasoning,
and problem solving.
REINFORCEMENT LEARNING
Chapter 13
INTRODUCTION
Consider building a learning robot. The robot, or agent, has a set of sensors (such as a camera
and sonars) to observe the state of its environment, and a set of actions (such as "move
forward" and "turn") it can perform to alter this state. Its task is to learn a control strategy, or
policy, for choosing actions that achieve its goals. For example, the robot may have a goal of
docking onto its battery charger whenever its battery level is low.
We assume that the goals of the agent can be defined by a reward function that assigns a
numerical value-an immediate payoff-to each distinct action the agent may take from each
distinct state. The control policy we desire is one that, from any initial state, chooses actions
that maximize the reward accumulated over time by the agent. This general setting for robot
learning is summarized in Figure 13.1.
Reinforcement learning problem differs from other function approximation tasks in several
important respects.
1. Delayed reward. The task of the agent is to learn a target function π that maps from the
current state s to the optimal action a = π(s). Earlier we have always assumed that when
learning some target function such as π, each training example would be a pair of the form (s,
π(s)). In reinforcement learning, however, training information is not available in this form.
Instead, the trainer provides only a sequence of immediate reward values as the agent
executes its sequence of actions. The agent, therefore, faces the problem of temporal credit
assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
2. Exploration. In reinforcement learning, the agent influences the distribution of training
examples by the action sequence it chooses. This raises the question of which
experimentation strategy produces most effective learning. The learner faces a tradeoff in
choosing whether to favor exploration of unknown states and actions (to gather new
information), or exploitation of states and actions that it has already learned will yield high
reward (to maximize its cumulative reward).
3. Partially observable states. Although it is convenient to assume that the agent's sensors
can perceive the entire state of the environment at each time step, in many practical situations
sensors provide only partial information.
4. Life-long learning. Unlike isolated function approximation tasks, robot learning often
requires that the robot learn several related tasks within the same environment, using the
same sensors. For example, a mobile robot may need to learn how to dock on its battery
charger, how to navigate through narrow corridors, and how to pick up output from laser
printers. This setting raises the possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
this case, let us choose γ = 0.9. The diagram at the bottom of the figure shows one optimal
policy for this setting (there are others as well). The optimal policy directs the agent along the
shortest path toward the state G.
The diagram at the right of Figure 13.2 shows the values of V* for each state. For example,
consider the bottom right state in this diagram. The value of V* for this state is 100 because
the optimal policy in this state selects the "move up" action that receives immediate reward
100. Thereafter, the agent will remain in the absorbing state and receive no further rewards.
Similarly, the value of V* for the bottom centre state is 90. This is because the optimal policy
will move the agent from this state to the right (generating an immediate reward of zero),
then upward (generating an immediate reward of 100). Thus, the discounted future reward
from the bottom centre state is
0 + γ100+ γ 20+ γ 30+... = 90
Q LEARNING
How can an agent learn an optimal policy n* for an arbitrary environment? It is difficult to
learn the function π*: S → A directly, because the available training data does not provide
training examples of the form (s, a). Instead, the only training information available to the
learner is the sequence of immediate rewards r(si, ai) for i = 0, 1,2, . . . .Given this kind of
training information it is easier to learn a numerical evaluation function defined over states
and actions, then implement the optimal policy in terms of this evaluation function.
The Q Function
Let us define the evaluation function Q(s, a) so that the value of Q is the reward received
immediately upon executing action a from state s, plus the value (discounted by y) of
following the optimal policy thereafter.
Q(s,a ) - r(s,a ) + γ V*(δ(s,a )) 13.4
To illustrate, Figure 13.2 shows the Q values for every state and action in the simple grid
world. Notice that the Q value for each state-action transition equals the r value for this
transition plus the V* value for the resulting state discounted by y. Note also that the optimal
policy shown in the figure corresponds to selecting actions with maximal Q values.
An Algorithm for Learning Q
Learning the Q function corresponds to learning the optimal policy. How can Q be learned?
The key problem is finding a reliable way to estimate training values for Q, given only a
sequence of immediate rewards r spread out over time. This can be accomplished through
iterative approximation. To see how, notice the close relationship between Q and V*,
This recursive definition of Q provides the basis for algorithms that iteratively approximate
Q. Q^ to refer to the learner's estimate, or hypothesis, of the actual Q function. In this
algorithm the learner represents its hypothesis Q^ by a large table with a separate entry for
each state-action pair. The table entry for the pair (s, a) stores the value for Q^(s, a) the,
learner's current hypothesis about the actual but unknown value Q(s, a). The table can be
initially filled with random values. The agent repeatedly observes its current state s, chooses
some action a, executes this action, then observes the resulting reward r = r(s, a) and the new
state s' = δ(s, a). It then updates the table entry for Q^(s, a) following each such transition,
according to the rule:
The above Q learning algorithm for deterministic Markov decision processes is described
more precisely in Table 13.1. Using this algorithm the agent's estimate Q^ converges in the
limit to the actual Q function, provided the system can be modelled as a deterministic Markov
decision process, the reward function r is bounded, and actions are chosen so that every state-
action pair is visited infinitely often.
An Illustrative Example
To illustrate the operation of the Q learning algorithm, consider a single action taken by an
agent, and the corresponding refinement to Q^ shown in Figure 13.3. In this example, the
agent moves one cell to the right in its grid world and receives an immediate reward of zero
for this transition. It then applies the training rule of Equation (13.7) to refine its estimate Q^
for the state-action transition it just executed. According to the training rule, the new Q^
estimate for this transition is the sum of the received reward (zero) and the highest Q^ value
associated with the resulting state (l00), discounted by γ (.9).
Each time the agent moves forward from an old state to a new one, Q learning propagates Q^
estimates backward from the new state to the old. At the same time, the immediate reward
received by the agent for the transition is used to augment these propagated values of Q^.