Short Note On Reinforced Learning
Short Note On Reinforced Learning
Short Note On Reinforced Learning
P. N. Dutta
Reinforcement learning is different from what machine learning researchers
call unsupervised learning, which is typically about finding structure hidden in
collections of unlabelled data.
The terms supervised learning and unsupervised learning appear to
exhaustively classify machine learning paradigms, but they do not. Although
we may think of reinforcement learning as a kind of unsupervised learning
because it does not rely on examples of correct behavior, reinforcement
learning is trying to maximize a reward signal instead of trying to find a hidden
structure.
But to discover such actions, it has to try actions that it has not selected
before. The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action selections in
the future. The dilemma is that neither exploration nor exploitation can be
pursued exclusively without failing at the task. The agent must try a variety of
actions and progressively and favour those that appear to be best. On a
stochastic
task, each action must be tried many times to gain a reliable estimate its
expected reward. The exploration {exploitation dilemma has been intensively
studied by mathematicians for many decades. For now, we simply note that
the entire issue of balancing exploration and exploitation does not even arise
in supervised and unsupervised learning, at least in their purist forms.
Another key feature of reinforcement learning is that it explicitly considers the
whole problem of a goal-directed agent interacting with an uncertain
environment. This is in contrast with many approaches that consider
subproblems without addressing how they might fit into a larger picture.
For example, we have mentioned that much of machine learning research is
concerned with supervised learning without explicitly specifying how such an
ability would finally be useful. Other researchers have developed theories of
planning with general goals, but without considering planning's role in real-
time decision making, or the question of where the predictive models
necessary for planning would come from.
Examples:
2. A calf struggles to its feet minutes after being born. Half an hour later it is
running at 20 miles per hour.
3. Phil prepares his breakfast. Closely examined, even this apparently simple
activity reveals a complex web of conditional behaviour and interlocking goal
{subgoal relationships: walking to the cupboard, opening it, selecting a cereal
box, then reaching for, grasping, and retrieving the box. Other complex, tuned,
interactive sequences of behaviour are required to obtain a bowl, spoon, and
milk jug. Each step involves a series of eye movements to obtain information
and to guide reaching and locomotion.
Rapid judgments are continually made about how to carry the objects or
whether it is better to ferry some of them to the dining table before obtaining
others. Each step is guided by goals, such as grasping a spoon or getting to the
refrigerator, and is in service of other goals, such as having the spoon to eat
with once the cereal is prepared and ultimately obtaining nourishment.
Whether he is aware of it or not, Phil is accessing information about the state
of his body that determines his nutritional needs, level of hunger, and food
preferences.
All interactions are involved between an active decision-making agent and its
environment, within which the agent seeks to achieve a goal despite un-
certainty about its environment. The agent's actions are permitted to affect the
future state of the environment (e.g., the next chess position, the level of
reservoirs of the refinery, the robot's next location and the future charge level
of its battery), thereby affecting the options and opportunities available to the
agent at later times. Correct choice requires taking into account indirect,
delayed consequences of actions, and thus may require foresight or planning.
Beyond the agent and the environment, one can identify four main sub
elements
of a reinforcement learning system:
a) a policy
b) a reward signal
c) a value function, and optionally
d) a model of the environment.
Policy:
A policy defines the learning agent's way of behaving at a given time.
Roughly speaking, a policy is a mapping from perceived states of the
environment to actions to be taken when in those states.
It corresponds to what in psychology would be called a set of
stimulus{response rules or associations.
In some cases, the policy may be a simple function or lookup table, whereas in
others it may involve extensive computation such as a search process. The
policy is the core of a reinforcement learning agent in the sense that it alone is
sufficient to determine behavior. In general, policies may be stochastic in
nature.
Reward Signal:
A reward signal defines the goal in a reinforcement learning problem. On each
time step, the environment sends to the reinforcement learning agent a single
number, a reward. The agent's sole objective is to maximize the total reward it
receives over the long run. The reward signal thus defines what are the good
and bad events for the agent.
(In a biological system, we might think of rewards as analogous to the
experiences of pleasure or pain. They are the immediate and defining features
of the problem faced by the agent. The reward sent to the agent at any time
depends on the agent's current action and the current state of the agent's
environment. The agent cannot alter the process that does this. The only way
the agent can influence the reward signal is through its actions, which can have
a direct effect on reward, or an indirect effect through changing the
environment's state. In our example above of Phil eating breakfast, the
reinforcement learning agent directing his behaviour might receive different
reward signals when he eats his breakfast depending on how hungry he is,
what mood he is in, and other features of his of his body, which is part of his
internal reinforcement learning agent's environment. The reward
signal is the primary basis for altering the policy. If an action selected by the
policy is followed by low reward, then the policy may be changed to select
some other action in that situation in the future. In general, reward signals
may be stochastic functions of the state of the environment and the actions
taken.
Whereas the reward signal indicates what is good in an immediate sense, a
value function specifies what is good in the long run.
Value Function:
Roughly speaking, the value of a state is the total amount of reward an agent
can expect to accumulate over the future, starting from that state. Whereas
rewards determine the
immediate, intrinsic desirability of environmental states, values indicate the
long-term desirability of states after taking into account the states that are
likely to follow, and the rewards available in those states.
For example, a state might always yield a low immediate reward but still have
a high value because it is regularly followed by other states that yield high
rewards. Or the reverse could be true.
To make a human analogy, rewards are somewhat like pleasure (if high) and
pain (if low), whereas values correspond to a more refined and farsighted
judgment of how pleased or displeased we are that our environment is in a
particular state.
Reinforcement learning methods specify how the agent changes its policy as a
result of its experience. The agent's goal, roughly speaking, is to maximize the
total amount of reward it receives over the long run.