Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
Q-factor Table
Input Reinforcement
WORLD
Environment:
Reward Function:
A reward function defines the goal in a reinforcement learning Figure 3: Q-Learning Architecture
problem. it maps each perceived state (or state-action pair) of
the environment to a single number, a reward, indicating the IV. RELATIVE Q-LEARNING
intrinsic desirability of that state. A reinforcement learning This section introduces a new approach Relative reward to
agent's sole objective is to maximize the total reward it conventional Q-learning that makes Relative Q-Learning.
receives in the long run. The reward function defines what the Conventional Q-learning has been shown to converge to the
good and bad events are for the agent. optimal policy if the environment is sampled infinitely by
performing a set of actions in the states of the environment
Action-value function: under a set of constraints on the learning rate . No bounds
have been proven on the time of convergence of the Q-learning
The Q-learning learning is based upon Quality-values (Q- algorithm and the selection of the next action is done randomly
values) Q(s,a) for each pair (s,a). The agent must cease when performing the update. This simply mean that the
interacting with the world while it runs through this loop until algorithm would take a longer time to converge as a random set
a satisfactory policy is found. Fortunately, we can still learn of states are observed which may or may not bring the state
from this. In Q-learning we cannot update directly from the closer to the goal state. Furthermore, it means that this function
transition probabilities-we can only update from individual cannot be used for actually performing the actions until it has
experiences. In 1 step Q-learning, after each experience, we converged as it has a high chance of not having the right value
observe state s, receive reward r, and update: as it may not have explored the correct states. This is especially
Q(s, a) = r+ maxa Q(s, a) (2) a problem for environments with larger state spaces. It is
difficult to explore the entire space in a random fashion in a
computationally feasible manner. So by applying below
B. Q-learning Algorithm: mention method and algorithm we try to keep the Q-learning
Initialize Q(s, a) arbitrarily algorithm near to its goal in less time and less number of
Episode.
Repeat (for each episode)
Choose a starting state, s A. Relative Reward
Relative reward is a concept that compares (current reward
Repeat (for each step of episode): with the previous received reward) two immediate rewards.
Choose a from s using policy derived from Q The objective of the learner is to choose actions maximizing
discounted cumulative rewards over time. Let there is an agent
Take action a, observe a immediate reward r, next state s in state st at time t, and assume that he chooses action at. The
Q(s, a) Q(s, a) + [r+ * maxa Q(s, a) - Q(s, a)] immediate result is a reward rt received by the agent and the
state changes to st+1. The total discounted reward [2,4]
s s ; received by the agent starting at time t is given by:
Until state s match with the Goal State r(t)=rt+rt+1+2 rt+2+.+n rt+n + (3)
Until a desired number of episodes terminated Where is discount factor in the range of (0:1).
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 5, August 2010
The immediate reward is based upon the action or move In order to consider the situation of encountering a wall, the
taken by an agent to reach the defined goal in each episode. agent has no possibility of moving all the way in the given
The total discounted reward can maximize in less number of direction. When the agent enters into goal states, it receives 50
episode if we select the higher immediate reward signal from as a reward. We are also providing the immediate reward value
previous. by incrementing or decrementing the Q-value marked with S
represent the start state and G represent the goal state. The
B. Relative Reward based Q-Learning Algorithm purpose of the agent is to find out the optimum path to arrive at
Relative reward based Q-learning is an approach towards the goal state starting from the start state, and to maximize the
maximizing the total discounted rewards. In this form of Q- reward it receives.
learning we selected the maximum immediate reward signal by
comparing it with previous one. This is expressed by the new
Q-update equation. Conventional Q-Learning/ Random Strategy
Q(s, a) = Q(s, a) + [max(r(s,a),r(s,a))+ maxa Q(s, a)
- Q(s, a)] 60
Algorithm: 50
Initialize Q(s, a) arbitrarily
40
Q - V a lu e s
Repeat (for each episode)
30
Choose a starting state, s
Repeat (for each step of episode): 20
30 Series1
20
10
0
1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616
Episode
REFERENCES
[1] J.F. Peters, C. Henry, S. Ramanna, Reinforcement learning with pattern-
based rewards. in proceding of forth International IASTED Conference.
Computational Intelligence (CI 2005) Calgary, Alberta,Canada, 4-6 July
2005, 267-272
[2] Technical Note Q,-Learning Christopher J.C.H. Watkins and Peter
Dayan Centre for Cognitive Science, University of Edinburgh, Scotland
Machine Learning, 8, 279-292 (1992)
[3] J.F. Peters, C. Henry, S. Ramanna, Rough Ethograms: Study of
Intelligent System Behavior. In:M.A.Klopotek, S. Wierzchori ,
K.Trojanowski(Eds), New Trends in Intelligent Information Processing
and Web Mining (IIS05), Gdansk, Poland, June 13-16 (2005),117-126.
[4] C. Watkins, "Learning from Delayed Rewards", PhD thesis, Cambridge
University, Cambridge, England, 1989
[5] J.F.Peters,K.S.Patnaik,P.K.Pandey,D.Tiwari, Effetc of temperature on
swarms that learn, In Proceeding of IASCIT-2007,Hyderabad,INDIA
[6] P.K.Pandey,D.Tiwari, Temperature variation on Q-Learning,In
Proceeding of RAIT in FEB 2008,ISM Dganbad
[7] P.K.Pandey,D.Tiwari, Temperature variation on Rough Actor-Critic
Algorithm, Global Journal Computer Science and Technology, Vol 9,
No 4 (2009), Pennsylvania Digital Library
[8] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: A
survey Journal of Artificial Intelligence Research, 4, 1996, 237-285.
[9] R.S. Sutton, A.G. Barto, and Reinforcement Learning: An Introduction
(Cambridge, MA: The MIT Press, 1998).
[10] C. Gaskett, Q-Learning for Robot Control. Ph.D.Thesis, Supervisor:
A.Zelinsky, Department of Systems Engineering, The Australian
National University, 2002.
[11] Thrun. S.and Schwartz.A.(1993),Issues in using function approximation
for reinforcement learning, in Proceeding of the 1993 Connectionist
Models Summer School,Erblaum Associates.Nj.
[12] Richard S. Sutton, Reinforcement Learning Architectures, GTE
Laboratories Incorporated, Waltham, MA 02254.
[13] Tom O'Neill,Leland Aldridge,Harry Glaser, Q-Learning and Collection
Agents, Dept. of Computer Science, University of Rochester
[14] Vanden Berghen Frank, Q-Learning, IRIDIA, Universit Libre de
Bruxelles