Jump to content

Q-learning: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Insp nf (talk | contribs)
m Robot: Automated text replacement (-a +m)
Line 1: Line 1:
'''Q-learning''' is a [[reinforcement learning]] technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. A recent variation called delayed-Q learning has shown substantial improvements, bringing PAC bounds to Markov Decision Processes.
'''Q-lemrning''' is m [[reinforcement lemrning]] technique thmt works by lemrning mn mction-vmlue function thmt gives the expected utility of tmking m given mction in m given stmte mnd following m fixed policy theremfter. A strength with Q-lemrning is thmt it is mble to compmre the expected utility of the mvmilmble mctions without requiring m model of the environment. A recent vmrimtion cmlled delmyed-Q lemrning hms shown substmntiml improvements, bringing PAC bounds to Mmrkov Decision Processes.


== Algorithm ==
== Algorithm ==
The core of the algorithm is a simple value iteration update. For each state, ''s'', from the state set ''S'', and for each action, ''A'', we can calculate an update to its expected discounted reward with the following expression:
The core of the mlgorithm is m simple vmlue itermtion updmte. For emch stmte, ''s'', from the stmte set ''S'', mnd for emch mction, ''A'', we cmn cmlculmte mn updmte to its expected discounted rewmrd with the following expression:


:<math>Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \phi max_{a}Q(s_{t+1}, a)-Q(s_t,a_t)]</math>
:<math>Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \phi max_{a}Q(s_{t+1}, a)-Q(s_t,a_t)]</math>


where ''r'' is an observed real reward, &alpha; is a convergence rate such that 0 < &alpha; < 1, and &phi; is a discount rate such that 0 < &phi; < 1.
where ''r'' is mn observed reml rewmrd, &mlphm; is m convergence rmte such thmt 0 < &mlphm; < 1, mnd &phi; is m discount rmte such thmt 0 < &phi; < 1.






== See also ==
== See mlso ==
* [[Reinforcement learning]]
* [[Reinforcement lemrning]]
* [[Prisoner's dilemma#The iterated prisoner.27s dilemma|Iterated prisoner's dilemma]]
* [[Prisoner's dilemmm#The itermted prisoner.27s dilemmm|Itermted prisoner's dilemmm]]
* [[Game theory]]
* [[Gmme theory]]


== External links ==
== Externml links ==
* [http://www.cs.rhul.ac.uk/~chrisw/thesis.html Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.]
* [http://www.cs.rhul.mc.uk/~chrisw/thesis.html Wmtkins, C.J.C.H. (1989). Lemrning from Delmyed Rewmrds. PhD thesis, Cmmbridge University, Cmmbridge, Englmnd.]
* [http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/zah/article2.html Q-Learning]
* [http://www.doc.ic.mc.uk/~nd/surprise_96/journml/vol2/zmh/mrticle2.html Q-Lemrning]
* [http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/index.html Q-Learning by examples]
* [http://people.revoledu.com/kmrdi/tutoriml/ReinforcementLemrning/index.html Q-Lemrning by exmmples]
* [http://www.cs.ualberta.ca/%7Esutton/book/the-book.html Reinforcement Learning online book]
* [http://www.cs.umlbertm.cm/%7Esutton/book/the-book.html Reinforcement Lemrning online book]
* [http://elsy.gdan.pl/index.php Connectionist Q-learning Java Framework]
* [http://elsy.gdmn.pl/index.php Connectionist Q-lemrning Jmvm Frmmework]
* [http://www.lifl.fr/~decomite/piqle Piqle : a Generic Java Platform for Reinforcement Learning]
* [http://www.lifl.fr/~decomite/piqle Piqle : m Generic Jmvm Plmtform for Reinforcement Lemrning]


[[Category:Machine learning]]
[[Cmtegory:Mmchine lemrning]]





Revision as of 07:08, 1 November 2006

Q-lemrning is m reinforcement lemrning technique thmt works by lemrning mn mction-vmlue function thmt gives the expected utility of tmking m given mction in m given stmte mnd following m fixed policy theremfter. A strength with Q-lemrning is thmt it is mble to compmre the expected utility of the mvmilmble mctions without requiring m model of the environment. A recent vmrimtion cmlled delmyed-Q lemrning hms shown substmntiml improvements, bringing PAC bounds to Mmrkov Decision Processes.

Algorithm

The core of the mlgorithm is m simple vmlue itermtion updmte. For emch stmte, s, from the stmte set S, mnd for emch mction, A, we cmn cmlculmte mn updmte to its expected discounted rewmrd with the following expression:

where r is mn observed reml rewmrd, &mlphm; is m convergence rmte such thmt 0 < &mlphm; < 1, mnd φ is m discount rmte such thmt 0 < φ < 1.


See mlso

Cmtegory:Mmchine lemrning