Google Scholar

[PDF][PDF] Finite-time Analysis of the Multiarmed Bandit Problem

P Auer - 2002 - mercurio.srv.di.unimi.it

Reinforcement learning policies face the exploration versus exploitation dilemma, ie the
search for a balance between exploring the environment to find profitable actions while
taking the empirically best action as often as possible. A popular measure of a policy's
success in addressing this dilemma is the regret, that is the loss due to the fact that the
globally optimal policy is not followed all the times. One of the simplest examples of the
exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were …

Save Cite Cited by 8505 Related articles Cached

[PDF] psu.edu

The nonstochastic multiarmed bandit problem

P Auer, N Cesa-Bianchi, Y Freund, RE Schapire - SIAM journal on computing, 2002 - SIAM

In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot
machines to play in a sequence of trials so as to maximize his reward. This classical
problem has received much attention because of the simple model it provides of the trade-off
between exploration (trying out each arm to find the best one) and exploitation (playing the
arm believed to give the best payoff). Past solutions for the bandit problem have almost
always relied on assumptions about the statistics of the slot machines. In this work, we make …

Save Cite Cited by 3159 Related articles All 29 versions

[PDF] nowpublishers.com

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S Bubeck, N Cesa-Bianchi - Foundations and Trends® in …, 2012 - nowpublishers.com

Multi-armed bandit problems are the most basic examples of sequential decision problems
with an exploration-exploitation trade-off. This is the balance between staying with the option
that gave highest payoffs in the past and exploring new options that might give higher
payoffs in the future. Although the study of bandit problems dates back to the 1930s,
exploration–exploitation trade-offs arise in several modern applications, such as ad
placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is …

Save Cite Cited by 3218 Related articles All 26 versions Library Search View as HTML

Cite

Advanced search

Saved to My library

[PDF][PDF] Finite-time Analysis of the Multiarmed Bandit Problem

The nonstochastic multiarmed bandit problem

Regret analysis of stochastic and nonstochastic multi-armed bandit problems