[PDF][PDF] Finite-time Analysis of the Multiarmed Bandit Problem

P Auer - 2002 - mercurio.srv.di.unimi.it
Reinforcement learning policies face the exploration versus exploitation dilemma, ie the
search for a balance between exploring the environment to find profitable actions while
taking the empirically best action as often as possible. A popular measure of a policy's
success in addressing this dilemma is the regret, that is the loss due to the fact that the
globally optimal policy is not followed all the times. One of the simplest examples of the
exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were …

The nonstochastic multiarmed bandit problem

P Auer, N Cesa-Bianchi, Y Freund, RE Schapire - SIAM journal on computing, 2002 - SIAM
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot
machines to play in a sequence of trials so as to maximize his reward. This classical
problem has received much attention because of the simple model it provides of the trade-off
between exploration (trying out each arm to find the best one) and exploitation (playing the
arm believed to give the best payoff). Past solutions for the bandit problem have almost
always relied on assumptions about the statistics of the slot machines. In this work, we make …

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S Bubeck, N Cesa-Bianchi - Foundations and Trends® in …, 2012 - nowpublishers.com
Multi-armed bandit problems are the most basic examples of sequential decision problems
with an exploration-exploitation trade-off. This is the balance between staying with the option
that gave highest payoffs in the past and exploring new options that might give higher
payoffs in the future. Although the study of bandit problems dates back to the 1930s,
exploration–exploitation trade-offs arise in several modern applications, such as ad
placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is …