The Multi-Armed Bandit problem (MAB) is a special case of Reinforcement Learning (RL): an agent collects rewards in an environment by taking some actions after observing some state of the environment. The main difference between general RL and MAB is that in MAB, we assume that the action taken by the agent does not influence the next state of the environment. Therefore, agents do not model state transitions, credit rewards to past actions, or "plan ahead" to get to reward-rich states. Due to this very fact, the notion of episodes is not used in MAB, unlike in general RL.
In many bandits use cases, the state of the environment is observed. These are known as contextual bandits problems, and can be thought of as a generalization of multi-armed bandits where the agent has access to additional context in each round.
To get started with Bandits in TF-Agents, we recommend checking our bandits tutorial.
Currently the following algorithms are available:
LinUCB
: A Contextual Bandit Approach to Personalized News Article Recommendation Li et al., 2010.Linear Thompson Sampling
: Thompson Sampling for Contextual Bandits with Linear Payoffs Agrawal et al., 2013.Neural Epsilon Greedy
: Bandit Algorithms Lattimore et al., 2019Neural LinUCB
: [Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling Riquelme et al., 2018] (https://arxiv.org/abs/1802.09127)Thompson Sampling with Dropout
: [Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling Riquelme et al., 2018] (https://arxiv.org/abs/1802.09127)Multi-objective neural agent
: Designing multi-objective multi-armed bandits algorithms: a study Drugan et al., 2013.EXP3
: Bandit Algorithms Lattimore et al., 2019
In bandits, the environment is responsible for (i) outputting information about the current state (aka observation or context), and (ii) outputting a reward when receiving an action as input.
In order to test the performance of existing and new bandit algorithms, the library provides several environments spanning various setups such as linear or non-linear rewards functions, stationary or non-stationary environment dynamics. More specifically, the following environments are available:
- Stationary: This environment assumes stationary functions for generating observations and rewards.
- Non-stationary: This environment has non-stationary dynamics.
- Piecewise stationary: This environment is non-stationary, consisting of stationary pieces.
- Drifting: In this case, the environment is also non-stationary and its dynamics are slowly drifting.
- Wheel: This is a non-linear environment with a scalar parameter that directly controls the difficulty of the problem.
- Classification suite:
Given any classification dataset wrapped as a
tf.data.Dataset
, this environment converts it into a bandit problem.
The library also provides TF-metrics for regret computation. The notion of regret is an important one in the bandits literature and it can be informally defined as the difference between the total expected reward using the optimal policy and the total expected reward collected by the agent. Most of the environments listed above come with utilities for computing metrics such as the regret, the percentage of suboptimal arm plays and so on.
The library provides ready-to-use end-to-end examples for training and
evaluating various bandit agents in the
tf_agents/bandits/agents/examples/v2/
directory. A few examples:
Stationary linear
: tests different bandit agents against stationary linear environments.Wheel
: tests different bandit agents against the wheel bandit environment.Drifting linear
: tests different bandit agents against drifting (i.e., non-stationary) linear environments.
In some bandits use cases, each arm has its own features. For example, in movie recommendation problems, the user features play the role of the context and the movies play the role of the arms (aka actions). Each movie has its own features, such as text description, metadata, trailer content features and so on. We refer to such problems as arm features problems.
An example of bandit training with arm features can be found
here
.
In some bandits use cases, the "goodness" of the decisions that the agent makes can be measured via multiple metrics. For example, when we recommend a certain movie to a user, we can measure several metrics about this decision, such as: whether the user clicked it, whether the user watched it, whether the user liked it, shared it and so on. For such bandits use cases, the library provides the following solutions.
- Multi-objective optimization In case of several reward signals, a common technique is called scalarization. The main idea is to combine all the input rewards signals into a single one, which can be optimized by the vanilla bandits algorithms. The library offers the following options for scalarization:
- Linear [Designing multi-objective multi-armed bandits algorithms: a study Drugan et al., 2013.](https://ieeexplore.ieee.org/document/6707036)
- Chebyshev [Designing multi-objective multi-armed bandits algorithms: a study Drugan et al., 2013.](https://ieeexplore.ieee.org/document/6707036)
- Hypervolume [Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization Golovin et al., 2020](https://arxiv.org/abs/2006.04655)