Sequential decision-making problems arise at every occasion that agents repeatedly interact with an unknown environment in an effort to maximize a certain notion of reward gained from interactions with this environment. Examples are abundant in online advertising, online gaming, robotics, deep learning, dynamic pricing, network routing, etc. In particular, multi-armed bandits (MAB) model the interaction between the agent and the unknown environment as follows. The agent repeatedly acts by pulling arms and after an arm is pulled, she receives a stochastic reward; the goal at the end of this process is to select actions that maximize the expected cumulative reward without knowledge of the arms’ distributions. Albeit simple, this model is widely applicable. On the other hand, many sequential decision making occasions deal with more complicated environments modeled through Markov Decision Processes (MDPs) where the environment’s status constantly changes as a result of taking actions and makes learning even more challenging. The field of reinforcement learning (RL) defines a principled foundation for this methodology, based on classical dynamic programming algorithms for solving MDPs.
Our research goal is to expand the applicability of bandit and RL algorithms to new application domains: specifically, safety-critical, lifelong and distributed physical systems, such as robotics, wireless networks, the power grid and medical trials.
One distinguishing feature of many of such “new” potential applications of bandits and RL is their safety-critical nature. Specifically, the algorithm’s chosen policies must satisfy certain system constraints that if violated can lead to catastrophic results for the system. Importantly, the specifics of these constraints often change based on the interactions with the unknown environment; thus, they are often unknown themselves. This leads to the new challenge of balancing the goal of reward maximization with the restriction of playing policies that are safe. We modeled this problem through bandits and RL frameworks with linear reward and constraint structures. It turns out that even this seemingly simple safe linear bandit and RL formulations are more intricate than the original setting without safety constraints. In particular, simple variations of existing algorithms can be shown to be highly suboptimal. Using appropriate tools from high-dimensional probability and exploration-exploitation dilemma, we were able to design novel algorithms and to guarantee that they not only respect the safety constraints, but also have performance comparable to the setting without safety constraints.
Recently, there has been a surging interest in designing lifelong learning agents that can continuously learn to solve multiple sequential decision making problems in their lifetimes. This scenario is in particular motivated by building multi-purpose embodied intelligence, such as robots working in a weakly structured environment. Typically, curating all tasks beforehand for such problems is nearly infeasible, and the problems the agent is tasked with may be adaptively selected based on the agent’s past behaviors. Consider a household robot as an example. Since each household is unique, it is difficult to anticipate upfront all scenarios the robot would encounter. In this direction, we theoretically study lifelong RL in a regret minimization setting, where the agent needs to solve a sequence of tasks using rewards in an unknown environment while balancing exploration and exploitation. Motivated by the embodied intelligence scenario, we suppose that tasks differ in rewards, but share the same state and action spaces and transition dynamics.
Another distinguishing feature of the envisioned applications of bandit algorithms is that interactions involve multiple distributed agents/learners (e.g., wireless/sensor networks). This calls for extensions of the traditional bandit setting to networked systems. In many such systems, it is critical to maintain an efficient communication among the network while achieving a good performance in terms of accumulated reward, usually measured as network’s regret. In view of this, for the problem of distributed contextual linear bandits, we prove a minimax lower bound on the communication cost of any distributed contextual linear bandit algorithm with stochastic contexts that is optimal in terms of regret. We further propose an algorithm whose regret is optimal and communication rate matches this lower bound, and therefore it is provably optimal in terms of both regret and communication rate.