Reinforcement Learning in Healthcare: A Survey: Chao Yu, Jiming Liu, Fellow, IEEE, and Shamim Nemati
Reinforcement Learning in Healthcare: A Survey: Chao Yu, Jiming Liu, Fellow, IEEE, and Shamim Nemati
Reinforcement Learning in Healthcare: A Survey: Chao Yu, Jiming Liu, Fellow, IEEE, and Shamim Nemati
Abstract—As a subfield of machine learning, reinforcement feedback and the new state from the environment. The goal
learning (RL) aims at empowering one’s capabilities in be- of the agent is to learn an optimal policy (i.e., a mapping
havioural decision making by using interaction experience with from the states to the actions) that maximizes the accumulated
the world and an evaluative feedback. Unlike traditional su-
pervised learning methods that usually rely on one-shot, ex- reward it receives over time. Therefore, agents in RL do
haustive and supervised reward signals, RL tackles with se- not receive direct instructions regarding which action they
quential decision making problems with sampled, evaluative should take, instead they must learn which actions are the
arXiv:1908.08796v4 [cs.LG] 24 Apr 2020
and delayed feedback simultaneously. Such distinctive features best through trial-and-error interactions with the environment.
make RL technique a suitable candidate for developing powerful This adaptive closed-loop feature renders RL distinct from
solutions in a variety of healthcare domains, where diagnosing
decisions or treatment regimes are usually characterized by a traditional supervised learning methods for regression or clas-
prolonged and sequential procedure. This survey discusses the sification, in which a list of correct labels must be provided,
broad applications of RL techniques in healthcare domains, or from unsupervised learning approaches to dimensionality
in order to provide the research community with systematic reduction or density estimation, which aim at finding hidden
understanding of theoretical foundations, enabling methods and structures in a collection of example data [11]. Moreover, in
techniques, existing challenges, and new insights of this emerging
paradigm. By first briefly examining theoretical foundations and comparison with other traditional control-based methods, RL
key techniques in RL research from efficient and representational does not require a well-represented mathematical model of
directions, we then provide an overview of RL applications in the environment, but develops a control policy directly from
healthcare domains ranging from dynamic treatment regimes in experience to predict states and rewards during a learning
chronic diseases and critical care, automated medical diagnosis procedure. Since the design of RL is letting an agent controller
from both unstructured and structured clinical data, as well as
many other control or scheduling domains that have infiltrated interact with the system, unknown and time-varying dynamics
many aspects of a healthcare system. Finally, we summarize the as well as changing performance requirements can be naturally
challenges and open issues in current research, and point out accounted for by the controller [15]. Lastly, RL is uniquely
some potential solutions and directions for future research. suited to systems with inherent time delays, in which decisions
Index Terms—Reinforcement Learning, Healthcare, Dynamic are performed without immediate knowledge of effectiveness,
Treatment Regimes, Critical Care, Chronic Disease, Automated but evaluated by a long-term future reward.
Diagnosis. The above features naturally make RL an attractive solution
to constructing efficient policies in various healthcare domains,
I. I NTRODUCTION where the decision making process is usually characterized
Driven by the increasing availability of massive multimodal- by a prolonged period or sequential procedure [16]. Typically,
ity data, and developed computational models and algorithms, a medical or clinical treatment regime is composed of a se-
the role of AI techniques in healthcare has grown rapidly in the quence of decision to determine the course of decisions such as
past decade [1], [2], [3], [4]. This emerging trend has promoted treatment type, drug dosage, or re-examination timing at a time
increasing interests in the proposal of advanced data analytical point according to the current health status and prior treatment
methods and machine learning approaches in a variety of history of an individual patient, with a goal of promoting the
healthcare applications [5], [6], [7], [8], [9]. As as a subfield patient’s long-term benefits. Unlike the common procedure in
in machine learning, reinforcement learning (RL) has achieved traditional randomized controlled trials that derive treatment
tremendous theoretical and technical achievements in general- regimes from the average population response, RL can be
ization, representation and efficiency in recent years, leading tailored for achieving precise treatment for individual patients
to its increasing applicability to real-life problems in playing who may possess high heterogeneity in response to the treat-
games, robotics control, financial and business management, ment due to variety in disease severity, personal characteristics
autonomous driving, natural language processing, computer and drug sensitivity. Moreover, RL is able to find optimal
vision, biological data analysis, and art creation, just to name policies using only previous experiences, without requiring
a few [10], [11], [12], [13], [14]. any prior knowledge about the mathematical model of the
In RL problems, an agent chooses an action at each time biological systems. This makes RL more appealing than many
step based on its current state, and receives an evaluative existing control-based approaches in healthcare domains since
it could be usually difficult or even impossible to build an
Chao Yu is with the School of Data and Computer Science, Sun Yat-sen accurate model for the complex human body system and the
University, Guangzhou, China. (Email: [email protected]). Jiming
Liu is with the Computer Science Department, Hong Kong Baptist Univer- responses to administered treatments, due to nonlinear, varying
sity, Kowloon Tong, Hong Kong. (Email: [email protected]). and delayed interaction between treatments and human bodies.
Shamim Nemati is with the Department of Biomedical Informatics, UC San Thus far, a plethora of theoretical or experimental studies
Diego, La Jolla, CA, USA. (Email: [email protected]).
have applied RL techniques and models in a variety of
heathcare domains, achieving performance exceeding that of
TABLE I TABLE II
S UMMARY OF A BBREVIATIONS IN RL S UMMARY OF A BBREVIATIONS IN H EALTHCARE
the agent after taking action a in state s; and γ ∈ [0, 1] is a as model-based or model-free methods, based on whether a
discount factor. complete knowledge of the MDP model can be specified a
An agent’s policy π : S × A → [0, 1] is a probability priori. Model-based methods, also referred to as planning
distribution that maps an action a ∈ A to a state s ∈ S. methods, require a complete description of the model in
When given an MDP and a policy π, the expected reward of terms of the transition and reward functions, while model-free
following this policy when starting in state s, V π (s), can be methods, also referred to as learning methods, learn an optimal
defined as follows: policy simply based on received observations and rewards.
"∞ # Dynamic programming (DP) [17] is a collection of model-
π
X
t based techniques to compute an optimal policy given a com-
V (s) , Eπ γ R(st , π(st ))|s0 = s (1)
t=0
plete description of an MDP model. DP includes two main
different approaches: Value Iteration (VI) and Policy Iteration
The value function can also be defined recursively using the (PI). VI specifies the optimal policy in terms of value function
Bellman operator B π : Q∗ (s, a) by iterating the Bellman updating as follows:
X
B π V π (s) , R(s, π(s)) + γ P(s, a, s0 )V π (s0 ) (2) X
s0 ∈S
Qt+1 (s, a) = R(s, a) + γ P(s, a, s0 ) max
0
Qt (s0 , a0 ) (4)
a ∈A
s0 ∈S
Since the Bellman operator B is a contraction mapping of
π
value function V , there exists a fixed point of value V π such For each iteration, the value function of every state s is
that B π V π = V π in the limit. The goal of an MDP problem updated one step further into the future based on the current
∗
is to compute an optimal policy π ∗ such that V π (s) ≥ V π (s) estimate. The concept of updating an estimate based on the
for every policy π and every state s ∈ S. To involve the action basis of other estimates is often referred to as bootstrapping.
information, Q-value is used to represent the optimal value of The value function is updated until the difference between two
each state-action pair by Equation 3. iterations, Qt and Qt+1 , is less than a small threshold. The
optimal policy is then derived using π ∗ (s) = arg maxa∈A Q∗ .
X Unlike VI, PI learns the policy directly. It starts with an initial
Q∗ (s, a) = R(s, a) + γ P(s, a, s0 ) max
0
Q(s0 , a0 ) (3) random policy π, and iteratively updates the policy by first
a ∈A
s0 ∈S computing the associated value function Qπ (policy evaluation
2) Basic Solutions and Challenging Issues: Many solution or prediction) and then improving the policy using π(s) =
techniques are available to compute an optimal policy for arg maxa∈A Q(s, a) (policy improvement or control).
a given MDP. Broadly, these techniques can be categorized Despite being mathematically sound, DP methods require a
complete and accurate description of the environment model, In AC methods, the actor is the policy to select actions,
which is unrealistic in most applications. When a model of the and the critic is an estimated value function to criticize the
problem is not available, the problem can then be solved by actions chosen by the actor. After each action execution,
using direct RL methods, in which an agent learns its optimal the critic evaluates the performance of action using the TD
policy while interacting with the environment. Monte Carlo error. The advantages of AC methods include that they are
(MC) methods and Temporal difference (TD) methods are two more appealing in dealing with large scale or even continuous
main such methods, with the difference of using episode-by- actions and learning stochastic policies, and more easier in
episode update in MC or step-by-step update in TD. Denote integrating domain specific constraints on policies.
(n)
Rt = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Vt (st+n ) n- In order to learn optimal policies, an RL agent should
step return at time t, then the general n−step update rule in make a balance between exploiting the knowledge obtained
(n)
TD methods is defined by ∆Vt (st ) = α[Rt − Vt (st )], in so far by acting optimally, and exploring the unknown space
which α ∈ (0, 1] is an appropriate learning rate controlling in order to find new efficient actions. Such an exploration-
the contribution of the new experience to the current estimate. exploitation trade-off dilemma is one of the most fundamental
MC methods then can be considered as an extreme case of TD theoretical issues in RL, since an effective exploration strategy
methods when the update is conducted after the whole episode enables the agent to make an elegant balance between these
of steps. In spite of having higher complexity in analyzing two processes by choosing explorative actions only when this
the efficiency and speed of convergence, TD methods usually behavior can potentially bring a higher expected return. A
require less memory for estimates and less computation, thus large amount of effort has been devoted to this issue in the
are easier to implement. traditional RL community, proposing a wealth of exploration
If the value function of a policy π is estimated by using strategies including simple heuristics such as ε-greedy and
samples that are generated by strictly following this policy, the Boltzmann exploration, Bayesian learning [22], [23], count-
RL algorithm is called on-policy, while off-policy algorithms based methods with Probably Approximately Correct (PAC)
can learn the value of a policy that is different from the one guarantees [24], [25], as well as more expressive methods
being followed. One of the most important and widely used of intrinsic motivation such as novelty, curiosity and surprise
RL approach is Q-learning [18], which is an off-policy TD [26]. For example, the ε-greedy strategy selects the greedy
algorithm. Its one-step updating rule is given by Equation 5, action, arg maxa Qt (s, a), with a high probability, and, occa-
sionally, with a small probability selects an action uniformly
at random. This ensures that all actions and their effects are
0 0
Qt+1 (s, a) = Qt (s, a)+αt [R(s, a)+γ max Q t (s , a )−Qt (s, a)] experienced. The ε-greedy exploration policy can be given by
a0
(5) Equation 7.
where α ∈ (0, 1] is an appropriate learning rate which controls
1 − ε if a = arg maxa0 Q(s, a),
the contribution of the new experience to the current estimate. π(a0 ) = (7)
Likewise, the SARSA algorithm [19] is an representation ε otherwise.
for on-policy TD approaches given by Equation 6: where ε ∈ [0, 1] is an exploration rate.
Other fundamental issues in RL research include but are
Qt+1 (s, a) = Qt (s, a)+αt [R(s, a)+γQt (s0 , π(s0 ))−Qt (s, a)] not limited to the credit assignment problem [14], [27], the
(6) sampel/space/time complexity [28], [29], function approxima-
The idea is that each experienced sample brings the current tion [30], [31], safety [32], [33], robustness [34], [35], and
estimate Q(s, a) closer to the optimal value Q∗ (s, a). Q- interpretability [36], [37]. A more comprehensive and in-depth
learning starts with an initial estimate for each state-action review on these issues can be found in [38], and more recently
pair. When an action a is taken in state s, resulting in the in [12], [14].
next state s0 , the corresponding Q-value Q(s, a) is updated
with a combination of its current value and the TD error
( R(s, a) + γ maxa0 Qt (s0 , a0 ) − Qt (s, a) for Q-learning, or B. Key Techniques in RL
R(s, a)+γQt (s0 , π(s0 ))−Qt (s, a) for SARSA). The TD error This section discusses some key techniques used in con-
is the difference between the current estimate Q(s, a) and the temporary RL, most of which can be understood in the light
expected discounted return based on the experienced sample. of the framework and solutions defined in the section ahead,
The Q value of each state-action pair is stored in a table for a yet these new techniques emphasize more sophisticated use
discrete state-action space. It has been proved that this tabular of samples, models of the world and learned knowledge of
Q-learning converges to the optimal Q∗ (s, a) w.p.1 when all previous tasks for efficiency purpose, as well as what should
state-action pairs are visited infinitely often and an appropriate be represented and how things should be represented during
exploration strategy and learning rate are chosen [18]. an RL problem. Note that the classification of these two kinds
Besides the above value-function based methods that main- of techniques are not mutually exclusive, which means that
tain a value function whereby a policy can be derived, direct some representation techniques are also used for improving
policy-search (PS) algorithms [20] try to estimate the pol- the learning efficiency, and vice versa.
icy directly without representing a value function explicitly, 1) Efficient Techniques: The purpose of using efficient
whereas the actor-critic (AC) methods [21] keep separate, techniques is to improve the learning performance in terms of,
explicit representations of both value functions and policies. for example, convergence ratio, sample efficient, computation
cost or generalization capabilities of an RL method. This methods such as DP or Monte Carlo Tree Search (MCTS)
improvement can be achieved by using different levels of [44], MRL methods are usually able to learn an accurate
knowledge: the Experience-level techniques focus on utilizing model quickly and then use this model to plan multi-step
the past experience for more stable and data-efficient learning; actions. Therefore, MRL methods normally have better sample
the Model-level techniques focus on building and planning efficiency than model-free methods [28].
over a model of the environment in order to improve sample c) Task-level: A higher task-level of efficient approaches
efficiency; while the Task-level techniques aim at generalizing focuses on the development of methods to transfer knowledge
the learning experience from past tasks to new relevant ones. from a set of source tasks to a target task. Transfer RL (TRL)
a) Experience-level: In traditional pure on-line TD learn- uses the transferred knowledge to significantly improve the
ing methods such as Q-learning and SARSA, an agent im- learning performance in the target task, e.g., by reducing
mediately conducts a DP-like update of the value functions the samples needed for a nearly optimal performance, or
every step interacting with the environment and then disregards increasing the final convergence level [45]. Taylor and Stone
the experienced state transition tuple afterwards. In spite of [46] provided a thorough review on TRL approaches by five
guaranteed convergence and great success in solving simple transfer dimensions: how the source task and target task may
toy problems, this kind of local updates poses several se- differ (e.g., in terms of action, state, reward or transition
vere performance problems when applied to more realistic functions), how to select the source task (e.g., all previously
systems with larger and possibly continuous settings. Since seen tasks, or only one task specified by human or modified
each experience tuple is used only for one update and then automatically), how to define task mappings (e.g., specified by
forgotten immediately, a larger number of samples are required human or learned from experience), what knowledge to trans-
to enable an optimal solution, causing the so called exploration ferred (from experience instances to higher level of models or
overhead problem. Moreover, it has been shown that directly rules), and allowed RL methods (e.g., MRL, PS, or BRL).
combining function approximation methods with pure on-line 2) Representational Techniques: Unlike traditional ma-
TD methods can cause instable or even diverged performance chine learning research that simply focuses on feature en-
[30], [31]. These inefficiency and instability problems become gineering for function approximation, representational tech-
even more pronounced in real environments, particularly in niques in RL can be in a broader perspective, paying attention
healthcare systems, where physical interactions between pa- to constructive or relational representation problems relevant
tients and environments call for more efficient sampling and not only to function approximation for state/action, polices and
stable learning methods. value functions, but also to more exogenous aspects regarding
The Experience-level techniques focus on how to make agents, tasks or models [12].
the best of the past learning experience for more stable and a) Representation for Value Functions or Policies: Many
efficient learning, and are the major driving force behind the traditional RL algorithms have been mainly designed for
proposal of modern Batch RL (BRL) [39]. In BRL, two basic problems with small discrete state and action spaces, which
techniques are used: storing the experience in a buffer and can be explicitly stored in tables. Despite the inherent chal-
reusing it as if it were new (the idea of experience replay lenges, applying these RL algorithms to continuous or highly
for addressing the inefficiency problem), and separating the dimensional domains would cause extra difficulties. A major
DP step from the function approximation step by using a aspect of representational techniques is to represent structures
supervised learning to fit the function approximator over the of policies and value functions in a more compact form for
sampled experience (the idea of fitting for addressing the in- an efficient approximation of solutions and thus scaling up
stability problem). There are several famous BRL approaches to larger domains. Broadly, three categories of approximation
in the literature, such as the non-linear approximator cases of methods can be clarified [31]: model-approximation methods
Neural Fitted Q Iteration (NFQI [40]), the Tree-based FQI that approximate the model and compute the desired policy on
[41], and robust linear approximation techniques for policy this approximated model; value-approximation methods that
learning such as Least-Squares Policy Iteration (LSPI [42]). approximate a value function whereby a policy can be inferred,
As will be discovered later, these BRL methods have enjoyed and policy-approximation methods that search in policy space
wide and successful applications in clinical decision makings, directly and update this policy to approximate the optimal
due to their promise in greatly improving learning speed and policy, or keep separate, explicit representations of both value
approximation accuracy, particularly from limited amounts of functions and policies.
clinical data. The value functions or policies can be parameterized using
b) Model-level: Unlike Experience-level techniques that either linear or non-linear function approximation presenta-
emphasize the efficient use of experience tuples, the Model- tions. Whereas the linear function approximation is better
level techniques try to build a model of the environment understood, simple to implement and usually has better con-
(in terms of the transition and reward functions) and then vergence guarantees, it needs explicit knowledge about domain
derive optimal policies from the environment model when it features, and also prohibits the representation of interactions
is approximately correct. This kind of model-based RL (MRL) between features. On the contrary, non-linear function approx-
approaches is rather different from the model-free RL methods imation methods do not need for good informative features and
such as TD methods or MC methods that directly estimate usually obtain better accuracy and performance in practice, but
value functions without building a model of the environment with less convergence guarantees.
[43]. Using some advanced exploration strategies and planning A notable success of RL in addressing real world complex
problems is the recent integration of deep neural networks dynamically during on-line learning [63].
into RL [47], [48], fostering a new flourishing research area Besides the factored representation of states, a more general
of Deep RL (DRL) [12]. A key factor in this success is that method is to decompose large complex tasks into smaller
deep learning can automatically abstract and extract high-level sets of sub-tasks, which can be solved separatively. Hierar-
features and semantic interpretation directly from the input chical RL (HRL) [64] formalizes hierarchical methods that
data, avoiding complex feature engineering or delicate feature use abstract states or actions over a hierarchy of subtasks
hand-crafting and selection for an individual task [49]. to decompose the original problem, potentially reducing its
b) Representation for Reward Functions: In a general computational complexity. Hengst [65] discussed the vari-
RL setting, the reward function is represented in the form of ous concepts and approaches in HRL, including algorithms
an evaluative scalar signal, which encodes a single objective that can automatically learn the hierarchical structure from
for the learning agent. In spite of its wide applicability, this interactions with the domain. Unlike HRL that focuses on
kind of quantifying reward functions has its limits inevitably. hierarchical decomposition of tasks, Relational RL (RRL) [66]
For example, real life problems usually involve two or more provides a new representational paradigm to RL in worlds
objectives at the same time, each with its own associated explicitly modeled in terms of objects and their relations.
reward signal. This has motivated the emerging research topic Using expressive data structures that represent the objects
of multi-objective RL (MORL) [50], in which a policy must and relations in an explicit way, RRL aims at generalizing
try to make a trade-off between distinct objectives in order to or facilitating learning over worlds with the same or different
achieve a Pareto optimal solution. Moreover, it is often difficult objects and relations. The main representation methods and
or even impossible to obtain feedback signals that can be techniques in RRL have been surveyed in detail in [66].
expressed in numerical rewards in some real-world domains. Last but not the least, Partially Observable MDP (POMDP)
Instead, qualitative reward signals such as being better or is widely adopted to represent models when the states are
higher may be readily available and thus can be directly used not fully observable, or the observations are noisy. Learning
by the learner. Preference-based RL (PRL) [51] is a novel in POMDP, denoted as Partially Observable RL (PORL),
research direction combining RL and preference learning [52] can be rather difficult due to extra uncertainties caused by
to equip an RL agent with a capability to learn desired policies the mappings from observations to hidden states [67]. Since
from qualitative feedback that is expressed by various ranking environmental states in many real life applications, notably in
functions. Last but not the least, all the existing RL methods healthcare systems, are only partially observable, PORL then
are grounded on an available feedback function, either in an becomes a suitable technique to derive a meaningful policy in
explicitly numerical or a qualitative form. However, when such such realistic environments.
feedback information is not readily available or the reward
function is difficult to specify manually, it is then necessary III. A PPLICATIONS OF RL IN H EALTHCARE
to consider an approach to RL whereby the reward function On account of its unique features against traditional machine
can be learned from a set of presumably optimal trajectories learning, statistic learning and control-based methods, RL-
so that the reward is consistent with the observed behaviors. related models and approaches have been widely applied in
The problem of deriving a reward function from observed healthcare domains since decades ago. The early days of focus
behavior is referred to as Inverse RL (IRL) [53], [54], which has been devoted to the application of DP methods in various
has received an increasingly high interest by researchers in the pharmacotherapeutic decision making problems using phar-
past few years. Numerous IRL methods have been proposed, macokinetic/pharmacodynamic (PK/PD) models [68], [69]. Hu
including the Maximum Entropy IRL [55], the Apprenticeship et al., [70] used POMDP to model drug infusion problem
Learning [56], nonlinear representations of the reward function for the administration of anesthesia, and proposed efficient
using Gaussian processes [57], and Bayesian IRL [58]. heuristics to compute suboptimal though useful treatment
c) Representation for Tasks or Models: Much recent strategies. Schaeffer et al. [71] discussed the benefits and
research on RL has focused on representing the tasks or associated challenges of MDP modeling in the context of
models in a compact way to facilitate construction of an medical treatment, and reviewed several instances of medical
efficient policy. Factored MDPs [59] are one of such ap- applications of MDPs, such as spherocytosis treatment and
proaches to representing large structured MDPs compactly, breast cancer screening and treatment.
by using a dynamic Bayesian network (DBN) to represent With the tremendous theoretical and technical achievements
the transition model among states that involve only some in generalization, representation and efficiency in recent years,
set of state variables, and the decomposition of global task RL approaches have been successfully applied in a number
reward to individual variables or small clusters of variables. of healthcare domains to date. Broadly, these application
This representation often allows an exponential reduction in domains can be categorized into three main types: dynamic
the representation size of structured MDPs, but the complex- treatment regimes in chronic disease or critical care, automated
ity of exact solution algorithms for such MDPs also grows medical diagnosis, and other general domains such as health
exponentially in the representation size. A large number of resources allocation and scheduling, optimal process control,
methods has been proposed to employ factored representation drug discovery and development, as well as health manage-
of MDP models for improving learning efficiency for either ment. Figure 2 provides a diagram outlining the application
model-based [60], [61] or model-free RL problems [62]. A domains, illustrating how this survey is organized along the
more challenging issues is how to learn this compact structure lines of the three broad domains in the field.
RL, while the treatment outcomes are expressed by the reward
functions. The inputs in DTRs are a set of clinical observations
and assessments of patients, and the outputs are the treatments
options at each stage, equivalent to the states and actions in
RL, respectively. Apparently, applying RL methods to solve
DTR problems demonstrates several benefits. RL is capable
of achieving time-dependent decisions on the best treatment
for each patient at each decision time, thus accounting for
heterogeneity across patients. This precise treatment can be
achieved even without relying on the identification of any
accurate mathematical models or explicit relationship between
treatments and outcomes. Furthermore, RL driven solutions
enable to improve long-term outcomes by considering delayed
effect of treatments, which is the major characteristic of
medical treatment. Finally, by careful engineering the reward
function using expert or domain knowledge, RL provides
an elegant way to multi-objective optimization of treatment
between efficacy and the raised side effect.
Due to these benefits, RL naturally becomes an appealing
tool for constructing optimal DTRs in healthcare. In fact,
solving DTR problems accounts for a large proportion of RL
studies in healthcare applications, which can be supported by
the dominantly large volume of references in this area. The
domains of applying RL in DTRs can be classified into two
main categories: chronic diseases and critical care.
A. Chronic Diseases
Fig. 2. The outline of application domains of RL in healthcare. Chronic diseases are now becoming the most pressing
public health issue worldwide, constituting a considerable
portion of death every year [80]. Chronic diseases normally
IV. DYNAMIC T REATMENT R EGIMES feature a long period lasting three months or more, expected to
require continuous clinical observation and medical care. The
One goal of healthcare decision-making is to develop ef- widely prevailing chronic diseases include endocrine diseases
fective treatment regimes that can dynamically adapt to the (e.g., diabetes and hyperthyroidism), cardiovascular diseases
varying clinical states and improve the long-term benefits (e.g., heart attacks and hypertension), various mental illnesses
of patients. Dynamic treatment regimes (DTRs) [72], [73], (e.g., depression and schizophrenia), cancer, HIV infection,
alternatively named as dynamic treatment policies [74], adap- obesity, and other oral health problems [81]. Long-term treat-
tive interventions [75], or adaptive treatment strategies [76], ment of these illnesses is often made up of a sequence of
provide a new paradigm to automate the process of developing medical intervention that must take into account the changing
new effective treatment regimes for individual patients with health status of a patient and adverse effects occurring from
long-term care [77]. A DTR is composed of a sequence previous treatment. In general, the relationship of treatment
of decision rules to determine the course of actions (e.g., duration, dosage and type against the patient’s response is too
treatment type, drug dosage, or reexamination timing) at a time complex to be be explicitly specified. As such, practitioners
point according to the current health status and prior treatment usually resort to some protocols following the Chronic Care
history of an individual patient. Unlike traditional randomized Model (CCM) [82] to facilitate decision making in chronic
controlled trials that are mainly used as an evaluative tool disease conditions. Since such protocols are derived from
for confirming the efficacy of a newly developed treatment, average responses to treatment in populations of patients,
DTRs are tailored for generating new scientific hypotheses selecting the best sequence of treatments for an individual
and developing optimal treatments across or within groups patient poses significant challenges due to the diversity across
of patients [77]. Utilizing valid data generated, for instance, or whithin the population. RL has been utilized to automate
from the Sequential Multiple Assignment Randomized Trial the discovery and generation of optimal DTRs in a variety of
(SMART) [78], [79], an optimal DTR that is capable of chronic diseases including caner, diabetes, anemia, HIV and
optimizing the final clinical outcome of particular interest can several common mental illnesses.
be derived. 1) Cancer: Cancer is one of the main chronic diseases that
The design of DTRs can be viewed as a sequential decision causes death. About 90.5 million people had cancer in 2015
making problem that fits into the RL framework well. The se- and approximately 14 million new cases are occurring each
ries of decision rules in DTRs are equivalent to the policies in year, causing about 8.8 million annual deaths that account for
TABLE III
S UMMARY OF RL A PPLICATION E XAMPLES IN THE DEVELOPMENT OF DTR S IN C ANCER
Applications References Base Methods Efficient Representational Data Acquisition Highlights or Limits
Techniques Techniques
Zhao et al. [83] Q-learning BRL N/A ODE model Using SVR or ERT to fit Q values; simplistic reward
function structure with integer values to assess the tradeoff
Optimal between efficacy and toxicity.
chemotherapy Hassani et al. [84] Q-learning N/A N/A ODE model Naive discrete formulation of states and actions.
drug dosage for Ahn & Park [85] NAC N/A N/A ODE model Discovering the strategy of performing continuous treat-
cancer treatment ment from the beginning.
Humphrey [86] Q-learning BRL N/A ODE model pro- Using three machine learning methods to fit Q values, in
posed in [83] high dimensional and subgroup scenarios.
Padmanabhan [87] Q-learning N/A N/A ODE model Using different reward functions to model different con-
straints in cancer treatment.
Zhao et al. [88] Q-learning BRL (FQI- N/A ODE model Considering censoring problem in multiple lines of treat-
SVR) driven by real ment in advanced NSCLC; using overall survival time as
NSCLC data the net reward.
Fürnkranz et al. [52], PI N/A PRL ODE model pro- Combining preference learning and RL for optimal ther-
Cheng et al. [89] posed in [83] apy design in cancer treatment, but only in model-based
DP settings.
Akrour et al. [90], PS N/A PRL ODE model pro- Using active ranking mechanism to reduce the number of
Busa-Fekete et al. [91] posed in [83] needed ranking queries to the expert to yield a satisfactory
policy without a generated model.
Optimal Vincent [92] Q-learning, BRL (FQI- N/A Linear model, Extended ODE model for radiation therapy; using hard
fractionation SARSA(λ), ERT) ODE model constraints in the reward function and simple exploration
scheduling of TD(λ), PS strategy.
radiation therapy Tseng et al. [93] Q-learning N/A DRL (DQN) Data from 114 Addressing limited sample size problem using GAN and
for cancer NSCLC patients approximating the transition probability using DNN.
treatment Jalalimanesh et al.[94] Q-learning N/A N/A Agent-based Using agent-based simulation to model the dynamics of
model tumor growth.
Jalalimanesh et al.[95] Q-learning N/A MORL Agent-based Formulated as a multi-objective problem by considering
model conflicting objective of minimising tumour therapy period
and unavoidable side effects.
Hypothetical or Goldberg & Kosorok Q-learning N/A N/A Linear model Addressing problems with censored data and a flexible
generic cancer [96], Soliman [97] number of stages.
clinical trial Yauney & Shah [98] Q-learning N/A DRL (DDQN) ODE model Addressing the problem of unstructured outcome rewards
using action-driven rewards.
15.7% of total deaths worldwide [99]. The primary treatment be extracted directly from clinical trial data in simulation.
options for cancer include surgery, chemotherapy, and radi- Ahn and Park [85] studied the applicability of the Natural
ation therapy. To analyze the dynamics between tumor and AC (NAC) approach [21] to the drug scheduling of cancer
immune systems, numerous computational models for spatio- chemotherapy based on an ODE-based tumor growth model
temporal or non-spatial tumor-immune dynamics have been proposed by de Pillis and Radunskaya [105]. Targeting at
proposed and analyzed by researchers over the past decades minimizing the tumor cell population and the drug amount
[100]. Building on these models, control policies have been while maximizing the populations of normal and immune
put forward to obtain efficient drug administration (see [85], cells, the NAC approach could discover an effective drug
[101] and references therein). scheduling policy by injecting drug continuously from the
Being a sequential evolutionary process by nature, cancer beginning until an appropriate time. This policy showed better
treatment is a major objective of RL in DTR applications performance than traditional pulsed chemotherapy protocol
[102], [103]. Table III summaries the major studies of applying that administers the drug in a periodical manner, typically on
RL in various aspects of cancer treatment, from the perspec- an order of several hours. The superiority of using continuous
tives of application scenarios (chemotherapy, radiotherapy or dosing treatment over a burst of dosing treatment was also
generic cancer treatment simulation), basic RL methods, the supported by the work [84], where naive discrete Q-learning
efficient and representational techniques applied (if applica- was applied. More recently, Padmanabhan et al. [87] proposed
ble), the learning data (retrospective clinical data, or generated different formulations of reward function in Q-learning to
from simulation models or computational models), and the generate effective drug dosing policies for patient groups with
main highlights and limits of the study. different characteristics. Humphrey [86] investigated several
RL methods have been extensively studied in deriving supervised learning approaches (Classification And Regression
efficient treatment strategies for cancer chemotherapy. Zhao et Trees (CART), random forests, and modified version of Mul-
al. [83] first applied model-free TD method, Q-learning, for tivariate Adaptive Regression Splines (MARS)) to estimate Q
decision making of agent dosage in chemotherapy. Drawing on values in a simulation of an advanced generic cancer trial.
the chemotherapy mathematical model expressed by several Radiotherapy is another major option of treating cancer,
Ordinary Difference Equations (ODE), virtual clinical trial and a number of studies have applied RL approaches for
data from in vivo tumor growth patterns was quantitatively developing automated radiation adaptation protocols [106].
generated. Two explicit machine learning approaches, support Jalalimanesh et al. [94] proposed an agent-based simulation
vector regression (SVG) [104] and extremely randomized trees model and Q-learning algorithm to optimize dose calculation
(ERT) [41], were applied to fit the approximated Q-functions in radiotherapy by varying the fraction size during the treat-
to the generated trial data. Using this kind of batch learning ment. Vincent [92] described preliminary efforts in investi-
methods, it was demonstrated that optimal strategies could gating a variety of RL methods to find optimal scheduling
algorithms for radiation therapy, including the exhaustive PS well as the timing of initiating the second-line therapy. In order
[20], FQI [40], SARSA(λ) [19] and K-Nearest Neighbors- to successfully handle the complex censored survival data, a
TD(λ) [107]. The preliminary findings suggest that there may modification of SVG approach, −SV R−C, was proposed to
be an advantage in using non-uniform fractionation schedules estimate the optimal Q values. A simulation study showed that
for some tissue types. the approach could select optimal compounds for two lines
As the goal of radiotherapy is in essence a multi-objective of treatment directly from clinical data, and the best initial
problem to erase the tumour with radiation while not impacting time for second-line therapy could be derived while taking into
normal cells as much as possible, Jalalimanesh et al. [95] account the heterogeneity across patients. Other studies [96],
proposed a multi-objective distributed Q-learning algorithm to [97] presented the novel censored-Q-learning algorithm that
find the Pareto-optimal solutions for calculating radiotherapy is adjusted for a multi-stage decision problem with a flexible
dose. Each objective was optimized by an individual learning number of stages in which the rewards are survival times that
agent and all the agents compromised their individual solutions are subject to censoring.
in order to derive a Pareto-optimal solution. Under the multi- To tackle the problem that a numerical reward function
objective formulation, three different clinical behaviors could should be specified beforehand in standard RL techniques,
be properly modeled (i.e., aggressive, conservative or mod- several studies investigated the possibility of formulating re-
erate), by paying different degree of attention to eliminating wards using qualitative preference or simply based on past
cancer cells or taking care of normal cells. actions in the treatment of cancer [89], [52], [98]. Akrour et
A recent study [93] proposed a multi-component DRL al. [90] proposed a PRL method combined with active ranking
framework to automate adaptive radiotherapy decision making in order to decrease the number of ranking queries to the
for non-small cell lung cancer (NSCLC) patients. Aiming at expert needed to yield a satisfactory policy. Experiments on the
reproducing or mimicking the decisions that have been previ- cancer treatment testbeds showed that a very limited external
ously made by clinicians, three neural network components, information in terms of expert’s ranking feedbacks might be
namely Generative Adversarial Net (GAN), transition Deep sufficient to reach state-of-the-art results. Busa-Fekete et al.
Neural Networks (DNN) and Deep Q Network (DQN), were [91] introduced a preference-based variant of a direct PS
applied: the GAN component was used to generate sufficiently method in the medical treatment design for cancer clinical
large synthetic patient data from historical small-sized real trials. A novel approach based on action-driven rewards was
clinical data; the transition DNN component was employed first proposed in [98]. It was showed that new dosing regimes
to learn how states would transit under different actions of in cancer chemotherapy could be learned using action-derived
dose fractions, based on the data synthesized from the GAN penalties, suggesting the possibility of using RL methods in
and available real clinical data; once the whole MDP model situations when final outcomes are not available, but priors on
has been provided, the DQN component was then responsible beneficial actions can be more easily specified.
for mapping the state into possible dose strategies, in order to 2) Diabetes: Diabetes mellitus, or simply called diabetes,
optimize future radiotherapy outcomes. The whole framework is one of the most serious chronic diseases in the world.
was evaluated in a retrospective dataset of 114 NSCLC patients According to a recent report released by International Dia-
who received radiotherapy under a successful dose escalation betes Federation (IDF), there are 451 million people living
protocol. It was demonstrated that the DRL framework was with diabetes in 2017, causing approximately 5 million deaths
able to learn effective dose adaptation policies between 1.5 worldwide and USD 850 billion global healthcare expenditure
and 3.8 Gy, which complied with the original dose range used [108]. It is expected that by 2045, the total number of adults
by the clinicians. with diabetes would increase to near 700 million, accounting
The treatment of cancer poses several significant theoretical for 9.9% of the adult population. Since the high prevalence
problems for applying existing RL approaches. Patients may of diabetes presents significant social influence and financial
drop out the treatment anytime due to various uncontrolled burdens, there has been an increasing urgency to ensure
reasons, causing the final treatment outcome (e.g., survival effective treatment to diabetes across the world.
time in cancer treatment) unobserved. This data censoring Intensive research concern has been devoted to the de-
problem [96] complicates the practical use of RL in discov- velopment of effective blood glucose control strategies in
ering individualized optimal regimens. Moreover, in general treatment of insulin-dependent diabetes (i.e., type 1 diabetes).
cancer treatment, the initiation and timing of the next line Since its first proposal in the 1970s [109], artificial pancreas
of therapy depend on the disease progression, and thus the (AP) have been widely used in the blood glucose control
number of treatment stage can be flexible. For instance, process to compute and administrate a precise insulin dose, by
NSCLC patients usually receive one to three treatment lines, using a continuous glucose monitoring system (CGMS) and a
and the necessity and timing of the second and third lines closed-loop controller [110]. Tremendous progress has been
of treatment vary from person to person. Developing valid made towards insulin infusion rate automation in AP using
methodology for computing optimal DTRs in such a flexible traditional control strategies such as Proportional-Integral-
setting is currently a premier challenge. Zhao et al. [88] Derivative (PID), Model Predictive Control (MPC), and Fuzzy
presented an adaptive Q-learning approach to discover optimal Logic (FL) [111], [112]. A major concern is the inter- and
DTRs for the first and second lines of treatment in Stage intra- variability of the diabetic population which raises the
IIIB/IV NSCLC. The trial was conducted by randomizing the demand for a personalized, patient specific approach of the
different compounds for first and second-line treatments, as glucose regulation. Moreover, the complexity of the physiolog-
ical system, the variety of disturbances such as meal, exercise, this predefined reward function then motivated the application
stress and sickness, along with the difficulty in modelling of IRL approach to reveal the reward function that doctors
accurately the glucose-insulin regulation system all raise the were using during their treatments [131]. Using observational
need in the development of more advanced adaptive algorithms data on the effect of food intake and physical activity in
for the glucose regulation. an outpatient setting using mobile technology, Luckett et al.
RL approaches have attracted increasingly high attention [132] proposed the V-learning method that directly estimates a
in personalized, patient specific glucose regulation in AP policy which maximizes the value over a class of policies and
systems [113]. Yasini et al. [114] made an initial study on requires minimal assumptions on the data-generating process.
using RL to control an AP to maintain normoglycemic around The method has been applied to estimate treatment regimes
80 mg/dl. Specifically, model-free TD Q-learning algorithm to reduce the number of hypo and hyperglycemic episodes in
was applied to compute the insulin delivery rate, without patients with type 1 diabetes.
relying on an explicit model of the glucose-insulin dynamics. 3) Anemia: Anemia is a common comorbidity in chronic
Daskalaki et al. [115] presented an AC controller for the renal failure that occurs in more than 90% of patients with end-
estimation of insulin infusion rate in silico trial based on stage renal disease (ESRD) who are undertaking hemodialysis.
the University of Virginia/Padova type 1 diabetes simulator Caused by a failure of adequately producing endogenous
[116]. In an evaluation of 12 day meal scenario for 10 adults, erythropoietin (EPO) and thus red blood cells, anemia can have
results showed that the approach could prevent hypoglycaemia significant impact on organ functions, giving rise to a number
well, but hyperglycaemia could not be properly solved due of severe consequences such as heart disease or even increased
to the static behaviors of the Actor component. The authors mortality. Currently, anemia can be successfully treated by ad-
then proposed using daily updates of the average basal rate ministering erythropoiesis-stimulating agents (ESAs), in order
(BR) and the insulin-to-carbohydrate (IC) ratio in order to to maintain the hemoglobin (HGB) level within a narrow range
optimize glucose regulation [117], and using estimation of of 11-12 g/dL. To achieve this, professional clinicians must
information transfer (IT) from insulin to glucose for automatic carry out a labor intensive process of dosing ESAs to assess
and personalized tuning of the AC approach [118]. This idea monthly HGB and iron levels before making adjustments
was motivated by the fact that small adaptation of insulin in the accordingly. However, since the existing Anemia Management
Actor component may be sufficient in case of large amount of Protocol (AMP) does not account for the high inter- and intra-
IT from insulin to glucose, whereas more dramatic updates individual variability in the patient’s response, the HGB level
may be required for low IT. The results from the Control of some patients usually oscillates around the target range,
Variability Grid Analysis (CVGA) showed that the approach causing several risks and side-effects.
could achieve higher performance in all three groups of As early as in 2005, Gaweda et al. [133] first proposed using
patients, with 100% percentages in the A+B zones for adults, RL to perform individualized treatment in the management of
and 93% for both adolescents and children, compared to renal anemia. The target under control is the HGB, whereas
approaches with random initialization and zero initial values. the control input is the amount of EPO administered by the
The AC approach was significantly extended to directly link to physician. As the iron storage in the patient, determined by
patient-specific characteristics, and evaluated more extensively Transferrin Saturation (TSAT), also has an impact on the pro-
under a complex meal protocol, meal uncertainty and insulin cess of red blood cell creation, it is considered as a state com-
sensitivity variation [119], [120]. ponent together with HGB. To model distinct dose-response
A number of studies used certain mathematical models relationship within a patient population, a fuzzy model was es-
to simulate the glucose-insulin dynamic system in patients. timated first by using real records of 186 hemodialysis patients
Based on the Palumbo mathematical model [121], the on- from the Division of Nephrology, University of Louisville.
policy SARSA was used for insulin delivery rate [122]. Ngo On-policy TD method, SARSA, was then performed on the
et al. applied model-based VI method [123] and AC method sample trajectories generated by the model. Results show that
[124] to reduce the fluctuation of the blood glucose in both the proposed approach generates adequate dosing strategies
fasting and post-meal scenarios, drawing on the Bergman’s for representative individuals from different response groups.
minimal insulin-glucose kinetics model [125] and the Hovorka The authors then proposed a combination of MPC approach
model [126] to simulate a patient. De Paula et al. [127], [128] with SARSA for decision support in anemia management
proposed policy learning algorithms that integrates RL with [134], with the MPC component used for simulation of patient
Gaussian processes to take into account glycemic variability response and SARSA for optimization of the dosing strategy.
under uncertainty, using the Ito’s stochastic model of the However, the automated RL approaches in these studies could
glucose-insulin dynamics [129]. only achieve a policy with a comparable outcome against
There are also several data-driven studies carried out to ana- the existing AMP. Other studies applied various kinds of
lyze RL in diabetes treatment based on real data from diabetes Q-learning, such as Q-learning with function approximation,
patients. Utilizing the data extracted from the medical records or directly based on state-aggregation [135], [136], [137], in
of over 10,000 patients in the University of Tokyo Hospital, providing effective treatment regimes in anemia.
Asoh et al. [130] estimated the MDP model underlying the Several studies resorted to BRL methods to derive optimal
progression of patient state and evaluated the value of treat- ESA dosing strategies for anemia treatment. By performing a
ment using the VI method. The opinions of a doctor were used retrospective study of a cohort of 209 hemodialysis patients,
to define the reward for each treatment. The preassumption of Malof and Gaweda [138] adopted the batch FQI method to
achieve dosing strategies that were superior to a standard AMP. approach outperformed those derived by each method alone.
The FQI method was also applied by Escandell et al. [139] for Since the treatment of HIV highly depends the patient’s
discovering efficient dosing strategies based on the historical immune system that varies from person to person, it is thus
treatment data of 195 patients in nephrology centers allocated necessary to derive efficient learning strategies that can address
around Italy and Portugal. An evaluation of the FQI method on and identify the variations across subpopulations. Marivate
a computational model that describes the effect of ESAs on the et al. [145] formalized a routine to accommodate multiple
hemoglobin level showed that FQI could achieve an increment sources of uncertainty in BRL methods to better evaluate the
of 27.6% in the proportion of patients that are within the effectiveness of treatments across a subpopulations of patients.
targeted range of hemoglobin during the period of treatment. Other approaches applied various kinds of TRL techniques so
In addition, the quantity of drug needed is reduced by 5.13%, as to take advantage of the prior information from previously
which indicates a more efficient use of ESAs [140]. learned transition models [146], [147] or learned policy [148].
4) HIV: Discovering effective treatment strategies for HIV- More recently, Yu et al. [149] proposed a causal policy
infected individuals remains one of the most significant chal- gradient algorithm and evaluated it in the treatment of HIV in
lenges in medical research. To date, the effective way to treat order to facilitate the final learning performance and increase
HIV makes use of a combination of anti-HIV drugs (i.e., explanations of learned strategies.
antiretrovirals) in the form of Highly Active Antiretroviral The treatment of HIV provides a well-known testbed for
Therapy (HAART) to inhibit the development of drug-resistant evaluation of exploration mechanisms in RL research. Sim-
HIV strains [141]. Patients suffering from HIV are typically ulations show that the basin of attraction of the healthy
prescribed a series of treatments over time in order to max- steady-state is rather small compared to that of the non-
imize the long-term positive outcomes of reducing patients’ healthy steady state [141]. Thus, general exploration methods
treatment burden and improving adherence to medication. are unable to yield meaningful performance improvement as
However, due to the differences between individuals in their they can only obtain samples in the vicinity of the “non-
immune responses to treatment, discovering the optimal drug healthy” steady state. To solve this issue, several studies have
combinations and scheduling strategy is still a difficult task in proposed more advanced exploration strategies in order to
both medical research and clinical trials. increase the learning performance in HIV treatment. Pazis et
Ernst et al. [142] first introduced RL techniques in com- al. [150] introduced an algorithm for PAC optimal exploration
puting Structured Treatment Interruption (STI) strategies for in continuous state spaces. Kawaguchi considered the time
HIV infected patients. Using a mathematical model [141] to bound in a PAC exploration process [151]. Results in both
artificially generate the clinical data, the BRL method FIQ- studies showed that the exploration algorithm could achieve
ERT was applied to learn an optimal drug prescription strategy far better strategies than other existing exploration strategies
in an off-line manner. The derived STI strategy is featured in HIV treatment.
with a cycling between the two main anti-HIV drugs: Reverse 5) Mental Disease: Mental diseases are characterized by
Transcriptase Inhibitors (RTI) and Protease Inhibitors (PI), a long-term period of clinical treatments that usually require
before bringing the patient to the healthy drug-free steady- adaptation in the duration, dose, or type of treatment over
state. Using the same mathematical model, Parbhoo [143] time [152]. Given that the brain is a complex system and thus
further implemented three kinds of BRL methods, FQI-ERT, extremely challenging to model, applying traditional control-
neural FQI and LSPI, to the problem of HIV treatment, based methods that rely on accurate brain models in mental
indicating that each learning technique had its own advantages disease treatment is proved infeasible. Well suited to the
and disadvantages. Moreover, a testing based on a ten-year problem at hand, RL has been widely applied to DTRs in
period of real clinical data from 250 HIV-infected patients a wide range of mental illness including epilepsy, depression,
in Charlotte Maxeke Johannesburg Academic Hospital, South schizophrenia and various kinds of substance addiction.
Africa verified that the RL methods were capable of sug- a) Epilepsy: Epilepsy is one of the most common severe
gesting treatments that were reasonably compliant with those neurological disorders, affecting around 1% of the world
suggested by clinicians. population. When happening, epilepsy is manifested in the
A mixture-of-experts approach was proposed in [144] to form of intermittent and intense seizures that are recognized
combine the strengths of both kernel-based regression methods as abnormal synchronized firing of neural populations. Im-
(i.e., history-alignment model) and RL (i.e., model-based plantable electrical deep-brain stimulation devices are now an
Bayesian PORL) for HIV therapy selection. Since kernel- important treatment option for drug-resistant epileptic patients.
based regression methods are more suitable for modeling more Researchers from nonlinear dynamic systems analysis and
related patients in history, while model-based RL methods control have proposed promising prediction and detection
are more suitable for reasoning about the future outcomes, algorithms to suppress the frequency, duration and amplitude
automatically selecting an appropriate model for a particular of seizures [153]. However, due to lack of full understand-
patient between these two methods thus tends to provide ing of seizure and its associated neural dynamics, designing
simpler yet more robust patterns of response to the treatment. optimal seizure suppression algorithms via minimal electrical
Making use of a subset of the EuResist database consisting of stimulation has been for a long time a challenging task in
HIV genotype and treatment response data for 32,960 patients, treatment of epilepsy.
together with the 312 most common drug combinations in the RL enables direct closed-loop optimizations of deep-brain
cohort, the treatment therapy derived by the mixture-of-experts stimulation strategies by adapting control policies to patients’
unique neural dynamics, without necessarily relying on having learning (IQ-learning), by interchanging the order of certain
accurate prediction or detection of seizures. The goal is to steps in traditional Q-learning, and showed that IQ-learning
explicitly maximize the effectiveness of stimulation, while improved on Q-learning in terms of integrated mean squared
simultaneously minimizing the overall amount of stimulation error in a study of MDD. The IQ-learning framework was then
applied thus reducing cell damage and preserving cognitive extended to optimize functionals of the outcome distribution
and neurological functions [154]. Guez et al. [155], [156], other than the expected value [166], [167]. Schulte et al.
[157] applied the BRL method, FQI-ERT, to optimize a deep- [168] provided systematic empirical studies of Q-learning
brain stimulation strategy for the treatment of epilepsy. En- and Advantage-learning (A-learning) [169] methods and il-
coding the observed Electroencephalograph (EEG) signal as a lustrated their performance using data from an MDD study.
114-dimensional continuous feature vector, and four different Other approaches include the penalized Q-learning [170], the
simulation frequencies as the actions, the RL approach was Augmented Multistage Outcome-Weighted Learning (AMOL)
applied to learn an optimal stimulation policy using data from [171], the budgeted learning algorithm [172], and the Censored
an in vitro animal model of epilepsy (i.e., field potential Q-learning algorithm [97].
recordings of seizure-like activity in slices of rat brains). c) Schizophrenia: RL methods have been also used to
Results showed that RL strategies substantially outperformed derive optimal DTRs in treatment of schizophrenia, using
the current best stimulation strategies in the literature, reducing data from the Clinical Antipsychotic Trials of Intervention
the incidence of seizures by 25% and total amount of electrical Effectiveness (CATIE) study [173], which was an 18-month
stimulation to the brain by a factor of about 10. Subsequent study divided into two main phases of treatment. An in-depth
validation work [158] showed generally similar results that case study of using BRL, FQI, to optimize treatment choices
RL-based policy could prevent epilepsy with a significant for patients with schizophrenia using data from CATIE was
reduced amount of stimulation, compared to fixed-frequency given by [174]. Key technical challenges of applying RL in
stimulation strategies. Bush and Pineau [159] applied manifold typically continuous, highly variable, and high-dimensional
embeddings to reconstruct the observable state space in MRL, clinical trials with missing data were outlined. To address these
and applied the proposed approach to tackle the high com- issues, the authors proposed the use of multiple imputation to
plexity of nonlinearity and partially observability in real-life overcome the missing data problem, and then presented two
systems. The learned neurostimulation policy was evaluated to methods, bootstrap voting and adaptive confidence intervals,
suppress epileptic seizures on animal brain slices and results for quantifying the evidence in the data for the choices made
showed that seizures could be effectively suppressed after a by the learned optimal policy. Ertefaie et al. [175] accom-
short transient period. modated residual analyses into Q-learning in order to increase
While the above in vitro biological models of epilepsy are the accuracy of model fit and demonstrated its superiority over
useful for research, they are nonetheless time-consuming and standard Q-learning using data from CATIE.
associated with high cost. In contrast, computational models Some studies have focused on optimizing multiple treatment
can provide large amounts of reproducible and cheap data that objectives in dealing with schizophrenia. Lizotte et al. [176]
may permit precise manipulations and deeper investigations. extended the FQI algorithm by considering multiple rewards
Vincent [92] proposed an in silico computational model of of symptom reduction, side-effects and quality of life simulta-
epileptiform behavior in brain slices, which was verified by neously in sequential treatments for schizophrenia. However,
using biological data from rat brain slices in vitro. Nagaraj et it was assumed that end-users had a true reward function that
al. [160] proposed the first computational model that captures was linear in the objectives and all future actions could be
the transition from inter-ictal to ictal activity, and applied chosen optimally with respect to the same true reward function
naive Q-learning method to optimize stimulation frequency over time. To solve these issues, the authors then proposed
for controlling seizures with minimum stimulations. It was the non-deterministic multi-objective FIQ algorithm, which
shown that even such simple RL methods could converge on computed policies for all preference functions simultaneously
the optimal solution in simulation with slow and fast inter- from continuous-state, finite-horizon data [177]. When patients
seizure intervals. do not know or cannot communicate their preferences, and
b) Depression: Major depressive disorder (MDD), also there is heterogeneity across patient preferences for these
known simply as depression, is a mental disorder characterized outcomes, formation of a single composite outcome that
by at least two weeks of low mood that is present across correctly balances the competing outcomes for all patients
most situations. Using data from the Sequenced Treatment is not possible. Laber et al. [178] then proposed a method
Alternatives to Relieve Depression (STAR*D) trial [161], for constructing DTRs for schizophrenia that accommodates
which is a sequenced four-stage randomized clinical trial of competing outcomes and preference heterogeneity across both
patients with MDD, Pineau et al. [162] first applied Kernel- patients and time by recommending sets of treatments at
based BRL [163] for constructing useful DTRs for patients each decision point. Butler et al. [179] derived a preference
with MDD. Other work tries to address the problem of sensitive optimal DTR for schizophrenia patient by directly
nonsmooth of decision rules as well as nonregularity of the eliciting patients’ preferences overtime.
parameter estimations in traditional RL methods by proposing d) Substance Addiction: Substance addiction, or sub-
various extensions over default Q-learning procedure in order stance use disorder (SUD), often involves a chronic course of
to increase the robustness of learning [164]. Laber et al. repeated cycles of cessation followed by relapse [180], [75].
[165] proposed a new version of Q-learning, interactive Q- There has been great interest in the development of DTRs by
investigators to deliver in-time interventions or preventions to still lack universally agreed-upon decision support for sepsis
end-users using RL methods, guiding them to lead healthier [188]. With the available data obtained from freely accessible
lives. For example, Murphy et al. [181] applied AC algorithm critical care databases such as the Multiparameter Intelligent
to reduce heavy drinking and smoking for university students. Monitoring in Intensive Care (MIMIC) [227], recent years
Chakraborty et al. [77], [182], [183] used Q-learning with lin- have seen an increasing number of studies that applied RL
ear models to identify DTRs for smoking cessation treatment techniques to the problem of deducing optimal treatment
regimes. Tao et al. [184] proposed a tree-based RL method policies for patients with sepsis [228].
to directly estimate optimal DTRs, and identify dynamic SUD The administration of intravenous (IV) and maximum va-
treatment regimes for adolescents. sopressor (VP) is a key research and clinical challenge in
sepsis. A number of studies have been carried out to tackle
this issue in the past years. Komorowski et al. [192], [193]
B. Critical Care
directly applied the on-policy SARSA algorithm and model-
Unlike the treatment of chronic diseases, which usually based PI method in a discretized state and action-space.
requires a long period of constant monitoring and medication, Raghu et al. [194], [195] examined fully continuous state
critical care is dedicated to more seriously ill or injured and action space, where policies are learned directly from
patients that are in need of special medical treatments and the physiological state data. To this end, the authors pro-
nursing care. Usually, such patients are provided with separate posed the fully-connected Dueling Double DQN to learn an
geographical area, or formally named the intensive care unit approximation for the optimal action-value function, which
(ICU), for intensive monitoring and close attention, so as to combines three state-of-the-art efficiency and stability boosting
improve the treatment outcomes [185]. ICUs will play a major techniques in DRL, i.e., Double DQN [229], Dueling DQN
role in the new era of healthcare systems. It is estimated that [230] and Prioritized Experience Replay (PER) [231]. Experi-
the ratio of ICU beds to hospital beds would increase from mental results demonstrated that using continuous state-space
3-5% in the past to 20-30% in the future [186]. modeling could identify interpretable policies with improved
Significant attempts have been devoted to the development patient outcomes, potentially reducing patient mortality in the
of clearer guidelines and standardizing approaches to various hospital by 1.8 - 3.6%. The authors also directly estimated the
aspects of interventions in ICUs, such as sedation, nutrition, transition model in continuous state-space, and applied two
administration of blood products, fluid and vasoactive drug PS methods, the direct policy gradient and Proximal Policy
therapy, haemodynamic endpoints, glucose control, and me- Optimization PPO [232], to derive a treatment strategy [196].
chanical ventilation [185]. Unfortunately, only a few of these Utomo et al. [197] proposed a graphical model that was able
interventions could be supported by high quality evidence from to show transitions of patient health conditions and treatments
randomised controlled trials or meta-analyses [187], especially for better explanability, and applied MC to generate a real-
when it comes to development of potentially new therapies time treatment recommendation. Li et al. [201] provided an
for complex ICU syndromes, such as sepsis [188] and acute online POMDP solution to take into account uncertainty and
respiratory distress syndrome [189]. history information in sepsis clinical applications. Futoma et
Thanks to the development in ubiquitous monitoring and al. [199] used multi-output Gaussian processes and DRL to
censoring techniques, it is now possible to generate rich ICU directly learn from sparsely sampled and frequently missing
data in a variety of formats such as free-text clinical notes, multivariate time series ICU data. Peng et al. [198] applied
images, physiological waveforms, and vital sign time series, the mixture-of-experts framework [144] in sepsis treatment
suggesting a great deal of opportunities for the applications by automatically switching between kernel learning and DRL
of machine learning and particularly RL techniques in critical depending on patient’s current history. Results showed that
care [190], [191]. However, the inherent 3C (Compartmen- this kind of mixed learning could achieve better performance
talization, Corruption, and Complexity) features indicate that than the strategies by physicians, Kernel learning and DQN
critical care data are usually noisy, biased and incomplete learning alone. Most recently, Yu et al. [200] addressed IRL
[5]. Properly processing and interpreting this data in a way problems in sepsis treatment.
that can be used by existing machine learning methods is the Targeting at glycemic regulation problems for severely ill
premier challenge of data analysis in critical care. To date, RL septic patients, Weng et al. [202] applied PI to learn the
has been widely applied in the treatment of sepsis (Section optimal targeted blood glucose levels from real data trajecto-
IV-B1), regulation of sedation (Section IV-B2), and some ries. Petersen et al. [203] investigated the cytokine mediation
other decision making problems in ICUs such as mechanical problem in sepsis treatment, using the DRL method, Deep
ventilation and heparin dosing (Section IV-B3). Table IV Deterministic Policy Gradient (DDPG) [233], to tackle the hi-
summarizes these applications according to the applied RL dimensional continuous states and actions, and potential-based
techniques and the sources of data acquired during learning. reward shaping [234] to facilitate the learning efficiency. The
1) Sepsis: Sepsis, which is defined as severe infection proposed approach was evaluated using an agent-based model,
causing life-threatening acute organ failure, is a leading cause the Innate Immune Response Agent-Based Model (IIRABM),
of mortality and associated healthcare costs in critical care that simulates the immune response to infection. The learned
[226]. While numbers of international organizations have treatment strategy was showed to achieve 0.8% mortality
devoted significant efforts to provide general guidance for over 500 randomly selected patient parameterizations with
treating sepsis over the past 20 years, physicians at practice mortalities average of 49%, suggesting that adaptive, person-
TABLE IV
S UMMARY OF RL A PPLICATION E XAMPLES IN THE D EVELOPMENT OF DTR S IN C RITICAL C ARE
Domain Application Reference Base method Efficient Representational Data Acquisition Highlights and Limits
Techniques Techniques
Komorowski et al. [192], SARSA,PI N/A N/A MIMIC-III Naive application of SARSA and PI in a discrete state
Administration of IV
[193] and action-space.
fluid and maximum VP
Sepsis Raghu et al. [194], [195] Q-learning N/A DRL (DDDQN) MIMIC-III Application of DRL in a fully continuous state but
discrete action space.
Raghu et al. [196] PS MRL N/A MIMIC-III Model-based learning with continuous state-space; inte-
grating clinician’s policies into RL policies.
Utomo et al. [197] MC N/A N/A MIMIC-III Estimating transitions of patient health conditions and
treatments to increase its explainability.
Peng et al. [198] Q-learning N/A DRL (DDDQN) MIMIC-III Adaptive switching between kernel learning and DRL.
Futoma et al. [199] Q-learning N/A DRL Clinical data at Tackling sparsely sampled and frequently missing mul-
university hospi- tivariate time series data.
tal
Yu et al. [200] Q-learning BRL(FQI) DRL, IRL MIMIC-III Inferring the best reward functions using deep IRL.
Li et al. [201] AC N/A PORL MIMIC-III Taking into account uncertainty and history information
of sepsis patients.
Targeted blood glucose Weng et al. [202] PI N/A N/A MIMIC-III Learning the optimal targeted blood glucose levels for
regulation sepsis patients
Cytokine mediation Petersen et al. [203] AC N/A DRL (DDPG) Agent-based Using reward shaping to facilitate the learning effi-
model ciency; significantly reducing mortality from 49% to
0.8%.
Moore et al. [204], [205] Q(λ) N/A N/A PK/PD model Achieving superior stability compared to a well-tuned
PID controller.
Regulation and
Moore et al. [206], [207] Q-learning N/A N/A PK/PD model Using the change of BIS as the state representation.
automation of sedation
Moore et al. [208], [209] Q-learning N/A N/A In vivo study First clinical trial for anesthesia administration using RL
Anesthesia and analgesia to maintain
on human volunteers.
physiological stability and
Sadati et al. [210] Unclear N/A N/A PK/PD model Expert knowledge can be used to realize reasonable
lowering pains of patients
initial dosage and keep drug inputs in safe values.
Borera et al. [211] Q-learning N/A N/A PK/PD model Using an adaptive filter to eliminate the delays when
estimating patient state.
Lowery & Faisal [212] AC N/A N/A PK/PD model Considering the continuous state and action spaces.
Padmanabhan et al. [213] Q-learning N/A N/A PK/PD model Regulating sedation and hemodynamic parameters si-
multaneously.
Humbert et al. [214] N/A N/A POMDP, IRL Clinical data Training an RL agent to mimic decisions by expert
anesthesiologists.
Nemati et al. [215] Q-learning BRL PORL MIMIC II End-to-end learning with hidden states of patients.
Heparin Dosing
Lin et al. [216] AC N/A DRL(DDPG) MIMIC, Emory Addressing dosing problems in continuous state-action
Others
Healthcare data spaces.
General medication Wang et al. [217] AC N/A DRL (DDPG) MIMIC-III Combining supervised and reinforcement learning for
recommendation medication dosing covering a large number of diseases.
Mechanical ventilation Prasad et al. [218] Q-learning BRL(FQI) N/A MIMIC-III Optimal decision making for the weaning time of me-
and sedative dosing chanical ventilation and personalized sedation dosage.
Yu et al. [219] Q-learning BRL(FQI) IRL MIMIC-III Applying IRL in inferring the reward functions.
Yu et al. [220] AC N/A N/A MIMIC-III Combing supervised learning and AC for more efficient
decision making.
Jagannatha et al. [221] Q-learning, PS BRL(FQI) N/A MIMIC-III Analyzing limitations of off-policy policy evaluation
methods in ICU settings.
Cheng et al. [222] Q-learning BRL(FQI) MORL MIMIC III Designing a multi-objective reward function that reflects
Ordering of lab tests
clinical considerations when ordering labs.
Chang et al. [223] Q-learning N/A DRL (Dueling MIMIC III The first RL application on multi-measurement schedul-
DQN) ing problem in the clinical setting.
Prevention and treatments Krakow et al. [224] Q-learning N/A N/A CIBMTR data First proposal of DTRs for acute GVHD prophylaxis
for GVHD and treatment.
Liu et al. [225] Q-learning N/A DRL (DQN) CIBMTR data Incorporation of a supervised learning step into RL.
alized multi-cytokine mediation therapy could be promising based control methods, using surrogate measures of anesthetic
for treating sepsis. effect, e.g., the bispectral (BIS) index, as the controlled
2) Anesthesia: Another major drug dosing problem in ICUs variable, has enhanced individualized anesthetic management,
is the regulation and automation of sedation and analgesia, resulting in the overall improvement of patient outcomes when
which is essential in maintaining physiological stability and compared with traditional controlled administration. Moore et
lowering pains of patients. Whereas surgical patients typically al. [204], [205] applied TD Q(λ) in administration of intra-
require deep sedation over a short duration of time, sedation venous propofol in ICU settings, using the well-studied Marsh-
for ICU patients, especially when using mechanical ventila- Schnider pharmacokinetic model to estimate the distribution
tion, can be more challenging [218]. Critically ill patients of drug within the patient, and a pharmacodynamic model
who are supported by mechanical ventilation require adequate for estimating drug effect. The RL method adopted the error
sedation for several days to guarantee safe treatment in the of BIS and estimation of the four compartmental propofol
ICU [235]. A misdosing of sedation or under sedation is concentrations as the input state, different propofol dose as
not acceptable since over sedation can cause hypotension, control actions, and the BIS error as the reward. The method
prolonged recovery time, delayed weaning from mechanical demonstrated superior stability and responsiveness when com-
ventilation, and other related negative outcomes, whereas pared to a well-tuned PID controller. The authors then modeled
under sedation can cause symptoms such as anxiety, agitation the drug disposition system as three states corresponding to
and hyperoxia [213]. the change of BIS, and applied basic Q-learning method to
solving this problem [206], [207]. They also presented the
The regulation of sedation in ICUs using RL methods has first clinical in vivo trial for closed-loop control of anesthesia
attracted attention of researcher for decades. As early as in administration using RL on 15 human volunteers [208], [209].
1994, Hu et al. [70] studied the problem of anesthesia control It was demonstrated that patient specific control of anesthesia
by applying some of the founding principles of RL (the MDP administration with improved control accuracy as compared
formulation and its planning solutions). More recently, RL-
to other studies in the literature could be achieved both in particularly challenging in ICUs. One one hand, higher costs
simulation and the clinical study. occur if unnecessary ventilation is still taking effect, while
Targeting at both muscle relaxation (paralysis) and Mean premature extubation can give rise to increased risk of mor-
Arterial Pressure (MAP), Sadati et al. [210] proposed an bidity and mortality. Optimal decision making regarding when
RL-based fuzzy controllers architecture in automation of the to wean patients off of a ventilator thus becomes nontrivial
clinical anesthesia. A multivariable anesthetic mathematical since there is currently no consistent clinical opinion on the
model was presented to achieve an anesthetic state using two best protocol for weaning of ventilation [238]. Prasad et al.
anesthetic drugs of Atracurium and Isoflurane. The highlight [218] applied off-policy RL algorithms, FQI-ERT and with
was that the physician’s clinical experience could be incorpo- feed forward neural networks, to determine the best weaning
rated into the design and implementation of the architecture, time of invasive mechanical ventilation, and the associated
to realize reasonable initial dosage and keep drug inputs in personalized sedation dosage. The policies learned showed
safe values. Padmanabhan et al. [213] used a closed-loop promise in recommending weaning protocols with improved
anesthesia controller to regulate the BIS and MAP within outcomes, in terms of minimizing rates of reintubation and
a desired range. Specifically, a weighted combination of the regulating physiological stability. Targeting at the same prob-
error of the BIS and MAP signals is considered in the proposed lem as [218], Jagannatha et al. [221] analyzed the properties
RL algorithm. This reduces the computational complexity of and limitations of standard off-policy evaluation methods in
the RL algorithm and consequently the controller processing RL and discussed possible extensions to them in order to
time. Borera et al. [211] proposed an Adaptive Neural Network improve their utility in clinical domains. More recently, Yu et
Filter (ANNF) to improve RL control of propofol hypnosis. al. applied Bayesian inverse RL [219] and Supervised-actor-
Lowery and Faisal [212] used a continuous AC method critic [220] to learn a suitable ventilator weaning policy from
to first learn a generic effective control strategy based on real trajectories in retrospective ICU data. RL has been also
average patient data and then fine-tune itself to individual used in the development of optimal policy for the ordering of
patients in a personalization stage. The results showed that the lab tests in ICUs [222], [223], and prevention and treatments
reinforcement learner could reduce the dose of administered for graft versus host disease (GVHD) [224], [225] using data
anesthetic agent by 9.4% as compared to a fixed controller, set from the Center for International Bone Marrow Transplant
and keep the BIS error within a narrow, clinically acceptable Research (CIBMTR) registry database.
range 93.9% of the time. More recently, an IRL method
has been proposed that used expert trajectories provided by
V. AUTOMATED M EDICAL D IAGNOSIS
anesthesiologists to train an RL agent for controlling the
concentration of drugs during a global anesthesia [214]. Medical diagnosis is a mapping process from a patient’s
3) Other Applications in Critical Care: While the previous information such as treatment history, current signs and symp-
sections are devoted to two topic-specific applications of RL toms to an accurate clarification of a disease. Being a complex
methods in critical care domains, there are many other more task, medical diagnosis often requires ample medical investi-
general medical problems that perhaps have received less gation on the clinical situations, causing significant cognitive
attention by researchers. One such problem is regarding the burden for clinicians to assimilate valuable information from
medication dosing, particulary, heparin dosing, in ICUs. A complex and diverse clinical reports. It has been reported that
recent study by Ghassemi et al. [236] highlighted that the diagnostic error accounts for as high as 10% of deaths and 17%
misdosing of medications in the ICU is both problematic and of adverse events in hospitals [239]. The error-prone process
preventable, e.g., up to two-thirds of patients at the study in diagnosis and the necessity to assisting the clinicians for
institution received a non-optimal initial dose of heparin, due a better and more efficient decision making urgently call
to the highly personal and complex factors that affect the for a significant revolution of the diagnostic process, leading
dose-response relationship. To address this issue, Nemati et to the advent of automated diagnostic era that is fueled by
al. [215] inferred hidden states of patients via discriminative advanced big data analysis and machine learning techniques
hidden Markov model and applied neural FQI to learn optimal [240], [241], [242].
heparin dosages. Lin et al. [216] applied DDPG in continuous Normally formulated as a supervised classification problem,
state-action spaces to learn a better policy for heparin dosing existing machining learning methods on clinical diagnosis
from observational data in MIMIC and the Emory University heavily rely on a large number of annotated samples in order
clinical data. Wang et al. [217] combined supervised sig- to infer and predict the possible diagnoses [243], [244], [245].
nals and reinforcement signals to learn recommendations for Moreover, these methods have limits in terms of capturing
medication dosing involving a large number of diseases and the underlying dynamics and uncertainties in the diagnosing
medications in ICUs. process and considering only a limited number of prediction
Another typical application of RL in ICUs is to develop a labels [246]. To overcome these issues, researchers are in-
decision support tool for automating the process of airway and creasingly interested in formulating the diagnostic inferencing
mechanical ventilation. The need for mechanical ventilation is problem as a sequential decision making process and using RL
required when patients in ICUs suffer from acute respiratory to leverage a small amount of labeled data with appropriate
failure (ARF) caused by various conditions such as cardio- evidence generated from relevant external resources [246]. The
genic pulmonary edema, sepsis or weakness after abdominal existing research can be classified into two main categories,
surgery [237]. The management of mechanical ventilation is according to the type of clinical data input into the learning
process: the structured medical data such as physiological is applied for obtaining optimized diagnostic strategies. The
signals, images, vital signs and lab tests, and the unstructured approach was evaluated on a sample diagnosing problem
data of free narrative text such as laboratory reports, clinical of solitary pulmonary nodule (SPN) and results verified its
notes and summaries. success in improving testing strategies in diagnosis, compared
with several other fixed testing strategies.
A. Structured Medical Data
The most successful application of RL in diagnosis using B. Unstructured Medical Data
structured data pertains to various processing and analysis Unlike the formally structured data that are directly machine
tasks in medical image examination, such as feature extracting, understandable, large proportions of clinical information are
image segmentation, and object detection/localization/tracing stored in a format of unstructured free text that contains a
[247], [248]. Sahba et al. [249], [250], [251], [252] applied relatively more complete picture of associated clinical events
basic Q-learning to the segmentation of the prostate in tran- [3]. Given their expressive and explanatory power, there is
srectal ultrasound images (UI). Liu and Jiang [253] used a great potential for clinical notes and narratives to play a vital
DRL method, Trust Region Policy Optimization (TRPO), for role in assisting diagnosis inference in an underlying clinical
joint surgical gesture segmentation and classification. Ghesu scenario. Moreover, limitations such as knowledge incomplete-
et al. [254] applied basic DQN to automatic landmark de- ness, sparsity and fixed schema in structured knowledge have
tection problems, and achieved more efficient, accurate and motivated researchers to use various kinds of unstructured
robust performance than state-of-the-art machine learning and external resources such as online websites for related medical
deep learning approaches on 2D Magnetic Resonance Images diagnosing tasks [246].
(MRI), UI and 3D Computed Tomography (CT) images. This Motivated by the Text REtrieval Conference-Clinical Deci-
approach was later extended to exploit multi-scale image sion Support (TREC-CDS) track dataset [268], diagnosis infer-
representations for large 3D CT scans [255], and consider in- encing from unstructured clinical text has gained much atten-
complete data [256] or nonlinear multi-dimensional parametric tion among AI researchers recently. Utilizing particular natural
space in MRI scans of the brain region [257]. language processing techniques to extract useful information
Alansary et al. evaluated different kinds of DRL methods from clinical text, RL has been used to optimize the diagnosis
(DQN, Double DQN (DDQN), Duel DQN, and Duel DDQN) inference procedure in several studies. Ling et al. [246], [269]
[12] for anatomical landmark localization in 3D fetal UI [258], proposed a novel clinical diagnosis inferencing approach that
and automatic standard view plane detection [259]. Al and applied DQN to incrementally learn about the most appropriate
Yun [260] applied AC based direct PS method for aortic valve clinical concepts that best describe the correct diagnosis by
landmarks localization and left atrial appendage seed localiza- using evidences gathered from relevant external resources
tion in 3D CT images. Several researchers also applied DQN (from Wikipedia and MayoClinic). Experiments on the TREC-
methods in 3D medical image registration problems [261], CDS datasets demonstrated the effectiveness of the proposed
[262], [263], active breast lesion detection from dynamic approach over several non RL-based systems.
contrast-enhanced MRI [264], and robust vessel centerline Exploiting real datasets from the Breast Cancer Surveillance
tracing problems in multi-modality 3D medical volumes [265]. Consortium (BCSC) [270], Chu et al. [271] presented an
Netto et al. [266] presented an overview of work applying adaptive online learning framework for supporting clinical
RL in medical image applications, providing a detailed illus- breast cancer diagnosis. The framework integrates both su-
tration of particular use of RL for lung nodules classification. pervised learning models for breast cancer risk assessment
The problem of classification is modeled as a sequential and RL models for decision-making of clinical measurements.
decision making problem, in which each state is defined as the The framework can quickly update relevant model parameters
combination of five 3D geometric measurements, the actions based on current diagnosis information during the training
are random transitions between states, and the final goal is process. Additionally, it can build flexible fitted models by
to discover the shortest path from the pattern presented to a integrating different model structures and plugging in the
known target of a malignant or a benign pattern. Preliminary corresponding parameters during the prediction process. The
results demonstrated that the Q-learning method can effec- authors demonstrated that the RL models could achieve ac-
tively classify lung nodules from benign and malignant directly curate breast cancer risk assessment from sequential data and
based on lung lesions CT images. incremental features.
Fakih and Das [267] developed a novel RL-based approach, In order to facilitate self-diagnosis while maintaining rea-
which is capable of suggesting proper diagnostic tests that sonable accuracy, the concept of symptom checking (SC) has
optimize a multi-objective performance criterion accounting been proposed recently. SC first inquires a patient with a
for issues of costs, morbidity, mortality and time expense. series of questions about their symptoms, and then attempts
To this end, some diagnostic decision rules are first extracted to diagnose some potential diseases [245]. Tang et al. [272]
from current medical databases, and then the set of possible formulated inquiry and diagnosis policies as an MDP, and
testing choices can be identified by comparing the state of adopted DQN to learn to inquire and diagnose based on limited
patient with the attributes in the decision rules. The testing patient data. Kao et al. [273] applied context-aware HRL
choices and the combined overall performance criterion then scheme to improve accuracy of SC over traditional systems
serve as inputs to the core RL module and the VI algorithm making a limited number of inquiries. Empirical studies on a
simulated dataset showed that the proposed model drastically (2) Optimal Process Control. RL has also been widely
improved disease prediction accuracy by a significant margin. applied in deriving an optimal control policy in a variety of
The SC system was successfully employed in the DeepQ healthcare situations, ranging from surgical robot operation
Tricorder which won the second prize in the Qualcomm [283], [284], [285], [286], [287], functional electrical stimula-
Tricorder XPRIZE competition in year 2017 [274], [275]. tion (FES) [288], [289], and adaptive rate control for medical
A dialogue system was proposed in [276] for automatic video streaming [290], [291]. Li and Burdick [283] applied
diagnosis, in which the medical dataset was built from a pe- RL to learn a control policy for a surgical robot such that the
diatric department in a Chinese online healthcare community. robot can conduct some basic clinical operations automatically.
The dataset consists of self-reports from patients and conver- A function approximation based IRL method was used to
sational data between patients and doctors. A DQN approach derive an optimal policy from experts’ demonstrations in high
was then used to train the dialogue policy. Experiment results dimensional sensory state space. The method was applied to
showed that the RL-based dialogue system was able to collect the evaluation of surgical robot operators in three clinical tasks
symptoms from patients via conversation and improve the of knot tying, needling passing and suturing. Thananjeyan et
accuracy for automatic diagnosis. In order to increase the al. [284] and Nguyen et al. [285] applied DRL algorithm,
efficiency of the dialogue systems, Tang et al. [277] applied TRPO, in learning tensioning policies effectively for surgical
DQN framework to train an efficient dialogue agent to sketch gauze cutting. Chen et al. [286] combined programming by
disease-specific lexical probability distribution, and thus to demonstration and RL for motion control of flexible manipu-
converse in a way that maximizes the diagnosis accuracy and lators in minimally invasive surgical performance, while Baek
minimizes the number of conversation turns. The dialogue et al. [287] proposed the use of RL to perform resection
system was evaluated on the mild cognitive impairment di- automation of cholecystectomy by planning a path that avoids
agnosis from a real clinical trial, and results showed that the collisions in a laparoscopic surgical robot system.
RL-driven framework could significantly outperform state-of- FES employs neuroprosthesis controllers to apply electrical
the-art supervised learning approaches using only a few turns current to the nerves and muscles of individuals with spinal
of conversation. cord injuries for rehabilitative movement [292]. RL has been
used to calculate stimulation patterns to efficiently adapt the
VI. OTHER H EALTHCARE D OMAINS control strategy to a wide range of time varying situations in
Besides the above applications of RL in DTR design and patients’ preferences and reaching dynamics. AC-based control
automated medical diagnosis, there are many other case appli- strategies [293], [294], [289] were proposed to evaluate target-
cations in broader healthcare domains that focus on problems oriented task performed using a planar musculoskeletal human
specifically in health resource scheduling and allocation, opti- arm in FES. To solve the reward learning problem in large state
mal process control, drug discovery and development, as well spaces, an IRL approach was proposed in [288] to evaluate the
as health management. effect of rehabilitative stimulations on patients with spinal cord
(1) Health Resource Scheduling and Allocation. The health- injuries based on the observed patient motions.
care system is a typical service-oriented system where cus- RL-based methods have also been widely applied in adap-
tomers (e.g., patients) are provided with service using limited tive control in mobile health medical video communication
resources, e.g. the time slots, nursing resources or diagnostic systems. For example, Istepanian et al. [290] proposed a
devices [278]. Business process management (BPM) plays new rate control algorithm based on Q-learning that satisfies
a key role in such systems as the objective of the service medical quality of service requirements in bandwidth demand-
provider is to maximize profit overtime, considering various ing situations of ultrasound video streaming. Alinejad [291]
customer classes and service types with dynamics or uncer- applied Q-learning for cross-layer optimization in real-time
tainties such as cancellations or no-shows of patients [279], medical video streaming.
[280]. Since the optimal resource allocation problem in BPM (3) Drug Discovery and Development. Drug discovery and
can be seen as a sequential decision making problem, RL is development is a time-consuming and costly process that
then naturally suitable for offering reasonable solutions. Huang usually lasts for 10-17 years, but with as low as around 10%
et al. [279] formulated the allocation optimization problems in overall probability of success [295]. To search an effective
BPM as an MDP and used basic Q-learning algorithm to derive molecule that meets the multiple criteria such as bioactivity
an optimal solution. The RL-based approach was then applied and synthetic accessibility in a prohibitively huge synthetically
to address the problem of optimizing resource allocation in ra- feasible molecule space is extremely difficult. By using com-
diology CT-scan examination process. A heuristic simulation- putational methods to virtually design and test molecules, de
based approximate DP approach was proposed in [278], which novo design offers ways to facilitate cycle of drug development
considered both stochastic service times and uncertain future [296]. It is until recent years that RL methods have been
arrival of clients. The experimental investigation using data applied in various aspects of de novo design for drug discovery
from the radiological department of a hospital indicated an and development. Olivecrona [297] used RL to fine tune
increases of 6.9% in the average profit of the hospital and 9% the recurrent neural network in order to generate molecules
in the number of examinations. Gomes [281] applied a DRL with certain desirable properties through augmented episodic
method, Asynchronous Advantage Actor Critic (A3C) [282], likelihood. Serrano et al. [298] applied DQN to solve the
to schedule appointments in a set of increasingly challenging proteinligand docking prediction problem, while Neil et al.
environments in primary care systems. [299] investigated the PPO method in molecular generation.
More recently, Popova et al. [300] applied DRL methods should be defined in a way that to the most approximates the
to generate novel targeted chemical libraries with desired behavior policy that has generated such data [305], [306].
properties. However, data in medical domains often exhibit notable
(4) Health Management. As a typical application domain, biases or noises that are presumably varying among differ-
RL has also been used in adaptive interventions to support ent clinicians, devices, or even medical institutes, reflecting
health management such as promoting physical activities for comparable inter-patient variability [92]. For some complex
diabetic patients [301], [302], or weight management for diseases, clinicians still face inconsistent guides in selecting
obesity patients [303], [304]. In these applications, throughout exact data as the state in a given case [191]. In addition,
continuous monitoring and communication of mobile health, the notorious issue of missing or incomplete data can further
personalized intervention policies can be derived to input the exaggerate the problem of data collection and state represen-
monitored measures and output when, how and which plan tation in medical settings, where the data can be collected
to deliver. A notable work was by Yom et al. [301], who from patients who may fail to complete the whole trial, or the
applied RL to optimize messages sent to the users, in order number of treatment stages or timing of initializing the next
to improve their compliance with the activity plan. A study line of therapy is flexible. This missing or censoring data will
of 27 sedentary diabetes type 2 patients showed that partici- tend to increase the variance of estimates of the value function
pants who received messages generated by the RL algorithm and thus the policy in an RL setting. While the missing data
increased the amount of activity and pace of walking, while problem can be generally solved using various imputation
the patients using static policy did not. Patients assigned to methods that sample several possible values from the esti-
the RL algorithm group experienced a superior reduction in mated distribution to fill in missing values, the censoring data
blood glucose levels compared to the static control policies, problem is far more challenging, calling for more sophisticated
and longer participation caused greater reductions in blood techniques for state representation and value estimation in such
glucose levels. flexible settings [96], [97].
Most existing work defines the states over the processed
medical data with raw physiological, pathological, and de-
VII. C HALLENGES AND O PEN I SSUES
mographics information, either using simple discretization
The content above has summarized the early endeavors and methods to enable storage of value function in tabular form,
continuous progress of applying RL in healthcare over the or using some kinds of function approximation models (e.g.,
past decades. Focus has been given to the vast variety of linear models or deep neural models). While this kind of
application domains in healthcare. While notable success has state representation is simple and easy to implement, the rich
been obtained, the majority of these studies simply applied temporal dependence or causal information, which is the key
existing naive RL approaches in solving healthcare problems feature of medical data, can be largely neglected [307]. To
in a relatively simplified setting, thus exhibiting some common solve this issue, various probabilistic graphical models [308]
shortcomings and practical limitations. This section discusses can be used to allow temporal modeling of time series medical
several challenges and open issues that have not been properly data, such as dynamic Bayesian networks (DBNs), in which
addressed by the current research, from perspectives of how nodes correspond to the random variables of interest, edges
to deal with the basic components in RL (i.e., formulation indicate the relationship between these random variables, and
of states, actions and rewards, learning with world models, additional edges model the time dependency. These kinds
and evaluation of policies), and fundamental theoretical issues of graphical models have the desirable property that allows
in traditional RL research (i.e., the exploration-exploitation for interpretation of interactions between state variables or
tradeoff and credit assignment problem). between states and actions, which is not the case for other
methods such as SVMs and neural networks.
Coupled with the state representation in RL is the formu-
A. State/Action Engineering lation of actions. The majority of existing work has mainly
The first step in applying RL to a healthcare problem is focused on discretization of the action space into limited bins
determining how to collect and pre-process proper medical of actions. Although this formulation is quite reasonable in
data, and summarize such data into some manageable state some medical settings, such as choices in between turning on
representations in a way that sufficient information can be ventilation or weaning off it, there are many other situations
retained for the task at hand. Selecting the appropriate level where actions are by themselves continuous/multidimensional
of descriptive information contained in the states is extremely variables. While the simplification of discretizing medicine
important. On one hand, it would be better to contain as de- dosage is necessary in the early proof-of-concept stage, re-
tailed information as possible in the states, since this complete alizing fully continuous dosing in the original action space
information can provide a greater distinction among patients. is imperative in order to meet the commitments of precision
On the other hand, however, increasing the state space makes medicine [309]. There has been a significant achievement in
the model become more difficult to solve. It is thus essential the continuous control using AC methods and PS methods in
that a good state representation include any compulsory factors the past years, particularly from the area of robotic control
or variables that causally affect both treatment decisions and [20] and DRL [12]. While this achievement can provide
the outcomes. Previous studies have showed that, to learn an direct solutions to this problem, selecting the action over
effective policy through observational medical data, the states large/infinite space is still non-trivial, especially when dealing
with any sample complexity guarantees (PAC). An effective or lab tests ordering in ICUs. However, all these studies still
method for efficient action selection in continuous and high focus on very limited application scenarios where only static
dimensional action spaces, while at the same time maintaining preferences or fixed objectives were considered. In a medical
low exploration complexity of PAC guarantees would extend context, the reward function is usually not a fixed term but
the applicability of current methods to more sample-critical subject to changing with regard to a variety of factors such
medical problems. as the time, the varying clinical situations and the evolving
physiopsychic conditions of the patients. Applying PRL and
MORL related principles to broader domains and considering
B. Reward Formulation the dynamic and evolving process of patients’ preferences and
Among all the basic components, the reward may be at the treatment objectives is still a challenging issue that needs to
core of an RL process. Since it encodes the goal information be further explored.
of a learning task, a proper formulation of reward functions A more challenging issue is regarding the inference of
plays the most crucial role in the success of RL. However, the reward functions directly from observed behaviors or clinical
majority of current RL applications in healthcare domains are data. While it is straightforward to formulate a reward func-
still grounded on simple numerical reward functions that must tion, either quantitatively or qualitatively, and then compute the
be explicitly defined beforehand to indicate the goal of treat- optimal policy using this function, it is sometimes preferable
ments by clinicians. It is true that in some medical settings, to directly estimate the reward function of experts from a set
the outcomes of treatments can be naturally generated and of presumably optimal treatment trajectories in retrospective
explicitly represented in a numerical form, for example, the medical data. Imitation learning, particularly, IRL [53], [54], is
time elapsed, the vitals monitored, or the mortality reduced. In one of the most feasible approaches to infer reward functions
general, however, specifying such a reward function precisely given observations of optimal behaviour. However, applying
is not only difficult but sometimes even misleading. For IRL in clinical settings is not straightforward, due to the inher-
instance, in treatment of cancers [83], the reward function was ent complexity of clinical data and its associated uncertainties
usually decomposed into several independent or contradictory during learning. The variance during the policy learning and
components based on some prior domain knowledge, each reward learning can amplify the bias in each learning process,
of which was mapped into some integer numbers, e.g., -60 potentially leading to divergent solutions that can be of little
as a high penalty for patient death and +15 as a bonus for use in practical clinical applications [312], [131].
a cured patient. Several threshold and weighting parameters Last but not the least, while it is possible to define a
were needed to provide a way for trading-off efficacy and short-term reward function at each decision step using prior
toxicity, which heavily rely on clinicians’ personal experience human knowledge, it would be more reasonable to provide
that varies from one to another. This kind of somewhat a long-term reward only at the end of a learning episode.
arbitrary quantifications might have significant influence on This is especially the case in healthcare domains where the
the final learned therapeutic strategies and it is unclear how real evaluation outcomes (e.g., decease of patients, duration
changing these numbers can affect the resulting strategies. of treatment) can only be observed at the end of treatment.
To conquer the above limitations, one alternative is to pro- Learning with sparse rewards is a challenging issue that has
vide the learning agent with more qualitative evaluations for attracted much attention in recent RL research. A number of
actions, turning the learning into a PRL problem [310]. Unlike effective approaches have been proposed, such as the hindsight
the standard RL approaches that are restricted to numerical and experience replay [313], the unsupervised auxiliary learning
quantitative feedback, the agent’s preferences instead can be [314], the imagination-augmented learning [315], and the
represented by more general types of preference models such reward shaping [234]. While there have been several studies
as ranking functions that sort states, actions, trajectories or that address the sparse reward problem in healthcare domains,
even policies from most to least promising [51]. Using such most of these studies only focus on DTRs with a rather short
kind of ranking functions has a number of advantages as they horizon (typically three or four steps). Moreover, previous
are more natural and easier to acquire in many applications work has showed that entirely ignoring short-term rewards
in clinical practice, particularly, when it is easier to require (e.g. maintaining hourly physiologic blood pressure for sepsis
comparisons between several, possibly suboptimal actions or patients) could prevent from learning crucial relationships
trajectories than to explicitly specify their performance. More- between certain states and actions [307]. How to tackle sparse
over, considering that the medical decisions always involve reward learning with a long horizon in highly dynamic clinical
two or more related or contradictory aspects during treatments environments is still a challenging issue in both theoretical and
such as benefits versus associated cost, efficacy versus toxicity, practical investigations of RL in healthcare.
and efficiency versus risk, it is natural to shape the learning
problem into a multi-objective optimization problem. MORL
techniques [50] can be applied to derive a policy that makes C. Policy Evaluation
a trade-off between distinct objectives in order to achieve a The process of estimating the value of a policy (i.e., target
Pareto optimal solution. Currently, there are only very limited policy) with data collected by another policy (i.e., behavior
studies in the literature that applied PRL [52], [89], [178] policy) is called off-policy evaluation problem [14]. This
and MORL [177], [311], [176], [222] in medical settings, for problem is critical in healthcare domains because it is usually
optimal therapy design in treatment of cancer, schizophrenia infeasible to estimate policy value by running the policy
directly on the target populations (i.e., patients) due to high E. Exploration Strategies
cost of experiments, uncontrolled risks of treatments, or simply Exploration plays a core role in RL, and a large amount of
unethical/illegal humanistic concerns. Thus, it is needed to effort has been devoted to this issue in the RL community. A
estimate how the learned policies might perform on retrospec- wealth of exploration strategies have been proposed in the past
tive data before testing them in real clinical environments. decades. Surprisingly, the majority of existing RL applications
While there is a large volume of work in RL community in healthcare domains simply adopt simple heuristic-based
that focuses on importance sampling (IS) techniques and exploration strategies (i.e., ε-greedy strategy). While this kind
how to trade off between bias and variance in IS-based off- of handling exploration dilemmas has made notable success, it
policy evaluation estimators (e.g., [316]), simply adopting becomes infeasible in dealing with more complicated dynam-
these estimators in healthcare settings might be unreliable ics and larger state/action spaces in medical settings, causing
due to issues of sparse rewards or large policy discrepancy either a large sample complexity or an asymptotic performance
between RL learners and physicians. Using sepsis management far from the optimum. Particularly, in cases of an environment
as a running example, Gottesman et al., [305] discussed in where only a rather small percentage of the state space is
detail why evaluation of polices using retrospective health reachable, naive exploration from the entire space would be
data is a fundamentally challenging issue. They argued that quite inefficient. This problem is getting more challenging
any inappropriate handling of state representation, variance of in continuous state/action space, for instance, in the setting
IS-based statistical estimators, and confounders in more ad- of HIV treatment [142], where the basin of attraction of the
hoc measures would result in unreliable or even misleading healthy state is rather small compared to that of the unhealthy
estimates of the quality of a treatment policy. The estimation state. It has been shown that traditional exploration methods
quality of the off-policy evaluation is critically dependent on are unable to obtain obvious performance improvement and
how precisely the behaviour policy is estimated from the data, generate any meaningful treatment strategy even after a long
and whether the probabilities of actions under the approxi- period of search in the whole space [150], [151]. Therefore,
mated behaviour policy model represent the true probabilities there is a justifiable need for strategies that can identify
[306]. While the main reasons have been largely unveiled, dynamics during learning or utilize a performance measure
there is still little work on effective policy evaluation methods to explore smartly in high dimensional spaces. In recent
in healthcare domains. One recent work is by Li et al. [201], years, several more advanced exploration strategies have been
who provided an off-policy POMDP learning method to take proposed, such as PAC guaranteed exploration methods target-
into account uncertainty and history information in clinical ing at continuous spaces [150], [151], concurrent exploration
applications. Trained on real ICU data, the proposed policy mechanisms [318], [319], [320] and exploration in deep RL
was capable of dictating near-optimal dosages in terms of [321], [322], [323]. It is thus imperative to incorporate such
vasopressor and intravenous fluid in a continuous action space exploration strategies in more challenging medical settings, not
for sepsis patients. only to decrease the sample complexity significantly, but more
importantly to seek out new treatment strategies that have not
been discovered before.
Another aspect of applying exploration strategies in health-
D. Model Learning care domains is the consideration of true cost of exploration.
Within the vanilla RL framework, whenever an agent explores
In the efficient techniques described in Section II-B, model- an inappropriate action, the consequent penalty acts as a neg-
based methods enable improved sample efficiency over model- ative reinforcement in order to discourage the wrong action.
free methods by learning a model of the transition and reward Although this procedure is appropriate for most situations, it
functions of the domain on-line and then planing a policy using may be problematic in some environments where the conse-
this model [43]. It is surprising that there are quite limited quences of wrong actions are not limited to bad performance,
model-based RL methods applied in healthcare in the current but can result in unrecoverable effects. This is obviously true
literature [196], [193], [197]. While a number of model- when dealing with patients in healthcare domains: although
based RL algorithms have been proposed and investigated we can reset a robot when it has fallen down, we cannot
in the RL community (e.g., R-max [25], E 3 [317]), most bring back to life when a patience has been given a fatal
of these algorithms assume that the agent operates in small medical treatment. Consequently, methods for safe exploration
domains with a discrete state space, which is contradictory to are of great real world interest in medical settings, in order to
the healthcare domains usually involving multi-dimensional preclude unwanted, unsafe actions [32], [33], [324].
continuously valued states and actions. Learning and plan-
ning over such large scale continuous models would cause
additional challenges for existing model-based methods [43]. F. Credit Assignment
A more difficult problem is to develop efficient exploration Another important aspect of RL is the credit assignment
strategies in continuous action/state space [29]. By deriving problem that decides when an action or which actions is
a finite representation of the system that both allows efficient responsible for the learning outcome after a sequence of
planning and intelligent exploration, it is potential to solve the decisions. This problem is critical as the evaluation of whether
challenging model learning tasks in healthcare systems more an action being “good” or “bad” usually cannot be decided
efficiently than contemporary RL algorithms. upon right away, but until the final goal has been achieved
by the agent. As each action at each step contributes more ambient intelligence and real-life applications are advocated
or less to the final performance of success or failure, it is as two main practical directions for RL applications in the
thus necessary to give distinct credit to the actions along the coming age of intelligent healthcare.
whole path, giving rise to the difficult problem of temporal
credit assignment problem. A related problem is the struc- A. Interpretable Strategy Learning
tural credit assignment problem, in which the problem is to Perhaps one of the most profound issues with modern
distribute feedback over the multiple candidates (e.g., multi- machine learning methods, including RL, is the lack of clear
ple concurrently learning agents, action choices, or structure interpretability [329]. Usually functioning as a black box
representations of the agent’s policy). expressed by, for instance, deep neural networks, models using
The temporal credit assignment problem is more prominent RL methods receive a set of data as input and directly output
in healthcare domains as the effect of treatments can be much a policy which is difficult to interpret. Although impressive
varied or delayed. Traditional RL research tackles the credit success has been made in solving challenging problems such
assignment problem using simple heuristics such as eligibility as learning to play Go and Atari games, the lack of in-
traces that weigh the past actions according to how far the time terpretability renders the policies unable to reveal the real
has elapsed (i.e., the backward view), or discount factors that correlation between features in the data and specific actions,
weigh the future events according to how far away they will and to impose and verify certain desirable policy properties,
happen (i.e., the forward view) [14]. These kinds of fixed and such as worst-case guarantees or safety constraints, for fur-
simplified heuristics are incapable of modelling more complex ther policy debugging and improvement [330]. These limits
interaction modes in a medical situation. As a running example therefore greatly hinder the successful adoption of RL policies
of explaining changes in blood glucose of a person with type for safety-critical applications such as in medical domains as
1 diabetes mellitus [325], it is difficult to give credit to the clinicians are unlikely to try new treatments without rigorous
two actions of doing exercise in the morning or taking insulin validation for safety, correctness and robustness [37], [331].
after lunch, both of which can potentially cause hypoglycemia Recently, there has been growing interest in attempting
in the afternoon. Since there are many factors to affect blood to address the problem of interpretability in RL algorithms.
glucose and the effect can take place after many hours, e.g., There are a variety of ways to realize interpretability of
moderate exercise can lead to heightened insulin sensitivity learned policy, by either using small, closed-form formulas to
for up to 22 hours, simply assigning an eligibility trace that compute index-based policies [332], using program synthesis
decays with time elapsed is thus unreasonable, misleading or to learn higher-level symbolic interpretable representations
even incorrect. How to model the time-varying causal rela- of learned policies [333], utilizing genetic programming for
tionships in healthcare and incorporate them into the learning interpretable policies represented by compact algebraic equa-
process is therefore a challenging issue that requires more tions [36], or using program verification techniques to verify
investigations. The abundant literature in causal explanation certain properties of the programs which are represented as
[326] and inference [327] can be introduced to provide a more decision trees [37]. Also, there has been growing attention not
powerful causal reasoning tool to the learning algorithm. By only on developing interpretable representations, but also on
producing hypothesized sequences of causal mechanisms that generating explicit explanations for sequential decision making
seek to explain or predict a set of real or counterfactual events problems [334]. While several works specifically focused on
which have been observed or manipulated [328], not only can interpretability of deep models in healthcare settings [335],
the learning performance be potentially improved, but also [336], how to develop interpretable RL solutions in order
more explainable learned strategies can be derived, which is to increase the robustness, safety and correctness of learned
ultimately important in healthcare domains. strategies in healthcare domains is still an unsolved issue that
calls for further investigations.
VIII. F UTURE P ERSPECTIVES B. Integration of Prior Knowledge
We have discussed a number of major challenges and open There is a wealth of prior knowledge in healthcare domains
issues raised in the current applications of RL techniques in that can be used for learning performance improvement. The
healthcare domains. Properly addressing these issues are of integration of such prior knowledge can be conducted in
great importance in facilitating the adoption of any medical different manners, either through configuration or presenta-
procedure or clinical strategy using RL. Looking into the tion of learning parameters, components or models [337],
future, there is an urgent need in bringing recent development [135], knowledge transfer from different individual patients,
in both theories and techniques of RL together with the subtypes/sub-populations or clinical domains [147], or en-
emerging clinical requirements in practice so as to generate abling human-in-the-loop interactive learning [338].
novel solutions that are more interpretable, robust, safe, prac- Gaweda et al. [337], [135] presented an approach to man-
tical and efficient. In this section, we briefly discuss some agement of anemia that incorporates a critical prior knowl-
of the future perspectives that we envision the most critical edge about the doseresponse characteristic into the learn-
towards realizing such ambitions. We mainly focus on three ing approach, that is, for all patients, it is known that the
theoretical directions: the interpretability of learned strategies, dose-response curve of HGB vs. EPO is monotonically non-
the integration of human or prior domain knowledge, and increasing. Thus, if a patient’s response is evaluated as in-
the capability of learning from small data. Healthcare under sufficient for a particular dose at a particular state, then the
physician knows that the optimal dose for that state is defi- cases for new diseases and rare illness, making it impossible
nitely higher than the administered one. Consequently, there is to obtain sufficient training samples with accurate labels. In
no need to explore the benefit of lower doses at further stages such circumstances, directly applying existing RL methods
of treatment. To capture this feature, the authors introduced an on limited data may result in overly optimistic, or in other
additional mechanism to the original Q-learning algorithm so extreme, pessimistic about treatments that are rarely performed
that the information about monotonically increasing character in practice.
of the HGB vs. EPO curve can be incorporated in the update Broadly, there are two different ways of dealing with a
procedure. This modification has been shown to make the EPO small sample learning problem [344]. The direct solution can
dosing faster and more efficiently. be using data augmentation strategies such as deformations
While transfer learning has been extensively studied in [345] or GANs [346] to increase samples and then employ
the agent learning community [46], there is quite limited conventional learning methods. The other type of solutions can
work on applying TRL techniques in healthcare settings. The be applying various model modification or domain adaptation
learning performance in the target task can be potentially methods such as knowledge distillation [347] or meta-learning
facilitated by using latent variable models, pre-trained model [348] to enable efficient learning that overcomes the problem
parameters from past tasks, or directly learning a mapping of data scarcity. While still in its early stage, significant
between past and target tasks, thus extending personalized care progress has been made in small sample learning research
to groups of patients with similar diagnoses. Marivate et al. in recent years [344]. How to build on these achievements
[145] highlighted the potential benefit of taking into account and tackle the small data RL problems in healthcare domains
individual variability and data limitations when performing thus calls for new methods of future investigations. One
batch policy evaluation for new individuals in HIV treatment. initial work is by Tseng et al. [93] who developed automated
A recent approach on TRL using latent variable models was radiation adaptation protocols for NSCLC patients by using
proposed by Killian et al. [146], [147], who used a Gaussian GAN to generate synthetic patient data and DQN to learn
Process latent variable model for HIV treatment by both dose decisions with the synthesized data and the available real
inferring the transition dynamics within a task instance and clinical data. Results showed that the learned dose strategies
also in the transfer between task instances. by DQN were capable of achieving similar results to those
Another way of integrating prior knowledge into an RL chosen by clinicians, yielding feasible and quite promising
process can be making use of the human cognitive abilities solutions for automatic treatment designs with limited data.
or domain expertise to guide, shape, evaluate or validate
the agent’s learning process, making the traditional RL into
D. Healthcare under Ambient Intelligence
a human-in-the-loop interactive RL problem [339]. Human
knowledge-driven RL methods can be of great interest to The recent development in sensor networks and wearable
problems in healthcare domains, where traditional learning al- devices has facilitated the advent of new era of healthcare
gorithms would possibly fail due to issues such as insufficient systems that are characterized by low-cost mobile sensing
training samples, complex and incomplete data or unexplain- and pervasive monitoring within the home and outdoor envi-
able learning process [338]. Consequently, the integration of ronments [349]. The Ambient Intelligence (AmI) technology,
the humans (i.e., doctors) into the learning process, and the in- which enables innovative human-machine interactions through
teraction of an expert’s knowledge with the automatic learning unobtrusive and anticipatory communications, has the potential
data would greatly enhance the knowledge discovery process to enhance the healthcare domain dramatically by learning
[340]. While there is some previous work from other domains, from user interaction, reasoning reasoning about users’ goals
particularly in training of robots [341], [342], human-in-the- and intensions, and planning activities and future interactions
loop interactive RL is not yet well established in the healthcare [350]. By using various kinds of sensing devices, such as
domain. It remains open for future research to transfer the smart phones, GPS and body sensors monitoring motions and
insights from existing studies into the healthcare domain to activities, it is now possible to remotely and continuously
ensure successful applications of existing RL methods. collect patients’ health information such that proper treatment
or intervention decisions can be made anytime and anywhere.
As an instance of online decision making in a possibly
C. Learning from Small Data infinite horizon setting involving many stages of interventions,
There is no doubt that the most recent progresses of RL, RL plays a key role in achieving the future vision of AmI
particularly DRL, are highly dependent on the premise of in healthcare systems through continuous interaction with the
large number of training samples. While this is quite rea- environment and adaption to the user needs in a transparent
sonable conceptually, that is, we cannot learn new things and optimal manner. In fact, the high level of monitoring and
that we have not tried sufficiently enough, there still exist sensing provides ample opportunity for RL methods that can
many domains lacking sufficient available training samples, fuse estimates of a given physiologic parameter from multiple
specifically, in some healthcare domains [343]. For example, sources to provide a single measurement, and derive optimal
in diagnose settings, medical images are much more difficult strategies using these data. Currently, there are several studies
to be annotated with certain lesions in high-quality without that have applied RL to achieve AmI in healthcare domains.
specific expertise compared to general images with simple For example, RL has been used to adapt the intervention
categories. In addition, there are usually few historical data or strategies of smart phones in order to recommend regular
physical activity to people who suffer from diabetes type 2 simulations, it is unrealistic for sample-critical domains where
[301], [302], or who have experienced a cardiac event and collecting samples would cause significant cost. This is obvi-
been in cardiac rehab [351], [352], [353]. It has been also ously true when dealing with real patients who would possibly
been used for mobile health intervention for college students not survive the long-term repeated trail-and-error treatment.
who drink heavily and smoke cigarettes [181]. Luckily, the wide range of efficient techniques reviewed in
Despite the successes, healthcare under AmI poses some Section II-B can provide promising solutions to this problem.
unique challenges that preclude direct application of existing Specifically, the sample-level batch learning methods can be
RL methodologies for DTRs. For example, it typically involves applied for more efficient use of past samples, while model-
a large number of time points or infinite time horizon for based methods enable better use of samples by building the
each individual; the momentary signal may be weak and may model of the environment. Another appealing solution is using
not directly measure the outcome of interest; and estimation task-level transfer methods that can reuse the past treatment
of optimal treatment strategies must be done online as data or patient information to facilitate learning in new cases, or
accumulate. How to tackle these issues is of great importance directly transfer the learned policies in simulations to real
in the successful applications of RL methods in the advent of environments. To enable efficient transfer, RL algorithms can
healthcare systems under AmI. be provided with initial knowledge that can direct the learning
in its initial stage toward more profitable and safer regions
of the state space, or with demonstrations and teacher advice
E. Future in-vivo Studies
from an external expert that can interrupt exploration and
To date, the vast volume of research reporting the devel- provide expert knowledge when the agent is confronted with
opment of RL techniques in healthcare is built upon certain unexpected situations.
computational models that leverages mathematical representa- The last issue in the real-life implementation of RL ap-
tion of how a patient responds to given treatment policies, or proaches is regarding the robustness of derived solutions.
upon retrospective clinical data to directly derive appropriate Despite inherently being suitable for optimizing outcomes in
treatment strategies. While this kind of in silico study is es- stochastic processes with uncertainty, existing RL methods are
sential as a tool for early stage exploration or direct derivation still facing difficulties in handling incomplete or noisy state
of adaptive treatment strategies by providing approximate or variables in partially observable real healthcare environments,
highly simplified models, future in vivo studies of closed-loop and in providing measures of confidence (e.g. standard errors,
RL approaches are urgently required to reliably assess the confidence sets, hypothesis tests). Uncertainty can also be
performance and personalization of the proposed approaches caused by the MDP parameters themselves, which leads to sig-
in real-life implementations. However, a number of major nificant increases in the difficulty of the problem, in terms of
issues still remain related to, in particular, data collection and both computational complexity and data requirements. While
preprocessing in real clinical settings, and high inter-individual there has been some recent work on robust MDP solutions
differences of the physiological responses, thus calling for which accounts for this issue [34], [35], a more general and
careful consideration of safety, efficiency and robustness of sound theoretical and empirical evaluation is still lacking.
RL methods in real-life healthcare applications. Moreover, most current studies are built upon predefined
First and foremost, safety is of paramount importance in functions to map states and actions into some integer numbers.
medical settings, thus it is imperative to ensure that the actions It is unclear how changing these numbers would affect the
during learning be safe enough when dealing with in vivo resulting optimal solutions. Understanding the robustness of
subjects. In some healthcare domains, the consequences of RL methods in uncertain healthcare settings is the subject
wrong actions are not merely limited to bad performance, but of ongoing critical investigations by the statistics, computer
may include long-term effects that cannot be compensated by science and healthcare communities.
more profitable exploitation later on. As one wrong action can
result in unrecoverable effects, learning in healthcare domains
IX. C ONCLUSIONS
poses a safety exploration dilemma [324], [32]. It is worth
noting that there are substantial ongoing efforts in the com- RL presents a mathematically solid and technically sound
puter science community to address precisely these problems, solution to optimal decision making in various healthcare tasks
namely in developing risk-directed exploration algorithms that challenged with noisy, multi-dimensional and incomplete data,
can efficiently learn with formal guarantees regarding the nonlinear and complex dynamics, and particularly, sequential
safety (or worst-case performance) of the system [33]. With decision precedures with delayed evaluation feedback. This
this consideration, the agent’s choice of actions is aided by paper aims to provide a state-of-the-art comprehensive survey
an appropriate risk metric acting as an exploration bonus of RL applications to a variety of decision making problems in
toward safer regions of the search space. How to draw on the area of healthcare. We have provided a structured summa-
these achievements and develop safe exploration strategies is rization of the theoretical foundations and key techniques in
thus urgently required to implement RL methods in real-life the RL research from traditional machine learning perspective,
healthcare applications. and surveyed the broad-ranging applications of RL methods in
Another issue is regarding the sample efficiency of RL solving problems affecting manifold areas of healthcare, from
methods in in vivo studies [83]. While it is possible for DTRs in chronic diseases and critical care, automated clinical
the RL algorithms to collect large numbers of samples in diagnosis, to other healthcare domains such as clinical resource
allocation and scheduling. The challenges and open issues in intelligence in medicine,” Artificial Intelligence in Medicine, vol. 46,
the current research have been discussed in detail from the no. 1, pp. 5–17, 2009.
[2] S. E. Dilsizian and E. L. Siegel, “Artificial intelligence in medicine
perspectives of basic components constituting an RL process and cardiac imaging: harnessing big data and advanced computing
(i.e., states, actions, rewards, policies and models), and funda- to provide personalized medical diagnosis and treatment,” Current
mental issues in RL research (i.e., the exploration-exploitation Cardiology Reports, vol. 16, no. 1, p. 441, 2014.
[3] F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. Dong,
dilemma and credit assignment). It should be emphasized that, H. Shen, and Y. Wang, “Artificial intelligence in healthcare: past,
although each of these challenging issues has been investigated present and future,” Stroke and Vascular Neurology, vol. 2, no. 4, pp.
extensively in the RL community for a long time, achieving 230–243, 2017.
[4] J. He, S. L. Baxter, J. Xu, J. Xu, X. Zhou, and K. Zhang, “The practical
remarkably successful solutions, it might be problematic to implementation of artificial intelligence technologies in medicine,”
directly apply these solutions in the healthcare settings due to Nature Medicine, vol. 25, no. 1, p. 30, 2019.
the inherent complexity in processes of medical data process- [5] A. E. Johnson, M. M. Ghassemi, S. Nemati, K. E. Niehaus, D. A.
Clifton, and G. D. Clifford, “Machine learning and decision support in
ing and policy learning. In fact, the unique features embodied critical care,” Proceedings of the IEEE, vol. 104, no. 2, pp. 444–466,
in the clinical or medical decision making process urgently 2016.
call for development of more advanced RL methods that are [6] D. Ravı̀, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo,
and G.-Z. Yang, “Deep learning for health informatics,” IEEE Journal
really suitable for real-life healthcare problems. Apart from of Biomedical and Health Informatics, vol. 21, no. 1, pp. 4–21, 2017.
the enumerated challenges, we have also pointed out several [7] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T.
perspectives that remain comparatively less addressed by the Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman
current literature. Interpretable learning, transfer learning as et al., “Opportunities and obstacles for deep learning in biology and
medicine,” bioRxiv, p. 142760, 2018.
well as small-data learning are the three theoretical directions [8] J. Luo, M. Wu, D. Gopukumar, and Y. Zhao, “Big data application in
that require more effort in order to make substantial progress. biomedical research and health care: a literature review,” Biomedical
Moreover, how to tailor the existing RL methods to deal with Informatics Insights, vol. 8, pp. BII–S31 559, 2016.
[9] A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo,
the pervasive data in the new era of AmI healthcare systems K. Chou, C. Cui, G. Corrado, S. Thrun, and J. Dean, “A guide to deep
and take into consideration safety, robustness and efficiency learning in healthcare,” Nature Medicine, vol. 25, no. 1, p. 24, 2019.
caused by real-life applications are two main paradigms that [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
need to be carefully handled in practice. et al., “Human-level control through deep reinforcement learning,”
The application of RL in healthcare is at the intersection Nature, vol. 518, no. 7540, p. 529, 2015.
of computer science and medicine. Such cross-disciplinary [11] M. L. Littman, “Reinforcement learning improves behaviour from
evaluative feedback,” Nature, vol. 521, no. 7553, p. 445, 2015.
research requires a concerted effort from machine learning [12] Y. Li, “Deep reinforcement learning,” arXiv preprint arXiv:1810.06339,
researchers and clinicians who are directly involved in patient 2018.
care and medical decision makings. While notable success [13] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications
of deep learning and reinforcement learning to biological data,” IEEE
has been obtained, RL has still received far less attention by transactions on neural networks and learning systems, vol. 29, no. 6,
researchers, either from computer science or from medicine, pp. 2063–2079, 2018.
compared to other research paradigms in healthcare domains, [14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
such as traditional machine learning, deep learning, statistical
[15] L. Buşoniu, T. de Bruin, D. Tolić, J. Kober, and I. Palunko, “Re-
learning and control-driven methods. Driven by both substan- inforcement learning for control: Performance, stability, and deep
tial progress in theories and techniques in the RL research, as approximators,” Annual Reviews in Control, 2018.
well as practical demands from healthcare practitioners and [16] O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag,
F. Doshi-Velez, and L. A. Celi, “Guidelines for reinforcement learning
managers, this situation is now changing rapidly and recent in healthcare.” Nature medicine, vol. 25, no. 1, p. 16, 2019.
years have witnessed a surge of interest in the paradigm of [17] R. Bellman, Dynamic programming. Courier Corporation, 2013.
applying RL in healthcare, which can be supported by the [18] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
no. 3-4, pp. 279–292, 1992.
dramatic increase in the number of publications on this topic [19] G. A. Rummery and M. Niranjan, On-line Q-learning using connec-
in the past few years. Serving as the first comprehensive survey tionist systems. University of Cambridge, Department of Engineering
of RL applications in healthcare, this paper aims at providing Cambridge, England, 1994, vol. 37.
[20] J. Kober and J. R. Peters, “Policy search for motor primitives in
the research community with systematic understanding of robotics,” in Advances in Neural Information Processing Systems, 2009,
foundations, broad palette of methods and techniques avail- pp. 849–856.
able, existing challenges, and new insights of this emerging [21] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71,
no. 7-9, pp. 1180–1190, 2008.
paradigm. By this, we hope that more researchers from various [22] N. Vlassis, M. Ghavamzadeh, S. Mannor, and P. Poupart, “Bayesian
disciplines can utilize their expertise in their own area and reinforcement learning,” in Reinforcement Learning. Springer, 2012,
work collaboratively to generate more applicable solutions to pp. 359–386.
optimal decision makings in healthcare. [23] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar et al., “Bayesian
reinforcement learning: A survey,” Foundations and Trends R in Ma-
chine Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
ACKNOWLEDGMENT [24] A. L. Strehl, L. Li, and M. L. Littman, “Reinforcement learning in finite
mdps: Pac analysis,” Journal of Machine Learning Research, vol. 10,
This work is supported by the Hongkong Scholar Program no. Nov, pp. 2413–2444, 2009.
under Grant XJ2017028. [25] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time
algorithm for near-optimal reinforcement learning,” Journal of Machine
R EFERENCES Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.
[26] A. G. Barto, “Intrinsic motivation and reinforcement learning,” in
[1] V. L. Patel, E. H. Shortliffe, M. Stefanelli, P. Szolovits, M. R. Berthold, Intrinsically motivated learning in natural and artificial systems.
R. Bellazzi, and A. Abu-Hanna, “The coming of age of artificial Springer, 2013, pp. 17–47.
[27] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey [53] A. Y. Ng, S. J. Russell et al., “Algorithms for inverse reinforcement
of multiagent reinforcement learning,” IEEE Transactions on Systems, learning.” in ICML, 2000, pp. 663–670.
Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008, [54] S. Zhifei and E. Meng Joo, “A survey of inverse reinforcement
2008. learning techniques,” International Journal of Intelligent Computing
[28] S. M. Kakade et al., “On the sample complexity of reinforcement and Cybernetics, vol. 5, no. 3, pp. 293–311, 2012.
learning,” Ph.D. dissertation, University of London London, England, [55] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum
2003. entropy inverse reinforcement learning.” in AAAI, vol. 8. Chicago, IL,
[29] L. Li, “Sample complexity bounds of exploration,” in Reinforcement USA, 2008, pp. 1433–1438.
Learning. Springer, 2012, pp. 175–204. [56] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-
[30] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement forcement learning,” in Proceedings of the twenty-first international
learning and dynamic programming using function approximators. conference on Machine learning. ACM, 2004, p. 1.
CRC press, 2010. [57] S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcement
[31] H. Van Hasselt, “Reinforcement learning in continuous state and action learning with gaussian processes,” in Advances in Neural Information
spaces,” in Reinforcement learning. Springer, 2012, pp. 207–251. Processing Systems, 2011, pp. 19–27.
[32] T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision [58] D. Ramachandran and E. Amir, “Bayesian inverse reinforcement learn-
processes,” in Proceedings of the 29th International Coference on ing,” Urbana, vol. 51, no. 61801, pp. 1–4, 2007.
International Conference on Machine Learning. Omnipress, 2012, [59] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman, “Efficient
pp. 1451–1458. solution algorithms for factored mdps,” Journal of Artificial Intelligence
[33] J. Garcıa and F. Fernández, “A comprehensive survey on safe rein- Research, vol. 19, pp. 399–468, 2003.
forcement learning,” Journal of Machine Learning Research, vol. 16, [60] M. Kearns and D. Koller, “Efficient reinforcement learning in factored
no. 1, pp. 1437–1480, 2015. mdps,” in IJCAI, vol. 16, 1999, pp. 740–747.
[34] W. Wiesemann, D. Kuhn, and B. Rustem, “Robust markov decision [61] C. Guestrin, R. Patrascu, and D. Schuurmans, “Algorithm-directed
processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. exploration for model-based reinforcement learning in factored mdps,”
153–183, 2013. in ICML, 2002, pp. 235–242.
[35] H. Xu and S. Mannor, “Distributionally robust markov decision pro- [62] I. Osband and B. Van Roy, “Near-optimal reinforcement learning in
cesses,” in Advances in Neural Information Processing Systems, 2010, factored mdps,” in Advances in Neural Information Processing Systems,
pp. 2505–2513. 2014, pp. 604–612.
[36] D. Hein, S. Udluft, and T. A. Runkler, “Interpretable policies for rein- [63] A. L. Strehl, C. Diuk, and M. L. Littman, “Efficient structure learning
forcement learning by genetic programming,” Engineering Applications in factored-state mdps,” in AAAI, vol. 7, 2007, pp. 645–650.
of Artificial Intelligence, vol. 76, pp. 158–169, 2018. [64] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical
[37] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no.
learning via policy extraction,” in Advances in Neural Information 1-2, pp. 41–77, 2003.
Processing Systems, 2018, pp. 2499–2509. [65] B. Hengst, “Hierarchical approaches,” in Reinforcement learning.
[38] M. Wiering and M. Van Otterlo, “Reinforcement learning,” Adaptation, Springer, 2012, pp. 293–323.
learning, and optimization, vol. 12, 2012. [66] M. van Otterlo, “Solving relational and first-order logical markov
[39] S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” decision processes: A survey,” in Reinforcement Learning. Springer,
in Reinforcement learning. Springer, 2012, pp. 45–73. 2012, pp. 253–292.
[40] M. Riedmiller, “Neural fitted q iteration–first experiences with a data [67] T. Jaakkola, S. P. Singh, and M. I. Jordan, “Reinforcement learning
efficient neural reinforcement learning method,” in European Confer- algorithm for partially observable markov decision problems,” in Ad-
ence on Machine Learning. Springer, 2005, pp. 317–328. vances in Neural Information Processing Systems, 1995, pp. 345–352.
[41] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode rein- [68] R. W. Jelliffe, J. Buell, R. Kalaba, R. Sridhar, and R. Rockwell,
forcement learning,” Journal of Machine Learning Research, vol. 6, “A computer program for digitalis dosage regimens,” Mathematical
no. Apr, pp. 503–556, 2005. Biosciences, vol. 9, pp. 179–193, 1970.
[42] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” Journal [69] R. E. Bellman, Mathematical methods in medicine. World Scientific
of Machine Learning Research, vol. 4, no. Dec, pp. 1107–1149, 2003. Publishing Co., Inc., 1983.
[43] T. Hester and P. Stone, “Learning and using models,” in Reinforcement [70] C. Hu, W. S. Lovejoy, and S. L. Shafer, “Comparison of some
learning. Springer, 2012, pp. 111–141. control strategies for three-compartment pk/pd models,” Journal of
[44] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, Pharmacokinetics and Biopharmaceutics, vol. 22, no. 6, pp. 525–550,
P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, 1994.
“A survey of monte carlo tree search methods,” IEEE Transactions on [71] A. J. Schaefer, M. D. Bailey, S. M. Shechter, and M. S. Roberts,
Computational Intelligence and AI in games, vol. 4, no. 1, pp. 1–43, “Modeling medical treatment using markov decision processes,” in
2012. Operations Research and Health Care. Springer, 2005, pp. 593–612.
[45] A. Lazaric, “Transfer in reinforcement learning: a framework and a [72] B. Chakraborty and S. A. Murphy, “Dynamic treatment regimes,”
survey,” in Reinforcement Learning. Springer, 2012, pp. 143–173. Annual Review of Statistics and Its Application, vol. 1, pp. 447–464,
[46] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning 2014.
domains: A survey,” Journal of Machine Learning Research, vol. 10, [73] E. B. Laber, D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy,
no. Jul, pp. 1633–1685, 2009. “Dynamic treatment regimes: Technical challenges and applications,”
[47] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Electronic Journal of Statistics, vol. 8, no. 1, p. 1225, 2014.
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, [74] J. K. Lunceford, M. Davidian, and A. A. Tsiatis, “Estimation of survival
M. Lanctot et al., “Mastering the game of go with deep neural networks distributions of treatment policies in two-stage randomization designs
and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016. in clinical trials,” Biometrics, vol. 58, no. 1, pp. 48–57, 2002.
[48] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A [75] D. Almirall and A. Chronis-Tuscano, “Adaptive interventions in child
survey of deep neural network architectures and their applications,” and adolescent mental health,” Journal of Clinical Child & Adolescent
Neurocomputing, vol. 234, pp. 11–26, 2017. Psychology, vol. 45, no. 4, pp. 383–395, 2016.
[49] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing [76] P. W. Lavori and R. Dawson, “Adaptive treatment strategies in chronic
of deep neural networks: A tutorial and survey,” Proceedings of the disease,” Annu. Rev. Med., vol. 59, pp. 443–453, 2008.
IEEE, vol. 105, no. 12, pp. 2295–2329, 2017. [77] B. Chakraborty and E. E. M. Moodie, Statistical Reinforcement Learn-
[50] C. Liu, X. Xu, and D. Hu, “Multiobjective reinforcement learning: A ing. Springer New York, 2013.
comprehensive overview,” IEEE Transactions on Systems, Man, and [78] S. A. Murphy, “An experimental design for the development of adaptive
Cybernetics: Systems, vol. 45, no. 3, pp. 385–398, 2015. treatment strategies,” Statistics in Medicine, vol. 24, no. 10, pp. 1455–
[51] C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey 1481, 2005.
of preference-based reinforcement learning methods,” The Journal of [79] S. A. Murphy, K. G. Lynch, D. Oslin, J. R. McKay, and T. TenHave,
Machine Learning Research, vol. 18, no. 1, pp. 4945–4990, 2017. “Developing adaptive treatment strategies in substance abuse research,”
[52] J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park, “Preference- Drug & Alcohol Dependence, vol. 88, pp. S24–S30, 2007.
based reinforcement learning: a formal framework and a policy iteration [80] W. H. Organization, Preventing chronic diseases: a vital investment.
algorithm,” Machine Learning, vol. 89, no. 1-2, pp. 123–156, 2012. World Health Organization, 2005.
[81] B. Chakraborty and E. Moodie, Statistical methods for dynamic treat- [105] L. G. De Pillis and A. Radunskaya, “The dynamics of an optimally
ment regimes. Springer, 2013. controlled tumor model: A case study,” Mathematical and Computer
[82] E. H. Wagner, B. T. Austin, C. Davis, M. Hindmarsh, J. Schaefer, and Modelling, vol. 37, no. 11, pp. 1221–1244, 2003.
A. Bonomi, “Improving chronic illness care: translating evidence into [106] M. Feng, G. Valdes, N. Dixit, and T. D. Solberg, “Machine learning in
action,” Health Affairs, vol. 20, no. 6, pp. 64–78, 2001. radiation oncology: Opportunities, requirements, and needs,” Frontiers
[83] Y. Zhao, M. R. Kosorok, and D. Zeng, “Reinforcement learning design in Oncology, vol. 8, 2018.
for cancer clinical trials,” Statistics in Medicine, vol. 28, no. 26, pp. [107] J. de Lope, D. Maravall et al., “Robust high performance reinforce-
3294–3315, 2009. ment learning through weighted k-nearest neighbors,” Neurocomputing,
[84] A. Hassani et al., “Reinforcement learning based control of tumor vol. 74, no. 8, pp. 1251–1259, 2011.
growth with chemotherapy,” in 2010 International Conference on [108] N. Cho, J. Shaw, S. Karuranga, Y. Huang, J. da Rocha Fernandes,
System Science and Engineering (ICSSE). IEEE, 2010, pp. 185–189. A. Ohlrogge, and B. Malanda, “Idf diabetes atlas: Global estimates
[85] I. Ahn and J. Park, “Drug scheduling of cancer chemotherapy based of diabetes prevalence for 2017 and projections for 2045,” Diabetes
on natural actor-critic approach,” BioSystems, vol. 106, no. 2-3, pp. Research and Clinical Practice, vol. 138, pp. 271–281, 2018.
121–129, 2011. [109] A. M. Albisser, B. Leibel, T. Ewart, Z. Davidovac, C. Botz, W. Zingg,
[86] K. Humphrey, “Using reinforcement learning to personalize dosing H. Schipper, and R. Gander, “Clinical control of diabetes by the
strategies in a simulated cancer trial with high dimensional data,” 2017. artificial pancreas,” Diabetes, vol. 23, no. 5, pp. 397–404, 1974.
[110] C. Cobelli, E. Renard, and B. Kovatchev, “Artificial pancreas: past,
[87] R. Padmanabhan, N. Meskin, and W. M. Haddad, “Reinforcement
present, future,” Diabetes, vol. 60, no. 11, pp. 2672–2682, 2011.
learning-based control of drug dosing for cancer chemotherapy treat-
[111] B. W. Bequette, “A critical assessment of algorithms and challenges
ment,” Mathematical biosciences, vol. 293, pp. 11–20, 2017.
in the development of a closed-loop artificial pancreas,” Diabetes
[88] Y. Zhao, D. Zeng, M. A. Socinski, and M. R. Kosorok, “Reinforcement Technology & Therapeutics, vol. 7, no. 1, pp. 28–47, 2005.
learning strategies for clinical trials in nonsmall cell lung cancer,” [112] T. Peyser, E. Dassau, M. Breton, and J. S. Skyler, “The artificial
Biometrics, vol. 67, no. 4, pp. 1422–1433, 2011. pancreas: current status and future prospects in the management of
[89] W. Cheng, J. Fürnkranz, E. Hüllermeier, and S.-H. Park, “Preference- diabetes,” Annals of the New York Academy of Sciences, vol. 1311,
based policy iteration: Leveraging preference learning for reinforce- no. 1, pp. 102–123, 2014.
ment learning,” in Joint European Conference on Machine Learning [113] M. K. Bothe, L. Dickens, K. Reichel, A. Tellmann, B. Ellger, M. West-
and Knowledge Discovery in Databases. Springer, 2011, pp. 312– phal, and A. A. Faisal, “The use of reinforcement learning algorithms to
327. meet the challenges of an artificial pancreas,” Expert Review of Medical
[90] R. Akrour, M. Schoenauer, and M. Sebag, “April: Active preference Devices, vol. 10, no. 5, pp. 661–673, 2013.
learning-based reinforcement learning,” in Joint European Confer- [114] S. Yasini, M. B. Naghibi Sistani, and A. Karimpour, “Agent-based sim-
ence on Machine Learning and Knowledge Discovery in Databases. ulation for blood glucose,” International Journal of Applied Science,
Springer, 2012, pp. 116–131. Engineering and Technology, vol. 5, pp. 89–95, 2009.
[91] R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier, [115] E. Daskalaki, L. Scarnato, P. Diem, and S. G. Mougiakakou, “Pre-
“Preference-based reinforcement learning: evolutionary direct policy liminary results of a novel approach for glucose regulation using an
search using a preference-based racing algorithm,” Machine Learning, actor-critic learning based controller,” 2010.
vol. 97, no. 3, pp. 327–351, 2014. [116] B. P. Kovatchev, M. Breton, C. Dalla Man, and C. Cobelli, “In silico
[92] R. Vincent, “Reinforcement learning in models of adaptive medical preclinical trials: a proof of concept in closed-loop control of type 1
treatment strategies,” Ph.D. dissertation, McGill University Libraries, diabetes,” 2009.
2014. [117] E. Daskalaki, P. Diem, and S. G. Mougiakakou, “An actor–critic based
[93] H. H. Tseng, Y. Luo, S. Cui, J. T. Chien, R. K. Ten Haken, and I. E. controller for glucose regulation in type 1 diabetes,” Computer Methods
Naqa, “Deep reinforcement learning for automated radiation adaptation and Programs in Biomedicine, vol. 109, no. 2, pp. 116–125, 2013.
in lung cancer,” Medical Physics, vol. 44, no. 12, pp. 6690–6705, 2017. [118] ——, “Personalized tuning of a reinforcement learning control al-
[94] A. Jalalimanesh, H. S. Haghighi, A. Ahmadi, and M. Soltani, gorithm for glucose regulation,” in 2013 35th Annual International
“Simulation-based optimization of radiotherapy: Agent-based modeling Conference of the IEEE Engineering in Medicine and Biology Society
and reinforcement learning,” Mathematics and Computers in Simula- (EMBC). IEEE, 2013, pp. 3487–3490.
tion, vol. 133, pp. 235–248, 2017. [119] ——, “Model-free machine learning in biomedicine: Feasibility study
[95] A. Jalalimanesh, H. S. Haghighi, A. Ahmadi, H. Hejazian, and in type 1 diabetes,” PloS One, vol. 11, no. 7, p. e0158722, 2016.
M. Soltani, “Multi-objective optimization of radiotherapy: distributed [120] Q. Sun, M. Jankovic, J. Budzinski, B. Moore, P. Diem, C. Stettler, and
q-learning and agent-based simulation,” Journal of Experimental & S. G. Mougiakakou, “A dual mode adaptive basal-bolus advisor based
Theoretical Artificial Intelligence, pp. 1–16, 2017. on reinforcement learning,” IEEE journal of biomedical and health
[96] Y. Goldberg and M. R. Kosorok, “Q-learning with censored data,” informatics, 2018.
Annals of Statistics, vol. 40, no. 1, p. 529, 2012. [121] P. Palumbo, S. Panunzi, and A. De Gaetano, “Qualitative behavior of
[97] Y. M. Soliman, “Personalized medical treatments using novel reinforce- a family of delay-differential models of the glucose-insulin system,”
ment learning algorithms,” arXiv preprint arXiv:1406.3922, 2014. Discrete and Continuous Dynamical Systems Series B, vol. 7, no. 2, p.
399, 2007.
[98] G. Yauney and P. Shah, “Reinforcement learning with action-derived
[122] A. Noori, M. A. Sadrnia et al., “Glucose level control using temporal
rewards for chemotherapy and clinical trial dosing regimen selection,”
difference methods,” in 2017 Iranian Conference on Electrical Engi-
in Machine Learning for Healthcare Conference, 2018, pp. 161–226.
neering (ICEE). IEEE, 2017, pp. 895–900.
[99] B. Stewart, C. P. Wild et al., “World cancer report 2014,” Health, 2017. [123] P. D. Ngo, S. Wei, A. Holubová, J. Muzik, and F. Godtliebsen,
[100] R. Eftimie, J. L. Bramson, and D. J. Earn, “Interactions between the “Reinforcement-learning optimal control for type-1 diabetes,” in 2018
immune system and cancer: a brief review of non-spatial mathematical IEEE EMBS International Conference on Biomedical & Health Infor-
models,” Bulletin of Mathematical Biology, vol. 73, no. 1, pp. 2–32, matics (BHI). IEEE, 2018, pp. 333–336.
2011. [124] ——, “Control of blood glucose for type-1 diabetes by using rein-
[101] J. Shi, O. Alagoz, F. S. Erenay, and Q. Su, “A survey of optimiza- forcement learning with feedforward algorithm,” Computational and
tion models on cancer chemotherapy treatment planning,” Annals of Mathematical Methods in Medicine, vol. 2018, 2018.
Operations Research, vol. 221, no. 1, pp. 331–356, 2014. [125] R. N. Bergman, Y. Z. Ider, C. R. Bowden, and C. Cobelli, “Quantitative
[102] N. Beerenwinkel, R. F. Schwarz, M. Gerstung, and F. Markowetz, estimation of insulin sensitivity.” American Journal of Physiology-
“Cancer evolution: mathematical models and computational inference,” Endocrinology And Metabolism, vol. 236, no. 6, p. E667, 1979.
Systematic Biology, vol. 64, no. 1, pp. e1–e25, 2014. [126] R. Hovorka, V. Canonico, L. J. Chassin, U. Haueter, M. Massi-
[103] M. Tenenbaum, A. Fern, L. Getoor, M. Littman, V. Manasinghka, Benedetti, M. O. Federici, T. R. Pieber, H. C. Schaller, L. Schaupp,
S. Natarajan, D. Page, J. Shrager, Y. Singer, and P. Tadepalli, “Person- T. Vering et al., “Nonlinear model predictive control of glucose concen-
alizing cancer therapy via machine learning,” in Workshops of NIPS, tration in subjects with type 1 diabetes,” Physiological Measurement,
2010. vol. 25, no. 4, p. 905, 2004.
[104] V. Vapnik, S. E. Golowich, and A. J. Smola, “Support vector method for [127] M. De Paula, L. O. Ávila, and E. C. Martı́nez, “Controlling blood
function approximation, regression estimation and signal processing,” glucose variability under uncertainty using reinforcement learning and
in Advances in Neural Information Processing Systems, 1997, pp. 281– gaussian processes,” Applied Soft Computing, vol. 35, pp. 310–332,
287. 2015.
[128] M. De Paula, G. G. Acosta, and E. C. Martı́nez, “On-line policy processes,” in Advances in Neural Information Processing Systems,
learning and adaptation for real-time personalization of an artificial 2017, pp. 6250–6261.
pancreas,” Expert Systems with Applications, vol. 42, no. 4, pp. 2234– [148] J. Yao, T. Killian, G. Konidaris, and F. Doshi-Velez, “Direct policy
2255, 2015. transfer via hidden parameter markov decision processes,” 2018.
[129] S. U. Acikgoz and U. M. Diwekar, “Blood glucose regulation with [149] C. Yu, Y. Dong, J. Liu, and G. Ren, “Incorporating causal factors into
stochastic optimal control for insulin-dependent diabetic patients,” reinforcement learning for dynamic treatment regimes in hiv,” BMC
Chemical Engineering Science, vol. 65, no. 3, pp. 1227–1236, 2010. medical informatics and decision making, vol. 19, no. 2, p. 60, 2019.
[130] H. Asoh, M. Shiro, S. Akaho, T. Kamishima, K. Hashida, E. Aramaki, [150] J. Pazis and R. Parr, “Pac optimal exploration in continuous space
and T. Kohro, “Modeling medical records of diabetes using markov markov decision processes.” in AAAI, 2013.
decision processes,” in Proceedings of ICML2013 Workshop on Role [151] K. Kawaguchi, “Bounded optimal exploration in mdp.” in AAAI, 2016,
of Machine Learning in Transforming Healthcare, 2013. pp. 1758–1764.
[131] H. Asoh, M. S. S. Akaho, T. Kamishima, K. Hasida, E. Aramaki, [152] S. A. Murphy, D. W. Oslin, A. J. Rush, and J. Zhu, “Methodological
and T. Kohro, “An application of inverse reinforcement learning to challenges in constructing effective treatment sequences for chronic
medical records of diabetes treatment,” in ECMLPKDD2013 Workshop psychiatric disorders,” Neuropsychopharmacology, vol. 32, no. 2, p.
on Reinforcement Learning with Generalized Feedback, 2013. 257, 2007.
[132] D. J. Luckett, E. B. Laber, A. R. Kahkoska, D. M. Maahs, E. Mayer- [153] T. N. Alotaiby, S. A. Alshebeili, T. Alshawi, I. Ahmad, and F. E. A.
Davis, and M. R. Kosorok, “Estimating dynamic treatment regimes in El-Samie, “Eeg seizure detection and prediction algorithms: a survey,”
mobile health using v-learning,” Journal of the American Statistical EURASIP Journal on Advances in Signal Processing, vol. 2014, no. 1,
Association, no. just-accepted, pp. 1–39, 2018. p. 183, 2014.
[133] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. Jacobs, [154] G. Panuccio, M. Semprini, L. Natale, S. Buccelli, I. Colombi, and
J. M. Zurada, and M. E. Brier, “Reinforcement learning approach to M. Chiappalone, “Progress in neuroengineering for brain repair: New
individualization of chronic pharmacotherapy,” in IJCNN’05, vol. 5. challenges and open issues,” Brain and Neuroscience Advances, vol. 2,
IEEE, 2005, pp. 3290–3295. p. 2398212818776475, 2018.
[134] A. E. Gaweda, M. K. Muezzinoglu, A. A. Jacobs, G. R. Aronoff, and [155] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau, “Adaptive treatment
M. E. Brier, “Model predictive control with reinforcement learning for of epilepsy via batch-mode reinforcement learning.” in AAAI, 2008, pp.
drug delivery in renal anemia management,” in IEEE EMBS’06. IEEE, 1671–1678.
2006, pp. 5177–5180. [156] J. Pineau, A. Guez, R. Vincent, G. Panuccio, and M. Avoli, “Treating
[135] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. Jacobs, J. M. epilepsy via adaptive neurostimulation: a reinforcement learning ap-
Zurada, and M. E. Brier, “Individualization of pharmacological anemia proach,” International Journal of Neural Systems, vol. 19, no. 04, pp.
management using reinforcement learning,” Neural Networks, vol. 18, 227–240, 2009.
no. 5-6, pp. 826–834, 2005. [157] A. Guez, “Adaptive control of epileptic seizures using reinforcement
[136] J. D. Martı́n-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, learning,” Ph.D. dissertation, McGill University Library, 2010.
M. Climente-Martı́, and N. V. Jiménez-Torres, “A reinforcement learn-
[158] G. Panuccio, A. Guez, R. Vincent, M. Avoli, and J. Pineau, “Adaptive
ing approach for individualizing erythropoietin dosages in hemodialysis
control of epileptiform excitability in an in vitro model of limbic
patients,” Expert Systems with Applications, vol. 36, no. 6, pp. 9737–
seizures,” Experimental Neurology, vol. 241, pp. 179–183, 2013.
9742, 2009.
[159] K. Bush and J. Pineau, “Manifold embeddings for model-based rein-
[137] J. D. Martı́n-Guerrero, E. Soria-Olivas, M. Martı́nez-Sober,
forcement learning under partial observability,” in Advances in Neural
M. Climente-Martı́, T. De Diego-Santos, and N. V. Jiménez-
Information Processing Systems, 2009, pp. 189–197.
Torres, “Validation of a reinforcement learning policy for dosage
[160] V. Nagaraj, A. Lamperski, and T. I. Netoff, “Seizure control in
optimization of erythropoietin,” in Australasian Joint Conference on
a computational model using a reinforcement learning stimulation
Artificial Intelligence. Springer, 2007, pp. 732–738.
paradigm,” International Journal of Neural Systems, vol. 27, no. 07, p.
[138] J. M. Malof and A. E. Gaweda, “Optimizing drug therapy with
1750012, 2017.
reinforcement learning: The case of anemia management,” in Neural
Networks (IJCNN), The 2011 International Joint Conference on. IEEE, [161] A. J. Rush, M. Fava, S. R. Wisniewski, P. W. Lavori, M. H. Trivedi,
2011, pp. 2088–2092. H. A. Sackeim, M. E. Thase, A. A. Nierenberg, F. M. Quitkin, T. M.
[139] P. Escandell-Montero, J. M. Martı́nez-Martı́nez, J. D. Martı́n-Guerrero, Kashner et al., “Sequenced treatment alternatives to relieve depression
E. Soria-Olivas, J. Vila-Francés, and R. Magdalena-Benedito, “Adaptive (star* d): rationale and design,” Controlled clinical trials, vol. 25, no. 1,
treatment of anemia on hemodialysis patients: A reinforcement learning pp. 119–142, 2004.
approach,” in CIDM2011. IEEE, 2011, pp. 44–49. [162] J. Pineau, M. G. Bellemare, A. J. Rush, A. Ghizaru, and S. A. Murphy,
[140] P. Escandell-Montero, M. Chermisi, J. M. Martinez-Martinez, “Constructing evidence-based treatment strategies using methods from
J. Gomez-Sanchis, C. Barbieri, E. Soria-Olivas, F. Mari, J. Vila- computer science,” Drug & Alcohol Dependence, vol. 88, pp. S52–S60,
Francés, A. Stopper, E. Gatti et al., “Optimization of anemia treatment 2007.
in hemodialysis patients via reinforcement learning,” Artificial Intelli- [163] D. Ormoneit and Ś. Sen, “Kernel-based reinforcement learning,” Ma-
gence in Medicine, vol. 62, no. 1, pp. 47–60, 2014. chine learning, vol. 49, no. 2-3, pp. 161–178, 2002.
[141] B. M. Adams, H. T. Banks, H.-D. Kwon, and H. T. Tran, “Dynamic [164] B. Chakraborty, E. B. Laber, and Y. Zhao, “Inference for optimal
multidrug therapies for hiv: Optimal and sti control approaches,” dynamic treatment regimes using an adaptive m-out-of-n bootstrap
Mathematical Biosciences and Engineering, vol. 1, no. 2, pp. 223–241, scheme,” Biometrics, vol. 69, no. 3, pp. 714–723, 2013.
2004. [165] E. B. Laber, K. A. Linn, and L. A. Stefanski, “Interactive model
[142] D. Ernst, G.-B. Stan, J. Goncalves, and L. Wehenkel, “Clinical data building for q-learning,” Biometrika, vol. 101, no. 4, pp. 831–847,
based optimal sti strategies for hiv: a reinforcement learning approach,” 2014.
in 45th IEEE Conference on Decision and Control. IEEE, 2006, pp. [166] K. A. Linn, E. B. Laber, and L. A. Stefanski, “Interactive q-learning
667–672. for probabilities and quantiles,” arXiv preprint arXiv:1407.3414, 2014.
[143] S. Parbhoo, “A reinforcement learning design for hiv clinical trials,” [167] ——, “Interactive q-learning for quantiles,” Journal of the American
Ph.D. dissertation, 2014. Statistical Association, vol. 112, no. 518, pp. 638–649, 2017.
[144] S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. Doshi-Velez, [168] P. J. Schulte, A. A. Tsiatis, E. B. Laber, and M. Davidian, “Q-and a-
“Combining kernel and model based learning for hiv therapy selection,” learning methods for estimating optimal dynamic treatment regimes,”
AMIA Summits on Translational Science Proceedings, vol. 2017, p. Statistical science: a review journal of the Institute of Mathematical
239, 2017. Statistics, vol. 29, no. 4, p. 640, 2014.
[145] V. N. Marivate, J. Chemali, E. Brunskill, and M. L. Littman, “Quanti- [169] S. A. Murphy, “Optimal dynamic treatment regimes,” Journal of the
fying uncertainty in batch personalized sequential decision making.” in Royal Statistical Society: Series B (Statistical Methodology), vol. 65,
AAAI Workshop: Modern Artificial Intelligence for Health Analytics, no. 2, pp. 331–355, 2003.
2014. [170] R. Song, W. Wang, D. Zeng, and M. R. Kosorok, “Penalized q-learning
[146] T. Killian, G. Konidaris, and F. Doshi-Velez, “Transfer learning across for dynamic treatment regimens,” Statistica Sinica, vol. 25, no. 3, p.
patient variations with hidden parameter markov decision processes,” 901, 2015.
arXiv preprint arXiv:1612.00475, 2016. [171] Y. Liu, Y. Wang, M. R. Kosorok, Y. Zhao, and D. Zeng, “Robust hybrid
[147] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez, “Robust learning for estimating personalized dynamic treatment regimens,”
and efficient transfer learning with hidden parameter markov decision arXiv preprint arXiv:1611.02314, 2016.
[172] K. Deng, R. Greiner, and S. Murphy, “Budgeted learning for developing [195] A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi,
personalized treatment,” in ICMLA2014. IEEE, 2014, pp. 7–14. “Continuous state-space models for optimal sepsis treatment: a deep
[173] R. S. Keefe, R. M. Bilder, S. M. Davis, P. D. Harvey, B. W. Palmer, reinforcement learning approach,” in Machine Learning for Healthcare
J. M. Gold, H. Y. Meltzer, M. F. Green, G. Capuano, T. S. Stroup Conference, 2017, pp. 147–163.
et al., “Neurocognitive effects of antipsychotic medications in patients [196] A. Raghu, M. Komorowski, and S. Singh, “Model-based reinforcement
with chronic schizophrenia in the catie trial,” Archives of General learning for sepsis treatment,” arXiv preprint arXiv:1811.09602, 2018.
Psychiatry, vol. 64, no. 6, pp. 633–647, 2007. [197] C. P. Utomo, X. Li, and W. Chen, “Treatment recommendation in crit-
[174] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and ical care: A scalable and interpretable approach in partially observable
S. A. Murphy, “Informing sequential clinical decision-making through health states,” 2018.
reinforcement learning: an empirical study,” Machine Learning, vol. 84, [198] X. Peng, Y. Ding, D. Wihl, O. Gottesman, M. Komorowski, L.-w. H.
no. 1-2, pp. 109–136, 2011. Lehman, A. Ross, A. Faisal, and F. Doshi-Velez, “Improving sepsis
[175] A. Ertefaie, S. Shortreed, and B. Chakraborty, “Q-learning residual treatment strategies by combining deep and kernel-based reinforcement
analysis: application to the effectiveness of sequences of antipsychotic learning,” arXiv preprint arXiv:1901.04670, 2019.
medications for patients with schizophrenia,” Statistics in Medicine, [199] J. Futoma, A. Lin, M. Sendak, A. Bedoya, M. Clement, C. O’Brien,
vol. 35, no. 13, pp. 2221–2234, 2016. and K. Heller, “Learning to treat sepsis with multi-output gaussian
[176] D. J. Lizotte, M. Bowling, and S. A. Murphy, “Linear fitted-q iter- process deep recurrent q-networks,” 2018.
ation with multiple reward functions,” Journal of Machine Learning [200] C. Yu, G. Ren, and J. Liu, “Deep inverse reinforcement learning for
Research, vol. 13, no. Nov, pp. 3253–3295, 2012. sepsis treatment,” in 2019 IEEE ICHI, 2019, pp. 1–3.
[177] D. J. Lizotte and E. B. Laber, “Multi-objective markov decision [201] L. Li, M. Komorowski, and A. A. Faisal, “The actor search tree critic
processes for data-driven decision support,” The Journal of Machine (astc) for off-policy pomdp learning in medical decision making,” arXiv
Learning Research, vol. 17, no. 1, pp. 7378–7405, 2016. preprint arXiv:1805.11548, 2018.
[178] E. B. Laber, D. J. Lizotte, and B. Ferguson, “Set-valued dynamic [202] W.-H. Weng, M. Gao, Z. He, S. Yan, and P. Szolovits, “Representation
treatment regimes for competing outcomes,” Biometrics, vol. 70, no. 1, and reinforcement learning for personalized glycemic control in septic
pp. 53–61, 2014. patients,” arXiv preprint arXiv:1712.00654, 2017.
[179] E. L. Butler, E. B. Laber, S. M. Davis, and M. R. Kosorok, “Incor- [203] B. K. Petersen, J. Yang, W. S. Grathwohl, C. Cockrell, C. Santiago,
porating patient preferences into estimation of optimal individualized G. An, and D. M. Faissol, “Precision medicine as a control problem:
treatment rules,” Biometrics, 2017. Using simulation and deep reinforcement learning to discover adap-
[180] M. Dennis and C. K. Scott, “Managing addiction as a chronic con- tive, personalized multi-cytokine therapy for sepsis,” arXiv preprint
dition,” Addiction Science & Clinical Practice, vol. 4, no. 1, p. 45, arXiv:1802.10440, 2018.
2007. [204] B. L. Moore, E. D. Sinzinger, T. M. Quasny, and L. D. Pyeatt,
[181] S. A. Murphy, Y. Deng, E. B. Laber, H. R. Maei, R. S. Sutton, “Intelligent control of closed-loop sedation in simulated icu patients.”
and K. Witkiewitz, “A batch, off-policy, actor-critic algorithm for in FLAIRS Conference, 2004, pp. 109–114.
optimizing the average reward,” arXiv preprint arXiv:1607.05047,
[205] E. D. Sinzinger and B. Moore, “Sedation of simulated icu patients
2016.
using reinforcement learning based control,” International Journal on
[182] B. Chakraborty, S. Murphy, and V. Strecher, “Inference for non-regular
Artificial Intelligence Tools, vol. 14, no. 01n02, pp. 137–156, 2005.
parameters in optimal dynamic treatment regimes,” Statistical Methods
[206] B. L. Moore, A. G. Doufas, and L. D. Pyeatt, “Reinforcement learning:
in Medical Research, vol. 19, no. 3, pp. 317–343, 2010.
a novel method for optimal control of propofol-induced hypnosis,”
[183] B. Chakraborty, V. Strecher, and S. Murphy, “Bias correction and
Anesthesia & Analgesia, vol. 112, no. 2, pp. 360–367, 2011.
confidence intervals for fitted q-iteration,” in Workshop on Model
[207] B. L. Moore, T. M. Quasny, and A. G. Doufas, “Reinforcement
Uncertainty and Risk in Reinforcement Learning, NIPS, Whistler,
learning versus proportional–integral–derivative control of hypnosis in
Canada. Citeseer, 2008.
a simulated intraoperative patient,” Anesthesia & Analgesia, vol. 112,
[184] Y. Tao, L. Wang, D. Almirall et al., “Tree-based reinforcement learning
no. 2, pp. 350–359, 2011.
for estimating optimal dynamic treatment regimes,” The Annals of
Applied Statistics, vol. 12, no. 3, pp. 1914–1938, 2018. [208] B. L. Moore, L. D. Pyeatt, V. Kulkarni, P. Panousis, K. Padrez,
[185] J.-L. Vincent, “Critical care-where have we been and where are we and A. G. Doufas, “Reinforcement learning for closed-loop propofol
going?” Critical Care, vol. 17, no. 1, p. S2, 2013. anesthesia: a study in human volunteers,” The Journal of Machine
[186] K. Krell, “Critical care workforce,” Critical Care Medicine, vol. 36, Learning Research, vol. 15, no. 1, pp. 655–696, 2014.
no. 4, pp. 1350–1353, 2008. [209] B. L. Moore, P. Panousis, V. Kulkarni, L. D. Pyeatt, and A. G. Doufas,
[187] M. Ghassemi, L. A. Celi, and D. J. Stone, “State of the art review: the “Reinforcement learning for closed-loop propofol anesthesia: A human
data revolution in critical care,” Critical Care, vol. 19, no. 1, p. 118, volunteer study.” in IAAI, 2010.
2015. [210] N. Sadati, A. Aflaki, and M. Jahed, “Multivariable anesthesia control
[188] A. Rhodes, L. E. Evans, W. Alhazzani, M. M. Levy, M. Antonelli, using reinforcement learning,” in IEEE SMC’06, vol. 6. IEEE, 2006,
R. Ferrer, A. Kumar, J. E. Sevransky, C. L. Sprung, M. E. Nunnally pp. 4563–4568.
et al., “Surviving sepsis campaign: international guidelines for man- [211] E. C. Borera, B. L. Moore, A. G. Doufas, and L. D. Pyeatt, “An adaptive
agement of sepsis and septic shock: 2016,” Intensive Care Medicine, neural network filter for improved patient state estimation in closed-
vol. 43, no. 3, pp. 304–377, 2017. loop anesthesia control,” in IEEE ICTAI’11. IEEE, 2011, pp. 41–46.
[189] A. D. T. Force, V. Ranieri, G. Rubenfeld et al., “Acute respiratory [212] C. Lowery and A. A. Faisal, “Towards efficient, personalized anesthesia
distress syndrome,” Jama, vol. 307, no. 23, pp. 2526–2533, 2012. using continuous reinforcement learning for propofol infusion control,”
[190] T. Kamio, T. Van, and K. Masamune, “Use of machine-learning in IEEE/EMBS NER’13. IEEE, 2013, pp. 1414–1417.
approaches to predict clinical deterioration in critically ill patients: [213] R. Padmanabhan, N. Meskin, and W. M. Haddad, “Closed-loop control
A systematic review,” International Journal of Medical Research and of anesthesia and mean arterial pressure using reinforcement learning,”
Health Sciences, vol. 6, no. 6, pp. 1–7, 2017. Biomedical Signal Processing and Control, vol. 22, pp. 54–64, 2015.
[191] A. Vellido, V. Ribas, C. Morales, A. R. Sanmartı́n, and J. C. R. [214] P. Humbert, J. Audiffren, C. Dubost, and L. Oudre, “Learning from an
Rodrı́guez, “Machine learning in critical care: state-of-the-art and a expert.”
sepsis case study,” Biomedical engineering online, vol. 17, no. 1, p. [215] S. Nemati, M. M. Ghassemi, and G. D. Clifford, “Optimal medication
135, 2018. dosing from suboptimal clinical examples: A deep reinforcement
[192] M. Komorowski, A. Gordon, L. Celi, and A. Faisal, “A markov decision learning approach,” in IEEE 38th Annual International Conference of
process to suggest optimal treatment of severe infections in intensive the Engineering in Medicine and Biology Society. IEEE, 2016, pp.
care,” in Neural Information Processing Systems Workshop on Machine 2978–2981.
Learning for Health, 2016. [216] R. Lin, M. D. Stanley, M. M. Ghassemi, and S. Nemati, “A deep deter-
[193] M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. ministic policy gradient approach to medication dosing and surveillance
Faisal, “The artificial intelligence clinician learns optimal treatment in the icu,” in IEEE EMBC’18. IEEE, 2018, pp. 4927–4931.
strategies for sepsis in intensive care,” Nature Medicine, vol. 24, no. 11, [217] L. Wang, W. Zhang, X. He, and H. Zha, “Supervised reinforcement
p. 1716, 2018. learning with recurrent neural network for dynamic treatment recom-
[194] A. Raghu, M. Komorowski, I. Ahmed, L. Celi, P. Szolovits, and mendation,” in Proceedings of the 24th ACM SIGKDD International
M. Ghassemi, “Deep reinforcement learning for sepsis treatment,” Conference on Knowledge Discovery & Data Mining. ACM, 2018,
arXiv preprint arXiv:1711.09602, 2017. pp. 2447–2456.
[218] N. Prasad, L.-F. Cheng, C. Chivers, M. Draugelis, and B. E. Engel- [242] K. T. Chui, W. Alhalabi, S. S. H. Pang, P. O. d. Pablos, R. W.
hardt, “A reinforcement learning approach to weaning of mechanical Liu, and M. Zhao, “Disease diagnosis in smart healthcare: Innovation,
ventilation in intensive care units,” arXiv preprint arXiv:1704.06300, technologies and applications,” Sustainability, vol. 9, no. 12, p. 2309,
2017. 2017.
[219] C. Yu, J. Liu, and H. Zhao, “Inverse reinforcement learning for [243] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzel, “Learning
intelligent mechanical ventilation and sedative dosing in intensive care to diagnose with lstm recurrent neural networks,” arXiv preprint
units,” BMC medical informatics and decision making, vol. 19, no. 2, arXiv:1511.03677, 2015.
p. 57, 2019. [244] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart,
[220] C. Yu, G. Ren, and Y. Dong, “Supervised-actor-critic reinforcement “Retain: An interpretable predictive model for healthcare using reverse
learning for intelligent mechanical ventilation and sedative dosing in time attention mechanism,” in Advances in Neural Information Pro-
intensive care units,” BMC medical informatics and decision making, cessing Systems, 2016, pp. 3504–3512.
2020. [245] T. R. Goodwin and S. M. Harabagiu, “Medical question answering
[221] A. Jagannatha, P. Thomas, and H. Yu, “Towards high confidence off- for clinical decision support,” in Proceedings of the 25th ACM Inter-
policy reinforcement learning for clinical applications.” national on Conference on Information and Knowledge Management.
[222] L.-F. Cheng, N. Prasad, and B. E. Engelhardt, “An optimal policy ACM, 2016, pp. 297–306.
for patient laboratory tests in intensive care units,” arXiv preprint [246] Y. Ling, S. A. Hasan, V. Datla, A. Qadir, K. Lee, J. Liu, and O. Farri,
arXiv:1808.04679, 2018. “Diagnostic inferencing via improving clinical concept extraction with
[223] C.-H. Chang, M. Mai, and A. Goldenberg, “Dynamic measurement deep reinforcement learning: A preliminary study,” in Machine Learn-
scheduling for adverse event forecasting using deep rl,” arXiv preprint ing for Healthcare Conference, 2017, pp. 271–285.
arXiv:1812.00268, 2018. [247] A. Bernstein and E. Burnaev, “Reinforcement learning in computer
[224] E. F. Krakow, M. Hemmer, T. Wang, B. Logan, M. Arora, S. Spellman, vision,” in CMV’17, vol. 10696. International Society for Optics and
D. Couriel, A. Alousi, J. Pidala, M. Last et al., “Tools for the Photonics, 2018, p. 106961S.
precision medicine era: How to develop highly personalized treatment [248] G. W. Taylor, “A reinforcement learning framework for parameter
recommendations from cohort and registry data using q-learning,” control in computer vision applications,” in Computer and Robot
American journal of epidemiology, vol. 186, no. 2, pp. 160–172, 2017. Vision, 2004. Proceedings. First Canadian Conference on. IEEE,
[225] Y. Liu, B. Logan, N. Liu, Z. Xu, J. Tang, and Y. Wang, “Deep 2004, pp. 496–503.
reinforcement learning for dynamic treatment regimes on medical [249] F. Sahba, H. R. Tizhoosh, and M. M. Salama, “A reinforcement learning
registry data,” in IEEE ICHI’17. IEEE, 2017, pp. 380–385. framework for medical image segmentation,” in IJCNN, vol. 6, 2006,
[226] J. E. Gotts and M. A. Matthay, “Sepsis: pathophysiology and clinical pp. 511–517.
management,” Bmj, vol. 353, p. i1585, 2016. [250] ——, “Application of opposition-based reinforcement learning in
[227] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, image segmentation,” in 2007 IEEE Symposium on Computational
M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, Intelligence in Image and Signal Processing. IEEE, 2007, pp. 246–
“Mimic-iii, a freely accessible critical care database,” Scientific Data, 251.
vol. 3, p. 160035, 2016. [251] ——, “Application of reinforcement learning for segmentation of
[228] S. Saria, “Individualized sepsis treatment using reinforcement learn- transrectal ultrasound images,” BMC Medical Imaging, vol. 8, no. 1,
ing,” Nature medicine, vol. 24, no. 11, p. 1641, 2018. p. 8, 2008.
[229] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning [252] F. Sahba, “Object segmentation in image sequences using reinforce-
with double q-learning.” in AAAI, vol. 2. Phoenix, AZ, 2016, p. 5. ment learning,” in CSCI’16. IEEE, 2016, pp. 1416–1417.
[230] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, [253] D. Liu and T. Jiang, “Deep reinforcement learning for surgical gesture
“Dueling network architectures for deep reinforcement learning,” in segmentation and classification,” in International Conference on Med-
International Conference on Machine Learning, 2016, pp. 1995–2003. ical Image Computing and Computer-Assisted Intervention. Springer,
[231] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience 2018, pp. 247–255.
replay,” arXiv preprint arXiv:1511.05952, 2015. [254] F. C. Ghesu, B. Georgescu, T. Mansi, D. Neumann, J. Hornegger, and
[232] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- D. Comaniciu, “An artificial agent for anatomical landmark detection
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, in medical images,” in International Conference on Medical Image
2017. Computing and Computer-Assisted Intervention. Springer, 2016, pp.
[233] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, 229–237.
D. Silver, and D. Wierstra, “Continuous control with deep reinforce- [255] F. C. Ghesu, B. Georgescu, Y. Zheng, S. Grbic, A. Maier, J. Hornegger,
ment learning,” arXiv preprint arXiv:1509.02971, 2015. and D. Comaniciu, “Multi-scale deep reinforcement learning for real-
[234] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward time 3d-landmark detection in ct scans,” IEEE Transactions on Pattern
transformations: Theory and application to reward shaping,” in ICML, Analysis and Machine Intelligence, 2017.
vol. 99, 1999, pp. 278–287. [256] F. C. Ghesu, B. Georgescu, S. Grbic, A. Maier, J. Hornegger, and
[235] W. M. Haddad, J. M. Bailey, B. Gholami, and A. R. Tannenbaum, D. Comaniciu, “Towards intelligent robust detection of anatomical
“Clinical decision support and closed-loop control for intensive care structures in incomplete volumetric data,” Medical Image Analysis,
unit sedation,” Asian Journal of Control, vol. 20, no. 5, pp. 1343–1350, vol. 48, pp. 203–213, 2018.
2012. [257] M. Etcheverry, B. Georgescu, B. Odry, T. J. Re, S. Kaushik, B. Geiger,
[236] M. M. Ghassemi, S. E. Richter, I. M. Eche, T. W. Chen, J. Danziger, and N. Mariappan, S. Grbic, and D. Comaniciu, “Nonlinear adaptively
L. A. Celi, “A data-driven approach to optimized medication dosing: a learned optimization for object localization in 3d medical images,” in
focus on heparin,” Intensive Care Medicine, vol. 40, no. 9, pp. 1332– Deep Learning in Medical Image Analysis and Multimodal Learning
1339, 2014. for Clinical Decision Support. Springer, 2018, pp. 254–262.
[237] S. Jaber, G. Bellani, L. Blanch, A. Demoule, A. Esteban, L. Gattinoni, [258] A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou, G. Vaillant,
C. Guérin, N. Hill, J. G. Laffey, S. M. Maggiore et al., “The intensive B. Glocker, B. Kainz, and D. Rueckert, “Evaluating reinforcement
care medicine research agenda for airways, invasive and noninvasive learning agents for anatomical landmark detection,” 2018.
mechanical ventilation,” Intensive Care Medicine, vol. 43, no. 9, pp. [259] A. Alansary, L. L. Folgoc, G. Vaillant, O. Oktay, Y. Li, W. Bai,
1352–1365, 2017. J. Passerat-Palmbach, R. Guerrero, K. Kamnitsas, B. Hou et al.,
[238] A. De Jong, G. Citerio, and S. Jaber, “Focus on ventilation and airway “Automatic view planning with multi-scale deep reinforcement learning
management in the icu,” Intensive Care Medicine, vol. 43, no. 12, pp. agents,” arXiv preprint arXiv:1806.03228, 2018.
1912–1915, 2017. [260] W. A. Al and I. D. Yun, “Partial policy-based reinforcement learning for
[239] E. National Academies of Sciences, Medicine et al., Improving diag- anatomical landmark localization in 3d medical images,” arXiv preprint
nosis in health care. National Academies Press, 2016. arXiv:1807.02908, 2018.
[240] S. K. Rai and K. Sowmya, “A review on use of machine learning [261] R. Liao, S. Miao, P. de Tournemire, S. Grbic, A. Kamen, T. Mansi,
techniques in diagnostic health-care,” Artificial Intelligent Systems and and D. Comaniciu, “An artificial agent for robust image registration.”
Machine Learning, vol. 10, no. 4, pp. 102–107, 2018. in AAAI, 2017, pp. 4168–4175.
[241] M. Fatima and M. Pasha, “Survey of machine learning algorithms [262] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, A. Wimmer,
for disease diagnostic,” Journal of Intelligent Learning Systems and and T. Chen, “Multimodal image registration with deep context re-
Applications, vol. 9, no. 01, p. 1, 2017. inforcement learning,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention. Springer, 2017, pp. [283] K. Li and J. W. Burdick, “A function approximation method for model-
240–248. based high-dimensional inverse reinforcement learning,” arXiv preprint
[263] J. Krebs, T. Mansi, H. Delingette, L. Zhang, F. C. Ghesu, S. Miao, A. K. arXiv:1708.07738, 2017.
Maier, N. Ayache, R. Liao, and A. Kamen, “Robust non-rigid registra- [284] B. Thananjeyan, A. Garg, S. Krishnan, C. Chen, L. Miller, and
tion through agent-based action learning,” in International Conference K. Goldberg, “Multilateral surgical pattern cutting in 2d orthotropic
on Medical Image Computing and Computer-Assisted Intervention. gauze with deep reinforcement learning policies for tensioning,” in
Springer, 2017, pp. 344–352. IEEE ICRA’17. IEEE, 2017, pp. 2371–2378.
[264] G. Maicas, G. Carneiro, A. P. Bradley, J. C. Nascimento, and I. Reid, [285] T. T. Nguyen, N. D. Nguyen, F. Bello, and S. Nahavandi, “A new
“Deep reinforcement learning for active breast lesion detection from tensioning method using deep reinforcement learning for surgical
dce-mri,” in International Conference on Medical Image Computing pattern cutting,” arXiv preprint arXiv:1901.03327, 2019.
and Computer-Assisted Intervention. Springer, 2017, pp. 665–673. [286] J. Chen, H. Y. Lau, W. Xu, and H. Ren, “Towards transferring skills
[265] P. Zhang, F. Wang, and Y. Zheng, “Deep reinforcement learning for to flexible surgical robots with programming by demonstration and
vessel centerline tracing in multi-modality 3d volumes,” in Inter- reinforcement learning,” in ICACI’16. IEEE, 2016, pp. 378–384.
national Conference on Medical Image Computing and Computer- [287] D. Baek, M. Hwang, H. Kim, and D.-S. Kwon, “Path planning
Assisted Intervention. Springer, 2018, pp. 755–763. for automation of surgery robot based on probabilistic roadmap and
[266] S. M. B. Netto, V. R. C. Leite, A. C. Silva, A. C. de Paiva, and reinforcement learning,” in 2018 15th International Conference on
A. de Almeida Neto, “Application on reinforcement learning for diag- Ubiquitous Robots (UR). IEEE, 2018, pp. 342–347.
nosis based on medical image,” in Reinforcement Learning. InTech, [288] K. Li, M. Rath, and J. W. Burdick, “Inverse reinforcement learning via
2008. function approximation for clinical motion analysis,” in IEEE ICRA’18.
[267] S. J. Fakih and T. K. Das, “Lead: a methodology for learning efficient IEEE, 2018, pp. 610–617.
approaches to medical diagnosis,” IEEE Transactions on Information [289] K. M. Jagodnik, P. S. Thomas, A. J. van den Bogert, M. S. Bran-
Technology in Biomedicine, vol. 10, no. 2, pp. 220–228, 2006. icky, and R. F. Kirsch, “Training an actor-critic reinforcement learn-
[268] K. Roberts, M. S. Simpson, E. M. Voorhees, and W. R. Hersh, ing controller for arm movement using human-generated rewards,”
“Overview of the trec 2016 clinical decision support track.” in TREC, IEEE Transactions on Neural Systems and Rehabilitation Engineering,
2016. vol. 25, no. 10, pp. 1892–1905, 2017.
[269] Y. Ling, S. A. Hasan, V. Datla, A. Qadir, K. Lee, J. Liu, and O. Farri, [290] R. S. Istepanian, N. Y. Philip, and M. G. Martini, “Medical qos
“Learning to diagnose: Assimilating clinical narratives using deep provision based on reinforcement learning in ultrasound streaming
reinforcement learning,” in Proceedings of the Eighth International over 3.5 g wireless systems,” IEEE Journal on Selected areas in
Joint Conference on Natural Language Processing (Volume 1: Long Communications, vol. 27, no. 4, 2009.
Papers), vol. 1, 2017, pp. 895–905. [291] A. Alinejad, N. Y. Philip, and R. S. Istepanian, “Cross-layer ultrasound
[270] R. Ballard-Barbash, S. H. Taplin, B. C. Yankaskas, V. L. Ernster, R. D. video streaming over mobile wimax and hsupa networks,” IEEE
Rosenberg, P. A. Carney, W. E. Barlow, B. M. Geller, K. Kerlikowske, transactions on Information Technology in Biomedicine, vol. 16, no. 1,
B. K. Edwards et al., “Breast cancer surveillance consortium: a national pp. 31–39, 2012.
mammography screening and outcomes database.” American Journal [292] K. Ragnarsson, “Functional electrical stimulation after spinal cord
of Roentgenology, vol. 169, no. 4, pp. 1001–1008, 1997. injury: current use, therapeutic effects and future directions,” Spinal
[271] T. Chu, J. Wang, and J. Chen, “An adaptive online learning framework cord, vol. 46, no. 4, p. 255, 2008.
for practical breast cancer diagnosis,” in Medical Imaging 2016: [293] P. S. Thomas, M. Branicky, A. Van Den Bogert, and K. Jagodnik,
Computer-Aided Diagnosis, vol. 9785. International Society for Optics “Creating a reinforcement learning controller for functional electrical
and Photonics, 2016, p. 978524. stimulation of a human arm,” in The Yale Workshop on Adaptive and
[272] K.-F. Tang, H.-C. Kao, C.-N. Chou, and E. Y. Chang, “Inquire and di- Learning Systems, vol. 49326. NIH Public Access, 2008, p. 1.
agnose: Neural symptom checking ensemble using deep reinforcement [294] P. S. Thomas, A. J. van den Bogert, K. M. Jagodnik, and M. S.
learning,” in Proceedings of NIPS Workshop on Deep Reinforcement Branicky, “Application of the actor-critic architecture to functional
Learning, 2016. electrical stimulation control of a human arm.” in IAAI, 2009.
[273] H.-C. Kao, K.-F. Tang, and E. Y. Chang, “Context-aware symptom [295] I. Kola and J. Landis, “Can the pharmaceutical industry reduce attrition
checking for disease diagnosis using hierarchical reinforcement learn- rates?” Nature reviews Drug discovery, vol. 3, no. 8, p. 711, 2004.
ing,” 2018. [296] G. Schneider, De novo molecular design. John Wiley & Sons, 2013.
[274] E. Y. Chang, M.-H. Wu, K.-F. T. Tang, H.-C. Kao, and C.-N. Chou, [297] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular
“Artificial intelligence in xprize deepq tricorder,” in Proceedings of the de-novo design through deep reinforcement learning,” Journal of
2nd International Workshop on Multimedia for Personal Health and Cheminformatics, vol. 9, no. 1, p. 48, 2017.
Health Care. ACM, 2017, pp. 11–18. [298] A. Serrano, B. Imbernón, H. Pérez-Sánchez, J. M. Cecilia, A. Bueno-
[275] E. Y. Chang, “Deepq: Advancing healthcare through artificial intel- Crespo, and J. L. Abellán, “Accelerating drugs discovery with deep
ligence and virtual reality,” in Proceedings of the 2017 ACM on reinforcement learning: An early approach,” in Proceedings of the 47th
Multimedia Conference. ACM, 2017, pp. 1068–1068. International Conference on Parallel Processing Companion. ACM,
[276] Z. Wei, Q. Liu, B. Peng, H. Tou, T. Chen, X. Huang, K.-F. Wong, 2018, p. 6.
and X. Dai, “Task-oriented dialogue system for automatic diagnosis,” [299] D. Neil, M. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood,
in Proceedings of the 56th Annual Meeting of the Association for and N. Brown, “Exploring deep recurrent models with reinforcement
Computational Linguistics (Volume 2: Short Papers), vol. 2, 2018, pp. learning for molecule design,” 2018.
201–207. [300] M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learning
[277] F. Tang, K. Lin, I. Uchendu, H. H. Dodge, and J. Zhou, “Improving for de novo drug design,” Science Advances, vol. 4, no. 7, p. eaap7885,
mild cognitive impairment prediction via reinforcement learning and 2018.
dialogue simulation,” arXiv preprint arXiv:1802.06428, 2018. [301] E. Yom-Tov, G. Feraru, M. Kozdoba, S. Mannor, M. Tennenholtz,
[278] H.-J. Schuetz and R. Kolisch, “Approximate dynamic programming and I. Hochberg, “Encouraging physical activity in patients with
for capacity allocation in the service industry,” European Journal of diabetes: Intervention using a reinforcement learning system,” Journal
Operational Research, vol. 218, no. 1, pp. 239–250, 2012. of Medical Internet Research, vol. 19, no. 10, 2017.
[279] Z. Huang, W. M. van der Aalst, X. Lu, and H. Duan, “Reinforcement [302] I. Hochberg, G. Feraru, M. Kozdoba, S. Mannor, M. Tennenholtz, and
learning based resource allocation in business process management,” E. Yom-Tov, “A reinforcement learning system to encourage physical
Data & Knowledge Engineering, vol. 70, no. 1, pp. 127–145, 2011. activity in diabetes patients,” arXiv preprint arXiv:1605.04070, 2016.
[280] B. Zeng, A. Turkcan, J. Lin, and M. Lawley, “Clinic scheduling [303] A. Baniya, S. Herrmann, Q. Qiao, and H. Lu, “Adaptive interven-
models with overbooking for patients with heterogeneous no-show tions treatment modelling and regimen optimization using sequential
probabilities,” Annals of Operations Research, vol. 178, no. 1, pp. 121– multiple assignment randomized trials (smart) and q-learning,” in
144, 2010. Proceedings of IIE Annual Conference, 2017, pp. 1187–1192.
[281] T. S. M. T. Gomes, “Reinforcement learning for primary care e [304] E. M. Forman, S. G. Kerrigan, M. L. Butryn, A. S. Juarascio, S. M.
appointment scheduling,” 2017. Manasse, S. Ontañón, D. H. Dallal, R. J. Crochiere, and D. Moskow,
[282] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, “Can the artificial intelligence technique of reinforcement learning use
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep continuously-monitored digital data to optimize treatment for weight
reinforcement learning,” in ICML, 2016, pp. 1928–1937. loss?” Journal of Behavioral Medicine, pp. 1–15, 2018.
[305] O. Gottesman, F. Johansson, J. Meier, J. Dent, D. Lee, S. Srinivasan, [330] S. Bhupatiraju, K. K. Agrawal, and R. Singh, “Towards mixed op-
L. Zhang, Y. Ding, D. Wihl, X. Peng et al., “Evaluating reinforcement timization for reinforcement learning with program synthesis,” arXiv
learning algorithms in observational health settings,” arXiv preprint preprint arXiv:1807.00403, 2018.
arXiv:1805.12298, 2018. [331] Z. C. Lipton, “The doctor just won’t accept that!” arXiv preprint
[306] A. Raghu, O. Gottesman, Y. Liu, M. Komorowski, A. Faisal, F. Doshi- arXiv:1711.08037, 2017.
Velez, and E. Brunskill, “Behaviour policy estimation in off-policy pol- [332] F. Maes, R. Fonteneau, L. Wehenkel, and D. Ernst, “Policy search
icy evaluation: Calibration matters,” arXiv preprint arXiv:1807.01066, in a space of simple closed-form formulas: towards interpretability
2018. of reinforcement learning,” in International Conference on Discovery
[307] R. Jeter, C. Josef, S. Shashikumar, and S. Nemati, “Does the” artificial Science. Springer, 2012, pp. 37–51.
intelligence clinician” learn optimal treatment strategies for sepsis in [333] A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri, “Pro-
intensive care?” arXiv preprint arXiv:1902.03271, 2019. grammatically interpretable reinforcement learning,” in International
[308] D. Koller, N. Friedman, and F. Bach, Probabilistic graphical models: Conference on Machine Learning, 2018, pp. 5052–5061.
principles and techniques. MIT press, 2009. [334] F. Elizalde, E. Sucar, J. Noguez, and A. Reyes, “Generating explana-
[309] J. L. Jameson and D. L. Longo, “Precision medicinełpersonalized, tions based on markov decision processes,” in Mexican International
problematic, and promising,” Obstetrical & Gynecological Survey, Conference on Artificial Intelligence. Springer, 2009, pp. 51–62.
vol. 70, no. 10, pp. 612–614, 2015. [335] Z. Che, S. Purushotham, R. Khemani, and Y. Liu, “Distilling knowledge
[310] J. Fürnkranz and E. Hüllermeier, “Preference learning,” in Encyclope- from deep networks with applications to healthcare domain,” arXiv
dia of Machine Learning. Springer, 2011, pp. 789–795. preprint arXiv:1512.03542, 2015.
[311] D. J. Lizotte, M. H. Bowling, and S. A. Murphy, “Efficient reinforce- [336] M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-
ment learning with multiple reward functions for randomized controlled Velez, “Beyond sparsity: Tree regularization of deep models for in-
trial analysis,” in ICML’10. Citeseer, 2010, pp. 695–702. terpretability,” in Thirty-Second AAAI Conference on Artificial Intelli-
[312] M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard, “In- gence, 2018.
verse reinforcement learning with simultaneous estimation of rewards [337] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. Jacobs,
and dynamics,” in Artificial Intelligence and Statistics, 2016, pp. 102– J. M. Zurada, and M. E. Brier, “Incorporating prior knowledge into
110. q-learning for drug delivery individualization,” in Fourth International
[313] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin- Conference on Machine Learning and Applications. IEEE, 2005, pp.
der, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight 6–pp.
experience replay,” in NIPS’17, 2017, pp. 5048–5058. [338] A. Holzinger, “Interactive machine learning for health informatics:
[314] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil- when do we need the human-in-the-loop?” Brain Informatics, vol. 3,
ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised no. 2, pp. 119–131, 2016.
auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016. [339] D. Abel, J. Salvatier, A. Stuhlmüller, and O. Evans, “Agent-
[315] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. agnostic human-in-the-loop reinforcement learning,” arXiv preprint
Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li et al., “Imagination- arXiv:1701.04079, 2017.
augmented agents for deep reinforcement learning,” in Advances in [340] E. J. Topol, “High-performance medicine: the convergence of human
neural information processing systems, 2017, pp. 5690–5701. and artificial intelligence,” Nature medicine, vol. 25, no. 1, p. 44, 2019.
[316] N. Jiang and L. Li, “Doubly robust off-policy value evaluation for [341] C. Yu, D. Wang, T. Yang, W. Zhu, Y. Li, H. Ge, and J. Ren, “Adaptively
reinforcement learning,” in Proceedings of the 33rd International shaping reinforcement learning agents via human reward,” in Pacific
Conference on International Conference on Machine Learning-Volume Rim International Conference on Artificial Intelligence. Springer,
48. JMLR. org, 2016, pp. 652–661. 2018, pp. 85–97.
[317] M. Kearns and S. Singh, “Near-optimal reinforcement learning in [342] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L. Thomaz,
polynomial time,” Machine learning, vol. 49, no. 2-3, pp. 209–232, “Policy shaping: Integrating human feedback with reinforcement learn-
2002. ing,” in Advances in neural information processing systems, 2013, pp.
[318] J. Pazis and R. Parr, “Efficient pac-optimal exploration in concurrent, 2625–2633.
continuous state mdps with delayed updates.” in AAAI, 2016, pp. 1977– [343] J. Shu, Z. Xu, and D. Meng, “Small sample learning in big data era,”
1985. arXiv preprint arXiv:1808.04572, 2018.
[319] M. Dimakopoulou and B. Van Roy, “Coordinated exploration in concur- [344] S. W. Carden and J. Livsey, “Small-sample reinforcement learning:
rent reinforcement learning,” in International Conference on Machine Improving policies using synthetic data 1,” Intelligent Decision Tech-
Learning, 2018, pp. 1270–1278. nologies, vol. 11, no. 2, pp. 167–175, 2017.
[320] Z. Guo and E. Brunskill, “Concurrent pac rl.” in AAAI, 2015, pp. 2624– [345] J. Salamon and J. P. Bello, “Deep convolutional neural networks and
2630. data augmentation for environmental sound classification,” IEEE Signal
[321] J. Fu, J. Co-Reyes, and S. Levine, “Ex2: Exploration with exemplar Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
models for deep reinforcement learning,” in Advances in Neural [346] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
Information Processing Systems, 2017, pp. 2574–2584. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
[322] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, Advances in neural information processing systems, 2014, pp. 2672–
J. Schulman, F. DeTurck, and P. Abbeel, “# exploration: A study of 2680.
count-based exploration for deep reinforcement learning,” in NIPS’17, [347] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
2017, pp. 2750–2759. neural network,” stat, vol. 1050, p. 9, 2015.
[323] B. C. Stadie, S. Levine, and P. Abbeel, “Incentivizing exploration in [348] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman,
reinforcement learning with deep predictive models,” arXiv preprint “Building machines that learn and think like people,” Behavioral and
arXiv:1507.00814, 2015. Brain Sciences, vol. 40, 2017.
[324] T. Mannucci, E.-J. van Kampen, C. de Visser, and Q. Chu, “Safe [349] Y.-L. Zheng, X.-R. Ding, C. C. Y. Poon, B. P. L. Lo, H. Zhang, X.-L.
exploration algorithms for reinforcement learning controllers,” IEEE Zhou, G.-Z. Yang, N. Zhao, and Y.-T. Zhang, “Unobtrusive sensing
transactions on neural networks and learning systems, vol. 29, no. 4, and wearable devices for health informatics,” IEEE Transactions on
pp. 1069–1081, 2018. Biomedical Engineering, vol. 61, no. 5, pp. 1538–1554, 2014.
[325] C. A. Merck and S. Kleinberg, “Causal explanation under indetermin- [350] G. Acampora, D. J. Cook, P. Rashidi, and A. V. Vasilakos, “A survey
ism: A sampling approach.” in AAAI, 2016, pp. 1037–1043. on ambient intelligence in healthcare,” Proceedings of the IEEE, vol.
[326] J. Woodward, Making things happen: A theory of causal explanation. 101, no. 12, pp. 2470–2494, 2013.
Oxford university press, 2005. [351] F. Zhu, J. Guo, R. Li, and J. Huang, “Robust actor-critic contextual
[327] S. L. Morgan and C. Winship, Counterfactuals and causal inference. bandit for mobile health (mhealth) interventions,” in Proceedings of the
Cambridge University Press, 2015. 2018 ACM International Conference on Bioinformatics, Computational
[328] D. Dash, M. Voortman, and M. De Jongh, “Sequences of mechanisms Biology, and Health Informatics. ACM, 2018, pp. 492–501.
for causal reasoning in artificial intelligence.” in IJCAI, 2013, pp. 839– [352] F. Zhu, J. Guo, Z. Xu, P. Liao, L. Yang, and J. Huang, “Group-
845. driven reinforcement learning for personalized mhealth intervention,” in
[329] Z. C. Lipton, “The mythos of model interpretability,” Communications International Conference on Medical Image Computing and Computer-
of the ACM, vol. 61, no. 10, pp. 36–43, 2018. Assisted Intervention. Springer, 2018, pp. 590–598.
[353] H. Lei, A. Tewari, and S. Murphy, “An actor-critic contextual bandit al- in Neural Information Processing Systems, vol. 27, 2014.
gorithm for personalized interventions using mobile devices,” Advances