Automated Planning and Acting
Automated Planning and Acting
Automated Planning and Acting
Malik Ghallab
LAAS-CNRS, University of Toulouse, France
Dana Nau
University of Maryland, USA
Paolo Traverso
FBK ICT IRST, Trento, Italy
Manuscript, lecture slides, and other materials are at our web site.
July 5, 2016
1 Introduction 1
1.1 Purpose and Motivations . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 First Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Focus and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Conceptual View of an Actor . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 A Simple Architecture . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Hierarchical and Continual Online Deliberation . . . . . . . . 7
1.2.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Deliberation Models and Functions . . . . . . . . . . . . . . . . . . . 11
1.3.1 Descriptive and Operational Models of Actions . . . . . . . . 11
1.3.2 Description of States for Deliberation . . . . . . . . . . . . . 12
1.3.3 Planning Versus Acting . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Other Deliberation Functions . . . . . . . . . . . . . . . . . . 17
1.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 A Factotum Service Robot . . . . . . . . . . . . . . . . . . . 18
1.4.2 A Complex Operations Manager . . . . . . . . . . . . . . . . 20
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bibliography 403
Index 447
6.1 Part of the state space for the problem in Example 6.4. . . . . . . . 281
6.2 A safe solution for Example 6.4 and its Graph(s0 , ). . . . . . . . . . 281
6.1 V (l) after the first three and last three iterations of VI. . . . . . . . 295
6.2 Iterations of AO on the example of Figure 6.6. . . . . . . . . . . . . 303
Notation Meaning
a, A action, set of actions
, A action template, set of action templates
cost(a), cost(s, a) cost of a, cost of a in state s
cost(), cost(s, ) cost of , cost of in state s
Dom(f ), Dom() domain of a function or plan
eff(a) effects of action a
g, Sg goal conditions, goal states
(s, a) progression, i.e., predicted result of applying a in s
1 (g, a) regression, i.e., conditions needed for a to produce s
Over ten years ago, Malik Ghallab, Dana Nau, and Paolo Traverso gave us
the firstand to date onlycomprehensive textbook dedicated to the field
of Automated Planning, providing a much needed resource for students,
researchers and practitioners. Since then, this rich field has continued to
evolve rapidly. There is now a unified understanding of what once seemed
disparate work on classical planning. Models and methods to deal with
time, resources, continuous change, multiple agents, and uncertainty have
substantially matured. Cross-fertilization with other fields such as software
verification, optimization, machine learning, and robotics has become the
rule rather than the exception. A phenomenal range of applications could
soon be within reachgiven the right future emphasis for the field.
Today, the authors are back with a new book, Automated Planning and
Acting. As the title indicates, this is not a mere second edition of the older
book. In line with the authors analysis of where the future emphasis should
lie for the field to realize its full impact, the book covers deliberative compu-
tational techniques for both planning and acting, that is for deciding which
actions to perform and also how to perform them. Automated Planning and
Acting is more than a graduate textbook or a reference book. Not only
do the authors outstandingly discharge their duties of educating the reader
about the basics and much of the recent progress in the field, but they also
propose a new framework from which the community can start to intensify
research on deliberative acting and its integration with planning.
These aims are reflected in the books content. The authors put the
integration of planning and acting at the forefront by dedicating an entire
chapter to a unified hierarchical model and refinement procedures that suit
the needs of both planning and acting functions. Each chapter devoted to
a particular class of representations also includes significant material on the
integration of planning and acting using these representations. Overall, the
book is more focused than its predecessor, and explores in even greater depth
models and approaches motivated by the needs of planning and acting in
the real world, such as handling time and uncertainty. At the same time,
the authors successfully balance breadth and depth by providing an elegant,
concise synthesis of a larger body of work than in their earlier text.
There is no doubt that Automated Planning and Acting will be the text
I require my students to read when they first start, and the goto book on
my shelf for my own reference. As a timely source of motivation for game-
changing research on the integration of planning and acting, it will also help
shape the field for the next decade.
Sylvie Thiebaux
The Australian National University
This book is about methods and techniques that a computational agent can
use for deliberative planning and acting, that is, for deciding both which
actions to perform and how to perform them, to achieve some objective.
The study of deliberation has several scientific and engineering motivations.
Understanding deliberation is an objective for most cognitive sciences. In
artificial intelligence research, this is done by modeling deliberation through
computational approaches to enable it and to allow it to be explained. Fur-
thermore, the investigated capabilities are better understood by mapping
concepts and theories into designed systems and experiments to test empir-
ically, measure, and qualify the proposed models.
The engineering motivation for studying deliberation is to build systems
that exhibit deliberation capabilities and develop technologies that address
socially useful needs. A technological system needs deliberation capabilities
if it must autonomously perform a set of tasks that are too diverse or must
be done in environments that are too diverse to engineer those tasks into
innate behaviors. Autonomy and diversity of tasks and environments is a
critical feature in many applications, including robotics (e.g., service and
personal robots; rescue and exploration robots; autonomous space stations,
satellites, or vehicles), complex simulation systems (e.g., tutoring, training
or entertainment), or complex infrastructure management (e.g., industrial
or energy plants, transportation networks, urban facilities).
is, a coherent synthesis of the state of the art, with the development of
new material. Most of this new material is presented in comprehensive
detail (e.g., in Chapter 3) consistent with a textbook use. In a few
parts (e.g., Section 4.5.3), this new material is in preliminary form and
serves as an invitation for further research.
This book can be used as a graduate-level textbook and as an infor-
mation source for scientists and professionals in the field. We assume the
reader to be familiar with the basic concepts of algorithms and data struc-
tures at the level that one might get in an undergraduate-level computer
science curriculum. Prior knowledge of heuristic search techniques would
also be helpful, but is not strictly necessary because the appendices provide
overviews of needed tools.
A complete set of lecture slides for this book and other auxiliary materials
are available online.
Acknowledgments
We are thankful to several friends and colleagues who gave us very valu-
able feedback on parts of this book. Among these are Hector Geffner,
Robert Goldman, Patrik Haslum, Jorg Hoffmann, Joachim Hertzberg, Fe-
lix Ingrand, Ugur Kuter, Marco Pistore, Mak Roberts, Vikas Shivashankar,
Sylvie Thiebaux, and Qiang Yang.
We also wish to acknowledge the support of our respective organizations,
which provided the support and facilities that helped to make this work
possible: LAAS-CNRS in Toulouse, France; the University of Maryland in
College Park, Maryland; and FBK ICT-IRST in Trento, Italy. Dana Nau
thanks ONR for their support of his planning work, and the students who
took courses from rough drafts of this book.
Finally, we wish to acknowledge the support of our families, who re-
mained patient during a project that consumed much more of our time and
attention than we had originally anticipated.
Paolo Traverso is the Director of FBK ICT IRST, the Research Center
at Fondazione Bruno Kessler (FBK). Paolo has worked in the advanced tech-
nology groups of companies in Chicago, London, and Milan, leading projects
in safety critical systems, data and knowledge management, and service ori-
ented applications. His research activity is mainly focused on planning and
acting under uncertainty. His contributions to research in automated plan-
ning include the technique called planning via symbolic model checking.
He has published more than one hundred papers in artificial intelligence. He
is an ECCAI Fellow.
Introduction
This chapter introduces informally the concepts and technical material de-
veloped in the rest of the book. It discusses in particular the notion of
deliberation, which is at the core of the interaction between planning and
acting. Section 1.1 motivates our study of deliberation from a computa-
tional viewpoint and delineates the scope of the book. We then introduce
a conceptual view of an artificial entity, called an actor, capable of acting
deliberately on its environment, and discuss our main assumptions. Delib-
eration models and functions are presented next. Section 1.4 describes two
application domains that will be simplified into illustrative examples of the
techniques covered in rest of the book.
process, both before and during acting, that addresses questions such as the
following:
If an agent performs an action, what will the result be?
Which actions should an agent undertake, and how should the agent
perform the chosen actions to produce a desired effect?
Such reasoning allows the agent to predict, to decide what to do and how do
it, and to combine several actions that contribute jointly to the objective.
The reasoning consists of using predictive models of the agents environment
and capabilities to simulate what will happen if the agent performs an action.
Let us illustrate these abstract notions intuitively.
1.1.2 Motivations
We address the issue of how an actor acts deliberately by following the
approaches and methods of artificial intelligence (AI). Our purpose proceeds
from the usual motivations of AI research, namely:
To understand, through effective formal models, the cognitive capa-
bilities that correspond to acting deliberately.
To build actors that exhibit these capabilities.
To develop technologies that address socially useful needs.
Understanding deliberation is an objective for most cognitive sciences.
The specifics of AI are to model deliberation through computational ap-
proaches that allow us to explain as well as to generate the modeled capa-
bilities. Furthermore, the investigated capabilities are better understood by
mapping concepts and theories into designed systems and experiments to
test empirically, measure, and qualify the proposed models. The technologi-
cal motivation for endowing an artificial actor with deliberation capabilities
stems from two factors:
autonomy, meaning that the actor performs its intended functions
without being directly operated by a person, and
2
In the interesting classification of Dennett [150], these species are called Popperian,
in reference to the epistemologist Karl Popper.
diversity in the tasks the actor can perform and the environments in
which it can operate.
Without autonomy, a directly operated or teleoperated device does not
usually need to deliberate. It simply extends the acting and sensing capa-
bilities of a human operator who is in charge of understanding and decision
making, possibly with the support of advice and planning tools, for example,
as in surgical robotics and other applications of teleoperation.
An autonomous system may not need deliberation if it operates only
in the fully specified environment for which it has been designed. Man-
ufacturing robots autonomously perform tasks such as painting, welding,
assembling, or servicing a warehouse without much deliberation. Similarly,
a vending machine or a driverless train operates autonomously without a
need for deliberation. For these and similar examples of automation, delib-
eration is performed by the designer. The system and its environment are
engineered so that the only variations that can occur are those accounted for
at the design stage in the systems predefined functioning envelope. Diver-
sity in the environment is not expected. A state outside of the functioning
envelope puts the system into a failure mode in which a person takes delib-
erate actions.
Similarly, a device designed for a unique specialized task may perform
it autonomously without much deliberation, as long the variations in its
environment are within its designed range. For example, a vacuum-cleaning
or lawn mowing robot does not deliberate, but it can cope autonomously
with its specialized tasks in a reasonable range of lawns or floors. However,
it may cease to function properly when it encounters a slippery floor, a steep
slope, or any condition outside of the range for which it was designed.
When a designer can account, within some functioning envelope, for all
the environments and tasks a system will face and when a person can be in
charge of deliberating outside of this envelope, by means of teleoperation or
reprogramming, then deliberation generally is not needed in the system it-
self. Such a system will be endowed with a library of reactive behaviors (e.g.,
as the birds visual target tracking in Example 1.1) that cover efficiently its
functioning envelope. However, when an autonomous actor has to face a
diversity of tasks, environments and interactions, then achieving its purpose
will require some degree of deliberation. This is the case in many robotics
applications, such as service and personal robots, rescue and exploration
robots, autonomous space stations and satellites, or even driverless cars.
This holds also for complex simulation systems used in entertainment (e.g.,
video games) or educational applications (serious games). It is equally ap-
We develop the view that planning can be needed for deliberation but is
seldom sufficient. We argue that acting goes beyond the execution of low-
level commands.
Example 1.2. Dana finishes breakfast in a hotel restaurant, and starts
going back to his room. On the way, he notices that the elevator is not on
his floor and decides to walk up the stairs. After a few steps he becomes
aware that he doesnt have his room key which he left on the breakfast table.
He goes back to pick it up.
In this example, the actor does not need to plan the simple task of going
to his room. He continually deliberates while acting: he makes opportunistic
choices, simulates in advance and monitors his actions, stops when needed
and decides on alternate actions.
Deliberation consists of reasoning with predictive models as well as ac-
quiring these models. An actor may have to learn how to adapt to new
situations and tasks, as much as to use the models it knows about for its de-
cision making. Further, even if a problem can be addressed with the actors
generic models, it can be more efficient to transform the explicit computa-
tions with these models into low-level sensory-motor functions. Hence, it is
natural to consider learning to act as a deliberation function. Section 7.3
offers a brief survey on learning and model acquisition for planning and
acting. However, our focus is on deliberation techniques using predefined
models.
Figure 1.1: Conceptual view of an actor (a); its restriction to planning and
acting (b).
bring o7 to room2
move ungrasp
identify move maintain back
grasp turn
type close knob
of knob pull pull
to
door knob monitor monitor
1.2.3 Assumptions
We are not seeking knowledge representation and reasoning approaches that
are effective across every kind of deliberation problem and at every level of a
hierarchically organized actor. Neither are we interested in highly specialized
actors tailored for a single niche, because deliberation is about facing diver-
sity. Instead, we are proposing a few generic approaches that can be adapted
to different classes of environments and, for a given actor, to different levels
of its deliberation. These approaches rely on restrictive assumptions that
are needed from a computational viewpoint, and that are acceptable for the
class of environments and tasks in which we are interested.
Deliberation assumptions are usually about how variable, dynamic, ob-
servable, and predictable the environment is, and what the actor knows and
perceives about it while acting. We can classify them into assumptions re-
lated to the dynamics of the environment, its observability, the uncertainty
managed in models, and how time and concurrency are handled.
Dynamics of the environment. An actor may assume to be in a static
world except for its own actions, or it may take into account exogenous
events and changes that are expected and/or observed. In both cases
the dynamics of the world may be described using discrete, continu-
ous or hybrid models. Of these, hybrid models are the most general.
Acting necessarily involves discontinuities in the interaction with the
environment,4 and these are best modeled discretely. But a purely dis-
crete model abstracts away continuous processes that may also need
to be modeled.
Observability of the environment. It is seldom the case that all the
information needed for deliberation is permanently known to the ac-
tor. Some facts or parameters may be always known, others may be
observable if specific sensing actions are performed, and others will
remain hidden. The actor may have to act on the basis of reasonable
assumptions or beliefs regarding the latter.
Uncertainty in knowledge and predictions. No actor is omniscient. It
may or may not be able to extend its knowledge with specific actions.
It may or may not be able to reason about the uncertainty regard-
ing the current state of the world and the predicted future (e.g., with
nondeterministic or probabilistic models). Abstracting away uncer-
tainty during a high-level deliberation can be legitimate if the actor
can handle it at a lower level and correct its course of action when
needed.
Time and concurrency. Every action consumes time. But delibera-
tion may or may not need to model it explicitly and reason about its
flow for the purpose of meeting deadlines, synchronizing, or handling
4
Think of the phases in a walking or grasping action.
concurrent activities.
Different chapters of the book make different assumptions about time,
concurrency, and uncertainty. Except for Section 7.4 on hybrid models, well
restrict ourself to discrete approaches. This is consistent with the focus and
scope discussed in Section 1.1.3, because it is primarily in sensory-motor
functions and commands that continuous models are systematically needed.
precisely where the book is located in the shelf, which positions of your
hand and fingers are feasible, and which sequences of precise motions and
manipulations will allow you to perform the action.
Furthermore, operational models may need to include ways to respond
to exogenous events, that is, events that occur because of external factors
beyond the actors control. For example, someone might be standing in front
of the bookshelf, the stool that you intended to use to reach the book on a
high shelf might be missing, or any of a potentially huge number of other
possibilities might interfere with your plan.
In principle, descriptive models can take into account the uncertainty
caused by exogenous events, for example, through nondeterministic or prob-
abilistic models (see Chapters 5 and 6), but the need to handle exogenous
events is much more compelling for operational models. Indeed, exogenous
events are often ignored in descriptive models because it is impractical to
try to model all of the possible joint effects of actions and exogenous events,
or to plan in advance for all of the contingencies. But operational models
must have ways to respond to such events if they happen, because they can
interfere with the execution of an action. In the library example, you might
need to ask someone to move out of the way, or you might have to stand on
a chair instead of the missing stool.
Finally, an actor needs descriptive models of the available commands in
order to use them effectively, but in general it does not need their operational
models. Indeed, commands are the lower-level sensory-motor primitives em-
bedded in the execution platform; their operational models correspond to
what is implemented in these primitives. Taking this remark to the extreme,
if one assumes that every known action corresponds to an executable com-
mand, then all operational models are embedded in the execution platform
and can be ignored at the deliberation level. This assumption seldom holds.
to choose the commands for grasping and operating the handle, the actor
needs to reason about detailed parameters such as the robots configuration
coordinates and the position and shape of the door handle. Higher up, where
the actor refines bring o7 to room2 into actions such as go to hallway
and navigate to room1, such details are not needed. It is more conve-
nient there to reason about the values of more abstract variables, such as
location(robot) = room1 or position(door) = closed. To establish correspon-
dences between these abstract variables and the detailed ones, the actor
could have definitions saying, for example, that location(robot) = room1
corresponds to a particular area in an Euclidean reference frame.
The precise organization of a hierarchy of data structures and state rep-
resentations is a well-known area in computer science (e.g., [522]). It may
take different forms in application domains such as robotics, virtual reality,
or geographic information systems. Here, well keep this point as simple as
possible and assume that at each part of an actors deliberation hierarchy,
the state representation includes not only the variables used in that part
of the hierarchy (e.g., the robots configuration coordinates at the bottom
of Figure 1.2), but also the variables used higher up in the hierarchy (e.g.,
location(robot)).
An important issue is the distinction and correspondence between pre-
dicted states and observed states. When an actor reasons about what might
happen and simulates changes of state to assess how desirable a course of
action is, it uses predicted states. When it reasons about how to perform
actions in some context, it relies on observed states; it may contrast its ob-
servations with its expectations. Predicted states are in general less detailed
than the observed one; they are obtained as a result of one or several pre-
dictions starting from an abstraction of the current observed state. To keep
the distinction clear, well use different notations:
s S is a predicted state;
is an observed state.
Because of partial and inaccurate observations, there can be uncertainty
about the present observed state as well as about the future predicted states.
Furthermore, information in a dynamic environment is ephemeral. Some of
the values in may be out-of-date: they may refer to things that the actor
previously observed but that it cannot currently observe. Thus, is the
state of the actors knowledge, rather than the true state of the world. In
general, the actor should be endowed with appropriate means to manage
the uncertainty and temporality of the data in .
Predict
s a s
Search
9
or a virtual character); perception planning synthesizes an organized set of
sensing and interpretation actions to recognize an object or to build a three-
dimensional model of a scene; infrastructure planning synthesizes plans to
deploy and organize facilities, such as a public transportation infrastructure,
to optimize their usage or to meet the needs of a community. Many other
such examples can be given, such as flight navigation planning, satellite
configuration planning, logistics planning, or industrial process planning.
There are, however, commonalities to many forms of planning. Domain-
independent planning tries to grasp these commonalities at an abstract level,
in which actions are generic state transformation operators over a widely
applicable representation of states as relations among objects.
Domain-independent and domain-specific planning complement each
other. In a hierarchically organized actor, planning takes place at multi-
ple levels of the hierarchy. At high levels, abstract descriptions of a prob-
lem can be tackled using domain-independent planning techniques. The
example shown in Figure 1.2 may require a path planner (for moving to
locations), a manipulation planner (for grasping the door handle), and a
domain-independent planner at the higher levels of the hierarchy.
Acting involves deciding how to perform the chosen actions (with or
without the help of a planner) while reacting to the context in which the
activity takes place. Each action is considered as an abstract task to be
refined, given the current context, progressively into actions or commands
that are more concrete. Whereas planning is a search over predicted states,
acting requires a continual assessment of the current state , to contrast it
with a predicted state s and adapt accordingly. Consequently, acting also
includes reacting to unexpected changes and exogenous events, which are
Planning stage
Acting stage
ure 1.2. Depending on the actors planning horizon, it may execute each
command as soon as one is planned or wait until the planning proceeds a
bit farther. Recall from Section 1.3.2 that the observed state may differ
from the predicted one. Furthermore, may evolve even when no commands
are being executed. Such situations may invalidate what is being planned,
necessitating replanning.
The interplay between acting and planning is relevant even if the plan-
ner synthesizes alternative courses of action for different contingencies (see 8
Chapters 5 and 6). Indeed, it may not be worthwhile to plan for all possible
contingencies, or the planner may not know in advance what all of them are.
These actions and commands are specified with descriptive and opera-
tional models. For example, move works if it is given waypoints in free space
or an obstacle-free path that meet kinematics and localization constraints;
the latter are, for example, visual landmarks required by action localize.
These conditions need to be checked and monitored by the robot while per-
forming the actions. Concurrency has to be managed. For example, goto
should run in parallel with detect, avoid, and localize.
Factotum needs domain-specific planners, for example, a motion planner
for move, a manipulation planner for grasp (possibly using locate, push, and
move actions). Corresponding plans are more than a sequence or a partially
ordered set of commands; they require closed-loop control and monitoring.
At the mission-preparation stage (the upper levels in Figure 1.2), it is
legitimate to view a logistics task as an organized set of abstract subtasks
for collecting, preparing, conveying, and delivering the goods. Each subtask
may be further decomposed into a sequence of still abstract actions such
as goto, take, and put. Domain-independent task planning techniques are
needed here.
However, deliberation does not end with the mission preparation stage.
A goto action can be performed in many ways depending on the environment
properties: it may or may not require a planned path; it may use different lo-
calization, path following, motion control, detection, and avoidance methods
(see the goto node in Figure 1.2). A goto after a take is possibly different
from the one before because of the held object. To perform a goto action in
different contexts, the robot relies on a collection of skills defined formally
by methods. A method specifies a way to refine an action into commands.
The same goto may start with a method (e.g., follow GPS waypoints) but
may be pursued with more adapted methods when required by the environ-
ment (indoor without GPS signal) or the context. Such a change between
methods may be a normal progression of the goto action or a retrial due to
complications. The robot also has methods for take, put, open, close, and
any other actions it may need to perform. These methods endow the robot
with operational models (its know-how) and knowledge about how to choose
the most adapted method with the right parameters.
The methods for performing actions may use complex control constructs
with concurrent processes (loops, conditionals, semaphores, multithread and
real-time locks). They can be developed from formal specifications in some
representation and/or with plan synthesis techniques. Different represen-
tations may be useful to cover the methods needed by the factotum robot.
Machine learning techniques can be used for improving the methods, acquir-
ing their models, and adapting the factotum to a new trade.
In addition to acting with the right methods, the robot has to monitor
its activity at every level, including possibly at the goal level. Prediction of
what is needed to correctly perform and monitor foreseen activities should be
made beforehand. Making the right predictions from the combined models
of actions and models of the environment is a difficult problem that involves
heterogeneous representations.
Finally, the robot requires extended perception capabilities: reasoning
on what is observable and what is not, integrating knowledge-gathering ac-
tions to environment changing actions, acting in order to maintain sufficient
knowledge for the task at hand with a consistent interpretation of self and
the world.
booking release
manager navigation manager
refines the abstract task store of Figure 1.5 into subtasks for registering a
car to be stored, moving it, and other tasks, down to executable commands.
Moreover, the tasks to be refined and controlled are carried out by differ-
ent components, for example, ships, gates, and storage or repair areas. Each
ship has its own procedure to unload cars to a gate. A gate has its own pro-
cedure to accept cars that are unloaded to the deck. A natural design option
is therefore to model the HOM in a distributed way, as a set of interacting
deliberation components. The interactions between ships and gates, gates
and trucks, and trucks and storage areas must be controlled with respect
to the global constraints and objectives of the system. To do that, HOM
must deal with uncertainty and nondeterminism due to exogenous events,
and to the fact that each component may from the point of view of the
management facility behave nondeterministically. For instance, in the task
to synchronize a ship with a gate to unload cars, the ship may send a re-
quest for unloading cars to the unloading manager, and the gate may reply
either that the request meets its requirements and the unloading operation
ning and acting. We also show how nondeterministic models can be used
with refinements techniques that intermix plans, actions, and goals. We
discuss the integration of planning and acting with input/output automata
to cover cases such as the distributed deliberation in the HOM example.
We cover probabilistic models in Chapter 6. We develop heuristic search
techniques for stochastic shortest path problems. We present online ap-
proaches for planning and acting, discuss refinement methods for acting
with probabilistic models, and analyze the specifics of descriptive models
of actions in the probabilistic case together with several practical issues for
modeling probabilistic domains.
Chapters 2 through 6 are devoted to planning and acting. Chapter 7
briefly surveys the other deliberation functions introduced in Section 1.3.4:
perceiving, monitoring, goal reasoning, interacting, and learning. It also
discusses hybrid models and ontologies for planning and acting.
Deliberation with
Deterministic Models
The descriptive models used by planning systems are often called planning
domains. However, it is important to keep in mind that a planning domain
is not an a priori definition of the actor and its environment. Rather, it
is necessarily an imperfect approximation that must incorporate trade-offs
among several competing criteria: accuracy, computational performance,
and understandability to users.
y
loc0 loc2
4
loc6
3 loc1
loc3
2
loc5 loc7
loc4
1
loc8
loc9
x
0 1 2 3 4 5 6
(a) (b)
Figure 2.2: Geometric model of a workpiece, (a) before and (b) after com-
puting the effects of a drilling action.
Example 2.2. Consider the task of using machine tools to modify the
shape of a metal workpiece. Each state might include a geometric model
of the workpiece (see Figure 2.2), and information about its location and
orientation, the status and capabilities of each machine tool, and so forth.
A descriptive model for a drilling operation might include the following:
The operations name and parameters (e.g., the dimensions, orienta-
tion, and machining tolerances of the hole to be drilled).
The operations preconditions, that is, conditions that are necessary for
it to be used. For example, the desired hole should be perpendicular to
the drilling surface, the workpiece should be mounted on the drilling
machine, the drilling machine should have a drill bit of the proper size,
and the drilling machine and drill bit need to be capable of satisfying
the machining tolerances.
The operations effects, that is, what it will do. These might include
a geometric model of the modified workpiece (see Figure 2.2(b)) and
estimates of how much time the action will take and how much it will
cost.
We will define two rigid properties: each pair of loading docks is adjacent if
there is a road between them, and each pile is at exactly one loading dock.
s1: d3
p1 c2 c1 r1 r2 c3
p
d1 d2 p 2
3
unload(r1,c1,c2,p1,d1)
s0: d3
c1 load(r1,c1,c2,p1,d1)
p1 c2 r1 r2 c3
p
d1 d2 p 2
3
r1
move(r1,d3,d1) s2: d3
c1
p1 c2 r2 c3
move(r1,d1,d3) p
d1 d2 p 2
3
In the subsequent examples that build on this one, we will not need to reason
about objects such as the roads and the robots wheels, or properties such
as the colors of the objects. Hence B and R do not include them.
Definition 2.4. A state variable over B is a syntactic term
x = sv(b1 , . . . , bk ), (2.1)
where sv is a symbol called the state variables name, and each bi is a member
of B. Each state variable x has a range,2 Range(x) B, which is the set of
all possible values for x.
2
We use range rather than domain to avoid confusion with planning domain.
s = {x1 = z1 , x2 = z2 , . . . , xn = zn }. (2.3)
Definition 2.10. Let R and X be sets of rigid relations and state variables
over a set of objects B, and S be a state-variable state space over X. An
action template 5 for S is a tuple = (head(), pre(), eff(), cost()) or
= (head(), pre(), eff()), the elements of which are as follows:
head() is a syntactic expression6 of the form
act(z1 , z2 , . . . , zk ),
act(z1 , z2 , . . . , zk )
pre: p1 , . . . , pm
eff: e1 , . . . , en
cost: c
Example 2.12. Continuing Example 2.5, suppose each robot r has an ex-
ecution platform that can perform the following commands:
if r is at a loading dock and is not already carrying anything, r can
load a container from the top of a pile;
if r is at a loading dock and is carrying a container, r can unload the
container onto the top of a pile; and
r can move from one loading dock to another if the other dock is
unoccupied and there is a road between the two docks.
To model these commands, let A comprise the following action templates:
7
This can be generalized to make cost() a numeric formula that involves s param-
eters. In this case, most forward-search algorithms and many domain-specific heuristic
functions will still work, but most domain-independent heuristic functions will not, nor
will backward-search and plan-space search algorithms (Sections 2.4 and 2.5).
load(r, c, c0 , p, d)
pre: at(p, d), cargo(r) = nil, loc(r) = d, pos(c) = c0 , top(p) = c
eff: cargo(r) = c, pile(c) nil, pos(c) r, top(p) c0
unload(r, c, c0 , p, d)
pre: at(p, d), pos(c) = r, loc(r) = d, top(p) = c0
eff: cargo(r) nil, pile(c) p, pos(c) c0 , top(p) c
move(r, d, d0 )
pre: adjacent(d, d0 ), loc(r) = d, occupied(d0 ) = F
eff: loc(r) d0 , occupied(d) F, occupied(d0 ) T
In the action templates, the parameters have the following ranges:
pre(a1 ) =
{at(p1 , d1 ), cargo(r1 ) = nil, loc(r1 ) = d1 , pos(c1 ) = c2 , top(p1 ) = c1 }.
= ha1 , a2 , . . . , an i.
The plansPlength is || = n, and its cost is the sum of the action costs:
cost() = ni=1 cost(ai ).
As a special case, hi is the empty plan, which contains no actions. Its
length and cost are both 0.
.a = ha1 , . . . , an , ai;
a. = ha, a1 , . . . , an i;
. 0 = ha1 , . . . , an , a01 , . . . , a0n0 i;
.hi = hi. = .
(s0 , ) = sn ;
b(s0 , ) = hs0 , . . . , sn i.
Forward-search (, s0 , g)
s s0 ; hi
loop
if s satisfies g, then return
A0 {a A | a is applicable in s}
if A0 = , then return failure
nondeterministically choose a A0 (i)
s (s, a); .a
1 = hmove(r1 , d1 , d3 )i;
2 = hmove(r2 , d2 , d3 ), move(r1 , d1 , d2 ), move(r2 , d3 , d1 ), move(r1 , d2 , d3 )i;
3 = hload(r1 , c1 , c2 , p1 , d1 ), unload(r1 , c1 , c2 , p1 , d1 ), move(r1 , d1 , d3 )i.
Deterministic-Search(, s0 , g)
Frontier {(hi, s0 )} // (hi, s0 ) is the initial node
Expanded
while Frontier 6= do
select a node = (, s) Frontier (i)
remove from Frontier and add it to Expanded
if s satisfies g then (ii)
return
Children {(.a, (s, a)) | s satisfies pre(a)}
prune (i.e., remove and discard) 0 or more nodes
from Children, Frontier and Expanded (iii)
Frontier Frontier Children (iv)
return failure
2.2.5 A*
A* is similar to uniform-cost search, but uses a heuristic function:
Node selection. Select a node Children that minimizes f () (de-
fined in Equation 2.9).
Pruning. For each node (, s) Children, if A* has more than one plan
that goes to s, then keep only the least costly one. More specifically,
let
gorithms, which gradually increase the depth of their search until they find
a solution. The best known of these is iterative deepening search (IDS),
which works as follows:
for k = 1 to ,
do a depth-first search, backtracking at every node of depth k
if the search found a solution, then return it
if the search generated no nodes of depth k, then return failure
B = Locations Numbers;
Locations = {loc1 , . . . , loc9 };
Numbers = {1, . . . , 9}.
There is a rigid relation adjacent that includes every pair of locations that
have a road between them, and rigid relations x and y that give each loca-
tions x and y coordinates:
adjacent = {(loc0 , loc1 ), (loc0 , loc6 ), (loc1 , loc0 ), (loc1 , loc3 ), . . .};
x = {(loc0 , 2), (loc1 , 0), (loc2 , 4), . . .};
y = {(loc0 , 4), (loc1 , 3), (loc2 , 4), . . .}.
There is one state variable loc with Range(loc) = Locations, and 10 states:
si = {loc = loci }, i = 0, . . . , 9.
r1 g:
s0:
d3 r1 c1
d3
d1
c1 d2
unload(r, c, l) move(r, d, e)
pre: cargo(r) = c, loc(r) = l pre: loc(r) = d
eff: cargo(r) nil, loc(c) l eff: loc(r) e
cost: 1 cost: 1
The action templates parameters have the following ranges:
Range(c) = Containers; Range(d) = Range(e) = Docks;
Range(l) = Locations; Range(r) = Robots.
Suppose we are running GBFS (see Section 2.2.7) on P . In s0 , there are two
applicable actions: a1 = move(r1, d3, d1) and a2 = move(r1, d3, d2). Let
g = {loc(r1)=d3, loc(c1)=r1}
hmax(s1) = max(s1,g) = max(1,1) = 1
max
loc(r1)=d3 loc(c1)=r1
min(1,(>1)) = 1
min(1,(>1),(>1)) = 1
0+1 = 1 (>0)+1 > 1 0+1 = 1
(>0)+1 > 1
move(r1,d2,d3) load(r1,c1,d1) (>0) +1 > 1
move(r1,d1,d3) pre: pre: load(r1,c3,d3)
max(0,0,0) = 0
pre: loc(r1)=d2
max load(r1,c2,d2) >0
loc(r1)=d1
cargo(r1)=nil loc(c1)=d1
>0 >0
0 0 loc(r1)=d1 0
true in s1 true in s 0 true in s
1 1
true in s1
g = {loc(r1)=d3, loc(c1)=r1}
hmax(s2) = max(s2,g) = max(1,2) = 2
max
loc(r1)=d3 loc(c1)=r1
min(1,(>1)) = 1
min(2,(>2),(>2)) = 2
(>0)+1 > 1 0+1 = 1 1+1 = 2
(>1)+1 > 2
move(r1,d2,d3) load(r1,c1,d1) (>1) +1 > 2
move(r1,d1,d3) pre: pre: load(r1,c3,d3)
max(0,1,0) = 1
pre: loc(r1)=d2
max load(r1,c2,d2) >1
loc(r1)=d1
cargo(r1)=nil loc(c1)=d1
0 >1
loc(r1)=d1
>0 true in s2
0 0+1 = 1 0
move(r1,d2,d1)
pre:
loc(r1)=d2
true in s2 0 true in s2
true in s2
g = {loc(r1)=d3, loc(c1)=r1}
hadd(s1) = add(s1,g) = 1+1 = 2
add
loc(r1)=d3 loc(c1)=r1
min(1,(>1)) = 1
min(1,(>1),(>1)) = 1
0+1 = 1 (>0)+1 > 1 0+1 = 1
(>0)+1 > 1
move(r1,d2,d3) load(r1,c1,d1) (>0) +1 > 1
move(r1,d1,d3) pre: pre: load(r1,c3,d3)
0+0+0 = 0
pre: loc(r1)=d2
add load(r1,c2,d2) >0
loc(r1)=d1
cargo(r1)=nil loc(c1)=d1
>0 >0
0 0 loc(r1)=d1 0
true in s1 true in s 0 true in s
1 1
true in s1
g = {loc(r1)=d3, loc(c1)=r1}
hadd(s2) = add(s2,g) = 1+2 = 3
add
loc(r1)=d3 loc(c1)=r1
min(1,(>1)) = 1
min(2,(>2),(>2)) = 2
(>0)+1 > 1 0+1 = 1 1+1 = 2
(>1)+1 > 2
move(r1,d2,d3) load(r1,c1,d1) (>1) +1 > 2
move(r1,d1,d3) pre: pre: load(r1,c3,d3)
0+1+0 = 1
pre: loc(r1)=d2
add load(r1,c2,d2) >1
loc(r1)=d1
cargo(r1)=nil loc(c1)=d1
0 >1
loc(r1)=d1
>0 true in s2
0 0+1 = 1 0
move(r1,d2,d1)
pre:
loc(r1)=d2
true in s2 0 true in s2
true in s2
where
X
add (s, g) = add (s, gi );
gi g
(
add 0, if gi s,
(s, gi ) =
min{add (s, a) | a A and gi eff(a)}, otherwise;
add (s, a) = cost(a) + add (s, pre(a)).
Both hmax and hadd have the same time P complexity. Their running time
is nontrivial, but it is polynomial in |A|+ xX |Range(x)|, the total number
of actions and ground atoms in the planning domain.
A relaxed state (or r-state, for short) is any set s of ground atoms such
that every state variable x X is the target of at least one atom in s.
It follows that every state is also an r-state.
A relaxed state s r-satisfies a set of literals g if S contains a subset
s s that satisfies g.
An action a is r-applicable in an r-state s if s r-satisfies pre(a). In this
case, the predicted r-state is
The r-state s2 r-satisfies g, so the plan = hmove(r1, d3, d1), load(r1, c1, d1)i
is a relaxed solution for P . No shorter plan is a relaxed solution for P , so
h+ (s) = + (s0 , g).
HFF(, s, g)
s0 = s; A0 =
for k = 1 by 1 until a subset of sk r-satisfies g do (i)
Ak {all actions that are r-applicable in sk1 }
sk + (sk1 , Ak )
if sk = sk1 then // (, s, g) has no solution (ii)
return
gk g
for i = k down to 1 do (iii)
arbitrarily choose a minimal set of actions
ai Ai such that + (si , ai ) satisfies gi
gi1 (gi eff(ai )) pre(ai )
haP 1 , a2 , . . . , ak i (iv)
return {cost(a) | a is an action in }
Atoms in 2:
Actions in A2: from 1:
move(r1,d3,d2) loc(r1) = d2
Atoms in 0: Actions in A1: Atoms in 1: move(r1,d1,d2) loc(c1) = d1
move(r1,d2,d3) loc(r1) = d3 move(r1,d3,d1) cargo(r1) = nil
loc(r1) = d2
move(r1,d2,d1) loc(r1) = d1 move(r1,d1,d3) loc(r1) = d1
loc(c1) = d1
loc(r1) = d2 move(r1,d2,d1) loc(r1) = d3
cargo(r1) = nil from 0:
loc(c1) = d1 move(r1,d2,d3) cargo(r1) = c1
cargo(r1) = nil load(r1,c1,d1) loc(c1) = r1
The graph structures in Figures 2.7 and 2.8 are called relaxed planning
graphs.
hsl (s) = the total number of landmarks found by the preceding algorithm.
15
In implementations, this usually is done only if every atom in 0 has the same type,
for example, 0 = loc(r1) = d1 loc(r2) = d1.
Backward-search(, s0 , g0 )
hi; g g0 (i)
loop
if s0 satisfies g then return
A0 {a A | a is relevant for g}
if A0 = then return failure
nondeterministically choose a A0
g 1 (g, a) (ii)
a. (iii)
Although the algorithm is more complicated than the HFF algorithm, its
running time is still polynomial.
Better landmark heuristics can be devised by doing additional compu-
tations to discover additional landmarks and by reasoning about the order
in which to achieve the landmarks. We discuss this further in Section 2.7.9.
Solved {g}
Here, Solved represents the set of all states that are already solved, that
is, states from which or one of s suffixes will achieve g0 ; and g 0 represents
the set of all states from which the plan a. will achieve g0 . If every state
that a. can solve is already solved, then it is useless to prepend a to . For
any solution that we can find this way, another branch of the search space
will contain a shorter solution that omits a.
17
The reason for calling this a violation even if the effect is x b is to ensure that PSP
(Algorithm 2.5) performs a systematic search [411, 336], that is, it does not generate the
same partial plan several times in different parts of the search space.
PSP(, )
loop
if Flaws() = then return
arbitrarily select f Flaws() (i)
R {all feasible resolvers for f }
if R = then return failure
nondeterministically choose R (ii)
()
return
d3
r1 r2 r2 r1
d1 d2 d1 d2
s0 g
x=b
Threats. Let v1 99K v2 be any causal link in , and v3 V be any node
such that v2 6 v3 and v3 6 v1 (hence it is possible for v3 to come between
v1 and v2 ). Suppose act(v3 ) has an effect y w that is unifiable with x,
that is, has an instance (here we extend Definition 2.9 to plans) in which
both x and y are the same state variable). Then v3 is a threat to the causal
link. There are three kinds of resolvers for such a threat:
Make v3 v1 , by adding (v3 , v1 ) to E.
Make v2 v3 , by adding (v2 , v3 ) to E.
Prevent x and y from unifying, by adding to C an inequality constraint
on their parameters.
Example 2.32. Figure 2.9 shows the initial state and goal for a simple
planning problem in which there are two robots and three loading docks, that
is, B = Robots Docks, where Robots = {r1, r2} and Docks = {d1, d2, d3}.
There are no rigid relations. There is one action template,
loc(r1) = d2
loc(r2) = d1
a0 ag
loc(r1) = d1
loc(r2) = d2
occupied(d3) = nil
occupied(d1) = r1
occupied(d2) = r2
Figure 2.10: The initial partial plan contains dummy actions a0 and ag that
represent s0 and g. There are two flaws: ag s two preconditions are open
goals.
move(r, d, d0 )
pre: loc(r) = d, occupied(d0 ) = F
eff: loc(r) d0 ,
where r Robots and d, d0 Docks. The initial state and the goal (see
Figure 2.9) are
Figure 2.10 shows the initial partial plan, and Figures 2.11 through 2.14
show successive snapshots of one of PSPs nondeterministic execution traces.
Each actions preconditions are written above the action, and the effects
are written below the action. Solid arrows represent edges in E, dashed
arrows represent causal links, and dot-dashed arrows represent threats. The
captions describe the refinements and how they affect the plan.
loc(r1)!=!d
occupied(d2)!=!nil
a1!=!move(r1,d,d2)
loc(r1)!=!d2
occupied(d)!=!nil loc(r1)!=!d2
occupied(d2)!=!r1 loc(r2)!=!d1
a0 ag
loc(r1)!=!d1
loc(r2)!=!d2 occupied(d1)!=!nil
occupied(d3)!=!nil loc(r2)!=!d'
occupied(d1)!=!r1 a2!=!move(r2,d',d1)
occupied(d2)!=!r2 loc(r2)!=!d1
occupied(d')!=!nil
occupied(d1)!=!r2
Figure 2.11: Resolving ag s open-goal flaws. For one of them, PSP adds a1
and a causal link. For the other, PSP adds a2 and another causal link.
loc(r1)!=!d1
occupied(d2)!=!nil
a1!=!move(r1,d1,d2)
loc(r1)!=!d2
occupied(d1)!=!nil loc(r1)!=!d2
occupied(d2)!=!r1 loc(r2)!=!d1
a0 ag
loc(r1)!=!d1
loc(r2)!=!d2 loc(r)!=!d2 occupied(d1)!=!nil
occupied(d3)!=!nil occupied(d'')!=!nil loc(r2)!=!d'
occupied(d1)!=!r1 a3!=!move(r,d2,d'') a2!=!move(r2,d',d1)
occupied(d2)!=!r2 loc(r)!=!d'' loc(r2)!=!d1
occupied(d2)!=!nil occupied(d')!=!nil
occupied(d'')!=!r occupied(d1)!=!r2
Figure 2.12: Resolving a1 s open-goal flaws. For one of them, PSP substi-
tutes d1 for d (which also resolves a1 s free-variable flaw) and adds a causal
link from a0 . For the other, PSP adds a3 and a causal link. The new action
a3 causes two threats (shown as red dashed-dotted lines).
loc(r1)!=!d1
occupied(d2)!=!nil
a1!=!move(r1,d1,d2)
loc(r1)!=!d2
occupied(d1)!=!nil loc(r1)!=!d2
occupied(d2)!=!r1 loc(r2)!=!d1
a0 ag
loc(r1)!=!d1
loc(r2)!=!d2 loc(r2)!=!d2 occupied(d1)!=!nil
occupied(d3)!=!nil occupied(d')!=!nil loc(r2)!=!d'
occupied(d1)!=!r1 a3!=!move(r2,d2,d') a2!=!move(r2,d',d1)
occupied(d2)!=!r2 loc(r2)!=!d' loc(r2)!=!d1
occupied(d2)!=!nil occupied(d')!=!nil
occupied(d')!=!r2 occupied(d1)!=!r2
Figure 2.13: Resolving a2 s open-goal flaws. For one of them, PSP substi-
tutes r2 for r and d0 for d00 , and adds a causal link from a3 . For the other,
PSP adds a causal link from a1 . As a side effect, these changes resolve the
two threats.
loc(r1)!=!d1
occupied(d2)!=!nil
a1!=!move(r1,d1,d2)
loc(r1)!=!d2
occupied(d1)!=!nil loc(r1)!=!d2
occupied(d2)!=!r1 loc(r2)!=!d1
a0 ag
loc(r1)!=!d1
loc(r2)!=!d2 loc(r2)!=!d2 occupied(d1)!=!nil
occupied(d3)!=!nil occupied(d3)!=!nil loc(r2)!=!d3
occupied(d1)!=!r1 a3!=!move(r2,d2,d3) a2!=!move(r2,d3,d1)
occupied(d2)!=!r2 loc(r2)!=!d3 loc(r2)!=!d1
occupied(d2)!=!nil occupied(d3)!=!nil
occupied(d3)!=!r2 occupied(d1)!=!r2
Figure 2.14: Resolving a3 s open-goal flaws. For one of them, PSP adds a
causal link. For the other, PSP substitutes d3 for d0 and adds a causal link.
The resulting partially ordered plan contains no flaws and hence solves the
planning problem.
which PSP selects the flaws can affect the size of the search space gen-
erated by PSPs nondeterministic choices in line (ii). Flaw selection is
analogous to variable ordering in CSPs, and the Minimum Remaining
Values heuristic for CSPs (choose the variable with the fewest remain-
ing values) is analogous to a PSP heuristic called Fewest Alternatives
First: select the flaw with the fewest resolvers.
Resolver selection (line (ii) of PSP) is analogous to value ordering in
CSPs. The Least Constraining Value heuristic for CSPs (choose the
value that rules out the fewest values for the other variables) translates
into the following PSP heuristic: choose the resolver that rules out the
fewest resolvers for the other flaws.
The preceding heuristic ignores an important difference between plan-
space planning and CSPs. Ordinarily, the number of variables in a CSP
is fixed in advance, hence the search tree is finite and all solutions are
at exactly the same depth. If one of PSPs resolvers introduces a new
action that has n new preconditions to achieve, this is like introducing
n new variables (and a number of new constraints) into a CSP, which
could make the CSP much harder to solve.
One way of adapting this heuristic to PSP is by first looking for re-
solvers that do not introduce open goals, and if there are several such
resolvers, then to choose the one that rules out the fewest resolvers for
the other flaws.
Although the preceding heuristics can help speed PSPs search, imple-
mentations of PSP tend to run much more slowly than the fastest state-
space planners. Generally the latter are GBFS algorithms that are guided
by heuristics like the ones in Section 2.3, and there are several impediments
to developing an analogous version of PSP. Because plan spaces have no ex-
plicit states, the heuristics in Section 2.3 are not directly applicable, nor is
it clear how to develop similar plan-space heuristics. Even if such heuristics
were available, a depth-first implementation of PSP would be problematic
because plan spaces generally are infinite. Consequently, for solving prob-
lems like the ones in the International Planning Competitions [291], most
automated-planning researchers have abandoned PSP in favor of forward-
search algorithms.
On the other hand, some important algorithms for temporal planning
(see Chapter 4) are extensions of PSP and are useful for maintaining flex-
ibility of execution in unpredictable environments. An understanding of
PSP is useful to provide the necessary background for understanding those
go(r, l, l0 )
pre: adjacent(l, l0 ), loc(r) = l take(r, l, o)
eff: loc(r) l0 pre: loc(r) = l, pos(o) = l,
cargo(r) = nil
navigate(r, l, l0 ) eff: pos(o) r, cargo(r) o
pre: adjacent(l, l0 ), loc(r) = l
eff: loc(r) l0
algorithms.
Example 2.33. To illustrate some of the things that can go wrong, suppose
a robot, rbt, is trying to accomplish the task bring o7 to loc2 near the top
of Figure 1.2. To create an abstract plan for this task, suppose rbt calls a
planner on a planning problem P = (, s0 , g) in which contains the action
templates shown in Figure 2.15, and
a1 = go(rbt,loc3,hall), a2 = navigate(rbt,hall,loc1),
a3 = take(rbt,loc1,o7), a4 = navigate(rbt,loc1,loc2),
a5 = put(rbt,loc2,o7).
look for o7 in loc1; and if its not there then look for it in loc4.
Consequently, actors need ways to change their plans when problems are
detected. The following section describes some ways to do that.
Run-Lookahead(, g)
while (s abstraction of observed state ) 6|= g do
Lookahead(, s, g)
if = failure then return failure
a pop-first-action(); perform(a)
Run-Lazy-Lookahead(, g)
s abstraction of observed state
while s 6|= g do
Lookahead(, s, g)
if = failure then return failure
while 6= hi and s 6|= g and Simulate(, s, g, ) 6= failure do
a pop-first-action(); perform(a)
s abstraction of observed state
Run-Concurrent-Lookahead(, g)
hi; s abstraction of observed state
thread 1: // threads 1 and 2 run concurrently
loop
Lookahead(, s, g)
thread 2:
loop
if s |= g then return success
else if = failure then return failure
else if 6= hi and Simulate(, s, g, ) 6= failure then
a pop-first-action(); perform(a)
s abstraction of observed state
Algorithms 2.6 through 2.8 illustrate some ways for an actor to use a plan-
ner. In each of them, (, s, g) is a planning problem, and Lookahead is an
guarantee that the plan will actually lead to a goal. But neither can we
guarantee that an actor will reach the goal if it uses a purported complete
solution plan. As we discussed in Section 2.6.1, acting procedures may need
to deal with a variety of problems that were not in the planners domain
model.
Here is a modified version of IDS (Section 2.2.8) that uses a depth limit,
kmax :
for k = 1 to kmax :
do a depth-first search, backtracking at every node of depth
k, and keeping track of which node = (, s) at depth k
has the lowest value f ()
if the search finds a solution, then return it
return
Both A* and IDS can also be modified to use a time limit, by having them
throw an exception when time runs out. When the exception is thrown,
IDS would return the plan mentioned in the preceding pseudocode, and
A* would return the plan found in the node = (, s) Frontier that
minimizes f ().
In a classical representation, all atoms have the same syntax as our rigid
relations. Each state s is represented as the set of all atoms that are true in
s, hence any atom not in this set is false in s. Each planning operator (the
classical counterpart of an action template) has preconditions and effects
that are literals.
Example 2.34. Here is a classical representation of s0 in Equation 2.4:
s0 = {loc(r1 , d1 ), loc(r2 , d2 ),
occupied(d1 ), occupied(d2 ),
pile(c1 , p1 ), pile(c2 , p1 ), pile(c3 , p2 ),
pos(c1 , c2 ), pos(c2 , nil), pos(c3 , nil),
top(p1 , c1 ), top(p2 , c3 ), top(p3 , nil)}.
Here is a classical planning operator corresponding to the load action tem-
plate in Example 2.12:
load(r, c, c0 , p, d)
pre: at(p, d), cargo(r), loc(r, d), pos(c, c0 ), top(p, c)
eff: cargo(r), pile(c, p), pile(c, nil), pos(c, c0 ), pos(c, r),
top(p, c), top(p, c0 )
Computer programs for games such as chess and checkers typically use
an acting procedure similar to Run-Lookahead (Algorithm 2.8). In these
programs, the Lookahead subroutine is similar to the time-limited version of
depth-first iterative deepening in Section 2.6.2, except that the depth-first
search is the well-known alpha-beta algorithm [338, 460, 517].
The IDA* algorithm in Section 2.2.8 is attributable to Korf [351].
Iterative-deepening algorithms are a special case of node-regeneration al-
gorithms that retract nodes to save space and regenerate them later if they
need to examine them again. There are several other search algorithms (e.g.,
the RBFS algorithm [353]) that do node regeneration in one way or another.
Zhang [625] provides a survey of such algorithms.
Such recipes can be written as HTN methods; see Section 3.5.2 for details.
The expressive power of HTN methods can be useful for developing prac-
tical applications [603, 444, 382], and a good set of methods can enable an
HTN planner to perform well on benchmark problems [394]. A drawback
of this approach is that it requires the domain author to write and debug a
potentially complex set of domain-specific recipes [308]. However, research
is being done on techniques for aiding this process (see Section 7.3.3).
rules can enable an temporal-logic planner to perform well [394], and the
expressive power of the control rules can be important in practical applica-
tions [157]. On the other hand, the domain author must write and debug
a potentially complex set domain-specific information [308], but research is
being done on techniques to aid this process (see Section 7.3.3).
Delete-Relaxation Heuristics
Delete-relaxation and the h+ and hFF heuristics (see Section 2.3.2) were pi-
oneered primarily by Hoffmann [276, 280], and the name of the hFF heuristic
comes from its use in the FF planning system [278]. Delete-relaxation can
also be used to describe the hadd and hmax heuristics in Section 2.3.1; hmax
is the optimal parallel solution (see Section 2.7.4) for the delete-relaxed
problem [270, 68].
Helmerts [266, 267] causal graph heuristic involves analyzing the plan-
ning domains causal structure using a directed graph whose nodes are all
of the state variables in the planning domain, and whose edges represent
dependencies among the state variables. Although it is not immediately ob-
vious that this is a delete-relaxation heuristic, a delete-relaxation heuristic
hcea has been developed that includes both the causal graph heuristic and
hadd as special cases [270].
Landmark Heuristics
The early work on landmarks by Porteous, Sebastia, and Hoffmann [493]
has been hugely influential, inspiring a large amount of additional work on
the subject. The landmark heuristic that we described in Section 2.3.3 is a
relatively simple one, and there are many ways to improve it.
Critical-Path Heuristics
There is a set {hm | m = 1, 2, . . .} of heuristic functions based loosely on the
notion of critical paths (an important concept in project scheduling). They
approximate the cost of achieving a goal g by the cost of achieving the most
costly subset of size m [261, 259]. More specifically, for m = 1, 2, . . . , let
0
if s |= g,
m (s, g) = 1
minaRel(g) cost(a) + m (s, (g, a)) if |g| m,
maxg0 g and |g0 |m m (s, g 0 ) otherwise,
where Rel(g) is the set of all actions that are relevant for g (see Defini-
tion 2.27). Then hm (s) = m (s, g). It is easy to show that h1 = hmax .
For each m, the heuristic hm is admissible; and if P
we hold m fixed then
m
h can be computed in polynomial time in |A| + xX |Range(x)|, the
number of actions and ground atoms in the planning domain. However, the
computational complexity is exponential in m.
Abstraction Heuristics
An abstraction of a planning domain is a -preserving homomorphism
from onto a smaller planning domain 0 . For each planning problem
P = (, s0 , g), this defines a corresponding abstraction P 0 = (0 , s00 , g 0 ); and
if we let c denote the cost of an optimal solution to a planning problem,
then it follows that c (P 0 ) c (P ). If 0 is simple enough that we can
compute c (P 0 ) for every planning problem P 0 in 0 , then the function h(s) =
c (0 , s0 , g 0 ) is an admissible heuristic for P .
tributions has been the integration of planning and execution (rather than
acting), under an assumption that the plans generated by the planning al-
gorithms were directly executable an assumption that is often unrealistic.
In the next chapter, we will return to the integration of planning and acting,
with refinement of actions into commands, and ways to react to events.
The receding-horizon technique has been widely used in control theory,
specifically in model-predictive control. The survey by Garcia et al. [211]
traces its implementation back to the early sixties. The general idea is to use
a predictive model to anticipate over a given horizon the response of a system
to some control and to select the control such that the response has some
desired characteristics. Optimal control seeks a response that optimizes a
criterion. The use of these techniques together with task planning has been
explored by Dean and Wellman [146].
Subgoaling has been used in the design of several problem-solving and
search algorithms (e.g., [371, 352]). In planning, issues such as serializ-
able goals and abstraction hierarchies with interesting properties have been
extensively studied (e.g., [39, 334, 616]). Sampling techniques have been
developed and are widely used for handling stochastic models of uncertainty
and nondeterminism, about which more is said in Chapter 6.
2.8 Exercises
2.1. Let P1 = (, s0 , g1 ) and P2 = (, s0 , g2 ) be two state-variable plan-
ning problems with the same planning domain and initial state. Let
1 = ha1 , . . . , an i and 2 = hb1 , . . . , bn i be solutions for P1 and P2 , re-
spectively. Let = ha1 , b1 , . . . , an , bn i.
(a) If is applicable in s0 , then is it a solution for P1 ? For P2 ? Why or
why not?
(b) E1 be the set of all state variables that are targets of the effects in
eff(a1 ), . . . , eff(an ), and E2 be the set of all state variables that are
targets of the effects in eff(b1 ), . . . , eff(bn ). If E1 E2 = , then is
applicable in s0 ? Why or why not?
(c) Let P1 be the set of all state variables that occur in
pre(a1 ), . . . , pre(an ), and P2 be the set of all state variables that oc-
cur in the preconditions of pre(b1 ), . . . , pre(bn ). If P1 P2 = and
E1 E2 = , then is applicable in s0 ? Is it a solution for P1 ? For
P2 ? Why or why not?
2.2. Let S be the state-variable state space discussed in Example 2.7. Give
a set of restrictions such that s is a state of S if and only if it satisfies those
restrictions.
2.4. Under what conditions will GBFS switch to a different path if its cur-
rent path is not a dead end?
2.6. What might be an effective way to use hadd , hmax , hFF , and hsl with
Backward-search?
2.7. Figure 2.16 shows a planning problem involving two robots whose ac-
tions are controlled by a single actor.
(a) If we run Forward-search on this problem, how many iterations will
the shortest execution traces have, and what plans will they return?
For one of them, give the sequence of states and actions chosen in the
execution trace.
(b) If we run Backward-search on this problem, how many iterations will
the shortest execution traces have, and what plans will they return?
For one of them, give the sequence of goals and actions chosen in the
execution trace.
(c) Compute the values of hadd (s0 ) and hmax (s0 ).
(d) In the HFF algorithm, suppose that instead of exiting the loop at the
first value of k such that sk r-satisfies g, we instead keep iterating the
loop. At what value of k will |sk | reach its maximum? At what value
of k will |Ak | reach its maximum?
(e) Compute the value of hFF (s0 ).
(f) Compute the value of hsl (s0 ).
take(r, l, c)
pre: loc(r) = l, pos(c) = l,
cargo(r) = nil
eff: cargo(r) = c, pos(c) r r1 r2
c1 c2
put(r, l, c) loc1 loc2
pre: loc(r) = l, pos(c) = r
s0 = {loc(r1) = loc1, loc(r2) = loc2,
eff: cargo(r) nil, pos(c) l
cargo(r1) = nil, cargo(r2) = nil,
move(r, l, m) pos(c1) = loc1, pos(c2) = loc2}
pre: loc(r) = l
eff: loc(r) m g = {pos(c1) = loc2, pos(c2) = loc2}
Figure 2.16: Planning problem for Exercise 2.7. In the action templates, r is
a robot, l and m are locations, and c is a container. In this problem, unlike
some of our previous examples, both robots may have the same location.
assign(x1 , x2 , n)
pre: value(x2 ) = n
eff: value(x1 ) n
value(bar)=5
a1 = assign(foo,bar,5)
value(foo)=5 value(foo)=5
value(bar)=1
start nish
value(foo)=1 value(x)=1
value(bar)=5 a2 = assign(bar,x,1)
value(baz)=0 value(bar)=1
Figure 2.17: Partial plan for swapping the values of two variables.
2.9. Figure 2.17 shows a partial plan for the variable-swapping problem in
Exercise 2.8.
(a) How many threats are there? What are they? What are their re-
solvers?
(b) Can PSP generate this plan? If so, describe an execution trace that
will produce it. If no, explain why not.
(c) In PSPs search space, how many immediate successors does this par-
tial plan have?
(d) How many solution plans can PSP produce from this partial plan?
(e) How many of the preceding solution plans are minimal?
(f) Trace the operation of PSP if we start it with the plan in Figure 2.17.
Follow whichever of PSPs execution traces finds the shortest plan.
pickup(x)
pre: loc(x) = table, top(x) = nil,
holding = nil
eff: loc(x) hand, holding x
c
putdown(x) a b
pre: holding = x
eff: loc(x) table, holding nil
Objects = Blocks {hand, table, nil}
unstack(x, y)
Blocks = {a, b, c}
pre: loc(x) = y, top(x) = nil,
holding = nil
s0 = {top(a) = c, top(b) = nil,
eff: loc(x) hand, top(y) nil,
holding x top(c) = nil, holding = nil,
loc(a) = table, loc(b) = table,
stack(x, y)
pre: holding = x, top(y) = nil loc(c) = a}
eff: loc(x) y, top(y) x,
holding nil g = {loc(a) = b, loc(b) = c}
(b) Is the holding state variable really needed? Why or why not?
(c) In the planning problem in Figure 2.18(b), how many states satisfy g?
(d) Give necessary and sufficient conditions for a set of atoms to be a
state.
(e) Is every blocks world planning problem solvable? Why or why not?
2.11. Repeat Exercise 2.8 on the planning problem in Figure 2.18(b), with
s1 = (s0 , unstack(c,a)) and s2 = (s0 , pickup(b)).
2.12. Repeat Exercise 2.9 using the planning problem in Figure 2.18(b) and
the partial plan shown in Figure 2.19.
2.13. Let be a partially ordered solution for a planning problem P =
(, s0 , g).
(a) Write a simple modification of Run-Lazy-Lookahead to execute .
(b) Suppose your procedure is executing , and let 0 be the part of that
it has not yet executed. Suppose an unanticipated event invalidates
some of the total orderings of 0 (i.e., not all of them will still achieve
hold=nil loc(a)=table
clear(x)=T clear(a)=T clear(b)=T
loc(x)=a hold=x hold=nil hold=a
unstack(x,a) putdown(x) pickup(a) stack(a,b)
clear(a)=T hold=nil hold=a hold=nil
hold=&x clear(x)=T loc(a)=hand loc(a)=b loc(a)=b
loc(x)=hand loc(x)=table clear(b)=F loc(b)=c
start nish
hold=nil hold=nil
loc(a)=table loc(b)=table hold=b
loc(b)=table clear(b)=T clear(c)=T
loc(c)=a pickup(b) stack(b,c)
clear(a)=F hold=b loc(b)=c
clear(b)=T loc(b)=hand hold=nil
clear(c)=T clear(c)=F
2.15. Let P be a planning problem in which the action templates and initial
state are as shown in Figure 2.16, and the goal is g = {loc(c1) = loc2}. In
the Run-Lazy-Lookahead algorithm, suppose the call to Lookahead(P ) returns
the following solution plan:
(a) Suppose that after the actor has performed take(r1,loc1,c1) and
move(r1,loc1,loc2), monitoring reveals that c1 fell off of the robot and
is still back at loc1. Tell what will happen, step by step. Assume that
Lookahead(P ) will always return the best solution for P .
(b) Repeat part (a) using Run-Lookahead.
(c) Suppose that after the actor has performed take(r1,loc1,c1), monitor-
ing reveals that r1s wheels have stopped working, hence r1 cannot
move from loc1. What should the actor do to recover? How would
you modify Run-Lazy-Lookahead, Run-Lookahead, and Run-Concurrent-
Lookahead to accomplish this?
Deliberation with
Refinement Methods
where
method-name is a unique symbol designating the method;
arg1 , . . . , argk are variables appearing in the method; an applicable
instance of a method binds these variables to objects and values;
task-identifier gives the task to which the method is applicable; simi-
larly for an event;
test specifies conditions under which the method may be used;
3.1.3 Illustrations
Let us illustrate the refinement method representation with a few examples.
Example 3.2. Consider the task for the robot in Example 3.1 to pick up
a particular container c. The robot may know the location of c (i.e., this
information may be in ), in which case the robot goes to that location to
take c. Otherwise, the robot will have to look at the locations it can reach
until it finds what it is looking for. This is expressed through two tasks,
fetch and search, and the following refinement methods:
m-fetch(r, c)
task: fetch(r, c)
pre:
body: if pos(c) = unknown then search(r, c)
else if loc(r) = pos(c) then take(r, c, pos(c))
else do
move-to(r, pos(c))
take(r, c, pos(c))
m-fetch refines the task fetch into a task search when the position of c is
unknown; otherwise, it triggers the appropriate take and, if needed, move-to
commands to pick up c.
2
We use informal pseudocode descriptions of the bodies of methods.
m-search(r, c)
task: search(r, c)
pre: pos(c) = unknown
body: if l Locations such that view(l) = F then do
move-to(l)
perceive(l)
if pos(c) = l then take(r, c, l)
else search(r, c)
else fail
The method performs a search by going to a location l, the content of which
is not yet known, and perceiving l. If c is there, the robot takes it; otherwise
the method recursively searches in other locations. If all locations have been
perceived, the search task fails.
The above example illustrates two task refinement methods. Let us
provide the robot with a method for reacting to an event.
Example 3.3. Suppose that a robot in the domain of Example 3.1 may have
to react to an emergency call by stopping its current activity and going to
the location from where the emergency originates. Let us represent this
with an event emergency(l, i) where l is the emergency origin location and
i N is an identification number of this event. We also need an additional
state variable: emergency-handling(r){T, F} indicates whether the robot r
is engaged in handling an emergency.
m-emergency(r, l, i)
event: emergency(l, i)
pre: emergency-handling(r)=F
body: emergency-handling(r) T
if load(r) 6= nil then put(r, load(r))
move-to(l)
address-emergency(l, i)
This method is applicable if robot r is not already engaged in handling
an emergency. In that case, the method sets its emergency-handling state
variable; it unloads whatever the robot is loaded with, if any; it triggers the
command to go the emergency location, then it sets a task for addressing
this emergency. Other methods are supposed to switch back emergency-
handling(r) when r has finished with the task address-emergency.
The previous simple examples introduced the representation. Let us now
illustrate how refinement methods can be used to handle the more complex
tasks discussed in Figure 1.2, such as opening a door. To keep the example
readable, we consider a one-arm robot and assume that the door is unlocked
(Exercises 3.9 and 3.10 cover other cases).
Example 3.4. Let us endow the robot with methods for opening doors. In
addition to the four state variables loc, load, pos, view introduced previously,
we need to characterize the opening status of the door and the position of
the robot with respect to it. The two following state variables fill that need:
reachable(r, o) {T, F}: indicates that robot r is within reach of object
o, here o is the door handle;
door-status(d) {closed, cracked, open, unknown}: gives the opening
status of door d, a cracked door is unlatched.
Furthermore, the following rigid relations are used:
adjacent(l, d): means that location l is adjacent to door d;
toward-side(l, d): location l is on the toward side of door d (i.e.,
where the door hinges are);
away-side(l, d): location l is on the away side of door d;
handle(d, o): o is the handle of door d;
type(d, rotates) or type(d,slide): door d rotates or slides;
side(d, left) or side(d, right): door d turns or slides to left or to the
right respectively with respect to the toward side of the door.
The commands needed to open a door are as follows:
move-close(r, o): robot r moves to a position where reachable(r, o)=T;
move-by(r, ): r performs a motion of magnitude and direction given
by vector ;
grasp(r, o): robot r grasps object o;
ungrasp(r, o): r ungrasps o;
turn(r, o, ): r turns a grasped object o by angle [, +];
pull(r, ): r pulls its arm by vector ;
push(r, ): r pushes its arm by ;
monitor-status(r, d): r focuses its perception to keep door-status up-
dated;
end-monitor-status(r, d): terminates the monitoring command.
m-opendoor(r, d, l, o)
task: opendoor(r, d)
pre: loc(r) = l adjacent(l, d) handle(d, o)
body: while reachable(r, o) do
move-close(r, o)
monitor-status(r, d)
if door-status(d)=closed then unlatch(r, d)
throw-wide(r, d)
end-monitor-status(r, d)
m-opendoor is a method for the opendoor task. It moves the robot close to
the door handle, unlatches the door if it is closed, then pulls it open while
monitoring its status. It has two subtasks: unlatch and throw-wide.
m1-unlatch(r, d, l, o)
task: unlatch(r, d)
pre: loc(r, l) toward-side(l, d) side(d, left) type(d, rotate)
handle(d, o)
body: grasp(r, o)
turn(r, o, alpha1)
pull(r, val1)
if door-status(d)=cracked then ungrasp(r, o)
else fail
m1-throw-wide(r, d, l, o)
task: throw-wide(r, d)
pre: loc(r, l) toward-side(l, d) side(d,left) type(d, rotate)
handle(d, o) door-status(d)=cracked
body: grasp(r, o)
pull(r, val1)
move-by(r, val2)
The preceding two methods are for doors that open by rotating on a hinge,
to the left and toward the robot. Other methods are needed for doors that
rotate to the right, doors that rotate away from the robot, and sliding doors
(see Exercise 3.7).
The method m1-unlatch grasps the door handle, turns then pulls the
handle before ungrasping. The method m1-throw-wide grasps the handle,
tasks
M Acting facts
commands events
Execution Platform
Environment
some points, there can be several external tasks and events being processed
concurrently.
To each external task or event that RAE reads in its stream, it asso-
ciates a LIFO stack that keeps track of how the refinement of is progress-
ing. There can be several such stacks being concurrently processed. The
refinement of is done according to a method in M, which may, at some
point, lead to a subtask 0 that will be put on top of the stack of . This
is pursued recursively. A refinement at any level by a method may fail, but
other methods may be applicable and are tried.
For each external task or event that RAE is currently processing, it main-
tains a refinement stack that is analogous to the execution stack of a com-
puter program. A refinement stack contains the following items:
all pending subtasks in which an external task or event is being refined,
the method currently tried for each pending subtask,
the current execution step of each method, and
previous methods tried for each subtask that failed.
A refinement stack is organized as a LIFO list of tuples:
stack =htuple1 ,. . . ,tuplek i. Each tuple is of the form (, m, i, tried)
where is a task or an event, m is an instance of a method in M addressing
, i is a pointer to the current step in the body of m, and tried is a set of
instances of methods already tried for that failed to accomplish it. The
top tuple of a refinement stack corresponds to the active method.
Progressing in a refinement stack means advancing sequentially by one
step in the body of the topmost method in the stack. The external task
or event that initiates a refinement stack remains under progress, at the
bottom of the stack, as long as this stack is not empty.
While RAE is advancing on a refinement stack, other external tasks and
events may appear in its input stream. RAE will create refinement stacks for
them too and will process all of its refinement stacks concurrently. At this
stage, RAE does not consider the possible dependencies among concurrent
stacks (extensions are discussed in Section 3.2.4). In particular, it does not
perform any ordering or synchronization between them. The management
of possible conflicts between concurrent stacks has to be taken care of in the
specification of the methods.
Rae(M)
Agenda
loop
until the input stream of external tasks and events is empty do
read in the input stream
Candidates Instances(M, , )
if Candidates = then output(failed to address )
else do
arbitrarily choose m Candidates
Agenda Agenda {h(, m, nil, )i}
for each stack Agenda do
Progress(stack )
if stack = then Agenda Agenda \ {stack}
Progress(stack )
(, m, i, tried) top(stack )
if i 6= nil and m[i] is a command then do
case status(m[i])
running: return
failure: Retry(stack ); return
done: continue
if i is the last step of m then
pop(stack) // remove stack s top element
else do
i nextstep(m, i)
case type(m[i])
assignment: update according to m[i]; return
command: trigger command m[i]; return
task: continue
0 m[i]
Candidates Instances(M, 0 , )
if Candidates = then Retry(stack )
else do
arbitrarily choose m0 Candidates
stack push(( 0 , m0 , nil, ),stack )
failed. If there are any method instances for that are not in that set
and are applicable in the current state , then Retry chooses one of them;
the refinement of will proceed with that method. Otherwise, RAE cannot
accomplish . If the stack is empty, then is an external task or event.
Otherwise, Retry calls itself recursively on the topmost stack element, which
is the one that generated as a subgoal.
Although Retry implements a mechanism similar to backtracking, it is
not backtracking in the usual sense. It does not go back to a previous
computational node to pick up another option among the candidates that
were applicable when that node was first reached. If it finds another method
among those that are now applicable for the current state of the world .
This is essential because RAE interact with a dynamic world. It cannot rely
on the set of Instances(M, , ) computed earlier, because some of these may
no longer be applicable, while new methods may be applicable.
Note that the same method instance that failed at some point may suc-
Retry(stack )
(, m, i, tried) pop(stack )
tried tried {m}
Candidates Instances(M, , )\ tried
if Candidates6= then do
arbitrarily choose m0 Candidates
stack push((, m0 , nil, tried),stack )
else do
if stack 6= then Retry(stack )
else do
output(failed to accomplish )
Agenda Agenda\stack
ceed later on. However, RAE does not attempt to retry method instances
that it has already tried. In general, this would require a complex analysis
of the conditions responsible for the failed method to be sure that these
conditions no longer hold.
Example 3.5. Let us illustrate how RAE works, using the two methods
given in Example 3.2 and the problem depicted in Figure 3.2. The robot
r1 is at location loc3, which has been observed. Container c1 is in loc1,
and container c2 in loc2, but neither location has been observed, hence the
position of c1 and c2 is unknown. The task fetch(r1,c2) is given to RAE.
loc3 loc3
r1
r1 c2
c1
loc1 c2
loc2 c1
loc1 loc2
Figure 3.3 shows the tree of RAE methods called for fetch(r1,c2). Ini-
tially, method m-fetch(r1,c2) is applicable. That method refines fetch(r1,c2)
into search(r1,c2). Method m-search finds a location, say loc1, that has not
been seen yet. It triggers the commands move-to(loc1) then perceive(loc1);
because c2 is not in loc1, the method recursively refines into another search
fetch(r1,c2)
m-fetch(r1,c2)
search(r1,c2)
m-search(r1,c2)
m-search(r1,c2)
Figure 3.3: Refinement tree of tasks, methods and commands for the task
fetch(r1,c2).
task. At this point, only loc2 remains unseen. The second instance of m-
search triggers the commands move-to(loc2), perceive(loc2), then take(r1,c2).
This terminates successfully the three methods in the stack.
Goals, like tasks, refer to an actors objectives. The objective for a task
is to perform some activity. The objective for a goal is to reach a state
where some condition g holds (see Definition 2.18). In some cases, an actors
objectives are more easily expressed through goals than through tasks.
Refinement methods are convenient for expressing and performing tasks.
We can easily extend the refinement method approach of RAE to handle
goals in a restricted way.4 Our previous definition of a method as a triple
(role, precondition, body) still holds. The role is now either a task, an event
or a goal. A goal g is specified syntactically by the construct achieve(g).
The body of a refinement method for any type of role is, as before,
a sequence of steps with control statements; each step is a command, an
assignment, or a refinement into a subtask or a subgoal. As we explain
shortly, a few modifications to RAE are sufficient to enable it to use such
methods. However, there is an important limitation.
Unlike the planning algorithms in Chapter 2, RAE does not search for
arbitrary sequences of commands that can achieve a goal g. Instead, just as
it would do for a task, RAE will choose opportunistically among the methods
in M whose roles match g. If M does not contain such a method, then g
will not be reachable by RAE. The same actor, with exactly the same set of
commands and execution platform, might be able to reach the goal g if M
contained a richer collection of methods. This limitation can be overcome,
but it requires using a planner, as we discuss in Sections 3.3 and 3.4.2.
Example 3.7. Consider the task fetch of Example 3.2. Instead of refining
it with another task, we may choose to refine it with a goal of making the
position of the container c known. The methods in Example 3.1 can be
rewritten as follows:
m-fetch(r, c)
task: fetch(r, c)
pre:
body: achieve(pos(c) 6= unknown)
move-to(pos(c))
take(r, c)
4
See Section 5.7 for a more general handling of goals.
m-find-where(r, c)
goal: achieve(pos(c) 6= unknown)
pre:
body: while there is a location l such that view(l)=F do
move-to(l)
perceive(l)
The last method tests its goal condition and succeeds as soon as g is met
with respect to current . The position of c may become known by some
other means than the perceive command, for example, if some other actor
shares this information with the robot. These two methods are simpler than
those in Example 3.2.
Because achieve(g) has the semantics and limitations of tasks, it is pro-
cessed by RAE as a task. One may ask what is the advantage of introducing
goals in RAE? The main advantage is to allow for monitoring of the con-
dition g with respect to the observed environment expressed in . For a
method m whose role is achieve(g), RAE can check before starting the body
of m whether g holds in current state . It also performs this test at every
progression step in the body of m and when m finishes. If the test succeeds,
then the goal is achieved, and the method stops. If the test fails when the
progression finishes, then the method has failed, and the Retry process is
performed.
In the previous example, nothing needs to be done if pos(c) is known
initially; if not, the m-find-where method stops if that position becomes
known at some point of the while loop.
The monitoring test is easily implemented by making three modifications
to the Progress procedure, Algorithm 3.2:
If the previous step m[i] is a command that returns failure: a Retry is
performed only when g does not hold in the current .
If i is the last step of m: if g is met in the current , then the top tuple
is removed from the stack (success case); if not a Retry on current
stack is performed.
After i is updated with nextstep(m, i): if g is met in the current ,
then the top tuple is removed from current stack without pursuing the
refinement further.
Note that if the previous step is a command that is still running, we postpone
the test until it finishes (no progress for the method in that case).
The monitoring capability allowed with goals is quite convenient.
It can be generalized to tasks by adding an extra field in methods:
The choice of the appropriate method for addressing a task when several
are applicable should be based on an estimate of how effective a method
will be in the current context for that task. Domain-specific heuristics can
be convenient for making informed choices. Ideally, however, one needs pre-
dictive models and a lookahead capability to be able to compare alternative
courses of actions but RAE uses operational models without predictive capa-
bility: the refinement methods defined so far are solely reactive.5 Let us first
extend them for the purpose of planning in the next section, then well come
back to this issue of informed choices in RAE with look-ahead mechanisms
in Section 3.4.
One way to help RAE make choices is to do refinement planning, that is,
to explore RAEs search space in order to predict the outcomes of different
possible courses of action. This section describes two refinement-planning
algorithms, SeRPE and IRT, that can be used for that purpose. In both of
them, the basic idea is to do predictive simulations of RAEs task refinement
process.
The planners initial state s0 will be RAEs current state , and the
planner will use methods like the ones that RAE uses; but instead of using
commands to an execution platform, the planner will use descriptive models
action templates as in Chapter 2 to predict the effects of the commands.
At points where RAE would choose a method m to use for some task or
goal , the planner will use search techniques like the ones in Chapter 2 to
explore several of the possible choices for m, to predict for each m whether
it will succeed in accomplishing .
As written, SeRPE and IRT require the classical planning assumptions
discussed in Section 2.1.1. Consequently, they cannot reason about how RAE
might handle situations in which actions have outcomes that are not known
in advance. For example, Example 3.2 involved searching for a container
using a command called perceive. We know in advance that if the actor
performs the perceive action, the values of some state variables will become
known, but we do not know what those values will be. Hence we cannot
write a classical action template for perceive.
5
This does not prevent from embedding in these methods planning capabilities for
performing specific tasks or steps, as illustrated in Exercises 3.1, 3.2, and 3.3.
if Candidates = then
if is a goal achieve(g) then return find-plan(, s, g)
else return failure
SeRPE(M, A, s, )
Candidates Instances(M, , s)
if Candidates = then return failure (i)
nondeterministically choose m Candidates
return Progress-to-finish(M, A, s, , m)
Progress-to-finish(M, A, s, , m)
i nil // instruction pointer for body(m)
hi // plan produced from body(m)
loop
if is a goal and s |= then return (ii)
if i is the last step of m then
if is a goal and s 6|= then return failure (iii)
return
i nextstep(m, i)
case type(m[i])
assignment: update s according to m[i]
command:
a the descriptive model of m[i] in A
if s |= pre(a) then
s (s, a); .a
else return failure
task or goal:
0 SeRPE(M, A, s, m[i])
if 0 = failure then return failure
s (s, 0 ); . 0
ing to each command, and S is the set of states constructed with a generative
approach from s and . In the modification proposed for line (i), find-plan
could be one of the planning algorithms in Chapter 2, with modifications
to make it return control to SeRPE if it sees a goal for which there is an
applicable method (see the discussion of this in Section 3.5.2).
Example 3.8. Figure 3.4 shows a state in a planning domain similar to the
one in Example 2.12 except that there is only one robot. Consider the tasks
of uncovering a container and putting it into a specified pile. Following are
methods to accomplish those tasks in some (but not all) cases. The variables
in these methods have the following ranges: c Containers; r Robots;
d, d0 Docks; p, p0 Piles. There are three kinds of tasks:
m1-put-in-pile(c, p0 ) m2-put-in-pile(r, c, p, d, p0 , d0 )
task: put-in-pile(c, p0 ) task: put-in-pile(c, p0 )
pre: pile(c) = p0 pre: pile(c) = p at(p, d) at(p0 , d0 )
body: // empty p 6= p0 cargo(r) = nil
body: if loc(r) 6= d then navigate(r, d)
uncover(c)
load(r, c, pos(c), p, d)
if loc(r) 6= d0 then navigate(r, d0 )
unload(r, c, top(p0 ), p0 , d0 )
d3
c2
p1 c1 r1 c3
p
d1 d2 p 2
3
m1-uncover(c) m2-uncover(r, c, p, p0 , d)
task: uncover(c) task: uncover(c)
pre: top(pile(c)) = c pre: pile(c) = p top(p) 6= c
body: // empty at(p, d) at(p0 , d) p 6= p0
loc(r) = d cargo(r) = nil
body: while top(p) 6= c do
c0 top(p)
load(r, c0 , pos(c0 ), p, d)
unload(r, c0 , top(p0 ), p0 , d)
m1-navigate(r, d0 ) m3-navigate(r, d, d0 )
task: navigate(r, d0 ) task: navigate(r, d0 )
pre: loc(r) = d0 pre: loc(r) 6= d0 d 6= d0
body: // empty adjacent(loc(r), d)
body: move(r, loc(r), d)
m2-navigate(r, d0 ) navigate(r, d0 )
task: navigate(r, d0 )
pre: loc(r) 6= d0
adjacent(loc(r), d0 )
body: move(r, loc(r), d0 )
Tree T1 : Tree T2 :
task task
put-in-pile(c1,p2) put-in-pile(c1,p2)
method method
m2-put-in-pile(r1,c1,p1,d1,p2,d2) m2-put-in-pile(r1,c1,p1,d1,p2,d2)
task task
uncover(c1) action uncover(c1) action
unload(r1,c1,c3,p2,d2) unload(r1,c1,c3,p2,d2)
task
method task navigate(r1,d2)
method
m1-uncover(c1) navigate(r1,d2) m1-uncover(c1)
(no children) method
(no children)
m3-navigate(r1,d3,d2)
method
m2-navigate(r1,d2) action task
action
load(r1,c1,nil,p1,d1) load(r ,c
1 1 ,nil,p ,d
1 1 ) navigate(r 1,d2)
action
move(r1,d1,d2) method
m2-navigate(r1,d2)
action
move(r1,d1,d3)
action
move(r1,d3,d2)
If we do cycle-checking (see Section 2.2), then there are only two ways to
refine the subtask navigate(r1 , d2 ), and Figure 3.5 shows the refinement trees
for both choices. These trees correspond to the following solution plans:
d3
c1
p1 c2 r1 r2 c3
p
d1 d2 p 2
3
If we tried to use either 3 or 4 alone, some of the actions would fail. Only
one robot can occupy a loading dock at a time, so neither robot can move
to the other dock unless the other robot first leaves that dock. We can
accomplish this by interleaving 3 and 4 to produce a plan such as this:
{interleave: p1 , . . . , pn },
where each pi is a sequence of steps hi,1 , . . . , i,j i for some j. This operator
has the same semantics as the concurrent operator in Section 3.2.4, except
that only one command can be performed at a time.
put-interleaved(c, p, c0 , p0 )
task: put-both(c, p, c0 , p0 )
pre: none
body: {interleave:
hput-in-pile(c, p)i,
hput-in-pile(c0 , p0 )i}
Then from the task put-both(c1 , p2 , c3 , p1 ), we can get the refinement tree in
Figure 3.7, which corresponds to 5 .
task
put-both(c1,p2,c3,p1)
method
put-interleaved(c1,p2,c3,p1)
interleave
task task
put-in-pile(c1,p2) put-in-pile(c3,p1)
method method
m2-put-in-pile(r1,c1,p1,d1,p2,d2) m2-put-in-pile(r2,c3,p2,d2,p1,d1)
task
uncover(c1) task
task navigate(r2,d1) action
uncover(c3) unload(r2,c3,p1)
method
m1-uncover(c1) method action
(no children) m1-uncover(c3) method unload(r1,c1,p2)
(no children) m-navigate(r2,d3,d1) task
action navigate(r1,d2)
load(r1,c1,p1)
action task
load(r2,c3,p2) navigate(r2,d3)
method
m2-navigate(r1,d2)
action method
move(r2,d2,d3) m-navigate(r2,d2,d3)
action
move(r1,d1,d2)
action
move(r2,d3,d1)
IRT(D, s, )
hi
a new task node; data()
Pending {}
while Pending 6= do
nondeterministically choose a node Pending
that has no children in Pending (i)
case type()
task:
data()
remove from Pending
M Instances(M( ), s)
if M = then return failure
nondeterministically choose m M
a new method node; data() (m, 0)
make a child of // this will be s only child
insert into Pending
action:
a data()
remove from Pending
if a is not applicable in s then return failure
s (s, a); .a
program or method:
v IRT-progress(D, , s, Pending)
if v = failure then return failure
return (, )
IRT-progress(D, , s, Pending)
(p, i) data()
if i is the last step of m then
remove from Pending
return
else if p[i] is a task then
a new task node; data() p[i]
append to s list of children
insert into Pending
else if p[i] is a primitive command then do
a new action node; data() ap[i]
append to s list of children
insert into Pending
else
// p[i] has the form {interleave: p1 , . . . , pk }
a new interleaving node
append to s list of children
for i = 1, . . . , k do
i a new program node; data() (pi , 0)
insert i into s set of children
insert i into Pending
start-a
m-do-a pre: p, running-a = F
task: do-a eff: running-a T
pre: // no preconditions
body: start-a, end-a end-a
pre: running-a = T
eff: e, running-a F
Refine-Lookahead(M, A, )
while (s abstraction of observed state ) 6|= do (i)
SeRPE-Lookahead(M, A, s, )
if = failure then return failure
a pop-first-action()); Perform(a)
Refine-Lazy-Lookahead(M, A, )
s abstraction of observed state
while s 6|= do
SeRPE-Lookahead(M, A, s, )
if = failure then return failure
while 6= hi and s 6|= and Simulate(, s, , ) 6= failure do
a pop-first-action()); Perform(a)
s abstraction of observed state
Refine-Concurrent-Lookahead(M, A, )
hi; s abstraction of observed state
thread 1: // threads 1 and 2 run concurrently
loop
SeRPE-Lookahead(M, A, s, )
thread 2:
loop
if s |= then return success
else if = failure then return failure
else if 6= hi and Simulate(, s, , ) 6= failure then
a pop-first-action()); Perform(a)
s abstraction of observed state
SeRPE-Lookahead returns
pated events occur, then the observed state and its abstraction are
s1 = (s0 , load(r1 , c1 , c2 , p1 , d1 )).
In the second iteration of the while loop, if SeRPE-Lookaheads nonde-
terministic choices are consistent with the ones it made the previous time,
then it returns
3 = hunload(r1 , c1 , c3 , p2 , d2 )i,
REAP-main(M, A)
Agenda
loop
until the input stream of external tasks and events is empty do
read in the input stream
Candidates Instances(M, , )
if Candidates = then output(failed to address )
s abstraction of observed state
T Refinement-tree(Candidates, M, A, s, ) (i)
if T = failure then
output(failed to address ) (ii)
else do
m the method instance at the top of T (iii)
Agenda Agenda {h(, m, nil, , T )i}
for each stack Agenda do
REAP-progress(M, A, stack)
if stack = then Agenda Agenda \ {stack}
top of T is the one that the planner recommends using for , so REAP-main
chooses this method in line (iii).
In line (ii), REAP-main stops trying to accomplish if Refinement-tree
returns failure. However, REAP-main can be modified to incorporate various
fallback options. Depending on the planning domain and the developers
objectives, a modified version of REAP-main could call Refinement-tree with
a set M0 of fallback methods that it would not otherwise use, postpone
accomplishment of until the environment changes in a way that makes
feasible, or modify (see goal reasoning in Section 1.3.4) to make it easier
to accomplish.
In lines (ii)(iii) of REAP-progress, the same approach is used to choose
a method m0 for the task 0 . Because 0 is a subgoal of the task in
REAP-main, this can be viewed as a kind of subgoaling (see Section 2.6.2).
The same approach is used again in lines (i)(ii) of REAP-retry.
REAP-progress(M, A, stack)
(, m, i, tried, T ) top(stack )
if i 6= nil and m[i] is a command then
case status(m[i])
running: return
failure: REAP-retry(M, A, stack); return
done:
T 0 the unexecuted part of T
if Simulate(, T 0 ) = failure then (i)
REAP-retry(M, A, stack); return
else continue
if i is the last step of m then
pop(stack) // remove (, m, i, tried, T )
else
i nextstep(m, i)
case type(m[i])
assignment: update according to m[i]; return
command: trigger command m[i]; return
task or goal: continue
0
m[i]
Candidates Instances(M, 0 , )
if Candidates = then
REAP-retry(M, A, stack); return
s abstraction of observed state
T 0 Refinement-tree(Candidates, M, A, s, ) (ii)
if T 0 = failure then REAP-retry(M, A, stack)
else do
m0 the topmost method in T 0 (iii)
stack push(( 0 , m0 , nil, , T 0 ),stack )
REAP-retry(M, A, stack)
(, m, i, tried, T ) pop(stack )
tried tried {m}
Candidates Instances(M, , ) \ tried
if Candidates = then output(failed to address )
s abstraction of observed state
T 0 Refinement-tree(Candidates, M, A, s, ) (i)
if T 0 6= failure then
m0 the topmost method in T 0 (ii)
push((, m0 , nil, tried, T 0 ), stack)
else
if stack 6= then REAP-retry(M, A, stack)
else do
output(failed to accomplish )
Agenda Agenda \ stack
Example 3.13. Let us repeat Example 3.11 using REAP. As before, we will
suppose that no execution errors or unforeseen events occur.
In REAP-mains first loop iteration, it reads = put-in-pile(c1 , p2 )
and calls Refinement-tree. Suppose Refinement-tree returns the re-
finement tree T1 in Figure 3.5. The topmost method in T is
m = carry-to-pile(r1 , c1 , p1 , d1 , c3 , p2 , d2 ), and REAP-main puts stack1 =
h(, m, nil, , T )i into Agenda. Assuming that nothing else arrives in the
input stream, REAP-main calls REAP-progress repeatedly on stack1 until
has been accomplished, as follows:
In the first call to REAP-progress, the top element of the stack
is (, m, nil, , T ). After the call to nextstep, this is replaced by
(, m, i, , T ), with i pointing to 1 = uncover(c1 ). REAP-progress
calls Refinement-tree, which returns a tree T1 that is a copy of T s left-
most branch. The topmost method in T1 is m1 = m-uncover(c1 ), and
REAP-progress pushes (1 , m1 , nil, , T1 ) onto stack1 .
In the second call to REAP-progress, the top element of stack1 is
(1 , m1 , nil, , T1 ). Because c1 is already uncovered, the method
produces no actions or subtasks, and REAP-progress removes
(1 , m1 , nil, , T1 ) from stack1 .
Comparison with RAE. In our examples, often only one method instance
was applicable in a given state. In such cases, RAE would have chosen the
same method instance as REAP, without needing to call a planner. Thus it
may not be immediately evident to the reader why REAPs planner is useful.
use refinement planning to plan for a task of the form achieve(g), this can
be considered a way of constraining the search for g.
On one hand, constraining the search in this manner can convey a sub-
stantial efficiency advantage [445]. On the other hand, Example 3.12 demon-
strates that unless the planner is given a comprehensive set of methods that
cover all of the possible tasks to accomplish, and all of the possible situ-
ations in which they might need to be accomplished, planning can fail in
situations in which one would want it to succeed. Consequently, several
researchers have investigated ways to combine the advantages of both re-
finement planning and classical planning by using refinement methods when
they are applicable and classical planning when no refinement methods are
available.
One approach involves running a classical planner and an HTN planner
as two separate subroutines, with the refinement planner passing control to
the classical planner whenever it encounters a task for which no methods
have been defined, and the classical planner passing control to the refinement
planner whenever it encounters an action that matches the head of an
HTN method [220].
Another approach achieves the same kind of effect by compiling a set
of HTN methods (subject to certain restrictions because HTN planning has
greater expressivity than classical planning) into a set of classical actions
whose names, preconditions, and effects encode the steps involved in apply-
ing the methods, and using these actions in a classical planner [8].
A third approach [533] uses an HTN-like formalism in which there are
goals rather than tasks, and the body of a method is a sequence of goals and
actions. If the planner encounters a goal for which there is an applicable
method then it uses the method. Otherwise it invokes a landmark-based
forward search. During each episode of landmark generation, the planner
treats the landmarks as intermediate goals, reverting to refinement planning
whenever it encounters a landmark for which there is an applicable method.
will require formal definitions of the relationships among tasks, states and
actions at different levels, translation algorithms based on these definitions,
and planning and acting algorithms that can accommodate these transla-
tions. A comprehensive approach for this problem has yet to be developed.
3.6 Exercises
3.1. Modify the m-search method of Example 3.2 by assuming it uses a
planning function, plan-path, which computes an optimized sequence of lo-
cations with content that is not yet known; the search proceeds according
to this sequence.
3.2. Complete the methods of Example 3.2 by considering that move-to
is not a command but a task addressed by a method that calls a motion
planner, which returns a trajectory, then controls the motion of the robot
along that trajectory.
3.3. Complete the methods of Example 3.2 by considering that perceive
is not a command but a task that requires calling a perception planner
that returns a sequence of observation poses. Define two methods: (i) for a
complete survey of a location where perceive goes through the entire sequence
of observation poses and (ii) for a focus perception that stops when the
searched object is detected.
3.4. Analyze how the methods in Exercises 3.1, 3.2, and 3.3 embed plan-
ning capabilities in refinement methods at the acting level. Relate this to
Figure 1.2 and the discussion in Section 1.2.2.
3.5. Combine the two scenarios of Examples 3.2 and 3.3: while the robot
is searching for a container, it has to react to an emergency. What needs
to be done to ensure that the robot returns to its search when the task
address-emergency finishes (see Section 3.2.4)?
3.6. In Example 3.4, in the body of m-opendoor, why is the first word
while rather than if?
3.7. Complete the methods of Example 3.4 for refining the tasks unlatch(r, d)
and throw-wide(r, d) when the door turns to the right, when the door opens
away from the robot, and when the door slides.
3.8. Complete the methods of Example 3.4 with appropriate steps to survey
the grasping status of whatever the robot is grasping and to turn the handle
in the opposite direction before ungrasping it.
3.9. Extend Example 3.4 for a robot with two arms: the robot uses its left
(or right) arm if the door turns or slides to the left (or right, respectively).
Add a method to move an object from one of the robots hands to the other
that can be used if the hand holding the object is needed for the opening
the door.
3.10. Extend Example 3.4 for the case in which the door might be locked
with an RFID lock system and the robots RFID chip is attached to its left
arm.
3.11. Redefine the pseudocode of RAE, Progress, and Retry to implement
the extensions discussed in Section 3.2.4 for controlling the progress of a
task.
3.12. Implement and test the fetch task of Example 3.2 in OpenPRS
(https://git.openrobots.org/projects/openprs/wiki). Integrate the
results of Exercise 3.1 in your implementation; use for plan-path a simple
Dijkstra graph-search algorithm. Is it possible to extend your OpenPRS
implementation to handle the requirements stated in Exercise 3.5?
3.13. In Example 3.8, rewrite the two methods for put-in-pile(c, p0 ) as a
single method. What are the benefits and drawbacks of having them as one
method rather than two?
3.14. For the task uncover(c) in Example 3.8, write a method or set of
methods for the case where there are containers on c but no other pile at
the same dock.
3.15. Professor Prune says that the m-navigate method in Example 3.8 can
cause excessive backtracking. Is he correct? Explain why or why not, and
if he is correct, then write a better method or set of methods.
3.16. Following is a domain-specific acting algorithm to find near-optimal
solutions for blocks world problems (see Exercise 2.10), where optimal
means the smallest possible number of actions. In it, s0 is an initial state
in which holding = nil, and g is a set of loc atoms (e.g., as in Figure 2.18).
Here are some definitions of terms used in the algorithm:
For each block b, if g contains an atom of the form loc(b) = c, then
goal(b) = c. If there is no such atom, then goal(b) = nil.
A block b is unfinished if s0 (loc(b)) 6= goal(b) and goal(b) 6= nil, or if
s0 (loc(b)) is an unfinished block. Otherwise b is finished.
A block b is clear if top(b) = nil.
Stack-blocks(s0 , g)
while there is at least one unfinished block do
if there is an unfinished clear block b such that
goal(b) = table or goal(b) is a finished clear block
then
move b to goal(b)
else
choose a clear unfinished block b
move b to table
(a) What sequence of actions will this algorithm produce for the planning
problem in Exercise 2.10(b)?
(b) Write a set of refinement methods that encode this algorithm. You
may assume that there is already a function finished(b) that returns
true if b is finished and false otherwise.
3.17. Suppose we try to use SeRPE on the problem in Example 3.6. Draw
as much of the refinement tree as you can. What problem prevents you from
drawing the entire refinement tree? Suggest a way to resolve the problem.
3.18. Rewrite the pseudocode for SeRPE, replacing the nondeterministic
choice with depth-first backtracking.
3.19. In Example 3.11, suppose that every time r1 starts down the road
from d1 to d2 , it hits a bump that knocks c1 off of r1 and back onto p1 .
(a) What sequence of commands will ARP-lazy, ARP-interleaved, and ARP-
asynchronous execute?
(b) What sequence of commands will REAP execute?
3.20. In Exercise 3.16, suppose that when the robot hand tries to pick up
a block, sometimes it will drop the block onto the table.
(a) What sequence of commands will ARP-lazy, ARP-interleaved, and ARP-
asynchronous execute?
(b) What sequence of commands will REAP execute? What kind of mod-
ification could you make to REAP to make it keep trying until it suc-
ceeds?
3.21. Redo Example 3.11 using a refinement planner that does a receding-
horizon search. More specifically, the planner is a modified version of SeRPE
that generates the first two actions of every refinement plan (hence it looks
at all partial plans of two steps or less), and it returns the partial plan that
(according to some kind of heuristic evaluation) is closest to accomplishing
the task or goal. You can assume that the heuristic evaluation always gives
accurate results.
This chapter is about planning and acting approaches in which time is ex-
plicit in the descriptive and operational models of actions, as well as in
the models of the expected evolution of the world. It describes several al-
gorithms and computation methods for handling durative and concurrent
activities with respect to a predicted dynamics.
The first section addresses the need of making time explicit in the delib-
eration of an actor. A knowledge representation for modeling actions with
temporal variables is presented in Section 4.2. It relies on an extension of
the refinement methods introduced earlier, which are seen here as chronicles,
that is, collections of assertions and tasks with explicit temporal constraints.
A planning algorithm with temporal refinement methods is developed in Sec-
tion 4.3. The basic techniques for managing temporal constraints and the
controllability of temporal plans are then presented in Section 4.4. Acting
problems with temporal domain models, are discussed, considering different
types of operational models in Section 4.5. The chapter concludes with a
discussion and historical remarks, followed by exercises.
4.1 Introduction
To perform an action, different kinds of resources may need to be borrowed
(e.g., space, tools) or consumed (e.g., energy). Time is a resource required
by every action, but it differs from other types of resources. It flows indepen-
dently from the actions being performed, and it can be shared ad infinitum
by independent actors as long as their actions do not interfere with each
other.
In previous chapters, we left time implicit in our models: an action
produced an instantaneous transition from one state to the next. However,
deliberative acting often requires explicit temporal models of actions. Rather
than just specifying an actions preconditions and effects, temporal models
must specify what things an action requires and what events it will cause
at various points during the actions performance. For example, moving a
robot r1 from a loading dock d1 to a loading dock d2 does not require d2 s
availability at the outset but it does require it shortly before r1 reaches d2 .
Actions may, and sometimes must overlap, even if their conditions and
effects are not independent. As one example, r1 may move from d1 to
d2 while r2 is concurrently moving from d2 to d1 . As another, opening a
door that has a knob and a spring latch that controls the knob requires
two tightly synchronized actions: (i) pushing and maintaining the latch
while (ii) turning the knob. Modeling such concurrency requires an explicit
representation of time.
Goals are sometimes constrained with absolute deadlines. Events may be
expected to occur at future time periods, for example, the arrival of sched-
uled ships at a harbor. Actions may have to be located in time with respect
to expected events or deadlines. Time can be required qualitatively, to han-
dle synchronization between actions and with events, and quantitatively, to
model the duration of actions with respect to various parameters.
In summary, the main motivations for making time explicit in planning
and acting are the following:
modeling the duration of actions;
modeling the effects, conditions, and resources borrowed or consumed
by an action at various moments along its duration, including delayed
effects;
handling the concurrency of actions that have interacting and joint
effects;
handling goals with relative or absolute temporal constraints;
planning and acting with respect to exogenous events that are expected
to occur at some future time; and
planning with actions that maintain a value while being executed, as
opposed to just changing that value (e.g., tracking a moving target, or
keeping a spring latch in some position).
An explicit representation of time for the purpose of acting and planning
can be either:
timeline
state
time
As a shorthand, [t]x = v stands for [t, t+1)x=v and [t]x : (v, v 0 ) stands for
[t, t+1]x:(v, v 0 ); the former gives the value of x at a single time point and the
latter expresses a transition from v to v 0 over two consecutive time-points.
In general, and assertion [t, t0 ]x:(v, v 0 ) does not model how the change takes
place within the interval [t, t0 ]; it can be gradual over possibly intermediate
1
This assumption avoids some minor issues regarding closed versus open intervals.
Example 4.3. The whereabouts of the robot r1, as depicted in Figure 4.2,
can be expressed with the following timeline:
In this timeline, T has three assertions: one persistence and two changes; C
has temporal and object constraints. The constraints are in this particular
case entailed from the three intervals and two change assertions in T . In-
stances of the timeline are substitutions of possible values in these assertions
for the five variables l, t1 , . . . , t4 .
Note that this timeline does not say what happens between t1 and t2 ; all
we know is that r1 leaves loc1 at or after t1 , and it arrives at l at or before t2 .
To say that these two changes happen exactly at t1 and t2 , we can add the fol-
lowing assertions in the timeline: [t1 , t1 +1] loc(r1):(loc1,route), and [t2 1, t2 ]
loc(r1):(route, l), where route is some intermediate location. These asser-
tions say that [t1 ]loc(r1)=loc1, [t1 + 1]loc(r1)=route, [t2 1]loc(r1)=route,
and [t2 ]loc(r1) = l.
loc(r1)
loc1 Persistence
loc2 Change
t1 t2 t3 t4 time
Figure 4.2: A timeline for the state variable loc(r1). The positions of the
points on the two axes are qualitative; the rough lines do not necessarily
represent linear changes.
(e.g., for the initial state), or explained by some reason due the actors own
actions or to the dynamics of the environment. For example, looking at
the timeline in Figure 4.2, the locations of the robot in l, then in loc2, are
explained by the two change assertions in that timeline. However, nothing
explains how the robot got to loc1; we have to state an assertion saying that
it was there initially or brought there by a move action.
Note that by definition of the intervals [t00 , t] and [t, t0 ] we have t00 < t < t0 .
Hence this definition excludes circular support, that is, assertion cannot
support assertion while supports , regardless of whether this support
is direct or by transitivity via some other assertions.
add assertions to T , for example, for the timeline in Figure 4.2 to take
into account additional motions of the robot.
instantiate some of the variables, which may possibly split a timeline of
the set with respect to different state variables, for example, assertions
related to loc(r) and loc(r0 ) refer to the same state variable, but that
timeline will be split if r is instantiated as r1 and r0 as r2.
4.2.2 Actions
We model an action as a collection of timelines. More precisely, a primitive
action template, or a primitive for short, is a triple (head, T , C), where head
is the name and arguments of the primitive, and (T , C) is a set of timelines.
The reader may view this representation as an extension of the action tem-
plates of Chapter 2 with explicit time expressing conditions and effects at
different moments during the time span of an action.
The constant empty means that a robot, a crane, a pile, or a dock is empty,
or that a container is not stacked on any other container.
The task in this example is to bring containers from their current position
to a destination pile. It is specified with primitives, tasks, and methods (to
which well come back in the next section). The primitives are the following:
leave(r, d, w)
assertions: [ts , te ]loc(r):(d, w)
[ts , te ]occupant(d):(r, empty)
constraints: te ts + 1
adjacent(d, w)
This expression says that the leave action changes the location of r from
dock d to the adjacent waypoint w, with a delay smaller than 1 after the
action starts at ts ; the dock d is empty when the action ends at te .
Similarly, enter is defined by the following action template:
enter(r, d, w)
assertions: [ts , te ]loc(r):(w, d)
[ts , te ]occupant(d):(empty, r)
constraints: te ts + 2
adjacent(d, w)
The take primitive is specified as follows:
take(k, c, r)
assertions: [ts , te ]pos(c):(r, k)
[ts , te ]grip(k):(empty, c)
[ts , te ]freight(r):(c, empty)
[ts , te ]loc(r)=d
constraints: attached(k, d), attached(p, d)
[t, t0 ]task.
The preceding expression means that task takes place over an interval con-
tained within [t, t0 ], that is, it starts at or after t, and finishes at or before
t0 . Note that [t, t0 ]task has different semantics than a persistence condition
on a state variable. It just says task should happen within [t, t0 ] and does
not require task to persist throughout the entire interval.
A task is refined into subtasks and primitives using refinement meth-
ods. A temporal refinement method is a tuple (head, task, refinement, T , C),
where head is the name and arguments of the methods, task gives the task
to which the method applies, refinement is the set of temporally qualified
subtasks and primitives in which it refines task, T are assertions and C con-
straints on temporal and object variables. A temporal refinement method
does not need a separate precondition field, as in the methods of previous
chapter. This is because temporal assertions may express conditions as well
as effects in a flexible way and at different moments. Temporal refinement
methods are illustrated in Example 4.12.
m-bring(r, c, p, p0 , d, d0 , k, k 0 )
task: bring(r, c, p) # r brings container c to pile p
refinement: [ts , t1 ]move(r, d0 )
[ts , t2 ]uncover(c, p0 )
[t3 , t4 ]load(k 0 , r, c, p0 )
[t5 , t6 ]move(r, d)
[t7 , te ]unload(k, r, c, p)
assertions: [ts , t3 ]pile(c)=p0
[ts , t3 ]freight(r)=empty
constraints: attached(p0 , d0 ), attached(p, d), d 6= d0
attached(k 0 , d0 ), attached(k, d)
t1 t3 , t2 t3 , t4 t5 , t6 t7
This method refines the bring task into five subtasks to move the robot to
d0 then to d, to uncover container c to have it at the top of pile p0 , to load the
robot in d0 and unload in d in the destination pile p. As depicted in Figure 4.3,
the first move and uncover are concurrent (t2 and t3 are unordered). When
both tasks finish, the remaining tasks are sequential. Container c remains
in its original pile, and robot r remains empty until the load task starts.
pile(c)=p
cargo(r)=nil
move t1 move
uncover
load unload
t2
ts t3 t4 t5 t6 t7 te
Figure 4.3: Assertions, actions and subtasks of a refinement method for the
bring task. The diagonal arrows represent precedence constraints.
m-move1(r, d, d0 , w, w0 )
task: move(r, d) #moves a robot r to a dock d
refinement: [ts , t1 ]leave(r, d0 , w0 )
[t2 , t3 ]navigate(w0 , w)
[t4 , te ]enter(r, d, w)
assertions: [ts , ts + 1]loc(r)=d0
constraints: adjacent(d, w), adjacent(d0 , w0 ), d 6= d0
connected(w, w0 )
t1 t2 , t3 t4
1
This method refines the move to a destination dock d into three successive
steps: leave the starting dock d0 to an adjacent waypoint w0 , navigate to a
connected waypoint w adjacent to the destination and enter the destination
d, which is required to be empty only when the robot gets there. The move
task requires additional methods to address cases in which the robot starts
from a road or when it is already there (see Exercise 4.2).
m-uncover(c, p, k, d, p0 )
task: uncover(c, p) #un-pile p until its top is c
refinement: [ts , t1 ]unstack(k, c0 , p)
[t2 , t3 ]stack(k, c0 , p0 )
[t4 , te ]uncover(c, p)
assertions: [ts , ts + 1]pile(c)=p
[ts , ts + 1]top(p)=c0
[ts , ts + 1]grip(k)=empty
constraints: attached(k, d), attached(p, d)
attached(p0 , d), p 6= p0 , c0 6= c
t1 t2 , t3 t4
This method refines the uncover task into unstacking the container at
the top of pile p and moving it to a nearby pile p0 and then invoking uncover
again recursively if the top of p is not c. Another method should handle the
case where c is at the top of p.
Finally, the task load can be refined into unstack and put primitives; task
unload is similarly refined into take and stack (see Exercise 4.2).
4.2.4 Chronicles
0 :
tasks: [t, t0 ]bring(r, c1, dock4)
supported: [ts ]loc(r1)=dock1
[ts ]loc(r2)=dock2
[ts + 10, ts + ]docked(ship1)=dock3
[ts ]top(pile-ship1)=c1
[ts ]pos(c1)=pallet
assertions: [te ]loc(r1) = dock1
[te ]loc(r2) = dock2
constraints: ts < t < t0 < te , 20 30, ts = 0
Chronicles will also be used to express partial plans that will be progres-
sively transformed by the planner into complete solution plans.
:
tasks: [t0 , t1 ]leave(r1,dock1,w1)
[t1 , t2 ]navigate(r1,w1,w2)
[t3 , t4 ]enter(r1,dock2,w2)
[t00 , t01 ]leave(r2,dock2,w2)
[t01 , t02 ]navigate(r2,w2,w1)
[t03 , t04 ]enter(r2,dock1,w1)
supported: ST
assertions: T
constraints: t01 < t3 , t1 < t03 , ts < t0 , ts < t00 , t4 < te , t04 < te
adjacent(dock1,w1), adjacent(dock2,w2)
connected(w1,w2)
This chronicle says that r1 leaves dock1 before r2 enters dock1 (t1 < t03 );
similarly, r2 leaves dock2 before r1 gets there (t01 < t3 ). Each action navigate
starts when the corresponding leave finishes (t1 and t01 ). However, an enter
may have to wait until after the navigate finishes (t2 to t3 ) and the way is
free.
leave
dock1
navigate enter
dock2
r1
t0 t1 t2 t3 t4
time
t0 t1 t2 t3 t4
r2
enter
dock1
navigate
leave
dock2
Figure 4.4: Temporally qualified actions of two robots, r1 and r2. The
diagonal arrows represent the precedence constraints t01 < t3 and t1 < t03 .
C are augmented with the assertions and constraints of m and those of its
primitives.
When a task is refined, the free variables in methods and primitives
are renamed and possibly instantiated. For example, enter is specified in
Example 4.11 with the free variables r, d, w, ts , te . In the first instance of
enter in the chronicle of Example 4.15, these variables are respectively bound
to r1, dock2, w2, t3 , and t4 . In the second instance of enter, they are
bounded to r2, dock1, w1, t03 , t04 . The general mechanism for every instance
of a primitive or a method is to rename the free variables in its template
to new names, then to constrain and/or instantiate these renamed variables
when needed.
Furthermore, when refining a task and augmenting the assertions and
contraints of a chronicle, as specified by a method, we need to make sure
that (T , C) remains secure. Separation constraints will be added to C to
handle conflicting assertions. The consistency of the resulting constraints
Condition (i) says that all tasks in 0 have been refined recursively down
into primitives; this is similar to what we saw in Section 3.3.1. Condition (ii)
extends to temporal domains the notion of causal link seen in Section 2.5.
Condition (iii) is a requirement to make sure that the solution chronicle
TemPlan(, )
Flaws set of flaws of
if Flaws= then return
arbitrarily select f Flaws (i)
Resolvers set of resolvers of f (ii)
if Resolvers= then return failure
nondeterministically choose Resolvers (iii)
Transform(, ) (iv)
Templan(, )
In Algorithm 4.1, step (i) is a heuristic choice of the order in which the
resolvers of a given flaw are searched. This choice affects the performance
but not the completeness of the algorithm. Step (iii) is a backtracking point
in a deterministic implementation of TemPlan: all resolvers for a flaw may
need to be tried to ensure completeness.
The main technical issues in this temporal planning algorithm are the
following:
How to find the flaws in and their resolvers, and how to transform
with a resolver , that is, the Transform subroutine in Templan. This
is discussed for the different types of flaws in Sections 4.3.2 to 4.3.4.
How to organize and explore the search space efficiently. This is dis-
cussed in Section 4.3.5.
How to check and maintain the consistency of the constraints in .
This is discussed in Section 4.4.
4.3.6 Illustration
Let us illustrate some of the steps of TemPlan on a detailed example.
k2
k1
w12
c1 c2
c1 c2
d1 p1 p1 p2 p2 d2
w13 w23
k3 k4
p3 p4
r1 w34 r2
d3 d4
Figure 4.5: A planning problem involving two robots, r1 and r2, servicing
four docks, d1 to d4; the task is to bring the containers c1 from pile p1 to
p3 and c2 from p2 to p4.
Example 4.17. Consider the problem depicted in Figure 4.5 for the domain
of Example 4.11 where two robots, r1 and r2, are servicing four docks, d1 to
d4, connected with four roads, as illustrated. Starting from the initial state
shown in the figure, the task is to bring the containers c1 to pile p3 and c2
to p4. No constraint on the final location of the robots is stated. Hence, the
initial chronicle 0 has no unsupported assertion (see Exercise 4.4).
At the first recursion of TemPlan, there are two flaws in current : the
nonrefined tasks bring(r, c1, p3) and bring(r0 , c2, p4). Suppose the method
m-bring is used to refine the first task into move, uncover, load, move and
unload. At this point, c, p, p0 , d, d0 , k, k 0 will be instantiated, respectively, to
c1, p3, p1, d3, d1, k1, k3; r is constrained to be in {r1, r2} and the time
points are constrained as depicted in Figure 4.3.
At the following recursion, there are six nonrefined tasks in . Assume
m-bring is similarly used to refine bring(r0 , c2, p4). Now the resulting chron-
icle contains ten nonrefined tasks (two uncovers, loads and unloads, and four
moves) as well as conflicting assertions related to the loc(r) and loc(r0 ) as-
sertions in the four load and unload tasks. Separation constraints are either
r 6= r0 or precedence constraints such that the tasks are run sequentially.
If the former separation is chosen, a final solution plan would be, for
example, to have r1 navigate to d2 while r2 navigates to d1. At the same
time, k1 uncovers c1 while k2 uncovers c2. Two synchronizations then take
place: before load(k2, r1, c2, p2) and, concurrently, before load(k1, r2, c1,
p1) (as in Figure 4.3). These two concurrent actions are then followed by
move(r1,d4) concurrently with move(r2,d3), and finally with the two unload
actions. The details of the remaining steps for reaching a solution are covered
in Exercise 4.5.
If we assume more realistically that navigation between waypoints is
constrained by the traversal of docks, and that no dock can contain more
than one robot at a time, then additional synchronizations will be required
for the motion of the two robots (see Exercise 4.6).
Example 4.18. The designer of the domain in Example 4.24 may consider
that the primitives unload, load, stack, and unstack are free. These actions
can be performed whenever their specified conditions are met; they can be
inserted in a plan when their assertions are needed to support nonsupported
assertions. However, the primitives leave and enter can be specified as being
task-dependent; they should necessarily appear as the result of a decom-
position of a move task. In other words, the designer does not foresee any
reason to perform an action such as leave or enter except within tasks that
require leaving or entering a place.
inconsistencies and reduce the search space at a low cost. They may be used
jointly with complete algorithms, such as forward-checking at regular stages
of the search. Such a complete consistency check has to be performed on the
free variables remaining in the final plan. Other trade-offs, such as choosing
flaws that lead to instantiate object variables, are also relevant for reducing
the complexity of maintaining variable binding constraints.
aij tj ti bij , denoted rij = [aij , bij ], where aij and bij are integers.
Note that rij entails rj,i = [bij , aij ]. To represent unary constraints (i.e.,
constraints on one variable rather than two), let us introduce an additional
temporal variable t0 with a fixed value t0 = 0. Then r0j = [a, b] represents
the constraint a tj b.
A solution to an STN (V, E) gives an integer value to each variable in V.
The STN is consistent if it has a solution that meets all the constraints in
E. It is minimal if every value in each interval rij belongs to a solution.
TemPlan proceeds by transforming a chronicle = (A, ST , T , C) such as
to meet the conditions of a solution plan. These transformations add in C
constraints of methods for refining tasks, constraints for supporting asser-
tions, and separation constraints for conflicting assertions. Each transfor-
mation should keep C consistent. The set of temporal constraints in C is an
STN (V, E), which evolves by the addition of new variables and constraints
while staying consistent. TemPlan requires checking incrementally that an
STN remains consistent when more variables and contraints are added to it.
This is more easily done when the network it is also maintained minimal, as
explained next.
Two operations are essential for checking the consistency of E:
composition: rik rkj = [aik + akj , bik + bkj ], which corresponds to the
transitive sum of the two constraints from i to j through k:
aik tk ti bik and akj tj tk bkj ;
intersection: rij rij 0 = [max{a , a0 }, min{b , b0 }], which is the
ij ij ij ij
conjunction of two constraints on (ti , tj ): aij tj ti bij and
a0ij tj ti b0ij .
Three constraints rik , rkj , and rij are consistent when rij (rik rkj ) 6= .
t2
[3,
2] 4]
[ 1,
t1
[2, 3] t3
Example 4.19. Consider the network in Figure 4.6 where vertices are time
points and edges are labelled with temporal constraints: r12 = [1, 2], r2,3 =
0 = r r
[3, 4] and r13 = [2, 3]. r12 and r2,3 entail by transitivity r13 12 23 = [4, 6].
0
But r13 is not compatible with r13 : the upper bound of r13 is 3, smaller
0 which is 4. That is r 0
than the lower bound of r13 13 r13 = . There is
0
no pair of variables t1 , t3 that can satisfy both r13 and r13 : this network is
inconsistent.
The path-consistency algorithm PC (Algorithm 4.2) tests all triples of
variables in V with a transitive update operation: rij rij (rik rkj ).
1
If a pair (ti , tj ) is not constrained, then we take rij = (, +); in that
sense, an STN corresponds implicitly to a complete graph.
PC(V, E)
for k = 1, . . . , n do
for each pair i, j such that 1 i < j n, i 6= k, and j 6= k do
rij rij [rik rkj ]
if rij = then return inconsistent
t2 [3, 4
]
2]
[1,
t1 t4
[1, 2]
t3
[6
,7 [4, 5]
]
t5
Example 4.20. Let us give the network in Figure 4.7 as input to PC (Al-
gorithm 4.2). The first iteration of PC for k = 1 with 2 i < j 5
does not change the constraints r23 , r24 , r25 , r34 , r35 ; it updates r25 as fol-
lows: r25 r25 [r21 r15 ] = (, +) [2, 1] [6, 7] = [4, 6]. The
remaining iterations confirm that this network is consistent and minimal 1
(see Exercise 4.9).
[-5
[-5
[1
,5
15
,5
5,
]
]
[0,
50
uncover
]
t3 [5, 10] t4 t3 [5, 10] t4
(a) (b)
next.
The issues raised in the previous example are addressed through the
notion of Simple Temporal constraint Networks with Uncertainty (STNU).
An STNU is like an STN except that its time points and constraints are
partitioned into controllable ones and contingent ones.
The intuition is that elements in V denote the ending time points of ac-
tions, while contingent constraints in E model the positive nonnull durations
of actions, predicted with uncertainty. If [ts , te ] [l, u] is a contingent con-
straint, then the actual duration te ts can be viewed as a random variable
whose value will be observed within [l, u], once the corresponding action
terminates. The actor controls ts : it assigns a value to it. However, it only
observes te , knowing in advance that it will be within the bounds set for
the contingent constraint on te ts . A meaningful STNU cannot have a
contingent variable te , which is the end point of two contingent constraints.
The controllability issue is to make sure (at planning time) that there
exist values for the controllable variables such as for any observed value of
the contingent variables the contraints are met. One can view controllable
variables as being existentially quantified, while contingent ones are univer-
sally quantified. However, the actor does not need to commit to values for
all its controllable variables before starting to act. It can choose a value for
a controllable variable only when needed at acting time. It can make this
choice as a function of the observed values of past contingent variables.
bring from t1 to t then move from t0 to t2 (Figure 4.9). The total duration
[t1 , t2 ] remains in [30, 50].
A dynamic execution strategy for this STNU can be the following: assign
t1 , observe t, assign t0 at any moment after t in [0, 5] then assign t3 10 units
after t0 . It is easy to check that, whatever the durations of the three tasks
are, within the bounds of the contingent constraints, the constraint [5, +5]
on their end points t2 and t4 will be met.
ts [a, b] te
[u, v]
[p, q]
t
Figure 4.10: Basic constraints for dynamic controllability.
denoted hte , b vi. Specific propagation rules for handling jointly these wait
constraints and the normal ones in a network need to be devised.
These propagation rules are given in Table 4.1. A row in this table is
similar to the propagated contraint (rik rkj ) from i to j through k that we
used in PC. The left column gives the conditions under which a propagation
rule applies, and the right column states the constraint to be added to
the network according to that rule. Double arrows represent contingent
constraints, and angle brackets are wait constraints. The first and second
rules implement, respectively, the cases (ii) and (iii). The third rule adds
a lower bound constraint to a wait, which follows directly from the above
argument. The last two rules correspond to transitive propagations of a
wait.
It can be shown that a modified path consistency algorithm relying on
these rules is correct: a network is dynamically controllable if and only
if it is accepted by the algorithm. Furthermore, the reduced controllable
constraints obtained in the final network give a dynamic execution strategy.
The transposition of the wait constraints as a distance graph allows the
incremental testing of dynamic controllability with a an algorithm in O(n3 )
inspired from Bellman-Ford.
as input the appropriate initial chronicle with the current state and the pre-
dicted exogenous events. It returns a chronicle that meets the conditions
of Definition 4.16.
Actions in the solution plan are primitives for TemPlan, for exam-
ple, leave, enter, stack, unstack, etc. (as in Example 4.11). However, these
primitives are compound tasks at the acting level, to be refined into com-
mands with appropriate refinement methods. This acting refinement goes
one level down in the representation hierarchy. We consider here primitive
refinements using the atemporal methods of Chapter 3.
Example 4.25. In Example 4.11, we defined several primitives such as leave
or unstack. Here are two methods to decompose them into commands:
m-leave(r, d, w, e)
task: leave(r, d, w)
pre: loc(r)=d, adjacent(d, w), exit(e, d, w)
body: until empty(e) wait(1)
goto(r, e)
The method m-leave waits until the exit e from dock d toward waypoint
w is empty, then it moves the robot to that exit. The method m-unstack
locates the grasping position for container c on top of a pile p, moves the
crane to that position, grasps it, ensures the grasp (e.g., closes latches) to
guarantee a firm grasp, raises the container slowly above the pile, then moves
away to the neutral position of that crane.
m-unstack(k, c, p)
task: unstack(k, c, p)
pre: pos(c)=p, top(p)=c, grip(k)=empty
attached(k, d), attached(p, d)
body: locate-grasp-position(k, c, p)
move-to-grasp-position(k, c, p)
grasp(k, c, p)
until firm-grasp(k, c, p) ensure-grasp(k, c, p)
lift-vertically(k, c, p)
move-to-neutral-position(k, c, p)
Dispatch(V, V, E, E)
initialize the network
while there are elements in V that have not occurred, do
update now
update contingent points in V that have been observed
enabled set of enabled time points
for every t enabled such that now = ut , trigger t
arbitrarily choose other points in enabled, and trigger them (i)
propagate in the network the values of triggered points
The propagation step is the most costly one in Dispatch: its complexity
is in 0(n3 ) where n is the number of remaining future points in the net-
work. Ideally, this propagation should be fast enough to allow iterations
and updates of now that are consistent with the temporal granularity of the
plan. As discussed earlier about the motivation for atemporal refinement,
this complexity is lower when temporal refinement does not break down
primitives at the finer command level.
Example 4.28. Let us extend Example 4.26 by requiring robot r1 to bring
a container c1 in dock d2 to some destination. TemPlan synthesizes a plan
, part of which is shown in Figure 4.11. To keep the figure readable, the
value of the constraints and parameters are omitted; the end point of an
action starting at ti is implicitly named t0i . Note that some of the object
variables are instantiated, but some are not (e.g., c0 ); temporal variables in
are not instantiated.
The initial step in Dispatch triggers t1 . When t01 is observed, t2 is enabled
and triggered, which make t3 and t4 enabled. t3 will be triggered enough
in advance to free dock d2 allowing r1 to get in (at t5 ). Similarly for the
subtask of uncovering container c, which is triggered at t4 . When t02 and t03
are observed, t5 become enabled and triggered. t7 will become enabled after
observing t05 and t06 . The rest of the plan follows linearly.
The Dispatch algorithm is easily integrated into eRAE. Triggering the
starting point of an action a means putting a new task in the input stream
t3 leave(r2,d2)
t1 t2 t5
leave(r1,d1) navigate(r1) enter(r1,d2)
t7 t8 t9
t6 unstack(k,c1) putdown(k,c1,r1) leave(r1,d2)
t4
unstack(k,c) stack(k,c)
of RAE (Algorithm 3.1), that is, starting a new stack for progressing on
the refinement of a. The upper bound on the duration of a is taken as a
deadline for terminating this stack. Progress, and eventually Retrace, will
pursue refinements in this stack until the action succeeds or fails, or until 12
the deadline is reached, which is another failure condition. The proximity
of the deadline can be used as a heuristics for prioritizing the most urgent
tasks in RAE.
Failures are addressed as plan repairs. For a deadline failure, the repair
can take two forms:
stopping the delayed action and seeking alternate ways for achieving
the plan from the current state, as for other types of failure, or
finishing the action despite the delay and repairing the remaining part
of the plan.
The latter option is preferable when the violated contingent constraint can
be resolved at the STNU propagation level. For example, if navigate(r1) in
Figure 4.11 takes slightly longer than the maximum duration specified, the
entire plan will still be feasible with a delay, which is possibly acceptable.
However, if this navigation is taking longer than expected because robot r1
broke down, a better option is to seek another robot to perform the task.
These considerations must be integrated in the actors monitoring function
(see Section 7.2).
Plan repair in case of a failure has to be performed with respect to the
current state, and to remaining predicted events and tasks whose achieve-
ment is still in the future. The repair can be local or global. In the latter
case, a full replanning is performed. A local repair can benefit from the plan-
space planning approach of TemPlan as follows. The failed action is removed
from the remaining chronicle together with all the assertions coming from
taining the past timelines for every state variable as well as the predicted
future timelines for exogenous variables about which the actor has predic-
tions. To keep things simple, let us assume that the qualifications in tempo-
ral methods with respect to the past do not extend beyond when each state
variable acquired its current value (this assumption is akin to the Markovian
property, which is introduced in Section 6.1). With this assumption, our in-
terest in the past is satisfied by keeping track for each state the variable x
of a pair (t, x=v), meaning that the value of variable is x = v since time t.
In other words, the following assertion holds in :
also contains predictions about exogenous future events, as seen in the ini-
tial chronicle in temporal planning. In this sense, is a particular chronicle
which maintains the current present and the exogenous future.
TemRAE works in a similar way to RAE. It reads its input stream for a
task to perform. For each such a task , it finds M( ), the set of methods
whose task is . It chooses a method m in M( ) that is applicable for
current ; it refines according to the subtasks specified in m. There are,
however, differences in how methods are evaluated as applicable and in how
refinements proceed.
Retry is possible as long as upper bounds on a task and its refinements have
not been violated.
The limitations of TempRAE are due to its lack of lookahead. As un-
derlined from Definition 4.29, temporal refinement acting cannot cope with
the requirement of a future change that is not brought by the subtasks of
a method and their refinements. Furthermore, the STNU corresponding to
the entire refinement tree of a task is not guaranteed to be dynamically
controllable. This STNU is discovered progressively as tasks and actions are
achieved and effects and events are observed. Precedence constraints with
respect to contingent events or absolute time (e.g., bringing a container be-
fore the scheduled departure of a ship) may lead to deadline failures. In
this approach, it is the responsibility of the designer to specify methods
that refine into dynamically controllable networks. Techniques presented in
Section 4.4.3 can be used to verify this requirement.
in TGP by Smith and Weld [547] or in TP4 by Haslum and Geffner [262].
Extensions of this model with conditions that prevail over the duration of
the action, (as in the model of Sandewall and Ronnquist [524]) have been
proposed, for example, in the SAPA planner of Do and Kambhampati [154],
or in the domain description language specifications PDDL2.1 of Fox and
Long [204]. Several planners rely on the latter representation, among which
HS by Haslum and Geffner [262], TPSYS by Garrido [214] or CRICKEY by
Coles et al. [127].
A few planners using the durative action model adopt the plan-space
approach, notably Zeno of Penberthy and Weld [471] which relies on linear
programming techniques, or VHPOP of Younes and Simmons [623] which
uses STN algorithms. Some planners pursue the HTN approach, as the
earlier planners mentioned above, or more recently SHOP2 by Nau et al.
[448] or Siadex by Castillo et al. [109].
Most durative actions temporal planners rely on state-based search tech-
niques. A few are based on temporal logic approaches. Among these are
TALplanner by Doherty and Kvarnstrom [156, 366], a model-checking based
planner by Edelkamp [169], and a SAT-based planner by Huang et al. [289].
Significant effort has been invested in generalizing classical state-space plan-
ning heuristics to the durative action case. The action compression tech-
nique, which basically abstract the durative transition to an instantaneous
one for the purpose of computing a heuristic, is quite popular, for example in
the work of Gerevini and Schubert [224] or Eyerich et al. [185]. Various tem-
poral extensions of the relaxed planning graph technique (Section 2.3.2), as
in Metric RPG of Hoffmann [277], have been proposed, for example, Haslum
and Geffner [262], Long and Fox [395], Coles et al. [127] and Haslum [258].
Sampling over a duration interval with action compression has also been
investigated by Kiesel and Ruml [330].
A few durative action planners can handle hybrid discrete-continuous
change. Some planners address continuous effects through repeated dis-
cretization, for example, UPMurphy of Penna et al. [473]. Linear program-
ming techniques, when the continuous dynamics is assumed to be linear,
have been used since ZENO [471] by several planners. A recent and quite
elaborate example is COLIN of Coles et al. [128]. The Kongming planner
of Li and Williams [381] relies on domain specific dynamic models.
The durative action model led to the design of quite performant planners.
But it usually has a weak notion of concurrency that basically requires
independence between concurrent actions. Interfering effects, as discussed
in Example 4.15, can be addressed by a few of the above mentioned planners,
for example, notably COLIN [128]. Alternatively, interfering effects can be
sentation of Williams and Abramson [606]. TPN extends STN with symbolic
constraints and decision nodes. Planning with a TPN is finding a path in
the explicit network that meets the constraints. Conrad et al. [129] intro-
duce choices in the acting component of RMPL. TPNs with error recovery,
temporal flexibility, and conditional context dependent execution are con-
sidered in Effinger et al. [175]. There, tasks have random variable durations
with probability distributions. A particle-sampling dynamic execution al-
gorithm finds an execution guaranteed to succeed with a given probability.
Santana and Williams [526] studied probabilistic TPNs with the notions of
weak and strong consistency, and proposed techniques to check these prop-
erties. TPNUs of Levine and Williams [380] add the notion of uncertainty
for contingent decisions taken by the environment and other agents. The
acting system adapts the execution to observations and predictions based
on the plan. It has been illustrated with a service robot which observes and
assists a human.
4.7 Exercises
4.1. Specify the primitives stack, unstack and navigate of Example 4.11. For
the latter, assume that navigation between connected waypoints is uncon-
strained.
4.2. For the domain in Example 4.12, define methods for the tasks load and
unload. For the task bring, define additional methods to cover the following
cases:
the destination pile is at the same dock as the source pile,
the robot r is already loaded with container c,
container c is already in its destination pile.
Similarly, define other methods for the task move to cover the cases where
the robot starts from a waypoint or when it is already at destination, and
another method for the task uncover when the container c is at the top of
pile p.
4.3. Augment the domain of Example 4.12 by considering that a pile p can
be attached to a ship and that a crane k on a dock d can unstack containers
from a pile p only when the corresponding ship is docked at d.
4.4. Specify the initial chronicle 0 for the problem of Example 4.17 and
Figure 4.5.
4.5. In Example 4.17, develop the steps of TemPlan until reaching a solution
to the planning problem.
4.6. For the domain in Example 4.12, redefine navigate as a task which
refines into the traversal of roads and the crossing to docks. The navigation
between two roads adjacent to a dock d requires crossing d which should
not be occupied during the crossing interval. For example, in Figure 4.5 the
navigation from d4 to d1 requires the traversal of d3 which should be empty
when the robot gets there. Analyze the conflicting assertions that result
from this modification in the first few steps of TemPlan for Example 4.17
and find resolvers for the corresponding flaws.
4.9. Run algorithm PC on the networks in Figure 4.7. Show that it adds
the constraints r1,3 = [1, 3], r24 = [1, 2] and r45 = [2, 3].
4.11. Run algorithm PC on the networks in Figure 4.8 and Figure 4.9 and
compute all the implicit constraints entailed from those in the networks;
show that both networks are minimal.
4.12. Prove that the minimal network in Figure 4.10 is such that
[b v, a u] [p, q].
4.13. Consider the minimal network in Figure 4.10 for the case where u 0
and [b v, a u] = . Prove that this network is not dynamically control-
lable.
4.15. For all the primitives in Example 4.11, define atemporal acting refine-
ment methods similar to the two given in Example 4.25.
4.16. Run algorithm Dispatch for the solution plan found in Exercise 4.5
assuming that robot r1 is much faster than r2.
Deliberation with
Nondeterministic Models
act(z1 , z2 , . . . , zk )
pre: p1 , . . . , pm
eff1: e11 , . . .
...
effn: e1n , . . .
Applicable(s) = {a A | (s, a) 6= }
transit3
back
move
back parking2
deliver
gate1
unload park
deliver
on_ship at_harbor
parking1 gate2
move
move
transit1
transit2
gate2. For simplicity, we label each state in Figure 5.1 only with the value
of the variable pos(item).
In this example, we have just five actions. Two of them are deterministic,
unload and back, and three are nondeterministic, park, move, and deliver.
Action unload unloads the item from the ship to the harbor, its preconditions
are pos(item) = on ship, and its effects pos(item) at harbor. Action back
moves the item back from any position in the harbor to the position pos(item)
= at harbor. To keep the figure simple, in Figure 5.1 we show only two
instances of actions back from the state where pos(item) = parking2 and
the state where pos(item) = gate1, but a back arrow should be drawn from
each state where the position is parking1, parking2, transit1, transit2, transit3,
gate1, and gate2.
A possible description of action park is the following, where eff1, eff2,
and eff3 are the three possible effects of the action:
park
pre: pos(item) = at harbor
eff1: pos(item) parking2
eff2: pos(item) parking1
eff3: pos(item) transit1
The actions park, move, and deliver are nondeterministic. In the case of
action park, we represent with nondetermism the fact that the storage areas
parking1 and parking2 may be unavailable for storing items, for example,
because they may be closed or full. Whether an area is available or not
cannot be predicted, because there are other actors parking and delivering
items, for example from different ships. However, we assume that it is always
possible either to park the item in one of the two parking areas or to move
it to transit area transit1. The item waits in transit1 until one of the two
parking areas are available, and it can be stored by the action move. Also
in the case of move, we use nondeterminism to represent the fact that we
do not know a priori which one of the two areas may become available.2
From the two parking areas, it is possible to deliver the container and load
them on trucks or to a transit area, from which it is necessary to move the
container into either one of the two parking areas. The deliver action moves
containers from parking1 to one of the two gates where trucks are loaded or
to a transit area from which it is necessary to move the container again to
load trucks in one of the two gates.3 The same action from parking2 may
lead to gate1 or to another transit area.
PerformPolicy()
s observe the current state
while s Dom() do
perform action (s)
s observe the current state
Example 5.4. Consider the domain of Example 5.2 shown in Figure 5.1.
The following are policies for this planning domain:
and the one of decision trees are incomparable. On one hand, policies allow
for infinite iterations of the application of actions, which are not allowed by
finite decision trees. On the other hand, decision trees allow for performing
different actions in the same state depending on at which point we are in
the tree, whereas policies always perform the same action in a state. We call
our policies memoryless policies. A policy with memory is a mapping from
a history of states to an action. Policies with memory allow for performing
different actions in the same state, depending on the states visited so far.
b(s, ) denotes the transitive closure of (s, (s)), that is, the
set that includes s and all its successors states reachable by
leaves(s, ) = {s0 | s0
b(s, ) and s0 6 Dom()}
Notice that leaves(s, ) can be empty, that is, there may be no leaves. This
is the case of policies that cycle on the same set of states. If is the empty
plan, then leaves(s, ) = {s}. We define the reachability graph that connects
the reachable states from state s through a policy :
parking2
gate1
on_ship at_harbor
parking1 gate2
transit1
transit2
We call (s)
b the set of states reachable from a state s.
Figures 5.2, 5.3, and 5.4 show the reachability graphs of 1 , 2 , and 3 from
the state where pos(item)=on ship. Notice also that all states are reachable
from state where pos(item)=on ship.
5
In this case, the value on ship of the state variable pos(item) identifies a single state.
parking2
gate1
on_ship at_harbor
parking1 gate2
move
transit1
transit2
Notice that we have a single initial state s0 rather than a set of initial
states S0 S. A set of initial states represents partially specified initial
conditions, or in other words uncertainty about the initial state. However,
restricting to a single initial state is not a limitation, because a domain with
a set of initial states S0 is equivalent to a domain where we have a single
initial state s0 6 S and an additional action ao 6 A such that (s0 , a0 ) = S0 .
We can now define different kinds of solutions to a planning problem.
transit3
move
parking2
gate1
on_ship at_harbor
parking1 gate2
transit1
transit2
Solutions are policies that may lead to a goal. They can achieve the
goal in different ways, with different levels of guarantee, and with differ-
ent strengths. The requirement we impose on a policy to be a solution is
that at least one state of its leaves is a goal state. We are interested in safe
solutions.
Safe solutions are policies in which the goal is reachable from the initial
state. Notice that, in general, they are not policies in which the goal is
reachable from any state of the domain of the policy (Dom()), because we
may have a state in Dom() that is not the initial state and from which we
do not reach the goal.
that sooner or later execution will get out of possibly infinite loops, they
are guaranteed to achieve the goal. They guarantee that there is always a
possibility to terminate the loop. However, for some applications, this may
be not enough.
eliminate the states in the policy that lead to the loop inside the set of goal
states.
In the following, we specify the relations among different kinds of solu-
tions.
Notice that our terminology here and in Chapter 6 are identical, but
different from the usual terminology in the literature, in which our solutions
and safe solutions are called weak solutions and strong cyclic solutions, re-
spectively. In the literature, every strong solution is also a weak solution,
which can be confusing. In most of the literature on probabilistic planning,
our safe and unsafe solutions are called proper and improper, and there is
no notion that makes a distinction between cyclic safe solutions and acyclic
safe solutions, despite the different strength they provide. We will not also
differentiate cyclic and a cyclic safe solutions in probabilistic planning in
Chapter 6, despite their differences (see discussion in Section 6.7.5). Ta-
ble 5.1 summarizes the corresponding terminology used in planning with
nondeterminism and in probabilistic planning literature.
Find-Solution (, s0 , Sg )
; s s0 ; Visited {s0 }
loop
if s Sg then return
A0 Applicable(s)
if A0 = then return failure
nondeterministically choose a A0
nondeterministically choose s0 (s, a) (i)
0
if s Visited then return failure
(s) a; Visited Visited {s0 }; s s0
tion for ignoring the precise order in which the algorithm tries actions a
among all the applicable actions to state s and alternative states s0 among
the states resulting from performing a in s.
Find-Safe-Solution (, s0 , Sg )
Frontier {s0 }
for every s Frontier \ Sg do
Frontier Frontier \ {s}
if Applicable(s) = then return failure
nondeterministically choose a Applicable(s)
(s, a)
Frontier Frontier ((s, a) \ Dom())
if has-unsafe-loops(, a, Frontier) then return failure
return
Algorithm 5.3 is a simple algorithm that finds safe solutions. The al-
gorithm performs a forward search and terminates when all the states in
Frontier are goal states. Find-Safe-Solution fails if the last action introduces
a bad loop, that is, a state from which no state in Frontier is reachable.
The routine has-unsafe-loops checks whether a bad loop is introduced. A
bad loop is introduced when the set of states resulting from performing
action a, which are not in the domain of , will never lead to the frontier:
Algorithm 5.4 is a simple algorithm that finds safe acyclic solutions. The
algorithm is the same as Find-Safe-Solution, but in the failure condition. It
fails if the last action introduces a loop, that is, a state from which the state
itself is reachable by performing the plan:
Find-Acyclic-Solution (, s0 , Sg )
Frontier {s0 }
for every s Frontier \ Sg do
Frontier Frontier \ {s}
if Applicable(s) = then return failure
nondeterministically choose a Applicable(s)
(s, a)
Frontier Frontier ((s, a) \ Dom())
if has-loops(, a, Frontier) then return failure
return
Step0 : on ship
Step1 : at harbor
Step2 : parking2, parking1, transit1
Step3 : transit3, gate1, gate2, transit2
Step4 : gate1, gate2
Compute-worst-case-for-action(S, Sg , , ancestors)
c0
0
// if S is nonempty, this loop will be executed at least once:
for every s S
if s ancestors then
return ( 0 ,)
(,c) Choose-best-action(s, Sg , , ancestors {s})
0 0
c0 max(c0 , c)
if c0 then
break
return ( 0 , c0 )
among the ones that are applicable. We are interested in finding a solution
with the minimum accumulated cost, that is, the minimum of the costs of
each action that is selected in the search. Because the domain model is
nondeterministic and (s, a) results in different states, we want to minimize
the worst-case accumulated cost, that is, the maximum accumulated cost of
each of the possible states in (s, a). This is given by the following recursive
formula:
(
0 if s is a goal,
c(s) =
minaApplicable(s) (cost(a) + maxs0 (s,a) c(s0 )) otherwise.
For this reason, the algorithm is said to perform a MinMax search. While
performing the search, the costs of actions that are used to expand the next
states are accumulated, and the algorithm checks whether the accumulated
cost becomes too high with respect to alternative selections of different ac-
tions. In this way, the accumulated cost is used to find an upper bound in
the forward iteration.
Find-Acyclic-Solution-by-MinMax (Algorithm 5.5) finds safe acyclic solu-
tions for nondeterministic planning problems in domains that may have
cycles. It returns a pair (,c), where is a safe acyclic solution that is
worst-case optimal, that is, the maximum cost of executing is as low as
Choose-best-action(s, Sg , , ancestors)
if s Sg then
return (, 0)
else if Applicable(s) = then
return (, )
else do
c=
// this loop will always be executed at least once:
for every a Applicable(s) do
( 0 , c0 ) Compute-worst-case-for-action((s, a), Sg , , ancestors)
if c > c0 + cost(s, a) then do
c c0 + cost(s, a)
(s) a
min(, c)
return (, c)
pos(item)=gate1
0 1
pos(item)=gate2
0 1 1
0 1
In the rest of this section, we describe the algorithms for planning via
symbolic model checking both as operation on sets of states and as the
corresponding symbolic transformations on formulas.
The four states in which the car is undamaged is represented by the single
variable assignment
x1 F
or analogously by the truth of the formula
P (Q) = x1 .
x, y , x 0 ) Q(x
xy (R(x
(x x)))[x0 x].
In this formula, the and operation symbolically simulates the effect of the
application of any applicable action in A to any state in Q. The explicit
enumeration of all the possible states and all the possible applications of
actions would exponentially blow up, but symbolically we can compute all
of them in a single step.
Policies are relations between states and actions, and can therefore be
represented symbolically as propositional formulas in the variables x and y .
x, y ).
In the following, we write such a formula as (x
We are now ready to describe the planning algorithms based on symbolic
model checking. In the subsequent sections, we consider an extension of the
definition of planning problem where we allow for a set of initial states rather
than a single initial state.
Find-Safe-Solution-by-ModelChecking(, s0 , Sg )
univpol {(s, a) | s S and a Applicable(s) }
SafePlan(univpol, Sg )
if s0 (Sg Dom()) then return
else return(failure)
SafePlan(0 ,Sg )
0 0
while 6= 0 do
0
0 0 \ {(s, a) 0 | (s, a) 6 (Sg Dom( 0 ))} (i)
0 PruneUnconnected( 0 , Sg ) (ii)
return RemoveNonProgress( 0 , Sg ) (iii)
PruneUnconnected(,Sg )
Old fail
New
while Old 6= New do
Old New
New preimgpol(Sg Dom(New))
return New
appear unfeasible in practice, because the set of all state-action pairs can
be very large. We should not forget, however, that very large sets of states
can be represented symbolically in a compact way. Indeed, the symbolic
representation of the universal policy is:
x0 R(x
univpol = x x, y , x 0 ),
state-action pair (s, a) for which 0 contains no path from s to the goal.
This second step is performed by the routine PruneUnconnected (see Algo-
rithm 5.8). PruneUnconnected repeatedly applies the intersection between
the current policy and the preimage policy, that is, preimgpol applied
to the domain of the current policy and the goal states. The preimage pol-
icy, given a set of states Q S, returns the policy that has at least one
out-coming state to the given set of states:
x0 (R(x
preimgpol(Q) = x x, y , x 0 ) Q(x
x0 )).
The routine PruneStates that eliminates the states and actions that lead to
the same domain of a policy is computed symbolically as follows:
x, y ) Q(x
PruneStates(, Q) = (x x)).
SafePlan thus returns the policy that has been obtained from the universal
policy by removing outgoing, unconnected and nonprogressing actions. Find-
Safe-Solution-by-ModelChecking finally tests whether the set of states in the
returned policy union with the goal states contains all the initial states. If
this is the case, is a safe solution; otherwise no safe solution exists.
Example 5.18. Let us consider the planning problem on the domain de-
scribed in Example 5.2, initial state s0 where pos(car)=on ship, and goal
states Sg = {pos(car)=gate2}. The elimination phase of the algorithm
does not remove any policy from the universal policy. Indeed, the goal
RemoveNonProgress(,Sg )
Old fail
New
while Old 6= New do
pre preimgpol(Sg Dom(New))
Old New
New PruneStates(pre, Sg Dom(New))
return New
state is reachable from any state in the domain, and therefore there are no
outgoing actions. As a consequence, function RemoveNonProgress receives
in input the universal policy and refines it, taking only those actions that
may lead to a progress versus the goal. The sequence i of policies built by
function RemoveNonProgress is as follows (in the following we indicate with
parking1 the state where pos(car)=parking1, etc.):
Step 0 :
Step 1 : 1 (parking1) = deliver; 1 (transit2) = move
Step 2 : 2 (parking1) = deliver; 2 (transit2) = move; 2 (at harbor) = park;
2 (transit1) = move
Step 3 : 3 (parking1) = deliver; 3 (transit2) = move; 3 (at harbor) = park;
3 (transit1) = move; 3 (parking2) = back; 3 (transit3) = back;
3 (gate1) = back; 3 (on ship) = unload
Step 4 : 3
A remark is in order. Algorithm 5.7 can find either safe cyclic or safe
acyclic solutions. It can be modified such that it looks for a safe acyclic
solution, and only if there is no such solution does it search for a safe cyclic
solution (see Exercise 5.11).
Find-Acyclic-Solution-by-ModelChecking(, S0 , Sg )
0 failure
while (0 6= and S0 6 (Sg Dom())) do
strongpre strongpreimgpol(Sg Dom())
0
PruneStates(strongpre, Sg Dom())
if (S0 (Sg Dom()))
then return
else return failure
Example 5.19. Let us consider the planning problem on the domain de-
scribed in Example 5.2, initial set of states S0 = {on ship}, and goal
states Sg = {gate1, gate2}. The sequence i of policies built by algorithm
Find-Acyclic-Solution-by-ModelChecking is as follows:
0 :
1 : 1 (transit3) = move; 1 (transit2) = move
2 : 2 (transit3) = move; 2 (transit2) = move;
2 (parking1) = deliver; 2 (parking2) = deliver
3 : 3 (transit3) = move; 3 (transit2) = move;
3 (parking1) = deliver; 3 (parking2) = deliver;
3 (transit1) = move
4 : 4 (transit3) = move; 4 (transit2) = move;
4 (parking1) = deliver; 4 (parking2) = deliver;
4 (transit1) = move; 4 (at harbor) = park
5 : 5 (transit3) = move; 5 (transit2) = move;
5 (parking1) = deliver; 5 (parking2) = deliver;
5 (transit1) = move; 5 (at harbor) = park;
5 (on ship) = unload
6 : 5
Notice that at the fifth iteration, PruneStates removes from 5 all the state-
action pairs that move the container back (action back) from states such that
a solution is already known. For instance, 5 (parking2) = back, 5 (gate1) =
back, and so on.
path from the root to a leaf can visit nodes associated with a subset of all
the variables of the BDD. The reached leaf node is labeled with the resulting
truth value. If v is a BDD, its size |v| is the number of its nodes.11 If n is a
node, we will use var(n) to denote the variable indexing node n. BDDs are
a canonical representation of boolean formulas if
there is a total order < over the set of variables used to label nodes,
such that for any node n and correspondent nonterminal child m, their
variables must be ordered, var(n) < var(m), and
the BDD contains no subgraphs that are isomorphic to the BDD itself.
The choice of variable ordering may have a dramatic impact on the dimen-
sion of a BDD. For example, Figure 5.7 depicts two BDDs for the same
formula (a1 b1 ) (a2 b2 ) (a3 b3 ) obtained with different variable
orderings.12
BDDs can be used to compute the results of applying the usual boolean
operators. Given a BDD that represents a formula, it is possible to transform
it to obtain the BDD representing the negation of the formula. Given two
BDDs representing two formulas, it is possible to combine them to obtain
the BDD representing the conjunction or the disjunction of the two formulas.
Substitution and quantification on boolean formulas can also be performed
as BDD transformations.
aa11
aa11
aa22 aa22
bb11 bb11
bb22 b
b2 bb22 bb22
aa33 2
bb33 b
b3 bb33 b3
b3
3
True
True False
False True
True False
False
Figure 5.7: Two BDDs for the formula (a1 b1 ) (a2 b2 ) (a3 b3 ).
safe solution. We therefore get rid from of all the pairs (s0 , a) that lead to
dead-end state s. We implement this by making action a not applicable in
s0 .13 In this way, at the next loop iteration, we will not have the possibility
to become stuck in the dead end.
Q leaves(s0 , ) \ Sg
if Q = then do
\ {(s, a) | s 6 b(s0 , )}
return()
select s Q
p0 Forward-search (d , s, Sg )
if p0 6= f ail then do
0 Plan2policy(p0 , s)
{(s, a) 0 | s 6 Dom()}
else for every s0 and a such that s (s0 , a) do
\ {(s0 , a)}
make the actions in the determinization of a
not applicable in s0
state (the left vertex of the triangle) to the set of possible final states (the
right part of the triangle): the search space is depicted as the large triangle.
In planning and acting, we plan just for a few next steps, then we act and we
know exactly in which state the application of actions results. We repeat the
interleaving of planning and acting until we reach a goal state. The search
space is reduced to the sequence of small triangles depicted in Figure 5.8.
Notice that there is a difference between the search space depicted in
Figure 5.8 and the ones depicted in Figures 1.3 and 1.4, because here we
have uncertainty in the outcome of each action, and the base of each small
triangle represents all the possible outcomes of an action rather than the
different outcomes of the search for each different action in a deterministic
domain.
The selection of good actions (i.e., actions that tend to lead to the goal)
can be done with estimations of distances from and reachability conditions
to the goal, like in heuristic search, and by learning step by step after each
application better estimates of the distance.
A critical issue is the possibility of getting trapped in dead ends. In safely
explorable domains (see also Chapter 6), that is, domains where execution
cannot get trapped in situations where there is no longer a path to the goal,
it is possible to devise methods that are complete, i.e., that guarantee to
reach the goal if there exists a solution, and that guarantee the termination
of the planning/acting loop if no solution exists. However, not all domains
are safely explorable, and not all actions are reversible. A navigation robot
can be trapped in a hole where no navigation operation is possible anymore;
a bank transaction is critical and cannot be easily undone. Even worse, the
actor may not easily recognize that it is trapped in a dead end. For instance,
a navigation robot can enter an area where it is possible to navigate but it
is impossible to get out of that area. Despite these problems, planning and
acting methods remain a viable solution to problems that cannot be solved
purely offline.
In this section, we present some basic techniques that can be used to
interleave planning and execution.
5.6.1 Lookahead
The idea underlying lookahead methods is to generate a partial plan to
interleave planning and acting. The Lookahead-Partial-Plan procedure, Al-
gorithm 5.14, interleaves partial planning in line (i) with acting in line (ii).
At each loop, Lookahead searches for a partial plan rooted at s. It returns a
partial plan as a policy that is partially defined, at least in s. A context-
Lookahead-Partial-Plan(, s0 , Sg )
s s0
while s / Sg and Applicable(s) 6= do
Lookahead(s, ) (i)
if = then return failure
else do
perform partial plan (ii)
s observe current state
FS-Replan (, s, Sg )
d
while s / Sg and Applicable(s) 6= do
if d undefined for s then do
d Plan2policy(Forward-search(d , s, Sg ), s)
if d = failure then return failure
perform action d (s)
s observe resulting state
The actor acts using d until reaching a state s that is not in the domain
of d . At that point, a new deterministic plan starting at s is generated.
If the planning domain is safely explorable and because Forward-search is a
complete deterministic planner, then FS-Replan will lead to a goal. If the
domain has dead ends, then FS-Replan is not guaranteed to reach the goal.
Notice the relation between FS-Replan and Lookahead-Partial-Plan. In
the case FS-Replan the parameter of Lookahead-Partial-Plan is realized by
the condition checking whether d is undefined for the current state s. FS-
Replan does not look ahead for only some steps, but until the goal is reached
according to a simplified (i.e., determinized) model of the domain.
MinMax LRTA* (, s0 , Sg )
s s0
while s 6 Sg and Applicable(s) 6= do (i)
0
a argminaApplicable(s) maxs0 (s,a) h(s ) (ii)
h(s) max{h(s), 1 + maxs0 (s,a) h(s0 )} (iii)
perform action a
s the current state
MinMax LRTA* can easily be extended to deal with domains that include
costs of actions. We need only to replace the formula in the value update
step with this: h(s) max{h(s), c(s, a) + maxs0 (s,a) h(s0 )}.
To choose the action a, the algorithm does a lookahead of one step. It
is possible to extend it to look ahead n steps by generating a partial search
tree by searching forward from the current state s. Then we can update
the values in the local search space by assigning at each state the minmax
distance under the assumptions that such values do not overestimate the
correct minmax distance to the goal.
simply simulate the execution through such control structures and replace
commands with the corresponding , which leads from one state to a single
state. Planning algorithms that use nondeterministic models must instead
take into account that programs might be executed in different states. A
simple simulation starting from one state does not allow us to know ex-
actly in which state the program will be executed. This makes the planning
algorithms much more complicated.
In the following subsections, we first recall the formalism for tasks and
adapt it to nondeterministic models (Section 5.7.1). We then define context-
dependent plans (Section 5.7.2). They are more expressive than policies
because they can take into account the context in which a step of the plan is
executed, and the context can depend of the steps that have been executed
so far. In the subsequent two subsections, we provide a planning algorithm
to generate context dependent plans that achieve tasks. We do this in two
steps. First, we generate automatically search automata from given tasks
(Section 5.7.3). Search automata are used to guide the search for context-
dependent plans. Second, we define the planning algorithm that exploits
the generated planning automata (Section 5.7.4).
achieve-cyclic(g)
achieve-acyclic(g)
to satisfy task t2 . In this way, actions to be performed can depend not only
on the current state of the domain but also on the internal state of the
actor, on its intention to satisfy one subtask or another. To represent this
kind of plans, we introduce the notion of context-dependent plan.
Definition 5.20. (Context-dependent Plans) A context-dependent plan
for a domain = (S, A, ) is a structure (C, c0 , act, ctxt), where:
C is a set of contexts, representing the internal state of the actor
c0 is the initial context,
act : S C * A is the action function, and
ctxt : S C S * C is the context function.
RB = {rb1 , . . . , rbn }, with rbi C is the set of red blocks, states where
the execution cannot stay forever.
success
p
cc00 T cc11
p
T
fail
from tasks. Rather than providing the formal definition of the search au-
tomata, we represent them using a graphical notation. We start with the
search automaton for a reachability goal p, that is, a safe solution as de-
fined in Section 5.2.3. We have to distinguish the case of cyclic from acyclic
solutions (see Definitions 5.10 and 5.11, respectively).
Let us start with acyclic solutions (see Figure 5.11). The search au-
tomaton has two search states: c0 (the initial state) and c1 . There are two
transitions leaving state c1 . The first one, guarded by condition p, is a suc-
cess transition that corresponds to the cases where p holds in the current
domain state. The second transition, guarded by condition p, represents
the case in which p does not hold in the current state, and therefore, to
achieve goal p in a safe acyclic way, we have to ensure that the goal can be
achieved from all the next states. We recall that this is the condition for the
plannig algorithm that will be devised in the next section (Section 5.7.4). We
remark that the second transition is a normal transition because it requires
performing an action in the plan; the first transition, instead, is immediate.
In the diagrams, we distinguish the two kinds of transitions by using thin
arrows for the immediate ones and thick arrows for the normal ones. A do-
main state is compatible with state c1 only if it satisfies in a safe acyclic way
goal p, that is, if condition p holds in the current state (first transition from
c1 ) or if the goal will be reachable in all the next states (second transition
from c1 ).
According to the semantics of safe acyclic solutions, it is not possible
for the search to stay in state c1 forever, as this corresponds to the case
in which condition p is never reached. That is, set {c1 } is a red block of
the search automaton. In the diagrams, states that are in a red block are
marked in grey. State c0 takes into account that it is not always possible
to ensure that condition p will be eventually reached, and that if this is not
the case, then p cannot be satisfied in a safe acyclic way, and therefore the
success
p
cc00
T
cc11 p
some
T
fail
search fails. The precedence order among the two transitions from state c0 ,
represented by the small circular dotted arrow between them, guarantees
that the transition leading to a failure is followed only if it is not possible
to satisfy the constraints of state c1 .
We provide now the search automaton for safe cyclic solutions (see Fig-
ure 5.12). The difference with respect to the search automaton for safe
acyclic solutions is in the transition from c1 guarded by condition p. In
this case we do not require that the goal holds from all the next states, but
only from some of them. Therefore, the transition has two possible targets,
namely states c1 (corresponding to the next states were we expect to achieve
the safe cyclic solution for p) and c0 (for the other next states). The seman-
tics of safe cyclic solutions requires that there should be always at least one
next state that satisfies the definition of safe cyclic solution for goal p; that
is, target c1 of the transition is marked by some in the search automaton.
This non-emptiness requirement is represented in the diagram with the
mark some on the arrow leading back to c1 . The preferred transition from
state c0 is the one that leads to c1 . This ensures that the algorithm will try
to find a safe cyclic solution whenever possible.
T action = a
cc00 cc11 success
T
fail
Figure 5.13 shows the simple search automaton for the primitive action
a A corresponding to a command. The transition from the state c1 guar-
antees that a domain state is acceptable only if the next state is achieved by
performing action a, that is, only if the next state is reachable by performing
action a.
The search automaton for the sequence t1 ; t2 is shown in Figure 5.14
The initial state of the compound automaton coincides with the initial state
of automaton At1 for t1 , and the transitions that leave At1 with success
are redirected to the initial state of At2 , the automaton for t2 . The search
automaton for the conditional task if p then t1 else t2 is in Figure 5.15. The
context c0 immediately moves the acting to the initial context of one of the
search automata for the tasks t1 or t2 according to the current domain state,
i.e. whether the property p holds in the current domain state or not.
The search automaton for the while loop while p do t1 is in Figure 5.16.
The context c0 has two immediate transitions guarded by the conditions p
and p. The former leads to the initial context of the automaton for t1 , i.e.,
the body of the cycle, and the latter leads to the success of the compound
automaton. The successful transitions of the automaton for t1 return back to
context c0 , but the failure transition for t1 falsifies the compound automaton.
The context c0 is marked as a red block. It guaranties that the loop is finite.
Figure 5.17 shows the simple search automaton for test p. All transitions
are immediate, because action performing is not required. The automaton
only checks that the current domain state satisfies formula p.
Figure 5.18 shows the search automaton for the failure-handling con-
struct if t1 fails then t2 . The search automaton is defined similarly to that
for sequences t1 ; t2 . The difference is that in this case, the transitions that
leave At1 (the search automaton for t1 ) with failure are redirected to the
initial state of At2 .
success
success
Att22
Att11
fail
succ
succ
Att 1
A 1 fail
p
cc00
p
succ
At2
A 2 fail
fail
succ
fail
fail
c0
c0 p At1
A 1
p
succ
p
c0
c0 succ
fail
p
fail
fail
fail
AAtt22
succ
At1
A 1
succ
and the transitions represent the possible evolutions of the contexts. In the
second step, compute associated states explores the planning domain and
associates a set of states of the planning domain to each state in the search
automaton. Intuitively, these are the states for which a plan exists from
the given context. In the third step, synthesize-plan constructs a plan by
exploiting the information on the states associated to the context.
Once the search automaton is built by the function build automaton, the
planning algorithm proceeds by associating to each context in the search
automaton a set of states in the planning domain. The association is built
by compute associated states, see Algorithm 5.18. The algorithm starts with
an optimistic association, which assigns all the states in the planning domain
(S is the set in = (S, A, )) to each context (line (i)). The algorithm
Plan-with-search-automata(, S0 , t)
automaton build automaton(, S0 , t)
AssociatedStates = compute associated states(, S0 , t, automaton)
= synthesize-plan(automaton, AssociatedStates)
return
computes the so-called Green-Block of contexts (see line (ii)), which is the
set of contexts that are not contained in any red block (see Section 5.7.3
and Definition 5.22). We need indeed to distinguish contexts in the green
block from those in red blocks because the search should eventually leave a
context in a red block, whereas this is not required for contexts in the green
block.
The association is then iteratively refined. At any iteration of the loop
(lines (iii)(vi)), a block of context is chosen, and the corresponding asso-
ciations are updated. Those states are removed from the association, from
which the algorithm discovers that the tasks in the context are not satis-
fiable. The algorithm terminates when a fixed point is reached, that is,
whenever no further refinement of the association is possible: in this case,
the while condition at line (iii) evaluates to false for each b Blocks and the
guard of the while loop fails.
The chosen block of contexts may be either one of the red blocks or the
green block. In case the green block is chosen, the refinement step must
guarantee that all the states associated to the contexts are safe, that is,
that they never lead to contexts where the goal can no longer be achieved.
This refinement (lines (v)(vi)) is obtained by choosing a context in the green
block and by refreshing the corresponding set of states (function update-
ctxt). Once a fixed point is reached and all the refresh steps on the contexts
in b do not change the association (i.e., no context in b needs updates), the
loop at lines (v)(vi) is left, and another block is chosen.
In the case of red blocks, the refinement needs to guarantee not only that
the states in the association are safe but also that the goal is eventually
resolved, that is, that the contexts in the red block are eventually left. To
this purpose, the sets of states associated to the red block contexts are
initially emptied (line (iv)). Then, iteratively, one of the red-block contexts
is chosen, and its association is updated (lines (v))-(vi)). In this way, a least
fixed point is computed for the states associated to the red block.
The core step of compute associated states is function update-ctxt. It
takes in input the search automaton (C, c0 , T, RB), the current association
of states assoc, that is, a function assoc : C 2S , and a context c C and
returns the new set of states in S to be associated to context c. It is defined
as follows:
update-ctxt(automaton, assoc, c) =
{s S | trans T (c) such that s trans-assoc(automaton, trans, assoc)}.
where we assume that assoc(f ail) = and assoc(succ) = S. That is, in the
case of an immediate transition, we require that s satisfies property p and
that it is compatible with the new search state c0 according to the current
association assoc.
If trans = (p, ((c01 , k10 ), . . . , (c0n , kn0 )) is a normal transition, then:
where:
gen-preimage((S1 , k1 ), . . . , (Sn , kn )) =
{(s, a) | S10 S1 . . . Sk0 Sk such that
(s, a) = S10 . . . Sk0 and
Si0 Sj0 = if i 6= j and Si0 6= if ki = some}
Also in the case of normal transitions, trans-assoc requires that s satisfies
property p. Moreover, it requires that there is some action a such that the
next states (s, a) satisfy the following conditions:
all the next states are compatible with some of the target search states,
according to association assoc; and
some next state is compatible with each target search state marked as
some.
These two conditions are enforced by requiring that the state-action pair
(s, a) appears in the generalized preimage of the sets of states assoc(c0i )
associated by assoc to the target search states c0i .
It is now possible to explain in more detail the iterative refinement at
lines (iii)(vi) in Algorithm 5.18. Recall that in the iterative refinement
loop, the following conditions are enforced:
(C1) a domain state s is associated to a search state c only if s can
satisfy the condition described by some transition of c;
(C2) actions from a given state s cannot be performed if they stay
forever inside a red block.
In each step of the iterative refinement, either a search state in the green
block is selected and the corresponding set of domain states is refined ac-
cording to (C1); or a red block is selected and all the sets of domain states
associated to its search states are refined according to (C2). The refinement
algorithm terminates when no further refinement step is possible, that is,
when a fixed point is reached.
Function update-ctxt(automaton, assoc, c) is used in the refinement steps
corresponding to (C1) as well as in the refinement steps corresponding to
(C2). In the former case, the refinement step simply updates assoc(c) to
the value of update-ctxt(automaton, assoc, c). In the latter case, the re-
finement should guarantee that any valid execution eventually leaves the
search states in the selected red block rbi . To this purpose, the empty
set of domain states is initially associated to the search states in the
red block; then, iteratively, one of the search states c rbi is chosen,
and its association assoc(c) is updated to update-ctxt(automaton, assoc, c).
These updates terminate when a fixed point is reached, that is, when
assoc(c) = update-ctxt(automaton, assoc, c) for each c rbi . In this way,
a least fixed point is computed, which guarantees that a domain state is
associated to a search state in the red block only if there is a plan from that
domain state that leaves the red block in a finite number of actions.
Finally, extract plan extracts a plan by using the information about the
associated domain states to each search state. Indeed, once a stable associ-
ation assoc from search states to sets of domain states is built for a search
automaton, a plan can be easily obtained. The contexts for the plan cor-
respond to the states of the search automaton. The information necessary
to define functions act and ctxt is implicitly computed during the execution
of the refinement steps. Indeed, function trans-assoc defines the possible
actions a = act(s, c) to be performed in the state-context pair (s, c), namely
retry
slide
retry S5 S6 move
S3 push
grasp S7
S0 S1 pull S4
move
S2
move
the actions that satisfy the constraints of one of the normal transitions of the
search automaton. Moreover, function gen-preimage defines the next acting
context ctxt(s, c, s0 ) for any possible next state s0 (s, a).
as increased forces.
pulling pushing
push
door_type S3 S6
move
slide move
S5 S8
sliding
components that interact with each other through inputs and outputs. This
is the main representational shift with respect to the usual acting/sensing
representation. Input/output automata allow for modeling distributed sys-
tems where each automaton is a component that interacts with other com-
ponents through inputs and outputs. They make it possible to simplify the
design process by abstracting away the details of their internal representa-
tion.
Formally, input/output automata are very similar to state transition
systems described in Chapter 2 and planning domains described in this
chapter, with the following differences. Input/output automata can evolve
to new states by receiving inputs from other automata and sending outputs
to other automata. Moreover, they can evolve with internal transitions
without sending outputs and receiving inputs. Internal transitions represent
commands that are sent to the execution platform,20 as they have been
introduced in the definition of RAE methods in Chapter 3.
Definition 5.25. (Input/Output Automaton) An input/output au-
tomaton is A = (S, S 0 , I, O, C, ), where
S is a finite set of states;
S 0 S is the set of possible initial states in which the automaton can
start;
I is the set of inputs, O is the set of outputs, and C is a set of com-
mands, with I, O, and C disjoint sets;
: S (I O C) 2S is the nondeterministic21 state transition
function.
can wait for the reception of an input, but whether it will receive it, and
when it will receive it is not under its control.
The automaton can determine when to perform an internal transition,
that is, to execute a command. However, notice that such transition can end
up in different states. This allows us to model the execution of commands
that are sent to the execution platform without knowing a priori the result
of execution.
In simple cases like the one in Figure 5.20, the input/output automaton
can be reduced to a simple nondeterministic state transition system, simply
by replacing the sequences of outputs and inputs with a nondeterministic
action, for instance, get-door-kind that leads to three possible states. In
this case, we can apply any of the techniques described in the previous sec-
tions, either offline or online (e.g., determinization, symbolic model check-
ing, lookahead), to generate a policy or a context-dependent plan that
acts with deliberation.
The different possible evolutions of an input output automaton can be
represented by its set of possible runs.
Definition 5.26. (Run of input/output automaton) A run of an in-
put/output automaton A = (S, S 0 , I, O, C, ) is a sequence s0 , a0 , s1 , a1 , . . .
such that s0 S 0 , ai I O C, and si+1 (si , ai ).
A run may be either finite or infinite.
pulling
pushing
door_type
2
sense(door)
0 1 3 5
sliding
Figure 5.21: Input/Output (I/O) automaton to control the robot I/O au-
tomaton.
wait
far move
S1 S3
S0 sensedistance
S2 S4 S5
close grasp
wait
The set of states of the controlled system are obtained by the Cartesian
product of the states of A and those of Ac . In Definition 5.30, the first two
23
Along the lines described in Chapter 2, goal g is a partial variable assignment to state
variables xi = vi , . . . , xk = vk , for each xi , . . . , xk X, with each of them having values
in their range: vi Range(xi ), . . . , vk Range(xk ).
items specify that the states of the controlled system evolve according to the
internal evolutions due to the execution of both commands of Ac (first item)
and of commands of A (second item). The third and fourth items regard
the evolutions that depend on inputs and outputs. In this case, the state of
the controlled system hsc , si evolves by taking into account the evolutions
of both A and Ac .
A remark is in order. We need to rule out controllers that can get
trapped in deadlocks. In other words, we need to rule out the case in
which an automaton sends outputs that the other automaton is not able
to receive. If an automaton sends an output, then the other automaton
must be able to consume it, either immediately or after executing internal
commands that lead to a state where the input is consumed. In other words,
an automaton A in a state s must be able to receive as one of its inputs
i I the output o0 O0 of another automaton A0 , or for all the possible
executions of commands c C of automaton A, there exists a successor of
s where o0 can be received as an input i.
Given this notion, we define intuitively the notion of a deadlock-free con-
troller for a controlled input/output automaton. It is a control automaton
such that all of its outputs can be received by the controlled automaton,
and vice versa, all the outputs of the controlled automaton can be received
by the controller.24
Informally, the synthesis problem is the problem of generating a control
automaton Ac such that the controlled system Ac B A satisfies a goal g, that
is, we have to synthesize Ac that interacts with A by making A reach some
desired state. In other words, a control automaton Ac is a solution for a
goal g if its every run of the controlled system Ac B A ends up in a state
where g holds.
Definition 5.31. (Satisfiability). Let g be a partial state variable as-
signment xi = vi , . . . , xk = vk , for each xi , . . . , xk X, and each vi
Range(xi ), . . . , vk Range(xk ). Let A be an input/output automaton. A
satisfies g, denoted with A |= g, if
there exists no infinite run25 of A, and
every final state s of A satisfies g.
We can now define when a control automaton is a solution for an in-
put/output automaton with respect to a goal, that is, when it controls the
automaton satisfying our desired requirement.
24
See Exercise 5.12.
25
See Definition 5.26 for the definition of run of an automaton
wait
far move
B3
B1
B4 B5
close grasp
wait
Figure 5.23: Abstracted system for the I/O automaton in Figure 5.22.
by the commands in C, and then, from every state reachable in this way,
their evolution caused by io.
Under the assumption that the execution of commands terminate, that
is, that commands cannot be trapped in loops, we can define an Abstracted
System whose states are sets of states of the automaton and whose evolutions
are over sets of states.
Example 5.34. In Figure 5.23, we show the abstracted system for the
controlled automaton in Figure 5.22. States s0 , s1 , and s2 generate the
set of states B1 = {s0 , s1 , s2 } because there is no way for the controller
to distinguish them until it receives either input far or input close by the
controlled automaton. We have B3 = {s3 }, B4 = {s4 }, and B5 = {s5 }.
wait
move
B3
B1 farclose
B4 B5
grasp
wait
I-cancel
O-success-open cmd-close
I-done S6 S8
I-open
S2 S4
S0 S1 cmd-open S7
S3
S5
O-fail-open
DOOR IO AUTOMATON
I-cancel
O-success-move cmd-stop
I-done S6 S8
I-move
S2 S4
S0 S1 cmd-move S7
S3
S5
O-fail-move
ROBOT IO AUTOMATON
Example 5.36. In Figure 5.25, the door and the robot are modeled as two
input/output automata. The door receives a request to open (the input
I-open). It then activates its engines to open (the command cmd-open). The
command may either succeed or fail, and the door sends outputs accordingly
(O-succes-open or O-fail-open). If the command succeeds, then the door waits
for two possible inputs, one indicating that the door can be closed (because,
e.g., the robot passed successfully), that is, the input I-done, or that there
robot:done door:done
robot:move move-succeed
door:open open-succeed S6 S8 S9
S2 S4
S0 S1 S5 S7
move-fail
S3
open-fail
door:cancel
Figure 5.26: A controller for the robot and door input/output automata.
is a problem and the door should stop with failure, that is, I-cancel. The
robot has a similar input/output automaton: it waits for an input to move
(I-move), then it moves (cmd-move). If the operation succeeds, then it waits
for an input stating either that everything is fine (I-done) and it can stop
(cmd-stop) or that a failure from the environment occurred (I-cancel).
Notice that with this model, the robot and any other actor in the en-
vironment do not even need to know whether the door is a sliding door
or a door that can be opened by pulling/pushing, because this is hidden
in the command cmd-open of the door input/output automaton. This ab-
straction mechanism is one of the advantages of a model based on multiple
input/output automata.
automata A1 , . . . , An and satisfies some desired goal. The problem has two
inputs:
A finite set of automata A1 , . . . , An . This set can be dynamic and can
be determined at run-time. It can be a representation of the current
methods for the tasks to be performed.
A requirement g that is defined as a partial state variable assignment
in the acting state space of each automaton Ai : xi = vi , . . . , xk = vk ,
for each xi , . . . , xk X, and each vi Range(xi ), . . . , vk Range(xk ).
The requirement can be either given at design time, or it can be gen-
erated at run time (see Section 7.2)
Informally, we want to generate a controller Ac that interacts with
A1 , . . . , An in such a way to make the automata A1 , . . . , An to reach some
states where the requirement g is satisfied. We introduce first the product
of the automata A1 , . . . , An :
Ak = A1 k . . . kAn
Such product is a representation of all the possible evolutions of automata
A1 , . . . , An , without any control by Ac .
We formally define the product of two automata A1 and A2 , which mod-
els the fact that the two automata may evolve independently. In the fol-
lowing definition, we assume that the two automata do not send messages
to each other, that is, the inputs of A1 cannot be outputs of A2 and vice
versa. This is a reasonable assumption in our case, where we suppose that
each available automaton A1 , . . . , An interacts only with the controller Ac
that we have to generate. The assumption can, however, be dropped by
modifying in a suitable way the definition of product.
Definition 5.38. (Product of Input/Output Automata) Let A1 =
(S1 , S10 , I1 , O1 , C1 , 1 ) and A2 = (S2 , S20 , I2 , O2 , C2 , 2 ) be two automata with
(I1 O1 C1 ) (I1 O1 C1 ) = . The product of A1 and A2 is A1 kA2 =
(S, S0 , I1 I2 , O1 O2 , C1 C2 , ), where:
S = S1 S2 ,
S0 = S10 S20 ,
hs01 , s2 i (hs1 , s2 i, a) if 1 (s1 , a) = s01 , and
hs1 , s02 i (hs1 , s2 i, a) if 2 (s2 , a) = s02
The automaton Ak = A1 k . . . kAn represents all the possible ways in which
automata A1 , . . . , An can evolve without any control. We can therefore de-
fine the automaton describing the behaviors of Ak when controlled by a
the search (this idea is called state reuse). The work in [436], implemented in
the PRP planner, devises a technique to focus on relevant aspects of states
for generating safe cyclic solutions. Such technique manages to improve sig-
nificantly the performance of the planner. Another study [435] extends the
work to conditional effects.
In [506, 139], the synthesis of controllers from a set of components is
accomplished by planning for safe acyclic solutions through model checking
techniques. The work in [507] combines symbolic model-checking techniques
and forward heuristic search.
Bonet and Geffner [81, 82, 85] have introduced the idea of planning in
belief space (i.e., the space of sets of states) using heuristic forward search.
Brafman and Hoffman [96] address the conformant planning problem by us-
ing SAT to reason about the effects of an action sequence and heuristic search
based on FF relaxation techniques [280]. They extend the technique to deal
with contingent planning in [279]. Partially observable contingent planning
is further addressed in [399], a work that interleaves conformant planning
with sensing actions and uses a landmark-based heuristic for selecting the
next sensing action, together with a projection method that uses classical
planning to solve the intermediate conformant planning problems. Another
work [89] studies the complexity of belief tracking for planning with sensing
both in the case of deterministic actions and uncertainty in the initial state
as well as in the case of nondeterminstic actions.
A notable work on interleaving planning and execution in nondetermin-
istic domains is presented in [344, 343, 342]. These authors propose different
techniques based on real-time heuristic search. Such algorithms are based on
distance heuristics in the search space. [342] proposes the MinMax Learning
Real Time A* presented in this chapter (see Algorithm 5.16): the learning
mechanism can be amortized over several planning runs. On one hand,
these techniques allow for dealing with large planning domains that cannot
be addressed by offline algorithms. On the other hand, they work on the
assumption of safely explorable domains, that is, domains that do not
contain dead ends.
FS-Replan can be the basis for extensions in probabilistic planning that
take into account probability distributions, see Chapter 6, algorithm RFF.
Vice versa, some algorithms devised for probabilistic planning can be used in
nondeterministic domains without taking into account the probabilistic dis-
tribution. This is the case of algorithms based on sparse sampling lookahead
(see Chapter 6, algorithm SLATE).
The work in [63] proposes a different technique based on symbolic model
checking for partially observable domains, which guarantees termination in
5.10 Exercises
5.1. Can all (memoryless) policies be written as contingent plans, that is,
plans with conditional tests? Vice versa? Explain the answer with some
examples.
5.4. Consider Definition 5.10 and Definition 5.11. Write definitions of safe
cyclic and acyclic solutions that reach the goal and then continue looping in-
side the set of goal states, and definitions of solutions that traverse infinitely
often the set of goal states. More formally, write definitions of safe cyclic
and acyclic solutions such that PerformPolicy() (see the acting procedure
in Algorithm 5.1) reaches the goal and then loops forever with the condi-
tion that PerformPolicy() is guaranteed to loop inside the set of goal states.
Write the same definition but with the condition that PerformPolicy() is
guaranteed to traverse the set of goal states infinitely many often.
5.7. Figure 5.28 is a domain model for a washing problem. To make the do-
main nondeterministic, suppose we assume that sometimes the start action
may either succeed or fail. If it fails, it will not change the state. The run
Figure 5.28: A planning domain in which there are two devices that use
water: a washing machine and a dishwasher. Because of water pressure
problems, only one device can use water at a time.
and finish actions are guaranteed to succeed. Also, say that the set of goal
states Sg are all the states where {clean(clothes)=T, clean(dishes)=T}
are satisfied.
5.11. The algorithm for planning for safe solutions by symbolic model check-
ing presented in this chapter (see Algorithm 5.7) can find either safe cyclic
or safe acyclic solutions. Modify the algorithm such that it finds a safe
acyclic solution, and only if one does not exist, does it search for a safe
cyclic solution.
Deliberation with
Probabilistic Models
6.1 Introduction
Some of the motivations for deliberation with probabilistic models are sim-
ilar to those introduced in Chapter 5 for addressing nondeterminism: the
This section introduces the main definitions, concepts and techniques needed
for addressing probabilistic planning and acting problems.
the entire
b(s0 , ), except at goal states and states that have no applicable
action. As usual in planning, goals are considered to be terminal states
requiring no further action.
Example 6.4. Here is a simple example, inspired from casino coin ma-
chines called one-armed bandits. This domain has three state variables
x, y, and z, ranging over the set {a, b, c}. The domain has nine states:
{x = a, y = a, z = a} . . . {x = c, y = c, z = c}, which are abbreviated as
S = {(aaa), (aab), . . . , (ccc)}. There are three actions: pull left, pull right,
and pull both arms simultaneously, denoted respectively Left, Right, and
Both. If the three state variables are distinct, then the three actions are
applicable. If x 6= y = z, only Left is applicable. If x = y 6= z, only Right
is applicable. If x = z 6= y, only Both is applicable. Finally, when the three
variables are equal no action is applicable. Here is a possible specification
of Left (each outcome is prefixed by its corresponding probability):
Left:
pre: (x 6= y)
eff: ( 31 ): {x a}
( 31 ): {x b}
( 31 ): {x c}
Similarly, when applicable, Right randomly changes z; Both randomly
changes y. We assume these changes to be uniformly distributed. Fig-
ure 6.1 gives part of the state space of this domain corresponding to the
problem of going from s0 = (abc) to a goal state in Sg = {(bbb), (ccc)}. Note
that every action in this domain may possibly leave the state unchanged,
that is, s, a, s (s, a). Note also that the state space of this domain is
not fully connected: once two variables are made equal, there is no action
to change them. Consequently, states (acb), (bac), (bca), (cab) and (cba) are
not reachable from (abc).
A solution to the problem in Figure 6.2 is, for instance,
(abc) = Left, (bbc) = (bba) = Right, (cbc) = (cac) = Both.
Here, is defined over Dom() = {s0 , (bbc), (cbc), (bba), (cac)}, and
b(s0 , ) = Dom() Sg . Figure 6.2 gives the Graph(s0 , ) for that solu-
tion.
(abc)
Left Right
Both
(ccc) (aaa)
(bbb)
Figure 6.1: Part of the state space for the problem in Example 6.4.
2
(abc)
Left
(bbc) (cbc)
Right Both
(bba) (cac)
Right Both
(bbb) (ccc)
Figure 6.2: A safe solution for Example 6.4 and its Graph(s0 , ).
for acting with a policy , by performing in each state s the action given by
(s) until reaching a goal or a state that has no applicable action.2
Run-Policy(, s0 , Sg , )
s s0
while s
/ Sg and Applicable(s) 6= do
perform action (s)
s observe resulting state
Let Prl (Sg |s, ) be the probability of reaching a goal P from a state s
by following policy for up to l steps: Prl (Sg |s, ) = Pr(), over
all {hs, s1 , . . . , sh i | si+1 (si , (si )), sh Sg , h l}. Let
Pr(Sg |s, ) = liml Prl (Sg |s, ). With this notation, it follows that:
if is a solution to the problem (, s0 , Sg ) then Pr(Sg |s0 , ) > 0; and
a goal is reachable from a state s with policy if and only if
Pr(Sg |s, ) > 0.
2
Section 6.5 further details how to perform action (s) in probabilistic models.
A domain has no dead end if and only if every state in S is safe. A domain
has no reachable dead end if and only if every state reachable from s0 by
any policy is safe. These desirable cases are difficult to detect in advance.
A problem has a safe solution when the domain dead ends are avoidable:
there is a such that b(s0 , ) avoids dead ends. Example 6.5 illustrates
a domain where dead ends are avoidable. Planning algorithms will seek to
avoid dead ends, searching for safe solutions. If the domain has at least one
unavoidable dead end reachable from s0 , then s0 is unsafe. In that case,
one may accept an unsafe solution whose probability of reaching a goal is
maximal. The trade-off between cost and probability of reaching the goal is
discussed in Section 6.7.3.
In summary, an SSP problem (, s0 , Sg ) can be such that (i) it has a
solution, possibly unsafe; (ii) it has a safe solution, its possible dead ends
are avoidable; (iii) it has no reachable dead end; or (iv) it has no dead
end. These four cases are in increasing order of restriction. Well start by
assuming to be in the most restricted case and relax it afterwards.
X
V (s) = E[ cost(si , (si ))], (6.1)
i
(6.3)
The value function V plays a critical role in solving SSPs: it makes
it possible to rank policies according to their expected total cost, to use
optimization techniques for seeking a safe optimal policy, and, as we will see
later, to heuristically focus the search on a part of the search space.
0
A policy 0 dominates a policy if and only if V (s) V (s) for every
state for which both and 0 are defined. An optimal policy is a policy
that dominates all other policies. It has a minimal expected cost over
all possible policies: V (s) = min V (s). Under our assumption of prob-
abilistic planning in a domain without dead ends, exists and is unique.
(6.4)
PI(, 0 )
0
loop until reaching a fixed point
compute {V (s) | s S} (i)
for every s S \ Sg do
(s) argmina {cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 )}
P
(ii)
There are two ways of computing V for current . The direct method
is to solve Equation 6.3 considered over the entire S as a system of n linear
equations, where n = |S|, the n unknown variables being the values of V (s).
There is a solution to this n linear equations if and only if the current
is safe. The value function V for the current can be computed using
classical linear calculs methods, such as Gaussian elimination.
The second method for finding V is iterative. It consist in computing
the following series of value functions:
X
Vi (s) = cost(s, (s)) + Pr(s0 |s, (s))Vi1 (s0 ). (6.8)
s0 (s,(s))
It can be shown that, for any initial V0 , if is safe, then this series converges
asymptotically to a fixed point equal to V . In practice, one stops when
maxs |Vi (s) Vi1 (s)| is small enough; Vi is then taken as an estimate of V
(more about this in the next section).
Algorithm PI, when initialized with a safe policy, strictly improves in each
iteration the current policy over the previous one, until reaching . In a do-
main that has no dead ends, there exists a safe 0 . All successive policies are
also safe and monotonically decreasing for the dominance relation order. In
other words, if the successive policies defined by PI are 0 , 1 , . . . , k , . . . ,
then s V (s) . . . V k (s) . . . V 1 (s) V 0 (s). Because there is a
finite number of distinct policies, algorithm PI with a safe 0 converges to
an optimal policy in a finite number of iterations.
The requirement that 0 is safe is easily met for domains without dead
ends. However, this strong assumption is difficult to meet in practice. It
makes PI difficult to generalize to domains with dead ends. Algorithm Value
Iteration, detailed next, also makes this assumption, but it can be general-
ized with heuristic search techniques to handle dead ends; it is often more
efficient in practice than PI.
(s) = argmina {QV (s, a)}.5 From V , a new value function can be computed
with the following equation:
X
V 0 (s) = min{cost(s, a) + Pr(s0 |s, a)V (s0 )}. (6.9)
a
s0 (s,a)
VI(, V0 )
V V0
loop until until reaching a fixed point
for every s S \ Sg do
V 0 (s) mina {cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 )}
P
V V0
(s) argmina {cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 )}
P
VI(, V0 )
V V0
loop until reaching a fixed point
for every s S \ Sg do
Bellman-Update(s)
Bellman-Update(s)
for every a Applicable(s)
P do
Q(s, a) cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 )
V (s) mina {Q(s, a)}
(s) argmina {Q(s, a)}
The termination condition of the outer loop of VI checks that such a fixed
point has been reached, that is, a full iteration over S without a change in
V . At the fixed point, every state s meets Equation 6.3, that is: s V (s) =
V (s) for current (s).
In previous section, we emphasized that because there is a finite number
of policies, it make sense to stop PI when the fixed point is reached. Here,
there is an infinite number of value functions; the precise fixed point is
an asymptotic limit. Hence, VI stops when a fixed point is approximately
reached, within some acceptable margin of error. This can be assessed by the
amount of change in the value of V (s) during its update in Bellman-Update.
This amplitude of change is called the residual of a state:
At each iteration of VI, residual (s) is computed before each update with
respect to the values of V at the previous iteration. The termination con-
dition of VI with a margin of error set to a small parameter > 0 is:
residual . Note, however, that with such a termination condition, the
value of V (s) at the last iteration is not identical to V (s) for current (s),
as illustrated next.
.2 g
a
s0 10
b 100
100 s1 b
Example 6.11. Consider the very simple domain in Figure 6.4. has
three states, s0 , s1 , and the goal g, and two actions a and b. Action a leads
in one step to g with probability p; it loops back on s0 with probability
1 p. Action b is deterministic. Assume cost(a) = 10, cost(b) = 100 and
p = .2. has two solutions, denoted a and b . Their values are:
V a (s0 ) = cost(a) i=0, (1 p)i = cost(a)
P
p = 50 and
V b (s0 ) = 2 cost(b) = 200. Hence = a .
Let us run VI (say, the synchronous version) on this simple domain as-
suming V0 (s) = 0 in every state. After the first iteration V1 (s0 ) = 10 and
V1 (s1 ) = 100. In the following iterations, Vi (s0 ) = 10 + .8 Vi1 (s0 ), and
Vi (s1 ) remains unchanged. The successive values of V in s0 are: 18, 24.4,
29.52, 33.62, 36.89, 39.51, 41.61, 43.29, 44.63, 45.71, 46.56, and so on, which
converges asymptotically to 50.
With = 104 , VI stops after 53 iterations with solution a and V (s0 ) =
49.9996. With = 103 , 102 and 101 , termination is reached after 43,
32, and 22 iterations, respectively. With a larger value of , say, = 5,
termination is reached after just 5 iterations with V (s0 ) = 33.62 (at this
point: residual(s0 ) = 33.62 29.52 < ). Note that at termination V (s0 ) 6=
V a (s0 ) for the found solution a . Well see next how to bound the difference
V (s0 ) V (s0 ).
Proposition 6.12. For any two value functions V1 and V2 such that
s V1 (s) V2 (s), we have: s (B k V1 )(s) (B k V2 )(s) for k = 1, 2, . . ..
Proposition 6.14. For and SSP problem without dead ends, VI reaches the
termination condition residual in a finite number of iterations.
where (s) and (s) are the expected number of steps to reach a goal
from s by following and respectively. However, this bound is difficult
to compute in the general case.
More interesting properties can be established when VI uses a heuristic
function V0 that is admissible or monotone.
Note the similarity between Equation 6.3 and Equation 6.11: the expected
number of steps to a goal is simply V with unit costs. Note also that the
bound (s0 ) can be arbitrarily large.
VI does not guarantee a solution whose difference with the optimum is
bounded in advance. This difference is bounded a posteriori. The bounds in
Proposition 6.17 entail 0 V (s)V (s) V (s)V (s) (s). How-
ever, a guaranteed approximation procedure is easily defined using VI with
an admissible heuristic. Such a procedure is illustrated in Algorithm 6.6.
Procedure GAP with an admissible heuristic returns a solution guar-
anteed to be within of the optimum, that is, V (s0 ) V (s0 ) . It
GAP(V0 , )
V V0 ; initialize > 0 arbitrarily
loop
run VI(, V )
compute (s0 ) for the found solution
if (s0 ) then return
else min{/ (s0 ), /2}
l5
5 .5
.2 .2 5
10
.8
l4
5
l3 l2
.5 2
5
.5 .8
.8
2
l0 .5 2
2
l1
Example 6.20. Consider a robot servicing an environment that has six lo-
cations l0 , l1 , . . . , l5 , which are connected as defined by the undirected graph
of Figure 6.5. Traversing an edge has a cost and a nondeterministic out-
come: the tentative traversal of a temporarily busy road has no effect. For
example, when in location l0 the robot takes the action move(l0 , l1 ); with a
probability .5 the action brings the robot to l1 , but if the road is busy the
robot remains in l0 ; in both cases the action costs 2. Edges are labelled by
their traversal cost and probability of success.
In a realistic application, the robot would know (for example, from sen- 5
sors in the environment) when an edge is busy and for how long. Let us
assume that the robot knows about a busy edge only when trying to tra-
verse it; a trial gives no information about the possible outcome of the next
trial. Finding an optimal policy for traversing between two locations can be
modeled as a simple SSP that has as many states as locations. A state for
a location l has as many actions as outgoing edges from l; each action has
two possible outcomes: reaching the adjacent edge or staying in l.
Table 6.1: V (l) after the first three and last three iterations of VI on the
domain of Figure 6.5.
iteration l0 l1 l2 l3 l4
1 2.00 2.00 2.00 3.60 5.00
2 4.00 4.00 5.28 5.92 7.50
3 6.00 7.00 7.79 8.78 8.75
10 19.52 21.86 21.16 19.76 9.99
11 19.75 22.18 21.93 19.88 10.00
12 19.87 22.34 22.29 19.94 10.00
Let us run Value Iteration on this simple domain for going from l0 to
l5 . With V0 = 0 and = .5, VI terminates after 12 iterations (see Table 6.1
which gives V (l) for the first three and last three iterations). It finds the
following policy: (l0 )=move(l0 , l4 ), (l4 )=move(l4 , l5 ), (l1 )=move(l0 , l4 ),
(l2 )=move(l0 , l4 ), (l3 )=move(l0 , l4 ). corresponds to the path hl0 , l4 , l5 i.
Its cost is V (l0 ) = 20, which is easily computed from V (l4 ) = 5/.5 and
V (l0 ) = (5 + .5 V (l4 ))/.5. Note that at termination V (l0 ) = 19.87 6=
V (l0 ). The residual after iteration 12 is 22.29 21.93 = .36 < .
Let us change the cost of the edge (l0 , l4 ) to 10. The cost of the previous
policy is now 30; it is no longer optimal. VI terminates (with the same )
after 13 iterations with a policy corresponding to the path hl0 , l1 , l3 , l5 i; its
cost is 26.5.
VI versus PI. The reader has noticed the formal similarities between VI
and PI: the two algorithms rely on repeated updates until reaching a fixed
point. Their differences are worth being underlined:
PI approaches V from above, while VI approaches the optimum from
below. Hence the importance of starting with an admissible heuristic
for the latter. PI does not require a heuristic but a safe initial 0 .
However, heuristics, when available, can bring a significant efficiency
gain.
PI computes V for the current and final solution , while VI relies on
an approximate V of the value V of the greedy policy .
PI reaches exactly its fixed point while a margin of error has to be set
for VI, allowing for the flexibility illustrated in procedure GAP.
Note, however, that when PI relies on the iterative method of Equation 6.8
for computing V the two algorithms can be quite close.
instead of over the entire S. This explored part of the search space starts
with {s0 } and is incrementally expanded. Let the Envelope be the set of
states that have been generated at some point by a search algorithm. The
Envelope is partitioned into:
(i) goal states, for which V (s) = 0,
(ii) fringe states, whose successors are still unknown; for a fringe state
(s) is not yet defined and V (s) = V0 (s),
(iii) interior states, whose successors are already in the Envelope.
Expanding a fringe P state s means finding its successors and defining
Q(s, a) = cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 ), V (s) = mina {Q(s, a)}, and
the greedy policy for current V , which is (s) = argmina {Q(s, a)}. Updat-
ing an interior state s means performing a Bellman update on s. When a
descendant s0 of s gets expanded or updated, V (s0 ) changes, which makes
V (s) no longer equal to mina {Q(s, a)} and requires updating s.
Let us define the useful notions of open and solved states with respect
to , a given margin of error.
Recall that
b(s, ) includes s and the states in the Envelope reachable
from s by current . It defines Graph(s, ), the current solution graph
starting from s. Throughout Section 6.3, is the greedy policy for current
V ; it changes after an update. Hence b(s, ) and Graph(s, ) are defined
dynamically.
Most heuristic search algorithms use the preceding notions and are based
on different instantiations of a general schema called Find&Revise (Algo-
rithm 6.7), which repeatedly performs a Find step followed by a Revise
step.
The Find step is a traversal of the current
b(s0 , ) for finding and choos-
ing an open state s. This Find step has to be systematic: no state in b(s0 , )
should be left open forever without being chosen for revision.
The Revise step updates an interior state whose residual > or expands
a fringe state. Revising a state can change current and hence b(s0 , ).
At any point, either a state s is open, or s has an open descendant (whose
Find&Revise(, s0 , Sg , V0 )
until s0 is solved do
choose an open state s in
b(s0 , ) (i) Find
expand or update s (ii) Revise
Algorithm 6.7: Find&Revise schema. The specifics of the Find and the
Revise steps depend on the particular algorithm instantiating this schema.
revision will later make s open), or s is solved. In the latter case, b(s, )
does not change anymore.
Find&Revise iterates until s0 is solved, that is, there is no open state in
AO (, s0 , g, V0 )
Envelope {s0 }
while b(s0 , ) has fringe states do
traverse b(s0 , ) and select a fringe state s
b(s0 , ) (i)
for all a Applicable(s) and s0 (s, a) do
if s0 is not already in Envelope then do
add s0 to Envelope
V (s0 ) V0 (s0 )
AO-Update(s) (ii)
branch, the traversal stops when it reaches a goal or a fringe state. The
selection of which fringe state to expand is arbitrary. This choice does
not change the convergence properties of the algorithm but may affect its
efficiency. The expansion of a state s changes generally V (s). This requires
updating s and all its ancestors in the envelope
AO-Update(s)
Z {s}
while Z 6= do
select s Z such that Z
b(s, ) = {s}
remove s from Z
Bellman-Update(s)
Z Z {s0 Envelope | s (s0 , (s0 ))}
Another is to label solved states to avoid revisiting them. Because the space
is acyclic, a state s is solved if it is either a goal or if all the successors of s
in (s, (s)) after an update are solved.
s0 c
a 12
b
5
19
s1 s2 s3
a b
c
a a 8
b 2 b 5
4 20 4
s9
4
s4 s5 s6 s7 s8 a
b b 4
a a a a b a
9
5 3 5 20
4 15 6
s10 s11
7 a
a b
10 6
s12 s15
s13 s14
a
a
35
25
s16
Example 6.23. Consider the domain in Figure 6.6, which has 17 states, s0
to s16 and three actions a, b, and c. Connectors are labeled by the action
name and cost; we assume uniform probability distributions. Let us take
V0 (s) = mina {cost(s, a)} and Sg = {s12 , s15 , s16 }.
AO terminates after 10 iterations, which are summarized in Table 6.2.
In the first iteration, V (s0 ) = min{5 + 2+4 5+9
2 , 19 + 15, 12 + 2 } = 8. In the
second iteration, V (s1 ) = min{7.5, 24.5, 7}; the update changes V (s0 ), but
not (s0 ). Similarly after s2 is expanded. When s6 is expanded, the updates
changes (s2 ), (s1 ), and (s0 ). The latter changes again successively after
s3 , s4 , and s9 are expanded (s0 ) = c. (s2 ) changes after s11 then s13 are
expanded. After the last iteration, the update (s0 ) = (s1 ) = (s2 ) =
(s5 ) = (s11 ) = a and (s4 ) = b; the corresponding solution graph has no
Let us now discuss best first search for a cyclic search space, for which
updates cannot be based on a bottom-up stage-by-stage procedure. LAO
handle this general case. It corresponds to Algorithm 6.8 where step (ii) is
replaced by a call to LAO-Update(s). The latter (Algorithm 6.10) performs
a VI-like series of repeated updates that are limited to the states on which
the expansion of s may have an effect. This is the set Z of s and all its
ancestors along current . Again, Z is not limited to b(s0 , ).
LAO-Update is akin to an asyncronous VI focused by current . However,
an update may change current , which may introduce new fringe states.
Consequently, the termination condition of LAO-Update is the following:
either an update introduces new fringe states in b(s0 , ) or the residual
over all updated states.
The preceding pseudo-code terminates with a solution but no guarantee
of its optimality. However, if the heuristic V0 is admissible, then the bounds
of Proposition 6.17 apply. A procedure such as GAP (Algorithm 6.6) can be
LAO-Update(s)
Z {s} {s0 Envelope | s b(s0 , )}
iterate until termination condition
s Z do
Bellman-Update(s)
ILAO (, s0 , g, V0 )
Envelope {s0 }
while
b(s0 , ) has fringe states do
for each s visited in a depth-first post-order traversal of
b(s0 , ) do
unless s has already been visited in this traversal do
if s is a fringe then expand s
Bellman-Update(s)
perform VI on b(s0 , ) until termination condition
b(s0 , ) is solved if and only if C has no open state and every other compo-
nent C 0 reachable from a state in C is solved.
Proof. The proof follows from the fact that the strongly connected com-
ponents of a graph define a partition of its vertices into a DAG (see Ap-
pendix B). If C meets the conditions of the proposition, then s C
b(s, )
has no open state, hence s is solved.
HDP(s)
if s Sg then label s solved
if s is solved then return false
else if (residual(s) > ) Solved-SCC(s, false) then do
Bellman-Update(s)
return true
Note that the disjunction produces a recursive call only when its first clause
is false. HDP and Solved-SCC returns a binary value that is true if and only
if s or one of its descendants has been updated.
Procedure Solved-SCC (Algorithm 6.13) finds strongly connected com-
ponents and labels them as solved if they meet the conditions of Proposi-
tion 6.24. It is very close to Tarjans algorithm. It has a second argument
that stands for a binary flag, true when s or one of its descendant has been
updated. Its differences with the original algorithm are the following. In
step (i) the recursion is through calls to HDP, while maintaining the updated
flag. In step (ii), the test for a strongly connected component is performed
only if no update took place below s. When the conjunction holds, then
s and all states below s in the depth-first traversal tree make a strongly
connected component C and are not open. Further, all strongly connected
components reachable from these states have already been labeled as solved.
Hence, states in C are solved (see details in Appendix B).
Procedure HDP is repeatedly called on s0 until it returns false, that is,
until s0 is solved. Appropriate reinitialization of the data structures needed
by Tarjan algorithm (i 0, stack and index undefined for states in the
Envelope) have to be performed before each call to HDP(s0 ). Finally, for
the sake of simplicity, this pseudocode does not differentiate a fringe state
from other open states: expansion of fringe states (over all its successors for
all applicable actions) is simply performed in HDP as an update step.
HDP inherits of the properties of Find&Revise: with an admissible heuris-
tic V0 , it converges asymptotically with to the optimal solution; when V0
is also monotone, its complexity is bounded by 1/ S [V (s) V0 (s)].
P
Solved-SCC(s, updated)
index(s) low(s) i
ii+1
push(s,stack )
for all s0 (s, (s)) do
if index(s0 ) is undefined than do
updated HDP(s0 ) updated (i)
low(s) min{low(s), low(s0 }
else if s0 is in stack then low(s) min{low(s), low(s0 }
if ( updated ) (index(s)=low(s)) then do (ii)
repeat
s0 pop(stack )
label s0 solved
until s0 = s
return updated
rithm.
IDA (Iterative Deepening A ) proceeds by repeated depth-first, heuris-
tically guided explorations of a deterministic search space. Each iteration
goes deeper than the previous one and, possibly, improves the heuristic esti-
mates. Iterations are pursued until finding an optimal path. The extension
of IDA to And/Or graphs is called LDFS; it also performs repeated depth-
first traversals where each traversal defines a graph instead of a path.
We first present a simpler version of LDFS called LDFSa (Algo-
rithm 6.14), which handles only acyclic SSPs. LDFSa does a recursive depth-
first traversal of the current
b(s0 , ). A traversal expands fringe states, up-
dates open states, and labels as solved states that do not, and will not in
the future, require updating. LDFSa (s0 ) is called repeatedly until it returns
s0 as solved.
In an acyclic SSP, a state s is solved when either it is a goal or when its
residual(s) and all its successors in (s, ) are solved. This is expressed
in line (ii) for the current action a of iteration (i).
Iteration (i) skips actions that do not meet the preceding inequality.
It proceeds recursively on successor states for an action a that meets this
inequality. If these recursions returns false for all the successors in (s, a),
then updated =false at the end of the inner loop (iii); iteration (i) stops
and s is labeled as solved. If no action in s meets inequality (ii) or if the
LDFSa (s)
if s Sg then label s solved
if s is solved return true
updated true
iterate over a Applicable(s)
P and while (updated ) (i)
if |V (s) [cost(s, a) + s0 (s,a) Pr(s0 |s, a)V (s0 )]| then do (ii)
updated false
for each s0 (s, a) do (iii)
0
updated LDFSa (s ) updated
if updated then Bellman-Update(s)
else do
(s) a
label s solved
return updated
such action exists, then residual (s) > and both procedures LDFS and HDP
perform a normal Bellman-update. Partial empirical tests show that LDFS
is generally, but not systematically, faster than HDP.
LDFS is an instance of the Find&Revise schema. It inherits its conver-
gence and complexity properties, including the bound on the number of
trials when used with an admissible and monotone heuristic.
V0 is admissible and monotone. When |Applicable(s)| and |(s, a)| are small,
one may perform a Bellman update in s and use the following function V1
instead of V0 :
0 if s g,
V1 (s) = P 0 0
mina {cost(s, a) + s0 (s,a) Pr(s |s, a)V0 (s )} otherwise.
each probabilistic action into a few deterministic ones (as seen in Sec-
tion 5.5).
We can map a nondeterministic domain = (S, A, ) into a deterministic
one d = (S, Ad , d ) with the following property: s S, a A, s0 (s, a),
a0 Ad with s0 = d (s, a0 ) and cost(s, a0 ) = cost(s, a). In other words,
d contains a deterministic action for each nondeterministic outcome of an
action in . This is the all-outcomes determinization, as opposed to the
most-probable outcomes determinization. In the latter, Ad contains deter-
ministic actions only for states s0 (s, a) such that Pr(s0 |s, a) is above
some threshold. For SSPs in factorized representation, it is straightforward
to obtain d from .
Let h (s) be the cost of an optimal path from s to a goal in the all-
outcomes determinization d , with h (s) = when s is a dead end, implicit
or explicit. It is simple to prove that h is an admissible and monotone
heuristic for . But h can be computationally expensive, in particular for
detecting implicit dead ends. Fortunately, heuristics for d are also useful
for .
Run-Lookahead(, s0 , Sg )
s s0
while s / Sg and Applicable(s) 6= do
a Lookahead(s, )
if a = failure then return failure
else do
perform action a
s observe resulting state
In addition to s0 , Sample may also return the cost of the transition from s
to s0 . A planning algorithm interfaced with such a Sample function does not
need a priori estimates of the probability and cost distributions of a domain
. A domain simulator is generally the way to implement the function
Sample.
RFF(, s0 , Sg , )
Det-Plan(d , s0 , Sg )
while s b(s0 , ) such that
s / Sg (s) is undefined Pr(s|s0 , ) then do
Det-Plan(d , s, Sg Targets(, s))
Plan, denoted Targets(, s), can be the already computed Dom() or any
subset of it. If the entire Dom() is too large, the overhead of using it in
Det-Plan can be larger than the benefit of reusing paths already planned in
. A trade-off reduces Targets(, s) to k states already in the domain of .
These can be taken randomly in Dom() or chosen according to some easily
computed criterion.
Computing Pr(s|s0 , ) can be time-consuming (a search and a sum over
all paths from s0 to s with ). This probability can be estimated by sam-
pling. A number of paths starting at s0 following are sampled; this allows
estimation of the frequency of reaching non-goal states that are not in the
domain of . RFF terminates when this frequency is lower than .
Algorithm 6.16 requires (as FS-Replan does) a domain without reachable
dead ends. However, RFF can be extended to domains with avoidable dead
ends, that is, where s0 is safe. This is achieved by introducing a backtrack
point in a state s which is either an explicit deadend or for which Det-
Plan fails. That state is marked as unsafe; a new search starting from its
predecessor s0 is attempted to change (s0 ) and avoid the previously failed
action.
RFF algorithm does not attempt to find an optimal or near optimal
solution. However, the offline version of RFF finds a probabilistically safe
solution, in the sense that the probability of reaching a state not in the
domain of , either safe or unsafe, is upper bounded by .
LRTDP(, s0 , g, V0 )
until s0 is solved or planning time is over do
LRTDP-Trial(s0 )
LRTDP-Trial(s)
visited empty stack
while s is unsolved do
push(s, visited)
Bellman-Update(s)
s Sample(s, (s))
s pop(visited)
while Check-Solved(s) is true and visited is not empty do
s pop(visited)
Check-Solved(s)
f lag true
open closed empty stack
if s is unsolved then push(s, open)
while open is not empty do
s pop(open)
push(s, closed)
if |V (s) Q(s, (s))| > then f lag false
else for all s0 (s, (s)) do
if s0 is unsolved and s0 / open closed
then push(s0 , open)
if f lag= true then do
for all s0 closed label s0 as solved (i)
else do
while closed is not empty do
s pop(closed)
Bellman-Update(s)
return f lag
Algorithm 6.18: Procedure to check and label solve states for LRTDP.
The simple argument used for Proposition 6.9 applies here: policy domi-
nates the base policy 0 .
The multiple rollout approach performs k similar simulated walks of
h steps for each action a applicable in s. (see Figure 6.7(b)). It then
averages their costs to assess Qh0 (s, a). This approach is probabilistically
approximately correct, that is, it provides a probabilistically safe solution
(not guaranteed to be safe) whose distance to the optimum is bounded. It
requires a number of calls to Sample equal to |Applicable(s)| k h.
s s
a1 ai
a1 ai
s11 s1i s11 s1i
k k
Figure 6.7: (a) Single Monte Carlo rollout; (b) Multiple rollout.
18
17
320 Chapter 6
parameters h and k bound the tree, respectively in depth and sampling width
(see Figure 6.8). At depth h, a leaf of the tree gets as a value a heuristic
estimate given by V0 . In an interior state s and for each action a applicable
in s, k successors are randomly sampled. The average of their estimated
values is used to compute recursively the cost-to-go Q(a, s). The minimum
over all actions in Applicable(s) gives (s) and V (s), as in Bellman-Update.
SLATE(s, h, k)
if h = 0 then return V0 (s)
if s Sg then return 0
for each a Applicable(s) do
samples
repeat k times
samples samples P Sample(, s, a)
Q(s, a) cost(s, a) + k1 s0 samples SLATE(s0 , h 1, k)
(s) argmina {Q(s, a)}
return Q(s, (s))
Assuming that a goal is reachable from every state, SLATE has the fol-
lowing properties:
It defines a near-optimal policy: the difference |V (s) V (s)| can be
bounded as a function of h and k.
It runs in a worst-case complexity independent of |S|, in O((k)h ),
where = max |Applicable(s)|.
As a Monte Carlo rollout, it does not require probability distribution
parameters: calls to Sample(, s, a) return states in (s, a) distributed
according the Pr(s0 |s, a), which allows to estimate Q(s, a).
Note the differences between SLATE and the multiple rollouts approach:
the latter is polynomial in h, but its approximation is probabilistic. SLATE
provides a guaranteed approximation, but it is exponential in h. More pre-
cisely, SLATE returns a solution whose distance to the optimal policy is
upper bounded |V (s) V (s)| < ; it runs in O(log ).
A few improvements can be brought to this procedure. One may reduce
the sampling rate with the depth of the state: the deeper is a state, the
less influence it has on the cost-to-go of the root. Further, samples can
implemented as a set with counters on its elements such as to perform a
s
a1 ai
s1.1 s1.k si.1 si.k
depth h
Figure 6.8: Sparse sampling tree of procedure SLATE.
UCT and Monte Carlo Tree Search. SLATE has an important limi-
tation: it has no sampling strategy. All actions in Applicable(s) are looped
through and explored in the same way. A sampling strategy would allow
to further explore a promising action and would prune out rapidly inferior
options, but no action should be left untried. A sampling strategy seeks a
trade-off between the number of times an action a has been sampled in s and
the value Q(s, a). This trade-off is used to guarantee with high probability
an approximate solution while minimizing the search.
UCT, Algorithm 6.20, is a Monte Carlo Tree Search technique that builds
up such a sampling strategy. Like SLATE, UCT expands, to a bounded depth,
a tree rooted at the current node. However, it develops this tree in a non-
uniform way. At an interior node of the tree in a state s, it selects a trial
action a with the strategy described subsequently. It samples a successor
s0 of s along a. It estimates the value of s0 , and it uses this estimate to
update Q(s, a); it also updates the predecessors of s up to the root node,
by averaging over all previously sampled successors in (s, a), as is done in
UCT(s, h)
if s Sg then return 0
if h = 0 then return V0 (s)
if s
/ Envelope then do
add s to Envelope
n(s) 0
for all a Applicable(s) do
Q(s, a) 0; n(s, a) 0
Untried {a Applicable(s) | n(s, a) = 0}
if Untried 6= then a Choose(Untried)
1
else a argminaApplicable(s) {Q(s, a) C [log(n(s))/n(s, a)] 2 }
s0 Sample(, s, a)
cost-rollout cost(s, a) + UCT(s0 , h 1)
Q(s, a) [n(s, a) Q(s, a) + cost-rollout]/(1 + n(s, a))
n(s) n(s) + 1
n(s, a) n(s, a) + 1
return cost-rollout
It was shown that the preceding selection strategy minimizes the number
of times a suboptimal action is sampled. UCT can also be proved to converge
asymptotically to the optimal solution.
All approaches described in this Section 6.4.4 can be implemented as
memoryless procedures (in the sense discussed in Section 6.4.1). They are
typically used in a receding horizon Run-Lookahead schema. This simplifies
the implementation of the planner, in particular when the lookahead bounds
are not uniform and have to be adapted to the context. This has another
important advantage that we have not discussed up to now: the capability
to generate non-stationary policies, possibly stochastic. Indeed, an actor
may find it desirable to apply a different action on its second visit to s than
on its first. For finite horizon problems in particular, non-stationary policies
can be shown to outperform stationary ones.
If the probability and cost distributions can be acquired offline and are
stable, and if the computational time remains compatible with acting con-
straints, planning algorithms of Section 6.3 can be used to compute an opti-
mal or near optimal policy. However, these conditions will not often be met.
The online lookahead techniques of Section 6.4 are usually more adapted to
acting with probabilistic models. This is particularly the case when a gener-
ative sampling model can be designed. Sampling techniques of the previous
section, when combined with informed heuristics V0 , are able to drive effi-
ciently lookahead techniques. A Sample stochastic function basically allows
one to run, to a controlled depth, several simulations for choosing the next
best command to pursue the refinement of an action.
Example 6.28. Consider Example 3.4 of opening a door. We can specify
the corresponding action with a single refinement method, the model of
which is partly pictured in Figure 6.9. For the sake of simplicity, the acting
states are simply labeled instead of a full definition of their state variables,
as described in Example 3.4. For example, s2 corresponds to the case in
which a door is closed; in s3 , it is cracked; locked and blocked are two failure
cases, while open is the goal state. Furthermore, the figure does not give all
applicable actions in a state, for example, there are several grasps in s2 and
s3 (left or right hand, on T shaped or spherical handle) and several turns
in s4 . Parameter values are also not shown.
s3 grasp pull
s6 move
s7 open
blocked
Recall that an acting engine not only has to refine actions into commands
but also to react to events. Probabilistic models and techniques are also
relevant when the role of a method is an event instead of a task. Probabilistic
methods can be convenient for specifying reactions to unexpected events.
Example 6.29. Consider a simple service robot domain, called PAMp , with
one robot rbt and four locations {pier1, pier2, exit1, exit2}. At each loca-
tion, there are containers of different types. The robot can move between
locations; it can take a container from a location and put it in a location.
The motion is deterministic, and the four locations are pairwise adjacent.
Actions take and put are constrained by the activity in the corresponding lo-
cation: if it is busy, these actions fail to achieve their intended effect and do
nothing. A location becomes or ceases to be busy randomly with probability
p. We model this as an exogenous event, switch(l), that switches the busy
attribute of location l. We assume at this stage to have a full knowledge of
the state of the world. This simple domain is modeled with the following
state variables:
loc(r) {pier1, pier2, exit1, exit2}: location of robot r;
Even a domain as simple as PAMp can have a huge state space (up to
4 8k+1 3 42 , that is, 1.6 1012 for k = 10), forbidding an explicit
enumeration or a drawing such as Figure 6.1. An adequate specification of
the actions in the previous example has to take into account their effects
as well as the effects of concurrent exogenous events. Indeed, recall that
nondeterminism accounts for the possible outcomes of an action a when the
world is static, but also for events that may happen in the world while a
is taking place and have an impact on the effects of a. Hence, (s, a) rep-
resents the set of possible states corresponding to the joint effects of a and
concurrent exogenous events. When the |(s, a)| are not too loarge, prob-
abilistic precondition-effect operators, illustrated next, can be a convenient
representation.
The take action is similarly specified: when the robot location l is not
busy and contains at least one container of the requested type c, then take
may either lead to a state where l is busy with no other effect, or it may
achieve its effects of a container of type being loaded and ctrs(l,c) being
reduced by one. For each of these two cases, additional switch events may
occur in any of the three other locations. This is similar for action Put (see
Exercises 6.17 and 6.18).
A typical task for the robot in domain PAMq is to move all red containers
to exit1 and all blue ones to exit2.
With only three exogenous events as in PAMq , the joint effects of ac-
tion and events become complex: the size and intricacy of (s, a) reaches
a point where the specification of precondition-effect operators is not easy
(see Exercise 6.19). Bayesian networks is the appropriate representation for
expressing conditional distributions on random variables. It offers power-
ful techniques for reasoning on these distributions. A Bayesian network is
a convenient way for specifying a joint probability function to a collection
of random variables. It is a DAG where nodes are the random variables
associated to a priori or conditional probability distributions. An edge be-
tween two random variables x and y expresses a conditional probability
dependance of y with respect to x. Dynamic Bayesian networks (DBNs) ex-
tend the static representation to handle different stages in time of the same
variables. They are particularly convenient in our case for expressing proba-
bilistic state transitions from s to (s, a), with a focus on the state variables
relevant for the action a and the events that may take place concurrently
with a. This is illustrated in the following example.
busy(l) Prob[busy(l)=T]
busy(l) busy(l) T 1-p
F p
ctrs(l,) ctrs(l,)
load(r) load(r)
loc(r)
ctrs(l1,1) ctrs(l1,1)
ctrs(l2,2) ctrs(l2,2)
Example 6.32. Figure 6.10 represents the DBN characterizing action take
in PAMq domain. It shows the state variables that condition or are affected
by take and the events switch, arrival and departure. If x is a state variable of
state s, we denote x0 that same state variable in s0 (s, a). Here, we extend
ctrs(l,)
0 >0
ctrs(l1,1) ctrs(l2,2) 0 loc(r)
K <K 0 >0 l l
Figure 6.11: Conditional probability trees for the ctrs state variables for
the action take combined with the possible events switch, arrival, and depar-
ture: (a) accounts for the arrival of a container at a pier location, (b) for
a departure at an exit location, and (c) for a container being taken by the
robot.
Example 6.33. Figure 6.11 gives the conditional probabilities for the ctrs
variables in the DBN of Figure 6.10. The leaves of each tree give the proba-
bility that the number of containers of some type at some location increases
or decreases by one container (the probability that this number remains un-
changed is the complement to 1). To simplify the picture, we take p = .05
and q = q 0 = .15. Tree (a) accounts for the possible arrival of a container
of some type at a pier location: if the location is full (ctrs(l1 , 1 ) =K) or
if the robot is in that location (loc(r) = l1 ), then no container arrival is
possible, otherwise there is a probability of .15 .6 for the arrival of a red
container at pier1, and so on. Similarly, tree (b) accounts for the departure
of a container at an exit location. Tree (c) gives the proper effect of action
take: the probability that ctrs changes is conditioned by the five ancestor
variables of that node in the DBN.
busy(l) busy(l)
ctrs(l) ctrs(l)
load(r) load(r)
loc(r)
pos(c) pos(c)
hue(c)
busy(l) Prob[load(r)=red]
T F
0 ctrs(l,)
0 >0
0 loc(r)
l l
0 load(r)
empty empty
0 pos(c)
l l
0 busy(l)
hue Prob[type=red | hue]" T F
a" .05! 0 hue(c)
b" .1! unknown a b c d
c" .85!
d .95 0 .047 .095 .807 .902
(a) (b)
Figure 6.13: (a) Conditional probability table for the type of a container
given its observed feature; (b) conditional probability trees for load(r)=red.
when its observed feature has some value. Prob[load(r)=blue] is the com-
plement to one of the numbers in the last four leaves; it is equal to zero in
the other leaves where Prob[load(r)=empty]=1.
Incremental-compression-and-search(, s0 , Sg )
while there is an unsolved state s in current
b(s0 , )
search for an optimal path from s to a goal
until a nondeterministic action a
compress this path to a single nondeterministic step (s) a
revise with Bellman-Update
h(s, a1 ), (s2 , a2 ), . . . , (sk1 , ak1 ), (sk , a)i such that actions a1 through ak1
are deterministic, but a is nondeterministic. It is possible to compress this
sequence to a single nondeterministic step (s, a), the cost of which is the sum
of cost of the k steps and the outcome (s, a) of which is the outcome of
the last step. This idea can be implemented as sketched in Algorithm 6.21.
Its advantage is to focus the costly processing on a small part of the search
space.
The notion of sparse probabilistic domains can be extended further to
cases in which |(s, a)| < k and Applicable(s) < m for some small con-
stants k and m. Sampling techniques such as the SLATE procedure (Algo-
rithm 6.19) are particularly useful in these cases.
Finally, let us mention that abstraction and decomposition are also used
for computing heuristics and control knowledge to guide or focus a global
search. There is a large overlap between abstraction or decomposition meth-
ods and the techniques discussed in Section 6.3.5.
Example 6.35. Consider the simplistic domain in Figure 6.14 that has two
policies a and b . a (s0 ) = a; a leads to a goal with probability p in one
step, or it loops back on s0 . b starts with action b which has a few possible
outcomes, all of them lead to a goal after several steps without loops, that
is, Graph(s0 , b ) is acyclic, all its leaves are goal states and its paths to goals
are of length k. Both a and b are safe Ppolicies; a is cyclic, whereas b is
acyclic. The value of a is Va = cost(a) i=0, (1 p)i = cost(a) p . The value
Vb of b is the weighted sum of the cost of the paths of Graph(s0 , b ).
a p
s0
b
rely on a heuristic value of the error parameter . There are even cases in
which VI may be used online, for example, on a receding horizon schema:
for |S| in the order of few thousands states, the running time of VI is on
the order of a few milliseconds. This may happen in small domains and in
well-engineered state spaces.
Most planning problems do not allow for an explicit enumeration of their
entire state space. Realistically, a few dozen parametrized state variables,
that is, a few hundred ground state variables, may be needed for modeling
a realistic domain. The corresponding state space is on the order of dk ,
where k is the number of ground state variables and d is the size of their
domain. In many practical cases k is so large (that is, a few hundred) that
iterating over S is not feasible. Options in such a case are to use focused
search algorithms that explore a small part of the search space as seen in
Section 6.3, to refine the model, to decompose the planning problem into
feasible subproblems, and to use domain configurable control knowledge to
reduce sharply the branching factor of a problem and allow for a significant
scaling up, as discussed in Section 6.3.5.
6.8.1 Foundations
models assume that after each state transition the actor knows which state s
it has reached; it then proceeds with the action (s) appropriate for s. The
POMDP model considers that the actor does not know its current state, but
it knows about the value of some observation variable o. It also has a proba-
bility distribution Pr(o|s, a) of observing o after running action a in s. This
gives it a probability distribution of possible states it might be in, called
the current actors belief : b(s|a, o). It has been demonstrated by Astrom
[25] that the last observation o does not summarize the past execution, but
the last belief does. Hence, a POMDP planning problem can be addressed
as an MDP problem in the belief space. One starts with an initial belief b0
(distribution for initial states) and computes a policy that gives for every
belief point b an action (b), leading to a goal also expressed in the belief
space.
Several approaches generalizing Dynamic Programming or Heuristic
Search methods to POMDPs have been proposed, for example, Kaelbling
et al. [305] and Smith and Simmons [552]. Policy search methods for
parametrized POMDPs policies are studied in Ng and Jordan [453]. Approx-
imate methods that focuses Bellman updates on a few belief points (called
point-based methods) are surveyed in Shani et al. [530]; they are compared
to an extension of RTDP in Bonet and Geffner [87]. Online algorithms for
POMDPs are surveyed in Ross et al. [516]. A Monto Carlo sampling ap-
proach is proposed by Silver and Veness [538]. Several interesting POMDP
applications have been developed, for example in robotics by Pineau et al.
[480], Foka and Trahanias [201] and Guez and Pineau [250]. However, the
POMDP developer faces several difficulties, among which:
Tremendous complexity: a discretized belief point corresponds to a
subset of states; hence the belief space is in O(2|S| ). Because |S|
is already exponential in the number of state variables, sophisticated
algorithms and heuristics do not scale up very far. Significant modeling
effort is required for decomposing a domain into small loosely coupled
problems amenable to a solution. For example, the approach of Pineau
et al. [480], even though it is applied to a small state space (less than
600 states), requires a clever hierarchization technique to achieve a
solution.
A strong assumption (not always highlighted in the POMDP litera-
ture): a policy from beliefs to actions requires the action (b) to be
applicable in every state s compatible with a belief b. It is not always
the case that the intersection of Applicable(s) for every s compatible
with b is meaningful. Sometimes, one would like to be able to choose
and Stentz [191], McMahan and Gordon [418] and Wingate and Seppi [609],
or with a backward order of updates from goals back along a greedy policy
in Dai and Hansen [134].
Policy Search methods (not the be confused with Policy Iteration algo-
rithm) deal with parametrized policies and perform a local search in the
parameter space of (for example, gradient descent). The survey of Deisen-
roth et al. [148] covers in particular their use for continuous space domains
and reinforcement learning problems.
Hansen and Zilberstein [254] developed the LAO algorithm as an ex-
tension of AO of Nilsson [460]. The Find&Revise schema was proposed
by Bonet and Geffner [83], who also developed several instantiation of this
schema into heuristic search algorithms such as HDP [83], LRTDP [84] and
LDFS [86]. A few other heuristic algorithms are presented in their recent
textbook [217, chap. 6 & 7]. RTDP has been introduced by Barto et al.
[46]. The domain-configurable control technique presented in Section 6.3.5
was developed by Kuter and Nau [361].
The FF-Replan planner has been developed by Yoon et al. [619] in the
context of the International Planning Competition. A critical analysis of
its replanning technique appears in Little and Thiebaux [389] together with
a characterization of probabilistically interesting problems. These prob-
lems have dead ends and safe solutions. To take the latter into account,
Yoon et al. [620] proposed an online receding horizon planner, called FF-
Hindsight, which relies on estimates through averaging and sampling over
possible determinizations with a fixed lookahead. The RFF algorithm has
been proposed by Teichteil-Konigsbuch et al. [566]; it has been generalized
to hybrid MDPs with continuous state variables [564].
The SLATE procedure is due to Kearns et al. [325]. UCT was proposed
by Kocsis and Szepesvari [339]. An AO version of it is described in Bonet
and Geffner [88]. UCT is based on Monte Carlo Tree Search techniques that
were developed with success for games such as Go by Gelly and Silver [219].
UCT was implemented into a few MDP planners such as PROST by Keller
and Eyerich [327]. An extension of UCT addressing POMDPs is studied by
Silver and Veness [538].
Several authors have exploited determinization techniques in probabilis-
tic planning, for example, Boutilier et al. [92] for pruning unnecessary Bell-
man update, Karabaev and Skvortsova [313] for performing Graphplan like
reachabilitiy analysis, Bonet and Geffner [85] for computing heuristics for
the mGPT planner, and Teichteil-Konigsbuch et al. [568] also for computing
heuristics. Proposition 6.25 is demonstrated in the latter reference.
For many planners, implicit dead ends can lead to inefficiency or even
been developed in Dean and Lin [144]. Hauskrecht et al. [263] propose ap-
proximate solutions to large MDPs with macro actions, that is, local policies
defined in particular regions of the state space. The approach of Barry et al.
[42] and their DetH algorithm [43] clusters a state space into aggregates of
closely connected states, then it uses a combination of determinization at
the higher level and VI at the lower level of a hierarchical MDP.
Sparse probabilistic domains have been studied in particular by Busoniu
et al. [103] and Likhachev et al. [386]. The path compression technique of
Algorithm 6.21 is detailed in the latter reference.
6.9 Exercises
6.1. In the domain of Example 6.4, consider a policy such that (s0 ) =
Both. Is a safe policy when s0 is either (acb), (bca) or (cba)? Is it safe
when s0 is (bac) or (cab)?
6.2. Prove that the recursive Equation 6.3 follows from the definition of
V (s) in Equation 6.1.
6.5. Run algorithm PI on the problem of Figure 6.15 starting from the
following policy: 0 (s1 ) = 0 (s2 ) = a, 0 (s3 ) = b, 0 (s4 ) = c.
(a) Compute V 0 (s) for the four non-goal states.
(b) What is the greedy policy of V 0 ?
(c) Iterate on the above two steps until reaching a fixed point.
s5
0.2
a
s2
s3
b
c c c d c
b
s1
a 0.5 s4
Figure 6.15: An SSP problem with five states and four actions a, b, c, and d;
only action a is nondeterministic, with the probabilities shown in the figure;
the cost of a and b is 1, the cost of c and d is 100; the initial state is s1 ; the
goal is s5 .
6.6. Run VI on the problem of Figure 6.15 with = .5 and the following
heuristics:
(a) V0 (s) = 0 in every state.
(b) V0 (s1 ) = V0 (s2 ) = 1 and V0 (s) = 100 for the two other states.
6.7. In the problem of Figure 6.15, add a self loop as a nondeterministic
effect for actions b, c, and d; that is, add s in (s, a) for these three ac-
tions wherever applicable. Assume that the corresponding distributions are
uniform. Solve the two previous exercises on this modified problem.
6.8. Run AO on the domain of Figure 6.6 with the heuristics V1 of Sec-
tion 6.3.5.
6.9. Modify the domain of Figure 6.6 by making the state s12 an explicit
dead end instead of a goal; run AO with the heuristics V0 and V1 of Sec-
tion 6.3.5.
6.10. Prove that algorithm LAO is an instance of the Find&Revise schema.
6.11. Modify the domain of Figure 6.6 by changing (s9 , a) = {s3 , s8 } and
making the state s15 an explicit dead end instead of a goal. Run LAO and
ILAO on this problem and compare their computation steps.
6.13. Run RFF on the problem of Figure 6.15 with = 0.7. Suppose the Det-
Plan subroutine calls the same Forward-Search algorithm as in the previous
exercise, and turns the plan into a policy. What is after one iteration of
the while loop?
6.15. Run Algorithm 6.21 on the problem of Figure 6.15; compare with the
computations of RFF on the same problem.
6.17. Write the probabilistic precondition-effect operators for the take and
put actions of the domain PAMp (Example 6.30). How many ground actions
are there is this domain?
6.18. Implement and run algorithm VI for a few problem instances of the
domain PAMp . Up to how many containers does your implementation scales
up?
6.19. For the domain in Example 6.31, analyze the interactions between the
arrival, departure, and switch events with the action take and put. Compute
the sets (s, take) and (s, put) for different states s.
6.20. Analyze a generalized PAMq domain where the arrival and departure
of containers can take place even in the robot location. Define conditional
probability trees for the variable ctrs.
1
The material of some sections in this chapter is based on that survey.
7.1 Perceiving
Deliberation is mostly needed for an actor facing a diversity of situations
in an open environment. Such an actor generally has partial knowledge
about the initial state of world and its possible evolution. It needs to be
able to perceive what is relevant for its activity and to deliberate about its
perception, while acting and perceiving.
Reasoning on perception leads to several problems, among which for
example those of:
Reliability: how reliable are sensing and information gathering ac-
tions? What verification and confirmation steps are needed to confirm
that the sensed value of a state variable is correct? How to assess the
distribution of values if uncertainty is explicitly modeled?
Observability: how to acquire information about non observable state
variables from the observable ones? How to balance costly observations
with approximate estimates?
Persistence: How long can one assume that a state variable keeps its
previous value as long as new observations do no contradict it?
Furthermore, there are numerous additional sensing problems for a physical
actor to reason on and determine how to handle its sensors (how and where
to use a sensor, how to process and qualify given data), as well as to perform
information-gathering actions through communication with and query of in-
formation sources. Handling sensors changes perception reasoning problems
considerably.
The details of these problems are beyond the scope of this short overview
section. In the following, well mention a few approaches to (i) planning and
acting with information gathering actions, (ii) planning sensing actions, (iii)
anchoring and signal-to-symbol matching problems, and (iv) recognizing
plans and situations.
Plan with respect to domain objects and properties that are unknown
when planning starts but that can be discovered at acting time through
planned information-gathering actions. New facts resulting from these
actions will be used to further refine the rest of the plan.
Query databases for facts the actor needs specifically to address a given
planning problem and query knowledge bases for additional models of
its environment that are relevant to the task at hand.
Planning with information gathering is studied by several authors using
conditional planning approaches, as in the PKS system of Petrick and Bac-
chus [476]. The continual planning approach of Brenner and Nebel [97] in
MAPL postpones part of the planning process. It introduces information-
gathering actions which will later allow development of the missing parts
of the plan. The planner uses assertions that abstract actions to be refined
after information-gathering. The approach is well adapted to dynamic en-
vironments where planning for subgoals that depend on yet unknown states
can be delayed until the required information is available through properly
planned information gathering actions.
Acquiring additional data and models at planning time is inspired from
semantic Web functionalities. For example, the ObjectEval system of
Samadi et al. [521] acquires from the Web statistics about possible loca-
tions of objects of different classes. It uses them in a utility function for
finding and delivering objects in an office environment. Other approaches
use Description Logic (DL), a fragment of first-order logic, to handle state-
ments about objects, properties, relations, and their instances with inference
algorithms for querying large stores of data and models over the Web [27].
Most implementations rely on OWL, the standard Web Ontology Language.
OWL handles an open-world representation where facts can be true, false,
or unknown. This point is further developed in Section 7.5.3.
object and bringing it to a user) requires other techniques, which are dis-
cussed next.
Progress-Plan(, G)
loop
observed current state
i maxj {0 j k + 1 | supports gj }
if i = k + 1 then return success
if i = 0 then return failure
else perform action ai
Going back to Example 7.1, if the robot has no action to lock or unlock the
door, and if the door is initially unlocked, then door-status(door)=unlocked
is an invariant of this domain. Note that the world invariant qualifies here
a particular model of the world, not the world itself; monitoring violation of
the invariant allows to detect discrepancies with respect to that model.
Invariants of a planning problem can be synthesized automatically, as
shown for example by Kelleher and Cohn [326] or Rintanen [512]. Several
authors have used invariants to speed up planning algorithms, for example,
Fox and Long [203]. However, at the acting level, we know that the as-
sumption of a static environment does not hold: there can be other state
transitions than those due to the actors actions. For example, the door of
Example 7.1 may become locked, this violating a plan that requires opening
that door. The actor has to monitor that the current state supports the
invariants relevant to its plan.
However, the invariants of a planning problem are often not sufficient for
the purpose of monitoring. Many of the invariants entailed from (, s0 , g)
express syntactical dependencies between the variables of the problem, for
example, a locked door is necessarily closed; it cannot be open.5 Often,
an actor has to monitor specific conditions that express the appropriate
context in which its activity can be performed. For example, the robot has
to monitor the status of its battery: if the charge level is below a threshold,
than at most units of time are available in normal functioning before
plugging at a recharge station. Such conditions cannot be deduced from the
specification of (, s0 , g); they have to be expressed specifically as monitoring
rules.
A simple approach, proposed by Fraser et al. [207], considers an extended
planning problem as a tuple (, s0 , g, ), where is a condition, expressed
formally in the same way as the preconditions of actions. Condition is
a requirement for planning: is a solution to the problem if every state
s b(s0 , ) supports . It is also a requirement for acting: the actor has
to monitor at acting time that every state observed while performing a plan
supports . A violation of this condition, due to any exogenous event or
malfunction, means a failure of the plan. It allows quite early detection of
infeasible goals or actions, even if the following actions in the plan appear
to be applicable and produce their expected effects.
Several authors have developed elaborate versions of the preceding idea
with monitoring rules, in some logical or temporal formalism, associated to
5
The use of multivalued state variables reduces these dependencies, when compared
with the use of predicates, but it does not eliminate them.
from the complete synthesized plan, for example, constraints on the persis-
tence of causal links. This automated synthesis of monitoring formulas is
not systematic but selective, on the basis of hand-programmed conditions
of what needs to be monitored and what does not. Additional monitoring
formulas are also specified by the designer in the same expressive temporal
logic formalism. For example, a UAV (the application domain of Doherty
et al. [158]) should have its winch retracted when its speed is above a given
threshold; it can exceed its continuous maximum power by a factor of for
up to units of time if this is followed by normal power usage for a period
of at least 0 . The system produces plans with concurrent and durative
actions together with conditions to be monitored during execution. These
conditions are evaluated online using formula progression techniques. When
actions do not achieve their desired results, or when some other conditions
fail, recovery via a plan repair phase is triggered.
experiment of Muscettola et al. [441] and Bernard et al. [53] offers a goal
reasoning capability. It analyses the progress of the mission and determines
which goals should be satisfied for the next planning window (one to two
weeks). The selected goals are passed to the planner, together with con-
straints that need to be satisfied at waypoints identified by the Mission
Manager (for example, the amount of energy left in the batteries should be
above a threshold at the end of the planning phase).
There is an analogous manager in the CPEF system of Myers [442],
used in the simulation of operational deployments; this manager provides
appropriate goals to the planner and controls the generation of plans. For
a similar class of applications, the ARTUE system of Molineaux et al. [429]
detects discrepancies when executing a plan. It generates an explanation,
possibly produces a new goal, and manages possible conflict between goals
currently under consideration. It uses decision theory techniques to decide
which goal to choose. The approach proposes an original explanation system,
which uses Assumption-based Truth Maintenance techniques to find the
possible explanation of the observed facts. In Powell et al. [494], the authors
extend ARTUE with a facility for teaching the system new goal selection
rules.
Another example is the Plan Management Agent of Pollack and Horty
[492] for handling personal calendars and workflow systems. This system
addresses the following functions:
Commitment management: commits to a plan already produced, and
avoids new plans that conflict with the existing ones.
Alternative assessment: decides which of the possible alternative goals
and plans should be kept or discarded.
Plan control: decides when and how to generate a plan.
Coordination with other agents: takes into account others commit-
ments and the cost of decisions involving their plans.
That system relies on temporal and causal reasoning. It is able to plan with
partial commitments that can be further refined later.
Finally let us mention a class of approaches, called Goal Driven Auton-
omy (GDA) for reasoning about possibly conflicting goals and synthesizing
new ones. These approaches are surveyed by Hawes [264] and Vattam et al.
[577]. The former surveys a number of architectures supporting goal reason-
ing in intelligent systems. The latter reviews more than 80 contributions on
various techniques for goal monitoring, goal formulation, and goal manage-
ment, organized within a comprehensive goal reasoning analysis framework.
1
Q(a) Q(a) + [rka (a) Q(a)], with = . (7.2)
ka
When ka for all a, the choice of the action that maximizes the
reward is given by argmaxa {Q(a)}. However, as long as the exploration of
alternatives has not been sufficient, the actor must try actions other than
the estimated best one, according to various heuristics. We can define a
function Selecta {Q(a)} that favors the current best action and allows for
exploring alternatives by various methods such as:
choose action argmaxa {Q(a)} with probability (1 ) and a randomly
drawn action other argmaxa {Q(a)} with probability , where is de-
creasing with experience; and
choose an action according to a probabilistic sampling distribution, for
example, with Boltzmann sampling, according to a probability distri-
Q(a)
bution given by e , where is decreasing with experience.
When the environment is stationary, the update of Q(a) in Equation 7.2
after performing action a becomes increasingly weak with big ka . If the
environment is not stationary, we can keep < 1 constant. Note also that
the initialization value q0 fosters exploration if q0 is high with respect to
other rewards. For example, if q0 = rmax , the maximum reward, never-tried
actions will be systemically preferred.
With these basics notions, let us now consider the interesting case where
the task at hand requires a sequence of several actions, each interfering with
the following ones, influencing the overall success and the sum of rewards.
The framework generally used is that of Markov decision processes, as seen
in Chapter 6. The actor seeks to learn an optimal policy that maximizes
the expected sum of rewards.
One approach is to learn the MDP model and then to apply the planning
techniques seen earlier (with rewards instead of costs, and maximization in-
stead of minimization) to find the optimal policy and then use it. Learning a
model means collecting enough statistics through an exploratory phase to es-
timate the probability distributions Pr(s0 |s, a) and the rewards r(s, a). This
direct approach requires a costly exploratory phase to acquire the model. It
is often better to start performing the task at hand, given what is known,
while continuing to learn, that is, to combine the two phases of acquiring a
model and finding the best action for the current model.
The Q-learning algorithm, Algorithm 7.2, meets this objective while
Q-learning
loop
a Selecta {Q(s, a)}
apply action a
observe resulting reward r(s, a) and next state s0
Q(s, a) Q(s, a) + [r(s, a) + maxa0 {Q(s0 , a0 )} Q(s, a)] (i)
s s0
until termination condition
avoiding the need to build the MDP model explicitly. Using the notations
introduced in the previous chapter, Equation 6.6 can be reformulated as:
X
Q(s, a) = r(s, a) + P (s0 |s, a) max
0
{Q(s0 , a0 )}.
a
s0 (s,a)
Q (s, a)
+ [r(s, a) + max {Q (s0 , a0 )} Q (s, a)] . (7.3)
0 a
Learning from specifications. The former case can be set in the MDP
framework. In passive imitation, for example, the teacher provides its
demonstrations as a set of sequences {1 , . . . , m }, each sequence =
hs1 , a1 , s2 , a2 , . . .i encodes an actual demonstration of the teacher perform-
ing an instance of the task to learn. The learner synthesizes a policy on
the basis of these demonstrations and from additional interactions with the
environment with cost and/or reward feedback. In active imitation, the
learner is further able to query the teacher, when needed, about what to
do in some state s, that is, what is the desirable value of (s). Each query
has a cost that needs to be taken into account in the overall learning pro-
cess. In a variant, called advice taking, the learner can query which of two
sequences i or j the teacher recommends as the best. Here the learner
has to synthesize (that is, to plan) the most informative sequences for its
learning process, given the cost of queries. Research in these issues is active
(for example, [608, 303]), but so far has been applied mostly in academic
benchmarks such as those of the Reinforcement Learning Competition (for
example, balance a vertical pole or maintain equilibrium on a bicycle).6
The partial programming framework offers a quite powerful approach for
learning from specifications. The formulation is more general than the above
imitation approaches. It relies on Semi-Markov Decision Processes (SMDP,
see Section 6.8.4) and hierarchical nondeterministic finite-state machines.
The latter are specified by the teacher as partial programs, with the usual
programming constructs augmented with open choice steps, where the best
actions remain to be learned. These specifications constrain the class of
policies that can be learned using an extended SMDP Q-learning technique.
The partial programming framework can yield to a significant speed-up in
the number of experiments with respect to unguided reinforcement learning,
as demonstrated in benchmark problems (for example, the taxi domain
of [153] with navigation in a grid to mimic loading and unloading randomly
distributed passengers to their destinations). Few partial programming lan-
guages and systems have been proposed (for example, Alisp [21, 465] or
A2BL [542]) and demonstrated in simulations and video games. The frame-
work seems to be well adapted to the partial specification of acting methods,
where operational models are further acquired through learning.
Most of the work on learning from a teachers actions has tried to avoid
the issues of understanding and transposing the demonstrations. For exam-
ple in robotics, quite often the demonstrations take place, through various
means, in the robots sensory-motor space. This form of learning through
teleoperation, where the teacher acts directly in the actuators and propri-
oceptive sensor spaces of the robot, is quite successful (see [537, 475] and
the survey of [24]), including for quite complex tasks such as helicopter
acrobatics [2, 125].
A more ambitious approach would take into account at a higher level
the need to understand and transpose the teachers demonstrations. Learn-
ing should aim at acquiring a mapping from particular sensory states
to plans. These can be obtained by plan recognition methods (for
example,[218, 499, 505]). The learner than develops its own plans, taking
into account its specific capabilities, to achieve the effects of the teachers
demonstrations. Developments along similar approaches are being investi-
gated (for example, [455, 518]). They cover potentially a more general class
of behaviors that can be demonstrated by the teacher and acquired by the
learner (for example, iterative actions). They also allow for extended gen-
eralization because they foster acquisition of basic principles and rely on
the learners planning capabilities. They are finally more natural and easier
for the teacher, because the teachers actions are interpreted in terms of
their intended effects on the environment rather than in a sequence of their
low-level commands.
A control automaton H has a finite set of control modes V and can switch
from one control mode to another one according to the control switches in E.
The control graph G = (V, E) is the discrete component of the control au-
tomaton H. In each control mode vi , continuous change is modeled through
the evolution of continuous variables in X. The invariant Inv(vi ) states a
condition over variables in X that is satisfied whenever H is in control mode
vi . F low(vi ) describes how continuous variables change while H is in control
mode vi . Jump(eij = (vi , vj )) is a condition over the continuous variables
that determines when the control mode should switch from vi to vj .7
T < 19
off on
= -0.1T = 5 -0.1T
T 18 T 22
T > 21
G = {V, E}, where V = {on, off}, E = {(on, off), (off, on)}. We suppose that
initially the heater is off and the temperature is 20 Celsius: Init(off) , T =
20. The heater remains off if the temperature is above 18 degrees Celsius:
Inv(off) , T 18. When the heater is off, the temperature falls according
to the flow condition F low(off) , T = 0.1T . The heater may turn on as
soon as the temperature falls below 19 degrees: Jump((off, on)) , T < 19.
Because Inv(off) , T 18, at the latest the heater will go on when the
temperature falls to 18 degrees. When the heater is on, the temperature
rises according to F low(on) , T = 5 0.1T . The heater continues to heat
while the temperature is below 22 degrees Celsius: Inv(on) , T 22. It
turns off when the temperature is higher than 21 degrees: Jump((on, off)) ,
T > 21.
The intended meaning of the Jump condition is that the switch takes
place nondeterministically for any value that satisfies the control switch
condition Jump. For instance, suppose that the variable x changes value
in a control mode vi by starting from a negative value and by increasing
monotonically, and suppose that the control switch Jump((vi , vj )) is x 0.
The switch can happen when x has any positive value, and not necessarily
when x = 0. However, we should notice that the actual condition for the
switch is determined both by the control switch condition Jump((vi , vj))
and by the inviariant condition Inv(vi ). For instance, if Inv(vi ) is x 1,
then the switch will take place nondeterministically when x satisfies the
condition 0 x < 1, that is, at the latest when x rises to value 1. We
can easily impose a deterministic switch, for instance, in our example, with
Jump((vi , vj )) = 0, or with Jump((vi , vj )) 0 and Inv(vi ) < 0.
Example 7.4. In this example, we consider an automatic battery charging
station that has to charge two plants, depending on whether the level of the
battery of each plant gets below two threshold values, l1 and l2 for the two
plants p1 and p2, respectively. We suppose that the charging system charges
at a constant rate e, for only one plant at a time; it can switch from one to
the other instantaneously. We suppose that plant p1 and p2 consume energy
with rate e1 and e2, respectively. The objective of the charging station is to
keep the charge of each plant above the threshold values. The corresponding
hybrid automaton is depicted in Figure 7.2.
The level of the each plant battery charges is described by two continuous
variables: X = {c1 , c2 }, for each plant p1 and p2, respectively. The charging
station can be on two control modes, charging one plant or the other. G =
{V, E}, where V = {p1, p2}, E = {(p1, p2), (p2, p1)}. We suppose that
initially the station is charging plant p1, and both plants have charges above
c2 l2
p1 p2
1 = e - e1 2 = e e2
2 = - e2 1 = - e1
c2 l2 c1 l1
c1 l1
To define I/O hybrid automata, let us first notice that the discrete com-
ponent of an hybrid automaton can be described with discrete state vari-
ables. The set of control modes V (see Definition 7.2) can be represented
with a set of discrete variables Y = {y1 , . . . , ym } ranging over discrete values.
Each complete assignment to variables y1 , . . . , ym corresponds to a control
mode vi V . This is similar to a state variable representation of a set of
states in Section 2.1.2.
Given the two sets X and Y of continuous and discrete variables, the
definition of an I/O hybrid automaton extends Definition 7.2 by distin-
guishing (discrete and continuous) input variables from output variables:
X = Xin Xout and Y = Yin Yout , where Xin Xout = and Yin Yout = .
Moreover, discrete and continuous input variables are distinguished into
controllable and uncontrollable variables: Xin = Xin c X u and Y = Y c
in in in
u c u c u
Yin , where Xin Xin = and Yin Yin = .
The idea is that a component of an actor can interact with a system mod-
eled as an I/O hybrid automaton by determining the discrete/continuous
controllable input variables of the system. The actor can perceive its sta-
tus through the discrete/continuous output variables of the system. An
actor can therefore assign values to the controllable inputs, whereas this is
not possible for uncontrollable variables, the value of which is determined
by the environment. Uncontrollable variables obey to dynamics that can-
not typically be modeled. They can represent external forces, the result of
exogenous events, actions of other agents, or noise in sensing the external
environment.
In a conceptual model of an actor (see Figure 1.1), commands change
the values of discrete and continuous controllable input variables, while per-
cepts affect discrete and continuous output variables. The model can evolve
through changes in both controllable and uncontrollable variables. The ac-
tor can be seen as a reactive systems that iteratively perceives the output
variables in Xout and Yout and reacts by determining the value of controllable
input variables in Xin c and Y c .
in
An actor can plan for and perform actions that change the control mode
of the hybrid automaton by assigning values to (some of) the variables in Yin c
and in Xin c . We might represent actions that change the values of discrete
variables in Yinc with purely discrete models, like those presented in previous
chapters. For instance, planning to determine the control mode of the hybrid
system could be done with a planning domain = (S, A, ), where states in
S correspond to control modes in V represented with discrete variables in
Y , and can be the transition function of deterministic models (Chapter 2),
Example 7.5. Consider Example 7.4. The two control modes p1 and p2
can be represented by a discrete controllable input variable: Yinc = {p}, that
can range over two values p1 and p2. Let us now modify the example: the
charging station, rather than charging the plants at a constant rate e, can
choose to charge plants at different rates between zero and a maximum rate
emax. This can be modeled with a continuous controllable input variable
x whose value is in R and in the interval (0, emax]: Xin c = {e}. Moreover,
the two plants consume energy at a rate that depends on the load of tasks
that they have to perform, and this load is not under the actors control:
we have therefore two continuous uncontrollable input variables e1 and e2 :
u = {e , e }. Finally, the current charge of the two plants can be perceived
Xin 1 2
by the actor: Xout = {c1 , c2 }. The goal is to keep the charge of each plant
above the two threshold values l1 for plant p1 and l2 for plant p2.
Let us now briefly introduce some techniques for planning and acting
with hybrid models.
One example along this direction is the work in [78, 77], whose idea is
to start from a language for describing planning domains with both discrete
and continuous variables, to give semantics in terms of hybrid automata
and then apply existing model checkers to find a solution. [78, 77] start
from the PDDL+ planning language [205]. PDDL+ allows for the definition
of models with discrete and continuous variables. Continuous dynamics is
modeled through processes and exogenous events, which model dynamics
that are initiated by the environment.
In PDDL+, the discrete component of an hybrid system is described
by a set of propositions, while the continuous component is modeled with
a vector of real variables. Discrete transitions are described through pre-
conditions and effects. Preconditions are conjunctions of propositions and
numeric constraints over the continuous variables. Events are represented
with preconditions that, when fired, trigger the event. Processes are active
as long as their preconditions are true and describe the continuous change
of continuous variables. PDDL+ allows for durative actions that have pre-
conditions and effects as conjunctions on propositional variables as well as
constraints on continuous variable; they have preconditions that should hold
when the action starts, during its execution, and at the end of the action.
PDDL+ planning domains have therefore some similarities with hybrid
automata. [78, 77] exploit the close relationship of the PDDL+ semantics
with hybrid automata. They provide a semantics of PDDL+ in terms of
hybrid automata, even if PDDL+ assumptions raise some semantic issues.
Among them, the PDDL+ -separation assumption states that no two
actions are allowed to simultaneously occur if they update common variables
(mutex actions). Plans have to meet the -separation condition, that is,
interfering actions must be separated by at least a time interval of length .
Indeed this problem is related to the PDDL2.1 model of temporal problems
[204], whose model of durative actions does not allow for concurrency unless
-separation is assumed.
A further difference between PDDL+ and hybrid automata is the se-
mantics of events. A number of assumptions is made about events and pro-
cesses, but the most relevant difference with hybrid automata is that events
and processes start as soon as their preconditions become satisfied, while
in hybrid automata transitions might happen at any time when the Jump
condition is satisfied and the invariants are not satisfied (see Section 7.4.1).
In [78], a first translation from PDDL+ to hybrid automata is defined
which does not take into account the different semantics of events. Because
transitions are allowed to happen at any time when a condition is satisfied,
then the model checker may not find a plan in the case in which a plan does
exist with the restricted PDDL+ semantics, that is, with events that are
triggered as soon as preconditions are true. For this reason, the approach is
not complete, that is, it may not find a plan when a plan does exist. Such
approach can instead be used to prove the non existence of a plan. The
approach is complete when there are no events.
In [77], the authors propose an exact translation of PDDL+ into hybrid
automata that mimics the semantics of PDDL+ events. The translation
guarantees that traces in the obtained hybrid automata correspond to se-
quential plans in the original planning domain in the case of linear hybrid
automata and can handle hybrid automata with affine dynamics with an
over-approximation that can be made arbitrarily precise.
x(t2)max
x(t1)max
x(t1)min
x(t2)min
t1 t2 t
Example 7.6. Figure 7.3 shows a simple flow tube for a continuous variable
x that evolves over time t. The intended meaning of the flow tube is that
if the value of variable x is in [x(t1 )min , x(t1 )max ] at time t1 , then it is
predicted that the value of x at time t2 will be in [x(t2 )min , x(t2 )max ], and
that x between these two time points will remain in the drawn trapezium.
The interval of initial possible values [x(t1 )min , x(t1 )max ] of x is called the
initial region, and interval [x(t2 )min , x(t2 )max ] is the final region. The flow
tube depicted in Figure 7.3 connects the initial region [x(t1 )min , x(t1 )max ]
to the final region [x(t2 )min , x(t2 )max ].
The idea is that an action a can modify the continuous variable x ac-
cording to the law described by the flow tube. We can express preconditions
of action a with regions of continuous variables (interval of values of x in
our example) where the action is applicable, and postconditions with the
resulting final region in a given duration (for example, after the interval
t2 t1 in our example).
Figure 7.3 shows a simple linear flow tube that can be represented with
a linear equation in the variables x and x, assuming that x is constant.
Conditions Init, Inv, and F low of hybrid automata (Definition 7.2) could
be used to represent flow tubes. For instance, let us suppose that in a node
vi variable x can evolve as described by the flow tube in Figure 7.3. We
have that Init(vi ) is a condition that contraints the values of x between
x(t1 )min and x(t1 )max ; F low(vi ) and Inv(vi ) should represent the bundle
of lines from any x with value between x(t1 )min and x(t1 )max to a point
between x(t2 )min and x(t2 )max . F low(vi ) and Inv(vi ) should constrain the
bundle to remain in the flow tube envelop.
Flow tubes can be more complex than the one shown in Figure 7.3, like
that in the next example.
Example 7.7. Figure 7.4 shows a flow tube of two variables x1 and x2 ,
which evolve over time t. The intended meaning of the flow tube is that if
the value of variables x1 and x2 at time t1 is in ellipse e1 , then it is predicted
that the value of x1 and x2 at time t2 will be in ellipse e2 , and that x1 and
x2 between these two time points will remain in the drawn envelope.
The most notable example of planning with hybrid models based on flow
tube is the work in [381]. The idea is based on the notion of a hybrid planning
graph: a planning graph [74] is used to represent the effects of actions over
discrete variables, while flow tubes are used to represent the effects over
continuous variables. Hybrid planning graphs are encoded as an extension
x1
e1
e2
t
x2
Region corresponding to
continuous precondition
Region corresponding to
of A
continuous precondition
of B
A B
Initial
region of
feasible
states States reachable after A
Initial region of
flow-tube
searches for a path through connected flow tubes (see Figure 7.5). Flow
tubes are connected by finding an intersection between the final region of
a flow tube (the effects of an actions over continuous variables) and the
preconditions of a subsequent action (represented as well as a region of
the flow tube). Connection conditions guarantee that all valid plans are
included in the graph. However, the condition is not sound, meaning that
not all plans in the graph are valid. A further step encodes the hybrid flow
graph as a mixed integer program, which makes sure that the output plan
is valid and optimal.
In the case of complex flow tubes that cannot be represented with lin-
ear equations, various approximation methods can be used. Hofmann and
Williams [282] use a polyhedral approximation, which approximates the
tubes as slices of polyhedra for each time step. Kurzhanskiy and Varaiya
[360] use an ellipsoidal calculus for approximation that has proven highly
efficient. The work of Li and Williams [381] makes use of and extends plan-
ning graphs with flow tubes. However, we believe the work is especially
interesting because the ideas underlying flow tubes could be used in general
with any state-space planner.
For solving the generated planning tasks, the approach described in [391]
makes use of the Temporal Fast Downward planner (TFD) [186]. While in
[391] the assumption is that the estimated state is the actual state, [392] ex-
tends the approach to uncertain state information thus providing the ability
to deal with noisy sensors and imperfect actuators.
The approach can lead to a state explosion in the case a coarse discretiza-
tion is required.
and so on.
Description Logic (DL) [27] is a family of formal knowledge represen-
tation languages suited for representing ontologies and for reasoning about
them. The DL family languages differ in expressivity and computational
complexity of the reasoning algorithms for each language. In general, a
DL language can express definitions of classes and relations (see also [232]
for a brief introduction to DL). Class definitions can include disjunction
and negation as well as constraints on the relations to other classes. A
relation between a class (its domain) and another class (its range) can be
constrained in cardinality and type. A class can be defined as a subclass of
another class. Similarly, relations can also be given definitions and there-
fore have subclasses as well. Class partitions can be defined by specifying a
set of subclasses that represent the partitions and can be exhaustive if all
instances of the class belong to some partition and disjoint if there is no
overlap in the subclasses. A class can be denoted as a primitive class and
not given a definition, and in that case, their subclasses and instances must
be explicitly indicated.
Description logic reasoning systems use these definitions to automati-
cally organize class descriptions in a taxonomic hierarchy and automatically
classify instances into classes whose definitions are satisfied by the features
of the instance. Specifically, description logic reasoners provide the following
main mechanisms:
class subsumption, where a class c1 subsumes another class c2 if its
definition includes a superset of the instances included in c2 ;
instance recognition, where an instance belongs to a class if the in-
stances features (roles and role values) satisfy the definition of the
class;
consistency checking, that is, mechanisms to detect inconsistent defi-
nitions; and
inheritance inference, that is, the automated inheritances of properties
by members of subclasses.
open_door
S1 S2
S10 LOCKED
retry
shift
retry S5 S6 move
S3 push
grasp S7 OPEN
S0 S1 pull S4
move
S2
move
S9 BLOCKED
Figure 7.6: Open door: semantic mapping of state variables and actions.
move-close, grasp, turn, and so on. Even in this case, the mapping down
is nondeterministic, while the mapping up is deterministic. These ideas are
depicted in Figure 7.6 as an example related to open-door.
Finally, we should recall that, even if the hypothesis that mapping down
is nondeterministic and mapping up is deterministic is reasonable in several
cases, this may not be always true. At a higher level, we can consider
parameters that are not considered at a lower level. As a simple example,
consider the case in which at the level of the topological map we may add
information about dangerous, or crowded rooms, where the robot should
avoid to pass through. Some of the variables at the higher levels do not
need to be mapped at the lower levels.
Concluding Remarks
Search Algorithms
r1 r2 ... ri ...
...
1 2 i
...
r11 r12 r21 r22 ri1 ri2
Deterministic-Search(P )
initial partial solution
{}
while 6= do
select (i)
remove from
if is a solution for P then return
R {candidate refinements for }
for every r R do
refine(, r)
add 0 to
return failure
Or-node: P
r1 r2
...
And-nodes: 1 2
... ...
Or-nodes: P11 P12 P21 P22
... ...
P1111 P1112 P1121 P1122
And-Or-Search(P )
return Or-Branch(P )
Or-Branch(P )
R {candidate refinements for P }
if R = then return failure
nondeterministically choose r R (i)
refine(P, r)
return And-Branch(P, )
And-Branch(P, )
if is a solution for P then return
{P1 , . . . , Pn } {unsolved subproblems in }
for every Pi {P1 , . . . , Pn } do (ii)
i Or-Branch(Pi )
if i = failure then return failure
if 1 , . . . , n are not compatible then return failure
incorporate 1 , . . . , n into
return
Strongly Connected
Components of a Graph
Tarjan(v)
index(v) low(v) i
ii+1
push(v,stack )
for all v 0 adjacent to v do
if index(v 0 ) is undefined than do
Tarjan(v 0 )
low(v) min{low(v), low(v 0 }
else if v 0 is in stack then low(v) min{low(v), low(v 0 }
if index(v)=low(v) then do
start a new component C
repeat
w pop(stack ) ; C C {w}
until w = v
[1] Aarup, M., Arentoft, M. M., Parrod, Y., Stader, J., and Stokes, I. (1994).
OPTIMUM-AIV: A knowledge-based planning and scheduling system for space-
craft AIV. In Intelligent Scheduling, pages 451469. Morgan Kaufmann.
[2] Abbeel, P., Coates, A., and Ng, A. (2010). Autonomous helicopter aerobatics
through apprenticeship learning. Intl. J. Robotics Research, 29(13):16081639.
[5] Adali, S., Console, L., Sapino, M. L., Schenone, M., and Terenziani, P. (2000).
Representing and reasoning with temporal constraints in multimedia presenta-
tions. In Intl. Symp. on Temporal Representation and Reasoning (TIME), pages
312.
[7] Albore, A. and Bertoli, P. (2004). Generating safe assumption-based plans for
partially observable, nondeterministic domains. In Proc. AAAI, pages 495500.
[8] Alford, R., Kuter, U., and Nau, D. S. (2009). Translating HTNs to PDDL: A
small amount of domain knowledge can go a long way. In Proc. IJCAI.
[9] Alford, R., Kuter, U., Nau, D. S., and Goldman, R. P. (2014a). Plan aggregation
for strong cyclic planning in nondeterministic domains. Artificial Intelligence,
216:206232.
[10] Alford, R., Shivashankar, V., Kuter, U., and Nau, D. S. (2014b). On the
feasibility of planning graph style heuristics for HTN planning. In Proc. ICAPS.
[11] Allen, J. (1984). Towards a general theory of action and time. Artificial
Intelligence, 23:123154.
[12] Allen, J. (1991a). Temporal reasoning and planning. In Allen, J., Kautz,
H., Pelavin, R., and Tenenberg, J., editors, Reasoning about Plans, pages 168.
Morgan Kaufmann.
[15] Allen, J. F., Hendler, J., and Tate, A., editors (1990). Readings in Planning.
Morgan Kaufmann.
[16] Allen, J. F. and Koomen, J. A. (1983). Planning using a temporal world model.
In Proc. IJCAI.
[17] Ambite, J. L., Knoblock, C. A., and Minton, S. (2000). Learning plan rewriting
rules. In Proc. ICAPS.
[19] Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., and Qin,
Y. (2004). An integrated theory of the mind. Psychological Review, 111(4):1036
1060.
[20] Andre, D., Friedman, N., and Parr, R. (1997). Generalized prioritized sweep-
ing. In Adv. in Neural Information Processing Syst. (Proc. NIPS).
[21] Andre, D. and Russell, S. J. (2002). State abstraction for programmable rein-
forcement learning agents. In Proc. AAAI.
[22] Andrews, T., Curbera, F., Dolakia, H., Goland, J., Klein, J., Leymann,
F., Liu, K., Roller, D., Smith, D., Thatte, S., Trickovic, I., and Weer-
avarana, S. (2003). Business Process Execution Language for Web Services.
http://msdn.microsoft.com/en-us/library/ee251594(v=bts.10).aspx.
[23] Araya-Lopez, M., Thomas, V., Buffet, O., and Charpillet, F. (2010). A closer
look at MOMDPs. In IEEE Intl. Conf. on Tools with AI (ICTAI), pages 197204.
[24] Argall, B. D., Chernova, S., veloso, M. M., and Browning, B. (2009). A
survey of robot learning from demonstration. Robotics and Autonomous Systems,
57(5):469483.
[26] Awaad, I., Kraetzschmar, G. K., and Hertzberg, J. (2014). Finding ways to
get the job done: An affordance-based approach. In Proc. ICAPS.
[27] Baader, F., Calvanese, D., McGuinness, D., Nardi, D., and Patel-Schneider,
P., editors (2003). The Description Logic Handbook: Theory, Implementation
and Applications. Cambridge Univ. Press.
[28] Bacchus, F. and Kabanza, F. (2000). Using temporal logics to express search
control knowledge for planning. Artificial Intelligence, 116(1-2):123191.
[30] Backstrom, C. and Nebel, B. (1993). Complexity results for SAS+ planning.
In Proc. IJCAI.
[31] Backstrom, C. and Nebel, B. (1995). Complexity results for SAS+ planning.
Computational Intelligence, 11(4):134.
[32] Baier, J. A., Mombourquette, B., and McIlraith, S. (2014). Diagnostic problem
solving: a planning perspective. In Proc. Intl. Conf. on Principles of Knowledge
Representation and Reasoning (KR), pages 110.
[34] Ball, M. and Holte, R. C. (2008). The compression power of symbolic pattern
databases. In Proc. ICAPS, pages 211.
[35] Baptiste, P., Laborie, P., Le Pape, C., and Nuijten, W. (2006). Constraint-
based scheduling and planning. In Rossi, F., Van Beek, P., and Walsh, T., editors,
Handbook of constraint programming, chapter 22, pages 759798. Elsevier.
[36] Barbier, M., Gabard, J.-F., Llareus, J. H., and Tessier, C. (2006). Implemen-
tation and flight testing of an onboard architecture for mission supervision. In
Intl. Unmanned Air Vehicle Syst. Conf.
[37] Barreiro, J., Boyce, M., Frank, J., Iatauro, M., Kichkaylo, T., Morris, P.,
Smith, T., and Do, M. (2012). EUROPA: A platform for AI planning, scheduling,
constraint programming, and optimization. In Intl. Competition on Knowledge
Engg. for Planning and Scheduling (ICKEPS).
[38] Barrett, A., Golden, K., Penberthy, J. S., and Weld, D. S. (1993). UCPOP
users manual (version 2.0). Technical Report TR-93-09-06, Univ. of Washington,
Dept. of Computer Science and Engineering.
[40] Barrett, A. and Weld, D. S. (1994). Partial order planning: Evaluating possible
efficiency gains. Artificial Intelligence, 67(1):71112.
[41] Barrett, C., Stump, A., and Tinelli, C. (2010). The SMT-LIB standard: Ver-
sion 2.0. In Gupta, A. and Kroening, D., editors, 8th Intl. Wksp. on Satisfiability
Modulo Theories.
[42] Barry, J., Kaelbling, L. P., and Lozano-Perez, T. (2010). Hierarchical solution
of large Markov decision processes. In ICAPS Workshop, pages 18.
[43] Barry, J., Kaelbling, L. P., and Lozano-Perez, T. (2011). DetH*: Approximate
hierarchical solution of large Markov decision processes. In Proc. IJCAI, pages
18.
[44] Bartak, R., Morris, R., and Venable, B. (2014). An Introduction to Constraint-
Based Temporal Reasoning. Morgan&Claypool.
[45] Bartak, R., Salido, M. A., and Rossi, F. (2010). Constraint satisfaction tech-
niques in planning and scheduling. J. Intelligent Manufacturing, 21(1):515.
[46] Barto, A. G., Bradke, S. J., and Singh, S. P. (1995). Learning to act using
real-time dynamic-programming. Artificial Intelligence, 72:81138.
[49] Beetz, M. and McDermott, D. (1994). Improving robot plans during their
execution. In Proc. AIPS.
[51] Ben Lamine, K. and Kabanza, F. (2002). Reasoning about robot actions: a
model checking approach. In Beetz, M., Hertzberg, J., Ghallab, M., and Pollack,
M. E., editors, Advances in Plan-Based Control of Robotic Agents, pages 123139.
Springer.
[52] Bercher, P., Keen, S., and Biundo, S. (2014). Hybrid planning heuristics
based on task decomposition graphs. International Symposium on Combinatorial
Search (SoCS), pages 19.
[53] Bernard, D., Gamble, E., Rouquette, N., Smith, B., Tung, Y., Muscettola, N.,
Dorais, G., Kanefsky, B., Kurien, J. A., and Millar, W. (2000). Remote agent
experiment DS1 technology validation report. Technical report, NASA.
[54] Bernardi, G., Cesta, A., Orlandini, A., and Finzi, A. (2013). A knowledge
engineering environment for P&S with timelines. In Proc. ICAPS, pages 1623.
[58] Bernardini, S. and Smith, D. E. (2009). Towards search control via dependency
graphs in Europa2. In ICAPS Wksp. on Heuristics for Domain-Independent
Planning.
[59] Bertoli, P., Cimatti, A., Pistore, M., Roveri, M., and Traverso, P. (2001a).
MBP: a model based planner. In IJCAI Wksp. on Planning under Uncertainty
and Incomplete Information, pages 9397.
[60] Bertoli, P., Cimatti, A., Pistore, M., and Traverso, P. (2003). A framework for
planning with extended goals under partial observability. In Proc. ICAPS.
[61] Bertoli, P., Cimatti, A., Roveri, M., and Traverso, P. (2001b). Planning in non-
deterministic domains under partial observability via symbolic model checking.
In Proc. IJCAI, pages 473478.
[62] Bertoli, P., Cimatti, A., Roveri, M., and Traverso, P. (2006). Strong planning
under partial observability. Artificial Intelligence, 170(4):337384.
[63] Bertoli, P., Cimatti, A., and Traverso, P. (2004). Interleaving execution and
planning for nondeterministic, partially observable domains. In Proc. ECAI,
pages 657661.
[64] Bertoli, P., Pistore, M., and Traverso, P. (2010). Automated composition
of Web services via planning in asynchronous domains. Artificial Intelligence,
174(3-4):316361.
[68] Betz, C. and Helmert, M. (2009). Planning with h+ in theory and practice.
In Proc. Annual German Conf. on AI (KI), volume 5803. Springer.
[69] Bhatia, A., Kavraki, L. E., and Vardi, M. Y. (2010). Sampling-based motion
planning with temporal goals. In IEEE Intl. Conf. on Robotics and Automation
(ICRA), pages 26892696. IEEE.
[70] Bird, C. D. and Emery, N. J. (2009a). Insightful problem solving and cre-
ative tool modification by captive nontool-using rooks. Proc. Natl. Acad. of Sci.
(PNAS), 106(25):1037010375.
[71] Bird, C. D. and Emery, N. J. (2009b). Rooks use stones to raise the water
level to reach a floating worm. Current Biology, 19(16):1410 1414.
[72] Biundo, S. and Schattenberg, B. (2001). From abstract crisis to concrete relief
A preliminary report on combining state abstraction and HTN planning. In
Proc. European Conf. on Planning (ECP), pages 157168.
[74] Blum, A. L. and Furst, M. L. (1997). Fast planning through planning graph
analysis. Artificial Intelligence, 90(1-2):281300.
[77] Bogomolov, S., Magazzeni, D., Minopoli, S., and Wehrle, M. (2015). PDDL+
planning with hybrid automata: Foundations of translating must behavior. In
Proc. ICAPS, pages 4246.
[78] Bogomolov, S., Magazzeni, D., Podelski, A., and Wehrle, M. (2014). Planning
as model checking in hybrid domains. In Proc. AAAI, pages 22282234.
[79] Bohren, J., Rusu, R. B., Jones, E. G., Marder-Eppstein, E., Pantofaru, C.,
Wise, M., Mosenlechner, L., Meeussen, W., and Holzer, S. (2011). Towards
autonomous robotic butlers: Lessons learned with the PR2. In IEEE Intl. Conf.
on Robotics and Automation (ICRA), pages 55685575.
[83] Bonet, B. and Geffner, H. (2003a). Faster heuristic search algorithms for
planning with uncertainty and full feedback. In Proc. IJCAI.
[84] Bonet, B. and Geffner, H. (2003b). Labeled RTDP: Improving the convergence
of real-time dynamic programming. In Proc. ICAPS, pages 1221.
[87] Bonet, B. and Geffner, H. (2009). Solving POMDPs: RTDP-Bel vs. point-
based algorithms. In Proc. IJCAI, pages 16411646.
[88] Bonet, B. and Geffner, H. (2012). Action selection for MDPs: Anytime AO*
versus UCT. In Proc. AAAI.
[89] Bonet, B. and Geffner, H. (2014). Belief tracking for planning with sens-
ing: Width, complexity and approximations. J. Artificial Intelligence Research,
50:923970.
[91] Bouguerra, A., Karlsson, L., and Saffiotti, A. (2007). Semantic knowledge-
based execution monitoring for mobile robots. In IEEE Intl. Conf. on Robotics
and Automation (ICRA), pages 36933698. IEEE.
[92] Boutilier, C., Brafman, R. I., and Geib, C. (1998). Structured reachability
analysis for Markov decision processes. In Proc. Conf. on Uncertainty in AI
(UAI), pages 2432.
[93] Boutilier, C., Dean, T., and Hanks, S. (1999). Decision-theoretic planning:
Structural assumptions and computational leverage. J. Artificial Intelligence
Research, 11:194.
[94] Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic dynamic
programming with factored representations. Artificial Intelligence, 121:49107.
[97] Brenner, M. and Nebel, B. (2009). Continual planning and acting in dy-
namic multiagent environments. J. Autonomous Agents and Multi-Agent Syst.,
19(3):297331.
[98] Brusoni, V., Console, L., Terenziani, P., and Dupre, D. T. (1998). A spec-
trum of definitions for temporal model-based diagnosis. Artificial Intelligence,
102(1):3979.
[99] Brusoni, V., Console, L., Terenziani, P., and Pernici, B. (1999). Qualitative and
quantitative temporal constraints and relational databases: Theory, architecture,
and applications. IEEE trans. on KDE, 11(6):948968.
[100] Bucchiarone, A., Marconi, A., Pistore, M., and Raik, H. (2012). Dynamic
adaptation of fragment-based and context-aware business processes. In Intl.
Conf. on Web Services, pages 3341.
[101] Bucchiarone, A., Marconi, A., Pistore, M., Traverso, P., Bertoli, P., and
Kazhamiakin, R. (2013). Domain objects for continuous context-aware adapta-
tion of service-based systems. In ICWS, pages 571578.
[102] Buffet, O. and Sigaud, O., editors (2010). Markov Decision Processes in
Artificial Intelligence. Wiley.
[103] Busoniu, L., Munos, R., De Schutter, B., and Babuska, R. (2011). Optimistic
planning for sparsely stochastic systems. In IEEE Symp. on Adaptive Dynamic
Progr. and Reinforcement Learning, pages 4855.
[106] Calvanese, D., Giacomo, G. D., and Vardi, M. Y. (2002). Reasoning about
actions and planning in ltl action theories. In Proc. Intl. Conf. on Principles of
Knowledge Representation and Reasoning (KR), pages 593602. Morgan Kauf-
mann.
[108] Castellini, C., Giunchiglia, E., and Tacchella, A. (2003). SAT-based planning
in complex domains: Concurrency, constraints and nondeterminism. Artificial
Intellegence, 147:85118.
[109] Castillo, L., Fdez-Olivares, J., and Garcia-Perez, O. (2006a). Efficiently han-
dling temporal knowledge in an HTN planner. In Proc. ICAPS, pages 110.
[110] Castillo, L., Fdez-Olivares, J., Garca-Perez, O., and Palao, F. (2006b). Effi-
ciently handling temporal knowledge in an HTN planner. In Proc. ICAPS, pages
6372.
[111] Cesta, A. and Oddi, A. (1996). Gaining efficiency and flexibility in the simple
temporal problem. In Intl. Symp. on Temporal Representation and Reasoning
(TIME).
[112] Cesta, A., Oddi, A., and Smith, S. F. (2002). A constraint-based method for
project scheduling with time windows. J. Heuristics, 8(1):109136.
[113] Champandard, A., Verweij, T., and Straatman, R. (2009). The AI for Killzone
2s multiplayer bots. In Game Developers Conf. (GDC).
[115] Chatilla, R., Alami, R., Degallaix, B., and Laruelle, H. (1992). Integrated
planning and execution control of autonomous robot actions. In IEEE Intl. Conf.
on Robotics and Automation (ICRA), pages 26892696.
[117] Cimatti, A., Giunchiglia, F., Pecchiari, P., Pietra, B., Profeta, J., Romano,
D., Traverso, P., and Yu, B. (1997). A provably correct embedded verifier for
the certification of safety critical software. In Intl. Conf. on Computer Aided
Verification (CAV), pages 202213.
[118] Cimatti, A., Micheli, A., and Roveri, M. (2012a). Solving temporal problems
using SMT: Strong controllability. In Proc. Int. Conf. Principles and Practice of
Constraint Programming(CP).
[119] Cimatti, A., Micheli, A., and Roveri, M. (2012b). Solving temporal problems
using SMT: Weak controllability. In Proc. AAAI.
[120] Cimatti, A., Micheli, A., and Roveri, M. (2015). Strong temporal planning
with uncontrollable durations: A state-space approach. In Proc. AAAI, pages
17.
[121] Cimatti, A., Pistore, M., Roveri, M., and Traverso, P. (2003). Weak, strong,
and strong cyclic planning via symbolic model checking. Artificial Intelligence,
147(1-2):3584.
[122] Cimatti, A., Roveri, M., and Traverso, P. (1998a). Automatic OBDD-based
generation of universal plans in non-deterministic domains. In Proc. AAAI, pages
875881.
[123] Cimatti, A., Roveri, M., and Traverso, P. (1998b). Strong planning in non-
deterministic domains via model checking. In Proc. AIPS, pages 3643.
[124] Claen, J., Roger, G., Lakemeyer, G., and Nebel, B. (2012). Platas
integrating planning and the action language Golog. KI-Kunstliche Intelligenz,
26(1):6167.
[125] Coates, A., Abbeel, P., and Ng, A. (2009). Apprenticeship learning for heli-
copter control. Communications ACM, 52(7):97.
[126] Coles, A. and Smith, A. (2007). Marvin: a heuristic search planner with
online macro-action learning. J. Artificial Intelligence Research, 28:119156.
[127] Coles, A. I., Fox, M., Long, D., and Smith, A. J. (2008). Planning with
problems requiring temporal coordination. In Proc. AAAI.
[128] Coles, A. J., Coles, A., Fox, M., and Long, D. (2012). COLIN: planning with
continuous linear numeric change. J. Artificial Intelligence Research.
[129] Conrad, P., Shah, J., and Williams, B. C. (2009). Flexible execution of plans
with choice. In Proc. ICAPS.
[131] Cormen, T., Leirson, C., Rivest, R., and Stein, C. (2001). Introduction to
Algorithms. MIT Press.
[133] Currie, K. and Tate, A. (1991). O-Plan: The open planning architecture.
Artificial Intelligence, 52(1):4986.
[135] Dal Lago, U., Pistore, M., and Traverso, P. (2002). Planning with a language
for extended goals. In Proc. AAAI, pages 447454.
[136] Daniele, M., Traverso, P., and Vardi, M. (1999). Strong cyclic planning
revisited. In Proc. European Conf. on Planning (ECP), pages 3548.
[137] De Giacomo, G., Iocchi, L., Nardi, D., and Rosati, R. (1997). Description
logic-based framework for planning with sensing actions. In Intl. Wksp. on De-
scription Logics.
[138] De Giacomo, G., Iocchi, L., Nardi, D., and Rosati, R. (1999). A theory and
implementation of cognitive mobile robots. J. Logic and Computation, 9(5):759
785.
[139] De Giacomo, G., Patrizi, F., and Sardina, S. (2013). Automatic behavior
composition synthesis. Artificial Intelligence, 196.
[141] Dean, T., Firby, R., and Miller, D. (1988). Hierarchical planning involving
deadlines, travel time and resources. Computational Intelligence, 6(1):381398.
[142] Dean, T., Givan, R., and Leach, S. (1997). Model reduction techniques for
computing approximately optimal solutions for Markov decision processes. In
Proc. Conf. on Uncertainty in AI (UAI), pages 124131.
[143] Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence
and causation. Computational Intelligence, 5(3):142150.
[144] Dean, T. and Lin, S.-H. (1995). Decomposition techniques for planning in
stochastic domains. In Proc. IJCAI, pages 11211127.
[146] Dean, T. L. and Wellman, M. (1991). Planning and Control. Morgan Kauf-
mann.
[147] Dechter, R., Meiri, I., and Pearl, J. (1991). Temporal constraint networks.
Artificial Intelligence, 49:6195.
[148] Deisenroth, M. P., Neumann, G., and Peters, J. (2013). A survey on policy
search for robotics. Foundations and Trends in Robotics, 2(1-2):1142.
[149] Della Penna, G., Magazzeni, D., Mercorio, F., and Intrigila, B. (2009). UP-
Murphi: A tool for universal planning on PDDL+ problems. In Proc. ICAPS.
[157] Doherty, P., Kvarnstrom, J., and Heintz, F. (2009a). A temporal logic-based
planning and execution monitoring framework for unmanned aircraft systems. J.
Autonomous Agents and Multi-Agent Syst., 19(3):332377.
[158] Doherty, P., Kvarnstrom, J., and Heintz, F. (2009b). A temporal logic-based
planning and execution monitoring framework for unmanned aircraft systems. J.
Autonomous Agents and Multi-Agent Syst., 19(3):332377.
[159] Domshlak, C., Karpas, E., and Markovitch, S. (2012). Online speedup learn-
ing for optimal planning. J. Artificial Intelligence Research.
[160] Doran, J. E. and Michie, D. (1966). Experiments with the graph traverser
program. Proc. Royal Soc. A: Math., Phys., and Engr. Sci., 294(1437):235259.
[162] Dousson, C., Gaborit, P., and Ghallab, M. (1993). Situation recognition:
Representation and algorithms. In Proc. IJCAI, pages 166172.
[165] Dvorak, F., Bit-Monnot, A., Ingrand, F., and Ghallab, M. (2014). A flexible
ANML actor and planner in robotics. In Finzi, A. and Orlandini, A., editors,
ICAPS Wksp. on Planning and Robotics, pages 1219.
[183] Estlin, T. A., Chien, S., and Wang, X. (1997). An argument for a hybrid
HTN/operator-based approach to planning. In Proc. European Conf. on Plan-
ning (ECP), pages 184196.
[184] Etzioni, O., Hanks, S., Weld, D. S., Draper, D., Lesh, N., and Williamson,
M. (1992). An approach to planning with incomplete information. In Proc. Intl.
Conf. on Principles of Knowledge Representation and Reasoning (KR), pages
115125.
[185] Eyerich, P., Mattmuller, R., and Roger, G. (2009). Using the context-
enhanced additive heuristic for temporal and numeric planning. In Proc. ICAPS.
[186] Eyerich, P., Mattmuller, R., and Roger, G. (2012). Using the context-
enhanced additive heuristic for temporal and numeric planning. In Prassler,
E., Zollner, M., Bischoff, R., and Burgard, W., editors, Towards Service Robots
for Everyday Environments: Recent Advances in Designing Service Robots for
Complex Tasks in Everyday Environments, pages 4964. Springer.
[187] Fargier, H., Jourdan, M., Layaa, N., and Vidal, T. (1998). Using temporal
constraint networks to manage temporal scenario of multimedia documents. In
ECAI Wksp. on Spatial and Temporal Reasoning.
[188] Feng, Z., Dearden, R., Meuleau, N., and Washington, R. (2004). Dynamic
programming for structured continuous Markov decision problems. In Proc.
AAAI, pages 154161.
[189] Feng, Z. and Hansen, E. A. (2002). Symbolic heuristic search for factored
Markov decision processes. In Proc. AAAI, pages 455460.
[190] Feng, Z., Hansen, E. A., and Zilberstein, S. (2002). Symbolic generalization
for on-line planning. In Proc. Conf. on Uncertainty in AI (UAI), pages 209216.
[195] Fichtner, M., Gromann, A., and Thielscher, M. (2003). Intelligent execution
monitoring in dynamic environments. Fundamenta Informaticae, 57(2-4):371
392.
[197] Fikes, R. E. and Nilsson, N. J. (1971). STRIPS: A new approach to the ap-
plication of theorem proving to problem solving. Artificial Intelligence, 2(3):189
208.
[198] Finzi, A., Pirri, F., and Reiter, R. (2000). Open world planning in the situa-
tion calculus. In Proc. AAAI, pages 754760.
[200] Fisher, M., Gabbay, D. M., and Vila, L., editors (2005). Handbook of Temporal
Reasoning in Artificial Intelligence. Elsevier.
[201] Foka, A. and Trahanias, P. (2007). Real-time hierarchical POMDPs for au-
tonomous robot navigation. Robotics and Autonomous Systems, 55:561571.
[204] Fox, M. and Long, D. (2003). PDDL2.1: An extension to PDDL for expressing
temporal planning domains. J. Artificial Intelligence Research, 20:61124.
[207] Fraser, G., Steinbauer, G., and Wotawa, F. (2004). Plan execution in dy-
namic environments. In Proc. Intl. Cognitive Robotics Workshop, pages 208217.
Springer.
[208] Fratini, S., Cesta, A., De Benedictis, R., Orlandini, A., and Rasconi, R.
(2011). APSI-based deliberation in goal oriented autonomous controllers. In
Symp. on Adv. in Space Technologies in Robotics and Automation (ASTRA).
[209] Fu, J., Ng, V., Bastani, F. B., and Yen, I.-L. (2011). Simple and fast strong
cyclic planning for fully-observable nondeterministic planning problems. In Proc.
IJCAI, pages 19491954.
[210] Fusier, F., Valentin, V., Bremond, F., Thonnat, M., Borg, M., Thirde, D.,
and Ferryman, J. (2007). Video understanding for complex activity recognition.
Machine Vision and Applications, 18(3-4):167188.
[211] Garcia, C. E., Prett, D. M., and Morari, M. (1989). Model predictive control:
theory and practice a survey. Automatica, 25(3):335348.
[214] Garrido, A. (2002). A temporal plannnig system for level 3 durative actions
of PDDL2.1. In AIPS Wksp. on Planning for Temporal Domains, pages 5666.
[215] Geffner, H. (2000). Functional Strips: A more flexible language for planning
and problem solving. In Minker, J., editor, Logic-Based Artificial Intelligence,
pages 187209. Kluwer.
[219] Gelly, S. and Silver, D. (2007). Combining online and offline knowledge in
UCT. In Proc. Intl. Conf. on Machine Learning (ICML), pages 273280.
[220] Gerevini, A., Kuter, U., Nau, D. S., Saetti, A., and Waisbrot, N. (2008).
Combining domain-independent planning and HTN planning: The Duet planner.
In Proc. ECAI, pages 573577.
[221] Gerevini, A., Saetti, A., and Serina, I. (2003). Planning through stochas-
tic local search and temporal action graphs in LPG. J. Artificial Intelligence
Research, 20:239290.
[222] Gerevini, A., Saetti, A., and Serina, I. (2005). Integrating planning and
temporal reasoning for domains with durations and time windows. In Proc.
IJCAI, volume 19, pages 12261232.
[225] Gerevini, A. and Serina, I. (2002). LPG: A planner based on local search for
planning graphs. In Proc. AIPS, pages 968973.
[227] Ghallab, M., Alami, R., and Chatila, R. (1987). Dealing with time in planning
and execution monitoring. In Bolles, R. and Roth, B., editors, Intl. Symp. on
Robotics Research (ISRR), pages 431443. MIT Press.
[230] Ghallab, M., Nau, D., and Traverso, P. (2014). The actors view of automated
planning and acting: A position paper. Artificial Intelligence, 208:117.
[231] Ghallab, M., Nau, D. S., and Traverso, P. (2004). Automated Planning:
Theory and Practice. Morgann Kaufmann.
[237] Givan, R., Dean, T., and Greig, M. (2003). Equivalence notions and model
minimization in Markov decision processes. Artificial Intelligence, 142:163223.
[238] Golden, K., Etzioni, O., and Weld, D. (1994). Omnipotence without omni-
science: Efficient sensor management for planning. In Proc. AAAI, pages 1048
154.
[239] Goldman, R., Pelican, M., and Musliner, D. (1999). Hard real-time mode
logic synthesis for hybrid control: A CIRCA-based approach. In Hybrid Systems
and AI: Papers from the AAAI Spring Symp. AAAI Technical Report SS-99-05.
[240] Goldman, R. P., Musliner, D. J., Krebsbach, K. D., and Boddy, M. S. (1997).
Dynamic abstraction planning. In Proc. AAAI, pages 680686. AAAI Press.
[241] Goldman, R. P., Musliner, D. J., and Pelican, M. J. (2000). Using model
checking to plan hard real-time controllers. In AIPS Wksp. on Model-Theoretic
Approaches to Planning.
[242] Golumbic, M. and Shamir, R. (1993). Complexity and algorithms for reason-
ing about time: a graph-theoretic approach. J. ACM, 40(5):11081133.
[243] Gonzalez-Ferrer, A., Fernandez-Olivares, J., Castillo, L., et al. (2009). JAB-
BAH: a java application framework for the translation between business process
models and HTN. In Proc. Intl. Competition on Knowledge Engineering for
Planning and Scheduling (ICKEPS).
[245] Gregory, P., Long, D., and Fox, M. (2007). A meta-CSP model for optimal
planning. In Abstraction, Reformulation, and Approximation, pages 200214.
Springer.
[246] Gregory, P., Long, D., Fox, M., and Beck, J. C. (2012). Planning modulo
theories: Extending the planning paradigm. In Proc. ICAPS.
[248] Guestrin, C., Hauskrecht, M., and Kveton, B. (2004). Solving factored MDPs
with continuous and discrete variables. In Proc. Conf. on Uncertainty in AI
(UAI), pages 235242.
[249] Guestrin, C., Koller, D., Parr, R., and Venkataraman, S. (2003). Efficient so-
lution algorithms for factored MDPs. J. Artificial Intelligence Research, 19:399
468.
[250] Guez, A. and Pineau, J. (2010). Multi-tasking SLAM. In IEEE Intl. Conf.
on Robotics and Automation (ICRA), pages 377384.
[251] Hahnel, D., Burgard, W., and Lakemeyer, G. (1998). GOLEX bridging the
gap between logic (GOLOG) and a real robot. In Proc. Annual German Conf.
on AI (KI), pages 165176. Springer.
[252] Hanks, S. and Firby, R. J. (1990). Issues and architectures for planning and
execution. In Proc. Wksp. on Innovative Approaches to Planning, Scheduling
and Control, pages 5970. Morgan Kaufmann.
[255] Hart, P. E., Nilsson, N. J., and Raphael, B. (1968). A formal basis for the
heuristic determination of minimum cost paths. IEEE Trans. Syst., Man, and
Cybernetics, pages 15561562.
[256] Hart, P. E., Nilsson, N. J., and Raphael, B. (1972). Correction to a formal
basis for the heuristic determination of minimum cost paths. ACM SIGART
Bulletin, 37:2829.
[257] Hartanto, R. and Hertzberg, J. (2008). Fusing DL reasoning with HTN plan-
ning. In Proc. Annual German Conf. on AI (KI), pages 6269. Springer.
[259] Haslum, P., Bonet, B., and Geffner, H. (2005). New admissible heuristics for
domain-independent planning. In Proc. AAAI.
[260] Haslum, P., Botea, A., Helmert, M., Bonet, B., and Koenig, S. (2007).
Domain-independent construction of pattern database heuristics for cost-optimal
planning. In Proc. AAAI, volume 7, pages 10071012.
[261] Haslum, P. and Geffner, H. (2000). Admissible heuristics for optimal plan-
ning. In Proc. AIPS, pages 140149.
[262] Haslum, P. and Geffner, H. (2001). Heuristic plannnig with time and re-
sources. In Proc. European Conf. on Planning (ECP), pages 121132.
[263] Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., and Boutilier, C.
(1998). Hierarchical solution of Markov decision processes using macro-actions.
In Proc. Conf. on Uncertainty in AI (UAI), pages 220229.
[265] Heintz, F., Kvarnstrom, J., and Doherty, P. (2010). Bridging the sense-
reasoning gap: DyKnow stream-based middleware for knowledge processing.
Advanced Engineering Informatics, 24(1):1426.
[269] Helmert, M. and Domshlak, C. (2009). Landmarks, critical paths and ab-
stractions: Whats the difference anyway? In Proc. ICAPS, pages 162169.
[270] Helmert, M. and Geffner, H. (2008). Unifying the causal graph and additive
heuristics. In Proc. ICAPS, pages 140147.
[271] Helmert, M., Haslum, P., and Hoffmann, J. (2007). Flexible abstraction
heuristics for optimal sequential planning. In Proc. ICAPS, pages 176183.
[272] Helmert, M., Haslum, P., and Hoffmann, J. (2008). Explicit-state abstraction:
A new method for generating heuristic functions. In Proc. AAAI, pages 1547
1550.
[273] Helmert, M., Haslum, P., Hoffmann, J., and Nissim, R. (2014). Merge-and-
shrink abstraction: A method for generating lower bounds in factored state
spaces. J. ACM, 61(3):16.
[275] Hoey, J., St-Aubin, R., Hu, A., and Boutilier, C. (1999). SPUDD: Stochastic
planning using decision diagrams. In Proc. Conf. on Uncertainty in AI (UAI),
pages 279288.
[278] Hoffmann, J. (2005). Where ignoring delete lists works: local search topol-
ogy in planning benchmarks. J. Artificial Intelligence Research, pages 685758.
[280] Hoffmann, J. and Nebel, B. (2001). The FF planning system: Fast plan
generation through heuristic search. J. Artificial Intelligence Research, 14:253
302.
[281] Hoffmann, J., Porteous, J., and Sebastia, L. (2004). Ordered landmarks in
planning. J. Artificial Intelligence Research, 22:215278.
[283] Hogg, C., Kuter, U., and Munoz-Avila, H. (2010). Learning methods to
generate good plans: Integrating HTN learning and reinforcement learning. In
Proc. AAAI.
[284] Hongeng, S., Nevatia, R., and Bremond, F. (2004). Video-based event recog-
nition: activity representation and probabilistic recognition methods. Computer
Vision and Image Understanding, 96(2):129162.
[289] Huang, R., Chen, Y., and Zhang, W. (2009). An optimal temporally expres-
sive planner: Initial results and application to P2P network optimization. In
Proc. ICAPS.
[292] Ingham, M. D., Ragno, R. J., and Williams, B. C. (2001). A reactive model-
based programming language for robotic space explorers. In Intl. Symp. on
Artificial Intell., Robotics and Automation in Space (i-SAIRAS).
[293] Ingrand, F., Chatilla, R., Alami, R., and Robert, F. (1996). PRS: A high
level supervision and control language for autonomous mobile robots. In IEEE
Intl. Conf. on Robotics and Automation (ICRA), pages 4349.
[295] Ivankovic, F., Haslum, P., Thiebaux, S., Shivashankar, V., and Nau, D.
(2014). Optimal planning with global numerical state constraints. In Proc.
ICAPS.
[296] Iwen, M. and Mali, A. D. (2002). Distributed graphplan. In IEEE Intl. Conf.
on Tools with AI (ICTAI), pages 138145. IEEE.
[297] Jensen, R. and Veloso, M. (2000). OBDD-based universal planning for syn-
chronized agents in non-deterministic domains. J. Artificial Intelligence Re-
search, 13:189226.
[298] Jensen, R., Veloso, M., and Bryant, R. (2003). Guided symbolic universal
planning. In Proc. ICAPS.
[300] Jimenez, S., de La Rosa, T., Fernandez, S., Fernandez, F., and Borrajo, D.
(2012). A review of machine learning for automated planning. The Knowledge
Engg. Review, 27(4):433467.
[301] Jonsson, A. K., Morris, P. H., Muscettola, N., Rajan, K., and Smith, B. D.
(2000). Planning in interplanetary space: Theory and practice. In AIPS, pages
177186.
[302] Jonsson, P., Drakengren, T., and Backstrom, C. (1999). Computational com-
plexity of relating time points and intervals. Artificial Intelligence, 109:273295.
[303] Judah, K., Fern, A. P., and Dietterich, T. G. (2012). Active imitation learning
via reduction to IID active learning. In Proc. Conf. on Uncertainty in AI (UAI),
pages 428437.
[304] Kabanza, F., Barbeau, M., and St-Denis, R. (1997). Planning control rules
for reactive agents. Artificial Intelligence, 95(1):67113.
[305] Kaelbling, L. P., Littman, M. L., and Cassandra, A. (1998). Planning and
acting in partially observable stochastic domains. Artificial Intelligence, 101:99
134.
[310] Kambhampati, S., Knoblock, C. A., and Yang, Q. (1995). Planning as refine-
ment search: A unified framework for evaluating design tradeoffs in partial-order
planning. Artificial Intelligence, 76(12):167238.
[311] Kambhampati, S. and Nau, D. S. (1996). On the nature and role of modal
truth criteria in planning. Artificial Intelligence, 82(2).
[315] Karlsson, L., Bouguerra, A., Broxvall, M., Coradeschi, S., and Saffiotti, A.
(2008). To secure an anchor A recovery planning approach to ambiguity in
perceptual anchoring. AI Communincations, 21(1):114.
[317] Karpas, E., Wang, D., Williams, B. C., and Haslum, P. (2015). Temporal
landmarks: What must happen, and when. In Proc. ICAPS.
[320] Kautz, H. and Allen, J. (1986). Generalized plan recognition. In Proc. AAAI,
pages 3237.
[321] Kautz, H., McAllester, D., and Selman, B. (1996). Encoding plans in propo-
sitional logic. In Proc. Intl. Conf. on Principles of Knowledge Representation
and Reasoning (KR), pages 374384.
[323] Kautz, H. and Selman, B. (1996). Pushing the envelope: Planning, proposi-
tional logic, and stochastic search. In Proc. AAAI, pages 11941201.
[324] Kautz, H. A., Thomas, W., and Vardi, M. Y., editors (2006). Synthesis and
Planning, Dagstuhl Seminar Proceedings.
[325] Kearns, M., Mansour, Y., and Ng, A. (2002). A sparse sampling algorithm
for near-optimal planning in large Markov decision processes. Machine Learning,
49:193208.
[328] Khatib, L., Morris, P., Morris, R., and Rossi, F. (2001). Temporal constraint
reasoning with preferences. In Proc. IJCAI.
[330] Kiesel, S. and Ruml, W. (2014). Planning under temporal uncertainty using
hindsight optimization. In ICAPS Wksp. on Planning and Robotics, pages 111.
[332] Knight, R., Rabideau, G., Chien, S., Engelhardt, B., and Sherwood, R.
(2001). Casper: space exploration through continuous planning. IEEE Intel-
ligent Systems, 16(5):7075.
[341] Koehler, J. (1999). Handling of conditional effects and negative goals in IPP.
Technical Report 128, Albert-Ludwigs-Universitat Freiburg.
[343] Koenig, S. and Simmons, R. (1998). Solving robot navigation problems with
initial pose uncertainty using real-time heuristic search. In Proc. AIPS.
[346] Kolobov, A., Mausam, and Weld, D. (2010). SixthSense: Fast and reliable
recognition of dead ends in MDPs. In Proc. AAAI.
[347] Kolobov, A., Mausam, and Weld, D. (2012). Stochastic shortest path MDPs
with dead ends. Proc. ICAPS Workshop HSDIP.
[348] Kolobov, A., Mausam, Weld, D., and Geffner, H. (2011). Heuristic search for
generalized stochastic shortest path MDPs. In Proc. ICAPS.
[349] Kolobov, A. and Weld, D. (2009). ReTrASE: integrating paradigms for ap-
proximate probabilistic planning. In Proc. IJCAI.
[355] Kruger, V., Kragic, D., Ude, A., and Geib, C. (2007). The meaning of action:
a review on action recognition and mapping. Advanced Robotics, 21(13):1473
1501.
[356] Kumar, V. and Kanal, L. (1983a). The composite decision process: A unifying
formulation for heuristic search, dynamic programming and branch and bound
procedures. In Proc. AAAI, pages 220224.
[357] Kumar, V. and Kanal, L. (1983b). A general branch and bound formulation
for understanding and synthesizing and/or tree search procedures. Artificial
Intelligence, pages 179198.
[358] Kupferman, O., Madhusudan, P., Thiagarajan, P. S., and Vardi, M. Y. (2000).
Open systems in reactive environments: Control and synthesis. In Proc. Intl.
Conf. on Concurrency Theory (CONCUR), pages 92107.
[361] Kuter, U. and Nau, D. (2005). Using domain-configurable search control for
probabilistic planning. In Proc. AAAI, pages 11691174.
[362] Kuter, U., Nau, D. S., Pistore, M., and Traverso, P. (2005). A hierarchical
task-network planner based on symbolic model checking. In Proc. ICAPS, pages
300309.
[363] Kuter, U., Nau, D. S., Pistore, M., and Traverso, P. (2009). Task decompo-
sition on abstract states, for planning under nondeterminism. Artificial Intelli-
gence, 173:669695.
[364] Kuter, U., Nau, D. S., Reisner, E., and Goldman, R. (2008). Using classical
planners to solve nondeterministic planning problems. In Proc. ICAPS, pages
190197.
[365] Kuter, U., Sirin, E., Nau, D. S., Parsia, B., and Hendler, J. (2004). Informa-
tion gathering during planning for web service composition. In McIlraith, S. A.,
Plexousakis, D., and van Harmelen, F., editors, Proc. Intl. Semantic Web Conf.
(ISWC), volume 3298 of LNCS, pages 335349. Springer.
[367] Kvarnstrom, J., Doherty, P., and Haslum, P. (2000). Extending TALplanner
with concurrency and resources. In Proc. European Conf. on Planning (ECP).
[368] Kveton, B., Hauskrecht, M., and Guestrin, C. (2006). Solving factored MDPs
with hybrid state and action variables. J. Artificial Intelligence Research, 27:153
201.
[370] Laborie, P. and Ghallab, M. (1995). Planning with sharable resource con-
straints. In Proc. IJCAI, pages 16431649.
[371] Laird, J., Rosenbloom, P., and Newell, A. (2012). Universal Subgoaling and
Chunking: The Automatic Generation and Learning of Goal Hierarchies, vol-
ume 11. Springer Science & Business Media.
[374] Le Guillou, X., Cordier, M.-O., Robin, S., Roze, L., et al. (2008). Chronicles
for on-line diagnosis of distributed systems. In Proc. ECAI, volume 8, pages
194198.
[376] Lemaignan, S., Espinoza, R. R., Mosenlechner, L., Alami, R., and Beetz, M.
(2010). ORO, a knowledge management platform for cognitive architectures in
robotics. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Syst. (IROS).
[377] Lesperance, Y., H.J., Lin, L. F., Marcus, D., Reiter, R., and Scherl, R. (1994).
A logical approach to high-level robot programming a progress report. In
AAAI Fall Symp. on Control of the Physical World by Intelligent Agents. AAAI
Technical Report FS-94-03.
[378] Levesque, H., Reiter, R., Lesperance, Y., Lin, F., and Scherl, R. (1997a).
GOLOG: A logic programming language for dynamic domains. J. Logic Pro-
gramming, 31:5984.
[379] Levesque, H. J., Reiter, R., Lesperance, Y., Lin, F., and Scherl, R. (1997b).
GOLOG: a logic programming language for dynamic domains. J. Logic Progr.,
31:5983.
[381] Li, H. X. and Williams, B. C. (2008). Generative planning for hybrid systems
based on flow tubes. In Proc. ICAPS, pages 206213.
[382] Liaskos, S., McIlraith, S. A., Sohrabi, S., and Mylopoulos, J. (2010). In-
tegrating preferences into goal models for requirements engineering. In Intl.
Requirements Engg. Conf., pages 135144.
[386] Likhachev, M., Gordon, G. J., and Thrun, S. (2004). Planning for Markov
decision processes with sparse stochasticity. In Adv. in Neural Information Pro-
cessing Syst. (Proc. NIPS), volume 17.
[387] Lin, S. (1965). Computer solutions of the traveling salesman problem. Bell
System Technical Journal, 44(10):22452269.
[388] Little, I., Aberdeen, D., and Thiebaux, S. (2005). Prottle: A probabilistic
temporal planner. In Proc. AAAI, pages 11811186.
[390] Liu, Y. and Koenig, S. (2006). Functional value iteration for decision-theoretic
planning with general utility functions. In Proc. AAAI.
[391] Lohr, J., Eyerich, P., Keller, T., and Nebel, B. (2012). A planning based
framework for controlling hybrid systems. In Proc. ICAPS.
[392] Lohr, J., Eyerich, P., Winkler, S., and Nebel, B. (2013). Domain predictive
control under uncertain numerical state information. In Proc. ICAPS.
[393] Long, D. and Fox, M. (1999). Efficient implementation of the plan graph in
STAN. J. Artificial Intelligence Research, 10(1-2):87115.
[394] Long, D. and Fox, M. (2003a). The 3rd international planning competition:
Results and analysis. J. Artificial Intelligence Research, 20:159.
[397] Lotem, A., Nau, D. S., and Hendler, J. (1999). Using planning graphs for
solving HTN problems. In Proc. AAAI, pages 534540.
[399] Maliah, S., Brafman, R., Karpas, E., and Shani, G. (2014). Partially observ-
able online contingent planning using landmark heuristics. In Proc. ICAPS.
[400] Malik, J. and Binford, T. (1983). Reasoning in time and space. In Proc.
IJCAI, pages 343345.
[401] Mansouri, M. and Pecora, F. (2016). Robot waiters: A case for hybrid reason-
ing with different types of knowledge. J. Experimental & Theoretical Artificial
Intelligence.
[402] Marthi, B., Russell, S., and Wolfe, J. (2007). Angelic semantics for high-level
actions. In Proc. ICAPS.
[403] Marthi, B., Russell, S., and Wolfe, J. (2008). Angelic hierarchical planning:
Optimal and online algorithms. In Proc. ICAPS, pages 222231.
[404] Marthi, B. M., Russell, S. J., Latham, D., and Guestrin, C. (2005). Concur-
rent hierarchical reinforcement learning. In Proc. AAAI, page 1652.
[405] Mattmuller, R., Ortlieb, M., Helmert, M., and Bercher, P. (2010). Pat-
tern database heuristics for fully observable nondeterministic planning. In Proc.
ICAPS, pages 105112.
[406] Mausam, Bertoli, P., and Weld, D. (2007). A hybridized planner for stochastic
domains. In Proc. IJCAI, pages 19721978.
[407] Mausam and Kolobov, A. (2012). Planning with Markov Decision Processes:
An AI Perspective. Morgan & Claypool.
[409] Mausam and Weld, D. (2006). Probabilistic temporal planning with uncertain
durations. In Proc. AAAI, pages 880887.
[410] Mausam and Weld, D. (2008). Planning with durative actions in stochastic
domains. J. Artificial Intelligence Research, 31(1):3382.
[414] McDermott, D. (1982). A temporal logic for reasoning about processes and
plans. Cognitive Science, 6:101155.
[419] Meiri, I. (1990). Faster Constraint satisfaction algorithms for temporal rea-
soning. Tech. report R-151, UC Los Angeles.
[420] Meuleau, N., Benazera, E., Brafman, R. I., and Hansen, E. A. (2009). A
heuristic search approach to planning with continuous resources in stochastic
domains. J. Artificial Intelligence Research, 34(1):27.
[422] Miguel, I., Jarvis, P., and Shen, Q. (2000). Flexible graphplan. In Proc.
ECAI, pages 506510.
[423] Minton, S., Bresina, J., and Drummond, M. (1991). Commitment strategies
in planning: A comparative analysis. In Proc. IJCAI, pages 259265.
[424] Minton, S., Drummond, M., Bresina, J., and Philips, A. (1992). Total order
vs. partial order planning: Factors influencing performance. In Proc. Intl. Conf.
on Principles of Knowledge Representation and Reasoning (KR), pages 8392.
[425] Mitten, L. G. (1970). Branch and bound methods: General formulations and
properties. Operations Research, 18:2334.
[426] Moeslund, T. B., Hilton, A., and Kruger, V. (2006). A survey of advances in
vision-based human motion capture and analysis. Computer Vision and Image
Understanding, 104(2-3):90126.
[429] Molineaux, M., Klenk, M., and Aha, D. (2010). Goal-driven autonomy in a
Navy strategy simulation. In Proc. AAAI, pages 15481554.
[433] Morris, P., Muscettola, N., and Vidal, T. (2001). Dynamic control of plans
with temporal uncertainty. In Proc. IJCAI, pages 494502.
[435] Muise, C., McIlraith, S. A., and Belle, V. (2014). Non-deterministic planning
with conditional effects. In Proc. ICAPS.
[436] Muise, C. J., McIlraith, S. A., and Beck, J. C. (2012). Improved non-
deterministic planning by exploiting state relevance. In Proc. ICAPS.
[438] Munoz-Avila, H., Aha, D. W., Nau, D. S., Weber, R., Breslow, L., and Ya-
man, F. (2001). SiN: Integrating case-based reasoning with task decomposition.
In Proc. IJCAI.
[439] Muscettola, N., Dorais, G., Fry, C., Levinson, R., and Plaunt, C. (2002).
IDEA: Planning at the core of autonomous reactive agents. In Intl. Wksp. on
Planning and Scheduling for Space (IWPSS).
[441] Muscettola, N., Nayak, P. P., Pell, B., and Williams, B. C. (1998b). Remote
Agent: To boldly go where no AI system has gone before. Artificial Intelligence,
103:547.
[443] Nareyek, A., Freuder, E. C., Fourer, R., Giunchiglia, E., Goldman, R. P.,
Kautz, H., Rintanen, J., and Tate, A. (2005). Constraints and AI planning.
IEEE Intelligent Systems, 20(2):6272.
[444] Nau, D. S., Au, T.-C., Ilghami, O., Kuter, U., Munoz-Avila, H., Murdock,
J. W., Wu, D., and Yaman, F. (2005). Applications of SHOP and SHOP2. IEEE
Intelligent Systems, 20(2):3441.
[445] Nau, D. S., Au, T.-C., Ilghami, O., Kuter, U., Murdock, J. W., Wu, D., and
Yaman, F. (2003). SHOP2: An HTN planning system. J. Artificial Intelligence
Research, 20:379404.
[446] Nau, D. S., Cao, Y., Lotem, A., and Munoz-Avila, H. (1999). SHOP: Simple
hierarchical ordered planner. In Proc. IJCAI, pages 968973.
[447] Nau, D. S., Kumar, V., and Kanal, L. N. (1984). General branch and bound,
and its relation to A* and AO*. Artificial Intelligence, 23(1):2958.
[448] Nau, D. S., Munoz-Avila, H., Cao, Y., Lotem, A., and Mitchell, S. (2001).
Total-order planning with partially ordered subtasks. In Proc. IJCAI.
[450] Newell, A. and Ernst, G. (1965). The search for generality. In Proc. IFIP
Congress, volume 65, pages 1724.
[451] Newell, A. and Simon, H. A. (1963). GPS, a program that simulates human
thought. In Feigenbaum, E. A. and Feldman, J. A., editors, Computers and
Thought. McGraw-Hill.
[452] Newton, M. A. H., Levine, J., Fox, M., and Long, D. (2007). Learning macro-
actions for arbitrary planners and domains. In Proc. ICAPS.
[453] Ng, A. and Jordan, M. (2000). PEGASUS: a policy search method for large
MDPs and POMDPs. In Proc. Conf. on Uncertainty in AI (UAI), pages 406415.
[456] Nieuwenhuis, R., Oliveras, A., and Tinelli, C. (2006). Solving SAT and SAT
modulo theories: From an abstract Davis-Putnam-Logemann-Loveland proce-
dure to DPLL (T). J. ACM, 53(6):937977.
[458] Nilsson, M., Kvarnstrom, J., and Doherty, P. (2014a). EfficientIDC: A faster
incremental dynamic controllability algorithm. In Proc. ICAPS.
[459] Nilsson, M., Kvarnstrom, J., and Doherty, P. (2014b). Incremental dynamic
controllability in cubic worst-case time. In Intl. Symp. on Temporal Representa-
tion and Reasoning.
[460] Nilsson, N. (1980). Principles of Artificial Intelligence. Morgan Kaufmann.
[461] Oaksford, M. and Chater, N. (2007). Bayesian rationality the probabilistic
approach to human reasoning. Oxford Univ. Press.
[462] Oates, T. and Cohen, P. R. (1996). Learning planning operators with condi-
tional and probabilistic effects. In Proc. AAAI Spring Symposium on Planning
with Incomplete Information for Robot Problems, pages 8694.
[463] Ong, S. C. W., Png, S. W., Hsu, D., and Lee, W. S. (2010). Planning under
uncertainty for robotic tasks with mixed observability. Intl. J. Robotics Research,
29(8):10531068.
[464] Papadimitriou, C. (1994). Computational Complexity. Addison Wesley.
[465] Parr, R. and Russell, S. J. (1998). Reinforcement learning with hierarchies of
machines. In Adv. in Neural Information Processing Syst. (Proc. NIPS), pages
10431049.
[466] Pasula, H., Zettlemoyer, L. S., and Kaelbling, L. P. (2004). Learning proba-
bilistic relational planning rules. In Proc. ICAPS, pages 7382.
[467] Pearl, J. (1984). Heuristics: Intelligent Search Strategies for Computer Prob-
lem Solving. Addison-Wesley.
[468] Pecora, F., Cirillo, M., DellOsa, F., Ullberg, J., and Saffiotti, A. (2012).
A constraint-based approach for proactive, context-aware human support. J.
Ambient Intell. and Smart Environments, 4(4):347367.
[469] Pednault, E. (1988). Synthesizing plans that contain actions with context-
dependent effects. Computational Intelligence, 4:356372.
[470] Pednault, E. P. (1989). ADL: Exploring the middle ground between STRIPS
and the situation calculus. In Proc. Intl. Conf. on Principles of Knowledge
Representation and Reasoning (KR), pages 324332.
[471] Penberthy, J. and Weld, D. S. (1994). Temporal planning with continuous
change. In Proc. AAAI, pages 10101015.
[472] Penberthy, J. S. and Weld, D. (1992). UCPOP: A sound, complete, par-
tial order planner for ADL. In Proc. Intl. Conf. on Principles of Knowledge
Representation and Reasoning (KR).
[473] Penna, G. D., Intrigila, B., Magazzeni, D., and Mercorio, F. (2009). Upmur-
phi: a tool for universal planning on PDDL+ problems. In Proc. ICAPS, pages
1923.
[475] Peters, J. and Ng, A. Y. (2009). Special issue on robot learning. Autonomous
Robots, 27(1-2).
[480] Pineau, J., Gordon, G. J., and Thrun, S. (2002). Policy-contingent abstrac-
tion for robust robot control. In Proc. Conf. on Uncertainty in AI (UAI), pages
477484.
[481] Pineau, J., Montemerlo, M., Pollack, M. E., Roy, N., and Thrun, S. (2003).
Towards robotic assistants in nursing homes: Challenges and results. Robotics
and Autonomous Systems, 42(3-4):271281.
[482] Pistore, M., Bettin, R., and Traverso, P. (2001). Symbolic techniques for
planning with extended goals in non-deterministic domains. In Proc. European
Conf. on Planning (ECP), LNAI. Springer.
[483] Pistore, M., Spalazzi, L., and Traverso, P. (2006). A minimalist approach to
semantic annotations for web processes compositions. In Euro. Semantic Web
Conf. (ESWC), pages 620634.
[484] Pistore, M. and Traverso, P. (2001). Planning as model checking for extended
goals in non-deterministic domains. In Proc. IJCAI, pages 479484. Morgan
Kaufmann.
[486] Pistore, M., Traverso, P., and Bertoli, P. (2005). Automated composition of
web services by planning in asynchronous domains. In Proc. ICAPS, pages 211.
[490] Pnueli, A. and Rosner, R. (1990). Distributed reactive systems are hard to
synthesize. In 31st Annual Symposium on Foundations of Computer Science,
pages 746757.
[491] Pohl, I. (1970). Heuristic search viewed as path finding in a graph. Artificial
Intelligence, 1(3):193204.
[492] Pollack, M. E. and Horty, J. F. (1999). Theres more to life than making
plans: Plan management in dynamic, multiagent environments. AI Magazine,
20(4):114.
[493] Porteous, J., Sebastia, L., and Hoffmann, J. (2001). On the extraction, order-
ing, and usage of landmarks in planning. In Proc. European Conf. on Planning
(ECP).
[494] Powell, J., Molineaux, M., and Aha, D. (2011). Active and interactive dis-
covery of goal selection knowledge. In FLAIRS.
[495] Prentice, S. and Roy, N. (2009). The belief roadmap: Efficient planning in
belief space by factoring the covariance. IJRR, 28(1112):14481465.
[496] Pryor, L. and Collins, G. (1996). Planning for contingency: A decision based
approach. J. Artificial Intelligence Research, 4:81120.
[498] Py, F., Rajan, K., and McGann, C. (2010). A systematic agent framework
for situated autonomous systems. In Proc. AAMAS, pages 583590.
[500] Quiniou, R., Cordier, M.-O., Carrault, G., and Wang, F. (2001). Applica-
tion of ILP to cardiac arrhythmia characterization for chronicle recognition. In
Inductive Logic Programming, pages 220227. Springer.
[501] Rabideau, G., Knight, R., Chien, S., Fukunaga, A., and Govindjee, A. (1999).
Iterative repair planning for spacecraft operations in the ASPEN system. In Intl.
Symp. on Artificial Intell., Robotics and Automation in Space (i-SAIRAS).
[503] Rajan, K. and Py, F. (2012). T-REX: Partitioned inference for AUV mis-
sion control. In Roberts, G. N. and Sutton, R., editors, Further Advances in
Unmanned Marine Vehicles, pages 171199. The Institution of Engg. and Tech-
nology.
[504] Rajan, K., Py, F., and Barreiro, J. (2012). Towards deliberative control in
marine robotics. In Marine Robot Autonomy, pages 91175. Springer.
[506] Ramrez, M., Yadav, N., and Sardina, S. (2013). Behavior composition as
fully observable non-deterministic planning. In Proc. ICAPS.
[508] Reingold, E., Nievergelt, J., and Deo, N. (1977). Combinatorial Optimization.
Prentice Hall.
[509] Richter, S., Helmert, M., and Westphal, M. (2008). Landmarks revisited. In
Proc. AAAI, volume 8, pages 975982.
[510] Richter, S. and Westphal, M. (2010). The LAMA planner: Guiding cost-based
anytime planning with landmarks. J. Artificial Intelligence Research, 39(1):127
177.
[515] Rodriguez-Moreno, M. D., Oddi, A., Borrajo, D., and Cesta, A. (2006). Ipss:
A hybrid approach to planning and scheduling integration. IEEE Trans. Knowl-
edge and Data Engg. (TDKE), 18(12):16811695.
[516] Ross, S., Pineau, J., Paquet, S., and Chaib-Draa, B. (2008). Online planning
algorithms for POMDPs. J. Artificial Intelligence Research, 32:663704.
[518] Rybski, P. E., Yoon, K., Stolarz, J., and Veloso, M. M. (2007). Interactive
robot task training through dialog and demonstration. In Conference on Human-
Robot Interaction, pages 4956.
[520] Sacerdoti, E. (1975). The nonlinear nature of plans. In Proc. IJCAI, pages
206214. Reprinted in [15], pp. 162170.
[521] Samadi, M., Kollar, T., and Veloso, M. (2012). Using the Web to interactively
learn to find objects. In Proc. AAAI, pages 20742080.
[527] Scherrer, B. and Lesner, B. (2012). On the use of non-stationary policies for
stationary infinite-horizon Markov decision processes. In Adv. in Neural Infor-
mation Processing Syst. (Proc. NIPS), pages 18261834.
[528] Schultz, D. G. and Melsa, J. L. (1967). State functions and linear control
systems. McGraw-Hill.
[529] Shah, M., Chrpa, L., Jimoh, F., Kitchin, D., McCluskey, T., Parkinson, S.,
and Vallati, M. (2013). Knowledge engineering tools in planning: State-of-the-art
and future challenges. In ICAPS Knowledge Engg. for Planning and Scheduling
(KEPS), pages 5360.
[530] Shani, G., Pineau, J., and Kaplow, R. (2012). A survey of point-based
POMDP solvers. J. Autonomous Agents and Multi-Agent Syst., pages 151.
[531] Shaparau, D., Pistore, M., and Traverso, P. (2006). Contingent planning with
goal preferences. In Proc. AAAI, pages 927935.
[532] Shaparau, D., Pistore, M., and Traverso, P. (2008). Fusing procedural and
declarative planning goals for nondeterministic domains. In Proc. AAAI, pages
983990.
[533] Shivashankar, V., Alford, R., Kuter, U., and Nau, D. (2013). The GoDeL
planning system: A more perfect union of domain-independent and hierarchical
planning. In Proc. IJCAI, pages 23802386.
[536] Shoham, Y. (1987). Temporal logic in AI: semantical and ontological consid-
erations. Artificial Intelligence, 33:89104.
[537] Sigaud, O. and Peters, J. (2010). From Motor Learning to Interaction Learn-
ing in Robots, volume 264 of Studies in Computational Intelligence. Springer.
[540] Simmons, R. (1994). Structured control for autonomous robots. IEEE Trans.
Robotics and Automation, 10(1):3443.
[542] Simpkins, C., Bhat, S., Isbell, Jr., C., and Mateas, M. (2008). Towards
adaptive programming: integrating reinforcement learning into a programming
language. In Proc. ACM SIGPLAN Conf. on Object-Oriented Progr. Syst., Lang.,
and Applications (OOPSLA), pages 603614. ACM.
[543] Simpson, R. M., Kitchin, D. E., and McCluskey, T. (2007). Planning domain
definition using GIPO. The Knowledge Engineering Review, 22(2):117134.
[544] Sirin, E., Parsia, B., Wu, D., Hendler, J., and Nau, D. S. (2004). HTN
planning for Web service composition using SHOP2. J. Web Semant. (JWS),
1(4):377396.
[545] Smith, D. E., Frank, J., and Cushing, W. (2008). The ANML language.
ICAPS Wksp. on Knowledge Engg. for Planning and Scheduling (KEPS).
[546] Smith, D. E., Frank, J., and Jonsson, A. K. (2000). Bridging the gap between
planning and scheduling. The Knowledge Engineering Review, 15(1):4783.
[547] Smith, D. E. and Weld, D. (1999a). Temporal planning with mutual exclusion
reasoning. In Proc. IJCAI.
[550] Smith, S. J. J., Hebbar, K., Nau, D. S., and Minis, I. (1997). Integrating
electrical and mechanical design and process planning. In Mantyla, M., Finger, S.,
and Tomiyama, T., editors, Knowledge Intensive CAD, pages 269288. Chapman
and Hall.
[551] Smith, S. J. J., Nau, D. S., and Throop, T. (1998). Computer bridge: A big
win for AI planning. AI Magazine, 19(2):93105.
[552] Smith, T. and Simmons, R. (2004). Heuristic search value iteration for
POMDPs. In Proc. Conf. on Uncertainty in AI (UAI).
[553] Sohrabi, S., Baier, J. A., and McIlraith, S. A. (2009). Htn planning with
preferences. In Proc. IJCAI, pages 17901797.
[555] Sridharan, M., Wyatt, J. L., and Dearden, R. (2008). HiPPo: Hierarchical
POMDPs for planning information processing and sensing actions on a robot. In
Proc. ICAPS, pages 346354.
[558] Stulp, F. and Beetz, M. (2008). Refining the execution of abstract actions
with learned action models. J. Artificial Intelligence Research, 32(1):487523.
[560] Tarjan, R. E. (1972). Depth-first search and linear graph algorithms. SIAM
J. Computing, 1(2):146160.
[561] Tate, A. (1977). Generating project networks. In Proc. IJCAI, pages 888893.
[562] Tate, A., Drabble, B., and Kirby, R. (1994). O-Plan2: An Architecture for
Command, Planning and Control. Morgan-Kaufmann.
[563] Teglas, E., Vul, E., Girotto, V., Gonzalez, M., Tenenbaum, J. B., and Bonatti,
L. L. (2011). Pure reasoning in 12-month-old infants as probabilistic inference.
Science, 332(6033):10541059.
[566] Teichteil-Konigsbuch, F., Infantes, G., and Kuter, U. (2008). RFF: A robust,
FF-based MDP planning algorithm for generating policies with low probability
of failure. In Proc. ICAPS.
[568] Teichteil-Konigsbuch, F., Vidal, V., and Infantes, G. (2011). Extending classi-
cal planning heuristics to probabilistic planning with dead-ends. In Proc. AAAI,
pages 16.
[570] Traverso, P., Veloso, M., and Giunchiglia, F., editors (2000). AIPS Wksp. on
Model-Theoretic Approaches to Planning.
[571] van den Briel, M., Vossen, T., and Kambampati, S. (2005). Reviving integer
programming approaches for AI planning: A branch-and-cut framework. In Proc.
ICAPS, pages 310319.
[572] van den Briel, M., Vossen, T., and Kambhampati, S. (2008). Loosely coupled
formulations for automated planning: An integer programming perspective. J.
Artificial Intelligence Research, 31:217257.
[573] Vaquero, T. S., Romero, V., Tonidandel, F., and Silva, J. R. (2007). itSIM-
PLE 2.0: An integrated tool for designing planning domains. In Proc. ICAPS,
pages 336343.
[574] Vaquero, T. S., Silva, J. R., and Beck, J. C. (2011). A brief review of tools
and methods for knowledge engineering for planning & scheduling. In ICAPS
Knowledge Engg. for Planning and Scheduling (KEPS), pages 715.
[591] Vu, V.-T., Bremond, F., and Thonnat, M. (2003). Automatic video inter-
pretation: A novel algorithm for temporal scenario recognition. In Proc. IJCAI,
pages 12951300.
[592] Waibel, M., Beetz, M., Civera, J., DAndrea, R., Elfring, J., Galvez-Lopez,
D., Haussermann, K., Janssen, R., Montiel, J. M. M., Perzylo, A., Schiessle, B.,
Tenorth, M., Zweigle, O., and van de Molengraft, R. (2011). RoboEarth. IEEE
Robotics and Automation Magazine, 18(2):6982.
[595] Wang, F. Y., Kyriakopoulos, K. J., Tsolkas, A., and Saridis, G. N. (1991). A
Petri-net coordination model for an intelligent mobile robot. IEEE Trans. Syst.,
Man, and Cybernetics, 21(4):777789.
[597] Weir, A. A. S., Chappell, J., and Kacelnik, A. (2002). Shaping of hooks in
New Caledonian crows. Science, 297(5583):981.
[600] Weld, D. S., Anderson, C. R., and Smith, D. E. (1998). Extending Graphplan
to handle uncertainty and sensing actions. In Proc. AAAI, pages 897904.
[601] Weld, D. S. and Etzioni, O. (1994). The first law of robotics (a call to arms).
In Proc. AAAI, pages 10421047.
[602] Wilkins, D. (2000). Using the SIPE-2 planning system: A manual for version
6.1. Technical report, SRI International.
[608] Wilson, A., Fern, A. P., and Tadepalli, P. (2012). A bayesian approach for
policy learning from trajectory preference queries. In Adv. in Neural Information
Processing Syst. (Proc. NIPS), pages 11421150.
[611] Wongpiromsarn, T., Topcu, U., Ozay, N., Xu, H., and Murray, R. M. (2011).
TuLiP: a software toolbox for receding horizon temporal logic planning. In 14th
Intl. Conf. on Hybrid Syst.: Computation and Control, pages 313314. ACM.
[613] Xie, F., Muller, M., and Holte, R. (2015). Understanding and improving local
exploration for gbfs. In Proc. ICAPS.
[614] Xu, Y., Fern, A., and Yoon, S. W. (2007). Discriminative learning of beam-
search heuristics for planning. In Proc. IJCAI.
[617] Yang, Q., Wu, K., and Jiang, Y. (2007). Learning action models from plan
examples using weighted MAX-SAT. Artificial Intelligence, 171(2):107143.
[618] Yoon, S., Fern, A., and Givan, R. (2006). Learning heuristic functions from
relaxed plans. In Proc. ICAPS.
[619] Yoon, S., Fern, A. P., and Givan, R. (2007). FF-replan: A baseline for
probabilistic planning. In Proc. ICAPS, pages 352359.
[620] Yoon, S., Fern, A. P., Givan, R., and Kambhampati, S. (2008). Probabilistic
planning via determinization in hindsight. In Proc. AAAI.
[622] Younes, H. and Simmons, R. (2002). On the role of ground actions in refine-
ment planning. In Proc. AIPS, pages 5462.
[626] Zhuo, H. H., Hu, D. H., Hogg, C., Yang, Q., and Munoz-Avila, H. (2009).
Learning HTN method preconditions and action models from partial observa-
tions. In Proc. IJCAI, pages 18041810.
[627] Zhuo, H. H., Yang, Q., Hu, D. H., and Li, L. (2010). Learning complex
action models with quantifiers and logical implications. Artificial Intelligence,
174(18):15401569.
[629] Ziparo, V. A., Iocchi, L., Lima, P. U., Nardi, D., and Palamara, P. F. (2011).
Petri net plans. J. Autonomous Agents and Multi-Agent Syst., 23(3):344383.
target
of an assignment statement, 96
of an atom, 33
of an effect, 33
task, 94, 97
refined, 157
temporal, 154
temporal assertion, 147
a priori supported, 158
causally supported, 151
change, 147
persistence, 147
possibly conflicting, 149
temporal network, 170