Wcci 14 S
Wcci 14 S
Wcci 14 S
Abstract—By exchanging propositional constraint In a standard MDP, a state transition model is assumed
information in large state spaces, agents can implicitly to be known beforehand, but this is not always the case in
reduce the state space, a feature that is particularly
real world applications [5]. Reinforcement Learning (RL)
attractive for Reinforcement Learning approaches. This
paper proposes a learning technique that combines is a common research area which addresses this issue,
a Reinforcement Learning algorithm and a planner focusing on the agent interaction with the environment as
for propositionally constrained state spaces that au- a mechanism for gathering information about the domain
tonomously help agents to implicitly reduce the state structure. Q-Learning [6] is a typical RL algorithm that
space towards possible plans that lead to the goal
inherits a notorious difficulty of RL algorithms in general
and avoid irrelevant or inadequate states. State space
constraints are communicated among the agents using a to deal with large state spaces, due to a need for balancing
common constraint set based on extended reachability the control objective and the model-free estimation of
goals. A performance evaluation against standard Rein- the domain structure based solely on exploration. Here it
forcement Learning techniques showed that by extend- is proposed a technique which incorporates propositional
ing autonomous learning with propositional constraints
constraints on the state space, using Q-learning-based
updated along the learning process can produce faster
convergence to optimal policies due to early state space algorithms and employing information exchange between
reduction caused by shared information on state space agents. By doing so, the objective is to reduce the overall
constraints. exploratory need, thus improving the performance of the
Index Terms—Multi-Agent, Cooperative Agents, learning algorithm. To constrain the state space, extended
Reinforcement Learning, Q-Learning, Planning, reachability goals (ERG) [7] are used. The ERG are com-
Markov Decision Processes, Extended Reachability
Goals. prised by two expressions: one to be preserved during the
iteration and another that describes a goal state. Both are
composed by propositions that describe the environment
I. Introduction states.
In recent years, concepts such as decentralization and The rest of this paper is organized as follows: First,
autonomy have been concentrating much of the attention section II introduces the definitions and concepts that were
of researchers in different areas of Computer Science. The used to develop the proposed solution. The section III
pros of decentralized and autonomous software are already details the definition of the adopted model and the ap-
well known and have been demonstrated in numerous plied planning strategies. The developed approach, the
applications. As far as decision making is concerned, there algorithms and techniques are presented in section IV. The
are acknowledged advantages in considering environments experiments set up is defined in section V. The Results
with individuals that cooperate with each other, trying and a comparative analysis between different algorithms
to achieve a single objective or independent goals. Panait are shown in section VI. Finally, the section VII summa-
and Luke [1] showed an extensive study of learning sys- rizes the key features and the main contributions of the
tems with cooperative agents, presenting examples focused proposed approach.
on Evolutionary Computation, Robotics, Reinforcement II. Background
Learning, etc.
In the context of autonomous learning, this paper con- A. Markov Decision Process and Reinforcement Learning
siders an extended form of the standard Markov Decision A Markov Decision Process (MDP) [2], [3], [4] is a
Process (MDP) [2], [3], [4] with propositional constraints formal model for synchronous interaction between an agent
on the state space that are communicated among agents and its environment. At every step the agent observes the
as the learning goes on. current state of the environment and decides to execute
one action. The execution of the selected action takes the Algorithm 1 Q-Learning
agent to a new state of the environment and produces
a reinforcement (a merit value associated to the state- 1: Initialize Q(s, a) arbitrarily;
action pair). The interaction between the agent and the 2: for i = 1 → M AX_EP ISODES do
environment continues until a stopping criteria is met. 3: Initialize s
MDPs are primarily used to model sequential decision 4: repeat
making in stochastic environments. Generally, MDPs are 5: Choose a from s using policy derived from Q
applied in planning and optimization in environments 6: Take action a, observe R(s) and resulting state s0
involving uncertainties, such as problems in Robotics, 7: Q(s, a) ← Q(s, a) + α[R(s) + γmax Q(s0 , a0 ) −
a0
Economics, etc. Q(s, a)]
A model M for a standard MDP can be defined as: 8: s ← s0
M = hS, A, T , Ri, where: 9: until s is terminal
• S =6 ∅: Finite set of system states; 10: end for
• A= 6 ∅: The set of actions that can be executed by the
agent;
Others Q-learning-based algorithms such as Dyna-
• T : The state transition function. Provides the proba-
Q [11] and SARSA [12], are reportedly faster to converge
bility of executing a transition. An agent in a state s
to the optimal policy than Q-learning. Dyna-Q is an
that executes an action a and immediately (next time
architecture to integrate planning, acting and learning. It
step) reaches state s0 ;
uses the same procedure to update the utility values as
• R: The reinforcement function.
Q-Learning, however it executes a learning procedure for
The function R returns a real value that corresponds to the expected utility values by looping over an internally
the received reinforcement after each action a at any state updated model of the environment.
s of the environment. SARSA is an acronym for State-Action-Reward-State-
Reinforcement Learning [5], [8] is a collection of learning Action, due to the procedure executed to update the utility
techniques for MDPs where an agent tries to maximize values, where: s is the current state, a is the action chosen
a function of the total reinforcement values received in a to be executed in the state, r the corresponding reward, s0
partially or totally unknown environment w.r.t. the tran- is the resulting state after executing a over s and finally a0
sition function. The widely studied Q-Learning algorithm is the next action to be executed in s0 . The agent interacts
was introduced by Watkins [9] and had its convergence with the environment and updates the policy based on
proved in [6]. It is the de facto standard RL technique with taken actions. This technique is an instance of the so-called
many variations [10] where the agent learns an optimal (or on-policy learning algorithms.
near-optimal) action policy through sequential updates of The above Reinforcement Learning algorithms were
an action-value function Q(s, a) without an explicit need compared in [13] and were also a benchmark to evaluate
to learn a model of the environment. By performing an the experiments performed in this paper.
action a, the agent interacts with the environment moving
from state to state. Each action performed over a state B. Communication
provides a reinforcement, and the corresponding agent Communication is commonly used to improve the per-
updates its estimate of the Q-value associated with the formance of several algorithms in decentralized multi-agent
state-action pair. In stochastic environments (typical for settings. It plays different roles in multi-agent systems,
MDP problems), the Q-value for a given state-action pair such as information exchange, task delegation, coordina-
under a given action policy is the (temporally discounted) tion of teams, distributed planning [14], conflict resolution,
expected sum of the received reinforcement values when etc.
performing the action at the given state and following the In the context herein, the communication between
optimal policy thereafter. agents [15] is basically the exchange of messages among
The Q-learning update equation is: agents for gathering propositional information about the
state space. Although broadcasting models for message
Q(s, a) = Q(s, a) + α[R(s) + γmax Q(s0 , a0 ) − Q(s, a)] difusion are possible, it is generally not interesting to send
0 a messages to all agents at each time interval, since com-
where 0 < α ≤ 1 is an input learning parameter that munication generally involves some cost [16]. Moreover,
determines the extent to which newly acquired information broadcasting promotes traffic increase that can cause an
will override old values. The constant 0 ≤ γ < 1 is the overflow [17]. Norrozi [18] shown an efficient way to avoid
temporal discount factor, lower values make the agent this problem. In general, for MDP problems the optimal
tends to consider only recent reinforcements for updating policy for each agent should be generated through minimal
the action-value function. and sufficient communication for coordination, such that
A pseudo-code for Q-Learning is presented in Multi- the cost is not an obstacle as far as costs are concerned. In
ERG1. fact, proper communication can be in some cases crucial for
agents to coordinate properly, keeping updated informa- specificity by adding definitions to manipulate a set of
tion on the environment. For this, the rules for communica- propositions for each state. An ERG-MDP constrains the
tion must defined in the model and thus choosing the right environment using the extended reachability goals.
moment to communicate can be considered a fundamental The ERG-MDP 1 model contains four new entities
problem in multi-agent systems [19]. P, L, ϕ1 and ϕ2 , defined as:
• P 6= ∅: A non-empty set of atomic propositions
C. Extended Reachability Goals
representing state characteristics;
The propositional constraints in our approach are based • L : S 7→ 2 : The state interpretation function;
P
on the concept of extended reachability goals. A simple • ϕ1 : The logical expression for the preservation goal to
reachability goal corresponds to a condition to be satisfied be maintained during the plan execution;
at the final state in a planning problem, reached after • ϕ2 : The condition to be achieved at the end of the
a plan execution [7]. Differently, an extended reachabil- plan execution defined as a logical expression.
ity goal, besides the specification of the condition to
be achieved, establishes a condition to be preserved (or IV. RL and Extended Reachability Goals in
restrictions to be satisfied) for every state during the Multi-Agent Domains
execution. This provides a significant extension on the Our proposal is mainly composed by the Multi-ERG-
specification of planning problems, narrowing the scope of Controller algorithm, which derives from its single agent
the model and allowing more complex planning problems version, the ERG-Controller [21].
to be more formally defined. Both the ERG-Controller and the Multi-ERG-
To define extended reachability goals it is necessary Controller make use of two algorithms: PPFERG and
to represent them using a formal language, e.g. the tem- ERG-RL. The former is a slightly different version of
poral logic used by Bacchus [20]. In general, the formal the PPF’ algorithm. It returns all the viable policies
representation of the goals is the composition of atomic given the final goal and the preservation goal. The latter
propositions that represent state characteristics and logical incorporates the extended reachability goals features and
operands, called hereafter as expressions. For example, is detailed as follows.
assuming that p, q and k are atomic propositions that
A. ERG-RL
correspond to conditions for states, we could define the
condition to be preserved as: ¬p ∧ ¬q. This representation The ERG-RL algorithm proposed by Araujo and
denotes that states with the properties p or q must be Ribeiro [21] extends a reinforcement learning algorithm to
avoided during the interaction with the environment. Sim- include extended reachability goals. The main difference
ilarly, it might be possible to define the condition to be between ERG-RL and standard RL is that the first also
achieved as k, indicating that the problem is solved after stores the proposition function when interacting with the
reaching a state that satisfies this condition. environment. The expressions are stored together with its
Planning problems with extended reachability goals can corresponding utility value for each state, for posterior
be solved using a strong probabilistic planning algorithm evaluation in the ERG-Controller.
called PPF’ [7]. The inputs to PPF’ are S, A, T and We stress that any RL algorithm can be used as the
the extended reachability goals features; and returns a learning component in ERG-RL to update the utility table
valid policy, if one exists. If there is more than a single values. For example, if we use Q-Learning, we have the
valid policy and the output is the one with the highest algorithm ERG-Q, which is considered for the experiments
probability of reaching the goal. PPF’ initially computes reported in this paper.
a set of states that satisfies the final goal. From this set, B. ERG-Controller
the algorithm verifies which of the remaining states that
satisfy the preservation goal can reach one state of the set. The ERG-Controller algorithm executes the exploration
The action selected has the maximum value for Q, but does over the environment and plans over the information re-
not consider the information of the reward function. This trieved from it. The execution flowchart of the algorithm
step is repeated until the initial state is comprised inside is presented in Figure 1.
the set or there is no new state to add to the set. The algorithm initially explores the environment to
A logic that can be used as a formal language to specify generate a first ERG-Model. This occurs through the
extended reachability goals and a planning system based execution of the ERG-RL algorithm.
on this logic is provided by Pereira et al. [7] . After a ERG-RL execution, the ERG-Controller tries
to find the expression with the lowest utility value for all
III. MDPs and Extended Reachability Goals visited states which is not in the set of avoidable expres-
sions. If the algorithm finds the expression and it does
Classical planning and MDP models are not suitable
not obstruct the agent from reaching the final goal, then
to formally represent extended reachability goals. Hence,
a new model is proposed and named ERG-MDP [21], 1 The default MDP properties that were previously defined in
which extends the default MDP model and increases its section II are omitted.
Figure 1. The complete ERG-Controller flowchart. Figure 2. The synthesized Multi-ERG-Controller approach flowchart.
Algorithm 2 Multi-ERG-Controller
it removes all the transitions that directs to states that
Require: θ
contains the expression found. Afterwards, it continues the
execution while accumulating experiences to improve the 1: avoid ← ¬ϕ1
model. 2: repeat
After the main loop, the PPFERG algorithm runs over 3: U ← U-Values(RunRLThreads(E))
a state transition model generated by bootstrapping the 4: E ← E ∪ model(RunRLThreads(E))
experiences from an exploratory action policy of ERG-RL. 5: for all s ∈ S do
The execution of PPFERG guarantees that the expression 6: exp ← L(s)
defined by the preservation goal is valid for all possible 7: if exp 6∈ avoid ∧ ¬blocked(exp, ϕ2 ) then
states and directs the actions towards the states that 8: V E ← V E ∪ {exp}
satisfy the final goal. By establishing these conditions, 9: count(exp) ← count(exp) + 1
PPFERG implicitly reduces the state space. 10: sum(exp) ← sum(exp) + U (s)
Finally, the ERG-Controller returns a policy that cor- 11: end if
responds to an optimal policy based on the set of viable 12: end for
policies found by the PPFERG algorithm. This optimal 13: if V E 6= ∅ then
policy corresponds to executing the actions that produce 14: for all exp ∈ V E do
the maximum utility values for the corresponding states. 15: if sum(exp) ÷ count(exp) ≤ θ then
A more detailed description of ERG-RL can be found 16: mean(exp) ← sum(exp) ÷ count(exp)
in [21]. 17: end if
18: end for
C. Multi-ERG-Controller 19: exp ← minmean
20: if exp 6= null then
The ERG-controller does not support multi-agent ex- 21: avoid ← avoid ∪ {exp}
ecutions for simultaneous update of the ERG-Model. To 22: T ← T − {∀s ∈ S, ∀a ∈ A, T (s, a) → s0 where
apply the same approach to problems with different agents exp ∈ L(s0 )}
distributed over the environment, we extended the default 23: end if
ERG-Controller and named it Multi-ERG-Controller. 24: end if
The Multi-ERG-Controller executes a different thread 25: until V E = ∅
for each agent in the environment. Each thread executes an 26: ϕ1 ← ¬avoid
ERG-RL algorithm instance according to the agent’s initial 27: P ← P P F ERG(E)
position. Figure 2 presents the flowchart for the Multi- 28: π ← f indOptimal(P, U )
ERG-Controller algorithm. 29: return π
The flowchart in Figure 2 shows the execution of a
number of threads equal to the number of agents, which
The U-Values and the ERG-Model acquired are used
is its main feature. The detailed Multi-ERG-Controller
in Multi-ERG-Controller to decide which expression (exp)
algorithm is presented in Multi-ERG2.
with averaged utility value below θ 2 does not block the
Both the U-Values, U in algorithm 2, and the ERG-
final goal (ϕ2 ) must be avoided. Here, one expression is
Model (E) are extracted from the agents’ interaction with
composed by the group of propositions present in a state
the environment. These interactions are executed in pro-
joined by an and(∧) operator. The propositions comprised
cedure RunRLThread (detailed latter in Multi-ERG3). U
represents the utility values and E the model retrieved from 2 Input parameter that defines the minimum utility value below
the inner RL algorithm. which an expression must be avoided during the interaction.
in a state can be recovered through the state interpretation The information exchange among the agents is ac-
function (L). complished by sharing the same utility table (U-Values).
After each round of the main loop, the algorithm verifies Thus, all information gathered from the environment is
if there is a valid expression. If a valid expression exists, updated in the same table. After each update, all agents
the algorithm chooses the expression, stores it in the set of can retrieve the current utility value for each state. To
avoidable expressions (avoid) and removes the transitions avoid inconsistency errors the access to the utility table is
that reach states that represents the expression. If there synchronized.
is not a valid expression, it calls the PPFERG algorithm, After all threads have been executed, the final policy
that performs similarly in ERG-Controller. can be retrieved from the shared utility table, selecting
As in ERG-Controller, the algorithm returns an optimal the actions with the maximum utility value for each state.
policy based on the set of viable policies (π) found by the
PPFERG. V. Experiments
1) RunRLThreads: The algorithm RunRLThreads is re- The experiments were ran on grid environments with
sponsible for creating and executing an ERG-RL instance 100 states, 10 rows and 10 columns, where each cell corre-
asynchronously for each agent in the problem. This algo- sponds to a single state. They were defined with extended
rithm returns the U-Values extracted from the interaction reachability goals and the valid propositions are: A, B
with the environment and the ERG-MDP model (E). The and @. The examples do not contain an initially defined
Figure 3 shows a graphical schema for the RunRLThreads condition to be preserved and the final goal expression is
algorithm. defined by a predefined special proposition represented by
@.
The grid environments were randomly generated (po-
sition of the agents, obstacles, initial position and final
goal position). Table I summarizes the abbreviations and
descriptions for the propositions and possible actions.
Algorithm 3 RunRLThreads
Figure 4 illustrates an environment configuration (sce-
Require: initial-states
nario) generated according to the description in Table I).
1: for all state ∈ initial-states do Empty cells correspond to states that do not have any
2: T hread(ERG-RL(state)) proposition associated.
3: end for
4: repeat
5: wait-time-interval
6: until all threads have finished.
7: return U-Values, E