Wcci 14 S

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Sharing Information on Extended Reachability Goals

Over Propositionally Constrained Multi-Agent State


Spaces
Anderson V. de Araujo and Carlos H. C. Ribeiro
Divisao de Ciencia da Computacao
Instituto Tecnologico de Aeronautica - ITA
Sao Jose dos Campos, SP - Brazil
Email: {andva, carlos}@ita.br

Abstract—By exchanging propositional constraint In a standard MDP, a state transition model is assumed
information in large state spaces, agents can implicitly to be known beforehand, but this is not always the case in
reduce the state space, a feature that is particularly
real world applications [5]. Reinforcement Learning (RL)
attractive for Reinforcement Learning approaches. This
paper proposes a learning technique that combines is a common research area which addresses this issue,
a Reinforcement Learning algorithm and a planner focusing on the agent interaction with the environment as
for propositionally constrained state spaces that au- a mechanism for gathering information about the domain
tonomously help agents to implicitly reduce the state structure. Q-Learning [6] is a typical RL algorithm that
space towards possible plans that lead to the goal
inherits a notorious difficulty of RL algorithms in general
and avoid irrelevant or inadequate states. State space
constraints are communicated among the agents using a to deal with large state spaces, due to a need for balancing
common constraint set based on extended reachability the control objective and the model-free estimation of
goals. A performance evaluation against standard Rein- the domain structure based solely on exploration. Here it
forcement Learning techniques showed that by extend- is proposed a technique which incorporates propositional
ing autonomous learning with propositional constraints
constraints on the state space, using Q-learning-based
updated along the learning process can produce faster
convergence to optimal policies due to early state space algorithms and employing information exchange between
reduction caused by shared information on state space agents. By doing so, the objective is to reduce the overall
constraints. exploratory need, thus improving the performance of the
Index Terms—Multi-Agent, Cooperative Agents, learning algorithm. To constrain the state space, extended
Reinforcement Learning, Q-Learning, Planning, reachability goals (ERG) [7] are used. The ERG are com-
Markov Decision Processes, Extended Reachability
Goals. prised by two expressions: one to be preserved during the
iteration and another that describes a goal state. Both are
composed by propositions that describe the environment
I. Introduction states.
In recent years, concepts such as decentralization and The rest of this paper is organized as follows: First,
autonomy have been concentrating much of the attention section II introduces the definitions and concepts that were
of researchers in different areas of Computer Science. The used to develop the proposed solution. The section III
pros of decentralized and autonomous software are already details the definition of the adopted model and the ap-
well known and have been demonstrated in numerous plied planning strategies. The developed approach, the
applications. As far as decision making is concerned, there algorithms and techniques are presented in section IV. The
are acknowledged advantages in considering environments experiments set up is defined in section V. The Results
with individuals that cooperate with each other, trying and a comparative analysis between different algorithms
to achieve a single objective or independent goals. Panait are shown in section VI. Finally, the section VII summa-
and Luke [1] showed an extensive study of learning sys- rizes the key features and the main contributions of the
tems with cooperative agents, presenting examples focused proposed approach.
on Evolutionary Computation, Robotics, Reinforcement II. Background
Learning, etc.
In the context of autonomous learning, this paper con- A. Markov Decision Process and Reinforcement Learning
siders an extended form of the standard Markov Decision A Markov Decision Process (MDP) [2], [3], [4] is a
Process (MDP) [2], [3], [4] with propositional constraints formal model for synchronous interaction between an agent
on the state space that are communicated among agents and its environment. At every step the agent observes the
as the learning goes on. current state of the environment and decides to execute
one action. The execution of the selected action takes the Algorithm 1 Q-Learning
agent to a new state of the environment and produces
a reinforcement (a merit value associated to the state- 1: Initialize Q(s, a) arbitrarily;
action pair). The interaction between the agent and the 2: for i = 1 → M AX_EP ISODES do
environment continues until a stopping criteria is met. 3: Initialize s
MDPs are primarily used to model sequential decision 4: repeat
making in stochastic environments. Generally, MDPs are 5: Choose a from s using policy derived from Q
applied in planning and optimization in environments 6: Take action a, observe R(s) and resulting state s0
involving uncertainties, such as problems in Robotics, 7: Q(s, a) ← Q(s, a) + α[R(s) + γmax Q(s0 , a0 ) −
a0
Economics, etc. Q(s, a)]
A model M for a standard MDP can be defined as: 8: s ← s0
M = hS, A, T , Ri, where: 9: until s is terminal
• S =6 ∅: Finite set of system states; 10: end for
• A= 6 ∅: The set of actions that can be executed by the
agent;
Others Q-learning-based algorithms such as Dyna-
• T : The state transition function. Provides the proba-
Q [11] and SARSA [12], are reportedly faster to converge
bility of executing a transition. An agent in a state s
to the optimal policy than Q-learning. Dyna-Q is an
that executes an action a and immediately (next time
architecture to integrate planning, acting and learning. It
step) reaches state s0 ;
uses the same procedure to update the utility values as
• R: The reinforcement function.
Q-Learning, however it executes a learning procedure for
The function R returns a real value that corresponds to the expected utility values by looping over an internally
the received reinforcement after each action a at any state updated model of the environment.
s of the environment. SARSA is an acronym for State-Action-Reward-State-
Reinforcement Learning [5], [8] is a collection of learning Action, due to the procedure executed to update the utility
techniques for MDPs where an agent tries to maximize values, where: s is the current state, a is the action chosen
a function of the total reinforcement values received in a to be executed in the state, r the corresponding reward, s0
partially or totally unknown environment w.r.t. the tran- is the resulting state after executing a over s and finally a0
sition function. The widely studied Q-Learning algorithm is the next action to be executed in s0 . The agent interacts
was introduced by Watkins [9] and had its convergence with the environment and updates the policy based on
proved in [6]. It is the de facto standard RL technique with taken actions. This technique is an instance of the so-called
many variations [10] where the agent learns an optimal (or on-policy learning algorithms.
near-optimal) action policy through sequential updates of The above Reinforcement Learning algorithms were
an action-value function Q(s, a) without an explicit need compared in [13] and were also a benchmark to evaluate
to learn a model of the environment. By performing an the experiments performed in this paper.
action a, the agent interacts with the environment moving
from state to state. Each action performed over a state B. Communication
provides a reinforcement, and the corresponding agent Communication is commonly used to improve the per-
updates its estimate of the Q-value associated with the formance of several algorithms in decentralized multi-agent
state-action pair. In stochastic environments (typical for settings. It plays different roles in multi-agent systems,
MDP problems), the Q-value for a given state-action pair such as information exchange, task delegation, coordina-
under a given action policy is the (temporally discounted) tion of teams, distributed planning [14], conflict resolution,
expected sum of the received reinforcement values when etc.
performing the action at the given state and following the In the context herein, the communication between
optimal policy thereafter. agents [15] is basically the exchange of messages among
The Q-learning update equation is: agents for gathering propositional information about the
state space. Although broadcasting models for message
Q(s, a) = Q(s, a) + α[R(s) + γmax Q(s0 , a0 ) − Q(s, a)] difusion are possible, it is generally not interesting to send
0 a messages to all agents at each time interval, since com-
where 0 < α ≤ 1 is an input learning parameter that munication generally involves some cost [16]. Moreover,
determines the extent to which newly acquired information broadcasting promotes traffic increase that can cause an
will override old values. The constant 0 ≤ γ < 1 is the overflow [17]. Norrozi [18] shown an efficient way to avoid
temporal discount factor, lower values make the agent this problem. In general, for MDP problems the optimal
tends to consider only recent reinforcements for updating policy for each agent should be generated through minimal
the action-value function. and sufficient communication for coordination, such that
A pseudo-code for Q-Learning is presented in Multi- the cost is not an obstacle as far as costs are concerned. In
ERG1. fact, proper communication can be in some cases crucial for
agents to coordinate properly, keeping updated informa- specificity by adding definitions to manipulate a set of
tion on the environment. For this, the rules for communica- propositions for each state. An ERG-MDP constrains the
tion must defined in the model and thus choosing the right environment using the extended reachability goals.
moment to communicate can be considered a fundamental The ERG-MDP 1 model contains four new entities
problem in multi-agent systems [19]. P, L, ϕ1 and ϕ2 , defined as:
• P 6= ∅: A non-empty set of atomic propositions
C. Extended Reachability Goals
representing state characteristics;
The propositional constraints in our approach are based • L : S 7→ 2 : The state interpretation function;
P
on the concept of extended reachability goals. A simple • ϕ1 : The logical expression for the preservation goal to
reachability goal corresponds to a condition to be satisfied be maintained during the plan execution;
at the final state in a planning problem, reached after • ϕ2 : The condition to be achieved at the end of the
a plan execution [7]. Differently, an extended reachabil- plan execution defined as a logical expression.
ity goal, besides the specification of the condition to
be achieved, establishes a condition to be preserved (or IV. RL and Extended Reachability Goals in
restrictions to be satisfied) for every state during the Multi-Agent Domains
execution. This provides a significant extension on the Our proposal is mainly composed by the Multi-ERG-
specification of planning problems, narrowing the scope of Controller algorithm, which derives from its single agent
the model and allowing more complex planning problems version, the ERG-Controller [21].
to be more formally defined. Both the ERG-Controller and the Multi-ERG-
To define extended reachability goals it is necessary Controller make use of two algorithms: PPFERG and
to represent them using a formal language, e.g. the tem- ERG-RL. The former is a slightly different version of
poral logic used by Bacchus [20]. In general, the formal the PPF’ algorithm. It returns all the viable policies
representation of the goals is the composition of atomic given the final goal and the preservation goal. The latter
propositions that represent state characteristics and logical incorporates the extended reachability goals features and
operands, called hereafter as expressions. For example, is detailed as follows.
assuming that p, q and k are atomic propositions that
A. ERG-RL
correspond to conditions for states, we could define the
condition to be preserved as: ¬p ∧ ¬q. This representation The ERG-RL algorithm proposed by Araujo and
denotes that states with the properties p or q must be Ribeiro [21] extends a reinforcement learning algorithm to
avoided during the interaction with the environment. Sim- include extended reachability goals. The main difference
ilarly, it might be possible to define the condition to be between ERG-RL and standard RL is that the first also
achieved as k, indicating that the problem is solved after stores the proposition function when interacting with the
reaching a state that satisfies this condition. environment. The expressions are stored together with its
Planning problems with extended reachability goals can corresponding utility value for each state, for posterior
be solved using a strong probabilistic planning algorithm evaluation in the ERG-Controller.
called PPF’ [7]. The inputs to PPF’ are S, A, T and We stress that any RL algorithm can be used as the
the extended reachability goals features; and returns a learning component in ERG-RL to update the utility table
valid policy, if one exists. If there is more than a single values. For example, if we use Q-Learning, we have the
valid policy and the output is the one with the highest algorithm ERG-Q, which is considered for the experiments
probability of reaching the goal. PPF’ initially computes reported in this paper.
a set of states that satisfies the final goal. From this set, B. ERG-Controller
the algorithm verifies which of the remaining states that
satisfy the preservation goal can reach one state of the set. The ERG-Controller algorithm executes the exploration
The action selected has the maximum value for Q, but does over the environment and plans over the information re-
not consider the information of the reward function. This trieved from it. The execution flowchart of the algorithm
step is repeated until the initial state is comprised inside is presented in Figure 1.
the set or there is no new state to add to the set. The algorithm initially explores the environment to
A logic that can be used as a formal language to specify generate a first ERG-Model. This occurs through the
extended reachability goals and a planning system based execution of the ERG-RL algorithm.
on this logic is provided by Pereira et al. [7] . After a ERG-RL execution, the ERG-Controller tries
to find the expression with the lowest utility value for all
III. MDPs and Extended Reachability Goals visited states which is not in the set of avoidable expres-
sions. If the algorithm finds the expression and it does
Classical planning and MDP models are not suitable
not obstruct the agent from reaching the final goal, then
to formally represent extended reachability goals. Hence,
a new model is proposed and named ERG-MDP [21], 1 The default MDP properties that were previously defined in
which extends the default MDP model and increases its section II are omitted.
Figure 1. The complete ERG-Controller flowchart. Figure 2. The synthesized Multi-ERG-Controller approach flowchart.

Algorithm 2 Multi-ERG-Controller
it removes all the transitions that directs to states that
Require: θ
contains the expression found. Afterwards, it continues the
execution while accumulating experiences to improve the 1: avoid ← ¬ϕ1
model. 2: repeat
After the main loop, the PPFERG algorithm runs over 3: U ← U-Values(RunRLThreads(E))
a state transition model generated by bootstrapping the 4: E ← E ∪ model(RunRLThreads(E))
experiences from an exploratory action policy of ERG-RL. 5: for all s ∈ S do
The execution of PPFERG guarantees that the expression 6: exp ← L(s)
defined by the preservation goal is valid for all possible 7: if exp 6∈ avoid ∧ ¬blocked(exp, ϕ2 ) then
states and directs the actions towards the states that 8: V E ← V E ∪ {exp}
satisfy the final goal. By establishing these conditions, 9: count(exp) ← count(exp) + 1
PPFERG implicitly reduces the state space. 10: sum(exp) ← sum(exp) + U (s)
Finally, the ERG-Controller returns a policy that cor- 11: end if
responds to an optimal policy based on the set of viable 12: end for
policies found by the PPFERG algorithm. This optimal 13: if V E 6= ∅ then
policy corresponds to executing the actions that produce 14: for all exp ∈ V E do
the maximum utility values for the corresponding states. 15: if sum(exp) ÷ count(exp) ≤ θ then
A more detailed description of ERG-RL can be found 16: mean(exp) ← sum(exp) ÷ count(exp)
in [21]. 17: end if
18: end for
C. Multi-ERG-Controller 19: exp ← minmean
20: if exp 6= null then
The ERG-controller does not support multi-agent ex- 21: avoid ← avoid ∪ {exp}
ecutions for simultaneous update of the ERG-Model. To 22: T ← T − {∀s ∈ S, ∀a ∈ A, T (s, a) → s0 where
apply the same approach to problems with different agents exp ∈ L(s0 )}
distributed over the environment, we extended the default 23: end if
ERG-Controller and named it Multi-ERG-Controller. 24: end if
The Multi-ERG-Controller executes a different thread 25: until V E = ∅
for each agent in the environment. Each thread executes an 26: ϕ1 ← ¬avoid
ERG-RL algorithm instance according to the agent’s initial 27: P ← P P F ERG(E)
position. Figure 2 presents the flowchart for the Multi- 28: π ← f indOptimal(P, U )
ERG-Controller algorithm. 29: return π
The flowchart in Figure 2 shows the execution of a
number of threads equal to the number of agents, which
The U-Values and the ERG-Model acquired are used
is its main feature. The detailed Multi-ERG-Controller
in Multi-ERG-Controller to decide which expression (exp)
algorithm is presented in Multi-ERG2.
with averaged utility value below θ 2 does not block the
Both the U-Values, U in algorithm 2, and the ERG-
final goal (ϕ2 ) must be avoided. Here, one expression is
Model (E) are extracted from the agents’ interaction with
composed by the group of propositions present in a state
the environment. These interactions are executed in pro-
joined by an and(∧) operator. The propositions comprised
cedure RunRLThread (detailed latter in Multi-ERG3). U
represents the utility values and E the model retrieved from 2 Input parameter that defines the minimum utility value below
the inner RL algorithm. which an expression must be avoided during the interaction.
in a state can be recovered through the state interpretation The information exchange among the agents is ac-
function (L). complished by sharing the same utility table (U-Values).
After each round of the main loop, the algorithm verifies Thus, all information gathered from the environment is
if there is a valid expression. If a valid expression exists, updated in the same table. After each update, all agents
the algorithm chooses the expression, stores it in the set of can retrieve the current utility value for each state. To
avoidable expressions (avoid) and removes the transitions avoid inconsistency errors the access to the utility table is
that reach states that represents the expression. If there synchronized.
is not a valid expression, it calls the PPFERG algorithm, After all threads have been executed, the final policy
that performs similarly in ERG-Controller. can be retrieved from the shared utility table, selecting
As in ERG-Controller, the algorithm returns an optimal the actions with the maximum utility value for each state.
policy based on the set of viable policies (π) found by the
PPFERG. V. Experiments
1) RunRLThreads: The algorithm RunRLThreads is re- The experiments were ran on grid environments with
sponsible for creating and executing an ERG-RL instance 100 states, 10 rows and 10 columns, where each cell corre-
asynchronously for each agent in the problem. This algo- sponds to a single state. They were defined with extended
rithm returns the U-Values extracted from the interaction reachability goals and the valid propositions are: A, B
with the environment and the ERG-MDP model (E). The and @. The examples do not contain an initially defined
Figure 3 shows a graphical schema for the RunRLThreads condition to be preserved and the final goal expression is
algorithm. defined by a predefined special proposition represented by
@.
The grid environments were randomly generated (po-
sition of the agents, obstacles, initial position and final
goal position). Table I summarizes the abbreviations and
descriptions for the propositions and possible actions.

Abbrev. Name Description


0-9 Agents Agents‘ Identifiers
A Obstacle A Proposition with low reward
B Obstacle B Proposition with low reward
@ Final Goal Composition of the final goal
↑ North Execute the action “north”
Figure 3. Graphical schema for RunRLThreads algorithm. ↓ South Execute the action “south”
→ East Execute the action “east”
← West Execute the action “west”
RunRLThreads and its internal procedures are detailed Table I
in Multi-ERG3. Abbreviations and descriptions for the generated examples.

Algorithm 3 RunRLThreads
Figure 4 illustrates an environment configuration (sce-
Require: initial-states
nario) generated according to the description in Table I).
1: for all state ∈ initial-states do Empty cells correspond to states that do not have any
2: T hread(ERG-RL(state)) proposition associated.
3: end for
4: repeat
5: wait-time-interval
6: until all threads have finished.
7: return U-Values, E

The Multi-ERG3 calls the procedure T hread, which


is responsible to create a new process of execution and
schedule it in the operating system to run. After all threads
are running the algorithm has to wait a predefined time Figure 4. An example of generated environment.
interval (represented by the procedure wait-time-interval)
to verify if all threads have finished. After all threads exe- The ERG-MDP model E for the examples was defined
cuted, the algorithm can finally return the updated utility as follows:
table and the model extracted from the RL executions. • S: [s00, ..., s99];
The returned variable are used in Multi-ERG-Controller • A: [north, south, west, east];
to extract the near-optimal policy from the environment. • P: [A, B, @];
• T : Equally distributed between the possible states3 ; The total time (T ) of an execution (in milliseconds), the
• L: It was randomly created for each different example; number of iterations and the standard deviation over the
4
• R : number of iterations are considered in the analysis.
– ∀s ∈ S, AB ∈ L(s), R(s) : −30; Algorithm T Iterations Iter. (std dev)
– R(∗) : −1.0. Q-Learning 3479.2 175.48 25.81
ERG-Cont.(Q-Learn.) 3404.2 76.98 9.58
• ϕ1 : ∅;
Multi-ERG(Q-Learn.) 3584.6 90.94 21.10
• ϕ2 : @. SARSA 6065 172.92 32.44
According to the defined ERG-MDP model, there is no ERG-Cont.(SARSA) 5120.8 77.1 9.58
Multi-ERG(SARSA) 5933.5 91.04 22.72
optimistic reward associated to the states that satisfies the Dyna-Q 1902 76.07 13.55
final goal expression. ERG-Cont.(Dyna-Q) 1567.8 27.99 7.87
The experiments were performed on a set of 10 examples Multi-ERG(Dyna-Q) 1916.9 36.81 10.48
Table II
performed 10 times for each test case. For these experi- Comparative results between all tested algorithms.
ments, the Multi-ERG-Controller algorithm was executed
with either 2 or 3 agents (threads). The ERG-Controller
and the RL algorithms Q-learning, SARSA and Dyna-Q The Table III presents the percentage of improvement
were also executed over the same test cases for comparative when comparing the ERG-Controller approach with the
reasons. standard RL algorithms.
For operating the RL algorithms the required constants Algorithm T Iterations Iter. (std dev)
were defined with values: α = 0.1, γ = 0.9 and  = 0.009. Q-Learn. -2.18 -56.13 -57,56
For the ERG-Controller based algorithms θ was defined as SARSA -15.56 -55.41 -70.46
Dyna-Q -17.57 -63.20 -41,93
−15. The execution of Dyna-Q requires alson the param- Table III
eter N , which is the number of iterations for the internal Comparison between the RL algorithms and its ERG-RL
planning procedure, defined as 5. versions.
The stopping criterion is defined as: max Qt−1 (s, a) −
Qt (s, a) < . The results presented on Table III shows the percent-
age of improvement obtained from applying the ERG-
VI. Results Controller over the same set of examples. It shows an im-
Thise section presents the final preservation goal pressive improvement of all algorithms over all compared
achieved by the ERG controllers and the comparison anal- attributes.
ysis between Multi-ERG-Controller (with 3 agents), ERG- The Table IV compares the single agent version of ERG-
Controller, Q-Learning, Dyna-Q and SARSA. Controller with its multi agent correspondent.
A. Preservation Goal Algorithm T Iterations Iter. (std dev)
Q-Learn +5.33 +18.14 +92.69
At each iteration of the Multi-ERG-Controller and SARSA +15.87 +18.08 +137.22
ERG-Controller main loops, the algorithms find the ex- Dyna-Q +22.26 +31.53 +33.19
Table IV
pression with the lowest averaged utility value, until it Comparison between the ERG-RL algorithms and its
finds all suited expressions (according to θ). The resultant Multi-ERG-RL versions.
set of expressions is composed by: A, B and A ∧ B. Thus,
the final preservation goal found by both ERG based
The information presented on Table IV shows that there
algorithms is:
was a reduction of time of execution, number of episodes
¬(A ∨ B ∨ (A ∧ B))
and the standard deviation of the episodes. In our analysis,
The preservation goal indicates that the agent should it was expected to achieve such results, due to the fact
avoid the states that contains the expressions one of the that the creation of threads and the synchronization of the
propositions (A or B) or both. utility table are expensive procedures to execute during the
B. Executions process.
The last table, Table V, expresses the comparison of
The Table II shows a comparison considering Q- the standard RL algorithms with Multi-ERG-Controller
Learning, SARSA and Dyna-Q as reinforcement learning approach.
algorithms. The comparison was accomplished using these The Table V presents an interesting improvement over
standard RL algorithms against the employment of them number of episodes and its standard deviation. The av-
as the learning component of ERG-RL in the ERG based erage time to execute each approach is very similar. We
algorithms. believe that this is due to the same reason of the compar-
3 e.g. if agent is in state s00 there are only two possible actions: ison of ERG-Controller with Multi-ERG-Controller: the
south and east, each one with probability 0.5. generation of new threads and the synchronization of the
4 where the symbol ‘*’ means any action or state. information between agents.
Algorithm T Iterations Iter. (std dev) [8] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
Q-Learn. +3.02 -48.17 -18.23 forcement learning: a survey,” Journal of Artificial Intelligence
SARSA -2.16 -47,35 -29.93 Research, vol. 4, pp. 237–285, 1996.
Dyna-Q +0,78 -51,6 -22,65 [9] C. Watkins, “Learning from delayed rewards,” Ph.D. disserta-
Table V tion, University of Cambridge,England, 1989.
Comparison between the standard RL algorithms and its [10] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman,
Multi-ERG-RL versions. “Pac model-free reinforcement learning,” in Proceedings of the
23rd international conference on Machine learning, ser. ICML
’06. New York, NY, USA: ACM, 2006, pp. 881–888. [Online].
Available: http://doi.acm.org/10.1145/1143844.1143955
VII. Conclusion [11] R. S. Sutton, “Integrated architectures for learning, planning,
and reacting based on approximating dynamic programming,” in
In this paper, is presented a new approach called Multi- Proceedings of the Seventh International Conference on Machine
Learning. Morgan Kaufmann Publishers Inc., 1990, pp. 216–
ERG-Controller to discover the action policy for an en- 224.
vironment under propositional constraints on states in [12] G. A. Rummery and M. Niranjan, “On-line Q-learning using
MDP problems for multiple agents. Our method connects connectionist systems,” Cambridge University Engineering De-
partment, Cambridge, England, Tech. Rep. TR 166, 1994.
extended reachability goals and multiple reinforcement [13] L.-J. Lin, “Self-improving reactive agents based on reinforce-
learning instances through shared information. We imple- ment learning, planning and teaching,” in Machine Learning,
mented a shared utility table algorithm, according to the 1992, pp. 293–321.
[14] E. H. Durfee and V. R. Lesser, “Using partial global plans
significance of environmental observations made by the to coordinate distributed problem solvers,” in Proceedings
agents, in order to reduce the learning time and generate of the 10th international joint conference on Artificial
better action policies. intelligence - Volume 2. San Francisco, CA, USA:
Morgan Kaufmann, 1987, pp. 875–883. [Online]. Available:
The results obtained and presented in section VI showed http://portal.acm.org/citation.cfm?id=1625995.1626060
an important reduction of iterations when applying any [15] C. V. Goldman and S. Zilberstein, “Optimizing information
ERG based algorithm. The learning time is also reduced exchange in cooperative multi-agent systems,” in AAMAS ’03:
Proceedings of the second international joint conference on Au-
on the ERG-Controller executions but not in Multi-ERG- tonomous agents and multiagent systems. New York, NY, USA:
Controller. ACM, 2003, pp. 137–144.
The Multi-ERG-Controller showed advances in problem [16] P. Xuan, V. Lesser, and S. Zilberstein, “Communication in
multi-agent markov decision processes,” in In Proc. of ICMAS
solving with extended reachability goals for multiple agents Workshop on Game Theoretic and Decision Theoretic Agents,
and we believe that increasing the number of agents and 2000.
states of the environment can favor this approach. Thus, [17] M. Kinney and C. Tsatsoulis, “Learning communication strate-
gies in multiagent systems,” in In Applied Intelligence, 1998.
the time consumed by the generation and synchronization [18] A. Noroozi, “A novel model for multi-agent systems to improve
of threads will be less expressive. communication efficiency,” in Computer Engineering and Tech-
From the RL algorithms results, the DynaQ algorithm nology, 2009. ICCET ’09. International Conference on, vol. 2,
jan. 2009, pp. 189 –192.
showed to be more effective on larger state spaces, when [19] R. Becker, A. Carlin, V. Lesser, and S. Zilberstein,
comparing with the results in [21]. “Analyzing myopic approaches for multi-agent communication,”
Computational Intelligence, vol. 25, no. 1, pp. 31–50,
As future work, we are moving towards working with 2009. [Online]. Available: http://dx.doi.org/10.1111/j.1467-
partially observable MDPs and extended reachability 8640.2008.01329.x
goals. [20] F. Bacchus and F. Kabanza, “Planning for temporally extended
goals,” Annals of Mathematics and Artificial Intelligence,
vol. 22, pp. 5–27, 1998, 10.1023/A:1018985923441. [Online].
References Available: http://dx.doi.org/10.1023/A:1018985923441
[21] A. V. Araujo and C. H. C. Ribeiro, “Solving problems with
[1] L. Panait and S. Luke, “Cooperative multi-agent learning: The
extended reachability goals through reinforcement learning on
state of the art,” Autonomous Agents and Multi-Agent Systems,
propositionally constrained state spaces,” in IEEE International
vol. 11, pp. 387–434, 2005, 10.1007/s10458-005-2631-2. [Online].
Conference On Systems, Man, And Cybernetics, vol. 1, 2013, pp.
Available: http://dx.doi.org/10.1007/s10458-005-2631-2
1542–1547.
[2] R. A. Howard, Dynamic Programming and Markov Processes.
MIT Press and Wiley, 1960, vol. 3.
[3] M. L. Puterman, Markov Decision Processes: Discrete Stochas-
tic Dynamic Programming, 1st ed. New York, NY, USA: John
Wiley & Sons, Inc., 1994.
[4] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern
Approach. Pearson Education, 2003. [Online]. Available:
http://portal.acm.org/citation.cfm?id=773294
[5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
Introduction (Adaptive Computation and Machine Learning).
The MIT Press, mar 1998.
[6] C. J. C. H. Watkins and P. Dayan, “Technical note q-learning,”
Machine Learning, vol. 8, pp. 279–292, 1992.
[7] S. Lago Pereira, L. Barros, and F. Cozman, “Strong probabilistic
planning,” in MICAI 2008: Advances in Artificial Intelligence,
ser. Lecture Notes in Computer Science, A. Gelbukh and
E. Morales, Eds. Springer Berlin / Heidelberg, 2008, vol.
5317, pp. 636–652, 10.1007/978-3-540-88636-5_61. [Online].
Available: http://dx.doi.org/10.1007/978-3-540-88636-5_61

You might also like