A Reinforcement Learning Approach To Obstacle Avoidance of Mobil

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A reinforcement learning approach to obstacle avoidance of mobile robots

Kristijan MaEek Ivan PetroviC Nedjeljko PeriC


University of Zagreb University of Zagreb University of Zagreb
Faculty of Electrical Engineering Faculty of Electrical Engineering Faculty of Electrical Engineering
and Computing and Computing and Computing
Department of Control and Department of Control and Department of Control and
Computer Engineering in Computer Engineering in Computer Engineering in
Automation Automation Automation
kristiian.macek@,fer.hr ivan.petrovic@,fer.hr nedieliko.peric@,fer.hr

dbstract: One of the basic issues in navigation of representation of the input space may be based on fuzzy
autonomous mobile robots is the obstacle avoidance task logic rules [4], [ 5 ] , [6] or neural network structures such as
that is commonly achieved using reactive control paradigm MLP neural networks [7]. Since the agent must develop
where a local mapping from perceived states to actions is an appropriate state-action strategy for itself, a laborious
acquired. A control strategy with learning capabilities in leaming phase may occur. Therefore, special attention
at1 unknown environment can be obtained using must be paid to the convergence rate of a particular
reinforcement learning where the learning agent is given reinforcement learning approach.
only sparse reward information. This credit assignment In this paper, convergence rate increase is obtained by
problem includes both temporal and structural aspects. reduction of the continuous internal state representation to
While the temporal credit assignment problem is solved a set of discrete states thereby extracting the relevant
using core elements of reinforcement learning agent, features of the input state space. Moreover, by allowing
solution of the structural credit assignment problem only a single discrete state to represent the input space at
requires an appropriate internal state space representation any given time, the discrete states are contrasted which
of the environment. Zn this paper a discrete coding of the results in individual credit assignment and more precise
input space using a neural network structure is presented computation. Similar approaches can be found in [8] and
as opposed I O the commonly used continuous internal [9], but our approach provides generally more
representation. This enables a faster and more eficient advantageous reinforcement learning scheme.
convergence of the reinforcement learning process.
2 Reinforcement learning
1 Introduction ‘ The basic reinforcement learning framework consists
of a learning agent that observes the current state of the
I -

One of the basic issues in navigation of autonomous environment ,; and aims to find an appropriate action a,
mobile robots is the obstacle avoidance capability, which
fits into the path planning problem. When a complete pertaining to a policy n that is being developed. As a
measure of success of agent’s interaction with the
knowledge of the environment is assumed global path
planning techniques can be applied [l], [2]. However, environment the agent is given an reward
efficiency of such techniques decreases rapidly in more (penalty) = in time instance t which is defined by the
complex and unstructured environments since considerable designer and describes the overali goal of the agent. For
modeling is needed. Therefore, local path planning obstacle avoidance purposes r = -1 upon collision with an
techniques, which rely on on-line sensory information of object and r = 0 otherwise. In order to develop a
the mobile robots, may prove more adequate in achieving consistent policy, the reward function must be fixed.
the task of obstacle avoidance. The aim of the intelligent agent is to maximize the
Among these, the reactive control paradigm is expected sum of discounted external rewards r for all
commonly used, where a mapping from perceived states to future instances:
actions is made. Acquiring the state-action pairs is
especially interesting using approaches where learning
capabilities apply, such as fuzzy logic and neural networks.
Nevertheless, learning the fuzzy logic rule base or
providing the necessary target patterns in the supervised
neural network learning may be a tedious and difficult where coefficient y determines how “far-sighted” the
task. agent should be.
A plausible solution is the reinforcement learning Moreover, each state ; of the system can be
approach, where only sparse information is given to the
associated with a measure of desirability V, (i)
which
learning agent in form of a scalar reward signal beside the
current sensory information [3]. A continuous internal state represents the expected sum of discounted rewards r for

0-7803-7479-7/02/$17.00 02002 IEEE 462 AMC 2002 - Maribor, Slovenia


future instances when system is found in state x, at time 1 further state-action assessments in the future its eligibility
measure e, typically decays exponentially:
and follows a fixed policy n thereafter:

If A = 0 , only the last state-action assessment is


considered relevant to the current state of the system,
A higher state desirability V, (x) implies a greater
however if A = l , all past assessments are considered
expected total reward sum. equally relevant.
Since the external reward signal r is not informative If the desirability of a state is associated with a certain
enough and can be delayed for a long time [lo], the
intemal reinforcement signal r' is derived that represents
(x, )
action Q ,U, , expression (3) for the prediction error
the immediate reward given to the system in terms of becomes:
correctness of the actions executed so far. It also
represents the prediction error of the total reward sum (6)
between two successive steps:
Based on theory of dynamic programming one can
(3) distinguish between two basic reinforcement learning
approaches where: 1.) state evaluations V ) or 2.) (x,
If r* ( t ) > 0 , the system performed better than expected in
the last time step, if r* ( t ) < 0 , the system performed
state-action evaluations Q (i t ,U, ) are calculated [111. In
1.) policy is derived indirectly through state evaluations
worse than expected. Therefore, the intemal reinforcement
signal r ' ( t ) gives a measure of quality of the last action V ( i , ) so that if V ( ; ) for all possible states x are
U,-, taken and may be used as a learning signal for maximized the optimal policy is achieved. In 2.) separate
parameter update of the learning system. state-action evaluations Q( x, ,U,) are updated basically
A generic expression for updating of the k-th when a certain action U, is taken at time t , thus when all
parameter p, of the learning system can be formulated as:
e(.,.) are maximized, the optimal policy is to choose

actions with maximal e(.,.) values.

where a is the learning rate and e, eligibility measure of


3 Controller design
how a certain parameter p, influenced the evaluation of
state desirability and the action choice in previous time 3.1 Input state-space discretisation
steps. Since prediction error r* is only known in time step
t, eligibility measure of parameter p, is taken from time As outlined briefly in the previous section, two aspects
step t-1 since parameter p , influenced the current are particularly important to an autonomous agent using
reinforcement learning, namely, the temporal and the
)
prediction of total discounted reward sum r ( t )+ yV (i, . structural credit assignment problem. The temporal credit
When considering the credit given to the learning assignment problem is related to rewarding a particular
agent for its actions, eligibility measures involve both action sequence in time and is solved using core elements
of any reinforcement learning agent such as state
structural and temporal credit assignment problems. The
structural credit assignment depends on structure of the desirability estimates, the intemal reinforcement signal and
learning agent and the temporal credit assignment involves eligibility measures.
However, the structural credit assignment problem
measuring credit (or blame) of a certain sequence of
actions to the overall performance of the learning system. that is related to the internal representation of environment
Depending on structure of the leaming agent and specific of the leaming agent is yet a difficult task. Since
reinforcement leaming scheme used, +he eligibility reinforcement learning methods are iterative procedures,
an inadequate internal representation of the input space
measure e, may acquire different forms.
(the environment) may result in a very slow convergence
Basically, the current state of the system is due not rate of the learning process because the credit or blame for
only to last action taken but also due to other actions taken certain sequence of actions may be distributed among
in past. If a certain parameter p, of the learning system is many internal regions of state space.
involved in current state evaluation and consequentially in Approaches such as fuzzy logic controllers or multi-
action selection, its eligibility measure e, is updated layer perception neural networks involve a continuous
according to structural credit. Thereafter, if not involved in internal representation of the input state space. This may

463
result in a large number of fuzzy rules to be ad-justed Y* )
( t ) = Y ( t ) + y max Q ( x, ,a, - Q( x,-l ,a,-I). (9)
simultaneously or in an on-line back-propagation learning
of neural networks with a slow convergence rate.
To enhance individual structural credit assignment, The eligibility measure e,, for all weight vectors
the continuous internal representation of the input space
may be replaced by a set of discrete states. A possible ji, = Lp,,p ,*...p,MI’is as follows:
approach is to use Kohonen neural network structure with
the winner-take-all unsupervised learning rule [12].
Learning is based on clustering of input data in order
to group similar inputs and separate dissimilar ones. It is
given that the number of clusters is N, input state vector is
x = [x I x2 . . . ~ L f and the set of weight vectors is Thus, only the eligibility measure eh of the k-th winning
- - - - neuron where action a, is chosen at time t is set to 1,
w,
{ W I , w z,..., w l,...,W N ) , where connects input vector x whereas all other eligibility measures e,, are decayed
to cluster j . Then the winning neuron k, representing the
discrete state k, is selected by similarity matching: according to their impact in the past.
The update rule for the state-action estimates pIris
according to (4) derived as:
(7)

The update rule for the winning vector & is:

Initially, all eligibility measures are set to zero.


Similar controller architecture and the input state
space discrete coding was elaborated in [6]. But there
Since the weight vector & closest to input vector i is reinforcement learning was based on state evaluations
determined only by the angle spanned between these two
vectors, both input vector xand the weight vector &
V (x,) and we based it on state-action evaluations
must be normalized, the former before feedforward pass Q, ai
, );
,,
which is generally a preferable solution in
and the later after updating. Thus, all weight vectors w, are terms of convergence rate [13]. The controller structure is
spread along the unit circle. depicted in Fig. I .
Activation function of the neurons representing
clusters is not important since the output of thej-th neuron
equals yl = 0, V j f k and yl = I, j = k . cluster selection layer
input layer action selection layer
3.2 Action (policy) learning

In addition to discrete internal state representation, a


discrete set of M actions { a , ,a*,...,a, } was chosen where
each state-action (cluster-action) pair in time t is associated
with a desirability measure Q ( i , , a , ) . For environment
-
state x, , k-th neuron is the winning one and desirability
measures for each action are Q(i:,a,)=pkl,
-
w sinilarity mtching ” -
p desirability evaluation
Q (x, ,a2) = pk2,..., Q (x, ,a, ) = p,, . The action chosen
weights weights

at time t is the one with maximum desirability measure


Fig.1: The neural network controller structure.
Q ( i , , a , ) , which leads to the “greedy” policy. The
internal reinforcement signal r* (prediction error) is
according to ( 6 ) :

464
4 Simulation results - -Initial- positions
-
of weight vectors
[WI, w2 ,..., w,,...,W N ) of the controller are uniformly
The experiments were carried out in the Saphira
Pioneer simulation environment for the Pioneer DX2 random distributed along the unit circle, whereas weight
mobile robot platform with front array of 8 sonars oriented vectors i,are initialised randomly in interval [-0.1, 0.11 .
at +90 ,+50 ,*30 , & I 0 relative to the longitudinal The corresponding learning rates are set to q = 0.01 and
vehicle axes. Sonar measurements d, ( 1 I i I 8 ) were a = 0.05 , respectively. Discount factor for future
coarse coded as: predictions is y = 0.99 and the decay rate is A = 0.05 .
The learning algorithm was verified in corridor-like
d, -RANGE-MIN environments designed in the Saphira world simulator. The
x, =1.0-2.0 * (12) mobile robot was allowed to randomly explore the
RANGE - MAX - RANGE - MIN
environment and thus acquired robot paths are depicted in
Figures 2 to 5.
where RANGE-MX, RANGE-MIN denote maximal and
minimal range of sonars, respectively, giving a nominal 7 0 0 . . . . . .
[-1, 11 sonar range. RANGE-MAX, RANGE-MIN may be
600
chosen arbitrarily (yet taking into account the physical
sonar constraints) and define the active sensor region, in
our case also the active learning region. It is chosen in our
case for RA NGE-MIN= 15cm and RANGE-MX=250cm. 300
The coarse coded sonar measurement vector 200

= [ x I x2...x8]I is normalized and given as input vector to I 00

the controller as described in the previous section (see


0
‘ 100 200 300 400 500 600 IO
Figure 1). (m cm)
The action set chosen was {“move forward a distance
8, “turn left a heading B and move forward a distance d”, Fig.2: A robot path with 20 clusters internal representation.
“turn right a heading B and move forward a distance S’},
where d = lOcm and B = 30 (action set number M=3). It 1000~

can t)e seen from the action set that the robot is on constant
forward move. There are two main reasons for this: firstly, moo.
by random exploration and obstacle avoidance the mobile
ti00.
agent can build a map of initially unknown eiivironment E
(thus potentially fulfilling other important tasks of mobile
robot navigation), secondly, the mobile agent is prevented
- 400

from being stuck in a local minimum, such as turning on


the same spot. Moreover, instead of choosing a constant 2oo.

forward velocity a constant forward distance action was 30


chosen which enables deriving direct mapping from states (in cm)

to actions regardless of the vehicle dynamics.


The simulated experiment is performed in the Fig.3: A robot path with 30 clusters internal representation.
following manner: when in active sonar region, depending
on the current sonar readings and state-action evaluations, 1200-
an appropriate action is chosen. Controller parameters are
1000.
updated thereafter. As stated earlier, the external reward
signal r is 0 for all states in the active region. Upon moo.
collision with an object (as detected by additional bumper
600.
sensors) or upon entering the “dangerous zone” of 15cm to j

the closest object (as read by sonars) the mobile agent 400.

receives a negative external penalty r = -1. Robot agent is 200 --


than considered to be in failure state and a trial restart must
be performed. The robot is moved back to the “save zone”
using very simple if-then logic (i.e. if obstacle is in front (in cm)

then move backward). Therefore, no external trial restart


is required which is advantageous to some previous Fig.4: A robot path with 30 clusters internal representation.
approaches. Upon trial restart all eligibility traces of the
action selection layer of the controller are reset to zero;

465
I 200 feature representation to a minimum required to fulfill a
given task such as obstacle avoidance.

6 References
J.T. Schwartz, M.Shirir: “A survey of motion
planning and related geometric algorithm”, Art$
Intell. J., vol. 37, pp. 157-169, 1988.
200 0. Khatib: “Real-time obstacle avoidance for
manipulators and mobile robots”, Int. J. Robot.
‘0 200 400 600 800 1000 1200
Res., ~ 0 1 . 5 .no. l., pp. 90-98, 1986.
(in cm) A.G. Barto, R.S. Sutton, and C.W. Anderson:
“Neuronlike adaptive elements that can solve
Fig.5: A robot path with 30 clusters internal representation. difficult learning control problems”, IEEE Trans.
Syst. A4an. Cybern., vol. SMC-13, no. 5, pp. 834-
To verify the learning algorithm the greedy policy 847, 1983.
was applied, as stated earlier. Essentially, the learning C.T. Lee, C.S.G. Lee: “Reinforcement
agent chooses the action with maximum desirability, which structure/parameter learning for neural-nehvork-
is considered satisfactory thereafter. This may result in sub based fuzzy logic control system”, IEEE Trans. on
optimal solutions (a pending motion in case when a Fuzzy Systems, vol. 2, no. 1, pp. 46-63, 1994.
straightforward motion is optimal). Since the primary task H. Beom, H. Cho: “A sensor-based navigation for a
- of obstacle avoidance is satisfied, a further improvement of mobile robot using fuzzy logic and reinforcement
optimality of solution should include a form of stochastic learning”, IEEE Trans. Syst. Man. Cybern., vol.
action exploration and a longer action sequence history. SMC-25. no. 3, pp. 464-477, 1995.
These aspects are to be included in further elaboration. N.H.C. Yung, C. Ye: “An intelligent mobile vehicle
Moreover, to achieve the global path planning navigator based on fuzzy logic and reinforcement
navigation a goal seeking behavior must be included and learning”, IEEE Trans. Syst. Man. Cybern., vol.
coordinated with the local obstacle avoidance task. If the SMC-29, no. 2, pp. 314-321, 1999.
action set is chosen in such a way that robot is on a G.A. Rummery: Problem solving with
constant forward move a map building, path tracking reinforcement learning, PhD Thesis, Cambridge
functionality may be included, converting an initially University Engineering department, University of
unknown environment into a known one, where global Cambridge, 1995. I

path planning techniques may . be applied thereafter. B . J A Krose, J.W.M van Dam: “Learning to avoid
However, these aspects are beyond the scope of this paper. collisions: a reinforcement learning paradigm for
mobile robot navigation”, Proceedings of
IFAC/IFIP/IMACS Symposium on ArtiJicial
5 Conclusions Intelligence in Real-Time Control, pp. 295-30,
1992.
A reinforcement learning approach for the obstacle A.H. Fagg, D. Lotspeich, and G.A. Bekey: “A
avoidance task of navigation of mobile robots was Reinforcement-Learning Approach to Reactive
developed. In general, reinforcement learning methods Control Policy Design for Autonomous Robots”,
present a suitable solution to developing control strategies Proc. of the 1994 IEEE International Conference
in an unknown environment. The learning agent may on Robotics and Automation, vol. 1, pp. 39-44, San
receive only sparse reward signals or credits for the actions Diego, CA, May 8-13, 1994.
taken, therefore intemal state and action evaluation is R.S. Sutton: “Learning to predict by the methods of
required. In terms of credit assignment problem internal temporal differences”, Machine Learning 3, pp. 9-
representation of the input state space is particularly 44, 1988.
important. R.S. Sutton: “Integrated architectures for learning,
A clustered discrete coding of the input state space planning, and reacting based on approximating
was developed using Kohonen neural network structure as dynamic programming”, in Seventh International
opposed to continuous internal state representation. This Conference on Machine Learning, pp. 2 16-226,
enabled individual state-action credit assignment, giving a 1990.
more precise evaluation computation and a faster T. Kohonen: Self-organization and associative
- convergence rate. memory, Springer Verlag, 1984.
The learning algorithm was verified in simulation C.J.C.H. Watkins, P. Dayan: “Technical Note: Q-
environment where the mobile robot performed obstacle learning”, Machine Learning, vol. 8, pp. 279-292,
avoidance capability, which was developed by using 1989.
initially unknown control strategy.
In perspective, optimality of the solution obtained
should be taken into account as well as an adaptive neural
network structure which could reduce the size of relevant

466

You might also like