A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
dbstract: One of the basic issues in navigation of representation of the input space may be based on fuzzy
autonomous mobile robots is the obstacle avoidance task logic rules [4], [ 5 ] , [6] or neural network structures such as
that is commonly achieved using reactive control paradigm MLP neural networks [7]. Since the agent must develop
where a local mapping from perceived states to actions is an appropriate state-action strategy for itself, a laborious
acquired. A control strategy with learning capabilities in leaming phase may occur. Therefore, special attention
at1 unknown environment can be obtained using must be paid to the convergence rate of a particular
reinforcement learning where the learning agent is given reinforcement learning approach.
only sparse reward information. This credit assignment In this paper, convergence rate increase is obtained by
problem includes both temporal and structural aspects. reduction of the continuous internal state representation to
While the temporal credit assignment problem is solved a set of discrete states thereby extracting the relevant
using core elements of reinforcement learning agent, features of the input state space. Moreover, by allowing
solution of the structural credit assignment problem only a single discrete state to represent the input space at
requires an appropriate internal state space representation any given time, the discrete states are contrasted which
of the environment. Zn this paper a discrete coding of the results in individual credit assignment and more precise
input space using a neural network structure is presented computation. Similar approaches can be found in [8] and
as opposed I O the commonly used continuous internal [9], but our approach provides generally more
representation. This enables a faster and more eficient advantageous reinforcement learning scheme.
convergence of the reinforcement learning process.
2 Reinforcement learning
1 Introduction ‘ The basic reinforcement learning framework consists
of a learning agent that observes the current state of the
I -
One of the basic issues in navigation of autonomous environment ,; and aims to find an appropriate action a,
mobile robots is the obstacle avoidance capability, which
fits into the path planning problem. When a complete pertaining to a policy n that is being developed. As a
measure of success of agent’s interaction with the
knowledge of the environment is assumed global path
planning techniques can be applied [l], [2]. However, environment the agent is given an reward
efficiency of such techniques decreases rapidly in more (penalty) = in time instance t which is defined by the
complex and unstructured environments since considerable designer and describes the overali goal of the agent. For
modeling is needed. Therefore, local path planning obstacle avoidance purposes r = -1 upon collision with an
techniques, which rely on on-line sensory information of object and r = 0 otherwise. In order to develop a
the mobile robots, may prove more adequate in achieving consistent policy, the reward function must be fixed.
the task of obstacle avoidance. The aim of the intelligent agent is to maximize the
Among these, the reactive control paradigm is expected sum of discounted external rewards r for all
commonly used, where a mapping from perceived states to future instances:
actions is made. Acquiring the state-action pairs is
especially interesting using approaches where learning
capabilities apply, such as fuzzy logic and neural networks.
Nevertheless, learning the fuzzy logic rule base or
providing the necessary target patterns in the supervised
neural network learning may be a tedious and difficult where coefficient y determines how “far-sighted” the
task. agent should be.
A plausible solution is the reinforcement learning Moreover, each state ; of the system can be
approach, where only sparse information is given to the
associated with a measure of desirability V, (i)
which
learning agent in form of a scalar reward signal beside the
current sensory information [3]. A continuous internal state represents the expected sum of discounted rewards r for
463
result in a large number of fuzzy rules to be ad-justed Y* )
( t ) = Y ( t ) + y max Q ( x, ,a, - Q( x,-l ,a,-I). (9)
simultaneously or in an on-line back-propagation learning
of neural networks with a slow convergence rate.
To enhance individual structural credit assignment, The eligibility measure e,, for all weight vectors
the continuous internal representation of the input space
may be replaced by a set of discrete states. A possible ji, = Lp,,p ,*...p,MI’is as follows:
approach is to use Kohonen neural network structure with
the winner-take-all unsupervised learning rule [12].
Learning is based on clustering of input data in order
to group similar inputs and separate dissimilar ones. It is
given that the number of clusters is N, input state vector is
x = [x I x2 . . . ~ L f and the set of weight vectors is Thus, only the eligibility measure eh of the k-th winning
- - - - neuron where action a, is chosen at time t is set to 1,
w,
{ W I , w z,..., w l,...,W N ) , where connects input vector x whereas all other eligibility measures e,, are decayed
to cluster j . Then the winning neuron k, representing the
discrete state k, is selected by similarity matching: according to their impact in the past.
The update rule for the state-action estimates pIris
according to (4) derived as:
(7)
464
4 Simulation results - -Initial- positions
-
of weight vectors
[WI, w2 ,..., w,,...,W N ) of the controller are uniformly
The experiments were carried out in the Saphira
Pioneer simulation environment for the Pioneer DX2 random distributed along the unit circle, whereas weight
mobile robot platform with front array of 8 sonars oriented vectors i,are initialised randomly in interval [-0.1, 0.11 .
at +90 ,+50 ,*30 , & I 0 relative to the longitudinal The corresponding learning rates are set to q = 0.01 and
vehicle axes. Sonar measurements d, ( 1 I i I 8 ) were a = 0.05 , respectively. Discount factor for future
coarse coded as: predictions is y = 0.99 and the decay rate is A = 0.05 .
The learning algorithm was verified in corridor-like
d, -RANGE-MIN environments designed in the Saphira world simulator. The
x, =1.0-2.0 * (12) mobile robot was allowed to randomly explore the
RANGE - MAX - RANGE - MIN
environment and thus acquired robot paths are depicted in
Figures 2 to 5.
where RANGE-MX, RANGE-MIN denote maximal and
minimal range of sonars, respectively, giving a nominal 7 0 0 . . . . . .
[-1, 11 sonar range. RANGE-MAX, RANGE-MIN may be
600
chosen arbitrarily (yet taking into account the physical
sonar constraints) and define the active sensor region, in
our case also the active learning region. It is chosen in our
case for RA NGE-MIN= 15cm and RANGE-MX=250cm. 300
The coarse coded sonar measurement vector 200
can t)e seen from the action set that the robot is on constant
forward move. There are two main reasons for this: firstly, moo.
by random exploration and obstacle avoidance the mobile
ti00.
agent can build a map of initially unknown eiivironment E
(thus potentially fulfilling other important tasks of mobile
robot navigation), secondly, the mobile agent is prevented
- 400
the closest object (as read by sonars) the mobile agent 400.
465
I 200 feature representation to a minimum required to fulfill a
given task such as obstacle avoidance.
6 References
J.T. Schwartz, M.Shirir: “A survey of motion
planning and related geometric algorithm”, Art$
Intell. J., vol. 37, pp. 157-169, 1988.
200 0. Khatib: “Real-time obstacle avoidance for
manipulators and mobile robots”, Int. J. Robot.
‘0 200 400 600 800 1000 1200
Res., ~ 0 1 . 5 .no. l., pp. 90-98, 1986.
(in cm) A.G. Barto, R.S. Sutton, and C.W. Anderson:
“Neuronlike adaptive elements that can solve
Fig.5: A robot path with 30 clusters internal representation. difficult learning control problems”, IEEE Trans.
Syst. A4an. Cybern., vol. SMC-13, no. 5, pp. 834-
To verify the learning algorithm the greedy policy 847, 1983.
was applied, as stated earlier. Essentially, the learning C.T. Lee, C.S.G. Lee: “Reinforcement
agent chooses the action with maximum desirability, which structure/parameter learning for neural-nehvork-
is considered satisfactory thereafter. This may result in sub based fuzzy logic control system”, IEEE Trans. on
optimal solutions (a pending motion in case when a Fuzzy Systems, vol. 2, no. 1, pp. 46-63, 1994.
straightforward motion is optimal). Since the primary task H. Beom, H. Cho: “A sensor-based navigation for a
- of obstacle avoidance is satisfied, a further improvement of mobile robot using fuzzy logic and reinforcement
optimality of solution should include a form of stochastic learning”, IEEE Trans. Syst. Man. Cybern., vol.
action exploration and a longer action sequence history. SMC-25. no. 3, pp. 464-477, 1995.
These aspects are to be included in further elaboration. N.H.C. Yung, C. Ye: “An intelligent mobile vehicle
Moreover, to achieve the global path planning navigator based on fuzzy logic and reinforcement
navigation a goal seeking behavior must be included and learning”, IEEE Trans. Syst. Man. Cybern., vol.
coordinated with the local obstacle avoidance task. If the SMC-29, no. 2, pp. 314-321, 1999.
action set is chosen in such a way that robot is on a G.A. Rummery: Problem solving with
constant forward move a map building, path tracking reinforcement learning, PhD Thesis, Cambridge
functionality may be included, converting an initially University Engineering department, University of
unknown environment into a known one, where global Cambridge, 1995. I
path planning techniques may . be applied thereafter. B . J A Krose, J.W.M van Dam: “Learning to avoid
However, these aspects are beyond the scope of this paper. collisions: a reinforcement learning paradigm for
mobile robot navigation”, Proceedings of
IFAC/IFIP/IMACS Symposium on ArtiJicial
5 Conclusions Intelligence in Real-Time Control, pp. 295-30,
1992.
A reinforcement learning approach for the obstacle A.H. Fagg, D. Lotspeich, and G.A. Bekey: “A
avoidance task of navigation of mobile robots was Reinforcement-Learning Approach to Reactive
developed. In general, reinforcement learning methods Control Policy Design for Autonomous Robots”,
present a suitable solution to developing control strategies Proc. of the 1994 IEEE International Conference
in an unknown environment. The learning agent may on Robotics and Automation, vol. 1, pp. 39-44, San
receive only sparse reward signals or credits for the actions Diego, CA, May 8-13, 1994.
taken, therefore intemal state and action evaluation is R.S. Sutton: “Learning to predict by the methods of
required. In terms of credit assignment problem internal temporal differences”, Machine Learning 3, pp. 9-
representation of the input state space is particularly 44, 1988.
important. R.S. Sutton: “Integrated architectures for learning,
A clustered discrete coding of the input state space planning, and reacting based on approximating
was developed using Kohonen neural network structure as dynamic programming”, in Seventh International
opposed to continuous internal state representation. This Conference on Machine Learning, pp. 2 16-226,
enabled individual state-action credit assignment, giving a 1990.
more precise evaluation computation and a faster T. Kohonen: Self-organization and associative
- convergence rate. memory, Springer Verlag, 1984.
The learning algorithm was verified in simulation C.J.C.H. Watkins, P. Dayan: “Technical Note: Q-
environment where the mobile robot performed obstacle learning”, Machine Learning, vol. 8, pp. 279-292,
avoidance capability, which was developed by using 1989.
initially unknown control strategy.
In perspective, optimality of the solution obtained
should be taken into account as well as an adaptive neural
network structure which could reduce the size of relevant
466