it has entered the zone it should simply not leave it anymore. The
robot is initially placed with a random orientation in a random position
outside the zone, and during learning it is punished for collisions and
rewarded strongly for every time step it spends in the zone (for details
see below). Moreover, while outside the zone, it is rewarded for moving
as quickly and straight as possible 2
and keeping away from walls. It uses four infrared proximity sensors at the front
|motor outputs||motor outputs||motor outputs|
2 The left and the right pairs of the six front sensors are averaged and used as if they were one sensor.
Figure 5. Simulated Khepera robot in environment 1. The large circle indicates the zone the robot should enter and stay in. The small circle represents the robot, and the lines inside the robot indicate position and direction of the infrared proximity sensors used in experiments 1 and 2.
2. Network training
Recurrent networks are known to be difficult to train with, e.g., gradient-descent methods such as standard backpropagation [Rumelhart, 1986] or even backpropagation through time [Werbos, 1990]. They are often sensitive to the fine details of the training algorithm, e.g. the number of time steps unrolled in the case of backpropagation through time (e.g., [Mozer, 1989]). For example, in an autonomous agent context, Rylatt  showed, for one particular task, that with some enhancements Simple Recurrent Networks [Elman, 1990] could be trained to handle long-term dependencies in a continuous domain, thus contradicting the results of Ulbricht  who had argued the opposite. In an extension of the work discussed in the previous section, Meeden  experimentally compared the training of recurrent control networks with (a) a local search method, a version of backpropagation adapted for reinforcement learning, and (b) a global method, an evolutionary algorithm. The results showed that the evolutionary algorithm in several cases found strategies, which the local method did not find. In particular, when only delayed reinforcement was available to the learning robot, the evolutionary method performed significantly better due to the fact that it did not at all rely on moment-to-moment guidance [Meeden, 1996].