Whereas the constructive part unclear and wide open
For example, even for the tabular case no control method using multi-step backups has been proved to converge to an optimal policy. Among planning methods, basic ideas such as trajectory sampling and selective backups are almost completely unexplored. On closer inspection, parts of the space will undoubtedly turn out to have far greater complexity and internal structure than is now apparent. There are also other dimensions along which reinforcement learning can be extended that we have not yet mentioned, leading to a much larger space of methods. Here we identify some of these dimensions and note some of the open questions and frontiers that have been left out of the preceding chapters.
One of the most important extensions of reinforcement learning beyond what we have treated in this book is to eliminate the requirement that the state representation has the Markov property. There are a number of interesting approaches to the non-Markov case. Most strive to construct from the given state signal and its past values a new signal that is Markov, or more nearly Markov. For example, one approach is based on the theory of partially-observable MDPs (POMDPs). POMDPs are finite MDPs in which the state is not observable, but another ``sensation" signal stochastically related to the state is observable. The theory of POMDPs has been extensively studied for the case of complete knowledge of the dynamics of the POMDP. In this case, Bayesian methods can be used to compute at each time step the probability of the environment being in each state of the underlying MDP. This probability distribution can then be used as a new state signal for the original problem. The downside for the Bayesian POMDP approach is its computational expense and its strong reliance on complete environment models. Some of the recent work pursuing this approach is by Littman, Cassandra and Kaelbling (1995), Parr and Russell (1995), and Chrisman (1992).
For example, state aggregation methods for function approximation are
in effect
equivalent to a non-Markov representation in which all members of a set
of states are mapped into a common sensation. There are other parallels
between the issues of function approximation and non-Markov
representations. In both cases the overall problem divides into two
parts: constructing an improved representation, and making do with the
current representation. In both cases the ``making do" part is
relatively well understood, whereas the constructive part is unclear and
wide open. At this point we can only guess as to whether or not these
parallels point to any common solution methods for the two
problems.
Another important direction for extending reinforcement learning beyond what we have covered in this book is to incorporate ideas of modularity and hierarchy. Introductory reinforcement learning is about learning value functions and one-step models of the dynamics of the environment. But much of what people learn does not seem to fall exactly into either of these categories. For example, consider what we know about tying our shows, making a phone call, or traveling to London. Having learned how to do such things, we are then able to choose and plan among them as if they were primitive actions. What we have learned in order to do this are not conventional value functions or one-step models. We are able to plan and learn at a variety of levels and flexibly interrelate them. Much of our learning appears not to be about learning values directly, but about preparing us to quickly estimate values later in response to new situations or new information. Considerable reinforcement learning research has been directed at capturing such abilities (e.g., Watkins, 1989; Dayan and Hinton, 1993; Singh, 1992a,b; Ring, 1994, Kaelbling, 1993; Sutton, 1995).