Dr. Dobb's Journal March 1997
The field of genetic algorithms (GAs) addresses the same set of problems as reinforcement learning, but with a different methodology. Instead of learning a utility function, GAs learn the policy directly, representing it as an encoded bit string (or, in genetic programming, as a program in some computer language). The approach is to search the space of policies by generating a pool of policies and evaluating each one for a sequence of time steps in a simulated environment. A new pool of policies is constructed by combining the more successful existing policies with some random change. The algorithm iterates until a sufficiently good policy is found.
The name "genetic algorithms" comes from an analogy with biology: The bits representing a policy are like the bases in a strand of DNA, the creation of new policies is like the process of sexual reproduction, the random change is like mutation, and the propagation of successful policies is like natural selection.
The GA approach shares with RL the twin benefits of being based on training (not programming) and being responsive to changes in the environment through the possibility of online retraining. However, there are reasons to believe that RL makes better use of available computing resources. Consider, for example, a policy A that is in the pool of policies being evaluated by GAs. Further, assume that the long-term cumulative reward from policy A is very poor. Then GAs would simply discard policy A in favor of better policies in its pool without learning anything from the simulation of A.
RL, on the other hand, has the potential of learning a lot from following policy A. Suppose policy A assigns action a to some state s and whenever action a is executed in state s, the resulting state always has high utility. RL will learn that action a is good in state s from executing policy A, even though the policy itself is bad. It can do this because it learns a state-based utility function. GAs, however, ignore the state-based structure of the problem and therefore waste information.
In summary, the GA approach needs a partial success in terms of total reward before it can make use of it, while RL only needs a partial success in terms of reaching a high-utility intermediate state, which is easier to achieve.
It is not clear exactly when GAs are the preferred approach over RL, but in general, RL seems to do better as the problem becomes more complex. GAs may do well when there is a constrained, small space of policies to be searched.
-- S.S., P.N., D.C.