Reinforcement Learning vs. Evolutionary Strategy: combine, aggregate, multiply

A birds-eye view of main ML algorithms

In statistics, we have descriptive and inferential statistics. ML deals with the same problems and claims any problem where the solution isn’t programmed directly, but is learned by the program. ML generally works by numerically minimising something: a cost function or error.

Supervised learning – You have labeled data: a sample of ground truth with features and labels. You estimate a model that predicts the labels using the features. Alternative terminology: predictor variables and target variables. You predict the values of the target using the predictors.

  • Regression. The target variable is numeric. Example: you want to predict the crop yield based on remote sensing data. Recurrent neural networks result in a “regression” since they usually output a number (a sequence or a vector) instead of a class (e.g. sentence generation, curve plotting). Algorithms: linear regression, polynomial regression, generalised linear models.
  • Classification. The target variable is categorical. Example: you want to detect the crop type that was planted using remote sensing data. Or Silicon Valley’s “Not Hot Dog” application. Algorithms: Naïve Bayes, logistic regression, discriminant analysis, decision trees, random forests, support vector machines, neural networks (NN) of many variations: feed-forward NNs, convolutional NNs, recurrent NNs.

Unsupervised learning – You have a sample with unlabeled information. No single variable is the specific target of prediction. You want to learn interesting features of the data:

  • Clustering. Which of these things are similar? Example: group consumers into relevant psychographics. Algorithms – k-means, hierarchical clustering.
  • Anomaly detection. Which of these things are different? Example: credit card fraud detection. Algorithms: k-nearest-neighbor.
  • Dimensionality reduction. How can you summarise the data in a high-dimensional data set using a lower-dimensional dataset which captures as much of the useful information as possible (possibly for further modelling with supervised or unsupervised algorithms)? Example: image compression. Algorithms: principal component analysis (PCA), neural network auto-encoders.

Reinforcement Learning  (Policy Gradients, DQN, A3C,..) – You are presented with a game/environment that responds sequentially or continuously to your inputs, and you learn to maximise an objective through trial and error.

Evolutionary Strategy – This approach consists of maintaining a distribution over network weight values, and having a large number of agents act in parallel using parameters sampled from this distribution. With this score, the parameter distribution can be moved toward that of the more successful agents, and away from that of the unsuccessful ones. By repeating this approach millions of times, with hundreds of agents, the weight distribution moves to a space that provides the agents with a good policy for solving the task at hand.

All the complex tasks in ML, from self-driving cars to machine translation, are solved by combining these building blocks into complex stacks.

Pro/cons of RL and ES

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behaviour.

RL is known to be unstable or even to diverge when a nonlinear function approximator such as a NN is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values and the target values.

RL’s other challenge is generalisation. In typical deep RL methods, this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.

Whereas RL methods such as A3C need to communicate gradients back and forth between workers and a parameter server, ES only requires fitness scores and high-level parameter distribution information to be communicated. It is this simplicity that allows the technique to scale up in ways current RL methods cannot. However, in situations with richer feedback signals however, things don’t go so well for ES.

Contextualising and combining the RL and ES

Appealing to nature for inspiration in AI can sometimes be seen as a problematic approach. Nature, after all, is working under constraints that computer scientists simply don’t have. If we look at intelligent behaviour in mammals, we find that it comes from a complex interplay of two ultimately intertwined processes, inter-life learning, and intra-life learning. Roughly speaking these two approaches in nature can be compared to the two in neural network optimisation. ES for which no gradient information is used to update the organism, is related to inter-life learning. Likewise, the gradient based methods (RL), for which specific experiences change the agent in specific ways, can be compared to intra-life learning.

The techniques employed in RL are in many ways inspired directly by the psychological literature on operant conditioning to come out of animal psychology. (In fact, Richard Sutton, one of the two founders of RL actually received his Bachelor’s degree in Psychology). In operant conditioning animals learn to associate rewarding or punishing outcomes with specific behaviour patterns. Animal trainers and researchers can manipulate this reward association in order to get animals to demonstrate their intelligence or behave in certain ways.

The central role of prediction in intra-life learning changes the dynamics quite a bit. What was before a somewhat sparse signal (occasional reward), becomes an extremely dense signal. At each moment mammalian brains are predicting the results of the complex flux of sensory stimuli and actions which the animal is immersed in. The outcome of the animals behaviour then provides a dense signal to guide the change in predictions and behaviour going forward. All of these signals are put to use in the brain in order to improve predictions (and consequently the quality of actions) going forward. If we apply this way of thinking to learning in artificial agents, we find that RL isn’t somehow fundamentally flawed, rather it is that the signal being used isn’t nearly as rich as it could (or should) be. In cases where the signal can’t be made more rich, (perhaps because it is inherently sparse, or to do with low-level reactivity) it is likely the case that learning through a highly parallelizable method such as ES is instead better.

Combining many

It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training.  On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.

An extreme example is combining more than just ES and RL and Microsoft’s Maluuba is a an illustrative example, which used many algorithms to beat the game Ms. Pac-Man. When the agent (Ms. Pac-Man) starts to learn, it moves randomly; it knows nothing about the game board. As it discovers new rewards (the little pellets and fruit Ms. Pac-Man eats) it begins placing little algorithms in those spots, which continuously learn how best to avoid ghosts and get more points based on Ms. Pac-Man’s interactions, according to the Maluuba research paper.

As the 163 potential algorithms are mapped, they continually send which movement they think would generate the highest reward to the agent, which averages the inputs and moves Ms. Pac-Man. Each time the agent dies, all the algorithms process what generated rewards. These helper algorithms were carefully crafted by humans to understand how to learn, however.

Instead of having one algorithm learn one complex problem, the AI distributes learning over many smaller algorithms, each tackling simpler problems, Maluuba says in a video. This research could be applied to other highly complex problems, like financial trading, according to the company.

But it’s worth noting that since more than 100 algorithms are being used to tell Ms. Pac-Man where to move and win the game, this technique is likely to be extremely computationally intensive.