A3C Archives - Hayk Hakobyan

Deep learning models can already achieve state-of-the-art results in some applications, but their capabilities are still limited. Unlike humans, deep learning models are unable to handle minor changes, and hence can only be applied for specific and narrowly defined tasks.

Consider this conversation of what might be the most sophisticated negotiation software on the planet, which occurred between two AI agents developed at Facebook:

Bob: “I can can I I everything else.”

Alice: “Balls have zero to me to me to me to me to me to me to me to me to.”

At first, they were speaking in plain old English, but researchers realized they forgot to include a reward for sticking to the language. So, the AI agents began to diverge, eventually rearranging legible words into seemingly nonsensical (but, in their perspective, highly efficient) sentences. They invented their own codewords, abbreviations, and structures.

This phenomenon is observed again and again and again.

A vanguard AI technology that can learn, recognize, and generate information on a nearly human level doesn’t exist yet, but we have taken steps toward that direction.

What are generative adversarial networks (GANs)?

Generally intelligent systems must be able to generalize from limited data and learning causal relationships. In 2016, Ian Goodfellow, a fellow at Google Brain, suggested using generative adversarial networks (GANs) as an alternative unsupervised machine learning method. This aimed to address many of the ailing points of the existing methods.

GANs consist of two deep neural networks: generator and discriminator. The generator’s goal is to create data samples that are so indistinguishable to the real ones. The discriminator’s goal is to identify which of the generator’s data samples are real and which are fake.

These two networks compete against each other in a zero-sum game (i.e. one’s loss implies another’s win). Both networks would then become stronger in a relatively short period of time.

Backpropagation is used to update the model parameters and train the neural networks. Over time, the networks learn many features of the provided data. To create realistic forged samples, the generator needs to learn the data’s features and patterns, while the discriminator does the same to correctly distinguish between real and fake samples.

GANs are thus able to overcome the above weaknesses by training (i.e. playing) neural networks against each other, thus learning from each other (which necessitates less data) and eventually performing better in a broader range of problems.

Applications of GANs

There are several types of GANs, and some of its most obvious applications include high-resolution or interactive image generation/blending, image inpainting, image-to-image translation, abstract reasoning, semantic segmentation, video generation, and text-to-image synthesis, among others.

The video game industry is the first area of entertainment to start seriously experimentingusing AI to generate raw content. There’s a huge cost incentive to invest in video game development automation given the US$300 million+ budget of modern AAA video games.

GANs have also been used for text, with less success⏤a bot developed to speak like Friedrich Nietzsche started to speak in a manner similar to the philosopher, but the sentences did not make sense. GANs for voice applications are able to reproduce a given text string to life-like voices with approximately 20 minutes of voice samples, such as these popular impersonations of American presidents Donald Trump and Barack Obama. In the near future, videos can likely be generated just by providing a script.

Goodfellow and his colleagues used GANs for image generation, recognition, and classification by teaching one of the networks to create images of handwritten digits (humans were not able to distinguish real handwritten digits). They also trained a neural network to create images of objects, which humans could only differentiate (from real ones) 78.7 percent of the time. Below are some sample images of faces created entirely by deep convolutional GANs.

Despite all the above achievements, GANs still have weaknesses:

Instability (the generator and the discriminator losses keep oscillating) and non-convergence (to optimum) of the objective function in GANs
Mode collapse (this happens when the generator doesn’t produce diverse images or information)
The possibility that either the generator or the discriminator becomes too strong as compared to the others during training
The possibility that either the generator or the discriminator never learns beyond a certain point

An existential threat

Do GANs and AI in general pose an existential threat to humanity? Elon Musk thinks so. Since 2014, he has been advocating adoption of AI regulations by authorities around the world. Recently, he reiterated the urgent need to be proactive in regulation.

“AI is a fundamental risk to the existence of human civilization,” Musk tells US politicians recently.

His concerns stem from the rapid developments related to GANs, which might push humanity toward the inception of artificial general intelligence. While AI regulations may serve as safeguards, AI is still far from the fictitious depictions seen frequently in Hollywood sci-fi movies.

(By the way, Facebook ultimately opted to require its negotiation bots to speak in plain old English.)

Here are some recommended resources for GAN:

This article originally appeared on Tech in Asia.

A birds-eye view of main ML algorithms

In statistics, we have descriptive and inferential statistics. ML deals with the same problems and claims any problem where the solution isn’t programmed directly, but is learned by the program. ML generally works by numerically minimising something: a cost function or error.

Supervised learning – You have labeled data: a sample of ground truth with features and labels. You estimate a model that predicts the labels using the features. Alternative terminology: predictor variables and target variables. You predict the values of the target using the predictors.

Regression. The target variable is numeric. Example: you want to predict the crop yield based on remote sensing data. Recurrent neural networks result in a “regression” since they usually output a number (a sequence or a vector) instead of a class (e.g. sentence generation, curve plotting). Algorithms: linear regression, polynomial regression, generalised linear models.
Classification. The target variable is categorical. Example: you want to detect the crop type that was planted using remote sensing data. Or Silicon Valley’s “Not Hot Dog” application. Algorithms: Naïve Bayes, logistic regression, discriminant analysis, decision trees, random forests, support vector machines, neural networks (NN) of many variations: feed-forward NNs, convolutional NNs, recurrent NNs.

Unsupervised learning – You have a sample with unlabeled information. No single variable is the specific target of prediction. You want to learn interesting features of the data:

Clustering. Which of these things are similar? Example: group consumers into relevant psychographics. Algorithms – k-means, hierarchical clustering.
Anomaly detection. Which of these things are different? Example: credit card fraud detection. Algorithms: k-nearest-neighbor.
Dimensionality reduction. How can you summarise the data in a high-dimensional data set using a lower-dimensional dataset which captures as much of the useful information as possible (possibly for further modelling with supervised or unsupervised algorithms)? Example: image compression. Algorithms: principal component analysis (PCA), neural network auto-encoders.

Reinforcement Learning (Policy Gradients, DQN, A3C,..) – You are presented with a game/environment that responds sequentially or continuously to your inputs, and you learn to maximise an objective through trial and error.

Evolutionary Strategy – This approach consists of maintaining a distribution over network weight values, and having a large number of agents act in parallel using parameters sampled from this distribution. With this score, the parameter distribution can be moved toward that of the more successful agents, and away from that of the unsuccessful ones. By repeating this approach millions of times, with hundreds of agents, the weight distribution moves to a space that provides the agents with a good policy for solving the task at hand.

All the complex tasks in ML, from self-driving cars to machine translation, are solved by combining these building blocks into complex stacks.

Pro/cons of RL and ES

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behaviour.

RL is known to be unstable or even to diverge when a nonlinear function approximator such as a NN is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values and the target values.

RL’s other challenge is generalisation. In typical deep RL methods, this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.

Whereas RL methods such as A3C need to communicate gradients back and forth between workers and a parameter server, ES only requires fitness scores and high-level parameter distribution information to be communicated. It is this simplicity that allows the technique to scale up in ways current RL methods cannot. However, in situations with richer feedback signals however, things don’t go so well for ES.

Contextualising and combining the RL and ES

Appealing to nature for inspiration in AI can sometimes be seen as a problematic approach. Nature, after all, is working under constraints that computer scientists simply don’t have. If we look at intelligent behaviour in mammals, we find that it comes from a complex interplay of two ultimately intertwined processes, inter-life learning, and intra-life learning. Roughly speaking these two approaches in nature can be compared to the two in neural network optimisation. ES for which no gradient information is used to update the organism, is related to inter-life learning. Likewise, the gradient based methods (RL), for which specific experiences change the agent in specific ways, can be compared to intra-life learning.

The techniques employed in RL are in many ways inspired directly by the psychological literature on operant conditioning to come out of animal psychology. (In fact, Richard Sutton, one of the two founders of RL actually received his Bachelor’s degree in Psychology). In operant conditioning animals learn to associate rewarding or punishing outcomes with specific behaviour patterns. Animal trainers and researchers can manipulate this reward association in order to get animals to demonstrate their intelligence or behave in certain ways.

The central role of prediction in intra-life learning changes the dynamics quite a bit. What was before a somewhat sparse signal (occasional reward), becomes an extremely dense signal. At each moment mammalian brains are predicting the results of the complex flux of sensory stimuli and actions which the animal is immersed in. The outcome of the animals behaviour then provides a dense signal to guide the change in predictions and behaviour going forward. All of these signals are put to use in the brain in order to improve predictions (and consequently the quality of actions) going forward. If we apply this way of thinking to learning in artificial agents, we find that RL isn’t somehow fundamentally flawed, rather it is that the signal being used isn’t nearly as rich as it could (or should) be. In cases where the signal can’t be made more rich, (perhaps because it is inherently sparse, or to do with low-level reactivity) it is likely the case that learning through a highly parallelizable method such as ES is instead better.

Combining many

It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training. On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.

An extreme example is combining more than just ES and RL and Microsoft’s Maluuba is a an illustrative example, which used many algorithms to beat the game Ms. Pac-Man. When the agent (Ms. Pac-Man) starts to learn, it moves randomly; it knows nothing about the game board. As it discovers new rewards (the little pellets and fruit Ms. Pac-Man eats) it begins placing little algorithms in those spots, which continuously learn how best to avoid ghosts and get more points based on Ms. Pac-Man’s interactions, according to the Maluuba research paper.

As the 163 potential algorithms are mapped, they continually send which movement they think would generate the highest reward to the agent, which averages the inputs and moves Ms. Pac-Man. Each time the agent dies, all the algorithms process what generated rewards. These helper algorithms were carefully crafted by humans to understand how to learn, however.

Instead of having one algorithm learn one complex problem, the AI distributes learning over many smaller algorithms, each tackling simpler problems, Maluuba says in a video. This research could be applied to other highly complex problems, like financial trading, according to the company.

But it’s worth noting that since more than 100 algorithms are being used to tell Ms. Pac-Man where to move and win the game, this technique is likely to be extremely computationally intensive.

Tag: A3C

How GANs can turn AI into a massive force