Reinforcement learning and its new frontiers

RL’s origins and historic context

RL copies a very simple principle from nature. The psychologist Edward Thorndike documented it more than 100 years ago. Thorndike placed cats inside boxes from which they could escape only by pressing a lever. After a considerable amount of pacing around and meowing, the animals would eventually step on the lever by chance. After they learned to associate this behaviour with the desired outcome, they eventually escaped with increasing speed.

Some of earliest AI researchers believed that this process might be usefully reproduced in machines. In 1951, Marvin Minsky, a student at Harvard who would become one of the founding fathers of AI, built a machine that used a simple form of reinforcement learning to mimic a rat learning to navigate a maze. Minsky’s Stochastic Neural Analogy Reinforcement Computer (SNARC), consisted of dozens of tubes, motors, and clutches that simulated the behaviour of 40 neurons and synapses. As a simulated rat made its way out of a virtual maze, the strength of some synaptic connections would increase, thereby reinforcing the underlying behaviour.

There were few successes over the next few decades. In 1992, Gerald Tesauro demonstrated a program that used the technique to play backgammon. It became skilled enough to rival the best human players, a landmark achievement in AI. But RL proved difficult to scale to more complex problems.

In March 2016, however, AlphaGo, a program trained using RL, won against one of the best Go players of all time, South Korea’s Lee Sedol. This milestone event opened again teh pandora’s box of research about RL. Turns out the key to having a strong RL is to combine it with deep learning.

Current usage and major methods of RL

Thanks to current RL research, computers can now automatically learn to play ATARI games, are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots learn how to perform complex manipulation tasks that defy explicit programming.

However, while RL saw its advancements accelerate, progress in RL has not been driven as much by new ideas or additional research as just by more of data, processing power and infrastructure. In general, there are four separate factors that hold back AI:

  1. Processing power (the obvious one: Moore’s Law, GPUs, ASICs),
  2. Data (in a specific form, not just somewhere on the internet – e.g. ImageNet),
  3. Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and
  4. Infrastructure (Linux, TCP/IP, Git, AWS, TensorFlow,..).

Similarly for RL, for example for computer vision, the 2012 AlexNet (deeper and wider version of 1990’s Convolutional Neural Networks – CNNs). Or, ATARI’s Deep Q Learning is an implementation of a standard Q Learning algorithm with function approximation, where the function approximator is a CNN. AlphaGo uses Policy Gradients with Monte Carlo tree search (MCTS).

RL’s most optimal method vs. human learning

Generally, RL approaches can be divided into two core categories. The first focuses on finding the optimum mappings that perform well in the problem of interest. Genetic algorithmgenetic programming and simulated annealing have been commonly employed in this class of RL approaches. The second category is to estimate the utility function of taking an action for the given problem via statistical techniques or dynamic programming methods, such as TD(λ) and Q-learning. To date, RL has been successfully applied in many real-world complex applications, including autonomous helicopterhumanoid roboticsautonomous vehicles, etc.

Policy Gradients (PGs), one of RL’s most used methods, is shown to work better than Q Learning when tuned well. PG is preferred because there’s an explicit policy and a principled approach that directly optimises the expected reward.

Before trying PGs (canon), it is recommended to first try to use cross-entropy method (CEM) (normal gun), a simple stochastic hill-climbing “guess and check” approach inspired loosely by evolution. And if you really need to or insist on using PGs for your problem, use a variation called TRPO, which usually works better and more consistently than vanilla PG in practice. The main idea is to avoid parameter updates that change the policy dramatically, as enforced by a constraint on the KL divergence between the distributions predicted by old and the new policies on data.

PGs, however have few disadvantages: they typically converge to a local rather than a global optimum and they display inefficient and high variance while evaluating a policy. PGs also require lot of training samples, take lot of time to train, and are hard to debug debug when they don’t work.

PG is a fancy form of guess-and-check, where the “guess” refers to sampling rollouts from a current policy and encouraging actions that lead to good outcomes. This represents the state of the art in how we currently approach RL problems. But compare that to how a human might learn (e.g. a game of Pong). You show him/her the game and say something along the lines of “You’re in control of a paddle and you can move it up or down, and your goal is to bounce the ball past the other player”, and you’re set and ready to go. Notice some of the differences:

  • Humans communicate the task/goal in a language (e.g. English), but in a standard RL case, you assume an arbitrary reward function that you have to discover through environment interactions. It can be argued that if a human went into a game without knowing anything about the reward function, the human would have a lot of difficulty learning what to do but PGs would be indifferent, and likely work much better.
  • A human brings in a huge amount of prior knowledge, such as elementary physics (concepts of gravity, constant velocity,..), and intuitive psychology. He/she also understands the concept of being “in control” of a paddle, and that it responds to your UP/DOWN key commands. In contrast, algorithms start from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to).
  • PGs are a brute force solution, where the correct actions are eventually discovered and internalised into a policy. Humans build a rich, abstract model and plan within it.
  • PGs have to actually experience a positive reward, and experience it very often in order to eventually shift the policy parameters towards repeating moves that give high rewards. On the other hand, humans can figure out what is likely to give rewards without ever actually experiencing the rewarding or unrewarding transition.

In games/situations with frequent reward signals that requires precise play, fast reflexes, and not much planning, PGs quite easily can beat humans. So once we understand the “trick” by which these algorithms work you can reason through their strengths and weaknesses.

PGs don’t easily scale to settings where huge amounts of exploration are difficult to obtain. Instead of requiring samples from a stochastic policy and encouraging the ones that get higher scores, deterministic policy gradients use a deterministic policy and get the gradient information directly from a second network (called a critic) that models the score function. This approach can in principle be much more efficient in settings with  high-dimensional actions where sampling actions provide poor coverage, but so far seems empirically slightly finicky to get working.

There is also a line of work that tries to make the search process less hopeless by adding additional supervision. In many practical cases, for instance, one can obtain expert trajectories from a human. For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later fine-tuned with PGs on the “real” goal of winning the game.

RL’s new frontiers: MAS, PTL, evolution, memetics and eTL

There is another method called Parallel Transfer Learning (PTL), which aims to optimize RL in multi-agent systems (MAS). MAS are computer systems composed of many interacting and autonomous agents within an environment of interests for problem-solving. MAS have a wide array of applications in industrial and scientific fields, such as resource management and computer games.

In MAS, as agents interact with and learn from one another, the challenge is to identify suitable source tasks from multiple agents that will contain mutually useful information to transfer. In conventional MAS (cMAS), which are optimal for simple environments, actions of each agent are pre-defined for possible states in the environment. Normal RL methodologies have been used as the learning processes of (cMAS) agents through trial-and-error interactions in a dynamic environment.

In PTL, each agent will broadcast its knowledge to all other agents while deciding whose knowledge to accept based on the reward received from other agents vs. expected rewards it predicts. Nevertheless, agents in this approach tend to infer incorrect actions on unseen circumstances or complex environments.

However, for more complex or changing environments, it is necessary to endow the agents with intelligence capable of adapting to an environment’s dynamics. A complex environment, almost by definition, implies complex interactions and necessitated learning of MAS, which current RL methodologies are hard-pressed to meet. A more recent machine learning paradigm of Transfer Learning (TL) was introduced as an approach of leveraging valuable knowledge from related and well studied problem domains to enhance problem-solving abilities of MAS in complex environments. Since then, TL has been successfully used for enhancing RL tasks via methodologies such as instance transferaction-value transferfeature transfer and advice exchanging (AE).

Most RL systems aim to train a single agent or cMAS. Evolutionary Transfer Learning framework (eTL) aims to develop intelligent and social agents capable of adapting to the dynamic environment of MAS and more efficient problem solving. It’s inspired by Darwin’s theory of evolution (natural selection + random variation) by principles that govern the evolutionary knowledge transfer process. eTL constructs social selection mechanisms that are modelled after the principles of human evolution. It mimics natural learning and errors that are introduced due to the physiological limits of the agents’ ability to perceive differences, thus generating “growth” and “variation” of knowledge that agents have, thus exhibiting higher adaptability capabilities for complex problem solving. Essential backbone of eTL comprises of memetic automaton, which includes evolutionary mechanisms such as meme representation, meme expression, etc.

Memetics

 

The term “meme” can be traced back to Dawkins’ “The Selfish Gene”, where he defined it as “a unit of information residing in the brain and is the replicator in human cultural evolution.” For the past few decades, the meme-inspired science of Memetics has attracted increasing attention in fields including anthropology, biology, psychology, sociology and computer science. Particularly, one of the most direct and simplest applications in computer science for problem solving has become memetic algorithm. Further  research of meme-inspired computational models resulted in concept of memetic automaton, which integrates memes into units of domain information useful for problem-solving. Recently, memes have been defined as transformation matrixes that can be reused across different problem domains for enhanced evolutionary search. As with genes serving as “instructions for building proteins”, memes carry “behavioural instructions,” constructing models for problem solving.

 

Memetics in eTL

 

Meme representation and meme evolution form the two core aspects of eTL. It then undergoes meme expression and meme assimilation. Meme representation is related to what a meme is, while meme expression is defined for an agent to express its stored memes as behavioural actions, and meme assimilation captures new memes by translating corresponding behaviours into knowledge that blends into the agent’s mind-universe. The meme evolution processes (i.e. meme internal and meme external evolutions) comprise the main behavioural learning aspects of eTL. To be specific, meme internal evolution denotes the process for agents to update their mind-universe via self learning or personal grooming. In eTL, all agents undergo meme internal evolution by exploring the common environment simultaneously. During meme internal evolution, meme external evolution might happen to model the social interaction among agents mainly via imitation, which takes place when memes are transmitted. Meme external evolution happens whenever the current agent identifies a suitable teacher agent via a meme selection process. Once the teacher agent is selected, meme transmission occurs to instruct how the agent imitates others. During this process, meme variation facilitates knowledge transfer among agents. Upon receiving feedback from the environment after performing an action, the agent then proceeds to update its mind-universe accordingly.

 

eTL implementation with learning agents

 

There are two implementations of learning agents that take the form of neurally-inspired learning structures, namely a FALCON and a BP multilayer neural network. Specifically, FALCON is a natural extension of self-organizing neural models proposed for real-time RL, while BP is a classical multi-layer network that has been widely used in various learning systems.
  1. MASs with TL vs. MAS without TL: Most TL approaches outperform cMAS. This is due to TL endowing agents with capacities to benefit from the knowledge transferred from the better performing agents, thus accelerating the learning rate of the agents in solving the complex task more efficiently and effectively.
  2. eTL vs. PTL and other TL approaches: FALCON and BP agents with the eTL outperform PTL and other TL approaches due to the reason that, when deciding whether to accept  information broadcasted by the others, agents in PTL tend to make incorrect predictions on previously unseen circumstances. Further, eTL also demonstrates superiority in attaining higher success rates than all AE models thanks to meme selection operator of eTL, which considers a fusion of the “imitate-from-elitist” and “like-attracts-like” principles so as to give agents the option of choosing more reliable teacher agents over the AE model.

Conclusions

While popularisation of RL is traced back to Edward Thorndike and Marvin Minsky, it’s been inspired by nature and present with us humans since ages long gone. This is how we effectively teach children and want to now teach our computer systems, real (neural networks) or simulated (MAS).

RL reentered human consciousness and rekindled our interest again in 2016 when AlphaGo beat Go champion Lee Sedol. RL has, via its currently successful PGs, DQNs and other methodologies, already contributed and continues to accelerate, turn more intelligent and optimise humanoid robotics, autonomous vehicles, hedge funds, and other endeavours, industries and aspect of human life.

However, what is that optimises or accelerates RL itself? Its new frontiers represent PTLs, Memetics and a holistic eTL methodology inspired by natural evolution and spreading of memes. This latter evolutionary (and revolutionary!) approach is governed by several meme-inspired evolutionary operators (implemented using FALCON and BP multi-layer neural network), including meme evolutions.

The performance efficacy of eTL seems to have outperformed even most state-of-the-art MAS TL systems (PTL).

What future does RL hold? We don’t know. But the amount of research resources, experimentation and imaginative thinking will surely not disappoint us.

How AI systems learn: approaches and concepts

As you know, goal of AI learning is generalisation, but one major issue is that data alone will never be enough, no matter how much of it is available. AI systems need both data and they need to learn based on data in order to generalise.

So let’s look at how AI systems learn. But before we do that, what are the few different and prevalent AI approaches?

Neural networks model a brain learning by example―given a set of right answers, a neural network learns the general patterns. Reinforcement Learning models a brain learning by experience―given some set of actions and an eventual reward or punishment, it learns which actions are ‘good’ or ‘bad,’ as relevant in context. Genetic Algorithms model evolution by natural selection―given some set of agents, let the better ones live and the worse ones die.

Usually, genetic algorithms do not allow agents to learn during their lifetimes, while neural networks allow agents to learn only during their lifetimes. Reinforcement learning allows agents to learn during their lifetimes and share knowledge with other agents.

Consider learning a Boolean function of (say) 100 variables from a million examples. There are 2100 ^ 100 examples whose classes you don’t know. How do you figure out what those classes are? In the absence of further information, there is no way to do this that beats flipping a coin. This observation was first made (in somewhat different form) by David Hume over 200 years ago, but even today many mistakes in ML stem from failing to appreciate it. Every learner must embody some knowledge/assumptions beyond the data it’s given in order to generalise beyond it.

This seems like rather depressing news. How then can we ever hope to learn anything? Luckily, the functions we want to learn in the real world are not drawn uniformly from the set of all mathematically possible functions. In fact, very general assumptions—like similar examples having similar classes, limited dependences, or limited complexity—are often enough to do quite well, and this is a large part of why ML has been so successful to date.

AI systems use induction, deduction, abduction and other methodologies to collect, analyse and learn from data, allowing generalisation to happen.

Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction (despite its limitations) is a more powerful lever than deduction, requiring much less input knowledge to produce useful results, but it still needs more than zero input knowledge to work.

Abduction is sometimes used to identify faults and revise knowledge based on empirical data. For each individual positive example that is not derivable from the current theory, abduction is applied to determine a set of assumptions that would allow it to be proven. These assumptions can then be used to make suggestions for modifying the theory. One potential repair is to learn a new rule for the assumed proposition so that it could be inferred from other known facts about the example. Another potential repair is to remove the assumed proposition from the list of antecedents of the rule in which it appears in the abductive explanation of the example – parsimonious covering theory (PCT). Abductive reasoning is useful in inductively revising existing knowledge bases to improve their accuracy. Inductive learning can be used to acquire accurate abductive theories.

One key concept in AI is classifier. Generally, AI systems can be divided into two types: classifiers (“if shiny and yellow then gold”) and controllers (“if shiny and yellow then pick up”). Controllers also include classify-ing conditions before inferring actions. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as data set. When a new observation is made, it is classified based on previous experience.

Classifier performance depends greatly on the characteristics of the data to be classified. The most widely used classifiers use kernel methods to be trained (i.e. to learn). There is no single classifier that works best on all given problems – “no free lunch“. Determining an optimal classifier for a given problem is still more an art than science.

The following formula sums up the process of AI learning.

LEARNING = REPRESENTATION + EVALUATION + OPTIMISATION

Representation. A classifier must be represented in some formal language that the computer can handle. Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called the hypothesis space of the learner. If a classifier is not in the hypothesis space, it cannot be learned. A related question is how to represent the input, i.e., what features to use.

Evaluation. An evaluation function is needed to distinguish good classifiers from bad ones. The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimise, for ease of optimisation (see below) and due to the issues discussed in the next section.

Optimisation. We need a method to search among the classifiers in the language for the highest-scoring one. The choice of optimisation technique is key to the efficiency of the learner, and also helps determine the classifier produced if the evaluation function has more than one optimum. It is common for new learners to start out using off-the-shelf optimisers.

Key criteria for choosing a representation is which kinds of knowledge are easily expressed in it. For example, if we have knowledge about probabilistic dependencies, graphical models are a good fit. And if we have knowledge about what kinds of preconditions are required by each class, “IF . . . THEN . . .” rules may be the the best option. The most useful learners in this regard are those that don’t just have assumptions hard-wired into them, but allow us to state them explicitly, vary them widely, and incorporate them dynamically into the learning.

What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just inventing a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of ML. When a learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on real data, when in fact it could have output one that is 75% accurate on both, it has overfit.

One way to understand overfitting is by decomposing generalisation error into bias and variance. Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal. Cross-validation can help to combat overfitting, but it’s no panacea, since if we use it to make too many parameter choices it can itself start to overfit. Besides cross-validation, there are many methods to combat overfitting, the most popular one is adding a regularisation term to the evaluation function. Another option is to perform a statistical significance test like chi-square before adding new structure, to decide whether the distribution of the class really is different with and without this structure.

 

Sources and relevant articles:

Limits of deep learning and way ahead

Artificial intelligence has reached peak hype. News outlets report that companies have replaced workers with IBM Watson and algorithms are beating doctors at diagnoses. New AI startups pop up every day – especially in China – and claim to solve all your personal and business problems with machine learning.

Ordinary objects like juicers and wifi routers suddenly advertise themselves as “powered by AI”. Not only can smart standing desks remember your height settings, they can also order you lunch.

Much of the AI hubbub is generated by reporters who’ve little or superficial knowledge about the subject matter and startups  hoping to be acquihired for engineering talent despite not solving any real business problems. No wonder there are so many misconceptions about what A.I. can and cannot do.

Deep learning will shape the future ahead

Neural networks were invented in the 60s, but recent boosts in big data and computational power made them actually useful. The results are undeniably incredible. Computers can now recognize objects in images and video and transcribe speech to text better than humans can. Google replaced Google Translate’s architecture with neural networks and now machine translation is also closing in on human performance.

The practical applications are mind-blowing. Computers can predict crop yield better than the USDA and indeed diagnose cancer more accurately than expert physicians.

DARPA, the creator of Internet and many other modern technologies, sees three waves of AI:

  1. Handcrafted knowledge, or expert systems like IBM’s DeepBlue or IBM Watson;
  2. Statistical learning, which includes machine learning and deep learning;
  3. Contextual adaption, which involves constructing reliable, explanatory models for real world phenomena using sparse data, like humans do.

As part of the current second wave of AI, deep learning algorithms work well because of what the report calls the “manifold hypothesis.” This refers to how different types of high-dimensional natural data tend to clump and be shaped differently when visualised in lower dimensions.

darpa_manifolds_750px_web

By mathematically manipulating and separating data clumps, deep neural networks can distinguish different data types. While neural networks can achieve nuanced classification and predication capabilities they are what is called “spreadsheets on steroids.”

darpa_manifolds_separation_750px_web

Deep learning algorithms have deep learning problems

At the recent AI By The Bay conference, one expert and inventor of widely used deep learning library Keras,  Francois Chollet, thinks that deep learning is simply more powerful pattern recognition vs. previous statistical and machine learning methods and that the most important problems for AI today are abstraction and reasoning. Current supervised perception and reinforcement learning algorithms require lots of training, are terrible at planning, and are only doing straightforward pattern recognition.

By contrast, humans “learn from very few examples, can do very long-term planning, and are capable of forming abstract models of a situation and manipulate these models to achieve extreme generalisation.”

Even simple human behaviours are laborious to teach to a deep learning algorithm. Let’s examine the task of not being hit by a car as you walk down the road.

Humans only need to be told once to avoid cars. We’re equipped with the ability to generalise from just a few examples and are capable of imagining (i.e. modelling) the dire consequences of being run over. Without losing life or limb, most of us quickly learn to avoid being overrun by motor vehicles.

Let’s now see how this works out if we train a computer. If you go the supervised learning route, you need big data sets of car situations with clearly labeled actions to take, such as “stop” or “move”. Then you’d need to train a neural network to learn the mapping between the situation and the appropriate action. If you go the reinforcement learning route, where you give an algorithm a goal and let it independently determine the ideal actions to take, the computer will “die” many times before learning to avoid cars in different situations.

While neural networks achieve statistically impressive results across large sample sizes, they are “individually unreliable” and often make mistakes humans would never make, such as classify a toothbrush as a baseball bat.

misclassification_darpa_web

Your results are only as good as your data

Neural networks fed inaccurate or incomplete data will simply produce the wrong results. The outcomes can be both embarrassing and damaging. In two major PR debacles, Google Images incorrectly classified African Americans as gorillas, while Microsoft’s Tay learned to spew racist, misogynistic hate speech after only hours training on Twitter.

Undesirable biases may even be implicit in our input data. Google’s massive Word2Vec embeddings are built off of 3 million words from Google News.  The data set makes associations such as “father is to doctor as mother is to nurse” which reflect gender bias in our language.

For example, researchers go to human ratings on Mechanical Turk to perform “hard de-biasing” to undo the associations. Such tactics are essential since word embeddings not only reflect stereotypes but can also amplify them. If the term “doctor” is more associated with men than women, then an algorithm might prioritise male job applicants over female job applicants for open physician positions.

Neural networks can be tricked or exploited

Ian Goodfellow, inventor of GANsshowed that neural networks can be deliberately tricked with adversarial examples. By mathematically manipulating an image in a way that is undetectable to the human eye, sophisticated attackers can trick neural networks into grossly misclassifying objects.

ian_goodfellow_adversarial_attacks

The dangers such adversarial attacks pose to AI systems are alarming, especially since adversarial images and original images seem identical to us. Self-driving cars could be hijacked with seemingly innocuous signage and secure systems could be compromised by data that initially appears normal.

Potential solutions

How can we overcome the limitations of deep learning and proceed towards general artificial intelligence? Chollet’s initial plan is using “super-human pattern recognition like deep learning to augment explicit search and formal systems”, starting with the field of mathematical proofs. Automated Theorem Provers (ATPs) typically use brute force search and quickly hit combinatorial explosions in practical use. In the DeepMath project, Chollet and his colleagues used deep learning to assist the proof search process, simulating a mathematician’s intuitions about what lemmas might be relevant.

Another approach is to develop more explainable models. In handwriting recognition, neural nets currently need to be trained on many thousand examples to perform decent classification. Instead of looking at just pixels, generative models can be taught the strokes behind any given character and use this physical construction information to disambiguate between similar numbers, such as a 9 or a 4.

Yann LeCun, AI boss of Facebook, proposes “energy-based models” as a method of overcoming limits in deep learning. Typically, a neural network is trained to produce a single output, such as an image label or sentence translation. LeCun’s energy-based models instead give an entire set of possible outputs, such as the many ways a sentence could be translated, along with scores for each configuration.

Geoffrey Hinton, called the “father of deep learning” wants to replace neurons in neural networks with “capsules” which he believes more accurately reflect the cortical structure in the human mind. Evolution must have found an efficient way to adapt features that are early in a sensory pathway so that they are more helpful to features that are several stages later in the pathway. He thinks capsule-based neural network architectures will be more resistant to the adversarial attacks.

Perhaps all of these approaches to overcoming the limits of deep learning have a value. Perhaps none of them do. Only time and continued investment in AI will tell. But one thing seems quite certain: it might be impossible to achieve general intelligence simply by scaling up today’s deep learning techniques.

Top 13 challenges AI is facing in 2017

AI and ML feed on data, and companies that center their business around the technology are growing a penchant for collecting user data, with or without the latter’s consent, in order to make their services more targeted and efficient. Already implementations of AI/ML are making it possible to impersonate people by imitating their handwritingvoice and conversation style, an unprecedented power that can come in handy in a number of dark scenarios. However, despite large amounts of previously collected data, early AI pilots have challenges producing  dramatic results that technology enthusiasts predicted. For example, early efforts of companies developing chatbots for Facebook’s Messenger platform saw 70% failure rates in handling user requests.

One of main challenges of AI goes beyond data: false positives. For example, a name-ranking algorithm ended up favoring white-sounding names, and advertising algorithms preferred to show high-paying job ads to male visitors.

Another challenge that caused much controversy in the past year was the “filter bubble” phenomenon that was seen in Facebook and other social media that tailored content to the biases and preferences of users, effectively shutting them out from other viewpoints and realities that were out there.

Additionally, as we give more control and decision-making power to AI algorithms, not only technological, but moral/philosophical considerations become important – when a self-driving car has to choose between the life of a passenger and a pedestrian.

To sum up, following are the challenges that AI still faces, despite creating and processing increasing amounts of of data and unprecedented amounts of other resources (number of people working on algorithms, CPUs, storage, better algorithms, etc.):

  1. Unsupervised Learning: Deep neural networks have afforded huge leaps in performance across a variety of image, sound and text problems. Most noticeably in 2015, the application of RNNs to text problems (NLP, language translation, etc.) have exploded. A major bottleneck in unsupervised learning is labeled data acquisition. It is known humans learn about objects and navigation with relatively little labeled “training” data. How is this performed? How can this be efficiently implemented in machines?
  2. Select Induction Vs. Deduction Vs. Abduction Based Approach: Induction is almost always a default choice when it comes to building an AI model for data analysis. However, it – as well as deduction, abduction, transduction – has its limitations which need serious consideration.
  3. Model Building: TensorFlow has opened the door for conversations about  building scalable ML platforms. There are plenty of companies working on data-science-in-the-cloud (H2O, Dato, MetaMind, …) but the question remains, what is the best way to build ML pipelines? This includes ETL, data storage and  optimisation algorithms.
  4. Smart Search: How can deep learning create better vector spaces and algorithms than Tf-Idf? What are some better alternative candidates?
  5. Optimise Reinforced Learning: As this approach avoids the problems of getting labelled data, the system needs to get data, learn from it and improve. While AlphaGo used RL to win against the Go champion, RL isn’t without its own issues: discussion on a more lightweight and conceptual level one on a more technical aspect.

  6. Build Domain Expertise: How to build and sustain domain knowledge in industries and for problems, which involve reasoning based on a complex body of knowledge like Legal, Financial, etc. and then formulate a process where machines can simulate an expert in the field.
  7. Grow Domain Knowledge: How can AI tackle problems, which involve extending a complex body of knowledge by suggesting new insights to the domain itself – for example new drugs to cure diseases?
  8. Complex Task Analyser and Planner: How can AI tackle complex tasks requiring data analysis, planning and execution? Many logistics and scheduling tasks can be done by current (non-AI) algorithms. A good example is the use of AI techniques in IoT for Sparse datasets . AI techniques help this case because there are large and complex datasets where human beings cannot detect patterns but machines can do so easily.
  9. Better Communication: While proliferation of smart chatbots and AI-powered communication tools is a trend since several years, these communication tools are still far from being smart, and may at times fail at recognising even a simple human language.
  10. Better Perception and Understanding: While Alibaba, Face+ create facial recognition software, visual perception and labelling are still generally problematic. There are few good examples, like this Russian face recognition app  that is good enough to be considered a potential tool for oppressive regimes seeking to identify and crack down on dissidents. Another algorithm proved to be effective at peeking behind masked images and blurred pictures.
  11. Anticipate Second-Order (and higher) Consequences: AI and deep learning have improved computer vision, for example, to the point that autonomous vehicles (cars and trucks) are viable (Otto, Waymo) . But what will their impact be on economy and society? What’s scary is that with advance of AI and related technologies, we might know less on AI’s data analysis and decision making process. Starting in 2012, Google used LSTMs to power the speech recognition system in Android, and in December 2016, Microsoft reported their system reached a word error rate of 5.9%  —  a figure roughly equal to that of human abilities for the first time in history. The goal-post continues to be moved rapidly .. for example loom.ai is building an avatar that can capture your personality. Preempting what’s to come, starting in the summer of 2018, EU is considering to require that companies be able to give users an explanation for decisions that their automated systems reach.
  12. Evolution of Expert SystemsExpert systems have been around for a long time.  Much of the vision of expert systems could be implemented in AI/deep learning algorithms in the near future. The architecture of IBM Watson is an indicative example.
  13. Better Sentiment Analysis: Catching up but still far from lexicon-based model for sentiment analysis, it is still pretty much a nascent and unchartered space for most AI applications. There are some small steps in this regard though, including OpenAI’s usage of mLSTM methodology to conduct sentiment analysis of text. The main issue is that there are many conceptual and contextual rules (rooted and steeped in particulars of culture, society, upbringing, etc of individuals) that govern sentiment and there are even more clues (possibly unlimited) that can convey these concepts.

Thoughts/comments?

Reinforcement Learning vs. Evolutionary Strategy: combine, aggregate, multiply

A birds-eye view of main ML algorithms

In statistics, we have descriptive and inferential statistics. ML deals with the same problems and claims any problem where the solution isn’t programmed directly, but is learned by the program. ML generally works by numerically minimising something: a cost function or error.

Supervised learning – You have labeled data: a sample of ground truth with features and labels. You estimate a model that predicts the labels using the features. Alternative terminology: predictor variables and target variables. You predict the values of the target using the predictors.

  • Regression. The target variable is numeric. Example: you want to predict the crop yield based on remote sensing data. Recurrent neural networks result in a “regression” since they usually output a number (a sequence or a vector) instead of a class (e.g. sentence generation, curve plotting). Algorithms: linear regression, polynomial regression, generalised linear models.
  • Classification. The target variable is categorical. Example: you want to detect the crop type that was planted using remote sensing data. Or Silicon Valley’s “Not Hot Dog” application. Algorithms: Naïve Bayes, logistic regression, discriminant analysis, decision trees, random forests, support vector machines, neural networks (NN) of many variations: feed-forward NNs, convolutional NNs, recurrent NNs.

Unsupervised learning – You have a sample with unlabeled information. No single variable is the specific target of prediction. You want to learn interesting features of the data:

  • Clustering. Which of these things are similar? Example: group consumers into relevant psychographics. Algorithms – k-means, hierarchical clustering.
  • Anomaly detection. Which of these things are different? Example: credit card fraud detection. Algorithms: k-nearest-neighbor.
  • Dimensionality reduction. How can you summarise the data in a high-dimensional data set using a lower-dimensional dataset which captures as much of the useful information as possible (possibly for further modelling with supervised or unsupervised algorithms)? Example: image compression. Algorithms: principal component analysis (PCA), neural network auto-encoders.

Reinforcement Learning  (Policy Gradients, DQN, A3C,..) – You are presented with a game/environment that responds sequentially or continuously to your inputs, and you learn to maximise an objective through trial and error.

Evolutionary Strategy – This approach consists of maintaining a distribution over network weight values, and having a large number of agents act in parallel using parameters sampled from this distribution. With this score, the parameter distribution can be moved toward that of the more successful agents, and away from that of the unsuccessful ones. By repeating this approach millions of times, with hundreds of agents, the weight distribution moves to a space that provides the agents with a good policy for solving the task at hand.

All the complex tasks in ML, from self-driving cars to machine translation, are solved by combining these building blocks into complex stacks.

Pro/cons of RL and ES

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behaviour.

RL is known to be unstable or even to diverge when a nonlinear function approximator such as a NN is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values and the target values.

RL’s other challenge is generalisation. In typical deep RL methods, this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.

Whereas RL methods such as A3C need to communicate gradients back and forth between workers and a parameter server, ES only requires fitness scores and high-level parameter distribution information to be communicated. It is this simplicity that allows the technique to scale up in ways current RL methods cannot. However, in situations with richer feedback signals however, things don’t go so well for ES.

Contextualising and combining the RL and ES

Appealing to nature for inspiration in AI can sometimes be seen as a problematic approach. Nature, after all, is working under constraints that computer scientists simply don’t have. If we look at intelligent behaviour in mammals, we find that it comes from a complex interplay of two ultimately intertwined processes, inter-life learning, and intra-life learning. Roughly speaking these two approaches in nature can be compared to the two in neural network optimisation. ES for which no gradient information is used to update the organism, is related to inter-life learning. Likewise, the gradient based methods (RL), for which specific experiences change the agent in specific ways, can be compared to intra-life learning.

The techniques employed in RL are in many ways inspired directly by the psychological literature on operant conditioning to come out of animal psychology. (In fact, Richard Sutton, one of the two founders of RL actually received his Bachelor’s degree in Psychology). In operant conditioning animals learn to associate rewarding or punishing outcomes with specific behaviour patterns. Animal trainers and researchers can manipulate this reward association in order to get animals to demonstrate their intelligence or behave in certain ways.

The central role of prediction in intra-life learning changes the dynamics quite a bit. What was before a somewhat sparse signal (occasional reward), becomes an extremely dense signal. At each moment mammalian brains are predicting the results of the complex flux of sensory stimuli and actions which the animal is immersed in. The outcome of the animals behaviour then provides a dense signal to guide the change in predictions and behaviour going forward. All of these signals are put to use in the brain in order to improve predictions (and consequently the quality of actions) going forward. If we apply this way of thinking to learning in artificial agents, we find that RL isn’t somehow fundamentally flawed, rather it is that the signal being used isn’t nearly as rich as it could (or should) be. In cases where the signal can’t be made more rich, (perhaps because it is inherently sparse, or to do with low-level reactivity) it is likely the case that learning through a highly parallelizable method such as ES is instead better.

Combining many

It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training.  On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.

An extreme example is combining more than just ES and RL and Microsoft’s Maluuba is a an illustrative example, which used many algorithms to beat the game Ms. Pac-Man. When the agent (Ms. Pac-Man) starts to learn, it moves randomly; it knows nothing about the game board. As it discovers new rewards (the little pellets and fruit Ms. Pac-Man eats) it begins placing little algorithms in those spots, which continuously learn how best to avoid ghosts and get more points based on Ms. Pac-Man’s interactions, according to the Maluuba research paper.

As the 163 potential algorithms are mapped, they continually send which movement they think would generate the highest reward to the agent, which averages the inputs and moves Ms. Pac-Man. Each time the agent dies, all the algorithms process what generated rewards. These helper algorithms were carefully crafted by humans to understand how to learn, however.

Instead of having one algorithm learn one complex problem, the AI distributes learning over many smaller algorithms, each tackling simpler problems, Maluuba says in a video. This research could be applied to other highly complex problems, like financial trading, according to the company.

But it’s worth noting that since more than 100 algorithms are being used to tell Ms. Pac-Man where to move and win the game, this technique is likely to be extremely computationally intensive.

Bayes craze, neural networks and uncertainty

Story, context and hype

Named after its inventor, the 18th-century Presbyterian minister Thomas Bayes, Bayes’ theorem is a method for calculating the validity of beliefs (hypotheses, claims, propositions) based on the best available evidence (observations, data, information). Here’s the most dumbed-down description: Initial/prior belief + new evidence/information = new/improved belief.

P(B|E) = P(B) X P(E|B) / P(E), with P standing for probability, B for belief and E for evidence. P(B) is the probability that B is true, and P(E) is the probability that E is true. P(B|E) means the probability of B if E is true, and P(E|B) is the probability of E if B is true.

Since recently, Bayesian theorem has become ubiquitous in modern life and is applied in everything from physics to cancer research, psychology to ML spam algorithms. Physicists have proposed Bayesian interpretations of quantum mechanics and Bayesian defences of string and multiverse theories. Philosophers assert that science as a whole can be viewed as a Bayesian process, and that Bayesian approach can distinguish science from pseudoscience more precisely than falsification, the method popularised by Karl Popper. Some even claim Bayesian machines might be so intelligent that they make humans “obsolete.”

Bayes going into AI/ML

Neural networks are all the rage in AI/ML. They learn tasks by analysing vast amounts of data and power everything from face recognition at Facebook to translation at Microsoft to search at Google. They’re beginning to help chatbots learn the art of conversation. And they’re part of the movement toward driverless cars and other autonomous machines. But because they can’t make sense of the world without help from such large amounts of carefully labelled data, they aren’t suited to everything. Induction is prevalent approach for learning methods and they have difficulties dealing with uncertainties, probabilities of future occurrences of different types of data/events and “confident error” problems.

Additionally, AI researchers have limited insight into why neural networks make particular decisions. They are, in many ways, black boxes. This opacity could cause serious problems: What if a self-driving car runs someone down?

Regular/standard neural networks are bad at calculating uncertainty. However, there is a recent trend of bringing in Bayes (and other alternative methodologies) into this game too. Currently, AI researchers, including those working on Google’s self-driving cars, started employing Bayesian software to help machines recognise patterns and make decisions.

Gamalon, an AI startup that went life earlier in 2017, touts a new type of AI that requires only small amounts of training data – its secret sauce is Bayesian Program Synthesis.

Rebellion Research, founded by the grandson of baseball grand Hank Greenberg, relies upon a form of ML called Bayesian networks, using a handful of machines to predict market trends and pinpoint particular trades.

There are many more examples.

The dark side of Bayesian inference

The most notable pitfall of Bayesian approach is the calculation of prior probability. In many cases, estimating  the prior is just guesswork, allowing subjective factors to creep into calculations. Some prior probabilities are unknown or don’t even exist such as multiverses, inflation or God. In this way, Bayes’ theorem can promote pseudoscience and superstition as well as reason.

In 1997, Microsoft launched its animated MS Office assistant Clippit, which was conceived to work on Bayesian inference system but failed miserably .

In law courts, Bayesian principles may lead to serious miscarriages of justice (see the prosecutor’s fallacy). In a famous example from the UK, Sally Clark was wrongly convicted in 1999 of murdering her two children. Prosecutors had argued that the probability of two babies dying of natural causes (the prior probability that she is innocent of both charges) was so low – one in 73 million – that she must have murdered them. But they failed to take into account that the probability of a mother killing both of her children (the prior probability that she is guilty of both charges) was also incredibly low.

So the relative prior probabilities that she was totally innocent or a double murderer were more similar than initially argued. Clark was later cleared on appeal with the appeal court judges criticising the use of the statistic in the original trial. Here is another such case.

Alternative, complimentary approaches

In actual practice, the method of evaluation most scientists/experts use most of the time is a variant of a technique proposed by Ronald Fisher in the early 1900s. In this approach, a hypothesis is considered validated by data only if the data pass a test that would be failed 95% or 99% of the time if the data were generated randomly. The advantage of Fisher’s approach (which is by no means perfect) is that to some degree it sidesteps the problem of estimating priors where no sufficient advance information exists. In the vast majority of scientific papers, Fisher’s statistics (and more sophisticated statistics in that tradition) are used.

As many AI/ML algorithms automate their optimisation and learning processes, they can deploy a more careful Gaussian process consideration, including type of kernel and the treatment of its hyper-parameters, can play a crucial role in obtaining a good optimiser that can achieve expert level performance.

Dropout (which addresses overfitting problem), is another technique that has been in use for several years in deep learning, is another technique that enables uncertainty estimates by approximating those of Gaussian process. This is a powerful tool in statistics that allows model distributions over functions and been applied in both the supervised and unsupervised domains, for both regression and classification tasks. It offers nice properties such as uncertainty estimates over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning.

Google’s Project Loon uses Gaussian process (together with reinforcement learning) for its navigation.

101 and failures of Machine Learning

Nowadays, ‘artificial intelligence’ (AI) and ‘machine learning’ (ML) are cliches that people use to signal awareness about technological trends. Companies tout AI/ML as panaceas to their ills and competitive advantage over their peers. From flower recognition to an algorithm that won against Go champion to big financial institutions, including ETFs of the biggest hedge fund in the world are already or moving to the AI/ML era.

However, as with any new technological breakthroughs, discoveries and inventions, the path is laden with misconceptions, failures, political agendas, etc. Let’s start by an overview of basic methodologies of ML, the foundation of AI.

101 and limitations of AI/ML

The fundamental goal of ML is to generalise beyond specific examples/occurrences of data. ML research focuses on experimental evaluation on actual data for realistic problems. ML’s performance is then evaluated by training a system (algorithm, program) on a set of test examples and measuring its accuracy at predicting the novel test (or real-life) examples.

Most frequently used methods in ML are induction and deduction. Deduction goes from the general to the particular, and induction goes from the particular to the general. Deduction is to induction what probability is to statistics.

Let’s start with induction. Domino effect is perhaps the most famous instance of induction. Inductive reasoning consists in constructing the axioms (hypotheses, theories) from the observation of supposed consequences of these axioms.Induction alone is not that useful: the induction of a model (a general knowledge) is interesting only if you can use it, i.e. if you can apply it to new situations, by going somehow from the general to the particular. This is what scientists do: observing natural phenomena, they postulate the laws of Nature. However, there is a problem with induction. It’s impossible to prove that an inductive statement is correct. At most can one empirically observe that the deductions that can be made from this statement are not in contradiction with experiments. But one can never be sure that no future observation will contradict the statement. Black Swam theory is the most famous illustration of this problem.

Deductive reasoning consists in combining logical statements (axioms, hypothesis, theorem) according to certain agreed upon rules in order to obtain new statements. This is how mathematicians prove theorems from axioms. Proving a theorem is nothing but combining a small set of axioms with certain rules. Of course, this does not mean proving a theorem is a simple task, but it could theoretically be automated.

A problem with deduction is exemplified by Gödel’s theorem, which states that for a rich enough set of axioms, one can produce statements that can be neither proved nor disproved.

Two other kinds of reasoning exist, abduction and analogy, and neither is frequently used in AI/ML, which may explain many of current AI/ML failures/problems.

Like deduction, abduction relies on knowledge expressed through general rules. Like deduction, it goes from the general to the particular, but it does in an unusual manner since it infers causes from consequences. So, from “A implies B” and “B”, A can be inferred. For example, most of a doctor’s work is inferring diseases from symptoms, which is what abduction is about. “I know the general rule which states that flu implies fever. I’m observing fever, so there must be flu.” However, abduction is not able to build new general rules: induction must have been involved at some point to state that “flu implies fever”.

Lastly, analogy goes from the particular to the particular. The most basic form of analogy is based on the assumption that similar situations have similar properties. More complex analogy-based learning schemes, involving several situations and recombinations can also be considered. Many lawyers use analogical reasoning to analyse new problems based on previous cases. Analogy completely bypasses the model construction: instead of going from the particular to the general, and then from to the general to the particular, it goes directly from the particular to the particular.

Let’s next check some of conspicuous failures in AI/ML (in 2016) and corresponding AI/ML methodology that, in my view, was responsible for failure:

Microsoft’s chatbot Tay utters racist, sexist, homophobic slurs (mimicking/analogising failure)

In an attempt to form relationships with younger customers, Microsoft launched an AI-powered chatbot called “Tay.ai” on Twitter in 2016. “Tay,” modelled around a teenage girl, morphed into a “Hitler-loving, feminist-bashing troll“—within just a day of her debut online. Microsoft yanked Tay off the social media platform and announced it planned to make “adjustments” to its algorithm.

AI-judged beauty contest was racist (deduction failure)

In “The First International Beauty Contest Judged by Artificial Intelligence,” a robot panel judged faces, based on “algorithms that can accurately evaluate the criteria linked to perception of human beauty and health.” But by failing to supply the AI/ML with a diverse training set, the contest winners were all white.

Chinese facial recognition study predicted convicts but shows bias (induction/abduction failure)

Researchers in China’s published a study entitled “Automated Inference on Criminality using Face Images.” They “fed the faces of 1,856 people (half of which were convicted violent criminals) into a computer and set about analysing them.” The researchers concluded that there were some discriminating structural features for predicting criminality, such as lip curvature, eye inner corner distance, and the so-called nose-mouth angle. Many in the field questioned the results and the report’s ethics underpinnings.

Concluding remarks

The above examples must not discourage companies to incorporate AI/ML into their processes and products. Most AI/ML failures seem to stem from band-aid, superfluous way of embracing AI/ML. A better and more sustainable approach to incorporating AI/ML would be to initiate a mix of projects generating both quick-wins and long-term transformational products/services/process. For quick-wins, a company might focus on changing internal employee touchpoints, using recent advances in speech, vision, and language understanding, etc.

For long-term projects, a company might go beyond local/point optimisation, to rethinking business lines, products/services, end-to-end processes, which is the area in which companies are likely to see the greatest impact. Take Google. Google’s initial focus was on incorporating ML into a few of their products (spam detection in Gmail, Google Translate, etc), but now the company is using machine learning to replace entire sets of systems. Further, to increase organisational learning, the company is dispersing ML experts across product groups and training thousands of software engineers, across all Google products, in basic machine learning.