Brief overview: neural networks, architectures, frameworks

Deep learning is a new name for an approach to AI called neural networks, which have been going in and out of fashion for more than 70 years. Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two researchers who moved to MIT in 1952 as founding members of what’s sometimes called the first cognitive science department.

Neural networks were a major area of research in both neuroscience and computer science until 1969, when, according to computer science lore, they were killed off by the MIT mathematicians Marvin Minsky and Seymour Papert, who became co-directors of the new MIT Artificial Intelligence Laboratory in 1970.

Neural networks are a means of doing machine learning, in which a computer learns to perform specific tasks by analysing training examples. Usually, these examples have been hand-labeled in advance. An object recognition system, for instance, might be fed thousands of labeled images of cars, houses, coffee cups, and so on, and it would find visual patterns in the images that consistently correlate with particular labels.

Modelled loosely on the human brain, a neural net consists of thousands or even millions of simple processing nodes that are densely interconnected. Most of today’s neural nets are organised into layers of nodes, and they’re “feed-forward,” meaning that data moves through them in only one direction. An individual node might be connected to several nodes in the layer beneath it, from which it receives data, and several nodes in the layer above it, to which it sends data.

Architecture and main types of neural networks

A typical neural network contains a large number of artificial neurons called units arranged in a series of layers.

Input layer contains units (artificial neurons) which receive input from the outside world on which network will learn, recognise about or otherwise process.
Output layer contains units that respond to the information about how it learned a task.
Hidden layers are situated between input and output layers. Their task is to transform the input into something that output unit can use in some way.

Perceptron has two input units and one output unit with no hidden layers, and is also called single layer perceptron.
Radial Basis Function Network are similar to the feed-forward neural network except radial basis function is used as activation function of these neurons.
Multilayer Perceptron networks use more than one hidden layer of neurons. These are also known as deep feed-forward neural networks.
Recurrent Neural Network’s (RNN) hidden layer neurons have self-connections and thus possess memory. LSTM is a type of RNN.
Hopfield Network is a fully interconnected network of neurons in which each neuron is connected to every other neuron. The network is trained with input pattern by setting a value of neurons to the desired pattern. Then its weights are computed. The weights are not changed. Once trained for one or more patterns, the network will converge to the learned patterns.
Boltzmann Machine Network are similar to Hopfield network except some neurons are for input, while others are hidden. The weights are initialized randomly and learn through back-propagation algorithm.
Convolutional Neural Network (CNN) derives its name from the “convolution” operator. The primary purpose of Convolution in case is to extract features from an input image/video. Convolution preserves the spatial relationship between pixels by learning about image/video features using small squares of input data.

Of these, let’s have a very brief review of CNNs and RNNs, as these are the most commonly used.

CNN

CNNs are ideal for image and video processing.
CNN takes a fixed size input and generate fixed-size outputs.
Use CNNs to break a component (image/video) into subcomponents (lines, curves, etc.).
CNN is a type of feed-forward artificial neural network – variation of multilayer perceptrons, which are designed to use minimal amounts of preprocessing.
CNNs use connectivity pattern between its neurons as inspired by the organization of the animal visual cortex, whose neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.
CNN looks for the same patterns on all the different subfields of the image/video.

RNN

RNNs are ideal for text and speech analysis.
RNN can handle arbitrary input/output lengths.
Use RNNs to create combinations of subcomponents (image captioning, text generation, language translation, etc.)
RNN, unlike feedforward neural networks, can use its internal memory to process arbitrary sequences of inputs.
RNNs use time-series information, i.e. what is last done will impact what done next.
RNN, in the simplest case, feed hidden layers from the previous step as an additional input into the next step and while it builds up memory in this process, it is not looking for the same patterns.

A type of RNN are LSTM and GRU. The key difference between GRU and LSTM is that a GRU has two gates (reset and update) whereas an LSTM has three gates (input, output and forget). GRU is similar to LSTM in that both utilise gating information to solve vanishing gradient problem. GRU’s performance is on par with LSTM, but computationally more efficient.

GRUs train faster and perform better than LSTMs on less training data if used for language modelling.
GRUs are simpler and easier to modify, for example adding new gates in case of additional input to the network.
In theory, LSTMs remember longer sequences than GRUs and outperform them in tasks requiring modelling long-distance relations.
GRUs expose complete memory, unlike LSTM
It’s recommended to train both GRU and LSTM and see which is better.

Deep learning frameworks

There are several frameworks that provide advanced AI/ML capabilities. How do you determine which framework is best for you?

The below figure summarises most of the popular open source deep network repositories. The ranking is based on the number of stars awarded by developers in GitHub (as of May 2017).

deep learning frameworks ranked via GitHub

Google’s TensorFlow is a library developed at Google Brain. TensorFlow supports a broad set of capabilities such as image, handwriting and speech recognition, forecasting and natural language processing (NLP). Its programming interfaces includes Python and C++ and alpha releases of Java, GO, R, and Haskell API will soon be supported.

Caffe is the brainchild of Yangqing Jia who leads engineering for Facebook AI. Caffe is the first mainstream industry-grade deep learning toolkit, started in late 2013. Due to its excellent convolutional model, it is one of the most popular toolkits within the computer vision community. Speed makes Caffe perfect for research experiments and commercial deployment. However, it does not support fine granularity network layers like those found in TensorFlow and Theano. Caffe can process over 60M images per day with a single Nvidia K40 GPU. It’s cross-platform and supports C++, Matlab and Python programming interfaces and has a large user community that contributes to their own repository known as “Model Zoo.” AlexNet and GoogleNet are two popular user-made networks available to the community.

Caffe 2 was unveiled in April 2017 and is focused on being modular and excelling at mobile and at large scale deployments. Like TensorFlow, Caffe 2 will support ARM architecture using the C++ Eigen library and continue offering strong support for vision-related problems, also adding in RNN and LSTM networks for NLP, handwriting recognition, and time series forecasting.

MXNet is a fully featured, programmable and scalable deep learning framework, which offers the ability to both mix programming models (imperative and declarative) and code in Python, C++, R, Scala, Julia, Matlab and JavaScript. MXNet supports CNN and RNN, including LTSM networks and provides excellent capabilities for imaging, handwriting and speech recognition, forecasting and NLP. It’s considered the world’s best image classifier, and supports GAN simulations. This model is used in Nash equilibrium to perform experimental economics methods. Amazon supports MXNet, planning to use it in existing and upcoming services whereas Apple is rumorred to be also using it.

Theano architecture lacks the elegance of TensorFlow, but provides capabilities like symbolic API supports looping control, so-called scan, which makes implementing RNNs easy and efficient. Theano supports many types of convolutions for hand writing and image classification including medical images. Theano uses 3D convolution/pooling for video classification. It can process natural language processing tasks, including language understanding, translation, and generation. Theano supports GAN.

How AI systems learn: approaches and concepts

As you know, goal of AI learning is generalisation, but one major issue is that data alone will never be enough, no matter how much of it is available. AI systems need both data and they need to learn based on data in order to generalise.

So let’s look at how AI systems learn. But before we do that, what are the few different and prevalent AI approaches?

Neural networks model a brain learning by example―given a set of right answers, a neural network learns the general patterns. Reinforcement Learning models a brain learning by experience―given some set of actions and an eventual reward or punishment, it learns which actions are ‘good’ or ‘bad,’ as relevant in context. Genetic Algorithms model evolution by natural selection―given some set of agents, let the better ones live and the worse ones die.

Usually, genetic algorithms do not allow agents to learn during their lifetimes, while neural networks allow agents to learn only during their lifetimes. Reinforcement learning allows agents to learn during their lifetimes and share knowledge with other agents.

Consider learning a Boolean function of (say) 100 variables from a million examples. There are 2100 ^ 100 examples whose classes you don’t know. How do you figure out what those classes are? In the absence of further information, there is no way to do this that beats flipping a coin. This observation was first made (in somewhat different form) by David Hume over 200 years ago, but even today many mistakes in ML stem from failing to appreciate it. Every learner must embody some knowledge/assumptions beyond the data it’s given in order to generalise beyond it.

This seems like rather depressing news. How then can we ever hope to learn anything? Luckily, the functions we want to learn in the real world are not drawn uniformly from the set of all mathematically possible functions. In fact, very general assumptions—like similar examples having similar classes, limited dependences, or limited complexity—are often enough to do quite well, and this is a large part of why ML has been so successful to date.

AI systems use induction, deduction, abduction and other methodologies to collect, analyse and learn from data, allowing generalisation to happen.

Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction (despite its limitations) is a more powerful lever than deduction, requiring much less input knowledge to produce useful results, but it still needs more than zero input knowledge to work.

Abduction is sometimes used to identify faults and revise knowledge based on empirical data. For each individual positive example that is not derivable from the current theory, abduction is applied to determine a set of assumptions that would allow it to be proven. These assumptions can then be used to make suggestions for modifying the theory. One potential repair is to learn a new rule for the assumed proposition so that it could be inferred from other known facts about the example. Another potential repair is to remove the assumed proposition from the list of antecedents of the rule in which it appears in the abductive explanation of the example – parsimonious covering theory (PCT). Abductive reasoning is useful in inductively revising existing knowledge bases to improve their accuracy. Inductive learning can be used to acquire accurate abductive theories.

One key concept in AI is classifier. Generally, AI systems can be divided into two types: classifiers (“if shiny and yellow then gold”) and controllers (“if shiny and yellow then pick up”). Controllers also include classify-ing conditions before inferring actions. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as data set. When a new observation is made, it is classified based on previous experience.

Classifier performance depends greatly on the characteristics of the data to be classified. The most widely used classifiers use kernel methods to be trained (i.e. to learn). There is no single classifier that works best on all given problems – “no free lunch“. Determining an optimal classifier for a given problem is still more an art than science.

The following formula sums up the process of AI learning.

LEARNING = REPRESENTATION + EVALUATION + OPTIMISATION

Representation. A classifier must be represented in some formal language that the computer can handle. Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called the hypothesis space of the learner. If a classifier is not in the hypothesis space, it cannot be learned. A related question is how to represent the input, i.e., what features to use.

Evaluation. An evaluation function is needed to distinguish good classifiers from bad ones. The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimise, for ease of optimisation (see below) and due to the issues discussed in the next section.

Optimisation. We need a method to search among the classifiers in the language for the highest-scoring one. The choice of optimisation technique is key to the efficiency of the learner, and also helps determine the classifier produced if the evaluation function has more than one optimum. It is common for new learners to start out using off-the-shelf optimisers.

Key criteria for choosing a representation is which kinds of knowledge are easily expressed in it. For example, if we have knowledge about probabilistic dependencies, graphical models are a good fit. And if we have knowledge about what kinds of preconditions are required by each class, “IF . . . THEN . . .” rules may be the the best option. The most useful learners in this regard are those that don’t just have assumptions hard-wired into them, but allow us to state them explicitly, vary them widely, and incorporate them dynamically into the learning.

What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just inventing a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of ML. When a learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on real data, when in fact it could have output one that is 75% accurate on both, it has overfit.

One way to understand overfitting is by decomposing generalisation error into bias and variance. Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal. Cross-validation can help to combat overfitting, but it’s no panacea, since if we use it to make too many parameter choices it can itself start to overfit. Besides cross-validation, there are many methods to combat overfitting, the most popular one is adding a regularisation term to the evaluation function. Another option is to perform a statistical significance test like chi-square before adding new structure, to decide whether the distribution of the class really is different with and without this structure.

Sources and relevant articles:

Bayes craze, neural networks and uncertainty

Story, context and hype

Named after its inventor, the 18^th-century Presbyterian minister Thomas Bayes, Bayes’ theorem is a method for calculating the validity of beliefs (hypotheses, claims, propositions) based on the best available evidence (observations, data, information). Here’s the most dumbed-down description: Initial/prior belief + new evidence/information = new/improved belief.

P(B|E) = P(B) X P(E|B) / P(E), with P standing for probability, B for belief and E for evidence. P(B) is the probability that B is true, and P(E) is the probability that E is true. P(B|E) means the probability of B if E is true, and P(E|B) is the probability of E if B is true.

Since recently, Bayesian theorem has become ubiquitous in modern life and is applied in everything from physics to cancer research, psychology to ML spam algorithms. Physicists have proposed Bayesian interpretations of quantum mechanics and Bayesian defences of string and multiverse theories. Philosophers assert that science as a whole can be viewed as a Bayesian process, and that Bayesian approach can distinguish science from pseudoscience more precisely than falsification, the method popularised by Karl Popper. Some even claim Bayesian machines might be so intelligent that they make humans “obsolete.”

Bayes going into AI/ML

Neural networks are all the rage in AI/ML. They learn tasks by analysing vast amounts of data and power everything from face recognition at Facebook to translation at Microsoft to search at Google. They’re beginning to help chatbots learn the art of conversation. And they’re part of the movement toward driverless cars and other autonomous machines. But because they can’t make sense of the world without help from such large amounts of carefully labelled data, they aren’t suited to everything. Induction is prevalent approach for learning methods and they have difficulties dealing with uncertainties, probabilities of future occurrences of different types of data/events and “confident error” problems.

Additionally, AI researchers have limited insight into why neural networks make particular decisions. They are, in many ways, black boxes. This opacity could cause serious problems: What if a self-driving car runs someone down?

Regular/standard neural networks are bad at calculating uncertainty. However, there is a recent trend of bringing in Bayes (and other alternative methodologies) into this game too. Currently, AI researchers, including those working on Google’s self-driving cars, started employing Bayesian software to help machines recognise patterns and make decisions.

Gamalon, an AI startup that went life earlier in 2017, touts a new type of AI that requires only small amounts of training data – its secret sauce is Bayesian Program Synthesis.

Rebellion Research, founded by the grandson of baseball grand Hank Greenberg, relies upon a form of ML called Bayesian networks, using a handful of machines to predict market trends and pinpoint particular trades.

There are many more examples.

The dark side of Bayesian inference

The most notable pitfall of Bayesian approach is the calculation of prior probability. In many cases, estimating the prior is just guesswork, allowing subjective factors to creep into calculations. Some prior probabilities are unknown or don’t even exist such as multiverses, inflation or God. In this way, Bayes’ theorem can promote pseudoscience and superstition as well as reason.

In 1997, Microsoft launched its animated MS Office assistant Clippit, which was conceived to work on Bayesian inference system but failed miserably .

In law courts, Bayesian principles may lead to serious miscarriages of justice (see the prosecutor’s fallacy). In a famous example from the UK, Sally Clark was wrongly convicted in 1999 of murdering her two children. Prosecutors had argued that the probability of two babies dying of natural causes (the prior probability that she is innocent of both charges) was so low – one in 73 million – that she must have murdered them. But they failed to take into account that the probability of a mother killing both of her children (the prior probability that she is guilty of both charges) was also incredibly low.

So the relative prior probabilities that she was totally innocent or a double murderer were more similar than initially argued. Clark was later cleared on appeal with the appeal court judges criticising the use of the statistic in the original trial. Here is another such case.

Alternative, complimentary approaches

In actual practice, the method of evaluation most scientists/experts use most of the time is a variant of a technique proposed by Ronald Fisher in the early 1900s. In this approach, a hypothesis is considered validated by data only if the data pass a test that would be failed 95% or 99% of the time if the data were generated randomly. The advantage of Fisher’s approach (which is by no means perfect) is that to some degree it sidesteps the problem of estimating priors where no sufficient advance information exists. In the vast majority of scientific papers, Fisher’s statistics (and more sophisticated statistics in that tradition) are used.

As many AI/ML algorithms automate their optimisation and learning processes, they can deploy a more careful Gaussian process consideration, including type of kernel and the treatment of its hyper-parameters, can play a crucial role in obtaining a good optimiser that can achieve expert level performance.

Dropout (which addresses overfitting problem), is another technique that has been in use for several years in deep learning, is another technique that enables uncertainty estimates by approximating those of Gaussian process. This is a powerful tool in statistics that allows model distributions over functions and been applied in both the supervised and unsupervised domains, for both regression and classification tasks. It offers nice properties such as uncertainty estimates over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning.

Google’s Project Loon uses Gaussian process (together with reinforcement learning) for its navigation.