LEARNING = REPRESENTATION + EVALUATION + OPTIMISATION
Representation. A classifier must be represented in some formal language that the computer can handle. Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called the hypothesis space of the learner. If a classifier is not in the hypothesis space, it cannot be learned. A related question is how to represent the input, i.e., what features to use.
Evaluation. An evaluation function is needed to distinguish good classifiers from bad ones. The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimise, for ease of optimisation (see below) and due to the issues discussed in the next section.
Optimisation. We need a method to search among the classifiers in the language for the highest-scoring one. The choice of optimisation technique is key to the efficiency of the learner, and also helps determine the classifier produced if the evaluation function has more than one optimum. It is common for new learners to start out using off-the-shelf optimisers.
Key criteria for choosing a representation is which kinds of knowledge are easily expressed in it. For example, if we have knowledge about probabilistic dependencies, graphical models are a good fit. And if we have knowledge about what kinds of preconditions are required by each class, “IF . . . THEN . . .” rules may be the the best option. The most useful learners in this regard are those that don’t just have assumptions hard-wired into them, but allow us to state them explicitly, vary them widely, and incorporate them dynamically into the learning.
What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just inventing a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of ML. When a learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on real data, when in fact it could have output one that is 75% accurate on both, it has overfit.
One way to understand overfitting is by decomposing generalisation error into bias and variance. Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal. Cross-validation can help to combat overfitting, but it’s no panacea, since if we use it to make too many parameter choices it can itself start to overfit. Besides cross-validation, there are many methods to combat overfitting, the most popular one is adding a regularisation term to the evaluation function. Another option is to perform a statistical significance test like chi-square before adding new structure, to decide whether the distribution of the class really is different with and without this structure.
Sources and relevant articles:
As you know, goal of AI learning is generalisation, but one major issue is that data alone will never be enough, no matter how much of it is available. AI systems need both data and they need to learn based on data in order to generalise.
So let’s look at how AI systems learn. But before we do that, what are the few different and prevalent AI approaches?
Neural networks model a brain learning by example―given a set of right answers, a neural network learns the general patterns. Reinforcement Learning models a brain learning by experience―given some set of actions and an eventual reward or punishment, it learns which actions are ‘good’ or ‘bad,’ as relevant in context. Genetic Algorithms model evolution by natural selection―given some set of agents, let the better ones live and the worse ones die.
Usually, genetic algorithms do not allow agents to learn during their lifetimes, while neural networks allow agents to learn only during their lifetimes. Reinforcement learning allows agents to learn during their lifetimes and share knowledge with other agents.
Consider learning a Boolean function of (say) 100 variables from a million examples. There are 2100 ^ 100 examples whose classes you don’t know. How do you figure out what those classes are? In the absence of further information, there is no way to do this that beats flipping a coin. This observation was first made (in somewhat different form) by David Hume over 200 years ago, but even today many mistakes in ML stem from failing to appreciate it. Every learner must embody some knowledge/assumptions beyond the data it’s given in order to generalise beyond it.
This seems like rather depressing news. How then can we ever hope to learn anything? Luckily, the functions we want to learn in the real world are not drawn uniformly from the set of all mathematically possible functions. In fact, very general assumptions—like similar examples having similar classes, limited dependences, or limited complexity—are often enough to do quite well, and this is a large part of why ML has been so successful to date.
AI systems use induction, deduction, abduction and other methodologies to collect, analyse and learn from data, allowing generalisation to happen.
Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction (despite its limitations) is a more powerful lever than deduction, requiring much less input knowledge to produce useful results, but it still needs more than zero input knowledge to work.
Abduction is sometimes used to identify faults and revise knowledge based on empirical data. For each individual positive example that is not derivable from the current theory, abduction is applied to determine a set of assumptions that would allow it to be proven. These assumptions can then be used to make suggestions for modifying the theory. One potential repair is to learn a new rule for the assumed proposition so that it could be inferred from other known facts about the example. Another potential repair is to remove the assumed proposition from the list of antecedents of the rule in which it appears in the abductive explanation of the example – parsimonious covering theory (PCT). Abductive reasoning is useful in inductively revising existing knowledge bases to improve their accuracy. Inductive learning can be used to acquire accurate abductive theories.
One key concept in AI is classifier. Generally, AI systems can be divided into two types: classifiers (“if shiny and yellow then gold”) and controllers (“if shiny and yellow then pick up”). Controllers also include classify-ing conditions before inferring actions. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as data set. When a new observation is made, it is classified based on previous experience.
Classifier performance depends greatly on the characteristics of the data to be classified. The most widely used classifiers use kernel methods to be trained (i.e. to learn). There is no single classifier that works best on all given problems – “no free lunch“. Determining an optimal classifier for a given problem is still more an art than science.
The following formula sums up the process of AI learning.