self-play Archives - Hayk Hakobyan

Poker is a game with imperfect information. Imperfect-information games model settings where players have private information. Huge progress has been made in solving such games over the past 20 years, especially since the Annual Computer Poker Competition was established in 2006. Before 2006, general-purpose linear programming solvers (example) and sequence-form representation (example) were used to solve small variants of poker or coarse abstractions of two-player limit Texas Hold’em.

Since 2006, two more scalable equilibrium-finding algorithms and problem representations have been developed for two-player zero-sum games. One family is based on smoothed gradient descent algorithms and a decomposed problem representation. The other family, counterfactual regret minimisation (CFR), is based on a form of self-play using no-regret learning, adapted so that regret updates can be computed at each information set separately, instead of requiring regrets to be updated for entire game strategies.

Best available guarantees for CFR require ~1/ε 2 iterations over the game tree to reach an ε-equilibrium, that is, strategies for players such that no player can be exploited by more than ε by any strategy. The gradient-based algorithms require only ~1/ε or ~log(1/ε) iterations. The latter approach matches the optimal number of iterations required. On the other hand, more effective sampling techniques have been developed for CFR than for the gradient-based algorithms, so quick approximate iterations can be used.

How to solve imperfect-information games

Currently, the main approach for solving imperfect-information games is shown in the image below. First, the game is abstracted to generate a smaller but strategically similar game, reducing it to a size that can be tackled with an equilibrium finding algorithm.

Then, the abstract game is solved for equilibrium or near-equilibrium. Nash equilibrium defines a notion of rational play, i.e. it’s a profile of strategies, one per player, such that no player can increase his/her expected payoff by switching to a different strategy. A strategy for a player states for each information set where it is the player’s turn, the probability with which the player should select each of his/her available actions.

An information set is a collection of game states that cannot be distinguished by the player whose turn it is because of private information of other players. Finally, the strategies from the abstract game are mapped back to the original game.

Science Magazine

Two main kinds of abstraction are used. One is information abstraction, where it is assumed in the abstract game that a player does not know some information that he/she actually knows. Lossless abstraction algorithms yield an abstract game from which each equilibrium is also an equilibrium in the original game, and typically reduce the size of poker (or other such) games by 1-2 orders of magnitude.

The second method, action abstraction, removes some actions from consideration in the abstract game, and is useful when the number of actions that a player can choose is large.

Libratus vs. top poker players

Previously, AI has beaten chess, checkers, Go, Jeopardy but managed to beat poker only in January 2017. Unlike chess or Go, poker is a game of imperfect-information and requires a different methodology to tackle it.

In a 20-day competition involving 120,000 hands at Rivers Casino in Pittsburgh during January 2017, Libratus became the first AI to defeat top human players at Heads-up no-limit Texas Hold’em—the primary benchmark and long-standing challenge problem for imperfect-information game-solving by AIs.

Libratus beat a team of four top poker professionals in Heads-up no-limit Texas hold’em, which has 6.38 × 10¹⁶¹ decision points. It played with each player a two-player game and collectively amassed about $1.8 million in chips. It used the above-mentioned approach of simplifying and abstracting the game, then finding an equilibrium followed by mapping the abstract game back to the original one while adding details and improving the overall strategy. Libratus includes three main parts:

Algorithm for computing (an approximate Nash equilibrium) a blueprint for the overall strategy of smaller and simpler play, using a precomputed decision tree of about 10¹³ decision points, instead of 10¹⁶¹ points in the usual game. So it starts with a simple weighted decision tree from which to select its moves depending on its hole cards and those on the board. One example of these simpler abstractions is grouping and treating similarly hands such as King-high flush and a Queen-high flush or bets of $100 or $105.
Algorithm that fleshes out the details of the strategy for earlier subgames that are reached or realised during a play, and a coarse strategy for the later rounds based on assumed realization of the earlier ones. Whenever an opponent makes a move that is not in the abstraction, the module computes a solution to this subgame that includes the opponent’s move.
Self-improver algorithm that solves potential weaknesses opponents have identified in the game’s strategy. Typically, AIs use ML to find mistakes in the opponent’s strategy and exploit them. But that also opens the AI to exploitation if the opponent shifts strategy. Instead, Libratus’ self-improver module analyses opponents’ bet sizes to detect potential holes in Libratus’ strategy. Libratus then adds these missing decision branches, computes probabilities and strategies for them, and adds them to the existing strategy.

This strategy is called the blueprint strategy.

Libratus is computationally expensive and was powered by the Bridges system, a high-performance computer that could achieve, at maximum, 1.35 Pflops. Libratus burned through approximately 19 million core hours of computing throughout the tournament In addition to beating the human experts, Libratus has also won against the previous AI champion Baby Tartanian8.

Another one, DeepStack, is an AI capable of playing Heads-up no-limit Texas Hold’em, which includes a similar algorithm, continual re-solving, but it has not been tested against top professional players.

Most of the same abstraction techniques apply for games with more than two players that are not zero-sum, but their equilibrium-finding problems are such that no polynomial-time algorithm is known. It is not even clear that finding a Nash equilibrium is the right goal in such games. Different equilibria can have different values to the players.

This AI could be used for calculating strategic decisions in the real world, such as in finance and information security.

Go is game whose number of possible moves – more than chess at 10¹⁷⁰ – is greater than the number of atoms in the universe.

AlphaGo, the predecessor to AlphaGo Zero, crushed 18-time world champion Lee Sedol and the reigning world number one player, Ke Jie. After beating Jie earlier this year, DeepMind announced AlphaGo was retiring from future competitions.

Now, an even more superior competitor, AlphaGo Zero, could beat the version of AlphaGo that faced Lee Sedol after training for just 36 hours and earned beat its predecessor by 100-0 score after 72 hours. Interestingly, AlphaGo Zero didn’t learn from observing humans playing against each other – unlike AlphaGo – but instead, its neural network relies on an old technique in reinforcement learning: self-play. Self-play means agents can learn behaviours that are not hand-coded on any reinforcement learning task, but the sophistication of the learned behaviour is limited by the sophistication of the environment. In order for an agent to learn intelligent behaviour in a particular environment, the environment has to be challenging, but not too challenging.

Essentially, self-play means that AlphaGo Zero plays against itself. During training, it sits on each side of the table: two instances of the same software face off against each other. A match starts with the game’s black and white stones scattered on the board, placed following a random set of moves from their starting positions. The two computer players are given the list of moves that led to the positions of the stones, and then are each told to come up with multiple chains of next moves along with estimates of the probability they will win by following through each chain. The next move from the best possible chain is then played, and the computer players repeat the above steps, coming up with chains of moves ranked by strength. This repeats over and over, with the software feeling its way through the game and internalizing which strategies turn out to be the strongest.

AlphaGo Zero did start from scratch with no experts guiding it. And it is much more efficient: it only uses a single computer and four of Google’s custom TPU1 chips to play matches, compared to AlphaGo’s several machines and 48 TPUs. Since Zero didn’t rely on human gameplay, and a smaller number of matches, its Monte Carlo tree search is smaller. The self-play algorithm also combined both the value and policy neural networks into one, and was trained on 64 GPUs and 19 CPUs by playing nearly five million games against itself. In comparison, AlphaGo needed months of training and used 1,920 CPUs and 280 GPUs to beat Lee Sedol.

AlphaGo combines the two most powerful ideas about learning to emerge from the past few decades: deep learning and reinforcement learning. In the human brain, sensory information is processed in a series of layers. For instance, visual information is first transformed in the retina, then in the midbrain, and then through many different areas of the cerebral cortex. This creates a hierarchy of representations where simple, local features are extracted first, and then more complex, global features are built from these. The AI equivalent is called deep learning; deep because it involves many layers of processing in simple neuron-like computing units.

But to survive in the world, animals need to not only recognise sensory information, but also act on it. Generations of scientists have studied how animals learn to take a series of actions that maximise their reward. This has led to mathematical theories of reinforcement learning that can now be implemented in AI systems. The most powerful of these is temporal difference learning, which improves actions by maximising its expectation of future reward. It is thus that, among others, it even discovered for itself, without human intervention, classic Go moves such as fuseki opening tactics and life and death.

So are there problems to which the current algorithms can be fairly immediately applied?

One example may be optimisation in controlled industrial settings. Here the goal is often to complete a complex series of tasks while satisfying multiple constraints and minimising cost. As long as the possibilities can be accurately simulated, self-play-based algorithms can explore and learn from a vastly larger space of outcomes than will ever be possible for humans.

Researchers at OpenAI have already experimented with the same technique to train bots to play Dota 2, and published a paper on competitive self play. There are other experiments, such as this one, showing how self-play/teaching AI is better at predicting heart attacks.

AlphaGo Zero’s success bodes well for AI’s mastery of games. But it would be a mistake to believe that we’ve learned something general about thinking and about learning for general intelligence. This approach won’t work in more ill-structured problems like natural-language understanding or robotics, where the state space is more complex and there isn’t a clear objective function.

Unsupervised training is the key to ultimately creating AI that can think for itself, but more research is needed outside of the confines of board games and predefined objective functions” before computers can really begin to think outside the box.

DeepMind says the research team behind AlphaGo is looking to pursue other complex problems, such as finding new cures for diseases, quantum chemistry and material design.

Although it couldn’t sample every possible board position, AlphaGo’s neural networks extracted key ideas about strategies that work well in any position. Unfortunately, as yet there is no known way to interrogate the network to directly read out what these key ideas are. If we learn the game of Go purely through supervised learning, the best one could hope to do would be as good as the human one is imitating. Through self-play (and thus unsupervised learning), one could learn something completely novel and create or catalyse emergence.

DeepMind’s self-play approach is not the only way to push the boundaries of AI. Gary Marcus, a neuroscientist at NYU, has co-founded Geometric Intelligence (acquired by Uber), to explore learning techniques that extrapolate from a small number of examples, inspired by how children learn. He claimed to outperform both Google’s and Microsoft’s deep-learning algorithms.

Tag: self-play

How AI defeated top poker players

How to solve imperfect-information games

Libratus vs. top poker players

Is self-play the future of (most) AI?