Games have long-been used as benchmarks of A.I. progress. Games test reasoning ability and simulate, in simplified form, some of the decision-making dilemmas found in the real-world. Computer scientists have also favored games for another reason: they have point systems and clearly-defined winners and losers. This makes them ideal environments for reinforcement learning, a technique where software learns from experience instead of existing data. In order for such software to judge whether a particular action is likely to be beneficial, points serve as a convenient reward signal, in much the way a dog trainer doles out a treat if Fido sits on command.
Chess was long considered the epitome of human strategic thought, a symbol of calculating rationality and intellect. It, of course, succumbed to artificial intelligence in 1997 when IBM’s DeepBlue algorithm beat grandmaster Gary Kasparov. After chess, came Go. In 2016, AlphaGo, an algorithm created by DeepMind, the London-based A.I. research shop owned by Google-parent Alphabet Inc., beat Lee Sedol, the world’s best player at the game. With a larger board than chess, Go is a far more difficult challenge: there are more possible move combinations than there are atoms in the universe and players select moves as much by instinct as by brute calculation. In ancient China, where the game originated, Go was considered one of the four essential arts a scholar needed to master.
Poker meanwhile enjoys a sleazier, less noble reputation. In poker, deception, luck and human psychology can play as large a role as pure intellect and reason. Well, guess what? Poker is a lot closer to most real world-decision making than either Go or chess. Multiple player games also more closely mirror the complexity of many situations in life, which are not winner-take-all.Pluribus builds on the techniques Brown and his Carnegie Mellon doctoral advisor, Tuomas Sandholm, used to create Libratus, another poker playing A.I. that in January 2017 beat four human poker pros over the course of 120,000 hands. But that experiment involved one-on-one competition, not the more usual six-player tournament version of the game.
In such two-sided games, it is always possible, through mathematical brute force, to compute an optimal strategy—known as a Nash equilibrium—that will result in the A.I. player at least breaking even. In non-team, mutli-player games, this kind of Nash equilibrium often doesn’t exist or is too difficult to calculate.
For this reason, Brown says six-player poker represents a harder challenge than even Starcraft II or Dota2, two video games where A.I. agents, designed by DeepMind and A.I. research firm OpenAI respectively, have beaten human opponents over the past two years. Those games are also complex and involve imperfect information and multiple players. But the players are grouped into two teams which face off in a winner-take-all contest, meaning an algorithm can still try to find the Nash equilibrium.
Starcraft II and Dota 2 also involve tactical elements—arcade-style shoot-’em-up battles. If an A.I. can master these tactics at super-human levels, it can win without having to use particularly innovative strategies. That’s not the case with poker. “In poker, you have to address imperfect information head-on,” Brown says. There’s no way to sidestep the problem by, for instance, learning to stack your chips better than your opponent. Being able to deal with unknown information is the key to effective bluffing and betting, he says.
Super-Human Performance, On A Laptop
Compared to Libratus, the earlier poker-playing A.I., Brown and Sandholm made substantial changes to the design of Pluribus that mean it requires far less computing power to both train and deploy. Libratus had used about 15 million core hours on a supercomputer to train. Pluribus uses just 128,400 core hours on a machine with 512 gigabytes of working memory—or about what a souped-up gaming laptop might have.
This is also vastly less computing power than that needed to train other A.I.s for game playing breakthroughs: AlphaZero, the latest version of DeepMind’s Go-playing algorithm, was trained on more than 5,000 of Google’s own highly-specialized computing processors. OpenAI’s Dota2 bots required more than 128,000 cores for every hour of training—and it trained for days.
The cost of all that data-crunching power can easily reach into the hundreds of thousands or even many millions of dollars. Brown and Sandholm estimate that at current cloud computing prices, it would cost less than $150 to train Pluribus. And, once trained, the algorithm is so light-weight, Brown and Sandholm could run it on a conventional 128 GB laptop.
The secret to Pluribus’ efficiency is a simple, but elegant way of strategizing. Libratus and many other game-playing A.I.’s “look ahead” to see how a strategy is likely to play out through to the end of a game, but this is too computationally difficult for a six-player game, especially given that each opponent can change their own strategy in response to what every other player around the table is betting. Brown and Sandholm found that Pluribus could achieve super-human performance by simply exploring the possibilities two or three rounds into the future and assuming the other players chose one of four possible strategies each round.
This finding may also have big implications for real-world A.I. applications: it may turn out to be easier and less expensive to create algorithms capable of advising human decision-makers under conditions of uncertainty than previously assumed.
A New Style of Hold ‘Em
The most immediate impact of Pluribus, though, is likely to be in the world of poker itself: Since the algorithm learned entirely from self-play, it can discover strategies and tactics beyond those found in poker lore.
For instance, conventional poker wisdom holds that if a player has been conservative on a betting round and merely checked, meaning the player declines to bet, or called, meaning the player matches the bets of the others, that player should not start the next betting round by raising. Yet, in its games against the human pros, Pluribus found this tactic—which is known as “donk betting”—could actually be effective. Pluribus also makes far more aggressive bets than human players tend to. And it plays a far more balanced game—varying whether to bluff or fold with a bad hand and whether to bet aggressively or conservatively when holding a good hand —than most human players. That makes it difficult for opponents to gain much information about Pluribus’ hand from its betting strategy.
Brown says the human pros that played Pluribus are already planning on adapting such strategies in their own future games.
So, while an A.I. is never going to bequeath you an ace that you can keep, like Rodgers’ grizzled gambler it might just give you something far more valuable: wisdom.