“DeepNash: Mastering the Game of Stratego With Model-Free Multiagent Reinforcement Learning”, Julien Perolat, Bart de Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, Karl Tuyls2022-06-30 (, )⁠:

We introduce DeepNash, an autonomous agent capable of learning to play the imperfect information game Stratego from scratch, up to a human expert level.

Stratego is one of the few iconic board games that Artificial Intelligence (AI) has not yet mastered. This popular game has an enormous game tree on the order of 10535 nodes, ie. 10175 times larger than that of Go. It has the additional complexity of requiring decision-making under imperfect information, similar to Texas hold’em poker, which has a much smaller game tree (on the order of 10164 nodes). Decisions in Stratego are made over a large number of discrete actions with no obvious link between action and outcome. Episodes are long, with often hundreds of moves before a player wins, and situations in Stratego can not easily be broken down into manageable-sized sub-problems as in poker. For these reasons, Stratego has been a grand challenge for the field of AI for decades, and existing AI methods barely reach an amateur level of play.

DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego via self-play. The Regularized Nash Dynamics (R-NaD) algorithm, a key component of DeepNash, converges to an approximate Nash equilibrium, instead of ‘cycling’ around it, by directly modifying the underlying multi-agent learning dynamics.

DeepNash beats existing state-of-the-art AI methods in Stratego and achieved a yearly (2022) and all-time top-3 rank on the Gravon games platform, competing with human expert players.

…Given the vast number of such possible private configurations in a public state, Stratego computationally challenges all existing search techniques as the search space becomes intractable. We therefore chose an orthogonal route in this work, without search, and propose a new method that combines model-free reinforcement learning in self-play with a game-theoretic algorithmic idea, Regularized Nash Dynamics (R-NaD). The model-free part implies that we don’t build an explicit opponent model tracking belief space (calculating a likelihood of the opponent’s state), and the game-theoretic part is based on the idea that by modifying the dynamical system underpinning our reinforcement-learning approach we can steer the learning behavior of the agent in the direction of the Nash equilibrium. The main advantage of this combined approach is that we do not need to explicitly model private states from public ones. A complex challenge, on the other hand, is to scale up this model-free reinforcement learning approach with R-NaD to make self-play competitive against human expert players in Stratego, which has not been achieved to date. This combined DeepNash approach is illustrated in Figure 1b.

Figure 1b: An overview of the DeepNash approach. DeepNash is an autonomous agent that learns to play the imperfect information game Stratego (A). It learns a policy represented by a deep neural network (B) through self-play from scratch (C) in order to converge to a Nash equilibrium (D).

…it is possible to define a learning update rule that induces a dynamical system for which there exists a so-called Lyapunov function. This function can be shown to decrease during learning and as such guarantees convergence to a fixed point. This is the central idea behind the R-NaD algorithm, and the successful recipe for DeepNash, which scales this approach using a deep neural network.

…The third category of deep RL algorithms, and where this work falls under, is policy gradient methods [Impala]. Regret Policy Gradient (RPG) (70) approximates CFR via a weighted policy gradient, but is not proven to converge to a Nash equilibrium. Neural Replicator Dynamics (NeuRD) (71) approximates Replicator Dynamics with a policy gradient and is proven to converge to a Nash equilibrium in the time average. Prior to this work, neither of these algorithms have been applied to large-scale domains, or have demonstrated human-level performance; this work uses NeuRD combined with the regularization idea laid out in (34) to converge in the last iterate.

…To train the final agent we used 768 TPU nodes used for Learners and 256 TPU nodes for Actors.

Quotes from Stratego Experts: Thorsten Jungblut, owner of the Gravon platform:

Many players in the past thought that there will never be an AI for Stratego that could be a real competition for human players, or even play in the top ten. Obviously, they were wrong.

Vincent de Boer, former Stratego world champion, evaluated DeepNash as follows:

The level of play of DeepNash surprised me. I had never seen or heard of an artificial Stratego player that came close to the level needed to win a match against an experienced human player, but after playing against DeepNash myself I was not surprised by the top-3 ranking it later on achieved on the Gravon internet platform. I would expect this agent to also do very well if it participated in the World Championship.