[blog, checkpoints/code (transcripts; highlightseg); Q&Acommentary] Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce CICERO, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between 7 players.
CICERO integrates a language model with planning and reinforcement learning algorithms by inferring players’ beliefs and intentions from its conversations and generating dialogue in pursuit of its plans.
Across 40 games of an anonymous online Diplomacy league, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
…CICERO participated anonymously in 40 games of Diplomacy in a “blitz” league on webDiplomacy.net from August 19 to October 13, 2022. This league played with 5 minute negotiation turns; these time controls allowed games to be completed within two hours. CICERO ranked in the top 10% of participants who played more than one game and 2nd out of 19 participants in the league that played 5 or more games. [5,277 messages or 292 messages per game] Across all 40 games, CICERO’s mean score was 25.8%, more than double the average score of 12.4% of its 82 opponents. As part of the league, CICERO participated in an 8-game tournament involving 21 participants, 6 of whom played at least 5 games. Participants could play a maximum of 6 games with their rank determined by the average of their best 3 games. CICERO placed 1st in this tournament.
…CICERO passed as a human player for 40 games of Diplomacy with 82 unique players, and no in-game messages indicated that players believed they were playing with an AI agent. One player mentioned in post-game chat a suspicion that one of CICERO’s accounts might be a bot, but this did not lead to CICERO being detected as an AI agent by other players in the league.
…Data: We obtained a dataset of 125,261 games of Diplomacy played online at webDiplomacy.net. Of these, 40,408 games contained dialogue, with a total of 12,901,662 messages exchanged between players. Player accounts were de-identified and automated redaction of personally identifiable information (PII) was performed by webDiplomacy. We refer to this dataset hereafter as WebDiplomacy.
…Strategic reasoning: To generate the intents for dialogue and to choose the final actions to play each turn, CICERO runs a strategic reasoning module that predicts other players’ policies (ie. a probability distribution over actions) for the current turn based on the state of the board and the shared dialogue, and then chooses a policy for itself for the current turn that responds optimally to the other players’ predicted policies.
Doing this with human players requires predicting how humans will play. A popular approach in cooperative games is to model the other players’ policies via supervised learning on human data, which is commonly referred to as behavioral cloning (BC). However, pure BC is brittle, especially since a supervised model may learn spurious correlations between dialogue and actions (Supplementary Figure 6).
To address this problem, CICERO used variants of piKL to model the policies of players. piKL is an iterative algorithm that predicts policies by assuming each player i seeks to both maximize the expected value of their policy πi and minimize the KL divergence between πi and the BC policy, which we call the anchor policy τi. An anchor strength parameter λ ∈【0,∞) trades off between these competing objectives…Other players of course may be deceptive about their plans. CICERO does not explicitly predict whether a message is deceptive or not, but rather relies on piKL to directly predict the policies of other players based on both the BC policy (which conditions on the message) and on whether deviating from the BC policy would benefit that player.
…Our final self-play algorithm operated similarly to AlphaZero and ReBeL, by applying planning “in the loop” as the improvement operator for RL. In our case, planning was via an approximated version of CoShar piKL. We generated self-play trajectories where on each turn we computed the CoShar piKL policy using a learned state-value model. We regressed the joint policy model toward that policy and regressed the value model toward the expected values of all players under that policy. We then sampled a joint action from that policy to generate the next state in the trajectory. The anchor policy was fixed throughout training in order to anchor the RL near human play. See SM, §E.4 for details.
Figure 1: Architecture of CICERO. CICERO predicts likely human actions for each player based on the board state and dialogue, using that as the starting point for a planning algorithm using RL-trained models. The output of planning is an action for the agent as well as beliefs about other players’ actions, which are used to select intents for a dialogue model to condition on. Generated message candidates undergo several filtering steps before a final message is sent.
Figure 3: The effect of intents on CICERO’s dialogue. Pictured are 3 different possible intents in the same game situation. In each case, we show a message generated by CICERO (England, pink) to France (blue), Germany (orange) and Russia (purple) conditioned on these intents. Each intent leads to quite different messages consistent with the intended actions.
Figure 2: Illustration of the training and inference process for intent-controlled dialogue. Actions are specified as strings of orders for units, eg. “NTH S BEL—HOL” means that North Sea will support Belgium to Holland. (A) An ‘intent model’ was trained to predict actions for a pair of players based on their dialogue. Training data was restricted to a subset where dialogue is deemed ‘truthful’ (see latent intents). (B) Each message in the dialogue training dataset was annotated with the output of the intent model on the dialogue up to that point, with an agreement message injected at the end. (C) The dialogue model was trained to predict each dataset message given the annotated intent for the target message. (D) During play, intents were supplied by the planning module instead.
Figure 5: The effect of dialogue on CICERO’s strategic planning and intents. CICERO (France, blue) and England (pink) are entangled in a fight, but it would be beneficial for both players if they could disengage. CICERO has just messaged England “Do you want to call this fight off? I can let you focus on Russia and I can focus on Italy”. Pictured are 3 ways that England might reply and how CICERO adapts to each. Because CICERO’s planning anchors around a dialogue-conditional policy model, its predictions for other players and accordingly its own plans are flexible and responsive to negotiation with other players (left, middle). Yet CICERO also avoids blindly trusting what other players propose by rejecting plans that have low predicted value and run counter to its own interests (right).
…Figure 6 showcases two examples of coordination and negotiation. In the coordination example, we observed CICERO building an alliance via discussion of a longer-term strategy. In the negotiation example, CICERO successfully changed the other player’s mind by proposing mutually beneficial moves. In a game in which dishonesty is commonplace, it is notable that we were able to achieve human-level performance by controlling the agent’s dialogue through the strategic reasoning module to be largely honest and helpful to its speaking partners.
Figure 6: Successful dialogue examples. Examples of CICERO coordinating (left) and negotiating (right) with authors of this paper in test games.
“What impresses me most about CICERO is its ability to communicate with empathy and build rapport while also tying that back to its strategic objectives…CICERO is ruthless. It’s resilient. And it’s patient…CICERO’s dialogue is direct, but it has some empathy. It’s surprisingly human.” —Andrew Goff (3× Diplomacy World Champion)
“I was flabbergasted. It seemed so genuine—so lifelike. It could read my texts and converse with me and make plans that were mutually beneficial—that would allow both of us to get ahead. It also lied to me and betrayed me, like top players frequently do.” —Claes de Graaf
In 2019 Noam Brown and I decided to tackle Diplomacy because it was the hardest game for AI we could think of and went beyond moving pieces on a board to cooperating with people through language. We thought human-level play was a decade away.
It’s worth noting that they built quite a complicated, specialized AI system (ie. they did not take an LLM and finetune a generalist agent that also can play Diplomacy):
First, they train a dialogue-conditional action model by behavioral cloning on human data to predict what other players will do.
Then they do joint RL planning to get action intentions of the AI and other payers using the outputs of the conditional action model and a learned dialogue-free value model. (They use also regularize this plan using a KL penalty to the output of the action model.)
They also train a conditional dialogue model that maps by finetuning a small LM (a 2.7b BART model, RC2C [context window: 2,048 BPE tokens]) to map intents + game history → messages. Interestingly, this model is trained in a way that makes it pretty honest by default.
They train a set of filters to remove hallucinations, inconsistencies, toxicity, leaking its actual plans, etc from the output messages, before sending them to other players.
The intents are updated after every message. At the end of each turn, they output the final intent as the action.
I do expect someone to figure out how to avoid all these dongles and do it with a more generalist model in the next year or two, though.