“Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach”, 2023-12-19 (; similar):
StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro-level operations and strategic macro awareness. Previous works, such as AlphaStar and SCC, achieve impressive performance on tackling StarCraft II; however, they still exhibit deficiencies in long-term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voyage and MetaGPT, present immense potential in solving intricate tasks. Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II, a highly complex RTS game.
To conveniently take full advantage of LLMs’ reasoning abilities, we first develop a textual StarCraft II environment, called TextStarCraft II, which LLM agents can interact with. Secondly, we propose a Chain of Summarization method, including single-frame summarization for processing raw observations and multi-frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions.
Our experiment consists of two parts: first, an evaluation by human experts, which includes assessing the LLMs’ mastery of StarCraft II knowledge and the performance of LLM agents in the game; second, the in-game performance of LLM agents, encompassing aspects like win rate and the impact of the Chain of Summarization. [finetuning GPT-3.5]
Experiment results demonstrate that: (1) LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; (2) Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for 8 years; (3) LLM agents are capable of defeating the built-in AI at the Harder (Lv5) difficulty level.
We have open-sourced the code and released demo videos of LLM agents playing StarCraft II.
[Example use: benchmarking slowed-down agents.]