“Multi-Game Decision Transformers”, Kuang-Huei Lee, Ofir Nachum, Mengjiao Yang, Lisa Lee, Daniel Freeman, Winnie Xu, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski, Igor Mordatch2022-05-30 (, , )⁠:

[blog; cf. Gato] A long-standing goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up Transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents.

Specifically, we show that a single transformer-based model—with a single set of weights—trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning.

We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance.

We release the pre-trained models and code to encourage further research in this direction. Additional information, videos and code can be seen at: sites.google.com/view/multi-game-transformers.

…We find that we can train a single agent that achieves 126% of human-level performance simultaneously across all games after training on offline expert and non-expert datasets (see Figure 1). Furthermore, we see similar trends that mirror those observed in language and vision: rapid finetuning to never-before-seen games with very little data (§4.5), a power-law relationship between performance and model size (§4.4), and faster training progress for larger models.

Model Variants and Scaling: We base our decision transformer (DT) configuration on GPT-2 as summarized in Appendix B.1. We report results for DT-200M (a Multi-Game DT with 200M parameters) if not specified otherwise [64 TPUv4 training time: 8 days]. Other smaller variants are DT-40M and DT-10M. We set sequence length to 4 game frames for all experiments, which results in sequences of 156 tokens.

Training and Fine-tuning: We train all Multi-Game DT models on TPUv4 hardware and the Jaxline framework for 10M steps using the LAMB optimizer with a 3 × 10−4 learning rate, 4,000 steps linear warm-up, no weight decay, gradient clip 1.0, β1 = 0.9 and β2 = 0.999, and batch size 2,048. [They apparently do not scale compute or data along with the model size? This will understate scaling curves by bottlenecking along another dimension.]

Figure 5: How model performance scales with model size, on training set games and novel games. (Impala) indicates using the Impala CNN architecture. (a) Scaling of IQM scores for all training games with different model sizes and architectures. (b) Scaling of IQM scores for all novel games after fine-tuning DT and CQL.

We investigate whether similar trends hold for interactive in-game performance—not just training loss—and show a similar power-law performance trend in Figure 5a. Multi-Game Decision Transformer performance reliably increases over 2 orders of magnitude, whereas the other methods either saturate, or have much slower performance growth. We also find that larger models train faster, in the sense of reaching higher in-game performance after observing the same number of tokens. We discuss these results in Appendix G.

Figure 1: Aggregates of human-normalized scores across 41 Atari games. Grey bars are single-game specialist models while blue are generalists. We also report the performance of Deep Q-Network (DQN), Batch-Constrained Q-learning (BCQ), Behavioral Cloning (BC), and Constrained Q-Learning (CQL).

Appendix G: Effect of Model Size on Training Speed: It is believed that large transformer-based language models train faster than smaller models, in the sense that they reach higher performance after observing a similar number of tokens37, 15. We find this trend to hold in our setting as well. Figure 15 shows an example of performance on 2 example games as multi-game training progresses. We see that larger models reach higher scores per number of training steps taken (thus tokens observed).

Figure 15: Example game scores for different model sizes as multi-game training progresses.

…In this work, we do not attempt to predict future observations due to their non-discrete nature and the additional model capacity that would be required to generate images. However, building image-based forward prediction models of the environment has been shown to be a useful representation objective for RL27, 26, 28. We leave it for future investigation

Figure 3: An overview of our decision transformer architecture.

4.6 Does multi-game decision transformer improve upon training data? We want to evaluate whether decision transformer with expert action inference is capable of acting better than the best demonstrations seen during training. To do this, we look at the top 3 performing decision transformer model rollouts. We use top 3 rollouts instead of the mean across all rollouts to more fairly compare to the best demonstration, rather than an average expert demonstration. We show percentage improvement over best demonstration score for individual games in Figure 7. We see large improvement over the training data in a number of games.

Figure 7: Percent of improvement of top-3 decision transformer rollouts over the best score in the training dataset. 0% indicates no improvement. Top-3 metric (instead of mean) is used to more fairly compare to the best—rather than expert average—demonstration score.

…We believe the trends suggest clear paths for future work—that, with larger models and larger suites of tasks, performance is likely to scale up commensurately.