“Gato: A Generalist Agent”, Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas2022-05-12 (, , , )⁠:

[blog; followups will be scaled-up] Inspired by progress in large-scale language modeling [Decision Transformer], we apply a similar approach towards building a single generalist agent beyond the realm of text outputs.

The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

Figure 1: A generalist agent. Gato can sense and act with different embodiments across a wide range of environments using a single neural network with the same set of weights. Gato was trained on 604 distinct tasks with varying modalities, observations and action specifications.

In this report we describe the model and the data, and document the current capabilities of Gato [at 0.08b, 0.36b, & 1.2b parameters].

…Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent.

…We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2b parameters in the case of Gato. As hardware and model architectures improve, this operating point will naturally increase the feasible model size, pushing generalist models higher up the scaling law curve. For simplicity Gato was trained offline in a purely supervised manner; however, in principle, there is no reason it could not also be trained with either offline or online reinforcement learning (RL)…Training of the model is performed on a 16×16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1,024, which takes about 4 days.

[multi-task learning is another blessing of scale—compare Gato’s bitter simplicity to PopArt; cf. “Date Weakly General AI is Publicly Known”; discussion: LW, HN; Lennart Heim notes that for PaLM’s training cost, Chinchilla Gato would be 125b-parameters.]

Figure 2: Training phase of Gato. Data from different tasks and modalities is serialized into a flat sequence of tokens, batched, and processed by a transformer neural network akin to a large language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions.
Figure 8: Model size scaling laws results. In-distribution performance as a function of tokens processed for 3 model scales. Performance is first mean-aggregated within each separate control domain, and then mean-aggregated across all domains. We can see a consistent improvement as model capacity is increased for a fixed number of tokens.

Scaling Laws Analysis: In Figure 8, we analyze the aggregate in-distribution performance of the pretrained model as a function of the number of parameters in order to get insight into how performance could improve with increased model capacity. We evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato). We refer to §C for details on the 3 model architectures. Here, for all 3 model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in §4.1). Then for each domain listed in Table 1 we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a substantial performance improvement with increased scale.

Figure 10: Robotics fine-tuning results. Left: Comparison of real robot Skill Generalization success rate averaged across test triplets for Gato, expert, and CRR trained on 35k expert episodes (upper bound). Right: Comparison of simulated robot Skill Generalization success rate averaged across test triplets for a series of ablations on the number of parameters, including scores for expert and a BC baseline trained on 5k episodes.
Table 1a: Datasets. Control datasets used to train Gato. Right: Vision & language datasets. Sample weight means the proportion of each dataset, on average, in the training sequence batches.
Control environment Tasks Episodes Approx. Tokens Sample Weight
DM Lab 254 16.4M 194B 9.35%
ALE Atari 51 63.4K 1.26B 9.5%
ALE Atari Extended 28 28.4K 565M 10.0%
Sokoban 1 27.2K 298M 1.33%
BabyAI 46 4.61M 22.8B 9.06%
DM Control Suite 30 395K 22.5B 4.62%
DM Control Suite Pixels 28 485K 35.5B 7.07%
DM Control Suite Random Small 26 10.6M 313B 3.04%
DM Control Suite Random Large 26 26.1M 791B 3.04%
Meta-World 45 94.6K 3.39B 8.96%
Procgen Benchmark 16 1.6M 4.46B 5.34%
RGB Stacking simulator 1 387K 24.4B 1.33%
RGB Stacking real robot 1 15.7K 980M 1.33%
Modular RL 38 843K 69.6B 8.23%
DM Manipulation Playground 4 286K 6.58B 1.68%
Playroom 1 829K 118B 1.33%
Total 596 63M 1.5T 85.3%

Fine-tuning and Model Size: To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure 10 compares the full 1.18b parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time.

Table 1b: Datasets. Vision & language datasets. Sample weight means the proportion of each dataset, on average, in the training sequence batches.
Vision / language dataset Sample Weight
MassiveText 6.7%
M3W 4%
ALIGN 0.67%
MS-COCO Captions 0.67%
Conceptual Captions 0.67%
LTIP 0.67%
OKVQA 0.67%
VQAV2 0.67%
Total 14.7%

…As we model the data autoregressively, each token is potentially also a target label given the previous tokens. Text tokens, discrete and continuous values, and actions can be directly set as targets after tokenization. Image tokens and agent observations are not currently predicted in Gato, although that may be an interesting direction for future work. Targets for these non-predicted tokens are set to an unused value and their contribution to the loss is masked out…Because distinct tasks within a domain can share identical embodiments, observation formats and action specifications, the model sometimes needs further context to disambiguate tasks. Rather than providing eg. one-hot task identifiers, we instead take inspiration from (Brown et al 2020; Sanh et al 2022; Wei et al 2021) and use prompt conditioning. During training, for 25% of the sequences in each batch, a prompt sequence is prepended, coming from an episode generated by the same source agent on the same task. Half of the prompt sequences are from the end of the episode, acting as a form of goal conditioning for many domains; and the other half are uniformly sampled from the episode. During evaluation, the agent can be prompted using a successful demonstration of the desired task, which we do by default in all control results that we present here.

Figure 5: Gato’s performance on simulated control tasks. Number of tasks where the performance of the pretrained model is above a percentage of expert score, grouped by domain. Here values on the x-axis represent a specific percentage of expert score, where 0 corresponds to random agent performance. The y-axis is the number of tasks where the pretrained model’s mean performance is equal to or above that percentage. That is, the width of each color band indicates the number of tasks where Gato’s mean performance is above a percentage of the maximum score obtained by a task-specific expert.

…In ALE Atari (Bellemare et al 2013) Gato achieves the average human (or better) scores for 23 Atari games, achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see §5.5 where we present a specialist single domain ALE Atari agent achieving better than human scores for 44 games).

…As mentioned earlier, transfer in Atari is challenging. Rusu et al 2016 researched transfer between randomly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behavior cloning to video games like Atari are discussed by Kanervisto et al 2020.