“Imitating Interactive Intelligence”, Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Stephen Clark, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Nathaniel Wong, Chen Yan, Rui Zhu2020-12-10 (, , , , ; similar)⁠:

A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behavior. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioral tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behavior beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalize beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioral imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount. See videos for an overview of the manuscript, training time-lapse, and human-agent interactions.

…Although the agents do not yet attain human-level performance, we will soon describe scaling experiments which suggest that this gap could be closed substantially simply by collecting more data…The scripted probe tasks are imperfect measures of model performance, but as we have shown above, they tend to be well correlated with model performance under human evaluation. With each doubling of the dataset size, performance grew by the same increment. The rate of performance, in particular for instruction-following tasks, was larger for the BG·A model compared to B·A. Generally, these results give us confidence that we could continue to improve the performance of the agents straightforwardly by increasing the dataset size.

Figure 15: Scaling & Transfer. A. Scaling properties for 2 of our agents. The agent’s performance on the scripted probe tasks increased as we trained on more data. In instruction-following tasks in particular, the rate of this increase was higher for BC+GAIL compared to BC (scatter points indicate seeds). B. Transfer learning across different language game prompts. Training on multiple language games simultaneously led to higher performance than training on each single prompt independently. C. Multitask training improved data efficiency. We held out episodes with instructions that contain the words “put”, “position” or “place” and studied how much of this data was required to learn to position objects in the room. When simultaneously trained on all language game prompts, using 1⁄8 of the Position data led to 60% of the performance with all data, compared to 7% if we used the positional data alone. D. Object-colour generalisation. We removed all instances of orange ducks from the data and environment, but we left all other orange objects and all non-orange ducks. The performance at scripted tasks testing for this particular object-colour combination was similar to baseline.

…After training, we asked the models to “Lift an orange duck” or “What color is the duck?”…Figure 15D shows that the agent trained without orange ducks performed almost as well on these restricted Lift and Color probe tasks as an agent trained with all of the data. These results demonstrate explicitly what our results elsewhere suggest: that agents trained to imitate human action and language demonstrate powerful combinatorial generalisation capabilities. While they have never encountered the entity, they know what an “orange duck” is and how to interact with one when asked to do so for the first time. This particular example was chosen at random; we have every reason to believe that similar effects would be observed for other compound concepts.