[blog; previously: “Deep neuroethology of a virtual rodent”, Merelet al2020; “Grounded Language Learning Fast and Slow”, Hillet al2020.] Intelligent behavior in the physical world exhibits structure at multiple spatial and temporal scales. Although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals defined on much longer timescales, and in terms of relations that extend far beyond the body itself, ultimately involving coordination with other agents. Recent research in artificial intelligence has shown the promise of learning-based approaches to the respective problems of complex movement, longer-term planning and multi-agent coordination. However, there is limited research aimed at their integration.
We study this problem by training teams of physically simulated humanoid avatars to play football in a realistic virtual environment. We develop a method that combines imitation learning [from motion-capture of human soccer], single-agent and multi-agent reinforcement learning and population-based training, and makes use of transferable representations of behavior for decision making at different levels of abstraction.
In a sequence of stages, players first learn to control a fully articulated body to perform realistic, human-like movements such as running and turning; they then acquire mid-level football skills such as dribbling and shooting; finally, they develop awareness of others and play as a team, bridging the gap between low-level motor control at a timescale of milliseconds, and coordinated goal-directed behavior as a team at the timescale of tens of seconds.
We investigate the emergence of behaviors at different levels of abstraction, as well as the representations that underlie these behaviors using several analysis techniques, including statistics from real-world sports analytics.
Our work constitutes a complete demonstration of integrated decision-making at multiple scales in a physically embodied multi-agent setting. See project video.
Figure 5: (A) Agent performance measured by Elo score against a set of pre-trained evaluation agents increases as the agents learn football behaviors. Counterfactual policy divergence by entity: early in training, the ball (blue curve) induces most divergence in the agent policy; other players have progressively more influence on the agent’s policy as training progresses. Pass-value-correlation increases for both passer and receiver over training as coordination improves. Agent’s probe score drops below 50% early in training, but improves to 60% as the agents learn coordinated strategies, and identify the value of teammate possession.
(B) Emergence of behaviors and abilities over training. Early in training (up to 1.5 billion environment steps or ~24 hours of training) running speed and possession increase rapidly and the ability to get up is effectively perfected. Division of labour decreases in this early phase as agents prioritize possession and learn uncoordinated ball chasing behaviors. After 1.5 billion environment steps a transition occurs in which division of labour improves and behavior shifts from individualistic ball chasing to coordinated play. In this second phase passing frequency, passing range and receiver OBSO [Receiver off-ball scoring opportunity] increase substantially.
(C) Division of Labour and passing plays: solid/dashed lines indicates past/future trajectories of the red and blue players and the ball (black line). The 2 left frames are at the point in time of the pass; the receiver turns to anticipate an upfield kick before the pass, leaving the teammate to control the ball. Rightmost frame is the point of reception.
(D) Typical probe task initialization with blue player 1 (“passer”) initialized in its own half, and player 2 (“receiver”) initialized on a wing and 2 defenders in the centre. Right: receiver value (scoring channel) as a function of future ball position on the pitch. Regions of high value in green and low value in red. · Left: passer value function. Both receiver and passer register higher value when the ball travels to the right wing, where the receiver is positioned.
…A schematic of our infrastructure is provided in Figure 4. Learning is performed on a central 16-core TPU-v2 machine where one core is used for each player in the population. Model inference occurs on 128 inference servers, each providing inference-as-a-service initiated by an inbound request identified by an unique model name. Concurrent requests for the same inference model result in automated batched inference, where an additional request incurs negligible marginal cost. Policy-environment interactions are executed on a large pool of 4,096 CPU actor workers. These connect to a central orchestrator machine which schedules the matches.