[cf. RETRO, Aytaret al2018] Effective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning, the dominant paradigm is for an agent to amortise information that helps decision-making into its network weights via gradient descent on training losses.
Here, we pursue an alternative approach in which agents can utilize large-scale context-sensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to use relevant information to inform their outputs. In addition, new information can be attended to by the [offline-style MuZero] agent, without retraining, by simply augmenting the retrieval dataset.
We study this approach in 9×9 Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques [SCaNN] in order to retrieve relevant data from a set of tens of millions [n = 50m] of expert demonstration states [from AlphaZero].
Attending to this information provides a substantial boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in reinforcement learning agents.
Figure 2: Details of the architecture used for a retrieval-augmented Go playing agent. A pre-trained network is used to generate a query qt corresponding to the current Go game state ot. This query is used for fast approximate nearest-neighbor retrieval using SCaNN. Retrieved neighbors xtn are processed using an invariant architecture, and used to inform an action-conditional recurrent forward model that outputs game outcome predictions v̂k and distributions over next actions ̂πk.