“Decision Transformer: Reinforcement Learning via Sequence Modeling”, 2021-06-02 (; backlinks; similar):
[interview; online DT] We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
…Decision Transformer: autoregressive sequence modeling for RL: We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer. Evaluation is also easy: we can initialize by a desired target return (eg. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence—similar to standard autoregressive generation in language models—yields a sequence of actions to execute in the environment.
…Sequence modeling as multitask learning: One effect of this type of modeling is that we perform conditional generation, where we initialize a trajectory by inputting our desired return. Decision Transformer does not yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the target return of a trained Decision Transformer, we find distinct policies are learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as Q✱bert and Seaquest), we find Decision Transformer can actually extrapolate outside of the dataset and model policies achieving higher return!
[Paper; Github; see also MuZero, “goal-conditioned” or “upside-down reinforcement learning” (such as “morethan” prompting), Shawn Presser’s GPT-2 chess model (& Cheng’s almost-DT chess transformer), value equivalent models, et al 2021 on ‘delusions’. Simultaneous work at BAIR invents Decision Transformer as Trajectory Transformer. Note that DT, being in the ‘every task is a generation task’ paradigm of GPT, lends itself nicely to preference learning simply by formatting human-ranked choices of a sequence.
The simplicity of this version of the control codes or ‘inline metadata trick’ (eg. CTRL) means it can be reused with almost any generative model where some measure of quality or reward is available (even if only self-critique like likelihood of a sequence eg. in Meena-style best-of ranking or inverse prompting): you have an architecture floorplan DALL·E 1? Use standard architecture software to score plans by their estimated thermal efficiency/sunlight/etc; prefix these scores, retrain, & decode for good floorplans maximizing thermal efficiency/sunlight. You have a regular DALL·E 1? Sample n samples per prompt, CLIP-rank the images, prefix their ranking, retrain… No useful CLIP? Then use the CogView self-text-captioning trick to turn generated images back into text, rank by text likelihood… Choose Your Own Adventure AI Dungeon game-tree? Rank completions by player choice, feed back in for preference learning… All of the work is done by the data, as long as the generative model is smart enough.]