“Trajectory Transformer: Reinforcement Learning As One Big Sequence Modeling Problem”, Michael Janner, Qiyang Colin Li, Sergey Levine2021-06-03 (, , ; backlinks; similar)⁠:

[blog; a simultaneous-invention of Decision Transformer, with more emphasis on model-based learning like exploration; see Decision Transformer annotation for related work.]

Paper:

Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem.

To this end, we explore how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem largely simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

…Replacing log-probabilities from the sequence model with reward predictions yields a model-based planning method, surprisingly effective despite lacking the details usually required to make planning with learned models effective.

Related Publication: Chen et al concurrently proposed another sequence modeling approach to reinforcement learning [Decision Transformer]. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (eg. in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (eg. Atari). We encourage you to check out their work as well.