“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune2022-06-23 (, , , , , )⁠:

[blog; code/models; Twitter; Kilcher video] Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way.

We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled [YouTube] videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data—here, online videos of people playing Minecraft—from which we can then train a general behavioral prior.

Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning.

For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

4.5 Data Scaling Properties of the Foundation Model: In this section we validate a core hypothesis behind this work: that it is far more effective to use labeled contractor data to train an [0.5b-parameter temporal convolution + resnet] IDM within the VPT method than it is to directly train a BC foundation model from that same small [$2,000] contractor dataset. If we could cheaply collect a labeled contractor dataset of a similar order of magnitude as web_clean, then this would not be important; however, collecting that scale of data would have cost millions of dollars. Figure 8 compares foundation models trained on increasing orders of magnitude of data from 1 hour up to the full ~70k web_clean dataset. Foundation models trained up to and including 1k hours are trained on the IDM contractor data, and those trained on 5k hours and above are trained on subsets of web_clean, which does not contain any IDM contractor data. Scaling training data increases log collection, mining, and crafting capabilities. The zero-shot model only begins to start crafting crafting tables at over 5,000 hours of training data. When fine-tuning each foundation model to contractor_house, we see that crafting rates for crafting tables and wooden tools increase by orders of magnitude when using the entire ~70k hour web_clean dataset. We furthermore only see the emergence of crafting stone tools at the largest data scale.

Figure 8: (1) Zero-shot rollout performance of foundation models trained on varying amounts of data. Models to the left of the dashed black line (points ≤1k hours) were trained on contractor data (ground-truth labels), and models to the right were trained on IDM pseudo-labeled subsets of web_clean. Due to compute limitations, this analysis was performed with smaller (71 million parameter) models except for the final point, which is the 0.5 billion parameter VPT foundation model. (2) The corresponding performance of each model after BC fine-tuning each model to the contractor_house dataset.

H. Foundation Model Scaling: In early experiments we found that increasing model size led to models staying in the efficient learning regime longer into training. Here we compare the 0.5B model described in §4.2 to both a 248M and 71M parameter model. Both of these models are trained for 15 epochs as compared to the 30 epochs the 0.5B model trained for. These models have the same architecture as the 0.5B model but each layer in the 248M parameter model has 1⁄2 the width and each layer in the 71M parameter model 1⁄3 the width. The 71M model was trained with an initial learning rate of 0.001586, batch size of 480, and weight decay of 0.044506. The 248M model had an initial learning rate of 0.001831, batch size of 640, and weight decay of 0.051376.

In Figure 18 we show validation loss on web_clean with IDM pseudo-labels, loss on the contractor dataset used to train the IDM with ground truth labels collected during contractor play, and zero-shot environment performance for the 71M, 248M, and 0.5B models. While larger models have better validation loss on web_clean, these results do not tell the clear story that the 0.5B model is better than its smaller counterparts. The 71M model has the lowest contractor dataset loss while having the highest web_clean loss, and it also has the best zero-shot environment performance. In fact, we see that the 71M model even had non-zero wooden tool crafting (Figure 18 bottom left). The 248M model also appears to be better at crafting than the 0.5B, and also has lower contractor dataset loss.

While the zero-shot results suggest smaller models are better, fine-tuning tells another story. When fine-tuning to contractor_house, model size rank ordering reverses and now the 0.5B model performs best both in validation loss (Figure 19 left) and in environment performance (Figure 19 right) followed by the 248M model and then the 71M model. Environment model rollouts are performed using the same game engine that we use to collect contractor data, which could be visually distinct from videos taken from the web. It is plausible that the larger models over-focus on the visual peculiarities in web data during pretraining since they have worse contractor data loss (Fig.18 top middle), and this causes them to perform more poorly in the environment zero-shot. However, we hypothesize that because the contractor_house dataset we fine-tune to is collected from our game engine, the larger models that are a better overall Minecraft prior (as indicated by lower web_clean validation loss in Fig.18 top left) can quickly shift their low level features to perform better on data coming from our game engine, resulting in better environment rollout performance. This hypothesis is further supported by Figure 19 (middle) showing loss on the contractor dataset collected for IDM training, which has no overlap with contractor_house. After just a few steps of fine-tuning to contractor_house, all models quickly improve in loss on the full IDM contractor dataset, with larger models now performing best. While not conclusive, we believe this investigation provides some intuition for future studies of model scaling for sequential decision making problems.