“Vision-Language Models As a Source of Rewards”, Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang2023-12-14 (CLIP, imitation learning, offline RL, RL scaling):
Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals.
We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals.
We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.
Figure 4: Scaling reward model size.
(Left) Precision-Recall curves for varying VLM architecture and sizes on an offline fixed dataset of Playhouse trajectories.
(Right) Ground truth returns on held-out evaluation tasks for Playhouse over the course of training with varying VLM reward sizes.
…We observe that increasing the size of the VLM used for the reward model (from 200M to 1.4b parameters) improves the precision-recall curves. Figure 4 (above right) shows the ground truth returns for held-out evaluation tasks over the course of training which are not given to the agent, when trained with VLM reward signals with different base models. We observe that the improved accuracy of the VLMs on offline datasets, when used as the only reward signal, does translate to better agent performance on ground truth evaluation metrics.
…within the training budget, we did not observe reward hacking of our VLM reward, where the true reward drops off while the proxy VLM reward continues to increase.