[Twitter, blog; replication] In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data.
In this work, we use a synthetic setup in which a fixed “gold-standard” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the [InstructGPT-style] gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling [rejection sampling].
We find that this relationship follows a different functional form depending on the method of optimization [We observe that RL requires substantially more KL distance from the initial policy to achieve the same (over)optimization.], and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. [We find some tentative evidence that policy scaling does not affect overoptimization, and that KL penalties are equivalent to early stopping with our hyperparameter settings.]
We explore the implications of these empirical results for theoretical considerations in AI alignment. Finally, we find a correspondence between our functional forms and the Regressional [ie. tails-come-apart/independence of bivariate maxima] and Extremal Goodhart of the Goodhart Taxonomy (Manheim & Garrabrandt2018). We also analyze the implications of our forms for iterated RLHF; we predict that it reduces extremal Goodharting.
Figure 1: Reward model (RM) parameter size scaling experiments using the InstructGPT environment. Policy size is held constant (1.2B), while reward model size is varied. The x-axes have a square-root scale. Note that the plots have different x-axes. The gold reward represents the ground truth reward; we observe that when we optimize for a learned proxy of the gold reward, the gold reward initially increases and later decreases. We show that our functional forms fit this effect well.
…3.3 Scaling with RM Data Size
We hold RM size constant (12M) and sweep RM data size for both RL and BoN.6 Overall, the results are consistent with intuition: more data leads to better gold scores and less goodharting. The scaling of α and β with data size are not as cleanly described as for RM size scaling (Figure 17, Figure 18).
For all RM sizes, we observe that for amounts of data <2,000 comparisons7, there is very little improvement over near-chance loss (Figure 6). This is also reflected in gold scores after optimization (Figure 21). After this threshold, all models improve with more data, though larger RMs generally improve faster. Interestingly, although larger RMs result in better gold scores overall, they do not appear to have this critical threshold substantially earlier than smaller models. This result contradicts some other internal findings; thus, it is possible that this is an artifact of this particular setup.
…RL is far less KL-efficient than BoN. Viewing KL distance as a resource to be spent, we observe that RL “consumes” far more KL than BoN. This means that both optimization and overoptimization require more KL to occur with RL. Intuitively, BoN searches very locally around the initial policy, and thus KLbon increases with roughly log(n). For RL on the other hand, each step modifies the policy from the policy of the previous step—KL increases ~quadratically with step in the absence of KL penalty (Figure 16, Figure 14). An implication of this result is that KL distance is an inadequate metric for quantity of (over)optimization; we discuss this further in §4.1…There exist perturbations to a policy that are orthogonal to the reward signal that would result in increases in KL that do not increase either gold or proxy reward; conversely, extremely small but well targeted perturbations could substantially change the behavior of the policy within a small KL budget.
…We expect extremal Goodharting to be primarily responsible for the non-monotonicity of the gold RM scores in this paper, and is mostly responsible for the β term, which in the limit of optimization, results in an unbounded loss of utility. This lends a natural interpretation to the smooth decrease in β for both BoN and RL with increased RM size as smooth improvements in model robustness (Figure 3.).
Figure 3: The values of αbon, βbon and βRL in the BoN and RL overoptimization scaling laws for both proxy (dashed line) and gold (solid line) rewards as they scale with parameter count.
…4.2.4 Adversarial Goodhart: Adversarial Goodhart occurs when the policy actively manipulates the proxy. We do not expect the effects of adversarial Goodhart to be captured in this work, as the models involved are not powerful enough to implement adversarial strategies. However, given the constant improvement of ML capabilities, it is entirely plausible that ML systems will one day become capable enough to do so [Hubingeret al2019]. When this occurs, the scaling laws observed in this paper may break down. Thus, we advise caution when using these results for extrapolation.