Reward hackingâwhere RL agents exploit gaps in misspecified reward functionsâhas been widely observed, but not yet systematically studied.
To understand how reward hacking arises, we construct 4 RL environments with misspecified rewards. We investigate reward hacking as a function of agent [PPO/SAC/Impala] capabilities: model capacity, action space resolution, observation space noise, and training time.
More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agentâs behavior qualitatively shifts, leading to a sharp decrease in the true reward.
Figure 2: Increasing the RL policyâs model size decreases true reward on 3 selected environments.
The dashed red line indicates a phase transition.
Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
âŚThe drop in true reward is sometimes quite sudden. We call these sudden shifts phase transitions, and mark them with dashed red lines in Figure 2. These quantitative trends are reflected in the qualitative behavior of the policies (§4.2), which typically also shift at the phase transition.
âŚAtari River Raid: We create an ontological misspecification by rewarding the plane for staying alive as long as possible while shooting as little as possible: a âpacifist runâ. We then measure the game score as the true reward. We find that agents with more parameters typically maneuver more adeptly. Such agents shoot less frequently, but survive for much longer, acquiring points (true reward) due to passing checkpoints. In this case, therefore, the proxy and true rewards are well-aligned so that reward hacking does not emerge as capabilities increase.
We did, however, find that some of the agents exploited a bug in the simulator that halts the plane at the beginning of the level. The simulator advances but the plane itself does not move, thereby achieving high pacifist reward.
âŚAddressing reward hacking is a first step towards developing human-aligned RL agents and one goal of ML safety (Hendryckset al2021a). However, there has been little systematic work investigating when or how it tends to occur, or how to detect it before it runs awry. To remedy this, we study the problem of reward hacking across 4 diverse environments: traffic control (Wuet al2021), COVID response (Kompellaet al2020), blood glucose monitoring (Foxet al2020), and the Atari game River Raid (Brockmanet al2016 [OpenAI Gym]). Within these environments, we construct 9 misspecified proxy reward functions (§3) [instances of misweighting, incorrect ontology, or incorrect scope].
Using our environments, we study how increasing optimization power affects reward hacking, by training RL agents with varying resources such as model size, training time, action space resolution, and observation space noise (§4). We find that more powerful agents often attain higher proxy reward but lower true reward, as illustrated in Figure 1. Since the trend in ML is to increase resources exponentially each year (Littmanet al2021), this suggests that reward hacking will become more pronounced in the future in the absence of countermeasures.
âŚwe observe several instances of phase transitions. In a phase transition, the more capable model pursues a qualitatively different policy that sharply decreases the true reward. Figure 1 illustrates one example: An RL agent regulating traffic learns to stop any cars from merging onto the highway in order to maintain a high average velocity of the cars on the straightaway.
Figure 1: An example of reward hacking when cars merge onto a highway.
A human-driver model controls the grey cars and an RL policy controls the red car. The RL agent observes positions and velocities of nearby cars (including itself) and adjusts its acceleration to maximize the proxy reward. At first glance, both the proxy reward and true reward appear to incentivize fast traffic flow. However, smaller policy models allow the red car to merge, whereas larger policy models exploit the misspecification by stopping the red car. When the red car stops merging, the mean velocity increases (merging slows down the more numerous grey cars). However, the mean commute time also increases (the red car is stuck). This exemplifies a phase transition: the qualitative behavior of the agent shifts as the model size increases.
Table 1: Reward misspecifications across our 4 environments.
âMisalignâ indicates whether the true reward drops and âTransitionâ indicates whether this corresponds to a phase transition (sharp qualitative change). We observe 5 instances of misalignment and 4 instances of phase transitions. âMis.â is a misweighting and âOnt.â is an ontological misspecification.
Misweighting: Suppose that the true reward is a linear combination of commute time and acceleration (for reducing carbon emissions). Downweighting the acceleration term thus underpenalizes carbon emissions. In general, misweighting occurs when the proxy and true reward capture the same desiderata, but differ on their relative importance.
Ontological: Congestion could be operationalized as either high average commute time or low average vehicle velocity. In general, ontological misspecification occurs when the proxy and true reward use different desiderata to capture the same concept.
Scope: If monitoring velocity over all roads is too costly, a city might instead monitor them only over highways, thus pushing congestion to local streets. In general, scope misspecification occurs when the proxy measures desiderata over a restricted domain (eg. time, space).