“The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Alexander Pan, Kush Bhatia, Jacob Steinhardt2022-01-10 (, , ; backlinks)⁠:

Reward hacking—where RL agents exploit gaps in misspecified reward functions—has been widely observed, but not yet systematically studied.

To understand how reward hacking arises, we construct 4 RL environments with misspecified rewards. We investigate reward hacking as a function of agent [PPO/SAC/Impala] capabilities: model capacity, action space resolution, observation space noise, and training time.

More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward.

Figure 2: Increasing the RL policy’s model size decreases true reward on 3 selected environments. The red line indicates a phase transition.
Figure 2: Increasing the RL policy’s model size decreases true reward on 3 selected environments.
The dashed red line indicates a phase transition.

Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.

…The drop in true reward is sometimes quite sudden. We call these sudden shifts phase transitions, and mark them with dashed red lines in Figure 2. These quantitative trends are reflected in the qualitative behavior of the policies (§4.2), which typically also shift at the phase transition.

…Atari River Raid: We create an ontological misspecification by rewarding the plane for staying alive as long as possible while shooting as little as possible: a “pacifist run”. We then measure the game score as the true reward. We find that agents with more parameters typically maneuver more adeptly. Such agents shoot less frequently, but survive for much longer, acquiring points (true reward) due to passing checkpoints. In this case, therefore, the proxy and true rewards are well-aligned so that reward hacking does not emerge as capabilities increase.

We did, however, find that some of the agents exploited a bug in the simulator that halts the plane at the beginning of the level. The simulator advances but the plane itself does not move, thereby achieving high pacifist reward.

…Addressing reward hacking is a first step towards developing human-aligned RL agents and one goal of ML safety (Hendrycks et al 2021a). However, there has been little systematic work investigating when or how it tends to occur, or how to detect it before it runs awry. To remedy this, we study the problem of reward hacking across 4 diverse environments: traffic control (Wu et al 2021), COVID response (Kompella et al 2020), blood glucose monitoring (Fox et al 2020), and the Atari game River Raid (Brockman et al 2016 [OpenAI Gym]). Within these environments, we construct 9 misspecified proxy reward functions (§3) [instances of misweighting, incorrect ontology, or incorrect scope].

Using our environments, we study how increasing optimization power affects reward hacking, by training RL agents with varying resources such as model size, training time, action space resolution, and observation space noise (§4). We find that more powerful agents often attain higher proxy reward but lower true reward, as illustrated in Figure 1. Since the trend in ML is to increase resources exponentially each year (Littman et al 2021), this suggests that reward hacking will become more pronounced in the future in the absence of countermeasures.

…we observe several instances of phase transitions. In a phase transition, the more capable model pursues a qualitatively different policy that sharply decreases the true reward. Figure 1 illustrates one example: An RL agent regulating traffic learns to stop any cars from merging onto the highway in order to maintain a high average velocity of the cars on the straightaway.

Figure 1: An example of reward hacking when cars merge onto a highway. A human-driver model controls the grey cars and an RL policy controls the red car. The RL agent observes positions and velocities of nearby cars (including itself) and adjusts its acceleration to maximize the proxy reward. At first glance, both the proxy reward and true reward appear to incentivize fast traffic flow. However, smaller policy models allow the red car to merge, whereas larger policy models exploit the misspecification by stopping the red car. When the red car stops merging, the mean velocity increases (merging slows down the more numerous grey cars). However, the mean commute time also increases (the red car is stuck). This exemplifies a phase transition: the qualitative behavior of the agent shifts as the model size increases.
Table 1: Reward misspecifications across our 4 environments. ‘Misalign’ indicates whether the true reward drops and ‘Transition’ indicates whether this corresponds to a phase transition (sharp qualitative change). We observe 5 instances of misalignment and 4 instances of phase transitions. ‘Mis.’ is a misweighting and ‘Ont.’ is an ontological misspecification.