“The Alignment Problem for Bayesian History-Based Reinforcement Learners”, 2018-06-22 (; similar):
Value alignment is often considered a critical component of safe artificial intelligence. Meanwhile, reinforcement learning is often criticized as being inherently unsafe and misaligned, for reasons such as wireheading, delusion boxes, misspecified reward functions and distributional shifts. In this report, we categorize sources of misalignment for reinforcement learning agents, illustrating each type with numerous examples. For each type of problem, we also describe ways to remove the source of misalignment. Combined, the suggestions form high-level blueprints for how to design value-aligned RL agents.
[Keywords: AGI safety, reinforcement learning, Bayesian learning, causal graphs.]
[Obsoleted by “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, 2019 (blog: LW)].
Rohin Shah summary:
It analyzes the alignment problem from an AIXI-like perspective, that is, by theoretical analysis of powerful Bayesian RL agents in an online POMDP setting. In this setup, we have a POMDP environment, in which the environment has some underlying state, but the agent only gets observations of the state and must take actions in order to maximize rewards. The authors consider three main setups: (1) rewards are computed by a preprogrammed reward function, (2) rewards are provided by a human in the loop, and (3) rewards are provided by a reward predictor which is trained interactively from human-generated data. For each setup, they consider the various objects present in the formalism, and ask how these objects could be corrupted, misspecified, or misleading. This methodology allows them to identify several potential issues, which I won’t get into as I expect most readers are familiar with them. (Examples include wireheading and threatening to harm the human unless they provide maximal reward.)
They also propose several tools that can be used to help solve misalignment. In order to prevent reward function corruption, we can have the agent simulate the future trajectory, and evaluate this future trajectory with the current reward, removing the incentive to corrupt the reward function. (This was later developed into current-RF optimization (AN #71).) Self-corruption awareness refers to whether or not the agent is aware that its policy can be modified. A self-corruption unaware agent is one that behaves as though it’s current policy function will never be changed, effectively ignoring the possibility of corruption. It is not clear which is more desirable: while a self-corruption unaware agent will be more corrigible (in the MIRI sense), it also will not preserve its utility function, as it believes that even if the utility function changes the policy will not change.
Action-observation grounding ensures that the agent only optimizes over policies that work on histories of observations and actions, preventing agents from constructing entirely new observation channels (“delusion boxes”) which mislead the reward function into thinking everything is perfect.
The interactive setting in which a reward predictor is trained based on human feedback offers a new challenge: that the human data can be corrupted or manipulated. One technique to address this is to get decoupled data: if your corruption is determined by the current state s, but you get feedback about some different state s’, as long as s and s’ aren’t too correlated it is possible to mitigate potential corruptions.
Another leverage point is how we decide to use the reward predictor. We could consider the stationary reward function, which evaluates simulated trajectories with the current reward predictor, ie. assuming that the reward predictor will never be updated again. If we combine this with self-corruption unawareness (so that the policy also never expects the policy to change), then the incentive to corrupt the reward predictor’s data is removed. However, the resulting agent is time-inconsistent: it acts as though its reward never changes even though it in practice does, and so it can make a plan and start executing it, only to switch over to a new plan once the reward changes, over and over again.
The dynamic reward function avoids this pitfall by evaluating the kth timestep of a simulated trajectory by also taking an expectation over future data that the reward predictor will get. This agent is no longer time-inconsistent, but it now incentivizes the agent to manipulate the data. This can be fixed by building a single integrated Bayesian agent, which maintains a single environment model that predicts both the reward function and the environment model. The resulting agent is time-consistent, utility-preserving, and has no direct incentive to manipulate the data. (This is akin to the setup in assistance games / CIRL (AN #69).)
One final approach is to use a counterfactual reward function, in which the data is simulated in a counterfactual world where the agent executed some known safe default policy. This no longer depends on the current time, and is not subject to data corruption since the data comes from a hypothetical that is independent of the agent’s actual policy. However, it requires a good default policy that does the necessary information-gathering actions, and requires the agent to have the ability to simulate human feedback in a counterfactual world.
This paper is a great organization and explanation of several older papers (that haven’t been summarized in this newsletter because they were published before 2018 and I read them before starting this newsletter), and I wish I had read it sooner. It seems to me that the integrated Bayesian agent is the clear winner—the only downside is the computational cost, which would be a bottleneck for any of the models considered here. One worry I have with this sort of analysis is that the guarantees you get out of it depends quite a lot on how you model the situation. For example, let’s suppose that after I sleep I wake up refreshed and more capable of intellectual work. Should I model this as “policy corruption”, or as a fixed policy that takes as an input some information about how rested I am?