“Designing Agent Incentives to Avoid Reward Tampering”, 2019-08-14 (; backlinks; similar):
From an AI safety perspective, having a clear design principle and a crisp characterization of what problem it solves means that we don’t have to guess which agents are safe. In this post and paper we describe how a design principle called ‘current-RF optimization’ avoids the reward function tampering problem.
…One way to prevent the agent from tampering with the reward function is to isolate or encrypt the reward function. However, we do not expect such solutions to scale indefinitely with our agent’s capabilities, as a sufficiently capable agent may find ways around most defenses. In our new paper, we describe a more principled way to fix the reward tampering problem. Rather than trying to protect the reward function, we change the agent’s incentives for tampering with it.
The fix relies on a slight change to the RL framework that gives the agent query access to the reward function. In the rocks and diamonds environment, this can be done by specifying to the agent how the purple nodes describe the reward function.
Using query access to the reward function, we can design a model-based agent that uses the current reward function to evaluate rollouts of potential policies (a current-RF agent, for short). For example, in the rocks and diamonds environment, a current-RF agent will look at the current reward description, and at time 1 see that it should collect diamonds. This is the criteria by which it will choose its first action, which will be going upwards towards the diamond. Note that the reward description is still changeable, just as before. Still, the current-RF agent will not use the reward-tampering possibility, because it is focused on satisfying the current reward description.