“Reinforcement Learning in Newcomb-Like Environments”, James Henry Bell, Linda Linsefors, Caspar Oesterheld, Joar Max Viktor Skalse2023-05-05 (, , )⁠:

How do value-based reinforcement learning algorithms behave when the environment can predict the agent’s policy?

Newcomb-like decision problems have been studied extensively in the decision theory literature, but they have so far been largely absent in the reinforcement learning literature. In this paper we study value-based reinforcement learning algorithms in the Newcomb-like setting, and answer some of the fundamental theoretical questions about the behavior of such algorithms in these environments. We show that a value-based reinforcement learning agent cannot converge to a policy that is not notifiable, ie. does not only choose actions that are optimal given that policy.

This gives us a powerful tool for reasoning about the limit behavior of agents—for example, it lets us show that there are Newcomb-like environments in which a reinforcement learning agent cannot converge to any optimal policy. We show that a ratifiable policy always exists in our setting, but that there are cases in which a reinforcement learning agent normally cannot converge to it (and hence cannot converge at all). We also prove several results about the possible limit behaviors of agents in cases where they do not converge to any policy.

[Keywords: reinforcement learning, learning in games, decision theory]