“Free-Play Periods for RL Agents”, Gwern2023-05-07 (, , , )⁠:

Proposal for incentivizing meta-learning of exploration in deep reinforcement learning: domain randomization with reward-shaping, where there is a fixed-length ‘play time’ with no rewards/losses at the beginning of each episode.

In standard reinforcement learning, agents learn to be maximally exploitative at the beginning of each episode, as they assume the rules are the same. In meta-learning oriented approaches, each episode have different rules, even explicitly randomized by a simulator; with enough scale, agents learn to observe their actions and adapt on the fly.

However, they still are penalized for any mistakes while adapting, because they continue to receive reward/loss as usual. Presumably, this means that they must be conservative in what exploratory actions they may take early on, and may not do any explicit exploration at all. This seems unduly harsh because in many problems, it is realistic to have some initial consequence-free period where an agent can take a series of exploratory actions to infer what sort of environment it is in, before the task begins ‘for real’. For example, in real world robotics, robots never start immediately earning rewards, but there is always some sort of bootup phase where they wait for tasks to start, which they could be using productively to self-test. But this sort of ‘sandbox’ period is never provided in meta-learning setups (except inadvertently).

I propose a simple free-play modification to domain randomization: every episode begins with a fixed-length “free-play” reward-shaped period where agents may take actions but all rewards are set to 0. After that period, the episode continues as usual. (This can be implemented simply by post-processing each episode’s data, and requires no other modifications whatsoever to existing DRL algorithms.) Since there is no possibility of greedily earning (or losing) reward during free-play, agents are incentivized to meta-learn optimal exploration of the meta-environment, to maximize information gain, before the dangerous part of the episode begins.

Free-play meta-training would lead to agents which ‘train themselves’ during the free-play period; for example, you would boot up a robot in a warehouse, and it would shake itself and wiggle around for a few minutes, and after that, the neural net now ‘knows’ all about how to operate that exact arm.