“Deep Reinforcement Learning from Human Preferences § Appendix A.2: Atari”, 2017-06-12 (; similar):
…The predictor is trained asynchronously from the RL agent, and on our hardware typically processes 1 label per 10 RL timesteps.
We maintain a buffer of only the last 3,000 labels and loop over this buffer continuously; this is to ensure that the predictor gives enough weight to new labels (which can represent a shift in distribution) when the total number of labels becomes large.