Bibliography (7):

  1. https://www.lesswrong.com/posts/eoHbneGvqDu25Hasc/rl-with-kl-penalties-is-better-seen-as-bayesian-inference

  2. GPT-3: Language Models are Few-Shot Learners

  3. Deep reinforcement learning from human preferences