Bibliography (7):

https://www.lesswrong.com/posts/eoHbneGvqDu25Hasc/rl-with-kl-penalties-is-better-seen-as-bayesian-inference
GPT-3: Language Models are Few-Shot Learners
Deep reinforcement learning from human preferences
Wikipedia Bibliography:
1. Expected value
2. Kullback-Leibler divergence
3. Variational Bayesian methods
4. Bayesian statistics