https://x.com/nabla_theta/status/1582889490296684544
https://www.lesswrong.com/posts/shcSdHGPhnLQkpSbX/scaling-laws-for-reward-model-overoptimization
Best-Of-n With Misaligned Reward Models for Math Reasoning
InstructGPT: Training language models to follow instructions with human feedback
Towards a Human-like Open-Domain Chatbot
https://arxiv.org/pdf/1803.04585#page=2&org=miri
Why the tails come apart
Calculating The Gaussian Expected Maximum § Probability of Bivariate Maximum
https://arxiv.org/pdf/1803.04585#page=3&org=miri
Categorizing Variants of Goodhart’s Law
https://arxiv.org/pdf/2210.10760#page=23&org=openai
https://arxiv.org/pdf/2210.10760#page=24&org=openai
https://arxiv.org/pdf/2210.10760#page=7&org=openai
https://arxiv.org/pdf/2210.10760#page=22&org=openai
https://arxiv.org/pdf/2210.10760.pdf#page=9&org=openai
Risks from Learned Optimization in Advanced Machine Learning Systems
Scaling laws for single-agent reinforcement learning
Scaling Laws for Autoregressive Generative Modeling
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
RL with KL penalties is better viewed as Bayesian inference