Bibliography (25):

  1. https://x.com/nabla_theta/status/1582889490296684544

  2. https://www.lesswrong.com/posts/shcSdHGPhnLQkpSbX/scaling-laws-for-reward-model-overoptimization

  3. Best-Of-n With Misaligned Reward Models for Math Reasoning

  4. InstructGPT: Training language models to follow instructions with human feedback

  5. Towards a Human-like Open-Domain Chatbot

  6. https://arxiv.org/pdf/1803.04585#page=2&org=miri

  7. Why the tails come apart

  8. Calculating The Gaussian Expected Maximum § Probability of Bivariate Maximum

  9. https://arxiv.org/pdf/1803.04585#page=3&org=miri

  10. Categorizing Variants of Goodhart’s Law

  11. https://arxiv.org/pdf/2210.10760#page=23&org=openai

  12. https://arxiv.org/pdf/2210.10760#page=24&org=openai

  13. https://arxiv.org/pdf/2210.10760#page=7&org=openai

  14. https://arxiv.org/pdf/2210.10760#page=22&org=openai

  15. https://arxiv.org/pdf/2210.10760.pdf#page=9&org=openai

  16. Risks from Learned Optimization in Advanced Machine Learning Systems

  17. Scaling laws for single-agent reinforcement learning

  18. Scaling Laws for Autoregressive Generative Modeling

  19. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

  20. RL with KL penalties is better viewed as Bayesian inference