-
https://x.com/nabla_theta/status/1582889490296684544
-
https://www.lesswrong.com/posts/shcSdHGPhnLQkpSbX/scaling-laws-for-reward-model-overoptimization
-
Best-Of-n With Misaligned Reward Models for Math Reasoning
-
InstructGPT: Training language models to follow instructions with human feedback
-
Towards a Human-like Open-Domain Chatbot
-
https://arxiv.org/pdf/1803.04585#page=2&org=miri
-
Why the tails come apart
-
Calculating The Gaussian Expected Maximum § Probability of Bivariate Maximum
-
https://arxiv.org/pdf/1803.04585#page=3&org=miri
-
Categorizing Variants of Goodhart’s Law
-
https://arxiv.org/pdf/2210.10760#page=23&org=openai
-
https://arxiv.org/pdf/2210.10760#page=24&org=openai
-
https://arxiv.org/pdf/2210.10760#page=7&org=openai
-
https://arxiv.org/pdf/2210.10760#page=22&org=openai
-
https://arxiv.org/pdf/2210.10760.pdf#page=9&org=openai
-
Risks from Learned Optimization in Advanced Machine Learning Systems
-
Scaling laws for single-agent reinforcement learning
-
Scaling Laws for Autoregressive Generative Modeling
-
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
-
RL with KL penalties is better viewed as Bayesian inference
-