“Best-Of-n With Misaligned Reward Models for Math Reasoning” (adversarial examples (AI), math, model-based RL; backlinks)
View HTML:
Best-Of-n With Misaligned Reward Models for Math Reasoning