“When Your AIs Deceive You: Challenges With Partial Observability in RLHF” (adversarial examples (AI), preference learning, AI safety; backlinks)
View HTML:
When Your AIs Deceive You: Challenges With Partial Observability in RLHF