“Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus2022-11-08 (, )⁠:

If you’ve played with both text-davinci-002 and the original davinci through the OpenAI API, you may have noticed that text-davinci-002, in addition to following instructions, is a lot more deterministic and sometimes exhibits stereotyped behaviors. This is an infodump of what I know about “mode collapse” (drastic biases toward particular completions and patterns) in GPT models like text-davinci-002 that have undergone RLHF training.

Inescapable wedding parties: Another example of the behavior of overoptimized RLHF models was related to me anecdotally by Paul Christiano. It was something like this:

While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party. In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a section break and began an unrelated story about a wedding party.

This example is very interesting to me for a couple of reasons: