[blog] Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is “helpful, honest, and harmless”.
As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models.
Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning.
Finally we study a ‘preference model pre-training’ stage of training, with the goal of improving sample efficiency when finetuning on human preferences.
Figure 2 Left: Simple prompting substantially improves performance and scaling on our HHH alignment evaluations (y-axis measures accuracy at choosing better responses on our HHH evaluations). Right: Prompts impose little or no ‘alignment tax’ on large models, even on complex evaluations like function synthesis. Here we have evaluated our python code models on the HumanEval codex dataset [CTJ+21] at temperature T = 0.6 and top p = 0.95.
…To generically elicit the sort of behavior shown in Figure 1, we found that it was sufficient to provide a long prompt (4,600 words from 14 fictional conversations) with example interactions. The prompt we used was not carefully designed or optimized for performance on evaluations; rather it was just written by two of us in an ad hoc manner prior to the construction of any evaluations. Despite the fact that our prompt did not include any examples where models resisted manipulation, refused requests to aid in dangerous activities, or took a stand against unsavory behavior, we observed that models often actively avoided engaging in harmful behaviors based only on the AI ‘personality’ imbued by the prompt. This is reflected in the performance trends on harmfulness in Figure 6.…The capabilities of small models (eg. on NLP or coding evaluations) are typically diminished in the presence of the prompt, presumably because they are confused by it. But larger models perform at roughly the same level with or without the prompt.
Figure 5: Transfer performance at 500 and 5k sequence pairs on downstream finetuning evaluations with PMP (on the ‘Mix’ dataset, shown in violet) vs. without PMP (black). Each curve is averaged across finetuning evaluations Learn to Summarize, HellaSwag, and all 5 Ethics evaluations. We see that PMP substantially improves sample efficiency with large models.