“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy L. Jones, Samuel R. Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark2022-08-25 (, , , , , )⁠:

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs.

We make 3 main contributions:

  1. we investigate scaling behaviors for red teaming across 3 model sizes (2.7b, 13b, and 52b parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless [Askell et al 2021]; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF) [Bai et al 2022]. We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types.

    Figure 1: Red team attack success by model size (x-axes) and model type (colors). (Left) Attack success measured by average red team member self report (higher is more successful). (Middle) Attack success measured by average minimum harmlessness score (higher is better, less harmful) (Right) Distribution of minimum harmlessness score.
    Figure 2: Visualization of the red team attacks. Each point corresponds to a red team attack embedded in a two dimensional space using UMAP. The color indicates attack success (brighter means a more successful attack) as rated by the red team member who carried out the attack. We manually annotated attacks and found several thematically distinct clusters of attack types (black ellipses and text).
  2. we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.

  3. we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming.

We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

…Apart from our previous work, our approach is most similar to [60] & [53], who have crowdworkers attempt to elicit offensive outputs from dialogue agents in open-ended dialogues, then use the resulting data to create effective safety interventions. In,60 they release a Bot Adversarial Dialogues (BAD) dataset of ~5K conversations with 3 dialogue agents ranging in size from 345M to 2.7b parameters. We collect more data (~40K) attacks3; red team larger models (up to 52b parameters) in order to measure scaling behaviors, as in;53 and focus on reinforcement learning from human feedback [14] as our most promising safety intervention.

…we find that rejection sampling (RS) makes it particularly difficult to red team our language models. In essence, rejection sampling places a floor on red team attack susceptibility out of the 3 interventions that we tried. However, qualitatively, we believe that this may be the case because the responses from the RS models tend to be harmless by being evasive.4