“Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Rusheb Shah, Quentin Feuillade–Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando2023-11-06 (, , )⁠:

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behavior.

In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesizing methamphetamine, building a bomb, and laundering money.

These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185× larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively.

Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.

[media: The challenge, Pour says, is that persona impersonation “is a very core thing that these models do.” They aim to achieve what the user wants, and they specialize in assuming different personalities—which proved central to the form of exploitation used in the new study. Stamping out their ability to take on potentially harmful personas, such as the “research assistant” that devised jailbreaking schemes, will be tricky. “Reducing it to zero is probably unrealistic”, Shah says. “But it’s important to think, ‘How close to zero can we get?’”…Katell acknowledges that organizations developing LLM-based chatbots are currently putting lots of work into making them safe. The developers are trying to tamp down users’ ability to jailbreak their systems and put those systems to nefarious work, such as that highlighted by Shah, Pour and their colleagues. Competitive urges may end up winning out, however, Katell says. “How much effort are the LLM providers willing to put in to keep them that way?” he says. “At least a few will probably tire of the effort and just let them do what they do.”]