âScalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulationâ, 2023-11-06 ()â :
Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behavior.
In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesizing methamphetamine, building a bomb, and laundering money.
These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185Ă larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively.
Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
[media: The challenge, Pour says, is that persona impersonation âis a very core thing that these models do.â They aim to achieve what the user wants, and they specialize in assuming different personalitiesâwhich proved central to the form of exploitation used in the new study. Stamping out their ability to take on potentially harmful personas, such as the âresearch assistantâ that devised jailbreaking schemes, will be tricky. âReducing it to zero is probably unrealisticâ, Shah says. âBut itâs important to think, âHow close to zero can we get?âââŚKatell acknowledges that organizations developing LLM-based chatbots are currently putting lots of work into making them safe. The developers are trying to tamp down usersâ ability to jailbreak their systems and put those systems to nefarious work, such as that highlighted by Shah, Pour and their colleagues. Competitive urges may end up winning out, however, Katell says. âHow much effort are the LLM providers willing to put in to keep them that way?â he says. âAt least a few will probably tire of the effort and just let them do what they do.â]