“How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark”, Dillon Bowen, Scott Emmons, Alexandra Souly, Qingyuan Lu, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Olivia Watkins, Sam Toyer2024-08-28 (ML dataset, adversarial examples (AI); similar):
[Most jailbreaks aren’t real.]