‘adversarial examples (AI)’ directory

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`robustness`

⁠[see previous entry]⁠

`robustness, model-hacking, token-space, interactive-red-teaming, dataset-watermarking, robustness`

⁠[see previous entry]⁠

`adversarial-examples`

⁠[see previous entry]⁠

Wikipedia

Miscellaneous

Bibliography

https://arxiv.org/abs/2410.13691: “Jailbreaking LLM-Controlled Robots ”⁠, Alexander Robey, Zachary Ravichandran, Vijay Kumar …, Hamed Hassani, George J. Pappas
link-bibliography⁠
https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models ”⁠, Michael Robinson, Sourya Dey, Shauna Sweet
link-bibliography⁠
https://arxiv.org/abs/2408.05446: “Ensemble Everything Everywhere: Multi-Scale Aggregation for Adversarial Robustness ”⁠, Stanislav Fort, Balaji Lakshminarayanan
link-bibliography⁠
https://arxiv.org/abs/2407.11969: “Does Refusal Training in LLMs Generalize to the Past Tense? ”⁠, Maksym Andriushchenko, Nicolas Flammarion
link-bibliography⁠
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”⁠, Siyan Zhao, Tung Nguyen, Aditya Grover⁠
link-bibliography⁠
https://arxiv.org/abs/2404.06664: “CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”⁠, Yu Ying Chiu, Liwei Jiang, Maria Antoniak …, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2402.17747: “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback ”⁠, Leon Lang, Davis Foote, Stuart Russell …, Anca Dragan, Erik Jenner, Scott Emmons
link-bibliography⁠
https://arxiv.org/abs/2402.15570: “Fast Adversarial Attacks on Language Models In One GPU Minute ”⁠, Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan …, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
link-bibliography⁠
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”⁠, Fengqing Jiang, Zhangchen Xu, Luyao Niu …, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li⁠, Radha Poovendran⁠
link-bibliography⁠
https://arxiv.org/abs/2401.05566#anthropic: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training ”⁠, Evan Hubinger, Carson Denison, Jesse Mu …, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, ⁠Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, ⁠Deep Ganguli, Fazl Barez, ⁠Jack Clark⁠, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai⁠, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky⁠, Paul Christiano⁠, ⁠Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, ⁠Ethan Perez
link-bibliography⁠
https://arxiv.org/abs/2310.08419: “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”⁠, Patrick Chao, Alexander Robey, Edgar Dobriban …, Hamed Hassani, George J. Pappas, Eric Wong
link-bibliography⁠
https://arxiv.org/abs/2310.02279#sony: “Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion ”⁠, Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao …, Naoki Murata⁠, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon⁠
link-bibliography⁠
https://arxiv.org/abs/2309.11751: “How Robust Is Google’s Bard to Adversarial Image Attacks? ”⁠, Yinpeng Dong, Huanran Chen, Jiawei Chen …, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian⁠, Hang Su, Jun Zhu
link-bibliography⁠
https://arxiv.org/abs/2306.07567: “Large Language Models Sometimes Generate Purely Negatively-Reinforced Text ”⁠, Fabien Roger
link-bibliography⁠
https://arxiv.org/abs/2305.16934: “On Evaluating Adversarial Robustness of Large Vision-Language Models ”⁠, Yunqing Zhao, Tianyu Pang, Chao Du …, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin
link-bibliography⁠
https://arxiv.org/abs/2303.02242: “TrojText: Test-Time Invisible Textual Trojan Insertion ”⁠, Yepeng Liu, Bo Feng, Qian Lou
link-bibliography⁠
https://arxiv.org/abs/2302.04222: “Glaze: Protecting Artists from Style Mimicry by Text-To-Image Models ”⁠, Shawn Shan, Jenna Cryan, Emily Wenger …, Haitao Zheng⁠, Rana Hanocka, Ben Y. Zhao
link-bibliography⁠
https://arxiv.org/abs/2211.03769: “Are AlphaZero-Like Agents Robust to Adversarial Perturbations? ”⁠, Li-Cheng Lan, Huan Zhang, Ti-Rong Wu …, Meng-Yu Tsai, I-Chen Wu⁠, Cho-Jui Hsieh
link-bibliography⁠
https://arxiv.org/abs/2211.00241: “Adversarial Policies Beat Superhuman Go AIs ”⁠, Tony T. Wang, Adam Gleave, Tom Tseng …, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine⁠, Stuart Russell
link-bibliography⁠
https://arxiv.org/abs/2208.08831#deepmind: “Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning ”⁠, Olivia Wiles, Isabela Albuquerque, Sven Gowal
link-bibliography⁠
https://arxiv.org/abs/2205.07460: “Diffusion Models for Adversarial Purification ”⁠, Weili Nie, Brandon Guo, Yujia Huang …, Chaowei Xiao, Arash Vahdat, Anima Anandkumar⁠
link-bibliography⁠
https://swabhs.com/assets/pdf/wanli.pdf#allen: “WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation ”⁠, Alisa Liu, Swabha Swayamdipta, ⁠Noah A. Smith, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2201.05320#allen: “CommonsenseQA 2.0: Exposing the Limits of AI through Gamification ”⁠, Alon Talmor, Ori Yoran, Ronan Le Bras …, Chandra Bhagavatula, Yoav Goldberg⁠, Yejin Choi⁠, ⁠Jonathan Berant
link-bibliography⁠
https://arxiv.org/abs/2110.13771#nvidia: “AugMax: Adversarial Composition of Random Augmentations for Robust Training ”⁠, Haotao Wang, Chaowei Xiao, Jean Kossaifi …, Zhiding Yu, Anima Anandkumar⁠, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2106.07411: “Partial Success in Closing the Gap between Human and Machine Vision ”⁠, Robert Geirhos⁠, Kantharaju Narayanappa, Benjamin Mitzkus …, Tizian Thieringer, Matthias Bethge⁠, Felix A. Wichmann, Wiel, Brendel
link-bibliography⁠
https://arxiv.org/abs/2105.12806: “A Universal Law of Robustness via Isoperimetry ”⁠, Sébastien Bubeck⁠, Mark Sellke
link-bibliography⁠
https://distill.pub/2021/multimodal-neurons/#openai: “Multimodal Neurons in Artificial Neural Networks [CLIP] ”⁠, Gabriel Goh⁠, Nick Cammarata, Chelsea Voss …, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford⁠, Chris Olah
link-bibliography⁠
https://aclanthology.org/2021.naacl-main.235.pdf#facebook: “Bot-Adversarial Dialogue for Safe Conversational Agents ”⁠, Jing Xu, Da Ju, Margaret Li …, Y-Lan Boureau, Jason Weston⁠, Emily Dinan
link-bibliography⁠
https://arxiv.org/abs/2006.14536#google: “Smooth Adversarial Training ”⁠, Cihang Xie, Mingxing Tan, Boqing Gong …, Alan Yuille⁠, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2002.00937: “Radioactive Data: Tracing through Training ”⁠, Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid⁠, Hervé Jégou
link-bibliography⁠
https://arxiv.org/abs/1911.09665: “Adversarial Examples Improve Image Recognition ”⁠, Cihang Xie, Mingxing Tan, Boqing Gong …, Jiang Wang⁠, Alan Yuille⁠, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/1907.07640: “Robustness Properties of Facebook’s ResNeXt WSL Models ”⁠, A. Emin Orhan
link-bibliography⁠
https://arxiv.org/abs/1706.06083: “Towards Deep Learning Models Resistant to Adversarial Attacks ”⁠, ⁠Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt⁠ …, Dimitris Tsipras, Adrian Vladu
link-bibliography⁠