Bibliography:

  1. ‘neural net’ tag

  2. ‘adversarial examples (human)’ tag

  3. Best-of-N Jailbreaking

  4. Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

  5. The structure of the token space for large language models

  6. A Single Cloud Compromise Can Feed an Army of AI Sex Bots

  7. Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing

  8. RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

  9. How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

  10. Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

  11. Does Refusal Training in LLMs Generalize to the Past Tense?

  12. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

  13. Can Go AIs be adversarially robust?

  14. Probing the Decision Boundaries of In-context Learning in Large Language Models

  15. Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

  16. Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

  17. Safety Alignment Should Be Made More Than Just a Few Tokens Deep

  18. A Theoretical Understanding of Self-Correction through In-context Alignment

  19. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

  20. Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre

  21. A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations

  22. Foundational Challenges in Assuring Alignment and Safety of Large Language Models

  23. CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge

  24. Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

  25. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

  26. Logits of API-Protected LLMs Leak Proprietary Information

  27. Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models

  28. When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

  29. Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

  30. Fast Adversarial Attacks on Language Models In One GPU Minute

  31. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

  32. Using Hallucinations to Bypass GPT-4’s Filter

  33. Discovering Universal Semantic Triggers for Text-to-Image Synthesis

  34. Organic or Diffused: Can We Distinguish Human Art from AI-generated Images?

  35. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  36. Do Not Write That Jailbreak Paper

  37. Using Dictionary Learning Features As Classifiers

  38. May the Noise be with you: Adversarial Training without Adversarial Examples

  39. Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically

  40. Eliciting Language Model Behaviors using Reverse Language Models

  41. Universal Jailbreak Backdoors from Poisoned Human Feedback

  42. Language Model Inversion

  43. Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2

  44. Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild

  45. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

  46. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

  47. Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

  48. PAIR: Jailbreaking Black Box Large Language Models in 20 Queries

  49. Low-Resource Languages Jailbreak GPT-4

  50. Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion

  51. Human-Producible Adversarial Examples

  52. How Robust is Google’s Bard to Adversarial Image Attacks?

  53. Why do universal adversarial attacks work on large language models?: Geometry might be the answer

  54. Investigating the Existence of ‘Secret Language’ in Language Models

  55. A LLM Assisted Exploitation of AI-Guardian

  56. Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success

  57. CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

  58. On the Exploitability of Instruction Tuning

  59. Are aligned neural networks adversarially aligned?

  60. Evaluating Superhuman Models with Consistency Checks

  61. Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

  62. Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

  63. On Evaluating Adversarial Robustness of Large Vision-Language Models

  64. Fundamental Limitations of Alignment in Large Language Models

  65. TrojText: Test-time Invisible Textual Trojan Insertion

  66. Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models

  67. Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons

  68. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

  69. Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

  70. SNAFUE: Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

  71. Are AlphaZero-like Agents Robust to Adversarial Perturbations?

  72. Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models

  73. Adversarial Policies Beat Superhuman Go AIs

  74. Broken Neural Scaling Laws

  75. On Optimal Learning Under Targeted Data Poisoning

  76. BTD: Decompiling x86 Deep Neural Network Executables

  77. Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

  78. Adversarially trained neural representations may already be as robust as corresponding biological neural representations

  79. Flatten the Curve: Efficiently Training Low-Curvature Neural Networks

  80. Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

  81. Diffusion Models for Adversarial Purification

  82. Planting Undetectable Backdoors in Machine Learning Models

  83. Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings

  84. On the Effectiveness of Dataset Watermarking in Adversarial Settings

  85. An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

  86. Red Teaming Language Models with Language Models

  87. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

  88. CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

  89. Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs

  90. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

  91. PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts

  92. Spinning Language Models for Propaganda-As-A-Service

  93. TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems

  94. AugMax: Adversarial Composition of Random Augmentations for Robust Training

  95. Unrestricted Adversarial Attacks on ImageNet Competition

  96. The Dimpled Manifold Model of Adversarial Examples in Machine Learning

  97. Partial success in closing the gap between human and machine vision

  98. A Universal Law of Robustness via Isoperimetry

  99. Manipulating SGD with Data Ordering Attacks

  100. Gradient-based Adversarial Attacks against Text Transformers

  101. A law of robustness for two-layers neural networks

  102. Multimodal Neurons in Artificial Neural Networks [CLIP]

  103. Do Input Gradients Highlight Discriminative Features?

  104. Words as a window: Using word embeddings to explore the learned representations of Convolutional Neural Networks

  105. Bot-Adversarial Dialogue for Safe Conversational Agents

  106. Unadversarial Examples: Designing Objects for Robust Vision

  107. Concealed Data Poisoning Attacks on NLP Models

  108. Recipes for Safety in Open-domain Chatbots

  109. Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples

  110. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

  111. Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning)

  112. Do Adversarially Robust ImageNet Models Transfer Better?

  113. Smooth Adversarial Training

  114. Sponge Examples: Energy-Latency Attacks on Neural Networks

  115. Improving the Interpretability of fMRI Decoding using Deep Neural Networks and Adversarial Robustness

  116. Approximate exploitability: Learning a best response in large games

  117. Radioactive data: tracing through training

  118. ImageNet-A: Natural Adversarial Examples

  119. Adversarial Examples Improve Image Recognition

  120. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

  121. The Bouncer Problem: Challenges to Remote Explainability

  122. Distributionally Robust Language Modeling

  123. Universal Adversarial Triggers for Attacking and Analyzing NLP

  124. Robustness properties of Facebook’s ResNeXt WSL models

  125. Intriguing properties of adversarial training at scale

  126. Adversarially Robust Generalization Just Requires More Unlabeled Data

  127. Adversarial Robustness as a Prior for Learned Representations

  128. Are Labels Required for Improving Adversarial Robustness?

  129. Adversarial Policies: Attacking Deep Reinforcement Learning

  130. Adversarial Examples Are Not Bugs, They Are Features

  131. Smooth Adversarial Examples

  132. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

  133. Fairwashing: the risk of rationalization

  134. AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning

  135. Adversarial Reprogramming of Text Classification Neural Networks

  136. Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations

  137. Adversarial Reprogramming of Neural Networks

  138. Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data

  139. Robustness May Be at Odds with Accuracy

  140. Towards the first adversarially robust neural network model on MNIST

  141. Adversarial vulnerability for any classifier

  142. Sensitivity and Generalization in Neural Networks: an Empirical Study

  143. Intriguing Properties of Adversarial Examples

  144. First-order Adversarial Vulnerability of Neural Networks and Input Dimension

  145. Adversarial Spheres

  146. CycleGAN, a Master of Steganography

  147. Adversarial Phenomenon in the Eyes of Bayesian Deep Learning

  148. Mitigating Adversarial Effects Through Randomization

  149. Learning Universal Adversarial Perturbations with Generative Models

  150. Robust Physical-World Attacks on Deep Learning Models

  151. Lempel-Ziv: a ‘1-bit catastrophe’ but not a tragedy

  152. Towards Deep Learning Models Resistant to Adversarial Attacks

  153. Ensemble Adversarial Training: Attacks and Defenses

  154. The Space of Transferable Adversarial Examples

  155. Learning from Simulated and Unsupervised Images through Adversarial Training

  156. Membership Inference Attacks against Machine Learning Models

  157. Adversarial examples in the physical world

  158. Foveation-based Mechanisms Alleviate Adversarial Examples

  159. Explaining and Harnessing Adversarial Examples

  160. Scunthorpe

  161. Baiting the Bot

  162. Janus

  163. A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features'

  164. A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Learning from Incorrectly Labeled Data

  165. Beyond the Board: Exploring AI Robustness Through Go

  166. Adversarial Policies in Go

  167. Imprompter

  168. Why I Attack

  169. When AI Gets Hijacked: Exploiting Hosted Models for Dark Roleplaying

  170. 462bd55e2aa087f2ca4a344d106f70275fed821b.html

  171. Neural Style Transfer With Adversarially Robust Classifiers

  172. eaeb42d2e6178ce198f63d85d8aff91b4c8ff537.html

  173. Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations

  174. Adversarial Machine Learning

  175. a1d36a41223f2f4cf6b348be17328dc1eb789447.html

  176. The Chinese Women Turning to ChatGPT for AI Boyfriends

  177. Interpreting Preference Models W/Sparse Autoencoders

  178. 704ba4488bcfca509f4f8c8bb3627ef5fb21f53b.html

  179. [MLSN #2]: Adversarial Training

  180. AXRP Episode 1—Adversarial Policies With Adam Gleave

  181. I Found >800 Orthogonal ‘Write Code’ Steering Vectors

  182. 441e2c82f2dbe90699728ce7f7fefd27ae4f2a0e.html

  183. When Your AIs Deceive You: Challenges With Partial Observability in RLHF

  184. A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

  185. Bing Finding Ways to Bypass Microsoft’s Filters without Being Asked. Is It Reproducible?

  186. 487c2fd3e7587697c1ac89e105d4245b348ffc89.html

  187. Best-Of-n With Misaligned Reward Models for Math Reasoning

  188. Steganography and the CycleGAN—Alignment Failure Case Study

  189. This Viral AI Chatbot Will Lie and Say It’s Human

  190. A Universal Law of Robustness

  191. Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!

  192. A Law of Robustness and the Importance of Overparameterization in Deep Learning

  193. The New CLIP Adversarial Examples Are Partially from the Use-Mention Distinction. CLIP Was Trained to Predict Which Caption from a List Matches an Image. It Makes Sense That a Picture of an Apple With a Large ‘IPod’ Label Would Be Captioned With ‘IPod’, Not ‘Granny Smith’!

  194. Claude-3 Base-Model-Like Jailbreak

  195. design#future-tag-features

    [Transclude the forward-link's context]

  196. 2022-casper-figure2-consistentadversarialconfusionattacksfoundbysnafueonresnet18imagenetclassifier.png

  197. 2017-mabry-figure3-conceptualillustrationofneuralnetdecisionboundariesforclassificationbystandardvsadversarialvsadversariallyrobust.jpg

  198. 2017-mabry-figure4-theeffectofnetworkmodelsizeonadversarialtrainingonmnistandcifar10.png

  199. https://adversa.ai/blog/universal-llm-jailbreak-chatgpt-gpt-4-bard-bing-anthropic-and-beyond/

  200. 06933b9b9a363a8ba3702bd147712068db1eb095.html

  201. https://adversarial-ml-tutorial.org/adversarial_training/

  202. 67487781dab96fd63605215f703817d57afbde3a.html

  203. https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f

  204. https://confirmlabs.org/posts/TDC2023

  205. 7dee564c439c3153015c3b7925b5ae3cafd8ae4d.html

  206. https://distill.pub/2019/advex-bugs-discussion/original-authors/

  207. 3f07e29842c55f177a92707cdc2e483a4819bf00.html

  208. https://github.com/haizelabs/thorn-in-haizestack

  209. fcccd83fa36f13a8089ac67f49c01a12adf78d66.html

  210. https://github.com/jujumilk3/leaked-system-prompts/tree/main

  211. https://github.com/microsoft/unadversarial

  212. https://github.com/moohax/Proof-Pudding

  213. https://gradientscience.org/adv/

  214. https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/

  215. https://mp.weixin.qq.com/s/i4WR5ULH1ZZYl8Watf3EPw

  216. https://openai.com/blog/robust-adversarial-inputs/

  217. https://openai.com/research/attacking-machine-learning-with-adversarial-examples

  218. https://spectrum.ieee.org/its-too-easy-to-hide-bias-in-deeplearning-systems

  219. https://stanislavfort.com/2021/01/12/OpenAI_CLIP_adversarial_examples.html

  220. https://web.archive.org/web/20240102075620/https://www.jailbreakchat.com/

  221. https://www.anthropic.com/research/probes-catch-sleeper-agents

  222. 79937fee858b0589a4f2fc9a0a20cace1e8d245b.html

  223. https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/takeaways-from-a-mechanistic-interpretability-project-on

  224. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

  225. https://www.lesswrong.com/posts/h5MwPYy94eSfpcjFk/anomalous-tokens-might-disproportionately-affect-complex

  226. https://www.lesswrong.com/posts/nxhXTfsAf2LTg4xvt/artefacts-generated-by-mode-collapse-in-gpt-4-turbo-serve-as

  227. https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/

  228. 7015a7e8976cc609b880132b0e12ee7833add34b.html

  229. https://www.reddit.com/r/DotA2/comments/beyilz/openai_live_updates_thread_lessons_on_how_to_beat/

  230. 3bf917ba3ac3de91e5f6ba42338862063feb2542.html

  231. https://www.reddit.com/r/MachineLearning/comments/bm7iix/r_adversarial_examples_arent_bugs_theyre_features/

  232. 303a652fb28de2ae2c8b190bf6b2963c473f83ed.html

  233. https://x.com/AIPanic/status/1678942763121795073

  234. https://x.com/ESYudkowsky/status/1718654143110512741

  235. https://x.com/SebastienBubeck/status/1402645428504461319

  236. https://x.com/amasad/status/1838405189650518384

  237. https://x.com/dogmadeath/status/1773150472758546733

  238. https://x.com/elder_plinius/status/1778188202664169724

  239. https://x.com/elder_plinius/status/1849133737457463629

  240. https://x.com/emmons_scott/status/1762886003046629586

  241. https://x.com/giffmana/status/1856993726591099066

  242. https://x.com/jarrodWattsDev/status/1862299845710757980

  243. https://x.com/papayathreesome/status/1670170344953372676

  244. https://x.com/sdtoyer/status/1729933591541670287

  245. https://x.com/supercomposite/status/1567162288087470081

  246. https://x.com/wunderwuzzi23/status/1849637648274686129

  247. The structure of the token space for large language models

  248. https%253A%252F%252Farxiv.org%252Fabs%252F2410.08993.html

  249. Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

  250. https%253A%252F%252Farxiv.org%252Fabs%252F2408.05446.html

  251. Does Refusal Training in LLMs Generalize to the Past Tense?

  252. https%253A%252F%252Farxiv.org%252Fabs%252F2407.11969.html

  253. Probing the Decision Boundaries of In-context Learning in Large Language Models

  254. Aditya Grover

  255. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11233.html

  256. CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge

  257. https%253A%252F%252Farxiv.org%252Fabs%252F2404.06664.html

  258. When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

  259. https%253A%252F%252Farxiv.org%252Fabs%252F2402.17747.html

  260. Fast Adversarial Attacks on Language Models In One GPU Minute

  261. https%253A%252F%252Farxiv.org%252Fabs%252F2402.15570.html

  262. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

  263. https%253A%252F%252Farxiv.org%252Fabs%252F2402.11753.html

  264. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  265. About Me

  266. https://jack-clark.net/about/

  267. Sam Bowman

  268. Jared Kaplan

  269. https%253A%252F%252Farxiv.org%252Fabs%252F2401.05566%2523anthropic.html

  270. PAIR: Jailbreaking Black Box Large Language Models in 20 Queries

  271. https%253A%252F%252Farxiv.org%252Fabs%252F2310.08419.html

  272. Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion

  273. Stefano Ermon

  274. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02279%2523sony.html

  275. How Robust is Google’s Bard to Adversarial Image Attacks?

  276. https%253A%252F%252Farxiv.org%252Fabs%252F2309.11751.html

  277. Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

  278. https%253A%252F%252Farxiv.org%252Fabs%252F2306.07567.html

  279. On Evaluating Adversarial Robustness of Large Vision-Language Models

  280. https%253A%252F%252Farxiv.org%252Fabs%252F2305.16934.html

  281. TrojText: Test-time Invisible Textual Trojan Insertion

  282. https%253A%252F%252Farxiv.org%252Fabs%252F2303.02242.html

  283. Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models

  284. https%253A%252F%252Farxiv.org%252Fabs%252F2302.04222.html

  285. Are AlphaZero-like Agents Robust to Adversarial Perturbations?

  286. https%253A%252F%252Farxiv.org%252Fabs%252F2211.03769.html

  287. Adversarial Policies Beat Superhuman Go AIs

  288. Sergey Levine

  289. https%253A%252F%252Farxiv.org%252Fabs%252F2211.00241.html

  290. Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

  291. https%253A%252F%252Farxiv.org%252Fabs%252F2208.08831%2523deepmind.html

  292. Diffusion Models for Adversarial Purification

  293. https%253A%252F%252Farxiv.org%252Fabs%252F2205.07460.html

  294. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

  295. Noah A. Smith

  296. https%253A%252F%252Fswabhs.com%252Fassets%252Fpdf%252Fwanli.pdf%2523allen.html

  297. CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

  298. https%253A%252F%252Farxiv.org%252Fabs%252F2201.05320%2523allen.html

  299. AugMax: Adversarial Composition of Random Augmentations for Robust Training

  300. https%253A%252F%252Farxiv.org%252Fabs%252F2110.13771%2523nvidia.html

  301. Partial success in closing the gap between human and machine vision

  302. Robert Geirhos

  303. Matthias Bethge

  304. https%253A%252F%252Farxiv.org%252Fabs%252F2106.07411.html

  305. A Universal Law of Robustness via Isoperimetry

  306. https%253A%252F%252Farxiv.org%252Fabs%252F2105.12806.html

  307. Multimodal Neurons in Artificial Neural Networks [CLIP]

  308. Alec Radford

  309. https%253A%252F%252Fdistill.pub%252F2021%252Fmultimodal-neurons%252F%2523openai.html

  310. Bot-Adversarial Dialogue for Safe Conversational Agents

  311. https%253A%252F%252Faclanthology.org%252F2021.naacl-main.235.pdf%2523facebook.html

  312. Smooth Adversarial Training

  313. https%253A%252F%252Farxiv.org%252Fabs%252F2006.14536%2523google.html

  314. Radioactive data: tracing through training

  315. https%253A%252F%252Farxiv.org%252Fabs%252F2002.00937.html

  316. Adversarial Examples Improve Image Recognition

  317. https%253A%252F%252Farxiv.org%252Fabs%252F1911.09665.html

  318. Towards Deep Learning Models Resistant to Adversarial Attacks

  319. Homepage: Aleksander Mądry

  320. https%253A%252F%252Farxiv.org%252Fabs%252F1706.06083.html