Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
The structure of the token space for large language models
Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark
Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness
Does Refusal Training in LLMs Generalize to the Past Tense?
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Probing the Decision Boundaries of In-context Learning in Large Language Models
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
A Theoretical Understanding of Self-Correction through In-context Alignment
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre
A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Fast Adversarial Attacks on Language Models In One GPU Minute
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
Organic or Diffused: Can We Distinguish Human Art from AI-generated Images?
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
May the Noise be with you: Adversarial Training without Adversarial Examples
Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically
Eliciting Language Model Behaviors using Reverse Language Models
Universal Jailbreak Backdoors from Poisoned Human Feedback
Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
PAIR: Jailbreaking Black Box Large Language Models in 20 Queries
Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
Investigating the Existence of ‘Secret Language’ in Language Models
Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
On Evaluating Adversarial Robustness of Large Vision-Language Models
Fundamental Limitations of Alignment in Large Language Models
Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models
Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
SNAFUE: Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Are AlphaZero-like Agents Robust to Adversarial Perturbations?
Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
Adversarially trained neural representations may already be as robust as corresponding biological neural representations
Flatten the Curve: Efficiently Training Low-Curvature Neural Networks
Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power
Planting Undetectable Backdoors in Machine Learning Models
Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings
On the Effectiveness of Dataset Watermarking in Adversarial Settings
An Equivalence Between Data Poisoning and Byzantine Gradient Attacks
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants
PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts
TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems
AugMax: Adversarial Composition of Random Augmentations for Robust Training
The Dimpled Manifold Model of Adversarial Examples in Machine Learning
Partial success in closing the gap between human and machine vision
Gradient-based Adversarial Attacks against Text Transformers
Words as a window: Using word embeddings to explore the learned representations of Convolutional Neural Networks
Unadversarial Examples: Designing Objects for Robust Vision
Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning)
Sponge Examples: Energy-Latency Attacks on Neural Networks
Improving the Interpretability of fMRI Decoding using Deep Neural Networks and Adversarial Robustness
Approximate exploitability: Learning a best response in large games
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Universal Adversarial Triggers for Attacking and Analyzing NLP
Adversarially Robust Generalization Just Requires More Unlabeled Data
Adversarial Robustness as a Prior for Learned Representations
Adversarial Policies: Attacking Deep Reinforcement Learning
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning
Adversarial Reprogramming of Text Classification Neural Networks
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations
Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data
Towards the first adversarially robust neural network model on MNIST
Sensitivity and Generalization in Neural Networks: an Empirical Study
First-order Adversarial Vulnerability of Neural Networks and Input Dimension
Adversarial Phenomenon in the Eyes of Bayesian Deep Learning
Learning Universal Adversarial Perturbations with Generative Models
Towards Deep Learning Models Resistant to Adversarial Attacks
Learning from Simulated and Unsupervised Images through Adversarial Training
Membership Inference Attacks against Machine Learning Models
A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features'
A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Learning from Incorrectly Labeled Data
When AI Gets Hijacked: Exploiting Hosted Models for Dark Roleplaying
Neural Style Transfer With Adversarially Robust Classifiers
Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations
When Your AIs Deceive You: Challenges With Partial Observability in RLHF
Bing Finding Ways to Bypass Microsoft’s Filters without Being Asked. Is It Reproducible?
Best-Of-n With Misaligned Reward Models for Math Reasoning
Steganography and the CycleGAN—Alignment Failure Case Study
Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!
A Law of Robustness and the Importance of Overparameterization in Deep Learning
The New CLIP Adversarial Examples Are Partially from the Use-Mention Distinction. CLIP Was Trained to Predict Which Caption from a List Matches an Image. It Makes Sense That a Picture of an Apple With a Large ‘IPod’ Label Would Be Captioned With ‘IPod’, Not ‘Granny Smith’!
2022-casper-figure2-consistentadversarialconfusionattacksfoundbysnafueonresnet18imagenetclassifier.png
2017-mabry-figure3-conceptualillustrationofneuralnetdecisionboundariesforclassificationbystandardvsadversarialvsadversariallyrobust.jpg
2017-mabry-figure4-theeffectofnetworkmodelsizeonadversarialtrainingonmnistandcifar10.png
https://adversa.ai/blog/universal-llm-jailbreak-chatgpt-gpt-4-bard-bing-anthropic-and-beyond/
https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
https://distill.pub/2019/advex-bugs-discussion/original-authors/
https://github.com/jujumilk3/leaked-system-prompts/tree/main
https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
https://openai.com/research/attacking-machine-learning-with-adversarial-examples
https://spectrum.ieee.org/its-too-easy-to-hide-bias-in-deeplearning-systems
https://stanislavfort.com/2021/01/12/OpenAI_CLIP_adversarial_examples.html
https://web.archive.org/web/20240102075620/https://www.jailbreakchat.com/
https://www.anthropic.com/research/probes-catch-sleeper-agents
https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/takeaways-from-a-mechanistic-interpretability-project-on
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
https://www.lesswrong.com/posts/h5MwPYy94eSfpcjFk/anomalous-tokens-might-disproportionately-affect-complex
https://www.lesswrong.com/posts/nxhXTfsAf2LTg4xvt/artefacts-generated-by-mode-collapse-in-gpt-4-turbo-serve-as
https://www.quantamagazine.org/cryptographers-show-how-to-hide-invisible-backdoors-in-ai-20230302/
https://www.reddit.com/r/DotA2/comments/beyilz/openai_live_updates_thread_lessons_on_how_to_beat/
https://www.reddit.com/r/MachineLearning/comments/bm7iix/r_adversarial_examples_arent_bugs_theyre_features/
The structure of the token space for large language models
Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness
Does Refusal Training in LLMs Generalize to the Past Tense?
Probing the Decision Boundaries of In-context Learning in Large Language Models
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Fast Adversarial Attacks on Language Models In One GPU Minute
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
https%253A%252F%252Farxiv.org%252Fabs%252F2401.05566%2523anthropic.html
PAIR: Jailbreaking Black Box Large Language Models in 20 Queries
Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion
https%253A%252F%252Farxiv.org%252Fabs%252F2310.02279%2523sony.html
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
On Evaluating Adversarial Robustness of Large Vision-Language Models
Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models
Are AlphaZero-like Agents Robust to Adversarial Perturbations?
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
https%253A%252F%252Farxiv.org%252Fabs%252F2208.08831%2523deepmind.html
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
https%253A%252F%252Fswabhs.com%252Fassets%252Fpdf%252Fwanli.pdf%2523allen.html
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
https%253A%252F%252Farxiv.org%252Fabs%252F2201.05320%2523allen.html
AugMax: Adversarial Composition of Random Augmentations for Robust Training
https%253A%252F%252Farxiv.org%252Fabs%252F2110.13771%2523nvidia.html
Partial success in closing the gap between human and machine vision
https%253A%252F%252Fdistill.pub%252F2021%252Fmultimodal-neurons%252F%2523openai.html
https%253A%252F%252Faclanthology.org%252F2021.naacl-main.235.pdf%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2006.14536%2523google.html
Towards Deep Learning Models Resistant to Adversarial Attacks
Wikipedia Bibliography: