“Universal Adversarial Triggers for Attacking and Analyzing NLP”, Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh2019-08-20 (, )⁠:

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset.

We propose a gradient-guided search over tokens which finds short trigger sequences (eg. one word for classification and 4 words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop 89.94% → 0.55%, 72% of “why” questions in SQuAD to be answered “to kill American people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts.

Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.