ā€œConcealed Data Poisoning Attacks on NLP Modelsā€, Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh2020-10-23 (; similar)⁠:

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data.

In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains ā€œJames Bondā€. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase.

We also apply our poison attack to language modeling (ā€œApple iPhoneā€ triggers negative generations) and machine translation (ā€œiced coffeeā€ mistranslated as ā€œhot coffeeā€).

We conclude by proposing 3 defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.