ā€œSNAFUE: Diagnostics for Deep Neural Networks With Automated Copy/Paste Attacksā€, Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell2022-11-18 ()⁠:

Deep neural networks (DNNs) are powerful, but they can make mistakes that pose large risks. A model performing well on a test set does not imply safety in deployment, so it is important to have additional tools to understand its flaws. Adversarial examples can help reveal weaknesses, but they are often difficult for a human to interpret or draw generalizable, actionable conclusions from. Some previous works have addressed this by studying human-interpretable attacks.

We build on these with 3 contributions. First, we introduce a method termed Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully-automated method for finding ā€œcopy/pasteā€ attacks in which one natural image can be pasted into another in order to induce an unrelated misclassification. Second, we use this to red team an ImageNet classifier and identify hundreds of easily-describable sets of vulnerabilities. Third, we compare this approach with other interpretability tools by attempting to rediscover trojans.

Our results suggest that SNAFUE can be useful for interpreting DNNs and generating adversarial data for them.

Code is available at https://github.com/thestephencasper/snafue.

Figure 2: Examples of targeted natural adversarial patches identified using SNAFUE which reveal consistent, easily-describable failure modes that can be used to interpret the network (eg. ā€œenvelopes plus cats are misclassified by the network as cartonsā€). Each row contains 10 patches labeled with the attack source and target. When a patch is inserted into any source class image, it tends to cause misclassification as the target class.