“Robustness May Be at Odds With Accuracy”, Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, Aleksander Madry2018-05-30 ()⁠:

We show that there may exist an inherent tension between the goal of adversarial robustness and that of standard generalization. Specifically, training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy.

We demonstrate that this trade-off between the standard accuracy of a model and its robustness to adversarial perturbations provably exists in a fairly simple and natural setting. These findings also corroborate a similar phenomenon observed empirically in more complex settings.

Further, we argue that this phenomenon is a consequence of robust classifiers learning fundamentally different feature representations than standard classifiers. These differences, in particular, seem to result in unexpected benefits: the representations learned by robust models tend to align better with salient data characteristics and human perception.

…Overall, when training on the entire dataset, we observe a decline in standard accuracy as the strength of the adversary increases (see Figure 7 of Appendix G for a plot of standard accuracy vs. ε). (Note that this still holds if we train on batches that contain natural examples as well, as recommended by Kurakin et al 2017. See Appendix B for details.) Similar effects were also observed in prior and concurrent work (Kurakin et al 2017; Madry et al 2018; Dvijotham et al 2018b; Wong et al 2018; Xiao et al 2019; Su et al 2018).