Deep artificial neural networks have been proposed as a model of primate vision. However, these networks are vulnerable to adversarial attacks, whereby introducing minimal noise can fool networks into misclassifying images. Primate vision is thought to be robust to such adversarial images. We evaluated this assumption by designing adversarial images to fool primate vision. To do so, we first trained a model to predict responses of face-selective neurons in macaque inferior temporal cortex. Next, we modified images, such as human faces, to match their model-predicted neuronal responses to a target category, such as monkey faces. These adversarial images elicited neuronal responses similar to the target category. Remarkably, the same images fooled monkeys and humans at the behavioral level. These results challenge fundamental assumptions about the similarity between computer and primate vision and show that a model of neuronal activity can selectively direct primate visual behavior.
Figure 1: Overview of adversarial attack. (A) A substitute model was fit on IT neuron responses. The substitute model consisted of a pre-trained ResNet-101 (excluding the final fully-connected layer) and a linear mapping model. Adversarial images were generated by gradient-based optimization of the image to create the desired neuronal response pattern as predicted by the substitute model. (B, left) The adversarial images were tested in monkeys in neuron-level experiments. Monkeys fixated on a red fixation point while images were presented in random order and neuronal responses were recorded. (B, right) The images were tested in behavioral experiments with monkeys and human subjects. Each image was presented for 1000ms. For monkeys, 2 choice buttons were presented (text for illustration only). Monkeys were rewarded for touching the correct button for training images and a random button for test images. Humans were instructed to press a key to indicate the correct option. (C) Example human → monkey attack images, based on 2 original human faces, are shown for different noise levels.
Figure 2: Neuron-level results of adversarial attack. (A–G), Human → monkey attack. (A) UMAP visualization of neuronal representation of images in monkey P. ‘Gray box attack human face’ corresponds to noise level 10. Inset shows average distances from adversarial images to clean human faces and clean monkey faces, along the direction that best separates the latter 2 and normalized to the distance between them. Points in inset show centers of mass of UMAP points for illustration only, and do not correspond to the distance quantification. (B) Success rates of attack and control images (pure model, merged, Gaussian noise, and PS noise images) at different noise levels. Legend and example images are in (C). Shading and error bars show s.e.m. over bootstrap samples. ✱, p < 0.05 and ✱✱, p < 0.01. (D, E), Same as (A, B) for monkey R. (F, G), Same as (A, B) for monkey B1. (H–L), Same as (A–G) for non-face → face attack in monkeys P and B1.