THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, Lukas Muttenthaler, Martin N. Hebart2021-09-22 (, , , )⁠:

Over the past decade, deep neural network (DNN) models have received a lot of attention due to their near-human object classification performance and their excellent prediction of signals recorded from biological visual systems. To better understand the function of these networks and relate them to hypotheses about brain activity and behavior, researchers need to extract the activations to images across different DNN layers. The abundance of different DNN variants, however, can often be unwieldy, and the task of extracting DNN activations from different layers may be non-trivial and error-prone for someone without a strong computational background. Thus, researchers in the fields of cognitive science and computational neuroscience would benefit from a library or package that supports a user in the extraction task.

THINGSvision is a new Python module that aims at closing this gap by providing a simple and unified tool for extracting layer activations for a wide range of pretrained and randomly-initialized neural network architectures, even for users with little to no programming experience.

We demonstrate the general utility of THINGSvision by relating extracted DNN activations to a number of functional MRI and behavioral datasets using representational similarity analysis, which can be performed as an integral part of the toolbox.

Together, THINGSvision enables researchers across diverse fields to extract features in a streamlined manner for their custom image dataset, thereby improving the ease of relating DNNs, brain activity, and behavior, and improving the reproducibility of findings in these research fields.

…Note that this section should not be regarded as an investigation in its own right. It is supposed to demonstrate the usefulness and versatility of the toolbox. This is the main reason for why we do not make any claims about hypotheses and how to test them. RSA is just one out of many potential applications, of which a subset is mentioned in §4.

§3.1. The Penultimate Layer

The correspondence of a DNN’s penultimate layer to human behavioral representations has been studied extensively and is therefore often used when investigating the representations of abstract visual concepts in neural network models (eg. Mur et al 201311ya; Bankson et al 2018; Jozwik et al 2018; Peterson et al 2018; Battleday et al 2019; Cichy et al 2019). To the best of our knowledge, our study is the first to compare visual object representations extracted from CLIP (Radford et al 2021) against the representations of well-known vision models that have previously shown a close correspondence to neural recordings of the primate visual system. We computed RDMs based on the Pearson correlation distance for 7 models, namely AlexNet (Krizhevsky et al 201212ya), VGG16 and VGG19 with batch normalization (Simonyan & Zisserman2015), which show a close correspondence to brain and behavior (Schrimpf et al 2018, Schrimpf et al 2020b), ResNet50 (He et al 2016), BrainScore’s current leader CORnet-S (Kubilius et al 2018, Kubilius et al 2019; Schrimpf et al 2020b), and OpenAI’s CLIP variants CLIP-RN and CLIP-ViT (Radford et al 2021). The comparison was done for 6 different image datasets that included functional MRI of the human visual system and behavior (Mur et al 201311ya; Bankson et al 2018; Cichy et al 2019; Mohsenzadeh et al 2019; Hebart et al 2020). For the neuroimaging datasets, participants viewed different images of objects while performing an oddball detection task in an MRI scanner. For the behavioral datasets, participants completed similarity judgments using the multi-arrangement task (Mur et al 201311ya; Bankson et al 2018) or a triplet odd-one-out task (Hebart et al 2020).

Note that Bankson et al 2018 exploited two different datasets which we label with “(1)” and “(2)” in Figure 2. The number of images per dataset are as follows: Kriegeskorte et al 200816yab, Mur et al 201311ya, Cichy et al 2014: 92; Bankson et al 2018 84 each; Cichy et al 2016, Cichy et al 2019: 118; Mohsenzadeh et al 2019: 156; Hebart et al 2019, Hebart et al 2020: 1,854. For each of these datasets except for Mohsenzadeh et al 2019, we additionally computed RDMs for group averages obtained from behavioral experiments. Furthermore, we computed RDMs for brain voxel activities obtained from fMRI recordings for the datasets used in Cichy et al 2014, Cichy et al 2016, and Mohsenzadeh et al 2019, based on voxels inside a mask covering higher visual cortex.

Figure 2: (A) RDMs for penultimate layer representations of different pretrained neural network models, for group averages of behavioral judgments, and for fMRI responses to higher visual cortex. For Mohsenzadeh et al 2019, no behavioral experiments had been conducted. For both datasets in Bankson et al 2018, and for Hebart et al 2020, no fMRI recordings were available. For display purposes, Hebart et al 2020 was downsampled to 200 conditions. RDMs were reordered according to an unsupervised clustering. (B, C) Pearson correlation coefficients for comparisons between neural network representations extracted from the penultimate layer and behavioral representations (B) and representations corresponding to fMRI responses of higher visual cortex (C). Activations were extracted from pretrained and randomly initialized models.

Figure 2A visualizes all RDMs. We clustered RDMs pertaining to group averages of behavioral judgments into 5 object clusters and sorted the RDMs corresponding to object representations extracted from DNNs according to the obtained cluster labels. The image datasets used in Kriegeskorte et al 200816yab, Mur et al 201311ya, and Cichy et al 2014, and Mohsenzadeh et al 2019 were already sorted according to object categories, which is why we did not perform a clustering on RDMs for those datasets. The number of clusters was chosen arbitrarily. The reordering was done to highlight the similarities and differences in RDMs.

§3.1.1. Behavioral Correspondences: 3.1.1.1. Pretrained Weights

Across all compared DNN models, CORnet-S and CLIP-RN showed the overall closest correspondence to behavioral representations. CORnet-S, however, was the only model that performed well across all datasets. CLIP-RN showed a high Pearson correlation (ranging 0.40–0.60) with behavioral representations across most datasets, with Mur et al 2013 being the only exception, for which both CLIP versions performed poorly. Interestingly, for one of the datasets in Bankson et al 2018, VGG16 with batch normalization (Simonyan & Zisserman2015) outperformed both CORnet-S and CLIP-RN (see Figure 2B). AlexNet consistently performed the worst for behavioral fits. Note that the broadest coverage of visual stimuli is provided by Hebart et al 2019, Hebart et al 2020, which should therefore be seen as the most representative result (rightmost column in Figure 2B).