“Multimodal Neurons in Artificial Neural Networks [CLIP]”, Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, Chris Olah2021-03-04 (, , ; similar)⁠:

[Investigation of CLIP activations: CLIP detects a wide variety of entities, like Spiderman, Lady Gaga, or Halle Berry, in a variety of media, such as photos, (images of) text, people in costumes, drawings, or just similar terms; previous cruder smaller NNs lacked this ‘conceptual’ level, only responding to the exact person’s photograph.

CLIP neurons further specialize in regions, famous individual, human emotions, religions, human attributes such as age/gender/facial-features, geographic regions (down to specific cities), holidays, art styles (such as anime vs painting), media franchises (Pokemon, Star Wars, Minecraft, Batman etc), brands, images of text, and abstract concepts like ‘star’ or ‘LGBTQ+’ or numbers or time or color. Such conceptual neurons also have ‘opposite’ neurons, like Donald Trump vs “musicians like Nicky Minaj and Eminem, video games like Fortnite, civil rights activists like Martin Luther King Junior, and LGBT symbols like rainbow flags.” The capabilities are best with the English language, but there is limited foreign-language capabilities as well.

Given the ‘conceptual’ level of neurons, it’s not too surprising that the overloaded/entangled/“polysemantic” neurons that Distill.pub has documented in VGG-16 (which appear undesirable and to reflect the crudity of the NN’s knowledge) are much less present in CLIP, and the neurons appear to learn much cleaner concepts.

The power of the zero-shot classification, and the breadth of CLIP’s capabilities, can lead to some counterintuitive results, like their discovery of what they dub typographic attacks: writing “iPod” on a piece of paper and sticking it on the front of a Granny Smith apple can lead to the text string “iPod” being much more ‘similar’ to the image than the text string “Granny Smith”.

Perhaps even more surprising is that the multimodal conceptual capability leads to a Stroop effect! (And also bouba/kiki.) All in all, CLIP is remarkable.]