The new CLIP adversarial examples are partially from the use-mention distinction. CLIP was trained to predict which caption from a list matches an image. It makes sense that a picture of an apple with a large "iPod" label would be captioned with "iPod", not "Granny Smith"!
This can be somewhat fixed with a list of labels that are more explicit about this, at least for a small set of pictures I've tried. After some experimentation, I found this prompt that seems to work with CLIP ViT-B-32:

Mar 7, 2021 · 8:38 PM UTC

Credits to @ykilcher for inspiration and @gwern for mentioning 'use-mention distinction' in the EleutherAI discord
🥳New Video (very short)🥳Turns out there is a SUPER EASY fix for countering textual adversarial attacks against @OpenAI's CLIP 😄 piped.video/Rk3MBx20z24
Also I wonder if this prompt is overfitting to "This is painting, text, symbol" Can you think of a use-mention example that isn't one of those?
Embarrassingly, this actually doesn't work for every adversarial example in the CLIP blogpost. My guess is the general technique will work for larger CLIPs and better prompts, though.