Context is intuitive for people but quite tricky for machines. The leading model on the IMAGECODE task (retrieving the correct image based on a description from a set of highly similar images) is 29% accurate, which is way below Amazon Mechanical Turk worker performance at 91%
Can vision & language models retrieve the correct image from a set given its contextual description (e.g. No bridesmaid visible at all)? We show that models struggle with this kind of contextual reasoning arxiv.org/abs/2203.15867 mcgill-nlp.github.io/imageco… #ACL2022
Another great example of context and compositionality easy for humans and not yet solved by the state-of-the-art machine learning models:
Replying to @TristanThrush
The task: Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain the same words/morphemes, only in a different order. Identical words between captions means that BOW models cannot perform above chance. 2/5
Following up on compositionality, here are examples (chosen from a set of 6 each) from #dalle2 generating the following prompts: (a) some plants surrounding a lightbulb (b) a lightbulb surrounding some plants (c) plants with a lightbulb inside (d) a lightbulb with plants inside

Apr 12, 2022 · 11:13 AM UTC

Replying to @mattgroh
Amazing #AIart by #Dalle2, #AI greetings you 🖖🏻🤖🤚🏻 #ArtificialIntelligence #openai #aiartcommunity