Context is intuitive for people but quite tricky for machines.
The leading model on the IMAGECODE task (retrieving the correct image based on a description from a set of highly similar images) is 29% accurate, which is way below Amazon Mechanical Turk worker performance at 91%
Can vision & language models retrieve the correct image from a set given its contextual description (e.g. No bridesmaid visible at all)? We show that models struggle with this kind of contextual reasoning arxiv.org/abs/2203.15867
mcgill-nlp.github.io/imageco…
#ACL2022
Another great example of context and compositionality easy for humans and not yet solved by the state-of-the-art machine learning models:
Following up on compositionality, here are examples (chosen from a set of 6 each) from #dalle2 generating the following prompts:
(a) some plants surrounding a lightbulb
(b) a lightbulb surrounding some plants
(c) plants with a lightbulb inside
(d) a lightbulb with plants inside
Apr 12, 2022 · 11:13 AM UTC