ā€œComparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasksā€, Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev2023-11-14 ()⁠:

We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark, which is designed to evaluate robust understanding and reasoning with core-knowledge concepts.

We extend that work by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4-V, the multimodal version of GPT-4, on zero-shot & one-shot prompts using image versions of the simplest tasks.

Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at human-like levels.