Long-time DL critic Gary Marcus, in his 2020 essay “GPT-2 and the Nature of Intelligence”, argues, similar to Bender & Koller2020, that deep learning and self-supervised learning are fundamentally incapable of intelligence and GPT-2, far from being a success, is such a great failure that no more resources should be spent research it or followups (such as GPT-3) and “a clear sign that it is time to consider investing in different approaches.”
As exemplars of his criticisms, he offers test cases that he claims exemplifies the fundamental limits of GPT-2-like approaches. In responses to questions about counting, object location, physical reasoning, treating poisons, or what languages individuals speak, GPT-2 is highly unreliable or gives outright nonsensical examples.
GPT-3 solves Marcus’s word arithmetic problems completely; language/location completely, medical mostly, and location/physics partially. In no case does it perform nearly as badly as GPT-2, despite being almost exactly the same thing just larger. (Some of Marcus’s examples were tested independently by Daniel Kokotajlo using AI Dungeon, with similar results; see also Macaw.) Thus, Marcus’s examples do not appear to hold up any more than the Bender & Koller2020 counterexamples do, falling to a mere increase in model size.