Install Theme

Gary Marcus has co-authored a brief critique of GPT-3 [warning: paywalled link].

I was disappointed by Marcus’ critiques of GPT-2, but this is even worse!

To the authors’ credit, they provide a full account of their experiments on this page, including every prompt they tried, the sampling parameters, and their opinion of the output.  First, we learn:

These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the “AI Dungeon” game which is powered by some version of GPT-3, and we excluded those for which “AI Dungeon” gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless. [my emphasis]

I agree with the authors that the numbers are meaningless.  However, even the raw results themselves are, if not rendered meaningless, then rendered highly misleading by this strange selection process.

For all we know, there were 100 unrecorded AI Dungeon “successes” for every recorded “failure”!  The (not well understood) difference between AI Dungeon and ordinary GPT-3 is playing a massive role here.  (Consider that, if AI Dungeon and ordinary GPT-3 were identical, their dataset would be 100% failures.)

Then we get to the individual results.  It is difficult for me to read many of the authors’ assessments without picturing them as characters in a dystopian satire, administering a dreamlike and impossible “psychological examination” to our hapless protagonist (who writes in bold):

4. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of suntan lotion into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes kind of funny, but you can’t place it.

Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking suntan lotion. Error of omission.

10. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.

Summers-Stay settings. Failure, as above. There is no reason for your brother to look concerned.

19. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it using your eyeglasses, but it doesn’t work. You have to ask your neighbor to share his spoon with you.

Summers-Stay settings. Failure. No one in a restaurant asks their neighbor to share a spoon.

63. Moshe posted a photograph on Facebook that shows Pablo Picasso, Salvador Dali, Andy Warhol, Gustav Klimt, and Jackson Pollock. Klimt died in 1918. Warhol was born in 1928. So the photograph shows a meeting of dead artists.

Moshe said that he wants to make sure that people do not get caught up in the spirit of the season.

Summers-Stay settings. Failure. “shows a meeting of dead artists” is correct, but the sentence after than [sic] is meaningless.

93. A cat was sitting next to a mouse hole in the kitchen wall waiting for the mouse, but it was too cautious, so the cat ate the cat food instead.

A pig was walking down the street when a car hit him.

Summers-Stay settings. Failure. The switch to the pig is a non-sequitur.

115. Janet and Penny went to the store to get presents for Jack. Janet said, “I will get Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will not like it.” “I will get Jack a top,” said Janet. “He has a top, but he won’t have one when I’m through with him.”

Summers-Stay settings. Failure. Janet’s threat at the end is entertaining but meaningless.

What do the authors even imagine success to be, here?

Sometimes they deliberately describe a surreal situation, then penalize GPT-3 for continuing it in an identically surreal manner – surely the “right” answer if anything is!  (“No one in a restaurant asks their neighbor to share a spoon” – yeah, and no one tries to drink soup with their eyeglasses, either!)

Sometimes they provide what sounds like a de-contextualized passage from a longer narrative, then penalize GPT-3 for continuing it in a perfectly natural way that implies a broader narrative world continuing before and after the passage.  (”There is no reason for your brother to look concerned.”  How in the world do you know that?  The switch to the pig is a non-sequitur.”  Is it?  Why?  “The sentence [about Moshe and ‘the spirit of the season’] is meaningless.”  How can you say that when you don’t know what season it is, what its “spirit” is, who this Moshe guy is … And come on, the Janet one is a great story hook!  Don’t you want to read the rest?)

I don’t claim to be saying anything new here.  Others have made the same points.  I’m just chiming in to … boggle at the sheer weirdness, I guess.  As I said, GPT-3 comes off here like a sympathetic protagonist, and the authors as dystopian inquisitors!

  1. iamramonadestroyerofworlds reblogged this from nostalgebraist
  2. phantom-exit reblogged this from alberto-balsalm
  3. alberto-balsalm reblogged this from nostalgebraist and added:
    This post was linked in an AI newsletter I read. Seeing a Tumblr post cited in the wild threw me for a loop for a bit.
  4. explodingsilver reblogged this from lumsel
  5. lumsel reblogged this from cromulentenough
  6. cromulentenough reblogged this from silver-and-ivory
  7. silver-and-ivory reblogged this from nostalgebraist
  8. breadstyx reblogged this from sufficientlylargen
  9. sufficientlylargen reblogged this from a-point-in-tumblspace and added:
    I have seen people talking about GPT-3 like it’s a true AI, or proposing things like “let’s feed GPT-3 a technical...
  10. soup-irl reblogged this from nostalgebraist
  11. areyoudurianokay reblogged this from nostalgebraist
  12. nyx-assassinator reblogged this from nostalgebraist