“GPT-3 vs Water Cooler Trivia Participants: A Human vs Robot Showdown”, 2021-03-12 (; backlinks):
Spoiler: GPT-3 got 73% of 156 trivia questions correct. This compares favorably to the 52% user average. However, it’s not an all-conquering feat: 37% of participants did better than 73% on their most recent quiz…The robot was best at Fine Arts and Current Events, worst at Word Play and Social Studies.
…As was mostly expected, GPT-3 performed exceptionally well at Current Events and Fine Arts, with Miscellaneous (lots of pun-driven food questions) and Word Play (discussed above) as trickier areas. The most surprising result? The poor performance in Social Studies, driven largely by the degree of word play-intersecting questions in that category.
The patterns we learned:
Word Play is the domain of humans.
This one’s not so surprising. We have a type of question called a “Two’fer Goofer” which asks for a pair of rhyming words that satisfy a given clue. It’s similar to the Rhyme Time category in Jeopardy! or the old newspaper puzzle Wordy Gurdy. We had 3 of these questions in the showdown and GPT-3 missed all 3 of them. For Word Play questions that were more like vocabulary quizzes, GPT-3 performed admirably.
Clues confuse GPT-3.
We have an alliterative two-word phrase at the start of each question to add a bit of flair and sneak in a clue for participants. In the image below it would be “Kooky Kingdom”. For GPT-3, these clues were a net-negative. In a few instances, the robot overlord program answered correctly when the clue was removed.…The other clues that confused GPT-3 were inline indications on the answer’s length. Below, we explicitly ask for a 5-letter action and GPT-3 gave us 8 letters across 2 words…