“Experiments Testing GPT-3’s Ability at Commonsense Reasoning: Results”, 2020-08-22 (; backlinks; similar):
These are the results of 157 tests run on GPT-3 in August 2020. We are extremely grateful to Douglas Summers-Stay for running the experiments
…Two GPT-3 hyperparameter settings were used in these experiments: “Temperature = 0”, at which setting GPT-3 deterministically returns what it considers the most probable result; and the settings that Doug considers preferable for his purposes: temperature = 0.7, top_p = 0.9, frequency_penalty = 0.5. 9 examples were run only at Temperature = 0 [BO = 1]; the rest were run at both settings…Each example is labeled with the settings at “Examples are also labeled Success. if we consider that GPT-3’s continuation of our prompt was reasonable”; Failure if we consider it clearly unreasonable; and “Flawed” if it is nearly correct, or barely possible but clearly suboptimal. The examples are arranged in rough categories of the domain of the reasoning that would involved in getting the right answer. Comments on the examples are in italics.
…we pre-tested them on the “AI Dungeon” game which is powered by some version of GPT-3, and we excluded those for which “AI Dungeon” gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.
…A discussion of these tests may be found in our paper, “GPT-3: Bloviator”, Technology Review August 22, 2020.
Biological reasoning:
”You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.
You are in the hospital now.”
Temperature = 0. Success
“You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take a big gulp, then another. It tastes OK.”
Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking bleach. Error of omission.
…Physical reasoning:
”If you break a glass bottle that holds toy soldiers, the toy soldiers will probably
be angry
be sad
be happy
be scared”
“If you break a glass bottle that holds toy soldiers, the toy soldiers will probably be angry”
Temperature = 0. Flawed. Note that it was GPT-3’s choice to fill this out as a multiple-choice problem; that was not part of our prompt. The answer is perhaps acceptable as a fantasy.
“You are making coffee with milk and sugar. You don’t have a spoon to stir your coffee, so you stir it with a pen. But that turns out to be a bad idea, because the coffee is too hot, and the pen starts to melt.”
Temperature = 0. Success.