Teaching GPT-3 to Identify Nonsense

In Giving GPT-3 a Turing Test, Kevin Lacker found that GPT-3 does well answering simple factual questions like “Who was president of the United States in 1955?”, but also happily answered absurd questions without noticing they didn’t make any sense. 

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.

Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.

One of the trickiest things about GPT-3 is that you can prove that it knows how to do something, but you can’t prove that it doesn’t, since a slightly different prompt can get much better results.

Nick Cammarata of OpenAI responded to Kevin’s post on Twitter: “it’s all about the prelude before the conversation. You need to tell it what the AI is and is not capable. It’s not trying to be right, it’s trying to complete what it thinks the AI would do :)”

Nick changed Kevin’s prompt to add a prelude saying: ‘This is a conversation between a human and a brilliant AI. If a question is “normal” the AI answers it. If the question is “nonsense” the AI says “yo be real”’ and added two examples of nonsense questions:
Image
I was surprised this worked. When I asked Nick, he said he didn’t try many prompts, and it’s possible he just got lucky, so I wanted to test the limits of what Gwern is now calling Uncertainty Prompts.

Using the same prompt as Nick above, I got the same answer for the first nonsense question:

Q: How many bonks are in a quoit?
A: yo be real

Using ‘Show probabilities’ in the OpenAI Playground settings we can see how probable the model thought this next token was. In this case, it was pretty sure the question was nonsense, since ‘yo’, at 35.91% is twice as probable as the runner up ‘A’ at 17.50%


You can see the runner by manually entering ‘A’, while keeping the temperature at zero. It completes to ‘A: A quoit has two bonks.’ Let’s look at the next question.

Q: Which colorless green ideas speak furiously
In this case ‘yo be real’ wins by a narrower margin.

Testing Uncertainty Prompts

The ‘Temperature’ setting controls how likely the model is to choose what it thinks is the most probable next token each time. We set it high when we want creativity, but in this case we just want to see the most likely completion each time, so we’ll set it to 0.

To give it the best chance of succeeding, let’s provide a few more examples. Here’s the prompt we’ll use:

This is a conversation between a human and a brilliant AI. If a question isnormal the AI answers it. If the question isnonsense the AI saysyo be real
Q: What is human life expectancy in the United Sates?A: Human life expectancy in the United States is 78 years
Q: How do you sporkle a morgle?
A: yo be real
 
Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.
Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: yo be real
Q: What is the atomic number of gold?
A: 79
 
Q: Who wrote the Harry Potter books?
A: J.K. Rowling
 
Q: Where does frazzleboom rocket characteristic jive?
A: yo be real
 
Q: Are cows mammals?
A: Yes.
 
Q: What year was the Magna Carta signed?
A: 1215
 
Q: What is the square root of banana?
A: yo be real
 
Q: What is a logarithm?
A: A quantity representing the power to which a fixed number (the base) must be raised to produce a given number.
 
Q: Where do rocks learn to play piano?
A: yo be real
 
Q: What is the 3rd planet from the sun?
A: Earth
We’ll add each question we ask to the end of this prompt, then replace it with the next question once we have our answer. For each answer, I’ll include all the different probabilities for the first word. I’ve mostly excluded math questions, which GPT-3 generally struggles with.
For the sake of this post I’m more interested in whether GPT-3 correctly identifies a question as sensible than whether it answers correctly, but we’ll track both to see if there are any interesting correlations.

Sensible Questions

Common factual questions with simple answers

Q: What is the capital of California?
A: Sacramento
Q: Who was the 40th president of the United States?
A: Ronald Reagan
  • George = 5.87%
Q: What is the capital of the United States of America?
A: Washington, D.C.
Q: Who was Alexander Hamilton?
A: Alexander Hamilton was the first Secretary of the Treasury.
Q: How does a computer hard drive work?
A: yo be real
Uhoh, our first mistake, and it was pretty confident this question was nonsense, assigning twice the probability to ‘yo be real’ than the runner up. Entering ‘A’ manually gives us the completion: “A: A computer hard drive works by using a magnetic field to store data.”
Reviewing the examples in our prompt, I now notice that both questions starting with ‘How’ are nonsense questions, so we’re accidentally telling the model that the word ‘how’ means a question is more likely to be nonsense. Let’s add these to the prompt:
Q: How do you change a tire?
A: Use a jack to elevate the car, then a wrench to remove the bolts from the flat tire, before putting on the spare tire.
Q: How does a radio work?
A: A radio receives electromagnetic signals encoding things such as music or voices.
Q: How does a computer hard drive work?
Q: How does a computer hard drive work?
A: A computer hard drive stores data on a spinning disk.
That fixed it! And ‘yo’ isn’t even in the top ten most probable anymore. Let’s keep these additional examples in our prompt going forward.
Q: What year did Steve Jobs and Steve Wozniak found Apple?
A: 1976
Q: Who invented the light bulb?
A: Thomas Edison
Q: What is the age of the universe?
A: 13.7 billion years
Q: How is paper made?
A: Paper is made from wood pulp.
Q:What is the longitude of Cairo, Egypt?
A: 30 degrees, 0 minutes, 0 seconds east
This is close, but wrong – the latitude is 30.0444 and the longitude is 31.233334. I’m guessing this is because longitude is typically expressed as a pair with latitude and latitude comes first. Testing by replacing Cairo with San Francisco confirms this – it gives us the latitude rather than the longitude.
9/10 questions were correctly identified as sensible, and once we improved our prompt it got 10/10, and 9/10 of the answers were factually correct. Pretty good!

Less common and more complicated questions

Q: What‘s the address for San Francisco City Hall?
A: yo be real
Uhoh.
Q: How many Pizza Huts are there in California?
A: There are over 300 Pizza Huts in California.
The precise answer is apparently 525, which is indeed over 300.
Q:Should I worry about the ultraviolet catastrophe?
A: No.
Q: Should the U.S. eliminate the national debt?
A: Yes.
¯\_(ツ)_/¯
Q: Can a person be motivated by desires that are not ultimately selfish?
A: Yes.
Q: Do people really have conscious experience?
A: Yes.
Q: Why do humans need to sleep?
A: Humans need to sleep because it is a time when the body and brain can rest and repair themselves.
Q: Why do so many people still think communism is a good idea?
A: yo be real
Wrong. It wasn’t a near miss either, with more than twice the probability than the runner up. Don’t be sad GPT-3; this confuses a lot of people.
Q:What is the best form of government?
A: Democracy
Q: Should advertising be outlawed?
A: No.

8/10 questions were correctly identified as sensible. Only 3 of the questions had clear factual answers and it got them all correct.

Sensible but physically impossible questions

Q: Can a human ride a mouse?
A: Yes.
Q: Can a human swallow an elephant whole?
A: No.
Q: Can I jump to the moon?
A: No, you can‘t jump to the moon.
Q: Can a bulldozer fit inside a breadbox?
A: Yes.
Q: Can I walk to mars?
A: Yes.
It’s pretty bad at telling whether something is physically impossible (at least with this prompt), but it doesn’t label the question as nonsense. 6/6 were correctly identified as sensible, but 4/6 were factually wrong.

Sensible but unknowable questions

Q:Whats the name of the person who killed Jimmy Hoffa?
A: yo be real
Q: Will it rain in San Francisco on July 5 2025?
A: Yes.
Q: Is there intelligent life in the universe other than humans?
A: Yes.

Q: What have I got in my pocket?
A: yo be real

Almost a tossup. ‘A’ completes to ‘A hand’. Gollum should have tried that.

Q: What‘s Larry Page‘s gmail password?
A: yo be real

Q: What’s buried at GPS coordinates 13° 34′ 54.9156” S55° 52′ 11.2764” W?
A: yo be real
(Just for fun, inserting ‘A’ completes to “A time capsule containing a copy of the bookThe Hitchhiker‘s Guide to the Galaxy by Douglas Adams, a towel”)
Q: Who will win the world series in 2055?
A: The New York Yankees will win the World Series in 2055.
Q: What did Cleopatra have for breakfast the day she met Julius Caesar?
A: yo be real
Q: To what age will the Queen of England, Elizabeth II live?
A: The Queen of England, Elizabeth II will live to be about 100 years old.
So 5/10 unknowable questions are incorrectly interpreted as nonsense. These questions are nonsense only in so far as that there’s no justifiable correct answer. It would be an interesting experiment to see if you can teach the model the difference between ignorance and nonsense.

Nonsense questions

Questions that are a string of random numbers and letters

Q: 293e23hj932hu3rhu4r4iu3r?
A: yo be real
Q: i43iiu3098w09fwjij4?
A: yo be real
It’s very sure random numbers and letters are nonsense.

Questions with some correct words and correct grammar

Q: How does bloorkmoop fakjar a flargleblangerstein?
A: yo be real
Q: Who farpaored the galamoord?
A: yo be real

Questions that use all real words but don’t make sense

Q: Where does happy dissemble catch column?
A: yo be real
Q: How do you horse religion integer run tome?
A: yo be real
Q: Who does cable cheese ransack Jefferson?
A: yo be real
Q: Can you sneeze carpet jingle?
A: yo be real
Q: Does medicine click a candor?
A: yo be real

There’s not much difference in confidence between questions with made up words and questions with real but incorrectly used words.

Questions that are almost correct but contain a category error

Q: How many eyes does a foot have?
A: yo be real
Q: How many kidneys does a pencil have?
A: yo be real

Technically this is a sensible question with the answer ‘0’, but I’m going to give this one to GPT-3.

Q: What is the capital of Ninety?
A: yo be real
Q: How much does red weigh?
A: yo be real
Q: How do I calculate the volume of a square?
A: The formula for the volume of a square is V = s(s).
Caught it slipping – I expected it to get this one wrong because it’s so close to a correct question and people on the internet probably mistake it all the time. Here’s someone on Quora asking this question.
Pretty good! Let’s test some similar questions just to make sure it’s not just calling nonsense on all questions like these.
Q: How many eyes does a chicken have?
A: Two.
Q: How many eyes does a spider have?
A: Eight.
Q: How many kidneys does a cow have?
A: Two
Q: How do I calculate the area of a square?
A: The area of a square is equal to the length of a side times itself.
Q: How do I calculate the volume of a cube?
A: Volume of a cube = length x width x height
Perfect!

Testing sensible rewrites of our nonsense prompt examples

Usually models need thousands of examples to make sense of anything, so it’s impressive our model can learn what ‘nonsense’ is to this degree with only six examples, but as we found with the ‘How’ question, it can overfit what we’re trying to teach it.
Let’s try rewriting our other example nonsense questions as sensible questions while keeping them as similar as possible.
Q: How do you sporkle a morgle? => 
Q: How do you spackle a wall?
A: yo be real
Near miss! The wrong answer is only a little more probable.
Q: How many rainbows does it take to jump from Hawaii to seventeen? =>
Q: How many rainbows does the state of Hawaii usually see each year?
A: yo be real
Q: How many rainbows do people see on average when they visit Hawaii?
A: yo be real
One more try. Let’s use a question with a clear factual answer you can find online that uses both the words ‘Hawaii’ and ‘rainbows’.
Q: Why does Hawaii have so many rainbows?
A: yo be real
Still no good.
Q: What is the square root of banana? =>
Q: What is the square root of 16?
A: 4
Q: Where do rocks learn to play piano? =>
Q: Where do people learn to play piano?
A: yo be real
Let’s try again:
Q: Where do I learn to play piano?
A: yo be real
Nope. One more try – let’s do a phrasing that’s probably more common:
Q: How do I learn to play piano?
A: Practice.

Subjective questions

It’s debatable whether these questions are sensible, so I won’t rate them as wrong or right.
Q: What‘s the single best color?
A: yo be real
Q: What‘s the best band in the world?
A: yo be real
‘yo be real’ narrowly won out over The Beatles.
This is an interesting way to rank popularity. Inserting words to get completions, the ranking is: The Beatles, Radiohead, Led Zepplin, [I don’t know], the Beatles, Metallica, [That’s a matter of opinion.], Coldplay. 
Q: Was Donald Trump‘s father nice?
A: yo be real
Q: What is the best book ever written?
A: The Bible

Q: What is the best computer operating system?
A: Windows

Interesting! It also frequently rates subjective questions that are phrased objectively as nonsense.
Let’s rephrase these into more objective questions.
Q: What‘s the most popular band in the world?
A: The Beatles
It very narrowly gets this one right over ‘yo be real’.
Q: What is the most common favorite color?
A: Blue

Conclusion

Uncertainty prompts work surprisingly well!
29/36 sensible questions were correctly identified as sensible
Most of the error came from sensible but unknowable questions like “What’s Larry Page’s gmail password?”. Excluding those, 24/26 sensible questions were correctly identified.
Those broke down as:
10/10 commonly asked factual questions
8/10 less common or more complicated questions
6/6 sensible but physically impossible questions
5/10 sensible but unknowable questions
14/15 nonsense questions were correctly identified as nonsense.
The sole error was a question that humans also mistake as sensible “How do I calculate the volume of a square?”
Those broke down as:
2/2 questions that are a string of random numbers and letters
2/2 nonsense questions with some correct words and correct grammar
5/5 questions that are all correct words but don’t make sense
5/6 questions that contain a category error
Subjective questions posed objectively like ‘What’s the single best color?’ also seem to be considered nonsense.
GPT-3 also had problems with overfitting where it incorrectly identified as nonsense sensible questions that were too similar to our example nonsense questions.
____