Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.
To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrowâsuch as redundancy, commonsense errors, and incoherenceâwere identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.
We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largest GPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.
Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github.
Figure 2: Average portion of tokens annotated with each span type (y-axis) across models (x-axis), with 95% confidence intervals.
Figure 3: Average portion of tokens covered by span annotations, broken down by span type. All models, including GPT-3, use the same apples-to-apples decoding hyperparameters: top-p=0.96, temperature=1, and no frequency penalty. We scale each span by its token length, normalize by generation token lengths, and remove severity-1 Grammar and Usage errors (see §C).
Figure 4: Taking the average span coverage (Figure 3) and removing reader issues (Technical Jargon and Needs Google), we plot values and 95% confidence intervals for all models, including all decoding hyperparameters we tested for GPT-3. We find a surprisingly large change in annotated errors depending on the decoding setting used.
Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).
These error categories decrease with in-domain training (GROVER) and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.
Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).
These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.
Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).
We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.
Human-authored text produces the most reader issues (Figure 2â3).
The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).
Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.
Decoding hyperparameters have a huge impact (Figure).
For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (ie. word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.
To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequency penalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent Scarecrow error spans as those authored by humans (more: §6.6).
âŚWe notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an articleâs text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3âs generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this template. For fairness, we removed these errors from GPT-3âs tally as well, though they were few enough we do not plot them separately.