[cf. Dhingraet al2021] Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text.
We release CTRL, a 1.6 billion-parameter conditional Transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence.
This provides a potential method for analyzing large amounts of data via model-based source attribution.
We have released multiple full-sized, pretrained versions of CTRL at Github.
…With 1.63 billion parameters, our Conditional Transformer Language (CTRL) model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. To preserve the generality of the language model trained in an unsupervised setting, we train CTRL on control codes derived from structure that naturally co-occurs with the raw text typically collected for training large language models. For example, large resources like Wikipedia, Project Gutenberg, and Amazon Reviews can each be assigned a domain-related control code. Smaller resources, like the content extracted from individual subreddits, often occur with both a broader domain name, reddit, as well as subdomain information, r/subdomain. In the vast majority of cases, text collected for training is associated with a URL, which often contains information pertinent to the text it represents. Humans can use these codes to trigger generation of text from different linguistic communities without having to understand how to prompt with particular linguistic patterns. Text can be generated in more predictable ways by controlling for content or changing the domain even when the initial prompt remains fixed.
…3.1 Data: We train on 140 GB of text drawing from a wide variety of domains: Wikipedia (En, De, Es, Fr), Project Gutenberg, submissions from 45 subreddits, OpenWebText, a large collection of news data (Hermannet al2015; Barraultet al2019; Sandhaus2008; Gruskyet al2018), Amazon Reviews (McAuleyet al2015), Europarl and UN data from WMT (En-De, En-Es, En-Fr) (Barraultet al2019), question-answer pairs (no context documents) from ELI5 (Fanet al2019) and the MRQA shared task, which includes the Stanford Question Answering Dataset (Rajpurkaret al2016), NewsQA (Trischleret al2016), TriviaQA (Joshiet al2017), SearchQA (Dunnet al2017), HotpotQA (Yanget al2018), and Natural Questions (Kwiatkowskiet al2019). A full account of training data and associated control codes can be found in Table 7 in the Appendix…In our version of OpenWebText, we include the URL used to download each document as the start of the input sequence. During training, CTRL learns relationships between the structure of these URLs and the text that follows. At inference, novel URLs can be used to specify a variety of features: domain, subdomain, entities, entity relations, and even dates.
…4.2 Control Codes:
Style by domain: Most control codes for our model specify the overall style of generated text by indicating a particular domain of training data.
Examples in Table 1 demonstrate that even for identical prompts, control codes allow for predictable variation in generation. The examples in Table 2 show how CTRL can generate domain-specific text without any prompt.
…Triggering specific tasks: A small number of control codes are related to specific tasks like question answering and translation.
These codes constrain the generation process the most, by triggering task-specific generation. In Table 4, we demonstrate relatively complex control codes for question answering and machine translation that act as a template mixed with a natural language prompt.
Zero-shot code-mixing: In the first example we mix a diet subreddit (/r/keto) with machine translation control codes for English and German.
In contrast to using Translation in 2, the generated text with mixed codes is coherent across multiple translated lines. This structure is an influence of Diet because it had multi-line examples in the training data, whereas the translation data consisted of shuffled single lines. In the second example we mix the politics subreddit (/r/politics) with a prompt that starts in French though no examples of this kind were found in the training data.
Table 1: Even for identical prompts (blue), control codes (red) allow for predictable variation in generation.
Table 2: With CTRL, no prompt (blue) is necessary as long as a control code (red) is provided. Control codes can be combined (Reviews, Rating:, and VALUE) to provide finer-grained control.
Table 3: CTRL is trained with links as control codes (red). Links provide a way to specify domain, subdomain, entities, entity relations, and even date. The links in these examples do not actually link to text; users can mimic the structure of the URLs that appear during training to create novel content during generation. Note that us-president is interpreted differently by the model depending on the date used (2007, 2014, vs 2018). Similarly, star is interpreted differently based on the domain (cnn vs. etonline) and topic (style vs. politics) can be varied even for identical entities (george-clooney).
Table 4: More complex templatized control codes are used for task-specific generation.
Table 5: Some codes can be mixed to generate text with novel cross-over behavior. Here, we present 2 examples. In the first example, we mix translation codes into the Diet domain. By doing so, the model continues alternatively generates English and German sentences while respecting the Diet domain and remains coherent across translations. In the second example, the Politics domain is mixed with a French prompt despite never seeing this combination in training.
…7. Future Directions:
More control codes and finer-grained control: The particular choice of control codes in this work is intended to represent a reasonably large variety in control over domain, topic, entities, entity relations, and dates.
A very flexible means of control is through the natural structure of the internet in the form of URLs. Many of the domains that were mapped in this work to a single control code (eg. Wikipedia, Project Gutenberg), could be refined to provide more fine-grained control either through further exploitation of URL structure (en.wikipedia.org, de.wikipedia.org, en.wikipedia.org/wiki/Anarchism, en.wikipedia.org/wiki/Anarchism#History) or through the manual extraction of structure already present in the data (eg. BooksAuthorTitleChapter).
Appendix A: Data sources and breakdown: Table 7: Data and control codes. Wikipedia, Books, News and multilingual have no secondary code. Reviews can be followed by Rating: and a value of {1.0, 2.0, 3.0, 4.0, 5.0}. For Links, a full or partial URL can be provided (See Table 3). For all the Reddit data, the secondary code can be Title: or Text:, which is the title and text of the article, respectively.
We hope future work explores extensions of CTRL to new domains in ways that provide further insight into controllable text generation.
[If conditioning isn’t solving your problems, you aren’t using enough of it!]