“CTRL: A Conditional Transformer Language Model For Controllable Generation”, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher2019-09-11 (, , , ; similar)⁠:

[cf. Dhingra et al 2021] Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text.

We release CTRL, a 1.6 billion-parameter conditional Transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence.

This provides a potential method for analyzing large amounts of data via model-based source attribution.

We have released multiple full-sized, pretrained versions of CTRL at Github.

…With 1.63 billion parameters, our Conditional Transformer Language (CTRL) model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. To preserve the generality of the language model trained in an unsupervised setting, we train CTRL on control codes derived from structure that naturally co-occurs with the raw text typically collected for training large language models. For example, large resources like Wikipedia, Project Gutenberg, and Amazon Reviews can each be assigned a domain-related control code. Smaller resources, like the content extracted from individual subreddits, often occur with both a broader domain name, reddit, as well as subdomain information, r/subdomain. In the vast majority of cases, text collected for training is associated with a URL, which often contains information pertinent to the text it represents. Humans can use these codes to trigger generation of text from different linguistic communities without having to understand how to prompt with particular linguistic patterns. Text can be generated in more predictable ways by controlling for content or changing the domain even when the initial prompt remains fixed.

3.1 Data: We train on 140 GB of text drawing from a wide variety of domains: Wikipedia (En, De, Es, Fr), Project Gutenberg, submissions from 45 subreddits, OpenWebText, a large collection of news data (Hermann et al 2015; Barrault et al 2019; Sandhaus2008; Grusky et al 2018), Amazon Reviews (McAuley et al 2015), Europarl and UN data from WMT (En-De, En-Es, En-Fr) (Barrault et al 2019), question-answer pairs (no context documents) from ELI5 (Fan et al 2019) and the MRQA shared task, which includes the Stanford Question Answering Dataset (Rajpurkar et al 2016), NewsQA (Trischler et al 2016), TriviaQA (Joshi et al 2017), SearchQA (Dunn et al 2017), HotpotQA (Yang et al 2018), and Natural Questions (Kwiatkowski et al 2019). A full account of training data and associated control codes can be found in Table 7 in the Appendix…In our version of OpenWebText, we include the URL used to download each document as the start of the input sequence. During training, CTRL learns relationships between the structure of these URLs and the text that follows. At inference, novel URLs can be used to specify a variety of features: domain, subdomain, entities, entity relations, and even dates.

4.2 Control Codes:

Table 1: Even for identical prompts (blue), control codes (red) allow for predictable variation in generation.
Table 2: With CTRL, no prompt (blue) is necessary as long as a control code (red) is provided. Control codes can be combined ('Reviews', 'Rating:', and 'VALUE') to provide finer-grained control.
Table 2: With CTRL, no prompt (blue) is necessary as long as a control code (red) is provided. Control codes can be combined (Reviews, Rating:, and VALUE) to provide finer-grained control.
Table 3: CTRL is trained with links as control codes (red). Links provide a way to specify domain, subdomain, entities, entity relations, and even date. The links in these examples do not actually link to text; users can mimic the structure of the URLs that appear during training to create novel content during generation. Note that us-president is interpreted differently by the model depending on the date used (200717ya, 2014, vs 2018). Similarly, star is interpreted differently based on the domain (cnn vs. etonline) and topic (style vs. politics) can be varied even for identical entities (george-clooney).
Table 3: CTRL is trained with links as control codes (red). Links provide a way to specify domain, subdomain, entities, entity relations, and even date. The links in these examples do not actually link to text; users can mimic the structure of the URLs that appear during training to create novel content during generation. Note that us-president is interpreted differently by the model depending on the date used (2007, 2014, vs 2018). Similarly, star is interpreted differently based on the domain (cnn vs. etonline) and topic (style vs. politics) can be varied even for identical entities (george-clooney).
Table 4: More complex templatized control codes are used for task-specific generation.
Table 5: Some codes can be mixed to generate text with novel cross-over behavior. Here, we present 2 examples. In the first example, we mix translation codes into the Diet domain. By doing so, the model continues alternatively generates English and German sentences while respecting the Diet domain and remains coherent across translations. In the second example, the Politics domain is mixed with a French prompt despite never seeing this combination in training.

7. Future Directions: