âHelp Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)â, 2022-10-25 (; backlinks)â :
Recent work in training large language models (LLMs) to follow natural language instructions has opened up exciting opportunities for natural language interface design. Building on the prior success of LLMs in the realm of computer-assisted creativity, we aim to study if LLMs can improve the quality of user-generated content through collaboration.
We present CoPoet, a collaborative poetry writing system.
In contrast to auto-completing a userâs text, CoPoet is controlled by user instructions that specify the attributes of the desired text, such as âWrite a sentence about âloveââ or âWrite a sentence ending in âflyââ. The core component of our system is a language model [T0, T5-11b] fine-tuned on a diverse collection of instructions for poetry writing.
Our model is not only competitive with publicly available LLMs trained on instructions (InstructGPT), but is also capable of satisfying unseen compositional instructions.
A study with 15 qualified crowdworkers shows that users successfully write poems with CoPoet on diverse topics ranging from âMonarchyâ to âClimate changeâ. Further, the collaboratively written poems are preferred by third-party evaluators over those written without the system.
âŚLarger Models Compose Instructions Better: On compositional instructions, we find that T5-11B-poem has the best average performance. In addition, there is a clear performance gap between the 11B and 3B models, showing the importance of model scale for composition, similar to recent observations of emergent abilities in LLMs ( et al 2022). We also find that few-shot InstructGPT outperforms T5-3B-poem and T0-3B-poem despite having no compositional instructions in the prompt. This indicates that smaller models, when finetuned on instructions, tend to overfit to templates seen during training, which hurts their generalization capability, as also reported in et al 2021.
âŚT5-11B-poem accurately answers 77.6% of compositional instructions while InstructGPT only manages 55.2%. Annotators also reported that verses from T5-11B-poem were marginally more creative/interesting than InstructGPT on KIKA and KIUA test sets and less so on the Compositional test set, indicating that the two models may have little difference in creativity.9
We observe that InstructGPT is a strong baseline, outperforming T0pp by a large margin on automatic metrics, and satisfying nearly 80% of the instructions in the KIKA and KIUA test sets according to human evaluation.
However, a common error case on compositional instructions is that while the model generations almost always contain the arguments mentioned in the instruction, they do not always satisfy the constraints correctlyâwhen asked for a verse that contains the word âsoulâ and ends with âyellowâ, InstructGPT generated the line âMy soul is as yellow as the sun on a summer dayâ that contains those arguments but not at the specified positions.