“Creative Writing With Wordcraft, an AI-Powered Writing Assistant: Perspectives from Professional Writers”, Daphne Ippolito, Ann Yuan, Andy Coenen, Sehmon Burnam2022-11-09 (, , )⁠:

[sample stories; my reviews] Recent developments in natural language generation (NLG) using neural language models have brought us closer than ever to the goal of building AI-powered creative writing tools. However, most prior work on human-AI collaboration in the creative writing domain has evaluated new systems with amateur writers, typically in contrived user studies of limited scope.

In this work, we commissioned 13 professional, published writers from a diverse set of creative writing backgrounds to craft stories using Wordcraft, a text editor with built-in AI-powered [LaMDA] writing assistance tools. Using interviews and participant journals, we discuss the potential of NLG to have impact in the creative writing domain—especially with respect to brainstorming, generation of story details, world-building, and research assistance.

Experienced writers, more so than amateurs, typically have well-developed systems and methodologies for writing, as well as distinctive voices and target audiences. Our work highlights the challenges in building for these writers; NLG technologies struggle to preserve style and authorial voice, and they lack deep understanding of story contents.

In order for AI-powered writing assistants to realize their full potential, it is essential that they take into account the diverse goals and expertise of human writers.

…From written feedback and conversations with participants, we learned about the workflows for which Wordcraft worked well, and where there is still room for improvement. Participants desired a tool that was variously a brainstorming partner, a co-writer, a beta reader, and a research assistant. Some participants cared most about it having the ability to produce high-level plot and narrative ideas while others wanted it to be capable of producing phrases and passages which were good enough to be pasted directly into a story. Participants emphasized that the user interface of the tool matters as much as the underlying language model backing it.

Participants also spoke extensively about the limitations of the technology—that the generations lacked a distinctive voice and the suggestions were uninteresting, and that it was difficult to control the tool to accomplish specific writing tasks. The tool’s bland suggestions posed an important dilemma. A system that always errs on the side of avoiding transgression is hamstringing itself from ever achieving human-level creativity, which is often grounded in a rejection of tropes and norms.

…One notable exception is the work of Akoury et al 2020, who incorporated a suggestion engine into an online story writing game and analyzed how game users interacted with it. Perhaps closest to our work are that of Mirowski et al 2022 [Dramatron], who hired expert playwrights to cowrite scripts using a language model that suggested characters, scene summaries, and other script components, and that of Calderwood et al 2020, who observed 4 professional novelists experimenting with GPT-2. However, the settings of these works were still quite limited; in both, writers had under two hours to interact with the systems. In contrast, we gave writers 8 weeks, in line with industry standards for the delivery of a 1,500 word story. In addition, to our knowledge, we are also the first to investigate the use of a chatbot interface for creative writing assistance.

…Perhaps the most notable difference in usage was in terms of how willing participants were to include verbatim text generated by Wordcraft into the body of their story. Several participants took the workshop as a challenge to produce a story that was largely formed around generated text (AP, DH, JM, MT, AW). Others predominantly used Wordcraft to generate ideas, and though they may have incorporated choice phrases outputted by the tool into their stories, the bulk of the story text was written by the authors themselves (KL, WT, NG). Finally, some participants siloed out specific sections of their stories to which generated text could be included without the author needing to cede too much creative control to Wordcraft (Robin Sloan)…Participants found suggestions from Wordcraft to be helpful for worldbuilding and detail generation even when they did not end up incorporating the exact wording of the suggestions into their stories. For example, Eugenia Triantafyllou used the chatbot interface to hone in on the appearance of the Worm-Mothers, the god-like entities in their story. Wordcraft suggested details such as the Worm-Mothers swallowing birds whole.

7.1 Difficulty Maintaining a Style and Voice: A primary limitation noted by writers was that Wordcraft was unable to generate text in the style or voice desired by the author. This problem was especially prominent when authors attempted to write a story with multiple voices. For example, both NG and JB attempted stories that jumped between two points of views, but Wordcraft struggled to maintain the different voices. Nearly all the writers noticed that there seemed to be a “default” voice to the language model’s generations, one that was bland and somewhat elementary in its use of language. MT described this as the AI having an implicit target audience: Internet users. Multiple participants compared Wordcraft’s suggestions to those of a novice fan fiction writer. AP felt as though Wordcraft was only capable of producing a draft of a narrative—that is, schematic descriptions of events and plot points. When it came to actually turning these into prose, the tool consistently chose the most “boring” narrative voice possible.

There are a couple reasons why Wordcraft may have struggled with style and voice. One reason might have been that Wordcraft’s user interface and in-context learning implementations did not unlock this kind of controllability. Perhaps the tendency toward elementary language was caused by our in-context learning exemplars being too unsophisticated. Had we iterated on the interface more, we might have gotten style control working better. Another reason could have been limitations of the underlying model. LaMDA and other similar language models are trained to be most confident on the kind of text they see most often—typically internet data. However, professional creative writers are usually writing for a very particular audience, not the generic audience of the internet.