“Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets”, Melanie Walsh, Anna Preus, Maria Antoniak2024-06-27 (, , , , )⁠:

[fails to consider BPE tokenization or tuning] Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry?

We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition.

We use this task to reflect on LLMs’ current poetic capabilities [ChatGPT-3, (Chat)GPT-4, Claude-3-Sonnet, LLaMA-3-70b—unclear whether base or tuned], as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets.

Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.

Figure 4: Fixed Forms—Poetry Foundation and Academy of American Poets. These figures show LLM performance (F1 scores) on the task of detecting a poem’s form (in the same way as the human annotation/institution it was collected from) by prompt type: with only the text of the poem; only the author and title; only the first line; only the last line. Error bars indicate standard deviation across 20 bootstrapped samples of poems.

…Poetic forms based on topic prove more difficult for the models, depending on the topic (Table 5 & Table 6). Forms centered on more concrete subjects like ‘death’ (elegy) and ‘art’ (ars poetica, ekphrasis) are more often recognized, while poems about abstract ideas and styles like aubades and odes are less so.

There are fewer forms in our dataset that depend on visual features, but most models except GPT-4 and GPT-4o falter with them, namely with concrete / pattern poetry (ie. poems that rely on visual and typographical elements for their structure) and prose poetry (ie. poems that don’t have line breaks and look like prose).