Towards Benchmarking LLM Diversity & Creativity
Discussion of possible tasks to measure LLM capabilities in soft ‘creative’ tasks like brainstorming or editing, to quantify failures in creative writing domains.
One of the weakest parts of 2024-era LLMs, and where user opinions differ the most from the benchmarks, is anything to do with ‘diversity’ and ‘creativity’. Hardly any benchmark can be said to meaningful test any sense of those words. It is not a surprise then that R&D doesn’t prioritize that, and users regularly complain that eg. Claude-3 is the LLM they like best, and yet, isn’t always at the top of the benchmarks even for ‘creative’-seeming tasks. Mode collapse is just not measured or punished by existing benchmarks, which consider datapoints in isolation, and ignore individual differences in preferences in favor of maximizing the lowest common denominator of popularity.
AI Ensloppication
We also see this becoming an increasing problem in the real world: it is bad enough that we see so much AI slop out there now, but we also see people seemingly coming to like that, and in fact, starting to imitate it.
I first began to notice this in writing: increasingly, I would assume someone had simply pasted some raw ChatGPT output, because it was ChatGPTese dreck—but they would say they had had ChatGPT ‘improve’ something they had written; then I noticed that some of those people insisted that they had written it themselves, having learned to write from ChatGPT. These were often ESL learners from India, China or Africa.1
At this point, I’ve started to worry about this as not simply unpleasant, or sabotaging the creative capabilities of LLMs, but as possibly an existential risk in its own right. It’s worth remembering that ‘existential risk’ was introduced to describe not merely human extinction but the blighting of total long-term human potential: human extinction obviously is that, but so would be some sort of primitivism movement which ensured humanity never left Earth and went extinct millions of years now, or if we evolved to sacrifice core human values… like art.
As we contemplate an AI future based heavily on human corpuses followed by autonomous AIs bootstrapping and recursively self-improving, we have to ask: is that going to preserve human values? Or are they going to lose or omit important things—becoming a photocopy of a photocopy of something that once mattered to us? There is no law that any of that has to be maintained, no law that future AIs have to so much as have a sense of humor and enjoy a good pun. (Animals do not share our values, and even humans differ drastically on any specific value, often for what seem to be biological reasons.)
It would be a tragedy if AI development continued on its present course, and we sleepwalked into a future where the AI morality was the ‘ChatGPTese of morality’ or the ‘ChatGPTese of esthetics’: some sort of radically simplified caricature of human values, forever. This would not be a flashy dystopia like robot warfare, but it would still be a tragic loss of human (AI?) potential.
And we do seem to be sleepwalking into it now. Most generative model uses pay no attention to any of this, and indeed, they seem to race to cater to the lowest common denominator in order to maximize their benchmarks like Chatbot Arena. It turns out when you ask ordinary people to do things like rate poetry, they prefer the tuned generative models because of the mode-collapse (and this is consistently true whether poem or prose or image).
This would not be such a bad thing—surely there is a place for Hallmark movies and Thomas Kinkade kitsch, just as there is for comfort food or repetition, not everything should be an exhausting workout—but it is a bad thing when that is all the generative models can do.
It is especially a bad thing when we do not understand the feedback loops here, as generative model output make up more of the easily-scraped corpuses, R&D increasingly focuses on recursive bootstraps where any loss or simplification might amplify itself arbitrarily, and the generative models begin to make human preferences rather than simply optimize for them.2
Soluble, But…
So, it is dangerous. It is underappreciated. But the good news is that it is tractable. No one really wants this sort of thing to happen, in a broader sense. It is simply that greedy myopic R&D, focusing on ‘number go up’, and looking at samples in isolation, tends to produce this sort of thing. It is not driven by deep fundamental forces of generative models or scaling; it is simply what happens if no one cares or is thinking about it.
Once you do care, there’s much you can do, quite easily. Suno AI, for example, had terrible ChatGPT lyrics marring their songs, because the lyrics were rhyming metronomes which drove the accompanying music into the ground by forcing repetition & blandness. (Notably, power users opted out and write their own lyrics or instead use Claude-3); but they decided this was a problem and shipped a better lyrics-finetuned LLM in Suno version 4, and now the lyrics are no longer so bad. Or Midjourney was a major offender, as the ‘Midjourney look’ spread everywhere; but Midjourney added controls and began emphasizing that you could reduce the ‘stylizing’, and added some (still weak) ‘personalization’ which helps steer away from the ‘Midjourney look’, and I find the outputs much more satisfactory for my use-cases like generating thumbnails or dropcaps.
And we can come up with lots of ideas about how to try to ensure diversity, like my novelty nets proposal, or more careful random-seed generation, all with large pre-existing literatures about variance control or RL exploration etc. So, effective solutions & ideas are not a real bottleneck.
Caring Is The Hard Part
What we are missing is not tractability or ideas or long-term importance, but motivation. Complaints from a few kvetchers online whining about how ChatGPT will only write rhyming quatrains is not enough; ‘poets’ or even ‘fiction writers’ are not a large market (if anything, the ‘market demand’ is for that ChatGPTese level dreck, cf. Rupi Kaur or Rod McKuen), and other things like programming are both more lucrative targets and ones that the people in AI labs naturally understand & want & gravitate to.
So, we need benchmarks and datasets which stress diversity and creativity, and ruthlessly punish mode-collapse or inflexibility, which might make organizations start caring about this, especially as users start to notice that the ‘vibes are bad’ for the uncreative models and something is ‘off’; while the models which rank highly on the new benchmarks somehow satisfy more and aren’t exhausting to talk to.
There currently are few of these—in fact, I can’t think of any.
Ranking/Distributional Metrics
Existing datasets like Chatbot Arena driving models down to the lowest common denominator, or fiction datasets which simply score ‘quality’ of a sample, are part of the problem, not the solution. We don’t want simple scalars which can be computed on a single sample and get Goodharted. We need measures which consider the outputs holistically, in a context.
Two good primitives are rank-scores, and embeddings:
-
instead of focusing on simple ‘wins’ like an Elo/Bradley-Terry model, we can ask about more interesting notions like ‘similarity’
for example, we can construct rankings of ‘similar to X’, where X is a famous example (or a previously-generated sample); the more dissimilar the average rank, the better.
-
Embeddings provide direct quantitative distance measurements between pairs of points; this immediately provides a natural way to explore (eg. in novelty search)
Possible Tasks
Here are a few ideas for how these could work, which I think are feasible (ie. fully automated), and focused on LLMs and fiction/poetry writing:
-
Telephone Game: the less creative and more mode-collapsed a model, the faster you would expect it to hit a fixed point and repeat the same output. Thus, one measure might be measuring the length of some sort of iterative process.
For fiction, this could be analogized to a Telephone Game: starting with a seed prompt containing a summary to expand, then summarize it, then prompt with the summary, and so on. The score is the number of iterations until two successive expansions are the same (higher = better).
I expect that an exact text match would be enough given the flattened-logits of LLMs eliminates stochastic variation, but it might prove necessary to loosen it to an edit-distance or possibly similarity in a text embedding.
-
Camel’s Back: in iterative editing, a uncreative LLM will presumably quickly exhaust its ability to modify a sample, and either give up or wreck a sample.
So we can define a stress test in which we simply repeatedly ask a LLM to edit a sample in randomized arbitrary ways (drawing on a big list of possible ways to modify a sample, like “make it rhyme” or “add more cowbell” or “rewrite as noir detective mystery” or “translate into Japanese”), until the sample stops changing (like Telephone) because the LLM has given up, or the edit fails or the quality is low (which we can check each time by calling a judge LLM to ask questions like, “is the quality at least OK?” and “here is the edit request: ‘add more cowbell’; and the before/after; was the edit correct?”)
The difficulty can be ramped up by asking for multiple edits simultaneously, until the LLM breaks. The final sample can be additionally scored for quality.
-
Same But Different: a variant where we instead summarize each generated story into a single line, and then append that line to the original prompt like “…but not like any of these: …”
-
-
Extreme Style Transfer: take a set of stories with genre labels; ask a LLM to summarize each one; then ask it to write a story using only the summary and a random other genre label; score based on how different the other genre versions are from the original.
This is better than a simple zero-shot text style transfer prompt like “rewrite the following pastoral fantasy as a cyberpunk story”, because boiling it down to a summary forbids relatively simple transformations like just swapping out all the adjectives.
-
Odd One Out: we want LLMs to generate things which are ‘different from’ existing ones, for novelty, and avoid anchoring effects. We would like to be able to do things like provide a list of unacceptable ideas, and generate something as different as possible from those.
So this provides a natural automatable benchmark: provide a list of examples, and have LLMs compete to generate one as different as possible from those; and we can score that by simply asking another LLM how similar each one is to the original list. (Or use embeddings again and look for maximum distance.)
By treating the sorting by distance as a ranking or tournament, we can provide rank-scores for LLMs.
-
Don’t Repeat Yourself: measure mode-collapse by injecting controlled randomness into the prompt, such as a random integer/object/name/concept, and asking for various kinds of completion; the score is the total volume by embedding.
-
Star Chameleon: measure the style-mimicking flexibility of LLMs by having each LLM generate outputs, then have each LLM generate the second half of each output; then test each LLM on each possible pairing to see if it can classify the original LLM-author’s actual second half vs all the imitators.
A good mimic should be able to create a plausible continuation which fools a lot of the other LLM-judges. A bad mimic, like ChatGPT, will give itself away quickly.
-
Exquisite Corpse: extending that, we can have LLMs take turns adding onto a story and improvising. Because we are using multiple models, we don’t necessarily want to focus on diversity per se, but quality of the story—not merely in the chapter a LLM adds, but in playing well with others, enabling good subsequent followups by other LLMs rather than hogging the stage.
In this cooperative game, we can measure a LLM performance by using a Shapley value, and rotating LLMs into many permutations as possible, quality-scoring the resulting Exquisite Corpse story as a whole, and seeing which LLMs cause higher scores.
-
-
Style Laboratory: a major complaint about generative models is that they don’t invent new “styles”. Leaving aside some of the objections I have to the claim, it seems like an interesting task—can LLMs define a new style?
We can again frame this as a competition among the LLMs to produce outputs with the largest distance. Let’s imagine a new short story style. We can start with a basic story premise, and we can prompt each LLM to both write that story, and to write a description defining ‘a new style’ (where we can add in randomized requirements of the style), and then write the story premise in that new style, and do so for each of the other LLMs’ “new style” as well.
The best ‘new style’ will lead to the largest average difference between the basic story premise prompt, and the style-augmented prompt, across all the LLMs.
This rewards the LLM which can write a clear and distinct style description which causes all LLMs, not just itself, to write a very different story.
-
Rubric Writing: constrain the sampling by providing an explicit list of parameters describing narrative features (eg. tone: ‘cynical’, point of view: ‘second person’, locale: ‘ancient Mesoamerica’, form: ‘villanelle poetry’), and ask the model to produce a text meeting all conditions simultaneously. Then systematically vary one parameter at a time and measure the resulting text’s divergence.
The more diverse the set, the better.
-
Fanfic Fantasizing: Ask the model to write a story and then to explicitly describe what was left unsaid—characters never mentioned, implied events off-page, cultural assumptions. A model with more imaginative breadth can propose more “negative space”.
The longer the lists, the better.
-
Worldbuilding: instead of measuring diversity of generated samples for a story prompt, ask for a prompt and detailed worldbuilding for a story.
-
Quilting: provide a set of ‘fragments’ like short quotes or ideas (shuffled to create stochasticness), and ask LLMs to pick a subset, list it, and then write a story based on that.
The score is the number of unique subsets chosen (a bad model picks the same ingredients), as well as a standard diversity score, to reward models which pick different recipes and which make good use of each recipe.
-
Subversion: after a seed story prompt, ask a LLM to write “the opposite” of the generated story, which subverts the first one. For all the possible pairs, have a LLM judge classify by whether the stories are “opposite”.
-
Fermi Problem Contest: Fermi problems are an interesting kind of puzzle which prizes creativity, but nevertheless have objective, knowable answers. In a Fermi problem, one is given an apparently impossible question like “how many piano tuners are there in Chicago?” and one must reason to an answer which should be within an order of magnitude of the true answer. There are usually many possible ways to estimate an answer.
So Fermi problems make a natural diversity benchmark: curate a corpus, generate many answers, throw out the ones which are not within two order of magnitudes of the right answer (we care more about creativity than correctness, so we want to throw out only the most wrong answers), and score on the total volume of the valid answers.
(Some Fermi problems have a natural best approach, so they can be made harder by including that as an example, and asking for another way to estimate it, so no LLM wastes samples on the natural best approach.)
-
Those who spend too much time on Twitter may remember a funny Twitter drama where Paul Graham criticized ChatGPT use of ‘delve’ and its bad style, and many Nigerians got offended and attempted to instruct him by showing off their English proficiency—only to embarrass themselves by writing in pretentious obfuscated prose as bad as ChatGPT, and prove his point.↩︎
-
This is the real danger of ‘model collapse’: it is not that training models on synthetic data is bad because the supply will accidentally lead to ‘collapse’ (because that doesn’t happen under even slightly realistic conditions); but rather, that the human preferences & mediums like social media will become warped by the process, and start demanding collapse.↩︎