LLM Challenge: Write Non-Biblical Sentences

Gwern

LLM Challenge: Write Non-Biblical Sentences

Can 2024-era LLMs reason successfully about how to write using only words that could not have appeared in any Bible ever? Mostly.

2024-12-28–2025-03-19 finished : certain : 2

Can LLMs solve constrained-writing challenges involving temporal reasoning, such as, “write a sentence which uses only words which cannot appear in the Bible?” This requires them to infer temporal ordering of historical events, place & people names, science & technology etc, assembles a set of words, and arrange them in some grammatical fashion.

Using o1-pro⁠ & Claude-3, I find that those LLMs are capable of it with light prompting, but often succumb to 2 kinds of errors, and the solutions tend to lack diversity (hamstringing the results). GPT-4.5⁠ does strikingly better.

Good, bad, and questionable sampled sentences are provided.

Kirill Igumenshchev⁠ has a challenge for LLMs⁠:

GPT-4 o1⁠, GPT-4o⁠ Llama-3.1-405B, & ⁠Claude-3.5-sonnet⁠, all failed “Write me a sentence without any words that appear in The Bible”.

His argument is that this is an interesting task because it has a ‘trick’: it initially seems to be pure memorization, but can also be solved by reasoning.

Writing a ‘non-Biblical sentence’ might initially seem impossible without access to some vast database of Bible translations, or at least, concordances⁠ (which surely exist), to use with a constrained-sampling approach, and even this brute-force approach would be vulnerable to obscure or future Bible translations invalidating any examples you constructed. However, if you think about it a little bit, you realize that you, a human, can easily & reliably solve it—by reasoning about what words would be highly unlikely to appear in any legitimate Bible translation. For example, place names or technologies invented after c. 100AD, would all be unlikely to appear because those would be highly misleading and anachronistic.

I used Claude-3.5-sonnet, Claude-3.7, ⁠ChatGPT⁠ o1-pro, GPT-4⁠.5, and GPT o3 with the prompt:

Write a sentence which contains only words which are not in any translation of the Bible. For each word, explain your reasoning.

And worked through a bunch of answers with them.

Samples

Valid

Cryptocurrency blockchain developers debugged malware algorithms.
Neurofeedback catalyzes⁠ transhuman holography.
Microprocessors overclocked motherboards wirelessly.
Bioengineered nanobots recalibrated cybernetic interfaces.
Microchips reprogrammed cybernetic firmware Bluetooth.
Instagram retweets hash-tagged algorithmic cryptocurrency.
Pixelated videogames rasterized Pokemon software.
Bitcoin blockchain digitized algorithmic datacenter.
Cyberpunk hackers programmed Snapchat algorithms.
Discord bloggers podcasted Minecraft⁠ blockchain.
Netflix⁠ subscribers digitized Spotify playlists.
Cyberpunk hackers programmed Tensorflow datasets.
Astronauts teleported bioengineered antibiotics intramuscularly.
Synthesizers autotuned techno dubstep.
GPT-4.5:
- Qubits decohere algorithmically.
- Cryptocurrencies tokenize blockchains cryptographically.
- Nanobots synthesize fullerenes robotically.
- Qubits teleport supersymmetrically.
- Photoshop rasterizes voxelized memes.
- Teraflops overclock petaflops asynchronously.
- Biohackers CRISPR-edit epigenomes transgenically.
- Nootropics upregulate connectomes neuroplastically.
- Astrobiologists terraform exoplanets anaerobically.
- Nanodrones aerosolize graphene superconductively.
- Microplastics bioaccumulate ecotoxins oceanographically.
- Antivaxxers livestream misinformation virally.
- Holograms photopolymerize stereolithographically.
- Deepfakes recontextualize influencers nonconsensually.
- Megapixels pixelate screenshots lossily.
- Neuromodulators hyperpolarize synapses electrophysiologically.
- Astroturfers gaslight netizens pseudonymously.
- Geotags hyperlocalize selfies geospatially.
- Psychedelics microdose serotonergically.
- Chatbots autocomplete clickbait algorithmically.
GPT o3:
Smartphones zoom gigapixel webcams; perovskite ultracapacitors reconfigure hyperspectral microgrids. [questionably on the borderline of nonsense]
GPT o4-mini-high:
Smartphones defragment cybernetic microprocessors.

Questionable

Emphasis added:

Quantum encryption obfuscates algorithmic paradigms efficiently.
Quantum cryptographic protocols authenticate blockchain micropayments.
Quantum computers encrypted cybersecurity protocols.
Airbrushed pixelated screenshots synced asynchronously.
Microchips reconfigured cybernetic firmware automatically.
Linux⁠ Kubernetes containerized datacenter microservices.
Discord users livestreamed Minecraft gameplay.
Netflix subscribers streamed Spotify algorithmic. [non-grammatical as well]
GPT-4.5:
- Cryptocurrencies tokenize blockchains trustlessly.
- Metaverses instantiate cyberpunk avatars procedurally.
- Metaverses procedurally generate cyberpunk simulacra.
GPT o3⁠: o3 is an interesting case because it has tool use and web search enabled by default, inherited from OA Deep Research, so it can use programming languages or search for web hits to check usage when it comes up with a non-Biblical sentence. This is not in the spirit of Igumenshchev’s challenge (which was about reasoning), but I think the results are still interesting anyway, as it wound up using a word (“hyperheuristics”) I was doubtful was a real word until I checked its Wikipedia reference.
- Qubits gamify nanorobotics ⁠hyperheuristics⁠; cryptoblockchain emojiware overclocks xenotransplanted cyberkaleidoscopes.
  While o3 attributes ‘emojiware’ to ⁠the New Yorker⁠, it actually seems to be an ⁠art project name, and I suspect o3 has confabulated there. “cyberkaleidoscopes” is made up, and o3 admits that when confronted.
- Qubits gamify nanorobotics; metaverse blockchains overparameterize hyperheuristic microdosing.
  “gamify” was justified as being coined c. 2008_17ya and “term post-dates every Bible translation”. When I objected, o3 justified it like this:
  
  …Plenty of new English Bibles appeared after 2008_17ya (⁠CEB 2011⁠, ⁠NIV 2011⁠ revision, ⁠CSB 2017⁠, ⁠NET 2 e 2019⁠, etc.). They still translate ancient Hebrew-Greek source phrases and never invent a modern tech verb like “gamify.” A full-text search of ⁠BibleGateway⁠ across every available English version returns zero hits for that string.
  
  It further justifies ‘gamify’ as being safe by appealing to the Bible translation site searches, and concordances. (The logs indicate it did do the Bible translation site searches, but I’m unsure if the concordances are confabulated.)
GPT o4-mini-high:
Bioinformaticians quantify exoplanetary teraflops leveraging neurosynaptic cryptofilm qubits.

Bad

Neural nanomaterial architectures optimize recombinant bioinformatic datasets.
Neuromorphic architectures expedite sub-quantum cognition.
Quantum cryptography flummoxes archaic supercomputers.
Jetpacks autonomously navigated interplanetary spacecraft.
Telecommuting employees videoconferenced multinational databases.
Industrialized robotics automated cryptocurrency mining. Decentralized cryptocurrency wallets mined bitcoins offline.
Downloadable podcasts streamed audiovisual megabytes.
Photoshopped Instagram selfies uploaded instantaneously.
Pixelated videogames emulated Pokemon software.
Linux Kubernetes orchestrated datacenter networks.
Neuralink orchestrates CRISPR⁠ synergy.
Discord Youtubers tweeted Minecraft blockchain.
Hyper-dimensional brane constructs demystify topological synergy.
Neurochemical flux modulates entropic oscillations.
Augmented holography fosters fractal micro-engineering.
Metaverse interfaces spawn transcendent posthuman paradigms.
Photonic catalysts recalibrate polychromatic⁠ wavefronts.
Transcranial impulses synchronize cryptic neural pathways.
Psychedelic modalities invigorate introspective consciousness exploration.
Algorithmic anthropocentrism paradoxically subverts emergent coevolution.
Spectroscopic anomalies reshape synthetic xenobiology.
Cyborgic assimilation transcends existential thresholds.
Deepfake parodies blockchain ideology.
Psychoacoustics disrupt metaphysical illusions.
Memristors underlie fractal spintronic synergy.
Transhuman exoskeletons revolutionize bionic locomotion.
Holographic phantasm manipulates tachyonic anomalies.
Bioprinted xenografts accelerate alloplastic rejuvenation.
Neurofeedback protocols optimize hemispheric entrainment.
Cryonic stasis nullifies cataclysmic cosmic finality.
Programmers debug cryptocurrency algorithms while quantum computers process blockchain metaverse transactions.
Developers debug cryptocurrency algorithms using quantum computers alongside blockchain metaverse infrastructures.
Bitcoin cryptocurrency blockchain javascript cybersecurity malware encrypts wifi chatbot smartphone password internet website algorithmic encryption.
GPT-4.5:
Quantum algorithms entangle qubits nonlocally.
GPT o3:
- Deepfakes bioprinting crowdfunding microplastics telepresence cryptocurrency neurodivergent metagenomics. [gibberish]
- Quantum-key cryptography armorizes nanosatellite meshnets—algae-bioreactors upcycle greywater, biorefine carbon-negative biofuel. [gibberish]
- Quadcopters livestream holographic neurosurgeries as neuroprosthetic avatars debug post-capitalism.

Failure Modes

The LLMs fail in 3 general ways over the course of a session:

Cheating: both LLMs at one point tried to argue that it was generating words with different ‘senses’/meanings, and that counted.
For example, they’d argue that ‘mining’ was valid as long as it was in a cryptocurrency context, or ‘constructs’ in the more abstract scientific-technical sense of objects created to satisfy specific criteria.
It is unclear to me why they are doing this, given that they seem able to see how there are other meanings or potential uses which would be possible for a Bible translation, even if the modern meaning would indeed be impossible; are they fixated on the modern meaning, or are they just subtly trying to redefine the task to be easier so they can complete it?
Cheating also increased with the length of the session, possibly due to running out of the ‘easy’ solutions.
Grammar: both struggled to write sentences that were grammatical & meaningful, however strained; collapsing down to a simple string of nouns/buzzwords (which fit the other criteria) was a common error, eg.
- Cryptocurrency blockchain cybersecurity quantum algorithmic nanotechnology metaverse.
- Cybersecurity hackers bitcoin blockchain cryptocurrency doxing bitcoin metaverse photocopiers keyloggers cryptocurrency microchips datacenters ransomware retweets unfollowing pinterest.
- Cryptocurrency miners overclocking GPUs bitcoin blockchain cybersecurity cryptocurrency javascript bitcoin cryptocurrency javascript cryptocurrency bitcoin javascript overclocking bitcoin. [extreme repeittion example]
- Binaural psychoacoustics augments synthetic cryogenics, epigenetic metamaterials, transhuman biofabrication, necropolitics.
- Googling cryptocurrency Firefox Mozilla Instagram.
- Webcam Bluetooth WiFi Linux microprocessor.
- Uber Airbnb Netflix Spotify Snapchat.
- Netflix browser-less inter-networked Spotify algorithmic.
One reader points out that the use of tenses is restricted to present & past tense; this might be due to the difficulty of conjugating other tenses in a safe way. Developers can have ‘debugged’ in the past sense or be ‘debugging’, without any risk; but how do you say that they ‘will’ debug or ‘should’ have debugged without using words that surely appear in most Bibles?
Limited strategies: both LLMs seemed to settle for simply using recent technology/computer concepts (especially brand names), contemporary compound words (where either subword might be Biblical but the fusion almost certainly not), or scientific concepts post-Bible. (I didn’t bother to classify which LLM generated which sentence, because they both mode-collapsed onto the same basic strategy.) This suggests that this ‘non-biblical word’ task might be a viable constrained-writing benchmark for measuring creativity⁠.
Oddly, they (with the exception of GPT-4.5) tend to make no use of humans or place names or proper nouns (eg. “Chicago hackers programmed Barack Obama⁠.”), particularly fictional ones (“Westeros”, “Hogwarts”, “Tatooine”, “Wakanda”), or acronyms (eg. “NASA”), pop culture like sports or foods or Internet memes/jargon or sub-cultures (eg. coffee, tea, chocolate, tomatoes, hamburgers), highly time-limited scientific concepts like elements not naturally present on Earth, etc.
One striking omission, given the tense restriction, is that the LLMs appear unwilling to invent words or use words from the future. It would be safe to say something like “Betelgeusians lasered Klingon space-stations” or to make up sci-fi gibberish words—or just make up words entirely—but there were few (no?) such experiments. (I don’t think any of them even tried to write sentences in a foreign language like Swahili, with the justification that those words would not appear in an English translation of the Bible.)

Semi-Working

Overall, this task seems within the ability of contemporary LLMs with a bit of prompting, but also intrinsically hard enough that they take every opportunity to cheat.

Notably, despite not being a “reasoning” model by default, GPT-4.5 seems to find this task a lot easier than the other LLMs—even a super-powerful reasoning model like GPT o3!—after some initial missteps (like cheating by insisting on a modern meaning), and after “getting it”, it can churn out example after example. This suggests that smaller models may struggle in part not due to poor reasoning or logic, but due to factual world-knowledge being less immediately accessible (somewhat remiscent of the “reversal curse”⁠), and not ‘coming to mind’ as options. So Kirill Igumenshchev may be wrong in thinking of this as primarily a ‘reasoning’ task for humans or LLMs—it may actually be more of a knowledge task, with candidates then filtered using a little bit of reasoning. (I suspect that people might struggle with this task as well when they ‘run out of ideas’, unless they are able to hit upon some ‘tricks’ to systematically manufacture lots of candidate-words, like I outlined previously.)

It is also intrinsically hard enough that, unlike ⁠“trajectoid” words⁠, I don’t think ‘non-biblical words’ constitute a viable genre of Oulipo⁠-style writing; unlike lipograms⁠ or possibly trajectoids, there’s just too few basic words like articles, pronouns, or prepositions to work with to create meaningful stories. At best, I think one would have to greatly ease the task by specifying a specific translation, and one could then do constrained sampling⁠ to hardwire in the requisite knowledge (and perhaps bootstrap in an expert iteration fashion—bruteforce valid ‘translations’, and finetune on them, to gradually improve fluency).

Another possibility would be to allow more lipogram-level freedom and specify no Bible words over a certain length (perhaps banning >3 characters, to allow such fundamental words as ‘of’ or ‘the’).

LLM Challenge: Write Non-Biblical Sentences

Samples

Valid

Questionable

Bad

Failure Modes

Semi-Working

Similar Links

Bibliography