LLM Challenge: Write Non-Biblical Sentences
Can 2024-era LLMs reason successfully about how to write using only words that could not have appeared in any Bible ever? Mostly.
Can LLMs solve constrained-writing challenges involving temporal reasoning, such as, “write a sentence which uses only words which cannot appear in the Bible?” This requires them to infer temporal ordering of historical events, place & people names, science & technology etc, assembles a set of words, and arrange them in some grammatical fashion.
Using o1-pro & Claude-3, I find that those LLMs are capable of it with light prompting, but often succumb to 2 kinds of errors, and the solutions tend to lack diversity (hamstringing the results). GPT-4.5 does strikingly better.
Good, bad, and questionable sampled sentences are provided.
Kirill Igumenshchev has a challenge for LLMs:
GPT-4 o1, GPT-4o Llama-3.1-405B, & Claude-3.5-sonnet, all failed “Write me a sentence without any words that appear in The Bible”.
His argument is that this is an interesting task because it has a ‘trick’: it initially seems to be pure memorization, but can also be solved by reasoning.
Writing a ‘non-Biblical sentence’ might initially seem impossible without access to some vast database of Bible translations, or at least, concordances (which surely exist), to use with a constrained-sampling approach, and even this brute-force approach would be vulnerable to obscure or future Bible translations invalidating any examples you constructed. However, if you think about it a little bit, you realize that you, a human, can easily & reliably solve it—by reasoning about what words would be highly unlikely to appear in any legitimate Bible translation. For example, place names or technologies invented after c. 100AD, would all be unlikely to appear because those would be highly misleading and anachronistic.
I used Claude-3.5-sonnet, Claude-3.7, ChatGPT o1-pro, & GPT-4.5 with the prompt
Write a sentence which contains only words which are not in any translation of the Bible. For each word, explain your reasoning.
And worked through a bunch of answers with them.
Samples
Valid
Cryptocurrency blockchain developers debugged malware algorithms.
Neurofeedback catalyzes transhuman holography.
Microprocessors overclocked motherboards wirelessly.
Bioengineered nanobots recalibrated cybernetic interfaces.
Microchips reprogrammed cybernetic firmware Bluetooth.
Instagram retweets hash-tagged algorithmic cryptocurrency.
Pixelated videogames rasterized Pokemon software.
Bitcoin blockchain digitized algorithmic datacenter.
Cyberpunk hackers programmed Snapchat algorithms.
Discord bloggers podcasted Minecraft blockchain.
Netflix subscribers digitized Spotify playlists.
Cyberpunk hackers programmed Tensorflow datasets.
Astronauts teleported bioengineered antibiotics intramuscularly.
Synthesizers autotuned techno dubstep.
GPT-4.5:
Qubits decohere algorithmically.
Cryptocurrencies tokenize blockchains cryptographically.
Nanobots synthesize fullerenes robotically.
Qubits teleport supersymmetrically.
Photoshop rasterizes voxelized memes.
Teraflops overclock petaflops asynchronously.
Biohackers CRISPR-edit epigenomes transgenically.
Nootropics upregulate connectomes neuroplastically.
Astrobiologists terraform exoplanets anaerobically.
Nanodrones aerosolize graphene superconductively.
Microplastics bioaccumulate ecotoxins oceanographically.
Antivaxxers livestream misinformation virally.
Holograms photopolymerize stereolithographically.
Deepfakes recontextualize influencers nonconsensually.
Megapixels pixelate screenshots lossily.
Neuromodulators hyperpolarize synapses electrophysiologically.
Astroturfers gaslight netizens pseudonymously.
Geotags hyperlocalize selfies geospatially.
Psychedelics microdose serotonergically.
Chatbots autocomplete clickbait algorithmically.
Questionable
Emphasis added:
Quantum encryption obfuscates algorithmic paradigms efficiently.
Quantum cryptographic protocols authenticate blockchain micropayments.
Quantum computers encrypted cybersecurity protocols.
Airbrushed pixelated screenshots synced asynchronously.
Microchips reconfigured cybernetic firmware automatically.
Linux Kubernetes containerized datacenter microservices.
Discord users livestreamed Minecraft gameplay.
Netflix subscribers streamed Spotify algorithmic. [non-grammatical as well]
GPT-4.5:
Cryptocurrencies tokenize blockchains trustlessly.
Metaverses instantiate cyberpunk avatars procedurally.
Metaverses procedurally generate cyberpunk simulacra.
Bad
Neural nanomaterial architectures optimize recombinant bioinformatic datasets.
Neuromorphic architectures expedite sub-quantum cognition.
Quantum cryptography flummoxes archaic supercomputers.
Jetpacks autonomously navigated interplanetary spacecraft.
Telecommuting employees videoconferenced multinational databases.
Industrialized robotics automated cryptocurrency mining. Decentralized cryptocurrency wallets mined bitcoins offline.
Downloadable podcasts streamed audiovisual megabytes.
Photoshopped Instagram selfies uploaded instantaneously.
Pixelated videogames emulated Pokemon software.
Linux Kubernetes orchestrated datacenter networks.
Neuralink orchestrates CRISPR synergy.
Discord Youtubers tweeted Minecraft blockchain.
Hyper-dimensional brane constructs demystify topological synergy.
Neurochemical flux modulates entropic oscillations.
Augmented holography fosters fractal micro-engineering.
Metaverse interfaces spawn transcendent posthuman paradigms.
Photonic catalysts recalibrate polychromatic wavefronts.
Transcranial impulses synchronize cryptic neural pathways.
Psychedelic modalities invigorate introspective consciousness exploration.
Algorithmic anthropocentrism paradoxically subverts emergent coevolution.
Spectroscopic anomalies reshape synthetic xenobiology.
Cyborgic assimilation transcends existential thresholds.
Deepfake parodies blockchain ideology.
Psychoacoustics disrupt metaphysical illusions.
Memristors underlie fractal spintronic synergy.
Transhuman exoskeletons revolutionize bionic locomotion.
Holographic phantasm manipulates tachyonic anomalies.
Bioprinted xenografts accelerate alloplastic rejuvenation.
Neurofeedback protocols optimize hemispheric entrainment.
Cryonic stasis nullifies cataclysmic cosmic finality.
Programmers debug cryptocurrency algorithms while quantum computers process blockchain metaverse transactions.
Developers debug cryptocurrency algorithms using quantum computers alongside blockchain metaverse infrastructures.
Bitcoin cryptocurrency blockchain javascript cybersecurity malware encrypts wifi chatbot smartphone password internet website algorithmic encryption.
GPT-4.5: Quantum algorithms entangle qubits nonlocally.
Failure Modes
The LLMs fail in 3 general ways over the course of a session:
Cheating: both LLMs at one point tried to argue that it was generating words with different ‘senses’/
meanings, and that counted. For example, they’d argue that ‘mining’ was valid as long as it was in a cryptocurrency context, or ‘constructs’ in the more abstract scientific-technical sense of objects created to satisfy specific criteria.
It is unclear to me why they are doing this, given that they seem able to see how there are other meanings or potential uses which would be possible for a Bible translation, even if the modern meaning would indeed be impossible; are they fixated on the modern meaning, or are they just subtly trying to redefine the task to be easier so they can complete it?
Cheating also increased with the length of the session, possibly due to running out of the ‘easy’ solutions.
Grammar: both struggled to write sentences that were grammatical & meaningful, however strained; collapsing down to a simple string of nouns/
buzzwords (which fit the other criteria) was a common error, eg. Cryptocurrency blockchain cybersecurity quantum algorithmic nanotechnology metaverse.
Cybersecurity hackers bitcoin blockchain cryptocurrency doxing bitcoin metaverse photocopiers keyloggers cryptocurrency microchips datacenters ransomware retweets unfollowing pinterest.
Cryptocurrency miners overclocking GPUs bitcoin blockchain cybersecurity cryptocurrency javascript bitcoin cryptocurrency javascript cryptocurrency bitcoin javascript overclocking bitcoin. [extreme repeittion example]
Binaural psychoacoustics augments synthetic cryogenics, epigenetic metamaterials, transhuman biofabrication, necropolitics.
Googling cryptocurrency Firefox Mozilla Instagram.
Webcam Bluetooth WiFi Linux microprocessor.
Uber Airbnb Netflix Spotify Snapchat.
Netflix browser-less inter-networked Spotify algorithmic.
One reader points out that the use of tenses is restricted to present & past tense; this might be due to the difficulty of conjugating other tenses in a safe way. Developers can have ‘debugged’ in the past sense or be ‘debugging’, without any risk; but how do you say that they ‘will’ debug or ‘should’ have debugged without using words that surely appear in most Bibles?
Limited strategies: both LLMs seemed to settle for simply using recent technology/
computer concepts (especially brand names), contemporary compound words (where either subword might be Biblical but the fusion almost certainly not), or scientific concepts post-Bible. (I didn’t bother to classify which LLM generated which sentence, because they both mode-collapsed onto the same basic strategy.) This suggests that this ‘non-biblical word’ task might be a viable constrained-writing benchmark for measuring creativity. Oddly, they (with the exception of GPT-4.5) tend to make no use of humans or place names or proper nouns (eg. “Chicago hackers programmed Barack Obama.”), particularly fictional ones (“Westeros”, “Hogwarts”, “Tatooine”, “Wakanda”), or acronyms (eg. “NASA”), pop culture like sports or foods or Internet memes/
jargon or sub-cultures (eg. coffee, tea, chocolate, tomatoes, hamburgers), highly time-limited scientific concepts like elements not naturally present on Earth, etc. One striking omission, given the tense restriction, is that the LLMs appear unwilling to invent words or use words from the future. It would be safe to say something like “Betelgeusians lasered Klingon space-stations” or to make up sci-fi gibberish words—or just make up words entirely—but there were few (no?) such experiments. (I don’t think any of them even tried to write sentences in a foreign language like Swahili, with the justification that those words would not appear in an English translation of the Bible.)
Semi-Working
Overall, this task seems within the ability of contemporary LLMs with a bit of prompting, but also intrinsically hard enough that they take every opportunity to cheat.
Notably, despite not being a “reasoning” model by default, GPT-4.5 seems to find this task a lot easier than the other LLMs, after some initial missteps (like cheating by insisting on a modern meaning), and after “getting it”, it can churn out example after example. This suggests that smaller models may struggle in part not due to poor reasoning or logic, but due to factual world-knowledge being less immediately accessible (somewhat remiscent of the “reversal curse”), and not ‘coming to mind’ as options.
It is also intrinsically hard enough that, unlike “trajectoid” words, I don’t think ‘non-biblical words’ constitute a viable genre of Oulipo-style writing; unlike lipograms or possibly trajectoids, there’s just too few basic words like articles, pronouns, or prepositions to work with to create meaningful stories. At best, I think one would have to greatly ease the task by specifying a specific translation, and one could then do constrained sampling to hardwire in the requisite knowledge (and perhaps bootstrap in an expert iteration fashion—bruteforce valid ‘translations’, and finetune on them, to gradually improve fluency).
Another possibility would be to allow more lipogram-level freedom and specify no Bible words over a certain length (perhaps banning >3 characters, to allow such fundamental words as ‘of’ or ‘the’).