Skip to main content

Utext: Rich Unicode Documents

An esoteric document proposal: abuse Unicode to create the fanciest possible ‘plain text’ documents.

Utext is a proposed esoteric-document format for typographically-rich documents (‘utexts’) under the constraint that they are pure UTF-8 text files. Utext is a Unicode answer to the typography maximalist question: “what is the most advanced (or at least, interesting) document that can be generated by (ab)using the full range of obscure capabilities provided by contemporary UTF-8? What is ‘plain text’ that is not so plain?”

I outline the inline & block formatting features that Unicode enables (comparable to popular formats like Markdown → HTML), and more advanced features that Utext could target: for better layout and saving text-artist labor, Utext could exploit text modification using large language models (LLMs) and ASCII image generation with neural nets. LLMs could rewrite text to replace words with synonyms or tweak punctuation for better line-justification. ASCII images could be generated from arbitrary image inputs or text prompts.

Finally, I note one should store together both Utext ‘source’ & ‘compiled’ text, which would greatly enhance upgradeability, accessibility, and community-building, by letting readers see & re-compile the source in addition to the final ‘compiled’ version.

This further allows for interesting line-oriented text formats, which allow live WYSIWG editing, in-place version-control, or can stream over the network (opening up applications like simple chat rooms).

Utext is a hacker’s answer to the complexity of document formats like PDF or HTML/JS. It’s a playground for typography enthusiasts, a challenge to push the boundaries of what we can do with pure UTF-8 text. With the full range of Unicode’s capabilities, Utext aims to create the most advanced ‘plain text’ possible documents. It’s about making the most of what we have, not about adding more—an exploration of minimalism & the power of constraint. So, let’s get started with why I was thinking about this at all—don’t we have plenty of document formats already (particularly ‘minimalist’ ones)?

Background

Document formats like PDF or HTML/JS can do just about anything if one puts enough work into it; but that makes them the Turing tarpits of document formats, and they are a morass of complexity, historical baggage, and questionable-but-highly-opinionated defaults. One often feels they are just too much.

Many document formats can be compiled lossily to plain text, like HTML or Pandoc Markdown. Sometimes a hacker will deliberately write in old-fashioned 80-column ASCII text man page/‘textfiles’ style for the retro effect; ‘minimalist’ projects like Gemini go about halfway between ASCII text and fullblown HTML/CSS, or they will rely on command-line terminal extensions like ANSI art or ‘sixels’, which turn a terminal into a veritable GUI capable of generating bitmaps at high resolution.^[As well as generating bugs & security vulnerabilities at high speed…] I find these attempts to not be satisfactory either pragmatically or esthetically: the plain text conversions are afterthoughts, and it shows; Gemini winds up doing too much to be simple, while still doing too little to be a feasible simplification of HTML or much of an improvement on Gopher1, and TUI projects (eg. ncurses, the recent wave of TUIs like Textualize) are scarcely distinguishable from GUIs as far as documents are concerned, doing far too much.

Unicode

There is one point on the ASCII ↔︎ JS spectrum that I haven’t seen, and it’s one that, as I use Unicode in more complex ways on Gwern.net and have learned how many obscure features or characters Unicode has, I increasingly think has been neglected: only UTF-8 text rendered by a monospace font. Not ASCII, not a weird subset of SGML, not troff, not raw terminal codes, not bitmaps encoded in ASCII—just UTF-8. This document format does only what pure Unicode text can do—but does everything that pure Unicode can do, which turns out to be a lot. And formats like Gopher or Gemini usually already support or require UTF-8 text encoding2; so, if UTF-8 can implicitly do most of what formats like Gemini do explicitly, doesn’t that make them not so minimalist after all?3 And if they are not minimalist, why do we want them?

What if we take Unicode literally, but not seriously? Your typical plain text output strips all formatting. At the most ambitious, it might have a Unicode superscript or fraction. But we can do so much more! If we put together all the tricks in one place, and use all the Unicode text formatting features, it might be useful to codify them into a specific approach: a Unicode text document, or Utext. (Users of Utext would then be, of course, Utexans.)

Rich Unicode

There are many well-known tricks for fancy formatting using Unicode. Below is an incomplete tour of how a powerful DSL or compiler could implement various ‘fancy’ formatting in Unicode, starting with inline elements:

  • Italics: Mathematical Symbols Range’s sans italics

  • Bold: Mathematical Symbols Range’s bolds

  • Italic Bold: ditto (as well as ‘script’ variants)

  • Superscripts/subscripts: Unicode superscripts/subscripts

    • Fractions

    • Footnotes

    • Hyperlink numbering: while ‘hyperlinks’, in the sense of interactive elements, could not be supported, they must still be supported—people are not about to stop writing URLs!—and can be usefully listed at the end of a utext.

      Note that several alternative sets of superscript numbers are provided in the Dingbats block, so there need not be any visual confusion between various uses of superscripts. URLs could get the ‘normal’ superscripts, while footnotes (being much less used) get weirder ones like ‘➀’/‘❶’/‘➊’.

  • Underlining: s͟i͟n͟g͟l͟e (COMBINING DOUBLE MACRON BELOW or COMBINING LOW LINE), d̳o̳u̳b̳l̳e̳ (COMBINING DOUBLE LOW LINE)

  • Strikethrough: s̵t̵r̵i̵k̵e̵t̵h̵r̵o̵u̵g̵h̵ (COMBINING SHORT STROKE OVERLAY)

  • Unordered list: Unicode supports countless icons which could be used for unordered lists of almost arbitrary depth

  • Ordered list: Standard approaches like 1. are fine, but Unicode also supports much weirder numbers—in addition to the ‘circle’ ones shown above, there are also PARENTHESIZED DIGIT n points like PARENTHESIZED DIGIT ONE: ‘⑴’, ‘⑵’, ‘⑶’, ‘⑷’ etc.

  • Math: Unicode can encode much simple math as-is, as even unusual fonts like ‘double-struck’ letters are in Unicode. For example, the quadratic equation in Unicode:

                              −𝘣 ± √̅(𝘣² − 4𝘢𝘤)
                          𝘹 = ————————————————                            (1)
                                    2𝘢

    For more advanced math, the user can write in the familiar LaTeX notation and it can be compiled to UnicodeMath (demo), as is already supported by some tools like Pandoc.4

  • Code syntax highlighting: while color is unavailable, italics/underline/bold are already adequate for syntax highlighting, as demonstrated by monochrome syntax highlighting themes like used in the influential ALGOL 60 report (eg. Pygments’s algol_nu—which inspired the Gwern.net syntax-highlighting theme)

  • Automatic rewrites with fancier versions: why display a boring (1) to the reader when you could replace it with something cool like ‘➀’? Or add ligatures—yes, it was a bad idea for Unicode to include them, but if they’re there, might as well use them and replace any ‘ff’ with ‘ff’ to look cooler.

  • Stylistic variations: there are many Unicode alphabet variations, like fraktur (or letters in white circles & black circles or black squares, or even upside-down!), which do not map onto standard markup because they would usually just be a different font choice; a Utext could support these transformations as attributes set on a specified span of text.

    Attributes could be more complex, and allow defining additions like adding diacritics to each character, avoiding typing. In addition to the weird stylistic variations, some useful shortcuts would be: all-uppercase, all-lowercase, alternating-case each letter, adding n spaces between each letter… (Even more transformations, on a more semantic level, are possible using AI.) Transformations could be chained: [She said what!​?]{transform="uppercase,bold,italics"}𝙎𝙃𝙀 𝙎𝘼𝙄𝘿 𝙒𝙃𝘼𝙏‽.

    Utext is not quite homoiconic, but one can transform text in-place (eg. with a keybinding in one’s IDE to ‘execute’ a selected range of text): a snippet like [I am talking calmly!]{transform="uppercase"} can be rewritten in-place to I AM TALKING CALMLY!; this is potentially lossy, so it would be convenient to keep the original around as a backup, and one could then swap back and forth as necessary ([I AM TALKING CALMLY!]{original="[I am talking calmly!]{transform=\"uppercase\"}"}).

Block elements:

  • Lists: handled by indentation + rarer characters for list marker visual flair (even just for stars, there are countless Unicode characters to choose from)

  • Sidenotes/margin notes: handled by the text layout reflowing the body text around them, as seen in many books—tedious to do by hand, but easy for a text layout algorithm.

  • Semigraphics, for drawing images/art:

  • Header outlines, tables5, diagrams, rectangles/squares/wrappers of any kind: box-drawing characters

             ┏━━━┳━━━┓
             ┃ | ┃ – ┃
             ┣━━━╋━━━┫
             ┃ H ┃ ‡ ┃
             ┗━━━┻━━━┛
    
      “Is this loss”—in Unicode?
      Depends how generous you feel.
      (And what fonts you installed.)
  • Headers: header level priority can be expressed nicely with the foregoing.

    Working our way up from the smallest headers to the largest, a header hierarchy might go: eg. italics, underlined, bold, bold underlined, underscored (add a second line like ======), capitalized underscored, outlined header…

    At the higher levels, we can draw increasingly large squares or rectangles; the text cannot be upscaled, of course, but the header text can be redrawn as ASCII art (eg. 1, 2, 2, 3)—they are fonts-within-fonts, one might say, and providing these sorts of ‘fonts’ was a common feature of ANSI/ASCII art text editors like TheDraw. The transition point would be governed by whatever looks good. (Maybe a 4×4 square is too small, and the lower limit is 5×5?)

    • Dropcaps: if we have ‘fonts’, we can also have dropcaps to decorate our documents. They are simply a box with an upscaled letter and perhaps ‘ornamentation’ inside it around the upscaled letter. These can be packaged, or generated on the fly procedurally.

    • Progress bars: one nice feature of more complex document readers is providing a progress bar. We can’t provide a GUI/TUI which computes this, of course, but we can encode progress bars into headers if we like.

      For example, in any underscored header (which is most of the large headers), we could transition characters, corresponding to document position. At the beginning of the document, the underscore line might be ---------- (0%), and at the halfway point, the next header is =====----- (50%), ========-- (75%), and finally, ========== (100%), providing a visual progress bar for the reader. (This sort of ‘scroll spy’ approach is increasingly common on web pages & mobile apps.)

Advanced Utext

These are the obvious ones. But we can be more ambitious: the point of having a special compiler is not to simply save some time & effort on things we could easily have done ourselves. What can we do that requires special support?

In 2023, an obvious thing to do is leverage the power of neural nets.6 Large language models understand English text at a high semantic level, enabling arbitrary rewrites of text to change style or wording while preserving the meaning ; meanwhile, image models are able to synthesize images which match a text description (or are optimized to satisfy multiple properties simultaneously).

  • Text Modification: one of the major benefits of writing in Utext is that one doesn’t have to decide where to break lines. Hard-wrapping lines makes editing much harder, and breaks things like search or search-and-replace—what if a phrase is hard-wrapped? A LLM can improve over a mechanical typesetting algorithm of simply scanning n characters in and then line-breaking.

    • Text style transfer a LLM can be prompted to zero-shot text style transfer in many ways: it can rewrite text to sound like a pirate, or into emoji. So arbitrary prompts or a library of prompts could be provided to let the author write & translate text: [I bought a computer from my local store.]{translate="pirate"}Me procured a contraption of reckonin' from me nearby tradin' post. or [...]{translate="emoji"}👤💳💻🏪.

    • Hyphen-free justification: LLMs can go even further in improving line-breaking: LLM can rewrite a piece of text to replace words by synonyms or tweak punctuation in order to fully-justify lines. These rewrites would be highly unsafe if done by a mere dictionary approach, but a LLM can score rewrites by whether they are semantically dangerous—does an edit modify a quoted proper name? Or have distinctly different connotations?

  • ASCII Image Generation: images generated by neural nets do not need to be high resolution or photographic—in fact, pixel art was one of the early big successes of CLIP-based image generation (having defied many earlier attempts using GANs, which were simply not knowledgeable enough about the world to generate convincingly abstract pixel art). And ASCII art is a lot like pixel art.

    So whereas past attempts to automatically convert various images to ASCII art representations, like aalib, have been unconvincing and required extremely large text images to reduce the quantization error down to recognizable levels, and failing at the core task of simplifying & abstracting (eg. Chung & Kown2022), it may be possible now to generate ASCII art of arbitrary image inputs (style transfer) & even text prompts (text2image).

    This can be done with the old CLIP tools, or newer ones based on Stable Diffusion (eg. “Ascii Art”, All-In-One-Pixel-Model, Pixel Art XL) etc.

    This approach would also allow generation of visual chrome, like drawing special effects around a given piece of text. At each optimization step, one would convert the current text into a bitmap, hold fixed the pixels corresponding to the user-written text, and then optimize the pixels in just the desired regions to match a prompt, and convert back to text7; so a user could hypothetically write something like [Visit My Awesome Homepage!]{container-type="box" size="11" decoration-prompt="cool roaring flames whooshing off to the right" characters="all"} to create a large banner which has new text-flame borders each time.

    Done right, this would let one freely build shadows and depth with layered outlines, underlines, and box drawings, or use it to create 3D effects on letters or shapes.

    • ASCII image testing: NN image tools like CLIP help solve the practical issue of how to benchmark or test the appearance of Utext documents.

      It is straightforward to test that Utext text compiles to the intended text files, simply by unit tests of source+compiled pairs, but that is font-independent, and in practice, the major concerns with Utext correctness will be whether a document looks right across major monospace fonts—it would be bad to ship features that look good in the developer’s personal choice of font, but then break horribly on some Mac font. (Most concerning is that box-drawing characters apparently are not guaranteed to be aligned even in monospace fonts—completely defeating the point of having them!)

      But an appropriate NN could help here. A CLIP finetuned on an ASCII art dataset, for example, could be used to benchmark & test the esthetics of Utext documents by generating versions in a variety of fonts in several VMs’ terminals, screenshotting the resulting rendered text, and then judging the similarity to each other & to a description. Errors in rendering (such as boxes or tables misaligned) should result in clusters of good instances and then broken outliers; a feature or a change which results in unusually large net distance flags dangerous behavior.

Open questions:

  • Microtypography: Most forms of microtypography, like inserting/subtracting spaces to improve justification, would seem to be unavailable because we need a monospace font for any vertical alignment. So Unicode space characters like HAIR SPACE or THIN SPACE would presumably be rendered either as a normal-sized space or not at all.

    One could try to subtly add double-spaces to improve layout, which may look natural if done well. More advanced approaches might be possible: for example, one could use a hyphenation dictionary to see where one could insert hyphens into text to make it justify better while still being valid English; a radical approach would be using LLMs to rewrite text to make it justify perfectly in monospace.

    Regardless, given the complexity of Unicode, there may still be a way to do microtypography by adding fractional spaces or finding some other way to render invisible half/quarter-width-characters. (One of the Unicode control characters?) Perhaps one could stick solely to 1:1 space substitutions, so that if they fall back to a regular space, the line in question remains backwards compatible, reverting to the original non-microtypographic appearance?

  • Animation: is there any clever way to do ‘animation’ as a static text file? Does this require knowing the vertical height of a window before one could do animation by paging down screens?

    If it can be done, then ‘video’ is entirely possible by using incremental generation to stream results, and video generation NNs become relevant.

  • Bidirectional text: what can be done exploiting bidirectional text rendering or vertical text? (cf. Trojan Source)

Utext Markup

For ease of user adoption, the markup should probably be based on Markdown. Markdown already has constructs corresponding to most of the above features, and the compiler can take care of things like hyperlink numbering & appending.

A Utext user probably needs more fine-grained control over 2D layout than a Markdown user usually needs (or wants). This could be encoded as tables, like Pandoc’s pipe tables. Tables can be given arbitrary attributes to control use of advanced layout features, like auto-generating text borders through a neural net.

Text + Source Storage

Because the output of Utext is the same as the input as far as everything else is concerned (ie. UTF-8 text), one can provide both the ‘source code’ and ‘compiled document’ of a Utext in a single file. While (less than) doubling the space consumption, this would have a few benefits:

  • Distributing source by default helps with community-building & licensing (similar to how many people learned HTML by reading the HTML of cool web pages they visited, and built on copies)

  • Utexts can be re-compiled:

    • Greater quality: particularly with the more advanced features I speculate about above, Utext compilers may improve over time. If Utexts didn’t come with source, improved renderings would be difficult as now the Utext has to be decompiled/reverse-engineered, and it may be hard to know what the original prompts or styles were.

    • Responsive design: any default size width will ill-suit someone; perhaps they prefer to provide the compiler a much wider canvas to paint on, or alternately, want the best possible rendition for their particularly narrow smartphone screen-width.

    • Personal preference: providing source lets readers customizes the compilation. They can disable or upgrade the advanced features, or change the esthetics: perhaps they don’t like the ‘default retro’ appearance, and will instead target Commodore fonts with their characteristic slanting ‘italic’ style

  • Accessibility: source-availability means that a screenreader can skip to the source, or the user could program their browser to skip to the source, or the Utext could simply be provided by a third party in a HTML wrapper which marks up the source as what the screenreader should

Utext Format

The primary purpose of Utext is to generate the coolest output, so the main challenge is just achieving any of the suggestions above. But as an optional feature, mostly independent of the foregoing, I’ll discuss a special-purpose Utext format.

Fancy Unicode tricks do not, in principle, require a new document format beyond the classic .txt file. One can simply store the source for a Utext in an ordinary Markdown text file with a .md extension, and ‘compile’ to a .txt file.

But it is convenient to define a new format (especially if Utext winds up adding extensions or syntax which make no sense to a Markdown compiler), and since we are using strictly UTF-8, we can do something interesting and combine a fancy Unicode text file with a text file into a single file format.8

The Utext format (.utxt / text/plain-utext;charset=UTF-8) is a UTF-8 document which has two kinds of lines: compiled and source. A utext could contain a single document, or concatenate many documents. There is no explicit document-level division or header, and they are distinguished on a line-by-line basis.

Each line of a compiled utext is prefixed with an invisible special character (eg. ASCII’s Start of Text control character) which indicates it is a compiled line.9 A compiled line is human-readable UTF-8 text line. (120 characters would be a good maximum width for all devices; but a Utext could be rendered at multiple widths so as to support other devices, like smartphones which usually a comfortable width of ~40 characters.) But a compiled line is not meant to be human-editable, because it uses all available Unicode tricks for rich formatting, like ASCII art. For example, a source line may use italics or bolds, which normally cannot be displayed in pure UTF-8 text without using terminal control codes or relying on markup conventions like HTML/Markdown, but the compiled version abuses Unicode to map each italic letter onto the math symbols—thereby providing ‘italics’ and ‘bolds’ in pure UTF-8 text rendering in any text viewer.

Any line without the prefixed special character is a source line. A source line would be whatever document format the user prefers, such as a Markdown dialect with custom extensions for Utext-specific features.

Since both compiled & source lines are UTF-8 text by definition, they can be stored in the same file. Utexts can be compiled ‘in-place’ and a text editor could easily provide a ‘WYSIWYG’ view of a utext being edited, and this interoperates with version-control systems.10 A standard utext provides the full compiled version first, followed by the source. This is most convenient for the reader, who can read a document in its pretty compiled form, and then skip the source. (It also helps solve accessibility issues: if a 40-column smartphone rendition is unacceptable, then one can provide the source with some HTML like a <pre style="white-space: pre-wrap"></pre> wrapper; ASCII/ANSI art is inherently screenreader hostile, but a screenreader could simply skip compiled lines for the human-readable source text, and so a Utext would be far more accessible than a hand-written equivalent, which lacks any formal structure or documentation of intent.)

But they can also be arbitrarily ordered: one could provide a Literate Utext, which is a range of compiled lines, followed by the corresponding source lines. (The standard utext is easily recreated from the literate utext by simply sorting on the first character of each line, and ‘bubbling’ up lines with the prefix and bubbling down lines without.) In this approach, as long as compilation is kept to one-pass compilation, Literate Utext can be generated incrementally and streamed.11

And because compiled vs source lines can be matched by a per-line regexp, a literate utext can be viewed easily as an incremental streamed—simply grep for the compiled lines. This means one can do command line stunts like create an IRC client/server with nice formatting simply by telnet: a client opens a telnet connection and pipes in the source text they type, sending it to the server; they open a second telnet connection, which downloads an (incrementally streaming) literate utext, filtering out the source lines with grep; and the server simply pipes each client line of text through the Utext compiler and appends to the file that each client is indefinitely downloading. (This can be done to some degree with regular IRC, but not with formatting.)


  1. Gemini mandates TLS but, in the name of simplicity, has no images? And then the constraints produce no interesting designs, just users chafing at the limits. Most Gemini pages I’ve seen make me wish the author had written some Markdown instead and saved us all the trouble—yes, the Gemini community may be the valuable part, but that’s an absurd way to defend a protocol & document format!↩︎

  2. Because in a globalized Internet, being ASCII-only is increasingly unacceptable, and so most viable document formats must be strictly more complex than UTF-8-only text files.↩︎

  3. Gemini, for example, supports list markers with the explanation that “The only reason they are defined is so that more advanced clients can replace the * with a nicer looking bullet symbol and so when list items which are too long to fit on the device screen get broken across multiple lines, the lines after the first one can be offset from the margin by the same amount of space as the bullet symbol. It’s just a typographic nicety.” The list markers could, of course, just have been written with a Unicode bullet point or symbol to start with. Similarly, Gemini headings & blockquotes will display as-is unless a Gemini client wishes to specially render them, and presumably most of these special renderings would doable in Unicode (eg. bold headers). Fortunately for a Utext enthusiast, Gemini can display all Utext by default, because it supports “preformatted text”, which displays the enclosed text literally (similar to HTML <pre> tags), so any Gemini client is ipso facto a Utext client and can dispense with special formatting of lists, headers, & blockquotes… (Clickable links and reflowed text would not be available by default, but the links could be interwoven with multiple preformatted-text sections, and the client could do something like request a URL with a specific width as the argument and the server respond with a Utext formatted for that width.)↩︎

  4. You can also convert to Unicode using UnicodeIt, Emacs, or my GPT-4-based latex2unicode.py.↩︎

  5. ASCII tables are notorious for being write-only, and needing tools like table formatters or Emacs’s Table Mode to edit.↩︎

  6. These optional features can be cached if performance is a concern.↩︎

  7. The available characters can be constrained to specific ranges, so one could constrain generation to use only Braille Patterns, or only shaded blocks etc. This would help generalize the tool, and avoid the existing menagerie of lots of CLI tools focused on different character sets.↩︎

  8. Which would make no sense for most Markdown approaches, aside from ‘self-contained’ implementations like Markdeep, which bundle a Javascript Markdown compiler/interpreter with the raw Markdown for easier distribution as a single web page.↩︎

  9. This is slightly inspired by Brad Templeton’s ProleText’s use of inline whitespace suffixed to each line to markup the plain text prefix. Almost arbitrarily many things could be encoded into the invisible whitespace, and one could do tricks like make a text file sort into an arbitrary ordering of lines by appropriate whitespace prefixes.↩︎

  10. Naively recording changes to both the compiled & source lines would cause a lot of problems with the comprehensibility of patches/diffs, and interfere with more advanced functionality like bisecting. Fortunately, for exactly this purpose, most version-control systems will support various ways of ‘ignoring’ or ‘removing’ text which matches a regexp, or calling out to a custom ‘diff’ program (which might just delete some lines before calling the standard diff utility). Git, for example, could be made to ignore compiled lines by using either “filters” or customized diff functions.↩︎

  11. With minor adjustments of any document-level features. For example, the URLs in hyperlinks can be printed immediately after their block, rather than appended to the end of the document.↩︎