Utext: Rich Unicode Documents

Gwern

Utext: Rich Unicode Documents

An esoteric document proposal: abuse Unicode to create the fanciest possible ‘plain text’ documents.

2023-10-08–2024-04-24 in progress : unlikely : 1

Utext is a proposed esoteric-document format for typographically-rich documents (‘utexts’) under the constraint that they are pure UTF-8 text files. Utext is a Unicode⁠ answer to the typography maximalist question: “what is the most advanced (or at least, interesting) document that can be generated by (ab)using the full range of obscure capabilities provided by contemporary UTF-8? What is ‘plain text’ that is not so plain?”

I outline ⁠the inline & block formatting features⁠ that Unicode enables (comparable to popular formats like Markdown⁠ → HTML), and more ⁠advanced features⁠ that Utext could target: for better layout and saving text-artist labor, Utext could exploit text modification using large language models (LLMs) and ASCII image generation with neural nets. LLMs could rewrite text to replace words with synonyms or tweak punctuation for better line-justification. ASCII images could be generated from arbitrary image inputs or text prompts.

I note one should ⁠store together⁠ both Utext ‘source’ & ‘compiled’ text, which would greatly enhance upgradeability, accessibility, and community-building, by letting readers see & re-compile the source in addition to the final ‘compiled’ version. This further allows for interesting ⁠line-oriented text formats⁠, which allow live WYSIWYG editing, in-place version-control, or can stream over the network (opening up applications like simple chat rooms).

But probably the best output format would be as ⁠a narrow subset of HTML⁠, turning it into hypertext and making it usable as a website, through judicious use of <pre> tags.

Utext is a hacker’s answer to the complexity of document formats like PDF or HTML/JS, which trap one in the tyranny of choice.⁠^⁠1⁠ It’s a playground for typography enthusiasts, a challenge to push the boundaries of what we can do with pure UTF-8 text. With the full range of Unicode’s capabilities, Utext aims to create the most advanced ‘plain text’ possible documents. It’s about making the most of what we have, not about adding more—an exploration of minimalism & the power of constraint. So, let’s get started with why I was thinking about this at all—don’t we have plenty of document formats already (particularly ‘minimalist’ ones)?

Background

Document formats like PDF or HTML/JS can do just about anything if one puts enough work into it; but that makes them the Turing tarpits of document formats, and they are a morass of complexity, historical baggage, and questionable-but-highly-opinionated defaults. One often feels they are just too much, and they enable techno-procrastination—endless tweaking of one’s LaTeX or HTML/JS/CSS/web-server stack, at the cost of actually writing or doing anything.

Many document formats can be compiled lossily to plain text, like HTML or Pandoc Markdown. Sometimes a hacker will deliberately write in old-fashioned 80-column ASCII⁠ text man page⁠/‘textfiles’⁠ style for the retro effect; ‘minimalist’ projects like Gemini⁠ go about halfway between ASCII text and fullblown HTML/CSS, or they will rely on command-line terminal⁠ extensions like ANSI art⁠ or ‘sixels’⁠, which turn a terminal into a veritable GUI capable of generating bitmaps at high resolution.^[As well as generating bugs & security vulnerabilities at high speed…] I find these attempts to not be satisfactory either pragmatically or esthetically: the plain text conversions are afterthoughts, and it shows; Gemini winds up doing too much to be simple, while still doing too little to be a feasible simplification of HTML or much of an improvement on Gopher⁠⁠^⁠2⁠, and TUI projects⁠ (eg. ncurses⁠, the recent wave of TUIs like Textualize) are scarcely distinguishable from GUIs as far as documents are concerned, doing far too much.

Unicode

There is one point on the ASCII ↔︎ JS spectrum that I haven’t seen, and it’s one that, as I use Unicode in more complex ways on Gwern.net and have learned how many obscure features or characters Unicode has, I increasingly think has been neglected: only UTF-8 text rendered by a monospace font. Not ASCII, not a weird subset of SGML, not troff, not raw terminal codes, not bitmaps encoded in ASCII—just UTF-8. This document format does only what pure Unicode text can do—but does everything that pure Unicode can do, which turns out to be a lot. And formats like Gopher or ⁠Gemini usually already support or require UTF-8 text encoding⁠^⁠3⁠; so, if UTF-8 can implicitly do most of what formats like Gemini do explicitly, doesn’t that make them not so minimalist after all?⁠^⁠4⁠ And if they are not minimalist, why do we want them?

What if we take Unicode literally, but not seriously? Your typical plain text output strips all formatting. At the most ambitious, it might have a Unicode superscript or fraction. But we can do so much more! If we put together all the tricks in one place, and use all the Unicode text formatting features, it might be useful to codify them into a specific approach: a Unicode text document, or Utext. (Users of Utext would then be, of course, Utexans.)

Rich Unicode

There are many well-known tricks for fancy formatting using Unicode. Below is an incomplete tour of how a powerful DSL or compiler could implement various ‘fancy’ formatting in Unicode, starting with inline elements:

Italics: Mathematical Symbols Range’s sans italics⁠
Bold: Mathematical Symbols Range’s bolds
Italic Bold: ditto (as well as ‘script’ variants)
Superscripts/subscripts: Unicode superscripts/subscripts⁠
- Fractions
- Footnotes
- Hyperlink numbering: while ‘hyperlinks’, in the sense of interactive elements, could not be supported, they must still be supported—people are not about to stop writing URLs!—and can be usefully listed at the end of a utext.
  Note that several alternative sets of superscript numbers are provided in the Dingbats block⁠, so there need not be any visual confusion between various uses of superscripts. URLs could get the ‘normal’ superscripts, while footnotes (being much less used) get weirder ones like ‘➀’/‘❶’/‘➊’.
Underlining⁠: s͟i͟n͟g͟l͟e (COMBINING DOUBLE MACRON BELOW or COMBINING LOW LINE), d̳o̳u̳b̳l̳e̳ (COMBINING DOUBLE LOW LINE)
Strikethrough⁠: s̵t̵r̵i̵k̵e̵t̵h̵r̵o̵u̵g̵h̵ (COMBINING SHORT STROKE OVERLAY)
Unordered list: Unicode supports countless icons which could be used for unordered lists of almost arbitrary depth
Ordered list: Standard approaches like 1. are fine, but Unicode also supports much weirder numbers—in addition to the ‘circle’ ones shown above, there are also PARENTHESIZED DIGIT n points like PARENTHESIZED DIGIT ONE: ‘⑴’, ‘⑵’, ‘⑶’, ‘⑷’ etc.
Math: Unicode can encode much simple math as-is, as even unusual fonts like ‘double-struck’ letters⁠ are in Unicode. For example, the quadratic equation in Unicode:
```
                          −𝘣 ± √̅(𝘣² − 4𝘢𝘤)
                      𝘹 = ————————————————                            (1)
                                2𝘢
```
For more advanced math, the user can write in the familiar LaTeX⁠ notation and it can be compiled to ⁠UnicodeMath⁠ (⁠demo), as is already supported by some tools like Pandoc.⁠^⁠5⁠
Code syntax highlighting⁠: while color is unavailable, italics/underline/bold are already adequate for syntax highlighting, as demonstrated by monochrome syntax highlighting themes like used in the influential ALGOL 60⁠ report (eg. Pygments’s ⁠algol_nu—which inspired the Gwern.net syntax-highlighting theme)
Automatic rewrites with fancier versions: why display a boring (1) to the reader when you could replace it with something cool like ‘➀’? Or add ligatures⁠—yes, it was a bad idea for Unicode to include them⁠, but if they’re there, might as well use them and replace any ‘ff’ with ‘ﬀ’ to look cooler.
Stylistic variations: there are many Unicode alphabet variations, like fraktur⁠ (or letters in white circles & black circles or black squares, or even upside-down!), which do not map onto standard markup because they would usually just be a different font choice; a Utext could support these transformations as attributes set on a specified span of text.
Attributes could be more complex, and allow defining additions like adding diacritics to each character, avoiding typing. In addition to the weird stylistic variations, some useful shortcuts would be: all-uppercase, all-lowercase, alternating-case each letter, adding n spaces between each letter… (Even more transformations, on a more semantic level, are possible ⁠using AI⁠.) Transformations could be chained: [She said what!?]{transform="uppercase,bold,italics"} → 𝙎𝙃𝙀 𝙎𝘼𝙄𝘿 𝙒𝙃𝘼𝙏‽.
Utext is not quite homoiconic⁠, but one can transform text in-place (eg. with a keybinding in one’s IDE to ‘execute’ a selected range of text): a snippet like [I am talking calmly!]{transform="uppercase"} can be rewritten in-place to I AM TALKING CALMLY!; this is potentially lossy, so it would be convenient to keep the original around as a backup, and one could then swap back and forth as necessary ([I AM TALKING CALMLY!]{original="[I am talking calmly!]{transform=\"uppercase\"}"}).

Block elements:

Lists: handled by indentation + rarer characters for list marker visual flair (even just for stars⁠, there are countless Unicode characters to choose from)
Sidenotes/margin notes: handled by the text layout reflowing the body text around them, as seen in many books—tedious to do by hand, but easy for a text layout algorithm.
Semigraphics⁠, for drawing images/art:
- Fleuron⁠ & Ornamental Dingbats⁠ decoration: these can be used to provide horizontal rulers, bolded quotes, and outline/box-drawing characters like checkerboard patterns (extended substantially in Unicode 13+ in ⁠“Symbols for Legacy Computing”⁠, for examples see the Unscii font homepage)
- Sparklines⁠: Block Elements⁠ can draw Tufte⁠-style sparklines like ‘▁▂▃▅▇’ (eg. on the⁠ CLI⁠).
- Data visualization & statistical graphs: gnuplot⁠ (matplotlib/asciiplotlib⁠/), termeter⁠, plotlib/bashplotlib⁠, ⁠txtplot⁠
- Shading: Block Elements include shaded blocks like ‘░’ (LIGHT SHADE), ‘▒’ (MEDIUM SHADE), & ‘▓’ (DARK SHADE); as well as
- Dots: Braille Patterns⁠ could draw various binary patterns, for a dithered look (eg. drawille⁠)

Header outlines, tables⁠^⁠6⁠, diagrams, rectangles/squares/wrappers of any kind: box-drawing characters⁠

         ┏━━━┳━━━┓
         ┃ | ┃ – ┃
         ┣━━━╋━━━┫
         ┃ H ┃ ‡ ┃
         ┗━━━┻━━━┛
 
  “Is this loss”—in Unicode?
  Depends how generous you feel.
  (And what fonts you installed.)

Headers: header level priority can be expressed nicely with the foregoing.
Working our way up from the smallest headers to the largest, a header hierarchy might go: eg. italics, underlined, bold, bold underlined, underscored (add a second line like ======), capitalized underscored, outlined header…
At the higher levels, we can draw increasingly large squares or rectangles; the text cannot be upscaled, of course, but the header text can be redrawn as ASCII art (eg. ⁠1⁠, ⁠2⁠, ⁠2⁠, ⁠3⁠)—they are fonts-within-fonts, one might say, and providing these sorts of ‘fonts’ was a common feature of ANSI/ASCII art text editors like TheDraw⁠. The transition point would be governed by whatever looks good. (Maybe a 4×4 square is too small, and the lower limit is 5×5?)
- Dropcaps⁠: if we have ‘fonts’, we can also have dropcaps to decorate our documents. They are simply a box with an upscaled letter and perhaps ‘ornamentation’ inside it around the upscaled letter. These can be packaged, or generated on the fly procedurally.
- Progress bars⁠: one nice feature of more complex document readers is providing a progress bar. We can’t provide a GUI/TUI which computes this, of course, but we can encode progress bars into headers if we like.
  For example, in any underscored header (which is most of the large headers), we could transition characters, corresponding to document position. At the beginning of the document, the underscore line might be ---------- (0%), and at the halfway point, the next header is =====----- (50%), ========-- (75%), and finally, ========== (100%), providing a visual progress bar for the reader. (This sort of ‘scroll spy’ approach is increasingly common on web pages & mobile apps.)

Backlinks (1)⁠ for ⁠“Rich Unicode”⁠:

Utext: Rich Unicode Documents⁠ (context⁠):
⁠[backlink context]

Advanced Utext

These are the obvious ones. But we can be more ambitious: the point of having a special compiler is not to simply save some time & effort on things we could easily have done ourselves. What can we do that requires special support?

In 2023, an obvious thing to do is leverage the power of neural nets.⁠^⁠7⁠ Large language models understand English text at a high semantic level, enabling arbitrary rewrites of text to change style or wording while preserving the meaning ; meanwhile, image models are able to synthesize images which match a text description (or are optimized to satisfy multiple properties simultaneously).

Text Modification: one of the major benefits of writing in Utext is that one doesn’t have to decide where to break lines. Hard-wrapping lines makes editing much harder, and breaks things like search or search-and-replace—what if a phrase is hard-wrapped? An LLM can improve over a mechanical typesetting algorithm of simply scanning n characters in and then line-breaking.
- Text style transfer⁠ an LLM can be prompted to zero-shot text style transfer⁠ in many ways: it can rewrite text to sound like a pirate, or ⁠into emoji⁠. So arbitrary prompts or a library of prompts could be provided to let the author write & translate text: [I bought a computer from my local store.]{translate="pirate"} → Me procured a contraption of reckonin' from me nearby tradin' post. or [...]{translate="emoji"} → 👤💳💻🏪.
- Hyphen-free justification: LLMs can go even further in improving line-breaking: LLM can rewrite a piece of text to replace words by synonyms or tweak punctuation in order to fully-justify lines. These rewrites would be highly unsafe if done by a mere dictionary approach, but an LLM can score rewrites by whether they are semantically dangerous—does an edit modify a quoted proper name? Or have distinctly different connotations?
ASCII Image Generation: images generated by neural nets do not need to be high resolution or photographic—in fact, pixel art⁠ was one of the early big successes⁠ of CLIP⁠-based image generation (having ⁠defied many earlier attempts⁠ using GANs⁠, which were simply not knowledgeable enough about the world to generate convincingly abstract pixel art). And ASCII art is a lot like pixel art.
So whereas past attempts to automatically convert various images to ASCII art representations, like aalib⁠, have been unconvincing and required extremely large text images to reduce the quantization error down to recognizable levels, and failing at the core task of simplifying & abstracting (eg. Chung & Kown2022⁠), it may be possible now to generate ASCII art of arbitrary image inputs (style transfer) & even text prompts (text2image).
This can be done with the old CLIP tools, or newer ones based on ⁠Stable Diffusion⁠ (eg. ⁠“Ascii Art”, All-In-One-Pixel-Model⁠, Pixel Art XL⁠) etc.
This approach would also allow generation of visual chrome, like drawing special effects around a given piece of text. At each optimization step, one would convert the current text into a bitmap, hold fixed the pixels corresponding to the user-written text, and then optimize the pixels in just the desired regions to match a prompt, and convert back to text⁠^⁠8⁠; so a user could hypothetically write something like [Visit My Awesome Homepage!]{container-type="box" size="11" decoration-prompt="cool roaring flames whooshing off to the right" characters="all"} to create a large banner which has new text-flame borders each time.
Done right, this would let one freely build shadows and depth with layered outlines, underlines, and box drawings, or use it to create 3D effects on letters or shapes.
- ASCII image testing: NN image tools like CLIP help solve the practical issue of how to benchmark or test the appearance of Utext documents.
  It is straightforward to test that Utext text compiles to the intended text files, simply by unit tests of source+compiled pairs, but that is font-independent, and in practice, the major concerns with Utext correctness will be whether a document looks right across major monospace fonts—it would be bad to ship features that look good in the developer’s personal choice of font, but then break horribly on some Mac font. (Most concerning is that box-drawing characters apparently are not guaranteed to be aligned even in monospace fonts—completely defeating the point of having them!)
  But an appropriate NN could help here. A CLIP finetuned on an ASCII art dataset, for example, could be used to benchmark & test the esthetics of Utext documents by generating versions in a variety of fonts in several VMs’ terminals, screenshotting the resulting rendered text, and then judging the similarity to each other & to a description. Errors in rendering (such as boxes or tables misaligned) should result in clusters of good instances and then broken outliers; a feature or a change which results in unusually large net distance flags dangerous behavior.

Open questions:

Microtypography⁠: Most forms of microtypography, like inserting/subtracting spaces to improve justification⁠, would seem to be unavailable because we need a monospace font for any vertical alignment. So Unicode space characters⁠ like HAIR SPACE or THIN SPACE would presumably be rendered either as a normal-sized space or not at all.
One could try to subtly add double-spaces to improve layout, which may look natural if done well. More advanced approaches might be possible: for example, one could use a hyphenation dictionary to see where one could insert hyphens into text to make it justify better while still being valid English; a radical approach would be using LLMs to rewrite text to make it justify perfectly in monospace.
Regardless, given the complexity of Unicode, there may still be a way to do microtypography by adding fractional spaces or finding some other way to render invisible half/quarter-width-characters. (One of the Unicode control characters⁠?) Perhaps one could stick solely to 1:1 space substitutions, so that if they fall back to a regular space, the line in question remains backwards compatible, reverting to the original non-microtypographic appearance?
Animation: is there any clever way to do ‘animation’ as a static text file? Does this require knowing the vertical height of a window before one could do animation by paging down screens?
If it can be done, then ‘video’ is entirely possible by using incremental generation to stream results, and video generation NNs become relevant.
Bidirectional text⁠: what can be done exploiting bidirectional text rendering or vertical text⁠? (cf. ⁠Trojan Source⁠)

Backlinks (2)⁠ for ⁠“Advanced Utext”⁠:

Utext Markup

For ease of user adoption, the markup should probably be based on Markdown. Markdown already has constructs corresponding to most of the above features, and the compiler can take care of things like hyperlink numbering & appending.

A Utext user probably needs more fine-grained control over 2D layout than a Markdown user usually needs (or wants). This could be encoded as tables, like ⁠Pandoc’s pipe tables⁠. Tables can be given arbitrary attributes to control use of advanced layout features, like auto-generating text borders through a neural net.

Text + Source Storage

Because the output of Utext is the same as the input as far as everything else is concerned (ie. UTF-8 text), one can provide both the ‘source code’ and ‘compiled document’ of a Utext in a single file. While (less than) doubling the space consumption, this would have a few benefits:

Distributing source by default helps with community-building & licensing (similar to how many people learned HTML by reading the HTML of cool web pages they visited, and built on copies)
Utexts can be re-compiled:
- Greater quality: particularly with the more advanced features I speculate about above, Utext compilers may improve over time. If Utexts didn’t come with source, improved renderings would be difficult as now the Utext has to be decompiled/reverse-engineered, and it may be hard to know what the original prompts or styles were.
- Responsive design: any default size width will ill-suit someone; perhaps they prefer to provide the compiler a much wider canvas to paint on, or alternately, want the best possible rendition for their particularly narrow smartphone screen-width. (See also ⁠HTML Utext⁠.)
- Personal preference: providing source lets readers customizes the compilation. They can disable or upgrade the advanced features, or change the esthetics: perhaps they don’t like the ‘default retro’ appearance, and will instead target Commodore fonts with their characteristic slanting ‘italic’ style
Accessibility: source-availability means that a screen-reader can skip to the source, or the user could program their browser to skip to the source, or the Utext could simply be provided by a third party in a HTML wrapper which marks up the source as what the screen-reader should

Backlinks (1)⁠ for ⁠“Text + Source Storage”⁠:

Utext: Rich Unicode Documents⁠ (context⁠):
⁠[backlink context]

Utext Format

The primary purpose of Utext is to generate the coolest output, so the main challenge is just achieving any of the suggestions above. But as an optional feature, mostly independent of the foregoing, I’ll discuss a special-purpose Utext format.

Fancy Unicode tricks do not, in principle, require a new document format beyond the classic .txt file. One can simply store the source for a Utext in an ordinary Markdown text file with a .md extension, and ‘compile’ to a .txt file.

But it is convenient to define a new format (especially if Utext winds up adding extensions or syntax which make no sense to a Markdown compiler), and since we are using strictly UTF-8, we can do something interesting and combine a fancy Unicode text file with a text file into a single file format.⁠^⁠9⁠

The Utext format (.utxt /text/plain-utext;charset=UTF-8) is a UTF-8 document which has two kinds of lines: compiled and source. A utext could contain a single document, or concatenate many documents. There is no explicit document-level division or header, and they are distinguished on a line-by-line basis.

Each line of a compiled utext is prefixed with an invisible special character (eg. ASCII’s Start of Text control character⁠) which indicates it is a compiled line.⁠^⁠10⁠ A compiled line is human-readable UTF-8 text line. (120 characters would be a good maximum width for all devices; but a Utext could be rendered at multiple widths so as to support other devices, like smartphones which usually a comfortable width of ~40 characters.) But a compiled line is not meant to be human-editable, because it uses all available Unicode tricks for rich formatting, like ASCII art. For example, a source line may use italics or bolds, which normally cannot be displayed in pure UTF-8 text without using terminal control codes or relying on markup conventions like HTML/Markdown, but the compiled version abuses Unicode to map each italic letter onto the math symbols—thereby providing ‘italics’ and ‘bolds’ in pure UTF-8 text rendering in any text viewer.

Any line without the prefixed special character is a source line. A source line would be whatever document format the user prefers, such as a Markdown dialect with custom extensions for Utext-specific features.

Since both compiled & source lines are UTF-8 text by definition, they can be stored in the same file. Utexts can be compiled ‘in-place’ and a text editor could easily provide a ‘WYSIWYG’ view of a utext being edited, and this interoperates with version-control systems.⁠^⁠11⁠ A standard utext provides the full compiled version first, followed by the source. This is most convenient for the reader, who can read a document in its pretty compiled form, and then skip the source. (It also helps solve accessibility issues: if a 40-column smartphone rendition is unacceptable, then one can provide the source with some HTML like a <pre style="white-space: pre-wrap"></pre> wrapper; ASCII/ANSI art is inherently screen-reader hostile, but a screen-reader could simply skip compiled lines for the human-readable source text, and so a Utext would be far more accessible than a hand-written equivalent, which lacks any formal structure or documentation of intent.)

But they can also be arbitrarily ordered: one could provide a Literate Utext, which is a range of compiled lines, followed by the corresponding source lines. (The standard utext is easily recreated from the literate utext by simply sorting on the first character of each line, and ‘bubbling’ up lines with the prefix and bubbling down lines without.) In this approach, as long as compilation is kept to one-pass compilation, Literate Utext can be generated incrementally and streamed.⁠^⁠12⁠

And because compiled vs source lines can be matched by a per-line regexp, a literate utext can be viewed easily as an incremental streamed—simply grep for the compiled lines. This means one can do command line stunts like create an IRC client/server with nice formatting simply by telnet: a client opens a telnet connection and pipes in the source text they type, sending it to the server; they open a second telnet connection, which downloads an (incrementally streaming) literate utext, filtering out the source lines with grep; and the server simply pipes each client line of text through the Utext compiler and appends to the file that each client is indefinitely downloading. (This can be done to some degree with regular IRC, but not with formatting.)

Backlinks (1)⁠ for ⁠“Utext Format”⁠:

Utext: Rich Unicode Documents⁠ (context⁠):
⁠[backlink context]

HTML Utext

How can we make Utext useful for websites?

By adding in the single missing ingredient of hypertext: active, clickable links.

And we can do that by defining a HTML wrapper which permits only literal text displays (in <pre> tags), <a> HTML hyperlinks, and the minimal amount of CSS to render this pleasingly on desktop/mobile.

This HTML Utext gives us a highly-esoteric textfiles-style format which nevertheless could work for a website.

When web developers meet to boast about their web page speeds, the web pages in question inevitably devolve into fancy text files—there is no way to be faster than to ask the browser to do as little as possible, after all, given that ultimately a web page is fundamentally a document, and everything else is tacked on afterwards. Considering how much Utext can do, one might ask, can you make a Utext web page? You can of course simply put a bunch of utext files on a web server and serve them as a UTF-8 text file, no problem. But can we do better? Can we make a ‘HTML Utext’ where, say, a programmer could seriously contemplate making a personal website out of which people might actually use? (Nothing on the level of a Gwern.net would be possible, but could we hope to beat, let’s say, ⁠Dan Luu’s website?)

Backlinks (3)⁠ for ⁠“HTML Utext”⁠:

Hypertext?

A Utext is just a text. What is the essence of a hyper text⁠? What separates a hypertext from a regular text like a printed book or a handwritten papyrus scroll?

The missing link. I would say that the essence of a hypertext is not supporting some typographical feature like italics, and nor is it having citations or references of some sort like page numbers (or even ISBN⁠/DOI⁠ IDs); if that was enough, then we’ve had “hypertext” for centuries or millennia. Nor is it the concept of a “link” referring in some way to where to physically get another work—even at the Library of Alexandria, there was at least one well-developed indexing system⁠ you could refer to a scroll by its genre/type and then the first letter of the author, and possibly the physical location in a bin; and an ISBN often implies that the legal deposit⁠ Library of Congress⁠ (which controls ISBNs) would have a copy in Washington DC. And nor is it containing Internet URLs: if someone printed out a web page and it included a list of the raw URLs appended at the end, you wouldn’t regard it as all that different from a regular book with its bibliography of references. Likewise, a utext which contained a bunch of URLs in it, as text, however informally marked up with < > or ( ) or whatever, is still not a hypertext.

Active, not passive. Rather, the essence of a hypertext is that the links are active, and will do something: usually, load the referenced document and display it. (Although not always: it could instead download the document, open it in a new ‘tab’, or add it to a sidebar as a kind of queue/bookmarks like Symbolics Concordia⁠, or any number of things.)

So, what Utext is missing is some way to provide active links, true hyperlinks. That’s the only important thing it is missing compared to HTML. Everything else, from multimedia to Javascript, is optional—plenty of hypertext systems have lacked them, and were usable.

There is no way to implement clickable active links in pure Unicode, as the viewing software must in some way execute code. But this hints that we only need one small, tiny feature from HTML: its <a> link. So… why not create utexts which are utexts except where they switch to HTML and use a <a> link?

`<pre>`

As previously mentioned, we can display UTF-8 text literally in a browser using the HTML <pre> tag, which stands for ‘pre-formatted’—exactly the sort of text we want to display here.

So, a minimal HTML Utext would be to define a normal HTML document, except that all features or tags not strictly required to be a valid HTML document are forbidden, and all that is allowed in the <body> of the document is <pre> and <a> tags. That is, the <pre> tags are used to display normal Utext content, and whenever a hyperlink is required, they are closed with </pre>, then an <a> link is defined, which can only contain 1 <pre> with Utext contents (which will be a clickable hyperlink).

<html>
    <body>
<pre>Normal Utext content here, but then one wants to link to
some URL like</pre> <a href="https://example.com/"><pre>Example.com</pre></a> <pre>to demonstrate how
linking works in HTML Utext.</pre>
    </body>
</html>

Thus, the author can write in the normal Markdown-esque Utext source language, and freely hyperlink away as usual (eg. generating a hyperlinked table of contents where each ‘section’ is a different <pre> ID), and it will compile down to a HTML document which nevertheless looks exactly like a utext—except for the blue underlined links throughout.

This is the minimum viable ‘HTML Utext’.

(And also, to be good web citizens, we should permit the <head> to contain <meta> and <link> fields.)

Styling

Because we can use CSS, we can pull in web-fonts⁠ (such as 8-bit fonts⁠) to ensure much more exact rendering of the utexts; and we can indulge in retro hacker esthetics like setting green-on-black. (A key feature, we can all agree.)

However, we are unable to indulge the temptation to go beyond global stylization (like setting background/foreground), because the <pre> tags are opaque and block any further HTML/CSS creep. Should a user find themselves fighting that, and trying to crack open the black-boxes to do colors or syntax-highlighting or even (gasp) make it interactive via JS, or do something where the utext would break fundamentally if rendered to a terminal—then it’s time for them to stop using Utext and start using real HTML.

CSS

The one immediate catch with this simple approach is that <pre> tags are ‘block’ elements and <a> are ‘inline’, and so we cannot quite intersperse them as freely as that, and get the expected layout. The above example would actually render each of the 3 elements as their own paragraph-like block, more like this:

Normal Utext content here, but then one wants to link to
some URL like
Example.com
to demonstrate how
linking works in HTML Utext.

This might be viable as a Gopher/Gemini-like highlighting of links, but I like inline links, and we can make it render as expected by adding 1 line of CSS:⁠^⁠13⁠

<style>pre { display: inline; }</style>

If we ever want to render it like the former version, then we simply write that explicitly.

A second flaw with this is that in 2024, it is no longer possible to pretend mobile users don’t exist. Mobile is a requirement; anything which breaks on mobile is too esoteric to take seriously.

We are able to support mobile by the brute-force approach of rendering a new utext for each desired screen-width. This allows a tradeoff between responsiveness and pre-computation/filesize: the more widths we generate a tailored utext for, the better the experience, but the longer the compilation & larger the filesize.

But can we deliver the different versions as appropriate? Yes, if we are willing to spend a little more CSS, or complicate the server itself:

<pre>/<a> tags support the usual HTML things like classes & IDs; so we can support mobile by rendering our utexts for multiple widths, assigning a class/ID to each one, and then add a media-query CSS declaration to switch at various screen widths.
We could also have separate websites/domains for mobile vs desktop, or we could set our HTTP server to do user-agent sniffing and try to detect what the device is and send it an appropriately-compiled utext. (Direct content negotiation⁠ for screen width is now unsupported⁠.)
This saves bandwidth and simplifies the delivered file, but is risky (user-agent sniffing is notoriously error-prone) and complicated and can be hard to debug. So this would be my least-preferred option.

So a complete HTML Utext might look like this:

<!DOCTYPE html>
<html lang="en-us">
  <head>
    <meta charset="utf-8">
    ...
 
    <style>pre { display: inline; }
    @media all and (max-width: 649px) {
        .wide { display: none; }
        }
    @media all and (min-width: 650px) {
        .narrow { display: none; }
        }
    </style>
  </head>
 
  <body>
<pre class="narrow">Normal Utext content here, but then one wants to link to
some URL like </pre><a class="narrow" href="https://example.com/"><pre>Example.com</pre></a><pre class="narrow"> to demonstrate how
linking works in HTML Utext.</pre>
 
<pre class="wide">Normal Utext content here, but then one wants to link to some URL like</pre> <a class="wide" href="https://example.com/"><pre>Example.com</pre></a>
<pre class="wide">to demonstrate how linking works in HTML Utext.</pre>
  </body>
</html>

This elegantly gives us a ‘web page’ which is: still lightning-fast & small, supports the full range of Unicode display tricks in Utext, has clickable links, will display well on the devices that make up 99% of the market, is standards-compliant, has metadata hidden from the reader, can be readily turned into pure plain Utext with the usual HTML → text tools (eg. elinks -dump)…

All in all, HTML Utext is something I can actually imagine using for a personal homepage or project without the esoteric document format being the real reason for it.

External Links

Discussion: ⁠HN⁠
⁠Monospace websites: ⁠“The Monospace Web: A minimalist design exploration”, Oskar Wickström

Dynamic or static? (Which static framework? There are a bazillion.) What colors? What fonts? How much Javascript (or should it be Typescript or…) and for what? Responsive or fixed size? Dark-mode? Server-side rendering or hydrated in-browser? Components? (Oh for the days of Microsoft FrontPage or PHP scripts you just FTPed to your shell account…!)
Gemini mandates TLS⁠ but, in the name of simplicity, has no images? And then the constraints produce no interesting designs, just users chafing at the limits.
Most Gemini pages I’ve seen make me wish the author had written some Markdown instead and saved us all the trouble—yes, the Gemini community may be the valuable part, but that’s an absurd way to defend a protocol & document format!
Because in a globalized Internet, being ASCII-only is increasingly unacceptable, and so most viable document formats must be strictly more complex than UTF-8-only text files.
Gemini, for example, ⁠supports list markers with the explanation that “The only reason they are defined is so that more advanced clients can replace the * with a nicer looking bullet symbol and so when list items which are too long to fit on the device screen get broken across multiple lines, the lines after the first one can be offset from the margin by the same amount of space as the bullet symbol. It’s just a typographic nicety.” The list markers could, of course, just have been written with a Unicode bullet point or symbol to start with. Similarly, Gemini headings & blockquotes will display as-is unless a Gemini client wishes to specially render them, and presumably most of these special renderings would doable in Unicode (eg. bold headers).
Fortunately for a Utext enthusiast, Gemini can display all Utext by default, because it supports ⁠“preformatted text”, which displays the enclosed text literally (similar to ⁠HTML <pre> tags⁠), so any Gemini client is ipso facto a Utext client and can dispense with special formatting of lists, headers, & blockquotes… (Clickable links and reflowed text would not be available by default, but the links could be interwoven with multiple preformatted-text sections, and the client could do something like request a URL with a specific width as the argument and the server respond with a Utext formatted for that width. But at that point, one might as well use ⁠HTML Utext⁠.)
You can also convert to Unicode using UnicodeIt, Emacs⁠, or my GPT-4⁠-based ⁠latex2unicode.py⁠.
ASCII tables are notorious for being write-only, and needing tools like ⁠table formatters or Emacs’s Table Mode⁠ to edit.
These optional features can be cached if performance is a concern.
The available characters can be constrained to specific ranges, so one could constrain generation to use only Braille Patterns, or only shaded blocks etc. This would help generalize the tool, and avoid the existing menagerie of lots of CLI tools focused on different character sets.
Which would make no sense for most Markdown approaches, aside from ‘self-contained’ implementations like Markdeep, which bundle a Javascript Markdown compiler/interpreter with the raw Markdown for easier distribution as a single web page.
This is slightly inspired by Brad Templeton’s ProleText’s use of inline whitespace suffixed to each line to markup the plain text prefix. Almost arbitrarily many things could be encoded into the invisible whitespace, and one could do tricks like make a text file sort into an arbitrary ordering of lines by appropriate whitespace prefixes.
Naively recording changes to both the compiled & source lines would cause a lot of problems with the comprehensibility of patches/diffs, and interfere with more advanced functionality like bisecting.
Fortunately, for exactly this purpose, most version-control systems will support various ways of ‘ignoring’ or ‘removing’ text which matches a regexp, or calling out to a custom ‘diff’ program (which might just delete some lines before calling the standard diff utility). Git⁠, for example, could be made to ignore compiled lines by using either ⁠“filters” or ⁠customized diff functions.
With minor adjustments of any document-level features. For example, the URLs in hyperlinks can be printed immediately after their block, rather than appended to the end of the document.
Because this is a highly unusual use of <pre>, there may be additional layout problems I haven’t run into yet and which weren’t apparent from the MDN docs. If that turns out to be the case, it may be necessary to include a few more CSS rules & classes to eg. style the <pre> inside <a>s differently from the ones outside.

Utext: Rich Unicode Documents

Background

Unicode

Rich Unicode

Advanced Utext

Utext Markup

Text + Source Storage

Utext Format

HTML Utext

Hypertext?

`<pre>`

Styling

CSS

External Links

Similar Links

Bibliography