#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# latex2unicode.py: Convert a simple inline TeX/LaTeX (aimed at ArXiv abstracts) into Unicode+HTML+CSS, using the OA API.
# Author: Gwern Branwen
# Date: 2023-06-28
# When: Time-stamp: "2024-09-14 19:34:43 gwern"
# License: CC-0
#
# Usage: $ OPENAI_API_KEY="sk-XXX" xclip -o | python latex2unicode.py
#
# Typesetting TeX/LaTeX for web browsers is typically a heavyweight operation; even if done server-side, display often requires a lot of CSS+fonts. And then the result looks highly unnatural and clearly 'alien', interrupting reading flow. This is worthwhile for complex equations, where browser typesetting is not up to snuff, but for many in-the-wild TeX uses, the use is often as simple as `$X$`, which would look better as `*X*` & take megabytes less to render. So it is desirable for simple TeX expressions to convert them to 'native' Unicode/HTML (augmented with a bit of custom CSS to handle things like superscripts-over-subscripts which pop up in integrals/summations/binomials/matrices etc).
# Unfortunately, TeX is an irregular macro language which is hard to parse and 'compile' to Unicode: it's easy to do many examples, but there's a long tail of weird variables, formatting commands etc, which means that I wind up defining lots of rewrites by hand, even though they are usually pretty 'obvious'. So, quite tedious and unrewarding.
# However, this is a perfect use-case for GPT models: it is hard to write comprehensive rules for, but is an extremely constrained problem in a domain it knows well which requires processing few tokens, where I can give it many few-shot examples, interrogate it for edge-cases to then write rules/examples for, and the harm of an error is relatively minimal (anyone seriously using an equation will need to read the original anyway, so won't be fooled by a wrong translation).
# So we write down a list of general rules, then a bunch of specific examples, then ask GPT-4 to translate from TeX to Unicode/HTML/CSS.
#
# eg.
# $ echo 'a + b = c^2' | python3 latex2unicode.py
# *a* + *b* = *c*^{2}
#
# Note: this is intended only for using clean TeX and compiling to something usable in HTML/Markdown. For converting from an image or screenshot to TeX, see tools like or or (or prompting a VLM like Claude-3 or GPT-4o-V with an image & request)
import sys
from openai import OpenAI
client = OpenAI()
if len(sys.argv) == 1:
target = sys.stdin.read().strip()
else:
target = sys.argv[1]
prompt = """
Task: Convert LaTeX inline expressions from ArXiv-style TeX math to inline Unicode+HTML+CSS, for easier reading in web browsers.
Task example:
Input to convert: \\(H\\gg1\\)
Converted output: *H* β« 1
Details:
- Convert only if the result is unambiguous.
- Note that inputs may be very short, because each LaTeX fragment in an ArXiv abstract is processed individually. Many inputs will be as short as a single letter (which are variables).
- Assume only default environment settings with no redefinitions or uses like `\\newcommand` or `\\begin`. Skip custom operators.
- Do not modify block-level equations, or complex structures such as diagrams or tables or arrays or matrices (eg `\\begin{bmatrix}`), or illustrations such as drawn by TikZ or `\\draw` , as those require special processing (eg. matrixes must be converted into HTML tables). Do not convert them & simply repeat it if the input is not an inline math expression.
- If a TeX command has no reasonable Unicode equivalent, such as the `\\overrightarrow{AB}`/`\\vec{AB}` or `\\check{a}` or `\\underline`/`\\overline` commands in LaTeX, simply repeat it.
- If a TeX command merely adjusts positioning, size, or margin (such as `\\big`/`\\raisebox`/`\\big`/`\\Big`), always omit it from the conversion (as it is probably unnecessary & would need to be handled specially if it was).
- The TeX/LaTeX special glyphs (`\\TeX` & `\\LaTeX`) are handled elsewhere; do not convert them, but simply repeat it.
- Use Unicode entities, eg. MATHEMATICAL CAPITAL SCRIPT O `πͺ` in place of `\\mathcal{O}`, and likewise for the Fraktur ones (`\\mathfrak`) and bold ones (`\\mathbb`). Convert to the closest Unicode entity that exists. Convert symbols, special symbols, mathematical operators, and Greek letters. Convert even if the Unicode is rare (such as `πͺ`). If there is no Unicode equivalent (such as because there is not a matching letter in that font family, or no appropriate combining character), then do not convert it.
- If there are multiple reasonable choices, such as `\\approx` which could be represented as `β` or `~`, choose the simpler-looking one. Do not choose the complex one unless there is some good specific reason for that.
- For superimposed subscript+superscript, use a predefined CSS class `subsup`, eg. `(\\Delta^0_n)` β `Ξ^{0}_{n}`; `\\Xi_{cc}^{++} = ccu` β `Ξ_{cc}^{++} = *ccu*`; `\\,\\Lambda_c \\Lambda_c \\to \\Xi_{cc}^{++}\\,n\\,` β `*Ξ*_{c} *Ξ*_{c} β Ξ_{cc}^{++},*n*`. This is also useful for summations or integrals, such as `\\int_a^b f(x) dx` β `β«_{a}^{b} *f*(*x*) *dx*`.
- For small fractions, use FRACTION SLASH (β) to convert (eg. `1/2` or `\\frac{1}{2}` β `1β2`). Do not use the Unicode fractions like VULGAR FRACTION ONE HALF `Β½`.
- For complex fractions which use superscripts or subscripts, multiple arguments etc, do not convert them & simply repeat them. eg. do not convert `\\(\\frac{a^{b}}{c^{d}}\\)`, as it is too complex.
- Convert roots such as square or cube roots if that would be unambiguous. For example, `\\sqrt[3]{8}` β `β8` is good, but not `\\sqrt[3]{ab}` because `β*ab*` is ambiguous; do not convert complex roots like `\\sqrt[3]{ab}`.
- Color & styling: if necessary, you may use simple CSS inline with a `` declaration, such as to color something blue using ``.
- Outlines/boxes: you may use simple inline CSS to draw borders.
- Be careful about dash use: correctly use MINUS SIGN (β) vs EM DASH (β) vs EN DASH (β) vs HYPHEN-MINUS (-).
More rules/examples for edge-cases:
- ' O(1)'
πͺ(1)
- '\\(\\mathsf{TC}^0\\)'
**TC**^{0}
- '\\(\\approx\\)'
~
- '\\(1-\\tilde \\Omega(n^{-1/3})\\)'
1 β Ξ©Μ(*n*^{β1β3})
- '\\(\\mathbf{R}^3\\)'
π^{3}
- '\\(\\ell_p\\)'
π_{p}
- '\\textcircled{r}'
β‘
- '(\\nabla \\log p_t\\)'
β log *p*_{t}
- '\\(\\partial_t u = \\Delta u + \\tilde B(u,u)\\)'
β_{t}*u* = Ξ*u* + *BΜ*(*u*, *u*)
- '\\(1 - \\frac{1}{e}\\)'
1 β 1β*e*
- 'O(\\sqrt{T}'
πͺ(β*T*)
- '\\(^\\circ\\)'
Β°
- '\\(^\\bullet\\)'
β’
- '6\\times 10^{-6}\\)'
6Γ10^{β6}
- '5\\div10'
5 Γ· 10
- '\\Pr(\\text{text} | \\alpha)'
Pr(text | Ξ±)
- '\\(\\hbar\\)'
β
- '\\frac{1}{2}β 1β2'
- \\nabla
β
- '\\(r \\to\\infty\\)'
*r* β β
- '\\hat{a}'
Γ’
- '\\textit{zero-shot}'
*zero-shot*
- '\\(f(x) = x \\cdot \\text{sigmoid}(\\beta x)\\)'
*f(x)* = *x* Γ sigmoid(Ξ² *x*)
- '\\clubsuit'
β£
- '\\textcolor{red}{x}'
x
- '\\textcolor{red}{X}'
X
- '\\textbf{bolding}'
**bolding**
- '\\textit{emphasis}'
*emphasis*
- 'B'
*B*
- 'u'
*u*
- 'X + Y'
*X* + *Y*
- '\\,\\Lambda_b \\Lambda_b \\to \\Xi_{bb}\\,N\\,'
, *Ξ*_{b} *Ξ*_{b} β Ξ_{bb} *N*,
- 'x \\in (-\\infty, \\infty)'
x β (-β, β)
- 'p\\bar{p} \\to \\mu^+\\mu^-'
ppΜ
β ΞΌ^{+}ΞΌ^{β}
- '\\alpha\\omega\\epsilon\\S\\om\\in'
Ξ±ΟΞ΅Β§ΓΈmβ
- '^2H ^6Li ^{10}B ^{14}N'
^{2}H ^{6}Li ^{10}B ^{14}N
- '\\mathcal{L} \\mathcal{H} \\mathbb{R} \\mathbb{C}'
β β β β
- '\\textrm{M}_\\odot'
M_{β}β16β10^{β10}M_{β}
- '200+'
200+
- 'M = M_a \\cup M_b \\subseteq \\mathbb{R}^d'
*M* = *M*_{a} βͺ *M*_{b} β β^{d}
- 'f : \\mathbb{R}^d \\to \\mathbb{R}^p'
*f* : β^{d} β β^{p}
- 'M_a'
*M*_{a}
- 'Ξ²_k\\bigl(f(M_i)\\bigr) = 0'
*Ξ²*_{k}(*f*(*M*_{i})) = 0
- 'k \\ge 1'
*k* β₯ 1
- 'Ξ²_0\\bigl(f(M_i)\\bigr) = 1'
*Ξ²*_{0}(*f*(*M*_{i})) = 1
- 'i =a, b'
*i* = *a*, *b*
- '(n,d,\\lambda)'
(*n*, *d*, Ξ»)
- '\\Lambda'
Ξ
- '\\not\\approx'
β
- '\\left\\langle A \\middle| B \\right\\rangle'
β¨*A*|*B*β©
- '\\mathcal{R}'
β
- '\\mathbb{R}'
β
- '\\cancel{x}'
xΜΈ
- '\\left{\\frac{1}{2} \\right}'
\\left{\\frac{1}{2} \\right}
- '\\dot{x}'
ẋ
- '\\ddot{x}'
ẍ
- 'x^{y^{z}}'
*x*^{yz
- '\\lim_{x \\to \\infty} f(x)'
limx β β f(x)
- '\\boxed{A}'
A
- '\\'
- '\\:'
- '\\;'
- '\\quad'
- '\\qquad'
- '!'
- '\\!'
- En space
- Figure space
- Punctuation space
Task:
- '""" + target + "'\n"
completion = client.chat.completions.create(
model="gpt-4o-mini", # we use GPT-4 because the outputs are short, we want the highest accuracy possible, we provide a lot of examples & instructions which may overload dumber models, and reviewing for correctness can be difficult, so we are willing to spend a few pennies to avoid the risk of a lower model
messages=[
{"role": "system", "content": "You are a skilled mathematician & tasteful typographer, expert in LaTeX."},
{"role": "user", "content": prompt }
]
)
print(completion.choices[0].message.content)
}