#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# seriate.py: semantically sort, or 'seriate', a list in a logical fashion
# Author: Gwern Branwen
# Date: 2025-01-02
# When:  Time-stamp: "2025-01-02 21:38:44 gwern"
# License: CC-0
#
# Usage: $ OPENAI_API_KEY="sk-XXX" xclip -o | python seriate.py
#
# Many 'lists' can be ordered in a meaningful way, to bring similar items closer together and move dissimilar items further away, but in ways which do not follow a strict comparison sort which implements a full proper ordering.
# # In a classic sorting, we sort 'ACB'→'ABC'; but in a seriation, we might instead seriate 'Dog, Horse, Cat' → 'Cat, Dog Horse'. (It might be hard to say in exactly what sense the second seriated version could be considered 'sorted'—size? phylogenetic similarity?—but it clearly makes more sense and is less confusing to read.)
# For example, images or paragraphs or lists of similar essays can be put into clearly more or less 'sorted' order, which cluster similar items, without obeying any obvious comparison function like a lexicographic sorting function. This sort of distance minimization is known as 'seriation' (or 'ordination'), and can be seen as a generalization of regular sorting; see <https://en.wikipedia.org/wiki/Seriation_(archaeology)>/<https://en.wikipedia.org/wiki/Ordination_(statistics)>/<https://www.jstatsoft.org/article/view/v025i03> (and its inverse, maximizing distance, can be seen as a kind of seriation too, although things get fuzzier there, see <https://gwern.net/unsort>).
# This can be done by hand, and should, because it makes such lists easier to read; but as usual, is too much work for a subtle benefit, and can only be done for static lists. So, we want to automate it.
# This may also be a useful primitive for LLM writing, by enabling a *seriation* pass: a first pass which cleans up text input by constraining edits to seriate it, *without modifying any words*. (For example, one could ask ChatGPT to clean up notes from a conversation or from brainstorming or jotting down text fragments, but invariably, ChatGPT will do more than just reorganize, and will wind up omitting parts, or rewriting into ChatGPTese. If one could first make ChatGPT seriate the notes, and then summarize it recursively to create a hierarchical Table of Contents & an abstract, and only *then* start rewriting it, the results might be much better.)
#
# We seriate by asking the LLM to resort the list in a logical way, whatever that might mean in a given context, and we keep doing so to a fixed point (ie. the list stops changing). After that, we check that no characters were lost (ie. that the final output is a permutation of the input), to guarantee a lossless transformation which only improved the sorting.

import sys
from openai import OpenAI
client = OpenAI()

if len(sys.argv) == 1:
    target = sys.stdin.read().strip()
else:
    target = sys.argv[1]

prompt = """
Task: [UPDATE THIS Clean website titles parsed from <title> tags.
If a title input is useless or meaningless or an error, print out the empty string `""` instead of the original title.
If the title can be fixed, remove the junk (spam, cruft, boilerplate) from the title.
Convert inline Markdown to HTML, like '*foo*' → '<em>foo</em>'
If the title looks good, then print out the original title.
If you are unsure how to fix it, then simply print out the original title. END UPDATE THIS]

Preview of input:
- """ + target + """\n

Task examples:

[UPDATE THIS
- "Anton Seder’s *The Animal in Decorative Art* (1896)"
"Anton Seder’s <em>The Animal in Decorative Art</em> (1896)"
- Input title to clean: "If I Sleep for an Hour, 30 People Will Die - The New York Times"
If I Sleep for an Hour, 30 People Will Die
- Input title to clean: "404"
""
- Input title to clean: "Gwern.net | Collecting New Socks Efficiently"
Collecting New Socks Efficiently
- Input title to clean: "worlds of DAVID BRIN"
""
- Input title to clean: "steve yegge · The Amazon Memo"
The Amazon Memo
- Input title to clean: "index"
""
- Input title to clean: ""
""
END UPDATE THIS]

Task:

Input list to seriate:

- """ + target + "\"\n"

completion = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": "You are an editor reorganizing a rough draft list."},
    {"role": "user", "content": prompt }
  ]
)

print(completion.choices[0].message.content)