“Towards Better RSS Feeds for Gwern.net”, Gwern2024-07-30 (, , ; similar)⁠:

Is there a feed to the content that is published on the site itself?

RSS is not supported because I have been unhappy with past attempts to create an RSS feed; the usual blog-centric approach doesn’t work well with lots of incremental updates and I dropped the darcs/git → RSS bridge as it was nothing but noise for readers.

For perspective, here are the last 40 git patches to Gwern.net, to give you an idea of how miserably useless a reading experience the patches per se would be for an ordinary reader who just wants a list of ‘interesting’ updates or writings:

“+lns; begin catching up after trip” · “record all minor pending edits (1722141005)” · “lint; rescrape for date-range subscripts” · “lint” · “+commafy number pass” · “lint; +first comma-fying pass” · “lint” · “lint” · “lorem: inline: date subscripts: continue finetuning examples” · “+lns” · “second EN DASH pass” · “big EN DASH search-and-replace while working on the newly-enabled date-range subscripts” · “+lns, catch up on some comments” · “Crumb: +apposite comic that’s a bit of a meme for Crumb, and some details from an interview” · “lint” · “+lns; lint” · “embryo selection: split out ‘history of iterated embryo selection’/IES history to separate page because long enough & of independent interest” · “split out Twitter UX essay for easier linking in my Twitter DMs” · “anime reviews: factor out development hell” · “initialize a miscitation/miscite tag as a subset of publication-bias (statistics/bias/publication/miscitation)” · “RTX: +thumbnail in Midjourneyv6 based on ChatGPT-4o suggestion to use a ‘nano’ skull and crossbones” · “note: split out ‘Highly Potent Drugs As Psychological Warfare Weapons’ essay to /rtx due to length and to make more linkable” · “initialize disappearing polymorphs tag (science/chemistry/disappearing-polymorph) apropos of sudden burst of Twitter interest” · “initialize Stigler’s diet problem (Dantzig) statistics/decision/stigler-diet tag” · “+lns” · “fully re-paragraphized all outstanding abstracts” · “lint” · “record all minor pending edits (1720931406); rescrape abstracts to update thumbnails for split-out pages” · “GTX: added few-shot examples to paragraphizer.py & loosened length constraint, so rerun paragraphizer on the holdouts” · “+lns” · “split out The Tale of Princess Kaguya anime movie review to /review/princess-kaguya; re-enable Midjourney to test personalization & generate a good Kaguya Danse macabre thumbnail” · “lint” · “+lns” · “Timecrimes: finally thought up a thumbnail” · “lint” · “initialize truesight tag doc/statistics/stylometry/truesight/ for LLM-powered stylometrics/deanonymization” · “initialize mode collapse preference learning tag for AI slop/ChatGPTese infections in ChatGPT, DALL·E 3, Midjourney, etc (reinforcement-learning/preference-learning/mode-collapse)” · “lorem block: collapses: mismatch cases: rm unnecessary case that is ~impossible to write” · “lint” · “+lns; further new sync lint work”

So in the interest of getting it rolling how about starting small?

Until I have a better system, I don’t want to. It isn’t fun, I won’t learn anything from it, and it doesn’t excite me to half-ass an RSS feed which I know is an unsatisfactory solution to the problem. (And it would be a liability as well: an RSS feed is a promise to the reader, and has long-term consequences. I am still fixing spurious 404s from the old Gitit-style RSS feed I deleted a decade ago—once the URLs get out there and are linked, you can’t recall them.)

And the RSS libraries are a pain to work with because it’s all XML-based, which is one reason I never was able to do much with the old RSS feed. “You see a maze of twisty types, each alike…”

While I have plenty of other things to work on which are also useful to readers or myself, and which may help clarify what I want from an RSS feed. (As it happens, they did, as you can see above.) Sometimes, design just takes a while and you have to use the system ‘in anger’ for a few years until you can see what the logical next step is. When I killed the first RSS feed, I had no idea what would be a good replacement, which was not simply jamming a Wikipedia Recent Changes or a blog/journalism peg into the square hole of Gwern.net. Now I do.


You can probably turn the changelog page into an RSS feed fairly easily, and there is a Gwern flair on the subreddit which comes with an authenticated RSS feed. (The Compilation flair might also be of interest.) That is the closest thing right now to what you want. (And the new ‘author’ metadata+backlinks is halfway to an RSS feed—that is how the ‘Gwern’ section of /doc/newest/index is implemented—so you could probably scrape that too and RSS-fy it.)


After reading Namespace’s comments on blogging and my own thinking about multi-level design, I have been mulling over an approach to RSS feeds that I think can work well.

The problem with standard RSS feeds is that they are designed for one-off pages, like a newspaper or a blog: a URL gets created as a single standalone finished page, announced, and that’s that. (The URL contents will inevitably change, but these changes are not considered important.) But this is a poor fit for Gwern.net because I ‘finish’ major pages only once a month or so, and it is clear that people are interested in a more granular view of my writing than that. (I am always embarrassed when I see someone tweeting out or including in a newsletter a tweet or comment of mine—because it indicates a failure of curation on my part if they have to link those instead of a page on my site.)

The other extreme is that every single file modification is reported in the RSS feed, like the RSS feed for English Wikipedia’s Recent Changes. This is an equally poor fit, because the nature of Gwern.net is that there is a lot of small change constantly going on site-wide, particularly related to formatting or reorganization, of less than zero interest to readers, and which is why I killed the original RSS feed: I couldn’t even read it myself!

More broadly, in general, there are just no good ways to announce the full spectrum of changes from comments or tweets (recall the original name: ‘microblogging’) to shorter blog posts to longform essays/books. No one has done this, and most generally do not recognize this as any kind of problem, but there is a big ‘gap’ between each speed of service, going from chat to micro-blog to blog/newsletter to essay to book.

So writers online tend to pigeonhole themselves: someone will tweet a lot, or they will instead write a lot of blog posts, or they will periodically write a long effort-post. When they engage in multiple time-scales, usually, one ‘wins’ and the others are a ‘waste’ in the sense that they get abandoned: either the author stops using them, or the content there gets ‘stranded’. Most writers simply accept this, and chop their writing down to fit their particular Procrustean bed—Scott Alexander writes solely in blog-posts & blogpost comments (having largely given up on Twitter, Tumblr, and LW/Reddit), even though if you wanted to know his big take on, say, ‘predictive processing’ as the theory of everything in AI/psychiatry, the best he can do is shrug and point you to 30+ blog posts scattered across at least 3 sites going back a decade (LW, SSC, & ACX); and Matt Levine has to repeat himself, again and again, every time a specific recurring drama comes up in finance

If you don’t want that to happen (maybe you don’t love to hear yourself write the same thing again and again quite as much as Mencius Moldbug does and don’t have the endurance of a Matt Levine) and don’t want to become solely a poaster or to have your secondary writings become so much water under the bridge, your only option is to do a lot of tedious work copying back and forth: I try to monthly review my tweets & comments and pull stuff onto Gwern.net, but it is a lot of work and I often think that I am leaving behind a lot of stuff, even when I do manage to do the review. It is clearly not very sustainable.

And if I wanted to summarize it at multiple levels (like a level in between a list of tweets and a full essay, or an annual level), when would I ever actually do something like write or think or live?

I think this is a major reason for the death of blogging and the increasing rarity of non-blog homepage sites like Gwern.net: the friction of multiple writing places means you are constantly being sucked into just one and tempted to abandon the others; and the rewards of social media tend to win out. You may be able to maintain dual-posting for a while, but at some point a shock happens, and when you return, you have a backlog and never catch up and settle for one. (This is expected from a queuing theory perspective if the friction is heavily overloading you: at some point things will break down catastrophically, and since you aren’t writing all that many words per day, which would be easy to copy-paste in total, it must be everything else, the friction/overhead.) And once you are writing solely on Twitter or Facebook or whatever, it’s hard to ever escape with your stuff to your own website, despite the perpetual amnesia / eternal now of the micro-blogs.

(The social media sites don’t even need to be hostile for this to happen. It’s just the constant trivial inconvenience and toil.)


So, how do I solve this broad problem of packaging up my writing in logical units spanning the continuum from single tweets to ‘best essay of the year’?

After several years of the annotation popup system and watching LLMs become effectively superhuman at summarization & resolving major limitations like short context windows / cost, I think I can propose a design:

you provide all the levels, without the toil, by starting with a link-centric approach where every comment or tweet or essay or reference is a URL with metadata, and using LLM recursive summarization to fill in the gaps. These different levels can then be exposed to readers as separate RSS feeds.

So the workflow would look like this: every comment is copied by the backend, with its data & metadata like author/date and a LLM auto-title/summary; sets of related comments like a tweet thread get grouped and summarized together as a whole; updates to blog posts or essays likewise get grouped and summarized; finally, whole new posts/essays (with a handwritten summary, or again LLM-written).

Once this has been set up, as the author, I go around tweeting or Redditing, possibly replying to my own comments repeatedly as the muse demands, and my tweets all get logged and saved automatically; a reader interested in the blow-by-blow can subscribe to the most atomic RSS feed, and read each one; a reader with less time to spare reads the grouped comment summary RSS feed; or they can read the whole essay that I eventually sit down and write with proper references (starting from the grouped-comment summary as a quick-and-dirty outline to help me get started). But there is almost no friction which stops my comments from percolating up through my site, from individual atomic comments to short summaries to finished writings, and readers can pick what level they want to read at, rather than reading a one-size-fits-all-but-suits-none ‘most recent’ RSS feed. (cf. Kicks Condor’s “Fraidycat” with its fixed allocation per feed.)


For Gwern.net specifically, I would start with the annotation system as the ‘atomic’ level and try to build up from there.

Now that I have finally overhauled the backend to support additional metadata on annotations like the critical ‘date modified’ field (necessary for /doc/newest/index).

With meaningful ‘last-modified’ vs ‘date-created’ metadata on all pages+annotations, I can now populate a sane RSS feed with both newly-created & recently-modified items, and these items can be both my essays & any new links I bookmark or annotations created.

So the idea is that reading the RSS is like a link feed with essay updates once in a while. It might go something like this:

“Golden Gate Bridge WP article / SF city WP article / annotation (suicide study) / annotation (poem) / weekly batch list of miscellaneous URLs & image uploads / ’Movie Reviews: +review of The Bridge 2006’ / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / ’Research Ideas: free play for RL exploration’ / annotation (link) / annotation (link) / annotation (link) / weekly batch list of miscellaneous links / annotation (link) / annotation (link) / annotation (link) / …”

‘Full’ annotations & essays get a separate entry, while the shorter ‘partial’ annotations (which have only a little metadata like a title or a tag) get rolled up into a single large weekly item which can be skipped or skimmed.

Because the annotations are just static HTML snippets already, they can be easily put into the RSS feed itself, to allow a preview (even if that obviously wouldn’t support all the on-site features; they will link to the essay or the first tag-directory entry, so the RSS entry for https://www.theinformation.com/articles/openai-removes-ai-safety-leader-m-dry-a-onetime-ally-of-ceo-altman wouldn’t link to TI but to its current tag-directory entry).

The main problem with this is that if an essay like a review is included each time its last-modified changes, the entry is not too useful: the annotation, which contains the page abstract, will likely still be the same and not mention whatever is changed. So you might see an essay pop up a dozen times in this RSS feed without knowing what changed.

I could write a manual diff, but that is exactly the sort of “toil” I am trying to avoid on Gwern.net because it is unsustainable in the long run & such overhead unconsciously discourages a writer. (You feel like you are being punished—you wrote something, and your reward for a job well done is… being required to write even more? Not fun.) It is also difficult to do any kind of labeling of importance of a patch upfront: a good essay update might be composed of dozens of patches, each trivial on their own; indeed, I might not realize where something is going until after the writing is all done, as I explore a topic or people respond or I dig up new sources or have a sudden realization—“completeness” is something that can be known only in retrospect, reviewing changes.

But with the date-range and the git history and progress in LLM context windows, this can now be automated. When an essay’s last-modified indicates that it should get an RSS feed item and the date-created indicates that this is an ‘old’ essay which has been recently modified rather than a new essay which has never been in the RSS feed before, the RSS-generating code can skip the hand-written abstract and instead generate a summary of the changes instead. To do this, call git on the essay’s Markdown source file for the last month of patches, extract the patch summaries & even the patches themselves if necessary, and feed them into a LLM like GPT-4o-mini to get back a consolidated description. (With context windows like 128k, I can easily feed in some examples to few-shot the task and still have room for big sets of diffs.)

The LLM will know to not bother mentioning the massive churn on Gwern.net like spellcheck or linkrot fixing, and will summarize it on a more semantic level for readers.

And you can generalize this idea further, and start to meet Namespace’s challenge for blogging software that can go from tweets to posts: right now, there is no good way to go from writing in tweet-sized chunks to writing longform essays. So many people who could have written blog posts or even books wind up trapped in long tweet-threads (or given the hostility of Twitter these days to any kind of serious intellectual work, like hiding tweets from non-logged-in users & penalizing tweets with external links, not writing anything at all), which never leave Twitter and are impossible to find or browse sanely. But with LLMs, you can fix that: the human writer writes in tweets or comments or sections as the muse moves them, and the LLM can consolidate them into progressively larger chunks, culminating in whole essays, and the human writer polish them up and finalize them. (After all, when it comes to writing, many people find writing very easy, as demonstrated by their ability to write tens of thousands of words on Twitter or Discord or Reddit or IRC—it’s the editing it all together that destroys them with the tedium and fear of criticism and they just never get started.) Each of these can be a separate RSS feed: one feed for atomic writing like tweets, one feed for the next level up (tweet-threads, sections?), one feed for the next level up (essays?), and so on.

Then a new reader can easily catch up on the backlog: simply read the essays, and then drop down to the level of granularity they have the time & interest for. (A big fan will of course read the most granular tweet-level daily feed, but others will prefer a higher level like weekly summaries, or even monthly essay-sized outputs.)

Obviously one can support multiple RSS feeds—once the paradigm has been sorted out and I’ve decided what I even want from RSS feeds in the first place. (The whole popup/annotation system is partially motivated by the goal of making updates more meaningful & granular and organizing references.)

Beyond the master/site-wide/firehose RSS feed which included all annotations/essays and the weekly miscellaneous batch entry (something like /site.rss vs /site-essays.rss vs /site-links.rss), you would want a RSS feed for each tag-directory which filtered to just items with that tag (eg. /doc/ai/index.rss), and you would want per-page RSS feeds following the existing convention of a file-extension suffix (so /foo is the essay, which you know because it has no period and the Gwern.net convention is that essays never have periods & files always have periods, and then /foo.rss would be the RSS feed for exclusively that essay), and you could easily split the firehose feed into essays-only/annotations-only.

And then in the ultimate evolution of this, the writer just writes atomic bits without any editing, and the LLM takes care of adding it to an ever-enlarging corpus and expanding it as appropriate, and then summarizing it for the writer & readers to review/read, and updating it based on feedback. (See also “Nenex”.)

(One might wonder how to present this outside the RSS feed context, but that’s straightforward, especially when you have transclusions + collapses. Like the other things, you just generate statically, at a compile-time, each ‘level’, and then you can present them to the reader as a series of collapses+transcludes. You can present them as simply a flat list or as a recursive set of collapses from small to large, or however you wish. Just rearrange a few links/div-wrappers as you please.)