Skip to main content

Gwern.net Large File Support

N/A

The Gwern.net workflow/architecture now supports large-files (>200MB), and I’ve begun uploading some, particularly the old DNM Archives. This should make stuff more findable on the site & via search engines, and easier to download over HTTP rather than rsync etc.

It’s always annoyed me to not have large-files hosted on Gwern.net like everything else, and a big exception in the workflows, and similar to the new /blog/ feature, it feels nice to have that fixed with such an elegant simple solution.


How does it work? Turns out to be quite simple.

The core problem is how to exclude large-files from version-control, while keeping everything else as normal as possible, in a low-risk way.

The storage & bandwidth were never the issue once I moved to Hetzner dedicated storage; the problem was Hakyll & git. Once you start checking in gigabyte-scale files, or copying them around using Hakyll, things can get catastrophically, and permanently, slow. And if you get it wrong, you are in serious trouble: unwinding hundreds of gigabytes of screwed-up VCS history is not easy (especially when it’s git.) So I’ve always been hesitant to add files >100MB to Gwern.net proper, and I’ve outsourced storage to file-hosts like Google Drive/Dropbox/Mega or to a separate rsync server. These solutions all have problems: clumsy, highly limited in terms of bandwidth/storage, susceptible to linkrot, do not permit easy automatic downloads in the shell or REPL, etc. I knew of various solutions like git-annex or Git LFS which claim to do… stuff… with git & large files, which might help, but I’ve never used them before, and had no confidence that they’d be good long-term solutions or fit my use-case.

The first third of the solution was patching Hakyll to use symlinks instead of copying files. This was necessary due to the sheer volume of files on Gwern.net, rendering a copy compilation step both incredibly slow and doubling the disk space use. It also fixes large files too, of course, as it takes no more time & space to symlink (and then rsync) a 10GB file than a 1KB file.

The second third was that for adding new files, I switched to an upload convenience script, which handled a lot of minor details for me, like renaming files to be unique, copying to the server immediately, and adding them to git.

The final third was to use .gitignore. Since my large-files are usually archival and rarely, if ever, change, they do not really need to be version-controlled. But if they live in the same directories as normal files, then git wants to control them. This can be fixed by telling .gitignore to skip particular files. One could do this file by file for each large-file added, but that is tedious… except by this point, I could simply modify upload to automatically add large-files to .gitignore & skip the git add step. And the Hakyll compilation & sync, being symlink-based, should have no performance problems if I were to add in 100GB of large-files. But I was still deeply unsure that this was workable. Did this actually match my needs?

After reading up on git-annex and Git LFS again for the first time in years, .gitignore still seemed like an elegant way to go, so I wrote up all my considerations and ran it past Claude-3.7, Deep Research, o1-pro, and GPT-4.5.

The consensus was unanimous that it really was that simple after all—the ‘ignore strategy’ probably was a good idea, and no LLMs pointed out any serious drawbacks I was unaware of (besides the obvious one that the large-files just aren’t being versioned, which is unfortunately too expensive).

I implemented the upload-ignore step and a few days ago began going through and uploading all my large-files, starting with the DNM Archives, but including all the GPT-2 & StyleGAN checkpoints, training montages, occasional random archives like Zach-Like or Disco Diffusion Artist Study Database, etc. Why not? It’s 2025, disk space & bandwidth are practically free. A 1GB archive is de nada. (People should get more dedicated servers, and less scarcity mindsets.)

It has required a few minor adjustments: some curl scripts need a -​-​max-size safeguard, and we had to tweak our link-prefetching logic to avoid fetching files which are either too large or cannot be viewed in web-browsers, but other than that…

So far so good!

Similar Links

[Similar links by topic]