Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.
Links on the Internet last forever or a year, whichever inconveniences you more. This is a major problem for anyone serious about writing with good references, as link rot will cripple several percent of all links each year, and compounding.
To deal with link rot, I present my multi-pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre-emptively downloads copies locally and attempts to archive them in the Internet Archive. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of
linkchecker
, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).
Given my interest in long term content and extensive linking, link rot is an issue of deep concern to me. I need backups not just for my files1, but for the web pages I read and use—they’re all part of my exomind. It’s not much good to have an extensive essay on some topic where half the links are dead and the reader can neither verify my claims nor get context for my claims.
Link Rot
Decay is inherent in all compound things. Work out your own salvation with diligence.
Last words of the Buddha
The dimension of digital decay is dismal and distressing. Wikipedia:
In a 2003 experiment, et al 2003 discovered that about one link out of every 200 disappeared each week from the Internet. et al 2005 discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication [the irony!], and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, et al 2001 ). 2002 examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.
Bruce Schneier remarks that one friend experienced 50% linkrot in one of his pages over less than 9 years (not that the situation was any better in 1998), and that his own blog posts link to news articles that go dead in days2; Vitorio checks bookmarks from 1997, finding that hand-checking indicates a total link rot of 91% with only half of the dead available in sources like the Internet Archive; Ernie Smith found 1 semi-working link in a 1994 book about the Internet; the Internet Archive itself has estimated the average lifespan of a Web page at 100 days. A Science study looked at articles in prestigious journals; they didn’t use many Internet links, but when they did, 2 years later ~13% were dead3. The French company Linterweb studied external links on the French Wikipedia before setting up their cache of French external links, and found—back in 2008—already 5% were dead. (The English Wikipedia has seen a 2010–January 2011 spike from a few thousand dead links to ~110,000 out of ~17.5m live links.) A followup check of the viral The Million Dollar Homepage, where it cost up to $53.9$38.02006k to insert a link by the last insertion on January 2006, found that a decade later in 2017, at least half the links were dead or squatted.4 Bookmarking website Pinboard, which provides some archiving services, noted in August 2014 that 17% of 3-year-old links & 25% of 5-year-old links were dead. The dismal studies just go on and on and on (and on). Even in a highly stable, funded, curated environment, link rot happens anyway. For example, about 11% of Arab Spring-related tweets were gone within a year (even though Twitter is—currently—still around). And sometimes they just get quietly lost, like when MySpace admitted it lost all music worldwide uploaded 2003–2015, euphemistically describing the mass deletion as “We completely rebuilt MySpace and decided to move over some of your content from the old MySpace” (only some of the 2008–2010 MySpace music could be rescued & put on the IA by “an anonymous academic study”).
My specific target date is 2070, 60 years from now. As of 2011-03-10, Gwern.net has around 6,800 external links (with around 2,200 to non-Wikipedia websites)5. Even at the lowest estimate of 3% annual linkrot, few will survive to 2070. If each link has a 97% chance of surviving each year, then the chance a link will be alive in 2070 is 0.972070−2011 ≈ 0.16 (or to put it another way, an 84% chance any given link will die). The 95% confidence interval for such a binomial distribution says that of the 2,200 non-Wikipedia links, ~336–394 will survive to 20706. If we try to predict using a more reasonable estimate of 50% linkrot, then an average of 0 links will survive (0.502070-2011 × 2,200 = 1.735 × 10-16 × 2,200 ≈ 0). It would be a good idea to simply assume that no link will survive.
With that in mind, one can consider remedies. (If we lie to ourselves and say it won’t be a problem in the future, then we guarantee that it will be a problem. “People can stand what is true, for they are already enduring it.”)
If you want to pre-emptively archive a specific page, that is easy: go to the IA and archive it, or in your web browser, print a PDF of it, or for more complex pages, use a browser plugin (eg. ScrapBook). But I find I visit and link so many web pages that I rarely think in advance of link rot to save a copy; what I need is a more systematic approach to detect and create links for all web pages I might need.
Detection
With every new spring
the blossoms speak not a word
yet expound the Law—
knowing what is at its heart
by the scattering storm winds.Shōtetsu7
The first remedy is to learn about broken links as soon as they happen, which allows one to react quickly and scrape archives or search engine caches (‘lazy preservation’). I currently use linkchecker
to spider Gwern.net looking for broken links. linkchecker
is run in a cron job like so:
@monthly linkchecker --check-extern --timeout=35 --no-warnings --file-output=html \
--ignore-url=^mailto --ignore-url=^irc --ignore-url=http://.*\.onion \
--ignore-url=paypal.com --ignore-url=web.archive.org \
https://gwern.net
Just this command would turn up many false positives. For example, there would be several hundred warnings on Wikipedia links because I link to redirects; and linkchecker
respects robots.txts which forbid it to check liveness, but emits a warning about this. These can be suppressed by editing ~/.linkchecker/linkcheckerrc
to say ignorewarnings=http-moved-permanent,http-robots-denied
(the available warning classes are listed in linkchecker -h
).
The quicker you know about a dead link, the sooner you can look for replacements or its new home.
Prevention
Remote Caching
Anything you post on the internet will be there as long as it’s embarrassing and gone as soon as it would be useful.
We can ask a third party to keep a cache for us. There are several archive site possibilities:
- the Internet Archive
- WebCite
- Perma.cc (highly limited; has lost some archives)
- Linterweb’s WikiWix8.
- Peeep.us (defunct as of 2018)
- Archive.is
- Pinboard (with the $22/year archiving option9)
There are other options but they are not available like Google10 or various commercial/government archives11
(An example would be bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-former-chief/
being archived at webcitation.org/5ur7ifr12
.)
These archives are also good for archiving your own website:
- you may be keeping backups of it, but your own website/server backups can be lost (I can speak from personal experience here), so it’s good to have external copies
- Another benefit is the reduction in ‘bus-factor’: if you were hit by a bus tomorrow, who would get your archives and be able to maintain the websites and understand the backups etc? While if archived in IA, people already know how to get copies and there are tools to download entire domains.
- A focus on backing up only one’s website can blind one to the need for archiving the external links as well. Many pages are meaningless or less valuable with broken links. A linkchecker script/daemon can also archive all the external links.
So there are several benefits to doing web archiving beyond simple server backups.
My first program in this vein of thought was a bot which fired off WebCite, Internet Archive/Alexa, & Archive.is requests: Wikipedia Archiving Bot, quickly followed up by a RSS version. (Or you could install the Alexa Toolbar to get automatic submission to the Internet Archive, if you have ceased to care about privacy.)
The core code was quickly adapted into a gitit wiki plugin which hooked into the save-page functionality and tried to archive every link in the newly-modified page, Interwiki.hs
Finally, I wrote archiver, a daemon which watches12/reads a text file. Source is available via git clone https://github.com/gwern/archiver-bot.git
. (A similar tool is Archiveror; the Python package archivenow
does something similar & as of January 2021 is probably better.)
The library half of archiver
is a simple wrapper around the appropriate HTTP requests; the executable half reads a specified text file and loops as it (slowly) fires off requests and deletes the appropriate URL.
That is, archiver
is a daemon which will process a specified text file, each line of which is a URL, and will one by one request that the URLs be archived or spidered
Usage of archiver
might look like archiver ~/.urls.txt gwern@gwern.net
. In the past, archiver
would sometimes crash for unknown reasons, so I usually wrap it in a while
loop like so: while true; do archiver ~/.urls.txt gwern@gwern.net; done
. If I wanted to put it in a detached GNU screen session: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern@gwern.net; done'
. Finally, rather than start it manually, I use a cron job to start it at boot, for a final invocation of
@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern2@gwern.net \
"cd ~/www && nice -n 20 ionice -c3 wget --unlink --limit-rate=20k --page-requisites --timestamping \
-e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm \
--user-agent='Firefox/4.9'" 500; done'
Local Caching
Remote archiving, while convenient, has a major flaw: the archive services cannot keep up with the growth of the Internet and are woefully incomplete. I experience this regularly, where a link on Gwern.net goes dead and I cannot find it in the Internet Archive or WebCite, and it is a general phenomenon: et al 2012 find <35% of common Web pages ever copied into an archive service, and typically only one copy exists.
Caching Proxy
The most ambitious & total approach to local caching is to set up a proxy to do your browsing through, and record literally all your web traffic; for example, using Live Archiving Proxy (LAP) or WarcProxy which will save as WARC files every page you visit through it. (Zachary Vance explains how to set up a local HTTPS certificate to MITM your HTTPS browsing as well.)
One may be reluctant to go this far, and prefer something lighter-weight, such as periodically extracting a list of visited URLs from one’s web browser and then attempting to archive them.
Batch Job Downloads
For a while, I used a shell script named, imaginatively enough, local-archiver
:
#!/bin/sh
set -euo pipefail
cp `find ~/.mozilla/ -name "places.sqlite"` ~/
sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
WHERE moz_places.id = moz_historyvisits.place_id \
and visit_date > strftime('%s','now','-1.5 month')*1000000 ORDER by \
visit_date;" | filter-urls >> ~/.tmp
rm ~/places.sqlite
split -l500 ~/.tmp ~/.tmp-urls
rm ~/.tmp
cd ~/www/
for file in ~/.tmp-urls*;
do (wget --unlink --continue --page-requisites --timestamping --input-file $file && rm $file &);
done
find ~/www -size +4M -delete
The code is not the prettiest, but it’s fairly straightforward:
-
the script grabs my Firefox browsing history by extracting it from the history SQL database file13, and feeds the URLs into wget.
wget
is not the best tool for archiving as it will not run JavaScript or Flash or download videos etc. It will download included JS files but the JS will be obsolete when run in the future and any dynamic content will be long gone. To do better would require a headless browser like PhantomJS which saves to MHT/MHTML, but PhantomJS refuses to support it and I’m not aware of an existing package to do this. In practice, static content is what is most important to archive, most JS is of highly questionable value in the first place, and any important YouTube videos can be archived manually withyoutube-dl
, sowget
’s limitations haven’t been so bad. -
The script
split
s the long list of URLs into a bunch of files and runs that manywget
s in parallel becausewget
apparently has no way of simultaneously downloading from multiple domains. There’s also the chance ofwget
hanging indefinitely, so parallel downloads continues to make progress. -
The
filter-urls
command is another shell script, which removes URLs I don’t want archived. This script is a hack which looks like this:#!/bin/sh set -euo pipefail cat /dev/stdin | sed -e "s/#.*//" | sed -e "s/&sid=.*$//" | sed -e "s/\/$//" | grep -v -e 4chan -e reddit ...
-
delete any particularly large (>4MB) files which might be media files like videos or audios (podcasts are particular offenders)
A local copy is not the best resource—what if a link goes dead in a way your tool cannot detect so you don’t know to put up your copy somewhere? But it solves the problem decisively.
The downside of this script’s batch approach soon became apparent to me:
- not automatic: you have to remember to invoke it and it only provides a single local archive, or if you invoke it regularly as a cron job, you may create lots of duplicates.
- unreliable:
wget
may hang, URLs may be archived too late, it may not be invoked frequently enough, >4MB non-video/audio files are increasingly common… - I wanted copies in the Internet Archive & elsewhere as well to let other people benefit and provide redundancy to my local archive
It was to fix these problems that I began working on archiver
—which would run constantly archiving URLs in the background, archive them into the IA as well, and be smarter about media file downloads. It has been much more satisfactory.
Daemon
archiver
has an extra feature where any third argument is treated as an arbitrary sh
command to run after each URL is archived, to which is appended said URL. You might use this feature if you wanted to load each URL into Firefox, or append them to a log file, or simply download or archive the URL in some other way.
For example, instead of a big local-archiver
run, I have archiver
run wget
on each individual URL: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern@gwern.net "cd ~/www && wget --unlink --continue --page-requisites --timestamping -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm --user-agent='Firefox/3.6' 120"; done'
. (For private URLs which require logins, such as darknet markets, wget
can still grab them with some help: installing the Firefox extension Export Cookies, logging into the site in Firefox like usual, exporting one’s cookies.txt
, and adding the option --load-cookies cookies.txt
to give it access to the cookies.)
Alternately, you might use curl
or a specialized archive downloader like the Internet Archive’s crawler Heritrix.
Cryptographic Timestamping Local Archives
We may want cryptographic timestamping to prove that we created a file or archive at a particular date and have not since altered it. Using a timestamping service’s API, I’ve written 2 shell scripts which implement downloading (wget-archive
) and timestamping strings or files (timestamp
). With these scripts, extending the archive bot is as simple as changing the shell command:
@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern2@gwern.net \
"wget-archive" 200; done'
Now every URL we download is automatically cryptographically timestamped with ~1-day resolution for free.
Resource Consumption
The space consumed by such a backup is not that bad; only 30–50 gigabytes for a year of browsing, and less depending on how hard you prune the downloads. (More, of course, if you use linkchecker
to archive entire sites and not just the pages you visit.) Storing this is quite viable in the long term; while page sizes have increased 7× between 2003 and 2011 and pages average around 400kb14, Kryder’s law has also been operating and has increased disk capacity by ~128×—in 2011, $107.6$80.02011 will buy you at least 2 terabytes, that works out to 4 cents a gigabyte or 80 cents for the low estimate for downloads; that is much better than the annual fee that somewhere like Pinboard charges. Of course, you need to back this up yourself. We’re relatively fortunate here—most Internet documents are ‘born digital’ and easy to migrate to new formats or inspect in the future. We can download them and worry about how to view them only when we need a particular document, and Web browser backwards-compatibility already stretches back to files written in the early 1990s. (Of course, we’re probably screwed if we discover the content we wanted was dynamically presented only in Adobe Flash or as an inaccessible ‘cloud’ service.) In contrast, if we were trying to preserve programs or software libraries instead, we would face a much more formidable task in keeping a working ladder of binary-compatible virtual machines or interpreters15. The situation with digital movie preservation hardly bears thinking on.
There are ways to cut down on the size; if you tar it all up and run 7-Zip with maximum compression options, you could probably compact it to 1⁄5th the size. I found that the uncompressed files could be reduced by around 10% by using fdupes to look for duplicate files and turning the duplicates into a space-saving hard link to the original with a command like fdupes --recurse --hardlink ~/www/
. (Apparently there are a lot of bit-identical JavaScript (eg. JQuery) and images out there.)
Good filtering of URL sources can help reduce URL archiving count by a large amount. Examining my manual backups of Firefox browsing history, over the 1153 days from 2014-02-25 to 2017-04-22, I visited 2,370,111 URLs or 2055 URLs per day; after passing through my filtering script, that leaves 171,446 URLs, which after de-duplication yields 39,523 URLs or ~34 unique URLs per day or 12,520 unique URLs per year to archive.
This shrunk my archive by 9GB from 65GB to 56GB, although at the cost of some archiving fidelity by removing many filetypes like CSS or JavaScript or GIF images. As of 2017-04-22, after ~6 years of archiving, between xz
compression (at the cost of easy searchability), aggressive filtering, occasional manual deletion of overly bulky domains I feel are probably adequately covered in the IA etc, my full WWW archives weigh 55GB.
URL Sources
Browser History
There are a number of ways to populate the source text file. For example, I have a script firefox-urls
:
#!/bin/sh
set -euo pipefail
cp --force `find ~/.mozilla/firefox/ -name "places.sqlite"|sort|head -1` ~/
sqlite3 -batch places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
WHERE moz_places.id = moz_historyvisits.place_id and \
visit_date > strftime('%s','now','-1 day')*1000000 ORDER by \
visit_date;" | filter-urls
rm ~/places.sqlite
(filter-urls
is the same script as in local-archiver
. If I don’t want a domain locally, I’m not going to bother with remote backups either. In fact, because of WebCite’s rate-limiting, archiver
is almost perpetually back-logged, and I especially don’t want it wasting time on worthless links like 4chan.)
This is called every hour by cron
:
@hourly firefox-urls >> ~/.urls.txt
This gets all visited URLs in the last time period and prints them out to the file for archiver to process. Hence, everything I browse is backed-up through archiver
.
Non-Firefox browsers can be supported with similar strategies; for example, Zachary Vance’s Chromium scripts likewise extracts URLs from Chromium’s SQL history & bookmarks.
Document Links
More useful perhaps is a script to extract external links from Markdown files and print them to standard out: link-extractor.hs
So now I can take find . -name "*.page"
, pass the 100 or so Markdown files in my wiki as arguments, and add the thousand or so external links to the archiver queue (eg. find . -name "*.page" -type f -print0 | xargs -0 ~/wiki/static/build/link-extractor.hs | filter-urls >> ~/.urls.txt
); they will eventually be archived/backed up.
Website Spidering
Sometimes a particular website is of long-term interest to one even if one has not visited every page on it; one could manually visit them and rely on the previous Firefox script to dump the URLs into archiver
but this isn’t always practical or time-efficient. linkchecker
inherently spiders the websites it is turned upon, so it’s not a surprise that it can build a site map or simply spit out all URLs on a domain; unfortunately, while linkchecker
has the ability to output in a remarkable variety of formats, it cannot simply output a newline-delimited list of URLs, so we need to post-process the output considerably. The following is the shell one-liner I use when I want to archive an entire site (note that this is a bad command to run on a large or heavily hyper-linked site like the English Wikipedia or LessWrong!); edit the target domain as necessary:
linkchecker --check-extern -odot --complete -v --ignore-url=^mailto --no-warnings http://www.longbets.org
| grep -F http #
| grep -F -v -e "label=" -e "->" -e '" [' -e '" ]' -e "/ "
| sed -e "s/href=\"//" -e "s/\",//" -e "s/ //"
| filter-urls
| sort --unique >> ~/.urls.txt
When linkchecker
does not work, one alternative is to do a wget --mirror
and extract the URLs from the filenames—list all the files and prefix with a “http://” etc.
Reacting to Broken Links
archiver
combined with a tool like link-checker
means that there will rarely be any broken links on Gwern.net since one can either find a live link or use the archived version. In theory, one has multiple options now:
-
Search for a copy on the live Web
-
link the Internet Archive copy
-
link the WebCite copy
-
link the WikiWix copy
-
use the
wget
dumpIf it’s been turned into a full local file-based version with
--convert-links --page-requisites
, one can easily convert the dump into something like a standalone PDF suitable for public distribution. (A PDF is easier to store and link than the original directory of bits and pieces or other HTML formats like a ZIP archive of said directory.)I use
wkhtmltopdf
which does a good job; an example of a dead webpage with no Internet mirrors ishttp://www.aeiveos.com/~bradbury/MatrioshkaBrains/MatrioshkaBrainsPaper.html
which can be found at1999-bradbury-matrioshkabrains.pdf
, or Sternberg et al’s 2001 review “The Predictive Value of IQ”.
Automatic Internet Archive Repairs
The Internet Archive can be queried via an API for available snapshots near a date. If you know that a URL has an acceptable-quality snapshot, you can get it and use some search-and-replace utility to automatically rewrite a broken link.
So we can ask for the nearest URL to an early date like 1990-01-01:
$ curl --silent 'https://archive.org/wayback/available?timestamp=19900101&url=https://eva.onegeek.org/pipermail/oldeva/2001-September/040368.html' | \
jq --raw-output '.'
Yielding:
{
"url": "https://eva.onegeek.org/pipermail/oldeva/2001-September/040368.html",
"archived_snapshots": {
"closest": {
"status": "200",
"available": true,
"url": "https://web.archive.org/web/20110726181638/http://eva.onegeek.org/pipermail/oldeva/2001-September/040368.html",
"timestamp": "20110726181638"
}
},
"timestamp": "19900101"
}
So our broken link https://eva.onegeek.org/pipermail/oldeva/2001-September/040368.html
can be fixed with the closest.url
field of the archived_snapshots
object, https://web.archive.org/web/20110726181638/http://eva.onegeek.org/pipermail/oldeva/2001-September/040368.html
. Straightforward.
Old = better. If we do not specify a timestamp, or if we specify a recent timestamp like 20230101
, the IA will return the most recent snapshot instead. For repairing links, this would usually be a bad idea: for most URLs, the first snapshot will be the highest-quality one, as later snapshots will typically be infected by bitrot, linkrot, website design ‘upgrades’, spam, domain squatters, maliciously indiscriminate redirects to a homepage, etc. Still, for some URLs we would want to do this, and it’s good to know that is possible.
If we have a search-and-replace utility, newline-delimited list of broken URLs we’d like to replace, and a list of text files to update, we can easily script all the queries & updates:
for URL in `cat urls.txt`; do
URL_ARCHIVE=$(curl --silent 'https://archive.org/wayback/available?timestamp=19900101&url='"$URL" | \
jq --exit-status --raw-output '.archived_snapshots.closest.url');
if [ "$URL_ARCHIVE" != "null" ]; then
search-and-replace "$URL" "$URL_ARCHIVE" [FILE]...;
fi
done
External Links
-
archivenow
(library for submission toarchive.is
,archive.st
,perma.cc
, Megalodon, WebCite, & IA) -
Memento meta-archive search engine (for checking IA & other archives)
-
Archive-It -(by the Internet Archive); donate to the IA
-
“Testing 3 million hyperlinks, lessons learned”, Stack Exchange
-
“Backup All The Things”, muflax
-
Tesoro (archive service; discussion)
-
“Digital Resource Lifespan”, XKCD
-
“Archiving web sites”, LWN
-
Software:
- webrecorder.io (online service); webrecorder (offline tool; WARC-based & can deal with dynamically-loaded content); Diskernet
- pywb
- ArchiveBox (headless Chrome)
- IPFS
- WordPress plugin: Broken Link Checker
-
youtube-dl
- PhantomJS
- erised
- wpull
-
“Indeed, it seems that Google is forgetting the old Web” (HN)
Appendices
Cryptographic Timestamping
Due to length, this section has been moved to a separate page.
filter-urls
A raw dump of URLs, while certainly archivable, will typically result in a very large mirror of questionable value (is it really necessary to archive Google search queries or Wikipedia articles? usually, no) and worse, given the rate-limiting necessary to store URLs in the Internet Archive or other services, may wind up delaying the archiving of the important links & risking their total loss. Disabling the remote archiving is unacceptable, so the best solution is to simply take a little time to manually blacklist various domains or URL patterns.
This blacklisting can be as simple as a command like filter-urls | grep -v en.wikipedia.org
, but can be much more elaborate. The following shell script is the skeleton of my own custom blacklist, derived from manually filtering through several years of daily browsing as well as spiders of dozens of websites for various people & purposes, demonstrating a variety of possible techniques: regexps for domains & file-types & query-strings, sed
-based rewrites, fixed-string matches (both blacklists and whitelists), etc:
#!/bin/sh
# USAGE: `filter-urls` accepts on standard input a list of newline-delimited URLs or filenames,
# and emits on standard output a list of newline-delimited URLs or filenames.
#
# This list may be shorter and entries altered. It tries to remove all unwanted entries, where 'unwanted'
# is a highly idiosyncratic list of regexps and fixed-string matches developed over hundreds of thousands
# of URLs/filenames output by my daily browsing, spidering of interesting sites, and requests
# from other people to spider sites for them.
#
# You are advised to test output to make sure it does not remove
# URLs or filenames you want to keep. (An easy way to test what is removed is to use the `comm` utility.)
#
# For performance, it does not sort or remove duplicates from output; both can be done by
# piping `filter-urls` to `sort --unique`.
set -euo pipefail
cat /dev/stdin \
| sed -e "s/#.*//" -e 's/>$//' -e "s/&sid=.*$//" -e "s/\/$//" -e 's/$/\n/' -e 's/\?sort=.*$//' \
-e 's/^[ \t]*//' -e 's/utm_source.*//' -e 's/https:\/\//http:\/\//' -e 's/\?showComment=.*//' \
| grep "\." \
| grep -F -v "*" \
| grep -E -v -e '\/\.rss$' -e "\.tw$" -e "//%20www\." -e "/file-not-found" -e "258..\.com/$" \
-e "3qavdvd" -e "://avdvd" -e "\.avi" -e "\.com\.tw" -e "\.onion" -e "\?fnid\=" -e "\?replytocom=" \
-e "^lesswrong.com/r/discussion/comments$" -e "^lesswrong.com/user/gwern$" \
-e "^webcitation.org/query$" \
-e "ftp.*" -e "6..6\.com" -e "6..9\.com" -e "6??6\.com" -e "7..7\.com" -e "7..8\.com" -e "7..\.com" \
-e "78..\.com" -e "7??7\.com" -e "8..8\.com" -e "8??8\.com" -e "9..9\.com" -e "9??9\.com" \
-e gold.*sell -e vip.*club \
| grep -F -v -e "#!" -e ".bin" -e ".mp4" -e ".swf" -e "/mediawiki/index.php?title=" -e "/search?q=cache:" \
-e "/wiki/Special:Block/" -e "/wiki/Special:WikiActivity" -e "Special%3ASearch" \
-e "Special:Search" -e "__setdomsess?dest="
# ...
# prevent URLs from piling up at the end of the file
echo -e "\n"
filter-urls
can be used on one’s local archive to save space by deleting files which may be downloaded by wget
as dependencies. For example:
find ~/www | sort --unique >> full.txt && \
find ~/www | filter-urls | sort --unique >> trimmed.txt
comm -23 full.txt trimmed.txt | xargs -d "\n" rm
rm full.txt trimmed.txt
sort
Key Compression Trick
-
I use
duplicity
&rdiff-backup
to backup my entire home directory to a cheap 1.5TB hard drive (bought from Newegg usingforre.st
’s “Storage Analysis—GB / $ for different sizes and media” price-chart); a limited selection of folders are backed up to Backblaze B2 usingduplicity
.I used to semiannually tar up my important folders, add PAR2 redundancy, and burn them to DVD, but that’s no longer really feasible; if I ever get a Blu-ray burner, I’ll resume WORM backups. (Magnetic media doesn’t strike me as reliable over many decades, and it would ease my mind to have optical backups.)↩︎
-
“When the Internet Is My Hard Drive, Should I Trust Third Parties?”, Wired:
Bits and pieces of the web disappear all the time. It’s called ‘link rot’, and we’re all used to it. A friend saved 65 links in 1999 when he planned a trip to Tuscany; only half of them still work today. In my own blog, essays and news articles and websites that I link to regularly disappear—sometimes within a few days of my linking to them.
-
“Going, Going, Gone: Lost Internet References”; abstract:
The extent of Internet referencing and Internet reference activity in medical or scientific publications was systematically examined in more than 1000 articles published between 2000 and 2003 in the New England Journal of Medicine, The Journal of the American Medical Association, and Science. Internet references accounted for 2.6% of all references (672⁄25548) and in articles 27 months old, 13% of Internet references were inactive.
-
The Million Dollar Homepage still gets a surprising amount of traffic, so one fun thing one could do is buy up expired domains which paid for particularly large links.↩︎
-
2013-01-06, the number has increased to ~12,000 external links, ~7,200 to non-Wikipedia domains.↩︎
-
If each link has a fixed chance of dying in each time period, such as 3%, then the total risk of death is an exponential distribution; over the time period 2011–2070 the cumulative chance it will beat each of the 3% risks is 0.1658. So in 2070, how many of the 2200 links will have beat the odds? Each link is independent, so they are like flipping a biased coin and form a binomial distribution. The binomial distribution, being discrete, has no easy equation, so we just ask R how many links survive at the 5th percentile/quantile (a lower bound) and how many survive at the 95th percentile (an upper bound):
qbinom(c(0.05, 0.95), 2200, 0.97^(2070-2011)) # [1] 336 394 ## the 50% annual link rot hypothetical: qbinom(c(0.05, 0.50), 2200, 0.50^(2070-2011)) # [1] 0 0
-
101, ‘Buddhism related to Blossoms’; Unforgotten Dreams: Poems by the Zen monk Shōtetsu; trans. Steven D. Carter, ISBN 0-231-10576-2↩︎
-
Which I suspect is only accidentally ‘general’ and would shut down access if there were some other way to ensure that Wikipedia external links still got archived.↩︎
-
Since Pinboard is a bookmarking service more than an archive site, I asked in 2012 whether treating it as such would be acceptable and Maciej said “Your current archive size, growing at 20 GB a year, should not be a problem. I’ll put you on the heavy-duty server where my own stuff lives.”
I ultimately did not take him up on this offer in part because Pinboard’s archives were relatively low-quality. Pinboard started being neglected around 2017 by Maciej, and as of mid-2022, errors are routine. I do not recommend use of Pinboard.↩︎
-
Google Cache is generally recommended only as a last ditch resort because pages expire quickly from it. Personally, I’m convinced that Google would never just delete colossal amounts of Internet data—this is Google, after all, the epitome of storing unthinkable amounts of data—and that Google Cache merely ceases to make public its copies. And to request a Google spider visit, one has to solve a CAPTCHA—so that’s not a scalable solution.↩︎
-
Which would not be publicly accessible or submittable; I know they exist, but because they hide themselves, I know only from random comments online eg. “years ago a friend of mine who I’d lost contact with caught up with me and told me he found a cached copy of a website I’d taken down in his employer’s equivalent to the Wayback Machine. His employer was a branch of the federal government.”.↩︎
-
Version 0.1 of my
archiver
daemon didn’t simply read the file until it was empty and exit, but actually watched it for modifications with inotify. I removed this functionality when I realized that the required WebCite choking (just one URL every ~25 seconds) meant thatarchiver
would never finish any reasonable workload.↩︎ -
Much easier than it was in the past; Jamie Zawinski records his travails with the previous Mozilla history format in the aptly-named “when the database worms eat into your brain”.↩︎
-
An older 2010 Google article put the average at 320kb, but that was an average over the entire Web, including all the old content.↩︎
-
Already one runs old games like the classic LucasArts adventure games in emulators of the DOS operating system like DOSBox; but those emulators will not always be maintained. Who will emulate the emulators? Presumably in 2050, one will instead emulate some ancient but compatible OS—Windows 7 or Debian 6.0, perhaps—and inside that run DOSBox (to run the DOS which can run the game).↩︎