Skip to main content

Writing a Wikipedia Link Archive Bot

Haskell: tutorial on writing a daemon to archive links in Wikipedia articles with TagSoup and WebCite; obsolete.

This is a 2008 tutorial demonstrating how to write a Haskell program to automatically archive Internet links into WebCite & Internet Archive to avoid linkrot, by parsing WP dumps, downloading & parsing WP articles for external links with the TagSoup HTML parsing library, using the WebCite/IA APIs to archive them, and optimizing runtime. This approach is suitable for one-off crawls but not for live archiving using the RSS feed; for the next step, see Wikipedia RSS Archive Bot for a demonstration of how one could write a RSS-oriented daemon.

Obsolete

WebCite is largely defunct, doing archiving this way is not advised, and WP link archiving is currently handled by Internet Archive-specific plugins by the WMF. These Haskell scripts are obsoleted by later IA/WMF initiatives dealing with linkrot. For a more general approach suitable for personal use, see the writeup of archiver-bot in Archiving URLs.

One of the best ways to learn something like Haskell is to simply dive in and make it do something useful. In my case, I’ve wanted to learn Haskell for a long time: the syntax was gorgeous compared to Common Lisp, Bash, or Java (the only other languages I had experience with); various ideas such as monads, lazy evaluation, and functional purity struck me as fascinating and exciting; and the community was excellent.

But I didn’t have any practical use for it, and I’m too lazy to learn it unless I have to. Bash worked well enough for my shell scripts, and I generally only messed with Common Lisp because of my use of the Stump window manager, but I had finished configuring and extending StumpWM to my liking1. Fortunately, I am a heavy user and long-time editor of Wikipedia, and some technical details bug me.

2 in particular struck me as fixable without requiring modifications to the MediaWiki codebase (a messy yet featureful pile of PHP):

  1. the “feature” of MediaWiki that by default, Fujiwara no Teika is a pagename distinct from Fujiwara no teika. Going to the latter does not automatically redirect to the former. One must either create redirects to the right pagename2 , or simply accept that mis-capitalizations will result in broken links and needless searches.

  2. Links to other non-Wikipedia websites, “external links”, often break. Because articles can evolve so fast, the Internet Archive doesn’t necessarily manage to archive all external links; worse, the Internet Archive is dependent for its data on data dumps from Alexa & its toolbar, which are provided an eternity (6 months or more) after the original spidering of the external link. An external link could be added to an article, used as a reference, fall victim to the infamous 404 error, be noticed to be a dead link and removed on that basis within a few weeks much less 9 months!

Archiving

How To?

Completely Obsolete

The IA added on-demand archiving in the mid-2010s, and all Alexa tools & endpoint have been shut down permanently.

Let’s talk about Problem 2.

The beginning of a solution is to realize that the Internet Archive is not usable here. Their FAQ specifically says that archiving is done only through Alexa as of 2008, and Alexa only does it through their spiders, the Alexa toolbar (proprietary and MS Windows/IE-only), and some web form which doesn’t appear to do anything3. Their Archive-It service does do on-demand archiving, but at steep institution-targeted fees; not something for individual users. (Note: this restriction has since been lifted, and IA does on-demand archiving and has partnered with WMF.)

Fortunately, if we poke around on the subject of archiving URLs, we’ll eventually discover something like WebCite, which is tailor-made for our purposes. They offer persistent archiving services, for free, and best of all they can archive on demand!

Reading through their documentation, it seems they will eventually allow you to upload a file containing a large number of URLs to cite, but not yet. That’s alright—just trying out the archiving (or reading the technical FAQ) reveals the submission URL encodes everything in a very simple format:

"https://www.webcitation.org/archive?url=" ++ url ++ "&email=foo@bar.com"

The Steps

So going back to our problem. Our problem is that we want to take all the articles of the English Wikipedia, somehow extract all the external links from said articles, transform those URLs according to the previous paragraph, and then actually open them (which will cause the embedded URL to be archived).

Article Names

Our first question is naturally “How on Earth do we even get the addresses of the millions of Wikipedia articles? Saying they’re all ‘somewhere in en.wikipedia.org/wiki/’ doesn’t help.” I’m going to cop out a bit here and not actually explain how I hit upon this approach, but we’ll borrow some code from the bot I coded to fix Problem 1, which looks a little like this:

main = do
    cd "/home/gwern/bin/pywikipedia/"
    {- The list comprehension is to prevent from operating on page titles which will
    cause enormous numbers of redirects to be created. It's what comes before that matters. -}
    list <- liftM (\bs -> [n | n<-bs,  length n <= (2^4)]) $ liftM words $ getContents

Redirect-bot.hs is assuming here that stdin is the source of a list. This long (>9,555,933 lines) list takes this form:

...
Delehaye
Delekovac
Delele
Delemont
Delemont_(Jura)
Delemont_JU
Delena
Delena_cancerides
Delenda
Deleni
...

Notice that spaces are being escaped as underscores, and that each page name is on a separate line. We are getting all this information from a special file kindly provided by Wikipedia precisely for people like us, who don’t want to spend the effort to parse Special:AllPages to get the live list of article names; parsing Allpages is how all the serious Wikipedia bots work, but relying on old database dumps gets us 90% of the functionality at 10% the effort. Go visit https://dumps.wikimedia.org/enwiki/latest/ and look for a file called all_titles_in_ns0.gz. When it’s downloaded and uncompressed to a ~60M file, it provides suitable fodder.

So, we have getContents lazily slurping in names of articles to standard in, and we have words (it’s a pure non-IO function, but liftM lets us use it in the IO monad) breaking a single long string with lots of newlines into a list of strings.

import Control.Monad (liftM) -- Don't forget your imports!

main :: IO ()
main = do
         articlelist <- liftM words $ getContents

Article Contents

Now what? Well, a list of items, each of which we want to perform the same operation on, cries out for us to use a (monadic) map on it.

         urllist <- mapM fetchArticleURLs articlelist

But what is this mystery function fetchArticleText? Well, it will take as its single argument a String, and it must return a list of Strings since any given article might have many different external links in it. And since we certainly aren’t loading into memory an immutable copy of Wikipedia (regardless of whether we download a full dump of the articles or decide to fetch articles over the Internet), the result will be promoted into the IO monad. So the type signature must be:

fetchArticleURLs :: String -> IO [String]

Purely arbitrarily, we’ll have fetchArticleURLs work on pages from the live English Wikipedia; but you could probably also go the route of working from an offline database dump, which would have the advantage of halving network requests and parsing out URLs much faster (since there’d be no need to request individual pages over the network).

Parsing HTML

It’s not too hard to see how to get the URL for any given article ("https://en.wikipedia.org/wiki/"++article), nor how to use Network.HTTP to download the HTML file (couldn’t be more than 10 lines)—but reliably parsing that HTML to extract genuine external links doesn’t seem trivial at all! We could go with some sort of clumsy regexp matching on “http://”, but hit-or-miss regular expressions just aren’t The Haskell Way.

Stumped, the thing to do is to go bum around on the Haskell wiki and #haskell, where some excellent fellow will eventually inform us that “Hey, you know, parsing Wikipedia pages’ HTML to extract out all external links kind of reminds me of one of the examples for Neil Mitchell’s tagsoup library. You should go check it out.”

Indeed, we need some sort of library to parse HTML for us and maybe handle the network business, and TagSoup fits the bill admirably. One of the examples in Example.hs (googleTechNews) does almost precisely what we want, except for Google Tech News. The example in its entirety:

googleTechNews = do
       tags <- liftM parseTags $ openURL "http://news.google.com/?ned=us&topic=t"
       let links = mapMaybe extract $ sections match tags
       putStr $ unlines links
   where extract xs = maybeTagText (xs !! 2)
         match =
             Match.tagOpenAttrNameLit "a" "id"
              (\value -> "r" `isPrefixOf` value && 'i' `notElem` value)

Reading this, one is struck by openURL; we don’t even need to read the documentation to know what it does. We already have the beginning of our fetchArticleURLs definition:

import Text.HTML.Download (openURL)

fetchArticleURLs :: String -> IO [String]
fetchArticleURLs article = undefined $ openURL("https://en.wikipedia.org/wiki/"++article)

We could replace undefined with whatever parsing we cook up, but we might want to use it somewhere else, and besides, it’s bad practice to muddy up a function that could be pure (like our hypothetical extractURLs function) with IO. Separation of concerns, and all that.

So we’re up to this:

fetchArticleURLs article = liftM extractURLs $ openURL("https://en.wikipedia.org/wiki/"++article)
extractURLs :: String -> [String]
extractURLs arg = undefined

The definition of extractURLs is a bit magical, since we don’t immediately understand the entire hierarchy of types and constructors which TagSoup operate ons:

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Data.List (isPrefixOf)

extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- (parseTags arg),
                       (_,x) <- atts,
                       "http://" `isPrefixOf` x]

Alright. This list comprehension has 3 parts, which we will read from top down. We start with our arg, a String. It immediately passes to parseTags, which turns it into [Tag]—notice it’s not a tree or hierarchical structure like our String actually represents; this is why it’s ‘soup’.

A Tag can be a lot of things, but we quickly realize we only want TagOpen.4 The "a" is there because <href> actually begins with <a href=...>.

But how does TagOpen "a" atts do anything? We can think of a list comprehension as syntax sugar over a bunch of filter calls and whatnot. In this case, what’s really happening is something like filter (\TagOpen "a" x -> x) (parseTags arg); any entry which isn’t a TagOpen is omitted from the intermediate list.

The second line is exactly the same. We’re still working on a [Tag], and our little filter is still testing away, but now our ‘test’ is really just pulling out the second entry in each TagOpen’s [Attribute]—it’s equivalent to a map of snd. And this gives us a [String] of URLs! The first entry in the lists are “href”, and the second are the URLs the href points to. But when we test out our two-liner, we realize to our dismay we are getting junk results like ["/wiki/Japanese_poetry",..]. A quick isPrefixOf and we fix that, and that finishes our list comprehension.

Archiving URLs

Archiving in WebCite

Oy. So we wrote extractURLs, which was used in fetchArticleURLs, which was used in main. Let’s backtrack a bit; we had left main at:

main = do
         articlelist <- liftM words $ getContents
         urllist <- mapM fetchArticleURLs articlelist

By the time urllist is generated, we’ve got an IO [String] which is containing all the external links. Now we can once again indulge the desire to map some sort of function over it. Another map, another hypothetical function:

         mapM archiveURL urllist

For convenience’s sake, we’ll reuse openURL even though we don’t want output, we really want something like this:

archiveURL :: String -> IO ()

The simplest way to define it is this:

archiveURL :: String -> IO String -- Reality
archiveURL url = openURL("https://www.webcitation.org/archive?url=" ++ url ++ "&email=foo@bar.com")

But we can improve on this! We are using mapM which will go to all the trouble of retaining all the output from archiveURL being run on the dozens, hundreds, thousands, or millions of entries in urllist.

Can’t we be more efficient somehow? This sort of situation is exactly where naming conventions can come in handy. I happen to remember that just as foo' denotes some stricter version of a function foo, a foo_ denotes some sort of function that will throw away its output. One blindly tries mapM_ instead, and it works!

We could have been more consistent. We know that mapM :: (a -> m b) -> [a] -> m [b], and we want to get rid of the output, or turn the [b] into nothing, not even a list, which means (); we throw the new type signature into Hoogle, and the very first result is mapM_.

Archiving in Alexa

We could also throw in a speculative request to Alexa, which handles archiving for the Internet Archive. Supposedly any URL visited by someone using their toolbar will be spidered and then the spidered copy given to the Internet Archive. We can mimic the request of the toolbar. It turns out to be not that difficult to do, not nearly as difficult as Alexa could have made it. (Ironically, my first attempt was based on trying to read the JavaScript source code to the Alexa toolbar, which I followed in detail—and the moment I checked an actual request URL, I realized I had gotten it completely wrong.) This code, for brevity’s sake, will not be in the final WP archive bot code, but it is included in the general-purpose archiver daemon which will archive any URLs in a file in multiple ways and can be seen as the generalized version of this WP archive bot.

With the handy Firefox extension Live HTTP Headers5, we can easily snoop on the toolbar’s traffic. I turned it on, visited a few domains with short names, saved the traffic to a file, and grepped for alexa.com and http://:

http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D0%26wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=cnn.com
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D1%26wid%3D7005%26ttl%3D1400&ref=&url=http%3A%2F%2Fwww.iana.org%2Fdomains%2Fexample%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=iana.org
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D2%26wid%3D7005%26ttl%3D989&ref=&url=http%3A%2F%2Fwebcitation.org%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=webcitation.org
http://widgets.alexa.com/traffic/rankr/?ref=http%3A%2F%2Fwebcitation.org%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D3%26wid%3D7005%26ttl%3D2241&ref=&url=http%3A%2F%2Fwww.archive.org%2Findex.php
http://widgets.alexa.com/traffic/sparky/?v=1&url=archive.org

We can rule out widgets.alexa.com—the URL structure and the name indicates that these requests are being made by the part of the toolbar that is showing the little traffic-vs-time graph in the Firefox GUI. So:

http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D0%26 /
    wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D1%26 /
    wid%3D7005%26ttl%3D1400&ref=&url=http%3A%2F%2Fwww.iana.org%2Fdomains%2Fexample%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D2%26 /
    wid%3D7005%26ttl%3D989&ref=&url=http%3A%2F%2Fwebcitation.org%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D3%26 /
    wid%3D7005%26ttl%3D2241&ref=&url=http%3A%2F%2Fwww.archive.org%2Findex.php

To split the last URL:

  1. http://data.alexa.com/data/SbADd155Tq0000

  2. ?cli=10

  3. &ver=spkyf-1.5.0

  4. &dat=ns

  5. &cdt=rq%3D2%26wid%3D7005%26ttl%3D989

  6. &ref=

  7. &url=http%3A%2F%2Fwebcitation.org%2F (the URL-encoding form of https://webcitation.org/)

We can hardwire #1 & #3; I am guessing that we can hardwire #2 and #4; #6 can be left empty since none of the examples used it; #7 is just the payload; but unfortunately, #5 seems a little more difficult. We have 4 examples here for the cdt field:

rq%3D0%26wid%3D7005
rq%3D1%26wid%3D7005%26ttl%3D1400
rq%3D2%26wid%3D7005%26ttl%3D989
rq%3D3%26wid%3D7005%26ttl%3D2241

What is this gibberish? ttl looks like a field name (‘time to live’, perhaps?); it’s URL-encoded gunk, so let’s decode it:

rq=0&wid=7005
rq=1&wid=7005&ttl=1400
rq=2&wid=7005&ttl=989
rq=3&wid=7005&ttl=2241
  • rq is presumably short for “request”, and an incrementing integer suggests that it’s counting how many pages one has visited. For the archiver bot, we can hardwire it as ‘0’ because there will be a good 20 seconds between each request (due to the WebCite timeout).

  • wid is troubling because it’s constant, and a fairly large number. It doesn’t seem large enough to be a user ID—surely the toolbar has had more than 7000 users, since the toolbar page says it has had 1.6 million downloads. Is it a user ID? We’ll leave it for now.

  • ttl is also troubling because it changes but isn’t even in the first request. Does it stand for ‘time to live’? Does it vary if I load the same page many times, is it a measure of how fast pages load or something (a metric very important to search engines, who rank higher the sites which are faster)? Can we get away with omitting it?

We need more information on ttl and wid. So let’s visit a bunch more URLs and see whether more data says anything:

http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D4%26wid%3D15698%26ttl%3D1830&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D5%26wid%3D15698%26ttl%3D425&ref=&url=http%3A%2F%2Fgwern.net%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D6%26wid%3D15698%26ttl%3D1614&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D7%26wid%3D15698%26ttl%3D376&ref=&url=http%3A%2F%2Fgwern.net%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D8%26wid%3D15698%26ttl%3D1468&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D9%26wid%3D15698%26ttl%3D2092&ref=&url=http%3A%2F%2Flesswrong.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D10%26wid%3D15698%26ttl%3D1717&ref=&url=http%3A%2F%2Fwww.overcomingbias.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D11%26wid%3D15698%26ttl%3D2010&ref=http%3A%2F%2Fwww.overcomingbias.com%2F
    &url=http%3A%2F%2Fwww.overcomingbias.com%2F2011%2F06%2Frevised-growth-claim.html%23comments

Filter and URL-decode the ttl field:

ttl=1830
ttl=425
ttl=1614
ttl=376
ttl=1468
ttl=2092
ttl=1717
ttl=2010

The smallest in any ttl is 376, the largest is 2241. This is a range which smells suspiciously like a page load time—few sites load much faster than 0.25 seconds and most clock in at a second or two. Combined with the observation that the first sample URL doesn’t have a ttl, I think we can safely omit the ttl field.

And wid? We speculated it’s some kind of user ID. In this second batch, the field is wid%3D15698 or wid=15698. So the wid is not consistent across Firefox sessions, but it is consistent across multiple page loads. This suggests to me that Alexa is only aggregating URLs by a ‘browser session’ ID (maybe it originally tracked individual ‘windows’, hence the ‘w’ part of the ID). We can probably just generate a random number between 1000 and 20000 every time we archive a URL. So, all told our pseudocode goes like this:

"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=" ++
    escape ("rq=0&wid=" ++ show randomint ++ "&ref=&url=" ++ url)

How do we get our randomint and urlEncode? I have had to URL-encode strings in the past, for XMonad’s XMonad.Actions.Search module:

escape :: String -> String
escape = concatMap escapeURIChar
    where escapeURIChar :: Char -> String
          escapeURIChar c | isAscii c && isAlphaNum c = [c]
                          | otherwise                 = concatMap (printf "%%%02X") $ encode [c]

We’ll do the random IO with System.Random, hide it internally inside the Alexa toolbar archive function since that function has to be in IO anyway, and reuse openURL:

import System.Random (getStdGen, randomR)
import Text.HTML.Download (openURL)
import Data.Char (isAlphaNum, isAscii)
import Text.Printf (printf)

a :: String -> IO ()
a url = do gen <- getStdGen
           let rint = fst $ randomR (1000::Int,20000) gen
           let payload = escape ("wid=" ++ show rint ++ "&ref=&url=" ++ url)
           print $ "http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq=0&"
                     ++ payload

             where escape :: String -> String
                   escape = concatMap escapeURIChar
                   escapeURIChar :: Char -> String
                   escapeURIChar c | isAscii c && isAlphaNum c = [c]
                                   | otherwise                = concatMap (printf "%%%02X") [c]

A better name would be alexaToolbar. We’ll test it: alexaToolbar "https://www.google.com" evaluates to:

"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
    &cdt=rq%3D0%26wid%3D%2819413%2C218941083%2040692%29%26ref%3D%26url%3Dhttp%3A%2F%2Fwww%2Egoogle%2Ecom%2F"

We run it in curl, and we get:

<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="404" HOME="0" AID="=">
</ALEXA>

Is that right? Let’s see what the very first URL example said, since as far as we could tell, the URL examples should be indefinitely reusable:

<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="cnn.com/" HOME="0" AID="=">
<RLS PREFIX="http://" more="92">
<RL HREF="foxnews.com/" TITLE="Fox News Channel"/>
<RL HREF="abcnews.go.com/" TITLE="Abc News: Online News, Breaking News, Feature Stories And More"/>
<RL HREF="www.cbsnews.com/" TITLE="CBS News"/>
<RL HREF="news.bbc.co.uk/" TITLE="BBC News"/>
<RL HREF="cnnenespanol.com/entv" TITLE="CNNenEspanol"/>
<RL HREF="nytimes.com/" TITLE="The New York Times"/>
<RL HREF="www.msnbc.com/" TITLE="Msnbc"/>
<RL HREF="news.google.com/?output=rss" TITLE="Google News"/>
<RL HREF="www.microsite.reuters.com/rss/worldNews" TITLE="Reuters - World News"/>
<RL HREF="www.pbs.org/" TITLE="Pbs"/>
<RL HREF="fullcoverage.yahoo.com/" TITLE="fullcoverage.yahoo.com/"/>
</RLS>
<SD TITLE="A" FLAGS="DMOZ" HOST="cnn.com">
<TITLE TEXT="CNN - Cable News Network"/>
<ADDR STREET="1 CNN Center, One CNN Center" CITY="Atlanta" STATE="GA" ZIP="30303" COUNTRY="US" />
<CREATED DATE="22-Sep-1993" DAY="22" MONTH="09" YEAR="1993"/>
<PHONE NUMBER="404.827.1500"/>
<OWNER NAME="Cable News Network"/>
<EMAIL ADDR="TMGROUP@turner.com"/>
<DOS>
<DO DOMAIN="insidersadvantage.com" TITLE="insidersadvantage.com"/>
</DOS>
<TICKER SYMBOL="AOL"/>
<LANG LEX="en"/>
<LINKSIN NUM="77443"/>
<SPEED TEXT="1451" PCT="49"/>
<REVIEWS AVG="3.0" NUM="53"/>
<CHILD SRATING="0"/>
</SD>
<KEYWORDS>
<KEYWORD VAL="Networks"/>
<KEYWORD VAL="Cable"/>
<KEYWORD VAL="CNN News Group"/>
</KEYWORDS><DMOZ>
<SITE BASE="cnn.com/" TITLE="CNN Interactive" DESC="News, weather, sports, and services including
    e-mail news alerts and downloadable audio/video reports.">
<CATS>
<CAT ID="Top/News" TITLE="English/News" CID="998601"/>
<CAT ID="Top/Arts/Television/News" TITLE="Television/News" CID="395700"/>
<CAT ID="Top/Arts/Television/Networks/Cable/CNN" TITLE="Cable/CNN" CID="392942"/>
</CATS>
</SITE>
</DMOZ>
<SD>
<POPULARITY URL="cnn.com/" TEXT="49"/>
<REACH RANK="42"/>
<RANK DELTA="+1"/>
</SD>
</ALEXA>

Well, drat. That looks very different. What could the error be? We compare our generated with an original, and they match, but only up to wid and later parameters:

wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
wid%3D%288260%2C2113025614%2040692%29%26ref%3D%26url%3Dhttp%3A%2F%2Fwww%2Ecnn%2Ecom

So maybe we need to escape only the URL, so:

let payload = "rq=0&wid=" ++ show rint ++ "&ref=&url=" ++ escape url

and now alexaToolbar "https://www.cnn.com" evaluates to:

"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&
    dat=ns&cdt=rq=0&wid=8100&ref=&url=http%3A%2F%2Fwww%2Ecnn%2Ecom"

Which seems to work:

<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="cnn.com/" HOME="0" AID="=">
<RLS PREFIX="http://" more="92">
<RL HREF="foxnews.com/" TITLE="Fox News Channel"/>
<RL HREF="abcnews.go.com/" TITLE="Abc News: Online News, Breaking News, Feature Stories And More"/>
<RL HREF="www.cbsnews.com/" TITLE="CBS News"/>
<RL HREF="news.bbc.co.uk/" TITLE="BBC News"/>
<RL HREF="cnnenespanol.com/entv" TITLE="CNNenEspanol"/>
<RL HREF="nytimes.com/" TITLE="The New York Times"/>
<RL HREF="www.msnbc.com/" TITLE="Msnbc"/>
<RL HREF="news.google.com/?output=rss" TITLE="Google News"/>
<RL HREF="www.microsite.reuters.com/rss/worldNews" TITLE="Reuters - World News"/>
<RL HREF="www.pbs.org/" TITLE="Pbs"/>
<RL HREF="fullcoverage.yahoo.com/" TITLE="fullcoverage.yahoo.com/"/>
</RLS>
<SD TITLE="A" FLAGS="DMOZ" HOST="cnn.com">
<TITLE TEXT="CNN - Cable News Network"/>
<ADDR STREET="1 CNN Center, One CNN Center" CITY="Atlanta" STATE="GA" ZIP="30303" COUNTRY="US" />
<CREATED DATE="22-Sep-1993" DAY="22" MONTH="09" YEAR="1993"/>
<PHONE NUMBER="404.827.1500"/>
<OWNER NAME="Cable News Network"/>
<EMAIL ADDR="TMGROUP@turner.com"/>
<DOS>
<DO DOMAIN="insidersadvantage.com" TITLE="insidersadvantage.com"/>
</DOS>
<TICKER SYMBOL="AOL"/>
<LANG LEX="en"/>
<LINKSIN NUM="77443"/>
<SPEED TEXT="1451" PCT="49"/>
<REVIEWS AVG="3.0" NUM="53"/>
<CHILD SRATING="0"/>
</SD>
<KEYWORDS>
<KEYWORD VAL="Networks"/>
<KEYWORD VAL="Cable"/>
<KEYWORD VAL="CNN News Group"/>
</KEYWORDS><DMOZ>
<SITE BASE="cnn.com/" TITLE="CNN Interactive" DESC="News, weather, sports, and services including
    e-mail news alerts and downloadable audio/video reports.">
<CATS>
<CAT ID="Top/News" TITLE="English/News" CID="998601"/>
<CAT ID="Top/Arts/Television/News" TITLE="Television/News" CID="395700"/>
<CAT ID="Top/Arts/Television/Networks/Cable/CNN" TITLE="Cable/CNN" CID="392942"/>
</CATS>
</SITE>
</DMOZ>
<SD>
<POPULARITY URL="cnn.com/" TEXT="49"/>
<REACH RANK="42"/>
<RANK DELTA="+1"/>
</SD>
</ALEXA>

Similarly, if we try “https://www.google.com”, that seems to work as well. Now we can modify the function to be directly usable:

import System.Random (getStdGen, randomR)
import Text.HTML.Download (openURL)
import Data.Char (isAlphaNum, isAscii)
import Text.Printf (printf)

alexaToolbar :: String -> IO ()
alexaToolbar url = do gen <- getStdGen
                      let rint = fst $ randomR (1000::Int,20000) gen
                      let payload = "wid=" ++ show rint ++ "&ref=&url=" ++ escape url
                      _ <- openURL $ "http://data.alexa.com/data/SbADd155Tq0000?\
                                     \cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq=0&" ++ payload
                      return ()
             where escape :: String -> String
                   escape = concatMap escapeURIChar
                   escapeURIChar :: Char -> String
                   escapeURIChar c | isAscii c && isAlphaNum c = [c]
                                   | otherwise                = concatMap (printf "%%%02X") [c]

Duplicate URLs

We compile the entire program, and it is short. It is sweet. It works, and does what we want, and has explicit types and import statements, and we’ve reasoned out every bit of it. So we should be pretty happy with it. But I’ve noticed one thing while playing around in GHCi: it’s naive and inefficient. Not only does it eat up ungodly amounts of memory and time, it also seems to be making duplicate requests! The reason for this is that on every Wikipedia webpage containing an article, there are a number of external links which aren’t part of the article itself; this includes linking to the WikiMedia Foundation (the non-profit organization which runs all the Wikipedias and associated projects), links to donations, etc.

Well, not a hard problem. We want to ensure there are no duplicate entries in our list; one suspects this is a common desire in list processing.

The type signature would be [a] -> [a], but unfortunately, this is far too general: Hoogle will not even put the right function on the first page.

We need to think: any function of this sort must go through the list, and compare entries. Compare what? For greater-than, less-than, etc.? That would imply an Ord constraint, but punched in, we see nothing useful. No, those are useful for sorting, but we only want the entries to be unique and not sorted in anyway. We only need ==, which means an Eq constraint. nub is the very first hit for Eq a => [a] -> [a].

(Of course, we could just read through Data.List.).

It makes the most sense to put nub just before we begin mapping archiveURL over urllist, so in it goes:

         mapM_ (archiveURL) (liftM nub urllist)

Prototype

Well, that’s that! Let’s put it all together:

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf, nub)
import Monad (liftM)

main :: IO ()
main = do articlelist <- liftM words $ getContents
          urllist <- liftM concat $ mapM fetchArticleURLs articlelist
          mapM_ (archiveURL) (liftM nub urllist)

archiveURL :: String -> IO String
archiveURL url = openURL("https://www.webcitation.org/archive?url=" ++ url ++ "&email=foo@bar.com")

fetchArticleURLs :: String -> IO [String]
fetchArticleURLs article = liftM extractURLs $ openURL("https://en.wikipedia.org/wiki/"++article)

extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- (parseTags arg),
                       (_,x) <- atts,
                       "http://" `isPrefixOf` x]

Is this not lovely? Not counting imports or type signatures, it’s 6 lines (10 with imports, 14 all told) all in clear natural Haskell.

Performance

Nub

But alas, we see an all too common problem with beautiful concise Haskell: it doesn’t perform well. We use it from the shell like thus:

$ head -n 20 enwiki-all-titles-in-ns0 | archive-bot

It performs acceptably for small values of n, but if we try 100 or more, we quickly notice extreme slowness & memory usage. Looking through the code, nub immediately comes to mind as a culprit. Everywhere else, we’re using maps and the like, which are lazy and make nice streams with constant memory-use, but nub operates on the whole list of URLs at once, so perhaps it’s slow. nub is 𝒪(n2) because it compares every entry against every other entry in a list (“accidentally quadratic”!). Let’s optimize it.

nub has to do quadratically many matches in part because it’s trying to preserve order (which is sort of important for a list). But we don’t care at all about order, so we’re free to do something clever… like convert the list to the mathematical structure of sets and then back! We know in sets each member is unique (no {1,1,2} business), so we know it works, and the performance is better than nub. So add in:

import Data.Set (toList, fromList)

And replace the last line of main with:

          mapM_ archiveURL $ liftM (toList . fromList) urllist

But we must continue our quest for it is still not fast enough.

ByteStrings

The conventional wisdom is whenever you are using String and it isn’t fast enough, use the bytestring library. Our next version will make use of lazy ByteStrings:

import qualified Data.ByteString.Lazy.Char8 as B
            (ByteString(), getContents, lines, unlines, pack, unpack, words)

To convert to ByteString, we first convert to ByteString equivalents wherever possible—B.words $ B.getContents, for example. Unfortunately for our main function, ByteString introduces some weirdness, so we de-sugar the do notation to use the good old =<< operator to make sure everything fits together. But take heart, it’s shorter than before:

main = mapM_ archiveURL =<< (liftM uniq $ mapM fetchArticleURLs
                               =<< (liftM B.words $ B.getContents))
              where uniq :: (Ord a) => [[a]] -> [a]
                    uniq = toList $ fromList $ concat

Notice how we’ve split out our concatenating and uniqueifying code to its own little points-free local definition. You may’ve also noticed that we’re using the regular concat and not ByteString.concat; this is because we actually aren’t concatenating the ByteStrings themselves but merely the list of lists.

Bytestring Bot

The other bit of trickiness is that TagSoup regular Strings, so we need manually convert back and forth between Strings and ByteStrings using B.pack and B.unpack, and of course the type signatures need to be updated. But this is all pretty simple and mechanical; the final result of all this fiddling is the noticeably uglier but also noticeably faster and leaner code:

module Main () where
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf)
import Monad (liftM)
import Data.Set (toList, fromList)
import qualified Data.ByteString.Lazy.Char8 as B
             (ByteString(), getContents, lines, unlines, pack, unpack, words)

main :: IO ()
main = mapM_ archiveURL
           =<< (liftM uniq $ mapM fetchArticleURLs
                =<< (liftM B.words $ B.getContents))
              where uniq :: (Ord a) => [[a]] -> [a]
                    uniq = toList $ fromList $ concat

archiveURL :: B.ByteString -> IO String
archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url)
                          ++ "&email=foo@bar.com")

fetchArticleURLs :: B.ByteString -> IO [B.ByteString]
fetchArticleURLs article = liftM extractURLs $ openURL("en.wikipedia.org/wiki/"
                                                      ++ (B.unpack article))

extractURLs :: String -> [B.ByteString]
extractURLs arg = map B.pack $ [x | TagOpen "a" atts <- (parseTags arg),
                                    (_,x) <- atts,
                                    "http://" `isPrefixOf` x]

Parallelizing Requests

What’s that, you say? While ByteStrings has indeed made some difference, it’s marginal?

Fortunately, we can go still further—there’s not a whole lot we can do for the algorithm itself yet, but we can investigate parallelizing matters.

Remember that archiveURL was a bit inefficient because even with mapM_, each entry was being processed one at a time? Well, is this not Haskell? Aren’t functional programming languages famous for having functions like map which are embarrassingly easy to parallelize? If archiveURL were pure, we’d use Control.Parallel.Strategies, but it’s not.

We’re throwing away the output from openURL inside archiveURL. Ideally we would somehow be mapping a function over urllist which would say in effect “Take this single ByteString and go off somewhere by yourself and do whatever needs doing; don’t bother coming back.” In short, we want to fork archiveURL off.

But what function will do this for us? What is its signature? Let us think about it.

Clearly, if it really is going away forever, it cannot be pure, because such a function would be useless. If we can only write fork (1+1), how do we know it ever ran at all? So the result of the argument must be something in IO. Perhaps it is IO [a]? But that’s not good, since the argument doesn’t have to evaluate to a list—what about arguments which mess around with tuples, or sets, or all the other data structures we have? The most general argument type is IO a.

We could imagine a ‘goAway :: (a -> IO a) -> a -> IO ()’, and many other variants; but we cannot communicate with goAway—that’s the whole point—and so like a spaceship, it must have everything it needs and its arguments combined must be IO a—so why insist on one particular call-scheme? goAway :: IO a -> IO () is more general.

Unfortunately, Hoogle cannot help us here. If we ask for IO a -> IO () or IO a -> IO a or other plausible types we could think of, we get a wide variety of functions, none of which we want. What we want is forkIO, with the signature IO a -> IO ThreadId.

This is a good illustration of why purity is important: it narrows down the range of possibilities. When a function starts with IO actions and ends with IO actions, it could be doing anything, and our search results reflect that. In comparison, a type like (a,b) -> a is so specific that djinn can mechanically generate the function for that type6

We change mapM_:

import Control.Concurrent (forkIO)

main = mapM_ (forkIO . archiveURL) =<<
           (liftM uniq $ mapM fetchArticleURLs
              =<< (liftM B.words $ B.getContents))
-- We need return () because otherwise `openURL` would return a String and break forkIO's heart
archiveURL url = openURL ("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com")
                     >> return ()

Much better! I’ve noticed improvements of 20-100%, since we can run on multiple CPUs. That little change was certainly worth it.

One small tweak occurs to me; if we ask a GHC 7 version of Hoogle for a function with the type m a -> m (), it will return a hit named Control.Monad.void, which is our >> return () in functor garb. If void is available, we would rewrite archiveURL like this:

archiveURL url = void (openURL ("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com"))

Laziness

But these speed increases have made me greedy—I want to be able to process a few thousand names at once, not a measly few hundred. This should cause us to ponder. Don Stewart has a quite fast program to check the viability of links, urlcheck; it delves pretty heavily into threads and gets great performance but it honestly looks pretty scary and hard to read. Do we have to resort to such nastiness?

Well, there’s a Haskell saying: “the best way to optimize a program is to make it lazier or stricter”.

It’s already strict in a lot of places because of the IO, and going further doesn’t seem like a good option—how can one pipe in thousands of articles if the program is going to strictly and immediately eat all one’s RAM? Laziness is the way to go.

To figure out how to get greater laziness, let’s go through step by step. We know getContents is lazy because it makes use of unsafeInterleaveIO according to the documentation—which is why we can have lazy IO. Let’s hold onto that thought.

Next is B.words. A quick experiment in GHCi (take 100 $ unwords $ repeat "Yes"; that sort of thing) convinces me that words is lazy as one would expect.

Next is mapM fetchArticleURLs. Here’s a problem. fetchArticleURLs does IO and so is strict, and mapM means it is little better than a loop over the list—the entire list. There goes our laziness since the whole list will be processed before being passed onto forkIO and archiveURL. But laziness and IO can co-exist in peace and harmony!

Remember getContents is lazy, and it can achieve this magic using unsafeInterleaveIO. Reading through the documentation, it seems to be safe in certain situations: when the IO target isn’t going to change, isn’t going to change as a result of whatever is done—or one doesn’t care when the IO is performed.

In our situation, it’s irrelevant whether we access an article at time n or time n + 1. When you’re working on millions of articles, it doesn’t matter much if one article is changed or even deleted. So we can safely use unsafeInterleaveIO! We could stick it in main or fetchArticleURLs, but for convenience’s sake, I put it in the latter.

So now we have:

import System.IO.Unsafe (unsafeInterleaveIO)

fetchArticleURLs article = liftM (B.lines . extractURLs)
                            (unsafeInterleaveIO $ openURL(wiki ++ B.unpack article))

Let’s think about the flow now. getContents keeps on delivering a few characters to words, which is waiting for a newline, at which point it turns those characters into a String and passes them onto mapM which will immediately run fetchArticleURLs—and fetchArticleURLs will immediately ‘finish’, leaving behind some thunks or whatever holding a promise that it’ll run the IO actions (downloading a particular Web page) whenever it’s needed. And that promise gets passed into nub, which is busy concatenating and then uniqueifying the results of those thunks.

So now the program will run in constant space (I notice that even when I pass it 10000 or 20000 names it’ll use a steady 7.6% or so of memory as opposed to a varying ~40% with previous versions running on a few hundred names). It may take a while longer, but this is a small price to pay—now my computer will still be usable while the program is running.

Frugal Removal

Are we tired of this program yet? I’ve thought of one last and particularly nasty thing to do to speed things up and recover the speed lost to laziness. We have uniq in there to remove duplicate entries and be a little friendlier to WebCite, correct? But if the program is hardwired to always run on Wikipedia pages, then we can just run fetchArticleURLs on a few pages, note down all the duplicates to help pages and legal notices etc., and hard-wire in a filter to remove them. This is a hack.

But here is the final, ugly, fast and frugal version. This combines the lazy IO covered previously and adds in the hardwired filter.

module Main () where

import Control.Concurrent (forkIO)
import Control.Monad (liftM)
import Data.List (isPrefixOf)
import System.IO.Unsafe (unsafeInterleaveIO)
import Text.HTML.Download (openURL)
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, pack, unpack, words)

main :: IO ()
main = mapM_ (forkIO . archiveURL) =<< (liftM uniq $ mapM fetchArticleURLs
                                         =<< (liftM B.words $ B.getContents))
                where uniq :: [[B.ByteString]] -> [B.ByteString]
                      uniq = filter (\x ->
                         if x == B.pack "https://wikimediafoundation.org/"
                          || x == B.pack "https://wikimediafoundation.org/wiki/Deductibility_of_donations"
                          || x == B.pack "https://wikimediafoundation.org/wiki/Fundraising"
                          || x == B.pack "https://wikimediafoundation.org/wiki/Privacy_policy"
                          || x == B.pack "http://www.mediawiki.org/"
                          || x == B.pack "http://www.wikimediafoundation.org"
                         then False else True) . concat

fetchArticleURLs :: B.ByteString -> IO [B.ByteString]
fetchArticleURLs article = liftM extractURLs $ unsafeInterleaveIO $
                            openURL("https://en.wikipedia.org/wiki/" ++ B.unpack article)

extractURLs :: String -> [B.ByteString]
extractURLs arg = map B.pack $ [x | TagOpen "a" atts <- (parseTags arg),
                                    (_,x) <- atts,
                                    "http://" `isPrefixOf` x]

archiveURL :: B.ByteString -> IO ()
archiveURL url = openURL("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com")
                   >> return ()

  1. My old StumpWM config is still available since other StumpWM users seem to still find it useful; I use Xmonad these days and that config is more up-to-date.↩︎

  2. n2—1, where n is how many spaces there are in the pagename.↩︎

  3. It is possible, though difficult, to script the forms in Haskell.↩︎

  4. The Position, Warning, Comment, and Close tags are irrelevant and do not contain the hyperlinks we want; and TagText after investigation turns out to be exactly what it says it is.↩︎

  5. Obsoleted by the developer console now.↩︎

  6. The function is f (a,_) = a, if you were wondering—the tuple function named fst.↩︎