So going back to our problem. Our problem is that we want to take all the articles of the English Wikipedia, somehow extract all the external links from said articles, transform those URLs according to the previous paragraph, and then actually open them (which will cause the embedded URL to be archived).
Our first question is naturally “How on Earth do we even get the addresses of the millions of Wikipedia articles? Saying they’re all ‘somewhere in en.wikipedia.org/wiki/
’ doesn’t help.” I’m going to cop out a bit here and not actually explain how I hit upon this approach, but we’ll borrow some code from the bot I coded to fix Problem 1, which looks a little like this:
main = do
cd "/home/gwern/bin/pywikipedia/"
{- The list comprehension is to prevent from operating on page titles which will
cause enormous numbers of redirects to be created. It's what comes before that matters. -}
list <- liftM (\bs -> [n | n<-bs, length n <= (2^4)]) $ liftM words $ getContents
Redirect-bot.hs
is assuming here that stdin is the source of a list. This long (>9,555,933 lines) list takes this form:
...
Delehaye
Delekovac
Delele
Delemont
Delemont_(Jura)
Delemont_JU
Delena
Delena_cancerides
Delenda
Deleni
...
Notice that spaces are being escaped as underscores, and that each page name is on a separate line. We are getting all this information from a special file kindly provided by Wikipedia precisely for people like us, who don’t want to spend the effort to parse Special:AllPages to get the live list of article names; parsing Allpages is how all the serious Wikipedia bots work, but relying on old database dumps gets us 90% of the functionality at 10% the effort. Go visit https://dumps.wikimedia.org/enwiki/latest/ and look for a file called all_titles_in_ns0.gz
. When it’s downloaded and uncompressed to a ~60M file, it provides suitable fodder.
So, we have getContents lazily slurping in names of articles to standard in, and we have words (it’s a pure non-IO function, but liftM lets us use it in the IO monad) breaking a single long string with lots of newlines into a list of strings.
import Control.Monad (liftM) -- Don't forget your imports!
main :: IO ()
main = do
articlelist <- liftM words $ getContents
Now what? Well, a list of items, each of which we want to perform the same operation on, cries out for us to use a (monadic) map
on it.
urllist <- mapM fetchArticleURLs articlelist
But what is this mystery function fetchArticleText
? Well, it will take as its single argument a String, and it must return a list of Strings since any given article might have many different external links in it. And since we certainly aren’t loading into memory an immutable copy of Wikipedia (regardless of whether we download a full dump of the articles or decide to fetch articles over the Internet), the result will be promoted into the IO monad. So the type signature must be:
fetchArticleURLs :: String -> IO [String]
Purely arbitrarily, we’ll have fetchArticleURLs
work on pages from the live English Wikipedia; but you could probably also go the route of working from an offline database dump, which would have the advantage of halving network requests and parsing out URLs much faster (since there’d be no need to request individual pages over the network).
It’s not too hard to see how to get the URL for any given article ("https://en.wikipedia.org/wiki/"++article
), nor how to use Network.HTTP to download the HTML file (couldn’t be more than 10 lines)—but reliably parsing that HTML to extract genuine external links doesn’t seem trivial at all! We could go with some sort of clumsy regexp matching on “http://”, but hit-or-miss regular expressions just aren’t The Haskell Way.
Stumped, the thing to do is to go bum around on the Haskell wiki and #haskell, where some excellent fellow will eventually inform us that “Hey, you know, parsing Wikipedia pages’ HTML to extract out all external links kind of reminds me of one of the examples for Neil Mitchell’s tagsoup
library. You should go check it out.”
Indeed, we need some sort of library to parse HTML for us and maybe handle the network business, and TagSoup fits the bill admirably. One of the examples in Example.hs (googleTechNews
) does almost precisely what we want, except for Google Tech News. The example in its entirety:
googleTechNews = do
tags <- liftM parseTags $ openURL "http://news.google.com/?ned=us&topic=t"
let links = mapMaybe extract $ sections match tags
putStr $ unlines links
where extract xs = maybeTagText (xs !! 2)
match =
Match.tagOpenAttrNameLit "a" "id"
(\value -> "r" `isPrefixOf` value && 'i' `notElem` value)
Reading this, one is struck by openURL
; we don’t even need to read the documentation to know what it does. We already have the beginning of our fetchArticleURLs
definition:
import Text.HTML.Download (openURL)
fetchArticleURLs :: String -> IO [String]
fetchArticleURLs article = undefined $ openURL("https://en.wikipedia.org/wiki/"++article)
We could replace undefined
with whatever parsing we cook up, but we might want to use it somewhere else, and besides, it’s bad practice to muddy up a function that could be pure (like our hypothetical extractURLs
function) with IO. Separation of concerns, and all that.
So we’re up to this:
fetchArticleURLs article = liftM extractURLs $ openURL("https://en.wikipedia.org/wiki/"++article)
extractURLs :: String -> [String]
extractURLs arg = undefined
The definition of extractURLs
is a bit magical, since we don’t immediately understand the entire hierarchy of types and constructors which TagSoup operate ons:
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Data.List (isPrefixOf)
extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- (parseTags arg),
(_,x) <- atts,
"http://" `isPrefixOf` x]
Alright. This list comprehension has 3 parts, which we will read from top down. We start with our arg
, a String. It immediately passes to parseTags, which turns it into [Tag]—notice it’s not a tree or hierarchical structure like our String actually represents; this is why it’s ‘soup’.
A Tag
can be a lot of things, but we quickly realize we only want TagOpen
. The "a"
is there because <href>
actually begins with <a href=...>
.
But how does TagOpen "a" atts
do anything? We can think of a list comprehension as syntax sugar over a bunch of filter
calls and whatnot. In this case, what’s really happening is something like filter (\TagOpen "a" x -> x) (parseTags arg)
; any entry which isn’t a TagOpen
is omitted from the intermediate list.
The second line is exactly the same. We’re still working on a [Tag]
, and our little filter
is still testing away, but now our ‘test’ is really just pulling out the second entry in each TagOpen
’s [Attribute]
—it’s equivalent to a map
of snd. And this gives us a [String]
of URLs! The first entry in the lists are “href”, and the second are the URLs the href points to. But when we test out our two-liner, we realize to our dismay we are getting junk results like ["/wiki/Japanese_poetry",..]
. A quick isPrefixOf and we fix that, and that finishes our list comprehension.
Oy. So we wrote extractURLs
, which was used in fetchArticleURLs
, which was used in main
. Let’s backtrack a bit; we had left main
at:
main = do
articlelist <- liftM words $ getContents
urllist <- mapM fetchArticleURLs articlelist
By the time urllist
is generated, we’ve got an IO [String]
which is containing all the external links. Now we can once again indulge the desire to map some sort of function over it. Another map, another hypothetical function:
For convenience’s sake, we’ll reuse openURL
even though we don’t want output, we really want something like this:
archiveURL :: String -> IO ()
The simplest way to define it is this:
archiveURL :: String -> IO String -- Reality
archiveURL url = openURL("https://www.webcitation.org/archive?url=" ++ url ++ "&email=foo@bar.com")
But we can improve on this! We are using mapM which will go to all the trouble of retaining all the output from archiveURL
being run on the dozens, hundreds, thousands, or millions of entries in urllist
.
Can’t we be more efficient somehow? This sort of situation is exactly where naming conventions can come in handy. I happen to remember that just as foo'
denotes some stricter version of a function foo
, a foo_
denotes some sort of function that will throw away its output. One blindly tries mapM_ instead, and it works!
We could have been more consistent. We know that mapM :: (a -> m b) -> [a] -> m [b]
, and we want to get rid of the output, or turn the [b]
into nothing, not even a list, which means ()
; we throw the new type signature into Hoogle, and the very first result is mapM_
.
We could also throw in a speculative request to Alexa, which handles archiving for the Internet Archive. Supposedly any URL visited by someone using their toolbar will be spidered and then the spidered copy given to the Internet Archive. We can mimic the request of the toolbar. It turns out to be not that difficult to do, not nearly as difficult as Alexa could have made it. (Ironically, my first attempt was based on trying to read the JavaScript source code to the Alexa toolbar, which I followed in detail—and the moment I checked an actual request URL, I realized I had gotten it completely wrong.) This code, for brevity’s sake, will not be in the final WP archive bot code, but it is included in the general-purpose archiver
daemon which will archive any URLs in a file in multiple ways and can be seen as the generalized version of this WP archive bot.
With the handy Firefox extension Live HTTP Headers, we can easily snoop on the toolbar’s traffic. I turned it on, visited a few domains with short names, saved the traffic to a file, and grepped for alexa.com
and http://
:
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D0%26wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=cnn.com
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D1%26wid%3D7005%26ttl%3D1400&ref=&url=http%3A%2F%2Fwww.iana.org%2Fdomains%2Fexample%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=iana.org
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D2%26wid%3D7005%26ttl%3D989&ref=&url=http%3A%2F%2Fwebcitation.org%2F
http://widgets.alexa.com/traffic/sparky/?v=1&url=webcitation.org
http://widgets.alexa.com/traffic/rankr/?ref=http%3A%2F%2Fwebcitation.org%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D3%26wid%3D7005%26ttl%3D2241&ref=&url=http%3A%2F%2Fwww.archive.org%2Findex.php
http://widgets.alexa.com/traffic/sparky/?v=1&url=archive.org
We can rule out widgets.alexa.com
—the URL structure and the name indicates that these requests are being made by the part of the toolbar that is showing the little traffic-vs-time graph in the Firefox GUI. So:
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D0%26 /
wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D1%26 /
wid%3D7005%26ttl%3D1400&ref=&url=http%3A%2F%2Fwww.iana.org%2Fdomains%2Fexample%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D2%26 /
wid%3D7005%26ttl%3D989&ref=&url=http%3A%2F%2Fwebcitation.org%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq%3D3%26 /
wid%3D7005%26ttl%3D2241&ref=&url=http%3A%2F%2Fwww.archive.org%2Findex.php
To split the last URL:
-
http://data.alexa.com/data/SbADd155Tq0000
-
?cli=10
-
&ver=spkyf-1.5.0
-
&dat=ns
-
&cdt=rq%3D2%26wid%3D7005%26ttl%3D989
-
&ref=
-
&url=http%3A%2F%2Fwebcitation.org%2F
(the URL-encoding form of https://webcitation.org/
)
We can hardwire #1 & #3; I am guessing that we can hardwire #2 and #4; #6 can be left empty since none of the examples used it; #7 is just the payload; but unfortunately, #5 seems a little more difficult. We have 4 examples here for the cdt
field:
rq%3D0%26wid%3D7005
rq%3D1%26wid%3D7005%26ttl%3D1400
rq%3D2%26wid%3D7005%26ttl%3D989
rq%3D3%26wid%3D7005%26ttl%3D2241
What is this gibberish? ttl
looks like a field name (‘time to live’, perhaps?); it’s URL-encoded gunk, so let’s decode it:
rq=0&wid=7005
rq=1&wid=7005&ttl=1400
rq=2&wid=7005&ttl=989
rq=3&wid=7005&ttl=2241
-
rq
is presumably short for “request”, and an incrementing integer suggests that it’s counting how many pages one has visited. For the archiver bot, we can hardwire it as ‘0’ because there will be a good 20 seconds between each request (due to the WebCite timeout).
-
wid
is troubling because it’s constant, and a fairly large number. It doesn’t seem large enough to be a user ID—surely the toolbar has had more than 7000 users, since the toolbar page says it has had 1.6 million downloads. Is it a user ID? We’ll leave it for now.
-
ttl
is also troubling because it changes but isn’t even in the first request. Does it stand for ‘time to live’? Does it vary if I load the same page many times, is it a measure of how fast pages load or something (a metric very important to search engines, who rank higher the sites which are faster)? Can we get away with omitting it?
We need more information on ttl
and wid
. So let’s visit a bunch more URLs and see whether more data says anything:
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D4%26wid%3D15698%26ttl%3D1830&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D5%26wid%3D15698%26ttl%3D425&ref=&url=http%3A%2F%2Fgwern.net%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D6%26wid%3D15698%26ttl%3D1614&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D7%26wid%3D15698%26ttl%3D376&ref=&url=http%3A%2F%2Fgwern.net%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D8%26wid%3D15698%26ttl%3D1468&ref=&url=http%3A%2F%2Fwww.reddit.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D9%26wid%3D15698%26ttl%3D2092&ref=&url=http%3A%2F%2Flesswrong.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D10%26wid%3D15698%26ttl%3D1717&ref=&url=http%3A%2F%2Fwww.overcomingbias.com%2F
http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D11%26wid%3D15698%26ttl%3D2010&ref=http%3A%2F%2Fwww.overcomingbias.com%2F
&url=http%3A%2F%2Fwww.overcomingbias.com%2F2011%2F06%2Frevised-growth-claim.html%23comments
Filter and URL-decode the ttl
field:
ttl=1830
ttl=425
ttl=1614
ttl=376
ttl=1468
ttl=2092
ttl=1717
ttl=2010
The smallest in any ttl
is 376, the largest is 2241. This is a range which smells suspiciously like a page load time—few sites load much faster than 0.25 seconds and most clock in at a second or two. Combined with the observation that the first sample URL doesn’t have a ttl
, I think we can safely omit the ttl
field.
And wid
? We speculated it’s some kind of user ID. In this second batch, the field is wid%3D15698
or wid=15698
. So the wid
is not consistent across Firefox sessions, but it is consistent across multiple page loads. This suggests to me that Alexa is only aggregating URLs by a ‘browser session’ ID (maybe it originally tracked individual ‘windows’, hence the ‘w’ part of the ID). We can probably just generate a random number between 10001,024ya and 20000 every time we archive a URL. So, all told our pseudocode goes like this:
"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=" ++
escape ("rq=0&wid=" ++ show randomint ++ "&ref=&url=" ++ url)
How do we get our randomint
and urlEncode
? I have had to URL-encode strings in the past, for XMonad’s XMonad.Actions.Search module:
escape :: String -> String
escape = concatMap escapeURIChar
where escapeURIChar :: Char -> String
escapeURIChar c | isAscii c && isAlphaNum c = [c]
| otherwise = concatMap (printf "%%%02X") $ encode [c]
We’ll do the random IO with System.Random, hide it internally inside the Alexa toolbar archive function since that function has to be in IO anyway, and reuse openURL
:
import System.Random (getStdGen, randomR)
import Text.HTML.Download (openURL)
import Data.Char (isAlphaNum, isAscii)
import Text.Printf (printf)
a :: String -> IO ()
a url = do gen <- getStdGen
let rint = fst $ randomR (1000::Int,20000) gen
let payload = escape ("wid=" ++ show rint ++ "&ref=&url=" ++ url)
print $ "http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq=0&"
++ payload
where escape :: String -> String
escape = concatMap escapeURIChar
escapeURIChar :: Char -> String
escapeURIChar c | isAscii c && isAlphaNum c = [c]
| otherwise = concatMap (printf "%%%02X") [c]
A better name would be alexaToolbar
. We’ll test it: alexaToolbar "https://www.google.com"
evaluates to:
"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&dat=ns
&cdt=rq%3D0%26wid%3D%2819413%2C218941083%2040692%29%26ref%3D%26url%3Dhttp%3A%2F%2Fwww%2Egoogle%2Ecom%2F"
We run it in curl
, and we get:
<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="404" HOME="0" AID="=">
</ALEXA>
Is that right? Let’s see what the very first URL example said, since as far as we could tell, the URL examples should be indefinitely reusable:
<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="cnn.com/" HOME="0" AID="=">
<RLS PREFIX="http://" more="92">
<RL HREF="foxnews.com/" TITLE="Fox News Channel"/>
<RL HREF="abcnews.go.com/" TITLE="Abc News: Online News, Breaking News, Feature Stories And More"/>
<RL HREF="www.cbsnews.com/" TITLE="CBS News"/>
<RL HREF="news.bbc.co.uk/" TITLE="BBC News"/>
<RL HREF="cnnenespanol.com/entv" TITLE="CNNenEspanol"/>
<RL HREF="nytimes.com/" TITLE="The New York Times"/>
<RL HREF="www.msnbc.com/" TITLE="Msnbc"/>
<RL HREF="news.google.com/?output=rss" TITLE="Google News"/>
<RL HREF="www.microsite.reuters.com/rss/worldNews" TITLE="Reuters - World News"/>
<RL HREF="www.pbs.org/" TITLE="Pbs"/>
<RL HREF="fullcoverage.yahoo.com/" TITLE="fullcoverage.yahoo.com/"/>
</RLS>
<SD TITLE="A" FLAGS="DMOZ" HOST="cnn.com">
<TITLE TEXT="CNN - Cable News Network"/>
<ADDR STREET="1 CNN Center, One CNN Center" CITY="Atlanta" STATE="GA" ZIP="30303" COUNTRY="US" />
<CREATED DATE="22-Sep-1993" DAY="22" MONTH="09" YEAR="1993"/>
<PHONE NUMBER="404.827.1500"/>
<OWNER NAME="Cable News Network"/>
<EMAIL ADDR="TMGROUP@turner.com"/>
<DOS>
<DO DOMAIN="insidersadvantage.com" TITLE="insidersadvantage.com"/>
</DOS>
<TICKER SYMBOL="AOL"/>
<LANG LEX="en"/>
<LINKSIN NUM="77443"/>
<SPEED TEXT="1451" PCT="49"/>
<REVIEWS AVG="3.0" NUM="53"/>
<CHILD SRATING="0"/>
</SD>
<KEYWORDS>
<KEYWORD VAL="Networks"/>
<KEYWORD VAL="Cable"/>
<KEYWORD VAL="CNN News Group"/>
</KEYWORDS><DMOZ>
<SITE BASE="cnn.com/" TITLE="CNN Interactive" DESC="News, weather, sports, and services including
e-mail news alerts and downloadable audio/video reports.">
<CATS>
<CAT ID="Top/News" TITLE="English/News" CID="998601"/>
<CAT ID="Top/Arts/Television/News" TITLE="Television/News" CID="395700"/>
<CAT ID="Top/Arts/Television/Networks/Cable/CNN" TITLE="Cable/CNN" CID="392942"/>
</CATS>
</SITE>
</DMOZ>
<SD>
<POPULARITY URL="cnn.com/" TEXT="49"/>
<REACH RANK="42"/>
<RANK DELTA="+1"/>
</SD>
</ALEXA>
Well, drat. That looks very different. What could the error be? We compare our generated with an original, and they match, but only up to wid
and later parameters:
wid%3D7005&ref=&url=http%3A%2F%2Fwww.cnn.com%2F
wid%3D%288260%2C2113025614%2040692%29%26ref%3D%26url%3Dhttp%3A%2F%2Fwww%2Ecnn%2Ecom
So maybe we need to escape only the URL, so:
let payload = "rq=0&wid=" ++ show rint ++ "&ref=&url=" ++ escape url
and now alexaToolbar "https://www.cnn.com"
evaluates to:
"http://data.alexa.com/data/SbADd155Tq0000?cli=10&ver=spkyf-1.5.0&
dat=ns&cdt=rq=0&wid=8100&ref=&url=http%3A%2F%2Fwww%2Ecnn%2Ecom"
Which seems to work:
<?xml version="1.0" encoding="UTF-8"?>
<ALEXA VER="0.9" URL="cnn.com/" HOME="0" AID="=">
<RLS PREFIX="http://" more="92">
<RL HREF="foxnews.com/" TITLE="Fox News Channel"/>
<RL HREF="abcnews.go.com/" TITLE="Abc News: Online News, Breaking News, Feature Stories And More"/>
<RL HREF="www.cbsnews.com/" TITLE="CBS News"/>
<RL HREF="news.bbc.co.uk/" TITLE="BBC News"/>
<RL HREF="cnnenespanol.com/entv" TITLE="CNNenEspanol"/>
<RL HREF="nytimes.com/" TITLE="The New York Times"/>
<RL HREF="www.msnbc.com/" TITLE="Msnbc"/>
<RL HREF="news.google.com/?output=rss" TITLE="Google News"/>
<RL HREF="www.microsite.reuters.com/rss/worldNews" TITLE="Reuters - World News"/>
<RL HREF="www.pbs.org/" TITLE="Pbs"/>
<RL HREF="fullcoverage.yahoo.com/" TITLE="fullcoverage.yahoo.com/"/>
</RLS>
<SD TITLE="A" FLAGS="DMOZ" HOST="cnn.com">
<TITLE TEXT="CNN - Cable News Network"/>
<ADDR STREET="1 CNN Center, One CNN Center" CITY="Atlanta" STATE="GA" ZIP="30303" COUNTRY="US" />
<CREATED DATE="22-Sep-1993" DAY="22" MONTH="09" YEAR="1993"/>
<PHONE NUMBER="404.827.1500"/>
<OWNER NAME="Cable News Network"/>
<EMAIL ADDR="TMGROUP@turner.com"/>
<DOS>
<DO DOMAIN="insidersadvantage.com" TITLE="insidersadvantage.com"/>
</DOS>
<TICKER SYMBOL="AOL"/>
<LANG LEX="en"/>
<LINKSIN NUM="77443"/>
<SPEED TEXT="1451" PCT="49"/>
<REVIEWS AVG="3.0" NUM="53"/>
<CHILD SRATING="0"/>
</SD>
<KEYWORDS>
<KEYWORD VAL="Networks"/>
<KEYWORD VAL="Cable"/>
<KEYWORD VAL="CNN News Group"/>
</KEYWORDS><DMOZ>
<SITE BASE="cnn.com/" TITLE="CNN Interactive" DESC="News, weather, sports, and services including
e-mail news alerts and downloadable audio/video reports.">
<CATS>
<CAT ID="Top/News" TITLE="English/News" CID="998601"/>
<CAT ID="Top/Arts/Television/News" TITLE="Television/News" CID="395700"/>
<CAT ID="Top/Arts/Television/Networks/Cable/CNN" TITLE="Cable/CNN" CID="392942"/>
</CATS>
</SITE>
</DMOZ>
<SD>
<POPULARITY URL="cnn.com/" TEXT="49"/>
<REACH RANK="42"/>
<RANK DELTA="+1"/>
</SD>
</ALEXA>
Similarly, if we try “https://www.google.com”, that seems to work as well. Now we can modify the function to be directly usable:
import System.Random (getStdGen, randomR)
import Text.HTML.Download (openURL)
import Data.Char (isAlphaNum, isAscii)
import Text.Printf (printf)
alexaToolbar :: String -> IO ()
alexaToolbar url = do gen <- getStdGen
let rint = fst $ randomR (1000::Int,20000) gen
let payload = "wid=" ++ show rint ++ "&ref=&url=" ++ escape url
_ <- openURL $ "http://data.alexa.com/data/SbADd155Tq0000?\
\cli=10&ver=spkyf-1.5.0&dat=ns&cdt=rq=0&" ++ payload
return ()
where escape :: String -> String
escape = concatMap escapeURIChar
escapeURIChar :: Char -> String
escapeURIChar c | isAscii c && isAlphaNum c = [c]
| otherwise = concatMap (printf "%%%02X") [c]
We compile the entire program, and it is short. It is sweet. It works, and does what we want, and has explicit types and import statements, and we’ve reasoned out every bit of it. So we should be pretty happy with it. But I’ve noticed one thing while playing around in GHCi: it’s naive and inefficient. Not only does it eat up ungodly amounts of memory and time, it also seems to be making duplicate requests! The reason for this is that on every Wikipedia webpage containing an article, there are a number of external links which aren’t part of the article itself; this includes linking to the WikiMedia Foundation (the non-profit organization which runs all the Wikipedias and associated projects), links to donations, etc.
Well, not a hard problem. We want to ensure there are no duplicate entries in our list; one suspects this is a common desire in list processing.
The type signature would be [a] -> [a]
, but unfortunately, this is far too general: Hoogle will not even put the right function on the first page.
We need to think: any function of this sort must go through the list, and compare entries. Compare what? For greater-than, less-than, etc.? That would imply an Ord constraint, but punched in, we see nothing useful. No, those are useful for sorting, but we only want the entries to be unique and not sorted in anyway. We only need ==
, which means an Eq constraint. nub is the very first hit for Eq a => [a] -> [a].
(Of course, we could just read through Data.List.).
It makes the most sense to put nub
just before we begin mapping archiveURL
over urllist
, so in it goes:
mapM_ (archiveURL) (liftM nub urllist)
Well, that’s that! Let’s put it all together:
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf, nub)
import Monad (liftM)
main :: IO ()
main = do articlelist <- liftM words $ getContents
urllist <- liftM concat $ mapM fetchArticleURLs articlelist
mapM_ (archiveURL) (liftM nub urllist)
archiveURL :: String -> IO String
archiveURL url = openURL("https://www.webcitation.org/archive?url=" ++ url ++ "&email=foo@bar.com")
fetchArticleURLs :: String -> IO [String]
fetchArticleURLs article = liftM extractURLs $ openURL("https://en.wikipedia.org/wiki/"++article)
extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- (parseTags arg),
(_,x) <- atts,
"http://" `isPrefixOf` x]
Is this not lovely? Not counting imports or type signatures, it’s 6 lines (10 with imports, 14 all told) all in clear natural Haskell.
What’s that, you say? While ByteStrings has indeed made some difference, it’s marginal?
Fortunately, we can go still further—there’s not a whole lot we can do for the algorithm itself yet, but we can investigate parallelizing matters.
Remember that archiveURL
was a bit inefficient because even with mapM_
, each entry was being processed one at a time? Well, is this not Haskell? Aren’t functional programming languages famous for having functions like map
which are embarrassingly easy to parallelize? If archiveURL
were pure, we’d use Control.Parallel.Strategies, but it’s not.
We’re throwing away the output from openURL
inside archiveURL
. Ideally we would somehow be mapping a function over urllist
which would say in effect “Take this single ByteString and go off somewhere by yourself and do whatever needs doing; don’t bother coming back.” In short, we want to fork archiveURL
off.
But what function will do this for us? What is its signature? Let us think about it.
Clearly, if it really is going away forever, it cannot be pure, because such a function would be useless. If we can only write fork (1+1)
, how do we know it ever ran at all? So the result of the argument must be something in IO. Perhaps it is IO [a]
? But that’s not good, since the argument doesn’t have to evaluate to a list—what about arguments which mess around with tuples, or sets, or all the other data structures we have? The most general argument type is IO a
.
We could imagine a ‘goAway :: (a -> IO a) -> a -> IO ()’, and many other variants; but we cannot communicate with goAway
—that’s the whole point—and so like a spaceship, it must have everything it needs and its arguments combined must be IO a
—so why insist on one particular call-scheme? goAway :: IO a -> IO ()
is more general.
Unfortunately, Hoogle cannot help us here. If we ask for IO a -> IO () or IO a -> IO a or other plausible types we could think of, we get a wide variety of functions, none of which we want. What we want is forkIO
, with the signature IO a -> IO ThreadId.
This is a good illustration of why purity is important: it narrows down the range of possibilities. When a function starts with IO actions and ends with IO actions, it could be doing anything, and our search results reflect that. In comparison, a type like (a,b) -> a
is so specific that djinn can mechanically generate the function for that type
We change mapM_
:
import Control.Concurrent (forkIO)
main = mapM_ (forkIO . archiveURL) =<<
(liftM uniq $ mapM fetchArticleURLs
=<< (liftM B.words $ B.getContents))
-- We need return () because otherwise `openURL` would return a String and break forkIO's heart
archiveURL url = openURL ("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com")
>> return ()
Much better! I’ve noticed improvements of 20-100%, since we can run on multiple CPUs. That little change was certainly worth it.
One small tweak occurs to me; if we ask a GHC 7 version of Hoogle for a function with the type m a -> m ()
, it will return a hit named Control.Monad.void
, which is our >> return ()
in functor garb. If void
is available, we would rewrite archiveURL
like this:
archiveURL url = void (openURL ("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com"))
But these speed increases have made me greedy—I want to be able to process a few thousand names at once, not a measly few hundred. This should cause us to ponder. Don Stewart has a quite fast program to check the viability of links, urlcheck; it delves pretty heavily into threads and gets great performance but it honestly looks pretty scary and hard to read. Do we have to resort to such nastiness?
Well, there’s a Haskell saying: “the best way to optimize a program is to make it lazier or stricter”.
It’s already strict in a lot of places because of the IO, and going further doesn’t seem like a good option—how can one pipe in thousands of articles if the program is going to strictly and immediately eat all one’s RAM? Laziness is the way to go.
To figure out how to get greater laziness, let’s go through step by step. We know getContents is lazy because it makes use of unsafeInterleaveIO according to the documentation—which is why we can have lazy IO. Let’s hold onto that thought.
Next is B.words
. A quick experiment in GHCi (take 100 $ unwords $ repeat "Yes"
; that sort of thing) convinces me that words
is lazy as one would expect.
Next is mapM fetchArticleURLs
. Here’s a problem. fetchArticleURLs
does IO and so is strict, and mapM means it is little better than a loop over the list—the entire list. There goes our laziness since the whole list will be processed before being passed onto forkIO
and archiveURL
. But laziness and IO can co-exist in peace and harmony!
Remember getContents
is lazy, and it can achieve this magic using unsafeInterleaveIO
. Reading through the documentation, it seems to be safe in certain situations: when the IO target isn’t going to change, isn’t going to change as a result of whatever is done—or one doesn’t care when the IO is performed.
In our situation, it’s irrelevant whether we access an article at time n or time n + 1. When you’re working on millions of articles, it doesn’t matter much if one article is changed or even deleted. So we can safely use unsafeInterleaveIO
! We could stick it in main
or fetchArticleURLs
, but for convenience’s sake, I put it in the latter.
So now we have:
import System.IO.Unsafe (unsafeInterleaveIO)
fetchArticleURLs article = liftM (B.lines . extractURLs)
(unsafeInterleaveIO $ openURL(wiki ++ B.unpack article))
Let’s think about the flow now. getContents
keeps on delivering a few characters to words
, which is waiting for a newline, at which point it turns those characters into a String and passes them onto mapM
which will immediately run fetchArticleURLs
—and fetchArticleURLs
will immediately ‘finish’, leaving behind some thunks or whatever holding a promise that it’ll run the IO actions (downloading a particular Web page) whenever it’s needed. And that promise gets passed into nub
, which is busy concatenating and then uniqueifying the results of those thunks.
So now the program will run in constant space (I notice that even when I pass it 10000 or 20000 names it’ll use a steady 7.6% or so of memory as opposed to a varying ~40% with previous versions running on a few hundred names). It may take a while longer, but this is a small price to pay—now my computer will still be usable while the program is running.
Are we tired of this program yet? I’ve thought of one last and particularly nasty thing to do to speed things up and recover the speed lost to laziness. We have uniq in there to remove duplicate entries and be a little friendlier to WebCite, correct? But if the program is hardwired to always run on Wikipedia pages, then we can just run fetchArticleURLs
on a few pages, note down all the duplicates to help pages and legal notices etc., and hard-wire in a filter
to remove them. This is a hack.
But here is the final, ugly, fast and frugal version. This combines the lazy IO covered previously and adds in the hardwired filter.
module Main () where
import Control.Concurrent (forkIO)
import Control.Monad (liftM)
import Data.List (isPrefixOf)
import System.IO.Unsafe (unsafeInterleaveIO)
import Text.HTML.Download (openURL)
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, pack, unpack, words)
main :: IO ()
main = mapM_ (forkIO . archiveURL) =<< (liftM uniq $ mapM fetchArticleURLs
=<< (liftM B.words $ B.getContents))
where uniq :: [[B.ByteString]] -> [B.ByteString]
uniq = filter (\x ->
if x == B.pack "https://wikimediafoundation.org/"
|| x == B.pack "https://wikimediafoundation.org/wiki/Deductibility_of_donations"
|| x == B.pack "https://wikimediafoundation.org/wiki/Fundraising"
|| x == B.pack "https://wikimediafoundation.org/wiki/Privacy_policy"
|| x == B.pack "http://www.mediawiki.org/"
|| x == B.pack "http://www.wikimediafoundation.org"
then False else True) . concat
fetchArticleURLs :: B.ByteString -> IO [B.ByteString]
fetchArticleURLs article = liftM extractURLs $ unsafeInterleaveIO $
openURL("https://en.wikipedia.org/wiki/" ++ B.unpack article)
extractURLs :: String -> [B.ByteString]
extractURLs arg = map B.pack $ [x | TagOpen "a" atts <- (parseTags arg),
(_,x) <- atts,
"http://" `isPrefixOf` x]
archiveURL :: B.ByteString -> IO ()
archiveURL url = openURL("www.webcitation.org/archive?url="++B.unpack url++"&email=foo@bar.com")
>> return ()