Writing a Wikipedia RSS Link Archive Bot

Gwern

Writing a Wikipedia RSS Link Archive Bot

Haskell, Internet archiving, tutorial, Wikipedia

Archiving using Wikipedia Recent Changes RSS feed (obsolete).

2009-11-02–2012-11-05_13ya finished certainty: highly likely importance: 3 backlinks similar bibliography

Task
Design
Improving

Continuation of the 2009_16ya Haskell Wikipedia link archiving bot tutorial, extending it from operating on a pre-specified list of articles to instead archiving links live by using TagSoup parsing Wikipedia Recent Changes for newly-added external links which can be archived using WebCite in parallel. (Note: these tutorials are obsolete. WebCite is largely defunct, doing archiving this way is not advised, and WP link archiving is currently handled by Internet Archive-specific plugins by the WMF. For a more general approach suitable for personal use, see the writeup of archiver-bot in Archiving URLs.)

We left Wikipedia Archive Bot at the point where it could take its stdin (newline-delimited article names), download the English Wikipedia article, parse the HTML for hyperlinks, and fire off an archive request to WebCitation.org.

But where does one get the list of names? Wikipedia dumps are occasionally available, but we have to get them manually, and they are irregularly created. Our bot will never archive links in very new pages because by the time they show up in a database dump and actually get processed, the page is no longer ‘very new’. Worse, it tends to be the new pages that most need their hyperlinks backed up in WebCitation, because the Internet Archive has a rolling 6-month blackout; here too, by the time a link is publicly available in the Internet Archive, the article is no longer new. Sufficiently ephemeral links might never even be spidered by the Internet Archive, for that matter.

Task

Dead links are an issue for new articles because often key claims or their very notability rests on online newspaper articles (frequently paywalled after a few days) or on blogs which might delete away the embarrassing information, or change or disappear for any of a thousand reasons. So we want our bot to be able to handle new pages. This is going to require extensive changes.

Design

As usual, we begin by identifying the ‘pure’ core to our program. We can separate the program into a few pieces of functionality:

get the user’s email & target wiki
somehow acquire a list of ‘new’ articles by downloading & parsing something on the target wiki
download the HTML of those new articles
parse said HTML for hyperlinks
fire off requests to WebCitation to archive all those hyperlinks

Point #4 is the pure core. A raw webpage is just a long String, and we want a list of URLs, which are String too; so:

extractURLs :: String -> [String]

Parsing HTML

As before, our tool of choice is TagSoup. TagSoup supplies parseTags :: String -> [Tags], and then all we need to do is look for any hyperlink which is a HTTP link; easily done with a list comprehension:

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))

extractURLs :: String -> [String]
extractURLs page = [x | TagOpen "a" atts <- parseTags page, (_,x) <- atts, "http://" `isPrefixOf` x]

We could also desugar the list comprehension into something uglier like this:

extractURLs' page = map tagfilter $ parseTags page
 where tagfilter (TagOpen "a" ((x, _):_)) = if "http://" `isPrefixOf` x then x else []
       tagfilter _ = []

Now we can solve #5. Given extractURLs, we fire off the request using openURL as before. But note that this time we aren’t hardwiring the email WebCitation requires. We’ll be getting it from the user later.

archiveURL :: String -> String -> IO ()
archiveURL email url = openURL url' >> return ()
              where url' = "https://www.webcitation.org/archive?url=" ++
                            escapeURIString isAllowedInURI url ++
                            "&email=" ++ email

RSS

And point #3 is trivial given the URL to new articles, so we are left with #1 and #2. Now, the first thing that should pop into any good Wikipedian’s head upon being asked how to get new articles is, ‘go to Special:NewPages!’ But how is a poor little bot to deal with that human-centric interface? Fortunately, upon visiting we notice that little RSS icon. That’s how we will do it: download the RSS feed for Newpages and parse it.

We will use the feed library to parse the RSS feed. Looking at the types, we see a RSS feed is a list of RSS items, [RSSItem], and each RSSItem has several fields—but we only need the title. So:

items :: String -> [RSSItem]
items feed = rssItems $ rssChannel $ fromJust $ parseFeedString feed

parseFeedString is actually String -> Maybe Feed, but we know Wikipedia won’t give us a bad feed, so we bypass error checking; then we drill down past the metadata into a RSSChannel (an RSS channel is a finer subdivision of the overall feed; you might have 2 channels, 1 for articles & 1 for pictures, or aggregating multiple websites or streams into 1 feed), and then we pull out the RSSItem, and then finally the actual URL links:

titles :: [RSSItem] -> [String]
titles itms = map (fromJust . rssItemLink) itms

So we could write:

parseRSS :: String -> [String]
parseRSS str = titles $ items str

And then we plug in the IO:

parseRSSIO :: String -> IO [String]
parseRSSIO str = fmap parseRSS $ openURL str

CLI Interface

So #2 is done. Now for #1, interfacing with the user. We’ll just be quick & lazy with some CLI arguments

main :: IO ()
main = do (email:website) <- getArgs
          if (null website)
           then archiveBot email "https://en.wikipedia.org/w/index.php?title=Special:NewPages&feed=rss"
           else archiveBot email website

What is ‘archiveBot’? It pulls together all the pieces. We need to download the supplied URL, parse it into further URLs (each being a new article), download each of those, parse them into sublists of HTTP links (each to be archived), and then archive each of those! Fortunately, with a few higher order functions and the little pieces already assembled, we can do this relatively easily:

archiveBot :: String -> String -> IO ()
archiveBot email website = do rss <- parseRSSIO website
                              articles <- mapM openURL rss
                              urls <- mapM extractURLs articles
                              concatMap (archiveURL email) urls
                              return ()

Putting Things Together

Tada! This gives us the completed program:

import Control.Monad (mapM)
import Data.List (isPrefixOf)
import Data.Maybe (fromJust)
import Network.URI (escapeURIString, isAllowedInURI)
import System.Environment (getArgs)

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)

import Text.Feed.Import (parseFeedString)
import Text.Feed.Types (Feed(RSSFeed))
import Text.RSS.Syntax (rssChannel, rssItems, RSSItem(..))


main :: IO ()
main = do (email:website:_) <- getArgs
          if (null website)
           then archiveBot email "https://en.wikipedia.org/w/index.php?title=Special:NewPages&feed=rss"
           else archiveBot email website

archiveBot :: String -> String -> IO ()
archiveBot email website = do rss <- parseRSSIO website
                              articles <- mapM openURL rss
                              let urls = concat $ map extractURLs articles
                              mapM (archiveURL email) urls
                              return ()

extractURLs :: String -> [String]
extractURLs page = [x | TagOpen "a" atts <- parseTags page, (_,x) <- atts, "http://" `isPrefixOf` x]

items :: String -> [RSSItem]
items feed = let RSSFeed r = fromJust $ parseFeedString feed
                 in rssItems $ rssChannel r

titles :: [RSSItem] -> [String]
titles itms = map (fromJust . rssItemLink) itms

parseRSS :: String -> [String]
parseRSS str = titles $ items str

parseRSSIO :: String -> IO [String]
parseRSSIO str = fmap parseRSS $ openURL str

archiveURL :: String -> String -> IO ()
archiveURL email url = openURL url' >> return ()
              where url' = "https://www.webcitation.org/archive?url=" ++
                            escapeURIString isAllowedInURI url ++
                            "&email=" ++ email

Now we have our finished prototype. We can compile it and run it like:

$ wp-archivebot “gwern@gwern.net”

(Installation is just as easy; this bot is available on Hackage, so it’s a cabal install wp-archivebot away.)

Improving

Hlint

We can tweak it further if we want. hlint (Hackage) will tell us that archiveBot could be improved by using mapM_ instead of mapM...return (), and concat $ map is better written concatMap. We could probably fuse items, titles, and parseRSS into just one line without hurting readability too much. And as ever, the type signatures are optional, which would save 6 lines.

Parallelism

How’s the efficiency? Not too bad, but as with the other bot, we’ll find network latency a bottleneck. No reason that the archives couldn’t proceed in parallel:

import Control.Concurrent (forkIO)

archiveURL email url = forkIO (openURL url') >> return ()

(As with the previous bot, the >> return () could be replaced with Control.Monad.void with GHC 7.)

This only parallelizes the archive request. We could independently parallelize, using MVars for synchronization, the downloads of the articles, but now things are getting complex. It’d be better to try to parallelize each article—have each article be a single thread which downloads the article, parses it for links, and fire off archive requests. With lightweight green threads, this is actually the best way.

This might sound difficult, but here too writing from the bottom-up with independent functions will make it easier to reorganize the program. In fact, we will only need to modify archiveBot a little bit, factoring out the independent bit and wrapping it in a forkIO:

archiveBot :: String -> String -> IO ()
archiveBot email website = do rss <- parseRSSIO website
                              mapM (forkIO . archiveOne) rss
                              return ()
    where archiveOne :: String -> IO ()
          archiveOne article =  do text <- openURL article
                                   let urls = extractURLs text
                                   mapM (archiveURL email) urls
                                   return ()

(Hlint will wisely suggest we get rid of the 2 return () statements by just using mapM_ instead of mapM.)

So now if parseRSSIO returns 100 entries, we will fork off 100 archiveOne workers. Combine this with the parallel version of archiveURL, and there would not seem to be no more real parallelism left to use.

Looping

What more can we do? Well, as constructed, the bot is one-shot: it downloads the Newpages RSS feed once, processes it, and exits.

The obvious thing to do is to do some sort of infinite loop or recursion—make archiveBot call itself once it’s done, and fetch the RSS feed all over again. We can’t wait on some notification, because RSS is a ‘pull’ protocol rather than ‘push’. So we simply loop as fast as we can. (Don’t worry, it’s OK! NewPages doesn’t change as fast as the blinding RecentChanges, but it still changes so fast that by the time we’ve dealt with the latest batch of articles, we need to get a fresh RSS feed anyway.)

In an interesting technical note, one of the factors that enabled the English Wikipedia to survive its exponential expansion, particularly past 2006_19ya, is the widespread use of increasingly sophisticated anti-vandalism bots. These anti-vandalism bots often are written as IRC bots, since IRC is a ‘push’ protocol—the few dozen active anti-vandalism bots sit in an IRC channel and Special:RecentChanges is piped into the channel by a dedicated RSS bot. This way, the anti-vandalism bots can be written as event-driven loops (fetching URLs as they are posted in the channel) and only one bot is constantly reloading the RecentChanges RSS.

Also interestingly, there is already a Haskell program for watching an RSS feed and posting related text on IRC—Don Stewart’s rss2irc. We can steal some of its code, replace parts with our functions and we wind up with:

reader :: String -> String -> IO ()
reader email url = items url >>= go
 where
   go old = do new <- items url
               -- remove duplicates
               let diff = foldl' (flip $ deleteBy matchingTitles) new old
               forM_ (take 100 diff) $ \itm ->
                case rssItemLink itm of
                    Nothing -> return ()
                    Just t  -> ignore $ forkIO $ archiveBot email t
               go new

   matchingTitles :: RSSItem -> RSSItem -> Bool
   matchingTitles x y = let title = (fromJust . rssItemTitle) in title x == title y

We will want to avoid repeating articles, though. We could store a persistent Data.Set of every URL we’ve visited, but that’s overkill. We’ll simply do a list disjunction—hence the foldl' and deleteBy.

The case-expression is there because sometimes there apparently are entries which don’t link to a new page and assuming otherwise leads to a crash. I never bothered to check what those links were. Probably articles which had been deleted that quickly.

And as previously mentioned, we might as well collapse all the little RSS functions and inline them to the where clause along with go and the new matchingTitles:

   -- Actually fetch a RSS feed and turn it from String to [RSSItem]
   items :: String -> IO [RSSItem]
   items rurl = do s <- get' rurl
                   let RSSFeed r = fromJust $ parseFeedString s
                   return $ nubBy matchingTitles $ rssItems $ rssChannel r

And now we have a bot which will loop, reloading the RSS, and not repeating too much work, in a few more lines of code. Hopefully they’re not much more difficult to read, too.

Rewriting Network Code

If we don’t like using TagSoup’s openURL, we could always do import Network.HTTP (getRequest, simpleHTTP) and then openURL = simpleHTTP . getRequest

import Control.Concurrent (forkIO)
import Control.Monad (liftM, forM_)
import Data.Char (intToDigit)
import Data.List (isInfixOf, isPrefixOf, foldl', deleteBy, nubBy)
import Data.Maybe (fromJust)
import Network.HTTP hiding (port)
import Network.Stream (ConnError)
import Network.URI
import System.Environment (getArgs)

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))

import Text.Feed.Import (parseFeedString)
import Text.Feed.Types (Feed(RSSFeed))
import Text.RSS.Syntax (rssChannel, rssItems, RSSItem(..))

main :: IO ()
main = do args <- getArgs
          -- Webcite requires a valid email, and they filter out public
          -- emails like mailinator.com. So we demand an email from the user.
          let email = head args
          -- This is largely intended for the English Wikipedia, so we default to En's NewPages
          -- but we let the user override; any second argument is assumed to be a MediaWiki RSS
          let url = if length args > 1
                     then args !! 1
                     else "https://en.wikipedia.org/w/index.php?title=Special:NewPages&feed=rss"
          reader email url

-- inspired by rss2irc
-- | wait on an RSS thread, updating every so often. Each RSS item links to some diff or page,
-- in addition to whatever other content the RSS item may contain (date, summary, etc.)
-- This runs 'archiveBot' on just that link, ignoring the rest.
reader :: String -> String -> IO ()
reader email url = items url >>= go
 where
   go old = do new <- items url
               -- remove duplicates
               let diff = foldl' (flip $ deleteBy matchingTitles) new old
               forM_ (take 100 diff) $ \itm ->
                case rssItemLink itm of
                    Nothing -> return ()
                    Just t  -> ignore $ forkIO $ archiveBot email t
               go new

   matchingTitles :: RSSItem -> RSSItem -> Bool
   matchingTitles x y = let title = (fromJust . rssItemTitle) in title x == title y

   -- Actually fetch a RSS feed and turn it from String to [RSSItem]
   items :: String -> IO [RSSItem]
   items rurl = do s <- get' rurl
                   let RSSFeed r = fromJust $ parseFeedString s
                   return $ nubBy matchingTitles $ rssItems $ rssChannel r

-- | Given the URL of an article, we suck down the HTML, grep it for http:// links,
-- filter out certain links that we don't want to archive (boilerplate links, interwiki links)
-- and then fire off an archive request for each link left.
archiveBot :: String -> String -> IO ()
archiveBot email ls = liftM uniq (fetchArticleURLs ls) >>= mapM_ (archiveURL email)
 where  uniq :: [String] -> [String] -- So hideous
        uniq = filter (\x ->not $ any (flip isInfixOf x) exceptions)

        exceptions :: [String]
        exceptions = ["wikimediafoundation", "http://www.mediawiki.org/", "wikipedia",
                      "&curid=", "index.php?title=", "&action="]

-- | Run 'extractURLs' on some page's raw HTML
fetchArticleURLs ::  String -> IO [String]
fetchArticleURLs = fmap extractURLs . get'

-- | Use the TagSoup library to extract all the hyperlinks in a page. This is really parsing the HTML,
-- so hopefully there won't be any spurious links.
extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- parseTags arg, (_,x) <- atts, "http://" `isPrefixOf` x]

-- | WebCitation.org is set up so one can archive a url just by doing a request
-- for 'webcitation.org/archive?url=url&email=email'
-- So it's very easy, given a URL and an email, to archive a page. No need to see what the response was.
archiveURL :: String -> String -> IO ()
archiveURL email url = print url' >> ignore (get' url')
              where url' = "https://www.webcitation.org/archive?url=" ++
                            escapeURIString isAllowedInURI url ++
                            "&email=" ++ email

-- | Convenient wrapper over the complexity of Network.HTTP. Given a URL, we get the raw HTML.
-- This means we paper over a bunch of possible errors and issues, but we've no time for them!
-- There are links to archive!
get' :: String -> IO String
get' = get . fromJust . parseURI
 where get :: URI -> IO String
       get uri = do resp <- simpleHTTP (request uri) >>= handleE (error . show)
                    case rspCode resp of
                        (2,0,0) -> return (rspBody resp)
                        _ -> error (httpError resp)
           where
               httpError resp = showRspCode (rspCode resp) ++ " " ++ rspReason resp
               showRspCode (a,b,c) = map intToDigit [a,b,c]

               request :: URI -> Request String
               request ur = Request{rqURI=ur, rqMethod=GET, rqHeaders=[], rqBody=""}

               handleE :: Monad m => (ConnError -> m a) -> Either ConnError a -> m a
               handleE h (Left e) = h e
               handleE _ (Right v) = return v

-- | Convenience function. 'forkIO' and 'forM_' demand return types of 'IO ()', but most interesting
-- IO functions don't return void. So one adds a call to 'return ()'; this just factors it out.
ignore :: Functor f => f a -> f ()
ignore = fmap $ const ()

get' is more than a little ugly. Let’s keep at it and make more simplifying assumptions, and rename it openURL while we’re at it:

The actual rewrite:

82,98c79,83
< get' :: String -> IO String
< get' = get . fromJust . parseURI
<  where get :: URI -> IO String
<        get uri = do resp <- simpleHTTP (request uri) >>= handleE (error . show)
<                     case rspCode resp of
<                         (2,0,0) -> return (rspBody resp)
<                         _ -> error (httpError resp)
<            where
<                httpError resp = showRspCode (rspCode resp) ++ " " ++ rspReason resp
<                showRspCode (a,b,c) = map intToDigit [a,b,c]
<
<                request :: URI -> Request String
<                request ur = Request{rqURI=ur, rqMethod=GET, rqHeaders=[], rqBody=""}
<
<                handleE :: Monad m => (ConnError -> m a) -> Either ConnError a -> m a
<                handleE h (Left e) = h e
<                handleE _ (Right v) = return v
---
> openURL :: String -> IO String
> openURL u = do res <- simpleHTTP $ getRequest u
>                case res of
>                 Left _ -> return ""
>                 Right y -> return $ rspBody y

Then we do the renames:

50c47
<    items rurl = do s <- get' rurl
---
>    items rurl = do s <- openURL rurl
67c64
< fetchArticleURLs = fmap extractURLs . get'
---
> fetchArticleURLs = fmap extractURLs . openURL
77c74
< archiveURL email url = print url' >> ignore (get' url')
---
> archiveURL email url = putStrLn url' >> ignore (openURL url')

Our rewrite of get' simplifies our imports:

3d2
< import Data.Char (intToDigit)
6,8c5,6
< import Network.HTTP hiding (port)
< import Network.Stream (ConnError)
< import Network.URI
---
> import Network.HTTP (getRequest, simpleHTTP, rspBody)
> import Network.URI (escapeURIString, isAllowedInURI)
12d9
< import Text.HTML.Download (openURL)

The final version after those changes:

import Control.Concurrent (forkIO)
import Control.Monad (liftM, forM_)
import Data.List (isInfixOf, isPrefixOf, foldl', deleteBy, nubBy)
import Data.Maybe (fromJust)
import Network.HTTP (getRequest, simpleHTTP, rspBody)
import Network.URI (escapeURIString, isAllowedInURI)
import System.Environment (getArgs)

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))

import Text.Feed.Import (parseFeedString)
import Text.Feed.Types (Feed(RSSFeed))
import Text.RSS.Syntax (rssChannel, rssItems, RSSItem(..))

main :: IO ()
main = do args <- getArgs
          -- Webcite requires a valid email, and they filter out public
          -- emails like mailinator.com. So we demand an email from the user.
          let email = head args
          -- This is largely intended for the English Wikipedia, so we default to En's NewPages
          -- but we let the user override; any second argument is assumed to be a MediaWiki RSS
          let url = if length args > 1
                     then args !! 1
                     else "https://en.wikipedia.org/w/index.php?title=Special:NewPages&feed=rss"
          reader email url

-- inspired by rss2irc
-- | wait on an RSS thread, updating every so often. Each RSS item links to some diff or page,
-- in addition to whatever other content the RSS item may contain (date, summary, etc.)
-- This runs 'archiveBot' on just that link, ignoring the rest.
reader :: String -> String -> IO ()
reader email url = items url >>= go
 where
   go old = do new <- items url
               -- remove duplicates
               let diff = foldl' (flip $ deleteBy matchingTitles) new old
               forM_ (take 100 diff) $ \itm ->
                case rssItemLink itm of
                    Nothing -> return ()
                    Just t  -> ignore $ forkIO $ archiveBot email t
               go new

   matchingTitles :: RSSItem -> RSSItem -> Bool
   matchingTitles x y = let title = (fromJust . rssItemTitle) in title x == title y

   -- Actually fetch a RSS feed and turn it from String to [RSSItem]
   items :: String -> IO [RSSItem]
   items rurl = do s <- openURL rurl
                   let RSSFeed r = fromJust $ parseFeedString s
                   return $ nubBy matchingTitles $ rssItems $ rssChannel r

-- | Given the URL of an article, we suck down the HTML, grep it for http:// links,
-- filter out certain links that we don't want to archive (boilerplate links, interwiki links)
-- and then fire off an archive request for each link left.
archiveBot :: String -> String -> IO ()
archiveBot email ls = liftM uniq (fetchArticleURLs ls) >>= mapM_ (archiveURL email)
 where  uniq :: [String] -> [String] -- So hideous
        uniq = filter (\x ->not $ any (flip isInfixOf x) exceptions)
        exceptions :: [String]
        exceptions = ["wikimediafoundation", "http://www.mediawiki.org/", "wikipedia", "creativecommons.org",
                      "&curid=", "index.php?title=", "&action="]

-- | Run 'extractURLs' on some page's raw HTML
fetchArticleURLs ::  String -> IO [String]
fetchArticleURLs = fmap extractURLs . openURL

-- | Use the TagSoup library to extract all the hyperlinks in a page. This is really parsing the HTML,
-- so hopefully there won't be any spurious links.
extractURLs :: String -> [String]
extractURLs arg = [x | TagOpen "a" atts <- parseTags arg, (_,x) <- atts, "http://" `isPrefixOf` x]

-- | WebCitation.org is set up so one can archive a url just by doing a request
-- for 'webcitation.org/archive?url=url&email=email'
-- So it's very easy, given a URL and an email, to archive a page. No need to see what the response was.
archiveURL :: String -> String -> IO ()
archiveURL email url = putStrLn url' >> ignore (openURL url')
              where url' = "https://www.webcitation.org/archive?url=" ++
                             escapeURIString isAllowedInURI url ++
                             "&email=" ++ email

-- | Convenient wrapper over the complexity of Network.HTTP. Given a URL, we get the raw HTML.
-- This means we paper over a bunch of possible errors and issues, but we've no time for them!
-- There are links to archive!
openURL :: String -> IO String
openURL u = do res <- simpleHTTP $ getRequest u
               case res of
                Left _ -> return ""
                Right y -> return $ rspBody y

-- | Convenience function. 'forkIO' and 'forM_' demand return types of 'IO ()', but most interesting
-- IO functions don't return void. So one adds a call to 'return ()'; this just factors it out.
-- Redundant with GHC 7's 'Control.Monad.void'.
ignore :: Functor f => f a -> f ()
ignore = fmap $ const ()

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Backlinks

[Backlinks (what links here)]

Bibliography

[Bibliography of links/references used in page]