Skip to main content

Internet Search Tips

A description of advanced tips and tricks for effective Internet research of papers/books, with real-world examples.

Over time, I developed a certain google-fu and expertise in finding references, papers, and books online. I start with the standard tricks like Boolean queries and keyboard shortcuts, and go through the flowchart for how to search, modify searches for hard targets, penetrate paywalls, request jailbreaks, scan books, monitor topics, and host documents. Some of these tricks are not well-known, like checking the Internet Archive (IA) for books.

I try to write down my search workflow, and give general advice about finding and hosting documents, with demonstration case studies.

Google-fu search skill is something I’ve prided myself ever since elementary school, when the librarian challenged the class to find things in the almanac; not infrequently, I’d win. And I can still remember the exact moment it dawned on me in high school that much of the rest of my life would be spent dealing with searches, paywalls, and broken links. The Internet is the greatest almanac of all, and to the curious, a never-ending cornucopia, so I am sad to see many fail to find things after a cursory search—or not look at all. For most people, if it’s not the first hit in Google/Google Scholar, it doesn’t exist. Below, I reveal my best Internet search tricks and try to provide a rough flowchart of how to go about an online search, explaining the subtle tricks and tacit knowledge of search-fu.

Roughly, we need to have proper tools to create an occasion for a search: we cannot search well if we avoid searching at all. Then each search will differ by which search engine & type of medium we are searching—they all have their own quirks, blind spots, and ways to modify a failed search. Often, we will run into walls, each of which has its own circumvention methods. But once we have found something, we are not done: we would often be foolish & short-sighted if we did not then make sure it stayed found. Finally, we might be interested in advanced topics like ensuring in advance resources can be found in the future if need be, or learning about new things we might want to then go find. To illustrate the overall workflow & provide examples of tacit knowledge, I include many Internet case studies of finding hard-to-find things.

Papers

Request

Human flesh search engine. Last resort: if none of this works, there are a few places online you can request a copy (however, they will usually fail if you have exhausted all previous avenues):

Finally, you can always try to contact the author. This only occasionally works for the papers I have the hardest time with, since they tend to be old ones where the author is dead or unreachable—any author publishing a paper since 199034ya will usually have been digitized somewhere—but it’s easy to try.

Post-Finding

After finding a fulltext copy, you should find a reliable long-term link/place to store it and make it more findable (remember—if it’s not in Google/Google Scholar, it doesn’t exist!):

  • Never Link Unreliable Hosts:

    • LG/SH: Always operate under the assumption they could be gone tomorrow. (As my uncle found out with Library.nu shortly after paying for a lifetime membership!) There are no guarantees either one will be around for long under their legal assaults or the behind-the-scenes dramas, and no guarantee that they are being properly mirrored or will be restored elsewhere.

      When in doubt, make a copy. Disk space is cheaper every day. Download anything you need and keep a copy of it yourself and, ideally, host it publicly.

    • NBER: never rely on a papers.nber.org/tmp/ or psycnet.apa.org URL, as they are temporary. (SSRN is also undesirable due to making it increasingly difficult to download, but it is at least reliable.)

    • Scribd: never link Scribd—they are a scummy website which impede downloads, and anything on Scribd usually first appeared elsewhere anyway. (In fact, if you run into anything vaguely useful-looking which exists only on Scribd, you’ll do humanity a service if you copy it elsewhere just in case.)

    • RG: avoid linking to ResearchGate (compromised by new ownership & PDFs get deleted routinely, apparently often by authors) or Academia.edu (the URLs are one-time and break)

    • high-impact journals: be careful linking to Nature.com or Cell (if a paper is not explicitly marked as Open Access, even if it’s available, it may disappear in a few months!14); similarly, watch out for wiley.com, tandfonline.com, jstor.org, springer.com, springerlink.com, & mendeley.com, who pull similar shenanigans.

    • ~/: be careful linking to academic personal directories on university websites (often noticeable by the Unix convention .edu/~user/ or by directories suggestive of ephemeral hosting, like .edu/cs/course112/readings/foo.pdf); they have short half-lives.

    • ?token=: beware any PDF URL with a lot of trailing garbage in the URL such as query strings like ?casa_token or ?cookie or ?X (or hosted on S3/AWS); such links may or may not work for other people but will surely stop working soon. (Academia.edu, Nature, and Elsevier are particularly egregious offenders here.)

  • PDF Editing: if a scan, it may be worth editing the PDF to crop the edges, threshold to binarize it (which, for a bad grayscale or color scan, can drastically reduce filesize while increasing readability), and OCR it.

    I use gscan2pdf but there are alternatives worth checking out.

  • Check & Improve Metadata.

    Adding metadata to papers/books is a good idea because it makes the file findable in G/GS (if it’s not online, does it really exist?) and helps you if you decide to use bibliographic software like Zotero in the future. Many academic publishers & LG are terrible about metadata, and will not include even title/author/DOI/year.

    PDFs can be easily annotated with metadata using ExifTool:: exiftool -All prints all metadata, and the metadata can be set individually using similar fields.

    For papers hidden inside volumes or other files, you should extract the relevant page range to create a single relevant file. (For extraction of PDF page-ranges, I use pdftk, eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf. Many publishers insert a spam page as the first page. You can drop that easily with pdftk INPUT.pdf cat 2-end output OUTPUT.pdf, but note that PDFtk may drop all metadata, so do that before adding any metadata. To delete pseudo-encryption or ‘passworded’ PDFs, do pdftk INPUT.pdf input_pw output OUTPUT.pdf; PDFs using actual encryption are trickier but can often be beaten by off-the-shelf password-cracking utilities.) For converting JPG/PNGs to PDF, one can use ImageMagick for <64 pages (convert *.png foo.pdf) but beyond that one may need to convert them individually & then join the resulting PDFs (eg. for f in *.png; do convert "$f" "${f%.png}.pdf"; done && pdftk *.pdf cat output foo.pdf or join with pdfunite *.pdf foo.pdf.)

    I try to set at least title/author/DOI/year/subject, and stuff any additional topics & bibliographic information into the “Keywords” field. Example of setting metadata:

    exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \
        -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \
        first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \
        Society_, Volume s2-30, Issue 1, 1930-01-01, pg264-286" 1930-ramsey.pdf

    “PDF Plus” is better than “PDF”.

    If two versions are provided, the “PDF” one may be intended (if there is any real difference) for printing and exclude features like hyperlinks .

  • Public Hosting: if possible, host a public copy; especially if it was very difficult to find, even if it was useless, it should be hosted. The life you save may be your own.

  • Link On WP/Social Media: for bonus points, link it in appropriate places on Wikipedia or Reddit or Twitter; this makes people aware of the copy being available, and also supercharges visibility in search engines.

  • Link Specific Pages: as noted before, you can link a specific page by adding #page=N to the URL. Linking the relevant page is helpful to readers. (I recommend against doing this if this is done to link an entire article inside a book, because that article will still have bad SEO and it will be hard to find; in such cases, it’s better to crop out the relevant page range as a standalone article, eg. using pdftk again for pdftk 1900-BOOK.pdf cat 123-456 output 1900-PAPER.pdf.)

Advanced

Aside from the (highly-recommended) use of hotkeys and Booleans for searches, there are a few useful tools for the researcher, which while expensive initially, can pay off in the long-term:

  • archiver-bot: automatically archive your web browsing and/or links from arbitrary websites to forestall linkrot; particularly useful for detecting & recovering from dead PDF links

  • Subscriptions like PubMed & GS search alerts: set up alerts for a specific search query, or for new citations of a specific paper. (Google Alerts is not as useful as it seems.)

    1. PubMed has straightforward conversion of search queries into alerts: “Create alert” below the search bar. (Given the volume of PubMed indexing, I recommend carefully tailoring your search to be as narrow as possible, or else your alerts may overwhelm you.)

    2. To create generic GS search query alert, simply use the “Create alert” on the sidebar for any search. To follow citations of a key paper, you must: 1. bring up the paper in GS; 2. click on “Cited by X”; 3. then use “Create alert” on the sidebar.

  • GCSE: a Google Custom Search Engines is a specialized search queries limited to whitelisted pages/domains etc (eg. my Wikipedia-focused anime/manga CSE).

    A GCSE can be thought of as a saved search query on steroids. If you find yourself regularly including scores of the same domains in multiple searches search, or constantly blacklisting domains with -site: or using many negations to filter out common false positives, it may be time to set up a GCSE which does all that by default.

  • Clippings: note-taking services like Evernote/Microsoft OneNote: regularly making and keeping excerpts creates a personalized search engine, in effect.

    This can be vital for refinding old things you read where the search terms are hopelessly generic or you can’t remember an exact quote or reference; it is one thing to search a keyword like “autism” in a few score thousand clippings, and another thing to search that in the entire Internet! (One can also reorganize or edit the notes to add in the keywords one is thinking of, to help with refinding.) I make heavy use of Evernote clipping and it is key to refinding my references.

  • Crawling Websites: sometimes having copies of whole websites might be useful, either for more flexible searching or for ensuring you have anything you might need in the future. (example: “Darknet Market Archives (201322015)”).

    Useful tools to know about: wget, cURL, HTTrack; Firefox plugins: NoScript, uBlock origin, Live HTTP Headers, Bypass Paywalls, cookie exporting.

    Short of downloading a website, it might also be useful to pre-emptively archive it by using linkchecker to crawl it, compile a list of all external & internal links, and store them for processing by another archival program (see Archiving URLs for examples). In certain rare circumstances, security tools like nmap can be useful to examine a mysterious server in more detail: what web server and services does it run, what else might be on it (sometimes interesting things like old anonymous FTP servers turn up), has a website moved between IPs or servers, etc.

Web Pages

With proper use of pre-emptive archiving tools like archiver-bot, fixing linkrot in one’s own pages is much easier, but that leaves other references. Searching for lost web pages is similar to searching for papers:

  • Just Search The Title: if the page title is given, search for the title.

    It is a good idea to include page titles in one’s own pages, as well as the URL, to help with future searches, since the URL may be meaningless gibberish on its own, and pre-emptive archiving can fail. HTML supports both alt and title parameters in link tags, and, in cases where displaying a title is not desirable (because the link is being used inline as part of normal hypertextual writing), titles can be included cleanly in Markdown documents like this: [inline text description](URL "Title").

  • Clean URLs: check the URL for weirdness or trailing garbage like ?rss​=1 or ?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29? Or a variant domain, like a mobile.foo.com/m.foo.com/foo.com/amp/ URL? Those are all less likely to be findable or archived than the canonical version.

  • Domain Site Search: restrict G search to the original domain with site:, or to related domains

  • Time-Limited Search: restrict G search to the original date-range/years.

    You can use this to tame overly-general searches. An alternative to the date-range widget is the advanced search syntax, which works (for now): specify numeric range queries using double-dots like foo 2020..2023 (which is useful beyond just years). If this is still too broad, it can always be narrowed down to individual years.

  • Switch Engines: try a different search engine: corpuses can vary, and in some cases G tries to be too smart for its own good when you need a literal search; DuckDuckGo (especially for ‘bang’ special searches), Bing, and Yandex are usable alternatives

  • Check Archives: if nowhere on the clearnet, try the Internet Archive (IA) or the Memento meta-archive search engine:

    IA is the default backup for a dead URL. If IA doesn’t Just Work, there may be other versions in it:

    • misleading redirects: did the IA ‘helpfully’ redirect you to a much-later-in-time error page? Kill the redirect and check the earliest stored version for the exact URL rather than the redirect. Did the page initially load but then error out/redirect? Disable JS with NoScript and reload.

    • Within-Domain Archives: IA lets you list all URLs with any archived versions, by searching for URL/*; the list of available URLs may reveal an alternate newer/older URL. It can also be useful to filter by filetype or substring.

      For example, one might list all URLs in a domain, and if the list is too long and filled with garbage URLs, then using the “Filter results” incremental-search widget to search for “uploads/” on a WordPress blog.15

      Screenshot of an oft-overlooked feature of the Internet Archive: displaying all available/archived URLs for a specific domain, filtered down to a subset matching a string like *uploads/*.

      Screenshot of an oft-overlooked feature of the Internet Archive: displaying all available/archived URLs for a specific domain, filtered down to a subset matching a string like *uploads/*.

      • wayback_machine_downloader (not to be confused with the internetarchive Python package which provides a CLI interface to uploading files) is a Ruby tool which lets you download whole domains from IA, which can be useful for running a local fulltext search using regexps (a good grep query is often enough), in cases where just looking at the URLs via URL/* is not helpful. (An alternative which might work is websitedownloader.io.)

      Example:

      gem install --user-install wayback_machine_downloader
      ~/.gem/ruby/2.7.0/bin/wayback_machine_downloader wayback_machine_downloader --all-timestamps 'https://blog.okcupid.com'
    • did the domain change, eg. from www.foo.com to foo.com or www.foo.org? Entirely different as far as IA is concerned.

    • does the internal evidence of the URL provide any hints? You can learn a lot from URLs just by paying attention and thinking about what each directory and argument means.

    • is this a Blogspot blog? Blogspot is uniquely horrible in that it has versions of each blog for every country domain: a foo.blogspot.com blog could be under any of foo.blogspot.de, foo.blogspot.au, foo.blogspot.hk, foo.blogspot.jp16

    • did the website provide RSS feeds?

      A little known fact is that Google Reader (GR; October 2005–July 201311ya) stored all RSS items it crawled, so if a website’s RSS feed was configured to include full items, the RSS feed history was an alternate mirror of the whole website, and since GR never removed RSS items, it was possible to retrieve pages or whole websites from it. GR has since closed down, sadly, but before it closed, Archive Team downloaded a large fraction of GR’s historical RSS feeds, and those archives are now hosted on IA. The catch is that they are stored in mega-WARCs, which, for all their archival virtues, are not the most user-friendly format. The raw GR mega-WARCs are difficult enough to work with that I defer an example to the appendix.

    • archive.today: an IA-like mirror. (Sometimes bypasses paywalls or has snapshots other services do not; I strongly recommend against treating archive.today/archive.is/etc as anything but a temporary mirror to grab snapshots from, as it has no long-term plans.)

    • any local archives, such as those made with my archiver-bot

    • Google Cache (GC): GC works, sometimes, but the copies are usually the worst around, ephemeral & cannot be relied upon. Google also appears to have been steadily deprecating GC over the years, as GC shows up less & less in search results. A last resort.

Books

Digital

E-books are rarer and harder to get than papers, although the situation has improved vastly since the early 2000s. To search for books online:

  • More Straightforward: book searches tend to be faster and simpler than paper searches, and to require less cleverness in search query formulation, perhaps because they are rarer online, much larger, and have simpler titles, making it easier for search engines.

    Search G, not GS, for books:

    No Books in Google Scholar

    Book fulltexts usually don’t show up in GS (for unknown reasons). You need to check G when searching for books.

    To double-check, you can try a filetype:pdf search; then check LG. Typically, if the main title + author doesn’t turn it up, it’s not online. (In some cases, the author order is reversed, or the title:subtitle are reversed, and you can find a copy by tweaking your search, but these are rare.).

  • IA: the Internet Archive has many books scanned which do not appear easily in search results (poor SEO?).

    • If an IA hit pops up in a search, always check it; the OCR may offer hints as to where to find it. If you don’t find anything or the provided, try doing an IA site search in G (not the IA built-in search engine), eg. book title site:archive.org.

    • DRM workarounds: if it is on IA but the IA version is DRMed and is only available for “checkout”, you can jailbreak it.

      Check the book out for the full period, 14 days. Download the PDF (not EPUB) version to Adobe Digital Elements version ≤4.0 (which can be run in Wine on Linux), and then import it to Calibre with the De-DRM plugin, which will produce a DRM-free PDF inside Calibre’s library. (Getting De-DRM running can be tricky, especially under Linux. I wound up having to edit some of the paths in the Python files to make them work with Wine. It also appears to fail on the most recent Google Play ebooks, ~2021.) You can then add metadata to the PDF & upload it to LG17. (LG’s versions of books are usually better than the IA scans, but if they don’t exist, IA’s is better than nothing.)

  • Google Play: use the same PDF DRM as IA, can be broken same way

  • HathiTrust also hosts many book scans, which can be searched for clues or hints or jailbroken.

    HathiTrust blocks whole-book downloads but it’s easy to download each page in a loop and stitch them together, for example:

    for i in {1..151}
    do if [[ ! -s "$i.pdf" ]]; then
        wget "https://babel.hathitrust.org/cgi/imgsrv/download/pdf?id=mdp.39015050609067;orient=0;size=100;seq=$i;attachment=0" \
              -O "$i.pdf"
        sleep 20s
     fi
    done
    
    pdftk *.pdf cat output 1957-super-scientificcareersandvocationaldevelopmenttheory.pdf
    
    exiftool -Title="Scientific Careers and Vocational Development Theory: A review, a critique and some recommendations" \
        -Date=1957 -Author="Donald E. Super, Paul B. Bachrach" -Subject="psychology" \
        -Keywords="Bureau Of Publications (Teachers College Columbia University), LCCCN: 57-12336, National Science Foundation, public domain, \
        https://babel.hathitrust.org/cgi/pt?id=mdp.39015050609067;view=1up;seq=1 https://psycnet.apa.org/record/1959-04098-000" \
        1957-super-scientificcareersandvocationaldevelopmenttheory.pdf

    Another example of this would be the Wellcome Library; while looking for An Investigation Into The Relation Between Intelligence And Inheritance, Lawrence1931, I came up dry until I checked one of the last search results, a “Wellcome Digital Library” hit, on the slim off-chance that, like the occasional Chinese/Indian library website, it just might have fulltext. As it happens, it did—good news? Yes, but with a caveat: it provides no way to download the book! It provides OCR, metadata, and individual page-image downloads all under CC-BY-NC-SA (so no legal problems), but… not the book. (The OCR is also unnecessarily zipped, so that is why Google ranked the page so low and did not show any revealing excerpts from the OCR transcript: because it’s hidden in an opaque archive to save a few kilobytes while destroying SEO.) Examining the download URLs for the highest-resolution images, they follow an unfortunate schema:

    1. https://dlcs.io/iiif-img/wellcome/1/5c27d7de-6d55-473c-b3b2-6c74ac7a04c6/full/2212,/0/default.jpg

    2. https://dlcs.io/iiif-img/wellcome/1/d514271c-b290-4ae8-bed7-fd30fb14d59e/full/2212,/0/default.jpg

    3. etc

    Instead of being sequentially numbered 1–90 or whatever, they all live under a unique hash or ID. Fortunately, one of the metadata files, the ‘manifest’ file, provides all of the hashes/IDs (but not the high-quality download URLs). Extracting the IDs from the manifest can be done with some quick sed & tr string processing, and fed into another short wget loop for download

    grep -F '@id' manifest\?manifest\=https\:%2F%2Fwellcomelibrary.org%2Fiiif%2Fb18032217%2Fmanifest | \
       sed -e 's/.*imageanno\/\(.*\)/\1/' | grep -E -v '^ .*' | tr -d ',' | tr -d '"' # "
    # bf23642e-e89b-43a0-8736-f5c6c77c03c3
    # 334faf27-3ee1-4a63-92d9-b40d55ab72ad
    # 5c27d7de-6d55-473c-b3b2-6c74ac7a04c6
    # d514271c-b290-4ae8-bed7-fd30fb14d59e
    # f85ef645-ec96-4d5a-be4e-0a781f87b5e2
    # a2e1af25-5576-4101-abee-96bd7c237a4d
    # 6580e767-0d03-40a1-ab8b-e6a37abe849c
    # ca178578-81c9-4829-b912-97c957b668a3
    # 2bd8959d-5540-4f36-82d9-49658f67cff6
    # ...etc
    I=1
    for HASH in $HASHES; do
        wget "https://dlcs.io/iiif-img/wellcome/1/$HASH/full/2212,/0/default.jpg" -O $I.jpg
        I=$((I+1))
    done

    And then the 59MB of JPGs can be cleaned up as usual with gscan2pdf (empty pages deleted, tables rotated, cover page cropped, all other pages binarized), compressed/OCRed with ocrmypdf, and metadata set with exiftool, producing a readable, downloadable, highly-search-engine-friendly 1.8MB PDF.

  • remember the Analog Hole works for papers/books too:

    if you can find a copy to read, but cannot figure out how to download it directly because the site uses JS or complicated cookie authentication or other tricks, you can always exploit the ‘analogue hole’—fullscreen the book in high resolution & take screenshots of every page; then crop, OCR etc. This is tedious but it works. And if you take screenshots at sufficiently high resolution, there will be relatively little quality loss. (This works better for books that are scans than ones born-digital.)

Physical

Expensive but feasible. Books are something of a double-edged sword compared to papers/theses. On the one hand, books are much more often unavailable online, and must be bought offline, but at least you almost always can buy used books offline without much trouble (and often for <$10 total); on the other hand, while paper/theses are often available online, when one is not unavailable, it’s usually very unavailable, and you’re stuck (unless you have a university ILL department backing you up or are willing to travel to the few or only universities with paper or microfilm copies).

Purchasing from used book sellers:

  • Sellers:

    • used book search engines: Google Books/find-more-books.com: a good starting point for seller links; if buying from a marketplace like AbeBooks/Amazon/Barnes & Noble, it’s worth searching the seller to see if they have their own website, which is potentially much cheaper. They may also have multiple editions in stock.

    • bad: eBay & Amazon are often bad, due to high-minimum-order+S&H and sellers on Amazon seem to assume Amazon buyers are easily rooked; but can be useful in providing metadata like page count or ISBN or variations on the title

    • good: AbeBooks, Thrift Books, Better World Books, B&N, Discover Books.

      Note: on AbeBooks, international orders can be useful (especially for behavioral genetics or psychology books) but be careful of international orders with your credit card—many debit/credit cards will fail on international orders and trigger a fraud alert, and PayPal is not accepted.

  • Price Alerts: if a book is not available or too expensive, set price watches: AbeBooks supports email alerts on stored searches, and Amazon can be monitored via CamelCamelCamel (remember the CCC price alert you want is on the used third-party category, as new books are more expensive, less available, and unnecessary).

Scanning:

  • Destructive Vs Non-Destructive: the fundamental dilemma of book scanning—destructively debinding books with a razor or guillotine cutter works much better & is much less time-consuming than spreading them on a flatbed scanner to scan one-by-one18, because it allows use of a sheet-fed scanner instead, which is easily 5x faster and will give higher-quality scans (because the sheets will be flat, scanned edge-to-edge, and much more closely aligned), but does, of course, require effectively destroying the book.

  • Tools:

    • cutting: For simple debinding of a few books a year, an X-acto knife/razor is good (avoid the ‘triangle’ blades, get curved blades intended for large cuts instead of detail work).

      Once you start doing more than one a month, it’s time to upgrade to a guillotine blade paper cutter (a fancier swinging-arm paper cutter, which uses a two-joint system to clamp down and cut uniformly).

      A guillotine blade can cut chunks of 200 pages easily without much slippage, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into several 200-page chunks for the guillotine cutter.

    • scanning: at some point, it may make sense to switch to a scanning service like 1DollarScan (1DS has acceptable quality for the black-white scans I have used them for thus far, but watch out for their nickel-and-diming fees for OCR or “setting the PDF title”; these can be done in no time yourself using gscan2pdf/exiftool/ocrmypdf and will save a lot of money as they, amazingly, bill by 100-page units). Books can be sent directly to 1DS, reducing logistical hassles.

  • Clean Up: after scanning, crop/threshold/OCR/add metadata

    • Adding metadata: same principles as papers. While more elaborate metadata can be added, like bookmarks, I have not experimented with those yet.

  • File format: PDF, not DjVu

    Despite being a worse format in many respects, I now recommend PDF and have stopped using DjVu for new scans19 and have converted my old DjVu files to PDF.

  • Uploading: to LibGen, usually, and Gwern.net sometimes. For backups, filelockers like Dropbox, Mega, MediaFire, or Google Drive are good. I usually upload 3 copies including LG. I rotate accounts once a year, to avoid putting too many files into a single account. [I discourage reliance on IA links.)

    Do Not Use Google Docs/Scribd/Dropbox/IA/etc for Long-Term Documents

    ‘Document’ websites like Google Docs (GD) should be strictly avoided as primary hosting. GD does not appear in G/GS, dooming a document to obscurity, and Scribd is ludicrously user-hostile with changing dark patterns. Such sites cannot be searched, scraped, downloaded reliably, clipped, used on many devices, archived20, or counted on for the long haul. (For example, Google Docs has made many documents ‘private’, breaking public links, to the surprise of even the authors when I contact them about it, for unclear reasons.)

    Such sites may be useful for collaboration or surveys, but should be regarded as strictly temporary working files, and moved to clean static HTML/PDF/XLSX hosted elsewhere as soon as possible.

  • Hosting: hosting papers is easy but books come with risk:

    Books can be dangerous; in deciding whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kindle editions or other signs of active exploitation and is effectively an ‘orphan work’.

    As of 2019-10-23, hosting 4090 files over 9 years (very roughly, assuming linear growth, <6.7 million document-days of hosting: 3763 × 0.5 × 8 × 365.25 = 6722426), I’ve received 4 takedown orders: a behavioral genetics textbook (201311ya), The Handbook of Psychopathy (200519ya), a recent meta-analysis paper (Roberts et al 2016), and a CUP DMCA takedown order for 27 files. I broke my rule of thumb to host the 2 books (my mistake), which leaves only the 1 paper, which I think was a fluke. So, as long as one avoids relatively recent books, the risk should be minimal.

Case Studies

Case study examples of using Internet search tips in action.

See Also

Appendix

Searching the Google Reader Archives

A 2015 tutorial on how to do manual searches of the 201311ya Google Reader archives on the Internet Archive. Google Reader provides fulltext mirrors of many websites which are long gone and not otherwise available even in the IA; however, the Archive Team archives are extremely user-unfriendly and challenging to use even for programmers.

I explain how to find & extract specific websites.

Note: now largely obsoleted by querying IA’s Wayback Machine for the GR RSS URL.

A little-known way to ‘undelete’ a pre-2013 blog or website is to use Google Reader (GR). Unusual archive: Google Reader. GR crawled regularly almost all blogs’ RSS feeds; RSS feeds often contain the fulltext of articles. If a blog author writes an article, the fulltext is included in the RSS feed, GR downloads it, and then the author changes their mind and edits or deletes it, GR would redownload the new version but it would continue to show the version the old version as well (you would see two versions, chronologically). If the author blogged regularly and so GR had learned to check regularly, it could hypothetically grab different edited versions, even, not just ones with weeks or months in between. Assuming that GR did not, as it sometimes did for inscrutable reasons, stop displaying the historical archives and only showed the last 90 days or so to readers; I was never able to figure out why this happened or if indeed it really did happen and was not some sort of UI problem. Regardless, if all went well, this let you undelete an article, albeit perhaps with messed up formatting or something. Sadly, GR was closed back on 2013-07-01, and you cannot simply log in and look for blogs.

Archive Team mirrored Google Reader. However, before it was closed, Archive Team launched a major effort to download as much of GR as possible. So in that dump, there may be archives of all of a random blog’s posts. Specifically: if a GR user subscribed to it; if Archive Team knew about it; if they requested it in time before closure; and if GR did keep full archives stretching back to the first posting.

AT mirror is raw binary data. Downside: the Archive Team dump is not in an easily browsed format, and merely figuring out what it might have is difficult. In fact, it’s so difficult that before researching Craig Wright in November–December 2015, I never had an urgent enough reason to figure out how to get anything out of it before, and I’m not sure I’ve ever seen anyone actually use it before; Archive Team takes the attitude that it’s better to preserve the data somehow and let posterity worry about using it. (There is a site which claimed to be a frontend to the dump but when I tried to use it, it was broken & still is in April 2024.)

Extracting

Find the right archive. The 9TB of data is stored in ~69 opaque compressed WARC archives. 9TB is a bit much to download and uncompress to look for one or two files, so to find out which WARC you need, you have to download the ~69 CDX indexes which record the contents of their respective WARC, and search them for the URLs you are interested in. (They are plain text so you can just grep them.)

Locations

In this example, we will look at the main blog of Craig Wright, gse-compliance.blogspot.com. (Another blog, security-doctor.blogspot.com, appears to have been too obscure to be crawled by GR.) To locate the WARC with the Wright RSS feeds, download the the master index. To search:

for file in *.gz; do echo $file; zcat $file | grep -F -e 'gse-compliance' -e 'security-doctor'; done
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml?client=\
# archiveteam&comments=true&likes=true&n=1000&r=n 20130602001238 https://www.google.com/reader/\
# api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n&n=1000&\
# likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM - - 1316181\
# 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# alt=rss?client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001249 https://www.google.\
# com/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault\
# %3Falt%3Drss?r=n&n=1000&likes=true&comments=true&client=ArchiveTeam unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM\
# - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001244 https://www.google.com/\
# reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault?\
# r=n&n=1000&likes=true&comments=true&client=ArchiveTeam unk - XXISZYMRUZWD3L6WEEEQQ7KY7KA5BD2X - - \
# 1404934 19809546472 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/rss.xml?client=archiveteam\
# &comments=true&likes=true&n=1000&r=n 20130602001253 https://www.google.com/reader/api/0/stream/contents\
# /feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Frss.xml?r=n&n=1000&likes=true&comments=true\
# &client=ArchiveTeam text/html 404 AJSJWHNSRBYIASRYY544HJMKLDBBKRMO - - 9467 19812279226 \
# archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Understanding the output: the format is defined by the first line, which then can be looked up:

  • the format string is: CDX N b a m s k r M S V g; which means here:

    • N: massaged url

    • b: date

    • a: original url

    • m: MIME type of original document

    • s: response code

    • k: new style checksum

    • r: redirect

    • M: meta tags (AIF)

    • S: ?

    • V: compressed arc file offset

    • g: file name

Example:

(com,google)/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml\
?client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001238 https://www.google.com\
/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n\
&n=1000&likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM\
- - 1316181 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Converts to:

  • massaged URL: (com,google)/reader/api/0/stream/contents/feed/ http:/gse-compliance.blogspot.com/atom.xml? client=archiveteam&comments=true&likes=true&n=1000&r=n

  • date: 20130602001238

  • original URL: https://www.google.com/reader/api/0/stream/contents/feed/ http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml? r=n&n=1000&likes=true&comments=true&client=ArchiveTeam

  • MIME type: unk [unknown?]

  • response code:—[none?]

  • new-style checksum: 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM

  • redirect:—[none?]

  • meta tags:—[none?]

  • S [? maybe length?]: 1316181

  • compressed arc file offset: 19808229791 (19,808,229,791; so somewhere around 19.8GB into the mega-WARC)

  • filename: archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

As of 2024, the WARCs have been processed into Wayback Machine and the original google.com/reader/api/0/ RSS URLs are now searchable, so one could look the old GR RSS up like a normal URL and do the normal broader searches like searching for all versions

However, in 2015, we had to do it the hard way: extracting directly from the WARC. Knowing the offset theoretically makes it possible to extract directly from the IA copy without having to download and decompress the entire thing… The S & offsets for gse-compliance are:

  1. 1316181/19808229791

  2. 1326824/19810951910

  3. 1404934/19809546472

  4. 9467/19812279226

So we found hits pointing towards archiveteam_greader_20130604001315 & archiveteam_greader_20130614211457 which we then need to download (25GB each):

wget 'https://archive.org/download/archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz'
wget 'https://archive.org/download/archiveteam_greader_20130614211457/greader_20130614211457.megawarc.warc.gz'

Once downloaded, how do we get the feeds? There are a number of hard-to-use and incomplete tools for working with giant WARCs; I contacted the original GR archiver, ivan, but that wasn’t too helpful.

warcat

I tried using warcat to unpack the entire WARC archive into individual files, and then delete everything which was not relevant:

python3 -m warcat extract /home/gwern/googlereader/...
find ./www.google.com/ -type f -not \( -name "*gse-compliance*" -or -name "*security-doctor*" \) -delete
find ./www.google.com/

But this was too slow, and crashed partway through before finishing.

Bug reports:

A more recent alternative library, which I haven’t tried, is warcio, which may be able to find the byte ranges & extract them.

dd

If we are feeling brave, we can use the offset and presumed length to have dd directly extract byte ranges:

dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=3.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.2119 s, 93.4 kB/s
dd skip=19809546472 count=1404934 if=greader_20130604001315.megawarc.warc.gz of=4.gz bs=1
# 1404934+0 records in
# 1404934+0 records out
# 1404934 bytes (1.4 MB) copied, 15.4225 s, 91.1 kB/s
dd skip=19812279226 count=9467 if=greader_20130604001315.megawarc.warc.gz of=5.gz bs=1
# 9467+0 records in
# 9467+0 records out
# 9467 bytes (9.5 kB) copied, 0.125689 s, 75.3 kB/s
dd skip=19808229791 count=1316181 if=greader_20130604001315.megawarc.warc.gz of=1.gz bs=1
# 1316181+0 records in
# 316181+0 records out
# 1316181 bytes (1.3 MB) copied, 14.6209 s, 90.0 kB/s
gunzip *.gz

Results

Success: raw HTML. My dd extraction was successful, and the resulting HTML/RSS could then be browsed with a command like cat *.warc | fold --spaces -width=200 | less. They can probably also be converted to a local form and browsed, although they won’t include any of the site assets like images or CSS/JS, since the original RSS feed assumes you can load any references from the original website and didn’t do any kind of data-URI or mirroring (not, after all, having been intended for archive purposes in the first place…)


  1. For example, the info: operator is entirely useless. The link: operator, in almost a decade of me trying it once in a great while, has never returned remotely as many links to my website as Google Webmaster Tools returns for inbound links, and seems to have been disabled entirely at some point.↩︎

  2. WP is increasingly out of date & unrepresentative due to increasingly narrow policies about sourcing & preprints, part of its overall deletionist decay, so it’s not a good place to look for references. It is a good place to look for key terminology, though.↩︎

  3. When I was a kid, I knew I could just ask my reference librarian to request any book I wanted by providing the unique ID, the ISBN, and there was a physical copy of the book inside the Library of Congress; made sense. I never understood how I was supposed to get these “paper” things my popular science books or newspaper articles would sometimes cite—where was a paper, exactly? If it was published in The Journal of Papers, where did I get this journal? My library only had a few score magazine subscriptions, certainly not all of these Science and Nature and beyond. The bitter answer turns out to be: ‘nowhere’. There is no unique identifier (the majority of papers lack any DOI still), and there is no central repository nor anyone in charge—only a chaotic patchwork of individual libraries and defunct websites. Thus, books tend to be easy to get, but a paper can be a multi-decade odyssey taking one to the depths of the Internet Archive or purchasing from sketchy Chinese websites who hire pirates to infiltrate private databases.↩︎

  4. Most search engines will treat any space or separation as an implicit AND, but I find it helpful to be explicit about it to make sure I’m searching what I think I’m searching.↩︎

  5. It also exposes OCR of them all, which can help Google find them—albeit at the cost of you needing to learn ‘OCRese’ in the snippets, so you can recognize when relevant text has been found, but mangled by OCR/layout.↩︎

  6. This probably explains part of why no one cites that paper, and those who cite it clearly have not actually read it, even though it invented racial admixture analysis, which, since reinvented by others, has become a major method in medical genetics.↩︎

  7. University ILL privileges are one of the most underrated fringe benefits of being a student, if you do any kind of research or hobbyist reading—you can request almost anything you can find in WorldCat, whether it’s an ultra-obscure book or a master’s thesis from 195074ya! Why wouldn’t you make regular use of it‽ Of things I miss from being a student, ILL is near the top.↩︎

  8. The complaint and indictment are not necessarily the same thing. An indictment frequently will leave out many details and confine itself to listing what the defendant is accused of. Complaints tend to be much richer in detail. However, sometimes there will be only one and not the other, perhaps because the more detailed complaint has been sealed (possibly precisely because it is more detailed).↩︎

  9. Trial testimony can run to hundreds of pages and blow through your remaining PACER budget, so one must be careful. In particular, testimony operates under an interesting & controversial price discrimination system related to how court stenographers report—who are not necessarily paid employees but may be contractors or freelancers—intended to ensure covering transcription costs: the transcript initially may cost hundreds of dollars, intended to extract full value from those who need the trial transcript immediately, such as lawyers or journalists, but then a while later, PACER drops the price to something more reasonable. That is, the first “original” fee costs a fortune, but then “copy” fees are cheaper. So for the US federal court system, the “original”, when ordered within hours of the testimony, will cost <$7.25/page but then the second person ordering the same transcript pays only <$1.20/page & everyone subsequently <$0.90/page, and as further time passes, that drops to <$0.60 (and I believe after a few months, PACER will then charge only the standard $0.10). So, when it comes to trial transcript on PACER, patience pays off.↩︎

  10. I’ve heard that LexisNexis terminals are sometimes available for public use in places like federal libraries or courthouses, but I have never tried this myself.↩︎

  11. Curiously, in historical textual criticism of copied manuscripts, it’s the opposite: shorter = truer. But with memories or paraphrases, longer = truer, because those tend to elide details and mutate into catchier versions when the transmitter is not ostensibly exactly copying a text.↩︎

  12. The quick summary of DOIs is that they are “ISBNs but for research papers”; they are those odd slash-separated alphanumeric strings you see around, typically of a form like 10.000/abc.1234. (Unlike ISBNs, the DOI standard is very loose, with about the only hard requirement being that there must be one / character in it, so almost any string is a DOI, even hateful ones like this genuine DOI: 10.1890/0012-9658(2001)082[1655:SVITDB]2.0.CO;2.) Many papers have no DOI, or the DOI was assigned retroactively, but if they have a DOI, it can be the most reliable way to query any database for them.↩︎

  13. I advise prepending, like https://sci-hub.st/https://journal.com instead of appending, like https://journal.com.sci-hub.st/ because the former is slightly easier to type but more importantly, Sci-Hub does not have SSL certificates set up properly (I assume they’re missing a wildcard) and so appending the Sci-Hub domain will fail to work in many web browsers due to HTTPS errors! However, if prepended, it’ll always work correctly.↩︎

  14. Academic publishers like to use the dark pattern of putting a little icon, labeled “full access” or “access” etc, where an Open Access indicator would go, knowing that if you are not intimately familiar with that publisher’s site design & examining it carefully, you’ll be fooled. Another dark pattern is the unannounced temporary paper: in particular, the APA, NBER, & Cell are fond of unpaywalling PDFs to exploit media coverage, and then unpredictably, silently, revoking access later and breaking links.↩︎

  15. To further illustrate this IA feature: if one was looking for Alex St. John’s entertaining memoir “Judgment Day Continued…”, a 201311ya account of organizing the wild 1996 Doom tournament thrown by Microsoft, but one didn’t have the URL handy, one could search the entire domain by going to https://web.archive.org/web/*/http://www.alexstjohn.com/* and using the filter with “judgment”, or if one at least remembered it was in 201311ya, one could narrow it down further to https://web.archive.org/web/*/http://www.alexstjohn.com/WP/2013/* and then filter or search by hand.↩︎

  16. If any Blogspot employee is reading this, for god’s sake stop this insanity!↩︎

  17. Uploading is not as hard as it may seem. There is a web interface (user/password: “genesis”/“upload”). Uploading large files can fail, so I usually use the FTP server: curl -T "$FILE" ftp://anonymous@ftp.libgen.is/upload/. ↩︎

  18. Although flatbed scanning is sometimes destructive too—I’ve cracked the spine of books while pressing them flat into a flatbed scanner.↩︎

  19. My workaround is to export from gscan2pdf as DjVu, which avoids the bug, then convert the DjVu files with ddjvu -format=pdf; this strips any OCR, so I add OCR with ocrmypdf and metadata with exiftool.↩︎

  20. One exception is Google Docs: one can append /mobilebasic to (as of 2023-01-04) get a simplified HTML view which can be archived. For example, “A Comprehensive Guide to Dakimakuras as a Hobby” is available only as a Google Docs page but the URL https://docs.google.com/document/d/1oIlLt1uqutTP8725wezfZ2mjc-IPfOFCdc6hlRIb-KM/mobilebasic will work with the Internet Archive.↩︎