Hacker News new | past | comments | ask | show | jobs | submit login
Internet Search Tips (gwern.net)
174 points by telotortium on Dec 12, 2018 | hide | past | favorite | 28 comments



These are some interesting tips. And as a bonus, the page is giving me serious flashbacks to Fravia's Web Searchlores! http://search.lores.eu/indexo.htm


Hopefully my tips are a little easier to read and learn from than Fravia's. And much more up to date...


Thanks for writing this up.

I see no mention of the wayback machine.

Nor digging into raw git and mercurial repositories.

I have also had luck emailing folks that either inquired about document or last good email for the author.

Universities often have blanket access to journals if you use their wifi. SH has mostly supplanted the need for this, but not always. Universities are still great for interlibrary loans, if you are friendly with the library staff, they might let you IL a book and read it in the library if they don't checkout to citizens.

Last one on mining a local search database is like you said, to determine the boolean operators, get familiar with the FTS backends (SQLServer, Postgres, Elastic Search, etc) and some lightweight un-obtrusive scripting (using in browser js console).

Like this snippet that extracts the pdfs and doi.org urls from a page. `copy` is a chrome specific helper. This snippet is great for extracting results that are lazily loaded in, something wget cannot do for you.

    function gpdfs() {
        aref = document.getElementsByTagName("a");
        var result = [];
        for(i = 0; i < aref.length; i++) {
            var t = aref[i];
            if (t.href.endsWith(".pdf") || t.href.includes("doi.org")) {
                result.push(t.href);
            }
        }
        return result;
    }

    function pc() {
        copy(gpdfs());
    }

    pc();


> I see no mention of the wayback machine.

The IA is covered in multiple sections. The Wayback Machine is merely one area of the IA.

> Nor digging into raw git and mercurial repositories.

I have never had to do that for any research task I've undertaken, so that would be both too obscure to mention and I wouldn't know anything about it.

> Universities often have blanket access to journals if you use their wifi.

A good point. My proxy superseded this for me but I used to do this, and simply forgot about it. I'll add that. (Another good trick I know is using the old Google Reader RSS exports hosted on IA to get fulltext of webpages. I'll have to add that too: https://gwern.net/Search#searching-the-google-reader-arc... )

> get familiar with the FTS backends (SQLServer, Postgres, Elastic Search, etc)

I was really just thinking of grep and some other CLI utilities there. :)


> I was really just thinking of grep and some other CLI utilities there. :)

Another technique I use is to spider a site (respectfully) and then use https://github.com/BurntSushi/ripgrep or load the spidered data into a local Elastic search.

Some of what you outline in searching other parts dovetail into handling poor access to the Internet. For folks with bad or transient access

    pip install --upgrade youtube_dl
To download youtube videos (or complete playlists) for offline viewing.


BTW "pip install warcio" is the latest hotness for processing warc files. Also, you can add a header to wget to download a byte range. Unfortunately archival tools are mostly targeted at crawling and then playback, not so much special collections like this one.


How do you use wget? I know you can specify a start position, intended for resuming downloads, but you need to specify an end position as well (to extract just the single compressed file), and I don't see any way to specify an end in the wget manpage.


Hm, come to think of it, I guess I was programming in python at the time. The internets say that wget doesn't support range requests and while curl's manpage says --range works, it doesn't work for me.


> master/PhD theses: sorry. It’s probably hopeless if it’s pre-2000. If you have a university proxy, you may be able to get a copy off ProQuest. Otherwise, you need full university ILL services, and even that might not be enough (a surprising number of universities appear to restrict access only to the university students/faculty, with the complicating factor of most theses being stored on microfilm).

At least in the Netherlands there is [1] which indexes many papers/theses including very old ones, and many are freely accessible.

[1] https://www.narcis.nl/


I've used my (US) university's interlibrary loan (ILL) service many times.

In my experience theses and dissertations are treated like any other book in interlibrary loan. Local public libraries also often have interlibrary loan, perhaps for a fee. So I can't see why theses would be inaccessible via ILL outside of a university.

I have run into some theses and dissertations being restricted to people affiliated with the particular university which holds the book, but that's rare in my experience. More common is the case of rare books (few copies available, like in the case of theses/dissertations) being marked as "library use only", so you can't take the book outside of the library. But I have received many of these via ILL, and my university simply respects the wishes of the book's owner by not allowing me to take the book outside of the library I requested the book from.


As a kid, I made heavy use of my public library's ILL (sorry, taxpayers), but I never heard any whisper that university-level ILL of any documents was possible, and the forms my librarians filled out strongly implied that only books & CDs were ever contemplated.

> In my experience theses and dissertations are treated like any other book in interlibrary loan.

Yes, but you have to be at a university in the first place. Life is grand if you're a student with full privileges - ILL is, IMO, one of the most underrated fringe benefits of being a student - but once you're out, you're out in the cold. I've discussed this with any number of people, including people at think-tanks without university affiliations, and no one's come up with a good solution how to get back into the ILL system short of hiring students to do requests for you.

> I have run into some theses and dissertations being restricted to people affiliated with the particular university which holds the book, but that's rare in my experience.

Yes, fairly rare, but it still happens. Unfortunately, I don't know what happens when you ILL them because I didn't run into any examples until after I graduated. I'm also puzzled when I run into online open access theses which are, however, embargoed for a year... (I simply schedule a reminder to go back, but I'm perplexed as to what could possibly be the point of that.)


Have you tried requesting a thesis or dissertation via a public library's interlibrary loan service? I am fairly confident that most universities would lend a thesis or dissertation to a public library. I don't see any clear reason why they would not other than snobbery. (Edit: And I can recall a few instances where my university got a book via ILL from a public library. So the converse does happen, at least.)

While I have not used public library ILL, my impressions is that the difference is more in scale than access. (Though access surely is reduced.) I recall reading public library ILL policies that charged for requests and only allowed one request at a time. That would be a lot worse than what I have right now, but much better than nothing. (Also, I have used ILL at two government labs and I would say the difference again was more in scale than access, and that these labs were somewhere between public and university libraries.)

There are many paid document delivery services. Here's my university's one: https://www.lib.utexas.edu/find-borrow-request/interlibrary-...

I've used a few paid document delivery services before. They weren't cheap, but I was able to get some things that my university's ILL wasn't able to.

Personally, I'd prefer some sort of scan request barter system. I scan X pages for you, you scan X pages for me. I've done similar things informally. r/scholar hasn't worked out well for anything not already digitized. I've thought before that someone should make a scan request website that gives you credits for fulfilling a request. Hopefully the lawyers would stay away from this as these items can't really be obtained any other way; if they were available for sale, people would buy them.

As for embargoed theses, my impression is that usually the author wants to get a patent on something discussed in the thesis. In US law, you can only get a patent within a 1 year after publicly disclosing the invention. So the embargo gives them more time. There likely are other reasons as well, but this is the only one I've encountered.


> I don't see any clear reason why they would not other than snobbery.

Snobbery is a pretty good reason for anything, I've found. In any case, it could be many things: expense (as my university regularly reminded us, each ILL cost like $20 on net), low demand from patrons (even if self-fulfilling, still a valid reason), lack of trust, not being plugged into the right ILL system/software... I don't recall any of the books I ILLed from my public library being indicated as being from universities, though there were several close to us.

I suppose I should try my public library when I have some spare time. I'm almost certain they'd be unable to get either books or theses or papers, but it'd be interesting to know the specific reason why not.

> There are many paid document delivery services. Here's my university's one:

Yes, actually, I know someone who used that one last month. (Scan quality could've been much better, IMO.) It is, unfortunately, rare to have a straightforward 'pay us $X and we'll give you a copy of any thesis' service linked or mentioned on the library website, and they are only for that library's holdings ("scans of book chapters and articles from the UT Libraries collection" ie not anyone else's). I would complain a lot less if most universities had it! I've sometimes wondered if more university libraries have it than I think they do, and they just hide it.

> As for embargoed theses, my impression is that usually the author wants to get a patent on something discussed in the thesis.

That would be reasonable, but in the subjects I usually research, that would make little sense. I think the last embargoed student thesis I ran into was a study on the stimulant effect of nicotine on cognitive performance; hard to see any patent on that being possible, much less profitable to apply to.


I'd push back if a public library refused to do ILL. (Though I can be fairly stubborn.) If cost is a concern, offer to pay. If a librarian says that they don't offer that service due to low demand, ask if they could make an exception. Maybe ask another librarian who might be more open to the idea. I don't know what to do about a lack of trust.

Lacking the right software is not a valid excuse. UT will send "ALA requests" every once in a while if the lending library isn't in their software. As far as I can tell an ALA request is just this form mailed or emailed: http://www.ala.org/rusa/sites/ala.org.rusa/files/content/sec...

I might suggest contacting various universities about getting copies of theses from them. It's possible that they'd be perfectly willing to scan them for you, even for free. On this note, I've been surprised by the extent some corporations have gone to provide me with copies of proprietary technical reports they produced. One time I called the number on the website of a large corporation and my call was forwarded to their staff librarian, as I recall. A few weeks later I got the requested report. It had to go through some release process, but they were happy to share their work. Another time I emailed a generic address at a Shell research lab, and a few weeks later I got a copy of the report I wanted. There have been failures as well, but I was surprised by how frequent the successes were.


Temporary embargo can happen when parts of the thesis were published in a non-open journal that requires an embargo period, or there's an external party involved that needs to clear publication.


Regarding the hopeless comment: if the author is still alive the easiest way to get a copy is, well, by emailing the author.

Unless the author is a hotshot, they should be able to procure an electronic copy, even if it's was done in pre-word processing times.

Most authors would be flattered, not inconvenienced.


Worth adding in my opinion:

booksc/bookfi/bookzz

IRC bot rooms (#bookz, #ebooks and others)

Private trackers (Bibliotik et al.)

DHT search engines, eMule, DC++

In my country you can also just walk into a public library, get a free membership card and start browsing


Most important in my experience are the terms that you choose for a search. Sometimes it helps to think about how most people would phrase a question.

Often results can be too broad ... so use them to choose more specific terms to add to your query to 'focus in' on the result you need. (E.g. adding years (even specific dates) can add focus -and quality- to your results.)

It helps to keep a growing list of bookmarks for high-quality 'specialist' sites with -a lot- of content.


Great article, but I’m left disappointed after looking at the “obstacles” that were presented. For example, paywalls: why is it ok for legal documents to cost money to access? This should be free for everyone–if they need money for “technology” (somehow RECAP apparently doesn’t, or can figure out a way to cover their expenses without charging people) they should roll it into taxes. Same goes for university research, especially if it’s publicly funded. So much effort wasted for information that should be freely available.


some good tips but the site is super hard to read (content structure wise)


I've reorganized it a bit.


I am honestly concerned for Gwern's mental health. So much personal optimization taken to such an extreme, doing blatantly illegal things and blogging about it. Some are probably quite unhealthy.

I am all for being the best person we all can be -- but Gwern seems to have made personal optimization the end goal itself.

I think we should aim to be happy well adjusted people, not work ourselves to death, not drug ourselves to our physical limit.

There were times in my life when I behaved similarly and it was rooted in deep dissatisfaction with life.

To be clear, this is not an ad hominem attack. The article is amazing.


Gwern is amazing and what a fascinating website. I hear you and understand where you are coming from, but some people thrive under extreme organization like this. These are the kind of people I want writing technical documentation on my teams.


The ability to wield the power of the Internet to answer questions, not just optimizing a shopping cart, but the ability to answer deep questions is a needed and important skill.

I have zero issue if someone wants to optimize themselves, even with no other end other than "optimal".


> doing blatantly illegal things and blogging about it

What are the legal risks of using (downloading from and uploading to) libgen, or buying from ebook.farm?


"If you can’t use it while chatting without the other person noting your pauses, it’s not fast enough."

Still takes time to read, comprehend and reply.


I think they might mean chatting online.


Gwern has now replaced "chatting" by "IRCing".




Applications are open for YC Summer 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: