“The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines”, 2000-02-01 (; backlinks; similar):
This research examines the ability of 6 popular Web search engines [AltaVista, Northern Light, Infoseek, Excite, HotBot/Lycos], individually and collectively, to locate Web pages containing common marketing/management phrases. We propose and validate a model for search engine performance that is able to represent key patterns of coverage and overlap among the engines.
The model enables us to estimate the typical additional benefit of using multiple search engines, depending on the particular set of engines being considered. It also provides an estimate of the number of relevant Web pages not found by any of the engines. For a typical marketing/management phrase we estimate that the “best” search engine locates about 50% of the pages, and all 6 engines together find about 90% of the total.
The model is also used to examine how properties of a Web page and characteristics of a phrase affect the probability that a given search engine will find a given page. For example, we find that the number of Web page links increases the prospect that each of the 6 search engines will find it. Finally, we summarize the relationship between major structural characteristics of a search engine and its performance in locating relevant Web pages.
[Keywords: capture-recapture, hierarchical Bayes, marketing information, probability models, World Wide Web]
…Overall, based on the Model 3 estimates in Table 8 (and consistent with Table 1), we can make 5 simple statements concerning the “best engine question”:
Overall, for a randomly chosen marketing phrase and URL, the search engine most likely to find it is AltaVista.
- But, Northern Light is a very close second and, in fact, does slightly better than AltaVista in finding managerial phrases.
HotBot is a very respectable third, locating a little over 50%–60% as many URLs as AltaVista or Northern Light.
Excite and Infoseek trail more substantially, locating 20%–30% as many documents as the 2 leading engines.
Lycos found 10%–15% as many documents as the 2 leaders