“How Large Is the World Wide Web?”, 2004 (; backlinks):
There are many metrics one could consider for estimating the size of the World Wide Web, and in the present chapter we focus on size in terms of the number n of Web pages. Since a database with all the valid URLs on the Web cannot be constructed and maintained, determining n by counting is impossible. For the same reasons, estimating n by directly sampling from the Web is also infeasible. Instead of studying the Web as a whole, one can try to assess the size of the publicly indexable Web, which is the part of the Web that is considered for indexing by the major search engines.
Several groups of researchers have invested considerable efforts to develop sound sampling schemes that involve submitting a number of queries to several major search engines. 1998 developed a procedure for sampling Web documents by submitting various queries to a number of search engines. We contrast their study with the one performed by 1998 in November 1997. Although both experiments took place almost in the same period of time, their estimates are substantially different.
In this chapter we review how the size of the indexable Web was estimated by 3 groups of researchers using 3 different statistical models: 1998 / 1999, 1998, and 2000. Then we present a statistical framework for the analysis of data sets collected by query-based sampling, using a hierarchical Bayes formulation of the [capture-recapture] Rasch model for multiple list population estimation developed in et al 1999. We explain why this approach seems to be in reasonable accord with the real-world constraints and thus allows us to make credible inferences about the size of the Web.
We give two different methods that lead to credible estimates of the size of the Web in a reasonable amount of time and are also consistent with the real-world constraints.
View PDF: