“Counting YouTube Videos via Random Prefix Sampling”, Jia Zhou, Yanhua Li, Vijay Kumar Adhikari, Zhi-Li Zhang2011-11-02 (, )⁠:

Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors.

We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id’s (namely, treating each collection as if it were the “true” collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May 2011.

Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (eg.through crawling of related video links) are highly skewed, substantially under-estimating the number of videos with very small view counts (<1,000); we also shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.