“Digitization and the Demand for Physical Works: Evidence from the Google Books Project”, Abhishek Nagaraj, Imke Reimers2021-04-12 (; similar)⁠:

Digitization has allowed customers to access content through online channels at low cost or for free. While free digital distribution has spurred concerns about cannibalizing demand for physical alternatives, digital distribution that incorporates search technologies could also allow the discovery of new content and boost, rather than displace, physical sales.

To test this idea, we study the impact of the Google Books digitization project, which digitized large collections of written works and made the full texts of these works widely searchable. Exploiting an unique natural experiment from Harvard Libraries, which worked with Google Books to digitize its catalog over a period of 5 years, we find that digitization can boost sales of physical book editions by 5–8%.

Digital distribution seems to stimulate demand through discovery: the increase in sales is stronger for less popular books and spills over to a digitized author’s non-digitized works. On the supply side, digitization allows small and independent publishers to discover new content and introduce new physical editions for existing books, further increasing sales.

Combined, our results point to the potential of free digital distribution to stimulate discovery and strengthen the demand for and supply of physical products.

…We tackle the empirical challenges through an unique natural experiment leveraging a research partnership with Harvard’s Widener Library, which provided books to seed the Google Books program. The digitization effort at Harvard only included out of copyright works, which—unlike in-copyright works—were made available to consumers in their entirety. This allows us to fairly assess the tradeoff between cannibalization (by a close substitute) and discovery (through search technology). Owing to the size of the collection, book digitization (and subsequent distribution) at Widener took over 5 years, providing substantial variation in the timing of book digitization. Further, our interviews with key informants suggest that the order of book digitization proceeded on a “shelf-by-shelf” basis, driven largely by convenience. While their testimony is useful to suggest no overt sources of bias, our setting is still not a randomized experiment, so that we perform a number of checks to establish the validity of the research design and address any potential concerns.

We obtained access to data on the timing of digitization activity as well as information on a comparable set of never-digitized books, which allows us to evaluate the impact of digital distribution on demand for physical works. Specifically, we combine data from 3 main sources. First, we collect data on the shelf-level location of books within the Harvard system 20038201113ya along with information on their loan activity. Since most books are never loaned, our analyses focus on 88,006 books (out of over 500,000) that had at least one loan in the sample period (and are robust to using a smaller sample of books with at least one loan before the start of digitization). Second, for a subset of 9,204 books (in English with at least four total loans), we obtain weekly US sales data on all related physical editions from the NPD (formerly Nielsen) BookScan database. The sales data must be manually collected and matched, which restricts the size of this sample. Finally, we are interested in the effect of digital distribution on physical supply through the release of new editions. Accordingly, we also collect data from the Bowker Books-In-Print database on book editions and prices, differentiating between established publishers and independents. We use these combined data and the natural experiment we outlined to examine the effects of free digital distribution on the demand and supply of physical editions. Our panel data structure allows for a difference-in-differences design that can incorporate time and, notably, book fixed effects, increasing confidence in the research design.

The baseline results suggest that rather than decrease sales, the impact of Google Books digitization on sales of physical copies is positive. In our preferred specification, digitization increases sales by 4.8% and increases the likelihood of at least one sale by 7.7 percentage points…Each year, books that are never scanned have an average annual probability of being sold of 16%, whereas those that are scanned have a probability of only 8.5% before their digitization and 24.1% after it. Similarly, books that are never digitized have a probability of 17.8%, while books that are digitized have a probability of 19.3% before their digitization but only 11% after their digitization. These differences are indicative of large potential impacts of digitization on demand.

…We confirm our findings in a series of robustness checks and tests of the validity of the research design. First, in addition to book and year × shelf-location fixed effects, we also incorporate time-varying controls at the book level such as search volume from Google Trends and availability on alternative platforms like Project Gutenberg. Second, we provide a number of subsample analyses dropping certain books that raise concerns about the exogeneity of their timing, including limiting the data to only public domain and scanned books. Third, we create a “twins” sample that consists of pairs of scanned and unscanned books adjacent to each other in the library shelves and hence covering the same subject. Finally, we also collected data on Amazon reviews for a set of books in our sample as an alternate measure of physical demand. All results are largely in line with the baseline result