[previously: Zittrainet al2014] Hyperlinks are a powerful tool for journalists and their readers. Diving deep into the context of an article is just a click away. But hyperlinks are a double-edged sword; for all of the Internet’s boundlessness, what’s found on the web can also be modified, moved, or entirely disappeared. This often-irreversible decay of web content is commonly known as linkrot. It comes with a similar problem of content drift, or the often-unannounced changes—retractions, additions, replacement—to the content at a particular URL.
Our team of researchers at Harvard Law School has undertaken a project to gain insight into the extent and characteristics of journalistic linkrot and content drift. We examined hyperlinks in The New York Times articles starting with the launch of the Times website in 1996 up through mid-2019, developed on the basis of a dataset provided to us by the Times. We focus on the Times not because it is an influential publication whose archives are often used to help form a historical record. Rather, the substantial linkrot and content drift we find here across the New York Times corpus accurately reflects the inherent difficulties of long-term linking to pieces of a volatile web.
Results show a near linear increase of linkrot over time, with interesting patterns emerging within certain sections of the paper or across top-level domains. Over half of articles containing at least one URL also contained a dead link. Additionally, of the ostensibly “healthy” links existing in articles, a hand review revealed additional erosion to citations via content drift.
…We found that of the 553,693 articles that included URLs on nytimes.com between its launch in 1996 and mid-2019, there were a total of 2,283,445 hyperlinks pointing to content outside of nytimes.com. 28% of these were “shallow links” such as example.com. 72% were “deep links” including a path to a specific page, such as example.com/article.
We focused our analysis on deep links, as they were the large majority of the sample, and lead to specific material that the article author hopes to point readers to. Of those, 25% of all links were completely inaccessible, with linkrot becoming more common over time—6% of links from 2018 had rotted, as compared to 43% of links from 2008 and 72% of links from 1998. 53% of all articles that contained deep links had at least one rotted link.
Linkrot Frequency Over Time
On top of that, some reachable links were not pointing to the information journalists had intended. An additional 13% of “healthy” links from a human-reviewed sample of 4,500 had drifted substantially since publication, with content drift becoming more common over time—4% of reachable links published in articles from 2019 had drifted, as compared to 25% of reachable links from articles published in 2009… Of the 15 sections with the most articles, the Health section had the lowest RRR figures, falling about 17% below the baseline linkrot frequency. The Travel section had the highest rot rate, with more than 17% of links appearing in the sections’ articles having rotted.
…For example, a section that reports heavily on government affairs or education might be disadvantaged by the fact that deep links to top-level domains like .gov, and .edu show higher relative rot rates. This phenomenon is initially counterintuitive, as both governments and academic institutions are well regarded as enduring entities. In some ways however, this is unsurprising as these URLs are volatile by design: whitehouse.gov will always have the same URL but will fundamentally change in both content and structure with each new administration. Similarly, universities and academic institutions are controlled by a vast network of stakeholders who by nature have a high turnover rate. It is precisely because their domains are fixed that their deep links are fragile. Another irony: Both educational institutions and government entities have mandates for historical repositories of their materials, and content produced by them has long been seen as necessary for preservation. This practice appears to have lessened the focus on maintaining older material on the live web, as workflows existed long before the internet to maintain records offline in pre-existing repositories.