âNewswire: A Large-Scale Structured Database of a Century of Historical Newsâ, 2024-06-13 ()â :
[HuggingFace] In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written 1878â99197747ya. [Due to being copyrighted by default starting in 1978] Locations in these articles are geo-referenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model.
To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 million structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgment and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (geo-referencing) of the news that millions of Americans read over the course of a century.
We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages.
The Newswire dataset is useful both for large language modelingâexpanding training data beyond what is available from modern web textsâand for studying a diversity of questions in computational linguistics, social science, and the digital humanities.