āThe Pile: An 800GB Dataset of Diverse Text for Language Modelingā, 2021 (; similar)ā :
[torrent download] Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.
The Pile is constructed from 22 diverse high-quality subsetsāmany of which derive from academic or professional sources. [Common Crawl, PubMed Central, Bibliotik (Books3), OpenWebText2, arXiv, Github, FreeLaw, Stack Exchange, USPTO Backgrounds, PubMed Abstracts, Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DeepMind Mathematics, Ubuntu IRC, BookCorpus2, EuroParl, Hacker News, YouTubeSubtitles, PhilPapers, NIH ExPorter, Enron Emails]
Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve substantially over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations.
Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.