âGoodWikiâ, 2023-09-09 ()â :
GoodWiki is a 179 million token dataset of English Wikipedia articles collected on September 4, 2023, that have been marked as âGoodâ or âFeaturedâ by Wikipedia editors. The dataset provides these articles in GitHub-flavored Markdown format, preserving layout features like lists, code blocks, math, and block quotes, unlike many other public Wikipedia datasets. Articles are accompanied by a short description of the page as well as any associated categories.
Thanks to a careful conversion process from wikicode, the markup language used by Wikipedia, articles in GoodWiki are generally faithful reproductions of the corresponding original Wikipedia pages, minus references, files, infoboxes, and tables. Curated template transclusion and HTML tag handling have minimized instances where entire words and phrases are missing mid-sentence.
The hope is that this more comprehensive data will play a small role in improving open-source NLP efforts in language modeling, summarization, and instruction tuning.
GoodWiki is more than 1.5Ă larger (when compared using the same tokenizer) than the widely used WikiText-103 dataset by et al 2016, even after excluding article descriptions. Also limited to articles marked as Good or Featured, WikiText inspired GoodWiki.
Composition: The dataset consists of 44,754 rows in a 482.7 MB snappy-compressed Parquet file.
View HTML: