“goodbooks-10k: Ten Thousand Books, Six Million Ratings: Https://fastml.com/goodbooks-10k”, 2017-11-29 (; backlinks):
This dataset contains six million ratings for ten thousand most popular (with most ratings) books. There are also:
books marked to read by the users
book metadata (author, year, etc.)
tags/shelves/genres
Access: Some of these files are quite large, so GitHub won’t show their contents online. See samples/ for smaller CSV snippets.
Open the notebook for a quick look at the data. Download individual zipped files from releases.
The dataset is accessible from Spotlight, recommender software based on PyTorch.
Contents:
ratings.csvcontains ratings sorted by time. It is 69MB and looks like that:user_id, book_id, rating1,258,52,4081,42,260,52,9296,52,2318,3Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1–10000, for users, 1–53424.
to_read.csv provides IDs of the books marked “to read” by each user, as user_id, book_id pairs, sorted by time. There are close to a million pairs.
books.csv has metadata for each book (GoodReads IDs, authors, title, average rating, etc.). The metadata have been extracted from GoodReads XML files, available in
books_xml.Tags
book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.
In raw XML files, tags look like this:
<popular_shelves> <shelf name="science-fiction" count="833"/> <shelf name="fantasy" count="543"/> <shelf name="sci-fi" count="542"/> … <shelf name="for-fun" count="8"/> <shelf name="all-time-favorites" count="8"/> <shelf name="science-fiction-and-fantasy" count="7"/></popular_shelves>Here, each tag/shelf is given an ID. tags.csv translates tag IDs to names.
goodreads IDs
Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.
You can use the goodreads book and work IDs to create URLs as follows:
https://www.goodreads.com/book/show/2767052 https://www.goodreads.com/work/editions/2792775Note that book_id in ratings.csv & to_read.csv maps to work_id, not to goodreads_book_id, meaning that ratings for different editions are aggregated.