OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

7k-Books

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Do we really need another dataset of books? My initial plan was to build a toy example for a recommender system article I was writing. After a bit of googling, I found a few datasets. Sadly, most of them had some issues that made them unusable for me (e.g, missing description of the book, a mix of different languages but no column to specify the language per row or weird delimiters). So I decided to make a dataset that would match my purposes. First, I got ISBNs from Soumik's Goodreads-books dataset. Using those identifiers, I crawled the Google Books API to extract the books' information. Then, I merged those results with some of the original columns from the dataset and after some cleaning I got the dataset you see here. What can I do with this? Different Exploratory Data Analysis, clustering of books by topics/category, content-based recommendation engine using different fields from the book's description. Why is this dataset smaller than Soumik's Goodreads-books? Many of the ISBNs of that dataset did not return valid results from the Google Books API. I plan to update this in the future, using more fields (e.g., title, author) in the API requests, as to have a bigger dataset. What did you use to build this dataset? Check out the repoistory here Google Books Crawler Acknowledgements This dataset relied heavily on Soumik's Goodreads-books dataset.

12 features

isbn13	numeric	6810 unique values 0 missing
isbn10	string	6810 unique values 0 missing
title	string	6394 unique values 4 missing
subtitle	string	2009 unique values 4429 missing
authors	string	3775 unique values 77 missing
categories	string	567 unique values 99 missing
thumbnail	string	6481 unique values 329 missing
description	string	6473 unique values 263 missing
published_year	numeric	94 unique values 6 missing
average_rating	numeric	200 unique values 43 missing
num_pages	numeric	915 unique values 43 missing
ratings_count	numeric	3881 unique values 43 missing