Do we really need another dataset of books?
My initial plan was to build a toy example for a recommender system article I was writing. After a bit of googling, I found a few datasets. Sadly, most of them had some issues that made them unusable for me (e.g, missing description of the book, a mix of different languages but no column to specify the language per row or weird delimiters).
So I decided to make a dataset that would match my purposes.
First, I got ISBNs from Soumik's Goodreads-books dataset. Using those identifiers, I crawled the Google Books API to extract the books' information.
Then, I merged those results with some of the original columns from the dataset and after some cleaning I got the dataset you see here.
What can I do with this?
Different Exploratory Data Analysis, clustering of books by topics/category, content-based recommendation engine using different fields from the book's description.
Why is this dataset smaller than Soumik's Goodreads-books?
Many of the ISBNs of that dataset did not return valid results from the Google Books API. I plan to update this in the future, using more fields (e.g., title, author) in the API requests, as to have a bigger dataset.
What did you use to build this dataset?
Check out the repoistory here Google Books Crawler
Acknowledgements
This dataset relied heavily on Soumik's Goodreads-books dataset.