Data
7k-Books

7k-Books

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Do we really need another dataset of books? My initial plan was to build a toy example for a recommender system article I was writing. After a bit of googling, I found a few datasets. Sadly, most of them had some issues that made them unusable for me (e.g, missing description of the book, a mix of different languages but no column to specify the language per row or weird delimiters). So I decided to make a dataset that would match my purposes. First, I got ISBNs from Soumik's Goodreads-books dataset. Using those identifiers, I crawled the Google Books API to extract the books' information. Then, I merged those results with some of the original columns from the dataset and after some cleaning I got the dataset you see here. What can I do with this? Different Exploratory Data Analysis, clustering of books by topics/category, content-based recommendation engine using different fields from the book's description. Why is this dataset smaller than Soumik's Goodreads-books? Many of the ISBNs of that dataset did not return valid results from the Google Books API. I plan to update this in the future, using more fields (e.g., title, author) in the API requests, as to have a bigger dataset. What did you use to build this dataset? Check out the repoistory here Google Books Crawler Acknowledgements This dataset relied heavily on Soumik's Goodreads-books dataset.

12 features

isbn13numeric6810 unique values
0 missing
isbn10string6810 unique values
0 missing
titlestring6394 unique values
4 missing
subtitlestring2009 unique values
4429 missing
authorsstring3775 unique values
77 missing
categoriesstring567 unique values
99 missing
thumbnailstring6481 unique values
329 missing
descriptionstring6473 unique values
263 missing
published_yearnumeric94 unique values
6 missing
average_ratingnumeric200 unique values
43 missing
num_pagesnumeric915 unique values
43 missing
ratings_countnumeric3881 unique values
43 missing

19 properties

6810
Number of instances (rows) of the dataset.
12
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
5336
Number of missing values in the dataset.
4630
Number of instances with at least one value missing.
5
Number of numeric attributes.
0
Number of nominal attributes.
0
Number of attributes divided by the number of instances.
41.67
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
67.99
Percentage of instances having missing values.
Average class difference between consecutive instances.
6.53
Percentage of missing values.

0 tasks

Define a new task