Data
515K-Hotel-Reviews-Data-in-Europe

515K-Hotel-Reviews-Data-in-Europe

active ARFF CC0: Public Domain Visibility: public Uploaded 24-03-2022 by Dustin Carrion
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Acknowledgements The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Data is originally owned by Booking.com. Please contact me through my profile if you want to use this dataset somewhere else. Data Context This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis. Data Content The csv file contains 17 fields. The description of each field is as below: Hotel_Address: Address of hotel. Review_Date: Date when reviewer posted the corresponding review. Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year. Hotel_Name: Name of Hotel Reviewer_Nationality: Nationality of Reviewer Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative' ReviewTotalNegativeWordCounts: Total number of words in the negative review. Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive' ReviewTotalPositiveWordCounts: Total number of words in the positive review. Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience TotalNumberofReviewsReviewerHasGiven: Number of Reviews the reviewers has given in the past. TotalNumberof_Reviews: Total number of valid reviews the hotel has. Tags: Tags reviewer gave the hotel. dayssincereview: Duration between the review date and scrape date. AdditionalNumberof_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there. lat: Latitude of the hotel lng: longtitude of the hotel In order to keep the text data clean, I removed unicode and punctuation in the text data and transform text into lower case. No other preprocessing was performed. Inspiration The dataset is large and informative, I believe you can have a lot of fun with it! Let me put some ideas below to futher inspire kagglers! Fit a regression model on reviews and score to see which words are more indicative to a higher/lower score Perform a sentiment analysis on the reviews Find correlation between reviewer's nationality and scores. Beautiful and informative visualization on the dataset. Clustering hotels based on reviews Simple recommendation engine to the guest who is fond of a special characteristic of hotel. The idea is unlimited! Please, have a look into data, generate some ideas and leave a master kernel here! I am ready to upvote your ideas and kernels! Cheers!

17 features

Hotel_Addressstring1493 unique values
0 missing
Additional_Number_of_Scoringnumeric480 unique values
0 missing
Review_Datestring731 unique values
0 missing
Average_Scorenumeric34 unique values
0 missing
Hotel_Namestring1492 unique values
0 missing
Reviewer_Nationalitystring227 unique values
0 missing
Negative_Reviewstring330011 unique values
0 missing
Review_Total_Negative_Word_Countsnumeric402 unique values
0 missing
Total_Number_of_Reviewsnumeric1142 unique values
0 missing
Positive_Reviewstring412601 unique values
0 missing
Review_Total_Positive_Word_Countsnumeric365 unique values
0 missing
Total_Number_of_Reviews_Reviewer_Has_Givennumeric198 unique values
0 missing
Reviewer_Scorenumeric37 unique values
0 missing
Tagsstring55242 unique values
0 missing
days_since_reviewstring731 unique values
0 missing
latnumeric1472 unique values
3268 missing
lngnumeric1472 unique values
3268 missing

19 properties

515738
Number of instances (rows) of the dataset.
17
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
6536
Number of missing values in the dataset.
3268
Number of instances with at least one value missing.
9
Number of numeric attributes.
0
Number of nominal attributes.
0
Number of attributes divided by the number of instances.
52.94
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0.63
Percentage of instances having missing values.
Average class difference between consecutive instances.
0.07
Percentage of missing values.

0 tasks

Define a new task