OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

1-million-Reddit-comments-from-40-subreddits

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Content This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc). For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced. I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json - csv). This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety. The information kept here is: subreddit (categorical): on which subreddit the comment was posted body (str): comment content controversiality (binary): a reddit aggregated metric score (scalar): upvotes minus downvotes Acknowledgements The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data. What can I do with that? Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models. Note If you think the License (CC0: Public Domain) should be different, contact me

4 features

subreddit	string	40 unique values 0 missing
body	string	963903 unique values 1 missing
controversiality	numeric	2 unique values 0 missing
score	numeric	2110 unique values 0 missing