Content
This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc).
For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced.
I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json - csv).
This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety.
The information kept here is:
subreddit (categorical): on which subreddit the comment was posted
body (str): comment content
controversiality (binary): a reddit aggregated metric
score (scalar): upvotes minus downvotes
Acknowledgements
The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data.
What can I do with that?
Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models.
Note
If you think the License (CC0: Public Domain) should be different, contact me