Data
1-million-Reddit-comments-from-40-subreddits

1-million-Reddit-comments-from-40-subreddits

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Computer Systems Machine Learning
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Content This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc). For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced. I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json - csv). This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety. The information kept here is: subreddit (categorical): on which subreddit the comment was posted body (str): comment content controversiality (binary): a reddit aggregated metric score (scalar): upvotes minus downvotes Acknowledgements The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data. What can I do with that? Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models. Note If you think the License (CC0: Public Domain) should be different, contact me

4 features

subredditstring40 unique values
0 missing
bodystring963903 unique values
1 missing
controversialitynumeric2 unique values
0 missing
scorenumeric2110 unique values
0 missing

19 properties

1000000
Number of instances (rows) of the dataset.
4
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
1
Number of missing values in the dataset.
1
Number of instances with at least one value missing.
2
Number of numeric attributes.
0
Number of nominal attributes.
0
Number of attributes divided by the number of instances.
50
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
Average class difference between consecutive instances.
0
Percentage of missing values.

0 tasks

Define a new task