OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

OnlineNewsPopularity

active ARFF Publicly available Visibility: public Uploaded 17-02-2016 by Hilda Fabiola Bernard
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Author: Kelwin Fernandes (INESC TEC, Universidade doPorto), Pedro Vinagre (ALGORITMI Research Centre, Universidade do Minho), Paulo Cortez - ALGORITMI Research Centre (Universidade do Minho), Pedro Sernadela (Universidade de Aveiro) Source: UCI Please cite: K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). * The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. * Acquisition date: January 8, 2015 * The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set. Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) Attribute Information: 0. url: URL of the article (non-predictive) 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the content 5. n_non_stop_words: Rate of non-stop words in the content 6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 7. num_hrefs: Number of links 8. num_self_hrefs: Number of links to other articles published by Mashable 9. num_imgs: Number of images 10. num_videos: Number of videos 11. average_token_length: Average length of the words in the content 12. num_keywords: Number of keywords in the metadata 13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 14. data_channel_is_entertainment: Is data channel 'Entertainment'? 15. data_channel_is_bus: Is data channel 'Business'? 16. data_channel_is_socmed: Is data channel 'Social Media'? 17. data_channel_is_tech: Is data channel 'Tech'? 18. data_channel_is_world: Is data channel 'World'? 19. kw_min_min: Worst keyword (min. shares) 20. kw_max_min: Worst keyword (max. shares) 21. kw_avg_min: Worst keyword (avg. shares) 22. kw_min_max: Best keyword (min. shares) 23. kw_max_max: Best keyword (max. shares) 24. kw_avg_max: Best keyword (avg. shares) 25. kw_min_avg: Avg. keyword (min. shares) 26. kw_max_avg: Avg. keyword (max. shares) 27. kw_avg_avg: Avg. keyword (avg. shares) 28. self_reference_min_shares: Min. shares of referenced articles in Mashable 29. self_reference_max_shares: Max. shares of referenced articles in Mashable 30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 31. weekday_is_monday: Was the article published on a Monday? 32. weekday_is_tuesday: Was the article published on a Tuesday? 33. weekday_is_wednesday: Was the article published on a Wednesday? 34. weekday_is_thursday: Was the article published on a Thursday? 35. weekday_is_friday: Was the article published on a Friday? 36. weekday_is_saturday: Was the article published on a Saturday? 37. weekday_is_sunday: Was the article published on a Sunday? 38. is_weekend: Was the article published on the weekend? 39. LDA_00: Closeness to LDA topic 0 40. LDA_01: Closeness to LDA topic 1 41. LDA_02: Closeness to LDA topic 2 42. LDA_03: Closeness to LDA topic 3 43. LDA_04: Closeness to LDA topic 4 44. global_subjectivity: Text subjectivity 45. global_sentiment_polarity: Text sentiment polarity 46. global_rate_positive_words: Rate of positive words in the content 47. global_rate_negative_words: Rate of negative words in the content 48. rate_positive_words: Rate of positive words among non-neutral tokens 49. rate_negative_words: Rate of negative words among non-neutral tokens 50. avg_positive_polarity: Avg. polarity of positive words 51. min_positive_polarity: Min. polarity of positive words 52. max_positive_polarity: Max. polarity of positive words 53. avg_negative_polarity: Avg. polarity of negative words 54. min_negative_polarity: Min. polarity of negative words 55. max_negative_polarity: Max. polarity of negative words 56. title_subjectivity: Title subjectivity 57. title_sentiment_polarity: Title polarity 58. abs_title_subjectivity: Absolute subjectivity level 59. abs_title_sentiment_polarity: Absolute polarity level 60. shares: Number of shares (target) Relevant Papers: K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal. Citation Request: K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

61 features

shares (target)	numeric	1454 unique values 0 missing
url	nominal	39644 unique values 0 missing
timedelta	numeric	724 unique values 0 missing
n_tokens_title	numeric	20 unique values 0 missing
n_tokens_content	numeric	2406 unique values 0 missing
n_unique_tokens	numeric	27281 unique values 0 missing
n_non_stop_words	numeric	1451 unique values 0 missing
n_non_stop_unique_tokens	numeric	22930 unique values 0 missing
num_hrefs	numeric	133 unique values 0 missing
num_self_hrefs	numeric	59 unique values 0 missing
num_imgs	numeric	91 unique values 0 missing
num_videos	numeric	53 unique values 0 missing
average_token_length	numeric	30136 unique values 0 missing
num_keywords	numeric	10 unique values 0 missing
data_channel_is_lifestyle	numeric	2 unique values 0 missing
data_channel_is_entertainment	numeric	2 unique values 0 missing
data_channel_is_bus	numeric	2 unique values 0 missing
data_channel_is_socmed	numeric	2 unique values 0 missing
data_channel_is_tech	numeric	2 unique values 0 missing
data_channel_is_world	numeric	2 unique values 0 missing
kw_min_min	numeric	26 unique values 0 missing
kw_max_min	numeric	1076 unique values 0 missing
kw_avg_min	numeric	17003 unique values 0 missing
kw_min_max	numeric	1021 unique values 0 missing
kw_max_max	numeric	35 unique values 0 missing
kw_avg_max	numeric	30834 unique values 0 missing
kw_min_avg	numeric	15982 unique values 0 missing
kw_max_avg	numeric	19438 unique values 0 missing
kw_avg_avg	numeric	39300 unique values 0 missing
self_reference_min_shares	numeric	1255 unique values 0 missing
self_reference_max_shares	numeric	1137 unique values 0 missing
self_reference_avg_sharess	numeric	8626 unique values 0 missing
weekday_is_monday	numeric	2 unique values 0 missing
weekday_is_tuesday	numeric	2 unique values 0 missing
weekday_is_wednesday	numeric	2 unique values 0 missing
weekday_is_thursday	numeric	2 unique values 0 missing
weekday_is_friday	numeric	2 unique values 0 missing
weekday_is_saturday	numeric	2 unique values 0 missing
weekday_is_sunday	numeric	2 unique values 0 missing
is_weekend	numeric	2 unique values 0 missing
LDA_00	numeric	39337 unique values 0 missing
LDA_01	numeric	39098 unique values 0 missing
LDA_02	numeric	39525 unique values 0 missing
LDA_03	numeric	38963 unique values 0 missing
LDA_04	numeric	39370 unique values 0 missing
global_subjectivity	numeric	34501 unique values 0 missing
global_sentiment_polarity	numeric	34695 unique values 0 missing
global_rate_positive_words	numeric	13159 unique values 0 missing
global_rate_negative_words	numeric	10271 unique values 0 missing
rate_positive_words	numeric	2284 unique values 0 missing
rate_negative_words	numeric	2284 unique values 0 missing
avg_positive_polarity	numeric	27301 unique values 0 missing
min_positive_polarity	numeric	33 unique values 0 missing
max_positive_polarity	numeric	38 unique values 0 missing
avg_negative_polarity	numeric	13841 unique values 0 missing
min_negative_polarity	numeric	54 unique values 0 missing
max_negative_polarity	numeric	49 unique values 0 missing
title_subjectivity	numeric	673 unique values 0 missing
title_sentiment_polarity	numeric	813 unique values 0 missing
abs_title_subjectivity	numeric	532 unique values 0 missing
abs_title_sentiment_polarity	numeric	653 unique values 0 missing

Show all 61 features

107 properties

NumberOfInstances

39644

Number of instances (rows) of the dataset.

NumberOfFeatures

Number of attributes (columns) of the dataset.

NumberOfClasses

Number of distinct values of the target attribute (if it is nominal).

NumberOfMissingValues

Number of missing values in the dataset.

NumberOfInstancesWithMissingValues

Number of instances with at least one value missing.

NumberOfNumericFeatures

Number of numeric attributes.

NumberOfSymbolicFeatures

Number of nominal attributes.

RandomTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Dimensionality

Number of attributes divided by the number of instances.

MaxMutualInformation

Maximum mutual information between the nominal attributes and the target attribute.

MinNominalAttDistinctValues

39644

The minimal number of distinct values among attributes of the nominal type.

PercentageOfBinaryFeatures

Percentage of binary attributes.

Quartile2StdDevOfNumericAtts

0.38

Second quartile (Median) of standard deviation of attributes of the numeric type.

RandomTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

EquivalentNumberOfAtts

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

MaxNominalAttDistinctValues

39644

The maximum number of distinct values among attributes of the nominal type.

MinSkewnessOfNumericAtts

-4.58

Minimum skewness among attributes of the numeric type.

PercentageOfInstancesWithMissingValues

Percentage of instances having missing values.

Quartile3AttributeEntropy

Third quartile of entropy among attributes.

RandomTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

J48.00001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MaxSkewnessOfNumericAtts

198.79

Maximum skewness among attributes of the numeric type.

MinStdDevOfNumericAtts

0.01

Minimum standard deviation of attributes of the numeric type.

PercentageOfMissingValues

Percentage of missing values.

Quartile3KurtosisOfNumericAtts

32.76

Third quartile of kurtosis among attributes of the numeric type.

AutoCorrelation

-4072.46

Average class difference between consecutive instances.

RandomTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MaxStdDevOfNumericAtts

214502.13

Maximum standard deviation of attributes of the numeric type.

MinorityClassPercentage

Percentage of instances belonging to the least frequent class.

PercentageOfNumericFeatures

98.36

Percentage of numeric attributes.

Quartile3MeansOfNumericAtts

22.3

Third quartile of means among attributes of the numeric type.

CfsSubsetEval_DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MeanAttributeEntropy

Average entropy of the attributes.

MinorityClassSize

Number of instances belonging to the least frequent class.

PercentageOfSymbolicFeatures

1.64

Percentage of nominal attributes.

Quartile3MutualInformation

Third quartile of mutual information between the nominal attributes and the target attribute.

CfsSubsetEval_DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.0001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanKurtosisOfNumericAtts

2111.1

Mean kurtosis among attributes of the numeric type.

NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1AttributeEntropy

First quartile of entropy among attributes.

Quartile3SkewnessOfNumericAtts

4.01

Third quartile of skewness among attributes of the numeric type.

CfsSubsetEval_DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMeansOfNumericAtts

17694.95

Mean of means among attributes of the numeric type.

NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1KurtosisOfNumericAtts

0.63

First quartile of kurtosis among attributes of the numeric type.

Quartile3StdDevOfNumericAtts

55.06

Third quartile of standard deviation of attributes of the numeric type.

CfsSubsetEval_NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMutualInformation

Average mutual information between the nominal attributes and the target attribute.

NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1MeansOfNumericAtts

0.15

First quartile of means among attributes of the numeric type.

REPTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanNoiseToSignalRatio

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

NumberOfBinaryFeatures

Number of binary attributes.

Quartile1MutualInformation

First quartile of mutual information between the nominal attributes and the target attribute.

REPTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

StdvNominalAttDistinctValues

Standard deviation of the number of distinct values among attributes of the nominal type.

J48.001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanNominalAttDistinctValues

39644

Average number of distinct values among the attributes of the nominal type.

Quartile1SkewnessOfNumericAtts

0.34

First quartile of skewness among attributes of the numeric type.

REPTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk

J48.001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanSkewnessOfNumericAtts

14.09

Mean skewness among attributes of the numeric type.

Quartile1StdDevOfNumericAtts

0.24

First quartile of standard deviation of attributes of the numeric type.

REPTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

CfsSubsetEval_kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassPercentage

Percentage of instances belonging to the most frequent class.

MeanStdDevOfNumericAtts

8633.94

Mean standard deviation of attributes of the numeric type.

Quartile2AttributeEntropy

Second quartile (Median) of entropy among attributes.

REPTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

ClassEntropy

Entropy of the target attribute values.

kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassSize

Number of instances belonging to the most frequent class.

MinAttributeEntropy

Minimal entropy among attributes.

Quartile2KurtosisOfNumericAtts

3.31

Second quartile (Median) of kurtosis among attributes of the numeric type.

REPTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxAttributeEntropy

Maximum entropy among attributes.

MinKurtosisOfNumericAtts

-1.28

Minimum kurtosis among attributes of the numeric type.

Quartile2MeansOfNumericAtts

0.31

Second quartile (Median) of means among attributes of the numeric type.

REPTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxKurtosisOfNumericAtts

39560.29

Maximum kurtosis among attributes of the numeric type.

MinMeansOfNumericAtts

-0.52

Minimum of means among attributes of the numeric type.

Quartile2MutualInformation

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

REPTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxMeansOfNumericAtts

752324.07

Maximum of means among attributes of the numeric type.

MinMutualInformation

Minimal mutual information between the nominal attributes and the target attribute.

Quartile2SkewnessOfNumericAtts

1.66

Second quartile (Median) of skewness among attributes of the numeric type.

Show all 107 properties