{ "data_id": "44049", "name": "OnlineNewsPopularity", "exact_name": "OnlineNewsPopularity", "version": 3, "version_label": null, "description": "Dataset used in the tabular data benchmark https:\/\/github.com\/LeoGrin\/tabular-benchmark, \n transformed in the same way. This dataset belongs to the \"regression on categorical and\n numerical features\" benchmark. Original description: \n \nVersion with url set as row id, creator data missing due to bad formatting.**Author**: Kelwin Fernandes (INESC TEC, Universidade doPorto), Pedro Vinagre (ALGORITMI Research Centre, Universidade do Minho), Paulo Cortez - ALGORITMI Research Centre (Universidade do Minho), Pedro Sernadela (Universidade de Aveiro) \n\n**Source**: UCI \n\n**Please cite**: K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal. \n\n\n\nThis dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).\n\n\n\n* The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. \n\n* Acquisition date: January 8, 2015 \n\n* The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.\n\n\n\n\n\nAttribute Information:\n\n\n\nNumber of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) \n\n\n\nAttribute Information: \n\n0. url: URL of the article (non-predictive) \n\n1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) \n\n2. n_tokens_title: Number of words in the title \n\n3. n_tokens_content: Number of words in the content \n\n4. n_unique_tokens: Rate of unique words in the content \n\n5. n_non_stop_words: Rate of non-stop words in the content \n\n6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content \n\n7. num_hrefs: Number of links \n\n8. num_self_hrefs: Number of links to other articles published by Mashable \n\n9. num_imgs: Number of images \n\n10. num_videos: Number of videos \n\n11. average_token_length: Average length of the words in the content \n\n12. num_keywords: Number of keywords in the metadata \n\n13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? \n\n14. data_channel_is_entertainment: Is data channel 'Entertainment'? \n\n15. data_channel_is_bus: Is data channel 'Business'? \n\n16. data_channel_is_socmed: Is data channel 'Social Media'? \n\n17. data_channel_is_tech: Is data channel 'Tech'? \n\n18. data_channel_is_world: Is data channel 'World'? \n\n19. kw_min_min: Worst keyword (min. shares) \n\n20. kw_max_min: Worst keyword (max. shares) \n\n21. kw_avg_min: Worst keyword (avg. shares) \n\n22. kw_min_max: Best keyword (min. shares) \n\n23. kw_max_max: Best keyword (max. shares) \n\n24. kw_avg_max: Best keyword (avg. shares) \n\n25. kw_min_avg: Avg. keyword (min. shares) \n\n26. kw_max_avg: Avg. keyword (max. shares) \n\n27. kw_avg_avg: Avg. keyword (avg. shares) \n\n28. self_reference_min_shares: Min. shares of referenced articles in Mashable \n\n29. self_reference_max_shares: Max. shares of referenced articles in Mashable \n\n30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable \n\n31. weekday_is_monday: Was the article published on a Monday? \n\n32. weekday_is_tuesday: Was the article published on a Tuesday? \n\n33. weekday_is_wednesday: Was the article published on a Wednesday? \n\n34. weekday_is_thursday: Was the article published on a Thursday? \n\n35. weekday_is_friday: Was the article published on a Friday? \n\n36. weekday_is_saturday: Was the article published on a Saturday? \n\n37. weekday_is_sunday: Was the article published on a Sunday? \n\n38. is_weekend: Was the article published on the weekend? \n\n39. LDA_00: Closeness to LDA topic 0 \n\n40. LDA_01: Closeness to LDA topic 1 \n\n41. LDA_02: Closeness to LDA topic 2 \n\n42. LDA_03: Closeness to LDA topic 3 \n\n43. LDA_04: Closeness to LDA topic 4 \n\n44. global_subjectivity: Text subjectivity \n\n45. global_sentiment_polarity: Text sentiment polarity \n\n46. global_rate_positive_words: Rate of positive words in the content \n\n47. global_rate_negative_words: Rate of negative words in the content \n\n48. rate_positive_words: Rate of positive words among non-neutral tokens \n\n49. rate_negative_words: Rate of negative words among non-neutral tokens \n\n50. avg_positive_polarity: Avg. polarity of positive words \n\n51. min_positive_polarity: Min. polarity of positive words \n\n52. max_positive_polarity: Max. polarity of positive words \n\n53. avg_negative_polarity: Avg. polarity of negative words \n\n54. min_negative_polarity: Min. polarity of negative words \n\n55. max_negative_polarity: Max. polarity of negative words \n\n56. title_subjectivity: Title subjectivity \n\n57. title_sentiment_polarity: Title polarity \n\n58. abs_title_subjectivity: Absolute subjectivity level \n\n59. abs_title_sentiment_polarity: Absolute polarity level \n\n60. shares: Number of shares (target)\n\n\n\n\n\nRelevant Papers:\n\n\n\nK. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.\n\n\n\n\n\n\n\nCitation Request:\n\n\n\nK. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.", "format": "arff", "uploader": "Leo Grin", "uploader_id": 26324, "visibility": "public", "creator": null, "contributor": "\"Leo Grin\"", "date": "2022-06-18 13:12:59", "update_comment": null, "last_update": "2022-06-18 13:12:59", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/old.openml.org\/data\/download\/22103137\/dataset", "kaggle_url": null, "default_target_attribute": "shares", "row_id_attribute": null, "ignore_attribute": null, "runs": 0, "suggest": { "input": [ "OnlineNewsPopularity", "Dataset used in the tabular data benchmark https:\/\/github.com\/LeoGrin\/tabular-benchmark, transformed in the same way. This dataset belongs to the \"regression on categorical and numerical features\" benchmark. Original description: Version with url set as row id, creator data missing due to bad formatting.**Author**: Kelwin Fernandes (INESC TEC, Universidade doPorto), Pedro Vinagre (ALGORITMI Research Centre, Universidade do Minho), Paulo Cortez - ALGORITMI Research Centre (Universidade do Minho), " ], "weight": 5 }, "qualities": { "NumberOfInstances": 39644, "NumberOfFeatures": 60, "NumberOfClasses": 0, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 46, "NumberOfSymbolicFeatures": 14, "PercentageOfBinaryFeatures": 23.333333333333332, "PercentageOfInstancesWithMissingValues": 0, "PercentageOfMissingValues": 0, "AutoCorrelation": 0.048687049895252785, "PercentageOfNumericFeatures": 76.66666666666667, "Dimensionality": 0.0015134698819493492, "PercentageOfSymbolicFeatures": 23.333333333333332, "MajorityClassPercentage": null, "MajorityClassSize": null, "MinorityClassPercentage": null, "MinorityClassSize": null, "NumberOfBinaryFeatures": 14 }, "tags": [], "features": [ { "name": "shares", "index": "59", "type": "numeric", "distinct": "1454", "missing": "0", "target": "1", "min": "1", "max": "14", "mean": "7", "stdev": "1" }, { "name": "timedelta", "index": "0", "type": "numeric", "distinct": "724", "missing": "0", "min": "8", "max": "731", "mean": "355", "stdev": "214" }, { "name": "n_tokens_title", "index": "1", "type": "numeric", "distinct": "20", "missing": "0", "min": "2", "max": "23", "mean": "10", "stdev": "2" }, { "name": "n_tokens_content", "index": "2", "type": "numeric", "distinct": "2406", "missing": "0", "min": "0", "max": "8474", "mean": "547", "stdev": "471" }, { "name": "n_unique_tokens", "index": "3", "type": "numeric", "distinct": "27281", "missing": "0", "min": "0", "max": "701", "mean": "1", "stdev": "4" }, { "name": "n_non_stop_words", "index": "4", "type": "numeric", "distinct": "1451", "missing": "0", "min": "0", "max": "1042", "mean": "1", "stdev": "5" }, { "name": "n_non_stop_unique_tokens", "index": "5", "type": "numeric", "distinct": "22930", "missing": "0", "min": "0", "max": "650", "mean": "1", "stdev": "3" }, { "name": "num_hrefs", "index": "6", "type": "numeric", "distinct": "133", "missing": "0", "min": "0", "max": "304", "mean": "11", "stdev": "11" }, { "name": "num_self_hrefs", "index": "7", "type": "numeric", "distinct": "59", "missing": "0", "min": "0", "max": "116", "mean": "3", "stdev": "4" }, { "name": "num_imgs", "index": "8", "type": "numeric", "distinct": "91", "missing": "0", "min": "0", "max": "128", "mean": "5", "stdev": "8" }, { "name": "num_videos", "index": "9", "type": "numeric", "distinct": "53", "missing": "0", "min": "0", "max": "91", "mean": "1", "stdev": "4" }, { "name": "average_token_length", "index": "10", "type": "numeric", "distinct": "30136", "missing": "0", "min": "0", "max": "8", "mean": "5", "stdev": "1" }, { "name": "num_keywords", "index": "11", "type": "numeric", "distinct": "10", "missing": "0", "min": "1", "max": "10", "mean": "7", "stdev": "2" }, { "name": "data_channel_is_lifestyle", "index": "12", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "data_channel_is_entertainment", "index": "13", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "data_channel_is_bus", "index": "14", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "data_channel_is_socmed", "index": "15", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "data_channel_is_tech", "index": "16", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "data_channel_is_world", "index": "17", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "kw_min_min", "index": "18", "type": "numeric", "distinct": "26", "missing": "0", "min": "-1", "max": "377", "mean": "26", "stdev": "70" }, { "name": "kw_max_min", "index": "19", "type": "numeric", "distinct": "1076", "missing": "0", "min": "0", "max": "298400", "mean": "1154", "stdev": "3858" }, { "name": "kw_avg_min", "index": "20", "type": "numeric", "distinct": "17003", "missing": "0", "min": "-1", "max": "42828", "mean": "312", "stdev": "621" }, { "name": "kw_min_max", "index": "21", "type": "numeric", "distinct": "1021", "missing": "0", "min": "0", "max": "843300", "mean": "13612", "stdev": "57986" }, { "name": "kw_max_max", "index": "22", "type": "numeric", "distinct": "35", "missing": "0", "min": "0", "max": "843300", "mean": "752324", "stdev": "214502" }, { "name": "kw_avg_max", "index": "23", "type": "numeric", "distinct": "30834", "missing": "0", "min": "0", "max": "843300", "mean": "259282", "stdev": "135102" }, { "name": "kw_min_avg", "index": "24", "type": "numeric", "distinct": "15982", "missing": "0", "min": "-1", "max": "3613", "mean": "1117", "stdev": "1137" }, { "name": "kw_max_avg", "index": "25", "type": "numeric", "distinct": "19438", "missing": "0", "min": "0", "max": "298400", "mean": "5657", "stdev": "6099" }, { "name": "kw_avg_avg", "index": "26", "type": "numeric", "distinct": "39300", "missing": "0", "min": "0", "max": "43568", "mean": "3136", "stdev": "1318" }, { "name": "self_reference_min_shares", "index": "27", "type": "numeric", "distinct": "1255", "missing": "0", "min": "0", "max": "843300", "mean": "3999", "stdev": "19739" }, { "name": "self_reference_max_shares", "index": "28", "type": "numeric", "distinct": "1137", "missing": "0", "min": "0", "max": "843300", "mean": "10329", "stdev": "41028" }, { "name": "self_reference_avg_sharess", "index": "29", "type": "numeric", "distinct": "8626", "missing": "0", "min": "0", "max": "843300", "mean": "6402", "stdev": "24211" }, { "name": "weekday_is_monday", "index": "30", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_tuesday", "index": "31", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_wednesday", "index": "32", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_thursday", "index": "33", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_friday", "index": "34", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_saturday", "index": "35", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "weekday_is_sunday", "index": "36", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "is_weekend", "index": "37", "type": "nominal", "distinct": "2", "missing": "0", "distr": [] }, { "name": "LDA_00", "index": "38", "type": "numeric", "distinct": "39337", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "LDA_01", "index": "39", "type": "numeric", "distinct": "39098", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "LDA_02", "index": "40", "type": "numeric", "distinct": "39525", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "LDA_03", "index": "41", "type": "numeric", "distinct": "38963", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "LDA_04", "index": "42", "type": "numeric", "distinct": "39370", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "global_subjectivity", "index": "43", "type": "numeric", "distinct": "34501", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "global_sentiment_polarity", "index": "44", "type": "numeric", "distinct": "34695", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "global_rate_positive_words", "index": "45", "type": "numeric", "distinct": "13159", "missing": "0", "min": "0", "max": "0", "mean": "0", "stdev": "0" }, { "name": "global_rate_negative_words", "index": "46", "type": "numeric", "distinct": "10271", "missing": "0", "min": "0", "max": "0", "mean": "0", "stdev": "0" }, { "name": "rate_positive_words", "index": "47", "type": "numeric", "distinct": "2284", "missing": "0", "min": "0", "max": "1", "mean": "1", "stdev": "0" }, { "name": "rate_negative_words", "index": "48", "type": "numeric", "distinct": "2284", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "avg_positive_polarity", "index": "49", "type": "numeric", "distinct": "27301", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "min_positive_polarity", "index": "50", "type": "numeric", "distinct": "33", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "max_positive_polarity", "index": "51", "type": "numeric", "distinct": "38", "missing": "0", "min": "0", "max": "1", "mean": "1", "stdev": "0" }, { "name": "avg_negative_polarity", "index": "52", "type": "numeric", "distinct": "13841", "missing": "0", "min": "-1", "max": "0", "mean": "0", "stdev": "0" }, { "name": "min_negative_polarity", "index": "53", "type": "numeric", "distinct": "54", "missing": "0", "min": "-1", "max": "0", "mean": "-1", "stdev": "0" }, { "name": "max_negative_polarity", "index": "54", "type": "numeric", "distinct": "49", "missing": "0", "min": "-1", "max": "0", "mean": "0", "stdev": "0" }, { "name": "title_subjectivity", "index": "55", "type": "numeric", "distinct": "673", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "title_sentiment_polarity", "index": "56", "type": "numeric", "distinct": "813", "missing": "0", "min": "-1", "max": "1", "mean": "0", "stdev": "0" }, { "name": "abs_title_subjectivity", "index": "57", "type": "numeric", "distinct": "532", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "abs_title_sentiment_polarity", "index": "58", "type": "numeric", "distinct": "653", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }