{ "data_id": "44", "name": "spambase", "exact_name": "spambase", "version": 1, "version_label": "1", "description": "**Author**: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt \n**Source**: [UCI](https:\/\/archive.ics.uci.edu\/ml\/datasets\/spambase) \n**Please cite**: [UCI](https:\/\/archive.ics.uci.edu\/ml\/citation_policy.html)\n\nSPAM E-mail Database \nThe \"spam\" concept is diverse: advertisements for products\/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.\n \nFor background on spam: \nCranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. \n\n### Attribute Information: \nThe last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. \n\nFor the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: \n\n48 continuous real [0,100] attributes of type \nword_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) \/ total number of words in e-mail. A \"word\" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.\n \n6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) \/ total characters in e-mail\n \n1 continuous real [1,...] attribute of type capital_run_length_average\n = average length of uninterrupted sequences of capital letters\n \n1 continuous integer [1,...] attribute of type capital_run_length_longest\n = length of longest uninterrupted sequence of capital letters\n \n1 continuous integer [1,...] attribute of type capital_run_length_total\n = sum of length of uninterrupted sequences of capital letters\n = total number of capital letters in the e-mail\n \n1 nominal {0,1} class attribute of type spam\n = denotes whether the e-mail was considered spam (1) or not (0), \n i.e. unsolicited commercial e-mail.", "format": "ARFF", "uploader": "Jan van Rijn", "uploader_id": 1, "visibility": "public", "creator": "Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt, Hewlett-Packard Labs", "contributor": null, "date": "2014-04-06 23:22:41", "update_comment": null, "last_update": "2014-04-06 23:22:41", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/www.openml.org\/data\/download\/44\/dataset_44_spambase.arff", "kaggle_url": "https:\/\/www.kaggle.com\/datasets\/yasserh\/spamemailsdataset", "default_target_attribute": "class", "row_id_attribute": null, "ignore_attribute": null, "runs": 162018, "suggest": { "input": [ "spambase", "SPAM E-mail Database The \"spam\" concept is diverse: advertisements for products\/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-sp " ], "weight": 5 }, "qualities": { "NumberOfInstances": 4601, "NumberOfFeatures": 58, "NumberOfClasses": 2, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 57, "NumberOfSymbolicFeatures": 1, "MaxStdDevOfNumericAtts": 606.3478507248471, "MinorityClassPercentage": 39.404477287546186, "PercentageOfNumericFeatures": 98.27586206896551, "Quartile3MeansOfNumericAtts": 0.24413062377743908, "CfsSubsetEval_DecisionStumpAUC": 0.9397314627894664, "RandomTreeDepth2AUC": 0.8911774993451567, "J48.00001.ErrRate": 0.08411214953271028, "MeanAttributeEntropy": null, "MinorityClassSize": 1813, "PercentageOfSymbolicFeatures": 1.7241379310344827, "Quartile3MutualInformation": null, "CfsSubsetEval_DecisionStumpErrRate": 0.08563355792219082, "RandomTreeDepth2ErrRate": 0.10410780265159748, "J48.00001.Kappa": 0.82347465212921, "MeanKurtosisOfNumericAtts": 241.1700186517731, "NaiveBayesAUC": 0.93523126498161, "Quartile1AttributeEntropy": null, "Quartile3SkewnessOfNumericAtts": 13.646188094980591, "CfsSubsetEval_DecisionStumpKappa": 0.8208876445659258, "RandomTreeDepth2Kappa": 0.781973609246172, "J48.0001.AUC": 0.924541669007748, "MeanMeansOfNumericAtts": 6.150770191072139, "NaiveBayesErrRate": 0.20234731580091284, "Quartile1KurtosisOfNumericAtts": 50.655931063002996, "Quartile3StdDevOfNumericAtts": 0.8437450862048406, "CfsSubsetEval_NaiveBayesAUC": 0.9397314627894664, "RandomTreeDepth3AUC": 0.8911774993451567, "J48.0001.ErrRate": 0.08411214953271028, "MeanMutualInformation": null, "NaiveBayesKappa": 0.605321391923295, "Quartile1MeansOfNumericAtts": 0.06479352314714121, "REPTreeDepth1AUC": 0.9386679853220129, "CfsSubsetEval_NaiveBayesErrRate": 0.08563355792219082, "RandomTreeDepth3ErrRate": 0.10410780265159748, "J48.0001.Kappa": 0.82347465212921, "MeanNoiseToSignalRatio": null, "NumberOfBinaryFeatures": 1, "Quartile1MutualInformation": null, "REPTreeDepth1ErrRate": 0.10345577048467725, "CfsSubsetEval_NaiveBayesKappa": 0.8208876445659258, "RandomTreeDepth3Kappa": 0.781973609246172, "J48.001.AUC": 0.924541669007748, "MeanNominalAttDistinctValues": 2, "Quartile1SkewnessOfNumericAtts": 5.8507230150661425, "REPTreeDepth1Kappa": 0.7807805902062573, "CfsSubsetEval_kNN1NAUC": 0.9397314627894664, "StdvNominalAttDistinctValues": 0, "J48.001.ErrRate": 0.08411214953271028, "MeanSkewnessOfNumericAtts": 11.186639096029253, "Quartile1StdDevOfNumericAtts": 0.31695822185668954, "REPTreeDepth2AUC": 0.9386679853220129, "CfsSubsetEval_kNN1NErrRate": 0.08563355792219082, "kNN1NAUC": 0.8937334657000572, "J48.001.Kappa": 0.82347465212921, "MeanStdDevOfNumericAtts": 15.193997694546747, "Quartile2AttributeEntropy": null, "REPTreeDepth2ErrRate": 0.10345577048467725, "CfsSubsetEval_kNN1NKappa": 0.8208876445659258, "kNN1NErrRate": 0.10736796348619865, "MajorityClassPercentage": 60.59552271245382, "MinAttributeEntropy": null, "Quartile2KurtosisOfNumericAtts": 127.37652934849572, "REPTreeDepth2Kappa": 0.7807805902062573, "ClassEntropy": 0.9673602371807668, "kNN1NKappa": 0.775167746729542, "MajorityClassSize": 2788, "MinKurtosisOfNumericAtts": 5.257394367988116, "Quartile2MeansOfNumericAtts": 0.10285155400999912, "REPTreeDepth3AUC": 0.9386679853220129, "DecisionStumpAUC": 0.7941574124705914, "MaxAttributeEntropy": null, "MinMeansOfNumericAtts": 0.005444468593783957, "Quartile2MutualInformation": null, "REPTreeDepth3ErrRate": 0.10345577048467725, "DecisionStumpErrRate": 0.20930232558139536, "MaxKurtosisOfNumericAtts": 1480.6420502862907, "MinMutualInformation": null, "Quartile2SkewnessOfNumericAtts": 9.724847529978312, "REPTreeDepth3Kappa": 0.7807805902062573, "DecisionStumpKappa": 0.549772420190581, "MaxMeansOfNumericAtts": 283.28928493805716, "MinNominalAttDistinctValues": 2, "PercentageOfBinaryFeatures": 1.7241379310344827, "Quartile2StdDevOfNumericAtts": 0.4440553289821315, "RandomTreeDepth1AUC": 0.8911774993451567, "Dimensionality": 0.012605955227124538, "MaxMutualInformation": null, "MinSkewnessOfNumericAtts": 1.5916742687064245, "PercentageOfInstancesWithMissingValues": 0, "Quartile3AttributeEntropy": null, "RandomTreeDepth1ErrRate": 0.10410780265159748, "EquivalentNumberOfAtts": null, "MaxNominalAttDistinctValues": 2, "MinStdDevOfNumericAtts": 0.07627427063724908, "PercentageOfMissingValues": 0, "Quartile3KurtosisOfNumericAtts": 299.0723734257733, "AutoCorrelation": 0.9997826086956522, "RandomTreeDepth1Kappa": 0.781973609246172, "J48.00001.AUC": 0.924541669007748, "MaxSkewnessOfNumericAtts": 31.062064279039635 }, "tags": [ { "uploader": "38960", "tag": "Computer Science" }, { "uploader": "38960", "tag": "Data Science" }, { "uploader": "38960", "tag": "Email Management" }, { "uploader": "38960", "tag": "Information Retrieval" }, { "uploader": "2", "tag": "Kaggle" }, { "uploader": "1", "tag": "mythbusting_1" }, { "uploader": "1", "tag": "OpenML-CC18" }, { "uploader": "348", "tag": "OpenML100" }, { "uploader": "2", "tag": "study_1" }, { "uploader": "3886", "tag": "study_123" }, { "uploader": "64", "tag": "study_14" }, { "uploader": "939", "tag": "study_15" }, { "uploader": "939", "tag": "study_20" }, { "uploader": "1", "tag": "study_34" }, { "uploader": "1", "tag": "study_37" }, { "uploader": "1", "tag": "study_41" }, { "uploader": "64", "tag": "study_52" }, { "uploader": "64", "tag": "study_7" }, { "uploader": "1856", "tag": "study_70" }, { "uploader": "1935", "tag": "study_98" }, { "uploader": "1", "tag": "study_99" }, { "uploader": "1", "tag": "uci" } ], "features": [ { "name": "class", "index": "57", "type": "nominal", "distinct": "2", "missing": "0", "target": "1", "distr": [ [ "0", "1" ], [ [ "2788", "0" ], [ "0", "1813" ] ] ] }, { "name": "word_freq_make", "index": "0", "type": "numeric", "distinct": "142", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_address", "index": "1", "type": "numeric", "distinct": "171", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_all", "index": "2", "type": "numeric", "distinct": "214", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "1" }, { "name": "word_freq_3d", "index": "3", "type": "numeric", "distinct": "43", "missing": "0", "min": "0", "max": "43", "mean": "0", "stdev": "1" }, { "name": "word_freq_our", "index": "4", "type": "numeric", "distinct": "255", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "1" }, { "name": "word_freq_over", "index": "5", "type": "numeric", "distinct": "141", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_remove", "index": "6", "type": "numeric", "distinct": "173", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_internet", "index": "7", "type": "numeric", "distinct": "170", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_order", "index": "8", "type": "numeric", "distinct": "144", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_mail", "index": "9", "type": "numeric", "distinct": "245", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_receive", "index": "10", "type": "numeric", "distinct": "113", "missing": "0", "min": "0", "max": "3", "mean": "0", "stdev": "0" }, { "name": "word_freq_will", "index": "11", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "10", "mean": "1", "stdev": "1" }, { "name": "word_freq_people", "index": "12", "type": "numeric", "distinct": "158", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_report", "index": "13", "type": "numeric", "distinct": "133", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "word_freq_addresses", "index": "14", "type": "numeric", "distinct": "118", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_free", "index": "15", "type": "numeric", "distinct": "253", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_business", "index": "16", "type": "numeric", "distinct": "197", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_email", "index": "17", "type": "numeric", "distinct": "229", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_you", "index": "18", "type": "numeric", "distinct": "575", "missing": "0", "min": "0", "max": "19", "mean": "2", "stdev": "2" }, { "name": "word_freq_credit", "index": "19", "type": "numeric", "distinct": "148", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_your", "index": "20", "type": "numeric", "distinct": "401", "missing": "0", "min": "0", "max": "11", "mean": "1", "stdev": "1" }, { "name": "word_freq_font", "index": "21", "type": "numeric", "distinct": "99", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_000", "index": "22", "type": "numeric", "distinct": "164", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_money", "index": "23", "type": "numeric", "distinct": "143", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_hp", "index": "24", "type": "numeric", "distinct": "395", "missing": "0", "min": "0", "max": "21", "mean": "1", "stdev": "2" }, { "name": "word_freq_hpl", "index": "25", "type": "numeric", "distinct": "281", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_george", "index": "26", "type": "numeric", "distinct": "240", "missing": "0", "min": "0", "max": "33", "mean": "1", "stdev": "3" }, { "name": "word_freq_650", "index": "27", "type": "numeric", "distinct": "200", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_lab", "index": "28", "type": "numeric", "distinct": "156", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_labs", "index": "29", "type": "numeric", "distinct": "179", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_telnet", "index": "30", "type": "numeric", "distinct": "128", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_857", "index": "31", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_data", "index": "32", "type": "numeric", "distinct": "184", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_415", "index": "33", "type": "numeric", "distinct": "110", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_85", "index": "34", "type": "numeric", "distinct": "177", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_technology", "index": "35", "type": "numeric", "distinct": "159", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_1999", "index": "36", "type": "numeric", "distinct": "188", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_parts", "index": "37", "type": "numeric", "distinct": "53", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_pm", "index": "38", "type": "numeric", "distinct": "163", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_direct", "index": "39", "type": "numeric", "distinct": "125", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_cs", "index": "40", "type": "numeric", "distinct": "108", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_meeting", "index": "41", "type": "numeric", "distinct": "186", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_original", "index": "42", "type": "numeric", "distinct": "136", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_project", "index": "43", "type": "numeric", "distinct": "160", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_re", "index": "44", "type": "numeric", "distinct": "230", "missing": "0", "min": "0", "max": "21", "mean": "0", "stdev": "1" }, { "name": "word_freq_edu", "index": "45", "type": "numeric", "distinct": "227", "missing": "0", "min": "0", "max": "22", "mean": "0", "stdev": "1" }, { "name": "word_freq_table", "index": "46", "type": "numeric", "distinct": "38", "missing": "0", "min": "0", "max": "2", "mean": "0", "stdev": "0" }, { "name": "word_freq_conference", "index": "47", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%3B", "index": "48", "type": "numeric", "distinct": "313", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%28", "index": "49", "type": "numeric", "distinct": "641", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%5B", "index": "50", "type": "numeric", "distinct": "225", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%21", "index": "51", "type": "numeric", "distinct": "964", "missing": "0", "min": "0", "max": "32", "mean": "0", "stdev": "1" }, { "name": "char_freq_%24", "index": "52", "type": "numeric", "distinct": "504", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "char_freq_%23", "index": "53", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "0" }, { "name": "capital_run_length_average", "index": "54", "type": "numeric", "distinct": "2161", "missing": "0", "min": "1", "max": "1103", "mean": "5", "stdev": "32" }, { "name": "capital_run_length_longest", "index": "55", "type": "numeric", "distinct": "271", "missing": "0", "min": "1", "max": "9989", "mean": "52", "stdev": "195" }, { "name": "capital_run_length_total", "index": "56", "type": "numeric", "distinct": "919", "missing": "0", "min": "1", "max": "15841", "mean": "283", "stdev": "606" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }