OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

spambase

active ARFF Publicly available Visibility: public Uploaded 06-04-2014 by Jan van Rijn
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Author: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Source: [UCI](https://archive.ics.uci.edu/ml/datasets/spambase) Please cite: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html) SPAM E-mail Database The "spam" concept is diverse: advertisements for products/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. ### Attribute Information: The last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

58 features

class (target)	nominal	2 unique values 0 missing
word_freq_make	numeric	142 unique values 0 missing
word_freq_address	numeric	171 unique values 0 missing
word_freq_all	numeric	214 unique values 0 missing
word_freq_3d	numeric	43 unique values 0 missing
word_freq_our	numeric	255 unique values 0 missing
word_freq_over	numeric	141 unique values 0 missing
word_freq_remove	numeric	173 unique values 0 missing
word_freq_internet	numeric	170 unique values 0 missing
word_freq_order	numeric	144 unique values 0 missing
word_freq_mail	numeric	245 unique values 0 missing
word_freq_receive	numeric	113 unique values 0 missing
word_freq_will	numeric	316 unique values 0 missing
word_freq_people	numeric	158 unique values 0 missing
word_freq_report	numeric	133 unique values 0 missing
word_freq_addresses	numeric	118 unique values 0 missing
word_freq_free	numeric	253 unique values 0 missing
word_freq_business	numeric	197 unique values 0 missing
word_freq_email	numeric	229 unique values 0 missing
word_freq_you	numeric	575 unique values 0 missing
word_freq_credit	numeric	148 unique values 0 missing
word_freq_your	numeric	401 unique values 0 missing
word_freq_font	numeric	99 unique values 0 missing
word_freq_000	numeric	164 unique values 0 missing
word_freq_money	numeric	143 unique values 0 missing
word_freq_hp	numeric	395 unique values 0 missing
word_freq_hpl	numeric	281 unique values 0 missing
word_freq_george	numeric	240 unique values 0 missing
word_freq_650	numeric	200 unique values 0 missing
word_freq_lab	numeric	156 unique values 0 missing
word_freq_labs	numeric	179 unique values 0 missing
word_freq_telnet	numeric	128 unique values 0 missing
word_freq_857	numeric	106 unique values 0 missing
word_freq_data	numeric	184 unique values 0 missing
word_freq_415	numeric	110 unique values 0 missing
word_freq_85	numeric	177 unique values 0 missing
word_freq_technology	numeric	159 unique values 0 missing
word_freq_1999	numeric	188 unique values 0 missing
word_freq_parts	numeric	53 unique values 0 missing
word_freq_pm	numeric	163 unique values 0 missing
word_freq_direct	numeric	125 unique values 0 missing
word_freq_cs	numeric	108 unique values 0 missing
word_freq_meeting	numeric	186 unique values 0 missing
word_freq_original	numeric	136 unique values 0 missing
word_freq_project	numeric	160 unique values 0 missing
word_freq_re	numeric	230 unique values 0 missing
word_freq_edu	numeric	227 unique values 0 missing
word_freq_table	numeric	38 unique values 0 missing
word_freq_conference	numeric	106 unique values 0 missing
char_freq_%3B	numeric	313 unique values 0 missing
char_freq_%28	numeric	641 unique values 0 missing
char_freq_%5B	numeric	225 unique values 0 missing
char_freq_%21	numeric	964 unique values 0 missing
char_freq_%24	numeric	504 unique values 0 missing
char_freq_%23	numeric	316 unique values 0 missing
capital_run_length_average	numeric	2161 unique values 0 missing
capital_run_length_longest	numeric	271 unique values 0 missing
capital_run_length_total	numeric	919 unique values 0 missing

Show all 58 features

107 properties

NumberOfInstances

4601

Number of instances (rows) of the dataset.

NumberOfFeatures

Number of attributes (columns) of the dataset.

NumberOfClasses

Number of distinct values of the target attribute (if it is nominal).

NumberOfMissingValues

Number of missing values in the dataset.

NumberOfInstancesWithMissingValues

Number of instances with at least one value missing.

NumberOfNumericFeatures

Number of numeric attributes.

NumberOfSymbolicFeatures

Number of nominal attributes.

MaxStdDevOfNumericAtts

606.35

Maximum standard deviation of attributes of the numeric type.

MinorityClassPercentage

39.4

Percentage of instances belonging to the least frequent class.

PercentageOfNumericFeatures

98.28

Percentage of numeric attributes.

Quartile3MeansOfNumericAtts

0.24

Third quartile of means among attributes of the numeric type.

CfsSubsetEval_DecisionStumpAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MeanAttributeEntropy

Average entropy of the attributes.

MinorityClassSize

1813

Number of instances belonging to the least frequent class.

PercentageOfSymbolicFeatures

1.72

Percentage of nominal attributes.

Quartile3MutualInformation

Third quartile of mutual information between the nominal attributes and the target attribute.

CfsSubsetEval_DecisionStumpErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MeanKurtosisOfNumericAtts

241.17

Mean kurtosis among attributes of the numeric type.

NaiveBayesAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1AttributeEntropy

First quartile of entropy among attributes.

Quartile3SkewnessOfNumericAtts

13.65

Third quartile of skewness among attributes of the numeric type.

CfsSubsetEval_DecisionStumpKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.0001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMeansOfNumericAtts

6.15

Mean of means among attributes of the numeric type.

NaiveBayesErrRate

0.2

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1KurtosisOfNumericAtts

50.66

First quartile of kurtosis among attributes of the numeric type.

Quartile3StdDevOfNumericAtts

0.84

Third quartile of standard deviation of attributes of the numeric type.

CfsSubsetEval_NaiveBayesAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMutualInformation

Average mutual information between the nominal attributes and the target attribute.

NaiveBayesKappa

0.61

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1MeansOfNumericAtts

0.06

First quartile of means among attributes of the numeric type.

REPTreeDepth1AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_NaiveBayesErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanNoiseToSignalRatio

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

NumberOfBinaryFeatures

Number of binary attributes.

Quartile1MutualInformation

First quartile of mutual information between the nominal attributes and the target attribute.

REPTreeDepth1ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_NaiveBayesKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanNominalAttDistinctValues

Average number of distinct values among the attributes of the nominal type.

Quartile1SkewnessOfNumericAtts

5.85

First quartile of skewness among attributes of the numeric type.

REPTreeDepth1Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_kNN1NAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

StdvNominalAttDistinctValues

Standard deviation of the number of distinct values among attributes of the nominal type.

J48.001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanSkewnessOfNumericAtts

11.19

Mean skewness among attributes of the numeric type.

Quartile1StdDevOfNumericAtts

0.32

First quartile of standard deviation of attributes of the numeric type.

REPTreeDepth2AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

CfsSubsetEval_kNN1NErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NAUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk

J48.001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanStdDevOfNumericAtts

15.19

Mean standard deviation of attributes of the numeric type.

Quartile2AttributeEntropy

Second quartile (Median) of entropy among attributes.

REPTreeDepth2ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2

CfsSubsetEval_kNN1NKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NErrRate

0.11

Error rate achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassPercentage

60.6

Percentage of instances belonging to the most frequent class.

MinAttributeEntropy

Minimal entropy among attributes.

Quartile2KurtosisOfNumericAtts

127.38

Second quartile (Median) of kurtosis among attributes of the numeric type.

REPTreeDepth2Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

ClassEntropy

0.97

Entropy of the target attribute values.

kNN1NKappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassSize

2788

Number of instances belonging to the most frequent class.

MinKurtosisOfNumericAtts

5.26

Minimum kurtosis among attributes of the numeric type.

Quartile2MeansOfNumericAtts

0.1

Second quartile (Median) of means among attributes of the numeric type.

REPTreeDepth3AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpAUC

0.79

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxAttributeEntropy

Maximum entropy among attributes.

MinMeansOfNumericAtts

0.01

Minimum of means among attributes of the numeric type.

Quartile2MutualInformation

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

REPTreeDepth3ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpErrRate

0.21

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxKurtosisOfNumericAtts

1480.64

Maximum kurtosis among attributes of the numeric type.

MinMutualInformation

Minimal mutual information between the nominal attributes and the target attribute.

Quartile2SkewnessOfNumericAtts

9.72

Second quartile (Median) of skewness among attributes of the numeric type.

REPTreeDepth3Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpKappa

0.55

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxMeansOfNumericAtts

283.29

Maximum of means among attributes of the numeric type.

MinNominalAttDistinctValues

The minimal number of distinct values among attributes of the nominal type.

PercentageOfBinaryFeatures

1.72

Percentage of binary attributes.

Quartile2StdDevOfNumericAtts

0.44

Second quartile (Median) of standard deviation of attributes of the numeric type.

RandomTreeDepth1AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Dimensionality

0.01

Number of attributes divided by the number of instances.

MaxMutualInformation

Maximum mutual information between the nominal attributes and the target attribute.

MinSkewnessOfNumericAtts

1.59

Minimum skewness among attributes of the numeric type.

PercentageOfInstancesWithMissingValues

Percentage of instances having missing values.

Quartile3AttributeEntropy

Third quartile of entropy among attributes.

RandomTreeDepth1ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

EquivalentNumberOfAtts

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

MaxNominalAttDistinctValues

The maximum number of distinct values among attributes of the nominal type.

MinStdDevOfNumericAtts

0.08

Minimum standard deviation of attributes of the numeric type.

PercentageOfMissingValues

Percentage of missing values.

Quartile3KurtosisOfNumericAtts

299.07

Third quartile of kurtosis among attributes of the numeric type.

AutoCorrelation

Average class difference between consecutive instances.

RandomTreeDepth1Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

J48.00001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MaxSkewnessOfNumericAtts

31.06

Maximum skewness among attributes of the numeric type.

Show all 107 properties

39 tasks

Supervised Classification on spambase

99484 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: class

Supervised Classification on spambase

58350 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: precision - target_feature: class

Supervised Classification on spambase

367 runs - estimation_procedure: 5 times 2-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: class

Supervised Classification on spambase

367 runs - estimation_procedure: 33% Holdout set - evaluation_measure: predictive_accuracy - target_feature: class

Supervised Classification on spambase

215 runs - estimation_procedure: 10 times 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: class

Supervised Classification on spambase

1 runs - estimation_procedure: 5 times 2-fold Crossvalidation - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 33% Holdout set - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 4-fold Crossvalidation - target_feature: class

Learning Curve on spambase

374 runs - estimation_procedure: 10-fold Learning Curve - evaluation_measure: predictive_accuracy - target_feature: class

Learning Curve on spambase

207 runs - estimation_procedure: 10 times 10-fold Learning Curve - evaluation_measure: predictive_accuracy - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Supervised Data Stream Classification on spambase

25 runs - estimation_procedure: Interleaved Test then Train - target_feature: class

Clustering on spambase