Data
nomao

nomao

active ARFF Publicly available Visibility: public Uploaded 25-05-2015 by Rafael Gomes Mantovani
0 likes downloaded by 16 people , 24 total downloads 0 issues 0 downvotes
  • Chemistry Life Science OpenML-CC18 OpenML100 study_123 study_135 study_14 study_144 study_218 study_99 study_271 study_240 study_379 study_226 study_275
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Nomao Labs Source: [UCI](https://archive.ics.uci.edu/ml/datasets/Nomao) Please cite: Laurent Candillier and Vincent Lemaire. Design and Analysis of the Nomao Challenge - Active Learning in the Real-World. In: Proceedings of the ALRA : Active Learning in Real-world Applications, Workshop ECML-PKDD 2012, Friday, September 28, 2012, Bristol, UK. 1. Data set title: Nomao Data Set 2. Abstract: Nomao collects data about places (name, phone, localization...) from many sources. Deduplication consists in detecting what data refer to the same place. Instances in the dataset compare 2 spots. 3. Data Set Characteristics: - Univariate - Area: Computer - Attribute Characteristics: Real - Associated Tasks: Classification - Missing Values?: Yes 4. Source: (a) Original owner of database (name / phone / snail address / email address) Nomao / 00 33 5 62 48 33 90 / 1 avenue Jean Rieux, 31500 Toulouse / challenge '@' nomao.com (b) Donor of database (name / phone / snail address / email address) Laurent Candillier / - / 1 avenue Jean Rieux, 31500 Toulouse / laurent '@' nomao.com 5. Data Set Information: The dataset has been enriched during the Nomao Challenge: organized along with the ALRA workshop (Active Learning in Real-world Applications): held at the ECML-PKDD 2012 conference. 5.1. Number of Instances 34,465 instances, mix of continuous and nominal, labeled by human expert. First 29,104 instances have been labeled with "human prior". See the corresponding article described in section "3. Past Usage" for more details. Next 917 instances have been labeled using the active learning method called "marg". Next 964 instances refer to the active method called "wmarg". Next 995 instances refer to the active method called "wmarg5". Next 1,985 instances refer to the active method called "rand" (random selection). Last instances have been labeled during the corresponding challenge. More details can be found in http://www.nomao.com/labs/challenge Next 163 instances refer to the active method called "baseline". Next 167 instances refer to the active method called "nomao". And last 170 instances refer to the active method called "tsun". 5.2. Number of Attributes 120 attributes: 89 continuous, 31 nominal (including the attributes 'label' and 'id'). The features are separated by comma. 5.3. Attribute Information: Missing data are allowed, represented by question marks '?'. Labels are +1 if the concerned spots must be merged, -1 if they do not refer to the same entity. 1 id: name is composed of the names of the spots that are compared, separated by a sharp (#). 2 clean_name_intersect_min: continuous. 3 clean_name_intersect_max: continuous. 4 clean_name_levenshtein_sim: continuous. 5 clean_name_trigram_sim: continuous. 6 clean_name_levenshtein_term: continuous. 7 clean_name_trigram_term: continuous. 8 clean_name_including: n,s,m. 9 clean_name_equality: n,s,m. 10 city_intersect_min: continuous. 11 city_intersect_max: continuous. 12 city_levenshtein_sim: continuous. 13 city_trigram_sim: continuous. 14 city_levenshtein_term: continuous. 15 city_trigram_term: continuous. 16 city_including: n,s,m. 17 city_equality: n,s,m. 18 zip_intersect_min: continuous. 19 zip_intersect_max: continuous. 20 zip_levenshtein_sim: continuous. 21 zip_trigram_sim: continuous. 22 zip_levenshtein_term: continuous. 23 zip_trigram_term: continuous. 24 zip_including: n,s,m. 25 zip_equality: n,s,m. 26 street_intersect_min: continuous. 27 street_intersect_max: continuous. 28 street_levenshtein_sim: continuous. 29 street_trigram_sim: continuous. 30 street_levenshtein_term: continuous. 31 street_trigram_term: continuous. 32 street_including: n,s,m. 33 street_equality: n,s,m. 34 website_intersect_min: continuous. 35 website_intersect_max: continuous. 36 website_levenshtein_sim: continuous. 37 website_trigram_sim: continuous. 38 website_levenshtein_term: continuous. 39 website_trigram_term: continuous. 40 website_including: n,s,m. 41 website_equality: n,s,m. 42 countryname_intersect_min: continuous. 43 countryname_intersect_max: continuous. 44 countryname_levenshtein_sim: continuous. 45 countryname_trigram_sim: continuous. 46 countryname_levenshtein_term: continuous. 47 countryname_trigram_term: continuous. 48 countryname_including: n,s,m. 49 countryname_equality: n,s,m. 50 geocoderlocalityname_intersect_min: continuous. 51 geocoderlocalityname_intersect_max: continuous. 52 geocoderlocalityname_levenshtein_sim: continuous. 53 geocoderlocalityname_trigram_sim: continuous. 54 geocoderlocalityname_levenshtein_term: continuous. 55 geocoderlocalityname_trigram_term: continuous. 56 geocoderlocalityname_including: n,s,m. 57 geocoderlocalityname_equality: n,s,m. 58 geocoderinputaddress_intersect_min: continuous. 59 geocoderinputaddress_intersect_max: continuous. 60 geocoderinputaddress_levenshtein_sim: continuous. 61 geocoderinputaddress_trigram_sim: continuous. 62 geocoderinputaddress_levenshtein_term: continuous. 63 geocoderinputaddress_trigram_term: continuous. 64 geocoderinputaddress_including: n,s,m. 65 geocoderinputaddress_equality: n,s,m. 66 geocoderoutputaddress_intersect_min: continuous. 67 geocoderoutputaddress_intersect_max: continuous. 68 geocoderoutputaddress_levenshtein_sim: continuous. 69 geocoderoutputaddress_trigram_sim: continuous. 70 geocoderoutputaddress_levenshtein_term: continuous. 71 geocoderoutputaddress_trigram_term: continuous. 72 geocoderoutputaddress_including: n,s,m. 73 geocoderoutputaddress_equality: n,s,m. 74 geocoderpostalcodenumber_intersect_min: continuous. 75 geocoderpostalcodenumber_intersect_max: continuous. 76 geocoderpostalcodenumber_levenshtein_sim: continuous. 77 geocoderpostalcodenumber_trigram_sim: continuous. 78 geocoderpostalcodenumber_levenshtein_term: continuous. 79 geocoderpostalcodenumber_trigram_term: continuous. 80 geocoderpostalcodenumber_including: n,s,m. 81 geocoderpostalcodenumber_equality: n,s,m. 82 geocodercountrynamecode_intersect_min: continuous. 83 geocodercountrynamecode_intersect_max: continuous. 84 geocodercountrynamecode_levenshtein_sim: continuous. 85 geocodercountrynamecode_trigram_sim: continuous. 86 geocodercountrynamecode_levenshtein_term: continuous. 87 geocodercountrynamecode_trigram_term: continuous. 88 geocodercountrynamecode_including: n,s,m. 89 geocodercountrynamecode_equality: n,s,m. 90 phone_diff: continuous. 91 phone_levenshtein: continuous. 92 phone_trigram: continuous. 93 phone_equality: n,s,m. 94 fax_diff: continuous. 95 fax_levenshtein: continuous. 96 fax_trigram: continuous. 97 fax_equality: n,s,m. 98 street_number_diff: continuous. 99 street_number_levenshtein: continuous. 100 street_number_trigram: continuous. 101 street_number_equality: n,s,m. 102 geocode_coordinates_long_diff: continuous. 103 geocode_coordinates_long_levenshtein: continuous. 104 geocode_coordinates_long_trigram: continuous. 105 geocode_coordinates_long_equality: n,s,m. 106 geocode_coordinates_lat_diff: continuous. 107 geocode_coordinates_lat_levenshtein: continuous. 108 geocode_coordinates_lat_trigram: continuous. 109 geocode_coordinates_lat_equality: n,s,m. 110 coordinates_long_diff: continuous. 111 coordinates_long_levenshtein: continuous. 112 coordinates_long_trigram: continuous. 113 coordinates_long_equality: n,s,m. 114 coordinates_lat_diff: continuous. 115 coordinates_lat_levenshtein: continuous. 116 coordinates_lat_trigram: continuous. 117 coordinates_lat_equality: n,s,m. 118 geocode_coordinates_diff: continuous. 119 coordinates_diff: continuous. 120 label: +1,-1. Relevant Papers: Laurent Candillier and Vincent Lemaire. Design and Analysis of the Nomao Challenge - Active Learning in the Real-World. In: Proceedings of the ALRA : Active Learning in Real-world Applications, Workshop ECML-PKDD 2012, Friday, September 28, 2012, Bristol, UK.

119 features

Class (target)nominal2 unique values
0 missing
V1numeric27 unique values
0 missing
V2numeric43 unique values
0 missing
V3numeric3942 unique values
0 missing
V4numeric2207 unique values
0 missing
V5numeric759 unique values
0 missing
V6numeric929 unique values
0 missing
V7nominal2 unique values
0 missing
V8nominal2 unique values
0 missing
V9numeric8 unique values
0 missing
V10numeric11 unique values
0 missing
V11numeric240 unique values
0 missing
V12numeric136 unique values
0 missing
V13numeric120 unique values
0 missing
V14numeric137 unique values
0 missing
V15nominal3 unique values
0 missing
V16nominal3 unique values
0 missing
V17numeric4 unique values
0 missing
V18numeric4 unique values
0 missing
V19numeric20 unique values
0 missing
V20numeric23 unique values
0 missing
V21numeric23 unique values
0 missing
V22numeric28 unique values
0 missing
V23nominal3 unique values
0 missing
V24nominal3 unique values
0 missing
V25numeric23 unique values
0 missing
V26numeric43 unique values
0 missing
V27numeric2044 unique values
0 missing
V28numeric1082 unique values
0 missing
V29numeric541 unique values
0 missing
V30numeric653 unique values
0 missing
V31nominal3 unique values
0 missing
V32nominal3 unique values
0 missing
V33numeric33 unique values
0 missing
V34numeric44 unique values
0 missing
V35numeric356 unique values
0 missing
V36numeric242 unique values
0 missing
V37numeric268 unique values
0 missing
V38numeric339 unique values
0 missing
V39nominal3 unique values
0 missing
V40nominal3 unique values
0 missing
V41numeric3 unique values
0 missing
V42numeric3 unique values
0 missing
V43numeric34 unique values
0 missing
V44numeric27 unique values
0 missing
V45numeric28 unique values
0 missing
V46numeric26 unique values
0 missing
V47nominal3 unique values
0 missing
V48nominal3 unique values
0 missing
V49numeric7 unique values
0 missing
V50numeric7 unique values
0 missing
V51numeric183 unique values
0 missing
V52numeric92 unique values
0 missing
V53numeric92 unique values
0 missing
V54numeric90 unique values
0 missing
V55nominal3 unique values
0 missing
V56nominal3 unique values
0 missing
V57numeric51 unique values
0 missing
V58numeric79 unique values
0 missing
V59numeric6861 unique values
0 missing
V60numeric4771 unique values
0 missing
V61numeric1289 unique values
0 missing
V62numeric1690 unique values
0 missing
V63nominal3 unique values
0 missing
V64nominal3 unique values
0 missing
V65numeric39 unique values
0 missing
V66numeric72 unique values
0 missing
V67numeric3136 unique values
0 missing
V68numeric1852 unique values
0 missing
V69numeric956 unique values
0 missing
V70numeric1258 unique values
0 missing
V71nominal3 unique values
0 missing
V72nominal3 unique values
0 missing
V73numeric4 unique values
0 missing
V74numeric4 unique values
0 missing
V75numeric18 unique values
0 missing
V76numeric19 unique values
0 missing
V77numeric17 unique values
0 missing
V78numeric21 unique values
0 missing
V79nominal3 unique values
0 missing
V80nominal3 unique values
0 missing
V81numeric3 unique values
0 missing
V82numeric3 unique values
0 missing
V83numeric3 unique values
0 missing
V84numeric3 unique values
0 missing
V85numeric3 unique values
0 missing
V86numeric3 unique values
0 missing
V87nominal3 unique values
0 missing
V88nominal3 unique values
0 missing
V89numeric744 unique values
0 missing
V90numeric29 unique values
0 missing
V91numeric62 unique values
0 missing
V92nominal3 unique values
0 missing
V93numeric140 unique values
0 missing
V94numeric20 unique values
0 missing
V95numeric31 unique values
0 missing
V96nominal3 unique values
0 missing
V97numeric299 unique values
0 missing
V98numeric18 unique values
0 missing
V99numeric26 unique values
0 missing
V100nominal3 unique values
0 missing
V101numeric5802 unique values
0 missing
V102numeric46 unique values
0 missing
V103numeric79 unique values
0 missing
V104nominal3 unique values
0 missing
V105numeric5461 unique values
0 missing
V106numeric32 unique values
0 missing
V107numeric85 unique values
0 missing
V108nominal3 unique values
0 missing
V109numeric5095 unique values
0 missing
V110numeric67 unique values
0 missing
V111numeric102 unique values
0 missing
V112nominal3 unique values
0 missing
V113numeric4687 unique values
0 missing
V114numeric56 unique values
0 missing
V115numeric104 unique values
0 missing
V116nominal3 unique values
0 missing
V117numeric2039 unique values
0 missing
V118numeric1726 unique values
0 missing

107 properties

34465
Number of instances (rows) of the dataset.
119
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
89
Number of numeric attributes.
30
Number of nominal attributes.
0.87
Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk
24621
Number of instances belonging to the most frequent class.
0.2
Minimal entropy among attributes.
9.35
Second quartile (Median) of kurtosis among attributes of the numeric type.
0.87
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.86
Entropy of the target attribute values.
1.37
Maximum entropy among attributes.
-1.84
Minimum kurtosis among attributes of the numeric type.
0.81
Second quartile (Median) of means among attributes of the numeric type.
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.84
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump
4303.67
Maximum kurtosis among attributes of the numeric type.
0.4
Minimum of means among attributes of the numeric type.
0.04
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
0.05
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.15
Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump
1
Maximum of means among attributes of the numeric type.
0
Minimal mutual information between the nominal attributes and the target attribute.
-2.5
Second quartile (Median) of skewness among attributes of the numeric type.
0.87
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.65
Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump
0.26
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
2.52
Percentage of binary attributes.
0.18
Second quartile (Median) of standard deviation of attributes of the numeric type.
0.93
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
0
Number of attributes divided by the number of instances.
3
The maximum number of distinct values among attributes of the nominal type.
-65.62
Minimum skewness among attributes of the numeric type.
0
Percentage of instances having missing values.
1.14
Third quartile of entropy among attributes.
0.06
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
15.43
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
1.96
Maximum skewness among attributes of the numeric type.
0.02
Minimum standard deviation of attributes of the numeric type.
0
Percentage of missing values.
27.39
Third quartile of kurtosis among attributes of the numeric type.
0.95
Average class difference between consecutive instances.
0.85
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
0.95
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001
0.05
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001
0.42
Maximum standard deviation of attributes of the numeric type.
28.56
Percentage of instances belonging to the least frequent class.
74.79
Percentage of numeric attributes.
0.93
Third quartile of means among attributes of the numeric type.
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.93
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.88
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001
0.84
Average entropy of the attributes.
9844
Number of instances belonging to the least frequent class.
25.21
Percentage of nominal attributes.
0.08
Third quartile of mutual information between the nominal attributes and the target attribute.
0.06
Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.06
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.95
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001
306.44
Mean kurtosis among attributes of the numeric type.
0.9
Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes
0.43
First quartile of entropy among attributes.
-0.83
Third quartile of skewness among attributes of the numeric type.
0.86
Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.85
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.05
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001
0.79
Mean of means among attributes of the numeric type.
0.16
Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes
0.95
First quartile of kurtosis among attributes of the numeric type.
0.23
Third quartile of standard deviation of attributes of the numeric type.
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.93
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.06
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.88
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001
0.06
Average mutual information between the nominal attributes and the target attribute.
0.58
Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes
0.66
First quartile of means among attributes of the numeric type.
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.06
Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.85
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.95
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001
14.09
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
3
Number of binary attributes.
0.01
First quartile of mutual information between the nominal attributes and the target attribute.
0.05
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.86
Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.31
Standard deviation of the number of distinct values among attributes of the nominal type.
0.05
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001
2.9
Average number of distinct values among the attributes of the nominal type.
-4.39
First quartile of skewness among attributes of the numeric type.
0.87
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.94
Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk
0.88
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001
-6.86
Mean skewness among attributes of the numeric type.
0.11
First quartile of standard deviation of attributes of the numeric type.
0.97
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.06
Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.05
Error rate achieved by the landmarker weka.classifiers.lazy.IBk
71.44
Percentage of instances belonging to the most frequent class.
0.18
Mean standard deviation of attributes of the numeric type.
0.95
Second quartile (Median) of entropy among attributes.
0.05
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.86
Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

30 tasks

32751 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: Class
32334 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: Class
2 runs - estimation_procedure: 33% Holdout set - target_feature: Class
1 runs - estimation_procedure: 5 times 2-fold Crossvalidation - target_feature: Class
0 runs - estimation_procedure: 20% Holdout (Ordered) - target_feature: Class
0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - target_feature: Class
0 runs - estimation_procedure: 33% Holdout set - evaluation_measure: predictive_accuracy - target_feature: Class
0 runs - estimation_procedure: 4-fold Crossvalidation - target_feature: Class
0 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: Class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature:
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - target_feature: Class
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
1308 runs - target_feature: Class
1308 runs - target_feature: Class
0 runs - target_feature: Class
0 runs - target_feature: Class
0 runs - target_feature: Class
0 runs - target_feature: Class
0 runs - target_feature: Class
0 runs - target_feature: Class
Define a new task