1486 nomao 1 **Author**: Nomao Labs **Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Nomao) **Please cite**: Laurent Candillier and Vincent Lemaire. Design and Analysis of the Nomao Challenge - Active Learning in the Real-World. In: Proceedings of the ALRA : Active Learning in Real-world Applications, Workshop ECML-PKDD 2012, Friday, September 28, 2012, Bristol, UK. 1. Data set title: Nomao Data Set 2. Abstract: Nomao collects data about places (name, phone, localization...) from many sources. Deduplication consists in detecting what data refer to the same place. Instances in the dataset compare 2 spots. 3. Data Set Characteristics: - Univariate - Area: Computer - Attribute Characteristics: Real - Associated Tasks: Classification - Missing Values?: Yes 4. Source: (a) Original owner of database (name / phone / snail address / email address) Nomao / 00 33 5 62 48 33 90 / 1 avenue Jean Rieux, 31500 Toulouse / challenge '@' nomao.com (b) Donor of database (name / phone / snail address / email address) Laurent Candillier / - / 1 avenue Jean Rieux, 31500 Toulouse / laurent '@' nomao.com 5. Data Set Information: The dataset has been enriched during the Nomao Challenge: organized along with the ALRA workshop (Active Learning in Real-world Applications): held at the ECML-PKDD 2012 conference. 5.1. Number of Instances 34,465 instances, mix of continuous and nominal, labeled by human expert. First 29,104 instances have been labeled with "human prior". See the corresponding article described in section "3. Past Usage" for more details. Next 917 instances have been labeled using the active learning method called "marg". Next 964 instances refer to the active method called "wmarg". Next 995 instances refer to the active method called "wmarg5". Next 1,985 instances refer to the active method called "rand" (random selection). Last instances have been labeled during the corresponding challenge. More details can be found in http://www.nomao.com/labs/challenge Next 163 instances refer to the active method called "baseline". Next 167 instances refer to the active method called "nomao". And last 170 instances refer to the active method called "tsun". 5.2. Number of Attributes 120 attributes: 89 continuous, 31 nominal (including the attributes 'label' and 'id'). The features are separated by comma. 5.3. Attribute Information: Missing data are allowed, represented by question marks '?'. Labels are +1 if the concerned spots must be merged, -1 if they do not refer to the same entity. 1 id: name is composed of the names of the spots that are compared, separated by a sharp (#). 2 clean_name_intersect_min: continuous. 3 clean_name_intersect_max: continuous. 4 clean_name_levenshtein_sim: continuous. 5 clean_name_trigram_sim: continuous. 6 clean_name_levenshtein_term: continuous. 7 clean_name_trigram_term: continuous. 8 clean_name_including: n,s,m. 9 clean_name_equality: n,s,m. 10 city_intersect_min: continuous. 11 city_intersect_max: continuous. 12 city_levenshtein_sim: continuous. 13 city_trigram_sim: continuous. 14 city_levenshtein_term: continuous. 15 city_trigram_term: continuous. 16 city_including: n,s,m. 17 city_equality: n,s,m. 18 zip_intersect_min: continuous. 19 zip_intersect_max: continuous. 20 zip_levenshtein_sim: continuous. 21 zip_trigram_sim: continuous. 22 zip_levenshtein_term: continuous. 23 zip_trigram_term: continuous. 24 zip_including: n,s,m. 25 zip_equality: n,s,m. 26 street_intersect_min: continuous. 27 street_intersect_max: continuous. 28 street_levenshtein_sim: continuous. 29 street_trigram_sim: continuous. 30 street_levenshtein_term: continuous. 31 street_trigram_term: continuous. 32 street_including: n,s,m. 33 street_equality: n,s,m. 34 website_intersect_min: continuous. 35 website_intersect_max: continuous. 36 website_levenshtein_sim: continuous. 37 website_trigram_sim: continuous. 38 website_levenshtein_term: continuous. 39 website_trigram_term: continuous. 40 website_including: n,s,m. 41 website_equality: n,s,m. 42 countryname_intersect_min: continuous. 43 countryname_intersect_max: continuous. 44 countryname_levenshtein_sim: continuous. 45 countryname_trigram_sim: continuous. 46 countryname_levenshtein_term: continuous. 47 countryname_trigram_term: continuous. 48 countryname_including: n,s,m. 49 countryname_equality: n,s,m. 50 geocoderlocalityname_intersect_min: continuous. 51 geocoderlocalityname_intersect_max: continuous. 52 geocoderlocalityname_levenshtein_sim: continuous. 53 geocoderlocalityname_trigram_sim: continuous. 54 geocoderlocalityname_levenshtein_term: continuous. 55 geocoderlocalityname_trigram_term: continuous. 56 geocoderlocalityname_including: n,s,m. 57 geocoderlocalityname_equality: n,s,m. 58 geocoderinputaddress_intersect_min: continuous. 59 geocoderinputaddress_intersect_max: continuous. 60 geocoderinputaddress_levenshtein_sim: continuous. 61 geocoderinputaddress_trigram_sim: continuous. 62 geocoderinputaddress_levenshtein_term: continuous. 63 geocoderinputaddress_trigram_term: continuous. 64 geocoderinputaddress_including: n,s,m. 65 geocoderinputaddress_equality: n,s,m. 66 geocoderoutputaddress_intersect_min: continuous. 67 geocoderoutputaddress_intersect_max: continuous. 68 geocoderoutputaddress_levenshtein_sim: continuous. 69 geocoderoutputaddress_trigram_sim: continuous. 70 geocoderoutputaddress_levenshtein_term: continuous. 71 geocoderoutputaddress_trigram_term: continuous. 72 geocoderoutputaddress_including: n,s,m. 73 geocoderoutputaddress_equality: n,s,m. 74 geocoderpostalcodenumber_intersect_min: continuous. 75 geocoderpostalcodenumber_intersect_max: continuous. 76 geocoderpostalcodenumber_levenshtein_sim: continuous. 77 geocoderpostalcodenumber_trigram_sim: continuous. 78 geocoderpostalcodenumber_levenshtein_term: continuous. 79 geocoderpostalcodenumber_trigram_term: continuous. 80 geocoderpostalcodenumber_including: n,s,m. 81 geocoderpostalcodenumber_equality: n,s,m. 82 geocodercountrynamecode_intersect_min: continuous. 83 geocodercountrynamecode_intersect_max: continuous. 84 geocodercountrynamecode_levenshtein_sim: continuous. 85 geocodercountrynamecode_trigram_sim: continuous. 86 geocodercountrynamecode_levenshtein_term: continuous. 87 geocodercountrynamecode_trigram_term: continuous. 88 geocodercountrynamecode_including: n,s,m. 89 geocodercountrynamecode_equality: n,s,m. 90 phone_diff: continuous. 91 phone_levenshtein: continuous. 92 phone_trigram: continuous. 93 phone_equality: n,s,m. 94 fax_diff: continuous. 95 fax_levenshtein: continuous. 96 fax_trigram: continuous. 97 fax_equality: n,s,m. 98 street_number_diff: continuous. 99 street_number_levenshtein: continuous. 100 street_number_trigram: continuous. 101 street_number_equality: n,s,m. 102 geocode_coordinates_long_diff: continuous. 103 geocode_coordinates_long_levenshtein: continuous. 104 geocode_coordinates_long_trigram: continuous. 105 geocode_coordinates_long_equality: n,s,m. 106 geocode_coordinates_lat_diff: continuous. 107 geocode_coordinates_lat_levenshtein: continuous. 108 geocode_coordinates_lat_trigram: continuous. 109 geocode_coordinates_lat_equality: n,s,m. 110 coordinates_long_diff: continuous. 111 coordinates_long_levenshtein: continuous. 112 coordinates_long_trigram: continuous. 113 coordinates_long_equality: n,s,m. 114 coordinates_lat_diff: continuous. 115 coordinates_lat_levenshtein: continuous. 116 coordinates_lat_trigram: continuous. 117 coordinates_lat_equality: n,s,m. 118 geocode_coordinates_diff: continuous. 119 coordinates_diff: continuous. 120 label: +1,-1. Relevant Papers: Laurent Candillier and Vincent Lemaire. Design and Analysis of the Nomao Challenge - Active Learning in the Real-World. In: Proceedings of the ALRA : Active Learning in Real-world Applications, Workshop ECML-PKDD 2012, Friday, September 28, 2012, Bristol, UK. 2 ARFF 2015-05-25T19:09:04 Public https://api.openml.org/data/v1/download/1592278/nomao.arff https://openml1.win.tue.nl/datasets/0000/1486/dataset_1486.pq 1592278 Class ChemistryLife ScienceOpenML-CC18OpenML100study_123study_135study_14study_144study_218study_99 public https://archive.ics.uci.edu/ml/datasets/Nomao https://openml1.win.tue.nl/datasets/0000/1486/dataset_1486.pq active 2018-10-03 21:39:12 8fc1ac73fbe5236892e166f9f24d7221