Data
KDDCup09_appetency_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup09_appetency_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup09_appetency (1111) with seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

94 features

APPETENCY (target)nominal2 unique values
0 missing
Var6numeric480 unique values
208 missing
Var10numeric29 unique values
1930 missing
Var12numeric5 unique values
1983 missing
Var16numeric64 unique values
1930 missing
Var22numeric226 unique values
185 missing
Var24numeric30 unique values
267 missing
Var26numeric3 unique values
1930 missing
Var30numeric6 unique values
1978 missing
Var33numeric18 unique values
1962 missing
Var37numeric21 unique values
1945 missing
Var38numeric1387 unique values
185 missing
Var40numeric7 unique values
1962 missing
Var43numeric4 unique values
1962 missing
Var46numeric12 unique values
1962 missing
Var50numeric7 unique values
1978 missing
Var54numeric4 unique values
1962 missing
Var60numeric14 unique values
1930 missing
Var64numeric5 unique values
1995 missing
Var66numeric16 unique values
1978 missing
Var68numeric16 unique values
1962 missing
Var69numeric39 unique values
1930 missing
Var71numeric26 unique values
1944 missing
Var74numeric133 unique values
208 missing
Var77numeric8 unique values
1978 missing
Var80numeric18 unique values
1930 missing
Var82numeric4 unique values
1945 missing
Var85numeric50 unique values
185 missing
Var86numeric13 unique values
1978 missing
Var88numeric15 unique values
1964 missing
Var90numeric1 unique values
1978 missing
Var91numeric26 unique values
1944 missing
Var93numeric4 unique values
1930 missing
Var94numeric1057 unique values
896 missing
Var95numeric10 unique values
1962 missing
Var96numeric7 unique values
1962 missing
Var99numeric13 unique values
1945 missing
Var100numeric3 unique values
1978 missing
Var101numeric6 unique values
1973 missing
Var103numeric11 unique values
1930 missing
Var104numeric13 unique values
1975 missing
Var105numeric13 unique values
1975 missing
Var106numeric10 unique values
1945 missing
Var112numeric68 unique values
185 missing
Var116numeric1 unique values
1978 missing
Var118numeric1 unique values
1994 missing
Var120numeric17 unique values
1930 missing
Var122numeric1 unique values
1962 missing
Var125numeric1183 unique values
208 missing
Var127numeric13 unique values
1964 missing
Var130numeric2 unique values
1962 missing
Var133numeric1608 unique values
185 missing
Var137numeric4 unique values
1978 missing
Var138numeric1 unique values
1945 missing
Var140numeric598 unique values
208 missing
Var144numeric9 unique values
208 missing
Var147numeric3 unique values
1930 missing
Var150numeric18 unique values
1945 missing
Var154numeric14 unique values
1978 missing
Var155numeric3 unique values
1945 missing
Var157numeric21 unique values
1944 missing
Var158numeric5 unique values
1973 missing
Var159numeric5 unique values
1962 missing
Var160numeric117 unique values
185 missing
Var165numeric9 unique values
1973 missing
Var166numeric15 unique values
1930 missing
Var171numeric23 unique values
1964 missing
Var176numeric6 unique values
1962 missing
Var179numeric6 unique values
1945 missing
Var180numeric18 unique values
1978 missing
Var181numeric5 unique values
185 missing
Var182numeric26 unique values
1945 missing
Var188numeric32 unique values
1962 missing
Var194nominal3 unique values
1489 missing
Var195nominal10 unique values
0 missing
Var196nominal2 unique values
0 missing
Var198nominal929 unique values
0 missing
Var199nominal655 unique values
0 missing
Var200nominal967 unique values
995 missing
Var201nominal1 unique values
1489 missing
Var202nominal1410 unique values
0 missing
Var204nominal100 unique values
0 missing
Var206nominal20 unique values
208 missing
Var207nominal11 unique values
0 missing
Var208nominal2 unique values
3 missing
Var211nominal2 unique values
0 missing
Var214nominal967 unique values
995 missing
Var215nominal1 unique values
1978 missing
Var216nominal405 unique values
0 missing
Var218nominal2 unique values
34 missing
Var222nominal929 unique values
0 missing
Var223nominal4 unique values
215 missing
Var226nominal23 unique values
0 missing
Var227nominal7 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
94
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
124468
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
72
Number of numeric attributes.
22
Number of nominal attributes.
5.32
Percentage of binary attributes.
100
Percentage of instances having missing values.
66.21
Percentage of missing values.
0.97
Average class difference between consecutive instances.
76.6
Percentage of numeric attributes.
0.05
Number of attributes divided by the number of instances.
23.4
Percentage of nominal attributes.
98.2
Percentage of instances belonging to the most frequent class.
1964
Number of instances belonging to the most frequent class.
1.8
Percentage of instances belonging to the least frequent class.
36
Number of instances belonging to the least frequent class.
5
Number of binary attributes.

0 tasks

Define a new task