Data
KDDCup09_appetency_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup09_appetency_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup09_appetency (1111) with seed=1 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

94 features

APPETENCY (target)nominal2 unique values
0 missing
Var4numeric2 unique values
1924 missing
Var5numeric24 unique values
1946 missing
Var9numeric17 unique values
1974 missing
Var11numeric3 unique values
1951 missing
Var13numeric642 unique values
227 missing
Var14numeric3 unique values
1951 missing
Var17numeric10 unique values
1924 missing
Var21numeric225 unique values
227 missing
Var22numeric225 unique values
205 missing
Var23numeric6 unique values
1946 missing
Var34numeric3 unique values
1951 missing
Var35numeric8 unique values
205 missing
Var38numeric1364 unique values
205 missing
Var40numeric8 unique values
1951 missing
Var44numeric3 unique values
205 missing
Var45numeric19 unique values
1981 missing
Var47numeric4 unique values
1974 missing
Var49numeric1 unique values
1951 missing
Var54numeric3 unique values
1951 missing
Var57numeric1949 unique values
0 missing
Var60numeric8 unique values
1946 missing
Var61numeric11 unique values
1965 missing
Var62numeric4 unique values
1980 missing
Var64numeric10 unique values
1990 missing
Var67numeric1 unique values
1946 missing
Var68numeric18 unique values
1951 missing
Var72numeric5 unique values
897 missing
Var73numeric115 unique values
0 missing
Var74numeric119 unique values
227 missing
Var76numeric1324 unique values
205 missing
Var77numeric8 unique values
1974 missing
Var81numeric1742 unique values
227 missing
Var82numeric3 unique values
1924 missing
Var83numeric44 unique values
205 missing
Var85numeric47 unique values
205 missing
Var86numeric20 unique values
1974 missing
Var87numeric3 unique values
1974 missing
Var88numeric22 unique values
1944 missing
Var90numeric1 unique values
1974 missing
Var95numeric13 unique values
1951 missing
Var96numeric10 unique values
1951 missing
Var97numeric3 unique values
1946 missing
Var100numeric3 unique values
1974 missing
Var101numeric10 unique values
1961 missing
Var109numeric62 unique values
296 missing
Var113numeric1970 unique values
0 missing
Var114numeric26 unique values
1951 missing
Var115numeric12 unique values
1969 missing
Var120numeric14 unique values
1946 missing
Var123numeric68 unique values
205 missing
Var125numeric1180 unique values
227 missing
Var127numeric12 unique values
1944 missing
Var128numeric22 unique values
1944 missing
Var129numeric12 unique values
1974 missing
Var130numeric2 unique values
1951 missing
Var131numeric9 unique values
1974 missing
Var132numeric11 unique values
205 missing
Var135numeric62 unique values
1924 missing
Var136numeric19 unique values
1976 missing
Var148numeric20 unique values
1946 missing
Var150numeric32 unique values
1924 missing
Var151numeric6 unique values
1965 missing
Var154numeric16 unique values
1974 missing
Var156numeric16 unique values
1976 missing
Var159numeric5 unique values
1951 missing
Var160numeric119 unique values
205 missing
Var165numeric13 unique values
1961 missing
Var168numeric24 unique values
1974 missing
Var171numeric37 unique values
1944 missing
Var172numeric6 unique values
1946 missing
Var174numeric9 unique values
1924 missing
Var177numeric17 unique values
1951 missing
Var180numeric21 unique values
1974 missing
Var181numeric4 unique values
205 missing
Var183numeric16 unique values
1951 missing
Var186numeric6 unique values
1974 missing
Var189numeric70 unique values
1161 missing
Var192nominal241 unique values
12 missing
Var195nominal15 unique values
0 missing
Var196nominal3 unique values
0 missing
Var200nominal956 unique values
1026 missing
Var202nominal1461 unique values
0 missing
Var212nominal51 unique values
0 missing
Var216nominal413 unique values
0 missing
Var217nominal1526 unique values
24 missing
Var218nominal2 unique values
24 missing
Var220nominal927 unique values
0 missing
Var221nominal7 unique values
0 missing
Var223nominal4 unique values
186 missing
Var224nominal1 unique values
1969 missing
Var225nominal3 unique values
1056 missing
Var227nominal6 unique values
0 missing
Var229nominal4 unique values
1138 missing

19 properties

2000
Number of instances (rows) of the dataset.
94
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
118767
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
77
Number of numeric attributes.
17
Number of nominal attributes.
2.13
Percentage of binary attributes.
100
Percentage of instances having missing values.
0.96
Average class difference between consecutive instances.
63.17
Percentage of missing values.
0.05
Number of attributes divided by the number of instances.
81.91
Percentage of numeric attributes.
98.2
Percentage of instances belonging to the most frequent class.
18.09
Percentage of nominal attributes.
1964
Number of instances belonging to the most frequent class.
1.8
Percentage of instances belonging to the least frequent class.
36
Number of instances belonging to the least frequent class.
2
Number of binary attributes.

0 tasks

Define a new task