OpenML
KDDCup09_appetency_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup09_appetency_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup09_appetency (1111) with seed=3 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

93 features

APPETENCY (target)nominal2 unique values
0 missing
Var1numeric7 unique values
1975 missing
Var2numeric1 unique values
1954 missing
Var5numeric19 unique values
1946 missing
Var6numeric481 unique values
205 missing
Var12numeric9 unique values
1982 missing
Var14numeric3 unique values
1954 missing
Var18numeric6 unique values
1931 missing
Var24numeric30 unique values
276 missing
Var25numeric91 unique values
194 missing
Var28numeric520 unique values
194 missing
Var36numeric25 unique values
1954 missing
Var45numeric18 unique values
1982 missing
Var46numeric13 unique values
1954 missing
Var47numeric2 unique values
1975 missing
Var49numeric1 unique values
1954 missing
Var50numeric9 unique values
1975 missing
Var51numeric149 unique values
1848 missing
Var53numeric15 unique values
1975 missing
Var56numeric8 unique values
1969 missing
Var60numeric11 unique values
1946 missing
Var61numeric7 unique values
1974 missing
Var62numeric5 unique values
1982 missing
Var66numeric17 unique values
1975 missing
Var68numeric23 unique values
1954 missing
Var69numeric27 unique values
1946 missing
Var72numeric6 unique values
891 missing
Var73numeric119 unique values
0 missing
Var80numeric16 unique values
1946 missing
Var81numeric1768 unique values
205 missing
Var82numeric3 unique values
1931 missing
Var87numeric4 unique values
1975 missing
Var89numeric4 unique values
1969 missing
Var91numeric18 unique values
1963 missing
Var93numeric2 unique values
1946 missing
Var95numeric10 unique values
1954 missing
Var100numeric2 unique values
1975 missing
Var102numeric16 unique values
1984 missing
Var103numeric10 unique values
1946 missing
Var105numeric16 unique values
1972 missing
Var107numeric7 unique values
1946 missing
Var108numeric16 unique values
1975 missing
Var109numeric61 unique values
276 missing
Var117numeric25 unique values
1931 missing
Var119numeric445 unique values
205 missing
Var120numeric14 unique values
1946 missing
Var122numeric1 unique values
1954 missing
Var123numeric70 unique values
194 missing
Var128numeric19 unique values
1958 missing
Var129numeric10 unique values
1975 missing
Var130numeric2 unique values
1954 missing
Var135numeric50 unique values
1931 missing
Var136numeric19 unique values
1975 missing
Var137numeric3 unique values
1975 missing
Var144numeric9 unique values
205 missing
Var146numeric6 unique values
1946 missing
Var147numeric3 unique values
1946 missing
Var148numeric19 unique values
1946 missing
Var151numeric5 unique values
1974 missing
Var153numeric1685 unique values
194 missing
Var155numeric5 unique values
1931 missing
Var161numeric4 unique values
1931 missing
Var164numeric4 unique values
1931 missing
Var166numeric8 unique values
1946 missing
Var170numeric8 unique values
1954 missing
Var173numeric2 unique values
194 missing
Var176numeric5 unique values
1954 missing
Var177numeric23 unique values
1954 missing
Var179numeric4 unique values
1931 missing
Var180numeric19 unique values
1975 missing
Var181numeric4 unique values
194 missing
Var182numeric30 unique values
1931 missing
Var187numeric11 unique values
1975 missing
Var188numeric40 unique values
1954 missing
Var193nominal32 unique values
0 missing
Var194nominal2 unique values
1451 missing
Var195nominal8 unique values
0 missing
Var197nominal152 unique values
6 missing
Var199nominal639 unique values
0 missing
Var201nominal1 unique values
1451 missing
Var205nominal3 unique values
79 missing
Var207nominal8 unique values
0 missing
Var208nominal2 unique values
6 missing
Var212nominal45 unique values
0 missing
Var213nominal1 unique values
1963 missing
Var216nominal402 unique values
0 missing
Var217nominal1504 unique values
31 missing
Var221nominal7 unique values
0 missing
Var225nominal3 unique values
1000 missing
Var226nominal23 unique values
0 missing
Var227nominal6 unique values
0 missing
Var228nominal25 unique values
0 missing
Var229nominal3 unique values
1104 missing

19 properties

2000
Number of instances (rows) of the dataset.
93
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
125883
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
73
Number of numeric attributes.
20
Number of nominal attributes.
98.2
Percentage of instances belonging to the most frequent class.
21.51
Percentage of nominal attributes.
1964
Number of instances belonging to the most frequent class.
1.8
Percentage of instances belonging to the least frequent class.
36
Number of instances belonging to the least frequent class.
3
Number of binary attributes.
3.23
Percentage of binary attributes.
100
Percentage of instances having missing values.
0.97
Average class difference between consecutive instances.
67.68
Percentage of missing values.
0.05
Number of attributes divided by the number of instances.
78.49
Percentage of numeric attributes.

0 tasks

Define a new task