Data
KDDCup09_appetency_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup09_appetency_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup09_appetency (1111) with seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

92 features

APPETENCY (target)nominal2 unique values
0 missing
Var7numeric6 unique values
234 missing
Var9numeric18 unique values
1968 missing
Var13numeric626 unique values
234 missing
Var17numeric10 unique values
1934 missing
Var21numeric211 unique values
228 missing
Var22numeric211 unique values
206 missing
Var23numeric7 unique values
1939 missing
Var24numeric31 unique values
299 missing
Var28numeric500 unique values
206 missing
Var35numeric8 unique values
206 missing
Var36numeric21 unique values
1953 missing
Var40numeric8 unique values
1953 missing
Var41numeric8 unique values
1968 missing
Var45numeric11 unique values
1989 missing
Var47numeric3 unique values
1968 missing
Var49numeric1 unique values
1953 missing
Var56numeric9 unique values
1977 missing
Var57numeric1939 unique values
0 missing
Var58numeric10 unique values
1968 missing
Var59numeric23 unique values
1966 missing
Var60numeric9 unique values
1939 missing
Var62numeric3 unique values
1980 missing
Var64numeric8 unique values
1992 missing
Var66numeric18 unique values
1968 missing
Var68numeric18 unique values
1953 missing
Var76numeric1286 unique values
206 missing
Var83numeric40 unique values
206 missing
Var85numeric49 unique values
206 missing
Var86numeric17 unique values
1968 missing
Var87numeric3 unique values
1968 missing
Var89numeric3 unique values
1977 missing
Var91numeric21 unique values
1955 missing
Var92numeric3 unique values
1997 missing
Var93numeric2 unique values
1939 missing
Var94numeric1064 unique values
890 missing
Var97numeric3 unique values
1939 missing
Var98numeric6 unique values
1980 missing
Var99numeric14 unique values
1934 missing
Var101numeric6 unique values
1960 missing
Var102numeric18 unique values
1982 missing
Var103numeric14 unique values
1939 missing
Var104numeric15 unique values
1966 missing
Var106numeric11 unique values
1934 missing
Var107numeric8 unique values
1939 missing
Var109numeric61 unique values
299 missing
Var110numeric1 unique values
1968 missing
Var111numeric30 unique values
1955 missing
Var114numeric24 unique values
1953 missing
Var115numeric11 unique values
1966 missing
Var117numeric34 unique values
1934 missing
Var118numeric1 unique values
1997 missing
Var119numeric434 unique values
228 missing
Var121numeric8 unique values
1968 missing
Var122numeric2 unique values
1953 missing
Var123numeric69 unique values
206 missing
Var128numeric25 unique values
1948 missing
Var129numeric9 unique values
1968 missing
Var130numeric2 unique values
1953 missing
Var131numeric5 unique values
1968 missing
Var132numeric12 unique values
206 missing
Var134numeric1439 unique values
206 missing
Var139numeric25 unique values
1939 missing
Var140numeric598 unique values
234 missing
Var142numeric2 unique values
1968 missing
Var144numeric9 unique values
228 missing
Var151numeric5 unique values
1969 missing
Var153numeric1670 unique values
206 missing
Var156numeric18 unique values
1968 missing
Var157numeric14 unique values
1955 missing
Var159numeric6 unique values
1953 missing
Var160numeric119 unique values
206 missing
Var161numeric4 unique values
1934 missing
Var163numeric1016 unique values
206 missing
Var164numeric5 unique values
1934 missing
Var165numeric9 unique values
1960 missing
Var172numeric7 unique values
1939 missing
Var180numeric21 unique values
1968 missing
Var183numeric18 unique values
1953 missing
Var190numeric11 unique values
1989 missing
Var191nominal1 unique values
1948 missing
Var193nominal32 unique values
0 missing
Var197nominal152 unique values
6 missing
Var198nominal973 unique values
0 missing
Var202nominal1431 unique values
0 missing
Var204nominal99 unique values
0 missing
Var211nominal2 unique values
0 missing
Var216nominal367 unique values
0 missing
Var220nominal973 unique values
0 missing
Var223nominal4 unique values
211 missing
Var225nominal3 unique values
1071 missing
Var229nominal3 unique values
1168 missing

19 properties

2000
Number of instances (rows) of the dataset.
92
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
121455
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
79
Number of numeric attributes.
13
Number of nominal attributes.
2.17
Percentage of binary attributes.
100
Percentage of instances having missing values.
66.01
Percentage of missing values.
0.96
Average class difference between consecutive instances.
85.87
Percentage of numeric attributes.
0.05
Number of attributes divided by the number of instances.
14.13
Percentage of nominal attributes.
98.2
Percentage of instances belonging to the most frequent class.
1964
Number of instances belonging to the most frequent class.
1.8
Percentage of instances belonging to the least frequent class.
36
Number of instances belonging to the least frequent class.
2
Number of binary attributes.

0 tasks

Define a new task