Data
KDDCup09_appetency_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup09_appetency_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup09_appetency (1111) with seed=0 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

93 features

APPETENCY (target)nominal2 unique values
0 missing
Var1numeric6 unique values
1964 missing
Var2numeric1 unique values
1950 missing
Var3numeric9 unique values
1950 missing
Var4numeric1 unique values
1946 missing
Var5numeric20 unique values
1955 missing
Var6numeric493 unique values
206 missing
Var11numeric2 unique values
1950 missing
Var12numeric6 unique values
1970 missing
Var14numeric3 unique values
1950 missing
Var16numeric43 unique values
1955 missing
Var22numeric230 unique values
185 missing
Var25numeric90 unique values
185 missing
Var27numeric1 unique values
1955 missing
Var29numeric1 unique values
1964 missing
Var37numeric24 unique values
1946 missing
Var47numeric4 unique values
1964 missing
Var51numeric136 unique values
1856 missing
Var59numeric20 unique values
1967 missing
Var62numeric3 unique values
1970 missing
Var63numeric13 unique values
1965 missing
Var68numeric18 unique values
1950 missing
Var69numeric29 unique values
1955 missing
Var70numeric23 unique values
1955 missing
Var71numeric22 unique values
1964 missing
Var72numeric5 unique values
868 missing
Var73numeric113 unique values
0 missing
Var74numeric127 unique values
207 missing
Var76numeric1325 unique values
185 missing
Var81numeric1761 unique values
206 missing
Var82numeric4 unique values
1946 missing
Var84numeric6 unique values
1950 missing
Var85numeric54 unique values
185 missing
Var86numeric27 unique values
1964 missing
Var87numeric3 unique values
1964 missing
Var88numeric20 unique values
1967 missing
Var90numeric1 unique values
1964 missing
Var92numeric5 unique values
1995 missing
Var93numeric2 unique values
1955 missing
Var94numeric1074 unique values
868 missing
Var100numeric3 unique values
1964 missing
Var103numeric9 unique values
1955 missing
Var106numeric15 unique values
1946 missing
Var107numeric7 unique values
1955 missing
Var110numeric1 unique values
1964 missing
Var111numeric29 unique values
1964 missing
Var112numeric75 unique values
185 missing
Var114numeric21 unique values
1950 missing
Var115numeric9 unique values
1967 missing
Var116numeric2 unique values
1964 missing
Var117numeric25 unique values
1946 missing
Var119numeric455 unique values
206 missing
Var122numeric1 unique values
1950 missing
Var124numeric17 unique values
1946 missing
Var125numeric1162 unique values
207 missing
Var129numeric12 unique values
1964 missing
Var130numeric2 unique values
1950 missing
Var133numeric1607 unique values
185 missing
Var134numeric1465 unique values
185 missing
Var136numeric30 unique values
1965 missing
Var137numeric7 unique values
1964 missing
Var139numeric26 unique values
1955 missing
Var142numeric2 unique values
1964 missing
Var143numeric3 unique values
185 missing
Var148numeric22 unique values
1955 missing
Var149numeric887 unique values
277 missing
Var150numeric25 unique values
1946 missing
Var151numeric5 unique values
1972 missing
Var155numeric4 unique values
1946 missing
Var157numeric14 unique values
1964 missing
Var158numeric5 unique values
1972 missing
Var161numeric3 unique values
1946 missing
Var171numeric26 unique values
1967 missing
Var173numeric2 unique values
185 missing
Var180numeric31 unique values
1964 missing
Var183numeric12 unique values
1950 missing
Var184numeric10 unique values
1950 missing
Var186numeric3 unique values
1964 missing
Var192nominal238 unique values
16 missing
Var195nominal10 unique values
0 missing
Var198nominal935 unique values
0 missing
Var201nominal1 unique values
1486 missing
Var206nominal21 unique values
206 missing
Var210nominal6 unique values
0 missing
Var211nominal2 unique values
0 missing
Var212nominal46 unique values
0 missing
Var214nominal948 unique values
1023 missing
Var216nominal376 unique values
0 missing
Var221nominal7 unique values
0 missing
Var224nominal1 unique values
1967 missing
Var225nominal3 unique values
1057 missing
Var226nominal23 unique values
0 missing
Var229nominal4 unique values
1127 missing

19 properties

2000
Number of instances (rows) of the dataset.
93
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
127027
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
77
Number of numeric attributes.
16
Number of nominal attributes.
3.23
Percentage of binary attributes.
100
Percentage of instances having missing values.
68.29
Percentage of missing values.
0.96
Average class difference between consecutive instances.
82.8
Percentage of numeric attributes.
0.05
Number of attributes divided by the number of instances.
17.2
Percentage of nominal attributes.
98.2
Percentage of instances belonging to the most frequent class.
1964
Number of instances belonging to the most frequent class.
1.8
Percentage of instances belonging to the least frequent class.
36
Number of instances belonging to the least frequent class.
3
Number of binary attributes.

0 tasks

Define a new task