Data
dna_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

dna_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF public Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset dna (40670) with seed=0 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

class (target)nominal3 unique values
0 missing
A0nominal2 unique values
0 missing
A1nominal2 unique values
0 missing
A2nominal2 unique values
0 missing
A3nominal2 unique values
0 missing
A6nominal2 unique values
0 missing
A8nominal2 unique values
0 missing
A9nominal2 unique values
0 missing
A10nominal2 unique values
0 missing
A11nominal2 unique values
0 missing
A14nominal2 unique values
0 missing
A15nominal2 unique values
0 missing
A19nominal2 unique values
0 missing
A22nominal2 unique values
0 missing
A26nominal2 unique values
0 missing
A28nominal2 unique values
0 missing
A33nominal2 unique values
0 missing
A35nominal2 unique values
0 missing
A39nominal2 unique values
0 missing
A41nominal2 unique values
0 missing
A42nominal2 unique values
0 missing
A45nominal2 unique values
0 missing
A46nominal2 unique values
0 missing
A47nominal2 unique values
0 missing
A48nominal2 unique values
0 missing
A50nominal2 unique values
0 missing
A51nominal2 unique values
0 missing
A52nominal2 unique values
0 missing
A53nominal2 unique values
0 missing
A55nominal2 unique values
0 missing
A56nominal2 unique values
0 missing
A57nominal2 unique values
0 missing
A59nominal2 unique values
0 missing
A61nominal2 unique values
0 missing
A62nominal2 unique values
0 missing
A64nominal2 unique values
0 missing
A66nominal2 unique values
0 missing
A67nominal2 unique values
0 missing
A68nominal2 unique values
0 missing
A69nominal2 unique values
0 missing
A70nominal2 unique values
0 missing
A73nominal2 unique values
0 missing
A74nominal2 unique values
0 missing
A78nominal2 unique values
0 missing
A80nominal2 unique values
0 missing
A82nominal2 unique values
0 missing
A83nominal2 unique values
0 missing
A84nominal2 unique values
0 missing
A85nominal2 unique values
0 missing
A90nominal2 unique values
0 missing
A91nominal2 unique values
0 missing
A92nominal2 unique values
0 missing
A93nominal2 unique values
0 missing
A94nominal2 unique values
0 missing
A97nominal2 unique values
0 missing
A98nominal2 unique values
0 missing
A100nominal2 unique values
0 missing
A101nominal2 unique values
0 missing
A106nominal2 unique values
0 missing
A107nominal2 unique values
0 missing
A108nominal2 unique values
0 missing
A109nominal2 unique values
0 missing
A110nominal2 unique values
0 missing
A111nominal2 unique values
0 missing
A118nominal2 unique values
0 missing
A121nominal2 unique values
0 missing
A122nominal2 unique values
0 missing
A123nominal2 unique values
0 missing
A124nominal2 unique values
0 missing
A125nominal2 unique values
0 missing
A126nominal2 unique values
0 missing
A128nominal2 unique values
0 missing
A129nominal2 unique values
0 missing
A131nominal2 unique values
0 missing
A132nominal2 unique values
0 missing
A133nominal2 unique values
0 missing
A134nominal2 unique values
0 missing
A135nominal2 unique values
0 missing
A138nominal2 unique values
0 missing
A140nominal2 unique values
0 missing
A142nominal2 unique values
0 missing
A145nominal2 unique values
0 missing
A147nominal2 unique values
0 missing
A149nominal2 unique values
0 missing
A150nominal2 unique values
0 missing
A153nominal2 unique values
0 missing
A155nominal2 unique values
0 missing
A157nominal2 unique values
0 missing
A158nominal2 unique values
0 missing
A159nominal2 unique values
0 missing
A160nominal2 unique values
0 missing
A161nominal2 unique values
0 missing
A162nominal2 unique values
0 missing
A163nominal2 unique values
0 missing
A164nominal2 unique values
0 missing
A165nominal2 unique values
0 missing
A166nominal2 unique values
0 missing
A167nominal2 unique values
0 missing
A169nominal2 unique values
0 missing
A170nominal2 unique values
0 missing
A174nominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
3
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
0
Number of numeric attributes.
101
Number of nominal attributes.
99.01
Percentage of binary attributes.
0
Percentage of instances having missing values.
0.37
Average class difference between consecutive instances.
0
Percentage of missing values.
0.05
Number of attributes divided by the number of instances.
0
Percentage of numeric attributes.
51.9
Percentage of instances belonging to the most frequent class.
100
Percentage of nominal attributes.
1038
Number of instances belonging to the most frequent class.
24
Percentage of instances belonging to the least frequent class.
480
Number of instances belonging to the least frequent class.
100
Number of binary attributes.

0 tasks

Define a new task