Data
dna_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

dna_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF public Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset dna (40670) with seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

class (target)nominal3 unique values
0 missing
A4nominal2 unique values
0 missing
A5nominal2 unique values
0 missing
A6nominal2 unique values
0 missing
A8nominal2 unique values
0 missing
A9nominal2 unique values
0 missing
A12nominal2 unique values
0 missing
A14nominal2 unique values
0 missing
A15nominal2 unique values
0 missing
A16nominal2 unique values
0 missing
A17nominal2 unique values
0 missing
A19nominal2 unique values
0 missing
A21nominal2 unique values
0 missing
A22nominal2 unique values
0 missing
A24nominal2 unique values
0 missing
A25nominal2 unique values
0 missing
A26nominal2 unique values
0 missing
A28nominal2 unique values
0 missing
A29nominal2 unique values
0 missing
A30nominal2 unique values
0 missing
A33nominal2 unique values
0 missing
A35nominal2 unique values
0 missing
A38nominal2 unique values
0 missing
A39nominal2 unique values
0 missing
A40nominal2 unique values
0 missing
A41nominal2 unique values
0 missing
A42nominal2 unique values
0 missing
A44nominal2 unique values
0 missing
A45nominal2 unique values
0 missing
A46nominal2 unique values
0 missing
A54nominal2 unique values
0 missing
A57nominal2 unique values
0 missing
A58nominal2 unique values
0 missing
A61nominal2 unique values
0 missing
A63nominal2 unique values
0 missing
A64nominal2 unique values
0 missing
A65nominal2 unique values
0 missing
A66nominal2 unique values
0 missing
A67nominal2 unique values
0 missing
A68nominal2 unique values
0 missing
A70nominal2 unique values
0 missing
A71nominal2 unique values
0 missing
A72nominal2 unique values
0 missing
A73nominal2 unique values
0 missing
A74nominal2 unique values
0 missing
A75nominal2 unique values
0 missing
A76nominal2 unique values
0 missing
A78nominal2 unique values
0 missing
A79nominal2 unique values
0 missing
A83nominal2 unique values
0 missing
A86nominal2 unique values
0 missing
A88nominal2 unique values
0 missing
A89nominal2 unique values
0 missing
A90nominal2 unique values
0 missing
A91nominal2 unique values
0 missing
A92nominal2 unique values
0 missing
A94nominal2 unique values
0 missing
A95nominal2 unique values
0 missing
A96nominal2 unique values
0 missing
A99nominal2 unique values
0 missing
A100nominal2 unique values
0 missing
A102nominal2 unique values
0 missing
A103nominal2 unique values
0 missing
A105nominal2 unique values
0 missing
A107nominal2 unique values
0 missing
A108nominal2 unique values
0 missing
A110nominal2 unique values
0 missing
A111nominal2 unique values
0 missing
A112nominal2 unique values
0 missing
A117nominal2 unique values
0 missing
A118nominal2 unique values
0 missing
A121nominal2 unique values
0 missing
A125nominal2 unique values
0 missing
A126nominal2 unique values
0 missing
A127nominal2 unique values
0 missing
A132nominal2 unique values
0 missing
A133nominal2 unique values
0 missing
A135nominal2 unique values
0 missing
A136nominal2 unique values
0 missing
A137nominal2 unique values
0 missing
A139nominal2 unique values
0 missing
A143nominal2 unique values
0 missing
A144nominal2 unique values
0 missing
A145nominal2 unique values
0 missing
A147nominal2 unique values
0 missing
A149nominal2 unique values
0 missing
A153nominal2 unique values
0 missing
A154nominal2 unique values
0 missing
A155nominal2 unique values
0 missing
A156nominal2 unique values
0 missing
A159nominal2 unique values
0 missing
A160nominal2 unique values
0 missing
A162nominal2 unique values
0 missing
A166nominal2 unique values
0 missing
A167nominal2 unique values
0 missing
A169nominal2 unique values
0 missing
A172nominal2 unique values
0 missing
A173nominal2 unique values
0 missing
A174nominal2 unique values
0 missing
A176nominal2 unique values
0 missing
A177nominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
3
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
0
Number of numeric attributes.
101
Number of nominal attributes.
0.05
Number of attributes divided by the number of instances.
0
Percentage of numeric attributes.
51.9
Percentage of instances belonging to the most frequent class.
100
Percentage of nominal attributes.
1038
Number of instances belonging to the most frequent class.
24
Percentage of instances belonging to the least frequent class.
480
Number of instances belonging to the least frequent class.
100
Number of binary attributes.
99.01
Percentage of binary attributes.
0
Percentage of instances having missing values.
0.38
Average class difference between consecutive instances.
0
Percentage of missing values.

0 tasks

Define a new task