Data
dna_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

dna_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF public Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset dna (40670) with seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

class (target)nominal3 unique values
0 missing
A4nominal2 unique values
0 missing
A6nominal2 unique values
0 missing
A7nominal2 unique values
0 missing
A11nominal2 unique values
0 missing
A14nominal2 unique values
0 missing
A16nominal2 unique values
0 missing
A17nominal2 unique values
0 missing
A18nominal2 unique values
0 missing
A20nominal2 unique values
0 missing
A21nominal2 unique values
0 missing
A22nominal2 unique values
0 missing
A23nominal2 unique values
0 missing
A25nominal2 unique values
0 missing
A27nominal2 unique values
0 missing
A29nominal2 unique values
0 missing
A31nominal2 unique values
0 missing
A34nominal2 unique values
0 missing
A36nominal2 unique values
0 missing
A38nominal2 unique values
0 missing
A40nominal2 unique values
0 missing
A41nominal2 unique values
0 missing
A42nominal2 unique values
0 missing
A45nominal2 unique values
0 missing
A48nominal2 unique values
0 missing
A49nominal2 unique values
0 missing
A50nominal2 unique values
0 missing
A54nominal2 unique values
0 missing
A55nominal2 unique values
0 missing
A58nominal2 unique values
0 missing
A61nominal2 unique values
0 missing
A64nominal2 unique values
0 missing
A65nominal2 unique values
0 missing
A67nominal2 unique values
0 missing
A68nominal2 unique values
0 missing
A73nominal2 unique values
0 missing
A74nominal2 unique values
0 missing
A75nominal2 unique values
0 missing
A77nominal2 unique values
0 missing
A78nominal2 unique values
0 missing
A79nominal2 unique values
0 missing
A80nominal2 unique values
0 missing
A82nominal2 unique values
0 missing
A83nominal2 unique values
0 missing
A84nominal2 unique values
0 missing
A85nominal2 unique values
0 missing
A86nominal2 unique values
0 missing
A88nominal2 unique values
0 missing
A92nominal2 unique values
0 missing
A94nominal2 unique values
0 missing
A99nominal2 unique values
0 missing
A100nominal2 unique values
0 missing
A101nominal2 unique values
0 missing
A103nominal2 unique values
0 missing
A105nominal2 unique values
0 missing
A107nominal2 unique values
0 missing
A108nominal2 unique values
0 missing
A110nominal2 unique values
0 missing
A112nominal2 unique values
0 missing
A113nominal2 unique values
0 missing
A114nominal2 unique values
0 missing
A115nominal2 unique values
0 missing
A116nominal2 unique values
0 missing
A118nominal2 unique values
0 missing
A119nominal2 unique values
0 missing
A121nominal2 unique values
0 missing
A124nominal2 unique values
0 missing
A125nominal2 unique values
0 missing
A126nominal2 unique values
0 missing
A130nominal2 unique values
0 missing
A131nominal2 unique values
0 missing
A133nominal2 unique values
0 missing
A134nominal2 unique values
0 missing
A135nominal2 unique values
0 missing
A136nominal2 unique values
0 missing
A137nominal2 unique values
0 missing
A140nominal2 unique values
0 missing
A141nominal2 unique values
0 missing
A142nominal2 unique values
0 missing
A143nominal2 unique values
0 missing
A144nominal2 unique values
0 missing
A148nominal2 unique values
0 missing
A149nominal2 unique values
0 missing
A150nominal2 unique values
0 missing
A153nominal2 unique values
0 missing
A154nominal2 unique values
0 missing
A155nominal2 unique values
0 missing
A157nominal2 unique values
0 missing
A158nominal2 unique values
0 missing
A159nominal2 unique values
0 missing
A160nominal2 unique values
0 missing
A165nominal2 unique values
0 missing
A166nominal2 unique values
0 missing
A167nominal2 unique values
0 missing
A169nominal2 unique values
0 missing
A170nominal2 unique values
0 missing
A171nominal2 unique values
0 missing
A172nominal2 unique values
0 missing
A177nominal2 unique values
0 missing
A178nominal2 unique values
0 missing
A179nominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
3
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
0
Number of numeric attributes.
101
Number of nominal attributes.
99.01
Percentage of binary attributes.
0
Percentage of instances having missing values.
0.39
Average class difference between consecutive instances.
0
Percentage of missing values.
0.05
Number of attributes divided by the number of instances.
0
Percentage of numeric attributes.
51.9
Percentage of instances belonging to the most frequent class.
100
Percentage of nominal attributes.
1038
Number of instances belonging to the most frequent class.
24
Percentage of instances belonging to the least frequent class.
480
Number of instances belonging to the least frequent class.
100
Number of binary attributes.

0 tasks

Define a new task