OpenML
albert_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

albert_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset albert (41147) with seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

79 features

class (target)nominal2 unique values
0 missing
V1numeric46 unique values
838 missing
V2numeric336 unique values
0 missing
V3numeric127 unique values
446 missing
V4numeric49 unique values
437 missing
V5numeric1398 unique values
40 missing
V6numeric327 unique values
390 missing
V7numeric141 unique values
83 missing
V8numeric51 unique values
1 missing
V9numeric413 unique values
83 missing
V10numeric5 unique values
838 missing
V11numeric46 unique values
83 missing
V12numeric20 unique values
1494 missing
V13numeric55 unique values
437 missing
V14nominal84 unique values
0 missing
V15nominal252 unique values
0 missing
V16nominal1419 unique values
0 missing
V17nominal1110 unique values
0 missing
V18nominal30 unique values
0 missing
V19nominal7 unique values
0 missing
V20nominal1170 unique values
0 missing
V21nominal52 unique values
0 missing
V22nominal3 unique values
0 missing
V23nominal1031 unique values
0 missing
V24nominal922 unique values
0 missing
V25nominal1344 unique values
0 missing
V26nominal815 unique values
0 missing
V27nominal20 unique values
0 missing
V28nominal882 unique values
0 missing
V29nominal1249 unique values
0 missing
V30nominal9 unique values
0 missing
V31nominal598 unique values
0 missing
V32nominal178 unique values
0 missing
V33nominal4 unique values
0 missing
V34nominal1299 unique values
0 missing
V35nominal7 unique values
0 missing
V36nominal12 unique values
0 missing
V37nominal786 unique values
0 missing
V38nominal31 unique values
0 missing
V39nominal649 unique values
0 missing
V40numeric143 unique values
71 missing
V41nominal4 unique values
0 missing
V42numeric50 unique values
0 missing
V43numeric332 unique values
373 missing
V44nominal1291 unique values
0 missing
V45nominal9 unique values
0 missing
V46nominal1201 unique values
0 missing
V47nominal9 unique values
0 missing
V48nominal27 unique values
0 missing
V49nominal255 unique values
0 missing
V50numeric153 unique values
83 missing
V51numeric37 unique values
73 missing
V52numeric6 unique values
819 missing
V53numeric20 unique values
1516 missing
V54nominal255 unique values
0 missing
V55nominal1450 unique values
0 missing
V56nominal1051 unique values
0 missing
V57nominal659 unique values
0 missing
V58nominal805 unique values
0 missing
V59numeric164 unique values
92 missing
V60nominal256 unique values
0 missing
V61nominal819 unique values
0 missing
V62nominal762 unique values
0 missing
V63nominal12 unique values
0 missing
V64numeric23 unique values
1515 missing
V65nominal21 unique values
0 missing
V66nominal1167 unique values
0 missing
V67numeric20 unique values
1522 missing
V68nominal21 unique values
0 missing
V69numeric21 unique values
1502 missing
V70nominal27 unique values
0 missing
V71nominal31 unique values
0 missing
V72numeric44 unique values
76 missing
V73nominal34 unique values
0 missing
V74nominal32 unique values
0 missing
V75numeric40 unique values
76 missing
V76nominal32 unique values
0 missing
V77nominal889 unique values
0 missing
V78nominal787 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
79
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
12888
Number of missing values in the dataset.
2000
Number of instances with at least one value missing.
26
Number of numeric attributes.
53
Number of nominal attributes.
0.04
Number of attributes divided by the number of instances.
32.91
Percentage of numeric attributes.
50
Percentage of instances belonging to the most frequent class.
67.09
Percentage of nominal attributes.
1000
Number of instances belonging to the most frequent class.
50
Percentage of instances belonging to the least frequent class.
1000
Number of instances belonging to the least frequent class.
1
Number of binary attributes.
1.27
Percentage of binary attributes.
100
Percentage of instances having missing values.
0.5
Average class difference between consecutive instances.
8.16
Percentage of missing values.

0 tasks

Define a new task