Data
porto-seguro_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

porto-seguro_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset porto-seguro (42742) with seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

58 features

target (target)nominal2 unique values
0 missing
ps_ind_01numeric8 unique values
0 missing
ps_ind_02_catnominal4 unique values
3 missing
ps_ind_03numeric12 unique values
0 missing
ps_ind_04_catnominal2 unique values
1 missing
ps_ind_05_catnominal7 unique values
20 missing
ps_ind_06_binnominal2 unique values
0 missing
ps_ind_07_binnominal2 unique values
0 missing
ps_ind_08_binnominal2 unique values
0 missing
ps_ind_09_binnominal2 unique values
0 missing
ps_ind_10_binnominal2 unique values
0 missing
ps_ind_11_binnominal2 unique values
0 missing
ps_ind_12_binnominal2 unique values
0 missing
ps_ind_13_binnominal2 unique values
0 missing
ps_ind_14numeric4 unique values
0 missing
ps_ind_15numeric14 unique values
0 missing
ps_ind_16_binnominal2 unique values
0 missing
ps_ind_17_binnominal2 unique values
0 missing
ps_ind_18_binnominal2 unique values
0 missing
ps_reg_01numeric10 unique values
0 missing
ps_reg_02numeric19 unique values
0 missing
ps_reg_03numeric1116 unique values
411 missing
ps_car_01_catnominal12 unique values
1 missing
ps_car_02_catnominal2 unique values
0 missing
ps_car_03_catnominal2 unique values
1353 missing
ps_car_04_catnominal9 unique values
0 missing
ps_car_05_catnominal2 unique values
848 missing
ps_car_06_catnominal18 unique values
0 missing
ps_car_07_catnominal2 unique values
39 missing
ps_car_08_catnominal2 unique values
0 missing
ps_car_09_catnominal5 unique values
3 missing
ps_car_10_catnominal2 unique values
0 missing
ps_car_11_catnominal104 unique values
0 missing
ps_car_11numeric4 unique values
0 missing
ps_car_12numeric60 unique values
0 missing
ps_car_13numeric1842 unique values
0 missing
ps_car_14numeric359 unique values
158 missing
ps_car_15numeric15 unique values
0 missing
ps_calc_01numeric10 unique values
0 missing
ps_calc_02numeric10 unique values
0 missing
ps_calc_03numeric10 unique values
0 missing
ps_calc_04numeric6 unique values
0 missing
ps_calc_05numeric7 unique values
0 missing
ps_calc_06numeric8 unique values
0 missing
ps_calc_07numeric9 unique values
0 missing
ps_calc_08numeric10 unique values
0 missing
ps_calc_09numeric8 unique values
0 missing
ps_calc_10numeric20 unique values
0 missing
ps_calc_11numeric15 unique values
0 missing
ps_calc_12numeric7 unique values
0 missing
ps_calc_13numeric11 unique values
0 missing
ps_calc_14numeric20 unique values
0 missing
ps_calc_15_binnominal2 unique values
0 missing
ps_calc_16_binnominal2 unique values
0 missing
ps_calc_17_binnominal2 unique values
0 missing
ps_calc_18_binnominal2 unique values
0 missing
ps_calc_19_binnominal2 unique values
0 missing
ps_calc_20_binnominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
58
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
2837
Number of missing values in the dataset.
1576
Number of instances with at least one value missing.
26
Number of numeric attributes.
32
Number of nominal attributes.
41.38
Percentage of binary attributes.
78.8
Percentage of instances having missing values.
2.45
Percentage of missing values.
0.93
Average class difference between consecutive instances.
44.83
Percentage of numeric attributes.
0.03
Number of attributes divided by the number of instances.
55.17
Percentage of nominal attributes.
96.35
Percentage of instances belonging to the most frequent class.
1927
Number of instances belonging to the most frequent class.
3.65
Percentage of instances belonging to the least frequent class.
73
Number of instances belonging to the least frequent class.
24
Number of binary attributes.

0 tasks

Define a new task