Data
amazon-commerce-reviews_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

amazon-commerce-reviews_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset amazon-commerce-reviews (1457) with seed=3 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

Class (target)nominal50 unique values
0 missing
V15numeric15 unique values
0 missing
V48numeric7 unique values
0 missing
V302numeric4 unique values
0 missing
V428numeric5 unique values
0 missing
V764numeric3 unique values
0 missing
V904numeric4 unique values
0 missing
V1021numeric3 unique values
0 missing
V1289numeric3 unique values
0 missing
V1383numeric3 unique values
0 missing
V1461numeric3 unique values
0 missing
V1483numeric3 unique values
0 missing
V1713numeric3 unique values
0 missing
V1803numeric3 unique values
0 missing
V1891numeric2 unique values
0 missing
V1966numeric4 unique values
0 missing
V2062numeric3 unique values
0 missing
V2123numeric3 unique values
0 missing
V2178numeric4 unique values
0 missing
V2244numeric3 unique values
0 missing
V2430numeric4 unique values
0 missing
V2468numeric3 unique values
0 missing
V2522numeric2 unique values
0 missing
V2529numeric3 unique values
0 missing
V2751numeric3 unique values
0 missing
V2818numeric4 unique values
0 missing
V2904numeric3 unique values
0 missing
V2912numeric3 unique values
0 missing
V2916numeric3 unique values
0 missing
V2962numeric2 unique values
0 missing
V2968numeric3 unique values
0 missing
V3117numeric3 unique values
0 missing
V3173numeric2 unique values
0 missing
V3247numeric3 unique values
0 missing
V3720numeric5 unique values
0 missing
V3836numeric3 unique values
0 missing
V3850numeric2 unique values
0 missing
V3863numeric3 unique values
0 missing
V3931numeric3 unique values
0 missing
V3980numeric4 unique values
0 missing
V3995numeric3 unique values
0 missing
V4161numeric4 unique values
0 missing
V4265numeric2 unique values
0 missing
V4273numeric3 unique values
0 missing
V4682numeric3 unique values
0 missing
V4714numeric3 unique values
0 missing
V4788numeric3 unique values
0 missing
V4851numeric2 unique values
0 missing
V5117numeric2 unique values
0 missing
V5166numeric3 unique values
0 missing
V5174numeric3 unique values
0 missing
V5237numeric3 unique values
0 missing
V5614numeric3 unique values
0 missing
V5811numeric2 unique values
0 missing
V5813numeric2 unique values
0 missing
V5920numeric2 unique values
0 missing
V6051numeric2 unique values
0 missing
V6052numeric3 unique values
0 missing
V6127numeric3 unique values
0 missing
V6270numeric2 unique values
0 missing
V6279numeric2 unique values
0 missing
V6430numeric2 unique values
0 missing
V6449numeric2 unique values
0 missing
V6510numeric2 unique values
0 missing
V6549numeric2 unique values
0 missing
V6552numeric2 unique values
0 missing
V6569numeric119 unique values
0 missing
V6580numeric51 unique values
0 missing
V6601numeric28 unique values
0 missing
V6804numeric13 unique values
0 missing
V6904numeric12 unique values
0 missing
V6974numeric15 unique values
0 missing
V7026numeric11 unique values
0 missing
V7190numeric7 unique values
0 missing
V7311numeric10 unique values
0 missing
V7384numeric13 unique values
0 missing
V7499numeric9 unique values
0 missing
V7565numeric12 unique values
0 missing
V7682numeric7 unique values
0 missing
V7793numeric8 unique values
0 missing
V7973numeric6 unique values
0 missing
V7980numeric6 unique values
0 missing
V8173numeric8 unique values
0 missing
V8266numeric6 unique values
0 missing
V8394numeric7 unique values
0 missing
V8478numeric8 unique values
0 missing
V8621numeric12 unique values
0 missing
V8701numeric6 unique values
0 missing
V8761numeric7 unique values
0 missing
V8791numeric8 unique values
0 missing
V8853numeric7 unique values
0 missing
V9003numeric6 unique values
0 missing
V9107numeric6 unique values
0 missing
V9231numeric6 unique values
0 missing
V9265numeric5 unique values
0 missing
V9312numeric6 unique values
0 missing
V9313numeric5 unique values
0 missing
V9367numeric9 unique values
0 missing
V9477numeric6 unique values
0 missing
V9541numeric5 unique values
0 missing
V9659numeric8 unique values
0 missing

19 properties

1500
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
50
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
100
Number of numeric attributes.
1
Number of nominal attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
0.97
Average class difference between consecutive instances.
99.01
Percentage of numeric attributes.
0.07
Number of attributes divided by the number of instances.
0.99
Percentage of nominal attributes.
2
Percentage of instances belonging to the most frequent class.
30
Number of instances belonging to the most frequent class.
2
Percentage of instances belonging to the least frequent class.
30
Number of instances belonging to the least frequent class.
0
Number of binary attributes.

0 tasks

Define a new task