OpenML
amazon-commerce-reviews_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

amazon-commerce-reviews_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset amazon-commerce-reviews (1457) with seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

Class (target)nominal50 unique values
0 missing
V386numeric7 unique values
0 missing
V395numeric5 unique values
0 missing
V568numeric5 unique values
0 missing
V763numeric4 unique values
0 missing
V965numeric4 unique values
0 missing
V1005numeric5 unique values
0 missing
V1039numeric5 unique values
0 missing
V1066numeric3 unique values
0 missing
V1068numeric3 unique values
0 missing
V1486numeric3 unique values
0 missing
V1718numeric3 unique values
0 missing
V1858numeric3 unique values
0 missing
V1962numeric3 unique values
0 missing
V2007numeric3 unique values
0 missing
V2122numeric4 unique values
0 missing
V2157numeric2 unique values
0 missing
V2178numeric4 unique values
0 missing
V2223numeric3 unique values
0 missing
V2576numeric2 unique values
0 missing
V2582numeric3 unique values
0 missing
V2672numeric3 unique values
0 missing
V3159numeric3 unique values
0 missing
V3214numeric2 unique values
0 missing
V3303numeric2 unique values
0 missing
V3381numeric2 unique values
0 missing
V3432numeric2 unique values
0 missing
V3629numeric4 unique values
0 missing
V3770numeric4 unique values
0 missing
V3847numeric3 unique values
0 missing
V3884numeric3 unique values
0 missing
V4046numeric3 unique values
0 missing
V4071numeric4 unique values
0 missing
V4189numeric4 unique values
0 missing
V4285numeric2 unique values
0 missing
V4364numeric2 unique values
0 missing
V4387numeric3 unique values
0 missing
V4429numeric5 unique values
0 missing
V4433numeric2 unique values
0 missing
V4465numeric2 unique values
0 missing
V4516numeric3 unique values
0 missing
V4557numeric3 unique values
0 missing
V4678numeric4 unique values
0 missing
V4696numeric2 unique values
0 missing
V4740numeric2 unique values
0 missing
V4842numeric3 unique values
0 missing
V4964numeric2 unique values
0 missing
V4966numeric3 unique values
0 missing
V5000numeric2 unique values
0 missing
V5071numeric3 unique values
0 missing
V5141numeric2 unique values
0 missing
V5193numeric3 unique values
0 missing
V5211numeric3 unique values
0 missing
V5214numeric2 unique values
0 missing
V5656numeric2 unique values
0 missing
V5752numeric2 unique values
0 missing
V5782numeric2 unique values
0 missing
V5907numeric2 unique values
0 missing
V5908numeric2 unique values
0 missing
V5909numeric2 unique values
0 missing
V5916numeric2 unique values
0 missing
V6114numeric2 unique values
0 missing
V6275numeric3 unique values
0 missing
V6310numeric2 unique values
0 missing
V6411numeric2 unique values
0 missing
V6631numeric23 unique values
0 missing
V6723numeric16 unique values
0 missing
V6761numeric18 unique values
0 missing
V6772numeric15 unique values
0 missing
V6872numeric12 unique values
0 missing
V6892numeric14 unique values
0 missing
V6904numeric12 unique values
0 missing
V6927numeric10 unique values
0 missing
V6958numeric12 unique values
0 missing
V7006numeric11 unique values
0 missing
V7420numeric6 unique values
0 missing
V7655numeric8 unique values
0 missing
V7670numeric9 unique values
0 missing
V7699numeric5 unique values
0 missing
V7812numeric6 unique values
0 missing
V7849numeric7 unique values
0 missing
V7853numeric7 unique values
0 missing
V8264numeric6 unique values
0 missing
V8341numeric4 unique values
0 missing
V8447numeric5 unique values
0 missing
V8584numeric4 unique values
0 missing
V8604numeric6 unique values
0 missing
V8705numeric7 unique values
0 missing
V8794numeric10 unique values
0 missing
V8845numeric5 unique values
0 missing
V8887numeric6 unique values
0 missing
V8963numeric6 unique values
0 missing
V9010numeric5 unique values
0 missing
V9178numeric5 unique values
0 missing
V9267numeric4 unique values
0 missing
V9365numeric5 unique values
0 missing
V9520numeric5 unique values
0 missing
V9590numeric6 unique values
0 missing
V9695numeric6 unique values
0 missing
V9728numeric5 unique values
0 missing
V9809numeric5 unique values
0 missing

19 properties

1500
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
50
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
100
Number of numeric attributes.
1
Number of nominal attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
0.97
Average class difference between consecutive instances.
99.01
Percentage of numeric attributes.
0.07
Number of attributes divided by the number of instances.
0.99
Percentage of nominal attributes.
2
Percentage of instances belonging to the most frequent class.
30
Number of instances belonging to the most frequent class.
2
Percentage of instances belonging to the least frequent class.
30
Number of instances belonging to the least frequent class.
0
Number of binary attributes.

0 tasks

Define a new task