Data
KDDCup99_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

KDDCup99_seed_3_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset KDDCup99 (42746) with seed=3 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

42 features

target (target)nominal8 unique values
0 missing
durationnumeric38 unique values
0 missing
protocol_typenominal3 unique values
0 missing
servicenominal28 unique values
0 missing
flagnominal6 unique values
0 missing
src_bytesnumeric209 unique values
0 missing
dst_bytesnumeric266 unique values
0 missing
landnominal1 unique values
0 missing
wrong_fragmentnumeric1 unique values
0 missing
urgentnumeric1 unique values
0 missing
hotnumeric4 unique values
0 missing
num_failed_loginsnumeric1 unique values
0 missing
logged_innominal2 unique values
0 missing
num_compromisednumeric2 unique values
0 missing
root_shellnominal1 unique values
0 missing
su_attemptednominal1 unique values
0 missing
num_rootnumeric5 unique values
0 missing
num_file_creationsnumeric1 unique values
0 missing
num_shellsnumeric1 unique values
0 missing
num_access_filesnumeric2 unique values
0 missing
num_outbound_cmdsnumeric1 unique values
0 missing
is_host_loginnominal1 unique values
0 missing
is_guest_loginnominal2 unique values
0 missing
countnumeric248 unique values
0 missing
srv_countnumeric99 unique values
0 missing
serror_ratenumeric10 unique values
0 missing
srv_serror_ratenumeric4 unique values
0 missing
rerror_ratenumeric9 unique values
0 missing
srv_rerror_ratenumeric5 unique values
0 missing
same_srv_ratenumeric36 unique values
0 missing
diff_srv_ratenumeric20 unique values
0 missing
srv_diff_host_ratenumeric36 unique values
0 missing
dst_host_countnumeric138 unique values
0 missing
dst_host_srv_countnumeric119 unique values
0 missing
dst_host_same_srv_ratenumeric68 unique values
0 missing
dst_host_diff_srv_ratenumeric39 unique values
0 missing
dst_host_same_src_port_ratenumeric44 unique values
0 missing
dst_host_srv_diff_host_ratenumeric29 unique values
0 missing
dst_host_serror_ratenumeric10 unique values
0 missing
dst_host_srv_serror_ratenumeric5 unique values
0 missing
dst_host_rerror_ratenumeric18 unique values
0 missing
dst_host_srv_rerror_ratenumeric20 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
42
Number of attributes (columns) of the dataset.
8
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
32
Number of numeric attributes.
10
Number of nominal attributes.
11.9
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
0.4
Average class difference between consecutive instances.
76.19
Percentage of numeric attributes.
0.02
Number of attributes divided by the number of instances.
23.81
Percentage of nominal attributes.
57.35
Percentage of instances belonging to the most frequent class.
1147
Number of instances belonging to the most frequent class.
0.05
Percentage of instances belonging to the least frequent class.
1
Number of instances belonging to the least frequent class.
5
Number of binary attributes.

0 tasks

Define a new task