Data
PhishingWebsites_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

PhishingWebsites_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by Eddie Bergman
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset PhishingWebsites (4534) with seed=1 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

31 features

Result (target)nominal2 unique values
0 missing
having_IP_Addressnominal2 unique values
0 missing
URL_Lengthnominal3 unique values
0 missing
Shortining_Servicenominal2 unique values
0 missing
having_At_Symbolnominal2 unique values
0 missing
double_slash_redirectingnominal2 unique values
0 missing
Prefix_Suffixnominal2 unique values
0 missing
having_Sub_Domainnominal3 unique values
0 missing
SSLfinal_Statenominal3 unique values
0 missing
Domain_registeration_lengthnominal2 unique values
0 missing
Faviconnominal2 unique values
0 missing
portnominal2 unique values
0 missing
HTTPS_tokennominal2 unique values
0 missing
Request_URLnominal2 unique values
0 missing
URL_of_Anchornominal3 unique values
0 missing
Links_in_tagsnominal3 unique values
0 missing
SFHnominal3 unique values
0 missing
Submitting_to_emailnominal2 unique values
0 missing
Abnormal_URLnominal2 unique values
0 missing
Redirectnominal2 unique values
0 missing
on_mouseovernominal2 unique values
0 missing
RightClicknominal2 unique values
0 missing
popUpWidnownominal2 unique values
0 missing
Iframenominal2 unique values
0 missing
age_of_domainnominal2 unique values
0 missing
DNSRecordnominal2 unique values
0 missing
web_trafficnominal3 unique values
0 missing
Page_Ranknominal2 unique values
0 missing
Google_Indexnominal2 unique values
0 missing
Links_pointing_to_pagenominal3 unique values
0 missing
Statistical_reportnominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
31
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
0
Number of numeric attributes.
31
Number of nominal attributes.
0
Percentage of missing values.
0.5
Average class difference between consecutive instances.
0
Percentage of numeric attributes.
0.02
Number of attributes divided by the number of instances.
100
Percentage of nominal attributes.
55.7
Percentage of instances belonging to the most frequent class.
1114
Number of instances belonging to the most frequent class.
44.3
Percentage of instances belonging to the least frequent class.
886
Number of instances belonging to the least frequent class.
23
Number of binary attributes.
74.19
Percentage of binary attributes.
0
Percentage of instances having missing values.

0 tasks

Define a new task