Data
Microsoft

Microsoft

active ARFF Publicly available Visibility: public Uploaded 05-07-2023 by Matthias Feurer
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Microsoft Learning to Rank Datasets ## Dataset Descriptions The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels: (1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant). (2) The features are basically extracted by us, and are those widely used in the research community. In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector. Below are two rows from MSLR-WEB10K dataset: ============================================== 0 qid:1 1:3 2:0 3:2 4:2 ... 135:0 136:0 2 qid:1 1:3 2:3 3:0 4:0 ... 135:0 136:0 ============================================== ## Dataset Partition We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models. Folds Training Set Validation Set Test Set Fold1 {S1,S2,S3} S4 S5 Fold2 {S2,S3,S4} S5 S1 Fold3 {S3,S4,S5} S1 S2 Fold4 {S4,S5,S1} S2 S3 Fold5 {S5,S1,S2} S3 S4 ## Reference You can cite this dataset as below. ``` @article{DBLP:journals/corr/QinL13, author = {Tao Qin and Tie{-}Yan Liu}, title = {Introducing {LETOR} 4.0 Datasets}, journal = {CoRR}, volume = {abs/1306.2597}, year = {2013}, url = {http://arxiv.org/abs/1306.2597}, timestamp = {Mon, 01 Jul 2013 20:31:25 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13}, bibsource = {dblp computer science bibliography, http://dblp.org} } ``` ## Note: * This is a learning-to-rank dataset and it should not be used for standard classification tasks. It is only coded this way to enable reproducing the work "Tabular data: Deep learning is not all you need" by Shwartz-Ziv and Amitai Armon. * This dataset concatenats the train, valid and test set from Fold1. * This is the 10k Version (Web10k) * The uploader shortened the word "variance" in the feature names to "var" to comply with OpenML's maximum feature name length.

137 features

relevance (target)nominal5 unique values
0 missing
query_id (row identifier)numeric10000 unique values
0 missing
covered_query_term_number-bodynumeric22 unique values
0 missing
covered_query_term_number-anchornumeric10 unique values
0 missing
covered_query_term_number-titlenumeric19 unique values
0 missing
covered_query_term_number-urlnumeric13 unique values
0 missing
covered_query_term_number-whole_documentnumeric22 unique values
0 missing
covered_query_term_ratio-bodynumeric68 unique values
0 missing
covered_query_term_ratio-anchornumeric38 unique values
0 missing
covered_query_term_ratio-titlenumeric54 unique values
0 missing
covered_query_term_ratio-urlnumeric43 unique values
0 missing
covered_query_term_ratio-whole_documentnumeric68 unique values
0 missing
stream_length-bodynumeric5840 unique values
0 missing
stream_length-anchornumeric226 unique values
0 missing
stream_length-titlenumeric627 unique values
0 missing
stream_length-urlnumeric107 unique values
0 missing
stream_length-whole_documentnumeric5961 unique values
0 missing
IDF(Inverse_document_frequency)-bodynumeric9208 unique values
0 missing
IDF(Inverse_document_frequency)-anchornumeric8511 unique values
0 missing
IDF(Inverse_document_frequency)-titlenumeric8691 unique values
0 missing
IDF(Inverse_document_frequency)-urlnumeric8677 unique values
0 missing
IDF(Inverse_document_frequency)-whole_documentnumeric9213 unique values
0 missing
sum_of_term_frequency-bodynumeric797 unique values
0 missing
sum_of_term_frequency-anchornumeric89 unique values
0 missing
sum_of_term_frequency-titlenumeric160 unique values
0 missing
sum_of_term_frequency-urlnumeric20 unique values
0 missing
sum_of_term_frequency-whole_documentnumeric818 unique values
0 missing
min_of_term_frequency-bodynumeric298 unique values
0 missing
min_of_term_frequency-anchornumeric50 unique values
0 missing
min_of_term_frequency-titlenumeric46 unique values
0 missing
min_of_term_frequency-urlnumeric12 unique values
0 missing
min_of_term_frequency-whole_documentnumeric307 unique values
0 missing
max_of_term_frequency-bodynumeric587 unique values
0 missing
max_of_term_frequency-anchornumeric54 unique values
0 missing
max_of_term_frequency-titlenumeric129 unique values
0 missing
max_of_term_frequency-urlnumeric17 unique values
0 missing
max_of_term_frequency-whole_documentnumeric615 unique values
0 missing
mean_of_term_frequency-bodynumeric2561 unique values
0 missing
mean_of_term_frequency-anchornumeric223 unique values
0 missing
mean_of_term_frequency-titlenumeric324 unique values
0 missing
mean_of_term_frequency-urlnumeric80 unique values
0 missing
mean_of_term_frequency-whole_documentnumeric2597 unique values
0 missing
var_of_term_frequency-bodynumeric21008 unique values
0 missing
var_of_term_frequency-anchornumeric470 unique values
0 missing
var_of_term_frequency-titlenumeric666 unique values
0 missing
var_of_term_frequency-urlnumeric123 unique values
0 missing
var_of_term_frequency-whole_documentnumeric21951 unique values
0 missing
sum_of_stream_length_normalized_term_frequency-bodynumeric89847 unique values
0 missing
sum_of_stream_length_normalized_term_frequency-anchornumeric1153 unique values
0 missing
sum_of_stream_length_normalized_term_frequency-titlenumeric1674 unique values
0 missing
sum_of_stream_length_normalized_term_frequency-urlnumeric300 unique values
0 missing
sum_of_stream_length_normalized_term_frequency-whole_documentnumeric94720 unique values
0 missing
min_of_stream_length_normalized_term_frequency-bodynumeric41428 unique values
0 missing
min_of_stream_length_normalized_term_frequency-anchornumeric639 unique values
0 missing
min_of_stream_length_normalized_term_frequency-titlenumeric852 unique values
0 missing
min_of_stream_length_normalized_term_frequency-urlnumeric139 unique values
0 missing
min_of_stream_length_normalized_term_frequency-whole_documentnumeric44050 unique values
0 missing
max_of_stream_length_normalized_term_frequency-bodynumeric66961 unique values
0 missing
max_of_stream_length_normalized_term_frequency-anchornumeric854 unique values
0 missing
max_of_stream_length_normalized_term_frequency-titlenumeric1373 unique values
0 missing
max_of_stream_length_normalized_term_frequency-urlnumeric196 unique values
0 missing
max_of_stream_length_normalized_term_frequency-whole_documentnumeric69811 unique values
0 missing
mean_of_stream_length_normalized_term_frequency-bodynumeric63890 unique values
0 missing
mean_of_stream_length_normalized_term_frequency-anchornumeric1479 unique values
0 missing
mean_of_stream_length_normalized_term_frequency-titlenumeric2125 unique values
0 missing
mean_of_stream_length_normalized_term_frequency-urlnumeric540 unique values
0 missing
mean_of_stream_length_normalized_term_frequency-whole_documentnumeric67228 unique values
0 missing
var_of_stream_length_normalized_term_frequency-bodynumeric8774 unique values
0 missing
var_of_stream_length_normalized_term_frequency-anchornumeric2319 unique values
0 missing
var_of_stream_length_normalized_term_frequency-titlenumeric2946 unique values
0 missing
var_of_stream_length_normalized_term_frequency-urlnumeric864 unique values
0 missing
var_of_stream_length_normalized_term_frequency-whole_documentnumeric8887 unique values
0 missing
sum_of_tf*idf-bodynumeric619392 unique values
0 missing
sum_of_tf*idf-anchornumeric34996 unique values
0 missing
sum_of_tf*idf-titlenumeric67401 unique values
0 missing
sum_of_tf*idf-urlnumeric30162 unique values
0 missing
sum_of_tf*idf-whole_documentnumeric655452 unique values
0 missing
min_of_tf*idf-bodynumeric108789 unique values
0 missing
min_of_tf*idf-anchornumeric7878 unique values
0 missing
min_of_tf*idf-titlenumeric9462 unique values
0 missing
min_of_tf*idf-urlnumeric4812 unique values
0 missing
min_of_tf*idf-whole_documentnumeric116800 unique values
0 missing
max_of_tf*idf-bodynumeric215802 unique values
0 missing
max_of_tf*idf-anchornumeric15520 unique values
0 missing
max_of_tf*idf-titlenumeric19132 unique values
0 missing
max_of_tf*idf-urlnumeric9531 unique values
0 missing
max_of_tf*idf-whole_documentnumeric228653 unique values
0 missing
mean_of_tf*idf-bodynumeric625056 unique values
0 missing
mean_of_tf*idf-anchornumeric40443 unique values
0 missing
mean_of_tf*idf-titlenumeric74892 unique values
0 missing
mean_of_tf*idf-urlnumeric36212 unique values
0 missing
mean_of_tf*idf-whole_documentnumeric661356 unique values
0 missing
var_of_tf*idf-bodynumeric583371 unique values
0 missing
var_of_tf*idf-anchornumeric38005 unique values
0 missing
var_of_tf*idf-titlenumeric72885 unique values
0 missing
var_of_tf*idf-urlnumeric34786 unique values
0 missing
var_of_tf*idf-whole_documentnumeric616183 unique values
0 missing
boolean_model-bodynumeric2 unique values
0 missing
boolean_model-anchornumeric2 unique values
0 missing
boolean_model-titlenumeric2 unique values
0 missing
boolean_model-urlnumeric2 unique values
0 missing
boolean_model-whole_documentnumeric2 unique values
0 missing
vector_space_model-bodynumeric276184 unique values
0 missing
vector_space_model-anchornumeric27447 unique values
0 missing
vector_space_model-titlenumeric55303 unique values
0 missing
vector_space_model-urlnumeric31999 unique values
0 missing
vector_space_model-whole_documentnumeric280515 unique values
0 missing
BM25-bodynumeric952619 unique values
0 missing
BM25-anchornumeric80924 unique values
0 missing
BM25-titlenumeric249263 unique values
0 missing
BM25-urlnumeric115107 unique values
0 missing
BM25-whole_documentnumeric1022469 unique values
0 missing
LMIR.ABS-bodynumeric941872 unique values
0 missing
LMIR.ABS-anchornumeric99016 unique values
0 missing
LMIR.ABS-titlenumeric327100 unique values
0 missing
LMIR.ABS-urlnumeric163631 unique values
0 missing
LMIR.ABS-whole_documentnumeric1005749 unique values
0 missing
LMIR.DIR-bodynumeric945167 unique values
0 missing
LMIR.DIR-anchornumeric100884 unique values
0 missing
LMIR.DIR-titlenumeric305059 unique values
0 missing
LMIR.DIR-urlnumeric159639 unique values
0 missing
LMIR.DIR-whole_documentnumeric1011588 unique values
0 missing
LMIR.JM-bodynumeric899692 unique values
0 missing
LMIR.JM-anchornumeric78232 unique values
0 missing
LMIR.JM-titlenumeric242188 unique values
0 missing
LMIR.JM-urlnumeric130083 unique values
0 missing
LMIR.JM-whole_documentnumeric953189 unique values
0 missing
Number_of_slash_in_URLnumeric27 unique values
0 missing
Length_of_URLnumeric424 unique values
0 missing
Inlink_numbernumeric29030 unique values
0 missing
Outlink_numbernumeric139 unique values
0 missing
PageRanknumeric65102 unique values
0 missing
SiteRanknumeric60686 unique values
0 missing
QualityScorenumeric254 unique values
0 missing
QualityScore2numeric255 unique values
0 missing
Query-url_click_countnumeric5326 unique values
0 missing
url_click_countnumeric3745 unique values
0 missing
url_dwell_timenumeric91823 unique values
0 missing

19 properties

1200192
Number of instances (rows) of the dataset.
137
Number of attributes (columns) of the dataset.
5
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
136
Number of numeric attributes.
1
Number of nominal attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
0.46
Average class difference between consecutive instances.
99.27
Percentage of numeric attributes.
0
Number of attributes divided by the number of instances.
0.73
Percentage of nominal attributes.
52.01
Percentage of instances belonging to the most frequent class.
624263
Number of instances belonging to the most frequent class.
0.74
Percentage of instances belonging to the least frequent class.
8881
Number of instances belonging to the least frequent class.
0
Number of binary attributes.

1 tasks

0 runs - estimation_procedure: 4-fold Crossvalidation - target_feature: relevance
Define a new task