Istella Dataset

This page is a clone of the blog post from istella: http://blog.istella.it/istella-learning-to-rank-dataset/

Istella is glad to release the Istella Learning to Rank (LETOR) dataset to the public, used in the past to learn one of the stages of the Istella production ranking pipeline. To the best of our knowledge, this is the largest publicly available LETOR dataset, particularly useful for large-scale experiments on the efficiency and scalability of LETOR solutions.

To use the dataset, you must read and accept the online Licence Agreement. By using the dataset, you agree to be bound by the terms of its license: Istella dataset is solely for non-commercial use.

Datasets

Istella LETOR

The Istella LETOR full dataset is composed of 33,018 queries and 220 features representing each query-document pair. It consists of 10,454,629 examples labeled with relevance judgments ranging from 0 (irrelevant) to 4 (perfectly relevant). The average number of per-query examples is 316. It has been splitted in train and test sets according to a 80%-20% scheme.
If you want to use the full dataset in your research, you can download Istella LETOR here. We just kindly ask you to acknowledge Istella and cite the following publication in your research:

Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM Trans. Inf. Syst. 35, 2, Article 15 (December 2016), 31 pages. DOI: https://doi.org/10.1145/2987380

Istella-S LETOR

We also made available a smaller sample of the dataset (named Istella-S LETOR). As the Istella LETOR, it is composed of 33,018 queries and 220 features representing each query-document pair. Istella-S LETOR consists of 3,408,630 pairs produced by sampling irrelevant pairs to an average of 103 examples per query.  It has been splitted in train, validation and test sets according to a 60%-20%-20% scheme.
If you want to use the full dataset in your research, you can download Istella-S LETOR here. We just kindly ask you to acknowledge Istella and cite the following publication in your research:

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, USA, 949-952. DOI: http://dx.doi.org/10.1145/2911451.2914763