Istella LETOR Datasets | Efficient Learning to Rank Tools

Mirror of the original Istella blog post.

Istella is glad to release the Istella Learning to Rank (LETOR) dataset to the public, used in the past to learn one of the stages of the Istella production ranking pipeline. To the best of our knowledge, this is the largest publicly available LETOR dataset, particularly useful for large-scale experiments on the efficiency and scalability of LETOR solutions.

To use the dataset, you must read and accept the Istella LETOR Licence Agreement. By using the dataset, you agree to be bound by the terms of its license: Istella LETOR datasets are solely for non-commercial use.

Available Istella LETOR Datasets

Istella LETOR

The Istella LETOR full dataset is composed of 33,018 queries and 220 features representing each query-document pair. It consists of 10,454,629 examples labeled with relevance judgments ranging from 0 (irrelevant) to 4 (perfectly relevant). The average number of per-query examples is 316. It has been splitted in train and test sets according to a 80%-20% scheme.

You can download Istella LETOR from the Istella website here.

In case you use it, we kindly ask you to cite the following publication in your research:

Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM Trans. Inf. Syst. 35, 2, Article 15 (December 2016), 31 pages. DOI: https://doi.org/10.1145/2987380

Istella-S LETOR

We also made available a smaller sample of the dataset (named Istella-S LETOR). As the Istella LETOR, it is composed of 33,018 queries and 220 features representing each query-document pair. Istella-S LETOR consists of 3,408,630 pairs produced by sampling irrelevant pairs to an average of 103 examples per query. It has been splitted in train, validation and test sets according to a 60%-20%-20% scheme.

You can download Istella-S LETOR from the Istella website here.

In case you use it, we kindly ask you to cite the following publication in your research:

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, USA, 949-952. DOI: http://dx.doi.org/10.1145/2911451.2914763

Istella-X LETOR

We made available a bigger dataset (named Istella-X (eXtended) LETOR). It is composed of 10,000 queries and 220 features representing each query-document pair. Istella-X LETOR consists of 26,791,447 pairs produced by retrieving up to 5,000 documents per query according to the BM25F ranking score. It has been splitted in train, validation and test sets according to a 60%-20%-20% scheme.

You can download Istella-X LETOR from the Istella website here.

In case you use it, we kindly ask you to cite the following publication in your research:

Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, Salvatore Orlando, Salvatore Trani. 2018. Selective Gradient Boosting for Effective Learning to Rank. In Proceedings of the 41th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA. DOI: http://dx.doi.org/10.1145/3209978.3210048