In the learning-to-rank-dataset.tar file you’ll find four CSV [1,2] files in which values are separated by “;” and comments lines are prefixed with a “#”. Each file matches a data-set as explained below.



Data gathering

The data from Twitter was collected on a daily basis during 360 consecutive days using the Twitter rest API [3]. Tweet retrieval was done by querying 1497 English keywords sampled from Wikipedia. Retrieved data contains 287 millions tweets exchanged by 30 millions distinct users. Each keyword is characterized by five features per day (see section Features description below).


This dataset is proposed in a “Learning to rank” setting. The data is split in sequences of 9 contiguous days: days 0 to 6 constitute the training part while days 7 and 8 are used to rank each keyword according to its activity. Hence the task is here to predict future activity of keywords in a social network using learning to rank methods. Subsequences are partially overlapping as shown in Figure 1.

Examples generation

We also generated three datasets of equal size (300) by sampling among the original 1497 keywords in order to study the effect of keyword ambiguity on the activity prediction task. The ambiguity of a keyword is, in this study, simply measured through Wikipedia’s disambiguation pages (WDP) that contain the different meanings associated to a keyword : the LOW dataset contains only keywords with no WDP and a definition page while MED (resp. HIGH) contains keywords with WDP referencing between 3 and 19 pages (resp. 20 and 99).

Activity distribution

Figure 2. Activity distribution per dataset subsample

Ambiguity distribution MED

Figure 3. Ambiguity levels distribution in dataset MED

Ambiguity distribution HIGH

Figure 4. Ambiguity levels distribution in dataset HIGH

Features description

A comprehensive description of the features is given in the document : Features-dataset-learning-to-rank. Some basic {min; max; mean; std} are available for each feature in the file : stats.tar for each time at the training part ([0 ; 7]).


François Kawala (1,2)
Ahlame Douzal (1)
Eric Gaussier (1)
Eustache Diemert (2)


(1) Université Joseph Fourier (Grenoble I)
Laboratoire d’informatique de Grenoble (LIG)
Equipe AMA
(2) TechMediaNetwork


Corresponding author : François Kawala. (replace “ç” per “c”)
(1) firstname.lastname [at]
(2) {ediemert ; fkawala} [at]