Learning to Rank with Partially Labeled Training Data
Vinh Truong, Massih-Reza Amini, Patrick Gallinari
Laboratoire d'Informatique Paris 6
8, rue du capitaine scott
Many real life applications involve ranking of objects instead of their classification. An example is the Document Retrieval (DR) task where the goal is to rank documents from a collection based on their relevancy to a user's query. Recently the supervised learning of ranking functions has attracted considerable attention from the Machine Learning (ML) community since it encompasses many applications ranging from label-Ranking to multi-class classification. In the supervised setting, the aim is to learn a mapping from a predefined set of instances for which there exists a set of labels. For example in the DR task, an instance is a query and its label is a set of relevance judgements that indicate which document in the collection is relevant or not to that query. Labelling instances for supervised learning often requires that an expert examine a large amount of data which in many cases is unrealistic. In the other hand, unlabeled data are cheap and can be easily gathered. For the classification task, the use of partially classified training data, known as semi-supervised learning, has been subject of intense studies since 1998 in the ML community. In this paper we propose a semi-supervised learning algorithm for ranking. Existing semisupervised ranking algorithms are graph-based transductive techniques which from specific observed training data give labels for specific test examples. Our motivation here is to develop a novel inductive approach which from specific observed training data (labeled and unlabeled) produces a general ranking rule which rank unseen examples with high accuracy. Our algorithm is an iterative approach which learns a supervised ranking function from the labeled training examples and then iterates two steps until the convergence of a ranking criterion. In the first step, the ranking algorithm orders unlabeled training examples and then by fixing a threshold labels the lowest ranked unlabeled data as irrelevant. It then supposes that the most top ranked unlabeled data are more likely to be relevant to the query and using a graph-based technique proposed in it labels as relevant the top ranked unlabeled data which are the most similar to the relevant examples in the labeled training set. In the second step, the algorithm learns a new ranking function using the labeled training examples and the unlabeled examples for which it found alternative labels in the previous step. Experiments on the CACM collection gathering titles and abstracts from the journal Communications of the Association for Computer Machinery show that the proposed approach allows to substantially gain over a fully supervised approach trained using only the labeled training examples.