Semi-supervised RankBoost for bipartite ranking



Distribution 1.0
11 March 2010

Massih R. Amini

Laboratoire d'Informatique de Grenoble





 


Description


This program is an implementation of the RankBoost algorithm [Freund et al., 2003] trained with partially labeled data for bipartite ranking (Semi-supervised RankBoost) described in [Amini et al. 2008]. The proposed algorithm is a new inductive ranking algorithm which builds a prediction function on the basis of two labeled and unlabeled training sets: one labeled and one unlabeled. In a first stage, the algorithm loops over the labeled set and assigns, for each labeled training example, the same relevance judgment to the most similar examples from the unlabeled training set. An extended version of the RankBoost algorithm is then developed to produce a scoring function that minimizes the average number of incorrectly ordered pairs of (relevant, irrelevant) examples, over the labeled training set and the tentatively labeled part of the unlabeled data.


Download and Installation

The program is free for scientific use only. If you publish results based on this program, please acknowledge its use, by referring to:

M.-R. Amini, V. Truong, C. Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008. [PDF]

This program was developed on Linux with g++ and the source code is available from:
http://ama.liglab.fr/~amini/SSRankBoost/semisup_rankboost.tar.bz2

After downloading the file, and unpackting it:

> bzip2 -cd semisup_rankboost.tar.bz2 | tar xvf -

you need to compile the program in the new directory semisup_rankboost/

> make

After compilation, two executables are created:

  • ssrankboost-learn (for training the model)
  • ssrankboost-test (for testing it)


Training and testing

Both train and test modules operate on feature:value representation of examples:
  • Rel feature:value feature:value
where Rel (in 0|-1|1) is the relevance judgement of an example. 0 if this judgement is unkown (for unlabeled examples), -1 (or 1) if the example is judged irrelevant (resp. relevant) to the given topic. In semisup_rankboost/ there are two training_set and test_set files in the directory example/ which are given for test purposes. training_set 10,000 partially labeled examples (10 labeled and 9990 unlabeled) and test_set contains 8,000 test examples. (These datasets are built over a subset of the RCV1 Reuters collection).


Train the model:
> ssrankboost-learn [options] input_file parameter_file

Options are:
-l    (float) The discount factor regularizing the impact of the objective term over unlabeled examples in the learning process. If -l is not specified, the program corresponds to the supervised RankBoost algorithm for bipartite ranking [Freund et al. 2003],
-k   (integer) The number of unlabeled examples which are given the same relevance judgment than their most nearest labeled neighbor (default 3),
-n   (integer) The number of candidate thresholds over features - stumps (default 10),
-t   (integer) The number of boosting iterations (default 50),
-? Help


Test the model:
> ssrankboost-test input_file parameter_file


Example

Running the program in the full supervised case [Freund et al. 2003] on a part of the Reuters collection provided in the directory example by using the default parameters (i.e. 20 candidate thresholds and 50 boosting iterations) gives

> ssrankboost-learn example/training_set Param-LabeledCase
Scanning ...
Vocabulary size: 21531, labeled examples: 10
Training ...
   1 --> Rt=0.75000000 alpha= 0.972955075 i*=  28 theta*=1.00000000
   2 --> Rt=0.34449112 alpha= 0.767976553 i*=1242 theta*=1.00000000
   3 --> Rt=0.23282148 alpha= 0.806175652 i*=  28 theta*=1.00000000
   4 --> Rt=0.15818393 alpha= 0.905899061 i*=  53 theta*=1.00000000
   5 --> Rt=0.08907593 alpha= 0.884357773 i*= 657 theta*=2.80000000
   6 --> Rt=0.05405791 alpha= 0.942351016 i*=  63 theta*=1.00000000
   7 --> Rt=0.02698971 alpha= 0.806420658 i*=   3 theta*=1.00000000
   8 --> Rt=0.01684374 alpha= 0.794356020 i*=  28 theta*=1.00000000
   9 --> Rt=0.01171223 alpha= 0.908774891 i*= 657 theta*=2.80000000
  10 --> Rt=0.00648682 alpha= 0.867918291 i*=  53 theta*=1.00000000
  11 --> Rt=0.00403677 alpha= 0.937479989 i*=  28 theta*=1.00000000
  12 --> Rt=0.00206520 alpha= 0.826522449 i*=   3 theta*=1.00000000
  13 --> Rt=0.00124185 alpha= 0.792700753 i*=  28 theta*=1.00000000
  14 --> Rt=0.00086342 alpha= 0.904145093 i*=  53 theta*=1.00000000
  15 --> Rt=0.00047831 alpha= 0.857645425 i*= 657 theta*=2.80000000
  16 --> Rt=0.00030285 alpha= 0.936779973 i*=  28 theta*=1.00000000
  17 --> Rt=0.00015668 alpha= 0.839317614 i*=   3 theta*=1.00000000
  18 --> Rt=0.00009211 alpha= 0.792598917 i*=  28 theta*=1.00000000
  19 --> Rt=0.00006397 alpha= 0.902250267 i*= 657 theta*=2.80000000
  20 --> Rt=0.00003531 alpha= 0.848599909 i*=  53 theta*=1.00000000
  21 --> Rt=0.00002271 alpha= 0.936674053 i*=  28 theta*=1.00000000
  22 --> Rt=0.00001183 alpha= 0.847885517 i*=   3 theta*=1.00000000
  23 --> Rt=0.00000685 alpha= 0.792655228 i*=  28 theta*=1.00000000
  24 --> Rt=0.00000475 alpha= 0.899772802 i*=  53 theta*=1.00000000
  25 --> Rt=0.00000262 alpha= 0.843674267 i*= 657 theta*=2.80000000
  26 --> Rt=0.00000170 alpha= 0.936658854 i*=  28 theta*=1.00000000
  27 --> Rt=0.00000089 alpha= 0.853603705 i*=1242 theta*=1.00000000
  28 --> Rt=0.00000051 alpha= 0.792703244 i*=  28 theta*=1.00000000
  29 --> Rt=0.00000035 alpha= 0.898799536 i*= 657 theta*=2.80000000
  30 --> Rt=0.00000019 alpha= 0.839749236 i*=  53 theta*=1.00000000
  31 --> Rt=0.00000013 alpha= 0.936636595 i*=  63 theta*=1.00000000
  32 --> Rt=0.00000007 alpha= 0.857409980 i*=   3 theta*=1.00000000
  33 --> Rt=0.00000004 alpha= 0.792724235 i*=  28 theta*=1.00000000
  34 --> Rt=0.00000003 alpha= 0.897733607 i*=  53 theta*=1.00000000
  35 --> Rt=0.00000001 alpha= 0.837604335 i*= 657 theta*=2.80000000
  36 --> Rt=0.00000001 alpha= 0.936622647 i*=  28 theta*=1.00000000
  37 --> Rt=0.00000001 alpha= 0.859915534 i*=   3 theta*=1.00000000
  38 --> Rt=0.00000000 alpha= 0.792735846 i*=  28 theta*=1.00000000
  39 --> Rt=0.00000000 alpha= 0.897300106 i*= 657 theta*=2.80000000
  40 --> Rt=0.00000000 alpha= 0.835938833 i*=  53 theta*=1.00000000
  41 --> Rt=0.00000000 alpha= 0.936606921 i*=  28 theta*=1.00000000
  42 --> Rt=0.00000000 alpha= 0.861564152 i*=1242 theta*=1.00000000
  43 --> Rt=0.00000000 alpha= 0.792739615 i*=  28 theta*=1.00000000
  44 --> Rt=0.00000000 alpha= 0.896850614 i*=  53 theta*=1.00000000
  45 --> Rt=0.00000000 alpha= 0.835020098 i*= 657 theta*=2.80000000
  46 --> Rt=0.00000000 alpha= 0.936597700 i*=  28 theta*=1.00000000
  47 --> Rt=0.00000000 alpha= 0.862640647 i*=1242 theta*=1.00000000
  48 --> Rt=0.00000000 alpha= 0.792742061 i*=  63 theta*=1.00000000
  49 --> Rt=0.00000000 alpha= 0.896662342 i*= 657 theta*=2.80000000
  50 --> Rt=0.00000000 alpha= 0.834316604 i*=  53 theta*=1.00000000


Where, Rt corresponds to the Zt in ([Freund et al., 2003] eq. 4), i* and theta* are the chosen feature variable and the corresponding threshold which determine the base ranking function at a given iteration ([Freund et al., 2003] eq. 9) and alpha correspond to the boosting weights ([Freund et al., 2003] eq. 6).

The testing of the previous model gives
> ssrankboost-test example/test_set Param-LabeledCase
T9P=0.400000 MAP@500=0.037637 REC@500=0.086915 AUC=0.547948
Where, T9P is the precision at 50.

Now using the additional unlabeled examples in the training process with a discount factor 0.1 (keeping the same default number of iterations, stumps as before and giving the same relevance judgement to 3 nearest unlabeled neighbors than each labeled example)
> ssrankboost-learn -l 0.1 example/training_set Param-SSupCase
Scanning ...
Vocabulary size: 21531, labeled examples: 10, unlabeled examples: 9990, number of unlabeled examples labeled by the KNN algorithm: 30
Training ...

And a similar output than the previous supervised leanring case displays with a difference that Rt now corresponds to the semi-supervised ranking loss bounds ([Amini et al. 2008] eq. 7) and the semi-supervised boosting weights are obtained as in ([Amini et al. 2008] eq. 6)

The testing of the previous semi-supervised model gives
> ssrankboost-test example/test_set Param-SSupCase
T9P=0.820000 MAP@500=0.143034 REC@500=0.177173 AUC=0.754527


Disclaimer

This program is publicly available for research use only. It should not be distributed for commercial use and the author is not responsible for any (mis)use of this algorithm.


Acknowledgements

The author is thankful to Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it as the examples used in this release come from part of RCV1 collection.


Bibliography


[Amini et al., 2008] Massih-Reza Amini, Vinh Truong , Cyril Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008.

[Freud et al., 2003] Yoav Freund, Raj Iyer, Robert E. Schapire, Yoram Singer. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research (JMLR 2003), 2003