Semi-supervised RankBoost

^Description

This program is an implementation of the RankBoost algorithm [Freund et al., 2003] trained with partially labeled data for bipartite ranking (Semi-supervised RankBoost) described in [Amini et al. 2008]. The proposed algorithm is a new inductive ranking algorithm which builds a prediction function on the basis of two labeled and unlabeled training sets: one labeled and one unlabeled. In a first stage, the algorithm loops over the labeled set and assigns, for each labeled training example, the same relevance judgment to the most similar examples from the unlabeled training set. An extended version of the RankBoost algorithm is then developed to produce a scoring function that minimizes the average number of incorrectly ordered pairs of (relevant, irrelevant) examples, over the labeled training set and the tentatively labeled part of the unlabeled data.

^{Download and Installation}

The program is free for scientific use only. If you publish results based on this program, please acknowledge its use, by referring to:

M.-R. Amini, V. Truong, C. Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008. [PDF]

This program was developed on Linux with g++ and the source code is available from:
http://ama.liglab.fr/~amini/SSRankBoost/semisup_rankboost.tar.bz2

After downloading the file, and unpackting it:

 > bzip2 -cd semisup_rankboost.tar.bz2 | tar xvf -

you need to compile the program in the new directory semisup_rankboost/

 > make

After compilation, two executables are created:

ssrankboost-learn (for training the model)
ssrankboost-test (for testing it)

^{Training and testing}

Both train and test modules operate on feature:value representation of examples:

Rel feature:value feature:value

where Rel (in 0|-1|1) is the relevance judgement of an example. 0 if this judgement is unkown (for unlabeled examples), -1 (or 1) if the example is judged irrelevant (resp. relevant) to the given topic. In semisup_rankboost/ there are two training_set and test_set files in the directory example/ which are given for test purposes. training_set 10,000 partially labeled examples (10 labeled and 9990 unlabeled) and test_set contains 8,000 test examples. (These datasets are built over a subset of the RCV1 Reuters collection).

^{Train the model:}

 > ssrankboost-learn [options] input_file parameter_file

Options are:

-l (float)	The discount factor regularizing the impact of the objective term over unlabeled examples in the learning process. If -l is not specified, the program corresponds to the supervised RankBoost algorithm for bipartite ranking [Freund et al. 2003],
-k (integer)	The number of unlabeled examples which are given the same relevance judgment than their most nearest labeled neighbor (default 3),
-n (integer)	The number of candidate thresholds over features - stumps (default 10),
-t (integer)	The number of boosting iterations (default 50),
-?	Help

^{Test the model:}

 > ssrankboost-test input_file parameter_file

^Example

Running the program in the full supervised case [Freund et al. 2003] on a part of the Reuters collection provided in the directory example by using the default parameters (i.e. 20 candidate thresholds and 50 boosting iterations) gives

 > ssrankboost-learn example/training_set Param-LabeledCase

Scanning ... 

Vocabulary size: 21531, labeled examples: 10

Training ...

   1 --> Rt=0.75000000 alpha= 0.972955075 i*=  28 theta*=1.00000000

   2 --> Rt=0.34449112 alpha= 0.767976553 i*=1242 theta*=1.00000000

   3 --> Rt=0.23282148 alpha= 0.806175652 i*=  28 theta*=1.00000000

   4 --> Rt=0.15818393 alpha= 0.905899061 i*=  53 theta*=1.00000000

   5 --> Rt=0.08907593 alpha= 0.884357773 i*= 657 theta*=2.80000000

   6 --> Rt=0.05405791 alpha= 0.942351016 i*=  63 theta*=1.00000000

   7 --> Rt=0.02698971 alpha= 0.806420658 i*=   3 theta*=1.00000000

   8 --> Rt=0.01684374 alpha= 0.794356020 i*=  28 theta*=1.00000000

   9 --> Rt=0.01171223 alpha= 0.908774891 i*= 657 theta*=2.80000000

  10 --> Rt=0.00648682 alpha= 0.867918291 i*=  53 theta*=1.00000000

  11 --> Rt=0.00403677 alpha= 0.937479989 i*=  28 theta*=1.00000000

  12 --> Rt=0.00206520 alpha= 0.826522449 i*=   3 theta*=1.00000000

  13 --> Rt=0.00124185 alpha= 0.792700753 i*=  28 theta*=1.00000000

  14 --> Rt=0.00086342 alpha= 0.904145093 i*=  53 theta*=1.00000000

  15 --> Rt=0.00047831 alpha= 0.857645425 i*= 657 theta*=2.80000000

  16 --> Rt=0.00030285 alpha= 0.936779973 i*=  28 theta*=1.00000000

  17 --> Rt=0.00015668 alpha= 0.839317614 i*=   3 theta*=1.00000000

  18 --> Rt=0.00009211 alpha= 0.792598917 i*=  28 theta*=1.00000000

  19 --> Rt=0.00006397 alpha= 0.902250267 i*= 657 theta*=2.80000000

  20 --> Rt=0.00003531 alpha= 0.848599909 i*=  53 theta*=1.00000000

  21 --> Rt=0.00002271 alpha= 0.936674053 i*=  28 theta*=1.00000000

  22 --> Rt=0.00001183 alpha= 0.847885517 i*=   3 theta*=1.00000000

  23 --> Rt=0.00000685 alpha= 0.792655228 i*=  28 theta*=1.00000000

  24 --> Rt=0.00000475 alpha= 0.899772802 i*=  53 theta*=1.00000000

  25 --> Rt=0.00000262 alpha= 0.843674267 i*= 657 theta*=2.80000000

  26 --> Rt=0.00000170 alpha= 0.936658854 i*=  28 theta*=1.00000000

  27 --> Rt=0.00000089 alpha= 0.853603705 i*=1242 theta*=1.00000000

  28 --> Rt=0.00000051 alpha= 0.792703244 i*=  28 theta*=1.00000000

  29 --> Rt=0.00000035 alpha= 0.898799536 i*= 657 theta*=2.80000000

  30 --> Rt=0.00000019 alpha= 0.839749236 i*=  53 theta*=1.00000000

  31 --> Rt=0.00000013 alpha= 0.936636595 i*=  63 theta*=1.00000000

  32 --> Rt=0.00000007 alpha= 0.857409980 i*=   3 theta*=1.00000000

  33 --> Rt=0.00000004 alpha= 0.792724235 i*=  28 theta*=1.00000000

  34 --> Rt=0.00000003 alpha= 0.897733607 i*=  53 theta*=1.00000000

  35 --> Rt=0.00000001 alpha= 0.837604335 i*= 657 theta*=2.80000000

  36 --> Rt=0.00000001 alpha= 0.936622647 i*=  28 theta*=1.00000000

  37 --> Rt=0.00000001 alpha= 0.859915534 i*=   3 theta*=1.00000000

  38 --> Rt=0.00000000 alpha= 0.792735846 i*=  28 theta*=1.00000000

  39 --> Rt=0.00000000 alpha= 0.897300106 i*= 657 theta*=2.80000000

  40 --> Rt=0.00000000 alpha= 0.835938833 i*=  53 theta*=1.00000000

  41 --> Rt=0.00000000 alpha= 0.936606921 i*=  28 theta*=1.00000000

  42 --> Rt=0.00000000 alpha= 0.861564152 i*=1242 theta*=1.00000000

  43 --> Rt=0.00000000 alpha= 0.792739615 i*=  28 theta*=1.00000000

  44 --> Rt=0.00000000 alpha= 0.896850614 i*=  53 theta*=1.00000000

  45 --> Rt=0.00000000 alpha= 0.835020098 i*= 657 theta*=2.80000000

  46 --> Rt=0.00000000 alpha= 0.936597700 i*=  28 theta*=1.00000000

  47 --> Rt=0.00000000 alpha= 0.862640647 i*=1242 theta*=1.00000000

  48 --> Rt=0.00000000 alpha= 0.792742061 i*=  63 theta*=1.00000000

  49 --> Rt=0.00000000 alpha= 0.896662342 i*= 657 theta*=2.80000000

  50 --> Rt=0.00000000 alpha= 0.834316604 i*=  53 theta*=1.00000000

Where, Rt corresponds to the Zt in ([Freund et al., 2003] eq. 4), i* and theta* are the chosen feature variable and the corresponding threshold which determine the base ranking function at a given iteration ([Freund et al., 2003] eq. 9) and alpha correspond to the boosting weights ([Freund et al., 2003] eq. 6).

The testing of the previous model gives

 > ssrankboost-test example/test_set Param-LabeledCase

T9P=0.400000 MAP@500=0.037637 REC@500=0.086915 AUC=0.547948

Where, T9P is the precision at 50.

Now using the additional unlabeled examples in the training process with a discount factor 0.1 (keeping the same default number of iterations, stumps as before and giving the same relevance judgement to 3 nearest unlabeled neighbors than each labeled example)

 > ssrankboost-learn -l 0.1 example/training_set Param-SSupCase

Scanning ... 

Vocabulary size: 21531, labeled examples: 10, unlabeled examples: 9990, number of unlabeled examples labeled by the KNN algorithm: 30

Training ...

And a similar output than the previous supervised leanring case displays with a difference that Rt now corresponds to the semi-supervised ranking loss bounds ([Amini et al. 2008] eq. 7) and the semi-supervised boosting weights are obtained as in ([Amini et al. 2008] eq. 6)

The testing of the previous semi-supervised model gives

 > ssrankboost-test example/test_set Param-SSupCase 

T9P=0.820000 MAP@500=0.143034 REC@500=0.177173 AUC=0.754527

^Disclaimer

This program is publicly available for research use only. It should not be distributed for commercial use and the author is not responsible for any (mis)use of this algorithm.

^{Acknowledgements}

The author is thankful to Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it as the examples used in this release come from part of RCV1 collection.

^Bibliography

[Amini et al., 2008] Massih-Reza Amini, Vinh Truong , Cyril Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008.

[Freud et al., 2003] Yoav Freund, Raj Iyer, Robert E. Schapire, Yoram Singer. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research (JMLR 2003), 2003