Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test collectionDistribution 1.0 30 September 2009 Massih R. Amini and Cyril Goutte National Research Council of Canada |
|
| Language | Number of documents | percentage | Vocabulary size |
| English | 18,758 | 16.78 | 21,531 |
| French | 26,648 | 23.45 | 24,893 |
| German | 29,953 | 26.80 | 34,279 |
| Italian | 24,039 | 21.51 | 15,506 |
| Spanish | 12,342 | 11.46 | 11,547 |
| TOTAL: | 111,740 | 100.00 |
The distribution of documents over classes is:
| Class | Number of documents | percentage |
| C15 | 18,816 | 16.84 |
| CCAT | 21,426 | 19.17 |
| E21 | 13,701 | 12.26 |
| ECAT | 19,198 | 17.18 |
| GCAT | 19,178 | 17.16 |
| M11 | 19,421 | 17.39 |
| TOTAL: | 111,740 | 100.00 |
In order to produce multilingual versions of each document, each original document was translated into the other 4 languages using a statistical machine translation system. We used the Portage system described by Ueffing et al. (2007), trained on the Europarl corpus for the 20 language pairs required here.
Each of the resulting 558,700 document versions (111,740 documents in 5 languages) was preprocessed and indexed using a standard preprocessing chain including removal of stopwords and low-frequency words. Documents were then represented as a bag of words using a TFIDF-based weighting scheme.
In studies carried out using this collection, one language is typically considered as one view of the document. We therefore have 5 views of each of the 111,740 documents extracted from the Reuters corpus. In Amini et al. (2009), 20% of the documents were reserved as test set, and the results were averaged over 10 random choices of labeled training examples.
The collection is available here (166 MB, 464 MB uncompressed) as a bzipped tar archive . This test collection is publicly available for research purposes only. If you publish results based on this data set, please acknowledge its use, by referring to:
Uncompressing MultiLingualReuters.tar.bz2 will create the directory MultiLingualReutersCollection/ which contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language. For example, EN contains files
And similarly for the 4 other languages.
Each file contains one indexed document per line, in a format similar to SVM_light. Each line is of the form: cat feature:value feature:value ... where cat is the category label, ie one of C15, CCAT, E21, ECAT, GCAT or M11. feature:value is the feature, value pair, in ascending order of feature index
The order of documents is maintained in corresponding files, for example, FR/Index_EN-FR and EN/Index_EN-EN have the same number of documents (and therefore the same number of lines), in the same order.
We thank Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it. More information are available in the README file.
@inproceedings{AUG09,
author = "Massih-Reza Amini and Nicolas Usunier and Cyril Goutte",
title = "Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization",
booktitle = "Advances in Neural Information Processing Systems 22 (NIPS 2009)",
url = "http://books.nips.cc/papers/files/nips22/NIPS2009_0688.pdf",
pages = "28--36",
year = "2009"
}
@inproceedings{USLJ07,
author = "Nicola Ueffing and Michel Simard and Samuel Larkin and J.~Howard Johnson",
title = "{NRC}'s {PORTAGE} system for {WMT} 2007",
booktitle = "In ACL-2007 Second Workshop on SMT",
url = "http://www.statmt.org/wmt07/pdf/WMT24.pdf",
pages = "185--188",
year = "2007"
}