Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test collection



Distribution 1.0
30 September 2009

Massih R. Amini and Cyril Goutte

National Research Council of Canada





 


Distribution

This test collection contains feature characteristics of documents originally written in five different languages (English, French, German, Spanish and Italian), and their translations, over a common set of 6 categories. This collection can be used for multilingual categorization, crosslingual categorization and multiview learning (one view=one language) research. Documents have been translated and preprocessed as explained below, and are made available as feature characteristics in a "bag of words" format.


Corpus Acquisition, Processing and Statistics

Documents from 6 large Reuters categories (CCAT, C15, ECAT, E21, GCAT and M11) were extracted from RCV1 (for English), and RCV2 (for French, German, Italian and Spanish). We sampled up to 5000 documents for each category in each language. Documents belonging to more than one of the 6 categories were assigned to the smallest category. This resulted in 12-30K documents per language, and 13-21K documents per class. The distribution of documents over languages is:

Language Number of documentspercentageVocabulary size
English 18,75816.7821,531
French26,64823.4524,893
German29,95326.8034,279
Italian24,03921.5115,506
Spanish12,34211.4611,547
TOTAL:111,740100.00

The distribution of documents over classes is:

Class Number of documentspercentage
C1518,81616.84
CCAT21,42619.17
E2113,70112.26
ECAT19,19817.18
GCAT19,17817.16
M1119,42117.39
TOTAL:111,740100.00

In order to produce multilingual versions of each document, each original document was translated into the other 4 languages using a statistical machine translation system. We used the Portage system described by Ueffing et al. (2007), trained on the Europarl corpus for the 20 language pairs required here.

Each of the resulting 558,700 document versions (111,740 documents in 5 languages) was preprocessed and indexed using a standard preprocessing chain including removal of stopwords and low-frequency words. Documents were then represented as a bag of words using a TFIDF-based weighting scheme.

In studies carried out using this collection, one language is typically considered as one view of the document. We therefore have 5 views of each of the 111,740 documents extracted from the Reuters corpus. In Amini et al. (2009), 20% of the documents were reserved as test set, and the results were averaged over 10 random choices of labeled training examples.


Download & Copyright

The collection is available here (166 MB, 464 MB uncompressed) as a bzipped tar archive . This test collection is publicly available for research purposes only. If you publish results based on this data set, please acknowledge its use, by referring to:

M.-R. Amini, N. Usunier, C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. Advances in Neural Information Processing Systems 22 (NIPS 2009), 2009.


Content

Uncompressing MultiLingualReuters.tar.bz2 will create the directory MultiLingualReutersCollection/ which contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language. For example, EN contains files

  • Index_EN-EN : Original English documents,
  • Index_FR-EN : French documents translated to English,
  • Index_GR-EN : German documents translated to English,
  • Index_IT-EN : Italian documents translated to English,
  • Index_SP-EN : Spanish documents translated to English,

And similarly for the 4 other languages.

Each file contains one indexed document per line, in a format similar to SVM_light. Each line is of the form: cat feature:value feature:value ... where cat is the category label, ie one of C15, CCAT, E21, ECAT, GCAT or M11. feature:value is the feature, value pair, in ascending order of feature index

The order of documents is maintained in corresponding files, for example, FR/Index_EN-FR and EN/Index_EN-EN have the same number of documents (and therefore the same number of lines), in the same order.


Acknowledgements

We thank Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it. More information are available in the README file.


Bibliography

@inproceedings{AUG09,
author = "Massih-Reza Amini and Nicolas Usunier and Cyril Goutte",
title = "Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization",
booktitle = "Advances in Neural Information Processing Systems 22 (NIPS 2009)",
url = "http://books.nips.cc/papers/files/nips22/NIPS2009_0688.pdf",
pages = "28--36",
year = "2009"
}

@inproceedings{USLJ07,
author = "Nicola Ueffing and Michel Simard and Samuel Larkin and J.~Howard Johnson",
title = "{NRC}'s {PORTAGE} system for {WMT} 2007",
booktitle = "In ACL-2007 Second Workshop on SMT",
url = "http://www.statmt.org/wmt07/pdf/WMT24.pdf",
pages = "185--188",
year = "2007"
}