Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test collection

^Distribution

This test collection contains feature characteristics of documents originally written in five different languages (English, French, German, Spanish and Italian), and their translations, over a common set of 6 categories. This collection can be used for multilingual categorization, crosslingual categorization and multiview learning (one view=one language) research. Documents have been translated and preprocessed as explained below, and are made available as feature characteristics in a "bag of words" format.

^{Corpus Acquisition, Processing and Statistics}

Documents from 6 large Reuters categories (CCAT, C15, ECAT, E21, GCAT and M11) were extracted from RCV1 (for English), and RCV2 (for French, German, Italian and Spanish). We sampled up to 5000 documents for each category in each language. Documents belonging to more than one of the 6 categories were assigned to the smallest category. This resulted in 12-30K documents per language, and 13-21K documents per class. The distribution of documents over languages is:

Language	Number of documents	percentage	Vocabulary size
English	18,758	16.78	21,531
French	26,648	23.45	24,893
German	29,953	26.80	34,279
Italian	24,039	21.51	15,506
Spanish	12,342	11.46	11,547
TOTAL:	111,740	100.00

The distribution of documents over classes is:

Class	Number of documents	percentage
C15	18,816	16.84
CCAT	21,426	19.17
E21	13,701	12.26
ECAT	19,198	17.18
GCAT	19,178	17.16
M11	19,421	17.39
TOTAL:	111,740	100.00

In order to produce multilingual versions of each document, each original document was translated into the other 4 languages using a statistical machine translation system. We used the Portage system described by Ueffing et al. (2007), trained on the Europarl corpus for the 20 language pairs required here.

Each of the resulting 558,700 document versions (111,740 documents in 5 languages) was preprocessed and indexed using a standard preprocessing chain including removal of stopwords and low-frequency words. Documents were then represented as a bag of words using a TFIDF-based weighting scheme.

In studies carried out using this collection, one language is typically considered as one view of the document. We therefore have 5 views of each of the 111,740 documents extracted from the Reuters corpus. In Amini et al. (2009), 20% of the documents were reserved as test set, and the results were averaged over 10 random choices of labeled training examples.

^{Download & Copyright}

The collection is available here (166 MB, 464 MB uncompressed) as a bzipped tar archive . This test collection is publicly available for research purposes only. If you publish results based on this data set, please acknowledge its use, by referring to:

M.-R. Amini, N. Usunier, C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. Advances in Neural Information Processing Systems 22 (NIPS 2009), 2009.

^Content

Uncompressing MultiLingualReuters.tar.bz2 will create the directory MultiLingualReutersCollection/ which contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language. For example, EN contains files

Index_EN-EN : Original English documents,
Index_FR-EN : French documents translated to English,
Index_GR-EN : German documents translated to English,
Index_IT-EN : Italian documents translated to English,
Index_SP-EN : Spanish documents translated to English,

And similarly for the 4 other languages.

Each file contains one indexed document per line, in a format similar to SVM_light. Each line is of the form: cat feature:value feature:value ... where cat is the category label, ie one of C15, CCAT, E21, ECAT, GCAT or M11. feature:value is the feature, value pair, in ascending order of feature index

The order of documents is maintained in corresponding files, for example, FR/Index_EN-FR and EN/Index_EN-EN have the same number of documents (and therefore the same number of lines), in the same order.

^{Acknowledgements}

We thank Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it. More information are available in the README file.

^Bibliography

@inproceedings{AUG09,
author = "Massih-Reza Amini and Nicolas Usunier and Cyril Goutte",
title = "Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization",
booktitle = "Advances in Neural Information Processing Systems 22 (NIPS 2009)",
url = "http://books.nips.cc/papers/files/nips22/NIPS2009_0688.pdf",
pages = "28--36",
year = "2009"
}

@inproceedings{USLJ07,
author = "Nicola Ueffing and Michel Simard and Samuel Larkin and J.~Howard Johnson",
title = "{NRC}'s {PORTAGE} system for {WMT} 2007",
booktitle = "In ACL-2007 Second Workshop on SMT",
url = "http://www.statmt.org/wmt07/pdf/WMT24.pdf",
pages = "185--188",
year = "2007"
}