Reducing the Annotation Burden in Text Classification

Anastasia Krithara(2), Cyril Goutte(2), Jean-Michel Renders(2), Massih-Reza Amini(1)
(1) Laboratoire d'Informatique Paris 6              (2)Xerox Research Center Europe
          8, rue du capitaine scott                                              6, Chemin de Maupertuis         
       75015 Paris                                                                 38240 Meylan         

In this paper we describe a method which combines semi-supervised and active learning for the classification task. In particular, we propose a semi-supervised PLSA (Probabilistic Latent Semantic Analysis) algorithm, combined with a pool-based active learning method, in order to classify text documents. The idea of combining active and semi-supervised learning refers to a collection of approaches which aim at reducing the labeling effort in supervised machine learning. Both models aim at solving the same problem, but from different perspectives. We consider the pool-based active learning model, where the essential idea is to select promising unlabeled examples from a given set in a sequential process in the sense that the corresponding target objects contribute to a more accurate prediction function. The idea behind semi-supervised learning is to reduce the annotation process by automatically labeling examples for which the system has high confidence in their membership to classes. The first experiments we have performed on the newsgroups dataset give promising results. With very few labeled examples we can achieve a good accuracy and using active learning on the top, we ameliorate the performance with the less possible annotation. In particular, our first results have shown that we gain, on average, about 10-15% in accuracy in text classification, by annotating only 3% of the unlabeled data. Our algorithm can be extended in order to calculate also the cost of labeling. In that way, we will be able to choose a document to label by its trade-off of ambiguity and cost of annotation (for example, it is harder to annotate a long document than a short one).