Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts

Marc Caillet, Jean-François Pessiot, Massih-Reza Amini, Patrick Gallinari
Laboratoire d'Informatique Paris 6
8, Rue du capitaine Scott
75015 Paris

In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification Maximum Likelihood approach. We evaluate the effectiveness of this technique on sets of concatenated paragraphs from two collections, the 7sectors and the 20 Newsgroups corpus, and compare it to a baseline text segmentation technique proposed by Salton et al.