Sequence Models for Automatic Highlighting and Surface Information Extraction

Massih-Reza Amini, Hugo Zaragoza, Patrick Gallinari
Laboratoire d'Informatique Paris 6
case 169
4, place de Jussieu
75252 Paris cedex 05

With the increase of textual information available electronically, we assist to a great diversification of the demands on Information Retrieval (IR) and Information Extraction (IE) systems. In this paper we apply Machine Learning techniques of sequence analysis to the tasks of highlighting and labeling text with respect to an information extraction task. Specifically, dynamic probability models are used. Like IR systems, they use little semantics, are fully trainable and do not require any knowledge representation of the domain. Unlike IR approaches, documents are considered as a dynamic sequence of words and not as a static vector. Furthermore, additional word information is naturally included in the representation. Models are evaluated on a sub-task of the MUC6 Scenario Template corpus. We consider different text representations using semantic and syntactic knowledge and analyze the influence of different grammatical constraints on the models.