Stochastic Models for Surface Information Extraction in Texts

Massih-Reza Amini, Hugo Zaragoza, Patrick Gallinari
Laboratoire d'Informatique Paris 6
case 169
4, place de Jussieu
75252 Paris cedex 05

We describe in this paper the application of numerical machine learning techniques to the extraction of information from a collection of textual data. More precisely, we consider the modeling of text sequences with Hidden Markov Models (HMMs) and Multi-layer Perceptrons (MLPs) and show how these models can be used to perform specific surface extraction tasks (i.e. tasks which do not need in depth syntactic or semantic analysis). We consider different text representations using semantic and syntactic knowledge and analyze the influence of different grammatical constraints on the models using the MUC6 corpus.