My research work lies at the intersection of Artificial Intelligence (encompassing machine learning and computational linguistics) and what would now be qualified as Data Science. I am interested in the general problem of accessing, mining and learning from large (text) collections, through machine learning models and methods and work on both fundamental problems (through the development of new models that explain different chracteristics of large-scale collections/networks) and applications related to computational linguistics and information retrieval. In computational linguistics, I developed machine learning and corpus based approaches by designing probabilistic models for bilingual lexicon extraction from corpora as well as for information extraction and machine translation. In information retrieval (IR), I have studied the theoretical foundations of the main IR models and their relation to properties of textual collections. I also proposed the first analytical version of heuristic IR constraints, in different settings (\textit{ad hoc} IR, cross-lingual IR and relevance feedback). I have also worked on several aspects of machine learning (ML) for text mining, in particular on latent topic models and their link to matrix factorization. I have also contributed to many different methods and models in machine learning for text mining, including a theoretical analysis of hierarchical text categorization and extensions of latent topic models using copulas to explicitly model dependencies between variables (e.g. words and topics). I have also developed, with George Paliouras and within the Network of Excellence Pascal 2, a series of challenges and workshops on Large Scale Hierarchical Text Classification (LSHTC).

Scientific animation

I have been a member of the Executive Board of the European Association for Computational Linguistics from 2007 to 2010, a member of the Computer Science panel of the European Research Council for Starting/Consolidator Grants, from 2007 to 2013, and I am a member of the Advisory Board of SIGDAT since 2005.
I was program co-chair of EMNLP 2006, workshop co-chair for EMNLP 2014, chair of CORIA 2015 and co-chair of IEEE DSAA 2015. I have also served as area chair during several years for SIGIR and ECIR and co-chaired and co-rganized several international workshops (as LSHC1, within ECIR 2010, LSHC2, within ECML 2011 and LSHC3, within ECML 2013).
I am a member of the editorial boards of the journals: International Journal of Data Science and Analytics, Information Retrieval, Information and Document Numérique, and a past member of the editorial boards of the journals: Traitement automatique des langues, Computational Linguistics, International Journal of Corpus Linguistics.

PhD students
  • The paper A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation, co-written with C. Goutte, received the Test of Time award at ECIR (European Conference on Information Retrieval) 2016
  • The paper Task Composition in Crowdsourcing, co-written with S. Amer-Yahia, V. Leroy, J. Pilourdault, R. M. Borromeo and M. Toyama, received an honorable mention award at DSAA (International IEEE Conference on Data Science and Advanced Analytics) 2016
  • The paper A Theoretical Analysis of Pseudo-Relevance Feedback Models, co-written with S. Clinchant, received the best paper award at ICTIR (International Conference on the Theory of Information Retrieval) 2013
  • The paper Modèles d'information pour la recherche multilingue, co-written with B. Li, received the best paper award at CORIA (Conférence pour la Recherche d'information et ses Applications) 2012
  • The paper Information-based models for ad-hoc IR, co-written with S. Clinchant , was nominated for the best paper award at SIGIR 2010
  • The French version of this paper, entitled Modèles de RI fondés sur l'information, received the best paper award at CORIA (Conférence pour la Recherche d'information et ses Applications) 2010