My research work lies at the intersection of Artificial Intelligence (encompassing machine learning and computational linguistics) and what would now be qualified as Data Science. I am interested in the general problem of accessing, mining and learning from large (text) collections, through machine learning models and methods and work on both fundamental problems (through the development of new models that explain different chracteristics of large-scale collections/networks) and applications related to computational linguistics and information retrieval. In computational linguistics, I developed machine learning and corpus based approaches by designing probabilistic models for bilingual lexicon extraction from corpora as well as for information extraction and machine translation. In information retrieval (IR), I have studied the theoretical foundations of the main IR models and their relation to properties of textual collections. I also proposed the first analytical version of heuristic IR constraints, in different settings (\textit{ad hoc} IR, cross-lingual IR and relevance feedback). I have also worked on several aspects of machine learning (ML) for text mining, in particular on latent topic models and their link to matrix factorization. I have also contributed to many different methods and models in machine learning for text mining, including a theoretical analysis of hierarchical text categorization and extensions of latent topic models using copulas to explicitly model dependencies between variables (e.g. words and topics). I have also developed, with George Paliouras and within the Network of Excellence Pascal 2, a series of challenges and workshops on Large Scale Hierarchical Text Classification (LSHTC).

  • Here's a video of the LIX colloquium on Data Science, held on Nov. 2014
  • And here's a short video to explain what textual information access is (in French)
Scientific animation

I have been a member of the Executive Board of the European Association for Computational Linguistics from 2007 to 2010, a member of the Computer Science panel of the European Research Council for Starting/Consolidator Grants, from 2007 to 2013, and I am a member of the Advisory Board of SIGDAT since 2005.
I was program co-chair of EMNLP 2006, workshop co-chair for EMNLP 2014, chair of CORIA 2015 and co-chair of IEEE DSAA 2015. I have also served as area chair during several years for SIGIR and ECIR and co-chaired and co-rganized several international workshops (as LSHC1, within ECIR 2010, LSHC2, within ECML 2011 and LSHC3, within ECML 2013).
I am a member of the editorial boards of the journals: International Journal of Data Science and Analytics, Information Retrieval, Information and Document Numérique, and a past member of the editorial boards of the journals: Traitement automatique des langues, Computational Linguistics, International Journal of Corpus Linguistics.

  • Smart Support Centers (FUI) (March 2015-Sept. 2018)
  • Graphical models for modeling the dynamics of content networks (regional project) (Sept. 2014-Sept. 2017)
  • New theoretical frameworks in metric learning (regional project) (Sept. 2013-Sept. 2016)
  • Khronos, Persyval (labex) project on data mining of temporal data (Sept. 2013-Sept. 2017)
  • CNRS Mastodons project (started in 2012)
  • CLASS-Y (ANR project) (February 2011-February 2015)
  • BioASQ (Eur. project) (Oct. 2012-Oct. 2014)
  • PASCAL2 European network of excellence (2009-2013)
  • MeTRICC (ANR project) (December 2008-December 2011)
  • FRAGRANCES (ANR project) (December 2008-December 2011)
  • LASCAR (LArge Scale CAtegoRization - UJF project) (January 2008-December 2009)
  • INFOM@GIC (French project) (2005-2006 pour ma participation)
  • PASCAL European Network of Excellence (2004-2006)
  • REVEAL THIS (European project) (2004-2007)
  • KerMIT (European project) (2001-2004)
  • Outiller les Alliances (French project) (2001-2003)
  • MuchMore (European project) (1999-2002)
  • EUROTRA-63 (European project) (1992-1995)
PhD students
  • Maziar Moradi Fard, co-supervised with A. Douza, ANR funding; Time series, machine learning; (2016-)
  • Yagmur Cinar, FUI funding; Information retrieval, machine learning; (2015-)
  • Adrien Dulac, co-supervised with C. Largeron, regional funding; Social network, machine learning; (2014-)
  • Diana Popa, co-supervised with J. Henderson and J. Perez, CIFRE XRCE/NaverLabs; Computational linguistics, machine learning; (2014-)
  • Hesam Amoualian, co-supervised with M.-R. Amini and M. Clausel, French national funding MESR; Machine learning; (2014-2017)
  • Théo Trouillon, co-supervised with G. Bouchard, CIFRE XRCE; Machine learning; (2014-2017)
  • Abdelkader El Mahdaouy, co-supervised with S. Ouatik, co-tutelle Univ. de Fès, Maroc; Computational linguistics, information retrieval; (2013-2017)
  • Irina Nicolae, co-supervised with M. Sebban, regional funding; Machine learning; (2013-2016)
  • Saeid Soheily Khah, co-supervised with A. Douzal, industrial funding; Data analysis; (2013-2016)
  • Hamid Mirisaee, co-supervised with A. Termier, French national funding MESR; Data mining, social network analysis and mining; (2012-2015)
  • François Kawala, co-supervised with A. Douzal, CIFRE Best of Media; Social network analysis and mining; (2011-2015)
  • Rohit Babbar, co-supervised with M.-R. Amini, ANR funding; Machine learning; (2011-2014)
  • Parantapa Goswami, co-supervised with M.-R. Amini, French national funding MESR; Information retrieval, machine learning; (2011-2014)
  • Cédric Lagnier, French national funding MESR; Social network analysis and mining; (2009-2013)
  • Clément Grimal, co-supervised with G. Bisson, ANR funding; Machine learning; (2009-2012)
  • Bo Li, ANR funding; Computational linguistics; (2009-2012)
  • Franck Meyer, Orange Labs; Machine learning; (2007-2012)
  • Stéphane Clinchant, CIFRE XRCE; Information retrieval; (2008-2011)
  • Ali Mustafa Qamar, French national funding MESR; Machine learning; (2007-2010)
  • Leila Kefi, co-supervised with C. Berrut, French national funding MNRT; Information retrieval; (2002-2006)
  • François Trouilleux, co-supervised with G. Bes and A. Zaenen, CIFRE XRCE; Linguistics; (1998-2001)

Most of my publications are either available on DBLP, or on Google Scholar (recent ones are also available on HAL); most of my patents are available on the USPTO site.

  • The paper A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation, co-written with C. Goutte, received the Test of Time award at ECIR (European Conference on Information Retrieval) 2016
  • The paper Task Composition in Crowdsourcing, co-written with S. Amer-Yahia, V. Leroy, J. Pilourdault, R. M. Borromeo and M. Toyama, received an honorable mention award at DSAA (International IEEE Conference on Data Science and Advanced Analytics) 2016
  • The paper A Theoretical Analysis of Pseudo-Relevance Feedback Models, co-written with S. Clinchant, received the best paper award at ICTIR (International Conference on the Theory of Information Retrieval) 2013
  • The paper Modèles d'information pour la recherche multilingue, co-written with B. Li, received the best paper award at CORIA (Conférence pour la Recherche d'information et ses Applications) 2012
  • The paper Information-based models for ad-hoc IR, co-written with S. Clinchant , was nominated for the best paper award at SIGIR 2010
  • The French version of this paper, entitled Modèles de RI fondés sur l'information, received the best paper award at CORIA (Conférence pour la Recherche d'information et ses Applications) 2010