PhD Positions 2018-2019

PhD position in Machine Learning and Software Test Automation

Title: Analysis and Selection of Execution Traces with Machine Learning Techniques for Software Test Automation Duration: 3 years Starting date: October 2018 Supervisors: Christophe Brouard, Roland Groz Application procedure and other details here or here (in french)

PhD position in Machine Learning and Software Engineering

Title: Enhanced Specification Mining of Workflows through Testing Duration: 3 years Starting date: October 2018 Supervisors: Roland Groz, Christophe Brouard Application procedure and other details here.

Internships 2018-2019

Taming the Beast of the Preimage in Machine Learning for Temporal Data

Funding: API (ANR Project) Supervisors: Ahlame Douzal (Ahlame.Douzal@imag.fr) Description: Machine learning has been increasingly investigated with success within the last decades. The rise in great prominence is undoubtedly driven by the easiness to derive efficient nonlinear models with the corresponding estimation algorithms, driven by kernel machines, with the celebrated support vector machines, and the revival of deep neural networks. The backbone of these machines is the pre-preprocessing of the data with a (cascade of) nonlinear transformation, in order to embed data into some feature space, also called latent space, where data-processing techniques can be easily carried out. The major bottleneck is that generally one needs to extract patterns (i.e., “antecedent” or inverse image) in the data space, not in the “virtual” feature one. This reverse embedding (Fig. 1) is of primary interest for pattern recognition and data mining. In this project, we focus on embeddings for temporal data and the preimage estimation problem under two major representation settings for temporal data: kernel machinery representations and deep learning. This internship is a preliminary work for a funded thesis at AMA-LIG (pdf here).

Uplift Prediction

Supervisors:Massih R. Amini & Eustache Diemert Description Uplift modeling aims at estimating and optimizing the causal impact of a treatment. In the digital advertising industry the treatment is exposure to different ads (see Radcliffe (2007)). In personalized medecine the treatment is usually a medication (see Jaskowski and Jaroszewicz (2012)). There are two main tasks: estimation of the effect at the population level and prediction of the effect the individual level. The objective of this internship is to research i) an optimizable form of the evaluation metric and ii) links to multi-task learning. The intern will spend his work time between Criteo and the university lab. The academic advisor would be Prof. Massih-Reza Amini from AMA – LIG – University Grenoble Alps. No activities other than research would be expected from the student during the internship. However the student will be involved in the Research group meetings and have access to other researchers at Criteo. More information on the Criteo Research team can be found at http://research.criteo.com/. Publication is encouraged and an important indicator of the success of the internship. The address of the local Criteo office is 4 rue des Meridiens 38130 Echirolles. The monthly pay rate by Criteo during the internship will be above the minimum required by the university. It is intended to last between February and 2 September 2018. Interns demonstrating seriousness in their research could be proposed to be hired as junior scientist and pursue a CIFRE PhD. More details can be find here: ama.liglab.fr/~amini/Cours/ML/Internships/uplift_modeling_internship_criteo.pdf

Mutliclass semi-supervised Learning through pseudo-Labeling

Supervisors:Massih R. Amini & Émilie Devijver Description In this project, we consider the self-learning approach that consists in learning a classifier using the labeled training set S and then use the predictions as scores of confidence by assigning iteratively pseudo-labels to unlabeled data having prediction scores above a given threshold and then training a new classifier. The main question here is the choice of the threshold which remains an open question in the case of multi-class classification problems. In the binary case, a solution has been proposed under the PAC-Bayes framework which consists in bounding the error of the Bayes classifier estimated over the unlabeled data having an absolute prediction score higher than a threshold; and then choosing the threshold which gives the tightest bound. In this project we propose first, to extent the previous theoretical result to the multiclass case, using for example the confusion matrix. And to use the result for automatically estimating different thresholds for pseudo-labeling unlabebed examples into different classes. The resulting algorithm will be then tested on real-world applications using genetics data or data extracted from different information retrieval problems.

Internships 2017

Learning representations from spatio temporal data

Funding: LOCUST ANR Project (Internship followed by both a funded thesis and a post-doc) Supervisors: Ahlame Douzal, Eric Gaussier Team: AMA-LIG Partners: MLIA-LIP6 UPMC, DEEZER Description Human interactions conducted either via the web and mobile services, or with artifacts, moving objects, and intelligent sensors generate large flows of complex dynamic data. These user traces correspond to sequences of observations: events, measurements, semantic content, etc. They may have a space (e.g. geo‐localization) and temporal components that are often composed of multiple types of information. The internship fits within the Locust project, the objective of which is to build formal models and algorithmic tools aimed at understanding, modeling and analyzing complex dynamic traces (spatiotemporal data) for a set of generic machine learning tasks and for target applications. Two use cases concerning respectively semantic information diffusion and urban computing will support the theoretical contributions and serve for evaluating the models and algorithms. The project is research oriented with two academic and an industrial partner (DEEZER). More particularly, the candidate will address the problem of recurrent neural networks (RNN) for classification, cokriging and forecasting of spatio-temporal data. Spatio-temporal data correspond to multivariate time series spatially localized and arising from a multi‐source diffusion process. For that, the aim is to investigate RNN (LSTM, GRU,…) approaches that account for dependencies (long as well as short) between time and space variables in a tractable and scalable way. Finally, this internship will be followed by both a funded thesis and a post-doc.

Representation Learning for Textual and Temporal Data Alignment

Funding: FUI-SSC Project Supervisors:Eric Gaussier, Hamid Mirisaee, Parantapa Goswami Partners: Coservit Description Information retrieval (IR) and recommendation systems have been of interest of many researchers and companies over the past decade. Classical IR and recommendation systems operate on data of only one type, i.e. data belonging to the same feature space. These systems lack a personalization mechanism that can ‘understand’ the query or reflect the information needs of a user at a particular instance in time and return customized results. In this age of Internet, the online activities of users have increased substantially. This presents the opportunity to exploit additional data about the user in order to predict a user’s profile or context and to use this profile for personalized retrieval or recommendations. Often, these additional data are heterogeneous in nature. For example, quantitative geospatial location data, browsed images or viewed videos. Moreover, the data evolves over time in most cases, which in turn emerges the importance of time series analysis. As a result, to design a better retrieval and recommendation system, one needs to exploit all these heterogeneous contextual pieces of information. In this internship, we will study contextual IR using heterogeneous data in the context of smart support centers (SSC). SSCs aim to solve the clients’ issues with the computational system in an automatic manner. Different hardwares of the devices in the system (e.g. CPU, RAM, disk etc.) have sensors within them. These sensors capture different quantitative and qualitative values about system usage (e.g. disk usage, RAM usage) at certain interval, which generate a multivariate structured temporal monitoring data. Additionally, tickets are generated based on the users’ complains, which detail the problem, and the time of reporting. The solutions of previously resolved tickets are also archived. These are semi-structured textual data. All these are very large-scale data supplied by industrial partners. Typical support center operations can be described as a contextual IR problem where temporal system monitoring data is used as contextual information to retrieve and recommend textual solutions in response to the user tickets. The difficulty is that the ticket generation time may not correspond to the actual time of occurrence of the problem, as users may realize the problem with some delay, or they may report it with a delay. So, there could be an unknown delay between the unusual behavior of the device’s data (i.e. the monitoring time series data) and the ticket issue time. Without the actual alignment of tickets with the relevant portions of monitoring data, it is not possible to exploit the monitoring data as contextual information. This internship aims to address the challenge of aligning textual information with temporal monitoring data where the delay between them is unknown. A similar study has been done on temporal stock price data and textual stock market news; however, in that case, the delay between the news and the actual change is more or less bounded as the news are published within a short time after a major change in stock prices while in our case, the delay is totally unbounded. Through this internship you will have the opportunity to: 1. learn about various concepts involved in state-of-the-art of the topic (such as text representation methods, clustering of temporal data), and implement the theories to see all these in action in the context of SSC data. 2.explore different representation learning methods, including deep learning, to represent hetero- geneous data (quantitative and textual in this case) 3.find a solution for aligning the tickets to the proper point/period of the time series.

PhD Positions

PhD position in Data Analysis and Machine Learning for spatio-temporal data

Title: Learning representations for classification, cokriging and forecasting spatio-temporal data Duration: 3 years Starting date: September 2016 Funding: Gross salary about 2000 euros/month(LOCUST ANR Project) Partners: MLIA-LIP6 UPMC, DEEZER Supervisors: Ahlame Douzal, Eric Gaussier Application procedure and other details here.

PhD position in Machine Learning algorithms for learning representations from temporal data

Title: Learning representations from multivariate temporal data Duration: 3 years Starting date: September 2015 Funding: Gross salary about 2000 euros/month, Projet Investissement d’Avenir (partners: CS, AIRBUS and EDF- R&D) Supervisors: Ahlame Douzal, Eric Gaussier Application procedure and other details here.