Franck Dernoncourt's home page

franck.dernoncourt@gmail.com
Skype: franck.dernoncourt
San Jose, CA, USA

Publications

0	De-identification of Patient Notes with Recurrent Neural Networks	2016
1	Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks	2016
2	Robust Dialog State Tracking for Large Ontologies	2016
3	Adobe-MIT submission to the DSTC 4 Spoken Language Understanding pilot task	2016
4	BeatDB: an end-to-end approach to unveil saliencies from massive signal data sets	2015
5	Gaussian Process-based Feature Selection for Wavelet Parameters: Predicting Acute Hypotensive Episodes from Physiological Signals	2015
6	TrackMania is intractable	2014
7	beatDB: A Large ScaleWaveform Feature Repository	2013
8	MoocViz: A Large Scale, Open Access, Collaborative, Data Analytics Platform for MOOCs	2013
9	MOOC En Images	2013
10	MOOCdb: Developing Standards and Systems for MOOC Data Science (MIT Technical Report)	2013
11	MOOCdb: Developing Data Standards for MOOC Data Science (MOOCshop paper)	2013
12	Machine Learning Algorithms for In-Database Analytics	2013
13	Efficient training set use for blood pressure prediction in a large scale learning classifier system	2013
14	Imprecise selection and fitness approximation in a large-scale evolutionary rule based system for blood pressure prediction	2013
15	Replacing the computer mouse	2012
16	Artificial Intelligence: why should firms care?	2012
17	Of the use of natural dialogue to hide MCQs in serious games	2012
18	Designing an intelligent dialogue system for serious games	2012
19	The medial Reticular Formation (mRF): a neural substrate for action selection? An evaluation via evolutionary computation.	2011
20	Fuzzy logic: introducing human reasoning within decision support systems?	2011
21	Fuzzy logic: between human reasoning and artificial intelligence	2011
22	Presentation on the Motion-Induced Blindness (MIB) phenomenom	2011
23	Presentation on the paper Automated Variable Weighting in k-Means Type Clustering	2010
24	Prediction of the water inflow to a lake	2010

De-identification of Patient Notes with Recurrent Neural Networks

Franck Dernoncourt*, Ji Young Lee*, Ozlem Uzuner, Peter Szolovits. "De-identification of Patient Notes with Recurrent Neural Networks". (* indicates equal contribution)

Objective: Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information (PHI) that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of EHR databases, the limited number of researchers with access to the non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value.

Materials and Methods: We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset.

Results: Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall 97.38 and a precision of 97.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall 99.25 and a precision of 99.06.

Conclusion: Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no feature engineering.

Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks

Ji Young Lee*, Franck Dernoncourt*. "Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks". NAACL 2016. (* indicates equal contribution)

Abstract: Recent approaches based on artificial neural networks (ANNs) have shown promising results for short-text classification. However, many short texts occur in sequences (e.g., sentences in a document or utterances in a dialog), and most existing ANN-based systems do not leverage the preceding short texts when classifying a subsequent one. In this work, we present a model based on recurrent neural networks and convolutional neural networks that incorporates the preceding short texts. Our model achieves state-of-the-art results on three different datasets for dialog act prediction.

Robust Dialog State Tracking for Large Ontologies

Franck Dernoncourt, Ji Young Lee, Trung H. Bui, and Hung H. Bui. "Robust Dialog State Tracking for Large Ontologies". International Workshop on Spoken Dialogue Systems. 2016.

Abstract: The Dialog State Tracking Challenge 4 (DSTC 4) differentiates itself from the previous three editions as follows: the number of slot-value pairs present in the ontology is much larger, no spoken language understanding output is given, and utterances are labeled at the subdialog level. This paper describes a novel dialog state tracking method designed to work robustly under these conditions, using elaborate string matching, coreference resolution tailored for dialogs and a few other improvements. The method can correctly identify many values that are not explicitly present in the utterance. On the final evaluation, our method came in first among 7 competing teams and 24 entries. The F1-score achieved by our method was 9 and 7 percentage points higher than that of the runner-up for the utterance-level evaluation and for the subdialog-level evaluation, respectively.

Adobe-MIT submission to the DSTC 4 Spoken Language Understanding pilot task

Franck Dernoncourt, Ji Young Lee, Trung H. Bui, and Hung H. Bui. "Adobe-MIT submission to the DSTC 4 Spoken Language Understanding pilot task". International Workshop on Spoken Dialogue Systems. 2016.

Abstract: The Dialog State Tracking Challenge 4 (DSTC 4) proposes several pilot tasks. In this paper, we focus on the spoken language understanding pilot task, which consists of tagging a given utterance with speech acts and semantic slots. We compare different classifiers: the best system obtains 0.52 and 0.67 F1-scores on the test set for speech act recognition for the tourist and the guide respectively, and 0.52 F1-score for semantic tagging for both the guide and the tourist.

BeatDB: an end-to-end approach to unveil saliencies from massive signal data sets

Franck Dernoncourt. "BeatDB: an end-to-end approach to unveil saliencies from massive signal data sets." Master's Thesis, Massachusetts Institute of Technology, 2015.

Abstract: Prediction studies on physiological signals are time-consuming: a typical study, even with a modest number of patients, usually takes from 6 to 12 months. In response we design a large-scale machine learning and analytics framework, BeatDB, to scale and speed up mining knowledge from waveforms. BeatDB radically shrinks the time an investigation takes by: * supporting fast, flexible investigations by offering a multi-level parameterization, allowing the user to define the condition to predict, the features, and many other investigation parameters. * precomputing beat-level features that are likely to be frequently used while computing on-the-fly less used features and statistical aggregates. In this thesis, we present BeatDB and demonstrate how it supports flexible investigations on the entire set of arterial blood pressure data in the MIMIC II Waveform Database, which contains over 5000 patients and 1 billion of blood pressure beats. We focus on the usefulness of wavelets as features in the context of blood pressure prediction and use Gaussian process to accelerate the search of the feature yielding the highest AUROC.

Gaussian Process-based Feature Selection for Wavelet Parameters: Predicting Acute Hypotensive Episodes from Physiological Signals

Franck Dernoncourt, Kalyan Veeramachaneni, and Una-May O'Reilly. "Gaussian Process-Based Feature Selection for Wavelet Parameters: Predicting Acute Hypotensive Episodes from Physiological Signals." In Computer-Based Medical Systems (CBMS), 2015 IEEE 28th International Symposium on, pp. 145-150. IEEE, 2015.

Abstract: Physiological signals such as blood pressure might contain key information to predict a medical condition, but are challenging to mine. Wavelets possess the ability to unveil location-specific features within signals but there exists no principled method to choose the optimal scales and time shifts. We present a scalable, robust system to find the best wavelet parameters using Gaussian processes (GPs). We demonstrate our system by assessing wavelets as predictors for the occurrence of acute hypotensive episodes (AHEs) using over 1 billion blood pressure beats. We obtain an AUROC of 0.79 with wavelet features only, and the false positive rate when the true positive rate is fixed at 0.90 is reduced by 15% when the wavelet feature is used in conjunction with other statistical features. Furthermore, the use of GPs reduces the selection effort by a factor of 3 compared with a naive grid search.

TrackMania is intractable

Franck Dernoncourt. "TrackMania is NP-complete." arXiv preprint arXiv:1411.5765 (2014).

Abstract: We prove that completing an untimed, unbounded track in TrackMania Nations Forever is NP-complete by using a reduction from 3-SAT and showing that a solution can be checked in polynomial time.

beatDB: A Large ScaleWaveform Feature Repository

Franck Dernoncourt, Kalyan Veeramachaneni, and Una-May O’Reilly. "beatDB: A Large Scale Waveform Feature Repository." In NIPS Workshop on Machine Learning for Clinical Data Analysis and Healthcare. 2013

Abstract: A great majority of the effort is spent assembling the data and formulating the features, while, rather ironically, the model building exercise takes relatively less time. beatDB aims at radically shrinking the time of large scale investigations by judiciously pre-computing beat features which are likely to be frequently used. In this poster we present beatDB structure and use beatDB for a concrete research study: predicting acute hypotensive event with blood pressure.

MoocViz: A Large Scale, Open Access, Collaborative, Data Analytics Platform for MOOCs

Franck Dernoncourt, Colin Taylor, Una-May O’Reilly, Kayan Veeramachaneni, Sherwin Wu, Chuong Do, and Sherif Halawa. "MoocViz: A large scale, open access, collaborative, data analytics platform for MOOCs." In NIPS Workshop on Data-Driven Education, Lake Tahoe, Nevada. 2013.

Abstract: In this paper we present an open access large scale analytics platform that helps researchers analyze MOOC data from multiple platforms with out the need to share the data. It allows researchers to share scripts/effort, compare results and attempts to engage the community to achieve shared educational science goals. The platform utilizes some well known tools and packages and provides multiple levels of access to address a wide variety of needs around the data. We demonstrate the platforms capability by analyzing data from two MOOCs, one from Coursera (offered by Stanford University) and one from edX (offered by MITx). This is the first time two courses from two platforms have been jointly analyzed. The analysis and the platform is made possible due to joint adoption of a data model called MOOCdb.

MOOC En Images

Franck Dernoncourt, Colin Taylor, Kalyan Veeramachaneni, Una-May O’Reilly. "MOOC En Images 6.002x". MIT Tech Report. 2013.

Abstract: This report provides a view into different descriptive statistics extracted from the data recorded during 6.002x the first course offering by MITx. We have developed a generalizable analytics framework and this report demonstrates use of this framework. This is a working document and we are expanding the scope of this document as we add additional analytical tools and interfaces to our framework

MOOCdb: Developing Standards and Systems for MOOC Data Science (MIT Technical Report)

Franck Dernoncourt, Chuong Do, Sherif Halawa, Una-May O'Reilly, Colin Taylor, and Kalyan Veeramachaneni. "MOOCdb: Developing Standards and Systems to Support MOOC Data Science." MIT Tech Report. 2014.

Abstract: The intent of this document is to enable development of data standards for MOOCs and build enabling technology. This document will be updated from time to time with feedback from the community as well from our internal development process

MOOCdb: Developing Data Standards for MOOC Data Science (MOOCshop paper)

Kalyan Veeramachaneni, Franck Dernoncourt, Colin Taylor, Zachary A. Pardos, Una-May O'Reilly. "MOOCdb: Developing Data Standards for MOOC Data Science." MOOCShop at Artificial Intelligence in Education, 2013.

Abstract: The intent of this article is to propose data standards for MOOCs. Our team has been conducting research related to mining information, building models, and interpreting data from the inaugural course offered by edX, 6.002x: Circuits and Electronics, since the Fall of 2012. This involves a set of steps, undertaken in most data science studies, which entails positing a hypothesis, assembling data and features (aka properties, covariates, explanatory variables, decision variables), identifying response variables, building a statistical model then validating, inspecting and interpreting the model. In our domain, and others like it that require behavioral analyses of an online setting, a great majority of the effort (in our case approximately 70%) is spent assembling the data and formulating the features, while, rather ironically, the model building exercise takes relatively less time. As we advance to analyzing cross-course data, it has become apparent that our algorithms which deal with data assembly and feature engineering lack cross-course generality. This is not a fault of our software design. The lack of generality reflects the diverse ad hoc data schemas we have adopted for each course. These schemas partially result because some of the courses are being offered for the first time and it is the first time behavioral data has been collected. As well, they arise from initial investigations taking a local perspective on each course rather than a global one extending across multiple courses.

Machine Learning Algorithms for In-Database Analytics

Abstract: Our project focused on extending the functionality of MADlib. MADlib is an open source machine learning and statistics library which works with Postgres or Greenplum to provide in-database analytics. Although some machine learning algorithms have been implemented in MADlib, there is room for additional contributions. We have implemented two different machine learning algorithms, symbolic regression with genetic programming and adaptive boosting for MADlib, and are in the process of contributing our code to the MADlib community codebase. We have also assessed the performance of our implementations and compared their performance with the same algorithms outside MADlib.

Efficient training set use for blood pressure prediction in a large scale learning classifier system

Erik Hemberg, Kalyan Veeramachaneni, Franck Dernoncourt, Mark Wagy, and Una-May O'Reilly. "Efficient training set use for blood pressure prediction in a large scale learning classifier system." In Sixteenth international workshop on learning classifier systems. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pp. 1267-1274. ACM, 2013.

Abstract: We define a machine learning problem to forecast arterial blood pressure. Our goal is to solve this problem with a large scale learning classifier system. Because learning classifiers systems are extremely computationally intensive and this problem's eventually large training set will be very costly to execute, we address how to use less of the training set while not negatively impacting learning accuracy. Our approach is to allow competition among solutions which have not been evaluated on the entire training set. The best of these solutions are then evaluated on more of the training set while their offspring start off being evaluated on less of the training set. To keep selection fair, we divide competing solutions according to how many training examples they have been tested on.

Imprecise selection and fitness approximation in a large-scale evolutionary rule based system for blood pressure prediction

Erik Hemberg, Kalyan Veeramachaneni, Franck Dernoncourt, Mark Wagy, and Una-May O'Reilly. "Imprecise selection and fitness approximation in a large-scale evolutionary rule based system for blood pressure prediction." In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pp. 153-154. ACM, 2013.

Abstract: We present how we have strategically allocated fitness evaluations in a large-scale rule based evolutionary system called ECStar. We describe a strategy that culls potentially weaker solutions early, then later only compete with solutions which have equivalent fitness evaluations, as they are evaluated on more fitness cases. Despite incurring some imprecision in fitness comparison, which arises from not evaluating on all the fitness cases or even the same ones, the strategy allows our system to make effective progress when the resources at its disposal are unpredictably available.

Imprecise selection and fitness approximation in a large-scale evolutionary rule based system for blood pressure prediction

Replacing the computer mouse

Franck Dernoncourt. "Replacing the computer mouse." arXiv preprint arXiv:1410.5907 (2014).

Abstract: In a few months the computer mouse will be half-a-century-old. It is known to have many drawbacks, the main ones being: loss of productivity due to constant switching between keyboard and mouse, health issues such as RSI, medical impossibility to use the mouse e.g. broken or amputated arm and unnatural human-computer interface like the keyboard. However almost everybody still uses a computer mouse nowadays.

In this short article, we explore computer mouse alternatives. Our research shows that moving the mouse cursor can be done efficiently with the SmartNav device and mouse clicks can be emulated in many complementary ways. We believe that computer users can increase their productivity and their health by using those alternatives. There are a few exceptions such as advanced users of graphics editing programs or FPS gamers, who will still be more efficient using a computer mouse.

This article is voluntary short and not overly technical, our main motivation being to make the readers aware of these solutions and their efficiencies. Details can be found in the appendices and by following the URLs and references. The primarily intended readers are computer scientists, people with RSI, physicians and interface pioneers. Feedback is highly welcome: this is work in progress, so feel free to e-mail the main author at francky@mit.edu

Artificial Intelligence: why should firms care?

Talk given on May 30th, 2012 at the Swedish Chamber of Commerce in Paris. As I was reading an article about IBM Watson, a small sentence drew my attention: "Eighty or 90 per cent of these requests don't need Watson anyway, technology already exists for what they need.". This epitomizes the growing need for the business world to catch up with artificial intelligence's latest developments. What is AI? What is the state of the art? Why should I care? i.e. what can AI bring to the business world? From law to finance, any field will be reshaped in the long term by AI.

Replace CS by AI

Of the use of natural dialogue to hide MCQs in serious games

Franck Dernoncourt. "De l'utilisation du dialogue naturel pour masquer les QCM au sein des jeux sérieux." RECITAL. 2012. Translation: Of the use of natural dialogue to hide MCQs in serious games.

Abstract: A major weakness of serious games at the moment is that they often incorporate multiple choice questionnaires (MCQs). However, no study has demonstrated that MCQs can accurately assess the level of understanding of a learner. On the contrary, some studies have experimentally shown that allowing the learner to input a free-text answer in the program instead of just selecting one answer in an MCQ allows a much finer evaluation of the learner's skills. We therefore propose to design a conversational agent that can understand statements in natural language within a narrow semantic context corresponding to the area of competence on which we assess the learner. This feature is intended to allow a natural dialogue with the learner, especially in the context of serious games. Such interaction in natural language aims to hide the underlying MCQs. This paper presents our approach.

Designing an intelligent dialogue system for serious games

Franck Dernoncourt. "Conception d’un système de dialogue intelligent pour jeux sérieux." RJC EIAH. 2012. Translation: Designing an intelligent dialogue system for serious games.

Abstract of the original paper: the objective of our work is to design a conversational agent (chatterbot) capable of understanding natural language statements in a restricted semantic domain. This feature is intended to allow a natural dialogue with a learner, especially in the context of serious games. This conversational agent will be experimented in a serious game for training staff, by simulating a client. It does not address the natural language understanding in its generality since firstly the semantic domain of a game is generally well defined and, secondly, we will restrict the types of sentences found in the dialogue.

The medial Reticular Formation (mRF): a neural substrate for action selection? An evaluation via evolutionary computation.

Franck Dernoncourt. "La formation reticulée médiane: un substrat pour la sélection de l'action? Modélisation via réseaux de neurones et algorithmes évolutionnistes." Master's Thesis. École Normale Supérieure Ulm. 2011. Translation: The medial Reticular Formation: a neural substrate for action selection? An evaluation via evolutionary computation.

The medial Reticular Formation (mRF) is located in the brainstem: it receives many sensory inputs and it can control motor actions through its projections on the spinal cord and cranial nerves. The mRF is phylogenetically one of the oldest neural structures of the brainstem, the latter being regarded as one of the oldest centers of the central nervous system. Subsequently it seems to be a low-level system for action selection.

The first model of the mRF was proposed by Kilmer and McCulloch in 1969, who already proposed that the mRF could be a "mode selector". In 2005, Humphries et al. (2005) tested the efficiency of this model in the minimal survival task defined in Girard et al. (2003). It performed poorly, but another version of it that included artificially evolved weights performed quite honorably. As a result, Humphries proposed a second model of the mRF, based on neural network formalism and taking into account new anatomical data. Nevertheless, it showed poor performances in the minimal survival task and turns out not to be anatomically very plausible.

In this Master's Thesis, we propose a new model of the mRF:

constrained by anatomical information about its structure,
constructed based on neural networks generated by artificial evolution,
assessed on tasks of action selection.

The model we obtain successfully manages the tasks of selection, indicating that the mRF can be used as an action selection system. We also demonstrate an anatomical property of the mRF, which coupled with the results of the paper Humphries et al. (2006) shows that it is very likely that the mRF network has a small-world structure.

This project was funded by the ANR (ANR-09-EMER-005-01. ANR = French National Agency for Research) in the project EvoNeuro.

Fuzzy logic: introducing human reasoning within decision support systems?

Franck Dernoncourt. "La Logique Floue: le raisonnement humain au cœur du système décisionnel ?" Master's Thesis. Conservatoire national des arts et métiers 2011. Translation: Fuzzy logic: introducing human reasoning within decision-making systems?

Fuzzy logic is based on solid mathematical foundations, including the mathematical theory of fuzzy sets, generalizing classical set theory. Firstly, we define fuzzy operators, which generalize operators of classical logic.

As a second step, we see how fuzzy logic can imitate human reasoning. We analyze the contribution of fuzzy logic for the modeling of human reasoning, and also experimentally investigate whether the decisions taken by humans correspond to decisions taken by fuzzy systems. To this end, given that the literature is deficient on this point, we design an experiment for that purpose and analyze the results.

We study the potential applications for databases and decision support systems in Chapter 5. How to integrate the advantages of fuzzy logic in the database? To which extent decision-making systems can use the flexibility of fuzzy logic?

We then analyze the potential applications for decision support systems and databases.

We show that at the heart of the company, bringing together all the interesting information from the operational databases, decision systems could benefit greatly from fuzzy logic by giving the keys to human reasoning, allowing to refine the decision-making.

Database theorists know what fuzzy logic could bring them in terms of information modeling: queries more intuitive and more powerful on the one hand, the data more consistent with the reality on the other. Many papers have been written, but few significant achievements have followed. The lack of consensus on a standard is probably the main reason behind.

Fuzzy logic: between human reasoning and artificial intelligence

Fuzzy logic is an extension of Boolean logic by Lotfi Zadeh in 1965 based on the mathematical theory of fuzzy sets, which is a generalization of classical set theory. By introducing the concept of degree in the verification of a condition, allowing a condition of being in a state other than true or false, fuzzy logic provides a very valuable flexibility to use reasoning, which makes it possible taking into account the inaccuracies and uncertainties. One of the advantages of fuzzy logic to formalize human reasoning is that the rules are set in natural language.

In this report, we:

introduce the basic concepts of fuzzy logic,
propose some arguments which support the view that fuzzy logic can model human reasoning better than standard logic and probability theory,
conduct an psychological experiment on humans to see if their way of thinking can be reflected by fuzzy logic.

We show that fuzzy logic can explain many experiments that had undermined traditional models of human reasoning in the 20th century. We show how the non-additivity of probability judgments can be expressed in a fuzzy system. We then confront fuzzy logic with some paradoxes of classical logic when it tries to model human reasoning: the sorites paradox is typically the kind of threshold problem that fuzzy logic reduces and the paradox of entailment does not pose a problem in fuzzy logic. It would be interesting to further explore Hempel's paradox and especially how we could express it in a neuro-fuzzy system. Similarly, Wason selection task would require further analysis, this time by focusing on fuzzy modus ponens and modus tollens.

Thus fuzzy logic appears as a powerful theoretical framework for studying human reasoning. Surprisingly, we find only one study comparing the decisions made by humans with that of a fuzzy system, whose purpose was essentially to design a system of decision support for medical personnel, not analyze human reasoning as such. We conduct our own experiment and investigate whether a fuzzy system could mimic the results observed in humans. For this purpose, we use a technique for optimizing fuzzy system using neural networks (neuro-fuzzy), through which we obtain good results, although the correlation between the two criteria for entry is high: a fuzzy system gives results closer to experimental values than those obtained by a polynomial system. This result reinforces the hypothesis that fuzzy logic can be used to explain decisions from human reasoning.

Presentation on the Motion-Induced Blindness (MIB) phenomenom

The visual system has a number a 'bugs', some of which we call illusions. Motion-induced blindness (MIB) belongs to a very interesting class of illusions in which objects in plain sight just disappear from phenomenal perception. Other classical examples of disappearance illusions are:

Binocular rivalry, in which two very different objects are presented to the two eyes, and at any given moment one of the obects--or most of it--remains invisible,
Backward masking, in which a stimulus is 'erased' from perception by a second stimulus, called a "mask", presented a brief time later,
Troxler fading, in which a low-contrast object may fade from visual perception after some time.

In addition, a number of neurological conditions usually involving lesions in parietal cortex, such as hemineglect and extinction, lead to cases in which objects in plain view are not seen, or not noticed. For a good review of these phenomena, see article "Psychophysical magic" by Kim and Blake (2005).

Motion-induced blindness MIB is a recently discovered and quite spectacular example of a disappearance illusion. The stimulus consists of a field of small objects, moving in a coherent way (either a 2D or 3D rotation, for example). Superimposed on this moving field is a number of high-contrast stationary objects. When most observers fixate a stationary point in this stimulus (such as one of the high-contrast objects, or a fixation point), after several seconds one or more of the stationary objects just disappear.

Activate full-screen, fix the white point in the center. After a few seconds, you will notice that the yellow point seems to disappear.

Presentation on the paper Automated Variable Weighting in k-Means Type Clustering

Abstract of the original paper: This paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is proposed. The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental results on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.

Prediction of the water inflow to a lake

The purpose of this project is to predict the water inflow to a lake, the Lac St-Jean, based on the evolution of the inflow to the lake from the history of this flow, snowmelt and precipitation in the watershed. All the data for this work have already been collected: our work aims to process, analyze and use these data to build a model which should be able to accurately predict the lake's water inflow.

In the first part, we conduct a preliminary study of the data so as to extract general information. In the second part, we establish a classification of the data to see the main trends. In the third and last part, we build several models to predict and we evaluate them through quality measurements.