Gaël Varoquaux

is this you? claim profile


Researcher at INRIA, Research Fellow at INSERM from 2010-2011, Research Fellow at INRIA from 2008-2010, Software consultant at Enthought, Inc. 2008, Junior specialist - scientific computing at UC Berkeley 2008, Research Fellow at LENS - European Laboratory for Non-linear Spectroscopy from 2007-2008, PhD Student at Institut d'Optique Graduate School from 2005-2008

  • Using Feature Grouping as a Stochastic Regularizer for High-Dimensional Noisy Data

    The use of complex models --with many parameters-- is challenging with high-dimensional small-sample problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive, as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via structured penalties. But rich penalties require mathematical expertise and entail large computational costs. Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations. Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of groups of features that capture feature covariations. We then randomly select these groups inside a stochastic gradient descent loop. This procedure acts as a structured regularizer for high-dimensional correlated data without additional computational cost and it has a denoising effect. We demonstrate the performance of our approach for logistic regression both on a sample-limited face image dataset with varying additive noise and on a typical high-dimensional learning problem, brain image classification.

    07/31/2018 ∙ by Sergul Aydore, et al. ∙ 16 share

    read it

  • Computational and informatics advances for reproducible data analysis in neuroimaging

    The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data sharing resources that have been developed for neuroimaging data, and the role of data standards (particularly the Brain Imaging Data Structure) in enabling the automated sharing, processing, and reuse of large neuroimaging datasets. We outline how the open-source Python language has provided the basis for a data science platform that enables reproducible data analysis and visualization. We also discuss how new advances in software engineering, such as containerization, provide the basis for greater reproducibility in data analysis. The emergence of this new ecosystem provides an example for many areas of science that are currently struggling with reproducibility.

    09/24/2018 ∙ by Russell A. Poldrack, et al. ∙ 12 share

    read it

  • Approximate message-passing for convex optimization with non-separable penalties

    We introduce an iterative optimization scheme for convex objectives consisting of a linear loss and a non-separable penalty, based on the expectation-consistent approximation and the vector approximate message-passing (VAMP) algorithm. Specifically, the penalties we approach are convex on a linear transformation of the variable to be determined, a notable example being total variation (TV). We describe the connection between message-passing algorithms -- typically used for approximate inference -- and proximal methods for optimization, and show that our scheme is, as VAMP, similar in nature to the Peaceman-Rachford splitting, with the important difference that stepsizes are set adaptively. Finally, we benchmark the performance of our VAMP-like iteration in problems where TV penalties are useful, namely classification in task fMRI and reconstruction in tomography, and show faster convergence than that of state-of-the-art approaches such as FISTA and ADMM in most settings.

    09/17/2018 ∙ by Andre Manoel, et al. ∙ 10 share

    read it

  • Extracting Universal Representations of Cognition across Brain-Imaging Studies

    The size of publicly available data in cognitive neuro-imaging has increased a lot in recent years, thanks to strong research and community efforts. Exploiting this wealth of data demands new methods to turn the heterogeneous cognitive information held in different task-fMRI studies into common-universal-cognitive models. In this paper, we pool data from large fMRI repositories to predict psychological conditions from statistical brain maps across different studies and subjects. We leverage advances in deep learning, intermediate representations and multi-task learning to learn universal interpretable low-dimensional representations of brain images, usable for predicting psychological stimuli in all input studies. The method improves decoding performance for 80 flow from every study to the others: it notably gives a strong performance boost when decoding studies of small size. The trained low-dimensional representation-task-optimized networks-is interpretable as a set of basis cognitive dimensions relevant to meaningful categories of cognitive stimuli. Our approach opens new ways of extracting information from brain maps, overcoming the low power of typical fMRI studies.

    09/17/2018 ∙ by Arthur Mensch, et al. ∙ 4 share

    read it

  • Encoding high-cardinality string categorical variables

    Statistical analysis usually requires a vector representation of categorical variables, using for instance one-hot encoding. This encoding strategy is not practical when the number of different categories grows, as it creates high-dimensional feature vectors. Additionally, the corresponding entries in the raw data are often represented as strings, that have additional information not captured by one-hot encoding. Here, we seek low-dimensional vectorial encoding of string categorical variables with high-cardinality. Ideally, these should i) be scalable to a very large number of categories, ii) be interpretable to the end user, and iii) facilitate statistical analysis. We introduce two new encoding approaches for string categories: a Gamma-Poisson matrix factorization on character-level substring counts, and a min-hash encoder, based on min-wise random permutations for fast approximation of the Jaccard similarity between strings. Both approaches are scalable and are suitable for streaming settings. Extensive experiments on real and simulated data show that these encoding methods improve prediction performance for real-life supervised-learning problems with high-cardinality string categorical variables and works as well as standard approaches with clean, low-cardinality ones. We recommend the following: i) if scalability is the main concern, the min-hash encoder is the best option as it does not require any fitting to the data; ii) if interpretability is important, the Gamma-Poisson factorization is a good alternative, as it can be interpreted as one-hot encoding, giving each encoding dimension a feature name that summarizes the substrings captured. Both models remove the need for hand-crafting features and data cleaning of string columns in databases and can be used for feature engineering in online autoML settings.

    07/03/2019 ∙ by Patricio Cerda, et al. ∙ 3 share

    read it

  • Manifold-regression to predict from MEG/EEG brain signals without source modeling

    Magnetoencephalography and electroencephalography (M/EEG) can reveal neuronal dynamics non-invasively in real-time and are therefore appreciated methods in medicine and neuroscience. Recent advances in modeling brain-behavior relationships have highlighted the effectiveness of Riemannian geometry for summarizing the spatially correlated time-series from M/EEG in terms of their covariance. However, after artefact-suppression, M/EEG data is often rank deficient which limits the application of Riemannian concepts. In this article, we focus on the task of regression with rank-reduced covariance matrices. We study two Riemannian approaches that vectorize the M/EEG covariance between-sensors through projection into a tangent space. The Wasserstein distance readily applies to rank-reduced data but lacks affine-invariance. This can be overcome by finding a common subspace in which the covariance matrices are full rank, enabling the affine-invariant geometric distance. We investigated the implications of these two approaches in synthetic generative models, which allowed us to control estimation bias of a linear model for prediction. We show that Wasserstein and geometric distances allow perfect out-of-sample prediction on the generative models. We then evaluated the methods on real data with regard to their effectiveness in predicting age from M/EEG covariance matrices. The findings suggest that the data-driven Riemannian methods outperform different sensor-space estimators and that they get close to the performance of biophysics-driven source-localization model that requires MRI acquisitions and tedious data processing. Our study suggests that the proposed Riemannian methods can serve as fundamental building-blocks for automated large-scale analysis of M/EEG.

    06/04/2019 ∙ by David Sabbagh, et al. ∙ 2 share

    read it

  • Learning Neural Representations of Human Cognition across Many fMRI Studies

    Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive processes/psychological tasks to brain networks? We cast this challenge in a machine-learning approach to predict conditions from statistical brain maps across different studies. For this, we leverage multi-task learning and multi-scale dimension reduction to learn low-dimensional representations of brain images that carry cognitive information and can be robustly associated with psychological stimuli. Our multi-dataset classification model achieves the best prediction performance on several large reference datasets, compared to models without cognitive-aware low-dimension representations, it brings a substantial performance boost to the analysis of small datasets, and can be introspected to identify universal template cognitive concepts.

    10/31/2017 ∙ by Arthur Mensch, et al. ∙ 0 share

    read it

  • Cross-validation failure: small sample sizes lead to large error bars

    Predictive models ground many state-of-the-art developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is cross-validation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of cross-validation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg ±10 standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.

    06/23/2017 ∙ by Gaël Varoquaux, et al. ∙ 0 share

    read it

  • Stochastic Subsampling for Factorizing Huge Matrices

    We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or non-negative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm. Our method comes with convergence guarantees to reach a stationary point of the matrix-factorization problem. We demonstrate its efficiency on massive functional Magnetic Resonance Imaging data (2 TB), and on patches extracted from hyperspectral images (103 GB). For both problems, which involve different penalties on rows and columns, we obtain significant speed-ups compared to state-of-the-art algorithms.

    01/19/2017 ∙ by Arthur Mensch, et al. ∙ 0 share

    read it

  • Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example

    Resting-state functional Magnetic Resonance Imaging (R-fMRI) holds the promise to reveal functional biomarkers of neuropsychiatric disorders. However, extracting such biomarkers is challenging for complex multi-faceted neuropatholo-gies, such as autism spectrum disorders. Large multi-site datasets increase sample sizes to compensate for this complexity, at the cost of uncontrolled heterogeneity. This heterogeneity raises new challenges, akin to those face in realistic diagnostic applications. Here, we demonstrate the feasibility of inter-site classification of neuropsychiatric status, with an application to the Autism Brain Imaging Data Exchange (ABIDE) database, a large (N=871) multi-site autism dataset. For this purpose, we investigate pipelines that extract the most predictive biomarkers from the data. These R-fMRI pipelines build participant-specific connectomes from functionally-defined brain areas. Connectomes are then compared across participants to learn patterns of connectivity that differentiate typical controls from individuals with autism. We predict this neuropsychiatric status for participants from the same acquisition sites or different, unseen, ones. Good choices of methods for the various steps of the pipeline lead to 67 ABIDE data, which is significantly better than previously reported results. We perform extensive validation on multiple subsets of the data defined by different inclusion criteria. These enables detailed analysis of the factors contributing to successful connectome-based prediction. First, prediction accuracy improves as we include more subjects, up to the maximum amount of subjects available. Second, the definition of functional brain areas is of paramount importance for biomarker discovery: brain areas extracted from large R-fMRI datasets outperform reference atlases in the classification tasks.

    11/18/2016 ∙ by Alexandre Abraham, et al. ∙ 0 share

    read it

  • Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals

    -In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is that good clustering comes with large algorithmic costs. We address it by contributing a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional variance-minimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publicly-available data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme.

    09/15/2016 ∙ by Andrés Hoyos-Idrobo, et al. ∙ 0 share

    read it