
Using Feature Grouping as a Stochastic Regularizer for HighDimensional Noisy Data
The use of complex models with many parameters is challenging with highdimensional smallsample problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive, as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via structured penalties. But rich penalties require mathematical expertise and entail large computational costs. Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations. Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of groups of features that capture feature covariations. We then randomly select these groups inside a stochastic gradient descent loop. This procedure acts as a structured regularizer for highdimensional correlated data without additional computational cost and it has a denoising effect. We demonstrate the performance of our approach for logistic regression both on a samplelimited face image dataset with varying additive noise and on a typical highdimensional learning problem, brain image classification.
07/31/2018 ∙ by Sergul Aydore, et al. ∙ 16 ∙ shareread it

Computational and informatics advances for reproducible data analysis in neuroimaging
The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data sharing resources that have been developed for neuroimaging data, and the role of data standards (particularly the Brain Imaging Data Structure) in enabling the automated sharing, processing, and reuse of large neuroimaging datasets. We outline how the opensource Python language has provided the basis for a data science platform that enables reproducible data analysis and visualization. We also discuss how new advances in software engineering, such as containerization, provide the basis for greater reproducibility in data analysis. The emergence of this new ecosystem provides an example for many areas of science that are currently struggling with reproducibility.
09/24/2018 ∙ by Russell A. Poldrack, et al. ∙ 12 ∙ shareread it

Approximate messagepassing for convex optimization with nonseparable penalties
We introduce an iterative optimization scheme for convex objectives consisting of a linear loss and a nonseparable penalty, based on the expectationconsistent approximation and the vector approximate messagepassing (VAMP) algorithm. Specifically, the penalties we approach are convex on a linear transformation of the variable to be determined, a notable example being total variation (TV). We describe the connection between messagepassing algorithms  typically used for approximate inference  and proximal methods for optimization, and show that our scheme is, as VAMP, similar in nature to the PeacemanRachford splitting, with the important difference that stepsizes are set adaptively. Finally, we benchmark the performance of our VAMPlike iteration in problems where TV penalties are useful, namely classification in task fMRI and reconstruction in tomography, and show faster convergence than that of stateoftheart approaches such as FISTA and ADMM in most settings.
09/17/2018 ∙ by Andre Manoel, et al. ∙ 10 ∙ shareread it

Extracting Universal Representations of Cognition across BrainImaging Studies
The size of publicly available data in cognitive neuroimaging has increased a lot in recent years, thanks to strong research and community efforts. Exploiting this wealth of data demands new methods to turn the heterogeneous cognitive information held in different taskfMRI studies into commonuniversalcognitive models. In this paper, we pool data from large fMRI repositories to predict psychological conditions from statistical brain maps across different studies and subjects. We leverage advances in deep learning, intermediate representations and multitask learning to learn universal interpretable lowdimensional representations of brain images, usable for predicting psychological stimuli in all input studies. The method improves decoding performance for 80 flow from every study to the others: it notably gives a strong performance boost when decoding studies of small size. The trained lowdimensional representationtaskoptimized networksis interpretable as a set of basis cognitive dimensions relevant to meaningful categories of cognitive stimuli. Our approach opens new ways of extracting information from brain maps, overcoming the low power of typical fMRI studies.
09/17/2018 ∙ by Arthur Mensch, et al. ∙ 4 ∙ shareread it

Encoding highcardinality string categorical variables
Statistical analysis usually requires a vector representation of categorical variables, using for instance onehot encoding. This encoding strategy is not practical when the number of different categories grows, as it creates highdimensional feature vectors. Additionally, the corresponding entries in the raw data are often represented as strings, that have additional information not captured by onehot encoding. Here, we seek lowdimensional vectorial encoding of string categorical variables with highcardinality. Ideally, these should i) be scalable to a very large number of categories, ii) be interpretable to the end user, and iii) facilitate statistical analysis. We introduce two new encoding approaches for string categories: a GammaPoisson matrix factorization on characterlevel substring counts, and a minhash encoder, based on minwise random permutations for fast approximation of the Jaccard similarity between strings. Both approaches are scalable and are suitable for streaming settings. Extensive experiments on real and simulated data show that these encoding methods improve prediction performance for reallife supervisedlearning problems with highcardinality string categorical variables and works as well as standard approaches with clean, lowcardinality ones. We recommend the following: i) if scalability is the main concern, the minhash encoder is the best option as it does not require any fitting to the data; ii) if interpretability is important, the GammaPoisson factorization is a good alternative, as it can be interpreted as onehot encoding, giving each encoding dimension a feature name that summarizes the substrings captured. Both models remove the need for handcrafting features and data cleaning of string columns in databases and can be used for feature engineering in online autoML settings.
07/03/2019 ∙ by Patricio Cerda, et al. ∙ 3 ∙ shareread it

Manifoldregression to predict from MEG/EEG brain signals without source modeling
Magnetoencephalography and electroencephalography (M/EEG) can reveal neuronal dynamics noninvasively in realtime and are therefore appreciated methods in medicine and neuroscience. Recent advances in modeling brainbehavior relationships have highlighted the effectiveness of Riemannian geometry for summarizing the spatially correlated timeseries from M/EEG in terms of their covariance. However, after artefactsuppression, M/EEG data is often rank deficient which limits the application of Riemannian concepts. In this article, we focus on the task of regression with rankreduced covariance matrices. We study two Riemannian approaches that vectorize the M/EEG covariance betweensensors through projection into a tangent space. The Wasserstein distance readily applies to rankreduced data but lacks affineinvariance. This can be overcome by finding a common subspace in which the covariance matrices are full rank, enabling the affineinvariant geometric distance. We investigated the implications of these two approaches in synthetic generative models, which allowed us to control estimation bias of a linear model for prediction. We show that Wasserstein and geometric distances allow perfect outofsample prediction on the generative models. We then evaluated the methods on real data with regard to their effectiveness in predicting age from M/EEG covariance matrices. The findings suggest that the datadriven Riemannian methods outperform different sensorspace estimators and that they get close to the performance of biophysicsdriven sourcelocalization model that requires MRI acquisitions and tedious data processing. Our study suggests that the proposed Riemannian methods can serve as fundamental buildingblocks for automated largescale analysis of M/EEG.
06/04/2019 ∙ by David Sabbagh, et al. ∙ 2 ∙ shareread it

Learning Neural Representations of Human Cognition across Many fMRI Studies
Cognitive neuroscience is enjoying rapid increase in extensive public brainimaging datasets. It opens the door to largescale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive processes/psychological tasks to brain networks? We cast this challenge in a machinelearning approach to predict conditions from statistical brain maps across different studies. For this, we leverage multitask learning and multiscale dimension reduction to learn lowdimensional representations of brain images that carry cognitive information and can be robustly associated with psychological stimuli. Our multidataset classification model achieves the best prediction performance on several large reference datasets, compared to models without cognitiveaware lowdimension representations, it brings a substantial performance boost to the analysis of small datasets, and can be introspected to identify universal template cognitive concepts.
10/31/2017 ∙ by Arthur Mensch, et al. ∙ 0 ∙ shareread it

Crossvalidation failure: small sample sizes lead to large error bars
Predictive models ground many stateoftheart developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is crossvalidation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of crossvalidation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg ±10 standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.
06/23/2017 ∙ by Gaël Varoquaux, et al. ∙ 0 ∙ shareread it

Stochastic Subsampling for Factorizing Huge Matrices
We present a matrixfactorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or nonnegative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and nonnegative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm. Our method comes with convergence guarantees to reach a stationary point of the matrixfactorization problem. We demonstrate its efficiency on massive functional Magnetic Resonance Imaging data (2 TB), and on patches extracted from hyperspectral images (103 GB). For both problems, which involve different penalties on rows and columns, we obtain significant speedups compared to stateoftheart algorithms.
01/19/2017 ∙ by Arthur Mensch, et al. ∙ 0 ∙ shareread it

Deriving reproducible biomarkers from multisite restingstate data: An Autismbased example
Restingstate functional Magnetic Resonance Imaging (RfMRI) holds the promise to reveal functional biomarkers of neuropsychiatric disorders. However, extracting such biomarkers is challenging for complex multifaceted neuropathologies, such as autism spectrum disorders. Large multisite datasets increase sample sizes to compensate for this complexity, at the cost of uncontrolled heterogeneity. This heterogeneity raises new challenges, akin to those face in realistic diagnostic applications. Here, we demonstrate the feasibility of intersite classification of neuropsychiatric status, with an application to the Autism Brain Imaging Data Exchange (ABIDE) database, a large (N=871) multisite autism dataset. For this purpose, we investigate pipelines that extract the most predictive biomarkers from the data. These RfMRI pipelines build participantspecific connectomes from functionallydefined brain areas. Connectomes are then compared across participants to learn patterns of connectivity that differentiate typical controls from individuals with autism. We predict this neuropsychiatric status for participants from the same acquisition sites or different, unseen, ones. Good choices of methods for the various steps of the pipeline lead to 67 ABIDE data, which is significantly better than previously reported results. We perform extensive validation on multiple subsets of the data defined by different inclusion criteria. These enables detailed analysis of the factors contributing to successful connectomebased prediction. First, prediction accuracy improves as we include more subjects, up to the maximum amount of subjects available. Second, the definition of functional brain areas is of paramount importance for biomarker discovery: brain areas extracted from large RfMRI datasets outperform reference atlases in the classification tasks.
11/18/2016 ∙ by Alexandre Abraham, et al. ∙ 0 ∙ shareread it

Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals
In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is that good clustering comes with large algorithmic costs. We address it by contributing a lineartime agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional varianceminimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publiclyavailable data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme.
09/15/2016 ∙ by Andrés HoyosIdrobo, et al. ∙ 0 ∙ shareread it
Gaël Varoquaux
is this you? claim profile
Researcher at INRIA, Research Fellow at INSERM from 20102011, Research Fellow at INRIA from 20082010, Software consultant at Enthought, Inc. 2008, Junior specialist  scientific computing at UC Berkeley 2008, Research Fellow at LENS  European Laboratory for Nonlinear Spectroscopy from 20072008, PhD Student at Institut d'Optique Graduate School from 20052008