
Stochastic Deep Networks
Machine learning is increasingly targeting areas where input data cannot be accurately described by a single vector, but can be modeled instead using the more flexible concept of random vectors, namely probability measures or more simply point clouds of varying cardinality. Using deep architectures on measures poses, however, many challenging issues. Indeed, deep architectures are originally designed to handle fixedlength vectors, or, using recursive mechanisms, ordered sequences thereof. In sharp contrast, measures describe a varying number of weighted observations with no particular order. We propose in this work a deep framework designed to handle crucial aspects of measures, namely permutation invariances, variations in weights and cardinality. Architectures derived from this pipeline can (i) map measures to measures  using the concept of pushforward operators; (ii) bridge the gap between measures and Euclidean spaces  through integration steps. This allows to design discriminative networks (to classify or reduce the dimensionality of input measures), generative architectures (to synthesize measures) and recurrent pipelines (to predict measure dynamics). We provide a theoretical analysis of these building blocks, review our architectures' approximation abilities and robustness w.r.t. perturbation, and try them on various discriminative and generative tasks.
11/19/2018 ∙ by Gwendoline de Bie, et al. ∙ 12 ∙ shareread it

TreeSliced Approximation of Wasserstein Distances
Optimal transport () theory provides a useful set of tools to compare probability distributions. As a consequence, the field of is gaining traction and interest within the machine learning community. A few deficiencies usually associated with include its high computational complexity when comparing discrete measures, which is quadratic when approximating it through entropic regularization; or supercubic when solving it exactly. For some applications, the fact that distances are not usually negative definite also means that they cannot be used with usual Hilbertian tools. In this work, we consider a particular family of ground metrics, namely tree metrics, which yield negative definite metrics that can be computed in linear time. By averaging over randomly sampled tree metrics, we obtain a treeslicedWasserstein distance. We illustrate that the proposed treeslicedWasserstein distances compare favorably with other baselines on various benchmark datasets.
02/01/2019 ∙ by Tam Le, et al. ∙ 12 ∙ shareread it

Semidual Regularized Optimal Transport
Variational problems that involve Wasserstein distances and more generally optimal transport (OT) theory are playing an increasingly important role in data sciences. Such problems can be used to form an examplar measure out of various probability measures, as in the Wasserstein barycenter problem, or to carry out parametric inference and density fitting, where the loss is measured in terms of an optimal transport cost to the measure of observations. Despite being conceptually simple, such problems are computationally challenging because they involve minimizing over quantities (Wasserstein distances) that are themselves hard to compute. Entropic regularization has recently emerged as an efficient tool to approximate the solution of such variational Wasserstein problems. In this paper, we give a thorough duality tour of these regularization techniques. In particular, we show how important concepts from classical OT such as ctransforms and semidiscrete approaches translate into similar ideas in a regularized setting. These dual formulations lead to smooth variational problems, which can be solved using smooth, differentiable and convex optimization problems that are simpler to implement and numerically more stable that their unregularized counterparts. We illustrate the versatility of this approach by applying it to the computation of Wasserstein barycenters and gradient flows of spatial regularization functionals.
11/13/2018 ∙ by Marco Cuturi, et al. ∙ 6 ∙ shareread it

Group level MEG/EEG source imaging via optimal transport: minimum Wasserstein estimates
Magnetoencephalography (MEG) and electroencephalography (EEG) are noninvasive modalities that measure the weak electromagnetic fields generated by neural activity. Inferring the location of the current sources that generated these magnetic fields is an illposed inverse problem known as source imaging. When considering a group study, a baseline approach consists in carrying out the estimation of these sources independently for each subject. The illposedness of each problem is typically addressed using sparsity promoting regularizations. A straightforward way to define a common pattern for these sources is then to average them. A more advanced alternative relies on a joint localization of sources for all subjects taken together, by enforcing some similarity across all estimated sources. An important advantage of this approach is that it consists in a single estimation in which all measurements are pooled together, making the inverse problem better posed. Such a joint estimation poses however a few challenges, notably the selection of a valid regularizer that can quantify such spatial similarities. We propose in this work a new procedure that can do so while taking into account the geometrical structure of the cortex. We call this procedure Minimum Wasserstein Estimates (MWE). The benefits of this model are twofold. First, joint inference allows to pool together the data of different brain geometries, accumulating more spatial information. Second, MWE are defined through Optimal Transport (OT) metrics which provide a tool to model spatial proximity between cortical sources of different subjects, hence not enforcing identical source location in the group. These benefits allow MWE to be more accurate than standard MEG source localization techniques. To support these claims, we perform source localization on realistic MEG simulations based on forward operators derived from MRI scans. On a visual task dataset, we demonstrate how MWE infer neural patterns similar to functional Magnetic Resonance Imaging (fMRI) maps.
02/13/2019 ∙ by Hicham Janati, et al. ∙ 6 ∙ shareread it

GAN and VAE from an Optimal Transport Point of View
This short article revisits some of the ideas introduced in arXiv:1701.07875 and arXiv:1705.07642 in a simple setup. This sheds some lights on the connexions between Variational Autoencoders (VAE), Generative Adversarial Networks (GAN) and Minimum Kantorovitch Estimators (MKE).
06/06/2017 ∙ by Aude Genevay, et al. ∙ 0 ∙ shareread it

Learning Generative Models with Sinkhorn Divergences
The ability to compare two degenerate probability distributions (i.e. two probability distributions supported on two distinct lowdimensional manifolds living in a much higherdimensional space) is a crucial problem arising in the estimation of generative models for highdimensional observations such as those arising in computer vision or natural language. It is known that optimal transport metrics can represent a cure for this problem, since they were specifically designed as an alternative to information divergences to handle such problematic scenarios. Unfortunately, training generative machines using OT raises formidable computational and statistical challenges, because of (i) the computational burden of evaluating OT losses, (ii) the instability and lack of smoothness of these losses, (iii) the difficulty to estimate robustly these losses and their gradients in high dimension. This paper presents the first tractable computational method to train large scale generative models using an optimal transport loss, and tackles these three issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into one that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations. These two approximations result in a robust and differentiable approximation of the OT loss with streamlined GPU execution. Entropic smoothing generates a family of losses interpolating between Wasserstein (OT) and Maximum Mean Discrepancy (MMD), thus allowing to find a sweet spot leveraging the geometry of OT and the favorable highdimensional sample complexity of MMD which comes with unbiased gradient estimates. The resulting computational architecture complements nicely standard deep network generative models by a stack of extra layers implementing the loss function.
06/01/2017 ∙ by Aude Genevay, et al. ∙ 0 ∙ shareread it

SoftDTW: a Differentiable Loss Function for TimeSeries
We propose in this paper a differentiable learning loss between time series. Our proposal builds upon the celebrated Dynamic Time Warping (DTW) discrepancy. Unlike the Euclidean distance, DTW is able to compare asynchronous time series of varying size and is robust to elastic transformations in time. To be robust to such invariances, DTW computes a minimal cost alignment between time series using dynamic programming. Our work takes advantage of a smoothed formulation of DTW, called softDTW, that computes the softminimum of all alignment costs. We show in this paper that softDTW is a differentiable loss function, and that both its value and its gradient can be computed with quadratic time/space complexity (DTW has quadratic time and linear space complexity). We show that our regularization is particularly well suited to average and cluster time series under the DTW geometry, a task for which our proposal significantly outperforms existing baselines (Petitjean et al., 2011). Next, we propose to tune the parameters of a machine that outputs time series by minimizing its fit with groundtruth labels in a softDTW sense.
03/05/2017 ∙ by Marco Cuturi, et al. ∙ 0 ∙ shareread it

Wasserstein Discriminant Analysis
Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of highdimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from different classes, divided by the dispersion of projected points coming from the same class. To quantify dispersion, WDA uses regularized Wasserstein distances, rather than crossvariance measures which have been usually considered, notably in LDA. Thanks to the the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples scale) interactions between classes. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; We show that the optimization of WDA can be tackled using automatic differentiation of Sinkhorn iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real life datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.
08/29/2016 ∙ by Rémi Flamary, et al. ∙ 0 ∙ shareread it

On Wasserstein Two Sample Testing and Related Families of Nonparametric Tests
Nonparametric two sample or homogeneity testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. The literature is old and rich, with a wide variety of statistics having being intelligently designed and analyzed, both for the unidimensional and the multivariate setting. Our contribution is to tie together many of these tests, drawing connections between seemingly very different statistics. In this work, our central object is the Wasserstein distance, as we form a chain of connections from univariate methods like the KolmogorovSmirnov test, PP/QQ plots and ROC/ODC curves, to multivariate tests involving energy statistics and kernel based maximum mean discrepancy. Some connections proceed through the construction of a smoothed Wasserstein distance, and others through the pursuit of a "distributionfree" Wasserstein test. Some observations in this chain are implicit in the literature, while others seem to have not been noticed thus far. Given nonparametric two sample testing's classical and continued importance, we aim to provide useful connections for theorists and practitioners familiar with one subset of methods but not others.
09/08/2015 ∙ by Aaditya Ramdas, et al. ∙ 0 ∙ shareread it

Wasserstein Training of Boltzmann Machines
The Boltzmann machine provides a useful framework to learn highly complex, multimodal and multiscale data distributions that occur in the real world. The default method to learn its parameters consists of minimizing the KullbackLeibler (KL) divergence from training samples to the Boltzmann model. We propose in this work a novel approach for Boltzmann training which assumes that a meaningful metric between observations is given. This metric can be represented by the Wasserstein distance between distributions, for which we derive a gradient with respect to the model parameters. Minimization of this new Wasserstein objective leads to generative models that are better when considering the metric and that have a clusterlike structure. We demonstrate the practical potential of these models for data completion and denoising, for which the metric between observations plays a crucial role.
07/07/2015 ∙ by Grégoire Montavon, et al. ∙ 0 ∙ shareread it

Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric
Given a family of probability measures in P(X), the space of probability measures on a Hilbert space X, our goal in this paper is to highlight one ore more curves in P(X) that summarize efficiently that family. We propose to study this problem under the optimal transport (Wasserstein) geometry, using curves that are restricted to be geodesic segments under that metric. We show that concepts that play a key role in Euclidean PCA, such as data centering or orthogonality of principal directions, find a natural equivalent in the optimal transport geometry, using Wasserstein means and differential geometry. The implementation of these ideas is, however, computationally challenging. To achieve scalable algorithms that can handle thousands of measures, we propose to use a relaxed definition for geodesics and regularized optimal transport distances. The interest of our approach is demonstrated on images seen either as shapes or color histograms.
06/26/2015 ∙ by Vivien Seguy, et al. ∙ 0 ∙ shareread it
Marco Cuturi
is this you? claim profile
Professor of Statistics in CREST  ENSAE at Université ParisSaclay, Ph.D. under the supervision of JeanPhilippe Vert in 11/2005 from the Ecole des Mines de Paris, Postdoctoral researcher at the Institute of Statistical Mathematics, Tokyo, between 11/2005 and 03/2007. ORFE department of Princeton University between 02/2009 and 08/2010 as a lecturer, I was at the Graduate School of Informatics of Kyoto University between 09/2010 and 09/2016 as an associate professor (tenured in 11/2013).