
TreeSliced Approximation of Wasserstein Distances
Optimal transport () theory provides a useful set of tools to compare probability distributions. As a consequence, the field of is gaining traction and interest within the machine learning community. A few deficiencies usually associated with include its high computational complexity when comparing discrete measures, which is quadratic when approximating it through entropic regularization; or supercubic when solving it exactly. For some applications, the fact that distances are not usually negative definite also means that they cannot be used with usual Hilbertian tools. In this work, we consider a particular family of ground metrics, namely tree metrics, which yield negative definite metrics that can be computed in linear time. By averaging over randomly sampled tree metrics, we obtain a treeslicedWasserstein distance. We illustrate that the proposed treeslicedWasserstein distances compare favorably with other baselines on various benchmark datasets.
02/01/2019 ∙ by Tam Le, et al. ∙ 12 ∙ shareread it

A Kernel Stein Test for Comparing Latent Variable Models
We propose a nonparametric, kernelbased test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observedvariable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a wellcontrolled typeI error. In the case of models with lowdimensional latent structure and highdimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015) , which cannot exploit the latent structure.
07/01/2019 ∙ by Heishiro Kanagawa, et al. ∙ 6 ∙ shareread it

Kernel method for persistence diagrams via kernel embedding and weight factor
Topological data analysis is an emerging mathematical concept for characterizing shapes in multiscale data. In this field, persistence diagrams are widely used as a descriptor of the input data, and can distinguish robust and noisy topological properties. Nowadays, it is highly desired to develop a statistical framework on persistence diagrams to deal with practical data. This paper proposes a kernel method on persistence diagrams. A theoretical contribution of our method is that the proposed kernel allows one to control the effect of persistence, and, if necessary, noisy topological properties can be discounted in data analysis. Furthermore, the method provides a fast approximation technique. The method is applied into several problems including practical data in physics, and the results show the advantage compared to the existing kernel method on persistence diagrams.
06/12/2017 ∙ by Genki Kusano, et al. ∙ 0 ∙ shareread it

A LinearTime Kernel GoodnessofFit Test
We propose a novel adaptive test of goodnessoffit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a meanshift alternative, our test always has greater relative efficiency than a previous lineartime kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier lineartime test, and matches or exceeds the power of a quadratictime kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratictime twosample test based on the Maximum Mean Discrepancy, with samples drawn from the model.
05/22/2017 ∙ by Wittawat Jitkrittum, et al. ∙ 0 ∙ shareread it

Influence Function and Robust Variant of Kernel Canonical Correlation Analysis
Many unsupervised kernel methods rely on the estimation of the kernel covariance operator (kernel CO) or kernel crosscovariance operator (kernel CCO). Both kernel CO and kernel CCO are sensitive to contaminated data, even when bounded positive definite kernels are used. To the best of our knowledge, there are few wellfounded robust kernel methods for statistical unsupervised learning. In addition, while the influence function (IF) of an estimator can characterize its robustness, asymptotic properties and standard error, the IF of a standard kernel canonical correlation analysis (standard kernel CCA) has not been derived yet. To fill this gap, we first propose a robust kernel covariance operator (robust kernel CO) and a robust kernel crosscovariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we derive the IF for robust kernel CCO and standard kernel CCA. Using the IF of the standard kernel CCA, we can detect influential observations from two sets of data. Finally, we propose a method based on the robust kernel CO and the robust kernel CCO, called robust kernel CCA, which is less sensitive to noise than the standard kernel CCA. The introduced principles can also be applied to many other kernel methods involving kernel CO or kernel CCO. Our experiments on synthesized data and imaging genetics analysis demonstrate that the proposed IF of standard kernel CCA can identify outliers. It is also seen that the proposed robust kernel CCA method performs better for ideal and contaminated data than the standard kernel CCA.
05/09/2017 ∙ by Md. Ashad Alam, et al. ∙ 0 ∙ shareread it

Trimmed Density Ratio Estimation
Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation procedure can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under highdimensional settings. Experiments are conducted to verify the effectiveness of the estimator.
03/09/2017 ∙ by Song Liu, et al. ∙ 0 ∙ shareread it

Learning Sparse Structural Changes in Highdimensional Markov Networks: A Review on Methodologies and Theories
Recent years have seen an increasing popularity of learning the sparse changes in Markov Networks. Changes in the structure of Markov Networks reflect alternations of interactions between random variables under different regimes and provide insights into the underlying system. While each individual network structure can be complicated and difficult to learn, the overall change from one network to another can be simple. This intuition gave birth to an approach that directly learns the sparse changes without modelling and learning the individual (possibly dense) networks. In this paper, we review such a direct learning method with some latest developments along this line of research.
01/06/2017 ∙ by Song Liu, et al. ∙ 0 ∙ shareread it

Post Selection Inference with Kernels
We propose a novel kernel based post selection inference (PSI) algorithm, which can not only handle nonlinearity in data but also structured output such as multidimensional and multilabel outputs. Specifically, we develop a PSI algorithm for independence measures, and propose the HilbertSchmidt Independence Criterion (HSIC) based PSI algorithm (hsicInf). The novelty of the proposed algorithm is that it can handle nonlinearity and/or structured data through kernels. Namely, the proposed algorithm can be used for wider range of applications including nonlinear multiclass classification and multivariate regressions, while existing PSI algorithms cannot handle them. Through synthetic experiments, we show that the proposed approach can find a set of statistically significant features for both regression and classification problems. Moreover, we apply the hsicInf algorithm to a realworld data, and show that hsicInf can successfully identify important features.
10/12/2016 ∙ by Makoto Yamada, et al. ∙ 0 ∙ shareread it

Kernel Mean Embedding of Distributions: A Review and Beyond
A Hilbert space embedding of a distributionin short, a kernel mean embeddinghas recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original "feature map" common to support vector machines (SVMs) and other kernel methods. While initially closely associated with the latter, it has meanwhile found application in fields ranging from kernel machines and probabilistic modeling to statistical inference, causal discovery, and deep learning. The goal of this survey is to give a comprehensive review of existing work and recent advances in this research area, and to discuss the most challenging issues and open problems that could lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel twosample testing, independent testing, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' ruleswhich are ubiquitous in graphical model, probabilistic inference, and reinforcement learningin a nonparametric way. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.
05/31/2016 ∙ by Krikamol Muandet, et al. ∙ 0 ∙ shareread it

Convergence guarantees for kernelbased quadrature rules in misspecified settings
Kernelbased quadrature rules are becoming important in machine learning and statistics, as they achieve super√(n) convergence rates in numerical integration, and thus provide alternatives to Monte Carlo integration in challenging settings where integrands are expensive to evaluate or where integrands are high dimensional. These rules are based on the assumption that the integrand has a certain degree of smoothness, which is expressed as that the integrand belongs to a certain reproducing kernel Hilbert space (RKHS). However, this assumption can be violated in practice (e.g., when the integrand is a black box function), and no general theory has been established for the convergence of kernel quadratures in such misspecified settings. Our contribution is in proving that kernel quadratures can be consistent even when the integrand does not belong to the assumed RKHS, i.e., when the integrand is less smooth than assumed. Specifically, we derive convergence rates that depend on the (unknown) lesser smoothness of the integrand, where the degree of smoothness is expressed via powers of RKHSs or via Sobolev spaces.
05/24/2016 ∙ by Motonobu Kanagawa, et al. ∙ 0 ∙ shareread it

Robust Kernel (Cross) Covariance Operators in Reproducing Kernel Hilbert Space toward Kernel Methods
To the best of our knowledge, there are no general wellfounded robust methods for statistical unsupervised learning. Most of the unsupervised methods explicitly or implicitly depend on the kernel covariance operator (kernel CO) or kernel crosscovariance operator (kernel CCO). They are sensitive to contaminated data, even when using bounded positive definite kernels. First, we propose robust kernel covariance operator (robust kernel CO) and robust kernel crosscovariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we propose influence function of classical kernel canonical correlation analysis (classical kernel CCA). Third, using this influence function, we propose a visualization method to detect influential observations from two sets of data. Finally, we propose a method based on robust kernel CO and robust kernel CCO, called robust kernel CCA, which is designed for contaminated data and less sensitive to noise than classical kernel CCA. The principles we describe also apply to many kernel methods which must deal with the issue of kernel CO or kernel CCO. Experiments on synthesized and imaging genetics analysis demonstrate that the proposed visualization and robust kernel CCA can be applied effectively to both ideal data and contaminated data. The robust methods show the superior performance over the stateoftheart methods.
02/17/2016 ∙ by Md. Ashad Alam, et al. ∙ 0 ∙ shareread it