Arthur Gretton

is this you? claim profile


Professor at University College London

  • Learning deep kernels for exponential family densities

    The kernel exponential family is a rich class of distributions,which can be fit efficiently and with statistical guarantees by score matching. Being required to choose a priori a simple kernel such as the Gaussian, however, limits its practical applicability. We provide a scheme for learning a kernel parameterized by a deep network, which can find complex location-dependent local features of the data geometry. This gives a very rich class of density models, capable of fitting complex structures on moderate-dimensional problems. Compared to deep density models fit via maximum likelihood, our approach provides a complementary set of strengths and tradeoffs: in empirical studies, the former can yield higher likelihoods, whereas the latter gives better estimates of the gradient of the log density, the score, which describes the distribution's shape.

    11/20/2018 ∙ by Li Wenliang, et al. ∙ 46 share

    read it

  • Exponential Family Estimation via Adversarial Dynamics Embedding

    We present an efficient algorithm for maximum likelihood estimation (MLE) of the general exponential family, even in cases when the energy function is represented by a deep neural network. We consider the primal-dual view of the MLE for the kinectics augmented model, which naturally introduces an adversarial dual sampler. The sampler will be represented by a novel neural network architectures, dynamics embeddings, mimicking the dynamical-based samplers, e.g., Hamiltonian Monte-Carlo and its variants. The dynamics embedding parametrization inherits the flexibility from HMC, and provides tractable entropy estimation of the augmented model. Meanwhile, it couples the adversarial dual samplers with the primal model, reducing memory and sample complexity. We further show that several existing estimators, including contrastive divergence (Hinton, 2002), score matching (Hyvärinen, 2005), pseudo-likelihood (Besag, 1975), noise-contrastive estimation (Gutmann and Hyvärinen, 2010), non-local contrastive objectives (Vickrey et al., 2010), and minimum probability flow (Sohl-Dickstein et al., 2011), can be recast as the special cases of the proposed method with different prefixed dual samplers. Finally, we empirically demonstrate the superiority of the proposed estimator against existing state-of-the-art methods on synthetic and real-world benchmarks.

    04/27/2019 ∙ by Bo Dai, et al. ∙ 28 share

    read it

  • Kernel Instrumental Variable Regression

    Instrumental variable regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which the convergence rate achieves the minimax optimal rate for unconfounded, one-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric instrumental variable regression.

    06/01/2019 ∙ by Rahul Singh, et al. ∙ 14 share

    read it

  • On gradient regularizers for MMD GANs

    We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). Our method is based on studying the behavior of the optimized MMD, and constrains the gradient based on analytical results rather than an optimization penalty. Experimental results show that the proposed regularization leads to stable training and outperforms state-of-the art methods on image generation, including on 160 × 160 CelebA and 64 × 64 ImageNet.

    05/29/2018 ∙ by Michael Arbel, et al. ∙ 6 share

    read it

  • A Kernel Stein Test for Comparing Latent Variable Models

    We propose a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a well-controlled type-I error. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015) , which cannot exploit the latent structure.

    07/01/2019 ∙ by Heishiro Kanagawa, et al. ∙ 6 share

    read it

  • Counterfactual Distribution Regression for Structured Inference

    We consider problems in which a system receives external perturbations from time to time. For instance, the system can be a train network in which particular lines are repeatedly disrupted without warning, having an effect on passenger behavior. The goal is to predict changes in the behavior of the system at particular points of interest, such as passenger traffic around stations at the affected rails. We assume that the data available provides records of the system functioning at its "natural regime" (e.g., the train network without disruptions) and data on cases where perturbations took place. The inference problem is how information concerning perturbations, with particular covariates such as location and time, can be generalized to predict the effect of novel perturbations. We approach this problem from the point of view of a mapping from the counterfactual distribution of the system behavior without disruptions to the distribution of the disrupted system. A variant on distribution regression is developed for this setup.

    08/20/2019 ∙ by Nicolo Colombo, et al. ∙ 5 share

    read it

  • Kernel Exponential Family Estimation via Doubly Dual Embedding

    We investigate penalized maximum log-likelihood estimation for exponential family distributions whose natural parameter resides in a reproducing kernel Hilbert space. Key to our approach is a novel technique, doubly dual embedding, that avoids computation of the partition function. This technique also allows the development of a flexible sampling strategy that amortizes the cost of Monte-Carlo sampling in the inference stage. The resulting estimator can be easily generalized to kernel conditional exponential families. We furthermore establish a connection between infinite-dimensional exponential family estimation and MMD-GANs, revealing a new perspective for understanding GANs. Compared to current score matching based estimators, the proposed method improves both memory and time efficiency while enjoying stronger statistical properties, such as fully capturing smoothness in its statistical convergence rate while the score matching estimator appears to saturate. Finally, we show that the proposed estimator can empirically outperform state-of-the-art methods in both kernel exponential family estimation and its conditional extension.

    11/06/2018 ∙ by Bo Dai, et al. ∙ 4 share

    read it

  • Antithetic and Monte Carlo kernel estimators for partial rankings

    In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which forbids the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators of Gram matrices. These Monte Carlo kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator. An overview of the existing kernels and metrics for permutations is also provided.

    07/01/2018 ∙ by María Lomelí, et al. ∙ 2 share

    read it

  • Kernel Conditional Exponential Family

    A nonparametric family of conditional distributions is introduced, which generalizes conditional exponential families using functional parameters in a suitable RKHS. An algorithm is provided for learning the generalized natural parameter, and consistency of the estimator is established in the well specified case. In experiments, the new method generally outperforms a competing approach with consistency guarantees, and is competitive with a deep conditional density model on datasets that exhibit abrupt transitions and heteroscedasticity.

    11/15/2017 ∙ by Michael Arbel, et al. ∙ 0 share

    read it

  • Efficient and principled score estimation with Nyström kernel exponential families

    We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite-dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. Our approach improves the computational efficiency of an earlier solution by using a low-rank, Nyström-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees.

    05/23/2017 ∙ by Dougal J. Sutherland, et al. ∙ 0 share

    read it

  • A Linear-Time Kernel Goodness-of-Fit Test

    We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.

    05/22/2017 ∙ by Wittawat Jitkrittum, et al. ∙ 0 share

    read it