Consider an open subset and denote by the set of Borel probability measures on this domain. We consider the problem of learning a probability measure from identically and independently distributed (IID) realisations . We will focus on parametric inference with a parametrised family , for an open set i.e. we seek such that is closest to in an appropriate sense. If we are in the M-closed setting, otherwise we are in the M-open setting. When has a density with respect to the Lebesgue measure, then a standard approach is to use the maximum likelihood estimator (MLE):
For complex models, a density may not be easily computable, or even exist and so the MLE need not be available. In some cases it is possible to approximate the likelihood; see for example pseudo likelihood (Besag, 1974), profile likelihood (Murphy and van der Vaart, 2000) and composite likelihood (Varin et al., 2011) estimation. It is sometimes also possible to access likelihoods in un-normalised forms i.e. where the constant
is unknown. This class of models is known as un-normalised models, or doubly-intractable models in the Bayesian literature, and a range of exact and approximate methods have been developped for this case; see for example the Markov chain Monte Carlo (MCMC) algorithms ofMurray et al. (2006); Moller et al. (2006) or the score-based and ratio-based approaches of Hyvärinen (2006, 2007); Gutmann and Hyvarinen (2012).
However, for many models of interest in modern statistical inference, none of the methods above can be applied straightforwardly and efficiently due to the complexity of the likelihoods involved. This is most notably the case for intractable generative models, sometimes also called implicit models or likelihood-free models; see Mohamed and Lakshminarayanan (2016) for a recent overview. Intractable generative models are parametric families of probability measures for which it is possible to obtain realisations for any value of the parameter , but for which we do not necessarily have access to a likelihood or approximation thereof. These models are used throughout the sciences, including in the fields of ecology (Wood, 2010), population genetics (Beaumont et al., 2002) or astronomy (Cameron and Pettitt, 2012)
. They also appear in machine learning as black-box models; see for example generative adversarial networks (GANs)(Goodfellow et al., 2014) and variational auto-encoders (Kingma and Welling, 2014).
Given a Borel probability space , we will call generative model any probability measure which is the pushforward of the probability measure with respect to a measurable parametric map called the generator. To generate independent realisations from the model, we produce IID realisations and apply the generator to each of these samples: for . While it is straightforward to generate samples from these models, a likelihood function need not be available, given that an associated positive density may not be computable or even exist. We therefore require alternatives to the MLE.
The estimators studied in this paper fall within the class of minimum divergence/distance estimators (Pardo, 2005; Basu et al., 2011). These are estimators minimising some notion of divergence (or an approximation thereof) between an empirical measure (where denotes a Dirac measure at ), obtained from the data
, and the parametric model:
If was absolutely continuous with respect to , maximising the likelihood would correspond to minimising the Kullback-Leibler (KL) divergence which, given , is defined as , where is the Radon-Nikodym derivative of with respect to . This approach to inference is useful for models with complicated or intractable likelihood, since the choice of divergence can be adapted to the class of models of interest.
In previous works, minimum distance estimators for generative models have been considered based on the Wasserstein distance and its Sinkhorn relaxation; see Bassetti et al. (2006); Frogner et al. (2015); Montavon et al. (2016); Genevay et al. (2018); Frogner and Poggio (2018); Sanjabi et al. (2018). These have the advantage that they can leverage extensive work in the field of optimal transport. In a Bayesian context, similar ideas are used in approximate Bayesian computation (ABC) methods Marin et al. (2012); Lintusaari et al. (2017) where synthetic data sets are simulated from the model then compared to the true data using some notion of distance. There, significant work has been put into automating the choice of distance (Fearnhead and Prangle, 2011), and the use of the Wasserstein distance has also recently been studied (Bernton et al., 2019).
In this paper, we shall investigate the properties of minimal divergence estimators based on an approximation of maximum mean discrepancy
(MMD). Such estimators have already been used extensively in the machine learning literature with generators taking the form of neural networks(Dziugaite et al., 2015; Li et al., 2015, 2017; Sutherland et al., 2017; Arbel et al., 2018; Bińkowski et al., 2018; Romano et al., 2018; dos Santos et al., 2019) where they are usually called MMD GANs, but can be used more generally. Our main objective in this paper is to present a general framework for minimum MMD estimators, to study their theoretical properties and to provide an initial discussion of the impact of the choice of kernel. This study brings insights into the favourable empirical results of previous work in MMD for neural networks, and demonstrate more broadly the usefulness of this approach for inference within the large class of intractable generative models of interest in the statistics literature. As will be discussed, this approach is significantly preferable to alternatives based on the Wasserstein distance for models with expensive generators as it comes with significantly stronger generalisation bounds and is more robust in several scenarios. Our detailed contributions can be summarised as follows:
In Section 2, we introduce the MMD metric, minimum MMD estimators, and the statistical Riemannian geometry the metric induces on the parameter space . Through this, we rephrase the mimimum divergence estimator problem in terms of a gradient flow, thus obtaining a stochastic natural gradient descent method for finding the estimator
which can significantly reduce computation cost as compared to stochastic gradient descent.
In Section 3, we focus on the theoretical properties of minimum MMD estimators and associated approximations. We use the information geometry of MMD to demonstrate generalisation bounds and statistical consistency, then prove that the estimator is asymptotically normal in the M-closed setting. These results give us necessary assumptions on the generator for the use of the estimators. We then analyse the robustness properties of the estimator in the M-open setting, establishing conditions for qualitative and quantitative robustness.
In Section 4
we study the efficiency and robustness of minimum MMD estimators based on Gaussian kernels for classes of isotropic Gaussian location and scale models. We demonstrate the effect of the kernel lengthscale on the efficiency of the estimators, and demonstrate a tradeoff between (asymptotic) efficiency and robustness. For high-dimensional problems, we demonstate that choosing the lengthscale according to the median heuristic provides an asymptotic variance independent of dimensionality. We also extend our analysis to mixtures of kernels, providing insights on settings often considered in machine learning applications.
In Section 5, we perform numerical simulations to support the theory detailed in the previous sections, demonstrating the behaviour of minimum MMD estimators for a number of examples including estimation of unknown parameters for the g-and-k distribution, in a stochastic volatility model and for two systems of stochastic differential equations.
2 The Maximum Mean Discrepancy Statistical Manifold
We begin by formalising the notion of MMD and introduce the corresponding minimum MMD estimators. We then use tools from information geometry to analyse these estimators, which leads to a stochastic natural gradient descent algorithm for efficient implementation.
2.1 Maximum Mean Discrepancy
Let be a Borel measurable kernel on , and consider the reproducing kernel Hilbert space associated with (see Berlinet and Thomas-Agnan (2004)), equipped with inner product and norm . Let be the set of Borel probability measures such that . The kernel mean embedding , intepreted as a Bochner integral, defines a continuous embedding from into . The mean embedding pulls-back the metric on generated by the inner product to define a pseudo-metric on called the maximum mean discrepancy , i.e., . The squared-MMD has a particularly simple expression that can be derived through an application of the reproducing property ():
thus providing a closed form expression up to calculation of expectations. The MMD is in fact a integral probability pseudo-metric (Muller, 1997; Sriperumbudur et al., 2012; Sriperumbudur, 2016) since it can be expressed as:
Integral probability metrics are prominent in the information-based complexity literature where they correspond to the worst-case integration error (Dick et al., 2013; Briol et al., 2019). If is injective then the kernel is said to be characteristic (Sriperumbudur et al., 2010). In this case MMD becomes a metric on (and hence a statistical divergence). A sufficient condition for to be characteristic is that is integrally strictly positive definite, i.e. implies that for all . On , Sriperumbudur et al. (2010) showed that the Gaussian and inverse multiquadric kernels are both integrally strictly positive definite. We shall assume this condition holds throughout the paper, unless explicitly stated otherwise.
2.2 Minimum MMD estimators
This paper proposes to use MMD in a minimum divergence estimator framework for inference in intractable generative models. Given an unknown data generating distribution and a parametrised family of model distributions , we consider a minimum MMD estimator:
where , and . In the following we will use to denote both the random measure and the measure , and we shall assume that . Several existing methodologies fall within this general framework, including kernel scoring rules (Eaton, 1982) and MMD GANs (Dziugaite et al., 2015; Li et al., 2015). For analogous methodology in a Bayesian context, see kernel ABC (Fukumizu et al., 2013; Park et al., 2015).
In general, the optimisation problem will not be convex and the minimiser will not be computable analytically. If the generator is differentiable with respect to with a computable Jacobian matrix, the minimiser will be a fixed point of the equation where . Assuming that the Jacobian is -integrable then the gradient term can be written as
where corresponds to the partial derivative with respect to the first argument. In practice it will not be possible to compute the integral terms analytically. We can introduce a U-statistic approximation for the gradient as follows:
. This is an unbiased estimator in the sense that, where the expectation is taken over the independent realisations of the . This allows us to use a stochastic gradient descent (SGD) (Dziugaite et al., 2015; Li et al., 2015): starting from , we iterate:
Sample and set for .
where is a step size sequence chosen to guarantee convergence (see (Robbins and Monro, 1985)) to the minimiser in Equation 2. For large values of , the SGD should approach , but this will come at significant computational cost. Let and . The overall cost of the gradient descent algorithm is per iteration. This cost is linear in the number of data points , but quadratic in the number of simulated samples . It could be made linear in by considering approximations of the maximum mean discrepancy as found in Chwialkowski et al. (2015). In large data settings (i.e. large), subsampling elements uniformly at random from may lead to significant speed-ups.
Clearly, when the generator and its gradient are computationally intensive, letting grow will become effectively intractable, and it will be reasonable to assume that the number of simulations is commensurate or even smaller than the sample size. To study the behaviour of minimum MMD estimators when synthetic data is prohibitively expensive, we consider the following minimum divergence estimator: based on a U-statistic approximation of the MMD:
where for some
. This estimator is closely related to the method of simulated moments(Hall, 2005) and satisfies , thus providing an unbiased estimator of the square distance between and . While the estimator
is not used in practice (since we re-sample from the generator at each gradient iteration), it is an idealisation which gives us insights into situations where the gradient descent cannot be iterated for a large numbers of steps relative to the observed data-set size, and so we cannot appeal on the law of large numbers.
2.3 The Information Geometry induced by MMD
The two estimators and
defined above are flexible in the sense that the choice of kernel and kernel hyperparameters will have a significant influence on the geometry induced on the space of probability measures. This section studies this geometry and develops tools which will later give us insights into the impact of the choice of kernel on the generalisation, asymptotic convergence and robustness of the corresponding estimators.
Let be a family of measures contained in and parametrised by an open subset . Assuming that the map is injective, the MMD distance between the elements and in induces a distance between and in . Under appropriate conditions this gives rise to a Riemmanian manifold structure on . The study of the geometry of such statistical manifolds lies at the center of information geometry (Amari, 1987; Barndorff-Nielsen, 1978)
. Traditional information geometry focuses on the statistical manifold induced by the Kullback-Leibler divergence over a parametrised set of probability measures. This yields a Riemmanian structure on the parameter space with the metric tensor given by the Fisher-Rao metric. A classic result due toCencov (2000) characterises this metric as the unique metric invariant under a large class of transformations (i.e. embeddings via Markov morphisms, see (Campbell, 1986; Montúfar et al., 2014)).
In this section, we study instead the geometry induced by MMD. To fix ideas, we shall consider a generative model distribution of the form for , where is an underlying Borel measure space. We assume that (i) is -measurable for all ; (ii) for all ; (iii) , for all . Suppose additionally that the kernel has bounded continuous derivatives over . Define the map to be the Bocher integral . By (Hájek and Johanis, 2014, Theorem 90), assumptions (i)-(iii) imply that the map is Fréchet differentiable and
The map induces a degenerate-Riemannian metric on given by the pull-back of the inner product on . In particular its components in the local coordinate-system are for . By (Steinwart and Christmann, 2008, Lemma 4.34), it follows that for ,
where . The induced metric tensor is in fact just the information metric associated to the MMD-squared divergence (see A.1). Further details about the geodesics induced by MMD can be found in Appendix A.2. This information metric will allow us to construct efficient optimisation algorithm and study the statistical properties of the minimum MMD estimators.
2.4 MMD Gradient Flow
Given the loss function, a standard approach to finding a minimum divergence estimator is via gradient descent (or in our case stochastic gradient descent). Gradient descent methods aim to minimise a function by following a curve , known as the gradient flow, that is everywhere tangent to the direction of steepest descent of . This direction depends on the choice of Riemannian metric on , and is given by where denotes the Riemannian gradient (or covariant derivative) of .
A particular instance of gradient descent, based on the Fisher Information metric, was developed by Amari and collaborators (Amari, 1998)
. It is a widely used alternative to standard gradient descent methods and referred to as natural gradient descent. It has been successfully applied to a variety of problems in machine learning and statistics, for example reinforcement learning(Kakade, 2002), neural network training (Park et al., 2000), Bayesian variational inference methods (Hoffman et al., 2013) and Markov chain Monte Carlo (Girolami and Calderhead, 2011). While the classical natural gradient approach is based on the Fisher information matrix induced by the KL divergence, information geometries arising from other metrics on probabilities have also been studied in previous works, including those arising from optimal transport metrics (Chen and Li, 2018; Li and Montufar, 2018) and the Fisher divergence (Karakida et al., 2016).
As discussed above, a gradient descent method can be formulated as an ordinary differential equation for thegradient flow which solves for some specified initial conditions. In local coordinates the Riemannian gradient can be expressed in terms of the standard gradient , formally , so we have . This flow can be approximated by taking various discretisations. An explicit Euler discretisation yields the scheme: . Under appropriate conditions on the step-size sequence this gradient descent scheme will converge to a local minimiser of . Provided that and the metric tensor are readily computable, the Euler discretisation yields a gradient scheme analogous to those detailed in (Amari, 1987, 1998).
For the MMD case, we cannot evaluate from Equation (3) exactly since it contains intractable integrals against . We can however use a similar approach to that used for the stochastic gradient algorithm and introduce a U-statistic approximation of the intractable integrals:
where are IID realisations from . We propose to perform optimisation using the following natural stochastic gradient descent algorithm: starting from , we iterate
Sample and set for .
The experiments in Section 5 demonstrate that this new algorithm can provide significant computational gains. This could be particularly impactful for GANs, where a large number of stochastic gradient descent are currently commonly used. The approximation of the inverse metric tensor does however yield an additional computational cost due to the inversion of a dense matrix: per iteration. When the dimension of the parameter set is high, the calculation of the inverse metric at every step can hence be prohibitive. The use of online methods to approximate sequentially without needing to compute inverses of dense matrices can be considered as in (Ollivier, 2018), or alternatively, approximate linear solvers could also be used to reduce this cost.
In certain cases, the gradient of the generator may not be available in closed form, precluding exact gradient descent inference. An alternative is the method of finite difference stochastic approximation (Kushner and Yin, 2003) can be used to approximate an exact descent direction. Alternatively, one can consider other discretisations of the gradient flow. For example, a fully implicit discretisation yields the following scheme (Jordan et al., 1998):
where is a step-size. Therefore the natural gradient flow can be viewed as a motion towards a lower value of but constrained to be close (in MMD) to the previous time-step. The constant controls the strength of this constraint, and thus can be viewed as a step size. The formulation allows the possibility of a natural gradient descent approach being adopted even if and are not readily computable. Indeed, (4) could potentially be minimised using some gradient-free optimisation method such as Nelder-Mead.
2.5 Minimum MMD Estimators and Kernel Scoring Rules
Before concluding this background section, we highlight the connection between our minimum MMD estimators and scoring rules (Dawid, 2007). A scoring rule is a function such that quantifies the accuracy of a model upon observing the realisation (see (Gneiting and Raftery, 2007) for technical conditions). We say a scoring rule is strictly proper if is uniquely minimised when . Any strictly proper scoring rule induces a divergence of the form . These divergences can then be used to obtain minimum distance estimators: . One way to solve this problem is by setting the gradient in to zero; i.e. solving , called estimating equations.
The minimum MMD estimators in this paper originate from the well-known kernel scoring rule (Eaton, 1982; Dawid, 2007; Zawadzki and Lahaie, 2015; Steinwart and Ziegel, 2017; Masnadi-Shirazi, 2017), which takes the form
This connection between scoring rules and minimum MMD estimators will be useful for theoretical results in the following section. Whilst the present paper focuses on minimum MMD estimators for generative models, our results also have implications for kernel scoring rules.
3 Behaviour of Minimum MMD estimators
The two estimators and defined above are flexible in the sense that the choice of kernel and kernel hyperparameters will have a significant influence on the geometry induced on the space of probability measures. This choice will also have an impact on the generalisation, asymptotic convergence and robustness of the estimators, as will be discussed in this section.
3.1 Concentration and Generalisation Bounds for MMD
In this section we will restrict ourselves to the case where and for . Given observations , it is clear that the convergence and efficiency of and in the limit of large and will depend on the choice of kernel as well as the dimensions and . As a first step, we obtain estimates for the out-of-sample error for each estimator, in the form of generalization bounds.
The necessary conditions in this proposition are quite natural. They are required to ensure the existence of and , and reclude models which are unidentifiable over a non-compact subset of parameters, i.e. models for which there are minimising sequences of which are unbounded. While these assumptions must be verified on a case-by-case basis, for most models we would expect these conditions to hold immediately.
For every , there exists such that the set is bounded.
For every and , there exists such that the set is almost surely bounded.
Theorem 1 (Generalisation Bounds).
Suppose that the kernel is bounded, and that Assumption 1 holds, then with probability at least ,
All proofs are deferred to Appendix B. An immediate corollary of the above result is that the speed of convergence in the generalisation errors decreases as and with the rates being independent of the dimensions and , and the properties of the kernel. Indeed, if the kernel is translation invariant, then will reduce to the maximum value of the kernel. A similar generalisation result was obtained in Dziugaite et al. (2015) for minimum MMD estimation of deep neural network models. While the bounds are of the same form, Theorem 1 only requires minimal assumptions on the smoothness of the kernel. Moreover, all the constants in the bound are explicit, demonstrating clearly dimensional dependence. Assumption 1 is required to guarantee the existence of at least one minimiser, whereas this is implicitly assumed in Dziugaite et al. (2015). The key result which determines the rate is the following concentration inequality.
Lemma 1 (Concentration Bound).
Assume that the kernel is bounded and let be a probability measure on . Let be the empirical measure obtained from independently and identically distributed samples of . Then with probability , we have that
See also (Gretton et al., 2009, Theorem 17) for an equivalent bound. We can compare this result with (Fournier and Guillin, 2015, Theorem 1) on comparing the rate of convergence of Wasserstein-1 distance (denoted ) to the empirical measure, which implies that for and sufficiently large, with probability we have , where and is a constant depending only on the constants and . This suggests that generalisation bounds analogous to Theorem 1 for Wasserstein distance would depend exponentially on dimension, at least when the distribution is absolutely continuous with respect to the Lebesgue measure. For measures support on a lower dimensional manifold, this bound has been recently tightened, see Weed and Bach (2017) and also Weed and Berthet (2019)
. For Sinkhorn divergences, which interpolate between optimal transport and MMD distanceGenevay et al. (2018)
this curse of dimensionality can be mitigatedGenevay et al. (2019) for measures on bounded domains.
3.2 Consistency and Asymptotic Normality
With additional assumptions, we can recover a classical strong consistency result.
Proposition 1 (Consistency).
Suppose that Assumption 1 holds and that there exists a unique minimiser such that . Then and as , almost surely.
Theorem 1 provides fairly weak probabilistic bounds on the convergence of the estimators and in terms of their MMD distance to the data distribution . Proposition 1 provides conditions under which these bounds translate to convergence of the estimators, however it is not clear how to extract quantitative information about the speed of convergence, and the efficiency of the estimator in general. A classical approach to this is to establish the asymptotic normality of the estimators and characterise the efficiency in terms of the asymptotic variance. We do this now, assuming that we are working in the -close setting, i.e. assuming that for some .
Theorem 2 (Central Limit Theorems).
Suppose that for some and that the conclusions of Proposition 1 hold. Suppose that:
There exists an open neighbourhood of such that is three times differentiable in with respect to .
The information metric is positive definite at .
There exists a compact neighbourhood of such that for where denotes the mixed derivatives of order and denotes the spectral norm.
The kernel is translation invariant, with bounded mixed derivatives up to order .
Then as :
where denotes convergence in distribution. The covariance matrix is given by the Godambe matrix where
Here, denotes the tensor product and . Furthermore, suppose that:
The kernel has bounded mixed derivatives up to order .
The indices satisfy , where ,
Then, as ,
We remark that the asymptotic covariance is minimised when , that is, when the number of samples generated from the model equals that of the data (at which point ). This means that it will be computationally inefficient to use much larger than . We note that the variance also does not depend on any amplitude parameter of the kernel, or any location parameters in . To the best of our knowledge, there are no known analogous result for minimum Wasserstein or Sinkhorn estimators (except a one-dimensional result for the minimum Wasserstein estimator in the supplementary material of (Bernton et al., 2019)).
Theorem 2 raises the question of efficiency of the estimator. The Cramer-Rao bound provides a lower bound on the variance of any unbiased estimator for , and it is well-known that it is attained by maximum likelihood estimators. The following result is an adaptation of the Cramer-Rao bound in Godambe (1960) for our estimators, which are biased.
Theorem 3 (Cramer-Rao Bounds).
Suppose that the CLTs in Theorem 2 hold and that the data distribution satisfies where is assumed to have density . Furthermore, suppose that the MMD information metric and the Fisher information metric are positive definite when . Then the asymptotic covariances and of the estimator and satisfy the Cramer-Rao bound, i.e. and are non-negative definite.
The results above demonstrate that we cannot expect our (biased) estimators to outperform maximum likelihood in the M-closed case. The efficiency of these estimators is strongly determined by the choice of kernel, in particular on the kernel bandwidth . The following result characterises the efficiency as .
Proposition 2 (Efficiency with Large Lengthscales).
Suppose that is a radial basis kernel, i.e. , where and . Let and denote the asymptotic variance as a function of the bandwidth of and respectively. Then
where and are the mean and covariance of respectively and denotes the Moore-Penrose inverse of . As a result, .
In general, the minimum MMD estimators may not achieve the efficiency of maximum likelihood estimators in the limit , however in one dimension, the limiting covariance in Equation 5 is a well known approximation for the inverse Fisher information (Jarrett, 1984; Stein and Nossek, 2017), which is optimal.
Before concluding this section on efficiency of minimum MMD estimators, we note that the asymptotic covariances and of Theorem 2
could be used to create confidence intervals for the value of(only for the M-closed case). Although these covariances cannot be estimated exactly since they depend on and contain intractable integrals, they can be approximated using the generator at the current estimated value of the parameters.
This concludes our theoretical study of the M-closed case and we now move on to the M-open case. A concept of importance to practical inference is robustness when subjected to corrupted data (Huber and Ronchetti, 2009). As will be seen below, minimum MMD estimators have very favourable robustness properties for this case.
Our first objective is to demonstrate qualitative robustness in the sense of Hampel (1971). More specifically, given some parametrized probability measure , we show that if two measures and are close in Prokhorov metric, then the distributions of the minimum distance estimators and for are respectively close.
Theorem 4 (Qualitative Robustness).
Suppose that (i) there exists a unique such that and (ii) , such that implies that . Then is qualitatively robust in the sense of Hampel (1971).
Additionally, suppose that for any empirical measure on points, that (i’) there exists a unique such that and (ii’) , such that implies that . Then such that is qualitatively robust for .
The result above characterises the qualitative robustness of the estimators, but does not provide a measure of the degree of robustness which can be used to study the effect of corrupted data on the estimated parameters. An important quantity used to quantify robustness is the influence function where measures the impact of an infinitesimal contamination of the data generating model in the direction of a Dirac measure located at some point . The influence function of a minimum distance estimator based on a scoring rule is given by (Dawid and Musio, 2014): . The supremum of the influence function over is called the gross-error sensitivity, and if it is finite, we say that an estimator is bias-robust (Hampel, 1971). We can use the connection with kernel scoring rules to study bias robustness of our estimators.
Theorem 5 (Bias Robustness).
The influence function corresponding to the maximum mean discrepancy is given by . Furthermore, suppose that is bounded and , then the MMD estimators are bias-robust.
Note that the conditions for this theorem to be valid are less stringent than assumptions required for the CLT in Theorem 2. As we shall see in the next section, there will be a trade-off between efficiency and robustness as the kernel bandwidth is varied. We shall demonstrate this through the influence function.
Overall, these results demonstrating the qualitative and bias robustness of minimum MMD estimators provides another strong motivation for their use. For complex generative model, it is common to be in the M-open setting; see for example all of the MMD GANs applications in machine learning where neural networks are used as models of images. Although it is not realistically expected that neural networks are good models for this, our robustness results can help explain the favourable experimental results observed in that case. Note that, to the best of our knowledge, the robustness of Wasserstein and Sinkhorn estimators has not been studied.
4 The Importance of Kernel Selection: Gaussian Models
As should be clear from the previous sections, the choice of kernel will strongly influence the characteristics of minimum MMD estimators, including (but not limited to) the efficiency of the estimators, their robustness to misspecification and the geometry of the loss function. In this section, we highlight some of these consequences for two particular models: a location and scale model for a Gaussian distribution. These models are illustrative problems for which many quantities of interest (such as the asymptotic variance and influence function) can be computed in closed form, allowing for a detailed study of the properties of minimum MMD estimators.
4.1 Kernel Selection in the Literature
A number of approaches for kernel selection have been proposed in the literature, most based on radial basis kernels of the form , for some function . We now highlight each of these approaches, and later discuss the consequences of our theoretical results in the case of Gaussian location and scale models.
Dziugaite et al. (2015) proposed to set the lengthscale using the median heuristic proposed in Gretton et al. (2008) for two-sample testing with MMD, and hence picked their lengthscale to be where is the data. This heuristic has previously been demonstrated to lead to high power in the context of two-sample testing for location models in Ramdas et al. (2015); Reddi et al. (2015). See also Garreau et al. (2017) for an extensive investigation. Li et al. (2015); Ren et al. (2016); Sutherland et al. (2017) have demonstrated empirically that a mixture of squared-exponential kernels yields good performance, i.e. a kernel of the form where and the lengthscales are chosen to cover a wide range of bandwidth. The weights can either be fixed, or optimised; see Sutherland et al. (2017) for more details. As the sum of characteristic kernels is characteristic (see Sriperumbudur et al. (2010)) this is a valid choice of kernel.
Another approach orginating from the use of MMD for hypothesis testing consists of studying the asymptotic distribution of the test statistics, and choose kernel parameters so as to maximise the power of the test. This was for example used in(Sutherland et al., 2017). A similar idea could be envisaged in our case: we could minimise the asymptotic variance of the CLT obtained in the previous section. Unfortunately, this will not be tractable in general since computing the asymptotic variance requires knowing the value of , but an approximation could be obtained using the current estimate of the parameter.
Finally, recent work (Li et al., 2017) also proposed to include the problem of kernel selection in the objective function, leading to a minimax objective. This renders the optimisation problem delicate to deal with in practice (Bottou et al., 2017). The introduction of several constraints on the objective function have however allowed significant empirical success (Arbel et al., 2018; Bińkowski et al., 2018). We do not consider this case, but it will be the subject of future work.
4.2 Gaussian Location Model
To focus ideas we shall focus on a Gaussian location model for a -dimensional istropic Gaussian distribution with unknown mean
and known standard deviation. In this case, we take , is a standard Gaussian distribution and . The derivative of the generator is given by . Although this is of course a fairly simple model which could be estimated by MLE, it will be useful to illustrate some of the important points for the implementation of MMD estimators. In the first instance, we consider the M-closed case where the data consists of samples where and the kernel is given by , where
is the probability density function of a Gaussian.
Proposition 3 (Asymptotic Variance for Gaussian Location Models).
Consider the minimum MMD estimator for the location of a Gaussian distribution using a Gaussian kernel , then the estimator has asymptotic variance given by
The Fisher information for this model is given by , and so in the regime we recover the efficiency of the MLE, so that the Cramer-Rao bound in Theorem 3 is attained. On the other hand, for finite values of , the minimum MMD estimator will be less efficient than the MLE. For , the asymptotic variance is with respect to , but we notice that the asymptotic variance is as , where . This demonstrates a curse of dimensionality in this regime. This transition in behaviour suggests that there is a critical scaling of with respect to which results in asymptotic variance independent of dimension.
Proposition 4 (Critical Scaling for Gaussian Location Models).
Consider the minimum MMD estimator for the location of a Gaussian distribution using a single Gaussian kernel where . The asymptotic variance is bounded independently of dimension if and only .
As previously mentioned, it has been demonstrated empirically that choosing the bandwidth according to the median heuristic results in good performance in the context of MMD hypothesis tests (Reddi et al., 2015; Ramdas et al., 2015). These works note that the median heuristic yields , which lies within the dimension independent regime in Proposition 4. Our CLT therefore explains some of the favourable properties of this choice.
Clearly, the choice of lengthscale can have a significant impact on the efficiency of the estimator, but it can also impact other aspects of the problem. For example, the loss landscape of the MMD, and hence our ability to perform gradient-based optimisation, is severely impacted by the choice of kernel. This is illustrated in Figure 1 (top left) in , where choices of lengthscale between and will be preferable for gradient-based optimisation routines since they avoid large regions where the loss function will have a gradient close to zero. Using a mixture of kernels with a range of lengthscale regimes could help avoid regients of near-zero gradient and hence be generally desirable for the gradient-based optimization. A third aspect of the inference scheme which is impacted by the lengthscale is the robustness. We can quantify the influence of the kernel choice on robustness using the influence function. Similar plots for different classes of kernels can be found in the Appendix in Figures 8, 9 and 10.
Proposition 5 (Influence Function for Gaussian Location Models).
Consider the MMD estimator for the location of a Gaussian model based on a Gaussian kernel . Then the influence function is given by: