Composite Goodness-of-fit Tests with Kernels

by   Oscar Key, et al.
Universidad Adolfo Ibáñez

Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of inference methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. One set of tools which can help are goodness-of-fit tests, where we test whether a dataset could have been generated by a fixed distribution. Kernel-based tests have been developed to for this problem, and these are popular due to their flexibility, strong theoretical guarantees and ease of implementation in a wide range of scenarios. In this paper, we extend this line of work to the more challenging composite goodness-of-fit problem, where we are instead interested in whether the data comes from any distribution in some parametric family. This is equivalent to testing whether a parametric model is well-specified for the data.



There are no comments yet.


page 1

page 2

page 3

page 4


A Chi-square Goodness-of-Fit Test for Continuous Distributions against a known Alternative

The chi square goodness-of-fit test is among the oldest known statistica...

Generalised Kernel Stein Discrepancy(GKSD): A Unifying Approach for Non-parametric Goodness-of-fit Testing

Non-parametric goodness-of-fit testing procedures based on kernel Stein ...

Smooths Tests of Goodness-of-fit for the Newcomb-Benford distribution

The Newcomb-Benford probability distribution is becoming very popular in...

Strengthening the Baillie-PSW primality test

The Baillie-PSW primality test combines Fermat and Lucas probable prime ...

On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

The reproducing kernel Hilbert space (RKHS) embedding of distributions o...

On the Optimality of Gaussian Kernel Based Nonparametric Tests against Smooth Alternatives

Nonparametric tests via kernel embedding of distributions have witnessed...

Boundary-free Kernel-smoothed Goodness-of-fit Tests for Data on General Interval

We propose kernel-type smoothed Kolmogorov-Smirnov and Cramér-von Mises ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One approach to answering the question of whether a model is misspecified is goodness-of-fit testing [Chapter 14]lehmann-testing-statistical-hypotheses. Given a

fixed distribution and observations generated from an unknown distribution

, goodness-of-fit tests compare the null hypothesis

against the alternative hypothesis . Tests using the ksd as test statistic are popular for this task because they can be applied to a wide-range of data types and can accommodate models with unnormalised densities liu-ksd-goodness-of-fit,chwialkowski-kernel-goodness-fit.

Although these tests are very useful in practice, we will often be interested in answering the more complex question of whether our data was generated by any element of some parametric family of distributions with parameter space . Specifically, the null hypothesis (corresponding to a well-specified model) is such that , and the alternative (corresponding to a misspecified model) is . This type of test is known as a composite goodness-of-fit test, and can be much more challenging to construct since is usually unknown.

Our paper fills an important gap in the literature by proposing the first set of kernel-based composite hypothesis tests applicable to a wide range of parametric models. These are in contrast to previously introduced composite tests which are limited to very specific parametric families kellnerOnesampleTestNormality,fernandezKernelizedSteinDiscrepancy. To devise these new tests, we make use of recently developed minimum distance estimators based on the ksd barpMinimumSteinDiscrepancy. A key challenge is that the dataset is used twice, both to estimate the parameter and the test statistic, and this is done without splitting it into estimation and test sets. To achieve the correct level, the test must take account of this dependence, both via a more in-depth theoretical analysis and by using a suitable method for approximating the threshold. In this initial version of the work, we only consider the second aspect by using the parametric bootstrap stuteBootstrapBasedGoodnessoffittests to achieve the correct level without data-splitting.

2 Background: Kernel Stein Discrepancies for Testing and Estimation

We will now briefly review the use of ksd for testing and estimation. Denote by the data space and the set of all Borel distributions on . For simplicity, we will focus on , but note that ksds have been developed for other data types Yang2018,Yang2019,kanagawaKernelSteinTest,Fernandez2019,fernandezKernelizedSteinDiscrepancy,Xu2020directional,Xu2021. The ksd is a function which measures the similarity between two distributions . Although it is not a probability metric, it is closely connected to the class of ipm Muller1997, which measure similarity as follows:


Let be the rkhs associated to the symmetric positive definite kernel Berlinet2004 and let denote the

-dimensional tensor product of

. The Langevin kernel Stein discrepancy oates-control-functionals,chwialkowski-kernel-goodness-fit,liu-ksd-goodness-of-fit,gorham-sample-quality-stein is obtained by considering an ipm with

The operator is called the Langevin Stein operator and is the Lebesgue density of . Using , Equation 1 simplifies to:

Let where denotes a Dirac measure at . The ksd can be straightforwardly computed in this case using a single V-statistic at a cost of . Under mild regularity conditions, the ksd is a statistical divergence meaning that if and only if ; see [Theorem 2.1]chwialkowski-kernel-goodness-fit & [Proposition 1]barpMinimumSteinDiscrepancy. The ksd is convenient in the context of unnormalised models since it depends on only via , which can be evaluated without knowledge of the normalisation constant of . We now briefly recall how it can be used for goodness-of-fit testing and estimation.

Goodness-of-fit testing with KSD

In goodness-of-fit testing, we would like to test against . A natural approach is to compute and check whether this quantity is zero (i.e. holds) or not (i.e. holds) liu-ksd-goodness-of-fit,chwialkowski-kernel-goodness-fit. Of course, since we only have access to instead of , this idealised procedure is replaced by the evaluation of . The question then becomes whether or not this quantity is further away from zero than expected under given the sampling error associated with a dataset of size .

To determine whether should be rejected, we need to select an appropriate threshold , which will depend on the level of the test . More precisely, should be set to the

-quantile of the distribution of

under . This distribution will usually be unknown a-priori, but can be approximated using a bootstrap method. A common example is the wild bootstrap shao-dependent-wild-bootstrap,leucht-dep-wild-bootstrap, which was specialised for kernel tests by chwialkowski-wild-bootstrap-kernel-tests.

Minimum distance estimation with KSD

As shown by barpMinimumSteinDiscrepancy, the ksd can also be used for parameter estimation through minimum distance estimation Wolfowitz1957. Given a parametric family indexed by . Given , a natural estimator is

Under regularity conditions, the estimator approaches as . The use of ksd was later extended by Grathwohl2020,matsubaraRobustGeneralisedBayesian,Gong2020. These estimators are also closely related to score-matching estimators Hyvarinen2006; see [Theorem 2]barpMinimumSteinDiscrepancy for details.

Input: , , ,
for  do
if , reject the null, else do not reject.
Algorithm 1 Wild bootstrap test
Input: , , , ,
for  do
       , ;
if , reject the null, else do not reject.
Algorithm 2 Parametric bootstrap test

3 Methodology

We now consider a novel composite goodness-of-fit test where both estimation and testing is based on the with some fixed kernel . The new test contains the two following stages:

Stage 1 (Estimation):


Stage 2 (Testing):

reject if , using one of Algorithm 1 or Algorithm 2.

Note that we could possibly replace stage 1 with another estimator such as maximum likelihood estimation, but this would require knowledge of any normalisation constant of the likelihood which may not be possible in many cases. To implement Stage 2, we require a bootstrap algorithm to estimate the threshold . A natural first approach is to use the wild bootstrap, since it is the current gold-standard for kernel-based goodness-of-fit tests. Algorithm 1 gives the implemention of a composite test using the wild bootstrap, where is the desired test level and is the number of bootstrap samples. Note that in our implementation the choice of sampling from a Rademacher distribution assumes that are independent.

If we make the unrealistic assumption that , then we are back in the setting considered by chwialkowski-kernel-goodness-fit,liu-ksd-goodness-of-fit and the wild bootstrap works as follows. Under , as , and converge to the same distribution, thus will converge to the -quantile of [Theorem 1,2]chwialkowski-wild-bootstrap-kernel-tests, and the test will reject with probability , as desired. Under , will diverge while will converge to some distribution, hence the probability that goes to as .

However, it is clear that we cannot assume that

in the finite data case. In fact, using the wild bootstrap in this fashion for composite tests may result in an incorrect type I error rate under

, and lost power under . This is because the estimation stage of the composite test introduces a second source of error which this approach does not take account of when computing . The two sources of error that the test encounters are as follows. Recall that in an idealised setting we would use as the test statistic. The first source of error is introduced because we must estimate this statistic with , as we only have access to a sample from . This source of error also occurs in non-composite tests, and is accounted for correctly by the wild bootstrap. The second source of error is specific to composite tests, and occurs because we must further estimate with , as we do not have access to . Algorithm 1 fixes the parameter estimate and then applies the bootstrap, thus failing to take account of the error in in and potentially computing an incorrect threshold. We can also view Algorithm 1 as a non-composite goodness-of-fit test against the wrong null hypothesis. By estimating the parameter and then applying the test, the test is not evaluating , but instead .

Figure 2 demonstrates the impact of ignoring this error. It shows the power of our test with for and , as varies, see Appendix B for details. We can see that the wild bootstrap test has lower power for most values of , including when , thus holds, and it corresponds to the type I error. It also demonstrates that the error depends on the size of . The estimator is consistent: as becomes large we expect the estimate of the parameter to converge on the true value, thus minimizing the impact of the error. However, for smaller it is necessary to use an alternative approach which is able to take account of the additional error.

To take account of both types of error we apply the parametric bootstrap described in Algorithm 2. This bootstrap approximates the distribution of the test statistic under by repeatedly resampling the observations and re-estimating the parameter, thus taking account of the error in the estimate of the parameter. In comparison to the wild bootstrap, this is substantially more computationally intensive because it requires repeatedly computing the kernel matrix on fresh data, whereas the wild bootstrap only computes the kernel matrix once. However, when is large and this extra computation is likely to be an issue, the size of the estimation error should also be reduced as the estimator converges on the true value of the parameter. Thus, under this high data regime, the wild and parametric bootstraps may achieve comparable powers, and it may be reasonable to use the cheaper wild bootstrap. In the setting considered in Figure 2 we find that this is the case, with the parametric bootstrap performing substantially better than the wild bootstrap for , but comparably for .

4 Illustration: The Kernel Exponential Family Model

Figure 1: Power of of the test when using the wild and parametric bootstraps for (solid) and (dashed). The dashed horizontal line shows the test level,

. The error bars show one standard error over

random seeds.
Figure 2: Goodness-of-fit of a kernel exponential family model with increasing . The grey histogram shows the dataset, and the lines the modelled density. Dashed lines indicate that the test rejected , and solid that it did not.

We apply our method to a density estimation task from matsubaraRobustGeneralisedBayesian, to test whether a robust method was really necessary. The model is in the kernel exponential family , where is assumed to belong to some RKHS but approximated by a finite sum of basis functions, and is a reference density. Specifically, we follow steinwartExplicitDescriptionReproducing and consider , with . The dataset is comprised of the velocities of galaxies postmanProbesLargescaleStructure, roederDensityEstimationConfidence. An open question is how large needs to be for the model to have enough capacity to fit the data. matsubaraRobustGeneralisedBayesian set , and we test this assumption. See Appendix B for the full experiment configuration. Figure 2 shows the fit of the model for increasing values of , and whether the test rejected the null hypothesis. We find that the test rejects for , but does not reject for . This suggests that is a suitable choice, though it would be reasonable to use a smaller value of which would decrease the computational cost of inference.

5 Conclusion

In this initial work, we introduced a flexible composite goodness-of-fit test which requires only an unnormalized density. In the full version of this paper we will consider a general framework for composite goodness-of-fit tests, which will also include a tests based on the maximum mean discrepancy. This will allow us to tackle a much wider set of problems, including composite testing for generative models with no computable density function. We will also include theoretical results showing that the test statistic converges under , the test is consistent under , and that the wild and parametric bootstraps behave appropriately.

Appendix A Closed-form expression for KSD estimator

For exponential family models, such as the Gaussian and kernel exponential families considered in this paper, there is a closed-form expression for the ksd estimator. Let the density of a model in the exponential family be

with an invertible map, any sufficient statistic for some , and . Then the ksd estimator of theta is given by

where and are defined as and . These are based on functions and which depend on the specific model, defined as

The detailed derivation of this result can be found in [Appendix D3]barpMinimumSteinDiscrepancy or matsubaraRobustGeneralisedBayesian.

Appendix B Experiment details

b.1 Figure 2

For each datapoint in the plot, the power is computed as follows:

  1. For four random seeds:

    1. For repeats:

      1. Sample points from

      2. Run the test

    2. Compute the power, the fraction of repeats during which the null hypothesis is rejected

  2. Compute the mean and standard error over the seeds

We find the standard error to be sufficiently small that it is not visible in the figure.


We estimate only the mean, which we denote by

. The variance,

, is known and specified by the null hypothesis. We use the closed-form KSD estimator as defined above, with


Gaussian kernel:

We choose .

Bootstrap and test

The wild bootstrap (Algorithm 1) is configured with . The parametric bootstrap (Algorithm 2) is configured with . We set in both cases.

b.2 Figure 2

Data normalisation

Following matsubaraRobustGeneralisedBayesian, we normalize the dataset by


We use the closed-form KSD estimator as defined above, with


IMQ kernel:

We select

using the median heuristic,

which in this case results in .

Bootstrap and test

Parametric bootstrap (Algorithm 2), with and .