1 Introduction
One approach to answering the question of whether a model is misspecified is goodnessoffit testing [Chapter 14]lehmanntestingstatisticalhypotheses. Given a
fixed distribution and observations generated from an unknown distribution, goodnessoffit tests compare the null hypothesis
against the alternative hypothesis . Tests using the ksd as test statistic are popular for this task because they can be applied to a widerange of data types and can accommodate models with unnormalised densities liuksdgoodnessoffit,chwialkowskikernelgoodnessfit.Although these tests are very useful in practice, we will often be interested in answering the more complex question of whether our data was generated by any element of some parametric family of distributions with parameter space . Specifically, the null hypothesis (corresponding to a wellspecified model) is such that , and the alternative (corresponding to a misspecified model) is . This type of test is known as a composite goodnessoffit test, and can be much more challenging to construct since is usually unknown.
Our paper fills an important gap in the literature by proposing the first set of kernelbased composite hypothesis tests applicable to a wide range of parametric models. These are in contrast to previously introduced composite tests which are limited to very specific parametric families kellnerOnesampleTestNormality,fernandezKernelizedSteinDiscrepancy. To devise these new tests, we make use of recently developed minimum distance estimators based on the ksd barpMinimumSteinDiscrepancy. A key challenge is that the dataset is used twice, both to estimate the parameter and the test statistic, and this is done without splitting it into estimation and test sets. To achieve the correct level, the test must take account of this dependence, both via a more indepth theoretical analysis and by using a suitable method for approximating the threshold. In this initial version of the work, we only consider the second aspect by using the parametric bootstrap stuteBootstrapBasedGoodnessoffittests to achieve the correct level without datasplitting.
2 Background: Kernel Stein Discrepancies for Testing and Estimation
We will now briefly review the use of ksd for testing and estimation. Denote by the data space and the set of all Borel distributions on . For simplicity, we will focus on , but note that ksds have been developed for other data types Yang2018,Yang2019,kanagawaKernelSteinTest,Fernandez2019,fernandezKernelizedSteinDiscrepancy,Xu2020directional,Xu2021. The ksd is a function which measures the similarity between two distributions . Although it is not a probability metric, it is closely connected to the class of ipm Muller1997, which measure similarity as follows:
(1) 
Let be the rkhs associated to the symmetric positive definite kernel Berlinet2004 and let denote the
dimensional tensor product of
. The Langevin kernel Stein discrepancy oatescontrolfunctionals,chwialkowskikernelgoodnessfit,liuksdgoodnessoffit,gorhamsamplequalitystein is obtained by considering an ipm withThe operator is called the Langevin Stein operator and is the Lebesgue density of . Using , Equation 1 simplifies to:
Let where denotes a Dirac measure at . The ksd can be straightforwardly computed in this case using a single Vstatistic at a cost of . Under mild regularity conditions, the ksd is a statistical divergence meaning that if and only if ; see [Theorem 2.1]chwialkowskikernelgoodnessfit & [Proposition 1]barpMinimumSteinDiscrepancy. The ksd is convenient in the context of unnormalised models since it depends on only via , which can be evaluated without knowledge of the normalisation constant of . We now briefly recall how it can be used for goodnessoffit testing and estimation.
Goodnessoffit testing with KSD
In goodnessoffit testing, we would like to test against . A natural approach is to compute and check whether this quantity is zero (i.e. holds) or not (i.e. holds) liuksdgoodnessoffit,chwialkowskikernelgoodnessfit. Of course, since we only have access to instead of , this idealised procedure is replaced by the evaluation of . The question then becomes whether or not this quantity is further away from zero than expected under given the sampling error associated with a dataset of size .
To determine whether should be rejected, we need to select an appropriate threshold , which will depend on the level of the test . More precisely, should be set to the
quantile of the distribution of
under . This distribution will usually be unknown apriori, but can be approximated using a bootstrap method. A common example is the wild bootstrap shaodependentwildbootstrap,leuchtdepwildbootstrap, which was specialised for kernel tests by chwialkowskiwildbootstrapkerneltests.Minimum distance estimation with KSD
As shown by barpMinimumSteinDiscrepancy, the ksd can also be used for parameter estimation through minimum distance estimation Wolfowitz1957. Given a parametric family indexed by . Given , a natural estimator is
Under regularity conditions, the estimator approaches as . The use of ksd was later extended by Grathwohl2020,matsubaraRobustGeneralisedBayesian,Gong2020. These estimators are also closely related to scorematching estimators Hyvarinen2006; see [Theorem 2]barpMinimumSteinDiscrepancy for details.
3 Methodology
We now consider a novel composite goodnessoffit test where both estimation and testing is based on the with some fixed kernel . The new test contains the two following stages:
 Stage 1 (Estimation):

.
 Stage 2 (Testing):

reject if , using one of Algorithm 1 or Algorithm 2.
Note that we could possibly replace stage 1 with another estimator such as maximum likelihood estimation, but this would require knowledge of any normalisation constant of the likelihood which may not be possible in many cases. To implement Stage 2, we require a bootstrap algorithm to estimate the threshold . A natural first approach is to use the wild bootstrap, since it is the current goldstandard for kernelbased goodnessoffit tests. Algorithm 1 gives the implemention of a composite test using the wild bootstrap, where is the desired test level and is the number of bootstrap samples. Note that in our implementation the choice of sampling from a Rademacher distribution assumes that are independent.
If we make the unrealistic assumption that , then we are back in the setting considered by chwialkowskikernelgoodnessfit,liuksdgoodnessoffit and the wild bootstrap works as follows. Under , as , and converge to the same distribution, thus will converge to the quantile of [Theorem 1,2]chwialkowskiwildbootstrapkerneltests, and the test will reject with probability , as desired. Under , will diverge while will converge to some distribution, hence the probability that goes to as .
However, it is clear that we cannot assume that
in the finite data case. In fact, using the wild bootstrap in this fashion for composite tests may result in an incorrect type I error rate under
, and lost power under . This is because the estimation stage of the composite test introduces a second source of error which this approach does not take account of when computing . The two sources of error that the test encounters are as follows. Recall that in an idealised setting we would use as the test statistic. The first source of error is introduced because we must estimate this statistic with , as we only have access to a sample from . This source of error also occurs in noncomposite tests, and is accounted for correctly by the wild bootstrap. The second source of error is specific to composite tests, and occurs because we must further estimate with , as we do not have access to . Algorithm 1 fixes the parameter estimate and then applies the bootstrap, thus failing to take account of the error in in and potentially computing an incorrect threshold. We can also view Algorithm 1 as a noncomposite goodnessoffit test against the wrong null hypothesis. By estimating the parameter and then applying the test, the test is not evaluating , but instead .Figure 2 demonstrates the impact of ignoring this error. It shows the power of our test with for and , as varies, see Appendix B for details. We can see that the wild bootstrap test has lower power for most values of , including when , thus holds, and it corresponds to the type I error. It also demonstrates that the error depends on the size of . The estimator is consistent: as becomes large we expect the estimate of the parameter to converge on the true value, thus minimizing the impact of the error. However, for smaller it is necessary to use an alternative approach which is able to take account of the additional error.
To take account of both types of error we apply the parametric bootstrap described in Algorithm 2. This bootstrap approximates the distribution of the test statistic under by repeatedly resampling the observations and reestimating the parameter, thus taking account of the error in the estimate of the parameter. In comparison to the wild bootstrap, this is substantially more computationally intensive because it requires repeatedly computing the kernel matrix on fresh data, whereas the wild bootstrap only computes the kernel matrix once. However, when is large and this extra computation is likely to be an issue, the size of the estimation error should also be reduced as the estimator converges on the true value of the parameter. Thus, under this high data regime, the wild and parametric bootstraps may achieve comparable powers, and it may be reasonable to use the cheaper wild bootstrap. In the setting considered in Figure 2 we find that this is the case, with the parametric bootstrap performing substantially better than the wild bootstrap for , but comparably for .
4 Illustration: The Kernel Exponential Family Model
We apply our method to a density estimation task from matsubaraRobustGeneralisedBayesian, to test whether a robust method was really necessary. The model is in the kernel exponential family , where is assumed to belong to some RKHS but approximated by a finite sum of basis functions, and is a reference density. Specifically, we follow steinwartExplicitDescriptionReproducing and consider , with . The dataset is comprised of the velocities of galaxies postmanProbesLargescaleStructure, roederDensityEstimationConfidence. An open question is how large needs to be for the model to have enough capacity to fit the data. matsubaraRobustGeneralisedBayesian set , and we test this assumption. See Appendix B for the full experiment configuration. Figure 2 shows the fit of the model for increasing values of , and whether the test rejected the null hypothesis. We find that the test rejects for , but does not reject for . This suggests that is a suitable choice, though it would be reasonable to use a smaller value of which would decrease the computational cost of inference.
5 Conclusion
In this initial work, we introduced a flexible composite goodnessoffit test which requires only an unnormalized density. In the full version of this paper we will consider a general framework for composite goodnessoffit tests, which will also include a tests based on the maximum mean discrepancy. This will allow us to tackle a much wider set of problems, including composite testing for generative models with no computable density function. We will also include theoretical results showing that the test statistic converges under , the test is consistent under , and that the wild and parametric bootstraps behave appropriately.
Appendix A Closedform expression for KSD estimator
For exponential family models, such as the Gaussian and kernel exponential families considered in this paper, there is a closedform expression for the ksd estimator. Let the density of a model in the exponential family be
with an invertible map, any sufficient statistic for some , and . Then the ksd estimator of theta is given by
where and are defined as and . These are based on functions and which depend on the specific model, defined as
The detailed derivation of this result can be found in [Appendix D3]barpMinimumSteinDiscrepancy or matsubaraRobustGeneralisedBayesian.
Appendix B Experiment details
b.1 Figure 2
For each datapoint in the plot, the power is computed as follows:

For four random seeds:

For repeats:

Sample points from

Run the test


Compute the power, the fraction of repeats during which the null hypothesis is rejected


Compute the mean and standard error over the seeds
We find the standard error to be sufficiently small that it is not visible in the figure.
Estimator
We estimate only the mean, which we denote by
. The variance,
, is known and specified by the null hypothesis. We use the closedform KSD estimator as defined above, withKernel
Gaussian kernel:
We choose .
Bootstrap and test
The wild bootstrap (Algorithm 1) is configured with . The parametric bootstrap (Algorithm 2) is configured with . We set in both cases.
b.2 Figure 2
Data normalisation
Following matsubaraRobustGeneralisedBayesian, we normalize the dataset by
Estimator
We use the closedform KSD estimator as defined above, with
Kernel
Bootstrap and test
Parametric bootstrap (Algorithm 2), with and .
Comments
There are no comments yet.