DeepAI

# Generalised Kernel Stein Discrepancy(GKSD): A Unifying Approach for Non-parametric Goodness-of-fit Testing

Non-parametric goodness-of-fit testing procedures based on kernel Stein discrepancies (KSD) are promising approaches to validate general unnormalised distributions in various scenarios. Existing works have focused on studying optimal kernel choices to boost test performances. However, the Stein operators are generally non-unique, while different choices of Stein operators can also have considerable effect on the test performances. In this work, we propose a unifying framework, the generalised kernel Stein discrepancy (GKSD), to theoretically compare and interpret different Stein operators in performing the KSD-based goodness-of-fit tests. We derive explicitly that how the proposed GKSD framework generalises existing Stein operators and their corresponding tests. In addition, we show thatGKSD framework can be used as a guide to develop kernel-based non-parametric goodness-of-fit tests for complex new data scenarios, e.g. truncated distributions or compositional data. Experimental results demonstrate that the proposed tests control type-I error well and achieve higher test power than existing approaches, including the test based on maximum-mean-discrepancy (MMD).

02/17/2020

### A Stein Goodness-of-fit Test for Directional Distributions

In many fields, data appears in the form of direction (unit vector) and ...
08/19/2020

### Kernelized Stein Discrepancy Tests of Goodness-of-fit for Time-to-Event Data

Survival Analysis and Reliability Theory are concerned with the analysis...
08/31/2022

### A general framework for the analysis of kernel-based tests

Kernel-based tests provide a simple yet effective framework that use the...
10/19/2022

### A kernel Stein test of goodness of fit for sequential models

We propose a goodness-of-fit measure for probability densities modeling ...
10/11/2022

### On RKHS Choices for Assessing Graph Generators via Kernel Stein Statistics

Score-based kernelised Stein discrepancy (KSD) tests have emerged as a p...
09/24/2017

### On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

The reproducing kernel Hilbert space (RKHS) embedding of distributions o...
06/09/2022

### A Spectral Representation of Kernel Stein Discrepancy with Application to Goodness-of-Fit Tests for Measures on Infinite Dimensional Hilbert Spaces

Kernel Stein discrepancy (KSD) is a widely used kernel-based measure of ...

## 1 Introduction

Stein’s method (barbour2005introduction) provides an elegant probabilistic tool for comparing distributions based on Stein operators

acting on a broad class of test functions, which has been used to tackle various problems in statistical inference, random graph theory, and computational biology. Modern machine learning tasks, such as density estimations

(hyvarinen2005estimation; liu2019estimating; wenliang2018learning), model criticisms (kim2016examples; Lloyd2015; sutherland2016generative), or generative modellings (goodfellow2014generative; li2017mmd), may extensively involve the modelling and learning with intractable densities, where the normalisation constant (or partition function) is unable to be obtained in closed form. Stein operators may only require access to the distributions through the differential (or difference) of the log density functions (or mass functions), which avoids the knowledge of the normalisation constant. It is particularly useful to study those unnormalised models (hyvarinen2005estimation). As such, Stein’s method has recently caught the attention of the machine learning community and various practical applications have been developed, including variational methods (liu2016stein), implicit model learning (li2017gradient), approximate inference (huggins2018random), non-convex optimisations (detommaso2018stein; sedghi2014provable), and sampling techniques (chen2018stein; chen2019stein; oates2019convergence).

The goodness-of-fit testing procedure aims to check the null hypothesis

, where is the known target distribution and is the unknown data distribution only accessible from a set of samples111As only one set of samples are observed, the goodness-of-fit testing sometimes is also referred to as the one-sample test or one-sample problem. This is opposed to the two-sample problem where the distribution is also unknown and appeared in the sample form. , . The non-parametric goodness-of-fit testing refers to the scenario where the assumptions made on the distributions and

are minimal, i.e. the distributions in non-parametric testings are not assumed to be in any parametric families. By contrast, parametric tests (e.g. student t-test or normality test) assume pre-defined parametric family to be tested against and usually deal with summary statistics such as means or standard deviations that can be more restrictive in terms of comparing the full distributions. Kernel-based methods have been applied to compares distributions via rich-enough reproducing kernel Hilbert spaces (RKHS)

(RKHSbook) and achieved state-of-the-art results for non-parametric two-sample test (gretton2012kernel) or independence test (gretton2008kernel). Combined with well-defined Stein operators, kernel Stein discrepancy (KSD) (gorham2015measuring; ley2017stein) has been developed for non-parametric goodness-of-fit testing procedures for unnormalised models, and demonstrate superior test performances in various scenarios including Euclidean data in (chwialkowski2016kernel; liu2016kernelized), discrete data (yang2018goodness), point processes (yang2019stein), latent variable models (kanagawa2019kernel), conditional densities (jitkrittum2020testing), censored-data (fernandez2020kernelized), directional data (xu2020stein) as well as data on Riemannian manifold (xu2021stein). It is worth to note that for the above-mentioned works on goodness-of-fit tests, specific Stein operators are required to be developed independently to address the statistical inference problem for data scenarios and the Stein operators can have diverse forms and seem to be unconnected. Beyond goodness-of-fit testing, KSD have recently been studied in the context of numerical integration (barp2018riemannian)(liu2018riemannian), density estimations (barp2019minimum; le2020diffusion) and measure transport (fisher2020measure).

To improve the performances for the goodness-of-fit testing procedures, existing works have focused on selecting(kubler2020learning; lim2020kernel) and learning(gretton2012optimal; sutherland2016generative; liu2020learning) an optimal kernel functions, which is adaptive to the finite sample observations. In addition, using the techniques related to kernel mean embedding (muandet2017kernel), KSD-based tests also enable the extraction of distributional features to perform computationally efficient tests and model criticisms (jitkrittum2017kernel; xu2021stein). Nonetheless, the choice of Stein operators can also have a considerable effect for test performances but this dimension of the research has so far been ignored mostly. Previous works have only demonstrated the non-uniqueness of valid Stein operators, e.g. yang2018goodness pointed out (in their Section 3.2) that for random graph models the Stein operator can be build from indicator function or normalised Laplacian. Moreover, fernandez2020kernelized has derived three different Stein operators for censored data based on various properties from survival analysis and empirically illustrated their different test performances.

The main contribution of this paper is threefold.

1) Our first contribution is to propose a unifying framework for KSD-based tests, called generalised kernel Stein discrepancy (GKSD). Under this unifying framework, we are able to make connections between the existing KSD-based methods for different data scenarios (introduced in Section 2), via an auxiliary function that we define and discuss in Section 3. With the appropriate choice of auxiliary functions, we then discuss the generalisation power of GKSD by incorporating the designated aspect of different testing scenarios, such as censoring information (fernandez2020kernelized), geometric structures of data (barp2018riemannian; xu2020stein; xu2021stein), and latent variables in the model (kanagawa2019kernel).

2) Our second contribution is to provide comparisons and interpretations for different KSD-based tests with the aid from the auxiliary functions. For example, fernandez2020kernelized has empirically shown the different test performances for censored data using the survival-KSD (sKSD) and the martingale-KSD (mKSD). With the GKSD framework, we are able to compare sKSD and mKSD from the perspective of different choices of auxiliary functions, which impose different treatments on the censored part and the uncensored part of the data. Such analysis and interpretation is helpful for explaining the reported empirical test performances.

3) Our third contribution applies GKSD for developing novel Stein goodness-of-fit tests. Understanding the role of auxiliary functions in different KSD methods, we provide a systematic approach to derive valid KSD-based tests on novel testing scenarios. In Section 4, we study the goodness-of-fit testing procedures for distributions with domain constraint. We derive the bounded-domain KSD (bd-KSD) and its corresponding statistical properties. Then we present case studies on testing truncated distributions and compositional data, i.e. data defined on a simplex/hypersimplex. We also compare the test performances with various choices of Stein operators in each case. Then we discuss the optimal choice of Stein operators.

## 2 Preliminaries

We first review Stein’s method and a set of existing Stein operators developed for testing various data scenarios, followed by a brief reminder on KSD-based goodness-of-fit testing procedure.

### 2.1 Stein’s method and Stein operators

Stein’s method is usually refer to the Stein characterisation for distributions. Given distribution , an operators is called a Stein operator w.r.t. if the following Stein’s identity holds for some test function : The class of such test function is called the Stein class for . We start by introducing the Stein operator and the variations of Stein characterisations in various data scenarios.

##### Euclidean Stein operator

We first review the Stein operator for continuous densities in Euclidean space with Cartesian coordinate (gorham2015measuring; ley2017stein), which is also referred to as the Langevin-diffusion Stein operator (barp2019minimum)222Another important approach to develop Stein operator is via Barbour’s generator approach (barbour1988stein).. Let and for be scalar-valued functions on .

defines a vector-valued function

. Let

be a smooth probability density on

which vanishes at infinity. For a bounded smooth function , the Stein operator is defined by

 Tqf(x) =f(x)⊤∇logq(x)+∇⋅f(x). (1)

The Stein’s identity holds for in Eq. (1) due to the integration by parts on : , where the last equality holds since vanishes at infinity. Since the Stein operator depends on the density only through the derivatives of , it does not involve the normalisation constant of , which is a useful property for dealing with unnormalised models (hyvarinen2005estimation).

##### Censored-data Stein operator

In practical data scenarios such as medical trials or e-commerce, we encounter data with censoring where the actual event time of interest (or survival times) is not accessible but, instead, a bound or interval, in which the event time is known to belong, is observed. fernandez2020kernelized has proposed a set of Stein operators for right-censored data, where the lower bound of the event time is observed. The right-censored data is observed in the form of where for the survival time and censoring time , the observation time is and indicates if we are observing or . Denote as the density of event time ; as the survival function333Survival function is defined as where is the c.d.f. of the event time. of the censoring time ; the test function assumed to vanish at origin, i.e. ; the set of functions , and the operator . The censored-data Stein operator is defined as

 (T0ω)(x,δ)=δ(ω(x)SC(x)μ0(x))′SC(x)μ0(x). (2)

Denote as taking expectation w.r.t. the observation pair where , which needs to distinguish from that takes expectation over . Note that the censoring distribution, e.g. remains unknown. By Eq. (17) in Appendix, we have the Stein’s identity .

The key challenge for this Stein operator is that the survival function for censoring time is unknown and not included in the null hypothesis. Hence, fernandez2020kernelized

applied tricks in survival analysis to derive a computationally feasible operator: the survival Stein operator. This operator have an unbiased estimation from the empirical observations. For the hazard function

associated with , the Survival Stein operator is defined as

 (T(s)0ω)(x,δ) =δ(ω′(x)+λ′0(x)λ0(x)ω(x))−λ0(x)ω(x). (3)

By the martingale identities, fernandez2020kernelized also proposed the martingale Stein operator,

 (T(m)0ω)(x,δ)=δω′(x)λ0(x)−ω(x). (4)

Details for known identities regarding survival analysis and martingales can be found in Appendix A.

##### Latent-variable Stein operator

Latent variable models are powerful tools in generative modelling and statistical inference. However, such models generally do not have closed form density expressions due to the integral operator w.r.t. latent spaces. The latent Stein operator was constructed via samples of the latent variables and the corresponding conditional densities (kanagawa2019kernel). Let be the target distribution which is not accessible in closed form, even its unnormalised version. Sample . The latent variable Stein operator is defined as

 Tq,zf(x)=1mm∑j=1Tq(x|zj)f(x) (5)

A closely related construction is the Stochastic Stein Operator (gorham2020stochastic), which has been developed for computationally efficient posterior sampling procedures. Additional details and comparisons with latent variable Stein operator are included in Appendix F.

##### Second-order Stein operator

To address distributions defined on Riemannian manifolds, Stein operators involving second-order differential operators have been studied (barp2018riemannian; le2020diffusion). For smooth Riemannian manifold and scalar-valued function , the second-order Stein operator for density on is defined as

 T(2)q~f(x)=∇~f(x)⊤∇logq+Δ~f(x), (6)

where denotes the corresponding Laplace-Beltrami operator444 Specifically, consider as the basis vector on Tangent space of , . denotes the (Riemannian) gradient operator, where

denotes the metric tensor matrix; the divergence operator is

for . In the Euclidean case,

, the identity matrix, which is independent of

.. We note that the second-order operator is also applicable for Euclidean manifold, i.e. when the test function class is chosen in the particular form: , Stein operator in Eq. (6) replicates that in Eq. (1).

##### Coordinate-dependent Stein operator

Consider coordinate system that is almost everywhere in . For a density on , the Stein operator with the chosen coordinate is defined as

 T(1)qf=d∑i=1(∂fi∂θi+fi∂∂θilog(qJ)), (7)

where is the volume element. can be shown as a Stein operator via differential forms and relevant Stoke’s theorem (xu2020stein).

### 2.2 Kernel Stein discrepancies (KSD)

With any well-defined Stein operator, we can choose an appropriate RKHS w.r.t. the data scenario to construct its corresponding KSD. Let , be distributions satisfying regularity conditions for the relevant testing scenarios and the test function class to be the unit ball RKHS, , KSD between distributions and is defined as

 KSD(p∥q;H)=supf∈B1(H)Ep[Tqf]. (8)

It is known from Stein’s identity that for any test functions in the Stein class, implies . In the testing procedure, a desirable property of the discrepancy measure is that if and only if . As such, we require our RKHS to be sufficiently large to capture any possible discrepancies between and , which requires mild regularity conditions (chwialkowski2016kernel, Theorem 2.2) for KSD to be a proper discrepancy measure. Algebraic manipulations produce the following quadratic form:

 KSD2(p∥q)=Ex,~x∼p[hq(x,~x)], (9)

where does not involve distribution and denotes the kernel associated with RKHS .

### 2.3 Goodness-of-fit tests with KSD

Now, suppose we have relevant samples from the unkonwn distribution . To test the null hypothesis against the (broad class of) alternative hypothesis , KSD can be empirically estimated via Eq. (9) using U-statistics or V-statistics (van2000asymptotic); given the significance level of the test, the critical value can be determined by wild-bootstrap procedures (chwialkowski2014wild) or spectral estimation (jitkrittum2017linear) based on the Stein kernel matrix

; the rejection decision is then made by comparing empirical test statistics with the critical value. In this way, the systematic procedure for non-parametric goodness-of-fit testing is obtained, which is applicable to unnormalised models.

## 3 Generalised Kernel Stein Discrepancy (GKSD)

In this section, we generalise the Stein operator in a particular way studied in this paper that is simple enough to interpret but powerful enough to cover all the Stein operators discussed in Section 2.1.

### 3.1 Generalised Stein operator

Let be appropriately bounded test function as defined in Section 2.1. We consider a new operator for density , the generalised Stein operator that also depends on the auxiliary function ,

 (Tq,gf)(x)=Tqf⊙g(x)=d∑i=1gi(x)fi(x)∂∂xilogq(x)+gi(x)∂∂xifi(x)+fi(x)∂∂xigi(x), (10)

where denotes the element-wise product. The Stein’s identity holds for all bounded function , due the similar argument of integration by parts as derived for Eq. (1).

With Stein operator , we may want to define the generalised kernel Stein discrepancy (GKSD):

 GKSDg(p∥q;H)=sup∥f∥H≤1Ep[Tqf⊙g(x)]. (11)

We note that also admits the following quadratic form similar to Eq. (9).

###### Proposition 1.

Let , the RKHS associated with kernel . For fixed choice of bounded ,

 GKSD2g(p∥q;H)=Ex,~x∼p[hq,g(x,~x)], (12)

where .

For different choice of auxiliary functions , GKSD exhibits distinct diffusion pattern induced by . We note that, by choosing , GKSD recovers the KSD with Stein operator in in Eq. (1). Some related ideas have been discussed using interpretations on Fisher information metric (mijoule2018stein, Section 7.2)

; as well as the form of invertible matrix for diffusion kernel Stein discrepancy (DKSD)

(barp2019minimum, Theorem 1). Even though those formulations can have implicit connections to GKSD in Eq. (12), previous works did not study the relationship with existing KSDs proposed for various goodness-of-fit testing scenarios, which we now proceed to show.

### 3.2 Generalising existing Stein operators

The specific choice of and its interplay with can be helpful to understand the conditions in various testing scenarios. In this section, we show that, with appropriate choice of the auxiliary function , GKSD is capable of generalising the KSDs derived from Stein operators introduced in Section 2.1. To specify the equivalence notion, we denote as identical formulations beyond the equality between evaluation values. All proofs and detailed derivations are included in the Appendix B.

###### Theorem 1 (Censored-data Stein operator).

Let dimension of the data and w.l.o.g., the test function is assumed to vanishes at . For , choosing , with recovers the censored-data with Stein operator defined in Eq. (2),

 Eμ0[Tμ0,SCω]≜E0[(T0ω)(x,δ)]=0. (13)

It is not difficult to see that the result holds from directly applying identity in Eq. (17) (explained in Appendix A). However, it is worth noting that during the testing procedure, is unknown so we do not have direct access to here. Moreover, the expectation on l.h.s. of Eq. (13) is w.r.t. the density of survival time for GKSD, where the expectation on the r.h.s. is , w.r.t. the paired observation incorporating censoring information. Theorem 1 serves the purpose of explicitly demonstrating how a particular choice of auxiliary function can bridge the gap between censored-data Stein operator with the Stein operator on distributions without the presence censoring information. Moving on, it will also be interesting to understand how the auxiliary function may explain the Stein operators in Eq. (3) and Eq. (4) when applies to GKSD, where the expectation is taken over the paired variable .

###### Theorem 2 (Martingale Stein operator).

Assume the same setting as in Theorem 1. Further assume that the positive definite test function is integrable such that ; for the survival times so the inverse of its corresponding hazard function is then well-defined on . For , choosing

 g(x)=δλ0(x)−1+(1−δ)∫x0μ0(s)ω(s)dsμ0(x)ω(x),

with recovers the martingale with Stein operator defined in Eq. (4),

 E0[Tμ0,gω]≜E0[(T(m)0ω)(x,δ)]=0. (14)

The -dependent decomposition of above, reveals the relationship between how censoring is incorporated in the martingale Stein operator, i.e. through the hazard function for uncensored data while through an interaction between the density and the test function in the censored part. Similarly, choosing -dependent auxiliary function can recover Survival Stein operator in Eq. (3).

###### Corollary 1 (Survival Stein operator).

Assume the conditions in Theorem 2 hold. For , choosing

 g(x)=δ+(1−δ)∫x0μ0(s)ω(s)λ0(s)dsμ0(x)ω(x),

with recovers the survival with Stein operator defined in Eq. (3),

 E0[Tμ0,gω]≜E0[(T(s)0ω)(x,δ)]=0.
##### Comparisons between T(m)0 and T(s)0

From Theorem 2 and Corollary 1, we now explicitly see:

1) in the uncensored part: the diffusion for martingale Stein operator is through the inverse of hazard function while the survival Stein operator has constant auxiliary function, replicating diffusion in the form of Eq. (1) in 1 dimension;

2) in the censored part: both Stein operators rely on the integral form where density and test function interacts, while survival Stein operator involves the hazard function within the integral, making it much harder to estimate empirically.

Our results show that for , the censoring information is only extracted from the censored part of data () while the uncensored part of data () are treated exactly the same as Eq. (1). However for , the censoring information is re-calibrated via both censored part and uncensored part of data , which results in more accurate empirical estimation compared to that of . The theoretical interpretations corroborate the empirical findings reported in fernandez2020kernelized.

###### Theorem 3 (Latent-variable Stein operator).

Assume vanishes at infinity . Given sample , the sample-based -dependent auxiliary function is chosen to be . recovers the latent Stein operator in Eq. (5),

 Eq[Tq,gf]≜∑jEq(x|zj)[Tq(x|zj)f]=0.

By choosing the auxiliary function as the finite sum of delta measures on the latent variable locations, the auxiliary function is effectively performing the sampling procedure to construct the random kernel for the latent Stein operator proposed in kanagawa2019kernel, to surpass the intractability rising from integral operation over the latent variables.

###### Theorem 4 (Second-order Stein operator).

Choosing , recovers the second-order Stein operator defined in Eq. (6),

Using the fact that , choosing produces the extra order on the differential operator. For the more general formulation, which can be chosen based on the linear operator are discussed in Appendix G. By choosing the linear operator itself to be the differential operator will automatically recover such second-order Stein operator. The multivariate version and the Riemannian manifold version are also applicable. Details are included in the Appendix.

###### Theorem 5 (Coordinate-dependent Stein operator).

Choosing , recovers the Stein operator defined in Eq. (7),

Proof. The result follows from separating the last term in Eq. (7): .

With the particular choice of coordinate system, choosing the auxiliary function to be the log of Jacobian can be interpreted as changing the diffusion pattern to incorporate coercive expectation w.r.t. taking expectation over the density. This can be explicitly shown via Stoke’s theorem in differential form (xu2020stein). In addition, the idea of using auxiliary function to incorporate domain properties or constraints can be very useful for problems where the data has a complicated and irregular domain. We provide detailed study on this for goodness-of-fit testing below in Section 4.

##### Remarks

Beyond generalising KSDs in various testing scenarios, GKSD can also recover learning objectives for unnormalised models such as score matching (hyvarinen2005estimation) i.e. score matching can be derived in terms of GKSD with specific choice of kernel and auxiliary function.

###### Theorem 6.

Let , be scalar functions. Choosing , and kernel , GKSD in the form of Eq. (10) recovers the score matching objective.

Detailed reviews on score matching and additional discussions are provided in Appendix D.

## 4 GKSD Application: Testing Data with Domain Constraints

We show that different choices of auxiliary functions in the generalised Stein operator are able to produce appropriate Stein operators for testings in various data scenarios. GKSD can then be useful to develop a systematic approach for new kernel Stein tests when appropriately auxiliary functions are used. In this section, we apply GKSD for testing data with general domain constraint.

Let

be a probability distribution defined on a compact domain

555It is common that is embedded in some non-compact domain , e.g. truncated distribution from . with boundary . Denote the unnormalised density

. Common examples include the truncated Gaussian distribution on interval

or compositional data that defined on a simplex/hyper-simplex. Complex boundaries such as polygon (liu2019estimating; yu2020generalized) or non-negative constraint for graphical models (yu2018graphical) have been studied. Such a problem setting is commonly observed when the observed data is only a subset of the domain or consists of structural constraint such as compositional data. For instance, if a local government would like to study the spread of the disease during the pandemic, while the infectious information is not accessible from other countries, one may need to validate model assumptions with the domain truncated by the designated border.

### 4.1 Stein operators on compact domains

To create the KSD-type test for data on domain , we first consider the Stein operators for densities on . We develop such a Stein operator guided by the generalised Stein operator in Eq. (10). Unlike densities on unbounded domain that is commonly assumed to vanish at infinity, densities on compact domain may not usually vanish at the boundary. Hence, direct application of Stein operator on may require the knowledge of normalised density at the boundary, which defeat the purpose of KSD testing for unnormalised models. To address this issue, we utilise the auxiliary function in GKSD.

Consider a bounded smooth function such that and for unnormalised on , the bounded-domain Stein operator is defined as . With the aid from auxiliary function , it is not hard to check the Stein’s identity holds w.r.t. .

### 4.2 Bounded-domain kernel Stein discrepancy (bd-KSD)

With the Stein operator , we proceed to define the bounded-domain Kernel Stein Discrepancy (bd-KSD) for goodness-of-fit testing, similar to the Section 2.2. We consider the unit ball RKHS function as the test function class and we find the “best" function to distinguish densities and on by taking supremum over , . Standard reproducing property gives the quadratic form , where .

Let with we show that under mild regularity conditions, bd-KSD is a property discrepancy measure on .

###### Theorem 7 (Characterisation of bd-KSD).

Let , be smooth densities defined on . Assume: 1) kernel is compact universal (carmeli2010vector, Definition 2(ii)); 2) ; 3) ; 4) whenever . Then, and if and only if .

##### Goodness-of-fit test with bd-KSD

Similar procedure as introduced in Section 2.2 applies to test the null hypothesis against the alternative . Observed samples on , the empirical U-statistic (lee90) can be computed, . The asymptotic distribution is obtained via U-statistics theory (lee90; van2000asymptotic) as follows. We denote the convergence in distribution by .

###### Theorem 8.

Assume the conditions in Theorem 7 holds. 1) Under ,

 n⋅bd-KSDg(~q∥~p)2d→∞∑j=1wj(Z2j−1), (15)

where

are i.i.d. standard Gaussian random variables and

are the eigenvalues of the Stein kernel

under : , where is the non-trivial eigen-function for Stein kernel operator . 2) Under ,

 √n⋅(bd-KSDg(~q∥~p)2u−bd-KSDg(~q∥~p)2)d→N(0,σ2),

where produces the non-degenerate U-statistics.

The goodness-of-fit testing then follows the standard procedures in Section 2.3 by applying bd-KSD.

### 4.3 Case studies: truncated distributions and compositional data

We first consider the distributions with truncated boundaries. In Fig. 1 (left), an example of two-components Gaussian mixture truncated in a unit ball is plotted in . It is obvious that the density does not necessarily vanish at the truncation boundary. Truncated distributions, including truncated Gaussian distributions (horrace2005some; horrace2015moments), truncated Pareto distributions (aban2006parameter), or truncated power-law distributions (deluca2013fitting) have been studied. In particular, left-truncated distributions are of special interest in survival analysis (klein2006survival). To the best of our knowledge, goodness-of-fit testing procedures for general truncated distributions has not yet been established.

We also consider the compositional data where the distribution is defined on a simplex, , which is a compact domain. A common example for compositional distribution is the Dirichlet distribution, with unnormalised density of the form , , where are the concentration parameters. An example Dirichlet distribution on is illustrated in Fig. 1 (right). It is also obvious that does not necessarily vanish at the boundary666Specifically, at boundary for ; while on for .. Recently, score matching procedures have been proposed to estimate unnormalised models for compositional data (scealy2020score). To the best of our knowledge, goodness-of-fit testing procedures for general unnormalised compositional distributions has not been well studied.

##### Goodness-of-fit testing approaches and simulations

An important aspect of applying bd-KSD is to choose the appropriate auxiliary functions in each data scenario, which we now specify. Simulation results are shown in Table. 1. Due to the lack of general testing procedures in these two scenarios, we compare the KSD-based approach to the maximum mean discrepancy (MMD) (gretton2012kernel) based approach where samples are generated from the null model and a two-sample test is then performed. Such strategy to test goodness-of-fit has been considered previously (jitkrittum2017linear; xu2020stein).

Truncated Distribution in Unit Ball : Requiring to vanish at the boundary and take into account the rotational invariance of the unit ball, the auxiliary functions can be chosen as , which relates to the Euclidean distance from the boundary raising to chosen power . Similar form of auxiliary function, with , was discussed for density estimation on truncated domains(liu2019estimating) . For larger

, more weights are concentrated to the center of the ball. We present the case where the null is 2-component mixture of Gaussian with identity variance and the alternative with correlation coefficient perturbed by

. In this case, the difference between the null and alternative distributions are concentrated towards the center, making based bd-KSD a better test as shown in Table. 1.

Compositional Distributions: With boundary definition , a natural choice of Since is more sensitive on the boundary where distributions are more different, bd-KSD based on produces higher power as shown in Table. 1.

Moreover, results in Table. 1 also show that bd-KSD based tests outperforms the MMD based tests. We include additional simulation results and insights on effect of different choice in Appendix C.

##### Real data experiments

We illustrate real-data testing scenarios for the presented case studies.

1) Chicago Crime Dataset777Data can be found at https://data.cityofchicago.org. We consider the relevant score-matching based objective, TruncSM (liu2019estimating)

, to fit the Gaussian mixture model, using half of the data and test on the other half. We set auxiliary function as the Euclidean distance to the nearest boundary point (analogous to

). For a 2-component Gaussian mixture, bd-KSD gives p-values 0.002 which is clearly an inadequate fit; for a 20-component Gaussian mixture, p-value is 0.162 which indicates a good fit of the TruncSM estimated model, at significance level 0.05.

2) Three-composition AFM of 23 aphyric Skye lavas data (aitchison1985kernel) The variables A, F and M represent the relative proportions of , and

, respectively. We fit the Gaussian kernel density estimation

(chacon2011asymptotics), using half of the data and test on the other half. We choose the auxiliary function : the min distance to the closest boundary. The bd-KSD gives p-value 0.004 which rejects the null hypothesis, indicating the fit is not good enough.

##### Conclusion and discussions

The present work proposes a general framework to unify and compare the existing KSD-based tests; as well as to design new KSD-based tests. Our empirical results for testing data with domain constraint validate our claim that Stein operators can have considerable effect on the test performances of KSD-based tests. Beyond, more rigorous treatment on choice of can be an interesting future direction.

## Appendix A Known Identities

### Expectations in Survival Analysis

We know the following identities in survival analysis, which will be useful for discussions in the main text: for any measurable function ,

 E0[Δϕ(T)] =∫∞0ϕ(s)μ0(s)SC(s)ds, (16) E0[(1−Δ)ϕ(T)] =∫∞0ϕ(s)μC(s)S0(s)ds. (17)

where here denotes the p.d.f. of the censoring distribution and denotes the survival function w.r.t. .

### Martingales in Survial Analysis

The following identity is useful to understand the martingale Stein operator in [fernandez2020kernelized]

 E0[Δϕ(T)−∫T0ϕ(t)λ0(t)dt]=0, (18)

which holds under the null hypothesis, where is the hazard function under the null . Let and be the individual counting and risk processes, defined by by and , respectively. Then, the individual zero-mean martingale for the i-th individual corresponds to , where for all .

Additionally, let such that for all , then is a zero-mean -martingale (see Chapter 2 of [aalen2008survival]). Then, taking expectation, we have

 E0[∫∞0ϕ(x)dMi(x)] =E0[∫∞0ϕ(x)(dNi(x)−Yi(x)λ0(x)dx)] =E0[Δϕ(T)−∫T0ϕ(x)λ0(x)dx]=0,

as stated above. The martingale property is useful to derive the martingale Stein operator in Eq. (4). For more details, see [fernandez2020kernelized].

## Appendix B Proofs and Derivations

Proof of Proposition 1

###### Proof.

Standard reproducing properties and taking the supremum over unit ball RKHS apply,

 GKSDg(p∥q;H)=sup∥f∥H≤1Ep[⟨Tq,gK(x,⋅),f⟩H]=∥Ep[Tq,gK(x,⋅)]∥H.

Specifically, assume , the setting in [chwialkowski2016kernel, liu2016kernelized],

 Tq,gK(x,⋅)=d∑i=1gi(x)(∂logq(x)∂xik(x,⋅)+∂k(x,⋅)∂xi)+∂gi(x)∂xik(x,⋅).

We can write explicitly as

 hq,g (x,~x)= d∑i=1 (∂2k(x,~x)∂xi∂~xi+∂logq(x)∂xi∂k(x,~x)∂~xi+∂logq(~x)∂~xi∂k(x,~x)∂xi+∂logq(x)∂xi∂logq(~x)∂~xik(x,~x))× gi(x)gi(~x)+∂gi(x)∂xi∂gi(~x)∂~xik(x,~x).

which recovers the quadratic form, which only depends on density but not . ∎

Proof of Theorem 1

###### Proof.

Note that the expectation on l.h.s. of Eq. (13) is integrating over the density of survival time where the expectation on the r.h.s., having the multiplication of in in Eq. (2), is taken over the paired observation incorporating censoring information. Using the identity in Eq. (17), we have

 Eμ0[Tμ0,gω] =∫R+Tμ0,gω(s)μ0(s)ds =∫R+(ω′(s)+ω(s)(g′(s)g(s)+(logμ0(s))′))g(s)μ0(s)ds =∫R+(ω′(s)+ω(s)S′C(s)SC(s)+ω(s)μ0(x)′μ0(s))μ0(s)SC(s)ds =∫R+ω′(s)SC(s)μ0(s)+ω(s)S′C(s)μ0(s)+ω(s)SC(s)μ0(x)′SC(s)μ0(s)μ0(s)SC(s)ds =∫R+(T0ω)(x,δ)μ0(x)SC(x)ds=E0[(T0ω)(x,δ)]=0.

We also note that by definition of survival functions. As such, is bounded almost everywhere in which satisfy the conditions for testing. ∎

Proof of Theorem 2

###### Proof.

To show the equivalence relation in the sense of Eq. (14), we need to consider the presence of indicator variable . This is essentially different from the proof of Theorem 1. Recall the following identity between hazard function and density: since

 λ′0(x)λ0(x)=μ′0(x)S0(x)λ0(x)+μ0(x)2S0(x)2λ0(x)=μ′0(x)μ0(x)+λ0(x). (19)

Denote , such that we can write . Decompose the Stein operator w.r.t. , we have

 Tμ0,gω=δTμ0,λ−10ω+(1−δ)Tμ0,ζω (20)

as is also linear operator w.r.t . We now decompose the above two components and using the form of Eq. (10),

 Tμ0,λ−10ω =λ−10(ω′+ωlogμ′0)+λ−10′ω =λ−10(ω′+ω(λ′0(x)λ0(x)−λ0(x))+λ0λ−10′ω) =λ−10(ω′−ωλ0) =λ−10ω′−ω

the second line equality follows from Eq. (19) while the third line follows from . The derivation is interesting that it reveals that the uncensored data in the martingale Stein operator is connected to the Langevin-diffusion via the inverse of hazard function, i.e. when (or absence of censoring), .

On the other hand, we rewrite the martingale Stein operator in Eq. (4) as . For GKSD to match this operator, we need to find such that .

 Tμ0