 # Kernel based method for the k-sample problem

In this paper we deal with the problem of testing for the equality of k probability distributions defined on (X,B), where X is a metric space and B is the corresponding Borel σ-field. We introduce a test statistic based on reproducing kernel Hilbert space embeddings and derive its asymptotic distribution under the null hypothesis. Simulations show that the introduced procedure outperforms known methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Testing for homogeneity, that is testing for the equality of several probability distributions is an old and important problem in statistics. When the number of these distributions is greater than two, it is named the -sample problem and has been tackled in the literature under different approaches. For instance, the traditional Kolomogorov-Smirnov, Cramér-von Mises and Anderson-Darling tests (,), initially introduced to treat the case of two distributions only, have been extended for dealing with the aforementioned -sample problem (,,). Also, procedures based the likelihood ratio and which led to more powerful tests than the previous ones were introduced in . Nevertheless, all these methods just permit to test the equality of distributions defined on , where is the Borel -field associated to , and cannot be used for distributions defined on more complex spaces. The interest of kernel-based methods, that is methods based on the use of reproducing kernel Hilbert spaces embeddings, relies on the fact that they permit to deal with high-dimensional and structured data (), which the aforementioned traditional methods do not do. In this vein, Harchaoui et al.  and, more recently, Gretton et al.  proposed kernel-based methods for the two sample problem. The former introduced a method based on the maximum Fisher discriminant ratio while the latter used the maximum mean discrepancy. The extension of their procedures to the case of more than two distributions is of a great interest since, to the best of our knowledge, it it has never been done.

In this paper, we deal with the -sample problem by extending the kernel-based approach of Harchaoui et al. . The rest of the paper is organized as follows. In Section 2, we recall some basic facts about the reproducing kernel Hilbert spaces embeddings. In Section 3, after specifying the testing problem that we deal with, we introduce a test statistic and derive its asymptotic distribution under the null hypothesis. We also tackle computational aspects that show how to compute this test statistic in practice. Section 4 is devoted to the presentation of simulations made in order to evaluate performance of our proposal and to compare it with known methods. All the proofs are postponed in Section 5.

## 2 Preliminary notions

In this section, we recall the notion of reproducing kernel hilbert space (RKHS) and we just define some elements related to it that are useful in this paper. For more details on RKHS and its use in probability and statistics, one may refer to .

Letting be a measurable space, where is a metric space and is the corresponding Borel -field, we consider a Hilbert space of functions from to , endowed with an inner product . This space is said to be a RKHS if there exists a kernel, that is a symmetric positive semi-definite function , such that for any and any , one has and . When is a RKHS with kernel , the map characterizes since one has

 K(x,y)=<Φ(x),Φ(y)>H

for any . It is called the feature map and it is an important tool when dealing with kernel methods for statistical problems. Throughout this paper, we consider a RKHS with kernel satisfying the following assumptions:

;

the RKHS associated to the kernel is dense in where is a probability measure on .

Let

be a random variable taking values in

and with probability distribution . If , the mean element associated with is defined for all functions as the unique element in satisfying,

 H=E(f(X))=∫Xf(x)dP(x).

Furhermore, if , we can define the covariance operator associated to as the unique operator from to itself such that, for any pair , one has

 H=Cov(f(X),g(X))=E(f(X)g(X))−E(f(X))E(g(X)).

It is very important to note that if is satisfied, then the mean element and the covariance operator are well-defined. They can also be expressed as

 m=E(K(X,⋅))

and

 V = E((K(X,⋅)−m)⊗(K(X,⋅)−m)) = E(K(X,⋅)⊗K(X,⋅))−m⊗m

where

is the tensor product such that, for any pair

, is the linear map from to itself satisfying for all . The empirical counterparts of and , obtained from a i.i.d. sample of , are then given by:

 ˆm=1nn∑i=1K(Xi,⋅)

and

 ˆV=1nn∑i=1(K(Xi,⋅)−ˆm)⊗(K(Xi,⋅)−ˆm)=1nn∑i=1K(Xi,⋅)⊗K(Xi,⋅)−ˆm⊗ˆm.

## 3 The k-sample problem

In this section, we specify the -sample problem that we deal with, as a test for hypotheses that are given. Then, a test statistic is proposed and its asymptotic distribution under the null hypothesis is derived. Finally, we deal with computational aspects and show how the introduced test statistic can be computed in practice.

For such that , we consider probability distibutions on . For , we denote by and by the mean element and the covariance operator, respectively, associated to . The -sample problem that we deal with is the test for the hypothesis : against the alternative given by : , .

### 3.1 Test statistic

For , let be an i.i.d. sample in with commmon distribution . We consider the statistics

 ˆmj=1njnj∑i=1K(X(j)i,⋅),
 ˆVj = 1njnj∑i=1(K(X(j)i,⋅)−ˆmj)⊗(K(X(j)i,⋅)−ˆmj) = 1njnj∑i=1K(X(j)i,⋅)⊗K(X(j)i,⋅)−ˆmj⊗ˆmj

from which we define

 ˆWn=k∑j=1njnˆVj,ˆm=k∑j=1njnˆmj.

where . Let be a sequence of strictly positive numbers such that . Then, we consider

 ˆVn=ˆWn+γnI,

where denotes the identity operator of , and we take as test statistic for the -sample problem the statistic:

 ˆTn=∑kj=1Pj∥ˆV−1/2nˆδj∥2H√2ℓ(ˆWn,γn),

where

 Pj=njn,ˆδj=ˆmj−ˆm

and

 ℓ(ˆWn,γn)={+∞∑p=1(λp(ˆWn)+γn)−2λ2p(ˆWn)}1/2.
###### Remark 3.1

The quantity is a normalization factor that permits to rescale the statistic in order to get a well-grounded test statistic. Note that in  a factor for recentering is also introduced, but we do not need it in this paper. It is know from  that

 ℓ(ˆWn,γn)=tr(ˆV−2nˆW2n). (1)

### 3.2 Asymptotic distribution under H0

We consider the following assumptions:

For , one has , where is a real belonging to .

the eigenvalues

satisfy for ;

there are infinitely many strictly positive eigenvalues of for .

Then, we have:

###### Theorem 3.1

Assume () to () and that , then under , converges in distribution, as , to .

### 3.3 Computation of the test statistic

For computing this test statistic in practice, the kernel trick () can be used as it was already done in  for twe two-groups case. For , we consider the operator from to represented in matrix form as

 G(j)n=[K(X(j)1,.),⋯,K(X(j)nj,.)].

Then put , and consider

 Λ(j,l)n=(G(j)n)TG(l)n=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝K(X(j)1,X(l)1)K(X(j)1,X(l)2)⋯K(X(j)1,X(l)nl)K(X(j)2,X(l)1)K(X(j)2,X(l)2)⋯K(X(j)2,X(l)nl)⋮⋮⋱⋮K(X(j)nj,X(l)1)K(X(j)nj,X(l)2)⋯K(X(j)nj,X(l)nl)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

and the Gram matrix

 Λn=GTnGn=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Λ(1,1)nΛ(1,2)n⋯Λ(1,k)nΛ(2,1)nΛ(2,2)n⋯Λ(2,k)n⋮⋮⋱⋮Λ(k,1)nΛ(k,2)n⋯Λ(k,k)n⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Further, denoting by (resp. ) the identity matrix (resp. the vector whose components are all equal to ), we consider the matrices ,

 Nn=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝{Q}n10⋯00{Q}n2⋯0⋮⋮⋱⋮00⋯{Q}nk⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

and the vector

 m(j)n=⎛⎜ ⎜ ⎜⎝m(j)n,1⋮m(j)n,n⎞⎟ ⎟ ⎟⎠

such that

 m(j)n,i=⎧⎪ ⎪⎨⎪ ⎪⎩−n−1 if 1≤i≤νj−1n−1j−n−1 if νj−1+1≤i≤νj−n−1 if νj+1≤i≤n

where . Clearly,

 ˆδj = ˆmj−ˆm=ˆmj−k∑l=1Plˆml = −P1ˆm1−P2ˆm2−⋯−Pj−1ˆmj−1+(n−1j−n−1)ˆmj−Pj+1ˆmj+1−⋯−Pkˆmk = −n−1n1∑i=1K(X(1)i,.)−n−1n2∑i=1K(X(2)i,.)−⋯−n−1nj−1∑i=1K(X(j−1)i,.) + ((n−1j−n−1)nj∑i=1K(X(j)i,.)−n−1nj+1∑i=1K(X(j+1)i,.)−⋯−n−1nk∑i=1K(X(k)i,.) = Gnm(j)n

and, as in , . Therefore

 ˆWn = n−1k∑j=1G(j)n{Q}nj{Q}Tnj(G(j)n)T=n−1GnNnNTnGTn (2)

and . Using the matrix inversion lemma, as in , we obtain

 Pj∥ˆV−1/2nˆδj∥2H = Pj(m(j)n)TGTnγ−1n{{I}n−GnNn(γn{I}n+n−1NTnGTnGnNn)−1NTnGTn}Gnm(1)n = njnγn{(m(j)n)TΛnm(j)n−n−1(m(j)n)TΛnNn(γn{I}n+n−1NTnΛnNn)−1NTnΛnm(j)n}.

Finally,

 k∑j=1Pj∥ˆV−1/2nˆδj∥2H = k∑j=1njnγn{(m(j)n)TΛnm(j)n

For computing we use (1) and (2); putting we have . Clearly, and using the matrix inversion lemma, we obtain

 (Hn+γnI)−1=γ−1nI−n−1γ−1nGnNnMnNTnGTn

where . Hence

 (Hn+γnI)−2H2n = γ−2nH2n−2n−1γ−3nGnNnMnNTnGTnH2n +n−2γ−4nGnNnMnNTnΛnNnMnNTnGTnH2n = γ−2nH2n−2n−3γ−3nGnNnMn(NTnΛnNn)2NTnGTn +n−4γ−4nGnNn(MnNTnΛnNn)2NTnΛnNnNTnGTn

and using the property , we finally obtain

 ℓ(ˆWn,γn) = tr(n−2γ−2n(NTnΛnNn)2−2n−3γ−3nMn(NTnΛnNn)3 +n−4γ−4n(MnNTnΛnNn)2(NTnΛnNn)2).

## 4 Power comparison by Monte Carlo simulation

In this section, the empirical power of the proposed test is computed through Monte Carlo simulations and compared to that of tests introduced by Zhang and Wu  which are based on statistics denoted by , and obtained from the likelihood-ratio test statistic and shown to be more powerful than the classical Kolmogorov-Smirnov, Cramér-von Mises and Anderson-Darling

-sample tests. We estimate the powers of our test and the three aforementioned tests in the following cases (

):

 {Case 3:} P1=Uniform(0,1),P2=Beta(1,1.5) and P3=Beta(1.5,1);

For all tests we take the significance level and the empirical power is computed over independent replications. For our test, we used the gaussian kernel , and computed the test statistic as indicated in Section 3.3 by taking

 γn=0.2×{1}[1,100[(n)+0.01×{1}[100,300](n)+n−0.25×{1}]300,+∞](n).

The results are given in Figures 1 to 4 that plot the empirical power versus the total sample size . They show that our test outperforms the three tests of Zhang and Wu  in all cases. Figure 1: Empirical power versus n for Case 1 with significance level α=0.05 Figure 3: Empirical power versus n for Case 3 with significance level α=0.05

## 5 Proofs

### 5.1 Preliminary results

In this section, we give some results that are necessary for proving Theorem 3.1.

###### Lemma 5.1

Assume (), () and () . Then, putting , we have .

Proof. Let be an orthonormal basis of

consisting of eigenvectors of

such that is associated to the -th eigenvalue . Using Lemma 21 in  and the equality , we obtain

 +∞∑p=1∣λp(ˆWn−W)∣≤+∞∑p=1∥(ˆWn−W)ep∥H≤k∑j=1njn{+∞∑p=1∥(ˆVj−Vj)ep∥H}.

From Proposition 12 of , we have for any , and since it follows from (5.1) that we have the equality . Furthermore,

 ∥ˆWn−W∥HS = [+∞∑p=1∥(ˆWn−W)ep∥2H]1/2 ≤ +∞∑p=1∥(ˆWn−W)ep∥H ≤ k∑j=1njn{+∞∑p=1∥(ˆVj−Vj)ep∥H}

which proves that .

The following lemma gives an asymptotic approximation of the test statistic.

###### Lemma 5.2

Assume (), () and (). If , then

 nˆTn = ˜Sn+oP(1), (3)

where

 ˜Sn=∑kj=1nj∥V−1/2nˆδj∥2H√2ℓ(W,γn).

Proof. Using Lemma 23 in , we have

 ∣ℓ(ˆWn,γn)−ℓ(W,γn)∣≤γ−1n∥ˆWn−W∥HS1−γ−1n∥ˆWn−W∥HS.

Then, from Lemma 5.1 it follows

 ∣ℓ(ˆW,γn)−ℓ(W,γn)∣=OP(γ−1nn−1/2)=oP(1). (4)

Furthermore,

 ∣∣∣∥ˆV−1/2nˆδj∥2H−∥V−1/2nˆδj∥2H∣∣∣ = ∣<ˆδj,ˆV−1nˆδj>H−<ˆδj,V−1nˆδj>H∣ (5) = ∣<ˆV−1nˆδj,(Vn−ˆVn)V−1nˆδj>H∣ ≤ ∥ˆV−1nˆδj∥H∥Vn−ˆVn∥HS∥V−1nˆδj∥H,
 ∥V−1nˆδj∥2H = <ˆδj,V−2nˆδj>H=tr(ˆδj⊗(V−2nˆδj))=tr(V−2n(ˆδj⊗ˆδj)) (6) = +∞∑p=1H = +∞∑p=1(λp+γn)−22H ≤ +∞∑p=1(λp+γn)−2∥ep∥2H∥ˆδj∥2H ≤ +∞∑p=1(λp+γn)−2∥ˆδj∥2H.

and

 ∥ˆδj∥H = ∥k∑l=1l≠jPl(ˆmj−ˆml)∥H≤k∑l=1l≠jPl∥ˆmj−ˆml∥H (7) ≤ k∑l=1l≠jPl(∥ˆmj−mj∥H+∥ˆml−ml∥H).

Using the central limit theorem, we have

for all , and since it follows from (7) that . The fact that permits to deduce from (6) that Moreover,

 ∥ˆV−1nˆδj∥H ≤ ∥V−1nˆδj∥H+∥(ˆV−1n−V−1n)ˆδj∥H (8) = ∥V−1nˆδj∥H+∥ˆV−1n(Vn−ˆVn)V−1nˆδj∥H ≤ ∥V−1nˆδj∥H+∥ˆV−1n∥HS∥Vn−ˆVn∥HS∥V−1nˆδj∥H

Next, using the upper-bound and Lemma 5.1,

 ∥ˆVn−V∥HS=∥ˆWn−W∥HS=OP(n−1/2) (9)

we get