We address the problem of comparing two probability distributionsand from finite samples in , where both distributions are assumed to be continuous (with respect to Lebesgue measure) and compactly supported. We consider the case where each distribution is observed from i.i.d. samples, called () and () respectively, and the two datasets and are independent. The methodology has applications in a variety of fields, particularly in bio-informatics. It can be used, for example, to test genetic similarities between subtypes of cancers, to compare patient groups to determine potential treatment propensity, and to detect small anomalies in medical images that are symptomatic of a certain disease. We will cover applications to flow cytometry and diffusion MRI data sets in this paper.
Due to the complicated nature of the datasets we would like to study, we are interested in the general alternative hypothesis test against
. This goes beyond tests of possible shifts of finite moments, for example, that of mean-shift alternatives namely. We also focus on the medium dimensional setting, in which , where () is the number of samples in (). As , the dimension is assumed to be fixed. (For the scenario where , the convergence of kernel matrices to the limiting integral operator needs to be treated quite differently, and a new analysis is needed.) We are particularly interested in the situation where data is sampled from distributions which are locally low-rank, which means that the local covariance matrices are generally of rank much smaller than . As will be clear in the analysis, the approach of constructing anisotropic kernels is most useful when data exhibits such characteristics in a high dimensional ambient space.
We are similarly concerned with a -sample problem, in which the question is to determine the global relationships between distributions, each of which has a finite set of i.i.d. samples. This can be done by measuring a pairwise distance between any two distributions and combining these pairwise measurements to build a weighted graph between the distributions. Thus we focus on the two-sample test as the primary problem.
In the two-sample problem, the two data sets do not have any point-to-point correspondence, which means that they need to be compared in terms of their probability densities. One way to do this is to construct “bins” at many places in the domain, compute the histograms of the two datasets at these bins, and then compare the histograms. This turns out to be a good description of our basic strategy, and the techniques beyond this include using “anisotropic gaussian” bins at every local point and “smoothing” the histograms when needed. Notice that there is a trade-off in constructing these bins: if a bin is too small, which may leave no points in it most of the time, the histogram will have a large variance compared to its mean. When a bin is too large, it will lose the power to distinguishand when they only differ slightly within the bin. In more than one dimension, one may hope to construct anisotropic bins which are large in some directions and small in others, so as maximize the power to differentiate and while maintaining small variance. It turns out to be possible when the deviation has certain preferred (local) directions. We illustrate this idea in a toy example below, and the analysis of testing consistency, including the comparison of different “binning” strategies, is carried out by analyzing the spectrum of the associated kernels in Section 3.
At the same time, we are not just interested in whether and deviate, but how and where they deviate. The difference of the histograms of and measured at multiple bins, introduced as above, can surely be used as an indication of where and differ. The formal concept is known as the witness function in literature, which we introduce in Section 2.3 and use throughout the applications.
The idea of using bins and comparing histograms dates back to measuring distribution distances based on kernel density estimation, and is also closely related to two-sample statistics by Reproducing Kernel Hilbert Space (RKHS) Maximum-mean discrepancy (MMD). We discuss these connections in more detail in Section1.2. While anisotropic kernels have been previously studied in manifold learning and image processing, kernel-based two-sample statistics with anisotropic kernels have not been examined in the situation where the data is locally low-rank, which is the theme of the current paper.
This paper yields several contributions to the two sample problem. Methodologically, we introduce a kernel-based MMD statistic that increases the testing power against certain deviated distributions by adopting an anisotropic kernel, and reduces both the computation and memory requirements. The proposed method can be combined with spectral smoothing of the histograms in order to reduce variability and possibly optimize the importance of certain regions of data, so that the power of the test maybe furtherly improved. Theoretically, asymptotic consistency is proved for any fixed deviation beyond the critical regime under generic assumptions. Experimentally, we provide two novel applications of two-sample and -sample problems for biological datasets.
The rest of the paper is organized as follows: we begin with a sketch of the main idea and motivating example in the remainder of this section, together with a review of previous studies. Section 2
formalizes the definition of the MMD statistics being proposed. Asymptotic analysis is given in Section3. The algorithm and other implementation matters, including computation complexity, are discussed in Section 4. Section 5 covers numerical experiments on synthetic and real-world datasets.
1.1 Main Idea
Let and be two distributions supported on a compact set . Suppose that a reference set is given, and for each point there is a (non-degenerate) covariance matrix (e.g. computed by local PCA). We define the asymmetric affinity kernel to be
Consider the two independent datasets and , where has i.i.d. samples and has i.i.d. samples. The empirical histograms of and at the reference point are defined as
for which the population quantities are
The empirical histograms are nothing else but the Gaussian binning of and at point with the anisotropic bins corresponding to the covariance matrix . We then compute the quantity
as a measurement of the (squared) distance between the two datasets.
We use the following example to illustrate the difference between (1) using anisotropic kernel where is aligned with the tangent space of the manifold data, and (2) using the isotropic ones where
is a multiple of the identity matrix.
The data is like in Figure 1, where and are supported on two arcs in separated by a gap of size at various regions. We begin by specifying a reference set and the covariance field . For simplicity, we do this by uniformly sampling reference points from (see Figure 1). At each reference point , we take neighbors to estimate the local covariance matrix by . The empirical histograms and are computed as in Eqn. (2) at every point (see Figure 1), as well as the quantity as in Eqn. (4). We also compute under a permutation of the data points in and
so as to mimic the null hypothesis, and we call the two values and respectively. The experiment is repeated 100 times, where , and the distribution of and are shown as red and blue bars in Figure 1. The simulation is done across three datasets where takes value as , and we compare isotropic and anisotropic kernels. When , the distributions of and overlay each other, as expected. When , greater separation between distributions of and implies greater power for the test. The advantage of the anisotropic kernel is clearly demonstrated, particularly when (the middle row).
The analysis of the testing power of
hinges on the singular value decomposition of, formally written as and will be defined in Section 3. The histogram thus becomes
and similarly for , which means that the ability of to distinguish and is determined by the amount that and differ when projected onto the singular functions . For the example above, the first few singular functions are visualized in Figure 1, where the ’s of the anisotropic kernel project along the directions where and deviate at a lower index of , thus contributing more significantly to the quantity . Figure 1 also shows the “witness function” (, c.f. Section 2.3) of kernels, which indicates the regions of deviation between and .
Throughout the paper, we refer to the use of a local Mahalanobis distance with as an anisotropic kernel and as an isotropic kernel. Similarly, we refer to a kernel measuring affinity from all points to a reference set (i.e. ) as an asymmetric kernel. The analysis in Section 3 is for the symmetrized version of the kernel , while in practice one never computes the -by- kernel but only the -by- asymmetric kernel which is equivalent and way more efficient. We discuss more about computation in Section 4.4.
1.2 Related Work
The question of two sample testing is a central problem in statistics. In one dimension, one classical approach to two sample testing is the Kolmogorov-Smirnov distance, which compares the
distance between the two empirical cumulative distribution functions[17, 25]. While there exist generalizations of these infinitely supported bins in high dimensions [4, 11, 22, 14], these require a large computational cost for either computing a minimum spanning tree or running a large number of randomized tests. This warranted binning functions that are defined more locally and in a data-adaptive fashion. Another high-dimensional extension of Kolmogorov-Smirnov is to randomly project the data into a low-dimensional space and compute the test in each dimension.
The 1D Kolmogorov-Smirnov statistic can be seen as a special case of the MMD discrepancy, which is generally defined as
where is certain family of integrable functions. When equals the set of all indicator functions of intervals in , the MMD discrepancy gives the Kolmogorov-Smirnov distance. Kernel-based MMD has been studied in , where the function class consists of all functions s.t. , where indicates the norm of the Hilbert space associated with the reproducing kernel. Specifically, suppose the PSD kernel is , the (squared) RKHS MMD can be written as
and can be estimated by (here we refer to the biased estimator in  which includes the diagonal terms)
The methodology and theory apply to any dimensional data as long as the kernel can be evaluated.
We consider a kernel of the form and its variants, where is certain measure of the reference points. This can be seen as a special case of RKHS MMD , which considers a general PSD kernel . When and is isotropic, is the Lebesgue measure, is reduced to gaussian kernel. However, returning to the asymmetric kernel as we do, allows us to easily redefine the local geometry around reference points and incorporate the local dimensionality reduction in (1). While the asymmetric kernel requires the additional technical assumption that the eigenmodes of the kernel do not vanish on the support of and , which is discussed in  in reference to isotropic Parzen windows, the construction yields a more powerful test for distributions that are concentrated near locally low-dimensional structures. Theoretically, our analysis gives a comparison of the testing power of different kernels, which turns out to be determined by the spectral decomposition of the kernels. This analysis augments that in : the asymptotic results in  does not imply testing power in high dimensions, as pointed out later by . Our analysis makes use of the spectral decomposition of the kernel with respect to the data distribution. While the empirical spectral expansion of translation-invariant kernels has been previously used to derive two-sample statistics, e.g. in  and more recently in [6, 27], and the idea dates back to earlier statistical works (see e.g. ), our setting is different due to the new construction of the kernel.
We generalize the construction by considering a family of kernels via a “spectral filtering”, which truncates the spectral decomposition of the anisotropic kernels and modifies the eigenvalues, c.f. Section2.2. The modified kernel may lead to improved testing power for certain departures. The problem of optimizing over kernels has been considered by , where one constructs a convex combination of a finite number of kernels drawn from a given family of kernels. The family of kernels considered in  are isotropic kernels, and possibly linear time computed kernels with high variance, which has the effect of choosing better spectral filters for separating two distributions. However, because the kernels we consider are anisotropic, they lie outside the family of kernels considered in 
and, in particular, have fundamentally different eigenfunctions over which we build linear combinations. Also, building spectral filters directly on the eigenfunctions yields a richer set of filters than those that can be constructed by finite convex combinations.
Our approach is also closely related to the previous study of the distribution distance based on kernel density estimation . We generalize the results in  by considering non-translation-invariant kernels, which greatly increases the separation between the expectation of under the null hypothesis and the expectation of under an alternative hypothesis. Moreover, it is well-known that kernel density estimation, which  is based on, converges poorly in high dimension. In the manifold setting, the problem was remedied by normalizing the (isometric) kernel in a modified way and the estimation accuracy was shown to only depend on the intrinsic dimension . Our proposed approach takes extra advantage of the locally low dimensional structure, and obtains improved distinguishing power compared to the one using isotropic kernels when possible.
At last, the proposed approach can be viewed as related to two sample testing via nearest neighbors . In , one computes the nearest neighbors of a reference point to the data and derives a statistical test based on the amount the empirical ratio , where is the number of neighbors from (similarly ), deviates from the expected ratio under the null hypothesis, namely . Because the nearest neighbor algorithm is based on Euclidean distance, it is equivalent to a kernel-based MMD with a hard-thresholded isotropic kernel. The approach can be similarly combined with a local Mahalanobis distance as we do, which has not been explored.
2 MMD Test Statistics and Witness Functions
Given two independent datasets and , where has i.i.d. samples drawn from distribution , and has i.i.d samples drawn from , we aim to (i) test the hypothesis against the alternative, and (ii) when , detect where the two distributions differ. We assume that and are supported on compact subset of , and both distributions have continuous probability densities, so we also use and to denote the densities and write integration w.r.t. as and similarly for .
As suggested in Section 1.1, the reference set and the covariance field are important for the construction of the anisotropic kernel. In this section and next, we assume that is given and is pre-defined. In practice, will be computed by a preprocessing procedure, and can be estimated by local PCA if not given a priori, c.f. Section 4.
2.1 Kernel MMD statistics
Using the kernel defined in (1), we consider the following empirical statistic:
where and are defined in (2). Note that (6) assumes the measure along with the covariance field needed in (1). can be any distribution in general, and in practice, it is an empirical distribution over the finite set , i.e. , where is the number of points in . For now we leave to be general. The population statistic corresponding to (6) is
The kernel (8) is clearly positive semi-definite (PSD), however, not necessarily “universal”, meaning that the population MMD as in (7) being zero does not guarantee that . The test is thus restricted to the departures within the Hilbert space (Assumption 2).
We introduce a spectral decomposition of the kernel based upon that of the asymmetric kernel , which sheds light on the analysis: Let be a distribution of data point (which is a mixture of and to be specified later). Since is bounded by 1, so is the integral , and thus the asymetric kernel is Hilbert-Schmidt and the integral operator is compact. The singular value decomposition of with respect to and can be written as
where , and are ortho-normal sets w.r.t and respectively. Then (8) can be written as
This formula suggests that the ability of the kernel MMD to distinguish and is determined by (i) how discriminative the eigenfunctions are (viewed as “feature extractors”), and (ii) how the spectrum decay (viewed as “weights” to combine the differences extracted per mode). It also naturally leads to generalizing the definition by modify the weights , which is next.
2.2 Generalized kernel and spectral filtering
We consider the situation where the distributions and lie around certain lower-dimensional manifolds in the ambient space, and both densities are smooth with respect to the manifold and decay off-manifold, which is typically encountered in the applications of interest (c.f. Section 5). Meanwhile, since the reference set is sampled near the data, is also centered around the manifold. Thus one would expect the population histograms and to be smooth on the manifold as well. This suggests building a “low-pass-filter” for the empirical histograms before computing the distance between them, namely the MMD statistic.
We thus introduce a general form of kernel as
where is a sequence of sufficiently decaying positive numbers, the requirement to be detailed in Section 3. Our analysis will be based on kernels in form of (11), which includes as a special case when . While the eigenfunctions are generally not analytically available, to compute the MMD statistics one only needs to evaluate ’s on the data points in
, which can be approximated by the empirical singular vectors of the-by- kernel matrix and computed efficiently for MMD tests (c.f. Section 4). Note that the approximation by empirical spectral decomposition may degrade as increases, however, for the purpose of smoothing histograms one typically assigns small values of for large so as to suppress the “high-frequency components”.
The construction (11) gives a large family of kernels and is versatile: First, setting for some positive integer is equivalent to using the kernel as where . This can be interpreted as redefining the affinity of points and by allowing -steps of “intermediate diffusion” on the reference set, and thus “on the data” . When is uniform over the whole ambient space, raising is equivalent to enlarging the bandwidth of the gaussian kernel. However, when is chosen to adapt to the densities and , then the kernel becomes a data-distribution-adapted object. As increases, decays rapidly, which “filters out” high-frequency components in the histograms when computing the MMD statistic, because . Generally, setting to be a decaying sequence has the effect of spectral filtering. Second, in the case that prior knowledge about the magnitude of the projection is available, one may also choose accordingly to select the “important modes”. Furthermore, one may view the kernel MMD with (11) as a weighted squared-distance statistics after projecting to the spectral coordinates by , where the coordinates are uncorrelated thanks to the orthogonality of . We further discuss the possible generalizations in the last section. The paper is mainly concerned with the kernel , including , with anisotropic , while the above extension of MMD may be of interest even when is isotropic.
2.3 Witness functions
Following the convection of RKHS MMD , the “witness function” is defined as
where and are the mean embedding of and in respectively. By Riez representation, equals multiplied by a constant. We will thus consider as the witness function. By definition, , and similarly for , thus . We then have that for the statistic ,
and generally for ,
The computation of empirical witness functions will be discussed in Section 4.2.
Although the witness function is not an estimator of the difference , it gives a measurement of how and deviate at local places. This can be useful for user interpretation of the two sample test, as shown in Section 5. The witness function augments the MMD statistic, which is a global quantity.
2.4 Kernel parameters
If the reference distribution and the covariance field are given, there is no tuning parameter to compute -MMD (c.f. Algorithm 2). To compute -MMD with general , which is truncated to be of finite rank , the number and the values are tunable parameters (c.f. Algorithm 3).
If the covariance field is provided up to a global scaling constant by , e.g., when estimated from local PCA, which means that one uses for some in (1), then this is a parameter which needs to be determined in practice.
Generally speaking, one may view the reference set distribution and the covariance field as “parameters” of the MMD kernel. The optimization of these parameters surely have an impact on the power of the MMD statistics, and a full analysis goes beyond the scope of the current paper. At the same time, there are important application scenarios where pre-defined and are available, e.g., the diffusion MRI imaging data (Section 5.3). We thus proceed with the simplified setting by by assuming pre-defined and , and focusing on the effects of anisotropic kernel and re-weighted spectrum.
3 Analysis of Testing Power
We consider the population MMD statistic of the following form
where and , and . We consider the limit where both and go to infinity and proprotional to each other, in other words, and , and .
We will show that, under generic assumptions on the kernel and the departure , the test based on is asymptotically consistent, which means that the test power as (with controlled false-positive rate). The asymptotic consistency holds when is allowed to depend on as long as decays to 0 slower than . We also provide a lower bound of the power based based upon Chebyshev which applies to the critical regime when is proportional to . The analysis also provides a quantitative comparison of the testing power of different kernels.
3.1 Assumptions on the kernel
Because is uniformly bounded and Hilbert-Schmidt, is positive semi-definite and compact, and thus can be expanded as in (10) where are a set of ortho-normal functions under . By that , we also have that satisfies that
and for any . Meanwhile, is continuous (by the continuity and uniformly boundedness of ), which means that the series in (10) converges uniformly and absolutely, and the eigenfunctions are continuous (Mercer’s Theorem). Finally, (16) implies that the operator is in the trace class, and specifically .
The in (11) satisfy that , , and that the kernel is PSD, continuous, and for all .
As a result, Mercer Theorem applies to guarantee that the spectral expansion (11) converges uniformly and absolutely, and the operator is in the trace class. Note that Assumption 1 holds for a large class of kernels, including all the important cases considered in this paper. The previous argument shows that is covered. Another important case is the finite-rank kernel: that is, when for some positive integer , and can be any positive numbers when such that and for all .
For the MMD test to distinguish a deviation from , it needs to have for such . We consider the family of alternatives which satisfies the following condition.
When , there exists s.t. and . In particular, if are strictly positive for all , then satisfies that does not vanish w.r.t , i.e. .
The following proposition, proved in the Appendix, shows that for such deviated .
Notations as above, for a fixed , the following are equivalent
(ii) For some , and .
If for all , then (i) is also equivalent to
Note that satisfies for all , thus (iii) applies. The proposition says that distinguishes an alternative when lies in the subspace spanned by , and for general , needs to lie in the subspace spanned by . These bases are usually not complete, e.g., when the reference set has points then is of rank at most . However, when the measure is continuous and smooth, the singular value decomposition (9) has a sufficiently decaying spectrum, and the low-frequency ’s can be efficiently approximated with a drastic down sampling of . This means that, under certain conditions of (sufficiently overlapping with and regularity), after replacing the continuous by a discrete sampling in constructing the kernel, an alternative violates Assumption 2 only when the departure lies entirely in the high-frequency modes w.r.t the original . In the applications considered in this paper, these very high-frequency departures are rarely of interest to detect, not to mention that estimating for large lacks robustness with finite samples. Thus Assumption 2 poses a mild constraint for all practical purposes considered in this paper, even with the spectral filtering kernel which utilizes a possibly truncated sequence of .
3.2 The centered kernel
We introduce the centered kernel under , defined as
where , and . The spectral decomposition of is the key quantity used in later analysis.
The following lemma shows that the MMD statistic, both the population and the empirical version, remains the same if is replaced by . The proof is by definition and details are omitted.
The kernel also inherits the following properties from , proved in Appendix A:
The kernel , as an integral operator on the space with the measure , is PSD. Under Assumption 1,
(1) for any , and for any ,
(2) is continuous,
(3) is Hilbert-Schmidt as an operator on and is in the trace class. It has the spectral expansion
where , . are continuous on , are both summable and square summable, and the series in (20) converges uniformly and absolutely.
(4) The eigenfunctions are square integrable, and thus integrable, w.r.t. . Furthermore, .
3.3 Limiting distribution of and asymptotic consistency
Consider the alternative for some fixed , . remains a probability density for any . We define the constants
where are finite by the integrability of under ((4) of Proposition 3.3), and .
The theorem below identifies the limiting distribution of under various order of , which may decrease to 0 as . The techniques very much follow Chapter 6 of  (see also Theorem 12, 13 in ), where the key step is to replace the two independent sets of summations in (23
) by normal random variables via multi-variate CLT (LemmaA.1), using a truncation argument based on the decaying of . The proof is left to Appendix A.
Let may depend on , , and notations , , and others like above. Under Assumption 1, as with , ,
(1) If , (including the case when ), then
Due to the summability of the random variable on the right rand side is well-defined.
(2) If , where , then
(3) If , then
where , with .
As a direct result of Theorem 3.4 (1), under ,
When , the limiting density is where are i.i.d. random variables. Theorem 3.4 (3) shows the asymptotic normality of under . One can verify that when , , where . These limiting densities under recover the classical result of V-statistics (Theorem 6.4.1.B and Theorem 6.4.3 ).
The numerical experiments in Section 3.5 show that the theoretical limits identified in Theorem 3.4 (1) approximate the empirical distributions of quite well when equals a few hundreds (Fig. 2). In below, we address the asymptotic consistency of the MMD test, and the finite-sample bound of the test power will be discussed in the next subsection.
A test based on the MMD statistic rejects whenever exceeds certain threshold . For a target “level” (typically ), a test (with samples) achieves level if , and the “power” against an alternative is defined as . We define the test power at level as
The threshold needs to be sufficiently large to guarantee that the false-positive rate is at most , and in practice it is a parameter to determine (c.f. Section 4.1). In below, we omit the dependence on in the notation, and write as . The test is called asymptotically consistent if as . For the one-parameter family of , the following theorem shows that, if satisfies Assumption 2, the MMD test has a nontrivial power when converges to a positive constant, and is asymptotically consistent if .
(1) If , , then , where is a monotonic function of .
(2) If , then .
3.4 Non-asymptotic bound of the testing power
In this section, we derive a non-asymptotic (lower) bound of the testing power for finite , which shows that the speed of the convergence is at least as fast as as increases whenever is greater than certain threshold value.
In particular, assuming that is uniformly bounded to be between for some , then the constants , , , are all uniformly bounded with respect to by constants which only depend on and .
The above bound applies when