Dependence measures based on reproducing kernel Hilbert spaces, also known as Hilbert-Schmidt Independence Criterion and denoted HSIC, are widely used to statistically decide whether or not two random vectors are dependent. Recently, non-parametric HSIC-based statistical tests of independence have been performed. However, these tests lead to the question of prior choice of the kernels associated to HSIC, there is as yet no method to objectively select specific kernels. In order to avoid a particular kernel choice, we propose a new HSIC-based aggregated procedure allowing to take into account several Gaussian kernels. To achieve this, we first propose non-asymptotic single tests of level and second type error controlled by . We also provide a sharp upper bound of the uniform seperation rate of the proposed tests. Thereafter, we introduce a multiple testing procedure in the case of Gaussian kernels, considering a set of various parameters. These agregated tests are shown to be of level and to overperform single tests in terms of uniform separation rates.
In this paper, we study the problem of testing the independence of two random vectors and . Let us first introduce some notations and assumptions. The couple is assumed to have a joint density w.r.t. Lebesgue measure on
. The probability measure associated to this density is denoted. The marginal densities of and are respectively denoted and . We also denote by , the product of the marginal densities and , defined as follows:
By analogy with the notation , the notation designates the probability measure associated to . The density is assumed to be unknown as well as the marginales and . We also assume that we have a -sample
of i.i.d random variables with common density
. We address here the question of testing the null hypothesis: “ and are independent” against the alternative : “ and are dependent”. That is equivalent to test
: “” against : “”.
Throughout this document, the densities , and are assumed to be bounded and denotes the maximum of their infinity norms: .
Non parametric tests of independence. To test independence between and , many approaches have been explored in the last few decades. Among them, [Hoeffding, 1948] proposes an independence test based on the difference between the distribution function of and the product of the marginal distribution functions. This test has good properties in the asymptotic framework: consistent and distribution-free under the null hypothesis. But, it is only designated to univariate random variables (
). Moreover, the statistic of this test is not practical to estimate. The authors of[Bergsma et al., 2014]
propose an improvement of Hoeffding’s test, which is also applicable to discrete random variables. Besides, the statistic associated to this test is easier to estimate than Hoeffding’s one. Lately,[Weihs et al., 2018] propose to extend Hoeffding’s test to the case of multivariate random variables. The estimation of the associated statistic requires a prior partition of the sample space. Still, the chosen partition highly impacts the quality of the test, and there is no theoretical method to objectively choose this partition. Another classical method for testing independence between and is based on comparing the join density and the product of the marginales [Ahmad and Li, 1997, Rosenblatt and Wahlen, 1992]. For this, an intermediate step is to estimate these densities using the kernel-based method of Parzen-Rosenblatt [Parzen, 1962]. The major drawback of this method is that the convergence is slow for high dimensions i.e. when is large (this fact is also called the curse of dimensionality, see e.g. [Scott, 2012]). This approach is therefore not feasible in the case of high dimensions with limited sample size. More recently, many approaches based on Reproducing Kernel Hilbert Spaces (RKHS, see [Aronszajn, 1950] for more details) have been developed. In particular, several RKHS-based dependence measures have been proposed. These measures have all the characteristic (under certain conditions on kernels) to be zero if and only if and are independent. We mention the Kernel Canonical Correlation (KCC), first introduced in [Bach and Jordan, 2002]. It has been shown that this measure characterizes independence in the case of Gaussian kernels (see [Bach and Jordan, 2002] for more details). Unfortunately, the estimation of KCC requires an extra regularisation which is not practical. Other dependence measures, easier to estimate and characterizing independence for a largest class of RKHS kernels: universal kernels [Micchelli et al., 2006] have been proposed later. For instance, the Kernel Mutual Information (KMI) [Gretton et al., 2003, Gretton et al., 2005b] and the Constrained covariance (COCO) [Gretton et al., 2005c, Gretton et al., 2005b], which are relatively easy to interpret and implement, have been widely used. Last but not least, one of the most interesting kernel dependence measure is the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005a]. The HSIC is very easy to compute and overperforms both analytically and numerically all previous kernel-based dependence measures [Gretton et al., 2005a]. Furthermore, beyond the good quality of a given dependence measure, a straightforward interpretation of its estimated value, may not be enough to discern the dependence from the independence. To further study the independence between and , independence tests based on these measures can be used. The first RKHS-based statistical test for independence is proposed by [Gretton et al., 2008]. This statistical test was proposed in an asymptotic framework for HSIC measures using the distributions of HSIC estimators under and under . These tests remain by far the most commonly used kernel-based tests for independence. A generalisation of this test for the joint and mutual independence of several random variables is presented in [Pfister et al., 2018]. We also mention the RKHS-based test [Póczos et al., 2012], inspired from [Gretton et al., 2008]
. This test is based on a new dependence measure called Copula-based kernel dependency measure. Yet, this measure seems more difficult to estimate than the HSIC. Lately, the distance covariance which is based on the difference between the characteristic function ofand the product of the marginal characteristic functions has been introduced in [Székely et al., 2007]. The distance covariance has good properties and has been used to study the independence between random variables of high dimensions [Székely and Rizzo, 2013, Yao et al., 2018]. Furthermore, it is has been shown that the distance covariance is not truly a new dependence measure. Indeed, this measure is none other than HSIC with specific choice of the kernels. We also mention the statistical test of independence based on the kernel mutual information recently proposed by [Berrett and Samworth, 2017]. This new statistical test seems to achieve comparable results with the classical tests based on HSIC. Still, the implementation of this test is more difficult and time-consuming. For all these reasons, we focus in this paper on HSIC measures to test independence between and .
Review on HSIC measures. The definition of the HSIC is derived from the notion of cross-covariance operator [Fukumizu et al., 2004], which can be seen as a generalisation of the classical covariance, measuring many forms of dependence between and (not only linear ones). For this, [Gretton et al., 2005a] associate to a RKHS composed of fonctions mapping from to ( is a set of transformations for ), and characterized a scalar product . The same operation is carried out for , considering a RKHS denoted and a scalar product . The cross-covariance operator associated to RKHS and is the operator mapping from to and verifying for all ,
Designating by and respectively orthonormal bases of and , the HSIC between and is the square of the operator’s Hilbert-Schmidt norm [Gretton et al., 2005a] defined as
The fundamental idea behind this definition is that is zero if and only if for all . Furthermore, we already know (see e.g. [Jacod and Protter, 2012]) that and are independent if and only if for all bounded and continuous functions and . It follows that, for well chosen RKHS, the nullity of the HSIC characterizes independence. Before giving such a condition, we recall that [Gretton et al., 2005a] expressed in a very convenient form, using kernels and respectively associated to and ,
where is an independent and identically distributed copy of . Note that only depends on the density of . We thus denote it in the following.
Authors of [Gretton et al., 2005a] showed that a sufficient condition so that the nullity of the associated HSIC is characteristic of independence is that the RKHS (resp. ) induced by and (resp. ) is dense in the space of bounded and continuous functions mapping from (resp. ) to . These kernels are called universal [Micchelli et al., 2006]. Among this class of kernels, the most commonly used are Gaussian kernels [Steinwart, 2001]. We consider in the rest of this paper Gaussian kernels. Let us introduce some notations. We denote by
the density of the standard Gaussian distribution ondefined for all by
For any bandwiths and , we define for any and ,
Finally, we define the Gaussian kernels, for and ,
We denote by the HSIC measure defined in (1), where the kernels and are respectively the Gaussian kernels and .
In practice, the computation of is not feasible, since it depends on the unknown density . Given an i.i.d -sample with common density , can be estimated by estimating each expectation of Equation (1). For this, we introduce the following -statistics, respectively with order 2, 3 and 4,
where is the set of all r-tuples drawn without replacement from the set . We estimate by the -statistic
Such estimators of the HSIC have been used to construct independence tests. A first asymptotic test of level has been proposed by [Gretton et al., 2008]. For this, the authors show that under
, the asymptotic distribution of the HSIC estimator can be approximated by a Gamma distribution with parameters which are easy to estimate. Furthermore,[Gretton and Györfi, 2010] also show the asymptotic consistency of the test (the convergence to one of the power under any reasonable alternative). However, there are two main disadvantages of this testing procedure. Firstly, it is purely asymptotic in the sense that the critical value of the test is obtained from an approximation of the asymptotic distribution under
. In particular, the first kind error is controlled only in the asymptotic framework. Secondly, only an heuristic choice of the bandwidthsand is proposed with no theoretical guarantees. In order to avoid such an arbitrary choice, we consider aggregated procedures which may lead to adaptive tests.
Towards adaptivity. To avoid the unjustified choice of the bandwidths and , a first step is to define a criterion allowing to compare the performances of the HSIC-tests associated to different bandwidths. For this, we consider the uniform separation rate as defined in [Baraud et al., 2003]. For any level- test with values in , rejecting independence when , the uniform separation rate of the test , over a class of alternatives such that satisfies smoothness assumptions, with respect to the -norm, is defined for all in by
The uniform separation rate is then the smallest value in the sense of the -norm of (the difference between the joint density and the product of marginales) allowing to control the 2-kind error of the test by . This definition is naturally the non-asymptotic version of the critical radius defined and studied for several examples in a serie of Ingster papers (see e.g. [Ingster, 1993a, Ingster, 1996]). A test of level having the optimal performances, should then have the smallest possible uniform separation rate (up to a multiplicative constant) over . These tests are generally called optimal in the minimax sense. The problem of non-asymptotic minimax rate of testing was raised in many papers over the past years. Among them, we mention for example [Ingster and Suslina, 1998, Laurent et al., 2012] for minimax detection of signals and [Donoho et al., 1996, Kerkyacharian and Picard, 1993] for minimax density estimation. However, only few works exist already for the problem of minimax independence testing. The notable works are those of Ingster [Ingster, 1989, Ingster, 1993b] and those of Yodé [Yodé, 2004, Yodé, 2011]. Still, these works are provided in the asymptotic framework. As far as we know, no minimax rate of testing independence was yet proved in the non-asymptotic framework. Furthermore, beyond the problem of minimax rate, the straightforward practical construction of a minimax test is impossible. Indeed, this construction depends on the unknown smoothness parameters defining the space . The objective is then to construct a minimax test which does not need any smoothness property to be implemented. These tests are called minimax adaptive (or assumption free). It has been shown that a standard logarithmic price is sometimes inevitable for adaptivity [Spokoiny et al., 1996]. The problem of adaptivity has received a good attention in the literature. We mention for instance [Baraud et al., 2003]
for testing a linear regression model with normal noise and[Butucea and Tribouley, 2006] for testing the equality of two samples densities. For the specific case of testing independence, the adaptive testing procedure proposed in [Yodé, 2011] seems to be the only currently existing. As mentioned above this test is purely asymptotic, but we are interested here in the non-asymptotic framework. Recently, an interesting approach of testing proposed in [Fromont et al., 2013], consists on testing the equality of intensities of two poisson processes by aggregating several kernels in a unique testing procedure. It has been shown in [Fromont et al., 2013] that this testing procedure is adaptive over several regularity spaces. Inspired by these works, and following the work of [Gretton et al., 2008, Gretton and Györfi, 2010], we consider in this paper a procedure of testing independence based on HSIC measures and aggregating a given set of Gaussian-kernel HSIC tests. Firstly, this procedure allows to avoid a particular kernel for HSIC-tests. Secondly, we show in this paper that the rate of this testing procedure over particular Sobolev and Nikol’skii-Besov balls can be upper bounded by a rate which seems optimal compared to "classical" rates of testing in other frameworks. This suggests that this test may be adaptive over these spaces of regularity.
In this paper, we first study a theoretical test (in the sense the critical value depends on the unknown marginal densities and ) based on such estimators of the HSIC, for which we provide non-asymptotic conditions to control the second kind error.
The study of this theoretical test allows us to introduce a new procedure based on the aggregation of these tests for various bandwidths avoiding the arbitrary choice of those parameters. We provide non-asymptotic theoretical guarantees for this aggregated procedure by proving that they satisfy a non-asymptotic oracle type condition for the uniform separation rate and outperform single tests.
Notice that in practice, we consider a permutation approach allowing to implement the aggregated testing procedure, leading to a test with non-asymptotic prescribed level . We complete this study by establishing non-asymptotic uniform separation rates over Sobolev balls and Nikol’skii-Besov balls. This document is organized as follows: in Section 2, we fist give in Section 2.1 a non-asymptotic condition on in terms of the theorical value so that the second error type of the single test associated to and is controlled. We then provide in Section 2.2 such condition w.r.t parameters , and the sample size . Finally, we give in Section 2.4 a sharp upper bound of the separation rate of single tests. In Section 3, we present in Section 3.1 the aggregated testing precedure. Thereafter, we give in Section 3.2 an oracle type inequality of the separation rate of the aggregated test. In Section 3.3, we consider two particular classes of functions: Sobolev balls and Nikol’skii-Besov balls, showing that the uniform separation rate of a well chosen aggregated test is as the same order as the optimal single one, up to a small factor of .
All along the paper, the generic notation denotes a positive constant depending only on its arguments and that may vary from line to line.
2 Single kernel-based tests
2.1 The testing procedures
A first theoretical test.
Since Gaussian kernels are characteristic, testing independence between and is equivalent to test
The statistic is then a naturel choice to test independence between and
, since it is an unbiased estimator of. The corresponding test rejects independence if is significantly large. Specifically, for , we consider the statistical test which rejects if where denotes the
-quantile ofunder . The associated test function is defined by
Then, the null hypothesis is rejected if and only if . By definition of the quantile, this theoretical test is of non-asymptotic level , that is if ,
Note that the non-asymptotic test is defined here using the quantiles as in [Albert et al., 2015] rather than the p-values
A permutation test of independence.
The analytical computation of the quantile is not possible since its value depends on the unknown marginals and of the couple . In practice, a permutation method with a Monte Carlo approximation is applied to approach as follows. Denote
the original sample and compute the test statisticdefined by Equation (5). Then, consider
independent and uniformly distributed random permutations of, denoted , independent of . Define for each permutation the corresponding permuted sample and compute the permuted test statistic
on this new sample.
Under , each permuted sample has the same distribution than the original sample . Hence, the random variables , have the same distribution as . We apply a trick, based on [Romano and Wolf, 2005, Lemma 1], which consists in adding the original sample to the Monte Carlo sample in order to obtain a test of non-asymptotic level . To do so, denote
the order statistic. Then, the permuted quantile with Monte Carlo approximation is thus defined as
The permuted test with Monte Carlo approximation performed in practice is then defined as
Let be in and the test defined by Equation (9). Then, under , that is if ,
that is, this permuted test with Monte Carlo approximation is of prescribed non-asymptotic level .
2.2 Control of the second kind error in terms of HSIC
For given , we propose in the following lemma a first non-asymptotic condition on the alternative ensuring that the probability of second kind error of the theoretical test under such is at most equal to . This condition is given for the value of
. It involves the variance of the estimatorwhich is finite since this estimator is a bounded random variable.
Lemma 1 gives a threshold for from which the dependence between and is detectable with probability at least using given Gaussian kernels and . Furthermore, it would be useful to give more explicit conditions w.r.t the bandwidths and and the sample size . The objective of this section is to provide a condition w.r.t , and on the theoretical value , so that the test has a second type error controlled by arbitrarily small . For this, we already give in Lemma 1 a condition w.r.t and . It is therefore necessary to provide sharp upper bounds for these two quantities w.r.t , and . Propositions 2 and 3 give these upper bounds.
Let be an i.i.d. sample with distribution and consider the test statistic defined by (5). Assume that the densities , and are bounded. Then,
Combining Lemma 1, Propositions 2 and 3, we can then give a sufficient condition on depending on the parameters , and the sample size in order to control the second type error by . This result is presented in the following corollary.
Note that the right hand term given in Corollary 1 is not computable in practice since it depends on the unknown density . However, this dependence is weak since it only depends on the infinite norm of and its marginals.
For given , Corollary 1 provides conditions on the value of ensuring that the probability of second kind error of the theoretical test under such is at most equal to . We now want to express such conditions in terms of the -norm of the function , for the sake of interpretation, and in order to be able to determine separation rates with respect to this -norm for our test.
2.3 Control of the second kind error in terms of -norm
In order to express a condition on the -norm of the function ensuring a probability of second kind error controlled by , we first give in Lemma 2 a link between and .
The following proposition gives a sufficient condition on , for the test to be -powerful.
Let be an i.i.d. sample with distribution and consider the test statistic defined by (5). Denote . Let , in , and be the -quantile of under as defined in Section 2.1. Assume that the densities , and are bounded, and that
One has as soon as
where , and denotes a positive constant depending only on its arguments.
In the condition given in Theorem 1, appears a compromise between a bias term and a term induced by the square-root of the variance of the estimator . Comparing the conditions on the HSIC given in Corollary 1 and on given in Theorem 1, the meticulous reader may notice that the term in has been removed. This suppression seems to be necessary to obtain optimal separation rates according to the literature in other testing frameworks. This derives from quite tricky computations that we point out here. By combining Lemmas 1 and 2, direct computations lead to the condition
If one directly considers the upper bound of the variance given in Proposition 2, one would get the unwanted term. The idea is to take advantage of the negative term to compensate such term. To do so, we need a more refined control of the variance given in the following technical proposition.
Let be an i.i.d. sample with distribution and consider the test statistic defined by (5). Assume that the densities , and are bounded. Then,
2.4 Uniform separation rate
The bias term in Theorem 1 comes from the fact that we do not estimate but . In order to have a control of the bias term w.r.t and , we assume that belongs some class of regular functions. We introduce the two following classes: Sobolev balls (isotropic case) and Nikol’skii-Besov balls (anisotropic case).
2.4.1 Case Sobolev balls
For , and , the Sobolev ball is the set defined by
denotes the Fourier transform ofdefined by , denotes the usual scalar product in and the Euclidean norm in .
The following proposition gives an upper bound for the bias term in the case when belongs to particular Sobolev balls.
One can now determine optimal bandwidths in order to minimize the right-hand side of Equation (13). To do so, the idea is to find for which both terms in the right hand side of (13) are of the same order w.r.t. . We also provide an upper bound for the uniform separation rate of the optimized test on Sobolev balls.
Consider the assumptions of Theorem 2, and define for all in and for all in ,
The uniform separation rate of the test over the Sobolev ball is controlled as follows
Note that, in the definition of the Sobolev ball , we have the same regularity parameter for all the directions in . This corresponds to isotropic regularity conditions. We now introduce other classes of functions allowing to take into account possible anisotropic regularity properties.
2.4.2 Case of Nikol’skii-Besov balls
For , and , we consider the anisotropic Nikol’skii-Besov ball defined by
where denotes the floor function of if is not integer and if is an integer. We give in the following proposition an upper bound of the bias term, similar to that of Lemma 3, in the case when belongs to particular Nikol’skii-Besov balls.
We assume that , where . Then, we have the following inequality,
Let and consider the same notation and assumptions as in Theorem 1. Let and