Detection of Sparse Positive Dependence

11/17/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

In a bivariate setting, we consider the problem of detecting a sparse contamination or mixture component, where the effect manifests itself as a positive dependence between the variables, which are otherwise independent in the main component. We first look at this problem in the context of a normal mixture model. In essence, the situation reduces to a univariate setting where the effect is a decrease in variance. In particular, a higher criticism test based on the pairwise differences is shown to achieve the detection boundary defined by the (oracle) likelihood ratio test. We then turn to a Gaussian copula model where the marginal distributions are unknown. Standard invariance considerations lead us to consider rank tests. In fact, a higher criticism test based on the pairwise rank differences achieves the detection boundary in the normal mixture model, although not in the very sparse regime. We do not know of any rank test that has any power in that regime.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The detection of rare effects has been an important problem for years in settings, and may be particularly relevant today, for example, with the search for personalized care in the health industry, where a small fraction of a population may respond particularly well, or particularly poorly, to some given treatment [18].

Following a theoretical investigation initiated in large part by Ingster [14] and broadened by Donoho and Jin [8], we are interested in studying two-component mixture models, also known as contamination models, in various asymptotic regimes defined by how the small mixture weight converges to zero. Most of the existing work in the setting of univariate data has focused on models where the contamination manifests itself as a shift in mean [10, 9, 12, 6, 17] with a few exceptions where the effect is a change in variance [2], or a change in both mean and variance [7].

In the present paper, we are interested in bivariate data, instead, and more specifically in a situation where the effect felt in the dependence between the two variables being measured. This setting has been recently considered in the literature in the context of assessing the reproducibility of studies. For example, Li et al. [16]

aimed to identify significant features from separate studies using an expectation-maximization (EM) algorithm. They applied a copula mixture model and assumed that changes in the mean and covariance matrix differentiate the contaminated component from the null component.

Zhao et al. [21] studied another model where variables from the contamination are stochastically larger marginally. In both models, the marginal distributions have some non-null effects. Similar settings have been considered within a multiple testing framework [5, 20].

While existing work has focused on models motivated by questions of reproducibility, in the present work we come back to basics and directly address the problem of detecting a bivariate mixture with a component where the variables are independent and a component where the variables are positively dependent.

1.1 Gaussian Mixture Model

Ingster [14] and Donoho and Jin [8] started with a mixture of Gaussians, and we do the same, and in our setting, this means we consider the following mixture model

(1)

where is the contamination proportion and is the correlation between the two variables under contamination. We consider the following hypothesis testing problem: based on drawn iid from (1), decide

(2)

Note that under the null hypothesis,

is from the bivariate standard normal. Under the alternative, and remain standard normal marginally. Following the literature on the detection of sparse mixtures [14, 8], we are most interested in a situation, asymptotic as , where , and the central question is how large needs to be in order to reliability distinguish these hypotheses.

The formulation (1) suggests that the alternative hypothesis is composite, but if we assume that are known under the alternative, then the likelihood ratio test (LRT) is optimal by Neyman-Pearson lemma. We start with characterizing the behavior of the LRT, which provides a benchmark. We then study some other testing procedures that do not require knowledge of the model parameters:111 Such procedures are said to be adaptive.

  • The covariance test rejects for large values of , and coincides with Rao’s score test in the present context. This is the classical test for independence, specifically designed for the case where and under the alternative. We shall see that it is suboptimal in some regimes.

  • The extremes test rejects for small values of . This test exploits the fact that, because is assumed positive, the variables in the contaminated component are closer to each other than in the null component.

  • The higher criticism test was suggested by John Tukey and deployed by Donoho and Jin [8] for the testing of sparse mixtures. We propose a version of that test based on the pairwise differences, . In detail, the test rejects for large values of

    (3)

    where , with

    denotes the standard normal distribution function, and

    , the empirical distribution function of .

As is common practice in this line of work [14, 8], under we set

(4)

The setting where is often called the dense regime and the setting where is often called the sparse regime. Our analysis reveals the following:

  1. Dense regime. The dense regime is most interesting when . In that case, we find that the covariance test and the higher criticism test match the asymptotic performance of the likelihood ratio test to first-order, while the extremes test has no power.

  2. Sparse regime. The sparse regime is most interesting when . In that case, we find that the higher criticism test still performs as well as the likelihood ratio test to first order, while the covariance test is powerless, and the extremes test is suboptimal.

1.2 Gaussian Mixture Copula Model

From a practical point of view, the assumption that both and are normally distributed is quite stringent. Hence, we would like to know if there are nonparametric procedures that do not require such a condition but can still achieve the same performance as the likelihood ratio test. In the univariate setting where the effect arises as a shift in mean, this was investigated in [3]. In the bivariate setting, in a model for reproducibility, Zhao et al. [21] proposed a nonparametric test based on a weighted version of Hoeffding’s test for independence.

Here, instead of model (1), we suppose follows a Gaussian mixture copula model (GMCM) [4]

, meaning that there is a latent random vector

such that

(5)

where and are unknown distribution functions on the real line, and is the standard normal distribution function, while is the contamination proportion and is the correlation between and in the contaminated component, as before in model (1). Li et al. [16] also used a copula mixture model, but they placed emphasis on the mean while we focus on the dependence.

We still consider the testing problem (2), but now in the context of Model (5). The setting is nonparametric in that both and are unknown. Model (5) is crafted in such a way that the marginal distributions of and contain absolutely no information that is pertinent to the testing problem under consideration. Figure 1 provides an illustration.

Figure 1: A scatterplot of data generated from a Gaussian copula mixture model of the form (5). Specifically, , , , and .

The model is also attractive because of an invariance under all increasing marginal transformations of the variables. This is the same invariance that leads to considering rank based methods such as the Spearman correlation test [15, Chp 6]. In fact, we analyze the Spearman correlation test, which is the nonparametric analog to the covariance test, showing that it is first-order asymptotically optimal in the dense regime. We also propose and analyze a nonparametric version of the higher criticism based on ranks which we show is first-order asymptotically optimal in the moderately sparse regime where . In the very sparse regime, where , we do not know of any rank-based test that has any power.

2 Gaussian Mixture Model

In this section, we focus on the Gaussian mixture model (

1). We start by deriving a lower bound on the performance of the likelihood ratio test, which provides a benchmark for the other (adaptive) tests, which we subsequently analyze.

We distinguish between the dense and sparse regimes:

dense regime (6)
sparse regime (7)

We say that a testing procedure is asymptotically powerful (resp. powerless) if the sum of its probabilities of Type I and Type II errors (its risk) has limit 0 (resp. limit inferior at least 1) in the large sample asymptote.

2.1 The likelihood ratio test

Theorem 1.

Consider the testing problem (2) with parameterized as in (4). In the dense regime, with parameterized as in (6), the likelihood ratio test is asymptotically powerless when . In the sparse regime, with parameterized as in (7), the likelihood ratio test is asymptotically powerless when .

This only provides a lower bound on what can be achieved, but it will turn out that to be sharp once we establish the performance of the higher criticism test in Proposition 2 below.

Proof.

The proof techniques are standard and already present in [10, 14], and many of the subsequent works.

Defining and , the model (1) is equivalently expressed in terms of , which has distribution

(8)

Note that and are independent only conditional on knowing what distribution they were sampled from. In terms of the ’s, the likelihood ratio is

(9)

where is the likelihood ratio for observation , which in the present case takes the following expression

(10)
(11)

The risk of the likelihood ratio test is equal to

(12)

We show that under each of the stated conditions. We consider each regime in turn.

Dense regime.

It turns out that it suffices to bound the second moment. Indeed, using the Cauchy-Schwarz inequality, we have

(13)

reducing the task to showing that . We have

(14)

where

(15)
(16)
(17)

For the third term, we have

(18)

and

(19)

Hence, we have

(20)

and, therefore,

(21)

so that when

(22)

since is assumed to be bounded away from 1. Under the specified parameterization, this happens exactly when .

Sparse regime. It turns out that simply bounding the second moment, as we did above, does not suffice. Instead, we truncate the likelihood and study the behavior of its first two moments. Define the indicator variable and the corresponding truncated likelihood ratio

(23)

Using the triangle inequality, the fact that , and the Cauchy-Schwarz inequality, we have the following upper bound:

(24)
(25)

so that when and .

For the first moment, we have

(26)

where, using the independence of and , and taking the expectation with respect to first,

(27)
(28)
(29)
(30)

where, for ,

(31)

and we used the fact that when . Since with in the sparse regime, for sufficiently close to 1, , in which case . This yields

(32)

For the second moment, we have

(33)

where

(34)
(35)
(36)

The sum of first two terms is bounded from above by . For the third term, we have

(37)

and

(38)

using the fact that . Hence,

(39)
(40)

when is sufficiently close to 1. This in turn yields the following bound

(41)

so that when

(42)

Under the specified parameterization, this happens exactly when . ∎

In the dense regime, with parameterized as in (6), we say that a test achieves the detection boundary if it is asymptotically powerful when , and in the sparse regime, with parameterized as in (7), we say that a test achieves the detection boundary if it is asymptotically powerful when .

2.2 The covariance test

Recall that the covariance test rejects for large values of , calibrated under the null where are iid standard normal.

Proposition 1.

For the testing problem (2), the covariance test achieves the detection boundary in the dense regime, while it is asymptotically powerless in the sparse regime.

Proof.

We divide the proof into the two regimes.

Dense regime. Under , we have

(43)
(44)

so that, by Chebyshev’s inequality,

(45)

for any sequence diverging to infinity.

Under , we have

(46)
(47)

so that, by Chebyshev’s inequality,

(48)

Thus the test with rejection region is asymptotically powerful when

(49)

If we choose , for example, and is parameterized as in (6), this happens for large enough when .

Sparse regime. To prove that the covariance test is asymptotically powerless when , we show that, under , converges to the same limiting distribution as under .

Under

, by the central limit theorem,

(50)

Under the distribution of the ’s (which remain iid) depends on , but the condition for applying Lyapunov’s central limit theorem are satisfied since

(51)

with and

(52)

where and the inequality is Cauchy-Schwarz’s, while

(53)

so that the test statistic still converges weakly to a normal distribution,

(54)

In the present regime, we have

(55)

so that and , and thus we conclude by Slutsky’s theorem that . ∎

Remark 1.

There are good reasons to consider the covariance test in this specific form since the means and variances are known. It is worth pointing out that the Pearson correlation test, which is more standard in practice since it does not require knowledge of the means or variances, has the same asymptotic power properties.

2.3 The higher criticism test and the extremes test

Define , and note that

(56)

Seen through the ’s, the problem becomes that of detecting a sparse contamination where the effect is in the variance. We recently studied this problem in detail [2], extending previous work by Cai et al. [7], who considered a setting where the effect is both in the mean and variance. Borrowing from our prior work, we consider a higher criticism test, already defined in (3), and an extremes test, which rejects for small values of .

Proposition 2.

For the testing problem (2), the higher criticism test achieves the detection boundary in the dense and sparse regimes.

Proof.

Set , which is the variance of the contaminated component. In our prior work [2, Prop 3], we showed that the higher criticism test as defined in (3) is asymptotically powerful when

  1. with fixed such that ;

  2. with fixed such that .

This can be directly translated into the present setting, yielding the stated result. ∎

Proposition 3.

For the testing problem (2), the extremes test is asymptotically powerless when is bounded away from 1, while when parameterized as in (4) and parameterized as in (7), it is asymptotically powerful when , and asymptotically powerless when .

Proof.

This is also a direct corollary from our prior work our prior work [2, Prop 2]. ∎

Thus the extremes test is grossly suboptimal in the dense regime, while it is suboptimal in the the sparse regime due to the fact that .

Remark 2.

The higher criticism and extremes tests are both based on the ’s. This was convenient as it reduced the problem of testing for independence to the problem of testing for a change in variance (both in a contamination model). However, reducing the original data, meaning the ’s, to the ’s implies a loss of information. Indeed, a lossless reduction would be from the ’s to the ’s, where

, with joint distribution given in (

8). It just turns out that ignoring the ’s does not lead to any loss in first-order asymptotic power.

2.4 Numerical experiments

We performed some numerical experiments to investigate the finite sample performance of the tests considered here: the likelihood ratio test, the Pearson correlation test (instead of the covariance test from a practical point of view), the extremes test, the higher criticism test, and also a plug-in version of the higher criticism test where the parameters of the bivariate normal distribution (the two means and two variances) are estimated under the null. The sample size

is set large to in order to capture the large-sample behavior of these tests. We tried four sparsity levels, setting . The p-values for each test are computed as follows:

  1. For the likelihood ratio test, the p-values are estimated based on permutations.

  2. For the higher criticism test and the plug-in higher criticism test, the p-values are estimated based on 200 permutations.

  3. For the extremes test, we used the exact null distribution, which is available in a closed form.

  4. For the Pearson correlation test, the p-values are from the limiting distribution.

For each scenario, we repeated the process 200 times and calculated the fraction of p-values smaller than 0.05, representing the empirical power at the 0.05 level.

The results of this experiment are reported in Figure 2 and are broadly consistent with the theory developed earlier in this section. Though we show that the higher criticism test is first-order comparable to the likelihood ratio test in the dense regime, even with a large sample, its power is much lower. The Pearson correlation test does better in that regime. The plug-in higher criticism test has a similar performance as the higher criticism test in the dense regime, while it loses some power in the moderately sparse regime, and is powerless in the very sparse regime.

(a)
(b)
(c)
(d)
Figure 2: Empirical power comparison with 95% error bars for the likelihood ratio test (black), the Pearson correlation test (green), the extremes test (blue), the higher criticism test (red, solid) and the plug-in higher criticism test (red, dashed). LABEL: Dense regime where . LABEL: Dense regime where . LABEL: Sparse regime where and . LABEL: Sparse regime where and . The horizontal line marks the level (set at 0.05) and the vertical line marks the asymptotic detection boundary derived earlier. The sample size is and the power curves and error bars are based on 200 replications.

3 Gaussian Mixture Copula Model

In this section we turn to the Gaussian mixture copula model introduced in (5). The setting is thus nonparametric, since the marginal distributions are completely unknown, and standard invariance considerations [15, Ch 6] lead us to consider test procedures that are based on the ranks. For this, we let denote the rank of among , and similarly, we let denote the rank of among . (The ranks are in increasing order, say.)

Although not strictly necessary, we will assume that and in (5) are strictly increasing and continuous. In that case, the ranks are invariant with respect to transformations of the form with and strictly increasing on the real line. In particular, for the rank tests that follow, this allows us to reduce their analysis under (5) to their analysis under (1).

3.1 The covariance rank test

The covariance rank test is the analog of the covariance test of Section 2.2. It rejects for large values of (redefined). As is well-known, this is equivalent to rejecting for large values of the Spearman rank correlation.

Proposition 4.

For the testing problem (2) under the model (5), the covariance rank test achieves the detection boundary in the dense regime.

We anticipate that the covariance rank test is asymptotically powerless in the sparse regime, although we do not formally prove that.

Proof.

We start by considering the null hypothesis . From [11, Eq 3.11-3.12, Ch 11], we have

(57)
(58)

so that, using Chebyshev’s inequality,

(59)

for any sequence diverging to infinity.

We now turn to the alternative hypothesis . For convenience, we assume that the ranks run from to . This does not change the test procedure since , but makes the derivations somewhat less cumbersome. In particular, we have

(60)
(61)

so that

(62)

For the expectation, we have

(63)
(64)

The expectation is with respect to independent, with drawn from the mixture (1), and and standard normal. Let and , so that . We note that is bivariate normal with standard marginals. Moreover, when comes from the main component, and are uncorrelated, and therefore independent; while when comes from the contaminated component, and have correlation . Therefore,

(65)

where under . We immediately have , and in general,222 This identity is well-known, and not hard to prove (https://math.stackexchange.com/questions/255368/getting-px0-y0-for-a-bivariate-distribution). It also appears, for example, in [19, Lem 1].

(66)

We conclude that, as and ,

(67)
(68)

For the variance, we start with the second moment

(69)
(70)

which then implies that

(71)
(72)

the same bound we had for . Thus, by Chebyshev’s inequality, we have

(73)

for any sequence diverging to infinity.

We consider the test with rejection region . Our analysis implies that this test is asymptotically powerful when

(74)

If we choose , for example, and is parameterized as in (6), this happens for large enough when . ∎

3.2 The higher criticism rank test

The analog of the higher criticism test of (3) is a higher criticism based on the the pairwise differences in ranks, . To be specific, we define

(75)

where is the probability , which can be expressed in closed form as

(76)

Note that in this definition the denominator is only an approximation to the standard deviation of the numerator. The standard deviation has a closed-form expression, known since the work of

Hoeffding [13, Th 2], but it is cumbersome and relatively costly to compute (although its computation is only done once for each ). Also, there is a fair amount of flexibility in the choice of range of thresholds considered. This particular choice seems to work well enough. As any other rank test, it is calibrated by permutation (or Monte Carlo if there are no ties in the data).

Theorem 2.

For the testing problem (2) under the model (5), the higher criticism rank test achieves the detection boundary in the moderately sparse regime.

Proof.

We start with the situation under the null hypothesis , where we show that is of order at most based on the concentration inequality for randomly permuted sums. Fixing critical value , define

(77)

Since is independent of , as we are under the null, we have that has the same distribution as when

is a uniformly distributed random permutation of

. Note that

(78)

By [1, Cor 2.1], there is a universal constant such that, for any ,

(79)
(80)

using the fact that for all . This implies that, for ,

(81)
(82)

for some other constant , using the fact that when , which is the range of ’s we are considering. Hence, choosing and using the union bound, we have

(83)
(84)

We now consider the alternative , and show that in probability under the stated condition. Let

(85)

Since , it suffices to find some in that range such that in probability. For that, define the empirical distributions and . Note that, by definition, and