Detection of Sparse Mixtures: Higher Criticism and Scan Statistic

We consider the problem of detecting a sparse mixture as studied by Ingster (1997) and Donoho and Jin (2004). We consider a wide array of base distributions. In particular, we study the situation when the base distribution has polynomial tails, a situation that has not received much attention in the literature. Perhaps surprisingly, we find that in the context of such a power-law distribution, the higher criticism does not achieve the detection boundary. However, the scan statistic does.

Authors

• 23 publications
• 6 publications
• An expectation-based space-time scan statistic for ZIP-distributed data

An expectation-based scan statistic is proposed for the prospective moni...
12/26/2017 ∙ by Benjamin Allévius, et al. ∙ 0

• Calibrating the scan statistic: finite sample performance vs. asymptotics

We consider the problem of detecting an elevated mean on an interval wit...
08/13/2020 ∙ by Guenther Walther, et al. ∙ 0

• Detection of Sparse Positive Dependence

In a bivariate setting, we consider the problem of detecting a sparse co...
11/17/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

• On the Asymptotic Distribution of the Scan Statistic for Point Clouds

We derive the large-sample distribution of several variants of the scan ...
10/04/2019 ∙ by Andrew Ying, et al. ∙ 0

• A Multiscale Scan Statistic for Adaptive Submatrix Localization

We consider the problem of localizing a submatrix with larger-than-usual...
06/20/2019 ∙ by Yuchao Liu, et al. ∙ 0

• Learning Patterns for Detection with Multiscale Scan Statistics

This paper addresses detecting anomalous patterns in images, time-series...
02/16/2018 ∙ by James Sharpnack, et al. ∙ 0

• Spatial Autoregressive Models for Scan Statistic

Spatial scan statistics are well-known methods for cluster detection and...
11/22/2019 ∙ by Mohamed-Salem Ahmed, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of detecting a sparse mixture. A simple variant of the problem can be formulated as follows. Let be a continuous distribution function on the real line, and and . The problem is to test

 Hn0:X1,…,Xniid∼F, (1)

versus

 Hn1:X1,…,Xniid∼(1−ε)F(⋅)+εF(⋅−μ). (2)

Mixtures models such as in (2) have been considered for quite some time, particularly in the context of robust statistics, where they are known as contamination models [9, Eq 1.22].

Rather, our contribution is in line with the testing problems studied by Ingster [10] in the context of the normal sequence model, where

above corresponds to the standard normal distribution. In that setting, Ingster considered the following parameterization

 ε=εn=n−β,μ=μn=√2rlogn, (3)

for some and . The advantage of this parameterization is that, holding and fixed, the situation admits a relatively simple description. Indeed, since both the null and the alternative hypotheses are simple, by the Neyman-Pearson Lemma, the likelihood ratio test (set at level ) is most powerful. Ingster studied the large-sample behavior of this test procedure and discovered that, in the case where , when , the test is powerless in the sense of achieving power , while when , the test was fully powerful in the sense of achieving power 1, where the function is given by

 ρ(β):={β−1/2,1/2<β≤3/4,(1−√1−β)2,3/4<β<1. (4)

Thus the existence of a detection boundary in the plane given by . In such a situation, we will say that a test procedure ‘achieves the detection boundary’, or is ‘first-order optimal’ (or simply ‘optimal’), if it is fully powerful when .

Such detection boundaries where derived for other models, for example, in [5, 6, 7]. We also mention that the situation where is also well-understood, but quite different, and will not be considered here. Most of the literature has focused on the more interesting setting where and we do the same here.

1.1 Threshold tests

After determining what one can hope for, it becomes of interest to understand what one can achieve with less information. Indeed, the likelihood ratio test requires knowledge of all the quantities and objects defining the testing problem, in this case , and even in the present stylized setting we might want to know what can be done when some of this information is missing, in particular what defines the alternative, namely . (The case where is also unknown has attracted much less attention. We discuss it in Section 5.)

When is known, the problem is that of goodness-of-fit testing, albeit with alternatives of the form (2) in mind. Donoho and Jin [7] opened this investigation with the analysis of various tests, including the max test based on and a variant of the Anderson-Darling test [1]. Seeing as a problem of multiple testing based on p-values defined as , the max test coincides with the Tippett-Šidák test combination test, while the Anderson-Darling test coincides with a proposal by Tukey called the higher criticism (HC). More recently, Moscovich et al. [14] analyzed a goodness-of-fit (BJ) test proposed by Berk and Jones [4] in the same setting. For , define

 Nn(t)=#{i:Xi≥t}. (5)

We note that, under the null hypothesis,

is binomial with parameters , which motivates the test that rejects for large values of

 supt:F(t)≤1/2Nn(t)−n(1−F(t))√nF(t)(1−F(t))+1. (6)

This is one of many possible variants of HC.111 The constraint ‘ can be replaced by , where can be taken to be smaller, say . The ‘’ in the denominator is roughly equivalent to adding the constraint , which Donoho and Jin [7] recommend for reasons of stability. In any case, this variant performs as well (to first order) as any other variant of HC considered in the literature, at least in all the regimes commonly considered.

Let denote the ordered ’s. We note that, under the null hypothesis,

has the beta distribution with parameters

, which motivates the definition of BJ, rejecting for small values of

 mini∈[n]Pi, (7)

where and denotes the distribution function of the beta distribution with parameters .

The verdict is the following. In the normal setting, HC and BJ achieve the detection boundary in the full range , while the max test is only able to achieve the detection in the upper half of the range . The same extends to other models, in particular to generalized Gaussian models where has density proportional to for some . (The case is qualitatively different. HC and BJ are still first-order optimal while the max test is suboptimal everywhere.)

These tests are all threshold tests, where we define a threshold test as any test with a rejection region of the form for some subset and some critical values . More broadly, any combination test that we know of that is discussed in the multiple-testing literature is a threshold test. (This includes the tests proposed by Fisher, Lipták-Stouffer, Tippett-Šidák, Simes, and more.) Thus it might be of interest to understand what can be achieved with a threshold test. In this regard, it is useful to examine how one would optimize such an approach if one had perfect knowledge of the model. Let denote the test with rejection region , where

 ct:=min{c≥0:P0(Nn(t)≥c)≤α}. (8)

We define the oracle threshold test as the test , where

 t∗:=argmaxt∈RP1(Nn(t)≥ct), (9)

with denoting the distribution under the null (1) and that under the alternative (2). (Here and elsewhere, denotes the desired significance level.) Note that computing only requires knowledge of , while computing requires knowledge of the entire model, namely . Thus the construction of the test relies on the oracle knowledge of .

1.2 Scan tests

Detection problems arise in a variety of contexts and in very many applications. An important example is in spatial statistics (itself a rather wide area), where the detection of ‘hot spots’, meaning areas of unusually high concentration, has been considered for quite some time [13]. An early contribution to this literature is that of Naus [15], who considered the distribution of the maximum number of points in an interval of given length (say

) when the points are drawn iid from the uniform distribution on

. This would nowadays be referred to as the scan statistic and arises when testing the null that the points are uniformly distributed in against the (composite) alternative that there is an sub-interval of length with higher intensity. Settings where sub-interval length is unknown have been considered [3].

For , define and . We note that, under the null hypothesis, is binomial with parameters , which motivates the test that rejects for large values of

 sup(s,t):F[s,t]≤1/2Nn[s,t]−nF[s,t]√nF[s,t](1−F[s,t])+1. (10)

Although there are many possible variants, this is the one we will be working with.

We note that, under the null hypothesis, for any pair of indices , has the beta distribution with parameters — see [8, Th 11.1]. This motivates the definition of the scan test which rejects for small values of

 min1≤i

where .

In general, we define a scan test as any test with region rejection of a the form , where is a subset of and are critical values. Let denote the test with rejection region , where

 cs,t:=min{c≥0:P0(Nn[s,t]≥c)≤α}. (12)

We define the oracle scan test as the test , where

 (13)

Indeed, relies on oracle knowledge of .

To the best of our knowledge, this is the first time that such tests are considered in the line of work that concerns us here with roots in the work of Ingster [10] and Donoho and Jin [7]. The main reason for considering these tests in the present context is that they happen to be first-order optimal, not only in the models considered in the literature (such as generalized Gaussian), but also in power-law models where has fat tails (e.g., Cauchy or Pareto), whereas threshold tests fail are suboptimal for such models.

1.3 Content

For simplicity and the sake of clarity, we will focus on oracle-type, rather than likelihood ratio, performance bounds. The former are indeed more transparent and can be obtained under more generality and with simpler arguments. Also our main intention here is to compare what can be achieved with threshold tests compared to the more general scan tests, defined next, and comparing the corresponding oracle tests seems more appropriate.

In Section 2, we study the oracle threshold test and the oracle scan test. We then consider a number of models. In Section 3, we consider the two scan tests described above and compare them to the oracle scan test. In Section 4, we present the result of some numerical experiments that illustrate our theory. We briefly discuss the performance of the likelihood ratio test and that of nonparametric approaches in Section 5.

2 Oracle threshold test and oracle scan test

In this section we state and prove some basic results for the oracle threshold and oracle scan tests.

2.1 Power monotonicity

It is natural to guess that the testing (1) versus (2) becomes easier as the shift increases. This is indeed the case, at least from the point of view of both oracle tests.

Proposition 1.

The oracle threshold test has monotonic power in the shift.

Proof.

We assume that is fixed and let denote the data distribution under the alternative (2). Take and let denote the oracle threshold (9) for , so that the oracle test for , meaning , has rejection region and power . Thus we need to show that . This is so because of the fact that, for any , is stochastically non-decreasing in , leading to

 π1=Pμ1(Nn(t1)≥ct1)≤Pμ2(Nn(t1)≥ct1)≤Pμ2(Nn(t2)≥ct2)=π2, (14)

where the last inequality is by construction of and . ∎

Clearly, the oracle scan test has at least as much power as the oracle threshold test. Interestingly, it does not have monotonic power in general, although it does under some natural assumptions on the base distribution.

Proposition 2.

Assume that , as a distribution, is unimodal. Then the oracle scan test has monotonic power in the shift.

Proof.

We stay with the setting and notation introduced in the proof of Proposition 1. Let be smallest such that

 F[s1+d,t1+μ2−μ1]=F[s1,t1]. (15)

The fact that , as a distribution, is unimodal implies that . Now, under the null, by construction,

 P0(Nn[s1+d,t1+μ2−μ1]≥cs1,t1)=P0(Nn[s1,t1]≥cs1,t1)≤α. (16)

On the other hand, under , is binomial with parameters and , while under , is binomial with parameters and

 q2 :=(1−ε)F[s1+d,t1+μ2−μ1]+εF[s1+d−μ2,t1+μ2−μ1−μ2] =(1−ε)F[s1,t1]+εF[s1+d−μ2,t1−μ1] ≥q1,

using the fact that . This explains the first inequality in the following derivation

 π1 =Pμ1(Nn[s1,t1]≥cs1,t1) ≤Pμ2(Nn[s1+d,t1+μ2−μ1]≥cs1,t1) ≤Pμ2(Nn[s2,t2]≥cs2,t2)=π2,

and the second inequality is by definition of . ∎

2.2 Performance bounds

We now provide necessary and sufficient conditions for the the oracle threshold test and the oracle scan test to be fully powerful in the large-sample limit (). We focus on the case where

 nεn→∞,√nεn→0, (17)

where the first condition implies that, under the alternative, the sample is indeed contaminated with probability tending to 1, while the second condition puts us in the regime corresponding to

under Ingster’s parameterization (3).

Our analysis below is based on the following simple result, which is an immediate consequence of Chebyshev’s inequality and the central limit theorem.

Lemma 1.

Suppose that we are testing versus where and , and consider the test at level that rejects for large values of — which is the most powerful test. It is asymptotically powerful if , while it is asymptotically powerless if .

Using Lemma 1, we easily obtain performance guarantees for the oracle threshold test and the oracle scan test.

Proposition 3.

The oracle threshold test is powerful if there is a sequence of thresholds such that

 nεn¯F(tn−μn)→∞,andnε2n¯F(tn−μn)2/¯F(tn)→∞. (18)

It is powerless if for any sequence of thresholds

 nε2n¯F(tn−μn)2/¯F(tn)→0. (19)
Proof.

Let denote a sequence of thresholds satisfying (18), and define and . We know that under the null and under the alternative, with

 n(qn−pn)2/qn =nε2n(¯F(tn−μn)−¯F(tn))2(1−εn)¯F(tn)+εn¯F(tn−μn).

If the second part of (18) holds, then necessarily , since

 nε2n¯F(tn−μn)2/¯F(tn)=[nε2n¯F(tn)][¯F(tn−μn)/¯F(tn)]2≤(nε2n)[¯F(tn−μn)/¯F(tn)]2, (20)

with by assumption. Hence,

 n(qn−pn)2/qn ∼nε2n¯F(tn−μn)2(1−εn)¯F(tn)+εn¯F(tn−μn) ≍nεn¯F(tn−μn)⋀nε2n¯F(tn−μn)2/¯F(tn).

Therefore, by Lemma 1, the sequence of tests has full power in the limit when (18) holds.

Now let be any sequence of thresholds and consider the sequence of tests . By Lemma 1, it has power in the limit since

 n(qn−pn)2/pn≤nε2n¯F(tn−μn)2/(1−εn)¯F(tn)→0, (21)

where the convergence to comes from (19). ∎

Remark 1.

Note that the first part of (18) may be replaced by

 n¯F(tn)→∞. (22)

This is because this and implies .

Proposition 4.

The oracle scan test is powerful if there is a sequence of intervals such that

 nεnF[sn−μn,tn−μn]→∞,andnε2nF[sn−μn,tn−μn]2/F[sn,tn]→∞. (23)

It is powerless if for any sequence of intervals

 nε2nF[sn−μn,tn−μn]2/F[sn,tn]→0. (24)

The proof is completely parallel to that of Proposition 3 and is omitted.

2.3 Examples: generalized Gaussian models and more

We look at a number of models and in each case derive the performance of the oracle threshold and oracle scan tests, and compare that with the performance of the likelihood ratio test.

To place the results in line with the literature on the topic, we adopt Ingster’s parameterization (3) for , in fact a softer version of that

 ε=εn∼n−β, (25)

for some fixed . The parameterization of will depend on on the model.

To further simplify matters, we assume throughout that

 log¯F(x)∼−φ(x), (26)

where is continuous and strictly increasing for large enough. In that case, in view of Remark 1, we note that (18) is satisfied when

 logn−φ(tn)→∞,(1−2β)logn+φ(tn)−2φ(tn−μn)→∞. (27)

2.3.1 Extended generalized Gaussian

This class of models is defined by the property that satisfies222 It is tempting to consider a more general condition where there is a function on such that for all . However, as long as is not constant (equal to zero in that case), it can easily be shown that for some .

 φ(ut)/φ(t)→ua,t→∞,∀u≥0. (28)

Here parameterizes this class of models. This covers the generalized Gaussian models, which are often used as benchmarks in this line of work. It also covers the case where where is arbitrary.

For , define

 ρa(β)={(21/(a−1)−1)a−1(β−1/2),1/2<β<1−2−a/(a−1),(1−(1−β)1/a)a,1−2−a/(a−1)≤β<1. (29)

For , define

 ρa(β)=2(β−1/2). (30)

In addition to (25), assume that

 μ=μn satisfies φ(μn)∼rlogn, with r≥0 fixed. (31)
Proposition 5.

The curve in the plane is the detection boundary that the oracle threshold test achieves.

Proof.

We focus on proving that the oracle threshold test achieves that boundary. A simple inspection of the arguments reveal that they are tight, so that this is the precise detection boundary that the test achieves. (See the proof of Proposition 8 for an example.)

We divide the proof into several cases.

Case 1: . Define and note that .

Case 1.1: and . Under these conditions, , and in particular there is such that

 1−2β≥−2r(1/b−1)−(a−1)+η. (32)

Setting , by (28) and (31), we have the following

 φ(tn−μn) =(rba/(1−b)a+o(1))logn, (33) φ(tn) =(r/(1−b)a+o(1))logn. (34)

By Proposition 1 we may focus on small enough that . This is possible because when , which we assume here. (This can be easily verified using the definition of .) Assuming that is as such, the first part of (27) is satisfied. For the second part, with (32), we have

 (1−2β)logn−2φ(tn−μn)+φ(tn) ≥[−2r(1/b−1)−(a−1)+η−2rba/(1−b)a+r/(1−b)a+o(1)]logn =[η+o(1)]logn→∞,

using the definition of and simplifying. Thus the second part of (27) is also satisfied and the oracle threshold test is powerful.

Case 1.2: and . Under these conditions, we have , and in particular there is such that

 1−β−η≥(1−r1/a)a≥((1−η)1/a−r1/a)a. (35)

Set , we have the following

 φ(tn−μn) =((1−η)1/a−r1/a)a+o(1))logn, (36) φ(tn) =(1−η+o(1))logn. (37)

By looking at the speed of , the first part of (27) is satisfied immediately. For the second part, with (32), we have

 (1−2β)logn−2φ(tn−μn)+φ(tn) =(1−2β)logn−2((1−η)1/a−r1/a)a+o(1))logn+(1−η+o(1))logn =2[1−β−η/2−((1−η)1/a−r1/a)a+o(1)]logn→∞.

Thus the second part of (27) is also satisfied and the oracle threshold test is powerful.

Case 2: . By Proposition 1 we may restrict attention to the case where . Here we set . Then the first part in (27) is clearly satisfied. For the second part, notice that

 (1−2β)logn−2φ(tn−μn)+φ(tn) =(1−2β)logn+(r+o(1))logn =[1−2β+r+o(1)]logn→∞.\qed

Thus, although the conditions are much more general here, the detection boundary is the same as in the corresponding generalized Gaussian model and, moreover, the oracle threshold test achieves that boundary.

Remark 2 (max test).

In this class of models, it can be shown that the max test achieves the detection boundary over the upper range, meaning when . In fact, defines the detection boundary for the max test.

2.3.2 Other models

In the next few classes of models, satisfies

 φ−1(t)−φ−1(vt)λ(t)→ω(v),t→∞,∀v∈(0,1]. (38)

for some functions and , with the latter being non-increasing, continuous, and such that . This is actually also the case when with and , with and .

Define

 ρ(β)=inf0

In addition to (25), assume that

 μ=μn∼rλ(logn),r≥0 fixed. (40)
Proposition 6.

The curve in the plane is the detection boundary that the oracle threshold test achieves.

Proof.

We focus on proving that the oracle threshold test achieves that boundary.

Since is continuous, we may define

 h∗=argmin0≤h≤1−β[ω(h)−ω(2β−1+2h)]. (41)

We focus on the case where . In the case where , the max test is powerful (Remark 3), and therefore so is the oracle threshold test. By Proposition 1 we may focus on the case where . With these assumptions and the fact that , there is be such that

 2β−1+2h∗+2η<1, (42)

and

 ω(h∗)−ω(2β−1+2h∗+η)

Define . Using (38) multiple times, for sufficiently large, we have the following

 μn =(r+o(1))λ(logn) ≤[ω(h∗)−ω(2β−1+2h∗+2η)]λ(logn) =φ−1(logn)−φ−1(h∗logn)−φ−1(logn)+φ−1((2β−1+2h∗+2η)logn) =φ−1((2β−1+2h∗+2η)logn)−φ−1(h∗logn).

Hence, eventually, , implying that

 logn−φ(tn)=logn−(2β−1+2h∗+2η)logn=[1−(2β−2+2h∗+2η)]logn→∞,

using (42). Thus the first part of (27) is satisfied.

Similarly, for sufficiently large,

 μn =(r+o(1))λ(logn) ≥[ω(h∗)−ω(2β−1+2h∗+η)]λ(logn) =φ−1((2β−1+2h∗+η)logn)−φ−1(h∗logn),

so that, eventually, , implying that

 (1−2β)logn−2φ(tn−μn)+φ(tn) ≥(1−2β)logn−2h∗logn+(2β−1+2h∗+η)logn =ηlogn→∞.

Thus the second part of (27) is satisfied. ∎

Remark 3 (max test).

In the present situation, it can be shown that defines the detection boundary for the max test.

2.3.3 Extended generalized Gumbel

This class of models is defined by for some , which satisfies (38) with and . In this case,

 μ=μn∼ra(loglogn)1/a−1, (44)

and the detection boundary is given by . Note that, at the detection boundary, when ; that when ; and when .

2.3.4 Extended generalized Gumbel

This class of models is defined by for some , which satisfies (38) with and . In this case,

 μ=μn∼ra(loglogn)1/a−1exp((loglogn)1/a), (45)

and the detection boundary is given by as in the previous class of models (since is the same).

Remark 4 (max test).

Based on Remark 3, in the last two classes of models, the max test achieves the detection boundary over the whole range. The same is true, more generally, when the infimum in (39) is at .

2.4 Examples: power-law models and more

In the next few classes of models, satisfies

 log(F(t+v)−F(t))∼−λ(t),t→∞,∀v≥0, (46)

for some function which is increasing eventually and such that as . This includes models where

 ¯F(t)∝t−a(logt)b(1+o(1/t)),t→∞, (47)

with and , in which case (46) holds with . It also includes models where , with , in which case (46) holds with , as well as other distribution with even slower decay.

In addition to (25), assume that

 μ=μnsatisfiesλ(μn)∼rlogn,r≥0 fixed. (48)
Proposition 7.

The curve in the plane is the detection boundary that the oracle scan test achieves.

Proof.

We focus on proving that the oracle scan test achieves that boundary.

Fix such that . Consider the interval with and , where is such that . We need to verify that (23) holds. On the one hand, we have

 nεnF[sn−μn,tn−μn]=n1−βF[0,v]→∞, (49)

because by assumption. So the first part of (23) holds. On the other hand,

 nε2nF[sn−μn,tn−μn]2/F[sn,tn]=n1−2βF[0,v]2/nr+o(1)=nr+1−2β+o(1)→∞,

since . So the second part of (23) holds. ∎

We now show that threshold tests are suboptimal in the main class of models satisfying (46), namely (47). (The same happens to be true in other models with fat tails satisfying (46).) This is the main motivation for considering scan tests.

Proposition 8.

In a model satisfying (47), and with the same parameterization (48), the curve in the plane is the detection boundary that the oracle threshold test achieves.

Proof.

We first prove that the oracle threshold test achieves this detection boundary. By Proposition 1 we may assume that . Therefore, fix such that . Set the threshold , where is such that . We need to verify that (18) holds, and we do so via Remark 1. Note that . In particular,

 n¯F(tn)∼nμ−an(logμn)b=n1−ar/(a+1)+o(1)→∞, (50)

and, by the same token,

 nε2n¯F(tn−μn)2/¯F(tn)∼n1−2βn−ar/(a+1)+o(1)=n1−2β−ar/(a+1)+o(1)→∞. (51)

We now turn to proving that this is the statement boundary is the best that the oracle threshold test can hope for. For this, fix . We need to verify (19). Suppose for contradiction that there is a sequence of thresholds, , such that (19) does not hold. By extracting a subsequence if needed, we may assume that

 nε2n¯F(tn−μn)2/¯F(tn)→λ∈(0,∞]. (52)

First, suppose that . Extracting a subsequence if needed, we may assume that . In that case, we have

 nε2n¯F(tn−μn)2/¯F(tn)≤nε2n/¯F(tn)≤n1−2β+o(1)μ−a+o(1)n=n1−2β−ar/(a+1)+o(1)→0. (53)

Since this contradicts (52), we must have , meaning that . In that case, we have