Signal detection via Phi-divergences for general mixtures

03/17/2018 ∙ by Marc Ditzhaus, et al. ∙ Heinrich Heine Universität Düsseldorf 0

In this paper we are interested in testing whether there are any signals hidden in high dimensional noise data. Therefore we study the family of goodness-of-fit tests based on Φ-divergences including the test of Berk and Jones as well as Tukey's higher criticism test. The optimality of this family is already known for the heterogeneous normal mixture model. We now present a technique to transfer this optimality to more general models. For illustration we apply our results to dense signal and sparse signal models including the exponential-χ^2 mixture model and general exponential families as the normal, exponential and Gumbel distribution. Beside the optimality of the whole family we discuss the power behavior on the detection boundary and show that the whole family has no power there, whereas the likelihood ratio test does.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In several research areas it is of interest to detect rare and weak signals hidden in a huge noisy background. These areas are, among others, genomics [14, 22, 26], disease surveillance [34, 36]

, local anomaly detection

[37] as well as cosmology and astronomy [10, 31]. E.g., in genomics we want to determine as early as possible whether a patient is healthy or affected by a common disease like cancer or leukemia. Many researchers assume that the majority of an affected patients’ genes behaves like genes of a non-affected patient (noisy background) and only a small amount of genes displays a slightly different behavior (signals). In other words, if there are any signals at all then they are represented rarely and weakly. This combination makes it very difficult to detect the signals. In this paper we study tests for this detection problem. After introducing the mathematical model we give more details about tests which were already suggested in the literature and explain our new insights into these.
Let be a known continuous noise distribution and be an unknown signal distribution on . E.g., for , i.e. a signal leads to a shift by . Now, let be an i.i.d. sample with

The parameter

can be interpreted as the probability that

follows the signal distribution instead of the noise distribution . Hence, the number of signals is random and approximately the size of . In this paper we are interested to test whether there are any signals, i.e. to test

(1.1)

We focus on the challenging case of rare signals , where we differ between the sparse signal case () and the dense signal case (). Throughout this paper, if not stated otherwise, all limits are meant as . Clearly, the likelihood ratio test, also called Neyman-Pearson test, is the best test for (1.1). Its power behavior was studied by Ingster [24] for a normal location model, i.e. and . This model is also called the heterogeneous normal mixture model. Using the parametrization , , and , , he showed that there is a detection boundary , which splits the --parametrization plane into the completely detectable and the undetectable area, see Figure 1:

(1.2)
Figure 1: The detection boundary for the sparse heterogeneous normal mixture model , see (1.2), is plotted. It splits the --parametrization plane into the completely detectable and the undetectable area, where the null and the alternative can be completely separated and merge (asymptotically), respectively.

If then the likelihood ratio test can completely separate and asymptotically (completely detectable case), i.e. the sum of type 1 and 2 error probabilities tends to . Otherwise, if , the likelihood ratio test and, thus, any other test cannot distinguish between and asymptotically (undetectable case). Later, Donoho and Jin [17] showed that Tukey’s higher criticism [40, 41, 42] can also completely separate and asymptotically if . This showed a certain optimality of the higher criticism test. In contrast to the likelihood ratio test the higher criticism test does not need the knowledge of the typically unknown signal probability and signal strength . Jager and Wellner [29]

introduced a family of test statistics

based on -divergences including higher criticism test statistic and the test statistic of Berk and Jones [6]. They extended the optimality result of Donoho and Jin [17] to their whole family. But in contrast to the higher criticism test, see [2, 3, 4, 8, 9, 17, 25, 35], it is less known if this optimality also holds under more general model assumptions for the whole family. In this paper we want to give a positive answer to this uncertainty.
As already mentioned, we differ between dense and sparse signals, where the main focus in the literature lies on the latter one. There are only a few positive results about the higher criticism test for dense signals. E.g., Cai et al. [8] proved the optimality of the higher criticism test in the dense signal case for the normal location model introduced above with . We will extend these results to general exponential families and to the whole family of Jager and Wellner [29].
When we explained the results of Ingster [24] we omitted the case , the behavior on the detection boundary. Ingster [24] determined the limit distribution of the likelihood ratio test on the boundary under as well as under . An interesting observation is that non-Gaussian limits do also occur. In other words, he showed that there is a third area in the --parametrization plane, namely the nontrivial power area on the boundary. Ditzhaus and Janssen [16] studied the asymptotic behavior of the likelihood ratio test and the higher criticism test on the detection boundary for general mixtures. In particular, they showed that the higher criticism test has no power on the boundary for various models, whereas the likelihood ratio test has nontrivial power there. We will extend this negative result to the whole family of Jager and Wellner [29].
In short, this paper gives the following new insights into the tests based on -divergences introduced by Cai and Wu [9]:

  1. In contrast to Jager and Wellner [29] we do not restrict the parameter to the interval and consider the family of test statistics instead. In particular, we extend the result of Jager and Wellner [29] about the asymptotic behavior of under the null to the parameters .

  2. The optimality of tests based on for some even holds beyond the assumption of normality. In particular, we verify the optimality for the model recently suggested by Cai and Wu [9] and for general dense mixtures based on exponential families.

  3. On the detection boundary tests of the form for some have no power asymptotically, whereas the likelihood ratio test does.

The paper is organized as follows. In Section 2 we introduce the family of test statistics and present the limit distribution under the null , which is the same for the whole family. Section 3 contains our tools to discuss the asymptotic power of the whole family under the alternative . These tools are applied in Section 4 to a sparse signal model class recently suggested by Cai and Wu [9] and a dense signal model based on general exponential families.

2 The test family and its limit under the null

2.1 The test statistics

This papers’ focus lies on continuous noise distributions. That is why we can assume without loss of generality that for all , e.g., having a transformation to -values in mind. Denote the distribution function of by . The basic idea is to compare the empirical distribution function with the noise distribution function for by using one of the -divergences tests proposed by Csiszár [13] based on a convex function , see also [1, 12, 13]. To be more specific, we introduce a family of convex functions mapping to :

Based on these the family of -divergences statistics is given by

for . It is easy to see that is continuous for every fixed and so is for all fixed . Now, we consider the following family of test statistics for (1.1) given by

(2.1)

where

denote the order statistics of the observation vector

. As explained by Jager and Wellner [29] Tukey’s higher criticism test , the test of Berk and Jones [6] , the ”reversed Berk-Jones” statistic introduced by Jager and Wellner [28] and a studentized version of the higher criticism statistic studied by Eicker [20] are included in this family. Note that does not always coincide with the corresponding known test statistic but is equivalent to them for given in the parenthesis. For all other the test statistic was new. Jager and Wellner [29] give a special emphasis to

, which is equivalent to the supremum of the pointwise Hellinger divergences between two Bernoulli distributions with parameters

and , .

2.2 Limit distribution under the null

The limit distribution of the higher criticism statistic is already known, see Jaeschke [27] and also Section 16.1 of Shorack and Wellner [38], and so one can derive the limit distribution of . Jager and Wellner [29] showed that converges in probability to under the null and, consequently, the limit distribution is the same for all . We now extend their result to all .
Notation for convergences: By we denote convergence in distribution, in -probability and in -probability, respectively.

Theorem 2.1

Define

Then we have for all that under the null

(2.2)

where is standard Gumbel distributed, i.e. is the distribution function of .

At least for it is known that the convergence rate is really slow, see Khmaladze and Shinjikashvili [33]. Since the basic idea of the proof of Theorem 2.1 is to approximate by the same bad rate can be expected for all or an even worse rate due to an additional approximation error. Consequently, we cannot recommend using a critical value based on the convergence result in Theorem 2.1 unless the sample size

is really huge. Since the noise distribution is assumed to be known, a possibility to estimate the

-quantile of

is to use a standard Monte-Carlo Simulation. Alternatively, you can find finite recursion formulas for the exact finite distribution in the literature, see Jager and Wellner [28] (for , up to ) and Khmaladze and Shinjikashvili [33] (for , up to ).

3 Asymptotic power under the alternative

In this section we present a tool to analyze the asymptotic power behavior of all family members under the alternative . This was already done by Ditzhaus and Janssen [16] for the higher criticism test, i.e. for . We show that the main tool to obtain their results can be used for general . To be more specific, this tool is the following function given by

(3.1)

where is the distribution of if . The basic idea is to compare the tails near to as well as near to of the -value if follows the signal distribution and the noise distribution , respectively. Due to symmetry does not change if we consider the -value instead. Moreover, due to this it is sufficient to consider instead of .

Theorem 3.2 (Complete detection)

Suppose that there is a sequence in such that and . Then we have for all

(3.2)

Under (3.2) there exists a sequence of critical values for all such that the sum of type 1 and 2 error probabilities of the test tends to . In other words, by using we can completely separate and asymptotically.
For the next result recall that the null (product) distribution and the alternative (product) distribution are said to be mutually contiguous if for all sequences of sets: if and only if . By the first Lemma of Le Cam and are mutually contiguous if and only if the limits (in distribution) of the likelihood ratio test statistic are real-valued under the null as well as under the alternative . This typically holds on the detection boundary, see Ditzhaus and Janssen [16]. Hence, mutual contiguity is no real restriction when discussing the behavior on this boundary.

Theorem 3.3 (No power)

Suppose that and are mutual contiguous and that there are constants such that

(3.3)
(3.4)

Then (2.2) also holds under the alternative .

Under (3.3) and (3.4) all tests of the form cannot distinguish between and asymptotically, i.e. the sum of type 1 and 2 error probabilities tends to . As we will prove, the supremum in (2.1) is not taken by values of from the extreme tails or from the middle with probability tending to , in other words for with . To be more specific, we verify that under (and so under if and are mutually contiguous)

This briefly explains why in (3.3) we take the supremum over instead over .

Remark 3.4 (Simplification for the sparse case)

Typically, in the sparse case we even have . Then it is easy to see that the statements in Theorems 3.3 and 3.2 remain true if is replaced by

4 Application

4.1 Extension of Cai and Wu [9]

Throughout this section we consider (only) the sparse case

Starting with a fixed noise distribution and a fixed sequence of signal distributions, Cai and Wu [9] developed a technique to calculate a detection threshold for the parameter .

  1. (Undetectable) If exceeds then there is no sequence of tests, which can distinguish between and asymptotically.

  2. (Completely detectable) If is smaller than then there is a sequence of likelihood ratio tests, which can completely separate and asymptotically.

Let us begin this section by recalling the results of Cai and Wu [9]. We first present the special case of standard normal noise.

Theorem 4.5 (Theorem 1 and 4 in [9])

Define for all

Suppose that

for a measurable . Then the detection boundary is given by

where denotes the essential supremum of a measurable function . Moreover, if then there is a sequence of critical values such that (higher criticism test) can completely separate and asymptotically.

The results concerning the normal location model mentioned in Section 1 follow immediately from this Theorem, for details see Section V-A and V-C in [9]. More generally, Theorem 4.5 can be applied to the model given by and , where denotes the convolution. For details we refer the reader to Corollary 1 and Section V-B in [9]

. An example for this convolution idea is the heteroscedastic normal location model, where the variance of the signal distribution

may differ from 1, i.e. is allowed. Note that Cai et al. [8] already studied the optimality for the higher criticism test under this model.
For non-normal noise distributions the following theorem can be applied:

Theorem 4.6 (Theorem 3 in [9])

Define for all

Suppose that

(4.1)

for a measurable . Then the detection threshold for is given by

(4.2)

Theorem 4.6 can be applied to derive the detection boundary for the general Gaussian location mixture model and the exponential- mixture model as explained by Cai and Wu [9]. Note that Donoho and Jin [17] already discussed these models and postulated the optimality of the higher criticism test for them. But it was not known if this optimality holds in general under the assumptions of Theorem 4.6. According to the next theorem it does, even for the whole family of test statistics .

Theorem 4.7 (Extension of Theorem 4.6)

Let be defined as in Theorem 4.6. Assume that there exists some such that for every

(4.3)
(4.4)

for all sufficiently large . Let be a sequence in such that and for all . Suppose that for some :

(4.5)
for some or for every there exists such that
(4.6)

for all . Then . Moreover, if then for every there is a sequence of critical values such that can completely separate and asymptotically.

The conditions (4.3) and (4.4) together are mimicking the essential supremum in (4.2), where is replaced by . The advantage is that we do not need the uniform convergence as in (4.1

). During our study we had a look at a simple scale exponential distribution model, for which we expected that

Theorem 4.6 can be applied. But for this model tends to for small and so (4.1) is violated. Our (new) more general assumptions can handle this problem. Furthermore, we can treat the following exponential families including a scale exponential, a scale Fréchet and a location Gumbel distribution model. Before we can formulate the theorem let us recall that is called slowly varying at infinity if for .

Theorem 4.8

Let be a family of continuous distributions on with and Radon-Nikodym density

(4.7)

for appropriate functions and with . Suppose that is strictly decreasing on for some , for all and

for a slowly varying function at infinity, where is the left continuous quantile function of . Let be the noise distribution and the signal distribution for with . Then the conditions of Theorem 4.7 are fulfilled for

Due to lack of space we do not discuss the asymptotic behavior of the tests on the detection boundary, i.e. . We refer the reader to Corollary 5.7 and Theorem 8.19 of Ditzhaus [15]. In short, the likelihood ratio test has nontrivial power on the detection boundary, whereas the higher criticism test has no power. By Theorem 3.3 the latter can be extended to the whole family .
At the end of this section we present the extension of Theorem 4.5. This can be applied, e.g., to verify the optimality of the whole family for the heteroscedastic and heterogeneous () normal location model with . The behavior of the likelihood ratio test and the higher criticism test on the detection boundary for this particular model is discussed in Ditzhaus and Janssen [16] and can be extended, using Theorems 3.3 and 3.2, to the whole family . Again, the likelihood ratio test has nontrivial power, whereas tests of the form for some have no asymptotic power.

Theorem 4.9 (Extension of Theorem 4.5)

Let be defined as in Theorem 4.5. Suppose that there is some such that for every

(4.8)
(4.9)

for all sufficiently large . Let be a sequence in such that and for all . Suppose that for some :

(4.10)
for some or for every there exists such that
(4.11)

for all . Then . Moreover, if then for every there is a sequence of critical values such that can completely separate and asymptotically.

4.2 Dense exponential family

In this section we give a quite general example for a dense signal model and show the optimality of the whole family for it. The normal location model, and with , is included in this consideration. This particular model was already discussed by Cai et al. [8] concerning the higher criticism test, i.e. for . Here, we extend their result to all and to exponential families in a far more general form compared to Theorem 4.7. In contrast to the previous section we discuss the asymptotic power behavior on the detection boundary here in detail.

Figure 2: The detection boundary for dense signal exponential family mixtures is plotted, see Theorem 4.10.
Theorem 4.10

Let be a family of continuous distributions on with Radon-Nikodym density given by (4.7) for and with . Assume that . Consider the noise distribution and the signal distribution with the parametrization

then the detection boundary for the parameter is given by

In particular, we have for all :

  1. If then there is a sequence of critical values such that can completely separate and asymptotically.

  2. Suppose that . Then is the sharp lower bound of the limit of the sum of type 1 and 2 error probabilities for all tests testing versus . But all tests of the form cannot distinguish between and asymptotically.

  3. If then no test can distinguish between and asymptotically.

5 Discussion

The higher criticism test statistic became quite popular recently. In this paper we showed that a whole class of test statistics shares the same optimal properties of the higher criticism one under different model assumptions. The advantage of a whole class is more flexibility in choosing a test statistic which suits the specific problem best. Jager and Wellner [29] already pointed out that is more sensible for signal distributions with heavy or light tails if or , respectively. As a good compromise they suggested their ”new” . In future we wish to conduct detailed simulation study in order to understand the differences between the test statistics and to be able to give an advice for practitioners how to choose ”the best” .

Beside the detection problem, a more in-depth analysis of the data as feature selection, estimation of the number of signals and classification is of great interest to practice, too. The detection problem discussed in this paper is closely related to the other problems, see

[18, 19, 23, 30], for which the higher criticism statistic can be used, too. The results in this paper suggest that the whole class may be used for these problems. The benefit would again be more flexibility. This could be a project for the future.
The function serving as our tool for the asymptotic behavior under the alternative was already used by Ditzhaus and Janssen [16] for the higher criticism test. In particular, the results concerning their examples can be extended immediately to the whole family . Beside the normal location model they also discussed a structure model for -values based on the spike chimeric alternatives of Khmaladze [32].

6 Acknowledgments

The author thanks the Deutsche Forschungsgemeinschaft (DFG) for financial support (DFG Grant no. 618886).

7 Proofs

To prove Theorem 2.1 we use some results of Chang [11] and Wellner [43] about the asymptotic behavior of the empirical distribution function. We summarize them in the following lemma.

Lemma 7.11

Let

be independent and identical distributed random variables on the same probability space

with continuous distribution function . Let be the corresponding empirical distribution function. Let be a decreasing sequence in , i.e. , such that . Then

(7.1)

If additionally then

(7.2)

Moreover, for all

(7.3)

Proof (Proof of Lemma 7.11)

First, suppose that

are uniformly distributed on

. Then (7.1) was stated by Chang [11], see also Theorem 0 in [43], and (7.3) follows by combining (i) and (ii) of Remark 1 of Wellner [43]. Moreover, (7.2) follows from Theorem 1S of Wellner [43]. For general continuous distribution note that are independent and uniformly distributed random variables in . Consequently, it is easy to check that the statements for general distributions can be concluded from the ones for uniform distributions.

Proof (Proof of Theorem 2.1)

Having a transformation to -values or in mind we can assume without loss of generality that is the uniform distribution on the interval . The proof is based, as is the one of Theorem 3.1 of Jager and Wellner [29], on a Taylor expansion of around . It is easy to verify that

and

Hence, by a careful calculation we obtain for all that

(7.4)

where is a random variable satisfying . Clearly, is monotone for all . That is why we have , where

and

Let . Obviously, . Moreover, observe that by (7.4)

Consequently, for (2.2) it is sufficient to show that

(7.5)
(7.6)
(7.7)

Note that

Hence, using the inequality

(7.8)

with appropriate sets we can deduce (7.7) from (7.4) if

(7.9)
(7.10)

Thus, it remains to verify (7.5), (7.6), (7.9) and (7.10). Note that by symmetry it is sufficient to show (7.6) and (7.10) for . Using again (7.8) we obtain (7.6) for