Minimax rates in sparse, high-dimensional changepoint detection

07/23/2019
by   Haoyang Liu, et al.
0

We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for n independent, p-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order √(p (8n)) and has a very delicate dependence on the sample size: in a certain sparsity regime it involves a triple iterated logarithmic factor in n. We also identify the leading constants in the rate to within a factor of 2 in both sparse and dense asymptotic regimes. Extensions to cases of spatial and temporal dependence are provided.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/01/2020

On Minimax Exponents of Sparse Testing

We consider exact asymptotics of the minimax risk for global testing aga...
10/25/2021

Minimax rates for sparse signal detection under correlation

We fully characterize the nonasymptotic minimax separation rate for spar...
09/22/2021

Sparse Uniformity Testing

In this paper we consider the uniformity testing problem for high-dimens...
03/10/2015

Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition

We establish minimax optimal rates of convergence for estimation in a hi...
05/04/2022

Rates of estimation for high-dimensional multi-reference alignment

We study the continuous multi-reference alignment model of estimating a ...
06/08/2021

Minimax and adaptive tests for detecting abrupt and possibly transitory changes in a Poisson process

Motivated by applications in cybersecurity and epidemiology, we consider...
07/03/2020

Two-sample Testing for Large, Sparse High-Dimensional Multinomials under Rare/Weak Perturbations

Given two samples from possibly different discrete distributions over a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of changepoint detection has a long history (e.g. Page, 1955), but has undergone a remarkable renaissance over the last 5–10 years. This has been driven in part because these days sensors and other devices collect and store data on unprecedented scales, often at high frequency, which has placed a greater emphasis on the running time of changepoint detection algorithms (Killick, Fearnhead and Eckley, 2012; Frick, Munk and Sieling, 2014). But it is also because nowadays these data streams are often monitored simultaneously as a multidimensional process, with a changepoint in a subset of the coordinates representing an event of interest. Examples include distributed denial of service attacks as detected by changes in traffic at certain internet routers (Peng et al., 2004) and changes in a subset of blood oxygen level dependent contrast in a subset of voxels in fMRI studies (Aston and Kirch, 2012). Away from time series contexts, the problem is also of interest, for instance in the detection of chromosomal copy number abnormality (Zhang et al., 2010; Wang and Samworth, 2018). Key to the success of changepoint detection methods in such settings is the ability to borrow strength across the different coordinates, in order to be able to detect much smaller changes than would be possible through observation of any single coordinate in isolation.

We initially consider a simple model where, for some , we observe a matrix that can be written as

(1)

where is deterministic and the entries of are independent random variables. We wish to test the null hypothesis that the columns of are constant against the alternative that there exists a time at which these mean vectors change, in at most out of the coordinates. The difficulty of this problem is governed by a signal strength parameter that measures the squared Euclidean norm of the difference between the mean vectors, rescaled by ; this latter quantity represents the distance of the change from the endpoints of the series and can be interpreted as an effective sample size. The goal is to identify the minimax testing rate in as a function of the problem parameters , and , and we denote this by

; this is the signal strength at which we can find a test making the sum of the Type I and Type II error probabilities arbitrarily small by choosing

to be an appropriately large multiple of (where the multiple is not allowed to depend on , and ), and at which any test has error probability sum arbitrarily close to for a suitably small multiple of .

Our first main contribution, in Theorem 1, is to reveal a particularly subtle form of the exact minimax testing rate in the above problem, namely

This result provides a significant generalization of two known special cases in the literature, namely and ; see Section 2.1 for further discussion. Although our optimal testing procedure depends on the sparsity level , which would often be unknown in practice, we show in Theorem 4 that it is possible to construct an adaptive test that very nearly achieves the same testing rate.

The theorem described above is a finite-sample result, but does not provide information at the level of constants. By contrast, in Section 2.4, we study both dense and sparse asymptotic regimes, and identify the optimal constants to within a factor of in the former case, and a factor of in the latter case. In combination with Theorem 1, then, we are able to provide really quite a precise picture of the minimax testing rate in this problem.

Sections 3 and 4 concern extensions of our results to more general data generating mechanisms that allow for spatial and temporal dependence respectively. In Section 3, we allow for cross-sectional dependence across the coordinates via a non-diagonal covariance matrix for the (Gaussian) columns of . We identify the sharp minimax testing rate when , though the optimal procedure depends on three functionals of

, namely its trace, as well as its Frobenius and operator norms. Estimation of these quantities is confounded by the potential presence of the changepoint, but we are able to propose a robust method that retains the same guarantee under a couple of additional conditions. As an example, we consider covariance matrices that are a convex combination of the identity matrix and a matrix of ones; thus, each pair of distinct coordinates has the same (non-negative) covariance. Interestingly, we find here that this covariance structure can make the problem either harder or easier, depending on the sparsity level of the changepoint. In Section 

4, we also focus on the case and allow dependence across the columns of (which are still assumed to be jointly Gaussian), controlled through a bound on the sum of the contributions of the operator norms of the off-diagonal blocks of the covariance matrix. Again, interesting phase transition phenomena in the testing rate occur here, depending on the relative magnitudes of the parameters , and .

Most prior work on multivariate changepoint detection has proceeded without a sparsity condition and in an asymptotic regime with growing to infinity with the dimension fixed, including Basseville and Nikiforov (1993), Csörgő and Horváth (1997), Ombao et al. (2005), Aue et al. (2009), Kirch et al. (2015), Zhang et al. (2010) and Horváth and Hušková (2012). Bai (2010) studied the least squares estimator of a change in mean for high-dimensional panel data. Jirak (2015), Cho and Fryzlewicz (2015), Cho (2016) and Wang and Samworth (2018) have all proposed CUSUM-based methods for the estimation of the location of a sparse, high-dimensional changepoint. Aston and Kirch (2018) introduce a notion of efficiency that quantifies the detection power of different statistics in high-dimensional settings. Enikeeva and Harchaoui (2018) study the sparse changepoint detection problem in an asymptotic regime in which , and at the same time with and the sample size not too large, while Xie and Siegmund (2013) develop a mixture procedure to detect such sparse changes. Further related work on high-dimensional changepoint problems include the detection of changes in covariance (e.g. Aue et al., 2009; Cribben and Yu, 2017; Wang et al., 2017) and in sparse dynamic networks (Wang et al., 2018a).

Proofs of our main results are given in Section 5, while auxiliary results appear in Section 6. We close this section by introducing some notation that will be used throughout the paper. For , we write . Given , we write and . We also write to mean that there exists a universal constant such that ; moreover, means and . For a set , we use and to denote its indicator function and cardinality respectively. For a vector , we define the norms , and , and also define . Given two vectors and a positive definite matrix , we define and and omit the subscripts when . More generally, the trace inner product of two matrices is defined as , while the Frobenius and operator norms of are given by and respectively, where

denotes the largest singular value. The total variation distance between two probability measures

and on a measurable space is defined as . Moreover, if is absolutely continuous with respect to

, then the Kullback–Leibler divergence is defined as

, and the chi-squared divergence is defined as . The notation and are generic probability and expectation operators whose distribution is determined from the context.

2 Main results

Recall that we consider a noisy observation of a matrix , where and each entry of the error matrix is an independent random variable. In other words, writing and for the th columns of and respectively, we have . The goal of our paper is to test whether the multivariate sequence has a changepoint. We define the parameter space of signals without a changepoint by

For and , the space consisting of signals with a sparse structural change at time is defined by

In the definition of , the parameters and determine the size of the problem, while is the location of the changepoint. The numbers and parametrize the sparsity level and the magnitude of the structural change respectively. It is worth noting that is normalized by the factor , which plays the role of the effective sample size of the problem. To understand this, consider the problem of testing the changepoint at location when

. Then the natural test statistic is

whose variance is

. Hence the difficulty of changepoint detection problem depends on the location of the changepoint. Through the normalization factor , we can define a common signal strength parameter across different possible changepoint locations. Taking a union over all such changepoint locations, the alternative hypothesis parameter space is given by

We will address the problem of testing the two hypotheses

(2)

To this end, we let denote the class of possible test statistics, i.e. measurable functions . We also define the minimax testing error by

where we use or to denote probabilities and expectations under the data generating process (1). Our goal is to determine the order of the minimax rate of testing in this problem, as defined below.

Definition 1.

We say is a minimax rate of testing if the following two conditions are satisfied:

  1. For any , there exists , depending only on , such that for any .

  2. For any , there exists , depending only on , such that for any .

2.1 Special cases

Special cases of are well understood in the literature. For instance, when , we recover the one-dimensional changepoint detection problem. Gao et al. (2019) recently determined that

(3)

The rate (3) involves a iterated logarithmic factor, in constrast to a typical logarithmic factor in the minimax rate of sparse signal detection (e.g., Donoho and Jin, 2004; Arias-Castro et al., 2005; Berthet and Rigollet, 2013).

Another solved special case is when . In this setting, we observe and , and the problem is to test whether or not . Since is a sufficient statistic for , the problem can be further reduced to a sparse signal detection problem in a Gaussian sequence model. For this problem, Collier et al. (2017) established the minimax detection boundary

(4)

It is interesting to notice the elbow effect in the rate (4). Above the sparsity level of , one obtains the parametric rate that can be achieved using the test that rejects if for an appropriate .

It is straightforward to extend both rates (3) and (4) to cases where either or is of a constant order. However, the general form of is unknown in the statistical literature.

2.2 Minimax detection boundary

The main result of the paper is given by the following theorem.

Theorem 1.

The minimax rate of the detection boundary of the problem (2) is given by

(5)

It is important to note that the minimax rate (5) is not a simple sum or multiplication of the rates (3) and (4) for constant or . The high-dimensional changepoint detection problem differs fundamentally from both its low-dimensional version and the sparse signal detection problem.

We observe that the minimax rate exhibits the two regimes in (5) only when , since if , then the condition is empty, and (5) has just one regime. Compared with the rate (4), the phase transition boundary for the sparsity becomes . In fact, the minimax rate (5) can be obtained by first replacing the in (4) with , and then adding the extra term (3).

The dependence of (5) on is very delicate. Consider the range of sparsity where

for some universal constant . The rate (5) then becomes

That is, it grows with at a rate. To the best of our knowledge, such a triple iterated logarithmic rate has not been found in any other problem before in the statistical literature.

Last but not least, we remark that when or is a constant, the rate (5) recovers (3) and (4) as special cases.

Upper Bound.

To derive the upper bound, we need to construct a testing procedure. We emphasize that the goal of hypothesis testing is to detect the existence of a changepoint; this is in contrast to the problem of changepoint estimation (Cho and Fryzlewicz, 2015; Wang and Samworth, 2018; Wang et al., 2018b), where the goal is to find the changepoint’s location.

If we knew that the changepoint were between and , it would be natural to define the statistic

(6)

Note that the definition of does not use the observations between and . This allows to detect any changepoint in this range, regardless of its location. The existence of a changepoint implies that . Since the structural change only occurs in a sparse set of coordinates, we threshold the magnitude of each coordinate at level to obtain

where

is the conditional second moment of

, given that its magnitude is at least . See Collier et al. (2017) for a similar strategy for the sparse signal detection problem. Note that has a centered distribution under .

Since the range of the potential changepoint locations is unknown, a natural first thought is to take a maximum of over . It turns out, however, that in high-dimensional settings it is very difficult to control the dependence between these different test statistics at the level of precision required to establish the minimax testing rate. A methodological contribution of this work, then, is the recognition that it suffices to compute a maximum of over a candidate set of locations, because if there exists a changepoint at time and for some , then and are of the same order of magnitude. This observation reflects a key difference between the changepoint testing and estimation problems. To this end, we define

so that . Then, for a given , the testing procedure we consider is given by

(7)

The theoretical performance of the test (7) is given by the following theorem. We use the notation for the rate function on the right-hand side of (5).

Proposition 2.

For any , there exists , depending only on , such that the testing procedure (7) with and satisfies

as long as .

Just as the minimax rate (5) has two regimes, the testing procedure (7) also uses two different strategies. In the dense regime , we have and thus (7) becomes simply . In the sparse regime , a thresholding rule is applied at level , where . We discuss adaptivity to the sparsity level in Section 2.3.

Lower Bound.

We show that the testing procedure (7) is minimax optimal by stating a matching lower bound.

Proposition 3.

For any , there exists , depending only on , such that whenever .

2.3 Adaptation to sparsity

The optimal testing procedure (7) that achieves the minimax detection rate depends on knowledge of the sparsity . In this section, we present an alternative procedure that is adaptive to . We first describe two testing procedures, designed to deal with the dense and sparse regimes respectively. For the dense regime, and for , we consider

(8)

In this dense regime, the cut-off value in (8) is of the same order as that in (7), and does not depend on the sparsity level . For the sparse regime, we consider a slightly different procedure from that used in Proposition 2, namely

Combining the two tests, we obtain a testing procedure that is adaptive to both regimes, given by

(9)
Theorem 4.

For any , there exists , depending only on , such that the testing procedure (9) with satisfies

as long as

(10)

Compared with the minimax rate (5), the rate (10) achieved by the adaptive procedure is nearly optimal except that it misses the factor of in the logarithmic term.

2.4 Asymptotic constants

A notable feature of our minimax detection boundary derived in Theorem 1 is that the rate is non-asymptotic, meaning that the result holds for arbitrary , and . On the other hand, if we are allowed to make a few asymptotic assumptions, we can give explicit constants for the lower and upper bounds. In this subsection, therefore, we let both the dimension and the sparsity be functions of , and we consider asymptotics as .

Theorem 5 (Dense regime).

Assume that as . Then, with

we have when and when .

Theorem 6 (Sparse regime).

Assume that and as . Then, with

we have when and when .

These two theorems characterize the asymptotic minimax upper and lower bounds of the changepoint detection problem under dense and sparse asymptotics respectively.

3 Spatial dependence

In this section, we consider changepoint detection in settings with cross-sectional dependence in the coordinates. To be specific, we now relax our previous assumption on the cross-sectional distribution by supposing only that for some general positive definite covariance matrix ; the goal remains to solve the testing problem (2). We retain the notation and for probabilities and expectations, with the dependence on suppressed.

Our first result provides the minimax rate of the detection boundary in the dense case where . This sets up a useful benchmark on the difficulty of the problem depending on the covariance structure. Similar to Definition 1, we use the notation for the minimax rate of testing.

Theorem 7.

The minimax rate is given by

(11)

In the special case , Theorem 7 yields , which recovers the result of Theorem 1 when .

A test that achieves the optimal rate (11) is given by

(12)

for an appropriate choice of . Though optimal, the procedure (12) relies on knowledge of . In fact, one only needs to know , and , rather than the entire covariance matrix . To be even more specific, from a careful examination of the proof, we see that we only need to know up to an additive error that is at most of the same order as the cut-off, whereas knowledge of the orders of and , up to multiplication by universal constants, is enough.

We now discuss how to use to estimate the three quantities and . The solution would be straightforward if we knew the location of the changepoint, but in more typical situations where the changepoint location is unknown, this becomes a robust covariance functional estimation problem. We assume that and that is an integer, since a simple modification can be made if is not a integer. We can then divide into three consecutive blocks , each of whose cardinalities is . For , we compute the sample covariance matrix

where . We can then order these three estimators according to their trace, as well as their Frobenius and operator norms, yielding

The idea is that at least two of the three covariance matrix estimators should be accurate, because there is at most one changepoint location. This motivates us to take the medians , and with respect to the three functionals as our robust estimators. It is convenient to define .

Proposition 8.

Assume for some , and fix an arbitrary positive definite and . Then given , there exists , depending only on and , such that

with -probability at least .

With the help of Proposition 8, we can plug the estimators into the procedure (12). This test is adaptive to the unknown covariance structure, and comes with the following performance guarantee.

Corollary 9.

Assume that for some . Then given , there exist , depending only on and , such that if , then the testing procedure

satisfies

as long as .

Remark 1.

The conditions and guarantee that and with high probability, by Proposition 8. Note that

will be satisfied if all eigenvalues of

are of the same order. In fact, it possible to weaken the condition using the notion of effective rank (Koltchinskii and Lounici, 2017); however, this greatly complicates the analysis, and we do not pursue this here. Alternatively, Corollary 9 also holds without the condition but under the stronger dimensionality restriction ; this then allows for an arbitrary covariance matrix .

To better understand the influence of the covariance structure, consider, for , the covariance matrix

which has diagonal entries and off-diagonal entries . The parameter controls the pairwise spatial dependence; moreover, and . By Theorem 7, we have

(13)

Thus the spatial dependence significantly increases the difficulty of the testing problem. In particular, if is of a constant order, then the minimax rate is , which is much larger than the rate (11) for .

However, the increased difficulty of testing in this example is just one part of the story. When we consider the sparsity factor , the influence of the covariance structure can be the other way around. To illustrate this interesting phenomenon, we discuss a situation where is small. Since , we have that for , where . Hence

(14)

where . When there is no changepoint, we have , so for all . When there is a changepoint between and , we have . In either case, then, we can estimate by . This motivates a new statistic, defined by

(15)

To construct a scalar summary of , we define the functions for and, for , set

(16)

Note that when . The use of a positive in (16) is to tolerate the error of as an estimator of . The new testing procedure is then

(17)
Theorem 10.

Assume that and . Then there exist universal constants such that if , then for any , we can find and , both depending only on , such that the testing procedure (17) with and satisfies

when , as long as .

Surprisingly, in the sparse regime, the spatial correlation helps changepoint detection, and the required signal strength for testing consistency decreases as increases. This is in stark contrast to (13) for the same covariance structure when .

Remark 2.

The testing procedure considered in Theorem 10 can be easily made adaptive to the unknown by taking advantage of Proposition 8. Since , when the estimator satisfies

with probability at least . Then, the procedure with replaced by enjoys the same guarantee of Theorem 10 under mild extra conditions.

To end the section, the next theorem shows that the rate achieved by Theorem 10 is minimax optimal.

Theorem 11.

Assume that and . Then

(18)

4 Temporal dependence

In this section, we consider the situation where form a multivariate time series. To be specific, in our model for , we now assume that the random vectors are jointly Gaussian but not necessarily independent. The covariance structure of the error vectors can be parametrized by a covariance matrix , and for , we write if:

  1. for all ;

  2. for all .

Thus the data generating process of is completely determined by its mean matrix and covariance matrix , and we use the notion and for the corresponding probability and expectation. The case reduces to the situation of observations at different time points being independent. Time series dependence in high-dimensional changepoint problems has also been considered by Wang and Samworth (2018); their condition for all is only slightly different from ours.

We focus on the case and do not consider the effect of sparsity. The minimax testing error is defined by

We also define the corresponding minimax rate of detection boundary similar to Definition 1. The testing procedure

(19)

has the following property:

Theorem 12.

For any , there exists , depending only on , such that the testing procedure (19) with satisfies