Sequential Subspace Changepoint Detection

11/09/2018
by   Liyan Xie, et al.
Rutgers University
0

We consider the sequential changepoint detection problem of detecting changes that are characterized by a subspace structure which is manifested in the covariance matrix. In particular, the covariance structure changes from an identity matrix to an unknown spiked covariance model. We consider three sequential changepoint detection procedures: The exact cumulative sum (CUSUM) that assumes knowledge of all parameters, the largest eigenvalue procedure and a novel Subspace-CUSUM algorithm with the last two being used for the case when unknown parameters are present. By leveraging the extreme eigenvalue distribution from random matrix theory and modeling the non-negligible temporal correlation in the sequence of detection statistics due to the sliding window approach, we provide theoretical approximations to the average run length (ARL) and the expected detection delay (EDD) for the largest eigenvalue procedure. The three methods are compared to each other using simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 6

page 7

page 8

page 10

page 11

06/15/2017

Sequential detection of low-rank changes using extreme eigenvalues

We study the problem of detecting an abrupt change to the signal covaria...
06/28/2018

First-order optimal sequential subspace change-point detection

We consider the sequential change-point detection problem of detecting c...
10/03/2016

Sequential Low-Rank Change Detection

Detecting emergence of a low-rank signal from high-dimensional data is a...
11/02/2018

RSVP-graphs: Fast High-dimensional Covariance Matrix Estimation under Latent Confounding

In this work we consider the problem of estimating a high-dimensional p ...
02/02/2021

Optimal Sequential Detection of Signals with Unknown Appearance and Disappearance Points in Time

The paper addresses a sequential changepoint detection problem, assuming...
10/14/2021

A Theory of Quantum Subspace Diagonalization

Quantum subspace diagonalization methods are an exciting new class of al...
10/01/2021

Inference on the maximal rank of time-varying covariance matrices using high-frequency data

We study the rank of the instantaneous or spot covariance matrix Σ_X(t) ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Detecting the changepoint from high-dimensional streaming data is a fundamental problem in various applications such as video surveillance, sensor networks, and seismic events detection. In various scenarios, the change happens to the covariance structure and can be represented as a linear subspace. For example, the covariance matrix may shift from an identity matrix to a spiked covariance model [1, 2].

Given a sequence of observed vectors

where and is the signal dimension, there may be a changepoint time when the distribution of the data stream changes. Our goal is to detect this change as quickly as possible from streaming (sequentially obtained) data using online techniques. We are particularly interested in the structured change occurring in the signal covariance. We study two related settings: the emerging subspace, meaning that the change is a subspace emerging from a noisy background; and the switching subspace

, meaning that the change is a switch in the direction of the subspace. The emerging subspace problem can arise, for instance, from coherent weak signal detection from seismic sensor arrays, and the switching subspace detection can be used for principal component analysis for streaming data. In these settings, the changes can be shown to be equivalent to a low-rank component added to the original covariance matrix. On the other hand, the switching subspace problem, as we will see, can be reduced to the emerging subspace problem, if we are willing to tolerate a performance loss. Therefore, we will focus on the analysis of the emerging subspace problem.

In this paper, we consider three detection procedures. We start with the exact CUSUM which is known to be optimum when we have complete knowledge of the pre- and post-change statistics and parameters. Since the post-change parameters are usually unknown we propose two alternatives to deal with the case of unknown parameters. Specifically we consider the largest eigenvalue procedure where we use as test statistic the largest eigenvalue of the sample covariance matrix of the data contained within a sliding time-window. This can be regarded as a straightforward extension of its offline counterpart

[1]

. The second method which we call the Subspace-CUSUM uses the structure of the exact CUSUM but in place of the parameters that are known it uses their estimates which are computed using data within, again, a sliding window. We perform a theoretical analysis of the largest eigenvalue procedure. A similar analysis, which is far more complicated and extended, is postponed for a future article. The three algorithms are compared using simulations.

The rest of the paper is organized as follows. Section II details on the two problems of emerging and switching subspace and how they can be related. Section III presents the three sequential change detection procedures. In Section IV we develop theoretical bounds for the average run length and the expected detection delay of the largest eigenvalue procedure. In Section V we presents numerical results and comparisons for the competing algorithms. Finally Section VI contains our concluding remarks.

I-a Related Work

Classical approaches to covariance change detection usually consider generic settings without assuming any structure. The CUSUM statistics can be derived if the pre-change and post-change distributions are known. For the multivariate case, the Hotelling control chart is the traditional way to detect the covariance change. The determinant of the sample covariance matrix was also used in [3] to detect change of the determinant of the covariance matrix. A multivariate CUSUM based on likelihood functions of multivariate Gaussian is studied in [4] but it only considers the covariance change from to for a constant . Offline change detection of covariance change from to is studied in [5] using the Schwarz information criterion [6], where the changepoint location must satisfy certain regularity condition to ensure the existence of the maximum likelihood estimator. In [7] we find a hypothesis testing approach to detect a shift in an off-diagonal sub-matrix of the covariance matrix using likelihood ratios. Recently, [8] studies a CUSUM-like procedure for detection of switching subspaces, when the distributions (as well as the subspaces) before and after the changepoint are exactly known; this is different from our work since we assume the subspace after the change is unknown.

The most related work to our present effort is the hypothesis testing methods developed in [1], which uses the largest eigenvalue of the sample covariance matrix to detect a sparse spiked covariance model given a fixed number of samples. The largest eigenvalue statistic is shown to be asymptotically minimax optimal for determining whether there exists a sparse and low-rank component in the offline setting. A natural sequential version of this idea is to use a sliding window and estimate the largest eigenvalue of the corresponding sample covariance matrix. However, this approach, under a sequential setting does not enjoy any form of (asymptotic) optimality.

A different test statistic, the so-called Kac-Rice statistic [9]

, has been considered for testing spiked covariance model. The Kac-Rice statistic is the conditional survival function of the largest observed singular value conditioned on all other observed singular values, and is characterized by a simple asymptotic distribution (uniform in

). However, the statistic involves the computation of an integral over the whole real line, and it is not clear how this can be carried over to the sequential formulation.

Ii Problem Setup

We first introduce the spiked covariance model [2]

, which assumes that a small number of directions explain most of the variance. In particular, we consider the rank-one spiked covariance matrix, which is given by

where denotes an identity matrix of size ; is the signal strength; represents a basis for the subspace with unit norm ; is the noise variance, which will be considered known since it can be estimated from training data111In fact it is possible to consider unknown as well and provide estimates of this parameter along with the necessary estimates of . However, in order to simplify our presentation, we decided to consider known.. The Signal-to-Noise Ratio (SNR) is defined as .

Formally, the emerging subspace problem can be cast as follows:

(1)

where is the unknown changepoint that we would like to detect from data that are acquired sequentially.

Similarly, the switching subspace problem can be formulated as follows:

(2)

where represent bases for the subspaces before and after the change, with and is considered known. In both settings, our goal is to detect the change as quickly as possible.

The switching subspace problem (2) can be reduced into the emerging subspace problem (1) by a simple data projection. Specifically, select any orthonormal matrix such that

which means that all rows of are orthogonal to , they are orthogonal to each other and have unit norm. Then, using the matrix , we project each observation onto a dimensional space and obtain a new sequence

Then is a zero-mean random vector with covariance matrix before the change and after the change. Let , and

Thus, problem (2) can be reduced to the following

(3)

Note that this way the switching subspace problem is reduced into the emerging subspace problem, where the new signal power depends on the angle between and , which is consistent with our intuition.

We would like to emphasize that by projecting the observations onto a lower dimensional space we lose information, suggesting that the two versions of the problem are not equivalent. Indeed, the optimum detector for the transformed data in (3) and the one for the original data in (2) do not coincide. This can be easily verified by computing the corresponding CUSUM tests and their optimum performance. Despite this difference, it is clear that with the proposed approach we put both problems under the same framework, offering, as we will see, computationally simple methods to solve the original problem in (2). Consequently, in the following analysis, we focus solely on problem (1).

Iii Detection Procedures

As we mentioned before, we consider three methods: The exact CUSUM procedure where all parameters are considered known, the largest eigenvalue procedure, and the Subspace-CUSUM procedure. It is clear that since CUSUM is optimum it will be regarded as a point of reference for the other two approaches. We first introduce some necessary notation. Denote with and

the probability and expectation induced when there is a changepoint at the

deterministic time . Under this definition and is the probability and the expectation under the nominal regime (change never happens) while and the probability and expectation under the alternative regime (change happens before we take any data).

Iii-a Optimal CUSUM Procedure

The CUSUM procedure [10, 11] is the most popular sequential test for change detection. When the observations are i.i.d. before and after the change, CUSUM is known to be exactly optimum [12] in the sense that it solves a very well defined constrained optimization problem introduced in [13]. However, the CUSUM procedure can be applied only when we have exact knowledge of the pre- and post-change distributions. Thus, for our problem, it requires complete specification of all parameters namely the subspace , noise power and SNR .

To derive the CUSUM procedure, let

denote the pre- and post-change probability density function (pdf) of the observations. Then the CUSUM statistics is defined by maximizing the log-likelihood ratio statistic over all possible changepoint locations

(4)

which has the recursive implementation

(5)

that enables its efficient calculation [12]. The CUSUM stopping time in turn is defined as

(6)

where is a threshold selected to meet a suitable false alarm constraint. For our problem of interest we can derive that

(7)

The second equality is due to the matrix inversion lemma [14] that allows us to write

which, after substitution into the equation, yields the desired result. Note that the multiplicative factor is positive, so we can omit it from the log-likelihood ratio when forming the CUSUM statistic. This leads to

(8)
Remark 1.

We can show that the increment in (8), i.e.,

has the following property: its expected value is negative under the pre-change and positive under the post-change probability measure. The proof relies on a simple argument based on Jensen’s inequality. Due to this property, before the change, the CUSUM statistics will oscillate near while it will exhibit, on average, a positive linear drift after the occurrence of the change forcing it, eventually, to hit or exceed the threshold.

Iii-B Largest Eigenvalue Procedure

Motivated by the off-line test in [1], a natural strategy to detect the change is to use the largest eigenvalue of the sample covariance matrix. Under the sequential setting, we adopt a sliding window approach and form the sample covariance matrix using observations that lie within a time window of length . For each time , the un-normalized sample covariance matrix using the available samples is given by

(9)

We note that for the matrix contains a single outer product and as time progresses the number of outer products increases linearly until it reaches . After this point, namely for , the number of outer products remains equal to .

Let denote the largest eigenvalue of a symmetric matrix . We define the largest eigenvalue procedure, as the one that stops according to the following rule:

(10)

where is a constant threshold selected to meet a suitable false alarm constraint. We need to emphasize that we do not divide by when forming the un-normalized sample covariance matrix. As we explain in Section III-D, it is better for to always divide by instead of . Consequently, we can omit the normalization with from our detection statistics by absorbing it into the threshold.

Iii-C Subspace-CUSUM Procedure

Usually the subspace and SNR are unknown. In this case it is impossible to form the exact CUSUM statistic depicted in (8). One option is to estimate the unknown parameters and substitute them back into the likelihood function. Here we propose to estimate only and call which leads to the following Subspace-CUSUM update

(11)

We denote the estimator of as . This is because at time the estimate will rely on the data that are in the “future” of . Practically, this is always possible by properly delaying our data by samples. Stopping occurs similarly to CUSUM, that is

Of course, in order to be fair, at the time of stopping we must make the appropriate correction, namely if exceeds the threshold at for the first time, then the actual stopping takes place at . The reason we use estimates based on “future” data is to make and independent which in turn will help us decide what is the appropriate choice for the drift constant .

For the drift parameter we need the following double inequality to be true

(12)

With (12) we can guarantee that mimics the behavior of the exact CUSUM statistic mentioned in Remark 1, namely, it exhibits a negative drift before and a positive after the change. To apply (11), we need to specify and of course provide the estimate . Regarding the latter we simply use the unit-normeigenvector corresponding to the largest eigenvalue of depicted in (9). As we mentioned, the main advantage of using is that it provides estimates that are independent from . This independence property allows for the straightforward computation of the two expectations in (12) and contributes towards the proper selection of . Note that under the pre-change distribution we can write

(13)

where the first equation is due to the independence of and , the next one due to having covariance and the last equality due to being of unit norm.

Under the post-change regime, we need to specify the statistical behavior of for the computation of . We will assume that the window size

is sufficiently large so that Central Limit Theorem (CLT) approximations

[15, 16] are possible for . The required result appears in the next lemma.

Lemma 1.

Suppose vectors are of dimension and follow the distribution . Let be the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix , then, as , we have the following CLT version for

Proof.

The proof is detailed in the Appendix. ∎

Lemma 1 provides an asymptotic statistical description of the un-normalized estimate of . More precisely it characterizes the estimation error . In our case we estimate the eigenvector from the matrix but, as mentioned before, we adopt a normalized (unit norm) version . Therefore if we fix at a sufficiently large value and denotes the estimation error of the un-normalized estimate at time then, from Lemma 1, we can deduce

(14)

Note that is also independent from and orthogonal to , the latter being true because the covariance matrix of is . This implies . Combining the above results, we have

Because , the above expression simplifies to

(15)

For the two expectations in (15), using the Gaussian approximation from Lemma 1, we have

and

Consequently

(16)

with the term being negligible compared to the other two when .

Consider now the case where is unknown but exceeds some pre-set minimal SNR . From the above derivation, given the worst-case SNR and an estimation for the noise variance , we can give a lower bound for . Consequently, the drift can be anything between and where, we observe, that the latter quantity exceeds when . Below, for simplicity, for we use the average of the two bounds.

Alternatively, and in particular when does not satisfy , we can estimate by Monto Carlo simulation. This method requires: (i) estimating the noise level , which can be obtained from training data without a changepoint; (ii) the pre-set worst-case SNR ; (iii) a unit norm vector that is generated randomly. Under the nominal regime we have . Under the alternative depends only on the SNR as shown in (16). We can therefore simulate the worst-case scenario using the randomly generated vector by generating samples from the distribution .

Even though the average of the update in (11) does not depend on , the computation of the test statistic requires the estimate

of the eigenvector. This can be accomplished by applying the singular value decomposition (SVD) (or the power method

[17]) on the un-normalized sample covariance matrix .

Remark 2.

An alternative possibility is to use the generalized likelihood ratio (GLR) statistic, where both and are estimated for each possible change location . The GLR statistic is

where , are estimated from samples . However, this computation is more intensive since there is no recursive implementation for the GLR statistic, furthermore it requires growing memory222There are finite memory versions [18] which, unfortunately, are equally complicated in their implementation.. Therefore, we do not consider the GLR statistic in this paper.

Iii-D Calibration

To fairly compare the detection procedures discussed in the previous section we need to properly calibrate them. Clearly the calibration process must be consistent with the performance measure we are interested in. It is exactly this point we are discussing next.

For a given stopping time we measure false alarms through the Average Run Length (ARL) expressed with . For the detection capability of we use the (worst-case) Expected Detection Delay (EDD) proposed by Lorden [13]

(17)

which considers the worst possible data before the change (expressed through the ) and the worst possible change-time .

We now consider scenarios that lead to the worst-case detection delay. For the largest eigenvalue procedure, assume . Since for the detection we use and compare it to a threshold, it is clear that the worst-case data before are the ones that will make as small as possible. We observe that

(18)

which corresponds to the data , before the change, being all equal to zero. In fact, the worst-case scenario at any time instant is equivalent to forgetting all data before and including and restarting the procedure from using, initially, one, then two, etc. outer products in the un-normalized sample covariance matrix, exactly as we do when we start at time 0. Due to stationarity, this suggests that we can limit ourselves to the case and compute and this will constitute the worst-case EDD. Furthermore, the fact that in the beginning we do not normalize with the number of outer products, is beneficial for since it improves its ARL.

We should emphasize that if we do not force the data before the change to become zero and use simulations to evaluate the detector with a change occurring at some time different from 0, then it is possible to arrive at misleading conclusions. Indeed, it is not uncommon this test to appear outperforming the exact CUSUM test for low ARL values. Of course this is impossible since the exact CUSUM is optimum for any ARL in the sense that it minimizes the worst-case EDD depicted in (17).

Let us now consider the worst-case scenario for Subspace-CUSUM. We observe that

suggesting that when restarts this is the worst it can happen for the detection delay. We therefore understand that the well-known property of the worst-case scenario in the exact CUSUM carries over to Subspace-CUSUM. Again, because of stationarity, this allows us to fix the changetime at . Of course, as mentioned before, because uses data coming from the future of , if our detector stops at some time (namely when for the first time we experience ) then the actual time of stopping must be corrected to . A similar correction is not necessary for CUSUM because this test has the exact information for all parameters.

Threshold is chosen so that the ARL meets a pre-specified value. In practice, is determined by simulation. A very convenient tool in accelerating the estimation of ARL (which is usually large) is the usage of the following formula that connects the ARL of CUSUM to the average of the SPRT stopping time [11]

(19)

where the SPRT stopping time is defined as

The validity of this formula relies on the CUSUM property that after each restart, is independent from the data before the time of the restart. Unfortunately this key characteristic is no longer true in the proposed Subspace-CUSUM scheme due to the fact that uses data from the future of . We could, however, argue that this dependence is weak. Indeed, as we have seen in Lemma 1, each is basically equal to plus some small random perturbation (estimation error with power of the order of ), with these perturbations being practically independent in time. As we observed with numerous simulations, estimating the ARL directly and through (19) (with replaced by ), results in almost indistinguishable values even for moderate window sizes . This suggests that we can use (19) to estimate the ARL of the Subspace-CUSUM as well. As we mentioned, in the final result we need to add to account for the future data used by the estimate .

Iv Analysis of the Largest Eigenvalue Procedure

It is clear that, in this work we are interested in promoting the Subspace-CUSUM detection procedure for the change detection problem of interest. Therefore, it would have been very supportive to this method to offer a theoretical analysis and derive formulas for the corresponding ARL and EDD. Even though such an analysis is possible it is unfortunately overly lengthy, for this reason, we postpone its presentation for a future publication. In this section we intend to characterize the ARL and EDD of the largest eigenvalue procedure which turns out to be simpler. In doing so we will also introduce some of the mathematical tools we are going to use in the (future) analysis of the Subspace-CUSUM.

Iv-a Link with Random Matrix Theory

Since the study of ARL requires the understanding of the property of the largest eigenvalue under the null, i.e., the samples are i.i.d. Gaussian random vectors with zero-mean and identity covariance matrix, we first review some related results from random matrix theory.

There has been an extensive literature on the distribution of the largest eigenvalue of the sample covariance matrix, see, e.g., [2, 19, 20]. There are two kinds of results typically available for eigenvalue distributions: the so-called bulk [21], which treats a continuum of eigenvalues, and the extremes, which are the (first few) largest and smallest eigenvalues. Assume there are samples which are -dimensional Gaussian random vectors with zero-mean and identity covariance matrix. Let denote the un-normalized sample covariance matrix. If , the largest eigenvalue of the sample covariance matrix converges to almost surely [22]. To characterize the distribution of the largest eigenvalue, [2] uses the Tracy-Widom law [23]. Define the center and scaling constants

(20)

If

, then the centered and scaled largest eigenvalue converges in distribution to a random variable

with the so-called Tracy-Widom law of order one [2]:

(21)

The Tracy-Widom law can be described in terms of a partial differential equation and the Airy function, and its tail can be computed numerically (using for example the R-package

RMTstat).

Iv-B Approximation of ARL Ignoring Temporal Correlation

If we ignore the temporal correlation of the largest eigenvalues produced by the sliding window, we can obtain a simple approximation for the ARL. If we call for then the probability to stop at is geometric and it is easy to see that the ARL can be expressed as . Clearly, to obtain this result we must assume that for as well, which is clearly not true. Since for the un-normalized sample covariance has less than terms, the corresponding probability is smaller than . This suggests that is actually a lower bound to the ARL while an upper bound. If then approximating the ARL with is quite acceptable. We can use the Tracy-Widom law to obtain an asymptotic expression relating the ARL with the threshold . The desired formula is depicted in the following proposition.

Proposition 1 (Approximation of ARL by ignoring temporal correlation).

For any we have , if we select

(22)

where denotes the -upper-percentage point of namely .

Proof.

The proof is straightforward and therefore ommitted. ∎

Iv-C Approximation of ARL Including Temporal Correlation

Now we aim to capture the temporal correlation between detection statistics due to overlapping time windows. We leverage a proof technique developed in [24], which can obtain satisfactory approximation for the tail probability of the maximum of a random field. For each , define

(23)

Fig. 1 illustrates the overlap of two sample covariance matrices and provides necessary notation.

Fig. 1: Illustration of two sample covariance matrices.

We note that for any given ,

which is the max over a set of correlated variables . Capturing the temporal dependence of is challenging. For our analysis we recall Pearson’s correlation between two random variables as

We then have the following lemma that addresses the problem of interest.

Lemma 2 (Approximation of local correlation).

Let

(24)

where and . Then when and ,

(25)
Proof.

The proof is given in the Appendix. ∎

By leveraging the properties of the local approximation in (25), we can obtain an asymptotic approximation using the localization theorem [25, 24]. Define a special function which is closely related to the Laplace transform of the overshoot over the boundary of a random walk [26]:

where and

are the pdf and cdf of the standard normal distribution

.

Proposition 2 (ARL with temporal correlation).

For large values of we can write

(26)

where

Proof.

The proof is detailed in the Appendix. ∎

We perform simulations to verify the accuracy of the threshold value obtained without and with considering the temporal correlation (Proposition 1 and Proposition 2, respectively). The results are shown in Table I. We find that, indeed, the threshold, when temporal correlation (26) is taken into account, is more accurate than its counterpart obtained by using the Tracy-Widom law (22).

Target ARL 5k 10k 20k 30k 40k 50k
Simulation 1.633 1.661 1.688 1.702 1.713 1.722
Approx (22) 1.738 1.763 1.787 1.800 1.809 1.816
Approx (26) 1.699 1.713 1.727 1.735 1.740 1.744
TABLE I: Comparison of the threshold obtained from simulations and using the approximations. Window length , dimension . The numbers shown are .

Iv-D Lower Bound on EDD using Marginal Power

We now focus on the detection performance and present a tight lower bound for the EDD of the largest eigenvalue procedure.

Proposition 3.

For large values of we have

(27)

where

Proof.

The proof is based on a known result for CUSUM [11]

and requires the derivation of the Kullback-Leibler divergence for our problem. Details are given in the Appendix. ∎

Consistent with intuition, in Proposition 3, the right-hand-side of (27) is a decreasing function of the SNR ,

Fig. 2: Simulated EDD and lower bound as a function of the threshold .

Comparing the lower bound in Proposition 3 with simulated average delay, as shown in Fig. 2, we can show that in the regime of small detection delay (which is the main regime of interest), the lower bound serves as a reasonably good approximation.

V Numerical examples

In this section, numerical results are presented to compare the three detection procedures. The tests are first applied to synthetic data and the performance of the Subspace-CUSUM and largest eigenvalue test are compared against the CUSUM optimum performance. Then the performance of Subspace-CUSUM is optimized by selecting the most appropriate window size.

V-a Performance Comparison

We perform simulations to compare the largest eigenvalue procedure, the Subspace-CUSUM procedure, and the exact CUSUM procedure. The threshold for each procedure is determined by Monte-Carlo simulation, as discussed in Section III-D. Fig. 3 depicts EDD versus log-ARL for parameter values , , and window length . The black line corresponds to the exact CUSUM procedure, which is clearly the best and it lies below the other curves. Subspace-CUSUM has always smaller EDD than the largest eigenvalue procedure and the difference increases with increasing ARL.

Fig. 3: Comparison of CUSUM, Subspace-CUSUM and the Largest eigenvalue procedure.

V-B Optimal window size.

Fig. 4: Minimal EDD vs ARL among window sizes from 1 to 50.

Fig. 5: Optimal window size resulting in smallest EDD.

We also consider the EDD/ARL curve where is optimized to minimize the detection delay at every ARL. We first compute the EDD for window sizes given each ARL value. Then we plot in Fig. 4 the lower envelope of EDDs corresponding to the optimal EDD achieved by varying . We also plot the optimal value of as a function of ARL in Fig. 5. Even though the best EDD of the Subspace-CUSUM is diverging from the performance enjoyed by CUSUM this divergence we believe is slower than the increase of the optimum CUSUM EDD. One of the goals in the future publication regarding the analysis of Subspace-CUSUM is to show that this is indeed the case, which in turn will demonstrate that this detection structure is first-order asymptotically optimum.

Vi Conclusion

In this paper, we considered three detection procedures for the rank-one change in the covariance matrix: the largest eigenvalue procedure, the exact CUSUM procedure, and the Subspace-CUSUM procedure. For Subspace-CUSUM we perform a simultaneous estimate of the required subspace in parallel with its sequential detection. We avoid estimating all unknown parameters by following a worst-case analysis with respect to the subspace power. We were able to derive theoretical expressions for the ARL and an interesting lower bound for the EDD of the largest eigenvalue procedure. In particular we were able to handle the correlations resulting from the usage of a sliding window which is an issue that is not present in the off-line version of the same procedure. For the comparisons of the three competing detectors we discuss how it is necessary to calibrate each detector so that comparisons are fair. Comparisons were performed using simulated data and Subspace-CUSUM was found to exhibit a significantly better performance than the largest eigenvalue procedure. Ongoing work involves establishing first-order asymptotic optimality of the Subspace-CUSUM procedure by determining the optimal drift parameter and by relating the sliding window length to the desired ARL.

Acknowledgment

The work of Liyan Xie and Yao Xie was supported by the US National Science Foundation under Grants CCF 1442635, CMMI 1538746, DMS 1830210, and the Career Award CCF 1650913. The work of George Moustakides was supported by the US National Science Foundation under Grant CIF 1513373, through Rutgers University.

References

  • [1] Q. Berthet, P. Rigollet et al., “Optimal detection of sparse principal components in high dimension,” The Annals of Statistics, vol. 41, no. 4, pp. 1780–1815, 2013.
  • [2] I. M. Johnstone, “On the distribution of the largest eigenvalue in principal components analysis,” Annals of statistics, pp. 295–327, 2001.
  • [3] F. B. Alt, “Multivariate quality control,” Encyclopedia of Statistical Sciences, 2004.
  • [4] J. D. Healy, “A note on multivariate cusum procedures,” Technometrics, vol. 29, no. 4, pp. 409–412, 1987.
  • [5] J. Chen and A. Gupta, “Statistical inference of covariance change points in gaussian model,” Statistics, vol. 38, no. 1, pp. 17–28, 2004.
  • [6] G. Schwarz et al., “Estimating the dimension of a model,” The annals of statistics, vol. 6, no. 2, pp. 461–464, 1978.
  • [7] E. Arias-Castro, S. Bubeck, G. Lugosi et al., “Detection of correlations,” The Annals of Statistics, vol. 40, no. 1, pp. 412–435, 2012.
  • [8] Y. Jiao, Y. Chen, and Y. Gu, “Subspace change-point detection: A new model and solution,” IEEE Journal of Selected Topics in Signal Processing, 2018.
  • [9] J. Taylor, J. Loftus, and R. Tibshrani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B, vol. 58, pp. 267–288, 1996.
  • [10] E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2, pp. 100–115, 1954.
  • [11] D. Siegmund,

    Sequential analysis: tests and confidence intervals

    .   Springer Science & Business Media, 1985.
  • [12] G. V. Moustakides, “Optimal stopping times for detecting changes in distributions,” The Annals of Statistics, pp. 1379–1387, 1986.
  • [13] G. Lorden, “Procedures for reacting to a change in distribution,” Annals of Mathematical Statistics, pp. 1897–1908, 1971.
  • [14] M. A. Woodbury, “Inverting modified matrices,” Memorandum report, vol. 42, no. 106, p. 336, 1950.
  • [15] T. W. Anderson, “Asymptotic theory for principal component analysis,” The Annals of Mathematical Statistics, vol. 34, no. 1, pp. 122–148, 1963.
  • [16] D. Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, pp. 1617–1642, 2007.
  • [17] R. Mises and H. Pollaczek-Geiringer, “Praktische verfahren der gleichungsauflösung.” ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, vol. 9, no. 1, pp. 58–77, 1929.
  • [18] T. L. Lai and J. Z. Shan, “Efficient recursive algorithms for detection of abrupt changes in signals and control systems,” IEEE Transactions on Automatic Control, vol. 44, no. 5, pp. 952–966, 1999.
  • [19] Y.-Q. Yin, Z.-D. Bai, and P. R. Krishnaiah, “On the limit of the largest eigenvalue of the large dimensional sample covariance matrix,” Probability theory and related fields, vol. 78, no. 4, pp. 509–521, 1988.
  • [20] J. Baik and J. W. Silverstein, “Eigenvalues of large sample covariance matrices of spiked population models,”

    Journal of Multivariate Analysis

    , vol. 97, no. 6, pp. 1382–1408, 2006.
  • [21] A. Edelman and Y. Wang, “Random matrix theory and its innovative applications,” in Advances in Applied Mathematics, Modeling, and Computational Science.   Springer, 2013, pp. 91–116.
  • [22] S. Geman, “A limit theorem for the norm of random matrices,” The Annals of Probability, pp. 252–261, 1980.
  • [23] C. Tracy and H. Widom, “On orthogonal and symplectic matrix ensembles,” Comm. Math. Phys., vol. 177, pp. 727–754, 1996.
  • [24] D. Siegmund, B. Yakir, and N. Zhang, “Tail approximations for maxima of random fields by likelihood ratio transformations,” Sequential Analysis, vol. 29, no. 3, pp. 245–262, 2010.
  • [25] ——, “Tail approximations for maxima of random fields by likelihood ratio transformations,” Sequential Analysis, vol. 29, no. 3, pp. 245–262, 2010.
  • [26] D. Siegmund and B. Yakir, The statistics of gene mapping.   Springer Science & Business Media, 2007.
  • [27] S. Li, Y. Xie, H. Dai, and L. Song, “M-statistic for kernel change-point detection,” in Advances in Neural Information Processing Systems, 2015, pp. 3366–3374.
  • [28] O. Guédon and R. Vershynin, “Community detection in sparse networks via grothendieck’s inequality,” Probability Theory and Related Fields, vol. 165, no. 3-4, pp. 1025–1049, 2016.