A Two-stage Online Monitoring Procedure for High-Dimensional Data Streams

12/14/2017
by   Jun Li, et al.
0

Advanced computing and data acquisition technologies have made possible the collection of high-dimensional data streams in many fields. Efficient online monitoring tools which can correctly identify any abnormal data stream for such data are highly sought after. However, most of the existing monitoring procedures directly apply the false discover rate (FDR) controlling procedure to the data at each time point, and the FDR at each time point (the point-wise FDR) is either specified by users or determined by the in-control (IC) average run length (ARL). If the point-wise FDR is specified by users, the resulting procedure lacks control of the global FDR and keeps users in the dark in terms of the IC-ARL. If the point-wise FDR is determined by the IC-ARL, the resulting procedure does not give users the flexibility to choose the number of false alarms (Type-I errors) they can tolerate when identifying abnormal data streams, which often makes the procedure too conservative. To address those limitations, we propose a two-stage monitoring procedure that can control both the IC-ARL and Type-I errors at the levels specified by users. As a result, the proposed procedure allows users to choose not only how often they expect any false alarms when all data streams are IC, but also how many false alarms they can tolerate when identifying abnormal data streams. With this extra flexibility, our proposed two-stage monitoring procedure is shown in the simulation study and real data analysis to outperform the exiting methods.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/15/2017

Efficient Global Monitoring Statistics for High-Dimensional Data

Global monitoring statistics play an important role for developing effic...
06/05/2019

Robust real-time monitoring of high-dimensional data streams

Robust real-time monitoring of high-dimensional data streams has many im...
08/06/2019

Online Detection of Sparse Changes in High-Dimensional Data Streams Using Tailored Projections

When applying principal component analysis (PCA) for dimension reduction...
06/28/2020

Dual Control of Testing Errors in High-Dimensional Data Analysis

False negative errors are of major concern in applications where missing...
05/30/2022

Profile Monitoring via Eigenvector Perturbation

Control charts are often used to monitor the quality characteristics of ...
12/15/2021

Simultaneous Monitoring of a Large Number of Heterogeneous Categorical Data Streams

This article proposes a powerful scheme to monitor a large number of cat...
06/14/2021

z-anonymity: Zero-Delay Anonymization for Data Streams

With the advent of big data and the birth of the data markets that sell ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in data acquisition technologies have greatly facilitated the collection of massive data in many industries. The number of data streams needed to be monitored in many statistical process control (SPC) applications has grown significantly. For example, in a network traffic surveillance application, Lévy-Leduc and Roueff (2009) considered the problem of monitoring thousands of Internet traffic metrics provided by a French Internet service provider. In a health-care monitoring application, Spiegelhalter et al. (2012) discussed how to simultaneously monitor 200,000 indicators of excess mortality in the UK health system. In the era of big data, the dimension of data streams available for monitoring will continue to grow. The demand for efficient monitoring schemes that can work for high-dimensional data streams has never been greater.

Different from monitoring a single data stream, a desired monitoring scheme for high dimensional data streams not only needs to raise an alarm as early as possible when some data streams are out-of-control (OC), but also needs to correctly identify those OC data streams. The research problem of monitoring high-dimensional data streams for this purpose has attracted considerable attention in the literature recently. A number of monitoring schemes have been proposed, including those in Grigg and Spiegelhalter (2008), Li and Tsung (2009, 2012), and Gandy and Lau (2013). A common theme in those methods is to monitor each data stream by one control chart, and calculate the -values of the charting statistics, and then apply some existing false discover rate (FDR) controlling procedure from the multiple hypothesis testing literature to those -values at each time point. Since the FDR controlling procedure is used separately at each time point in those methods, the FDR is only controlled locally at each time point. We refer to the FDR at each time point as the point-wise FDR hereafter. However, when designing a monitoring scheme, it is more desirable to have control of certain global measures, for example, in this case, control of a global FDR. Because it has long been believed to be true that if the point-wise FDR is controlled locally at each time point, the FDR over any window of time (the global FDR) will be controlled at the same level (see, for example, Grigg and Spiegelhalter (2008)), the approach based on controlling the point-wise FDR has dominated the literature for developing monitoring schemes for high-dimensional data streams. In Section 2, we show that, contrary to the common perceptions, control of the point-wise FDR does not guarantee control of the global FDR.

In addition to the lack of global FDR control, the above point-wise FDR monitoring schemes do not directly tell users what is the expected in-control (IC) average run length (ARL). The IC-ARL is the expected number of observations collected before any false alarm occurs in any data stream when the system is IC. It is widely used to measure how often to expect a false alarm when the system is IC in the literature. In many applications, the system is usually expected to be IC most of the time, and users do not want a monitoring scheme to raise any false alarm too often. Therefore, the IC-ARL is also a reasonable global measure to consider when designing monitoring schemes for high-dimensional data streams. Different from the other methods mentioned above, which simply apply the FDR controlling procedure at each time point using the point-wise FDR pre-specified by users, Li and Tsung (2009, 2012) designed their monitoring scheme to have the desired IC-ARL. More specifically, they determined the point-wise FDR used in the FDR controlling procedure at each time point through Monte-Carlo simulation so that the resulting monitoring scheme has the desired IC-ARL. Although their procedure has control of a global measure, the IC-ARL, the point-wise FDR used in their FDR controlling procedure is completely determined by the specified IC-ARL, and a larger IC-ARL implies a smaller point-wise FDR. However, the point-wise FDR reflects how much users can tolerate false alarms (Type-I errors) when identifying OC data streams. The smaller the point-wise FDR, the less powerful the monitoring scheme for detecting OC data streams. Therefore, the monitoring scheme in Li and Tsung (2009, 2012) can not have both the IC-ARL and Type-I errors controlled at the user’s desired level. One has to be compromised.

From the above discussion, we can see that an attractive monitoring scheme for high-dimensional data streams should be able to achieve global control of certain false alarm rate and at the same time allow users to choose the level of Type-I errors they can tolerate when identifying OC data streams. As we show later, the existing monitoring procedures mentioned above were developed using a single-stage strategy. The nature of this single-stage strategy makes it difficult to modify those procedures in order for them to have the above two desired properties. In this paper, different from the single-stage strategy used in those existing monitoring schemes, we propose a general two-stage strategy to design high-dimensional data monitoring schemes. As shown in Section 3, the high-dimensional data monitoring problem can be formulated in a way that we essentially need to answer the following two questions at each time point when monitoring high-dimensional data: the first question is if there are any OC data streams, and if the answer to this question is “yes”, then the second question is where the OC data streams are. Based on this formulation, we propose the following two-stage procedure. In the first stage, at each time point a global test is conducted based on information gathered across all the data streams to decide if there are any OC data streams. This answers the “IF” question. The decision rule for this global test can be naturally used to satisfy the global IC-ARL requirement. If it is determined that there exists at least one OC data stream from the first stage, we move to the second stage and carry out some local tests to identify which data streams are OC. This answers the “WHERE” question. The decision rule for those local tests will be determined to control certain Type-I error rates. As a result, our proposed two-stage procedure allows users to choose not only how often they expect a false alarm when the system is IC (the IC-ARL requirement) but also how many false alarms they can tolerate when identifying OC data streams (the Type-I error rate requirement). Due to this extra flexibility, our two-stage procedure can significantly outperform the method proposed in Li and Tsung (2009, 2012) as shown in our simulation studies.

The rest of the paper is organized as follows. In Section 2, we provide some useful insight of the existing point-wise FDR monitoring schemes and explain why those monitoring schemes lack control of the global FDR. In Section 3, we introduce our general two-stage procedure for monitoring high-dimensional data streams. In Section 4, we demonstrate the use of this general approach in developing a new monitoring scheme for detecting Gaussian data streams with location shifts. A simulation study is reported in Section 5 to compare the performance of this new monitoring scheme with some existing method. In Section 6, we demonstrate the application of our proposed two-stage procedure using a real data set from a manufacturing process. Finally, we provide some concluding remarks in Section 7.

2 Results on existing high-dimensional data monitoring procedures

The setup for our high-dimension data monitoring problem is the following. There are data streams in the system, where can be hundreds or thousands. We denote the observation from the -th data stream at time by , , . Since a time series model can be used to decorrelate the temporal correlation between the observations from each data stream and a spatial model can be used to decorrelate the spatial correlation between the data streams before applying monitoring schemes, without loss of generality, we assume that the are independent both within each data stream and between different data streams. When the system is in-control, the underlying distribution of is called the in-control distribution, denoted by , .

Following the above setup, at any time , we observe , , and the task of our online monitoring scheme at time is to determine if the distribution of is the same as for all . When some of the distributions of have changed from , we not only want to trigger an alarm as early as possible, but also want to correctly identify those data streams that have changed from . This is equivalent to conducting the following multiple hypothesis testing, for ,

versus

and (2.1)

where , is the OC distribution and is the change-point for the -th data stream.

Most of the existing methods use the above formulation to design their monitoring schemes, which we refer to as “the single-stage strategy”. Denote the

-value based on some appropriate test statistic for the above hypothesis testing problem by

. Then most of the existing monitoring schemes are constructed by applying some existing FDR controlling procedure, such as the step-up procedures proposed by Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001), to those

at each time point with some FDR specified by users. The scheme sets off an alarm if the FDR controlling procedure produces at least one rejection and the data stream corresponding to the rejected null hypothesis is then identified as the OC data stream. In the following, we show that control of the point-wise FDR in those methods does not guarantee control of the global FDR.

Number Number
Number of not rejected rejected
IC data stream (True null hypothesis)
OC data stream (Non-true null hypothesis)
Table 1: The outcome of the multiple hypothesis testing problem

To begin with, we introduce a few notations. Denote the unknown numbers of IC and OC data streams at time by and , respectively. In other words, and are the unknown numbers of true and false null hypotheses for the hypothesis testing problem in (2.1), respectively. Let be the number of rejected hypotheses based on some rejection rule. Among those rejected hypotheses, (unknown) of them are falsely rejected and (unknown) of them are correctly rejected. Those rejections are false alarms or Type I errors. The situation discussed above is summarized in Table 1. The point-wise FDR at time is then defined as the expected proportion of Type-I errors among the rejected hypotheses at time , that is, , where if and 0 if . Alternatively, it can be written as

Similarly, an appropriate way to define the global FDR up to time is

When the system is IC, that is, all the null hypotheses are true with for , the global FDR and point-wise FDR are then equivalent to,

Since most of the point-wise FDR monitoring procedures can have the point-wise FDR controlled at the level of , then is controlled at the level of for every . But it does not guarantee that is also controlled at the level of . It is easy to see that, if is large enough, we will eventually encounter a false alarm. In other words, will approach 1. Therefore, the global FDR can approach 1 in this case. When the system is OC, as shown in our simulation studies reported in Section 5, the global FDR can be also much higher than the point-wise FDR.

To see it theoretically, note that

Since

this helps explain why control of the point-wise FDR does not guarantee control of the global FDR in those point-wise FDR monitoring procedures.

To develop a procedure which can have the global FDR control, we notice that

Therefore, we have . This implies that, if we can control the point-wise FDR at the level of , the global FDR is controlled at the level of . However, in many applications, is usually large, and then the level for the point-wise FDR, , can be extremely small, which will make the resulting procedure extremely conservative for detecting OC data streams. Therefore, although the above Bonferroni-type of adjustment gives us a simple procedure which has the desired global FDR control, the procedure might not be very attractive for use in practice. How to design a monitoring scheme which has proper control of global FDR remains an open problem.

Even if we were able to design a monitoring scheme that has proper control of global FDR, similar to the monitoring scheme in Li and Tsung (2009, 2012), we would not have the flexibility of choosing the point-wise FDR at the user’s desired level. This is simply the result of using the singe-stage strategy. As mentioned in the Introduction, it is quite challenging to develop a monitoring scheme based on the single-stage strategy that can have the control of certain global false alarm rate and at the same time allow the users to choose the level of Type-I errors they can tolerate when identifying OC data streams. In the following, we propose a monitoring scheme using a two-stage strategy that can easily have the two desired properties.

3 Proposed two-stage monitoring procedure

As shown in the previous section, the online high-dimensional data monitoring problem can be formulated into a hypothesis testing problem in (2.1) at each time point . However, this single-stage strategy leads to many restrictions of the resulting procedures. To help alleviate those restrictions, we formulate the monitoring problem into a two-stage hypothesis testing problem. At the first stage, we conduct the following global test by pulling all the information across the data streams,

versus

and (3.1)

If is accepted, it indicates that there is no OC data stream and all data streams are IC at time . A rejection of indicates that some of the data streams are OC, in which case, we move to our second stage and conduct the multiple hypothesis testing in (2.1) to identify which data streams are OC under the condition that is rejected in the first stage.

In the above two-stage formulation, we can have the control limit developed for the hypothesis testing problem in each of the two stages, which allows us to design our monitoring scheme to satisfy both the IC-ARL and Type-I error rate requirements. In the following, we describe our two-stage monitoring procedure in detail.

3.1 First stage: Controlling the IC-ARL

In the first stage, we test the hypotheses in (3) at each time . Denote an appropriate test statistic for this hypothesis testing problem by , which is usually a function of the observations from all data streams up to time . Without loss of generality, we assume that is rejected if is too large. Then we can use as our global monitoring statistic in the first stage. If is smaller than some control limit, denoted by , we declare that there is no OC data stream and all data streams are IC. If is larger than , we signal an alarm, suggesting some of the data streams are OC. Based on this formulation, the monitoring in the first stage is similar to the standard monitoring of one data stream, and we can choose the control limit such that the IC-ARL of the above monitoring scheme is equal to some pre-specified value. Therefore, by choosing the control limit in this stage, users are able to choose how often they want a false alarm to occur when the system is IC.

Although we can choose any test statistic for the hypotheses in (3) as our global monitoring statistic and the control limit can be chosen accordingly to satisfy the IC-ARL requirement, we need to choose carefully so that it can be powerful to reject for different OC scenarios. How to develop an efficient global monitoring statistic has been also an active research area in SPC. In Section 4, we will give some examples of such .

3.2 Second stage: Controlling the Type-I error rate

3.2.1 Decision rule

In the first stage we monitor the global monitoring statistic . If , we declare that there is no OC data stream and continue to monitor at the next time point. If , we signal an alarm, suggesting some of the data streams are OC, and move to the second stage of our monitoring procedure, which is to identify which data stream is OC. As mentioned earlier, this task is equivalent to the multiple hypothesis testing problem in (2.1) under the condition that . Denote the standardized test statistic we use for the hypothesis testing problem in (2.1) by and assume that is rejected if is too large. To decide which data stream is OC, a natural decision rule is the following: if is larger than some control limit, say , we reject and conclude that the -th data stream is OC; otherwise, we do not reject and conclude that the -th data stream is IC. Here the subscript “h” in the control limit is to show that it depends on the control limit we obtain from the first stage. Based on the above decision rule, our remaining task is to determine the threshold . Since this is a multiple hypothesis testing problem, we need to determine to control certain Type-I error rate. The Type-I error rate in our monitoring application simply reflects how much we can tolerate false alarms when identifying abnormal data streams. Therefore, by choosing the control limit in our second stage, users are able to choose how many false alarms they want to tolerate when identifying OC data streams.

3.2.2 Choice of Type-I error rate

When testing a single hypothesis, the Type-I error rate is simply the probability of a Type-I error. When testing multiple hypotheses in (

2.1), there is a Type-I error associated with each , . Therefore, there are many ways to define the overall Type-I error rate when testing those multiple hypotheses simultaneously. Besides the FDR, some of the standard Type-I error rates used in the multiple hypothesis testing literature are the following (see, for example, Dudoit et al. (2003)):

  • The per-comparison error rate (PCER) is defined as the expected number of Type-I errors divided by the number of hypotheses.

  • The family-wise error rate (FWER) is defined as the probability of at least one Type-I error.

Using the notations introduced in Table 1,

The FWER control is the most conservative of the above three Type-I error rates, and is usually considered to be too conservative for the high-dimensional data monitoring purpose. The FDR control is less conservative than the FWER control, and it is a popular choice in many multiple hypothesis testing problems. However, similar to what we explain in Section 2, it is difficult to figure out a way to determine such that the resulting procedure can have proper control of global FDR. If we use the PCER as the choice of Type-I error rate, we can define the point-wise PCER and global PCER as follows,

Notice that the global PCER is simply the average of the point-wise PCERs, which implies that, if all the point-wise PCERs are controlled at the level of , the global PCER is controlled at the level of . Therefore, we can focus on the point-wise PCER when developing the decision rule. This is one of the advantages of using the PCER over the FDR. Furthermore, by definition, is simply the expected number of Type-I errors (false alarms). As a result, it is straightforward for users to specify the desired level of from the number of false alarms they can tolerate when identifying OC data streams. Due to the above reasons, in the following we use the PCER as the choice of Type-I error rate in our second stage and describe how to determine to control the PCER.

3.2.3 Determining the control limit based on the PCER

Without loss of generality, we assume that the first data streams are IC. Then for any given , we have

where the subscript “” indicates that the probabilities are calculated under the setting where the first data streams are IC and the remaining are OC. Therefore, for the PCER at the pre-specified level , we need to determine such that

(3.2)

The above conditional probabilities are very challenging to calculate. To circumvent this difficulty, we denote the marginal -value of by

and the probability density function of

when the data stream is OC by . As shown in the Appendix, if are all non-increasing functions, we have, for ,

where the subscript “” indicates that the probabilities are calculated under the null hypothesis in (3); that is, and all data streams are IC. Therefore,

To find that satisfies (3.2), we only need to find such that

(3.3)

Note that the above condition that are all non-increasing functions is not restrictive and is satisfied for most of the commonly used charting statistics, including the CUSUM statistic we used in Section 4.

To make the above task for finding easier, we assume that both and () follow their respective null steady-state distributions when is true. Under this assumption, does not depend on . Then given the control limit we obtain from the first stage, the control limit of our second stage can be found as follows.

  • For any given , generate IC data streams and keep monitoring them using the global monitoring statistic . When is larger than the given control limit , record the proportion of that is larger than the given value of and denote it by .

  • Repeat Step 1 times (for example, ). Then the average of the over the replications can be used to approximate .

  • Using Steps 1 and 2, we can obtain the approximate of for any given . We can then use some numerical method, for example, the bisection method to obtain the control limit of our second stage that satisfies (3.3).

In the following, we summarize our two-stage monitoring procedure.

  • We first select an appropriate global monitoring statistic . The monitoring scheme in the first stage is simply calculating at each time point and comparing with the pre-specified control limit . If is larger than , we signal an alarm indicating that there exists at least one OC data stream, and proceed to the second stage. Otherwise, we continue the monitoring using . The control limit is determined by using Monte Carlo simulation and bisection method such that the IC-ARL of the above monitoring scheme based on is equal to the pre-specified IC-ARL.

  • If is larger than and an alarm is given in the first stage, we then construct the local monitoring statistic for each data stream. If is larger than the control limit , the -th data stream is OC. Otherwise, it is IC. The control limit in this stage is determined using the three steps described above.

From the summary of our two-stage procedure, the control limit in our first stage is determined by the IC-ARL, which controls how often we expect a false alarm when the system is actually IC. The control limit in our second stage is determined by the PCER, which controls how many false alarms we can tolerate. Therefore, by using our two-stage monitoring procedure, the users can design their own monitoring scheme depending on how often they want to expect any false alarm when the system is IC and how many false alarms they can tolerate when identifying OC data streams.

4 Two-stage monitoring procedure for Gaussian data streams with location shifts

In this section, we show how to develop a two-stage monitoring procedure for high-dimensional data streams in one particular setting following the general approach we describe in Section 3. The setting we consider here is that the IC distribution (

) is the Gaussian distribution with mean

and variance 1 (denoted by

), and the target OC distribution () is also some Gaussian distribution with mean and variance (denoted by ). Under the Gaussian distribution assumption, an optimal monitoring statistic for each data stream is the CUSUM statistic. For example, to detect an upward mean shift, the CUSUM statistic for the -th data stream is defined as

(4.1)

where is the reference value and usually specified before the monitoring based on the target OC distribution mean .

To implement our two-stage monitoring procedure, we first need to find an appropriate global monitoring statistic , which has good power to detect any type of OC scenarios. Tartakovsky et al. (2006) proposed using as the global monitoring statistic. This approach enjoys some optimal property when exactly 1 out of data streams is OC. However, it performs poorly if many data streams are OC. To overcome this limitation, Mei (2010) proposed using to monitor the data streams simultaneously. It has been shown that is more effective than when a moderate or large number of data streams are OC. However, is less powerful than when only a few data streams are OC. In practice, it is usually unknown in advance how many data streams will be OC. Therefore, neither nor as the global monitoring statistic can guarantee robust performance. Xie and Siegmund (2013) recognized those limitations and further proposed a monitoring statistic derived from the likelihood function of a normal-mixture model. Besides being extremely computationally intensive, their approach also needs to pre-specify the mixing percentage. The approach will perform the best if the pre-specified mixing percentage correctly reflects the percentage of OC data streams. If the mixing percentage is misspecified, their approach will sacrifice some power. Recently Zou et al. (2014) proposed some alternative statistic to combine the CUSUM statistics from each of the data streams to produce a single global monitoring statistic. This approach does not need the prior knowledge of how many data streams are OC, and performs well comparing with the aforementioned methods across different OC scenarios. Therefore, in the following we use the global monitoring statistic proposed by Zou et al. (2014) as an example to demonstrate our two-stage procedure.

Denote the marginal -value of defined in (4.1) by , . The corresponding order statistics are . The global monitoring statistic proposed in Zou et al. (2014) is defined as,

(4.2)

where is the indicator function and takes 1 if is true and 0 otherwise.

To use the above in the monitoring scheme, it is important that those -values, , can be calculated quickly. Grigg and Spiegelhalter (2008) developed an empirical approximation to the null steady-state distribution of the CUSUM statistic defined in (4.1) and further provides a close-form formula to approximate the steady-state -value. Since this formula only works when the CUSUM statistic reaches its steady-state, to make use of this formula, we modify the definition of the CUSUM statistic a little. Instead of starting the CUSUM statistic at 0, i.e., as in (4.1), we start the CUSUM statistic at some value randomly drawn from the null steady-state distribution of . To draw a random sample from the null steady-state distribution of , we generate independent sequences of (), each of which is independently drawn from , and calculate as in (4.1). Then can serve as a random sample from the null steady-state distribution of . Our modified CUSUM statistic is then defined as follows. For ,

(4.3)

where is randomly drawn with replacement from . Since the above modified CUSUM statistic starts from the steady-state, the null distribution of the at any time point follows the null steady-state distribution. As a result, we can utilize the close-form formula provided in Grigg and Spiegelhalter (2008) to calculate the -values of the quickly. For simplicity, we abuse the notation and also denote those -values by . Based on the above modified CUSUM statistics and their corresponding -values , we calculate in (4.2). Our monitoring scheme in the first stage is then to monitor those from our modified CUSUM statistics and signals an alarm if , where is the control limit and can be determined through Monte-Carlo simulation to satisfy the IC-ARL requirement.

If we signal an alarm, i.e., , in our first stage, we move to our second stage to identify which data stream is OC. In this stage, we first need to identify the appropriate test statistic for the hypothesis testing problem (2.1) for each data stream. Due to the optimality of the CUSUM statistic, we can use the above modified as our . Alternatively we can also use , where is again the -value of , as our . Since is simply a non-decreasing function of , these two test statistics are equivalent. Given the fact that the have been calculated in our first stage, we use as our in this example. Based on as our , we next use Monte-Carlo simulation and the bisection method as described in Section 3 to determine the control limit of our second stage to satisfy the PCER requirement, that is, finding such that

where is the desired PCER. Recall that, in order to easily use Monte Carlo simulation and the bisection method to determine the above control limit , we need to assume that both and () follow their respective null steady-state distributions when is true. Therefore, our modification of the CUSUM statistics described earlier is also necessary for this assumption to hold. Once we obtain the control limit using Monte Carlo simulation and the bisection method, we identify the -th data stream as the OC data stream if , and as the IC data stream otherwise.

5 Simulation Study

In this section, we refer to our two-stage procedure described in Section 4 as the two-stage CUSUM procedure. In the following, we present a simulation study to compare the performance of this two-stage CUSUM procedure with the one using the methodology proposed in Li and Tsung (2009, 2012).

Using the methodology proposed in Li and Tsung (2009, 2012) in the setting considered in Section 4, at each time point , we will monitor each data stream using the CUSUM statistic and calculate its -value. To use the close-form formula provided in Grigg and Spiegelhalter (2008) to calculate those -values quickly, we also use the modified CUSUM statistic defined in (4.3) instead of the original CUSUM statistic in (4.1) to monitor each data stream. The -values of the are again denoted by , and their corresponding order statistics are . Let

where is the point-wise FDR and is determined by Monte Carlo simulation to satisfy the desired IC-ARL. If , all the data streams are IC. If , all the data streams associated with

are classified as the OC data streams and the rest are the IC data streams. Hereafter, we refer to this procedure as the LT procedure.

Since both our two-stage CUSUM procedure and the LT procedure can be designed to have the desired IC-ARL control, we compare these two procedures based on how quickly they can correctly identify all the OC data streams. If there is only one data stream, this can be easily measured by the OC-ARL, which is the average number of the observations collected after the system becomes OC at the time when the scheme triggers an alarm. However, when there are a large number of data streams, even if there is only one OC data stream, it is not guaranteed that the scheme can identify the right OC data stream at the first time when it triggers an alarm. When there are multiple OC data streams, it is also quite often that the scheme will not be able to identify all the OC data streams at once. Therefore, using the OC-ARL might not be able to reflect well the performance when monitoring high-dimensional data streams.

To find an appropriate criterion to evaluate the performance, different from the simulations conducted to obtain the OC-ARL where the monitoring is stopped at the first time when the scheme triggers an alarm, in our simulations we assume that after the OC data stream is identified by the scheme, the monitoring of that data stream continues, the observations collected afterwards from that data stream will be drawn from its IC distribution, and its CUSUM statistic will be restarted at the value randomly drawn with replacement from as described in the previous section. When all the OC data streams are identified, the monitoring is stopped and we calculate the average time to detect all the OC data streams, which we denote by the ATDOC. Comparing with the OC-ARL, the ATDOC can better measure how effective the monitoring scheme is in terms of detecting all the OC data streams. Therefore, in the following we use the ATDOC criterion to compare the performance of the two monitoring schemes.

The simulation settings we consider here are similar to those used in Zou et al. (2014). The general settings are the following. Among the data streams, data streams are from the IC distribution , the remaining data streams are from some OC distribution , . Two scenarios of the OC distribution are considered: (I) equal allocation, i.e., ; (II) increasing allocation, i.e., , . In both scenarios, is the target OC mean value specified in the CUSUM statistic, and throughout the simulations, we set .

In the first simulation study, we choose . The desired IC-ARL is set at 200, 500, 1000 or 10000. Besides the IC-ARL, we also need to specify the desired PCER for our two-stage CUSUM procedure. Three scenarios are considered: PCER=0.01, 0.03 or 0.05. The control limits in the first stage and in the second stage for our two-stage CUSUM procedure along with the point-wise FDR for the LT procedure can be obtained through Monte-Carlo simulation. The results are presented in Table 2.

IC-ARL PCER IC-ARL PCER
200 .04865 16.546 .01 .99577 500 .02087 22.900 .01 .99716
.03 .98428 .03 .98710
.05 .97039 .05 .97472
1000 .01034 28.570 .01 .99802 10000 .00108 49.119 .01 .99947
.03 .98863 .03 .99225
.05 .97746 .05 .98097
Table 2: The control limits used in our proposed two-stage CUSUM and LT procedures when , .

Using those control limits, we apply our two-stage CUSUM procedure and the LT procedure to different OC settings. Tables 3 and 4

reports the the average and standard deviation of the ATDOC of the two procedures from 1000 simulations for those settings. As we can see, in all the settings considered here, for our two-stage CUSUM procedure, higher PCERs lead to smaller ATDOCs. So if we want to detect the OC data streams faster, we can increase the PCER that we can tolerate. In contrast, the LT procedure does not have this kind of flexibility. Comparing between these two procedures, our two-stage CUSUM procedure has smaller ATDOC than the LT procedure for all the settings except for the case of

, and the advantage of our two-stage CUSUM procedure over the LT procedure becomes more pronounced when the number of the OC data streams increases. When PCER=0.05, our two-stage CUSUM procedure can cut the ATDOC of the LT procedure by half in many cases. In the case of , since there is only one OC data stream, the ATDOC is roughly equal to the OC-ARL. Based on our simulation, the LT procedure has very similar OC-ARL performance as the one using the global monitoring statistic . As mentioned earlier, the monitoring scheme using is optimal when there is exactly one OC data stream. This explains why the LT procedure performs better than the two-stage CUSUM procedure when .

Two-stage CUSUM
IC-ARL LT PCER=.01 PCER=.03 PCER=.05
1 50.4 (25.7) 57.2 (27.6) 55.4 (27.3) 54.0 (27.6)
3 49.7 (15.1) 49.5 (16.8) 44.4 (16.0) 41.9 (15.5)
5 49.4 (11.8) 47.0 (12.4) 41.4 (11.9) 38.8 (11.8)
8 48.0 (9.2) 43.5 (9.2) 37.5 (8.6) 34.7 (8.4)
200 10 48.0 (8.3) 42.2 (8.2) 36.3 (7.6) 33.2 (7.3)
20 45.2 (5.9) 38.2 (5.3) 31.5 (5.00) 28.6 (4.8)
50 41.1 (3.7) 35.9 (3.1) 27.1 (2.7) 23.5 (2.6)
80 38.1 (2.9) 35.2 (2.4) 25.8 (2.0) 21.7 (1.9)
Equal 100 36.6 (2.6) 35.1 (2.1) 25.5 (1.8) 21.2 (1.7)
allocation 1 58.4 (27.7) 67.8 (30.7) 66.5 (31.2) 65.5 (31.5)
3 58.0 (17.4) 56.0 (18.6) 50.5 (18.2) 47.2 (17.9)
5 56.7 (12.7) 51.5 (13.1) 45.6 (12.6) 42.4 (12.1)
8 55.6 (10.7) 47.8 (10.6) 41.0 (10.0) 37.8 (9.4)
500 10 55.1 (9.1) 46.1 (8.7) 39.2 (8.0) 35.9 (7.7)
20 52.5 (6.7) 41.6 (5.8) 34.0 (5.4) 31.0 (5.3)
50 48.6 (4.2) 39.3 (3.3) 29.1 (2.9) 25.2 (2.8)
80 45.7 (3.4) 38.5 (2.6) 27.5 (2.1) 23.1 (2.0)
100 44.3 (2.8) 38.3 (2.3) 27.2 (1.9) 22.6 (1.7)
1 101.4 (69.1) 116.1 (76.7) 106.9 (70.2) 103.5 (68.8)
3 72.1 (28.1) 73.2 (30.0) 64.8 (28.7) 59.6 (27.8)
5 59.4 (19.1) 57.8 (20.1) 50.4 (18.9) 46.3 (17.8)
8 48.5 (11.7) 44.6 (11.7) 38.6 (11.1) 35.7 (11.0)
200 10 44.2 (9.9) 40.1 (10.3) 34.2 (9.5) 31.0 (8.8)
20 31.6 (5.0) 28.1 (5.2) 23.1 (4.6) 21.1 (4.6)
50 20.6 (2.4) 19.2 (2.2) 14.8 (1.9) 12.8 (1.9)
80 16.1 (1.4) 16.2 (1.4) 12.3 (1.3) 10.4 (1.2)
Increasing 100 14.4 (1.2) 15.1 (1.1) 11.4 (1.0) 9.6 (1.0)
allocation 1 121.1 (81.6) 138.0 (91.1) 131.8 (86.7) 129.0 (85.2)
3 83.3 (29.8) 83.0 (33.1) 73.1 (30.6) 68.3 (29.5)
5 68.7 (20.7) 63.9 (22.2) 56.0 (21.2) 51.2 (19.5)
8 55.8 (13.3) 49.6 (14.4) 42.0 (13.3) 37.9 (11.8)
500 10 50.8 (10.8) 44.1 (11.3) 37.4 (10.5) 34.0 (10.1)
20 36.8 (5.4) 30.9 (5.9) 25.2 (5.5) 22.8 (5.2)
50 24.2 (2.5) 21.0 (2.4) 15.9 (2.2) 13.8 (2.2)
80 19.3 (1.6) 17.6 (1.5) 13.1 (1.4) 11.1 (1.3)
100 17.3 (1.3) 16.4 (1.3) 12.1 (1.2) 10.2 (1.1)
Table 3: The ATDOC comparison between our proposed two-stage CUSUM procedure and the LT procedure when . Standard deviations of the ATDOC are in parentheses.
Two-stage CUSUM
IC-ARL LT PCER=.01 PCER=.03 PCER=.05
1 64.1 (30.0) 73.5 (32.2) 72.8 (32.2) 72.2 (32.0)
3 63.4 (17.2) 60.3 (18.6) 52.7 (17.9) 50.1 (17.4)
5 63.0 (13.4) 55.7 (14.1) 48.1 (13.2) 45.4 (13.2)
8 61.9 (11.3) 51.7 (11.2) 43.8 (10.5) 40.5 (10.1)
1000 10 61.0 (9.5) 49.6 (9.2) 41.7 (8.6) 38.3 (8.1)
20 59.1 (6.7) 45.2 (5.9) 36.4 (5.5) 33.1 (5.2)
50 54.5 (4.2) 42.0 (3.3) 30.2 (2.8) 26.4 (2.8)
80 52.0 (3.5) 41.6 (2.8) 28.8 (2.3) 24.3 (2.2)
Equal 100 50.4 (3.0) 41.2 (2.4) 28.3 (1.9) 23.6 (1.8)
allocation 1 83.0 (36.9) 89.9 (39.0) 89.5 (39.2) 89.4 (39.1)
3 80.8 (19.6) 72.1 (20.0) 61.7 (20.5) 58.1 (19.4)
5 80.7 (15.8) 66.9 (15.8) 55.5 (14.5) 51.4 (14.4)
8 79.9 (12.1) 62.1 (11.2) 50.0 (10.5) 46.0 (10.1)
10000 10 79.0 (10.7) 59.2 (9.8) 47.1 (9.1) 42.7 (8.8)
20 77.6 (8.1) 55.0 (6.8) 41.1 (6.3) 36.8 (5.9)
50 73.3 (4.9) 52.4 (4.0) 34.1 (3.3) 29.3 (3.1)
80 71.2 (4.0) 52.0 (3.0) 32.4 (2.4) 26.5 (2.2)
100 69.9 (3.4) 51.8 (2.7) 31.8 (2.1) 25.7 (1.9)
1 138.9 (95.8) 155.7 (102.7) 152.2 (100.3) 150.2 (100.0)
3 93.4 (34.9) 91.3 (38.9) 81.3 (38.3) 75.8 (37.1)
5 76.5 (22.1) 70.4 (23.7) 59.9 (22.1) 55.6 (21.2)
8 61.7 (14.1) 53.6 (14.8) 45.0 (14.0) 41.4 (13.6)
1000 10 55.8 (11.4) 47.3 (12.3) 39.5 (11.3) 36.1 (10.8)
20 41.3 (6.3) 33.5 (6.5) 26.9 (5.9) 24.2 (5.5)
50 27.2 (2.6) 22.5 (2.5) 16.7 (2.4) 14.4 (2.3)
80 22.0 (1.8) 19.0 (1.7) 13.7 (1.6) 11.7 (1.5)
Increasing 100 19.8 (1.4) 17.6 (1.3) 12.6 (1.2) 10.7 (1.2)
allocation 1 187.3 (120.6) 199.3 (123.1) 196.6 (121.2) 196.4 (121.3)
3 121.6 (40.8) 112.2 (45.0) 94.8 (43.9) 87.3 (40.6)
5 98.8 (26.4) 85.4 (27.7) 71.2 (26.2) 65.5 (25.1)
8 81.6 (17.1) 67.1 (17.7) 53.6 (16.1) 48.4 (15.8)
10000 10 74.0 (14.4) 59.0 (15.1) 47.2 (15.0) 42.2 (14.3)
20 54.3 (7.6) 41.2 (7.5) 30.8 (7.2) 27.5 (6.8)
50 36.8 (3.0) 28.2 (3.0) 19.1 (2.8) 16.2 (2.7)
80 30.1 (1.9) 23.5 (1.9) 15.5 (1.8) 12.8 (1.6)
100 27.4 (1.6) 21.8 (1.5) 14.1 (1.4) 11.6 (1.3)

Table 4: The ATDOC comparison between our proposed two-stage CUSUM procedure and the LT procedure when . Standard deviations of the ATDOC are in parentheses.

We also report in Tables 5 and 6 the simulated global FDR of the LT procedure and the simulated global PCER for our two-stage CUSUM procedure in the settings considered above. As we can see from the table, the global FDR of the LT procedure can be much higher than the point-wise FDR . This shows that control of the point-wise FDR does not guarantee control of the global FDR. In contrast, our two-stage CUSUM procedure remains control of the global PCER in all the settings.

LT Two-stage CUSUM
Point-wise FDR Nominal PCER
IC-ARL .01 .03 .05
1 .15868 .00553 .01889 .03427
3 .11974 .00440 .01645 .03128
5 .09631 .00364 .01495 .02911
Equal 8 .08320 .00275 .01340 .02607
allocation 10 .06626 .00234 .01258 .02585
20 .04953 .00113 .00865 .02014
50 .03012 .00062 .00331 .00882
80 .02231 .00052 .00251 .00589
200 100 .01772 .00046 .00221 .00500
1 .24242 .00609 .02053 .03754
3 .15774 .00464 .01703 .03190
5 .11862 .00395 .01585 .03040
Increasing 8 .08587 .00296 .01352 .02755
allocation 10 .07876 .00248 .01256 .02545
20 .04791 .00122 .00784 .01882
50 .02875 .00090 .00404 .00915
80 .02166 .00083 .00361 .00769
100 .01608 .00076 .00334 .00709
LT Two-stage CUSUM
Point-wise FDR Nominal PCER
IC-ARL .01 .03 .05
1 .08733 .00391 .01534 .02887
3 .05750 .00308 .01404 .02706
5 .04357 .00260 .01273 .02539
Equal 8 .03581 .00205 .01173 .02351
allocation 10 .02983 .00161 .01063 .02271
20 .02168 .00071 .00731 .01800
50 .01356 .00042 .00273 .00769
80 .00963 .00034 .00200 .00488
500 100 .00746 .00030 .00174 .00416
1 .12683 .00430 .01784 .03204
3 .08237 .00343 .01449 .02805
5 .06009 .00303 .01409 .02656
Increasing 8 .04000 .00219 .01181 .02403
allocation 10 .03803 .00187 .01135 .02300
20 .02527 .00086 .00727 .01742
50 .01252 .00058 .00326 .00773
80 .00915 .00055 .00287 .00642
100 .00758 .00053 .00270 .00585
Table 5: Global FDR of the LT procedure and global PCER of our two-stage CUSUM procedure when .
LT Two-stage CUSUM
Point-wise FDR Nominal PCER
IC-ARL .01 .03 .05
1 .04550 .00270 .01335 .02480
3 .02590 .00219 .01226 .02383
5 .02261 .00194 .01160 .02290
Equal 8 .01920 .00150 .01060 .02148
allocation 10 .01481 .00114 .00990 .02064
20 .01120 .00050 .00697 .01629
50 .00723 .00028 .00245 .00708
80 .00492 .00023 .00178 .00431
1000 100 .00381 .00021 .00153 .00374
1 .07483 .00359 .01550 .02860
3 .04510 .00275 .01339 .02569
5 .02915 .00213 .01235 .02397
Increasing 8 .02568 .00169 .01149 .02226
allocation 10 .02114 .00137 .01034 .02074
20 .01262 .00057 .00668 .01574
50 .00634 .00041 .00294 .00687
80 .00450 .00038 .00246 .00566
100 .00373 .00034 .00236 .00525
LT Two-stage CUSUM
Point-wise FDR Nominal PCER
IC-ARL .01 .03 .05
1 .00750 .00101 .00909 .02071
3 .00550 .00064 .00808 .01943
5 .00200 .00055 .00786 .01946
Equal 8 .00289 .00039 .00748 .01848
allocation 10 .00218 .00031 .00706 .01765
20 .00114 .00012 .00531 .01557
50 .00057 .00007 .00167 .00719
80 .00059 .00005 .00115 .00379
10000 100 .00036 .00005 .00101 .00315
1 .01000 .00136 .01064 .02247
3 .00675 .00084 .00895 .02094
5 .00317 .00067 .00851 .02009
Increasing 8 .00178 .00048 .00762 .01831
allocation 10 .00227 .00035 .00749 .01815
20 .00100 .00014 .00520 .01488
50 .00073 .00011 .00203 .00639
80 .00057 .00010 .00170 .00489
100 .00050 .00008 .00157 .00438

Table 6: Global FDR of the LT procedure and global PCER of our two-stage CUSUM procedure when .

We also conduct a simulation study with a higher dimension . For our two-stage procedure, PCER is set at 0.005, 0.01 or 0.02. The control limits and for our two-stage CUSUM procedure along with the point-wise FDR for the LT procedure obtained through Monte-Carlo simulation for IC-ARL=1000 are listed in Table 7.

IC-ARL PCER
1000 .01045 32.593 .005 .99737
.01 .99379
.02 .98576
Table 7: The control limits used in our proposed two-stage CUSUM and LT procedures when .

The ATDOC comparison between the two-stage CUSUM procedure and the LT procedure is reported in Figure 1. Similar to the case, our two-stage CUSUM procedure has much smaller ATDOC than the LT procedure in all the cases except the case of . The explanation is the same as the one given above. Figure 2 shows the global FDR control of the LT procedure and the global PCER control of our two-stage CUSUM procedure. In both figures, the ratio of the global FDR and the point-wise FDR is plotted for the LT procedure, and the ratio of the global PCER and its nominal value is plotted for our two-stage CUSUM procedure. As we can see, the global FDR can be much higher than the point-wise FDR in the LT procedure, while our two-stage CUSUM procedure has good control of the global PCER.

(a)
(b)
Figure 1: The ATDOC comparison between our proposed two-stage CUSUM procedure and the LT procedure when and IC-ARL=1000 for: (a) equal allocation; (b) increasing allocation. Both figures are plotted on the log scale.
(a)
(b)
Figure 2: The ratio of the global FDR and the point-wise FDR for the LT procedure and the ratio of the global PCER and its nominal value for our two-stage CUSUM procedure when and IC-ARL=1000 for: (a) equal allocation; (b) increasing allocation.

6 Real data application

In this section, we use a real data set to demonstrate the application of our proposed two-stage procedure. The data set we use was collected from a modern semiconductor manufacturing process and is publicly available in the UC Irvine Machine Learning Repository (

http://archive.ics.uci.edu/ml/datasets/SECOM). It contains a total of vector observations, and each vector observation consists of measured features. There is a huge amount of constant values across the observations in features, therefore we exclude them from our analysis, which leaves us a data set with vector observations and each being a -dimensional vector. To demonstrate the application of our proposed two-stage monitoring procedure, we treat the observations of each feature as one data stream, and our task is then to monitor the data streams simultaneously and identify any non-conforming observations as soon as possible.

Among the vector observations, of them are labeled as nonconforming and the remaining conforming. For illustration, we use all the conforming observations (the vector observations) as the historical sample and apply our proposed two-stage monitoring procedure to the nonconforming ones (the vector observations) to see how many nonconforming observations our procedure is able to detect. For simplicity of notation, we reorganize the data set such that the conforming observations from the -th feature/data stream are denoted by , and the nonconforming ones , .

Since not all the data are normally distributed, to use the global monitoring statistic

in (4.2), we apply the inverse transformation, , , to each data stream, where is the empirical distribution function based on the historical observations of the -th data stream. Under this transformation, the in-control distributions of all the data streams are approximately . Denote the transformed nonconforming observations by , , . For the -th data stream and at each , , we then calculate our CUSUM statistic in (4.3) based on those nonconforming observations with and . By applying the close-form -value formula from Grigg and Spiegelhalter (2008) to those CUSUM statistics , we obtain their marginal -values, . According to Zou et al. (2014), although there exists certain weak positive dependence structure in this data set, their proposed global monitoring statistic in (4.2) can still be effective. Therefore, in our proposed two-stage procedure, we still use this as our global monitoring statistic and calculate its value based on to decide whether an alarm should be given at . If is greater than the control limit , we then compare to the appropriate control limit to decide whether the data stream is nonconforming. In the above two-stage procedure, we set the nominal IC-ARL to be 1000, and the corresponding control limits and used in the two stages are listed in Table 8. As a comparison, we also apply the LT procedure to this data set and its control limit for the same nominal IC-ARL is also reported in Table 8.

IC-ARL PCER
1000 .01055 30.536 .005 .99832
.01 .99548
.02 .98859
Table 8: The control limits used in our proposed two-stage CUSUM and LT procedures when .

By using our proposed two-stage procedure, we detect 102 out of 104 nonconforming vectors. In those 102 detected nonconforming vectors, we discover 3124, 4221, 5683 nonconforming features using PCER=0.005, 0.01, 0.02, respectively; and their respective average times to detect those nonconforming features are 61.886, 61.131, 60.274. By comparison, the one-stage LT procedure detects 89 nonconforming vectors. In those 89 detected nonconforming vectors, 2466 nonconforming features are detected and the average time to detect those nonconforming features is 64.116. From the above comparison, we can see that our two-stage procedure can provide both faster and more complete detection of nonconforming observations.

7 Concluding Remarks

In this paper, we introduce a general two-stage monitoring procedure for high-dimensional data streams. The proposed monitoring procedure can control both the IC-ARL and Type-I errors at the levels specified by users. Therefore, by using the proposed two-stage monitoring procedure, the users can choose how often they want to expect any false alarm when the system is IC and how many false alarms they can tolerate when identifying OC data streams. This gives the users more freedom to design the desired monitoring procedure for high-dimensional data streams. Our simulation study shows that the proposed method has better performance comparing with the existing methods.

When monitoring high-dimensional data streams, depending on the purpose, two types of monitoring schemes are needed. The first type is for applications where any OC data stream indicates the same abnormality in the whole system and requires the same corrective action. As a result, those applications do not require the identification of OC data streams. For this monitoring purpose, a monitoring scheme which tracks a single global monitoring statistic such as those proposed in Tartakovsky et al. (2006), Mei (2010), Xie and Siegmund (2013), Zou et al. (2014) can achieve the goal. The second type of monitoring schemes is for applications where different OC data streams indicate different problems in the system, and require different corrective actions. As a result, this type of monitoring schemes needs to identify which data streams are OC. The monitoring schemes proposed in in Grigg and Spiegelhalter (2008), Li and Tsung (2009, 2012), and Gandy and Lau (2013) belong to this category. In the past, the research on these two types of monitoring schemes has not overlapped. In this paper, the two-stage method we propose for the second type of monitoring schemes makes use of the development from the research on the first type of monitoring schemes. Therefore, our proposed two-stage method provides an important linkage between research on these two types of monitoring schemes.

In our proposed two-stage monitoring scheme, we choose the PCER as the Type-I error rate when identifying the OC data streams in our second stage. As mentioned in the paper, the main reason for this choice is that the global PCER can be easily controlled by controlling the point-wise PCER. In some applications, it might be acceptable to control only the point-wise Type-I error rate when identifying the OC data streams. If this is the case, the FDR might be a better choice than the PCER. However, to apply the existing FDR controlling procedures to our second stage essentially requires the calculation of the following -values, denoted by , which are not the marginal -values, but are conditional on that ,

where is the observed value of the test statistic , the subscript “” indicates that the probability is calculated under the null hypothesis ; that is, the

-th data stream is IC. The above conditional probability is very challenging to calculate. Therefore, the existing FDR controlling procedures cannot be directly applied to our second stage. In our future research, we will explore the estimation approach proposed by Storey (2002) to develop a suitable FDR controlling procedure for the second stage of our two-stage monitoring scheme.

Acknowledgements

We thank the Editor, Professor Fugee Tsung, the Guest Editor, Professor Changliang Zou, and the 2 referees for their constructive comments and suggestions, which improved the quality of the paper greatly.

Appendix: Proof

Proposition 1.

If are all non-increasing functions, then we have, for ,

Proof.

Since the proof will be the same for each time point , for the sake of notational simplicity, we drop the subscript “t” in the proof below. Therefore, we need to prove that, for ,

Below we prove that

(7.1)

For , it can be proved similarly.

Based on the definition of conditional probabilities, it is easy to see that

Similarly we can write

Therefore, to prove (7.1), it suffices to show that,

or

Denote the probability density function of under the null and alternative hypotheses by and , respectively. Therefore,

(7.2)

where , …, are the lower limits of integration with respect to , …, , respectively, and is the function of , …, , is the function of , …, , and so on. Since larger leads to larger , , …, are all non-increasing functions of .

Similarly,

(7.3)

Denote the -value of by and its probability density function when the data stream is OC by . Then (Appendix: Proof) and (Appendix: Proof) can be calculated using their -values. That is,