Statistical process control (SPC) applies statistical methods to the monitoring and control of a process in order to detect abnormal variations of the process. One of the most popular SPC tools is the control chart, which plots a statistic that measures a feature of the process over time. When the charting statistic is well within the predetermined control limits, it indicates that the process is in a state of statistical control (hereafter in-control). When this charting statistic goes beyond the control limits, it triggers an alarm to indicate that the process is likely experiencing abnormal variations (hereafter out-of-control). Control charts are easy to visualize and interpret, therefore they have been successfully applied to applications across many different industries, including fraud detection, disease outbreak surveillance, network traffic monitoring and others (see, for example, Tsung et al. (2007), Woodall (2006), Jeske et al. (2009)).
In the SPC literature, there exist parametric control charts and nonparametric control charts. Parametric control charts need to assume a particular parametric distribution for the process. In practice it is often not easy to identify the parametric distribution that would be appropriate for a specific application. If the distribution is not specified correctly, parametric control charts may not perform as expected. In contrast, nonparametric control charts do not require specifying a particular parametric distribution for the process and remain valid regardless of the true underlying distribution. Therefore, nonparametric control charts are more desirable in many real world applications.
There are many nonparametric control charts in the literature. We refer to Chakraborti, van der Laan and Bakir (2001) and Chapter 8 of Qiu (2014) for an overview on this topic. Most of the existing nonparametric control charts were developed to detect location changes only. However, in practical situations it is usually unknown in advance what kind of changes the process will experience. Therefore, it is more desirable to develop a nonparametric control chart that can detect any arbitrary distributional changes. For this purpose, Zou and Tsung (2010) proposed an EWMA chart based on a powerful goodness-of-fit test. However, according to the simulation studies conducted in Ross and Adams (2012), this EWMA chart is only sensitive in detecting scale increases and is not as powerful as its competitors in detecting other types of distributional changes including location shifts. In addition, their proposed EWMA chart involves a weight parameter , which practitioners need to pre-specify. Different choices of will affect the detection power of the resulting control chart. In general, the EWMA chart with smaller is more powerful for detecting smaller changes, and the one with larger is more powerful for detecting larger changes. However, in practice, it is rarely known in advance what kind of changes will occur.
To overcome the above limitations, Ross and Adams (2012) proposed two control charts based on the change-point detection (CPD) framework. Their proposed CPD charts are free of any tuning parameter and are shown to have better overall performance than Zou and Tsung’s EWMA chart for detecting different distributional changes. However, like most CPD charts, the computation of their proposed charts is very intensive, since at each time point all the possible change-point scenarios need to be considered.
To detect any arbitrary distributional changes, Qiu and Li (2011) also proposed two nonparametric control charts by first converting the nonparametric problem into a categorical data analysis problem through data categorization and then developing CUSUM charts for monitoring the resulting categorical data. The idea of developing nonparametric control charts through data categorization is very innovative, since it allows adoption of many existing categorical data analysis methods to develop new nonparametric tools in the SPC field. However, similar to the above Zou and Tsung’s EWMA chart, the two CUSUM charts proposed by Qiu and Li (2011) involve a tuning parameter , which needs to be pre-specified. In the parametric setting, the optimal choice of in the CUSUM statistic is usually linked to the out-of-control distribution, therefore practitioners have some general guideline on how to choose . Unfortunately, in the nonparametric CUSUM statistics proposed in Qiu and Li (2011), it is not clear how is linked to the out-of-control distribution. Because of this, it is not even clear what the right range is for the value of . In their paper, they considered , 0.005, 0.01, or 0.05, which seems to be much smaller than those commonly used in other CUSUM statistics. According to some simulation study we conducted, the in-control run lengths of their CUSUM statistics with those small values of have much larger variability than what we usually expect from regular CUSUM statistics. It seems that some larger values of should be used instead. But again it is not clear what is the right choice of people should use in practice. Furthermore, based on our simulation studies, the control charts directly based on the categorial data after data categorization are usually less efficient than other rank-based nonparametric control charts due to the loss of the ordering information from the original data.
To address all the above limitations, in this paper we propose a new nonparametric control chart for detecting arbitrary distributional changes. More specifically, we first follow the above data categorization idea to develop a new CUSUM chart for monitoring the resulting categorical data. The CUSUM chart we propose is more efficient than the ones used in Qiu and Li (2011) for detecting different distributional changes, since it is capable of incorporating the ordering information of the original data. To implement the new CUSUM chart, we need to specify the out-of-control distribution, which is rarely known in advance in practice. To overcome this difficulty, we borrow the idea proposed in Lorden and Pollak (2008) and develop an adaptive version of the proposed CUSUM chart. Our adaptive CUSUM chart does not require the specification of the out-of-control distribution. Instead, it uses the most recent data to estimate the out-of-control distribution. The resulting adaptive CUSUM chart has simple recursive formulas, so it is very efficient in computation and its implementation is simple and straightforward. To address the situation where there are no sufficiently large reference data available, we also develop a self-starting monitoring scheme of the proposed adaptive CUSUM chart. Our simulation studies show that the proposed self-starting adaptive CUSUM chart has better overall performance than other competitors for detecting different distributional changes.
The rest of the paper is organized as follows. In Section 2, we describe our proposed nonparametric adaptive CUSUM chart and its properties. A simulation study is reported in Section 3 to evaluate the performance of our proposed control chart. In Section 4, we demonstrate the application of our proposed control chart using a real data set from a manufacturing process. Finally, we provide some concluding remarks in Section 5. All the proofs are deferred to the Appendix.
2.1 The proposed CUSUM statistic
The typical setup we consider in this paper is the following. There are independent and identically distributed reference (historical) data, denoted by , …, , from some in-control distribution . Let be the future observations collected over time from the process. At any time , we observe , and the task of control charts at this time is to decide whether the process has changed based on . This can be formulated as the following hypothesis testing problem,
where is the change point, and is usually referred to as the out-of-control distribution.
If we further assume that and are both completely known, to test the hypothesis in (2.1
), the test statistic based on the likelihood ratio method is
and it has the following convenient recursive representation
The popular CUSUM chart discussed in Page (1954) is then constructed by monitoring the above over the time and it raises an alarm if exceeds some threshold. The above CUSUM chart is easy to construct and enjoys some optimality property (Moustakides (1986)), therefore it has been widely used in many applications.
To implement the above CUSUM chart, both the in-control and out-of-control distributions, and , need to be completely specified. However, in our nonparametric setting, both and of are unknown. To overcome this difficulty, we first use the data categorization idea introduced in Qiu and Li (2011) to categorize the data so that the in-control and out-of-control distributions of the resulting categorical data can be easily established. More specifically, let be the boundary points, and the real line is then partitioned into the following intervals,
where is the indicator function that equals 1 when is true and 0 otherwise. Then indicates whether falls in the -th interval . Define . It is easy to see that follows a multinomial distribution with and , , denoted by Multi. Therefore, based on the above data categorization, the original data
with any arbitrary distribution is converted into the multinomial random variable.
To completely characterize the distribution of , we need to know . Following Qiu and Li (2011), we choose to be the
-th quantile of the in-control distribution of. Then the in-control distribution of the is simply Multi. Based on those ’s, we first assume that the out-of-control distribution of is given by another multinomial distribution Multi, where and . Using the in-control and out-of-control distributions of instead of those of , the CUSUM statistic in (2.2) becomes
Similar to the charting statistics proposed in Qiu and Li (2011), the above CUSUM statistic is usually less powerful than other rank-based charting statistics. The reason is that the ordering information of the original data is lost in (2.1), since it does not make use of the ordering information of the intervals, ,…, . To overcome this drawback, we need to find a new way to construct the CUSUM statistic so that the ordering information of ,…, can be used. For this purpose, we first define the cumulative unions of ,…, , i.e.,
Similarly we define the cumulative sums of , i.e.,
Then indicates whether falls in the interval . Write
. The new vectorcontains the same amount of information as . However, if we use the log-likelihood ratio based on in our CUSUM statistic, the ordering information of ,…, can be incorporated, so the ordering information of can be preserved.
To develop the log-likelihood ratio based on , we first notice that , , is a Bernoulli random variable and the log-likelihood ratio based on is
Then our proposed log-likelihood ratio based on is simply the weighted sum of the above log-likelihood ratios, i.e.,
where is the weight function, and we choose to give more weights to the tail areas. Therefore, our proposed CUSUM statistic is
As described above, using the log-likelihood ratio of in our CUSUM statistic helps preserve the ordering information of the data. Based on how both the intervals, ,…, , and their cumulative unions are constructed, the ordering information of the data used in the above CUSUM statistic is from the smallest to the largest. In the nonparametric literature, the Wilcoxon-Mann-Whitney test is a powerful test for testing location differences, and the Ansari-Bradley test is a powerful test for testing scale differences. Both tests can be considered as a rank-sum test. In the Wilcoxon-Mann-Whitney test, the data are ranked from the smallest to the largest, while in the Ansari-Bradley test, the data can be considered as being ranked from the center outward. This observation makes us believe that, although our CUSUM statistic in (2.1) can detect any arbitrary distributional changes, it might not be very powerful for detecting scale changes. To develop a CUSUM statistic that is efficient for scale changes, we need to make use of the center-outward ordering of the data.
To do so, different from how we categorize the data previously, we categorize the data in a center-outward fashion. More specifically, let , , be the -th quantile of the in-control distribution of . We partition the real line into the following regions,
It is clear that ,…, are ordered from the center outward. Define . It is easy to see that follows a multinomial distribution and its in-control distribution is Multi. Again we assume that the out-of-control distribution of is given by another multinomial distribution Multi, where and . Although ,…, are ordered from the center outward, if we use directly to construct the CUSUM statistic, the center-outward ordering of ,…, will not be utilized. Similar to how we construct in (2.1) to incorporate the left-to-right ordering information of the data, we consider the cumulative unions of ,…,,
and the cumulative sums of ,
Using the same method for obtaining in (2.1), we can obtain the following CUSUM statistic that makes use of the center-outward ordering information of the data,
Both and can be used to detect any arbitrary distributional changes. As shown in our simulation study in Section 3.2, is more powerful than for detecting location changes, since it uses the left-to-right ordering information of the data. In contrast, uses the center-outward ordering information of the data, therefore it is more powerful than for detecting scale changes. If no prior information is available on what type of changes the process might experience, we propose to use the following CUSUM statistic,
2.2 The adaptive CUSUM statistic
To implement the above CUSUM statistic , and in the out-of-control distributions of and
need to be specified in advance. This can be a difficult task for many real-world applications, where prior knowledge of the out–of-control distribution may not be available. This is the case even for the standard CUSUM statistic when both the in-control and out-of-control distributions are the normal distributions but with different means. To circumvent this difficulty, a few adaptive CUSUM statistics were proposed in the literature. For example, in Sparks (2000), instead of using the specified out-of-control mean in the standard CUSUM statistic, an estimate of the out-of-control mean using an exponentially weighted moving average of all the past observations is plugged in. In Han and Tsung (2006), the absolute value of the current observation is used as the estimate of the out-of-control mean in the standard CUSUM statistic. Following the same idea, Lorden and Pollak (2008) proposed another way to estimate the out-of-control mean to be used in the CUSUM statistic, and proved the asymptotic optimality of the resulting CUSUM statistic under a single-parameter exponential family. Recently, Wu (2016) generalized Lorden and Pollak’s result to the multi-parameter exponential family. In both Lorden and Pollak (2008) and Wu(2016), the key observation is that, at any given time, the most recent time when the CUSUM statistic goes back to 0 provides a candidate estimate for the possible change point , and therefore the observations collected after can be used to estimate the parameters in the out-of-control distribution.
In the following, we adopt the approach from Lorden and Pollak (2008) and Wu (2016) and substitute () in our proposed CUSUM statistic by their estimates based on the observations collected after their change point estimates , where is the most recent time when the CUSUM statistic equals 0. More specifically, define, for , ,
where the are the estimates of the at time and are defined by
In the above estimates, is the number of observations collected before the current time but after the candidate change point estimate . Similarly, is the number of observations falling in the th interval before time but after time . Both and can be calculated recursively by
The constants in (2.8
) can be considered as the parameters of the Dirichlet distribution, the conjugate prior for. Therefore, the above estimate
can be considered as a Bayesian estimate. In Bayesian statistics, it is common to chooseas the noninformative prior for . However, in our case a closer examination of reveals that, whenever returns to 0, will be used to estimate . Therefore, the choice does not work. Instead, we can choose proportional to when the process experiences the smallest distributional change that is meaningful. In this paper, we choose as follows. We first assume that the in-control distribution of is and its smallest meaningful out-of-control distribution is either or . Under this in-control and out-of-control distributional assumption for , we can obtain the corresponding out-of-control distribution of , denoted by Multi for and Multi for . Then we choose or , . When using in , denoted by , the prior indicates a positive location shift, so is more powerful for detecting positive location shifts. When using in , denoted by , the prior indicates a negative location shift, so is more powerful for detecting negative location shifts. Similarly, when using in , denoted by , the prior indicates a scale increase, so is more powerful for detecting scale increases. When using in , denoted by , the prior indicates a scale decrease, so is more powerful for detecting scale decreases. If we do not have any prior information about what type of changes the process might encounter, the charting statistic we use is
which is efficient to detect any type of distributional changes.
2.3 Determining the control limit
As described in the previous section, our proposed adaptive CUSUM statistic is simply , and the resulting control chart is to monitor over time , and it raises an alarm if exceeds the control limit . As we can see from (2.7), is a function of and only. Define
and for ,
where is a uniform random variable on (0,1). Let andand is the same as the joint distribution of and . Therefore, our proposed adaptive CUSUM control chart based on is distribution-free. Determining the control limit for this CUSUM chart can be achieved by simulating data from any standard continuous distribution, say the standard normal distribution, as and finding to obtain the desired in-control average run length (denoted by ) through a bi-section search. Table 1 shows the computed control limit using the bi-section search algorithm based on 10,000 replications for different choices of when .
2.4 Self-starting monitoring scheme
To categorize the original data and implement our proposed control chart based on , we need to know and , which are the -th quantile and -th quantile of the in-control distribution of , respectively. Since those quantiles are rarely known in practice, we can approximate them by their sample estimates from the in-control reference data. However, in order for the effect of using those quantile estimates instead of the true values on the to be negligible, it usually requires a substantial amount of in-control reference data. In many real-world applications, it can be very challenging to have such data. To solve this problem, we develop a self-starting monitoring scheme where the estimates of quantiles and are updated sequentially each time when a new observation is collected.
More specifically, at time we have observations collected in the past, i.e.,
Let denote their order statistics. For a given , , find the integer such that and
Then based on , the -th quantile of the in-control distribution of , , can be estimated by
Since for , the estimates of can be obtained accordingly.
Using those estimates, at time we partition the real line into the following left-to-right regions,
or the following center-outward regions,
Define and , where
The following result shows the in-control distributions of and .
For , , are independent and identically distributed as Multi when the process is in-control.
Based on the above result, has the same in-control distribution as , . Therefore, in our self-starting monitoring scheme, we replace in our proposed adaptive CUSUM statistic described in Section 2.2 by , and the resulting self-starting control chart can still use the control limit we obtain from Section 2.3.
In the above self-starting monitoring scheme, it is assumed that the calculation of our sequential quantile estimates (2.10) starts from . In order for Theorem 1 to hold, the size of the reference data is at least , since this ensures that, for any and any , , we can find an integer such that and
If the number of observations we have is smaller than , it implies that we can not find such an integer for some . If this is the case, we simply define
When using the above , the in-control distribution of is not exactly Multi. Therefore, if , the in-control distribution of is a little off from its expected one for . Since this is the case only for , we expect that its effect on the is negligible if is not large.
In the following, we report a simulation study to evaluate such effects. In the simulation study, we choose the size of the reference data or and the number of categories the data are categorized into =10, 20, 30, or 40. Three different in-control distributions, , are considered: the standard normal, denoted by ; the
distribution with 2.5 degrees of freedom, denoted by; the lognormal distribution with parameters and , denoted by . Using the control limits reported in Table 1, we apply our proposed self-starting monitoring scheme to the data simulated from the above three in-control distributions, and record the time it takes to trigger an alarm, which is the in-control run length. This is repeated 10,000 times and the average of the 10,000 in-control run lengths is the simulated of our proposed self-starting monitoring scheme. Table 2 shows the simulated
along with their corresponding standard errors (in the parentheses) under different settings.
As mentioned above, only the first observations can potentially cause the to deviate from the nominal level. To make such effects to be negligible, should not be very large. This implies that the minimal size of the reference data we need to maintain the desired should increase as increases. As we can see from Table 2, for or 20, the simulated