On Monitoring High-Dimensional Multivariate Processes with Individual Observations

10/26/2021
by   Mohsen Ebadi, et al.
0

Modern data collecting methods and computation tools have made it possible to monitor high-dimensional processes. In this article, Phase II monitoring of high-dimensional processes is investigated when the available number of samples collected in Phase I is limitted in comparison to the number of variables. A new charting statistic for high-dimensional multivariate processes based on the diagonal elements of the underlying covariance matrix is introduced and a unified procedure for Phase I and II by employing a self-starting control chart is proposed. To remedy the effect of outliers, we adopt a robust procedure for parameter estimation in Phase I and introduce the appropriate consistent estimators. The statistical performance of the proposed method is evaluated in Phase II through average run length (ARL) criterion in the absence and presence of outliers and reveals that the proposed control chart scheme effectively detects various kinds of shifts in the process mean. Finally, we illustrate the applicability of our proposed method via a real-world example.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/26/2021

Phase I Analysis of High-Dimensional Multivariate Processes in the Presence of Outliers

One of the significant challenges in monitoring the quality of products ...
01/23/2021

A Change-Point Based Control Chart for Detecting Sparse Changes in High-Dimensional Heteroscedastic Data

Because of the curse-of-dimensionality, high-dimensional processes prese...
11/10/2019

A new method for phase II monitoring of multivariate simple linear profiles

A scope in quality control, which has recently received a great deal of ...
09/29/2021

Dynamic probabilistic predictable feature analysis for high dimensional temporal monitoring

Dynamic statistical process monitoring methods have been widely studied ...
12/20/2019

A Review of Dispersion Control Charts for Multivariate Individual Observations

A multivariate control chart is designed to monitor process parameters o...
03/29/2022

Anomaly Detection for Compositional Data using VSI MEWMA control chart

In recent years, the monitoring of compositional data using control char...
10/22/2020

Nonparametric robust monitoring of time series panel data

In many applications, a control procedure is required to detect potentia...

1 Introduction

Multivariate statistical process monitoring (MSPM) techniques have been extensively used to detect shifts in the parameters of multivariate processes. The well-known Hotelling’s

control chart is usually recommended for monitoring the mean of multivariate statistical process control with individual observations. Recently, modern data collecting and computation tools have made it possible to monitor high-dimensional processes. However, typical MSPM approaches used to monitor high-dimensional processes are frequently hampered by high-dimensional settings; the phenomenon also known as “curse of dimensionality”. This is mainly because the sample covariance matrix used in the methods based on

statistic is singular. Despite many research papers being published on multivariate control charts to monitor the process mean (see for example Reynolds and Cho (2006), Reynolds and Stoumbos (2008), Woodall and Montgomery (2014) and Bersimis, et al. (2007) for discussions and reviews of multivariate control charts), monitoring changes in the mean vector for high dimensional multivariate processes has received little attention in the literature. We discuss the few exceptions in the following. Under the “sparsity” assumption, which means that there are only a small set of variables responsible for the process change, Wang and Jiang (2009) proposed using a forward selection algorithm combined with a Shewhart-type control chart, referred to as the VS-MSPC chart. Another variable selection (VS)-based approach, the adaptive absolute shrinkage and selection operator (LASSO), which identifies potentially altered variables, was proposed by Zou and Qiu (2009). Capizzi and Masarotto (2011) combined the least angle regression with MEWMA to monitor both the mean and variability. Jiang et al. (2012) investigate the impact of mean shifts estimation on the probability of accurately identifying changed variables and suggest a variable selection-based MEWMA (VS-MEWMA) control chart which is more sensitive to the small shifts in the mean vectors. Abdella et al. (2017) used forward variable selection as a pre-diagnosis and it was integrated into the MCUSUM chart. Kim et al. (2020) proposed a penalised likelihood-based technique based on

norm regularisation that shrinks all process mean estimates toward zero. Their proposed chart is efficient in monitoring high-dimensional processes since it has a closed-form solution as well as probability distributions of the monitoring statistic under null and alternative hypotheses.

The aforementioned literature mainly focused on the problem of Phase II monitoring of high dimensional processes when there is enough data with no outlier observation in Phase I. However, in practical situations, the Phase I data is limitted and outliers may exist in Phase I data. For the situation where a reference dataset is not large enough to estimate the process parameters, self-starting methods that handle sequential monitoring by using the successive process readings to update the parameter estimates and simultaneously check for the out-of-control conditions exist in the literature. See, for example, Sullivan and Jones-Farmer (2002) and Hawkins and Maboudou-Tchao (2007). More recently, Chen et al. (2016) introduced a new nonparametric method for monitoring location parameters when only a small reference dataset is needed. Still, they assumed that and that the Phase I dataset is outlier free. However, In the high-dimensional situation, more robust estimators of the process parameters are required if some outliers contaminate the data. In many practical situations, the covariance matrix can not be meaningfully estimated from the original data due to the “curse of dimensionality”. This research proposes a new self-starting control chart for Phase II monitoring of a high-dimensional process. In our new chart, rather than estimating all elements of the covariance matrix, we only estimate it’s diagonal elements. By using a robust method to estimate the parameters, we show our approach is very effective in Phase II monitoring of the process mean, especially when the sample size is small compared to the number of variables. The remainder of this paper is organized as follows. In Section 2, we develop our proposed charts based on the diagonal elements of the sample covariance matrix and a (unified) self-starting approach for Phase II monitoring is then proposed. In Section 3, the performance of the proposed chart is evaluated in terms of Average Run Length via Monte Carlo simulations. In Section 4, a real example is employed to show the applicability of the proposed methodology. Section 5 concludes our paper.

2 Monitoring method for high-dimensional process

This section proposes a new control chart based on the characteristics and limitations of high-dimensional processes. We then propose a self-starting approach for Phase II monitoring by using a robust estimation approach for the underlying parameters of the in-control processes when the historical data is limitted.

2.1 A new control chart

Consider the problem of monitoring a multivariate process with quality characteristics . It is assumed that there are independent and identically distributed (i.i.d) historical (reference) observations

collected for Phase I analysis. Under an in-control situation, we assume that the process follows a multivariate normal distribution with mean vector

and covariance matrix . For Phase II monitoring, the th future observation, is collected to be monitored over time. A typical approach for monitoring the mean of such a process in Phase II uses the statistic based on the Mahalanobis distance (Mahalanobis 1936) as follows:

(1)

A Phase I control chart, however, can be obtained by replacing and with the sample mean and covariance matrix, respectively (Bersimis et al. 2007). A large value of

leads to rejection of the null hypothesis that the observation

follows

and consequently, if the value of the test statistic plots above the prespecified control limit, the chart signals an out-of-control situation. When the in-control parameters of the process are known or estimated at the end of Phase I, the

statistic follows a -distribution with degrees of freedom. This is called a Phase II control chart and has the upper control limit of and the lower control limit of zero (see, for example, Bersimis et al. 2007). The conventional estimator of the parameters and in Phase I are the sample mean vector and the sample covariance matrix, respectively. But when , the standard sample covariance matrix is singular and cannot be inverted, so the

statistic becomes ill-defined. In practice, the available number of Phase I samples is usually limited, and users of process monitoring approaches do not want to wait until many samples accumulate. In this paper, we propose using the diagonal matrix consisting of only the diagonal elements of the sample covariance matrix, obtained from the Phase I analysis, to replace the covariance matrix in calculating the critical distance for multivariate process monitoring in Phase II. Note that the diagonal elements of a covariance matrix can be estimated by as few as two individual observations, so it will not be affected by high dimensionality. Consider the sampling epoch

and denote the individual observation of the th quality characteristic variable by where , and . Let

denote the vector of in-control variances of the

variables obtained from the diagonal elements of . If we define , then the corresponding modified Mahalanobis distance is:

(2)

where denotes the th element of the vector . Let

denote the eigenvalues of the in-control correlation matrix

and , then by the eigenvalue decomposition

, where columns of the orthogonal matrix

constitute an orthogonal basis of eigenvectors of

. Now using the transformation ,

where . This shows that the modified distance can be rewritten as where , for

, are i.i.d standard normal random variables. Since

, the distance is the weighted sum of i.i.d. random variables with distribution, and thus the mean and variance of are given by

(3)
(4)

where represents the trace of matrix . Using the mean and variance of the modified distance, one can define the statistic:

(5)

To derive our asymptotic results, we make the following assumptions:

Assumption 1

For , we assume that .

Assumption 2

The eigenvalues of the correlation matrix satisfy .

Assumption 3

The dimension grows with sample size at a rate of with .

Assumption 4

For some , .

Since Assumptions 1-2 imply

, with a direct application of the Hájek–Šidák central limit theorem (c.f. DasGupta 2008) we can show that for any given

the statistic has an asymptotic distribution as . It is worth mentioning that the assumptions provided above are not very restrictive. For example, Assumption 1 implies that the growth rate for traces of powers of the correlation matrix should not be higher than . Thus, it can be valid even when some strong pairwise correlations exist among data like popular cases of autoregressive (AR) or moving average (MA) structures. For example, we can easily show that . Thus, for an AR correlation structure we have . This is also true for some MA or banded correlation structures, i.e. , satisfying Assumption 1. Inspired by the work of Srivastava and Du (2008) on the one-sample test of the mean vector in a high-dimensional setting, Ro et al. (2015) used the aforementioned asymptotic result for in

, for outlier detection purposes in high-dimensional datasets. From simulation studies, we observed that the asymptotic normality of

in fails to accurately approximate the tails of the distribution

. Besides, investigating Table 1 of Ro et al. (2015) reveals the estimated probability of false positives are overestimated for moderate dimensions and small nominal Type I errors

. Since small values of

such as 0.005 are common in constructing control charts, we need a good approximation of the quantiles of the exact distribution of

.

To improve the accuracy of approximations, we employ the Cornish–Fisher asymptotic expansion of quantiles, which uses higher-order moments of

to account for the effect of non-normality. These expansions were first developed by Cornish and Fisher (1938) and Fisher and Cornish (1960). The expansion can be derived by inverting the Edgeworth expansion of the distribution of . See Hall (1983), Small (2010), and Polansky (2011).

Theorem 1

Using the second-order Cornish-Fisher expansion, the upper percentile of statistic at significance level is given by

(6)

where is the upper percentile of the standard normal distribution.

A proof of the formula (6) is provided in Appendix A. Depending on how small the type I error is set, we can either use a first or second order Cornish-Fisher expansion. Although we have given the second-order in (6), in our simulation study in Section 3, we only use the first-order expansion of the Cornish-Fisher

(7)

as it suffices to achieve good results.

The discussion above suggests a control chart for monitoring based on its respective value when or equivalently is known or properly estimated. The proposed control chart triggers an out-of-control alarm whenever

(8)

When the process is in-control, a new observation follows , while in the out-of-control situation the observations follow . Consequently, assuming

, the asymptotic Type II error probability of the proposed control chart as

can be derived. Since for the Cornish-Fisher expansion as , the following lemma provides an asymptotic type II error probability.

Theorem 2

Under Assumptions 1-2, and for , , we have

Proof. First notice that

Under , we have and

that is as . Thus , and . Now under , and Assumptions 1 and 2,

 

As the proposed chart is a Shewhart-type chart, the in-control and out-of-control ARLs of the proposed control chart are

(9)

where represents the probability of type II error when the process is out-of-control and can be asymptotically calculated by using Theorem 2.

In order to use (8) in Phase II, proper estimates of the parameters , , , and in Phase I are needed to obtain good results in Phase II. For estimating , based on observations in Phase I, one can use the suggested estimator by Srivastava and Du (2008) where used a consistent estimator under Assumption 1 and 3 as follows

where the sample correlation matrix in Phase I can be given from

(10)

where is the sample covariance matrix and denotes the diagonal matrix of the sample variances in . Besides, the following consistent estimator of can be used as .

(11)

A proof is provided in Ebadi et al. 2021.

In order to remedy the effects of outliers in Phase I, we apply a methodology for the robust estimation of the parameters proposed by Ebadi et al. (2021) through modifying re-weighted minimum diagonal product (RMDP) algorithm with Cornish-Fisher expansion. They also proposed a finite sample correction coefficient for better convergence via a simulation study and careful numerical evaluations defined as follows

(12)

which under Assumptions 1 and 3, as .

We denote the estimated parameters from RMDP algorithm proposed by Ebadi et al. (2021) as , , , and

and will use these estimates as initial estimates in our self starting control chart. These estimates will be then updated as new observations will appear. In the next subsection, we introduce a self-starting procedure for monitoring high-dimensional data.

2.2 A self-starting procedure for Phase II

In this section, we provide a procedure to perform Phase I analysis and Phase II monitoring subsequently. The steps of the proposed procedure are the following:

  • Collect a historical sample of size from the multivariate for Phase I analysis.

  • Implement the robust procedure of Ebadi et al. (2021) (mentioned in Section 2.1) to derive the robust estimates , , , and the finite sample correction factors. Then identify the potential outlying observations in Phase I data.

  • Having obtained the estimates in step (ii), for a new observation in Phase II, use the control chart (8), presented in Section 2.1.

  • If the new observation is identified as an in-control observation, update the estimates in step (ii) by adding the new observation to the Phase I data.

  • Repeat the steps (iii) and (iv) for the new observations until the control chart triggers an out-of-control alarm.

Notice that steps (iii)-(v) define a self-starting Phase II control chart. As articulated in the paper Mabboudou-Tchao and Hawkins (2011), to implement a self-starting control chart, one needs to have enough historical samples to obtain initial estimates of the process parameters. This makes the proposed Phase II chart less sensitive to the initial Phase I sample, and the nominal can be achieved more stably for different samples. In other words, in self-starting control charts, the process parameters are updated continually over the time of sampling, and also, the out-of-control condition is checked concurrently. As the in-control period increases, the estimated mean vector and covariance matrix converge to the true mean vector and covariance matrix so that the asymptotic normality of the underlying statistic is achieved. There is an important difference between our proposed self-starting chart in step (iii)-(v) and those in the literature such as in Mabboudou-Tchao and Hawkins (2011). Methods in the literature typically require at least initial process reading vectors to set up the initial non-degenerate estimates of parameters, while our proposed chart does not have this limitation since the estimation of the covariance matrix is not required.

Generally, as new observations are collected in Phase II, updating the covariance matrix estimate becomes challenging in high-dimensional cases. For this purpose, the method proposed by Quesenberry (1997) can be useful. Let

(13)

Quesenberry (1997) suggested the following updating formulas

(14)

to reduce the computational cost in calculating the sample mean and covariance matrix. See also Sullivan and Jones-Farmer (2002). In the next section, we examine the performance of our proposed methods through simulation.

3 Simulation Study

This section investigates the performance of proposed methods through a simulation study both in the absence and presence of contaminated data. R Software is employed for this purpose with two scenarios, denoted by Scenario I and Scenario II, where in both scenarios the common mean vector is , while their covariance matrices are , and , respectively which are related to independent and autoregressive correlation structures.

We evaluate the performance of the proposed chart in Phase II via the ARL criteria. Recall that ARL is the average number of samples taken until an out-of-control signal is observed and and are for in-control and out-of-control situations, respectively. Large and small are desirable to guarantee few false alarms and fast detection of process changes, respectively. In this paper, we use a total of 10,000 replications to estimate the ARL values. Any deviation of the CDF of from the standard normal leads to a very different of its corresponding control chart from the nominal values of , which increases the false alarm rate of that control chart. Table 1 presents the results of with and without using the Cornish-Fisher expansion when we use the exact parameters in Phase II for Scenarios 1 and 2. The UCLs for both charts are determined to give the nominal . On the other hand, Table 2 compares the results of with and without using the Cornish-Fisher expansion for different values of and when the mean vector for of observations has shifted by the vector . In both Tables 1 and 2, we also provided the nominal and using (9). It is clear from Tables 1 and 2, when using the Cornish-Fisher expansion, both the estimated in-control and out-of-control s are generally much closer to their nominal values, while the original statistic proposed by Srivastava and Du (2008) gives very different s from their respective nominal values. Additionally, we performed other simulations with different values for , , and . These simulations, not reported here, are consistent with the conclusions presented in Tables 1 and 2. As expected, the average in-control ARL values become closer to the nominal values by increasing the Phase I sample size as incorporating new in-control observations of Phase II into the estimation improves the estimates’ accuracy. Another important conclusion is that when is smaller, the in-control ARL is generally much closer to the nominal value because more in-control observations will be involved in the estimation. Hence, we recommend using larger with smaller . Note that the initial samples in Phase I may affect the performance of the chart in Phase II and give different s, but our simulation showed that having Phase I data with a sample size of about 200-300 is appropriate. Since our method only needs to estimate diagonal elements of the covariance matrix, obtaining accurate estimates are much affordable than methods based on which needs to estimate elements.

We also conduct a simulation to evaluate the effect of correlation on the proposed method. Note that high correlations may happen in applications when dimensionality is small, but for high dimensional cases it seems unlikely. For example as stated by Ahmadi-Javid and Ebadi (2021), in high-dimensional multiple stream processes (MSPs) as a particular multivariate processes with two sources of variation, the streams are usually independent or weakly correlated. As a sensitivity analysis, Table 3 investigates the effect of correlation on the calculated when covariance matrix is used and gradually change the value of from 0 to 0.9. We set the mean vector to zero. In calculating , we use the true values of the parameters. It can be observed from Table 3 that as the correlation between variables increases, the value of generally remains close to the nominal values for either or 0.005. However, for a few cases with high correlation such as and small values of , the simulated is slightly greater than the nominal value which is a positive point when the process is under control, but may increases. We do not provide a study of the change in by increasing to save space. Typically, for the smaller values of , the change in this quantity is negligible in comparison to values. However, since is a weighted sum of independent random variables, this shortcoming can be overcomed by adopting the Welch-Satterthwaite (W-S) -approximation. More details are available in Satterthwaite (1941, 1946), Welch (1947), and Zhang et al. (2020) who recently showed the effectiveness of the Welch-Satterthwaite approximation in their proposed high-dimensional two-sample test statistic when the variables are highly correlated. Another solution for this phenomena, which maybe worth considering as future work, is to determine a finite sample correction coefficient for the case of highly-correlated multivariate process by using of an extensive simulation such that the control chart can achieve the nominal .

Scenario 1 Scenario 2
=0.01 =0.005 =0.0027 =0.01 =0.005 =0.0027
() () () () () ()
With Without With Without With Without With Without With Without With Without
CF CF CF CF CF CF CF CF CF CF CF CF
=10 104.8 39.3 207.4 55.4 376.3 77.7 110.7 32.5 202.7 42.7 356.5 55.1
=20 102.5 45.5 204.2 70.3 366.5 100.3 104.8 36.3 201.2 51.1 347.7 67.5
=30 101.1 50.8 199.4 79.9 373.7 119.1 102.3 39.9 193.1 57.3 346.1 77.9
=50 100.5 56.8 202.1 92.6 365.6 140.4 100.4 43.9 196.9 67.4 353.3 94.6
=80 100.4 62 202.2 104.8 360.2 165.3 102.2 50.1 193.8 76.7 354.9 113.8
=100 101.1 63.9 199.8 109.1 373.2 173.4 98.6 52.1 195.2 82.2 352.4 123
=150 101.7 68.9 197.7 120.3 373.4 194.7 99.1 56.3 198.5 94.2 362.8 142.1
=200 99.7 71.1 196.3 127.3 370.1 209.3 99.5 60 198.3 101.1 361.2 152
Table 1: Simulated for different values of and when actual process parameters are used.
Scenario 1 Scenario 2
=0.01 =0.005 =0.0027 =0.01 =0.005 =0.0027
() () () () () ()
With Without With Without With Without With Without With Without With Without
CF CF CF CF CF CF CF CF CF CF CF CF
=10 Simulated 30.3 14.1 51.1 18.5 84.6 23.4 44.7 15.2 79.1 19.6 129.4 23.9
Nominal 33.2 60.1 102.3 41 75.7 131.05
=20 Simulated 20.7 11.4 34.7 15.7 54.6 20.5 27.8 12.1 46.3 15.5 73.8 19.5
Nominal 22.1 38.5 63.3 29.7 53.2 89.8
=30 Simulated 15.9 9.8 26 13.4 40.2 17.8 20.9 10.6 33.9 13.8 52.6 17
Nominal 16.6 27.9 44.7 23.5 41 67.8
=50 Simulated 10.8 7.4 16.8 10.1 25.5 13.4 14.5 8.6 23.1 11.2 33.8 13.9
Nominal 10.8 17.4 26.8 16.4 27.6 44.3
=80 Simulated 7.3 5.6 10.9 7.4 15.6 9.7 10.2 6.8 15.3 8.8 22.2 11
Nominal 6.9 10.5 15.5 11.16 18 27.8
=100 Simulated 5.9 4.7 8.4 6.2 12.2 8 8.6 6.1 12.7 7.8 18.4 9.7
Nominal 5.5 8.1 11.7 9.1 14.3 21.7
=150 Simulated 4 3.35 5.4 4.3 7.3 5.4 6.2 4.5 8.8 5.9 12.1 7.5
Nominal 3.6 5 6.8 6.1 9.2 13.3
=200 Simulated 2.9 2.6 3.9 3.3 5.1 4.1 4.7 3.8 6.7 4.8 8.9 5.8
Nominal 2.7 3.5 4.6 4.6 6.5 9.2
Table 2: Simulated for different values of and when actual process parameters are used and of variables is shifted
=0.01 (=100)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
=30 103 101 102 100 102 102 104 107 112 124
=50 102 103 97 101 98 101 100 102 105 115
=80 101 101 99 100 102 103 98 103 106 111
=100 101 94 102 101 99 101 102 101 101 106
=0.005 (=200)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
=30 202 202 200 192 200 202 208 196 209 224
=50 204 209 194 197 192 198 190 203 203 210
=80 201 199 203 203 193 195 196 194 197 213
=100 202 205 199 191 202 197 200 197 196 206
Table 3: Effect of correlation on of the proposed chart (cavariance matrix with different values of )

It is common in the literature to judge the performance of a self-starting chart by its out-of-control run lengths. See for example, Hawkins and Maboudou-Tchao (2007), Zou et al. (2007) and Maboudou-Tchao and Hawkins (2011). Several factors such as dimension, in-control ARL, size of the shift in the mean vector may affect the performance of a self-starting chart. An important factor, especially in the unknown-parameter self-starting setting, is the “learning time” of a chart, which is the length of time the process runs in control before a shift occurs (Maboudou-Tchao and Hawkins 2011). A chart with shorter learning times is highly desired. We now evaluate the effect of an initial in-control period of length on the out-of-control ARL performance of the proposed self-starting chart for different dimensions of the process. Figure 1 displays the effect of the initial in-control period of length on for . We note that asymptotic type II error probability of proposed chart is through the parameter as

(15)

where .

In figure 1, it is assumed that new (out-of-control) observations have the mean vector which is determined such that in all cases. The is set to 200, and the nominal for these choices of and is 1.0077, calculated based on equation (9), for all values of in both Scenarios 1 and 2. This asymptotic value is also shown as a horizontal dash line in Figure 1. A selection of values ranging from 20 to 1000 is used. After an initial in-control period of length , the mean vector was shifted, and all signals before time were omitted from the calculations. The figure shows that generally decreases slightly with increases in . In other words, when the initial learning period is short, the chart may take a long time to detect the shift, but by increasing , the detection time improves significantly. The figure also shows that the proposed chart reaches it’s asymptotic between and 300 for all dimensions in both scenarios.

(a) Scenario 1
(b) Scenario 2
Figure 1: A comparison of the simulated (coloured curves) in a self starting chart with nominal (horizontal dash lines) for different values and .

To provide a better understanding of the proposed control chart’s performance in Phase II, we compare it with the RMCD control chart proposed by Chenouri et al. (2009). Based on using RMCD estimators as robust estimators of the mean vector and covariance matrix, Chenouri et al. (2009) proposed a robust Hotelling’s -type control chart for individual observations. Comparing our method with the aforementioned RMCD chart is sensible because both methods are Shewhart-type control charts and use robust approaches and reweighting algorithms. Chenouri et al. (2009) used an extensive simulation to estimate the empirical and quantiles of Phase II chart when the Phase I sample size is not large. However, they did not provide the empirical quantiles for large dimension . It is worth mentioning that a significant advantage of our proposed chart is the use of Cornish-Fisher expansion and the finite sample correction coefficient described in Section 2.1. We do not estimate the quantiles of the charting statistic for different values of , which can be a very time-consuming task especially when is large. So, to perform a fair comparison, we assume that is 100,000. We design our experiment so that both methods theoretically achieve when . Our proposed RMDP method and the RMCD method of Chenouri et al. (2009) use their respective parameters’ estimates in Phase I. The function CovMcd in the rrcov package of R software written by Valentine Todorov (2007) is used to calculate process parameters based on RMCD. We compare the performance of our proposed control chart (without the adaptive feature) with the RMCD chart in Phase II in terms of for different amounts of shift. We consider three different rates of contamination in Phase I data. For the sake of simplicity, we assume that the mean vector of the contaminating distribution in Phase I is the same as that of Phase II out-of-control observations. Note that the non-centrality parameter of the chart is different from . So, for any Phase I outlier or Phase II out-of-control observation, we assume that the mean of the first variables are equally shifted by the amount of while the covariance matrix remains in-control. In Phase I, out of the generated observations, of them are outliers with distribution and the remaining observations are generated from the in-control distribution . In Phase II, the new out-of-control observations with distribution are generated to compute for both methods based on 10,000 simulations. We only report the comparison between the two methods for Scenario 2, while we can make similar conclusions for Scenario 1. Figure 2 depicts the comparison of our proposed RMDP chart with the RMCD chart of Chenouri et al. (2009) for some combinations of , , and when and . The curves related to RMDP and RMCD are in blue and red colours, respectively. Looking at Figure 2, we conclude that our proposed method outperforms the RMCD chart for fixed and in terms of . When , the from both methods converges to 1 for any as the mean shift value increases. The proposed RMDP chart’s performance in Phase II does not change considerably when contaminations exist among Phase I data. However, the of the RMCD based method stays far from 1 for large when and , especially when increases. Moreover, when , the simulated

s for RMCD method for different shifts are very close to their theoretical values calculated based on the non-central Chi-square distribution. For

and , the simulated s of RMCD chart are bigger than their theoretical values. The simulation results for and support the conclusions mentioned above.

(a)
(b)
(c)
Figure 2: A comparison between s of the proposed RMDP chart and the RMCD chart of Chenouri et al. (2009) in Phase II when 50% of variables are shifted in an amount of in both Phase I and Phase II and the desired false-alarm rate is 0.005.

4 A real world example data

In this Section, we provide an example of Phase II monitoring using a multivariate dataset for a semiconductor manufacturing process. The dataset was recently used by Zou et al. (2015), Shu and Fan (2018), Li et al. (2020), and Mukherjee and Marozzi (2020) for high-dimensional monitoring and is available online at the UC Irvine Machine Learning Repository

(http://archive.ics.uci.edu/ml/datasets/SECOM). This dataset was collected from July 2008 to October 2008, consists of 1567 vector observations, and for each observation, there are 591 continuous measurements (variables). However, the dataset contains a considerable number of null (missing) values and several variables with almost constant values. After cleaning data, in total variables remained. Among 1567 vector observations in the dataset, 1463 observations are labelled as conforming, and the remaining 104 as non-conforming. There is also a label that specifies the timestamp of each sample. The abovementioned papers, which investigated this dataset, treated the vector of observations related to conforming parts as in-control Phase I data. They then monitored non-conforming items as Phase II data based on the estimated parameters from the Phase I data. However, a more reasonable approach is to divide the data into Phase I and Phase II data based on their sampling time. We consider the first 80% of data as historical (Phase I) data, while the next 20% as Phase II data for monitoring. In other words, are Phase I data and are for Phase II monitoring. Our objective is estimating process parameters based on Phase I data and then monitoring the Phase II observations based on the proposed control chart and estimated parameters from Phase I.

Ignoring the correlation among variables, we conduct marginal normality tests of Shapiro–Wilks. We conclude that the assumption of normality does not hold for most of the variables (p-values are very small). So, for each marginal observation , we use the inverse transformation where is the marginal empirical distribution function based on the 1463 conforming observations of the th variable.

Figure 3: Correlation between first 50 variables of semiconductor manufacturing data.

We implement the proposed robust RMDP approach on 1253 samples of Phase I. We consider to achieve =200, so the estimated parameters from Phase I are , , and . Figure 3 shows that the estimated correlation matrix based on Phase I samples is sparse, while some variables are highly correlated. Hence, though the pairwise correlations among most variables are weak, some strong correlations are consistent with our assumptions based on our previous discussion in Section 2.1. After estimating the required parameters, we construct the propsed control chart to examine which of 314 samples in Phase II are out of control. Figure 4 depicts the control chart for Phase II observations based on our proposed method, where the UCL is . In Figure 4, Phase I data, Phase II data, and non-conforming items are specified by blue, grey, and yellow points, respectively. Figure 4 suggests a shift in the process mean of the first 200 observations in Phase I, which implies that the proposed chart can quickly detect and declare the process as out-of-control. It is worthwhile to mention that among 104 non-conforming items, 29 appear within the first 200 items. This observation suggests that an initial investigation of the process could have prevented the subsequent out-of-control or non-conforming items. Looking at Figure 4, we see an apparent second shift in the process mean appearing in Phase II data from sample 1253 to 1567, in which there are 17 non-conforming items. However, similar to the distribution-free control charts proposed by Shu and Fan (2018) and Mukherjee and Marozzi (2020), for most of the nonconforming items, the value of the charting statistics is below the calculated UCL. This contradictory result might be because non-conforming items are not due to the process mean shifts only. Figure 5 compares the empirical CDF of the charting statistics of conforming and non-conforming samples with that of the standard normal. While the empirical CDF of conforming samples shows a perfect match with the standard normal CDF, the empirical CDF of non-conforming items represents a shift to the right from the standard normal distribution, revealing a noticeable change in distribution.

Figure 4: A control chart when of samples are considered in Phase I (blue points) and in Phase II (grey points) for semiconductor data.
Figure 5: A comparison between empirical CDF of charting statistics of conforming samples (green) and non-conforming samples (blue) with the CDF standard normal (dashed black line).

5 Conclusion

In many practical applications of high dimensional processes, the Phase I sample size is usually small and computing the sample covariance matrix is impractical. In this article, we employ the diagonal matrix of the underlying covariance matrix to monitor the mean vector of high dimensional correlated quality characteristics described by a multivariate normal distribution. The main reason to consider the diagonal matrix instead of the whole sample covariance matrix is its non-singularity for cases where the number of quality characteristics is much larger than the sample size. Moreover, we proposed a unifying approach for Phase I and Phase II analysis by employing a self-starting control chart. In terms of efficiency, the proposed procedure shows good performance in Phase II. Due to recent advances in data-acquisition equipment, the study of high dimensional process monitoring is a vibrant and promising research area. We believe that much more work is needed on this topic.

References

  1. Abramowitz, M., Stegun, I. A. (Eds.). 1972. Handbook of mathematical functions with formulas, graphs, and mathematical tables (Vol. 55). Washington, DC: National bureau of standards.

  2. Abdella, G.M., Al‐Khalifa, K.N., Kim, S., Jeong, M.K., Elsayed, E.A. and Hamouda, A.M., 2016. Variable Selection‐based Multivariate Cumulative Sum Control Chart. Quality and Reliability Engineering International, 33, 565-578.

  3. Ahmadi-Javid, A., Ebadi, M. (2021). A two-step method for monitoring normally distributed multi-stream processes in high dimensions. Quality Engineering, 33, 143-155.

  4. Bersimis, S., S. Psarakis, and J. Panaretos. 2007. Multivariate Statistical Process Control Charts: An Overview. Quality and Reliability Engineering International, 23, 517–543.

  5. Capizzi, G., Masarotto, G. 2011. A least angle regression control chart for multidimensional data. Technometrics, 53, 285–296.

  6. Chen, N., Zi, X., Zou, C. 2016. A distribution-free multivariate control chart. Technometrics, 58, 448-459.

  7. Chenouri, S. E., Steiner, S. H., Variyath, A. M. 2009. A multivariate robust control chart for individual observations. Journal of Quality Technology, 41, 259-271.

  8. Cornish, E. A., Fisher, R. A. 1938. Moments and cumulants in the specification of distributions. Review of the International Statistical Institute, 5, 307-320.

  9. DasGupta, A. 2008. Asymptotic theory of statistics and probability. Springer Science & Business Media.

  10. Ebadi, M., Chenouri, S., Steiner, S. H. 2021. Phase I Analysis of High-Dimensional Multivariate Processes in the Presence of Outliers. Arxiv Preprint.

  11. Fisher, S. R. A., Cornish, E. A. 1960. The percentile points of distributions having known cumulants. Technometrics, 2, 209-225.

  12. Hall, P. 1983. Inverting an Edgeworth expansion. The Annals of Statistics, 11, 569-576.

  13. Hawkins, D. M., and Maboudou-Tchao, E. M. 2007. Self-starting multivariate exponentially weighted moving average control charting. Technometrics, 49, 199-209.

  14. Jiang, W., Wang, K., Tsung, F. 2012. A variable-selection-based multivariate EWMA chart for process monitoring and diagnosis. Journal of Quality Technology, 44, 209–230.

  15. Kim, S., Jeong, M. K., & Elsayed, E. A. 2020. A penalized likelihood-based quality monitoring via L2-norm regularization for high-dimensional processes. Journal of Quality Technology, 52, 265-280.

  16. Li, W., Xiang, D., Tsung, F., Pu, X. 2020. A diagnostic procedure for high-dimensional data streams via missed discovery rate control. Technometrics, 62, 84-100.

  17. Maboudou-Tchao, E. M., Hawkins, D. M. 2011. Self-starting multivariate control charts for location and scale. Journal of Quality Technology, 43, 113-126.

  18. Mahalanobis P.C. 1936, On the generalised distance in statistics, Proceedings of the National Institute of Science of India, 12, pp. 49-55

  19. Mukherjee, A., & Marozzi, M. 2020. Nonparametric Phase-II control charts for monitoring high-dimensional processes with unknown parameters. Journal of Quality Technology, 1-21.

  20. Polansky, A. M. 2011. Introduction to statistical limit theory. CRC Press.

  21. Quesenberry, C. P. 1997. SPC methods for quality improvement. New York: Wiley.

  22. Ro, K., Zou, C., Wang, Z., & Yin, G. (2015). Outlier detection for high-dimensional data. Biometrika, 102, 589-599.

  23. Satterthwaite, F. E. 1941. Synthesis of variance. Psychometrika, 6, 309-316.

  24. Satterthwaite, F. E. 1946. An approximate distribution of estimates of variance components. Biometrics bulletin, 2, 110-114.

  25. Shu, L., Fan, J. 2018. A distribution‐free control chart for monitoring high‐dimensional processes based on interpoint distances. Naval Research Logistics (NRL), 65, 317-330.

  26. Small, C. G. 2010. Expansions and asymptotics for statistics. CRC Press.

  27. Srivastava, M. S., 2005. Some tests concerning the covariance matrix in high dimensional data. Journal of the Japan Statistical Society, 35, 251-272.

  28. Srivastava, M. S., and Du, M. 2008. A test for the mean vector with fewer observations than the dimension.

    Journal of Multivariate Analysis, 99

    , 386-402.

  29. Srivastava, M. S., Katayama, S., Kano, Y. 2013. A two sample test in high dimensional data. Journal of Multivariate Analysis, 114, 349-358.

  30. Srivastava, M. S., Yanagihara, H. 2010. Testing the equality of several covariance matrices with fewer observations than the dimension. Journal of Multivariate Analysis, 101, 1319-1329.

  31. Sullivan, J. H., Jones-Farmer, L. A. 2002. A self-starting control chart for multivariate individual observations. Technometrics, 44, 24-33.

  32. Wang, K., Jiang, W. 2009. High-dimensional process monitoring and fault isolation via variable selection. Journal of Quality Technology, 41, 247–258.

  33. Welch, B. L. 1951. On the comparison of several mean values: an alternative approach. Biometrika, 38, 330-336.

  34. Woodall, W. H., Montgomery, D. C. 2014. Some current directions in the theory and application of statistical process monitoring. Journal of Quality Technology, 46, 78-94.

  35. Zhang, L., Zhu, T., Zhang, J. T. 2020. A simple scale-invariant two-sample test for high-dimensional data. Econometrics and Statistics, 14, 131-144.

  36. Zou, C., Qiu, P. 2009. Multivariate statistical process control using LASSO. Journal of American Statistical Association, 104, 1586–1596.

  37. Zou, C., Wang, Z., Zi, X., Jiang, W. (2015). An efficient online monitoring method for high-dimensional data streams. Technometrics, 57, 374-387.

  38. Zou, C., Zhou, C., Wang, Z., Tsung, F. 2007. A self-starting control chart for linear profiles. Journal of Quality Technology, 39, 364-375.

Appendix A

In this Appendix, we derive the Cornish-Fisher expansion in (6) for the proposed test statistic . We first briefly review the Cornish-Fisher expansion approach. For more details, we refer the reader to Fisher and Cornish (1960), Abramowitz and Stegun (1972), and Polansky (2011). Let be a sequence of independent and identically distributed random variables with mean and variance

. Denote the the cumulative distribution function of

by . Let represent the quantile of , that is . The Cornish-Fisher asymptotic expansion of with respect to is given by , where

(16)

with , and