Phase I Analysis of High-Dimensional Multivariate Processes in the Presence of Outliers

10/26/2021
by   Mohsen Ebadi, et al.
0

One of the significant challenges in monitoring the quality of products today is the high dimensionality of quality characteristics. In this paper, we address Phase I analysis of high-dimensional processes with individual observations when the available number of samples collected over time is limited. Using a new charting statistic, we propose a robust procedure for parameter estimation in Phase I. This robust procedure is efficient in parameter estimation in the presence of outliers or contamination in the data. A consistent estimator is proposed for parameter estimation and a finite sample correction coefficient is derived and evaluated through simulation. We assess the statistical performance of the proposed method in Phase I in terms of the probability of signal criterion. This assessment is carried out in the absence and presence of outliers. We show that, in both phases, the proposed control chart scheme effectively detects various kinds of shifts in the process mean. Besides, we present two real-world examples to illustrate the applicability of our proposed method.

READ FULL TEXT VIEW PDF
10/26/2021

On Monitoring High-Dimensional Multivariate Processes with Individual Observations

Modern data collecting methods and computation tools have made it possib...
09/04/2018

High-dimensional varying index coefficient quantile regression model

Statistical learning evolves quickly with more and more sophisticated mo...
09/30/2021

Robust High-Dimensional Regression with Coefficient Thresholding and its Application to Imaging Data Analysis

It is of importance to develop statistical techniques to analyze high-di...
03/09/2017

Trimmed Density Ratio Estimation

Density ratio estimation is a vital tool in both machine learning and st...
07/28/2018

Bayesian Sparse Propensity Score Estimation for Unit Nonresponse

Nonresponse weighting adjustment using propensity score is a popular met...
01/23/2021

A Change-Point Based Control Chart for Detecting Sparse Changes in High-Dimensional Heteroscedastic Data

Because of the curse-of-dimensionality, high-dimensional processes prese...
01/28/2020

The Multivariate Extension of the Lomb-Scargle Method

The common methods of spectral analysis for multivariate (n-dimensional)...

1 Introduction

In industrial practice, control charts are widely used for monitoring quality characteristics. The quality of a product in many industries is often related to several correlated characteristics, and their combined effect describes product quality. Multivariate control charts have been used for such situations. Because a collection of separate univariate control charts ignores the correlation between the quality characteristics, they do not provide the whole picture of the quality of products. The

statistic proposed by Hotelling (1947) is a well-known tool for monitoring the mean vector of multivariate processes, and most of the early papers on this topic concentrated on the

statistic due to its ease of computation. However, for situations where the number of variables is greater than the number of observations, the sample covariance matrix used in the statistic is singular. In this case, the methods based on

perform poorly and become unreliable. Woodall and Montgomery (2014) pointed out that monitoring high-dimensional data, such as those coming from sensors, social networks, health multistream systems, and cloud computing applications, is a complex and important area of research. With advances in computing and modern data-acquisition equipment, high dimensional data are common in many environments, which brings severe challenges to the applicability of the traditional multivariate monitoring methods. Although the multivariate control charting methods are abundant (see and Bersimis et al. (2007) and Ebadi et al. (2021a) for literature review), for monitoring changes in the mean vector of high dimensional multivariate processes the literature is sparse. Moreover, the few existing work mainly focused on the problem of Phase II monitoring of high dimensional processes which will be explained as follows. Variable selection (VS)-based control charts were recently introduced for monitoring multivariate data based on the reasonable assumption that assignable causes usually affect only a small portion of the monitored quality characteristics and not all variables simultaneously. Wang and Jiang (2009) proposed using a forward selection algorithm (FVS) to identify the subset of the shifted variables, combined with a Shewhart-type control chart, referred to as the VS-MSPC chart. Jiang et al. (2012) suggested using a multivariate EWMA (MEWMA) control chart instead of a Shewhart-type control chart to accumulate recent observations and make an integrated procedure VS-MEWMA chart, more sensitive to the small shifts. However, Abdella et al. (2016) explained that variable selection procedure malfunctions might lead to the poor performance of the VS-MEWMA in detecting small process changes and proposed a VS-MCUSUM control chart to improve the detection of small changes in the mean vector. Zou and Qiu (2009) used the least absolute shrinkage and selection operator (LASSO) as a variable selection method and applied the EWMA control chart to develop a LASSO-based EWMA (LEWMA) chart for monitoring the process mean. The LASSO-based schemes assume that the number of potential out-of-control variables is not fixed a priori, and any subset of the monitored variables can potentially shift. Capizzi and Masarotto (2015) compared the performances of three VS-based control charts, including the least angle regression (LAR) control chart proposed by Capizzi and Masarotto (2011), LEWMA, and VS-MEWMA charts.

Generally, monitoring a multivariate process contains two phases, Phase I and Phase II. According to Woodall (2000), the primary concern of Phase I is to analyze historical data to understand the underlying variation and determine the stability of the process. Having removed those samples associated with any assignable causes in Phase I, we can estimate the in-control values of the process parameters to use them in designing control charts for Phase II monitoring of the process. On the other hand, the primary purpose of Phase II is to quickly detect shifts in the process parameters from the in-control (IC) parameter values estimated in the Phase I step. A significant challenge for monitoring high-dimensional processes is that only a small reference dataset is often available. At the same time, the dimension of the Phase I data is greater than the Phase I sample size

. Hence, the Phase I analysis can not be performed as usual, and historical observations are insufficient to accurately estimate all parameters of the in-control distribution. Besides, outliers may exist in Phase I data, which makes this analysis even harder. There is a need to develop more robust estimators of the process parameters in the high-dimensional setting. We illustrate the issue using a dataset of vertical density profiles (VDP) of engineered wooden boards. This dataset has been studied by several authors such as Walker and Wright (2002), Woodall et al. (2004), Williams et al. (2007) and Wang and Jiang (2009). In the VDP data, each of 24 profiles consists of 314 points, correspond to 314 random variables. Wang and Jiang (2009) investigated applying the VS-MSPC chart to this high-dimensional process. However, VS-based control charts require an estimate of the process covariance matrix, obtained from Phase I analysis, to construct the charting statistics. Since there are only 24 samples available and thus estimating the process covariance matrix is impossible, they suggested resampling each profile by taking one point out of every 16 points to reduce the data dimension to 20. However, using this dimension reduction approach may ignore important information, and the estimated covariance matrix may be very different from the real covariance matrix. This estimation problem would be even more difficult if some outliers contaminate the data. In many practical situations, the covariance matrix can not be meaningfully estimated from the original data due to the “curse of dimensionality”. In this paper, we apply a robust method to estimate the parameters and propose a new control chart in Phase I. In our new chart, rather than estimating all elements of the covariance matrix, we only estimate it’s diagonal elements. We show our approach is very effective in Phase I analysis of the process mean, especially when the sample size is small compared to the number of variables. The rest of this paper is organized as follows. In Section 2, we develop our proposed charts based on the diagonal elements of the sample covariance matrix. We use a robust estimation procedure to construct the proposed chart in this section and present a finite sample correction coefficient in the Phase I analysis. Using Monte Carlo simulations, the performance of the proposed chart is evaluated in terms of probability of signal in Section 3. In Section 4, two illustrative examples are investigated to demonstrate the performance of the proposed methodology. Our conclusions are summarized in Section 5.

2 Proposed Method

In this section, we begin by proposing a control chart for high-dimensional processes and then present an estimation approach for the underlying parameters of the in-control processes in Phase I. We further propose a finite sample correction coefficient to be used for the Phase I analysis.

2.1 Charting statistic and robust parameter estimation

Suppose the multivariate process with quality characteristics and there are independent and identically distributed (i.i.d) historical (reference) observations

in Phase I. The process follows a multivariate normal distribution with mean vector

and covariance matrix under an in-control situation. In practice, the available number of Phase I samples is usually limited, i.e. , and the standard sample covariance matrix is singular such that the typical approaches based on

statistic becomes impractical. Assume the sampling epoch

and denote the individual observation of the th quality characteristic variable by where , and . Let

denote the vector of in-control variances of the

variables obtained from the diagonal elements of . If we define , then the corresponding modified Mahalanobis distance is:

(1)

where denotes the th element of the vector . A Phase I control chart, however, can be obtained by replacing and with the sample mean and covariance matrix, respectively (Bersimis et al. 2007). Let

denote the eigenvalues of the in-control correlation matrix

, then the mean and variance of are given by (see Ebadi et al. 2021b for the proof)

(2)

where represents the trace of matrix

. Hence, the following test statistic can be defined:

(3)

It can be shown that for any given the statistic has an asymptotic distribution as under the following assumptions:

Assumption 1

For , we assume that .

Assumption 2

The eigenvalues of the correlation matrix satisfy .

Assumption 3

The dimension grows with sample size at a rate of with .

Assumption 4

For some , .

A discussion on these assumptions is provided in Ebadi et al. 2021b. To provide better accuracy in tails when small values of such as 0.005 are used in constructing control charts, one can modify the statistic by using the first-order expansion of the Cornish-Fisher

(4)

as it suffices to achieve good results (Ebadi et al. 2021b).

Hence, the modified charting statistic triggers an out-of-control alarm whenever

(5)

When the process is in-control, a new observation follows , while in the out-of-control situation the observations follow . Consequently, assuming

, the asymptotic Type II error probability of the proposed control chart as

can be derived as (see Ebadi et al. 2021b for more details and the proof.)

As mentioned, proper estimates of the parameters, , , , and in Phase I are needed. We now discuss the methodology for the robust estimation of the parameters of the proposed control chart. It is known that, under the normality assumption of the Phase I data, the maximum likelihood estimates of these parameters are optimal. However, it is also well known that these estimators, sometimes referred to as the classical estimators, are sensitive to outlying observations in the Phase I dataset. This section presents an algorithm for robust estimation of the Phase I parameters. We first introduce consistent estimators for and , and then briefly explain the algorithm. Recall that there are i.i.d. historical observations collected for Phase I analysis. The sample correlation matrix in Phase I, based on observations, can be represented by

(6)

where is the sample covariance matrix and denotes the diagonal matrix of the sample variances in . In practice, is unknown. To use (3) and (5), one needs to estimate and , in addition to and . From classical multivariate statistics, we know that for a fixed , and are consistent estimators of and , respectively, as . However, this is not the case when both and grow unbounded. Motivated by the results in Bai and Saranadasa (1996) and Srivastava (2005), the paper Srivastava and Du (2008) proposed

as the uniformly minimum variance unbiased estimator (UMVUE) of

. Furthermore, since when , Srivastava and Du (2008) suggested to simply use and additionally established its consistency under Assumption 1 and 3, in the sense that

Srivastava and Du (2008) were concerned with a hypothesis testing problem involving the mean vector of a multivariate normal distribution. They used a test statistic similar to in (3) when in its denominator, the parameter is replaced by an estimator in the form of , where as . Srivastava and Du (2008) also argued that leads to a faster convergence to normality. However, in the current paper, we focus on process monitoring in which, from our numerical investigation, we noticed that of Srivastava and Du (2008) is a poor choice. In the next subsection and Appendix B, via a simulation study and careful numerical evaluations, we propose using

(7)

instead. It is easy to show that under Assumptions 1 and 3, as .

Recall that we introduced a Cornish-Fisher expansion for the distributional quantiles of

. This result leads to the threshold (5), which is being used to chart . To calculate the threshold (5) in practice, we must provide an estimate of based on the Phase I data. The following Theorem provides a consistent estimator of .

Theorem 1

Under Assumptions 1 and 3, converges to in probability as . So, a consistent estimator of is given by

(8)

A proof for Theorem 1 is provided in Appendix A.

Under the Assumptions 1, 3 and 4, and outlier free data, Ro et al. (2015) showed that when the classical estimators denoted by and are used, the following result holds

(9)

Although the result (9) was proved for outlier-free data, it is also valid when and are replaced by consistent robust estimators when data are contaminated with outliers. In order to obtain robust estimators of parameters in the hypothesis testing problem of location, Ro et al. (2015) introduced re-weighted minimum diagonal product (RMDP) estimators by modifying the re-weighted minimum covariance determinant (RMCD) algorithm of Rousseeuw and Van Driessen (1999). The current paper proposes a robust estimation procedure for parameter estimation in Phase I, similar to Ro et al. (2015). However, due to the use of the Cornish-Fisher expansion, we additionally must estimate and modify all computational steps of the algorithm accordingly to accommodate for this change. This modification makes our algorithm quite different from that of Ro et al. (2015). In what follows, we provide a summary of our proposed re-weighting algorithm.

Motivated by the RMCD estimators of Rousseeuw and Van Driessen (1999), we start by searching for a subset of size in Phase I data with the smallest product of the diagonal elements of the sample covariance matrix. The integer is the sample size to achieve breakdown value for the output estimators, where . See Rousseeuw and Van Driessen (1999). More formally, let denote the size or cardinality of the set and . For any and the sub-sample , let and denote the sub-sample mean and covariance matrix, respectively. The minimum diagonal product (MDP) estimator of the mean vector is defined as

(10)

Notice that denotes the matrix of diagonal elements of . The MDP estimator of the diagonal matrix is given by

(11)

where is a scaling factor depending on , , and to ensure consistency and finite sample accuracy of for multivariate normal data. See Croux and Haesbroeck (1999) and Pison et al. (2002). Moreover, we propose the the following MDP estimators of and

(12)
(13)

where is the correlation matrix associated with . Algorithm 1 of Ro et al. (2015), adapted the fast minimum covariance determinant (MCD) algorithm of Rousseeuw and Van Driessen (1999) to obtain the MDP estimates of and . In the current paper, we use the MDP estimates of Ro et al. (2015) as initial estimates and provide adjusted or re-weighted MDP estimates incorporating the Cornish-Fisher expansion. Our reweighed algorithm below is motivated by Rousseeuw and Van Driessen (1999) and Ro et al. (2015).

Algorithm 1: Reweighted MDP with Cornish-Fisher expansion

  • Step 1. Initialize and , for example and .

  • Step 2. Run Algorithm 1 in Ro et al. (2015) and obtain the raw MDP estimates and , and then calculate and using equations (12) and (13), respectively to provide initial raw MDP estimates of , .

  • Step 3. Calculate the value of for . From equations (3) and (5), the observation is identified as an outlier if

    (14)

    where

    (15)
  • Step 4. Assign a weight to each observation according to the threshold in (14) by if the -th observation is identified as an outlier and , otherwise. Let , , and represent the sample mean, diagonal matrix of variances and the correlation matrix, respectively, associated with the sample covariance matrix computed from all observations for which .

  • Step 5. Denote the re-weighted estimates of , , and by , , and , respectively and update the traces through

    (16)
    (17)

    Since an updated is not available until the end of Step 6, for the -th observation with , using the Proposition 2 in Ro et al. (2015), compute the refined distance

    (18)

    where is the standard normal density function. Then, evaluate the outlyingness of each observation with the rejection region

    (19)

    where is obtained from (15) by substituting with .

  • Update weights of observations based on their outlyingness determined in Step 5. Calculate the process parameters by updating , , , and based on the new observations weights.

2.2 Finite sample correction

In this subsection, we propose a finite sample correction factor for the asymptotic distribution of the test statistic . This correction will improve the accuracy of the control limit obtained from the asymptotic normality with Cornish-Fisher expansion in Phase I.

Lemma 1

For the usual estimators and , we have

(20)

where the subscript ‘p’ denotes ‘in probability’.

Proof. Let be the th diagonal element of the sample covariance matrix . From Corollary 2.6 of Srivastava (2009), for large , we have . Thus, and

(21)

Since

(22)

using (21), we obtain

Substituting these in (22) we conclude that   

Recall the definition of in equation (3) and define , when is substituted by in (3). From Lemma 1, we can write

(23)

If we replace by its estimate introduced in Section 2.2. we can write . Our simulation study reported in Appendix B suggests a finite sample correction factor having the form

(24)

It is worth mentioning here that although the Cornish-Fisher expansion was based on the normality assumption, the result of Lemma 1 and thus the derivation of the finite sample correction factor does not heavily rely on this assumption. Moreover, Srivastava (2009) showed that the one-sample version of test statistic could also be applied for non-normal data under certain assumptions. So, a natural suggestion for the case of non-normal processes is to use the statistic together with the finite sample correction defined in(24). A future research direction is on improving the performance of the proposed charting statistic for various distributional assumptions. In the next section, we examine the performance of our proposed methods through simulation.

3 Performance evaluation

In this section, we evaluate the proposed methods through a simulation study both in the absence and presence of contaminated data. We use R (R Development Core Team, 2019) for all the computation. We consider two scenarios, where in both scenarios the common mean vector is , while their covariance matrices are , and , respectively which are related to independent and autoregressive correlation structures. We use these two scenarios in this section to evaluate our proposed methods. All the simulation runs are performed based on 10,000 replications unless indicated otherwise.

We first graphically compare the accuracy of the proposed charting statistics and for varying dimensions . Figure 1

compares the cumulative distribution function (CDF) for the statistics

and , when the data distributions are the standard multivariate normal, for . As we can see, the depicted CDF for the statistic (red curve) generally shows a considerable deviation from the standard normal’s CDF (the black-dashed curve), especially in tails and for smaller values of , while the CDF for statistic (blue curve) is very close to the standard normal. We expected this result as the Cornish-Fisher adjustment provides a better approximation for the quantiles based on the asymptotic normal distribution.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: A comparison of the CDF of the statistic (red-solid curve) and (blue-solid curve) with standard Normal’s CDF (black-dash curve).

To evaluate Phase I performance, we begin by evaluating the estimation accuracy of the proposed robust approach in Section 2.1. Figure 2 depicts the boxplots for the ratio of the re-weighted MDP estimates to the actual value of (vertical axis) based on simulations under Scenarios 1 and 2 (horizontal axis). This figure shows the accuracy of the proposed robust estimation method for dimensions and Phase I sample size when there is no contamination in the Phase I data. Similar figures can be obtained for other values when contamination is present. From these results, we conclude that for different combinations of and under both Scenarios 1 and 2, the re-weighted MDP estimator is unbiased for . However, the variance under Scenario 1 is generally less than Scenario 2. We expect that we can make similar conclusions for different combinations of and under a variety of other scenarios.

(a) p=30
(b) p=50
(c) p=80
(d) p=100
Figure 2: Boxplots for the ratio of the re-weighted MDP estimates to the actual value of for different with and outlier-free data.

Statistical performance of a Phase I control chart is usually determined in terms of the probability of an out-of-control signal. This probability is defined as obtaining at least one observed value of the underlying statistic outside the control limits when performing control charting using a set of historical observations. When the Phase I process is in-control, the probability of signal is the same as the overall false alarm rate. Figure 3 provides a performance evaluation of the proposed Phase I control chart in achieving the desired overall false alarm rate when the process is in-control. Let be the overall probability of a false alarm in Phase I, and denote the probability of a false alarm for an individual value of statistic. We use the following formula to calculate the value of based on the desired . See Lehmann and Romano (2006) and Williams et al.(2006).

(25)

Note that (25) implies that a control chart should be very accurate to achieve very small computed based on . For example when and , then . We consider different values for and in our simulations. For each combination of and , we simulated Phase I datasets and recorded the number of signals. A signal is produced in each simulated dataset whenever at least one of the values exceeds the UCL defined based on and using threshold (5). The true (overall) false-alarm rate is then estimated as the total number of signals divided by . Figure 3 depicts the estimated overall false-alarm rate for and . A reference line for the value of is also shown as a horizontal dash line. By comparing this figure to Figures 4 and 5 of Williams et al. (2006), it can be concluded that the overall false-alarm rate of the proposed chart can converge very fast to its nominal value, usually after sample sizes in all cases.

Recall, Assumption 1 implies that correlations among variables are not cumulatively large. This requirement may seem to limit the range of applications for the proposed methodology. However, note that although in the numerator of the proposed statistic only the diagonal elements of the covariance matrix have been involved, the correlation structure is considered in the denominator through . Besides, our Cornish-Fisher expansion uses and , which directly depend on correlation structure. Indeed, this is one of the reasons that we believe our method is better than the similar statistics without CF expansion. Moreover, very high correltions may happen in applications when dimensionality is small, but for high dimensional cases it seems unlikely. Hence, it is also useful to consider the effect of correlation on the proposed method. To perform some sensitivity analysis, we provide Figure 4 and Table 1 (in Appendix B) to give better insights into our proposed methods. Figure 4 investigates the effect of correlation on the calculated overall false alarm when covariance matrix is used and gradually change the value of from 0 to 0.9. We set the mean vector to zero. Figure 4 shows the result for and , while similar conclusions can be obtained for other values of . It can be concluded from this figure that for small and moderate correlations, the convergence to the nominal overall false alarm is fast, while for some high correlations such as , the convergence speed is slower. However, we notice that since is a weighted sum of independent random variables, we can overcome this shortcoming by adopting the Welch-Satterthwaite (W-S)

-approximation. Satterthwaite (1941, 1946) and Welch (1947) provided a moment matching approach to approximate the distribution of a weighted sum of independent

random variables, which statisticians have successfully used over the past 70 years. We postpone a careful investigation of this promising line of research in connection to our work for future research. Nevertheless, we point out that Zhang et al. (2020) recently showed the effectiveness of the Welch-Satterthwaite approximation in their proposed high-dimensional two-sample test statistic when the variables are highly correlated.

(a) Scenario 1 and
(b) Scenario 2 and
(c) Scenario 1 and
(d) Scenario 2 and
Figure 3: Overall Probability of a false alarm for Scenarios 1 and 2 and , where the desired overall false-alarm rates are 0.01 and 0.05
(a)
(b) p=100,
Figure 4: Effect of correlation on overall false alarm of the proposed chart (cavariance matrix with different values of ) for Scenario 2

We also evaluate the performance of the proposed chart in detecting a mean shift in Phase I for different values of the rate of contamination and size of shift. We assume that there are outlying observations with distribution , while observations are generated from the in-control distribution . Recall the asymptotic type II error probability of proposed chart in Sec. 2.1 in which the amount of mean shift is given through the parameter as

(26)

where . So, we only report the simulation results in terms of the non-centrality parameter . Figure 5 depicts the probability of signal under Scenario 2 for dimensions , contamination rates

, type I errors

, and non-centrality parameter values . From figure 5, one can see that when the value of is zero, the probability of signal is about , while as the value of increases, the probability of signal also increases. Moreover, similar to the Phase I control charts based on RMCD (See for example Variyath and Vattathoor 2014, Figures 4-9), we observe that as the contamination rate increases, the probability of an out-of-control signal decreases.

(a) and
(b) and
(c) and
(d) and
Figure 5: Probability of signal for Scenario 2 and , , where the desired overall false-alarm rates are 0.01 and 0.05.

Finally, we compare our outlier detection method with the other popular robust method, the minimum covariance determinant (MCD), which is used by Vargas (2003) for Phase I analysis of multivariate processes with individual observations. Vargas (2003) proposed replacing the classical estimators of mean and covariance with MCD estimators in the

statistic and used the following UCL for Phase I control chart

(27)

We use the function CovMcd in the rrcov package of R software to calculate process parameters based on MCD in Phase I. In Phase I, out of the generated observations, of them are outliers with distribution and the remaining observations are generated from the in-control distribution . We consider three different rates of contamination in Phase I data and assume that the mean of the shifted variables are equally shifted by the amount of while the covariance matrix remains in-control.Two criteria are used to assess outlier detection performance. The Type I error rate, which is the proportion of good observations that are wrongly categorised as outliers, and the Type II error rate, which is the proportion of contaminated observations that are incorrectly labelled as good ones. We only report the comparison between the two methods for Scenario 2 when and , while we can make similar conclusions for Scenario 1. In the Type I comparisons, our RMDP method showed a very better convergence to the nominal one. For example, for both and , the Type I error of our proposed RMDP chart is almost 0.05, while it is almost 0.25 for the MCD chart of Vargas (2003). Figure 6 depicts the comparison of (outlier) detection power of the two methods when the nominal significance level is chosen to be 0.05. From Figure 6, one can conclude that the detection power for both methods increases as increases, but the power of our proposed method is generally better than MCD method. This superiority is more considerable for larger value of , as our method gets the highest power much faster than MCD method.

(a)
(b)
Figure 6: A comparison between the power of the proposed RMDP chart and the MCD chart of Vargas (2003) in Phase I when 10% of variables are shifted in an amount of in Phase I and the desired is 0.05.

4 Real data applications

This section applies the proposed methodology on two real datasets. We first carry out the proposed Phase I approach to investigate the VDP dataset introduced in Section 1. The second example involves data from a diagnostic breast cancer process.

4.1 VDP data

The VDP dataset consists of 24 vertical density profiles of wooden boards, for which 314 density measurements were made along a designated vertical line taken 0.002 inches apart for each profile. Several research articles have discussed the VDP dataset, treating it as a multivariate dataset with 24 observations with a dimensionality of 314. Although in the VDP example, we deal with profile data with functional structure, multivariate control charts have been extensively used for monitoring them. See, for example, Zhang and Albin (2009) and Nassar et al. (2021). We intentionally chose this example to show the superiority of our proposed method over some other multivariate methods, as well as how well our proposed approach performs when some assumptions are violated. Because the sample size is smaller than the dimensionality, the sample variance-covariance matrix is singular. To overcome this problem, Wang and Jiang (2009) suggested resampling each profile by taking one point out of every 16 points to reduce the total dimension to 20. So each profile has 20 variables left. However, our proposed method uses all of the available data.

Wang and Jiang (2009) applied the VS-MSPC chart and Hotelling chart to this data. None of these two charts detected any profile being out of control limits when . Even for and , chart does not show an out-of-control signal. We consider the VDP data as Phase I data and apply our outlier detection method with and 0.05. For these values of , the required parameters are calculated based on 24 profiles and when , , , and are estimated. The estimated covariance matrix shows that all variables are strongly correlated, so as mentioned in the last section, we can still use our proposed method with caution. We plot the values of against the threshold given based on equation (19) to show which observation(s) are assigned a zero weight. The resulting control charts based on three values of are depicted in Figure 7. Note that based on (18), the value of charting statistic for each sample will change by changing , so we cannot combine these three charts into one chart with three different control limits. As is obvious from this figure, when , the method detects profile 6 as an outlier. Nevertheless, due to the high correlation, we might have weaker detection power than expected in an out-of-control situation. By comparing the control chart for with that for and , we can conclude that samples 3, 6 and 16 need careful investigation. We highlight these three boards using a red colour in Figure 8, among the 24 observed VDPs profiles. The result of our investigation is consistent with the two outliers (boards 3 and 6) detected by the third method proposed by Williams et al. (2003). Notice that the method of Williams et al. (2003) is based on Intra-Profile Pooling and assumes that no variability is due to a common cause.

(a) =0.005
(b) =0.01
(c) =0.05
Figure 7: Proposed Phase I control chart for VDP data based on different values.
Figure 8: Observed VDP of 24 wooden board profiles with three outlying profiles highlighted in red colour.

4.2 An example of diagnostic breast cancer dataset

In this subsection, we provide another example using a multivariate dataset of diagnostic breast cancer. The dataset is available through the ftp server in the Computer Sciences Department at UW-Madison: http://ftp.cs.wisc.edu/

math-prog/cpo-dataset/machine-learn/cancer/WDBC/ and has been recently used by Fan et al. (2021) for detecting change in Phase I of the high-dimensional covariance matrix. This dataset consists of 567 vector observations, and for each observation, there are 30 real-valued features/variables measured for each observation, as well as a categorical label to indicate if the underlying observation is benign or malignant, denoted by ”B” and ”M”, respectively. The dataset doesn’t contain any null (missing) values and in total, 357 observations labeled as “B” and 212 observations labeled as “M”. Similar to Fan et al. (2021), we treat the 357 benign observations as Phase-I in-control observations and the 212 malignant observations as Phase-I out-of-control observations and trying to detect whether there is a shift in the mean between these two types. We also place the malignant observations after the benign observations, meaning that observation 357 is the true change point.

We first use Shapiro–Wilks marginal normality tests, ignoring the correlation between variables. As the p-values are very small, we conclude that the assumption of normality does not hold for most of the variables. So, for each marginal observation , we use the inverse transformation where is the marginal empirical distribution function based on the 357 in-control observations of the th variable. This transformation has been also used by Li et al .(2020) for multistream data.

Figure 9: Correlation between the 30 variables of diagnostic breast cancer data.

The estimated correlation matrix based on Phase I samples is shown in Figure 9. It can be seen that while a few variables are independent, some of them are (highly) correlated, meaning that high pairwise correlation exists in the dataset. We considered different values of for this example. The output of our proposed robust RMDP approach on this dataset with shows that , , and . Figure 10 depicts the control chart based on our proposed method, plotting the statistic in equation (19) against the control limit, where the UCL is . Figure 10 suggests an obvious shift in the process starting from observation 357 in Phase I with a lot of out-of-control samples, which implies that the proposed chart can properly detect and declare the change from bengin to malignant observations as it started right after sample 357. As a complementary, Figure 11 compares the empirical CDF of the charting statistics of bengin and malugnant samples with that of the standard normal. While the empirical CDF of conforming samples shows a perfect match with the standard normal CDF, the empirical CDF of malignant items represents a substantial shift to the right from the standard normal distribution, revealing a noticeable change in distribution.

Figure 10: A shift in the breast cancer dataset after sample 357 based on proposed RMDP control chart
Figure 11: A comparison between empirical CDF of charting statistics of bengin observations (green) and malignant observations (blue) with the CDF standard normal (dashed black line).

5 Conclusion

Technological advancements have led in the development of high-dimensional multivariate processes that require the monitoring of a vast number of variables simultaneously. In such cases, the Phase I sample size is usually small and computing the sample covariance matrix is impractical. This paper employs a new test statistic which is efficient when the sample size in Phase I is less than number of variables. We also propose a consistent estimator for parameter estimation as well as a finite sample correction coefficient via simulation. The robust procedure used in this paper makes the proposed control chart effective in the presence of outlying observations in Phase I. Through Mone Carlo simulaion, the proposed procedure shows good performance in Phase I in terms of probability of signal. As future work, we would like to extend the method to more complicated cases such as non-normal high dimensional processes, highly correlated multivariate processes, or when the batch size is greater than one.

References

  1. Abdella, G.M., Al‐Khalifa, K.N., Kim, S., Jeong, M.K., Elsayed, E.A. and Hamouda, A.M., 2016. Variable Selection‐based Multivariate Cumulative Sum Control Chart. Quality and Reliability Engineering International, 33, 565-578.

  2. Bai, Z., Saranadasa, H. 1996. Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 6, 311-329.

  3. Bersimis, S., S. Psarakis, and J. Panaretos. 2007. Multivariate Statistical Process Control Charts: An Overview. Quality and Reliability Engineering International, 23, 517–543.

  4. Capizzi, G., Masarotto, G. 2011. A least angle regression control chart for multidimensional data. Technometrics, 53, 285–296.

  5. Capizzi, G. and Masarotto, G., 2015. Comparison of phase II control charts based on variable selection methods. Frontiers in Statistical Quality Control, 11, 151-162. Springer International Publishing.

  6. Croux, C., Haesbroeck, G. 1999. Influence function and efficiency of the minimum covariance determinant scatter matrix estimator.

    Journal of Multivariate Analysis, 71

    , 161-190.

  7. Ebadi, M., Chenouri, S., Lin, D. K., Steiner, S. H. 2021a. Statistical monitoring of the covariance matrix in multivariate processes: A literature review. Journal of Quality Technology, 1-136.

  8. Ebadi, M., Chenouri, S., Steiner, S. H. 2021b. On Monitoring High-Dimensional Multivariate Processes with Individual Observations. Arxiv Preprint.

  9. Fan, J., Shu, L., Yang, A., Li, Y. (2021). Phase I analysis of high-dimensional covariance matrices based on sparse leading eigenvalues. Journal of Quality Technology, 53, 333-346.

  10. Hotelling, H., 1947. Multivariate quality control, illustrated by the air testing of sample bombsights. Techniques of Statistical Analysis. McGraw Hill, New York, pp. 111–184.

  11. Jiang, W., Wang, K., Tsung, F. 2012. A variable-selection-based multivariate EWMA chart for process monitoring and diagnosis. Journal of Quality Technology, 44, 209–230.

  12. Lehmann, E. L., Romano, J. P. 2006. Testing statistical hypotheses. Springer Science & Business Media.

  13. Li, W., Xiang, D., Tsung, F., Pu, X. 2020. A diagnostic procedure for high-dimensional data streams via missed discovery rate control. Technometrics, 62, 84-100.

  14. Nassar, S. H., & Abdel-Salam, A. S. G. (2021). Semiparametric MEWMA for Phase II profile monitoring. Quality and Reliability Engineering International.

  15. Pison, G., Van Aelst, S., Willems, G. 2002. Small sample corrections for LTS and MCD. Metrika, 55, 111-123.

  16. Ro, K., Zou, C., Wang, Z., & Yin, G. (2015). Outlier detection for high-dimensional data. Biometrika, 102, 589-599.

  17. Rousseeuw, P. J., Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212-223.

  18. Satterthwaite, F. E. 1941. Synthesis of variance. Psychometrika, 6, 309-316.

  19. Satterthwaite, F. E. 1946. An approximate distribution of estimates of variance components. Biometrics bulletin, 2, 110-114.

  20. Srivastava, M. S., 2005. Some tests concerning the covariance matrix in high dimensional data. Journal of the Japan Statistical Society, 35, 251-272.

  21. Srivastava, M. S. 2009. A test for the mean vector with fewer observations than the dimension under non-normality. Journal of Multivariate Analysis, 100, 518-532.

  22. Srivastava, M. S., and Du, M. 2008. A test for the mean vector with fewer observations than the dimension. Journal of Multivariate Analysis, 99, 386-402.

  23. Variyath, A. M., Vattathoor, J. 2014. Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations. Quality and Reliability Engineering International, 30, 795-812.

  24. Vargas, N. J. A. (2003). Robust estimation in multivariate control charts for individual observations. Journal of Quality Technology, 35, 367-376.

  25. Walker, E., Wright, S. P. 2002. Comparing Curves Using Additive Models. Journal of Quality Technology, 34, 118–129.

  26. Wang, K., Jiang, W. 2009. High-dimensional process monitoring and fault isolation via variable selection. Journal of Quality Technology, 41, 247–258.

  27. Welch, B. L. 1951. On the comparison of several mean values: an alternative approach. Biometrika, 38, 330-336.

  28. Willems, G., Pison, G., Rousseeuw, P. J., Van Aelst, S. 2002. A robust Hotelling test. Metrika, 55, 125-138.

  29. Williams, J. D., Woodall, W. H., & Birch, J. B. 2003. Phase I monitoring of nonlinear profiles. In quality and productivity research conference, Yorktown Heights, New York.

  30. Williams, J. D., Woodall, W. H., and Birch, J. B. 2007. Statistical monitoring of nonlinear product and process quality profiles. Quality and Reliability Engineering International, 23, 925–941.

  31. Williams, J. D., Woodall, W. H., Birch, J. B., Sullivan, J. H. 2006. Distribution of Hotelling’s statistic based on the successive differences estimator. Journal of Quality Technology, 38, 217-229.

  32. Woodall, W. H. 2000. Controversies and contradictions in statistical process control (with discussion), Journal of Quality Technology, 32, 341–378.

  33. Woodall, W. H., Montgomery, D. C. 2014. Some current directions in the theory and application of statistical process monitoring. Journal of Quality Technology, 46, 78-94.

  34. Woodall, W. H., Spitzner, D. J., Montgomery, D. C., and Gupta, S. 2004. Using control charts to monitor process and product quality profiles. Journal of Quality Technology, 36, 309–320.

  35. Zhang, H., & Albin, S. (2009). Detecting outliers in complex profiles using a control chart method. IIE Transactions, 41, 335-345.

  36. Zhang, L., Zhu, T., Zhang, J. T. 2020. A simple scale-invariant two-sample test for high-dimensional data. Econometrics and Statistics, 14, 131-144.

  37. Zou, C., Qiu, P. 2009. Multivariate statistical process control using LASSO. Journal of American Statistical Association, 104, 1586–1596.

Appendix A

In this appendix, a consistent and unbiased estimator of based on the samples available in Phase I will be presented. We first obtain expressions for and in terms of chi-square random variables in a similar fashion as in Srivastava [2005], where expressions for , , and were derived, and their corresponding expectations will be then calculated.

From the spectral decomposition of the positive definite matrix , we can write , where is the diagonal matrix of the eigenvalues