One of most common problem in real data is the presence of outliers, i.e. observations that are well separated from the bulk of data, that may be errors that affect the data analysis or can suggest unexpected information. According to the classical Tukey-Huber Contamination Model (THCM), a small fraction of rows can be contaminated and these are the units considered as outliers. Since the ’s many methods have been developed in order to be less sensitive to such outlying observations. A complete introduction and explanation of the developments in robust statistics is given in the book by Maronna et al. .
In some application, e.g. in modern high-dimensional data sets, the entries of an observation (or cells) can be independently contaminated.Alqallaf et al.  first formulated the Independent Contamination Model (ICM), taking into consideration this cell-wise contamination scheme. According to this paradigm, given a fraction of contaminated cells, the expected fraction of contaminated rows is
which exceeds the breakdown point for increasing value of the contaminatin level and the dimension . Traditional robust estimators may fail in this situation. Furthermore, Agostinelli et al. [2015b] shows that both type of outliers, case-wise and cell-wise, can occur simultaneously.
Gervini and Yohai 
introduced the idea of an adaptive univariate filter, identifying the proportion of outliers in the sample measuring the difference between the empirical distribution and a reference distribution. Then, it is used to compute an adaptive cutoff value, and finally a robust and efficient weighted least squares estimator is defined. Starting from this concept of outlier detection,Agostinelli et al. [2015a] introduced a two-step procedure: in the first step large cell-wise outliers are flagged by the univariate filter and replaced by NA’s values [a technique called snipping in Farcomeni, 2014]; in the second step a Generalized S-Estimator [Danilov et al., 2012] is applied to deal with case-wise outliers. The choice of using GSE is due to the fact that it has been specifically designed to cope with missing values in multivariate data. Leung et al.  improved this procedure proposing the following modifications:
They combined the univariate filter with a bivariate filter to take into account the correlations among variables.
In order to handle also moderate cell-wise outliers, they proposed a filter as intersection between the univariate-bivariate filter and Detect Deviating Cells (DDC), a filter procedure introduced by Rousseeuw and Van Den Bossche .
Finally, they constructed a Generalized Rocke S-estimator (GRE) replacing the GSE, to face the lost of robustness in case of high-dimensional case-wise outliers.
Here, we want to define a new filter in general dimension , with , based on the statistical data depth functions and it will be used in combination with the GSE. Note that if we filter the cell-wise outliers considering the variables independent. Section 2 introduces the main idea on how to construct filters based on statistical depth functions, in subsection 2.1 we illustrate the procedure by using the half-space depth function while in subsections 2.2 and 2.3 we introduce two different strategies to mark observations/cells as outliers. Section 3 shows how the approaches in Agostinelli et al. [2015a] and Leung et al.  are special cases of our framework and we introduce a statistical data depth function namely Gervini-Yohai depth function. Section 4 illustrates the features of our approach using a real data set while Section 5 reports the results of a Monte Carlo experiment. Appendix A discusses general properties a statistical data depth function should have, Appendix B studies the Gervini-Yohai depth properties and Appendix C contains full results of the Monte Carlo experiment.
2 Filters based on Statistical Data Depth Function
Let be a
-valued random variable with distribution function. For a point , we consider the statistical data depth of with respect to be such that satisfies the four properties given in Liu  and Zuo and Serfling [2000a] and reported in Appendix A of the Supplementary Material. Given an independent and identically distributed sample of size , we denote its empirical distribution function and by the sample depth. We assume that, is a uniform consistent estimator of , that is,
a property enjoined by many statistical data depth functions, e.g., among others simplicial depth [Liu, 1990], half-space depth [Tukey, 1975]. One important feature of the depth functions is the -depth trimmed region given by ; for any , we will denote the smallest region
that has probability larger that or equal toaccording to . Throughout, subscripts and superscripts for depth regions are used for depth levels and probability contents, respectively. Let be the complement in of the set . Let , be the maximum of the depth, for simplicial depth , for half-space depth .
Given a high order quantile, we define a filter of dimension based on
where represents the positive part of , and we mark as outliers all the observations with the smallest population depth (where is the largest integer less then or equal to ). This define a filter in the general dimension .
We have the following result, with obvious proof.
If (a.s.) then as .
If the above result holds, then the filter would be consistent. In the next subsection we are going to illustrate this approach using the half-space depth.
2.1 Filters based on Half-space Depth
Definition 1 (Half-space depth).
Let be a -valued random variable with distribution function . For a point , the half-space depth of with respect to is defined as the minimum probability of all closed half-spaces including :
where indicates the set of all half-spaces in containing .
A random vectoris said elliptically symmetric distributed, denoted by , if it has a density function given by
where is a non-negative scalar function, is the location parameter and is a positive definite matrix. Denote by the corresponding distribution function and by the squared Mahalanobis distance of a -dimensional point . By Theorem 3.3 of Zuo and Serfling [2000b] if a depth is affine equivariant (1) and has maximum at (2) (see Appendix A) then a depth is such that for some non increasing function and we can restrict ourselves without loss of generality, to the case and where
is the identity matrix of dimension. Under this setting, it is easy to see that the half-space depth of a given point is given by , where is a marginal distribution of .
If the function is such that
then, there exists a such that for all so that , , where is the distribution function of the standard normal. Hence,
and therefore, for all ,
Given an independent and identically distributed sample , we define the filter in general dimension introduced previously, where here we use the half-space depth
where is a high order quantile, is the empirical distribution function and is a chosen reference distribution which depends on a pair of initial location and dispersion estimators, and
. Hereafter, we are going to use the normal distribution. For and one might use, e.g., the coordinate-wise median and the coordinate-wise MAD for a univariate filter as in Leung et al. . In order to compute the value , we have to identify the set where is a large quantile of . By Corollary 4.3 in Zuo and Serfling [2000b], and denoting with the squared Mahalanobis distance of using the initial location and dispersion estimates, the set can be rewritten as , where
is a large quantile of a chi-squared distribution withdegrees of freedom.
Now we want to show that the result given by Proposition 1 holds for this particular case.
Consider a random vector and suppose that is an elliptically symmetric distribution. Also consider a pair of location and dispersion estimators and such that and a.s.. Let be a chosen reference distribution and the empirical distribution function. If the reference distribution satisfies
where is some large quantile of , then
In Donoho and Gasko , it is proved that for i.i.d. with distribution , as
Note that, by the continuity of , a.s.. Hence, for each there exists such that for all we have
In the next example, we illustrate a univariate filter based on half-space depth that controls independently the left and the right tail of the distribution.
Example 1 (Univariate filter with two-tails control).
In the univariate case, given a point there exist only two halfspaces including it, hence the half-space depth assumes the explicit form
and considering the empirical distribution function , the halfspace depth will be
Consider and , a pair of initial location and dispersion estimators. Here we choose for and respectively the coordinate-wise median and the median absolute deviation (MAD). For each variable (), we denote the standardized version of by . Let a chosen reference distribution for ; here we use the standard normal distribution, i.e., . Let be the empirical distribution for the standardized values, that is
We define the proportion of flagged outliers by
where is a large quantile of . Note that, according to (1), we are considering the set , which results in the simpler form written above considering the definition of the half-space depth in the univariate case. Here, if we consider the order statistics , define and . Using the definition of half-space depth function in the univariate case, presented above, the previous expression can be written as
Then, we flag observations with the smallest depth value as cell-wise outliers and replace them by NA’s.
2.2 A consistent univariate, bivariate and -variate filter
Given a sample where , we first apply the univariate filter described in the previous example to each variable separately. Filtered data are indicated through an auxiliary matrix of zeros and ones, with zero corresponding to a NA value. Next we want to identify the bivariate outliers by iterating the filter over all possible pairs of variables. Consider a pair of variables . The initial location and dispersion estimators are, respectively, the coordinate-wise median and the sub-matrix of the estimate computed by the generalized S-estimator on non-filtered data. Note that, this ensure the positive definiteness property for and each sub-matrix corresponding to a subset of variables. For bivariate points with no flagged components by the univariate filter we compute the squared Mahalanobis distance and hence apply the bivariate filter, for all . At the end we want to identify the cells which have to be flagged as cell-wise outliers. The procedure used for this purpose is described in Leung et al.  and reported here. Let
be the set of triplets which identifies the pairs of cells flagged by the bivariate filter in rows . For each cell in the data, we count the number of flagged pairs in the -th row in which the considered cell is involved:
In absence of contamination,
follows approximately a binomial distributionwhere represents the overall proportion of cell-wise outliers undetected by the univariate filter. Hence, we flag the cell if , where is the -quantile of . Finally, we perform the -variate filter as described in subsection 2.1 to the full data matrix. Detected observations (rows) are directly flagged as -variate (case-wise) outliers. We denote the procedure based on univariate, bivariate and -variate filters, HS-UBPF.
2.3 A sequencing filtering procedure
Suppose we would like to apply a sequence of filters with different dimension . For each , , the filter updates the data matrix adding NA values to the -tuples identified as -variate outliers. In this way, each filter applies only those -tuples that have not been flagged as outliers by the filters with lower dimension.
Initial values for each procedures rather than would be obtained by applying the GSE to the actual filtered values.
This procedure aims to be a valid alternative to that used in the presented HS-UBPF filter to perform a sequence of filters with different dimensions. However, this is a preliminary idea, indeed it has not been implemented yet.
3 Gervini-Yohai -variate filter
In this Section we are going to show that the filters introduced in Agostinelli et al. [2015a] are a special case of our approach, using the following Gervini-Yohai depth
where is a continuous distribution function, and are the location and scatter matrix functionals and is the squared Mahalanobis distance. Appendix B shows that this is a statistical data depth function. Let be a sequence of discrete distribution functions that might depends on and such that , we might define the finite sample version of the Gervini-Yohai depth as
however for filtering purpose we will use two alternative definitions later on. The use of , that might depend on the data, instead of makes this sample depth semiparametric. We notice that the Mahalanobis depth, which is completely parametric, cannot be used for the purpose of defining a filter in a similar fashion.
Let , be an -tuple of the integer numbers and, for easy of presentation, let be a subvector of dimension of . Consider a pair of initial location and scatter estimators
Now, define the squared Mahalanobis distance for a data point by . Consider the distribution function of a , the distribution function of and let be the empirical distribution function of (). We consider two finite sample version of the Gervini-Yohai depth, i.e.,
The proportion of flagged -variate outliers is defined by
Here , where is any point in such that and is a large quantile of . Then, we flag observations. It is easy to see that,
since is a non increasing function of the squared Mahalanobis distance of the point .
We can rephrase Proposition 2. in Leung et al. , that states the consistency property of the filter as follows.
Consider a random vector and a pair of location and scatter estimators and such that and a.s.. Consider any continuous distribution function and let be the empirical distribution function of and . If the distribution satisfies:
where , where is any point in such that and is a large quantile of , then
We consider the weekly returns from to for a portfolio of 20 small-cap stocks used in Leung et al. .
With this example we want to compare the filter introduced in Agostinelli et al. [2015a] (indicated as GY-UF in case of univariate filter and GY-UBF for univariate and bivariate filter) and the same filter with the improvements proposed in Leung et al.  (indicated here as GY-UBF-DDC-C) to the presented filter based on statistical data depth functions, using the halfspace depth (HS-UF for the univariate filter, HS-UBF for the univariate-bivariate filter, HS-UBPF for the univariate-bivariate--variate filter and HS-UBPF-DDC-C for the combination of the HS-UBPF with the modifications in Leung et al. ).
Figure 1 shows the normal QQ-plots of the 20 variables. The returns in all stocks seem to roughly follow a normal distribution, but with the presence of large outliers. The returns in each stock that lie 3 MAD’s away from the coordinate-wise median are displayed in green in the figure. In total, the of cells are outside; if these are cell-wise outliers then they propagate to of the cases.
Figure 2 shows the squared Mahalanobis distances (MDs) of the weekly returns based on the estimates given by the MLE, the GY-UF, the GY-UBF, the HS-UF, the HS-UBF and the HS-UBPF. Observations with one or more cells flagged as outliers are displayed in green. We say that the estimate identifies an outlier correctly if the MD exceeds the quantile of a chi-squared distribution with 20 degrees of freedom. We see that the MLE estimate does a very poor job recognizing only 8 of the 59 cases. The GY-UF, HS-UF, HS-UBF and HS-UBPF show a quite similar behavior, doing better then the MLE but they miss about one third of the cases. The GY-UBF identifies all but seven of the cases.
Figure 3 shows the Mahlanobis distances produced by GY-UBF-DDC-C and HS-UBPF-DDC-C. Here we can see that the GY-UBF-DDC-C misses 13 of 59 cases while the HS-UBPF-DDC-C has missed 15 cases. Although they seem not to do a better job, these two filters are able to flag some observations, not identified before, as case-wise outliers. These outliers are more clearly highlighted by HS-UBPF-DDC-C.
Figure 4 shows the bivariate scatter plot of WTS versus HTLD, HTLD versus WSBC and WSBC versus SUR where the GY-UBF and HS-UBF filters are applied, respectively. The bivariate observations with at least one component flagged as outlier are in blue, and outliers detected by the bivariate filter are in orange. We see that the HS-UBF identifies less outliers with respect to the GY-UBF.
5 Monte Carlo results
We performed a Monte Carlo simulation to assess the performance of the proposed filter based on halfspace depth. After the filter flags the outlying observations, the generalized S-estimator is applied to the data with added missing values. Our simulation study is based on the same setup described in Leung et al.  to compare significantly the performance of our filter with respect to the filter introduced in their work.
We considered samples from a , where all values in are equal to , and the sample size is . We consider the following scenarios:
Clean data: data without changes.
Cell-Wise contamination: a proportion of cells in the data is replaced by , where .
The proportions of contaminated rows chosen for case-wise contamination are , and for cell-wise contamination. The number of replicates in our simulation study is .
We measure the performance of a given pair of location and scatter estimators and using the mean squared error (MSE) and the likelihood ratio test distance (LRT), as in Leung et al. :
where is the estimate of the -th replication andand . Finally, we computed the maximum average LRT distances considering all contamination values .
Table 1 shows the average LRT distances under cell-wise contamination. We see that the univarite and univariate-bivariate filters have more problems in filtering moderate cell-wise outliers (for example ), while show a constant and optimal behavior for increasing contamination values of . GY-UBF-DDC-C and HS-UBPF-DDC-C have lower maximum average LRT distances, but are higher for large . This behavior is shown in Figure 5 (top) where the average LRT distances versus different contamination values are displayed, with of cell-wise contamination level and .
Table 2 shows the maximum average LRT distances under case-wise contamination. Overall, the GY-UBP-DDC-C and HS-UBPF-DDC-C outperform all the other filters obtaining better results. Excluding these two, we see that the HS-UBPF is competitive in case of moderate case-wise contamination. An illustration of their behavior is given in Figure 6 (top) which shows the average LRT distances for different values of , with of case-wise contamination level and .
Table 3 and Table 4 show the maximum average MSE under cell-wise and case-wise contamination, respectively. The values in the tables are the MSE values multiplied by 1000 for a better visualization and model comparison. Under case-wise contamination, the GY-UBF-DDC-C and HS-UBPF-DDC-C outperform the other filters, and have also competitive results for cell-wise contamination. In Figure 5 (bottom) and Figure 6 (bottom) the average MSE versus different contamination values are displayed, with and of cell-wise contamination and of case-wise contamination respectively. We highlight the nice redescending performance of the HS-UBPF for both LRT and MSE, not shared by the other filters.
Considering the two-step procedure introduced in Agostinelli et al. [2015a] and improved by Leung et al. , we present a new filter based on statistical data depth functions that can be used in place of the previous filters, intended as a generalization of such filters. Furthermore, we also combine the depth filter HS-UBPF and DDC, as suggested by Leung et al. . As shown in the example, the filter HS-UBPF is able to identify large outlying observations and removes less cells than the GY-UBF. In addition, it also detects the case-wise outliers, which are clearly highlighted.
If we consider the performance of the entire procedure, our simulations show that using HS-UBPF we obtain the best estimates in case of moderate proportion of contamination, but it is still competitive for higher percentage of contamination, also for high-dimensional dataset, under both types of contamination models. Generally, the GY-UBF and HS-UBPF combined with DDC outperform the other filters. Differences in performance of these two estimators are not clearly visible. However the HS-UBPF has shown, especially under the case-wise contamination an interesting behaviour for moderate contamination level.
Further research on this filter could be needed to explore the performance of the estimator in different types of data and how it can vary with respect to the dimensions and , for example in flat datasets (e.g., ). In addition different statistical data depth functions could be used in place of the half-space depth.
Appendix A Statistical data depth properties
Definition 2 (Depth Function).
A depth function measures the centrality of a point w.r.t. a probability distribution
measures the centrality of a point w.r.t. a probability distribution.
Appendix B Gervini-Yohai depth
Here we want to show that the Gervini-Yohai depth, defined as , is a proper statistical depth function, i.e., it satisfies the four properties introduced above.
Affine invariance: it follows directly from the affine invariance property of the Mahalanobis distance;
Maximality at center: if is elliptically symmetric around ,
For any we have
when is strictly monotone then strict inequality holds, and is the unique maximizer of the Gervini-Yohai depth.
Approaching zero: if we have that and consequently . Then
Appendix C Monte Carlo experiment
Results for all combinations of the model parameters explored in the Monte Carlo simulation are reported in this section.
- Agostinelli et al. [2015a] C. Agostinelli, A. Leung, V.J. Yohai, and R.H. Zamar. Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST, 24(3):441–461, 2015a.
- Agostinelli et al. [2015b] C. Agostinelli, A. Leung, V.J. Yohai, and R.H. Zamar. Rejoinder on: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST, 24(3):484–488, 2015b.
- Alqallaf et al.  F. Alqallaf, S. Van Aelst, R. H. Zamar, and V. J. Yohai. Propagation of outliers in multivariate data. The Annals of Statistics, 37(1):311–331, 2009.
- Danilov et al.  M. Danilov, V.J. Yohai, and R.H. Zamar. Robust estimation of multivariate location and scatter in the presence of missing data. Journal of the American Statistical Association, 107:1178–1186, 2012.
- Donoho and Gasko  D.L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803–1827, 1992.
- Farcomeni  A Farcomeni. Robust constrained clustering in presence of entry-wise outliers. Technometrics, 56(1):102–111, 2014.
- Gervini and Yohai  D. Gervini and V.J. Yohai. A class of robust and fully efficient regression estimators. The Annals of Statistics, 30(2):583–616, 2002.
- Leung et al.  A. Leung, V.J. Yohai, and R.H. Zamar. Multivariate location and scatter matrix estimation under cellwise and casewise contamination. Computational Statistics and Data Analysis, 111:59–76, 2017.
- Liu  R.Y. Liu. On a notion of data depth based on random simplices. The Annals of Statistics, 18(1):405–414, 1990.
- Maronna et al.  R.A. Maronna, R.D. Martin, and Yohai V.J. Robust statistic: theory and methods. Wiley, Chichister, 2006.
- Rousseeuw and Van Den Bossche  P.J. Rousseeuw and W. Van Den Bossche. Detecting deviating data cells. Technometrics, 60(2):135–145, 2018.
- Serfling  R.J. Serfling. Multivariate symmetry and asymmetry. Encyclopedia of statistical sciences, pages 5338–5345, 2006.
- Tukey  J.W. Tukey. Mathematics and picturing of data. In Proceedings of International Congress of Mathematics, volume 2, pages 523–531, 1975.
- Zuo and Serfling [2000a] Y. Zuo and R. Serfling. General notions of statistical depth function. The Annals of Statistics, 28(2):461–482, 2000a.
- Zuo and Serfling [2000b] Y. Zuo and R.J. Serfling. Structual properties and convergence results for contours of sample statistical depth functions. The Annals of Statistics, 28(2):483–499, 2000b.