1 Introduction
One of most common problem in real data is the presence of outliers, i.e. observations that are well separated from the bulk of data, that may be errors that affect the data analysis or can suggest unexpected information. According to the classical TukeyHuber Contamination Model (THCM), a small fraction of rows can be contaminated and these are the units considered as outliers. Since the ’s many methods have been developed in order to be less sensitive to such outlying observations. A complete introduction and explanation of the developments in robust statistics is given in the book by Maronna et al. [2006].
In some application, e.g. in modern highdimensional data sets, the entries of an observation (or cells) can be independently contaminated.
Alqallaf et al. [2009] first formulated the Independent Contamination Model (ICM), taking into consideration this cellwise contamination scheme. According to this paradigm, given a fraction of contaminated cells, the expected fraction of contaminated rows iswhich exceeds the breakdown point for increasing value of the contaminatin level and the dimension . Traditional robust estimators may fail in this situation. Furthermore, Agostinelli et al. [2015b] shows that both type of outliers, casewise and cellwise, can occur simultaneously.
Gervini and Yohai [2002]
introduced the idea of an adaptive univariate filter, identifying the proportion of outliers in the sample measuring the difference between the empirical distribution and a reference distribution. Then, it is used to compute an adaptive cutoff value, and finally a robust and efficient weighted least squares estimator is defined. Starting from this concept of outlier detection,
Agostinelli et al. [2015a] introduced a twostep procedure: in the first step large cellwise outliers are flagged by the univariate filter and replaced by NA’s values [a technique called snipping in Farcomeni, 2014]; in the second step a Generalized SEstimator [Danilov et al., 2012] is applied to deal with casewise outliers. The choice of using GSE is due to the fact that it has been specifically designed to cope with missing values in multivariate data. Leung et al. [2017] improved this procedure proposing the following modifications:
They combined the univariate filter with a bivariate filter to take into account the correlations among variables.

In order to handle also moderate cellwise outliers, they proposed a filter as intersection between the univariatebivariate filter and Detect Deviating Cells (DDC), a filter procedure introduced by Rousseeuw and Van Den Bossche [2018].

Finally, they constructed a Generalized Rocke Sestimator (GRE) replacing the GSE, to face the lost of robustness in case of highdimensional casewise outliers.
Here, we want to define a new filter in general dimension , with , based on the statistical data depth functions and it will be used in combination with the GSE. Note that if we filter the cellwise outliers considering the variables independent. Section 2 introduces the main idea on how to construct filters based on statistical depth functions, in subsection 2.1 we illustrate the procedure by using the halfspace depth function while in subsections 2.2 and 2.3 we introduce two different strategies to mark observations/cells as outliers. Section 3 shows how the approaches in Agostinelli et al. [2015a] and Leung et al. [2017] are special cases of our framework and we introduce a statistical data depth function namely GerviniYohai depth function. Section 4 illustrates the features of our approach using a real data set while Section 5 reports the results of a Monte Carlo experiment. Appendix A discusses general properties a statistical data depth function should have, Appendix B studies the GerviniYohai depth properties and Appendix C contains full results of the Monte Carlo experiment.
2 Filters based on Statistical Data Depth Function
Let be a
valued random variable with distribution function
. For a point , we consider the statistical data depth of with respect to be such that satisfies the four properties given in Liu [1990] and Zuo and Serfling [2000a] and reported in Appendix A of the Supplementary Material. Given an independent and identically distributed sample of size , we denote its empirical distribution function and by the sample depth. We assume that, is a uniform consistent estimator of , that is,a property enjoined by many statistical data depth functions, e.g., among others simplicial depth [Liu, 1990], halfspace depth [Tukey, 1975]. One important feature of the depth functions is the depth trimmed region given by ; for any , we will denote the smallest region
that has probability larger that or equal to
according to . Throughout, subscripts and superscripts for depth regions are used for depth levels and probability contents, respectively. Let be the complement in of the set . Let , be the maximum of the depth, for simplicial depth , for halfspace depth .Given a high order quantile
, we define a filter of dimension based on(1) 
where represents the positive part of , and we mark as outliers all the observations with the smallest population depth (where is the largest integer less then or equal to ). This define a filter in the general dimension .
We have the following result, with obvious proof.
Proposition 1.
If (a.s.) then as .
If the above result holds, then the filter would be consistent. In the next subsection we are going to illustrate this approach using the halfspace depth.
2.1 Filters based on Halfspace Depth
Definition 1 (Halfspace depth).
Let be a valued random variable with distribution function . For a point , the halfspace depth of with respect to is defined as the minimum probability of all closed halfspaces including :
where indicates the set of all halfspaces in containing .
A random vector
is said elliptically symmetric distributed, denoted by , if it has a density function given bywhere is a nonnegative scalar function, is the location parameter and is a positive definite matrix. Denote by the corresponding distribution function and by the squared Mahalanobis distance of a dimensional point . By Theorem 3.3 of Zuo and Serfling [2000b] if a depth is affine equivariant (1) and has maximum at (2) (see Appendix A) then a depth is such that for some non increasing function and we can restrict ourselves without loss of generality, to the case and where
is the identity matrix of dimension
. Under this setting, it is easy to see that the halfspace depth of a given point is given by , where is a marginal distribution of .If the function is such that
then, there exists a such that for all so that , , where is the distribution function of the standard normal. Hence,
and therefore, for all ,
Given an independent and identically distributed sample , we define the filter in general dimension introduced previously, where here we use the halfspace depth
where is a high order quantile, is the empirical distribution function and is a chosen reference distribution which depends on a pair of initial location and dispersion estimators, and
. Hereafter, we are going to use the normal distribution
. For and one might use, e.g., the coordinatewise median and the coordinatewise MAD for a univariate filter as in Leung et al. [2017]. In order to compute the value , we have to identify the set where is a large quantile of . By Corollary 4.3 in Zuo and Serfling [2000b], and denoting with the squared Mahalanobis distance of using the initial location and dispersion estimates, the set can be rewritten as , whereis a large quantile of a chisquared distribution with
degrees of freedom.Now we want to show that the result given by Proposition 1 holds for this particular case.
Proposition 2.
Consider a random vector and suppose that is an elliptically symmetric distribution. Also consider a pair of location and dispersion estimators and such that and a.s.. Let be a chosen reference distribution and the empirical distribution function. If the reference distribution satisfies
where is some large quantile of , then
Proof.
In Donoho and Gasko [1992], it is proved that for i.i.d. with distribution , as
Note that, by the continuity of , a.s.. Hence, for each there exists such that for all we have
∎
In the next example, we illustrate a univariate filter based on halfspace depth that controls independently the left and the right tail of the distribution.
Example 1 (Univariate filter with twotails control).
In the univariate case, given a point there exist only two halfspaces including it, hence the halfspace depth assumes the explicit form
and considering the empirical distribution function , the halfspace depth will be
Consider and , a pair of initial location and dispersion estimators. Here we choose for and respectively the coordinatewise median and the median absolute deviation (MAD). For each variable (), we denote the standardized version of by . Let a chosen reference distribution for ; here we use the standard normal distribution, i.e., . Let be the empirical distribution for the standardized values, that is
We define the proportion of flagged outliers by
where is a large quantile of . Note that, according to (1), we are considering the set , which results in the simpler form written above considering the definition of the halfspace depth in the univariate case. Here, if we consider the order statistics , define and . Using the definition of halfspace depth function in the univariate case, presented above, the previous expression can be written as
(2) 
Then, we flag observations with the smallest depth value as cellwise outliers and replace them by NA’s.
2.2 A consistent univariate, bivariate and variate filter
Given a sample where , we first apply the univariate filter described in the previous example to each variable separately. Filtered data are indicated through an auxiliary matrix of zeros and ones, with zero corresponding to a NA value. Next we want to identify the bivariate outliers by iterating the filter over all possible pairs of variables. Consider a pair of variables . The initial location and dispersion estimators are, respectively, the coordinatewise median and the submatrix of the estimate computed by the generalized Sestimator on nonfiltered data. Note that, this ensure the positive definiteness property for and each submatrix corresponding to a subset of variables. For bivariate points with no flagged components by the univariate filter we compute the squared Mahalanobis distance and hence apply the bivariate filter, for all . At the end we want to identify the cells which have to be flagged as cellwise outliers. The procedure used for this purpose is described in Leung et al. [2017] and reported here. Let
be the set of triplets which identifies the pairs of cells flagged by the bivariate filter in rows . For each cell in the data, we count the number of flagged pairs in the th row in which the considered cell is involved:
In absence of contamination,
follows approximately a binomial distribution
where represents the overall proportion of cellwise outliers undetected by the univariate filter. Hence, we flag the cell if , where is the quantile of . Finally, we perform the variate filter as described in subsection 2.1 to the full data matrix. Detected observations (rows) are directly flagged as variate (casewise) outliers. We denote the procedure based on univariate, bivariate and variate filters, HSUBPF.2.3 A sequencing filtering procedure
Suppose we would like to apply a sequence of filters with different dimension . For each , , the filter updates the data matrix adding NA values to the tuples identified as variate outliers. In this way, each filter applies only those tuples that have not been flagged as outliers by the filters with lower dimension.
Initial values for each procedures rather than would be obtained by applying the GSE to the actual filtered values.
This procedure aims to be a valid alternative to that used in the presented HSUBPF filter to perform a sequence of filters with different dimensions. However, this is a preliminary idea, indeed it has not been implemented yet.
3 GerviniYohai variate filter
In this Section we are going to show that the filters introduced in Agostinelli et al. [2015a] are a special case of our approach, using the following GerviniYohai depth
where is a continuous distribution function, and are the location and scatter matrix functionals and is the squared Mahalanobis distance. Appendix B shows that this is a statistical data depth function. Let be a sequence of discrete distribution functions that might depends on and such that , we might define the finite sample version of the GerviniYohai depth as
however for filtering purpose we will use two alternative definitions later on. The use of , that might depend on the data, instead of makes this sample depth semiparametric. We notice that the Mahalanobis depth, which is completely parametric, cannot be used for the purpose of defining a filter in a similar fashion.
Let , be an tuple of the integer numbers and, for easy of presentation, let be a subvector of dimension of . Consider a pair of initial location and scatter estimators
Now, define the squared Mahalanobis distance for a data point by . Consider the distribution function of a , the distribution function of and let be the empirical distribution function of (). We consider two finite sample version of the GerviniYohai depth, i.e.,
and
The proportion of flagged variate outliers is defined by
Here , where is any point in such that and is a large quantile of . Then, we flag observations. It is easy to see that,
since is a non increasing function of the squared Mahalanobis distance of the point .
We can rephrase Proposition 2. in Leung et al. [2017], that states the consistency property of the filter as follows.
Proposition 3.
Consider a random vector and a pair of location and scatter estimators and such that and a.s.. Consider any continuous distribution function and let be the empirical distribution function of and . If the distribution satisfies:
(3) 
where , where is any point in such that and is a large quantile of , then
where
4 Example
We consider the weekly returns from to for a portfolio of 20 smallcap stocks used in Leung et al. [2017].
With this example we want to compare the filter introduced in Agostinelli et al. [2015a] (indicated as GYUF in case of univariate filter and GYUBF for univariate and bivariate filter) and the same filter with the improvements proposed in Leung et al. [2017] (indicated here as GYUBFDDCC) to the presented filter based on statistical data depth functions, using the halfspace depth (HSUF for the univariate filter, HSUBF for the univariatebivariate filter, HSUBPF for the univariatebivariatevariate filter and HSUBPFDDCC for the combination of the HSUBPF with the modifications in Leung et al. [2017]).
Figure 1 shows the normal QQplots of the 20 variables. The returns in all stocks seem to roughly follow a normal distribution, but with the presence of large outliers. The returns in each stock that lie 3 MAD’s away from the coordinatewise median are displayed in green in the figure. In total, the of cells are outside; if these are cellwise outliers then they propagate to of the cases.
Figure 2 shows the squared Mahalanobis distances (MDs) of the weekly returns based on the estimates given by the MLE, the GYUF, the GYUBF, the HSUF, the HSUBF and the HSUBPF. Observations with one or more cells flagged as outliers are displayed in green. We say that the estimate identifies an outlier correctly if the MD exceeds the quantile of a chisquared distribution with 20 degrees of freedom. We see that the MLE estimate does a very poor job recognizing only 8 of the 59 cases. The GYUF, HSUF, HSUBF and HSUBPF show a quite similar behavior, doing better then the MLE but they miss about one third of the cases. The GYUBF identifies all but seven of the cases.
Figure 3 shows the Mahlanobis distances produced by GYUBFDDCC and HSUBPFDDCC. Here we can see that the GYUBFDDCC misses 13 of 59 cases while the HSUBPFDDCC has missed 15 cases. Although they seem not to do a better job, these two filters are able to flag some observations, not identified before, as casewise outliers. These outliers are more clearly highlighted by HSUBPFDDCC.
Figure 4 shows the bivariate scatter plot of WTS versus HTLD, HTLD versus WSBC and WSBC versus SUR where the GYUBF and HSUBF filters are applied, respectively. The bivariate observations with at least one component flagged as outlier are in blue, and outliers detected by the bivariate filter are in orange. We see that the HSUBF identifies less outliers with respect to the GYUBF.
5 Monte Carlo results
We performed a Monte Carlo simulation to assess the performance of the proposed filter based on halfspace depth. After the filter flags the outlying observations, the generalized Sestimator is applied to the data with added missing values. Our simulation study is based on the same setup described in Leung et al. [2017] to compare significantly the performance of our filter with respect to the filter introduced in their work.
We considered samples from a , where all values in are equal to , and the sample size is . We consider the following scenarios:

Clean data: data without changes.

CellWise contamination: a proportion of cells in the data is replaced by , where .

CaseWise contamination: a proportion of cases in the data matrix is replaced by , where , and
is the eigenvector corresponding to the smallest eigenvalue of
with length such that .
The proportions of contaminated rows chosen for casewise contamination are , and for cellwise contamination. The number of replicates in our simulation study is .
We measure the performance of a given pair of location and scatter estimators and using the mean squared error (MSE) and the likelihood ratio test distance (LRT), as in Leung et al. [2017]:
where is the estimate of the th replication and
is the KullbackLeibler divergence between two Gaussian distributions with the same mean and variances
and . Finally, we computed the maximum average LRT distances considering all contamination values .UF  UBF  DDCC  

GY  HS  GY  HS  HSUBPF  GYUBF  HSUBPF  
10  0  0.8  0.8  0.9  0.9  0.9  1.0  1.0 
0.02  1.2  1.3  1.3  1.3  1.3  1.1  1.1  
0.05  4.6  4.8  4.6  4.5  4.3  2.4  2.5  
20  0  1.3  1.4  1.4  1.5  1.5  1.8  1.8 
0.02  3.9  4.4  4.2  4.5  4.4  2.5  2.5  
0.05  11.0  12.2  11.3  11.9  11.8  8.2  8.3  
30  0  1.9  1.9  2.0  2.0  2.0  3.4  3.4 
0.02  6  6.7  6.5  6.8  6.6  5.0  5.1  
0.05  14.5  16.9  15.1  16.8  16.6  13.4  13.9  
40  0  2.4  2.5  2.6  2.6  2.6  5.8  5.8 
0.02  7.5  8.5  8.2  8.6  8.5  9.2  9.3  
0.05  17.4  20.8  18.1  20.7  20.5  20.0  20.0  
50  0  2.9  3.0  3.1  3.1  3.2  5.1  5.1 
0.02  8.8  10.0  9.7  10.2  10.0  12.2  12.5  
0.05  19.9  24.5  20.8  24.3  24.1  24.5  24.7 
Table 1 shows the average LRT distances under cellwise contamination. We see that the univarite and univariatebivariate filters have more problems in filtering moderate cellwise outliers (for example ), while show a constant and optimal behavior for increasing contamination values of . GYUBFDDCC and HSUBPFDDCC have lower maximum average LRT distances, but are higher for large . This behavior is shown in Figure 5 (top) where the average LRT distances versus different contamination values are displayed, with of cellwise contamination level and .
UF  UBF  DDCC  

GY  HS  GY  HS  HSUBPF  GYUBF  HSUBPF  
10  0  0.8  0.8  0.9  0.9  0.9  1.0  1.0 
0.1  10.5  12.6  14.9  13.7  7.8  3.6  3.8  
0.2  93.0  104.5  125.3  107.9  50.5  18.7  18.3  
20  0  1.3  1.4  1.4  1.5  1.5  1.8  1.8 
0.1  26.7  33.6  39.3  37.8  15.3  7.1  7.0  
0.2  111.7  110.1  125.1  114.8  110.2  19.6  19.7  
30  0  1.9  1.9  2.0  2.0  2.0  3.4  3.4 
0.1  50.5  49.6  59.2  57.0  22.3  9.0  9.2  
0.2  111.0  108.3  119.1  114.1  114.6  17.1  17.0  
40  0  2.4  2.5  2.6  2.6  2.6  5.8  5.9 
0.1  57.4  59.8  63.6  61.6  29.3  16.2  16.5  
0.2  109.7  106.7  114.6  113.4  113.8  19.3  19.0  
50  0  2.9  3.0  3.1  3.1  3.2  5.1  5.0 
0.1  61.5  61.3  65.1  63.4  38.4  30.9  31.5  
0.2  108.8  105.6  112.1  113.2  113.4  20.6  19.2 
Table 2 shows the maximum average LRT distances under casewise contamination. Overall, the GYUBPDDCC and HSUBPFDDCC outperform all the other filters obtaining better results. Excluding these two, we see that the HSUBPF is competitive in case of moderate casewise contamination. An illustration of their behavior is given in Figure 6 (top) which shows the average LRT distances for different values of , with of casewise contamination level and .
UF  UBF  DDCC  

GY  HS  GY  HS  HSUBPF  GYUBF  HSUBPF  
10  0  11  11  11  11  11  13  13 
0.02  13  13  13  13  13  15  15  
0.05  19  20  20  20  20  20  20  
20  0  5  5  5  5  5  7  7 
0.02  7  7  7  7  7  8  8  
0.05  15  16  15  15  15  16  16  
30  0  3  4  4  3  4  6  6 
0.02  5  5  5  5  5  7  7  
0.05  13  14  13  14  14  15  15  
40  0  3  3  3  3  3  6  6 
0.02  4  5  5  5  4  7  7  
0.05  13  14  13  14  14  15  16  
50  0  2  2  2  2  2  4  4 
0.02  4  4  4  4  4  6  6  
0.05  12  14  12  14  14  14  15 
UF  UBF  DDCC  

GY  HS  GY  HS  HSUBPF  GYUBF  HSUBPF  
10  0  11  11  11  11  11  13  13 
0.1  15  17  17  17  14  17  16  
0.2  94  112  137  123  76  25  25  
20  0  5  5  5  5  5  7  7 
0.1  11  13  14  13  8  8  8  
0.2  65  70  92  77  73  13  13  
30  0  3  4  4  4  4  6  6 
0.1  10  10  12  11  7  6  6  
0.2  49  52  71  57  57  8  8  
40  0  3  3  3  3  3  6  6 
0.1  8  9  10  9  6  5  5  
0.2  40  43  60  46  46  7  7  
50  0  2  2  2  2  2  4  3 
0.1  7  8  8  8  6  5  5  
0.2  34  36  52  39  39  5  5 
Table 3 and Table 4 show the maximum average MSE under cellwise and casewise contamination, respectively. The values in the tables are the MSE values multiplied by 1000 for a better visualization and model comparison. Under casewise contamination, the GYUBFDDCC and HSUBPFDDCC outperform the other filters, and have also competitive results for cellwise contamination. In Figure 5 (bottom) and Figure 6 (bottom) the average MSE versus different contamination values are displayed, with and of cellwise contamination and of casewise contamination respectively. We highlight the nice redescending performance of the HSUBPF for both LRT and MSE, not shared by the other filters.
6 Conclusions
Considering the twostep procedure introduced in Agostinelli et al. [2015a] and improved by Leung et al. [2017], we present a new filter based on statistical data depth functions that can be used in place of the previous filters, intended as a generalization of such filters. Furthermore, we also combine the depth filter HSUBPF and DDC, as suggested by Leung et al. [2017]. As shown in the example, the filter HSUBPF is able to identify large outlying observations and removes less cells than the GYUBF. In addition, it also detects the casewise outliers, which are clearly highlighted.
If we consider the performance of the entire procedure, our simulations show that using HSUBPF we obtain the best estimates in case of moderate proportion of contamination, but it is still competitive for higher percentage of contamination, also for highdimensional dataset, under both types of contamination models. Generally, the GYUBF and HSUBPF combined with DDC outperform the other filters. Differences in performance of these two estimators are not clearly visible. However the HSUBPF has shown, especially under the casewise contamination an interesting behaviour for moderate contamination level.
Further research on this filter could be needed to explore the performance of the estimator in different types of data and how it can vary with respect to the dimensions and , for example in flat datasets (e.g., ). In addition different statistical data depth functions could be used in place of the halfspace depth.
Appendix A Statistical data depth properties
Definition 2 (Depth Function).
Appendix B GerviniYohai depth
Here we want to show that the GerviniYohai depth, defined as , is a proper statistical depth function, i.e., it satisfies the four properties introduced above.

Affine invariance: it follows directly from the affine invariance property of the Mahalanobis distance;

Maximality at center: if is elliptically symmetric around ,
For any we have
when is strictly monotone then strict inequality holds, and is the unique maximizer of the GerviniYohai depth.

Monotonicity:
Then .

Approaching zero: if we have that and consequently . Then
Appendix C Monte Carlo experiment
Results for all combinations of the model parameters explored in the Monte Carlo simulation are reported in this section.
In Figures 7, 8 and Figures 9, 10 the average LRT and average MSE versus different contamination values are displayed, respectively.
References
 Agostinelli et al. [2015a] C. Agostinelli, A. Leung, V.J. Yohai, and R.H. Zamar. Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST, 24(3):441–461, 2015a.
 Agostinelli et al. [2015b] C. Agostinelli, A. Leung, V.J. Yohai, and R.H. Zamar. Rejoinder on: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST, 24(3):484–488, 2015b.
 Alqallaf et al. [2009] F. Alqallaf, S. Van Aelst, R. H. Zamar, and V. J. Yohai. Propagation of outliers in multivariate data. The Annals of Statistics, 37(1):311–331, 2009.
 Danilov et al. [2012] M. Danilov, V.J. Yohai, and R.H. Zamar. Robust estimation of multivariate location and scatter in the presence of missing data. Journal of the American Statistical Association, 107:1178–1186, 2012.
 Donoho and Gasko [1992] D.L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803–1827, 1992.
 Farcomeni [2014] A Farcomeni. Robust constrained clustering in presence of entrywise outliers. Technometrics, 56(1):102–111, 2014.
 Gervini and Yohai [2002] D. Gervini and V.J. Yohai. A class of robust and fully efficient regression estimators. The Annals of Statistics, 30(2):583–616, 2002.
 Leung et al. [2017] A. Leung, V.J. Yohai, and R.H. Zamar. Multivariate location and scatter matrix estimation under cellwise and casewise contamination. Computational Statistics and Data Analysis, 111:59–76, 2017.
 Liu [1990] R.Y. Liu. On a notion of data depth based on random simplices. The Annals of Statistics, 18(1):405–414, 1990.
 Maronna et al. [2006] R.A. Maronna, R.D. Martin, and Yohai V.J. Robust statistic: theory and methods. Wiley, Chichister, 2006.
 Rousseeuw and Van Den Bossche [2018] P.J. Rousseeuw and W. Van Den Bossche. Detecting deviating data cells. Technometrics, 60(2):135–145, 2018.
 Serfling [2006] R.J. Serfling. Multivariate symmetry and asymmetry. Encyclopedia of statistical sciences, pages 5338–5345, 2006.
 Tukey [1975] J.W. Tukey. Mathematics and picturing of data. In Proceedings of International Congress of Mathematics, volume 2, pages 523–531, 1975.
 Zuo and Serfling [2000a] Y. Zuo and R. Serfling. General notions of statistical depth function. The Annals of Statistics, 28(2):461–482, 2000a.
 Zuo and Serfling [2000b] Y. Zuo and R.J. Serfling. Structual properties and convergence results for contours of sample statistical depth functions. The Annals of Statistics, 28(2):483–499, 2000b.
Comments
There are no comments yet.