1 The problem of nearby contamination
The leading example in the paper is the geyser (Old Faithful) dataset. From the scatterplot of this bivariate dataset we see that it consists of two clusters, the smaller of which contains about 30% to 35% of the observations. If one interprets the smaller cluster as contamination this is a relatively high contamination level, though it should not be prohibitive since the estimators considered in the paper can be tuned to a breakdown value well above 35%. However, the contamination happens to lie quite close to the inlying data. Having a large fraction of contamination located fairly close by makes the geyser data particularly challenging, as illustrated by CRAC. If we use a scatter estimator with a breakdown value of e.g. 40%, replacing any 35% of clean data by data points positioned anywhere cannot completely destroy the scatter matrix (in the sense of making its first eigenvalue arbitrarily large or its last eigenvalue arbitrarily close to zero), but that does not imply that the scatter matrix will have a small bias. Indeed, it is known that the bias of the estimators under study is the largest for nearby contamination, as shown byHubert et al. (2014).
The other real data example in the paper is the cows dataset with 4 variables. Our first instinct was to carry out a PCA to get some idea about the shape of the data. Figure 1
shows the first two principal components of the cows data, which explain 96% of the total variance. (Here we used the ROBPCA method ofHubert et al. (2005)
, but classical PCA gave a very similar picture.) The plot shows that this dataset is equally challenging. CRAC interpret it as a well-behaved point cloud plus contamination. Again most of the contamination is nearby, and then it fans out. An alternative interpretation is that it could be a sample from a skewed distribution, and in that model none of these points need to be considered outlying.
The paper starts by analyzing the geyser data through monitoring an S-estimator, the subsequent MM-estimator, and the MCD. Each of these estimators is tuned by its breakdown value. For the S-estimator this yields the surprising conclusion that tuning for a 50% breakdown value achieves the intended result, whereas 49% does not. A second conclusion is that the tuning of the MM-estimator appears more stable, but that is only because the MM-estimator starts from the S-estimator with 50% breakdown value, and it fails when starting from the 49% version.
These conclusions would appear to suggest that S-estimators are not suitable in such challenging situations. In fact, using the Matlab code kindly provided by CRAC we carried out an additional experiment, confirming that if the small cluster is moved a little bit in the direction of the larger one, even the S-estimator tuned for 50% breakdown no longer detects it, resulting in an uninformative monitoring plot consisting of horizontal lines only. Following the S-estimate by the MM step does not change that of course.
In both the geyser and cows examples, as well as the two simulated data sets in Section 6 whose contamination is equally nearby, CRAC find that the MCD gives better results due to its hard rejection.
2 The choice of the rho function
We would like to argue that it is not the definition of S-estimators that makes them fail the stated objective in these four examples, but rather the choice of the -function. In our notation, a multivariate S-estimator of a -variate dataset is defined as the pair
formed by a location vectorand a PSD scatter matrix that together minimize subject to
in which is the statistical distance of to relative to . One often takes where
follows the multivariate standard normal distribution. Note that we writeas a function of the statistical distance and not its square .
People almost exclusively use the bisquare in S-estimators, which redescends slowly in order to attain a high statistical efficiency at uncontaminated data. This is what makes it so hard to detect nearby contamination. To illustrate our point, we construct a ‘custom made’
-function for dealing with nearby outliers. It is given by
where is a tuning constant that determines where the -function becomes flat. Note that the dimension matters, because for uncontaminated Gaussian data the squared distance roughly follows a distribution with degrees of freedom which has expectation . The corresponding -function is given by :
and the weight function by :
For this -function, can be computed numerically or by Monte Carlo.
Figure 2 shows the functions and for dimension , for different values of . In the right panel we see that the more approaches 0, the harder the rejection becomes.
The left panel of Figure 3 shows the weights assigned to the observations in the geyser data, where the statistical distances on the horizontal axis were computed from the MCD estimates of location and scatter. The MCD method (red line) assigns weights that are one for most points of the main cluster and zero for all points of the second cluster. In contrast, the bisquare (black curve) still gives quite a bit of weight to the points lying in between the two clusters. (The weights are shown as little circles on the curve.) These points pull the S-estimator towards a non-robust solution. Our custom weight function (blue curve) redescends much faster so it gives small weights to the intermediate points, thereby behaving more like the MCD. The right panel shows the weights in the cows data, with similar conclusions.
We applied the S-estimator with the custom function for to the geyser dataset, using the FSDA toolbox of Riani et al. (2012). This does yield a fit to the main cluster, as illustrated by the tolerance ellipse in Figure 4. For the cows data we also obtain the desired fit.
We constructed this particular custom function with the sole purpose of illustrating that the behavior of S-estimators does not only depend on tuning, but also on the shape of the -function. The shape of this particular makes the S-estimator use weights similar to those of the MCD, especially when is tiny. This exemplifies the tradeoff between the ability to deal with nearby outliers and statistical efficiency at Gaussian data. But we definitely do not propose for general use in multivariate S-estimators because it has some disadvantages. If we use its breakdown value is below the 50% of the MCD and decreases with the dimension . We could also put but then we obtain an inconsistent estimator that needs a consistency correction. In either case we do not know how many data points will fall in the non-constant part of . For the MCD this number is always , and the MCD is also easier to compute.
3 Speeding up the computation
Monitoring requires the repeated calculation of a statistical procedure for various values of the parameter that needs tuning. Therefore the feasibility of monitoring depends on the computation time of the method being monitored. The Matlab code used in the paper computes the S-estimator, MM-estimator and MCD from 1000 initial elemental subsets that are kept unchanged throughout the monitoring. Computing concentration steps from 1000 subsets is computationally demanding. An alternative is to run the deterministic algorithm of Hubert et al. (2012) who start from six specifically constructed initial estimates instead of 1000 random ones, and illustrated their method by monitoring memberships in a multivariate classification. Hubert et al. (2015) extended this work to S-estimators and MM-estimators and illustrated it by monitoring the estimates of location and scatter in a flow cytometry dataset with data points.
4 Limitations of monitoring
The review paper by CRAC clearly documents the benefits of monitoring. On the other hand the approach also has some limitations having to do with the size of the problem and the number of parameters that should be tuned. One of us works in the food sorting industry where thousands of items are inspected in milliseconds in order to detect outliers (impurities), yielding settings with tens of dimensions. In such situations monitoring could still be of use for tuning classification parameters, as long as it can be done off-line. The large sample size is the least problematic: instead of monitoring all the individual observations one can monitor a subset of them, or summary statistics. Higher dimensions are more difficult to handle due to the substantially increasing computation time. Another question is how many parameters can be tuned, as the parameters usually interact with each other. The paper illustrates tuning one parameter, and if there are more the experiment needs to be carefully designed.
- Hubert et al. (2014) Hubert, M., P. Rousseeuw, and K. Vakili (2014). Shape bias of robust covariance estimators: an empirical study. Statistical Papers 55, 15–28.
et al. (2005)
Hubert, M., P. Rousseeuw, and K. Vanden Branden (2005).
ROBPCA: a new approach to robust principal components analysis.Technometrics 47, 64–79.
- Hubert et al. (2015) Hubert, M., P. Rousseeuw, D. Vanpaemel, and T. Verdonck (2015). The DetS and DetMM estimators for multivariate location and scatter. Computational Statistics & Data Analysis 81, 64 – 75.
- Hubert et al. (2012) Hubert, M., P. J. Rousseeuw, and T. Verdonck (2012). A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21(3), 618–637.
- Riani et al. (2012) Riani, M., D. Perrotta, and F. Torti (2012). FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems 116, 17–32.