# Finite-sample properties of robust location and scale estimators

When the experimental data set is contaminated, we usually employ robust alternatives to common location and scale estimators, such as the sample median and Hodges Lehmann estimators for location and the sample median absolute deviation and Shamos estimators for scale. It is well known that these estimators have high positive asymptotic breakdown points and are normally consistent as the sample size tends to infinity. To our knowledge, the finite-sample properties of these estimators depending on the sample size have not well been studied in the literature. In this paper, we fill this gap by providing their closed-form finite-sample breakdown points and calculating the unbiasing factors and relative efficiencies of the robust estimators through the extensive Monte Carlo simulations up to the sample size 100. The numerical study shows that the unbiasing factor improves the finite-sample performance significantly. In addition, we also provide the predicted values for the unbiasing factors which are obtained by using the least squares method which can be used for the case of sample size more than 100.

## Authors

• 3 publications
• 1 publication
• 10 publications
• ### Lugsail lag windows and their application to MCMC

Lag windows are commonly used in the time series, steady state simulatio...
09/12/2018 ∙ by Dootika Vats, et al. ∙ 0

• ### Finite-sample Analysis of M-estimators using Self-concordance

We demonstrate how self-concordance of the loss can be exploited to obta...
10/16/2018 ∙ by Dmitrii Ostrovskii, et al. ∙ 0

• ### Optimal robust estimators for families of distributions on the integers

Let F_θ be a family of distributions with support on the set of nonnegat...
11/10/2019 ∙ by Ricardo A. Maronna, et al. ∙ 0

• ### Complete Subset Averaging for Quantile Regressions

We propose a novel conditional quantile prediction method based on the c...
03/06/2020 ∙ by Ji Hyung Lee, et al. ∙ 0

• ### A Matching Based Theoretical Framework for Estimating Probability of Causation

The concept of Probability of Causation (PC) is critically important in ...
08/13/2018 ∙ by Tapajit Dey, et al. ∙ 0

• ### Estimating the size of a hidden finite set: large-sample behavior of estimators

A finite set is "hidden" if its elements are not directly enumerable or ...
08/14/2018 ∙ by Si Cheng, et al. ∙ 0

• ### Finite Sample L_2 Bounds for Sequential Monte Carlo and Adaptive Path Selection

We prove a bound on the finite sample error of sequential Monte Carlo (S...
07/03/2018 ∙ by Joseph Marion, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Estimation of the location and scale parameters of a distribution, such as the mean and standard deviation of a normal population, is a common and important problem in the various branches of engineering including: biomedical, chemical, materials, mechanical and industrial engineering. The quality of the data plays an important role in estimating these parameters, whereas in the engineering sciences, the experimental data is often contaminated due to the measurement errors, the volatile operating conditions, etc. Thus, robust estimations are advocated as alternatives to commonly used location and scale estimators (e.g., the sample mean and sample standard deviation) for estimating the parameters of population. For example, for the case where some of the observations are contaminated by outliers, we usually adopt the sample median and Hodges-Lehmann

(Hodges and Lehmann, 1963) estimators for the location parameter and the sample median absolute deviation (Hampel, 1974) and Shamos (Shamos, 1976) estimators for the scale parameter, because these estimators have a large breakdown point and thus perform well both in the presence and absence of outliers.

The breakdown point is a common criterion for measuring the robustness of an estimator. The larger the breakdown point of an estimator, the more robust it is. The finite-sample breakdown point (Donoho and Huber, 1983) is defined as the maximum proportion of incorrect or arbitrarily observations that an estimator can deal with without making an egregiously incorrect value. For example, the breakdown points of the sample mean and the sample median are 0 and , respectively. In general, the breakdown point can be written as a function of the sample size. In this paper, we provide the finite-sample breakdown points for the various location and scale estimators mentioned above. We show that when the sample sizes are small, they are noticeably different from the asymptotic breakdown point, which is the limit of the finite sample breakdown point as the sample size approaches infinity.

It deserves mentioning that for robust scale estimation, the MAD and the Shamos estimators not only have positive asymptotic breakdown points, but also are normally consistent as the sample size goes to infinity. However, when the sample size is small, they have serious biases and provide inappropriate estimation of the scale parameter. Some bias-correction techniques are commonly adopted to improve the finite sample performance of these estimators. For instance, Williams (2011) studied the finite sample correction factors through computer simulations for several simple robust estimators of the standard deviation of a normal population, which include the MAD, interquartile range, shortest half interval, and median moving range, Later on, Hayes (2014) obtained the finite-sample bias-correction factors for the MAD for the scale parameter. They have shown that finite sample correction factors can significantly eliminate systematic biases of these robust estimators, especially when the sample sizes are small.

To our knowledge, finite-sample properties of the sample median absolute deviation (MAD) and Shamos estimators have received little attention in the literature except for some references covering topic for small sample sizes. This observation motivates us to employ the extensive Monte Carlo simulation to obtain the empirical biases of these estimators. Given that the empirical variance of an estimator is one of the important metrics for evaluating an estimator, we also obtain the values of the finite-sample variances of the median, Hodges-Lehmann, MAD and Shamos estimators under the standard normal distribution, which are not fully provided in the statistics literature. Numerical results show that the unbiasing factor improves the finite-sample performance of the estimator significantly. In addition, we provide the predicted values for the unbiasing factors obtained by the least squares method which can be used for the case of the sample size more than 100.

The remainder of this paper is organized as follows. In Section 2, we derive the finite-sample breakdown points for robust location estimators and robust scale estimators, respectively. Through using the extensive Monte Carlo simulation, we calculate the empirical biases of the MAD and Shamos estimators in Section 3 and the finite-sample variances of the median, HL, MAD, and Shamos estimators in Section 4. Some concluding remarks are provided in Section 5.

## 2 Finite-sample breakdown point

In this section, we derive the finite-sample breakdown points for robust location estimators: the sample median and Hodges-Lehmann (HL) estimator in Subsection 2.1 and for robust scale estimators: the MAD and Shamos estimator in Subsection 2.2.

### 2.1 Robust location estimators

It is well known that the asymptotic breakdown points of the sample median and the HL estimator are 1/2 and , respectively. Note that these estimators are in a closed form and are location-equivariant in the sense that . However, in many cases, the finite-sample breakdown points can be noticeably different from the asymptotic breakdown point, especially when the sample size is small. For instance, when , we observe from equation (1) that the finite-sample breakdown point for the median is 0.4, which is different from its asymptotic breakdown point of 0.5.

Suppose that we have a sample of size , . Then we can make up to of the sample observations arbitrarily large without making the median arbitrarily large. Let be the floor function ( is the largest integer not exceeding ). The finite-sample breakdown point of the median is given by

 ϵn=⌊(n−1)/2⌋n. (1)

Using the fact that can be rewritten as where , we have

 ϵn=⌊(n−1)/2⌋n=12−12n−δn.

The asymptotic breakdown point of the median is obtained by taking the limit of the finite-sample breakdown point as , which provides that .

The HL estimator is defined as the median of all pairwise averages of the sample observations and is given by

 median(Xi+Xj2).

Note that the median of all pairwise averages is calculated for , , and . We denoted these three versions as

 HL1 =mediani≤j(Xi+Xj2),  and  HL3 =median∀(i,j)(Xi+Xj2),

respectively. In what follows, we first derive the breakdown point for the and then use a similar approach to derive the breakdown point for and .

Suppose that we make of the observations arbitrarily large with . Notice that there are paired average terms (so-called Walsh averages) in the HL3 estimator: , where . Because the HL3 estimator is the median of the values, the finite-sample breakdown point cannot be greater than due to equation (1). If we make of the observations arbitrarily large, then the number of arbitrarily large Walsh averages becomes . These two facts provide the following relationship

 n2−(n−k)2n2≤⌊(n2−1)/2⌋n2,

which is equivalent to . The finite-sample breakdown point of the is then given by , where

 k∗=max{k∈N:k2−2nk+⌊(n2−1)/2⌋≥0~{}and~{}0≤k≤n}. (2)

To obtain an explicit formula for (2), we let . Since , is decreasing for . The roots of are given by and . Since is an integer and , we have , that is Then we have the closed-form finite-sample breakdown point of the

 (3)

The asymptotic breakdown point of is given by . Using where , we can rewrite (3) as

 ϵn=n−√n2−(n2−1)/2+δ1−δ2n,

where and . Thus, we have .

In the case of the estimator, there are Walsh averages. Since the estimator is the median of the Walsh averages, the finite breakdown point cannot be greater than due to equation (1) again. If we make observations arbitrarily large with , then there are arbitrarily large Walsh averages. Thus, the following inequality holds

 n(n−1)/2−(n−k)(n−k−1)/2n(n−1)/2≤1n(n−1)/2⌊n(n−1)/2−12⌋,

which is equivalent to In a similar way as done for equation (2), we let be the largest integer satisfying the above with . For convenience, we let . Then is decreasing for due to and the roots of are given by . Thus, using the similar argument to that used for the case, we have Then we have the closed-form finite-sample breakdown point of the

 ϵn=⌊n−1/2−√(n−1/2)2−2⌊(n2−n−2)/4⌋⌋n. (4)

It should be noted that we also have .

Similar to the case of the , we obtain that the closed-form finite-sample breakdown point of the estimator is given by

 ϵn=⌊n+1/2−√(n+1/2)2−2⌊(n2+n−2)/4⌋⌋n. (5)

### 2.2 Robust scale estimators

For robust scale estimation, we consider the MAD (Hampel, 1974) and the Shamos estimator (Shamos, 1976). The MAD is given by

where and is needed to make this estimator consistent under the normal distribution (Rousseeuw and Croux, 1993). This resembles the median and its finite-sample breakdown point is the same as that of the median in (1). The Shamos estimator is given by

 SH=mediani

where is needed to make this estimator consistent under the normal distribution (Lèvy-Leduc et al., 2011).

Of particular note is that the Shamos estimator resembles the HL1 estimator by replacing the Walsh averages by pairwise differences. Thus, its finite-sample breakdown point is the same as that of the HL1 estimator in (4). In the case of the HL estimator, the median is calculated for , , and , but the median in the Shamos estimator is calculated only for because for . Note that the MAD and Shamos estimators are in a closed form and are scale-equivariant in the sense that . In Table 1, we provide the finite-sample breakdown points of the estimators considered in this paper. Also, we provide the plots of these values in Figure 1.

## 3 Empirical biases

As mentioned above, the MAD in (6) and the Shamos estimator in (7) are normally consistent, that is, as the sample size goes to infinity, it converges to the standard deviation under the normal distribution, . However, when the sample size is small, they have serious biases. In this section, we obtain the unbiasing factors for the MAD and Shamos estimators through the extensive Monte Carlo simulation. It deserves mentioning that the location estimators such as the median and the Hodges-Lehmann estimator have no bias under the normal distribution.

For this simulation, we generated a sample of size from the standard normal distribution, , and calculated the MAD and Shamos estimates. We repeated this simulation ten million times () to obtain the empirical biases of these two estimators. In Table 2, we provide the empirical biases for . We also provide the plot of these empirical biases in Figure 2. Using these biases, we can easily obtain the unbiasing factors as follows. For convenience, let be the empirical bias of the . Then is the unbiasing factor and thus an empirically unbiased MAD is given by

Similarly, an empirically unbiased Shamos estimator is given by

 SH1+Bn,

where is the empirical bias of the .

For the case when , we suggest to estimate them as follows. Since the MAD in (6) and Shamos in (7) are normally consistent, and converge to zero as goes to infinity. For a large value of , we suggest to use the methods proposed by Hayes (2014) and Williams (2011). Hayes (2014) suggests the use of and Williams (2011) suggests the use of . Similarly, we can also estimate using and . To estimate these, we obtained more empirical bias in Table 3 for . Using the values for the case of , we can obtain the least squares estimate given by

 An=−0.76213n−0.86413n2.

Also, we can obtain the least squares estimate using the method of Williams (2011) after the logarithm transformation which is given by

 An=−0.804168866⋅n−1.008922.

Note that Hayes (2014) and Williams (2011) estimated

for the case of odd and even values of

separately. However, for a large value of , the gain in precision may not be noticeable as Figure 2 shows that there is no noticeable difference in the case of the odd and even values of .

We can also obtain the least squares estimate using Hayes (2014) and Williams (2011) for a large value of as follows

 Bn =0.414253297n+0.442396799n2 and Bn =0.435760656⋅n−1.0084443,

respectively. In Table 3, we provide the estimated biases of the MAD and Shamos estimators. These results show that the estimated biases are very accurate up to the fourth decimal point. Also, there is no noticeable difference between the two estimates by Hayes (2014) and Williams (2011).

It is well known that the sample standard deviation is not unbiased under the normal distribution. To make it unbiased, the unbiasing factor is widely used so that is unbiased. We suggest to use and notations for the unbiasing factors of the MAD and Shamos estimators, respectively. Then we can obtain the unbiased MAD and Shamos estimators for any value of given by

where and .

## 4 Empirical variances

In this section, through the extensive Monte Carlo simulation, we calculate the finite-sample variances of the median, HL, MAD and Shamos estimators under the standard normal distribution. We generated a sample of size from the standard normal distribution and calculated their empirical variances for a given value of . We repeated this simulation ten million times () to obtain the empirical variance for each of .

It should be noted that the values of the asymptotic relative efficiency (ARE) of various estimators are known. Here the ARE is defined as

 ARE(^θ2|^θ1)=limn→∞RE(^θ2|^θ1), (9)

where

 RE(^θ2|^θ1)=Var(^θ1)Var(^θ2). (10)

where is often a reference or baseline estimator. For example, under the normal distribution, we have , , , and , where is the sample mean and is the sample standard deviation. For more details, see Serfling (2011) and Lèvy-Leduc et al. (2011).

Note that with a random sample of size from the standard normal distribution, we have and , where . Thus, we have , , and for a large value of . We provide these values of each of in Tables 4 and 5. In Figure 3, we also plotted these values.

For the case when , we suggest to estimate these values based on Hayes (2014) or Williams (2011) as we did the biases in the previous section. We suggest the following models to obtain these values for :

 nVar(median) =1.57+a1n+a2n2 nVar(HL) =1.0472+a3n+a4n2 Var(MAD)1−c4(n)2 =2.7027+a5n+a6n2 and Var(Shamos)1−c4(n)2 =1.15875+a7n+a8n2.

One can also use the method based on Williams (2011). For brevity, we used the method based on Hayes (2014). To estimate these values for , we obtained the empirical REs in Table 6 for , , , , , , . Notice that Figure 3 indicates that it is reasonable to estimate the values for the median and MAD in the case of odd and even values of separately. Using the large values of , we can estimate the above coefficients. For this, we use the simulation results in Tables 5 and 6. Then the least squares estimates based on the method of Hayes (2014) are given by

 nVar(median) =1.57−0.6589n−0.943n2(for~{}% odd~{}n) nVar(median) =1.57−2.1950n+1.929n2(for~{}% even~{}n) nVar(HL1) =1.0472+0.1127n+0.8365n2 nVar(HL2) =1.0472+0.2923n+0.2258n2 nVar(HL3) =1.0472+0.2022n+0.4343n2 Var(MAD)1−c4(n)2 =2.7027+0.2996n−149.357n2(for~% {}odd~{}n) Var(MAD)1−c4(n)2 =2.7027−2.417n−153.010n2(for~{% }even~{}n) and Var(Shamos)1−c4(n)2 =1.15875+2.822n+12.238n2.

In Tables 7 and 8, we also calculated the REs of the afore-mentioned estimators for using the above empirical variances. For , one can also easily obtain the REs using the above estimated variances. It should be noted that the REs of the median and HL estimators are one for . When , the median and the HL are essentially the same as the sample mean. Note that the HL1 is not available for .

Another noticeable result is that the RE of the HL1 is exactly one when . When , the HL1 is the median of , , , , and . Then this is the same as the median of , , , , and , where are order statistics. Because and , we have

 HL1=12(X(1)+X(4)2+X(2)+X(3)2)=X1+X2+X3+X44=¯X.

Thus, the RE of the HL1 should be one. In this case, as expected, the finite-sample breakdown is zero as provided in Table 1.

It should be noted that the and are unbiased for under the normal distribution, but their square values are not unbiased for . Using the empirical and estimated variances, we can obtain the unbiased versions as follows. For convenience, we denote and , where the variances are obtained using a sample of size from the standard normal distribution as mentioned earlier. Since the MAD and Shamos estimators are scale-equivariant, we have and with a sample from the normal distribution . It is immediate from (8) that and . Considering , we have and Thus, the following estimators are unbiased for under the normal distribution

## 5 Concluding remarks

In this paper, we studied the finite-sample properties of the sample median and Hodges-Lehmann estimators for location and the sample median absolute deviation and Shamos estimators for scale. We first obtained closed-form finite-sample breakdown points for these robust location and scale estimators for the population parameters. We then calculated the unbiasing factors and relative efficiencies of the MAD and the Shamos estimators for the scale parameter through the extensive Monte Carlo simulations up to the sample size 100. The numerical study showed that the unbiasing factor significantly improves the finite-sample performance. In addition, we also provided the predicted values for the unbiasing factors which are obtained by using the least squares method which can be used for the case of sample size more than 100. To facilitate the implementation of the proposed method, we developed the R package library, which will be available at the author’s personal web page.

## Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. NRF-2017R1A2B4004169).

## References

• Donoho and Huber (1983) Donoho, D. and Huber, P. J. (1983). The notion of breakdown point. In A Festschrift for Erich L. Lehmann, Wadsworth Statist./Probab. Ser., pages 157–184. Wadsworth, Belmont, CA.
• Hampel (1974) Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69:383–393.
• Hayes (2014) Hayes, K. (2014). Finite-sample bias-correction factors for the median absolute deviation. Communications in Statistics: Simulation and Computation, 43:2205–2212.
• Hodges and Lehmann (1963) Hodges, J. L. and Lehmann, E. L. (1963). Estimates of location based on rank tests. Annals of Mathematical Statistics, 34:598–611.
• Lèvy-Leduc et al. (2011) Lèvy-Leduc, C., Boistard, H., Moulines, E., Taqqu, M. S., and Reisen, V. A. (2011). Large sample behaviour of some well-known robust estimators under long-range dependence. Statistics, 45:59–71.
• Rousseeuw and Croux (1993) Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88:1273–1283.
• Serfling (2011) Serfling, R. J. (2011). Asymptotic relative efficiency in estimation. In Lovric, M., editor, Encyclopedia of Statistical Science, Part I, pages 68–82. Springer-Verlag, Berlin.
• Shamos (1976) Shamos, M. I. (1976). Geometry and statistics: Problems at the interface. In Traub, J. F., editor, Algorithms and Complexity: New Directions and Recent Results, pages 251–280. Academic Press, New York.
• Williams (2011) Williams, D. C. (2011). Finite sample correction factors for several simple robust estimators of normal standard deviation. Journal of Statistical Computation and Simulation, 81:1697–1702.