    Optimally estimating the sample mean and standard deviation from the five-number summary

When reporting the results of clinical studies, some researchers may choose the five-number summary (including the sample median, the first and third quartiles, and the minimum and maximum values) rather than the sample mean and standard deviation, particularly for skewed data. For these studies, when included in a meta-analysis, it is often desired to convert the five-number summary back to the sample mean and standard deviation. For this purpose, several methods have been proposed in the recent literature and they are increasingly used nowadays. In this paper, we propose to further advance the literature by developing a smoothly weighted estimator for the sample standard deviation that fully utilizes the sample size information. For ease of implementation, we also derive an approximation formula for the optimal weight, as well as a shortcut formula for the sample standard deviation. Numerical results show that our new estimator provides a more accurate estimate for normal data and also performs favorably for non-normal data. Together with the optimal sample mean estimator in Luo et al., our new methods have dramatically improved the existing methods for data transformation, and they are capable to serve as "rules of thumb" in meta-analysis for studies reported with the five-number summary. Finally for practical use, an Excel spreadsheet and an online calculator are also provided for implementing our optimal estimators.

Authors

01/04/2018

How to estimate the sample mean and standard deviation from the five number summary?

In some clinical studies, researchers may report the five number summary...
03/25/2019

Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis

Researchers increasingly use meta-analysis to synthesize the results of ...
06/27/2019

Chi-squared Test for Binned, Gaussian Samples

We examine the χ^2 test for binned, Gaussian samples, including effects ...
10/12/2020

Detecting the skewness of data from the sample size and the five-number summary

For clinical studies with continuous outcomes, when the data are potenti...
04/05/2020

ABCMETAapp: R Shiny Application for Simulation-based Estimation of Mean and Standard Deviation for Meta-analysis via Approximate Bayesian Computation (ABC)

Background and Objective: In meta-analysis based on continuous outcome, ...
04/25/2011

Optimal impact strategies for asteroid deflection

This paper presents an analysis of optimal impact strategies to deflect ...
09/05/2018

Two-sample aggregate data meta-analysis of medians

We consider the problem of meta-analyzing two-group studies that report ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Meta-analysis is becoming increasingly popular in the past several decades, mainly owing to its wide range of applications in evidence-based medicine.25 To statistically combine data from multiple independent studies, researchers need to first conduct a systematic review and extract the summary data from the clinical studies in the literature. For continuous outcomes, e.g., the high blood pressure and the amount of alcohol consumed, the sample mean and standard deviation (SD) are the most commonly used summary statistics for evaluating the effectiveness of a certain medicine or treatment. For skewed data, however, the five-number summary (including the sample median, the first and third quartiles, and the minimum and maximum values) has also been frequently reported in the literature. To the best of our knowledge, there are few methods available in meta-analysis that can incorporate the sample median and the sample mean simultaneously. As an example, when applying the fixed-effect model or the random-effects model, the sample mean and SD are the must-to-have quantities for computing the overall effect.6

This then yields a natural question as follows: when performing a meta-analysis, how to deal with clinical studies in which the five-number summary was reported rather than the sample mean and SD? In the early stages, researchers often exclude such studies from further analysis by claiming them as “studies with no sufficient data” in the flow chart of study selection. Such an approach is, however, often suboptimal as it excludes valuable information from the literature. And consequently, the final results are less reliable, in particular when a large proportion of the included studies are with the five-number summary. For this, there is an increased demand for developing new methods that are able to convert the five-number summary back to the sample mean and SD. For ease of notation, let denotes the five-number summary, where is the sample minimum, is the first quartile, is the sample median, is the third quartile, and is the sample maximum of the data. We also let be the sample size in the study.

Note that the five-number summary may not be fully reported in clinical studies. In a special case with being reported, Hozo et al.7 was among the first to estimate the sample mean and SD. It is noted, however, that Hozo et al.7 did not sufficiently use the information of sample size so that their estimators are either biased or non-smooth. Inspired by this, Wan et al.8 and Luo et al.1 further improved the existing methods by proposing nearly unbiased and optimal estimators with analytical formulas. In addition, we note that Walter and Yao 9 had also provided a numerical solution for estimating the sample SD, while the lack of the analytical formula makes it less accessible to practitioners. In another special case with being reported, Wan et al.8

proposed a nearly unbiased estimator for the sample SD, and Luo et al.

1 proposed an optimal estimator for the sample mean by fully using the sample size information. In Google Scholar as of 27 February 2020, Hozo et al.,7 Wan et al.8 and Luo et al.1 have been cited 3595, 1078 and 143 times, respectively. Without any doubt, these several papers have been attracting more attentions and playing an important role in meta-analysis.

When was fully reported, Bland 10 extended Hozo et al.’s method to estimate the sample mean and SD from the five-number summary. As their methods are essentially the same, it is noted that the estimators in Bland 10 are also suboptimal mainly because the sample size information is again not sufficiently used. To be more specific, the sample mean estimator in Bland 10 is

 ¯X≈a+2q1+2m+2q3+b8=14(a+b2)+12(q1+q32)+m4.

According to Johnson and Kuby, 11 given that the data follow a symmetric distribution, the quantities , , and can each serve as an estimate of the sample mean. To have a final estimator, Bland 10 applied the artificial weights 1/4, 1/2, and 1/4 for the three components, respectively. That is, the first and third components are treated equally and both of them are only half reliable compared to the second component. As this is not always the truth, to improve the sample mean estimation, Luo et al.1 proposed the optimal estimator as

 ¯X≈w3,1(a+b2)+w3,2(q1+q32)+(1−w3,1−w3,2)m, (1)

where and are the optimal weights assigned to the respective components.

For the sample SD estimation from the five-number summary, Bland 10 also provided an estimator that follows the inequality method as in Hozo et al.7 Then Wan et al.8 proposed a nearly unbiased estimator of the sample SD to improve the literature. In Section 3, we will point out that the sample SD estimator in Wan et al.8 may still not be optimal due to the insufficient use of the sample size information. For more details, see the motivating example in Section 3. According to Higgins and Green 6 and Chen and Peace,12

the sample SD plays a crucial role in weighting the studies in meta-analysis. Inaccurate weighting results may lead to biased overall effect sizes and biased confidence intervals, and hence mislead physicians to provide patients with unreasonable or even wrong medications. Inspired by this, we propose a smoothly weighted estimator for the sample SD to further improve the existing literature. To promote the practical use, we have provided an Excel spreadsheet to implement the optimal estimators in the Supplementary Material. More importantly, we have also incorporated the new estimator in this paper into our online calculator at

http://www.math.hkbu.edu.hk/~tongt/papers/median2mean.html. From the practical point of view, our proposed method will make a solid contribution to meta-analysis and has the potential to be widely used.

The rest of the paper is organized as follows. In Section 2, we review the existing methods for estimating the sample SD under three common scenarios. In Section 3, we present a motivating example, propose a smoothly weighted estimator for the sample SD, and derive a shortcut formula of our new estimator for practical use. In Section 4, we conduct numerical studies to assess the finite sample performance of our new estimator, and meanwhile we demonstrate its superiority over the existing methods. We then conclude the paper in Section 5 with a brief summary of the existing methods, and provide the theoretical results in Section 6.

2 Existing methods

Needless to say, the sample size provides an important information and should be sufficiently used in the estimation procedure. To incorporate it with the five-number summary, we let , , and , that represent the three common scenarios in the literature. Clearly, and are two special cases of . To further clarify, we also note that , and are the same as , and in Wan et al.,8 respectively. In this section, we briefly review the existing estimators of the sample SD under the three scenarios.

2.1 Estimating the sample SD from S1={a,m,b;n}

Under scenario , Hozo et al.7 proposed to estimate the sample SD by

 S≈⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1√12[(b−a)2+(a−2m+b)24]1/2n≤15,b−a41570.

As shown in Google Scholar, Hozo et al.’s estimator is very popular in the previous literature due to the huge demand of data transformation in evidence-based medicine. We note, however, that their estimator is a step function of the sample size so that the final estimate may not be optimal. For example, when increases from 70 to 71, there is a sudden drop in the estimated value of the sample SD, namely dropped by 33.3% if the 71st sample is not an extreme value so that and remain the same in both samples. On the other hand, the sample size information is completely ignored within each of the three intervals so that their estimator is biased and suboptimal.

In view of the above limitations, Wan et al.8 proposed a nearly unbiased estimator of the sample SD as

 S≈b−aξ, (2)

where ,

is the cumulative distribution function of the standard normal distribution, and

is the inverse function of . Wan et al.’s estimator not only overcomes the limitations of Hozo et al.’s estimator by incorporating the sample size information efficiently, but also is very simple in practice for the given analytical formula. From a statistical point of view, Wan et al.8 has provided the best method for estimating the sample SD, given that the reported data include only the four numbers in scenario .

2.2 Estimating the sample SD from S2={q1,m,q3;n}

Under scenario , Wan et al.8 proposed an estimator of the sample SD as

 S≈q3−q1η, (3)

where . Note that the method for deriving estimator (3) is the same in spirit as that for estimator (2). In particular, Wan et al.8 has incorporated the sample size information appropriately so that the proposed estimator is nearly unbiased. When the reported data include only , , and , it can be shown that estimator (3) is the best method for estimating the sample SD.

2.3 Estimating the sample SD from S3={a,q1,m,q3,b;n}

Under scenario , Bland 10 proposed to estimate the sample SD by

 S≈ [(a2+2q21+2m2+2q23+b2)16+(aq1+q1m+mq3+q3b)8 −(a+2q1+2m+2q3+b)264]1/2.

As mentioned in Section 1, Bland’s estimator is independent of the sample size and is less accurate for practical use. To improve the literature, Wan et al.8 proposed the following estimator of the sample SD:

 S≈12(b−aξ+q3−q1η). (4)

In essence, Wan et al.8 treated scenario as a combination of scenario and scenario . By estimating the sample SD from scenario and scenario separately, they applied the average of estimators (2) and (3) as the final estimator. In Section 3, we will show that and may not be equally reliable for different sample size . As a consequence, the average estimator in formula (4) is not the optimal estimator due to the insufficient use of the sample size information. This motivates us to propose a smoothly weighted estimator for the sample SD to further improve Wan et al.’s estimator in this paper.

3 Main results

To present the main idea, we let be a random sample of size from the normal distribution with mean

and variance

, and be the order statistics of . Let also for . Then follow the standard normal distribution with as the corresponding order statistics. Finally, by letting with a positive integer, we have , , , , and .

3.1 Motivating example

To investigate whether the two components in estimator (4) are equally reliable, we first conduct a simple simulation study. In each simulation, we generate a random sample of size from the standard normal distribution, find the five-number summary , and then apply estimators (2) and (3) to estimate the sample SD respectively. For , or equivalently, , we repeat the simulation 1,000,000 times and plot the histograms of the sample SD estimates in Figure 1 for both methods. Figure 1: The histograms of the sample SD estimates (the true SD is 1 as shown in the vertical dashed lines) with the sample sizes 5, 85 and 401, with a total of 1,000,000 simulations. The red and green histograms represent the frequencies of the estimates by estimators (2) and (3), respectively.

From Figure 1, it is evident that estimators (2) and (3) may not be equally reliable when the sample size varies. Note that the true SD equals one since the data are generated from the standard normal distribution. When the sample size is small (say ), estimator (2) provides a more accurate and less skewed estimate for the sample SD. When the sample size is moderate (say ), the two estimators perform similarly and are about equally reliable. When the sample size is large (say ), estimator (3) provides to be a more reliable estimator than estimator (2). This hence shows that Wan et al.’s estimator in formula (4) may not be the optimal estimator for the sample SD when the five-number summary is fully reported. We propose to further improve Wan et al.’s estimator by considering a linear combination of estimators (2) and (3), in which the optimal weight is a function of the sample size.

3.2 Optimal sample SD estimation

In view of the limitations of estimator (4), we propose the following estimator for the sample SD:

 Sw=w(b−aξ)+(1−w)(q3−q1η), (5)

where and are given in Section 2, and is the weight assigned to the first component. Note that the new estimator is a weighted combination of estimators (2) and (3). When , the new estimator reduces to estimator (2). When , the new estimator reduces to estimator (3). And when , the new estimator leads to estimator (4) in Wan et al.8 Hence, the existing estimators of the sample SD are all special cases of our new estimator.

To find the optimal estimator, we consider the commonly used quadratic loss function, i.e.,

. We select the optimal weight by minimizing the expected value of the loss function , or equivalently, by minimizing the mean squared error (MSE) of the estimator. In Section 6, we show in Theorem 1 that the optimal weight is, approximately,

 wopt=Var(q3−q1)/η2−Cov(b−a,q3−q1)/(ξη)Var(b−a)/ξ2+Var(q3−q1)/η2−2Cov(b−a,q3−q1)/(ξη). (6)

We further show that the optimal weight will converge to zero when the sample size tends to infinity. That is, estimator (3) will be more reliable than estimator (2) when the sample size is large.

Note that the optimal weight in formula (6) has a complicated form and may not be readily accessible to practitioners. To promote the practical use of the new estimator, we also develop an approximation formula for the optimal weight. Recall that , , , and . We have and . Then by formula (6) and the symmetry of the standard normal distribution, we can rewrite the optimal weight as

 wopt(n)=11+J(n), (7)

where

 J(n)=Var(Z(n)−Z(1))/ξ2−Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))/(ξη)Var(Z(3Q+1)−Z(Q+1))/η2−Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))/(ξη). (8)

Note that is independent of the parameters and and depends only on the sample size . This shows that the optimal weight is a function of only. For clarification, we have expressed the optimal weight as in formula (7).

3.3 An approximation formula

To have an approximation formula for the optimal weight, we numerically compute the true values of and for different values of using formulas (7) and (8). We then plot and in Figure 2 for varying from 5 to 401, respectively. Observing that is an increasing and concave function of , we consider the simple power function to approximate with so that the approximation curve is also concave. With the true values of , the best values of the coefficients are approximately , and . Finally, by plugging into formula (7), we have the approximation formula for the optimal weight as

 ~wopt(n)≈11+0.07n0.6. (9) Figure 2: The left panel displays the true and approximated values of J(n), and the right panel displays the optimal weights, the approximated weights and the weights in Wan et al.8

From the right panel of Figure 2, it is evident that the approximation formula provides a perfect fit to the true optimal weight values for up to . By formulas (5) and (9), our proposed estimator of the sample SD from the five-number summary is

 S(n)≈(11+0.07n0.6)b−aξ+(0.07n0.61+0.07n0.6)q3−q1η. (10)

From formula (9), the nearest integer is so that . That is, when , estimators (2) and (3) will be about equally reliable. This coincides with our simulation results in Figure 1 that estimators (2) and (3) perform very similarly when . This, from another perspective, demonstrates that our approximation formula can serve as a “rule of thumb” for estimating the sample SD from the five-number summary.

Recall that and , where is the upper

th quantile of the standard normal distribution. We have the shortcut formula of estimator (

10) as

 S≈b−aθ1+q3−q1θ2, (11)

where

 θ1=θ1(n) =(2+0.14n0.6)⋅Φ−1(n−0.375n+0.25), θ2=θ2(n) =(2+20.07n0.6)⋅Φ−1(0.75n−0.125n+0.25).

For ease of implementation, we also provide the numerical values of and in Table 1 for up to 100, or equivalently, for up to 401. For a general sample size, one may refer to our Excel spreadsheet for specific values in the Supplementary Material, or compute them using the command “qnorm()” in the R software.

4 Numerical studies

To evaluate the practical performance of the new method, we conduct numerical studies to compare our proposed estimator with the three estimators in Wan et al.8 The robustness of the estimators will also be examined.

In the first study, we generate the data from the normal distribution with mean 50 and standard deviation 17, for which we follow the same settings as in Hozo et al.7 and Wan et al.8 Then for the simulated data with sample size , we compute the sample SD, denoted as , and also record the five-number summary . To apply the proposed method, we assume that all the available data are the five-number summary and the sample size, and apply estimators (2), (3), (4) and (11) to estimate the sample SD, denoted by , , and , respectively. Finally, for a fair comparison, we compute the relative mean squared errors (RMSE) of the four estimators as follows:

 RMSE(S1)=∑Ti=1(S1,i−σ)2∑Ti=1(SSami−σ)2, RMSE(S0)=∑Ti=1(S0,i−σ)2∑Ti=1(SSami−σ)2, RMSE(S0.5)=∑Ti=1(S0.5,i−σ)2∑Ti=1(SSami−σ)2, RMSE(S~wopt)=∑Ti=1(S~wopt,i−σ)2∑Ti=1(SSami−σ)2,

where is the total number of simulations, is the true SD, and is the sample SD in the th simulation. Figure 3: The natural logarithm of the RMSE values of the sample SD estimators for data from the normal distribution. The empty triangles represent the results by estimator (2), the solid squares represent the results by estimator (3), the empty circles represent the results by estimator (4), and the solid circles represent the results by our new estimator (11).

With and ranging from 5 to 801, we compute the natural logarithm of the RMSE values for the four estimators and plot them in Figure 3. From the numerical results, it is evident that our new estimator has a smaller RMSE value than the three existing estimators in all settings, which demonstrates that our new estimator does provide the optimal estimate for the sample SD. Specifically, for estimator (2) that only applies the minimum and maximum values, it performs well only when the sample size is extremely small (). For estimator (3) that only applies the first and third quartiles, it does not perform well for any sample size. While for the equally weighted estimator (4), we note that it performs better than estimators (2) and (3) in a wide range of settings. Nevertheless, we also note that estimator (4) is not as good as estimator (2) when is relatively small (), and is not as good as estimator (3) when is relatively large (). This shows, from another perspective, that estimator (4) does not provide an optimal weight between the two elementary estimators and hence is still suboptimal. In fact, compared to our new estimator, estimator (4) is capable to provide an optimal estimate only when the sample size is about 85, which coincides with our analytical results in Section 3.3. It is also interesting to point out that the numerical results for large sample sizes are also consistent with the asymptotic results in Theorem 2, in which we demonstrated that our new estimator has the smallest asymptotic RMSE among the four estimators. To conclude, from both practical and theoretical perspectives, our new estimator is superior to the existing estimators in all settings, and it deserves as the optimal estimator of the sample SD for the studies reported with the five-number summary. Figure 4: The natural logarithm of the RMSE values of the sample SD estimators for data from the skewed distributions. The empty triangles represent the results by estimator (2), the solid squares represent the results by estimator (3), the empty circles represent the results by estimator (4), and the solid circles represent the results by our new estimator (11).

To check the robustness of our proposed estimator, we conduct another numerical study with data generated from non-normal distributions. Specifically, we consider four skewed distributions including the log-normal distribution with location parameter

and scale parameter

, the chi-square distribution with 10 degrees of freedom, the beta distribution with parameters

and , and the Weibull distribution with shape parameter and scale parameter . Other settings and the estimation procedure remain the same as in the previous study. Finally, for estimators (2), (3), (4) and (11), we simulate the data for each non-normal distribution with simulations, and then report the natural logarithm of their RMSE values in Figure 4 for up to 801, respectively. From the numerical results, it is evident that our new estimator is still able to provide a smaller RMSE value than the existing estimators in most settings. This shows that our new estimator is quite robust to the violation of the normality assumption. In particular, we note that estimator (2

) performs even worse when the sample size is large, with the possible reason that the minimum and maximum values are more likely to be the outliers when they are simulated from heavy-tailed distributions. In contrast, our new estimator has a slowly increased RMSE as the sample size increases, which also demonstrates that our new estimator has better asymptotic properties compared to the existing estimators. Together with the comparison results in the previous study, we conclude that our new estimator not only provides an optimal estimate of the sample SD for normal data, but also performs favorably compared to the existing estimators for non-normal data.

5 Conclusion

For clinical trials with continuous outcomes, the sample mean and SD are routinely reported in the literature. While in some other studies, researchers may instead report the five-number summary including the sample median, the first and third quartiles, and the minimum and maximum values. For these studies, when included in a meta-analysis, it is often desired to convert the five-number summary back to the sample mean and standard deviation. As reviewed in Section 2, a number of studies have emerged recently to solve this important problem under three common scenarios. It is noted, however, that the existing methods, including Wan et al.8 and Bland,10 are still suboptimal for estimating the sample SD from the five-number summary.

To further advance the literature, we have proposed an improved estimator for the sample SD by considering a smoothly weighted combination of two available estimators. In addition, given that the analytical form of the optimal weight is complicated and may not be readily accessible to practitioners, we have also derived an approximation formula for the optimal weight, and that yields a shortcut formula for the optimal estimation of the sample SD. As confirmed by the theoretical and numerical results, our new methods are able to dramatically improve the existing methods in the literature. Together with Luo et al.,1 we hence recommend practitioners to estimate the sample mean and SD from the five-number summary by formulas (1) and (11), respectively.

To summarize, we have also provided the optimal estimators of the sample mean and SD under the three common scenarios in Table 2. To be more specific, the optimal sample mean estimators under all three scenarios are from Luo et al.,1 the optimal sample SD estimators under scenarios and are from Wan et al.,8 and the optimal sample SD estimator under scenario is provided in (11) which makes Table 2 a whole pie for data transformation from the five-number summary to the sample mean and SD. To promote the practical use, we have also provided an Excel spreadsheet to implement the optimal estimators in the Supplementary Material. And more importantly, we have also incorporated the new estimator in this paper into our online calculator at http://www.math.hkbu.edu.hk/~tongt/papers/median2mean.html. According to Table 2, if the five-number summary is reported for a certain study, estimators (1) and (11) will be adopted to estimate the sample mean and SD, respectively. Specifically, one can input the five-number summary and the sample size information into the corresponding entries under scenario , and then by clicking the “Calculate” button, the optimal estimates of the sample mean and SD will be automatically displayed in the result entries.

Finally, we note that all the estimators in Table 2 are established under the normality assumption for the clinical trial data. In practice, however, this normality assumption may not hold, in particular when the five-number summary is reported rather than the sample mean and SD. In view of this, researchers have also been proposing different approaches for analyzing the studies reported with the five-number summary. They include, for example, extending the data transformation methods from normal data to non-normal data,1315 or developing new meta-analytical methods to directly synthesize the data with the five-number summary. 1617 We note that the proposed method in this paper can also be readily extended to non-normal data, yet further work is needed to assess the effectiveness of these new estimators when included in a meta-analysis.

6 Theoretical results

We present the theoretical results of the proposed method in this section. Specifically, we will have 2 theorems. And to prove them, we need 2 lemmas as follows.

Lemma 1.

Let and

be the cumulative distribution function and the probability density function of the standard normal distribution, respectively. Let also

be the inverse function of . Then,

 limx→0+Φ−1(1−x)√−2ln(x)=1. (12)

Proof. Since and are both positive as , to prove formula (12) it is equivalent to showing that

 limx→0+[Φ−1(1−x)]2−2ln(x)=1.

Let . Then, . Noting that as , by L’Hpital’s rule, we have

 limy→∞y2−2ln(1−Φ(y)) = limy→∞1−Φ(y)y−1ϕ(y) = limy→∞ϕ(y)ϕ(y)+y−2ϕ(y) = 1,

where the second last equality follows by the Stein property of the standard normal distribution, that is, .

Lemma 2.

Let follow the standard normal distribution with being the corresponding order statistics. As , we have

 Var(Z(n)−Z(1))≈π26ln(n),
 Var(Z(3Q+1)−Z(Q+1))≈2.4758n, (13)
 Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))=O(1√nln(n)). (14)

Proof. Since and are asymptotically independent (see Theorem 8.4.3 in Arnold et al.18), then as , . Note also that the limiting distribution of follows a Gumbel distribution as

 limn→∞P(Z(n)−anbn

where and . In addition, for the Gumbel distribution, the mean value is the Euler-Mascheroni constant and the variance is . Therefore, and as , and consequently, . Further by symmetry, we have . Hence, as ,

 Var(Z(n)−Z(1))≈Var(Z(1))+Var(Z(n))≈π26ln(n).

By Theorem 2 in Luo et al.,1 as , and . This shows that formula (13) holds.

By the Cauchy-Schwartz inequality, we have

 Cov(Z(1),Z(Q+1))≤√Var(Z(1))Var(Z(Q+1))=O(1√nln(n)).

Similarly, we can show that , and are all of order . Combining the above results, it is readily known that formula (14) holds.

Theorem 1.

For the proposed estimator in formula (5), we have the following properties.

1. for any weight .

2. The optimal weight is, approximately,
.

3. as .

Proof. (i) The expected value of the proposed estimator is

 E(Sw)=wE(b−aξ)+(1−w)E(q3−q1η).

Since and , we have for any weight .

(ii) By part (i), we have . Hence, to minimize the MSE of the estimator, it is approximately equivalent to minimizing the variance of the estimator. Note that

 Var(Sw)=w2Var(b−aξ)+(1−w)2Var(q3−q1η)+2w(1−w)Cov(b−aξ,q3−q1η).

The first derivative of the variance with respect to is

 ddwVar(Sw)= 2wVar(b−aξ)−2(1−w)Var(q3−q1η) +(2−4w)Cov(b−aξ,q3−q1η).

Setting the first derivative equal to zero, we have

 w=Var(q3−q1)/η2−Cov(b−a,q3−q1)/(ξη)Var(b−a)/ξ2+Var(q3−q1)/η2−2Cov(b−a,q3−q1)/(ξη). (15)

Also by the Cauchy-Schwartz inequality, the second derivative is always non-negative,

 d2dw2Var(Sw)=2Var(b−aξ)+2Var(q3−q1η)−4Cov(b−aξ,q3−q1η)≥0.

This shows that the weight in formula (15) is, approximately, the optimal weight of the estimator.

(iii) Recall that, by formulas (7) and (8), the optimal weight can be rewritten as

 wopt=Var(Z(3Q+1)−Z(Q+1))/η2−Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))/(ξη)Var[(Z(n)−Z(1))/ξ−(Z(3Q+1)−Z(Q+1))/η], (16)

where and . By Lemma 1, as , we have

 ξ=2Φ−1(1−0.625n+0.25)≈2√−2ln(0.625n+0.25)=O(√ln(n)). (17)

It is also clear that

 η=2Φ−1(0.75n−0.125n+0.25)≈2Φ−1(0.75)=O(1). (18)

Then together with Lemma 2, as , the numerator of in formula (16) is of order . In addition, it can be shown that the denominator of is, approximately, . Finally, by combining the above results, we have as .

Theorem 2.

For the relative mean squared errors (RMSE) of the four differently weighted estimators, as , we have

 RMSE(S1)=MSE(S1)MSE(SSam)=O(n(ln(n))2) RMSE(S0)=MSE(S0)MSE(SSam)≈2.721 RMSE(S0.5)=MSE(S0.5)MSE(SSam)=O(n(ln(n))2) RMSE(S~wopt)=MSE(S~wopt)MSE(SSam)≈2.721,

where , and are Wan et al.’s estimators (2), (3) and (4) respectively, is our new estimator (11), and is the sample SD.

Proof. By Holtzman,19 the expected value of the sample SD is , where . Note also that is the sample variance and so is an unbiased estimator for . Then,

 MSE(SSam)=E(SSam)2−2σE(SSam)+σ2=2σ2−2cσ2=0.5σ2/n+o(1/n). (19)

For the existing estimators, by Wan et al.,8 we have for . Further by Lemma 2 and formulas (17)-(18), as ,

 MSE(S1)≈Var(Z(n)−Z(1))4ξ2σ2=O(1(ln(n))2),
 MSE(S0)≈Var(Z(3Q+1)−Z(Q+1))4η2σ2≈2.4758σ2/n4[Φ−1(0.75)]2+o(1n)≈1.3605σ2n+o(1n),
 MSE(S0.5)≈ σ2[Var(Z(n)−Z(1))4ξ2+Var(Z(3Q+1)−Z(Q+1))4η2 +Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))2ξη] = = O(1(ln(n))2),

where . Then together with formula (19), we have , and as .

For our new estimator, by Theorem 1, we have . Note also that and . Then by Lemma 2 and formulas (17)-(18), as ,

 MSE(S~wopt)≈ σ2[Var(Z(n)−Z(1))(1+0.07n0.6)2ξ2+Var(Z(3Q+1)−Z(Q+1))[1+1/(0.07n0.6)]2η2 +2Cov(Z(n)−Z(1),Z(3Q+1)−Z(Q+1))(1+0.07n0.6)[1+1/(0.07n0.6)]ξη] ≈ O(1n1.2