 # Relative Efficiency of Higher Normed Estimators Over the Least Squares Estimator

In this article, we study the performance of the estimator that minimizes L_2k- order loss function (for k > 2 ) against the estimators which minimizes the L_2- order loss function (or the least squares estimator). Commonly occurring examples illustrate the differences in efficiency between L_2k and L_2 - based estimators. We derive an empirically testable condition under which the L_2k estimator is more efficient than the least squares estimator. We construct a simple decision rule to choose between L_2k and L_2 estimator. Special emphasis is provided to study L_4 estimator. A detailed simulation study verifies the effectiveness of this decision rule. Also, the superiority of the L_2k estimator is demonstrated in a real life data set.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The least squares (LS hereafter) method is possibly the most popular method of estimation routinely used to estimate the underlying (regression) parameters. Stigler (1981) rightly said: ”The method of least squares is the automobile of modern statistical analysis: Despite its limitations, occasional accidents, and incidental pollution, it and its numerous variations, extensions, and related conveyances carry the bulk of statistical analysis, and are known and valued by nearly all”. Such an overwhelming popularity of the LS may be due to its simplicity, optimal properties and robustness to any distributional assumption. Moreover, it leads to the best (minimum variance) estimator under normality. Laplace used the name ”most advantageous method”. However, it appears to us that such an irresistible popularity of the LS may have impeded the exploration of other

smooth loss functions. Comparative computational difficulty might be another reason that such exploration was not favoured by pioneers such as Gauss, Laplace and others. Whereas, a large literature to incorporate non-smooth

loss functions in order to address (outlier) robustness have been developed. Unfortunately and surprisingly, whole statistical literature is somewhat mute on the possible use of

smooth higher order loss functions. Therefore, it is a pertinent question to ask: Are there any relative advantages in using higher order smooth loss function compared to the omnipresent least squares? Our aim here is to study an appropriate higher order estimator and compare its efficiency against the LS. In the regression set up, we find a significantly large and useful class of error distributions for which a higher order loss function is more efficient than the LS. In this paper, we give an empirically testable condition under which a higher order smooth loss functions lead to a more efficient estimator than the LS. We also provide a simple but pragmatic decision rule to make a choice between and . A detailed simulation study shows the effectiveness of such a decision rule.

In Section 2, we describe the model and develop the methodology needed to compare the efficiency of different loss functions. Section 3 provides various classes of error distributions which are used for comparison of estimator. In Section 4, we provide a decision rule along with its asymptotic properties. Section 5 provides an epilogue where we consider very general classes of parametric distributions on finite support to illustrate the enormous scope of applicability of the based loss functions. Section 6 summarizes the results of simulation study on mixture distributions. Section 7 gives an application to real life data. Section 8 ends with some concluding remarks and identifies possible future directions of research.

## 2 Model and assumptions

Consider a linear regression set up

 Y=Xβ+ε, (2.1)

where is an vector of observations, is an design matrix and

is the cumulative distribution function (cdf) corresponding to the error vector the error

We also assume the following regular conditions:

1. , a finite and nonsingular matrix.

2. .

3. .

4. Observations are independent.

In this set up, the ordinary least squares estimate (OLS)

is the

best linear unbiased estimator

in the sense of minimum variance. It is well-known that for the OLS estimator, under ,

 √n(ˆβOLS−β0)d→N(0,σ2(X′X)−1).

Furthermore, it is clear that the minimization of is pointless if

is odd; and the minimization of

or for odd will not be very convenient because of lack of differentiability . Therefore, it remains to check whether minimization of for some positive integer other than 1, can yield a better result than , at least in some cases. If so, our objective is to identify those cases. It is obvious that the corresponding estimators for such a cases will be non-linear. Furthermore, when the error is normal, then the best linear unbiased estimator is indeed the best unbiased estimator. Thus, for normal or near normal error, always will be better than any other estimator. A closer look reveals that deviation from uni-modality causes the robustness properties of LS to falter.

Studying the efficacy of higher order normed based estimator is important on its own right; not necessarily in comparison to least square. It opens up the possibility to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of

may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Earlier also, such an use of higher order loss functions is attempted. Turner (1960) heuristically touch upon the possible use of a higher order loss function in the context of estimation of the location parameter. He discusses several kinds of general PDFs; and advices in the case of the double exponential, to minimize the sum of the absolute deviations; in the case of the normal, to minimize the sum of the squared deviations (least squares); and in the case of the q-th power distribution, to minimize the sum of the q-th power of the deviations (least q-th’s). Attempts are also made to define a general class of likelihood to derive a robust parameter estimates, robust to distributional assumption. For example, Zeckhauser and Thompson (1970) defines a general class of distribution; and empirically found its suitability.

### 2.1 Methodology

For the exposition purpose, let us first consider the simple bivariate linear regression model

 Y=α+βxi+εi,

for . The usual approach to take the error function as . We obtain by minimizing with respect to and . Note that is the best estimates in the class of linear unbiased estimators. Hence there may be some nonlinear estimator with better efficiency.

In contrast, we shall take as our loss function, and derive as the estimator of . Our objective is to compare and , and discover conditions under which the latter performs better than the former. Both of these estimators are -estimators. So they possess the properties such as consistency and asymptotic normality under some standard conditions.

For the OLS estimator,

 √n(ˆθOLS−θ0)d→N(0,σ2S−1), (2.2)

where

 S=(n∑ni=1xi∑ni=1xi∑ni=1x2i).

We exhibit that satisfies the following result.

Lemma 1.

 √n(ˆθ4−θ0)d→N(0,μ6−μ239μ32S−1). (2.3)

Proof: For the estimator, using the M-estimator property, we have

 √n(ˆθL4−θ0)d→N(0,V(θ0)),

where

 V(θ0)=A(θ0)−1B(θ0)[A(θ0)−1]′,
 B(θ0) = E[ψ(y,θ0)ψ(y,θ0)′]−E[ψ(y,θ0)]E[ψ(y,θ0)]′, A(θ0) = E(δδθψ(y,θ0)),

and

 ψ=δS4δθ=(−4)(∑(Yi−α−βxi)3∑(Yi−α−βxi)3xi).

Let denote the

th order central moment corresponding to the distribution of

.

Consequently, we get

 E[ψ(y,θ0)]=(−4)(nμ3μ3∑xi),   and hence,   E[ψ(y,θ0)]E[ψ(y,θ0)]′=16μ23R

where

 R=(n2n∑xin∑xi(∑xi)2).

Also, , where

Thus, . Since

 Q−R = (n(n−1)−n2(n−1)∑ni=1xi−n∑ni=1xi(n−1)∑ni=1xi−n∑ni=1xi(∑ni=1xi)2−∑ni=1x2i−(∑ni=1xi)2) = (−n−∑ni=1xi−∑ni=1xi−∑ni=1x2i) = −S,
 B(θ0)=16S(μ6−μ23)=16SVar(ε3).

Similarly, one can simplify

 A(θ0)=12μ2.S

Then

 V(θ0)=A(θ0)−1B(θ0)[A(θ0)−1]′=112μ2S−1(16S(μ6−μ23)))112μ2S−1=μ6−μ239μ22S−1.

Hence the proof.

Theorem 1. The estimator performs better than the OLS estimator in terms of precision iff

 μ6−μ23μ32<9 (2.4)

Proof. Proof follows by comparing 2.2 and 2.3.

For symmetric distribution of , or whenever , the criterion will be

 μ69μ32<1. (2.5)

Clearly this condition may or may not be satisfied depending on the distribution of .

So far, for exposition purpose, we have dealt with a simple regression framework. Now the scope of this paper demands to present all these above findings in a more general regression set-up. The following remark is made to this end.

Remark 1.

For a multiple linear regression model with regressors, all the above calculations can be carried out with where

 X==⎛⎜ ⎜ ⎜ ⎜⎝1x11x21⋯xk11x12x22⋯xk2⋮⋮⋮⋮⋮1x1nx2n⋯xkn⎞⎟ ⎟ ⎟ ⎟⎠.

Whereas matrix would be given by

 Q=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝n(n−1)(n−1)∑ni=1x1i(n−1)∑ni=1x2i⋯n∑ni=1xki(n−1)∑ni=1x1i∑i≠jx1ix1j∑i≠jx1ix2j⋯∑i≠jx1ixkj(n−1)∑ni=1x2i∑i≠jx2ix1j∑i≠jx2ix2j⋯∑i≠jx2ixkj⋮⋮⋮⋮⋮(n−1)∑ni=1xki∑i≠jxkix1j∑i≠jxkix2j⋯∑i≠jxkixkj⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

and the matrix is given by,

 R==⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝n2n∑ni=1x1in∑ni=1x2i⋯n∑ni=1xkin∑ni=1x1i(∑ni=1x1i)2(∑ni=1x1i)(∑ni=1x2i)⋯(∑i≠jx1i)(∑ni=1xki)n∑ni=1x2i(∑ni=1x2i)(∑ni=1x1i)x1j(∑ni=1x2i)2⋯(∑ni=1x2i)(∑ni=1x2ixki)⋮⋮⋮⋮⋮n∑ni=1xki(∑ni=1xki)(∑ni=1x1i)(∑ni=1xki)(∑ni=1x2i)⋯(∑ni=1xki)2⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Thus,    which gives the earlier result that is, estimators are better than that of iff

 μ6−μ239μ32<1.

It may be interesting to examine the performance of relative to that of , or that of The following two corollaries are presented to this end.

Corollary 1. estimator better than LS (i.e., ) iff

 Var(ε2k−1)(2k−1)2(Var(ε))2k−1<1.

Proof. The proof is analogous to that of Theorem 1.

Corollary 2. estimator better than iff

 Var(ε2k−1)(2k−3)2(2k−1)2(Var(ε2k−3))μ22<1.

Proof. The proof is analogous to that of Theorem 1.

## 3 OLS versus L4 for some selected distributions

In this section, we consider few important parametric error distributions to illustrate the vast scope of applicability of based loss function. The list of distributions considered is no way exhaustive, but certainly shows the immense opportunity of applications in diverse areas. Now, we check the aforementioned condition (2.4) hold for different distributions of .

#### 3.0.1 U-Shaped Distribution

Consider a simple U-shaped distribution:

 f(x)=dx2k;−c≤x≤c,wherek is a positive % integer.

Note that It is easy to calculate

 μ69μ32=(2k+3)39(2k+1)2(2k+7).

Note that for and in the limit,

A U-shaped distribution has two modes; and can be looked upon as a mixture of two (J-shaped) distributions - a mixture of an extreme positively skewed and another extreme negatively skewed distributions. One popular applied example of a U-shaped distribution is the number of deaths at various ages. Several more examples can be found in by B. S. Everitt ( 2005).

#### 3.0.2 Uniform(−a,a)

This is a symmetric distribution. Here

 μr=ar+1−(−a)r+1(r+1){a−(−a)},

and consequently

 μ69μ32=37<1.

Hence,

estimator is better than the OLS estimator, when the error component has uniform distribution.

#### 3.0.3 Normal(μ,σ2)

This is again a symmetric distribution where

 μ2r=σ2r(2r−1)×(2r−3)×⋯×5×3×1,

and hence

 μ69μ32=159>1.

Hence, for normally distributed errors, the OLS estimator is always preferred over the

estimator.

#### 3.0.4 Laplace(λ)

Here we have

 μr=λrΓ(r+1),

if is even, and, consequently,

 μ69μ32=6!72>1.

Hence, when follows Laplace distribution, the OLS estimator is preferred over the estimator.

#### 3.0.5 Beta(a,b)

The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by

and

, that appear as exponents of the random variable and control the shape of the distribution. This class of distributions include a variety of symmetric, bell-shaped, positively skewed, negatively skewed, uniform, and ’U-shaped’ distributions. The general form of the central moments of the beta distribution are quite complicated. So we will start with the raw moments and obtain the forms of

, and . Note that, here

 μ′r=r−1∏i=0{a+ia+b+i}.

The figure 4 depicted in Section 5, provides a huge range of parameters for which is better than

#### 3.0.6 Gaussian mixture distribution

Suppose that . We assume, for simplicity, common for both the components. Here

 μr=12r∑i=0(ri)(ξ1−ξ)r−iμ1i+12r∑i=0(ri)(ξ2−ξ)r−iμ2i,

where and is the th central moment of distribution, . Let . Then (2.5) reduces to

 6+18c2−12c4−8c6<0, (3.6)

where . The left hand side expression of (3.6), which is a function of , is plotted in Figure 1. From the plot we can see for , this function assumes values less than zero and then it decreases rapidly. This means that the condition (2.5) of superiority of estimators will be satisfied if the means of the two component of the mixture distribution are more than

distance. A mixture of more than two Gaussian distributions will behave similarly with respect to this condition. Also the case of unequal

can be tackled similarly.

#### 3.0.7 Truncated Normal distribution

Consider (for simplicity) the both side truncated standard normal distribution. The even order moments are

 μc2k=2d∫c0x2kexp(−x22)√2πdx.

Define Therefore,

 μc2k=c2k−1exp(−c22)Δ+μc2k−2(2k−1).

Now it is easy to calculate and
and
Now one can see that as long as performs better than One implication of this result is that 97 percent times performs better than

#### 3.0.8 Raised cosine distribution

If follows a raised cosine distribution with parameters and , denoted by

, then the probability density function (pdf) is given by

 f(x)=12b[1+cos(πx−ab)],  a−b≤x≤a+b, a∈R, b>0.

The form of this distribution resembles that of a normal distribution except for the fact that it has finite tails. Suppose it can be assumed that the value of systematic errors lies in some known interval; and manufacturer has aimed to make device as accurate as possible. In such circumstances, Raised Cosine distribution may be appropriate. Another popular application is in circular data. See Rinne (2010, pp. 116). Other properties like the cdf, moment generating function (mgf), characteristic functions, raw moments up to order 4, and the kurtosis are available in Rinne (2010, pp. 116-118). It is observed that the distribution has a kurtosis of 2.1938, less than that of normal distribution. It has a thin tail. Here, using the mgf, we have

 μ6 = b6(π6−42π4+840π2−5040)7π6, μ2 = b2(π2−6)3π2,

and hence

 μ6/(9μ32)=37−72π4−2196π2+144727(π2−6)3=0.8926<1.

Hence, for this distribution, (2.5) is satisfied, and consequently is preferred for parameter estimation.

### 3.1 A Sub-Gaussian family of distributions

Sub-Gaussian family of distributions is a well-studied family of distribution whose tail is dominated by the normal distribution. As we observed that the estimators are preferred for a distribution for which the tail is thinner than that of a normal distribution, here we discuss about a relatively uncommon distribution and validity of the condition with respect to this distribution. Consider a distribution with pdf of the form

 f(x)=cexp(−x2k),

where is an integer and is the normalizing constant, which gives

 c=kΓ(12k).

For various values of , the pdf of the distribution is drawn in Figure 3. Note that provides the normal curve. As becomes larger and larger, tail of the distribution tend to collapse. For extremely large , the distribution resembles a symmetric curve in a finite support. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the

axis; and values of the test statistic are given in the

axis. The parallel line, parallel to axis, shows the cut-off point, which is 1 . The second plot of the panel shows that when the value of is greater than 1.45, performs better than . Figure 3: The shape of the various Sub-Gaussian distributions (f(x)=cexp(−x2k) for a range of values k where L4 is better than L2. The red curve is for k=1.45.

Here if r is odd. When r is even, we have

 μr=c∫∞−∞xrexp(−x2k)=ckΓ(r+1k).

We immediately get

The values of the test statistic against various values of is drawn in the bottom part in Figure 3. We observe that for , assumes values less than 1.

Now, according to Zeckhauser and Thompson (1970); Turner (1960) and Box and Tiao (1962 ), for a distribution with pdf

 f(u)=k(σ,m)exp(−∣∣uσ∣∣m),  σ>0, m>0,

where , estimators dominate all estimators where .

Note that, for and , we obtain the distribution displayed in Figure 3. So estimator performs better than the corresponding estimator, that is the OLS estimator, which is entirely in agreement to what we derived above. For the normal distribution ;

gives the double exponential distribution; where

tends to , the distribution tends to the rectangular. The article of Zeckhauser and Thompson (1970) considers four empirical examples to find that there is a sizable gains in likelihood if is estimated rather than pre-specified equal to 2. All of the evidence they found leads them to the conclusion that if accurate estimation of a linear regression line is important, it will usually be desirable to estimate not only the coefficients of the regression line, but also the parameters of the power distribution that generated the errors about the regression line. The effect on the estimates of regression coefficients may not be small.

In the next two section we will construct a decision rule based on the condition (2.4) and carry out some simulation study.

## 4 Decision rule: OLS versus L4

In this Section we derive a decision rule based on the criterion from Section 2 to decide whether OLS or estimator is preferred for some data.

Lemma 2. Suppose follows a distribution for which exists for all . Then

 √nˆμr=1√nn∑i=1(xi−¯¯¯x)r=1√nn∑i=1(xi−μ)r−rμr−11√nn∑i=1(xi−μ)+op(1).

Proof: Observe that

 √nˆμr=1√nn∑i=1(xi−¯¯¯x)r=1√nn∑i=1(xi−μ+μ−¯¯¯x)r (4.7) = 1√nn∑i=1(xi−μ)r+r1nn∑i=1(xi−μ)r−1√n(μ−¯¯¯x) +(r2)1nn∑i=1(xi−μ)r−2√n(μ−¯¯¯x)2+⋯.

Now all the terms other than the first and second term of (4.7) are of the order because is , and hence , is . Also . Therefore

 √nˆμr=1√nn∑i=1(xi−¯¯¯x)r=1√nn∑i=1(xi−μ)r−rμr−11√nn∑i=1(xi−μ)+op(1).

Furthermore, by delta method,

 √n((ˆσ2)r2−(σ2)r2)=r2(σ2)r2−1√n(ˆσ2−σ2)+op(1).

Then, we have the following Theorem.

Theorem 2. Let . Suppose exists for distribution of . Then

 √n(ˆv−v)=α0ˆσ61√nn∑i=1Zi+op(1),

where and

Proof of this theorem is given in the Appendix B.

## 5 An Epilogue

In this section we consider a very general class of parametric distributions on finite support (this assumption is made to ease plot drawing) to illustrate the enormous scope of applicability of based loss function. Consider the class of distribution:

 f=d(1+x2)a, a∈R d>0, |x|≤1, .

Here depends on to make the a density. The first plot of the panel depicts the shape of the density for different values of . Depending on the value of , this class of distributions includes various ’U-shaped’ (for ) and ’Bell-Shaped’ (for ) distributions. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the axis; and values of the test statistic are given in the axis. The second plot of the panel shows that when the value of is greater than -3.2, performs better than , in all shape going from low deep to hump till it reaches certain level. Figure 4: The shape of the distributions (f=d(1+x2)a, a∈R, d>0, |x|≤1) for a range of values a where L4 is better than L2.

Plots of another parametric family of distributions (belongs to the Pearsonian family, Type II), given by

 f=d(1−x2)a, a>−1 d>0,  function of a, |x|<1

are shown below. It may be noted that this particular distribution is linked to distribution as well. To see this, let Let Then

 f(x)=Γ(α1+α2)Γ(α1)Γ(α2)(x−a)α1−1(b−x)α2−1(b−a)α1+α2−1.

To see the equivalence, set and Figure 5: The shape of the distributions (f=d(1−x2)a, a>−1, d>0, |x|≤1) for a range of values a where L4 is better than L2.

Figure 5 also depicts the same feature as in Figure 4. Depending on the value of , this group of parametric of distributions depicts various ’U-shaped’ and ’Bell-Shaped’. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. The red curve is drawn for The second plot of the panel shows that when the value of is greater than 3.5, performs better than , in all shape going from low deep to hump till it reaches certain level.

These two distributions illustrate the enormous possibility of use of based loss function. Future studies will investigate whether this is a general phenomenon for other Pearsonian family of errors distributions.

## 6 Simulation study

We carry out the decision making procedure under 0-1 loss function and calculate the risk function, which is the expected loss. Here we generate data from three types of distribution, one for which is always better than OLS estimator, one where OLS is better than , and the third one is near the boundary. The values of the calculated risk are given in Table 1.

Table 1 given in Appendix A should be here

Simulation study is based on 10000 iterations; and with sample sizes of 100, 200, 500, 1000, 2000, 5000.

The first panel of Table 1 is based on mixture of two

distributions with 6 degrees of freedom (DF) each. Mean of each components are set at

Here it may be mentioned that our test needs existence of 6th order moments. To this end, we need distribution with at least 7 df. DF 6 is considered to examine the performance of our decision rule even when moments do not exist. Mixture coefficients are taken from distribution. From this part of the table, it is clear that decision is more certain as sample size increases; more importantly, it is so when the distance between the two components are more.

The second panel of the Table is based on mixture of two distribution with 10 df each. Here findings corroborate with the first panel. The third panel is based on mixture of two distribution with 20 df each.

The 4th panel of the table is based on mixture of two asymmetric distributions. Mixture coefficients are taken from distribution. The fifth panel is based on two symmetric beta distributions with weight from . The first column, in Panel 5 needs special attention. The parameter combination is chosen such that it is in the neighborhood of the boundary the test statistic. Here it shows that test does not favour (for the large sample case, n=5000) any one, as expected. Here the risk is near 50% .

The 6th and 7th panel of the Table are based on mixture of two normal distributions. Here also findings are on the expected line. As sample size increases, test correctly discriminates between and

## 7 Empirical Illustration

In this sub-section, we provide two illustrations. One is based on a constructed data set which resembles many real life scenario; and the second one is based on a real life data set.

### 7.1 Constructed Example

Data often contains rounding errors. Variables (like heights or weights, age in years, or birth weight in ounces.) that by their very nature are continuous are, nevertheless, typically measured in a discrete manner. People feel more comfortable to report their age as mid forty, mid fifty and so on. They are rounded to a certain level of accuracy, often to some preassigned decimal point of a measuring scale (e.g., to multiples of 10 cm, 1 cm, or 0.1 cm) or simply our preference of some numbers over other numbers. The reason may be the avoidance of costs associated with a fine measurement or the imprecise nature of the measuring instrument. The German military, for example, measures the height of recruits to the nearest 1 cm. Even if precise measurements are available, they are sometimes recorded in a coarsened way in order to preserve confidentiality or to compress the data into an easy to grasp frequency table.

Here we consider the linear regression where the dependent variable is rounded to nearest integer; independent variables are free of any such errors. The dependent variable is generated as

 yst=8+1×x1+2×x2  where x1=1.3×sample.int(10);x2=2.32∗sample(10:18).

However, assume that we do not observe but observe

 Y=5×floor(yst/5)\lx@notefootnotesample.int(10)randomlyarranged1to10integers;sample(10:18)randomlyarranged10to18integers;floor($yst/5$)isthelargestintegerlessthanorequalto$yst$..

Now we are regressing on and For this example, we set a moderate sample size of 40. We consider 5000 replication.The output is summarized as follows:

Table 2: Average Estimates Based on the Constructed Data.

Parameters
Estimates Intercept=5.5 Slope 1 =1 Slope 2=2
Average () 7.013 1.115 1.924
Average () 6.548 1.021 1.962

It is observed that 90 percent times is preferred over based on our proposed decision rule.

After estimation of the model, it may be of interest to know which set of estimators provides the best fit. In the present context it is a tricky problem to find an appropriate ’goodness of fit’ measure. Likelihood based methods are not tenable. Similarly, residual sum of square or are not not useful to compare the performances of these two set of parameter estimates. Here we suggest to apply the idea of Pseudo (See Cameron and Trivedi, 2005 for details, page No. 311).

Let denotes the objective function being maximized, denotes its value in the intercept-only model, denotes the value in the fitted model, and denotes the largest possible value of . Then the maximum potential gain in the objective function resulting from inclusion of regressors is and the actual gain is . This suggests the measure

 R2RG=Qfit−Q0Qmax−Q0.

where the subscript RG means relative gain. Note that, for least squares, For both the loss functions,

We also calculated the number of times the Pseudo for is numerically greater than that of . It is astonishing to see that 100 percent times the Pseudo for is numerically greater than that of .

### 7.2 Real Life Example

For our empirical analysis, we use the data provided by the National Sample Survey Organization of India viz. the NSSO 68th round all India unit level survey on consumption expenditure (Schedule1.0, Type 1 and 2) conducted during July 2011 to June 2012. This dataset is a nationally representative sample of household and individual characteristics based on a stratified sampling of households. For this round, the dataset is comprised of 1,68,880 household level observations. The dataset provides a detailed list of various household and individual specific characteristics along with the consumption expenditures of the households. In addition to this, data is also provided on the households’ localization which includes the sector (Rural or Urban), the district and the state/union territory (henceforth, the union territories will be referred to as states). For our analysis, we use the amount of land possessed (in logarithm form) by the households as our principal (dependent) variable along side various demographic variables as controls (independent variables). The kernel density plot clearly suggest that amount of land possession by rural households does have a bimodal distribution 222The same phenomenon is also seen for all-India households (rural and urban together). The plot presented here is for rural household excluding the households with no land. It is interesting to note that bi-modality is observed both for (1) households with non-zero amount of land ; and (2) with all households. All the results presented here are based on rural household with non-zero lands. Number of rural households with non-zero amount of land is 98483. Whole study is based on This set of 98483 observations.. The plot clearly indicates that India is suffering from ”vanishing middle-class syndrome,” only marginal and rich farmers are there. The empirical analysis demands some routine and rudimentary summary statistics as given in Table 2. We regress the amount of land possessed () on six explanatory variables, 333 We also tried with many other explanatory variables available in our master file. We also repeated the same exercise for all-india (rural and urban together) households. It is need less to mention that overall findings are same across all models we attempted. viz, Median age of a household (mage), the number of children below 15 years of age (chlt15), the number of old people above 60 years of age (Ogt60), the number of male member in the households (male), the number of female member in the households (female); and finally the number of member with education level above 10th standard (highedu). We then estimate 444In this paper we do not pursue the endogeneity issue, if any. the linear regression model based on and The estimated results are summarized as below: Figure 6: Kernel density plot of amount of land possession by rural households

Table 3: Summary Statistics.

 4.945 Mean Note: (i) All the results presented here are based on rural household with non-zero lands. Number of rural households with non-zero amount of land is 98483. (ii) This table is based on non-logarithm data. Median SE Min Max Kurtosis 5.352 2.290 0.693 12.007 1.944

Table 4:

Model Estimates and Standard Errors .

Variables Intercept L2 L4 3.03493710 3.47252382 (0.0342302942 ) (0.0120790850 ) 0.01240624 0.01077902 (0.0007863559) ( 0.0002774869) -0.14759963 -0.09232877 (0.0082999771 ) (0.0029288714 ) 0.06754135 0.04168707 ( 0.0135886351) (0.0047951174 ) 0.36720544 0.24403590 (0.0064274676 ) (0.0022681058 ) 0.31067372 0.21197564 (0.0071582759 ) (0.0025259912 ) 0.18669815 0.16207843 (0.0074214316 ) (0.0026188528 )

It can be noted that Standard errors (SE) of the parameter estimates are provided in the parenthesis. It is to observe, as expected, that SE of based estimates are significantly and uniformly less than that of

based estimates. The value of our proposed test statistic is 5.30311871 which lies beyond 95 per cent confidence interval (8.90603838 9.09396162) suggesting that

based estimates are more efficient than that of . The pseudo for is 0.14779372 and the same for is 0.09794736 . The pseudo also clearly suggests the supremacy of over

## 8 Discussion

This paper tried to give answer to the unassailable question: Does higher order loss function based estimator perform better than the omnipresent least squares? Every teacher, student faces this question on the first-day class on regression analysis. We tried to show that, in several real life situations,

smooth higher order loss function based estimator may lead to more efficient estimator as compared to universal least squares. It is true that least squares has one unassailable advantages, its simplicity. It may also be computationally less intensive. However, with the advent of modern computing power, computational issues may hardly be relevant.

Further work may commence in the following directions. A generalized version of the condition similar to the one derived in section 4 may be useful for comparing and estimators. This may be obtained following a similar approach i.e. by obtaining the variance of these two estimators using the expression of variance of m-estimators and comparing them. However, estimation of higher moments may have impact on the performance of the proposed decision rule. It may be useful to study the impact of outliers on the parameter estimates coming from higher order based loss function. Comparison of break-down point of estimators may be very useful. It may also be interesting to find robust standard errors for for non set up.

It may be extremely useful to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Such combination may give answer to the omnipresent question: What is the optimal loss function for a given data set? The choice and design of loss functions is important in any practical application (see Hennig and Kutlukaya, 2007). Future research will shed light in this direction.

Appendix A

Appendix B

Proof of Theorem 2: We write

 ˆv−v = ˆμ6−ˆμ32ˆσ6−μ6−μ23σ6=ˆμ6−ˆμ23−μ6+μ23ˆσ