1 Introduction
The least squares (LS hereafter) method is possibly the most popular method of estimation routinely used to estimate the underlying (regression) parameters. Stigler (1981) rightly said: ”The method of least squares is the automobile of modern statistical analysis: Despite its limitations, occasional accidents, and incidental pollution, it and its numerous variations, extensions, and related conveyances carry the bulk of statistical analysis, and are known and valued by nearly all”. Such an overwhelming popularity of the LS may be due to its simplicity, optimal properties and robustness to any distributional assumption. Moreover, it leads to the best (minimum variance) estimator under normality. Laplace used the name ”most advantageous method”. However, it appears to us that such an irresistible popularity of the LS may have impeded the exploration of other
smooth loss functions. Comparative computational difficulty might be another reason that such exploration was not favoured by pioneers such as Gauss, Laplace and others. Whereas, a large literature to incorporate nonsmoothloss functions in order to address (outlier) robustness have been developed. Unfortunately and surprisingly, whole statistical literature is somewhat mute on the possible use of
smooth higher order loss functions. Therefore, it is a pertinent question to ask: Are there any relative advantages in using higher order smooth loss function compared to the omnipresent least squares? Our aim here is to study an appropriate higher order estimator and compare its efficiency against the LS. In the regression set up, we find a significantly large and useful class of error distributions for which a higher order loss function is more efficient than the LS. In this paper, we give an empirically testable condition under which a higher order smooth loss functions lead to a more efficient estimator than the LS. We also provide a simple but pragmatic decision rule to make a choice between and . A detailed simulation study shows the effectiveness of such a decision rule.In Section 2, we describe the model and develop the methodology needed to compare the efficiency of different loss functions. Section 3 provides various classes of error distributions which are used for comparison of estimator. In Section 4, we provide a decision rule along with its asymptotic properties. Section 5 provides an epilogue where we consider very general classes of parametric distributions on finite support to illustrate the enormous scope of applicability of the based loss functions. Section 6 summarizes the results of simulation study on mixture distributions. Section 7 gives an application to real life data. Section 8 ends with some concluding remarks and identifies possible future directions of research.
2 Model and assumptions
Consider a linear regression set up
(2.1) 
where is an vector of observations, is an design matrix and
is the cumulative distribution function (cdf) corresponding to the error vector the error
We also assume the following regular conditions:
, a finite and nonsingular matrix.

.

.

Observations are independent.
In this set up, the ordinary least squares estimate (OLS)
is thebest linear unbiased estimator
in the sense of minimum variance. It is wellknown that for the OLS estimator, under ,Furthermore, it is clear that the minimization of is pointless if
is odd; and the minimization of
or for odd will not be very convenient because of lack of differentiability . Therefore, it remains to check whether minimization of for some positive integer other than 1, can yield a better result than , at least in some cases. If so, our objective is to identify those cases. It is obvious that the corresponding estimators for such a cases will be nonlinear. Furthermore, when the error is normal, then the best linear unbiased estimator is indeed the best unbiased estimator. Thus, for normal or near normal error, always will be better than any other estimator. A closer look reveals that deviation from unimodality causes the robustness properties of LS to falter.Studying the efficacy of higher order normed based estimator is important on its own right; not necessarily in comparison to least square. It opens up the possibility to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of
may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Earlier also, such an use of higher order loss functions is attempted. Turner (1960) heuristically touch upon the possible use of a higher order loss function in the context of estimation of the location parameter. He discusses several kinds of general PDFs; and advices in the case of the double exponential, to minimize the sum of the absolute deviations; in the case of the normal, to minimize the sum of the squared deviations (least squares); and in the case of the qth power distribution, to minimize the sum of the qth power of the deviations (least qth’s). Attempts are also made to define a general class of likelihood to derive a robust parameter estimates, robust to distributional assumption. For example, Zeckhauser and Thompson (1970) defines a general class of distribution; and empirically found its suitability.
2.1 Methodology
For the exposition purpose, let us first consider the simple bivariate linear regression model
for . The usual approach to take the error function as . We obtain by minimizing with respect to and . Note that is the best estimates in the class of linear unbiased estimators. Hence there may be some nonlinear estimator with better efficiency.
In contrast, we shall take as our loss function, and derive as the estimator of . Our objective is to compare and , and discover conditions under which the latter performs better than the former. Both of these estimators are estimators. So they possess the properties such as consistency and asymptotic normality under some standard conditions.
For the OLS estimator,
(2.2) 
where
We exhibit that satisfies the following result.
Lemma 1.
(2.3) 
Proof: For the estimator, using the Mestimator property, we have
where
and
Consequently, we get
where
Also, , where
Thus, . Since
Similarly, one can simplify
Then
Hence the proof.
Theorem 1. The estimator performs better than the OLS estimator in terms of precision iff
(2.4) 
Proof. Proof follows by comparing 2.2 and 2.3.
For symmetric distribution of , or whenever , the criterion will be
(2.5) 
Clearly this condition may or may not be satisfied depending on the distribution of .
So far, for exposition purpose, we have dealt with a simple regression framework. Now the scope of this paper demands to present all these above findings in a more general regression setup. The following remark is made to this end.
Remark 1.
For a multiple linear regression model with regressors, all the above calculations can be carried out with where
Whereas matrix would be given by
and the matrix is given by,
Thus, which gives the earlier result that is, estimators are better than that of iff
It may be interesting to examine the performance of relative to that of , or that of The following two corollaries are presented to this end.
Corollary 1. estimator better than LS (i.e., ) iff
Proof. The proof is analogous to that of Theorem 1.
Corollary 2. estimator better than iff
Proof. The proof is analogous to that of Theorem 1.
3 OLS versus for some selected distributions
In this section, we consider few important parametric error distributions to illustrate the vast scope of applicability of based loss function. The list of distributions considered is no way exhaustive, but certainly shows the immense opportunity of applications in diverse areas. Now, we check the aforementioned condition (2.4) hold for different distributions of .
3.0.1 UShaped Distribution
Consider a simple Ushaped distribution:
Note that It is easy to calculate
Note that for and in the limit,
A Ushaped distribution has two modes; and can be looked upon as a mixture of two (Jshaped) distributions  a mixture of an extreme positively skewed and another extreme negatively skewed distributions. One popular applied example of a Ushaped distribution is the number of deaths at various ages. Several more examples can be found in by B. S. Everitt ( 2005).
3.0.2 Uniform()
This is a symmetric distribution. Here
and consequently
Hence,
estimator is better than the OLS estimator, when the error component has uniform distribution.
3.0.3 Normal()
This is again a symmetric distribution where
and hence
Hence, for normally distributed errors, the OLS estimator is always preferred over the
estimator.3.0.4 Laplace()
Here we have
if is even, and, consequently,
Hence, when follows Laplace distribution, the OLS estimator is preferred over the estimator.
3.0.5 Beta()
The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by
and, that appear as exponents of the random variable and control the shape of the distribution. This class of distributions include a variety of symmetric, bellshaped, positively skewed, negatively skewed, uniform, and ’Ushaped’ distributions. The general form of the central moments of the beta distribution are quite complicated. So we will start with the raw moments and obtain the forms of
, and . Note that, hereThe figure 4 depicted in Section 5, provides a huge range of parameters for which is better than
3.0.6 Gaussian mixture distribution
Suppose that . We assume, for simplicity, common for both the components. Here
where and is the th central moment of distribution, . Let . Then (2.5) reduces to
(3.6) 
where . The left hand side expression of (3.6), which is a function of , is plotted in Figure 1. From the plot we can see for , this function assumes values less than zero and then it decreases rapidly. This means that the condition (2.5) of superiority of estimators will be satisfied if the means of the two component of the mixture distribution are more than
distance. A mixture of more than two Gaussian distributions will behave similarly with respect to this condition. Also the case of unequal
can be tackled similarly.3.0.7 Truncated Normal distribution
Consider (for simplicity) the both side truncated standard normal distribution. The even order moments are
Define Therefore,
Now it is easy to calculate and
and
Now one can see that as long as performs better than
One implication of this result is that 97 percent times performs better than
3.0.8 Raised cosine distribution
If follows a raised cosine distribution with parameters and , denoted by
, then the probability density function (pdf) is given by
The form of this distribution resembles that of a normal distribution except for the fact that it has finite tails. Suppose it can be assumed that the value of systematic errors lies in some known interval; and manufacturer has aimed to make device as accurate as possible. In such circumstances, Raised Cosine distribution may be appropriate. Another popular application is in circular data. See Rinne (2010, pp. 116). Other properties like the cdf, moment generating function (mgf), characteristic functions, raw moments up to order 4, and the kurtosis are available in Rinne (2010, pp. 116118). It is observed that the distribution has a kurtosis of 2.1938, less than that of normal distribution. It has a thin tail. Here, using the mgf, we have
and hence
Hence, for this distribution, (2.5) is satisfied, and consequently is preferred for parameter estimation.
3.1 A SubGaussian family of distributions
SubGaussian family of distributions is a wellstudied family of distribution whose tail is dominated by the normal distribution. As we observed that the estimators are preferred for a distribution for which the tail is thinner than that of a normal distribution, here we discuss about a relatively uncommon distribution and validity of the condition with respect to this distribution. Consider a distribution with pdf of the form
where is an integer and is the normalizing constant, which gives
For various values of , the pdf of the distribution is drawn in Figure 3. Note that provides the normal curve. As becomes larger and larger, tail of the distribution tend to collapse. For extremely large , the distribution resembles a symmetric curve in a finite support. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the
axis; and values of the test statistic are given in the
axis. The parallel line, parallel to axis, shows the cutoff point, which is 1 . The second plot of the panel shows that when the value of is greater than 1.45, performs better than .Here if r is odd. When r is even, we have
We immediately get
The values of the test statistic against various values of is drawn in the bottom part in Figure 3. We observe that for , assumes values less than 1.
Now, according to Zeckhauser and Thompson (1970); Turner (1960) and Box and Tiao (1962 ), for a distribution with pdf
where , estimators dominate all estimators where .
Note that, for and , we obtain the distribution displayed in Figure 3. So estimator performs better than the corresponding estimator, that is the OLS estimator, which is entirely in agreement to what we derived above. For the normal distribution ;
gives the double exponential distribution; where
tends to , the distribution tends to the rectangular. The article of Zeckhauser and Thompson (1970) considers four empirical examples to find that there is a sizable gains in likelihood if is estimated rather than prespecified equal to 2. All of the evidence they found leads them to the conclusion that if accurate estimation of a linear regression line is important, it will usually be desirable to estimate not only the coefficients of the regression line, but also the parameters of the power distribution that generated the errors about the regression line. The effect on the estimates of regression coefficients may not be small.In the next two section we will construct a decision rule based on the condition (2.4) and carry out some simulation study.
4 Decision rule: OLS versus
In this Section we derive a decision rule based on the criterion from Section
2 to decide whether OLS or estimator is preferred for some data.
Lemma 2. Suppose follows a distribution for which exists for all . Then
Proof: Observe that
(4.7)  
Now all the terms other than the first and second term of (4.7) are of the order because is , and hence , is . Also . Therefore
Furthermore, by delta method,
Then, we have the following Theorem.
Theorem 2. Let . Suppose exists for distribution of . Then
where and
Proof of this theorem is given in the Appendix B.
5 An Epilogue
In this section we consider a very general class of parametric distributions on finite support (this assumption is made to ease plot drawing) to illustrate the enormous scope of applicability of based loss function. Consider the class of distribution:
Here depends on to make the a density. The first plot of the panel depicts the shape of the density for different values of . Depending on the value of , this class of distributions includes various ’Ushaped’ (for ) and ’BellShaped’ (for ) distributions. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the axis; and values of the test statistic are given in the axis. The second plot of the panel shows that when the value of is greater than 3.2, performs better than , in all shape going from low deep to hump till it reaches certain level.
Plots of another parametric family of distributions (belongs to the Pearsonian family, Type II), given by
are shown below. It may be noted that this particular distribution is linked to distribution as well. To see this, let Let Then
To see the equivalence, set and
Figure 5 also depicts the same feature as in Figure 4. Depending on the value of , this group of parametric of distributions depicts various ’Ushaped’ and ’BellShaped’. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. The red curve is drawn for The second plot of the panel shows that when the value of is greater than 3.5, performs better than , in all shape going from low deep to hump till it reaches certain level.
These two distributions illustrate the enormous possibility of use of based loss function. Future studies will investigate whether this is a general phenomenon for other Pearsonian family of errors distributions.
6 Simulation study
We carry out the decision making procedure under 01 loss function and calculate the risk function, which is the expected loss. Here we generate data from three types of distribution, one for which is always better than OLS estimator, one where OLS is better than , and the third one is near the boundary. The values of the calculated risk are given in Table 1.
Table 1 given in Appendix A should be here
Simulation study is based on 10000 iterations; and with sample sizes of 100, 200, 500, 1000, 2000, 5000.
The first panel of Table 1 is based on mixture of two
distributions with 6 degrees of freedom (DF) each. Mean of each components are set at
Here it may be mentioned that our test needs existence of 6th order moments. To this end, we need distribution with at least 7 df. DF 6 is considered to examine the performance of our decision rule even when moments do not exist. Mixture coefficients are taken from distribution. From this part of the table, it is clear that decision is more certain as sample size increases; more importantly, it is so when the distance between the two components are more.The second panel of the Table is based on mixture of two distribution with 10 df each. Here findings corroborate with the first panel. The third panel is based on mixture of two distribution with 20 df each.
The 4th panel of the table is based on mixture of two asymmetric distributions. Mixture coefficients are taken from distribution. The fifth panel is based on two symmetric beta distributions with weight from . The first column, in Panel 5 needs special attention. The parameter combination is chosen such that it is in the neighborhood of the boundary the test statistic. Here it shows that test does not favour (for the large sample case, n=5000) any one, as expected. Here the risk is near 50% .
The 6th and 7th panel of the Table are based on mixture of two normal distributions. Here also findings are on the expected line. As sample size increases, test correctly discriminates between and
7 Empirical Illustration
In this subsection, we provide two illustrations. One is based on a constructed data set which resembles many real life scenario; and the second one is based on a real life data set.
7.1 Constructed Example
Data often contains rounding errors. Variables (like heights or weights, age in years, or birth weight in ounces.) that by their very nature are continuous are, nevertheless, typically measured in a discrete manner. People feel more comfortable to report their age as mid forty, mid fifty and so on. They are rounded to a certain level of accuracy, often to some preassigned decimal point of a measuring scale (e.g., to multiples of 10 cm, 1 cm, or 0.1 cm) or simply our preference of some numbers over other numbers. The reason may be the avoidance of costs associated with a fine measurement or the imprecise nature of the measuring instrument. The German military, for example, measures the height of recruits to the nearest 1 cm. Even if precise measurements are available, they are sometimes recorded in a coarsened way in order to preserve confidentiality or to compress the data into an easy to grasp frequency table.
Here we consider the linear regression where the dependent variable is rounded to nearest integer; independent variables are free of any such errors. The dependent variable is generated as
However, assume that we do not observe but observe
Now we are regressing on and For this example, we set a moderate sample size of 40. We consider 5000 replication.The output is summarized as follows:
Table 2: Average Estimates Based on the Constructed Data.

Parameters  

Estimates  Intercept=5.5  Slope 1 =1  Slope 2=2 
Average ()  7.013  1.115  1.924 
Average ()  6.548  1.021  1.962 
It is observed that 90 percent times is preferred over based on our proposed decision rule.
After estimation of the model, it may be of interest to know which set of estimators provides the best fit. In the present context it is a tricky problem to find an appropriate ’goodness of fit’ measure. Likelihood based methods are not tenable. Similarly, residual sum of square or are not not useful to compare the performances of these two set of parameter estimates. Here we suggest to apply the idea of Pseudo (See Cameron and Trivedi, 2005 for details, page No. 311).
Let denotes the objective function being maximized, denotes its value in the interceptonly model, denotes the value in the fitted model, and denotes the largest possible value of . Then the maximum potential gain in the objective function resulting from inclusion of regressors is and the actual gain is . This suggests the measure
where the subscript RG means relative gain. Note that, for least squares, For both the loss functions,
We also calculated the number of times the Pseudo for is numerically greater than that of . It is astonishing to see that 100 percent times the Pseudo for is numerically greater than that of .
7.2 Real Life Example
For our empirical analysis, we use the data provided by the National Sample Survey Organization of India viz. the NSSO 68th round all India unit level survey on consumption expenditure (Schedule1.0, Type 1 and 2) conducted during July 2011 to June 2012. This dataset is a nationally representative sample of household and individual characteristics based on a stratified sampling of households. For this round, the dataset is comprised of 1,68,880 household level observations. The dataset provides a detailed list of various household and individual specific characteristics along with the consumption expenditures of the households. In addition to this, data is also provided on the households’ localization which includes the sector (Rural or Urban), the district and the state/union territory (henceforth, the union territories will be referred to as states). For our analysis, we use the amount of land possessed (in logarithm form) by the households as our principal (dependent) variable along side various demographic variables as controls (independent variables). The kernel density plot clearly suggest that amount of land possession by rural households does have a bimodal distribution ^{2}^{2}2The same phenomenon is also seen for allIndia households (rural and urban together). The plot presented here is for rural household excluding the households with no land. It is interesting to note that bimodality is observed both for (1) households with nonzero amount of land ; and (2) with all households. All the results presented here are based on rural household with nonzero lands. Number of rural households with nonzero amount of land is 98483. Whole study is based on This set of 98483 observations.. The plot clearly indicates that India is suffering from ”vanishing middleclass syndrome,” only marginal and rich farmers are there. The empirical analysis demands some routine and rudimentary summary statistics as given in Table 2. We regress the amount of land possessed () on six explanatory variables, ^{3}^{3}3 We also tried with many other explanatory variables available in our master file. We also repeated the same exercise for allindia (rural and urban together) households. It is need less to mention that overall findings are same across all models we attempted. viz, Median age of a household (mage), the number of children below 15 years of age (chlt15), the number of old people above 60 years of age (Ogt60), the number of male member in the households (male), the number of female member in the households (female); and finally the number of member with education level above 10th standard (highedu). We then estimate ^{4}^{4}4In this paper we do not pursue the endogeneity issue, if any. the linear regression model based on and The estimated results are summarized as below:
Table 3: Summary Statistics.
Mean 
Median  SE  Min  Max  Kurtosis 

4.945  5.352  2.290  0.693  12.007  1.944 
Note: (i) All the results presented here are based on rural household with nonzero lands. Number of rural households with nonzero amount of land is 98483. (ii) This table is based on nonlogarithm data. 
Table 4:
Model Estimates and Standard Errors .
Variables  

Intercept  3.03493710  3.47252382 
(0.0342302942 )  (0.0120790850 )  
mage  0.01240624  0.01077902 
(0.0007863559)  ( 0.0002774869)  
chlt15  0.14759963  0.09232877 
(0.0082999771 )  (0.0029288714 )  
Ogt60  0.06754135  0.04168707 
( 0.0135886351)  (0.0047951174 )  
male  0.36720544  0.24403590 
(0.0064274676 )  (0.0022681058 )  
female  0.31067372  0.21197564 
(0.0071582759 )  (0.0025259912 )  
highedu  0.18669815  0.16207843 
(0.0074214316 )  (0.0026188528 )  
Note: (i) All the results presented here are based on rural household with nonzero lands. Number of rural households with nonzero amount of land is 98483. Whole study is based on this set of 98483 observations. Logarithm transformation is taken to reduce degree of heteroscedasticity. (ii) Standard errors are provided in the parenthesis. (iii) Leastsquares based estimates are used as an initial estimates for estimation. (iv) ’Rootsolver’ in R is used to obtain the estimates. 
It can be noted that Standard errors (SE) of the parameter estimates are provided in the parenthesis. It is to observe, as expected, that SE of based estimates are significantly and uniformly less than that of
based estimates. The value of our proposed test statistic is 5.30311871 which lies beyond 95 per cent confidence interval (8.90603838 9.09396162) suggesting that
based estimates are more efficient than that of . The pseudo for is 0.14779372 and the same for is 0.09794736 . The pseudo also clearly suggests the supremacy of over8 Discussion
This paper tried to give answer to the unassailable question: Does higher order loss function based estimator perform better than the omnipresent least squares? Every teacher, student faces this question on the firstday class on regression analysis. We tried to show that, in several real life situations,
smooth higher order loss function based estimator may lead to more efficient estimator as compared to universal least squares. It is true that least squares has one unassailable advantages, its simplicity. It may also be computationally less intensive. However, with the advent of modern computing power, computational issues may hardly be relevant.Further work may commence in the following directions. A generalized version of the condition similar to the one derived in section 4 may be useful for comparing and estimators. This may be obtained following a similar approach i.e. by obtaining the variance of these two estimators using the expression of variance of mestimators and comparing them. However, estimation of higher moments may have impact on the performance of the proposed decision rule. It may be useful to study the impact of outliers on the parameter estimates coming from higher order based loss function. Comparison of breakdown point of estimators may be very useful. It may also be interesting to find robust standard errors for for non set up.
It may be extremely useful to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Such combination may give answer to the omnipresent question: What is the optimal loss function for a given data set? The choice and design of loss functions is important in any practical application (see Hennig and Kutlukaya, 2007). Future research will shed light in this direction.
Appendix A
Sample size 
Mixture of Tdistribution  (5, 5)  (4, 4)  (3, 3)  (2, 2) 

100 
Mixture of Tdistn (df=6)  9940  9834  9200  5896 
200  9952  9838  9256  5571  
500  9961  9844  9309  4861  
1000  9965  9856  9337  4309  
2000  9975  9896  9348  3763  
5000  9973  9912  9445  2833  
100  Mixture of Tdistn (df=10)  10000  9986  9860  8092 
200  9999  9992  9900  8329  
500  10000  9997  9959  8842  
1000  10000  9996  9973  9041  
2000  10000  10000  9991  9368  
5000  10000  9999  9988  9666  
100  Mixture of Tdistn (df=20)  10000  9999  9981  9155 
200  10000  10000  9996  9585  
500  10000  10000  10000  9846  
1000  10000  10000  10000  9950  
2000  10000  10000  10000  9984  
5000  10000  10000  10000  9999  
Mixture of Betadistn (asym)  (4,10; 10,4)  (1,4; 4,1)  (2,4; 4,2)  (3,4; 4,3)  
100  10000  10000  9691  3632  
200  10000  10000  9992  4791  
500  10000  10000  10000  7277  
1000  10000  10000  10000  9241  
2000  10000  10000  10000  9957  
5000  10000  10000  10000  10000  
Mixture of Betadistn (sym)  (4,4; 4,4)  (3,3; 3,3)  (2,2; 2,2)  (1,1; 1,1)  
100  2007  3740  7665  9987  
200  1881  4748  9445  10000  
500  2089  7178  9993  10000  
1000  2574  9200  10000  10000  
2000  3464  9957  10000  10000  
5000  5810  10000  10000  10000  
Mixture of two normal distributions  (3, 3)  (2, 2)  (1, 1)  (0, 0)  
100  9999  9806  1862  211  
200  10000  9959  1330  26  
500  10000  10000  815  0  
1000  10000  10000  478  0  
2000  10000  10000  265  0  
5000  10000  10000  63  0  
Mixture of three normal distributions  (4, 4)  (3, 3)  (2, 2)  (1, 1)  
100  9968  9560  5589  550  
200  10000  9951  6589  181  
500  10000  10000  8252  17  
1000  10000  10000  9450  1  
2000  10000  10000  9922  0  
5000  10000  10000  10000  0  



Appendix B
Proof of Theorem 2: We write
Comments
There are no comments yet.