The least squares (LS hereafter) method is possibly the most popular method of estimation routinely used to estimate the underlying (regression) parameters. Stigler (1981) rightly said: ”The method of least squares is the automobile of modern statistical analysis: Despite its limitations, occasional accidents, and incidental pollution, it and its numerous variations, extensions, and related conveyances carry the bulk of statistical analysis, and are known and valued by nearly all”. Such an overwhelming popularity of the LS may be due to its simplicity, optimal properties and robustness to any distributional assumption. Moreover, it leads to the best (minimum variance) estimator under normality. Laplace used the name ”most advantageous method”. However, it appears to us that such an irresistible popularity of the LS may have impeded the exploration of othersmooth loss functions. Comparative computational difficulty might be another reason that such exploration was not favoured by pioneers such as Gauss, Laplace and others. Whereas, a large literature to incorporate non-smooth
loss functions in order to address (outlier) robustness have been developed. Unfortunately and surprisingly, whole statistical literature is somewhat mute on the possible use ofsmooth higher order loss functions. Therefore, it is a pertinent question to ask: Are there any relative advantages in using higher order smooth loss function compared to the omnipresent least squares? Our aim here is to study an appropriate higher order estimator and compare its efficiency against the LS. In the regression set up, we find a significantly large and useful class of error distributions for which a higher order loss function is more efficient than the LS. In this paper, we give an empirically testable condition under which a higher order smooth loss functions lead to a more efficient estimator than the LS. We also provide a simple but pragmatic decision rule to make a choice between and . A detailed simulation study shows the effectiveness of such a decision rule.
In Section 2, we describe the model and develop the methodology needed to compare the efficiency of different loss functions. Section 3 provides various classes of error distributions which are used for comparison of estimator. In Section 4, we provide a decision rule along with its asymptotic properties. Section 5 provides an epilogue where we consider very general classes of parametric distributions on finite support to illustrate the enormous scope of applicability of the based loss functions. Section 6 summarizes the results of simulation study on mixture distributions. Section 7 gives an application to real life data. Section 8 ends with some concluding remarks and identifies possible future directions of research.
2 Model and assumptions
Consider a linear regression set up
where is an vector of observations, is an design matrix and
is the cumulative distribution function (cdf) corresponding to the error vector the errorWe also assume the following regular conditions:
, a finite and nonsingular matrix.
Observations are independent.
In this set up, the ordinary least squares estimate (OLS)is the
best linear unbiased estimatorin the sense of minimum variance. It is well-known that for the OLS estimator, under ,
Furthermore, it is clear that the minimization of is pointless if
is odd; and the minimization ofor for odd will not be very convenient because of lack of differentiability . Therefore, it remains to check whether minimization of for some positive integer other than 1, can yield a better result than , at least in some cases. If so, our objective is to identify those cases. It is obvious that the corresponding estimators for such a cases will be non-linear. Furthermore, when the error is normal, then the best linear unbiased estimator is indeed the best unbiased estimator. Thus, for normal or near normal error, always will be better than any other estimator. A closer look reveals that deviation from uni-modality causes the robustness properties of LS to falter.
Studying the efficacy of higher order normed based estimator is important on its own right; not necessarily in comparison to least square. It opens up the possibility to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of
may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Earlier also, such an use of higher order loss functions is attempted. Turner (1960) heuristically touch upon the possible use of a higher order loss function in the context of estimation of the location parameter. He discusses several kinds of general PDFs; and advices in the case of the double exponential, to minimize the sum of the absolute deviations; in the case of the normal, to minimize the sum of the squared deviations (least squares); and in the case of the q-th power distribution, to minimize the sum of the q-th power of the deviations (least q-th’s). Attempts are also made to define a general class of likelihood to derive a robust parameter estimates, robust to distributional assumption. For example, Zeckhauser and Thompson (1970) defines a general class of distribution; and empirically found its suitability.
For the exposition purpose, let us first consider the simple bivariate linear regression model
for . The usual approach to take the error function as . We obtain by minimizing with respect to and . Note that is the best estimates in the class of linear unbiased estimators. Hence there may be some nonlinear estimator with better efficiency.
In contrast, we shall take as our loss function, and derive as the estimator of . Our objective is to compare and , and discover conditions under which the latter performs better than the former. Both of these estimators are -estimators. So they possess the properties such as consistency and asymptotic normality under some standard conditions.
For the OLS estimator,
We exhibit that satisfies the following result.
Proof: For the estimator, using the M-estimator property, we have
Let denote the
th order central moment corresponding to the distribution of.
Consequently, we get
Also, , where
Thus, . Since
Similarly, one can simplify
Hence the proof.
Theorem 1. The estimator performs better than the OLS estimator in terms of precision iff
Proof. Proof follows by comparing 2.2 and 2.3.
For symmetric distribution of , or whenever , the criterion will be
Clearly this condition may or may not be satisfied depending on the distribution of .
So far, for exposition purpose, we have dealt with a simple regression framework. Now the scope of this paper demands to present all these above findings in a more general regression set-up. The following remark is made to this end.
For a multiple linear regression model with regressors, all the above calculations can be carried out with where
Whereas matrix would be given by
and the matrix is given by,
Thus, which gives the earlier result that is, estimators are better than that of iff
It may be interesting to examine the performance of relative to that of , or that of The following two corollaries are presented to this end.
Corollary 1. estimator better than LS (i.e., ) iff
Proof. The proof is analogous to that of Theorem 1.
Corollary 2. estimator better than iff
Proof. The proof is analogous to that of Theorem 1.
3 OLS versus for some selected distributions
In this section, we consider few important parametric error distributions to illustrate the vast scope of applicability of based loss function. The list of distributions considered is no way exhaustive, but certainly shows the immense opportunity of applications in diverse areas. Now, we check the aforementioned condition (2.4) hold for different distributions of .
3.0.1 U-Shaped Distribution
Consider a simple U-shaped distribution:
Note that It is easy to calculate
Note that for and in the limit,
A U-shaped distribution has two modes; and can be looked upon as a mixture of two (J-shaped) distributions - a mixture of an extreme positively skewed and another extreme negatively skewed distributions. One popular applied example of a U-shaped distribution is the number of deaths at various ages. Several more examples can be found in by B. S. Everitt ( 2005).
This is a symmetric distribution. Here
estimator is better than the OLS estimator, when the error component has uniform distribution.
This is again a symmetric distribution where
Hence, for normally distributed errors, the OLS estimator is always preferred over theestimator.
Here we have
if is even, and, consequently,
Hence, when follows Laplace distribution, the OLS estimator is preferred over the estimator.
, that appear as exponents of the random variable and control the shape of the distribution. This class of distributions include a variety of symmetric, bell-shaped, positively skewed, negatively skewed, uniform, and ’U-shaped’ distributions. The general form of the central moments of the beta distribution are quite complicated. So we will start with the raw moments and obtain the forms of, and . Note that, here
The figure 4 depicted in Section 5, provides a huge range of parameters for which is better than
3.0.6 Gaussian mixture distribution
Suppose that . We assume, for simplicity, common for both the components. Here
where and is the th central moment of distribution, . Let . Then (2.5) reduces to
where . The left hand side expression of (3.6), which is a function of , is plotted in Figure 1. From the plot we can see for , this function assumes values less than zero and then it decreases rapidly. This means that the condition (2.5) of superiority of estimators will be satisfied if the means of the two component of the mixture distribution are more than
distance. A mixture of more than two Gaussian distributions will behave similarly with respect to this condition. Also the case of unequalcan be tackled similarly.
3.0.7 Truncated Normal distribution
Consider (for simplicity) the both side truncated standard normal distribution. The even order moments are
Now it is easy to calculate and
Now one can see that as long as performs better than One implication of this result is that 97 percent times performs better than
3.0.8 Raised cosine distribution
If follows a raised cosine distribution with parameters and , denoted by
, then the probability density function (pdf) is given by
The form of this distribution resembles that of a normal distribution except for the fact that it has finite tails. Suppose it can be assumed that the value of systematic errors lies in some known interval; and manufacturer has aimed to make device as accurate as possible. In such circumstances, Raised Cosine distribution may be appropriate. Another popular application is in circular data. See Rinne (2010, pp. 116). Other properties like the cdf, moment generating function (mgf), characteristic functions, raw moments up to order 4, and the kurtosis are available in Rinne (2010, pp. 116-118). It is observed that the distribution has a kurtosis of 2.1938, less than that of normal distribution. It has a thin tail. Here, using the mgf, we have
Hence, for this distribution, (2.5) is satisfied, and consequently is preferred for parameter estimation.
3.1 A Sub-Gaussian family of distributions
Sub-Gaussian family of distributions is a well-studied family of distribution whose tail is dominated by the normal distribution. As we observed that the estimators are preferred for a distribution for which the tail is thinner than that of a normal distribution, here we discuss about a relatively uncommon distribution and validity of the condition with respect to this distribution. Consider a distribution with pdf of the form
where is an integer and is the normalizing constant, which gives
For various values of , the pdf of the distribution is drawn in Figure 3. Note that provides the normal curve. As becomes larger and larger, tail of the distribution tend to collapse. For extremely large , the distribution resembles a symmetric curve in a finite support. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the
axis; and values of the test statistic are given in theaxis. The parallel line, parallel to axis, shows the cut-off point, which is 1 . The second plot of the panel shows that when the value of is greater than 1.45, performs better than .
Here if r is odd. When r is even, we have
We immediately get
The values of the test statistic against various values of is drawn in the bottom part in Figure 3. We observe that for , assumes values less than 1.
Now, according to Zeckhauser and Thompson (1970); Turner (1960) and Box and Tiao (1962 ), for a distribution with pdf
where , estimators dominate all estimators where .
Note that, for and , we obtain the distribution displayed in Figure 3. So estimator performs better than the corresponding estimator, that is the OLS estimator, which is entirely in agreement to what we derived above. For the normal distribution ;
gives the double exponential distribution; wheretends to , the distribution tends to the rectangular. The article of Zeckhauser and Thompson (1970) considers four empirical examples to find that there is a sizable gains in likelihood if is estimated rather than pre-specified equal to 2. All of the evidence they found leads them to the conclusion that if accurate estimation of a linear regression line is important, it will usually be desirable to estimate not only the coefficients of the regression line, but also the parameters of the power distribution that generated the errors about the regression line. The effect on the estimates of regression coefficients may not be small.
In the next two section we will construct a decision rule based on the condition (2.4) and carry out some simulation study.
4 Decision rule: OLS versus
In this Section we derive a decision rule based on the criterion from Section
2 to decide whether OLS or estimator is preferred for some data.
Lemma 2. Suppose follows a distribution for which exists for all . Then
Proof: Observe that
Now all the terms other than the first and second term of (4.7) are of the order because is , and hence , is . Also . Therefore
Furthermore, by delta method,
Then, we have the following Theorem.
Theorem 2. Let . Suppose exists for distribution of . Then
Proof of this theorem is given in the Appendix B.
5 An Epilogue
In this section we consider a very general class of parametric distributions on finite support (this assumption is made to ease plot drawing) to illustrate the enormous scope of applicability of based loss function. Consider the class of distribution:
Here depends on to make the a density. The first plot of the panel depicts the shape of the density for different values of . Depending on the value of , this class of distributions includes various ’U-shaped’ (for ) and ’Bell-Shaped’ (for ) distributions. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. Here it may be mentioned that the red curve is drawn for For the second plot of the panel, various values of are given in the axis; and values of the test statistic are given in the axis. The second plot of the panel shows that when the value of is greater than -3.2, performs better than , in all shape going from low deep to hump till it reaches certain level.
Plots of another parametric family of distributions (belongs to the Pearsonian family, Type II), given by
are shown below. It may be noted that this particular distribution is linked to distribution as well. To see this, let Let Then
To see the equivalence, set and
Figure 5 also depicts the same feature as in Figure 4. Depending on the value of , this group of parametric of distributions depicts various ’U-shaped’ and ’Bell-Shaped’. It is interesting to consider the peaks of all drawn curves. The first plot (the density plot) of this panel shows that performs better than for all those curves for which peaks are below the red curve. The red curve is drawn for The second plot of the panel shows that when the value of is greater than 3.5, performs better than , in all shape going from low deep to hump till it reaches certain level.
These two distributions illustrate the enormous possibility of use of based loss function. Future studies will investigate whether this is a general phenomenon for other Pearsonian family of errors distributions.
6 Simulation study
We carry out the decision making procedure under 0-1 loss function and calculate the risk function, which is the expected loss. Here we generate data from three types of distribution, one for which is always better than OLS estimator, one where OLS is better than , and the third one is near the boundary. The values of the calculated risk are given in Table 1.
Table 1 given in Appendix A should be here
Simulation study is based on 10000 iterations; and with sample sizes of 100, 200, 500, 1000, 2000, 5000.
The first panel of Table 1 is based on mixture of two
distributions with 6 degrees of freedom (DF) each. Mean of each components are set atHere it may be mentioned that our test needs existence of 6th order moments. To this end, we need distribution with at least 7 df. DF 6 is considered to examine the performance of our decision rule even when moments do not exist. Mixture coefficients are taken from distribution. From this part of the table, it is clear that decision is more certain as sample size increases; more importantly, it is so when the distance between the two components are more.
The second panel of the Table is based on mixture of two distribution with 10 df each. Here findings corroborate with the first panel. The third panel is based on mixture of two distribution with 20 df each.
The 4th panel of the table is based on mixture of two asymmetric distributions. Mixture coefficients are taken from distribution. The fifth panel is based on two symmetric beta distributions with weight from . The first column, in Panel 5 needs special attention. The parameter combination is chosen such that it is in the neighborhood of the boundary the test statistic. Here it shows that test does not favour (for the large sample case, n=5000) any one, as expected. Here the risk is near 50% .
The 6th and 7th panel of the Table are based on mixture of two normal distributions. Here also findings are on the expected line. As sample size increases, test correctly discriminates between and
7 Empirical Illustration
In this sub-section, we provide two illustrations. One is based on a constructed data set which resembles many real life scenario; and the second one is based on a real life data set.
7.1 Constructed Example
Data often contains rounding errors. Variables (like heights or weights, age in years, or birth weight in ounces.) that by their very nature are continuous are, nevertheless, typically measured in a discrete manner. People feel more comfortable to report their age as mid forty, mid fifty and so on. They are rounded to a certain level of accuracy, often to some preassigned decimal point of a measuring scale (e.g., to multiples of 10 cm, 1 cm, or 0.1 cm) or simply our preference of some numbers over other numbers. The reason may be the avoidance of costs associated with a fine measurement or the imprecise nature of the measuring instrument. The German military, for example, measures the height of recruits to the nearest 1 cm. Even if precise measurements are available, they are sometimes recorded in a coarsened way in order to preserve confidentiality or to compress the data into an easy to grasp frequency table.
Here we consider the linear regression where the dependent variable is rounded to nearest integer; independent variables are free of any such errors. The dependent variable is generated as
However, assume that we do not observe but observe
Now we are regressing on and For this example, we set a moderate sample size of 40. We consider 5000 replication.The output is summarized as follows:
Table 2: Average Estimates Based on the Constructed Data.
|Estimates||Intercept=5.5||Slope 1 =1||Slope 2=2|
It is observed that 90 percent times is preferred over based on our proposed decision rule.
After estimation of the model, it may be of interest to know which set of estimators provides the best fit. In the present context it is a tricky problem to find an appropriate ’goodness of fit’ measure. Likelihood based methods are not tenable. Similarly, residual sum of square or are not not useful to compare the performances of these two set of parameter estimates. Here we suggest to apply the idea of Pseudo (See Cameron and Trivedi, 2005 for details, page No. 311).
Let denotes the objective function being maximized, denotes its value in the intercept-only model, denotes the value in the fitted model, and denotes the largest possible value of . Then the maximum potential gain in the objective function resulting from inclusion of regressors is and the actual gain is . This suggests the measure
where the subscript RG means relative gain. Note that, for least squares, For both the loss functions,
We also calculated the number of times the Pseudo for is numerically greater than that of . It is astonishing to see that 100 percent times the Pseudo for is numerically greater than that of .
7.2 Real Life Example
For our empirical analysis, we use the data provided by the National Sample Survey Organization of India viz. the NSSO 68th round all India unit level survey on consumption expenditure (Schedule1.0, Type 1 and 2) conducted during July 2011 to June 2012. This dataset is a nationally representative sample of household and individual characteristics based on a stratified sampling of households. For this round, the dataset is comprised of 1,68,880 household level observations. The dataset provides a detailed list of various household and individual specific characteristics along with the consumption expenditures of the households. In addition to this, data is also provided on the households’ localization which includes the sector (Rural or Urban), the district and the state/union territory (henceforth, the union territories will be referred to as states). For our analysis, we use the amount of land possessed (in logarithm form) by the households as our principal (dependent) variable along side various demographic variables as controls (independent variables). The kernel density plot clearly suggest that amount of land possession by rural households does have a bimodal distribution 222The same phenomenon is also seen for all-India households (rural and urban together). The plot presented here is for rural household excluding the households with no land. It is interesting to note that bi-modality is observed both for (1) households with non-zero amount of land ; and (2) with all households. All the results presented here are based on rural household with non-zero lands. Number of rural households with non-zero amount of land is 98483. Whole study is based on This set of 98483 observations.. The plot clearly indicates that India is suffering from ”vanishing middle-class syndrome,” only marginal and rich farmers are there. The empirical analysis demands some routine and rudimentary summary statistics as given in Table 2. We regress the amount of land possessed () on six explanatory variables, 333 We also tried with many other explanatory variables available in our master file. We also repeated the same exercise for all-india (rural and urban together) households. It is need less to mention that overall findings are same across all models we attempted. viz, Median age of a household (mage), the number of children below 15 years of age (chlt15), the number of old people above 60 years of age (Ogt60), the number of male member in the households (male), the number of female member in the households (female); and finally the number of member with education level above 10th standard (highedu). We then estimate 444In this paper we do not pursue the endogeneity issue, if any. the linear regression model based on and The estimated results are summarized as below:
Table 3: Summary Statistics.
|Note: (i) All the results presented here are based on rural household with non-zero lands. Number of rural households with non-zero amount of land is 98483. (ii) This table is based on non-logarithm data.|
Model Estimates and Standard Errors .
|(0.0342302942 )||(0.0120790850 )|
|(0.0082999771 )||(0.0029288714 )|
|( 0.0135886351)||(0.0047951174 )|
|(0.0064274676 )||(0.0022681058 )|
|(0.0071582759 )||(0.0025259912 )|
|(0.0074214316 )||(0.0026188528 )|
(i) All the results presented here are based on rural household with non-zero lands. Number of rural households with non-zero amount of land is 98483. Whole study is based on this set of 98483 observations. Logarithm transformation is taken to reduce degree of heteroscedasticity. (ii) Standard errors are provided in the parenthesis. (iii) Least-squares based estimates are used as an initial estimates forestimation. (iv) ’Rootsolver’ in R is used to obtain the estimates.
It can be noted that Standard errors (SE) of the parameter estimates are provided in the parenthesis. It is to observe, as expected, that SE of based estimates are significantly and uniformly less than that of
based estimates. The value of our proposed test statistic is 5.30311871 which lies beyond 95 per cent confidence interval (8.90603838 9.09396162) suggesting thatbased estimates are more efficient than that of . The pseudo for is 0.14779372 and the same for is 0.09794736 . The pseudo also clearly suggests the supremacy of over
This paper tried to give answer to the unassailable question: Does higher order loss function based estimator perform better than the omnipresent least squares? Every teacher, student faces this question on the first-day class on regression analysis. We tried to show that, in several real life situations,smooth higher order loss function based estimator may lead to more efficient estimator as compared to universal least squares. It is true that least squares has one unassailable advantages, its simplicity. It may also be computationally less intensive. However, with the advent of modern computing power, computational issues may hardly be relevant.
Further work may commence in the following directions. A generalized version of the condition similar to the one derived in section 4 may be useful for comparing and estimators. This may be obtained following a similar approach i.e. by obtaining the variance of these two estimators using the expression of variance of m-estimators and comparing them. However, estimation of higher moments may have impact on the performance of the proposed decision rule. It may be useful to study the impact of outliers on the parameter estimates coming from higher order based loss function. Comparison of break-down point of estimators may be very useful. It may also be interesting to find robust standard errors for for non set up.
It may be extremely useful to consider a convex combination of loss functions of various degrees. Arthanari and Dodge (1981) considers convex combination of and norms; and studies its properties. Convex combination of may lead to more useful estimator; and the resultant estimator is expected to be robust to any distributional assumption. Such combination may give answer to the omnipresent question: What is the optimal loss function for a given data set? The choice and design of loss functions is important in any practical application (see Hennig and Kutlukaya, 2007). Future research will shed light in this direction.
|Mixture of T-distribution||(5, -5)||(4,- 4)||(3, -3)||(2, -2)|
|Mixture of T-distn (df=6)||9940||9834||9200||5896|
|100||Mixture of T-distn (df=10)||10000||9986||9860||8092|
|100||Mixture of T-distn (df=20)||10000||9999||9981||9155|
|Mixture of Beta-distn (asym)||(4,10; 10,4)||(1,4; 4,1)||(2,4; 4,2)||(3,4; 4,3)|
|Mixture of Beta-distn (sym)||(4,4; 4,4)||(3,3; 3,3)||(2,2; 2,2)||(1,1; 1,1)|
|Mixture of two normal distributions||(3, -3)||(2, -2)||(1, -1)||(0, 0)|
|Mixture of three normal distributions||(4, -4)||(3, -3)||(2, -2)||(1, -1)|
Proof of Theorem 2: We write