To evaluate point forecasts, one commonly uses a scoring function, also called a loss function, which measures the inaccuracy of the forecast relative to an observed outcome. Loss functions are also used in estimation, forecast ranking and comparison, model selection, and back-testing(Gneiting and Raftery, 2007; Gneiting, 2011). In all of these applications, given a target statistical functional, we desire a consistent loss function, meaning that the correct value of the functional is the Bayes act with respect to the loss. In this case, we say the loss function elicits the functional.
While many common statistics are elicitable, such as the mean, median, and quantiles, it is well-known that the variance is not. This impossibility follows from an observation of Osband (1985) that elicitable functionals have convex level sets, meaning mixtures of distributions with the same functional value must again have the same functional value, an axiom which the variance does not satisfy. (Indeed, mixtures generally have higher variance.) Nonetheless, several authors have pointed out that the variance is indirectly
elicitable: one may elicit the first and second moment of the distribution, and then combine these values with a link function to obtain the variance. The minimum number of dimensions required in such an indirect elicitation scheme (for the variance, 2) is referred to as the elicitation order, or elicitation complexity, of the functional in question(Lambert et al, 2008). Recently, several important non-elicitable functionals, including many risk measures such as conditional value at risk, have been shown to be indirectly elicitable with low elicitation complexity (Lambert et al, 2008; Frongillo and Kash, 2015; Fissler et al, 2016).
Heinrich (2014) recently showed that another common statistic, the mode functional, is not elicitable, despite the fact that its level sets are convex. It is therefore natural to ask whether the mode is indirectly elicitable, and if so, determine its elicitation complexity. Our main result is that the mode has infinite elicitation complexity with respect to identifiable functionals, a relatively weak restriction (see Definitions 3 and 4 and the discussion following). Interestingly, our results also extend to modal intervals, which are elicitable, as we discuss in Section 4.
Our results show that it is impossible to develop a consistent loss function for evaluating point forecasts of the mode, even indirectly. Moreover, they cast doubt on the existence of broadly effective empirical risk minimization schemes for estimating the mode or a modal interval. Our techniques differ from previous work (Frongillo and Kash, 2015), and may be applicable to other functionals of interest. We conclude with open questions, including a discussion of other notions of elicitation complexity and other properties.
Let be a set of probability measures on a common measurable space . For each probability measure
, denote the expectation of a random variablewith distribution by . We will use the term “property” to refer to a statistical functional taking values in a report space , often a subset of or .
Definition 1 (Property)
A property is a functional which assigns a report value to each probability measure in .
For example, considering probability measures on the measurable space , with being the Borel -algebra on , the mean is a real-valued property. Similarly, another real-valued property is the variance, . Our focus in this paper will be the mode, which will be defined with care at the end of this section.
We next formalize our notion of consistency, which ensures that the Bayes act for a loss function coincides with the desired property.
Definition 2 (Elicits)
A loss function elicits a property if for every we have . We say is elicitable if there exists some loss function that elicits . For all the set of elicitable properties will be denoted .
For example, the mean is elicited by squared loss .
The following concept of identifiability, due to Osband (1985), has played a central role in the theory of property elicitation. The definition states that each level set of the property, that is, the set of distributions sharing a particular value of the property, can be described by a linear constraint which depends on . Note that Steinwart et al. (2014) adopt a weaker notion of identifiability, wherein the condition need only hold for almost every level set, and call the definition below “strong identifiability”; see Section 5.
Definition 3 (Identification)
A property is identifiable if there exists an identification function such that for all we have if and only if . Let denote the class of all properties from to which are identifiable.
To illustrate, the mean is identified by the function .
Let us return to the notion of elicitation, and consider the variance . As observed by Osband (1985), for a property to be elicitable it must have convex level sets: the set of distributions having the same property value must be convex. It follows immediately that the variance is not elicitable. As noted in the introduction, however, the variance can be expressed as a function, or link, of elicitable properties, for example the mean and second moment: . This motivates the notion of indirect elicitability, wherein one elicits an intermediate property, and then computes a link function to obtain the original property. When confronted with non-elicitable properties, it is therefore natural to ask the minimal dimension of such an intermediate elicitable property; this is the notion of elicitation complexity (Lambert et al, 2008; Frongillo and Kash, 2015; Fissler et al, 2016). As we explain following the definition, we further require that these intermediate properties be identifiable.
Definition 4 (Identifiable Elicitation Complexity)
Let be the class of identifiable properties. For , a property is -elicitable with respect to if there exists an elicitable property and a function such that . The identifiable elicitation complexity of is then the minimum of all such that is -elicitable with respect to .
Without imposing such a restriction on the class of intermediate properties, the definition of elicitation complexity would be trivial, as noted by Frongillo & Kash (2015): all properties of distributions on have complexity 1 by first eliciting the entire distribution via set-theoretic bijections between and (see also the discussion following Corollary 1). To justify the restriction to identifiability in particular, first note that nearly all natural elicitable properties are identifiable, including expectations, ratios of expectations, quantiles, and expectiles. Second, the results of Lambert (2018) and Steinwart et al. (2014) show that continuous non-locally-constant functionals are elicitable if and only if they are weakly identifiable, meaning identifiability is essentially necessary for continuous non-locally-constant properties in Definition 4. Third, elicitable properties which are not identifiable are often indirectly elicitable via finite-dimensional identifiable properties, as is the case for all finite elicitable properties (those taking values in a finite set); this observation is particularly relevant as we give infinite lower bounds.
Returning to the example of the variance, we see that while it is not elicitable, its identifiable elicitation complexity is at most . The variance can be recovered via the function composed with the identifiable and elicitable vector-valued property . In this case the identification function for is where . There is a distinction between a property which is elicitable like the mean, , and a property which is -elicitable like the mean squared, . While every elicitable real-valued property is trivially -elicitable via the identity function, not every -elicitable property is elicitable. The mean squared fails to be elicitable, but is -elicitable.
Finally, we define identification complexity, which trivially lower bounds identifiable elicitation complexity, a fact we use extensively in our results.
Definition 5 (Identification Complexity)
For , a property is -identifiable if there exists an identifiable property and a function such that . Furthermore, the identification complexity of is the minimum of all such that is -identifiable.
, consider the cumulative distribution functionassociated with . A modal interval is any interval of the form to which assigns maximal probability. Let denote a midpoint of a modal interval, defined as
Regardless of whether the modal interval is unique we can use its midpoint to define the mode of the distribution. Suppose there exists a sequence of real numbers where as and a corresponding choice of midpoints of modal intervals converging to a real number, . Then is the mode of the distribution. This definition is careful not to assume that a probability density exists. In the case where the distribution function is absolutely continuous and admits a continuous density , then coincides with the global maximum of
. When working with a discrete probability distribution,corresponds to the point(s) associated with maximal probability.
We will refer to probability measures which have a well-defined and unique mode as unimodal. If a probability measure is unimodal and there exists a probability density associated with it, the density does not necessarily have a unique local maximum, a stronger requirement. For example, a Gaussian density is unimodal in both senses of the term, whereas a mixture of Gaussians with unit variance and strictly distinct weights does not necessarily have a unique local maximum, but does have a well-defined and unique mode and thus is unimodal. (See Section 5 for a discussion of the stronger definition.)
Heinrich (2014) demonstrates that the mode is not directly elicitable with respect to several classes of unimodal probability measures. We proceed by studying the identifiable elicitation complexity of the mode. Our main results Theorems 3.1 and 3.2 both show that the identifiable elicitation complexity of the mode is infinite with respect to two classes of probability measures. These results imply that, when restricting to identifiable intermediate properties, the mode is not even indirectly elicitable.
To begin, let denote the class of unimodal probability measures defined on the real line which admit a smooth and bounded density. Below we will define a class of probability measures within consisting of (finite) mixtures of normalized bump functions which will be the class of probability measures employed in Lemma 1, Theorem 3.1, and Corollary 1. We will denote by the class of probability measures which can be expressed as a (finite) mixture of Gaussians, the focus of Theorem 3.2. Since each admits a unique probability density , we will identify the probability measure with its density , and use the two interchangeably. Hence, when we choose an element , we mean the probability density associated with a probability measure . Finally, and both denote the mode of the distribution as defined in Section 2 which corresponds to the global maximum of .
We define the bump function centered at of width as follows,
where . We then define the bump centered at to be the function . Note that and . Let denote the class of distributions in which are finite mixtures of bump functions in the set , i.e., of width centered at .
To build intuition, let us first see why the mode itself is not identifiable. In fact, we will establish the stronger statement that the mode is not identifiable with respect to . (See also (Fissler and Ziegel, 2017, Lemma 2.4).)
The mode, , is not identifiable with respect to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density.
For a contradiction, suppose there exists such that is identified by . For define the density in . Clearly, , and since identifies , we thus have and . Combining,
from which we conclude and thus , a contradiction.
We now see that the mode is not identifiable, but it remains to understand its identifiable elicitation complexity. Theorem 3.1 generalizes the argument of Lemma 1, showing that the mode is not indirectly identifiable with respect to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density. In other words, for this class , there is no way to express the mode as a function of a finite-dimensional identifiable property. We conclude that the identifiable elicitation complexity of the mode is infinite with respect to .
The mode, , has infinite identifiable elicitation complexity with respect to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density.
We briefly outline the proof of Theorem 3.1; the full proof appears in Appendix A. Let be an identification function, which identifies some intermediate property for some finite dimension . Taking , we construct a probability density with bump heights specified by a vector , chosen so that the gap between any two bump heights is smaller than the minimum height. We observe that the expected value of
is linear in the bump heights, and moreover this linear transformation is rank deficient, giving us a nontrivial vectorin its kernel. By our initial choice of , for any such we can find a suitable choice of coefficient so that while changing the mode. After normalization, this gives us a valid density yielding zero expectation of , and thus residing in the same level set of as , yet with a different mode. This contradicts the existence of a function satisfying , as would need to map the same value to two different values. Figure 1 illustrates this construction, showing the density along with a hypothetical choice of .
The impossibility result of Theorem 3.1 is strengthened in Theorem 3.2, which shows that the mode has infinite identifiable elicitation complexity even after restricting to the family of probability measures in which can be expressed as a mixture of Gaussians. While the general outline of the proof is similar, the bump functions used in Theorem 3.1 were supported on disjoint intervals, which is clearly not true of Gaussians. In particular, changing the heights of distant Gaussians will now alter the mode.
The mode, , has infinite identifiable elicitation complexity with respect to , the class of probability measures in which can be expressed as a mixture of Gaussians.
See Appendix A for the proof, which shows that the statement holds even when the mixture is over Gaussians with the same variance. As in Theorem 3.1, we assume for some finite-dimensional identifiable , construct an initial density , and show that there must exist another in the same level set as but with a different mode (see Figure 2). While the broad outline remains the same, several technical challenges arise from the overlapping supports of Gaussians. To address these issues, we bound the potential contribution of one Gaussian to the density value at another to show that the mode changes from to , and use these bounds again to set the height vector so that a coefficient still exists for all possible vectors .
4 Implications for the Modal Interval
While the mode is not elicitable, it is well-known that the midpoint of the modal interval defined in eq. (1), which we will refer to as the modal midpoint, is elicitable, via the simple loss function . (Note that as we restrict to single-valued functionals in this paper, in the technical results that follow, we will only consider distributions with a unique solution to eq. (1).) Recalling that the mode is the limit of the modal midpoint as the radius approaches , it is often suggested to estimate the mode by for a sufficiently small . Heinrich (2014) argues that this practice is ill-advised, given the non-elicitability of the mode, and further demonstrates this argument empirically. For a particular Gaussian mixture with density , with two local maxima and , , Heinrich shows that given a fixed number of samples, even small values of result in a modal midpoint which is more often closer to than the mode . More precisely, for , out of 1000 trials each, Heinrich finds that in no more than 438 trials. Moreover, this success rate drops when .
Yet we observe that, as the midpoint of the true modal interval is very close to for sufficiently small , this simulation study also shows that the sample modal midpoint is an ineffective estimate of the true modal midpoint . When is replaced by in the preceding paragraph, we obtain qualitatively similar results: the majority of the time, is closer to than the true modal midpoint , and the situation worsens for . (See Figure 3 and Appendix B for details.) In summary, not only does the modal midpoint fail to estimate the mode, it fails to estimate the modal midpoint.
These empirical findings suggest the difficulty of eliciting modal midpoints in practice, despite the fact that they are elicitable. This sentiment is confirmed by the following Corollary, which extends our argument on the elicitation complexity of the mode to modal midpoints. The result essentially follows from the following observation. For a distribution consisting of disjoint bump functions in , as defined after eq. (2) and used in the argument of Theorem 3.1, the mode and modal midpoint coincide. While this equivalence does not hold anymore for mixtures of Gaussians, we remark that the proof of Theorem 3.2 could be directly modified, by enlarging the width of the balls to , and choosing sufficiently small and large , so that the same logic would hold for the modal interval when is sufficiently small.
For any , the modal midpoint, , has infinite identifiable elicitation complexity with respect to , the class of probability measures defined on the real line which admit a smooth and bounded density, and have a unique mode and -modal midpoint.
Let be given. Observe that for any , the disjoint bump functions comprising are spaced far enough apart so that an interval of width can only intersect the support of at most one bump function. Moreover, if the interval intersects the th such function, it maximizes the contained mass by centering the interval at exactly . The global maximum mass is therefore achieved by centering the interval to capture the mass of the bump function with the largest weight, whose midpoint coincides with the mode. From these observations, we conclude for all . In other words, and are the same functional with respect to . Hence, the identification complexity and identifiable elicitation complexity of the modal midpoint with respect to are at least that of the mode.
The fact that modal midpoints are elicitable yet have infinite identifiable elicitation complexity illustrates the subtlety of our definitions. This subtlety is important; as pointed out by Frongillo and Kash (2015), one can construct pathological yet elicitable properties, such as a bijective for finite via any strictly proper scoring rule (Gneiting and Raftery, 2007). Hence, the restriction to identifiable intermediate properties, or some other class of properties ruling out such pathologies (see Section 5), is necessary for practical estimation schemes. In this light, our results are in line with the observation that both the mode and modal midpoint fail to be continuous in even weak senses: for certain distributions , is not continuous in , and the same is true of .
To close, it is interesting to contrast the above demonstration and negative result with the existing positive results in the literature on the estimation of the mode and modal midpoints. Some positive results, showing favorable error bounds, assume that the true density is not only unimodal but has a unique local maximum, i.e., the density increases before the mode and decreases afterwards; see for example Robertson and Cryer (1974, Section 2) and Lee (1989, Assumption 2). Moreover, many proposed estimators are expressed as sequences of estimators which depend on the sample size (Parzen, 1962; Chernoff, 1964; Grenander, 1965; Venter, 1967); we may roughly view these estimators as intermediate properties of countably infinite dimension, consistent with our results.
Several interesting open questions remain. One could further ask for the identifiable elicitation complexity of the mode with respect to other classes of probability distributions. One interesting class would be distributions with densities having a unique local maximum, though note that the elicitability of the mode is still open in this case. The method of perturbing the heights of (in this case, heavily overlapping) bumps as in Lemma 1 and Theorems 3.1 and 3.2 does not seem sufficient for this class.
Another set of questions arises when stepping away from the class of identifiable properties and considering other classes, such as weakly identifiable properties; negative results with respect to this class would show infinite complexity with respect to continuous, non-locally-constant, component-wise elicitable properties (Lambert, 2018; Steinwart et al, 2014). Another interesting class of properties in this context would be those elicited by convex loss functions, as these properties are of practical interest yet need not be identifiable (Frongillo and Kash, 2015)
. Finally, we suspect that our techniques could be applied to other properties whose elicitation complexity is not known, such as the width of the smallest confidence interval.
Appendix A Omitted Proofs
Proof (of Theorem 3.1)
Let be given. Since the identification complexity lower bounds the identifiable elicitation complexity of the mode it suffices to show that the mode is not -identifiable for arbitrary . Suppose, by way of contradiction, that the mode is -identifiable. Hence, there exists a property identified by and function such that . Our goal will be to specify two densities with and , contradicting the existence of .
Let and consider the following density in with strictly decreasing heights and . Observe that and denote . Consider the matrix
Let denote a nontrivial vector in the kernel of . To complete the proof, we will demonstrate that for any there exist real numbers so that is a density satisfying and . We proceed by considering all cases of and showing the existence of in each case.
First, considering , let denote the entry of with greatest magnitude (if not unique, choose the entry associated with the maximal initial height ), and take . Second, if , then take and treat as above. In the final case, at least one pair of entries of have opposite sign. Let denote an entry of with the greatest magnitude (if not unique, choose the entry associated with the maximal initial height ) and assume ; otherwise take . Choose such that satisfying for any with . Note this interval is nonempty because and for all such that . In each of the above cases, there are finitely many which do not yield a unimodal . If the chosen yields a which is not unimodal, then discard this particular from the interval and choose again.
With the appropriate normalization constant , we now have a density given by . As is contained in the kernel of , linearity of expectation and the definition of now guarantee that , and the method with which we showed exists ensures that is unimodal with . These two statements together contradict the existence of the link function satisfying .
Proof (of Theorem 3.2)
As in the proof of Theorem 3.1, we assume the mode is -identifiable and arrive at a contradiction. Hence, we assume there exists a property identified by and function such that . We will again specify two densities from in the same level set of , but different modes which contradicts the existence of .
Let , and let be Gaussian densities with unit height () centered at for some to be determined. For any mixture parameters , we will denote the Gaussian mixture density as follows,
where we define to be all positive scalings of densities in . As we are interested in the mode, we can always renormalize to obtain a distribution in with the same mode. In the following, we extend for unnormalized densities in the natural way.
Observe that for any mixture , we have , for any . This follows from second-order optimality conditions: as the inflection point of a Gaussian density is at , we have , and thus for some . Let . We will want , and thus we choose any .
We will additionally use the following claims in our proof.
For all , .
If , then .
If , then .
In Claim 1, the first two inequalities are trivial, and the third follows from the observation that the contribution of to is upper bounded by for all . Claim 2 then follows from Claim 1: for all we have . Similarly, for Claim 3, .
Finally, we construct our initial mixture so that and the following condition holds,
By Claim 2, we therefore would have . Condition (4) can be satisfied for (and smaller if is larger); we give one explicit construction here. Letting for ease of notation, we may take and . Enforcing , the average of the remaining elements is then which is strictly less than but strictly greater than , as desired. We may therefore choose the remaining elements to be any decreasing sequence in the interval whose average is .
Now let . Consider the matrix
Let denote a nontrivial vector in the kernel of . To complete the proof, we will demonstrate that for any such there exists a real number so that (after normalization to obtain the corresponding element in ) is the desired density. We proceed by cases on the entries of .
First, if , then let denote the entry of with greatest magnitude. If is not unique, then choose the entry associated with the maximal initial height . Choose such that
This ensures that so that by Claim 3. Second, if , then take and treat as above.
In the final case, at least one pair of entries of have opposite sign. Let denote the entry of with the greatest magnitude and assume ; otherwise take . If is not unique, then choose the entry associated with the maximal initial height . Choose such that
Once again, the lower bound ensures that so that by Claim 3. We bound from above in this case to ensure that , meaning we have a valid density.
It thus remains to verify that this interval is nonempty. Take an index such that . Note that , so that . Also note that . Chaining these inequalities together,
As this inequality holds for all such , it holds for the minimum over .
In each of the above cases, there are finitely many which fail to yield a unimodal density, . If the chosen yields such a , discard this particular and choose again.
Similar to the conclusion of Theorem 3.1, the density (after normalization to obtain the corresponding element in ) gives the desired contradiction.
Appendix B Experimental Details
So as to allow comparison with Heinrich (2014), we consider a density which is a mixture of two Gaussians; letting and , where denotes a Gaussian density with mean, we set . The true mode of is , with the other local maximum occuring at . The experiment performed is analogous to Heinrich (2014): for each value of as shown in Table 1, and in each of 1000 trials, we collect independent samples from , and measure the performance of the empirical modal midpiont relative to the true mode and true modal midpoint . In the case of a tie for , we take the lowest value (which the reader will note should favor the correct value). In sum, our results are qualitatively similar to Heinrich (2014), in that the modal midpoint fails to estimate the mode, but we can also confirm that it fails to estimate the modal midpoint as well. Note in particular that the two “Versus local max” columns are identical.
|MSE||Versus local max|
- Chernoff (1964) Chernoff H (1964) Estimation of the mode. Annals of the Institute of Statistical Mathematics 16(1):31–41
- Fissler and Ziegel (2017) Fissler T, Ziegel JF (2017) Order-Sensitivity and Equivariance of Scoring Functions. arXiv:171109628 [math, stat] URL http://arxiv.org/abs/1711.09628, arXiv: 1711.09628
- Fissler et al (2016) Fissler T, Ziegel JF, et al (2016) Higher order elicitability and osband’s principle. The Annals of Statistics 44(4):1680–1707
- Frongillo and Kash (2015) Frongillo R, Kash I (2015) On elicitation complexity. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 3258–3266
- Gneiting (2011) Gneiting T (2011) Making and evaluating point forecasts. Journal of the American Statistical Association 106(494):746–762
- Gneiting and Raftery (2007) Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102(477):359–378
- Grenander (1965) Grenander U (1965) Some direct estimates of the mode. The Annals of Mathematical Statistics 36(1):131–138
- Heinrich (2014) Heinrich C (2014) The mode functional is not elicitable. Biometrika 101(1):245–251
- Lambert (2018) Lambert NS (2018) Elicitation and evaluation of statistical forecasts. Preprint
- Lambert et al (2008) Lambert NS, Pennock DM, Shoham Y (2008) Eliciting properties of probability distributions. In: Proceedings of the 9th ACM Conference on Electronic Commerce, ACM, pp 129–138
- Lee (1989) Lee Mj (1989) Mode regression. Journal of Econometrics 42(3):337 – 349
- Osband (1985) Osband K (1985) Providing incentives for better cost forecasting. PhD thesis, University of California, Berkeley
Parzen E (1962) On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065–1076
- Robertson and Cryer (1974) Robertson T, Cryer JD (1974) An iterative procedure for estimating the mode. Journal of the American Statistical Association 69(348):1012–1016
Steinwart et al (2014)
Steinwart I, Pasin C, Williamson R, Zhang S (2014) Elicitation and identification of properties. In: Balcan MF, Feldman V, Szepesvári C (eds) Proceedings of The 27th Conference on Learning Theory, PMLR, Barcelona, Spain, Proceedings of Machine Learning Research, vol 35, pp 482–526
- Venter (1967) Venter J (1967) On estimation of the mode. The Annals of Mathematical Statistics 38(5):1446–1455