## 1 Introduction

Predictions often come in the form of point forecasts, which summarize the distribution of a random future outcome via a single real value. We evaluate point forecasts using loss functions, which determine the error from the forecast and the eventual outcome. Conversely, loss functions are often used to produce forecasts via empirical risk minimization, which simply chooses the forecast with the lowest total loss on previously observed data. In both of these settings, however, the loss function determines the optimal forecast; in particular, as a function of the distribution of the random variable, the optimal forecast is that which minimizes the expected loss. If this minimizer happens to be a particular statistic or

*property*, we say the loss elicits the property.

The importance of the relationship between losses and optimal forecasts was first observed by Brier, who advocated for better loss functions in meteorology. More recently, Savage [8], Osband [7], Lambert et al. [6, 5], and Gneiting [3] have developed the theory of property elicitation. This literature has largely focused on characterizations of which properties are elicitable, and of the loss functions that elicit them. Lambert [5] and Steinwart et al. [9] gave a characterization of elicitable real-valued properties using the notion of identifiability from Osband [7], showing that (continuous) properties are elicitable if and only if they are identifiable. A property is identifiable if it satisfies an additional linear constraint with respect to its level sets.

A known necessary condition for a property to be elicitable is for the level sets to be convex. Interestingly, this condition is not sufficient. The mode has convex level sets, but is not elicitable, as shown by Heinrich [4]. To confront non-elicitable properties such as the mode, several authors have suggested the notion of elicitation complexity, where one asks “how elicitable” a property is, quantified as the minimal amount of elicitable information needed to compute the property [6, 2, 1].

As the mode is not elicitable, we address the natural question of its elicitation complexity. In particular, we study its identifiable elicitation complexity, defined as the smallest dimension of a vector-valued identifiable property from which one can compute the mode [2]. We show that the mode has infinite identifiable elicitation complexity, thus casting doubt on the effectiveness of empirical risk minimization in learning the mode, and the impossibility of developing a consistent loss function for scoring its point forecasts. Our techniques differ from previous work [2], and may be applicable to other properties of interest; for example, we show a similar lower bound for the modal interval. We conclude with open questions, including a discussion of other notions of elicitation complexity and other properties of interest.

## 2 Background

Let be a set of probability measures on a common measurable space . For each probability measure we will denote the expectation of a random variable with distribution by .

###### Definition 1.

A property is a function which assigns a report value to each probability measure in .

For example, considering probability measures on the measurable space given by and the Borel sets, the mean is a real-valued property. Similarly, another real-valued property is the variance, . Our focus in this paper will be the mode, which will be defined with care at the end of this section.

###### Definition 2.

A property is elicitable if there exists a loss function such that for all the property can be expressed as .

To build on our examples above, the mean is elicitable with respect to the loss function .
A necessary condition for a property to be elicitable is for it to have convex level sets: the set of distributions having the same property value must be convex [7].
It follows immediately that the variance is not elicitable.
The variance, however, can be expressed as a function, or *link*, of elicitable properties as in so we say it is *indirectly* elicitable.
When confronted with non-elicitable properties, it is therefore natural to ask the minimal dimension of such an intermediate elicitable property; this is the notion of *elicitation complexity* [6, 2].

###### Definition 3.

Let denote the class of all properties from to which are elicitable. A property is -elicitable for if there exists an elicitable property and a function such that . Moreover, the elicitation complexity of is given by the minimum of all such that is -elicitable.

Returning to the variance, we see that while it is not elicitable, its elicitation complexity is at most . The variance can be recovered via the function composed with the elicitable vector-valued property . It is worth noting that there is a distinction between a property which is elicitable like the mean, , and a property which is -elicitable like the mean squared, , which fails to be elicitable. Hence, while every elicitable real-valued property is trivially -elicitable via the identity function, not every -elicitable property is elicitable. For a discussion on the nuances of this definition, see [2].

We now introduce the notion of identifiability due to Osband [7], which has played a central role in the theory of property elicitation. Following [2], we will use identifiability to restrict the intermediate properties which we allow, essentially requiring their level sets to be determined by a linear constraint. (We briefly note that Steinwart et al. [9] adopt a weaker notion of identifiability, and call the definition below “strong identifiability”; see Section 4.)

###### Definition 4.

A property is identifiable if there exists a function such that if and only if for all . If is identified by , we refer to as an identification function.

The mean is identified by the function , whereas the variance is not identifiable in general. Below we define identification complexity, the analogue of elicitation complexity for identifiability.

###### Definition 5.

Let denote the class of all properties from to which are identifiable. A property, , is -identifiable for if there exists an identifiable property and a function such that . Furthermore, the identification complexity of is given by the minimum of all such that is -identifiable.

To illustrate identification complexity, let us revisit one final time the example of variance. Above we established that a vector-valued property which demonstrates the indirect elicitability of the variance is . Observe that is an identification function for , therefore, the identification complexity of the variance is at most .

Without further restrictions, the definition of elicitation complexity given above is trivial, as all properties of finite-support distributions have complexity 1 by first eliciting the distribution (encoded using a bijection from to ). Hence, following [2], we consider the stronger notion of elicitation complexity which requires to be identifiable.

###### Definition 6.

A property, , has identifiable elicitation complexity if is the minimum dimension for which there exists a property in and function with .

Now since the identification complexity is a lower bound for the identifiable elicitation complexity of a property. We make use of this bound in the proofs of Theorems 1 and 2.

For the remainder of this section, we turn to the mode, which we define as in [3] and [4]. Letting and

, consider the cumulative distribution function

associated with . A*modal interval*is any interval of the form to which assigns maximal probability. Let denote the midpoint of a modal interval, defined as follows,

(1) |

Regardless of whether a midpoint of the modal interval is unique we can use it to define the mode of the distribution. Suppose is a sequence of real numbers where as and is a corresponding choice of midpoints which converge to a real number, . Then is the mode of the distribution. Note that this definition is careful not to assume that a probability density exists. In the case where the distribution function is absolutely continuous and admits a continuous density , then coincides with the global maximum of

. When working with a discrete probability distribution,

corresponds to the point(s) associated with maximal probability.We will refer to probability measures which have a well-defined and unique mode as unimodal. Note that if a probability measure is unimodal and there exists a probability density associated with it, the density does not necessarily have a unique local maximum, a stronger requirement. For example, a Gaussian density is unimodal in both senses of the term, whereas a mixture of Gaussians with unit variance and strictly distinct weights does not necessarily have a unique local maximum, but does have a well-defined and unique mode and thus is unimodal. (See Section 4 for a discussion of the stronger definition.)

In 2014, Heinrich [4] demonstrated with respect to several classes of unimodal probability measures that the mode is not directly elicitable. Following the framework of elicitation complexity [6, 2], we naturally proceed by studying its identifiable elicitation complexity. Our main results are Theorems 1 and 2 which both show that the identifiable elicitation complexity of the mode is infinite relative to two classes of probability measures. Corollary 1 in turn shows that for any , the modal midpoint also has infinite identifiable elicitation complexity.

## 3 Results

Let denote the class of unimodal probability measures defined on the real line which admit a smooth and bounded density. Below we will define a class of probability measures within consisting of mixtures of normalized bump functions which will be the class of probability measures employed in Lemma 1, Theorem 1, and Corollary 1. We will denote by the class of probability measures which can be expressed as a mixture of Gaussians, the focus of Theorem 2. Since each admits a unique probability density , we will identify the probability measure with its density , and use the two interchangeably. Hence, when we choose an element we intend this to communicate that we will be using the probability density associated with a probability measure . And finally, whether or is written it is taken to be the mode of the distribution as defined in Section 2 which corresponds to the global maximum of .

To define the class of probability measures used in Lemma 1, Theorem 1, and Corollary 1, define the bump function centered at of width as follows,

(2) |

where . Let denote the class of distributions in which are mixtures of bump functions of width centered at . Note that a mixture of finitely many such bump functions is an element of , with zero weight on the remaining functions.

We begin by showing that the mode is not identifiable relative to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density. Our proof shows the stronger statement that the mode is not identifiable relative . We begin with a probability density with mode and and construct another element with a different mode but still , a contradiction. (See Figure 1.) We subsequently generalize this proof in Theorem 1 to demonstrate that the mode is not -identifiable for any .

###### Lemma 1.

The mode, , is not identifiable relative to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density.

###### Proof.

Let be given. Assume, by way of contradiction, there exists that identifies . Consider the probability density in , where , and . Note that and thus . We proceed by constructing another density in such that , but which contradicts the existence of .

Let and denote nonnegative heights, at least one of which is nonzero, which satisfy . To complete the proof, we will show that given any and there exists a choice of so that , where is a normalizing constant, yields a contradiction.

First, consider the case when . Without loss of generality, assume that . If , choose . On the other hand, if , choose with . Next, if take and treat as above. Finally, consider when . If , recall that and choose . If , choose with .

We claim that is a density which produces the contradiction we seek. The normalizing constant ensures is a density, the linearity of expectation ensures that , and finally the method of choosing ensures that is unimodal with which contradicts the existence of . ∎

While our result establishes that the mode is not identifiable, it does not directly speak to the identifiable elicitation complexity of the mode. Below, in Theorem 1, we generalize this argument to show that the identifiable elicitation complexity of the mode is infinite. We assume that the identifiable elicitation complexity is finite and demonstrate that a contradiction arises. To outline the argument, we begin with a density , and construct another density with a different mode, but which must lie in the same level set of the intermediate property . (See Figure 2.) This contradicts the existence of a function satisfying .

###### Theorem 1.

The mode, , has infinite identifiable elicitation complexity relative to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density.

###### Proof.

Let be given. Since the identification complexity lower bounds the identifiable elicitation complexity of the mode it suffices to show that the mode is not -identifiable for arbitrary . Suppose, by way of contradiction, that the mode is -identifiable. Hence, there exists a property identified by and function such that . Our goal will be to specify two densities with and , contradicting the existence of .

Let and consider the following density in with strictly decreasing heights and . Observe that and denote . Consider the matrix given by

(3) |

Let denote a nontrivial vector in the kernel of . To complete the proof, we will demonstrate that for any there exists a real number so that is a density satisfying and . We proceed by considering all cases of and showing the existence of in each case.

Consider , let denote an entry of with greatest magnitude. If is not unique, then choose the entry associated with the maximal initial height and take . Second, if , then take and treat as above. In the final case, at least one pair of entries of have opposite sign. Let denote an entry of with the greatest magnitude and assume , otherwise take . If is not unique, then choose the entry associated with the maximal initial height and choose in such that for any with . In each of the above cases, there are finitely many which do not yield a unimodal . If the chosen yields a which is not unimodal, then discard this particular from the interval and choose again.

With the appropriate normalization constant , we now have a density . Linearity of expectation and the definition of now guarantee that , and the method with which we showed exists ensures that is unimodal with . These two statements together contradict the existence of the link function satisfying . ∎

We now further strengthen our lower bound, showing that the mode has infinite identifiable elicitation complexity even when restricting to the family of probability measures in which can be expressed as a mixture of Gaussians. While the general outline of the proof is similar, in contrast to Theorem 1 where the bump functions were supported on disjoint intervals, this is clearly not true of Gaussians. In particular, changing the heights / weights of distant Gaussians will in general alter the mode.

###### Theorem 2.

The mode, , has infinite identifiable elicitation complexity relative to , the class of probability measures in which can be expressed as a mixture of Gaussians.

###### Proof.

As in the proof of Theorem 1, we assume the mode is -identifiable and arrive at a contradiction. Hence, we assume there exists a property identified by and function such that . We will again specify two densities with and , contradicting the existence of .

Let and consider distinct Gaussian densities each centered at with unit variance, where the constant is chosen sufficiently large enough so that ; that is, so that the tails do not contribute significantly to the mode of .

Consider given by where and . Note that and denote . Consider the matrix,

(4) |

Let denote a nontrivial vector in the kernel of . To complete the proof, we will demonstrate that for any there exists a real number so that is the desired density.

First, if , then let denote an entry of with greatest magnitude. If is not unique then choose the entry associated with the maximal initial height and choose . Second, if then take and treat as above. For the final case, assume without loss of generality that at least one pair of entries of have opposite sign. Let denote an entry of with the greatest magnitude and assume , otherwise take . If is not unique, then choose the entry associated with the maximal initial height and choose in . In all of the above cases, there are finitely many which fail to yield a unimodal density, . If the chosen yields such a , discard this particular and choose again.

Similar to the conclusion of Theorem 1, is a density which produces the desired contradiction. ∎

We close with a simple demonstration of how to apply our techniques above to other properties of interest. In particular, we will show a similar result for the midpoint of a modal interval defined in eq. (1). Recall that the mode was defined as the limit of a convergent sequence of midpoints of modal intervals with decreasing lengths . Given a distribution consisting of disjoint bump functions in like in the argument of Theorem 1, the mode and the midpoint of a modal interval of length coincide. Contrary to the mode, however, the midpoint of a modal interval is elicitable. For any , minimizes the expectation of the loss function . Nonetheless, we show that is not identifiable. In fact, since the mode has infinite elicitation complexity relative to , so must .

###### Corollary 1.

For any , the modal midpoint, , has infinite identifiable elicitation complexity relative to , the class of unimodal probability measures defined on the real line which admit a smooth and bounded density.

###### Proof.

Let be given. Observe that for any , the disjoint bump functions comprising are spaced far enough apart so that an interval of width can only intersect the support of at most one bump function. Moreover, if the interval intersects the th such function, it maximizes the contained mass by centering the interval at exactly . The global maximum mass is therefore achieved by centering the interval to capture the mass of the bump function with the largest weight, whose midpoint coincides with the mode. From these observations, we conclude for all . In other words, and are the same function relative to . Hence, the identification complexity and identifiable elicitation complexity of the modal interval with respect to are at least that of the mode. ∎

## 4 Discussion

We have shown that the mode has infinite identifiable elicitation complexity, relative to the class of unimodal probability measures defined on the real line which admit a smooth and bounded density, as well as to , the class of probability measures which can be expressed as a mixture of Gaussians. Our proofs take a different approach than previous work [2], and we believe this approach may find other applications. We give one such extension of Theorem 1, showing a similar lower bound for the modal interval midpoint in Corollary 1. Our results show the importance of asking “how elicitable” non-elicitable properties are, as our infinite elicitation complexity lower bound suggests very strongly the impossibility of using empirical risk minimization techniques to learn the mode or modal midpoint.

Several interesting open questions remain. One could further ask what the identifiable elicitation complexity of the mode with respect to other classes of probability distributions; the most interesting class would be distributions with densities having a unique local maximum. The method of perturbing the heights of a probability density as in Lemma 1 and Theorems 1 and 2 fails to answer this question. Another set of questions arise when stepping away from the class of identifiable properties, such as that of weakly identifiable properties, which would show infinite complexity with respect to continuous, non-locally-constant, component-wise elicitable properties [5, 9]

. Perhaps the most interesting class of properties in this context would be those elicited by convex loss functions. Finally, we expect that our techniques could be applied to other properties of interest, such as the width of the smallest confidence interval.

## Acknowledgements

This work was supported by National Science Foundation Grant CCF-1657598. We thank Jessie Finocchiaro for helpful suggestions, and Nicole Woytarowicz for her initial work on this project, including a proof of Lemma 1 in her B.S. thesis.

## References

- [1] Tobias Fissler, Johanna F Ziegel, et al. Higher order elicitability and osband’s principle. The Annals of Statistics, 44(4):1680–1707, 2016.
- [2] Rafael Frongillo and Ian Kash. On elicitation complexity. In Advances in Neural Information Processing Systems, pages 3258–3266, 2015.
- [3] Tilmann Gneiting. Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762, 2011.
- [4] Claudio Heinrich. The mode functional is not elicitable. Biometrika, 101(1):245–251, 2014.
- [5] Nicholas S. Lambert. Elicitation and evaluation of statistical forecasts. Preprint, 2013.
- [6] Nicolas S Lambert, David M Pennock, and Yoav Shoham. Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 129–138. ACM, 2008.
- [7] Kent Osband. Providing incentives for better cost forecasting. PhD thesis, University of California, Berkeley, 1985.
- [8] Leonard J Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971.
- [9] Ingo Steinwart, Chloé Pasin, Robert C. Williamson, and Siyu Zhang. Elicitation and identification of properties. In COLT, 2014.

Comments

There are no comments yet.