:exclamation: This is a read-only mirror of the CRAN R package repository. murphydiagram — Murphy Diagrams for Forecast Comparisons. Homepage: https://sites.google.com/site/fk83research/code
In the practice of point prediction, it is desirable that forecasters receive a directive in the form of a statistical functional, such as the mean or a quantile of the predictive distribution. When evaluating and comparing competing forecasts, it is then critical that the scoring function used for these purposes be consistent for the functional at hand, in the sense that the expected score is minimized when following the directive. We show that any scoring function that is consistent for a quantile or an expectile functional, respectively, can be represented as a mixture of extremal scoring functions that form a linearly parameterized family. Scoring functions for the mean value and probability forecasts of binary events constitute important examples. The quantile and expectile functionals along with the respective extremal scoring functions admit appealing economic interpretations in terms of thresholds in decision making. The Choquet type mixture representations give rise to simple checks of whether a forecast dominates another in the sense that it is preferable under any consistent scoring function. In empirical settings it suffices to compare the average scores for only a finite number of extremal elements. Plots of the average scores with respect to the extremal scoring functions, which we call Murphy diagrams, permit detailed comparisons of the relative merits of competing forecasts.READ FULL TEXT VIEW PDF
:exclamation: This is a read-only mirror of the CRAN R package repository. murphydiagram — Murphy Diagrams for Forecast Comparisons. Homepage: https://sites.google.com/site/fk83research/code
Over the past two decades, a broad transdisciplinary consensus has developed that forecasts ought to be probabilistic in nature, i.e., they ought to take the form of predictive probability distributions over future quantities or events (Gneiting and Katzfuss 2014). Nevertheless, a wealth of applied settings require point forecasts, be it for reasons of decision making, tradition, reporting requirements, or ease of communication. In this situation, a directive is required as to the specific feature or functional of the predictive distribution that is being sought.
We follow Gneiting (2011) and consider a functional to be a potentially set-valued mapping from a class of probability distributions, , to the real line, , with the mean or expectation functional, quantiles, and expectiles being key examples. Competing point forecasts are then compared by using a nonnegative scoring function, , that represents the loss or penalty when the point forecast is issued and the observation realizes. A critically important requirement on the scoring function is that it be consistent for the functional relative to the class , in the sense that
for all probability distributions , all , and all . If equality in (1) implies that , then the scoring function is strictly consistent.
To give a prominent example, the ubiquitous squared error scoring function,
, is strictly consistent for the mean or expectation functional relative to the class of probability distributions with finite variance. However, there are many alternatives. In a classical paper, Savage (1971) showed, subject to weak regularity conditions, that a scoring function is consistent for the mean functional if and only if it is of the form
where the function is convex with subgradient ; squared error arises when . Holzmann and Eulert (2014) proved that when forecasts make ideal use of nested information bases, the forecast with the broader information basis is preferable under any consistent scoring function.
However, in real world settings, as pointed out by Patton (2015), forecasts are hardly ever ideal, and the ranking of competing forecasts might depend on the choice of the scoring function. This had already been observed by Murphy (1977), Schervish (1989), and Merkle and Steyvers (2013), among others, in the important special case of a binary predictand, where corresponds to a success and to a non-success, so that the mean of the predictive distribution provides a probability forecast for a success. As there is no obvious reason for a consistent scoring function to be preferred over any other, this raises the question which one of the many alternatives to use.
Our work is motivated by the quest for guidance in this setting. Theoretically, the respective key result is that, subject to unimportant regularity conditions, any function of the form (2) admits a mixture representation of the form
where is a nonnegative measure, and
for . (Here and in what follows, we write for the positive part of and for the indicator function of the event .) Thus every scoring function consistent for the mean can be written as a weighted average over elementary or extremal scores . As an important consequence, a point forecast that is preferable in terms of each extremal score is preferable in terms of any consistent scoring function. The elementary scores can be seen as representing the loss, relative to an oracle, in an investment problem with cost basis and future revenue ; see Section 2.3.
In empirical settings, point forecasts are compared based on their average scores. Specifically, let us consider a sequence of triplets for , where and are competing point forecasts and is the subsequent outcome. We may compare the two forecasts graphically, by plotting the respective empirical scores,
for and , versus . An example of this type of display, which we term a Murphy diagram, is shown in Figure 1, where we consider point forecasts of wind speed at a major wind energy center.
More generally, for both quantiles and expectiles the apparent wealth of consistent scoring functions can be reduced to a one-dimensional family of readily interpretable elementary scores, in the sense that every consistent scoring function can be represented as a mixture from that family. The case of the mean or expectation functional, which includes probability forecasts for binary events as a further special case, corresponds to the expectile at level .
The remainder of the paper is organized as follows. Section 2 is devoted to the key theoretical development, in which we state and discuss the mixture representations, relate to Choquet theory and order sensitivity, and provide economic interpretations of the elementary scores and the associated functionals. In particular, we show that expectiles are optimal decision thresholds in binary investment problems with fixed cost basis and differential taxation of profits versus losses. In Section 3, we apply the mixture representations to study forecast rankings and propose the aforementioned Murphy diagram for forecast comparisons. Illustrations on data examples follow in Section 4, where we revisit meteorological and economic case studies in the work of Gneiting et al. (2006), Rudebusch and Williams (2009), and Patton (2015). The paper closes with a discussion in Section 5. Proofs and computational details are deferred to Appendices.
Before focusing on the specific cases of quantiles and expectiles, we review general background material on the assessment of point forecasts, with emphasis on consistent scoring functions.
We first introduce notation and expose conventions. Let denote the class of the probability measures on the Borel-Lebesgue sets of the real line, . For simplicity, we do not distinguish between a measure
and the associated cumulative distribution function (CDF). We follow standard conventions and assume CDFs to be right-continuous. A functionS defined on a rectangle is called a scoring function if for all with if . Here, is interpreted as the loss or cost accrued when the point forecast is issued and the observation realizes. The scoring function is regular if it is jointly measurable and left-continuous in its first argument, , for every .
In point prediction problems, it is rarely evident which functional of the predictive distribution should be reported. Guidance can be given implicitly, by specifying a loss function, or explicitly, by specifying a functional. The notion of consistency originates in this setting.
Consider a functional on a class on which the mapping is well-defined. Usually, the functional is single-valued, as in the case of the mean functional where we take as the class
of the probability measures with finite first moment. More generally, theexpectile at level of a probability measure is the unique solution to the equation
where corresponds to the mean functional (Newey and Powell 1987). In the case of quantiles, the functional might be set-valued. Specifically, the quantile functional at level maps a probability measure to the closed interval , with lower limit and upper limit . The two limits differ only when the level set contains more than one point, so typically the functional is single-valued. Any number between and represents an -quantile and will be denoted .
The scoring function S is consistent for a functional relative to the class if
for all probability measures , all , and all point forecasts . A functional that admits a strictly consistent scoring function is called elicitable, and can then be represented as the solution to an optimization problem, in that
Hence, if the goal is to minimize expected loss, the optimal strategy is to follow the requested directive in the form of a functional.
In what follows, we restrict attention to the quantile and expectile functionals. These are critically important in a gamut of applications, including quantile and expectile regression in general, and least squares (i.e., mean) and probit and logit (i.e., binary probability) regression in particular.
The classes of the consistent scoring functions for quantiles and expectiles have been described by Savage (1971), Thomson (1979), and Gneiting (2011), and we review the respective characterizations in the setting of the latter paper, where further detail is available.
Up to mild regularity conditions, a scoring function S is consistent for the quantile functional at level relative to the class if and only if it is of the form
where is non-decreasing. The most prominent example arises when , which yields the asymmetric piecewise linear scoring function,
that lies at the heart of quantile regression (Koenker and Bassett 1978; Koenker 2005). Similarly, a scoring function is consistent for the expectile at level relative to the class if and only if it is of the form
where is convex with subgradient . The key example arises when , where
This is the loss function used for estimation in expectile regression (Newey and Powell 1987; Efron 1991), including the ubiquitous case
of ordinary least squares regression.
In view of the representations (5) and (7), the scoring functions that are consistent for quantiles and expectiles are parameterized by the non-decreasing functions , and the convex functions with subgradient , respectively. In general, neither nor and are uniquely determined. We therefore select special versions of these functions. Furthermore, in the interest of simplicity we generally assume that , adding comments in cases where there are finite boundary points. Let denote the class of all left-continuous non-decreasing real functions, and let denote the class of all convex real functions with subgradient . This last condition is satisfied when is chosen to be the left-hand derivative of , which exists everywhere and is left-continuous by construction.
In what follows, we use the symbol to denote the class of the scoring functions S of the form (5) where . Similarly, we write for the class of the scoring functions S of the form (7) where . For all practical purposes, the families and can be identified with the classes of the regular scoring functions that are consistent for quantiles and expectiles, respectively. These classes appear to be rather large. However, in either case the apparent multitude can be reduced to a one-dimensional family of elementary scoring functions, in the sense that every consistent scoring function admits a representation as a mixture of elementary elements.
Theorem 1a (quantiles). Any member of the class admits a representation of the form
where is a nonnegative measure and
The mixing measure is unique and satisfies for , where is the nondecreasing function in the representation (5). Furthermore, we have for .
Theorem 1b (expectiles). Any member of the class admits a representation of the form
where is a nonnegative measure and
The mixing measure is unique and satisfies for , where is the left-hand derivative of the convex function in the representation (7). Furthermore, we have for , where denotes the left-hand derivative with respect to the second argument.
Note that the relations in (9) and (11) hold pointwise. In particular, the respective integrals are pointwise well-defined. This is because for the functions and are right-continuous, non-negative, and uniformly bounded with bounded support, and because the non-decreasing functions and define non-negative measures and that assign finite mass to any finite interval.
In the case of quantiles, the asymmetric piecewise linear scoring function corresponds to the choice in (5), so the mixing measure in the representation (9) is the Lebesgue measure. The elementary scoring function arises when , i.e., when is a one-point measure in .
In the case of expectiles, the mixing measure for the asymmetric squared error scoring function is twice the Lebesgue measure. The choice recovers the mean or expectation functional, for which existing parametric subfamilies emerge as special cases of our mixture representation. Patton’s (2015) exponential Bregman family,
of homogeneous scoring functions on the positive half line the mixing measure has Lebesgue density , remarkably with no case distinction being required. The elementary scoring function emerges when in (7); here the mixing measure in (11) is a one-point measure in .
From a theoretical perspective, a natural question is whether the mixture representations (9) and (11) can be considered Choquet representations in the sense of functional analysis (Phelps 2001). Recall that a member S of a convex class is an extreme point of if it cannot be written as an average of two other members, i.e., if with implies . Our mixture representations qualify as Choquet representations if the elementary scores and form extreme points of the underlying classes of scoring functions. This cannot possibly be true for our classes and because they are invariant under dilations, hence admit trivial average representations built with multiples of one and the same scoring function. Therefore, the families and need to be restricted suitably. Specifically, let the class consist of all functions such that and . Similarly, let denote the family of all such that and . These classes are convex, and so are the associated subclasses of the families and , which we denote by and , respectively. The elementary scores and evidently are members of these restricted families.
Proposition 1a (quantiles). For every and , the scoring function is an extreme point of the class .
Proposition 1b (expectiles). For every and , the scoring function is an extreme point of the class .
We thus have furnished Choquet representations for subclasses of the consistent scoring functions for quantiles and expectiles. In the extant literature, such Choquet representations have been known in the binary case only, where corresponds to a success and to a non-success, so that the mean, , of the predictive distribution provides a probability forecast for a success. In this setting, the Savage representation (7) for the members of the respective class reduces to
The mixture representation (11) can then be written as
where is a nonnegative measure and
The parameter can be interpreted as the cost-loss ratio in the classical simple cost-loss decision model (Richardson 2012). Up to unimportant conventions regarding coding, scaling, and gain-loss orientation, this recovers the well known mixture representation of the proper scoring rules for probability forecasts of binary events (Shuford, Albert, and Massengill 1966; Schervish 1989). Different choices of the mixing measure yield the standard examples of scoring rules in this case; see Buja et al. (2005) and Table 1 in Gneiting and Raftery (2007). The widely used Brier score,
arises when is twice the Lebesgue measure.
Our results in the previous section give rise to natural economic interpretations of the extremal scoring functions and , along with the quantile and expectile functionals themselves. In either case, the interpretation relates to a binary betting or investment decision with random outcome, .
In the case of the extremal quantile scoring function in (10), the payoff takes on only two possible values, relating to a bet on whether or not the outcome will exceed the threshold . Specifically, consider the following payoff scheme, which is realized in spread betting in prediction markets (Wolfers and Zitzewitz 2008):
If Quinn refrains from betting, his payoff will be zero, independently of the outcome .
If Quinn enters the bet and realizes, he loses his wager, .
If Quinn enters the bet and realizes, his winnings are , for a gain of .
How should Quinn act under this payoff scheme? If Quinn does not enter the bet, his actual and expected payoffs equal zero. If he does enter, his expected payoff is
where is Quinn’s predictive CDF for the future outcome, , which for simplicity we assume to be strictly increasing. This expression is strictly positive if and only if , where
Hence, Quinn’s optimal decision rule is determined by the -quantile of , in that he enters the bet if and only if . Motivated by the specific format of the optimal decision or Bayes rule, the top left matrix in Table 1 summarizes the payoff from just any strategy of the form enter the bet if and only if .
It remains to draw the connection to the extremal scoring function . To this end, we shift attention from positively oriented payoffs to negatively oriented regrets, which we define as the difference between the payoff for an oracle and Quinn’s payoff. Here the term oracle refers to a (hypothetical) omniscient bettor who enters the bet if and only if realizes, which would yield an ideal payoff if , and zero otherwise. If Quinn uses some decision threshold , his regret equals the extremal score except for an irrelevant multiplicative factor. This is illustrated in the bottom left matrix in the table and corresponds to the classical, simple cost-loss decision model (Richardson 2012). In decision theoretic terms, the distinction between payoff and regret is inessential, because the difference depends on the outcome, , only. In either case, the optimal strategy is to choose the decision threshold .
|Monetary Payoff||Monetary Payoff|
|Score (Regret)||Score (Regret)|
In the case of the extremal expectile scoring function in (7), the payoff is real-valued. Specifically, suppose that Eve considers investing a fixed amount into a start-up company, in exchange for an unknown, future amount of the company’s profits or losses. The payoff structure then is as follows:
If Eve refrains from the deal, her payoff will be zero, independently of the outcome .
If Eve invests and realizes, her payoff is negative, at . Here, is the sheer monetary loss, and the factor accounts for Eve’s reduction in income tax, with representing the deduction rate.111In financial terms, the loss acts as a tax shield. The linear functional form assumed here is not unrealistic, even though it is simpler than many real-world tax schemes, where nonlinearities may arise from tax exemptions, progression, etc.
If Eve invests and realizes, her payoff is positive, at , where denotes the tax rate that applies to her profits.
How should Eve act under this payoff scheme? If Eve does not enter the deal, her actual and expected payoffs vanish. In case she invests, the expected payoff is
This expression is strictly positive if and only if the expectile at level
of Eve’s predictive CDF, , exceeds . In analogy to the quantile case, the top right matrix in Table 1 represents Eve’s payoff from just any strategy of the form invest if and only if .
To relate to the extremal scoring function , we again shift attention to regrets relative to an omniscient investor or oracle who enters the deal if and only if occurs, which would yield the ideal payoff . As seen in the table, if Eve uses the threshold to determine whether or not to invest, the regret equals the extremal score , up to a multiplicative factor.222The elementary score for probability forecasts of a binary event in (14) is obtained when and . The parameter can then be interpreted as a cost-loss ratio.
Therefore, expectiles can be interpreted as optimal decision thresholds in investment problems with fixed costs and differential tax rates for profits versus losses. The mean arises in the special case when in (18). It corresponds to situations in which losses are fully tax deductible () and nests situations without taxes (). Tough taxation settings where shift Eve’s incentives toward not entering the deal and correspond to expectiles at levels . For example, if losses cannot be deducted at all , whereas profits are taxed at a rate of , Eve will invest only if the expectile at level of her predictive CDF, , exceeds the deal’s fixed costs, . Note that we permit the case , which may reflect subsidies or tax credits, say.
The above interpretation of expectiles as optimal thresholds in decision problems attaches an economic meaning to this class of functionals, which thus far seems to have been missing; e.g., Schulze Waltrup et al. (2014, p. 2) note that “expectiles lack an intuitive interpretation”. The foregoing may also bear on the debate about the revision of the Basel protocol for banking regulation, which involves contention about the choice of the functional of in-house risk distributions that banks are supposed to report to regulators (Embrechts et al. 2014). Recently, expectiles have been put forth as potential candidates, as it has been proved that they are the only elicitable law-invariant coherent risk measures (Delbaen et al. 2014; Ziegel 2014; Bellini and Bignozzi 2015).
The extremal scoring functions and are not only consistent for their respective functional, they in fact enjoy the stronger property of order sensitivity. Generally, a scoring function S is order sensitive for the functional relative to the class if, for all , all , and all ,
The order sensitivity is strict if the above continues to hold when the inequalities involving and are strict. As before, we denote the class of the Borel probability measures on by , and we write for the subclass of the probability measures with finite first moment.
Proposition 2a (quantiles). For every and , the extremal scoring function is order sensitive for the -quantile functional relative to .
Proposition 2b (expectiles). For every and , the extremal scoring function is order sensitive for the -expectile functional relative to .
Owing to the mixture representations (9) and (11), the order sensitivity of the extremal scoring functions transfers to all regular consistent scoring functions. Strict order sensitivity applies if the function in the representation (5) and the derivative in the representation (7), respectively, are strictly increasing, relative to subclasses of probability measures with suitable moment constraints. Closely related results have recently been obtained in studies of elicitability (Steinwart et al. 2014; Ziegel 2014; Bellini and Bignozzi 2015). In this strand of literature, the ambitious goal of characterizing all elicitable functionals necessitates regularity conditions that are not satisfied by our discontinuous, compactly supported extremal scoring functions.
In this section, we turn to the task of comparing and ranking forecasts. Before applying our mixture representations to this problem, we introduce the prediction space setting of Gneiting and Ranjan (2013) and define notions of forecast dominance.
A prediction space
is a probability space tailored to the study of forecasting problems. Following the seminal work of Murphy and Winkler (1987), the prediction space setting of Gneiting and Ranjan (2013) considers the joint distribution of forecasts and observations. We first focus on probabilistic forecasts,, which we identify with the associated cumulative distribution functions (CDFs) for the real-valued outcome, . The elements of the respective sample space can be identified with tuples of the form
where the predictive distributions utilize information sets , respectively, with being a sigma field on the sample space . In measure theoretic language, the information sets correspond to sub sigma fields, and is a CDF-valued random quantity measurable with respect to . The joint distribution of the quantities in (19) is encoded by a probability measure on . In this setting, a predictive distribution is ideal relative to if it corresponds to the conditional distribution of the outcome under given .
. The random variableattains the values and 2 with probability , independently of and . For and , we let , , and , where
denotes the CDF of the standard normal distribution.
In a nutshell, a prediction space specifies the joint distribution of tuples of the form (19). To give an example, Table 2 revisits a scenario studied by Gneiting et al. (2007) and Gneiting and Ranjan (2013).333The only difference is that we let the random variable attain the values and 2, rather than the values and 1 as in Gneiting et al. (2007) and Gneiting and Ranjan (2013). Here, the outcome is generated as where . The perfect forecaster is ideal relative to the sigma field generated by the random variable . The unfocused and sign-reversed forecasters also have knowledge of , but fail to be ideal. The climatological forecaster, issuing the unconditional distribution of the outcome as predictive distribution, is ideal relative to the uninformative sigma field generated by the empty set.
Any predictive distribution, , can be reduced to a point forecast by extracting the sought functional, . In what follows, we focus on quantiles, the mean or expectation functional, and probability forecasts of the binary event that the outcome exceeds a threshold value. The respective point forecasts for the perfect, climatological, unfocused, and sign-reversed forecaster are shown in Table 2.
In practice, point forecasts might be an end to themselves, i.e., they might have been issued without there being an underlying predictive distribution. To accommodate such cases, we define a point prediction space to be a probability space , where the elements of the sample space can be identified with tuples of the form
where the random variables represent point forecasts and utilize information sets , respectively.444For simplicity, we let be single-valued. Extensions to set-valued random quantities, as might occur in the case of quantiles, are straightforward. The joint distribution of the point forecasts and the observation in (20) is specified by the probability measure . Similarly, it is sometimes useful to consider a mixed prediction space, by specifying the joint distribution of tuples of the form
where represent CDF-valued random quantities, and represent point forecasts.
We now define notions of forecast dominance, starting with probabilistic forecasts that take the form of predictive CDFs, and then turning to point forecasts. In the former setting, a scoring rule is a suitably measurable function that assigns a loss or penalty when we issue the predictive distribution and realizes. A scoring rule is proper if
for all probability measures and in its domain of definition (Gneiting and Raftery 2007). Proper scoring rules therefore encourage honest and careful assessments. As is well known, a scoring function S that is consistent for a single-valued functional relative to a class induces a proper scoring rule, by defining for and .
Definition 1 (predictive CDFs). Let and be probabilistic forecasts, and let be the outcome, in a prediction space. Then dominates relative to a class of proper scoring rules if for every .
We now turn to quantiles and expectiles and the respective families and of the regular consistent scoring functions for these functionals.
Definition 2a (quantiles). Let and be point forecasts, and let be the outcome, in a point prediction space. Then dominates as an -quantile forecast if for every scoring function .
Definition 2b (expectiles). Let and be point forecasts, and let be the outcome, in a point prediction space. Then dominates as an -expectile forecast if for every scoring function .
It is important to note that the expectations in the definitions are taken with respect to the joint distribution of the probabilistic forecasts and the outcome. The notions provide partial orderings for the predictive distributions in (19) and in (20), respectively.555In the special case of probability forecasts of a binary event, related notions of sufficiency and dominance have been studied by DeGroot and Fienberg (1983), Vardeman and Meeden (1983), Schervish (1989), Krämer (2005), and Bröcker (2009). Essentially, a probabilistic forecast that dominates another is preferable, or at least not inferior, in any type of decision that involves the respective predictive distributions. In the case of quantiles or expectiles, a point forecast that dominates another is preferable, or at least not inferior, in any type of decision problem that depends on the respective predictive distributions via the considered functional only. Adaptations to functionals other than quantiles or expectiles are straightforward.
Under which conditions does a forecast dominate another? Holzmann and Eulert (2014) recently showed that if two predictive distributions are ideal, then the one with the richer information set dominates the other. Furthermore, the result carries over to ideal forecasters’ induced point predictions, including but not limited to the cases of quantiles and expectiles that we consider here. To give an example in the setting of Table 2, the perfect and the climatological forecasters are ideal relative to the sigma fields generated by , and generated by the empty set, respectively. Therefore, the perfect forecaster dominates the climatological forecaster, in any of the above senses.
Tsyplakov (2014) went on to show that if a predictive distribution is ideal relative to a certain information set, then it dominates any predictive distribution that is measurable with respect to the information set. Again, the result carries over to the induced point forecasts. In the setting of Table 2, the perfect forecaster is ideal relative to the sigma field generated by the random variables and . The climatological, unfocused, and sign-reversed forecasters are measurable with respect to this sigma field, and so they are dominated by the perfect forecaster, in any of the above senses.
In the practice of forecasting, predictive distributions are hardly ever ideal, and information sets may not be nested, as emphasized by Patton (2015). Therefore, the above theoretical results are not readily applicable, and distinct soring rules, or distinct consistent scoring functions, may yield distinct forecast rankings, as in empirical examples given by Schervish (1989), Merkle and Steyvers (2013), and Patton (2015), among others. Furthermore, in general it is not feasible to check the validity of the expectation inequalities in Definitions 1, 2a, and 2b for any proper scoring rule , or consistent scoring function , or , respectively.
Fortunately, in the case of quantile and expectile forecasts, the mixture representations in Theorems 1a and 1b reduce checks for dominance to the respective one-dimensional families of elementary scoring functions.
Corollary 1a (quantiles). In a point prediction space, dominates as an -quantile forecast if for every .
Corollary 1b (expectiles). In a point prediction space, dominates as an -expectile forecast if for every .
The reduction to a one-dimensional problem suggests graphical comparisons via Murphy diagrams. Before we discuss this tool, we note that order sensitivity can sometimes be invoked to prove dominance. For example, consider the mixed prediction space setting (21) with and . Suppose that the CDF-valued random quantity is ideal relative to the sigma field , and let denote its -quantile. Suppose furthermore that and are measurable with respect to . By Corollary 1a in concert with Proposition 1a and a conditioning argument, dominates as an -quantile forecast if with probability one either
holds true. An analogous argument applies in the case of the -expectile.
In the scenario of Table 2, the argument can be put to work in the case that corresponds to median and mean forecasts, respectively. Specifically, let be the perfect forecast, which has median and mean , let be the sigma field generated by , and let and . Invoking the order sensitivity argument, we see that the climatological forecaster dominates the sign-reversed forecaster for both median and mean predictions.
As noted, Corollaries 1a and 1b suggest graphical tools for the comparison of quantile and expectile forecasts, including the special cases of the mean or expectation functional, and the further special case of probability forecasts of a binary event. We describe these diagnostic tools in the setting of a point prediction space (20), where denote point forecasts for the outcome , and the probability measure represents their joint distribution. In the case of probability forecasts, we use the more suggestive notation for the forecasts.
For quantile forecasts at level , we plot the graph of the expected elementary quantile score ,
for . By Corollary 1a, forecast dominates forecast if and only if for . The area under equals the respective expected asymmetric piecewise linear score (6).
For expectile forecasts at level , we plot the graph of the expected elementary expectile score ,
for . By Corollary 1b, forecast dominates forecast if and only if for . The area under equals half the respective expected asymmetric squared error (8).
For probability forecasts of a binary event, we plot the graph of the expected elementary score ,
for . By Corollary 1b, the probability forecast dominates if and only if for . The area under equals half the expected Brier score (15).
In the context of probability forecasts for binary weather events, displays of this type have a rich tradition that can be traced to Thompson and Brier (1955) and Murphy (1977). More recent examples include the papers by Schervish (1989), Richardson (2000), Wilks (2001), Mylne (2002), and Berrocal et al. (2010), among many others. Murphy (1977) distinguished three kinds of diagrams that reflect the economic decisions involved. The negatively oriented expense diagram shows the mean raw loss or expense of a given forecast scheme; the positively oriented value diagram takes the unconditional or climatological forecast as reference and plots the difference in expense between this reference forecast and the forecast at hand, and lastly, the relative-value diagram plots the ratio of the utility of a given forecast and the utility of an oracle forecast. The displays introduced above are similar to the value diagrams of Murphy, and we refer to them as Murphy diagrams. Our Murphy diagrams are by default negatively oriented and plot the expected elementary score for competing quantile, expectile, and probability forecasters. For better visual appearance, we generally connect the left- and right-hand limits at the jump points of the empirical score curves.
Figure 2 shows Murphy diagrams for the perfect, climatological, unfocused, and sign-reversed forecasters in Table 2. We compare point predictions for the mean or expectation functional, and the quantile at level , along with probability forecasts for the binary event that the outcome exceeds the threshold value 2. Analytic expressions for the respective expected scores are given in Appendix B. As proved in the previous section, the perfect forecaster dominates the other forecasters for all functionals considered. The expected score curves for the climatological and the unfocused, and for the unfocused and the sign-reversed forecasters, intersect in all three cases, so there are no order relations between these forecasters. Finally, the Murphy diagrams suggest that the climatological forecaster dominates the sign-reversed forecaster for all three functionals, and in the case of the mean functional, the order sensitivity argument in the previous section confirms the visual impression. In the cases of the quantile and probability forecasts, final confirmation would need to be based on tedious analytic investigations of the asymptotic behavior of the expected score functions.
By default, our Murphy diagrams show the expected elementary scores. If interest focuses on binary comparisons, it is natural to consider Murphy diagrams for the difference,
between the expected elementary scores of two point forecasters.
We now turn to the comparison and ranking of empirical forecasts. Specifically, we consider tuples
where are the th forecaster’s point predictions, for , and , are the respective outcomes. Thus, we have competing forecasters, and each of them issues a set of point predictions. A convenient interpretation of the empirical setting is as a special case of a point prediction space, in which the tuples in (20) attain each of the values in (27) with probability . Then the probability measure is the corresponding empirical measure, and with this identification, the (average) empirical scores
where is either , , or , become the expected elementary scores from (23), (24), and (25), respectively. To compare forecasters and , say, it is convenient to show a Murphy plot of the equivalent of the difference (26), namely
for , and again is either , , or , respectively.
Murphy diagrams can be used efficiently to show a lack of domination when forecasters’ expected elementary score curves intersect. However, in general it is not possible to conclude domination, unless the visual impression is supported by tedious analytic investigations of the behavior of the expected score functions as . Fortunately, these complications do not arise in the empirical case, where dominance can be established by comparing the empirical score functions at a well-defined, finite set of arguments only, as follows.
Corollary 2a (quantiles). An empirical forecast dominates for -quantile predictions if
Corollary 2b (expectiles). An empirical forecast dominates for -expectile predictions if
for and in the left-hand limit as , . In the case evaluations at can be omitted.
To see why these results hold, note that in either case the score differential is right-continuous, and that it vanishes unless . Furthermore, in the case of quantiles is piecewise constant with no other jump points than , or . Similarly, in the case of expectiles is piecewise linear with no other jump points than and , and no other change of slope than at . The change of slope disappears when . Figure 3 illustrates the behavior of in the cases of the median and the mean, respectively.
To give an example, we consider the 10 forecasters in Table A.1 of Merkle and Steyvers (2013), each of whom issues probability forecasts for 21 binary events. The data are artificial but mimic forecasters in the Aggregate Contingent Estimation System (ACES), a web based survey that solicited probability forecasts for world events from the general public. The Murphy diagram in the left-hand panel of Figure 4 shows the empirical score curves
where is forecaster ’s stated probability for world event to materialize, and is the respective binary realization. By Corollary 2b, dominance relations can be inferred by evaluating at the forecasters’ stated probabilities. We note that ID 3 dominates IDs 6 and 8, and that ID 5 dominates ID 10. The remaining pairwise comparisons do not give rise to dominance relations. T