People and algorithms constantly rely on probabilistic forecasts (about medical treatments, weather, transportation times, etc.) and make potentially high-stake decisions based on them. In most cases, forecasts are not perfect, e.g., the forecasted chance that it will rain tomorrow does not match the true probability exactly. While average performance statistics might be available (accuracy, calibration, etc), it is generally impossible to tell whether any individual prediction is reliable (individually calibrated), e.g., about the medical condition of an specific patient or the delay of a particular flight [21, 1, 22]. Intuitively, this is because multiple identical datapoints are needed to confidently estimate a probability from empirical frequencies, but identical datapoints are rare in real world applications (e.g. two patients are always different). Given these limitations, we study alternative mechanisms to convey confidence about individual predictions to decision-makers.
We consider settings where a single forecaster provides predictions to many decision makers, each facing a potentially different decision making problem. For example, a personalized medicine service could predict whether a product is effective for thousands of individual patients [19, 20, 2]. If the prediction is accurate for 70% of patients, it could be accurate for Alice but not Bob, or vice-versa. Therefore, Alice might be hesitant to make decisions based on the 70% average accuracy. In this setting, we propose an insurance-like mechanism that 1) enables each decision maker to confidently make decisions as if the advertised probabilities were individually correct, and 2) is implementable by the forecaster with provably vanishing costs in the long run.
To achieve this, we turn to the classic idea [6, 14] that a probabilistic belief is equivalent to a willingness to take bets. We use the previous example to illustrate that if the forecaster is willing to take bets, a decision maker can bet with the forecaster as an “insurance” against mis-prediction. Suppose Alice is trying to decide whether or not to use a product. If she uses the product, she gains $10 if the product is effective and loses $2 otherwise. The personalized medicine service (forecaster) predicts that the product is effective with 50% chance for Alice. Under this probability Alice expects to gain $4 if she decides to use the product, but she is worried the probability is incorrect. Alice proposes a bet: Alice pays the forecaster $6 if the product is effective, and the forecaster pays Alice $6 otherwise. The forecaster should accept the bet because under its own forecasted probability the bet is fair (i.e., the expectation is zero if the forecasted probabilities are true for Alice). Alice gets the guarantee that if she decides to use the product, effective or not, she gains $4 — equal to her expected utility under the forecasted (and possibly incorrect) probability. In general, we show that Alice has a way of choosing bets for any utility function and forecasted probability, such that her true gain equals her expected gain under the forecasted probability.
From the forecaster’s perspective, if the true probability that Alice’s treatment is effective is actually 10%, then the forecaster will lose $4.8 from this bet in expectation. However, in our setup, the forecaster makes probabilistic forecasts for many different decision makers, and each decision maker selects some bet based on their utility function and forecasted probability. The forecaster might gain or lose on individual bets, but it only needs to not lose on the entire set of bets on average for the approach to be sustainable. Intuitively, each decision maker’s difference between forecasted gain and true gain can be averaged across the pool of decision makers. The difficult requirement that each difference should be negative has been reduced to an easier requirement that the average difference should be negative.
However, this protocol leaves the forecaster vulnerable to exploitation. For example, Alice already knows that the product will be ineffective; she could still bet with the forecaster for the malicious purpose of gaining $6. Surprisingly we show that in the online setup , the forecaster has an algorithm to adapt its forecasts and guarantee vanishing loss in the long run, even in the presence of malicious decision makers. This is achieved by first using any existing online prediction algorithm to predict the probabilities, then applying a post processing algorithm to fine-tune these probabilities based on past gains/losses (similar to the idea of recalibration [16, 11]).
As a concrete application of our approach, we simulate the interaction between an airline and passengers with real flight delay data. Risk averse passengers might want to avoid a flight if there is possibility of delay and their loss in case of delay is high. We show if an airline offers to accept bets based on the predicted probability of delay, it can help risk-averse passengers make better decisions, and increase both the airline’s revenue (due to increased demand for the flight) and the total utility (airline revenue plus passenger utility).
We further verify our theory with large scale simulations on several datasets and a diverse benchmark of decision tasks. We show that forecasters based on our post-processing algorithm consistently achieve close to zero betting loss (on average) within a small number of time steps. On the other hand, several seemingly reasonable alternative algorithms not only lack theoretical guarantees, but often suffer from positive average betting loss in practice.
2.1 Decision Making with Forecasts
This section defines the basic setup of the paper. We represent the decision making process as a multi-player game between nature, a forecaster and a set of (decision making) agents. At every step nature reveals an input observation to the forecaster (e.g. patient medical records) and selects the hidden probability that
(e.g. probability treatment is successful), We only consider binary variables () and defer the general case to Appendix B.
The forecaster chooses a forecasted probability to approximate . We also allow the forecaster to represent the lack of knowledge about , i.e. the forecaster outputs a confidence where the hope is that .
At each time step, one or more agents can use the forecast and to make decisions, i.e. to select an action . However, for simplicity we assume that different agents make decisions at different time steps, so at each time step there is only a single agent, and we can uniquely index the agent by the time step . The agent knows its own loss (negative utility) function (the forecaster does not have to know this) where is the maximum loss involved in each decision. This protocol is formalized below.
Protocol 1: Decision Making with Forecasts
Nature reveals to forecaster and chooses without revealing it
Forecaster reveals where
has loss functionand reveals selected according to and
Nature samples and reveals ; Agent incurs loss
We make no assumptions on nature, forecaster, or the agents. They can choose any strategy to generate their actions, as long as they do not look into the future (i.e. their action only depends on variables that have already been revealed). In particular, we make no i.i.d. assumptions on how nature selects and ; for example, nature could even select them adversarially to maximize the agent’s loss.
2.2 Individual Coverage
Ideally in Protocol 1 the forecaster’s prediction should satisfy ) for each individual (this is often called individual coverage or individual calibration in the literature). However, many existing results show that learning individually calibrated probabilities from past data is often impossible [21, 1, 22] unless the forecast is trivial (i.e. ).
One intuitive reason for this impossibility result is that in many practical scenarios for each we only observe a single sample . The forecaster cannot infer from a single sample without relying on unverifiable assumptions.
2.3 Probability as Willingness to Bet
A major justification for probability theory has been that probability can represent willingness to bet[6, 12]. For example, if you truly believe that a coin is fair, then it would be inconsistent if you are not willing to win $1 for heads, and lose $1 for tails (assuming you only care about average gain rather than risk). More specifically a forecaster that holds a probabilistic belief should be willing to accept any bet where it gains a non-negative amount in expectation.
For binary variables, we consider the case where a forecaster believes that a binary event happens with some probability but does not know the exact value of . The forecaster only believes that . The forecaster should be willing to accept any bet with non-negative expected return under every . For example, assume the forecaster believes that a coin comes up heads with at least 40% chance and at most 60% chance. The forecaster should be willing to win $6 for heads, and lose $4 for tails; similarly the forecaster should be willing to lose $4 for heads, and win $6 for tails.
More generally, according to Lemma 1 (proved in Appendix D), a forecaster believes that the probability of success of the binary event satisfies if and only if she is willing to accept bets where she loses .
Let such that , then a function satisfies , if and only if for some and , .
In words, a forecaster is willing to lose if has non-positive expectation under every probability the forecaster considers possible. However, every such function are smaller (i.e. forecaster loses less) than for some . Therefore, we only have to consider whether a forecaster is willing to accept bets of the form .
3 Decisions with Unreliable Forecasts
In Protocol 1, agents could make decisions based on the forecasted probability and the agent’s loss . For example, the agent could choose
to minimize the expected loss under the forecasted probability.
However, how can the agent know that this decision has low expected loss under the true probability ? This can be achieved with two desiderata, which we formalize below:
We denote the agent’s maximum / average / minimum expected loss under the forecasted probability as
and true expected loss as . If the agent knows that
Desideratum 2 The interval size is close to .
then the agent can infer that the true expected loss is not too far off from the forecasted expected loss . This is because if is small then will be close to . Both and will be sandwiched in the small interval .
However, we show that desiderata 1 and 2 often cannot be achieved simultaneously. To guarantee the forecaster in general must output individually correct probabilities (i.e. ), as shown by the following proposition (proof in Appendix D).
For any where
1. If then we have
2. If then such that
In words, unless we cannot guarantee that without assuming that the agent’s loss function is special (e.g. it is a constant function). However, in Section 2.2 we argued that it is usually impossible to achieve unless is very large (i.e. ). If is too large, the interval will be large, and the guarantee that would be practically useless even if it were true. This means the forecaster cannot convey confidence in individual predictions it makes, and as a result the agent can’t be very confident about the expected loss it will incur.
3.1 Insuring against unreliable forecasts
Since it is difficult to satisfy desiderata 1 and 2 simultaneously, we consider relaxing desideratum 1. In particular, we study what guarantees are possible for each individual decision maker even when , i.e., the prediction is wrong.
We consider the setup where each agent can receive some side payment (a form of ”insurance” which could depend on the outcome , and could be positive or negative) from the forecaster, and we would like to guarantee
In other words, we would like the expected loss under the true distribution to be predictable once we incorporate the side payment.
Note that desideratum 1’ can be trivially satisfied if the forecaster is willing to pay any side payment to the decision agent. For example, an agent can choose to satisfy desideratum 1’. However, if the forecaster offers any side payment, it could be subject to exploitation. For example, decision agents could request the forecaster to pay $1 under any outcome . Such a mechanism cannot be sustainable for the forecaster.
3.2 Insuring with fair bets
Even though the forecaster cannot offer arbitrary payments to the decision agent, we show that the forecaster can offer a sufficiently large set of payments, such that [i] each decision agent can select a payment to satisfy Desideratum 1’ and [ii] the forecaster has an algorithm to guarantee vanishing loss in the long run, even when the decision agents tries to exploit the forecaster.
In fact, the “fair bets” in Section 2.3 satisfy our requirement. Specifically, the forecaster can offer the set
as available side payment options. The constant caps the maximum payment each decision agent can request (in our setup is also upper bounded by ). This set of payments satisfy both [i] (which we show in this section) and [ii] (which we show in the next section).
Before we proceed to show [i] and [ii], for convenience, we formally write down the new protocol. Compared to Protocol 1, the decision agent selects some “stake” , and receive side payment from the forecaster.
Protocol 2: Decision Making with Bets
Nature reveals observation and chooses without revealing it
Forecaster reveals where
Agent has loss function and reveals action and stake selected according to and
Nature samples and reveals
Agent incurs loss ; forecaster incurs loss
Denote the agent’s true expected loss with side payment as (i.e. the LHS in Desideratum 1’)
then we have the following guarantee111For the more general version of the proposition in the multi-class setup, see Appendix A.1. for any choice of and
If the stake then
In words, the agent has a choice of stake that only depends on variables known to the agent ( and ) and does not depend on variables unknown to the agent (, ). If the agent chooses this , she can be certain that desideratum 1’ is satisfied, regardless of what the forecaster or nature does (they can choose any ).
This mechanism allows the agent to make decisions as if the forecasted probability is correct, i.e. as if . This is because Proposition 2 is true for any choice of action (as long as the agent chooses according to Proposition 2 after selecting ). Intuitively, for any action the agent selects, she can guarantee to achieve a total loss close to (assuming is small). This is the same guarantee she would get as if .
In addition, if [ii] is satisfied (i.e. the forecaster has vanishing loss), the forecaster also doesn’t lose anything, so should have no incentive to avoid offering these payments. We discuss this in the next section.
4 Probability Forecaster Strategy
In this section we study the forecaster’s strategy. As motivated in the previous section, the goal of the forecaster (in Protocol 2) is to:
1) have non-positive cumulative loss when is large, so that the side payments are sustainable
2) output the smallest compatible with 1), so that forecasts are as sharp as possible
Specifically, the forecaster’s average cumulative loss (up to time ) in Protocol 2 is
Whether Eq.(3) is non-positive or not depends on the actions of all the players: forecaster , nature and agent . Our focus is on the forecaster, so we say that a sequence of forecasts is asymptotically sound relative to if the forecaster loss in Protocol 2 is non-positive, i.e.
In subsequent development we will use a stronger definition than Eq.(4). We say that a sequence of forecasts is asymptotically exact relative to if the forecaster loss in Protocol 2 is exactly zero, i.e.
Intuitively asymptotic soundness requires that the forecaster should not lose in the long run; asymptotic exactness requires that the forecaster should neither lose nor win in the long run — a stronger requirement.222In mechanism design literature, Eq.(4) and Eq.(5) are typically referred to as weak and strong budget balanced. Here we use the terminology in probability forecasting literature.
The reason we focus on asymptotic exactness is because the forecaster should output the smallest possible to achieve sharp forecasts. Observe that the left hand side of Eq.(4) is increasing if decreases. Therefore, whenever the forecaster is asymptotically sound but not asymptotically exact (i.e. the left hand side in Eq.(4) is strictly negative), there is some room to decrease without violating asymptotic soundness.
4.1 Online Forecasting Algorithm
We aim to achieve asymptotic exactness with minimal assumptions on (we only assume boundedness). This is challenging for two reasons: an adversary could select to violate asymptotic exactness as much as possible (e.g. decision agents could try to profit on the forecaster’s loss); in Protocol 2 the agent’s action is selected after the forecaster’s prediction are revealed, so the agent has last-move advantage.
Nevertheless asymptotic exactness can be achieved as shown in Theorem 1 (proof in Appendix C). In fact, we design a post-processing algorithm that modifies the prediction of a base algorithm (similar to recalibration [16, 11]). Algorithm 1 can modify any base algorithm (as long as the base algorithm outputs some at every time step) to achieve asymptotic exactness, even though the finite time performance could be hurt by a poor base prediction algorithm.
Suppose there is a constant such that , , there exists an algorithm to output in Protocol 2 that is asymptotically exact for generated by any strategy of nature and agent. In particular, Algorithm 1 satisfies
learns two regression models (such as neural networks with a single real number as output)and . is trained to predict by minimizing the standard loss while is trained to to minimize the squared payoff of each bet
Based on Algorithm 2, Algorithm 1 learns an additional “correction” parameter by invoking Algorithm 3. Intuitively, up to time , if the forecaster has positive cumulative loss in Protocol 2, then the s have been too small in the past, Algorithm 1 will select a larger to increase ; conversely if the forecaster has negative cumulative loss, then the s have been too large in the past, and Algorithm 1 will select a smaller to decrease .
4.2 Offline Forecasting
Our new definition of asymptotic soundness in Eq.(4) is related to existing notions of calibration. Asymptotic soundness depends on the set of bets , which are further determined by the downstream decision tasks. In fact, we prove we can recover existing notions of calibration [5, 11, 15, 17] or multicalibration  for special decision tasks. For more details see Appendix A.2.
If a forecaster satisfies the existing notions of calibration, there is some set of functions , such that the forecaster is asymptotically sound if the agents choose based on some . The benefit is that once deployed, the forecaster does not have to be updated (compared to the online setup where the forecaster must continually update via Algorithm 1). However, the short-coming is that we must make strong assumptions on how the agents choose bets to insure themselves.
5 Case Study on Flight Delays
In this section we study a practical application that could benefit from our proposed mechanism. Compared to other means of transport, flights are often the fastest, but usually the least punctual. Different passengers may have different losses in case of delay. For example, if a passenger needs to attend an important event on-time, the loss from a delay can be very large, and the passenger might want to choose an alternative transportation method. The airline company could predict the probability of delay, and each passenger could use the probability to compute their expected loss before deciding to fly or not. However, as argued in Section 2.2, there is in general no good way to know that these probabilities are correct. Even worse, the airline may have the incentive to under-report the probability of delay to attract passengers.
Instead the airline can use Protocol 2 to convey confidence to the passengers that the delay probability is accurate. In this case, Protocol 2 has a simple form that can be easily explained to passengers as a “delay insurance”. In particular, if a passenger buys a ticket, he can choose to insure himself against delay by specifying the amount he would like to get paid if the airplane is delayed. The airlines provides a quote on the cost of the insurance (i.e. the passenger pays if the flight is not delayed). Note that this would be equivalent to Protocol 2 if the airline first predicts the probability of delay and then quotes .
If a passenger buys the right insurance according to Proposition 2, their expected utility (or negative loss) will be fixed — she does not need to worry that the predicted delay probability might be incorrect. In addition, if the airline follows Algorithm 1 the airline is also guaranteed to not lose money from the “delay insurance” in the long run (no matter what the passengers do), so the airline should be incentivized to implement the insurance mechanism to benefit its passengers “for free”.
Since the passengers’ utility functions are unknown, we model three types of passengers that differ by their assumptions on when they make their decision:
Naive passengers don’t care about delays and assume the airline doesn’t delay.
Trustful passengers assume the delay probability forecasted by the airline is correct.
Cautious passengers assume the worst (i.e. they choose actions that maximizes their worst case utility)
In this experiment we will vary the proportion of cautious passengers, and equally split the remaining passengers between naive and trustful. The naive and trustful passengers do not care about the risk of mis-prediction, so they do not buy the delay insurance (i.e. they always choose ), while cautious passengers always buy insurance that maximize their worst case utility.
5.1 Simulation Setup††Code and jupyter notebook tutorial available at
We use the flight delay and cancellation dataset 
from the year 2015, and use flight records of the single biggest airline (WN). As input feature, we convert the source airport, target airport, and scheduled time into one-hot vectors, and binarize the arrival delay into 1 (delay20min) and 0 (delay
20min). We use a two layer neural network with the leaky ReLU activation for prediction.
Let denote whether a delay happens, and denote whether the passenger chooses to ride the plane. We model the passenger utility (negative loss) as
where is the utility of the alternative option (e.g. taking another transportation or cancelling the trip). For simplicity we assume that this is a single real number. is the reward of the trip, is the cost of the ticket, and is the cost of a delayed flight. For each flight we sample 1000 potential passengers by randomly drawing the values and (for details see appendix).
Based on the passenger type (naive, trustful, cautious) and passenger parameter and , each passenger will have a maximum they are willing to pay for the flight. For simplicity we assume the airline will choose at the highest price for which it can sell 300 tickets. The passengers who are willing to pay more than will choose , and other passengers will choose .
5.2 Delay Insurance Improves Total Utility
The simulation results are shown in Figure 1. Using the betting mechanism is strictly better for both the airline’s revenue (i.e. ticket price * number of tickets) and the total utility (airline revenue + passenger utility). This is because the cautious passengers always make decisions to maximize their worst case utility. With the betting mechanism, their worst case utility becomes closer to their actual true utility, so their decision ( or ) will better maximize their true utility. The airline also benefits because it can charge a higher ticket price due to increased demand (more cautious passengers will choose ).
6 Additional Experiments
We further verify our theory with simulations a diverse benchmark of decision tasks. We also do ablation study to show that Algorithm 3 is necessary. Several simpler alternatives often fail to achieve asymptotic exactness and have worse empirical performance.
Dataset and Decision Tasks
We compare several forecaster algorithms that differ in whether they use Algorithm 3 to adjust the parameter . In particular, swap regret refers to Algorithm 3; none does not use and simply set it to ; standard regret minimizes the standard regret rather than the swap regret; naive best response chooses the that would have been optimal were it counter-factually applied to the past iterations.
The results are plotted in Figure 2,3 in the main paper and Figure 5,6 in Appendix B.2. There are three main observations: 1) Even when a forecaster is calibrated, for individual decision makers, the expected loss under the forecaster probability is almost always incorrect. 2) Algorithm 1 has good empirical performance. In particular, the guarantees of Theorem 1 can be achieved within a reasonable number of time steps, and the interval size is usually small. 3) Seemingly reasonable alternatives to Algorithm 1 often empirically fail to be asymptotically exact.
In this paper, we propose an alternative solution to address the impossibility of individual calibration based on an insurance between the forecaster and decision makers. Each decision maker can make decisions as if the forecasted probability is correct, while the forecaster can also guarantee not losing in the long run. Future work can explore other issues that arise from this protocol, such as honesty , fairness , and social/moral/legal implications.
This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), JP Morgan, ONR, FLI, SDSI, Amazon AWS, and SAIL.
-  (2019) The limits of distribution-free conditional predictive inference. arXiv preprint arXiv:1903.04684. Cited by: §1, §2.2.
-  (2014) Preemptive genotyping for personalized medicine: design of the right drug, right dose, right time—using genomic data to individualize treatment protocol. In Mayo Clinic Proceedings, pp. 25–33. Cited by: §1.
From external to internal regret.
Journal of Machine Learning Research8 (Jun), pp. 1307–1324. Cited by: §4.1.
-  (2006) Prediction, learning, and games. Cambridge university press. Cited by: Appendix C, §1.
-  (1985) Calibration-based empirical probability. The Annals of Statistics, pp. 1251–1274. Cited by: §A.2.1, §4.2.
-  (1931) On the subjective meaning of probability. Fundamenta mathematicae 17 (1), pp. 298–329. Cited by: §1, §2.3.
-  (2017) 2015 flight delays and cancellations. Note: https://www.kaggle.com/usdot/flight-delays Cited by: §5.1.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §6.
-  (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §7.
-  (2003) When is honesty the best policy? the effect of stated company intent on consumer skepticism. Journal of consumer psychology 13 (3), pp. 349–356. Cited by: §7.
-  (2017) On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §A.2.1, §1, §4.1, §4.2, §6.
-  (2017) Reasoning about uncertainty. MIT press. Cited by: §2.3.
-  (2017) Calibration for the (computationally-identifiable) masses. arXiv preprint arXiv:1711.08513. Cited by: §4.2.
-  (1996) Probability theory: the logic of science. Washington University St. Louis, MO. Cited by: §1.
-  (2016) Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807. Cited by: §4.2.
Estimating uncertainty online against an adversary.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §4.1.
-  (2019) Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pp. 3792–3803. Cited by: §4.2.
-  (2015) Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, Vol. 2015, pp. 2901. Cited by: §6.
-  (2009) An agenda for personalized medicine. Nature 461 (7265), pp. 724–726. Cited by: §1.
-  (2012) Operational implementation of prospective genotyping for personalized medicine: the design of the vanderbilt predict project. Clinical Pharmacology & Therapeutics 92 (1), pp. 87–95. Cited by: §1.
-  (2005) Algorithmic learning in a random world. Springer Science & Business Media. Cited by: §1, §2.2.
-  (2020) Individual calibration with randomized forecasting. arXiv preprint arXiv:2006.10288. Cited by: §1, §2.2.
-  (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936. Cited by: §4.1.
Appendix A Additional Results
a.1 Multiclass Prediction
For multiclass prediction, we suppose that can take distinct values. We denote as the -dimensional probability simplex. For notational convenience we represent as a one-hot vector in , so .
Protocol 3: Decision Making with Bets, Multiclass
Nature reveals and chooses without revealing it
Forecaster reveals and
Agent has loss and chooses action and
Sample and reveal
Agent total loss is , forecaster loss is
As before we require the regularity condition that and (even though these are no longer on , hence not probabilities.
Similar to Section 3 we can denote the agent’s maximum / minimum expected loss under the forecasted probability as
and true expected loss as . As before denote
Proof of Proposition 3.
As a notation shorthand we denote with the vector , such that . We first show a closed form solution for which can be written as
Similarly we have
Denote the that achieves the infimum as . Comparing with we have
a.2 Offline Calibration
For this section we restrict to the i.i.d. setup, where we assume there are random variableswith some distribution such that at each time step,
We also assume that the forecaster ’s choice and the agent’s choice in Protocol 2 are computed by functions of
The following definition is the specialization of asymptotic soundness in Eq.(4) to the i.i.d. setup
We say that the functions are sound with respect to some set of functions if
If we say is -calibrated.
a.2.1 Examples and Special Cases
Standard calibration is defined as: for any , among the where it is indeed true that is with probability. Formally this can be written as
Deviation from this ideal situation is measured by the maximum calibration error (MCE).
Note that the MCE may be ill-defined if there is an interval such that with zero probability. We are going to avoid the technical subtlety by assuming that this does not happen, i.e. the distribution of is supported on the entire set .
When is the set of all possible functions (i.e. it only depends on the probability forecast but not itself), we obtain the standard definition of calibration [5, 11], as shown by the following proposition
The forecaster function is sound with respect to if and only if the MCE error of is less than .
See Appendix D ∎
Multi-calibration achieves standard calibration for all subsets in some collection of sets , and denote as the indicator function of (i.e. iff ). Suppose consists of all functions of the form where is an arbitrary function. A forecaster that is sound with respect to this set of is also multicalibrated.
a.2.2 Soundness and Calibration
This section aims to argue that if the forecaster achieves existing definitions of calibration, then it is sound under some assumptions on decision making agents.
We assume that every decision agent select according to Proposition 2 and the loss does not depend on (i.e. every decision maker has the same loss), in addition we assume that when given the same prediction the decision maker will always select the same (for example, this would be true if every decision maker have the same loss, and they always choose the action that minimizes expected loss). Under these assumptions,a forecaster that satisfies standard calibration will also be asymptotically sound.
To see why this is, only depends on , and is independent of ; so will only depend on . This satisfies the condition of Appendix A.2.1; a forecaster that satisfies standard calibration will also be sound.
Appendix B Experiment Details and Additional Results
b.1 Airline Delay
In Protocol 2 must be non-negative for its interpretation as a probability interval . However if we only consider the flight delay insurance interpretation: airline pay passenger if flight delays and passenger pays airline if flight doesn’t delay. These payments are meaningful for both positive and negative ; the passenger utility (with insurance) can be computed as , which is also meaningful for both positive and negative . We find that allowing negative improves the stability of the algorithm.
We sample as and sample from . We assume the cost of delay can be more varied, so we sample it from the following process: and . This gives us a cost of delay between , but large values are less likely.
b.2 Additional Experiments
For each data point we associate an extra feature used to define decision loss. For MNIST this is the digit label and for UCI Adult this is the age (binned by quantile into 10 bins). We simulate three kinds of decision losses; for each type of decision loss we randomly sample a few instantiations.
1. One-sided: we assume that and each decision loss is large if and small if . For different values of there are different stakes (i.e. how much does the loss when differ from ).
2. Different Stakes: Each value of the decision loss is a draw from , which is used to capture the feature that certain groups of people have larger stakes
3. Random. Each value of the decision loss is a draw from but clipped to be within .
Forecasted Loss vs. True Loss
In Figure 6 we plot the relationship between the expected loss under the forecasted probability and the expected loss under the true probability (we can compute this for the MNIST dataset because the true probability is known as explained in Section 6). Even if we apply histogram binning recalibration (explained in Section 6), the individual probabilities are almost always incorrect.
Average Interval Size
In Figure 3 we plot the interval size . A small satisfies desideratum 2 in Section 3 and makes the guarantee in Proposition 2 useful for decision makers. We observe that most interval sizes are small, and larger intervals are exponentially unlikely.