Right Decisions from Wrong Predictions: A Mechanism Design Alternative to Individual Calibration

by   Shengjia Zhao, et al.

Decision makers often need to rely on imperfect probabilistic forecasts. While average performance metrics are typically available, it is difficult to assess the quality of individual forecasts and the corresponding utilities. To convey confidence about individual predictions to decision-makers, we propose a compensation mechanism ensuring that the forecasted utility matches the actually accrued utility. While a naive scheme to compensate decision-makers for prediction errors can be exploited and might not be sustainable in the long run, we propose a mechanism based on fair bets and online learning that provably cannot be exploited. We demonstrate an application showing how passengers could confidently optimize individual travel plans based on flight delay probabilities estimated by an airline.



There are no comments yet.


page 1

page 2

page 3

page 4


Individual Calibration with Randomized Forecasting

Machine learning applications often require calibrated predictions, e.g....

Bridging Machine Learning and Mechanism Design towards Algorithmic Fairness

Decision-making systems increasingly orchestrate our world: how to inter...

Reconciliation of probabilistic forecasts with an application to wind power

New methods are proposed for adjusting probabilistic forecasts to ensure...

A Ranking Approach to Fair Classification

Algorithmic decision systems are increasingly used in areas such as hiri...

Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

An increasingly common use case for machine learning models is augmentin...

Mechanism Design with Predictions

Improving algorithms via predictions is a very active research topic in ...

The Probabilistic Final Standing Calculator: a fair stochastic tool to handle abruptly stopped football seasons

The COVID-19 pandemic has left its marks in the sports world, forcing th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People and algorithms constantly rely on probabilistic forecasts (about medical treatments, weather, transportation times, etc.) and make potentially high-stake decisions based on them. In most cases, forecasts are not perfect, e.g., the forecasted chance that it will rain tomorrow does not match the true probability exactly. While average performance statistics might be available (accuracy, calibration, etc), it is generally impossible to tell whether any individual prediction is reliable (individually calibrated), e.g., about the medical condition of an specific patient or the delay of a particular flight  [21, 1, 22]. Intuitively, this is because multiple identical datapoints are needed to confidently estimate a probability from empirical frequencies, but identical datapoints are rare in real world applications (e.g. two patients are always different). Given these limitations, we study alternative mechanisms to convey confidence about individual predictions to decision-makers.

We consider settings where a single forecaster provides predictions to many decision makers, each facing a potentially different decision making problem. For example, a personalized medicine service could predict whether a product is effective for thousands of individual patients [19, 20, 2]. If the prediction is accurate for 70% of patients, it could be accurate for Alice but not Bob, or vice-versa. Therefore, Alice might be hesitant to make decisions based on the 70% average accuracy. In this setting, we propose an insurance-like mechanism that 1) enables each decision maker to confidently make decisions as if the advertised probabilities were individually correct, and 2) is implementable by the forecaster with provably vanishing costs in the long run.

To achieve this, we turn to the classic idea [6, 14] that a probabilistic belief is equivalent to a willingness to take bets. We use the previous example to illustrate that if the forecaster is willing to take bets, a decision maker can bet with the forecaster as an “insurance” against mis-prediction. Suppose Alice is trying to decide whether or not to use a product. If she uses the product, she gains $10 if the product is effective and loses $2 otherwise. The personalized medicine service (forecaster) predicts that the product is effective with 50% chance for Alice. Under this probability Alice expects to gain $4 if she decides to use the product, but she is worried the probability is incorrect. Alice proposes a bet: Alice pays the forecaster $6 if the product is effective, and the forecaster pays Alice $6 otherwise. The forecaster should accept the bet because under its own forecasted probability the bet is fair (i.e., the expectation is zero if the forecasted probabilities are true for Alice). Alice gets the guarantee that if she decides to use the product, effective or not, she gains $4 — equal to her expected utility under the forecasted (and possibly incorrect) probability. In general, we show that Alice has a way of choosing bets for any utility function and forecasted probability, such that her true gain equals her expected gain under the forecasted probability.

From the forecaster’s perspective, if the true probability that Alice’s treatment is effective is actually 10%, then the forecaster will lose $4.8 from this bet in expectation. However, in our setup, the forecaster makes probabilistic forecasts for many different decision makers, and each decision maker selects some bet based on their utility function and forecasted probability. The forecaster might gain or lose on individual bets, but it only needs to not lose on the entire set of bets on average for the approach to be sustainable. Intuitively, each decision maker’s difference between forecasted gain and true gain can be averaged across the pool of decision makers. The difficult requirement that each difference should be negative has been reduced to an easier requirement that the average difference should be negative.

However, this protocol leaves the forecaster vulnerable to exploitation. For example, Alice already knows that the product will be ineffective; she could still bet with the forecaster for the malicious purpose of gaining $6. Surprisingly we show that in the online setup [4], the forecaster has an algorithm to adapt its forecasts and guarantee vanishing loss in the long run, even in the presence of malicious decision makers. This is achieved by first using any existing online prediction algorithm to predict the probabilities, then applying a post processing algorithm to fine-tune these probabilities based on past gains/losses (similar to the idea of recalibration [16, 11]).

As a concrete application of our approach, we simulate the interaction between an airline and passengers with real flight delay data. Risk averse passengers might want to avoid a flight if there is possibility of delay and their loss in case of delay is high. We show if an airline offers to accept bets based on the predicted probability of delay, it can help risk-averse passengers make better decisions, and increase both the airline’s revenue (due to increased demand for the flight) and the total utility (airline revenue plus passenger utility).

We further verify our theory with large scale simulations on several datasets and a diverse benchmark of decision tasks. We show that forecasters based on our post-processing algorithm consistently achieve close to zero betting loss (on average) within a small number of time steps. On the other hand, several seemingly reasonable alternative algorithms not only lack theoretical guarantees, but often suffer from positive average betting loss in practice.

2 Background

2.1 Decision Making with Forecasts

This section defines the basic setup of the paper. We represent the decision making process as a multi-player game between nature, a forecaster and a set of (decision making) agents. At every step nature reveals an input observation to the forecaster (e.g. patient medical records) and selects the hidden probability that

(e.g. probability treatment is successful), We only consider binary variables (

) and defer the general case to Appendix B.

The forecaster chooses a forecasted probability to approximate . We also allow the forecaster to represent the lack of knowledge about , i.e. the forecaster outputs a confidence where the hope is that .

At each time step, one or more agents can use the forecast and to make decisions, i.e. to select an action . However, for simplicity we assume that different agents make decisions at different time steps, so at each time step there is only a single agent, and we can uniquely index the agent by the time step . The agent knows its own loss (negative utility) function (the forecaster does not have to know this) where is the maximum loss involved in each decision. This protocol is formalized below.

Protocol 1: Decision Making with Forecasts


  1. [topsep=0pt,itemsep=0.1ex,partopsep=0ex,parsep=0ex]

  2. Nature reveals to forecaster and chooses without revealing it

  3. Forecaster reveals where

  4. Agent

    has loss function

    and reveals selected according to and

  5. Nature samples and reveals ; Agent incurs loss

We make no assumptions on nature, forecaster, or the agents. They can choose any strategy to generate their actions, as long as they do not look into the future (i.e. their action only depends on variables that have already been revealed). In particular, we make no i.i.d. assumptions on how nature selects and ; for example, nature could even select them adversarially to maximize the agent’s loss.

2.2 Individual Coverage

Ideally in Protocol 1 the forecaster’s prediction should satisfy ) for each individual (this is often called individual coverage or individual calibration in the literature). However, many existing results show that learning individually calibrated probabilities from past data is often impossible [21, 1, 22] unless the forecast is trivial (i.e. ).

One intuitive reason for this impossibility result is that in many practical scenarios for each we only observe a single sample . The forecaster cannot infer from a single sample without relying on unverifiable assumptions.

2.3 Probability as Willingness to Bet

A major justification for probability theory has been that probability can represent willingness to bet 

[6, 12]. For example, if you truly believe that a coin is fair, then it would be inconsistent if you are not willing to win $1 for heads, and lose $1 for tails (assuming you only care about average gain rather than risk). More specifically a forecaster that holds a probabilistic belief should be willing to accept any bet where it gains a non-negative amount in expectation.

For binary variables, we consider the case where a forecaster believes that a binary event happens with some probability but does not know the exact value of . The forecaster only believes that . The forecaster should be willing to accept any bet with non-negative expected return under every . For example, assume the forecaster believes that a coin comes up heads with at least 40% chance and at most 60% chance. The forecaster should be willing to win $6 for heads, and lose $4 for tails; similarly the forecaster should be willing to lose $4 for heads, and win $6 for tails.

More generally, according to Lemma 1 (proved in Appendix D), a forecaster believes that the probability of success of the binary event satisfies if and only if she is willing to accept bets where she loses .

Lemma 1.

Let such that , then a function satisfies , if and only if for some and , .

In words, a forecaster is willing to lose if has non-positive expectation under every probability the forecaster considers possible. However, every such function are smaller (i.e. forecaster loses less) than for some . Therefore, we only have to consider whether a forecaster is willing to accept bets of the form .

3 Decisions with Unreliable Forecasts

In Protocol 1, agents could make decisions based on the forecasted probability and the agent’s loss . For example, the agent could choose


to minimize the expected loss under the forecasted probability.

However, how can the agent know that this decision has low expected loss under the true probability ? This can be achieved with two desiderata, which we formalize below:

We denote the agent’s maximum / average / minimum expected loss under the forecasted probability as

and true expected loss as . If the agent knows that

Desideratum 1
Desideratum 2 The interval size is close to .

then the agent can infer that the true expected loss is not too far off from the forecasted expected loss . This is because if is small then will be close to . Both and will be sandwiched in the small interval .

However, we show that desiderata 1 and 2 often cannot be achieved simultaneously. To guarantee the forecaster in general must output individually correct probabilities (i.e. ), as shown by the following proposition (proof in Appendix D).

Proposition 1.

For any where

1. If then we have

2. If then such that

In words, unless we cannot guarantee that without assuming that the agent’s loss function is special (e.g. it is a constant function). However, in Section 2.2 we argued that it is usually impossible to achieve unless is very large (i.e. ). If is too large, the interval will be large, and the guarantee that would be practically useless even if it were true. This means the forecaster cannot convey confidence in individual predictions it makes, and as a result the agent can’t be very confident about the expected loss it will incur.

3.1 Insuring against unreliable forecasts

Since it is difficult to satisfy desiderata 1 and 2 simultaneously, we consider relaxing desideratum 1. In particular, we study what guarantees are possible for each individual decision maker even when , i.e., the prediction is wrong.

We consider the setup where each agent can receive some side payment (a form of ”insurance” which could depend on the outcome , and could be positive or negative) from the forecaster, and we would like to guarantee

Desideratum 1’

In other words, we would like the expected loss under the true distribution to be predictable once we incorporate the side payment.

Note that desideratum 1’ can be trivially satisfied if the forecaster is willing to pay any side payment to the decision agent. For example, an agent can choose to satisfy desideratum 1’. However, if the forecaster offers any side payment, it could be subject to exploitation. For example, decision agents could request the forecaster to pay $1 under any outcome . Such a mechanism cannot be sustainable for the forecaster.

3.2 Insuring with fair bets

Even though the forecaster cannot offer arbitrary payments to the decision agent, we show that the forecaster can offer a sufficiently large set of payments, such that [i] each decision agent can select a payment to satisfy Desideratum 1’ and [ii] the forecaster has an algorithm to guarantee vanishing loss in the long run, even when the decision agents tries to exploit the forecaster.

In fact, the “fair bets” in Section 2.3 satisfy our requirement. Specifically, the forecaster can offer the set

as available side payment options. The constant caps the maximum payment each decision agent can request (in our setup is also upper bounded by ). This set of payments satisfy both [i] (which we show in this section) and [ii] (which we show in the next section).

Before we proceed to show [i] and [ii], for convenience, we formally write down the new protocol. Compared to Protocol 1, the decision agent selects some “stake” , and receive side payment from the forecaster.

Protocol 2: Decision Making with Bets


  1. [topsep=0pt,itemsep=0.5ex,partopsep=0ex,parsep=0ex]

  2. Nature reveals observation and chooses without revealing it

  3. Forecaster reveals where

  4. Agent has loss function and reveals action and stake selected according to and

  5. Nature samples and reveals

  6. Agent incurs loss ; forecaster incurs loss

Denote the agent’s true expected loss with side payment as (i.e. the LHS in Desideratum 1’)


then we have the following guarantee111For the more general version of the proposition in the multi-class setup, see Appendix A.1. for any choice of and

Proposition 2.

If the stake then

In words, the agent has a choice of stake that only depends on variables known to the agent ( and ) and does not depend on variables unknown to the agent (, ). If the agent chooses this , she can be certain that desideratum 1’ is satisfied, regardless of what the forecaster or nature does (they can choose any ).

This mechanism allows the agent to make decisions as if the forecasted probability is correct, i.e. as if . This is because Proposition 2 is true for any choice of action (as long as the agent chooses according to Proposition 2 after selecting ). Intuitively, for any action the agent selects, she can guarantee to achieve a total loss close to (assuming is small). This is the same guarantee she would get as if .

In addition, if [ii] is satisfied (i.e. the forecaster has vanishing loss), the forecaster also doesn’t lose anything, so should have no incentive to avoid offering these payments. We discuss this in the next section.

Invoke Algorithm 2 and 3 with for  do
       Receive and from Algorithm 2 Receive from Algorithm 3 Output , Input and Set , , Send to Algorithm 3
Algorithm 1 Post-Processing for Exactness
Choose any initial value for for  do
       Input and output , Input and
Algorithm 2 Online Prediction

4 Probability Forecaster Strategy

In this section we study the forecaster’s strategy. As motivated in the previous section, the goal of the forecaster (in Protocol 2) is to:

1) have non-positive cumulative loss when is large, so that the side payments are sustainable
2) output the smallest compatible with 1), so that forecasts are as sharp as possible

Specifically, the forecaster’s average cumulative loss (up to time ) in Protocol 2 is


Whether Eq.(3) is non-positive or not depends on the actions of all the players: forecaster , nature and agent . Our focus is on the forecaster, so we say that a sequence of forecasts is asymptotically sound relative to if the forecaster loss in Protocol 2 is non-positive, i.e.


In subsequent development we will use a stronger definition than Eq.(4). We say that a sequence of forecasts is asymptotically exact relative to if the forecaster loss in Protocol 2 is exactly zero, i.e.


Intuitively asymptotic soundness requires that the forecaster should not lose in the long run; asymptotic exactness requires that the forecaster should neither lose nor win in the long run — a stronger requirement.222In mechanism design literature, Eq.(4) and Eq.(5) are typically referred to as weak and strong budget balanced. Here we use the terminology in probability forecasting literature.

The reason we focus on asymptotic exactness is because the forecaster should output the smallest possible to achieve sharp forecasts. Observe that the left hand side of Eq.(4) is increasing if decreases. Therefore, whenever the forecaster is asymptotically sound but not asymptotically exact (i.e. the left hand side in Eq.(4) is strictly negative), there is some room to decrease without violating asymptotic soundness.

Input: number of discrete interval Partition into equal intervals , , For each interval init an empty set , set for  do
       Initialize an empty ordered list Initialize and while  do
             Append to Set as the that satisfies
      Remove all elements before from Select uniform randomly from Choose and send to Algorithm 1 Receive from Algorithm 1, add to
Algorithm 3 Swap Regret Minimization

4.1 Online Forecasting Algorithm

We aim to achieve asymptotic exactness with minimal assumptions on (we only assume boundedness). This is challenging for two reasons: an adversary could select to violate asymptotic exactness as much as possible (e.g. decision agents could try to profit on the forecaster’s loss); in Protocol 2 the agent’s action is selected after the forecaster’s prediction are revealed, so the agent has last-move advantage.

Nevertheless asymptotic exactness can be achieved as shown in Theorem 1 (proof in Appendix C). In fact, we design a post-processing algorithm that modifies the prediction of a base algorithm (similar to recalibration [16, 11]). Algorithm 1 can modify any base algorithm (as long as the base algorithm outputs some at every time step) to achieve asymptotic exactness, even though the finite time performance could be hurt by a poor base prediction algorithm.

Theorem 1.

Suppose there is a constant such that , , there exists an algorithm to output in Protocol 2 that is asymptotically exact for generated by any strategy of nature and agent. In particular, Algorithm 1 satisfies

For this paper we use as our base algorithm a simple online gradient descent algorithm [23] shown in Algorithm 2. Specifically Algorithm 2

learns two regression models (such as neural networks with a single real number as output)

and . is trained to predict by minimizing the standard loss while is trained to to minimize the squared payoff of each bet

Based on Algorithm 2, Algorithm 1 learns an additional “correction” parameter by invoking Algorithm 3. Intuitively, up to time , if the forecaster has positive cumulative loss in Protocol 2, then the s have been too small in the past, Algorithm 1 will select a larger to increase ; conversely if the forecaster has negative cumulative loss, then the s have been too large in the past, and Algorithm 1 will select a smaller to decrease .

Despite the straight-forward intuition, the difficulty comes from ensuring Theorem 1 for any sequence of . In fact, Algorithm 3 needs to be a swap regret minimization algorithm [3]. For a detailed explanation and proof of why using Algorithm 3 can guarantee Theorem 1 see Appendix C.

4.2 Offline Forecasting

Our new definition of asymptotic soundness in Eq.(4) is related to existing notions of calibration. Asymptotic soundness depends on the set of bets , which are further determined by the downstream decision tasks. In fact, we prove we can recover existing notions of calibration [5, 11, 15, 17] or multicalibration [13] for special decision tasks. For more details see Appendix A.2.

If a forecaster satisfies the existing notions of calibration, there is some set of functions , such that the forecaster is asymptotically sound if the agents choose based on some . The benefit is that once deployed, the forecaster does not have to be updated (compared to the online setup where the forecaster must continually update via Algorithm 1). However, the short-coming is that we must make strong assumptions on how the agents choose bets to insure themselves.

Figure 1: The airline’s revenue (Top) and total utility (of both airline and passenger, Bottom) with and without the betting mechanism. Different colors represent the percentage of cautious passengers. The x-axis represents the number of flights that has happened, and the y-axis represents the average utility per passenger across all past flights. Left: Without the betting mechanism that insure passengers against delay Middle and Right: With the betting mechanism, the airline revenue increases (because it is able to charge a higher ticket price due to increased demand) and the total utility increases. The middle panel is the utility with both Algorithm 1 and Algorithm 3, while the right panel only uses Algorithm 1 (i.e. it always sets ). In general the middle panel achieves faster convergence, so with fewer iterations, the utility is better than the right panel.

5 Case Study on Flight Delays

In this section we study a practical application that could benefit from our proposed mechanism. Compared to other means of transport, flights are often the fastest, but usually the least punctual. Different passengers may have different losses in case of delay. For example, if a passenger needs to attend an important event on-time, the loss from a delay can be very large, and the passenger might want to choose an alternative transportation method. The airline company could predict the probability of delay, and each passenger could use the probability to compute their expected loss before deciding to fly or not. However, as argued in Section 2.2, there is in general no good way to know that these probabilities are correct. Even worse, the airline may have the incentive to under-report the probability of delay to attract passengers.

Instead the airline can use Protocol 2 to convey confidence to the passengers that the delay probability is accurate. In this case, Protocol 2 has a simple form that can be easily explained to passengers as a “delay insurance”. In particular, if a passenger buys a ticket, he can choose to insure himself against delay by specifying the amount he would like to get paid if the airplane is delayed. The airlines provides a quote on the cost of the insurance (i.e. the passenger pays if the flight is not delayed). Note that this would be equivalent to Protocol 2 if the airline first predicts the probability of delay and then quotes .

If a passenger buys the right insurance according to Proposition 2, their expected utility (or negative loss) will be fixed — she does not need to worry that the predicted delay probability might be incorrect. In addition, if the airline follows Algorithm 1 the airline is also guaranteed to not lose money from the “delay insurance” in the long run (no matter what the passengers do), so the airline should be incentivized to implement the insurance mechanism to benefit its passengers “for free”.

Passenger Model

Since the passengers’ utility functions are unknown, we model three types of passengers that differ by their assumptions on when they make their decision:

  1. [topsep=0pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  2. Naive passengers don’t care about delays and assume the airline doesn’t delay.

  3. Trustful passengers assume the delay probability forecasted by the airline is correct.

  4. Cautious passengers assume the worst (i.e. they choose actions that maximizes their worst case utility)

In this experiment we will vary the proportion of cautious passengers, and equally split the remaining passengers between naive and trustful. The naive and trustful passengers do not care about the risk of mis-prediction, so they do not buy the delay insurance (i.e. they always choose ), while cautious passengers always buy insurance that maximize their worst case utility.

5.1 Simulation Setup

Code and jupyter notebook tutorial available at

We use the flight delay and cancellation dataset [7]

from the year 2015, and use flight records of the single biggest airline (WN). As input feature, we convert the source airport, target airport, and scheduled time into one-hot vectors, and binarize the arrival delay into 1 (delay

20min) and 0 (delay

20min). We use a two layer neural network with the leaky ReLU activation for prediction.

Passenger Utility

Let denote whether a delay happens, and denote whether the passenger chooses to ride the plane. We model the passenger utility (negative loss) as

where is the utility of the alternative option (e.g. taking another transportation or cancelling the trip). For simplicity we assume that this is a single real number. is the reward of the trip, is the cost of the ticket, and is the cost of a delayed flight. For each flight we sample 1000 potential passengers by randomly drawing the values and (for details see appendix).

Airline Pricing

Based on the passenger type (naive, trustful, cautious) and passenger parameter and , each passenger will have a maximum they are willing to pay for the flight. For simplicity we assume the airline will choose at the highest price for which it can sell 300 tickets. The passengers who are willing to pay more than will choose , and other passengers will choose .

Figure 2:

Comparing forecaster loss in Protocol 2 for different forecaster algorithms on MNIST (results for Adult dataset are in appendix 


). Each plot is an average performance across 20 different decision tasks, where we plot the top 10%, 25%, 50%, 75%, 90% quantile in forecaster loss. If the forecaster achieves asymptotic exactness defined in Eq.(

5), then the loss should be close to . Left panel is Algorithm 1, and the rest are other seemingly reasonable algorithms explained in Section 6. The loss of a forecaster that use Algorithm 1 typically converges to faster, while alternative algorithms often fail to converge.
Figure 3: Histogram of the interval size produced by the forecaster algorithm across all the tasks. There is no noticeable difference between the different algorithms. Notably the interval sizes are typically quite small, and big interval size is exponentially less common.

5.2 Delay Insurance Improves Total Utility

The simulation results are shown in Figure 1. Using the betting mechanism is strictly better for both the airline’s revenue (i.e. ticket price * number of tickets) and the total utility (airline revenue + passenger utility). This is because the cautious passengers always make decisions to maximize their worst case utility. With the betting mechanism, their worst case utility becomes closer to their actual true utility, so their decision ( or ) will better maximize their true utility. The airline also benefits because it can charge a higher ticket price due to increased demand (more cautious passengers will choose ).

We also consider several alternatives to Algorithm 3. The alternative algorithms do not provide theoretical guarantees; in practice, they also achieve worse convergence to the final utility. This is be a reason to prefer Algorithm 3 if the number of iterations is small.

6 Additional Experiments

We further verify our theory with simulations a diverse benchmark of decision tasks. We also do ablation study to show that Algorithm 3 is necessary. Several simpler alternatives often fail to achieve asymptotic exactness and have worse empirical performance.

Dataset and Decision Tasks

We use the MNIST and UCI Adult [8] datasets. MNIST is a multi-class classification dataset; we convert it to binary classification by choosing where the is the digit category. We also generate a benchmark consisting of 20 different decision tasks. For details see Appendix B.2.


We compare several forecaster algorithms that differ in whether they use Algorithm 3 to adjust the parameter . In particular, swap regret refers to Algorithm 3; none does not use and simply set it to ; standard regret minimizes the standard regret rather than the swap regret; naive best response chooses the that would have been optimal were it counter-factually applied to the past iterations.

Forecaster Model

As in the previous experiment, we use a two layer neural network as the forecaster and . For the results shown in Figure 6 we also use histogram binning [18] on the entire validation set to recalibrate , such that satisfies standard calibration [11].


The results are plotted in Figure 2,3 in the main paper and Figure 5,6 in Appendix B.2. There are three main observations: 1) Even when a forecaster is calibrated, for individual decision makers, the expected loss under the forecaster probability is almost always incorrect. 2) Algorithm 1 has good empirical performance. In particular, the guarantees of Theorem 1 can be achieved within a reasonable number of time steps, and the interval size is usually small. 3) Seemingly reasonable alternatives to Algorithm 1 often empirically fail to be asymptotically exact.

7 Conclusion

In this paper, we propose an alternative solution to address the impossibility of individual calibration based on an insurance between the forecaster and decision makers. Each decision maker can make decisions as if the forecasted probability is correct, while the forecaster can also guarantee not losing in the long run. Future work can explore other issues that arise from this protocol, such as honesty [10], fairness [9], and social/moral/legal implications.

8 Acknowledgements

This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), JP Morgan, ONR, FLI, SDSI, Amazon AWS, and SAIL.


  • [1] R. F. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani (2019) The limits of distribution-free conditional predictive inference. arXiv preprint arXiv:1903.04684. Cited by: §1, §2.2.
  • [2] S. J. Bielinski, J. E. Olson, J. Pathak, R. M. Weinshilboum, L. Wang, K. J. Lyke, E. Ryu, P. V. Targonski, M. D. Van Norstrand, M. A. Hathcock, et al. (2014) Preemptive genotyping for personalized medicine: design of the right drug, right dose, right time—using genomic data to individualize treatment protocol. In Mayo Clinic Proceedings, pp. 25–33. Cited by: §1.
  • [3] A. Blum and Y. Mansour (2007) From external to internal regret.

    Journal of Machine Learning Research

    8 (Jun), pp. 1307–1324.
    Cited by: §4.1.
  • [4] N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: Appendix C, §1.
  • [5] A. P. Dawid (1985) Calibration-based empirical probability. The Annals of Statistics, pp. 1251–1274. Cited by: §A.2.1, §4.2.
  • [6] B. De Finetti (1931) On the subjective meaning of probability. Fundamenta mathematicae 17 (1), pp. 298–329. Cited by: §1, §2.3.
  • [7] DoT (2017) 2015 flight delays and cancellations. Note: https://www.kaggle.com/usdot/flight-delays Cited by: §5.1.
  • [8] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
  • [9] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §7.
  • [10] M. R. Foreh and S. Grier (2003) When is honesty the best policy? the effect of stated company intent on consumer skepticism. Journal of consumer psychology 13 (3), pp. 349–356. Cited by: §7.
  • [11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §A.2.1, §1, §4.1, §4.2, §6.
  • [12] J. Y. Halpern (2017) Reasoning about uncertainty. MIT press. Cited by: §2.3.
  • [13] U. Hébert-Johnson, M. P. Kim, O. Reingold, and G. N. Rothblum (2017) Calibration for the (computationally-identifiable) masses. arXiv preprint arXiv:1711.08513. Cited by: §4.2.
  • [14] E. T. Jaynes (1996) Probability theory: the logic of science. Washington University St. Louis, MO. Cited by: §1.
  • [15] J. Kleinberg, S. Mullainathan, and M. Raghavan (2016) Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807. Cited by: §4.2.
  • [16] V. Kuleshov and S. Ermon (2017) Estimating uncertainty online against an adversary. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §1, §4.1.
  • [17] A. Kumar, P. S. Liang, and T. Ma (2019) Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pp. 3792–3803. Cited by: §4.2.
  • [18] M. P. Naeini, G. F. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, Vol. 2015, pp. 2901. Cited by: §6.
  • [19] P. C. Ng, S. S. Murray, S. Levy, and J. C. Venter (2009) An agenda for personalized medicine. Nature 461 (7265), pp. 724–726. Cited by: §1.
  • [20] J. M. Pulley, J. C. Denny, J. F. Peterson, G. R. Bernard, C. L. Vnencak-Jones, A. H. Ramirez, J. T. Delaney, E. Bowton, K. Brothers, K. Johnson, et al. (2012) Operational implementation of prospective genotyping for personalized medicine: the design of the vanderbilt predict project. Clinical Pharmacology & Therapeutics 92 (1), pp. 87–95. Cited by: §1.
  • [21] V. Vovk, A. Gammerman, and G. Shafer (2005) Algorithmic learning in a random world. Springer Science & Business Media. Cited by: §1, §2.2.
  • [22] S. Zhao, T. Ma, and S. Ermon (2020) Individual calibration with randomized forecasting. arXiv preprint arXiv:2006.10288. Cited by: §1, §2.2.
  • [23] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936. Cited by: §4.1.

Appendix A Additional Results

a.1 Multiclass Prediction

For multiclass prediction, we suppose that can take distinct values. We denote as the -dimensional probability simplex. For notational convenience we represent as a one-hot vector in , so .

Protocol 3: Decision Making with Bets, Multiclass

At time

  1. Nature reveals and chooses without revealing it

  2. Forecaster reveals and

  3. Agent has loss and chooses action and

  4. Sample and reveal

  5. Agent total loss is , forecaster loss is

As before we require the regularity condition that and (even though these are no longer on , hence not probabilities.

Similar to Section 3 we can denote the agent’s maximum / minimum expected loss under the forecasted probability as

and true expected loss as . As before denote

Proposition 3.

If then

Proof of Proposition 3.

As a notation shorthand we denote with the vector , such that . We first show a closed form solution for which can be written as

Similarly we have

Denote the that achieves the infimum as . Comparing with we have

a.2 Offline Calibration

For this section we restrict to the i.i.d. setup, where we assume there are random variables

with some distribution such that at each time step,

We also assume that the forecaster ’s choice and the agent’s choice in Protocol 2 are computed by functions of

The following definition is the specialization of asymptotic soundness in Eq.(4) to the i.i.d. setup

Definition 1.

We say that the functions are sound with respect to some set of functions if

If we say is -calibrated.

a.2.1 Examples and Special Cases

Standard Calibration

Standard calibration is defined as: for any , among the where it is indeed true that is with probability. Formally this can be written as

Deviation from this ideal situation is measured by the maximum calibration error (MCE).

Note that the MCE may be ill-defined if there is an interval such that with zero probability. We are going to avoid the technical subtlety by assuming that this does not happen, i.e. the distribution of is supported on the entire set .

When is the set of all possible functions (i.e. it only depends on the probability forecast but not itself), we obtain the standard definition of calibration [5, 11], as shown by the following proposition

Proposition 4.

The forecaster function is sound with respect to if and only if the MCE error of is less than .


See Appendix D


Multi-calibration achieves standard calibration for all subsets in some collection of sets , and denote as the indicator function of (i.e. iff ). Suppose consists of all functions of the form where is an arbitrary function. A forecaster that is sound with respect to this set of is also multicalibrated.

a.2.2 Soundness and Calibration

This section aims to argue that if the forecaster achieves existing definitions of calibration, then it is sound under some assumptions on decision making agents.

Standard Calibration

We assume that every decision agent select according to Proposition 2 and the loss does not depend on (i.e. every decision maker has the same loss), in addition we assume that when given the same prediction the decision maker will always select the same (for example, this would be true if every decision maker have the same loss, and they always choose the action that minimizes expected loss). Under these assumptions,a forecaster that satisfies standard calibration will also be asymptotically sound.

To see why this is, only depends on , and is independent of ; so will only depend on . This satisfies the condition of Appendix A.2.1; a forecaster that satisfies standard calibration will also be sound.

Appendix B Experiment Details and Additional Results

b.1 Airline Delay


In Protocol 2 must be non-negative for its interpretation as a probability interval . However if we only consider the flight delay insurance interpretation: airline pay passenger if flight delays and passenger pays airline if flight doesn’t delay. These payments are meaningful for both positive and negative ; the passenger utility (with insurance) can be computed as , which is also meaningful for both positive and negative . We find that allowing negative improves the stability of the algorithm.

Passenger Model

We sample as and sample from . We assume the cost of delay can be more varied, so we sample it from the following process: and . This gives us a cost of delay between , but large values are less likely.

Additional Results

We show additional comparison with other alternatives to Algorithm 3 in Figure 4. For details about these alternatives see Section 6.

Figure 4: This plot extends Figure 1. We compare with additional Alternatives to Algorithm 3.

b.2 Additional Experiments

Decision Loss

For each data point we associate an extra feature used to define decision loss. For MNIST this is the digit label and for UCI Adult this is the age (binned by quantile into 10 bins). We simulate three kinds of decision losses; for each type of decision loss we randomly sample a few instantiations.

1. One-sided: we assume that and each decision loss is large if and small if . For different values of there are different stakes (i.e. how much does the loss when differ from ).

2. Different Stakes: Each value of the decision loss is a draw from , which is used to capture the feature that certain groups of people have larger stakes

3. Random. Each value of the decision loss is a draw from but clipped to be within .

Forecasted Loss vs. True Loss

In Figure 6 we plot the relationship between the expected loss under the forecasted probability and the expected loss under the true probability (we can compute this for the MNIST dataset because the true probability is known as explained in Section 6). Even if we apply histogram binning recalibration (explained in Section 6), the individual probabilities are almost always incorrect.

Asymptotic Exactness

In Figure 2 and Figure 5 we plot the average betting loss of the forecaster. Algorithm 1 consistently achieve better asymptotic exactness compared to alternatives.

Average Interval Size

In Figure 3 we plot the interval size . A small satisfies desideratum 2 in Section 3 and makes the guarantee in Proposition 2 useful for decision makers. We observe that most interval sizes are small, and larger intervals are exponentially unlikely.

Figure 5: This plot is identical to Figure 2 but for the Adult dataset
Figure 6: The expected loss under the forecaster utility vs. expected loss under the true probability. Each dot represents an individual probability forecast with a particular choice of loss function. We use histogram binning on the entire validation set to recalibrate the forecaster. Even though the forecaster is calibrated, the individual probabilities are often incorrect. Therefore, the expected loss under the forecasted probability often differs from the expected loss under the true probability (blue dots). On other hand, with additional payment from the bets, the expected total loss under true probability is always bounded between the minimum loss under the forecasted probability, and the maximum loss under the forecasted probability.

Appendix C Proof of Online Prediction

The goal of Algorithm 3 is to select a sequence of to minimize the loss . However, instead of a standard online regret minimization, Algorithm 3 tries to minimize the swap regret. Let denote the set of -Lipshitz functions , define the swap regret as