Incentives for Federated Learning: a Hypothesis Elicitation Approach

07/21/2020 ∙ by Yang Liu, et al. ∙ University of California Santa Cruz 10

Federated learning provides a promising paradigm for collecting machine learning models from distributed data sources without compromising users' data privacy. The success of a credible federated learning system builds on the assumption that the decentralized and self-interested users will be willing to participate to contribute their local models in a trustworthy way. However, without proper incentives, users might simply opt out the contribution cycle, or will be mis-incentivized to contribute spam/false information. This paper introduces solutions to incentivize truthful reporting of a local, user-side machine learning model for federated learning. Our results build on the literature of information elicitation, but focus on the questions of eliciting hypothesis (rather than eliciting human predictions). We provide a scoring rule based framework that incentivizes truthful reporting of local hypotheses at a Bayesian Nash Equilibrium. We study the market implementation, accuracy as well as robustness properties of our proposed solution too. We verify the effectiveness of our methods using MNIST and CIFAR-10 datasets. Particularly we show that by reporting low-quality hypotheses, users will receive decreasing scores (rewards, or payments).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When a company relies on distributed users’ data to train a machine learning model, federated learning [mcmahan2016communication, yang2019federated, kairouz2019advances] promotes the idea that users/customers’ data should be kept local, and only the locally held/learned hypothesis will be shared/contributed from each user. While federated learning has observed success in keyboard recognition [hard2018federated] and in language modeling [chen2019federated], existing works have made an implicit assumption that participating users will be willing to contribute their local hypotheses to help the central entity to refine the model. Nonetheless, without proper incentives, agents can choose to opt out of the participation, to contribute either uninformative or outdated information, or to even contribute malicious model information. Though being an important question for federated learning [yang2019federated, liu2020fedcoin, hansus, han20], this capability of providing adequate incentives for user participation has largely been overlooked. In this paper we ask the questions that: Can a machine learning hypothesis be incentivized/elicited by a certain form of scoring rules from self-interested agents? The availability of a scoring rule will help us design a payment for the elicited hypothesis properly to motivate the reporting of high-quality ones. The corresponding solutions complement the literature of federated learning by offering a generic template for incentivizing users’ participation.

We address the challenge via providing a scoring framework to elicit hypotheses truthfully from the self-interested agents/users111Throughout this paper, we will interchange the use of agents and users.. More concretely, suppose an agent has a locally observed hypothesis . For instance, the hypothesis can come from solving a local problem: according to a certain hypothesis class , a distribution

, a loss function

. The goal is to design a scoring function that takes a reported hypothesis , and possibly a second input argument (to be defined in the context) such that , where the expectation is w.r.t. agent ’s local belief, which is specified in context. If the above can be achieved, can serve as the basis of a payment system in federated learning such that agents paid by will be incentivized to contribute their local models truthfully. In this work, we primarily consider two settings, with arguably increasing difficulties in designing our mechanisms:

With ground truth verification

We will start with a relatively easier setting where we as the designer has access to a labeled dataset . We will demonstrate how this question is similar to the classical information elicitation problem with strictly proper scoring rule [gneiting2007strictly], calibrated loss functions [bartlett2006convexity] and peer prediction (information elicitation without verification) [miller2005eliciting].

With only access to features

The second setting is when we only have but not the ground truth

. This case is arguably more popular in practice, since collecting label annotation requires a substantial amount of efforts. For instance, a company is interested in eliciting/training a classifier for an image classification problem. While it has access to images, it might not have spent efforts in collecting labels for the images. We will again present a peer prediciton-ish solution for this setting.

Besides establishing the desired incentive properties of the scoring rules, we will look into questions such as when the scoring mechanism is rewarding accurate classifiers, how to build a prediction market-ish solution to elicit improving classifiers, as well as our mechanism’s robustness against possible collusion. Our work can be viewed both as a contribution to federated learning via providing incentives for selfish agents to share their hypotheses, as well as a contribution to the literature of information elicitation via studying the problem of hypothesis elicitation. We validate our claims via experiments using the MNIST and CIFAR-10 datasets.

All omitted proofs and experiment details can be found in the supplementary materials.

1.1 Related works

Due to space limit, we only briefly survey the related two lines of works:

Information elicitation

Our solution concept relates most closely to the literature of information elicitation [Brier:50, Win:69, Savage:71, Matheson:76, Jose:06, Gneiting:07]. Information elicitation primarily focuses on the questions of developing scoring rule to incentivize or to elicite self-interested agents’ private probalistic beliefs about a private event (e.g., how likely will COVID-19 death toll reach 100K by May 1?). Relevant to us, [abernethy2011collaborative] provides a market treatment to elicit more accurate classifiers but the solution requires the designer to have the ground truth labels and agents to agree on the losses. We provide a more generic solution without above limiations.

A more challenging setting features an elicitation question while there sans ground truth verification. Peer prediction [Prelec:2004, MRZ:2005, witkowski2012robust, radanovic2013, Witkowski_hcomp13, dasgupta2013crowdsourced, shnayder2016informed, radanovic2016incentives, LC17, kong2019information, liu2020] is among the most popular solution concept. The core idea of peer prediction is to score each agent based on another reference report elicited from the rest of agents, and to leverage on the stochastic correlation between different agents’ information. Most relevant to us is the Correlated Agreement mechanism [dasgupta2013crowdsourced, shnayder2016informed, kong2019information]. We provide a separate discussion of it in Section 2.1.

Federated learning

Federated learning [mcmahan2016communication, hard2018federated, yang2019federated] arose recently as a promising architecture for learning from massive amounts of users’ local information without polling their private data. The existing literature has devoted extensive efforts to make the model sharing process more secure [secure_1, secure_2, secure_3, secure_4, secure_5, bonawitz2016practical], more efficient [efficient_1, efficient_2, efficient_3, efficient_4, fl:communication, efficient_6, efficient_7], more robust [robust_1, robust_2, robust_3, pillutla2019robust] to heterogeneity in the distributed data source, among many other works. For more detailed survey please refer to several thorough ones [yang2019federated, kairouz2019advances].

The incentive issue has been listed as an outstanding problem in federated learning [yang2019federated]. There have been several very recent works touching on the challenge of incentive design in federated learning. [liu2020fedcoin] proposed a currency system for federated learning based on blockchain techniques. [hansus] describes a payoff sharing algorithm that maximizes system designer’s utility, but the solution does not consider the agents’ strategic behaviors induced by insufficient incentives. [han20] further added fairness guarantees to an above reward system. We are not aware of a systematic study of the truthfulness in incentiving hypotheses in federated learning, and our work complements above results by providing an incentive-compatible scoring system for building a payment system for federated learning.

2 Formulation

Consider the setting with a set of agents, each with a hypothesis which maps feature space to label space . The hypothesis space is the space of hypotheses accessible or yet considered by agent , perhaps as a function of the subsets of or which have been encountered by or the agent’s available computational power. is often obtained following a local optimization process. For example, can be defined as the function which minimizes a loss function over an agent’s hypothesis space.

where in above is the local distribution that agent has access to train and evaluate . In the federated learning setting, note that can also represent the optimal output from a private training algorithm and would denote a training hypothesis space that encodes a certainly level of privacy guarantees. In this paper, we do not discuss the specific ways to make a local hypothesis private 222There exists a variety of definitions of privacy and their corresponding solutions for achieving so. Notable solutions include output perturbation [chaudhuri2011differentially] or output sampling [bassily2014private] to preserve privacy when differential privacy [dwork2006differential] is adopted to quantify the preserved privacy level., but rather we focus on developing scoring functions to incentivize/elicit this “private" and ready-to-be shared hypothesis.

Suppose the mechanism designer has access to a dataset : can be a standard training set with pairs of features and labels , or we are in a unsupervised setting where we don’t have labels associated with each sample : .

The goal of the mechanism designer is to collect truthfully from agent . Denote the reported/contributed hypothesis from agent as 333 can be none if users chose to not contribute.. Each agent will be scored using a function that takes all reported hypotheses and as inputs: such that it is “proper” at a Bayesian Nash Equilibrium:

Definition 1.

is called inducing truthful reporting at a Bayesian Nash Equilibrium if for every agent , assuming for all , (i.e., every other agent is willing to report their hypotheses truthfully),

where the expectation encodes agent ’s belief about .

2.1 Peer prediction

Peer prediction is a technique developed to truthfully elicit information when there is no ground truth verification. Suppose we are interested in eliciting private observations about a categorical event

generated according to a random variable

(in the context of a machine learning task, can be thought of as labels). Each of the agents holds a noisy observation of , denoted as . Again the goal of the mechanism designer is to elicit the s, but they are private and we do not have access to the ground truth to perform an evaluation. The scoring function is designed so that truth-telling is a strict Bayesian Nash Equilibrium (implying other agents truthfully report their ), that is, ,

Correlated Agreement

Correlated Agreement (CA) [dasgupta2013crowdsourced, 2016arXiv160303151S] is a recently established peer prediction mechanism for a multi-task setting. CA is also the core and the focus of our subsequent sections. This mechanism builds on a matrix that captures the stochastic correlation between the two sources of predictions and . is then defined as a squared matrix with its entries defined as follows:

The intuition of above matrix is that each entry of captures the marginal correlation between the two predictions. denotes the sign matrix of :

CA requires each agent to perform multiple tasks: denote agent ’s observations for the tasks as . Ultimately the scoring function for each task that is shared between is defined as follows: randomly draw two other tasks ,

It was established in [2016arXiv160303151S] that CA is truthful and proper (Theorem 5.2, [2016arXiv160303151S]) 444To be precise, it is an informed truthfulness. We refer interested readers to [2016arXiv160303151S] for the detailed differences.. then is strictly truthful (Theorem 4.4, [2016arXiv160303151S]).

3 Elicitation with verification

We start by considering the setting where the mechanism designer has access to ground truth labels, i.e., .

3.1 A warm-up case: eliciting Bayes optimal classifier

As a warm-up, we start with the question of eliciting the Bayes optimal classifier:

It is straightforward to observe that, by definition using (negative sign changes a loss to a reward (score)) and any affine transformation of it will be sufficient to incentivize truthful reporting of hypothesis. Next we are going to show that any classification-calibrated loss function [bartlett2006convexity] can serve as a proper scoring function for eliciting hypothesis.555We provide details of the calibration in the proof. Classical examples include cross-entropy loss, squared loss, etc.

Theorem 1.

Any classification calibrated loss function (paying agents ) induces truthful reporting of the Bayes optimal classifier.

3.2 Eliciting “any-optimal" classifier: a peer prediction approach

Now consider the case that an agent does not hold an absolute Bayes optimal classifier. Instead, in practice, agent’s local hypothesis will depend on the local observations they have, the privacy level he desired, the hypothesis space and training method he is using. Consider agent holds the following hypothesis , according to a loss function , and a hypothesis space :

By definition, each specific will be sufficient to incentivize a hypothesis. However, it is unclear how trained using would necessarily be optimal according to a universal metric/score. We aim for a more generic approach to elicit different s that are returned from different training procedure and hypothesis classes. In the following sections, we provide a peer prediction approach to do so.

We first state the hypothesis elicitation problem as a standard peer prediction problem. The connection is made by firstly rephrasing the two data sources, the classifiers and the labels, from agents’ perspective. Let’s re-interpret the ground truth label as an “optimal" agent who holds a hypothesis . We denote this agent as . Each local hypothesis agent holds can be interpreted as the agent that observes

for a set of randomly drawn feature vectors

: Then a peer prediction mechanism induces truthful reporting if:

Correlated Agreement for hypothesis elicitation

To be more concrete, consider a specific implementation of peer prediction mechanism, the Correlated Agreement (CA). Recall that the mechanism builds on a correlation matrix defined as follows:

Then the CA for hypothesis elicitation is summarized in Algorithm 1.

1:  For each sample , randomly sample two other tasks to pair with.
2:  Pay a reported hypothesis for according to
(1)
3:  Total payment to a reported hypothesis :
Algorithm 1 CA for Hypothesis Elicitation

We reproduce the incentive guarantees and required conditions:

Theorem 2.

CA mechanism induces truthful reporting of a hypothesis at a Bayesian Nash Equilibrium.

Knowledge requirement of

We’d like to note that knowing the sign of matrix between and is a relatively weak assumption to have to run the mechanism. For example, for a binary classification task , define the following accuracy measure,

We offer the following:

Lemma 1.

For binary classification (), if , is an identify matrix.

is stating that is informative about the ground truth label [LC17]. Similar conditions can be derived for to guarantee an identify . With identifying a simple structure of , the CA mechanism for hypothesis elicitation runs in a rather simple manner.

When do we reward accuracy

The elegance of the above CA mechanism leverages the correlation between a classifier and the ground truth label. Ideally we’d like a mechanism that rewards the accuracy of the contributed classifier. Consider the binary label case:

Theorem 3.

When (uniform prior), and let

be the identity matrix, the more accurate classifier within each

receives a higher score.

Note that the above result does not conflict with our incentive claims. In an equal prior case, misreporting can only reduce a believed optimal classifier’s accuracy instead of the other way. It remains an interesting question to understand a more generic set of conditions under which CA will be able to incentivize contributions of more accurate classifiers.

A market implementation

The above scoring mechanism leads to a market implementation [hanson2007logarithmic] that incentivizes improving classifiers. In particular, suppose agents come and participate at discrete time step . Denote the hypothesis agent contributed at time step as (and his report ). Agent at time will be paid according to where is an incentive-compatible scoring function that elicits truthfully using . The incentive-compatibility of the market payment is immediate due to . The above market implementation incentivizes improving classifiers with bounded budget 666Telescoping returns: ..

Calibrated CA scores

When is the identity matrix, the CA mechanism reduces to:

That is the reward structure of CA builds on 0-1 loss function. We ask the question of can we extend the CA to a calibrated one? We define the following loss-calibrated scoring function for CA:

Here again we negate the loss to make it a reward (agent will seek to maximize it instead of minimizing it). If this extension is possible, not only we will be able to include more scoring functions, but also we are allowed to score/verify non-binary classifiers directly. Due to space limit, we provide positive answers and detailed results in Appendix, while we will present empirical results on the calibrated scores of CA in Section 5.

4 Elicitation without verification

Now we move on to a more challenging setting where we do not have ground truth label to verify the accuracy, or the informativeness of , i.e., the mechanism designer only has access to a . The main idea of our solution from this section follows straight-forwardly from the previous section, but instead of having a ground truth agent , for each classifier we only have a reference agent drawn from the rest agents to score with. The corresponding scoring rule takes the form of , and similarly the goal is to achieve the following:

As argued before, if we treat and as two agents and holding private information, a properly defined peer prediction scoring function that elicits using will suffice to elicit using . Again we will focus on using Correlated Agreement as a running example. Recall that the mechanism builds on a correlation matrix .

The mechanism then operates as follows: For each task , randomly sample two other tasks . Then pay a reported hypothesis according to

(2)

We reproduce the incentive guarantees:

Theorem 4.

CA mechanism induces truthful reporting of a hypothesis at a Bayesian Nash Equilibrium.

The proof is similar to the proof of Theorem 2 so we will not repeat the details in the Appendix.

To enable a clean presentation of analysis, the rest of this section will focus on using/applying CA for the binary case . First, as an extension to Lemma 1, we have:

Lemma 2.

If and are conditionally independent given , and , then is an identify matrix.

When do we reward accuracy

As mentioned earlier that in general peer prediction mechanisms do not incentivize accuracy. Nonetheless we provide conditions under which they do. The result below holds for binary classifications.

Theorem 5.

When (i) , (ii) , and (iii) and are conditional independent of , the more accurate classifier within each receives a higher score in expectation.

4.1 Peer Prediction market

Implementing the above peer prediction setting in a market setting is hard, due to again the challenge of no ground truth verification. The use of reference answers collected from other peers to similarly close a market will create incentives for further manipulations.

Our first attempt is to crowdsource to obtain an independent survey answer and use the survey answer to close the market. Denote the survey hypothesis as and use to close the market:

(3)
Theorem 6.

When the survey hypothesis is (i) conditionally independent from the market contributions, and (ii) Bayesian informative, then closing the market using the crowdsourcing survey hypothesis is incentive compatible.

The above mechanism is manipulable in several aspects. Particularly, the crowdsourcing process needs to be independent from the market, which implies that the survey participant will need to stay away from participating in the market - but it is unclear whether this will be the case. In the Appendix we show that by maintaining a survey process that elicits hypotheses, we can further improve the robustness of our mechanisms against agents performing a joint manipulation on both surveys and markets.

Remark

Before we conclude this section, we remark that the above solution for the without verification

setting also points to an hybrid solution when the designer has access to both sample points with and without ground truth labels. The introduction of the pure peer assessment solution helps reduce the variance of payment.

4.2 Robust elicitation

Running a peer prediction mechanism with verifications coming only from peer agents is vulnerable when facing collusion. In this section we answer the question of how robust our mechanisms are when facing a -fraction of adversary in the participating population. To instantiate our discussion, consider the following setting

  • There are fraction of agents who will act truthfully if incentivized properly. Denote the randomly drawn classifier from this population as .

  • There are fraction of agents are adversary, whose reported hypotheses can be arbitrary and are purely adversarial.

Denote the following quantifies , that is are the error rates for the eliciting classifier while are the error rates for the Bayes optimal classifier. We prove the following

Theorem 7.

CA is truthful in eliciting hypothesis when facing -fraction of adversary when satisfies:

When the agent believes that the classifier the crowd holds is as accurate as the Bayes optimal classifier we have , then a sufficient condition for eliciting truthful reporting is , that is our mechanism is robust up to half of the population manipulating. Clearly the more accurate the reference classifier is, the more robust our mechanism is.

5 Experiments

In this section, we implement two reward structures of CA: 0-1 score and Cross-Entropy (CE) score as mentioned at the end of Section 3.2. We experiment on two image classification tasks: MNIST [mnist] and CIFAR-10 [cifar] in our experiments. For agent (weak agent), we choose LeNet [mnist] and ResNet34 [resnet] for MNIST and CIFAR-10 respectively. For (strong agent), we use a 13-layer CNN architecture for both datasets.

Either of them is trained on random sampled 25000 images from each image classification training task. After the training process, agent reaches 99.37% and 62.46% test accuracy if he truthfully reports the prediction on MNIST and CIFAR-10 test data. Agent is able to reach 99.74% and 76.89% test accuracy if the prediction on MNIST and CIFAR-10 test data is truthfully reported.

and receive hypothesis scores based on the test data (10000 test images) of MNIST or CIFAR-10. For elicitation with verification, we use ground truth labels to calculate the hypothesis score. For elicitation without verification, we replace the ground truth labels with the other agent’s prediction - will serve as ’s peer reference hypothesis and vice versa.

5.1 Results

Statistically, an agent ’s mis-reported hypothesis can be expressed by a misreport transition matrix . Each element

represents the probability of flipping the truthfully reported label

to the misreported label : . Random flipping predictions will degrade the quality of a classifier. When there is no adversary attack, we focus on two kinds of misreport transition matrix: a uniform matrix or a sparse matrix. For the uniform matrix, we assume the probability of flipping from a given class into other classes to be the same: . changes gradually from 0 to 0.56 after 10 increases, which results in a 0%–50% misreport rate. The sparse matrix focuses on particular 5 pairs of classes which are easily mistaken between each pair. Denote the corresponding transition matrix elements of class pair to be: , we assume that . changes gradually from 0 to 0.5 after 10 increases, which results in a 0%–50% misreport rate.

Every setting is simulated 5 times. The line in each figure consists of the median score of 5 runs as well as the corresponding “deviation interval", which is the maximum absolute score deviation. The y axis symbolizes the averaged score of all test images.

As shown in Figure 12, in most situations, 0-1 score and CE score of both and keep on decreasing while the misreport rate is increasing. As for 0-1 score without ground truth verification, the score of either agent begins to fluctuate more when the misreport rate in sparse misreport model is . Our results conclude that both the 0-1 score and CE score induce truthful reporting of a hypothesis and will penalize misreported agents whether there is ground truth for verification or not.

[width=.26]figures/0-1_Score_changes_Exp12_MNIST_strong1.pdf

(a) 0-1 Score, Agent

[width=.26]figures/0-1_Score_changes_Exp12_MNIST_strong2.pdf

(b) 0-1 Score, Agent

[width=.26]figures/CE_Score_changes_Exp12_MNIST_strong1.pdf

(c) CE Score, Agent

[width=.26]figures/CE_Score_changes_Exp12_MNIST_strong2.pdf

(d) CE Score, Agent
Figure 1: Hypothesis scores versus misreport rate on MNIST dataset.

[width=.26]figures/0-1_Score_changes_Exp12_CIFAR_weak.pdf

(a) 0-1 score, agent

[width=.26]figures/0-1_Score_changes_Exp12_CIFAR_strong.pdf

(b) 0-1 score, agent

[width=.26]figures/CE_Score_changes_Exp12_CIFAR_weak.pdf

(c) CE score, agent

[width=.26]figures/CE_Score_changes_Exp12_CIFAR_strong.pdf

(d) CE score, agent
Figure 2: Hypothesis scores versus misreport rate on CIFAR-10 dataset.

5.2 Elicitation with adversarial attack

We test the robustness of our mechanism when facing a 0.3-fraction of adversary in the participating population. We introduce an adversarial agent, LinfPGDAttack, introduced in AdverTorch [ding2019advertorch] to influence the labels for verification when there is no ground truth. In Figure 3, both the 0-1 score and CE score induce truthful reporting of a hypothesis for MNIST.

However, for CIFAR-10, with the increasing of misreport rate, the decreasing tendency fluctuates more often. Two factors attribute to this phenomenon: the agents’ abilities as well as the quality of generated "ground truth" labels. When the misreport rate is large and generated labels are of low quality, the probability of successfully matching the misreported label to an incorrect generated label can be much higher than usual. But in general, these two scoring structures incentivize agents to truthfully report their results.

[width=.26]figures/0-1_Score_changes_Exp3_MNIST.pdf

(a) 0-1 score, MNIST

[width=.26]figures/CE_Score_changes_Exp3_MNIST.pdf

(b) CE score, MNIST

[width=.26]figures/0-1_Score_changes_Exp3_CIFAR.pdf

(c) 0-1 score, CIFAR

[width=.26]figures/CE_Score_changes_Exp3_CIFAR.pdf

(d) CE score, CIFAR
Figure 3: Hypothesis scores versus misreport rate (with adversarial attack).

6 Concluding remarks

This paper provides an elicitation framework to incentivize contribution of truthful hypotheses in federated learning. We have offered a scoring rule based solution template which we name as hypothesis elicitation. We establish the incentive property of the proposed scoring mechanisms and have tested their performance with real-world datasets extensively. We have also looked into the accuracy, robustness of the scoring rules, as well as market approaches for implementing them.

References

Appendix

A. Proofs

Proof for Theorem 1

Definition 2.

[bartlett2006convexity] A classification-calibrated loss function is defined as the follows: there exists a non-decreasing convex function that satisfies: .

Proof.

Denote the Bayes risk of a classifier as its minimum risk as . The classifier’s -risk is defined as , with its minimum value .

We prove by contradiction. Suppose that reporting a hypothesis returns higher payoff (or a smaller risk) than reporting under . Then

which is a contradiction. In above, the first equality (1) is by definition, (2) is due to the calibration condition, (3) is due to the definition of .

Proof for Lemma 1

Proof.

The proof builds essentially on the law of total probability (note

and are the same):

Now consider :

The second row of involving can be symmetrically argued. ∎

Proof for Theorem 2

Proof.

Note the following fact:

Therefore we can focus on the expected score of individual sample . The incentive compatibility will then hold for the sum.

The below proof is a rework of the one presented in [shnayder2016informed]:

(replacing with due to iid assumption)

Note that truthful reporting returns the following expected payment:

(Only the corresponding survives the 3rd summation)

Because we conclude that for any other reporting strategy:

completing the proof. ∎

Proof for Theorem 3

Proof.

For any classifier , the expected score is

(independence between and )
(equal prior)

The last equality indicates that higher than accuracy, the higher the expected score given to the agent, completing the proof. ∎

Calibrated CA Scores

We start with extending the definition for calibration for CA:

Definition 3.

We call w.r.t. original CA scoring function if the following condition holds:

(4)

Since induces as the maximizer, if satisfies the calibration property as defined in Definition 2, we will establish similarly the incentive property of

Theorem 8.

If for a certain loss function such that satisfies the calibration condition, then induces truthful reporting of .

The proof of the above theorem repeats the one for Theorem 1, therefore we omit the details.

Denote by . Sufficient condition for CA calibration was studied in [liu2020peerloss]:

Theorem 9 (Theorem 6 of [liu2020peerloss]).

Under the following conditions, is calibrated if , and satisfies the following:

Proof for Lemma 2

Proof.

Denote by

i.e., the correlation matrix defined between and ground truth label (). And

further dervies

(by conditional independence)