meta-XB
Code for the paper "Few-Shot Calibration of Set Predictors via Meta-Learned Cross-Validation-Based Conformal Prediction"
view repo
Conventional frequentist learning is known to yield poorly calibrated models that fail to reliably quantify the uncertainty of their decisions. Bayesian learning can improve calibration, but formal guarantees apply only under restrictive assumptions about correct model specification. Conformal prediction (CP) offers a general framework for the design of set predictors with calibration guarantees that hold regardless of the underlying data generation mechanism. However, when training data are limited, CP tends to produce large, and hence uninformative, predicted sets. This paper introduces a novel meta-learning solution that aims at reducing the set prediction size. Unlike prior work, the proposed meta-learning scheme, referred to as meta-XB, (i) builds on cross-validation-based CP, rather than the less efficient validation-based CP; and (ii) preserves formal per-task calibration guarantees, rather than less stringent task-marginal guarantees. Finally, meta-XB is extended to adaptive non-conformal scores, which are shown empirically to further enhance marginal per-input calibration.
READ FULL TEXT VIEW PDFCode for the paper "Few-Shot Calibration of Set Predictors via Meta-Learned Cross-Validation-Based Conformal Prediction"
In modern application of artificial intelligence (AI),
calibration is often deemed as important as the standard criterion of (average) accuracy [thelen2022comprehensive]. A well-calibrated model is one that can reliably quantify the uncertainty of its decisions [guo2017calibration, hermans2021averting]. Information about uncertainty is critical when access to data is limited and AI decisions are to be acted on by human operators, machines, or other algorithms. Recent work on calibration for AI has focused on Bayesian learning, or related ensembling methods, as means to quantify epistemic uncertainty [finn2018probabilistic, yoon2018bayesian, ravi2018amortized, jose2022information]. However, recent studies have shown the limitations of Bayesian learning when the assumed model likelihood or prior distribution are misspecified [masegosa2020learning]. Furthermore, exact Bayesian learning is computationally infeasible, calling for approximations such as Monte Carlo (MC) sampling [robert1999monte] and variational inference (VI) [blundell2015weight]. Overall, under practical conditions, Bayesian learning does not provide formal guarantees of calibration.Conformal prediction (CP) [vovk2005algorithmic] provides a general framework for the calibration of (frequentist or Bayesian) probabilistic models. The formal calibration guarantees provided by CP hold irrespective of the (unknown) data distribution, as long as the available data samples and the test samples are exchangeable – a weaker requirement than the standard i.i.d. assumption. As illustrated in Fig. 1, CP produces set predictors that output a subset of the output space for each input
, with the property that the set contains the true output value with probability no smaller than a desired value
for .Mathematically, for a given learning task , assume that we are given a data set with samples, i.e., , where the th sample contains input and target . CP provides a set predictor , specified by a hyperparameter vector , that maps an input to a subset of the output domain based on a data set . Calibration amounts to the per-task validity condition
(1) |
which indicates that the set predictor contains the true target with probability at least . In (1), the probability
is taken over the ground-truth, exchangeable, joint distribution
, and bold letters represent random variables.
The most common form of CP, referred to as validation-based CP (VB-CP), splits the data set into training and validation subsets [vovk2005algorithmic]. The validation subset is used to calibrate the set prediction on a test example for a given desired miscoverage level in (1). The drawback of this approach is that validation data is not used for training, resulting in inefficient set predictors in the presence of a limited number of data samples. The average size of a set predictor , referred to as inefficiency, is defined as
(2) |
where the average is taken with respect to the ground-truth joint distribution .
A more efficient CP set predictor was introduced by [barber2021predictive] based on cross-validation. The cross-validation-based CP (XB-CP) set predictor splits the data set into folds to effectively use the available data for both training and calibration. XB-CP can also satisfy the per-task validity condition (1)^{1}^{1}1We refer here in particular to the jackknife-mm scheme presented in Section 2.2 of [barber2021predictive]..
Further improvements in efficiency can be obtained via meta-learning [thrun1998lifelong]. Meta-learning jointly processes data from multiple learning tasks, say , which are assumed to be drawn i.i.d. from a task distribution . These data are used to optimize the hyperparameter of the set predictor to be used on a new task . Specifically, reference [fisch2021few] introduced a meta-learning-based method that modifies VB-CP. The resulting meta-VB algorithm satisfies a looser validity condition with respect to the per-task inequality (1), in which the probability in (1) is no smaller than only on average with respect to the task distribution .
In this paper, we introduce a novel meta-learning approach, termed meta-XB, with the aim of reducing the inefficiency (2) of XB-CP, while preserving, unlike [fisch2021few], the per-task validity condition (1) for every task . Furthermore, we incorporate in the design of meta-XB the adaptive nonconformity (NC) scores introduced in [romano2020classification]. As argued in [romano2020classification] for conventional CP, adaptive NC scores are empirically known to improve the per-task conditional validity condition
(3) |
This condition is significantly stronger than (1) as it holds for any test input . A summary of the considered CP schemes can be found in Fig. 2.
Overall, the contribution of this work can be summarized as follows:
We incorporate adaptive NC scores [romano2020classification] in the design of meta-XB, demonstrating via experiments that adaptive NC scores can enhance conditional validity as defined by condition (3).
In this section, we describe necessary background material on CP [vovk2005algorithmic, balasubramanian2014conformal], VB-CP [vovk2005algorithmic], XB-CP [barber2021predictive], and adaptive NC scores [romano2020classification].
At a high level, given an input for some learning task , CP outputs a prediction set that includes all outputs such that the pair conforms well with the examples in the available data set . We recall from Section 1 that represents a vector of hyperparameter. The key underlying assumption is that data set and test pair are realizations of exchangeable random variables and .
For any learning task , data set and a test data point are exchangeable random variables, i.e., the joint distribution is invariant to any permutation of the variables . Mathematically, we have the equality with , for any permutation operator . Note that the standard assumption of i.i.d. random variables satisfies exchangeability.
CP measures conformity via NC scores, which are generally functions of the hyperparameter vector , and are defined as follows.
(NC score) For a given learning task , given a data set with samples, a nonconformity (NC) score is a function that maps the data set and any input-output pair with and to a real number while satisfying the permutation-invariance property for any permutation operator .
A good NC score should express how poorly the point “conforms” to the data set . The most common way to obtain an NC score is via a parametric two-step approach. This involves a training algorithm defined by a conditional distribution , which describes the output of the algorithm as a function of training data set and hyperparameter vector
. This distribution may describe the output of a stochastic optimization algorithm, such as stochastic gradient descent (SGD), for frequentist learning, or of a Monte Carlo method for Bayesian learning
[guedj2019primer, angelino2016patterns, simeone2022machine]. The hyperparameter vector may determine, e.g., learning rate schedule or initialization.(Conventional two-step NC score) For a learning task , let
represent the loss of a machine learning model parametrized by vector
on an input-output pair with and . Given a training algorithm that is invariant to permutation of the training set , a conventional two-step NC score for input-output pair given data set is defined as(4) |
VB-CP [vovk2005algorithmic] divides the data set into a training data set of samples and a validation data set of samples with . It uses the training data set to evaluate the NC scores , while the validation data set is leveraged to construct the set predictor as detailed next.
Given an input , the prediction set of VB-CP includes all output values whose NC score is smaller than (or equal to) a fraction (at least) of the NC scores for validation data points .
With this definition, the set predictor for VB-CP can be thus expressed as
(5) |
Intuitively, by the exchangeability condition, the empirical ordering condition among the NC scores used to define set (5) ensures the validity condition (1) [vovk2005algorithmic].
In VB-CP, the validation data set is only used to compute the empirical quantile in (5), and is hence not leveraged by the training algorithm . This generally causes the inefficiency (2) of VB-CP to be large if number of data points, , is small. XB-CP addresses this problem via -fold cross-validation [barber2021predictive]. -fold cross-validation partitions the per-task data set into disjoint subsets such that the condition is satisfied. We define the leave-one-out data set that excludes the subset . We also introduce a mapping function to identify the subset that includes the sample , i.e., .
We focus here on a variant of XB-CP that is referred to as min-max jacknife+ in [barber2021predictive]. This variant has stronger validity guarantees than the jacknife+ scheme also studied in [barber2021predictive]. Accordingly, given a test input , XB-CP computes the NC score for a candidate pair with by taking the minimum NC score over all possible subsets , i.e., as . Furthermore, for each data point , the NC score is evaluated by excluding the subset as . Note that evaluating the resulting NC scores requires running the training algorithm times, once for each subset . Finally, a candidate is included in the prediction set if the NC score for is smaller (or equal) than for a fraction (at least) of the validation data points with .
Overall, given data set and test input , -fold XB-CP produces the set predictor
(6) | ||||
where is the indicator function ( and ).
The CP methods reviewed so far achieve the per-task validity condition (1). In contrast, the per-input conditional validity (3) is only attainable with strong additional assumptions on the joint distribution [vovk2012conditional, lei2014distribution]. However, the adaptive NC score introduced by [romano2020classification] is known to empirically improve the per-input conditional validity of VB-CP (5) and XB-CP (6).
In this subsection, we assume that a model class of probabilistic predictors
is available, e.g., a neural network with a softmax activation in the last layer. To gain insight on the definition of adaptive NC scores, let us assume for the sake of argument that the ground-truth conditional distribution
is known. The most efficient (deterministic) set predictor satisfying the conditional coverage condition (3) would then be obtained as the smallest-cardinality subset of target values in that satisfies the conditional coverage condition (3), i.e.,(7) |
Note that set (7) can be obtained by adding values to set predictor in order from largest to smallest value of until the constraint in (7) is satisfied.
In practice, the conditional distribution
is estimated via the model
where the parameter vector is produced by a training algorithm applied to some training data set . This yields the naïve set predictor(8) |
where we have used for generality the ensemble predictor obtained by averaging over the output of the training algorithm. Unless the likelihood model is perfectly calibrated, i.e., unless the equality holds, there is no guarantee that the set predictor in (8) satisfies the conditional coverage condition (3) or the marginal coverage condition (1) with .
To tackle this problem, [romano2020classification] proposed to apply VB-CP or XB-CP with a modified NC score inspired by the naïve prediction (8).
(Adaptive NC score) For a learning task , given a training algorithm that is invariant to permutation of the training set , the adaptive NC score for input-output pair with and given data set , is defined as
(9) |
Intuitively, if the adaptive NC score is large, the pair does not conform well with the probabilistic model obtained by training on set . The adaptive NC score satisfies the condition in Definition 1, and hence by Theorems 1 and 2, the set predictors (5) and (6) for VB-CP and XB-CP, respectively, are both valid when the adaptive NC score is used. Furthermore, [romano2020classification] demonstrated improved conditional empirical coverage performance as compared to the conventional two-step NC score in Definition 2. This may be seen as a consequence of the conditional validity of the naïve predictor (8) under the assumption of a well-calibrated model.
In this section, we introduce the proposed meta-XB algorithm. We start by describing the meta-learning framework.
Up to now, we have focused on a single task . Meta-learning utilizes data from multiple tasks to enhance the efficiency of the learning procedure for new tasks. Following the standard meta-learning formulation [baxter2000model, amit2018meta], as anticipated in Section 1, the learning environment is characterized by a task distribution over the task identifier . Given meta-training tasks realizations drawn i.i.d. from the task distribution , the meta-training data set consists of realizations of data sets with examples and test sample for each task . Pairs are generated i.i.d. from the joint distribution , satisfying Assumption 1 for all tasks .
The goal of meta-learning for CP is to optimize the vector of hyperparameter based on the meta-training data , so as to obtain a more efficient set predictor . While reference [fisch2021few] proposed a meta-learning solution for VB-CP [vovk2005algorithmic], here we introduce a meta-learning method for XB-CP.
Meta-XB aims at finding a hyperparameter vector that minimizes the average size of the prediction set in (6) for tasks that follow the distribution . To this end, it addresses the problem of minimizing the empirical average of the sizes of the prediction sets across the meta-training tasks over the hyperparameter vector . This amounts to the optimization
(11) |
where the first sum is over the meta-training tasks and the second is over the available data for each task. By (6), the size of the prediction set is not a differentiable function of the hyperparameter vector . Therefore, in order to address (11) via gradient descent, we introduce a differentiable soft inefficiency criterion by replacing the indicator function with the sigmoid for some ; the quantile with a differentiable soft empirical quantile ; and the minimum operator with the softmin function [goodfellow2016deep].
For an input set , the softmin function is defined as [goodfellow2016deep, Section 6.2.2.3]
(12) |
for some . Finally, given an input set , the soft empirical quantile is defined as
(13) |
for some and for some , where we have used the pinball loss [koenker1978regression]
(14) |
with . With these definitions, the soft inefficiency metric is derived from (6) as follows (see details in Appendix A-B).
Given a data set and a test input , the soft inefficiency for the -fold XB-CP predictor (6) is defined as
(15) | |||
where and .
The parameters , and dictate the trade-off between smoothness and accuracy of the approximation with respect to the true inefficiency : As , the approximation becomes increasingly accurate for any , as long as we have , but the function is increasingly less smooth (see Fig. 3 for an illustration of the accuracy of the soft quantile).
Replacing the soft inefficiency (15) into problem (11) yields a differentiable program when conventional two-step NC scores (Definition 2) are used. We address the corresponding problem via stochastic gradient descent (SGD), whereby at each iteration a batch of tasks and examples per task are sampled. The overall meta-learning procedure is summarized in Algorithm 1.
Adaptive NC scores are not differentiable. Therefore, in order to enable the optimization of problem (11) with the soft inefficiency (15), we propose to replace the indicator function in (10
) with the sigmoid function
. We also have found that approximating the number of outputs that satisfy (10) rather than direct application of sigmoid function empirically improves per-input coverage performance. This yields the soft adaptive NC score , which is detailed in Appendix B. With the soft adaptive NC score, meta-XB is then applied as in Algorithm 1.As mentioned in Section 1, existing meta-learning schemes for CP cannot achieve the per-task validity condition in (1), requiring an additional marginalization over distribution [fisch2021few] or achieving looser validity guarantees formulated as probably approximately correct (PAC)-bounds [park2022pac]. In contrast, meta-XB has the following property.
Theorem 3 is a direct consequence of Theorem 2, since meta-XB maintains the permutation-invariance of the training algorithm as required by Definition 2.
(16) |
Bayesian learning and model misspecification. When the model is misspecified, i.e., when the assumed model likelihood or prior distribution cannot express the ground-truth data generating distribution [masegosa2020learning], Bayesian learning may yield poor generalization performance [masegosa2020learning, morningstar2022pacm, wenzel2020good]. Downweighting the prior distribution and/or the likelihood, as done in generalized Bayesian learning [knoblauch2019generalized, simeone2022machine] or in “cold” posteriors [wenzel2020good], improve the generalization performance. In order to mitigate the model likelihood misspecification, alternative variational free energy metrics were introduced by [masegosa2020learning] via second-order PAC-Bayes bounds, and by [morningstar2022pacm] via multi-sample PAC-Bayes bounds. Misspecification of the prior distribution can be also addressed via Bayesian meta-learning, which optimizes the prior from data in a manner similar to empirical Bayes [mackay2003information].
Bayesian meta-learning While frequentist meta-learning has shown remarkable success in few-shot learning tasks in terms of accuracy [finn2017model, snell2017prototypical], improvements in terms of calibration can be obtained by Bayesian meta-learning that optimizes over a hyper-posterior distribution from multiple tasks [amit2018meta, finn2018probabilistic, yoon2018bayesian, ravi2018amortized, nguyen2020uncertainty, jose2022information]
. The hyper-prior can also be modelled as a stochastic process to avoid the bias caused by parametric models
[rothfuss2021meta].CP-aware loss. [stutz2021learning] and [einbinder2022training]
proposed CP-aware loss functions to enhance the efficiency or per-input validity (
3) of VB-CP. The drawback of these solutions is that they require a large amount of data samples, i.e., , unlike the meta-learning methods studied here.Per-input validity and local validity. As discussed in Section II-D, the per-input validity condition (3) cannot be satisfied without strong assumptions on the joint distribution [vovk2012conditional, lei2014distribution]
. Given the importance of adapting the prediction set size to the input to capture heteroscedasticity
[romano2019conformalized, izbicki2020cd], a looser local validity condition, which conditions on a subset of the input data space containing the input of interest, i.e., , has been considered in [lei2014distribution, foygel2021limits]. Choosing a proper subset becomes problematic especially in high-dimensional input space [izbicki2020cd, leroy2021md], and [tibshirani2019conformal, lin2021locally] proposed to reweight the samples outside the subset by treating the problem as distribution-shift between the data set and the test input .In this section, we provide experimental results to validate the performance of meta-XB in terms of (i) per-task coverage ; (ii) per-task inefficiency (2); (iii) per-task conditional coverage ; and (iv) per-task conditional inefficiency . To evaluate input-conditional quantities, we follow the approach in [romano2020classification, Section S1.2]. As benchmark schemes, we consider (i) VB-CP, (ii) XB-CP, and (iii) meta-VB [fisch2021few], with either the conventional NC score (Definition 2 with log-loss ) or adaptive NC score with Definition 4). Note that meta-VB was described in [fisch2021few] only for the conventional NC score, but the application of the adaptive NC score is direct. For all the experiments, unless specified otherwise, we consider a number of examples for the data set and the desired miscoverage level . For the cross-validation-based set predictors XB-CP and meta-XB, we set number of folds to . The aforementioned performance measures are estimated by averaging over realizations of data set and over realizations for the test sample of each task . We report in this section the different per-task quantities which are computed from different tasks. During meta-training, for different tasks, we assume availability of i.i.d. examples, from which we sample pairs when computing inefficiency (16), with which we use Adam optimizer [kingma2014adam] to update the hyperparameter vector via SGD. Lastly, we set the value of the approximation parameters and to be one.
Following [romano2020classification], for VB-CP and XB-CP, we adopt a support vector classifier as training algorithm as it does not require any tuning of the hyperparameter vector . In contrast, for meta-VB and meta-XB, we adopt a neural network classifier [romano2019conformalized], and set the training algorithm to output the last iterate of a pre-defined number of steps of GD (, unless specified otherwise) with initialization given by the hyperparameter vector [finn2017model]. Note that using full-batch GD ensures the permutation-invariance of the training algorithm as required by Definition 2.
All the experiments are implemented by PyTorch
[paszke2019pytorch] and ran over a GPU server with single NVIDIA A100 card.We start with the synthetic-data experiment introduced in [romano2020classification] in which the input is such that the first element equals with probability and otherwise, while the other elements are i.i.d. standard Gaussian variables. For each task , matrix is sampled with i.i.d. standard Gaussian entries and the ground-truth conditional distribution is defined as the categorical distribution
(17) |
for , where is the th column of the task information matrix . The number of classes is and neural network classifier consists of two hidden layers with Exponential Linear Unit (ELU) activation [clevert2015fast] in the hidden layers and a softmax activation in the last layer.
In Fig. 4, we demonstrate the performance of the considered set predictors as a function of number of tasks . Both meta-VB and meta-XB achieve lower inefficiency (2) as compared to the conventional set predictors VB-CP and XB-CP, as soon as the number of meta-training tasks is sufficiently large to ensure successful generalization across tasks [yin2019meta, jose2020informationtheoretic]. For example, meta-XB with tasks obtain an average prediction set size of , while XB-CP has an inefficiency larger than . Furthermore, all schemes satisfy the validity condition (1), except for meta-VB for , confirming the analytical results. Adaptive NC scores are seen to be instrumental in improving the conditional validity (3) when used with meta-XB, although this comes at the cost of a larger inefficiency.
Next, we investigate the impact of number of per-task examples in data set using adaptive NC scores. As shown in Fig. 5, the average size of the set predictors decreases as grows larger. In the few-examples regime, i.e., with , the meta-learned set predictors meta-VB and meta-XB outperform the conventional set predictors VB-CP and XB-CP in terms of inefficiency. However, when is large enough, i.e., when , conventional set predictors are preferable, as transfer of knowledege across tasks becomes unnecessary, and possibly deleterious [amit2018meta] (see also [park2020learning] for related discussions). In terms of conditional coverage, Fig. 5 shows that cross-validation-based CP methods are preferable as compared to validation-based CP approaches.
We now consider the real-world modulation classification example illustrated in Fig. 1, in which the goal is classifying received radio signals depending on the modulation scheme used to generate it [o2016convolutional, o2018over]. The RadioML 2018.01A data set consists inputs with dimension , accounting for complex baseband signals sampled over time instants, generated from different modulation types [o2018over]. Each task amounts to the binary classification of signals from two randomly selected modulation types. Specifically, we divide the modulations types into classes used to generate meta-training tasks, and classes used to produce meta-testing tasks, following the standard data generation approach in few-shot classifications [lake2011one, ravi2016optimization]. We adopt VGG16 [simonyan2014very] as the neural network classifier as in [o2018over]. Furthermore, for meta-VB and meta-XB, we apply a single GD step during meta-training and five GD steps during meta-testing [finn2017model, ravi2018amortized].
Fig. 6 shows per-task coverage and inefficiency for all schemes assuming conventional NC scores. While the conventional set predictors VB-CP and XB-CP produce large, uninformative set predictors that encompass the entire target data space of dimension , the meta-learned set predictors meta-VB and meta-XB can significantly improve the prediction efficiency. However, meta-VB fails to achieve per-task validity condition (1), while the proposed meta-XB is valid as proved by Theorem 3.
Lastly, we consider image classification problem with the miniImagenet dataset [vinyals2016matching] considering data points per task with desired miscoverage level
. We consider binary classification with tasks being defined by randomly selecting two classes of images, and drawing training data sets by choosing among all examples belonging to the two chosen classes. Conventional NC scores are used, and the neural network classifier consists of the convolutional neural network (CNN) used in
[finn2017model]. For meta-VB and meta-XB, a single step GD update is used during meta-training, while five GD update steps are applied during meta-testing. Fig. 7 shows that meta-learning-based set predictors outperform conventional schemes. Furthermore, meta-VB fails to meet per-task coverage in contrast to the proposed meta-XB.This paper has introduced meta-XB, a meta-learning solution for cross-validation-based conformal prediction that aims at reducing the average prediction set size, while formally guaranteeing per-task calibration. The approach is based on the use of soft quantiles, and it integrates adaptive nonconformity scores for improved input-conditional calibration. Through experimental results, including for modulation classification [o2016convolutional, o2018over], meta-XB was shown to outperform both conventional conformal prediction-based solutions and meta-learning conformal prediction schemes. Future work may integrate meta-learning with CP-aware training criteria [stutz2021learning, einbinder2022training], or with stochastic set predictors.
The work of S. Park, K. M. Cohen, and O. Simeone was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 725731).
The proof mainly follows [barber2021predictive, Section B.1] and [barber2021predictive, Section B.2.2] with the following changes.
We first extend the result for regression problems in [barber2021predictive] to classification, starting with the case , for which the mapping function is the identity . Unlike [barber2021predictive], which defined “comparison matrix” of residuals for the regression problem, we consider a more general comparison matrix defined in terms of NC scores that can be applied for both classification and regression problems in a manner similar to [romano2020classification]. Accordingly, we define the comparison matrix
(18) |
for a fixed vector of hyperparameter . The cardinality of the set of “strange” points
(19) |
can be bounded as [barber2021predictive, romano2020classification]. Therefore, theorem 2 holds for , since any points can be “strange points” with equal probability thanks to Assumption 1.
To address the case , we follow [barber2021predictive, Section B.2.2] by drawing additional test examples that are all assigned to the th fold. This way, the actual th test point is equally likely to be in any of the folds. Now, taking the augmented data set that contains all the examples in lieu of in (18), we can bound the number of “strange points” in set (19) as
(20) |
Finally, by using the same proof technique in [barber2021predictive, Section B.2.2], we have the inequality
(21) |
In Theorem 2, we choose , which satisfies per-task validity condition (1) from (21).
From the definition of the XB-CP set predictor (6), the inefficiency can be obtained as
(22) | |||
(23) |
with