1 Introduction
Current stateoftheart language models perform well on a wide range of challenging questionanswering tasks (Brown et al., 2020; Chowdhery et al., 2022; Hoffmann et al., 2022). They can even outperform the average human on the MMLU benchmark (which consists of examlike questions across 57 categories) and on BIGBench (which consists of 150+ diverse tasks). Yet when models generate longform text, they often produce false statements or “hallucinations” (Lin et al., 2021; Maynez et al., 2020; Shuster et al., 2021). This reduces their value to human users, as users cannot tell when a model is being truthful or not.
The problem of truthfulness motivates calibration for language models (Nguyen and O’Connor, 2015). If models convey calibrated uncertainty about their statements, then users know how much to trust a given statement. This is important for current models (which often hallucinate falsehoods) but also for any model that makes statements where there is no known ground truth (e.g. economic forecasts, open problems in science or mathematics).
Previous work on calibration focuses on the model logprobabilities or “logits” (Guo et al., 2017; Jiang et al., 2021). Yet the logprobabilities of models like GPT3 represent uncertainty over tokens (ways of expressing a claim) and not epistemic uncertainty over claims themselves. If a claim can be paraphrased in many different ways, then each paraphrase may have a low logprobability.^{1}^{1}1Sometimes it’s feasible to sum over the probabilities of all paraphrases of a claim. But if the claim is complex, the space of possible paraphrases will be vast and hard to demarcate. By contrast, when humans express uncertainty, this is epistemic uncertainty about the claim itself.^{2}^{2}2If a human says “I think it’s likely this vaccine will be effective”, they express confidence about the vaccine not the string “vaccine”. In this paper, we finetune models to express epistemic uncertainty using natural language. We call this “verbalized probability”.
The goal of verbalized probability is to express uncertainty in a humanlike way but not to directly mimic human training data. Models should be calibrated about their own uncertainty, which differs from human uncertainty. For example, GPT3 outperforms most humans on a computer security quiz (Hendrycks et al., 2020) but is much worse at arithmetic questions of the form “”. Thus, we expect pretrained models will need to be finetuned to produce calibrated verbalized probabilities.
Training models in verbalized probability is a component of making models “honest” (Evans et al., 2021; Askell et al., 2021a; Christiano, 2021). We define a model as honest if it can communicate everything it represents internally in natural language (and will not misrepresent any internal states). Honesty helps with AI alignment: if an honest model has a misinformed or malign internal state, then it could communicate this state to humans who can act accordingly. Calibration is compatible with a certain kind of dishonesty, because a model could be calibrated by simply imitating a calibrated individual (without having the same “beliefs” as the individual). However, if GPT3 achieves good calibration on diverse questions after finetuning as in Section 3.1, it seems unlikely that it dishonestly misrepresents its confidence.
1.1 Contributions
We introduce a new test suite for calibration. CalibratedMath is a suite of elementary mathematics problems. For each question, a model must produce both a numerical answer and a confidence in its answer (see Figure 1). There are many types of question, which vary substantially in content and in difficulty for GPT3. This allows us to test how calibration generalizes under distribution shifts (by shifting the question type) and makes for a challenging test (see Figure 3). Since GPT3’s math abilities differ greatly from humans, GPT3 cannot simply imitate human expressions of uncertainty.
GPT3 can learn to express calibrated uncertainty using words (“verbalized probability”). We finetune GPT3 to produce verbalized probabilities. It achieves reasonable calibration both in and outofdistribution, outperforming a fairly strong baseline (Figure 5 and Table 1).
This calibration performance is not explained by learning to output logits. GPT3 does not simply learn to output the uncertainty information contained in its logits (Section 3.4
). We also show that certain superficial heuristics (e.g. the size of the integers in the arithmetic question) cannot explain the performance of verbalized probability.
2 Setup
2.1 Calibration and Three Kinds of Probability
We want to test the calibration of language models for uncertainty over their own answers to questions. The basic idea is that if a calibrated model assigns 90% to an answer, then the answer is correct 90% of the time. Formally, let be a model, be a question, be the model’s answer, and be the assigned probability that is correct. Then these assigned probabilities are (perfectly) calibrated if:
(1) 
for (Guo et al., 2017). In this paper, we test calibration on different sets of questions to evaluate how well calibration generalizes under distribution shift (Ovadia et al., 2019).
We consider three sources for the probability that the model’s answer is correct, as shown in Figure 2. Two of the kinds of probability (“answer logit” and “indirect logit”) are based on the logprobabilities that a language model assigns to tokens. Thus they cannot be used for models without a tractable likelihood on outputs (e.g. information retrieval models that call out to external resources). By contrast, verbalized probabilities apply to any model that outputs natural language. Moreover, verbalized probabilities mirror human expression of uncertainty. This allows models to respond to prompts from nontechnical users (e.g. “How sure are you about what you just said?”, “I’ve told you my confidence on a scale from 15. Can you do the same?”). This also allows models to decide when and how to provide uncertainty information (depending on the human audience).
2.2 CalibratedMath
CalibratedMath is a test suite consisting of 21 arithmetic tasks, including addition, multiplication, rounding, arithmetic progressions, and finding remainders (see full details in Table 3). For each task, questions and answers are programmatically generated. The answers are always integers and for some tasks there are multiple correct answers (e.g. “Name any prime number below 208?”). The 21 tasks are further divided into subtasks based on the number of digits in each operand and the number format. The subtasks vary in difficulty for GPT3. For example, multiplication is harder than addition and gets more difficult as the number of digits is increased. The fact that some subtasks are predictably easier or harder for GPT3 is crucial for a challenging test of calibration.
As in prior work on calibration in ML (Ovadia et al., 2019; Karandikar et al., 2021), we focus on how well calibration generalizes under distribution shift. Our main experiments use the “Addsubtract” training set (Figure 3). This consists of tasks in CalibratedMath that involve addition or subtraction and have a unique correct answer. The evaluation set (called “Multianswer”) consists of questions with multiple correct answers that sometimes involve multiplication and division. There is a distribution shift between training and evaluation, with the following two aspects:

Shift in task difficulty: GPT3 is more likely to answer questions in the evaluation set (Multianswer) correctly than the training set (Addsubtract). Median accuracy is 65% for Multianswer and 21% for Addsubtract (for full details see Figure 8). Thus, to be well calibrated, the model should assign higher probabilities on average to answers in the evaluation set than the training set. This is essentially a shift in the “label distribution” from training to evaluation. (We expect language models other than GPT3 to have a similar distribution shift for the same reason.)

Shift in content: The training and evaluation sets differ in the mathematical concepts they employ and whether or not there are multiple correct answers.
Though not shown in Figure 3, models trained on Addsubtract are also evaluated on a second evaluation set called “Multiplydivide”. Questions in Multiplydivide have unique correct answers but are more difficult than those in Addsubtract and include distinct concepts related to multiplication and division (Table 3).
2.3 Metrics
Our goal is to measure the model’s calibration when expressing uncertainty about its own zeroshot answers. In all our experiments, the model’s zeroshot answers are held fixed. The goal is not to improve the model’s answers but instead to improve calibration in expressing uncertainty over these answers.^{3}^{3}3In general, training a model to improve calibration may also improve the accuracy of the model’s answers. However, for CalibratedMath, the training we provide for calibration is unlikely to improve accuracy very much. Thus, it’s reasonable to measure calibration with respect to the zeroshot answers even after finetuning. Calibration is measured using two metrics:
Mean squared error (MSE). Following Section 2.1, for each question the model assigns a probability to its own answer being correct. The MSE compares to the groundtruth of whether is correct or not:
Note that a model can be perfectly calibrated (per Equation 1) and not have a MSE of zero. The MSE combines calibration error with “sharpness” (Kuleshov and Liang, 2015), while the MAD (below) just measures the former. (The MSE is called the “Brier Score” in probabilistic forecasting.)
Mean absolute deviation calibration error (MAD).
The MAD estimates how closely the model approximates Equation
1 based on a finite sample. Model probabilities are divided into bins with equal numbers of samples, so the bins have denser coverage where there are more samples (Nguyen and O’Connor, 2015). Within each bin , we calculate the proportion of correct answers (“” or “accuracy”) and average probability assigned to answers in (“” or the “average confidence”). Then the MAD is given by:3 Experiments
For our experiments, we used the 175billion parameter GPT3 model (“davinci”) via the OpenAI API (Brown et al., 2020). We tried out smaller models but their performance on arithmetic questions is too weak for CalibratedMath to be challenging.^{4}^{4}4We tested smaller models including GPTJ (Wang and Komatsuzaki, 2021) and the 7Bparameter GPT3 on the arithmetic questions. Their performance is so weak that guessing 0% for every question would achieve reasonable calibration. To learn more about how different models perform on CalibratedMath, we recommend using models comparable to GPT3175B in performance.
How can we finetune a pretrained model to output calibrated verbalized probabilities? We finetune GPT3 using supervised learning. This approach is less principled and flexible than using reinforcement learning (with rewards derived from a proper scoring rule). However, supervised learning was easier to implement using OpenAI’s API, and provides an interesting test of generalization outside the training distribution.
3.1 Supervised finetuning
To finetune GPT3 to produce verbalized probabilities, we need a labeled training set. Each input is a question followed by GPT3’s answer and the label is a (calibrated) confidence (see Figure 3). The basic intuition is that for questions GPT3 is likely to get wrong, its confidence should be low. Thus, we use GPT3’s empirical accuracy on each type of question as the label. We recognize that this approach can lead to suboptimal labels. For example, it might use a lowconfidence label for “” because most twodigit multiplications are hard for GPT3. But we will show that the approach works well enough for our purposes.
Formally, let be a question from subtask . Let be GPT3’s answer to . We define associated with the input to be GPT3’s empirical accuracy on subtask :
which we estimate using random samples generated from . The full training set is then constructed as follows. For each subtask we randomly sample 100 questions and generate GPT3’s zeroshot answers (using greedy decoding) for a total of 10k inputs. We then compute the for each and use it to construct the label for each sample from .
The label is a simple transformation of . For the “verbalized numbers” setup, the label is given by . In the “verbalized words” setup, we use a set of five words (e.g. “lowest”, “low”, “medium”, “high”, “highest”) to express the degree of confidence. We map to one of five words corresponding to probability intervals of width 0.2. Categories can then be mapped back to probability values by taking the midpoint of the corresponding interval. (We found that using meaningful words – such as “lowest” etc. – worked slightly less well than meaningless names. See Appendix B.1.)
3.1.1 Indirect logit and baselines
For the indirect logit (defined in Figure 2), we use the same random sample of 100 questions from each subtask (along with GPT3’s zeroshot answer). However, in this case the label for each individual questionanswer pair is the boolean True/False value indicating whether the model’s answer was correct, for which we have the groundtruth. Thus we can optimize the crossentropy loss. Further details for the supervised finetuning setup are given in Appendix B.3.
We compare the two finetuned setups (verbalized probability and indirect logit) to the “zeroshot answer logit” (see Fig. 2). We also include a “constant baseline”. This baseline uses a constant probability on the evaluation set, where the value of the constant is the bestscoring value on the training set (in terms of MSE)^{5}^{5}5For the constant baseline, the MAD is the difference in model accuracy between training and evaluation tasks.. Metrics are shown in Table 1 and Figure 4, while calibration curves are in Figure 5.
Setup  Multianswer  Multiplydivide  

MSE  MAD  MSE  MAD  
Verbalized numbers (finetune)  22.0  16.4  15.5  19.0 
Answer logit (zeroshot)  37.4  33.7  10.4  9.4 
Indirect logit (finetune)  33.7  38.4  11.7  7.1 
Constant baseline  34.1  31.1  15.3  8.5 
3.2 Results
Verbalized probability generalizes well to both eval sets. The main result is shown in Table 1 and Figures 4 and 5. After finetuning on the Addsubtract training set, verbalized probabilities generalize reasonably well to both the Multiplydivide and Multianswer evaluation sets. So the model remains moderately calibrated under a substantial distribution shift. In terms of MSE, the model outperforms the two logit setups on Multianswer and matches the constant baseline on Multiplydivide.^{6}^{6}6The shift in task difficulty from Addsubtract to Multiplydivide is relatively small. So the constant baseline should do reasonably well in MSE (and very well in MAD). We ran an additional experiment to probe generalization, where we flipped around the training set (training on Multiplydivide and evaluating on both Addsubtract and Multianswer). Again, verbalized probability generalizes reasonably well and outperforms other setups on Multianswer (see Appendix C.3). Finally, we find that verbalized probability performs similarly whether the model outputs tokens for words or numbers (see Appendix C.4).
Verbalized probability overfits to training. Calibration for verbalized probability is much better indistribution. The model is underconfident in its answers to Multianswer because these answers are more likely to be correct than those for the Addsubtract training set.^{7}^{7}7Our results suggest that the finetund GPT3 will only output a verbal probability (e.g. 96%) if that precise token (“96%”) appeared during training. This would explain the lack of smoothness in the calibration curves in Figure 5.
Indirect logit generalizes well to Multiplydivide. The indirect logit achieves impressive calibration on the Multiplydivide evaluation set, where it outperforms other models. However, it does worse than verbalized probability on the Multianswer evaluation. This is likely because it is more difficult to avoid overfitting given our setup.^{8}^{8}8It’s possible to do early stopping for verbalized probability by stopping when the actual MSE on the training set stops decreasing – but this is not available for the indirect logit (Appendix B.3). Further work could explore how the indirect logit compares to verbalized probability with different training setups (e.g. a more diverse distribution on probabilities and questions).
3.3 Stochastic Fewshot
In order to learn more about how verbalized probability generalizes, we tested GPT3’s calibration in a stochastic shot setting, while varying from 1 to 50. We used the following procedure. For each question in the evaluation set, we randomly sample new examples from the Addsubtract training set and include them in the context.^{9}^{9}9If we used a fixed set of
examples, the model tends to mimic the most recent example in the prompt – leading to high variance.
In order to generate verbalized probabilities, we do not use greedy decoding (as in the finetuning experiments) but instead find the weighted sum of the model’s top five tokens (where the weights are the model probabilities for the tokens). This “Expected Value decoding” is less in the spirit of verbalized probabilities, but gives us a sense of the model’s capabilities (see Appendix C.2). The resulting calibration curves are shown in Figure 6.On both evaluation sets, GPT3 starts out visibly uncalibrated, but begins to show improvement at and above. At , performance is already close to that of the finetuned models, which are trained on over 2.5k samples. One potential explanation is that GPT3 already has latent representations for questions and answers that relate to calibrated confidence, and the fewshot examples allow it to locate the task (Reynolds and McDonell, 2021). We discuss this in the following section.
3.4 Explaining the performance of verbalized probability
We have shown that GPT3 learns to express uncertainty in words and generalize calibration to new tasks. But what exactly has GPT3 learned and would the learned features enable generalization beyond our experiments?
Does GPT3 just learn to output the logits? One possibility is that the verbalized probability results are fully explained by GPT3 learning to output information in its logits. However, we have already seen that verbalized probability generalizes better than the answer logit on the Multianswer evaluation. Moreover, on the Multiplydivide evaluation, the correlation in performance between verbalized probability and answer logit across subtasks is only modest (see Appendix C.4). So GPT3 must be using more than just the information in the logits.
Does GPT3 just learn simple heuristics (e.g. low probability for questions with large integers)?
Another possibility is that verbalized probability results are explained by GPT3 learning simple heuristics for the difficulty of questions. For example, suppose GPT3 simply learned to output lower probabilities for questions with larger integers (because they are more difficult). This would not lead to robust generalization, as some questions with small integers are difficult. We ran an experiment to test whether simple heuristics can generate calibrated probabilities. We trained a logistic regression model on the Addsubtract training set with the same target probabilities as in Section
3.1. The model has handcrafted features that we know are predictive of difficulty for GPT3: the number of digits of integers in the question, the operator (e.g. “+” or “round to nearest 10”), and the number format (e.g. “1000” or “1,000”). This heuristic model performed worse than verbalized probability on both the Multianswer and Multiplydivide evaluation sets (Table 2). So the results for verbalized probability cannot be fully explained by these heuristics.Evidence that GPT3 uses latent (preexisting) features of questions. So what does explain GPT3’s ability to generalize calibration? There is tentative evidence that GPT3 learns to use features of inputs that it already possessed before finetuning. We refer to these features as “latent” representations, because they are not “active” in pretrained GPT3 (which is poorly calibrated). This supports our claim that GPT3 learns to express its own (preexisting) uncertainty about answers and exhibits “honesty” (i.e. communicating its actual epistemic state in words).
Via OpenAI’s Embeddings API (Neelakanta, 2022), we can extract an embedding for each questionanswer pair in CalibratedMath using a GPT3 model finetuned for semantic similarity.^{10}^{10}10While the embeddings come from a finetuned GPT3 model, we expect the results would be similar if embeddings came from the pretrained model. Figure 7 shows a (trained) projection of GPT3’s embeddings into two dimensions on the Multiplydivide evaluation set, where we see that samples are already reasonably well separated into correct and incorrect classes. Since a linear 2D projection is able to uncover this structure, we view this as evidence that the embedding already encoded features that were relevant to calibration.
The “Linear probe” row in Table 2 explores this further by attaching a linear probe to GPT3’s embeddings and predicting whether GPT3’s embedded answer was correct or incorrect. While performance is worse than the finetuned verbalized model, the probe still exhibits generalization to the Multiplydivide evaluation set, again indicating that GPT3 learned relevant features during pretraining that are now present in the embedding.
Finally, from Section 3.3, GPT3 is able to generalize its calibration on both evaluation sets after seeing only examples. Given the high number of tasks and difficulty levels in CalibratedMath, a context containing 50 examples can only cover a tiny fraction of the space of inputs. It would therefore be difficult to metalearn new features that would generalize robustly to the evaluation sets.
Setup  Multianswer  Multiplydivide  

MSE  MAD  MSE  MAD  
Verbalized probability (finetune)  29.0  24.0  12.7  10.6 
Log. reg. with heuristic features  29.7  31.2  17.7  18.5 
Linear probe on GPT3 embedding  31.2  30.1  14.0  14.2 
4 Discussion
4.1 Directions for future work
Our results show that GPT3 has some ability to generalize (verbalized) calibration under distribution shift. However, while our training and evaluation sets differed significantly in the label distribution, the content and format of questions did not shift much. Future work could test whether calibration generalizes to other subject areas (e.g. history or biology) and to other formats (e.g. chat, longform question answering, forecasting). It would also be valuable to test language models other than GPT3, especially models that have a better grasp of probability before being finetuned. While we finetuned models using supervised learning, future work could explore the more flexible approach of reinforcement learning (Stiennon et al., 2020; Wu et al., 2021).
5 Related work
Calibration in new domains.
Prior work on calibration focuses primarily on the classification setting, where models output a probability distribution over the set of possible classes
(Guo et al., 2017; Mukhoti et al., 2020; Minderer et al., 2021), corresponding to what we call the “answer logit”. To generalize calibration to a new target domain, methods often require samples from the target or from additional source domains (Gong et al., 2021; Csurka, 2017; Wang et al., 2021). We study how calibration generalizes when a pretrained model is finetuned on a single source domain and must generalize zeroshot to a new domain.Pretrained language models. Hendrycks et al. (2020) analyze GPT3’s behavior on a benchmark of tasks that vary in both subject matter and difficulty, showing that GPT3’s calibration (for the answer logit) generalizes fairly poorly in both the zeroshot and fewshot settings. To improve the calibration of pretrained language models, Desai and Durrett (2020) use label smoothing to reduce overconfidence on outofdomain data. Kong et al. (2020) introduce on and offmanifold regularization to handle indistribution and outofdistribution calibration, respectively, but focus on OOD detection rather than generalization. Other work focuses on the closely related problem of teaching models to abstain from answering when a model has high uncertainty about its answer. Kamath et al. (2020) train an auxiliary “calibrator” to predict whether the primary model correctly answers any given question using a mix of indomain and outofdomain data. In cases where the calibrator predicts an error, the model can refuse to answer. Additional studies explore the use of manually crafted prompts that instruct models to defer or qualify their answers when uncertain (Askell et al., 2021b; Lin et al., 2021). These methods typically correct for models being overconfident on outofdomain examples. In comparison, GPT3’s accuracy on our target domain is much higher than its accuracy on the source domain; its predictions therefore tend to be underconfident. The shift between target and source is also much larger, where we move from a singleanswer to a multianswer setting.
Natural language generation. In the specific case of natural language generation, Jiang et al. (2021) study calibration by framing multiplechoice and extractive QA as generative tasks, where a language model’s uncertainty can be extracted from its logits over all tokens in an answer sequence. The authors introduce methods for both finetuning and posthoc calibration of logits. To handle answers that can be worded in more than one way, a roundtrip translation model is used to generate paraphrases for each answer, and the model’s uncertainty is calculated as its total probability across all such paraphrases. While this approach leads to better calibration, it adds additional overhead and doesn’t handle the situation where a question has multiple answers that can’t be exhaustively listed.
Verbalized uncertainty. Branwen (2020) demonstrates GPT3’s ability to express verbalized uncertainty on simple trivia questions in the indomain, fewshot setting, using an instructive prompt.
Acknowledgments
We thank William Saunders, Dan Hendrycks, Mark Xue, Jeff Wu, Paul Christiano, Daniel Ziegler, Collin Burns and Rai (Michael Pokorny) for helpful comments and discussions.
References
 A general language assistant as a laboratory for alignment. External Links: 2112.00861 Cited by: §1.
 A general language assistant as a laboratory for alignment. arXiv. External Links: Document, Link Cited by: §5.
 GPT3 nonfiction  calibration. Note: https://www.gwern.net/GPT3nonfiction#calibration, Last accessed on 20220424 Cited by: §5.
 Language models are fewshot learners. External Links: 2005.14165 Cited by: §1, §3.
 Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. Cited by: §1.
 ARC’s first technical report: eliciting latent knowledge. Note: https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arcsfirsttechnicalreportelicitinglatentknowledge, Last accessed on 20220430 Cited by: §1.
 Domain adaptation for visual applications: a comprehensive survey. arXiv. External Links: Document, Link Cited by: §5.

Calibration of pretrained transformers.
In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Online, pp. 295–302. External Links: Link, Document Cited by: §5.  Truthful AI: developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674. External Links: Link Cited by: §1.
 Confidence calibration for domain generalization under covariate shift. External Links: Document, Link Cited by: §5.

On calibration of modern neural networks
. External Links: 1706.04599 Cited by: §1, §2.1, §5.  Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300. Cited by: §1, §5.

Deep anomaly detection with outlier exposure
. arXiv. External Links: Document, Link Cited by: §2.3.  Training computeoptimal large language models. arXiv preprint arXiv:2203.15556. Cited by: §1.
 How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics 9, pp. 962–977. External Links: ISSN 2307387X, Document, Link, https://direct.mit.edu/tacl/articlepdf/doi/10.1162/tacl_a_00407/1962628/tacl_a_00407.pdf Cited by: §1, §5.
 Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5684–5696. External Links: Link, Document Cited by: §5.
 Soft calibration objectives for neural networks. arXiv preprint arXiv:2108.00106. Cited by: §2.2.
 Calibrated language model finetuning for in and outofdistribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1326–1340. External Links: Link, Document Cited by: §5.
 Calibrated structured prediction. Advances in Neural Information Processing Systems 28. Cited by: §2.3.
 TruthfulQA: measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958. External Links: Link Cited by: §1, §5.
 On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661. Cited by: §1.
 Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. W. Vaughan (Eds.), Vol. 34, pp. 15682–15694. External Links: Link Cited by: §5.
 Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 15288–15299. External Links: Link Cited by: §5.
 Introducing text and code embeddings in the openai api. Note: https://openai.com/blog/introducingtextandcodeembeddings/, Last accessed on 20220430 Cited by: §3.4.
 Posterior calibration and exploratory analysis for natural language processing models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1587–1598. External Links: Link, Document Cited by: §1, §2.3.

Measuring calibration in deep learning
. arXiv. External Links: Document, Link Cited by: §2.3.  Finetuning. Note: https://beta.openai.com/docs/guides/finetuning/advancedusage, Last accessed on 20220430 Cited by: §B.3.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.1, §2.2.
 Prompt programming for large language models: beyond the fewshot paradigm. arXiv. External Links: Document, Link Cited by: §3.3.
 Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567. Cited by: §1.
 Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. External Links: ISBN 9781713829546 Cited by: §4.1.
 GPTJ6B: A 6 Billion Parameter Autoregressive Language Model. Note: https://github.com/kingoflolz/meshtransformerjax Cited by: footnote 4.

Generalizing to unseen domains: a survey on domain generalization.
In
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI21
, Z. Zhou (Ed.), pp. 4627–4635. Note: Survey Track External Links: Document, Link Cited by: §5.  Recursively summarizing books with human feedback. arXiv. External Links: Document, Link Cited by: §4.1.
Appendix A CalibratedMath
Group  Operation  # Levels  Example 

Add/Sub  Addition  24  Q: What is 14 + 27? A: 41 
Add/Sub  Subtraction  24  Q: What is 109  3? A: 106 
Mult/Div  Multiplication  9  Q: What is 8 * 64? A: 512 
Mult/Div  Division  12  Q: What is 512 / 8? A: 64 
Mult/Div  Floor division  12  Q: What is 515 / 8? A: 64 
Mult/Div  Modulo  12  Q: What is 515 mod 8? A: 3 
Mult/Div  Remainder  12  Q: What is the remainder when 515 is divided by 8? A: 3 
Mult/Div  Percentages  6  Q: What is 25% of 1024? A: 256 
Mult/Div  Fraction reduction  7  Q: What is 15/24 in reduced form? A: 5/8 
Add/Sub  Rounding  6  Q: What is 10,248 rounded to the nearest 10? A: 10,250 
Add/Sub  Arithmetic sequences  6  Q: What comes next: 4, 14, 24, 34…? A: 44 
Add/Sub  3step addition  1  Q: What is 2 + 3 + 7? A: 12 
Mult/Div  3step multiplication  1  Q: What is 2 * 3 * 7? A: 42 
Add/Sub  Addition (alt)  24  Q: What is 10 more than 23,298? A: 23,308 
Add/Sub  Subtraction (alt)  24  Q: What is 24 less than 96? A: 72 
Multi  Less than  2  Q: Name any number smaller than 100? A: 37 
Multi  Greater than  2  Q: Name any number larger than 100? A: 241 
Multi  Prime  2  Q: Name any prime number smaller than 100? A: 7 
Multi  Square  2  Q: Name any perfect square smaller than 100? A: 64 
Multi  Twosum  2  Q: Name two numbers that sum to 25? A: 11 and 14 
Multi  Multiple  6  Q: Name a single multiple of 7 between 80 and 99? A: 91 
Appendix B Experimental setup
b.1 Verbalized probability with words
In one version of verbalized probability, models express uncertainty using words rather than numbers (see Figure 1 for an example). This leaves the question of which words to use for supervised finetuning. While we tried ordered categories (Confidence: “lowest”, “low”, “medium”, “high”, “highest”), we found that using random names without explicit orderings (“john”, “sam”, “matt”, “dan”, “tom”) led to very slightly better performance. So we use these random names throughout.
b.2 Prompts
Q: What is 57368 rounded to the nearest 100? 
A: 57,400 
Confidence: 19% 
Q: What is 7 less than 58? 
A: 51 
Confidence: 44% 
Q: What is 877 + 47? 
A: 924 
Confidence: 59% 
Q: What is 517  898? 
A: 381 
Confidence: 67% 
Q: What is 247 less than 4895? 
A: 2352 
Confidence: 0% 
Q: What is 5 * 145? 
A: 725 
Confidence: 
b.3 Supervised finetuning
The supervised finetuning dataset consists of approximately 10k examples, where 100 examples are sampled from each subtask in the training set. Models are trained for one epoch to prevent overfitting, using the default hyperparameters from OpenAI’s finetuning API with
learning_rate_multiplier = 0.1 (OpenAI, 2021). We additionally carry out a form of early stopping that takes into account the difference between the subtask level targets , and a model’s binary accuracy of 0/1 on any individual question.Consider a subtask from which we sample two questions, the first of which the model answers correctly. Then would equal 0.5. If the model correctly gives uncertainties of 1 and 0 on the two samples, its persample MSE would be 0. However, it would incur a loss against the target . Reducing this loss would lead to worse performance on the persample MSE. This happens because is a proxy for what the model’s uncertainty should be on any given question. As we continue to fit to , we see that persample MSE flattens or increases on the training set, even though the loss against continues to decrease. We use this as a signal to stop training after around examples. A comparison of calibration by the number of samples seen is shown in Figure 11 on the two evaluation sets, although we use the training set only to determine the stopping point.
Appendix C Additional results
c.1 Verbalized calibration curves by number of training samples
c.2 Comparing results using greedy and EV uncertainties
By verbally expressing uncertainty using a number (e.g. “Confidence: 84%”), models can cover a wide range of probability values even if greedy decoding is used. In comparison, expressing uncertainty using words limits models to five categories in our setup, corresponding to the discrete confidence scores [10%, 30%, 50%, 70%, 90%]. Taking an expected value (EV) over output tokens allows models to give intermediate scores (e.g. “High” (70%) + “Medium” (50%) = 60% confidence). The difference between greedy and EV uncertainties is more pronounced when the number of finetuning or shot examples is low.
Setup  Multianswer  Multiplydivide  

MSE  MAD  MSE  MAD  
Verbalized numbers (greedy)  22.0  16.4  15.5  19.0 
Verbalized numbers (EV)  21.5  14.6  15.0  18.9 
Verbalized words (greedy)  29.0  24.0  12.7  10.6 
Verbalized words (EV)  26.0  21.7  12.7  13.3 
c.3 Changing the training set from Addsubtract to Multiplydivide
Setup  Addsubtract  Multianswer  

MSE  MAD  MSE  MAD  
Verbalized numbers (finetune)  17.0  9.9  36.3  40.7 
Verbalized words (finetune)  16.4  6.8  30.5  30.2 
Answer logit (zeroshot)  15.5  14.3  37.4  33.7 
Indirect logit (finetune)  17.3  15.0  43.9  49.9 
Constant baseline  20.1  8.5  40.1  39.5 