Calibrating Structured Output Predictors for Natural Language Processing

We address the problem of calibrating prediction confidence for output entities of interest in natural language processing (NLP) applications. It is important that NLP applications such as named entity recognition and question answering produce calibrated confidence scores for their predictions, especially if the system is to be deployed in a safety-critical domain such as healthcare. However, the output space of such structured prediction models is often too large to adapt binary or multi-class calibration methods directly. In this study, we propose a general calibration scheme for output entities of interest in neural-network based structured prediction models. Our proposed method can be used with any binary class calibration scheme and a neural network model. Additionally, we show that our calibration method can also be used as an uncertainty-aware, entity-specific decoding step to improve the performance of the underlying model at no additional training cost or data requirements. We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering. We also improve our model's performance from our decoding step across several tasks and benchmark datasets. Our method improves the calibration and model performance on out-of-domain test scenarios as well.



There are no comments yet.


page 1

page 2

page 3

page 4


Domain-Transferable Method for Named Entity Recognition Task

Named Entity Recognition (NER) is a fundamental task in the fields of na...

Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three

Modern neural networks do not always produce well-calibrated predictions...

It's better to say "I can't answer" than answering incorrectly: Towards Safety critical NLP systems

In order to make AI systems more reliable and their adoption in safety c...

Quantifying Uncertainties in Natural Language Processing Tasks

Reliable uncertainty quantification is a first step towards building exp...

Chekhov's Gun Recognition

Chekhov's gun is a dramatic principle stating that every element in a st...

Posterior calibration and exploratory analysis for natural language processing models

Many models in natural language processing define probabilistic distribu...

Distilling Knowledge for Search-based Structured Prediction

Many natural language processing tasks can be modeled into structured pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several modern machine-learning based Natural Language Processing (NLP) systems can provide a confidence score with their output predictions. This score can be used as a measure of predictor confidence. A well-calibrated confidence score is a probability measure that is closely correlated with the likelihood of model output’s correctness. As a result, NLP systems with calibrated confidence can predict when their predictions are likely to be incorrect and therefore, should not be trusted. This property is necessary for the responsible deployment of NLP systems in safety-critical domains such as healthcare and finance. Calibration of predictors is a well-studied problem in Machine Learning

(Guo et al., 2017; Platt et al., 1999); however, widely used methods in this domain are often defined as binary or multi-class problems(Naeini et al., 2015; Nguyen and O’Connor, 2015). The structured output schemes of NLP tasks such as information extraction (IE) (Sang and De Meulder, 2003) and extractive question answering (Rajpurkar et al., 2018) have an output space that is often too large for standard multi-class calibration schemes. We define a general calibration scheme for such tasks. Our calibration method is defined as a post-processing step for neural network architectures such as bert based models (Devlin et al., 2018)

and uses model uncertainty estimation methods

(Gal and Ghahramani, 2016) to not only calibrate but also boost the performance of the underlying learned models.

We study NLP models that provide a posterior probability

for an output given input . The output can be a label sequence in case of part-of-speech (POS) or named entity recognition (NER) tasks, or a span prediction in case of extractive question answering (QA) tasks, or a relation prediction in case of relation extraction task. The posterior probability provided by the model can be used as a measure of the model’s confidence in its prediction. However, is often a poor estimate of model confidence for the output . The output space of the model in sequence-labelling tasks is often very large, and therefore for any output will be small. For instance, in a sequence labelling task with number of classes and a sequence length of , the possible events in output space will be of the order of . Additionally, recent efforts (Guo et al., 2017; Nguyen and O’Connor, 2015; Dong et al., 2018; Kumar and Sarawagi, 2019) at calibrating machine learning models have shown that they are poorly calibrated. Empirical results from Guo et al. (2017) show that techniques used in deep neural networks such as dropout and their large architecture size can negatively affect the calibration of their outputs in binary and multi-class classification tasks.

Large neural network architectures based on contextual embeddings (Devlin et al., 2018; Peters et al., 2018) have shown state-of-the-art performance across several NLP tasks (Andrew and Gao, 2007; Wang et al., 2019) and are being rapidly adopted for information extraction and other NLP tasks in safety-critical applications (Zhu et al., 2018; Sarabadani, 2019; Li et al., 2019; Lee et al., 2019). Development of efficient post-processing calibration methods is imperative for the safe deployment of such large neural network based models.

In this study, we demonstrate that neural network models show high calibration errors for NLP tasks such as POS, NER and QA. We extend the work by Kuleshov and Liang (2015) to define well-calibrated forecasters for output entities of interest in structured prediction of NLP tasks. We provide a novel calibration method that applies to a wide variety of NLP tasks and can be used to produce model confidences for specific output entities instead of the complete label sequence prediction. We provide a general scheme for designing manageable and relevant output spaces for such problems. We show that our methods lead to improved calibration performance on a variety of benchmark NLP datasets. Our model also leads to improved out-of-domain calibration performance as compared to baseline methods, suggesting that our calibration methods can generalize well.

Lastly, we propose a procedure to use our calibrated confidence scores to re-score the predictions in our defined output event space. This procedure can be interpreted as a scheme to combine model uncertainty scores and entity-specific features with decoding methods like Viterbi. We show that this re-scoring leads to consistent improvement in model performance across several tasks at no additional training or data requirements.

2 Calibration framework for Structured Prediction NLP models

2.1 Background

Structured Prediction refers to the task of predicting a structured output for an input . In NLP, a wide array of tasks including parsing, information extraction, and extractive question answering fall within this category. Recent approaches towards solving such tasks are commonly based on neural networks that are trained by minimizing the following objective :



is the parameter vector of the neural network and

is the regularization penalty and is the dataset . The trained model can then be used to produce the output . Here, the corresponding model probability is the uncalibrated confidence score.

In binary class classification, the output space is

. The confidence score for such classifiers can then be calibrated by training a calibrator

which takes in the model confidence to produce a recalibrated score (Platt et al., 1999). A widely used method for binary class calibration is Platt scaling where

is a logistic regression model. Similar methods have also been defined for multi-class classification

(Guo et al., 2017). However, extending this to structured prediction in NLP settings is non-trivial since the output space is often too large for us to calibrate the output probabilities of all events.

2.2 Related Work

Calibration methods for binary/multi class classification has been widely studied in related literature (Bröcker, 2009; Guo et al., 2017). Recent efforts at confidence modeling for NLP has focused on several tasks like co-reference, (Nguyen and O’Connor, 2015), semantic parsing (Dong et al., 2018)

and neural machine translation

(Kumar and Sarawagi, 2019).

2.3 Calibration in Structured Prediction

In this section, we define the calibration framework by Kuleshov and Liang (2015) in the context of structured prediction problems in NLP. The model denotes the neural network that produces an conditional probability given an tuple. In a multi/binary class setting, a function is used to map the output to a calibrated confidence score for all . In a structured prediction setting, since the cardinality of is usually large, we instead focus on the subset . contains events of interest which are defined based on the output events relevant to the deployment requirements of the model. There can be several different schemes to define . In later sections, we discuss related work on calibration that can be understood as applications of different schemes. In this work, we define a general framework for constructing for NLP tasks which allows us to maximize calibration performance on output entities of interest.

We define to be a function, that takes the event E, the input feature and to produce a confidence score between . We refer to this calibration function as the forecaster and use as a shorthand since it is implicit that depends on outputs of . We would like to find the forecaster that minimizes the discrepancy between and for sampled from and E uniformly sampled from .

A commonly used methodology for constructing a forecaster for is to train it on a held-out dataset . A forecaster for a binary classifier is perfectly calibrated if


It is trained on samples from . For our forecaster based on , perfect calibration would imply that


The training data samples for our forecaster are .

2.4 Construction of Event of Interest set

The main contributions of this paper stem from our proposed schemes for constructing the aformentioned sets for NLP applications. In the interest of brevity, let us define “Entities of interest” as the set of all entity predictions that can be queried from for a sample . For instance, in the case of answer span prediction for QA, the may contain the MAP prediction of the best answer span (answer start and end indexes). In a parsing or sequence labeling task, may contain the top-k label sequences obtained from viterbi decoding. In a relation or named-entity extraction task, contains the relation or named entity span predictions respectively. Each entity in corresponds to a event set E that is defined by all outputs in that contain the entity . contains set E for all entities in . We are interested in providing a calibrated probability of the event for all in . If the correct output sequence lies in the set E for an entity , we refer to as a positive entity and the event as a positive event. In the example of named entity recognition, may refer to a predicted entity span, E refers to all possible sequences in that contain the predicted span. The corresponding event is positive if the correct label sequence contains the span prediction .

While constructing the set we should ensure that it is limited to a relatively small number of output entities, while still covering as many positive events in as possible. To explain this consideration, let us take the example of a parsing task such as syntax or semantic parsing. Two possible schemes for defining are :

  1. Scheme 1: contains the MAP label sequence prediction. contains the event corresponding to whether the label sequence is correct.

  2. Scheme 2: contains all possible label sequences. contains a event corresponding to whether the label sequence is correct, for all

Calibration of model confidence by Dong et al. (2018) can be viewed as Scheme 1, where the entity of interest is the MAP label sequence prediction. Whereas, using Platt Scaling in a one-vs-all setting for multi-class classification (Guo et al., 2017) can be seen as an implementation of Scheme 2 where the entity of interest is the presence of class label. As discussed in previous sections, Scheme 2 is too computationally expensive for our purposes due to large value of . Scheme 1 is computationally cheaper, but it has lower coverage of positive events. For instance, a sequence labelling model with a 60% accuracy at sentence level means that only 60 % of positive events are covered by the set corresponding to predictions. In other words, only 60 % of the correct outputs of model will be used for constructing the forecaster. This can limit the positive events in . Including the top- predictions in may increase the coverage of positive events and therefore increase the positive training data for the forecaster. The optimum choice of involves a trade-off. A larger value of

implies broader coverage of positive events and more positive training data for the forecaster training. However, it may also lead to an unbalanced training dataset that is skewed in favour of negative training examples.

Task specific details about are provided in the later sections. For the purposes of this paper, top- refers to the top MAP sequence predictions, also referred to as .

2.5 Forecaster Construction

Here we provide a summary of the steps involved in Forecaster construction. Remaining details are in the Appendix. We train the neural network model on the training data split for a task and use the validation data for monitoring the loss and early stopping. After the training is complete, this validation data is re-purposed to create the forecaster training data. We use an MC-Dropout(Gal and Ghahramani, 2016)

average of (n=10) samples to get a low variance estimate of logit outputs from the neural networks. This average is fed into the decoding step of the model

to obtain top- label sequence predictions. We then collect the relevant entities in , along with the labels to form the training data for the forecaster.

We use gradient boosted decision trees

(Friedman, 2001) as our region-based (Dong et al., 2018; Kuleshov and Liang, 2015) forecaster model. We limit our choice of k to . We train our forecasters on training data constructed through top- and top- extraction each. These two models are then evaluated on top- extraction training data, and the best value of is used for evaluation on test. The intuition here is that the top- training data is a positive-event rich dataset and therefore can be used to reject a larger if it leads to reduced performance on positive events.

Calibration bert bert+CRF
Platt 15.90.03 15.56.23
Calibrated Mean 2.55.34 2.31.35
+Var 2.11.32 2.55.32
Platt+top2 11.4.07 14.21.16
Calibrated Mean+top2 2.94 .29 4.82.15
+Var+top2 2.17.35 4.26.10
+Rank+top2 2.43.30 2.43.45
+Rank+Var+top2 1.81.12 2.29.27
Table 1: ECE percentages on Penn Treebank for different models and calibration methods. The results are for top-

MAP predictions on the test data. ECE standard deviation is estimated by repeating the experiments for 5 repetitions. ECE for uncalibrated

bert and bert+CRF model is 35.11% and 33.72% respectively. Choice of was decided based on Section 2.5.

2.6 Feature Construction for Calibration

We use three categories of features as inputs to our forecaster.

Model based features contain the mean probability obtained by averaging over the marginal probability of the “entity of interest” obtained from 10 MC-dropout samples of . Average of marginal probabilities acts as a reduced variance estimate of un-calibrated model confidence. Our experiments use the pre-trained bert architecture as the backbone network, which uses dropout layers. We obtain MC-Dropout samples by enabling dropout sampling for all layers. We also provide percentile and marginal probability values from the MC-Dropout samples, to provide prediction uncertainty information to the forecaster. Since our forecaster training data contains entity predictions from top- MAP predictions, we also include the rank as a feature. We refer to these two features as “Var” and “Rank” in our models.

Entity of interest based features contain the length of the entity span if the output task is named entity. We only use this feature in the NER experiments and refer to it as “ln”.

Data Uncertainty based features: Dong et al. (2018) propose the use of language modelling (LM) and OOV-word-based features for estimating data uncertainty. The use of word-pieces and large pre-training corpora in bert may affect the efficacy of LM based features. Nevertheless, we use LM perplexity (referred to as “lm”) in the QA task to investigate its effectiveness as an indicator of the distributional shift in data. The use of word-pieces in models like bert reduces the negative effect of OOV words on model prediction. Therefore, we do not include OOV features in our experiments.

3 Experiments and Results

We use bert-base (Devlin et al., 2018) as the network architecture for our experiments. Validation split for each dataset was used for early stopping bert fine-tuning and as training data for forecaster training. POS and NER experiments are evaluated on Penn Treebank and CoNLL 2003 (Sang and De Meulder, 2003), MADE 1.0 (Jagannatha et al., 2019) respectively. QA experiments are evaluated on SQuAD1.1 (Rajpurkar et al., 2018) and EMRQA (Pampari et al., 2018) corpus. We also investigate the performance of our forecasters on an out-of-domain QA corpus constructed by applying EMRQA QA data generation scheme (Pampari et al., 2018) on the MADE 1.0 named entity and relations corpus. Details for these datasets are provided in their relevant sections.

We use the expected calibration error (ECE) metric defined by Naeini et al. (2015) with bins (Guo et al., 2017) to evaluate the calibration of our models. ECE is defined as an estimate of the expected difference between the model confidence and accuracy. ECE has been used in several related works (Guo et al., 2017; Maddox et al., 2019; Kumar et al., 2018; Vaicenavicius et al., 2019) to estimate model calibration. We use Platt scaling as the baseline calibration model. It uses the length-normalized probability averaged across MC-Dropout samples as the input. The lower variance and length invariance of this input feature make Platt Scaling a strong baseline. We also use a “Calibrated Mean” baseline using Gradient Boosted Decision Trees as our estimator with the same input feature as Platt.

Calibration bert bert+CRF
Baseline 60.30.12 62.31.11
+Rank+Var+top2 60.30.23 62.31.09
Table 2:

Micro-avg f-score for POS datasets using the baseline and our best proposed calibration method. The confidence score from the calibration method is used to re-rank the events

and the top selection is chosen. Standard deviation is estimated by repeating the experiments for 5 repetitions. Baseline refers to MC-dropout averaged (sample-size=10) output from the model .

3.1 Calibration for Part-of-Speech Tagging

Part-of-speech (POS) is a sequence labelling task where the input is a text sentence, and the output is a sequence of syntactic tags. We evaluate our method on the Penn Treebank dataset Marcus et al. (1994). We can define either the token prediction or the complete sequence prediction as the entity of interest. Since using a token level entity of interest effectively reduces the calibration problem to that of calibrating a multi-class classification, we study the case where the predicted label sequence of the entire sentence forms the entity of interest set. The event of interest set is defined by the events which denote whether each top- sentence level MAP prediction is correct. We use two choice of models, namely bert and bert-CRF. We use model uncertainty and rank based features for our POS experiments.

Calibration CoNLL MADE 1.0
Platt 2.00.12 4.00.07
Calibrated Mean 2.29.33 3.07.18
+Var 2.43.36 3.05.17
+Var+ln 2.24.14 2.92.24
Platt+top3 16.64.48 2.14.18
Calibrated Mean+top3 17.06.50 2.22.31
+Var+top3 17.10.24 2.17.39
+Rank+Var+top3 2.01.33 2.34.15
+Rank+Var+ln+top3 1.91.29 2.12.24
Table 3: ECE percentages for the two named entity datasets and calibration methods. The results are for all predicted named entity spans in top-1 MAP predictions on the test data. ECE standard deviation is estimated by repeating the experiments for repetitions. ECE for uncalibrated span marginals from bert model is 3.68% and 5.59% for CoNLL and MADE 1.0 datasets. Choice of was decided based on Section 2.5.

Table 1

shows the ECE values for our baseline, proposed and ablated models. We use the heuristic described in Section

2.4 to select the value for top- selection. “top2” in Table 1 refers to forecasters trained with additional top- predictions. ECE for un-calibrated bert and bert+CRF models is 35.11% and 33.72% respectively. Our methods outperform both baselines by a large margin. Both “Rank” and “Var” features help in improving model calibration. Inclusion of top- prediction sequences also improve the calibration performance significantly.

We use the confidence predictions of our full-feature model +Rank+Var+top2 to re-rank the top- predictions in the test set. Table 2 shows the sentence-level (entity of interest) accuracy for our re-ranked top prediction and the original model prediction.

Calibration CoNLL MADE 1.0
Baseline 89.45.08 84.01.11
+Rank+Var+ln+top3 89.78.10 84.34.10
Table 4: Micro-avg f-score for NER datasets and our best proposed calibration method. The confidence score from the calibration method is used to re-rank the events and a confidence value of 0.5 is used as a cutoff. Standard deviation is estimated by repeating the experiments for repetitions. Baseline refers to MC-dropout averaged (sample-size=10) output of model .

3.2 Calibration for Named Entities

For Named Entity Recognition experiments, we use two NE annotated datasets CoNLL 2003 and MADE 1.0 dataset. CoNLL consists of documents from the Reuters corpus annotated with named entities such as Person, Location etc. MADE 1.0 dataset is composed of electronic health records annotated with clinical named entities such as Medication, Indication and Adverse effects.

The entity of interest for NER is the named entity span prediction. We define as predicted entity spans in label sequences predictions for . We use bert-base with token-level softmax output and marginal likelihood based training. The model uncertainty estimates for “Var” feature are computed by estimating the variance of length normalized MC-dropout samples of span marginals. Due to the similar trends in behavior of bert and bert+CRF model in POS experiments, we only use bert model for NER. However, the span marginal computation can be easily extended to linear-chain CRF models. We also use the length of the predicted as the feature “ln” in this experiment. Complete details about forecaster and baselines are in the Appendix. Based on the heuristic defined in Section 2.4, we use a value of since it resulted in lower ECE for top-1 span predictions on forecaster training data. Table 3 shows ECE results for NE span predictions. We see that our proposed methods perform better than the baselines for both datasets. The ECE for uncalibrated span marginals is 3.68% and 5.59% for CoNLL and MADE 1.0 datasets respectively.

Calibration SQuAD1.1 EMRQA MADE 1.0 MADE 1.0(OOD)
Platt 3.69.16 5.07.37 3.64.17 15.20.16
Calibrated Mean 2.95.26 2.28.18 2.50.31 13.26.94
+var 2.92.28 2.74.15 2.71.32 12.41.95
Platt+top3 7.71.28 5.42.25 11.87.19 16.36.26
Calibrated Mean+top3 3.52.35 2.11.19 9.21.25 12.11.24
+var+top3 3.56.29 2.20.20 9.26.27 11.67.27
+var+lm+top3 3.54.21 2.12.19 6.07.26 12.42.32
+var+rank+top3 2.47.18 1.98.10 1.77.23 12.69.20
+var+rank+lm+top3 2.79.32 2.24.29 1.66.27 12.60.28
Table 5: ECE percentages for QA tasks SQuAD1.1, EMRQA and MADE 1.0. MADE 1.0(OOD) refers to the out-of-domain evaluation of a QA model that is trained and calibrated on EMRQA training and validation splits. The results are for top-1 MAP predictions on the test data. ECE standard deviation is estimated by repeating the experiments for 5 repetitions. bert model’s uncalibrated ECE for SQuAD1.1, EMRQA, MADE 1.0 and MADE 1.0(OOD) are 6.24% 6.10%, 20.10% and 18.70% respectively.Choice of was decided based on Section 2.5.

We use the confidence predictions of our “+Rank+Var+ln+top3” model to re-score the confidence predictions for all spans predicted in top- MAP predictions for samples in the test set. A threshold of 0.5 was used to remove span predictions with low confidence scores. Table 4 shows the Named Entity level (entity of interest) Micro-F score for our re-ranked top prediction and the original model prediction. We see that re-ranked predictions from our models consistently improve the performance of predictions.

Calibration SQuAD1.1 EMRQA MADE 1.0 MADE 1.0(OOD)
Baseline 79.79.08 70.97.14 66.21.18 31.62.12
+var+rank+lm+top3 80.03.15 71.37.28 66.33.15 32.02.09
Table 6: Table shows change in Exact Match Accuracy for QA datasets and our best proposed calibration method. The confidence score from the calibration method is used to re-rank the events . Standard deviation is estimated by repeating the experiments for 5 repetitions. Baseline refers to MC-dropout averaged (sample-size=10) output of model . Choice of was decided based on Section 2.5.

3.3 Calibration for QA Models

We use three datasets for evaluation of our calibration methods on the QA task. Our QA tasks are modeled as extractive QA methods with a single span answer predictions. We use three datasets to construct experiments for QA calibration. SQuAD1.1 and EMRQA (Pampari et al., 2018) are open-domain and clinical-domain QA datasets, respectively. We process the EMRQA dataset by restricting the passage length and removing unanswerable questions. We also design an out-of-domain evaluation of calibration using clinical QA datasets. We follow the guidelines from Pampari et al. (2018) to create a QA dataset from MADE 1.0 (Jagannatha et al., 2019). This allows us to have two QA datasets with common question forms, but different text distributions. Details about dataset pre-processing and construction are provided in the Appendix.

The entity of interest for QA is the top- answer span predictions. We use the “lm” perplexity as a feature in this experiment to analyze its behaviour in out-of-domain evaluations. We use a layer unidirectional LSTM to train a next word language model on the EMRQA passages. This language model is then used to compute the perplexity of a sentence for the “lm” input feature to the forecaster. We use the same baselines as the previous two tasks. The uncalibrated ECE for all SQuAD1.1, EMRQA, MADE 1.0 and MADE 1.0 (Out-of-Domain) are 6.24%, 6.10%, 20.10% and 18.70% respectively. Our methods outperform the baselines by a large margin in both in-domain and out-of-domain experiments. Our “+var+rank+top3” and “+var+rank+lm+top3” models consistently outperform the baselines. We also see an improvement in model accuracy for all QA tasks, even out-of-domain. Our models are evaluated on SQuAD1.1 dev set, and test sets from EMRQA and MADE 1.0.

4 Discussion

Our proposed methods outperform the baselines in most tasks and are almost as competitive in others.

Features and top- samples:

The inclusion of top- features improve the performance in almost all tasks when the rank of the prediction is included. We see large increases in calibration error when the top- prediction samples are included in forecaster training without including the rank information in tasks such as CoNLL NER, MADE 1.0 Q&A. This may be because the predictions may have similar model confidence and uncertainty values. Therefore a more discriminative signal such as rank is needed to prioritize them. For instance, the difference between probabilities of and MAP predictions for POS tagging may differ by only one or two tokens. In a sentence of length or more, this difference in probability when normalized by length would account to very small shifts in the overall model confidence score. Therefore an additional input of top- leads to a substantial gain in performance for both bert and bert+CRF models in POS.

Our task-agnostic scheme of “var+rank+topk” based forecasters consistently outperform or stay competitive to other forecasting methods. However, results from task-specific features such as “lm” and “len” show that use of task-specific features can further reduce the calibration error. Our experimental setup has the set of the same questions in the in-domain and out-of-domain dataset. Only the data distribution for the answer passage is different. However, we do not observe an improvement in out-of-domain performance by using “lm” feature. A more detailed analysis of task-specific features in QA with both data and question shifts is required. We leave further investigations of such schemes as our future work.

Figure 1: An example of named entity span from CoNLL dataset. Rank is rank from top-

MAP inference (Viterbi decoding). Mean Prob and Std is the mean and standard deviation of length-normalized probabilities (geometric mean of marginal probabilities for each token in the span). Calibrated confidence is the output of


: We show that using our forecaster confidence to re-rank the entities of interest leads to a modest boost in model performance for the NER and QA tasks. In POS no appreciable drop or gain in performance for both bert and bert-CRF models. We believe this may be due to the already high token level accuracy (above 97%) on Penn Treebank data. Nevertheless, this suggests that our re-scoring does not lead to a degradation in model performance in cases where it is not effective.

Our forecaster re-scores the top- entity confidence scores based on model uncertainty score and entity-level features such as entity lengths. Intuitively, we want to prioritize predictions that have low uncertainty over high uncertainty predictions, if their uncalibrated confidence scores are similar. We provide an example of such re-ranking in Figure 1. It shows a named entity span predictions for the correct span “Such”. The model produces two entity predictions “off-spinner Such” and “Such”. The un-calibrated confidence score of “off-spinner Such” is higher than “Such”, but the variance of its prediction is higher as well. Therefore the +Rank+Var+ln+top3 re-ranks the second (and correct) prediction higher. It is important to note here that the variance of “off-spinner Such” may be higher just because it involves two token predictions as compared to only one token prediction in “Such”. This along with the “ln” feature in +Rank+Var+ln+top3 may mean that the forecaster is also using length information along with uncertainty to make this prediction. However, we see similar improvements in QA tasks, where the “ln” feature is not used, and all entity predictions involve two predictions (span start and end index predictions). These results suggest that use of uncertainty features are useful in both calibration and re-ranking of predicted structured output entities.

Figure 2: Modified reliability plots (Accuracy - Confidence vs Confidence) on MADE 1.0 QA test. The dotted horizontal line represents perfect calibration. Scatter point size denotes bin size.
Out-of-domain Performance

: Our experiments testing the performance of calibrated QA systems on out-of-domain data suggest that our methods result in improved calibration on unseen data as well. Additionally, our methods also lead to an improvement in system accuracy on out-of-domain data, suggesting that the mapping learned by the forecaster model is not specific to a dataset. However, there is still a large gap between the calibration error for within domain and out-of-domain testing. This can be seen in the reliability plot shown in Figure 2. The number of samples in each bin are denoted by the radius of the scatter point. The calibrated models shown in the figure corresponds to “+var+rank+lm+top3’ forecaster calibrated using both in-domain and out-of-domain validation datasets for forecaster training. We see that out-of-domain forecasters are over-confident and this behaviour is not mitigated by using data-uncertainty aware features like “lm”. This is likely due to a shift in model’s prediction error when applied to a new dataset. Re-calibration of the forecaster using a validation set from the out-of-domain data seems to bridge the gap. However, we can see that the sharpness (Kuleshov and Liang, 2015) of out-of-domain trained, in-domain calibrated model is much lower than that of in-domain trained, in-domain calibrated one. Additionally, a validation dataset is often not available in the real world. Mitigating the loss in calibration and sharpness induced by out-of-domain evaluation is an important avenue for future research.

5 Conclusion

We show a new calibration and confidence based re-scoring scheme for structured output entities in NLP. We show that our calibration methods outperform competitive baselines on several NLP tasks. Our task-agnostic methods can provide calibrated model outputs of specific entities instead of the entire label sequence prediction. We also show that our calibration method can provide improvements to the trained model’s accuracy at no additional training or data cost. Our method is compatible with modern NLP architectures like bert. Lastly, we show that our calibration does not over-fit on in-domain data and is capable of generalizing the calibration to out-of-domain datasets.


Appendix A Appendices


a.1 Algorithm Details:

The forecaster construction algorithm is provided in Algorithm LABEL:alg:forecaster. The candidate events in Algorithm LABEL:alg:forecaster are obtained by extracting top- label sequences for every output. The logits obtained from are averaged over 10 MC-Dropout samples before being fed into the final output layer. We use the validation dataset from the task’s original split to train the forecaster. The validation dataset is used to construct both training and validation split for the forecaster. The training split contains all top- predicted entities. The validation split contains only top- predicted entities.

a.2 Evaluation Details

We use the expected calibration error (ECE) score defined by (Naeini et al., 2015) to evaluate our calibration methods. Expected calibration error is a score that estimates the expected absolute difference between model confidence and accuracy. This is calculated by binning the model outputs into ( for our experiments) bins and then computing the expected calibration error across all bins. It is defined as


where is the number of bins, is the total number of data samples, is the bin. The functions and calculate the accuracy and model confidence for a bin.

a.3 Implementation Details

We use AllenNLP’s wrapper with HuggingFace’s Transformers code 111 for our implementation. We use bert-base-cased (Wolf et al., 2019) weights as the initialization for general-domain datasets and bio-bert weights (Lee et al., 2019) as the initialization for clinical datasets. We use cased models for our analysis, since bio-bert(Lee et al., 2019)

uses cased models. A common learning rate of 2e-5 was used for all experiments. We used validation data splits provided by the datasets. In cases where the validation dataset was not provided, such as MADE 1.0, EMRQA or SQuAD1.1, we use 10% of the training data as the validation data. We use a patience of 5 for early stopping the model, with each epoch consisting of 20,000 steps. We use the final evaluation metric instead of negative log likelihood (NLL) to monitor and early stop the training. This is to reduce the mis-calibration of the underlying

model, since Guo et al. (2017) observe that neural nets overfit on NLL. The implementation for each experiment is provided in the following subsections.

a.3.1 Part-of-speech experiments

We evaluate our method on the Penn Treebank dataset Marcus et al. (1994). Our experiment uses the standard training (1-18), validation(19-21) and test (22-24) splits from the WSJ portion of the Penn Treebank dataset. The un-calibrated output of our model for a candidate label sequence is estimated as


where is the number of dropout samples. The root accounts for different sentence lengths. Here is the length of the sentence. We observe that this kind of normalization improves the calibration of both baselines and proposed models. We do not normalize the probabilities while reporting the ECE of uncalibrated models. We use two choice of models, namely bert and bert+CRF. bert only model adds a linear layer to the output of bert

network and uses a softmax activation function to produce marginal label probabilities for each token.

bert+CRF uses a CRF layer on top of unary potentials obtained from the bert network outputs.

We use Platt Scaling (Platt et al., 1999) as the baseline calibration model. Our Platt scaling model uses the MC-Dropout average of length normalized probability output of the model as input. The lower variance and length invariance of this input feature make Platt Scaling a very strong baseline. We also use a “Calibrated Mean” baseline using Gradient Boosted Decision Trees as our estimator with the same input feature as Platt.

a.3.2 NER Experiments

For CoNLL dataset, “testa” file was reserved for validation data and “testb” was reserved for test. For MADE 1.0 (Jagannatha et al., 2019), since validation data split was not provided we randomly selected 10% of training data as validation data. The length normalized marginal probability for a span starting at and of length is estimated as


We use this as the input to both the baseline and proposed models. We observe that this kind of normalization improves the calibration of baseline and proposed models. We do not normalize the probabilities while reporting the ECE of uncalibrated models. We use BIO-tags for training. While decoding, we also allow spans that start with “I-” tag.

a.3.3 QA experiments

We use three datasets for our QA experiments, SQAUD 1.1, EMRQA and MADE 1.0. Our main aim in these experiments is to understand the behaviour of calibration and not the complexity of the tasks themselves. Therefore, we restrict the passage lengths of EMRQA and MADE 1.0 datasets to be similar to SQuAD1.1. We pre-process the passages from EMRQA to remove unannotated answer span instances and reduce the passage length to 20 sentences. EMRQA provides multiple question templates for the same question type (referred to as logical form in Pampari et al. (2018)). For each annotation, we randomly sample 3 question templates for our QA experiments. This is done to ensure that question types that have multiple question templates are not over-represented in the data. For example, the question type for “’Does he take anything for her —problem—” has 49 available answer templates, whereas “How often does the patient take —medication—” only has one. So for each annotation, we sample 3 question templates for a question type. If the question type does not have 3 available templates, we up-sample. For more details please refer to Pampari et al. (2018).

EMRQA is a QA dataset constructed from named entity and relation annotations from clinical i2b2 datasets consisting of adverse event, medication and risk related questions Pampari et al. (2018). We aim to also test the performance of our calibration method on out-of-domain test data. To do so, we construct a QA dataset from the clinical named entity and relation dataset MADE 1.0, using the questions and the dataset construction procedure followed in EMRQA. This allows us to have two QA datasets with common question forms, but different text distributions. This experimental setup enables us to evaluate how a QA system would perform when deployed on a new text corpus. This corresponds to the application scenario where a fixed set of questions (such as Adverse event questionnaire (Naranjo et al., 1981)) are to be answered for clinical records from different sources. Both EMRQA and MADE 1.0 are constructed from clinical documents. However, the documents themselves have different structure and language due to their different clinical sources, thereby mimicking the real-world application scenarios of clinical QA systems.

MADE QA Construction

MADE 1.0 (Jagannatha et al., 2019) is an NER and relation dataset that has similar annotation to “relations” and “medication” i2b2 datasets used in EMRQA. EMRQA uses an automated procedure to construct questions and answers from NER and relation annotations. We replicate the automated QA construction followed by Pampari et al. (2018) on MADE 1.0 dataset to obtain a corresponding QA dataset for the same. For this construction, we use question templates that use annotations that are common in both MADE 1.0 and EMRQA datasets. Examples of common questions are in Table 7. A full list of questions in MADE 1.0 QA is in “question_templates.csv” file included in supplementary materials. The dataset splits for EMRQA and MADE QA are provided in Table 8.

Input Output Example Question Form
Problem Treatment How does the patient manage her —problem—
Treatment Problem Why is the patient on —treatment—
Problem Problem Has the patient ever been diagnosed or treated for —problem—
Drug Drug Has patient ever been prescribed —medication—
Table 7: Examples of questions that are common in EMRQA and MADE QA datasets.
Dataset Name Train Validation Test
EMRQA 74414 8870 9198
MADE QA 99496 14066 21309
Table 8: Dataset size for the MADE dataset QA pairs that were constructed using guidelines from EMRQA. EMRQA dataset splits are also provided for comparison.
Forecaster features

Since we only consider single-span answer predictions, we require a constant number of predictions ( answer start and answer end token index), for this task. Therefore we do not use the “ln” feature in this task. The uncalibrated probability of an event is normalized as follows and then used as input to all calibration models.


Unlike the previous tasks, extractive QA with single-span output does not have a varying number of output predictions for each data sample. It always only predicts the start and end spans. Therefore using length normalized (where length is always 2) uncalibrated output does not significantly affect the calibration of baseline models. However, we use the length-normalized uncalibrated probability as our input feature to keep our base set of features consistent throughout the tasks. Additionally, in extractive QA tasks with non-contiguous spans, the number of output predictions can vary and be higher than 2. In such cases, based on our results on POS and NER, the length-normalized probability may prove to be more useful. The “Var” feature and “Rank” feature is estimated as described in previous tasks.