Several modern machine-learning based Natural Language Processing (NLP) systems can provide a confidence score with their output predictions. This score can be used as a measure of predictor confidence. A well-calibrated confidence score is a probability measure that is closely correlated with the likelihood of model output’s correctness. As a result, NLP systems with calibrated confidence can predict when their predictions are likely to be incorrect and therefore, should not be trusted. This property is necessary for the responsible deployment of NLP systems in safety-critical domains such as healthcare and finance. Calibration of predictors is a well-studied problem in Machine Learning(Guo et al., 2017; Platt et al., 1999); however, widely used methods in this domain are often defined as binary or multi-class problems(Naeini et al., 2015; Nguyen and O’Connor, 2015). The structured output schemes of NLP tasks such as information extraction (IE) (Sang and De Meulder, 2003) and extractive question answering (Rajpurkar et al., 2018) have an output space that is often too large for standard multi-class calibration schemes. We define a general calibration scheme for such tasks. Our calibration method is defined as a post-processing step for neural network architectures such as bert based models (Devlin et al., 2018)
and uses model uncertainty estimation methods(Gal and Ghahramani, 2016) to not only calibrate but also boost the performance of the underlying learned models.
We study NLP models that provide a posterior probabilityfor an output given input . The output can be a label sequence in case of part-of-speech (POS) or named entity recognition (NER) tasks, or a span prediction in case of extractive question answering (QA) tasks, or a relation prediction in case of relation extraction task. The posterior probability provided by the model can be used as a measure of the model’s confidence in its prediction. However, is often a poor estimate of model confidence for the output . The output space of the model in sequence-labelling tasks is often very large, and therefore for any output will be small. For instance, in a sequence labelling task with number of classes and a sequence length of , the possible events in output space will be of the order of . Additionally, recent efforts (Guo et al., 2017; Nguyen and O’Connor, 2015; Dong et al., 2018; Kumar and Sarawagi, 2019) at calibrating machine learning models have shown that they are poorly calibrated. Empirical results from Guo et al. (2017) show that techniques used in deep neural networks such as dropout and their large architecture size can negatively affect the calibration of their outputs in binary and multi-class classification tasks.
Large neural network architectures based on contextual embeddings (Devlin et al., 2018; Peters et al., 2018) have shown state-of-the-art performance across several NLP tasks (Andrew and Gao, 2007; Wang et al., 2019) and are being rapidly adopted for information extraction and other NLP tasks in safety-critical applications (Zhu et al., 2018; Sarabadani, 2019; Li et al., 2019; Lee et al., 2019). Development of efficient post-processing calibration methods is imperative for the safe deployment of such large neural network based models.
In this study, we demonstrate that neural network models show high calibration errors for NLP tasks such as POS, NER and QA. We extend the work by Kuleshov and Liang (2015) to define well-calibrated forecasters for output entities of interest in structured prediction of NLP tasks. We provide a novel calibration method that applies to a wide variety of NLP tasks and can be used to produce model confidences for specific output entities instead of the complete label sequence prediction. We provide a general scheme for designing manageable and relevant output spaces for such problems. We show that our methods lead to improved calibration performance on a variety of benchmark NLP datasets. Our model also leads to improved out-of-domain calibration performance as compared to baseline methods, suggesting that our calibration methods can generalize well.
Lastly, we propose a procedure to use our calibrated confidence scores to re-score the predictions in our defined output event space. This procedure can be interpreted as a scheme to combine model uncertainty scores and entity-specific features with decoding methods like Viterbi. We show that this re-scoring leads to consistent improvement in model performance across several tasks at no additional training or data requirements.
2 Calibration framework for Structured Prediction NLP models
Structured Prediction refers to the task of predicting a structured output for an input . In NLP, a wide array of tasks including parsing, information extraction, and extractive question answering fall within this category. Recent approaches towards solving such tasks are commonly based on neural networks that are trained by minimizing the following objective :
is the parameter vector of the neural network andis the regularization penalty and is the dataset . The trained model can then be used to produce the output . Here, the corresponding model probability is the uncalibrated confidence score.
In binary class classification, the output space is
. The confidence score for such classifiers can then be calibrated by training a calibratorwhich takes in the model confidence to produce a recalibrated score (Platt et al., 1999). A widely used method for binary class calibration is Platt scaling where
is a logistic regression model. Similar methods have also been defined for multi-class classification(Guo et al., 2017). However, extending this to structured prediction in NLP settings is non-trivial since the output space is often too large for us to calibrate the output probabilities of all events.
2.2 Related Work
Calibration methods for binary/multi class classification has been widely studied in related literature (Bröcker, 2009; Guo et al., 2017). Recent efforts at confidence modeling for NLP has focused on several tasks like co-reference, (Nguyen and O’Connor, 2015), semantic parsing (Dong et al., 2018)2019).
2.3 Calibration in Structured Prediction
In this section, we define the calibration framework by Kuleshov and Liang (2015) in the context of structured prediction problems in NLP. The model denotes the neural network that produces an conditional probability given an tuple. In a multi/binary class setting, a function is used to map the output to a calibrated confidence score for all . In a structured prediction setting, since the cardinality of is usually large, we instead focus on the subset . contains events of interest which are defined based on the output events relevant to the deployment requirements of the model. There can be several different schemes to define . In later sections, we discuss related work on calibration that can be understood as applications of different schemes. In this work, we define a general framework for constructing for NLP tasks which allows us to maximize calibration performance on output entities of interest.
We define to be a function, that takes the event E, the input feature and to produce a confidence score between . We refer to this calibration function as the forecaster and use as a shorthand since it is implicit that depends on outputs of . We would like to find the forecaster that minimizes the discrepancy between and for sampled from and E uniformly sampled from .
A commonly used methodology for constructing a forecaster for is to train it on a held-out dataset . A forecaster for a binary classifier is perfectly calibrated if
It is trained on samples from . For our forecaster based on , perfect calibration would imply that
The training data samples for our forecaster are .
2.4 Construction of Event of Interest set
The main contributions of this paper stem from our proposed schemes for constructing the aformentioned sets for NLP applications. In the interest of brevity, let us define “Entities of interest” as the set of all entity predictions that can be queried from for a sample . For instance, in the case of answer span prediction for QA, the may contain the MAP prediction of the best answer span (answer start and end indexes). In a parsing or sequence labeling task, may contain the top-k label sequences obtained from viterbi decoding. In a relation or named-entity extraction task, contains the relation or named entity span predictions respectively. Each entity in corresponds to a event set E that is defined by all outputs in that contain the entity . contains set E for all entities in . We are interested in providing a calibrated probability of the event for all in . If the correct output sequence lies in the set E for an entity , we refer to as a positive entity and the event as a positive event. In the example of named entity recognition, may refer to a predicted entity span, E refers to all possible sequences in that contain the predicted span. The corresponding event is positive if the correct label sequence contains the span prediction .
While constructing the set we should ensure that it is limited to a relatively small number of output entities, while still covering as many positive events in as possible. To explain this consideration, let us take the example of a parsing task such as syntax or semantic parsing. Two possible schemes for defining are :
Scheme 1: contains the MAP label sequence prediction. contains the event corresponding to whether the label sequence is correct.
Scheme 2: contains all possible label sequences. contains a event corresponding to whether the label sequence is correct, for all
Calibration of model confidence by Dong et al. (2018) can be viewed as Scheme 1, where the entity of interest is the MAP label sequence prediction. Whereas, using Platt Scaling in a one-vs-all setting for multi-class classification (Guo et al., 2017) can be seen as an implementation of Scheme 2 where the entity of interest is the presence of class label. As discussed in previous sections, Scheme 2 is too computationally expensive for our purposes due to large value of . Scheme 1 is computationally cheaper, but it has lower coverage of positive events. For instance, a sequence labelling model with a 60% accuracy at sentence level means that only 60 % of positive events are covered by the set corresponding to predictions. In other words, only 60 % of the correct outputs of model will be used for constructing the forecaster. This can limit the positive events in . Including the top- predictions in may increase the coverage of positive events and therefore increase the positive training data for the forecaster. The optimum choice of involves a trade-off. A larger value of
implies broader coverage of positive events and more positive training data for the forecaster training. However, it may also lead to an unbalanced training dataset that is skewed in favour of negative training examples.
Task specific details about are provided in the later sections. For the purposes of this paper, top- refers to the top MAP sequence predictions, also referred to as .
2.5 Forecaster Construction
Here we provide a summary of the steps involved in Forecaster construction. Remaining details are in the Appendix. We train the neural network model on the training data split for a task and use the validation data for monitoring the loss and early stopping. After the training is complete, this validation data is re-purposed to create the forecaster training data. We use an MC-Dropout(Gal and Ghahramani, 2016)to obtain top- label sequence predictions. We then collect the relevant entities in , along with the labels to form the training data for the forecaster.
|Calibrated Mean+top2||2.94 .29||4.82.15|
MAP predictions on the test data. ECE standard deviation is estimated by repeating the experiments for 5 repetitions. ECE for uncalibratedbert and bert+CRF model is 35.11% and 33.72% respectively. Choice of was decided based on Section 2.5.
2.6 Feature Construction for Calibration
We use three categories of features as inputs to our forecaster.
Model based features contain the mean probability obtained by averaging over the marginal probability of the “entity of interest” obtained from 10 MC-dropout samples of . Average of marginal probabilities acts as a reduced variance estimate of un-calibrated model confidence. Our experiments use the pre-trained bert architecture as the backbone network, which uses dropout layers. We obtain MC-Dropout samples by enabling dropout sampling for all layers. We also provide percentile and marginal probability values from the MC-Dropout samples, to provide prediction uncertainty information to the forecaster. Since our forecaster training data contains entity predictions from top- MAP predictions, we also include the rank as a feature. We refer to these two features as “Var” and “Rank” in our models.
Entity of interest based features contain the length of the entity span if the output task is named entity. We only use this feature in the NER experiments and refer to it as “ln”.
Data Uncertainty based features: Dong et al. (2018) propose the use of language modelling (LM) and OOV-word-based features for estimating data uncertainty. The use of word-pieces and large pre-training corpora in bert may affect the efficacy of LM based features. Nevertheless, we use LM perplexity (referred to as “lm”) in the QA task to investigate its effectiveness as an indicator of the distributional shift in data. The use of word-pieces in models like bert reduces the negative effect of OOV words on model prediction. Therefore, we do not include OOV features in our experiments.
3 Experiments and Results
We use bert-base (Devlin et al., 2018) as the network architecture for our experiments. Validation split for each dataset was used for early stopping bert fine-tuning and as training data for forecaster training. POS and NER experiments are evaluated on Penn Treebank and CoNLL 2003 (Sang and De Meulder, 2003), MADE 1.0 (Jagannatha et al., 2019) respectively. QA experiments are evaluated on SQuAD1.1 (Rajpurkar et al., 2018) and EMRQA (Pampari et al., 2018) corpus. We also investigate the performance of our forecasters on an out-of-domain QA corpus constructed by applying EMRQA QA data generation scheme (Pampari et al., 2018) on the MADE 1.0 named entity and relations corpus. Details for these datasets are provided in their relevant sections.
We use the expected calibration error (ECE) metric defined by Naeini et al. (2015) with bins (Guo et al., 2017) to evaluate the calibration of our models. ECE is defined as an estimate of the expected difference between the model confidence and accuracy. ECE has been used in several related works (Guo et al., 2017; Maddox et al., 2019; Kumar et al., 2018; Vaicenavicius et al., 2019) to estimate model calibration. We use Platt scaling as the baseline calibration model. It uses the length-normalized probability averaged across MC-Dropout samples as the input. The lower variance and length invariance of this input feature make Platt Scaling a strong baseline. We also use a “Calibrated Mean” baseline using Gradient Boosted Decision Trees as our estimator with the same input feature as Platt.
Micro-avg f-score for POS datasets using the baseline and our best proposed calibration method. The confidence score from the calibration method is used to re-rank the eventsand the top selection is chosen. Standard deviation is estimated by repeating the experiments for 5 repetitions. Baseline refers to MC-dropout averaged (sample-size=10) output from the model .
3.1 Calibration for Part-of-Speech Tagging
Part-of-speech (POS) is a sequence labelling task where the input is a text sentence, and the output is a sequence of syntactic tags. We evaluate our method on the Penn Treebank dataset Marcus et al. (1994). We can define either the token prediction or the complete sequence prediction as the entity of interest. Since using a token level entity of interest effectively reduces the calibration problem to that of calibrating a multi-class classification, we study the case where the predicted label sequence of the entire sentence forms the entity of interest set. The event of interest set is defined by the events which denote whether each top- sentence level MAP prediction is correct. We use two choice of models, namely bert and bert-CRF. We use model uncertainty and rank based features for our POS experiments.
shows the ECE values for our baseline, proposed and ablated models. We use the heuristic described in Section2.4 to select the value for top- selection. “top2” in Table 1 refers to forecasters trained with additional top- predictions. ECE for un-calibrated bert and bert+CRF models is 35.11% and 33.72% respectively. Our methods outperform both baselines by a large margin. Both “Rank” and “Var” features help in improving model calibration. Inclusion of top- prediction sequences also improve the calibration performance significantly.
We use the confidence predictions of our full-feature model +Rank+Var+top2 to re-rank the top- predictions in the test set. Table 2 shows the sentence-level (entity of interest) accuracy for our re-ranked top prediction and the original model prediction.
3.2 Calibration for Named Entities
For Named Entity Recognition experiments, we use two NE annotated datasets CoNLL 2003 and MADE 1.0 dataset. CoNLL consists of documents from the Reuters corpus annotated with named entities such as Person, Location etc. MADE 1.0 dataset is composed of electronic health records annotated with clinical named entities such as Medication, Indication and Adverse effects.
The entity of interest for NER is the named entity span prediction. We define as predicted entity spans in label sequences predictions for . We use bert-base with token-level softmax output and marginal likelihood based training. The model uncertainty estimates for “Var” feature are computed by estimating the variance of length normalized MC-dropout samples of span marginals. Due to the similar trends in behavior of bert and bert+CRF model in POS experiments, we only use bert model for NER. However, the span marginal computation can be easily extended to linear-chain CRF models. We also use the length of the predicted as the feature “ln” in this experiment. Complete details about forecaster and baselines are in the Appendix. Based on the heuristic defined in Section 2.4, we use a value of since it resulted in lower ECE for top-1 span predictions on forecaster training data. Table 3 shows ECE results for NE span predictions. We see that our proposed methods perform better than the baselines for both datasets. The ECE for uncalibrated span marginals is 3.68% and 5.59% for CoNLL and MADE 1.0 datasets respectively.
|Calibration||SQuAD1.1||EMRQA||MADE 1.0||MADE 1.0(OOD)|
We use the confidence predictions of our “+Rank+Var+ln+top3” model to re-score the confidence predictions for all spans predicted in top- MAP predictions for samples in the test set. A threshold of 0.5 was used to remove span predictions with low confidence scores. Table 4 shows the Named Entity level (entity of interest) Micro-F score for our re-ranked top prediction and the original model prediction. We see that re-ranked predictions from our models consistently improve the performance of predictions.
|Calibration||SQuAD1.1||EMRQA||MADE 1.0||MADE 1.0(OOD)|
3.3 Calibration for QA Models
We use three datasets for evaluation of our calibration methods on the QA task. Our QA tasks are modeled as extractive QA methods with a single span answer predictions. We use three datasets to construct experiments for QA calibration. SQuAD1.1 and EMRQA (Pampari et al., 2018) are open-domain and clinical-domain QA datasets, respectively. We process the EMRQA dataset by restricting the passage length and removing unanswerable questions. We also design an out-of-domain evaluation of calibration using clinical QA datasets. We follow the guidelines from Pampari et al. (2018) to create a QA dataset from MADE 1.0 (Jagannatha et al., 2019). This allows us to have two QA datasets with common question forms, but different text distributions. Details about dataset pre-processing and construction are provided in the Appendix.
The entity of interest for QA is the top- answer span predictions. We use the “lm” perplexity as a feature in this experiment to analyze its behaviour in out-of-domain evaluations. We use a layer unidirectional LSTM to train a next word language model on the EMRQA passages. This language model is then used to compute the perplexity of a sentence for the “lm” input feature to the forecaster. We use the same baselines as the previous two tasks. The uncalibrated ECE for all SQuAD1.1, EMRQA, MADE 1.0 and MADE 1.0 (Out-of-Domain) are 6.24%, 6.10%, 20.10% and 18.70% respectively. Our methods outperform the baselines by a large margin in both in-domain and out-of-domain experiments. Our “+var+rank+top3” and “+var+rank+lm+top3” models consistently outperform the baselines. We also see an improvement in model accuracy for all QA tasks, even out-of-domain. Our models are evaluated on SQuAD1.1 dev set, and test sets from EMRQA and MADE 1.0.
Our proposed methods outperform the baselines in most tasks and are almost as competitive in others.
Features and top- samples:
The inclusion of top- features improve the performance in almost all tasks when the rank of the prediction is included. We see large increases in calibration error when the top- prediction samples are included in forecaster training without including the rank information in tasks such as CoNLL NER, MADE 1.0 Q&A. This may be because the predictions may have similar model confidence and uncertainty values. Therefore a more discriminative signal such as rank is needed to prioritize them. For instance, the difference between probabilities of and MAP predictions for POS tagging may differ by only one or two tokens. In a sentence of length or more, this difference in probability when normalized by length would account to very small shifts in the overall model confidence score. Therefore an additional input of top- leads to a substantial gain in performance for both bert and bert+CRF models in POS.
Our task-agnostic scheme of “var+rank+topk” based forecasters consistently outperform or stay competitive to other forecasting methods. However, results from task-specific features such as “lm” and “len” show that use of task-specific features can further reduce the calibration error. Our experimental setup has the set of the same questions in the in-domain and out-of-domain dataset. Only the data distribution for the answer passage is different. However, we do not observe an improvement in out-of-domain performance by using “lm” feature. A more detailed analysis of task-specific features in QA with both data and question shifts is required. We leave further investigations of such schemes as our future work.
: We show that using our forecaster confidence to re-rank the entities of interest leads to a modest boost in model performance for the NER and QA tasks. In POS no appreciable drop or gain in performance for both bert and bert-CRF models. We believe this may be due to the already high token level accuracy (above 97%) on Penn Treebank data. Nevertheless, this suggests that our re-scoring does not lead to a degradation in model performance in cases where it is not effective.
Our forecaster re-scores the top- entity confidence scores based on model uncertainty score and entity-level features such as entity lengths. Intuitively, we want to prioritize predictions that have low uncertainty over high uncertainty predictions, if their uncalibrated confidence scores are similar. We provide an example of such re-ranking in Figure 1. It shows a named entity span predictions for the correct span “Such”. The model produces two entity predictions “off-spinner Such” and “Such”. The un-calibrated confidence score of “off-spinner Such” is higher than “Such”, but the variance of its prediction is higher as well. Therefore the +Rank+Var+ln+top3 re-ranks the second (and correct) prediction higher. It is important to note here that the variance of “off-spinner Such” may be higher just because it involves two token predictions as compared to only one token prediction in “Such”. This along with the “ln” feature in +Rank+Var+ln+top3 may mean that the forecaster is also using length information along with uncertainty to make this prediction. However, we see similar improvements in QA tasks, where the “ln” feature is not used, and all entity predictions involve two predictions (span start and end index predictions). These results suggest that use of uncertainty features are useful in both calibration and re-ranking of predicted structured output entities.
: Our experiments testing the performance of calibrated QA systems on out-of-domain data suggest that our methods result in improved calibration on unseen data as well. Additionally, our methods also lead to an improvement in system accuracy on out-of-domain data, suggesting that the mapping learned by the forecaster model is not specific to a dataset. However, there is still a large gap between the calibration error for within domain and out-of-domain testing. This can be seen in the reliability plot shown in Figure 2. The number of samples in each bin are denoted by the radius of the scatter point. The calibrated models shown in the figure corresponds to “+var+rank+lm+top3’ forecaster calibrated using both in-domain and out-of-domain validation datasets for forecaster training. We see that out-of-domain forecasters are over-confident and this behaviour is not mitigated by using data-uncertainty aware features like “lm”. This is likely due to a shift in model’s prediction error when applied to a new dataset. Re-calibration of the forecaster using a validation set from the out-of-domain data seems to bridge the gap. However, we can see that the sharpness (Kuleshov and Liang, 2015) of out-of-domain trained, in-domain calibrated model is much lower than that of in-domain trained, in-domain calibrated one. Additionally, a validation dataset is often not available in the real world. Mitigating the loss in calibration and sharpness induced by out-of-domain evaluation is an important avenue for future research.
We show a new calibration and confidence based re-scoring scheme for structured output entities in NLP. We show that our calibration methods outperform competitive baselines on several NLP tasks. Our task-agnostic methods can provide calibrated model outputs of specific entities instead of the entire label sequence prediction. We also show that our calibration method can provide improvements to the trained model’s accuracy at no additional training or data cost. Our method is compatible with modern NLP architectures like bert. Lastly, we show that our calibration does not over-fit on in-domain data and is capable of generalizing the calibration to out-of-domain datasets.
- Andrew and Gao (2007) Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
- Bröcker (2009) Jochen Bröcker. 2009. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 135(643):1512–1519.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dong et al. (2018) Li Dong, Chris Quirk, and Mirella Lapata. 2018. Confidence modeling for neural semantic parsing. arXiv preprint arXiv:1805.04604.
- Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani. 2016.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In international conference on machine learning, pages 1050–1059.
- Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org.
- Jagannatha et al. (2019) Abhyuday Jagannatha, Feifan Liu, Weisong Liu, and Hong Yu. 2019. Overview of the first natural language processing challenge for extracting medication, indication, and adverse drug events from electronic health record notes (made 1.0). Drug safety, 42(1):99–111.
- Kuleshov and Liang (2015) Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482.
- Kumar and Sarawagi (2019) Aviral Kumar and Sunita Sarawagi. 2019. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802.
- Kumar et al. (2018) Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. 2018. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pages 2810–2819.
- Lee et al. (2019) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746.
- Li et al. (2019) Fei Li, Yonghao Jin, Weisong Liu, Bhanu Pratap Singh Rawat, Pengshan Cai, and Hong Yu. 2019. Fine-tuning bidirectional encoder representations from transformers (bert)–based models on large-scale electronic health record notes: An empirical study. JMIR Med Inform, 7(3):e14830.
- Maddox et al. (2019) Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. 2019. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476.
- Marcus et al. (1994) Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The penn treebank: annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology, pages 114–119. Association for Computational Linguistics.
Naeini et al. (2015)
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015.
Obtaining well calibrated probabilities using bayesian binning.
Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Naranjo et al. (1981) Cláudio A Naranjo, Usoa Busto, Edward M Sellers, P Sandor, I Ruiz, EA Roberts, E Janecek, C Domecq, and DJ Greenblatt. 1981. A method for estimating the probability of adverse drug reactions. Clinical Pharmacology & Therapeutics, 30(2):239–245.
- Nguyen and O’Connor (2015) Khanh Nguyen and Brendan O’Connor. 2015. Posterior calibration and exploratory analysis for natural language processing models. arXiv preprint arXiv:1508.05154.
- Pampari et al. (2018) Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Platt et al. (1999)
John Platt et al. 1999.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
- Sang and De Meulder (2003) Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
- Sarabadani (2019) Sarah Sarabadani. 2019. Detection of adverse drug reaction mentions in tweets using elmo. In Proceedings of the Fourth Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task, pages 120–122.
- Vaicenavicius et al. (2019) Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B Schön. 2019. Evaluating model calibration in classification. arXiv preprint arXiv:1902.06977.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Zhu et al. (2018) Henghui Zhu, Ioannis Ch Paschalidis, and Amir Tahmasebi. 2018. Clinical concept extraction with contextual word embedding. arXiv preprint arXiv:1810.10566.
Appendix A Appendices
a.1 Algorithm Details:
The forecaster construction algorithm is provided in Algorithm LABEL:alg:forecaster. The candidate events in Algorithm LABEL:alg:forecaster are obtained by extracting top- label sequences for every output. The logits obtained from are averaged over 10 MC-Dropout samples before being fed into the final output layer. We use the validation dataset from the task’s original split to train the forecaster. The validation dataset is used to construct both training and validation split for the forecaster. The training split contains all top- predicted entities. The validation split contains only top- predicted entities.
a.2 Evaluation Details
We use the expected calibration error (ECE) score defined by (Naeini et al., 2015) to evaluate our calibration methods. Expected calibration error is a score that estimates the expected absolute difference between model confidence and accuracy. This is calculated by binning the model outputs into ( for our experiments) bins and then computing the expected calibration error across all bins. It is defined as
where is the number of bins, is the total number of data samples, is the bin. The functions and calculate the accuracy and model confidence for a bin.
a.3 Implementation Details
We use AllenNLP’s wrapper with HuggingFace’s Transformers code 111https://github.com/huggingface/transformers for our implementation. We use bert-base-cased (Wolf et al., 2019) weights as the initialization for general-domain datasets and bio-bert weights (Lee et al., 2019) as the initialization for clinical datasets. We use cased models for our analysis, since bio-bert(Lee et al., 2019)
uses cased models. A common learning rate of 2e-5 was used for all experiments. We used validation data splits provided by the datasets. In cases where the validation dataset was not provided, such as MADE 1.0, EMRQA or SQuAD1.1, we use 10% of the training data as the validation data. We use a patience of 5 for early stopping the model, with each epoch consisting of 20,000 steps. We use the final evaluation metric instead of negative log likelihood (NLL) to monitor and early stop the training. This is to reduce the mis-calibration of the underlyingmodel, since Guo et al. (2017) observe that neural nets overfit on NLL. The implementation for each experiment is provided in the following subsections.
a.3.1 Part-of-speech experiments
We evaluate our method on the Penn Treebank dataset Marcus et al. (1994). Our experiment uses the standard training (1-18), validation(19-21) and test (22-24) splits from the WSJ portion of the Penn Treebank dataset. The un-calibrated output of our model for a candidate label sequence is estimated as
where is the number of dropout samples. The root accounts for different sentence lengths. Here is the length of the sentence. We observe that this kind of normalization improves the calibration of both baselines and proposed models. We do not normalize the probabilities while reporting the ECE of uncalibrated models. We use two choice of models, namely bert and bert+CRF. bert only model adds a linear layer to the output of bert
network and uses a softmax activation function to produce marginal label probabilities for each token.bert+CRF uses a CRF layer on top of unary potentials obtained from the bert network outputs.
We use Platt Scaling (Platt et al., 1999) as the baseline calibration model. Our Platt scaling model uses the MC-Dropout average of length normalized probability output of the model as input. The lower variance and length invariance of this input feature make Platt Scaling a very strong baseline. We also use a “Calibrated Mean” baseline using Gradient Boosted Decision Trees as our estimator with the same input feature as Platt.
a.3.2 NER Experiments
For CoNLL dataset, “testa” file was reserved for validation data and “testb” was reserved for test. For MADE 1.0 (Jagannatha et al., 2019), since validation data split was not provided we randomly selected 10% of training data as validation data. The length normalized marginal probability for a span starting at and of length is estimated as
We use this as the input to both the baseline and proposed models. We observe that this kind of normalization improves the calibration of baseline and proposed models. We do not normalize the probabilities while reporting the ECE of uncalibrated models. We use BIO-tags for training. While decoding, we also allow spans that start with “I-” tag.
a.3.3 QA experiments
We use three datasets for our QA experiments, SQAUD 1.1, EMRQA and MADE 1.0. Our main aim in these experiments is to understand the behaviour of calibration and not the complexity of the tasks themselves. Therefore, we restrict the passage lengths of EMRQA and MADE 1.0 datasets to be similar to SQuAD1.1. We pre-process the passages from EMRQA to remove unannotated answer span instances and reduce the passage length to 20 sentences. EMRQA provides multiple question templates for the same question type (referred to as logical form in Pampari et al. (2018)). For each annotation, we randomly sample 3 question templates for our QA experiments. This is done to ensure that question types that have multiple question templates are not over-represented in the data. For example, the question type for “’Does he take anything for her —problem—” has 49 available answer templates, whereas “How often does the patient take —medication—” only has one. So for each annotation, we sample 3 question templates for a question type. If the question type does not have 3 available templates, we up-sample. For more details please refer to Pampari et al. (2018).
EMRQA is a QA dataset constructed from named entity and relation annotations from clinical i2b2 datasets consisting of adverse event, medication and risk related questions Pampari et al. (2018). We aim to also test the performance of our calibration method on out-of-domain test data. To do so, we construct a QA dataset from the clinical named entity and relation dataset MADE 1.0, using the questions and the dataset construction procedure followed in EMRQA. This allows us to have two QA datasets with common question forms, but different text distributions. This experimental setup enables us to evaluate how a QA system would perform when deployed on a new text corpus. This corresponds to the application scenario where a fixed set of questions (such as Adverse event questionnaire (Naranjo et al., 1981)) are to be answered for clinical records from different sources. Both EMRQA and MADE 1.0 are constructed from clinical documents. However, the documents themselves have different structure and language due to their different clinical sources, thereby mimicking the real-world application scenarios of clinical QA systems.
MADE QA Construction
MADE 1.0 (Jagannatha et al., 2019) is an NER and relation dataset that has similar annotation to “relations” and “medication” i2b2 datasets used in EMRQA. EMRQA uses an automated procedure to construct questions and answers from NER and relation annotations. We replicate the automated QA construction followed by Pampari et al. (2018) on MADE 1.0 dataset to obtain a corresponding QA dataset for the same. For this construction, we use question templates that use annotations that are common in both MADE 1.0 and EMRQA datasets. Examples of common questions are in Table 7. A full list of questions in MADE 1.0 QA is in “question_templates.csv” file included in supplementary materials. The dataset splits for EMRQA and MADE QA are provided in Table 8.
|Input||Output||Example Question Form|
|Problem||Treatment||How does the patient manage her —problem—|
|Treatment||Problem||Why is the patient on —treatment—|
|Problem||Problem||Has the patient ever been diagnosed or treated for —problem—|
|Drug||Drug||Has patient ever been prescribed —medication—|
Since we only consider single-span answer predictions, we require a constant number of predictions ( answer start and answer end token index), for this task. Therefore we do not use the “ln” feature in this task. The uncalibrated probability of an event is normalized as follows and then used as input to all calibration models.
Unlike the previous tasks, extractive QA with single-span output does not have a varying number of output predictions for each data sample. It always only predicts the start and end spans. Therefore using length normalized (where length is always 2) uncalibrated output does not significantly affect the calibration of baseline models. However, we use the length-normalized uncalibrated probability as our input feature to keep our base set of features consistent throughout the tasks. Additionally, in extractive QA tasks with non-contiguous spans, the number of output predictions can vary and be higher than 2. In such cases, based on our results on POS and NER, the length-normalized probability may prove to be more useful. The “Var” feature and “Rank” feature is estimated as described in previous tasks.