Large-scale, pre-trained contextual language representations Devlin et al. (2018); Radford et al. (2018); Raffel et al. (2020); Brown et al. (2020) have approached or exceeded human performance on many existing language understanding benchmarks. However, due to increasing complexity and concerns of statistical bias enabling artificially high performance Schwartz et al. (2017); Poliak et al. (2018); Niven and Kao (2019); Min et al. (2020), the coherence of these state-of-the-art systems and their alignment to humans is not well understood.
This is perhaps because benchmarks geared toward language understanding only cover the tip of the iceberg, typically focusing on a high-level end task rather than diving deeper into the kind of coherent, robust understanding that takes place in humans. Language understanding in machines is often boiled down to text classification, where a classifier is tasked with recognizing whether a text contains a particular semantic class, e.g., textual entailment Dagan et al. (2005); Bowman et al. (2015), commonsense implausibility Roemmele et al. (2011); Mostafazadeh et al. (2016); Bisk et al. (2020), or combinations of several phenomena meant to serve as comprehensive diagnostics Poliak et al. (2018); Wang et al. (2018, 2019). Without regard to the underlying evidence used to reach a conclusion, systems are rewarded for correct predictions on the task without “showing their work.”
To make meaningful improvement on machine language understanding, it is important to have more informative performance measures. To address this issue, the contribution of this paper is to introduce a novel model- and task-agnostic evaluation framework that allows a quick assessment of text classifiers’ ability in terms of the coherence of their predictions. We apply our framework to two existing language understanding benchmarks of different genres to demonstrate its versatility. Our results support recent findings of spurious behaviors in fine-tuned large LMs, and show that our framework, although simple in ideas and implementation, is effective as a quick measure to provide insight into the coherence of machines’ predictions.
2 Related Work
In the face of data bias and uninterpretability of large LMs, past work has proposed methods to robustly interpret and evaluate them for various tasks and domains. Some work has sought to probe contextual language representations through various means to better understand what knowledge they hold and their correspondence to syntactic and semantic patterns Tenney et al. (2018); Hewitt and Manning (2019); Jawahar et al. (2019); Tenney et al. (2019). Meanwhile, behavior testing approaches have also been applied to understand model capabilities, from automatically removing words in language inputs and examining model performance as the input becomes malformed or insufficient for prediction Li et al. (2016); Murdoch et al. (2018); Hewitt and Manning (2019), to curating fine-grained testing data to measure performance on interesting phenomena Zhou et al. (2019); Ribeiro et al. (2020). Similar work has used specialized natural language inference tasks Welleck et al. (2019); Uppal et al. (2020), logic rules Li et al. (2019); Asai and Hajishirzi (2020), and annotated explanations DeYoung et al. (2020); Jhamtani and Clark (2020) to support and evaluate consistency and coherence of inference in these models. Other works have studied coherence of discourse through the proxy task of sentence re-ordering Lapata (2003); Logeswaran et al. (2018). Different from these previous works that focus only on specific tasks or methods, or require heavy annotation, this paper introduces an easily-accessed, versatile evaluation of machine coherence from a small amount of additional annotation.
3 Coherent Text Classification
For any text classification task requiring reasoning over a discourse, a coherent classifier should use the same evidence as humans do in reaching a conclusion. For any positive example, we expect that there are specific regions of the text which contain the semantic class of interest and thus directly contribute to the positive label. Conversely, for any negative example, there should be no such regions of the text. At a high level, we will propose a coherence measure that captures whether classifiers can give consistent and human-aligned predictions on these regions to support the end task conclusion.
Depending on specific tasks, this measure can have different implementations while maintaining the same high-level goal. In the following sections, we will use two example benchmark datasets, Conversational Entailment (CE) from Zhang and Chai (2010) and Abductive Reasoning in narrative Text (ART) from Bhagavatula et al. (2020)
, to illustrate how the coherence measure can be applied. We intentionally chose these two distinctive benchmark datasets for our investigation. CE is formulated as a textual entailment task, while ART is a multiple-choice text plausibility classification task. CE is small-scale, created over ten years ago before the era of deep learning, while ART is a large-scale (171k examples) dataset created more recently. Through these two different datasets, we aim to demonstrate the versatility of this framework.
3.1 Coherence in Textual Entailment
CE poses a textual entailment task where context is given as several turns of a natural language dialog, and we must determine whether the dialog entails a hypothesis sentence. All required information is explicitly given in the dialog. In each positive example, only some dialog turns directly contribute to the entailment, while others are irrelevant to the hypothesis. For example, as shown in Figure 1, turns and together entail the hypothesis, while others are not necessary for entailment.
As shown in Figure 2 for CE, we can label individual spans of a discourse that entails a hypothesis with whether or not consecutive sub-spans of the discourse also entail the hypothesis. Here, while the entire dialog from through entails the hypothesis, the spans from through and through do not, as they omit details required by the hypothesis. Given an example of length ,111Length can be defined in units of dialog turns, sentences, paragraphs, or other appropriate units of the text. Text should be decomposed such that individual sub-spans are not malformed or fragmented, so token- and character-level sub-spans will typically be inappropriate for this evaluation. we can decompose it into possible consecutive sub-spans222There are combinations of starting and ending points for multi-sentence sub-spans, plus individual sentences. to label with human judgements.
For a correctly classified example, we can then perform inference on all sub-spans. If the system additionally classifies all of them correctly, we consider the prediction to be coherent. We then calculate coherence on the task as the percentage of examples coherently classified. Extremely simple to compute, this provides valuable insight beyond the surface of end task accuracy, measuring how well the classifier’s perceived evidence toward the conclusion aligns with that of humans. Alternatively, the average sub-span accuracy may be considered as a more lenient measure.
3.2 Coherence in Plausibility Classification
ART, meanwhile, is a multiple-choice text classification benchmark for commonsense plausibility recognition. The task is to determine which of two candidate sentences most plausibly fits between two given context sentences when considering commonsense constraints on the world. This translates naturally into a choice between two three-sentence stories (differing only by the second sentence), one of which has some implausibility (the positive choice). For example, as shown in Figure 3, Story 1 is implausible because while the second sentence describes a negative event, the third sentence indicates celebration. Meanwhile, in Story 2, the agent is celebrating a positive event.
To account for multiple-choice tasks like ART, where we identify one of two texts to be semantically implausible, we must adjust this setup. We still consider sub-spans of the context, breaking down each pair of texts into
pairs of sub-spans. Intuitively, the model’s choice on each pair should again align with that of humans. However, there is a possibility that none of the texts contain the positive class. In such cases, the classifier should not make a confident prediction, and instead believe the texts are equally likely. Confidence should be defined based on the classifier’s internal model of the probability distribution over all possible class labels, i.e., text choices (typically calculated by applying softmax over the activations of several neural network branches). This is conceptually visualized in Figure4, where a classifier should only become confident that Story B is implausible once both the second and third sentence are present, as the trash is less likely to end up on the floor with a hole in the top of the bag.
Generally, let represent the consecutive sub-sequence of text from unit through , e.g., sentences through of text .
Consider a set of texts of length such that , and a classifier such that .333 While text choices may be different lengths, this can be trivially resolved by padding.
While text choices may be different lengths, this can be trivially resolved by padding.When classifying a set , let be considered a confident prediction if , where
refers to probability of classunder the classifier’s output distribution, and is a confidence threshold. Where there is no positive text within , then the desired outcome (ground truth) is for to be a non-confident prediction. This should be reflected in the calculation of coherence.
4 Coherence of SOTA Classifiers
Using our framework, we next establish baseline measures of coherence on the two benchmarks. The source code and data for our empirical study are shared with the community on GitHub.444https://github.com/sled-group/Verifiable-Coherent-NLU
4.1 Enabling Coherence Evaluation
To enable the type of evaluation described in Section 3
for our benchmarks, additional annotation is required. CE contains 50 unique dialog sources from the Switchboard corpusGodfrey and Holliman (1997). We randomly selected 10 testing sources to form the test set and left all remaining sources for training and validation, creating an 80%/20% split for training and validation (703 examples) versus testing (178 examples). We annotated the positive examples in the test set with the range of dialog turns entailing the hypothesis, allowing us to generate ground truth labels for the coherence measurement. Examples were labeled by two separate annotators and cross-verified with a near-perfect Cohen’s Cohen (1960) of 0.91, then a third annotator resolved any disagreements.
To transfer ART to our framework, we annotated 200 random examples from the public validation set (1532 examples) with the evidence for implausibility. There are 3 possible cases in implausible story choices: 1) the second sentence conflicts with the first and/or third sentence, 2) the second sentence is malformed or nonsense, presumably due to annotation error or adversarial filtering Zellers et al. (2018), and 3) the first and third sentence conflict with each other by default, and the second sentence does not resolve this. These cases are labeled by two annotators then merged with a fair Cohen’s of 0.30 (perhaps lower due to subjectivity of commonsense-based problems), and a third annotator again resolving disagreements. 11 examples were discarded as two annotators agreed that both story choices were entirely plausible, presumably due to annotation error in ART.
|Model||Accuracy (%)||Strict Coherence (; %)||Lenient Coherence (; %)|
|BERT||55.8||28.5 (-27.3)||35.7 (-20.1)|
|RoBERTa||70.9||39.0 (-31.9)||47.5 (-23.4)|
|+ mnli||78.5||50.6 (-27.9)||58.2 (-20.3)|
|DeBERTa||67.4||37.2 (-30.2)||45.2 (-22.2)|
|Model||Accuracy (%)||Strict Coherence (; %)||Lenient Coherence (; %)|
|BERT||66.7 (66.7)||42.3 (-24.4)||0.15||43.7 (-23.0)||0.85|
|RoBERTa||87.8 (84.2)||55.0 (-32.8)||0.1||59.3 (-28.5)||0.05|
|DeBERTa||88.4 (85.7)||59.8 (-28.6)||0.85||61.8 (-26.6)||0.95|
4.2 Empirical Results
We evaluate three state-of-the-art, transformer-based language models from recent years: BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), and DeBERTa He et al. (2021).555We use the “large” configuration of all models, which have 24 hidden layers and 16 attention heads.et al. (2017), a large-scale textual entailment dataset with some dialog-based problems. We measure both the accuracy, i.e., the proportion of instances where the end task prediction is correct, and coherence of models on respective evaluation sets. Specifically, we consider two kinds of coherence: strict and lenient. Given a set of evaluation instances, strict coherence refers to the proportion of instances where the end task prediction is not only correct, but also coherent as described in Section 3. While strict coherence only rewards systems for examples where all sub-span predictions are correct, lenient coherence averages the sub-span accuracy over all examples for a less rigid reward. We include this alternate form of coherence to accommodate some disagreement with our annotations (which can be subjective based on measured inter-annotator agreement) without severe penalty.
Following common practice, systems are trained with cross-entropy loss toward the end task of text classification, maximizing accuracy on the validation set for model selection. On CE, we used 8-fold cross-validation split by dialog sources, then re-trained the model with the highest average validation accuracy on all folds.
Pre-trained model parameters and implementations come from HuggingFace transformers Wolf et al. (2020),666https://huggingface.co/transformers/ each trained with the AdamW optimizer Loshchilov and Hutter (2018)
. We performed a grid search over a wide range of learning rates and a maximum of 10 epochs. Training batch sizes are fixed based on available GPU memory. Selected hyperparameters can be found in AppendixA.
Discussion of results.
Results on the test set of CE and public validation set of ART are listed in Table 1. All results show a statistically significant drop in performance from classification accuracy to strict coherence under a McNemar test McNemar (1947) with , some dropping below majority-class accuracy. While lenient coherence is slightly higher for both tasks, we still see large drops from accuracy. This demonstrates that while our text classifiers can achieve high classification accuracy on CE and ART, they do not deeply understand the tasks. Much of their performance is supported by incoherent intermediate predictions. Although pre-training on MultiNLI improves the end task accuracy on CE, it still suffers from comparably significant drops to the coherence measures. On ART, while all models see significant performance drops, DeBERTa, the state-of-the-art system for the task, achieves the best accuracy and coherence measures, as well as the highest chosen values, which generally indicates more confident predictions. Even though it only marginally outperforms RoBERTa in accuracy, we see larger improvements in coherence measures and the chosen , suggesting DeBERTa is more robust.
In this work, we proposed a simple and versatile method to evaluate the coherence of text classifiers, particularly targeting the problem where end task prediction depends on a discourse rather than a single sentence. By annotating a small amount of data in a benchmark, this method supports a quick assessment on whether machines’ end task performance is supported by coherent intermediate evidence. Future work driven by benchmarks should consider similar examination beyond the end task accuracy, whether this be through our proposed coherence measures or other appropriate means. As we showed, such effort is quite straightforward, and can drive progress toward more powerful classifiers that can support human-aligned reasoning.
This work was supported in part by IIS-1949634 from the National Science Foundation. We thank Bri Epstein and Haoyi Qiu for their diligent annotation work. We also thank the anonymous reviewers for their helpful comments and suggestions.
- Logic-guided data augmentation and regularization for consistent question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. External Links: Cited by: §2.
- Abductive commonsense reasoning. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia. External Links: Cited by: Figure 3, §3.
- PIQA: Reasoning about Physical Commonsense in Natural Language. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Inteligence (AAAI-20), New York, NY, USA. External Links: Cited by: §1.
A large annotated corpus for learning natural language inference.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal (English). External Links: Cited by: §1.
- Language Models are Few-Shot Learners. arXiv: 2005.14165. External Links: Cited by: §1.
- A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1), pp. 37–46. External Links: Cited by: §4.1.
- The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, J. Quiñonero-Candela, I. Dagan, B. Magnini, and F. d’Alché-Buc (Eds.), Vol. 3944, pp. 177–190 (English). External Links: Cited by: §1.
- BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, MN, USA. Cited by: §1, §4.2.
- ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online. External Links: Cited by: §2.
- Switchboard-1 release 2. Philadelphia, PA, USA. Note: Linguistic Data Consortium External Links: Cited by: §4.1.
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654. External Links: Cited by: §4.2.
- A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, MN, USA. External Links: Cited by: §2.
- What Does BERT Learn about the Structure of Language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. External Links: Cited by: §2.
- Learning to explain: datasets and models for identifying valid reasoning chains in multihop question-answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online. External Links: Cited by: §2.
- Probabilistic text structuring: experiments with sentence ordering. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan. External Links: Cited by: §2.
- Understanding neural networks through representation erasure. arXiv:1612.08220. Cited by: §2.
- A logic-driven framework for consistency of neural models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. External Links: Cited by: §2.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv: 1907.11692. External Links: Cited by: §4.2.
Sentence ordering and coherence modeling using recurrent neural networks. In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA. External Links: Cited by: §2.
- Decoupled Weight Decay Regularization. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada (en). Cited by: §4.2.
- Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2), pp. 153–157 (en). External Links: Cited by: §4.2.
Syntactic Data Augmentation Increases Robustness to Inference Heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online. External Links: Cited by: §1.
- A Corpus and Cloze Evaluation Framework for Deeper Understanding of Commonsense Stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), San Diego, CA, USA. External Links: Cited by: §1.
- Beyond word importance: contextual decomposition to extract interactions from LSTMs. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada. Cited by: §2.
- Probing Neural Network Comprehension of Natural Language Arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy. External Links: Cited by: §1.
- Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Cited by: §1.
- Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA. Cited by: §1.
Improving Language Understanding with Unsupervised Learning. OpenAI. Cited by: §1.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Cited by: §1.
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online. External Links: Cited by: §2.
- Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford, CA, USA (English). Cited by: §1.
- The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. In Proceedings of the 21st Conference on Computational Natural Language (CoNLL 2017), Vancouver, BC, Canada (English). Cited by: §1.
- BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy. External Links: Cited by: §2.
- What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada (en). Cited by: §2.
- Two-step classification using recasted data for low resource settings. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China. External Links: Cited by: §2.
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada. External Links: Cited by: §1.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium (English). Cited by: §1.
- Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy. External Links: Cited by: §2.
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. External Links: Cited by: §4.2.
- Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020): System Demonstrations, Online. External Links: Cited by: §4.2.
- SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. External Links: Cited by: §4.1.
- Towards Conversation Entailment: An Empirical Investigation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), Cambridge, MA, USA. Cited by: Figure 1, §3.
- ”Going on a Vacation” takes longer than ”Going for a Walk”: A Study of Temporal Commonsense Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Hong Kong, China. External Links: Cited by: §2.
Appendix A Model Training Details
The selected hyperparameters for each model presented in the paper are listed in Table 2.
|Task||Model||Batch Size||Learning Rate||Ep.|