Discourse Based Evaluation of Language Understanding
We introduce DiscEval, a compilation of 11 evaluation datasets with a focus on discourse, that can be used for evaluation of English Natural Language Understanding when considering meaning as use. We make the case that evaluation with discourse tasks is overlooked and that Natural Language Inference (NLI) pretraining may not lead to the learning really universal representations. DiscEval can also be used as supplementary training data for multi-task learning-based systems, and is publicly available, alongside the code for gathering and preprocessing the datasets.READ FULL TEXT VIEW PDF
Natural language inference (NLI) and semantic textual similarity (STS) a...
Concept tagging is a type of structured learning needed for natural lang...
The ability to comprehend wishes or desires and their fulfillment is
The goal of this paper is to use multi-task learning to efficiently scal...
Natural Language Inference (NLI), also known as Recognizing Textual
In this paper we argue that crime drama exemplified in television progra...
Humans use language to accomplish a wide variety of tasks - asking for a...
Discourse Based Evaluation of Language Understanding
Recent models for Natural Language Understanding (NLU) have made unusually quick progress over the last year, according to current evaluation frameworks. The GLUE benchmark [Wang et al.2018]
was designed to be a set of challenging tasks for NLU. However, the best current systems surpass human accuracy estimate on the average score of the GLUE tasks111as of june 2019.
While these benchmarks have been created for evaluation purposes, the best performing systems [Liu et al.2019, Yang et al.2019] rely on multi-task learning and use all the evaluation tasks as training tasks before the task-specific evaluations. These multi-task models, seen as the best performing, are also released for general purpose use. However, such multi-task finetuning might lead to catastrophic forgetting of capacities learned during the first training phase (viz. language modeling for BERT). This tendency makes the representativeness of these benchmark tasks all the more important.
A wide range of important work actually trust SentEval [Cer et al.2018, Kiros and Chan2018, Subramanian et al.2018, Wieting et al.2015] or GLUE [Liu et al.2019] in order to back up the claim that encoders produce universal representations.
Since the consensual benchmarks can be a strong guidance for the evolution of NLU models, they should be as exhaustive as possible, and closely related to the models’ end use cases. Ultimately, many use cases are related to conversation with end users or analysis of structured documents. In such cases, discourse analysis (i.e. the ability to parse high-level textual structures that take into account the global context) is a prerequisite for human level performance.
More generally, the evaluation of NLU systems should incorporate pragmatics aspects and take into account the actual intent of speech acts [Searle et al.1980], while existing evaluations may focus mostly on semantic aspects (e.g. semantic similarity and NLI). In real word use, recognizing the implicatures of a statement is arguably more important than recognizing its mere implications [Grice1975]. An implicature of a statement is a part of its meaning that is not contained it its literal interpretation and is dependent on the context (which may not be explicitly available, but can be inferred).
Consider the following utterance :
You’re standing on my foot.
Implications of this utterance include You are standing on a part of my body or Someone is touching me. On another hand, a plausible implicature of this utterance is that the speaker wants the adressee to move further away.
Understanding the literal semantic content of a statement in a conversation or a document is not sufficient if it does not allow a NLU system to understand how that statement should change a situation or how it fits a broader context. Speech acts have been categorized into classes such as Assertion, Question or Order which have different kinds of effects on the world. For instance, constative speech acts (e.g. the sky is blue) describe a state of the world and are either true or false while performative speech acts (e.g. I declare you husband and wife) can change the world upon utterance [Austin1975]. Discourse tasks focus on the meaning of language as use. Therefore, a discourse-centric evaluation could by construction be a better fit to evaluate how NLU models perform in practical use cases, or at least should be used as a complement to semantics-centered evaluations.
Thus, we make the case that ignoring discourse in evaluations is detrimental to NLU. We compile a list of tasks that should complement existing evaluations frameworks. We frame all our tasks as classification tasks (either of sentences, or of sentence pairs) so that they can be seamlessly integrated with other evaluation or multi-task pretraining setups (GLUE or SentEval tasks). Our evaluation benchmark, named DiscEval, is publicly available222https://github.com/synapse-developpement/DiscEval. We evaluate state of the art NLU models on these tasks by fine-tuning BERT on several auxiliary finetuning datasets and show that the most widely used auxiliary finetuning dataset, viz. MNLI, is not the best performing on DiscEval.
Evaluation methods of NLU have been the object of heated debates since the proposal of the Turing Test. Automatic evaluations can be based on sentence similarity [Agirre et al.2012] and leverage human annotated scores of similarity between sentence pairs. Sentence similarity estimation tasks can potentially encompass many aspects, but it is not clear how humans annotators weight semantic, stylistic, and discursive aspects during their rating.
Using a set of more focused and clearly defined tasks has been a popular approach. Kiros2015 proposed a set of tasks and tools for sentence understanding evaluation. These tasks were compiled in the SentEval [Conneau et al.2017]
evaluation suite designed for automatic evaluation of pre-trained sentence embeddings. SentEval tasks are mostly based on sentiment analysis, sentence similarity and natural language inference, and forces the user to provide a sentence encoder that is not finetuned during the evaluation. Concurrently, zhang2015character also compiled a set of text classification tasks based on thematic classification and sentiment analysis, that is still used to evaluate document level representation learning[Yang et al.2019].
GLUE [Wang et al.2018] propose to evaluate language understanding with less constraints than SentEval, allowing users not to rely on explicit sentence embedding based models. These tasks are classification or regression based, and are carried out for sentences or sentence pairs. Additionally, they propose diagnostic NLI tasks where various annotated linguistic phenomena occur, which could be necessary to make the right predictions, as in Poliak2018.
Natural Language Inference can be regarded as a universal framework for evaluation [Poliak et al.2018a]. In the Recast framework, existing datasets (e.g. sentiment analysis) are casted as NLI tasks. For instance, based on the sentence don’t waste your money, annotated as a negative review, they use handcrafted rules to generate the following example:
(premise: When asked about the product, liam said ”don’t waste your money” ,
hypothesis: Liam didn’t like the product,
However, the generated datasets do not allow an evaluation to measure directly how well a model deals with the semantic phenomena present in the original dataset, since some sentences use artificially generated reported speech.
Thus, NLI data could be used to evaluate discourse analysis, but it is not clear how to generate examples that are not overly artificial. Moreover, it is unclear to what extent the examples in existing NLI datasets are required to deal with pragmatics.
SuperGLUE [Wang et al.2018] updates GLUE with six novel tasks that are selected to be even more challenging. Two of them deal with contextualized lexical semantics, two tasks are a form of question answering, and two of them are NLI problems. One of those NLI tasks, CommitmentBank333https://github.com/mcdm/CommitmentBank/, is the only explicitly discourse-related task.
Discourse relation prediction has been used by [Nie et al.2017] and [Sileo et al.2019] for sentence representation learning evaluation, but the dataset they use (PDTB [Prasad et al.2008]) is included in ours.
Our goal is to compile a set of diverse discourse-related tasks. We restrict ourselves to classification either of sentences or sentence pairs. We only use publicly available datasets and tasks that are absent from other benchmarks (SentEval/GLUE). As opposed to Glue [Wang et al.2018], we do not keep test labels hidden in order to allow faster experiments. The scores in the task are not meant to be compared to previous work, since we alter some datasets to yield more meaningful evaluations (we perform duplicate removal or class subsampling when mentioned).
We first present the tasks, and then propose a rudimentary taxonomy of how they fit into conceptions of meaning as use.
|PDTB||discourse relation||“it was censorship”/“it was outrageous” conjunction||13k|
|STAC||discourse relation||“what ?”/“i literally lost” question-answer-pair||11k|
|GUM||discourse relation||“Do not drink”/“if underage in your country” condition||2k|
|Emergent||stance||“a meteorite landed in nicaragua.”/“small meteorite hits managua” for||2k|
|SwitchBoard||speech act||“well , a little different , actually ,” hedge||19k|
|MRDA||speech act||“yeah that ’s that ’s that ’s what i meant .” acknowledge-answer||14k|
|Persuasion||C/E/P/S/S/R||“Co-operation is essential for team work”/“lions hunt in a team” low specificity||0.6k|
|SarcasmV2||presence of sarcasm||“don’t quit your day job”/“[…] i was going to sell this joke. […]” sarcasm||9k|
|Squinky||I/I/F||“boo ya.” uninformative, high implicature, unformal||4k|
|Verifiability||verifiability||“I’ve been a physician for 20 years.” verifiable-experiential||6k|
|EmoBank||V/A/D||“I wanted to be there..” low valence, high arousal, low dominance||5k|
[Prasad et al.2014] contains a collection of fine-grained implicit (i.e. not signaled by a discourse marker) relations between sentences from the news domain in the Penn Discourse TreeBank 2.0. We select the level 2 relations as categories.
is a corpus of strategic chat conversations manually annotated with negotiation-related information, dialogue acts and discourse structures in the framework of Segmented Discourse Representation Theory (SDRT). We only consider pairwise relations between all dialog acts, following BaThLoAs2019.1. We remove duplicate pairs and dialogues that only have non-linguistic utterances (coming from a server). We subsample dialog act pairs with no relation so that they constitute of each fold.
is a corpus of multilayer annotations for texts from various domains; it includes Rhetorical Structure Theory (RST) discourse structure annotations. Once again, we only consider pairwise interactions between discourse units (e.g. sentences/clauses). We subsample dialog act pairs with no relation so that they constitute of each dialog. We split the examples in train/test/dev sets randomly according to the dialog they belong to.
[Ferreira and Vlachos2016] is composed of pairs of assertions and titles of news articles that are against, for, or neutral with respect to the opinion of the assertion.
[Godfrey et al.1992] contains textual transcriptions of dialogs about various topics with annotated speech acts. We remove duplicate examples and subsample Statements and Non Statements so that they constitute 20% of the examples. We use a custom train/dev validation split (90/10 ratio) since this deduplication lead to a drastic size reduction of the original development set. The label of a speech can dependent on the context (previous utterances), but we discarded it in this work for the sake of simplicity, even though integration of context could improve the scores [Ribeiro et al.2015].
[Shriberg et al.2004] contains textual transcription of multi-party real meetings, with speech acts annotations. We use a custom train/dev validation split (90/10 ratio) since this deduplication lead to a drastic size reduction of the original development set, and we subsample Statement examples so that they constitue 20% of the dataset. We also ignored the context.
is a collection of arguments from student essays annotated with factors of persasiveness with respect to a claim; considered factors are the following: Specificity, Eloquence, Relevance, Strength, ClaimType, PremiseType. For each graded target (first 4 factors), we cast the ratings into three quantiles and discard the middle quantile.
[Oraby et al.2016] consists of messages from online forums with responses that may or may not be sarcastic according to human annotations.
[Lahiri2015] gather annotations in Formality and Informativeness and Implicature where sentences were graded on a scale from 1 to 7. They define the Implicature score as the amount of not explicitly stated information carried in a sentence. For each target, we cast the ratings into three quantiles and discard the middle quantile.
[Park and Cardie2014] is a collection of online user comment annotated as Verifiable-Experiential (verifiable and about writer’s experience) Verifiable-Non-Experiential or Unverifiable.
[Buechel and Hahn2017] aggregates emotion annotations on texts from various domains using the VAD reprsentation format. The authors define Valence as corresponding to the concept of polarity444This is the dimension that is widely used in sentiment analysis., Arousal as degree of calmness or excitement and Dominance as perceived degree of control over a situation. For each target, we cast the ratings into three quantiles and discard the middle quantile.
A sentence can have a goal (characterized by speech act or discourse relation), can pursue that goal through various means (e.g. using appeal to emotions, or verifiable arguments), and can achieve that goal with varying degrees of success. This leads us to a rudimentary grouping of our tasks:
The speech acts classification tasks (SwitchBoard, MRDA) deal with the detection of the function of utterances. They use the same label set (viz. DASML) [Allen and Core1997] but different domains and annotation guidelines. A discourse relation characterizes how a sentence contributes to a meaning of a document/conversation (e.g; through elaboration or contrast), so this task requires a form of understanding of the use of a sentence, and how a sentence fits with another sentence in a broader discourse. Here, three tasks (PDTB, STAC, GUM) deal with discourse relation prediction with varying domains and formalisms555These formalisms have different assumptions and definitions about the nature of discourse structure.. The Stance detection task can be seen as a coarse grained discourse relation classification.
Persuasiveness prediction is a useful tool to assess whether a model can measure how well a sentence can achieve its intended goal. This aspect is orthogonal to the determination of the goal itself, and is arguably equally important.
Detecting emotional content, verifiability, formality, informativeness or sarcasm is necessary in order to figure out in what realm communication is occuring. A statement can be persuasive, yet poorly informative and unverifiable. Emotions [Dolan2002] and power perception [Pfeffer1981] can have a strong influence on human behavior and text interpretation. Manipulating emotions can be the main purpose of a speech act as well. Sarcasm is a another mean of communication and sarcasm detection is in itself an interesting task for evaluation or pragmatics, since sarcasm is a clear case of literal meaning being different from the intended meaning.
We provide two baselines for DiscEval: prediction of the majority class, and a fastText classifier. The fastText classifier[Joulin et al.2016] has randomly initialized embeddings of size 10 and default parameters otherwise. Embeddings size was picked among according to DiscEval development set accuracy. When the input is a sentence pair, words have distinct representations for their occurrences in first and second sentence (e.g. and for the word )
As another reference, we evaluate BERT [Devlin et al.2019]
base uncased model; During evaluation fine-tuning phase, we use 2 epochs and HuggingFace script666https://github.com/huggingface/pytorch-transformers/ default parameters otherwise.
We also perform experiments with Supplementary Training on Intermediate Labeled-data Tasks (STILT) [Phang et al.2018]. STILT is a further pretraining step on a data-rich task before the final fine-tuning evaluation on the target task. We finetune BERT on three of such tasks:
data is from [Nie et al.2017], consisting of sentences or clauses that were separated by a discourse marker from a list of 15 markers. Prediction of discourse markers based of the context clauses/sentences with which they occurred have been used as a training signal for sentence representation learning. Authors used handcrafted rules for each marker in order to ensure that the markers actually signal a form of relation. DisSent has underwhelming results on the GLUE tasks as a STILT [Wang et al.2019a].
[Sileo et al.2019] is another dataset for discourse marker prediction, composed of discourse markers with
usage examples for each marker. Sentence pairs were extracted from web data, and the markers come either from the PDTB or from an heuristic automatic extraction.
These models are evaluated in table 2. We report the average score of 8 runs of finetuning phases.
DiscEval seem to be challenging to BERT base model. Indeed, for all tasks, there is a STILT that significantly improves the accuracy of BERT.
The best overall result is achieved with Discovery STILT. Pretraining on MNLI also yields an overall improvement over vanilla BERT especially on Emergent stance classification task which is related to natural language inference. However, MNLI finetuning worsens the results of BERT on STAC, speech act classification, sarcasm detection, and verifiability detction tasks.
DiscEval categories cover a broad range of discourse aspects. The overall accuracies only show a synthetic view of the tasks evaluated in DiscEval. Some datasets (STAC, MRDA, Persuasiveness) contain many subcategories that allow a fine grained analysis through a wide array of classes (viz. 51 categories for MRDA). Table 3 shows a fine grained evaluation which yields some insights on the capabilities of BERT. We do not report categories with a support inferior to 20 for conciseness sake. Interestingly, Discovery and MNLI are quite complementary as STILTs; For instance, MNLI is helpful to stance detection and on some persuasion related tasks, while Discovery is the best performing in discourse relation prediction.
We proposed DiscEval, a set of discourse related evaluation tasks, and used them to evaluate BERT finetuned on various auxiliary finetuning tasks. The results lead us to rethink the efficiency of mainly using NLI as an auxiliary training task. DiscEval can be used for training or evaluation in general NLU or discourse related work. In further investigations, we plan to use more general tasks than classification on sentence or pairs, such as longer and possibly structured sequences. Several of the datasets we used (SwitchBoard, GUM, STAC) already contain such higher level structures. A comparison with human annotators on DiscEval tasks could also help to pinpoints the weaknesses of current models dealing with discourse phenomena.
It would also be interesting to measure how scores on DiscEval tasks and GLUE tasks correlate to each other, and test whether multi-task learning on DiscEval and GLUE tasks improves the score on both benchmarks. A further step would be to study the correlation between performance metrics in deployed NLU systems and the scores of the automated evaluation benchmarks in order to validate our claims about centrality of discourse.
|MRDA.Expansions of y/n Answers||0||27.6||45.4||45.1||45.1||44.7||520|
|MRDA.Affirmative Non-yes Answers||0||2.7||9.9||10.6||15.5||8.8||117|
What you can cram into a single vector: Probing sentence embeddings for linguistic properties.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136. Association for Computational Linguistics.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4868–4874, Brussels, Belgium, October-November. Association for Computational Linguistics.
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November. Association for Computational Linguistics.