Discourse-Based Evaluation of Language Understanding

07/19/2019 ∙ by Damien Sileo, et al. ∙ 0

We introduce DiscEval, a compilation of 11 evaluation datasets with a focus on discourse, that can be used for evaluation of English Natural Language Understanding when considering meaning as use. We make the case that evaluation with discourse tasks is overlooked and that Natural Language Inference (NLI) pretraining may not lead to the learning really universal representations. DiscEval can also be used as supplementary training data for multi-task learning-based systems, and is publicly available, alongside the code for gathering and preprocessing the datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Discourse Based Evaluation of Language Understanding

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent models for Natural Language Understanding (NLU) have made unusually quick progress over the last year, according to current evaluation frameworks. The GLUE benchmark [Wang et al.2018]

was designed to be a set of challenging tasks for NLU. However, the best current systems surpass human accuracy estimate on the average score of the GLUE tasks

111as of june 2019.

While these benchmarks have been created for evaluation purposes, the best performing systems [Liu et al.2019, Yang et al.2019] rely on multi-task learning and use all the evaluation tasks as training tasks before the task-specific evaluations. These multi-task models, seen as the best performing, are also released for general purpose use. However, such multi-task finetuning might lead to catastrophic forgetting of capacities learned during the first training phase (viz. language modeling for BERT). This tendency makes the representativeness of these benchmark tasks all the more important.

A wide range of important work actually trust SentEval [Cer et al.2018, Kiros and Chan2018, Subramanian et al.2018, Wieting et al.2015] or GLUE [Liu et al.2019] in order to back up the claim that encoders produce universal representations.

Since the consensual benchmarks can be a strong guidance for the evolution of NLU models, they should be as exhaustive as possible, and closely related to the models’ end use cases. Ultimately, many use cases are related to conversation with end users or analysis of structured documents. In such cases, discourse analysis (i.e. the ability to parse high-level textual structures that take into account the global context) is a prerequisite for human level performance.

More generally, the evaluation of NLU systems should incorporate pragmatics aspects and take into account the actual intent of speech acts [Searle et al.1980], while existing evaluations may focus mostly on semantic aspects (e.g. semantic similarity and NLI). In real word use, recognizing the implicatures of a statement is arguably more important than recognizing its mere implications [Grice1975]. An implicature of a statement is a part of its meaning that is not contained it its literal interpretation and is dependent on the context (which may not be explicitly available, but can be inferred).

Consider the following utterance :

You’re standing on my foot.

Implications of this utterance include You are standing on a part of my body or Someone is touching me. On another hand, a plausible implicature of this utterance is that the speaker wants the adressee to move further away.

The speaker’s intention, also called illocutionary force [Austin1975] can be regarded as a dimension of meaning that is complementary to the literal content [Green2000].

Understanding the literal semantic content of a statement in a conversation or a document is not sufficient if it does not allow a NLU system to understand how that statement should change a situation or how it fits a broader context. Speech acts have been categorized into classes such as Assertion, Question or Order which have different kinds of effects on the world. For instance, constative speech acts (e.g. the sky is blue) describe a state of the world and are either true or false while performative speech acts (e.g. I declare you husband and wife) can change the world upon utterance [Austin1975]. Discourse tasks focus on the meaning of language as use. Therefore, a discourse-centric evaluation could by construction be a better fit to evaluate how NLU models perform in practical use cases, or at least should be used as a complement to semantics-centered evaluations.

Thus, we make the case that ignoring discourse in evaluations is detrimental to NLU. We compile a list of tasks that should complement existing evaluations frameworks. We frame all our tasks as classification tasks (either of sentences, or of sentence pairs) so that they can be seamlessly integrated with other evaluation or multi-task pretraining setups (GLUE or SentEval tasks). Our evaluation benchmark, named DiscEval, is publicly available222https://github.com/synapse-developpement/DiscEval. We evaluate state of the art NLU models on these tasks by fine-tuning BERT on several auxiliary finetuning datasets and show that the most widely used auxiliary finetuning dataset, viz. MNLI, is not the best performing on DiscEval.

2 Related Work

Evaluation methods of NLU have been the object of heated debates since the proposal of the Turing Test. Automatic evaluations can be based on sentence similarity [Agirre et al.2012] and leverage human annotated scores of similarity between sentence pairs. Sentence similarity estimation tasks can potentially encompass many aspects, but it is not clear how humans annotators weight semantic, stylistic, and discursive aspects during their rating.

Using a set of more focused and clearly defined tasks has been a popular approach. Kiros2015 proposed a set of tasks and tools for sentence understanding evaluation. These tasks were compiled in the SentEval [Conneau et al.2017]

evaluation suite designed for automatic evaluation of pre-trained sentence embeddings. SentEval tasks are mostly based on sentiment analysis, sentence similarity and natural language inference, and forces the user to provide a sentence encoder that is not finetuned during the evaluation. Concurrently, zhang2015character also compiled a set of text classification tasks based on thematic classification and sentiment analysis, that is still used to evaluate document level representation learning

[Yang et al.2019].

GLUE [Wang et al.2018] propose to evaluate language understanding with less constraints than SentEval, allowing users not to rely on explicit sentence embedding based models. These tasks are classification or regression based, and are carried out for sentences or sentence pairs. Additionally, they propose diagnostic NLI tasks where various annotated linguistic phenomena occur, which could be necessary to make the right predictions, as in Poliak2018.

Natural Language Inference can be regarded as a universal framework for evaluation [Poliak et al.2018a]. In the Recast framework, existing datasets (e.g. sentiment analysis) are casted as NLI tasks. For instance, based on the sentence don’t waste your money, annotated as a negative review, they use handcrafted rules to generate the following example:

(premise: When asked about the product, liam said ”don’t waste your money” ,
hypothesis: Liam didn’t like the product,
label: entailment)

However, the generated datasets do not allow an evaluation to measure directly how well a model deals with the semantic phenomena present in the original dataset, since some sentences use artificially generated reported speech.

Thus, NLI data could be used to evaluate discourse analysis, but it is not clear how to generate examples that are not overly artificial. Moreover, it is unclear to what extent the examples in existing NLI datasets are required to deal with pragmatics.

SuperGLUE [Wang et al.2018] updates GLUE with six novel tasks that are selected to be even more challenging. Two of them deal with contextualized lexical semantics, two tasks are a form of question answering, and two of them are NLI problems. One of those NLI tasks, CommitmentBank333https://github.com/mcdm/CommitmentBank/, is the only explicitly discourse-related task.

Discourse relation prediction has been used by [Nie et al.2017] and [Sileo et al.2019] for sentence representation learning evaluation, but the dataset they use (PDTB [Prasad et al.2008]) is included in ours.

Other evaluations, like linguistic probing [Conneau et al.2018, Belinkov and Glass2019, Wang et al.2019b] focus on an internal understanding of what is captured by the models (e.g. syntax, lexical content), rather than measuring performance on external tasks, and are outside the scope of this work.

3 Proposed Tasks

Our goal is to compile a set of diverse discourse-related tasks. We restrict ourselves to classification either of sentences or sentence pairs. We only use publicly available datasets and tasks that are absent from other benchmarks (SentEval/GLUE). As opposed to Glue [Wang et al.2018], we do not keep test labels hidden in order to allow faster experiments. The scores in the task are not meant to be compared to previous work, since we alter some datasets to yield more meaningful evaluations (we perform duplicate removal or class subsampling when mentioned).

We first present the tasks, and then propose a rudimentary taxonomy of how they fit into conceptions of meaning as use.

dataset categories exemple class
PDTB discourse relation “it was censorship”/“it was outrageous” conjunction 13k
STAC discourse relation “what ?”/“i literally lost” question-answer-pair 11k
GUM discourse relation “Do not drink”/“if underage in your country” condition 2k
Emergent stance “a meteorite landed in nicaragua.”/“small meteorite hits managua” for 2k
SwitchBoard speech act “well , a little different , actually ,” hedge 19k
MRDA speech act “yeah that ’s that ’s that ’s what i meant .” acknowledge-answer 14k
Persuasion C/E/P/S/S/R “Co-operation is essential for team work”/“lions hunt in a team” low specificity 0.6k
SarcasmV2 presence of sarcasm “don’t quit your day job”/“[…] i was going to sell this joke. […]” sarcasm 9k
Squinky I/I/F “boo ya.” uninformative, high implicature, unformal 4k
Verifiability verifiability “I’ve been a physician for 20 years.” verifiable-experiential 6k
EmoBank V/A/D “I wanted to be there..” low valence, high arousal, low dominance 5k
Table 1: DiscEval classification datasets. is the number of examples in the training set. C/E/P/S/S/R denotes ClaimType/Eloquence/PremiseType/Strength/Specificity/Relevance; I/I/F is Information/Implicature/Formality ; V/A/D denotes Valence/Arousal/Dominance

PDTB STAC GUM Emergent SwitchB. MRDA Persuasion SarcasmV2 Squinky Verif. EmoBank AVG Majority 26.2 20.2 16.9 50.2 18.6 19.7 66.5 49 53.3 69.6 56.1 40.6 fastText 31 47 16.9 64.6 47.3 31 66.5 64.6 79.4 77 65 53.7
52.2 55.6 38.6 75.5 63.7 43.7 66.6 76.2 87.8 84.9 75.6 65.5
BERT+MNLI 52.3 54.9 40.2 78.8 63.1 42.9 69.5 71.7 87.9 84.4 76.1 65.6 BERT+DisSent 51.4 57.2 45.3 67.6 64.3 43.7 69.6 64.7 87.8 84 75.6 64.7 BERT+Discovery 55.4 58.7 48.5 69.8 65.1 45.7 67.9 75.5 88.5 85.4 76.6 67.0

Table 2: Transfer test accuracies across DiscEval tasks; We report the average when the dataset has several classification tasks (as in Squinky, EmoBank and Persuasion); AVG denotes the average of DiscEval tasks; BERT+ refers to BERT pretrained classification model after auxiliary finetuning phase on task .

3.1 DiscEval tasks

In this section, we describe the datasets that are part of DiscEval. They are summarized in table 1. The name of the most frequent classes can be found in table 3.


[Prasad et al.2014] contains a collection of fine-grained implicit (i.e. not signaled by a discourse marker) relations between sentences from the news domain in the Penn Discourse TreeBank 2.0. We select the level 2 relations as categories.


is a corpus of strategic chat conversations manually annotated with negotiation-related information, dialogue acts and discourse structures in the framework of Segmented Discourse Representation Theory (SDRT). We only consider pairwise relations between all dialog acts, following BaThLoAs2019.1. We remove duplicate pairs and dialogues that only have non-linguistic utterances (coming from a server). We subsample dialog act pairs with no relation so that they constitute of each fold.


is a corpus of multilayer annotations for texts from various domains; it includes Rhetorical Structure Theory (RST) discourse structure annotations. Once again, we only consider pairwise interactions between discourse units (e.g. sentences/clauses). We subsample dialog act pairs with no relation so that they constitute of each dialog. We split the examples in train/test/dev sets randomly according to the dialog they belong to.


[Ferreira and Vlachos2016] is composed of pairs of assertions and titles of news articles that are against, for, or neutral with respect to the opinion of the assertion.


[Godfrey et al.1992] contains textual transcriptions of dialogs about various topics with annotated speech acts. We remove duplicate examples and subsample Statements and Non Statements so that they constitute 20% of the examples. We use a custom train/dev validation split (90/10 ratio) since this deduplication lead to a drastic size reduction of the original development set. The label of a speech can dependent on the context (previous utterances), but we discarded it in this work for the sake of simplicity, even though integration of context could improve the scores [Ribeiro et al.2015].


[Shriberg et al.2004] contains textual transcription of multi-party real meetings, with speech acts annotations. We use a custom train/dev validation split (90/10 ratio) since this deduplication lead to a drastic size reduction of the original development set, and we subsample Statement examples so that they constitue 20% of the dataset. We also ignored the context.


[Carlile et al.2018]

is a collection of arguments from student essays annotated with factors of persasiveness with respect to a claim; considered factors are the following: Specificity, Eloquence, Relevance, Strength, ClaimType, PremiseType. For each graded target (first 4 factors), we cast the ratings into three quantiles and discard the middle quantile.


[Oraby et al.2016] consists of messages from online forums with responses that may or may not be sarcastic according to human annotations.

Squinky dataset

[Lahiri2015] gather annotations in Formality and Informativeness and Implicature where sentences were graded on a scale from 1 to 7. They define the Implicature score as the amount of not explicitly stated information carried in a sentence. For each target, we cast the ratings into three quantiles and discard the middle quantile.


[Park and Cardie2014] is a collection of online user comment annotated as Verifiable-Experiential (verifiable and about writer’s experience) Verifiable-Non-Experiential or Unverifiable.


[Buechel and Hahn2017] aggregates emotion annotations on texts from various domains using the VAD reprsentation format. The authors define Valence as corresponding to the concept of polarity444This is the dimension that is widely used in sentiment analysis., Arousal as degree of calmness or excitement and Dominance as perceived degree of control over a situation. For each target, we cast the ratings into three quantiles and discard the middle quantile.

3.2 Articulating DiscEval tasks

A sentence can have a goal (characterized by speech act or discourse relation), can pursue that goal through various means (e.g. using appeal to emotions, or verifiable arguments), and can achieve that goal with varying degrees of success. This leads us to a rudimentary grouping of our tasks:

  • The speech acts classification tasks (SwitchBoard, MRDA) deal with the detection of the function of utterances. They use the same label set (viz. DASML) [Allen and Core1997] but different domains and annotation guidelines. A discourse relation characterizes how a sentence contributes to a meaning of a document/conversation (e.g; through elaboration or contrast), so this task requires a form of understanding of the use of a sentence, and how a sentence fits with another sentence in a broader discourse. Here, three tasks (PDTB, STAC, GUM) deal with discourse relation prediction with varying domains and formalisms555These formalisms have different assumptions and definitions about the nature of discourse structure.. The Stance detection task can be seen as a coarse grained discourse relation classification.

  • Persuasiveness prediction is a useful tool to assess whether a model can measure how well a sentence can achieve its intended goal. This aspect is orthogonal to the determination of the goal itself, and is arguably equally important.

  • Detecting emotional content, verifiability, formality, informativeness or sarcasm is necessary in order to figure out in what realm communication is occuring. A statement can be persuasive, yet poorly informative and unverifiable. Emotions [Dolan2002] and power perception [Pfeffer1981] can have a strong influence on human behavior and text interpretation. Manipulating emotions can be the main purpose of a speech act as well. Sarcasm is a another mean of communication and sarcasm detection is in itself an interesting task for evaluation or pragmatics, since sarcasm is a clear case of literal meaning being different from the intended meaning.

4 Evaluations

4.1 Models

We provide two baselines for DiscEval: prediction of the majority class, and a fastText classifier. The fastText classifier

[Joulin et al.2016] has randomly initialized embeddings of size 10 and default parameters otherwise. Embeddings size was picked among according to DiscEval development set accuracy. When the input is a sentence pair, words have distinct representations for their occurrences in first and second sentence (e.g. and for the word )

As another reference, we evaluate BERT [Devlin et al.2019]

base uncased model; During evaluation fine-tuning phase, we use 2 epochs and HuggingFace script

666https://github.com/huggingface/pytorch-transformers/ default parameters otherwise.

We also perform experiments with Supplementary Training on Intermediate Labeled-data Tasks (STILT) [Phang et al.2018]. STILT is a further pretraining step on a data-rich task before the final fine-tuning evaluation on the target task. We finetune BERT on three of such tasks:


[Williams et al.2018] is a collection of 433k sentence pairs manually annotated with contradiction, entailment, or neutral relations. Finetuning with this dataset leads to accuracy improvement on all GLUE tasks except CoLA [Phang et al.2018].


data is from [Nie et al.2017], consisting of sentences or clauses that were separated by a discourse marker from a list of 15 markers. Prediction of discourse markers based of the context clauses/sentences with which they occurred have been used as a training signal for sentence representation learning. Authors used handcrafted rules for each marker in order to ensure that the markers actually signal a form of relation. DisSent has underwhelming results on the GLUE tasks as a STILT [Wang et al.2019a].


[Sileo et al.2019] is another dataset for discourse marker prediction, composed of discourse markers with

usage examples for each marker. Sentence pairs were extracted from web data, and the markers come either from the PDTB or from an heuristic automatic extraction.

We finetune BERT on the STILTs with 1 to 3 epoch and select the best performing model according to DiscEval average development set accuracy.

4.2 Overall Results

These models are evaluated in table 2. We report the average score of 8 runs of finetuning phases.

DiscEval seem to be challenging to BERT base model. Indeed, for all tasks, there is a STILT that significantly improves the accuracy of BERT.

The best overall result is achieved with Discovery STILT. Pretraining on MNLI also yields an overall improvement over vanilla BERT especially on Emergent stance classification task which is related to natural language inference. However, MNLI finetuning worsens the results of BERT on STAC, speech act classification, sarcasm detection, and verifiability detction tasks.

MNLI has been suggested as a useful auxilary training task based on evaluation on GLUE[Phang et al.2018] and SentEval[Conneau et al.2017] . Our evaluation suggests that finetuning a model with MNLI alone has drawbacks and that could be alleviated by using discourse marker prediction tasks.

4.3 Fine Grained Results

DiscEval categories cover a broad range of discourse aspects. The overall accuracies only show a synthetic view of the tasks evaluated in DiscEval. Some datasets (STAC, MRDA, Persuasiveness) contain many subcategories that allow a fine grained analysis through a wide array of classes (viz. 51 categories for MRDA). Table 3 shows a fine grained evaluation which yields some insights on the capabilities of BERT. We do not report categories with a support inferior to 20 for conciseness sake. Interestingly, Discovery and MNLI are quite complementary as STILTs; For instance, MNLI is helpful to stance detection and on some persuasion related tasks, while Discovery is the best performing in discourse relation prediction.

5 Conclusion

We proposed DiscEval, a set of discourse related evaluation tasks, and used them to evaluate BERT finetuned on various auxiliary finetuning tasks. The results lead us to rethink the efficiency of mainly using NLI as an auxiliary training task. DiscEval can be used for training or evaluation in general NLU or discourse related work. In further investigations, we plan to use more general tasks than classification on sentence or pairs, such as longer and possibly structured sequences. Several of the datasets we used (SwitchBoard, GUM, STAC) already contain such higher level structures. A comparison with human annotators on DiscEval tasks could also help to pinpoints the weaknesses of current models dealing with discourse phenomena.

It would also be interesting to measure how scores on DiscEval tasks and GLUE tasks correlate to each other, and test whether multi-task learning on DiscEval and GLUE tasks improves the score on both benchmarks. A further step would be to study the correlation between performance metrics in deployed NLU systems and the scores of the automated evaluation benchmarks in order to validate our claims about centrality of discourse.

class Majority fastText BERT BERT+DisSent BERT+ Discovery BERT+MNLI support
Emergent.for 67 74.1 81.4 72.5 79.2 82.5 130
Emergent.observing 0 54.9 72.6 54.9 58.9 75.6 97
Emergent.against 0 31.8 67.8 31.1 56.8 81.9 32
EmoBank.Arousal.high 0 51.1 71.6 74.0 72.8 73.3 346
EmoBank.Arousal.low 66 62.5 72.1 72.5 73.4 73.1 337
EmoBank.Domninance.low 77 76.5 76.2 75.6 77.1 74.3 502
EmoBank.Domninance.high 0 33.1 55.2 56.8 53.5 59.5 296
EmoBank.Valence.low 72 76.9 88.1 87.1 88.2 87.5 360
EmoBank.Valence.high 0 65.4 84.8 84 84.5 84.2 283
Squinky.Formality.low 69 90.6 96.8 96.8 96.3 96.6 240
Squinky.Formality.high 0 89.4 96.3 96.4 96.2 95.8 212
Squinky.Implicature.low 69 68.8 74.8 74.8 75.8 74.7 246
Squinky.Implicature.high 0 49.4 72.9 72 73.8 72.8 219
Squinky.Informativeness.low 70 87.5 94.3 94.2 94.4 93.8 250
Squinky.Informativeness.high 0 85.8 93.1 93.3 93.5 92.7 214
GUM.no_relation 0 0 43.4 42.4 47.5 34.3 45
GUM.elaboration 29 29 48.1 53.7 54.3 49.3 42
GUM.purpose 0 0 56.8 79.5 85.3 66.8 27
GUM.circumstance 0 0 56.8 65.2 67.7 63.1 23
GUM.condition 0 0 67.8 68 70.9 68.7 20
MRDA.Statement 33 39.3 48.2 48.3 48.7 47.8 1270
MRDA.Expansions of y/n Answers 0 27.6 45.4 45.1 45.1 44.7 520
MRDA.Defending/Explanation 0 49.5 53.7 53.2 54.3 53.5 515
MRDA.Rising Tone 0 20.2 41.3 41.5 41.9 41.2 445
MRDA.Offer 0 39.5 52.7 51.8 54.6 51.2 398
MRDA.Floor Holder 0 33.1 57.2 57.4 57 57.5 372
MRDA.Understanding Check 0 8.6 47.1 47.7 49.3 46 359
MRDA.Floor Grabber 0 29.4 40 36.8 37.6 35.6 279
MRDA.Assessment/Appreciation 0 33.7 56.8 58.4 58 56.8 225
MRDA.Acknowledge-answer 0 33.3 34.7 40.1 44.3 32.5 217
MRDA.Accept 0 32.9 41.5 39.8 43.6 42 167
MRDA.Wh-Question 0 33.9 61.4 59.7 61.2 60.1 138
MRDA.Collaborative Completion 0 19 24.9 23.8 27.1 22.7 119
MRDA.Affirmative Non-yes Answers 0 2.7 9.9 10.6 15.5 8.8 117
MRDA.Interrupted/Abandoned/Uninterpretable 0 0 17 11.8 29.6 10.8 112
MRDA.Yes-No-question 0 0 22.1 19.8 40.3 21.9 83

55 55.2 64.3 64.2 65.5 63.6 568
PDTB.Entrel 0 50.7 67.4 66.2 69.2 67.2 418
PDTB.Contingency 0 25.7 51.8 52.8 53.9 52.1 291
PDTB.Comparison 0 0 41.5 44.8 49.8 44.9 151
PDTB.Temporal 0 0 41.1 39.1 42.8 32 82
PDTB.Cause 41 45.3 58.4 59.6 60.0 59 284
PDTB.Restatement 0 0 44.9 43.4 52.7 44.7 215
PDTB.Conjunction 0 38.4 53.8 53.2 55.1 54.3 206
PDTB.Contrast 0 0 45.8 48.1 52.8 48.7 132
PDTB.Instantiation 0 0 63.5 60.8 66.5 64.6 120
PDTB.Asynchronous 0 0 52.5 48.2 57.7 49 64

86.0 86.0 86.0 86.0 86.0 86.0 68
Persuasion.Eloquence.high 0.0 0.0 0.0 0.0 0.0 0.0 22
Persuasion.PremiseType.common_knowledge 84 84 84 84 84.2 84 51
Persuasion.Relevance.high 81.0 81.0 81.0 81.0 81.0 80 61
Persuasion.Relevance.low 0 0 0 0 0 9.8 29
Persuasion.Specificity.low 73 73 74.9 84.0 79.4 81.4 36
Persuasion.Specificity.high 0 0.7 15.2 70.9 49.4 72.2 26
Persuasion.Strength.low 72 72 72.1 72 73.4 74.2 26
Persuasion.Strength.high 0 0 0.8 0 9.2 22.6 20

0 71 81.2 81.2 81.8 81.5 295
STAC.no_relation 34 38.9 37.8 38.6 41.8 35.7 264
STAC.Acknowledgement 0 62.1 71 71.6 73.0 71.2 143
STAC.Comment 0 41.4 52.4 52.8 54.5 50.7 116
STAC.Elaboration 0 25.6 43.4 46 47.0 42.7 102
STAC.Result 0 50 56 59.4 56 53 78
STAC.Continuation 0 20 27.2 26 34.3 25.8 68
STAC.Q_Elab 0 46.9 59.8 59.4 62.7 61.7 67
STAC.Contrast 0 0.5 45 42.4 47.0 43.2 38
STAC.Clarification_question 0 0 38.4 44.2 42.2 37.3 29
STAC.Explanation 0 6.5 22.6 33.8 36.7 32.5 27

0 68 76.4 63.6 70.9 75.7 239
SarcasmV2.not_sarcasm 66 60.5 77.1 67.8 74.6 66.2 230

31 69.2 86.9 88.6 87.7 87.2 121
SwitchBoard.Statement-non-opinion 0 46.9 63.5 62.1 63.3 61.8 81
SwitchBoard.Yes-No-Question 0 63.5 81.5 80.4 81.1 80.4 75
SwitchBoard.Wh-Question 0 61.8 72.6 73.2 71.5 74.4 46
SwitchBoard.Statement-opinion 0 4.4 49 48.1 53.3 42.8 42
SwitchBoard.Declarative Yes-No-Question 0 28.3 41.3 41.6 42.5 36.6 35
SwitchBoard.Conventional-closing 0 56.7 76.3 73.8 78.0 77.7 28
SwitchBoard.Action-directive 0 9.6 64.4 65.4 65.9 68.9 26
SwitchBoard.Agree/Accept 0 43.9 58.5 61.3 58.6 58.4 24
SwitchBoard.Summarize/Reformulate 0 3.1 28.6 27.9 28.8 22.9 23
SwitchBoard.Appreciation 0 52.7 81.9 81.2 81.1 83.3 21

82 85.8 90.1 90.3 90.9 90.1 1687
Verifiability.non-experiential 0 25.6 63.7 62.4 65.1 60.6 370
Verifiability.experiential 0 62.7 78.4 77.4 79.9 77.1 367
Table 3: Transfer F1 scores across the categories of DiscEval tasks; BERT+ denotes BERT pretrained classification model after auxiliary finetuning phase on task .


  • [Agirre et al.2012] Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 385–393. Association for Computational Linguistics.
  • [Allen and Core1997] Allen, J. and Core, M. (1997). Draft of damsl: Dialog act markup in several layers.
  • [Austin1975] Austin, J. L. (1975). How to do things with words. Oxford university press.
  • [Badene et al.2019] Badene, S., Thompson, C., Lorré, J.-P., and Asher, N. (2019). Learning Multi-party Discourse Structure Using Weak Supervision (regular paper). In Computational Linguistics and Intellectual Technologies: papers from the Annual conference Dialogue, Moscou, Russie, 29/05/2019-01/06/2019, page (on line), http://www.aclweb.org, mai. Association for Computational Linguistics (ACL).
  • [Belinkov and Glass2019] Belinkov, Y. and Glass, J. (2019). Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
  • [Buechel and Hahn2017] Buechel, S. and Hahn, U. (2017). EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 578–585, Valencia, Spain, April. Association for Computational Linguistics.
  • [Carlile et al.2018] Carlile, W., Gurrapadi, N., Ke, Z., and Ng, V. (2018). Give me more feedback: Annotating argument persuasiveness and related attributes in student essays. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 621–631, Melbourne, Australia, July. Association for Computational Linguistics.
  • [Cer et al.2018] Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and Kurzweil, R. (2018). Universal sentence encoder.
  • [Conneau et al.2017] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. Emnlp.
  • [Conneau et al.2018] Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018).

    What you can cram into a single vector: Probing sentence embeddings for linguistic properties.

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136. Association for Computational Linguistics.
  • [Devlin et al.2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics.
  • [Dolan2002] Dolan, R. J. (2002). Emotion, cognition, and behavior. science, 298(5596):1191–1194.
  • [Ferreira and Vlachos2016] Ferreira, W. and Vlachos, A. (2016). Emergent: a novel data-set for stance classification. In HLT-NAACL.
  • [Godfrey et al.1992] Godfrey, J. J., Holliman, E. C., and McDaniel, J. (1992). Switchboard: Telephone speech corpus for research and development. In Proceedings of the 1992 IEEE International Conference on Acoustics, Speech and Signal Processing - Volume 1, ICASSP’92, pages 517–520, Washington, DC, USA. IEEE Computer Society.
  • [Green2000] Green, M. S. (2000). Illocutionary force and semantic content. Linguistics and Philosophy, 23(5):435–473.
  • [Grice1975] Grice, H. P. (1975). Logic and conversation. In Peter Cole et al., editors, Syntax and Semantics: Vol. 3: Speech Acts, pages 41–58. Academic Press, New York.
  • [Joulin et al.2016] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification.
  • [Kiros and Chan2018] Kiros, J. and Chan, W. (2018). InferLite: Simple universal sentence representations from natural language inference data. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 4868–4874, Brussels, Belgium, October-November. Association for Computational Linguistics.
  • [Kiros et al.2015] Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
  • [Lahiri2015] Lahiri, S. (2015). SQUINKY! A Corpus of Sentence-level Formality, Informativeness, and Implicature. CoRR, abs/1506.02306.
  • [Liu et al.2019] Liu, X., He, P., Chen, W., and Gao, J. (2019). Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.
  • [Nie et al.2017] Nie, A., Bennett, E. D., and Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations.
  • [Oraby et al.2016] Oraby, S., Harrison, V., Reed, L., Hernandez, E., Riloff, E., and Walker, M. (2016). Creating and characterizing a diverse corpus of sarcasm in dialogue. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 31–41. Association for Computational Linguistics.
  • [Park and Cardie2014] Park, J. and Cardie, C. (2014). Identifying appropriate support for propositions in online user comments. In Proceedings of the first workshop on argumentation mining, pages 29–38.
  • [Pfeffer1981] Pfeffer, J. (1981). Understanding the role of power in decision making. Jay M. Shafritz y J. Steven Ott, Classics of Organization Theory, Wadsworth, pages 137–154.
  • [Phang et al.2018] Phang, J., Févry, T., and Bowman, S. R. (2018). Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088.
  • [Poliak et al.2018a] Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., White, A. S., and Van Durme, B. (2018a). Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81.
  • [Poliak et al.2018b] Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and Durme, B. V. (2018b). Hypothesis Only Baselines in Natural Language Inference. Proceedings of the 7th Joint Conference on Lexical and Computational Semantics, (1):180–191.
  • [Prasad et al.2008] Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B. (2008). The penn discourse treebank 2.0. In Bente Maegaard Joseph Mariani Jan Odijk Stelios Piperidis Daniel Tapias Nicoletta Calzolari (Conference Chair), Khalid Choukri, editor, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, may. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/.
  • [Prasad et al.2014] Prasad, R., Riley, K. F., and Lee, A. (2014). Towards Full Text Shallow Discourse Relation Annotation : Experiments with Cross-Paragraph Implicit Relations in the PDTB. (2009).
  • [Ribeiro et al.2015] Ribeiro, E., Ribeiro, R., and de Matos, D. M. (2015). The influence of context on dialogue act recognition. arXiv preprint arXiv:1506.00839.
  • [Searle et al.1980] Searle, J. R., Kiefer, F., Bierwisch, M., et al. (1980). Speech act theory and pragmatics, volume 10. Springer.
  • [Shriberg et al.2004] Shriberg, E., Dhillon, R., Bhagat, S., Ang, J., and Carvey, H. (2004). The icsi meeting recorder dialog act (mrda) corpus. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004.
  • [Sileo et al.2019] Sileo, D., Van de Cruys, T., Pradel, C., and Muller, P. (2019). Mining discourse markers for unsupervised sentence representation learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics.
  • [Subramanian et al.2018] Subramanian, S., Trischler, A., Bengio, Y., and Pal, C. J. (2018). Learning general purpose distributed sentence representations via large scale multi-task learning. International Conference on Learning Representations.
  • [Wang et al.2018] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    , pages 353–355, Brussels, Belgium, November. Association for Computational Linguistics.
  • [Wang et al.2019a] Wang, A., Hula, J., Xia, P., Pappagari, R., McCoy, R. T., Patel, R., Kim, N., Tenney, I., Huang, Y., Yu, K., Jin, S., Chen, B., Durme, B. V., Grave, E., Pavlick, E., and Bowman, S. R. (2019a). Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In ACL 2019.
  • [Wang et al.2019b] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019b). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  • [Wieting et al.2015] Wieting, J., Bansal, M., Gimpel, K., and Livescu, K. (2015). Towards universal paraphrastic sentence embeddings. CoRR, abs/1511.08198.
  • [Williams et al.2018] Williams, A., Nangia, N., and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  • [Yang et al.2019] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding.
  • [Zhang et al.2015] Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.