Evaluation Benchmarks and Learning Criteriafor Discourse-Aware Sentence Representations

08/31/2019 ∙ by Mingda Chen, et al. ∙ Toyota Technological Institute at Chicago 0

Prior work on pretrained sentence embeddings and benchmarks focus on the capabilities of stand-alone sentences. We propose DiscoEval, a test suite of tasks to evaluate whether sentence representations include broader context information. We also propose a variety of training objectives that makes use of natural annotations from Wikipedia to build sentence encoders capable of modeling discourse. We benchmark sentence encoders pretrained with our proposed training objectives, as well as other popular pretrained sentence encoders on DiscoEval and other sentence evaluation tasks. Empirically, we show that these training objectives help to encode different aspects of information in document structures. Moreover, BERT and ELMo demonstrate strong performances over DiscoEval with individual hidden layers showing different characteristics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pretrained sentence representations have been found useful in various downstream tasks such as visual question answering (Tapaswi et al., 2016), script inference (Pichotta and Mooney, 2016), and information retrieval (Le and Mikolov, 2014; Palangi et al., 2016). Benchmark datasets (Adi et al., 2017; Conneau and Kiela, 2018; Wang et al., 2018a, 2019) have been proposed to evaluate the encoded knowledge, where the focus has been primarily on natural language understanding capabilities of the representation of a stand-alone sentence, such as its semantic roles, rather than the broader context in which it is situated.

[The European Community’s consumer price index rose a provisional 0.6% in September from August]1 [and was up 5.3% from September 1988,]2 [according to Eurostat, the EC’s statistical agency.]3
Figure 1: An RST discourse tree from the RST Discourse Treebank. “N” represents “nucleus”, containing basic information for the relation. “S” represents “satellite”, containing additional information about the nucleus.

In this paper, we seek to incorporate and evaluate discourse knowledge in general purpose sentence representations. A discourse is a coherent, structured group of sentences that acts as a fundamental type of structure in natural language Jurafsky and Martin (2009). A discourse structure is often characterized by the arrangement of semantic elements across multiple sentences, such as entities and pronouns. The simplest such arrangement (i.e., linearly-structured) can be understood as sentence ordering, where the structure is manifested in the timing of introducing entities. Deeper discourse structures use more complex relations among sentences (e.g., tree-structured; see Figure 1).

Theoretically, discourse structures have been approached through Centering Theory Grosz et al. (1995) for studying distributions of entities across text and Rhetorical Structure Theory (RST; Mann and Thompson, 1988) for modelling the logical structure of natural language via discourse trees. Researchers have found modelling discourse useful in a range of tasks Guzmán et al. (2014); Narasimhan and Barzilay (2015); Liu and Lapata (2018); Pan et al. (2018), including summarization Gerani et al. (2014), text classification Ji and Smith (2017)

, and text generation 

Bosselut et al. (2018).

In this paper, we propose DiscoEval, a task suite designed to evaluate discourse-related knowledge in pretrained sentence representations. DiscoEval comprises 7 task groups covering multiple domains, including Wikipedia, stories, dialogues, and scientific literature. The tasks are probing tasks (Shi et al., 2016; Adi et al., 2017; Belinkov et al., 2017; Peters et al., 2018; Conneau et al., 2018; Poliak et al., 2018; Tenney et al., 2019; Liu et al., 2019; Ettinger, 2019; Chen et al., 2019, inter alia) based on sentence ordering, annotated discourse relations, and discourse coherence. The data is either generated semi-automatically or based on human annotations (Carlson et al., 2001; Prasad et al., 2008; Lin et al., 2009; Kummerfeld et al., 2019).

We also propose a set of novel multi-task learning objectives building upon standard pretrained sentence encoders, which rely on the assumption of distributional semantics of text. These objectives depend only on the natural structure in structured document collections like Wikipedia.

Empirically, we benchmark our models and several popular sentence encoders on DiscoEval and SentEval (Conneau and Kiela, 2018)

. We find that our proposed training objectives help the models capture different characteristics in the sentence representations. Additionally, we find that ELMo shows strong performance on SentEval, whereas BERT performs the best among the pretrained embeddings on DiscoEval. Both BERT and Skip-thought vectors 

(Kiros et al., 2015), which have training losses explicitly related to surrounding sentences, perform much stronger compared to their respective prior work, demonstrating the effectiveness of incorporating losses that make use of broader context. Through per-layer analysis, we also find that for both BERT and ELMo, deep layers consistently outperform shallower ones on DiscoEval, showing different trends from SentEval where the shallow layers have the best performance.

2 Related Work

Discourse modelling and discourse parsing have a rich history (Marcu, 2000; Barzilay and Lapata, 2008; Zhou et al., 2010; Kalchbrenner and Blunsom, 2013; Ji and Eisenstein, 2015; Li and Jurafsky, 2017; Wang et al., 2018; Liu et al., 2018; Lin et al., 2019, inter alia), much of it based on recovering linguistic annotations of discourse structure.

Several researchers have defined tasks related to discourse structure, including sentence ordering Chen et al. (2016); Logeswaran et al. (2016); Cui et al. (2018), sentence clustering Wang et al. (2018b), and disentangling textual threads (Elsner and Charniak, 2008, 2010; Lowe et al., 2015; Mehri and Carenini, 2017; Jiang et al., 2018; Kummerfeld et al., 2019).

There is a great deal of prior work on pretrained representations (Le and Mikolov, 2014; Kiros et al., 2015; Hill et al., 2016; Wieting et al., 2016; McCann et al., 2017; Gan et al., 2017; Peters et al., 2018; Logeswaran and Lee, 2018; Devlin et al., 2019; Tang and de Sa, 2019; Yang et al., 2019; Liu et al., 2019, inter alia). Skip-thought vectors form an effective architecture for general-purpose sentence embeddings. The model encodes a sentence to a vector representation, and then predicts the previous and next sentences in the discourse context. Since Skip-thought performs well in downstream evaluation tasks, we use this neighboring-sentence objective as a starting point for our models.

There is also work on incorporating discourse related objectives into the training of sentence representations. Jernite et al. (2017) propose binary sentence ordering, conjunction prediction (requiring manually-defined conjunction groups), and next sentence prediction. Similarly, Sileo et al. (2019) and Nie et al. (2019) create training datasets automatically based on discourse relations provided in the Penn Discourse Treebank (PDTB; Lin et al., 2009).

Our work differs from prior work in that we propose a general-purpose pretrained sentence embedding evaluation suite that covers multiple aspects of discourse knowledge and we propose novel training signals based on document structure, including sentence position and section titles, without requiring additional human annotation.

3 Discourse Evaluation

We propose DiscoEval, a test suite of 7 tasks to evaluate whether sentence representations include semantic information relevant to discourse processing. Below we describe the tasks and datasets, as well as the evaluation framework. We closely follow the SentEval sentence embedding evaluation suite, in particular its supervised sentence and sentence pair classification tasks, which use predefined neural architectures with slots for fixed-dimensional sentence embeddings. All DiscoEval tasks are modelled by logistic regression unless otherwise stated in later sections.

We also experimented with adding hidden layers to the DiscoEval classification models. However, we find simpler linear classifiers to provide a clearer comparison among sentence embedding methods. More complex classification models lead to noisier results, as more of the modelling burden is shifted to the optimization of the classifiers. Hence we decide to evaluate the sentence embeddings with simple classification models.

In the rest of this section, we will use to denote concatenation of vectors, for element-wise multiplication, and for element-wise absolute value.

3.1 Discourse Relations

As the most direct way to probe discourse knowledge, we consider the task of predicting annotated discourse relations among sentences. We use two human-annotated datasets: the RST Discourse Treebank (RST-DT; Carlson et al., 2001) and the Penn Discourse Treebank (PDTB; Prasad et al., 2008). They have different labeling schemes. PDTB provides discourse markers for adjacent sentences, whereas RST-DT offers document-level discourse trees, which recently was used to evaluate discourse knowledge encoded in document-level models Ferracane et al. (2019). The difference allows us to see if the pretrained representations capture local or global information about discourse structure.

More specifically, as shown in Figure 1, in RST-DT, text is segmented into basic units, elementary discourse units (EDUs), upon which a discourse tree is built recursively. Although a relation can take multiple units, we follow prior work Ji and Eisenstein (2014)

to use right-branching trees for non-binary relations to binarize the tree structure and use the 18 coarse-grained relations defined by

Carlson et al. (2001).

When evaluating pretrained sentence encoders on RST-DT, we first encode EDUs into vectors, then use averaged vectors of EDUs of subtrees as the representation of the subtrees. The target prediction is the label of nodes in discourse trees and the input to the classifier is , where and are vector representations of the left and right subtrees respectively. For example, the input for target “NN-Attribution” in Figure 1 would be , , where is the encoded representation for the th EDU in the text. We use the standard data splits, where there are 347 documents for training and 38 documents for testing. We choose 35 documents from the training set to serve as a validation set.

For PDTB, we use a pair of sentences to predict discourse relations. Following Lin et al. (2009), we focus on two kinds of relations from PDTB: explicit (PDTB-E) and implicit (PDTB-I). The sentence pairs with explicit relations are two consecutive sentences with a particular connective word in between. Figure 2 is an example of an explicit relation.

1. In any case, the brokerage firms are clearly moving faster to create new ads than they did in the fall of 1987. 2. But it remains to be seen whether their ads will be any more effective. label: Comparison.Contrast

Figure 2: Example in the PDTB explicit relation task. The words in are taken out from input sentence 2.

In the PDTB, annotators insert an implicit connective between adjacent sentences to reflect their relations, if such an implicit relation exists. Figure 3 shows an example of an implicit relation. The PDTB provides a three-level hierarchy of relation tags. In DiscoEval, we use the second level of types (Lin et al., 2009), as they provide finer semantic distinctions compared to the first level. To ensure there is a reasonable amount of evaluation data, we use sections 2-14 as training set, 15-18 as development set, and 19-23 as test set. In addition, we filter out categories that have less than 10 instances. This leaves us 12 categories for explicit relations and 11 for implicit ones. Category names are listed in the supplementary material.

1. “A lot of investor confidence comes from the fact that they can speak to us,” he says. 2. so “To maintain that dialogue is absolutely crucial.” label: Contingency.Cause

Figure 3: Example in the PDTB implicit relation task.

We use the sentence embeddings to infer sentence relations with supervised training. As input to the classifier, we encode both sentences to vector representations and , concatenated with their element-wise product and absolute difference: .

3.2 Sentence Position (SP)

We create a task that we call Sentence Position. It can be seen as way to probe the knowledge of linearly-structured discourse, where the ordering corresponds to the timings of events. When constructing this dataset, we take five consecutive sentences from a corpus, randomly move one of these five sentences to the first position, and ask models to predict the true position of the first sentence in the modified sequence.

We create three versions of this task, one for each of the following three domains: the first five sentences of the introduction section of a Wikipedia article (Wiki), the ROC Stories corpus (ROC; Mostafazadeh et al., 2016), and the first 5 sentences in the abstracts of arXiv papers (arXiv; Chen et al., 2016). Figure 4 shows an example of this task for the ROC Stories domain. The first sentence should be in the fourth position among these sentences. To make correct predictions, the model needs to be aware of both typical orderings of events as well as how events are described in language. In the example shown, Bonnie’s excitement comes from her imagination so it must happen after she picked up the jeans and tried them on but right before she realized the actual size.

- She was excited thinking she must have lost weight. - Bonnie hated trying on clothes. - Then she realized they actually size 14s, and 12s. - She picked up a pair of size 12 jeans from the display. - When she tried them on they were too big!

Figure 4: Example from the ROC Stories domain of the Sentence Position task. The first sentence should be in the fourth position.

To train classifiers for these tasks, we do the following. We first encode the five sentences to vector representations . As input to the classifier, we include and the concatenation of for all : .

3.3 Binary Sentence Ordering (BSO)

Similar to sentence position prediction, Binary Sentence Ordering (BSO) is a binary classification task to determine the order of two sentences. The fact that BSO only has a pair of sentences as input makes it different from Sentence Position, where there is more context, and we hope that BSO can evaluate the ability of capturing local discourse coherence in the given sentence representations. The data comes from the same three domains as Sentence Position, and each instance is a pair of consecutive sentences.

Figure 5 shows an example from the arXiv domain of the Binary Sentence Ordering task. The order of the sentences in this instance is incorrect, as the “functions” are referenced before they are introduced. To detect the incorrect ordering in this example, the encoded representations need to be able to provide information about new and old information in each sentence.

1. These functions include fast and synchronized response to environmental change, or long-term memory about the transcriptional status. 2. Focusing on the collective behaviors on a population level, we explore potential regulatory functions this model can offer.

Figure 5: Example from the arXiv domain of the Binary Sentence Ordering task (incorrect ordering shown).

To form the input when training classifiers, we concatenate the embeddings of both sentences with their element-wise difference: .

3.4 Discourse Coherence (DC)

Inspired by prior work on chat disentanglement Elsner and Charniak (2008, 2010) and sentence clustering (Wang et al., 2018b), we propose a sentence disentanglement task. The task is to determine whether a sequence of six sentences forms a coherent paragraph. We start with a coherent sequence of six sentences, then randomly replace one of the sentences (chosen uniformly among positions 2-5) with a sentence from another discourse. This task, which we call Discourse Coherence (DC), is a binary classification task and the datasets are balanced between positive and negative instances.

We use data from two domains for this task: Wikipedia and the Ubuntu IRC channel.222irclogs.ubuntu.com/ For Wikipedia, we begin by choosing a sequence of six sentences from a Wikipedia article. For purposes of choosing difficult distractor sentences, we use the Wikipedia categories of each document as an indication of its topic. To create a negative instance, we randomly sample a sentence from another document with a similar set of categories (measured by the percentage of overlapping categories). This sampled sentence replaces one of the six consecutive sentences in the original sequence. When splitting the train, development, and test sets, we ensure there are no overlapping documents among them.

Our proposed dataset differs from the sentence clustering task of Wang et al. (2018b) in that it preserves sentence order and does not anonymize or lemmatize words, because they play an important role in conveying information about discourse coherence.

For the Ubuntu domain, we use the human annotations of conversation thread structure from Kummerfeld et al. (2019)

to provide us with a coherent sequence of utterances. We filter out sentences by heuristic rules to avoid overly technical and unsolvable cases. The negative sentence is randomly picked from other conversations. Similarly, when splitting the train, development, and test sets, we ensure there are no overlapping conversations among them.

Figure 6 is an instance of the Wikipedia domain of the Discourse Coherence task. This instance is not coherent and the boldfaced text is from a different document. The incoherence can be found either by comparing characteristics of the entity being discussed or by the topic of the sentence group. Solving this task is non-trivial as it may require the ability to perform inference across multiple sentences.

1. It is possible he was the youngest of the family as the name “Sextus” translates to sixth in English implying he was the sixth of two living and three stillborn brothers. 2. According to Roman tradition, his rape of Lucretia was the precipitating event in the overthrow of the monarchy and the establishment of the Roman Republic. 3. Tarquinius Superbus was besieging Ardea, a city of the Rutulians. 4. The place could not be taken by force, and the Roman army lay encamped beneath the walls. 5. He was soon elected to the Academy’s membership (although he had to wait until 1903 to be elected to the Society of American Artists), and in 1883 he opened a New York studio, dividing his time for several years between Manhattan and Boston. 6. As nothing was happening in the field, they mounted their horses to pay a surprise visit to their homes.

Figure 6: An example from the Wikipedia domain of the Discourse Coherence task. This sequence is not coherent; the boldface sentence was substituted in for the true fifth sentence from another article.

In this task, we encode all sentences to vector representations and concatenate all of them () as input to the classification model. Note that in this task, we use a hidden layer of 2000 dimensions with sigmoid activation in the classification model, as this is necessary for the classifier to use features based on multiple inputs simultaneously given the simple concatenation as input. We could have developed richer ways to encode the input so that a linear classifier would be feasible (e.g., use the element-wise products of all pairs of sentence embeddings), but we wish to keep the input dimensionality of the classifier small enough that the classifier will be learnable given fixed sentence embeddings and limited training data.

3.5 Sentence Section Prediction (SSP)

The Sentence Section Prediction (SSP) task is defined as determining the section of a given sentence. The motivation behind this task is that sentences within certain sections typically exhibit similar patterns because of the way people write coherent text. The pattern can be found based on connectives or specificity of a sentence. For example, “Empirically” is usually used in the abstract or introduction sections in scientific writing.

We construct the dataset from PeerRead Kang et al. (2018), which consists of scientific papers from a variety of fields. The goal is to predict whether or not a sentence belongs to the Abstract section. After eliminating sentences that are too easy for the task (e.g., equations), we randomly sample sentences from the Abstract or from a section in the middle of a paper.333We avoid sentences from the Introduction or Conclusion sections to make the task more solvable. Figure 7 shows two sentences from this task, where the first sentence is more general and from an Abstract whereas the second is more specific and is from another section. In this task, the input to the classifier is simply the sentence embedding.

1. The theory behind the SVM and the naive Bayes classifier is explored.

2. This relocation of the active target may be repeated an arbitrary number of times.

Figure 7: Examples from Sentence Section Prediction. The first is from an Abstract while the second is not.

Table 1 shows the number of instances in each DiscoEval task introduced above.

Task PDTB-E PDTB-I Ubuntu RST-DT Others
Train 9383 8693 5816 17051 10000
Dev. 3613 2972 1834 2045 4000
Test 3758 3024 2418 2308 4000
Table 1: Size of datasets in DiscoEval.

4 Models and Learning Criteria

Having described DiscoEval, we now discuss methods for incorporating discourse information into sentence embedding training. All models in our experiments are composed of a single encoder and multiple decoders. The encoder, parameterized by a bidirectional Gated Recurrent Unit (BiGRU;

Chung et al., 2014), encodes the sentence, either in training or in evaluation of the downstream tasks, to a fixed-length vector representation (i.e., the average of the hidden states across positions).

The decoders take the aforementioned encoded sentence representation, and predict the targets we define in the sections below. We first introduce Neighboring Sentence Prediction, the loss for our baseline model. We then propose additional training losses to encourage our sentence embeddings to capture other context information.

4.1 Neighboring Sentence Prediction (NSP)

Similar to prior work on sentence embeddings Kiros et al. (2015); Hill et al. (2016), we use an encoded sentence representation to predict its surrounding sentences. In particular, we predict the immediately preceding and succeeding sentences. All of our sentence embedding models use this loss. Formally, the loss is defined as

where we parameterize and

as separate feedforward neural networks and compute the log-probability of a target sentence using its bag-of-words representation.

4.2 Nesting Level (NL)

A table of contents serves as a high level description of an article, outlining its organizational structure. Wikipedia articles, for example, contain rich tables of contents with many levels of hierarchical structure. The “nesting level” of a sentence (i.e., how many levels deep it resides) provides information about its role in the overall discourse. To encode this information into our sentence representations, we introduce a discriminative loss to predict a sentence’s nesting level in the table of contents:

where represents the nesting level of the sentence and is parameterized by a feedforward neural network. Note that sentences within the same paragraph share the same nesting level. In Wikipedia, there are up to 7 nesting levels.

4.3 Sentence and Paragraph Position (SPP)

Similar to nesting level, we add a loss based on using the sentence representation to predict its position in the paragraph and in the article. The position of the sentence can be a strong indication of the relations between the topics of the current sentence and the topics in the entire article. For example, the first several sentences often cover the general topics to be discussed more thoroughly in the following sentences. To encourage our sentence embeddings to capture such information, we define a position prediction loss

where is the sentence position of within the current paragraph and is the position of the current paragraph in the whole document.

4.4 Section and Document Title (SDT)

Unlike the previous position-based losses, this loss makes use of section and document titles, which gives the model more direct access to the topical information at different positions in the document. The loss is defined as

Where is the section title of sentence , is the document title of sentence , and and are two different bag-of-words decoders.

5 Experiments

5.1 Setup

We train our models on Wikipedia as it is a knowledge rich textual resource and has consistent structures over all documents. Details on hyperparameters are in the supplementary material. When evaluating on DiscoEval, we encode sentences with pretrained sentence encoders. Following SentEval, we freeze the sentence encoders and only learn the parameters of the downstream classifier. The “Baseline” row in Table 

2 are embeddings trained with only the NSP loss. The subsequent rows are trained with extra losses defined in Section 4 in addition to the NSP loss.

SentEval DiscoEval
USS SSS SC Probing SP BSO DC SSP PDTB-E PDTB-I RST-DT avg.
Skip-thought 41.7 81.2 78.4 70.1 44.6 63.0 54.9 77.6 39.6 38.7 59.7 53.8
InferSent 63.4 83.3 79.7 71.8 43.6 62.9 56.3 77.2 37.1 38.0 53.2 49.8
DisSent 50.0 79.2 80.5 74.0 44.9 64.9 54.8 77.6 41.6 39.9 57.8 54.6
ELMo 60.9 77.6 80.8 74.7 46.4 66.6 59.4 78.4 41.5 41.5 58.8 56.1
BERT-Base 30.1 66.3 81.4 73.9 49.1 68.1 58.9 80.6 44.0 42.5 58.7 57.4
BERT-Large 43.6 70.7 83.4 75.0 49.9 69.3 60.5 80.4 44.1 43.8 58.8 58.1
Baseline (NSP) 57.8 77.1 77.0 70.6 44.1 63.8 61.2 78.2 36.9 38.0 57.0 54.5
+ SDT 59.0 77.3 76.8 69.7 43.9 62.9 60.0 78.0 37.0 37.7 56.2 53.7
+ SPP 56.0 77.5 77.4 70.7 45.6 64.3 60.8 78.6 37.1 37.7 57.1 54.6
+ NL 56.7 78.2 77.2 70.6 44.7 64.0 61.2 78.3 37.2 37.8 56.4 54.4
+ SPP + NL 55.4 76.7 77.0 70.4 45.7 64.9 60.9 79.2 37.9 39.3 56.7 55.1
+ SDT + NL 58.5 76.9 77.2 70.2 44.4 63.2 61.0 78.4 36.6 37.9 57.4 54.2
+ SDT + SPP 58.4 77.4 76.6 70.2 44.4 63.9 60.5 78.0 37.3 36.9 56.2 54.1
ALL 58.8 76.3 77.0 70.2 43.9 63.3 60.3 78.7 36.5 38.2 55.8 54.1
Table 2: Results for SentEval and DiscoEval. The highest number in each column is boldfaced. The highest number for our models in each column is underlined. “All” uses all four losses. “avg.” is the averaged accuracy for all tasks in DiscoEval.

Additionally, we benchmark several popular pretrained sentence encoders on DiscoEval, including Skip-thought,444github.com/ryankiros/skip-thoughts InferSent Conneau et al. (2017),555github.com/facebookresearch/InferSent DisSent Nie et al. (2019),666github.com/windweller/DisExtract ELMo,777github.com/allenai/allennlp and BERT.888github.com/huggingface/pytorch-pretrained-BERT For ELMo, we use the averaged vector of all three layers and time steps as the sentence representations. For BERT, we use the averaged vector at the position of the “[CLS]” token across all layers. We also evaluate per-layer performance for both models in Section 6.

When reporting results for SentEval, we compute the averaged Pearson correlations for Semantic Textual Similarity tasks from 2012 to 2016 Agirre et al. (2012, 2013, 2014, 2015, 2016). We refer to the average as unsupervised semantic similarity (USS) since those tasks do not require training data. We compute the averaged results for the STS Benchmark (Cer et al., 2017), textual entailment, and semantic relatedness (Marelli et al., 2014) and refer to the average as supervised semantic similarity (SSS). We compute the average accuracy for movie review (Pang and Lee, 2005); customer review (Hu and Liu, 2004); opinion polarity (Wiebe et al., 2005); subjectivity classification (Pang and Lee, 2004); Stanford sentiment treebank (Socher et al., 2013); question classification (Li and Roth, 2002); and paraphrase detection (Dolan et al., 2004), and refer to it as sentence classification (SC). For the rest of the linguistic probing tasks (Conneau et al., 2018), we report the average accuracy and report it as “Probing”.

5.2 Results

Table 2 shows the experiment results over all SentEval and DiscoEval tasks. Different models and training signals have complex effects when performing various downstream tasks. We summarize our findings below:

  • On DiscoEval, Skip-thought performs best on RST-DT. DisSent performs strongly for PDTB tasks but it requires discourse markers from PDTB for generating training data. BERT has the highest average by a large margin, but ELMo has competitive performance on multiple tasks.

  • The NL or SPP loss alone has complex effects across tasks in DiscoEval, but when they are combined, the model achieves the best performance, outperforming our baseline by 0.6% on average. In particular, it yields 39.3% accuracy on PDTB-I, outperforming Skip-thought by 0.6%. This is presumably caused by the differing, yet complementary, effects of these two losses (NL and SPP).

  • The SDT loss generally hurts performance on DiscoEval, especially on the position-related tasks (SP, BSO). This can be explained by the notion that consecutive sentences in the same section are encouraged to have the same sentence representations when using the SDT loss. However, the SP and BSO tasks involve differentiating neighboring sentences in terms of their position and ordering information.

  • On SentEval, SDT is most helpful for the USS tasks, presumably because it provides the most direct information about the topic of each sentence, which is a component of semantic similarity. SDT helps slightly on the SSS tasks. NL gives the biggest improvement in SSS.

  • In comparing BERT to ELMo and Skip-thought to InferSent on DiscoEval, we can see the benefit of adding information about neighboring sentences. Our proposed training objectives show complementary improvements over NSP, which suggests that they can potentially benefit these pretrained representations.

6 Analysis

Per-Layer analysis.

Figure 8: Heatmap for individual hidden layers of BERT-Base (lower part) and ELMo (upper part).
ELMo BERT-Base
SentEval 0.8 5.0
DiscoEval 1.3 8.9
Table 3: Average of the layer number for the best layers in SentEval and DiscoEval.

To investigate the performance of individual hidden layers, we evaluate ELMo and BERT on both SentEval and DiscoEval using each hidden layer. For ELMo, we use the averaged vector from the targeted layer. For BERT-Base, we use the vector from the position of the “[CLS]” token. Figure 8 shows the heatmap of performance for individual hidden layers. We note that for better visualization, colors in each column are standardized. On SentEval, BERT-Base performs better with shallow layers on USS, SSS, and Probing (though not on SC), but on DiscoEval, the results using BERT-Base gradually increase with deeper layers. To evaluate this phenomenon quantitatively, we compute the average of the layer number for the best layers for both ELMo and BERT-Base and show it in Table 3. From the table, we can see that DiscoEval requires deeper layers to achieve better performance. We assume this is because deeper layers can capture higher-level structure, which aligns with the information needed to solve the discourse tasks.

DiscoEval architectures.

In all DiscoEval tasks except DC, we use no hidden layer in the neural architectures, following the example of SentEval. However, some tasks are unsolvable with this simple architecture. In particular, the DC tasks have low accuracies with all models unless a hidden layer is used. As shown in Table 4, when adding a hidden layer of 2000 to this task, the performance on DC improves dramatically. This shows that DC requires more complex comparison and inference among input sentences. Our human evaluation below on DC also shows that human accuracies exceed those of the classifier based on sentence embeddings by a large margin.

Baseline w/o hidden layer 52.0
Baseline w/ hidden layer 61.2
Table 4: Accuracies with baseline encoder on Discourse Coherence task, with or without a hidden layer in the classifier.
Sentence Position Binary Sentence Ordering Discourse Coherence
Human 77.3 84.7 87.0
BERT-Large 49.9 69.3 60.5
Wiki arXiv ROC Wiki arXiv ROC Wiki Ubuntu
Human 84.0 76.0 94.0 64.0 72.0 96.0 98.0 74.0
BERT-Large 43.0 56.0 50.9 70.3 66.8 70.9 64.9 56.1
Table 5: Accuracies (%) for a human annotator and BERT-Large on Sentence Position, Binary Sentence Ordering, and Discourse Coherence tasks.
Random 20
Baseline w/o context 43.2
Baseline w/ context 46.0
Table 6: Accuracies (%) for baseline encoder on Sentence Position task when using downstream classifier with or without context.

Human Evaluation.

We conduct a human evaluation on the Sentence Position, Binary Sentence Ordering, and Discourse Coherence datasets. A native English speaker was provided with 50 examples per domain for these tasks. While the results in Table 5 show that the overall human accuracies exceed those of the classifier based on BERT-Large by a large margin, we observe that within some specific domains, for example Wiki in BSO, BERT-Large demonstrates very strong performance.

Does context matter in Sentence Position?

In the SP task, the inputs are the target sentence together with 4 surrounding sentences. We study the effect of removing the surrounding 4 sentences, i.e., only using the target sentence to predict its position from the start of the paragraph.

Table 6 shows the comparison of the baseline model performance on Sentence Position with or without the surrounding sentences and a random baseline. Since our baseline model is already trained with NSP, it is expected to see improvements over a random baseline. The further improvement from using surrounding sentences demonstrates that the context information is helpful in determining the sentence position.

7 Conclusion

We proposed DiscoEval, a test suite of tasks to evaluate discourse-related knowledge encoded in pretrained sentence representations. We also proposed a variety of training objectives to strengthen encoders’ ability to incorporate discourse information. We benchmarked several pretrained sentence encoders and demonstrated the effects of the proposed training objectives on different tasks. While our learning criteria showed benefit on certain classes of tasks, our hope is that the DiscoEval evaluation suite can inspire additional research in capturing broad discourse context in fixed-dimensional sentence embeddings.

Acknowledgments

We thank Jonathan Kummerfeld for helpful discussions about the IRC Disentanglement dataset, Davis Yoshida for discussions about BERT, and the anonymous reviewers for their feedback that improved this paper. This research was supported in part by a Bloomberg data science research grant to K. Gimpel.

References

  • Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg (2017) Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of ICLR, Cited by: §1, §1.
  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, and J. Wiebe (2015) SemEval-2015 task 2: semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 252–263. External Links: Document, Link Cited by: §5.1.
  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe (2014) SemEval-2014 task 10: multilingual semantic textual similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 81–91. External Links: Document, Link Cited by: §5.1.
  • E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe (2016) SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–511. External Links: Document, Link Cited by: §5.1.
  • E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo (2013) *SEM 2013 shared task: semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pp. 32–43. External Links: Link Cited by: §5.1.
  • E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre (2012) SemEval-2012 task 6: a pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 385–393. External Links: Link Cited by: §5.1.
  • R. Barzilay and M. Lapata (2008) Modeling local coherence: an entity-based approach. Computational Linguistics 34 (1), pp. 1–34. Cited by: §2.
  • Y. Belinkov, L. Màrquez, H. Sajjad, N. Durrani, F. Dalvi, and J. Glass (2017)

    Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks

    .
    In

    Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    ,
    Taipei, Taiwan, pp. 1–10. External Links: Link Cited by: §1.
  • A. Bosselut, A. Celikyilmaz, X. He, J. Gao, P. Huang, and Y. Choi (2018) Discourse-aware neural rewards for coherent text generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 173–184. External Links: Link, Document Cited by: §1.
  • L. Carlson, D. Marcu, and M. E. Okurovsky (2001) Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Proceedings of the Second SIGdial Workshop on Discourse and Dialogue, External Links: Link Cited by: §1, §3.1, §3.1.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. External Links: Document, Link Cited by: §5.1.
  • M. Chen, Z. Chu, K. Stratos, and K. Gimpel (2019) EntEval: a holistic evaluation benchmark for entity representations. In Proc. of EMNLP, Cited by: §1.
  • X. Chen, X. Qiu, and X. Huang (2016) Neural sentence ordering. arXiv preprint arXiv:1607.06952. Cited by: §2, §3.2.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    arXiv preprint arXiv:1412.3555. Cited by: §4.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. External Links: Document, Link Cited by: §5.1.
  • A. Conneau and D. Kiela (2018) SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. External Links: Link Cited by: §1, §1.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2126–2136. External Links: Link Cited by: §1, §5.1.
  • B. Cui, Y. Li, M. Chen, and Z. Zhang (2018) Deep attentive sentence ordering network. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4340–4349. External Links: Link Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations, §2.
  • B. Dolan, C. Quirk, and C. Brockett (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp. 350–356. Cited by: §5.1.
  • M. Elsner and E. Charniak (2008) You talking to me? a corpus and algorithm for conversation disentanglement. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 834–842. External Links: Link Cited by: §2, §3.4.
  • M. Elsner and E. Charniak (2010) Disentangling chat. Computational Linguistics 36 (3), pp. 389–409. External Links: Link, Document Cited by: §2, §3.4.
  • A. Ettinger (2019) What bert is not: lessons from a new suite of psycholinguistic diagnostics for language models. arXiv preprint arXiv:1907.13528. Cited by: §1.
  • E. Ferracane, G. Durrett, J. J. Li, and K. Erk (2019) Evaluating discourse in structured text representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 646–653. External Links: Link Cited by: §3.1.
  • Z. Gan, Y. Pu, R. Henao, C. Li, X. He, and L. Carin (2017)

    Learning generic sentence representations using convolutional neural networks

    .
    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2390–2400. External Links: Link, Document Cited by: §2.
  • S. Gerani, Y. Mehdad, G. Carenini, R. T. Ng, and B. Nejat (2014) Abstractive summarization of product reviews using discourse structure. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1602–1613. External Links: Link, Document Cited by: §1.
  • B. J. Grosz, S. Weinstein, and A. K. Joshi (1995) Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21 (2). External Links: Link Cited by: §1.
  • F. Guzmán, S. Joty, L. Màrquez, and P. Nakov (2014) Using discourse structure improves machine translation evaluation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 687–698. External Links: Link, Document Cited by: §1.
  • F. Hill, K. Cho, and A. Korhonen (2016)

    Learning distributed representations of sentences from unlabelled data

    .
    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377. External Links: Document, Link Cited by: §2, §4.1.
  • M. Hu and B. Liu (2004) Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §5.1.
  • Y. Jernite, S. R. Bowman, and D. Sontag (2017) Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557. Cited by: §2.
  • Y. Ji and J. Eisenstein (2014) Representation learning for text-level discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 13–24. External Links: Link, Document Cited by: §3.1.
  • Y. Ji and J. Eisenstein (2015) One vector is not enough: entity-augmented distributed semantics for discourse relations. Transactions of the Association for Computational Linguistics 3, pp. 329–344. External Links: Link Cited by: §2.
  • Y. Ji and N. A. Smith (2017) Neural discourse structure for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 996–1005. External Links: Link, Document Cited by: §1.
  • J. Jiang, F. Chen, Y. Chen, and W. Wang (2018) Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1812–1822. External Links: Document, Link Cited by: §2.
  • D. Jurafsky and J. H. Martin (2009) Speech and language processing (2nd edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA. External Links: ISBN 0131873210 Cited by: §1.
  • N. Kalchbrenner and P. Blunsom (2013) Recurrent convolutional neural networks for discourse compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 119–126. External Links: Link Cited by: §2.
  • D. Kang, W. Ammar, B. Dalvi, M. van Zuylen, S. Kohlmeier, E. Hovy, and R. Schwartz (2018) A dataset of peer reviews (peerread): collection, insights and nlp applications. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1647–1661. External Links: Document, Link Cited by: §3.5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler (2015) Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 3294–3302. External Links: Link Cited by: §1, §2, §4.1.
  • J. K. Kummerfeld, S. R. Gouravajhala, J. J. Peper, V. Athreya, C. Gunasekara, J. Ganhotra, S. S. Patel, L. C. Polymenakos, and W. Lasecki (2019) A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3846–3856. External Links: Link Cited by: §1, §2, §3.4.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In

    International conference on machine learning

    ,
    pp. 1188–1196. Cited by: §1, §2.
  • J. Li and D. Jurafsky (2017) Neural net models of open-domain discourse coherence. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 198–209. External Links: Document, Link Cited by: §2.
  • X. Li and D. Roth (2002) Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pp. 1–7. Cited by: §5.1.
  • X. Lin, S. Joty, P. Jwalapuram, and M. S. Bari (2019) A unified linear-time framework for sentence-level discourse parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4190–4200. External Links: Link Cited by: §2.
  • Z. Lin, M. Kan, and H. T. Ng (2009) Recognizing implicit discourse relations in the penn discourse treebank. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 343–351. External Links: Link Cited by: §1, §2, §3.1, §3.1.
  • J. Liu, S. B. Cohen, and M. Lapata (2018) Discourse representation structure parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 429–439. External Links: Link, Document Cited by: §2.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Link, Document Cited by: §1.
  • Y. Liu and M. Lapata (2018) Learning structured text representations. Transactions of the Association for Computational Linguistics 6, pp. 63–75. External Links: Link, Document Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • L. Logeswaran, H. Lee, and D. Radev (2016) Sentence ordering and coherence modeling using recurrent neural networks. External Links: 1611.02654 Cited by: §2.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 285–294. External Links: Link, Document Cited by: §2.
  • W. C. Mann and S. A. Thompson (1988) Rhetorical structure theory: toward a functional theory of text organization. Text 8 (3), pp. 243–281. External Links: Link Cited by: §1.
  • D. Marcu (2000) The theory and practice of discourse parsing and summarization. MIT Press, Cambridge, MA, USA. External Links: ISBN 0262133725 Cited by: §2.
  • M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli (2014) SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Cited by: §5.1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6294–6305. External Links: Link Cited by: §2.
  • S. Mehri and G. Carenini (2017)

    Chat disentanglement: identifying semantic reply relationships with random forests and recurrent neural networks

    .
    In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 615–623. External Links: Link Cited by: §2.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Link Cited by: §3.2.
  • K. Narasimhan and R. Barzilay (2015) Machine comprehension with discourse relations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1253–1262. External Links: Link, Document Cited by: §1.
  • A. Nie, E. Bennett, and N. Goodman (2019) DisSent: learning sentence representations from explicit discourse relations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4497–4510. External Links: Link Cited by: §2, §5.1.
  • H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward (2016)

    Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval

    .
    IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24 (4), pp. 694–707. External Links: ISSN 2329-9290, Link, Document Cited by: §1.
  • B. Pan, Y. Yang, Z. Zhao, Y. Zhuang, D. Cai, and X. He (2018)

    Discourse marker augmented network with reinforcement learning for natural language inference

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 989–999. External Links: Link, Document Cited by: §1.
  • B. Pang and L. Lee (2004)

    A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

    .
    In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271. Cited by: §5.1.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115–124. Cited by: §5.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Document, Link Cited by: Appendix A.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. External Links: Document, Link Cited by: Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations, §2.
  • M. Peters, M. Neumann, L. Zettlemoyer, and W. Yih (2018) Dissecting contextual word embeddings: architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1499–1509. External Links: Link, Document Cited by: §1.
  • K. Pichotta and R. J. Mooney (2016) Using sentence-level LSTM language models for script inference. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 279–289. External Links: Link, Document Cited by: §1.
  • A. Poliak, A. Haldar, R. Rudinger, J. E. Hu, E. Pavlick, A. S. White, and B. Van Durme (2018) Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 67–81. External Links: Link, Document Cited by: §1.
  • R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber (2008) The penn discourse treebank 2.0. In In Proceedings of LREC, Cited by: §1, §3.1.
  • X. Shi, I. Padhi, and K. Knight (2016) Does string-based neural MT learn source syntax?. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1526–1534. External Links: Link, Document Cited by: §1.
  • D. Sileo, T. Van De Cruys, C. Pradel, and P. Muller (2019) Mining discourse markers for unsupervised sentence representation learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3477–3486. External Links: Link, Document Cited by: §2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. Cited by: §5.1.
  • S. Tang and V. R. de Sa (2019) Exploiting invertible decoders for unsupervised sentence representation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4050–4060. External Links: Link Cited by: §2.
  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) Movieqa: understanding stories in movies through question-answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 4631–4640. Cited by: §1.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019) What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018a) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. External Links: Link Cited by: §1.
  • S. Wang, E. Holgate, G. Durrett, and K. Erk (2018b) Picking apart story salads. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1455–1465. External Links: Link Cited by: §2, §3.4, §3.4.
  • Y. Wang, S. Li, and J. Yang (2018) Toward fast and accurate neural discourse segmentation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 962–967. External Links: Link, Document Cited by: §2.
  • J. Wiebe, T. Wilson, and C. Cardie (2005) Annotating expressions of opinions and emotions in language. Language resources and evaluation 39 (2), pp. 165–210. Cited by: §5.1.
  • J. Wieting, M. Bansal, K. Gimpel, and K. Livescu (2016) Towards universal paraphrastic sentence embeddings. In Proceedings of International Conference on Learning Representations, Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.
  • Z. Zhou, Y. Xu, Z. Niu, M. Lan, J. Su, and C. L. Tan (2010) Predicting discourse connectives for implicit discourse relation recognition. In Coling 2010: Posters, pp. 1507–1514. External Links: Link Cited by: §2.

Appendix A Hyperparameters

Our models use 1200 dimensional BiGRUs, resulting in 2400 dimensional sentence representations. The feedforward neural networks used in the decoders are parameterized using two hidden layers and use ReLU activation functions. We intialize our models with 300 dimensional GloVe embeddings 

Pennington et al. (2014). We use Adam Kingma and Ba (2014)

as optimizer and train our models for one epoch on Wikipedia without employing early stopping.