DisSent: Sentence Representation Learning from Explicit Discourse Relations

10/12/2017 ∙ by Allen Nie, et al. ∙ Stanford University 0

Sentence vectors represent an appealing approach to meaning: learn an embedding that encompasses the meaning of a sentence in a single vector, that can be used for a variety of semantic tasks. Existing models for learning sentence embeddings either require extensive computational resources to train on large corpora, or are trained on costly, manually curated datasets of sentence relations. We observe that humans naturally annotate the relations between their sentences with discourse markers like "but" and "because". These words are deeply linked to the meanings of the sentences they connect. Using this natural signal, we automatically collect a classification dataset from unannotated text. Training a model to predict these discourse markers yields high quality sentence embeddings. Our model captures complementary information to existing models and achieves comparable generalization performance to state of the art models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When humans read a sentence they extract a flexible representation of meaning that can be used for many tasks. Developing wide-coverage models to represent the meaning of a sentence is thus a key task in natural language understanding. The applications of such general-purpose representations of sentence meaning are many — paraphrase detection, summarization, knowledge-base population, question-answering, automatic message forwarding, and metaphoric language, to name a few.

Learning flexible meaning representations requires a sufficiently demanding, yet tractable, training task. We propose to leverage a high-level relationship between sentences that is both frequently and systematically marked in natural language: the discourse relations between sentences. Human writers naturally use a small set of very common transition words between sentences111 We will use the term “sentence” to mean either a whole sentence or a subphrase within a sentence that could stand alone. The term “main clause” can also be used for such a subphrase. Even for compound sentences, with two main clauses joined by a discourse marker, we will say that the discourse marker joins two “sentences”. to identify the relations between adjacent ideas. These words, such as because, but, and, which mark the relationship between two sentences on the highest level, have been widely studied in linguistics, both formally and computationally, and have many different names. We use the name “discourse markers”. Because discourse markers annotate deep conceptual relations between sentences, similar to entailment, they may permit learning from relatively little data; because discourse markers are produced in natural text, unlike entailment, they don’t require hand annotation.

We thus propose the DisSent model and Discourse Prediction Task to train sentence embeddings. We choose pairs of sentences linked with common discourse markers, and, using a simple data preprocessing scheme, we are able to automatically curate a sizable training set. We then train a sentence encoding model to learn embeddings for each sentence in a pair such that a classifier can identify, based on the embeddings, which discourse marker was used to link the sentences.

Conneau et al. (2017) published an evaluation framework, SentEval222https://github.com/facebookresearch/SentEval, to evaluate sentence embeddings. They compile a set of pre-defined sentence classification tasks on which a good sentence representation should perform well. They used these tasks to evaluate their InferSent model which they trained on a natural language inference task Bowman et al. (2015). We use the SentEval framework to evaluate our models. We augment these SentEval tasks with a new dataset, DIS, derived from our discourse validation and test sets. This evaluation dataset contains our top 5 discourse markers (and, but, because, when, if), with a total of 300K training and 18K each of validation and test sentence pairs. We further evaluate our model on classification tasks based on implicit discourse relations using Penn Discourse Treebank (PDTB) Rashmi et al. (2008).

Using a model architecture similar to InferSent, but trained on our new discourse classification task, we demonstrate that our DisSent embeddings achieve comparable state-of-the-art results on some evaluation tasks, and superior on others.

2 Discourse Prediction Task

Hobbs (1990) argues that discourse relations are always present, that they compose into parsable structures, and that they fall under a small set of categories. In our work we focus on explicit discourse markers between adjacent sentences, rather than implicit relations between a sentence and the related discourse. This simplification sidesteps a huge body of work in determining the correct relation for a sentence Marcu and Echihabi (2002), in describing the wealth of complex structures a discourse can take Webber et al. (2003), in determining whether a relation will be marked or unmarked Patterson and Kehler (2013); Yung et al. (2017), and in compiling a comprehensive set of discourse relations Rashmi et al. (2008); Hobbs (1990); Jasinskaja and Karagjosova (2015).

Marker Extracted Pairs Percent (%)
and 818,634 21.1
as 761,330 19.6
when 552,540 14.2
but 508,648 13.1
if 491,394 12.6
before 268,787 6.9
while 120,231 3.1
because 116,444 3.0
after 84,330 2.2
though 61,023 1.6
so 57,816 1.5
although 13,933 0.4
still 11,125 0.3
also 10,026 0.3
then 8,414 0.2
Total 4,706,292 100.0
Table 1: Number of pairs of sentences extracted from BookCorpus for each discourse marker and percent of each marker in the resulting dataset.

With this focus in mind, we propose a new task for natural language understanding: discourse marker prediction. Given two sentences in a corpus, the model must predict which discourse marker was used by the author to link the two ideas. For example, “She’s late to class   she missed the bus” would likely be completed with because, but “She’s sick at home   she missed the class” would likely be completed with so, and “She’s good at soccer   she missed the goal” would likely be completed with but. All of these example pairs have similar syntactic structures and many words in common. But the meanings of the component sentences lead to strong intuitions about which discourse marker makes the most sense. Without a semantic understanding of the sentences, we would not be able to guess the correct relation. We hypothesize that success at choosing the correct discourse marker will require a representation that reflects the full meaning of a sentence.

We note that perfect performance at this task is impossible for humans, because different discourse markers can easily appear in the same context. For example, in some cases, markers are (at least close to) synonymous with one another Knott (1996). Other times, it is possible for multiple discourse markers to link the same pair of sentences and change the interpretation. (In the sentence “Bob saw Alice was at the party, (thensobut) he went home,” changing the discourse marker drastically changes our interpretation of Bob’s goals and feeling towards Alice.) Despite this ceiling on absolute performance, a discourse marker can frequently be inferred from the meanings of the sentences it connects, making this a useful training task.

3 DisSent Model

We adapt the best architecture from Conneau et al. Conneau et al. (2017) as our sentence encoder. This architecture uses a standard bidirectional LSTM Graves et al. (2013)

, followed by temporal max-pooling to create sentence vectors. We parameterize the BiLSTM with the different weights

and to reflect the asymmetry of sentence processing. We then concatenate the forward and backward encodings.

We apply global max pooling on the resulting vectors to construct the encoding for each sentence. That is, we apply an element-wise max operation over the temporal dimension of the hidden states.Global max pooling builds a sentence representation from all time steps in the processing of a sentence Collobert and Weston (2008); Conneau et al. (2017), providing regularization and shorter back-propagation paths.


Our objective is to predict the discourse relations between two sentences from their vectors, where . Because we want generally useful sentence vectors after training, the learned computation should happen before the sentences are combined to make a prediction. However, some non-linear interactions between the sentence vectors are likely to be needed. To achieve this, we include a fixed set of common pair-wise vector operations: subtraction, multiplication, and average.


Finally we use an affine fully-connected layer to project the concatenated vector

down to a lower dimensional representation, and then project it down to a vector of label size (the number of discourse markers). We use softmax to compute the probability distribution over discourse relations.

4 Data Collection

S1 marker S2
Her eyes flew up to his face. and Suddenly she realized why he looked so different.
The concept is simple. but The execution will be incredibly dangerous.
You used to feel pride. because You defended innocent people.
Ill tell you about it. if You give me your number.
Belter was still hard at work. when Drade and barney strolled in.
We plugged bulky headsets into the dashboard. so We could hear each other when we spoke into the microphones.
It was mere minutes or hours. before He finally fell into unconsciousness.
And then the cloudy darkness lifted. though The lifeboat did not slow down.
Table 2: Example pairs from our Books 8 dataset.

We present an automatic way to collect a large corpus of sentence pairs and the relations between them. We collected sentence pairs from BookCorpus Zhu et al. (2015), a dataset of text from unpublished novels (Romance, Fantasy, Science fiction, and Teen genres), which was used by Kiros et al. (2015) to train their SkipThought model. We searched this corpus for sentences that contained any of the discourse markers in a predetermined set, then filtered and extracted related sentence pairs using universal dependency parsing Schuster and Manning (2016).

4.1 Choice of Discourse Markers

We chose relatively frequent discourse markers (accounting for at least 1% of discourse markers in the overall corpus) from all of the discourse markers identified in the manual for the preparation of PDTB Rashmi et al. (2008). We present our set of discourse markers and their frequencies in Table 1.




[I wore a jacket] because [it was cold outside].






Because [it was cold outside], [I wore a jacket].


Figure 1: While the relative order of a discourse marker (e.g. because) and its conntected sentences is flexible, the dependency relations between these components within the overall sentence remains constant. See Appendix A.1 for dependency patterns for other discourse markers.

4.2 Dependency Parsing

While discourse markers are often quite separable from the sentences they link, there are some complications in extracting sentence pairs from raw text. Many discourse markers in English occur almost exclusively between the two sentences they connect, but for other discourse markers, their position relative to their connected sentences is less systematic (e.g. because has two very common orderings, shown in Figure 1). For this reason, we use the Stanford CoreNLP dependency parser Schuster and Manning (2016). Each discourse marker, when it is used to link two sentences, is parsed by the dependency parser in a systematic way.

Rashmi et al. (2008) discussed 4 types of locations that discourse markers appear in, relative to the sentences they connect. A discourse marker can appear within the same sentence as the two clauses it connects (SS), or can connect the sentence in which it appears to its immediate predecessor (IPS), a non-adjacent previous sentence (NAPS), or a sentence that follows it (FS). Using dependency parsing, we are able to extract the appropriate sentence pairs in the first two cases (SS and IPS). Rashmi et al. (2008) collected the frequencies of the cases in Penn Treebank and discovered the first two cases account for 91% of instances.

For each discourse marker, we first searched within the dependency parse for its governor sentence, which we call “S2” (see Figure 1), and rejected examples without the appropriate relation.333Note that different discourse markers may have different corresponding dependency patterns linking them to their sentence pairs. We discuss dependency extraction in more detail in Appendix A.1 This filtering allows us to exclude cases when these words are not used as discourse markers at all (e.g. the word so used as an adverb, as in “that’s so cool!”).

We then searched for the dependency relation linking sentence S2 to sentence S1 within the same sentence (SS). Searching for this kind of dependency relation allows us to capture sentence pairs where the discourse marker starts the sentence and connects two clauses within that sentence (e.g. “Because [it was cold outside], [I wore a jacket].”). It also allows us to limit our extraction to only the relevant subclauses (e.g. “I think that [they were at the store] when [you came by].”).

Whenever we find the target dependency relation between the discourse marker and the entire sentence in which it appears, we make the assumption that the discourse marker links to the immediately previous sentence (IPS). In this case, we simply identify the current sentence as S2 and the previous sentence as S1. For the remaining 9% of uses (NAPS and FS), our method is unable to extract the appropriate pairs and will, if it finds an acceptable S2, incorrectly choose the previous sentence as S1.

Despite many advantages, the dependency parser introduces some problems. First, not all parses are correct (e.g. the non-sentence “Himself close his eyes.” was extracted from “To his shame, he just let himself close his eyes and gave himself over to unconsciousness.” due to an incorrect parse). Second, even given correct parses, extraction was imperfect for sentences with implicit, repeated subjects (e.g. “[Wolfe chastised us for not being serious enough] and [gave us high marks for learning the techniques]”) or certain kinds of marked embedded clauses (e.g. “She was reading her favorite book, [which her sister had given her] when [she last visited]”). Fortunately these errors were relatively rare, and many could be avoided simply by enforcing that the extracted sentences each have a main verb and satisfy a minimum length. Overall this method extracts high-quality sentence pairs with appropriately labeled relations.

4.3 Length-based Filtering

As a way to exclude extremely uninformative sentence pairs and standardize lengths of sentences, we filtered pairs based on several criteria on the lengths (in words) of the two sentences. We excluded any pair where one of the two sentences was less than 5 or more than 50 words long. We additionally excluded any pairs where one of the two sentences was more than 5 times the length of the other.

4.4 Training Dataset

Using these methods, we curated a dataset of 4,706,292 pairs of sentences for 15 discourse markers. Examples are shown in Table 2. We then randomly divide the dataset into train/validation/test set with 0.9, 0.05, 0.05 split. The dataset is inherently unbalanced but in our experiments the model is still able to learn rarer classes quite well (see Appendix A.2 for more details on the effects of class frequencies).

We also consider smaller sets of discourse markers, resulting in smaller data sets. For our experiment on 8 discourse markers, we have 3,616,015 sentence pairs in total, and for our experiment on 5 discourse markers, we have 3,216,552 sentence pairs in total. Selected markers are displayed in Table 3.

4.5 Evaluation Dataset: DIS

In addition to serving as a training task for our model, we believe that discourse prediction can be a valuable tool for evaluating sentence embeddings. This task is similar to PDTB tasks classifying explicit relations, but provides a simpler target, (given its smaller set of class labels) and a larger set of sentence pairs.

We combine the validation and test dataset from the Books 5 corpus and resplit to form the data for this task. In total, we have 334,913 training examples, and 18,605 sentence paris each for dev and test. The task is to accurately predict one discourse marker out of five, relying on the embedding of each sentence generated by an embedding model.

5 Related Work

Current state of the art models summarize the meaning of a sentence via a sentence vector

, relying on completely unsupervised learning or supervised learning through high-level classification tasks.

Skipthought Kiros et al. (2015) is an unsupervised sequence model that has been shown to generate useful sentence embeddings. However, it requires large amounts of training data and long training time to perform well. In SkipThought, each word in the previous sentence is used to generate each word in the next sentence. In DisSent, each word in both sentences is used to classify the discourse marker, which is often extracted from the second sentence. This difference allows for faster training time and learning that is focused on relationships between sentence meanings.

InferSent Conneau et al. (2017) explores the idea that sentence embeddings can be learned from sentence relationships. Conneau et al. (2017) trained a classifier to predict entailment relations in the Stanford Natural Language Inference (SNLI) Bowman et al. (2015) and MultiNLI Williams et al. (2017) corpora, achieving comparable performance to SkipThought on generalization tasks, but with much less data and shorter training time. However, the training set used was built using human annotation, making it laborious and expensive to collect. InferSent embeddings are therefore limited in the size and variety of dataset they can be trained from. In contrast, while DisSent also leverages sentence relationships, it can be trained on automatically collected data.

Jernite et al. (2017) have proposed a model that leverages discourse relations. They manually put discourse markers into several categories based on human interpretations of discourse marker similarity, and the model predicts the category instead of the individual discourse marker. Their model also trains on auxiliary tasks, such as sentence ordering and ranking of the following sentence and must compensate for data imbalance across tasks. Their data collection methods only allow them to look at paragraphs longer than 8 sentences, and sentence pairs with sentence-initial discourse markers, resulting in only 1.4M sentence pairs from a much larger corpus. Our proposed model extracts a wider variety of sentence pairs, can be applied to corpora with shorter paragraphs, and includes no auxiliary tasks.

6 Experiments

For all our models, we tuned the hyperparameters on the validation set, and report results from the test set. We use stochastic gradient descent with initial learning rate 0.1, and we anneal by the factor of 5 each time the validation accuracy is lower than previous epoch. We train our models for 20 epochs, and use early stopping to prevent overfitting. We also clip the gradient norm to 5.0. We did not use dropout as it lowers the generalization performance. We experimented with both temporal mean pooling and temporal max pooling and found the later to perform much better at transfer tasks. All models we report used a 4096 hidden state size.

Label Discourse Markers
Books 5 and, but, because, if, when
Books 8 and, but, because, if, when, before,
though, so
Books ALL and, but, because, if, when, before,
though, so, as, while, after, still, also,
then, although
Table 3: Discourse marker sets used in our experiments.

Discourse Marker Set

To investigate the qualitative relations among the All

marker set, we build a confusion matrix based on predictions on the test set. Figure

2 reflects classification performance for the actual model, trained on the full corpus, that we later show generalization results for. Overall, this model achieved 67.5% accuracy on the classification task. This model shows a clear effect of frequency, such that it tends to misclassify infrequent discourse markers as frequent ones. However, deviations from the effect of frequency appear to be semantically meaningful.444See Appendix A.2 for more systematic demonstrations that trained models have meaningfully captured the semantic similarities of discourse markers and not only their relative frequencies. Classifications errors are much more common for semantically similar discourse marker pairs than would be expected from frequency alone. The most common confusion is when the synonymous marker although is mistakenly classified as but. The temporal relation markers before, after and then, intuitively very similar discourse markers, are rarely confused for anything but each other. The fact that they are indeed confusable may reflect the tendency of authors to mark temporal relation primarily when it is ambiguous.

Figure 2: Confusion Matrix trained on the All dataset extracted from BookCorpus. Each cell represents the proportion of instances of the actual discourse marker misclassified as the classified discourse marker. This proportion is log-transformed to highlight small differences. Discourse markers are arranged in order of frequency from left (least frequent) to right (most frequent).

Because there appears to be intrinsic conceptual overlap in the set of ALL markers, we experimented on different subsets of discourse markers. We choose sets of 5 and 8 discourse markers that seemed non-overlapping and frequent, both intuitively and with respect to confusions in Figure 2. The set of sentence pairs for each smaller dataset is a strict subset of those in any larger dataset. Our chosen sets are shown in Table 3.

Implicit vs. Explicit Prediction Task

Due to the lack of discourse-related tasks in SentEval, we want to compare DisSent’s ability to capture discourse-relevant information with other sentence representation models. Penn Discourse Treebank Rashmi et al. (2008) offers a hand-annotated dataset that contains discourse relation annotations between sentences.

Given a pair of connected sentences, whose relation type has been labeled in PDTB, we can determine whether the discourse relation between them was explicitly marked or not. We evaluate DisSent and InferSent sentence embedding models and a word vector baseline. Given only the two sentences and no additional information (i.e. the discourse relation type is unknown), we make a binary classification of whether the sentence pair appeared as explicitly or implicitly marked.

We collected 34,512 sentences from PDTB, where 16,224 sentences are marked with implicit relation type, and 18,459 are marked with explicit relation type. We follow Patterson and Kehler (2013)’s preprocessing. The dataset contains 25 sections in total. We use sections 0 and 1 as the development set, sections 23 and 24 for the test set, and we train on the remaining sections 2-22.

This task is different from the setting in Patterson and Kehler (2013). We do not allow the classifier to access the underlying discourse relation type and we only provide the individual sentence embeddings as input features. In contrast, Patterson and Kehler (2013) used a variety of discrete features provided as part of PDTB dataset for their classifier, including the hand-annotated relation types.

Implicit Relation Prediction Task

Sporleder and Lascarides (2008) have argued that sentence pairs with explicitly marked relations are qualitatively different from those where the relation is left implicit. However, despite such differences, Qin et al. (2017) were able to use explicit discourse data and adversarial networks to train a classifier to successfully identify implicit discourse relations. Evaluating the performance of DisSent embeddings on the implicit relation task will add to the discussion of how much explicit relations can be leveraged to understand implicit ones.

We use the same dataset split scheme for this task as for the implicit vs explicit task discussed above. Following Ji and Eisenstein (2014) and Qin et al. (2017)

, we predict the 11 most frequent relations. We directly use the SentEval training framework for evaluation and use a simple softmax layer for classification.

Transfer Tasks

We evaluate the performance of our generated sentence embeddings on a series of natural language understanding benchmark tests provided by Conneau et al. (2017)

. The tasks we chose include sentiment analysis (MR, SST), question-type (TREC), product reviews (CR), subjectivity-objectivity (SUBJ), opinion polarity (MPQA), entailment (SICK-E), relatedness (SICK-R), and paraphrase detection (MRPC). We add our DIS evaluation tasks, as described in Section

4.5, to these tasks. These tasks are all classification tasks with 2-6 classes, except for relatedness, for which the model predicts human similarity judgements.

6.1 Results

Marker All Books 8 Books 5
and 0.78 / 0.72 0.78 / 0.78 0.79 / 0.81
but 0.73 / 0.71 0.79 / 0.72 0.80 / 0.75
because 0.36 / 0.45 0.37 / 0.50 0.38 / 0.55
if 0.75 / 0.79 0.80 / 0.78 0.81 / 0.81
when 0.62 / 0.61 0.74 / 0.71 0.77 / 0.77
so 0.48 / 0.49 0.46 / 0.56
though 0.30 / 0.48 0.39 / 0.61
before 0.61 / 0.65 0.64 / 0.77
as 0.77 / 0.68
while 0.36 / 0.46
after 0.42 / 0.55
although 0.07 / 0.24
still 0.21 / 0.42
also 0.14 / 0.36
then 0.12 / 0.31
Overall 67.5 73.5 77.3
Table 4: Training task performance: Test recall / precision for each discourse marker on the classification task, and we report overall accuracy.

Training Task

On the discourse marker prediction task that our model is trained for, we achieve high levels of test performance for all discourse markers. (Though it is interesting that because, perhaps the conceptually deepest relation, is also systematically the hardest for our model.) The larger the set, the more difficult the task becomes, and we see lower test accuracy overall when the size of the discourse marker set increases. The training task performance for each subset is shown in Table 4.

Discourse Marker Set

Varying the set of discourse markers doesn’t seem to help or hinder the model’s performance on generalization tasks. Top generalization performance on the three sets of discourse markers is shown in Table 6. Similar generalization performance was achieved when training on 5, 8, and all 15 discourse markers.

The similarity in generalization performance by each discourse set shows that top 5 discourse markers capture most of the relationships in the training data.

Implicit Discourse Relation Task

We compare the performance of our sentence representation learning model with other similar models and with the state of the art model on this task. Not surprisingly, DisSent is able to capture discourse-related information much better than InferSent. DisSent outperforms word vector models evaluated by Qin et al. (2017), and our sentence representation learning model is only about 3% away from the state of the art model that uses adversarial learning to leverage explicit discourse marker to learn about implicit discourse relations.

DisSent Books 5 40.7 86.5
DisSent Books 8 41.4 87.9
DisSent Books ALL 42.9 87.6
InferSent Conneau et al. (2017) 38.4 84.5
Patterson and Kehler (2013) 86.6
Word Vectors Qin et al. (2017) 36.9 74.8
Lin et al. (2009) + Brown Cluster 40.7
Adversarial Net (Qin et al., 2017) 46.2
Table 5: Discourse Generalization Tasks using PDTB: Following the metric used in these literature, we report overall test accuracy for sentence embedding models, as well as baselines and state of the art for these task.

Transfer Tasks

Results of our models, and comparison to other approaches, are shown in Table 6. Despite being a much simpler task than SkipThought and allowing for much more scalable data collection than InferSent, DisSent performs as well or better than these approaches on most generalization tasks.

DisSent and InferSent do well on different sets of tasks. In particular, DisSent outperforms InferSent on TREC (question-type classification) and, unsuprisingly, DIS, the task most similar to DisSent’s training set. InferSent outperforms DisSent on the tasks most similar to it’s training data, SICK-R and SICK-E. These tasks, like SNLI, were crowdsourced, and seeded with images from Flickr30k corpus Young et al. (2014).

Self-supervised training methods
DisSent Books 5 80.2 85.4 93.2 90.2 82.8 91.2 0.845 83.5 76.1 75.7
DisSent Books 8 79.8 85.0 93.4 90.5 83.9 93.0 0.854 83.8 76.1 80.2
DisSent Books ALL 80.1 84.9 93.6 90.1 84.1 93.6 0.849 83.7 75.0 79.9
Disc BiGRU 88.6 81.0 71.6
Unsupervised training methods
FastSent 70.8 78.4 88.7 80.6 76.8 72.2
FastSent + AE 71.8 76.7 88.8 81.5 80.4 71.2
Skipthought 76.5 80.1 93.6 87.1 82.0 92.2 0.858 82.3 73.0 70.1
Skipthought-LN 79.4 83.1 93.7 89.3 82.9 88.4 0.858 79.5
Supervised training methods
DictRep (bow) 76.7 78.7 90.7 87.2 81.0
InferSent 81.1 86.3 92.4 90.2 84.6 88.2 0.884 86.1 76.2 65.4
Multi-task training methods
LSMTL 82.5 87.7 94.0 90.9 83.2 93.0 0.888 87.8 78.6
Table 6: Generalization Task Results using SentEval.

We report the best results for transfer learning tasks.

indicates models that we trained. DisSent uses 4096 hidden state dimensions and GLoVe 300-dimension word embeddings. InferSent Conneau et al. (2017) uses 4096 embedding dimensions and use GloVE 300-dimension word embeddings. Disc BiGRU Jernite et al. (2017) hidden state has 512 dimensions. FastSent and FastSent + AE Hill et al. (2016) have 500 dimensions. SkipThought Kiros et al. (2015) and SkipThought-LN Conneau et al. (2017) models trained on 600-dimension word embeddings and produced 2400-dimension sentence embeddings. DictRep (bow) is from Conneau et al. (2017). LSMTL Subramanian et al. (2018) uses 2048-dimension bi-directional GRU as encoder, and trained on 512 dimension word embeddings.

Although DisSent is trained on a dataset derived from the same corpus as SkipThought, DisSent almost entirely dominates SkipThought’s performance across all tasks. In particular, on the SICK dataset, DisSent and SkipThought perform similarly on the relatedness task (SICK-R), but DisSent strongly outperforms SkipThought on the entailment task (SICK-E). This discrepancy highlights an important difference between the two models. Whereas both models are trained to, given a particular sentence, identify words that appear near that sentence in the corpus, DisSent focuses on learning specific kinds of relationships between sentences – ones that humans tend to explicitly mark. We find that reducing the model’s task to only predicting a small set of discourse relations, rather than trying to recover all words in the following sentence, results in better features for identifying entailment and contradiction without losing cues to relatedness.

On the new DIS task, the DisSent models do strikingly better than other models. (The direction of this effect is not surprising, but the effect is large: DisSent Books 8 does 10-15% better than previous models on the DIS discourse task, while InferSent does only about 2% better than DisSent at the SICK-E entailment task.) We found that even though DisSent Books 8 has a lower overall accuracy for it’s 8-way classification training task (Table 4), it outperforms he DisSent Books 5 model at the 5-way DIS evaluation task. This indicates that the harder task yields more discourse-related information about the sentences.

Surprisingly, the recently proposed large scale multi-task learning (LSMTL) sentence representation learning model Subramanian et al. (2018), which combined machine translation, constituency parsing, entailment prediction, and SkipThought style training with 124M training data only narrowly outperform our model, indicating that marked discourse relations between sentences can serve as a strong high-level linguistic signal useful for sentence representation learning.

Overall, on the evaluation tasks we present, DisSent performs on par with previous state-of-the-art models and offers advantages in data collection and training speed.

7 Discussion

The ability of discourse marker prediction as a training task to shape useful, state of the art, sentence embeddings is encouraging. Yet a number of issues for future research are apparent.

Limitations of evaluation

The generalization tasks that we (following Conneau et al. (2017)) use to compare models focus on sentiment, entailment, and similarity. These are narrow operational definitions of semantic meaning. A model that generates meaningful sentence embeddings should excel at these tasks. However, success at these tasks does not necessarily imply that a model has learned a deep semantic understanding of a sentence.

Sentiment classification, for example, in many cases only requires the model to understand local structures. Text similarity can be computed with various textual distances (e.g., Levenshtein or Jaro distance) on bag-of-words, without a compositional representation of the sentence. Thus, the ability of our, and other, models to achieve high performance on these metrics may reflect a competent representation sentence meaning; but more rigorous tests are needed to understand whether these embeddings capture sentence meaning in general.

Implicit and explicit discourse relations

We focus on explicit discourse relations for training our embeddings. Another meaningful way to exploit discourse relations in training is by leveraging implicit discourse signals. For instance, Jernite et al. (2017) showed that predicting sentence ordering could help to generate meaningful sentence embeddings. This approach makes the assumption that adjacent sentences are closer together in meaning space (or generated from similar latent topics). This may be true of many adjacent sentences, especially those whose relation is unmarked. But adjacent sentences can be related to one another in many different, complicated ways. For example, sentences linked by contrastive markers, like but or however are likely expressing different or opposite ideas.

Identifying other features of natural text that contain informative signals of discourse structure and combining these with explicit discourse markers is an appealing direction for future research.

Multilingual generalization

In principle, the DisSent model and extraction methods would apply equally well to multilingual data with minimal language-specific modifications. Within universal dependency grammar, discourse markers across languages should correspond to structurally similar dependency patterns. Beyond dependency parsing and minimal marker-specific pattern development (see Appendix A.1), our extraction method is automatic, requiring no annotation of the original dataset, and so any large dataset of raw text in a language can be used.

8 Conclusion

We present a discourse marker prediction task for training sentence embeddings to reflect the meaning of a sentence. We train our model on this task and show that the resulting embeddings lead to high generalization performance on a number of established tasks for sentence embeddings.

This type of training task can leverage large amounts of unannotated text, since it relies only on the kinds of annotations (sentence boundaries and discourse markers) that humans naturally mark in their communications with each other. A dataset for this task is therefore easy to collect relative to other supervised tasks. Compared to unsupervised methods that train on a full corpus, our method yields more targeted and faster training. Encouragingly this model trained on discourse marker prediction achieves comparable generalization performance to other state of the art models.


Appendix A Supplementary Material

a.1 Details on Dependency Based Sentence Extraction

We use dependency parsing to identify uses of discourse marker and to extract sentence pairs. While universal dependency grammar provides enough information to identify discourse markers and their connecting sentences, different discourse markers are parsed with different dependency relations. For each discourse marker of interest, we identified the appropriate dependency pattern. Figure 4 shows the full set of dependency patterns used by our extraction algorithm for English.

For some markers, we further filtered based on the order of the two sentences in the original text. For example, the discourse marker then always appears in the order ”S1 then S2”, unlike because, which can appear in the order ”Because S2, S1”. Excluding proposed extractions in an incorrect order makes our method more robust to incorrect dependency parses.

a.2 Frequency vs. Semantics in Classification Performance

Figure 3 reflects a model trained on a balanced subset of our training set. When the model can no longer rely on base rates of discourse markers to make judgments, overall accuracy drops from 68% to 47%. However inspecting the matrices shows very similar confusability, suggesting that training on unbalanced data does not greatly decrease sensitivity to non-frequency predictors.

To more quantitatively represent the connection between what the two models learn, we compute the correlation between the balanced confusions and the residuals

of the unbalanced confusions (when predicted linearly from log frequency). These residuals account for 64% of the variance in the balanced confusions (

). That is, we can come close to predicting the balanced confusions from the unbalanced ones.

Figure 3: Balanced Classifier Confusion Matrix trained on a balanced subset of the All dataset where discourse markers are capped at 13,421 occurrences each. Each cell represents the proportion of instances of the actual discourse marker misclassified as the classified discourse marker. This proportion is log-transformed to highlight small differences. Discourse markers are arranged in order of frequency from left (least frequent) to right (most frequent).



























for example



Figure 4: Dependency patterns used for extraction for each discourse marker.