Data for discourse connective prediction.
Accurate prediction of suitable discourse connectives (however, furthermore, etc.) is a key component of any system aimed at building coherent and fluent discourses from shorter sentences and passages. As an example, a dialog system might assemble a long and informative answer by sampling passages extracted from different documents retrieved from the web. We formulate the task of discourse connective prediction and release a dataset of 2.9M sentence pairs separated by discourse connectives for this task. Then, we evaluate the hardness of the task for human raters, apply a recently proposed decomposable attention (DA) model to this task and observe that the automatic predictor has a higher F1 than human raters (32 vs. 30). Nevertheless, under specific conditions the raters still outperform the DA model, suggesting that there is headroom for future improvements. Finally, we further demonstrate the usefulness of the connectives dataset by showing that it improves implicit discourse relation prediction when used for model pre-training.READ FULL TEXT VIEW PDF
Implicit discourse relation recognition is a challenging task due to the...
Implicit discourse relation recognition is a challenging task as the rel...
Discourse markers ( by contrast, happily, etc.) are words or
In this paper we present GumDrop, Georgetown University's entry at the D...
Sentence fusion is the task of joining several independent sentences int...
Recent research in psycholinguistics has provided increasing evidence th...
In discourse studies concessions are considered among those argumentativ...
Data for discourse connective prediction.
Discourse connectives, also referred to as discourse markers, discourse cues, or discourse adverbials, are used to bind together and to explicate the relation between pieces of text. It is a common language class exercise to be asked to fill in suitable connectives to a text in order to improve the text flow. Similarly, it is important for computational summarization and text adaptation systems to be able to fill in suitable discourse connectives to produce natural-sounding utterances.
In this work, we study the problem of automatic discourse connective prediction. We limit ourselves to connectives which appear at the beginning of a sentence, linking the sentence to the preceding one. Even in this limited setting, an automatic discourse connective predictor has many concrete use cases. For example, in a question-answering setting it could help to generate answers by collating sentences from multiple sources. In extractive text summarization, it could be used to determine what is the best way to join two sentences that used to be separated by one or more sentences. As part of a text-authoring application, it could suggest suitable connectives at the beginning of a sentence.
In the literature, discourse connective prediction has been recently studied merely as an intermediate step for the well-studied problem of implicit discourse relation prediction [Xu et al.2012, Zhou et al.2010]. However, considering the aforementioned applications, we argue that connective prediction makes an interesting and relevant problem in its own right.
The contributions of this work are twofold:
We present an extensive experimental study on the problem of discourse connective prediction and show that a recently proposed decomposable attention model [Parikh et al.2016] yields a good performance on this task. The model clearly outperforms a popular word-pair model and obtains a better performance than human raters on the same task and data.
We describe the dataset that we collected, consisting of 2.9 million adjacent sentence pairs (with and without a connective) extracted from the English Wikipedia. For 10 000 sentences, we also include connectives filled in by human raters. The dataset is publicly available at: https://github.com/ekQ/discourse-connectives
A few earlier works study discourse connective prediction alone, but recently it has been studied merely as an intermediate step for discourse relation prediction. Next we provide a brief overview of these two lines of work, starting from the latter.
Implicit discourse relation prediction has attracted considerable attention in recent years [Braud and Denis2016, Liu and Li2016, Qin et al.2016, Qin et al.2017, Rutherford and Xue2015, Wu et al.2017, Zhang et al.2016]. Earlier pitler2008 showed that if a discourse connective is known, the explicit discourse relation111Later, it has been shown that a single discourse connective can actually convey multiple discourse relations [Rohde et al.2015, Rohde et al.2016]. can be inferred with a 93.09% accuracy, which has inspired several efforts at predicting connectives to improve implicit discourse relation prediction. zhou2010 predicted connectives using an
-gram language model, whereas xu2012 employed word pairs and a selection of linguistically informed features. liu2016b, on the other hand, showed that predicting both connectives and relations using a convolutional neural network in a multi-task setting improves the relation prediction performance.
Some earlier works have focused on connective prediction alone and developed various hand-crafted features for distinguishing between connectives. For example, elhadad1990 explored pragmatic features for distinguishing between the connectives but and although, and between because and since
. Later grote1998 developed a specialized lexicon for discourse connectives based on the relevant constraints and preferences associated with the connectives. While these works do not present an experimental evaluation of the proposed systems, we evaluate our connective prediction models extensively in order to understand their applicability to real-life scenarios. Furthermore, our aim is to learn the representations of the two arguments and their relationship automatically which allows us to distinguish between a large set of connectives without extensive manual efforts required to craft features that separate the connectives.
In addition to predicting the most suitable discourse connective, several methods have been developed for predicting the presence of a discourse connective [Yung et al.2017, Di Eugenio et al.1997, Patterson and Kehler2013]. We also predict the presence of a connective by considering [No connective] as one of the classes to be predicted.
We compile a list of 79 discourse connectives based on the Penn Discourse Treebank (PDTB) [Prasad et al.2008]. Since our focus is on sentence concatenation, we ignore the (forward) connectives, which typically point to the following sentence rather than the previous one, such as “After the election, […]”. However, for several ambiguous connectives, the forward use can be ruled out by requiring a comma after the connective (e.g. Instead,); we include such connectives in our data. Discontinuous connectives, such as “If […] then […]”, are not included.
Data samples for discourse connective prediction can be collected from any large unannotated text corpus. In this instance, we use the English Wikipedia222A snapshot from September 5, 2016. and collect every pair of consecutive sentences within the same paragraph where the latter sentence begins with one of the 79 discourse connectives. As a result, we obtain a dataset of 1.95 million sentence pairs separated by a connective. Additionally, we collect 0.91 million examples of consecutive sentences not separated by a discourse connective, labeled as [No connective], for a total of 2.86 million sentence pairs.333Note that our models are tested only on consecutive sentences, for which the ground truth connectives are known, but they can be applied to connect also disjoint sentences.
The frequency distribution of the connectives is very skewed;however occurs 720 334 times, whereas else, only 43 times in the beginning of a sentence. In order to make the connective prediction task more feasible for the models and for human raters, we select a subset of sufficiently frequent and distinct connectives (e.g. for example is included but for instance is not since it conveys the same meaning and is less frequent). The details of the selection process are omitted in the interest of space, but the resulting 19 connectives are listed in Table 1.
|however||720 334||on the other hand||20 301|
|for example||111 711||in particular,||16 011|
|and||73 644||indeed,||15 286|
|meanwhile,||57 971||overall,||9 513|
|therefore||44 064||in other words||8 888|
|finally,||33 076||rather,||5 596|
|nevertheless||32 952||by contrast,||4 605|
|instead,||30 973||by then||4 279|
|moreover||25 583||otherwise,||3 563|
Finally, we split the data into train, development, and test sets. We balance the connective classes, since in an unbalanced dataset most examples would be labeled as [No connective]
and many connectives would be extremely under-represented, limiting the applicability of the resulting classifier. For the development and test sets, we pick 500 samples per connective (including[No connective]) by under-sampling without replacement. This results in two balanced datasets of 10 000 samples. For the training set, we pick 20 000 samples per connective by under-sampling the majority classes and oversampling the minority classes, creating a balanced dataset of 400 000 samples. Connective samples from a single Wikipedia article are not included in more than one of the three datasets to avoid over-fitting through potential repetition within a single article.
In comparison with the PDTB dataset, which contains information about both discourse connectives and discourse relations, the main advantage of the collected dataset is its size. PDTB contains only 40 600 examples (1.4% of the size of the collected dataset), which causes sparsity issues [Li and Nenkova2014]. This can slow down the development of new models, particularly complex neural models that often require large training datasets to generalize well.
The decomposable attention (DA) model was recently introduced by parikh2016 for the natural language inference (NLI) problem which aims to classify entailment and contradiction relations between a premise and a hypothesis. Discourse connective prediction is related to the NLI problem since entailment and contradiction can be explicitly indicated by certain connectives (for instance, therefore and by contrast, respectively). However, the larger number of classes makes connective prediction more challenging. DA was shown to yield a state-of-the-art performance on the NLI task while requiring almost an order of magnitude fewer parameters than previous approaches. For all these reasons, it seems natural to apply the DA model to the connective prediction problem.
marcu2002 proposed to use word-pair features to predict discourse relations based on discourse connectives mapped to these relations. Similarly, many later implicit discourse relation prediction models are based on word-pair features [Marcu and Echihabi2002, Pitler et al.2009, Xu et al.2012, Zhou et al.2010] or aggregated word-pair features [Biran and McKeown2013, Rutherford and Xue2014]. Therefore, we use a model called WordPairs to have a baseline for the DA model.
The DA model consists of three steps, attend, compare, and aggregate
, which are executed by three different feed-forward neural networks, , and , respectively. As input, the model takes two sentences and
represented by sequences of word embeddings. The sequences are padded by “NULL” tokens to fix their lengths to 50 tokens.
In the attend step, the model computes non-negative attention scores for each pair of tokens across the two input sentences. This computation ignores the order of the tokens and it produces soft-alignments from to and vice versa.
In the compare
step, the model computes comparison vectors between each input token and its aligned sub-phrase. The aligned sub-phrase is a linear combination of the embedding vectors of the other sentence weighted by the attention scores.
Finally, in the aggregate step, the comparison vectors are summed over the tokens of a sentence and then the aggregate vectors of the two sentences are concatenated. The resulting vector is fed into the third feed-forward network which outputs containing scores for each class. The predicted class is given by .
The weights of the three networks are randomly initialized, after which the model is trained in an end-to-end manner. Our implementation of the DA model has the following differences compared to the original model described by parikh2016: () we do not use the self-attention mechanism which was reported to provide only a small improvement over the vanilla version of DA; () we do not project down the embedding vectors but use 100-dimensional word2vec embeddings [Mikolov et al.2013] which are updated during the training; () we use layer normalization [Ba et al.2016] which makes the model converge faster.
The WordPairs model considers as features all word pairs which appear across the two arguments (e.g. word appears in Arg 1 and word in Arg 2) in at least five samples in the training dataset. Such features are employed by many implicit discourse relation prediction models [Marcu and Echihabi2002, Pitler et al.2009, Zhou et al.2010, Xu et al.2012]. Additionally, we incorporate single word features (e.g. word A appears in Arg 2) since these slightly improved the results. With these binary features, we train logistic regressors using the one-vs-rest scheme to predict one of the 20 different connectives.444We trained two versions of the WordPairs model: using stochastic gradient descent with mini-batches and using LIBLINEAR with 100k samples (i.e. 25% of the training data) which we could fit into the memory of a 256 GB machine. The reported results are based on the latter approach, which performed better.
model: using stochastic gradient descent with mini-batches and using LIBLINEAR with 100k samples (i.e. 25% of the training data) which we could fit into the memory of a 256 GB machine. The reported results are based on the latter approach, which performed better.
Next we present experimental results on discourse connective prediction using human raters, the DA model and the WordPairs model. For this task, we remove the connective (if any) from the second sentence in each test pair, and measure the ability of the model (or the raters) to identify the removed connective.
To better understand what is reasonable to expect from an automatic predictor, we use a crowd-sourcing platform to ask human raters to reconstruct the removed connectives for each of the 10 000 test sentence pairs. Each sentence pair is annotated by three (not necessarily the same three) native English speakers. The raters are shown the two sentences, the latter of which starts with a [Connective goes here] placeholder, and asked to select the most suitable connective from the 20 options, including [No connective]. This layer of human annotations is also released as part of the connective dataset. The raters are instructed to pick the most natural connective in case there are multiple suitable options. Furthermore, they are asked to pick [No connective] only if adding a connective would make the concatenation sound ungrammatical or artificial, or if the two sentences seem to be completely disconnected. The sentences are not pre-processed apart from upper-casing the first character of the second sentence to avoid giving away the presence of a connective in the original sentence. The order of the connectives is randomized, except for [No connective] which is always shown last.
On the whole test-set, human annotators achieve a macro-averaged F1
score of 23.72. The confusion matrix generated by the raters’ decisions is presented on the left side of Figure1. It shows that the raters are strongly biased towards [No connective] despite the indication to refrain from using it. A similar bias was observed by rohde2016 for the task of filling in a suitable conjunction before a discourse connective. There are at least two possible explanations for this bias: () in the sake of clarity and in line with common scientific writing guidelines, Wikipedia editors tend to use connectives quite generously, and () the artificial balancing of the datasets makes [No connective] under-represented in the test data compared to the actual distribution of discourse connectives vs. [No connective]. The confusion matrix also shows that there are clusters of connectives that raters tend to confuse, even though they do not necessarily encode exactly the same relation. Examples are rather, and instead,, for example and in particular,, on the other hand and by contrast,. For 57.1% of the test questions, there is a consensus among at least two raters and for 11.4%, all three raters agree on the most suitable connective.
In this section, the DA model and the WordPairs model are employed to perform the same task as the human raters, i.e., learning to reconstruct the connective possibly removed from the beginning of the second sentence in each test pair. A balanced dataset is used for both training and testing the models as described in Section 3 The DA
model is evaluated using the following hyper-parameters optimized on the development set: network size (one hidden layer with 200 neurons), batch size (64), dropout ratios for the, , and
networks (0.68, 0.14, and 0.44, respectively), and learning rate (0.0018). The model is implemented in TensorFlow[Abadi et al.2015] and the training is run for 300 000 batch steps. The results, reported in Table 2, show that DA clearly outperforms the WordPairs baseline with an F1 score of 31.80 vs. 14.81.
Table 3 compares the accuracy of DA predictions to the rater decisions. The macro-averaged F1 score of human raters is 23.72 which is, quite surprisingly, lower than the F1 score of the DA model, 31.80. The difference is smaller when considering majority votes on the subset of 5 714 tasks for which there is a consensus among at least 2 out of 3 raters, which results in a 30.36 F1 score for the raters. On these less ambiguous cases, the model performance also increases to 32.68.
|Setting||Raters (F1)||Model (F1)|
As we mentioned in Section 5.1, human raters are clearly less eager to introduce a connective than Wikipedia editors. Therefore, we also evaluate the setting in which we exclude the questions for which either the ground-truth label, or the rater-assigned majority label, or the model-assigned label is [No connective]. The results, listed in the last line of Table 3, show that under these conditions human raters actually outperform the model.
The confusion matrix of DA is shown on the right side of Figure 1. For each connective, the true connective is the most frequent prediction. Connective on the other hand has the lowest F1 score (15.06), whereas by then has the highest (57.29). Some of the most frequent mistakes are between similar connectives, such as however vs. nevertheless, and instead, vs. rather,. These errors are by and large consistent with those of human raters (left side of the figure). This confirms that the model is accurately capturing the meaning of the relation, and when it does not select the gold connective it is making similar approximations to what people would do. Furthermore, the Figure 1 shows that raters have a more pronounced tendency to select frequent connectives, such as however and and. To further exemplify, in Table 4 we show a selection of wrong and correct decisions made by DA and human raters. A manual inspection of these and other examples shows that in some cases a larger context than the previous sentence is required for inferring the connective. For instance, to correctly decide whether finally, is more suitable than then, one may have to inspect a larger context.
|in particular||however||in particular|
An advantage of the DA model is that it is possible to examine which words the model attends to when inferring a connective. In some cases, the attended words are clearly meaningful semantically or linguistically, whereas in other cases the soft-alignment matrix that the model produces is harder to interpret. Examples of the former case are represented in Figure 2, which shows the alignment matrices from the tokens of the first sentence (-axis) to the tokens of the second sentence (-axis) so that the rows sum to 1. In the left example, the model correctly predicts however as the connective after aligning the word attempt with refuse and not. These word pairs indicate contrast which makes however a likely connective. In the right example, the model aligns the phrase was disqualified with had gone and correctly predicts by then as the connective. The corresponding tenses, i.e., past and past perfect, respectively, are likely clues of the presence of by then.
We studied the problem of discourse connective prediction, which has many useful applications in text summarization, adaptation and conversationalization. We collected a dataset of 2.9 million pairs of consecutive sentences and connectives, and made it publicly available to facilitate further research on this problem, as well as other related bi-sequence classification tasks. We showed that the recently proposed decomposable attention model performs surprisingly well on the connective prediction task, even better than human raters on the same representative test set consisting of 10 000 samples. We also observed that, unlike the model, human raters have a preference for implicit connectives, as they do outperform the model if the comparison is restricted to the cases in which the majority of raters agrees on an explicit connective. The alignment matrices produced by the model suggest that the predictor is picking up relevant lexical, syntactic and semantic clues. The confusion matrix of the predictor shows very similar error patterns to the matrix generated from human raters, further confirming the meaningfulness of the decisions made by the model.
We would like to thank Cesar Ilharco for his help on running the experiments.
Tensorflow: Large-scale machine learning on heterogeneous systems.Software available from tensorflow.org.
Proc. Ninth International Workshop on Natural Language Generation.