Encoder Decoder for DRR
Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.READ FULL TEXT VIEW PDF
Encoder Decoder for DRR
The high-level organization of text can be characterized in terms of discourse relations between adjacent spans of text (Knott, 1996; Mann, 1984; Webber et al., 1999). Identifying these relations has been shown to be relevant to tasks such as summarization (Louis et al., 2010; Yoshida et al., 2014)2009), and coherence evaluation (Lin et al., 2011). While the Penn Discourse Treebank (PDTB) now provides a large dataset annotated for discourse relations (Prasad et al., 2008), the automatic identification of implicit discourse relations is a difficult task, with state-of-the-art performance at roughly 40% (Lin et al., 2009).
One reason for this poor performance is that predicting implicit discourse relations is a fundamentally semantic task, and the relevant semantics may be difficult to recover from surface level features. For example, consider the discourse relation between the following two sentences in Example (1), where a discourse connector like “because” seems appropriate to indicate the relationship. However, without discourse connector, there is little surface information to signal the relationship. We address this issue by applying a discriminatively-trained model of compositional distributional semantics to discourse relation classification (Socher et al., 2013; Baroni et al., 2014)
. The meaning of each sentence is represented as a vector(Turney et al., 2010), which is computed through a series of compositional operations over the parse tree.
|Example (1) :||Bob gave Tina the burger.||Example (2) :||Bob gave Tina the burger.|
|She was hungry.||He was hungry.|
We further argue that purely vector-based representations on sentences are insufficiently expressive to capture discourse relations. To see why, consider what happens in Example (2), where a tiny change is made based on Example (1). After changing the subject of the second sentence to Bob, the original discourse relation seems no longer holding in Example (2). But despite the radical difference in meaning, the distributional representation of the second sentence will be almost unchanged: the syntactic structure remains identical, and the words “he” and “she” have very similar word representations. We address this issue by computing vector representations not only for each sentence, but also for each coreferent entity mention within the sentences. These representations are meant to capture the role played by the entity in the text. We compute entity-role representations using a novel feed-forward compositional model, which combines upward and downward passes through the syntactic structure. Representations for these coreferent mentions are then combined into a classification model, and help to predict the implicit discourse relation. In combination, our approach achieves a 3% improvement in accuracy over the best previous work (Lin et al., 2009) on the second-level discourse relation identification in the PDTB.111For more details, please refer to the long version of this paper (Ji & Eisenstein, 2015).
Our model requires a syntactic parse tree, which is produced automatically from the Stanford CoreNLP parser Klein & Manning (2003)
. A reviewer asked whether it might be better to employ a left-to-right recurrent neural network, which would obviate the need for this language-specific resource. While it would clearly be preferable to avoid the use of language-specific resources whenever possible, we think this approach is unlikely to succeed in this case. A key difference between language and other types of data is that language has inherent recursive structure. A rich literature in both linguistics and natural language processing elaborates on the close relationship between (recursively-structured) syntax and semantics. Therefore, we see strong theoretical evidence — as well as practical evidence from the history of natural language processing — that syntactic parse structures are central to capturing the meaning in text.
Regarding the multilingual question, there are now accurate parsers and annotated treebanks for dozens of languages,222http://universaldependencies.github.io/docs/#language-other and training accurate parsers for “low resource” languages is a hot research topic, with substantial interest from both industry and academia. Languages differ substantially in the importance of word ordering, with English emphasizing word order more than most other languages (Bender, 2013). To our knowledge, it is an open question as to whether left-to-right recurrent neural networks will successfully extract meaning in languages where word order is more free.
We briefly describe our approach to entity-augmented distributional semantics and to discourse relation identification. Our relation identification model is named as disco2, since it is a distributional compositional approach to discourse relations.
The entity-augmented distributional semantics includes two passes in composition procedure: the upward pass for distributional representation of sentence, while the downward pass for distributional representation of entities shared between sentences.
Distributional representations for sentences are computed in a feed-forward upward
pass: each non-terminal in the binarized syntactic parse tree has a-dimensional distributional representation that is computed from the distributional representations of its children, bottoming out in representations of individual words. We follow the Recursive Neural Network (RNN) model proposed by Socher et al. (2011). Specifically, for a given parent node , we denote the left child as , and the right child as . We compose their representations to obtain, , where is the element-wise hyperbolic tangent function, and is the upward composition matrix. We apply this compositional procedure from the bottom up, ultimately obtaining the sentence-level representation .
As seen in the contrast between Examples (1) and (2), a model that uses a single vector representation for each sentence would find little to distinguish between “she was hungry” and “he was hungry”. It would therefore almost certainly fail to identify the correct discourse relation for at least one of these cases, which requires tracking the roles played by the entities that are coreferent in each pair of sentences. To address this issue, we augment the representation of each sentence with additional vectors, representing the semantics of the role played by each coreferent entity in each sentence. Rather than represent this information in a logical form — which would require robust parsing to a logical representation — we represent it through additional distributional vectors. The role of a constituent can be viewed as a combination of information from two neighboring nodes in the parse tree: its parent , and its sibling . We can make a downward pass, computing the downward vector from the downward vector of the parent , and the upward vector of the sibling : , where is the downward composition matrix. The base case of this recursive procedure occurs at the root of the parse tree, which is set equal to the upward representation, .
To predict the discourse relation between an sentence pair , the decision function is a sum of bilinear products,
where the predicted relation is given by , and are the classification parameters for relation . A scalar is used as the bias term for relation , and is the set of coreferent entity mentions shared among the sentence pair . For the cases where there are no coreferent entity mentions between two sentences, , the classification model considers only the upward vectors at the root. We also use the surface features vector in the decision function, as we find that, this approach outperforms prior work on the classification of implicit discourse relations in the PDTB, when combined with a small number of surface features.
|Model||+Entity semantics||+Surface features||Accuracy(%)|
|1. Lin et al. (2009)||Yes||40.2|
|2. Surface feature model||Yes||39.69|
|signficantly better than Lin et al. (2009) with|
We evaluate our approach on the implicit discourse relation identification in the Penn Discourse Treebank (PDTB). PDTB relations may be explicit, meaning that they are signaled by discourse connectives (e.g., because); alternatively, they may be implicit
, meaning that the connective is absent. We focus on the more challenging problem of classifying implicit discourse relations. Aiming to build a discourse parser in future, we follow the same experimental setting proposed byLin et al. (2009), and evaluate our relation identification model on the second-level relation types.
We run the Stanford parser (Klein & Manning, 2003) and the Berkeley coreference system (Durrett & Klein, 2013) to obtain syntactic trees and coreference results respectively. In the PDTB, each discourse relation is annotated between two argument spans. For non-sentence argument span, we identify the syntactic subtrees with the span, and construct a right-branching superstructure to unify them into a tree.
Table 1 presents results for multiclass identification of second-level PDTB relations. As shown in lines 5 and 6, disco2 outperforms the prior state-of-the-art (line 1). The strongest performance is obtained by including the entity distributional semantics, with a 3.4% improvement over the accuracy reported by Lin et al. (2009) (). The improvement over our reimplementation of this work (line 2) is even greater, which shows how the distributional representation provides additional value over the surface features. The contribution of entity semantics is shown in Table 1 by the accuracy differences between lines 3 and 4, and between lines 5 and 6.
Discourse relations are determined by the meaning of their arguments, and progress on discourse parsing therefore requires computing representations of the argument semantics. We present a compositional method for inducing distributed representations not only of discourse arguments, but also of the entities that thread through the discourse. By jointly learning the relation classification weights and the compositional operators, this approach outperforms prior work based on hand-engineered surface features. More discussion and experimental results can be found in a forthcoming journal paper (Ji & Eisenstein, 2015).
Discourse structures for text generation.In Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics, pp. 367–375. Association for Computational Linguistics, 1984.
Proceedings of the 28th International Conference on Machine Learning, pp. 129–136, 2011.
Journal of artificial intelligence research, 37(1):141–188, 2010.
Dependency-based Discourse Parser for Single-Document Summarization.In EMNLP, 2014.