Modeling the coherence in a paragraph or a long document is an important task, which contributes to both natural language generation and natural language understanding. Intuitively, it involves dealing with logic consistency and topic transitions. As a subtask, sentence ordering aims to reconstruct a coherent paragraph from an unordered set of sentences, namely paragraph. It has been shown to benefit several tasks, including retrieval-based question answering 1, 10], where erroneous sentence orderings may cause performance degradation. Therefore, it is of great importance to study sentence reordering.
are rule-based or statistical ones, relying on handcrafted and sophisticated features. However, careful designs of these features require not only high labor costs but also rich linguistic knowledge. Thus, it is difficult to transfer these methods to new domains or languages. Inspired by the recent success of deep learning, neural networks have been introduced to this task, of which representative work includes window network, neural ranking model , hierarchical RNN-based models [11, 16], and deep attentive sentence ordering network (ATTOrderNet) . Among these models, ATTOrderNet achieves the state-of-the-art performance with the aid of multi-head self-attention  to learn a relatively reliable paragraph representation for subsequent sentence ordering.
Despite the best performance ATTOrderNet having exhibited so far, it still has two drawbacks. First, it is based on fully-connected graph representations. Although such representations enable the network to capture structural relationships across sentences, they also introduce lots of noise caused by any two semantically incoherent sentences. Second, the self-attention mechanism only exploits sentence-level information and applies the same set of parameters to quantify the relationship between sentences. Obviously, it is not flexible enough to exploit extra information, such as entities, which have proved crucial in modeling text coherence [3, 9]. Thus, we believe that it is worthy of exploring a more suitable neural network for sentence ordering.
In this paper, we propose a novel graph-based neural sentence ordering model that adapts the recent graph recurrent network (GRN) . Inspired by Guinaudeau and Strube Guinaudeau:acl13, we first represent the input set of sentences (paragraph) as a Sentence-Entity graph, where each node represents either a sentence or an entity. Each entity node only connects to the sentence nodes that contain it, and two sentence nodes are linked if they contain the same entities. By doing so, our graph representations are able to model not only the semantic relevance between coherent sentences but also the co-occurrence between sentences and entities. Here we take the example in Figure 1 to illustrate the intuition behind our representations. We can see that the sentences sharing the same entities tend to be semantically close to each other: both the sentences S1 and S2 contain the entity “dad”, and thus they are more coherent than S1 and S4. Compared with the fully-connected graph representations explored previously , our graph representations reduce the noise caused by the edges between irrelevant sentence nodes. Another advantage is that the useful entity information can be fully exploited when encoding the input paragraph. Based on sentence-entity graphs, we then adopt GRN  to recurrently perform semantic transitions among connected nodes. In particular, we introduce an additional paragraph-level node to assemble semantic information of all nodes during this process, where the resulting paragraph-level representation is beneficial to information transitions among long-distance connected nodes. Moreover, since sentence nodes and entity nodes play different roles, we employ different parameters to distinguish their impacts. Finally, on the basis of the learned paragraph representation, a pointer network is used to produce the order of sentences.
The main contribution of our work lies in introducing GRN into sentence ordering, which can be classified into three sub aspects: 1) We propose a GRN-based encoder for sentence ordering. Our work is the first one to explore such an encoder for this task. Experimental results show that our model significantly outperforms the state-of-the-arts. 2) We refine vanilla GRN by modeling sentence nodes and entity nodes with different parameters. 3) Via plenty of experiments, we verify that entities are very useful in graph representations for sentence ordering.
2 Baseline: ATTOrderNet
In this section, we give a brief introduction to ATTOrderNet, which achieves state-of-the-art performance and thus is chosen as the baseline of our work. As shown in Figure 2, ATTOrderNet consists of a Bi-LSTM sentence encoder, a paragraph encoder based on multi-head self-attention , and a pointer network based decoder . It takes a set of input sentences with the order as input and tries to recover the correct order . Here denotes the number of the input sentences.
2.1 Sentence Encoding with Bi-LSTM
The Bi-LSTM sentence encoder takes a word embedding sequence () of each input sentence to produce its semantic representation. At the -th step, the current states ( and ) are generated from the previous hidden states ( and ) and the current word embedding as follows:
Finally, the sentence representation is obtained by concatenating the last states of the Bi-LSTM in both directions .
2.2 Paragraph Encoding with Multi-Head Self-Attention Network
The paragraph encoder consists of several self-attention layers followed by an average pooling layer. Given the representations for the input sentences, the initial paragraph representation is obtained by concatenating all sentence representations .
Next, the initial representation is fed into self-attention layers for the update. In particular, the update for layer is conducted by
where represents the -th network layer including multi-head self-attention and feed-forward networks. Finally, an average pooling layer is used to generate the final paragraph representation from the output of the last self-attention layer where
is the vector representation of.
2.3 Decoding with Pointer Network
After obtaining the final paragraph representation
, an LSTM-based pointer network is used to predict the correct sentence order. Formally, the conditional probability of a predicted ordergiven input paragraph can be formalized as
where , and are model parameters. During training, the correct sentence order is known, so the sequence of decoder inputs is . At test time, the decoder inputs correspond to the representations of sentences in the predicted order. For each step , the decoder state is updated recurrently by taking the representation of the previous sentence as the input:
where is the decoder state, and is initialized as the final paragraph representation . The first-step input and initial cell memory are zero vectors.
3 Our Model
In this section, we give a detailed description to our graph-based neural sentence ordering model, which consists of a sentence encoder, a graph neural network based paragraph encoder and a pointer network based decoder. For fair comparison, our sentence encoder and decoder are identical with those of ATTOrderNet. Due to the space limitation, we only describe our paragraph encoder here, which involves graph representations and graph encoding.
3.1 Sentence-Entity Graph
To take advantage of graph neural network for encoding paragraph, we need to represent input paragraphs as graphs. Different from the fully-connected graph representations explored previously , we follow Guinaudeau and Strube Guinaudeau:acl13 to incorporate entity information into our graphs, where it can serve as additional knowledge and be exploited to alleviate the noise caused by connecting incoherent sentences. To do this, we first consider all nouns of each input paragraph as entities. Since there can be numerous entities for very long paragraphs, we remove the entities that only appear once in the paragraph. As a result, we observe that a reasonable number of entities are generated for most paragraphs, and we will show more details in the experiments.
Then, with the identified entities, we transform the input paragraph into a sentence-entity graph. As shown in Figure 3 (b), our sentence-entity graphs are undirected and can be formalized as , where , and represent the sentence-level nodes (such as ), entity-level nodes (such as ), and edges, respectively. Every sentence-entity graph has two types of edges, where an edge of the first type (SE) connects a sentence and an entity within it. Inspired by Guinaudeau and Strube Guinaudeau:acl13, we set the label for each SE-typed edge based on the syntactic role of the entity in the sentence, which can be either a subject(S), an object(O) or other(X). If an entity appears multiple times with different roles in the same sentence, we pick the highest-rank role according to SOX. On the other hand, every edge of the second type (SS) connects two sentences that have common entities, and these edges are unlabeled. As a result, sentence nodes are connected to other sentence nodes and entity nodes, while entity nodes are only connected to sentence nodes.
Figure 3 compares a fully-connected graph with a sentence-entity graph for the example in Figure 1. Within the fully-connected graph, there are several unnecessary edges that introduce noise, such as the one connecting S1 and S4. Intuitively, S1 and S4 do not form a coherent context. It is probably because they do not have common entities, especially given the situation that they both share entities with other sentences. In contrast, the sentence-entity graph does not take that edge, thus it does not suffer from the corresponding noise. Another problem with the fully-connected graph is that every node is directly linked to others, thus no information can be obtained based on the graph structure. Conversely, the structure of our sentence-entity graph can provide more discriminating information.
3.2 Encoding with GRN
To encode our graphs, we adopt GRN  that has been shown effective in various kinds of graph encoding tasks. GRN is a kind of graph neural network  that parallelly and iteratively updates its node states with a message passing framework. For every message passing step , the state update for each node mainly involves two steps: a message is first calculated from its directly connected neighbors, then the node state is updated by applying the gated operations of an LSTM step with the newly calculated message. Here we use GRU for updating node states instead of LSTM for better efficiency and fewer parameters.
Figure 4 shows the architecture of our GRN-based paragraph encoder, which adopts a paragraph-level state in addition to the sentence states (such as , we follow Section 2.2 to use ) and entity states (such as ). We consider the sentence nodes and the entity nodes as different semantic units, since they contain different amount of information and have different types of neighbors. Therefore, we apply separate parameters and different gated operations to model their state transition processes, both following the two-step message-passing process. To update a sentence state at step , messages from neighboring sentence states (such as ) and entity states (such as ) are calculated via weighted sum:
where and denote the sets of neighboring sentences and entities of , respectively. We compute gates and according to the edge label (if any) and the two associated node states using a single-layer network with a sigmoid activation. Then, is updated by aggregating the messages ( and) and the global state via
Similarly, at step , each entity state is updated based on its word embedding , its directly connected sentence nodes (such as ), and the global node :
Finally, the global state is updated with the messages from both sentence and entity states via
where ( ), ( ), and () are model parameters, and is the number of entities. In this way, each node absorbs richer contextual information through the iterative encoding process and captures the logical relationships with others. After recurrent state transitions of iterations, we obtain the final paragraph state , which will be used to initialize the state of decoder (see Section 2.3).
|Model||NIPS Abstract||ANN Abstract||arXiv Abstract||SIND|
We first compare our approach with previous methods on several benchmark datasets.
NIPS Abstract. This dataset contains roughly 3K abstracts from NIPS papers from 2005 to 2015.
ANN Abstract. It includes about 12K abstracts extracted from the papers in ACL Anthology Network (AAN) corpus .
arXiv Abstract. We further consider another source of abstracts collected from arXiv. It consists of around 1.1M instances.
SIND. It has 50K stories for the visual storytelling task111http://visionandlanguage.net/VIST/, which is in a different domain from the others. Here we use each story as a paragraph.
For data preprocessing, we first use NLTK to tokenize the sentences, and then adopt Stanford Parser222https://nlp.stanford.edu/software/lex-parser.shtml to extract nouns with syntactic roles for the edge labels (S, O or X). For each paragraph, we treat all nouns appearing more than once in it as entities. On average, each paragraph from NIPS Abstract, ANN Abstract, arXiv Abstract and SIND has 5.8, 4.5, 7.4 and 2.1 entities, respectively.
Our settings follow Cui et al., Cui:emnlp18 for fair comparison. We use 100-dimension Glove word embeddings333https://nlp.stanford.edu/projects/glove/. The hidden size of LSTM is 300 for NIPS Abstract and 512 for the others. For our GRN encoders, The state sizes for sentence and entity nodes are set to 512 and 150, respectively. The size of edge embeddings is set to 50. Adadelta  is adopted as the optimizer with = , = and initial learning rate 1.0. For regularization term, we employ L2 weight decay with coefficient and dropout with probability 0.5. Batch size is 16 for training and beam search with size 64 is implemented for decoding.
We compare our model (SE-Graph) with the existing state of the arts, including (1) LSTM+PtrNet , (2) Varient-LSTM+PtrNet , and (3) ATTOrderNet . Their major difference is how to encode paragraphs: LSTM+PtrNet uses a conventional LSTM to learn paragraph representation, Varient-LSTM+PtrNet is based on a set-to-sequence framework , and ATTOrderNet adopts self-attention mechanism. Besides, in order to better study the different effects of entities, we also list the performances of two variants of our model: (1) F-Graph. Similar to ATTOrderNet, it uses a fully-connected graph to represent the input unordered paragraph, but adopts GRN rather than self-attention layers to encode the graphs. (2) S-Graph. It is a simplified version of our model by removing all entity nodes and their related edges from the original sentence-entity graphs. Correspondingly, all entity states (s in Equations 5, 6, 7 and 8) are also removed.
Following previous work, we use the following three major metrics:
Kendall’s tau (): It ranges from -1 (the worst) to 1 (the best). Specifically, it is calculated as 1- 2(number of inversions)/, where denote the sequence length and number of inversions is the number of pairs in the predicted sequence with incorrect relative order.
Accuracy (Acc): It measures the percentage of sentences whose absolute positions are correctly predicted. Compared with , it penalizes results that correctly preserve most relative orders but with a slight shift.
Perfect Match Ratio (PMR): It considers each paragraph as a single unit and calculates the ratio of exactly matching orders, so no partial credit is given for any incorrect permutations.
Obviously, these three metrics evaluate the quality of sentence ordering from different aspects, and thus their combination can give us a comprehensive evaluation on this task.
4.2 Effect of Recurrent Step
The recurrent step
is an important hyperparameter to our model, thus we choose the validation set of our largest dataset (arXiv Abstract) to study its effectiveness. Figure5 shows the results. We observe large improvements when increasing from 0 to 3, showing the effectiveness of our framework. Nevertheless, the increase of from 3 to 5 does not lead to further improvements while requiring more running time. Therefore, we set =3 for all experiments thereafter.
4.3 Main Results
Table 1 reports the overall experimental results. Our model exhibits the best performance across datasets in different domains, demonstrating the effectiveness and robustness of our model. Moreover, we draw the following interesting conclusions. First, based on the same fully-connncted graph representations, F-Graph slightly outperforms ATTOrderNet on all datasets, even with fewer number of parameters and relatively fewer recurrent steps. This result proves the validity of applying GRN to encode paragraphs. Second, S-Graph shows better performance compared with F-Graph. This confirms the hypothesis that leveraging entity information can reduce the noise caused by connecting incoherent sentences. Third, SE-Graph outperforms S-Graph on all datasets across all metrics. It is because incorporating entities as extra information and modeling the co-occurrence between sentences and entities can further contribute to our neural graph model. Considering that SE-Graph has slightly more parameters than S-Graph, we make further analysis in Section 4.4 to show that the improvement given by SE-Graph is irrelevant to introducing new parameters.
Previous work has indicated that both the first and last sentences play special roles in a paragraph due to their crucial absolute positions, so we also report accuracies of our models on predicting them. Table 2 summarizes the experimental results on arXiv Abstract and SIND, where SE-Graph and its two variants also outperform ATTOrderNet, and particularly, SE-Graph reaches the best performance. Again, both results witness the advantages of our model.
4.4 Ablation Study
To investigate the impacts of entities and edges on our model, we adopt SE-Graph and S-Graph for further ablation studies, because both of them exploit entity information. Particularly, we continue to choose arXiv Abstract, the largest among our datasets, to conduct reliable analyses. The results are shown in Table 3, and we have the following observations.
First, shuffling edges significantly hurts the performances of both S-Graph and SE-Graph. The resulting PMR of S-Graph (42.41) is still comparable with the PMR of F-Graph (42.50 as shown in Table 1). Intuitively, shuffling edges can introduce a lot of noise. These facts above indicate that fully-connected graphs are also very noisy, especially because F-Graph takes the same number of parameters as S-Graph. Therefore we can confirm our previous statement again: the entities can help reduce noise. Second, removing edge labels leads to less performance drops than removing or shuffling edges. It is likely because some labels can be automatically learned by our graph encoder. Nevertheless, the labels still provide useful information. Third, there are slight decreases for S-Graph and SE-Graph, if we only remove 10% entities. Removing entities is a way to simulate syntactic parsing noise, as our entities are obtained by the parsing results. This indicates the robustness of our model against potential parsing accuracy drops on certain domains, such as medical and chemistry. On the other hand, randomly removing 50% entities causes significant performance drops. As the model size still remains unchanged, this shows the importance of introducing entities. Particularly, the result of removing 50% entities for SE-Graph is slightly worse than original model of S-Graph, demonstrating that SE-Graph’s improvement over S-Graph is not derived from simply introducing more parameters. Finally, share parameters illustrates the effect of making both GRNs (Equations 6 and 7) to share parameters. The result shows a drastic decrease on final performance, which is quite reasonable because entity nodes play fundamentally different roles from sentence nodes. Consequently, it is intuitive to model them separately.
5 Related work
Previous work on sentence ordering mainly focused on the utilization of linguistic features via statistical models [14, 4, 2, 3, 9, 12]. Especially, the entity based models [2, 3, 12] have shown the effectiveness of exploiting entities for this task. Recently, the studies have evolved into neural network based models, such as window network , neural ranking model , hierarchical RNN-based models [11, 16], and ATTOrderNet . Compared with these models, we combine the advantages of modeling entities and GRN, obtaining state-of-the-art performance. Even without entity information, our model variant based on fully-connected graphs still shows better performance than the previous state-of-the-art model, indicating that GRN is a stronger alternative for this task.
|Remove edge labels||—||—||58.51||43.96|
|Remove 50% entities||57.57||42.83||57.84||43.18|
|Remove 10% entities||57.79||43.26||58.67||44.17|
Graph Neural Networks in NLP.
We have presented a neural graph-based model for sentence ordering. Specifically, we first introduce sentence-entity graphs to model both the semantic relevance between coherent sentences and the co-occurrence between sentences and entities. Then, GRN is adopted on the built graphs to encode input sentences by performing semantic transitions among connected nodes. Compared with the previous state-of-the-art model, ours is capable of reducing the noise brought by relationship modeling between incoherent sentences, but also fully leveraging entity information for paragraph encoding. Extensive experiments on several benchmark datasets prove the superiority of our model over the state-of-the-art and other baselines.
The authors were supported by Beijing Advanced Innovation Center for Language Resources, National Natural Science Foundation of China (No. 61672440), the Fundamental Research Funds for the Central Universities (Grant No. ZK1024), and Scientific Research Project of National Language Committee of China (Grant No. YB135-49).
Inferring strategies for sentence ordering in multidocument news summarization.
Journal of Artificial Intelligence Research17, pp. 35–55. Cited by: §1.
-  (2005) Modeling local coherence: an entity-based approach. In ACL, pp. 141–148. Cited by: §5.
-  (2008) Modeling local coherence: an entity-based approach. Computational Linguistics 34 (1), pp. 1–34. Cited by: §1, §1, §1, §5.
-  (2004) Catching the drift: probabilistic content models, with applications to generation and summarization. In NAACL, pp. 113–120. Cited by: §5.
Graph convolutional encoders for syntax-aware neural machine translation. In EMNLP, pp. 1957–1967. Cited by: §5.
-  (2018) Graph-to-sequence learning using gated graph neural networks. In ACL, pp. 273–283. Cited by: §5.
-  (2016) Neural sentence ordering. CoRR abs/1607.06952. Cited by: §1, §5.
-  (2018) Deep attentive sentence ordering network. In EMNLP, pp. 4340–4349. Cited by: §1, §1, §3.1, §4.1, §5.
-  (2011) Extending the entity grid with entity-specific features. In ACL, pp. 125–129. Cited by: §1, §5.
Extractive multi-document summarization with integer linear programming and support vector regression. In COLING, pp. 911–926. Cited by: §1.
-  (2016) End-to-end neural sentence ordering using pointer network. CoRR abs/1611.04953. Cited by: §1, §4.1, §5.
-  (2013) Graph-based local coherence modeling. In ACL, pp. 93–103. Cited by: §1, §5.
-  (2004) Statistical significance tests for machine translation evaluation. In EMNLP, pp. 388–395. Cited by: Table 1.
-  (2003) Probabilistic text structuring: experiments with sentence ordering. In ACL, Cited by: §1, §5.
-  (2014) A model of coherence based on distributed sentence representation. In EMNLP, pp. 2039–2048. Cited by: §1, §5.
Sentence ordering and coherence modeling using recurrent neural networks. In AAAI Conference on Artificial Intelligence, pp. 5285–5292. Cited by: §1, §4.1, §5.
-  (2017) Encoding sentences with graph convolutional networks for semantic role labeling. In EMNLP, pp. 1506–1515. Cited by: §5.
-  (2016) A bibliometric and network analysis of the field of computational linguistics. JASIST 67 (3), pp. 683–706. Cited by: 2nd item.
-  (2009) The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. Cited by: §3.2.
-  (2019) Semantic neural machine translation using amr. CoRR abs/1902.07282. Cited by: §5.
-  (2018) A graph-to-sequence model for amr-to-text generation. In ACL, pp. 1616–1626. Cited by: §5.
-  (2018) N-ary relation extraction using graph-state lstm. In EMNLP, pp. 2226–2235. Cited by: §5.
-  (2017) Attention is all you need. In NIPS, pp. 6000–6010. Cited by: §1, §2.
-  (2015) Order matters: sequence to sequence for sets. In ICLR, Cited by: §4.1.
-  (2015) Pointer networks. In NIPS, pp. 2692–2700. Cited by: §2.
-  (2018) Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pp. 349–357. Cited by: §5.
-  (2018) Graph2seq: graph to sequence learning with attention-based neural networks. CoRR abs/1804.00823. Cited by: §5.
-  (2018) QANet: combining local convolution with global self-attention for reading comprehension. In ICLR, Cited by: §1.
-  (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: §4.1.
-  (2018) Sentence-state lstm for text representation. In ACL, pp. 317–327. Cited by: Graph-based Neural Sentence Ordering, §1, §3.2, §5.