The TOEFL sentence insertion dataset used in InsertGNN.
Sentence insertion is a delicate but fundamental NLP problem. Current approaches in sentence ordering, text coherence, and question answering (QA) are neither suitable nor good at solving it. In this paper, We propose InsertGNN, a simple yet effective model that represents the problem as a graph and adopts the graph Neural Network (GNN) to learn the connection between sentences. It is also supervised by both the local and global information that the local interactions of neighboring sentences can be considered. To the best of our knowledge, this is the first recorded attempt to apply a supervised graph-structured model in sentence insertion. We evaluate our method in our newly collected TOEFL dataset and further verify its effectiveness on the larger arXivdataset using cross-domain learning. The experiments show that InsertGNN outperforms the unsupervised text coherence method, the topological sentence ordering approach, and the QA architecture. Specifically, It achieves an accuracy of 70READ FULL TEXT VIEW PDF
The TOEFL sentence insertion dataset used in InsertGNN.
Sentence ordering, which is first described by Barzilay and Lapata (2008)
, has drawn little interest during past decades, partly because it is not as exciting as other big NLP topics like machine translation and text generation. Also, to our best knowledge, there is no such a canonical solution that is exclusively designed for this problem. Besides, no previous researchers compare their model performance with human’s accuracy, which we believe is meaningful to set as a standard benchmark to show the strengths of NLP technology. In this work, we compensate for these deficiencies. We design a global-local fused GNN unique to sentence insertion and construct a dataset from an English proficiency test (see Table1), which has full records of examiners’ accuracy scores. Our results strongly prove our model’s competency over other baseline models and show it approaches human’s performance on this exam.
Sentence ordering and QA are the two most loosely related sub-fields to sentence insertion.
Sentence ordering, a more general case of sentence insertion, first apply neural network by Chen et al. (2016) with a pair-wise order ranking. But error propagation could be introduced by this pair-wise ranking, which is later alleviated by the pointer network (Gong et al., 2016). Then the attention mechanism is added by Logeswaran et al. (2016) and ATTOrderNet (Cui et al., 2018). However, we cannot simply transfer this pointer network architecture directly into sentence insertion due to their distinct inputs. Putra and Tokunaga (2017) design a unsupervised graph method from the perspective of sentence similarities and coherence inside a text. They claim superiority on the supervised Entity Grid (Barzilay and Lapata, 2008) and the unsupervised Entity Graph (Guinaudeau and Strube, 2013). But it is unlearnable and needs to build all graphs with potential positions of the taken-out sentence.
QA provides a more universal solution for our case (Devlin et al., 2019; Joshi et al., 2020; Yang et al., 2019; Yamada et al., 2020). The take-out sentence and the remaining paragraph can be regarded as the question and context respectively, which are then linearly combined and fed into a network to output the position. But this linear combination has a trouble for models like transformers (Vaswani et al., 2017) to understand the inner logic between the new concatenated paragraph and its slots.
To avoid this disadvantage of QA architecture, we imitate the way people tackle this question type where they usually put the new sentence into each slot and check if it is coherent. A directly graph can perfectly depict this pattern. GNN (Scarselli et al., 2008) and Graph Convolutional Networks (GCN) (Kipf and Welling, 2016) appear in many NLP tasks such as relation extraction (Vashishth et al., 2018)2019), text classification (Yao et al., 2019)2017). Attention mechanism (Veličković et al., 2017) is also proved useful in areas like knowledge graph and recommendation (Wang et al., 2019). However, GNN is seldom adopted in QA but sentence ordering by Yin et al. (2019) with shared entities. We are the first to demonstrate the practicability of GNN in sentence insertion problem.
TOEFL is one of the two largest exams to test the English level of non-native speakers hosted by the Educational Testing Service (ETS) globally. We choose it for two main reasons. Firstly, all TOEFL questions are extracted from academic articles, designed by language experts and therefore are high-qualified. Secondly, ETS annually offers score data reports of examiners. According to the latest summary by ETS111https://www.ets.org/s/toefl/pdf/94227_unlweb.pdf, the average accuracy in the reading section is 70.67%.
Every year ETS releases few public available articles in TOEFL Practice Online, weakening our attempt to build a large-scale dataset. We collect all questions since 2011 and got 156 useful samples with an equal label distribution of 32%, 25%, 22%, and 21%222you can download our collected TOEFL sentence insertion dataset from https://github.com/Wufang1997/TOEFL-Sentence-Insertion-Dataset.
We construct another dataset from arXiv to enrich the training samples, and choose the abstract as the contextual paragraph, since it is independently readable and well-edited with strong logic clues. We don’t pick abstracts containing less than 5 sentences or 300 words to keep them more informative. Besides, we partially abandon categories that are not in the TOEFL scope, such as Computer Science. Moreover, several categories have tremendous mathematical formulations or physical symbols. These terms have no meaningful corresponding pretrained embeddings and should not be included in our supplementary dataset. After those selections, 5965 abstracts remain.
NLTK (Loper and Bird, 2002) toolkit is used to break the paragraph into sentences and randomly choose one as the sentence to be inserted. Then three other positions are selected to form a TOEFL-like problem. This operation can be repeated multiple times for each abstract since there are dozens of nonredundant combinations. The key statistics of these two datasets is listed in Table 2.
In Preceding Adjacent Vertex (PAV), a weighted directed edge is established from each sentence to the preceding adjacent sentence. Single Similarity Vertex (SSV) discards the constraints of precedence and adjacency. Multiple Similar Vertex (MSV) even relaxes the singular condition and allows multiple outgoing edges for each sentence as long as their corresponding similarity score exceeds a threshold . In the experiment, we use the same sentence encoder as our InsertGNN for the graph instead of GloVe (Pennington et al., 2014).
Instead of framing the sentence ordering as a sequence prediction problem, Prabhumoye et al. (2020)
regard it as a constraint learning problem. We do little modification on this topological sort. Sentences between two slots are represented as the nodes with a known constraint between them. An Multi-layer Perceptron (MLP) is used to predict the remaining constraints of the relative order between the taken-out sentence and other sentences.
We linearly combine the dependent sentence and the paragraph as the input and extract the output vector of the [CLS] character as the final representation, followed by an MLP to forecast the probability of four slots. We name this straight architecture asP
-type (Plain). We also put the new sentence into all four possible slots and classify those differently filled paragraphs. The final prediction will be the one with the highest probability. We name this architecture asA-type (Altered).
We first split the paragraph into five parts as according to positions of four different slots (
). We pad the paragraph to make the graph complete if there is no leading sentence before slotor no ending sentence after slot . Then a graph (see Fig 2) is built to describes all potential orders of the inserted paragraph, where each node represents a sentence and the directed edge represents the relative order. If two sentences are or possibly are connected, there is a directed edge between them.
After that, we feed both five sentences from the splitted context paragraph and the taken-out sentence into a sentence encoder, obtaining representation vectors for each sentence in the paragraph and for the taken-out sentence. These vectors correspond to node features, where corresponds to features of node .
We first apply a -layer Global Graph Neural Networks (GGN) with attention (Veličković et al., 2017). The input is
The attention score at layer is computed as for a center node and its neighbor , where
is the activation function andis the trainable parameter. Then the attention weight is obtained by . After that, the weights are used to update the features of node as
The representations for four slots in the last graph convolutional layer are forwarded into an MLP shared by the global-local fusion stage to get prediction . In back-propagation, the binary cross-entropy loss is calculated. For a dataset with samples, the GGN loss is , where is the ground truth label.
GGN is capable of handling most situations but may ignore local details Zhang et al. (2020). In our task, the answer can sometimes be concluded by reading the two sentences nearby the slot instead of in the view of the whole paragraph. We use a Local Graph Neural Networks (LGN) Zhang et al. (2020) to concentrate on the local sentence interactions (see Fig 3).
We create four sub-graphs with only the slot and its two surrounding sentences, whose node features are from the output of the previous GGN. The sub-graphs are fed into -layer parameter-shared GCN (Kipf and Welling, 2016), and we use the Weisfeiler-Lehman (WL) algorithm (Weisfeiler and Leman, 1968) to extract the multi-scale subtree features. The output of each layer is treated as WL’s fingerprint. Similar to DGCNN (Phan et al., 2018), we horizontally concatenate these fingerprints rather than calculating the WL graph kernel. Then we apply a size pooling followed by a fully-connected layer for graph classification. In back-propagation, we compute the binary cross-entropy loss for four sub-graphs.
At the final stage, the representations and are combined to fuse the global and local information. We take an averaged value of if some node is contained in more than one sub-graph. The fused features go through another GGN, and the output for four answer slots is fed into the shared MLP to attain the prediction . After that, the binary cross-entropy loss is computed, and the total loss is the sum of three binary cross-entropy losses: , where , and are loss weights of , and .
We use Sentence Transformer (Reimers and Gurevych, 2019) as the sentence encoder to summarize the content of sentences, which makes sentences with similar meanings close in vector space. It is first trained on Natural Language Inference (NLI) and then fine-tuned on the Semantic Textual Similarity benchmark (STSb) train set. Besides, we use BERT (Devlin et al., 2018) and its two variants DistillBERT (Sanh et al., 2019) and RoBERTa (Liu et al., 2019) for QA architecture.
In the experiment, we neither fine-tune the baseline transformers nor the sentence transformer, only use them as a embedding layer. For MSV, we choose a threshold
of 0.3. For both GGN and LGNs, we utilize leaky relu and relu as the activation function, respectively, and include the dropout mechanism between layers with a 0.5 dropout rate. They both have 1 hidden layer and 4 hidden units. For GGN, we choose 16 hidden attention heads and 4 output attention heads with an attention dropout of 0.6 and a 0.2 negative slope of leaky relu. They all have a residual connection. For all MLPs, we utilize tanh as the activation function with no dropout. We adopt an Adam optimizer with a weight decay rate of 0.0005 and a random seed of 1234. We set, and , where
is given more weights so that the model can focus more on the global-local fusion information. We train 100 epochs for InsertGNN with a learning rate of 0.0001 and 200 epochs for BERT-based models with a learning rate of 0.01.
We test the unsupervised text coherence model at first. PAV attains the highest accuracy of 41.66 on our TOEFL dataset (see Table 3), in accord with Putra and Tokunaga (2017)’s evaluation. It indicates that local cohesion is more important than long-distance cohesion, in line with the motivation we design LGNs.
Next, we test the performance of supervised approaches. First, we use 0.05 as the validation proportion. The result shows our InsertGNN defeats other baselines, reaching an accuracy of 71.5 (see Table 4). It is comparable to the average human accuracy of 70.67, declaring our model can do at least as good as non-native English speakers.
However, it is not convincing since InsertGNN only gets one or two more right with this small splitting rate. We divide the TOEFL dataset into two halves, one for training and the other for validating. With only a two-digit number of training samples, InsertGNN still reaches an accuracy rate of 54.49, observably preceding all baseline models.
We further evaluate InsertGNN using cross-domain learning. Models are first trained on the arXiv dataset (source domain) with a validation splitting ratio of 0.05 and then directly tested on the TOEFL dataset (target domain). InsertGNN still stands out with an accuracy of 39.1 (see Table 5).
The TOEFL accuracy is lower, because the contents of two datasets are slightly different. ArXiv abstracts are a brief summarization and therefore very condensed. In contrast, the TOEFL paragraphs are an expanded narrative of a sub-point or a detailed explanation, which is more elaborate with a stronger inner logic. This manner of writing causes a decrease in accuracy when models are in cross-domain.
In the paper, we propose a novel sentence insertion GNN called InsertGNN and build a benchmark dataset from TOEFL. In the experiments, we provide strong evidence to demonstrate the effectiveness and power of GNN in this specific NLP task. It surpasses the unsupervised text coherence methods, the topological sort approach, and existing transformer-based QA models. InsertGNN offers new perspectives on future model design when solving problems that can be depicted and modified as a graph structure. The downside of our work is the scale of TOEFL dataset. We leave other researchers to replenish and enlarge this dataset in the future, when ETS makes more sets of test passages public available.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4340–4349.
Dgcnn: A convolutional neural network over large-scale labeled graphs.Neural Networks, 108:533–543.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7370–7377.
Global-local gcn: Large-scale label noise cleansing for face recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7731–7740.