Log In Sign Up

InsertGNN: Can Graph Neural Networks Outperform Humans in TOEFL Sentence Insertion Problem?

Sentence insertion is a delicate but fundamental NLP problem. Current approaches in sentence ordering, text coherence, and question answering (QA) are neither suitable nor good at solving it. In this paper, We propose InsertGNN, a simple yet effective model that represents the problem as a graph and adopts the graph Neural Network (GNN) to learn the connection between sentences. It is also supervised by both the local and global information that the local interactions of neighboring sentences can be considered. To the best of our knowledge, this is the first recorded attempt to apply a supervised graph-structured model in sentence insertion. We evaluate our method in our newly collected TOEFL dataset and further verify its effectiveness on the larger arXivdataset using cross-domain learning. The experiments show that InsertGNN outperforms the unsupervised text coherence method, the topological sentence ordering approach, and the QA architecture. Specifically, It achieves an accuracy of 70


page 1

page 2

page 3

page 4


Pruned Graph Neural Network for Short Story Ordering

Text coherence is a fundamental problem in natural language generation a...

STaCK: Sentence Ordering with Temporal Commonsense Knowledge

Sentence order prediction is the task of finding the correct order of se...

Evaluating Text Coherence at Sentence and Paragraph Levels

In this paper, to evaluate text coherence, we propose the paragraph orde...

Topological Sort for Sentence Ordering

Sentence ordering is the task of arranging the sentences of a given text...

Text Coherence Analysis Based on Deep Neural Network

In this paper, we propose a novel deep coherence model (DCM) using a con...

Tradeoffs in Sentence Selection Techniques for Open-Domain Question Answering

Current methods in open-domain question answering (QA) usually employ a ...

Graph-based Neural Sentence Ordering

Sentence ordering is to restore the original paragraph from a set of sen...

Code Repositories


The TOEFL sentence insertion dataset used in InsertGNN.

view repo

1 Introduction

Sentence ordering, which is first described by Barzilay and Lapata (2008)

, has drawn little interest during past decades, partly because it is not as exciting as other big NLP topics like machine translation and text generation. Also, to our best knowledge, there is no such a canonical solution that is exclusively designed for this problem. Besides, no previous researchers compare their model performance with human’s accuracy, which we believe is meaningful to set as a standard benchmark to show the strengths of NLP technology. In this work, we compensate for these deficiencies. We design a global-local fused GNN unique to sentence insertion and construct a dataset from an English proficiency test (see Table

1), which has full records of examiners’ accuracy scores. Our results strongly prove our model’s competency over other baseline models and show it approaches human’s performance on this exam.

Sentence ordering and QA are the two most loosely related sub-fields to sentence insertion.

Sentence ordering, a more general case of sentence insertion, first apply neural network by Chen et al. (2016) with a pair-wise order ranking. But error propagation could be introduced by this pair-wise ranking, which is later alleviated by the pointer network (Gong et al., 2016). Then the attention mechanism is added by Logeswaran et al. (2016) and ATTOrderNet (Cui et al., 2018). However, we cannot simply transfer this pointer network architecture directly into sentence insertion due to their distinct inputs. Putra and Tokunaga (2017) design a unsupervised graph method from the perspective of sentence similarities and coherence inside a text. They claim superiority on the supervised Entity Grid (Barzilay and Lapata, 2008) and the unsupervised Entity Graph (Guinaudeau and Strube, 2013). But it is unlearnable and needs to build all graphs with potential positions of the taken-out sentence.

QA provides a more universal solution for our case (Devlin et al., 2019; Joshi et al., 2020; Yang et al., 2019; Yamada et al., 2020). The take-out sentence and the remaining paragraph can be regarded as the question and context respectively, which are then linearly combined and fed into a network to output the position. But this linear combination has a trouble for models like transformers (Vaswani et al., 2017) to understand the inner logic between the new concatenated paragraph and its slots.

To avoid this disadvantage of QA architecture, we imitate the way people tackle this question type where they usually put the new sentence into each slot and check if it is coherent. A directly graph can perfectly depict this pattern. GNN (Scarselli et al., 2008) and Graph Convolutional Networks (GCN) (Kipf and Welling, 2016) appear in many NLP tasks such as relation extraction (Vashishth et al., 2018)

, knowledge graph

(Nathani et al., 2019), text classification (Yao et al., 2019)

and named entity recognition

(Cetoli et al., 2017). Attention mechanism (Veličković et al., 2017) is also proved useful in areas like knowledge graph and recommendation (Wang et al., 2019). However, GNN is seldom adopted in QA but sentence ordering by Yin et al. (2019) with shared entities. We are the first to demonstrate the practicability of GNN in sentence insertion problem.

2 Dataset

2.1 TOEFL Exams

TOEFL is one of the two largest exams to test the English level of non-native speakers hosted by the Educational Testing Service (ETS) globally. We choose it for two main reasons. Firstly, all TOEFL questions are extracted from academic articles, designed by language experts and therefore are high-qualified. Secondly, ETS annually offers score data reports of examiners. According to the latest summary by ETS111, the average accuracy in the reading section is 70.67%.

Every year ETS releases few public available articles in TOEFL Practice Online, weakening our attempt to build a large-scale dataset. We collect all questions since 2011 and got 156 useful samples with an equal label distribution of 32%, 25%, 22%, and 21%222you can download our collected TOEFL sentence insertion dataset from

2.2 ArXiv Dataset

We construct another dataset from arXiv to enrich the training samples, and choose the abstract as the contextual paragraph, since it is independently readable and well-edited with strong logic clues. We don’t pick abstracts containing less than 5 sentences or 300 words to keep them more informative. Besides, we partially abandon categories that are not in the TOEFL scope, such as Computer Science. Moreover, several categories have tremendous mathematical formulations or physical symbols. These terms have no meaningful corresponding pretrained embeddings and should not be included in our supplementary dataset. After those selections, 5965 abstracts remain.

NLTK (Loper and Bird, 2002) toolkit is used to break the paragraph into sentences and randomly choose one as the sentence to be inserted. Then three other positions are selected to form a TOEFL-like problem. This operation can be repeated multiple times for each abstract since there are dozens of nonredundant combinations. The key statistics of these two datasets is listed in Table 2.

Size Sentences Words Topics
TOEFL 156 7.31 133.94
Anthropology, Architecture, Astronomy, Economics, Biology,
Chemistry, Communication, North Amrica, Physics,
Political Sceince, Psychology Sociology, World History
arXiv 5965 7.24 121.36

Astrophysics, Computer Vision and Pattern Recognition,

Cryptography, Economics, General Relativitym
High Energy Physics-Theory, Information Theory,
Networking and Internet Architecture, Quantum Physics
Table 2: Dataset Statistics, including the sample size, average number of sentences inside each paragraph, average number of words of each paragraph, and their main topics (categories).

3 Baseline

3.1 Unsupervised Text Coherence Model

Putra and Tokunaga (2017) propose three algorithms to build an unsupervised coherence graph. The major difference is the determination of edges (see Fig 1).

In Preceding Adjacent Vertex (PAV), a weighted directed edge is established from each sentence to the preceding adjacent sentence. Single Similarity Vertex (SSV) discards the constraints of precedence and adjacency. Multiple Similar Vertex (MSV) even relaxes the singular condition and allows multiple outgoing edges for each sentence as long as their corresponding similarity score exceeds a threshold . In the experiment, we use the same sentence encoder as our InsertGNN for the graph instead of GloVe (Pennington et al., 2014).

Figure 1: Three graph construction algorithms. PAV only allows edges between a sentence and its preceding sentence. SSV allows edges between a sentence and any other sentences. MSV allows multiple edges.
Figure 2: InsertGNN.

3.2 Topological Sort

Instead of framing the sentence ordering as a sequence prediction problem, Prabhumoye et al. (2020)

regard it as a constraint learning problem. We do little modification on this topological sort. Sentences between two slots are represented as the nodes with a known constraint between them. An Multi-layer Perceptron (MLP) is used to predict the remaining constraints of the relative order between the taken-out sentence and other sentences.

3.3 QA Architecture

We linearly combine the dependent sentence and the paragraph as the input and extract the output vector of the [CLS] character as the final representation, followed by an MLP to forecast the probability of four slots. We name this straight architecture as


-type (Plain). We also put the new sentence into all four possible slots and classify those differently filled paragraphs. The final prediction will be the one with the highest probability. We name this architecture as

A-type (Altered).

4 InsertGNN

4.1 Preliminary

We first split the paragraph into five parts as according to positions of four different slots (

). We pad the paragraph to make the graph complete if there is no leading sentence before slot

or no ending sentence after slot . Then a graph (see Fig 2) is built to describes all potential orders of the inserted paragraph, where each node represents a sentence and the directed edge represents the relative order. If two sentences are or possibly are connected, there is a directed edge between them.

After that, we feed both five sentences from the splitted context paragraph and the taken-out sentence into a sentence encoder, obtaining representation vectors for each sentence in the paragraph and for the taken-out sentence. These vectors correspond to node features, where corresponds to features of node .

4.2 Global Graph Convolutional Networks

We first apply a -layer Global Graph Neural Networks (GGN) with attention (Veličković et al., 2017). The input is

The attention score at layer is computed as for a center node and its neighbor , where

is the activation function and

is the trainable parameter. Then the attention weight is obtained by . After that, the weights are used to update the features of node as

The representations for four slots in the last graph convolutional layer are forwarded into an MLP shared by the global-local fusion stage to get prediction . In back-propagation, the binary cross-entropy loss is calculated. For a dataset with samples, the GGN loss is , where is the ground truth label.

4.3 Local Graph Convolutional Networks

GGN is capable of handling most situations but may ignore local details Zhang et al. (2020). In our task, the answer can sometimes be concluded by reading the two sentences nearby the slot instead of in the view of the whole paragraph. We use a Local Graph Neural Networks (LGN) Zhang et al. (2020) to concentrate on the local sentence interactions (see Fig 3).

We create four sub-graphs with only the slot and its two surrounding sentences, whose node features are from the output of the previous GGN. The sub-graphs are fed into -layer parameter-shared GCN (Kipf and Welling, 2016), and we use the Weisfeiler-Lehman (WL) algorithm (Weisfeiler and Leman, 1968) to extract the multi-scale subtree features. The output of each layer is treated as WL’s fingerprint. Similar to DGCNN (Phan et al., 2018), we horizontally concatenate these fingerprints rather than calculating the WL graph kernel. Then we apply a size pooling followed by a fully-connected layer for graph classification. In back-propagation, we compute the binary cross-entropy loss for four sub-graphs.

Figure 3: The local graph convolutional networks.

4.4 Global-local Fusion

At the final stage, the representations and are combined to fuse the global and local information. We take an averaged value of if some node is contained in more than one sub-graph. The fused features go through another GGN, and the output for four answer slots is fed into the shared MLP to attain the prediction . After that, the binary cross-entropy loss is computed, and the total loss is the sum of three binary cross-entropy losses: , where , and are loss weights of , and .

5 Experiment

5.1 Configurations

We use Sentence Transformer (Reimers and Gurevych, 2019) as the sentence encoder to summarize the content of sentences, which makes sentences with similar meanings close in vector space. It is first trained on Natural Language Inference (NLI) and then fine-tuned on the Semantic Textual Similarity benchmark (STSb) train set. Besides, we use BERT (Devlin et al., 2018) and its two variants DistillBERT (Sanh et al., 2019) and RoBERTa (Liu et al., 2019) for QA architecture.

In the experiment, we neither fine-tune the baseline transformers nor the sentence transformer, only use them as a embedding layer. For MSV, we choose a threshold

of 0.3. For both GGN and LGNs, we utilize leaky relu and relu as the activation function, respectively, and include the dropout mechanism between layers with a 0.5 dropout rate. They both have 1 hidden layer and 4 hidden units. For GGN, we choose 16 hidden attention heads and 4 output attention heads with an attention dropout of 0.6 and a 0.2 negative slope of leaky relu. They all have a residual connection. For all MLPs, we utilize tanh as the activation function with no dropout. We adopt an Adam optimizer with a weight decay rate of 0.0005 and a random seed of 1234. We set

, and , where

is given more weights so that the model can focus more on the global-local fusion information. We train 100 epochs for InsertGNN with a learning rate of 0.0001 and 200 epochs for BERT-based models with a learning rate of 0.01.

5.2 Toefl

We test the unsupervised text coherence model at first. PAV attains the highest accuracy of 41.66 on our TOEFL dataset (see Table 3), in accord with Putra and Tokunaga (2017)’s evaluation. It indicates that local cohesion is more important than long-distance cohesion, in line with the motivation we design LGNs.

Methodology Acc_TOEFL
PAV 0.4166
SSV 0.3462
MSV 0.3718
Table 3: Unsupervised Learning accuracy.

Next, we test the performance of supervised approaches. First, we use 0.05 as the validation proportion. The result shows our InsertGNN defeats other baselines, reaching an accuracy of 71.5 (see Table 4). It is comparable to the average human accuracy of 70.67, declaring our model can do at least as good as non-native English speakers.

However, it is not convincing since InsertGNN only gets one or two more right with this small splitting rate. We divide the TOEFL dataset into two halves, one for training and the other for validating. With only a two-digit number of training samples, InsertGNN still reaches an accuracy rate of 54.49, observably preceding all baseline models.

BERT P 0.4286 0.3846
A 0.4286 0.3205
DistillBERT P 0.5714 0.3589
A 0.5714 0.2949
RoBERTa P 0.4286 0.3589
A 0.4286 0.2861
Topological Sort 0.5714 0.3462
InsertGNN 0.7143 0.5449
Table 4: Supervised Learning accuracy. P and A refers to the P-type and the A-type QA structure. Left and right columns of TOFEL accuracy correspond to different validation split ratios of 0.05 and 0.5, respectively.

5.3 Cross-domain Learning from ArXiv

We further evaluate InsertGNN using cross-domain learning. Models are first trained on the arXiv dataset (source domain) with a validation splitting ratio of 0.05 and then directly tested on the TOEFL dataset (target domain). InsertGNN still stands out with an accuracy of 39.1 (see Table 5).

Methodology Acc_arXiv Acc_TOEFL
BERT P 0.3474 0.3397
A 0.3263 0.3013
DistillBERT P 0.3522 0.3397
A 0.3133 0.3076
RoBERTa P 0.3356 0.3269
A 0.3255 0.3141
Topological Sort 0.4362 0.2885
InsertGNN 0.4631 0.3910
Table 5: Cross-domain learning accuracy from arXiv dataset to TOEFL dataset. The left and right columns correspond to the accuracy in arXiv test set and the whole TOEFL dataset, respectively.

The TOEFL accuracy is lower, because the contents of two datasets are slightly different. ArXiv abstracts are a brief summarization and therefore very condensed. In contrast, the TOEFL paragraphs are an expanded narrative of a sub-point or a detailed explanation, which is more elaborate with a stronger inner logic. This manner of writing causes a decrease in accuracy when models are in cross-domain.

6 Conclusion and Future Work

In the paper, we propose a novel sentence insertion GNN called InsertGNN and build a benchmark dataset from TOEFL. In the experiments, we provide strong evidence to demonstrate the effectiveness and power of GNN in this specific NLP task. It surpasses the unsupervised text coherence methods, the topological sort approach, and existing transformer-based QA models. InsertGNN offers new perspectives on future model design when solving problems that can be depicted and modified as a graph structure. The downside of our work is the scale of TOEFL dataset. We leave other researchers to replenish and enlarge this dataset in the future, when ETS makes more sets of test passages public available.