Matching Long Text Documents via Graph Convolutional Networks

by   Bang Liu, et al.

Identifying the relationship between two text objects is a core research problem underlying many natural language processing tasks. A wide range of deep learning schemes have been proposed for text matching, mainly focusing on sentence matching, question answering or query document matching. We point out that existing approaches do not perform well at matching long documents, which is critical, for example, to AI-based news article understanding and event or story formation. The reason is that these methods either omit or fail to fully utilize complicated semantic structures in long documents. In this paper, we propose a graph approach to text matching, especially targeting long document matching, such as identifying whether two news articles report the same event in the real world, possibly with different narratives. We propose the Concept Interaction Graph to yield a graph representation for a document, with vertices representing different concepts, each being one or a group of coherent keywords in the document, and with edges representing the interactions between different concepts, connected by sentences in the document. Based on the graph representation of document pairs, we further propose a Siamese Encoded Graph Convolutional Network that learns vertex representations through a Siamese neural network and aggregates the vertex features though Graph Convolutional Networks to generate the matching result. Extensive evaluation of the proposed approach based on two labeled news article datasets created at Tencent for its intelligent news products show that the proposed graph approach to long document matching significantly outperforms a wide range of state-of-the-art methods.


Multiresolution Graph Attention Networks for Relevance Matching

A large number of deep learning models have been proposed for the text m...

Dating Documents using Graph Convolution Networks

Document date is essential for many important tasks, such as document re...

Question Answering by Reasoning Across Documents with Graph Convolutional Networks

Most research in reading comprehension has focused on answering question...

DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

Keyphrase extraction from documents is useful to a variety of applicatio...

Knowledge Enhanced Hybrid Neural Network for Text Matching

Long text brings a big challenge to semantic matching due to their compl...

Neural Deepfake Detection with Factual Structure of Text

Deepfake detection, the task of automatically discriminating machine-gen...

Headline Generation: Learning from Decomposed Document Titles

We propose a novel method for generating titles for unstructured text do...

1. Introduction

Semantic matching, which aims to model the underlying semantic similarity or relationship among different textual elements such as sentences and documents, has been playing a central role in many Natural Language Processing (NLP) applications, including question answering (Yu et al., 2014)

, automatic text summarization

(Ponzanelli et al., 2015), top- re-ranking in machine translation (Brown et al., 1993), as well as information organization (Liu et al., 2017). Although a wide range of shallow and deep learning techniques (Hu et al., 2014; Qiu and Huang, 2015; Wan et al., 2016; Pang et al., 2016) have been proposed to match sentence pairs, question-answer pairs, or query-document pairs, yet up to date, it is still challenging to match a pair of (long) text documents—the rich semantic and syntactic structures in text documents have made it an increasingly difficult task, as document lengths increase. For example, news articles from different news agencies may report a same physical incidence in the real world from different perspectives, possibly with different ways of wording and narratives. Yet, accurately identifying the relationship between long documents is a critical capability expected in the next-generation AI-based news systems, which should automatically organize vast amounts of daily Internet news articles into events and stories (Liu et al., 2017). This capability, if developed, can largely assist or replace the tedious daily routine work performed by human editors at Internet media organizations.

Traditional approaches to text matching represent text document as vectors in terms of the term frequency-inverse document frequency (TF-IDF), LDA

(Blei et al., 2003)

and so forth, and estimate the semantic distances between documents via unsupervised metrics. However, such approaches are not sufficient as they do not take the semantic structures of natural language into consideration. In recent years, a wide variety of deep neural network models based on word-vector representations have been proposed for text matching, e.g.,

(Hu et al., 2014; Qiu and Huang, 2015; Wan et al., 2016; Pang et al., 2016). One category of deep network models (Qiu and Huang, 2015; Wan et al., 2016)

takes the word embedding sequences of a pair of text objects as the input, and adopts a Siamese convolutional or recurrent neural network to transform the input into intermediate contextual representations, on which the final scoring is performed. Another category of deep models

(Hu et al., 2014; Pang et al., 2016)

focuses on the interactions between each word in one text object with each word in the other text object, and aggregates all the pairwise interactions, e.g., using convolutional neural networks (CNNs), to yield a matching score. However, this paper shows that existing deep neural network models do not have a satisfactory performance for matching long documents, since the rich structural information inherent in long documents is not taken into account. In other words, existing methods are mainly matching short text snippets on a word-level or word vector level, omitting the complex way of interactions among sentences, key words or phrases present in any long document.

In this paper, we propose a novel graphical approach to text matching. We argue that the appropriate semantic representation of documents plays a central role in matching long text objects. A successful semantic matching algorithm critically depends on a novel document representation, beyond the linear word-vec representations, that can capture the complex interactions among sentences and concepts in an article. We propose a novel graphical document representation named Concept Interaction Graph, which is able to represent a document by an undirected weighted graph, with each vertex denoting a concept (i.e., a community of highly coherent keywords) in the document, and the sentences closely related to that concept representing the features of the vertex. Moreover, the edge between a pair of vertices indicates the level of interaction/connection between the two concepts (through sentences). By restructuring documents into a Concept Interaction Graphs, we decompose the semantic focuses in each document into interacting concepts. The task of matching two documents is therefore converted into a graph matching problem.

To compare two Concept Interaction Graphs, we propose a new deep neural network model, named Siamese Encoded Graph Convolutional Network (SE-GCN), combining the strengths of Siamese architectures with Graph Convolutional Network (GCN) (Kipf and Welling, 2016; Defferrard et al., 2016), an emerging variant of CNN that directly operate on graphs. Specifically, we combine the Concept Interaction Graphs of a pair of documents into one unified graph, by including all vertices, and for each vertex in the unified graph, grouping the features from the two graphs, representing a concatenation of sentence subsets related to this concept from both documents. We introduce a Siamese architecture to encode the concatenated features on each vertex into a match vector. The unified graph obtained this way is subsequently passed through multiple layers of GCN to yield a final matching score. This way, our model factorizes the matching process between two pieces of text into the sub-problems of matching corresponding semantic unit pairs in the two documents.

We performed extensive evaluation on two large datasets of long Chinese news article pairs that were collected from major Internet news providers in China, including Tencent, Sina, WeChat, Sohu, etc., in a two-month period from October 1, 2016 to November 30, 2016, covering diverse topics in the open domain. The datasets also contain ground truth labels that indicate whether a pair of news articles talk about the same event and whether they belong to the same story (a notion larger than events). They are created by the editors and product managers at Tencent for algorithm evaluation purposes.111As long text document matching is a relatively new problem and the related datasets are lacking, we are currently under the process of publishing these news article datasets to the public for research purposes. Compared with a wide range of state-of-the-art shallow and deep text matching algorithms that do not take the structural interactions of semantic units into account, our proposed algorithms achieve significant improvements through the use of a graphical representation of documents.

To the best of our knowledge, this is not only the first work that provides a graphical approach to long text document matching, but also the first work that novelly adapts the GCN structure to identify the relationship between a pair of graphs, whereas previously, different GCNs have been mainly used for completing missing attributes/links (Kipf and Welling, 2016; Defferrard et al., 2016) or for node clustering/classification (Hamilton et al., 2017), but all within a single

graph, e.g., a knowledge graph, citation network or social network.

The remainder of this paper is organized as follows. Sec. 2 presents our proposed Concept Interaction Graph for document representation. Sec. 3 presents our propoesd Siamese Encoded Graph Convolutional Network for text pair matching based on the derived graphical representation. In Sec. 4, we conduct extensive performance evaluations of the proposed models and algorithms based on two large datasets created at Tencent for its intelligent news products. We review the related literature in Sec. 5 and conclude the paper in Sec. 6.

2. Concept Interaction Graph

Figure 1. An example to show a piece of text and its corresponding Concept Interaction Graph representation.
Figure 2. An overview of the procedure of constructing the (joint) Concept Interaction Graph (CIG) to match a pair of documents.

In this section, we present our Concept Interaction Graph (CIG) to represent a document as a weighted undirected graph, which decomposes a document into subsets of sentences, focusing on different sub-topics or concepts. Such a graph representation proves to be effective at uncovering the underlying attention structure of a long text document such as a news article, which will help with text matching.

We first describe our desired structure for a concept interaction graph before presenting the detailed steps to derive it. Given a document , our objective is to obtain a graph representation of . Each vertex in is called a concept, which is a community of highly correlated keywords in document . Each sentence in will be assigned onto one concept vertex that is the most related to the sentence. We link two vertices by an edge if the similarity (e.g., TF-IDF similarity) of the sentence sets attached to the two vertices, respectively, is above a threshold.

As a toy example, Fig. 1 illustrates how we convert a document into a Concept Interaction Graph. We can extract keywords Rick, Morty, Summer, and Candy Planet

from the document using standard keyword extraction algorithms

(Mihalcea and Tarau, 2004). These keywords are further clustered into three concepts, where each concept is a subset of keywords that are highly correlated with each other. After grouping keywords into concepts, we assign each sentence in the document to its most related concept vertex. For example, in Fig. 1, sentences and are mainly talking about the relationship between Rick and Morty, and are thus assigned to the concept (Rick, Morty). Other sentences are assigned to sentences in a similar way. The assignment of sentences to concepts naturally leads to multiple sentence subsets. We then connect the concept vertices by weighted edges, where the weight of the edge between a pair of concepts denotes how much the two are related to each other. The edge weights can be determined in various ways, which we will discuss later. This way, we have re-structured the original document into a graph of different focal points, as well as the interaction topology among them.

2.1. Construct Concept Interaction Graphs

We now introduce our detailed procedure to transform a document into a desired CIG as described above. The process consists of five steps: 1) document preprocessing, 2) keyword co-occurrence graph construction, 3) concept detection, 4) vertex construction, and 5) edge construction. The entire procedure is shown in Fig. 2.

Document Preprocessing Given an input document , our first step is to preprocess the document to acquire its keywords. First, for Chinese text data (which will be used in our evaluation), we need to perform word segmentation using off-the-shelf tools such as Stanford CoreNLP (Manning et al., 2014). For English text data, word segmentation is not necessary. Second, we extract named entities from the document. For documents, especially news articles, the named entities are usually critical keywords. Finally, we apply a keyword extraction algorithm to expand the keyword set, as the named entities alone are not enough to cover the main focuses of the document.

To efficiently and accurately extract keywords for Chinese news articles, we have constructed a supervised classifier to decide whether a word is a keyword or not for a document. In particular, we have a document-keywords dataset of over 10,000 documents at Tencent, including over 20,000 positive keyword samples and over 350,000 negative samples. Each word is transformed into a multi-view feature vector and classified by a binary classifier which involves a combined use of Gradient Boosting Decision Tree (GBDT) and Logistic Regression (LR) 

(Liu et al., 2017). For English documents, we can use TextRank (Mihalcea and Tarau, 2004) to get the keywords of each document. Notice that our proposed graphical representation of documents is not language-dependent and can easily be extended to other languages.

KeyGraph Construction. Having extracted the keywords of a document , we construct a keyword co-occurrence graph, called KeyGraph, based on the set of keywords. Each keyword is a vertex in the KeyGraph. We connect a pair of keywords by an edge if they co-occur in at least one sentence.

Concept Detection. The structure of KeyGraph reveals the connections between keywords. If a subset of keywords are highly correlated with each other, they will form a densely connected sub-graph in the KeyGraph, which we call a concept.

Concepts can be extracted by applying community detection algorithms on the constructed KeyGraph. Community detection is able to split a KeyGraph into a set of communities , where each community contains the keywords for a certain concept. By using overlapping community detection, each keyword may appear in multiple concepts.

A lot of existing algorithms can be utilized for community detection. In our case, the number of concepts in different documents varies a lot, and the number of keywords contained in a constructed KeyGraph is rather small. Based on these observations, we utilize the betweenness centrality score (Sayyadi and Raschid, 2013) of edges to measure the strength of each edge in KeyGraph to detect keyword communities. An edge’s betweenness score is defined as the number of shortest paths between all pairs of nodes that pass through it. An edge between two communities is expected to achieve a high betweenness score. Edges with high betweenness score will be removed iteratively to extract separated communities. The iterative splitting process will stop until the number of nodes in each sub-graph is smaller than a predefined threshold, or until the maximum betweenness score of all edges in the sub-graph is smaller than a threshold that depends on the sub-graph’s size. We refer interested readers to (Sayyadi and Raschid, 2013) for more details on community detection over a KeyGraph.

Vertex Construction.

After we obtain the concepts of a document, the next step is to assign each sentence to its most related concepts. We calculate the cosine similarity between each sentence and a concept, where sentences are represented by TF-IDF vectors. As a concept is a bag of keywords, it can also be represented by a TF-IDF vector. We assign each sentence to the concept which is the most similar to the sentence in terms of the TF-IDF vector and whose similarity score is above a predefined threshold. After this step, sentences in the documents are grouped by concepts. For sentences that do not match any concept in the document, we create a special

dummy vertex that does not contain any keyword and attach all the unmatched sentences to it.

Edge Construction. Given the set of extracted concepts with attached sentences, we further organize these concept vertices into a weighted undirected graph to reveal the correlations between different concepts. There are various ways to construct edges between vertices and to calculate edge weights. For example, for each vertex, we can combine the sentences attached to it into a long piece of concatenated text, and calculate the edge weight between any two vertices as the TF-IDF similarity between the two pieces of concatenated text on the two vertices, respectively. We also tried multiple alternative methods for weight calculation, such as counting the number of sentences that contain at least one keyword from each of the two vertices respectively. Our empirical experience shows that constructing edges by TF-IDF similarity generates a good Concept Interaction Graph for NLP tasks, as the resulted graph is more densely connected compared with the graph weight weights determined by other methods.

Until now, we have transformed an input document into a Concept Interaction Graph. Compared with the original document with a sequential structure, CIG discovers the focal point distribution in the document by detecting all the concepts and grouping sentences according to different concepts. Furthermore, the weighted edges represent the strengths of interactions among these concepts. In the next section, we will show how to use such a graphical representation of documents for text matching purposes.

3. A Graphical Approach to Document Matching

Figure 3. An overview of the proposed Siamese Encoded Graph Convolutional Network (SE-GCN) for matching a pair of long text documents. a) The architecture of the Siamese Text Pair Encoder on each vertex of the joint concept interaction graph (CIG) of the two documents, for vertex feature generation. b) The GCN layers to map the initial vertex features in the joint CIG into a final matching score.

In this section, we exploit the graphical representation of documents provided by concept interaction graphs, and propose the so-called Siamese Encoded Graph Convolutional Network (SE-GCN) for text matching. Fig. 3 illustrates the overall architecture of our proposed model, which is trained end-to-end.

3.1. The Joint CIG for a Pair of Documents

Since our goal is to classify the relationship of a pair of input documents and , we need a mechanism to merge the two corresponding CIGs and , which can be eventually aggregated to a final matching score. One straightforward way is to have a “Siamese GCN”, where is encoded into a contextual vector via multiple layers of graph convolutional networks (GCN), and the same procedure is applied to . Finally, we can match the two contextual vectors to obtain the matching score. However, this approach does not lead to good performance according to our experiments, as the comparison is only done in the final layer between the short encoded vectors, with too much information lost at the initial GCN layers.

Intuitively speaking, a better approach to utilize the concept interaction graph is to compare the sentence subsets on each vertex, and aggregate such fine-grained comparisons on different vertices, possibly weighted by the interaction topology, to get an overall matching result. To preserve the contrast between and on a per-vertex level and let such vertex contrasts propagate through multiple GCN layers, we propose a novel procedure to merge a pair of CIGs.

Specifically, for a pair of input documents and , we can construct a joint Concept Interaction Graph (joint CIG) by taking the “union” of the respective two CIGs and in the following way:

  • Include all the concept vertices from and into the joint CIG.

  • For each vertex in the joint CIG, its associated sentence set is given by the union , where (or ) is the set of sentences associated with in (or ).

  • The edge weight for every pair of vertices and in the joint CIG is recalculated based on the TF-IDF similarity between their respective sentence sets, and .

3.2. A Siamese Document Pair Encoder

Given the joint CIG , our next step is find an appropriate feature vector of a fixed length for each vertex to express the semantic similarity and divergence between and , which represents the difference between documents and on the focal point . A natural idea is manually extract various features to compare and , e.g., in terms of TF-IDF similarity, distance between mean word vectors. However, the performance of such a method is limited and will be highly dependent on feature engineering. To reduce the impact of human judgment in feature engineering, we resort to the power of a neural encoder applied onto every vertex in a distributed manner. As illustrated by Fig. 3 (a), we apply a same Siamese neural network encoder (Neculoiu et al., 2016) onto each vertex to convert the word embeddings (e.g., provided by Word2Vec (Mikolov et al., 2013)) of into a fixed-sized hidden feature vector , which we call the match vector.

In particular, the Siamese encoder takes the sequences of word embeddings of and as two inputs, encode them into two context vectors through the context layers that share weights on both sides, and compare the two context vectors through an aggregation layer to get the match vector

. The context layer usually contains one or multiple layers of LSTM, bi-directional LSTM (BiLSTM), or CNN with max pooling layers, aiming to capture the contextual information in each text sequence. In a Siamese network, every text sequence is encoded by the same context representation layer. The obtained context vectors are concatenated in the aggregation layer, and can be further transformed by more layers to get a fixed length


In our experiments, the context layer contains a single layer of -D CNN that consists of kernels and a max pooling layer. Denote the context vectors of sentences and sentences as and . Then, in the aggregation layer, the match vector is given by concatenating the element-wise absolute difference and the element-wise multiplication of the two context vectors, i.e.,


where denotes Hadamard (or element-wise) product.

3.3. Siamese Encoded GCN

Finally, we utilize the ability of Graph Convolutional Network (GCN) (Kipf and Welling, 2016) to capture the interactions between vertices and get an overall matching score between two documents. GCNs generalize the CNN from low-dimensional regular grids to high-dimensional irregular graph domains. In general, the input to the GCN (Kipf and Welling, 2016) is a graph with vertices , and edges with weights . The input also contains a vertex feature matrix denoted by , where is the feature vector of vertex .

For a pair of documents and , we will input the joint concept interaction graph (assuming it has vertices) with match vectors, as obtained according to the previous subsection, into the GCN, such that , i.e., the match vector obtained for each from the Siamese encoder will serve as the feature vector for vertex in GCN.

Now let us briefly describe the GCN propagation layers, as shown in Fig. 3 (b). Interested readers are referred to (Kipf and Welling, 2016) for details. Denote the weighted adjacency matrix of the graph as where . Let be a diagonal matrix such that . We will utilize a multi-layer GCN with the following layer-wise propagation rule (Kipf and Welling, 2016):


where and is a diagonal matrix such that are the adjacency matrix and the degree matrix of graph , respectively, with added self-connections, and

is the identity matrix.

The input layer to GCN is , which contains original vertex features, and is the matrix of activation, containing hidden vectors of the vertices in the layer. is the trainable weight matrix in the layer.

denotes an activation function such as Sigmoid or ReLU. Such a form of propagation rules is motivated by a first-order approximation of localized spectral filters on graphs, and can be considered as differentiable generalization of the Weisfeiler-Lehman algorithm, as described in 

(Kipf and Welling, 2016).

In summary, as shown in Fig. 3, the combination of a Siamese encoder applied to each vertex and multiple layers of GCN leads to the proposed Siamese Encoded GCN (SE-GCN), which takes a joint CIG representation of a pair of documents and as the input, pass the original sentences associated with each vertex into the same Siamese encoder in a distributed fashion to get the match vector . Next, the concept interaction graph , together with the match vectors serving as vertex features, are fed into multiple layers of GCNs. Finally, the hidden vectors in the last GCN layer is merged into a single vector of a fixed length. Note that these hidden vectors of vertices preserve the structural properties of the entire Concept Interaction Graph with minimum information loss. We use the mean of the hidden vectors of all vertices in the last layer as the merged representation, based on which the final matching score is computed. All the components in the entire proposed SE-GCN model can be jointly trained in an end-to-end manner with back-propagation.


To further improve the performance of our model, we can also manually construct a feature vector for the pair of documents in question, and concatenate the final mean vector representation from the GCN with the manual feature vector for classification. In our experiment, we pass such a concatenated vector to a regression layer, such as a multi-layer feed forward neural network, to get the final matching result.

We can see that SE-GCN solves the problem of long text document matching in a “divide-and-conquer” manner. The matching of two documents is divided into the matching of pairs of text snippets (sentence subsets) on each vertex of the constructed Concept Interaction Graph. Then, the distributed vertex matching results are aggregated and merged through graph convolutional network layers. SE-GCN overcomes the limitation of previous text matching algorithms, by extending text representation from a sequential or grid point of view to graphs, and can therefore better capture the rich intrinsic semantic structures in long text objects.

Finally, it is worth noting that our proposed SE-GCN is highly flexible. Different components in the architecture may be replaced by different neural network modules. Besides, it is not limited to text matching problems and can be applied to a variety of natural language processing tasks, especially those related to the modelling of long text objects, such as document classification, sentiment analysis and so on.

4. Evaluation

In this section, we evaluate the performance of our proposed SE-GCN model on the document pair matching task. We will first describe the task of semantic relationship classification for news articles, and then introduce two Chinese news datasets we collected specifically for this task at Tencent. After that, to evaluate our model’s efficiency, we will compare our model with a wide variety of existing text matching approaches.

4.1. Description of Tasks and Datasets

Dataset Pos Samples Neg Samples Train Dev Test
Table 1. Description of evaluation datasets.

Most of existing research work on text matching mainly focuses on short text pairs. And there are few research work and publicly available datasets for long document pair matching tasks. However, the problem of matching two documents, such as news articles, will be of great value to real-world applications, such as intelligent news systems.

Specifically, we will study the problem of matching a pair of news articles to classify whether they are talking about the same physical event or whether they belong to the same story in the real world. The concepts of event and story are defined as (Liu et al., 2017):

Definition 4.1 ().

Event: an event is a set of news documents that contains semantically identical information revolving around a real-world incident. An event always has a specific time of occurrence. It may involve a group of participating persons, organizations or other types of entities, the actions performed by them, and one or several locations.

Definition 4.2 ().

Story: a story consists of a set of semantically related or similar events.

Figure 4. The events contained in the story “2016 U.S. presidential election”.

To give readers more intuition on what the stories or events look like, here we use an example to clarify the concept of story and event. Fig. 4 shows the events contained in the story 2016 U.S. presidential election. As we can see, there are multiple sets of sub-events, such as events about Hillary’s health condition, Trump avoid tax, Hillary’s “mail door” and so on, which belong to the same story 2016 U.S. presidential election. For each event subset, there are multiple events occurred at different time. For example, the event set Election television debates contains three events that correspond to the three television debates during the presidential election, respectively. Let us consider the following 4 events under the story 2016 U.S. presidential election: 1) Trump and Hilary’s first television debate; 2) Trump and Hilary’s second television debate; 3) FBI restarts “mail door” investigation; 4) America votes to elect the new president. Intuitively, these 4 events should having no overlap between them. A news article about Trump and Hilary’s first television debate is conceptually separate from Trump and Hilary’s second television debate. For news articles, different events from the same story should be clearly distinguishable, because they usually follow the progressing timeline of real-world affairs.

Extracting events and stories accurately from vast news corpora is critical for online news feed apps and search engines to organize news information collected from the Internet and present it to users in sensible forms. The key problem for such kind of applications is classify whether two news articles are talking about the same event or the same story. However, to our best knowledge, we are the first to study this problem. As there is no publicly available dataset for such task, here we propose two datasets: Chinese News Same Event dataset (CNSE), and Chinese News Same Story dataset (CNSS).

The two datasets contain long Chinese news articles that were collected from major Internet news providers in China, including Tencent, Sina, WeChat, Sohu, etc., in a two-month period from October 1, 2016 to November 30, 2016, covering diverse topics in the open domain. For the Chinese News Same Event dataset, it contains pairs of news articles with labels that represent whether a pair of news articles are talking about the same event. The labels are created by the editors and product managers of Tencent. Similarly, the number of the Chinese News Same Story dataset is , and the labels are representing whether two documents are talking about the same story. For each document in these two datasets, it also has a publication timestamp, and a topic category, such as “Society”, “Entertainment” and so on. Notice that the negative samples in the two datasets are not randomly generated: we select document pairs that contain similar keywords, and filter out samples with TF-IDF similarity lower than a threshold.

Table 1 shows a detailed breakdown of the datasets used in the evaluation. For both of the two datasets, we use of samples as training set, of samples as development set, and the remaining

of as test set. We conduct the experiments on the two datasets. We use training sets to train the models, development set to tune the hyper-parameters and each test set is only used once in the final evaluation. The metrics we used to evaluate the performance of our proposed models on the text matching tasks are the accuracy and the F1 score of classification results. For each model, we carry out training for 10 epochs. We then choose the model with the best validation performance to be evaluated on the test set.

4.2. Compared Algorithms

In the following, We briefly describe the baseline methods:

  • Support Vector Machine with Manually Extracted Document Pair Features (Feature + SVM)

    : this is the most classical approach for classification tasks. In this approach, we extract features for a pair of documents, and train a support vector machine to classify the relationship between two documents. The extracted features include: the TF-IDF cosine similarity and the TF cosine similarity between two documents, the TF-IDF cosine similarity and the TF similarity between the first sentence of two documents, the topic categories of the two documents, and the absolute gap value of the publication time of the two documents.

  • Deep Structured Semantic Models (DSSM) (Huang et al., 2013): it utilizes a deep neural network (DNN) to map high-dimensional sparse features into low-dimensional dense features, and calculates the semantic similarity of the text pair.

  • Convolutional Deep Structured Semantic Models (C-DSSM) (Shen et al., 2014): learning low-dimensional semantic vectors for input text by convolutional neural network (CNN).

  • Multiple Positional Semantic Matching (MV-LSTM) (Wan et al., 2016): matching two text with multiple positional text representations, and aggregating interactions between different positional representations to give a matching score.

  • Match by Local and Distributed Representations (DUET) (Mitra et al., 2017)

    : matching two text using both a local representation and learned distributed representations.

  • Convolutional Matching Architecture-I (ARC-I) (Hu et al., 2014)

    : encoding text pairs by CNN, and comparing the encoded representations of each text with a multi-layer perceptron (MLP).

  • Convolutional Matching Architecture-II (ARC-II) (Hu et al., 2014): built directly on the interaction space between two text, and model all the possible combinations of them with -D and -D convolution.

  • MatchPyramid (Pang et al., 2016): calculating pairwise word matching matrix, and modeling text matching as image recognition, by taking the matching matrix as an image.

  • K-NRM (Xiong et al., 2017): using a translation matrix to model word-level similarities and a new kernel-pooling technique to extract multi-level match features, and a learning-to-rank layer that combines those features into the final ranking score.

We utilize the implementation of MatchZoo (Fan et al., 2017) for the evaluation of above deep text matching models.

4.3. Performance Analysis

Algorithm Dev Test
Accuracy F1-score Accuracy F1-score
Table 2. Accuracy and F1-score results of different algorithms on CNSE dataset.
Algorithm Dev Test
Accuracy F1-score Accuracy F1-score
Table 3. Accuracy and F1-score results of different algorithms on CNSS dataset.

Table 2 and Table 3 compare the performance of different models in terms of classification accuracy and F1 score, based on the Chinese News Same Event dataset and the Chinese News Same Story dataset. We can see that the results of our Siamese Encoded Graph Convolutional Network achieves the best performance on both two datasets in terms of accuracy and F1 score. This can be attributed to the two characteristics of our model. First, the input of long document pairs are re-organized into Concept Interaction Graphs. Therefore, corresponding semantic units in the two documents will be roughly aligned. Second, our model learns the match vector of each aligned semantic unit through a siamese encoder network, and aggregate the match vectors of all units, or concept vertices, via Graph Convolutional Network to take semantic topology structure of two documents into consideration. Therefore, it solves the problem of matching documents in a “divide-and-conquer” manner to cope with the long length of documents, and fully utilize the connections between semantic units to give an overall matching score or label.

Table 2 and Table 3 indicate that the deep text matching models in Matchzoo lead to bad performance in the long document text matching. The main reasons are the following. First, existing deep text matching models are hard to capture meaningful semantic relations between the long document pair. When the input text pairs are long, it is hard to get an appropriate context vector representation to match text pairs. For interaction-focused models, most of the interactions between words in two long documents will be meaningless, therefore it is not easy to extract useful interaction features for further matching steps. Our model effectively solves the above challenges by representing documents as Concept Interaction Graphs to split and align long text pairs, and utilize the semantic structure of long documents through Graph Convolution Network for semantic matching.

Moreover, Fig. 5(a) and Fig. 5(b) show that our SE-GCN performs better than SVM according to ROC and AUC, indicating the higher precision of our model. We also notice that the performance given by the classical “Manual features + SVM” model is relatively not bad compared to other models. Actually that is reasonable, as the extracted features such as the publication time of news articles, topic categories of news articles and so on are quite critical to judge whether two news articles are talking about the same event or story. However, our model provides a method to match a pair of long documents without manually designed features and achieves significant improvement compared to existing deep text matching models. Besides, we can easily incorporate manually designed features into our model by concatenating them with our learned matching vector for two documents.

(a) CNSE Dataset
(b) CNSS Dataset
Figure 5. Compare the ROC curves of our model and the SVM baseline model on two datasets.

Overall, the experimental results demonstrate the superior applicability and generalizability of our proposed model.

(a) CNSE Dataset
(b) CNSS Dataset
Figure 6. Compare the performance of our model on two datasets using different weight calculation strategies.

Impact of global feature concatenation. Compare our model with the version that doesn’t contain global feature concatenation in the last layer. It is not surprising that the performance is worse when we do not feed global feature vectors into our model. However, we can see that our model without global feature concatenation still achieves much better performance than existing deep text matching models. The reason is that existing text matching models are not able to characterize the semantic similarities between long text pairs. Without utilizing the intrinsic semantic structures in long documents, neither representation-focused deep neural models nor interaction-focused models are able to get meaningful comparisons between long documents. In our model, we represent documents by Concept Interaction Graphs so that it is able to align document pairs and match long documents with their semantic structures.

Impact of different edge weight calculation strategies. Given a pair of Concept Interaction Graph vertices with sentence index lists and , and with sentence index lists and . The indices indicate the position of attached sentences in document and . We tried different strategies to assign weights to the edges:

  • TF-IDF: for each vertex, concatenating all the sentences from both documents to get a single text snippet, and calculating the TF-IDF similarity between the two text snippets belonging to a pair of vertices.

  • Number of connecting sentences: counting how many sentences in and contain at least one keyword in and one keyword in (we call them connecting sentences), and use the total number of sentences as weight .

  • Position of connecting sentences: counting how many sentences in and contain at least one keyword in and one keyword in . For each sentence, suppose its position in the document is at the -th paragraph and the -th sentence in that paragraph. We assign a position score to it which is calculated as:


    where and are two hyper parameters ( = 0.1 and =0.3 for our experiments). We then sum up the position scores of connecting sentences as .

  • TextRank score of connecting sentence: similar with above approach, but we use TextRank algorithm to assign scores for sentences. We then sum up the TextRank scores of connecting sentences as .

Fig. 6 compares the effects of our SE-GAN model on the test sets of Chinese News Same Event dataset and Chinese News Same Story dataset, with different weight calculation strategies. As we can see, for different cases, choosing appropriate edge weight assignment strategies can influence the performance. The TF-IDF strategy achieves slightly better performance than other methods on the event dataset, and the strategies considering sentence positions and sentence TextRank scores can improve the performance over the story dataset. In overall, TF-IDF weight strategy is enough to give us promising performance.

5. Related Work

There are mainly two research lines that are highly related to our work: Document Graph Representation and Text Matching.

5.1. Document Graph Representation

A various of graph representations have been proposed for document modeling. Based on the different types of graph nodes, a majority of existing works can be generalized into four categories: word graph, text graph, concept graph, and hybrid graph.

For word graphs, the vertices represent different non-stop words in a document, and the edges are constructed based on syntactic analysis (Leskovec et al., 2004), co-occurrences (Rousseau and Vazirgiannis, 2013) or preceding relation (Schenker et al., 2003). For text graphs, they use sentences, paragraphs or documents as vertices, and establish edges by word co-occurrence, location (Mihalcea and Tarau, 2004), text similarities (Putra and Tokunaga, 2017), or hyperlinks between documents (Page et al., 1999).

Concept graphs link terms in a document to real world entities or concepts based on knowledge bases such as DBpedia (Auer et al., 2007). After detected concepts in a document as graph vertices, they can be connected by edges based on syntactic/semantic rules. Besides, using these concepts as initial seeds, a concept graph can be expanded by performing a depth-first search along the DBpedia with a maximum depth of two, and adds all outgoing relational edges and concepts along the paths (Schuhmacher and Ponzetto, 2014).

Hybrid graphs consists of different types of vertices and edges. (Rink et al., 2010) builds a graph representation of sentences that encodes lexical, syntactic, and semantic relations. (Jiang et al., 2010) extract tokens, syntactic structure nodes, part of speech nodes, and semantic nodes from each sentence, and link them by different types of edges that representing different relationships. (Baker and Ellsworth, 2017) combines Frame Semantics and Construction Grammar to construct a Frame Semantic Graph of a sentence.

5.2. Text Matching

Most existing works on text matching can be generalized into three categories: unsupervised metrics, representation-focused deep neural models, and interaction-focused deep neural models (Fan et al., 2017).

Traditional methods represent a text document as vectors of bag of words (BOW), term frequency inverse document frequency (TF-IDF), LDA (Blei et al., 2003) and so forth, and calculate the distance between vectors. However, they cannot capture the semantic distance and usually cannot achieve good performance.

In recent years, different neural network architectures have been proposed for text pair matching tasks. For representation-focused models, they usually transform the word embedding sequences of text pairs into context representation vectors through a Siamese architectural multi-layer Long Short-Term Memory (LSTM) network or Convolutional Neural Networks (CNN), followed by a fully connected network or score function which gives the matching score or label based on the context representation vectors

(Qiu and Huang, 2015; Wan et al., 2016). For interaction-focused models, they extract the features of all pair-wise interactions between words in text pairs, and aggregate the interaction features by deep networks to give a matching result (Hu et al., 2014; Pang et al., 2016). However, the intrinsic structural properties of long text documents are not fully utilized by these neural models. Therefore, they cannot achieve good performance for long text pair matching.

6. Conclusion

In this paper, we propose a novel graphical approach to text matching. We propose the Concept Interaction Graph to transform one or a pair of documents into a weighted undirected graph, with each vertex representing a concept of tightly correlated keywords and edges indicating their interaction levels. Based on the graph representation of documents, we further propose the Siamese Encoded Graph Convolutional Network, a novel deep neural network architecture, which takes graphical representations of documents as the input and matches two documents by learning hidden document representations through the combined use of a distributed Siamese network applied to each vertex in the graph and multiple Graph Convolutional Network layers. We apply our techniques to the task of relationship classification between a pair of long documents, i.e., whether they belong to the same event (or story), based on two newly created Chinese datasets containing news articles. Our extensive experiments show that the proposed approach can achieve significant improvement for long document matching, compared with multiple existing approaches.


  • (1)
  • Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. The semantic web (2007), 722–735.
  • Baker and Ellsworth (2017) Collin Baker and Michael Ellsworth. 2017. Graph Methods for Multilingual FrameNets. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing. 45–50.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    3, Jan (2003), 993–1022.
  • Brown et al. (1993) Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19, 2 (1993), 263–311.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
  • Fan et al. (2017) Yixing Fan, Liang Pang, JianPeng Hou, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2017. MatchZoo: A Toolkit for Deep Text Matching. arXiv preprint arXiv:1707.07270 (2017).
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning on Graphs: Methods and Applications. arXiv preprint arXiv:1709.05584 (2017).
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems. 2042–2050.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2333–2338.
  • Jiang et al. (2010) Chuntao Jiang, Frans Coenen, Robert Sanderson, and Michele Zito. 2010.

    Text classification using graph mining-based feature extraction.

    Knowledge-Based Systems 23, 4 (2010), 302–308.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Leskovec et al. (2004) Jure Leskovec, Marko Grobelnik, and Natasa Milic-Frayling. 2004. Learning sub-structures of document semantic graphs for document summarization. (2004).
  • Liu et al. (2017) Bang Liu, Di Niu, Kunfeng Lai, Linglong Kong, and Yu Xu. 2017. Growing Story Forest Online from Massive Breaking News. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 777–785.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.
  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Mitra et al. (2017) Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1291–1299.
  • Neculoiu et al. (2016) Paul Neculoiu, Maarten Versteegh, Mihai Rotaru, and Textkernel BV Amsterdam. 2016. Learning Text Similarity with Siamese Recurrent Networks. ACL 2016 (2016), 148.
  • Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
  • Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching as Image Recognition.. In AAAI. 2793–2799.
  • Ponzanelli et al. (2015) Luca Ponzanelli, Andrea Mocci, and Michele Lanza. 2015. Summarizing complex development artifacts by mining heterogeneous data. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 401–405.
  • Putra and Tokunaga (2017) Jan Wira Gotama Putra and Takenobu Tokunaga. 2017. Evaluating text coherence based on semantic similarity graph. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing. 76–85.
  • Qiu and Huang (2015) Xipeng Qiu and Xuanjing Huang. 2015.

    Convolutional Neural Tensor Network Architecture for Community-Based Question Answering.. In

    IJCAI. 1305–1311.
  • Rink et al. (2010) Bryan Rink, Cosmin Adrian Bejan, and Sanda M Harabagiu. 2010. Learning Textual Graph Patterns to Detect Causal Event Relations.. In FLAIRS Conference.
  • Rousseau and Vazirgiannis (2013) François Rousseau and Michalis Vazirgiannis. 2013. Graph-of-word and TW-IDF: new approach to ad hoc IR. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 59–68.
  • Sayyadi and Raschid (2013) Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology (TOIT) 13, 2 (2013), 4.
  • Schenker et al. (2003) Adam Schenker, Mark Last, Horst Bunke, and Abraham Kandel. 2003. Clustering of web documents using a graph model.


    55 (2003), 3–18.
  • Schuhmacher and Ponzetto (2014) Michael Schuhmacher and Simone Paolo Ponzetto. 2014. Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 543–552.
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 373–374.
  • Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, Liang Pang, and Xueqi Cheng. 2016. A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations.. In AAAI, Vol. 16. 2835–2841.
  • Xiong et al. (2017) Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 55–64.
  • Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632 (2014).