DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

05/19/2019 ∙ by Zhiqing Sun, et al. ∙ HEC Montréal Université de Montréal Peking University 0

Keyphrase extraction from documents is useful to a variety of applications such as information retrieval and document summarization. This paper presents an end-to-end method called DivGraphPointer for extracting a set of diversified keyphrases from a document. DivGraphPointer combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Specifically, given a document, a word graph is constructed from the document based on word proximity and is encoded with graph convolutional networks, which effectively capture document-level word salience by modeling long-range dependency between words in the document and aggregating multiple appearances of identical words into one node. Furthermore, we propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process. Experimental results on five benchmark data sets show that our proposed method significantly outperforms the existing state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Keyphrase extraction from documents is useful in a variety of tasks such as information retrieval (Kim et al., 2013), text summarization (Qazvinian et al., 2010), and question answering (Li et al., 2010). It allows to identify the salient contents from a document. The topic has attracted a large amount of work in the literature.

Most traditional approaches to keyphrase extraction are unsupervised approaches. They usually first identify the candidate keyphrases with some heuristics (e.g., regular expressions), and then rank the candidate keyphrases according to their importance in the documents

(Hasan and Ng, 2014). Along this direction, the state-of-the-art algorithms are graph-based ranking methods (Mihalcea and Tarau, 2004; Wan and Xiao, 2008; Liu et al., 2010), which first construct a word graph from a document and then determine the importance of the keyphrases with random walk based approaches such as PageRank (Brin and Page, 1998). By constructing the word graph, these methods can effectively identify the most salient keyphrases. Some diversification mechanisms have also been investigated in some early work (Mei et al., 2010; Bougouin et al., 2013) to address the problem of over-generation of the same concepts in keyphrase extraction. However, these methods are fully unsupervised. They rely heavily on manually designed heuristics, which may not work well when applied to a different type of document. In experiments, we also observe that the performance of these methods is usually limited and inferior to the supervised ones.

Recently, end-to-end neural approaches for keyphrase extraction have been attracting growing interests (Zhang et al., 2016; Meng et al., 2017; Zhang et al., 2017). The neural approaches usually studied keyphrase extraction in the encoder-decoder framework (Sutskever et al., 2014)

, which first encodes the input documents into vector representations and then generates the keyphrases with Recurrent Neural Networks (RNN)

(Mikolov et al., 2010) or CopyRNN (Gu et al., 2016) decoders conditioned on the document representations. These neural methods have achieved state-of-the-art performance on multiple benchmark data sets with end-to-end supervised training. The end-to-end training offers a great advantage that the extraction process can adapt to the type of documents. However, compared to the unsupervised graph-based ranking approaches, existing end-to-end approaches only treat documents as sequences of words. They do not benefit from the a more global graph structure that provides useful document-level word salience information such as long-range dependencies between words, as well as a synthetic view on the multiple appearances of identical words in the document. Another problem of these end-to-end methods is that they cannot guarantee the diversity of the extracted key phrases: it is often the case that several similar keyphrases are extracted. Therefore, we are seeking an approach that can have the advantage of modeling document-level word salience, generating diverse keyphrases, and meanwhile be efficiently trained in an end-to-end fashion.

Figure 1. Illustration of our encoder-decoder architecture for keyphrase extraction. In this example, the document is a sequence of words, namely, , and we have generated the first keyphrase . We are predicting , the second word for keyphrase , which will be selected from the nodes within the graph and the ending token for a keyphrase. Note that multiple appearances of in the document is aggregated into only one node in the constructed word graph. (Better viewed in color)

In this paper, we propose an end-to-end approach called DivGraphPointer for extracting diversified keyphrases from documents. Specifically, given an input document, we first construct a word graph from it, which aggregates identical words into one node and captures both the short- and long-range dependency between the words in the document. Afterwards, the graph convolutional neural network

(Kipf and Welling, 2016) is applied to the word graph to learn the representations of each node, which effectively models the word salience. To extract diverse keyphrases from documents, we propose a diversified pointer network model (Vinyals et al., 2015) over the word graph, which dynamically picks nodes from the word graph to construct the keyphrases. Two diversity mechanisms are proposed to increase the diversity among the generated keyphrases. Specifically, we employ a coverage attention mechanism (Tu et al., 2016) to address the over-generation problem in keyphrase extraction at lexical level and a semantic modification mechanism to dynamically modify the encoded document representation at semantic level. Figure 1 illustrates our approach schematically. The whole framework can be effectively and efficiently trained with back-propagation in an end-to-end fashion. Experimental results show that our proposed DivGraphPointer achieves state-of-the-art performance for keyphrase extraction on five benchmarks and significantly outperforms the existing supervised and unsupervised keyphrase extraction methods.

The contribution of this paper is twofold:

  • We propose a graph convolutional network encoder for keyphrase extraction that can effectively capture document-level word salience.

  • We propose two complementary diversification mechanisms that help the pointer network decoder to extract diverse keyphrases.

2. Related Work

The most traditional keyphrase extraction method is based on Tf-Idf (Sparck Jones, 1972): It identifies words that appear frequently in a document, but do not occur frequently in the entire document collection. These words are expected to be salient words for the document. The same idea can be applied on a set of keyphrase candidates identified by some syntactic patterns (Frank et al., 1999; Hulth, 2003). However, the drawback of this family of approaches is that each word (or phrase) is considered in isolation. The inherent relations between words and between phrases are not taken into account. Such relations are important in keyphrase extraction.

To solve this problem, approaches based on word graphs have been proposed. In a word graph, words are connected according to some estimated relations such as co-occurrences. A graph-based extraction algorithm can then take into account the connections between words. The TextRank algorithm

(Mihalcea and Tarau, 2004) was the first graph-based approach for keyphrase extraction. Given a word graph built on co-occurrences, it calculates the importance of candidate words with PageRank. The importance of a candidate keyphrase is then estimated as the sum of the scores of the constituent words. Following this work, the DivRank algorithm (Mei et al., 2010) was proposed to balance the importance and diversity of the extracted keyphrases. The TopicRank algorithm (Bougouin et al., 2013) was further proposed for topic-based keyphrase extraction. This algorithm first clusters the candidate phrases by topic and then chooses one phrase from each topic, which is able to generate a diversified set of keyphrases. In TopicRank, a complete topic graph is constructed to better capture the semantic relations between topics. The graph-based document representation can effectively model document-level word salience. However, these methods are fully unsupervised: the way that keyphrases are identified from a word graph is designed manually. Such a method lacks the flexibility to cope with different types of documents. In our proposed methods, we will use an end-to-end supervised training in order to adapt the extraction process to documents. In experiments, this also yields better performance.

Indeed, end-to-end neural approaches to keyphrase extraction have attracted a growing attention in recent studies. Zhang et al. (2016) treated keyphrase extraction as a sequence labeling task and proposed a model called joint-layer RNN for extracting keyphrases from Twitter data. Meng et al. (2017) first proposed an encoder-decoder framework for keyphrase extraction. However, their RNN-based encoder and decoder treat a document as a sequence and ignore the correlation between keyphrases. Afterwards, Zhang et al. (2017) further proposed a CNN-based model for this task. The copy mechanism (Gu et al., 2016) is employed to handle the rare word problem in these encoder-decoder approaches, which allows to copy some words from the input documents. All the above approaches are based on word sequences, which inherit the well known problem that only local relations between words can be coped with. Compared to them, our model encodes documents with graphs, which are able to model more global document-level word salience.

Deep graph-based methods have been used for other tasks. They are also relevant to our work. For example, Yasunaga et al. (2017) proposed to use graph convolutional networks to rank the salience of candidate sentences for document summarization. Marcheggiani and Titov (2017) studied encoding sentences with graph convolutional neural networks for semantic role labeling. Bastings et al. (2017) studied graph convolutional encoders for machine translation. In this paper, we target a different task. This is the first time that graph convolutional neural networks are used for keyphrase extraction.

Diversity is an important criterion in keyphrase extraction: it is useless to extract a set of similar keyphrases. Diversity has been studied in IR for search result diversification (Carbonell and Goldstein, 1998; Santos et al., 2010). Two main ideas have been used: selecting results that are different from those already selected; or selecting a set of results that ensure a good coverage of topics. In keyphrase extraction, there are also a few attempts to deal with diversity. Similar to our work, Zhang and Xiao (2018) and Chen et al. (2018) also employed a coverage mechanism to address the diversification problem. Chen et al. (2018) further proposed a review mechanism to explicitly model the correlation between the keyphrases. However, their approaches are different from ours in multiple ways. First, this paper focuses on keyphrase extraction, while their approaches focus on keyphrase generation, which also requires generating keyphrases that cannot be found in the document. Furthermore, we use a hard form of coverage attention mechanism that penalizes the attention weight by one-hot vectors, while they used the same form in machine translation as the original paper (Tu et al., 2016). We believe that in the keyphrase extraction task where source and target share the same vocabulary, a hard coverage attention could avoid the error propagation of attentions. Moreover, by directly applying the coverage attention on the word graph, we efficiently penalize all appearances of identical words. Finally, the review mechanism was employed in the decoding phase of RNN in Chen et al. (2018), while our context modification mechanism modifies the initial semantic state of RNN. We expect that our context modification and coverage attention mechanisms can explicitly address the over-generation problem in keyphrase extraction at both semantic level and lexical level, and thus are complementary to each other. In contrast, the review mechanism and the coverage mechanism used in Chen et al. (2018) provide both an attentive context, which tend to play a similar role and are somehow duplicated.

3. Deep Graph-based Diversified Keyphrase Extraction

Our Diversified Graph Pointer (DivGraphPointer) is based on the encoder-decoder framework. We first construct a graph from a document from word adjacency matrices and encode the graph with the Graph Convolutional Networks (GCN) (Kipf and Welling, 2016). Afterwards, based on the representation of the graph, we generate the keyphrases one by one with a decoder. Note that during the generation process, we restrict the output to the nodes of the graph. In other words, we only extract words appearing in the documents. Figure 1 presents an illustration of the DivGraphPointer framework.

3.1. Graph Convolutional Network Encoder

In this part, we present our graph-based encoders. Traditional unsupervised graph-based ranking approaches for keyphrase extraction have shown promising performance at estimating the salience of words, which motivated us to develop deep graph-based encoders. Compared to sequence-based encoders such as RNN and CNN, graph-based encoders have several advantages. For example, graph-based encoders can explicitly leverage the short- and long-term dependency between words. Moreover, while the sequence-based encoders treat different appearances of an identical word independently, the graph representation of document naturally represents the identical word at different positions as a single node in the word graph. In this way, the graph-based encoding approaches can aggregate the information of all appearances of an identical word when estimating its salience, and thus employ a similar idea as in Tf-Idf (Sparck Jones, 1972) and PageRank (Brin and Page, 1998) methods: important words tend to be more frequent in the document and widely linked with other important words.

3.1.1. Graph Construction

Our model represents a document by a complete graph in which identical words are nodes and edges are weighted according to the strength of the structural relations between nodes. Instead of manually designing and extracting connections between words (e.g. based on co-occurrence weights), we rely on the basic proximity information between words in the sentences, and let training process learn to use it. The basic assumption is that the closer two words in a sentence, the stronger their relation. Specifically, we construct a directed word graph from the word sequences of a document in both directions. The adjacency matrices are denoted as: and . We assume that two words are related if they appear close to each other in a document. This extends the traditional adjacency relation to a more flexible proximity relation. The strength of the relation between a pair of words depends on their distance. Specifically, similar to (Bougouin et al., 2013), we define the weight from word to word in the two graphs as follows:

(1)
(2)

where is the set of the position offset of word in the document. The function aims to filter the uni-directional information and help the graph-based encoder focusing on the order of the sequence.

In order to stabilize the iterative message propagation process in graph convolutional network encoder, we normalize each adjacency matrix by , where is the adjacency matrix with self-connections, and is the degree matrix. The purpose of this re-normalization trick (Kipf and Welling, 2016)

is to constrain the eigenvalues of the normalized adjacency matrices

close to 1.

Such a graph construction approach differs from the one used in TextRank (Mihalcea and Tarau, 2004) that captures the co-occurrence in a limited window of words. forms a complete graph where all nodes are interconnected. As stated in (Bougouin et al., 2013), the completeness of the graph has the benefit of providing a more exhaustive view of the relations between words. Also, computing weights based on the distances between offset positions bypasses the need for a manually defined parameter such as window size.

3.1.2. Graph Convolutional Networks

Next, we encode the nodes with multi-layer Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016). Each graph convolutional layer generally consists of two stages. In the first stage, each node aggregates the information from its neighbors; in the second stage, the representation of each node is updated according to its current representation and the information aggregated from its neighbors. Given the node representation matrix in the -th layer, the information aggregated from the neighbors, denoted as , can be calculated as follows:

(3)

which aggregates the information from both the neighbors defined in the two matrices and the node itself. Here, are layer-specific trainable weight matrices.

Once the information from the neighbors are aggregated, inspired by He et al. (2016) and Gehring et al. (2017), we updated the node representation with a residual Gated Linear Unit (GLU) activation:

(4)

where

is the sigmoid function and

is the point-wise multiplication. is another function defined in a similar way as , which is used as the gating function of the information collected from the neighbors.

is initialized with the pretrained word embedding matrix, and the residual connection is omitted in its activation.

The representation of the entire graph (or the document representation) is then obtained by averaging the aggregation of the last layer’s node representations , where denotes the total number of GCN layers.

Based on the encoded document representation , we propose a decoder, named DivPointer, to generate summative and diverse keyphrases in the next section.

3.2. Diversified Pointer Network Decoder

In this part, we introduce our approach of keyphrase extraction based on the graph representation. Most previous end-to-end neural approaches select keyphrases independently during the decoding process. However, ignoring the diversity among phrases may lead to multiple similar keyphrases, undermining the representativeness of the keyphrase set. Therefore, we propose a DivPointer Network with two mechanisms on semantic level and lexical level respectively to improve the diversity among keyphrases during the decoding process.

3.2.1. Pointer Network

The decoder is used to generate output keyphrases according to the representation of the input document. We adopt a pointer network (Vinyals et al., 2015)

with diversity enabled attentions to generate keyphrases. A pointer network is a neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in the original data space. The graph nodes corresponding to words in a document are regarded as the original data space of the pointer network in our case.

Specifically, the pointer decoder receives the document representation as the initial state , and predicts each word of a keyphrase sequentially based on :

(5)

where denotes the node representation of the word that keyphrase generated at the previous step. is the hidden state of an RNN. The word is then selected with a pointer network according to certain attention mechanism based on .

A general attention (Bahdanau et al., 2014) score on each graph node with respect to the hidden state can be computed by:

(6)

where is the node representation of taken from and , , , and are parameters to be learned. We can then obtain the pointer distribution on the nodes by normalizing :

(7)

With this distribution, we can select a word with the maximum pointer probability as from the graph nodes.

However, the attention mechanism above is merely built on the global hidden state, namely the document representation . Diversity among the generated keyphrases can hardly be addressed without taking the previously generated keyphrases into consideration in the decoding process. Two mechanisms aiming at achieving diversity among keyphrases generated are introduced in the following sections.

3.2.2. Context Modification

During the decoding process, the document representation is the key to maintain a global semantic context for each word to be generated. However, the constant context may lead to generating similar keyphrases repeatedly, which hurts the representativeness of the generated keyphrase set in reality.

Therefore, we propose to update the context dynamically based on the previously generated keyphrases while decoding the words for the phrase , as follows:

(8)
(9)

where we introduce the average representation of the previous generated keyphrases into the model to learn a updated context. is initialized as . and are initialized accordingly.

We find the modified context helps the pointer network to focus on the keyphrases with different meanings yet to come. Therefore, it can solve the over-generation problem in the previous deep generation models.

3.2.3. Coverage Attention

In addition to the context modification mechanism, we also adopt the coverage mechanism to enhance diversity among keyphrases on lexical level. Coverage mechanism has been well investigated in machine translation (Tu et al., 2016), search result diversification (Santos et al., 2010) and document summarization (See et al., 2017). When the criterion is used in a neural network, it generally maintains a coverage vector to keep track of the attention history. More specifically, they directly use the sum of previous alignment probabilities as the coverage for each word in the input.

Since keyphrases are usually short and summative, their meanings are more sensitive to the term replacement than those of sentences or documents. Based on this observation, we propose a coverage attention on lexical level with the help of one-hot representations of the previously generated keyphrases. Specifically, we use the sum of the one-hot vectors of the previously generated keyphrases as a coverage representation.

(10)

where is the coverage value of the node for the keyphrase and is the element of the one-hot vector in the keyphrase. It is if the generated word is the node in the graph, otherwise.

The coverage representation is then incorporated into the attention mechanism of our DivPointer network as follows:

Note that our coverage attention mechanism differs from the original form (Tu et al., 2016) that was designed for machine translation. The reason for the change is that in the keyphrase extraction task, the source and the target share the same vocabulary. So a hard coverage attention could avoid the error propagation of attentions to similar words. Instead, the soft attention distribution cannot precisely represent the previous generated keyphrases.

4. Training and Decoding

In both training and decoding, the symbol is added to the end of the keyphrases to help learn to stop adding more words. The whole model is trained to maximize the log-likelihood of the words in the keyphrases according to the given input document. Specifically, the training objective is defined as below:

(11)

where is the number of document and is the number of keyphrases in each document. is the generative probability of the keyphrase given the previous generated keyphrases and the context . For simplicity, we omit the conditional notation here. is the keyphrase in the document, which is also conditioned on and the context of the

document. The order of the keyphrases for the same document are randomly shuffled in each epoch of training.

We use Adam (Kingma and Ba, 2014) with a mini-batch size of to optimize model parameters, with an initial learning rate in the first 6,000 steps and

in the rest steps. A gradient clipping

is applied in each step. We also use early-stop in the training with a validation dataset. Model parameters are initialized by normal distributions

(Glorot and Bengio, 2010). Dropouts (Srivastava et al., 2014) are applied on the word embedding with dropout rate and on the GCN output with dropout rate

to reduce over-fitting. We also apply batch normalization

(Ioffe and Szegedy, 2015) after the last graph convolutional layers to accelerate the training process.

In the decoding, we generate keyphrases based on the negative log generative probability of candidate keyphrases, with a beam-search in window size and search depth . In practice, we find that the model tends to generate short keyphrases when using the negative log-likelihood as the keyphrase scores. Therefore, we propose a simple keyphrase length penalization, where we normalize the score of candidate keyphrase by its length :

(12)

where is the original score (negative log probability) and is the normalized score. is the length penalty factor, where larger tends to generate shorter keyphrases, and smaller generates longer keyphrases.

Dataset #paper #keyphrase length
training data
Kp20k 527830 2.94 1.80
validation data
Inspec 1500 7.39 2.28
NUS five-fold cross-validation
SemEval 188 3.84 1.97
Krapivin 1904 2.52 1.94
Kp20k 20,000 2.94 1.80
test data
Inspec 500 7.70 2.28
NUS 211 5.37 1.84
SemEval 100 5.73 1.93
Krapivin 400 3.24 2.01
Kp20k 20,000 2.94 1.80
Table 1. Statistics of five datasets. We use the original training data from the Inspec, SemEval, and Krapivin data sets as validation data to select the best length penalty factor .
Inspec NUS SemEval Krapivin Kp20k
@5 @10 @5 @10 @5 @10 @5 @10 @5 @10
unsupervised methods
Tf-Idf 0.223 0.304 0.139 0.181 0.120 0.184 0.113 0.143 0.105 0.130
TextRank 0.229 0.275 0.195 0.190 0.172 0.181 0.172 0.147 0.180 0.150
SingleRank 0.214 0.297 0.145 0.169 0.132 0.169 0.096 0.137 0.099 0.124
ExpandRank 0.211 0.295 0.137 0.162 0.135 0.163 0.096 0.136 N/A N/A
supervised methods
RNN 0.000 0.000 0.005 0.004 0.004 0.003 0.002 0.001 0.138 0.009
CopyRNN 0.292 0.336 0.342 0.317 0.291 0.296 0.302 0.252 0.328 0.255
CNN 0.088 0.069 0.176 0.133 0.162 0.127 0.141 0.098 0.188 0.203
CopyCNN 0.285 0.346 0.342 0.330 0.295 0.308 0.314 0.272 0.351 0.288
SeqPointer 0.347 0.386 0.383 0.345 0.328 0.324 0.318 0.274 0.333 0.281
GraphPointer 0.375* 0.387 0.421* 0.375** 0.377* 0.350** 0.340** 0.280** 0.341** 0.282*
DivGraphPointer 0.386** 0.417** 0.460** 0.402** 0.401** 0.389** 0.363** 0.297** 0.368** 0.292**
Normalized Discounted Cumulative Gain @10 (NDCG@10)
SeqPointer 0.448 0.501 0.440 0.476 0.463
GraphPointer 0.479 0.536 0.482 0.499 0.498
DivGraphPointer 0.503 0.591 0.518 0.534 0.532
Table 2. The performance of keyphrase extraction on five benchmarks. The results of the former six methods are taken from (Meng et al., 2016) and the results of CNN and CopyCNN are taken from (Zhang et al., 2017). The significance levels (** 0.01, * 0.1) between different methods (GraphPointer v.s SeqPointer, DivGraphPointer v.s GraphPointer) are also provided.

5. Experiments

5.1. Experimental Setting

Data

We use the data set Kp20k (Meng et al., 2017) for training. Kp20k contains a large amount of high-quality scientific metadata in the computer science domain from various online digital libraries (Meng et al., 2016). We follow the official setting of this dataset and split the dataset into training (527,830 articles), validation (20,000 articles) and test (20,000 articles) data. We further test the model trained with KP20k on four widely-adopted keyphrase extraction data sets including Inspec (Hulth, 2003), NUS (Nguyen and Kan, 2007), SemEval-2010 (Kim et al., 2010) and Krapivin (Krapivin et al., 2009). Following (Meng et al., 2017), we take the concatenation of the title and the abstract as the content of a paper for all these datasets. No text pre-processing steps are conducted. In this paper, we focus on keyphrase extraction. Therefore, only the keyphrases that appear in the documents are used for training and evaluation. Table 1 provides the statistics on the number of papers, the average number of keyphrases and the corresponding average keyphrase length for each benchmark datasets.

Model Setting

A 3-layer Gated Recurrent Unit (GRU)

(Chung et al., 2014) is used as the RNN recurrent function. The dimension of both the input and the output of GRU is set to 400, while the word embedding dimension is set to 300. The number of GCN layers is empirically set to 6. The length penalty factor is selected according to the validations on different data sets. All the other hyper-parameters are the same when we evaluate our different models on different datasets. The word embeddings (or the node embeddings) are fixed to the pre-trained fastText model (Bojanowski et al., 2016), which is trained on Wikipedia 2017, UMBC web-base corpus and statmt.org news dataset (16B tokens). It breaks words into sub-words, which can help handle the problem of out-of-vocabulary (OOV).

Baseline

We compare our models with four supervised algorithms: RNN (Meng et al., 2017), CopyRNN (Meng et al., 2017), CNN(Zhang et al., 2017), and CopyCNN(Zhang et al., 2017). Considering that the unsupervised ranking-based methods motivate our proposed graph-based encoder solution, we also compare our models with four well-known unsupervised algorithms for keyphrase extraction including Tf-Idf (Sparck Jones, 1972), TextRank (Mihalcea and Tarau, 2004), SingleRank (Wan and Xiao, 2008), ExpandRank (Wan and Xiao, 2008).

We also compare several different variants of our algorithms. We compare with the method of encoding documents with 2-layer bi-directional LSTM (Hochreiter and Schmidhuber, 1997) and decoding the keyphrases with pointer networks, marked as SeqPointer. We also compare with the method with graph-based encoder and vanilla pointer decoder, marked as GraphPointer. No diversity mechanisms are used during decoding for both SeqPointer and GraphPointer. The hyper-parameters of these two models are also selected according to the validation data.

Evaluation

Following the literature, the macro-averaged precision, recall and F1 measures are used to measure the overall performance. Here, precision is defined as the number of correctly-predicted keyphrases over the number of all predicted keyphrases; recall is computed as the number of correctly predicted keyphrases over the total number of data records, and F1 is the harmonic average of precision and recall.

Besides the set-based metrics, we also evaluate our models with a rank-based metric, Normalized Discounted Cumulative Gain (NDCG) (Wang et al., 2013). Since most keyphrase extraction models’ output is sequential, we believe that the rank-based metric could better measure the performance of those models.

In the evaluation, we apply Porter Stemmer (Porter, 1980) to both target keyphrases and predicted keyphrases when determining the match of keyphrases and match of the identical word. The embedding of different variants of an identical word are averaged when fed as the input of graph-based encoder.

5.2. Results

The performance of different algorithms on five benchmarks are summarized in Table 2. For each method, the table presents the performance of generating 5 and 10 keyphrases with respect to F1 measure. We also include the truncated NDCG measure of generating 10 keyphrases in the table. The best results are highlighted in bold. We can see that for most cases (except RNN and CNN), the supervised models outperform all the unsupervised algorithms. This is not surprising since the supervised models are trained end-to-end with supervised data. The RNN model and CNN do not perform well on nearly all datasets (e.g., Inspec, NUS, SemEval, and Krapivin). The reason is that they cannot generate words that are not in the vocabulary (OOV words) during the decoding process, while in this paper we only allow to generate words appeared in the given documents. This problem is addressed by the SeqPointer model, which utilizes the pointer network to copy words from the source text, and hence the performance is significantly improved. We can also see such phenomenon on CopyCNN.

Comparing SeqPointer and CopyRNN, we can observe that SeqPointer outperforms CopyRNN on all data sets. This could be due to the fact that generation mechanism interferes with the copy mechanism in CopyRNN. Implementation details could also contribute to the performance difference. For example, we utilize fastText as word embedding to handle the OOV problem, while Meng et al. (2017) randomly initialize their word embeddings. Moreover, We use a length penalty mechanism to solve the problem of short phrase generation, while Meng et al. (2017) applied a simple heuristic by preserving only the first single-word phrase and removing the rest. We also use different hidden dimension, learning rate, dropout rate, beam depth, and beam size settings.

By replacing the RNN encoder with the graph convolutional network encoder, the performance of GraphPointer is further improved. This shows that although sequence-based encoders are effective at capturing sequential information such as word order and adjacency, such sequential information is not sufficient in the keyphrase extraction task. It turns out that the long-term dependency and information aggregation are more important for precisely extracting keyphrases.

Our proposed model DivGraphPointer achieves the best performance by modeling document-level word salience during the encoding process (the graph encoder), and increasing the diversity of keyphrases during decoding with diversified pointer networks.

Inspec Krapivin
@5 @10 @5 @10
SeqPointer 0.347 0.386 0.318 0.274
1-layer GCN 0.351 0.365 0.320 0.261
3-layer GCN 0.373 0.382 0.334 0.268
6-layer GCN 0.375 0.387 0.340 0.280
9-layer GCN 0.374 0.397 0.362 0.284
12-layer GCN 0.373 0.394 0.344 0.283
Table 3. Effectiveness of encoding documents as graphs with graph convolutional neural networks.
Inspec Krapivin
@5 @10 @5 @10
+neither 0.375 0.387 0.340 0.280
+coverage 0.363 0.381 0.356 0.283
+context 0.379 0.400 0.360 0.290
+both 0.386 0.417 0.363 0.297
Table 4. Investigating the performance of DivGraphPointer with different diversity mechanisms. Here ”coverage” denotes coverage attention, and ”context” denotes context modification.
AIC@5 AIC@10
Inspec 0.057 0.034 0.062 0.048
NUS 0.056 0.041 0.063 0.054
SemEval 0.048 0.031 0.051 0.044
Krapvin 0.047 0.033 0.050 0.044
Kp20k 0.049 0.030 0.053 0.040
Table 5. The Average Index of Coincidence (AIC@N) of the extracted keyphrases from GraphPointer (left) and DivGraphPointer (right) on five benchmarks
Inspec Krapivin
@5 @10 @5 @10
0.394 0.411 0.348 0.282
0.386 0.417 0.363 0.297
0.379 0.415 0.365 0.297
0.364 0.400 0.370 0.296
0.349 0.389 0.363 0.295
0.341 0.382 0.362 0.296
Table 6. The performance of DivGraphPointer w.r.t. the length penalty factor .
Figure 2. An example of keyphrase extraction results with CopyRNN and our models. Phrases in bold are true keyphrases that predicted by the algorithms.

5.3. Model Analysis

We further conducted a detailed analysis of the proposed DivGraphPointer model. We take two data sets Inspec and Krapivin as examples.

5.3.1. Effectiveness of Encoding Documents as Graphs

In this part, we evaluate the effectiveness of encoding documents with graphs. We compare SeqPointer and GraphPointer with different numbers of graph convolutional layers. To focus on comparing the effectiveness of different encoders, we do not use the diversity mechanism and set the length penalty factor in all the compared algorithms.

The results on Inspec and Krapivin are summarized in Table 3. First, we can see that encoding documents as graphs significantly outperforms the SeqPointer, which encodes documents with bi-directional RNN, especially when more GCN layers are used. Especially, we notice that even 1-layer GCN can outperform RNN encoder on some metrics. This shows that the graph-based representation are very suitable for the keyphrase extraction task and demonstrates the effectiveness of graph-based encoder in aggregating information of multiple appearances of identical words and modeling long-term dependency.

Increasing the number of layers will increase the performance in the beginning. If too many layers are used, the performance will decrease due to over-fitting. In all our other experiments, we choose the number of graph convolutional layers as six by balancing the effectiveness and efficiency of the graph convolutional networks.

5.3.2. Diversity Mechanism

Next, we investigate the effectiveness of the proposed diversity mechanisms: context modification and coverage attention. We compare the following DivGraphPointer variants: (1) with neither of them, (2) with only context modification, (3) with only coverage attention and (4) with both of them . Here we also set the length penalty factor . The results are presented in Table 4.

We can see that the performance of adding context modification significantly improves comparing to the vanilla pointer networks. The results of adding coverage attention are mixed. On the Krapivin data set, the performance improves while on the Inspec data set, the performance decreases. The performance of adding both mechanism are very robust, significantly better than vanilla pointer networks. This shows that the coverage attention and context modification focus on lexical-level and semantic-level duplication problem, respectively, thus are complementary to each other.

We also provide the Average Index of Coincidence (AIC) (Friedman, 1922) of the extracted keyphrases from GraphPointer and DivGraphPointer in Table 5. AIC represents the probability of the identical words appearing in the extracted keyphrases. A smaller AIC indicates smaller redundancy. We can find that our proposed mechanism is very effective in improving keyphrase diversity. We also observe that this effect is more significant for top 5 keyphrases than top 10 keyphrases. It shows that the diversification mechanism tends to take effect at the begin of the keyphrase extraction process.

5.3.3. Length Penalty Factor

Fianlly, we investigate how the length penalty factor affects the performance of the DivGraphPointer. Results with different values of length penalty factor are presented in Table 6. Results show that the length penalty factors affect the model performance significantly. Either a small value of (e.g., =0), which tends to generate long phrase, or a big value of (e.g., ), which tends to generate short phrases, yields inferior results. The best choice of is between 0 and 1 for Inspec and between 1 and 5 for Krapivin, which indicates that both extreme cases - totally normalization or no normalization - are not good choices.

Figure 3. A sub-graph on keyphrase words of the full word graph for the document in Figure 2. The edge width is proportional to the edge weight.

5.4. Case Analysis

Finally, we present a case study on the results of the keyphrases extracted by different algorithms. Figure 2 presents the results of CopyRNN (Meng et al., 2017) and different variants of our algorithms. ”Model-” means that the length penalty factor is used in the evaluated model. We also visualize a sub-graph on the words that appear in the keyphrases for clear illustration of encoding phase of graph-based methods. From the figures, we have the following observations:

  1. The diversity mechanism is quite effective to increase the diversity of the generated keyphrases. For example, “CopyRNN” generates five similar keyphrases, which all contain the word “noise”, “GraphPointer-100” generates four such keyphrases, while “DivGraphPointer-100” and “DivGraphPointer-1” only generates two.

  2. The length penalty factor can efficiently control the granularity of generated keyphrases. For example, the keyphrase with only a single word “coherence” ranks the first in “DivGraphPointer-100”, but only ranks the sixth in “DivGraphPointer-1”, when the same trained model is used.

  3. The graph-based encoders can better estimate the salience of words by their relations. For example, “high-rise building” and “rectangular room” are highly relevant in the word graph, and thus are selected by the “GraphPointer” model, while “CopyRNN” finds “sound pressure”, which appears only once and only weakly connected to other keyphrases. The strong connection between “traffic” and “noise” also explains why the graph-based methods rank “traffic noise” higher than true keyphrase “traffic noise transmission” or “traffic noise prediction methods”.

6. Conclusions and Future Work

In this paper, we propose an end-to-end method called DivGraphPointer for extracting diverse keyphrases. It formulates documents as graphs and applies graph convolutional networks for encoding the graphs, which efficiently capture the document-level word salience by modeling both the short- and long-range dependency between words in documents and aggregate the information of multiple appearances of identical words. To avoid extracting similar keyphrases, a diversified pointer network is proposed to generate diverse keyphrases from the nodes of the graphs. Experiments on five benchmark data sets show that our proposed DivGraphPointer model achieves state-of-the-art performance, significantly outperforming existing state-of-the-art supervised and unsupervised methods.

Our research can be extended in many directions. To begin with, currently our diversified pointer network decoders extract keyphrase in an auto-regressive fashion. We could further leverage reinforcement learning to address the exposure bias as well as consequent error propagation problem in the sequential generation process. Moreover, our graph convolutional network encoder aggregates the word relation information through manually designed edge weights based on proximity at present. We would like to further explore utilizing graph attention networks (GATs)

(Velickovic et al., 2017) to dynamically capture correlation between words in word graph. Finally, utilizing linguistic information when constructing edges in word graphs to ease keyphrase extraction is also an interesting future direction.

acknowledgements

We would like to thank Rui Meng for sharing the source code and giving helpful advice.

References