DeepAI
Log In Sign Up

Heterogeneous Graph Neural Networks for Keyphrase Generation

The encoder-decoder framework achieves state-of-the-art results in keyphrase generation (KG) tasks by predicting both present keyphrases that appear in the source document and absent keyphrases that do not. However, relying solely on the source document can result in generating uncontrollable and inaccurate absent keyphrases. To address these problems, we propose a novel graph-based method that can capture explicit knowledge from related references. Our model first retrieves some document-keyphrases pairs similar to the source document from a pre-defined index as references. Then a heterogeneous graph is constructed to capture relationships of different granularities between the source document and its references. To guide the decoding process, a hierarchical attention and copy mechanism is introduced, which directly copies appropriate words from both the source document and its references based on their relevance and significance. The experimental results on multiple KG benchmarks show that the proposed model achieves significant improvements against other baseline models, especially with regard to the absent keyphrase prediction.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/01/2022

NC-DRE: Leveraging Non-entity Clue Information for Document-level Relation Extraction

Document-level relation extraction (RE), which requires reasoning on mul...
04/15/2021

Hierarchical Learning for Generation with Long Source Sequences

One of the challenges for current sequence to sequence (seq2seq) models ...
04/26/2021

Focused Attention Improves Document-Grounded Generation

Document grounded generation is the task of using the information provid...
05/19/2019

DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

Keyphrase extraction from documents is useful to a variety of applicatio...
04/18/2020

Exclusive Hierarchical Decoding for Deep Keyphrase Generation

Keyphrase generation (KG) aims to summarize the main ideas of a document...
04/01/2022

Graph Enhanced Contrastive Learning for Radiology Findings Summarization

The impression section of a radiology report summarizes the most promine...
04/18/2018

Learning to Map Context-Dependent Sentences to Executable Formal Queries

We propose a context-dependent model to map utterances within an interac...

1 Introduction

Keyphrase generation (KG), a fundamental task in the field of natural language processing (NLP), refers to the generation of a set of keyphrases that expresses the crucial semantic meaning of a document. These keyphrases can be further categorized into present keyphrases that appear in the document and absent keyphrases that do not. Current KG approaches generally adopt an encoder-decoder framework

Sutskever et al. (2014) with attention mechanism Bahdanau et al. (2015); Luong et al. (2015) and copy mechanism Gu et al. (2016); See et al. (2017) to simultaneously predict present and absent keyphrases Meng et al. (2017); Chen et al. (2018); Chan et al. (2019); Chen et al. (2019b, a); Yuan et al. (2020).

Although the proposed methods for keyphrase generation have shown promising results on present keyphrase predictions, they often generate uncontrollable and inaccurate predictions on the absent ones. The main reason is that there are numerous candidates of absent keyphrases that have implicit relationships (e.g., technology hypernyms or task hypernyms) with the concepts in the document. For instance, for a document discussing “LSTM”, all the technology hypernyms like “Neural Network”, “RNN” and “Recurrent Neural Network” can be its absent keyphrases candidates. When dealing with scarce training data or limited model size, it is non-trivial for the model to summarize and memorize all the candidates accurately. Thus, one can expect that the generated absent keyphrases are often sub-optimal when the candidate set in model’s mind is relatively small or inaccurate. This problem is crucial because absent keyphrases account for a large proportion of all the ground-truth keyphrases. As shown in Figure

1, in some datasets, up to 50% of the keyphrases are absent.

Figure 1: Proportion of present and absent keyphrases among four datasets. Although the previous methods for keyphrase generation have shown promising results on present keyphrase predictions, they are not yet satisfactory on the absent keyphrase predictions, which also occupy a large proportion.
Figure 2: Graphical illustration of our proposed Gater. We first retrieve references using the source document, where each reference is the concatenation of document and keyphrases pair from the training set. Then we construct a heterogeneous graph and perform iterative updating. Finally, the source document node is extracted to decode the keyphrase sequence with a hierarchical attention and copy mechanism.

To address this problem, we propose a novel graph-based method to capture explicit knowledge from related references. Each reference is a retrieved document-keyphrases pair from a predefined index (e.g., the training set) that similar to the source document. This is motivated by the fact that the related references often contain candidate or even ground-truth absent keyphrases of the source document. Empirically, we find three retrieved references cover up to 27% of the ground-truth absent keyphrases on average (see Section 4.3 for details).

Our heterogeneous graph is designed to incorporate knowledge from the related references. It contains source document, reference and keyword nodes, and has the following advantages: (a) different reference nodes can interact with the source document regarding the explicit shared keyword information, which can enrich the semantic representation of the source document; (b) a powerful structural prior is introduced as the keywords are highly overlapped with the ground-truth keyphrases. Statistically, we collect the top five keywords from each document on the validation set, and we find that these keywords contain 68% of the tokens in the ground-truth keyphrases. On the decoder side, as a portion of absent keyphrases directly appear in the references, we propose a hierarchical attention and copy mechanism for copying appropriate words from both source document and its references based on their relevance and significance.

The main contributions of this paper can be summarized as follows: (1) we design a heterogeneous graph network for keyphrase generation, which can enrich the source document node through keyword nodes and retrieved reference nodes; (2) we propose a hierarchical attention and copy mechanism to facilitate the decoding process, which can copy appropriate words from both the source document and retrieved references; and (3) our proposed method outperforms other state-of-the-art methods on multiple benchmarks, and especially excels in absent keyphrase prediction. Our codes are publicly available at Github111https://github.com/jiacheng-ye/kg_gater.

2 Methodology

In this work, we propose a heterogeneous Graph ATtention network basEd on References (Gater) for keyphrase generation, as shown in Figure 2. Given a source document, we first retrieve related document from a predefined index222We use the training set as our reference index in our experiment, which can also be easily extended to open corpus. and concatenate each retrieved document with its keyphrases to serve as a reference. Then we construct a heterogeneous graph that contains document nodes333Note that source document and reference are the two specific contents of the document node. and keyword nodes based on the source document and its references. The graph is updated iteratively to enhance the representations of the source document node. Finally, the source document node is extracted to decode the keyphrase sequence. To facilitate the decoding process, we also introduce a hierarchical attention and copy mechanism, with which the model directly attends to and copies from both the source document and its references. The hierarchical arrangement ensures that more semantically relevant words and those in more relevant references will be given larger weights for the current decision.

2.1 Reference Retriever

Given a source document

, we first use a reference retriever to output several related references from the training set. To make full use of both the retrieved document and retrieved keyphrases, we denote a reference as the concatenation of the two. We find that the use of a term frequency–inverse document frequency (TF-IDF)-based retriever provides a simple but efficient means to accomplish the retrieval task. Specifically, we first represent the source document and all the reference candidates as TF-IDF weighted uni/bi-gram vectors. Then, the most similar

references

are retrieved by comparing the cosine similarities of the vectors of the source document and all the references.

2.2 Heterogeneous Graph Encoder

2.2.1 Graph Construction

Given the source document and its references , we select the top- unique words as keywords based on their TF-IDF weights from the source document and each reference. The additional keyword nodes can enrich the semantic representation of the source document through message passing, and introduce prior knowledge for generating keyphrase as the highly overlap between keywords and keyphrases. We then build a heterogeneous graph based on the source document, references and keywords.

Formally, our undirected heterogeneous graph can be defined as , and . Specifically, () denotes unique keyword nodes of the source document and references, corresponds to the source document node and reference nodes, () and represents the edge weight between the -th reference and source document, and () and indicates the edge weight between the -th keyword and the -th document.

2.2.2 Graph Initializers

Node Initializers

There are two types of nodes in our heterogeneous graph (i.e., document nodes and keyword nodes ). For each document node, the same as previous works Meng et al. (2017); Chen et al. (2019a), an embedding lookup table

is first applied to each word, and then a bidirectional Gated Recurrent Unit (GRU)

Cho et al. (2014) is used to obtain the context-aware representation of each word. The representation for document and each word is defined as the concatenation of the forward and backward hidden states (i.e., and , respectively). For each keyword node, since the same keyword may appear in multiple documents, we simply use the word embedding as its initial node representation .

Edge Initializers

There are two types of edges in our heterogeneous graph (i.e., document-to-document edge and document-to-keyword ). To include information about the significance of the relationships between keyword and document nodes, we infuse TF-IDF values in the edge weights. Similarly, we also infuse TF-IDF values in the edge weights of as a prior statistical -gram similarity between documents. The two types of floating TF-IDF weights are then transformed into integers and mapped to dense vectors using embedding matrices and .

2.2.3 Graph Aggregating and Updating

Aggregator

Graph attention networks (GAT) Velickovic et al. (2018) are used to aggregate information for each node. We denote the hidden states of input nodes as , where . With the additional edge feature, the aggregator is defined as follows:

(1)

where is the embedding of edge feature, is the attention weight between and , and is the aggregated feature. For simplicity, we will use to denote the GAT aggregating layer, where is used for query, key, and value, and is used as edge features.

Updater

To update the node state, similar to the approach used in the Transformer Vaswani et al. (2017)

, we introduce a residual connection and position-wise feed-forward (FFN) layer consisting of two linear transformations. Given an undirected heterogeneous graph

with node features and edge features , we update each types of nodes separately as follows:

(2)

with word nodes updated first by aggregating document-level information from document nodes, then document nodes updated by the updated word nodes, and finally document nodes updated again by the updated document nodes. The above process is executed iteratively for steps to realize better document representation.

When the heterogeneous graph encoder finished, we seperate into and as the representation of source document and each reference. We denote as the encoder hidden state of each word in the source document, and denotes the encoder hidden state of each word of the -th reference. All the features described above (i.e., , , and ) will be used in the reference-aware decoder.

2.3 Reference-aware Decoder

After encoding the document into a reference-aware representation , we propose a hierarchical attention and copy mechanism to further incorporate the reference information by attending to and copying words from both the source document and the references.

We use as the initial hidden state of a GRU decoder, and the decoding process in time step is described as follows:

(3)

where is the context vector and the hierarchical attention mechanism is defined as follows:

(4)

where is a word-level attention distribution over words from the source document using , is an attention distribution over references using , which gives greater weights to more relevant references, is a word-level attention distribution over words from -th reference using , which can be considered as the importance of each word in -th reference, and is a soft gate for determining the importance of the context vectors from source document and references. All the attention distributions described above are computed as in Bahdanau et al. (2015).

To alleviate the out-of-vocabulary (OOV) problem, a copy mechanism See et al. (2017) is generally adopted. To further guide the decoding process by copying appropriate words from references based on their relevance and significance, we propose a hierarchical copy mechanism. Specifically, a dynamic vocabulary is constructed by merging the predefined vocabulary , the words in source document and all the words in the references

. Thus, the probability of predicting a word

is computed as follows:

(5)

where is the generative probability over predefined vocabulary , is the copy probability from the source document, is the copy probability from all the references, and serves as a soft switcher that determines the preference for selecting the word from the predefined vocabulary, source document or references.

2.4 Training

The proposed Gater model is independent of any specific training method, so we can use either the One2One training paradigm Meng et al. (2017), where the target keyphrase set are split into multiple training targets for a source document :

(6)

or the One2Seq training paradigm Ye and Wang (2018); Yuan et al. (2020), where all the keyphrases are concatenated into one training target:

(7)

where is the concatenation of the keyphrases in by a delimiter.

Model NUS SemEval KP20k
Present Absent Present Absent Present Absent
CopyRNN Meng et al. (2017) 0.311 0.266 0.058 0.116 0.293 0.304 0.043 0.067 0.333 0.262 0.125 0.211
CorrRNN Chen et al. (2018) 0.318 0.278 0.059 - 0.320 0.320 0.041 - - - - -
TG-Net Chen et al. (2019b) 0.349 0.295 0.075 0.137 0.318 0.322 0.045 0.076 0.372 0.315 0.156 0.268
KG-KE-KR-M Chen et al. (2019a) 0.344 0.287 0.123 0.193 0.329 0.327 0.049 0.090 0.400 0.327 0.177 0.278
CopyRNN-Gater (Ours) 0.374 0.304 0.126 0.193 0.366 0.340 0.056 0.092 0.402 0.324 0.186 0.285
Table 1: Keyphrase prediction results of all the models trained under One2One

paradigm. The best results are bold. The subscript are corresponding standard deviation (e.g., 0.285

means 0.2850.001).
Model NUS SemEval KP20k
Present Absent Present Absent Present Absent
catSeq Yuan et al. (2020) 0.323 0.397 0.016 0.028 0.242 0.283 0.020 0.028 0.291 0.367 0.015 0.032
catSeqD Yuan et al. (2020) 0.321 0.394 0.014 0.024 0.233 0.274 0.016 0.024 0.285 0.363 0.015 0.031
catSeqCorr Chan et al. (2019) 0.319 0.390 0.014 0.024 0.246 0.290 0.018 0.026 0.289 0.365 0.015 0.032
catSeqTG Chan et al. (2019) 0.325 0.393 0.011 0.018 0.246 0.290 0.019 0.027 0.292 0.366 0.015 0.032
SenSeNet Luo et al. (2020) 0.348 0.403 0.018 0.032 0.255 0.299 0.024 0.032 0.296 0.370 0.017 0.036
catSeq-Gater (Ours) 0.337 0.418 0.033 0.054 0.257 0.309 0.026 0.035 0.295 0.384 0.030 0.060
Table 2: Keyphrase prediction results of all the models trained under One2Seq paradigm. The best results are bold. The subscript are corresponding standard deviation (e.g., 0.060 means 0.0600.002).

3 Experimental Setup

3.1 Datasets

We conduct our experiments on four scientific article datasets, including NUS Nguyen and Kan (2007), Krapivin Krapivin et al. (2009), SemEval Kim et al. (2010) and KP20k Meng et al. (2017). Each sample from these datasets consists of a title, an abstract, and some keyphrases given by the authors of the papers. Following previous works Meng et al. (2017); Chen et al. (2019b, a); Yuan et al. (2020), we concatenate the title and abstract as a source document. We use the largest dataset (i.e., KP20k) for model training, and the testing sets of all the four datasets for evaluation. After preprocessing (i.e., lowercasing, replacing all the digits with the symbol and removing the duplicated data), the final KP20k dataset contains 509,818 samples for training, 20,000 for validation and 20,000 for testing. The number of test samples in NUS, Krapivin and SemEval is 211, 400 and 100, respectively.

3.2 Baselines

For a comprehensive evaluation, we verify our method under both training paradigms (i.e., One2One and One2Seq) and compare with the following methods444We didn’t compare with Chen et al. (2020) since they use a different preprocessing method with others, see the discussion on github for details.:

  • catSeq Yuan et al. (2020). The RNN-based seq2seq model with copy mechanism under One2Seq training paradigm. CopyRNN Meng et al. (2017) is the one with the same model but under One2One training paradigm.

  • catSeqD Yuan et al. (2020). An extension of catSeq with orthogonal regularization Bousmalis et al. (2016) and target encoding to improve diversity under One2Seq training paradigm.

  • catSeqCorr Chan et al. (2019). The extension of catSeq with coverage and review mechanisms under One2Seq training paradigm. CorrRNN Chen et al. (2018) is the one under One2One training paradigm.

  • catSeqTG Chan et al. (2019). The extension of catSeq with additional title encoding. TG-Net Chen et al. (2019b) is the one under One2One training paradigm.

  • KG-KE-KR-M Chen et al. (2019a). A joint extraction and generation model with the retrieved keyphrases and a merging process under One2One training paradigm.

  • SenSeNet Luo et al. (2020). The extension of catSeq with document structure under One2Seq paradigm.

3.3 Implementation Details

Following previous works Chan et al. (2019); Yuan et al. (2020), when training under the One2Seq paradigm, the target keyphrase sequence is the concatenation of present and absent keyphrases, with the present keyphrases are sorted according to the orders of their first occurrences in the document and the absent keyphrase kept in their original order.

We keep all the parameters the same as those reported in Chan et al. (2019), hence, we only report the parameters in the additional graph module. We retrieve 3 references and extract the top 20 keywords from source document and each reference to construct the graph. We set the number of attention heads to 5 and the number of iterations to 2, based on the valid set. During training, we use a dropout rate of 0.3 for the graph layer, the batch size of 12 and 64 for One2Seq and One2One training paradigm, respectively. During testing, we use greedy search for One2Seq, and beam search with a maximum depth of 6 and a beam size of 200 for One2One. We repeat the experiments of our model three times using different random seeds and report the averaged results.

3.4 Evaluation Metrics

For the model trained under One2One paradigm, as in previous works Meng et al. (2017); Chen et al. (2018, 2019b), we use macro-averaged and for present keyphrase predictions, and and for absent keyphrase predictions. For the model trained under One2Seq paradigm, we follow Chan et al. (2019) and use and for both present and absent keyphrase predictions, where compares all the keyphrases predicted by the model with the ground-truth keyphrases, which means it considers the number of predictions. We apply the Porter Stemmer before determining whether two keyphrases are identical and remove all the duplicated keyphrases after stemming.

4 Results and Analysis

4.1 Present and Absent Keyphrase Predictions

Table 1 and Table 2 show the performance evaluations of the present and absent keyphrase predicted by the model trained under One2One paradigm and One2Seq paradigm, respectively.555Due to the space limitations, the results on the Krapivin dataset can be found in Appendix A. For the results on absent keyphrases, as noted by previous works Chan et al. (2019); Yuan et al. (2020) that predicting absent keyphrases for a document is an extremely challenging task, the proposed Gater model still outperforms the state-of-the-art baseline models on all the metrics under both training paradigms, which demonstrates the effectiveness of our methods that includes the knowledge of references. Compared to KG-KE-KR-M, CopyRNN-Gater achieves the same or better results on all the datasets. This suggests that both the retrieved document and keyphrases are useful for predicting absent keyphrases.

For present keyphrase prediction, we find that Gater outperforms most of the baseline methods on both training paradigms, which indicates that the related references also help the model to understand the source document and to predict more accurate present keyphrases.

4.2 Ablation Study

To examine the contribution of each component in Gater, we conduct ablation experiments on the largest dataset KP20k, the results of which are presented in Table 3. For the input references, the model’s performance is degraded if either the retrieved documents or retrieved keyphrases are removed, which indicates that both are useful for keyphrases prediction. For the heterogeneous graph encoder, the graph becomes a heterogeneous bipartite graph when the edges are removed, and a homogeneous graph when the edges are removed. We can see that both result in degraded performance due to the lack of interaction. Removing both the edges and the edges means that the reference information is only used on the decoder side with the reference-aware decoder, which further degrades the results. For the reference-aware decoder, we find the hierarchical attention and copy mechanism to be essential to the performance of Gater. This indicates the importance of integrating knowledge from references on the decoder side.

Model Present Absent
catSeq-Gater 0.295 0.384 0.030 0.060
Input Reference
 - retrieved documents 0.293 0.377 0.026 0.052
 - retrieved keyphrases 0.291 0.369 0.018 0.037
 - both 0.291 0.367 0.015 0.032
Heterogeneous Graph Encoder
 - edge 0.294 0.379 0.024 0.049
 - edge 0.294 0.379 0.026 0.052
 - both 0.293 0.371 0.020 0.041
Reference-aware Decoder
 - hierarchical copy 0.293 0.373 0.022 0.042
  - hierarchical attention 0.291 0.368 0.018 0.036
Table 3: Ablation study of catSeq-Gater on KP20k dataset. All references are ignored in graph encoder when removing edge and the heterogeneous graph becomes homogeneous graph when removing edge.

4.3 Quality and Influence of References

Figure 3: Transforming rate and for absent keyphrases under different types of retrievers on KP20k dataset for catSeq-Gater. We study a random retriever, a sparse retriever based on TF-IDF and a dense retriever based on Specter.

As our graph is based on the retrieved references, we also investigated the quality and influence of the references. We define the quality of the retrieved references as the transforming rate of absent keyphrase (i.e., the proportion of absent keyphrases that appear in the retrieved references). Intuitively, the references that contain more absent keyphrases provide more explicit knowledge for the model generation. As shown on the left part in Figure 3, the simple sparse retriever based on TF-IDF outperforms the random retriever by a large margin regarding the reference quality. We also use a dense retriever Specter666https://github.com/allenai/specter Cohan et al. (2020), which is a BERT-based model pretrained using scientific documents. We find that using a dense retriever further helps in the transforming rate of absent keyphrases. On the right part of Figure 3, we show the influence of the references, and we note that random references degrade the model performance as they contain a lot of noise. Surprisingly, we can obtain a 2.6% performance boost in the prediction of absent keyphrase by considering only the most similar references with a sparse or dense retriever, and the introduction of more than three references does not further improve the performance. One possible explanation is that although more references lead to a higher transforming rate of the absent keyphrase, they also introduce more irrelevant information, which interferes with the judgment of the model.

4.4 Incorporating Baselines with Gater

Model Present Absent
catSeqD 0.285 0.363 0.015 0.031
 + Gater 0.294 0.381 0.025 0.051
catSeqCorr 0.289 0.365 0.015 0.032
 + Gater 0.296 0.384 0.030 0.060
catSeqTG 0.292 0.366 0.015 0.032
 + Gater 0.293 0.380 0.025 0.052
Table 4: Results of applying our Gater to other baseline models on KP20k test set. The best results are bold.

Our proposed Gater can be considered as an extra plugin for incorporating knowledge from references on both the encoder and decoder sides, which can also be easily applied to other models. We investigate the effects of adding Gater to other baseline models in Table 4. We note that Gater enhances the performance of all the baseline models in both predicting present and absent keyphrases. This further demonstrates the effectiveness and portability of the proposed method.

Figure 4: Example of generated keyphrases by different models. The top 10 predictions are compared and some incorrect predictions are omitted for simplicity. The correct predictions are in bold blue and bold red for present and absent keyphrase, respectively. The absent predictions that appear in the references are highlighted in yellow, where only the keyphrases of retrieved documents are considered as references for KG-KE-KR-M.

4.5 Case Study

We display a prediction example by baseline models and CopyRNN-Gater in Figure 4. Our model generates more accurate present and absent keyphrases comparing to the baselines. For instance, we observe that CopyRNN-Gater successfully predicts the absent keyphrase “porous medium” as it appears in the retrieved documents, while both CopyRNN and KG-KE-KR-M fail. This demonstrates that using both the retrieved documents and keyphrases as references provides more knowledge (e.g., candidates of the ground-truth absent keyphrases) compared with using keyphrases alone as in KG-KE-KR-M.

5 Related Work

5.1 Keyphrase Extraction and Generation

Existing approaches for keyphrase prediction can be broadly divided into extraction and generation methods. Early work mostly use a two-step approach for keyphrase extraction. First, they extract a large set of candidate phrases by hand-crafted rules Mihalcea and Tarau (2004); Medelyan et al. (2009); Liu et al. (2011). Then, these candidates are scored and reranked based on unsupervised methods Mihalcea and Tarau (2004); Wan and Xiao (2008) or supervised methods Hulth (2003); Nguyen and Kan (2007). Other extractive approaches utilize neural-based sequence labeling methods Zhang et al. (2016); Gollapalli et al. (2017).

Keyphrase generation is an extension of keyphrase extraction which considers the absent keyphrase prediction. Meng et al. (2017) proposed a generative model CopyRNN based on the encoder-decoder framework Sutskever et al. (2014). They employed an One2One paradigm that uses a single keyphrase as the target sequence. Since CopyRNN uses beam search to perform independently prediction, it’s lack of dependency on the generated keyphrases, which results in many duplicated keyphrases. CorrRNN Chen et al. (2018) proposed a review mechanism to consider the hidden states of the previously generated keyphrase. Ye and Wang (2018) proposed to use a seperator to concatnate all keyphrases as a sequence in training. With this setup, the seq2seq model is capable to generate all possible keyphrases in one sequence as well as capture the contextual information between the keyphrases. However, it still use beam search to generate multiple keyphrases sequences with a fixed beam depth, and then perform keyphrase ranking to select top-k keyphrases as output. Yuan et al. (2020) proposed catSeq with One2Seq paradigm by adding a special token at the end to terminate the decoding process. They further introduce catSeqD by maximizing mutual information between all the keyphrases and source text and using orthogonal constraints Bousmalis et al. (2016) to ensure the coverage and diversity of the generated keyphrase. Many works are conducted based on the One2Seq paradigm Chen et al. (2019a); Chan et al. (2019); Chen et al. (2020); Meng et al. (2021); Luo et al. (2020). Chen et al. (2019a)

proposed to use the keyphrases of retrieved documents as an external input. However, the keyphrase alone lacks semantic information, and the potential knowledge in the retrieved documents are also ignored. In contrast, our method makes full use of both retrieved documents and keyphrases as references. Since catSeq tends to generate shorter sequences, a reinforcement learning approach is introduced by

Chan et al. (2019) to encourage their model to generate the correct number of keyphrases with an adaptive reward (i.e., and ). More recently, Luo et al. (2021) introduced a two-stage reinforcement learning-based fine-tuning approach with a fine-grained reward score, which also considers the semantic similarities between predictions and targets. Ye et al. (2021) proposed a One2Set paradigm to predict the keyphrases as a set, which eliminates the bias caused by the predefined order in One2Seq paradigm. Our method can also be integrated into these methods to further improve performance, as shown in section 4.4.

5.2 Heterogeneous Graph for NLP

Different from homogeneous graph that only considers a single type of nodes or links, heterogeneous graph can deal with multiple types of nodes or links Shi et al. (2016). Linmei et al. (2019) constructed a topic-entity heterogeneous neural graph for semi-supervised short text classification. Tu et al. (2019) introduced a heterogeneous graph neural network to encode documents, entities, and candidates together for multi-hop reading comprehension. Wang et al. (2020)

presented heterogeneous graph neural network with words, sentences, and documents nodes for extractive summarization. In our paper, we study the keyword-document heterogeneous graph network for keyphrase generation, which has not been explored before.

6 Conclusions

In this paper, we propose a graph-based method that can capture explicit knowledge from related references. Our model consists of a heterogeneous graph encoder to model different granularity of relations among the source document and its references, and a hierarchical attention and copy mechanism to guide the decoding process. Extensive experiments demonstrate the effectiveness and portability of our method on both the present and absent keyphrase predictions.

Acknowledgments

The authors wish to thank the anonymous reviewers for their helpful comments. This work was partially funded by China National Key R&D Program (No. 2017YFB1002104), National Natural Science Foundation of China (No. 61976056, 62076069), Shanghai Municipal Science and Technology Major Project (No.2021SHZDZX0103).

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.3.
  • K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 343–351. External Links: Link Cited by: 2nd item, §5.1.
  • H. P. Chan, W. Chen, L. Wang, and I. King (2019) Neural keyphrase generation via reinforcement learning with adaptive rewards. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2163–2174. External Links: Document, Link Cited by: Table 5, §1, Table 2, 3rd item, 4th item, §3.3, §3.3, §3.4, §4.1, §5.1.
  • J. Chen, X. Zhang, Y. Wu, Z. Yan, and Z. Li (2018) Keyphrase generation with correlation constraints. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4057–4066. External Links: Document, Link Cited by: Table 5, §1, Table 1, 3rd item, §3.4, §5.1.
  • W. Chen, H. P. Chan, P. Li, L. Bing, and I. King (2019a) An integrated approach for keyphrase generation via exploring the power of retrieval and extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2846–2856. External Links: Document, Link Cited by: Table 5, §1, §2.2.2, Table 1, 5th item, §3.1, §5.1.
  • W. Chen, H. P. Chan, P. Li, and I. King (2020) Exclusive hierarchical decoding for deep keyphrase generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1095–1105. External Links: Document, Link Cited by: §5.1, footnote 4.
  • W. Chen, Y. Gao, J. Zhang, I. King, and M. R. Lyu (2019b) Title-guided encoding for keyphrase generation. In

    The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019

    ,
    pp. 6268–6275. External Links: Document, Link Cited by: Table 5, §1, Table 1, 4th item, §3.1, §3.4.
  • K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Document, Link Cited by: §2.2.2.
  • A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. Weld (2020) SPECTER: document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2270–2282. External Links: Document, Link Cited by: §4.3.
  • S. D. Gollapalli, X. Li, and P. Yang (2017) Incorporating expert knowledge into keyphrase extraction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, S. P. Singh and S. Markovitch (Eds.), pp. 3180–3187. External Links: Link Cited by: §5.1.
  • J. Gu, Z. Lu, H. Li, and V. O.K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1631–1640. External Links: Document, Link Cited by: §1.
  • A. Hulth (2003)

    Improved automatic keyword extraction given more linguistic knowledge

    .
    In Proceedings of the 2003 conference on Empirical methods in natural language processing, pp. 216–223. Cited by: §5.1.
  • S. N. Kim, O. Medelyan, M. Kan, and T. Baldwin (2010) SemEval-2010 task 5 : automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–26. External Links: Link Cited by: §3.1.
  • M. Krapivin, A. Autaeu, and M. Marchese (2009) Large dataset for keyphrases extraction. Technical report University of Trento. Cited by: §3.1.
  • H. Linmei, T. Yang, C. Shi, H. Ji, and X. Li (2019) Heterogeneous graph attention networks for semi-supervised short text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4821–4830. External Links: Document, Link Cited by: §5.2.
  • Z. Liu, X. Chen, Y. Zheng, and M. Sun (2011) Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, Oregon, USA, pp. 135–144. External Links: Link Cited by: §5.1.
  • Y. Luo, Z. Li, B. Wang, X. Xing, Q. Zhang, and X. Huang (2020) SenSeNet: neural keyphrase generation with document structure. In arXiv, Cited by: Table 5, Table 2, 6th item, §5.1.
  • Y. Luo, Y. Xu, J. Ye, X. Qiu, and Q. Zhang (2021) Keyphrase generation with fine-grained evaluation-guided reinforcement learning. arXiv preprint arXiv:2104.08799. Cited by: §5.1.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Document, Link Cited by: §1.
  • O. Medelyan, E. Frank, and I. H. Witten (2009) Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1318–1327. External Links: Link Cited by: §5.1.
  • R. Meng, X. Yuan, T. Wang, S. Zhao, A. Trischler, and D. He (2021) An empirical study on neural keyphrase generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 4985–5007. External Links: Link Cited by: §5.1.
  • R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, and Y. Chi (2017) Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 582–592. External Links: Document, Link Cited by: Table 5, §1, §2.2.2, §2.4, Table 1, 1st item, §3.1, §3.4, §5.1.
  • R. Mihalcea and P. Tarau (2004) TextRank: bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Link Cited by: §5.1.
  • T. D. Nguyen and M. Kan (2007) Keyphrase extraction in scientific publications. In International conference on Asian digital libraries, Cited by: §3.1, §5.1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Document, Link Cited by: §1, §2.3.
  • C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip (2016) A survey of heterogeneous information network analysis. In TKDE, Cited by: §5.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Link Cited by: §1, §5.1.
  • M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2704–2713. External Links: Document, Link Cited by: §5.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.2.3.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.2.3.
  • X. Wan and J. Xiao (2008) Single document keyphrase extraction using neighborhood knowledge.. In AAAI, Cited by: §5.1.
  • D. Wang, P. Liu, Y. Zheng, X. Qiu, and X. Huang (2020) Heterogeneous graph neural networks for extractive document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6209–6219. External Links: Document, Link Cited by: §5.2.
  • H. Ye and L. Wang (2018) Semi-supervised learning for neural keyphrase generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4142–4153. External Links: Document, Link Cited by: §2.4, §5.1.
  • J. Ye, T. Gui, Y. Luo, Y. Xu, and Q. Zhang (2021) One2Set: Generating diverse keyphrases as a set. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4598–4608. External Links: Link, Document Cited by: §5.1.
  • X. Yuan, T. Wang, R. Meng, K. Thaker, P. Brusilovsky, D. He, and A. Trischler (2020) One size does not fit all: generating and evaluating variable number of keyphrases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7961–7975. External Links: Document, Link Cited by: Table 5, §1, §2.4, Table 2, 1st item, 2nd item, §3.1, §3.3, §4.1, §5.1.
  • Q. Zhang, Y. Wang, Y. Gong, and X. Huang (2016) Keyphrase extraction using deep recurrent neural networks on Twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 836–845. External Links: Document, Link Cited by: §5.1.

Appendix A Results on Krapivin Dataset

Model Krapivin
Present Absent
CopyRNN (Meng et al., 2017) 0.334 0.326 0.113 0.202
CorrRNN (Chen et al., 2018) 0.358 0.330 0.108 -
TG-Net (Chen et al., 2019b) 0.406 0.370 0.146 0.253
KG-KE-KR-M (Chen et al., 2019a) 0.431 0.378 0.153 0.251
CopyRNN-Gater (Ours) 0.435 0.383 0.195 0.294
Model Krapivin
Present Absent
catSeq (Yuan et al., 2020) 0.269 0.354 0.018 0.036
catSeqD (Yuan et al., 2020) 0.264 0.349 0.018 0.037
catSeqCorr (Chan et al., 2019) 0.265 0.349 0.020 0.038
catSeqTG (Chan et al., 2019) 0.282 0.366 0.018 0.034
SenSeNet (Luo et al., 2020) 0.279 0.354 0.024 0.046
catSeq-Gater (Ours) 0.276 0.376 0.037 0.069
Table 5: Keyphrase prediction results of the models trained under One2One and One2Seq paradigms. The best results are bold. The subscripts are the corresponding standard deviation (e.g., 0.069 means 0.0690.005).