code for ACL 2020 paper Exclusive Hierarchical Decoding for Deep Keyphrase Generation
Keyphrase generation (KG) aims to summarize the main ideas of a document into a set of keyphrases. A new setting is recently introduced into this problem, in which, given a document, the model needs to predict a set of keyphrases and simultaneously determine the appropriate number of keyphrases to produce. Previous work in this setting employs a sequential decoding process to generate keyphrases. However, such a decoding method ignores the intrinsic hierarchical compositionality existing in the keyphrase set of a document. Moreover, previous work tends to generate duplicated keyphrases, which wastes time and computing resources. To overcome these limitations, we propose an exclusive hierarchical decoding framework that includes a hierarchical decoding process and either a soft or a hard exclusion mechanism. The hierarchical decoding process is to explicitly model the hierarchical compositionality of a keyphrase set. Both the soft and the hard exclusion mechanisms keep track of previously-predicted keyphrases within a window size to enhance the diversity of the generated keyphrases. Extensive experiments on multiple KG benchmark datasets demonstrate the effectiveness of our method to generate less duplicated and more accurate keyphrases.READ FULL TEXT VIEW PDF
Keyphrase extraction (KE) aims to summarize a set of phrases that accura...
We investigated a suitable auxiliary channel setting and the gap between...
We review the recently introduced soft-aided bit-marking (SABM) algorith...
Keyphrase generation (KG) aims to generate a set of keyphrases given a
Coding for distributed computing supports low-latency computation by
Generating keyphrases that summarize the main points of a document is a
In this paper, we investigate the diversity aspect of paraphrase generat...
code for ACL 2020 paper Exclusive Hierarchical Decoding for Deep Keyphrase Generation
Keyphrases are short phrases that indicate the core information of a document. As shown in Figure 1, the keyphrase generation (KG) problem focuses on automatically producing a keyphrase set (a set of keyphrases) for the given document. Because of the condensed expression, keyphrases can benefit various downstream applications including opinion mining Berend (2011); Wilson et al. (2005), document clustering Hulth and Megyesi (2006)
, and text summarizationWang and Cardie (2013).
Keyphrases of a document can be categorized into two groups: present keyphrase that appears in the document and absent keyphrase that does not appear in the document. Recent generative methods for KG apply the attentional encoder-decoder framework Luong et al. (2015); Bahdanau et al. (2014) with copy mechanism Gu et al. (2016); See et al. (2017) to predict both present and absent keyphrases. To generate multiple keyphrases for an input document, these methods first use beam search to generate a huge number of keyphrases (e.g., 200) and then pick the top ranked keyphrases as the final prediction. Thus, in other words, these methods can only predict a fixed number of keyphrases for all documents.
However, in a practical situation, the appropriate number of keyphrases varies according to the content of the input document. To simultaneously predict keyphrases and determine the suitable number of keyphrases, Yuan et al. (2018) adopts a sequential decoding method with greedy search to generate one sequence consisting of the predicted keyphrases and separators. For example, the produced sequence may be “hemodynamics [sep] erectile dysfunction [sep] …”, where “[sep]” is the separator. After producing an ending token, the decoding process terminates. The final keyphrase predictions are obtained after splitting the sequence by separators. However, there are two drawbacks to this method. First, the sequential decoding method ignores the hierarchical compositionality existing in a keyphrase set (a keyphrase set is composed of multiple keyphrases and each keyphrase consists of multiple words). In this work, we examine the hypothesis that a generative model can predict more accurate keyphrases by incorporating the knowledge of the hierarchical compositionality in the decoder architecture. Second, the sequential decoding method tends to generate duplicated keyphrases. It is simple to design specific post-processing rules to remove the repeated keyphrases, but generating and then removing repeated keyphrases wastes time and computing resources. To address these two limitations, we propose a novel exclusive hierarchical decoding framework for KG, which includes a hierarchical decoding process and an exclusion mechanism.
Our hierarchical decoding process is designed to explicitly model the hierarchical compositionality of a keyphrase set. It is composed of phrase-level decoding (PD) and word-level decoding (WD). A PD step determines which aspect of the document to summarize based on both the document content and the aspects summarized by previously-generated keyphrases. The hidden representation of the captured aspect is employed to initialize the WD process. Then, a new WD process is conducted under the PD step to generate a new keyphrase word by word. Both PD and WD repeat until meeting the stop conditions. In our method, both PD and WD attend the document content to gather contextual information. Moreover, the attention score of each WD step is rescaled by the corresponding PD attention score. The purpose of the attention rescaling is to indicate which aspect is focused on by the current PD step.
We also propose two kinds of exclusion mechanisms (i.e., a soft one and a hard one) to avoid generating duplicated keyphrases. Either the soft one or the hard one is used in our hierarchical decoding process. Both of them are used in the WD process of our hierarchical decoding. Besides, both of them collect the previously-generated keyphrases, where is a predefined window size. The soft exclusion mechanism is incorporated in the training stage, where an exclusive loss is employed to encourage the model to generate a different first word of the current keyphrase with the first words of the collected keyphrases. However, the hard exclusion mechanism is used in the inference stage, where an exclusive search is used to force WD to produce a different first word with the first words of the collected keyphrases. Our motivation is from the statistical observation that in 85% of the documents on the largest KG benchmark, the keyphrases of each individual document have different first words. Moreover, since a keyphrase is usually composed of only two or three words, the predicted first word significantly affects the prediction of the following keyphrase words. Thus, our exclusion mechanisms can boost the diversity of the generated keyphrases. In addition, generating fewer duplications will also improve the chance to produce correct keyphrases that have not been predicted yet.
We conduct extensive experiments on four popular real-world benchmarks. Empirical results demonstrate the effectiveness of our hierarchical decoding process. Besides, both the soft and the hard exclusion mechanisms significantly reduce the number of duplicated keyphrases. Furthermore, after employing the hard exclusion mechanism, our model consistently outperforms all the SOTA sequential decoding baselines on the four benchmarks.
We summarize our main contributions as follows: (1) to our best knowledge, we are the first to design a hierarchical decoding process for the keyphrase generation problem; (2) we propose two novel exclusion mechanisms to avoid generating duplicated keyphrases as well as improve the generation accuracy; and (3) our method consistently outperforms all the SOTA sequential decoding methods on multiple benchmarks under the new setting.
Most of the traditional extractive methods Witten et al. (1999); Mihalcea and Tarau (2004) focus on extracting present keyphrases from the input document and follow a two-step framework. They first extract plenty of keyphrase candidates by handcrafted rules Medelyan et al. (2009). Then, they score and rank these candidates based on either unsupervised methods Mihalcea and Tarau (2004)
or supervised learning methodsNguyen and Kan (2007); Hulth (2003). Recently, neural-based sequence labeling methods Gollapalli et al. (2017); Luan et al. (2017); Zhang et al. (2016) are also explored in keyphrase extraction problem. However, these extractive methods cannot predict absent keyphrase which is also an essential part of a keyphrase set.
To produce both present and absent keyphrases, Meng et al. (2017) introduced a generative model, CopyRNN, which is based on an attentional encoder-decoder framework Bahdanau et al. (2014) incorporating with a copy mechanism Gu et al. (2016). A wide range of extensions of CopyRNN are recently proposed Chen et al. (2018, 2019); Ye and Wang (2018); Chen et al. (2019); Zhao and Zhang (2019). All of them rely on beam search to over-generate lots of keyphrases with large beam size and then select the top (e.g., five or ten) ranked ones as the final prediction. That means these over-generated methods will always predict keyphrases for any input documents. Nevertheless, in a real situation, the keyphrase number should be determined by the document content and may vary among different documents.
To this end, Yuan et al. (2018) introduced a new setting that the KG model should predict multiple keyphrases and simultaneously decide the suitable keyphrase number for the given document. Two models with a sequential decoding process, catSeq and catSeqD, are proposed in Yuan et al. (2018). The catSeq is also an attentional encoder-decoder model Bahdanau et al. (2014) with copy mechanism See et al. (2017), but adopting new training and inference setup to fit the new setting. The catSeqD is an extension of catSeq with orthogonal regularization Bousmalis et al. (2016) and target encoding. Lately, Chan et al. (2019)
proposed a reinforcement learning based fine-tuning method, which fine-tunes the pre-trained models with adaptive rewards for generating more sufficient and accurate keyphrases. We follow the same setting withYuan et al. (2018) and propose an exclusive hierarchical decoding method for the KG problem. To the best of our knowledge, this is the first time the hierarchical decoding is explored in the KG problem. Different from the hierarchical decoding in other areas Fan et al. (2018); Yarats and Lewis (2018); Tan et al. (2017); Chen and Zhuge (2018), we rescale the attention score of each WD step with the corresponding PD attention score to provide aspect guidance when generating keyphrases. Moreover, either a soft or a hard exclusion mechanism is innovatively incorporated in the decoding process to improve generation diversity.
We denote vectors and matrices with bold lowercase and uppercase letters respectively. Sets are denoted with calligraphy letters. We use to represent a parameter matrix.
We define the keyphrase generation problem as follows. The input is a document , the output is a keyphrase set , where is the keyphrase number of . Both the and each are sequences of words, i.e., and , where and are the word numbers of and correspondingly.
We first encode each word of the document into a hidden state and then employ our exclusive hierarchical decoding shown in Figure 2 to produce keyphrases for the given document. Our hierarchical decoding process consists of phrase-level decoding (PD) and word-level decoding (WD). Each PD step decides an appropriate aspect to summarize based on both the context of the document and the aspects summarized by previous PD steps. Then, the hidden representation of the captured aspect is employed to initialize the WD process to generate a new keyphrase word by word. The WD process terminates when producing a “[eowd]” token. If the WD process output a “[eopd]” token, the whole hierarchical decoding process stops. Both PD and WD attend the document content. The PD attention score is used to re-weight the WD attention score to provide aspect guidance. To improve the diversity of the predicted keyphrases, we incorporate either an exclusive loss when training (i.e., the soft exclusion mechanism) or an exclusive search mechanism when inference (i.e., the hard exclusion mechanism).
To obtain the context-aware representation of each document word, we employ a two-layered bidirectional GRU Cho et al. (2014) as the document encoder: where and is the embedding vector of with dimensions. is the encoded context-aware representation of . Here, “[ ; ]” means concatenation.
Our hierarchical decoding process is controlled by the hierarchical decoder, which utilizes a phrase-level decoder and a word-level decoder to handle the PD process and the WD process respectively. We present our hierarchical decoder first and then introduce the exclusion mechanisms. In our decoders, all the hidden states and attentional vectors are -dimensional vectors.
We adopt a unidirectional GRU layer as our phrase-level decoder. After the WD process under last PD step is finished, the phrase-level decoder will update its hidden state as follows:
where is the attentional vector for the ending WD step under the (-1)-th PD step (e.g., in Figure 2(b)). is regarded as the hidden representation of the captured aspect at the -th PD step. is initialized as the document representation . is initialized with zeros.
In PD-Attention process, the PD attentional score is computed from the following attention mechanism employing as the query vector:
We choose another unidirectional GRU layer to conduct word-level decoding. Under the -th PD step, the word-level decoder updates its hidden state first:
where is the WD attentional vector of the (-1)-th WD step and is the -dimensional embedding vector of the token. We define , where is the current hidden state of the phrase-level decoder, is a zero vector, and is the embedding of the start token. Then, the WD attentional vector is computed:
where is the original WD attention score which is computed similar to except that a new parameter matrix is used and is employed as the query vector. The purpose of the rescaling operation in Eq. (7) is to indicate the focused aspect of the current PD step for each WD step.
is utilized to predict the probability distribution of current keyword with the copy mechanismSee et al. (2017):
where is the copy gate. is the probability distribution over a predefined vocabulary . is the copying probability distribution over which is a set of all the words that appeared in the document. is the final predicted probability distribution. Finally, greedy search is applied to produce the current token.
The WD process terminates when producing a “[eowd]” token. The whole hierarchical decoding process ends if the word-level decoder produces a “[eopd]” token at the -th step, i.e., is predicted as “[eopd]”.
A standard negative log-likelihood loss is employed as the generation loss to train our hierarchical decoding model:
where are the target keyphrases of previously-finished PD steps and are target keyphrase words of previous WD steps under the -th PD step. When training, each original target keyphrase is extended with a “[neopd]” token and a “[eowd]” token, i.e., . Besides, a “[eopd]” token is also incorporated into the targets to indicate the ending of whole decoding process. Teacher forcing is employed when training.
To alleviate the duplication generation problem, we propose a soft and a hard exclusion mechanisms. Either of them can be incorporated into our hierarchical decoding process to form one kind of exclusive hierarchical decoding method.
Soft Exclusion Mechanism. An exclusive loss (EL) is introduced in the training stage as shown in Algorithm 1. “” in line “3” means the current WD step is predicting the first word of a keyphrase. In short, the exclusive loss punishes the model for the tendency to generate the same first word of the current keyphrase with the first words of previously-generated keyphrases within the window size .
Hard Exclusion Mechanism. An exclusive search (ES) is introduced in the inference stage as shown in Algorithm 2. The exclusive search mechanism forces the word-level decoding to predict a different first word with the first words of previously-predicted keyphrases within the window size .
Since a keyphrase usually has only two or three words, the first word significantly affects the prediction of the following words. Therefore, both the soft and the hard exclusion mechanisms can improve the diversity of generated keyphrases.
Our model implementations are based on the OpenNMT system Klein et al. (2017)
using PyTorchPaszke et al. (2017). Experiments of all models are repeated with three different random seeds and the averaged results are reported.
We employ four scientific article benchmark datasets to evaluate our models, including KP20k Meng et al. (2017), Inspec Hulth (2003), Krapivin Krapivin et al. (2009), and SemEval Kim et al. (2010). Following previous work Yuan et al. (2018); Chen et al. (2019), we use the training set of KP20k to train all the models. After removing the duplicated data, we maintain 509,818 data samples in the training set, 20,000 in the validation set, and 20,000 in the testing set. After training, we test all the models on the testing datasets of these four benchmarks. The dataset statistics are shown in Table 1.
We focus on the comparisons with state-of-the-art decoding methods and choose the following generation models under the new setting as our baselines:
In this paper, we propose two novel models that are denoted as follows:
ExHiRD-s. Our Exclusive HieRarchical Decoding model with the soft exclusion mechanism. In experiments, the window size is selected as 4 after tuning on the KP20k validation dataset.
ExHiRD-h. Our Exclusive HieRarchical Decoding model with the hard exclusion mechanism. In experiments, the values of the window size are selected as 4, 1, 1, 1 for Inspec, Krapivin, SemEval, and KP20k respectively after tuning on the corresponding validation datasets.
Present keyphrase prediction results of all models on all datasets. The best results are bold. In all the tables of this paper, the subscript represents the corresponding standard deviation (e.g., 0.3111 indicates 0.3110.001).
We engage which is recently proposed in Yuan et al. (2018)
as one of our evaluation metrics.compares all the predicted keyphrases by the model with ground-truth keyphrases, which means it does not use a fixed cutoff for the predictions. Therefore, it considers the number of predictions.
We also use as another evaluation metric. When the number of predictions is less than five, we randomly append incorrect keyphrases until it obtains five predictions instead of directly using the original predictions. If we do not adopt such an appending operation, will become the same with when the prediction number is less than five.
The macro-averaged and scores are reported. When determining whether two keyphrases are identical, all the keyphrases are stemmed first. Besides, all the duplicated keyphrases are removed after stemming.
Following previous work Meng et al. (2017); Yuan et al. (2018); Chen et al. (2019); Chan et al. (2019), we lowercase the characters, tokenize the sequences, and replace digits with “digit” token. Similar to Yuan et al. (2018), when training, the present keyphrase targets are sorted according to the orders of their first occurrences in the document. Then, the absent keyphrase targets are put at the end of the sorted present keyphrase targets. We use “p_start” and “a_start” as the “[neopd]” token of present and absent keyphrases respectively. “;” is employed as the “[eowd]” token for both present and absent keyphrases. “/s” is used as the “[eopd]” token.
The vocabulary with 50,000 tokens is shared between the encoder and decoder. We set as 100 and
as 300. The hidden states of the encoder layers are initialized as zeros. In the training stage, we randomly initialize all the trainable parameters including the embedding using a uniform distribution in. We set batch size as 10, max gradient norm as 1.0, and initial learning rate as 0.001. We do not use dropout. Adam Kingma and Ba (2014) is used as our optimizer. The learning rate decays to half if the perplexity on KP20k validation set stops decreasing. Early stopping is applied when training. When inference, we set the minimum phrase-level decoding step as 1 and the maximum as 20.
We show the present and absent keyphrase prediction results in Table 2 and Table 3 correspondingly. As indicated in these two tables, both the ExHiRD-s model and the ExHiRD-h outperform the state-of-the-art baselines on most of the metrics, which demonstrates the effectiveness of our exclusive hierarchical decoding methods. Besides, the ExHiRD-h model consistently achieves the best results on both present and absent keyphrase prediction in all the datasets222We also tried to simultaneously incorporate the soft and the hard exclusion mechanisms into our hierarchical decoding model, but it still underperforms ExHiRD-h..
In this section, we study the model capability of avoiding producing duplicated keyphrases. Duplication ratio is denoted as “DupRatio” and defined as follows:
where # means “the number of”. For instance, the DupRatio is 0.5 (3/6) for [A, A, B, B, A, C].
We report the average DupRatio per document in Table 4. From this table, we observe that our ExHiRD-s and ExHiRD-h consistently and significantly reduce the duplication ratios on all datasets. Moreover, we also find that our ExHiRD-h model achieves the lowest duplication ratios on all datasets.
We also study the average number of unique keyphrase predictions per document. Duplicated keyphrases are removed. The results are shown in Table 5. One main finding is that all the models generate an insufficient number of unique keyphrases on most datasets, especially for predicting absent keyphrases. We also observe that our methods can improve the number of unique keyphrases by a large margin, which is extremely beneficial to solve the problem of insufficient generation. Correspondingly, it also leads to over-generate more keyphrases than the ground-truth for the cases that do not have this problem, such as the present keyphrase predictions on Krapivin and KP20k datasets. We leave solving the over-generation of present keyphrases on Krapivin and KP20k as our future work.
Since our ExHiRD-h model achieves the best performance on almost all of the metrics, we select it as our final model and probe it more subtly in the following sections. In order to understand the effects of each component of ExHiRD-h, we conduct an ablation study on it and report the results on the SemEval dataset in Table 6.
We observe that both our hierarchical decoding process and exclusive search mechanism are helpful to generate more accurate present and absent keyphrases. Besides, we also find that the significant performance margins on the duplication ratio and the keyphrase numbers are mainly from the exclusive search mechanism.
For a more comprehensive understanding of our exclusive search mechanism in our ExHiRD-h model, we also study the effects of the window size . We conduct the experiments on KP20k dataset and list the results in Table 7.
We note that a larger window size leads to a lower DupRatio as we anticipated. It is because the exclusive search can observe more previously-generated keyphrases to avoid generating duplicated keyphrases when is larger. When is “all”, the DupRatio is not absolute zero because we stem keyphrases when determining whether they are duplicated. Besides, we also find that larger leads to better scores. The reason is that for scores, we append incorrect keyphrases to obtain five predictions when the number of predictions is less than five. A larger leads to predict more unique keyphrases, append less absolutely incorrect keyphrases and improve the chance to output more accurate keyphrases. However, generating more unique keyphrases may also lead to more incorrect predictions, which will degrade the scores since considers all the unique predictions without a fixed cutoff.
|Transformer w/ ES||0.359||0.294||3.75||0.027||0.013||0.79||0.114|
|catSeq w/ ES||0.366||0.305||3.95||0.025||0.012||0.68||0.138|
|catSeqD w/ ES||0.366||0.306||3.99||0.026||0.012||0.65||0.137|
|catSeqCorr w/ ES||0.366||0.298||3.74||0.027||0.013||0.72||0.159|
Our exclusive search is a general method that can be easily applied to other models. In this section, we study the effects of our exclusive search on other baseline models. We show the experimental results on KP20k dataset in Table 8.
From this table, we note that the effects of exclusive search on baselines are similar to the effects on our hierarchical decoding. We also see our ExHiRD-h still achieves the best performance on most of the metrics, even if baselines are also incorporated with exclusive search, which exhibits the superiority of our hierarchical decoding again.
We display a prediction example in Figure 3. Our ExHiRD-h model generates more accurate keyphrases for the document comparing to the four baselines. Besides, we also observe much less repeated keyphrases are generated by our ExHiRD-h. For instance, all the baselines produce the keyphrase “debugging” at least three times. However, our ExHiRD-h only generates it once, which demonstrates that our proposed method is more powerful in avoiding duplicated keyphrases.
In this paper, we propose an exclusive hierarchical decoding framework for keyphrase generation. Unlike previous sequential decoding methods, our hierarchical decoding consists of a phrase-level decoding process to capture the current aspect to summarize and a word-level decoding process to generate keyphrases based on the captured aspect. Besides, we also propose a soft and a hard exclusion mechanisms to enhance the diversity of the generated keyphrases. Extensive experimental results demonstrate the effectiveness of our methods. One interesting future direction is to explore whether the beam search is helpful to our model.
The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2300174 (Collaborative Research Fund, No. C5026-18GF)). We would like to thank our colleagues for their comments.
Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 1162–1170. External Links: Cited by: §1.
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6268–6275. External Links: Cited by: §2.2.
Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 889–898. External Links: Cited by: §2.2.
Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. External Links: Cited by: §2.1, §5.1.
OpenNMT: open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. External Links: Cited by: §5.
Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 347–354. External Links: Cited by: §1.
Keyphrase extraction using deep recurrent neural networks on twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 836–845. External Links: Cited by: §2.1.