BSDAR: Beam Search Decoding with Attention Reward in Neural Keyphrase Generation

by   Iftitahu Ni'mah, et al.
TU Eindhoven

This study mainly investigates two decoding problems in neural keyphrase generation: sequence length bias and beam diversity. We introduce an extension of beam search inference based on word-level and n-gram level attention score to adjust and constrain Seq2Seq prediction at test time. Results show that our proposed solution can overcome the algorithm bias to shorter and nearly identical sequences, resulting in a significant improvement of the decoding performance on generating keyphrases that are present and absent in source text.


page 1

page 2

page 3

page 4


On Decoding Strategies for Neural Text Generators

When generating text from probabilistic models, the chosen decoding stra...

Best-First Beam Search

Decoding for many NLP tasks requires a heuristic algorithm for approxima...

Learning Beam Search Policies via Imitation Learning

Beam search is widely used for approximate decoding in structured predic...

Determinantal Beam Search

Beam search is a go-to strategy for decoding neural sequence models. The...

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Neural sequence models are widely used to model time-series data in many...

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Beam search is a desirable choice of test-time decoding algorithm for ne...

Consistency of a Recurrent Language Model With Respect to Incomplete Decoding

Despite strong performance on a variety of tasks, neural sequence models...

1 Introduction

Neural keyphrase generation [Meng2017Deep, Chen2018Keyphrase], a conditioned Sequence-to-Sequence (Seq2Seq) approach for automated keyphrase extraction, has recently shown a promising result as another domain for exploring latent aspects of Seq2Seq models [sutskever2014sequence, cho2014learning, Jiatao2016Copying, See2017Get, vinyals2015show, xu2015show]

. Given pairs of document and the corresponding keyphrase references as ground truth labels, the task is to encode sequence of words in the source document into contextual vector representation; and accordingly generate sequence of target words, a

‘‘keyword’’ or ‘‘keyphrase’’ that retains core information of the source document.

Keyphrase generation shares a common objective with Seq2Seq-based document summarization

[See2017Get], i.e. to condense a document into a short document abstraction. Consequently, both domains also share common challenges: the generation algorithm needs to accommodate two mechanisms – to copy words from source, and to generate semantically related words not featured in source document. While the “copying” task is particularly easy for an unsupervised keyword extractor (e.g. TfIdf), a generative model such as Seq2Seq Recurrent Network has not been specifically trained on such task. The problem has been addressed by incorporating copying mechanism [Jiatao2016Copying] into Seq2Seq architecture, resulting in models referred as CopyRNN [Meng2017Deep] and CorrRNN [Chen2018Keyphrase]. However, there has not been enough attention on addressing decoding issues in the current keyphrase generation task, referred as “beam search curse” [Yang2018breaking], which is also listed as one of six challenges for NMT [koehn2017six].

Figure 1: Beam Search Decoding Issues: sequence length bias and beam diversity
Figure 2: Beam Search Decoding Issues: generating Present (C) vs. Absent (A) keyphrases

This work further studies the decoding issues in neural keyphrase generation. We focus on two challenges of beam search decoding: (1) the algorithm bias to shorter sequences; and (2) the algorithm tendency to produce nearly identical sequences, causing the beam candidates to be less diverse.

Our empirical finding (fig. 2) shows an example of discrepancy between what the network (CorrRNN) has learnt from training examples and what the network generates at test time. Here, the attention successfully annotates sequence of words in the source document with a higher score (darker highlight), matched against the corresponding ground truth references. On the contrary, the decoding algorithm fails to include the corresponding sequences in a final prediction set. This finding suggests that apart from improving the model architecture to be more expressive to the corresponding “copying” task, the overall generation performance is also conditioned on the decoding algorithm. This example of decoding issue in the current task becomes our main motivation to further utilizing the learnt attention scores as a mechanism to constrain the sequence generation process at test time. We argue that constraining beam search decoding based on “hints” provided by attention network is favourable in the current task, given a condition that most keyphrase references are present in the document.

2 Beam Search Decoding Issues

2.1 Sequence length bias

By default, beam search decoding algorithm of Seq2Seq model [Bahdanau2014Neural] stops the search when exactly completed candidates of target sequences found, i.e. when the decoder network generates “end” token. As illustrated in fig. 1, although increasing beam size increases the possibility of more candidates to explore, the likelihood for the model to find the “end” token after the first decoding step is also high. Consequently, the beam mainly contains short sequences (grams, grams), disregarding many potential grams candidates with longer sequences (). This tendency of the decoding algorithm to favour shorter sequences can hurt performance severely, specifically in the current keyphrase generation task, where ground truth references are in variable length (grams, ). We further show empirically in section 5 that solely utilizing a normalization technique does not guarantee to solve the sequence length bias issues in the current task.

2.2 Beam diversity

Figure 1 also exemplifies diversity problem in the standard beam search decoding of Seq2Seq model for the current task. The generation of nearly identical beams, i.e. of sequences started with word “internet”, results in a low “informativeness” (based on Precision metric). Consequently, it decreases the decoding performance in the current task.

3 Neural keyphrase generation

3.1 Seq2Seq Attention Model

Our Attention-based Seq2Seq model is constructed of: (1) encoder - decoder architecture with attention mechanism [Bahdanau2014Neural] as a backbone architecture; (2) copying mechanism [Jiatao2016Copying, See2017Get, Meng2017Deep]; and (3) coverage-review mechanism [See2017Get, Chen2018Keyphrase]. We re-implemented and modified CopyRNN [Meng2017Deep] and CorrRNN [Chen2018Keyphrase] to be able to be trained on two conditions of target vocabulary size: truncated vocabulary and a very large target vocabulary [jean2015Using]. In truncated vocabulary setting, the vocabulary included in look-up dictionary is constrained to top-50K most frequent words, while Out-of-Vocabulary (OOV) set is referred as “<unk>”. In this study, Seq2Seq model with truncated vocabulary is referred as CorrRNN, while the modified model for large target vocabulary is referred as CorrRNN-L.

3.1.1 CorrRNN

For both Seq2Seq models in this study (CorrRNN, CorrRNN-L), we use MLP (concat) attention mechanism [Bahdanau2014Neural] as the attention scoring function scoreQ,KV tanhWQ,K for incorporating copying, coverage, and review mechanism into Seq2Seq architecture. Q corresponds to query-attention: decoder state at one time step ; and K denotes keys-attention: a sequence of encoder states and coverage vector of attention cov up to the current time step . The latter corresponds to a coverage mechanism [See2017Get, Chen2018Keyphrase]. Here, for CorrRNN, the model has two inputs: (1) sequences with a truncated dictionary size referred as ; and (2) sequences with extended vocabulary referred as . The additional oov dictionary index size is set to , excluding the main vocabulary index with top most frequent words.

3.1.2 CorrRNN-L

Since the vocabulary is not being truncated, CorrRNN-L model does not have issues with OOV. The problem, however, is shifted to the complexity of training and decoding due to large number of target words. This issue has been addressed by incorporating sampled softmax approximation based on adaptive sampling algorithm [jean2015Using] as an approximate learning approach to a very large target vocabulary size. Here, we trained the model on

vocabulary size. The number of vocabulary samples for the approximate softmax layer, instead of

full softmax in CorrRNN model architecture, is set to .

4 Beam Search Decoding with Attention Reward (BSDAR)

4.1 Word-level Attention-Reward

We propose a word-level attention reward function WORD_ATT_REWARD as a mechanism to augment Seq2Seq prediction, shown in alg. 1. For each decoding time step

, the logits probability output of decoder network (

) is augmented by an attention vector (), which each value corresponds to the attention weight of a word in source sequence . (alg. 1 line 3) denotes a normalized (mean) attention score of a particular word, given the possibility that the word occurs multiple times in the source text. Since the attention vector is bounded to the sequential position of words in source input, the dictionary look-up of word index and the corresponding position is also given during the decoding time as a reference to calculate mean , where denotes a set of words in input sequence and corresponds to the position where the word occurs in input sequence. Correspondingly, the augmentation of decoder logits only applies for words appear in source text.

To intensify the effect of logits augmentation, which is becoming critical in a noisy decoding of Seq2Seq with large target vocabulary size (CorrRNN-L), we use and (alg. 1 line 4) as an augmentation factor of attention reward . Here, we set and MAX .

1:for  MAX_STEPS do
2:     Collect logits and attention weights
3:     Compute normalized , for each word
4:     Augment the logits, for words in
5:     Get top most probable words
     tokens ARGSORT
     probs SORT
6:     return BEAM-HYP (tokens, probs)
1:while steps MAX_STEPS and Results BEAM_SIZE  do
2:     Run one step decoding with word attention reward
3:     Expand tree with new candidates HYPS BEAM-HYP
4:     Construct sequence candidates
5:     if SEQ found in annotated source ATT-ANNOT then
6:         Reward candidates with
7:     else if SEQ is partly composed of tokens in ATT-ANNOT then
8:         Set probability into negative value (log “inf”) to penalize SEQ candidate
9:     else
10:         Set logits probability without reward score
         AUG_PROB HYPS.CURR_PROB      
11:     Re-sort beam candidates
12:     Update tree with new candidates
13:     Sort the candidates stored in memory (tree) based on normalized joint probability
14:     Expand Results with completed sequence candidates

Our proposed word-level attention reward shares a common intuition with actor-critic algorithm for sequence prediction [Bahdanau2017Actor]. Intuitively, the WORD_ATT_REWARD function corresponds to increasing the probability of actions that the encoder network gives a higher attention value; and decreasing the probability of actions that are being ignored by the encoder network. The actions correspond to the Seq2Seq predictions for each decoding step. The proposed reward mechanism, thus, bounds the actions with an attention score, which each value represents a degree of importance of word in source sequence based on what the Seq2Seq attention network has learnt during training time.

4.2 gram-level attention reward

The proposed word-level attention reward in alg. 1 is based only on bag-of-words assumption. As a potential trade-offs on the decoding performance, the beam algorithm may then favour sequence candidates containing words with higher attention score. Thus, it disregards whether the formed sequence is sensible as a candidate of keyphrase. For instance, from the attention visualization in fig. 2, a keyphrase candidate “inquiry honours” may also be rewarded highly, considering the sequence is composed of words with a high attention score (“inquiry”, “honours”). This leads to a noisy prediction and can potentially decrease decoding performance.

To mitigate this issue, we also introduce the gram-level attention reward function NGRAM_ATT_REWARD to further augment and penalize beam candidates before adding them into the memory (alg. 2). For each decoding step , given the previous stored tokens in memory and the current beam candidate returned by WORD_ATT_REWARD, an gram candidate SEQ were formed. For all SEQ candidates matched against the extracted gram attention annotations ATT-ANNOT, the logits of the last tokens of the SEQ candidates were added by the corresponding gram attention score (alg. 2 line 6).

Automated attention-based annotation

The steps for acquiring both extracted n-gram attention annotations ATT-ANNOT and the corresponding attention score are shown in alg. 3. The attention annotation ATT-ANNOT was constructed during the first decoding step () by a simple gram () chunking method (i.e. sliding window of and ) of a filtered source document. The filtering of source document is based on a percentile rank of sorted attention score as a cutting attention threshold . Given the attention threshold , the ATT-ANNOT was extracted based on gram chunks of element-wise multiplication between filtered attention score (line 3) and source sequence (line 4). The final result is a list of grams with the corresponding attention score . was acquired by computing the mean of attention scores of words composing the corresponding gram sequence : . The resulting extracted atention annotation from the first decoding step is then used as a global reference for the following decoding time steps () and is utilized as reference for NGRAM_ATT_REWARD function.

1:Collect attention vector
2:Compute threshold based on percentile of
3:Binarize attention values based on threshold
if else
4:Extract n-grams attention annotation
5:Extract based on sequential position
6:Extract n-gram attention values
7:Return {ATT-ANNOT , }
Penalty score

Penalty was given to the beam candidates that are partially composed of word tokens found in ATT-ANNOT. These sequence candidates contain words with high attention score, but are mainly non-sensical. In this subset, the last tokens of the sequence candidates were set with a negative probability (), shown in alg. 2 line 8. For all candidates with negative value of probability in beam tree, the logits were then set to zero. This penalty encourages the sequence candidates to have an extremely large log probability values (“inf”), and correspondingly lower ranks during beam re-sorting stage. For sequence candidates that do not contain words and phrases in ATT-ANNOT, the logits output of decoder network were not augmented, kept as is (line 10). Thus, the sequences not featured in source text were still considered as candidates of Seq2Seq final prediction, but ranked after those found in the source sequence. This last step intuitively aims to preserve the “abstractive” ability of the model, i.e. the ability to produce sequences that do not necessarily appear in the source but has a close semantic meaning with the corresponding document.

4.3 Re-rank method

In addition to the proposed decoding with attention reward, a heuristic approach to alleviate sequence length bias and diversity issues was employed. We adopted a concept of

intra- and inter- sibling rank of beam decoding [LiJurafskiMutual2016] into a simple implementation. We refer the heuristic approach adopted in this study as (1) pre-intra siblings rank; (2) post-intra siblings re-rank; and (3) post-inter sibling re-rank. Here, in pre-intra siblings ranking, for each decoding step, we only consider top-3 beams (word tokens) to be added into the tree queue. Given completed sequences (i.e. sequences with “<end>” as last token), in post-intra siblings re-rank, given candidates with the same parent node and sequence length, only top-1 beam candidates were considered. Likewise, in post-inter sibling re-rank, only top-5 candidates were considered as final solution. While pre-intra siblings rank was ranked based on the probability scores of the last tokens of the sequence candidates, the post-intra siblings re-rank and post-inter siblings re-rank were sorted based on normalized joint probability score of the completed sequences.

5 Increasing traversal breadth and depth

Conventional wisdom for tackling the aforementioned beam search decoding issues is by expanding the beam size (traversal breadth) and sequence length (traversal depth). Here, we show empirically that increasing both parameters does not guarantee to result optimal solutions or significantly increase the performance gain in the current study.

Expanding Beam Size

Figure (2(a) shows that there is no gain on increasing beam size (up to beams) and utilizing a simple length normalization technique. The different trend of uni-gram-based evaluation in figure 2(b), however, indicates that the beam solution includes potential candidates partially (i.e. contains partial words in references), but fails to correctly form gram sequence candidate longer than a uni-gram. Example of the decoding results is shown in table 1.

No Length Normalization With Length Normalization
internet internet
recommendations internet analysis
support recommendations
unk support
online unk
information online
internet analysis information
web web
decision decision
computer computer
Table 1: Decoding results. Beam size is set to 10.
(a) gram evaluation
(b) uni-gram evaluation
(c) MAX_SEQ vs. diversity score
Figure 3: The effect of changing beam size on decoding performance
Increasing sequence length

Increasing the traversal depth (MAX_SEQ), shown in figure 2(c), also does not add any values on improving the diversity of beam candidates in the current task. We define a diversity score (y axis of figure 2(c)) as the “diversity” of the first word tokens in prediction and ground truth set, as such.


Where denotes a set of unique words, which each corresponds to the first token of a sequence and corresponds to a list of keyphrases. The purpose of this diversity score metric is to measure the repetitiveness of beam decoding based on the first word tokens of the generation results. The higher the diversity score is (e.g. close to the diversity score of ground truth references), the better of the decoding algorithm to overcome the beam diversity issue.

6 Experiments and results

Seq2Seq model in this study was trained on KP20k corpus [Meng2017Deep] ( documents, providing training sets after preprocessing). For training the model, both sources and the corresponding keyphrase labels are represented as word sequences, such as and , where and are maximum sequence length of source document and target keyphrase labels respectively. To be noted, each source document corresponds to multiple keyphrase labels. By splitting the data sample into pairs , the training set is presented as text-keyphrase pairs, each contains only one source text sequence and target keyphrase sequence. In inference stage, standard evaluation data sets for keyphrase extraction were used (Inspec [Hulth:2003] (), Krapivin [Krapivin:2009] (), NUS [Nguyen:2007] (), Semeval-2010 [Kim:2010] ().

BS Decoding CorrRNN
Inspec (201) Krapivin (201) NUS (156) Semeval-2010 (242)
BS 0.098 0.098 0.073 0.073 0.093 0.093 0.056 0.056
BS++ 0.091 0.091 0.094 0.094 0.098 0.098 0.059 0.059
BSDAR 0.290 0.372 0.249 0.317 0.211 0.255 0.169 0.221
Table 2: Comparison of Beam Search (BS) decoding algorithms on CorrRNN performance. size of test corpora
BS Decoding CorrRNN-L
Inspec (201) Krapivin (201) NUS (156) Semeval-2010 (242)
BS 0.045 0.045 0.038 0.038 0.043 0.043 0.026 0.026
BS++ 0.033 0.047 0.038 0.05 0.0409 0.052 0.020 0.029
BSDAR 0.268 0.327 0.193 0.226 0.152 0.182 0.139 0.166
Table 3: Comparison of Beam Search (BS) decoding algorithms on CorrRNN-L performance. size of test corpora
Decoding Inspec (199) Krapivin (185) NUS (148) Semeval-2010 (242)
BS 0.027 0.207 0.066 0.211 0.026 0.257 0.053 0.226 0.034 0.240 0.048 0.197 0.015 0.194 0.026 0.154
BS++ 0.022 0.194 0.078 0.329 0.024 0.247 0.081 0.369 0.029 0.228 0.064 0.304 0.013 0.182 0.031 0.260
BSDAR 0.038 0.249 0.079 0.331 0.071 0.300 0.064 0.348 0.037 0.260 0.065 0.310 0.031 0.225 0.041 0.253
Table 4: Abstractive Performance. Scores are based on Micro-Average Recall (R) and ROUGE-L average F1-score. size of test corpora after preprocessing.
Decoding Inspec (138) Krapivin (85) NUS (93) Semeval-2010 (212)
BS 0.236 0.189 0.236 0.183 0.215 0.150 0.196 0.142
BS++ 0.249 0.328 0.248 0.344 0.234 0.279 0.199 0.269
BSDAR 0.405 0.423 0.359 0.408 0.335 0.349 0.277 0.285
Table 5: Abstractive Performance on longer sequences (grams, ). Scores are based on ROUGE-L average F1-score. size of test corpora after preprocessing.
Beam Parameters

For the first decoding time step (), beam size () is set to be a larger number () than the remaining decoding time steps (). Number of hypothesis num_hyps representing the partial solutions (queue) of sequence candidates to be added into the memory is set to .

Evaluation Metrics

We employ Micro-average Recall (R), ROUGE-L average F1-score, and diversity (section 5) metrics to evaluate performance of beam search decoding algorithms in this study.

Model Inspec-L (189) Krapivin-L (118) NUS-L (121) Semeval2010-N (234)
TfIdf 0.082 0.252 0.635 0.079 0.189 0.562 0.044 0.110 0.478 0.039 0.157 0.441
CorrRNN BS 0. 0. 0.256 0. 0. 0.265 0. 0. 0.251 0. 0. 0.236
CorrRNN BS++ 0.005 0.005 0.294 0.011 0.011 0.306 0.009 0.009 0.299 0.006 0.006 0.261
CorrRNN BSDAR 0.007 0.130 0.643 0.009 0.123 0.560 0.009 0.070 0.4960 0.009 0.058 0.426
CorrRNN-L BS 0. 0. 0.138 0. 0. 0.151 0. 0. 0.152 0. 0. 0.134
CorrRNN-L BS++ 0. 0.003 0.308 0.009 0.009 0.341 0.005 0.005 0.320 0.002 0.003 0.299
CorrRNN-L BSDAR 0.102 0.342 0.664 0.057 0.219 0.572 0.035 0.164 0.4962 0.049 0.162 0.436
Table 6: Comparison between models in subsets with longer keyphrases (n-grams, ). size of test corpora after preprocessing.

6.1 Decoding performance

Table 2 and 3 shows the comparison between our proposed method (BSDAR) and standard beam search decoding algorithms for Seq2Seq prediction. To show that simply applying the heuristic rules (section 4.3) does not guarantee to solve the decoding issues, we included the heuristic-based only beam decoding (BS++) as a comparison. In general, the results for both CorrRNN and CorrRNN-L models show that BSDAR can significantly improve the recall score of about .

We also show that while expanding retrieval set (from to , where ) has no gain for the decoding based on standard beam decoding (BS) and heuristic-based decoding (BS++), the performance of the proposed decoding method significantly increases. This result indicates that the solutions based on standard beam search (BS) and heuristic-based beam search (BS++) mainly contain noises and non-relevant sequences, as compared to the predictions resulted by BSDAR.

6.2 Abstractive performance

For a neural generation task, it is also important to maintain the model ability to generate “abstractive” sequences. The results in table 4 show that BSDAR is able to maintain a high performance on generating sequences not featured in source text, given subsets with “absent” keyphrase references in a variable sequence length (). We further evaluate the “abstractive” performance of the decoding algorithms on longer sequences (table 5). Since this task is generally challenging for any models, we use ROUGE-L average score metric to match the prediction and ground truth set. In addition to maintaining a high abstractive performance in the subset with longer target sequences, BSDAR is also able to disclose the “actual” intrinsic performance of CorrRNN-L. As compared to low decoding performance of CorrRNN-L based on standard beam search (BS), the decoding performance of CorrRNN-L based on BSDAR is higher. This indicates that the corresponding model has actually learnt to attend on relevant words in source text, but the decoding algorithm fails to include the words in the final generated sequences. Understanding the intrinsic performance of a complex Seq2Seq model exemplified by this empirical result is becoming essential, specifically to further address a challenging “abstractive generation” task for any neural sequence models.

6.3 On sequence length bias issue

To measure whether the proposed method in this study (BSDAR) can overcome the algorithm bias to shorter sequences, we compare the Seq2Seq decoding performance based on BSDAR with Tf-Idf unsupervised keyword extractor in a subset with longer target sequences (grams,

). Intuitively, this subset introduces challenges for both neural network and non-neural network (Tf-Idf) approaches, since both models may suffer with sequence length bias issues, resulting the prediction with shorter sequence length. The result in table

7 shows that the proposed solution can improve the decoding performance of Seq2Seq models, outperforming Tf-Idf in three data sets, as compared to Seq2Seq with standard beam (BS) and heuristic-based only beam decoding (BS++).

6.4 On diversity issue

We also show that BSDAR can overcome the diversity issue of beam decoding algorithm. Based on the result in table 7, BSDAR can maintain high diversity score (section 5) close to the diversity measure of ground truth references. Furthermore, as compared to the standard beam (BS) and heuristic beam (BS++), BSDAR shows a reasonably better decoding performance for recalling uni-gram (), bi-gram (), and gram references ().

Keyphrase set DIV
Ground truth 0.942 N/A N/A N/A
BS 0.058 0.120 0.017 0.
BS++ 0.661 0.112 0.030 0.004
BSDAR 0.746 0.301 0.198 0.223
Table 7: Diversity measure of decoding algorithms across data sets

7 Conclusion

We present an approach to overcome two decoding issues in neural keyphrase generation by incorporating attention vector as a re-scoring method of beam search decoding algorithm. We show empirically that the proposed solution (BSDAR) not only performs well on improving the generation of longer and diverse sequences, but also maintains the Seq2Seq ability to predict “abstractive” keyphrases, i.e. semantically relevant keyphrases that are not present in source text.