Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization

12/31/2016 ∙ by Jun Suzuki, et al. ∙ 0

This paper tackles the reduction of redundant repeating generation that is often observed in RNN-based encoder-decoder models. Our basic idea is to jointly estimate the upper-bound frequency of each target vocabulary in the encoder and control the output words based on the estimation in the decoder. Our method shows significant improvement over a strong RNN-based encoder-decoder baseline and achieved its best results on an abstractive summarization benchmark.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The RNN-based encoder-decoder (EncDec) approach has recently been providing significant progress in various natural language generation (NLG) tasks,

i.e., machine translation (MT) Sutskever et al. (2014); Cho et al. (2014) and abstractive summarization (ABS) Rush et al. (2015). Since a scheme in this approach can be interpreted as a conditional language model, it is suitable for NLG tasks. However, one potential weakness is that it sometimes repeatedly generates the same phrase (or word).

This issue has been discussed in the neural MT (NMT) literature as a part of a coverage problem Tu et al. (2016); Mi et al. (2016). Such repeating generation behavior can become more severe in some NLG tasks than in MT. The very short ABS task in DUC-2003 and 2004 Over et al. (2007) is a typical example because it requires the generation of a summary in a pre-defined limited output space, such as ten words or 75 bytes. Thus, the repeated output consumes precious limited output space. Unfortunately, the coverage approach cannot be directly applied to ABS tasks since they require us to optimally find salient ideas from the input in a lossy compression manner, and thus the summary (output) length hardly depends on the input length; an MT task is mainly loss-less generation and nearly one-to-one correspondence between input and output Nallapati et al. (2016a).

From this background, this paper tackles this issue and proposes a method to overcome it in ABS tasks. The basic idea of our method is to jointly estimate the upper-bound frequency of each target vocabulary that can occur in a summary during the encoding process and exploit the estimation to control the output words in each decoding step. We refer to our additional component as a word-frequency estimation (WFE) sub-model. The WFE sub-model explicitly manages how many times each word has been generated so far and might be generated in the future during the decoding process. Thus, we expect to decisively prohibit excessive generation. Finally, we evaluate the effectiveness of our method on well-studied ABS benchmark data provided by Rush et al. Rush et al. (2015), and evaluated in Chopra et al. (2016); Nallapati et al. (2016b); Kikuchi et al. (2016); Takase et al. (2016); Ayana et al. (2016); Gulcehre et al. (2016).

2 Baseline RNN-based EncDec Model

The baseline of our proposal is an RNN-based EncDec model with an attention mechanism Luong et al. (2015). In fact, this model has already been used as a strong baseline for ABS tasks Chopra et al. (2016); Kikuchi et al. (2016) as well as in the NMT literature. More specifically, as a case study we employ a 2-layer bidirectional LSTM encoder and a 2-layer LSTM decoder with a global attention Bahdanau et al. (2014). We omit a detailed review of the descriptions due to space limitations. The following are the necessary parts for explaining our proposed method.

Let and be input and output sequences, respectively, where and

are one-hot vectors, which correspond to the

-th word in the input and the -th word in the output. Let denote the vocabulary (set of words) of output. For simplification, this paper uses the following four notation rules:

  1. is a short notation for representing a list of (column) vectors, i.e., .

  2. represents a -dimensional (column) vector whose elements are all , i.e., .

  3. represents the -th element of , i.e., , then .

  4. and, always denotes the index of output vocabulary, namely, , and represents the score of the -th word in , where .

Encoder: Let denote the overall process of our 2-layer bidirectional LSTM encoder. The encoder receives input and returns a list of final hidden states :

Input: list of hidden states generated by encoder
Initialize: : cumulative log-likelihood
           : list of generated words
           : hidden states to process decoder
1: triplet of (minimal) info for decoding process
2: set initial triplet to priority queue
3: prepare queue to store complete sentences
4: Repeat
5:     prepare empty list
6:     Repeat
7:        pop a candidate history
8:        see Eq. 2
9:        append likelihood vector
10:     Until repeat until is empty
14:     separate into or
15: Until finish if is empty
Figure 1: Algorithm for a -best beam search decoding typically used in EncDec approach.

Decoder: We employ a -best beam-search decoder to find the (approximated) best output given input . Figure 1 shows a typical -best beam search algorithm used in the decoder of EncDec approach. We define the (minimal) required information shown in Figure 1 for the -th decoding process is the following triplet, , where is the cumulative log-likelihood from step 0 to , is a (candidate of) output word sequence generated so far from step 0 to , that is, and is the all the hidden states for calculating the -th decoding process. Then, the function calcLL in Line 8 can be written as follows:


where is the softmax function for a given vector and represents the overall process of a single decoding step.

Moreover, in Line 11 is a -matrix, where is the number of complete sentences in . The -element of represents a likelihood of the -th word, namely , that is calculated using the -th candidate in at the -th step. In Line 12, the function makeTriplet constructs a set of triplets based on the information of index . Then, in Line 13, the function selectTopK selects the top- candidates from union of a set of generated triplets at current step and a set of triplets of complete sentences in . Finally, the function sepComp in Line 13 divides a set of triplets in two distinct sets whether they are complete sentences, , or not, . If the elements in are all complete sentences, namely, and , then the algorithm stops according to the evaluation of Line 15.

3 Word Frequency Estimation

This section describes our proposed method, which roughly consists of two parts:

  1. a sub-model that estimates the upper-bound frequencies of the target vocabulary words in the output, and

  2. architecture for controlling the output words in the decoder using estimations.

3.1 Definition

Let denote a vector representation of the frequency estimation. denotes element-wise product. is calculated by:


where and

represent the element-wise sigmoid and ReLU 

Glorot et al. (2011), respectively. Thus, , , and .

We incorporate two separated components, and , to improve the frequency fitting. The purpose of is to distinguish whether the target words occur or not, regardless of their frequency. Thus, can be interpreted as a gate function that resembles estimating the fertility in the coverage Tu et al. (2016)

and a switch probability in the copy mechanism 

Gulcehre et al. (2016). These ideas originated from such gated recurrent networks as LSTM Hochreiter and Schmidhuber (1997) and GRU Chung et al. (2014). Then, can much focus on to model frequency equal to or larger than 1. This separation can be expected since has no influence if .

3.2 Effective usage

The technical challenge of our method is effectively leveraging WFE . Among several possible choices, we selected to integrate it as prior knowledge in the decoder. To do so, we re-define in Eq. 2 as:

The difference is the additional term of , which is an adjusted likelihood for the -th step originally calculated from . We define as:


is a function that receives a vector and performs an element-wise calculation: for all if it receives . We define the relation between in Eq. 4 and in Eq. 3 as follows:


Eq. 5 is updated from to with the estimated output of previous step . Since for all , all of the elements in are monotonically non-increasing. If at , then regardless of . This means that the -th word will never be selected any more at step for all . Thus, the interpretation of is that it directly manages the upper-bound frequency of each target word that can occur in the current and future decoding time steps. As a result, decoding with our method never generates words that exceed the estimation , and thus we expect to reduce the redundant repeating generation.

Note here that our method never requires (or ) for all at the last decoding time step , as is generally required in the coverage Tu et al. (2016); Mi et al. (2016); Wu et al. (2016). This is why we say upper-bound frequency estimation, not just (exact) frequency.

Input: list of hidden states generated by encoder
Parameters: , , ,
1: linear transformation for frequency model
2: ,
3: frequency estimation
4: linear transformation for occurrence model
5: , and
6: , and
7: occurrence estimation
Figure 2: Procedure for calculating the components of our WFE sub-model.

3.3 Calculation

Figure 2 shows the detailed procedure for calculating and in Eq. 3. For , we sum up all of the features of the input given by the encoder (Line 2) and estimate the frequency. In contrast, for , we expect Lines 5 and 6 to work as a kind of voting for both positive and negative directions since needs just occurrence information, not frequency. For example,

may take large positive or negative values if a certain input word (feature) has a strong influence for occurring or not occurring specific target word(s) in the output. This idea is borrowed from the Max-pooling layer 

Goodfellow et al. (2013).

3.4 Parameter estimation (Training)

Given the training data, let be a vector representation of the true frequency of the target words given the input, where . Clearly

can be obtained by counting the words in the corresponding output. We define loss function

for estimating our WFE sub-model as follows:


where represents the overall parameters. The form of is closely related to that used in support vector regression (SVR) Smola and Schölkopf (2004). We allow estimation for all to take a value in the range of with no penalty (the loss is zero). In our case, we select since all the elements of are an integer. The remaining 0.25 for both the positive and negative sides denotes the margin between every integer. We select to penalize larger for more distant error, and , i.e., , since we aim to obtain upper-bound estimation and to penalize the under-estimation below the true frequency .

Finally, we minimize Eq. 6 with a standard negative log-likelihood objective function to estimate the baseline EncDec model.

Source vocabulary 119,507
Target vocabulary 68,887
Dim. of embedding 200
Dim. of hidden state 400
Encoder RNN unit 2-layer bi-LSTM
Decoder RNN unit 2-layer LSTM with attention

Adam (first 5 epoch)

+ SGD (remaining epoch)
Initial learning rate 0.001 (Adam) / 0.01 (SGD)
Mini batch size 256 (shuffled at each epoch)
Gradient clipping 10 (Adam) / 5 (SGD)
Stopping criterion max 15 epoch w/ early stopping
based on the val. set
Other opt. options Dropout = 0.3
Table 1: Model and optimization configurations in our experiments. : including special BOS, EOS, and UNK symbols. : as suggested in Wu et al. (2016)
DUC-2004 (w/ 75-byte limit) Gigaword (w/o length limit)
EncDec 29.23 8.71 25.27 33.99 16.06 31.63
(baseline) 29.52 9.45 25.80 34.27 16.68 32.14
 our impl.) 29.60 9.62 25.97 34.18 16.51 31.97
EncDec+WFE 31.92 9.36 27.22 36.21 16.87 33.55
(proposed) 32.28 10.54 27.80 36.30 17.31 33.88
31.70 10.34 27.48 36.08 17.23 33.73
(perf. gain from to ) +2.68 +0.92 +1.83 +2.03 +0.63 +1.78
Table 2: Results on DUC-2004 and Gigaword data: ROUGE-(R): recall-based ROUGE-, ROUGE-(F): F1-based ROUGE-, where , respectively.
DUC-2004 (w/ 75-byte limit) Gigaword (w/o length limit)
ABS Rush et al. (2015) 26.55 7.06 22.05 30.88 12.22 27.77
RAS Chopra et al. (2016) 28.97 8.26 24.06 33.78 15.97 31.15
BWL Nallapati et al. (2016a)222The same paper was published in CoNLL Nallapati et al. (2016b). However, the results are updated in arXiv version. 28.35 9.46 24.59 32.67 15.59 30.64
   (words-lvt5k-1sent) 28.61 9.42 25.24 35.30 16.64 32.62
MRT Ayana et al. (2016) 30.41 10.87 26.79 36.54 16.59 33.44
EncDec+WFE [This Paper] 32.28 10.54 27.80 36.30 17.31 33.88
(perf. gain from ) +1.87 -0.33 +1.01 -0.24 +0.72 +0.44
Table 3: Results of current top systems: ‘*’: previous best score for each evaluation. : using a larger vocab for both encoder and decoder, not strictly fair configuration with other results.
 True Estimation 0 1 2 3
1 7,014 7,064 1,784 16 4
2 51 95 60 0 0
2 4 1 0 0
Table 4: Confusion matrix of WFE on Gigaword data: only evaluated true frequency .
G: china success at youth world championship shows preparation for #### olympics
A: china germany germany germany germany and germany at world youth championship
B: china faces germany at world youth championship
G: British and Spanish governments leave extradition of Pinochet to courts
A: spain britain seek shelter from pinochet ’s pinochet case over pinochet ’s
B: spain britain seek shelter over pinochet ’s possible extradition from spain
G: torn UNK : plum island juniper duo now just a lone tree
A: black women black women black in black code
B: in plum island of the ancient
Figure 3:

Examples of generated summary. G: reference summary, A: baseline EncDec, and B: EncDec+WFE. (underlines indicate repeating phrases and words)

4 Experiments

We investigated the effectiveness of our method on ABS experiments, which were first performed by Rush et al., Rush et al. (2015). The data consist of approximately 3.8 million training, 400,000 validation and 400,000 test data, respectively333The data can be created by the data construction scripts in the author’s code: Generally, 1951 test data, randomly extracted from the test data section, are used for evaluation444As previously described Chopra et al. (2016) we removed the ill-formed (empty) data for Gigaword.. Additionally, DUC-2004 evaluation data Over et al. (2007)555 were also evaluated by the identical models trained on the above Gigaword data. We strictly followed the instructions of the evaluation setting used in previous studies for a fair comparison. Table 1 summarizes the model configuration and the parameter estimation setting in our experiments.

4.1 Main results: comparison with baseline

Table 2

shows the results of the baseline EncDec and our proposed EncDec+WFE. Note that the DUC-2004 data was evaluated by recall-based ROUGE scores, while the Gigaword data was evaluated by F-score-based ROUGE, respectively. For a validity confirmation of our EncDec baseline, we also performed

OpenNMT tool666 The results on Gigaword data with were, 33.65, 16.12, and 31.37 for ROUGE-1(F), ROUGE-2(F) and ROUGE-L(F), respectively, which were almost similar results (but slightly lower) with our implementation. This supports that our baseline worked well as a strong baseline. Clearly, EncDec+WFE significantly outperformed the strong EncDec baseline by a wide margin on the ROUGE scores. Thus, we conclude that the WFE sub-model has a positive impact to gain the ABS performance since performance gains were derived only by the effect of incorporating our WFE sub-model.

4.2 Comparison to current top systems

Table 3 lists the current top system results. Our method EncDec+WFE successfully achieved the current best scores on most evaluations. This result also supports the effectiveness of incorporating our WFE sub-model.

MRT Ayana et al. (2016) previously provided the best results. Note that its model structure is nearly identical to our baseline. On the contrary, MRT trained a model with a sequence-wise minimum risk estimation, while we trained all the models in our experiments with standard (point-wise) log-likelihood maximization. MRT essentially complements our method. We expect to further improve its performance by applying MRT for its training since recent progress of NMT has suggested leveraging a sequence-wise optimization technique for improving performance Wiseman and Rush (2016); Shen et al. (2016). We leave this as our future work.

4.3 Generation examples

Figure 3 shows actual generation examples. Based on our motivation, we specifically selected the redundant repeating output that occurred in the baseline EncDec. It is clear that EncDec+WFE successfully reduced them. This observation offers further evidence of the effectiveness of our method in quality.

4.4 Performance of the WFE sub-model

To evaluate the WFE sub-model alone, Table 4 shows the confusion matrix of the frequency estimation. We quantized by for all , where 0.5 was derived from the margin in . Unfortunately, the result looks not so well. There seems to exist an enough room to improve the estimation. However, we emphasize that it already has an enough power to improve the overall quality as shown in Table 2 and Figure 3. We can expect to further gain the overall performance by improving the performance of the WFE sub-model.

5 Conclusion

This paper discussed the behavior of redundant repeating generation often observed in neural EncDec approaches. We proposed a method for reducing such redundancy by incorporating a sub-model that directly estimates and manages the frequency of each target vocabulary in the output. Experiments on ABS benchmark data showed the effectiveness of our method, EncDec+WFE, for both improving automatic evaluation performance and reducing the actual redundancy. Our method is suitable for lossy compression tasks such as image caption generation tasks.