|various types of renewable energy such as solar and wind are often touted as being the solution to the world ’s growing energy crisis . but one researcher has come up with a novel idea that could trump them all - a biological solar panel that works around the clock . by harnessing the electrons generated by plants such as moss , he said he can create useful energy that could be used at home or elsewhere . a university of cambridge scientist has revealed his green source of energy . by using just moss he is able to generate enough power to run a clock -lrb- shown -rrb- . he said panels of plant material could power appliances in our homes . and the technology could help farmers grow crops where electricity is scarce . (…)|
|university of cambridge scientist has revealed his green source of energy . by using just moss he is able to generate enough power to run a clock . he said panels of plant material could power appliances in our homes . and the tech could help farmers grow crops where electricity is scarce .|
|[Extracted prototype] he said panels of plant material could power in our|
|[Abstractive summary] panels of plant material could power appliances .|
|[Extracted Prototype] university of cambridge scientist has revealed his he said panels of plant material could power appliances in our homes and the technology could help farmers grow crops where is scarce|
|[Abstractive summary] university of cambridge scientist has revealed his green source of energy . he said panels of plant material could power appliances in our homes .|
), as a prototype from the source text. It generates an abstractive summary based on the prototype and source texts. The length of the generated summary is controlled in accordance with the length of the prototype text.
Neural summarization has made great progress in recent years. It has two main approaches: extractive and abstractive. Extractive methods generate summaries by selecting important sentences [25, 27]. They produce grammatically correct summaries; however, they do not give much flexibility to the summarization because they only extract sentences from the source text. By contrast, abstractive summarization enables more flexible summarization, and it is expected to generate more fluent and readable summaries than extractive models. The most commonly used abstractive summarization model is the pointer-generator , which generates a summary word-by-word while copying words from the source text and generating words from a pre-defined vocabulary set. This model can generate an accurate summary by combining word-level extraction and generation.
Although the idea of controlling the length of the summary was mostly neglected in the past, it was recently pointed out that it is actually an important aspect of abstractive summarization [15, 4]. In practical applications, the summary length should be controllable in order for it to fit the device that displays it. However, there have only been a few studies on controlling the summary length. kikuchi16 kikuchi16 proposed a length-controllable model that uses length embeddings. In the length embedding approach, the summary length is encoded either as an embedding that represents the remaining length at each decoding step or as an initial embedding to the decoder that represents the desired length. CNNlengthcontrol CNNlengthcontrol proposed a model that uses the desired length as an input to the initial state of the decoder. These previous models control the length in the decoding module by using length embeddings. However, length embeddings only add length information on the decoder side. Consequently, they may miss important information because it is difficult to take into account which content should be included in the summary for certain length constraints.
We propose a new length-controllable abstractive summarization that is guided by the prototype text. Our idea is to use a word-level extractive module instead of length embeddings to control the summary length. Figure 2 compares the previous length-controllable models and the proposed one. The yellow blocks are the modules responsible for length control. Since the word-level extractor controls which contents are to be included in the summary when a length constraint is given, it is possible to generate a summary including the important contents. Our model consists of two steps. First, the word-level extractor predicts the word-level importance of the source text and extracts important words according to the importance scores and the desired length. The extracted word sequence is used as a “prototype” of the summary; we call it the “prototype” text. Second, we use the prototype text as an additional input of the encoder-decoder model. The length of the summary is kept close to that of the prototype text because the summary is generated by referring to the prototype text. Figure 1 shows examples of output generated by our model. Our abstractive summaries are similar to the extracted prototypes. The extractive module produces a rough overview of the summary, and the encoder-decoder module produces a fluent summary based on the extracted prototype.
Our idea is inspired by extractive-and-abstractive summarization. Extractive-and-abstractive summarization incorporates an extractive model in an abstractive encoder-decoder model. While in the simple encoder-decoder model, one model identifies the important contents and generates fluent summaries, the extractive-and-abstractive model has an encoder-decoder part that generates fluent summaries and a separate part that extracts important contents. Several studies have shown that separating the problem of finding the important content and the problem of generating fluent summaries improves the accuracy of the summary [5, 1]. Our model can be regarded as an extension of models that work in this way: However, this is the first to extend the extractive module such that it can control the summary length.
Ours is the first method that controls the summary length using an extractive module and that achieves both high accuracy and length controllability in abstractive summarization. Our contributions are summarized as follows:
We propose a new length-controllable prototype-guided abstractive summarization model, called LPAS (Length-controllable Prototype-guided Abstractive Summarization). Our model effectively guides the abstractive summarization using a summary prototype. Our model controls the summary length by controlling the number of words in the prototype text.
Our model achieved state-of-the-art ROUGE scores in length-controlled abstractive summarization settings on the CNN/DM and NEWSROOM datasets.
2 Task Definition
Our study defines length-controllable abstractive summarization as two pipelined tasks: prototype extraction and prototype-guided abstractive summarization. The problem formulations of each task are described below.
Task 1 (Prototype Extraction)
Given a source text with words = and a desired summary length , the model estimates importance scores = and extracts the top- important words = as a prototype text on the basis of . The desired summary length can be set to an arbitrary value. Note that the original word order is preserved in ( is not bag-of-words).
Task 2 (Prototype-guided Abstractive Summarization)
Given the source text and the extracted prototype text , the model generates a length-controlled abstractive summary = . The length of summary is controlled in accordance with the prototype length .
3 Proposed Model
Our model consists of three modules: the prototype extractor, joint encoder, and summary decoder (Figure 3). The last two modules comprise Task 2, the prototype-guided abstractive summarization. The prototype extractor uses BERT, while the joint encoder and summary decoder use the Transformer architecture .
Prototype extractor (§3.2)
The prototype extractor extracts the top- important words from the source text.
Joint encoder (§3.3)
The joint encoder encodes both the source text and the prototype text.
Summary decoder (§3.4)
The summary decoder is based on the pointer-generator model and generates an abstractive summary by using the output of the joint encoder.
3.2 Prototype Extractor
Since our model extracts the prototype at the word level, the prototype extractor estimates an importance score of each word . BERT has achieved SOTA on many classification tasks, so it is a natural choice for the prototype extractor. Our model uses BERT and a task-specific feed-forward network on top of BERT. We tokenize the source text using the BERT tokenizer111https://github.com/google-research/bert/ and fine-tune the BERT model. The importance score is defined as
where is the last hidden state of the pre-trained BERT. and are learnable parameters.
is a sigmoid function.is the dimension of the last hidden state of the pre-trained BERT.
To extract a more fluent prototype than when using only the word-level importance, we define a new weighted importance score that incorporates a sentence-level importance score as a weight for the word-level importance score:
where is the number of words in the -th sentence . Our model extracts the top- important words as a prototype from the source text on the basis of . It controls the length of the summary in accordance with the number of words in the prototype text, .
3.3 Joint Encoder
This layer projects each of the one-hot vectors of words(of size ) into a -dimensional vector space with a pre-trained weight matrix such as GloVe . Then, the word embeddings are mapped to
-dimensional vectors by using the fully connected layer, and the mapped embeddings are passed to a ReLU function. This layer also adds positional encoding to the word embedding.
Transformer encoder blocks
The encoder encodes the embedded source and prototype texts with a stack of Transformer blocks . Our model encodes the two texts with the encoder stack independently. We denote these outputs as and , respectively.
Transformer dual encoder blocks
This block calculates the interactive alignment between the encoded source and prototype texts. Specifically, it encodes the source and prototype texts and then performs multi-head attention on the other output of the encoder stack (i.e., and ). We denote the outputs of the dual encoder stack of the source text and prototype text by and , respectively.
3.4 Summary Decoder
The decoder receives a sequence of words in an abstractive summary , which is generated through an auto-regressive process. At each decoding step , this layer projects each of the one-hot vectors of the words in the same way as the embedding layer in the joint encoder.
Transformer decoder blocks
The decoder uses a stack of decoder Transformer blocks  that perform multi-head attention on the encoded representations of the prototype, . It uses another stack of decoder Transformer blocks that perform multi-head attention on those of the source text, , on top of the first stack. The first stack rewrites the prototype text, and the second one complements the rewritten prototype with the original source information. The subsequent mask is used in the stacks since this component is used in a step-by-step manner at test time. The output of the stacks is .
Our pointer-generator model copies the words from the source and prototype texts on the basis of the copy distributions, for efficient reuse.
The copy distributions of the source and prototype words are described as follows:
where and are respectively the first attention heads of the last block in the first and second stacks of the decoder.
Final vocabulary distribution
The final vocabulary distribution is described as follows:
where , , , and are learnable parameters.
Our model is not trained in an end-to-end manner: the prototype extractor is trained first and then the encoder and decoder are trained.
4.1 Generating Training Data
Since there are no supervised data for the prototype extractor, we created pseudo training data like in . The training data consists of word and label pairs, (, ) for all . is 1 if is included in the summary; otherwise it is 0. To construct the paired data automatically, we first extract oracle source sentences that maximize the ROUGE-R score in the same way as in . Then, we calculate the word-by-word alignment between the reference summary and using a dynamic programming algorithm to consider the word order. Finally, we label all aligned words with 1 and other words, including the words that are not in the oracle sentence, with 0.
Joint encoder and summary decoder
We have to create triple data of (, , ), consisting of the source text, the gold prototype text, and the target text, for training our encoder and decoder. We use the top- words (in terms of ; Eq. 2) in the oracle sentences as the gold prototype text to extract a prototype closer to the reference summary and improve the quality of the encoder-decoder training. is decided using the reference summary length . To obtain a natural summary close to the desired length, we quantize the length into discrete bins, where each bin represents a size range. We set the size range to 5 in this study. That is, the value nearest to the summary length among multiples of 5 is selected for .
4.2 Loss Function
We use the binary cross-entropy loss, because the extractor estimates the importance score of each word (Eq. 1), which is a binary classification task.
where is the number of training examples.
Joint encoder and summary decoder
The main loss for the encoder-decoder is the cross-entropy loss:
Moreover, we add the attention guide loss of the summary decoder. This loss is designed to guide the estimated attention distribution to the reference attention.
is the first attention head of the last block in the joint encoder stack for the prototype. denotes the absolute position in the source text corresponding to the -th word in the sequence of summary words. The overall loss of the generation model is a linear combination of these three losses.
and were set to 0.5 in the experiments.
During the inference period, we use a beam search and re-ranking . We keep all summary candidates provided by the beam search, where is the size of the beam, and generate the
-best summaries. The summaries are then re-ranked by the number of repeated N-grams, the smaller the better. The beam search and this re-ranking improve the ROUGE score of the output, as they eliminate candidates that contain repetitions. For the length-controlled setting, we set the value ofto the desired length. For the standard setting, we set it to the average length of the reference summary in the validation data.
6.1 Datasets and settings
We used the CNN/DM dataset , a standard corpus for news summarization. The summaries are bullet points for the articles shown on their respective websites. Following see17 see17, we used the non-anonymized version of the corpus and truncated the source documents to 400 tokens and the target summaries to 120 tokens. The dataset includes 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs. We also used the NEWSROOM dataset . NEWSROOM contains various news sources (38 different news sites). We used 973,042 pairs of data for training. We sampled 30,000 pairs for validation data, and the number of the test pairs was 106,349. To evaluate the length-controlled setting for NEWSROOM dataset, we randomly sampled 10,000 samples from the test set.
We used the same configurations for the two datasets. The extractor used the pre-trained BERT model 
. We fine-tuned BERT for two epochs with the default settings. Our encoder and decoder used pre-trained 300-dimensional GloVe embeddings. The encoder and decoder Transformer have four blocks. The number of heads was 8, and the number of dimensions of FFN was 2048.was set to 512. We used the Adam optimizer  with a scheduled learning rate . We set the size of the input vocabulary to 100,000 and the output vocabulary to 1,000.
6.2 Evaluation Metrics
Does our model improve the ROUGE score in the length-controlled setting?
We used two types of length-controllable models as baselines. The first one is a CNN-based length-controllable model (LC) that uses the desired length as an input to the initial state of the CNN-based decoder. . The second one (LenEmb) embeds the remaining length and adds them to each decoder step . Since there are no previous results on applying LenEmb to the CNN/DM dataset, we implemented it as a Transformer-based encoder-decoder model. Specifically, we simply added the embeddings of the remaining length to the word embeddings at each decoding step.
Table 1 shows that our model achieved high ROUGE scores for different lengths and outperformed the previous length-controllable models in most cases. Our model was about 2 points more accurate on average than LenEmb. Our model selected the most important words from the source text in accordance with the desired length. It was thus effective at keeping the important information even in the length-controlled setting. Figure 4
a shows the precision, recall, and F score of ROUGE for different lengths. Our model maintained a highF-score around the average length (around 60 words); this indicates that it can select important information and generate stable results with different lengths.
Does our model generate a summary with the desired length?
b shows the relationship between the desired length and the output length. The x-axis indicates the desired length, and the y-axis indicates the average length and standard deviation of the length-controlled output summary. The results show that our model properly controls the summary length. This controllable nature comes from the training procedure. When trainingour encoder-decoder, we set the number of words in the prototype text according to the length of the reference summary; therefore, the model learns to generate a summary that has a similar length to the prototype text.
How good is the quality of the prototype text?
To evaluate the quality of the prototype, we evaluated the ROUGE scores of the extracted prototype text. Table 2 shows the results. In the table, LPAS-ext (top-3 sents) means the top-three sentences were extracted using . Interestingly, ROUGE-1 and ROUGE-2 scores of the LPAS-ext (Top- words) were higher than those of the sentence-level extractive models. This indicates that word-level LPAS-ext is effective at finding not only important words (ROUGE-1), but also important phrases (ROUGE-2). Also, we can see from Table 5 that whole LPAS improved the ROUGE-L score of LPAS-ext. This indicates that our joint encoder and summary decoder generate more fluent summaries with the help of the prototype text.
|Bottom-Up (top-3 sents)||40.7||18.0||37.0|
|- top-3 sents||41.48||19.23||37.76|
|- Top- words||44.79||20.59||38.12|
Does our abstractive model improve if the quality of the prototype is improved?
|Gold sentences + Gold length||46.68||23.52||43.41|
We evaluated our model in the following two settings in order to analyze the relationship between the quality of the abstractive summary and that of the prototype. In the gold-length setting, we only gave the gold length to the prototype extractor. In the gold sentences + the gold-length setting, we gave the gold sentences and gold length (see 4.1). Table 3 shows the results. These results indicate that selecting the correct number of words in the prototype improves the ROUGE scores. In this study, we simply selected the average length when extracting the prototype for all examples in the standard setting; however, there will be an improvement if we adaptively select the number of words in the prototype for each source text. Moreover, the ROUGE score largely improved in the gold sentence and gold-length settings. This indicates that the quality of the generated summary will significantly improve by increasing the accuracy of the extractive model.
Is our model effective on other datasets?
To verify the effectiveness of our model on various other summary styles, we evaluated it on a large and varied news summary dataset, NEWSROOM. Table 4 and Figure 5 show the results in the length-controlled setting for NEWSROOM. Our model achieved higher ROUGE scores than those of LenEmb. From Figure 5a, we can see that the F-value of the ROUGE score was highest around 30 words. This is because the average word number is about 30 words. Moreover, Figure 5b shows that our model also acquired a length control capability for a dataset with various styles.
|w/o pre-trained encoder-decoder model|
|Pointer-Generator + Coverage||39.53||17.28||36.38|
|Key information guide network||38.95||17.12||35.68|
|w/ pre-trained encoder-decoder model|
|= average length||39.24||27.20||35.84|
|= domain length||39.79||27.85||36.48|
|LPAS (w/o Prototype)||38.48||26.99||35.30|
How well does our model perform in the standard setting?
Table 5 shows that our model achieved the ROUGE scores comparable to previous models that do not consider the length constraint on the CNN/DM dataset. We note that the current state-of-the-art models use pre-trained encoder-decoder models (8-11), while the encoder and decoder of our model (except for prototype extractor) were not pre-trained.
We also examined the results of generating a summary from only the prototype (LPAS w/o Source) or the source (LPAS w/o Prototype). Here, using only the prototype, turned out to have the same accuracy as using only the source, but the model using the source and the prototype simultaneously had higher accuracy. These results indicate that our prototype extraction and joint encoder effectively incorporated the source text and prototype information and contributed to improving the accuracy.
The results for the NEWSROOM dataset under standard settings are shown in Table 6. To consider differences in summary length between news domains, we evaluated our model in the average length and domain-level average length (denoted as domain length) settings. The results indicate that our model had significantly higher ROUGE scores compared with the official baselines and outperformed our baseline (LPAS w/o Prototype). They also indicate that our model is effective on datasets containing text in various styles. Moreover, we found that considering the domain length has positive effects on the ROUGE scores. This indicates that our model can easily reflect differences in summary length among various styles.
7 Related Work and Discussion
Length control for summarization
kikuchi16 kikuchi16 were the first to propose using length embedding for length-controlled abstractive summarization. controllable_summarization controllable_summarization also used length embeddings at the beginning of the decoder module for length control. CNNlengthcontrol CNNlengthcontrol proposed a CNN-based length-controllable summarization model that uses the desired length as an input to the initial state of the decoder. positional_lc positional_lc introduced positional encoding that represents the remaining length at each decoder step of the Transformer-based encoder-decoder model. It is almost equivalent to the model LenEmb we implemented. These previous models use length embeddings for controlling the length in the decoding module, whereas we use the prototype extractor for controlling the summary length and to include important information in the summary.
Neural extractive-and-abstractive summarization
unified unified, bottomup bottomup and ETADS ETADS incorporated a sentence- and word-level extractive model in the pointer-generator model. Their models weight the copy probability for the source text by using an extractive model and guide the pointer-generator model to copy important words. keyword_guided keyword_guided proposed a keyword guided abstractive summarization model. sentence_rewriting sentence_rewriting proposed a sentence extraction and re-writing model that trains in an end-to-end manner by using reinforcement learning. retrieve_rewrite retrieve_rewrite proposed a search and rewrite model. EXCONSUMM_Compressive EXCONSUMM_Compressive proposed a combination of sentence-level extraction and compression. The idea behind these models is word-level weighting for the entire source text or sentence-level re-writing. On the other hand, our model guides the summarization with alength-controllable prototype text by using the prototype extractor and joint encoder. Utilizing extractive results to control the length of the summary is a new idea.
Large-scale pre-trained language model
BERT  is a new pre-trained language model that uses bidirectional encoder representations from Transformer. BERT has performed well in many natural language understanding tasks such as the GLUE benchmarks  and natural language inference . Bertsum Bertsum used BERT for their sentence-level extractive summarization model. HIBERT HIBERT trained a new pre-trained model that considers document-level information for sentence-level extractive summarization. We used BERT for the word-level prototype extractor and verified the effectiveness of using it in the word-level extractive module. Several researchers have published pre-trained encoder-decoder models very recently [22, 11, 18]. poda poda pre-trained a transformer-based pointer-generator model. BART BART pre-trained a normal transformer-based encoder-decoder model using large unlabeled data and achieved state-of-the-art results. unilm unilm extended the BERT structure to handle sequence-to-sequence tasks.
Reinforcement learning for summarization
Reinforcement learning (RL) is a key summarization technique. RL can be used to optimize non-differential metrics or multiple non-differential networks. narayan-etal-2018-ranking narayan-etal-2018-ranking and dong-etal-2018-banditsum dong-etal-2018-banditsum used RL for extractive summarization. For abstractive summarization, abstractive_RL abstractive_RL used RL to mitigate the exposure bias of abstractive summarization. sentence_rewriting sentence_rewriting used RL to combine sentence-extraction and pointer-generator models. Our model achieved high ROUGE scores without RL. In future, we may incorporate RL in our models to get a further improvement.
We proposed a new length-controllable abstractive summarization model. Our model consists of a word-level prototype extractor and a prototype-guided abstractive summarization model. The prototype extractor identifies the important part of the source text within the length constraint, and the abstractive model is guided with the prototype text. This characteristic enabled it to achieve a high ROUGE score in standard summarization tasks. Moreover, our prototype extractor ensures the summary will have the desired length. Experiments with the CNN/DM dataset and the NEWSROOM dataset show that our model outperformed previous models in standard and length-controlled settings. In future, we would like to incorporate a pre-trained language model in the abstractive model to build a higher quality summarization model.
-  (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL, pp. 675–686. Cited by: §1, §5, Table 5.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR. Cited by: §6.1, §7.
-  (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32, pp. 13042–13054. Cited by: Table 5.
-  (2018) Controllable abstractive summarization. In NMT@ACL, pp. 45–54. Cited by: §1.
-  (2018) Bottom-up abstractive summarization. In EMNLP, pp. 4098–4109. Cited by: §1, §4.1, Table 2, Table 5.
-  (2018) Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In ACL, pp. 708–719. Cited by: §6.1, Table 6.
-  (2015) Teaching machines to read and comprehend. In NIPS, pp. 1693–1701. Cited by: §6.1.
-  (2018) A unified model for extractive and abstractive summarization using inconsistency loss. In ACL, pp. 132–141. Cited by: §4.1, Table 5.
-  (2016) Controlling output length in neural encoder-decoders. In EMNLP, pp. 1328–1338. Cited by: §6.3.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §6.1.
BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv e-prints. External Links: Cited by: Table 5, §7.
-  (2018) Guiding generation for abstractive text summarization based on key information guide network. In ACL, pp. 55–60. Cited by: Table 5.
-  (2004) ROUGE: a package for automatic evaluation of summaries. In ACL, Cited by: §6.2.
-  (2019) Fine-tune BERT for extractive summarization. CoRR abs/1903.10318. External Links: Cited by: Table 2.
Controlling length in abstractive summarization using a convolutional neural network. In EMNLP, pp. 4110–4119. Cited by: §1, §6.3, Table 1.
-  (2019) Jointly extracting and compressing documents with summary state representations. In NAACL, pp. 3955–3966. Cited by: Table 5.
-  (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §3.3.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints. External Links: Cited by: Table 5, §7.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, pp. 1073–1083. Cited by: §1, Table 5.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.1, §3.3, §3.3, §3.4, §6.1.
-  (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, pp. 353–355. Cited by: §7.
-  (2019) Denosing based sequence-to-sequence pre-training for text generation. In EMNLP, pp. to appear. Cited by: Table 5, §7.
-  (2018) A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pp. 1112–1122. Cited by: §7.
-  (2019) Improving abstractive document summarization with salient information modeling. In ACL, pp. 2132–2141. Cited by: Table 5.
-  (2018) Neural latent extractive document summarization. In EMNLP, pp. 779–784. Cited by: §1.
-  (2019) HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. In ACL, pp. 5059–5069. Cited by: Table 2.
-  (2018) Neural document summarization by jointly learning to score and select sentences. In ACL, pp. 654–663. Cited by: §1, Table 2.