1 Background and Motivation
While extractive summarization selects and copies the most relevant source phrases and sentences to the summary, abstractive summarization (AS) aims to capture the source meaning and generate summaries not necessarily containing portions of the source textsNenkova and McKeown (2011), holding promise of producing summaries more like human created ones. State-of-the-art neural AS models Nallapati et al. (2016); See et al. (2017); Paulus et al. (2018); Tan et al. (2017); Makino et al. (2019); You et al. (2019) extend a standard sequence-to-sequence (Seq2Seq) architecture, using either recurrent (RNN) Bahdanau et al. (2015) or Transformer-based Vaswani et al. (2017)
encoder and decoder components. see2017get extend the standard Seq2Seq model with a pointer-generator network (PG-Net), providing the model with extractive capabilities, i.e., allowing it to choose between generating a token and copying source text tokens. P17-1108 propose a hierarchical model that introduces an additional graph-based attention mechanism which serves to model interactions between encoded sentence representations. Paulus18 incorporate a reward expectation based on reinforcement learning into a mixed training objective to steer the model towards predicting globally meaningful sequences.
With respect long-document summarization, celikyilmaz2018deep distribute the encoding task to multiple collaborating encoder agents, whereas cohan2018discourse propose a hierarchical encoder that captures the document’s discourse structure, and an attentive discourse-aware decoder that generates the summary. The latter requires a predefined discourse structure and is designed for domain-specific texts (e.g., scientific publications). Despite multiple encoders operating on different document segments, these models still limit the maximal document length at inference.
In this work, we address a prominent limitation of neural AS models: they cannot summarize texts longer than the maximal input length set during model training. At inference, documents longer than
tokens are truncated, which renders the (potentially summary-relevant) truncated content inaccessible to the model. We propose novel AS models based on windowing of source text: we sequentially shift encoder’s attention over different windows of source text. The decoder is shared across windows, thereby preserving semantic information from a previous window when decoding the next. We investigate two windowing strategies: (1) Static Windowing Model (SWM) precomputes, based on the training corpus statistics, the number of tokens the decoder is to generate from each source window; (2) for the Dynamic Windowing Model (DWM), we first heuristically, based on semantic similarity between source text and summary sentences, inject specialwindow-shift tokens into the training reference summaries and then let the decoder learn to emit window-shift tokens during generation. Signaling the window shift by generating a special token, conceptually allows the DWM model to summarize arbitrarily long texts during inference. Evaluation on the WikiHow corpus Koupaee and Wang (2018) of long texts with more even distribution of summary-relevant content renders our windowing models effective.
2 Windowing AS Models
Figure 1 contains the high-level depiction of the windowing AS model. We start from the attention-based Seq2Seq model with recurrent components Bahdanau et al. (2015),111We also experimented with Transformer Vaswani et al. (2017) encoder/decoder, but obtained weaker performance. which maps the input sequence into an output sequence . A bidirectional LSTM (Bi-LSTM) encoder produces contextualized representations for each input token. Decoder’s state is initialized with the concatenation of the end states of encoder’s LSTMs (). We apply an attention mechanism similar to luong15effective. However, instead of learning a local attention span around each source text position – which would limit the model to a fixed-size input during training – we attend over a window of tokens and sequentially slide the window over the long text. This way the decoder learns to model transitions between content windows, allowing to summarize arbitrarily long documents at inference.
and a stride step, divide the source text ( tokens) into overlapping windows.222
We pad the last window(s), if shorter thantokens. We use the same decoder, retaining its state, across all input windows. Sharing a decoder across input windows allows the flow of semantic information between adjacent windows and holds promise of retaining summary coherence. At each decoding step , we attend over the window representations, using the decoder’s hidden state as the attention query, and obtain the conditioned window encoding (for the decoding step ): , with attention weight computed as the softmax-normalized value of the dot-product between the encoded token and the decoder’s state . Decoder outputs the embedding via feed-forward projection of the concatenation of the attended input representation and its own hidden state : , with
as parameters. The output probability distribution(over training vocabulary
) is then simply computed by applying the softmax function on the vector of dot-product values computed betweenand each of the (pretrained) word embeddings.
We augment the base Seq2Seq model with the pointer-generator network (PG-Net), as in See et al. (2017), allowing the decoder to choose, in each step, between generating a token from the training vocabulary and copying a token from the source document. Generation probability is based on the context vector , decoder’s hidden state , and decoder’s input :
with , , as parameters. The output probability for a word from the extended vocabulary (union of
and source text words) interpolates between generation and copying distributions:
This specifies the PG-Net-augmented Seq2Seq AS model that operates on a window ( tokens). We next need to specify when to transition from one window of source text to another.
2.1 Static Windowing Model
The Static Windowing Model precomputes the number of tokens the decoder needs to generate for each input window. Let be the equally-sized source windows (determined with and ). We use the following function to determine the importance (weight) for each window: , with and as parameters defining the shape of the summary distribution over windows.333For example, with and , the early windows will receive larger weights than the later windows. The unnormalized weights are converted into probabilities using the softmax function. We next compute the expected summary length for a given document, based on the document length and training corpus statistics. Let be the set of documents and the set of their respective reference summaries in the training corpus. We compute the expected summary length for a new document as:
where is the length that covers 90% of training documents (i.e., 90% of are at most ) and is the length that covers 90% of reference summaries from . The number of tokens the decoder is to generate for a window is now simply a product of and the normalized weight .
2.2 Dynamic Windowing Model
SWM still relies on the document (and summary) lengths of the training corpus, and the number of summary tokens decoded for a window does not it’s content. Dynamic Windowing Model (DWM) aims to be more flexible, by allowing the decoder to dynamically signal, via a special token, the saturation of the current window and shift to the next. Because (1) the decoder needs to learn to emit this window-shift token (), and (2) we still want an end-to-end trainable AS model, we need to somehow inject window-shift tokens (
) into reference summaries of the training corpus. We achieve this heuristically, by computing semantic similarity scores between source text sentences and reference summary sentences. For simplicity, we obtain the sentence embedding as a sum of its respective word embeddings and compute the cosine similarity between sentence embeddings.444We acknowledge that this is a rudimentary method for computing semantic similarity between sentences. We intend to experiment with more advanced sentence embedding models and more accurate sentence similarity measures (Kusner et al., 2015; Conneau et al., 2017; Devlin et al., 2019; Zhelezniak et al., 2019, inter alia) in subsequent work.
For every reference summary sentence, we identify the most similar source document sentence and determine its respective window.555Depending on and , a sentence be in more than one window. In such cases, we map the sentence to the last containing window. This way we map each reference summary sentence to one source window. The order of windows assigned to summary sentences is, however, not necessarily sequential (e.g., for some reference summary with five sentences). Since our model allows only sequential window shifts, we first make the window order sequential by replacing sequence-breaking windows with accumulated maximums (e.g., becomes ). We then inject window-shift tokens () between summary sentences with different assigned source windows (e.g., for the window assignment we inject between the first and second summary sentence and between the third and fourth sentence). During inference, the input window is shifted whenever the decoder outputs the token.
We evaluate our windowing models on two benchmark datasets: (1) CNN/Dailymail news corpus, created by Nallapati et al. (2016) from the question answering dataset of hermann2015teaching and (2) WikiHow corpus Koupaee and Wang (2018). News place the most relevant information at the beginning (the so-called lead-and-body principle): the standard models that truncate long documents are thus likely to perform well in the CNN/Dailymail evaluation. The WikiHow dataset does not have such a construction bias – summary-relevant information is more evenly distributed across the texts.
We use the negative log likelihood objective and optimize the models by maximizing the ROUGE-L performance on development sets. We use a batch-level beam search decoder with beam size . Unlike standard beam search, does not decrease when the end-of-summary token (<eos>) is predicted. Longer yet incomplete partial hypotheses can thus take over completed beams whenever they prevail in terms of length-normalized log probability. We set the hidden state sizes for both encoder’s LSTMs and decoder’s LSTM to . We employ the Adam optimizer Kingma and Ba (2015) (=0.9, =0.999, and =1e-8). For word representations, we use pretrained 300-dim. fastText embeddings (50,000 most frequent words)666https://tinyurl.com/y3y69h3z
We compare different variants of SWM and DWM against the standard PG-Net Seq2Seq model (Stan) with the fixed-size input See et al. (2017), as well as against the commonly employed Lead-3 baseline, which simply copies the first three document sentences to the summary.
Results and Discussion.
Table 1 contains the results on the CNN/Dailymail dataset. Unsurprisingly, the simple Lead-3 baseline outperforms Stan and both our static and dynamic windowing models. This is because in CNN/Dailymail documents almost all of the summary-relevant content is found at the very beginning of the document. The ability to process all windows does not benefit to SWM and DWM in this setting as there is virtually no summary-relevant content in later windows.
In Table 2 we display the results on the WikiHow dataset, which is bound to be more appropriate for the windowing models, because of the more even distribution of the summary-relevant content across the source documents.
On the WikiHow dataset, the windowing models – SWM and DWM – generally have an edge over the standard PG-Net Seq2Seq model (Stan) when the fixed-size input for Stan matches the windows size of the windowing models. For a larger input size , Stan performs comparably to DWM with the same window size . Notably, the DWM has the advantage of being able to process longer overall input. Lowering for Stan to and comparing it against SWM/DWM with windows of the same size , we see that the windowing models clearly prevail. This renders our windowing models as a more approriate solution for summarization of documents for which the following two properties hold: (1) the document length massively surpasses the maximal number of tokens we can feed to the fixed-input-size model and (2) summary-relevant information is present all across the document, and not just at its beginning.
While SWM seems to outperform DWM, in practice SWM cannot really summarize arbitrarily long texts at inference. Despite transitioning across content windows, SWM adapts to the summary lengths seen in the training corpus and generates the <eos> token too early during inference on the long texts. In contrast, by learning to emit window transitions, the Dynamic Windowing Model can truly generate summaries for arbitrarily long texts at inference time, regardless of the observed lengths of training document and their respective reference summaries. Figure 2 depicts the summary of a very long document ( tokens), produced by a DWS model trained on an order of magnitude shorter documents ( tokens).
Neural summarization models fix the length of the source texts in training (e.g., based on the average source document length in the training set), forcing documents longer than this threshold to be truncated at inference. In this work, we proposed windowing summarization models, which allow to process arbitrarily long documents at inference, taking into account full source text. Our models are effective in summarizing long texts with evenly distributed summary-relevant content.
- Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, External Links: Cited by: §1, §2.
Supervised learning of universal sentence representations from natural language inference data.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Cited by: footnote 4.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: footnote 4.
- Adam: A method for stochastic optimization. In ICLR, Cited by: §3.
- WikiHow: A large scale text summarization dataset. CoRR abs/1810.09305. External Links: Cited by: §1, §3.
From word embeddings to document distances.
International conference on machine learning, pp. 957–966. Cited by: footnote 4.
- Global optimization under length constraint for neural text summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1039–1048. External Links: Cited by: §1.
- Sequence-to-sequence rnns for text summarization. In Proceedings of ICLR: Workshop Track, External Links: Cited by: §1, §3.
- Automatic summarization. Foundations and Trends in Information Retrieval 5 (2-3), pp. 103–233. Cited by: §1.
- A deep reinforced model for abstractive summarization. In Proceedings of ICLR, External Links: Cited by: §1.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §1, §2, §3.
- Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1171–1181. External Links: Cited by: §1.
- Attention is all you need. In Proceedings of NeurIPS, Cited by: §1, footnote 1.
- Improving abstractive document summarization with salient information modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2132–2141. External Links: Cited by: §1.
Don’t settle for average, go for the max: fuzzy sets and max-pooled word vectors. In Proceedings of ICLR, Cited by: footnote 4.