Neural Abstractive Text Summarization with Sequence-to-Sequence Models

12/05/2018 ∙ by Tian Shi, et al. ∙ Virginia Polytechnic Institute and State University 8

In the past few years, neural abstractive text summarization with sequence-to-sequence (seq2seq) models have gained a lot of popularity. Many interesting techniques have been proposed to improve the seq2seq models, making them capable of handling different challenges, such as saliency, fluency and human readability, and generate high-quality summaries. Generally speaking, most of these techniques differ in one of these three categories: network structure, parameter inference, and decoding/generation. There are also other concerns, such as efficiency and parallelism for training a model. In this paper, we provide a comprehensive literature and technical survey on different seq2seq models for abstractive text summarization from viewpoint of network structures, training strategies, and summary generation algorithms. Many models were first proposed for language modeling and generation tasks, such as machine translation, and later applied to abstractive text summarization. Therefore, we also provide a brief review of these models. As part of this survey, we also develop an open source library, namely Neural Abstractive Text Summarizer (NATS) toolkit, for the abstractive text summarization. An extensive set of experiments have been conducted on the widely used CNN/Daily Mail dataset to examine the effectiveness of several different neural network components. Finally, we benchmark two models implemented in NATS on two recently released datasets, i.e., Newsroom and Bytecup.



There are no comments yet.


page 1

page 4

page 28

Code Repositories


pycorrector is a toolkit for text error correction. 文本纠错,Kenlm,Seq2Seq_Attention,BERT,MacBERT,ELECTRA,ERNIE,Transformer等模型实现,开箱即用。

view repo


Neural Abstractive Text Summarization with Sequence-to-Sequence Models

view repo


Learning Framework for Neural Abstractive Text Summarization

view repo



view repo


HackMIT 2019!!!

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the modern era of big data, retrieving useful information from a large number of textual documents is a challenging task, due to the unprecedented growth in the availability of blogs, news articles, and reports are explosive. Automatic text summarization provides an effective solution for summarizing these documents. The task of the text summarization is to condense long documents into short summaries while preserving the important information and meaning of the documents [1, 2]. Having the short summaries, the text content can be retrieved, processed and digested effectively and efficiently.

Generally speaking, there are two ways to do text summarization: Extractive and Abstractive [3]. A method is considered to be extractive if words, phrases, and sentences in the summaries are selected from the source articles [4, 5, 6, 2, 7, 8, 9, 10]. They are relatively simple and can produce grammatically correct sentences. The generated summaries usually persist salient information of source articles and have a good matching with human-written summaries [5, 11, 12, 13]. On the other hand, abstractive text summarization has attracted many attentions since it is capable of generating novel words using language generation models grounded on representations of source documents [14, 15]. Thus, they have a strong potential of producing high-quality summaries that are verbally innovative and can also easily incorporate external knowledge [12]. In this category, many deep neural network based models have achieved better performance in terms of the commonly used evaluation measures (such as ROUGE [16] score) compared to traditional extractive approaches [17, 18]. In this paper, we primarily focus on the recent advances of sequence-to-sequence (seq2seq) models for the task of abstractive text summarization.

I-a Seq2seq Models and Pointer-Generator Network

Seq2seq models (see Fig. 2[19, 20]

have been successfully applied to a variety of natural language processing (NLP) tasks, such as machine translation 

[21, 22, 23, 24, 25], headline generation [15, 26, 27], text summarization [12, 14], and speech recognition [28, 29, 30]

. Inspired by the success of neural machine translation (NMT) 

[23], Rush et al. [15] first introduced a neural attention seq2seq model with an attention based encoder and a neural network language model (NNLM) decoder to the abstractive sentence summarization task, which has achieved a significant performance improvement over conventional methods. Chopra et al. [26]

further extended this model by replacing the feed-forward NNLM with a recurrent neural network (RNN). The model is also equipped with a convolutional attention-based encoder and a RNN (Elman 

[31] or LSTM [32]) decoder, and outperforms other state-of-the-art models on a commonly used benchmark dataset, i.e., the Gigaword corpus. Nallapati et al. [14] introduced several novel elements to the RNN encoder-decoder architecture to address critical problems in the abstractive text summarization, including using the following (i) feature-rich encoder to capture keywords, (ii) a switching generator-pointer to model out-of-vocabulary (OOV) words, and (iii) the hierarchical attention to capture hierarchical document structures. They also established benchmarks for these models on a CNN/Daily Mail dataset [33, 34], which consists of pairs of news articles and multi-sentence highlights (summaries). Before this dataset was introduced, many abstractive text summarization models have concentrated on compressing short documents to single sentence summaries [15, 26]. For the task of summarizing long documents into multi-sentence summaries, these models have several shortcomings: 1) They cannot accurately reproduce the salient information of source documents. 2) They cannot efficiently handle OOV words. 3) They tend to suffer from word- and sentence-level repetitions and generating unnatural summaries. To tackle the first two challenges, See et al. [12] proposed a pointer-generator network that implicitly combines the abstraction with the extraction. This pointer-generator architecture can copy words from source texts via a pointer and generate novel words from a vocabulary via a generator. With the pointing/copying mechanism [35, 36, 37, 38, 39, 40, 41], factual information can be reproduced accurately and OOV words can also be taken care in the summaries. Many subsequent studies that achieved state-of-the-art performance have also demonstrated the effectiveness of the pointing/copying mechanism [17, 18, 42, 43]. The third problem has been addressed by the coverage mechanism [12], intra-temporal and intra-decoder attention mechanisms [17]

, and some other heuristic approaches, like forcing a decoder to never output the same trigram more than once during testing 


I-B Training Strategies

There are two other non-trivial issues with the current seq2seq framework, i.e., exposure bias and inconsistency of training and testing measurements [44, 45, 46, 47]. Based on the neural probabilistic language model [48], seq2seq models are usually trained by maximizing the likelihood of ground-truth tokens given their previous ground-truth tokens and hidden states (Teacher Forcing algorithm [44, 49], see Fig. 9). However, at testing time (see Fig. 8), previous ground-truth tokens are unknown, and they are replaced with tokens generated by the model itself. Since the generated tokens have never been exposed to the decoder during training, the decoding error can accumulate quickly during the sequence generation. This is known as exposure bias [46]. The other issue is the mismatch of measurements

. Performance of seq2seq models is usually estimated with non-differentiable evaluation metrics, such as ROUGE 

[16] and BLEU [50] scores, which are inconsistent with the log-likelihood function (cross-entropy loss) used in the training phase. These problems are alleviated by the curriculum learning and reinforcement learning (RL) approaches.

I-B1 Training with Curriculum and Reinforcement Learning Approaches

Bengio et al. [44] proposed a curriculum learning approach, known as scheduled sampling, to slowly change the input of the decoder from ground-truth tokens to model generated ones. Thus, the proposed meta-algorithm bridges the gap between training and testing. It is a practical solution for avoiding the exposure bias. Ranzato et al. [46] proposed a sequence level training algorithm, called MIXER (Mixed Incremental Cross-Entropy Reinforce), which consists of the cross entropy training, REINFORCE [51] and curriculum learning [44]

. REINFORCE algorithm can make use of any user-defined task specific reward (e.g., non-differentiable evaluation metrics), therefore, combining with curriculum learning, the proposed model is capable of addressing both issues of seq2seq models. However, REINFORCE suffers from the high variance of gradient estimators and instability during training 

[52, 53, 22]. Bahdanau et al. [52] proposed an actor-critic based RL method which has relatively lower variance for gradient estimators. In the actor-critic method, an additional critic network is trained to compute value functions given the policy from the actor network (a seq2seq model), and the actor network is trained based on the estimated value functions (assumed to be exact) from the critic network. On the other hand, Rennie et al. [53] introduced a self-critical sequence training method (SCST) which has a lower variance compared to the REINFORCE algorithm and does not need the second critic network.

I-B2 Applications to Abstractive Text Summarization

RL algorithms for training seq2seq models have achieved success in a variety of language generation tasks, such as image captioning [53], machine translation [52], and dialogue generation [54]. Specific to the abstractive text summarization, Lin et al. [55] introduced a coarse-to-fine attention framework for the purpose of summarizing long documents. Their model parameters were learned with REINFORCE algorithm. Zhang et al. [56] used REINFORCE algorithm and the curriculum learning strategy for the sentence simplification task. Paulus et al. [17] first applied the self-critic policy gradient algorithm to training their seq2seq model with the copying mechanism and obtained the state-of-the-art performance in terms of ROUGE scores [16]. They proposed a mixed objective function that combines the RL loss with the traditional cross-entropy loss. Thus, their method can both leverage the non-differentiable evaluation metrics and improve the readability. Celikyilmaz et al. [18] introduced a novel deep communicating agents method for abstractive summarization, where they also adopted the RL loss in their objective function. Pasunuru et al. [57] applied the self-critic policy gradient algorithm to train the pointer-generator network. They also introduced two novel rewards (i.e., saliency and entailment rewards) in addition to ROUGE metric to keep the generated summaries salient and logically entailed. Li et al. [58]

proposed a training framework based on the actor-critic method, where the actor network is an attention-based seq2seq model, and the critic network consists of a maximum likelihood estimator and a global summary quality estimator that is used to distinguish the generated and ground-truth summaries via a neural network binary classifier. Chen 

et al. [59] proposed a compression-paraphrase multi-step procedure, for abstractive text summarization, which first extracts salient sentences from documents and then rewrites them. In their model, they used an advantage actor-critic algorithm to optimize the sentence extractor for a better extraction strategy. Keneshloo et al. [47] conducted a comprehensive summary of various RL methods and their applications in training seq2seq models for different NLP tasks. They also implemented these RL algorithms in an open source library111 constructed using the pointer-generator network [12] as the base model.

I-C Beyond RNN

Most of the prevalent seq2seq models that have attained state-of-the-art performance for sequence modeling and language generation tasks are RNN, especially long short-term memory (LSTM) 


and gated recurrent unit (GRU) 

[60], based encoder-decoder models [19, 23]

. Standard RNN models are difficult to train due to the vanishing and exploding gradients problems 


. LSTM is a solution for vanishing gradients problem, but still does not address the exploding gradients issue. This issue is recently solved using a gradient norm clipping strategy 

[62]. Another critical problem of RNN based models is the computation constraint for long sequences due to their inherent sequential dependence nature. In other words, the current hidden state in a RNN is a function of previous hidden states. Because of such dependence, RNN cannot be parallelized within a sequence along the time-step dimension (see Fig. 2) during training and evaluation, and hence training them becomes major challenge for long sequences due to the computation time and memory constraints of GPUs [63].

Recently, it has been found that the convolutional neural network (CNN) 

[64] based encoder-decoder models have the potential to alleviate the aforementioned problem, since they have better performance in terms of the following three considerations [65, 66, 67]. 1) A model can be parallelized during training and evaluation. 2) The computational complexity of the model is linear with respect to the length of sequences. 3) The model has short paths between pairs of input and output tokens, so that it can propagate gradient signals more efficiently [68]. Kalchbrenner et al. [65] introduced a ByteNet model which adopts the one-dimensional convolutional neural network of fixed depth to both the encoder and the decoder [69]

. The decoder CNN is stacked on top of the hidden representation of the encoder CNN, which ensures a shorter path between input and output. The proposed ByteNet model has achieved state-of-the-art performance on a character-level machine translation task with parallelism and linear-time computational complexity 

[65]. Bradbury et al. [66] proposed a quasi-recurrent neural network (QRNN) encoder-decoder architecture, where both encoder and decoder are composed of convolutional layers and so-called ‘dynamic average pooling’ layers [70, 66]. The convolutional layers allow computations to be completely parallel across both mini-batches and sequence time-step dimensions, while they require less amount of time compared with computation demands for LSTM despite the sequential dependence still presents in the pooling layers [66]. This framework has demonstrated to be effective by outperforming LSTM-based models on a character-level machine translation task with a significantly higher computational speed. Recently, Gehring et al. [71, 67, 72] attempted to build CNN based seq2seq models and apply them to large-scale benchmark datasets for sequence modeling. In [71], the authors proposed a convolutional encoder model, in which the encoder is composed of a succession of convolutional layers, and demonstrated its strong performance for machine translation. They further constructed a convolutional seq2seq architecture by replacing the LSTM decoder with a CNN decoder and bringing in several novel elements, including gated linear units [73] and multi-step attention [67]. The model also enables computations of all network elements parallelized, thus training and decoding can be much faster than the RNN models. It also achieved state-of-the-art performance on several machine translation benchmark datasets. Vaswani et al. [63] further constructed a novel network architecture, namely, Transformer, which only depends on feed-forward networks and the attention mechanism. It has achieved state-of-the-art performance in machine translation task with significantly less training time. Currently, ConvS2S model [72] has been applied to the abstractive document summarization and outperforms the pointer-generator network [12] on the CNN/Daily Mail dataset.

Fig. 1: An overall taxonomy of topics on seq2seq models for neural abstractive text summarization.

I-D Other Studies

So far, we primarily focused on the pointer-generator network, training neural networks with RL algorithms, and CNN based seq2seq architectures. There are many other studies that aim to improve the performance of seq2seq models for the task of abstractive text summarization from different perspectives and broaden their applications.

I-D1 Network Structure and Attention

The first way to boost the performance of seq2seq models is to design better network structures. Zhou et al. [74] introduced an information filter, namely, a selective gate network between the encoder and decoder. This model can control the information flow from the encoder to the decoder via constructing a second level representation of the source texts with the gate network. Zeng et al. [38] introduced a read-again mechanism to improve the quality of the representations of the source texts. Tan et al. [75] built a graph ranking model upon a hierarchical encoder-decoder framework, which enables the model to capture the salient information of the source documents and generate accurate, fluent and non-redundant summaries. Xia et al. [76] proposed a deliberation network that passes the decoding process multiple times (deliberation process), to polish the sequences generated by the previous decoding process. Li et al. [77] incorporated a sequence of variational auto-encoders [78, 79] into the decoder to capture the latent structure of the generated summaries.

I-D2 Extraction + Abstraction

Another way to improve the abstractive text summarization is to make use of the salient information from the extraction process. Hsu et al. [80] proposed a unified framework that takes advantage of both extractive and abstractive summarization using a novel attention mechanism, which is a combination of the sentence-level attention (based on the extractive summarization [81]) and the word-level attention (based on the pointer-generator network [12]), inspired by the intuition that words in less attended sentences should have lower attention scores. Chen et al. [59] introduced a multi-step procedure, namely compression-paraphrase, for abstractive summarization, which first extracts salient sentences from documents and then rewrites them in order to get final summaries. Li et al. [82] introduced a guiding generation model, where the keywords in source texts is first retrieved with an extractive model [83]. Then, a guide network is applied to encode them to obtain the key information representations that will guide the summary generation process.

I-D3 Long Documents

Compared to short articles and texts with moderate lengths, there are many challenges that arise in long documents, such as difficulty in capturing the salient information [84]. Nallapati et al. [14] proposed a hierarchical attention model to capture hierarchical structures of long documents. To make models scale-up to very long sequences, Ling et al. [55] introduced a coarse-to-fine attention mechanism, which hierarchically reads and attends long documents222A document is split into many chunks of texts.. By stochastically selecting chunks of texts during training, this approach can scale linearly with the number of chunks instead of the number of tokens. Cohan et al. [84] proposed a discourse-aware attention model which has a similar idea to that of a hierarchical attention model. Their model was applied to two large-scale datasets of scientific papers, i.e., arXiv and PubMed datasets. Tan et al. [75] introduced a graph-based attention model which is built upon a hierarchical encoder-decoder framework where the pagerank algorithm [85] was used to calculate saliency scores of sentences.

I-D4 Multi-Task Learning

Multi-task learning has become a promising research direction for this problem since it allows seq2seq models to handle different tasks. Pasunuru et al. [86] introduced a multi-task learning framework, which incorporates knowledge from an entailment generation task into the abstractive text summarization task by sharing decoder parameters. They further proposed a novel framework [87] that is composed of two auxiliary tasks, i.e., question generation and entailment generation, to improve their model for capturing the saliency and entailment for the abstractive text summarization. In their model, different tasks share several encoder, decoder and attention layers. Mccann et al. [88] introduced a Natural Language Decathlon (decaNLP333, a challenge that spans ten different tasks, including question-answering, machine translation, summarization, and so on. They also proposed a multitask question answering network that can jointly learn all tasks without task-specific modules or parameters, since all tasks are mapped to the same framework of question-answering over a given context.

I-D5 Beam Search

Beam search algorithms have been commonly used in the decoding of different language generation tasks [22, 12]. However, the generated candidate-sequences are usually lacking in diversity [89]. In other words, top- candidates are nearly identical, where is size of a beam. Li et al. [90] replaced the log-likelihood objective function in the neural probabilistic language model [48] with Maximum Mutual Information (MMI) [91] in their neural conversation models to remedy the problem. This idea has also been applied to neural machine translation (NMT) [92] to model the bi-directional dependency of source and target texts. They further proposed a simple yet fast decoding algorithm that can generate diverse candidates and has shown performance improvement on the abstractive text summarization task [93]. Vijayakumar et al. [94] proposed generating diverse outputs by optimizing for a diversity-augmented objective function. Their method, referred to as Diverse Beam Search (DBS) algorithm, has been applied to image captioning, machine translation, and visual question-generation tasks. Cibils et al. [95] introduced a meta-algorithm that first uses DBS to generate summaries, and then, picks candidates according to maximal marginal relevance [96] under the assumption that the most useful candidates should be close to the source document and far away from each other. The proposed algorithm has boosted the performance of the pointer-generator network on CNN/Daily Mail dataset.

Despite many research papers that are published in the area of neural abstractive text summarization, there are few survey papers [97, 98, 99] that provide a comprehensive study. In this paper, we systematically review current advances of seq2seq models for the abstractive text summarization task from various perspectives, including network structures, training strategies, and sequence generation. In addition to a literature survey, we also implemented some of these methods in an open-source library, namely NATS444 Extensive set of experiments have been conducted on various benchmark text summarization datasets in order to examine the importance of different network components. The main contributions of this paper can be summarized as follows:

  • [leftmargin=*]

  • Provide a comprehensive literature survey of current advances of seq2seq models with an emphasis on the abstractive text summarization.

  • Conduct a detailed review of the techniques used to tackle different challenges in RNN encoder-decoder architectures.

  • Review different strategies for training seq2seq models and approaches for generating summaries.

  • Provide an open-source library, which implements some of these models, and systematically investigate the effects of different network elements on the summarization performance.

The rest of this paper is organized as follows: An overall taxonomy of topics on seq2seq models for neural abstractive text summarization is shown in Fig. 1. A comprehensive list of papers published till date on the topic of neural abstractive text summarization have been summarized in Table I and II. In Section II, we introduce the basic seq2seq framework along with its extensions, including attention mechanism, pointing/copying mechanism, repetition handling, improving encoder or decoder, summarizing long documents and combining with extractive models. Section III summarizes different training strategies, including word-level training methods, such as cross-entropy training, and sentence-level training with RL algorithms. In Section IV, we discuss generating summaries using the beam search algorithm and various other diverse beam decoding algorithms. Section V briefly introduces the convolutional seq2seq model and its application to the abstractive text summarization. In Section VI, we present details of our implementations and discuss our experimental results on the CNN/Daily Mail, Newsroom [100], and Bytecup555 datasets. We conclude this survey in Section VII.

Ii The RNN Encoder-Decoder Framework

In this section, we review different encoder-decoder models for the neural abstractive text summarization. We will start with the basic RNN seq2seq framework and attention mechanism. Then, we will describe more advanced network structures that can handle different challenges in the text summarization, such as repetition and out-of-vocabulary (OOV) words. We will highlight various existing problems and proposed solutions.

Ii-a Seq2seq Framework Basics

A vanilla seq2seq framework for the abstractive summarization is composed of an encoder and a decoder. The encoder reads a source article, denoted by , and transforms it to hidden states ; while the decoder takes these hidden states as the context input and outputs a summary . Here, and are one-hot representations of the tokens in the source article and summary, respectively. We use and to represent the number of tokens (document length) of the original source document and the summary, respectively. A summarization task is defined as inferring a summary from a given source article using seq2seq models.

Fig. 2: The basic seq2seq model. SOS and EOS represent the start and end of a sequence, respectively.

Encoders and decoders can be feed-forward networks, CNN [71, 67] or RNN. RNN architectures, especially long short term memory (LSTM) [32] and gated recurrent unit (GRU) [20], have been most widely adopted for seq2seq models. Fig. 2 shows a basic RNN seq2seq model with a bi-directional LSTM encoder and an LSTM decoder. The bi-directional LSTM is considered since it usually gives better document representations compared to a forward LSTM. The encoder reads a sequence of input tokens and turns them into a sequences of hidden states with following updating algorithm:


where weight matrices

and vector

are learnable parameters666In the rest of this paper, we will use (weights) and (bias) to represent the model parameters., denotes the word embeddings of token , and represents the cell states. Both and are initialized to . For the bi-directional LSTM, the input sequence is encoded as and , where the right and left arrows denote the forward and backward temporal dependencies, respectively. Superscript is the shortcut notation used to indicate that it is for the encoder. During the decoding, the decoder takes the encoded representations of the source article (i.e., hidden and cell states , , , ) as the input and generates the summary . In a simple encoder-decoder model, encoded vectors are used to initialize hidden and cell states of the LSTM decoder. For example, we can initialize them as follows:


Here, superscript denotes the decoder and is a concatenation operator. At each decoding step, we first update the hidden state conditioned on the previous hidden states and input tokens, i.e.,


Hereafter, we will not explicitly express the cell states in the input and output of LSTM, since only hidden states are passed to other parts of the model. Then, the vocabulary distribution can be calculated as follows:


where is a vector whose dimension is the size of the vocabulary and for each element of a vector

. Therefore, the probability of generating the target token

in the vocabulary is denoted as .

This LSTM based encoder-decoder framework was the foundation of many neural abstractive text summarization models [14, 12, 17]. However, there are many problems with this model. For example, the encoder is not well trained via back propagation through time [101, 102], since the paths from encoder to the output are relatively far apart, which limits the propagation of gradient signals. The accuracy and human-readability of generated summaries is also very low with a lot of OOV words777In the rest of this paper, we will use unk, i.e., unknown words, to denote OOV words. and repetitions. The rest of this section will discuss different models that were proposed in the literature to resolve these issues for producing better quality summaries.

Ii-B Attention Mechanism

The attention mechanism has achieved great success and is commonly used in seq2seq models for different natural language processing (NLP) tasks [103], such as machine translation [23, 21], image captioning [104], and neural abstractive text summarization [14, 12, 17]. In an attention based encoder-decoder architecture (shown in Fig. 3), the decoder not only takes the encoded representations (i.e., final hidden and cell states) of the source article as input, but also selectively focuses on parts of the article at each decoding step. For example, suppose we want to compress the source input888 “Kylian Mbappe scored two goals in four second-half minutes to send France into the World Cup quarter-finals with a thrilling 4-3 win over Argentina on Saturday.” to its short version “France beat Argentina 4-3 to enter quarter-finals.”. When generating the token “beat”, the decoder may need to attend “a thrilling 4-3 win” than other parts of the text. This attention can be achieved by an alignment mechanism [23], which first computes the attention distribution of the source tokens and then lets the decoder know where to attend to produce a target token. In the encoder-decoder framework depicted in Fig. 2 and 3, given all the hidden states of the encoder999 for the bi-directional LSTM is defined as the concatenation of and ., i.e., and the current decoder hidden state , the attention distribution over the source tokens is calculated as follows:


where the alignment score is obtained by the content-based score function, which has three alternatives as suggested in [21]:


It should be noted that the number of additional parameters for ‘dot’, ‘general’ and ‘concat’ approaches are , and , respectively. Here represents the dimension of a vector. The ‘general’ and ‘concat’ are commonly used score functions in the abstractive text summarization [12, 17]. One of the drawbacks of ‘dot’ method is that it requires and to have the same dimension.

Fig. 3: An attention-based seq2seq model.

With the attention distribution, we can naturally define the source side context vector for the target word as


Together with the current decoder hidden state , we get the attention hidden state [21]


Finally, the vocabulary distribution is calculated by


When , the decoder hidden state is updated by


where the input is concatenation of and .

Ii-C Pointing/Copying Mechanism

The pointing/copying mechanism [35] represents a class of approaches that generate target tokens by directly copying them from input sequences based on their attention weights. It can be naturally applied to the abstractive text summarization since summaries and articles can share the same vocabulary [12]. More importantly, it is capable to deal with out-of-vocabulary (OOV) words [14, 36, 37, 12]. A variety of studies have shown a boosting performance after incorporating the pointing/copying mechanism into the seq2seq framework [12, 17, 18]. In this section, we review several alternatives of this mechanism for the abstractive text summarization.

Ii-C1 Pointer Softmax [37]

The basic architecture of pointer softmax is described as follows. It consists of three fundamental components: short-list softmax, location softmax and switching network. At decoding step , a short-list softmax calculated by Eq. (9) is used to predict target tokens in the vocabulary. The location softmax gives locations of tokens that will be copied from the source article to the target based on attention weights

. With these two components, a switching network is designed to determine whether to predict a token from the vocabulary or copy one from the source article if it is an OOV token. The switching network is a multilayer perceptron (MLP) with a sigmoid activation function, which estimates the probability

of generating tokens from the vocabulary based on the context vector and hidden state with


where is a scalar and is a sigmoid activation function. The final probability of producing the target token is given by the concatenation of vectors and .

Ii-C2 Switching Generator-Pointer [14]

Similar to the switching network in pointer softmax [37], the switching generator-pointer is also equipped with a ‘switch’, which determines whether to generate a token from the vocabulary or point to one in the source article at each decoding step. The switch is explicitly modeled by


If the switch is turned on, the decoder produces a word from the vocabulary with the distribution (see Eq. (9)). Otherwise, the decoder generates a pointer based on the attention distribution (see Eq. (5)), i.e., , where is the position of the token in the source article. When a pointer is activated, embedding of the pointed token will be used as an input for the next decoding step.

Ii-C3 CopyNet [36]

CopyNet has a differentiable network architecture and can be easily trained in an end-to-end manner. In this framework, the probability of generating a target token is a combination of the probabilities of two modes, i.e. generate-mode and copy-mode. First, CopyNet represents unique tokens in the vocabulary and source sequence by and , respectively, and builds an extended vocabulary . Then, the vocabulary distribution over the extended vocabulary is calculated by


where and are also defined on , i.e.,


Here, is a normalization factor shared by both the above equations. is calculated with


is obtained by Eq. (6).

Fig. 4: The pointer-generator network.

Ii-C4 Pointer-Generator Network [12]

Pointer-generator network also has a differentiable network architecture (see Fig. 4). Similar to CopyNet [36], the vocabulary distribution over an extended vocabulary is calculated by


where is obtained by Eq. (12). Vocabulary distribution and attention distribution are defined as follows:




The pointer-generator network has been used as the base model for many abstractive text summarization models (see Table I and II). Finally, it should be noted that in CopyNet and pointer-generator network can be viewed as a “soft-switch” to choose between generation and copying, which is different from “hard-switch” (i.e., ) in pointer softmax and switching generator-pointer [37, 14, 17].

Ii-D Repetition Handling

One of the critical challenges for attention based seq2seq models is that the generated sequences have repetitions, since the attention mechanism tends to ignore the past alignment information [105, 106]. For summarization and headline generation tasks, model generated summaries suffer from both word-level and sentence-level repetitions. The latter is specific to summaries which consist of several sentences [14, 12, 17], such as those in CNN/Daily Mail dataset [14] and Newsroom dataset [100]. In this section, we review several approaches that have been proposed to overcome the repetition problem.

Ii-D1 Temporal Attention [14, 17]

Temporal attention method was originally proposed to deal with the attention deficiency problem in neural machine translation (NMT) [106]. Nallapati et al. [14] have found that it can also overcome the problem of repetition when generating multi-sentence summaries, since it prevents the model from attending the same parts of a source article by tracking the past attention weights. More formally, given the attention score in Eq. (6), we can first define a temporal attention score as [17]:


Then, attention distribution is calculated with


Given the attention distribution, the context vector (see Eq. (7)) is rewritten as


It can be seen from Eq. (20) that, at each decoding step, the input tokens which have been highly attended will have a lower attention score via the normalization in time dimension. As a result, the decoder will not repeatedly attend the same part of the source article.

Ii-D2 Intra-decoder Attention [17]

Intra-decoder attention is another technique to handle the repetition problem for long-sequence generations. Compared to the regular attention based models, it allows a decoder to not only attend tokens in a source article but also keep track of the previously decoded tokens in a summary, so that the decoder will not repeatedly produce the same information.

For , intra-decoder attention scores, denoted by , can be calculated in the same manner as the attention score 101010We have to replace with in Eq. (6), where .. Then, the attention weight for each token is expressed as


With attention distribution, we can calculate the decoder-side context vector by taking linear combination of the decoder hidden states, i.e., , as


The decoder-side and encoder-side context vector will be both used to calculate the vocabulary distribution.

Ii-D3 Coverage [12]

The coverage model was first proposed for the NMT task [105] to address the problems of the standard attention mechanism which tends to ignore the past alignment information. Recently, See et al. [12] introduced the coverage mechanism to the abstractive text summarization task. In their model, they first defined a coverage vector as the sum of attention distributions of the previous decoding steps, i.e.,


Thus, it contains the accumulated attention information on each token in the source article during the previous decoding steps. The coverage vector will then be used as an additional input to calculate the attention score


As a result, the attention at current decoding time-step is aware of the attention during the previous decoding steps. Moreover, they defined a novel coverage loss to ensure that the decoder does not repeatedly attend the same locations when generating multi-sentence summaries. Here, the coverage loss is defined as


which is upper bounded by .

Ii-D4 Distraction [107]

The coverage mechanism has also been used in [107] (known as distraction) for the document summarization task. In addition to the distraction mechanism over the attention, they also proposed a distraction mechanism over the encoder context vectors. Both mechanisms are used to prevent the model from attending certain regions of the source article repeatedly. Formally, given the context vector at current decoding step and all historical context vectors (see Eq. (7)), the distracted context vector is defined as


where both and are diagonal parameter matrices.

Ii-E Improving Encoded Representations

Although LSTM and bi-directional LSTM encoders111111GRU and bi-directional GRU are also often seen in abstractive summarization papers. have been commonly used in the seq2seq models for the abstractive text summarization [14, 12, 17], representations of the source articles are still believed to be sub-optimal. In this section, we review some approaches that aim to improve the encoding process.

Fig. 5: An illustration of the selective encoder.

Ii-E1 Selective Encoding [74]

The selective encoding model was proposed for the abstractive sentence summarization task [74]. Built upon an attention based encoder-decoder framework, it introduces a selective gate network into the encoder for the purpose of distilling salient information from source articles. A second layer representation, namely, distilled representation, of a source article is constructed over the representation of the first LSTM layer (a bi-directional GRU encoder in this work.). Formally, the distilled representation of each token in the source article is defined as


where denotes the selective gate for token and is calculated as follows:


where . The distilled representations are then used for the decoding. Such a gate network can control information flow from an encoder to a decoder and can also select salient information, therefore, it boosts the performance of the sentence summarization task [74].

Ii-E2 Read-Again Encoding [38]

Intuitively, read-again mechanism is motivated by human readers who read an article several times before writing a summary. To simulate this cognitive process, a read-again encoder reads a source article twice and outputs two-level representations. In the first read, an LSTM encodes tokens and the article as and , respectively. In the second read, we use another LSTM to encode the source text based on the outputs of the first read. Formally, the encoder hidden state of the second read is updated by


The hidden states of the second read will be passed into decoders for summary generation.

Fig. 6: An illustration of the read-again encoder.

Ii-F Improving Decoder

Ii-F1 Embedding Weight Sharing [17]

Sharing the embedding weights with the decoder is a practical approach that can boost the performance since it allows us to reuse the semantic and syntactic information in an embedding matrix during summary generation [108, 17]. Suppose the embedding matrix is represented by , we can formulate the matrix used in the summary generation (see Eq. (9)) as follows:


By sharing model weights, the number of parameters is significantly less than a standard model since the number of parameters for is , while that for is , where represents the dimension of vector and denotes size of the vocabulary.

Ii-F2 Deliberation [76]

When a human writes a summary, they usually first create a draft and then polish it based on the global context. Inspired by this polishing process, Xia et al. [76] proposed a deliberation network for sequence generation tasks. A deliberation network can have more than one decoder121212There are two decoders in this paper. The first one is similar to the decoder presents in the basic seq2seq model described in Fig. 2. Let us denote the encoder hidden states by and the first-pass decoder hidden states by . During the decoding, the second-pass decoder, which is used to polish the draft written in the first-pass, attends both encoder and the first-pass decoder. Therefore, we obtain two context vectors and at time step , where and are attention weights. As we can see, two context vectors capture global information of the encoded article and the sequence generated by the first-pass decoder. The second-pass decoder will take them as input and update the next hidden states with


Finally, the vocabulary distribution at decoding step is calculated by


The deliberation network has also boosted the performance of seq2seq models in NMT and abstractive text summarization tasks.

Ii-F3 Deep Recurrent Generative Decoder (DRGD) [77]

Conventional encoder-decoder models calculate hidden states and attention weights in an entirely deterministic fashion, which limits the capability of representations and results in low quality summaries. Incorporating variational auto-encoders (VAEs) [78, 79] into the encoder-decoder framework provides a practical solution for this problem. Inspired by the variational RNN proposed in [109] to model the highly structured sequential data, Li et al. [77] introduced a seq2seq model with DRGD that aims to capture latent structure information of summaries and improve the summarization quality. This model employs GRU as the basic recurrent model for both encoder and decoder. However, to be consistent with this survey paper, we will explain their ideas using LSTM instead.

There are two LSTM layers to calculate the decoder hidden state . At the decoding step , the first layer hidden state is updated by . Then, the attention weights and the context vector are calculated with the encoder hidden state and the first layer decoder hidden state using Eqs. (5), (6) and (7). For the second layer, the hidden state is updated with . Finally, the decoder hidden state is obtained by , where is also referred to as the deterministic hidden state.

VAE is incorporated into the decoder to capture latent structure information of summaries which is represented by a multivariate Gaussian distribution. By using a reparameterization trick 

[79, 110], latent variables can be first expressed as


where the noise variable , and Gaussian parameters and in the network are calculated by


where is a hidden vector of the encoding process of the VAE and defined as


With the latent structure variables , the output hidden states can be formulated as


Finally, the vocabulary distribution is calculated by


We primarily focused on the network structure of DRGD in this section. The details of VAE and its derivations can be found in [78, 77, 79, 110]. In DRGD, VAE is incorporated into the decoder of a seq2seq model, more recent works have also used VAE in the attention layer [111] and for the sentence compression task [40].

Ii-G Summarizing Long Document

Compared with sentence summarization, the abstractive summarization for very long documents has been relatively less investigated. Recently, attention based seq2seq models with pointing/copying mechanism have shown their power in summarizing long documents with 400 and 800 tokens [12, 17]. However, performance improvement primarily attributes to copying and repetition/redundancy avoiding techniques [14, 12, 17]. For very long documents, we need to consider several important factors to generate high quality summaries [75], such as saliency, fluency, coherence and novelty. Usually, seq2seq models combined with the beam search decoding algorithm can generate fluent and human-readable sentences. In this section, we review models that aim to improve the performance of long document summarization from the perspective of saliency.

Seq2seq models for long document summarization usually consists of an encoder with a hierarchical architecture which is used to capture the hierarchical structure of the source documents. The top-level salient information includes the important sentences [14, 75], chunks of texts [55], sections [84], and paragraphs [18], while the lower-level salient information represents keywords. Hereafter, we will use the term ‘chunk’ to represent the top-level information. Fig. 7 shows neural network structure of a hierarchical encoder, which first uses a word-encoder to encode tokens in a chunk for the chunk representation, and then use a chunk encoder to encode the chunks in a document for the document representation. In this paper, we only consider the single-layer forward LSTM131313The deep communicating agents model [18] , which requires multiple layers of bi-directional LSTM, falls out of the scope of this survey. for both word and chunk encoders.

Suppose, the hidden states of chunk and word in this chunk are represented by and . At decoding step , we can calculate word-level attention weight for the current decoder hidden state as follows:


At the same time, we can also calculate chunk-level attention weight as follows:


where both alignment scores and can be calculated using Eq. (6). In this section, we will review four different models that are based on the hierarchical encoder for the task of long document text summarization.

Fig. 7: A hierarchical encoder which first encodes tokens for the chunk representations and then encodes chunks for the document representation.

Ii-G1 Hierarchical Attention [14]

The intuition behind a hierarchical attention is that words in less important chunks should be less attended. Therefore, with chunk-level attention distribution and word-level attention distribution , we first calculate re-scaled word-level attention distribution by


This re-scaled attention will then be used to calculate the context vector using Eq. (7), i.e.,


It should be noted that such hierarchical attention framework is different from the hierarchical attention network proposed in [112], where the chunk representation is obtained using


instead of the last hidden state of the word-encoder.

Ii-G2 Discourse-Aware Attention [84]

The idea of the discourse-aware attention is similar to that of the hierarchical attention giving Eq. (43). The main difference between these two attention models is that the re-scaled attention distribution in the discourse-aware attention is calculated by


Ii-G3 Coarse-to-Fine Attention [55]

The coarse-to-fine (C2F) attention was proposed for computational efficiency. Similar to the hierarchical attention [14], the proposed model also has both chunk-level attention and word-level attention. However, instead of using word-level hidden states in all chunks for calculating the context vector, the C2F attention method first samples a chunk from the chunk-level attention distribution, and then calculates the context vector using


At the test time, the stochastic sampling of the chunks will be replaced by a greedy search.

Ii-G4 Graph-based Attention [75]

The aforementioned hierarchical attention mechanism implicitly captures the chunk-level salient information, where the importance of a chunk is determined solely by its attention weight. In contrast, the graph-based attention framework allows us to calculate the saliency scores explicitly using the pagerank algorithm [85, 113] on a graph whose vertices and edges are chunks of texts and their similarities, respectively. Formally, at the decoding time-step , saliency scores for all input chunks are obtained by


where adjacent matrix (similarity of chunks) is calculated by


is a diagonal matrix with its -element equal to the sum of the column of . is a damping factor. The vector is defined as


where is a topic (see [113, 75] for more details). Finally, the graph-based attention distribution over a chunk can be obtained by


where is initialized with . It can be seen that the graph-based attention mechanism will focus on chunks that rank higher than the previous decoding step, i.e., . Therefore, it provides an efficient way to select salient information from source documents.

Year Reference Highlights Framework Training Optimizer DUC Gigaword CNN/DM Others Metrics
Rush et al. [15] Attention Based Summarization (ABS) Bag-of-words, Convolution, Attention Neural Network Language Model (NNLM) XENT SGD - - ROUGE
2015 lopyrev et al. [114] Simple Attention LSTM LSTM XENT RMSProp - - - BLEU
Ranzato et al. [46] Sequence-level Training Elman, LSTM Elman, LSTM XENT, DAD, E2E, MIXER SGD - - - ROUGE, BLEU
Chopra et al. [26] Recurrent Attentive Summarizer Convolution Encoder, Attentive Encoder Elman, LSTM XENT SGD - - ROUGE
Nallapati et al. [14] Switch Generator-Pointer, Temporal-Attention, Hierarchical-Attention RNN, Feature-rich Encoder RNN XENT Adadelta - ROUGE
Miao et al. [40] Auto-encoding Sentence Compression, Forced-Attention Sentence Compression, Pointer Network Encoder Compressor Decoder XENT+RL Adam - - - ROUGE
2016 Chen et al. [107] Distraction GRU GRU XENT Adadelta - - - CNN, LCSTS ROUGE
Gulcehre et al. [37] Pointer softmax GRU GRU XENT Adadelta - - - ROUGE
Gu et al. [36] CopyNet GRU GRU XENT SGD - - - LCSTS ROUGE
Zeng et al. [38] Read-again, Copy Mechanism LSTM/GRU/Hierarchical read-again encoder LSTM XENT SGD - - ROUGE
Li et al. [93] Diverse Beam Decoding LSTM LSTM RL SGD - - - - ROUGE
Takase et al. [115] Abstract Meaning Representation (AMR) based on ABS. Attention-based AMR encoder NNLM XENT SGD - - ROUGE
See et al. [12] Pointer-Generator Network, Coverage LSTM LSTM XENT Adadelta - - - ROUGE, METER
Paulus et al. [17] A Deep Reinforced Model, Intra-temporal and Intra-decoder Attention, Weight Sharing LSTM LSTM XENT + RL Adam - - NYT ROUGE, Human
Zhou et al. [74] Selective Encoding, Abstractive Sentence Summarization GRU GRU XENT SGD - MSR-ATC ROUGE
Xia et al. [76] Deliberation Networks LSTM LSTM XENT Adadelta - - - ROUGE
Nema et al. [116] Query-based, Diversity based Attention GRU query encoder, document encoder GRU XENT Adam - - - Debate-pedia ROUGE
Tan et al. [75] Graph-based Attention Hierarchical Encoder LSTM XENT Adam - - CNN, DailyMail ROUGE
Ling et al. [55] Coarse-to-fine Attention LSTM LSTM RL SGD - - - ROUGE, PPL
2017 Zhang et al. [56] Sentence Simplification, Reinforcement Learning LSTM LSTM RL Adam - - - Newsela, WikiSmall, WikiLarge BLEU, FKGL, SARI
Li et al. [77] Deep Recurrent Generative Decoder (DRGD) GRU GRU, VAE XENT, VAE Adadelta - LCSTS ROUGE
Liu et al. [117] Adversarial Training Pointer-Generator Network GAN Adadelta - - - ROUGE, Human
Pasunuru et al. [86] Multi-Task with Entailment Generation LSTM document encoder and premise Encoder LSTM Summary and Entailment Decoder Hybrid-Objective Adam - SNLI ROUGE, METEOR, BLEU, CIDEr-D
Gehring et al. [67] Convolutional Seq2seq, Position Embeddings, Gated Linear Unit, Multi-step Attention CNN CNN XENT Adam - - ROUGE
Fan et al. [72] Convolutional Seq2seq, Controllable CNN CNN XENT Adam - - ROUGE, Human
TABLE I: An overview of different seq2seq models for the neural abstractive text summarization (2015-2017).
Year Reference Highlights Framework Training Optimizer DUC Gigaword CNN/DM Others Metrics
Celikyilmaz et al. [18] Deep Communicating Agents, Semantic Cohesion Loss LSTM LSTM Hybrid-Objective Adam - - NYT ROUGE, Human
Chen et al. [59] Reinforce-Selected Sentence Rewriting LSTM Encoder Extractor Abstractor XENT + RL SGD - - ROUGE, Human
Hsu et al. [80] Abstraction + Extraction, Inconsistency Loss Extractor: GRU. Abstractor: Pointer-generator Network Hybrid-Objective + RL Adadelta - - - ROUGE, Human
Li et al. [58] Actor-Critic GRU GRU RL Adadelta - LCSTS ROUGE
Li et al. [82] Abstraction + Extraction, Key Information Guide Network (KIGN) KIGN: LSTM. Framework: Pointer-Generator Network XENT Adadelta - - - ROUGE
Lin et al. [118] Global Encoding, Convolutional Gated Unit LSTM LSTM XENT Adam - - LCSTS ROUGE
Pasunuru et al. [57] Multi-Reward Optimization for RL: ROUGE, Saliency and Entailment. LSTM LSTM RL Adam - SNLI, MultiNLI, SQuAD ROUGE, Human
Song et al. [41] Structured-Infused Copy Mechanisms Pointer-Generator Network Hybrid-Objective Adam - - - ROUGE, Human
2018 Cohan et al. [84] Discourse Aware Attention Hierarchical RNN LSTM Encoder LSTM XENT Adagrad - - - PubMed, arXiv ROUGE
Guo et al. [87] Multi-Task Summarization with Entailment and Question Generation Multi-Task Encoder-Decoder Framework Hybrid-Objective Adam SQuAD, SNLI ROUGE, METEOR
Cibils et al. [95] Diverse Beam Search, Plagiarism and Extraction Scores Pointer-Generator Network XENT Adagrad - - - ROUGE
Wang et al. [119] Topic Aware Attention CNN CNN RL - - LCSTS ROUGE
Kryściński et al. [120] Improve Abstraction LSTM Encoder Decoder: Contextual Model and Language Model XENT + RL Asynchronous Gradient Descent Optimizer - - -

ROUGE, Novel n-gram Test, Human

Gehrmann et al. [43] Bottom-up Attention, Abstraction + Extraction Pointer-Generator Network Hybrid-Objective Adagrad - - NYT ROUGE, Novel
Zhang et al. [121] Learning to Summarize Radiology Findings Pointer-Generator Network + Background Encoder XENT Adam - - - Radiology Reports ROUGE
Jiang et al. [42] Closed-book Training Pointer-Generator Network + Closed-book Decoder Hybrid-Objective + RL Adam - - ROUGE, METEOR
Chung et al. [122] Main Pointer Generator Pointer-Generator Network + Document Encoder XENT Adadelta - - - ROUGE
Chen et al. [123] Iterative Text Summarization GRU encoder, GRU decoder, iterative unit Hybrid-Objective Adam - - ROUGE
TABLE II: An overview of different seq2seq models for the neural abstractive text summarization (2018).

Ii-H Extraction + Abstraction

Extractive summarization approaches usually show a better performance comparing to the abstractive approaches [12, 14, 13] especially with respect to ROUGE measures. One of the advantages of the extractive approaches is that they can summarize source articles by extracting salient snippets and sentences directly from these documents [81], while abstractive approaches rely on word-level attention mechanism to determine the most relevant words to the target words at each decoding step. In this section, we review several studies that have attempted to improve the performance of the abstractive summarization by combining them with extractive models.

Ii-H1 Extractor + Pointer-Generator Network [80]

This model proposes a unified framework that tries to leverage the sentence-level salient information from an extractive model and incorporate them into an abstractive model (a pointer-generator network). More formally, inspired by the hierarchical attention mechanism [14], they replaced the attention distribution in the abstractive model with a scaled version , where the attention weights are expressed as follows:


Here, is the sentence-level salient score of the sentence at word position and decoding step . Different from [14], the salient scores (sentence-level attention weights) are obtained from another deep neural network known as extractor [80].

During training, in addition to cross-entropy and coverage loss used in the pointer-generator network, this paper also proposed two other losses, i.e., extractor loss and inconsistency loss. The extractor loss is used to train the extractor and is defined as follows:


where is the ground truth label for the sentence and is the total number of sentences. The inconsistency loss is expressed as


where is the set of the top- attended words and is the total number of words in a summary. Intuitively, the inconsistency loss is used to ensure that the sentence-level attentions in the extractive model and word-level attentions in the abstractive model are consistent with each other. In other words, when word-level attention weights are high, the corresponding sentence-level attention weights should also be high.

Ii-H2 Key-Information Guide Network (KIGN) [82]

This approach uses a guiding generation mechanism that leverages the key (salient) information, i.e., keywords, to guide decoding process. This is a two-step procedure. First, keywords are extracted from source articles using the TextRank algorithm [83]. Second, a KIGN encodes the key information and incorporates them into the decoder to guide the generation of summaries. Technically speaking, we can use a bi-directional LSTM to encode the key information and the output vector is the concatenation of hidden states, i.e., , where is the length of the key information sequence. Then, the alignment mechanism is modified as


Similarly, the soft-switch in the pointer-generator network is calculated using


Ii-H3 Reinforce-Selected Sentence Rewriting [59]

Most models introduced in this survey are built upon the encoder-decoder framework [14, 12, 17], in which the encoder reads source articles and turns them into vector representations, and the decoder takes the encoded vectors as input and generates summaries. Unlike these models, the reinforce-selected sentence rewriting model [59] consists of two seq2seq models. The first one is an extractive model (extractor) which is designed to extract salient sentences from a source article, while the second is an abstractive model (abstractor) which paraphrases and compresses the extracted sentences into a short summary. The abstractor network is a standard attention-based seq2seq model with the copying mechanism for handling OOV words. For the extractor network, an encoder first uses a CNN to encode tokens and obtains representations of sentences, and then it uses an LSTM to encode the sentences and represent a source document. With the sentence-level representations, the decoder (another LSTM) is designed to recurrently extract salient sentences from the document using the pointing mechanism [35]. This model has achieved the state-of-the-art performance on CNN/Daily Mail dataset and was demonstrated to be computationally more efficient than the pointer-generator network [12].

Iii Training Strategies

In this section, we review different strategies to train the seq2seq models for abstractive text summarization. As discussed in [46], there are two categories of training methodologies, i.e., word-level and sequence-level training. The commonly used teacher forcing algorithm [44, 49] and cross-entropy training [48, 124] belong to the first category, while different RL-based algorithms [46, 53, 52] fall into the second. We now discuss the basic ideas of different training algorithms and their applications to seq2seq models for the text summarization. A comprehensive survey of deep RL for seq2seq models can be found in [47].

Iii-a Word-Level Training

The word-level training for language models represents methodologies that try to optimize predictions of the next token [46]. For example, in the abstractive text summarization, given a source article , a seq2seq model generates a summary with the probability , where represents model parameters (e.g., weights and bias ). In a neural language model [48], this probability can be expanded to


where each multiplier , known as likelihood, is a conditional probability of the next token given all previous ones denoted by

. Intuitively, the text generation process can be described as follows. Starting with a special token ‘SOS’ (start of sequence), the model generates a token

at a time with the probability . This token can be obtained by a sampling method or a greedy search, i.e., (see Fig. 8). The generated token will then be fed into the next decoding step. The generation is stopped when the model outputs ‘EOS’ (end of sequence) token or when the length reaches a user defined maximum threshold. In this section, we review different approaches for learning model parameters, i.e., . We will start with the commonly used end-to-end training approach, i.e., cross-entropy training, and then move on to two different methods for avoiding the problem of exposure bias.

Fig. 8: Generation process with a greedy search.

Iii-A1 Cross-Entropy Training (XENT) [46]

To learn model parameters , XENT maximizes the log-likelihood of observed sequences (ground-truth) , i.e.,


which is equivalent to minimizing the cross entropy (XE) loss,


We show this training strategy in Fig. 9. The algorithm is also known as the teacher forcing algorithm [44, 49]. During training, it uses observed tokens (ground-truth) as input and aims to improve the probability of the next observed token at each decoding step. However, during testing, it relies on predicted tokens from the previous decoding step. This is the major difference between training and testing (see Fig. 8 and Fig. 9). Since the predicted tokens may not be the observed ones, this discrepancy will be accumulated over time and thus yields summaries that are very different from ground-truth summaries. This problem is known as exposure bias [46, 45, 44].

Fig. 9: Training with the teacher forcing algorithm.

Iii-A2 Scheduled Sampling [46, 45, 44]

Scheduled sampling algorithm, also known as Data As Demonstrator (DAD) [46, 45], has been proposed to solve the exposure bias problem. As shown in Fig. 10, during training, the input at each decoding step comes from a sampler which can decide whether it is a model generated token from the last step or an observed token

from training data. The sampling is based on a Bernoulli distribution


where is the probability of using a token from training data and is a binary indicator function. In the scheduled sampling algorithm, is an annealing/scheduling function and decreases with training time from to . As suggested by Bengio et al. [44], scheduling function can take different forms, e.g.,


where is training step and is a parameter that guarantees . This strategy is often referred to as a curriculum learning algorithm