pycorrector is a toolkit for text error correction. 文本纠错，Kenlm，Seq2Seq_Attention，BERT，MacBERT，ELECTRA，ERNIE，Transformer等模型实现，开箱即用。
In the past few years, neural abstractive text summarization with sequence-to-sequence (seq2seq) models have gained a lot of popularity. Many interesting techniques have been proposed to improve the seq2seq models, making them capable of handling different challenges, such as saliency, fluency and human readability, and generate high-quality summaries. Generally speaking, most of these techniques differ in one of these three categories: network structure, parameter inference, and decoding/generation. There are also other concerns, such as efficiency and parallelism for training a model. In this paper, we provide a comprehensive literature and technical survey on different seq2seq models for abstractive text summarization from viewpoint of network structures, training strategies, and summary generation algorithms. Many models were first proposed for language modeling and generation tasks, such as machine translation, and later applied to abstractive text summarization. Therefore, we also provide a brief review of these models. As part of this survey, we also develop an open source library, namely Neural Abstractive Text Summarizer (NATS) toolkit, for the abstractive text summarization. An extensive set of experiments have been conducted on the widely used CNN/Daily Mail dataset to examine the effectiveness of several different neural network components. Finally, we benchmark two models implemented in NATS on two recently released datasets, i.e., Newsroom and Bytecup.READ FULL TEXT VIEW PDF
Neural abstractive text summarization (NATS) has received a lot of atten...
Extractive keyphrase generation research has been around since the ninet...
In recent years, there has been a explosion in the amount of text data f...
Pre-trained sequence-to-sequence (seq-to-seq) models have significantly
Unlike extractive summarization, abstractive summarization has to fuse
Text summarization condenses a text to a shorter version while retaining...
Keyphrases are a very short summary of an input text and provide the mai...
pycorrector is a toolkit for text error correction. 文本纠错，Kenlm，Seq2Seq_Attention，BERT，MacBERT，ELECTRA，ERNIE，Transformer等模型实现，开箱即用。
Neural Abstractive Text Summarization with Sequence-to-Sequence Models
Learning Framework for Neural Abstractive Text Summarization
In the modern era of big data, retrieving useful information from a large number of textual documents is a challenging task, due to the unprecedented growth in the availability of blogs, news articles, and reports are explosive. Automatic text summarization provides an effective solution for summarizing these documents. The task of the text summarization is to condense long documents into short summaries while preserving the important information and meaning of the documents [1, 2]. Having the short summaries, the text content can be retrieved, processed and digested effectively and efficiently.
Generally speaking, there are two ways to do text summarization: Extractive and Abstractive . A method is considered to be extractive if words, phrases, and sentences in the summaries are selected from the source articles [4, 5, 6, 2, 7, 8, 9, 10]. They are relatively simple and can produce grammatically correct sentences. The generated summaries usually persist salient information of source articles and have a good matching with human-written summaries [5, 11, 12, 13]. On the other hand, abstractive text summarization has attracted many attentions since it is capable of generating novel words using language generation models grounded on representations of source documents [14, 15]. Thus, they have a strong potential of producing high-quality summaries that are verbally innovative and can also easily incorporate external knowledge . In this category, many deep neural network based models have achieved better performance in terms of the commonly used evaluation measures (such as ROUGE  score) compared to traditional extractive approaches [17, 18]. In this paper, we primarily focus on the recent advances of sequence-to-sequence (seq2seq) models for the task of abstractive text summarization.
have been successfully applied to a variety of natural language processing (NLP) tasks, such as machine translation[21, 22, 23, 24, 25], headline generation [15, 26, 27], text summarization [12, 14], and speech recognition [28, 29, 30]
. Inspired by the success of neural machine translation (NMT), Rush et al.  first introduced a neural attention seq2seq model with an attention based encoder and a neural network language model (NNLM) decoder to the abstractive sentence summarization task, which has achieved a significant performance improvement over conventional methods. Chopra et al. 
further extended this model by replacing the feed-forward NNLM with a recurrent neural network (RNN). The model is also equipped with a convolutional attention-based encoder and a RNN (Elman or LSTM ) decoder, and outperforms other state-of-the-art models on a commonly used benchmark dataset, i.e., the Gigaword corpus. Nallapati et al.  introduced several novel elements to the RNN encoder-decoder architecture to address critical problems in the abstractive text summarization, including using the following (i) feature-rich encoder to capture keywords, (ii) a switching generator-pointer to model out-of-vocabulary (OOV) words, and (iii) the hierarchical attention to capture hierarchical document structures. They also established benchmarks for these models on a CNN/Daily Mail dataset [33, 34], which consists of pairs of news articles and multi-sentence highlights (summaries). Before this dataset was introduced, many abstractive text summarization models have concentrated on compressing short documents to single sentence summaries [15, 26]. For the task of summarizing long documents into multi-sentence summaries, these models have several shortcomings: 1) They cannot accurately reproduce the salient information of source documents. 2) They cannot efficiently handle OOV words. 3) They tend to suffer from word- and sentence-level repetitions and generating unnatural summaries. To tackle the first two challenges, See et al.  proposed a pointer-generator network that implicitly combines the abstraction with the extraction. This pointer-generator architecture can copy words from source texts via a pointer and generate novel words from a vocabulary via a generator. With the pointing/copying mechanism [35, 36, 37, 38, 39, 40, 41], factual information can be reproduced accurately and OOV words can also be taken care in the summaries. Many subsequent studies that achieved state-of-the-art performance have also demonstrated the effectiveness of the pointing/copying mechanism [17, 18, 42, 43]. The third problem has been addressed by the coverage mechanism , intra-temporal and intra-decoder attention mechanisms 
, and some other heuristic approaches, like forcing a decoder to never output the same trigram more than once during testing.
There are two other non-trivial issues with the current seq2seq framework, i.e., exposure bias and inconsistency of training and testing measurements [44, 45, 46, 47]. Based on the neural probabilistic language model , seq2seq models are usually trained by maximizing the likelihood of ground-truth tokens given their previous ground-truth tokens and hidden states (Teacher Forcing algorithm [44, 49], see Fig. 9). However, at testing time (see Fig. 8), previous ground-truth tokens are unknown, and they are replaced with tokens generated by the model itself. Since the generated tokens have never been exposed to the decoder during training, the decoding error can accumulate quickly during the sequence generation. This is known as exposure bias . The other issue is the mismatch of measurements16] and BLEU  scores, which are inconsistent with the log-likelihood function (cross-entropy loss) used in the training phase. These problems are alleviated by the curriculum learning and reinforcement learning (RL) approaches.
Bengio et al.  proposed a curriculum learning approach, known as scheduled sampling, to slowly change the input of the decoder from ground-truth tokens to model generated ones. Thus, the proposed meta-algorithm bridges the gap between training and testing. It is a practical solution for avoiding the exposure bias. Ranzato et al.  proposed a sequence level training algorithm, called MIXER (Mixed Incremental Cross-Entropy Reinforce), which consists of the cross entropy training, REINFORCE  and curriculum learning 
. REINFORCE algorithm can make use of any user-defined task specific reward (e.g., non-differentiable evaluation metrics), therefore, combining with curriculum learning, the proposed model is capable of addressing both issues of seq2seq models. However, REINFORCE suffers from the high variance of gradient estimators and instability during training[52, 53, 22]. Bahdanau et al.  proposed an actor-critic based RL method which has relatively lower variance for gradient estimators. In the actor-critic method, an additional critic network is trained to compute value functions given the policy from the actor network (a seq2seq model), and the actor network is trained based on the estimated value functions (assumed to be exact) from the critic network. On the other hand, Rennie et al.  introduced a self-critical sequence training method (SCST) which has a lower variance compared to the REINFORCE algorithm and does not need the second critic network.
RL algorithms for training seq2seq models have achieved success in a variety of language generation tasks, such as image captioning , machine translation , and dialogue generation . Specific to the abstractive text summarization, Lin et al.  introduced a coarse-to-fine attention framework for the purpose of summarizing long documents. Their model parameters were learned with REINFORCE algorithm. Zhang et al.  used REINFORCE algorithm and the curriculum learning strategy for the sentence simplification task. Paulus et al.  first applied the self-critic policy gradient algorithm to training their seq2seq model with the copying mechanism and obtained the state-of-the-art performance in terms of ROUGE scores . They proposed a mixed objective function that combines the RL loss with the traditional cross-entropy loss. Thus, their method can both leverage the non-differentiable evaluation metrics and improve the readability. Celikyilmaz et al.  introduced a novel deep communicating agents method for abstractive summarization, where they also adopted the RL loss in their objective function. Pasunuru et al.  applied the self-critic policy gradient algorithm to train the pointer-generator network. They also introduced two novel rewards (i.e., saliency and entailment rewards) in addition to ROUGE metric to keep the generated summaries salient and logically entailed. Li et al. 
proposed a training framework based on the actor-critic method, where the actor network is an attention-based seq2seq model, and the critic network consists of a maximum likelihood estimator and a global summary quality estimator that is used to distinguish the generated and ground-truth summaries via a neural network binary classifier. Chenet al.  proposed a compression-paraphrase multi-step procedure, for abstractive text summarization, which first extracts salient sentences from documents and then rewrites them. In their model, they used an advantage actor-critic algorithm to optimize the sentence extractor for a better extraction strategy. Keneshloo et al.  conducted a comprehensive summary of various RL methods and their applications in training seq2seq models for different NLP tasks. They also implemented these RL algorithms in an open source library111https://github.com/yaserkl/RLSeq2Seq/ constructed using the pointer-generator network  as the base model.
Most of the prevalent seq2seq models that have attained state-of-the-art performance for sequence modeling and language generation tasks are RNN, especially long short-term memory (LSTM)
and gated recurrent unit (GRU), based encoder-decoder models [19, 23]
. Standard RNN models are difficult to train due to the vanishing and exploding gradients problems
. LSTM is a solution for vanishing gradients problem, but still does not address the exploding gradients issue. This issue is recently solved using a gradient norm clipping strategy. Another critical problem of RNN based models is the computation constraint for long sequences due to their inherent sequential dependence nature. In other words, the current hidden state in a RNN is a function of previous hidden states. Because of such dependence, RNN cannot be parallelized within a sequence along the time-step dimension (see Fig. 2) during training and evaluation, and hence training them becomes major challenge for long sequences due to the computation time and memory constraints of GPUs .
Recently, it has been found that the convolutional neural network (CNN) based encoder-decoder models have the potential to alleviate the aforementioned problem, since they have better performance in terms of the following three considerations [65, 66, 67]. 1) A model can be parallelized during training and evaluation. 2) The computational complexity of the model is linear with respect to the length of sequences. 3) The model has short paths between pairs of input and output tokens, so that it can propagate gradient signals more efficiently . Kalchbrenner et al.  introduced a ByteNet model which adopts the one-dimensional convolutional neural network of fixed depth to both the encoder and the decoder 
. The decoder CNN is stacked on top of the hidden representation of the encoder CNN, which ensures a shorter path between input and output. The proposed ByteNet model has achieved state-of-the-art performance on a character-level machine translation task with parallelism and linear-time computational complexity. Bradbury et al.  proposed a quasi-recurrent neural network (QRNN) encoder-decoder architecture, where both encoder and decoder are composed of convolutional layers and so-called ‘dynamic average pooling’ layers [70, 66]. The convolutional layers allow computations to be completely parallel across both mini-batches and sequence time-step dimensions, while they require less amount of time compared with computation demands for LSTM despite the sequential dependence still presents in the pooling layers . This framework has demonstrated to be effective by outperforming LSTM-based models on a character-level machine translation task with a significantly higher computational speed. Recently, Gehring et al. [71, 67, 72] attempted to build CNN based seq2seq models and apply them to large-scale benchmark datasets for sequence modeling. In , the authors proposed a convolutional encoder model, in which the encoder is composed of a succession of convolutional layers, and demonstrated its strong performance for machine translation. They further constructed a convolutional seq2seq architecture by replacing the LSTM decoder with a CNN decoder and bringing in several novel elements, including gated linear units  and multi-step attention . The model also enables computations of all network elements parallelized, thus training and decoding can be much faster than the RNN models. It also achieved state-of-the-art performance on several machine translation benchmark datasets. Vaswani et al.  further constructed a novel network architecture, namely, Transformer, which only depends on feed-forward networks and the attention mechanism. It has achieved state-of-the-art performance in machine translation task with significantly less training time. Currently, ConvS2S model  has been applied to the abstractive document summarization and outperforms the pointer-generator network  on the CNN/Daily Mail dataset.
So far, we primarily focused on the pointer-generator network, training neural networks with RL algorithms, and CNN based seq2seq architectures. There are many other studies that aim to improve the performance of seq2seq models for the task of abstractive text summarization from different perspectives and broaden their applications.
The first way to boost the performance of seq2seq models is to design better network structures. Zhou et al.  introduced an information filter, namely, a selective gate network between the encoder and decoder. This model can control the information flow from the encoder to the decoder via constructing a second level representation of the source texts with the gate network. Zeng et al.  introduced a read-again mechanism to improve the quality of the representations of the source texts. Tan et al.  built a graph ranking model upon a hierarchical encoder-decoder framework, which enables the model to capture the salient information of the source documents and generate accurate, fluent and non-redundant summaries. Xia et al.  proposed a deliberation network that passes the decoding process multiple times (deliberation process), to polish the sequences generated by the previous decoding process. Li et al.  incorporated a sequence of variational auto-encoders [78, 79] into the decoder to capture the latent structure of the generated summaries.
Another way to improve the abstractive text summarization is to make use of the salient information from the extraction process. Hsu et al.  proposed a unified framework that takes advantage of both extractive and abstractive summarization using a novel attention mechanism, which is a combination of the sentence-level attention (based on the extractive summarization ) and the word-level attention (based on the pointer-generator network ), inspired by the intuition that words in less attended sentences should have lower attention scores. Chen et al.  introduced a multi-step procedure, namely compression-paraphrase, for abstractive summarization, which first extracts salient sentences from documents and then rewrites them in order to get final summaries. Li et al.  introduced a guiding generation model, where the keywords in source texts is first retrieved with an extractive model . Then, a guide network is applied to encode them to obtain the key information representations that will guide the summary generation process.
Compared to short articles and texts with moderate lengths, there are many challenges that arise in long documents, such as difficulty in capturing the salient information . Nallapati et al.  proposed a hierarchical attention model to capture hierarchical structures of long documents. To make models scale-up to very long sequences, Ling et al.  introduced a coarse-to-fine attention mechanism, which hierarchically reads and attends long documents222A document is split into many chunks of texts.. By stochastically selecting chunks of texts during training, this approach can scale linearly with the number of chunks instead of the number of tokens. Cohan et al.  proposed a discourse-aware attention model which has a similar idea to that of a hierarchical attention model. Their model was applied to two large-scale datasets of scientific papers, i.e., arXiv and PubMed datasets. Tan et al.  introduced a graph-based attention model which is built upon a hierarchical encoder-decoder framework where the pagerank algorithm  was used to calculate saliency scores of sentences.
Multi-task learning has become a promising research direction for this problem since it allows seq2seq models to handle different tasks. Pasunuru et al.  introduced a multi-task learning framework, which incorporates knowledge from an entailment generation task into the abstractive text summarization task by sharing decoder parameters. They further proposed a novel framework  that is composed of two auxiliary tasks, i.e., question generation and entailment generation, to improve their model for capturing the saliency and entailment for the abstractive text summarization. In their model, different tasks share several encoder, decoder and attention layers. Mccann et al.  introduced a Natural Language Decathlon (decaNLP333https://github.com/salesforce/decaNLP), a challenge that spans ten different tasks, including question-answering, machine translation, summarization, and so on. They also proposed a multitask question answering network that can jointly learn all tasks without task-specific modules or parameters, since all tasks are mapped to the same framework of question-answering over a given context.
Beam search algorithms have been commonly used in the decoding of different language generation tasks [22, 12]. However, the generated candidate-sequences are usually lacking in diversity . In other words, top- candidates are nearly identical, where is size of a beam. Li et al.  replaced the log-likelihood objective function in the neural probabilistic language model  with Maximum Mutual Information (MMI)  in their neural conversation models to remedy the problem. This idea has also been applied to neural machine translation (NMT)  to model the bi-directional dependency of source and target texts. They further proposed a simple yet fast decoding algorithm that can generate diverse candidates and has shown performance improvement on the abstractive text summarization task . Vijayakumar et al.  proposed generating diverse outputs by optimizing for a diversity-augmented objective function. Their method, referred to as Diverse Beam Search (DBS) algorithm, has been applied to image captioning, machine translation, and visual question-generation tasks. Cibils et al.  introduced a meta-algorithm that first uses DBS to generate summaries, and then, picks candidates according to maximal marginal relevance  under the assumption that the most useful candidates should be close to the source document and far away from each other. The proposed algorithm has boosted the performance of the pointer-generator network on CNN/Daily Mail dataset.
Despite many research papers that are published in the area of neural abstractive text summarization, there are few survey papers [97, 98, 99] that provide a comprehensive study. In this paper, we systematically review current advances of seq2seq models for the abstractive text summarization task from various perspectives, including network structures, training strategies, and sequence generation. In addition to a literature survey, we also implemented some of these methods in an open-source library, namely NATS444https://github.com/tshi04/NATS. Extensive set of experiments have been conducted on various benchmark text summarization datasets in order to examine the importance of different network components. The main contributions of this paper can be summarized as follows:
Provide a comprehensive literature survey of current advances of seq2seq models with an emphasis on the abstractive text summarization.
Conduct a detailed review of the techniques used to tackle different challenges in RNN encoder-decoder architectures.
Review different strategies for training seq2seq models and approaches for generating summaries.
Provide an open-source library, which implements some of these models, and systematically investigate the effects of different network elements on the summarization performance.
The rest of this paper is organized as follows: An overall taxonomy of topics on seq2seq models for neural abstractive text summarization is shown in Fig. 1. A comprehensive list of papers published till date on the topic of neural abstractive text summarization have been summarized in Table I and II. In Section II, we introduce the basic seq2seq framework along with its extensions, including attention mechanism, pointing/copying mechanism, repetition handling, improving encoder or decoder, summarizing long documents and combining with extractive models. Section III summarizes different training strategies, including word-level training methods, such as cross-entropy training, and sentence-level training with RL algorithms. In Section IV, we discuss generating summaries using the beam search algorithm and various other diverse beam decoding algorithms. Section V briefly introduces the convolutional seq2seq model and its application to the abstractive text summarization. In Section VI, we present details of our implementations and discuss our experimental results on the CNN/Daily Mail, Newsroom , and Bytecup555https://www.biendata.com/competition/bytecup2018/ datasets. We conclude this survey in Section VII.
In this section, we review different encoder-decoder models for the neural abstractive text summarization. We will start with the basic RNN seq2seq framework and attention mechanism. Then, we will describe more advanced network structures that can handle different challenges in the text summarization, such as repetition and out-of-vocabulary (OOV) words. We will highlight various existing problems and proposed solutions.
A vanilla seq2seq framework for the abstractive summarization is composed of an encoder and a decoder. The encoder reads a source article, denoted by , and transforms it to hidden states ; while the decoder takes these hidden states as the context input and outputs a summary . Here, and are one-hot representations of the tokens in the source article and summary, respectively. We use and to represent the number of tokens (document length) of the original source document and the summary, respectively. A summarization task is defined as inferring a summary from a given source article using seq2seq models.
Encoders and decoders can be feed-forward networks, CNN [71, 67] or RNN. RNN architectures, especially long short term memory (LSTM)  and gated recurrent unit (GRU) , have been most widely adopted for seq2seq models. Fig. 2 shows a basic RNN seq2seq model with a bi-directional LSTM encoder and an LSTM decoder. The bi-directional LSTM is considered since it usually gives better document representations compared to a forward LSTM. The encoder reads a sequence of input tokens and turns them into a sequences of hidden states with following updating algorithm:
where weight matrices
and vectorare learnable parameters666In the rest of this paper, we will use (weights) and (bias) to represent the model parameters., denotes the word embeddings of token , and represents the cell states. Both and are initialized to . For the bi-directional LSTM, the input sequence is encoded as and , where the right and left arrows denote the forward and backward temporal dependencies, respectively. Superscript is the shortcut notation used to indicate that it is for the encoder. During the decoding, the decoder takes the encoded representations of the source article (i.e., hidden and cell states , , , ) as the input and generates the summary . In a simple encoder-decoder model, encoded vectors are used to initialize hidden and cell states of the LSTM decoder. For example, we can initialize them as follows:
Here, superscript denotes the decoder and is a concatenation operator. At each decoding step, we first update the hidden state conditioned on the previous hidden states and input tokens, i.e.,
Hereafter, we will not explicitly express the cell states in the input and output of LSTM, since only hidden states are passed to other parts of the model. Then, the vocabulary distribution can be calculated as follows:
where is a vector whose dimension is the size of the vocabulary and for each element of a vector
. Therefore, the probability of generating the target tokenin the vocabulary is denoted as .
This LSTM based encoder-decoder framework was the foundation of many neural abstractive text summarization models [14, 12, 17]. However, there are many problems with this model. For example, the encoder is not well trained via back propagation through time [101, 102], since the paths from encoder to the output are relatively far apart, which limits the propagation of gradient signals. The accuracy and human-readability of generated summaries is also very low with a lot of OOV words777In the rest of this paper, we will use unk, i.e., unknown words, to denote OOV words. and repetitions. The rest of this section will discuss different models that were proposed in the literature to resolve these issues for producing better quality summaries.
The attention mechanism has achieved great success and is commonly used in seq2seq models for different natural language processing (NLP) tasks , such as machine translation [23, 21], image captioning , and neural abstractive text summarization [14, 12, 17]. In an attention based encoder-decoder architecture (shown in Fig. 3), the decoder not only takes the encoded representations (i.e., final hidden and cell states) of the source article as input, but also selectively focuses on parts of the article at each decoding step. For example, suppose we want to compress the source input888https://timesofindia.indiatimes.com/sports/football/fifa-world-cup/france-vs-argentina-live-score-fifa-world-cup-2018/articleshow/64807463.cms “Kylian Mbappe scored two goals in four second-half minutes to send France into the World Cup quarter-finals with a thrilling 4-3 win over Argentina on Saturday.” to its short version “France beat Argentina 4-3 to enter quarter-finals.”. When generating the token “beat”, the decoder may need to attend “a thrilling 4-3 win” than other parts of the text. This attention can be achieved by an alignment mechanism , which first computes the attention distribution of the source tokens and then lets the decoder know where to attend to produce a target token. In the encoder-decoder framework depicted in Fig. 2 and 3, given all the hidden states of the encoder999 for the bi-directional LSTM is defined as the concatenation of and ., i.e., and the current decoder hidden state , the attention distribution over the source tokens is calculated as follows:
where the alignment score is obtained by the content-based score function, which has three alternatives as suggested in :
It should be noted that the number of additional parameters for ‘dot’, ‘general’ and ‘concat’ approaches are , and , respectively. Here represents the dimension of a vector. The ‘general’ and ‘concat’ are commonly used score functions in the abstractive text summarization [12, 17]. One of the drawbacks of ‘dot’ method is that it requires and to have the same dimension.
With the attention distribution, we can naturally define the source side context vector for the target word as
Together with the current decoder hidden state , we get the attention hidden state 
Finally, the vocabulary distribution is calculated by
When , the decoder hidden state is updated by
where the input is concatenation of and .
The pointing/copying mechanism  represents a class of approaches that generate target tokens by directly copying them from input sequences based on their attention weights. It can be naturally applied to the abstractive text summarization since summaries and articles can share the same vocabulary . More importantly, it is capable to deal with out-of-vocabulary (OOV) words [14, 36, 37, 12]. A variety of studies have shown a boosting performance after incorporating the pointing/copying mechanism into the seq2seq framework [12, 17, 18]. In this section, we review several alternatives of this mechanism for the abstractive text summarization.
The basic architecture of pointer softmax is described as follows. It consists of three fundamental components: short-list softmax, location softmax and switching network. At decoding step , a short-list softmax calculated by Eq. (9) is used to predict target tokens in the vocabulary. The location softmax gives locations of tokens that will be copied from the source article to the target based on attention weights
. With these two components, a switching network is designed to determine whether to predict a token from the vocabulary or copy one from the source article if it is an OOV token. The switching network is a multilayer perceptron (MLP) with a sigmoid activation function, which estimates the probabilityof generating tokens from the vocabulary based on the context vector and hidden state with
where is a scalar and is a sigmoid activation function. The final probability of producing the target token is given by the concatenation of vectors and .
Similar to the switching network in pointer softmax , the switching generator-pointer is also equipped with a ‘switch’, which determines whether to generate a token from the vocabulary or point to one in the source article at each decoding step. The switch is explicitly modeled by
If the switch is turned on, the decoder produces a word from the vocabulary with the distribution (see Eq. (9)). Otherwise, the decoder generates a pointer based on the attention distribution (see Eq. (5)), i.e., , where is the position of the token in the source article. When a pointer is activated, embedding of the pointed token will be used as an input for the next decoding step.
CopyNet has a differentiable network architecture and can be easily trained in an end-to-end manner. In this framework, the probability of generating a target token is a combination of the probabilities of two modes, i.e. generate-mode and copy-mode. First, CopyNet represents unique tokens in the vocabulary and source sequence by and , respectively, and builds an extended vocabulary . Then, the vocabulary distribution over the extended vocabulary is calculated by
where and are also defined on , i.e.,
Here, is a normalization factor shared by both the above equations. is calculated with
is obtained by Eq. (6).
where is obtained by Eq. (12). Vocabulary distribution and attention distribution are defined as follows:
The pointer-generator network has been used as the base model for many abstractive text summarization models (see Table I and II). Finally, it should be noted that in CopyNet and pointer-generator network can be viewed as a “soft-switch” to choose between generation and copying, which is different from “hard-switch” (i.e., ) in pointer softmax and switching generator-pointer [37, 14, 17].
One of the critical challenges for attention based seq2seq models is that the generated sequences have repetitions, since the attention mechanism tends to ignore the past alignment information [105, 106]. For summarization and headline generation tasks, model generated summaries suffer from both word-level and sentence-level repetitions. The latter is specific to summaries which consist of several sentences [14, 12, 17], such as those in CNN/Daily Mail dataset  and Newsroom dataset . In this section, we review several approaches that have been proposed to overcome the repetition problem.
Temporal attention method was originally proposed to deal with the attention deficiency problem in neural machine translation (NMT) . Nallapati et al.  have found that it can also overcome the problem of repetition when generating multi-sentence summaries, since it prevents the model from attending the same parts of a source article by tracking the past attention weights. More formally, given the attention score in Eq. (6), we can first define a temporal attention score as :
Then, attention distribution is calculated with
Given the attention distribution, the context vector (see Eq. (7)) is rewritten as
It can be seen from Eq. (20) that, at each decoding step, the input tokens which have been highly attended will have a lower attention score via the normalization in time dimension. As a result, the decoder will not repeatedly attend the same part of the source article.
Intra-decoder attention is another technique to handle the repetition problem for long-sequence generations. Compared to the regular attention based models, it allows a decoder to not only attend tokens in a source article but also keep track of the previously decoded tokens in a summary, so that the decoder will not repeatedly produce the same information.
For , intra-decoder attention scores, denoted by , can be calculated in the same manner as the attention score 101010We have to replace with in Eq. (6), where .. Then, the attention weight for each token is expressed as
With attention distribution, we can calculate the decoder-side context vector by taking linear combination of the decoder hidden states, i.e., , as
The decoder-side and encoder-side context vector will be both used to calculate the vocabulary distribution.
The coverage model was first proposed for the NMT task  to address the problems of the standard attention mechanism which tends to ignore the past alignment information. Recently, See et al.  introduced the coverage mechanism to the abstractive text summarization task. In their model, they first defined a coverage vector as the sum of attention distributions of the previous decoding steps, i.e.,
Thus, it contains the accumulated attention information on each token in the source article during the previous decoding steps. The coverage vector will then be used as an additional input to calculate the attention score
As a result, the attention at current decoding time-step is aware of the attention during the previous decoding steps. Moreover, they defined a novel coverage loss to ensure that the decoder does not repeatedly attend the same locations when generating multi-sentence summaries. Here, the coverage loss is defined as
which is upper bounded by .
The coverage mechanism has also been used in  (known as distraction) for the document summarization task. In addition to the distraction mechanism over the attention, they also proposed a distraction mechanism over the encoder context vectors. Both mechanisms are used to prevent the model from attending certain regions of the source article repeatedly. Formally, given the context vector at current decoding step and all historical context vectors (see Eq. (7)), the distracted context vector is defined as
where both and are diagonal parameter matrices.
Although LSTM and bi-directional LSTM encoders111111GRU and bi-directional GRU are also often seen in abstractive summarization papers. have been commonly used in the seq2seq models for the abstractive text summarization [14, 12, 17], representations of the source articles are still believed to be sub-optimal. In this section, we review some approaches that aim to improve the encoding process.
The selective encoding model was proposed for the abstractive sentence summarization task . Built upon an attention based encoder-decoder framework, it introduces a selective gate network into the encoder for the purpose of distilling salient information from source articles. A second layer representation, namely, distilled representation, of a source article is constructed over the representation of the first LSTM layer (a bi-directional GRU encoder in this work.). Formally, the distilled representation of each token in the source article is defined as
where denotes the selective gate for token and is calculated as follows:
where . The distilled representations are then used for the decoding. Such a gate network can control information flow from an encoder to a decoder and can also select salient information, therefore, it boosts the performance of the sentence summarization task .
Intuitively, read-again mechanism is motivated by human readers who read an article several times before writing a summary. To simulate this cognitive process, a read-again encoder reads a source article twice and outputs two-level representations. In the first read, an LSTM encodes tokens and the article as and , respectively. In the second read, we use another LSTM to encode the source text based on the outputs of the first read. Formally, the encoder hidden state of the second read is updated by
The hidden states of the second read will be passed into decoders for summary generation.
Sharing the embedding weights with the decoder is a practical approach that can boost the performance since it allows us to reuse the semantic and syntactic information in an embedding matrix during summary generation [108, 17]. Suppose the embedding matrix is represented by , we can formulate the matrix used in the summary generation (see Eq. (9)) as follows:
By sharing model weights, the number of parameters is significantly less than a standard model since the number of parameters for is , while that for is , where represents the dimension of vector and denotes size of the vocabulary.
When a human writes a summary, they usually first create a draft and then polish it based on the global context. Inspired by this polishing process, Xia et al.  proposed a deliberation network for sequence generation tasks. A deliberation network can have more than one decoder121212There are two decoders in this paper. The first one is similar to the decoder presents in the basic seq2seq model described in Fig. 2. Let us denote the encoder hidden states by and the first-pass decoder hidden states by . During the decoding, the second-pass decoder, which is used to polish the draft written in the first-pass, attends both encoder and the first-pass decoder. Therefore, we obtain two context vectors and at time step , where and are attention weights. As we can see, two context vectors capture global information of the encoded article and the sequence generated by the first-pass decoder. The second-pass decoder will take them as input and update the next hidden states with
Finally, the vocabulary distribution at decoding step is calculated by
The deliberation network has also boosted the performance of seq2seq models in NMT and abstractive text summarization tasks.
Conventional encoder-decoder models calculate hidden states and attention weights in an entirely deterministic fashion, which limits the capability of representations and results in low quality summaries. Incorporating variational auto-encoders (VAEs) [78, 79] into the encoder-decoder framework provides a practical solution for this problem. Inspired by the variational RNN proposed in  to model the highly structured sequential data, Li et al.  introduced a seq2seq model with DRGD that aims to capture latent structure information of summaries and improve the summarization quality. This model employs GRU as the basic recurrent model for both encoder and decoder. However, to be consistent with this survey paper, we will explain their ideas using LSTM instead.
There are two LSTM layers to calculate the decoder hidden state . At the decoding step , the first layer hidden state is updated by . Then, the attention weights and the context vector are calculated with the encoder hidden state and the first layer decoder hidden state using Eqs. (5), (6) and (7). For the second layer, the hidden state is updated with . Finally, the decoder hidden state is obtained by , where is also referred to as the deterministic hidden state.
VAE is incorporated into the decoder to capture latent structure information of summaries which is represented by a multivariate Gaussian distribution. By using a reparameterization trick[79, 110], latent variables can be first expressed as
where the noise variable , and Gaussian parameters and in the network are calculated by
where is a hidden vector of the encoding process of the VAE and defined as
With the latent structure variables , the output hidden states can be formulated as
Finally, the vocabulary distribution is calculated by
We primarily focused on the network structure of DRGD in this section. The details of VAE and its derivations can be found in [78, 77, 79, 110]. In DRGD, VAE is incorporated into the decoder of a seq2seq model, more recent works have also used VAE in the attention layer  and for the sentence compression task .
Compared with sentence summarization, the abstractive summarization for very long documents has been relatively less investigated. Recently, attention based seq2seq models with pointing/copying mechanism have shown their power in summarizing long documents with 400 and 800 tokens [12, 17]. However, performance improvement primarily attributes to copying and repetition/redundancy avoiding techniques [14, 12, 17]. For very long documents, we need to consider several important factors to generate high quality summaries , such as saliency, fluency, coherence and novelty. Usually, seq2seq models combined with the beam search decoding algorithm can generate fluent and human-readable sentences. In this section, we review models that aim to improve the performance of long document summarization from the perspective of saliency.
Seq2seq models for long document summarization usually consists of an encoder with a hierarchical architecture which is used to capture the hierarchical structure of the source documents. The top-level salient information includes the important sentences [14, 75], chunks of texts , sections , and paragraphs , while the lower-level salient information represents keywords. Hereafter, we will use the term ‘chunk’ to represent the top-level information. Fig. 7 shows neural network structure of a hierarchical encoder, which first uses a word-encoder to encode tokens in a chunk for the chunk representation, and then use a chunk encoder to encode the chunks in a document for the document representation. In this paper, we only consider the single-layer forward LSTM131313The deep communicating agents model  , which requires multiple layers of bi-directional LSTM, falls out of the scope of this survey. for both word and chunk encoders.
Suppose, the hidden states of chunk and word in this chunk are represented by and . At decoding step , we can calculate word-level attention weight for the current decoder hidden state as follows:
At the same time, we can also calculate chunk-level attention weight as follows:
where both alignment scores and can be calculated using Eq. (6). In this section, we will review four different models that are based on the hierarchical encoder for the task of long document text summarization.
The intuition behind a hierarchical attention is that words in less important chunks should be less attended. Therefore, with chunk-level attention distribution and word-level attention distribution , we first calculate re-scaled word-level attention distribution by
This re-scaled attention will then be used to calculate the context vector using Eq. (7), i.e.,
It should be noted that such hierarchical attention framework is different from the hierarchical attention network proposed in , where the chunk representation is obtained using
instead of the last hidden state of the word-encoder.
The idea of the discourse-aware attention is similar to that of the hierarchical attention giving Eq. (43). The main difference between these two attention models is that the re-scaled attention distribution in the discourse-aware attention is calculated by
The coarse-to-fine (C2F) attention was proposed for computational efficiency. Similar to the hierarchical attention , the proposed model also has both chunk-level attention and word-level attention. However, instead of using word-level hidden states in all chunks for calculating the context vector, the C2F attention method first samples a chunk from the chunk-level attention distribution, and then calculates the context vector using
At the test time, the stochastic sampling of the chunks will be replaced by a greedy search.
The aforementioned hierarchical attention mechanism implicitly captures the chunk-level salient information, where the importance of a chunk is determined solely by its attention weight. In contrast, the graph-based attention framework allows us to calculate the saliency scores explicitly using the pagerank algorithm [85, 113] on a graph whose vertices and edges are chunks of texts and their similarities, respectively. Formally, at the decoding time-step , saliency scores for all input chunks are obtained by
where adjacent matrix (similarity of chunks) is calculated by
is a diagonal matrix with its -element equal to the sum of the column of . is a damping factor. The vector is defined as
where is initialized with . It can be seen that the graph-based attention mechanism will focus on chunks that rank higher than the previous decoding step, i.e., . Therefore, it provides an efficient way to select salient information from source documents.
|Rush et al. ||Attention Based Summarization (ABS)||Bag-of-words, Convolution, Attention Neural Network Language Model (NNLM)||XENT||SGD||✓||✓||-||-||ROUGE|
|2015||lopyrev et al. ||Simple Attention||LSTM LSTM||XENT||RMSProp||-||✓||-||-||BLEU|
|Ranzato et al. ||Sequence-level Training||Elman, LSTM Elman, LSTM||XENT, DAD, E2E, MIXER||SGD||-||✓||-||-||ROUGE, BLEU|
|Chopra et al. ||Recurrent Attentive Summarizer||Convolution Encoder, Attentive Encoder Elman, LSTM||XENT||SGD||✓||✓||-||-||ROUGE|
|Nallapati et al. ||Switch Generator-Pointer, Temporal-Attention, Hierarchical-Attention||RNN, Feature-rich Encoder RNN||XENT||Adadelta||✓||✓||✓||-||ROUGE|
|Miao et al. ||Auto-encoding Sentence Compression, Forced-Attention Sentence Compression, Pointer Network||Encoder Compressor Decoder||XENT+RL||Adam||-||✓||-||-||ROUGE|
|2016||Chen et al. ||Distraction||GRU GRU||XENT||Adadelta||-||-||-||CNN, LCSTS||ROUGE|
|Gulcehre et al. ||Pointer softmax||GRU GRU||XENT||Adadelta||-||✓||-||-||ROUGE|
|Gu et al. ||CopyNet||GRU GRU||XENT||SGD||-||-||-||LCSTS||ROUGE|
|Zeng et al. ||Read-again, Copy Mechanism||LSTM/GRU/Hierarchical read-again encoder LSTM||XENT||SGD||✓||✓||-||-||ROUGE|
|Li et al. ||Diverse Beam Decoding||LSTM LSTM||RL||SGD||-||-||-||-||ROUGE|
|Takase et al. ||Abstract Meaning Representation (AMR) based on ABS.||Attention-based AMR encoder NNLM||XENT||SGD||✓||✓||-||-||ROUGE|
|See et al. ||Pointer-Generator Network, Coverage||LSTM LSTM||XENT||Adadelta||-||-||✓||-||ROUGE, METER|
|Paulus et al. ||A Deep Reinforced Model, Intra-temporal and Intra-decoder Attention, Weight Sharing||LSTM LSTM||XENT + RL||Adam||-||-||✓||NYT||ROUGE, Human|
|Zhou et al. ||Selective Encoding, Abstractive Sentence Summarization||GRU GRU||XENT||SGD||✓||✓||-||MSR-ATC||ROUGE|
|Xia et al. ||Deliberation Networks||LSTM LSTM||XENT||Adadelta||-||✓||-||-||ROUGE|
|Nema et al. ||Query-based, Diversity based Attention||GRU query encoder, document encoder GRU||XENT||Adam||-||-||-||Debate-pedia||ROUGE|
|Tan et al. ||Graph-based Attention||Hierarchical Encoder LSTM||XENT||Adam||-||-||✓||CNN, DailyMail||ROUGE|
|Ling et al. ||Coarse-to-fine Attention||LSTM LSTM||RL||SGD||-||-||✓||-||ROUGE, PPL|
|2017||Zhang et al. ||Sentence Simplification, Reinforcement Learning||LSTM LSTM||RL||Adam||-||-||-||Newsela, WikiSmall, WikiLarge||BLEU, FKGL, SARI|
|Li et al. ||Deep Recurrent Generative Decoder (DRGD)||GRU GRU, VAE||XENT, VAE||Adadelta||✓||✓||-||LCSTS||ROUGE|
|Liu et al. ||Adversarial Training||Pointer-Generator Network||GAN||Adadelta||-||-||✓||-||ROUGE, Human|
|Pasunuru et al. ||Multi-Task with Entailment Generation||LSTM document encoder and premise Encoder LSTM Summary and Entailment Decoder||Hybrid-Objective||Adam||✓||✓||-||SNLI||ROUGE, METEOR, BLEU, CIDEr-D|
|Gehring et al. ||Convolutional Seq2seq, Position Embeddings, Gated Linear Unit, Multi-step Attention||CNN CNN||XENT||Adam||✓||✓||-||-||ROUGE|
|Fan et al. ||Convolutional Seq2seq, Controllable||CNN CNN||XENT||Adam||✓||-||✓||-||ROUGE, Human|
|Celikyilmaz et al. ||Deep Communicating Agents, Semantic Cohesion Loss||LSTM LSTM||Hybrid-Objective||Adam||-||-||✓||NYT||ROUGE, Human|
|Chen et al. ||Reinforce-Selected Sentence Rewriting||LSTM Encoder Extractor Abstractor||XENT + RL||SGD||✓||-||✓||-||ROUGE, Human|
|Hsu et al. ||Abstraction + Extraction, Inconsistency Loss||Extractor: GRU. Abstractor: Pointer-generator Network||Hybrid-Objective + RL||Adadelta||-||-||✓||-||ROUGE, Human|
|Li et al. ||Actor-Critic||GRU GRU||RL||Adadelta||✓||✓||-||LCSTS||ROUGE|
|Li et al. ||Abstraction + Extraction, Key Information Guide Network (KIGN)||KIGN: LSTM. Framework: Pointer-Generator Network||XENT||Adadelta||-||-||✓||-||ROUGE|
|Lin et al. ||Global Encoding, Convolutional Gated Unit||LSTM LSTM||XENT||Adam||-||✓||-||LCSTS||ROUGE|
|Pasunuru et al. ||Multi-Reward Optimization for RL: ROUGE, Saliency and Entailment.||LSTM LSTM||RL||Adam||✓||-||✓||SNLI, MultiNLI, SQuAD||ROUGE, Human|
|Song et al. ||Structured-Infused Copy Mechanisms||Pointer-Generator Network||Hybrid-Objective||Adam||-||✓||-||-||ROUGE, Human|
|2018||Cohan et al. ||Discourse Aware Attention||Hierarchical RNN LSTM Encoder LSTM||XENT||Adagrad||-||-||-||PubMed, arXiv||ROUGE|
|Guo et al. ||Multi-Task Summarization with Entailment and Question Generation||Multi-Task Encoder-Decoder Framework||Hybrid-Objective||Adam||✓||✓||✓||SQuAD, SNLI||ROUGE, METEOR|
|Cibils et al. ||Diverse Beam Search, Plagiarism and Extraction Scores||Pointer-Generator Network||XENT||Adagrad||-||-||✓||-||ROUGE|
|Wang et al. ||Topic Aware Attention||CNN CNN||RL||-||✓||✓||-||LCSTS||ROUGE|
|Kryściński et al. ||Improve Abstraction||LSTM Encoder Decoder: Contextual Model and Language Model||XENT + RL||Asynchronous Gradient Descent Optimizer||-||-||✓||-||
ROUGE, Novel n-gram Test, Human
|Gehrmann et al. ||Bottom-up Attention, Abstraction + Extraction||Pointer-Generator Network||Hybrid-Objective||Adagrad||-||-||✓||NYT||ROUGE, Novel|
|Zhang et al. ||Learning to Summarize Radiology Findings||Pointer-Generator Network + Background Encoder||XENT||Adam||-||-||-||Radiology Reports||ROUGE|
|Jiang et al. ||Closed-book Training||Pointer-Generator Network + Closed-book Decoder||Hybrid-Objective + RL||Adam||✓||-||✓||-||ROUGE, METEOR|
|Chung et al. ||Main Pointer Generator||Pointer-Generator Network + Document Encoder||XENT||Adadelta||-||-||✓||-||ROUGE|
|Chen et al. ||Iterative Text Summarization||GRU encoder, GRU decoder, iterative unit||Hybrid-Objective||Adam||✓||-||✓||-||ROUGE|
Extractive summarization approaches usually show a better performance comparing to the abstractive approaches [12, 14, 13] especially with respect to ROUGE measures. One of the advantages of the extractive approaches is that they can summarize source articles by extracting salient snippets and sentences directly from these documents , while abstractive approaches rely on word-level attention mechanism to determine the most relevant words to the target words at each decoding step. In this section, we review several studies that have attempted to improve the performance of the abstractive summarization by combining them with extractive models.
This model proposes a unified framework that tries to leverage the sentence-level salient information from an extractive model and incorporate them into an abstractive model (a pointer-generator network). More formally, inspired by the hierarchical attention mechanism , they replaced the attention distribution in the abstractive model with a scaled version , where the attention weights are expressed as follows:
Here, is the sentence-level salient score of the sentence at word position and decoding step . Different from , the salient scores (sentence-level attention weights) are obtained from another deep neural network known as extractor .
During training, in addition to cross-entropy and coverage loss used in the pointer-generator network, this paper also proposed two other losses, i.e., extractor loss and inconsistency loss. The extractor loss is used to train the extractor and is defined as follows:
where is the ground truth label for the sentence and is the total number of sentences. The inconsistency loss is expressed as
where is the set of the top- attended words and is the total number of words in a summary. Intuitively, the inconsistency loss is used to ensure that the sentence-level attentions in the extractive model and word-level attentions in the abstractive model are consistent with each other. In other words, when word-level attention weights are high, the corresponding sentence-level attention weights should also be high.
This approach uses a guiding generation mechanism that leverages the key (salient) information, i.e., keywords, to guide decoding process. This is a two-step procedure. First, keywords are extracted from source articles using the TextRank algorithm . Second, a KIGN encodes the key information and incorporates them into the decoder to guide the generation of summaries. Technically speaking, we can use a bi-directional LSTM to encode the key information and the output vector is the concatenation of hidden states, i.e., , where is the length of the key information sequence. Then, the alignment mechanism is modified as
Similarly, the soft-switch in the pointer-generator network is calculated using
Most models introduced in this survey are built upon the encoder-decoder framework [14, 12, 17], in which the encoder reads source articles and turns them into vector representations, and the decoder takes the encoded vectors as input and generates summaries. Unlike these models, the reinforce-selected sentence rewriting model  consists of two seq2seq models. The first one is an extractive model (extractor) which is designed to extract salient sentences from a source article, while the second is an abstractive model (abstractor) which paraphrases and compresses the extracted sentences into a short summary. The abstractor network is a standard attention-based seq2seq model with the copying mechanism for handling OOV words. For the extractor network, an encoder first uses a CNN to encode tokens and obtains representations of sentences, and then it uses an LSTM to encode the sentences and represent a source document. With the sentence-level representations, the decoder (another LSTM) is designed to recurrently extract salient sentences from the document using the pointing mechanism . This model has achieved the state-of-the-art performance on CNN/Daily Mail dataset and was demonstrated to be computationally more efficient than the pointer-generator network .
In this section, we review different strategies to train the seq2seq models for abstractive text summarization. As discussed in , there are two categories of training methodologies, i.e., word-level and sequence-level training. The commonly used teacher forcing algorithm [44, 49] and cross-entropy training [48, 124] belong to the first category, while different RL-based algorithms [46, 53, 52] fall into the second. We now discuss the basic ideas of different training algorithms and their applications to seq2seq models for the text summarization. A comprehensive survey of deep RL for seq2seq models can be found in .
The word-level training for language models represents methodologies that try to optimize predictions of the next token . For example, in the abstractive text summarization, given a source article , a seq2seq model generates a summary with the probability , where represents model parameters (e.g., weights and bias ). In a neural language model , this probability can be expanded to
where each multiplier , known as likelihood, is a conditional probability of the next token given all previous ones denoted by
. Intuitively, the text generation process can be described as follows. Starting with a special token ‘SOS’ (start of sequence), the model generates a tokenat a time with the probability . This token can be obtained by a sampling method or a greedy search, i.e., (see Fig. 8). The generated token will then be fed into the next decoding step. The generation is stopped when the model outputs ‘EOS’ (end of sequence) token or when the length reaches a user defined maximum threshold. In this section, we review different approaches for learning model parameters, i.e., . We will start with the commonly used end-to-end training approach, i.e., cross-entropy training, and then move on to two different methods for avoiding the problem of exposure bias.
To learn model parameters , XENT maximizes the log-likelihood of observed sequences (ground-truth) , i.e.,
which is equivalent to minimizing the cross entropy (XE) loss,
We show this training strategy in Fig. 9. The algorithm is also known as the teacher forcing algorithm [44, 49]. During training, it uses observed tokens (ground-truth) as input and aims to improve the probability of the next observed token at each decoding step. However, during testing, it relies on predicted tokens from the previous decoding step. This is the major difference between training and testing (see Fig. 8 and Fig. 9). Since the predicted tokens may not be the observed ones, this discrepancy will be accumulated over time and thus yields summaries that are very different from ground-truth summaries. This problem is known as exposure bias [46, 45, 44].
Scheduled sampling algorithm, also known as Data As Demonstrator (DAD) [46, 45], has been proposed to solve the exposure bias problem. As shown in Fig. 10, during training, the input at each decoding step comes from a sampler which can decide whether it is a model generated token from the last step or an observed token
from training data. The sampling is based on a Bernoulli distribution
where is the probability of using a token from training data and is a binary indicator function. In the scheduled sampling algorithm, is an annealing/scheduling function and decreases with training time from to . As suggested by Bengio et al. , scheduling function can take different forms, e.g.,
where is training step and is a parameter that guarantees . This strategy is often referred to as a curriculum learning algorithm