Abstractive sentence summarization aims at generating concise and informative summaries based on the core meaning of source sentences. Previous endeavors tackle the problem through either rule-based methods Dorr et al. (2003) or statistical models trained on relatively small scale training corpora Banko et al. (2000). Following its successful applications on machine translation Sutskever et al. (2014); Bahdanau et al. (2015), the sequence-to-sequence framework is also applied on the abstractive sentence summarization task using large-scale sentence summary corpora Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016), obtaining better performance compared to the traditional methods.
One central component in state-of-the-art sequence to sequence models is the use of attention for building connections between the source sequence and target words, so that a more informed decision can be made for generating a target word by considering the most relevant parts of the source sequence Bahdanau et al. (2015); Vaswani et al. (2017). For abstractive sentence summarization, such attention mechanisms can be useful for selecting the most salient words for a short summary, while filtering the negative influence of redundant parts.
We consider improving abstractive summarization quality by enhancing target-to-source attention. In particular, a contrastive mechanism is taken, by encouraging the contribution from the conventional attention that attends to relevant parts of the source sentence, while at the same time penalizing the contribution from an opponent attention that attends to irrelevant or less relevant parts. Contrastive attention was first proposed in computer visionSong et al. (2018a), which is used for person re-identification by attending to person and background regions contrastively. To our knowledge, we are the first to use contrastive attention for NLP and deploy it in the sequence-to-sequence framework.
In particular, we take Transformer Vaswani et al. (2017) as the baseline summarization model, and enhance it with a proponent attention module and an opponent attention module. The former acts as the conventional attention mechanism, while the latter can be regarded as a dual module to the former, with similar weight calculation structure, but using a novel softmin function to discourage contributions from irrelevant or less relevant words.
To our knowledge, we are the first to investigate Transformer as a sequence to sequence summarizer. Results on three benchmark datasets show that it gives highly competitive accuracies compared with RNN and CNN alternatives. When equipped with the proposed contrastive attention mechanism, our Transformer model achieves the best reported results on all data. The visualization of attentions shows that through using the contrastive attention mechanism, our attention is more focused on relevant parts than the baseline. We release our code at XXX.
2 Related Work
Automatic summarization has been investigated in two main paradigms: the extractive method and the abstractive method. The former extracts important pieces of source document and concatenates them sequentially Jing and McKeown (2000); Knight and Marcu (2000); Neto et al. (2002), while the latter grasps the core meaning of the source text and re-state it in short text as abstractive summary Banko et al. (2000); Rush et al. (2015). In this paper, we focus on abstractive summarization, and especially on abstractive sentence summarization.
Previous work deals with the abstractive sentence summarization task by using either rule based methods Dorr et al. (2003), or statistical methods utilizing a source-summary parallel corpus to train a machine translation model Banko et al. (2000), or a syntax based transduction model Cohn and Lapata (2008); Woodsend et al. (2010).
In recent years, sequence-to-sequence neural framework becomes predominant on this task by encoding long source texts and decoding into short summaries together with the attention mechanism. RNN is the most commonly adopted and extensively explored architecture Chopra et al. (2016); Nallapati et al. (2016); Li et al. (2017). A CNN-based architecture is recently employed by Gehring et al. Gehring et al. (2017) using ConvS2S, which applies CNN on both encoder and decoder. Later, Wang et al. Wang et al. (2018)
build upon ConvS2S with topic words embedding and encoding, and train the system with reinforcement learning.
The most related work to our contrastive attention mechanism is in the field of computer vision. Song et al. Song et al. (2018a) first propose the contrastive attention mechanism for person re-identification. In their work, based on a pre-provided person and background segmentation, the two regions are contrastively attended so that they can be easily discriminated. In comparison, we apply the contrastive attention mechanism for sentence level summarization by contrastively attending to relevant parts and irrelevant or less relevant parts. Furthermore, we propose a novel softmax softmin functionality to train the attention mechanism, which is different to Song et al. Song et al. (2018a), who use mean squared error loss for attention training.
Other explorations with respect to the characteristics of the abstractive summarization task include copying mechanism that copies words from source sequences for composing summaries Gu et al. (2016); Gulcehre et al. (2016); Song et al. (2018b), the selection mechanism that elaborately selects important parts of source sentences Zhou et al. (2017); Lin et al. (2018), the distraction mechanism that avoids repeated attention on the same area Chen et al. (2016), and the sequence level training that avoids exposure bias in teacher forcing methods Ayana et al. (2016); Li et al. (2018); Edunov et al. (2018). Such methods are built on conventional attention, and are orthogonal to our proposed contrastive attention mechanism.
We use two categories of attention for summary generation. One is the conventional attention that attends to relevant parts of source sentence, the other is the opponent attention that contrarily attends to irrelevant or less relevant parts. Both categories of attention output probability distributions over summary words, which are jointly optimized by encouraging the contribution from the conventional attention and discouraging the contribution from the opponent attention.
Figure LABEL:fig:overall illustrates the overall networks. We use Transformer architecture as our basis, upon which we build the contrastive attention mechanism. The left part is the original Transformer. We derive the opponent attention from the conventional attention which is the encoder-decoder attention of the original Transformer, and stack several layers on top of the opponent attention as shown in the right part of Figure LABEL:fig:overall. Both parts contribute to the summary generation by producing probability distributions over the target vocabulary, respectively. The left part outputs the conventional probability based on the conventional attention as the original Transformer does, while the right part outputs the opponent probability based on the opponent attention. The two probabilities in Figure LABEL:fig:overall are jointly optimized in a novel way as explained in Section 3.3.
3.1 Transformer for Abstractive Sentence Summarization
Transformer is an attention network based sequence-to-sequence architecture Vaswani et al. (2017)
, which encodes the source text into hidden vectors and decodes into the target text based on the source side information and the target generation history. In comparison to the RNN based architecture and the CNN based architecture, both the encoder and the decoder of Transformer adopt attention as main function.
Let and denote the source sentence and its summary, respectively. Transformer is trained to maximize the probability of given : , where is the conventional probability of the current summary word given the source sentence and the summary generation history. is computed based on the attention mechanism and the stacked deep layers as shown in the left part of Figure LABEL:fig:overall.
Scaled dot-product attention is applied in Transformer:
where denotes query vector, key vectors, and value vectors, respectively. denotes the dimension of one vector of . Softmax function outputs the attention weights distributed over . ) is a vector of weighted sum of elements of , and represents current context information.
We focus on the encoder-decoder attention, which builds the connection between source and target by informing the decoder which area of the source text should be attended to. Specifically, in the encoder-decoder attention, is the single vector coming from the current position of the decoder, and are the same sequence of vectors that are the outcomes of the encoder at all source positions. Softmax function distributes the attention weights over the source positions.
The attentions in Transformer adopts the multi-head implementation, in which each head computes attention as Equation (1) but with smaller whose dimension is times of their original dimension respectively. The attentions from heads are concatenated together and linearly projected to compose the final attention. In this way, multi-head attention provides a multi-view of attention behavior beneficial for the final performance.
The “N” plates in Figure LABEL:fig:overall
stands for the stacked N identical layers. On the source side, each layer of the stacked N layers contains two sublayers: the self-attention mechanism, and the fully connected feed-forward network. Each sublayer employs residual connection that adds input to outcome of sublayer, then layer normalization is employed on the outcome of the residual connection.
On the target summary side, each layer contains an additional sublayer of the encoder-decoder attention between the self-attention sublayer and the feed-forward sublayer. At the top of the decoder, the softmax layer is applied to convert the decoder output to summary word generation probabilities.
3.2 Contrastive Attention Mechanism
3.2.1 Opponent Attention
As illustrated in Figure LABEL:fig:overall, the opponent attention is derived from the conventional encoder-decoder attention. Since the multi-head attention is employed in Transformer, there are N heads in total in the conventional encoder-decoder attention, where N denotes the number of layers, denotes the number of heads in each layer. These heads exhibit diverse attention behaviors, posing a challenge of determining which head to derive the opponent attention, so that it attends to irrelevant or less relevant parts.
Figure LABEL:fig:heads illustrates the attention weights of two sampled heads. The attention weights in (a) well reflect the word level relevant relation between the source sentence and the target summary, while attention weights in (b) do not. We find that such behavior characteristic of each head is fixed. For example, head (a) always exhibits the relevant relation across different sentences and different runs. Based on depicting heatmaps of all heads for a few sentences, we choose the head that corresponds well to the relevant relation between source and target to derive the opponent attention 111Given manual alignments between source and target of sampled sentence-summary pairs, we select the head that has the lowest alignment error rate (AER) of its attention weights..
Specifically, let denote the conventional encoder-decoder attention weights of the head which is used for deriving the opponent attention:
where and are from the head same to that of . Let denote the opponent attention weights. It is obtained through the opponent function applied on followed by the softmax function:
The opponent function in Equation (3) performs a masking operation, which finds the maximum weight in , and replaces it with the negative infinity value, so that the softmax function outputs zero given the negative infinity value input. Then the maximum weight in is set zero in after the opponent and softmax functions. In this way, the most relevant part of the source sequence, which receives maximum attention in the conventional attention weights , is masked and neglected in . Instead, the remaining less relevant or irrelevant parts are extracted into for the following contrastive training and decoding.
We also tried other methods to calculate the opponent attention weights, such as Song et al. (2018a) 222Song et al. Song et al. (2018a) directly let in extracting background features for person re-identification in computer vision. We have to add softmax function since the attention weights must be normalized to one in sequence-to-sequence framework. or , which aims to make contrary to , but they underperform the masking opponent function on all benchmark datasets. So we present only the masking opponent in the following sections.
After is obtained via Equation (3), the opponent attention is: , where is from the head same to that of and in computing .
Compared to the conventional attention , which summarizes current relevant context, summarizes current irrelevant or less relevant context. They constitute a contrastive pair, and contribute together for the final summary word generation.
3.2.2 Opponent Probability
The opponent probability is computed by stacking several layers on top of , and a softmin layer in the end as shown in the right part of Figure (LABEL:fig:overall). In particular,
where is the matrix of the linear projection sublayer.
contributes to via Equation (4-7) step by step. The LayerNorm and FeedForward layers with residual connection is similar to the original Transformer, while a novel softmin function is introduced in the end to invert the contribution from :
where , i.e., the input vector to the softmin function in Equation (7). Softmin normalizes so that scores of all words in the summary vocabulary sum to one. We can see that the bigger the , the smaller the is.
Softmin functions contrarily to softmax. As a result, when we try to maximize , where is the gold summary word, we effectively search for an appropriate to generate the lowest , where is the index of in . It means that the more irrelevant is to the summary, the lower the can be obtained, resulting in higher .
3.3 Training and Decoding
During training, we jointly maximize the conventional probability and the opponent probability :
where is the balanced weight. The conventional probability is computed as the original Transformer does, basing on sublayers of feed-forward, linear projection, and softmax stacked over the conventional attention as illustrated in the left part of Figure LABEL:fig:overall. The opponent probability is based on similar sublayers stacked over the opponent attention, but with softmin as the last sublayer as illustrated in the right part of Figure LABEL:fig:overall.
Due to the contrary properties of softmax and softmin, jointly maximizing and actually maximizes the contribution from the conventional attention for summary word generation, while at the same time minimizes the contribution from the opponent attention333We also tried replacing softmin in Equation (7) with softmax, and correspondingly setting the training objective as maximizing , but this method failed to train because becomes too small during training, and results in negative infinity value of that hampers the training. In comparison, softmin and the training objective of Equation (9) do not have such problem, enabling the effective training of the proposed network.. In other words, the training objective is to let the relevant part attended by contribute more to the summarization, while let the irrelevant or less relevant parts attended by contribute less.
During decoding, we aim to find maximum of Equation (9) in the beam search process.
We conduct experiments on abstractive sentence summarization benchmark datasets to demonstrate the effectiveness of the proposed contrastive attention mechanism.
In this paper, we evaluate our proposed method on three abstractive text summarization benchmark datasets. First, we use the annotated Gigaword corpus and preprocess it identically to Rush et al. Rush et al. (2015), which results in around 3.8M training samples, 190K validation samples and 1951 test samples for evaluation. The source-summary pairs are formed through pairing the first sentence of each article with its headline. We use DUC-2004 as another English data set only for testing in our experiments. It contains 500 documents, each containing four human-generated reference summaries. The length of the summary is capped at 75 bytes. The last data set we used is a large corpus of Chinese short text summarization (LCSTS) Hu et al. (2015), which is collected from the Chinese microblogging website Sina Weibo. We follow the data split of the original paper, with 2.4M source-summary pairs from the first part of the corpus for training, 725 pairs from the last part with high annotation score for testing.
|ABS Rush et al. (2015)||29.55||11.32||26.42||26.55||7.06||22.05|
|ABS+ Rush et al. (2015)||29.76||11.88||26.96||28.18||8.49||23.81|
|RAS-Elman Chopra et al. (2016)||33.78||15.97||31.15||28.97||8.26||24.06|
|words-lvt5k-1sent Nallapati et al. (2016)||35.30||16.64||32.62||28.61||9.42||25.24|
|SEASS Zhou et al. (2017)||36.15||17.54||33.63||29.21||9.56||25.51|
|RNN Ayana et al. (2016)||36.54||16.59||33.44||30.41||10.87||26.79|
|Actor-Critic Li et al. (2018)||36.05||17.35||33.49||29.41||9.84||25.85|
|StructuredLoss Edunov et al. (2018)||36.70||17.88||34.29||-||-||-|
|DRGD Li et al. (2017)||36.27||17.57||33.62||31.79||10.75||27.48|
|ConvS2S Gehring et al. (2017)||35.88||17.48||33.29||30.44||10.84||26.90|
|ConvS2S Wang et al. (2018)||36.92||18.29||34.58||31.15||10.85||27.68|
|FactAware Cao et al. (2018)||37.27||17.65||34.24||-||-||-|
4.2 Experimental Setup
We employ Transformer as our basis architecture444https://github.com/pytorch/fairseq. Six layers are stacked in both the encoder and decoder, and the dimensions of the embedding vectors and all hidden vectors are set 512. The inner layer of the feed-forward sublayer has the dimensionality of 2048. We set eight heads in the multi-head attention. The source embedding, the target embedding and the linear sublayer are shared in our experiments. Byte-pair encoding is employed in the English experiment with a shared source-target vocabulary of about 32k tokens Sennrich et al. (2015).
Regarding the contrastive attention mechanism, the opponent attention is derived from the head whose attention is most synchronous to word alignments of the source-summary pair. In our experiments, we select the fifth head of the third layer for deriving the opponent attention in the English experiments, and select the second head of the third layer in the Chinese experiments. All dimensions in the contrastive architecture are set 64. The in Equation (9) is tuned on the development set in each experiment.
During training, we use the Adam optimizer with 1 = 0.9, 2 = 0.98, = 10. The initial learning rate is 0.0005. The inverse square root schedule is applied for initial warm up and annealing Vaswani et al. (2017). During training, we use a dropout rate of 0.3 on all datasets.
During evaluation, we employ ROUGE Lin (2004)
as our evaluation metric. Since standard Rouge package is used to evaluate the English summarization systems, we also follow the method of Huet al. Hu et al. (2015) to map Chinese words into numerical IDs in order to evaluate the performance on the Chinese data set.
4.3.1 English Results
The experimental results on the English evaluation sets are listed in Table 1. We report the full-length F-1 scores of ROUGE-1 (R-1), ROUGE2 (R-2), and ROUGE-L (R-L) on the evaluation set of the annotated Gigaword, while report the recall-based scores of the R-1, R-2, and R-L on the evaluation set of DUC2004 to follow the setting of the previous works.
The results of our works are shown at the bottom of Table 1. The performances of the related works are reported in the upper part of Table 1 for comparison. ABS and ABS+ are the pioneer works of using neural models for abstractive text summarization. RAS-Elman extends ABS/ABS+ with attentive CNN encoder. words-lvt5k-1sent uses large vocabulary and linguistic features such as POS and NER tags. RNN, Actor-Critic, StructuredLoss are sequence-level training methods to overcome the problem of the usual teacher-forcing methods. DRGD uses recurrent latent random model to improve summarization quality. FactAware generates summary words conditioned on both the source text and the fact descriptions extracted from OpenIE or dependencies. Besides the above RNN-based related works, CNN-based architectures of ConvS2S and ConvS2S are included for comparison.
Table 1 shows that we build a strong baseline using Transformer alone which obtains the state-of-the-art performance on Gigaword evaluation set, and obtains comparable performance to the state-of-the-art on DUC2004. When we introduce the contrastive attention mechanism into Transformer, it significantly improves the performance of Transformer, and greatly advances the state-of-the-art on both Gigaword evaluation set and DUC2004, as shown in the row of “Transformer+Contrastive Attention”.
|RNN context Hu et al. (2015)||29.90||17.40||27.20|
|CopyNet Gu et al. (2016)||34.40||21.60||31.30|
|RNN Ayana et al. (2016)||38.20||25.20||35.40|
|RNN Chen et al. (2016)||35.20||22.60||32.50|
|DRGD Li et al. (2017)||36.99||24.15||34.21|
|Actor-Critic Li et al. (2018)||37.51||24.68||35.02|
|Global Lin et al. (2018)||39.40||26.90||36.50|
4.3.2 Chinese Results
Table 2 presents the evaluation results on LCSTS. The upper rows list the performances of the related works, the bottom rows list the performances of our Transformer baseline and the integration of the contrastive attention mechanism into Transformer. We only take character sequences as source-summary pairs and evaluate the performance based on reference characters for strict comparison to the related works.
Table 2 shows that Transformer also sets a strong baseline on LCSTS that surpasses the performances of the previous works. When Transformer is equipped with our proposed contrastive attention mechanism, the performance is significantly improved and drastically advances the state-of-the-art on LCSTS.
5 Analysis and Discussion
5.1 Effect of the Contrastive Attention Mechanism on Attentions
Figure LABEL:fig:att_results shows the attention weights before and after using the contrastive attention mechanism. We depict the averaged attention weights of all heads in one layer in Figure LABEL:fig:att_resultsa and LABEL:fig:att_resultsb to study how it contributes to the conventional probability computation, and depict the opponent attention weights in Figure LABEL:fig:att_resultsc to study its contribution to the opponent probability. Since we select the fifth head of the third layer to derive the opponent attention in English experiment, the studies are carried out on the third layer.
Figure LABEL:fig:att_resultsa is from the baseline Transformer, Figure LABEL:fig:att_resultsb is from “Transformer + ContrastiveAttention”. We can see that “Transformer + ContrastiveAttention” is more focused on the source parts that are most relevant to the summary than the baseline Transformer, which scatters attention weights on summary word neighbors or even functional words such as “-lrb-” and “the”. “Transformer + ContrastiveAttention” cancels such scattered attentions by using the contrastive attention mechanism.
Figure LABEL:fig:att_resultsc depicts the opponent attention weights. They are optimized during training to generate the lowest score which is fed into softmin to get the highest opponent probability . The more irrelevant to the summary word the opponent is, the lower the score can be obtained, thus resulting in higher . Figure LABEL:fig:att_resultsc shows that the attentions are formed over irrelevant parts with varied weights as the result of maximizing during training.
|mask maximum weight||38.72||19.09||35.82||32.22||11.04||27.59|
|mask top-2 weights||38.17||19.15||35.51||31.87||10.94||27.41|
|mask top-3 weights||38.36||19.11||35.56||31.67||10.37||27.31|
|Src:press freedom in algeria remains at risk despite the release on wednesday of prominent newspaper editor mohamed UNK after a two-year prison sentence , human rights organizations said .|
|Ref:algerian press freedom at risk despite editor ’s release UNK picture|
|Transformer:press freedom remains at risk in algeria rights groups say|
|Transformer+ContrastiveAtt:press freedom remains at risk despite release of algerian editor|
|Src:denmark ’s poul-erik hoyer completed his hat-trick of men ’s singles badminton titles at the european championships , winning the final here on saturday|
|Ref:hoyer wins singles title|
|Transformer:hoyer completes hat-trick|
|Transformer+ContrastiveAtt:hoyer wins men ’s singles title|
|Src:french bank credit agricole launched on tuesday a public cash offer to buy the ## percent of emporiki bank it does not already own , in a bid valuing the greek group at #.# billion euros ( #.# billion dollars ) .|
|Ref:credit agricole announces #.#-billion-euro bid for greek bank emporiki|
|Transformer:credit agricole launches public cash offer for greek bank|
|Transformer+ContrastiveAtt:french bank credit agricole bids #.# billion euros for greek bank|
5.2 Effect of the Opponent Probability in Decoding
We study the contribution of the opponent probability by dropping it during decoding to see if it hurts the performance. Table 4 shows that dropping significantly harms the performance of “Transformer + ContrastiveAtt”. The performance difference between the model dropping and the baseline Transformer is marginal, indicating that adding the opponent probability is key for achieving the performance improvement.
5.3 Explorations on Deriving the Opponent Attention
Masking More Attention Weights for Deriving the Opponent Attention
In Section 3.2.1, we mask the most salient word that has the maximum weight of to derive the opponent attention. In this subsection, we experimented with masking more weights of by two ways: 1) masking top weights, 2) dynamically masking. In the dynamically masking method, we order the weights from big to small at first, then go on masking two neighbors until the ratio between them is over a threshold. The threshold is 1.02 based on training and tuning on the development set.
The upper rows of Table 3 presents the performance comparison between masking maximum weight and masking more weights. It shows that masking maximum weight performs better, indicating that masking the most salient weight leaves more irrelevant or less relevant words to compute the opponent probability , which is more reliable than that computed from less remaining words after masking more weights.
Selecting Non-synchronous Head or Averaged Head for Deriving the Opponent Attention
As explained in Section 3.2.1, the opponent attention is derived from the head that is most synchronous to the word alignments between source sentence and summary. We denote it “synchronous head”. We also explored deriving the opponent attention from the fifth head of the first layer, which is non-synchronous to the word alignments as illustrated in Figure LABEL:fig:headsb. Its result is presented in the “non-synchronous head” row. In addition, the attention weights averaged on all heads of the third layer are used to derive the opponent attention. We denote it “averaged head”.
As shown in the middle part of Table 3, both “non-synchronous head” and “averaged head” underperform “synchronous head”. “non-synchronous head” performs worst, and even worse than the Transformer baseline on Gigaword. This indicates that it is better to compose the opponent attention from irrelevant parts that can be easily located in the synchronous head. “averaged head” performs slightly worse than “synchronous head”, and is also slower due to the involved all heads.
5.4 Qualitative Study
Table 5 shows the qualitative results. The highlights in the baseline Transformer manifest the incorrect areas extracted by the baseline system. In contrast, the highlights in Transformer+ContrastiveAtt show that correct contents are extracted since the contrastive system distinguish relevant parts from irrelevant parts on the source side and made attending to correct areas more easily.
We proposed a contrastive attention mechanism for abstractive sentence summarization, using both the conventional attention that attends to the relevant parts of the source sentence, and a novel opponent attention that attends to irrelevant or less relevant parts for the summary word generation. Both categories of the attention constitute a contrastive pair, and we encourage contribution from the conventional attention and penalize contribution from the opponent attention through joint training. Using Transformer as a strong baseline, experiments on three benchmark data sets show that the proposed contrastive attention mechanism significantly improves the performance, advancing the state-of-the-art performance for the task.
The authors would like to thank the anonymous reviewers for the helpful comments. This work was supported by National Key R&D Program of China (Grant No. 2016YFE0132100), National Natural Science Foundation of China (Grant No. 61525205, 61673289).
- Neural headline generation with sentence-wise optimization. Computer Research Repository arXiv:1604.01904. Note: version 2 External Links: Cited by: §2, Table 1, Table 2.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, Cited by: §1, §1.
- Headline generation based on statistical translation. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 318–325. Cited by: §1, §2, §2.
- Faithful to the original: fact aware neural abstractive summarization. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 1.
Distraction-based neural networks for modeling document. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2754–2760. Cited by: §2, Table 2.
Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98. Cited by: §1, §2, Table 1.
- Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 137–144. Cited by: §2.
- Hedge trimmer: a parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume 5, pp. 1–8. Cited by: §1, §2.
- Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 355–364. Cited by: §2, Table 1.
Convolutional sequence to sequence learning.
Proceedings of the 34th International Conference on Machine Learning, pp. 1243–1252. Cited by: §2, Table 1.
- Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: §2, Table 2.
- Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
Lcsts: a large scale chinese short text summarization dataset.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1967–1972. Cited by: §4.1, §4.2, Table 2.
- Cut and paste based text summarization. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 178–185. Cited by: §2.
- Statistics-based summarization - step one: sentence compression. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial Intelligence, pp. 703–710. Cited by: §2.
- Actor-critic based training framework for abstractive summarization. Computing Research Repository arXiv:1803.11070. External Links: Cited by: §2, Table 1, Table 2.
- Deep recurrent generative decoder for abstractive text summarization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2091–2100. Cited by: §2, Table 1, Table 2.
- ROUGE: a package for automatic evaluation of summaries. In Proc of the ACL-04 Workshop on Text Summarization Branches Out, Cited by: §4.2.
- Global encoding for abstractive summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 163–169. Cited by: §2, Table 2.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: §1, §2, Table 1.
- Automatic text summarization using a machine learning approach. In Brazilian Symposium on Artificial Intelligence, pp. 205–215. Cited by: §2.
A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389. Cited by: §1, §2, §4.1, Table 1.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: §4.2.
Mask-guided contrastive attention model for person re-identification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1179–1188. Cited by: §1, §2, §3.2.1, footnote 2.
- Structure-infused copy mechanisms for abstractive summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1717–1729. Cited by: §2.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §1, §3.1, §4.2.
- A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence,, pp. 4453–4460. Cited by: §2, Table 1.
- Title generation with quasi-synchronous grammar. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 513–523. Cited by: §2.
- Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1095–1104. Cited by: §2, Table 1.