Text summarization is an important task in natural language processing. It aims to understand the key idea of a document and generate a headline or a summary. Previous social media text summarization systems(Rush et al., 2015; Hu et al., 2015) are mainly based on abstractive text summarization. Most of them belong to a family of encoder-decoders which have shown effective in many tasks, like machine translation (Cho et al., 2014; Dzmitry Bahdanau and Bengio, 2015).
However, these models sometimes select noise words in irrelevant sentences as part of a summary by error. Figure 1 gives an example of noise words generated by a state-of-the-art encoder-decoder model (“RNN-context”). Unlike translation which requires encoding all information to ensure the accuracy, summarization tries to extract the most important information. Furthermore, only a small part of sentences convey the key information while the rest of sentences usually are useless. Thus, these unrelated sentences make it hard for encoder-decoder models to extract key information.
In order to address this issue, we propose a novel method by learning sentence weight distribution to encourage models focus on key sentences and ignore unimportant sentences. In our approach, we first design a multi-layer perceptron to predict sentence weights. Then, considering that ROUGE is a popular evaluation criterion for summarization, we estimate the gold sentence weights of training data by ROUGE scores between sentences and summaries. During training, we design an end-to-end optimization method which minimizes the gap between predicted sentence weights and estimated sentence weights.
We conduct experiments on a large-scale social media dataset. Experimental results show that our method outperforms competitive baselines. Besides, we do not limit our method to any specific neural network, it can be extended to any sequence-to-sequence model.
2 Related Work
Summarization approaches can be divided into two typical categories: extractive summarization (Radev et al., 2004; Aliguliyev, 2009; Woodsend and Lapata, 2010; Ferreira et al., 2013; Cheng and Lapata, 2016) and abstractive summarization (Knight and Marcu, 2002; Bing et al., 2015; Rush et al., 2015; Hu et al., 2015; Gu et al., 2016). For extractive summarizations, most works usually select several sentences from a document as a summary or a headline. For abstractive summarization, most works usually encode a document into an abstractive representation and then generate words in a summary one by one. Most social media summarization systems belong to abstractive text summarizaition. Generally speaking, extractive summarization achieves better performance than abstractive summarization for long and normal documents. However, extractive summarization is not suitable for social media text which are full of noises and very short.
Neural abstractive text summarization is a newly proposed method and has become a hot research topic in recent years. Unlike the traditional summarization systems which consist of many small sub-components that are tuned separately (Knight and Marcu, 2002; Erkan and Radev, 2004; Moawad and Aref, 2012), neural abstractive text summarization attempts to build and train a single, large neural network that reads a document and outputs a correct summary. Rush et al. Rush et al. (2015) first introduced the encoder-decoder framework with the attention mechanism to abstractive text summarization. Bing et al. Bing et al. (2015) proposed an abstraction-based multi-document summarization framework which can construct new sentences by exploring more fine-grained syntactic units than sentences. Gu et al. Gu et al. (2016) proposed a copy mechanism to address the problem of unknown words. Nallapati et al. Nallapati et al. (2016) proposed several novel models to address critical problems in summarization.
3 Proposed Model
Section 3.1 introduces how to estimate sentence weight distribution in detail. Section 3.2 describes how to generate the representation of sentence weight distribution. Section 3.3 shows how to incorporate estimated sentence weights and predicted sentence weights in training.
3.1 Estimating Sentence Weight Distribution
Assume we are provided with a summary and a document where is the number of sentences. The first step of our method is to compute the distribution of sentence weights for training data as
where is computed as
and is computed as
where ROUGE is the evaluation metric to judge the quality of predicted summaries.
3.2 Representation of Sentence Weight Distribution
In our model, we first produce sentence weight distribution over all sentences. The computation is based on the sentence embeddings and the position embeddings of sentences as
denotes vector concatenation ofand ; MLP refers to a multi-layer perceptron. is produced as
where returns all indexes of words which belong to the sentence. Then, the new output of an encoder part is
where is the number of hidden states and is computed as
where returns the weight of sentence which belongs to, and is the output of RNN or Bi-LSTM used in an encoder. Then, the new output of an encoder h is delivered to a decoder which produces a summary.
Given the model parameter and an input text , a corresponding summary and sentence weight distribution
(described in Section 3.1), the loss function is
where is the batch size,
is the conditional probability of the output wordgiven source texts , is the predicted sentence weight and is the conditional probability of the sentence weight (descirbed in Section 3.1) given source texts .
|RNN (Hu et al., 2015)||21.5||8.9||18.6|
|(Hu et al., 2015)|
In this section, we evaluate our proposed approach on a social media dataset and report the performance of the models. Furthermore, we use a case to illustrate the improvement achieved by our approach.
We use the large-scale Chinese short summarization dataset (LCSTS), which is provided by Hu et al. Hu et al. (2015). This dataset is constructed from a famous Chinese social media called Sina Weibo111The place where a lot of popular Chinese media and organizations post news and information.. Based on the statistic data on the training set, we set the maximum number of sentences as 20 and the maximum length of a sentence as 150 in this paper.
4.2 Experimental Settings
Following previous works and experimental results on the development set, we set hyper-parameters as follows. The character embedding dimension is 400 and the size of hidden state is 512. The parameter is 0.01. All word embeddings are initialized randomly. We use the 1-layer encoder and the 1-layer decoder in this paper.
We use the minibatch stochastic gradient descent (SGD) algorithm to train our model. Each gradient is computed using a minibatch of 32 pairs (document, summary). Best validation accuracy is reached after 12k batches, which requires around 2 days of training. For evaluation, we use the ROUGE metric proposed by(Lin and Hovy, 2003)
. Unlike BLEU which includes various n-gram matches, there are several versions of ROUGE for different match lengths: ROUGE-1, ROUGE-2 and ROUGE-L. Experiments are performed on a commodity 64-bit Dell Precision T7910 workstation with one 3.0 GHz 16-core CPU, RAM and one Titan X GPU.
We do not limit our method on specific neural network, it can be extended to any sequence-to-sequence model. In this paper, we evaluate our method on two types of baselines.
RNN We denote RNN as the basic sequence-to-sequence model, with a bi-LSTM encoder and a bi-LSTM decoder. It is a widely used framework.
RNN-context RNN-context is a sequence-to-sequence framework with the attention mechanism.
4.4 Results and Discussions
We compare our approach with baselines, including RNN and RNN-context. The main results are shown in Table 1. It can be seen that our approach achieves ROUGE improvement over both baselines. In particular, SWD outperforms RNN-context by almost 2% ROUGE-1 points.
Finally we give an example summary as shown in Figure 2. This example is illustrated in Section 1, aimed to show the negative influence of unimportant words on extracting key information on RNN-context. RNN-context chooses some unimportant words as summary, like “some fundings from the central government” (shown in blue). In contrast, the outputs of our method contain some key words (shown in pink), like “Fan Gang”, “ the rate of China economy growth slows down”. This example shows the effectiveness of our model on handling noise documents which are full of a number of irrelevant words.
In this paper, we propose a novel method by learning sentence weight distribution to improve the performance of abstractive summarization. The target is to make models focus on important sentences and ignore irrelevant sentences. The results on a large-scale Chinese social media dataset show that our approach outperforms competitive baselines. We also give the example which shows that the summary produced by our method is more relevant to the gold summary. Besides, our method can be extended to any sequence-to-sequence model. Word based seq2seq systems are potentially helpful to this task, cause the words can incorporate more meaningful information. In the future, we will try several word segmentation methods (Sun et al., 2009, 2012; Xu and Sun, 2016; Xu et al., 2017) to improve the system.
- Aliguliyev (2009) Ramiz M Aliguliyev. 2009. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications 36(4):7764–7772.
- Bing et al. (2015) Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, and Rebecca Passonneau. 2015. Abstractive multi-document summarization via phrase selection and merging. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 1587–1597.
- Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 484–494. http://www.aclweb.org/anthology/P16-1046.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1724–1734.
- Dzmitry Bahdanau and Bengio (2015) Kyunghyun Cho Dzmitry Bahdanau and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
Erkan and Radev (2004)
Günes Erkan and Dragomir R Radev. 2004.
Lexrank: Graph-based lexical centrality as salience in text
Journal of Artificial Intelligence Research22:457–479.
- Ferreira et al. (2013) Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George DC Cavalcanti, Rinaldo Lima, Steven J Simske, and Luciano Favaro. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert systems with applications 40(14):5755–5764.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
- Hu et al. (2015) Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. LCSTS: A large scale chinese short text summarization dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. pages 1967–1972.
- Knight and Marcu (2002) Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139(1):91–107.
- Lin and Hovy (2003) Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, May 27 - June 1, 2003.
- Ma and Sun (2017) Shuming Ma and Xu Sun. 2017. A semantic relevance based neural network for text summarization and text simplification. CoRR abs/1710.02318.
- Ma et al. (2017) Shuming Ma, Xu Sun, Jingjing Xu, Houfeng Wang, Wenjie Li, and Qi Su. 2017. Improving semantic relevance for sequence-to-sequence learning of chinese social media text summarization. In ACL’17.
- Moawad and Aref (2012) Ibrahim F Moawad and Mostafa Aref. 2012. Semantic graph reduction approach for abstractive text summarization. In Computer Engineering & Systems (ICCES), 2012 Seventh International Conference on. IEEE, pages 132–138.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016. pages 280–290.
- Radev et al. (2004) Dragomir R Radev, Timothy Allison, Sasha Blair-Goldensohn, John Blitzer, Arda Celebi, Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, et al. 2004. Mead-a platform for multidocument multilingual text summarization. In LREC.
Rush et al. (2015)
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015.
A neural attention model for abstractive sentence summarization.In EMNLP. The Association for Computational Linguistics, pages 379–389.
- Sun et al. (2014) Xu Sun, Wenjie Li, Houfeng Wang, and Qin Lu. 2014. Feature-frequency-adaptive on-line training for fast and accurate natural language processing. Computational Linguistics 40(3):563–586.
- Sun et al. (2012) Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In ACL’12. pages 253–262.
- Sun et al. (2017) Xu Sun, Bingzhen Wei, Xuancheng Ren, and Shuming Ma. 2017. Label embedding network: Learning label representation for soft training of deep networks. CoRR abs/1710.10393.
- Sun et al. (2009) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2009. A discriminative latent variable chinese segmenter with hybrid word/character information. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA. pages 56–64.
- Sun et al. (2013) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2013. Probabilistic chinese word segmentation with non-local information and stochastic training. Inf. Process. Manage. 49(3):626–636.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
- Woodsend and Lapata (2010) Kristian Woodsend and Mirella Lapata. 2010. Automatic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 565–574.
- Xu et al. (2017) Jingjing Xu, Shuming Ma, Yi Zhang, Bingzhen Wei, Xiaoyan Cai, and Xu Sun. 2017. Transfer learning for low-resource chinese word segmentation with a novel neural network. In The Conference on Natural Language Processing and Chinese Computing.
- Xu and Sun (2016) Jingjing Xu and Xu Sun. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In ACL’16. pages 567–572.