Improving Social Media Text Summarization by Learning Sentence Weight Distribution

by   Jingjing Xu, et al.
Peking University

Recently, encoder-decoder models are widely used in social media text summarization. However, these models sometimes select noise words in irrelevant sentences as part of a summary by error, thus declining the performance. In order to inhibit irrelevant sentences and focus on key information, we propose an effective approach by learning sentence weight distribution. In our model, we build a multi-layer perceptron to predict sentence weights. During training, we use the ROUGE score as an alternative to the estimated sentence weight, and try to minimize the gap between estimated weights and predicted weights. In this way, we encourage our model to focus on the key sentences, which have high relevance with the summary. Experimental results show that our approach outperforms baselines on a large-scale social media corpus.



There are no comments yet.


page 1

page 2

page 3

page 4


Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization

Current Chinese social media text summarization models are based on an e...

Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization

Most of the current abstractive text summarization models are based on t...

Text Summarization using Deep Learning and Ridge Regression

We develop models and extract relevant features for automatic text summa...

A Two-Phase Approach for Abstractive Podcast Summarization

Podcast summarization is different from summarization of other data form...

TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Recent models in developing summarization systems consist of millions of...

Automatically Neutralizing Subjective Bias in Text

Texts like news, encyclopedias, and some social media strive for objecti...

Sentence-level quality estimation by predicting HTER as a multi-component metric

This submission investigates alternative machine learning models for pre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of the negative influence of noise words. “RNN-context” is the basic encoder-decoder model with the attention mechanism. The gold summary (shown in blue) is that the rate of China economy growth slows down and the investment in real estate is certainly better than 2012. However, the key words are unseen in the output of “RNN-context” while words (shown in pink) in irrelevant sentences are selected as part of a summary.

Text summarization is an important task in natural language processing. It aims to understand the key idea of a document and generate a headline or a summary. Previous social media text summarization systems 

(Rush et al., 2015; Hu et al., 2015) are mainly based on abstractive text summarization. Most of them belong to a family of encoder-decoders which have shown effective in many tasks, like machine translation (Cho et al., 2014; Dzmitry Bahdanau and Bengio, 2015).

However, these models sometimes select noise words in irrelevant sentences as part of a summary by error. Figure 1 gives an example of noise words generated by a state-of-the-art encoder-decoder model (“RNN-context”). Unlike translation which requires encoding all information to ensure the accuracy, summarization tries to extract the most important information. Furthermore, only a small part of sentences convey the key information while the rest of sentences usually are useless. Thus, these unrelated sentences make it hard for encoder-decoder models to extract key information.

In order to address this issue, we propose a novel method by learning sentence weight distribution to encourage models focus on key sentences and ignore unimportant sentences. In our approach, we first design a multi-layer perceptron to predict sentence weights. Then, considering that ROUGE is a popular evaluation criterion for summarization, we estimate the gold sentence weights of training data by ROUGE scores between sentences and summaries. During training, we design an end-to-end optimization method which minimizes the gap between predicted sentence weights and estimated sentence weights.

We conduct experiments on a large-scale social media dataset. Experimental results show that our method outperforms competitive baselines. Besides, we do not limit our method to any specific neural network, it can be extended to any sequence-to-sequence model.

2 Related Work

Summarization approaches can be divided into two typical categories: extractive summarization  (Radev et al., 2004; Aliguliyev, 2009; Woodsend and Lapata, 2010; Ferreira et al., 2013; Cheng and Lapata, 2016) and abstractive summarization (Knight and Marcu, 2002; Bing et al., 2015; Rush et al., 2015; Hu et al., 2015; Gu et al., 2016). For extractive summarizations, most works usually select several sentences from a document as a summary or a headline. For abstractive summarization, most works usually encode a document into an abstractive representation and then generate words in a summary one by one. Most social media summarization systems belong to abstractive text summarizaition. Generally speaking, extractive summarization achieves better performance than abstractive summarization for long and normal documents. However, extractive summarization is not suitable for social media text which are full of noises and very short.

Neural abstractive text summarization is a newly proposed method and has become a hot research topic in recent years. Unlike the traditional summarization systems which consist of many small sub-components that are tuned separately (Knight and Marcu, 2002; Erkan and Radev, 2004; Moawad and Aref, 2012), neural abstractive text summarization attempts to build and train a single, large neural network that reads a document and outputs a correct summary. Rush et al. Rush et al. (2015) first introduced the encoder-decoder framework with the attention mechanism to abstractive text summarization. Bing et al. Bing et al. (2015) proposed an abstraction-based multi-document summarization framework which can construct new sentences by exploring more fine-grained syntactic units than sentences. Gu et al. Gu et al. (2016) proposed a copy mechanism to address the problem of unknown words. Nallapati et al. Nallapati et al. (2016) proposed several novel models to address critical problems in summarization.

3 Proposed Model

Our method is based on the basic encoder-decoder framework proposed by Cho et al. Dzmitry Bahdanau and Bengio (2015) and Sutskever et al. Sutskever et al. (2014).

Section 3.1 introduces how to estimate sentence weight distribution in detail. Section 3.2 describes how to generate the representation of sentence weight distribution. Section 3.3 shows how to incorporate estimated sentence weights and predicted sentence weights in training.

3.1 Estimating Sentence Weight Distribution

Assume we are provided with a summary and a document where is the number of sentences. The first step of our method is to compute the distribution of sentence weights for training data as


where is computed as


and is computed as


where ROUGE is the evaluation metric to judge the quality of predicted summaries.

3.2 Representation of Sentence Weight Distribution

In our model, we first produce sentence weight distribution over all sentences. The computation is based on the sentence embeddings and the position embeddings of sentences as



denotes vector concatenation of

and ; MLP refers to a multi-layer perceptron. is produced as


where returns all indexes of words which belong to the sentence. Then, the new output of an encoder part is


where is the number of hidden states and is computed as


where returns the weight of sentence which belongs to, and is the output of RNN or Bi-LSTM used in an encoder. Then, the new output of an encoder h is delivered to a decoder which produces a summary.

Figure 2: Comparisons of the predicted summaries between RNN-context and our method. The predicted summary (shown in blue) of RNN-context comes from an unimportant sentence. In contrast, the predicted summary of our method covers some key words (shown in purple).

3.3 Training

Given the model parameter and an input text , a corresponding summary and sentence weight distribution

(described in Section 3.1), the loss function is

where is the batch size,

is the conditional probability of the output word

given source texts , is the predicted sentence weight and is the conditional probability of the sentence weight (descirbed in Section 3.1) given source texts .

Train Devlopment Test
2,400,591 10,666 1,106
Table 1: Details of LCSTS dataset. The size is given in number of pairs (short text, summary).
Models R-1 R-2 R-L
RNN (Hu et al., 2015) 21.5 8.9 18.6
+SWD 24.1 10.3 21.1
RNN-context 29.9 17.4 27.2
 (Hu et al., 2015)
+SWD 32.0 19.0 29.4
Table 2: ROUGE scores (R-1:ROUGE-1; R-2: ROUGE-2; R-L: ROUGE-L) of the trained models computed on the test and development sets. “RNN” and “RNN-context” are two baselines. We refer our method as SWD.

4 Experiments

In this section, we evaluate our proposed approach on a social media dataset and report the performance of the models. Furthermore, we use a case to illustrate the improvement achieved by our approach.

4.1 Dataset

We use the large-scale Chinese short summarization dataset (LCSTS), which is provided by Hu et al. Hu et al. (2015). This dataset is constructed from a famous Chinese social media called Sina Weibo111The place where a lot of popular Chinese media and organizations post news and information.. Based on the statistic data on the training set, we set the maximum number of sentences as 20 and the maximum length of a sentence as 150 in this paper.

4.2 Experimental Settings

Following previous works and experimental results on the development set, we set hyper-parameters as follows. The character embedding dimension is 400 and the size of hidden state is 512. The parameter is 0.01. All word embeddings are initialized randomly. We use the 1-layer encoder and the 1-layer decoder in this paper.

We use the minibatch stochastic gradient descent (SGD) algorithm to train our model. Each gradient is computed using a minibatch of 32 pairs (document, summary). Best validation accuracy is reached after 12k batches, which requires around 2 days of training. For evaluation, we use the ROUGE metric proposed by

(Lin and Hovy, 2003)

. Unlike BLEU which includes various n-gram matches, there are several versions of ROUGE for different match lengths: ROUGE-1, ROUGE-2 and ROUGE-L. Experiments are performed on a commodity 64-bit Dell Precision T7910 workstation with one 3.0 GHz 16-core CPU, RAM and one Titan X GPU.

4.3 Models

We do not limit our method on specific neural network, it can be extended to any sequence-to-sequence model. In this paper, we evaluate our method on two types of baselines.

RNN We denote RNN as the basic sequence-to-sequence model, with a bi-LSTM encoder and a bi-LSTM decoder. It is a widely used framework.

RNN-context RNN-context is a sequence-to-sequence framework with the attention mechanism.

4.4 Results and Discussions

We compare our approach with baselines, including RNN and RNN-context. The main results are shown in Table 1. It can be seen that our approach achieves ROUGE improvement over both baselines. In particular, SWD outperforms RNN-context by almost 2% ROUGE-1 points.

Finally we give an example summary as shown in Figure 2. This example is illustrated in Section 1, aimed to show the negative influence of unimportant words on extracting key information on RNN-context. RNN-context chooses some unimportant words as summary, like “some fundings from the central government” (shown in blue). In contrast, the outputs of our method contain some key words (shown in pink), like “Fan Gang”, “ the rate of China economy growth slows down”. This example shows the effectiveness of our model on handling noise documents which are full of a number of irrelevant words.

5 Conclusion

In this paper, we propose a novel method by learning sentence weight distribution to improve the performance of abstractive summarization. The target is to make models focus on important sentences and ignore irrelevant sentences. The results on a large-scale Chinese social media dataset show that our approach outperforms competitive baselines. We also give the example which shows that the summary produced by our method is more relevant to the gold summary. Besides, our method can be extended to any sequence-to-sequence model. Word based seq2seq systems are potentially helpful to this task, cause the words can incorporate more meaningful information. In the future, we will try several word segmentation methods (Sun et al., 2009, 2012; Xu and Sun, 2016; Xu et al., 2017) to improve the system.


  • Aliguliyev (2009) Ramiz M Aliguliyev. 2009. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications 36(4):7764–7772.
  • Bing et al. (2015) Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, and Rebecca Passonneau. 2015. Abstractive multi-document summarization via phrase selection and merging. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 1587–1597.
  • Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 484–494.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1724–1734.
  • Dzmitry Bahdanau and Bengio (2015) Kyunghyun Cho Dzmitry Bahdanau and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • Erkan and Radev (2004) Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization.

    Journal of Artificial Intelligence Research

  • Ferreira et al. (2013) Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George DC Cavalcanti, Rinaldo Lima, Steven J Simske, and Luciano Favaro. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert systems with applications 40(14):5755–5764.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Hu et al. (2015) Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. LCSTS: A large scale chinese short text summarization dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. pages 1967–1972.
  • Knight and Marcu (2002) Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139(1):91–107.
  • Lin and Hovy (2003) Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, May 27 - June 1, 2003.
  • Ma and Sun (2017) Shuming Ma and Xu Sun. 2017. A semantic relevance based neural network for text summarization and text simplification. CoRR abs/1710.02318.
  • Ma et al. (2017) Shuming Ma, Xu Sun, Jingjing Xu, Houfeng Wang, Wenjie Li, and Qi Su. 2017. Improving semantic relevance for sequence-to-sequence learning of chinese social media text summarization. In ACL’17.
  • Moawad and Aref (2012) Ibrahim F Moawad and Mostafa Aref. 2012. Semantic graph reduction approach for abstractive text summarization. In Computer Engineering & Systems (ICCES), 2012 Seventh International Conference on. IEEE, pages 132–138.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016. pages 280–290.
  • Radev et al. (2004) Dragomir R Radev, Timothy Allison, Sasha Blair-Goldensohn, John Blitzer, Arda Celebi, Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, et al. 2004. Mead-a platform for multidocument multilingual text summarization. In LREC.
  • Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015.

    A neural attention model for abstractive sentence summarization.

    In EMNLP. The Association for Computational Linguistics, pages 379–389.
  • Sun et al. (2014) Xu Sun, Wenjie Li, Houfeng Wang, and Qin Lu. 2014. Feature-frequency-adaptive on-line training for fast and accurate natural language processing. Computational Linguistics 40(3):563–586.
  • Sun et al. (2012) Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In ACL’12. pages 253–262.
  • Sun et al. (2017) Xu Sun, Bingzhen Wei, Xuancheng Ren, and Shuming Ma. 2017. Label embedding network: Learning label representation for soft training of deep networks. CoRR abs/1710.10393.
  • Sun et al. (2009) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2009. A discriminative latent variable chinese segmenter with hybrid word/character information. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA. pages 56–64.
  • Sun et al. (2013) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2013. Probabilistic chinese word segmentation with non-local information and stochastic training. Inf. Process. Manage. 49(3):626–636.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
  • Woodsend and Lapata (2010) Kristian Woodsend and Mirella Lapata. 2010. Automatic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 565–574.
  • Xu et al. (2017) Jingjing Xu, Shuming Ma, Yi Zhang, Bingzhen Wei, Xiaoyan Cai, and Xu Sun. 2017. Transfer learning for low-resource chinese word segmentation with a novel neural network. In The Conference on Natural Language Processing and Chinese Computing.
  • Xu and Sun (2016) Jingjing Xu and Xu Sun. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In ACL’16. pages 567–572.