Regularizing Output Distribution of Abstractive Chinese Social Media Text Summarization for Improved Semantic Consistency

05/10/2018 ∙ by Bingzhen Wei, et al. ∙ Peking University 0

Abstractive text summarization is a highly difficult problem, and the sequence-to-sequence model has shown success in improving the performance on the task. However, the generated summaries are often inconsistent with the source content in semantics. In such cases, when generating summaries, the model selects semantically unrelated words with respect to the source content as the most probable output. The problem can be attributed to heuristically constructed training data, where summaries can be unrelated to the source content, thus containing semantically unrelated words and spurious word correspondence. In this paper, we propose a regularization approach for the sequence-to-sequence model and make use of what the model has learned to regularize the learning objective to alleviate the effect of the problem. In addition, we propose a practical human evaluation method to address the problem that the existing automatic evaluation method does not evaluate the semantic consistency with the source content properly. Experimental results demonstrate the effectiveness of the proposed approach, which outperforms almost all the existing models. Especially, the proposed approach improves the semantic consistency by 4% in terms of human evaluation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Abstractive test summarization is an important text generation task. With the applying of the sequence-to-sequence model and the publication of large-scale datasets, the quality of the automatic generated summarization has been greatly improved (McAuley2013; lcsts; abs; ras; ibmsummarization; distraction; copynet; See2017; DRGD). However, the semantic consistency of the automatically generated summaries is still far from satisfactory.

The commonly-used large-scale datasets for deep learning models are constructed based on naturally-annotated data with heuristic rules 

(lcsts; ras; ibmsummarization). The summaries are not written for the source content specifically. It suggests that the provided summary may not be semantically consistent with the source content. For example, the dataset for Chinese social media text summarization, namely LCSTS, contains more than 20% text-summary pairs that are not related, according to the statistics of the manually checked data (lcsts).

Source content: 最终,在港交所拒绝阿里巴巴集团同股不同权的“合伙人制度”股权架构后,阿里巴巴集团被迫与它的前伙伴挥手告别,转身一头投入美国证券交易委员会(SEC)的怀抱。(下图为阿里巴巴帝国图)
In the end, after the Hong Kong Stock Exchange rejected the “partnership” equity structure of the Alibaba Group’s different shareholding rights, the Alibaba Group was forced to say goodbye to its former partners and turned to invest in the arms of the Securities and Exchange Commission (SEC). (The picture below shows the Alibaba Empire)

Table 1 shows an example of semantic inconsistency. Typically, the reference summary contains extra information that cannot be understood from the source content. It is hard to conclude the summary even for a human. Due to the inconsistency, the system cannot extract enough information in the source text, and it would be hard for the model to learn to generate the summary accordingly. The model has to encode spurious correspondence of the summary and the source content by memorization. However, this kind of correspondence is superficial and is not actually needed for generating reasonable summaries. Moreover, the information is harmful to generating semantically consistent summaries, because unrelated information is modeled. For example, the word “利益” (benefits) in the summary is not related to the source content. Thus, it has to be remembered by the model, together with the source content. However, this correspondence is spurious, because the word “利益” is not related to any word in the source content. In the following, we refer to this problem as Spurious Correspondence caused by the semantically inconsistent data.

In this work, we aim to alleviate the impact of the semantic inconsistency of the current dataset. Based on the sequence-to-sequence model, we propose a regularization method to heuristically show down the learning of the spurious correspondence, so that the unrelated information in the dataset is less represented by the model. We incorporate a new soft training target to achieve this goal. For each output time in training, in addition to the gold reference word, the current output also targets at a softened output word distribution that regularizes the current output word distribution. In this way, a more robust correspondence of the source content and the output words can be learned, and potentially, the output summary will be more semantically consistent.

To obtain the softened output word distribution, we propose two methods based on the sequence-to-sequence model.

  • The first one uses the output layer of the decoder to generate the distribution but with a higher temperature when using softmax normalization. It keeps the relative order of the possible output words but guides the model to keep a smaller discriminative margin. For spurious correspondence, across different examples, the output distribution is more likely to be different, so no effective discriminative margin will be established. For true correspondence, across different examples, the output distribution is more likely to be the same, so a margin can be gradually established.

  • The second one introduces an additional output layer to generate the distribution. Analogous to multi-task learning, the additional output layer provides an alternative view of the data, so that it can regularize the output distribution more effectively. Because the additional output layer differs from the original one in that the less stable information, i.e., the spurious correspondence learned by the model itself, is represented differently. Besides, the relative order can also be regularized in this method.

More detailed explanation is introduced in Section 2.

Another problem for abstractive text summarization is that the system summary cannot be easily evaluated automatically. ROUGE (rouge) is widely used for summarization evaluation. However, as ROUGE is designed for extractive text summarization, it cannot deal with summary paraphrasing in abstractive text summarization. Besides, as ROUGE is based on the reference, it requires high-quality reference summary for a reasonable evaluation, which is also lacking in the existing dataset for Chinese social media text summarization. We argue that for proper evaluation of text generation task, human evaluation cannot be avoided. We propose a simple and practical human evaluation for evaluating text summarization, where the summary is evaluated against the source content instead of the reference. It handles both of the problems of paraphrasing and lack of high-quality reference.

The contributions of this work are summarized as follows:

  • We propose an approach to regularize the output word distribution, so that the semantic inconsistency, e.g. words not related to the source content, exhibited in the training data is underrepresented in the model. We add a cross-entropy based regularization term to the overall loss. We also propose two methods to obtain the soft target distribution for regularization. The results demonstrate the effectiveness of the proposed approach, which outperforms almost all the existing systems. In particular, the semantic consistency is improved by 4% in terms of human evaluation. We also conduct analysis to examine the effect of the proposed method on the output summaries and the output label distributions, showing that the improved consistency results from the regularized output distribution.

  • We propose a simple human evaluation method to assess the semantic consistency of the generated summary with the source content. Such kind of evaluation is absent in the existing work of text summarization. In the proposed human evaluation, the summary is evaluated against the source content other than the reference summary, so that it can better measure the consistency of the generated summary and the source content when high-quality reference is not available.

2. Proposed Method

Base on the fact that the spurious correspondence is not stable and its realization in the model is prone to change, we propose to alleviate the issue heuristically by regularization. We use the cross-entropy with an annealed output distribution as the regularization term in the loss so that the little fluctuation in the distribution will be depressed and more robust and stable correspondence will be learned. By correspondence, we mean the relation between (a) the current output, and (b) the source content and the partially generated output. Furthermore, we propose to use an additional output layer to generate the annealed output distribution. Due to the same fact, the two output layers will differ more in the words that superficially co-occur, so that the output distribution can be better regularized.

Figure 1. Illustration of the proposed methods. Left: Self-Train. Right: Dual-Train.

2.1. Regularizing the Neural Network with Annealed Distribution

Typically, in the training of the sequence-to-sequence model, only the one-hot hard target is used in the cross-entropy based loss function. For an example in the training set, the loss of an output vector is

111For the convenience of description, we omit the related trainable parameters of the model.

(1)

where is the output vector, is the one-hot hard target vector, and is the number of labels. However, as is the one-hot vector, all the elements are zero except the one representing the correct label. Hence, the loss becomes

(2)

where

is the index of the correct label. The loss is then summed over the output sentences and across the minibatch and used as the source error signal in the backpropagation.

The hard target could cause several problems in the training. Soft training methods try to use a soft target distribution to provide a generalized error signal to the training. For the summarization task, a straight-forward way would be to use the current output vector as the soft target, which contains the knowledge learned by the current model, i.e., the correspondence of the source content and the current output word:

(3)

Then, the two losses are combined as the new loss function:

(4)

where is the index of the true label and is the strength of the soft training loss. We refer to this approach as Self-Train (The left part of Figure 1).

The output of the model can be seen as a refined supervisory signal for the learning of the model. The added loss promotes the learning of more stable correspondence. The output not only learns from the one-hot distribution but also the distribution generated by the model itself. However, during the training, the output of the neural network can become too close to the one-hot distribution. To solve this, we make the soft target the soften output distribution. We apply the softmax with temperature , which is computed by222It uses simplified notation, where and should be the unnormalized output, i.e., the output before the softmax operation.

(5)

This transformation keeps the relative order of the labels, and a higher temperature will make the output distributed more evenly.

The key motivation is that if the model is still not confident how to generate the current output word under the supervision of the reference summary, it means the correspondence can be spurious and the reference output is unlikely to be concluded from the source content. It makes no sense to force the model to learn such correspondence. The regularization follows that motivation, and in such case, the error signal will be less significant compared to the one-hot target. In the case where the model is extremely confident how to generate the current output, the annealed distribution will resemble the one-hot target. Thus, the regularization is not effective. In all, we make use of the model itself to identify the spurious correspondence and then regularize the output distribution accordingly.

2.2. Dual Output Layers

However, the aforementioned method tries to regularize the output word distribution based on what it has already learned. The relative order of the output words is kept. The self-dependency may not be desirable for regularization. It may be better if more correspondence that is spurious can be identified.

In this paper, we further propose to obtain the soft target from a different view of the model, so that different knowledge of the dataset can be used to mitigate the overfitting problem. An additional output layer is introduced to generate the soft target. The two output layers share the same hidden representation but have independent parameters. They could learn different knowledge of the data. We refer to this approach as

Dual-Train. For clarity, the original output layer is denoted by and the new output layer . Their outputs are denoted by and , respectively.

The output layer acts as the original output layer. We apply soft training using the output from to this output layer to increase its ability of generalization. Suppose the correct label is . The target of the output includes both the one-hot distribution and the distribution generated from :

(6)

The new output layer is trained normally using the originally hard target. This output layer is not used in the prediction, and its only purpose is to generate the soft target to facilitate the soft training of . Suppose the correct label is . The target of the output includes only the one-hot distribution:

(7)

Because of the random initialization of the parameters in the output layers, and could learn different things. The diversified knowledge is helpful when dealing with the spurious correspondence in the data. It can also be seen as an online kind of ensemble methods. Several different instances of the same model are softly aggregated into one to make classification. The right part of Figure 1 shows the architecture of the proposed Dual-Train method.

3. Experiments

We evaluate the proposed approach on the Chinese social media text summarization task, based on the sequence-to-sequence model. We also analyze the output text and the output label distribution of the models, showing the power of the proposed approach. Finally, we show the cases where the correspondences learned by the proposed approach are still problematic, which can be explained based on the approach we adopt.

3.1. Dataset

Large-Scale Chinese Short Text Summarization Dataset (LCSTS) is constructed by (lcsts). The dataset consists of more than 2.4 million text-summary pairs in total, constructed from a famous Chinese social media microblogging service Weibo333http://weibo.com. The whole dataset is split into three parts, with 2,400,591 pairs in PART I for training, 10,666 pairs in PART II for validation, and 1,106 pairs in PART III for testing. The authors of the dataset have manually annotated the relevance scores, ranging from 1 to 5, of the text-summary pairs in PART II and PART III. They suggested that only pairs with scores no less than three should be used for evaluation, which leaves 8,685 pairs in PART II, and 725 pairs in PART III. From the statistics of the PART II and PART III, we can see that more than 20% of the pairs are dropped to maintain semantic quality. It indicates that the training set, which has not been manually annotated and checked, contains a huge quantity of unrelated text-summary pairs.

3.2. Experimental Settings

We use the sequence-to-sequence model (seq2seq) with attention (attention; stanfordattention; mapattention; supervisedattention) as the Baseline. Both the encoder and decoder are based on the single layer LSTM (Hochreiter1997). The word embedding size is 400, and the hidden state size of the LSTM unit is 500. We conduct experiments on the word level. To convert the character sequences into word sequences, we use Jieba444https://pypi.python.org/pypi/jieba/ to segment the words, the same with the existing work (lcsts; copynet).

Self-Train and Dual-Train are implemented based on the baseline model, with two more hyper-parameters, the temperature and the soft training strength . We use a very simple setting for all tasks, and set ,

. We pre-train the model without applying the soft training objective for 5 epochs out of total 10 epochs. We use the Adam optimizer

(Kingma2014) for all the tasks, using the default settings with , , and . In testing, we use beam search to generate the summaries, and the beam size is set to 5. We report the test results at the epoch that achieves the best score on the development set.

3.3. Evaluation Protocol

For text summarization, a common automatic evaluation method is ROUGE (rouge). The generated summary is evaluated against the reference summary, based on unigram recall (ROUGE-1), bigram recall (ROUGE-2), and recall of longest common subsequence (ROUGE-L). To facilitate comparison with the existing systems, we adopt ROUGE as the automatic evaluation method. The ROUGE is calculated on the character level, following the previous work (lcsts).

However, for abstractive text summarization, the ROUGE is sub-optimal, and cannot assess the semantic consistency between the summary and the source content, especially when there is only one reference for a piece of text. The reason is that the same content may be expressed in different ways with different focuses. Simple word match cannot recognize the paraphrasing. It is the case for all of the existing large-scale datasets. Besides, as aforementioned, ROUGE is calculated on the character level in Chinese text summarization, making the metrics favor the models on the character level in practice. In Chinese, a word is the smallest semantic element that can be uttered in isolation, not a character. In the extreme case, the generated text could be completely intelligible, but the characters could still match. In theory, calculating ROUGE metrics on the word level could alleviate the problem. However, word segmentation is also a non-trivial task for Chinese. There are many kinds of segmentation rules, which will produce different ROUGE scores. We argue that it is not acceptable to introduce additional systematic bias in automatic evaluations, and automatic evaluation for semantically related tasks can only serve as a reference.

To avoid the deficiencies, we propose a simple human evaluation method to assess the semantic consistency. Each summary candidate is evaluated against the text rather than the reference. If the candidate is irrelevant or incorrect to the text, or the candidate is not understandable, the candidate is labeled bad. Otherwise, the candidate is labeled good. Then, we can get an accuracy of the good summaries. The proposed evaluation is very simple and straight-forward. It focuses on the relevance between the summary and the text. The semantic consistency should be the major consideration when putting the text summarization methods into practice, but the current automatic methods cannot judge properly. For detailed guidelines in human evaluation, please refer to Appendix LABEL:sec:he-std.

In the human evaluation, the text-summary pairs are dispatched to two human annotators who are native speakers of Chinese. As in our setting the summary is evaluated against the reference, the number of the pairs needs to be manually evaluated is four times the number of the pairs in the test set, because we need to compare four systems in total. To decrease the workload and get a hint about the annotation quality at the same time, we adopt the following procedure. We first randomly select 100 pairs in the validation set for the two human annotators to evaluate. Each pair is annotated twice, and the inter-annotator agreement is checked. We find that under the protocol, the inter-annotator agreement is quite high. In the evaluation of the test set, a pair is only annotated once to accelerate evaluation. To further maintain consistency, summaries of the same source content will not be distributed to different annotators.

3.4. Experimental Results

Methods # Good # Total Accuracy
Reference 673 725 92.8%
Baseline 360 49.6%
Self-Train 316 43.6%
Dual-Train 389 53.6%
Table 2. Results of the human evaluation, showing how many summaries are semantically consistent with their source content. The generated summary is evaluated directly against the source content.

First, we show the results for human evaluation, which focuses on the semantic consistency of the summary with its source content. We evaluate the systems implemented by us as well as the reference. We cannot conduct human evaluations for the existing systems from other work, because the output summaries needed are not available for us. Besides, the baseline system we implemented is very competitive in terms of ROUGE and achieves better performance than almost all the existing systems. The results are listed in Table 2. It is surprising to see that the accuracy of the reference summaries does not reach 100%. It means that the test set still contains text-summary pairs of poor quality even after removing the pairs with relevance scores lower than 3 as suggested by the authors of the dataset. As we can see, Dual-Train improves the accuracy by 4%. Due to the rigorous definition of being good, the results mean that 4% more of the summaries are semantically consistent with their source content. However, Self-Train has a performance drop compared to the baseline. After investigating its generated summaries, we find that the major reason is that the generated summaries are not grammatically complete and often stop too early, although the generated part is indeed more related to the source content. Because the definition of being good, the improved relevance does not make up the loss on intelligibility.

Methods ROUGE-1 ROUGE-2 ROUGE-L
RNN-context (lcsts) 29.9 17.4 27.2
SRB (MaEA2017) 33.3 20.0 30.1
CopyNet (copynet) 35.0 22.3 32.0
RNN-distract (distraction) 35.2 22.6 32.5
DRGD (DRGD) 37.0 24.1 34.2
Baseline (Ours) 35.3 23.4 33.0
Self-Train (Ours) 35.3 23.3 32.6
Dual-Train (Ours) 36.2 24.3 33.8
Table 3. Comparisons with the existing models in terms of ROUGE metrics.

Then, we compare the automatic evaluation results in Table 3. As we can see, only applying soft training without adaptation (Self-Train) hurts the performance. With the additional output layer (Dual-Train), the performance can be greatly improved over the baseline. Moreover, with the proposed method the simple baseline model is second to the best compared with the state-of-the-art models and even surpasses in ROUGE-2. It is promising that applying the proposed method to the state-of-the-art model could also improve its performance.

The automatic evaluation is done on the original test set to facilitate comparison with existing work. However, a more reasonable setting would be to exclude the 52 test instances that are found bad in the human evaluation, because the quality of the automatic evaluation depends on the reference summary. As the existing methods do not provide their test output, it is a non-trivial task to reproduce all their results of the same reported performance. Nonetheless, it does not change the fact that ROUGE cannot handle the issues in abstractive text summarization properly.

3.5. Experimental Analysis

To examine the effect of the proposed method and reveal how the proposed method improves the consistency, we compare the output of the baseline with Dual-Train, based on both the output text and the output label distribution. We also conduct error analysis to discover room for improvements.

3.5.1. Analysis of the Output Text

Table 4. Examples of the summaries generated by the baseline and Dual-Train from the test set. As we can see, the summaries generated by the proposed are much better than the ones generated by the baseline, and even more informative and precise than the references.
Table 1. Example of semantic inconsistency in the LCSTS dataset. In this example, the reference summary cannot be concluded from the source content, because the semantics of the summary is not contained in the source text. In short, the semantics of “benefits” cannot be concluded from the source content.

2. Proposed Method

Base on the fact that the spurious correspondence is not stable and its realization in the model is prone to change, we propose to alleviate the issue heuristically by regularization. We use the cross-entropy with an annealed output distribution as the regularization term in the loss so that the little fluctuation in the distribution will be depressed and more robust and stable correspondence will be learned. By correspondence, we mean the relation between (a) the current output, and (b) the source content and the partially generated output. Furthermore, we propose to use an additional output layer to generate the annealed output distribution. Due to the same fact, the two output layers will differ more in the words that superficially co-occur, so that the output distribution can be better regularized.

Figure 1. Illustration of the proposed methods. Left: Self-Train. Right: Dual-Train.

2.1. Regularizing the Neural Network with Annealed Distribution

Typically, in the training of the sequence-to-sequence model, only the one-hot hard target is used in the cross-entropy based loss function. For an example in the training set, the loss of an output vector is

111For the convenience of description, we omit the related trainable parameters of the model.

(1)

where is the output vector, is the one-hot hard target vector, and is the number of labels. However, as is the one-hot vector, all the elements are zero except the one representing the correct label. Hence, the loss becomes

(2)

where

is the index of the correct label. The loss is then summed over the output sentences and across the minibatch and used as the source error signal in the backpropagation.

The hard target could cause several problems in the training. Soft training methods try to use a soft target distribution to provide a generalized error signal to the training. For the summarization task, a straight-forward way would be to use the current output vector as the soft target, which contains the knowledge learned by the current model, i.e., the correspondence of the source content and the current output word:

(3)

Then, the two losses are combined as the new loss function:

(4)

where is the index of the true label and is the strength of the soft training loss. We refer to this approach as Self-Train (The left part of Figure 1).

The output of the model can be seen as a refined supervisory signal for the learning of the model. The added loss promotes the learning of more stable correspondence. The output not only learns from the one-hot distribution but also the distribution generated by the model itself. However, during the training, the output of the neural network can become too close to the one-hot distribution. To solve this, we make the soft target the soften output distribution. We apply the softmax with temperature , which is computed by222It uses simplified notation, where and should be the unnormalized output, i.e., the output before the softmax operation.

(5)

This transformation keeps the relative order of the labels, and a higher temperature will make the output distributed more evenly.

The key motivation is that if the model is still not confident how to generate the current output word under the supervision of the reference summary, it means the correspondence can be spurious and the reference output is unlikely to be concluded from the source content. It makes no sense to force the model to learn such correspondence. The regularization follows that motivation, and in such case, the error signal will be less significant compared to the one-hot target. In the case where the model is extremely confident how to generate the current output, the annealed distribution will resemble the one-hot target. Thus, the regularization is not effective. In all, we make use of the model itself to identify the spurious correspondence and then regularize the output distribution accordingly.

2.2. Dual Output Layers

However, the aforementioned method tries to regularize the output word distribution based on what it has already learned. The relative order of the output words is kept. The self-dependency may not be desirable for regularization. It may be better if more correspondence that is spurious can be identified.

In this paper, we further propose to obtain the soft target from a different view of the model, so that different knowledge of the dataset can be used to mitigate the overfitting problem. An additional output layer is introduced to generate the soft target. The two output layers share the same hidden representation but have independent parameters. They could learn different knowledge of the data. We refer to this approach as

Dual-Train. For clarity, the original output layer is denoted by and the new output layer . Their outputs are denoted by and , respectively.

The output layer acts as the original output layer. We apply soft training using the output from to this output layer to increase its ability of generalization. Suppose the correct label is . The target of the output includes both the one-hot distribution and the distribution generated from :

(6)

The new output layer is trained normally using the originally hard target. This output layer is not used in the prediction, and its only purpose is to generate the soft target to facilitate the soft training of . Suppose the correct label is . The target of the output includes only the one-hot distribution:

(7)

Because of the random initialization of the parameters in the output layers, and could learn different things. The diversified knowledge is helpful when dealing with the spurious correspondence in the data. It can also be seen as an online kind of ensemble methods. Several different instances of the same model are softly aggregated into one to make classification. The right part of Figure 1 shows the architecture of the proposed Dual-Train method.

3. Experiments

We evaluate the proposed approach on the Chinese social media text summarization task, based on the sequence-to-sequence model. We also analyze the output text and the output label distribution of the models, showing the power of the proposed approach. Finally, we show the cases where the correspondences learned by the proposed approach are still problematic, which can be explained based on the approach we adopt.

3.1. Dataset

Large-Scale Chinese Short Text Summarization Dataset (LCSTS) is constructed by (lcsts). The dataset consists of more than 2.4 million text-summary pairs in total, constructed from a famous Chinese social media microblogging service Weibo333http://weibo.com. The whole dataset is split into three parts, with 2,400,591 pairs in PART I for training, 10,666 pairs in PART II for validation, and 1,106 pairs in PART III for testing. The authors of the dataset have manually annotated the relevance scores, ranging from 1 to 5, of the text-summary pairs in PART II and PART III. They suggested that only pairs with scores no less than three should be used for evaluation, which leaves 8,685 pairs in PART II, and 725 pairs in PART III. From the statistics of the PART II and PART III, we can see that more than 20% of the pairs are dropped to maintain semantic quality. It indicates that the training set, which has not been manually annotated and checked, contains a huge quantity of unrelated text-summary pairs.

3.2. Experimental Settings

We use the sequence-to-sequence model (seq2seq) with attention (attention; stanfordattention; mapattention; supervisedattention) as the Baseline. Both the encoder and decoder are based on the single layer LSTM (Hochreiter1997). The word embedding size is 400, and the hidden state size of the LSTM unit is 500. We conduct experiments on the word level. To convert the character sequences into word sequences, we use Jieba444https://pypi.python.org/pypi/jieba/ to segment the words, the same with the existing work (lcsts; copynet).

Self-Train and Dual-Train are implemented based on the baseline model, with two more hyper-parameters, the temperature and the soft training strength . We use a very simple setting for all tasks, and set ,

. We pre-train the model without applying the soft training objective for 5 epochs out of total 10 epochs. We use the Adam optimizer

(Kingma2014) for all the tasks, using the default settings with , , and . In testing, we use beam search to generate the summaries, and the beam size is set to 5. We report the test results at the epoch that achieves the best score on the development set.

3.3. Evaluation Protocol

For text summarization, a common automatic evaluation method is ROUGE (rouge). The generated summary is evaluated against the reference summary, based on unigram recall (ROUGE-1), bigram recall (ROUGE-2), and recall of longest common subsequence (ROUGE-L). To facilitate comparison with the existing systems, we adopt ROUGE as the automatic evaluation method. The ROUGE is calculated on the character level, following the previous work (lcsts).

However, for abstractive text summarization, the ROUGE is sub-optimal, and cannot assess the semantic consistency between the summary and the source content, especially when there is only one reference for a piece of text. The reason is that the same content may be expressed in different ways with different focuses. Simple word match cannot recognize the paraphrasing. It is the case for all of the existing large-scale datasets. Besides, as aforementioned, ROUGE is calculated on the character level in Chinese text summarization, making the metrics favor the models on the character level in practice. In Chinese, a word is the smallest semantic element that can be uttered in isolation, not a character. In the extreme case, the generated text could be completely intelligible, but the characters could still match. In theory, calculating ROUGE metrics on the word level could alleviate the problem. However, word segmentation is also a non-trivial task for Chinese. There are many kinds of segmentation rules, which will produce different ROUGE scores. We argue that it is not acceptable to introduce additional systematic bias in automatic evaluations, and automatic evaluation for semantically related tasks can only serve as a reference.

To avoid the deficiencies, we propose a simple human evaluation method to assess the semantic consistency. Each summary candidate is evaluated against the text rather than the reference. If the candidate is irrelevant or incorrect to the text, or the candidate is not understandable, the candidate is labeled bad. Otherwise, the candidate is labeled good. Then, we can get an accuracy of the good summaries. The proposed evaluation is very simple and straight-forward. It focuses on the relevance between the summary and the text. The semantic consistency should be the major consideration when putting the text summarization methods into practice, but the current automatic methods cannot judge properly. For detailed guidelines in human evaluation, please refer to Appendix LABEL:sec:he-std.

In the human evaluation, the text-summary pairs are dispatched to two human annotators who are native speakers of Chinese. As in our setting the summary is evaluated against the reference, the number of the pairs needs to be manually evaluated is four times the number of the pairs in the test set, because we need to compare four systems in total. To decrease the workload and get a hint about the annotation quality at the same time, we adopt the following procedure. We first randomly select 100 pairs in the validation set for the two human annotators to evaluate. Each pair is annotated twice, and the inter-annotator agreement is checked. We find that under the protocol, the inter-annotator agreement is quite high. In the evaluation of the test set, a pair is only annotated once to accelerate evaluation. To further maintain consistency, summaries of the same source content will not be distributed to different annotators.

3.4. Experimental Results

Methods # Good # Total Accuracy
Reference 673 725 92.8%
Baseline 360 49.6%
Self-Train 316 43.6%
Dual-Train 389 53.6%
Table 2. Results of the human evaluation, showing how many summaries are semantically consistent with their source content. The generated summary is evaluated directly against the source content.

First, we show the results for human evaluation, which focuses on the semantic consistency of the summary with its source content. We evaluate the systems implemented by us as well as the reference. We cannot conduct human evaluations for the existing systems from other work, because the output summaries needed are not available for us. Besides, the baseline system we implemented is very competitive in terms of ROUGE and achieves better performance than almost all the existing systems. The results are listed in Table 2. It is surprising to see that the accuracy of the reference summaries does not reach 100%. It means that the test set still contains text-summary pairs of poor quality even after removing the pairs with relevance scores lower than 3 as suggested by the authors of the dataset. As we can see, Dual-Train improves the accuracy by 4%. Due to the rigorous definition of being good, the results mean that 4% more of the summaries are semantically consistent with their source content. However, Self-Train has a performance drop compared to the baseline. After investigating its generated summaries, we find that the major reason is that the generated summaries are not grammatically complete and often stop too early, although the generated part is indeed more related to the source content. Because the definition of being good, the improved relevance does not make up the loss on intelligibility.

Methods ROUGE-1 ROUGE-2 ROUGE-L
RNN-context (lcsts) 29.9 17.4 27.2
SRB (MaEA2017) 33.3 20.0 30.1
CopyNet (copynet) 35.0 22.3 32.0
RNN-distract (distraction) 35.2 22.6 32.5
DRGD (DRGD) 37.0 24.1 34.2
Baseline (Ours) 35.3 23.4 33.0
Self-Train (Ours) 35.3 23.3 32.6
Dual-Train (Ours) 36.2 24.3 33.8
Table 3. Comparisons with the existing models in terms of ROUGE metrics.

Then, we compare the automatic evaluation results in Table 3. As we can see, only applying soft training without adaptation (Self-Train) hurts the performance. With the additional output layer (Dual-Train), the performance can be greatly improved over the baseline. Moreover, with the proposed method the simple baseline model is second to the best compared with the state-of-the-art models and even surpasses in ROUGE-2. It is promising that applying the proposed method to the state-of-the-art model could also improve its performance.

The automatic evaluation is done on the original test set to facilitate comparison with existing work. However, a more reasonable setting would be to exclude the 52 test instances that are found bad in the human evaluation, because the quality of the automatic evaluation depends on the reference summary. As the existing methods do not provide their test output, it is a non-trivial task to reproduce all their results of the same reported performance. Nonetheless, it does not change the fact that ROUGE cannot handle the issues in abstractive text summarization properly.

3.5. Experimental Analysis

To examine the effect of the proposed method and reveal how the proposed method improves the consistency, we compare the output of the baseline with Dual-Train, based on both the output text and the output label distribution. We also conduct error analysis to discover room for improvements.

3.5.1. Analysis of the Output Text

Table 4. Examples of the summaries generated by the baseline and Dual-Train from the test set. As we can see, the summaries generated by the proposed are much better than the ones generated by the baseline, and even more informative and precise than the references.

3. Experiments

We evaluate the proposed approach on the Chinese social media text summarization task, based on the sequence-to-sequence model. We also analyze the output text and the output label distribution of the models, showing the power of the proposed approach. Finally, we show the cases where the correspondences learned by the proposed approach are still problematic, which can be explained based on the approach we adopt.

3.1. Dataset

Large-Scale Chinese Short Text Summarization Dataset (LCSTS) is constructed by (lcsts). The dataset consists of more than 2.4 million text-summary pairs in total, constructed from a famous Chinese social media microblogging service Weibo333http://weibo.com. The whole dataset is split into three parts, with 2,400,591 pairs in PART I for training, 10,666 pairs in PART II for validation, and 1,106 pairs in PART III for testing. The authors of the dataset have manually annotated the relevance scores, ranging from 1 to 5, of the text-summary pairs in PART II and PART III. They suggested that only pairs with scores no less than three should be used for evaluation, which leaves 8,685 pairs in PART II, and 725 pairs in PART III. From the statistics of the PART II and PART III, we can see that more than 20% of the pairs are dropped to maintain semantic quality. It indicates that the training set, which has not been manually annotated and checked, contains a huge quantity of unrelated text-summary pairs.

3.2. Experimental Settings

We use the sequence-to-sequence model (seq2seq) with attention (attention; stanfordattention; mapattention; supervisedattention) as the Baseline. Both the encoder and decoder are based on the single layer LSTM (Hochreiter1997). The word embedding size is 400, and the hidden state size of the LSTM unit is 500. We conduct experiments on the word level. To convert the character sequences into word sequences, we use Jieba444https://pypi.python.org/pypi/jieba/ to segment the words, the same with the existing work (lcsts; copynet).

Self-Train and Dual-Train are implemented based on the baseline model, with two more hyper-parameters, the temperature and the soft training strength . We use a very simple setting for all tasks, and set ,

. We pre-train the model without applying the soft training objective for 5 epochs out of total 10 epochs. We use the Adam optimizer

(Kingma2014) for all the tasks, using the default settings with , , and . In testing, we use beam search to generate the summaries, and the beam size is set to 5. We report the test results at the epoch that achieves the best score on the development set.

3.3. Evaluation Protocol

For text summarization, a common automatic evaluation method is ROUGE (rouge). The generated summary is evaluated against the reference summary, based on unigram recall (ROUGE-1), bigram recall (ROUGE-2), and recall of longest common subsequence (ROUGE-L). To facilitate comparison with the existing systems, we adopt ROUGE as the automatic evaluation method. The ROUGE is calculated on the character level, following the previous work (lcsts).

However, for abstractive text summarization, the ROUGE is sub-optimal, and cannot assess the semantic consistency between the summary and the source content, especially when there is only one reference for a piece of text. The reason is that the same content may be expressed in different ways with different focuses. Simple word match cannot recognize the paraphrasing. It is the case for all of the existing large-scale datasets. Besides, as aforementioned, ROUGE is calculated on the character level in Chinese text summarization, making the metrics favor the models on the character level in practice. In Chinese, a word is the smallest semantic element that can be uttered in isolation, not a character. In the extreme case, the generated text could be completely intelligible, but the characters could still match. In theory, calculating ROUGE metrics on the word level could alleviate the problem. However, word segmentation is also a non-trivial task for Chinese. There are many kinds of segmentation rules, which will produce different ROUGE scores. We argue that it is not acceptable to introduce additional systematic bias in automatic evaluations, and automatic evaluation for semantically related tasks can only serve as a reference.

To avoid the deficiencies, we propose a simple human evaluation method to assess the semantic consistency. Each summary candidate is evaluated against the text rather than the reference. If the candidate is irrelevant or incorrect to the text, or the candidate is not understandable, the candidate is labeled bad. Otherwise, the candidate is labeled good. Then, we can get an accuracy of the good summaries. The proposed evaluation is very simple and straight-forward. It focuses on the relevance between the summary and the text. The semantic consistency should be the major consideration when putting the text summarization methods into practice, but the current automatic methods cannot judge properly. For detailed guidelines in human evaluation, please refer to Appendix LABEL:sec:he-std.

In the human evaluation, the text-summary pairs are dispatched to two human annotators who are native speakers of Chinese. As in our setting the summary is evaluated against the reference, the number of the pairs needs to be manually evaluated is four times the number of the pairs in the test set, because we need to compare four systems in total. To decrease the workload and get a hint about the annotation quality at the same time, we adopt the following procedure. We first randomly select 100 pairs in the validation set for the two human annotators to evaluate. Each pair is annotated twice, and the inter-annotator agreement is checked. We find that under the protocol, the inter-annotator agreement is quite high. In the evaluation of the test set, a pair is only annotated once to accelerate evaluation. To further maintain consistency, summaries of the same source content will not be distributed to different annotators.

3.4. Experimental Results

Methods # Good # Total Accuracy
Reference 673 725 92.8%
Baseline 360 49.6%
Self-Train 316 43.6%
Dual-Train 389 53.6%
Table 2. Results of the human evaluation, showing how many summaries are semantically consistent with their source content. The generated summary is evaluated directly against the source content.

First, we show the results for human evaluation, which focuses on the semantic consistency of the summary with its source content. We evaluate the systems implemented by us as well as the reference. We cannot conduct human evaluations for the existing systems from other work, because the output summaries needed are not available for us. Besides, the baseline system we implemented is very competitive in terms of ROUGE and achieves better performance than almost all the existing systems. The results are listed in Table 2. It is surprising to see that the accuracy of the reference summaries does not reach 100%. It means that the test set still contains text-summary pairs of poor quality even after removing the pairs with relevance scores lower than 3 as suggested by the authors of the dataset. As we can see, Dual-Train improves the accuracy by 4%. Due to the rigorous definition of being good, the results mean that 4% more of the summaries are semantically consistent with their source content. However, Self-Train has a performance drop compared to the baseline. After investigating its generated summaries, we find that the major reason is that the generated summaries are not grammatically complete and often stop too early, although the generated part is indeed more related to the source content. Because the definition of being good, the improved relevance does not make up the loss on intelligibility.

Methods ROUGE-1 ROUGE-2 ROUGE-L
RNN-context (lcsts) 29.9 17.4 27.2
SRB (MaEA2017) 33.3 20.0 30.1
CopyNet (copynet) 35.0 22.3 32.0
RNN-distract (distraction) 35.2 22.6 32.5
DRGD (DRGD) 37.0 24.1 34.2
Baseline (Ours) 35.3 23.4 33.0
Self-Train (Ours) 35.3 23.3 32.6
Dual-Train (Ours) 36.2 24.3 33.8
Table 3. Comparisons with the existing models in terms of ROUGE metrics.

Then, we compare the automatic evaluation results in Table 3. As we can see, only applying soft training without adaptation (Self-Train) hurts the performance. With the additional output layer (Dual-Train), the performance can be greatly improved over the baseline. Moreover, with the proposed method the simple baseline model is second to the best compared with the state-of-the-art models and even surpasses in ROUGE-2. It is promising that applying the proposed method to the state-of-the-art model could also improve its performance.

The automatic evaluation is done on the original test set to facilitate comparison with existing work. However, a more reasonable setting would be to exclude the 52 test instances that are found bad in the human evaluation, because the quality of the automatic evaluation depends on the reference summary. As the existing methods do not provide their test output, it is a non-trivial task to reproduce all their results of the same reported performance. Nonetheless, it does not change the fact that ROUGE cannot handle the issues in abstractive text summarization properly.

3.5. Experimental Analysis

To examine the effect of the proposed method and reveal how the proposed method improves the consistency, we compare the output of the baseline with Dual-Train, based on both the output text and the output label distribution. We also conduct error analysis to discover room for improvements.

3.5.1. Analysis of the Output Text

Table 4. Examples of the summaries generated by the baseline and Dual-Train from the test set. As we can see, the summaries generated by the proposed are much better than the ones generated by the baseline, and even more informative and precise than the references.