ROUGE is the de facto criterion for summarization research [3, 1, 7, 2]. Despite its wide use, previous work [14, 11, 10, 16] has agreed on its two major drawbacks: 1) it favors lexical similarity, not semantic similarity, and 2) it requires a reference summary.
The first drawback makes ROUGE unfit when heavy rewrite happens, which is not so rare, especially for abstractive/generative summarization [16, 3, 1]. Some attempts introduce word semantics into ROUGE, including replacing exact-word-matching with the dot product of the word embeddings . But bounded by the grams-based framework of ROUGE, such word-level fixes cannot be effective because the meaning of a sentence is not only defined by words comprising it.
The second drawback significantly limits summarization research because reference summaries are expensive to obtain 
. Instead of simply labeling the data with numbers or categorical tags in many other supervised learning problems, a human annotator for summarization tasks needs to generate a substantial amount of data. Current summarization research is limited to a handful of datasets, which are dominantly in the news domain. With ROUGE, it is hard to expand the study to new domains, or even new datasets, that do not come with reference summaries.
To tackle the two drawbacks rooted in the design of ROUGE, we hypothesize that it is possible to assess the quality of summary through a semantic document-summary comparison. As the first step toward this goal, this paper studies the feasibility to predict how proper a given summary matches a given document. In order to instantiate, we train end-to-end models in supervised machine learning fashion by leveraging recent advances in sentence embedding, which demonstrate that semantically similar sentences are close in the embedding space. This could help the trained model to consider more on high-level semantic similarity instead of on lexical similarity.
Such a supervised learning task is not trivial because no datasets can be directly used here: existing datasets do not contain mis-matching summaries but only human-composed reference summaries that well match documents. We hence develop two negative sample generation approaches to prepare two datasets by swapping and mutating summaries, respectively.
Experimental results show that our methods can accurately tell whether a summary matches a given document with a 96.2% accuracy, or tell how much a summary is mutated with irrelevant words with a correlation coefficient over 0.95. Additional cross-domain analyses show that models trained in our approach can capture the meaning.
In summary, our contributions are as follows:
a feasibility study on semantics-based summary quality assessment without using reference summaries,
two negative sample generation methods from existing datasets, and
extensive and promising empirical evidences, especially the correlation between the output of our model and human evaluation scores for machine-generates summaries in TAC2010.
2.1 Model Architecture
We formulate the problem as a supervised learning problem. Formally, given an input document, and a candidate summary, the goal is to compute a score about the quality of the summary w.r.t. the input document.
As depicted in Figure 1
, our model has two stages. First both the document and the summary are transformed into a vector representation. Then a neural network is trained to estimate a summary quality score from the vector representation. Our study covers two approaches to convert text into its semantic vector representation, detailed as follows.
The first approach, based on sentence embedding, views the document and the summary as two sequences of and sentences, respectively. A vector representing both of them is then the concatenation of sentence embeddings of all sentences in them
and Emb is a sentence encoder. We employed two sentence encoders: Google’s Universal Sentence Encoder (USE)  and Facebook’s InferSent Sentence Encoder , to study the impact of sentence embedding on the task.
The second approach, based on BERT , views the document and the summary as two sequences of tokens, i.e., , . The tokens, plus two control tokens, [CLS] and [SEP], are then concatenated:
before being fed into the BERT network. The output of BERT corresponding to the special token CLS can be regarded the represenation of both the document and the summary:
Then, the vector is fed into a neural network that predicts the score. For the
constructed using sentence encoders, three standard neural networks, namely a fully-connected (FC) network, a convolutional neural network (CNN), and a long short-term memory (LSTM) network, are chosen as examples to prove the concept of our framework. In the last two networks, in line with common practices, there is an FC network in the last stage. For the BERT-based document-summary representation, only the FC network is used. Details of the networks are provided in Section 3.
2.2 Negative Sample Generation
To train a said supervised model, besides summaries that match documents well, we also need summaries that match documents poorly, including negatively. However, existing summarization datasets contain only the former ones, the human-composed reference summaries. So we introduce two approaches, random mutation and cross pairing, to create negative samples.
 uses pairwise preferences to obtain ground truth by asking human annotators to select a preferred summary out of two provided. We go one step further by using an unsupervised method to automatically obtain negative samples.
Random mutation, illustrated in Figure 2, alters the content of a reference summary by: 1) adding random tokens drawn from the vocabulary to random locations, or 2) deleting random tokens; or 3) replacing random tokens with random words. The complement of the percentage of mutation is the score to be predicted by the model. For example, if 30% of the tokens in a reference summary are deleted, then the model is expected to predict 0.7 when the corresponding document and the mutated summary are fed into the model. In particular, when no mutation, i.e., the original reference summary, the label is 1. Therefore, the model to be trained is a regression model, and Mean Square Error (MSE) is chosen as the loss metric.
One caveat of such word-level mutation is that the structure of the text may be destroyed, producing ill-formed data. Hence, we introduce the second approach, cross-pairing.
Another concern is using the amount of mutation to approximate the quality of a summary, e.g., replacing words by synonyms does not change the meaning much. However, we can rest assured for two reasons. First, if a semantically important word is altered, e.g., from “I am happy” to “I am not happy”, the sentence encoder can capture such changes. Second, the chances that a word mutated retains the meaning is very low due to the randomness.
generates fake context words. Given a document and its reference summary, we create negative data by pairing the document with reference summaries of other documents. We assign the label 0 to such document-summary pairs, and the label 1 to any original pair of document and (reference) summary. This renders the problem into a binary classification problem. We use binary cross-entropy as the loss function.
We evaluate our approach on three widely used summarization datasets:
The first two datasets belong to news article domain while the third are formed from patent documents and their abstracts.
Results on data prepared in different methods introduced in Section 2.2 will be reported separately. For each dataset, we randomly pick 30,000 samples, and report the result on each dataset individually. Later, we also study how well a model trained on one dataset can perform on the other.
For each data preparation method, we generate one fake sample per article, thus we have a total of 60,000 samples. Data is split into 80%/10%/10% as training/validation/testing set. The splitting procedure also ensures that no article in test set appears in training set.
As it is interesting to study the impact of different text sequence encoding schemes to our task, we evaluate the results on four text sequence encoder settings:
For the first 3 sentence encoders, we apply padding to both documents and summaries to unify the input dimensions. The dimensions are limited to the length of 80% of the data, or equivalently, 47 sentences for documents and 3 sentences for summaries. For BERT, we limit the total token number at 512. Should a pair of document and summary together exceed or fall short of 512 tokens, we pad or truncate the document and summary in parallel to meet the dimension requirement.
As a baseline, we also test a model (denoted as “GloVe” in Table 1) built on top of word embedding using GloVe . We use pretrained 100d GloVe matrix. The padding/truncation lengths are set to 1091 and 55 words, respectively for documents and summaries, according to the 80% rule above.
As mentioned in Section 2.1, embeddings are fed into 3 kinds of networks: a fully-connected network (FC-only), a CNN, and an LSTM. The FC-only network has a single hidden layer of 128 neurons fully connected to the flattened input embeddings and a single output node. The CNN is stacked from a 2D convlutional layer, a max-pooling layer, and lastly a fully-connected layer resulting in a single output neuron. We use 128 kernel filters of a dimensionwhere
is the dimension of word/sentence embeddings, so that convolutions do not cross word/sentence boundaries. The filter size of the max-pooling layer equals to the dimension of the output of the convolutional layer. The max-pooling layer output is fully connected to the single output neuron. In LSTM setting, we use one layer of 25 and 128 LSTM units for word and sentence models, respectively. These units are connected to a single output neuron in fully-connected manner. For all architectures, we use RMSProp optimizer with NA01 learning rate. Early-stopping is used to determine the desired number of epochs, and stop training if validation loss is not improving in three epochs.
As for the BERT model, we consider the pre-trained 12-layer transformer as our model. Each token of the input sequence including [CLS] is encoded into an embedding vector with a dimension of 768 by using BERT. The FC-only network for BERT-based model also has a single hidden layer of 128 neurons fully connected to the [CLS] embedding and a single output node.
3.3 Base Model Performance on Cross-pairing
The results of our base models on samples generated using cross-paring are given in Table 1. Each row corresponds to one network architecture, and each columns corresponds to one text-to-vector scheme.
|Cross pairing: Accuracy (%)|
First of all, the framework has phenomenal success on telling how well a summary matches a document. On predicting whether a document-summary pair is true using data prepared by cross pairing, the accuracy can be up to 96.2% when using InferSent as the encoder and FC-only as the network, and up to 98.5% using BERT as the encoder. Such encouraging results are obtained with very little hyperparameter tuning, showing a positive direction toward evaluating a candidate summary without the reference summary. We are confident that with more thorough hyperparameter optimization, the performances can be pushed to satisfaction very closely.
Second, sentence embedding models consistently outperform word embedding ones. In particular, InferSent and USE-Trans models achieve 96.2% and 93.5% accuracy, respectively, while GloVe-based models seem to struggle with an accuracy up to 72.2% only.
The fact that sentence embedding based models overwhelmingly outperform word embedding based ones suggests that earlier fixes by introducing word embeddings into ROUGE  is not sufficient and it will be better to develop a new metric other than ROUGE. As discussed earlier, the meaning of a sentence is not solely defined by the words comprising it but also the structure. This finding aligns well with other studies that primitively combining word embeddings cannot capture sentence similarities .
We further are interested in the differences of the two sentence encoder families, USE and InferSent, on the task. InferSent-based models lead USE-based models on the data prepared by cross pairing. But their positions are flipped on the data prepared by mutation. This suggests that InferSent might be sensitive to sentence structure that might be destroyed by mutating words.
Lastly, USE-Trans and BERT based models consistently outperform USE-DAN based ones, suggesting that Transformer architecture  might have positive impacts on distinguishing informative sentences from those intended for fluency or attracting readers. Specifically, the pre-trained transformer architecture such as BERT has shown the best performance among all the competing methods. This shows the BERT model may capture some fundamental linguistic semantics in pre-training.
3.4 Base Model Performance on Mutation
We proceed to evaluate our model performance on mutation in Tables 2, where the three variants of mutation are denoted as mutation-add, mutation-delete, and mutation-replace.
The results are similar to the cross-paired datasets. In general, the results are encouraging. The framework can help to tell how well a summary matches a document. For example on predicting how much the reference summary is mutated, the correlation coefficient can be up to 95.5/93.0/96.9 (for mutation-add, mutation-delete, and mutation-replace respectively) when using USE-Trans + LSTM, and up to 95.3/94.6/98.4 when using BERT. Second, sentence embedding models consistently outperform word embedding ones except on mutated-deletion data where all models perform comparably. Lastly, USE-Trans and BERT based models are generally better than USE-DAN and Glove, while the pre-trained transformer architecture such as BERT has shown the best performance. This shows the transformer architecture especially the pre-trained transformer such as BERT model may capture some fundamental linguistic semantics.
|Mutation: (PCC )|
3.5 Cross-domain Analysis
By cross-domain analysis, we mean that the training and test data are from two different domains (e.g., news articles vs. patents, or two different kinds of news articles). The domain on which the model is trained is called the source domain while the domain on which the model is tested is called the target domain. A good summary assessment model is expected to have a consistent performance across domains, even on text from a domain that differs from those used in training.
Because domain-transferrability is not a focus of this paper, we use samples generated in cross-pairing only in this part. The only model used here is BERT-based.
Cross-domain performance. By using the CNN/DailyMail dataset as the target domain, we test the tranferability of different source domains. Here we use the Big-patent  and Newsroom  datasets as the source domains.
As shown in Table 3, the performance on the CNN/DailyMail drops when training on other domains like Big-patent and Newsroom, showing that domain difference exists. But since the performance drop is very small, around and for Big-patent and Newsroom respectively, this means the proposed method has relatively good transferability across different domains.
Cross-domain performance w.r.t. domain size. We continue to examine the effect of target domain size on the cross-domain performance. We sample data from the CNN/DailyMail dataset with different sample sizes, 30k, 100k, and 300k, to test the transferabiltiy from another domain and itself.
As shown in Table 4, for training in-domain (CNN/DM as the source), the performance increases slightly with the data size. For transferring (Newsroom as the source), the performance is relatively stable, as all the results are around . A good thing is that the domain difference is not large as the transferability from Newsroom to CNN/DailyMail is generally good in all cases.
|Target domain size|
|Source: CNN/DM itself||98.5||98.6||99.2|
Hence we can conclude the a model built in our approach has consistent performance across domains, and across different domain sizes, and thus suitable to handle summaries from various sources.
3.6 Alignment with human evaluation
The last, but probably the most exciting, part of the evaluation is on how well the scores from our models correlate or align with human evaluation or judgment of summaries.
3.6.1 Data and setup
Because there is no released single-document summarization dataset that includes human evaluation on the summaries, we use the data from TAC2010 111https://tac.nist.gov/2010/Summarization/Guided-Summ.2010.guidelines.html guided summarization competition, a multi-document summarization task, to approximate. The human evaluation result from TAC2010 is distributed by NIST.
In TAC2010 guided summarization task, there are 43 machine summarizers and 4 human summarizers. Given a document set, consisting of 10 news articles about the same event, each summarizer generates a summary. Because there are 46 document sets and 47 summarizers, we have a total of document-summary pairs.
For each summary generated, a group of human evaluators score it from multiple aspects, resulting in 4 scores: the Pyramid score, the modified score, the linguistic quality, and the overall score. The meaning of the scores can be found from TAC2010 dataset. Because the Pyramid score is not available for human-composed summaries and it is also based on overlaps, this part of the study focuses on the last 3 scores.
Given a document , and a corresponding summary (machine-generated or human-composed), denote the score from our framework as . Because a document set (denoted as ) has 10 articles which all correspond to the same summary, we define the score for the summary as , and compute its correlation with 3 human scores. Our model is trained using samples generated from 30k CNN/DailyMail data only.
Because BERT-based model has the best performance in previous experiments, we use BERT-based model in this last part of the experiment. The results are given in Table 5 (in Pearson’s correlation coefficient) and Table 6 (in Spearman’s correlation coefficient).
|Sample generated by||Modified||Linguistic||Overall|
|mix (all 4 above)||0.4496||0.2420||0.4018|
|mix on machines’||0.5191||0.2357||0.4274|
|Sample generated by||Modified||Linguistic||Overall|
|mix (all 4 above)||0.5024||0.1511||0.3691|
|mix on machines’||0.5387||0.1269||0.3844|
Among the 4 methods to generate samples, crosspairing produces the samples that train a model that best align with human evaluation scores, reaching 0.3426 (Pearson’s) and 0.3993 (Spearman’s) with modified score. We think this is a promising result given the limited amount of data (30k news articles) used in training.
Models trained with 3 mutation-generated samples have very poor alignment with human evaluation scores, especially deletion based. Using samples generated from deletion, the Pearson’s or Spearman’s correlation coefficients between our model’s score and human evaluation scores are below 0.1. One possible reason is that the model trained in deletion-generated samples is influenced by the length of the summary heavily rather than the information or writing of it.
Among three human evaluation scores, our model has the lowest correlation with linguistic quality. This might indicate that current sentence or document embedding approaches focus on the semantics but not the writing styles.
If we mix the samples generated in all 4 methods together to train a new model, the model’s alignment with human evaluation can be boosted significantly. For example, the Pearson’s and Spearman’s correlation coefficients between the mixed model and modified score are as high as 0.4496 and 0.5024, respectively.
Lastly, we are particularly interested in testing our approach on machine-generated summaries in TAC2010 because an important use of our approach is to judge automated summarizers. On machine-generated summaries, our mixed model’s Pearson’s and Spearman’s correlation coefficients with modified score are 0.5191 and 0.5387, respectively, and those with overall score are 0.4274 and 0.3844, respectively.
So in conclusion, we think that our approach achieves very promising initial results given the small amount of data and training epoch.
3.6.3 Comparison with ROGUE
We then compare our best model with the ROUGE metrics, the de facto
standard in summmarization study, to see whether our model or ROUGE aligns with human evaluation better. Because ROUGE is a set of metrics, we pick 4 in this part of the study: ROUGE-1, ROUGE-2, ROUGE-4, and ROUGE-W-1.2. They measure the overlap between the summary and the document in terms of n-grams and skip-gram. The ROUGE scores are computed by NIST and distributed in the TAC2010 dataset. They are for machine-generated summaries only.
The results in Pearson’s and Spearman’s correlation coefficients are given in Tables 7 and 8 respectively. The suffixes P, R, and F after each ROUGE metric denote Precision, Recall, and F1 score, respectively. In terms of Pearson’s correlation coefficient, on modified score and overall score, our best model outperforms ROUGE-4, such as 0.5191 vs. 0.4765 and 0.4274 vs. 0.3513 in Table 7. And our method achieves slightly (below 0.1) inferior performance than those of ROUGE-1, ROUGE-2, and ROUGE-W-1.2, in most cases. It closes the gap with ROUGE metrics further on linguistic quality. In particular, it has almost equal performance with ROUGE-2 in linguistic quality. Similar results can be observed in Spearman’s correlation coefficient as well.
Therefore, despite that our approach cannot fully defeat ROUGE in terms of correlation with human evaluation of summary quality, the initial result is still promising in that our approach’s performance is close to ROUGE’s after training with a small amount of data and epoch.
In this paper, we propose an end-to-end approach that can potentially assess summary quality by its semantic similarity to the input document, without needing a reference summary. Two methods to prepare negative samples for training such end-to-end models are developed. Extensive experiments under various settings, including different neural network architectures, show that our approach can consistently and accurately tell whether or how much a summary is about the input document. Cross-domain analyses further show that a model trained in our approach can be used to judge summary quality in an unseen domain. Finally, our model shows moderate correlation with human evaluation to summaries, with a performance close to or equal to ROUGE metrics’. Our approach is a step toward designing better metrics to supplement the widely used, lexical-based ROUGE.
-  (2018) Entity commonsense representation for neural abstractive summarization. In NAACL-HLT, Vol. 1, pp. 697–707. Cited by: §1, §1.
-  (2018) Retrieve, rerank and rewrite: soft template based neural summarization. In ACL, Vol. 1, pp. 152–161. Cited by: §1.
-  (2018) Deep communicating agents for abstractive summarization. In NAACL-HLT, Vol. 1, pp. 1662–1675. Cited by: §1, §1.
-  (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §2.1, 1st item, 2nd item.
-  (2017) Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pp. 670–680. Cited by: §2.1, 3rd item.
-  (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (Mlm). External Links: Cited by: §2.1, 4th item.
-  (2018) Unsupervised semantic abstractive summarization. In Proceedings of ACL 2018, Student Research Workshop, pp. 74–83. Cited by: §1.
-  (2018) Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). External Links: Cited by: 2nd item, §3.5.
-  (2015) Teaching machines to read and comprehend. In NeurIPS, pp. 1693–1701. Cited by: 1st item.
How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Cited by: §1, §3.3.
-  (2008) Correlation between rouge and human evaluation of extractive meeting summaries. In ACL, pp. 201–204. Cited by: §1.
-  (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, pp. 3111–3119. Cited by: §2.2.
-  (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: 1st item.
-  (2015) Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1925–1930. Cited by: §1, §1, §3.3.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.
-  (2018) Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization. In ACL, pp. 664–674. Cited by: §1, §1.
-  (2019) BIGPATENT: a large-scale dataset for abstractive and coherent summarization. External Links: Cited by: 3rd item, §3.5.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §3.3.
-  (2018) Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pp. 164–174. External Links: Cited by: §1.
-  (2018) Estimating summary quality with pairwise preferences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 1687–1696. Cited by: §1, §2.2.