The task of text summarization is to generate a reference summary that conveys all the salient information of an original document. There are two strategies for this type of summarization. With the extractive approach, the most noticeable key sentences are extracted from the source and compiled into a reference [zhong2019searching, wang2019self, xiao2019extractive]. The second approach is abstractive, with which a paraphrased summary is generated from the source [zhang2018abstractiveness, guo2018soft, wenbo2019concept]. The generated summary may not contain the same words that appear in the source document. Therefore, measuring factual alignment between the generated summary and source document is important [kryscinski2019neural].
Most summarization models are evaluated using recall-oriented understudy for gisting evaluation (ROUGE) [lin-2004-rouge], which measures n-gram overlaps between generated and reference summaries. ROUGE has proven to have a high correlation with manual evaluation methods, such as pyramid [nenkova2007pyramid] and TAC AESOP [owczarzak2011overview]. However, Louis louis2013automatically proved that the correlation decreased significantly when only one reference summary was provided. Additionally, considering the process by which a person manually summarizes a document, ROUGE is limited, because it does not reflect semantic meanings between generated and reference summaries. For example, when a person summarizes a document, they tend to use words that are implicit while not always using the explicit words from the original document. As the ROUGE score is computed based on an n-gram overlap, even if two words have the same semantic meaning, the score can be low. Table 1 shows an example of the ROUGE limitation when applied to a Korean summarization. This tendency is particularly prevalent in Korean, which is an agglutinative language that combines various morphemes into a word to express several meanings and grammatical functions, unlike English. In this process, complex morphological variations can occur. Therefore, leveraging ROUGE scores produces inaccurate results.
To overcome this limitation, an evaluation method that considers the semantic information of both the generated and reference summary is required. It is important to examine the factuality between the generated summary and source document, because the generated summary may contain false information. Each person summarizes information in different manners, and it is difficult to agree, even after cross-checking [kryscinski2019neural]. Therefore, the source document should also be considered with generated and reference summary.
In this study, we propose metrics for evaluating a summarization model that consider both the source document and reference summary together with the generated summary (see Table 1). Our contributions can be summarized as follows:
We propose the evaluation metrics that can be applied to a summarization model using deep semantic information.
We propose methods to improve the correlation between the proposed evaluation metrics and human judgment.
Via extensive evaluation, we demonstrate that the correlation with human judgment is significantly higher for our proposed evaluation metrics than for ROUGE scores.
2 Related Work
Evaluation methods of text summarization are divided into two strategies: manual and automatic. Manual evaluation is expensive and difficult[nenkova2004evaluating, passonneau2013automated]. Several studies have been conducted to develop automatic methods that facilitate fast and low-cost evaluations. There are two types of automatic evaluation methods: extrinsic and intrinsic. An extrinsic automatic method evaluates a summarization model based on how it affects the completion of tasks comprising the judgment of document relevance [dorr2004extrinsic]. The intrinsic automatic method evaluates quality via a property analysis or by calculating its similarity to a manually generated summary. Intrinsic methods include the pyramid method [nenkova2007pyramid], the basic-elements method [hovy2006automated], and ROUGE [lin2004rouge]. The pyramid method inspects various human-made summaries and creates summary content units, each with a scoring weight. The basic-elements method is similar to the pyramid method. ROUGE evaluates the similarity of the lexical overlap between the candidate and reference summary.
As the ROUGE score is computed based on the n-gram overlap, it does not account for synonymous words or phrases. Many approaches have been proposed to overcome this limitation. ParaEval [zhou2006paraeval], ROUGE-WE [ng2015better], ROUGE 2.0 [ganesan2018rouge], and ROUGE-G [shafieibavani2018graph]
have been used to extend ROUGE to support synonymous constructs. ParaEval uses a matching method based on paraphrase tables. ROUGE-WE uses a lexical matching method with a semantic similarity measure and the cosine distances between tokens. ROUGE 2.0 uses WordNet as a synonym dictionary and computed token overlaps with all synonyms of matched words. ROUGE-G uses lexical and semantic matching from WordNet. These approaches have limitations because they require hand-crafted lexical and synonym dictionaries, which are particularly difficult to construct in Korean. Our research is different in that 1) We propose a method to evaluate generated summary by considering documents as well as reference summary. 2) In addition, our evaluation model is robust to out of vocabulary (OOV) words because it leverages a pre-trained neural network (SBERT) based on byte pair encoding (BPE)[gage1994new] tokenization method from unsupervised leraning. Considering the fact that Korean is an agglutinative language, this feature is very important. 3) Finally, Our evaluation model can be further trained to capture more contextualized information both on reference summary and document.
Text summarization models can be divided into abstractive, extractive, and hybrid. Abstractive models reword phrases and create summaries having novel phrases constructed from the original document. Recent text summarization approaches have leveraged multi-task and multi-reward training [jiang2018closed, paulus2017deep, pasunuru2018multi, guo2018soft], attention-with-copying mechanisms [tan2017abstractive, see2017get, cohan2018discourse], and unsupervised training strategies [schumann2018unsupervised, chu2018unsupervised]. The extractive method extracts the most-suitable sentences (or words) from the source document and copies them directly into the summary. Many researchers [neto2002automatic, colmenares2015heads, filippova2013overcoming]
have utilized domain expertise to develop heuristics for refining summary texts. Recently, neural-based text summarization models have been proposed to train the model for predicting whether a span of text should be included in the summary[nallapati2016classify, narayan2017neural, xu2019neural, liu2019comes]
. Reinforcement learning-based summarization models have also been proposed to directly optimize models[wu2018learning, dong2018banditsum, narayan2018ranking]. The hybrid approach uses both abstractive and extractive methods. With this approach, the summarization process is divided into two phases: content selection and paraphrasing [gehrmann2018bottom, hsu2018unified, chen2018fast, liu2018generating].
From Table 1, we can observe the importance of considering both the document and reference summary together for proper evaluation of the summarization model. In Subsection 3.1, we propose a method for evaluating the generated summary with the reference summary to reflect deep semantic meaning. Next, we propose a method for evaluating the generated summary with the original document and reference summary together. The reference-document-aware evaluation metric model can be further trained to capture more contextualized information from both on reference summary and document (Subsection 3.2).
3.1 Reference and Document Aware Semantic Evaluation
Let us define the generated summary from the summarization model as and reference summary as , where indicates each word. Then, each summary representation, and
, can be constructed using sentence-embedding methods. Neural-based sentence-embedding methods have been broadly studied. Conneau conneau2017supervised trained a Siamese bidirectional long short-term memory model with a max-pooling strategy on the Stanford Natural Language Inference (SNLI) corpus[bowman2015large] and the MultiGenre Natural Language Inference (NLI) dataset [williams2017broad]. Cer cer2018universal proposed the universal sentence encoder to train a transformer on the SNLI dataset. Reimers reimers2019sentence recently proposed sentence-BERT(SBERT), which leverages a pre-trained BERT [devlin2018bert], trained with a combination of the SNLI and multi-genre NLI, and shows state-of-the-art sentence embedding performance. SBERT is suitable for semantic similarity searches and showed faster inference speeds than previous state-of-the-art approaches, including BERT, RoBERTa [liu2019roberta], and the universal sentence encoder.
We leverage a pre-trained SBERT to construct summary representations. Each word representation, , is obtained from SBERT as
Subsequently, mean-pooling is performed to construct as
where represents an index of a word-embedding dimension, and represents a length of . can also be obtained in the same manner.
The semantic similarity score, , between and can be obtained as follows
Recall that it is important to consider factual consistency with the source document, and, given the same document, the method of summarizing important information varies from person to person [owczarzak2012assessing, kryscinski2019neural]. Therefore, the source document should also be considered with the generated summary when evaluating the summarization model.
Given a document, , the document representation, , can be obtained using Eqs. (1) and (2). Thus, the similarity score between and can be defined as
Given a reference and source document, the reference-document-aware semantic score (RDASS) of the generated summary is defined by averaging and :
We also experimented with a sum, max and min operation between and , but averaging the two scores reports highest correlation with human judgment.
3.2 Fine-tuning SBERT with the Abstractive Summarization Model
SBERT is a trainable metric model. Thus, it can be further trained to capture more contextualized information about the reference summary and source document. We propose a fine-tuning method for SBERT that uses the abstractive summarization model.
Most neural approaches for abstractive summarization are based on an encoder–decoder architecture [see2017get]. Formally, given a document, , the objective is to generate a summary,
, from a hidden representation,
. The hidden representation is the output vector of the decoder. We leverage the hidden representation of the decoder to fine-tune the SBERT.
Following [reimers2019sentence], we adopt a triplet objective to fine-tune the SBERT. Given an anchor , a positive reference representation , a negative representation , and a Euclidean distance , the triplet objective for generated and reference summaries is then defined as
where represents a margin that ensures is closer to than . We set as . Similarly, the triplet objective for generated summary and document can be defined as
Thus, the final objective for SBERT is to minimize the combined two triplet objectives as
The objective function, , of SBERT is jointly optimized with the abstractive summarization objective. Usually, the negative log-likelihood (NLL) objective between the generated and reference summaries is used for abstractive summarization [see2017get, narayan2018don]. We refer to the fine-tuned SBERT with abstractive summarization model as “FWA-SBERT.”
4 Experimental Setup
We trained and evaluated our models using the Korean Daum/News dataset 111https://media.daum.net/, comprising 10 topics, such as politics, economy, international, culture, information technology, and others. From this, we extracted 3-million news articles. The number of articles for training, validating, and testing was , , and respectively. We refer to this dataset as “Daum/News.” We used Daum/News to fully understand the content of the article and conduct a proper evaluation. The dataset contains articles from 143 newspapers, each having different summary styles, and the effectiveness of the proposed methods is exemplified using it. Therefore, we expect that our research can be applied to different languages.
4.2 Summarization Model
We adopted abstractive summarization model of [liu2019text] 222https://github.com/nlpyang/PreSumm. Liu liu2019text leveraged pre-trained BERT as an encoder and a six-layered transformer as a decoder, showing state-of-the-art results on Cable News Network/DailyMail [hermann2015teaching], New York Times [sandhaus2008new], and XSum [narayan2018don] datasets. We set all environments according to [liu2019text], except that we leveraged the pre-trained BERT trained on Korean dataset (Subsection 4.3) instead of english-bert-base-uncased. We trained the abstractive summarization model on Korean Daum/News dataset.
To leverage SBERT, we first pre-trained BERT (bert-base-uncased) on Korean dataset, comprising sentences and documents, including Wiki, Sejong corpus, and web documents. Next, we trained SBERT with classification and regression objectives from NLI [bowman2015large, williams2017broad] and the semantical textual similarity (STS) benchmark (STSb) [cer2017semeval]. Because NLI and STSb datasets are in English, we leveraged the Korean NLI and STS dataset 333https://github.com/kakaobrain/KorNLUDatasets [ham2020kornli] which translated from Kakao Machine Translator 444https://translate.kakao.com. Evaluation of the STS benchmark test dataset was conducted, showing an Spearman’s rank correlation result. Subsequently, the pre-trained SBERT model was fine-tuned with the abstractive summarization model to capture more contextualized information of the reference summary and source document with a generated summary (Subsection 3.2). All training was conducted on the Kakao Brain Cloud with 4 Tesla V100 graphical processing units.
4.4 Human Judgment
To demonstrate the effectiveness of the reference-document-aware semantic metric, we evaluated its correlation with human judgment. Following [kryscinski2019neural], we asked annotators to score relevance, consistency, and fluency. Relevance represents the degree of appropriateness of the document, consistency represents the degree of factualness, and fluency represents the degree of the quality of generated summary. Additionally, human avg represents the average value of the scores for the three indicators. Given a document, reference summary, and generated summary, each annotator scored in the range of to points for the evaluation indicator (i.e., relevance, consistency, fluency). The human judgment was conducted by judges having a PhD (3 judges) or a MS (3 judges) degree in computer science. The averaged human score of relevance was , consistency was , and fluency was for sampled summaries from Korean Daum/News test dataset.
In this section, we first report the performance of the summarization model using the ROUGE and proposed evaluation metrics (Subsection 3.1). Next, we report how the proposed evaluation metrics correlated to human judgment. We also report the correlation of the proposed evaluation metrics to ROUGE to show that the proposed methods complement ROUGE. Finally, through qualitative evaluation, we demonstrate the limitations of ROUGE and the superiority of the proposed evaluation metrics.
5.1 Performance of the Summarization Model
The abstractive summarization model is based on the neural architecture of [liu2019text]. We trained the summarization model on the Daum/News dataset. To evaluate the summarization model, we used ROUGE and the proposed evaluation metrics. The fine-tuned FWA-SBERT was then used to evaluate the proposed semantic scores (, , and RDASS). Table 2 shows the performance of the summarization model with baseline methods (Reference Summary, Lead 1, and 3) on the Daum/News dataset.
We set the reference summary as upper-bound. In the case of the reference summary, the reporter tends to use implicit words when summarizing the document, so the score is relatively low compared to the Lead baselines. However, because the score is 1.00, the reference summary shows the highest RDASS score. For Lead-1, shows higher performance than , and for Lead-3, shows higher performance than . The reason for this performance is that Lead-3 contains more sentences from the document, so the similarity with the reference summary is low, but the similarity with the document is increased. In the case of ROUGE performance of lead baselines, relatively low performance can be confirmed compared to other researches [kryscinski2019neural] conducted in English dataset. The reason is that in the case of Korean, the same semantic meaning is expressed differently because of the nature of the language of the agglutinative language. A detailed example of this is described in Table 5 below. However, it can be seen that the RDASS score of lead baselines is similar to that of the reference summary. Through this, we can confirm that the proposed evaluation method can reflect the semantic meaning of the reference summary and document well. In the case of the [liu2019text], it shows higher similarity with the reference summary than the Lead baselines, but since it is based on the generation model, it does not extract the sentence from the document as the Lead baselines. As a result, it shows the relatively low score. We describe how these results are correlated with human judgment in the next section.
5.2 Correlation with Human Judgment
Figures (0(a)) and (0(b)) show the Pearson correlation and Kendall rank, respectively, of the proposed evaluation metrics with human judgment on the 200 sampled summaries. Pearson correlation measure whether the two variables are linearly related, where 1 indicates positive linear correlation and -1 indicates negative linear correlation. And Kendall rank measure the rank correlation of the two variables, where 1 indicates two variables are similar and -1 indicates dissimilar. Both correlation measure methods are widely used in summarization task to analyze correlation with human judgment.
In the Pearson correlation matrix, the correlation with human judgment was significantly higher for the proposed evaluation metrics than for ROUGE scores. Additionally, in the Kendall rank matrix, the proposed evaluation metrics showed highest correlation with human judgment than did the ROUGE scores. Among the proposed evaluation metrics, showed higher performance than and RDASS showed the highest correlation with human judgment. These results indicate that the proposed evaluation metrics can reflect deep semantic meaning overcoming the limitations of ROUGE which based on n-gram overlap.
|Sentence Representation||Relevance||Consistency||Fluency||Human Avg|
To demonstrate the effectiveness of fine-tuning SBERT with an abstractive summarization model, we set baseline methods depending on which sentence representation methods to use for the proposed methods (Subsection 3.1) as follows:
Multilingual Universal Sentence Encoder (MUSE): MUSE [yang2019multilingual] is a multilingual sentence encoder that embeds text from 16 languages into a single semantic space using multi-task learning. This model was trained on more than 1-billion question-answer pairs and showed competitive state-of-the-art results on semantic [gillick2018end], bitext retrival [ziemski2016united], and retrieval question-answering [yang2019multilingual].
Pre-trained SBERT: We only leveraged pre-trained SBERT without fine-tuning. We refer to this as “P-SBERT.”
Table 3 show the performance comparison depended upon which sentence representation was used. P-SBERT shows the high correlation coefficient with humans than MUSE. Overall, when the FWA-SBERT was used, it showed the closest correlation with human judgment.
Through quantitative evaluation, we demonstrated that the proposed evaluation metrics had a high correlation with human judgment and that the method of fine-tuning SBERT improved the performance of the proposed evaluation metrics.
We also experimented to understand how each evaluation metric was correlated to each other. As shown in Table 4, there was a high correlation among the ROUGE metrics. However, the proposed evaluation metrics had a relatively low correlation with ROUGE. This indicates that the proposed evaluation metrics reflected semantic meaning, in our case, that ROUGE could not. Thus, it complements the ROUGE metrics.
5.3 Qualitative Analysis
In this section, through qualitative analysis, we demonstrate the effectiveness of our evaluation metrics. Table 5 shows ROUGE, RDASS and human evaluation results for the generated summaries for the two articles.
In article-1, the generated summary “On the 30th birthday of Messi, he had a good time with his family” has the same semantic meaning as the reference summary “Messi’s 30th birthday with his wife and son”. However, since the sentence having the same semantic meaning can be variously expressed in Korean, which has the characteristics of agglutinative language, the ROUGE score is low while human evaluation scores are high. Likewise, the generated summary “Samsung Electronics launches new ‘qled tv’ in Brazil” in article-2 has a same semantic meaning as the reference summary “Samsung Electronics launches ‘qled tv’ in Brazil, the largest market in Latin America”. The generated summary in both articles is correct, but the ROUGE score is low. On the other hand, the RDASS score indicates a higher score, and indicates that the generated summary is the correct answer.
In this paper, we pointed out the limitation of the widely used ROUGE evaluation metric when adopting Korean summarization. Since Korean is an agglutinative language, the generated summary having the same semantic meaning with reference summary can be variously expressed. Therefore, only leveraging ROUGE metric can produce inaccurate evaluation results. To overcome this limitation, we proposed RDASS (Reference and Document Aware Semantic Score) evaluation metric. The RDASS can reflect deep semantic relationships of a generated, reference summary, and document. Through extensive evaluations, we demonstrated that the correlation with human judgment is higher for the proposed evaluation metric (RDASS) than for ROUGE scores. In future work, we will demonstrate the effectiveness of the proposed method in English summarization dataset.