Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

04/29/2020 ∙ by Dongyub Lee, et al. ∙ Kakao Corp. 1

Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Recently, many models for text summarization have been proposed. Most of those models were evaluated using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of text summarization is to generate a reference summary that conveys all the salient information of an original document. There are two strategies for this type of summarization. With the extractive approach, the most noticeable key sentences are extracted from the source and compiled into a reference [zhong2019searching, wang2019self, xiao2019extractive]. The second approach is abstractive, with which a paraphrased summary is generated from the source [zhang2018abstractiveness, guo2018soft, wenbo2019concept]. The generated summary may not contain the same words that appear in the source document. Therefore, measuring factual alignment between the generated summary and source document is important [kryscinski2019neural].

Most summarization models are evaluated using recall-oriented understudy for gisting evaluation (ROUGE) [lin-2004-rouge], which measures n-gram overlaps between generated and reference summaries. ROUGE has proven to have a high correlation with manual evaluation methods, such as pyramid [nenkova2007pyramid] and TAC AESOP [owczarzak2011overview]. However, Louis louis2013automatically proved that the correlation decreased significantly when only one reference summary was provided. Additionally, considering the process by which a person manually summarizes a document, ROUGE is limited, because it does not reflect semantic meanings between generated and reference summaries. For example, when a person summarizes a document, they tend to use words that are implicit while not always using the explicit words from the original document. As the ROUGE score is computed based on an n-gram overlap, even if two words have the same semantic meaning, the score can be low. Table 1 shows an example of the ROUGE limitation when applied to a Korean summarization. This tendency is particularly prevalent in Korean, which is an agglutinative language that combines various morphemes into a word to express several meanings and grammatical functions, unlike English. In this process, complex morphological variations can occur. Therefore, leveraging ROUGE scores produces inaccurate results.

width=1.0 Article ‘슬기로운 의사생활’이 또다시 최고 시청률을 경신하며 고공행진을 이어갔다. 26일 방송된 tvN 2020 목요 스페셜 ‘슬기로운 의사생활’ 3회는 케이블, IPTV, 위성을 통합한 유료플랫폼에서 가구 평균 8.6%, 최고 10%의 시청률을 기록했다. 3주 연속 시청률 상승세다.
The tv program “sage doctor life” set an all time record for its highest viewer ratings. On the 26th, TVN 2020 Thursday Special “Sage Doctor Life,” aired on the 26th, recorded an average household rating of 8.6% and a maximum of 10% on a paid platform incorporating cable, IPTV, and satellite. The ratings have been rising for three consecutive weeks.
Reference Summary ‘슬기로운 의사생활’ 최고 시청률 10% 돌파… 3회 연속 상승
The tv program “Sage doctor life” breaks its all time high 10% viewer ratings for 3 consecutive episodes.
Wrong Candidate ‘슬기로운 의사생활’ 최저 시청률 10% 돌파… 3회 연속 하락
The tv program “Sage doctor life” reached the lowest viewing rate of 10% for 3 times in a row.
 Rouge scores with Reference Summary (R-1/R-2/R-L): 0.78/ 0.63/ 0.78
 Ours (RDASS): 0.44
Correct Candidate ‘슬기로운 의사생활’ 최고 시청률 경신… 3주 연속 상승
The tv program “Sage doctor life” set an all time record for its highest viewer ratings for 3 consecutive episodes.
 Rouge scores with Reference Summary (R-1/R-2/R-L): 0.71/ 0.53/ 0.71
 Ours (RDASS): 0.56

Table 1: An example showing the limitations of ROUGE in Korean summarization. The incorrectly generated summary has a high ROUGE score, but has the opposite semantic meaning. Text areas marked in blue and red serve as indicators for distinguishing the factualness of the semantic comparisons, as reflected by the our metrics shown.

To overcome this limitation, an evaluation method that considers the semantic information of both the generated and reference summary is required. It is important to examine the factuality between the generated summary and source document, because the generated summary may contain false information. Each person summarizes information in different manners, and it is difficult to agree, even after cross-checking [kryscinski2019neural]. Therefore, the source document should also be considered with generated and reference summary.

In this study, we propose metrics for evaluating a summarization model that consider both the source document and reference summary together with the generated summary (see Table 1). Our contributions can be summarized as follows:

  • We propose the evaluation metrics that can be applied to a summarization model using deep semantic information.

  • We propose methods to improve the correlation between the proposed evaluation metrics and human judgment.

  • Via extensive evaluation, we demonstrate that the correlation with human judgment is significantly higher for our proposed evaluation metrics than for ROUGE scores.

2 Related Work

Evaluation methods of text summarization are divided into two strategies: manual and automatic. Manual evaluation is expensive and difficult[nenkova2004evaluating, passonneau2013automated]. Several studies have been conducted to develop automatic methods that facilitate fast and low-cost evaluations. There are two types of automatic evaluation methods: extrinsic and intrinsic. An extrinsic automatic method evaluates a summarization model based on how it affects the completion of tasks comprising the judgment of document relevance [dorr2004extrinsic]. The intrinsic automatic method evaluates quality via a property analysis or by calculating its similarity to a manually generated summary. Intrinsic methods include the pyramid method [nenkova2007pyramid], the basic-elements method [hovy2006automated], and ROUGE [lin2004rouge]. The pyramid method inspects various human-made summaries and creates summary content units, each with a scoring weight. The basic-elements method is similar to the pyramid method. ROUGE evaluates the similarity of the lexical overlap between the candidate and reference summary.

As the ROUGE score is computed based on the n-gram overlap, it does not account for synonymous words or phrases. Many approaches have been proposed to overcome this limitation. ParaEval [zhou2006paraeval], ROUGE-WE [ng2015better], ROUGE 2.0 [ganesan2018rouge], and ROUGE-G [shafieibavani2018graph]

have been used to extend ROUGE to support synonymous constructs. ParaEval uses a matching method based on paraphrase tables. ROUGE-WE uses a lexical matching method with a semantic similarity measure and the cosine distances between tokens. ROUGE 2.0 uses WordNet as a synonym dictionary and computed token overlaps with all synonyms of matched words. ROUGE-G uses lexical and semantic matching from WordNet. These approaches have limitations because they require hand-crafted lexical and synonym dictionaries, which are particularly difficult to construct in Korean. Our research is different in that 1) We propose a method to evaluate generated summary by considering documents as well as reference summary. 2) In addition, our evaluation model is robust to out of vocabulary (OOV) words because it leverages a pre-trained neural network (SBERT) based on byte pair encoding (BPE) 

[gage1994new] tokenization method from unsupervised leraning. Considering the fact that Korean is an agglutinative language, this feature is very important. 3) Finally, Our evaluation model can be further trained to capture more contextualized information both on reference summary and document.

Text summarization models can be divided into abstractive, extractive, and hybrid. Abstractive models reword phrases and create summaries having novel phrases constructed from the original document. Recent text summarization approaches have leveraged multi-task and multi-reward training [jiang2018closed, paulus2017deep, pasunuru2018multi, guo2018soft], attention-with-copying mechanisms [tan2017abstractive, see2017get, cohan2018discourse], and unsupervised training strategies [schumann2018unsupervised, chu2018unsupervised]. The extractive method extracts the most-suitable sentences (or words) from the source document and copies them directly into the summary. Many researchers [neto2002automatic, colmenares2015heads, filippova2013overcoming]

have utilized domain expertise to develop heuristics for refining summary texts. Recently, neural-based text summarization models have been proposed to train the model for predicting whether a span of text should be included in the summary

[nallapati2016classify, narayan2017neural, xu2019neural, liu2019comes]

. Reinforcement learning-based summarization models have also been proposed to directly optimize models

[wu2018learning, dong2018banditsum, narayan2018ranking]. The hybrid approach uses both abstractive and extractive methods. With this approach, the summarization process is divided into two phases: content selection and paraphrasing [gehrmann2018bottom, hsu2018unified, chen2018fast, liu2018generating].

3 Methodology

From Table 1, we can observe the importance of considering both the document and reference summary together for proper evaluation of the summarization model. In Subsection 3.1, we propose a method for evaluating the generated summary with the reference summary to reflect deep semantic meaning. Next, we propose a method for evaluating the generated summary with the original document and reference summary together. The reference-document-aware evaluation metric model can be further trained to capture more contextualized information from both on reference summary and document (Subsection 3.2).

3.1 Reference and Document Aware Semantic Evaluation

Let us define the generated summary from the summarization model as and reference summary as , where indicates each word. Then, each summary representation, and

, can be constructed using sentence-embedding methods. Neural-based sentence-embedding methods have been broadly studied. Conneau conneau2017supervised trained a Siamese bidirectional long short-term memory model with a max-pooling strategy on the Stanford Natural Language Inference (SNLI) corpus 

[bowman2015large] and the MultiGenre Natural Language Inference (NLI) dataset [williams2017broad]. Cer cer2018universal proposed the universal sentence encoder to train a transformer on the SNLI dataset. Reimers reimers2019sentence recently proposed sentence-BERT(SBERT), which leverages a pre-trained BERT [devlin2018bert], trained with a combination of the SNLI and multi-genre NLI, and shows state-of-the-art sentence embedding performance. SBERT is suitable for semantic similarity searches and showed faster inference speeds than previous state-of-the-art approaches, including BERT, RoBERTa [liu2019roberta], and the universal sentence encoder.

We leverage a pre-trained SBERT to construct summary representations. Each word representation, , is obtained from SBERT as

(1)

Subsequently, mean-pooling is performed to construct as

(2)

where represents an index of a word-embedding dimension, and represents a length of . can also be obtained in the same manner.

The semantic similarity score, , between and can be obtained as follows

(3)

Recall that it is important to consider factual consistency with the source document, and, given the same document, the method of summarizing important information varies from person to person [owczarzak2012assessing, kryscinski2019neural]. Therefore, the source document should also be considered with the generated summary when evaluating the summarization model.

Given a document, , the document representation, , can be obtained using Eqs. (1) and (2). Thus, the similarity score between and can be defined as

(4)

Given a reference and source document, the reference-document-aware semantic score (RDASS) of the generated summary is defined by averaging and :

(5)

We also experimented with a sum, max and min operation between and , but averaging the two scores reports highest correlation with human judgment.

3.2 Fine-tuning SBERT with the Abstractive Summarization Model

SBERT is a trainable metric model. Thus, it can be further trained to capture more contextualized information about the reference summary and source document. We propose a fine-tuning method for SBERT that uses the abstractive summarization model.

Most neural approaches for abstractive summarization are based on an encoder–decoder architecture [see2017get]. Formally, given a document, , the objective is to generate a summary,

, from a hidden representation,

. The hidden representation is the output vector of the decoder. We leverage the hidden representation of the decoder to fine-tune the SBERT.

Following [reimers2019sentence], we adopt a triplet objective to fine-tune the SBERT. Given an anchor , a positive reference representation , a negative representation , and a Euclidean distance , the triplet objective for generated and reference summaries is then defined as

(6)

where represents a margin that ensures is closer to than . We set as . Similarly, the triplet objective for generated summary and document can be defined as

(7)

Thus, the final objective for SBERT is to minimize the combined two triplet objectives as

(8)

The objective function, , of SBERT is jointly optimized with the abstractive summarization objective. Usually, the negative log-likelihood (NLL) objective between the generated and reference summaries is used for abstractive summarization [see2017get, narayan2018don]. We refer to the fine-tuned SBERT with abstractive summarization model as “FWA-SBERT.”

4 Experimental Setup

4.1 Dataset

We trained and evaluated our models using the Korean Daum/News dataset 111https://media.daum.net/, comprising 10 topics, such as politics, economy, international, culture, information technology, and others. From this, we extracted 3-million news articles. The number of articles for training, validating, and testing was , , and respectively. We refer to this dataset as “Daum/News.” We used Daum/News to fully understand the content of the article and conduct a proper evaluation. The dataset contains articles from 143 newspapers, each having different summary styles, and the effectiveness of the proposed methods is exemplified using it. Therefore, we expect that our research can be applied to different languages.

4.2 Summarization Model

We adopted abstractive summarization model of [liu2019text] 222https://github.com/nlpyang/PreSumm. Liu liu2019text leveraged pre-trained BERT as an encoder and a six-layered transformer as a decoder, showing state-of-the-art results on Cable News Network/DailyMail [hermann2015teaching], New York Times [sandhaus2008new], and XSum [narayan2018don] datasets. We set all environments according to [liu2019text], except that we leveraged the pre-trained BERT trained on Korean dataset (Subsection 4.3) instead of english-bert-base-uncased. We trained the abstractive summarization model on Korean Daum/News dataset.

4.3 Sbert

To leverage SBERT, we first pre-trained BERT (bert-base-uncased) on Korean dataset, comprising sentences and documents, including Wiki, Sejong corpus, and web documents. Next, we trained SBERT with classification and regression objectives from NLI [bowman2015large, williams2017broad] and the semantical textual similarity (STS) benchmark (STSb) [cer2017semeval]. Because NLI and STSb datasets are in English, we leveraged the Korean NLI and STS dataset 333https://github.com/kakaobrain/KorNLUDatasets [ham2020kornli] which translated from Kakao Machine Translator 444https://translate.kakao.com. Evaluation of the STS benchmark test dataset was conducted, showing an Spearman’s rank correlation result. Subsequently, the pre-trained SBERT model was fine-tuned with the abstractive summarization model to capture more contextualized information of the reference summary and source document with a generated summary (Subsection 3.2). All training was conducted on the Kakao Brain Cloud with 4 Tesla V100 graphical processing units.

4.4 Human Judgment

To demonstrate the effectiveness of the reference-document-aware semantic metric, we evaluated its correlation with human judgment. Following [kryscinski2019neural], we asked annotators to score relevance, consistency, and fluency. Relevance represents the degree of appropriateness of the document, consistency represents the degree of factualness, and fluency represents the degree of the quality of generated summary. Additionally, human avg represents the average value of the scores for the three indicators. Given a document, reference summary, and generated summary, each annotator scored in the range of to points for the evaluation indicator (i.e., relevance, consistency, fluency). The human judgment was conducted by judges having a PhD (3 judges) or a MS (3 judges) degree in computer science. The averaged human score of relevance was , consistency was , and fluency was for sampled summaries from Korean Daum/News test dataset.

5 Results

In this section, we first report the performance of the summarization model using the ROUGE and proposed evaluation metrics (Subsection 3.1). Next, we report how the proposed evaluation metrics correlated to human judgment. We also report the correlation of the proposed evaluation metrics to ROUGE to show that the proposed methods complement ROUGE. Finally, through qualitative evaluation, we demonstrate the limitations of ROUGE and the superiority of the proposed evaluation metrics.

5.1 Performance of the Summarization Model

width=0.8 Model Proposed Evaluation Metrics ROUGE RDASS 1 2 L Reference Summary 1.00 0.55 0.78 1.00 1.00 1.00 Lead-1 0.71 0.64 0.68 0.13 0.03 0.13 Lead-3 0.66 0.79 0.73 0.07 0.01 0.07 BERTSUMABS [liu2019text] 0.83 0.46 0.65 0.35 0.15 0.35

Table 2: Performance of the summarization model on the DAUM/NEWS dataset.

The abstractive summarization model is based on the neural architecture of [liu2019text]. We trained the summarization model on the Daum/News dataset. To evaluate the summarization model, we used ROUGE and the proposed evaluation metrics. The fine-tuned FWA-SBERT was then used to evaluate the proposed semantic scores (, , and RDASS). Table 2 shows the performance of the summarization model with baseline methods (Reference Summary, Lead 1, and 3) on the Daum/News dataset.

We set the reference summary as upper-bound. In the case of the reference summary, the reporter tends to use implicit words when summarizing the document, so the score is relatively low compared to the Lead baselines. However, because the score is 1.00, the reference summary shows the highest RDASS score. For Lead-1, shows higher performance than , and for Lead-3, shows higher performance than . The reason for this performance is that Lead-3 contains more sentences from the document, so the similarity with the reference summary is low, but the similarity with the document is increased. In the case of ROUGE performance of lead baselines, relatively low performance can be confirmed compared to other researches [kryscinski2019neural] conducted in English dataset. The reason is that in the case of Korean, the same semantic meaning is expressed differently because of the nature of the language of the agglutinative language. A detailed example of this is described in Table 5 below. However, it can be seen that the RDASS score of lead baselines is similar to that of the reference summary. Through this, we can confirm that the proposed evaluation method can reflect the semantic meaning of the reference summary and document well. In the case of the [liu2019text], it shows higher similarity with the reference summary than the Lead baselines, but since it is based on the generation model, it does not extract the sentence from the document as the Lead baselines. As a result, it shows the relatively low score. We describe how these results are correlated with human judgment in the next section.

5.2 Correlation with Human Judgment

(a) Pearson correlations
(b) Kendall rank
Figure 1: Pearson correlations and Kendall rank of the proposed evaluation metrics with human judgment.

Figures (0(a)) and (0(b)) show the Pearson correlation and Kendall rank, respectively, of the proposed evaluation metrics with human judgment on the 200 sampled summaries. Pearson correlation measure whether the two variables are linearly related, where 1 indicates positive linear correlation and -1 indicates negative linear correlation. And Kendall rank measure the rank correlation of the two variables, where 1 indicates two variables are similar and -1 indicates dissimilar. Both correlation measure methods are widely used in summarization task to analyze correlation with human judgment.

In the Pearson correlation matrix, the correlation with human judgment was significantly higher for the proposed evaluation metrics than for ROUGE scores. Additionally, in the Kendall rank matrix, the proposed evaluation metrics showed highest correlation with human judgment than did the ROUGE scores. Among the proposed evaluation metrics, showed higher performance than and RDASS showed the highest correlation with human judgment. These results indicate that the proposed evaluation metrics can reflect deep semantic meaning overcoming the limitations of ROUGE which based on n-gram overlap.

width=0.9

Sentence Representation Relevance Consistency Fluency Human Avg
Pearson Kendall Pearson Kendall Pearson Kendall Pearson Kendall
MUSE 0.29 0.19 0.18 0.10 0.22 0.08 0.25 0.13
0.09 0.05 0.13 0.06 0.15 0.04 0.13 0.06
RDASS 0.29 0.19 0.24 0.12 0.23 0.09 0.28 0.14
P-SBERT 0.34 0.22 0.27 0.17 0.25 0.09 0.32 0.18
0.24 0.13 0.27 0.15 0.22 0.09 0.27 0.15
RDASS 0.37 0.22 0.34 0.20 0.29 0.11 0.37 0.21
FWA-SBERT 0.35 0.24 0.28 0.17 0.25 0.10 0.32 0.19
0.26 0.13 0.28 0.15 0.24 0.09 0.29 0.15
RDASS 0.39 0.24 0.36 0.21 0.29 0.12 0.38 0.22
Table 3: Performance comparison depended upon which sentence representation was used.

To demonstrate the effectiveness of fine-tuning SBERT with an abstractive summarization model, we set baseline methods depending on which sentence representation methods to use for the proposed methods (Subsection  3.1) as follows:

Multilingual Universal Sentence Encoder (MUSE): MUSE [yang2019multilingual] is a multilingual sentence encoder that embeds text from 16 languages into a single semantic space using multi-task learning. This model was trained on more than 1-billion question-answer pairs and showed competitive state-of-the-art results on semantic [gillick2018end], bitext retrival [ziemski2016united], and retrieval question-answering [yang2019multilingual].

Pre-trained SBERT: We only leveraged pre-trained SBERT without fine-tuning. We refer to this as “P-SBERT.”

Table 3 show the performance comparison depended upon which sentence representation was used. P-SBERT shows the high correlation coefficient with humans than MUSE. Overall, when the FWA-SBERT was used, it showed the closest correlation with human judgment.

Through quantitative evaluation, we demonstrated that the proposed evaluation metrics had a high correlation with human judgment and that the method of fine-tuning SBERT improved the performance of the proposed evaluation metrics.

We also experimented to understand how each evaluation metric was correlated to each other. As shown in Table 4, there was a high correlation among the ROUGE metrics. However, the proposed evaluation metrics had a relatively low correlation with ROUGE. This indicates that the proposed evaluation metrics reflected semantic meaning, in our case, that ROUGE could not. Thus, it complements the ROUGE metrics.

ROUGE-1 ROUGE-2 ROUGE-L RDASS
ROUGE-1 1.00 0.84 0.99 0.64 0.16 0.54
ROUGE-2 1.00 0.85 0.52 0.09 0.45
ROUGE-L 1.00 0.63 0.17 0.53
1.00 0.32 0.77
1.00 0.69
RDASS 1.00
Table 4: Pearson correlation of ROUGE and the proposed evaluation metrics.

5.3 Qualitative Analysis

In this section, through qualitative analysis, we demonstrate the effectiveness of our evaluation metrics. Table 5 shows ROUGE, RDASS and human evaluation results for the generated summaries for the two articles.

width=1.0 Article-1 리오넬 메시(30·fc바르셀로나)가 자신의 서른 번째 생일을 가족과 함께 오붓하게 보냈다. 지난 24일 만 서른 살이 된 메시는 자신의 인스타그램에 집에서 가족들과 함께 보낸 생일상을 찍은 사진을 올렸다. 메시는 오랜 그의 여자친구이자, 이제 아내가 되는 안토넬라 로쿠조(29), 아들 티아고가 함께 다정하게 사진을 찍었다.
Lionel Messi (30 fc Barcelona) spent his thirtieth birthday with his family. Messi, who turned thirty on the 24th, posted a picture of his birthday on Instagram with his family at home. Messi was tenderly photographed by his longtime girlfriend, Antonella Rokujo (29), and his son, Thiago.
Reference Summary 메시가 30번째 생일 함께한 이는 아내와 아들
Messi’s 30th birthday with his wife and son.
Generated Summary 메시 30번째 생일, 가족과 함께 오붓하게 보내
On the 30th birthday of Messi, he had a good time with his family.
 Rouge(1/ 2/ L): 0.14/ 0.00/ 0.14
 RDASS: 0.81
 Human Evaluation (relevance/ consistency/ fluency): 4.4/ 4.2/ 4.2
Article-2 삼성전자는 19일(현지시간) 브라질 상파울루의 팔라시오 탕가라 호텔에서 ‘QLED TV 론칭 이벤트’를 열고 2017년형 QLED TV 라인업을 선보였다고 23일 밝혔다. 4월 중남미에서는 처음으로 멕시코에서 QLED TV를 출시한 뒤 파나마, 콜롬비아 등으로 확대하다 이번에 중남미 최대 시장인 브라질에 제품을 출시한 것이다. 브라질은 전체 중남미 TV 시장의 40%(금액 기준) 이상을 차지할 정도로 중요한 TV 시장이다. 올해 1∼4월 브라질 TV 시장은 작년 같은 기간보다 13%(수량 기준) 성장했고, 특히 프리미엄 TV 시장인 UHD(초고화질) TV는 작년보다 50% 이상 시장이 커졌다. 특히 삼성전자는 브라질 UHD TV 시장에서 올해 1∼4월 56%(수량 기준) 점유율로 압도적 1위를 차지했다.
Samsung Electronics announced on the 23rd that it held a ‘QLED TV launching event’ at the Palacio Tangara Hotel in Sao Paulo, Brazil on the 19th (local time) and introduced the 2017 QLED TV lineup. In April, it launched the QLED TV in Mexico for the first time in Latin America, and then expanded to Panama, Colombia, etc. This time it launched the product in Brazil, the largest market in Latin America. Brazil is an important TV market, accounting for more than 40% of total Latin American TV market. In January-April this year, the Brazilian TV market grew 13% (in quantity) from the same period last year. In particular, the UHD (Ultra High Definition) TV, a premium TV market, was 50% larger than last year. In particular, Samsung Electronics took the dominant position in the UHD TV market in Brazil with 56% (based on quantity) in January-April this year.
Reference Summary 삼성전자, 중남미 최대 시장 브라질에 qled tv 론칭
Samsung Electronics launches ‘qled tv’ in Brazil, the largest market in Latin America.
Generated Summary 삼성전자, 브라질서 ‘qled tv’ 신제품 출시
Samsung Electronics launches new ‘qled tv’ in Brazil.
 Rouge(1/ 2/ L): 0.14/ 0.00/ 0.14
 RDASS: 0.71
 Human Evaluation(relevance/ consistency/ fluency): 4.6/ 4.4/ 4.4

Table 5: Example articles from the “DAUM/News” test dataset. ROUGE, RDASS and human evaluation results for the generated summaries are represented.

In article-1, the generated summary “On the 30th birthday of Messi, he had a good time with his family” has the same semantic meaning as the reference summary “Messi’s 30th birthday with his wife and son”. However, since the sentence having the same semantic meaning can be variously expressed in Korean, which has the characteristics of agglutinative language, the ROUGE score is low while human evaluation scores are high. Likewise, the generated summary “Samsung Electronics launches new ‘qled tv’ in Brazil” in article-2 has a same semantic meaning as the reference summary “Samsung Electronics launches ‘qled tv’ in Brazil, the largest market in Latin America”. The generated summary in both articles is correct, but the ROUGE score is low. On the other hand, the RDASS score indicates a higher score, and indicates that the generated summary is the correct answer.

6 Conclusion

In this paper, we pointed out the limitation of the widely used ROUGE evaluation metric when adopting Korean summarization. Since Korean is an agglutinative language, the generated summary having the same semantic meaning with reference summary can be variously expressed. Therefore, only leveraging ROUGE metric can produce inaccurate evaluation results. To overcome this limitation, we proposed RDASS (Reference and  Document  Aware  Semantic  Score) evaluation metric. The RDASS can reflect deep semantic relationships of a generated, reference summary, and document. Through extensive evaluations, we demonstrated that the correlation with human judgment is higher for the proposed evaluation metric (RDASS) than for ROUGE scores. In future work, we will demonstrate the effectiveness of the proposed method in English summarization dataset.

References