InfoLM: A New Metric to Evaluate Summarization Data2Text Generation

12/02/2021
by   Pierre Colombo, et al.
Télécom Paris
0

Assessing the quality of natural language generation systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and over 10 points of correlation gains in many configurations on both summarization and data2text generation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 14

page 16

10/09/2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...
07/06/2018

The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost n...
05/13/2021

Towards Human-Free Automatic Quality Evaluation of German Summarization

Evaluating large summarization corpora using humans has proven to be exp...
04/14/2020

A Human Evaluation of AMR-to-English Generation Systems

Most current state-of-the art systems for generating English text from A...
04/21/2022

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates hum...
08/21/2021

CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements

Human evaluation has always been expensive while researchers struggle to...
09/14/2021

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A plethora of applications of natural language processing (NLP) performs text-to-text transformation

mellish1998evaluation; belz2006comparing; specia2018quality

. Given an input, these systems are required to produce an output text that is coherent, readable and informative. Due to both high annotation costs and time, researchers tend to rely on automatic evaluation to compare the outputs of such systems. Reference-based automatic evaluation relies on comparing a candidate text produced by the NLG system and one or multiple reference texts (‘gold standard’) created by a human annotator. Generic automatic evaluation of NLG is a huge challenge as it requires building a metric that evaluates the similarity between a candidate and one or several gold-standard reference texts. However, the definition of success criteria is task-specific: as an example, evaluation of text summarization focuses on content, coherence, grammatically, conciseness, and readability

mani2001automatic, whereas machine translation focuses on fidelity, fluency and adequacy of the translation hovy1999toward and data2text generation gardent2017creating consider criteria such as data coverage, correctness and text structure.

Automatic text evaluation metrics fall into two categories: metrics that are trained to maximise their correlations using human annotation (

e.g., RUSE shimanaka2018ruse, BEER stanojevic-simaan-2014-beer, BLEND ma2017blend) and untrained metrics (e.g., BLEU bleu, ROUGE lin-2004-rouge, BERTSCORE zhang2019bertscore, DepthScore staerman2021depth, BaryScore colombo2021automatic, MOVERSCORE zhao2019moverscore Word Mover Distance kusner2015wmd). In this work, we focus on untrained metrics as trained metrics may not generalize well to new data (existing labelled corpora are of small size). Two categories of untrained metrics can be distinguished: word or character based-metrics that compute a score based on string representation and embedding-based metrics that rely on a continuous representation. String-based metrics (e.g., BLEU) often fail to robustly match paraphrases reiter2009investigation as they mainly focus on the surface form as opposed to embedding-based metrics relying on continuous representations.

In this paper, we introduce InfoLM a family of new untrained metrics to evaluate text summarization and data2text generation. At the highest level InfoLM key components include: (1) a pre-trained masked language model (PMLM

) that is used to compute two discrete probability distributions over the vocabulary. They represent the probability of observing each token of the vocabulary given the candidate and the reference sentence, respectively. (2) A contrast function

that is used to measure the dissimilarity between aforementioned probability distributions. InfoLM differs from existing BERT-based metrics (e.g. BERTSCORE, MOVERSCORE) as it directly relies on the PMLM which outputs discrete probability distributions. Thus InfoLM does neither require to arbitrarily select one or several specific layers ( e.g. BERTSCORE relies on the 9th layer for bert-base-uncased), nor involves selecting arbitrary aggregations technics (e.g. Power Mean ruckle2018concatenated for MOVERSCORE). As InfoLM relies on statistics on tokens it can also be seen as a string-based metric. However, it does not suffer from common pitfalls of string-based metrics (e.g. synonyms, need of an exact string-match) as the PMLM also allows ones to assign a high score to paraphrases and to capture distant dependencies.
Contributions Our contributions are summarized below:
(1) A set of novel metrics to automatically evaluate summarization and data2text generation. In this work, we introduce InfoLM which overcomes the common pitfall of string matching metrics and does not require to select a layer, nor to rely on a arbitrary aggregation function. InfoLM combines a pre-trained model and a contrast function denoted by between two discrete probability distributions. We explore the use of different choices of contrast functions such as -divergences, distances or Fisher-Rao distances.
(2) Tasks. First, we demonstrate on both summarization and data2text that InfoLM is better suited than concurrent metrics. A comparison is conducted, using multiple correlation measures with human judgment both at the text and system level. Second, we dissect InfoLM to better understand the relative importance of each component (e.g. calibration, sensibility to the change of information measures).

2 Problem statement and related work

In this section, we start by introducing notations and formulate the problem of both evaluating text generation and metrics. Then, we identify and present the most relevant related work and the existing approaches for the studied tasks.

2.1 Problem statement

NLG evaluation. Given a dataset where is the -th reference text; is the -th candidate text generated by the -th NLG system; is the number of texts in the dataset and

the number of systems available. The vector

is composed of M tokens (e.g., words or subwords) and is composed of L tokens111The reference and candidate text can be composed of several sentences as it is the case in summarization.. The set of tokens (vocabulary) is denoted as , denotes the set of possible texts. is the score associated by a human annotator to the candidate text when comparing it with the reference text . We aim at building an evaluation metric such that .

Evaluating evaluation metrics. To assess the relevance of an evaluation metric , correlation with human judgment is considered to be one of the most important criteria koehn2009statistical; specia2010machine; chatzikoumi2020evaluate. Debate on the relative merits of different correlations for the evaluation of automatic metrics is ongoing but classical correlation measures are Pearson leusch2003novel, Spearman melamed2003precision or Kendall kendall1938new tests. Two meta-evaluation strategies are commonly used: (1) text-level correlation or (2) system-level correlation. Formally, the text-level correlation is computed as follows:

(1)

where and are the vectors composed of scores assigned by the automatic metric and the human respectively. and is the chosen correlation measure (e.g., Pearson, Kendall or Spearman). Similarly, the system level correlation is obtained by

(2)

where the latter are the vectors composed of the averaged scores assigned by and the human, respectively. For the significance analysis, we follow graham2014testing and use a William test to validate a significant improvement for dependent correlations steiger1980tests.

2.2 Existing metrics

String-based metrics

Two types of string-based metrics exist: N-Grams matching and Edit distance-based metrics. N-Grams matching metrics

count the number of N-grams in common between the candidate text and the reference text. The three most-used metrics are

BLEU, ROUGE and METEOR banerjee2005meteor. If no N-gram is in common between the input text candidate and the reference, these metrics fail to produce meaningful scores. The second category of metrics gathers edit distance-based metrics. They measure the number of basic operations such as edit/delete/insert to measure semantic equivalence. Variants include TER snover2006ter, CDER leusch-etal-2006-cder, EED stanchev2019eed, CHARACTER wang2016character. Edit distance-based metrics do not handle synonyms and focus on surface form. InfoLM can be seen as string-based but do not suffer from the aforementioned matching problem and can handle synonyms as it relies on a PMLM.

Embedding-based metrics

Another class of metrics relies on word embeddings. These metrics either use static words embeddings such as word2vec word2vec or contextualized embeddings such as ELMO elmo, BERT bert and its variants sanh2019distilbert; liu2019roberta. Among the most popular metrics, we can mention MOVERSCORE, BERTSCORE, WMD kusner2015wmd , WMDO chow-etal-2019-wmdo. Different from these approaches InfoLM relies on a language model and work with discrete probability distributions instead of continuous representations.

Learning-based metrics

Various trained metrics have been proposed such as BEER, BEND, RUSE, CIDER vedantam2015cider. These methods rely on train/dev/test sets composed of human evaluations. InfoLM does not require any training step and relies on a frozen PMLM.

Pmlm as a metric.

To the best of our knowledge, using a PMLM (i.e

, without further training) as a reference-based automatic metric remains overlooked. The closest use we found was to rely on autoregressive models, such as GPT-2

radford2019language, to compute the generated sentence perplexity and assess its fluency. Researchers mainly focused on the use of the learnt embedding of the PMLM. However, it remains an open question to find a reliable layer aggregation mechanism (e.g, BERTSCORE arbitrary selects a layer based on a chosen dataset, MOVERSCORE uses the 5 last layers). InfoLM addresses this aggregation issue by relying on the PMLM.

2.3 Masked language modelling

Language models based on masked language pre-training objectives bert; liu2019roberta aim at reconstructing a corrupt version of an input text by minimizing a cross-entropy loss. This corrupted context corresponds to a ”local view” view of the sentence. To ensure fair comparison, we do not use existing alternatives (e.g. GPT-2 based models radford2019language) as concurrent works zhang2019bertscore; zhao2019moverscore rely on PMLM.

3 InfoLM

In this section, we first introduce a novel family of metrics called InfoLM and then detail the different components of these novel metrics.
Notations We denote by , the parameter of the PMLM, its temperature and an information measure (see Section 3.3) which quantifies the similarity between two discrete distributions.

3.1 Motivations & Definitions

A PMLM has learnt the empirical distribution of a large text corpus. Given a text , the corrupted context with a mask at position j is denoted , the LM predicts a distribution over the vocabulary given the masked context. As an example, for a masked input sentence , a pretrained model could place high probabilities on tokens “food”, “meal” and low probability on ”the”. It is worth noting that represents the probability of observing each token of the vocabulary given the masked input .

Definition 3.1 (Equivalence for masked contexts).

Given , two masked contexts , from input texts , , with masks at positions j and k respectively, are equivalent (denoted ) if the two predicted discrete distributions given by the PMLM, namely and , are similar. Formally, if .

Remark.

We have the intuition that two similar sentences will share several pairs of equivalent masked contexts. At this point, we make no claim on the relationship between equivalence and the masked context similarity.

In this work, we make the hypothesis that two similar sentences will share multiple equivalent masked contexts. However, pairwise comparisons of all the pairs of individual masked contexts are prohibitively expensive ( comparisons) when considering long texts. Motivated by efficiency, we instead propose to work with two ”global views” of the sentences that are well-formed probability distributions and are obtained through the aggregation of individual PMLM predictions. Aggregated probabilities for and are denoted and respectively.

Definition 3.2 (Similarity for texts).

Given , two texts are said to be similar (denoted ) if where and denotes the aggregated individual masked context predictions.

3.2 InfoLM

Overview

InfoLM uses the notion of similarity given in Definition 3.2. Given a reference text together with a candidate text , InfoLM recursively masks each token position of both and to obtain individual masked contexts. By relying on a PMLM, InfoLM predicts one distribution for each individual masked contexts. The resulting distributions are then averaged (we refer to this operation ”bag of distributions”) to obtain and . The final step involves comparing two well formed discrete probability distributions and through a measure of information . InfoLM writes as:

(3)
Remark.

It is worth to emphasize that and are two well formed discrete probability distributions. They represent the probability of observing each token of the vocabulary given the candidate and the reference sentence, respectively.

Aggregation Procedure

Rare tokens can be more indicative of text similarity than common tokens banerjee2005meteor. Thus, for the aggregation of the individual masked contexts, we propose to compute a weighted ”bag of distributions” where the weights are normalized measures of the importance of each token. In practice, and write as:

where and are measures of the importance of the k-th token in the candidate and reference text, respectively, i.e., satisfying . These are computed using the inverse document frequency scores determined at the corpus level zhao2019moverscore; kusner2015wmd.

LM Calibration.

Modern deep neural networks are overconfident

guo2017calibration. To re-calibrate language models several techniques have been proposed (e.g entropy rates, pmlr-v119-braverman20a temperature scaling platt1999probabilistic, contextual calibration zhao2021calibrate). Here, we choose to study how calibration affects InfoLM by relying on temperature222When one token receive all the probability mass, and when , the probability becomes uniform. scaling motivated by simplicity and speed.

3.3 Information measures

In this work, we focus on comparing a pair of discrete probability distributions through information measures (see basseville2013divergence for an exhaustive study). We rely on two types of information measures: divergences and distances. The divergence is a measure of dissimilarity that is always positive or equal to zero if (and only if) the two considered distributions are strictly identical. We call distance, a function that is symmetric, positive, respects the triangle inequality and is equal to zero if (and only if) the two considered distributions are strictly identical. We will use information measures that belong to either Csiszar -divergences or that are distances.

Name Notation Domain Expression
-divergence
csiszar1967information
divergence
fujisawa2008robust
AB Divergence
cichocki2011generalized
distance
distance
distance
Fisher-Rao distance
Table 1: Expression of the divergences (upper group) and distance between two positives measures and as well as the definition domain (see regli2018alpha). We omit the index in the summations.

Divergence measures

Various divergence measures have been proposed for a large variety of applications basseville2013divergence. The full expression of the studied divergences can be found in Table 1. We focus here on three families of divergences Divergences, Divergences and Divergences. Note that there exist other families of divergences such as Bregman divergence bregman1967relaxation, divergences basu1998robust, Chernoff divergence chernoff1952measure.
-Divergences. This divergence was introduced by renyi1961measures and are a special case of the -divergences csiszar1967information. They are widely used in variational inference li2016r and closely related to Rényi divergences but are not a special case. From Table 1 we note special cases of -Divergences: (i) Kullback-Leiber (KL) is recovered by letting , (ii) Hellinger distance hellinger1909neue follows by choosing . For this family, weights the influence of .
-Divergences. This divergence has been introduced by fujisawa2008robust as a scale-invariant modification of the robust -divergences.333In our setting we work we normalised distributions, thus scale invariance is not a mandatory property. It is worth mentioning as it could cause practical issues when optimising our metric. For the divergences the parameter is used to control the importance of the element of small probabilities (e.g.

, outliers in some scenarios, tokens with low probability in our case). If

, the importance of large is reduced which gives more weights to the outliers. Special cases include the distance (i.e., ) and KL divergence (i.e., ).
-Divergences. The family of -divergences is flexible and allows to respectively control the mass coverage or the robustness. cichocki2011generalized propose to use divergences. As can be seen in Table 1 these divergences have two parameters . It allows to tune the mass coverage and the robustness independently. The -divergence is obtained by choosing .
From information divergences to discrimination. For our application, we would like to produce a metric between two texts regardless of the source (system or human). Thus we are interested in symmetric divergence: such divergences are called discrimination. To obtain discrimination two tricks are commonly applied either the Jeffrey’s symmetrization, which averages and ), or the Jensen’s symmetrization, which averages and . We choose to use Jeffreys symmetrization as it does not require computing . The symmetric with Jeffrey’s symmetrization is denoted .

Distances

distances. The distances can be used to measure the similarity between two distributions and we restrict ourselves to .
Fisher-Rao distance. The Fisher-Rao distance () represents the Geodesic Distance rao1987differential between two distributions. Interestingly, this distance remains overlooked in the ML community but has been recently used to achieve robustness against adversarial attacks marine_rao.

Connection with string matching metrics.

We adopt the following notations and . First, let us consider two texts such that . InfoLM with is closed to if . It means that all likely tokens (according to the PMLM) when considering are also likely when considering . For string matching metrics, it corresponds to a perfect match between and . Second, let us consider such that (dissimilar texts) and a measure of information that relies on product of and (e.g Fisher-Rao). In this case, thus all likely tokens when considering are unlikely when considering (the converse it true as well). For string matching metrics this corresponds to no match among the sub-strings of and .

4 Experimental Frameworks

In this section, we describe our experimental setting. We present the tasks and the baselines metrics use for each task.

4.1 Text summarization

Text summarization aims at compressing long texts into fluent, short sentences that preserve the salient information.
Datasets. To compare the different metrics previous work bhandari2020re; fabbri2020summeval either relies on the TAC datasets dang2008overview; mcnamee2009overview or on new summarization datasets extracted from CNN/DailyMail nallapati2016abstractive. As pointed out by peyrard2019studying; bhandari2020re, TAC datasets are old and contain flaws (e.g systems used to generate summaries were of poor quality), we choose to work with the newly assemble dataset from CNN/Daily News proposed in bhandari2020re. This dataset gathers 11,490 summaries and annotations are carried using the pyramid method nenkova2004evaluating.
Metrics. For text summarization, perhaps the most known metrics are ROUGE and its extensions ng2015better, or METEOR and its variants denkowski2014meteor. Recently, a new set of metrics (e.g BERTSCORE, MOVERSCORE) have been applied to text summarization.

4.2 Data2Text generation

Prior works mainly rely on two task-oriented dialogue datasets (i.e., BAGEL mairesse2010phrase, SFHOTEL wen2015semantically). As sentence generated in these data-sets are unlikely to be representative of the progress of recent NLG systems we instead rely on a different dataset coming from the WebNLG2020 challenge gardent2017creating. Given the following example of triple: (John_Blaha birthDate 1942_08_26) (John_Blaha birthPlace San_Antonio) (John_E_Blaha job Pilot) the goal is to generate John Blaha, born in San Antonio on 1942-08-26, worked as a pilot.
Annotations. The WebNLG task is evaluated by human annotators along four different axes: (1) Data Coverage: Are all the descriptions presented in the data included in the text? (2) Relevance: Does the text contains only predicates found in the data? (3) Correctness: Are predicates found in the data correctly mentioned and adequately introduced? (4) Text structure: Is the produced text well-structured, grammatically correct and written in acceptable English?, (5) Fluency: Does the text progress naturally? Is it easy to understand? Is it a coherent whole?
Metrics. For this task, organisers rely on untrained metrics (e.g. BLEU, METEOR, TER, BERTSCORE) to compare the performance of the candidate systems. Thus, we will focus on system-level correlation.

5 Numerical Results

(a) Abs - Text
(b) Ext - Text
(c) Abs - Sys
(d) Ext - Sys
(e) Abs - Text
(f) Ext - Text
(g) Abs - Sys
(h) Ext - Sys
Figure 9: Results of the correlation between metrics and human judgments on the CNN dataset. First raw reports correlation as measured by the Person () and second raw focus on Kendall () coefficient. In this experiment, parameter are optimized for each criterion. The results on with Spearman () can be found in Section B.5).

In this section, we study the performance of InfoLM on both text summarization and data2text generation.

5.1 Results on Text Summarization

General Analysis. Figure 35 gathers the results of the correlation study between scores produced by different metrics and human judgement (i.e. pyramid score). We can reproduce results from bhandari2020re. We observe a different behavior depending on the type of systems to be evaluated (e.g., abstractive or extractive) and the chosen correlation coefficient. We observe that InfoLM with or with outperforms other BERT-based metrics such as MOVERSCORE or BERTSCORE (e.g., it is worth noting that both metrics perform poorly at the text or system-level when considering outputs from extractive systems). largely outperforms n-gram matching metrics (e.g., ROUGE metrics) on all datasets when measuring correlation with the Kendall and in almost all configurations (except when considering abstractive outputs at the system level) when using the Pearson . It is worth noting the overall good performance of the parameter-free Fisher-Rao distance.
Choice of information geometric measure for InfoLM. In Figure 35, we can observe two different types of groups depending on the global behaviour. First we notice that using , leads to poor performances in many configurations. Good performance of in some configurations is surprising as is extremely selective (i.e. computes ). As output produced by the PMLM is sparse, correspond to one likely word in one sentence and not likely at all in the other. The second group gathers , , , and and achieves the best performance overall. and achieve similar performance suggesting that the flexibility (e.g., robustness to outliers) introduced by the parameter in is not useful in our task. This observation is strengthened by the lower performance of . The difference of results between the two measures is due to the flexibility introduced by (i.e., controls the relative importance of the ration ) which can be interpreted in our case as the ability to control the importance attributed to less likely words jalalzai2020heavy.
Takeaways. The best performing metric is obtained with . The Fisher-Rao distance, denoted by , achieves good performance in many scenarios and has the advantage to be parameter-free.

5.2 Results on Data2Text

Global Analysis: Table 3 gathers results of the correlation analysis of the metrics with human judgements following the five different axes. We observe that the five considered criteria of annotations are not independent: text structure and fluency achieve a strong correlation coefficient (). Additionally, all metrics achieve similar results when the correlation is computed on these two criteria. We observe that the best performing group of metric is based on InfoLM followed by metrics based on continuous representation from BERT (i.e., MOVERSCORE and BERTSCORE) followed by N-gram matching metrics. Regarding correctness, data coverage and relevance, we observe that both and achieve the best results on almost all correlation coefficients. On data coverage, InfoLM achieves improvement up to points in correlation compared to both BERT based or N-gram matching metrics. Regarding fluency and text structure, Fisher-Rao distance works better and slightly outperforms the second-best performing metric, namely BERTSCORE.
Takeaways. Similar to summarisation, we observe very low correlation for , . We also observe that -divergences achieve lower results than both and divergences suggesting that, as noticed for summarisation, robustness to unlikely words (i.e., outliers) is less relevant for our task.

Correctness Data Coverage Fluency Relevance Text Structure
Metric
Correct 100.0 100.0 100.0 97.6 85.2 73.3 80.0 81.1 61.6 99.1 89.7 75.0 80.1 80.8 60.0
DataC 85.2 97.6 73.3 100.0 100.0 100.0 71.8 51.7 38.3 96.0 93.8 81.6 71.6 51.4 36.6
Fluency 81.1 80.0 61.6 71.8 51.7 38.3 100.0 100.0 100.0 77.0 61.4 46.6 99.5 99.7 98.3
Relev 89.7 99.1 75.0 96.0 93.8 81.6 77.0 61.4 46.6 100.0 100.0 100.0 77.2 61.1 45.0
TextS 80.8 80.1 60.0 71.6 51.4 36.6 99.5 99.7 98.3 77.2 61.1 45.0 100.0 100.0 100.0
88.8 89.3 76.6 81.8 82.6 70.0 86.6 92.0 76.6 89.8 87.9 73.3 86.6 91.4 75.0
88.8 89.3 76.6 81.8 82.6 70.0 86.6 92.0 76.6 89.8 87.9 73.3 86.6 91.4 75.0
81.4 50.0 71.6 48.4 79.7 65.0 44.8 84.7 76.6 49.3 72.3 60.0 48.0 83.8 75.0
75.2 33.8 61.6 32.4 53.8 40.0 22.7 83.5 73.3 32.2 57.9 45.0 25.6 83.2 71.6
89.7 86.0 75.0 78.7 70.5 51.6 93.3 95.7 85.3 87.6 84.4 70.0 92.4 93.8 81.6
JS 79.4 81.1 70.0 69.3 75.5 60.0 89.4 91.4 75.0 81.7 70.5 60.0 91.9 91.1 73.3
BertS 85.5 83.4 73.3 74.7 68.2 53.3 92.3 95.5 85.0 83.3 79.4 65.0 91.9 95.0 83.3
MoverS 84.1 84.1 73.3 78.7 66.2 53.3 91.2 92.1 78.3 82.1 77.4 65.0 90.1 91.4 76.3
BLEU 77.6 66.3 60.0 55.7 50.2 36.6 89.4 90.5 78.3 63.0 65.2 51.6 88.5 89.1 76.6
R-1 80.6 65.0 65.0 61.1 59.6 48.3 76.5 76.3 60.3 64.3 69.2 56.7 75.9 77.5 58.3
METEOR 86.5 66.3 70.0 77.3 50.2 46.6 86.7 90.5 78.3 82.1 65.2 58.6 86.2 89.1 76.6
TER 79.6 78.3 58.0 69.7 58.2 38.0 89.1 93.5 80.0 75.0 70.2 77.6 89.5 91.1 78.6
Table 2: Correlation at the system level with human judgement along five different axis: correctness, data coverage, fluency, relevance and text structure for the WebNLG task. Best results by group are underlined, overall best results are bolted.

5.3 Further Analysis

Correlation between metrics

In this experiment, we complete our global analysis by comparing the scores obtained by the different metrics with each other. We want to gain an understanding of how different our metric is from other metrics and how the choice of information geometric measures affects the predictions. Figure 10 gathers the results of the experiment. We observe a high correlation () between , , and 444Note that these metrics consider the product of and .. Interestingly, we observe a lower correlation () with BERTSCORE and N-gram matching metrics, e.g., ROOGE) whereas BERTSCORE achieves a stronger correlation with ROUGE ().
Takeaways. Through the correlation analysis in Figure 10, we observe the impact of different geometry on InfoLM predictions. The correlation analysis shows that the prediction of InfoLM when using , , and are highly correlated and as illustrated by previous experience achieve high correlation scores which we believe validate our approach. It is worth noting that requires no tuning as it is parameter free.

Figure 10: Pearson correlation at the system level between metrics when considering abstractive system outputs.

Score Distributions

In Figure 11, we study the text score distribution of different metrics on abstractive summary. The ideal metric would mimic the human score distribution (i.e. ) and be able to distinguish between good and bad quality summaries. The results show that ROUGE and BLEU struggle to distinguish between between good quality () low quality summaries () which has been reported in peyrard2019studying. We observe that , and metrics are able to make the distinction. Interestingly, as increases and the distances become more selective (i.e. focus on one word solely), the distances struggle to distinguish low from high scoring summaries.

Takeaways. InfoLM when combined with , and is able to distinguish high-scoring from low scoring summaries.

Figure 11: Score distribution of text score when considering abstractive system outputs. stands for pyramide score.

Temperature Calibration

To study the impact of calibration, we choose to work on system-level correlation and report in Figure 12 the achieved the correlation measured by the different coefficients. We limit our study to the Fisher-Rao distance as it is a parameter-free metric and is among the best-performing metrics of InfoLM. Due to space constraints, we report the result on extractive systems only (extractive systems can be found in Figure 15).
Takeaways. Fisher-Rao only considers product thus as increases and the predicted probability of the PMLM becomes more uniform more words are considered and the aggregated distributions become richer in terms of considered words. It is worth noting that when changing the temperature we observe a smooth change in correlation and observe an optimal temperature which is reached for . It suggests that InfoLM benefits from a PMLM that is not too selective (case ). For a specific application, the temperature of InfoLM can be tuned to improve correlation and InfoLM will likely benefit from well-calibrated PMLM.

Figure 12: Impact of Calibration on system-level correlation for summarization.

6 Summary and Concluding Remarks

In this work, we presented InfoLM that does not require training and it is among the first metrics computing the similarity between two discrete probability distributions over the vocabulary (which is similar to string-based metrics) but also leverages the recent advance in language modeling thanks to a PMLM. Our experiments on both summarization and data2text generation demonstrate the validity of our approach. Among available contrast measures, the Fisher-Rao distance is parameter-free and thus, it is easy to use in practice while the -Divergence achieves better results but requires to select and

. Future work includes extending our metrics to new tasks such as translation, image captioning and paraphrase detection, as well as investigating others information geometric measures (

e.g., Wasserstein distance staerman2021ot, or other types of divergences and Mutual Information colombo2021novel) to obtain better correlation with different attributes.

7 Acknowledgments

This work was also granted access to the HPC resources of IDRIS under the allocation 2021-AP010611665 as well as under the project 2021-101838 made by GENCI.

References

Appendix A Additional Experimental Results

In this section, we report the additional experiments we conduct. Because of space constraints, we did not report the sensibility analysis (see Section A.2) in the main paper.

a.1 Role of calibration

The complete results on the role of calibration can be found in Figure 15. We notice a similar behaviour for InfoLM on both extractive and abstractive systems. There is a smooth variation in terms of performance and a single maximum for .

(a) Abstractive
(b) Extractive
Figure 15: Impact of Calibration on system-level correlation (with Pearson (), Spearman () or Kendall ()) for CNN. The chosen measure is Rao as it is parameter-free. Calibration is changed using temperature scaling using a single temperature .

a.2 Choice of and

In this experiment, we aim at quantifying the sensitivity of to the choice of and . Figure 16 gathers the results of the analysis. We observe that a change in induces a stronger change in the metric. Additionally the lower the better result we obtain. We can also note that the variation of both parameters is smooth. Interestingly for abstractive systems, a low value of should be chosen where for extractive higher is better. It suggests that for evaluating abstractive systems the metric should focus on low values of (words that are probable for both candidate and reference text) where for extractive systems the attention should be focused on high values of (words that are likely only in one text).
Takeaways. Low values of leads to better results, optimal value of is for abstractive and for extractive. Good parameter combinations achieve consistently high performance when using different correlation coefficients.

Figure 16: Impact of change in and for . System level correlation, as measured by Pearson () or Spearman (), is presented on abstractive (first column) and extractive system (second column).

a.3 Statistical analysis

Automatic metrics are used in the WebNLG challenge to compare the different systems. To evaluate whether the observed improvement in correlation is significant, we report the results of William’s Significance test in Figure 22.
Takeaways: (i) Regarding correctness and relevance , is a suitable choice that is significantly better than other metrics; (ii) Regarding text structure, is significantly better and compare favourably against all metrics except MOVERSCORE for automatic fluency evaluation; (iii) Regarding data coverage, METEOR achieves good result however significance difference is only observed with BERTSCORE.

(a) Correct
(b) Data C
(c) Fluency
(d) Relev
(e) Text S
Figure 22: Results of William’s Significance Test: the tested hypothesis is: “is the increase of correlation significant”. For clarity and due to space constraints the p-values are truncated and multiply par 100. Only p-values that are lower than are displayed.

a.4 Complete results on Summarization

We gather in Figure 35, the complete results on summarization. Due to space constraints, we did not report all the Spearman Correlation coefficients in the main paper. It is worth noting that our observations hold and the best performing metric is obtained with . The Fisher-Rao distance, denoted by , achieves good performance in many scenarios and has the advantage to be parameter-free.

a.5 Complete results on Data2text Generation

We gather in Table 3, the complete results on data2text generation. Due to space constraints, we did not report all the baselines in the main paper.

Correctness Data Coverage Fluency Relevance Text Structure
Metric
Correct 100.0 100.0 100.0 97.6 85.2 73.3 80.0 81.1 61.6 99.1 89.7 75.0 80.1 80.8 60.0
DataC 85.2 97.6 73.3 100.0 100.0 100.0 71.8 51.7 38.3 96.0 93.8 81.6 71.6 51.4 36.6
Fluency 81.1 80.0 61.6 71.8 51.7 38.3 100.0 100.0 100.0 77.0 61.4 46.6 99.5 99.7 98.3
Relev 89.7 99.1 75.0 96.0 93.8 81.6 77.0 61.4 46.6 100.0 100.0 100.0 77.2 61.1 45.0
TextS 80.8 80.1 60.0 71.6 51.4 36.6 99.5 99.7 98.3 77.2 61.1 45.0 100.0 100.0 100.0
88.8 89.3 76.6 81.8 82.6 70.0 86.6 92.0 76.6 89.8 87.9 73.3 86.6 91.4 75.0
88.8 89.3 76.6 81.8 82.6 70.0 86.6 92.0 76.6 89.8 87.9 73.3 86.6 91.4 75.0
81.4 50.0 71.6 48.4 79.7 65.0 44.8 84.7 76.6 49.3 72.3 60.0 48.0 83.8 75.0
75.2 33.8 61.6 32.4 53.8 40.0 22.7 83.5 73.3 32.2 57.9 45.0 25.6 83.2 71.6
67.0 21.9 56.6 21.6 37.9 33.3 11.9 75.2 58.3 20.1 43.8 38.3 14.8 75.5 60.0
63.2 33.0 46.6 30.4 36.4 26.6 67.6 65.0 46.6 29.1 49.1 35.0 67.2 65.2 46.6
89.7 86.0 75.0 78.7 70.5 51.6 93.3 95.7 85.3 87.6 84.4 70.0 92.4 93.8 81.6
JS 79.4 81.1 70.0 69.3 75.5 60.0 89.4 91.4 75.0 81.7 70.5 60.0 91.9 91.1 73.3
BertS 85.5 83.4 73.3 74.7 68.2 53.3 92.3 95.5 85.0 83.3 79.4 65.0 91.9 95.0 83.3
MoverS 84.1 84.1 73.3 78.7 66.2 53.3 91.2 92.1 78.3 82.1 77.4 65.0 90.1 91.4 76.3
BLEU 77.6 66.3 60.0 55.7 50.2 36.6 89.4 90.5 78.3 63.0 65.2 51.6 88.5 89.1 76.6
R-1 80.6 65.0 65.0 61.1 59.6 48.3 76.5 76.3 60.3 64.3 69.2 56.7 75.9 77.5 58.3
R-2 73.6 63.3 58.3 54.7 43.1 35.0 86.4 81.9 63.4 62.0 60.8 46.7 86.5 80.5 61.7
R-WE 60.9 73.4 60.0 40.2 58.2 40.1 61.4 84.7 61.3 49.9 64.1 48.3 60.2 85.9 60.0
METEOR 86.5 66.3 70.0 77.3 50.2 46.6 86.7 90.5 78.3 82.1 65.2 58.6 86.2 89.1 76.6
TER 79.6 78.3 58.0 69.7 58.2 38.0 89.1 93.5 80.0 75.0 70.2 77.6 89.5 91.1 78.6
Table 3: Correlation at the system level with human judgement along five different axis: correctness, data coverage, fluency, relevance and text structure for the WebNLG task. Best results by group are underlined, overall best results are bolted.

a.6 Complete Results on Score Distribution

We gather in Figure 38 the score distribution on the summarization dataset. It is worth noting that for these experiments we have scaled the divergences between 0 and 1 and we have considered 1 - InfoLM.

Appendix B Additional details on the datasets

b.1 Summarization

The summarization dataset CNN can be found in https://github.com/neulab/REALSumm and no preprocessing has been applied. Specifically, the summaries have been generated using 14 abstractive systems zhong2020extractive; wang2020heterogeneous; zhong2019searching; liu2019text; zhou2018neural; narayan2018ranking; dong2019unified; kedzie2018content; zhou2018neural and 11 extractive systems see2017get; chen2018fast; raffel2019exploring; gehrmann-etal-2018-bottom; dong2019unified; liu2019text; lewis2019bart; yoon2020learning.
Pyramide Score. The Pyramide score, which was inspired from work in reading comprehension (see beck1991revising) is a scoring procedure to assess the semantic content in the scope of summarization. The manual methods are based on the concept of Summary Content Units (SCUs) which groups sentences from different summaries. grouping if they carry the same content.

b.2 Data2text

The goal of the WebNLG challenge is to develop efficient Knowledge Base Verbalizers gardent2017creating; perez2016building(i.e generation algorithms that can verbalise knowledge base fragments) and thus handle complex interaction that can occur during the micro-planning phase when generating sentences ferreira2018enriching. Details on the WebNLG2020 task are given in ferreira20202020. The dataset is composed of generated sentences coming from 15 different systems using various approaches such as symbolic approaches or neural-based systems. All data are freely available on GitHub https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release˙v3.0 and no preprocessing has been applied. For the dataset, they use the RDF format which is a widely used format for many datasets such as LinkedGeoData http://linkedgeodata.org/About, FOAF http://www.foaf-project.org/ or MusicBrainz https://musicbrainz.org/

. extracted from DBpedia

auer2007dbpedia.

b.3 Limitations

We have evaluate our metrics on text only datasets. Future work will also investigate robustness of the metric to type of texts (e.g spoken text dinkar2020importance; chapuis2021code; chapuis2020hierarchical) and the extension to multimodal setting colombo2021improving; garcia2019token and other type of task (e.g related to dialog colombo2021beam and affect driven text generation colombo2019affect; witon2018disney and intend generation mehri2020example with dialog acts colombo2020guiding).

b.4 Metric Choices

Several revised versions of BLEU doddington2002automatic; galley2015deltableu and METEOR denkowski2014meteor; guo-hu-2019-meteor have been proposed in recent years. For our implementation we choose to use SACREBLEU sacrebleu. Similarly, a plethora of ROUGE extension have been proposed (ganesan2018rouge; shafieibavani2018graph) but in our work we choose to work with ROUGE-1, ROUGE-2 and ROUGE-WE. For the BERT based metrics, we choose to rely on the most popular although several alternatives exists (e.g. SENTENCE-MOVER clark-etal-2019-sentence)

(a) Abs - Text
(b) Ext - Text
(c) Abs - Sys
(d) Ext - Sys
(e) Abs - Text
(f) Ext - Text
(g) Abs - Sys
(h) Ext - Sys
(i) Abs - Text
(j) Ext - Text
(k) Abs - Sys
(l) Ext - Sys
Figure 35: Results of the correlation between metrics and human judgement on the CNN dataset.
(a) Score distribution Abstractive Summarization Systems
(b) Score distribution Extractive Summarization Systems
Figure 38: Score distribution considering a larger number of metrics than in Figure 11.

b.5 Parameters Choices

For summarization (see Figure 35) we choose to work with the following parameters:

  • The temperature is set .

  • For , we choose for extractive and for abstractive.

  • For , we choose .

  • For , we choose and for extractive and and for abstractive.

For data2text generation (see Section A.5) we choose to work with the following parameters:

  • The temperature is set .

  • For , we choose

  • For , we choose

  • For , we choose

Takeaways Although, parameters can be tuned to achieve better results. It is worth noting that the parameter-free Fisher-Rao distance (i.e. ) achieve good results when considered for InfoLM.

Appendix C Implementation details

c.1 Algorithm details

A complete algorithm for InfoLM is given in Algorithm 1.

1:Input Candidate text of length L, Reference text of length M, measure of information
2:
3:for   do Compute
4:     
5:end for
6:for   do Compute
7:     
8:end for
9:Output
Algorithm 1 InfoLM

c.2 Computational resources

For all the experiments we use Tesla NVIDIA P100 to compute the BERT based metrics. Running time is less than an hour. For metrics based on string matching, we use a single CPU and the running time is less than an hour.

c.3 Libraries

For this project among the library we used we can cite:

We thank the contributor for open-sourcing their libraries.

c.4 Negative Results

We tried to remove the stop words and do pre-processing technics. Little improvement when cleaning the candidate and golden reference text might be attributed to BERT.