1 Introduction
A plethora of applications of natural language processing (NLP) performs texttotext transformation
mellish1998evaluation; belz2006comparing; specia2018quality. Given an input, these systems are required to produce an output text that is coherent, readable and informative. Due to both high annotation costs and time, researchers tend to rely on automatic evaluation to compare the outputs of such systems. Referencebased automatic evaluation relies on comparing a candidate text produced by the NLG system and one or multiple reference texts (‘gold standard’) created by a human annotator. Generic automatic evaluation of NLG is a huge challenge as it requires building a metric that evaluates the similarity between a candidate and one or several goldstandard reference texts. However, the definition of success criteria is taskspecific: as an example, evaluation of text summarization focuses on content, coherence, grammatically, conciseness, and readability
mani2001automatic, whereas machine translation focuses on fidelity, fluency and adequacy of the translation hovy1999toward and data2text generation gardent2017creating consider criteria such as data coverage, correctness and text structure.Automatic text evaluation metrics fall into two categories: metrics that are trained to maximise their correlations using human annotation (
e.g., RUSE shimanaka2018ruse, BEER stanojevicsimaan2014beer, BLEND ma2017blend) and untrained metrics (e.g., BLEU bleu, ROUGE lin2004rouge, BERTSCORE zhang2019bertscore, DepthScore staerman2021depth, BaryScore colombo2021automatic, MOVERSCORE zhao2019moverscore Word Mover Distance kusner2015wmd). In this work, we focus on untrained metrics as trained metrics may not generalize well to new data (existing labelled corpora are of small size). Two categories of untrained metrics can be distinguished: word or character basedmetrics that compute a score based on string representation and embeddingbased metrics that rely on a continuous representation. Stringbased metrics (e.g., BLEU) often fail to robustly match paraphrases reiter2009investigation as they mainly focus on the surface form as opposed to embeddingbased metrics relying on continuous representations.In this paper, we introduce InfoLM a family of new untrained metrics to evaluate text summarization and data2text generation. At the highest level InfoLM key components include: (1) a pretrained masked language model (PMLM
) that is used to compute two discrete probability distributions over the vocabulary. They represent the probability of observing each token of the vocabulary given the candidate and the reference sentence, respectively. (2) A contrast function
that is used to measure the dissimilarity between aforementioned probability distributions. InfoLM differs from existing BERTbased metrics (e.g. BERTSCORE, MOVERSCORE) as it directly relies on the PMLM which outputs discrete probability distributions. Thus InfoLM does neither require to arbitrarily select one or several specific layers ( e.g. BERTSCORE relies on the 9th layer for bertbaseuncased), nor involves selecting arbitrary aggregations technics (e.g. Power Mean ruckle2018concatenated for MOVERSCORE). As InfoLM relies on statistics on tokens it can also be seen as a stringbased metric. However, it does not suffer from common pitfalls of stringbased metrics (e.g. synonyms, need of an exact stringmatch) as the PMLM also allows ones to assign a high score to paraphrases and to capture distant dependencies.Contributions Our contributions are summarized below:
(1) A set of novel metrics to automatically evaluate summarization and data2text generation. In this work, we introduce InfoLM which overcomes the common pitfall of string matching metrics and does not require to select a layer, nor to rely on a arbitrary aggregation function. InfoLM combines a pretrained model and a contrast function denoted by between two discrete probability distributions. We explore the use of different choices of contrast functions such as divergences, distances or FisherRao distances.
(2) Tasks. First, we demonstrate on both summarization and data2text that InfoLM is better suited than concurrent metrics. A comparison is conducted, using multiple correlation measures with human judgment both at the text and system level. Second, we dissect InfoLM to better understand the relative importance of each component (e.g. calibration, sensibility to the change of information measures).
2 Problem statement and related work
In this section, we start by introducing notations and formulate the problem of both evaluating text generation and metrics. Then, we identify and present the most relevant related work and the existing approaches for the studied tasks.
2.1 Problem statement
NLG evaluation. Given a dataset where is the th reference text; is the th candidate text generated by the th NLG system; is the number of texts in the dataset and
the number of systems available. The vector
is composed of M tokens (e.g., words or subwords) and is composed of L tokens^{1}^{1}1The reference and candidate text can be composed of several sentences as it is the case in summarization.. The set of tokens (vocabulary) is denoted as , denotes the set of possible texts. is the score associated by a human annotator to the candidate text when comparing it with the reference text . We aim at building an evaluation metric such that .Evaluating evaluation metrics. To assess the relevance of an evaluation metric , correlation with human judgment is considered to be one of the most important criteria koehn2009statistical; specia2010machine; chatzikoumi2020evaluate. Debate on the relative merits of different correlations for the evaluation of automatic metrics is ongoing but classical correlation measures are Pearson leusch2003novel, Spearman melamed2003precision or Kendall kendall1938new tests. Two metaevaluation strategies are commonly used: (1) textlevel correlation or (2) systemlevel correlation. Formally, the textlevel correlation is computed as follows:
(1) 
where and are the vectors composed of scores assigned by the automatic metric and the human respectively. and is the chosen correlation measure (e.g., Pearson, Kendall or Spearman). Similarly, the system level correlation is obtained by
(2)  
where the latter are the vectors composed of the averaged scores assigned by and the human, respectively. For the significance analysis, we follow graham2014testing and use a William test to validate a significant improvement for dependent correlations steiger1980tests.
2.2 Existing metrics
Stringbased metrics
Two types of stringbased metrics exist: NGrams matching and Edit distancebased metrics. NGrams matching metrics
count the number of Ngrams in common between the candidate text and the reference text. The three mostused metrics are
BLEU, ROUGE and METEOR banerjee2005meteor. If no Ngram is in common between the input text candidate and the reference, these metrics fail to produce meaningful scores. The second category of metrics gathers edit distancebased metrics. They measure the number of basic operations such as edit/delete/insert to measure semantic equivalence. Variants include TER snover2006ter, CDER leuschetal2006cder, EED stanchev2019eed, CHARACTER wang2016character. Edit distancebased metrics do not handle synonyms and focus on surface form. InfoLM can be seen as stringbased but do not suffer from the aforementioned matching problem and can handle synonyms as it relies on a PMLM.Embeddingbased metrics
Another class of metrics relies on word embeddings. These metrics either use static words embeddings such as word2vec word2vec or contextualized embeddings such as ELMO elmo, BERT bert and its variants sanh2019distilbert; liu2019roberta. Among the most popular metrics, we can mention MOVERSCORE, BERTSCORE, WMD kusner2015wmd , WMDO chowetal2019wmdo. Different from these approaches InfoLM relies on a language model and work with discrete probability distributions instead of continuous representations.
Learningbased metrics
Various trained metrics have been proposed such as BEER, BEND, RUSE, CIDER vedantam2015cider. These methods rely on train/dev/test sets composed of human evaluations. InfoLM does not require any training step and relies on a frozen PMLM.
Pmlm as a metric.
To the best of our knowledge, using a PMLM (i.e
, without further training) as a referencebased automatic metric remains overlooked. The closest use we found was to rely on autoregressive models, such as GPT2
radford2019language, to compute the generated sentence perplexity and assess its fluency. Researchers mainly focused on the use of the learnt embedding of the PMLM. However, it remains an open question to find a reliable layer aggregation mechanism (e.g, BERTSCORE arbitrary selects a layer based on a chosen dataset, MOVERSCORE uses the 5 last layers). InfoLM addresses this aggregation issue by relying on the PMLM.2.3 Masked language modelling
Language models based on masked language pretraining objectives bert; liu2019roberta aim at reconstructing a corrupt version of an input text by minimizing a crossentropy loss. This corrupted context corresponds to a ”local view” view of the sentence. To ensure fair comparison, we do not use existing alternatives (e.g. GPT2 based models radford2019language) as concurrent works zhang2019bertscore; zhao2019moverscore rely on PMLM.
3 InfoLM
In this section, we first introduce a novel family of metrics called InfoLM and then detail the different components of these novel metrics.
Notations We denote by , the parameter of the PMLM, its temperature and an information measure (see Section 3.3) which quantifies the similarity between two discrete distributions.
3.1 Motivations & Definitions
A PMLM has learnt the empirical distribution of a large text corpus. Given a text , the corrupted context with a mask at position j is denoted , the LM predicts a distribution over the vocabulary given the masked context. As an example, for a masked input sentence , a pretrained model could place high probabilities on tokens “food”, “meal” and low probability on ”the”. It is worth noting that represents the probability of observing each token of the vocabulary given the masked input .
Definition 3.1 (Equivalence for masked contexts).
Given , two masked contexts , from input texts , , with masks at positions j and k respectively, are equivalent (denoted ) if the two predicted discrete distributions given by the PMLM, namely and , are similar. Formally, if .
Remark.
We have the intuition that two similar sentences will share several pairs of equivalent masked contexts. At this point, we make no claim on the relationship between equivalence and the masked context similarity.
In this work, we make the hypothesis that two similar sentences will share multiple equivalent masked contexts. However, pairwise comparisons of all the pairs of individual masked contexts are prohibitively expensive ( comparisons) when considering long texts. Motivated by efficiency, we instead propose to work with two ”global views” of the sentences that are wellformed probability distributions and are obtained through the aggregation of individual PMLM predictions. Aggregated probabilities for and are denoted and respectively.
Definition 3.2 (Similarity for texts).
Given , two texts are said to be similar (denoted ) if where and denotes the aggregated individual masked context predictions.
3.2 InfoLM
Overview
InfoLM uses the notion of similarity given in Definition 3.2. Given a reference text together with a candidate text , InfoLM recursively masks each token position of both and to obtain individual masked contexts. By relying on a PMLM, InfoLM predicts one distribution for each individual masked contexts. The resulting distributions are then averaged (we refer to this operation ”bag of distributions”) to obtain and . The final step involves comparing two well formed discrete probability distributions and through a measure of information . InfoLM writes as:
(3) 
Remark.
It is worth to emphasize that and are two well formed discrete probability distributions. They represent the probability of observing each token of the vocabulary given the candidate and the reference sentence, respectively.
Aggregation Procedure
Rare tokens can be more indicative of text similarity than common tokens banerjee2005meteor. Thus, for the aggregation of the individual masked contexts, we propose to compute a weighted ”bag of distributions” where the weights are normalized measures of the importance of each token. In practice, and write as:
where and are measures of the importance of the kth token in the candidate and reference text, respectively, i.e., satisfying . These are computed using the inverse document frequency scores determined at the corpus level zhao2019moverscore; kusner2015wmd.
LM Calibration.
Modern deep neural networks are overconfident
guo2017calibration. To recalibrate language models several techniques have been proposed (e.g entropy rates, pmlrv119braverman20a temperature scaling platt1999probabilistic, contextual calibration zhao2021calibrate). Here, we choose to study how calibration affects InfoLM by relying on temperature^{2}^{2}2When one token receive all the probability mass, and when , the probability becomes uniform. scaling motivated by simplicity and speed.3.3 Information measures
In this work, we focus on comparing a pair of discrete probability distributions through information measures (see basseville2013divergence for an exhaustive study). We rely on two types of information measures: divergences and distances. The divergence is a measure of dissimilarity that is always positive or equal to zero if (and only if) the two considered distributions are strictly identical. We call distance, a function that is symmetric, positive, respects the triangle inequality and is equal to zero if (and only if) the two considered distributions are strictly identical. We will use information measures that belong to either Csiszar divergences or that are distances.
Name  Notation  Domain  Expression  







distance  
distance  
distance  
FisherRao distance 
Divergence measures
Various divergence measures have been proposed for a large variety of applications basseville2013divergence. The full expression of the studied divergences can be found in Table 1. We focus here on three families of divergences Divergences, Divergences and Divergences. Note that there exist other families of divergences such as Bregman divergence bregman1967relaxation, divergences basu1998robust, Chernoff divergence chernoff1952measure.
Divergences.
This divergence was introduced by renyi1961measures and are a special case of the divergences csiszar1967information. They are widely used in variational inference li2016r and closely related to Rényi divergences but are not a special case.
From Table 1 we note special cases of Divergences: (i) KullbackLeiber (KL) is recovered by letting , (ii) Hellinger distance hellinger1909neue follows by choosing .
For this family, weights the influence of .
Divergences. This divergence has been introduced by fujisawa2008robust as a scaleinvariant modification of the robust divergences.^{3}^{3}3In our setting we work we normalised distributions, thus scale invariance is not a mandatory property. It is worth mentioning as it could cause practical issues when optimising our metric. For the divergences the parameter is used to control the importance of the element of small probabilities (e.g.
, outliers in some scenarios, tokens with low probability in our case). If
, the importance of large is reduced which gives more weights to the outliers. Special cases include the distance (i.e., ) and KL divergence (i.e., ).Divergences. The family of divergences is flexible and allows to respectively control the mass coverage or the robustness. cichocki2011generalized propose to use divergences. As can be seen in Table 1 these divergences have two parameters . It allows to tune the mass coverage and the robustness independently. The divergence is obtained by choosing .
From information divergences to discrimination. For our application, we would like to produce a metric between two texts regardless of the source (system or human). Thus we are interested in symmetric divergence: such divergences are called discrimination. To obtain discrimination two tricks are commonly applied either the Jeffrey’s symmetrization, which averages and ), or the Jensen’s symmetrization, which averages and . We choose to use Jeffreys symmetrization as it does not require computing . The symmetric with Jeffrey’s symmetrization is denoted .
Distances
distances. The distances can be used to measure the similarity between two distributions and we restrict ourselves to .
FisherRao distance. The FisherRao distance () represents the Geodesic Distance rao1987differential between two distributions. Interestingly, this distance remains overlooked in the ML community but has been recently used to achieve robustness against adversarial attacks marine_rao.
Connection with string matching metrics.
We adopt the following notations and . First, let us consider two texts such that . InfoLM with is closed to if . It means that all likely tokens (according to the PMLM) when considering are also likely when considering . For string matching metrics, it corresponds to a perfect match between and . Second, let us consider such that (dissimilar texts) and a measure of information that relies on product of and (e.g FisherRao). In this case, thus all likely tokens when considering are unlikely when considering (the converse it true as well). For string matching metrics this corresponds to no match among the substrings of and .
4 Experimental Frameworks
In this section, we describe our experimental setting. We present the tasks and the baselines metrics use for each task.
4.1 Text summarization
Text summarization aims at compressing long texts into fluent, short sentences that preserve the salient information.
Datasets. To compare the different metrics previous work bhandari2020re; fabbri2020summeval either relies on the TAC datasets dang2008overview; mcnamee2009overview or on new summarization datasets extracted from CNN/DailyMail nallapati2016abstractive. As pointed out by peyrard2019studying; bhandari2020re, TAC datasets are old and contain flaws (e.g systems used to generate summaries were of poor quality), we choose to work with the newly assemble dataset from CNN/Daily News proposed in bhandari2020re. This dataset gathers 11,490 summaries and annotations are carried using the pyramid method nenkova2004evaluating.
Metrics. For text summarization, perhaps the most known metrics are ROUGE and its extensions ng2015better, or METEOR and its variants denkowski2014meteor. Recently, a new set of metrics (e.g BERTSCORE, MOVERSCORE) have been applied to text summarization.
4.2 Data2Text generation
Prior works mainly rely on two taskoriented dialogue datasets (i.e., BAGEL mairesse2010phrase, SFHOTEL wen2015semantically). As sentence generated in these datasets are unlikely to be representative of the progress of recent NLG systems we instead rely on a different dataset coming from the WebNLG2020 challenge gardent2017creating. Given the following example of triple: (John_Blaha birthDate 1942_08_26) (John_Blaha birthPlace San_Antonio) (John_E_Blaha job Pilot) the goal is to generate John Blaha, born in San Antonio on 19420826, worked as a pilot.
Annotations. The WebNLG task is evaluated by human annotators along four different axes: (1) Data Coverage: Are all the descriptions presented in the data included in the text? (2) Relevance: Does the text contains only predicates found in the data? (3) Correctness: Are predicates found in the data correctly mentioned and adequately introduced? (4) Text structure: Is the produced text wellstructured, grammatically correct and written in acceptable English?, (5) Fluency: Does the text progress naturally? Is it easy to understand? Is it a coherent whole?
Metrics. For this task, organisers rely on untrained metrics (e.g. BLEU, METEOR, TER, BERTSCORE) to compare the performance of the candidate systems. Thus, we will focus on systemlevel correlation.
5 Numerical Results
In this section, we study the performance of InfoLM on both text summarization and data2text generation.
5.1 Results on Text Summarization
General Analysis. Figure 35 gathers the results of the correlation study between scores produced by different metrics and human judgement (i.e. pyramid
score). We can reproduce results from bhandari2020re. We observe a different behavior depending on the type of systems to be evaluated (e.g., abstractive or extractive) and the chosen correlation coefficient. We observe that InfoLM with or with outperforms other BERTbased metrics such as MOVERSCORE or BERTSCORE (e.g., it is worth noting that both metrics perform poorly at the text or systemlevel when considering outputs from extractive systems). largely outperforms ngram matching metrics (e.g., ROUGE metrics) on all datasets when measuring correlation with the Kendall and in almost all configurations (except when considering abstractive outputs at the system level) when using the Pearson . It is worth noting the overall good performance of the parameterfree FisherRao distance.
Choice of information geometric measure for InfoLM. In Figure 35, we can observe two different types of groups depending on the global behaviour. First we notice that using , leads to poor performances in many configurations. Good performance of in some configurations is surprising as is extremely selective (i.e. computes ). As output produced by the PMLM is sparse, correspond to one likely word in one sentence and not likely at all in the other. The second group gathers , , , and and achieves the best performance overall. and achieve similar performance suggesting that the flexibility (e.g., robustness to outliers) introduced by the parameter in is not useful in our task. This observation is strengthened by the lower performance of . The difference of results between the two measures is due to the flexibility introduced by (i.e., controls the relative importance of the ration ) which can be interpreted in our case as the ability to control the importance attributed to less likely words jalalzai2020heavy.
Takeaways. The best performing metric is obtained with . The FisherRao distance, denoted by , achieves good performance in many scenarios and has the advantage to be parameterfree.
5.2 Results on Data2Text
Global Analysis: Table 3 gathers results of the correlation analysis of the metrics with human judgements following the five different axes. We observe that the five considered criteria of annotations are not independent: text structure and fluency achieve a strong correlation coefficient (). Additionally, all metrics achieve similar results when the correlation is computed on these two criteria. We observe that the best performing group of metric is based on InfoLM followed by metrics based on continuous representation from BERT (i.e., MOVERSCORE and BERTSCORE) followed by Ngram matching metrics.
Regarding correctness, data coverage and relevance, we observe that both and achieve the best results on almost all correlation coefficients. On data coverage, InfoLM achieves improvement up to points in correlation compared to both BERT based or Ngram matching metrics. Regarding fluency and text structure, FisherRao distance works better and slightly outperforms the secondbest performing metric, namely BERTSCORE.
Takeaways. Similar to summarisation, we observe very low correlation for , . We also observe that divergences achieve lower results than both and divergences suggesting that, as noticed for summarisation, robustness to unlikely words (i.e., outliers) is less relevant for our task.
Correctness  Data Coverage  Fluency  Relevance  Text Structure  
Metric  
Correct  100.0  100.0  100.0  97.6  85.2  73.3  80.0  81.1  61.6  99.1  89.7  75.0  80.1  80.8  60.0 
DataC  85.2  97.6  73.3  100.0  100.0  100.0  71.8  51.7  38.3  96.0  93.8  81.6  71.6  51.4  36.6 
Fluency  81.1  80.0  61.6  71.8  51.7  38.3  100.0  100.0  100.0  77.0  61.4  46.6  99.5  99.7  98.3 
Relev  89.7  99.1  75.0  96.0  93.8  81.6  77.0  61.4  46.6  100.0  100.0  100.0  77.2  61.1  45.0 
TextS  80.8  80.1  60.0  71.6  51.4  36.6  99.5  99.7  98.3  77.2  61.1  45.0  100.0  100.0  100.0 
88.8  89.3  76.6  81.8  82.6  70.0  86.6  92.0  76.6  89.8  87.9  73.3  86.6  91.4  75.0  
88.8  89.3  76.6  81.8  82.6  70.0  86.6  92.0  76.6  89.8  87.9  73.3  86.6  91.4  75.0  
81.4  50.0  71.6  48.4  79.7  65.0  44.8  84.7  76.6  49.3  72.3  60.0  48.0  83.8  75.0  
75.2  33.8  61.6  32.4  53.8  40.0  22.7  83.5  73.3  32.2  57.9  45.0  25.6  83.2  71.6  
89.7  86.0  75.0  78.7  70.5  51.6  93.3  95.7  85.3  87.6  84.4  70.0  92.4  93.8  81.6  
JS  79.4  81.1  70.0  69.3  75.5  60.0  89.4  91.4  75.0  81.7  70.5  60.0  91.9  91.1  73.3 
BertS  85.5  83.4  73.3  74.7  68.2  53.3  92.3  95.5  85.0  83.3  79.4  65.0  91.9  95.0  83.3 
MoverS  84.1  84.1  73.3  78.7  66.2  53.3  91.2  92.1  78.3  82.1  77.4  65.0  90.1  91.4  76.3 
BLEU  77.6  66.3  60.0  55.7  50.2  36.6  89.4  90.5  78.3  63.0  65.2  51.6  88.5  89.1  76.6 
R1  80.6  65.0  65.0  61.1  59.6  48.3  76.5  76.3  60.3  64.3  69.2  56.7  75.9  77.5  58.3 
METEOR  86.5  66.3  70.0  77.3  50.2  46.6  86.7  90.5  78.3  82.1  65.2  58.6  86.2  89.1  76.6 
TER  79.6  78.3  58.0  69.7  58.2  38.0  89.1  93.5  80.0  75.0  70.2  77.6  89.5  91.1  78.6 
5.3 Further Analysis
Correlation between metrics
In this experiment, we complete our global analysis by comparing the scores obtained by the different metrics with each other. We want to gain an understanding of how different our metric is from other metrics and how the choice of information geometric measures affects the predictions. Figure 10 gathers the results of the experiment. We observe a high correlation () between , , and ^{4}^{4}4Note that these metrics consider the product of and .. Interestingly, we observe a lower correlation () with BERTSCORE and Ngram matching metrics, e.g., ROOGE) whereas BERTSCORE achieves a stronger correlation with ROUGE ().
Takeaways. Through the correlation analysis in Figure 10, we observe the impact of different geometry on InfoLM predictions. The correlation analysis shows that the prediction of InfoLM when using , , and are highly correlated and as illustrated by previous experience achieve high correlation scores which we believe validate our approach. It is worth noting that requires no tuning as it is parameter free.
Score Distributions
In Figure 11, we study the text score distribution of different metrics on abstractive summary. The ideal metric would mimic the human score distribution (i.e. ) and be able to distinguish between good and bad quality summaries. The results show that ROUGE and BLEU struggle to distinguish between between good quality () low quality summaries () which has been reported in peyrard2019studying. We observe that , and metrics are able to make the distinction. Interestingly, as increases and the distances become more selective (i.e. focus on one word solely), the distances struggle to distinguish low from high scoring summaries.
Takeaways. InfoLM when combined with , and is able to distinguish highscoring from low scoring summaries.
Temperature Calibration
To study the impact of calibration, we choose to work on systemlevel correlation and report in Figure 12 the achieved the correlation measured by the different coefficients. We limit our study to the FisherRao distance as it is a parameterfree metric and is among the bestperforming metrics of InfoLM. Due to space constraints, we report the result on extractive systems only (extractive systems can be found in Figure 15).
Takeaways. FisherRao only considers product thus as increases and the predicted probability of the PMLM becomes more uniform more words are considered and the aggregated distributions become richer in terms of considered words. It is worth noting that when changing the temperature we observe a smooth change in correlation and observe an optimal temperature which is reached for . It suggests that InfoLM benefits from a PMLM that is not too selective (case ). For a specific application, the temperature of InfoLM can be tuned to improve correlation and InfoLM will likely benefit from wellcalibrated PMLM.
6 Summary and Concluding Remarks
In this work, we presented InfoLM that does not require training and it is among the first metrics computing the similarity between two discrete probability distributions over the vocabulary (which is similar to stringbased metrics) but also leverages the recent advance in language modeling thanks to a PMLM. Our experiments on both summarization and data2text generation demonstrate the validity of our approach. Among available contrast measures, the FisherRao distance is parameterfree and thus, it is easy to use in practice while the Divergence achieves better results but requires to select and
. Future work includes extending our metrics to new tasks such as translation, image captioning and paraphrase detection, as well as investigating others information geometric measures (
e.g., Wasserstein distance staerman2021ot, or other types of divergences and Mutual Information colombo2021novel) to obtain better correlation with different attributes.7 Acknowledgments
This work was also granted access to the HPC resources of IDRIS under the allocation 2021AP010611665 as well as under the project 2021101838 made by GENCI.
References
Appendix A Additional Experimental Results
In this section, we report the additional experiments we conduct. Because of space constraints, we did not report the sensibility analysis (see Section A.2) in the main paper.
a.1 Role of calibration
The complete results on the role of calibration can be found in Figure 15. We notice a similar behaviour for InfoLM on both extractive and abstractive systems. There is a smooth variation in terms of performance and a single maximum for .
a.2 Choice of and
In this experiment, we aim at quantifying the sensitivity of to the choice of and . Figure 16 gathers the results of the analysis. We observe that a change in induces a stronger change in the metric. Additionally the lower the better result we obtain. We can also note that the variation of both parameters is smooth. Interestingly for abstractive systems, a low value of should be chosen where for extractive higher is better. It suggests that for evaluating abstractive systems the metric should focus on low values of (words that are probable for both candidate and reference text) where for extractive systems the attention should be focused on high values of (words that are likely only in one text).
Takeaways. Low values of leads to better results, optimal value of is for abstractive and for extractive. Good parameter combinations achieve consistently high performance when using different correlation coefficients.
a.3 Statistical analysis
Automatic metrics are used in the WebNLG challenge to compare the different systems. To evaluate whether the observed improvement in correlation is significant, we report the results of William’s Significance test in Figure 22.
Takeaways: (i) Regarding correctness and relevance , is a suitable choice that is significantly better than other metrics; (ii) Regarding text structure, is significantly better and compare favourably against all metrics except MOVERSCORE for automatic fluency evaluation; (iii) Regarding data coverage, METEOR achieves good result however significance difference is only observed with BERTSCORE.
a.4 Complete results on Summarization
We gather in Figure 35, the complete results on summarization. Due to space constraints, we did not report all the Spearman Correlation coefficients in the main paper. It is worth noting that our observations hold and the best performing metric is obtained with . The FisherRao distance, denoted by , achieves good performance in many scenarios and has the advantage to be parameterfree.
a.5 Complete results on Data2text Generation
We gather in Table 3, the complete results on data2text generation. Due to space constraints, we did not report all the baselines in the main paper.
Correctness  Data Coverage  Fluency  Relevance  Text Structure  
Metric  
Correct  100.0  100.0  100.0  97.6  85.2  73.3  80.0  81.1  61.6  99.1  89.7  75.0  80.1  80.8  60.0 
DataC  85.2  97.6  73.3  100.0  100.0  100.0  71.8  51.7  38.3  96.0  93.8  81.6  71.6  51.4  36.6 
Fluency  81.1  80.0  61.6  71.8  51.7  38.3  100.0  100.0  100.0  77.0  61.4  46.6  99.5  99.7  98.3 
Relev  89.7  99.1  75.0  96.0  93.8  81.6  77.0  61.4  46.6  100.0  100.0  100.0  77.2  61.1  45.0 
TextS  80.8  80.1  60.0  71.6  51.4  36.6  99.5  99.7  98.3  77.2  61.1  45.0  100.0  100.0  100.0 
88.8  89.3  76.6  81.8  82.6  70.0  86.6  92.0  76.6  89.8  87.9  73.3  86.6  91.4  75.0  
88.8  89.3  76.6  81.8  82.6  70.0  86.6  92.0  76.6  89.8  87.9  73.3  86.6  91.4  75.0  
81.4  50.0  71.6  48.4  79.7  65.0  44.8  84.7  76.6  49.3  72.3  60.0  48.0  83.8  75.0  
75.2  33.8  61.6  32.4  53.8  40.0  22.7  83.5  73.3  32.2  57.9  45.0  25.6  83.2  71.6  
67.0  21.9  56.6  21.6  37.9  33.3  11.9  75.2  58.3  20.1  43.8  38.3  14.8  75.5  60.0  
63.2  33.0  46.6  30.4  36.4  26.6  67.6  65.0  46.6  29.1  49.1  35.0  67.2  65.2  46.6  
89.7  86.0  75.0  78.7  70.5  51.6  93.3  95.7  85.3  87.6  84.4  70.0  92.4  93.8  81.6  
JS  79.4  81.1  70.0  69.3  75.5  60.0  89.4  91.4  75.0  81.7  70.5  60.0  91.9  91.1  73.3 
BertS  85.5  83.4  73.3  74.7  68.2  53.3  92.3  95.5  85.0  83.3  79.4  65.0  91.9  95.0  83.3 
MoverS  84.1  84.1  73.3  78.7  66.2  53.3  91.2  92.1  78.3  82.1  77.4  65.0  90.1  91.4  76.3 
BLEU  77.6  66.3  60.0  55.7  50.2  36.6  89.4  90.5  78.3  63.0  65.2  51.6  88.5  89.1  76.6 
R1  80.6  65.0  65.0  61.1  59.6  48.3  76.5  76.3  60.3  64.3  69.2  56.7  75.9  77.5  58.3 
R2  73.6  63.3  58.3  54.7  43.1  35.0  86.4  81.9  63.4  62.0  60.8  46.7  86.5  80.5  61.7 
RWE  60.9  73.4  60.0  40.2  58.2  40.1  61.4  84.7  61.3  49.9  64.1  48.3  60.2  85.9  60.0 
METEOR  86.5  66.3  70.0  77.3  50.2  46.6  86.7  90.5  78.3  82.1  65.2  58.6  86.2  89.1  76.6 
TER  79.6  78.3  58.0  69.7  58.2  38.0  89.1  93.5  80.0  75.0  70.2  77.6  89.5  91.1  78.6 
a.6 Complete Results on Score Distribution
We gather in Figure 38 the score distribution on the summarization dataset. It is worth noting that for these experiments we have scaled the divergences between 0 and 1 and we have considered 1  InfoLM.
Appendix B Additional details on the datasets
b.1 Summarization
The summarization dataset CNN can be found in https://github.com/neulab/REALSumm and no preprocessing has been applied. Specifically, the summaries have been generated using 14 abstractive systems zhong2020extractive; wang2020heterogeneous; zhong2019searching; liu2019text; zhou2018neural; narayan2018ranking; dong2019unified; kedzie2018content; zhou2018neural and 11 extractive systems see2017get; chen2018fast; raffel2019exploring; gehrmannetal2018bottom; dong2019unified; liu2019text; lewis2019bart; yoon2020learning.
Pyramide Score. The Pyramide score, which was inspired from work in reading comprehension (see beck1991revising) is a scoring procedure to assess the semantic content in the scope of summarization. The manual methods are based on the concept of Summary Content Units (SCUs) which groups sentences from different summaries. grouping if they carry the same content.
b.2 Data2text
The goal of the WebNLG challenge is to develop efficient Knowledge Base Verbalizers gardent2017creating; perez2016building(i.e generation algorithms that can verbalise knowledge base fragments) and thus handle complex interaction that can occur during the microplanning phase when generating sentences ferreira2018enriching. Details on the WebNLG2020 task are given in ferreira20202020. The dataset is composed of generated sentences coming from 15 different systems using various approaches such as symbolic approaches or neuralbased systems. All data are freely available on GitHub https://gitlab.com/shimorina/webnlgdataset//tree/master/release˙v3.0 and no preprocessing has been applied. For the dataset, they use the RDF format which is a widely used format for many datasets such as LinkedGeoData http://linkedgeodata.org/About, FOAF http://www.foafproject.org/ or MusicBrainz https://musicbrainz.org/
. extracted from DBpedia
auer2007dbpedia.b.3 Limitations
We have evaluate our metrics on text only datasets. Future work will also investigate robustness of the metric to type of texts (e.g spoken text dinkar2020importance; chapuis2021code; chapuis2020hierarchical) and the extension to multimodal setting colombo2021improving; garcia2019token and other type of task (e.g related to dialog colombo2021beam and affect driven text generation colombo2019affect; witon2018disney and intend generation mehri2020example with dialog acts colombo2020guiding).
b.4 Metric Choices
Several revised versions of BLEU doddington2002automatic; galley2015deltableu and METEOR denkowski2014meteor; guohu2019meteor have been proposed in recent years. For our implementation we choose to use SACREBLEU sacrebleu. Similarly, a plethora of ROUGE extension have been proposed (ganesan2018rouge; shafieibavani2018graph) but in our work we choose to work with ROUGE1, ROUGE2 and ROUGEWE. For the BERT based metrics, we choose to rely on the most popular although several alternatives exists (e.g. SENTENCEMOVER clarketal2019sentence)
b.5 Parameters Choices
For summarization (see Figure 35) we choose to work with the following parameters:

The temperature is set .

For , we choose for extractive and for abstractive.

For , we choose .

For , we choose and for extractive and and for abstractive.
For data2text generation (see Section A.5) we choose to work with the following parameters:

The temperature is set .

For , we choose

For , we choose

For , we choose
Takeaways Although, parameters can be tuned to achieve better results. It is worth noting that the parameterfree FisherRao distance (i.e. ) achieve good results when considered for InfoLM.
Appendix C Implementation details
c.1 Algorithm details
A complete algorithm for InfoLM is given in Algorithm 1.
c.2 Computational resources
For all the experiments we use Tesla NVIDIA P100 to compute the BERT based metrics. Running time is less than an hour. For metrics based on string matching, we use a single CPU and the running time is less than an hour.
c.3 Libraries
For this project among the library we used we can cite:

Transformers from hugging_face.

SummEval can be found here https://github.com/YaleLILY/SummEval and has been proposed in fabbri2020summeval for the dataset CNN.

RealSumm which can be found here https://github.com/neulab/REALSumm and has been proposed in bhandari2020re for some metrics.

Pytorch pytorch for the GPU support.

Implementation of MOVERSCORE can be found here https://github.com/AIPHES/emnlp19moverscore.

Implementation of BERTSCORE can be found here https://github.com/Tiiiger/bert˙score.

SacreBLEU for BLEU implementation sacrebleu.

The William test is taken from the author’s code and is available at https://github.com/ygraham/nlpwilliams
We thank the contributor for opensourcing their libraries.
c.4 Negative Results
We tried to remove the stop words and do preprocessing technics. Little improvement when cleaning the candidate and golden reference text might be attributed to BERT.
Comments
There are no comments yet.