Evaluation metrics play a central role in the machine learning community. They direct the efforts of the research community and are used to define the state of the art models. In machine translation and summarization, the two most common metrics used for evaluating similarity between candidate and reference texts are BLEU (papineni2002bleu) and ROUGE (lin2004rouge)
. Both approaches rely on counting the matching n-grams in the candidates summary to n-grams in the reference text. BLEU is precision focused while ROUGE is recall focused. These metrics have posed serious limitations and have already been criticized by the academic community.In this work we formulate three criticisms of BLEU and ROUGE, establish criteria that a sound metric should have and propose concrete ways to use recent advances in NLP to design data-driven metric addressing the weaknesses found in BLEU and ROUGE.
2 Related Work
2.1 BLEU, ROUGE and n-gram matching approaches
BLEU (Bilingual Evaluation Understudy) (papineni2002bleu) and ROUGE (lin2004rouge) have been used to evaluate many NLP tasks for almost two decades. The general acceptance of these methods depend on many factors including their simplicity and the intuitive interpretability. Yet the main factor is the claim that they highly correlate with human judgement (papineni2002bleu). This has been criticised extensively by the literature and the shortcomings of these methods have been widely studied. Reiter (reiter2018structured) , in his structured review of BLEU, finds a low correlation between BLEU and human judgment. Callison et al (callison2006re) examines BLEU in the context of machine translation and find that BLEU does neither correlate with human judgment on adequacy(whether the hypothesis sentence adequately captures the meaning of the reference sentence) nor fluency(the quality of language in a sentence). Sulem et al (sulem2018bleu)
examines BLEU in the context of text simplification on grammaticality, meaning preservation and simplicity and report BLEU has very low or in some cases negative correlation with human judgment. Considering these results it is a natural step to pursue new avenues for natural language evaluation and with the advent of deep learning using neural networks for this task is a promising step forward.
2.2 Transformers, BERT and GPT
Language modeling has become an important NLP technique thanks to the ability to apply it to various NLP tasks as explained in Radford et al (radford2019language)
. There are two leading architectures for language modeling Recurrent Neural Networks (RNNs)(mikolov2010recurrent) and Transformers (vaswani2017attention) . RNNs handle the input tokens, words or characters, one by one through time to learn the relationship between them, whereas, transformers receive a segment of tokens and learn the dependencies between them using an attention mechanism.
2.3 Model-based metrics
While BLEU and ROUGE are defined in a discrete space new evaluation metric can be defined in this continuous space. BERTscore (zhang2019bertscore)
uses word embeddings and cosine similarity to create a score array and use greedy matching to maximize the similarity score. Sentence Mover’s Similarity(clark2019sentence) uses the mover similarity, Wasserstein distance, between sentence embedding generated from averaging the word embeddings in a sentence. Both of these methods report stronger correlations with human judgment and better results when compared to BLEU and ROUGE. While they are using word embeddings (mikolov2013distributed) to transfer their sentence in a continuous space they are still using distance metrics to evaluate that sentence. While BLEND (ma2017blend) uses an SVM to combine different existing evaluation metrics. One other evaluation method proposed is RUSE (shimanaka2018ruse)
this method proposes embedding both sentences separately and pooling them to a given size. After that they use a pre trained MLP to predict on different tasks. This quality estimator metric is then proposed to be used in language evaluation. Our proposed methodology is to take neural language evaluation beyond architecture specifications. We are proposing a framework in which an evaluators success can be determined.
3 Challenges with BLEU and ROUGE
In this part, we discuss three significant limitations of BLEU and ROUGE. These metrics can assign: High scores to semantically opposite translations/summaries, Low scores to semantically related translations/summaries and High scores to unintelligible translations/summaries.
3.1 High score, opposite meanings
Suppose that we have a reference summary s1. By adding a few negation terms to s1, one can create a summary s2 which is semantically opposite to s1 but yet has a high BLEU/ROUGE score.
3.2 Low score, similar meanings
In addition not to be sensitive to negation, BLEU and ROUGE score can give low scores to sentences with equivalent meaning. If s2 is a paraphrase of s1, the meaning will be the same ;however, the overlap between words in s1 and s2 will not necessarily be significant.
3.3 High score, unintelligible sentences
A third weakness of BLEU and ROUGE is that in their simplest implementations, they are insensitive to word permutation and can give very high scores to unintelligible sentences. Let s1 be "On a morning, I saw a man running in the street." and s2 be “On morning a, I saw the running a man street”. s2 is not an intelligible sentence. The unigram version of ROUGE and BLEU will give these 2 sentences a score of 1.
3.4.1 Experiments with carefully crafted sentences
To illustrate our argument, let’s consider the following pairs of sentences:
In Pair 1: s1 is "For the past two decades, the translation and summarization communities have used ROUGE and BLEU and these metrics have shown to be robust to criticism” s2 is "“For the past two decades, the translation and summarization communities have used ROUGE and BLEU and these metrics have shown not to be robust to criticism”. They differ by adding the negation in s2.
In Pair 2: s1 is "On a morning, I saw a man running in the street." and s2 is "In the early hours of the day, I observed one gentleman jogging along the road”. s2 is a paraphrase of s1.
|Pair 1 (Opposite sentences)||0.90||0.975||1.65/5|
|Pair 2 (Paraphrase)||0.173||0.1875||4.35/5|
3.4.2 Semantic similarity experiments
To go beyond carefully crafted sentences. We assessed how well BLEU and ROUGE correlated with human judgement of similarity between pairs of paraphrased sentences and compared their performance to a RoBERTa model finetuned for semantic similarity (Table 2).
4 Towards a robust data-driven approach
4.1 Metric Scorecard
In our methodology to design new evaluation metrics for comparing reference summaries/translations to hypothesis ones, we established first-principles criteria on what a good evaluator should do. The first one is that it should be highly correlated with human judgement of similarity. The second one is that it should be able to distinguish sentences which are in logical contradiction, logically unrelated or in logical agreement. The third one is that a robust evaluator should also be able to identify unintelligible sentences. The last criteria is that a good evaluation metric should not give high scores to semantically distant sentences and low scores to semantically related sentences.
4.2 Implementing metrics satisfying scorecard
4.2.1 Semantic Similarity
Starting from the RoBERTa large pre-trained model (liu2019roberta) , we finetune it to predict sentence similarity on the STS-B benchmark dataset. Given two sentences of text, s1 and s2, the systems need to compute how similar s1 and s2 are, returning a similarity score between 0 and 5. The dataset comprises naturally occurring pairs of sentences drawn from several domains and genres, annotated by crowdsourcing. The benchmark comprises 8628 sentence pairs with 5700 pairs in the training set, 1500 in the development set and 1379 in the test set.
4.2.2 Logical Equivalence
For logical inference, we start with a pretrained RoBERTa (liu2019roberta) model and finetune it using the Multi-Genre Natural Language Inference Corpus (Williams et al., 2018). It is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither (neutral). The training set includes 393k sentence pairs, development set includes 20k and test set includes 20k. The accuracy of the pre-trained model on the development set is 0.9060.
4.2.3 Sentence Intelligibility
We start with a pretrained roBERTa (liu2019roberta) model and finetune it using the Corpus of Linguistic Acceptability (CoLA) . It consists of examples of expert English sentence acceptability judgments drawn from 22 books. Each example is a single string of English words annotated with whether it is grammatically possible sentence of English. The training set for CoLA has 10k sentences and the development set includes 1k sentences. The current model gets 67.8 percent accuracy
4.2.4 Rationale for Language Models
The overall rationale for using language models fine tuned for specific aspects of the scorecard is that recent work has shown that language models are unsupervised multitask learners (radford2019language) and can rediscover the classical NLP pipeline. By fine tuning them on a specific task, we make them pay attention to the correct level of abstraction corresponding to the scorecard.
In this work, we have shown three main limitations of BLEU and ROUGE and proposed a path forward outlining why and how state of the art language models can be used as summary evaluators. Future work includes extending the proposed scorecard, updating the models matching best the scorecard criteria and assessing published summarization models using that scorecard.