BERT score for language generation
We propose BERTScore, an automatic evaluation metric for text generation. Analogous to common metrics, computes a similarity score for each token in the candidate sentence with each token in the reference. However, instead of looking for exact matches, we compute similarity using contextualized BERT embeddings. We evaluate on several machine translation and image captioning benchmarks, and show that BERTScore correlates better with human judgments than existing metrics, often significantly outperforming even task-specific supervised metrics.READ FULL TEXT VIEW PDF
BERT score for language generation
Automatic evaluation of natural language generation, for example in machine translation and caption generation, requires comparing candidate sentences to annotated references. The goal is to evaluate the semantic equivalence of the candidates and references. However, common methods rely on surface-form similarity only. For example, Bleu bleu, the most common machine translation metric, simply counts -gram overlap between the candidate and the annotated reference. While this provides a simple and general measure, it fails to capture much of the lexical and compositional diversity of natural language.
In this paper, we focus on sentence-level generation evaluation, and introduce BERTScore, an evaluation metric based on pre-trained BERT contextual embeddings (bert). BERTScore
computes the similarity between two sentences as a weighted aggregation of cosine similarities between their tokens.
BERTScore addresses three common pitfalls in -gram based methods meteor. First, -gram based methods use exact string matching (e.g. in Bleu
) or define a cascade of matching heuristics (e.g. inMeteor meteor), and fail to robustly match paraphrases. For example, given the reference people like foreign cars, metrics like Bleu or Meteor incorrectly give a higher score to people like visiting places abroad compared to consumers prefer imported cars. This is because the metrics fail to correctly identify paraphrased words. This leads to underestimation of performance, when semantically-correct phrases are penalized because they differ from the surface form of the reference sentence. In contrast to string matching, we compute cosine similarity using contextualized token embeddings, which have been shown as effective for paraphrase detection bert. A second problem is the lack of distinction between tokens that are important or unimportant to the sentence meaning. For example, given the reference a child is playing, both the child is playing and a child is singing will get the same Bleu score. This often leads to performance overestimation, especially in models with robust language models that correctly generate function words. Instead of treating all tokens equally, we introduce a simple importance weighting scheme to emphasize words of higher significance to sentence meaning. Finally, -gram models fail to capture distant dependencies and penalize semantically-critical ordering changes (Isozaki10:autoeval). For example, given a small window, Bleu will only mildly penalize swapping of cause and effect (e.g. A because B instead of B because A), especially when the arguments A and B are long phrases. In contrast, contextualized embeddings are trained to effectively capture distant dependencies and ordering in all the involved token embeddings.
We experiment with BERTScore on machine translation and image captioning tasks using multiple systems by correlating BERTScore and related metrics to available human judgments. Our experiments demonstrate that BERTScore correlates highly with human evaluations of the quality of machine translation and image captioning systems. In machine translation, BERTScore correlates better with segment-level human judgment than existing metrics on the common WMT17 benchmark (wmt17em), including outperforming metrics learned specifically for this dataset. We also show that BERTScore is well correlated with human annotators for image captioning, surpassing Spice, a popular task-specific metric spice, on the twelve 2015 COCO Captioning Challenge participating systems (coco). Finally, we test the robustness of BERTScore on the adversarial paraphrase dataset PAWS paws, and show that it is more robust to adversarial examples than other metrics. BERTScore is available at github.com/Tiiiger/bert_score.
Natural language text generation is commonly evaluated against annotated reference sentences. Given reference sentence and an candidate sentence , a generation evaluation metric is a function that maps and to a real number. The goal is to give sentences that are preferred by human judgments higher scores relatively. Existing metrics can be broadly categorized into -gram matching metrics, embedding-based metrics, and learned metrics.
The most commonly used metrics for text generation count the number of -grams () that occur in the reference and candidate . In general, the higher the -gram order is, the more the metric is able to capture word order, but also becomes more restrictive and constrained to the exact form of the reference.
Formally, let and be the lists of -grams () in the reference and candidate sentences. The number of -gram matches is
is an indicator function. The precision and recall are
Several popular metrics build upon one or both of these exact matching scores.
The most widely used metric in machine translation is Bleu bleu, which includes three modifications to . First, each -gram in the reference can only be matched at most once. For example, if is the sooner the better and is the the the, only two words in are matched for instead of all three words. Second, Bleu is designed as a corpus-level metric, where a set of reference-candidate pairs is evaluated as a group. The number of exact matches is accumulated for all pairs and divided by the total number of -grams in all candidate sentences. Finally, Bleu introduces a brevity penalty to penalize when this total number of -grams across all candidate sentences is low. Typically, Bleu is computed for various values of (e.g. ) that are averaged geometrically. A smoothed variant, SentBleu (moses) is computed on a sentence level. In contrast to Bleu, BERTScore is not restricted to maximum -gram length, but instead relies on contextualized embeddings that are able to capture dependencies of unbounded length.
Meteor (meteor) computes and while allowing backing-off from exact unigram matching to matching word stems, synonyms, and paraphrases. For example, running may match run
if not exact match is possible. These non-exact matches use external stemmer, synonym lexicon, and paraphrase table. The computation uses beam search to minimize the number of matched chunks (consecutive word unigram matches).Meteor 1.5 (meteor1.5) distinguishes between content and function words, and assigns weights their importance differently. It also applies importance weighting to different matching types, including exact unigrams, stems, synonyms, and paraphrases. These parameters are tuned to maximize correlation with human judgments. Because Meteor requires external resources, only five languages are fully supported with the full feature set, including either synonym or paraphrase matching, and eleven are partially supported. Similar to Meteor, BERTScore is designed to allow relaxed matches. But instead of relying on external resources, BERTScore takes advantage of BERT embeddings that are trained on large amounts of raw text and can be easily created for new languages. BERTScore
also incorporates importance weighting, which is estimated from simple corpus statistics. In contrast toRouge, BERTScore does not require any tuning process to maximize correlations with human judgments.
NIST (nist) is a revised version of Bleu that weighs each -gram differently and also introduces an alternative brevity penalty. chrF (chrF) compares character -grams, in the reference and candidate sentences. chrF++ (chrF++) extends chrF to include word bigram matching. PER (per), WER (wer), CDER (cder), TER (ter), and TERp (TER-Plus) are metrics based on edit distance. CIDEr (cider) is an image captioning metric that computes cosine similarity between tf–idf weighted -grams. Finally, Rouge (rouge) is a commonly used metric for summarization evaluation. Rouge- (rouge) computes (usually ), while Rouge- is a variant of with the numerator replaced by the length of the longest common subsequence.
Word embeddings (word2vec; glove; fasttext; dai2017mixture; athiwaratkun2018probabilistic) are dense representations of tokens that are learned by optimizing an objective that follow the distributional semantics candidate, where similar words are encouraged to be closer to one another in the learned space. This property has been studied for generation evaluation. MEANT 2.0 (meant2) uses pre-trained word embeddings to compute lexical similarity and exploits shallow semantic parses to evaluate structural similarity. task-dialog-eval
explore using average-pooling and max-pooling on word embeddings to construct sentence-level representation, which is used to compute cosine similarity between the reference and candidate sentences.rus2012comparison also study greedy word embedding matching. In contrast to these methods, we use contextualized embeddings, which capture the specific use of the token in the sentence and, potentially, sequence information. We do not use external tools to generate linguistic structures, and therefore our approach is relatively easily to apply to new languages. Our token-level computation allows to visualize the matching and weigh tokens differently according to their importance.
Learning-based metrics are usually trained to optimize correlation with human judgments. BEER (beer) uses a regression model based on character -grams and word bigrams. BLEND (blend) employs SVM regression to combine 29 existing metrics for English. RUSE (ruse)
uses a multi-layer perceptron regressor on three pre-trained sentence embedding models. All these methods require human judgments as supervision, which are necessary for each dataset and costly to obtain. These models also face the risk of poor generalization to new domain, even within a known language and task. Instead of regressing on human judgment scores,leic train a neural model that takes an image and a caption as inputs and predicts if the caption is human-generated. One potential risk with this approach is that it is optimized to existing models, and may generalize poorly to new models. In contrast, the parameters of the BERT model underlying BERTScore are not optimized for any specific task. We also do not require access to images and provide an approach that applies for both text-only and multi-modal tasks.
Given a reference sentence and a candidate sentence , we use contextual embeddings to represent the tokens, and compute a weighted matching using cosine similarity and inverse document frequency scores.
We use BERT contextual embeddings to represent the tokens in the input sentences and . In contrast to word embeddings word2vec; glove, contextual embeddings elmo; bert
provide the same word different vector representations in different sentences.BERT uses a Transformer encoder (transformer)
trained on masked language modeling and next-sentence prediction tasks. Pre-trained BERT embeddings were shown to benefit various NLP tasks, including natural language inference, sentiment analysis, paraphrase detection, question answering, named entity recognition(bert)bert-summarization), contextual emotion detection (bert-emotion), citation recommendation (bert-citation), and document retrieval (bert-retrieval). The BERT model tokenizes the input text into a sequence of word pieces (google16), where unknown words are split into several commonly observed sequences of characters. The contextualized embedding representation for each word piece is computed by repeatedly applying self-attention and nonlinear transformations in an alternating fashion. This process generates multiple layers of embedded representations. Following initial experiments, we use the ninth layer from the model. This is consistent with recent findings showing that the intermediate BERT layers may lead to more semantically meaningful contextual embeddings than the final layer (alternate; linguistic-context). Appendix A studies the effect of layer choice. Given a reference tokenized into word pieces , BERT generates the sequence . Similarly, we map the tokenized candidate to .
The vector representation enables a soft measure of similarity instead of exact string matching bleu or heuristics meteor. We measure the quality of matching a reference word piece and an candidate word piece using their cosine similarity:
We use pre-normalized vectors, which reduces this calculation reduces to the inner product .
While this similarity computes the similarity of word pieces in isolation from the rest of the two sentences, it inherits the dependence on context from the BERT model.
Previous work on such similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words meteor; cider. We incorporate importance weighting using inverse document frequency () scores computed from the reference sentences in the test corpus. Given reference sentences , the of a word piece is
We do not use the full tf-idf measure because we process single sentences, where it is likely that the term frequency (tf) is 1. Because we use the reference sentences, the scores remain the same for all systems evaluated on a specific test set. For unknown words in the reference sentences, we apply plus-one smoothing.
The complete score matches each word piece in to a word piece in to compute recall, and each word piece in to a word piece in to compute precision. The two combine to compute an F1 measure. We use greedy matching to maximize the matching similarity score. In contrast to Bleu (Section 2), we identify matches using -weighted cosine similarity, which allows for approximate matching. For a reference and candidate , the recall, precision, and F1 scores are:
We evaluate our approach on machine translation and image captioning. We focus on correlations with human judgments.
We use the uncased English model for English tasks, for Chinese tasks, and the cased multilingual model for other languages. Appendix A shows the effect of the BERT model choice.
We use the WMT17 metric evaluation dataset (wmt17em), which contains translation systems outputs, gold reference translations, and two types of human judgment scores. Segment-level human judgments assign a score to each pair of output and reference. System-level human judgments associate each system with a single score based on all output-reference pairs in the test set. The dataset uses absolute Pearson correlation with human judgments to evaluate metric quality. We compute system-level score by averaging the BERTScores for all system outputs. WMT17 includes translations from English to Czech, German, Finnish, Latvian, Russian, and Turkish, and from the same set of languages to English. We compare the performance of several popular metrics: Bleu (bleu), CDER (cder), and TER (ter). We also compare our correlations with state-of-the-art metrics, including METEOR++ (meteor++), chrF++ (chrF++), BEER (beer), BLEND (blend), and RUSE (ruse).
We use the human judgments of twelve submission entries from the COCO 2015 Captioning Challenge. Each participating system generates a caption for each image in the COCO validation set (coco), and each image has approximately five reference captions. Following leic, we compute the Pearson correlation with two system-level metrics: M1, the percentage of captions that are evaluated as better or equal to human captions and M2, the percentage of captions that are indistinguishable from human caption. We compute BERTScore with multiple references by scoring the candidate with each available reference and returning the highest score. We compare BERTScore to four task-agnostic metrics: BLEU (bleu), METEOR (meteor), ROUGE-L (rouge), and CIDEr (cider). We also compare with two task-specific metrics: SPICE (spice) and LEIC (leic). SPICE is computed using the similarity of the scene graphs parsed from the reference and candidate captions. LEIC uses a critique network that takes an image and a caption and outputs a proxy score, predicting whether the caption is written by human.
Tables 1 and 2 show segment-level and system-level correlations on to-English translations, and table 3 shows system-level correlation on from-English translations. Across most language pairs, BERTScore shows the highest correlations with human judgments both at the segment-level and system-level. While the recall and precision measures alternate as the best measure across language, the combination of them to the F1 measure performs reliably across the different settings. BERTScore shows better correlation than RUSE, a supervised metric trained on WMT16 and WMT15 human judgment data. We also observe that weighting generally leads to better correlation.
Table 4 shows correlation results for the COCO Captioning Challenge. BERTScore outperforms all task-agnostic baselines by large margins. Image captioning presents a challenging evaluation scenario, and metrics based on strict -gram matching, including Bleu and Rouge, have weak correlations with human judgments. importance weighting shows significant benefits for this task, which suggests people attribute higher importance to content words. Finally, LEIC leic, remains highly competitive and outperforms all other methods. LEIC stands out from the other metrics. First, it is trained on the COCO data and is optimized for the task of distinguishing between human and generated captions. Second, it has access to the images, while all other methods observe the text only.
|Case||Sentences||Ranks (out of 560)|
|SentBleu||1.||: According to opinion in Hungary, Serbia is “a safe third country”.||Human:||23|
|: According to Hungarian view, Serbia is a “safe third country.”||:||100|
|2.||: At same time Kingfisher is closing 60 B&Q outlets across the country||Human:||38|
|: At the same time, Kingfisher will close 60 B & Q stores nationwide||:||201|
|3.||: Construction took six months.||Human:||243|
|: Has taken six months of construction.||:||230|
|4.||: Authorities are quickly repairing the fence.||Human:||205|
|: Authorities are about to repair the fence fast.||:||193|
|5.||: Hewlett-Packard to cut up to 30,000 jobs||Human:||119|
|: Hewlett-Packard will reduce jobs up to 30.000||:||168|
|SentBleu||6.||: In their view the human dignity of the man had been violated.||Human:||500|
|: Look at the human dignity of the man injured.||:||523|
: A good prank is funny, but takes moments to reverse.
|: A good prank is funny, but it takes only moments before he becomes a boomerang.||:||487|
|8.||: For example when he steered a shot from Ideye over the crossbar in the 56th minute.||Human:||516|
|: So, for example, when he steered a shot of Ideye over the latte (56th).||:||498|
|9.||: I will put the pressure on them and onus on them to make a decision.||Human:||507|
|: I will exert the pressure on it and her urge to make a decision.||:||460|
|10.||: Contrary to initial fears, however, the wound was not serious.||Human:||462|
|: Contrary to initial fears, he remained without a serious Blessur.||:||481|
We study the BERTScore and SentBleu using failure cases of reference and candidate pairs from WMT16 German-to-English (wmt16em). We rank all 560 pairs by the human score, BERTScore, or SentBleu score from most similar to least similar. Ideally, the ranks assigned by BERTScore and SentBleu should be similar to the rank assigned by the human score.
Table 5 shows examples where BERTScore and SentBleu scores disagree about the ranking for the example by a large number. We observe that BERTScore is effectively able to capture synonyms and changes in word order. For example, in the first pair, the reference and candidate sentences are almost identical except that the candidate replaces opinion in Hungry with Hungarian view and switches the order of “ and a. While BERTScore ranks the pair relatively high, SentBleu ranks the pair as dissimilar possibly because it cannot match the synonyms and is sensitive to the small changes in the order of words. The fifth pair shows a set of changes that preserve the semantic meaning: replacing to cut with will reduce and swapping the order of 30,000 and jobs. BERTScore ranks the candidate translation similar to the human judgment, whereas SentBleu ranks it much lower. We also see that SentBleu potentially over-rewards -grams overlap, even when phrases are used very differently. In the sixth pair, both the candidate and the reference contain the human dignity of the man. Yet the two sentences convey very different meaning. BERTScore agrees with the human judgment and assigns a low rank to the pair. In contrast, SentBleu considers the pair as relatively similar because the reference and the candidate sentences have significant word overlap.
Because BERTScore relies on explicit alignments, it is easy to visualize the word matching to better understand the resulting score. Figure 2 visualizes the BERTScore matching of two pairs from Table 5. The coloring in the figure visualizes the amount of contribution of each token to the overall score, including both the score and the cosine similarity. In both examples, function words such as are, to, and of, contribute less to the overall similarity score.
|Trained on QQP (supervised)||DecAtt||0.939*||0.263|
|Trained on QQP + PAWSQQP (supervised)||DecAtt||-||0.511|
We test the robustness of BERTScore using adversarial paraphrase classification. We use the Quora Question Pair corpus (QQP; QQP) and the Paraphrase Adversaries from Word Scrambling dataset (PAWS; paws). Both datasets contain pairs of sentences labeled to indicate whether they are paraphrases or not. Positive examples in QQP are real duplicated questions, while negative examples are generated from related, but different questions. Sentence pairs in PAWS are generated through word swapping. For example, in PAWS, Flights from New York to Florida may be changed to Flights from Florida to New York
and a good classifier should identify that these two sentences are not paraphrases. PAWS includes two parts PAWSQQP, which is based on the QQP data, and PAWSWiki. Table 6 shows the area under ROC curve for existing models and automatic metrics.
We observe that supervised classifiers trained on QQP perform even worse than random guess on PAWSQQP, i.e. these models believe that the adversarial examples are more likely to be paraphrases. When adversarial examples are provided in training, state-of-the-art models like DIIN (diin) and fine-tuned BERT are able to identify the adversarial examples but their performance still decreases significantly from the their performance on QQP.
We study the effectiveness of automatic metrics for paraphrase detection without any training data. We use the PAWSQQP development set which contains 667 sentences. For QQP, we use the first 5000 sentences in the training set instead because the test labels are not available. We treat the first sentence as the reference and the second sentence as the candidate. We expect that pairs with higher score are more likely to be paraphrases. Most metrics have decent performance on QQP, but show a significant performance drop on PAWSQQP, almost down to chance performance. This suggests these metrics fail to to distinguish the harder adversarial examples. In contrast, the performance of BERTScore drops only slightly, which demonstrates that it is more robust than the other metrics.
We propose BERTScore, a new metric for evaluating generated text against gold standard references. Our experiments on common benchmarks demonstrate that BERTScore achieves better correlation than common metrics, such as Bleu or Meteor. Our analysis illustrates the potential of BERTScore to resolve some of the limitations of these commonly used metrics, especially on challenging adversarial examples. BERTScore is purposely designed to be simple, interpretable, task agnostic, and easy to use. The code for BERTScore is available at github.com/Tiiiger/bert_score.
This research is supported in part by grants from the National Science Foundation (III-1618134, III-1526012, IIS1149882, IIS-1724282, and TRIPODS-1740822), the Office of Naval Research DOD (N00014-17-1-2175), and the Bill and Melinda Gates Foundation, SAP, Zillow and Facebook Research. We thank Graham Neubig, Tianze Shi, Yin Cui, and Guandao Yang for their insightful comments.
In Section 5, we report the human correlation of BERTScores computed by using the uncased BERTBASE model. We hereby investigate the potential improvement of using different BERT models on the WMT16 German-to-English data (wmt16em). In Table 7, we report the average human correlation on segment level of computed by using BERTBASE and BERTLARGE. As expected, BERTScores computed on the BERTLARGE model correlate better with human judgment on the WMT17 dataset. However, the improvement is marginal and appears less appealing when we consider the computational overhead of BERTLARGE. Therefore, in our opinion using BERTBASE will suffice.
Since there are BERT pre-trained models on different domains, we hypothesize that using more domain-specific model would improve the correlation with human judgment. On WMT16 English-to-Chinese translation data, we compute with BERTMULTI which is a general domain multilingual BERT model trained on 104 languages and with BERTCHINESE which is trained solely on Chinese data. The experimental result is presented in Table 8. We observe that, as hypothesized, computed with BERTCHINESE shows a significant performance increase. Therefore, we expect a future improvement through more domain-specific BERT models and advise practitioners to use domain-specific models when available.
As suggested by previous studies elmo; alternate
, selecting a good layer, or a good combination of layers, of hidden representations is important. In designingBERTScore, we use WMT16 segment level human judgment data as a development set to facilitate our representation choice. In Figure 3, we plot the change of human correlation of BERTScores over different layers of BERT models on WMT16 German-to-English translation task. Based on results from 3 different BERT models, we identify a common trend that computed with the representations of intermediate layers tends to work better. In practice, we use the th layer of a BERTBASE model.
|WMT14 En-De||ConvS2S (gehring2017convs2s)||0.266||0.8323||0.8311||0.8312|
|WMT14 En-Fr||ConvS2S (gehring2017convs2s)||0.408||0.8749||0.8693||0.8718|
|IWSLT14 De-En||Transformer-iwslt (ott2019fairseq)||0.347||0.7903||0.7764||0.7820|
Table 9 shows the Bleu scores and the BERTScores of pre-trained machine translation models on WMT14 English-to-German, WMT14 English-to-French, IWSLT14 German-to-English task. We used publicly available pre-trained models from Tensor2Tensor (tensor2tensor)111
Code available at https://github.com/tensorflow/tensor2tensor, and pre-trained model available at gs://tensor2tensor-checkpoints/transformer_ende_test.
and fairseq (ott2019fairseq)222
Code and pre-trained model available at https://github.com/pytorch/fairseq.
Since a pre-trained Transformer model on IWSLT is not released, we trained our own using the fairseq library.
We use multilingual cased 333Hash code:
bert-base-multilingual-cased_L9_version=0.1.0 for English-to-German and English-to-French pairs, and English uncased 444Hash code: bert-base-uncased_L9_version=0.1.0 for German-to-English pairs. Interestingly, the gap between a DynamicConv (wu2018pay) trained on only WMT16 and a Transformer (snmt) trained on WMT16 and paracrawl (about 30 more training data) becomes larger when evaluated with BERTScores rather than Bleu.