Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

09/06/2022
by   Doan Nam Long Vu, et al.
0

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.

READ FULL TEXT

page 5

page 11

research
12/19/2022

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation

Is it possible to leverage large scale raw and raw parallel corpora to b...
research
10/13/2020

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Recent advances in automatic evaluation metrics for text have shown that...
research
12/21/2022

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Human linguistic capacity is often characterized by compositionality and...
research
09/23/2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

There is an increasing focus on model-based dialog evaluation metrics su...
research
03/12/2018

Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts

Although there is an emerging trend towards generating embeddings for pr...
research
05/31/2022

Cluster-based Evaluation of Automatically Generated Text

While probabilistic language generators have improved dramatically over ...
research
07/17/2021

Generative Pretraining for Paraphrase Evaluation

We introduce ParaBLEU, a paraphrase representation learning model and ev...

Please sign up or login with your details

Forgot password? Click here to reset