A Better Way to Do Masked Language Model Scoring

05/17/2023
by   Carina Kauf, et al.
0

Estimating the log-likelihood of a given sentence under an autoregressive language model is straightforward: one can simply apply the chain rule and sum the log-likelihood values for each successive token. However, for masked language models, there is no direct way to estimate the log-likelihood of a sentence. To address this issue, Salazar et al. (2020) propose to estimate sentence pseudo-log-likelihood (PLL) scores, computed by successively masking each sentence token, retrieving its score using the rest of the sentence as context, and summing the resulting values. Here, we demonstrate that the original PLL method yields inflated scores for out-of-vocabulary words and propose an adapted metric, in which we mask not only the target token, but also all within-word tokens to the right of the target. We show that our adapted metric (PLL-word-l2r) outperforms both the original PLL metric and a PLL metric in which all within-word tokens are masked. In particular, it better satisfies theoretical desiderata and better correlates with scores from autoregressive models. Finally, we show that the choice of metric affects even tightly controlled, minimal pair evaluation benchmarks (such as BLiMP), underscoring the importance of selecting an appropriate scoring metric for evaluating MLM properties.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2019

Pseudolikelihood Reranking with Masked Language Models

We rerank with scores from pretrained masked language models like BERT t...
research
04/23/2018

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

We show how to deploy recurrent neural networks within a hierarchical Ba...
research
06/30/2023

Should you marginalize over possible tokenizations?

Autoregressive language models (LMs) map token sequences to probabilitie...
research
02/28/2022

Rethinking and Refining the Distinct Metric

Distinct is a widely used automatic metric for evaluating the diversity ...
research
10/19/2022

Enrichment Score: a better quantitative metric for evaluating the enrichment capacity of molecular docking models

The standard quantitative metric for evaluating enrichment capacity know...
research
08/25/2023

Assessing Keyness using Permutation Tests

We propose a resampling-based approach for assessing keyness in corpus l...
research
12/08/2019

Cost-Sensitive Training for Autoregressive Models

Training autoregressive models to better predict under the test metric, ...

Please sign up or login with your details

Forgot password? Click here to reset