Should you marginalize over possible tokenizations?

06/30/2023
by   Nadezhda Chirkova, et al.
0

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5 cases, but that it becomes more pronounced for data with long complex words.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2020

String Attractors for Automatic Sequences

We show that it is decidable, given an automatic sequence s and a consta...
research
05/17/2023

A Better Way to Do Masked Language Model Scoring

Estimating the log-likelihood of a given sentence under an autoregressiv...
research
08/25/2021

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

Standard pretrained language models operate on sequences of subword toke...
research
05/17/2023

Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

While there is much recent interest in studying why Transformer-based la...
research
05/24/2023

Learning the String Partial Order

We show that most structured prediction problems can be solved in linear...
research
12/21/2022

Reconstruction Probing

We propose reconstruction probing, a new analysis method for contextuali...
research
07/18/2023

Text vectorization via transformer-based language models and n-gram perplexities

As the probability (and thus perplexity) of a text is calculated based o...

Please sign up or login with your details

Forgot password? Click here to reset