Tokenization and the Noiseless Channel

06/29/2023
by   Vilém Zouhar, et al.
0

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the Rényi entropy with α = 2.5 has a very strong correlation with Bleu: 0.78 in comparison to just -0.32 for compressed length.

READ FULL TEXT
research
10/09/2020

Token-level Adaptive Training for Neural Machine Translation

There exists a token imbalance phenomenon in natural language as differe...
research
03/29/2021

Asymptotically Optimal Massey-Like Inequality on Guessing Entropy With Application to Side-Channel Attack Evaluations

A Massey-like inequality is any useful lower bound on guessing entropy i...
research
01/31/2023

Bayesian estimation of information-theoretic metrics for sparsely sampled distributions

Estimating the Shannon entropy of a discrete distribution from which we ...
research
08/24/2014

Fuzzy and entropy facial recognition

This paper suggests an effective method for facial recognition using fuz...
research
05/09/2012

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the a...
research
07/05/2022

Entropy of Sharp Restart

Restart has the potential of expediting or impeding the completion times...
research
10/11/2010

Hierarchical Multiclass Decompositions with Application to Authorship Determination

This paper is mainly concerned with the question of how to decompose mul...

Please sign up or login with your details

Forgot password? Click here to reset