Byte Pair Encoding is Suboptimal for Language Model Pretraining

04/07/2020
by   Kaj Bostrom, et al.
0

The success of pretrained transformer language models in natural language processing has led to a wide range of different pretraining setups. These models employ a variety of subword tokenization methods, most notably byte pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. First, we analyze differences between BPE and unigram LM tokenization, and find that the unigram LM method is able to recover subword units that more strongly align with underlying morphology, in addition to avoiding several shortcomings of BPE stemming from its greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks, we find that the unigram LM tokenization method consistently matches or outperforms BPE. We hope that developers of future pretrained language models will consider adopting the unigram LM method over the more common BPE.

READ FULL TEXT
research
10/06/2020

Pretrained Language Model Embryology: The Birth of ALBERT

While behaviors of pretrained language models (LMs) have been thoroughly...
research
07/22/2021

Back-Translated Task Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification

Language models (LMs) pretrained on a large text corpus and fine-tuned o...
research
05/25/2022

Segmenting Numerical Substitution Ciphers

Deciphering historical substitution ciphers is a challenging problem. Ex...
research
09/09/2021

Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning

Recent prompt-based approaches allow pretrained language models to achie...
research
05/22/2023

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Pretrained language models have achieved remarkable success in various n...
research
09/03/2022

TransPolymer: a Transformer-based Language Model for Polymer Property Predictions

Accurate and efficient prediction of polymer properties is of great sign...
research
12/16/2021

Reconsidering the Past: Optimizing Hidden States in Language Models

We present Hidden-State Optimization (HSO), a gradient-based method for ...

Please sign up or login with your details

Forgot password? Click here to reset