Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

08/25/2021
by   Itay Itzhak, et al.
5

Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not enhance its performance on such tasks.

READ FULL TEXT
research
06/06/2022

What do tokens know about their characters and how do they know it?

Pre-trained language models (PLMs) that use subword tokenization schemes...
research
03/13/2018

Neural Lattice Language Models

In this work, we propose a new language modeling paradigm that has the a...
research
02/27/2023

Systematic Rectification of Language Models via Dead-end Analysis

With adversarial or otherwise normal prompts, existing large language mo...
research
06/30/2023

Should you marginalize over possible tokenizations?

Autoregressive language models (LMs) map token sequences to probabilitie...
research
07/03/2017

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead o...
research
09/17/2019

Character-Centric Storytelling

Sequential vision-to-language or visual storytelling has recently been o...
research
05/23/2023

FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language

Using model weights pretrained on a high-resource language as a warm sta...

Please sign up or login with your details

Forgot password? Click here to reset