What do tokens know about their characters and how do they know it?

06/06/2022
by   Ayush Kaushal, et al.
0

Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2021

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

Standard pretrained language models operate on sequences of subword toke...
research
12/19/2022

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

Language tasks involving character-level manipulations (e.g., spelling c...
research
01/11/2020

Authorship Attribution in Bangla literature using Character-level CNN

Characters are the smallest unit of text that can extract stylometric si...
research
03/15/2022

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Natural language processing models learn word representations based on t...
research
07/07/2021

Can Evolutionary Computation Help us to Crib the Voynich Manuscript ?

Departing from the postulate that Voynich Manuscript is not a hoax but r...
research
01/26/2021

Coloring the Black Box: What Synesthesia Tells Us about Character Embeddings

In contrast to their word- or sentence-level counterparts, character emb...
research
12/14/2019

Towards Robust Toxic Content Classification

Toxic content detection aims to identify content that can offend or harm...

Please sign up or login with your details

Forgot password? Click here to reset