Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

12/19/2022
by   Jing Huang, et al.
0

Language tasks involving character-level manipulations (e.g., spelling correction, many word games) are challenging for models based in subword tokenization. To address this, we adapt the interchange intervention training method of Geiger et al. (2021) to operate on type-level variables over characters. This allows us to encode robust, position-independent character-level information in the internal representations of subword-based models. We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context. While simple character-level tokenization approaches still perform best on purely form-based tasks like string reversal, our method is superior for more complex tasks that blend form, meaning, and context, such as spelling correction in context and word search games. Our approach also leads to subword-based models with human-intepretable internal representations of characters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2022

What do tokens know about their characters and how do they know it?

Pre-trained language models (PLMs) that use subword tokenization schemes...
research
03/15/2022

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Natural language processing models learn word representations based on t...
research
05/05/2023

Adapting Transformer Language Models for Predictive Typing in Brain-Computer Interfaces

Brain-computer interfaces (BCI) are an important mode of alternative and...
research
11/06/2020

Understanding Pure Character-Based Neural Machine Translation: The Case of Translating Finnish into English

Recent work has shown that deeper character-based neural machine transla...
research
09/02/2017

Patterns versus Characters in Subword-aware Neural Language Modeling

Words in some natural languages can have a composite structure. Elements...
research
02/01/2021

Inducing Meaningful Units from Character Sequences with Slot Attention

Characters do not convey meaning, but sequences of characters do. We pro...
research
10/26/2020

PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction

Unconscious biases continue to be prevalent in modern text and media, ca...

Please sign up or login with your details

Forgot password? Click here to reset