Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

08/01/2021
by   Yuval Pinter, et al.
0

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many domain-specific models that were trained on tokens. In this paper, we present XRayEmb, a method for retrofitting existing token-based models with character-level information. XRayEmb is composed of a character-level "encoder" that computes vector representations of character sequences, and a generative component that decodes from the internal representation to a character sequence. We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures and on both sequence-level and sequence tagging tasks, particularly on non-standard English text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
research
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...
research
08/13/2018

Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

Character-level models of tokens have been shown to be effective at deal...
research
05/25/2023

Extracting Text Representations for Terms and Phrases in Technical Domains

Extracting dense representations for terms and phrases is a task of grea...
research
03/12/2019

Character Eyes: Seeing Language through Character-Level Taggers

Character-level models have been used extensively in recent years in NLP...
research
03/27/2023

An Information Extraction Study: Take In Mind the Tokenization!

Current research on the advantages and trade-offs of using characters, i...
research
05/08/2023

Token-level Fitting Issues of Seq2seq Models

Sequence-to-sequence (seq2seq) models have been widely used for natural ...

Please sign up or login with your details

Forgot password? Click here to reset