CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

03/11/2021
by   Jonathan H. Clark, et al.
0

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/01/2021

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Commonly-used transformer language models depend on a tokenization schem...
04/22/2022

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in mos...
09/15/2021

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Compared to monolingual models, cross-lingual models usually require a m...
10/24/2020

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...
08/13/2018

Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

Character-level models of tokens have been shown to be effective at deal...
10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...
10/06/2021

How BPE Affects Memorization in Transformers

Training data memorization in NLP can both be beneficial (e.g., closed-b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.