A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

by   Md Mofijul Islam, et al.

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models' adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks, with larger gains in low-resource languages. Additionally, our neural tokenizer exhibits a robust performance on downstream tasks when adversarial noise is present (typos and misspelling), further increasing the initial improvements over statistical subword tokenizers.



page 7


Specializing Multilingual Language Models: An Empirical Study

Contextualized word representations from pretrained multilingual languag...

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Recent impressive improvements in NLP, largely based on the success of c...

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...

Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech...

Reduce and Reconstruct: Improving Low-resource End-to-end ASR Via Reconstruction Using Reduced Vocabularies

End-to-end automatic speech recognition (ASR) systems are increasingly b...

Integrating diverse extraction pathways using iterative predictions for Multilingual Open Information Extraction

In this paper we investigate a simple hypothesis for the Open Informatio...

Natural Vocabulary Emerges from Free-Form Annotations

We propose an approach for annotating object classes using free-form tex...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.