A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

04/22/2022
by   Md Mofijul Islam, et al.
8

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models' adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks, with larger gains in low-resource languages. Additionally, our neural tokenizer exhibits a robust performance on downstream tasks when adversarial noise is present (typos and misspelling), further increasing the initial improvements over statistical subword tokenizers.

READ FULL TEXT
research
06/16/2021

Specializing Multilingual Language Models: An Empirical Study

Contextualized word representations from pretrained multilingual languag...
research
04/28/2023

NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

This paper describes our system developed for the SemEval-2023 Task 12 "...
research
03/11/2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural ...
research
10/26/2021

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Recent impressive improvements in NLP, largely based on the success of c...
research
06/01/2023

UCAS-IIE-NLP at SemEval-2023 Task 12: Enhancing Generalization of Multilingual BERT for Low-resource Sentiment Analysis

This paper describes our system designed for SemEval-2023 Task 12: Senti...
research
04/30/2021

Scaling End-to-End Models for Large-Scale Multilingual ASR

Building ASR models across many language families is a challenging multi...
research
11/04/2022

Dealing with Abbreviations in the Slovenian Biographical Lexicon

Abbreviations present a significant challenge for NLP systems because th...

Please sign up or login with your details

Forgot password? Click here to reset