Improving Pre-Trained Multilingual Models with Vocabulary Expansion

09/26/2019
by   Hai Wang, et al.
0

Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2022

TiBERT: Tibetan Pre-trained Language Model

The pre-trained language model is trained on large-scale unlabeled text ...
research
05/17/2019

Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

The paper introduces methods of adaptation of multilingual masked langua...
research
01/25/2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Large multilingual language models typically rely on a single vocabulary...
research
08/10/2020

KR-BERT: A Small-Scale Korean-Specific Language Model

Since the appearance of BERT, recent works including XLNet and RoBERTa u...
research
08/28/2020

Knowledge Efficient Deep Learning for Natural Language Processing

Deep learning has become the workhorse for a wide range of natural langu...
research
12/01/2015

Multilingual Language Processing From Bytes

We describe an LSTM-based model which we call Byte-to-Span (BTS) that re...
research
08/28/2018

A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

We address the design of a unified multilingual system for handwriting r...

Please sign up or login with your details

Forgot password? Click here to reset