Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

04/20/2023
by   Verena Blaschke, et al.
0

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

READ FULL TEXT

page 4

page 6

page 15

research
09/14/2021

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Cross-lingual transfer between a high-resource language and its dialects...
research
07/12/2022

Zero-shot Cross-lingual Transfer is Under-specified Optimization

Pretrained multilingual encoders enable zero-shot cross-lingual transfer...
research
03/30/2023

Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages

In this work, we induce character-level noise in various forms when fine...
research
05/23/2023

MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

In this paper, we present MasakhaPOS, the largest part-of-speech (POS) d...
research
10/12/2019

Acquisition of Inflectional Morphology in Artificial Neural Networks With Prior Knowledge

How does knowledge of one language's morphology influence learning of in...
research
06/14/2021

Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Hate speech and profanity detection suffer from data sparsity, especiall...

Please sign up or login with your details

Forgot password? Click here to reset