Subword Pooling Makes a Difference

02/22/2021
by   Judit Acs, et al.
3

Contextual word-representations became a standard in modern natural language processing systems. These models use subword tokenization to handle large vocabularies and unknown words. Word-level usage of such systems requires a way of pooling multiple subwords that correspond to a single word. In this paper we investigate how the choice of subword pooling affects the downstream performance on three tasks: morphological probing, POS tagging and NER, in 9 typologically diverse languages. We compare these in two massively multilingual models, mBERT and XLM-RoBERTa. For morphological tasks, the widely used `choose the first subword' is the worst strategy and the best results are obtained by using attention over the subwords. For POS tagging both of these strategies perform poorly and the best choice is to use a small LSTM over the subwords. The same strategy works best for NER and we show that mBERT is better than XLM-RoBERTa in all 9 languages. We publicly release all code, data and the full result tables at <https://github.com/juditacs/subword-choice>.

READ FULL TEXT

page 7

page 8

research
01/31/2021

BNLP: Natural language processing toolkit for Bengali language

BNLP is an open source language processing toolkit for Bengali language ...
research
02/22/2021

Evaluating Contextualized Language Models for Hungarian

We present an extended comparison of contextualized language models for ...
research
03/22/2019

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Despite an ever growing number of word representation models introduced ...
research
02/22/2023

Impact of Subword Pooling Strategy on Cross-lingual Event Detection

Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have...
research
08/28/2023

FonMTL: Towards Multitask Learning for the Fon Language

The Fon language, spoken by an average 2 million of people, is a truly l...
research
03/16/2020

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing t...
research
09/21/2020

Vector Projection Network for Few-shot Slot Tagging in Natural Language Understanding

Few-shot slot tagging becomes appealing for rapid domain transfer and ad...

Please sign up or login with your details

Forgot password? Click here to reset