CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

05/23/2023
by   Benjamin Minixhofer, et al.
0

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9 Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/24/2022

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

A cornerstone in AI research has been the creation and adoption of stand...
research
09/19/2021

MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models

Recent work indicated that pretrained language models (PLMs) such as BER...
research
11/10/2020

UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations

This work describes a self-supervised data augmentation approach used to...
research
10/08/2022

Understanding HTML with Large Language Models

Large language models (LLMs) have shown exceptional performance on a var...
research
11/29/2022

BARTSmiles: Generative Masked Language Models for Molecular Representations

We discover a robust self-supervised strategy tailored towards molecular...
research
10/25/2019

Stem-driven Language Models for Morphologically Rich Languages

Neural language models (LMs) have shown to benefit significantly from en...
research
01/17/2022

Handling Compounding in Mobile Keyboard Input

This paper proposes a framework to improve the typing experience of mobi...

Please sign up or login with your details

Forgot password? Click here to reset