Subword Segmental Language Modelling for Nguni Languages

10/12/2022
by   Francois Meyer, et al.
0

Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2020

Tackling the Low-resource Challenge for Canonical Segmentation

Canonical morphological segmentation consists of dividing words into the...
research
05/11/2020

Neural Polysynthetic Language Modelling

Research in natural language processing commonly assumes that approaches...
research
03/16/2022

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

Pre-trained language models such as BERT have been successful at tacklin...
research
04/16/2021

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Segmentation remains an important preprocessing step both in languages w...
research
04/01/2021

Canonical and Surface Morphological Segmentation for Nguni Languages

Morphological Segmentation involves decomposing words into morphemes, th...
research
02/22/2017

Unsupervised Learning of Morphological Forests

This paper focuses on unsupervised modeling of morphological families, c...
research
10/24/2020

Revisiting Neural Language Modelling with Syllables

Language modelling is regularly analysed at word, subword or character u...

Please sign up or login with your details

Forgot password? Click here to reset