Canonical and Surface Morphological Segmentation for Nguni Languages

04/01/2021
by   Tumi Moeng, et al.
0

Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5 outperform bidirectional LSTM-CRFs to obtain an average of 97.1 segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.

READ FULL TEXT
research
10/06/2020

Tackling the Low-resource Challenge for Canonical Segmentation

Canonical morphological segmentation consists of dividing words into the...
research
03/16/2022

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Morphologically-rich polysynthetic languages present a challenge for NLP...
research
10/12/2022

Subword Segmental Language Modelling for Nguni Languages

Subwords have become the standard units of text in NLP, enabling efficie...
research
05/25/2020

The IMS-CUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

In this paper, we present the systems of the University of Stuttgart IMS...
research
05/05/2017

Building Morphological Chains for Agglutinative Languages

In this paper, we build morphological chains for agglutinative languages...
research
03/12/2019

Character Eyes: Seeing Language through Character-Level Taggers

Character-level models have been used extensively in recent years in NLP...
research
02/03/2019

Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

In this paper we present a novel lemmatization method based on a sequenc...

Please sign up or login with your details

Forgot password? Click here to reset