BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

03/16/2022
by   Manuel Mager, et al.
0

Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri–Spanish.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2021

What Can Unsupervised Machine Translation Contribute to High-Resource Language Pairs?

Whereas existing literature on unsupervised machine translation (MT) foc...
research
10/05/2022

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

Language modelling and machine translation tasks mostly use subword or c...
research
04/01/2021

Canonical and Surface Morphological Segmentation for Nguni Languages

Morphological Segmentation involves decomposing words into morphemes, th...
research
10/11/2022

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Data sparsity is one of the main challenges posed by Code-switching (CS)...
research
09/02/2021

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Data-driven subword segmentation has become the default strategy for ope...
research
05/05/2017

Building Morphological Chains for Agglutinative Languages

In this paper, we build morphological chains for agglutinative languages...
research
05/06/2022

Quantifying Synthesis and Fusion and their Impact on Machine Translation

Theoretical work in morphological typology offers the possibility of mea...

Please sign up or login with your details

Forgot password? Click here to reset