Text normalization for endangered languages: the case of Ligurian

06/16/2022
by   Stefano Lusito, et al.
0

Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization. Our datasets are released to the public.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2021

Text Normalization for Low-Resource Languages of Africa

Training data for machine learning models can come from many different s...
research
08/04/2017

Massively Multilingual Neural Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme conversion (g2p) is necessary for text-to-speech and...
research
01/20/2023

Language Agnostic Data-Driven Inverse Text Normalization

With the emergence of automatic speech recognition (ASR) models, convert...
research
10/08/2020

Query-Key Normalization for Transformers

Low-resource language translation is a challenging but socially valuable...
research
10/20/2020

Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages

Spelling normalization for low resource languages is a challenging task ...
research
02/12/2021

Neural Inverse Text Normalization

While there have been several contributions exploring state of the art t...
research
05/25/2020

Dialect Text Normalization to Normative Standard Finnish

We compare different LSTMs and transformer models in terms of their effe...

Please sign up or login with your details

Forgot password? Click here to reset