Noisy Parallel Data Alignment

01/23/2023
by   Ruoyu Xie, et al.
0

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6

READ FULL TEXT
research
09/28/2020

Neural Baselines for Word Alignment

Word alignments identify translational correspondences between words in ...
research
12/21/2020

Subword Sampling for Low Resource Word Alignment

Annotation projection is an important area in NLP that can greatly contr...
research
10/21/2020

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose s...
research
02/26/2023

User-Centric Evaluation of OCR Systems for Kwak'wala

There has been recent interest in improving optical character recognitio...
research
01/28/2020

Unsupervised Multilingual Alignment using Wasserstein Barycenter

We study unsupervised multilingual alignment, the problem of finding wor...
research
09/28/2020

Generative latent neural models for automatic word alignment

Word alignments identify translational correspondences between words in ...
research
11/10/2020

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processin...

Please sign up or login with your details

Forgot password? Click here to reset