Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings

06/29/2019
by   Marcely Zanon Boito, et al.
0

Since Bahdanau et al. [1] first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presents an empirical evaluation of 3 main sequence-to-sequence models (CNN, RNN and Transformer-based) for word discovery from unsegmented phoneme sequences. This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. Evaluating word segmentation quality can be seen as an extrinsic evaluation of the soft-alignment matrices produced during training. Our experiments in a low-resource scenario on Mboshi and English languages (both aligned to French) show that RNNs surprisingly outperform CNNs and Transformer for this task. Our results are confirmed by an intrinsic evaluation of alignment quality through the use of Average Normalized Entropy (ANE). Lastly, we improve our best word discovery model by using an alignment entropy confidence measure that accumulates ANE over all the occurrences of a given alignment pair in the collection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2020

Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

For endangered languages, data collection campaigns have to accommodate ...
research
06/08/2018

Multilingual Neural Machine Translation with Task-Specific Attention

Multilingual machine translation addresses the task of translating betwe...
research
06/18/2018

Unsupervised Word Segmentation from Speech with Attention

We present a first attempt to perform attentional word segmentation dire...
research
02/16/2018

Bayesian Models for Unit Discovery on a Very Low Resource Language

Developing speech technologies for low-resource languages has become a v...
research
01/20/2022

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

In the present study, we propose novel sequence-to-sequence pre-training...
research
09/26/2016

Robust Time-Series Retrieval Using Probabilistic Adaptive Segmental Alignment

Traditional pairwise sequence alignment is based on matching individual ...
research
12/16/2022

Homonymy Information for English WordNet

A widely acknowledged shortcoming of WordNet is that it lacks a distinct...

Please sign up or login with your details

Forgot password? Click here to reset