Multilingual bottleneck features for subword modeling in zero-resource languages

03/23/2018
by   Enno Hermann, et al.
0

How can we effectively develop speech technology for languages where no transcribed data is available? Many existing approaches use no annotated resources at all, yet it makes sense to leverage information from large annotated corpora in other languages, for example in the form of multilingual bottleneck features (BNFs) obtained from a supervised speech recognition system. In this work, we evaluate the benefits of BNFs for subword modeling (feature extraction) in six unseen languages on a word discrimination task. First we establish a strong unsupervised baseline by combining two existing methods: vocal tract length normalisation (VTLN) and the correspondence autoencoder (cAE). We then show that BNFs trained on a single language already beat this baseline; including up to 10 languages results in additional improvements which cannot be matched by just adding more data from a single language. Finally, we show that the cAE can improve further on the BNFs if high-quality same-word pairs are available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2020

Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

In this work, we explore the benefits of using multilingual bottleneck f...
research
11/09/2018

Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

Unsupervised subword modeling aims to learn low-level representations of...
research
02/06/2020

Multilingual acoustic word embedding models for processing zero-resource languages

Acoustic word embeddings are fixed-dimensional representations of variab...
research
02/13/2018

A Short Survey on Sense-Annotated Corpora for Diverse Languages and Resources

With the advancement of research in word sense disambiguation and deep l...
research
11/14/2018

Almost Zero-Resource ASR-free Keyword Spotting using Multilingual Bottleneck Features and Correspondence Autoencoders

We compare features for dynamic time warping based keyword spotting in a...
research
07/23/2018

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

We consider multilingual bottleneck features (BNFs) for nearly zero-reso...
research
09/27/2016

AP16-OL7: A Multilingual Database for Oriental Languages and A Language Recognition Baseline

We present the AP16-OL7 database which was released as the training and ...

Please sign up or login with your details

Forgot password? Click here to reset