Unsupervised feature learning for speech using correspondence and Siamese networks

03/28/2020
by   Petri-Johan Last, et al.
0

In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs. The Triamese network uses a contrastive loss to reduce the distance between frames of the same predicted word type while increasing the distance between negative examples. For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs. We find that, on the two datasets considered here, the CAE outperforms the Triamese network. However, we show that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.

READ FULL TEXT
research
11/01/2018

Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

We investigate unsupervised models that can map a variable-duration spee...
research
03/19/2021

Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation

Acoustic word embeddings (AWEs) are fixed-dimensional representations of...
research
12/14/2020

A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Many speech processing tasks involve measuring the acoustic similarity b...
research
01/03/2017

Unsupervised neural and Bayesian models for zero-resource speech processing

In settings where only unlabelled speech data is available, zero-resourc...
research
12/03/2020

A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

We propose a new unsupervised model for mapping a variable-duration spee...
research
06/22/2016

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Zero-resource speech technology is a growing research area that aims to ...
research
11/01/2016

Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech Detection

In this letter, we propose enhanced factored three way restricted Boltzm...

Please sign up or login with your details

Forgot password? Click here to reset