A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

12/14/2020
by   Lisa van Staden, et al.
0

Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the segment level, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs. These are used as inputs to a recurrent CAE-based AWE model. In a word discrimination task on English and Xitsonga data, all three representation learning approaches outperform MFCCs, with CPC consistently showing the biggest improvement. In cross-lingual experiments we find that CPC features trained on English can also be transferred to Xitsonga.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/19/2021

Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation

Acoustic word embeddings (AWEs) are fixed-dimensional representations of...
research
11/01/2018

Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

We investigate unsupervised models that can map a variable-duration spee...
research
12/03/2020

A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

We propose a new unsupervised model for mapping a variable-duration spee...
research
10/28/2022

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

Given the strong results of self-supervised models on various tasks, the...
research
06/03/2021

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Automatic detection of phoneme or word-like units is one of the core obj...
research
03/28/2020

Unsupervised feature learning for speech using correspondence and Siamese networks

In zero-resource settings where transcribed speech audio is unavailable,...
research
07/26/2020

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Unsupervised spoken term discovery consists of two tasks: finding the ac...

Please sign up or login with your details

Forgot password? Click here to reset