Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

06/05/2022
by   Santiago Cuervo, et al.
6

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2021

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Typically, unsupervised segmentation of speech into the phone and word-l...
research
07/20/2020

Hierarchical Contrastive Motion Learning for Video Action Recognition

One central question for video action recognition is how to model motion...
research
05/27/2022

Self-supervised models of audio effectively explain human cortical responses to speech

Self-supervised language models are very effective at predicting high-le...
research
06/24/2022

Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models

In this work, we analyzed and compared speech representations extracted ...
research
10/15/2022

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Unsupervised representation learning for speech audios attained impressi...
research
02/24/2022

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Recent work on unsupervised speech segmentation has used self-supervised...
research
03/10/2021

Variable-rate discrete representation learning

Semantically meaningful information content in perceptual signals is usu...

Please sign up or login with your details

Forgot password? Click here to reset