Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

05/14/2023
by   Weiwei Lin, et al.
0

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20

READ FULL TEXT
research
05/20/2023

Self-supervised representations in speech-based depression detection

This paper proposes handling training data sparsity in speech-based auto...
research
08/10/2022

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

In recent studies, self-supervised pre-trained models tend to outperform...
research
06/14/2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challe...
research
10/30/2022

Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models

In this work, we investigate the effectiveness of pretrained Self-Superv...
research
11/01/2022

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Current state-of-the-art methods for automatic synthetic speech evaluati...
research
06/24/2019

A computational model of early language acquisition from audiovisual experiences of young infants

Earlier research has suggested that human infants might use statistical ...

Please sign up or login with your details

Forgot password? Click here to reset