Disentangling Prosody Representations with Unsupervised Speech Reconstruction

12/14/2022
by   Leyuan Qu, et al.
0

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for unsupervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Some audio samples can be found on our demo website.

READ FULL TEXT

page 1

page 7

page 8

page 9

page 10

page 13

research
11/16/2022

Data Augmentation with Unsupervised Speaking Style Transfer for Speech Emotion Recognition

Currently, the performance of Speech Emotion Recognition (SER) systems i...
research
07/08/2021

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

Traditional voice conversion(VC) has been focused on speaker identity co...
research
05/28/2019

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

We present an unsupervised end-to-end training scheme where we discover ...
research
10/08/2021

Cognitive Coding of Speech

We propose an approach for cognitive coding of speech by unsupervised ex...
research
11/12/2019

Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?

Automatic speech recognition (ASR) is a key technology in many services ...
research
10/31/2022

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

The rapid spread of media content synthesis technology and the potential...
research
06/14/2021

Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit

Researchers have recently started to study how the emotional speech hear...

Please sign up or login with your details

Forgot password? Click here to reset