Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

10/15/2022
by   Themos Stafylakis, et al.
0

Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2022

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

When recognizing emotions from speech, we encounter two common problems:...
research
04/08/2023

Unsupervised Speech Representation Pooling Using Vector Quantization

With the advent of general-purpose speech representations from large-sca...
research
04/06/2021

Speaker embeddings by modeling channel-wise correlations

Speaker embeddings extracted with deep 2D convolutional neural networks ...
research
08/09/2023

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

The emergence of self-supervised representation (i.e., wav2vec 2.0) allo...
research
10/09/2021

Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

Many speech applications require understanding aspects beyond the words ...
research
03/01/2023

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations

Text-to-speech (TTS) systems are modelled as mel-synthesizers followed b...
research
08/22/2023

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Wav2vec2 has achieved success in applying Transformer architecture and s...

Please sign up or login with your details

Forgot password? Click here to reset