Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

08/10/2022
by   Jaejin Cho, et al.
0

Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings. We used DIstillation with NO labels (DINO), proposed in computer vision, and adapted it to the speech domain. Unlike the contrastive methods, DINO does not require negative sampling. These embeddings were evaluated on speaker verification and emotion recognition. In speaker verification, the unsupervised DINO embedding with cosine scoring provided 4.38 outperforms the best contrastive self-supervised method by 40 An iterative pseudo-labeling training pipeline, not requiring speaker labels, further improved the EER to 1.89 performed 60.87, 79.21, and 56.98 MSP-Podcast, respectively. The results imply the generality of the DINO embedding to different speech applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2022

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

In recent studies, self-supervised pre-trained models tend to outperform...
research
08/15/2022

C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification

Self-supervised learning (SSL) has drawn an increased attention in the f...
research
10/27/2022

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

We study a novel neural architecture and its training strategies of spea...
research
09/28/2021

The JHU submission to VoxSRC-21: Track 3

This technical report describes Johns Hopkins University speaker recogni...
research
10/22/2020

Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning

In this paper, we propose a simple but powerful unsupervised learning me...
research
08/01/2023

Self-Supervised Contrastive BERT Fine-tuning for Fusion-based Reviewed-Item Retrieval

As natural language interfaces enable users to express increasingly comp...
research
04/28/2021

A Note on Connecting Barlow Twins with Negative-Sample-Free Contrastive Learning

In this report, we relate the algorithmic design of Barlow Twins' method...

Please sign up or login with your details

Forgot password? Click here to reset