DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

05/17/2023
by   Alexander H. Liu, et al.
0

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

READ FULL TEXT
research
10/08/2022

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Speech is the surface form of a finite set of phonetic units, which can ...
research
12/06/2022

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

In this work, we present a novel method, named AV2vec, for learning audi...
research
07/06/2023

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

Large self-supervised models are effective feature extractors, but their...
research
10/29/2022

Application of Knowledge Distillation to Multi-task Speech Representation Learning

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to...
research
11/23/2021

Domain-Agnostic Clustering with Self-Distillation

Recent advancements in self-supervised learning have reduced the gap bet...
research
06/14/2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challe...
research
07/29/2022

Global-Local Self-Distillation for Visual Representation Learning

The downstream accuracy of self-supervised methods is tightly linked to ...

Please sign up or login with your details

Forgot password? Click here to reset