DeepAI AI Chat
Log In Sign Up

The Ability of Self-Supervised Speech Models for Audio Representations

09/26/2022
by   Tung-yu Wu, et al.
National Taiwan University
0

Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning, but some questions regarding their representation ability remain unanswered. This paper addresses two of them: (1) Can SSL speech models deal with non-speech audio?; (2) Would different SSL speech models have insights into diverse aspects of audio features? To answer the two questions, we conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of currently state-of-the-art SSL speech models, which are wav2vec 2.0 and HuBERT in this paper. These experiments are carried out during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results show that (1) SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets; (2) different SSL speech models have insights into different aspects of audio features. The two conclusions provide a foundation for the ensemble of representation models. We further propose an ensemble framework to fuse speech representation models' embeddings. Our framework outperforms state-of-the-art SSL speech/audio models and has generally superior performance on abundant datasets compared with other teams in HEAR Challenge. Our code is available at https://github.com/tony10101105/HEAR-2021-NeurIPS-Challenge – NTU-GURA.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/28/2022

Audio Barlow Twins: Self-Supervised Audio Representation Learning

The Barlow Twins self-supervised learning objective requires neither neg...
03/22/2021

Self-paced ensemble learning for speech and audio classification

Combining multiple machine learning models into an ensemble is known to ...
06/24/2022

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...
10/16/2022

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

We present the SUPERB challenge at SLT 2022, which aims at learning self...
03/05/2023

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Recent work has explored using self-supervised learning (SSL) speech rep...
01/02/2021

What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

In recent times, BERT based transformer models have become an inseparabl...