Opening the Black Box of wav2vec Feature Encoder

10/27/2022
by   Kwanghee Choi, et al.
0

Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To analyze the embedding space in a reductive manner, we feed the synthesized audio signals, which is the summation of simple sine waves. Through extensive experiments, we conclude that various information is embedded inside the feature encoder representations: (1) fundamental frequency, (2) formants, and (3) amplitude, packed with (4) sufficient temporal detail. Further, the information incorporated inside the latent representations is analogous to spectrograms but with a fundamental difference: latent representations construct a metric space so that closer representations imply acoustic similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2022

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

The human perception system is often assumed to recruit motor knowledge ...
research
05/23/2023

Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

Self-supervised learning (SSL) models use only the intrinsic structure o...
research
08/08/2022

AWEncoder: Adversarial Watermarking Pre-trained Encoders in Contrastive Learning

As a self-supervised learning paradigm, contrastive learning has been wi...
research
10/21/2022

Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech

Recent self-supervised learning (SSL) models have proven to learn rich r...
research
11/02/2022

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...
research
01/25/2020

Multi-task self-supervised learning for Robust Speech Recognition

Despite the growing interest in unsupervised learning, extracting meanin...

Please sign up or login with your details

Forgot password? Click here to reset