Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

10/25/2019
by   Andy T. Liu, et al.
1

We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts. The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches. Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically. In a low resource setting with only 0.1 100

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2022

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Recovering the masked speech frames is widely applied in speech represen...
research
07/12/2020

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

We introduce a self-supervised speech pre-training method called TERA, w...
research
04/05/2019

An Unsupervised Autoregressive Model for Speech Representation Learning

This paper proposes a novel unsupervised autoregressive neural model for...
research
07/25/2023

Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods

This thesis focuses on representation learning for sequence data over ti...
research
04/11/2020

Improved Speech Representations with Multi-Target Autoregressive Predictive Coding

Training objectives based on predictive coding have recently been shown ...
research
10/12/2022

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Current multimodal models, aimed at solving Vision and Language (V+L) ta...
research
05/22/2023

Efficient Large-Scale Vision Representation Learning

In this article, we present our approach to single-modality vision repre...

Please sign up or login with your details

Forgot password? Click here to reset