TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

07/12/2020
by   Andy T. Liu, et al.
0

We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn through the formulation of a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous approaches, we use a multi-target auxiliary task to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from its altered counterpart, where we use a stochastic policy to alter along three dimensions: temporal, channel, and magnitude. TERA can be used to extract speech representations or fine-tune with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, speaker recognition, and speech recognition. TERA achieved strong performance on these tasks by improving upon surface features and outperforming previous methods. In our experiments, we show that through alteration along different dimensions, the model learns to encode distinct aspects of speech. We explore different knowledge transfer methods to incorporate the pre-trained model with downstream models. Furthermore, we show that the proposed method can be easily transferred to another dataset not used in pre-training.

READ FULL TEXT

page 1

page 4

research
12/23/2021

Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition

Self-supervised acoustic pre-training has achieved amazing results on th...
research
08/11/2021

Learning Oculomotor Behaviors from Scanpath

Identifying oculomotor behaviors relevant for eye-tracking applications ...
research
10/25/2022

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Recovering the masked speech frames is widely applied in speech represen...
research
07/09/2021

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

Predicting the altered acoustic frames is an effective way of self-super...
research
10/25/2019

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

We present Mockingjay as a new speech representation learning approach, ...
research
10/06/2021

Semantic Prediction: Which One Should Come First, Recognition or Prediction?

The ultimate goal of video prediction is not forecasting future pixel-va...
research
04/08/2022

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT...

Please sign up or login with your details

Forgot password? Click here to reset