TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

07/12/2020

∙

We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn through the formulation of a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous approaches, we use a multi-target auxiliary task to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from its altered counterpart, where we use a stochastic policy to alter along three dimensions: temporal, channel, and magnitude. TERA can be used to extract speech representations or fine-tune with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, speaker recognition, and speech recognition. TERA achieved strong performance on these tasks by improving upon surface features and outperforming previous methods. In our experiments, we show that through alteration along different dimensions, the model learns to encode distinct aspects of speech. We explore different knowledge transfer methods to incorporate the pre-trained model with downstream models. Furthermore, we show that the proposed method can be easily transferred to another dataset not used in pre-training.

READ FULL TEXT

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition

Learning Oculomotor Behaviors from Scanpath

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Semantic Prediction: Which One Should Come First, Recognition or Prediction?

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Related Research

Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition

Learning Oculomotor Behaviors from Scanpath

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Semantic Prediction: Which One Should Come First, Recognition or Prediction?

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning