DeepAI AI Chat
Log In Sign Up

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

by   Vrunda N. Sukhadia, et al.

This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame. In simple pooling, which is our baseline, there is no way to identify the channel information. To incorporate channel information, we have proposed non-overlapping cluster IDs for speech from different channels. Our method gives a relative improvement of   5 encoder-decoder self-supervised model built with simple pooling of data, which serves as our baseline.


Joint Encoder-Decoder Self-Supervised Pre-training for ASR

Self-supervised learning (SSL) has shown tremendous success in various s...

AVATAR submission to the Ego4D AV Transcription Challenge

In this report, we describe our submission to the Ego4D AudioVisual (AV)...

Self-Supervised Encoder for Fault Prediction in Electrochemical Cells

Predicting faults before they occur helps to avoid potential safety haza...

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

In this paper, we present a model pretraining technique, named MaskOCR, ...

Energy-Inspired Self-Supervised Pretraining for Vision Models

Motivated by the fact that forward and backward passes of a deep network...

Watermarking Images in Self-Supervised Latent Spaces

We revisit watermarking techniques based on pre-trained deep networks, i...

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...