Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

by   Yu Kang, et al.

Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method. Our code is available at: <>.


page 1

page 2

page 3

page 4


Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Neural text-to-speech (TTS) models can synthesize natural human speech w...

Efficient Purely Convolutional Text Encoding

In this work, we focus on a lightweight convolutional architecture that ...

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Despite the recent developments in the field of cross-modal retrieval, t...

Investigating Pre-trained Audio Encoders in the Low-Resource Condition

Pre-trained speech encoders have been central to pushing state-of-the-ar...

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Despite surprising performance on zero-shot transfer, pre-training a lar...

Please sign up or login with your details

Forgot password? Click here to reset