CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

09/01/2021
by   Hang Li, et al.
0

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Video-language pre-training (VLP) has become increasingly important due ...
research
08/14/2023

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from auto...
research
08/22/2021

Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...
research
06/24/2021

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

In this paper, inspired by the successes of visionlanguage pre-trained m...
research
06/12/2023

NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection

Deepfake technologies empowered by deep learning are rapidly evolving, c...
research
03/22/2022

MetaMorph: Learning Universal Controllers with Transformers

Multiple domains like vision, natural language, and audio are witnessing...
research
06/27/2022

A Topic-Attentive Transformer-based Model For Multimodal Depression Detection

Depression is one of the most common mental disorders, which imposes hea...

Please sign up or login with your details

Forgot password? Click here to reset