CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

02/08/2022
by   Vin Sachidananda, et al.
0

Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer (SpecTran) which is trained using a Contrastive Audio-Language Pretraining objective to align audio and language from similar queries. Subsequently, the derived acoustic and lexical tokens representations are input into a multimodal transformer to incorporate utterance level context and derive the proposed CALM representations. We show that these pretrained embeddings can subsequently be used in multimodal supervised tasks and demonstrate the benefits of the proposed pretraining steps in terms of the alignment of the two embedding spaces and the multirate nature of the pretraining. Our system shows 10-25% improvement over existing emotion recognition systems including state-of-the-art three-modality systems under various evaluation objectives.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2019

Multimodal Embeddings from Language Models

Word embeddings such as ELMo have recently been shown to model word sema...
research
09/08/2023

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

The SOTA in transcription of disfluent and conversational speech has in ...
research
06/24/2019

Multimodal and Multi-view Models for Emotion Recognition

Studies on emotion recognition (ER) show that combining lexical and acou...
research
08/14/2023

AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes

We propose a method named AudioFormer,which learns audio feature represe...
research
03/06/2023

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...
research
11/02/2022

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...
research
10/08/2022

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Recent advances in learning aligned multimodal representations have been...

Please sign up or login with your details

Forgot password? Click here to reset