CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

by   Vin Sachidananda, et al.

Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer (SpecTran) which is trained using a Contrastive Audio-Language Pretraining objective to align audio and language from similar queries. Subsequently, the derived acoustic and lexical tokens representations are input into a multimodal transformer to incorporate utterance level context and derive the proposed CALM representations. We show that these pretrained embeddings can subsequently be used in multimodal supervised tasks and demonstrate the benefits of the proposed pretraining steps in terms of the alignment of the two embedding spaces and the multirate nature of the pretraining. Our system shows 10-25% improvement over existing emotion recognition systems including state-of-the-art three-modality systems under various evaluation objectives.


page 1

page 2

page 3

page 4


Multimodal Embeddings from Language Models

Word embeddings such as ELMo have recently been shown to model word sema...

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

The SOTA in transcription of disfluent and conversational speech has in ...

Multimodal and Multi-view Models for Emotion Recognition

Studies on emotion recognition (ER) show that combining lexical and acou...

AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes

We propose a method named AudioFormer,which learns audio feature represe...

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Recent advances in learning aligned multimodal representations have been...

Please sign up or login with your details

Forgot password? Click here to reset