Cross-stitched Multi-modal Encoders

04/20/2022
by   Karan Singla, et al.
0

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

READ FULL TEXT
research
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
research
08/22/2021

Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...
research
12/29/2020

Detecting Hate Speech in Multi-modal Memes

In the past few years, there has been a surge of interest in multi-modal...
research
03/22/2021

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Current state-of-the-art approaches to cross-modal retrieval process tex...
research
04/04/2022

Analysis of Joint Speech-Text Embeddings for Semantic Matching

Embeddings play an important role in many recent end-to-end solutions fo...
research
03/14/2020

Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

In English, prosody adds a broad range of information to segment sequenc...

Please sign up or login with your details

Forgot password? Click here to reset