DeepAI AI Chat
Log In Sign Up

Cross-stitched Multi-modal Encoders

by   Karan Singla, et al.

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.


OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...

Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...

Analysis of Joint Speech-Text Embeddings for Semantic Matching

Embeddings play an important role in many recent end-to-end solutions fo...

Detecting Hate Speech in Multi-modal Memes

In the past few years, there has been a surge of interest in multi-modal...

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Current state-of-the-art approaches to cross-modal retrieval process tex...

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm is a linguistic phenomenon indicating a discrepancy between lite...

Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

In English, prosody adds a broad range of information to segment sequenc...