Analysis of Joint Speech-Text Embeddings for Semantic Matching

04/04/2022
by   Muhammad Huzaifah, et al.
0

Embeddings play an important role in many recent end-to-end solutions for language processing problems involving more than one data modality. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs. This was done through dual encoders in a teacher-student model setup, with a pretrained language model acting as the teacher and a transformer-based speech encoder as the student. We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios and found that both approaches improve semantic matching. Multiple techniques were utilized to analyze and evaluate cross-modal semantic alignment of the embeddings: a quantitative retrieval accuracy metric, zero-shot classification to investigate generalizability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

We introduce SONAR, a new multilingual and multimodal fixed-size sentenc...
research
07/03/2020

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Spoken language understanding is typically based on pipeline architectur...
research
04/20/2022

Cross-stitched Multi-modal Encoders

In this paper, we propose a novel architecture for multi-modal speech an...
research
10/28/2020

Bridging the Modality Gap for Speech-to-Text Translation

End-to-end speech translation aims to translate speech in one language i...
research
01/07/2023

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Despite surprising performance on zero-shot transfer, pre-training a lar...
research
05/17/2020

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Speech is one of the most effective means of communication and is full o...
research
10/12/2022

Distilling Knowledge from Language Models for Video-based Action Anticipation

Anticipating future actions in a video is useful for many autonomous and...

Please sign up or login with your details

Forgot password? Click here to reset