Cross-Modal Discrete Representation Learning

06/10/2021
by   Alexander H. Liu, et al.
0

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.

READ FULL TEXT

page 7

page 8

page 16

page 17

research
08/09/2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval betw...
research
10/15/2019

Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal underst...
research
02/04/2017

Simple to Complex Cross-modal Learning to Rank

The heterogeneity-gap between different modalities brings a significant ...
research
12/09/2021

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrela...
research
07/31/2022

Cross-Modal Alignment Learning of Vision-Language Conceptual Systems

Human infants learn the names of objects and develop their own conceptua...
research
07/16/2022

SVGraph: Learning Semantic Graphs from Instructional Videos

In this work, we focus on generating graphical representations of noisy,...
research
08/08/2022

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Video-text retrieval (VTR) is an attractive yet challenging task for mul...

Please sign up or login with your details

Forgot password? Click here to reset