Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

09/14/2019
by   Shweta Mahajan, et al.
17

One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be similar to a Gaussian prior that is shared across the two domains, ensuring compatible continuity of the encoded semantic representations of images and texts. Semantic alignment is achieved through supervision from matching image-text pairs. To show the benefits of our semi-supervised representation, we apply it to cross-modal retrieval and phrase localization. We not only achieve state-of-the-art accuracy, but significantly better generalization across datasets, owing to the semantic continuity of the latent space.

READ FULL TEXT

page 2

page 3

page 12

page 14

page 15

research
11/14/2019

HUSE: Hierarchical Universal Semantic Embeddings

There is a recent surge of interest in cross-modal representation learni...
research
09/18/2019

Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

We propose a novel deep training algorithm for joint representation of a...
research
02/07/2022

Unsupervised physics-informed disentanglement of multimodal data for high-throughput scientific discovery

We introduce physics-informed multimodal autoencoders (PIMA) - a variati...
research
09/03/2019

Do Cross Modal Systems Leverage Semantic Relationships?

Current cross-modal retrieval systems are evaluated using R@K measure wh...
research
01/14/2019

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...
research
08/02/2021

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

This paper introduces a two-phase deep feature calibration framework for...
research
10/22/2021

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

This paper introduces a two-phase deep feature engineering framework for...

Please sign up or login with your details

Forgot password? Click here to reset