Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach

12/01/2020
by   Minyoung Kim, et al.
0

We deal with the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Our assumption is that the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. A suboptimally trained decoder can potentially harm the model's capability of identifying the true factors. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identified under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional encoder-decoder latent variable models. We also test our model on the Recipe1M, the large-scale food image/recipe dataset, where the learned factors by our approach highly coincide with the most pronounced food factors that are widely agreed on, including savoriness, wateriness, and greenness.

READ FULL TEXT

page 7

page 10

page 15

page 16

page 19

page 20

page 21

page 22

research
12/05/2021

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Cross-modal retrieval is to utilize one modality as a query to retrieve ...
research
02/26/2020

NestedVAE: Isolating Common Factors via Weak Supervision

Fair and unbiased machine learning is an important and active field of r...
research
05/30/2019

Cross-modal Variational Auto-encoder with Distributed Latent Spaces and Associators

In this paper, we propose a novel structure for a cross-modal data assoc...
research
04/02/2020

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model

Nowadays, driven by the increasing concern on diet and health, food comp...
research
03/21/2020

Cross-modal Deep Face Normals with Deactivable Skip Connections

We present an approach for estimating surface normals from in-the-wild c...
research
03/25/2019

Learning a Multi-Modal Policy via Imitating Demonstrations with Mixed Behaviors

We propose a novel approach to train a multi-modal policy from mixed dem...
research
04/01/2021

Learning Deep Latent Subspaces for Image Denoising

Heterogeneity exists in most camera images. This heterogeneity manifests...

Please sign up or login with your details

Forgot password? Click here to reset