MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model

04/02/2020
by   Han Fu, et al.
0

Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all existing approaches on the benchmark Recipe1M dataset and requires less computational cost.

READ FULL TEXT

page 8

page 13

research
05/03/2019

Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images

Food computing is playing an increasingly important role in human daily ...
research
03/09/2020

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Cross-modal food retrieval is an important task to perform analysis of f...
research
08/14/2019

Harmonized Multimodal Learning with Gaussian Process Latent Variable Models

Multimodal learning aims to discover the relationship between multiple m...
research
04/20/2022

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Cross-modal image-recipe retrieval has gained significant attention in r...
research
12/01/2020

Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach

We deal with the problem of learning the underlying disentangled latent ...
research
10/15/2019

Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal underst...
research
03/30/2022

Learning Program Representations for Food Images and Cooking Recipes

In this paper, we are interested in modeling a how-to instructional proc...

Please sign up or login with your details

Forgot password? Click here to reset