Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

08/02/2021
by   Zhongwei Xie, et al.
8

It is widely acknowledged that learning joint embeddings of recipes with images is challenging due to the diverse composition and deformation of ingredients in cooking procedures. We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance cross-modal retrieval services. Our MSJE approach has three unique features. First, we extract the TFIDF feature from the title, ingredients and cooking instructions of recipes. By determining the significance of word sequences through combining LSTM learned features with their TFIDF features, we encode a recipe into a TFIDF weighted vector for capturing significant key terms and how such key terms are used in the corresponding cooking instructions. Second, we combine the recipe TFIDF feature with the recipe sequence feature extracted through two-stage LSTM networks, which is effective in capturing the unique relationship between a recipe and its associated image(s). Third, we further incorporate TFIDF enhanced category semantics to improve the mapping of image modality and to regulate the similarity loss function during the iterative learning of cross-modal joint embedding. Experiments on the benchmark dataset Recipe1M show the proposed approach outperforms the state-of-the-art approaches.

READ FULL TEXT

page 2

page 3

page 9

page 11

page 13

research
08/09/2021

Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

This paper presents a three-tier modality alignment approach to learning...
research
11/30/2022

Improving Cross-Modal Retrieval with Set of Diverse Embeddings

Cross-modal retrieval across image and text modalities is a challenging ...
research
11/13/2021

Memotion Analysis through the Lens of Joint Embedding

Joint embedding (JE) is a way to encode multi-modal data into a vector s...
research
08/02/2021

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

This paper introduces a two-phase deep feature calibration framework for...
research
10/22/2021

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

This paper introduces a two-phase deep feature engineering framework for...
research
10/10/2022

Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval

Visual Semantic Embedding (VSE) aims to extract the semantics of images ...
research
02/04/2017

Simple to Complex Cross-modal Learning to Rank

The heterogeneity-gap between different modalities brings a significant ...

Please sign up or login with your details

Forgot password? Click here to reset