Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

05/10/2022
by   Jing Yang, et al.
0

In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME (Adversarial Cross-Modal Embedding) and H-T (Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer (ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embeddings.

READ FULL TEXT

page 8

page 12

page 13

research
03/28/2021

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Video-Text Retrieval has been a hot research topic with the explosion of...
research
05/17/2020

T-VSE: Transformer-Based Visual Semantic Embedding

Transformer models have recently achieved impressive performance on NLP ...
research
08/08/2023

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Most existing cross-modal retrieval methods employ two-stream encoders w...
research
01/11/2023

EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

We learn a visual representation that captures information about the cam...
research
03/06/2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to comp...
research
04/28/2022

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

The development of the transformer-based text-to-image models are impede...

Please sign up or login with your details

Forgot password? Click here to reset