Structured Vision-Language Pretraining for Computational Cooking

12/08/2022
by   Mustafa Shukor, et al.
0

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook (Structured Vision-Language Pretraining for Computational Cooking), first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. Finally, we conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. The code will be made publicly available.

READ FULL TEXT

page 2

page 4

page 7

page 11

page 13

page 14

research
08/29/2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

Vision and Language Pretraining has become the prevalent approach for ta...
research
04/22/2022

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Cross-modal encoders for vision-language (VL) tasks are often pretrained...
research
09/15/2022

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

This paper presents OmniVL, a new foundation model to support both image...
research
12/13/2022

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A fundamental characteristic common to both human vision and natural lan...
research
03/06/2023

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...
research
06/12/2023

Sticker820K: Empowering Interactive Retrieval with Stickers

Stickers have become a ubiquitous part of modern-day communication, conv...
research
07/24/2023

Towards a Visual-Language Foundation Model for Computational Pathology

The accelerated adoption of digital pathology and advances in deep learn...

Please sign up or login with your details

Forgot password? Click here to reset