Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

11/29/2021
by   Yoad Tewel, et al.
0

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.

READ FULL TEXT

page 13

page 15

page 17

page 19

page 20

page 21

page 22

page 25

research
07/22/2022

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

We introduce a zero-shot video captioning method that employs two frozen...
research
02/09/2023

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Augmenting pretrained language models (LMs) with a vision encoder (e.g.,...
research
11/25/2022

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zer...
research
07/26/2022

NewsStories: Illustrating articles with visual summaries

Recent self-supervised approaches have used large-scale image-text datas...
research
12/04/2021

Emojich – zero-shot emoji generation using Russian language: a technical report

This technical report presents a text-to-image neural network "Emojich" ...
research
04/13/2023

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Large-scale Vision-Language Models, such as CLIP, learn powerful image-t...

Please sign up or login with your details

Forgot password? Click here to reset