Embedding Arithmetic for Text-driven Image Transformation

12/06/2021
by   Guillaume Couairon, et al.
0

Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of text-driven image transformation. SIMAT contains 6k images and 18k "transformation queries" that aim at either replacing scene elements or changing their pairwise relationships. The goal is to retrieve an image consistent with the (source image, transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use SIMAT to show that vanilla CLIP multimodal embeddings are not very well suited for text-driven image transformation, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage the geometric properties of pretrained universal sentence encoders (FastText, LASER and LaBSE).

READ FULL TEXT
research
03/02/2019

Let's Transfer Transformations of Shared Semantic Representations

With a good image understanding capability, can we manipulate the images...
research
04/12/2022

Probabilistic Compositional Embeddings for Multimodal Image Retrieval

Existing works in image retrieval often consider retrieving images with ...
research
11/24/2016

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

In this paper, we focus on training and evaluating effective word embedd...
research
08/20/2018

Learning to Learn from Web Data through Deep Semantic Embeddings

In this paper we propose to learn a multimodal image and text embedding ...
research
11/14/2016

A DNN Framework For Text Image Rectification From Planar Transformations

In this paper, a novel neural network architecture is proposed attemptin...
research
10/03/2022

Learning Equivariant Segmentation with Instance-Unique Querying

Prevalent state-of-the-art instance segmentation methods fall into a que...
research
03/02/2023

X Fuse: Fusing Visual Information in Text-to-Image Generation

We introduce X Fuse, a general approach for conditioning on visual inf...

Please sign up or login with your details

Forgot password? Click here to reset