Expressing Visual Relationships via Language

06/18/2019
by   Hao Tan, et al.
8

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation, and retrieval), generating relational captions for two images, can also be very useful. This important problem has not been explored mostly due to lack of datasets and effective models. To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions. We then propose a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention. We also extend the model with dynamic relational attention, which calculates visual alignment while decoding. Our models are evaluated on our newly collected and two public datasets consisting of image pairs annotated with relationship sentences. Experimental results, based on both automatic and human evaluation, demonstrate that our model outperforms all baselines and existing methods on all the datasets.

READ FULL TEXT

page 3

page 9

research
02/03/2021

L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

Recent advances in language and vision push forward the research of capt...
research
07/02/2020

Fact-based Text Editing

We propose a novel text editing task, referred to as fact-based text edi...
research
10/14/2019

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

In this paper, we investigate a novel problem of telling the difference ...
research
03/06/2020

Show, Edit and Tell: A Framework for Editing Image Captions

Most image captioning frameworks generate captions directly from images,...
research
12/13/2021

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

Text-based image captioning (TextCap) requires simultaneous comprehensio...
research
04/14/2022

Optimal quadratic binding for relational reasoning in vector symbolic neural architectures

Binding operation is fundamental to many cognitive processes, such as co...
research
06/15/2019

Generating Diverse and Informative Natural Language Fashion Feedback

Recent advances in multi-modal vision and language tasks enable a new se...

Please sign up or login with your details

Forgot password? Click here to reset