Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

10/20/2022
by   Yu Zhao, et al.
0

Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that our models are impressive, providing accurate and human-like spatial-oriented text descriptions. Meanwhile, VSRC has great potential for VSD, and the joint end-to-end architecture is the better choice for their integration. We make the dataset and codes public for research purposes.

READ FULL TEXT

page 4

page 13

research
07/14/2023

AIC-AB NET: A Neural Network for Image Captioning with Spatial Attention and Text Attributes

Image captioning is a significant field across computer vision and natur...
research
09/10/2019

Select and Attend: Towards Controllable Content Selection in Text Generation

Many text generation tasks naturally contain two steps: content selectio...
research
01/28/2020

Incorporating Joint Embeddings into Goal-Oriented Dialogues with Multi-Task Learning

Attention-based encoder-decoder neural network models have recently show...
research
12/15/2016

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Along with the prosperity of recurrent neural network in modelling seque...
research
08/16/2021

AutoChart: A Dataset for Chart-to-Text Generation Task

The analytical description of charts is an exciting and important resear...
research
04/11/2020

End to End Chinese Lexical Fusion Recognition with Sememe Knowledge

In this paper, we present Chinese lexical fusion recognition, a new task...

Please sign up or login with your details

Forgot password? Click here to reset