Aligning where to see and what to tell: image caption with region-based attention and scene factorization

06/20/2015
by   Junqi Jin, et al.
0

Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image caption system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifting among the visual regions imposes a thread of visual ordering. This alignment characterizes the flow of "abstract meaning", encoding what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets. We show that using either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.

READ FULL TEXT

page 4

page 10

page 16

page 18

page 19

page 20

research
08/07/2019

Scene-based Factored Attention for Image Captioning

Image captioning has attracted ever-increasing research attention in the...
research
09/01/2016

Attentional Push: Augmenting Salience with Shared Attention Modeling

We present a novel visual attention tracking technique based on Shared A...
research
05/29/2023

Text-Only Image Captioning with Multi-Context Data Generation

Text-only Image Captioning (TIC) is an approach that aims to construct a...
research
08/24/2023

Dense Text-to-Image Generation with Attention Modulation

Existing text-to-image diffusion models struggle to synthesize realistic...
research
08/27/2019

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

This work aims at modeling how the meaning of gradable adjectives of siz...
research
10/26/2022

Visual Semantic Parsing: From Images to Abstract Meaning Representation

The success of scene graphs for visual scene understanding has brought a...
research
09/26/2022

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

This paper focuses on enhancing the captions generated by image-caption ...

Please sign up or login with your details

Forgot password? Click here to reset