Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

05/15/2019
by   Fenglin Liu, et al.
0

In image-grounded text generation, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on the COCO dataset for image captioning. Extensive experiments show that the refined image representations boost the baseline models by up to 12 effective and generalizes well to a wide range of models.

READ FULL TEXT

page 1

page 2

page 7

page 13

research
01/20/2023

Visual Semantic Relatedness Dataset for Image Captioning

Modern image captioning system relies heavily on extracting knowledge fr...
research
09/12/2023

Towards Visual Taxonomy Expansion

Taxonomy expansion task is essential in organizing the ever-increasing v...
research
06/06/2019

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...
research
06/21/2020

Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attent...
research
02/23/2021

Enhanced Modality Transition for Image Captioning

Image captioning model is a cross-modality knowledge discovery task, whi...
research
09/27/2016

House price estimation from visual and textual features

Most existing automatic house price estimation systems rely only on some...
research
06/27/2018

Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of text...

Please sign up or login with your details

Forgot password? Click here to reset