Improving Image Captioning with Better Use of Captions

06/21/2020
by   Zhan Shi, et al.
11

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics.

READ FULL TEXT
research
12/12/2016

Text-guided Attention Model for Image Captioning

Visual attention plays an important role to understand images and demons...
research
05/18/2021

Dependent Multi-Task Learning with Causal Intervention for Image Captioning

Recent work for image captioning mainly followed an extract-then-generat...
research
08/14/2019

HorNet: A Hierarchical Offshoot Recurrent Network for Improving Person Re-ID via Image Captioning

Person re-identification (re-ID) aims to recognize a person-of-interest ...
research
02/11/2022

Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm

The continuous increase in the use of social media and the visual conten...
research
06/07/2019

Figure Captioning with Reasoning and Sequence-Level Training

Figures, such as bar charts, pie charts, and line plots, are widely used...
research
05/15/2019

Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

In image-grounded text generation, fine-grained representations of the i...
research
02/15/2020

MRRC: Multiple Role Representation Crossover Interpretation for Image Captioning With R-CNN Feature Distribution Composition (FDC)

While image captioning through machines requires structured learning and...

Please sign up or login with your details

Forgot password? Click here to reset