Exploring Explicit and Implicit Visual Relationships for Image Captioning

05/06/2021
by   Zeliang Song, et al.
0

Image captioning is one of the most challenging tasks in AI, which aims to automatically generate textual sentences for an image. Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. However, these models usually lack the comprehensive understanding of the contextual interactions reflected on various visual relationships between objects. In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers (Region BERT) without extra relational annotations. To evaluate the effectiveness and superiority of our proposed method, we conduct extensive experiments on Microsoft COCO benchmark and achieve remarkable improvements compared with strong baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

research
09/19/2018

Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects w...
research
06/04/2019

Relational Reasoning using Prior Knowledge for Visual Captioning

Exploiting relationships among objects has achieved remarkable progress ...
research
06/14/2019

Image Captioning: Transforming Objects into Words

Image captioning models typically follow an encoder-decoder architecture...
research
09/09/2019

Hierarchy Parsing for Image Captioning

It is always well believed that parsing an image into constituent visual...
research
08/05/2021

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Existing image captioning methods just focus on understanding the relati...
research
08/06/2019

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Image captioning attempts to generate a sentence composed of several lin...
research
02/28/2020

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...

Please sign up or login with your details

Forgot password? Click here to reset