Better Understanding Hierarchical Visual Relationship for Image Caption

12/04/2019
by   Zheng-cong Fei, et al.
0

The Convolutional Neural Network (CNN) has been the dominant image feature extractor in computer vision for years. However, it fails to get the relationship between images/objects and their hierarchical interactions which can be helpful for representing and describing an image. In this paper, we propose a new design for image caption under a general encoder-decoder framework. It takes into account the hierarchical interactions between different abstraction levels of visual information in the images and their bounding-boxes. Specifically, we present CNN plus Graph Convolutional Network (GCN) architecture that novelly integrates both semantic and spatial visual relationships into image encoder. The representations of regions in an image and the connections between images are refined by leveraging graph structure through GCN. With the learned multi-level features, our model capitalizes on the Transformer-based decoder for description generation. We conduct experiments on the COCO image captioning dataset. Evaluations show that our proposed model outperforms the previous state-of-the-art models in the task of image caption, leading to a better performance in terms of all evaluation metrics.

READ FULL TEXT

page 2

page 8

research
09/19/2018

Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects w...
research
09/02/2019

HiCoRe: Visual Hierarchical Context-Reasoning

Reasoning about images/objects and their hierarchical interactions is a ...
research
09/09/2019

Hierarchy Parsing for Image Captioning

It is always well believed that parsing an image into constituent visual...
research
01/25/2022

Main Product Detection with Graph Networks for Fashion

Computer vision has established a foothold in the online fashion retail ...
research
08/05/2021

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Existing image captioning methods just focus on understanding the relati...
research
06/06/2021

TabularNet: A Neural Network Architecture for Understanding Semantic Structures of Tabular Data

Tabular data are ubiquitous for the widespread applications of tables an...
research
07/27/2022

End-to-end Graph-constrained Vectorized Floorplan Generation with Panoptic Refinement

The automatic generation of floorplans given user inputs has great poten...

Please sign up or login with your details

Forgot password? Click here to reset