Hierarchy Parsing for Image Captioning

09/09/2019
by   Ting Yao, et al.
0

It is always well believed that parsing an image into constituent visual patterns would be helpful for understanding and representing an image. Nevertheless, there has not been evidence in support of the idea on describing an image with a natural-language utterance. In this paper, we introduce a new design to model a hierarchy from instance level (segmentation), region level (detection) to the whole image to delve into a thorough image understanding for captioning. Specifically, we present a HIerarchy Parsing (HIP) architecture that novelly integrates hierarchical structure into image encoder. Technically, an image decomposes into a set of regions and some of the regions are resolved into finer ones. Each region then regresses to an instance, i.e., foreground of the region. Such process naturally builds a hierarchal tree. A tree-structured Long Short-Term Memory (Tree-LSTM) network is then employed to interpret the hierarchal structure and enhance all the instance-level, region-level and image-level features. Our HIP is appealing in view that it is pluggable to any neural captioning models. Extensive experiments on COCO image captioning dataset demonstrate the superiority of HIP. More remarkably, HIP plus a top-down attention-based LSTM decoder increases CIDEr-D performance from 120.1 to 127.2 region-level features from HIP with semantic relation learnt through Graph Convolutional Networks (GCN), CIDEr-D is boosted up to 130.6

READ FULL TEXT

page 1

page 3

page 8

research
09/19/2018

Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects w...
research
11/05/2016

Boosting Image Captioning with Attributes

Automatically describing an image with a natural language has been an em...
research
05/06/2021

Exploring Explicit and Implicit Visual Relationships for Image Captioning

Image captioning is one of the most challenging tasks in AI, which aims ...
research
12/04/2019

Better Understanding Hierarchical Visual Relationship for Image Caption

The Convolutional Neural Network (CNN) has been the dominant image featu...
research
08/01/2019

Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation

Image paragraph generation is the task of producing a coherent story (us...
research
11/24/2017

Convolutional Image Captioning

Image captioning is an important but challenging task, applicable to vir...
research
10/09/2019

Semantic-aware Image Deblurring

Image deblurring has achieved exciting progress in recent years. However...

Please sign up or login with your details

Forgot password? Click here to reset