SG2Caps: Revisiting Scene Graphs for Image Captioning

02/09/2021
by   Subarna Tripathi, et al.
0

The mainstream image captioning models rely on Convolutional Neural Network (CNN) image features with an additional attention to salient regions and objects to generate captions via recurrent models. Recently, scene graph representations of images have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes. Several studies have noted that naive use of scene graphs from a black-box scene graph generator harms image caption-ing performance, and scene graph-based captioning mod-els have to incur the overhead of explicit use of image features to generate decent captions. Addressing these challenges, we propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance. The basic idea is to close the semantic gap between two scene graphs - one derived from the input image and the other one from its caption. In order to achieve this, we leverage the spatial location of objects and the Human-Object-Interaction (HOI) labels as an additional HOI graph. Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning. Direct utilization of the scene graph labels avoids expensive graph convolutions over high-dimensional CNN features resulting in 49

READ FULL TEXT

page 2

page 7

page 8

research
09/25/2020

Are scene graphs good enough to improve Image Captioning?

Many top-performing image captioning models rely solely on object featur...
research
11/22/2019

TPsgtR: Neural-Symbolic Tensor Product Scene-Graph-Triplet Representation for Image Captioning

Image captioning can be improved if the structure of the graphical repre...
research
07/31/2017

Scene Graph Generation from Objects, Phrases and Region Captions

Object detection, scene graph generation and region captioning, which ar...
research
03/30/2016

Dense Image Representation with Spatial Pyramid VLAD Coding of CNN for Locally Robust Captioning

The workflow of extracting features from images using convolutional neur...
research
07/23/2020

Comprehensive Image Captioning via Scene Graph Decomposition

We address the challenging problem of image captioning by revisiting the...
research
11/21/2018

An Interpretable Model for Scene Graph Generation

We propose an efficient and interpretable scene graph generator. We cons...
research
02/07/2018

Generating Triples with Adversarial Networks for Scene Graph Construction

Driven by successes in deep learning, computer vision research has begun...

Please sign up or login with your details

Forgot password? Click here to reset