ReFormer: The Relational Transformer for Image Captioning

07/29/2021
by   Xuewen Yang, et al.
0

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer – a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the bene-fit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relation-ships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation

READ FULL TEXT

page 5

page 8

research
09/25/2020

Are scene graphs good enough to improve Image Captioning?

Many top-performing image captioning models rely solely on object featur...
research
09/19/2018

Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects w...
research
12/02/2021

Object-Centric Unsupervised Image Captioning

Training an image captioning model in an unsupervised manner without uti...
research
08/05/2021

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Existing image captioning methods just focus on understanding the relati...
research
07/23/2020

Comprehensive Image Captioning via Scene Graph Decomposition

We address the challenging problem of image captioning by revisiting the...
research
11/22/2019

TPsgtR: Neural-Symbolic Tensor Product Scene-Graph-Triplet Representation for Image Captioning

Image captioning can be improved if the structure of the graphical repre...

Please sign up or login with your details

Forgot password? Click here to reset