Image Captioning In the Transformer Age

04/15/2022
by   Yang Xu, et al.
0

Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/14/2021

Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model

Image Captioning is an arduous task of producing syntactically and seman...
07/28/2021

Experimenting with Self-Supervision using Rotation Prediction for Image Captioning

Image captioning is a task in the field of Artificial Intelligence that ...
10/07/2021

End-to-End Supermask Pruning: Learning to Prune Image Captioning Models

With the advancement of deep models, research work on image captioning h...
04/03/2018

Learning to Guide Decoding for Image Captioning

Recently, much advance has been made in image captioning, and an encoder...
03/29/2022

End-to-End Transformer Based Model for Image Captioning

CNN-LSTM based architectures have played an important role in image capt...
11/16/2016

Semantic Regularisation for Recurrent Image Annotation

The "CNN-RNN" design pattern is increasingly widely applied in a variety...
04/08/2021

HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies

This paper presents a model architecture for encoding the representation...