End-to-End Transformer Based Model for Image Captioning

03/29/2022
by   Yiyu Wang, et al.
0

CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models and achieved great success. Meanwhile, almost all recent works adopt Faster R-CNN as the backbone encoder to extract region-level features from given images. However, Faster R-CNN needs a pre-training on an additional dataset, which divides the image captioning task into two stages and limits its potential applications. In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. Firstly, we adopt SwinTransformer to replace Faster R-CNN as the backbone encoder to extract grid-level features from given images; Then, referring to Transformer, we build a refining encoder and a decoder. The refining encoder refines the grid features by capturing the intra-relationship between them, and the decoder decodes the refined features into captions word by word. Furthermore, in order to increase the interaction between multi-modal (vision and language) features to enhance the modeling capability, we calculate the mean pooling of grid features as the global feature, then introduce it into refining encoder to refine with grid features together, and add a pre-fusion process of refined global feature and generated words in decoder. To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2 (ensemble of 4 models) CIDEr scores on `Karpathy' offline test split and 136.0 (c5) and 138.3 models and source code will be released.

READ FULL TEXT

page 6

page 7

research
09/11/2021

Bornon: Bengali Image Captioning with Transformer-based Deep learning approach

Image captioning using Encoder-Decoder based approach where CNN is used ...
research
01/26/2021

CPTR: Full Transformer Network for Image Captioning

In this paper, we consider the image captioning task from a new sequence...
research
07/20/2022

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

Current state-of-the-art methods for image captioning employ region-base...
research
12/09/2021

Injecting Semantic Concepts into End-to-End Image Captioning

Tremendous progress has been made in recent years in developing better i...
research
01/06/2022

Compact Bidirectional Transformer for Image Captioning

Most current image captioning models typically generate captions from le...
research
02/13/2023

Towards Local Visual Modeling for Image Captioning

In this paper, we study the local visual modeling with grid features for...
research
04/15/2022

Image Captioning In the Transformer Age

Image Captioning (IC) has achieved astonishing developments by incorpora...

Please sign up or login with your details

Forgot password? Click here to reset