Efficient Modeling of Future Context for Image Captioning

07/22/2022
by   Zhengcong Fei, et al.
0

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.

READ FULL TEXT
research
10/11/2021

Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt...
research
01/06/2022

Compact Bidirectional Transformer for Image Captioning

Most current image captioning models typically generate captions from le...
research
06/17/2021

Semi-Autoregressive Transformer for Image Captioning

Current state-of-the-art image captioning models adopt autoregressive de...
research
02/21/2022

CaMEL: Mean Teacher Learning for Image Captioning

Describing images in natural language is a fundamental step towards the ...
research
11/30/2022

Uncertainty-Aware Image Captioning

It is well believed that the higher uncertainty in a word of the caption...
research
08/27/2018

simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

The encode-decoder framework has shown recent success in image captionin...
research
10/04/2022

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Humans tend to decompose a sentence into different parts like sth do sth...

Please sign up or login with your details

Forgot password? Click here to reset