Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

by   Jiuxiang Gu, et al.

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.


Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model

Image Captioning is an arduous task of producing syntactically and seman...

Learning Compact Reward for Image Captioning

Adversarial learning has shown its advances in generating natural and di...

Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcemen...

Convolutional Pose Machines

Pose Machines provide a sequential prediction framework for learning ric...

Improving Image Captioning with Conditional Generative Adversarial Nets

In this paper, we propose a novel conditional generative adversarial net...

Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

In image-grounded text generation, fine-grained representations of the i...

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important n...