Context-Aware Visual Policy Network for Sequence-Level Image Captioning

08/16/2018
by   Daqing Liu, et al.
0

Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias" during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (e.g., visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (e.g., "man riding horse") and comparisons (e.g., "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention. Compared against traditional visual attention that only fixes a single image region at every step, CAVP can attend to complex visual compositions over time. The whole image captioning model --- CAVP and its subsequent language policy network --- can be efficiently optimized end-to-end by using an actor-critic policy gradient method with respect to any caption evaluation metric. We demonstrate the effectiveness of CAVP by state-of-the-art performances on MS-COCO offline split and online server, using various metrics and sensible visualizations of qualitative visual context. The code is available at https://github.com/daqingliu/CAVP

READ FULL TEXT
research
06/06/2019

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...
research
12/10/2020

Image Captioning with Context-Aware Auxiliary Guidance

Image captioning is a challenging computer vision task, which aims to ge...
research
08/27/2018

simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

The encode-decoder framework has shown recent success in image captionin...
research
08/13/2022

ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning

Expansion methods explore the possibility of performance bottlenecks in ...
research
09/13/2018

Improving Reinforcement Learning Based Image Captioning with Natural Language Prior

Recently, Reinforcement Learning (RL) approaches have demonstrated advan...
research
04/15/2019

Self-critical n-step Training for Image Captioning

Existing methods for image captioning are usually trained by cross entro...
research
06/21/2020

Off-Policy Self-Critical Training for Transformer in Visual Paragraph Generation

Recently, several approaches have been proposed to solve language genera...

Please sign up or login with your details

Forgot password? Click here to reset