Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning

05/28/2022
by   Longzhen Yang, et al.
0

Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. However, compromise does not make the progress. Decayed diversity makes the captioner a repeater, and decayed accuracy makes it a fake advisor. In this work, we exploit a novel Variational Transformer framework to improve accuracy and diversity simultaneously. To ensure accuracy, we introduce the "Invisible Information Prior" along with the "Auto-selectable GMM" to instruct the encoder to learn the precise language information and object relation in different scenes. To ensure diversity, we propose the "Range-Median Reward" baseline to retain more diverse candidates with higher rewards during the RL-based training process. Experiments show that our method achieves the simultaneous promotion of accuracy (CIDEr) and diversity (self-CIDEr), up to 1.1 and 4.8 percent, compared with the baseline. Also, our method outperforms others under the newly proposed measurement of the trade-off gap, with at least 3.55 percent promotion.

READ FULL TEXT
research
03/28/2019

Describing like humans: on diversity in image captioning

Recently, the state-of-the-art models for image captioning have overtake...
research
08/02/2023

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Generating visually grounded image captions with specific linguistic sty...
research
02/27/2020

Analysis of diversity-accuracy tradeoff in image captioning

We investigate the effect of different model architectures, training obj...
research
08/14/2019

Towards Diverse and Accurate Image Captions via Reinforcing Determinantal Point Process

Although significant progress has been made in the field of automatic im...
research
01/04/2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing Spatio-temporal events in natural language, video capti...
research
02/22/2023

Feasible Recourse Plan via Diverse Interpolation

Explaining algorithmic decisions and recommending actionable feedback is...

Please sign up or login with your details

Forgot password? Click here to reset