Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

06/23/2023
by   Zihao Yue, et al.
0

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.

READ FULL TEXT

page 1

page 9

research
12/01/2016

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Current image captioning methods are usually trained via (penalized) max...
research
01/30/2018

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Automatic image captioning has recently approached human-level performan...
research
08/06/2020

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important n...
research
06/07/2021

Counterfactual Maximum Likelihood Estimation for Training Deep Networks

Although deep learning models have driven state-of-the-art performance o...
research
03/14/2019

Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning

Our goal in this work is to train an image captioning model that generat...
research
05/19/2017

Softmax Q-Distribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML

Reward augmented maximum likelihood (RAML), a simple and effective learn...
research
04/16/2021

Concadia: Tackling image accessibility with context

Images have become an integral part of online media. This has enhanced s...

Please sign up or login with your details

Forgot password? Click here to reset