Cooperative image captioning

07/26/2019
by   Gilad Vered, et al.
0

When describing images with natural language, the descriptions can be made more informative if tuned using downstream tasks. This is often achieved by training two networks: a "speaker network" that generates sentences given an image, and a "listener network" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate to achieve a joint task, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. We describe an approach that addresses both challenges. We first develop a new effective optimization based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. Second, we show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the standard COCO benchmark show that PSST Multinomial dramatically improve the recall@10 from 60 human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.

READ FULL TEXT

page 12

page 19

research
11/17/2020

Structural and Functional Decomposition for Personality Image Captioning in a Communication Game

Personality image captioning (PIC) aims to describe an image with a natu...
research
06/26/2023

Semi-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, s...
research
12/21/2020

Alleviating Noisy Data in Image Captioning with Cooperative Distillation

Image captioning systems have made substantial progress, largely due to ...
research
04/04/2023

Cross-Domain Image Captioning with Discriminative Finetuning

Neural captioners are typically trained to mimic human-generated referen...
research
01/11/2017

Context-aware Captions from Context-agnostic Supervision

We introduce an inference technique to produce discriminative context-aw...
research
04/15/2021

Multitasking Inhibits Semantic Drift

When intelligent agents communicate to accomplish shared goals, how do t...
research
10/08/2021

Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content

Natural language descriptions sometimes accompany visualizations to bett...

Please sign up or login with your details

Forgot password? Click here to reset