Learning to Collocate Neural Modules for Image Captioning

04/18/2019
by   Xu Yang, et al.
0

We do not speak word by word from scratch; our brain quickly structures a pattern like sth do sth at someplace and then fill in the detailed descriptions. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the `inner pattern' connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q&A, where the language (ie, question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, eg, by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.

READ FULL TEXT

page 1

page 2

page 7

page 14

page 15

page 16

research
10/04/2022

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Humans tend to decompose a sentence into different parts like sth do sth...
research
12/06/2016

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

Attention-based neural encoder-decoder frameworks have been widely adopt...
research
07/17/2020

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Generating natural language descriptions for videos, i.e., video caption...
research
12/06/2018

Auto-Encoding Graphical Inductive Bias for Descriptive Image Captioning

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the languag...
research
12/06/2018

Auto-Encoding Scene Graphs for Image Captioning

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the languag...
research
01/15/2020

Show, Recall, and Tell: Image Captioning with Recall Mechanism

Generating natural and accurate descriptions in image cap-tioning has al...
research
05/03/2023

Transforming Visual Scene Graphs to Image Captions

We propose to Transform Scene Graphs (TSG) into more descriptive caption...

Please sign up or login with your details

Forgot password? Click here to reset