Exploring Discrete Diffusion Models for Image Captioning

11/21/2022
by   Zixin Zhu, et al.
0

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.

READ FULL TEXT

page 3

page 9

research
10/10/2022

CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning

Image captioning task has been extensively researched by previous work. ...
research
05/20/2023

DiffCap: Exploring Continuous Diffusion on Image Captioning

Current image captioning works usually focus on generating descriptions ...
research
08/08/2022

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

We present Bit Diffusion: a simple and generic approach for generating d...
research
12/06/2022

Semantic-Conditional Diffusion Networks for Image Captioning

Recent advances on text-to-image generation have witnessed the rise of d...
research
08/13/2022

ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning

Expansion methods explore the possibility of performance bottlenecks in ...
research
12/09/2020

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Texts appearing in daily scenes that can be recognized by OCR (Optical C...
research
07/07/2022

ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning

Most recent state of art architectures rely on combinations and variatio...

Please sign up or login with your details

Forgot password? Click here to reset