Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

11/24/2021
by   Marcella Cornia, et al.
1

While captioning models have obtained compelling results in describing natural images, there is a growing effort to increase their capability of dealing with real-world concepts. In this paper, we address the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human- and automatically-collected captions. To this end, we propose a model which induces a separation between content and descriptive style through the incorporation of stylistic parameters and keywords extracted from large-scale multi-modal models as pivotal data. In terms of visual features, our model avoids the need of object detectors and employs grid-like features together with a single objective of prompt language modeling. Experimentally, we consistently outperform existing methods in terms of caption quality and capability of describing out-of-domain concepts. Finally, our model obtains a new state of the art on both COCO and nocaps.

READ FULL TEXT

page 1

page 3

page 7

page 8

page 13

page 14

page 15

page 16

research
07/31/2023

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Stylized visual captioning aims to generate image or video descriptions ...
research
12/20/2018

nocaps: novel object captioning at scale

Image captioning models have achieved impressive results on datasets con...
research
11/18/2014

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image ...
research
07/03/2023

JourneyDB: A Benchmark for Generative Image Understanding

While recent advancements in vision-language models have revolutionized ...
research
12/22/2016

Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

We introduce a new multi-modal task for computer systems, posed as a com...
research
11/14/2015

Oracle performance for visual captioning

The task of associating images and videos with a natural language descri...
research
10/14/2022

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Large-scale vision-language pre-trained (VLP) models are prone to halluc...

Please sign up or login with your details

Forgot password? Click here to reset