Dense Captioning with Joint Inference and Visual Context

11/21/2016
by   Linjie Yang, et al.
0

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. We identify two key challenges of dense captioning that need to be properly addressed when tackling the problem. First, dense visual concept annotations in each image are associated with highly overlapping target regions, making accurate localization of each visual concept challenging. Second, the large amount of visual concepts makes it hard to recognize each of them by appearance alone. We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. We design our model architecture in a methodical manner and thoroughly evaluate the variations in architecture. Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm. Qualitative experiments also reveal the semantic capabilities of our model in dense captioning.

READ FULL TEXT

page 1

page 2

page 7

page 8

research
11/24/2015

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

We introduce the dense captioning task, which requires a computer vision...
research
04/13/2022

Semantic-Aware Pretraining for Dense Video Captioning

This report describes the details of our approach for the event dense-ca...
research
04/25/2015

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

In this paper, we address the task of learning novel visual concepts, an...
research
12/01/2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
research
04/22/2022

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Dense captioning in 3D point clouds is an emerging vision-and-language t...
research
11/14/2015

Oracle performance for visual captioning

The task of associating images and videos with a natural language descri...
research
05/03/2023

Visual Transformation Telling

In this paper, we propose a new visual reasoning task, called Visual Tra...

Please sign up or login with your details

Forgot password? Click here to reset