DenseCap: Fully Convolutional Localization Networks for Dense Captioning

11/24/2015
by   Justin Johnson, et al.
1

We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

READ FULL TEXT

page 1

page 3

page 5

page 7

page 8

research
04/23/2018

Jointly Localizing and Describing Events for Dense Video Captioning

Automatically describing a video with natural language is regarded as a ...
research
11/21/2016

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understan...
research
04/12/2017

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Associating image regions with text queries has been recently explored a...
research
04/02/2019

Context and Attribute Grounded Dense Captioning

Dense captioning aims at simultaneously localizing semantic regions and ...
research
07/26/2017

Deep Interactive Region Segmentation and Captioning

With recent innovations in dense image captioning, it is now possible to...
research
03/04/2023

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Benefiting from large-scale vision-language pre-training on image-text p...
research
09/06/2023

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

3D dense captioning requires a model to translate its understanding of a...

Please sign up or login with your details

Forgot password? Click here to reset