End-to-End 3D Dense Captioning with Vote2Cap-DETR

01/06/2023
by   Sijin Chen, et al.
0

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated “detect-then-describe” pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in CIDEr@0.5IoU, respectively. Codes will be released soon.

READ FULL TEXT

page 4

page 8

page 14

page 15

page 16

research
08/17/2021

End-to-End Dense Video Captioning with Parallel Decoding

Dense video captioning aims to generate multiple associated captions wit...
research
09/06/2023

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

3D dense captioning requires a model to translate its understanding of a...
research
05/23/2021

End-to-End Video Object Detection with Spatial-Temporal Transformers

Recently, DETR and Deformable DETR have been proposed to eliminate the n...
research
04/22/2022

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Dense captioning in 3D point clouds is an emerging vision-and-language t...
research
04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...
research
03/10/2022

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

3D dense captioning is a recently-proposed novel task, where point cloud...
research
07/18/2022

Temporal Lift Pooling for Continuous Sign Language Recognition

Pooling methods are necessities for modern neural networks for increasin...

Please sign up or login with your details

Forgot password? Click here to reset