MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

03/10/2022
by   Yang Jiao, et al.
0

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them, which leads to sub-optimal results. In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Technically, our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones. We first devise a novel Spatial Layout Graph Convolution (SLGC), which semantically encodes several first-order relations as edges of a graph constructed over 3D object proposals. Next, from the resulting graph, we further extract multiple triplets which encapsulate basic first-order relations as the basic unit and construct several Object-centric Triplet Attention Graphs (OTAG) to infer multi-order relations for every target object. The updated node features from OTAG are aggregated and fed into the caption decoder to provide abundant relational cues so that captions including diverse relations with context objects can be generated. Extensive experiments on the Scan2Cap dataset prove the effectiveness of our proposed MORE and its components, and we also outperform the current state-of-the-art method.

READ FULL TEXT
research
10/08/2020

Dense Relational Image Captioning via Multi-task Triple-Stream Networks

We introduce dense relational captioning, a novel image captioning task ...
research
04/22/2022

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Dense captioning in 3D point clouds is an emerging vision-and-language t...
research
10/08/2022

Contextual Modeling for 3D Dense Captioning on Point Clouds

3D dense captioning, as an emerging vision-language task, aims to identi...
research
07/22/2017

OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts

Generating captions for images is a task that has recently received cons...
research
03/26/2023

SEM-POS: Grammatically and Semantically Correct Video Captioning

Generating grammatically and semantically correct captions in video capt...
research
01/06/2023

End-to-End 3D Dense Captioning with Vote2Cap-DETR

3D dense captioning aims to generate multiple captions localized with th...
research
11/05/2022

Semantic Metadata Extraction from Dense Video Captioning

Annotation of multimedia data by humans is time-consuming and costly, wh...

Please sign up or login with your details

Forgot password? Click here to reset