X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

by   Zhihao Yuan, et al.

3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets, respectively.


page 4

page 7


Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Although recent point cloud analysis achieves impressive progress, the p...

Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge

Cross-modal knowledge distillation deals with transferring knowledge fro...

PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds

Garment pattern design aims to convert a 3D garment to the corresponding...

Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation

Current state-of-the-art point cloud-based perception methods usually re...

Text to Point Cloud Localization with Relation-Enhanced Transformer

Automatically localizing a position based on a few natural language inst...

CSDN: Cross-modal Shape-transfer Dual-refinement Network for Point Cloud Completion

How will you repair a physical object with some missings? You may imagin...

Contrastive Learning of Features between Images and LiDAR

Image and Point Clouds provide different information for robots. Finding...

Please sign up or login with your details

Forgot password? Click here to reset