Log In Sign Up

PointCMC: Cross-Modal Multi-Scale Correspondences Learning for Point Cloud Understanding

Some self-supervised cross-modal learning approaches have recently demonstrated the potential of image signals for enhancing point cloud representation. However, it remains a question on how to directly model cross-modal local and global correspondences in a self-supervised fashion. To solve it, we proposed PointCMC, a novel cross-modal method to model multi-scale correspondences across modalities for self-supervised point cloud representation learning. In particular, PointCMC is composed of: (1) a local-to-local (L2L) module that learns local correspondences through optimized cross-modal local geometric features, (2) a local-to-global (L2G) module that aims to learn the correspondences between local and global features across modalities via local-global discrimination, and (3) a global-to-global (G2G) module, which leverages auxiliary global contrastive loss between the point cloud and image to learn high-level semantic correspondences. Extensive experiment results show that our approach outperforms existing state-of-the-art methods in various downstream tasks such as 3D object classification and segmentation. Code will be made publicly available upon acceptance.


page 1

page 2

page 3

page 4


CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

Manual annotation of large-scale point cloud dataset for varying tasks s...

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

LiDAR mapping is important yet challenging in self-driving and mobile ro...

Self-supervised Modal and View Invariant Feature Learning

Most of the existing self-supervised feature learning methods for 3D dat...

Cross-modal Learning for Image-Guided Point Cloud Shape Completion

In this paper we explore the recent topic of point cloud completion, gui...

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Current point-cloud detection methods have difficulty detecting the open...

Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition

Gesture recognition is one of the most intuitive ways of interaction and...

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised representation learning is a critical problem in compute...