PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

03/14/2023
by   Anthony Chen, et al.
0

Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9 https://github.com/BLVLab/PiMAE.

READ FULL TEXT

page 3

page 4

page 6

page 7

page 14

page 15

research
02/27/2023

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

Masked Autoencoders (MAE) have shown promising performance in self-super...
research
11/22/2022

PointCMC: Cross-Modal Multi-Scale Correspondences Learning for Point Cloud Understanding

Some self-supervised cross-modal learning approaches have recently demon...
research
01/29/2021

Self-Supervised Representation Learning for RGB-D Salient Object Detection

Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are al...
research
10/09/2022

Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Although recent point cloud analysis achieves impressive progress, the p...
research
12/13/2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Pre-training by numerous image data has become de-facto for robust 2D re...
research
12/24/2020

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised representation learning is a critical problem in compute...
research
12/08/2020

Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting

Crowd counting is a fundamental yet challenging problem, which desires r...

Please sign up or login with your details

Forgot password? Click here to reset