Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

02/28/2023
by   Ji Hou, et al.
0

Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5 semantic segmentation.

READ FULL TEXT

page 1

page 4

page 6

page 8

page 11

research
02/06/2020

RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training

Although well-known large-scale datasets, such as ImageNet, have driven ...
research
09/18/2017

Matterport3D: Learning from RGB-D Data in Indoor Environments

Access to large, diverse RGB-D datasets is critical for training RGB-D s...
research
06/12/2023

Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

We present a novel approach for the detection of deepfake videos using a...
research
04/22/2021

Pri3D: Can 3D Priors Help 2D Representation Learning?

Recent advances in 3D perception have shown impressive progress in under...
research
12/06/2021

4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

We present a new approach to instill 4D dynamic object priors into learn...
research
05/19/2022

Masked Image Modeling with Denoising Contrast

Since the development of self-supervised visual representation learning ...
research
08/03/2022

Learning Prior Feature and Attention Enhanced Image Inpainting

Many recent inpainting works have achieved impressive results by leverag...

Please sign up or login with your details

Forgot password? Click here to reset