CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

02/13/2023
by   Jiange Yang, et al.
1

Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In addition, our single-model design without requirement of fusion module is very flexible and robust to generalize to unimodal scenario in both training and testing phases. Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our CoMAE for RGB and depth representation learning. In addition, our experiment results reveal that CoMAE is a data-efficient representation learner. Although we only use the small-scale and unlabeled training set for pre-training, our CoMAE pre-trained models are still competitive to the state-of-the-art methods with extra large-scale and supervised RGB dataset pre-training. Code will be released at https://github.com/MCG-NJU/CoMAE.

READ FULL TEXT

page 3

page 5

page 8

research
01/29/2021

Self-Supervised Representation Learning for RGB-D Salient Object Detection

Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are al...
research
04/04/2022

MultiMAE: Multi-modal Multi-task Masked Autoencoders

We propose a pre-training strategy called Multi-modal Multi-task Masked ...
research
12/12/2021

Self-Supervised Modality-Aware Multiple Granularity Pre-Training for RGB-Infrared Person Re-Identification

While RGB-Infrared cross-modality person re-identification (RGB-IR ReID)...
research
10/11/2022

ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

We introduce ViFiCon, a self-supervised contrastive learning scheme whic...
research
12/24/2020

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised representation learning is a critical problem in compute...
research
01/05/2023

Event Camera Data Pre-training

This paper proposes a pre-trained neural network for handling event came...
research
07/11/2022

A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Self-supervised pre-training for 3D vision has drawn increasing research...

Please sign up or login with your details

Forgot password? Click here to reset