Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

12/16/2022
by   Runpei Dong, et al.
1

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21 released at https://github.com/RunpeiDong/ACT.

READ FULL TEXT

page 4

page 19

research
04/15/2022

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Transformer-based models are widely used in natural language understandi...
research
07/26/2021

Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers

Self-supervised pre-training of large-scale transformer models on text c...
research
03/24/2021

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Cross-modal recipe retrieval has recently gained substantial attention d...
research
04/16/2023

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

Cross-modal distillation has been widely used to transfer knowledge acro...
research
10/20/2021

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval

Cross-model retrieval has emerged as one of the most important upgrades ...
research
07/07/2023

URL: A Representation Learning Benchmark for Transferable Uncertainty Estimates

Representation learning has significantly driven the field to develop pr...
research
07/05/2023

Interactive Image Segmentation with Cross-Modality Vision Transformers

Interactive image segmentation aims to segment the target from the backg...

Please sign up or login with your details

Forgot password? Click here to reset