Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

09/15/2022
by   Yi Wang, et al.
8

Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we "inflate" the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally enables to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in to enhancing the 3D task performance "for free".

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Image2Point: 3D Point-Cloud Understanding with Pretrained 2D ConvNets

3D point-clouds and 2D images are different visual representations of th...
research
02/28/2023

Applying Plain Transformers to Real-World Point Clouds

Due to the lack of inductive bias, transformer-based models usually requ...
research
03/29/2021

CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision ...
research
11/29/2021

Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

We present Point-BERT, a new paradigm for learning Transformers to gener...
research
04/19/2022

Multimodal Token Fusion for Vision Transformers

Many adaptations of transformers have emerged to address the single-moda...
research
02/22/2021

Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined po...
research
04/08/2022

Vision Transformers for Single Image Dehazing

Image dehazing is a representative low-level vision task that estimates ...

Please sign up or login with your details

Forgot password? Click here to reset