Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

by   Jinghuan Shang, et al.

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/ jishang/3dtrl/3dtrl.html


page 5

page 6

page 7

page 9

page 10

page 11


Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image repr...

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

Human perception of surroundings is often guided by the various poses pr...

Global Interaction Modelling in Vision Transformer via Super Tokens

With the popularity of Transformer architectures in computer vision, the...

Visually Wired NFTs: Exploring the Role of Inspiration in Non-Fungible Tokens

The fervor for Non-Fungible Tokens (NFTs) attracted countless creators, ...

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

Viewpoint Selection for Photographing Architectures

This paper studies the problem of how to choose good viewpoints for taki...

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which...

Please sign up or login with your details

Forgot password? Click here to reset