Log In Sign Up

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at jishang/3dtrl/3dtrl.html


page 5

page 6

page 7

page 9

page 10

page 11


Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image repr...

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer visio...

Global Interaction Modelling in Vision Transformer via Super Tokens

With the popularity of Transformer architectures in computer vision, the...

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which...

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

Viewpoint Selection for Photographing Architectures

This paper studies the problem of how to choose good viewpoints for taki...

Visual Prompt Tuning for Generative Transfer Learning

Transferring knowledge from an image synthesis model trained on a large ...

1 Introduction

Over the past few years, computer vision models have developed rapidly from CNNs vgg16; resnet; inception to now Transformers dosovitskiy2020vit; deit; tnt

. With these models, we can now accurately classify objects in an image, align image frames among video pairs, classify actions in videos, and more. Despite their success, many of the models neglect that the world is 3D and do not extend beyond the XY image plane 

meshrcnn. While humans can readily estimate the 3D structure of a scene from 2D pixels of an image, most of the existing vision models with 2D images do not take the 3D structure of the world into consideration. This is one of the reason why humans are able to recognize objects in images and actions in videos regardless of their viewpoint, but the vision models often fail to generalize over novel viewpoints NPL_2021_CVPR; viewclr; meshrcnn.

Consequently, in this paper, we develop an approach to learn viewpoint-agnostic representations for a robust understanding of the visual data. Naive solutions to obtain viewpoint-agnostic representation would be either supervising the model with densely annotated 3D data, or learning representation from a large scale 2D datasets with samples encompassing different viewpoints. Given the fact that such high quality data are expensive to acquire and hard to scale, an approach with a higher sample efficiency without 3D supervision is desired.

To this end, we propose a 3D Token Representation Layer (3DTRL), incorporating 3D camera transformations into the recent successful visual Transformers dosovitskiy2020vit; deit; crossvit; liu2021Swin. 3DTRL first recovers camera-centered 3D coordinates of each token by depth estimation. Then 3DTRL estimates a camera matrix to transform these camera-centered coordinates to a 3D world space. In this world space, 3D locations of the tokens are absolute and view-invariant, which contain important information for learning viewpoint-agnostic representations. Therefore, 3DTRL incorporates such 3D positional information in the Transformer backbone in the form of 3D positional embeddings, and generates output tokens with 3D information. Unlike visual Transformers only relying on 2D positional embedding, models with 3DTRL are more compliant with learning viewpoint-agnostic representations.

We conduct extensive experiments on various vision tasks to confirm the effectiveness of 3DTRL. Our 3DTRL outperforms the Transformer backbones on 3 image classification, 5 multi-view video alignment, and 2 multi-view action recognition datasets in their respective tasks. Moreover, 3DTRL is a light-weighted, plug-and-play module that achieves the above improvements with minimal (2% computation and 4% parameters) overhead.

In summary, we present a learnable, differentiable layer 3DTRL that efficiently and effectively learns viewpoint-agnostic representations.

2 Background: Pinhole Camera Model

3DTRL is based on the standard pinhole camera model widely used in computer vision. Thus, we first briefly review the pinhole camera model. In homogeneous coordinate system, given a point with world coordinate , a camera projects a pixel at on an image by:


where is the intrinsic matrix and is the extrinsic matrix. is further represented by


where is the focal length and are the pixel coordinates. In this work, we explore visual understanding in multi-view setting, thus aiming at learning viewpoint-agnostic representations. In this setting, a scene may be captured with different cameras positioned at non-identical locations and viewing angles. Here, the world coordinate is the same across all the cameras while pixel projection is different across cameras. We focus on how to estimate the world coordinates from their corresponding image pixels at . Estimating from involves two transformations that correspond to the inverse of and which might not be known beforehand. Thus, we estimate them from image patches instead, which is a key procedure in 3DTRL.

3 3D Token Representation Layer (3DTRL)

Figure 1: Overview of proposed 3DTRL. Left: 3DTRL is a module inserted in between Transformer layers. Right: 3DTRL has three parts, Pseudo-Depth Estimator, Camera Parameter Estimator, and 3D Positional Embedding layer. Within Pseudo-Depth Estimator, we first estimate depth of each token and then calculate 3D coordinates from 2D locations and depth.

In this section, we detail how 3DTRL estimates 3D positional information in a Transformer and is integrated. We will introduce 3DTRL for image analysis, then we adapt it to video models.

3.1 Overview

3DTRL is a simple yet effective plug-in module that can be inserted in between the layers of visual Transformers (Figure 1 left). Given a set of input tokens , 3DTRL returns a set of tokens with 3D information. The number of tokens and their dimensionality are kept unchanged. Within the module (Figure 1 right), 3DTRL first performs 3D estimation using two estimators: (1) pseudo-depth estimator and (2) camera parameter estimator. Then the tokens are associated with their recovered world 3D coordinates, which are transformed from estimated depth and camera matrix. Finally, 3DTRL generate 3D positional embeddings from these world coordinates and combine them with input to generate the output of 3DTRL.

In the aspect that we insert 3DTRL in between the Transformer model, 3DTRL implicitly leverages the Transformer layers before it to be a part of the 3D estimators, and layers after that to be the actual 3D feature encoder (Figure 1 left). This avoids adding a large number of parameters while resulting in reasonable estimations. We empirically find that placing 3DTRL at a shallow-medium layer of the network yields better results (See Section 4.3), which is a trade-off between model capacity for estimation and 3D feature encoding.

3.2 3D Estimation of the Input Tokens

3DTRL first estimates both the camera-centered 3D coordinates of each token using depth estimation and a camera matrix shared by all the tokens of an image. Then, the camera-centered 3D coordinates are transformed to the “world” coordinates by the camera matrix.

Pseudo-depth Estimation.

Given input tokens , 3DTRL first performs pseudo-depth estimation of each token . The pseudo-depth estimator is a function that outputs the depth of each token individually. We implement using a 2-layer MLP. We call this pseudo-depth estimation since it is similar to depth estimation from monocular images but operates at a very coarse scale, given that each token corresponds to an image patch rather than a single pixel in Transformer.

After pseudo-depth estimation, 3DTRL transforms the pseudo-depth map to camera-centered 3D coordinates. Recall that in ViT dosovitskiy2020vit, an image is decomposed into patches , where is the size of each image patch and the tokens are obtained from a linear projection of these image patches. Thus, each token is initially associated with a 2D location on the image plane, denoted as . By depth estimation, 3DTRL associates each token with one more value . Based on the pinhole camera model explained in Section 2, 3DTRL transforms to a camera-centered 3D coordinate by:


Since we purely perform the aforementioned estimation from monocular images and the the camera intrinsic matrix is unknown, we simply set

to a constant hyperparameter. To define coordinate system of 2D image plane

, we set the center of the original image is the origin for convenience, so that the image plane and camera coordinate system shares the same origin. We use the center of the image patch to represent its associate coordinates.

We believe that this depth-based 3D coordinate estimation best leverages the known 2D geometry, which is beneficial for later representation learning. We later confirm this in our ablation study (in Section 4.3), in which we compare it against a variant directly estimating .

Camera Parameter Estimation.

The camera parameters are required to transform the estimated camera-centered 3D coordinates to the world coordinate system. These camera parameters are estimated jointly from all input tokens . This involves estimation of two matrices, a rotation matrix and a and translation matrix through an estimator . We implement using a MLP. Specifically, we use a shared MLP stem to aggregate all the tokens into an intermediate representation. Then, we use two separated fully connected heads to estimate the parameters in and respectively. We note the camera parameter estimator as a whole: . To ensure is mathematically valid, we first estimate the three values corresponding to raw, pitch and roll angles of the camera pose, and then convert them into a rotation matrix.

Transform to World Coordinates.

Now, with the estimated camera parameters, 3DTRL transforms estimated camera-centered coordinates into the world space, a 3D space where 3D coordinates of the tokens are absolute and viewpoint-invariant. Following the pinhole camera model, we recover by


3.3 Incorporating 3D Positional Information in Transformers

The last step of 3DTRL is to leverage the estimated 3D positional information in Transformer backbone. For this, we choose to adopt a typical technique of incorporating positional embedding that is already used in Transformers dosovitskiy2020vit; attention. In contrast to 2D positional embedding in ViTs dosovitskiy2020vit, 3DTRL learns a 3D embedding function to transform estimated world coordinates to positional embeddings . This 3D embedding function is implemented using a two-layer MLP. Then, the obtained 3D positional embedding is incorporated in the Transformer backbone by combining it with the token representations. The outcome is the final token representations :


After 3D embedding, the resultant token representations are associated with a 3D space, thus enabling the remaining Transformer layers to encode viewpoint-agnostic token representations. Consequently, the estimation of pseudo-depth and camera parameters with applying geometric transformations based on them introduces inductive bias to the Transformer. We ablate other ways of incorporating the 3D positional information of the tokens in Section 4.3.

3.4 3DTRL in Video Models

Notably, 3DTRL can be also easily generalized to video models. For video models, the input to 3DTRL is a set of spatial-temporal tokens corresponding to a video clip containing frames, where are tokens from -th frame. We simply extend our module to operate on an additional time dimension, where depth estimation and 3D positional embedding are done for each spatial-temporal tokens individually: . However, we propose two strategies for the camera parameter estimation in 3DTRL for videos. These two strategies are: (a) Divided-Temporal (DT) and (b) Joint-Temporal (JT) estimation introduced below. In (a) DT strategy, we estimate one set of camera parameters per input frame () in a dissociated manner, and thus estimate a total of camera matrices for the entire video. In (b) JT strategy, we estimate only one camera from all frames . The camera is shared across all spatial-temporal tokens associated with the 3D space. The underlying hypothesis is that the camera pose and location do not change during the video clip. JT could be helpful to properly constrain the model for scenarios where camera movement is not required, but it is not generalizable to scenarios where the subject of interest moves frequently within the field-of-view. By default, we use DT strategy. Later in Section 4.5, we show a comparison of DT and JT strategies in different scenarios.

4 Experiments

We conduct extensive experiments to demonstrate the efficacy of viewpoint-agnostic representations learned by 3DTRL in multiple vision tasks: (i) image classification, (ii) multi-view video alignment, and (iii) video action classification.

4.1 Image Classification

In order to validate the power of 3DTRL, we first evaluate ViT dosovitskiy2020vit

with 3DTRL for image classification task using CIFAR-10 

cifar, CIFAR-100 cifar

and ImageNet-1K 

imagenet datasets.


We use the training recipe of DeiT deit for training our baseline Vision Transformer model on CIFAR and ImageNet datasets from scratch. We performed ablations to find an optimal location in ViTs where 3DTRL should be plugged-in (Section 4.3). Thus, in all our experiments we place 3DTRL after 4 Transformer layers, unless otherwise stated. The configuration of our DeiT-T, DeiT-S, and DeiT-B is identical to that mentioned in deit

. All our transformer models are trained for 50 and 300 epochs for CIFAR and ImageNet respectively. Further training details and hyper-parameters can be found in the Appendix.

Table 1: Top 1 classification accuracy (%) on CIFAR-10, CIFAR-100, ImageNet-1K (IN-1K), and viewpoint-perturbed IN-1K (IN-1K-perturbed). We also report the number of parameters (#params) and computation (in MACs). Note that the MACs are reported w.r.t. IN-1K samples. All the models are trained from scratch. Method #params MACs CIFAR-10 CIFAR-100 IN-1K IN-1K-perturbed DeiT-T 5.72M 1.08G 74.1 51.3 73.4 61.3  +3DTRL 5.95M 1.10G 78.8 (+4.7) 53.7 (+2.4) 73.6 (+0.2) 64.6 (+3.3) DeiT-S 22.1M 4.24G 77.2 54.6 79.4 71.1  +3DTRL 23.0M 4.33G 80.7 (+3.5) 61.5 (+6.9) 79.7 (+0.3) 72.7 (+1.6) DeiT-B 86.6M 16.7G 76.6 51.9 81.0 70.6  +3DTRL 90.1M 17.2G 82.8 (+6.2) 61.8 (+9.9) 81.2 (+0.2) 74.7 (+4.1) Figure 2: Top: original IN-1K samples. Bottom: viewpoint-perturbed IN-1K samples.


From Table 4.1, 3DTRL with all DeiT variants shows consistent performance improvement over the baseline on the CIFAR datasets, with only 2% computation overhead and 4% more parameters. Despite fewer training samples, 3DTRL significantly outperforms the baseline in CIFAR, showing the strong generalizability of 3DTRL when limited training data are available. We argue that multi-view data is not available in abundance, especially in domains with limited data (like in medical domain), thus 3DTRL with its ability to learn viewpoint-agnostic representations will be instrumental in such domains. We also find the performance improvement on ImageNet is less than that on CIFAR. This is because ImageNet has limited viewpoints in both training and validation splits, thus reducing the significance of performing geometric aware transformations for learning view agnostic representations. In contrast, CIFAR samples present more diverse camera viewpoints, so it is a more suitable dataset for testing the quality of learned viewpoint-agnostic representations.

Robustness on Viewpoint-perturbed Data

In order to emphasize the need of learning viewpoint-agnostic representations, we further construct a viewpoint-perturbed ImageNet-1K validation set by applying random perspective transformations to images. Example images are shown in Figure 2

. We note that perspective transformation on these static images is not equivalent to real viewpoint changes because the zero-paddings and interpolations do not happen in real-world cameras. Nonetheless, we believe it is a meaningful experiment as real image datasets with multi-view settings are lacking in the vision community. Table 

4.1 shows that the performance improvement with 3DTRL on perturbed data is greater than that on original data. This demonstrates the robustness of 3DTRL for perspective transformed data, thanks to the learned viewpoint-agnostic representations.

4.2 Multi-view Video Alignment

Video alignment hadji2021representation; dwibedi2019temporal; misra2016shuffle is a task to learn a frame-to-frame mapping between video pairs with close semantic embeddings. In particular, we consider a multi-view setting that aligning videos captured from the same event but different viewpoints. Thus, video pairs from the same event are temporally synchronized.

Figure 3: Examples from video alignment datasets. Each dataset has synchronized videos of at least 2 viewpoints. All datasets except Pouring have one ego-centric viewpoint, highlighted in red boxes. More details are available in the Appendix.


We use 5 multi-view datasets from a wide range of environments: Minecraft (MC) – video game, Pick, Can, and Lift from robot simulators (PyBullet coumans2019 and Robomimic robomimic2021), and Pouring from real-world human actions sermanet2018time. Example video frames are provided in Figure 3. Each dataset contains synchronized videos from multiple cameras (viewpoints). There is one ego-centric camera per dataset except Pouring, which are continuously moving with the subject. These ego-centric videos make the alignment task more challenging. Detail dataset statistics are available in the Appendix.


We follow common video alignment methods sermanet2018time to train an encoder that outputs frame-wise embeddings. We still use DeiT deit as a baseline model and apply 3DTRL to it, similar to image classification. This gives us a visual encoder backbone. During training, we use the time-contrastive loss sermanet2018time to encourage temporally closed embeddings to be similar while temporally far-away embeddings to be apart. Then, we obtain alignments via nearest-neighbor such that an embedding from video 1 is being paired to its nearest neighbor in video 2 in the embedding space. And similarly from video 2 is paired with its nearest neighbor in video 1. We note the baseline method as DeiT+TCN and ours as +3DTRL. We use ImageNet-1K pre-trained weights for experiments on Pouring, but we train from scratch for other datasets considering that simulation environments are out of real-world distribution.


We evaluate the alignment by three metrics: Alignment Error sermanet2018time, Cycle Error, and Kendall’s Tau kendall1938new. Let the alignment pairs between two videos be and . In brief, Alignment Error measures the temporal mismatching of . Cycle Error is based on cycle-consistency wang2019learning; dwibedi2019temporal, where two pairs and are called consistent when . Thus, Cycle Error measures the inconsistency based on distance metric . Kendall’s Tau () measures ordering in pairs. Given a pair of embeddings from video 1 and their corresponding nearest neighbors from video 2 , the indices tuple is concordant when and or and . Otherwise the tuple is discordant. Kendall’s Tau computes the ratio of concordant pairs and discordant pairs over all pairs of frames. Let be the number of frames in a video, then the formal notations of the three metrics are:

. (6)

We establish two evaluation protocols: (a) Seen and (b) Unseen. In Seen, we train and test models on videos from all cameras. However, in Unseen, we hold out several cameras for test, which is a representative scenario for validating the effectiveness of 3DTRL. Detail of experimental settings are provided in the Appendix.

Figure 4: Results on video alignment in Seen and Unseen protocols. indicates a higher metric is better and is otherwise. Blue bars are for DeiT+TCN without 3DTPL and green bars are with 3DTPL. 3DTRL outperforms the baseline consistently in both settings. We note Unseen is not applicable to Pouring dataset because only two cameras are available in this dataset.


Figure 4 illustrates the evaluation results of 2 viewpoint settings over 5 datasets, compared to the DeiT+TCN baseline. 3DTRL outperforms the baseline consistently across all datasets. In particular, 3DTRL improves the baseline by a large margin in Pouring and MC, corroborating that 3DTRL adapts to diverse unseen viewpoints. The improvements of 3DTRL on Pick and MC also suggests the strong generalizability when learning from smaller datasets. With enough data (Lift and Can), 3DTRL still outperforms but the gap is relatively small. When evaluating in Unseen setting, both methods have performance drop. However, 3DTRL still outperforms in Pick, MC, and Can, which suggests the representations learned by 3DTRL are able to generalize over novel viewpoints. MC has the largest viewpoint diversity so it is hard to obtain reasonable align results in the unseen setting for both the models. Sample alignment results are provided in Figure 6.

Table 2: Video alignment results compared with SOTA methods. Values are alignment errors. Method Backbone Input Pouring Pick MC TCN sermanet2018time CNN 1 frame 0.180 0.273 0.286 Disentanglement shang2021disentangle CNN 1 frame - 0.155 0.233 mfTCN dwibedi2018learning 3DCNN 8 frames 0.143 - - mfTCN dwibedi2018learning 3DCNN 32 frames 0.088 - - DeiT deit+TCN Transformer 1 frame 0.155 0.216 0.292  +3DTRL Transformer 1 frame 0.080 0.116 0.202 Figure 5: Alignment error w.r.t # training videos. Red dashed line indicates 3DTRL using 10 videos outperforms baseline model using 45 videos.
Figure 6: Visualization of alignment results in Can dataset. The example shown is aligning the first-person (egocentric) video to the given third-person video reference (1st row). We compare model with 3DTRL (Ours, 3rd row) against the ground truth (FPV GT, 2nd row) and the baseline model (DeiT+TCN, 4th row). Red borders highlight the frames where 3DTRL aligns frames better compared to the baseline.

In Table 4.2, we further compare 3DTRL with previous methods on certain datasets. Note that Disentanglement shang2021disentangle uses extra losses whereas others use time-contrastive loss only. We find that 3DTRL with only single-frame input is able to surpass the strong baselines set by models using extra losses shang2021disentangle or multiple input frames dwibedi2018learning. We also vary the number of training videos in Pouring dataset, and results from Figure 5 show that 3DTRL benefits from more data. Meanwhile, 3DTRL can outperform the baseline while only using 22% of data the baseline used.

4.3 Ablation Studies

We conduct our ablation studies on image models mostly using CIFAR-10 and CIFAR-100 for image classification and Pick for multi-view video alignment.

Table 3: Ablation study results. For CIFAR, we test on models based on Tiny (T), Small (S) and Base (B) backbones (DeiT) and report accuracy(%). For Pick, we only test Base model and report alignment error. Method CIFAR-10 (T/S/B) CIFAR-100 (T/S/B) Pick DeiT 74.1/ 77.2 / 76.6 51.3 / 54.6 / 51.9 0.216 DeiT + MLP 74.2 / 77.2 / 76.5 47.9 / 54.7 / 53.4 0.130 DeiT + 3DTRL 78.8 / 80.7 / 82.8 53.7 / 61.5 / 61.8 0.116 Replace: Depth 76.7 / 78.2 / 77.4 48.3 / 54.1 / 52.6 0.134 Replace: Token Embed Concat. 80.7 / 83.7 / 84.9 53.4 / 61.8 / 60.2 0.133 Figure 7: Results on inserting 3DTRL at different locations.

MLP vs. 3DTRL.

3DTRL is implemented by several MLPs with required geometric transforms in between. In this experiment, we replace 3DTRL with the similar number of fully-connected layers with residual connection, to have comparable parameters and computation as 3DTRL. Results are provided in Table 

4.3. We find that MLP implementation is only comparable with the baseline performance despite the increase in parameters and computation. Thus, we confirm that the geometric transformations imposed on the token representations is the key to make 3DTRL effective.

Token Coordinates Estimation.

In this ablation, we show how estimating only depth compared to estimating a set of 3 coordinates differs in 3DTRL. The second last row of Table 4.3 shows the results. We find that depth estimation is better because it uses precise 2D image coordinates when recovering tokens in 3D space, whereas estimating regresses for the 3D coordinates without any geometric constraints. Also, we find that estimating hampers the performance in image classification task more than video alignment. This is because estimating 3D coordinates is harder when the training samples are in unconstrained scenarios. In contrast, video pairs in an alignment dataset share the same scene captured from different view angles, which facilitates the recovery of token in 3D space.

How to incorporate 3D positional information in Transformers?

By default, we use Equation 5 to incorporate the 3D positional information in Transformer backbone through a learned positional embedding . In this experiment, we directly infuse the estimated 3D world coordinates within the token representation by concatenating them across the channel axis. The -d feature is then projected back to -d by a MLP. We keep parameters and computation comparable to the default 3DTRL. We test this 3DTRL variant (Embedding Concat.) and results are presented in Table 4.3. We find that the concatenation variant outperforms the default variant of 3DTRL on CIFAR-10, but comparable and worse results in CIFAR-100 and Pick. This observation substantiates the instability of using raw 3D coordinates. In comparison, the use of 3D positional embedding is generalizable to more challenging and diverse scenarios.

Where should we have 3DTRL?

We vary the location of 3DTRL to empirically study the optimal location of 3DTRL in a 12-layer DeiT-T backbone. In Figure 7, we find that inserting 3DTRL at the earlier layers (after 1/3 of the network) yields the best performance consistently on both datasets.

How many 3DTRLs should be used?

We explore the possibility of using multiple 3DTRLs in a Transformer backbone. This usage is essentially estimating 3D information from different levels of features, and injecting such information back into the backbone. We test several DeiT-T variants using multiple 3DTRLs on CIFAR-10 and results are shown in Table 4.3. We find that using multiple 3DTRLs further increases the performance in general compared to using only one 3DTRL. Specifically, we demonstrate that inserting 3DTRLs at layer , , and yields the best result among all the strategies we explore. This experiment empirically shows multiple 3DTRLs potentially benefits the model.

3DTRL Location(s) N/A (DeiT baseline) 4 4, 6, 8 4, 4, 4 2, 4, 6, 8
CIFAR-10 74.1 78.8 79.5 79.3 79.1
Table 4: CIFAR-10 Performance when using multiple 3DTRLs based on DeiT-T.

Regularization effect of 3DTRL

Mixup zhang2017mixup and CutMix yun2019cutmix are commonly used image augmentation methods in training image models, which mixes two images to create an augmented image for training input. Such techniques along with other augmentations provide diverse image samples so that the image model is regularized and avoids over-fitting. We hypothesise that Mixup & CutMix could potentially damage structures in original image, which may cause inaccurate pseudo-depth estimation in 3DTRL, thus hamper the training procedure. Therefore, we conduct an experiment on ImageNet-1K by disabling Mixup & CutMix augmentations. We train baseline DeiT and DeiT+3DTRL from scratch, and compare the results in both validation set and perturbed set. In Table 5, we find that the classification accuracy for both DeiT-T baseline and 3DTRL improves after disabling Mixup & CutMix while 3DTRL still outperforms the baseline by 0.2%.

However, on view-perturbed set, the baseline DeiT-T drops (-6.8%) significantly compared to 3DTRL (-1.6%). This shows the strong regularization effect of 3DTRL that promotes the model to generalize over different samples.

Model ImageNet-1K ImageNet-1K-Perturbed
DeiT-T 73.4 61.3
DeiT-T + Mixup & CutMix Disabled 74.2 54.5
DeiT-T+3DTRL 73.6 64.6
DeiT-T+3DTRL + Mixup & CutMix Disabled 74.4 63.0
Table 5: ImageNet-1K and ImageNet-1K-Perturbed results when Mixup & Cutmix are disabled.

4.4 Qualitative Visualizations

Although previous experiments have shown that 3DTRL is beneficial for different vision tasks, in this section, we visualize some qualitative results for a deeper understanding of 3DTRL’s effectiveness. The qualitative results include visualization of estimated pseudo-depth maps (Figure 8), estimated camera positions (Figure 9) and estimated 3D world locations of image patches (Figure 10), which are from intermediate steps of 3DTRL. For these analysis, we use DeiT-T+3DTRL  trained on either ImageNet-1K or video alignment dataset Can and the visualizations are performed on the validation samples. In Figure 8, we observe a fine separation of the object-of-interest and the background in most of the pseudo-depth maps. Thus, these predicted pseudo-depth maps are sufficient to recover the tokens corresponding to object-of-interest in 3D space. In Figure 9, we find that the estimated camera positions approximately reflect the true robot motion or similar view angles of human perception. Furthermore, in Figure 10, we show the image patches in their recovered 3D locations, where patches of object-of-interest usually have larger values (pseudo-depth) and the background tokens have smaller values. For visual clarity, the estimated pseudo-depth in Figure 8 is plotted along vertical axis. Thus, the patches with larger values are closer to the camera and vice-versa. All these visualizations demonstrate that the estimations in 3DTRL are reasonable and sufficient for providing 3D information of the image space.

Figure 8: Top: ImageNet-1K validation samples (left half) and their viewpoint-perturbed versions (right half). Bottom: Estimated pseudo-depth map, interpolated from to for better understanding. Depth increases from blueish to yellowish region.
Figure 9: Visualization of image samples (left, with colored boundaries) and estimated camera positions in a 3D space (right). The color of the boundary on each image corresponds to the estimated camera from that image. Top: Images are from a video clip captured by an egocentric (eye-in-hand) camera on a robot arm in Can environment, ordered by timestep from left to right. The estimated camera positions approximately reflects the motion of the robot arm, which is moving towards right and down. Bottom: Samples from ImageNet-1K. The estimated camera pose of the second image (yellow boundary) is somehow at a head-up view, and the rest are at a top-down view. These estimated cameras are approximately aligned with human perception.
Figure 10: Visualization of image patches at their estimated 3D world locations. The center of the -plane (horizontal plane, at the bottom) is the origin of the 3D space. The vertical axis is -axis. The patches corresponds to object-of-interest usually have larger values (corresponding to larger pseudo-depth values), which are localized “farther” from the origin. Most of the background patches have smaller values that are at the “closer” to the -plane.

4.5 3DTRL for Video Representation Learning

Method Strategy Smarthome (CV2) Smarthome (CS) NTU (CV)
Acc mPA Acc mPA Acc
TimeSformer timesformer - 59.4 27.5 75.7 56.1 86.4
+ 3DTRL DT 62.9 (+3.5) 34.0 (+6.5) 76.1 (+0.4) 57.0 (+0.9) 87.9 (+1.5)
+ 3DTRL JT 58.6 (-0.8) 30.9 (+3.4) 76.2 (+0.5) 57.2 (+1.1) 87.9 (+1.5)
                             Kinetics-400 pre-trained
TimeSformer timesformer - 69.3 37.5 77.2 57.7 87.7
+ 3DTRL w/o K400 DT 69.5 (+0.2) 39.2 (+1.7) 77.5 (+0.3) 58.9 (+1.2) 88.8 (+1.1)
+ 3DTRL w/ K400 DT 71.9 (+2.6) 41.7 (+4.2) 77.8 (+0.6) 61.0 (+2.3) 88.6 (+0.9)
+ 3DTRL w/o K400 JT 66.6 (-2.7) 35.0 (-2.5) 77.0 (-0.2) 58.6 (+0.9) 88.6 (+0.9)
+ 3DTRL w/ K400 JT 68.2 (-0.9) 37.1 (-0.4) 77.0 (-0.2) 59.9 (+2.2) 87.7 (+0.0)
Table 6: Results on action recognition on Smarthome (CV2 & CS protocols) and NTU (CV protocol). Acc indicates classification accuracy (%) and mPA indicates mean per-class accuracy. 3DTRL is implemented with two strategies Divided-Temporal (DT) and Joint-Temporal (JT). In methods using Kinetics-400 (K400) pre-training, TimeSformer backbone is always initialized with pre-trained weights, and 3DTRL w/, w/o K400 denotes 3DTRL is randomly initialized and is initialized from pre-trained weights respectively.

In this section, we illustrate how 3DTRL can be adapted for video models. Since we aim at learning viewpoint-agnostic representations, as a natural choice we validate the effectiveness of 3DTRL on video datasets with multi-camera setups and cross-view evaluation. Consequently, we conduct our experiments on two multi-view action recognition datasets: Toyota Smarthome smarthome (Smarthome) and NTU-RGB+D NTU_RGB+D (NTU) for the task of action classification. For evaluation on Smarthome, we follow Cross-View 2 (CV2) and Cross-Subject (CS) protocols proposed in smarthome, whereas on NTU, we follow Cross-View (CV) protocol proposed in NTU_RGB+D. In cross-view protocols, the model is trained on a set of cameras and tested on a different set of cameras. Similarly for cross-subject protocol, the model is trained and tested on different set of subjects. Note that for Smarthome dataset, we report the action classification accuracy on two metrics (i) classification accuracy (Acc) and (ii) mean per-class accuracy (mPA). More details on NTU and Smarthome datasets is provided in the appendix.

Network architecture & Training/Testing

TimeSformer timesformer is a straightforward extension of ViT dosovitskiy2020vit for videos which operates on spatio-temporal tokens from videos, so that 3DTRL can be easily deployed to TimeSformer as well. We use clips of size , with frames sampled at a rate of 1/32. We use a ViT-B encoder with patch size . Similar to our previous experimental settings, we place 3DTRL after 4 Transformer blocks in TimeSformer. We adopt the training strategies mentioned in timesformer and thus, train all the video models for 15 epochs. During inference, we sample one and 10 temporal clips from the entire video on NTU and Smarthome datasets respectively. We use 3 spatial crops (top-left, center, bottom-right) from each temporal clip and obtain the final prediction by averaging the scores for all the crops.


In Table 6, we present the action classification results on Smarthome and NTU datasets with 3DTRL plugged in TimeSformer. We present the 3DTRL results for two of its kinds, implemented with Dissociated-Temporal (DT) and Joint-Temporal (JT) strategy as discussed in Section 3.4. Firstly, we present the results for the models without any pre-training and then for the models with Kinetics-400 kinetics (K400) pre-training.

3DTRL can easily take advantage of pre-trained weights because it does not change the relying backbone Transformer – just being added in between blocks. In Table 6, we present results for two fine-tuning scenarios: (a) 3DTRL w/o K400 and (b) 3DTRL w/ K400. For the first scenario (a), TimeSformer is initialized with K400 pre-training weights and leave the parameters in 3DTRL randomly initialized. Then in the fine-tuning stage, all the model parameters including that of 3DTRL is trained. In the second scenario (b), all parameters in TimeSformer and 3DTRL are pre-trained on K400 from scratch and fine-tuned on the respective datasets.

We find that all the variants of 3DTRL outperforms the baseline TimeSformer results. Note that 3DTRL implemented with JT strategy under-performs the baseline TimeSformer on Smarthome (in most of the CV2 experiments) dataset. We find that the Joint-Temporal strategy adoption for video models is particularly effective when there is a large availability of training data, for example on Smarthome CS protocol and NTU dataset compared to Smarthome CV2. We also note a similar observation for the pre-training strategy of 3DTRL.

Our experiments show that although there is an improvement with 3DTRL compared to the baseline for different fine-tuning strategy, it is more significant when 3DTRL is pre-trained with K400. However, when large-scale training samples are available like in NTU, 3DTRL does not require K400 pre-training. To sum up, 3DTRL can be seen as a crucial ingredient for learning viewpoint-agnostic video representations.

5 Related Work

There has been a remarkable progress in visual understanding with the shift from the use of CNNs lecun1995convolutional; vgg16; resnet; inception to visual Transformers dosovitskiy2020vit. Transformers have shown substantial improvements over CNNs in image dosovitskiy2020vit; liu2021Swin; crossvit; kahatapitiya2021swat; deit; tnt; yuan2021tokens analysis and video timesformer; svn; ryoo2021tokenlearner; arnab2021vivit; liu2021videoswin understanding tasks due to its flexibility in learning global relations among visual tokens. Studies also combine CNNs with Transformer architectures to leverage the pros in both the structures dai2021coatnet; starformer; liu2022convnet; graham2021levit. In addition, Transformer has been shown to be effective in learning 3D representation pointtransformer. However, these advancements in architecture-types have not addressed the issue of learning viewpoint-agnostic representation. Viewpoint-agnostic representation learning is drawing increasing attention in the vision community due to its wide range of downstream applications like 3D object-detection rukhovich2022imvoxelnet,video alignment dwibedi2019temporal; gao2022fine; chen2022frame, action recognition sigurdsson2018charades; sigurdsson2018actor

, pose estimation 

haque2016towards; sun2020view, robot learning sermanet2018time; shang2021disentangle; hsu2022vision; jangir2022look; Stadie2017ThirdPersonIL, and other tasks.

There is a broad line of work towards directly utilizing 3D information like depth haque2016towards, pose das2020vpn; das2021vpn++, and point clouds piergiovanni20214d; robert2022learning, or in some cases deriving 3D structure from paired 2D inputs vijayanarasimhan2017sfm. However, methods rely on the availability of multi-modal data which is hard to acquire are not scalable.

Consequently, other studies have focused on learning 3D perception of the input visual signal in order to generalize the learned representation to novel viewpoints. This is done by imposing explicit geometric transform operations in CNNs stn; NPL_2021_CVPR; yan2016perspective; cao2021monoscene; rukhovich2022imvoxelnet, without the requirement of any 3D supervision. In contrast to these existing works, our Transformer-based 3DTRL imposes geometric transformations on visual tokens to recover their representation in a 3D space. To the best of our knowledge, 3DTRL is the first of its kind to learn a 3D positional embedding associated with the visual tokens for viewpoint-agnostic representation learning in different image and video related tasks.

6 Conclusion

In this work, we have presented 3DTRL, a plug-and-play module for visual Transformer that leverages 3D geometric information to learn viewpoint-agnostic representations. Within 3DTRL, by pseudo-depth estimation and learned camera parameters, it manages to recover positional information of tokens in a 3D space. Through our extensive experiment, we confirm 3DTRL is generally effective in a variety of visual understanding tasks including image classification, multi-view video alignment, and cross-view action recognition, by adding minimum parameters and computation overhead.

7 Acknowledgment

We thank valuable discussions with members of Robotics Lab at Stony Brook University. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Ministry of Science and ICT (No.2018-0-00205, Development of Core Technology of Robot Task-Intelligence for Improvement of Labor Condition. This work was also supported by the National Science Foundation (IIS-2104404 and CNS-2104416).



Appendix A Implementation Details

3DTRL is easily inserted in Transformers. The components of 3DTRL are implemented by several MLPs and required geometric transformations in between. We keep the hidden dimension size in MLPs the same as the embedding dimensionality of Transformer backbone, Tiny=192, Small=384, Base=768 in specific. We provide PyTorch-style pseudo-code about inserting 3DTRL in Transformer (Algorithm 

1) and about details of 3DTRL (Algorithm 2). We use image Transformer for example and omit operations on CLS token for simplicity. Full implementation including video model is provided in supplementary files.

# Use 3DTRL with Transformer backbone
Class Transformer_with_3DTRL:
def __init__(self, config):
# Initialize a Transformer backbone and 3DTRL
self.backbone = Transformer(config)
self.3dtrl = 3DTRL(config)
# Before which Transformer layer we insert 3DTRL
self.3dtrl_location = config.3dtrl_location
def forward(self, tokens):
for i, block in enumerate(self.backbone.blocks):
# Tokens go through 3DTRL at desired insert location
if i == self.3dtrl_location:
tokens = self.3dtrl(tokens)
# Tokens go through backbone layers
tokens = block(tokens)
return tokens
Algorithm 1 PyTorch-style pseudo-code for using 3DTRL in Transformer
Class 3DTRL:
# Make a 3DTRL
def __init__(self, config):
# 2D coordinates on image plane
self.u, self.v = make_2d_coordinates()
# Depth estimator
self.depth_estimator = nn.Sequential(
nn.Linear(config.embed_dim, config.embed_dim),


nn.Linear(config.embed_dim, 1))
# Camera parameter estimator, including a stem and two heads
self.camera_estimator_stem = nn.Sequential(
nn.Linear(config.embed_dim, config.embed_dim),
nn.Linear(config.embed_dim, config.embed_dim),
nn.Linear(config.embed_dim, 32),
nn.Linear(32, 32))
# Heads for rotation and translation matrices.
self.rotation_head = nn.Linear(32, 3)
self.translation_head = nn.Linear(32, 3)
# 3D positional embedding layer
self.3d_pos_embedding = nn.Sequential(
nn.Linear(3, config.embed_dim),
nn.Linear(config.embed_dim, config.embed_dim))
def forward(self, tokens):
# Depth estimation
depth = self.depth_estimator(tokens)
camera_centered_coords = uvd_to_xyz(self.u, self.v, depth)
# Camera estimation
interm_rep = self.camera_estimator_stem(tokens)
rot, trans = self.rotation_head(interm_rep), self.translation_head(interm_rep)
rot = make_rotation_matrix(rot)
# Transformation from camera-centered to world space
world_coords = transform(camera_centered_coords, rot, trans)
# Convert world coordinates to 3D positional embeddings
3d_pos_embed = self.3d_pos_embedding(world_coords)
# Generate output tokens
return tokens + 3d_pos_embed
Algorithm 2 PyTorch-style pseudo-code for 3DTRL

Appendix B Settings for Image Classification


For the task of image classification, we provide a thorough evaluation on three popular image datasets: CIFAR-10 cifar, CIFAR-100 cifar, and ImageNet imagenet. CIFAR-10/100 consists of 50k training and 10k test images, and ImageNet has 1.3M training and 50k validation images.

Training Configurations

We follow the configurations introduced in DeiT deit. We provide a copy of configurations here in Table 7 (CIFAR) and Table 8 (ImageNet-1K) for reference. We use 4 NVIDIA Tesla V100s to train models with Tiny, Small and Base backbones on ImageNet-1K for 22 hours, 3 days and 5 days respectively.

Input Size 3232
Patch Size 22
Batch Size 128
Optimizer AdamW
Optimizer Epsilon 1.0e-06
Momentum , = 0.9, 0.999
layer-wise lr decay 0.75
Weight Decay 0.05
Gradient Clip None
Learning Rate Schedule Cosine
Learning Rate 1e-3
Warmup LR 1.0e-6
Min LR 1e-6
Epochs 50
Warmup Epochs 5
Decay Rate 0.988
drop path 0.1
Exponential Moving Average (EMA) True
EMA Decay 0.9999
Random Resize & Crop Scale and Ratio (0.08, 1.0), (0.67, 1.5)
Random Flip Horizontal 0.5; Vertical 0.0
Color Jittering None
Auto-agumentation rand-m15-n2-mstd1.0-inc1
Mixup True
Cutmix True

Mixup, Cutmix Probability

0.8, 1.0
Mixup Mode Batch
Label Smoothing 0.1
Table 7: CIFAR Training Settings
Input Size 224224
Crop Ratio 0.9
Batch Size 512
Optimizer AdamW
Optimizer Epsilon 1.0e-06
Momentum 0.9
Weight Decay 0.3
Gradient Clip 1.0
Learning Rate Schedule Cosine
Learning Rate 1.5e-3
Warmup LR 1.0e-6
Min LR 1.0e-5
Epochs 300
Decay Epochs 1.0
Warmup Epochs 15
Cooldown Epochs 10
Patience Epochs 10
Decay Rate 0.988
Exponential Moving Average (EMA) True
EMA Decay 0.99992
Random Resize & Crop Scale and Ratio (0.08, 1.0), (0.67, 1.5)
Random Flip Horizontal 0.5; Vertical 0.0
Color Jittering 0.4
Auto-agumentation rand-m15-n2-mstd1.0-inc1
Mixup True
Cutmix True
Mixup, Cutmix Probability 0.5, 0.5
Mixup Mode Batch
Label Smoothing 0.1
Table 8: ImageNet-1K Training Settings deit

Appendix C Settings for Video Alignment


We provide the statistics of 5 datasets used for video alignment in Table 9. In general, datasets with fewer training videos, more/diverse viewpoints, and longer videos are harder for alignment. We will also provide the copy of used/converted dataset upon publish.

Dataset # Training/Validation/Test Videos # Viewpoints Average Frames/Video
Pouring 45 / 10 / 14 2 266
MC 4 / 2 / 2 9 66
Pick 10 / 5 / 5 10 60
Can 200 / 50 / 50 5 38
Lift 200 / 50 / 50 5 20
Table 9: Statistics of multi-view datasets used for video alignment.

Training Configurations

The training setting for video alignment is listed in Table 10. The setting is the same for all datasets and all methods for fair comparison. GPU hours required for training vary across datasets, depending on the size of datasets and early stopping (convergence). Approximately we use 24 hours in total to fully train on all 5 datasets using an NVIDIA RTX A5000.

Positive Window of TCN Loss 3 frames in MC, Pick, Pouring; 2 frames in Can and Lift
Learning Rate 1e-6
Batch Size 1
Optimizer Adam
Gradient Clip 10.0
Early Stopping 10 epochs
Random Seed 42
Augmentations None
Table 10: Training Settings for Video Alignment

Appendix D Settings for Video Representation Learning


Our dataset choices are based on multi-camera setups in order to provide cross-view evaluation. Therefore, we evaluate the effectiveness of 3DTRLon two multi-view datasets Toyota Smarthome smarthome and NTU-RGB+D NTU_RGB+D. We also use Kinetics-400 kinetics for pre-training the video backbone before plugging-in 3DTRL.

Toyota-Smarthome (Smarthome) is a recent ADL dataset recorded in an apartment where 18 older subjects carry out tasks of daily living during a day. The dataset contains 16.1k video clips, 7 different camera views and 31 complex activities performed in a natural way without strong prior instructions. For evaluation on this dataset, we follow cross-subject () and cross-view () protocols proposed in smarthome. We ignore protocol due to limited training samples.

NTU RGB+D (NTU) is acquired with a Kinect v2 camera and consists of 56880 video samples with 60 activity classes. The activities were performed by 40 subjects and recorded from 80 viewpoints. For each frame, the dataset provides RGB, depth and a 25-joint skeleton of each subject in the frame. For evaluation, we follow the two protocols proposed in NTU_RGB+D: cross-subject (CS) and cross-view (CV).

Kinetics-400 (K400) is a large-scale dataset with  240k training, 20k validation and 35k testing videos in 400 human action categories. However, this dataset do not posses the viewpoint challenges, we are addressing in this paper. So, we use this dataset only for pre-training purpose as used by earlier studies.

Training Configurations

The training setting for action recognition on both datasets follow the configurations provided in timesformer. We train all the video models on 4 RTX 8000 GPUs with a batch size of 4 per GPU. A gradient accumulation is performed to have an effective batch size of 64. Similar to timesformer, we train our video models with SGD optimiser with momentum and weight decay.