TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

08/12/2021 ∙ by Jinyu Yang, et al. ∙ 0

Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior transferability over its CNNs-based counterparts with a large margin, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT's intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep neural networks (DNNs) demonstrate unprecedented achievements on various machine learning problems and applications. However, such impressive performance heavily relies on massive amounts of labeled data which requires considerable time and labor efforts to collect. Therefore, it is desirable to train models that can leverage rich labeled data from a different but related domain and generalize well on target domains with no or limited labeled examples. Unfortunately, the canonical supervised-learning paradigm suffers from the domain shift issue that poses a major challenge in adapting models across domains. This motivates the research on unsupervised domain adaptation (UDA)

Wang and Deng (2018)

which is a special scenario of transfer learning

Pan and Yang (2009). The key idea of UDA is to project data points of the labeled source domain and the unlabeled target domain into a common feature space, such that the projected features are both discriminative (semantic meaningful) and domain-invariant, in turn, generalize well to bridge the domain gap. To achieve this goal, various methods have been proposed in the past decades, among which adversarial adaptation has become the dominant technique in this field, which attempts to align cross-domain representations by minimizing an adversarial loss through a domain discriminator Ganin et al. (2016); Tzeng et al. (2017); Long et al. (2017a).

Recently, Vision Transformer (ViT) Dosovitskiy et al. (2020) has received increasing attention in the vision community. Different from CNNs that act on local receptive fields of the given image, ViT models long-range dependencies among visual features across the entire image, through the global self-attention mechanism. Specifically in ViT, each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded and concatenated with position embeddings. To be consistent with NLP paradigm, a class token is prepended to the patch tokens, serving as the representation of the whole image. Then, those sequential embeddings are fed into a stack of transformers to learn desired visual representations. Due to its advantages in global context modeling, ViT has obtained excellent results on various vision tasks, such as image classification Dosovitskiy et al. (2020), object detection Carion et al. (2020); Wang et al. (2021), segmentation Zheng et al. (2020); Liu et al. (2021), and video understanding Girdhar et al. (2019); Neimark et al. (2021).

Despite that ViT is becoming increasingly popular, two important questions related to domain adaption remain unanswered. First, how transferable is ViT across different domains, compared to its CNNs counterparts? As ViT is convolution-free and lacks some inductive bias inherent to CNNs (e.g., locality and translation equivariance), it relies on large-scale pre-training to trump inductive bias. Such training prerequisite along with the learned global attentions may provide ViT with outstanding capability in domain transferring, yet this hypothesis has not been investigated. The second question is, how can we properly improve ViT in adapting different domains? One intuitive approach is to directly apply adversarial discriminator onto the class tokens to perform adversarial alignment, where the state of a class token represents the entire image. However, cross-domain alignment of such global features assumes all regions or aspects of the image have the equal transferability and discriminative potential, which is not always tenable. For instance, background regions can be easier aligned across domains, while foreground regions are more discriminative. In other words, some discriminative features may lack transferability, and some transferable features may not contribute much to the downstream task (e.g., classification). Therefore, in order to properly enhance the transferability of ViT, it is critical to identify fine-grained features that are both transferable and discriminative.

In this paper we aim to present our answers to the two aforementioned questions. Firstly, to fill the blank of understanding ViT’s transferability, we first conduct a comprehensive study of vanilla ViT Dosovitskiy et al. (2020) on public UDA benchmarks. As expected, our experimental results demonstrate that ViT is more transferable than its strong CNNs-based counterparts, which can be partially explained by the global context modeling and large-scale pre-training. Besides, we observe further improvements by applying an adversarial discriminator to the class tokens of ViT, which only aligns global representations. However, such strategy suffers from the oversimplified assumption and ignores the inherent properties of ViT that are beneficial for domain adaptation: i) sequential patch tokens actually give us the free access to fine-grained features; ii) the self-attention mechanism in transformer naturally works as a discriminative probe. In the light of this, we propose an unified UDA framework that makes full use of ViT’s inherent merits. We name it Transferable Vision Transformer (TVT).

The key idea of our method is to retain both transferable and discriminative features which are essential in knowledge adaptation. To achieve this goal, we first introduce the novel Transferability Adaption Module (TAM) built upon a conventional transformer. TAM uses a patch-level domain discriminator to measure the transferabilities of patch tokens, and injects learned transferabilities into the multi-head self-attention block of a transformer. On one hand, the attention weights of patch tokens in the self-attention block are used to determine their semantic importance, i.e., the features with larger attention are more discriminative yet without transferability guarantees. On the other hand, as patch tokens can be regarded as fine-grained representations of an image, the higher transferability of a token means the local features are more transferable across domains though not necessarily discriminative. By simply replacing the last transformer of ViT with a plug-and-play TAM, we could drive ViT to focus on both transferable and discriminative features.

Since our method performs adversarial adaptation that forces the learned features of two domains to be similar, one underlying side-effect is that the discriminative information of target domain might be destroyed during feature alignment. To address this problem, we design a Discriminative Clustering Module (DCM) inspired by the clustering assumption. The motivation is to enforce the individual target prediction close to one-hot encoding (well separated) and the global target prediction to be uniformly distributed (global diverse), such that the learnt target-domain representation could retain maximum discriminative information about the input values.

Contributions of this paper are summarized as follows:

  • As far as we know, we are the first investigating the capability of ViT in transferring knowledge on the domain adaptation task. We believe this work gives good insights to understand and explore ViT’s transferability while applied to various vision tasks.

  • We propose TAM that delicately leverages the intrinsic characteristics of ViT, such that our method can capture both transferable and discriminative features for domain adaptation. Moreover, we adopt discriminative clustering assumption to alleviate the discrimination destruction during adversarial alignment.

  • Without any bells and whistles, our method set up a new competitive baseline cross several public UDA benchmarks.

Related Work

Unsupervised Domain Adaptation

Transfer learning aims to learn transferable knowledge that are generalizable across different domains with different distributions Pan and Yang (2009); Ying et al. (2018). This is built upon the evidence that feature representations in machine learning models, especially in deep neural networks, are transferable Yosinski et al. (2014)

. The main challenge of transfer learning is to reduce the domain shift or the discrepancy of the marginal probability distributions across domains

Wang and Deng (2018). In the past decades, various methods have been proposed to address one canonical transfer learning problem, i.e., unsupervised domain adaptation (UDA), where no labels are available for the target domain. For instance, DDC Tzeng et al. (2014) attempted to learn domain-invariant features by minimizing Maximum Mean Discrepancy (MMD) Borgwardt et al. (2006)

between two domains. Long et al. further improved DDC by embedding hidden representations of all task-specific layers in a reproducing Hilbert space and used a multiple kernel variant of MMD to measure the domain distance

Long et al. (2015)

. Long et al. proposed to align joint distributions of multiple domain-specific layers across domains through a joint maximum mean discrepancy metric

Long et al. (2017b). Another line of effort was inspired by the success of adversarial learning Goodfellow et al. (2014). By introducing a domain discriminator and modeling the domain adaption as a minimax problem Ganin et al. (2016); Tzeng et al. (2017); Long et al. (2017a), an encoder is trained to generate domain-invariant features, through deceiving a discriminator which tries to distinguish features of source domain from that of target domain.

It is noteworthy that all of these methods completely or partially used CNNs as the fundamental block LeCun et al. (1998); Krizhevsky, Sutskever, and Hinton (2012); He et al. (2016). By contrast, our method explores ViT Dosovitskiy et al. (2020) to tackle the UDA problem, as we believe ViT has better potential and capability in domain adaptation owning to some of its properties. Although previous UDA methods (e.g., adversarial learning) are able to improve vanilla ViT to some extent, they were not well designed for transformer-based models, and thereby cannot leverage ViT’s inherent characteristic of providing attention information and fine-grained representations. However, Our method is delicately designed with the nature of ViT and could effectively leverages the transferability and discrimination of each feature for knowledge transfer, thus having better chance in fully exploiting the adaptation power of ViT.

Vision Transformer

Transformers Vaswani et al. (2017) was firstly proposed in the NLP field and demonstrate record-breaking performance on various language tasks, e.g., text classification and machine translation Devlin et al. (2018); Beltagy, Peters, and Cohan (2020); Zhou et al. (2020). Much of such impressive achievement is attributed to the power of capturing long-range dependencies through attention mechanism. Spurred by this, some recent studies attempted to integrate attention into CNNs to augment feature maps, aiming to provide the capability in modeling heterogeneous interactions Wang et al. (2018); Bello et al. (2019); Hu et al. (2018). Another pioneering work of completely convolution-free architecture is Vision Transformer (ViT), which applied transformers on a sequence of fixed-size non-overlapping image patches. Different from CNNs that rely on image-specific inductive biases (e.g., locality and translation equivariance), ViT takes the benefits from large-scale pre-training data and global context modeling. One such method Dosovitskiy et al. (2020), known for its simplicity and accuracy/compute trade-off, competes favorably against CNNs on the classification task and lays the foundation for applying transformer to different vision tasks. ViT and its variants have proved their wide applicability in object detection Carion et al. (2020); Zhu et al. (2020); Wang et al. (2021), segmentation Zheng et al. (2020); Wang et al. (2020), and video understanding Girdhar et al. (2019); Neimark et al. (2021), etc.

Despite the success of ViT on different vision tasks, to the best of our knowledge, neither their transferability nor the design of UDA methods with ViT have been previously discussed in the literature. To this end, we focus in this paper on the investigation of ViT’s capability in knowledge transferring across different domains. We propose a novel UDA framework tailored for ViT by exploring its intrinsic merits and prove its superiority over existing methods.


Adversarial Learning UDA

We consider the image classification task in UDA, where a labeled source domain with examples and an unlabeled target domain with

examples are given. The goal of UDA is to learn features that are both discriminative and invariant to the domain discrepancy, and in turn guarantee accurate prediction on the unlabeled target data. Here, a common practice is to jointly performs feature learning, domain adaptation, and classifier learning by optimizing the following loss function:


where is supervised classification loss, is a transfer loss with various possible implementations, and is used to control the importance of . One of the most commonly used is the adversarial loss which encourages a domain-invariant feature space through a domain discriminator Ganin et al. (2016).

Self-attention Mechanism

The main building block of ViT is Multi-head Self-Attention (MSA), which is used in the transformer to capture long-range dependencies Vaswani et al. (2017). Specifically, MSA concatenates multiple scaled dot-product attention (short for SA) modules, where each SA module takes a set of queries (), keys (), and values () as inputs. In order to learn dependencies between distinct positions, SA computes the dot products of the query with all keys, and applies a softmax function to obtain the weights on the values.


where is the dimension of and . With , MSA is defined as:


where , , are projections of different heads, is another mapping function. Intuitively, using multiple heads allows MSA to jointly attend to information from different representation subspaces at different positions.


In this section, we first investigate ViT’s ability in knowledge transfer on various adaptation tasks. After that, we conduct the early attempts to improve ViT’s transferability by incorporating adversarial learning. Finally, we introduce our method named Transferable Vision Transformer (TVT), which consists two new adaptation modules to further improve ViT’s capability for cross-domain adaptation..

ViT’s Transferability

To the best of our knowledge, the transferability of ViT has not been studied in the literature before, although ViT and its variants have shown great success in various vision task. To probe into ViT’s capability of domain adaptation, we choose the vanilla ViT Dosovitskiy et al. (2020) as the backbone in all of our studies, owing to its simplicity and popularity. We train vanilla ViT by labeled source data only and assess its transferability by the classification accuracy on target data. As mentioned above, CNNs-based approaches dominate UDA research in the past decades and demonstrate great successes. Therefore, we compare vanilla ViT with CNNs-based architectures, including LeNet LeCun et al. (1998), AlexNet Krizhevsky, Sutskever, and Hinton (2012), and ResNet He et al. (2016). All experiments are performed on well-established benchmarks with standard evaluation protocols.

Take the results on Office-31 dataset for example. As shown in Table 2, Source Only ViT obtains impressing classification accuracy 89.45%, which is much better than its strong CNN opponents AlexNet (70.1%) and ResNet (76.1%). Similar phenomenon can be observed in other benchmark results, where ViT competes favorably against, if not better than, the other state-of-the-arts CNNs backbones, as shown in Table 1,3,4. Surprisingly, Source Only ViT even outperforms strong CNNs-based UDA approaches without any bells and whistles. For instance, it achieves an average accuracy 78.74% on Office-Home dataset (Table 3), beating all CNN-based UDA methods. Compared to SHOT Liang, Hu, and Feng (2020) recognized as the best UDA model nowadays, Source Only ViT obtains 7% absolute accuracy boost, a big step in pushing the frontier of UDA research. These evidences justify our hypothesis that ViT is more transferable, partially explained by its large-scale pre-training and global context modeling. However, as observed in Table 1, a large gap still exists between the Source Only and Target Only models (88.3% vs 99.22%), which indicates further improvement space of ViT’s transferability.

ViT w/ Adversarial Adaptation: Baseline

We first investigate how ViT benefits from adversarial adaptation Ganin et al. (2016), which is widely used in CNNs-based UDA methods. We follow the typical adversarial adaptation fashion that employs an encoder for feature learning, a classifier for classification, and a domain discriminator for global feature alignment. Here, is implemented as ViT and is applied to output state of the class tokens of the source and target images. To accomplish domain knowledge adaptation, and play a minimax game: learns domain-invariant features to deceive , while distinguishes source-domain features from that of target-domain. The objective can be formulated as:


where , , is cross-entropy loss, the superscript can be either or to denote a source or a target domain, and denotes the domain label (i.e., is source, is target).

We denote ViT with adversarial adaptation as our Baseline. As shown in Table 1,2,3,4, Baseline shows 7.8%, 0.78%, 1.56%, and 3.21% absolute accuracy improvements over vanilla ViT, respectively on the four benchmarks. Those results reveal that global feature alignment with a domain discriminator helps ViT’s transferability. However, compared with the digit recognition task, Baseline achieves limited improvements on object detection which is more complicated and challenging. We boils down such observation to a conclusion that simply applying global adversarial alignment cannot exploit ViT’s full transferable power, since it fails to consider two key factors: (i) not all regions/features are equally transferable or discriminative. For effective knowledge transfer, it is essential to focus on both transferable and discriminative features; (ii) ViT naturally provides fine-grained features given its forward passing sequential tokens, and attention weights in transformer actually convey discriminative potentials of patch tokens. To address these challenges and fully leverage the merits of ViT, a new UDA framework named Transferable Vision Transformer (TVT) is further proposed.

Figure 1: An overview of the proposed TVT framework. As in ViT, both source and target images are split into fixed-size patches which are then linearly mapped and embedded with positional information. The generated patches are fed into a transformer encoder whose last layer is replaced by Transferability Adaptation Module (TAM). Feature learning, adversarial domain adaptation and classification are accomplished by ViT-akin backbone, two domain discriminators (on patch-level and global-level), Discriminative Clustering Module (DCM) and the MLP-based classifier.

Transferable Vision Transformer (TVT)

An overview of TVT is shown in Figure 1, which contains two main modules: (i) a Transferability Adaptation Module (TAM) and (ii) a Discriminative Clustering Module (DCM). These two modules are highly interrelated and play a complementary role in transferring knowledge for ViT-based architectures. TAM encourages the output state of class token to focus on both transferable and semantic meaningful features, and DCM enforces the aligned features of target-domain samples to be clustered with large margins. As a consequence, the features learnt by TVT are discriminative in classification and transferable across domains as well. We detail each module in what follows.

Transferability Adaptation Module

As shown in Figure 1, we introduce the Transferability Adaptation Module (TAM) that explicitly considers the intrinsic merits of ViT, i.e., attention mechanisms and sequential patch tokens. As the patch tokens are regarded as local features of an image, they are corresponded to different image regions or captures different visual aspects as fine-grained representations of an image. Assuming patch tokens of different semantic importance and transferabilities, TAM aims at assigning different weights to those tokens, to encourage the learned image representations, i.e., the output state of class token, to attend to patch tokens that are both transferable and discriminative. While the self-attention weights in ViT could be employed as discriminative weights, one major hurdle here is, the transferability of each patch token is not available. To bypass this difficulty, we adopt a patch-level domain discriminator that matches cross-domain local features Pei et al. (2018); Wang et al. (2019) by optimizing:


where is number of patches, and

is the probability of this region belonging to the source domain. During adversarial learning,

tries to assign for a source-domain patch and for the target-domain ones, while combats such circumstances. Conceptually, a patch that can easily deceive (i.g., is around 0.5) is more transferable across domains and should be given a higher transferability. We therefore use to measure the transferability of token of image, where is the standard entropy function. An other explanation of the transferability is: by assigning weights to different patches, it disentangles an image into common space representations and domain-specific representations, while the passing paths of domain-specific features are softly suppressed.

We then convert the conventional MSA into the transferable MSA (T-MSA) by transferability adaptation, i.e., injecting the learned transferabilities into attention weights of the class token. Our T-MSA is built upon the transferable self-attention (TSA) block that is formally defined as:


where is the query of the class token, is the key of the patch tokens, is Hadamard product, and is concatenation operation. Obviously, and indicate the discrimination (semantic importance) and the transferability of each patch token, respectively. To jointly attend to the transferabilities of different representation subspaces and of different locations, we thus define T-MSA as:


Taken them together, we get the TAM as follows:


We only apply TAM to the last transformer layer where patch features are spatially non-local and of higher semantic meanings. By this means, TAM focuses on fine-grained features that are transferable across domains and are discriminative for classification. So we have , where is the total number of transformer layers in ViT.

Discriminative Clustering Module

Towards the challenging problem of learning a probabilistic discriminative classifier with unlabeled target data, it is desirable to minimize the expected classification error on the target domain. However, cross-domain feature alignment through TAM by forcing the two domains to be similar may destroy the discriminative information of the learned representation, if no semantic constrains of the target domain is introduced. As shown in Figure 2, although the target feature is indistinguishable from the source feature, it is distributed in a mess which limits its discriminative power. To address this limitation, we are inspired by the assumptions that: (i) are expected to retain as much information about as possible Bridle, Heading, and MacKay (1992); and (ii) decision boundary should not cross high density regions, but instead lie in low density regions, which is also known as cluster assumption Chapelle and Zien (2005). Fortunately, these two assumptions can be met by maximizing mutual information between the empirical distribution on the target inputs and the induced target label distribution Gomes, Krause, and Perona (2010); Shi and Sha (2012); Hu et al. (2017), which can be formally defined as:


where , , and is the number of classes. Note that maximizing enforces the target predictions close to one-hot encoding, therefore the cluster assumption is guaranteed. To ensure the global diversity, we also maximize to avoid that every target data is assigned to the same class. With , our model is encouraged to learn tightly clustered target features with uniform distribution, such that the discriminative information in the target domain are retained.

To summarize, the objective function of TVT is:


where , , and are hyper-parameters.


To verify the effectiveness of our model, we conduct comprehensive studies on commonly used benchmarks and present experimental comparisons against state-of-the-art UDA methods as shown below.


is an UDA benchmark on digit classification. We follow the same setting in previous work to perform adaptations on MNIST

LeCun et al. (1998)

, USPS, and Street View House Numbers (SVHN)

Netzer et al. (2011). For each source-target domain pair, we train our model using the training sets of each domain, and perform evaluations on the standard test set of the target domain.


Saenko et al. (2010) contains 4,652 images of 31 categories, which were collected from three domains: Amazon (A), DSLR (D), and Webcam (W). The Amazon (A) image were downloaded from amazon.zom, while the DSLR (D), and Webcam (W) were photoed under the office environment by web and digital SLR camera, respectively.


Venkateswara et al. (2017) consists of images from four different domains: Artistic images (Ar), Clip Art (Cl), Product im- ages (Pr), and Real-World images (Rw). A total of 65 categories are covered within each domain.


Peng et al. (2017) is a synthesis-to-real object recognition task used for the 2018 VisDA challenge. It covers 12 categories. The source domain contains 152,397 synthetic 2D renderings generated from different angles and under different lighting conditions, while the target domain contains 55,388 real-world images.

Baseline Methods

We compare with RevGrad Ganin and Lempitsky (2015); Ganin et al. (2016), ADDA Tzeng et al. (2017), SHOT Liang, Hu, and Feng (2020), CDAN Long et al. (2017a), CyCADA Hoffman et al. (2018), MCD Saito et al. (2018), DDC Tzeng et al. (2014), DAN Long et al. (2015), JAN Long et al. (2017b), PFAN Chen et al. (2019), TADA Wang et al. (2019), ALDA Chen et al. (2020), TAT Liu et al. (2019), and DTA Lee et al. (2019), under the close-set setting where the source and the target domain share the same label space. We use the results in their original papers for fair comparison. For each type of backbone, we report its lower bound performance, denoted as Source Only, meaning the models are trained with source data only. For digit recognition, we also show the Target Only results as the high-end performance, which is obtained by both training and testing on the labeled target data. Baseline denotes vanilla ViT with adversarial adaptation Ganin et al. (2016).

Implementation Details

The ViT-Base with 1616 input patch size (or ViT-B/16) Dosovitskiy et al. (2020)

pre-trained on ImageNet

Deng et al. (2009)

is used as our backbone. The transformer encoder of ViT-B/16 contains 12 transformer layers in total. We train all ViT-based models using mini-batch Stochastic Gradient Descent (SGD) optimizer with the momentum of 0.9. We initialized the learning rate as 0 and linearly increase it to

after 500 training steps. We then decrease it by the cosine decay strategy. The only exception is that we set for D A and W A in Office-31 dataset.

Algorithm S→M U→M M→U Avg

Source Only
LeNet 67.1 69.6 82.2 73.0

73.9 73.0 77.1 74.7

76.0 90.1 89.4 85.2

89.6 96.8 91.9 92.8

90.4 96.5 95.6 94.2

89.2 98.0 95.6 94.3

96.2 94.1 94.2 94.8

Target Only
99.4 99.4 98.0 98.9

Source Only
ViT 88.58 88.23 73.09 88.30

92.70 98.60 97.01 96.10

99.01 99.38 98.21 98.87

Target Only
99.70 99.70 98.26 99.22
Table 1: Performance comparison on Digits dataset.
Algorithm A W D W W D A D D A W A Avg

Source Only
AlexNet 61.6 95.4 99.0 63.8 51.1 49.8 70.1

61.8 95.0 98.5 64.4 52.1 52.2 70.6

68.5 96.0 99.0 67.0 54.0 53.1 72.9

73.0 96.4 99.2 72.3 53.4 51.2 74.3

75.2 96.6 99.6 72.8 57.5 56.3 76.3

78.3 97.2 100.0 76.3 57.3 57.3 77.7

83.0 99.0 99.9 76.3 63.3 60.8 80.4

Source Only
ResNet 68.4 96.7 99.3 68.9 62.5 60.7 76.1

75.6 96.0 98.2 76.5 62.2 61.5 78.3

80.5 97.1 99.6 78.6 63.6 62.8 80.4

82.0 96.9 99.1 79.7 68.2 67.4 82.2

86.0 96.7 99.7 85.1 69.2 70.7 84.6

94.1 98.6 100.0 92.9 71.0 69.3 87.7

94.3 98.7 99.8 91.6 72.9 73.0 88.4

92.5 99.3 100.0 93.2 73.1 72.1 88.4

90.1 98.4 99.9 94.0 74.7 74.3 88.6

95.6 97.7 100.0 94.0 72.2 72.5 88.7

Source Only
ViT 89.18 98.87 100.0 88.76 80.09 79.77 89.45

91.57 98.99 100.0 90.56 80.16 80.12 90.23

96.35 99.37 100.0 96.39 84.91 86.05 93.85
Table 2: Performance comparison on Office-31 dataset.
Algorithm Ar→Cl Ar→Pr Ar→Rw Cl→Ar Cl→Pr Cl→Rw Pr→Ar Pr→Cl Pr→Rw Rw→Ar Rw→Cl Rw→Pr Avg

Source Only
AlexNet 26.4 32.6 41.3 22.1 41.7 42.1 20.5 20.3 51.1 31.0 27.9 54.9 34.3

31.7 43.2 55.1 33.8 48.6 50.8 30.1 35.1 57.7 44.6 39.3 63.7 44.5

36.4 45.2 54.7 35.2 51.8 55.1 31.6 39.7 59.3 45.7 46.4 65.9 47.3

35.5 46.1 57.7 36.4 53.3 54.5 33.4 40.3 60.1 45.9 47.4 67.9 48.2

Source Only
ResNet 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1

43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3

45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6

45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3

50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8

51.6 69.5 75.4 59.4 69.5 68.6 59.5 50.5 76.8 70.9 56.6 81.6 65.8

53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1 66.6

53.1 72.3 77.2 59.1 71.2 72.1 59.7 53.1 78.4 72.4 60.0 82.9 67.6

57.1 78.1 81.5 68.0 78.2 78.1 67.4 54.9 82.2 73.3 58.8 84.3 71.8

Source Only
ViT 66.16 84.28 86.64 77.92 83.28 84.32 75.98 62.73 88.66 80.10 66.19 88.65 78.74

71.94 80.67 86.67 79.93 80.38 83.52 76.89 70.93 88.27 83.02 72.91 88.44 80.30

74.89 86.82 89.47 82.78 87.95 88.27 79.81 71.94 90.13 85.46 74.62 90.56 83.56

Table 3: Performance comparison on Office-Home dataset.
Algorithm plane bcycl bus car house knife mcycl person plant sktbrd train truck Avg

Source Only
ResNet 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4

81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4

87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9

93.8 74.1 82.4 69.4 90.6 87.2 89.0 67.6 93.4 76.1 87.7 22.2 77.8

93.7 82.2 85.6 83.8 93.0 81.0 90.7 82.1 95.1 78.1 86.4 32.1 81.5

94.3 88.5 80.1 57.3 93.1 94.9 80.7 80.3 91.5 89.1 86.3 58.2 82.9

Source Only
ViT 98.16 72.98 82.52 62.00 97.34 63.52 96.46 29.80 68.74 86.72 96.74 23.65 73.22

94.60 81.55 81.81 69.85 93.54 69.93 88.60 50.45 86.79 88.47 91.45 20.10 76.43

92.92 85.58 77.51 60.48 93.60 98.17 89.35 76.40 93.56 92.02 91.69 55.73 83.92
Table 4: Performance comparison on VisDA-2017 dataset.

Results of Digit Recognition

For the digit recognition task, we perform evaluations on SVHNMNISt, USPSMNIST, and MNISTUSPS, following the standard evaluation protocol of UDA. Shown in Table 1, TVT obtains the best mean accuracy for each task and outperforms prior work in terms of the average classification accuracy. TVT also performs better than Baseline (+2.7%) due to the contribution of the proposed TAM and DCM. In particular, TVT achieves comparable results to Target Only model, indicating that the domain shift problem is well alleviated.

Results of Object Recognition

For object recognition task, Office-31, Office-Home, and VisDA-2017 are used in evaluation. As shown in Table 2 34, TVT sets up new benchmark results for all the three datasets. On the medium-sized Office-Home dataset (Table 3), we achieve the significant improvement over the best prior UDA method (83.56% vs 71.8%). Results on the large-scale VisDA-2017 dataset (Table 4) show that we not only achieve a higher average accuracy, but also compete favorably against ALDA and SHOT that rely on pseudo labels. We believe training with pseudo label would give TVT extra accuracy gain, while it is out of our current scope. Note that DTA also enforces the cluster assumption to learn discriminative features, but it fails to encourage the global diversity which may leads to a degenerate solution where every point is assigned to the same class. Besides, TVT surpasses both Source Only and Baseline, revealing its effectiveness in transferring domain knowledge by (i) capturing both transferable and discriminative fine-grained features and (ii) retaining discriminative information while searching for the domain-invariant representations. This is also evidenced by the t-SNE visualization of learned features as showcased in Figure 2. Obviously, TAM can effectively align source and target domain features by exploiting the local feature transferability. However, the target feature is not well-separated due to that target labels in training are absent and the discriminative information are destroyed by adversarial alignment. Fortunately, this problem is alleviated by DCM by assuming that datapoints should be classified with large margin, as illustrated in Figure 2 (D).

Ablation Study

To learn the individual contribution of TAM and DCM in improving the knowledge transferability of ViT, we conduct the ablation study in Table 5. Compared to Source Only, TAM consistently improves the classification accuracy with average 4.82% boost, indicating the significance of capturing both transferable and discriminative features. The performance is further improved by incorporating DCM, justifying the necessary of retaining the discriminative information of the learned representation. It is noteworthy that DCM brings the largest improvement on the large-scale synthetic-to-real VisDA-2017 dataset. We suspect that the large domain gap in VisDA-2017 (synthetic 2D rendering to natural image) is the leading reason, since simply aligning two domains with large domain shift results in a mess distributed feature space. This challenge, however, can be largely addressed by DCM that enables retaining discriminative information based on a cluster assumption.

Methods Digits Office-31 Office-Home VisDA-2017 Avg
Source Only 88.30 89.45 78.74 73.22 82.43
+TAM 97.20 91.21 81.30 79.30 87.25
+DCM 98.87 93.85 83.56 83.92 90.05
Table 5: Ablation study of each module.
Attention Visualization

We visualize the attention map of the class token in TAM to verify that our model can attend to local features that are both transferable and discriminative. Without loss of generality, we randomly sample target-domain images in VisDA-2017 dataset for comparison. As shown in Figure 3, our method captures more accurate regions than Source Only and Baseline. For instance, to recognize the person in the top-left image, Source Only mainly focus on women’s shoulder which is discriminative yet not highly transferable. Moving beyond the shoulder region, the baseline also attends to faces and hands that can generalize well across domains. Our method, instead, ignores the shoulder and only highlight those regions that are important for classification and transferable. Certainly, by leveraging the intrinsic attention mechanism and fine-grained features captured by sequential patches, our method promotes the capability of ViT in transferring domain knowledge.

Figure 2: t-SNE visualization of VisDA-2017 dataset, where red and blue points indicate the source (synthetic rendering) and the target (real images) domain, respectively.


In this paper, we perform the first-of-its-kind investigation of ViT’s transferability in UDA task and observe that ViT are more transferable than CNNs counterparts. To further improve the power of ViT in transferring domain knowledge, we propose TVT by explicitly considering the intrinsic merits of transformer architecture. Specifically, TVT captures both transferable and discriminative features in the given image, and retains discriminative information of the learnt domain-invariant representations. Experimental results on widely used benchmarks show that TVT outperforms prior UDA methods by a large margin.

Figure 3: Attention map visualization of person, truck, and bicycle in VisDA-2017 dataset. The hotter the color, the higher the attention.