Drop to Adapt: Learning Discriminative Features for Unsupervised Domain Adaptation

10/12/2019 ∙ by Seungmin Lee, et al. ∙ 42

Recent works on domain adaptation exploit adversarial training to obtain domain-invariant feature representations from the joint learning of feature extractor and domain discriminator networks. However, domain adversarial methods render suboptimal performances since they attempt to match the distributions among the domains without considering the task at hand. We propose Drop to Adapt (DTA), which leverages adversarial dropout to learn strongly discriminative features by enforcing the cluster assumption. Accordingly, we design objective functions to support robust domain adaptation. We demonstrate efficacy of the proposed method on various experiments and achieve consistent improvements in both image classification and semantic segmentation tasks. Our source code is available at https://github.com/postBG/DTA.pytorch.



There are no comments yet.


page 5

page 8

page 13

Code Repositories


Official implementation of Drop to Adapt: Learning Discriminative Features for Unsupervised Domain Adaptation, to be presented at ICCV 2019.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Before adaptation (b) Adapted model
(c) AdD on feature extractor

(d) AdD on classifier

Figure 1: We illustrate the domain adaptation process with adversarial dropout (AdD). We depict the source and target domains as solid and dashed lines, respectively. Decision boundary of a model only trained on the source domain easily violates the cluster assumption in that it passes through target feature-dense regions (a). We can apply AdD on both the feature extractor (c) and classifier (d). When AdD is used on the feature extractor, the decision boundary is pushed away from feature dense regions. On the contrary, AdD on the classifier pushes features away from the decision boundary. Eventually, our domain adapted model draws a robust decision boundary that avoids clusters (b).

The advent of deep neural networks (DNNs) has shown exceptional performances on various visual recognition tasks using large-scale datasets 

[8, 22, 13]. Training a DNN model begins with curating data and its associated label. In general, the annotation process is expensive and time-consuming. Moreover, we are unable to collect appropriate data in some cases, if events are rarely encountered or related to dangerous situations. Hence, researchers [33, 35, 10, 36] are paying attention to leverage synthetic data in a simulation environment, where annotating labels is effortless to a wide range of scenarios.

To take full advantage of synthetic datasets, domain adaptation has become an active research area. In the domain adaptation setting, we leverage rich annotations on a source domain to achieve strong performance on a target domain regardless of poor annotations. Nevertheless, a model trained only on the source domain provides disappointing outcomes when the target domain shows inherently different characteristics. This issue is known as domain shift and is one of the main reasons for performance drops on the target domain. Therefore, we propose a novel method that can reduce the domain shift for domain adaptation.

In this paper, we tackle unsupervised domain adaptation (UDA), where the target domain is completely unlabelled. Recent works have proposed to align source and target domain distributions through domain adversarial training [11, 45, 12]. These methods employ an auxiliary domain discriminator to obtain domain-invariant feature representation. The main assumption in domain adversarial training is that if the feature representation is domain-invariant, a classifier trained on the source domain’s features will operate on the target domain as well. However, the weaknesses of domain adversarial methods have been pointed out in [38, 37, 41]. Since the domain discriminator simply aligns source and target features without considering the class labels, it is likely that the resulting features will not only be domain-invariant, but also non-discriminative with respect to class labels. Consequently, it is hard to reach the optimal performance on classification.

Our approach is based on the cluster assumption, which states that decision boundaries should be placed in low density regions in the feature space [5]. Without model adaptation, the feature extractor generates indiscriminate features for unseen data from the target domain, and the classifier may draw decision boundaries that pass through feature-dense regions on the target domain. Thus, we learn a domain adapted model by pushing the decision boundary away from the target domain’s features. Our method, Drop to Adapt (DTA), employs adversarial dropout [31] to enforce the cluster assumption on the target domain. More precisely, to support various tasks, we introduce element-wise and channel-wise adversarial dropout operations for fully-connected and convolutional layers, respectively. Fig. 1

overviews our method, and we design the associated loss functions in Section 


We summarize our contributions as follows: 1) We propose a generalized framework in UDA, which is built upon adversarial dropout [31]. Our implementation supports both convolutional and fully connected layers; 2) We test on various domain adaptation benchmarks for image classification, and achieve competitive results compared to state-of-the-art methods; and 3) We extend the proposed method to a semantic segmentation task in UDA, where we perform adaptation from the simulation to real-world environments.

2 Related Work

Domain adaptation

has been studied extensively. Ben-David et al. [1, 2] examined various divergence metrics between two domains, and defined an upper bound for the target domain error. Based on these studies, image-translation methods minimize the discrepancy between the two domains at an image-level [43, 52, 3].

On the other hand, feature alignment methods have attempted to match feature distributions between the source and target domains[11, 45, 24]. In particular, Ganin et al. [11] proposed a domain adversarial training method that aims to generate domain-invariant features by deceiving a domain discriminator. Many recent works use domain adversarial training as a key component in their adaptation procedure [12, 4, 15, 41, 32, 48, 47]. However, the domain classifier cannot consider class labels; thus, the generated features tend to be sub-optimal for classification.

To overcome the weaknesses of domain adversarial training, more recent works directly deal with the relationship between the decision boundary and feature representations based on the cluster assumption [5]. Several works [26, 9, 41]

exploit semi-supervised learning for domain adaptation. Besides, MCD 

[38] and ADR [37] use a minimax training method to push target feature distributions away from the decision boundary, where both methods are composed of the feature extractor and the classifiers. More precisely, in [37], two different classifiers are sampled via stochastic dropout. Then, for the same target data sample, the classifiers are updated to maximize the discrepancy between the two predictions. Lastly, the feature extractor is updated multiple times to minimize this discrepancy. The minimax training process leaves the classifier in a noise sensitive state. Therefore, it must be newly trained for optimal performance.

Though our work is partly inspired by ADR, the proposed method is more efficient and simpler to train compared to the prior arts [37, 38]. Instead of updating the classifier for maximizing discrepancy, we employ adversarial dropout [31] on the classifier to achieve a similar effect. Furthermore, this adversarial dropout can be applied to the feature extractor as well. Without the need of a minimax training scheme, DTA has a straightforward and reliable adaptation process.


is a simple yet effective regularization method that randomly drops a fraction of the neurons during the training process 

[42]. According to Srivastava el al. [42], dropout has the effect of ensembling multiple subsets of a network. Park  [30] spotlighted the efficacy of the dropout on convolutional layers. Tompson el al. [44] pointed out that activations of convolutional layers are usually surrounded by similar activations within the same feature map; thus, dropping individual neurons does not have a strong effect in convolution layers. Instead, they proposed spatial dropout, which drops entire feature maps instead of individual neurons. Building on spatial dropout, Hou el al. [16] proposed a weighted channel dropout that uses variable drop rates for individual channels, where the drop rates depend on the channel’s averaged activation value. The weighted channel dropout is only applied to deep layers of the network, where activations are known to have high specificity [51, 50, 49]. Similarly, for channel-wise adversarial dropout, we remove entire feature maps in an adversarial way.

3 Proposed Method

3.1 Unsupervised Domain Adaptation

We first define the unsupervised domain adaptation (UDA) problem in general, and relevant notations to our work. In the UDA setting, we use data from two distinctive domains: the source domain and the target domain . A data point from the source domain has an associated label , whereas one from the target domain has no paired ground-truth label. We employ a feature extractor , where represents a dropout mask which can be applied at an arbitrary layer of the feature extractor. The feature extractor takes a data point from two domains

and creates a latent vector, which is fed into a classifier

. The classifier applies a dropout mask at an arbitrary layer. We denote the entire neural network as a composition of the feature extractor and the classifier: .

3.2 Adversarial Dropout

We leverage a non-stochastic dropout mechanism, Adversarial Dropout (AdD) [31], for unsupervised domain adaptation. Adversarial dropout was originally proposed as an effective regularization method for supervised and semi-supervised learning. More specifically, Park  [31] define two types of Adversarial Dropout: Supervised Adversarial Dropout (SAdD), and Virtual Adversarial Dropout (VAdD). With access to ground truth labels, SAdD is used to maximize the divergence between a model’s prediction and ground truth label. Without labels, on the other hand, VAdD is used to maximize the divergence between two independent predictions to an input. Due to the lack of target domain labels, SAdD cannot be employed for our purpose. Thus, we exclusively work with VAdD, which is referred to as AdD for the sake of convenience.

AdD provides a simple and efficient mechanism of generating two divergent predictions for an input. Ultimately, our goal is to enforce the cluster assumption on target data by minimizing the divergence between predictions. To this end, we introduce element-wise AdD (EAdD) and propose its variant, channel-wise AdD (CAdD).

We first define a dropout mask applied to an intermediate layer of a network . For simplicity, we decompose a network into the subsequent sub-networks and by the layer applied dropout , such as:


where represents the element-wise multiplication. Note that has the same dimensions to the output of .

Let measure the divergence between two distributions and . Then, the divergence between the predictions of with different dropout masks, and , is defined as:


3.2.1 Element-wise Adversarial Dropout

The element-wise adversarial dropout (EAdD) mask is defined with respect to a stochastic dropout mask as:

where , (3)

where denotes the dimension of , and is a hyper parameter to control the perturbation magnitude with respect to . The objective is to find a minimally modified adversarial mask that maximizes the output divergence between two independent forward passes of .

To find , Park  [31] optimize a 0/1 knapsack problem with appropriate relaxations in the process. Their optimization process can be simplified into the following steps. First, an impact value is approximated for each element in , which is directly proportional to the element’s contribution for increasing the divergence. When negative, the element has a decreasing effect on the divergence. Then, without breaching the boundary condition, the elements of are adjusted to maximize divergence.

(a) Element-wise AdD (EAdD)
(b) Channel-wise AdD (CAdD)
Figure 2: Comparison of EAdD and CAdD. EAdD drops units individually, regardless of spatial correlation. CAdD, on the other hand, drops entire feature maps, making it more suitable for convolutional layers.

3.2.2 Channel-wise Adversarial Dropout

To use DTA in a wider range of tasks, we extend EAdD to convolutional layers. In these layers, however, standard dropout is relatively ineffective due to the strong spatial correlation between individual activations of a feature map [44]. EAdD dropout suffers from the same issues when naively applied to convolutional layers.

Hence, we formulate CAdD, which adversarially drops entire feature maps rather than individual activations. While the general procedure is similar to that of EAdD, we impose certain constraints on the mask to represent spatial dropout [44]. Fig. 2 highlights the difference between EAdD and CAdD.

Consider the activation of a convolutional layer, , where , , and denote the channel, height, and width dimensions of the activation, respectively. We define a channel-wise dropout mask , with the following constraints:


Here, corresponds to the -th activation map of , denotes a matrix of zeros, and denotes a matrix of ones, respectively. Then, the channel-wise adversarial dropout mask is defined as:


As before, is the hyper parameter that controls degree of the perturbation.

The process of finding the channel-wise adversarial dropout mask is similar to those of element-wise adversarial dropout. For CAdD, however, the impact value is approximated for each activation map of due to the constraints in Eq. (4). We provide the further details about the approximation in Appendix A of our supplementary material.

3.3 Drop to Adapt

Unlike the prior arts [38, 37], the proposed algorithm leverages a unified objective function to optimize all network parameters. The overall loss function is defined as a weighted sum of four objective functions:


where , , , and represent the objectives for task-specific, domain adaptation, entropy minimization and Virtual Adversarial Training (VAT) [28], respectively. Also, the associated hyper-parameters, , , and , control the relative importance of the terms.

Task-specific objective.

We define the task-specific objective function regarding the source domain . In practice, this objective function can be replaced according to the given task. As an example, we present the cross entropy which is widely used for classification:



is one-hot encoded vector of


Domain adaptation objective.

As the main component, we present the objective function for the domain adaptation first. The objective consists of two parts to affect on the feature extractor and the classifier :


We aim to minimize the divergence between two predicted distribution regarding to an input : one with a random dropout mask and another with an adversarial dropout mask . Among the various divergence measures, we choose the Kullback-Leibler (KL) divergence in this work. Assuming that the feature extractor consists of convolutional layers, we employ channel-wise adversarial dropout for :


We illustrate the effects of in Fig. 1(c). Initially, the decision boundary crosses high density regions in the feature space (Fig. 1(a)), which is in violation of the cluster assumption. By applying adversarial dropout on the feature extractor, we cause certain features to cross the decision boundary (Fig. 1(c), left). Then, to enforce consistent predictions, the model parameter are updated to push the decision boundary away from these features (Fig. 1(c), right).

Similarly, we apply AdD to the classifier, where the classifier is defined as a series of fully connected layers. Thus, we perform the element-wise adversarial dropout and compute the divergence:


When adversarial dropout is applied on the classifier, we determine the most volatile areas in the feature space. These volatile regions are in the vicinity of the decision boundary, and predictions in these regions can be changed even by a small perturbation. (Fig. 1(d), left). Therefore, minimizing lets the features avoid falling into such volatile regions (Fig. 1(d), right).

Entropy minimization objective.

We introduce the entropy minimization objective to enforce the cluster assumption further. This loss penalizes target samples for being close to the decision boundary, and thus, causes the model to learn more discriminative features:

VAT objective.

Lastly, we exploit VAT, which adversarially perturbs the target data at the input level. The VAT minimization objective is defined as:


where represents the virtual adversarial perturbation on input . While DTA and VAT are similarly motivated, they regularize the network with different forms of perturbations: network parameter perturbations (DTA) and input perturbations (VAT). Thus, VAT provides an orthogonal regularization to DTA, leading to complementary effects.

Interpretation of DTA.

Fig. 3 visualizes the effects of adversarial dropout using Grad-GAM [40], which accentuates the most discriminative regions for a prediction. As a baseline, we present Grad-CAM visualizations of a model trained only on the source domain (SO, see Fig. 3(b)). We apply AdD on the source only model (SO + AdD), and see that semantically meaningful areas are deactivated. In contrast, our domain adapted model (DTA, see Fig. 3(d)) stays relatively unaffected by AdD, as it keeps seeing the same discriminative regions (see Fig. 3(e)) regardless of AdD. The visualizations imply that AdD promotes activations on more hidden units, and lends to robust decision boundary across the domains.

(a) Input (b) SO (c) SO+AdD (d) DTA (e) DTA+AdD
Figure 3: Effect of adversarial dropout. We visualize class activation maps on target domain images using GradCAM [40]. Adversarial dropout (c) effectively deactivates semantically meaningful regions for a prediction compared to its baseline model only trained on source domain (b). Our domain adapted model (DTA) produces reasonable predictions (d), even though of units are eliminated by AdD (e).

4 Experimental Results

Source only (Ours) 76.5 96.3 76.9 60.1 78.2
SE* [9] 98.6 98.1 97.3 74.2 79.7
VADA [41] 94.5 - - 73.5 80.0
DIRT-T [41] 99.4 - - 75.5 -
Co-DA [20] 98.3 - - 76.4 81.1
Co-DA+DIRT-T [20] 99.4 - - 76.3 -
Ours 99.4 99.5 99.1 72.8 82.6
Target only (Ours) 99.6 97.8 99.6 90.4 70.0
Table 1: Results of experiment on small image datasets. *We compare with the MT+CT+TF for SE.

In this section, we evaluate the proposed method on small and large DA benchmarks. To demonstrate the generality of our model, we conduct the experiments in two major recognition tasks: classification and segmentation. In each experiment, we select one domain as the source domain, and another as the target domain. We denote ”Source only” as the target domain performance of a model trained on the source domain, and ”Target only” as that of a model trained on the target domain. These two serve as baselines for the lower and upper bound performance in domain adaptation. We do not tune a set of data augmentation schemes nor do we report performance with ensemble predictions, as in French el al. [9]. Rather, all evaluation results are based on the same data augmentation strategy with a single model prediction.

aero. bike bus car horse knife moto. person plant sktb. train truck avg.
Source Only 46.2 27.6 31.4 78.1 71.8 1.3 71.7 14.3 63.5 31.0 93.7 3.2 50.8
DAN [24] 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1
DANN [11] 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4
MCD [38] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9
ADR [37] 87.8 79.5 83.7 65.3 92.3 61.8 88.9 73.2 87.8 60.0 85.5 32.3 74.8
Ours 93.7 82.2 85.6 83.8 93.0 81.0 90.7 82.1 95.1 78.1 86.4 32.1 81.5
Table 2: Results on VisDA-2017 classification using ResNet-101.

4.1 DA on Small Datasets

To evaluate the influence of DTA model, we first perform experiments on small datasets. We use MNIST [21], USPS [17], and Street View House Numbers (SVHN) [29] for adaptation on digits recognition. For object recognition, we use CIFAR10 (CIFAR) [19] and STL10 (STL) [6]. For fair comparison against recent state-of-the-art methods such as Self-Ensembling (SE) [9], VADA [41], and DIRT-T [41], we conduct experiments on the same network architecture as in SE. Not that while VADA/DIRT-T use a slightly differernet architecture, the total number of parameters are comparable. The results can be found in Table 1

, and a full list of hyperparameter settings can be found in Appendix B.

Svhn Mnist.

SVHN and MNIST are two digit classification datasets with a drastic distributional shift between the two. While MNIST consists of binary handwritten digit images, SVHN consists of colored images of street house numbers. Since MNIST has a significantly lower image dimensionality than SVHN, we adopt the dimension of MNIST to 32 32 of SVHN, with three channels. When the proposed DTA is applied, our approach demonstrates a significant improvement over previous works, and achieves a performance similar to the ”Target only” performance on MNIST.

Mnist Usps.

MNIST and USPS contain grayscale images, so the domain shift between these two datasets is relatively smaller compared to that of the SVHN MNIST setting. In both adaptation directions, we achieve an accuracy close to the performance of fully supervised learning on the target domain. In fact, we obtain higher accuracy on USPS when adapting from MNIST, than when trained directly on USPS. This is because the USPS training is relatively small, allowing us to achieve improved performance by adapting from MNIST, using DTA.

Cifar Stl.

CIFAR and STL are 10-class object recognition datasets with colored images. We remove the non-overlapping classes and redefine the task as a 9-class classification task. Furthermore, we downscale the 96 96 image dimesion of STL to match the 32 32 dimension of CIFAR. In the CIFAR STL setting, our method’s performance surpasses others by a comfortable margin. For the same reasons presented in the MNIST USPS setting, our adapted model outperforms the target only model on this dataset pair. In STL CIFAR, however, our method is slightly weak. This is because STL contains a very small dataset, with only 50 images per class. Since DTA regularizes the decision boundary of the model, the inherent assumption is that the model can achieve low generalization error on the source domain. This assumption holds in most cases, but breaks down when STL is the source domain.

To summarize, we achieve a substantial margin of improvement over the source only model across all domain configurations. In four of the five configurations, our method outperforms the recent state-of-the-art results. Next, we evaluate our method on more practical settings that embody real-life domain adaptation scenarios.







t light

t sign













Source only 25.3 13.7 56.8 2.7 17.2 21.2 20.0 8.7 75.3 11.2 72.0 45.7 4.9 42.2 14.2 20.2 0.4 19.5 0.0 24.8
DANN 72.4 19.1 73.0 3.9 9.3 17.3 13.1 5.5 71.0 20.1 62.2 32.6 5.2 68.4 12.1 9.9 0.0 5.8 0.0 26.4
ADR 87.8 15.6 77.4 20.6 9.7 19.0 19.9 7.7 82.0 31.5 74.3 43.5 9.0 77.8 17.5 27.7 1.8 9.7 0.0 33.3
Ours 88.8 36.9 76.9 20.9 15.4 19.6 21.8 7.9 82.9 26.7 76.1 51.7 9.4 76.1 22.4 28.9 1.7 15.2 0.0 35.8
Table 3: Results on GTA Cityscapes, using a modified FCN with ResNet-50 as the base network.
(a) Source Only (b) DTA
Figure 4: t-SNE. t-SNE visualization of VisDA-2017 classification dataset using ResNet-101, before and after adaptation with DTA. t-SNE hyperparameters are consistent in both visualizations.
Method avg.
Source Only (Ours) 45.6
DAN [24] 53.0
RTN [26] 53.6
DANN [11] 55.0
JAN-A [27] 61.6
GTA [39] 69.5
SimNet [34] 69.6
CDAN-E [25] 70.0
Ours 76.2
SE* [9] 82.8
Table 4: Results on VisDA-2017 classification using ResNet-50. *SE report ensemble of multiple predictions. All other methods, including ours, report the average achieved by a single prediction.

4.2 DA on Large Datasets

We apply our method to adaptation on large-scale, large-image datasets. In particular, we evaluate on VisDA-2017 [33] image classification and VisDA-2017 image segmentation tasks.


The VisDA-2017 image classification is a 12-class domain adaptation problem. The source domain consists of 152,397 synthetic images, where 3D CAD models are rendered from various conditions. The target domain consists of 55,388 real images taken from the MS-COCO dataset [22]. Since the objective is to learn from labeled synthetic images and correctly predict the class of real images, this dataset has been frequently used in many domain adaptation works [24, 12, 38, 37, 9]. For fair comparison with recent works, we follow the protocol of ADR [37] in our experiments. Specifically, we apply the EAdD after the second fully connected layer, and CAdD within the last convolution layer of ResNet-50 [14]

and ResNet-101 models. Both models are initialized with weights from an ImageNet 

[8] pre-trained model. For more details on implementation, we refer our readers to Appendix B.

The per-class adaptation performance with a ResNet-101 backbone can be found in Table 2. The table clearly shows that our proposed method surpasses previous methods by a large margin. Note that all methods in this table use the same ResNet-101 backbone. Compared to the performance of a source only model, we achieve a 30.7% improvement (or 60.4% relative improvement) on the average accuracy. Furthermore, DTA shows a significant improvement across all categories; in fact, it achieves the best per-class performance in all classes, except the “truck” class, where it falls behind ADR by a mere 0.2%. Although our source only model is slightly lower than that of both MCD [38] and ADR, our proposed method effectively generalizes a model from the source to target domain, with stronger adaptation performance over MCD and ADR by margins of 9.6% and 6.7%, respectively.

In Table 4, we show that it is feasible to apply DTA on a different backbone network with success. Similarly to DTA on ResNet-101, our model outperforms recent previous methods, and demonstrates a significance improvement over the source only model. While SE reports the best overall performance, we do not consider it to be comparable to other methods - including ours - because the reported accuracy is a result of 16 ensembled predictions.

For qualitative analysis, Figure 4 visualizes the feature representations of VisDA-2017 classification with t-SNE [46]. The source only model shows strong clustering of the source domain’s synthetic image samples (blue), but fails to have similar influence on the target domain’s real image samples (red). During training, DTA constantly enforces the clustering of target samples by stimulating the feature representations and decision boundary of the model. Therefore, we can clearly see an improved separation of target features with DTA, resulting in the best performance in VisDA-2017.


To further demonstrate our method’s applicability to real-world adaptation settings, we evaluate DTA in the challenging VisDA-2017 semantic segmentation task. For the source domain, we use the synthetic GTA5 [35] dataset which consists of 24966 labeled images. As the target domain, we use the real-world Cityscapes [7], consisting of 5000 images. Both datasets are evaluated on the same category of 19 classes, with the mean Intersection-over-Union (mIoU) metric. For fair comparison with recent methods [12, 37], we follow the procedure of ADR and use a modified version of Fully Convolutional Networks (FCN) [23] on a ResNet-50 backbone. We apply CAdD within the last convolutional layer of ResNet-50.

We report our results in Table 3, alongside results of existing methods. Our method clearly improves upon the mIoU of not only the source only model, but also competing methods. Even with the same training procedure and settings as in the classification experiments, DTA is extremely effective at adapting the most common classes in the dataset. This conclusion is supported in Figure 5, where we display examples of input images, ground truths, and the corresponding outputs of source only and DTA model. While the source only predictions are erroneous in most classes, DTA’s predictions are relatively clean and accurate.

(a) Input (b) Ground Truth (c) Source Only (d) DTA
Figure 5: Semantic segmentation. Qualitative results of the semantic segmentation task on GTA Cityscapes, before and after adaptation with DTA. We use a modified FCN architecture with ResNet-50 as the base model.

5 Discussion

Methods aero. bike bus car horse knife moto. person plant sktb. train truck avg.
Source Only 54.2 27.7 17.6 57.1 48.4 4.0 86.4 11.0 69.1 15.6 95.7 7.3 46.0
VAT 83.1 62.5 70.5 53.0 81.8 13.2 89.9 74.4 88.5 41.1 89.0 38.2 67.1
fDTA 88.8 58.2 82.8 82.3 90.4 0.1 92.8 77.3 94.2 78.5 86.9 0.2 72.5
fDTA + VAT 91.3 66.3 77.7 77.5 91.0 13.1 92.6 83.0 94.2 58.0 85.9 12.0 73.1
cDTA 92.4 72.9 75.1 72.6 92.8 7.4 90.8 82.1 95.0 66.6 87.8 31.6 74.7
cDTA + VAT 90.0 72.7 83.7 79.3 92.0 6.8 91.4 82.6 92.2 70.4 86.3 22.9 75.4
cDTA + fDTA 88.2 68.8 87.2 82.8 92.3 5.8 89.4 78.4 95.5 74.8 82.4 16.1 75.0
Ours 93.1 70.5 83.8 87.0 92.3 3.3 91.9 86.4 93.1 71.0 82.0 15.3 76.2
Source Only 46.2 27.6 31.4 78.1 71.8 1.4 71.6 14.3 63.5 31.0 93.7 3.2 50.8
VAT 90.1 43.9 83.9 85.6 90.9 1.4 95.0 78.6 93.8 57.9 86.2 13.4 73.2
fDTA 89.1 75.5 84.6 87.2 92.3 72.9 89.7 78.5 91.8 39.5 84.1 10.8 76.4
fDTA + VAT 93.0 84.8 81.8 78.1 93.2 70.1 88.8 82.0 94.0 81.5 87.4 39.6 80.5
cDTA 91.8 81.5 78.7 67.0 91.3 71.6 85.3 76.9 93.5 72.5 86.7 44.1 77.0
cDTA + VAT 93.8 86.1 82.9 78.3 92.2 83.9 88.2 80.6 94.1 82.2 88.0 40.0 81.2
cDTA + fDTA 91.7 77.7 78.8 75.2 91.0 73.2 88.4 78.8 93.2 56.6 88.7 35.6 77.4
Ours 93.7 82.2 85.6 83.8 93.0 81.0 90.7 82.0 95.1 78.1 86.4 32.1 81.5
Table 5: Ablation Studies on VisDA-2017 Classification Dataset

Although the proposed DTA shows significant improvements on multiple visual tasks, we would like to understand the role of each component in DTA and how their combination operates in practice. We perform a series of ablation experiments and present the results in Table 5. All ablations are conducted on VisDA-2017 image classification dataset. To verify the effectiveness and generality, we use ResNet-50 and ResNet-101 models for all experiments in this ablation. The modified ResNet-based models consist of the original convolutional layers with FAdD after the second fully connected layer, and CAdD within the last convolutional layer. The entropy loss term in Eq. (11) is applied on all ablations except the “Source Only” setting.

To assess whether each module of DTA (VAT, fDTA, cDTA) plays an important role in the performance, we first experiment with individual modules. Overall, all three modules improve the performance over a source only model. We observe that the three components contribute to the accuracy of each category differently. In ResNet-101, while fDTA has a great impact on the “knife” category, VAT significantly boosts the performance of the “skakteboard” class. Theoretically, VAT [28] can be seen as the regularization by perturbing the input image, while the proposed methods can be seen as perturbations on the feature space of the model. Therefore, we can see that two combinations (fDTA + VAT), (cDTA + VAT) shows increased performance compared to the individually regularized model (73.2% (VAT) / 77.0% (cDTA) 81.2% in ResNet-101, 67.1% (VAT) / 72.5% (fDTA) 73.1% in ResNet-50). These results suggest that it is beneficial to use VAT [28] with the proposed method. More specifically, both methods exhibit complementary effects for adaptation on a large domain shift. This advantage can also be observed in the comparison of fDTA + cDTA to a final version of the proposed method (VAT + fDTA + cDTA). One interesting point is that all these trends are mostly maintained in both backbone models; the only difference is the amount of margin between the performance of source only and individual models. From this fact, we conclude that the proposed method can act as a general regularization technique for adaptation, regardless of the model’s capacity.

6 Conclusion

We presented a simple yet effective method for unsupervised domain adaptation despite large domain shifts. With two types of proposed adversarial dropout modules, EAdD and CAdD, we enforced the cluster assumption on the target domain. The proposed methods are easily integrated into existing deep learning architectures. Through extensive experiments on various small and large datasets, we demonstrated the effectiveness of the proposed method on two domain adaptation tasks, and in all cases we achieved significant improvement as compared to the source-only model and the state-of-the-art results.


This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No. R7117-16-0164, Development of wide area driving environment awareness and cooperative driving technology which are based on V2X wireless communication).


  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Mach. Learn.. Cited by: §2.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira (2006) Analysis of representations for domain adaptation. In NIPS, Cited by: §2.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §2.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In NIPS, Cited by: §2.
  • [5] O. Chapelle and A. Zien (2005) Semi-supervised classification by low density separation. In AISTATS, Cited by: §1, §2.
  • [6] A. Coates, H. Lee, and A. Ng (2011) An analysis of single layer networks in unsupervised feature learning. In AISTATS, Cited by: §4.1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §4.2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §4.2.
  • [9] G. French, M. Mackiewicz, and M. Fisher (2018) Self-ensembling for visual domain adaptation. In ICLR, Cited by: §2, §4.1, §4.2, Table 1, Table 4, §4.
  • [10] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §1.
  • [11] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, Cited by: §1, §2, Table 2, Table 4.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR. Cited by: §1, §2, §4.2, §4.2.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. IJRR. Cited by: §1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.
  • [15] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: Cycle-consistent adversarial domain adaptation. In ICML, Cited by: §2.
  • [16] S. Hou and Z. Wang (2019)

    Weighted channel dropout for regularization of deep convolutional neural network

    In AAAI, Cited by: §2.
  • [17] J. J. Hull (1994) A database for handwritten text recognition research. IEEE TPAMI. Cited by: §4.1.
  • [18] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: Appendix Appendix B.
  • [19] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Note: Tech Report Cited by: §4.1.
  • [20] A. Kumar, P. Sattigeri, K. Wadhawan, L. Karlinsky, R. Feris, B. Freeman, and G. Wornell (2018) Co-regularized alignment for unsupervised domain adaptation. In NIPS, Cited by: Table 1.
  • [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proc. IEEE. Cited by: §4.1.
  • [22] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §4.2.
  • [23] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: Appendix Appendix B, §4.2.
  • [24] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In ICML, Cited by: §2, §4.2, Table 2, Table 4.
  • [25] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In NIPS, Cited by: Table 4.
  • [26] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In NIPS, Cited by: §2, Table 4.
  • [27] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    In ICML, Cited by: Table 4.
  • [28] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE TPAMI. Cited by: §3.3, §5.
  • [29] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Cited by: §4.1.
  • [30] S. Park and N. Kwak (2016) Analysis on the dropout effect in convolutional neural networks. ACCV. Cited by: §2.
  • [31] S. Park, J. Park, S. Shin, and I. Moon (2018) Adversarial dropout for supervised and semi-supervised learning. In AAAI, Cited by: Appendix Appendix A, §1, §1, §2, §3.2.1, §3.2.
  • [32] Z. Pei, Z. Cao, M. Long, and J. Wang (2018) Multi-adversarial domain adaptation. In AAAI, Cited by: §2.
  • [33] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) VisDA: the visual domain adaptation challenge. External Links: arXiv:1710.06924 Cited by: §1, §4.2.
  • [34] P. O. Pinheiro (2018) Unsupervised domain adaptation with similarity learning. In CVPR, Cited by: Table 4.
  • [35] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: Ground truth from computer games. In ECCV, Cited by: §1, §4.2.
  • [36] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez (2016) The SYNTHIA Dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §1.
  • [37] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2018) Adversarial dropout regularization. In ICLR, Cited by: §1, §2, §2, §3.3, §4.2, §4.2, Table 2.
  • [38] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: §1, §2, §2, §3.3, §4.2, §4.2, Table 2.
  • [39] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In CVPR, Cited by: Table 4.
  • [40] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In IEEE ICCV, Cited by: Figure A-1, Figure 3, §3.3.
  • [41] R. Shu, H. Bui, H. Narui, and S. Ermon (2018) A DIRT-T approach to unsupervised domain adaptation. In ICLR, Cited by: Appendix Appendix B, §1, §2, §2, §4.1, Table 1.
  • [42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §2.
  • [43] Y. Taigman, A. Polyak, and L. Wolf (2017) Unsupervised cross-domain image generation. ICLR. Cited by: §2.
  • [44] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In CVPR, Cited by: §2, §3.2.2, §3.2.2.
  • [45] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §2.
  • [46] L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of Machine Learning Research

    Cited by: §4.2.
  • [47] R. Volpi, P. Morerio, S. Savarese, and V. Murino (2018) Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, Cited by: §2.
  • [48] X. Wang, L. Li, W. Ye, M. Long, and J. Wang (2019) Transferable attention for domain adaptation. In AAAI, Cited by: §2.
  • [49] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In NIPS, Cited by: §2.
  • [50] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, Cited by: §2.
  • [51] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian (2016) Picking deep filter responses for fine-grained image recognition. In CVPR, Cited by: §2.
  • [52] Jun-Yan. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: §2.

Appendix Appendix A Approximation of Channel-wise Adversarial Dropout

Without loss of generality, the dropout mask is vectorized to Similarly, and represent vectorized forms of and , respectively. After vectorization of , we refer to the elements of with a set of indices , and impose the channel-wise dropout constraints as follows:


Let denote as the divergence between two outputs using different dropout masks for convenience sake. Assuming is a differentiable function with respect to , it can be approximated by a first-order Taylor expansion:

This equation shows that the Jacobian is proportional to the divergence. In other words,


We now see that the elements of correspond to the impact values, which indicate the contribution of each activation over the divergence metric. Thus, for the given Jacobian, we can systematically modify the elements of to maximize the divergence. However, due to the channel-wise dropout constraint from Eq. (a-13), we cannot modify each element individually. Instead, we reformulate the above relationship as:


The impact value of the -th activation map in can be defined as:


Consequently, after computing the impact values , we solve 0/1 Knapsack problem as proposed in [31] while holding the constraints (a-13).

Appendix Appendix B Implementation Details

Training with DTA Loss

We apply a ramp-up factor on DTA loss function to stabilize the training process. Instead of directly modulating the weight term , we gradually increase the perturbation magnitudes and which decide the number of hidden units to be eliminated. It allows us to regulate the consistency term, and to train the network being robust to various levels of perturbation generated by the adversarial dropout. We update the ramp-up factors with the following schedule:


where represents the ramp-up period, and

denotes the ramp up factor at the current epoch

. Finally, the perturbation magnitude is defined as:


where denotes the maximum level of perturbation. In practice, the same ramp-up period is applied for both and .


Table A-1 presents the hyperparameters used in our experiments. We followed a similar hyperparameter search protocol as Shu  [41], where we sample a very small subset of labels from the target domain training set. For each objective function, we limit the hyperparameter search to a predefined set of values: , , , , , and . Furthermore, we provide the rest of parameters related to network training for each experimental set up.

Small dataset.

All small dataset experiments were trained for 90 epochs, using Adam optimizer [18] with an initial learning rate of 0.001, decaying by a factor of 0.1 every 30 epochs.

Large dataset.

We conducted the VisDA-2017 classification experiments on ResNet-50 and ResNet-101. We trained the networks for 20 epochs using Stochastic Gradient Descent (SGD) with a momentum value of 0.9 and an initial learning rate of 0.001, which decays by a factor of 0.1 after 10th epoch.

Semantic segmentation.

The semantic segmentation task for domain adaptation from GTA5 to Cityscapes was trained for 5 epochs using SGD with a momentum of 0.9. Since FCN [23] has no fully-connected layers, was automatically set to 0. In addition, we used the maximum value from the beginning because the task-specific objective were dominant in the early stages of training. In this experiment, we turned off VAT objective which hinders from learning the segmentation task.

Experiment Backbone
Small dataset
SVHN MNIST 9 Conv+1 FC 2 0.01 0.1 0.1 0.05 80 3.5
MNIST USPS 3 Conv+2 FC 2 0.01 0 0.1 0.05 80 0
USPS MNIST 3 Conv+2 FC 2 0.01 0.1 0.1 0.05 80 3.5
STL CIFAR 9 Conv+1 FC 2 0.01 0.1 0 0.05 60 3.5
CIFAR STL 9 Conv+1 FC 2 0.01 0 0 0.05 80 0
Large dataset
VisDA-2017 Classification ResNet-50 2 0.02 0.2 0.1 0.01 20 15
VisDA-2017 Classification ResNet-101 2 0.02 0.2 0.1 0.01 30 15
Semantic segmentation
GTA5 Cityscapes ResNet-50 FCN 2 0.01 0 0 0.02 1 0
Table A-1: Hyperparameters

Appendix Appendix C Additional GradCAM visualizations

In Figure A-1, we provide additional GradCAM visualizations to highlight the effects of adversarial dropout.

(a) Input (b) SO (c) SO+AdD (d) DTA (e) DTA+AdD
Figure A-1: Effect of adversarial dropout, visualized by GradCAM [40].