Compound Domain Adaptation in an Open World

09/08/2019 ∙ by Ziwei Liu, et al. ∙ 5

Existing works on domain adaptation often assume clear boundaries between source and target domains. Despite giving rise to a clean problem formalization, such form falls short of simulating the real world where domains are compounded of interleaving and confounding factors, blurring the domain boundaries. In this work, we opt for a different problem, dubbed open compound domain adaptation (OCDA), for studying the techniques of training domain-robust models in a more realistic setting. OCDA considers a compound (unlabeled) target domain which mixes several major factors (e.g., backgrounds, lighting conditions, etc.), along with a labeled training set, in the training stage and new open domains during inference. The compound target domain can be seen as a combination of multiple traditional target domains each with its own idiosyncrasy. To tackle OCDA, we propose a class-confusion loss to disentangle the domain-dominant factors out of the data and then use them to schedule a curriculum domain adaptation strategy. Moreover, we use a memory-augmented neural network architecture to increase the network's capacity for handling previously unseen domains. Extensive experiments on digit classification, facial expression recognition, semantic segmentation, and reinforcement learning verify the effectiveness of our approach.



There are no comments yet.


page 1

page 5

page 8

page 12

page 13

page 14

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern visual recognition systems (e.g., classification, segmentation, and navigation) pride themselves on the compelling performance on in-domain data, i.e., the training and test data drawn from the same underlying distribution. However, in the real-world deployment, they could be easily crippled when the test data comes from different domains (e.g., the same class of objects but with different backgrounds, poses, or appearances from the training set) [34, 40]. It is highly desirable to train models in such a paradigm that empowers the models’ robustness against domain gaps appearing between the training and the inference. To this end, domain adaptation emerges as an appealing framework [34, 28, 11] to mitigate the curse of the domain mismatches.

Existing works on domain adaptation often assume clear boundaries between different domains [11, 7, 42, 24, 35]

. Despite giving rise to a clean problem formalization, such form falls short of simulating the real world where domains are compounded of interleaving and confounding factors. Indeed, numerous factors could jointly contribute to the variance of the data, making it hard to find proper boundaries to separate the data into discrete domains.

In this work, we opt for a different problem formalism, dubbed open compound domain adaptation (OCDA), for studying the techniques of training domain-robust models in a more realistic and challenging setting. As Figure 1 illustrates, OCDA considers learning a model from a well-labeled dataset (the source domain) as well as a big unlabeled dataset which could differ from the source domain due to various factors. We call this unlabeled set the compound target domain to stress the fact that it mixes several major factors underlying the data. In other words, it could be a combination of multiple traditional “target domains” in the existing works, where each domain is distinctive in one or two major factors. For example, the five domains in the classic digits datasets (SVHN [27], MNIST [19], MNIST-M [6], USPS [17], and SynNum [6]) mainly differentiate themselves from each other by the backgrounds and/or text fonts.

At the inference stage, OCDA tests the model on not only the compound target domain but also new open domains, which are unseen in the training stage.

We propose a three-pronged approach to tackling OCDA. The main idea hinges on curriculum domain adaptation [46, 5, 47]. We schedule the (unlabeled) examples of the compound target domain according to their “gaps” to the labeled source domain, so that we solve the easy domain adaptation problem first, followed by increasingly harder target-domain examples.

Unlike the existing curriculum adaptation methods [46, 5, 47]

that rely on the holistic sample difficulty, we schedule the curriculum by using finer-grained “domain distances” between the source domain and each target domain example. At the beginning of the training, a deep neural network learns class-discriminativeness from the labeled source domain and meanwhile domain-invariance by a domain-confusion loss over the “easy” target examples. After certain epochs, the network would find almost no difference between the “easy” target examples and the source domain, and we then feed it with “harder” target examples which are further away from the source domain. Gradually, the network becomes both domain-invariant and class-discriminative.

The above training algorithm is the first prong of our approach. We enhance it by another two prongs for effectively constructing the curriculum and for increasing the network’s capacity in handling new open domains, respectively. The last feature is vital for tackling previously unseen domains during the inference stage of OCDA.

In order to construct the curriculum for the compound target-domain examples, we first extract domain-focused factors out of the data and then rank the target examples according to their distances to the source domain by using these factors. We conjecture that such factors do not contribute to, but rather corrupt, the ideal class-discriminative representations. Hence, we propose a class-fusion loss to distill the domain-focused factors. This loss takes the conventional form of a cross-entropy and yet is supervised by randomized class labels.

In order to prepare the network for new open domains arising during inference, we propose to increase the model’s capacity by a memory module. The memory augments the representation of an input, enabling the network to dynamically balance between the original information contained in the input and the memory. If the input is close to the training examples, probably the representation extracted from itself can already lead to accurate classification. In contrast, the memory can step in and play a more important role, if the input is too far away from the source domain, making it an otherwise confusing example for the network. As a result, this memory-augmented network is more capable of dealing with the open domains than the vanilla networks.

Our contribution is three-fold. 1) We formalize a more realistic research problem than traditional domain adaptation, dubbed open compound domain adaptation (OCDA), for studying the techniques of learning deep neural networks that are robust to domain mismatches. 2) We propose a novel three-pronged approach to OCDA, including a training algorithm inspired by curriculum domain adaptation, a class-confusion loss for distilling domain-focused factors from the data, and a memory-augmented neural network to better handle new open domains. 3) We carefully design several benchmarks on classification, recognition, segmentation, and reinforcement learning and conduct comprehensive experiments to evaluate our approach under the OCDA setting. We will release the code, models, and datasets to facilitate future research.

Figure 2: Overview of disentangling domain-focused factors and curriculum domain adaptation. We propose to separate the factors that underlie the data and contribute to the domain idiosyncrasies out of those controlling the samples’ belonging to classes. It is achieved by a class-confusion algorithm in an unsupervised manner. Such domain-focused factors allow us to effectively construct the curriculum for domain-robust learning.

2 Related Works

In the following, we briefly review some sub-fields (Table 1) in domain adaptation, which are closely related to the open compound domain adaptation (OCDA) problem.

Unsupervised Domain Adaptation. In the past few years, researchers have devoted enormous efforts to the field of unsupervised domain adaptation [34, 40], whose goal is to retain the recognition accuracy in new domains where ground truth annotations are inaccessible. Representative techniques include latent distribution alignment [11], back-propagation [6], gradient reversal [7], adversarial discrimination [42], joint maximum mean discrepancy [24], cycle consistency [15]

and maximum classifier discrepancy 

[35]. Though promising results have been achieved, this traditional domain adaptation direction focuses on the “one source domain, one target domain” setting, which is suboptimal for more complicated scenario where multiple target domains are present.

Multi-Target Domain Adaptation. Recently this field explores the possibility of extending unsupervised domain adaptation into multi-target [10, 8, 44] or continuous [1, 12, 26] scenario, whose objective is to perform well on multiple target domains when only source domain has class labels. The developed methods usually require domain labels (e.g. the -th testing sample belongs to the -th target domain) as inputs, but this assumption rarely holds in the real-world scenario. Here we take one step further and endeavor for the compound domain adaptation problem, where both category labels and domain labels in the test set are unavailable.

Domain Generalized/Agnostic Learning. Domain generalization [21, 20] and domain agnostic learning [32, 4] aim to learn universal representations that can be applied in a domain-invariant manner. Since these methods focus on learning semantic representations that are invariant to the domain shift, they largely neglect the latent structures inside the target domains. In this work, we explicitly model the latent structures inside the compound domains by leveraging the learned domain-dominant factors for curriculum scheduling and dynamic adaptation.

3 Our Approach

In this section, we describe our approach to the open compound domain adaptation (OCDA) problem. Figures 2 and 3 present the overall work flow, which comprises three major components: 1) disentangling domain-focused factors from the data with only source domain class labels; 2) curriculum domain adaptation with a scheduled training set; and 3) a memory-enhanced neural network for handling new domains by weighing the contribution of the memory. In what follows, we describe them in detail.

Figure 3: Overview of memory-enhanced deep neural network. We propose to enhance the vanilla network by a memory module, which allows knowledge transfer from the source domain so that the network can dynamically balance the input-conveyed information and the memory-transferred knowledge, so as to more flexibly handle the previously unseen domains.

3.1 Disentangling Domain-Focused Factors

We propose to separate the factors that underlie the data and contribute to the domain idiosyncrasies out of those controlling the examples’ class-discriminativeness. Such domain-focused factors enable us to effectively construct the curriculum in the next section.

Concretely, we first train a deep neural network based classifier using the labeled source domain. Denote by the encoder up to the second-to-the-last layer and by the classifier. This encoder mainly captures the class-discriminative representations of the data.

We assume all the factors left over by this class-discriminative encoder contribute to the domain idiosyncrasies. Immediately, we can extract the domain-focused factors by using another encoder , requiring it to possess the following two properties. One is about the completeness, , namely, the outputs of the two encoders for any input provide sufficient information for a decoder to reconstruct the input. The other is about the orthogonality, meaning that the domain encoder is supposed to contain as little mutual information with the class encoder as possible. We leave to the supplementary materials the detailed training algorithm for meeting the first property.

We propose a class-confusion algorithm to make the domain encoder observe the second property. It alternates between the following two sub-problems,


where the superscript index a data instance, and is a classifier the domain-encoder tries to confuse. By the second sub-problem, we strive to train the classifier with the groundtruth labels of the inputs of the source domain — for the inputs of the target domain, we assign them pseudo labels by the classifier we have trained earlier. The learned domain encoder is class-confusing due to , a random label uniformly chosen from the label space. Imaging we have learned a decent classifier , the first sub-problem essentially learns the domain-encoder such that it classifies the input into a random class .

Figure 4 (a) and (b) visualize the examples embedded by the class encoder and domain encoder , respectively. It is clear that the class encoder embeds the examples to clusters each dominated by a class, while the domain-focused encoder mixes up the examples regardless of their class belongings.

3.2 Curriculum Domain Adaptation

We rank the examples of the compound target domain according to their distances to the source domain, namely, to construct the curriculum for domain adaptation [46]. We compute the “domain gap” between a target example and the source domain by using the domain-encoder: , which averages over the source domain examples.

Equipped with this ranking list, we train our domain-robust deep neural network by the curriculum domain adaptation strategy. The basic idea is to consider the subset of target examples that are very close to the source domain at the first few epochs. After certain epochs, we include more target examples whose distances to the source domain are moderate. Finally, we incorporate the whole compound target domain. At each stage of the curriculum learning, we learn by minimizing the sum of two losses. One is the cross-entropy loss defined over the labeled source domain, and the other is the domain-confusion loss [42] computed between the source domain and the currently covered target examples. We refer readers to [42] for the details of the domain-confusion loss. Figure 4 (c) illustrates a curriculum used in our experiments.

Figure 4: Visualization of the learned (a) class-discriminative representations, (b) domain-focused representations through t-SNE [25], and (c) a curriculum used in our experiments.

3.3 A Memory-Enhanced Deep Neural Network

Existing domain adaptation methods mostly employ the representation directly extracted from the input for classification (prediction, etc.). This direct representation could fool the classifier if the input is from a new domain that is significantly different from the training domains. In this work, we propose to enhance the vanilla network by a memory module, which allows knowledge transfer from the source domain so that the network can dynamically balance the input-conveyed information and the memory-transferred knowledge, so as to more flexibly handle the previously unseen domains.

Class Memory . We design a class memory module to store the class information from the source domain. Inspired by the previous works [38, 30, 23] that perform prototype analysis, we also leverage class centroids to constitute our class memory , where is the number of object classes.

Enhancer . For each input image, we construct an enhancer for it using the memory . The intuition is that this enhancer can augment its direct representation , benefiting the input especially when it comes from a novel domain. The enhancer transfers knowledge from the source domain to the input: , where is a softmax function. We add this enhancer to the direct representation balanced by a domain indicator.

Domain Indicator . In the presence of open domains, it is desirable to allow the network to dynamically calibrate how much knowledge to transfer from the source domain versus how much to rely on the direct representation of the input. Intuitively, the “domain gap” between an input and the source domain should play a role in the calibration. Hence, we design a domain indicator to incorporate such domain awareness: , where is a lightweight network with the tanhactivation functions and is the domain encoder we learned earlier.

Source-Enhanced representation . We are now ready to present the final representation of the input which is dynamically enhanced by the memory:


which transfers class-discriminative knowledge from the labeled source domain to the input in a dynamic domain-aware manner. The operation is an element-wise multiplication. Inspired by the cosine classifiers [22, 9], we -normalize this representation before sending it to the softmax classification layer. Together with the three major designed components in our approach, this normalization helps mitigate the domain differences especially when the input is significantly far away from the source domain.

Figure 5: Results of ablation studies about (a) the memory-enhanced embeddings and curriculum domain adaptation and (b) the domain-focused factors disentanglement.


Src. Domain Compound Domains (C) Open (O) Avg.
ADDA [42] 80.10.4 56.80.7 64.80.3 72.51.2 67.20.5 68.60.7
JAN [24] 65.10.1 43.00.1 63.50.2 85.60.0 57.20.1 64.30.1
MCD [35] 69.61.4 48.60.5 70.60.2 89.82.9 62.91.0 69.91.3
MTDA [8] 84.60.3 65.30.2 70.00.2 - 73.30.2 -
DADA [32] - - - - - 80.10.4
Ours 93.60.2 68.70.5 87.50.3 88.00.8 83.30.3 84.50.5


Table 2: Performance on the C-Digits benchmark. The methods in gray are designed for multi-target domain adaptation. MTDA uses domain labels, while DADA uses the open domain images.

4 Experiments

Datasets. To facilitate a comprehensive evaluation on various tasks (i.e., classification, segmentation, and navigation), we carefully design four open compound domain adaptation (OCDA) benchmarks: C-Digits, C-Faces, C-Driving, and C-Mazes, respectively.

  1. [leftmargin=*]

  2. C-Digits: This benchmark aims to evaluate the classification adaptation ability under different appearances and backgrounds. It is built upon five classic digits datasets (SVHN [27], MNIST [19], MNIST-M [6], USPS [17] and SynNum [6]), where SVHN is used as the source domain, MNIST, MNIST-M, and USPS are mixed as the compound target domain, and SynNum is the open domain. We employ SWIT [39] as an additional open domain for further analysis.

  3. C-Faces: This benchmark aims to evaluate the classification adaptation ability under different camera poses. It is built upon the Multi-PIE dataset  [13], where C05 (frontal view) is used as source domain, C08-C14 (left side view) are combined as the compound target domain, and C19 (right side view) is kept out as the open domain.

  4. C-Driving: This benchmark aims to evaluate the segmentation adaptation ability from simulation to different real driving scenarios. The GTA-5 [33] dataset is adopted as the source domain, while the BDD100K dataset [43] (with different scenarios including “rainy”, “snowy”, “cloudy”, and “overcast”) is taken for the compound and open domains.

  5. C-Mazes: This benchmark aims to evaluate the navigation adaptation ability under different environmental appearances. It is built upon the GridWorld environment [16], where mazes with different colors are used as the source and open domains. Since reinforcement learning often assumes no prior access to the environments, there are no compound domains here.

Network Architectures. To make a fair comparison with previous works [42, 8, 32], the modified LeNet-5 [19] and ResNet-18 [14] are used as the backbone networks for C-Digits and C-Faces, respectively. Following [41, 47, 29], a pre-trained VGG-16 [37] is the backbone network for C-Driving. We additionally test our approach on reinforcement learning using ResNet-18 following [16].

Evaluation Metrics.

The C-digits performance is measured by the digit classification accuracy, while the C-Faces performance is measured by the facial expression classification accuracy. The C-Driving performance is measured by the standard mIOU, while the C-Mazes performance is measured by the average successful rate in 300 steps. We evaluate the performance of each method with five runs and report both the mean and standard deviation. Moreover, we report both results of individual domains and the averaged results for a comprehensive analysis.

Comparison Methods. For classification tasks, we choose for comparison state-of-the-art methods in both conventional unsupervised domain adaptation (ADDA [42], JAN [24], MCD [35]) and the recent multi-target domain adaptation methods (MTDA [8] and DADA [32]). Since MTDA [8] and DADA [32] are the most related to our work, we directly contrast our results to the numbers reported in their papers. For the segmentation task, we compare with three state-of-the-art methods, AdaptSeg [41], CBST [47] and IBN-Net [29]. For the reinforcement learning task, we benchmark with SynPo [16], a representative work for adaptation across environments. We apply these methods to the same backbone networks as ours for a fair comparison.


Src. Domain Compound Domains (C) Open (O) Avg.
C05 C08 C09 C13 C14 C19 C C+O
ADDA [42] 46.90.2 36.40.5 39.10.3 65.40.4 71.80.8 47.00.4 51.90.4
JAN [24] 63.50.3 40.61.0 83.50.4 92.00.8 52.51.5 69.70.6 66.20.8
MCD [35] 50.40.5 45.80.2 77.80.1 88.00.1 60.40.9 65.70.2 64.60.4
MTDA [8] 49.00.2 48.20.1 53.10.2 84.30.1 - 58.70.2 -
Ours 73.30.2 55.10.4 84.10.1 88.90.3 72.70.6 75.40.3 74.8


Table 3: Performance on the C-Faces benchmark. The methods in gray are especially designed for multi-target domain adaptation. MTDA uses domain labels during training.


Source Compound (C) Open (O) Avg.
GTA-5 Rainy Snowy Cloudy Overcast C C+O
Source Only 16.2 18.0 20.9 21.2 18.9 19.1
AdaptSeg [41] 20.2 21.2 23.8 25.1 22.1 22.5
CBST [47] 21.3 20.6 23.9 24.7 22.2 22.6
IBN-Net [29] 20.6 21.9 26.1 25.5 22.8 23.5
Ours 22.0 22.9 27.0 27.9 24.5 25.0



Source Open(O) Avg.
M0 M1 M2 M3 M4 O
MTL 00 305 750 655 42.52.5
MLP [16] 55 4510 755 8010 51.27.5
SynPo [16] 55 3020 805 305 36.38.8
SynPo+A. 05 4010 955 455 45.06.3
Ours 802.5 7510 855 905 82.55.6


Table 4: Performance on the C-Driving (left) and C-Mazes benchmarks (right). “SynPo+A.” indicates that we equip SynPo with proper color augmentation during training.

4.1 Ablation Study

Effectiveness of the Domain-Focused Factors Disentanglement. Here we verify that the domain-focused factors disentanglement helps discover the latent structures in the compound target domain. It is probed by the domain identification rate within the -nearest neighbors found by different encodings. Figure 5 (b) shows that features produced by our disentanglement have a much higher identification rate (95%) than the counterparts without disentanglement (65%).

Effectiveness of the Curriculum Domain Adaptation. Figure 5 (a) also reveals that, in the compound domain, the curriculum training contributes more to the performance on USPS compared to those on MNIST and MNITS-M. On the other hand, we can observe from Figure 4 that USPS is the furthest target domain from the source domain SVHN. It implies that curriculum domain adaptation makes it easy to adapt to the distant target domains through an easy-to-hard adaptation schedule.

Effectiveness of Memory-Enhanced Representations. Recall that the memory-enhanced representations consist of two main components: the enhancer coming from the memory and the domain indicator. From Figure 5 (a), we observe that the class enhancer leads to large improvements on all target domains. It is because the enhancer from the memory transfers useful semantic concepts to the input of any domain. Another observation is that the domain indicator is the most effective on the open domain (“SynNum”), probably because it helps dynamically calibrate the representations by leveraging domain relations.

4.2 Comparison Results

C-Digits. Table 2 shows the comparison performances of different methods. We have the following observations. Firstly, ADDA [42] and JAN [24] boost the performance on the compound domain by enforcing global distribution alignment. However, they also sacrifice the performance on the open domain since there is no built-in mechanism for handling any new domains, “overfitting” the model to the seen domains. Secondly, MCD [35] improves the results on the open domain, but its accuracy degrades on the compound target domain. Maximizing the classifier discrepancy increases the robustness to the open domain; however, it also fails to capture the fine-grained latent structure in the compound target domain. Lastly, compared to other multi-target domain adaptation methods (MTDA [8] and DADA [32]), our approach discovers domain structures and performs domain-aware knowledge transfer, achieving substantial advantages on all the test domains.

C-Faces. Similar observations can be made on the C-Faces benchmark as shown in Table 3. Since face representations are inherently hierarchical, JAN [24] demonstrates competitive results on C14 due to its layer-wise transferring strategy. Under the domain shift with different camera poses, our approach still consistently outperforms other alternatives for both the compound and open domains.

C-Driving. We compare with the state-of-the-art semantic segmentation adaptation methods such as AdaptSeg [41], CBST [47], and IBN-Net [29]. All methods are tested under real-world driving scenarios in the BDD100K dataset [43]. We can see that our approach has clear advantages on both the compound domain ( gains) and the open domain ( gains) as shown in Table 4 (left). We show detailed per-class accuracies in the supplementary material.

C-Mazes. To directly compare with SynPo [16], we also evaluate on the GridWorld environments they provided. The task in this benchmark is to learn navigation policies that can successfully collect all the treasures in the given mazes. Existing reinforcement learning methods suffer from environmental changes, which we simulate as the appearances of the mazes here. The final results are listed in Table 4 (right). Our approach transfers visual knowledge among navigation experiences and achieves more than improvements over the prior arts.

Figure 6: (a) Qualitative results comparison of semantic segmentation on the source domain (S), the compound target domain (C), and the open domain (O). (b) Illustrations of the 5 different domains in the C-Mazes benchmark.

4.3 Further Analyses

Robustness to the Complexity of the Compound Target Domain. We control the complexity of the compound target domain by varying the number of traditional target domains / datasets in it. Here we gradually increase constituting domains from a single target domain (i.e., MNIST) to two, and eventually three (i.e., MNIST + MNIST-M + USPS). From Figure 7 (a), we observe that as the number of datasets increase, our approach only undergoes a moderate performance drop. The learned curriculum enables gradual knowledge transfer that is capable of coping with complex structures in the compound target domain.

Figure 7: Further analysis on the (a) robustness to the complexity of the compound target domain and (b) robustness to the number of open domains.

Robustness to the Number of Open Domains. The performance change w.r.t. the number of open domains is demonstrated in Figure 7 (b). Here we include two new digits datasets, USPS-M (crafted in a similar way as MNIST-M) and SWIT [39], as the additional open domains. Compared to JAN [24] and MCD [35], our approach is more resilient to the various numbers of open domains. The domain indicator module in our framework helps dynamically calibrate the embedding, thus enhancing the robustness to open domains. Figure 8 presents the t-SNE visualization comparison between the obtained embeddings of JAN [24], MCD [35], and our approach.

Figure 8: t-SNE visualization of the obtained embeddings. “M”, “MM” and “U” stand for MNIST, MNIST-M, and USPS, respectively, while “SN”, “UM” and “SW” stand for SynNum, USPS-M, and SWIT, respectively.

5 Conclusion

In this work, we have formalized a more realistic research problem, called open compound domain adaptation (OCDA), for studying techniques of learning deep neural networks that are robust to domain mismatches. We also proposed a novel three-pronged approach to OCDA, including a training algorithm inspired by curriculum domain adaptation, a class-confusion loss for disentangling domain-dominant factors from the class-discriminative factors, and a memory-augmented neural network to better deal with the new open domains during inference. Several benchmarks on classification, recognition, segmentation, and reinforcement learning have been designed to facilitate future research towards domain-robust learning.


  • [1] A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell (2018) Adapting to continuously shifting domains. ICLR Workshop. Cited by: §2.
  • [2] Z. Cao, L. Ma, M. Long, and J. Wang (2018) Partial adversarial domain adaptation. In ECCV, Cited by: Appendix A.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. ICLR. Cited by: §C.2.
  • [4] Z. Chen, J. Zhuang, X. Liang, and L. Lin (2019) Blending-target domain adaptation by adversarial meta-adaptation networks. In CVPR, Cited by: §2.
  • [5] D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool (2019)

    Model adaptation with synthetic and real data for semantic dense foggy scene understanding

    IJCV. Cited by: §1, §1.
  • [6] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, Cited by: §C.1, §1, §2, item 1.
  • [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR. Cited by: §1, §2.
  • [8] B. Gholami, P. Sahu, O. Rudovic, K. Bousmalis, and V. Pavlovic (2018) Unsupervised multi-target domain adaptation: an information theoretic approach. arXiv preprint arXiv:1810.11547. Cited by: §2, Table 2, §4.2, Table 3, §4, §4.
  • [9] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §3.3.
  • [10] B. Gong, K. Grauman, and F. Sha (2013) Reshaping visual datasets for domain adaptation. In NIPS, Cited by: §2.
  • [11] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In CVPR, Cited by: §1, §1, §2.
  • [12] R. Gong, W. Li, Y. Chen, and L. Van Gool (2019) DLOW: domain flow for adaptation and generalization. In CVPR, Cited by: §2.
  • [13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker (2008) Multi-pie. In 2008 8th IEEE International Conference on Automatic Face Gesture Recognition, Cited by: §C.1, item 2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Appendix B, §C.2, §4.
  • [15] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §C.1, §C.2, §2.
  • [16] H. Hu, L. Chen, B. Gong, and F. Sha (2018) Synthesized policies for transfer and adaptation across tasks and environments. In NIPS, Cited by: §C.1, §C.2, Figure 15, item 4, §4.2, Table 4, §4, §4.
  • [17] J. J. Hull (1994) A database for handwritten text recognition research. TPAMI. Cited by: §C.1, §1, item 1.
  • [18] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §C.2.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §C.1, §C.2, §1, item 1, §4.
  • [20] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §2.
  • [21] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018) Domain generalization with adversarial feature learning. In CVPR, Cited by: §2.
  • [22] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017)

    Sphereface: deep hypersphere embedding for face recognition

    In CVPR, Cited by: §3.3.
  • [23] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In CVPR, Cited by: §3.3.
  • [24] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    In ICML, Cited by: Figure 13, Figure 14, Figure 16, Figure 17, Appendix E, §1, §2, Table 2, §4.2, §4.2, §4.3, Table 3, §4.
  • [25] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. JMLR. Cited by: Figure 4.
  • [26] M. Mancini, S. R. Bulo, B. Caputo, and E. Ricci (2019) AdaGraph: unifying predictive and continuous domain adaptation through graphs. In CVPR, Cited by: §2.
  • [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS. Cited by: §C.1, §1, item 1.
  • [28] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks. Cited by: §1.
  • [29] X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In ECCV, Cited by: §C.2, Table 8, §4.2, Table 4, §4, §4.
  • [30] Y. Pan, T. Yao, Y. Li, Y. Wang, C. Ngo, and T. Mei (2019) Transferrable prototypical networks for unsupervised domain adaptation. In CVPR, Cited by: §3.3.
  • [31] P. Panareda Busto and J. Gall (2017) Open set domain adaptation. In ICCV, Cited by: Appendix A.
  • [32] X. Peng, Z. Huang, X. Sun, and K. Saenko (2019) Domain agnostic learning with disentangled representations. In ICML, Cited by: §2, Table 2, §4.2, §4, §4.
  • [33] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: Ground truth from computer games. In ECCV, Cited by: §C.1, item 3.
  • [34] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In ECCV, Cited by: §1, §2.
  • [35] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: Figure 13, Figure 14, Figure 16, Figure 17, Appendix E, §1, §2, Table 2, §4.2, §4.3, Table 3, §4.
  • [36] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018) Open set domain adaptation by backpropagation. In ECCV, Cited by: Appendix A.
  • [37] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: Appendix B, §4.
  • [38] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, Cited by: §3.3.
  • [39] Switzerland handwritten digits dataset. Note: 2019-03-15 Cited by: item 1, §4.3.
  • [40] A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In CVPR, Cited by: §1, §2.
  • [41] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §C.2, §C.2, Table 8, Appendix D, §4.2, Table 4, §4, §4.
  • [42] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §C.2, §1, §2, §3.2, Table 2, §4.2, Table 3, §4, §4.
  • [43] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell (2018) BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687. Cited by: §C.1, item 3, §4.2.
  • [44] H. Yu, M. Hu, and S. Chen (2018) Multi-target unsupervised domain adaptation without exactly shared categories. arXiv preprint arXiv:1809.00852. Cited by: §2.
  • [45] J. Zhang, Z. Ding, W. Li, and P. Ogunbona (2018) Importance weighted adversarial nets for partial domain adaptation. In CVPR, Cited by: Appendix A.
  • [46] Y. Zhang, P. David, H. Foroosh, and B. Gong (2019) A curriculum domain adaptation approach to the semantic segmentation of urban scenes. TPAMI. Cited by: §1, §1, §3.2.
  • [47] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, Cited by: §C.2, Table 8, §1, §1, §4.2, Table 4, §4, §4.


In this supplementary material, we provide details omitted in the main text including:

  • [leftmargin=*]

  • Section A: relation to other DA problems (Sec. 2 “Related Works” of the main paper.)

  • Section B: more methodology details (Sec. 3 “Our Approach” of the main paper.)

  • Section C: detailed experimental setup (Sec. 4 “Experiments” of the main paper.)

  • Section D: additional comparison results (Sec. 4.2 “Comparison Results” of the main paper.)

  • Section E: additional visualization of our approach (Sec. 4.3 “Further Analysis” of the main paper.)

Appendix A Relation to Other Domain Adaptation Problems

Human vision system shows remarkable generalization ability to see clear across many different domains, such as in foggy and rainy days. Computer vision system, on the other hand, has long been haunted by this domain shift issue. Several sub-fields in domain adaptation have been studies to mitigate this challenge.

Open/Partial Set Domain Adaptation. Another route of research aims to tackle the category sharing/unsharing issues between source and target domain, namely open set [31, 36] and partial set [45, 2] domain adaptation. They assume that the target domain contains either (1) new categories that don’t appear in source domain; or (2) only a subset of categories that appear in source domain. Both settings concern the “openness” of categories. Instead, in this work we investigate the “openness” of domains, i.e. we assume there are unknown domains existing that are absent in the training phase.

Appendix B More Methodology Details

Notation Summary. We summarize the notations used in the paper in Table 5.


 Notation  Meaning
input image
category label
random category label
class encoder
class classifier
domain encoder
class decoder
class classifier after domain encoder
direct feature
visual memory
class enhancer
domain indicator
network that generates
source-enhanced representation


Table 5: Summary of notations.

Details of Class and Domain Manifold Disentanglement. The detailed disentangling algorithm is shown in Algorithm 1.

, , , , , ,
for k iterations do
     Sample mini-batch .
     Compute pseudo label .
     Update the discriminator with: .
     Prepare random label

, and convert it to one-hot vector

     Compute adversarial loss: .
     Compute reconstruction loss: .
     Update the domain encoder with: .
end for
Algorithm 1 Disentangling training. and has been trained using source-domain data, : the decoder, : number of classes, : a constant.

Time Complexity. Our approach introduces negligible computational overhead to the standard deep neural networks, such as VGG [37] and ResNet [14], since only a lightweight memory module is inserted during inference.


Compound (C) Open
Domains Rainy Snowy Cloudy Overcast Total
training set w/o label 4855 5307 4535 8143 22840
validation set w/ label 215 242 346 627 1430


Table 6: Statistics of BDD100K dataset in our Open Compound Domain Adaptation (OCDA) setting.
Figure 9: Examples of the C-Driving benchmark (GTA5 and BDD100K datasets).

Appendix C Experimental Setup

c.1 Open Compound Domain Adaptation (OCDA) Datasets

C-Digits. The C-Digits dataset is consist of 5 digit datasets: SVHN [27], MNIST [19], MNIST-M [6], USPS [17], and SynNum [6]. SVHN is a dataset of street view housing numbers. MNIST and USPS are two datasets of hand-written numbers. MNIST-M and SynNum are two datasets of synthetic numbers. We choose SVHN as the source domain, MNIST, MNIST-M, and USPS as target domains that can be accessed during training, and SynNum as the open domain, which is only accessible during testing. Images from all domains are scaled to and converted to RGB format. We follow the origin training/testing split of each dataset. The SVHN dataset is balanced across the classes, following [15]. Besides, no further pre-processing is applied to the data.

C-Faces. We choose images of 6 different view angles from the We choose images of 6 different view angles from the Multi-PIE dataset [13]: C051, C080, C090, C130, C140, and C190. 6 facial expressions are treated as classification categories: neutral, smile, surprise, squint, disgust, and scream. We use images of view angle C051 as the source domain in our experiment, C080, C090, C130, and C140 as target domains that can be accessed during training, and C190 as the open domain. All the images are aligned according to the face landmarks and cropped around the facial areas into images.

C-Driving. For semantic segmentation on driving scenarios, we adopt the GTA5 [33] dataset as the source domain, and the BDD100K [43] dataset as the Compound and open domains. Examples of these datasets are shown in Fig.9.

GTA5 is a virtual street view dataset generated from Grand Theft Auto V (GTA5). It has 24966 images of resolution . The domain categories and statistics of the BDD100K dataset are provided in Table.6. We use the Rainy, Snowy, Cloudy, and Overcast domains in the training set of BDD100K in our experiments. The Overcast domain is used as the open domain while the rest are used as the Compound domains. Other domains of the original BDD100K dataset like Clear and Foggy are not used. Among these data, a small fraction with annotations is used as the validation set, while the rest are used as the training set for adaptation. The results in our experiments are reported on the validation set.

C-Mazes. This is a reinforcement learning dataset which is consist of mazes with different colors. The mazes are generated following [16], where agent is asked to collect treasure spots from the mazes by different orders. For simplicity, we only use one scene (i.e. maze with one topology) and two tasks (i.e. two different treasure collection orders) for the experiments. We randomly generated 5 combinations of colors to construct the domains (which are illustrated in Figure 10). The colors of agent and five treasures are the same across the domain. We use Domain 0 as the source domain, and the rest of the 4 domains are target domains.

Figure 10: Illustrations of the 5 different domains in the C-Mazes benchmark.

c.2 Training Details

C-Digits. The backbone model for this experiment is a LeNet-5 [19], following the setups in [15]. There are two stages in the training process. (1) In the first stage, we train the network with discriminative centroid loss for 100 epochs as a warm start for the memory-enhanced deep neural network. (2) In the second stage, we further fine-tune the networks with curriculum sampling and memory modules on top of the backbone network, where a domain adversarial loss [42] is incorporated and the weights learned in the first stage are copied to two identical but independent networks (source network and target network). Only weights of the target network are updated during this stage. Centroids of each class (i.e. constituting elements in the class memory) are calculated in the beginning of this stage, and the classifiers are reinitialized. The model is trained without the discriminative centroid loss in stage 2. Some major hyper-parameters can be found in Table 7.

C-Faces. Experiments on the C-Faces dataset are similar to experiments on the C-Digits dataset. However, the backbone model is ResNet18 [14] with random initialization, instead of a LeNet-5. Some major hyper-parameters can also be found in Table 7.


 Dataset  Initial LR.  Epoch  betas for ADAM
C-Digits (stage 1) 1e-4 100 (0.9, 0.999)
C-Digits (stage 2) 1e-5 200 (0.9, 0.999)
C-Faces (stage 1) 1e-4 100 (0.9, 0.999)
C-Faces (stage 2) 1e-5 200 (0.9, 0.999)


Table 7: The major hyper-parameters used in our experiments. “LR.” stands for learning rate.

C-Driving. Our implementation mainly follows [41]. We use DeepLab-VGG16 [3]

model with synchronized batch normalization and the batch size is set to 8. The initial learning rate is 0.01 and is decreased using the ”poly” policy with 0.9 power. The maximum iteration number is 40k and we apply early stop at 5k iteration to reduce overfitting. The GTA5 and BDD100K images are resized to

and for training respectively. For the IBN-Net [29] baseline, we replace the batch normalization layers after the {2, 4, 7}-th convolution layers with instance normalization layers. For the AdaptSeg [41] baseline, we use the Adam optimizer [18] and 0.005 initial learning rate for the discriminator.

In our method, we use dynamic transferable embedding and curriculum training in addition to adversarial adaptation  [41]. The visual memory here also comprises of a set of class centroids, which is an aggregation of local features belonging to the same category. Inspired by [47], the curriculum learning procedure is further designed to include the averaged probability confidence of each image as a guidance. Specifically, the samples that are easier, i.e., have higher confidence, are firstly fed into the model for adaptation. Since domain encoder is not accessible here, we only use class enhancer for dynamic transferable embedding.

C-Mazes. The experiments for C-Mazes are also conducted in two-stages. In the first stage, We follow the setups in [16] to train a randomly initialized ResNet-18 policy network. We only use one single topology and two different tasks for the experiments. The total episodes is set to 16000. The initial learning rate is set to 0.001. Then in the second stage, we use the pre-trained model to calculate state feature centers for each actions and use memory module to fine-tune the model.

Appendix D Additional Results

Figure 11: Qualitative results comparison of semantic segmentation on the source domain (S), the compound domains (C), and the open domain (O).


Domain Method


















Rainy Source only 48.3 3.4 39.7 0.6 12.2 10.1 5.6 5.1 44.3 17.4 65.4 12.1 0.4 34.5 7.2 0.1 0.5 16.2
AdaptSeg [41] 58.6 17.8 46.4 2.1 19.6 15.6 5.0 7.7 55.6 20.7 65.9 17.3 0.0 41.3 7.4 3.1 0.0 20.2
CBST [47] 59.4 13.2 47.2 2.4 12.1 14.1 3.5 8.6 53.8 13.1 80.3 13.7 17.2 49.9 8.9 0.0 6.6 21.3
IBN-Net [29] 58.1 19.5 51.0 4.3 16.9 18.8 4.6 9.2 44.5 11.0 69.9 20.0 0.0 39.9 8.4 15.3 0.0 20.6
Ours 63.0 15.4 54.2 2.5 16.1 16.0 5.6 5.2 54.1 14.9 75.2 18.5 0.0 43.2 9.4 24.6 0.0 22.0
Snowy Source only 50.8 4.7 45.1 5.9 24.0 8.5 10.8 8.7 35.9 9.4 60.5 17.3 0.0 47.7 9.7 3.2 0.7 18.0
AdaptSeg [41] 59.9 13.3 52.7 3.4 15.9 14.2 12.2 7.2 51.0 10.8 72.3 21.9 0.0 55.0 11.3 1.7 0.0 21.2
CBST [47] 59.6 11.8 57.2 2.5 19.3 13.3 7.0 9.6 41.9 7.3 70.5 18.5 0.0 61.7 8.7 1.8 0.2 20.6
IBN-Net [29] 61.3 13.5 57.6 3.3 14.8 17.7 10.9 6.8 39.0 6.9 71.6 22.6 0.0 56.1 13.8 20.4 0.0 21.9
Ours 68.0 10.9 61.0 2.3 23.4 15.8 12.3 6.9 48.1 9.9 74.3 19.5 0.0 58.7 10.0 13.8 0.1 22.9
Cloudy Source only 47.0 8.8 33.6 4.5 20.6 11.4 13.5 8.8 55.4 25.2 78.9 20.3 0.0 53.3 10.7 4.6 0.0 20.9
AdaptSeg [41] 51.8 15.7 46.0 5.4 25.8 18.0 12.0 6.4 64.4 26.4 82.9 24.9 0.0 58.4 10.5 4.4 0.0 23.8
CBST [47] 56.8 21.5 45.9 5.7 19.5 17.2 10.3 8.6 62.2 24.3 89.4 20.0 0.0 58.0 14.6 0.1 0.1 23.9
IBN-Net [29] 60.8 18.1 50.5 8.2 25.6 20.4 12.0 11.3 59.3 24.7 84.8 24.1 12.1 59.3 13.7 9.0 1.2 26.1
Ours 69.3 20.1 55.3 7.3 24.2 18.3 12.0 7.9 64.2 27.4 88.2 24.7 0.0 62.8 13.6 18.2 0.0 27.0
Overcast Source only 46.6 9.5 38.5 2.7 19.8 12.9 9.2 17.5 52.7 19.9 76.8 20.9 1.4 53.8 10.8 8.4 1.8 21.2
AdaptSeg [41] 59.5 24.0 49.4 6.3 23.3 19.8 8.0 14.4 61.5 22.9 74.8 29.9 0.3 59.8 12.8 9.7 0.0 25.1
CBST [47] 58.9 26.8 51.6 6.5 17.8 17.9 5.9 17.9 60.9 21.7 87.9 22.9 0.0 59.9 11.0 2.1 0.2 24.7
IBN-Net [29] 62.9 25.3 55.5 6.5 21.2 22.3 7.2 15.3 53.3 16.5 81.6 31.1 2.4 59.1 10.3 14.2 0.0 25.5
Ours 73.5 26.5 62.5 8.6 24.2 20.2 8.5 15.2 61.2 23.0 86.3 27.3 0.0 64.4 14.3 13.3 0.0 27.9


Table 8: Per-category IoU(%) results on the C-Driving Benchmark. (BDD100K dataset is used as the real-world target domain data.) The ’train’ and ’bicycle’ categories are not listed because their results are close to zero.
Figure 12: Ablation study of memory-enhanced neural network and curriculum training on the C-Driving benchmark (GTA5 and BDD100K datasets).

C-Driving. Fig.11 shows example results on the GTA5 and BDD100K dataset. Our method produces more accurate segmentation results on the compound domains and the open domain compared to ’source only’ and AdaptSeg [41]. Per-category results are provided in Table.8, and ablation study of dynamic transferable embedding and curriculum training are shown in Fig.12.

Appendix E More Visualization

t-SNE Visualization. Here we show t-SNE visualizations of the learned dynamic transferable embedding on the C-Digits, C-Faces, and C-Mazes testing data (Figure 13 - 15, among various methods. The dynamic transferable embedding on the C-Mazes benchmark are state features of each actions. Our approach generally learns a more discriminative feature space thanks to the proposed disentanglement and memory modules.

Figure 13: t-SNE of the C-Digits features of MCD [35], JAN [24], and our approach.
Figure 14: t-SNE of the C-Faces expression features of MCD [35], JAN [24], and our approach.
Figure 15: t-SNE of the C-Maze action features of MLP [16], MTL, SynPo [16], and our approach.

Confusion Matrices. Here we show visualizations of class confusion matrices on the C-Digits and C-Faces testing data in Figure 16 and Figure 17. Compared to MCD [35] and JAN [24], our method performs better on the accuracies of each class.

Figure 16: Confusion matrices visualizations on the C-Digits benchmark for MCD [35], JAN [24], and our approach.
Figure 17: Confusion matrices visualizations on the C-Faces benchmark for MCD [35], JAN [24], and our approach.