Adaptation Across Extreme Variations using Unlabeled Domain Bridges

06/05/2019 ∙ by Shuyang Dai, et al. ∙ 4

We tackle an unsupervised domain adaptation problem for which the domain discrepancy between labeled source and unlabeled target domains is large, due to many factors of inter and intra-domain variation. While deep domain adaptation methods have been realized by reducing the domain discrepancy, these are difficult to apply when domains are significantly unalike. In this work, we propose to decompose domain discrepancy into multiple but smaller, and thus easier to minimize, discrepancies by introducing unlabeled bridging domains that connect the source and target domains. We realize our proposal through an extension of the domain adversarial neural network with multiple discriminators, each of which accounts for reducing discrepancies between unlabeled (bridge, target) domains and a mix of all precedent domains including source. We validate the effectiveness of our method on several adaptation tasks including object recognition and semantic segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 8

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Unsupervised domain adaptation is challenging when the target domain is significantly different from the source domain due to many convoluted factors of variation. We introduce bridging domains composed of unlabeled images with some common factors to the source (e.g., lighting) and the target domain (e.g., viewpoint, image resolution). Bridging domain allows to learn individual factors of variation one by one, making the adaptation from source to eventual target domain easier.

With advances in supervised deep learning, many vision problems have yielded tremendous improvements 

[30, 47, 49, 22, 16, 43, 42, 33, 32, 46, 7]. While the success is driven by several factors, such as improved deep learning architectures [22, 25] or optimization techniques [13, 29, 27], it is strongly dependent on the existence of large-scale labeled training data [12]. Unfortunately, such a dataset may not be available for each application domain. This demands new ways of knowledge transfer from existing labeled data to individual target applications, potentially with an access to large-scale unlabeled data from the application domain.

Unsupervised domain adaptation (UDA) [3, 2]

has been proposed to improve the generalization ability of classifiers, using unlabeled data from the target domain. Deep domain adaptation that realizes UDA in deep learning framework has been successful in several vision tasks  

[54, 15, 8, 26, 24, 23, 51]. The core idea is to reduce the discrepancy metric, measured by the domain discriminator [15] or MMD kernel [54] at certain representation of deep networks, between two domains. Ideally, the discriminator should learn the transformation mechanisms between the two domains. However, it could be difficult to model such dynamics when there are many factors of inter- and intra-domain variation applied to transform source domain into target domain.

In this paper, we aim to solve unsupervised domain adaptation challenges whose domain discrepancy is large due to many factors of variation across source and target domains. Figure 1 provides an illustrative example of adapting from labeled images of cars from internet to recognize cars for surveillance application at night. Two dominant factors, the perspective and illumination, make this an difficult adaptation task. As a step towards solving these problems, we introduce unlabeled domain bridges whose factors of variation are partially shared with the source domain, while the others are in common with the target domain. As in Figure 1, the domain on the bottom left shares a consistent lighting condition (day) with the source, while viewpoint is similar to that of the target domain. We note that there could be multiple bridging domains such as the one on the bottom right whose lighting intensity is in the midst of the first bridging domain and the target domain.111In contrast to traditional UDA, we use external knowledge to determine bridging domains. However, we use the term “unsupervised” to emphasize that task labels are not used for the bridging and the target domains.

To utilize unlabeled bridging domains, we propose to extend the domain adversarial neural network [15] using multiple domain discriminators, each of which accounts for learning and reducing the discrepancy between unlabeled (bridging, target) domains and the mix of all precedent domains. We justify our learning framework by deriving a bound on the target error, that contains the source error and a list of discrepancies between unlabeled domain and the mix of precedent domains including the source. While not comparable to the original bound, it captures the intuition that judicious choices of bridge domains should not introduce large discrepancies. We hypothesize that the decomposition of a single, large discrepancy into multiple, small ones leads to a series of easier optimization problems, culminating in better alignment of source and target domains. We illustrate this intuition on a variant of two moons dataset in Figure 2.

While works on unsupervised discovery of latent domains exist [20, 17, 18], the choice of bridging domains remains a hard, unsolved problem. Our supplementary material exploits discriminator scores to demonstrate initial results, but we defer detailed study to future works. Instead, we focus on the complementary and also unsolved problem of devising adversarial formulations that exploit given bridging domains. We observe that such domain information is often easily available in practice, for example, image meta-data such as timestamps, geo-tags and calibration parameters suffice to inform about illumination, weather or perspective.

We validate our framework on several vision tasks, including object classification and semantic segmentation. We conduct experiments on digit recognition from MNIST to SVHN using MNIST-M as a bridging domain. We perform extensive experiments on car recognition from surveillance camera at night using web images as a source and surveillance images of day and evening as bridging domains. To verify the generality of our proposal, we evaluate the performance of semantic segmentation on Foggy Cityscapes [45] using GTA5 [44] as labeled source and Cityscapes [9] as unlabeled bridging domains.

We summarize our contributions as follows:

  • A novel UDA framework to deal with adaptation problems with extreme inter- and intra-domain variations in the target domain using unlabeled bridging domains.

  • An extension of DANN [15] with multiple discriminators, each of which accounts for reducing the discrepancy between unlabeled (bridging, target) domain and all precedent domains combined.

  • An extensive empirical validation on vision tasks, including object recognition and semantic segmentation, with a single and multiple bridging domains, demonstrating the effectiveness of our proposed approach.

2 Related Work

Unsupervised Domain Adaptation. One challenge is how to properly reduce discrepancy across domains [39, 40]. Specifically, an appropriate metric is required in order to measure the difference in between domains [3, 2]. Recent works use kernel-based methods such as maximum mean discrepancy (MMD) [34, 52, 54, 35] and optimum transport (OT) [10, 11] to measure the domain difference in the feature level space. Others adopt the idea of adversarial training [15, 48, 5] which is inspired from the generative adversarial network (GAN) [19, 41]. This training procedure allows the feature representations to be indistinguishable between the source and target domain, aligning the two together. One example of using adversarial training on UDA problems is the domain adversarial neural network (DANN) [15]. It trains a discriminator which distinguishes domains, while also learning an extractor to fool the discriminator by providing domain-invariant feature. Different from DANN, the adversarial discriminative domain adaptation [53] encodes data from the source domain to the feature level space using source labels and then uses adversarial training to obtain target feature that matches source feature.

Multiple Domains. In [56, 37] models are proposed for multiple-source UDA problems based on a domain adversarial learning. While the intuition is to utilize extra source domains that are available, the adaptation process is practically in favor of the source domain that is closely related to the target domain [37]. Our method shares the similar high-level idea with [37] where relevant domains should guide the adaptation. In contrast, unlabeled

bridging domains that share factors of variation with both source and target domains are utilized to guide two domains, aligned along side with the bridging domain. Similar to us, the benefit of having intermediate domains to guide transfer learning is shown in

[50], but in the context of semi-supervised label propagation that requires labeled data from the target domains.

3 Method

We define notations and review domain adversarial neural network (DANN) [15]. We introduce our proposed domain adaptation framework built on top of DANN that utilizes unlabeled bridging domains to enhance the adaptation performance when the source and target domains are significantly different due to many factors of variations.

Notation. Denote and as source and target domains, respectively. Output label has categories. The model is composed of three parts: 1) the feature extractor with parameter that maps the input data into a

-dimensional feature vector

; 2) the domain discriminator with parameter that tells whether the feature vector is from source or target domain; and 3) the classifier with parameter that gives a predicted label .

3.1 Recap: Domain Adversarial Neural Network

The domain adversarial neural network transfers a classifier learned from labeled source domain to unlabeled target domain by learning domain-invariant features. It is realized by first learning the domain-related information and leveraging it with features extracted from the input. DANN uses a domain discriminator

to control the amount of domain-related information in the extracted feature. Firstly, itself needs to be well trained to tell the difference between source and target domains. The discriminator is updated by maximizing the following:

(1)

In comparison, the feature extractor wants to confuse the discriminator to remove any domain-specific information. Moreover, to make sure the extracted feature is task-related, is trained to generate features that can be correctly classified by the classifier trained by minimizing the following:

(2)

and a learning objective for feature extractor is as follows:

(3)

While [15] introduces a gradient reversal layer to jointly train all parameters, we do alternating update of GANs [19] between and in our implementation.222While we present “minimax” formulation for the ease of presentation, our implementation is based on “non-saturating” formulation [19, 14]. In addition, we adopt multi-level adversarial training, whose adversarial losses are incurred at multiple layers of intermediate features [36] or output spaces [35, 51], in the form of entropy minimization objective [21, 35].

3.2 Challenge in Domain Adversarial Learning

While deep domain adaptation algorithms are realized in different forms [52, 54, 15, 48, 5], their theoretical motivation is from the seminal work of [3]. In short, the associated theorem states that the target domain task error is bounded by the source error and the domain discrepancy:

(4)

where is a hypothesis and is written as:

Adversarial or MMD losses are used to minimize the second term on the RHS of (4) to obtain a tighter bound. We focus on the adversarial loss, where the domain discrepancy is learned by a discriminator parameterized by neural networks. While it provides a flexibility on the types of discrepancy it can learn, it is challenging to learn the right transformation applied to the source domain to reach at the target domain when the two domains are far apart.

To motivate, consider a variant of the two moons dataset, whose data points are translated to the right by the amount proportional to the rotation angle, as in Figure 2. The source domain is centered at the origin, while the target domain is moved to the right after being rotated by , and given without labels. Adapting from source to target directly is extremely difficult due to a significant change. Moreover, there are many ways to generate the same unlabeled target data points (e.g., rotate counterclockwise instead of clockwise, as in the bottom of Figure 2). In such a case, knowing what happens in the middle of the entire transformation process from source to target domains is critical, as these data points in the middle, even if they are unlabeled, can guide learning algorithms to easily disentangle transformation factors (e.g., clockwise rotation and translation to the right) from task-relevant factors.

Figure 2: Translated two moons. The inter-twinning moons (left) are considered as the source domain. Two moons are rotated and translated to the right to generate the target domain (right in gray). When unlabeled data is given, one cannot figure out whether the data is rotated by clockwise or counterclockwise, while it becomes apparent if unlabeled data points in the middle of transformation from source to target are provided.
Figure 3: The learning framework with labeled source, unlabeled target, and unlabeled bridging domains for our extension of DANN using multiple discriminators. The model is composed of shared feature extractor , classifier , which is trained using labeled source examples, and two domain discriminators and . learns discrepancy between the source and the unlabeled bridging domain, while learns discrepancy between the mix of source and bridging domains and the target domain.

3.3 Adaptation with Bridging Domain

Motivated by Section 3.2, we introduce additional sets of unlabeled examples, which we call bridging domains, that reside in the transformation pathway from labeled source to unlabeled target domains to guide adaptation process.

DANN with a Single Bridging Domain
Besides and , we denote a bridging domain. Our framework is composed of feature extractor from an input and classifier trained using classification loss in (2). Unlike the DANN that directly aligns and , we decompose the adaptation into two as follows: First, and are aligned. This is easier task than direct adaptation as in DANN since there are less discriminating factors between and . Second, we adapt to the union of and . Similarly, the task is easier since it needs to discover remaining factors between and or as some factors are already found from the previous step. To accommodate two adaptation steps, we use two binary domain discriminators for learning discrepancy between and and between and . Finally, this is realized with the following objectives:

(5)
(6)

Both and are minimized to update their respective model parameters and . Once and are updated, we update the classifier using (2) and the feature extractor to confuse discriminators as follows:

(7)

with two hyperparameters

and to adjust the strengths of adversarial loss. We alternate updates between and . The proposed framework is visualized in Figure 3.

Theoretical Insights
To provide insights on how our learning objectives are constructed, we derive a bound on the target error while taking into account the unlabeled bridging domain as follows:

(8)

where , , and

Note that , making (8) similar to (4). The derivation is in the Supplementary Material.

The implications of (8) are two-fold: Firstly, to keep the bound tight, we need to make sure that both domain discrepancies are small. This motivates the design of our proposed adversarial learning framework discussed earlier. More importantly, we argue that the individual components of decomposed discrepancies are much easier to optimize than the one in (4) when the bridging domain is chosen properly. The positive effect of learning from decomposed discrepancies is the easy disentanglement of domain-specific factors from task-relevant factors as explained in Section 3.2.

Discussion on Bridging Domain Selection
The assumption on the existence of unlabeled bridging domain may seem strong. However, there are many real-world problems where the bridging domains can be naturally defined with available side information. As demonstrated later in Section 4.3, the illumination condition of the surveillance images can be obtained from the mean pixel values. In addition, lighting or weather conditions may also be obtained from accessible camera meta information.

When no side information is available, unsupervised discovery of latent domains [50, 20, 17], a complementary tool to our framework in constructing bridging domains, could be a possible solution. Following the idea of [48], we also conduct an experiment on unsupervised bridging domain construction based on the discriminator score. The discovered bridging domain shows the effectiveness when applied to our proposed learning framework. While we present our initial results on this direction in the Supplementary Material, we leave more thorough investigation as a future work.

DANN with Multiple Bridging Domains
Lastly, our framework can be extended to the case where multiple unlabeled bridging domains exist, which is desirable to span larger discrepancies between source and target domains. To formalize, we denote as source and target, and as unlabeled bridging domains with closer to source than . We introduce domain discriminators , each of which is trained by maximizing the following objective:

(9)

and the learning objective for and is given as follows:

(10)

The training procedure is summarized in Algorithm 1.

1:Input: source domain data with label , bridging domain data , and target domain data , batch size , and maximum iteration .
2:for  do
3:     Sample a batch of source data ,
4:     bridging data ,
5:     and target data ;
6:     Extract feature for all domains;
7:     Update discriminator by maximizing (9);
8:     Update classifier by minimizing (10);
9:     Update feature extractor by minimizing (10);
10:end for
Algorithm 1 Model training procedure.

4 Experiments

We evaluate our methods on three adaptation tasks: digit classification, object recognition, and semantic scene segmentation. For the recognition task, we use the Comprehensive Cars (CompCars) [55] dataset to recognize car models in the surveillance domain at night time using labeled images from the web domain. For the scene segmentation task, synthetic images of the GTA5 dataset [44] are given as the source domain and the task is to perform adaptation on Foggy Cityscapes [45].

4.1 Toy Experiment with Two Moons

We start with experiments on a synthetic dataset. Created for binary classification problem, the inter-twinning moons 2D dataset suits our model if we consider different rotated versions of the standard two entangled moons as different domains. In this experiment, we consider a hard adaptation from the original data to the ones that are rotated (clockwise or counter-clockwise), while intermediate rotation such as and can be considered as bridging domains. Moreover, as discussed in Section 3.2, the domains do not share the same centers and are proportionally translated according to the rotated angle. We follow the same network architecture as in [15], with one hidden layer of neurons followed by sigmoid non-linearity. The performance is summarized in Table 1. One observation is that when is involved as a target domain, the source domain accuracy is sacrificed a lot, which may be because of the limited network capacity. While the adaptation achieves only , which is almost a random guess, with source-to-target model (), the proposed method clearly demonstrates its effectiveness, achieving on the target domain. Another interesting observation is that is well adapted when is involved, i.e., a bridging domain closer to the target domain, while the one closer to the source domain () is not very effective.

Model
090 80.881.71 - - 56.984.47
03090 87.233.64 95.664.18 - 60.987.41
06090 79.191.21 - 89.663.61 80.679.47
0306090 78.751.56 82.338.71 87.333.83 86.972.17
Table 1: Average classification accuracy on test set of each domain. Results for the baseline and different bridging domain combinations are included.
Figure 4: Sample images of different digit datasets.
Model MNIST-M SVHN
MNISTSVHN - 71.02
MNISTMNIST-MSVHN 96.27 78.07
MNISTMNIST-MSVHN 97.07 81.28
Table 2: Accuracy on MNIST-M and SVHN test sets averaged over 10 runs. We report the performance of the standard DANN, the DANN model using mixture of unlabeled domains as a single target (MNISTMNIST-MSVHN), and our proposed model.

4.2 Digit Classification

Different digit datasets are considered as separated domains. MNIST [31] provides a large amount of hand written digit images in gray scale. SVHN [38] contains colored digit images of house numbers from street view. MNIST-M [15] is enriched from MNIST using randomly selected colored image patches in BSD500 [1] as background. Figure 4 provides samples of each domain. We consider adaptation from labeled MNIST to unlabeled SVHN, while using MNIST-M as an unlabeled bridging domain. Given the differences between MNIST and SVHN, MNIST-M seems appropriate bridging domain (similar appearance of foreground digits to MNIST but color statistics to SVHN).

We compare our model with the baseline model, i.e., a standard DANN from source to target without bridging domain. A DANN model that adapts to the mixture of bridge and target domains as a single target is included for comparison. We present results in Table 2. When the bridging domain is involved, the average accuracy on SVHN (target) significantly improves upon the baseline model. Moreover, our proposed model achieves higher performance than the model with mixture of unlabeled domains, demonstrating benefits from the bridging domain.

4.3 Recognizing Cars in SV Domain at Night

Figure 5: Sample images of CompCars surveillance (SV) domain from light (SV1) to dark (SV5) illumination conditions.
Figure 6: t-SNE plots of CompCars web and SV domains from light (SV1) to dark (SV5) illumination conditions using baseline features. From left to right, there is less overlap between the web (blue) and SV data points. Best viewed digitally in color.

Dataset and Experimental Setting
The CompCars dataset provides two sets of images: 1) the web-nature images are collected from car forums, public websites and search engines, and 2) the surveillance-nature images are collected from surveillance cameras. The dataset is composed of web images across car models and SV images across car models, with these categories of SV set being inclusive of categories from web set. We consider a set of adaptation problems from labeled web to unlabeled surveillance (SV) images. The task is very challenging as SV images experience different perspective and illumination variations from web images.

We use an illumination condition as a metric for adaptation difficulty and partition the SV set into SV1–5 based on the illumination condition of each image.333We compute the mean pixel-intensity and sort/threshold images to construct SV1–5 with roughly the same sizes. In practice, the illumination condition may be obtained from metadata, such as recorded time. SV1 contains the brightest images, thus is the easiest domain for adaptation, whereas SV5 contains the darkest ones, thus is the hardest to adapt. We visualize samples from SV1–5 in Figure 5. Not only through visual inspection of images, we also confirm the domain discrepancy between web and SV1–5 domains through t-SNE plots in Figure 6.

Finally, we present two experimental protocols. First, we evaluate on an adaptation task from web to SV night (SV4–5) using SV day (SV1–3) as one domain bridge. We demonstrate the difficulty of adaptation when two domains are far from each other, and show the importance of bridging domain and the effectiveness our adaptation method. Second, we adapt to extreme SV domain (SV5) using different combinations of one or multiple bridging domains (SV1–4) and characterize the properties of an effective bridging domain.

Models, Training and Evaluation
ImageNet pretrained ResNet-18 [22] fine-tuned on the web domain is used for our baseline. We attach a linear classifier on top of -dim feature and binary discriminators parameterized by 3-layer MLPs () for domain adversarial learning. Adam stochastic optimization [29] is used with the learning rate of

for 500 epochs. We perform supervised model selection 

[5, 35] using 2 images per class from SV4–5 domains. We report recognition accuracy on SV test sets of respective target domain definition.

Evaluation with a Single Bridging Domain
We demonstrate the difficulty of adaptation when domains are far and show that the performance of adversarial DA can be enhanced using bridging domains. In particular, night images (SV4–5) are considered as unlabeled target domain and day images (SV1–3) as unlabeled bridging domain. We compare the following models in Table 3: baseline model trained on labeled web images, DANN from source to target (WebSV4–5), from source to mixture of bridge and target (WebSV1–5), and the proposed model from source to bridge to target (WebSV1–3SV4–5).

Model SV1–3 SV4–5
Web (source only) 72.67 19.87
WebSV4–5 68.901.28 49.830.70
WebSV45 74.030.71 61.370.30
WebSV1–5 83.290.14 77.840.34
WebSV1–34–5 82.830.40 78.780.33
Table 3:

Accuracy and standard error over 5 runs on SV test sets for models without (Web

SV4–5) and with (WebSV45, WebSV1–34–5) bridging domain. For comparison, we report the accuracy of baseline and a model using mixture of bridge and target domains as a single target domain (WebSV1–5).

While the DANN adapted to the target domain (SV4–5) improves upon the baseline model, the performance is still far from adequate when compared to the performance of day images. By introducing unlabeled bridging domain, we observe significant improvement in accuracy on the target domain, achieving using standard DANN adapted to the mixture of bridging and target domains and using our proposed method.

To better understand the advantage of our proposed training scheme, we monitor the accuracy on SV1–3 and SV4–5 validation sets of the standard (WebSV1–5) and the proposed (WebSV1–34–5) models and plot curves in Figure 7. Interestingly, while both models show fast convergence on day images (Figure 6(a)), we observe a large fluctuation on the performance of night images using naive adversarial training (Figure 6(b)). On the contrary, our method allows stable performance on night images earlier in the training. This implies that it is important to know the curriculum [4] (i.e., adaptation difficulty) and our proposed method with multiple discriminators with bridge to target adaptation objective effectively utilizes such information at training.

Our hypothesis becomes much more apparent when both bridging and target domains are far from the source domain. We conduct an additional experiment using SV4 as a bridge domain and SV5 as a target domain and compare with the naively trained model (WebSV4–5). As in Table 3, the proposed model (WebSV45) outperforms the DANN by a large margin ( to on SV4–5 test set). This is because it is difficult to determine the adaptation curriculum as both domains are distance away from the source domain (see Figure 6), which is different from the previous experiment where there are sufficient amount of day images that are fairly close to the source domain, for discriminator to figure out the curriculum.

(a) Validation accuracy on SV1–3 (day)
(b) Validation accuracy on SV4–5 (night)
Figure 7: Accuracy curves of models over training epoch.

Which is a Good Bridging Domain?
We perform an ablation study to characterize the property of a good bridging domain. Specifically, we would like to answer which is more useful bridging domain between the one closer to the source domain or the other closer to the target domain. To this end, we compare two models, namely, WebSV15 and WebSV45. Note that SV4 is more similar to the target domain (SV5) in terms of visual attributes (e.g., perspective, illumination) than SV1.

The results are summarized in Table 4. We observe much higher accuracy on the target domain (SV5) for the model using SV1 as a bridging domain () than the one using SV4 (). We believe that the optimization of the adversarial loss for the second model () is more difficult than that of the first model () as SV4 is farther from the web domain than SV1. This implies that a good bridging domain decomposes the domain discrepancy between the source and the target domains, so that the decomposed discrepancy losses can be easily optimized.

Model SV5
WebSV5 37.830.51
WebSV45 58.400.60
WebSV15 69.690.99
WebSV345 74.010.52
WebSV2345 75.150.18
WebSV12345 75.470.20
Table 4: Accuracy and standard error over 5 runs on SV5 test set for models with different bridging domain configurations.

Evaluation with Multiple Bridging Domains
Our theoretical motivation suggests that, if we have many bridging domains whose generalization error between any two neighboring domains is small, we can also reduce the generalization error between the source and target domains.

We test our hypothesis through experiments that adapt to SV5 with different bridging domain configurations. Specifically, bridging domains are included one by one from SV4 to SV1, and finally reach adaptation with four bridging domains. As in Table 4, DANN fails at adaptation without domain bridge (WebSV5). While including SV4 as the target domain raises adaptation difficulty, using it as a bridging domain (WebSV45) greatly improves the performance on the SV5 test set. Including SV3 as an additional bridging domain (WebSV345) shows additional improvement, confirming our hypothesis. While adding SV2 and SV1 as bridging domains leads to an extra improvement, the margin is not as large as including SV3 and SV4. The reason is that SV3 is already close to the web domain as SV1 or SV2 (see Figure 5 and 6) and there is not much benefit of introducing additional bridges to further decompose the domain discrepancy loss.

Figure 8: Sample images and annotation for GTA5, Cityscapes, and Foggy Cityscapes with different foggy levels. Larger number indicates heavier fog.

4.4 Foggy Scene Segmentation

Dataset and Experimental Setting
We use the GTA5 dataset [44], a synthetic dataset of street-view images, containing 24,966 images of size , as labeled source domain. Unlike previous works [24, 51], we adapt to Foggy Cityscapes [45], a derivative from the real scene images of Cityscapes [9] with a fog simulation, and use Cityscapes as well as Foggy Cityscapes with lighter foggy levels () as unlabeled bridging domains. The sample images of three datasets and corresponding annotation are shown in Figure 8.

The task is to categorize each pixel into one of 19 semantic categories on the test set images of Foggy Cityscapes with foggy level. We consider several models for comparison, such as the traditional DANN (GTA5), with one bridging domain (GTA5), or with two of them (GTA5). One consideration is that we partition training images of Cityscapes equally for each domain to prevent the case where the algorithm finds an exact correspondence between images from different unlabeled domains. Such instance-level correspondence across different domains happens since Foggy Cityscapes is a derivative of the Cityscapes, but is a fairly strong assumption and is unlikely to happen in the real world.

Evaluation on Semantic Segmentation
We utilize the adaptation method by [51] as our base model, which reduces the domain discrepancy between the two domains at structured output spaces. The discriminator is attached on top of the segmentation output layer generated by DeepLab-v2 [6] engine using ResNet-101 [22] encoder. The same discriminator architecture is used for multiple adversarial losses in our framework.

Model # images mIoU on
GTA5 (source only) 27.5
GTA5 33.080.32
GTA5 32.620.58
GTA5 34.820.39
GTA5 33.160.35
GTA5 34.130.78
GTA5 35.310.06
Table 5: mIoU and standard error over 5 runs on Foggy Cityscapes () test set. We partition unlabeled training data into 2 for models in row 2–4, resulting in images per domain, while for models in row 5–7, we partition them into 3, resulting in images per domain, for experiments with two bridging domains.

We first conduct experiments with one bridging domain to adapt to Foggy Cityscapes (). We construct two partition of unlabeled data for Cityscapes and . Mean intersection-over-union (IoU) averaged over 5 runs, using different partition for each run, is reported in row 2–4 of Table 5. While significant improvement in mIoU is observed by directly adapting to the target domain (GTA5), our framework using bridging domain further enhances the performance on the final target domain from to . Similarly to Section 4.3, we also compare the performance with the domain adapted model by merging Cityscapes with the target domain (GTA5). Interestingly, the performance on the target domain is not even as good as the model directly adapting to the target domain. Our results indicate that naively merging two domains with different properties may be suboptimal for adversarial adaptation.

Next, we experiment with two bridging domains by introducing Foggy Cityscapes () as a bridge between Cityscapes and . The setting is similar, but we use of entire images for each unlabeled domain. Table 5 row 5–7 validate our hypothesis that additional bridging domains are beneficial, improving mIoU from to . While using the same number of overall unlabeled images during the training, we also observe benefit of using two bridging domains ( row) than one ( row). Overall, our experimental results verify that the proposed domain adaptation framework allows gradual distribution alignment through the help of the bridging domain, even on a more challenging case for semantic segmentation.

5 Conclusions

This paper aims to simplify adaptation problems with extreme domain variations, using unlabeled bridging domains. A novel framework based on DANN is developed by introducing additional discriminators to account for decomposed many, but smaller discrepancies of the source-to-target domain discrepancy. Several adaptation tasks in computer vision are considered, demonstrating the effectiveness of our framework with bridging domains. A variant of our method that automatically decomposes the target domain into a sequence of bridges is an interesting future direction.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS, 2007.
  • [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
  • [5] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
  • [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.
  • [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [8] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In CVPR, pages 3339–3348, 2018.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3213–3223, 2016.
  • [10] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739, 2017.
  • [11] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. arXiv preprint arXiv:1507.00504, 2015.
  • [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 2009.
  • [13] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [14] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: Gans do not need to decrease adivergence at every step. In ICLR, 2018.
  • [15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [17] B. Gong, K. Grauman, and F. Sha. Reshaping visual datasets for domain adaptation. In Advances in Neural Information Processing Systems, pages 1286–1294, 2013.
  • [18] B. Gong, K. Grauman, and F. Sha. Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision, 109(1-2):3–27, 2014.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [20] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In 2011 international conference on computer vision, pages 999–1006. IEEE, 2011.
  • [21] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2005.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [23] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • [24] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [25] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, June 2018.
  • [26] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, June 2018.
  • [27] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • [29] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In NIPS, 2012.
  • [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [32] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
  • [33] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal Loss for Dense Object Detection. In ICCV, 2017.
  • [34] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation. In ICCV, 2013.
  • [35] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
  • [36] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efficient learning of transferable representations across domains and tasks. In NIPS, 2017.
  • [37] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In Advances in neural information processing systems, pages 1041–1048, 2009.
  • [38] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.
  • [39] S. J. Pan, Q. Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [40] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • [41] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
  • [44] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118. Springer, 2016.
  • [45] C. Sakaridis, D. Dai, and L. Van Gool. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126(9):973–992, 2018.
  • [46] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. PAMI, 39(4):640–651, 2017.
  • [47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [48] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, and M. Chandraker.

    Unsupervised domain adaptation for face recognition in unlabeled videos.

    In ICCV, 2017.
  • [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [50] B. Tan, Y. Song, E. Zhong, and Q. Yang. Transitive transfer learning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1155–1164. ACM, 2015.
  • [51] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In CVPR, June 2018.
  • [52] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [53] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [54] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [55] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3973–3981, 2015.
  • [56] H. Zhao, S. Zhang, G. Wu, J. P. Costeira, J. M. Moura, and G. J. Gordon. Multiple source domain adaptation with adversarial training of neural networks. arXiv preprint arXiv:1705.09684, 2017.

Appendix S1 Proof of (8)

With labeled source and unlabeled bridging domains, , , an empirical minimizer of source error, and weight vector , for any

, with probability at least

, the target error can be bounded as follows:

(S11)

where is a hypothesis and

(S12)
(S13)
(S14)
Proof.

For the presentation clarity, we use interchangeably. Let , which will be defined later. We begin by bounding the target error by the mixture error and the divergence as follows:

(S15)

The second term of RHS in (S15) is further bounded as follows:

(S16)
(S17)

and is bounded as follows:

(S18)
(S19)

Plugging (S17) and (S19) into (S15), we get the following:

(S20)
(S21)

Assuming and , the RHS of (S21) is written as follows:

(S22)

where and . We further assume that , an empirical minimizer of source error.

Now we are left with bounding the source error by the empirical source error , theoretical minimum errors of the target and the bridging domains. This is done by using Lemma 6 in [2] as follows:

(S23)
(S24)
(S25)
(S26)
(S27)
(S28)
(S29)
(S30)

where the second inequality is due to the fact that and sixth is by adding to RHS. and are introduced in the second and third inequalities using Lemma 6. Finally, plugging in (S30) into (S22), we get the following:

(S31)
(S32)

where the last inequality is given that for any and .

Appendix S2 Digit Classification

Model architecture is listed as the following.

Generator Discriminator Feature Extractor
Input feature Input feature Input

conv. 32 ReLU, stride 1

conv. 32 ReLU, stride 1, max pool 2
conv. 64 ReLU, stride 1
MLP output 10 MLP output 128, ReLU conv. 64 ReLU, stride 1, max pool 2
MLP output 2 conv. 128 ReLU, stride 1
conv. 128 ReLU, stride 1, max pool 2
Reshape to , MLP output feature with shape
Table S6: Architecture for Digit Classification Experiment
Figure S9: Validation accuracy over training epochs of our proposed domain adaptation framework with bridging domains. SV5 is used for validation set.

Appendix S3 Recognizing Cars in SV Domain at Night

Model architecture is listed in Table S7. Additional experiment results on WebSV:5, for are shown in Figure S9, e.g., SV3:5 denotes SV3SV4SV5.

Generator Discriminator Feature Extractor
Input feature Input feature Input
conv. 64 ReLU, stride 2, max pool 2
Resnet output 64
Resnet output 128
MLP output 431 MLP output 320, ReLU Resnet output 256
MLP output 2 Resnet output 512
Resnet output 512
output feature with shape
Table S7: Architecture for Car Recognition Experiment
(a) Performance on SV4–5 based on supervised bridging domain discovery using ground truth lighting conditions.
(b) Performance on SV4–5 based on unsupervised bridging domain discovery using discriminator score of pretrained DANN.
Figure S10: Validation accuracy over training epochs of our proposed domain adaptation framework with bridging domains. (Left) ground truth label or (right) discriminator score are used for bridging domain discovery.

Appendix S4 Unsupervised Discovery of Bridging Domains

While works on unsupervised discovery of latent domains exist [20, 17, 18], the choice of bridging domains remains a hard, unsolved problem. In this section, we present our initial effort and preliminary results along this direction. Specifically, we exploit the discriminator score of pretrained DANN model to quantify the closeness to the source domain of each image in the target domain. This approach [48] intuitively makes sense as discriminator is trained to distinguish source and target domains, and those images from the target domain predicted as source domain are likely to be more similar to those images in the source domain, thus qualified as a bridging domain. Unfortunately, this is not necessarily true since the DANN is trained in an adversarial way and the discriminator at convergence should not be able to distinguish images from source and target domains [19]. Figure S13 illustrates such behavior – we visualize images from the surveillance domain based on the discriminator score from left-top the highest to right-bottom the lowest of a pretrained DANN model at epoch 150, which is close to convergence. As we can see, images are not in order from day to night, but rather they are in random order. On the other hand, the discriminator of pretrained DANN model at epoch 10 is more discriminative in separating day and night images, as shown in Figure S11.

Based on our intuition and the visual inspection, we propose to construct bridging domains based on the discriminator score of pretraeind DANN model at epoch 10 (i.e., early stopped model). By ranking the discriminator score for , we evenly split the unlabeled target data into sub-domains, denoted as for ; has the highest discriminator score and the lowest. We then apply our proposed framework on with as the target and the rest as the bridging domains. Results for are shown in Figure 9(b). Note that we use the SV4–5 for validation and testing so that the results are comparable with the reported ones in the main paper. Table S8 provides results based on discriminator scores of pretrained DANN model at epoch 10. The performance of our framework using unsupervised bridging domain discovery is highly competitive to those using ground truth lighting condition to construct bridging domains. Moreover, as in Figure S10, our proposed framework with discovered bridging domains (Figure 9(b)) demonstrates much more stable training curve comparing to the baseline DANN model (Figure 9(a), WebSV1–5). We also evaluate the performance of our proposed adaptation framework with discovered bridging domains using pretrained DANN models at epoch 50 and 150. Using Web2-way (78.62) as a reference, the results are 76.31 and 67.46 respectively. This confirms our observation in Figure S11, S12 and S13 that our framework is the most effective when the bridging domains are retrieved by the discriminator of DANN stopped early.

While our initial exploration on unsupervised latent domain discovery demonstrates its effectiveness, it is still at an early stage for following reasons: 1) the proposed approach requires additional model selection stage (i.e., early stopping) to find the effective discriminator, 2) the proposed approach needs to be evaluated on other adaptation tasks to prove its generality, and 3) extension to other vision tasks such as semantic segmentation is not straightforward due to different forms of discriminator outputs (e.g., state-of-the-art domain adaptation method for semantic segmentation [51] uses patchGAN discriminator [28] that generates multiple discriminator outputs for a single input image). For these reasons, we defer detailed study to future works.

Model SV4-5 Model SV4-5
WebSV4–5 49.830.70 Web2-way 78.620.39
WebSV1–5 77.840.34 Web3-way 78.290.28
WebSV1–34–5 78.780.33 Web4-way 77.930.43
WebSV12345 77.860.18 Web5-way 79.130.32
Table S8: Accuracy and standard error over 5 runs on SV4–5 test set for models with different bridging domain configurations. Models on the left column are based on the ground truth for bridging domain construction, while those on the right column are based on the proposed unsupervised bridging domain discovery method. Web-way means that we evenly split the unlabeled data into domains based on their discriminator scores of pretrained DANN at epoch 10.
Figure S11: Pre-train after 10 epochs.
Figure S12: Pre-train after 50 epochs.
Figure S13: Pre-train after 150 epochs.