Joint Pixel and Feature-level Domain Adaptation in the Wild

by   Luan Tran, et al.

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a novel design and complementary properties that result in better performance. At the feature level, inspired by insights from semi-supervised learning in a domain adversarial neural network, we propose a novel regularization in the form of domain adversarial entropy minimization. Next, we posit that insights from computer vision are more amenable to injection at the pixel level and specifically address the key challenge of adaptation across different semantic levels. In particular, we use 3D geometry and image synthesization based on a generalized appearance flow to preserve identity across higher-level pose transformations, while using an attribute-conditioned CycleGAN to translate a single source into multiple target images that differ in lower-level properties such as lighting. We validate on a novel problem of car recognition in unlabeled surveillance images using labeled images from the web, handling explicitly specified, nameable factors of variation through pixel-level and implicit, unspecified factors through feature-level adaptation. Extensive experiments achieve state-of-the-art results, demonstrating the effectiveness of complementing feature and pixel-level information via our proposed domain adaptation method.



page 3

page 7

page 8

page 12

page 17

page 18

page 19


Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Partial domain adaptation (PDA) extends standard domain adaptation to a ...

Dual Adversarial Domain Adaptation

Unsupervised domain adaptation aims at transferring knowledge from the l...

CyCADA: Cycle-Consistent Adversarial Domain Adaptation

Domain adaptation is critical for success in new, unseen environments. A...

SPLAT: Semantic Pixel-Level Adaptation Transforms for Detection

Domain adaptation of visual detectors is a critical challenge, yet exist...

DeceptionNet: Network-Driven Domain Randomization

We present a novel approach to tackle domain adaptation between syntheti...

Pixel Invisibility: Detecting Objects Invisible in Color Images

Despite recent success of object detectors using deep neural networks, t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has made an enormous impact on many applications in computer vision such as generic object recognition [25, 47, 52, 18], fine-grained categorization [58, 24, 44], object detection [29, 30, 31, 41, 42] and semantic segmentation [6, 45]. Much of its success is attributed to the availability of large-scale labeled training data [10, 16]. However, this is hardly true in many practical scenarios: since annotation is expensive, most data remains unlabeled. Consider car recognition problem from surveillance images, where factors such as camera angle, distance, lighting or weather condition are different across locations. It is not feasible to exhaustively annotate all these images. Meanwhile, there already exist abundant labeled data from web domain [24, 61, 13]

, but with very different image characteristics that precludes direct transfer of discriminative CNN-based classifiers. For instance, web images might be from catalog magazines with professional lighting and ground-level camera poses, while surveillance images can originate from cameras atop traffic lights in challenging lighting and weather conditions.

Figure 1: Our framework for unsupervised domain adaptation at multiple semantic levels: for feature-level, we bring insights from semi-supervised learning to obtain highly discriminative domain-invariant representations; at pixel-level, we leverage complementary domain-specific vision insights e.g. geometry and attributes.

Unsupervised domain adaptation (UDA) is a promising tool for overcoming the lack of labeled training data in target domains. The goal of UDA is to transfer a classifier from the source to target domain. Several approaches aim to match distributions between source and target domains at different levels of representations, such as feature [57, 56, 12, 48] or pixel levels [53, 46, 64, 3]. Certain adaptation challenges are better handled in the feature space, but feature-level DA is a black-box algorithm for which adding domain-specific insights during adaptation is more difficult than in pixel space. On the contrary, pixel space is much higher-dimensional than feature space and the optimization problem is significantly under-determined without proper constraints. It has been an open challenge how to effectively combine them. Here we address this challenge by leveraging complementary tools that are inherently better-suited at each level (Figure 1).

Specifically, we posit that feature-level DA is more amenable to techniques from semi-supervised learning, while pixel-level DA allows domain-specific insights from computer vision. In Section 3, we present our feature-level DA method that improves upon the state-of-the-art domain adversarial neural network (DANN) [12]. We extend an instance of SSL algorithm [43] to feature-level DA and show its equivalence to DANN (Figure 3). This opens the door to tools such as entropy minimization [15], thus, we propose a domain adversarial entropy regularization that achieves highly discriminative and domain-invariant representations.

A challenge for pixel-level adaptation is to simultaneously transform source image properties at higher and lower semantic levels. We argue that proper incorporation of vision insights is needed to tackle this. In Section 4, we present pixel-level adaptation by image translation and synthesization that can make use of vision concepts to deal with different factors of variation, such as photometric or perspective transformations (Figure 2).111Our framework is unsupervised DA in the sense that we don’t require a recognition label from the target domain for training, but it demands for additional annotations to inject insights from vision concepts. To handle low-level transformations, we propose an attribute-conditioned CycleGAN that extends [64] to generate multiple target images with different attributes. To handle high-level identity-preserving pose transformations, we propose to use a warping-based image synthesization, such as appearance flow [63]. To overcome semantic gaps between synthetic and real images, we propose a generalization of appearance flow with sparse keypoints as a domain bridge.

In Section 5, we evaluate our framework on surveillance images of the comprehensive cars (CompCars) dataset [61] by defining an experimental protocol with web images as labeled source domain and surveillance images as unlabeled target domain. We propose to handle explicitly specified, nameable factors of variation such as pose and lighting through pixel-level DA, while other non-specified nuisance factors are handled by feature-level DA. We achieve top-1 accuracy, reducing error by from a model trained only on the source domain. We present ablation studies to demonstrate the importance of each adaptation component by extensively evaluating performances with various mixes of components. We further validate the effectiveness of our proposed feature-level DA methods on standard UDA benchmarks, achieving state-of-the-art error rates on three out of four experimental settings.

In summary, the main contributions of this work are:

  • [leftmargin=10pt]

  • A novel UDA method to adapt at multiple levels from pixel to feature, with insights for each type of adaptation.

  • For feature-level DA, a connection of DANN to a semi-supervised variant, motivating a novel regularization through domain adversarial entropy minimization.

  • For pixel-level DA, an attribute-conditioned CycleGAN to translate a single source image into multiple target images with different low-level attributes properties, along with a warping-based image synthesization for identity-preservation across pose translations.

  • Improved generalization of appearance flow from synthetic to real domains using keypoints as domain bridges.

  • A new experimental protocol on car recognition in surveillance domain, with detailed analysis of various modules and efficacy of our proposed UDA framework.

  • State-of-the-art performance on standard UDA benchmarks with our feature-level DA methods.

2 Related Work

Unsupervised Domain Adaptation. Following theoretical developments of domain adaptation [2, 1], a major challenge for UDA problems is to define a proper metric measuring the domain difference. The maximum mean discrepancy [32, 57, 11, 56, 51], which measures the difference based on kernels, and more recently the domain adversarial neural network [12, 4, 3, 48], which measures the difference using discriminator, have been successful. Noticing a similarity in problem settings between UDA and SSL, there have been attempts to better transfer classifier by aligning domains beyond matching global statistics. For example, an additional loss formulation such as entropy regularization [15] has been used in addition to domain adversarial loss for efficient label transfer [33, 34]. We establish a connection between two frameworks by augmenting an instance of SSL based on generative adversarial network (GAN) [49, 43, 9] and further extend to combine techniques in a unified framework.

Image-to-Image Translation. Hertzmann et al. [19] employs a non-parametric texture model on a single input-output image pair to create image filter effects. With the successes of GAN on image generation [14, 40], conditional variants of GAN [36]

have been successfully adopted to image-to-image translation problems in both paired 

[21] and unpaired [46, 53, 64] training settings. Our model extends the work of [64] for image translation in unpaired settings to generate multiple outputs using a control variable [59].

Combining feature and pixel levels. A combination of pixel and feature level adaptation also been in attempted in contemporary work of [20], however, we differ in a few important ways. Specifically, we go further in using insights from semi-supervised learning that allow novel regularizations for feature-level adaptation, while exploiting 3D geometry and attribute-based conditioning in GANs to simultaneously handle high-level pose and low-level lighting variations, repsectively. Our experiments include a detailed study of the complementary benefits, as well as the effectiveness of various adaptation modules. While [20] consider problems such as semantic segmentation, we study a car recognition problem that highlights the need for adaptation across various levels. Finally, we also obtain state-of-the-art results on standard UDA benchmarks.

Figure 2: Overview of our framework on car model recognition for surveillance images using labeled web data and unlabeled surveillance data. Images captured by surveillance cameras have a different distribution from web images in terms of nameable factors of variation, such as viewpoint (e.g., camera elevation) or lighting conditions (e.g., day or night) as well as other nuisance factors. We cast the problem into an unsupervised domain adaptation framework by integrating pixel-level adaptation for perspective and photometric transformations, with feature-level adaptation for other nuisance factors.

Perspective Transformation. Previous works [60, 26, 54] propose encoder-decoder networks to directly generate output images of target viewpoint. Adversarial training for perspective transformation [55, 62] has demonstrated good performance on disentangling viewpoint from other appearance factors, but there are still concept switches in unpaired settings. Instead of learning the output distribution, [63, 39]

proposed a warping-based viewpoint synthesization by estimating a dense pixel-level flow field called appearance flow. We extend to improve generalization to real images using synthetic-to-real domain invariant representations such as 2D keypoints.

3 Feature-level Domain Adaptation

We introduce a domain adversarial neural network with semi-supervised learning (SSL) objective (DANN-SS, Figure 3(b)) motivated from SSL of generative adversarial network [43]. We show equivalence to standard DANN objective [12] upto a difference in discriminator parameterization and discuss an advantage of the proposed formulation (Section 3.2). Through this connection to SSL, Section 3.3 proposes a domain adversarial entropy minimization (DANN-EM) that incorporates the idea of entropy minimization [15] into domain adversarial learning framework.

Notation. Let be source and target datasets and be the set for class label. Let

be the feature extraction function, e.g., CNN, with parameters

that takes input and maps into

-dimensional vector.

3.1 Domain Adversarial Training for UDA

We first revisit the formulation of standard DANN [12]. The goal of domain adversarial training is to make feature representation between domains indistinguishable so that the classifier learned from labeled source data can be transferred to target domain. This can be achieved through a domain discriminator that tells whether the representations from two domains are still distinguishable, and is trained to fool while classifying the source data correctly. The learning objective is written as:



is a class score function that outputs the probability of data

being a class among categories, i.e., and balances between classification and domain discrimination. The parameters

are updated in turn using stochastic gradient descent.

3.2 Semi-Supervised Learning as DA Training

We note that the problem setup of UDA is the same as that of SSL when we remove the notion of domains and consider two sets of training data, one labeled and the other unlabeled. With this in mind, we propose a new objective for UDA called DANN-SS motivated by the semi-supervised learning of generative adversarial network [43] whose learning objective is formulated as follows:


where we omit from for clarity. The major difference is that the discriminator no longer exists but instead a classifier has one more output entry than to perform both classification and domain discrimination. The auxiliary class is reserved for the target domain and the domain of class score function is . The conditional score function in is written as follows:


Relation to DANN. It is straightforward to see since the first term of is for -way classification score on the source data and the second term is for log-probability of the target data not being classified as a target class. We show by reformulating using Eq (3):

Now it is clear that two formulations are equivalent with the constraints on the domain classification score and the class score .

Although it appears to be a subtle difference, we believe that joint parameterization of classifier and discriminator has a benefit over separate parameterization. For example, DANN-SS optimizes the decision boundary between source and target domain while looking at the classification boundaries, so they can be better regularized. Moreover, features may become more classifiable, i.e., in low entropy region for classifier, after is updated. As will be discussed in Section 5, we also observe a significant empirical advantage of the DANN-SS parameterization over that of DANN.

3.3 Domain-Adversarial Entropy Regularization

A reparameterization of DANN provides a new insight on how to adapt to target domain by taking into account the classifier. However, the adversarial objective of assumes no prior on the classification distribution for the target data. Instead, we can strengthen it by enforcing the class prediction distribution to be “peaky” while avoiding peakiness at the target class. To this end, we incorporate the idea of entropy minimization [15] into DANN-SS as follows:

Note that we parameterize the classifier in the same way as DANN-SS and it is trained with the same objective. On the contrary, the second term of is changed from . To better understand, we investigate the second term of in more detail by reorganizing as follows:


The maximum of Eq (4) is obtained when 1) for some or 2) . However, the second solution is opposite to our intention and we prevent this by adding an adversarial objective of into as follows:


where we parameterize to be consistent with . We provide more analysis on Eq (4) and (5) in the supplementary material, simply noting here that the DANN-EM formulation lends significant empirical benefits in Section 5.

(a) DANN (baseline)
Figure 3: (a) Standard DANN with -way model classifier and binary domain discriminator and (b) the proposed DANN-SS with -way classifier used for both model classification and domain discrimination. CNN and classifiers are updated in turn (dotted boxes) while fixing the others (solid boxes).

4 Pixel-level Cross-Domain Image Translation

As is common for neural nets, DANN is a black-box algorithm and adding an extra domain-specific insight is difficult. On the other hand, certain challenges in domain adaptation can be better handled in image space. In this section, we introduce additional tools for domain adaptation at multiple semantic levels to deal with different factors of variation, such as photometric or perspective transformation. To achieve this, we propose significant extensions to prior works on CycleGAN [64] and appearance flows [63]. We describe with an illustrative application of car recognition in surveillance domain where the only labeled data is from web domain. The pipeline of our example is in Figure 2.

4.1 Photometric Transformation by CycleGAN

As noticed from Figure 2, images from surveillance domain have disparate color statistics from web images as they might acquired outdoors at different times with significant lighting variations. CycleGAN [64] is proposed as a promising tool for image translation by disentangling low-level statistics from geometric structure through cycle consistency and can generate an image of the same geometric structure but with the style of target domain. A limitation, however, is that it generates a single output when there could be multiple output styles. Thus, we propose an attribute-conditioned CycleGAN (AC-CycleGAN) that generates diverse output images with the same geometric structure by incorporating a conditioning variable into the CycleGAN’s generators.

Let be the set of attributes in the target domain , e.g., day or night. We learn a generator that translates a source image into a target image with certain style by fooling an attribute-specific discriminator , where is a set of target images with attribute . The learning objectives are:

We use multiple discriminators to prevent competition between different attribute configurations, but it is feasible to have a single discriminator with -way domain classification loss [53]. Also, one might afford to have multiple generators per attribute without sharing parameters, for a small number of attribute configurations.222Empirically, using two separate generators for day and night performs better than a single generator (please see supplementary material).

Since the optimization problem is underdetermined, we add constraints like cycle consistency [64] as follows:

where we define an inverse generator that maps generated output images back to the source domain, i.e., . We also use a patchGAN [21, 64] for discriminators that make real or fake decisions from local patches and UNet [21] for generators, each of which contributes to preserve geometric structure of an input image.

4.2 Perspective Transformation

Besides color differences, we also observe significant differences in camera perspective (Figure 2). In this section, we deal with perspective transformations using an image warping based on a pixel-level dense flow called appearance flow (AF) [63]. Specifically, we propose to improve the generalization of AF estimation network (AFNet) trained on 3D CAD rendered images to real images by utilizing a robust representation across domains, i.e. 2D keypoints.

Appearance Flow. Zhou et al. [63] propose to estimate a dense flow of pixels between two images of different viewpoints. Once estimated, synthesization is done by simply reorganizing pixels using bilinear sampling [22] as follows:


where is an input image, is an output image, is a learned pixel-level flow field in horizontal and vertical directions called appearance flow and denotes the 4-pixel neighborhood of (). Unlike neural network based image synthesization methods [54], AF-based transformation may have a better chance of preserving object identity, since all pixels of an output image are from an input image and no new information, such as learned priors in the decoder network, is introduced.

A limitation of AF estimation network (AFNet), however, is that it requires image pairs with perspective being the only factor of variation. Since it is infeasible to collect such a highly controlled dataset of real images at large-scale, rendered images from 3D CAD models are used for the training of AFNet, but this incurs a generalization issue when applied to the real images at test time.

Keypoint-based Robust Estimation of AF. To make the AFNet generalizable to real images, we propose to use an invariant representation between rendered and real images, such as object keypoints. Although sparse, we argue that, for objects like cars, 2D keypoints contain sufficient information to reconstruct rough geometry of the entire object. Besides, keypoint estimation can be done robustly across synthetic and real domains even when the keypoint localization network is trained only on the synthetic data [28]. To this end, we propose a 2D keypoint-based AFNet (KFNet) that takes estimated 2D keypoints and the target viewpoint as the input pair, to generate flow fields for synthesization.

The training of KFNet can be done using rendered images. The trained KFNet can be readily applied to real images after a 2D keypoint estimation. Learning AF from a sparse keypoint representation is more difficult than learning from an image. To leverage the pretrained AFNet that produces a robust AF representation for rendered images, we propose to learn the KFNet by distilling knowledge from the AFNet. The learning objective is:


where is an estimated appearance flow by KFNet and is that by AFNet. Here, is the predicted image from using based on Eq (6). The training framework by distillation is visualized in Figure 4.

Figure 4: Training framework of keypoint-based appearance flow network (KFNet) by distilling knowledge from pretrained AFNet.
ID Perspective Photometric Feature-level DA Top-1 Top-5 Day Night
Top-1 Top-5 Top-1 Top-5
M1 Baseline (web only) 54.98 69.62 72.67 86.71 19.87 35.68
M2 Supervised (web + SV) 98.63 99.65 98.92 99.70 98.05 99.54
M3 AF 59.83 74.09 76.23 89.41 27.28 43.68
M4 KF 61.89 75.21 77.72 90.01 30.47 45.85
M5 MKF 64.49 78.51 78.47 90.27 36.73 55.15
M6 CycleGAN 62.65 79.50 77.72 90.40 38.72 57.87
M7 AC-CycleGAN 67.65 82.11 79.17 90.88 44.79 64.69
M8 DANN 59.40 74.06 74.71 87.70 29.00 46.99
M9 DANN-SS 67.49 80.33 80.38 92.40 41.89 56.36
M10 DANN-EM 72.40 81.63 82.48 90.55 52.38 63.91
M11 MKF CycleGAN 71.15 84.90 81.51 92.05 50.59 70.71
M12 MKF AC-CycleGAN 77.38 89.66 81.76 91.72 68.69 85.56
M13 AC-CycleGAN DANN 69.06 83.25 78.26 90.21 50.80 69.44
M14 AC-CycleGAN DANN-SS 74.50 88.36 81.77 93.16 60.07 78.84
M15 AC-CycleGAN DANN-EM 81.44 90.37 84.11 91.89 76.14 87.35
M16 MKF DANN 64.96 79.95 77.75 90.10 39.56 59.82
M17 MKF DANN-SS 74.79 86.76 82.52 92.42 59.44 75.53
M18 MKF DANN-EM 83.63 91.13 85.65 92.05 79.61 89.31
M19 MKF AC-CycleGAN DANN 76.10 89.60 80.93 92.17 66.52 84.49
M20 MKF AC-CycleGAN DANN-SS 83.80 93.02 85.49 93.50 80.45 92.06
M21 MKF AC-CycleGAN DANN-EM 84.69 92.51 86.10 92.90 81.89 91.73
Table 1: Car model recognition accuracy on CompCars Surveillance dataset of our proposed recognition system with different combinations of components, such as perspective transformation (Section 4.2), photometric transformation (Section 4.1), or feature-level domain adaptation (Section 3). We consider pixel-based (AF), keypoint-based (KF) and with mask (MKF) for perspective transformation, naive and attribute-conditioned CycleGAN, and DANN, DANN-SS and DANN-EM as variations.

5 Experimental Results

In this section, we report empirical results of our proposed framework on the car model recognition in surveillance domain. In addition, we validate the effectiveness of our proposed feature-level DA methods in comparison to state-of-the-art methods on the standard UDA benchmark.

5.1 Recognizing Cars in Surveillance Domain

Dataset. Comprehensive Cars dataset [61] offers two sets of data, one collected from web domain and the other collected from surveillance domain. The sample images from each domain are found in Figure 2. The dataset contains images from web domain across categories of car models and images333Note that only are labeled among images. from surveillance domain across a smaller subset of categories of car models, where are used for training in both cases.

For data preprocessing, we crop and scale web images into using provided bounding boxes while maintaining the aspect ratio. Since they are already cropped, surveillance images are simply scaled into . To train AFNet and KFNet, we render images from 3D CAD models from ShapeNet [5]. Noting that the major perspective variation between web and surveillance images are camera elevation, we render into an image at different elevation levels () while adding a few azimuth variations ().

Evaluation. We report top-1 and top-5 accuracy on test images from surveillance dataset. The softmax scores are averaged between the center crop of size and its horizontal flip to make a prediction. Noticing a huge accuracy drop on images taken at night, we also report classification accuracy for individual subsets of day and night.

We extensively evaluate the performance of the models by combining the proposed three adaptation components in different ways and study the impact of each component in Section 5.2

. Note that we consider synthesized images through AF or CycleGAN as labeled and therefore we incorporate them into source domain and the classification networks are trained accordingly. The ImageNet pretrained ResNet-18

444 [18] fine-tuned on the web dataset is our baseline network. The CNN and the classifier of all evaluated models are initialized with those of the baseline network. For DANN-SS, DANN-EM models, we initialize by adding one column to the weight matrix of classifier whose parameters are set with the mean of other columns, which we find working well in practice. The discriminator of DANN are randomly initialized. Due to space constraints, we leave further implementation details in the supplementary material.

Performance. The summary results are given in Table 1. Although it achieves state-of-the-art performance on the test set of web dataset ( vs. from [61]), the baseline network suffers from generalization to surveillance images, resulting in only top-1 accuracy. Comparing to the performance of the model trained with target domain supervision (), we could get a sense how different two domains are. Our proposed framework, M21, achieves top-1 accuracy, which reduces the error by from M1. We also verify the importance of individual components by training models after removing one component at a time. We achieve without perspective transformation (M15), without photometric transformation (M18), and (M12) without feature-level DA. Also, the feature-level DA (M10) improves the most comparing to other adaptation methods when used alone (M5, M7).

Figure 5: Visualization of synthesized images by perspective and photometirc transformations. For each web image from CompCars dataset, we show perspective transformed images at four camera elevation angles () using AFNet and 2D keypoint-based AFNet (KFNet), and with an estimated foreground mask (MKF). Then, we show photometic transformed images into day/night by AC-CycleGAN.

5.2 Ablation Study and Discussion

In this section, we provide more in-depth analysis on the impact of each component of our system.

5.2.1 Analysis on Feature-level DA

We observe a clear trend that our feature-level DA methods significantly improves upon baselines without any feature-level DA or with standard DANN (e.g., {M1, M8–M10} or {M12, M19–21}), and DANN-EM further improves the performance over DANN-SS. One observation is that the top-5 accuracy of M21 is slightly worse than that of M20. This is somewhat expected side-effect of entropy minimization since it strengthens the confidence of the most confident class while suppressing the confidence of other classes.

Figure 6: Entropy of DANN (with different ) and DANN-SS.

We look deeper into the relation between DANN and DANN-SS by monitoring the entropy of classification distribution for the target data. Although the entropy is not directly correlated with the recognition accuracy, we can relate it to the classifiability of samples since it tells how far the samples are from decision boundary towards centers. We report the entropy of M8 with different and M9 in Figure 6. Starting from the same entropy, DANN-SS model reduces the entropy faster and converges at much lower value than DANN, while DANN shows the behavior of mode collapse when is large. Even with carefully tuned , it doesn’t reach at the entropy level of DANN-SS.

5.2.2 Complementarity of Components

As our system consists of three components that all aim at reducing the gap between domains, it is important to understand the complementarity of these components. We first observe that any combination of two brings significant performance improvement (e.g., M12, M15, M18 vs. M5, M7, M10). Specifically, combining perspective transformation (M5) and feature-level DA (M10) brings the most significant improvement (M18) with an increment of .

However, combining AC-CycleGAN (M7) and feature-level DA (M10) brings less improvement even though each component alone has demonstrated larger improvement. We think that the factors of variation captured by feature-level DA are more complementary to those by perspective transformation than those by AC-CycleGAN. For example, perspective difference between web and surveillance domains is difficult to learn using standard network architectures; we need additional domain-specific tools like appearance flow. Also, the contrast in low-level statistics between domains such as color might be easier to learn for either CycleGAN or DANNs. Actually, color is one major factor of variation since web images tend to be day, whereas surveillance images include both day and night. Consequently, although CycleGAN shows more gain on night images when used alone ( vs. ), it becomes less significant when combined with feature-level DA () than that of perspective transformation module combined with feature-level DA ().

5.2.3 Perspective Transformation for CycleGAN

The success of CycleGAN [64]

on unpaired image-to-image translation is attributed by several factors, such as autoencoding using cycle consistency loss, patch based discriminator, or the network architecture like UNet with skip connections. However, we find that applying CycleGAN naively to translate images from web to surveillance domain is not effective due to perspective difference. As in Figure 

7(a), the CycleGAN maintains the geometric structure of an input, but doesn’t adapt to the perspective of surveillance domain.

When we relax some constraints of CycleGAN (relaxed CycleGAN) such as receptive field size of patchGAN or skip connections, we start to see some perspective change in the third row of Figure 7(a), but we lose many details that are crucial for recognition tasks. Our proposed approach solves this challenge in two steps, first by explicitly disentangling the perspective variation using appearance flow and second by dealing with other variation factors (lighting and low-level color transformation) using CycleGAN. This results in high-quality synthesization of images from web to surveillance as in Figure 5 and significantly improved recognition performance ( from M7 to M12).

(a) Comparison of CycleGANs with different constraints.

Attribute interpolation

Figure 7: Web to surveillance image translation of CycleGAN (a) with relaxed constraints and (b) with attribute interpolation.

5.2.4 Diverse Outputs with Attribute Interpolation

The AC-CycleGAN produces multiple outputs from a single input by changing latent code configuration. It allows to have at least twice as many synthesized labeled images as that of CycleGAN in our case and this directly translates into improved recognition performance (M12, vs. M11, ). Furthermore, the continuous interpolation of latent code [59] allows to generate continuous change in certain factors of variation, which in our case is the lighting condition. Figure 7(b) shows images generated with attribute interpolation. We observe gradual change in global (e.g., color tone) and local (e.g., pixel intensity of headlight) manners from day to night while preserving object identities.

5.2.5 Analysis on 2D Keypoint-based AF Estimation

AFNet is trained on the rendered images of 3D CAD models and a performance degradation is expected when applied to real images. On the other hand, our KFNet uses 2D keypoints as an input and it is questionable whether estimating a per-pixel dense flow from sparse keypoints is even feasible. To demonstrate the feasibility, we compare the pixel-level L1 reconstruction error between rendered images and perspective transformed images at four elevations ( to ) using AFNet and KFNet. We obtained error with KFNet and this is comparable with error with AFNet.555We normalize pixel values to be in .

Since we are not aware of any dataset with paired images of cars with elevation difference, we resort to qualitative comparison. We visualize transformed images of real data from CompCars dataset with AFNet and KFNet in Figure 5. AFNet struggles to generalize on real images and generates distorted images with incorrect target elevation. Although sparse, 2D keypoints are more robust to domain shift from synthetic to real and are sufficient to preserve the object geometry while correctly transforming to target perspective. Finally, better recognition performance of the network trained with source and the perspective transformed images on the surveillance domain ( from M3 to M4) implies the superiority of the proposed KFNet-based perspective transformation.

5.3 Evaluation of DANNs on UDA Benchmark

In addition to our observation from Table 1 (e.g., M12, M19–M21), we further validate the effectiveness of our feature-level DA methods to the baseline and others on standard UDA benchmark. We conduct experiments on 4 settings proposed by [12]. We evaluate the performance of the models using shallow [12, 4] and deep [17] network architectures. As suggested by [4], we use ( for GTSRB dataset) randomly selected labeled target domain images for validation. Details of experimental protocol are in the supplementary material. An unsupervised model selection, i.e., without using labeled examples from the target data, in UDA is an open question and we leave it for future work.

The results are provided in Table 2. DANN-SS demonstrates its superiority to DANN with gradient reversal [12] or DANN on domain separation network [4]. Note that we focus on the feature-level DA in this experiment and don’t combine with any pixel-level adaptation as in [4, 3, 20]. Moreover, DANN-EM improves upon DANN-SS, achieving state-of-the-art on 3 settings out of 4. Specifically, we observe even lower error () for Synth. DigitsSVHN setting than that of supervisedly trained model on the target domain () reported by [17] by utilizing an extra unlabeled data from SVHN dataset [38], which suggests that our method benefits from large-scale unlabeled data from target domain.

Method network MMM SS SM SG
RevGrad [12] shallow 23.33 8.91 26.15 11.35
DSN [4] shallow 16.80 8.80 17.30 6.90
ADA [17] deep 10.47 8.14 2.40 2.34
source only shallow 31.72 12.78 31.61 4.37
DANN 11.48 6.67 7.66 2.67
DANN-SS 9.59 6.68 5.93 1.55
DANN-EM 9.42 5.69 5.19 1.46
source only deep 32.10 12.95 36.26 5.47
DANN 2.02 7.76 11.30 2.72
DANN-SS 2.04 5.53 3.77 1.30
DANN-EM 1.94 4.92 3.51 1.05
Table 2: Evaluation on UDA benchmark, such as MNIST [27] to MNIST-M [12] (MMM), Synth. Digits [12] to SVHN (SS), SVHN to MNIST (SM), or Synth. Signs [37] to GTSRB [50] (S

G). Experiments are executed for 10 times with different random seeds and mean test set error rate is reported. The best performer and the ones within standard error are bold-faced.

6 Conclusion

With an observation that certain adaptation challenges are better handled in feature space and others are in pixel space, we propose an unsupervised domain adaptation framework by leveraging complementary tools that are better-suited for each type of adaptation challenge. Importance and complementarity of each component are demonstrated through extensive experiments on a novel application of car recognition in surveillance domain. We also demonstrate state-of-the-art performance on standard UDA benchmark with our proposed feature-level DA methods.


  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS, 2007.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, July 2017.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • [6] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, PP(99):1–1, 2017.
  • [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
  • [8] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
  • [9] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdinov. Good semi-supervised learning that requires a bad gan. In NIPS, 2017.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 2009.
  • [11] B. Fernando, T. Tommasi, and T. Tuytelaars. Joint cross-domain classification and subspace learning for unsupervised adaptation. Pattern Recognition Letters, 65:60–66, 2015.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
  • [13] T. Gebru, J. Krause, J. Deng, and L. Fei-Fei. Scalable annotation of fine-grained categories without experts. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 1877–1881. ACM, 2017.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [15] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2005.
  • [16] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao.

    Ms-celeb-1m: A dataset and benchmark for large-scale face recognition.

    In ECCV, 2016.
  • [17] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In ICCV, 2017.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [19] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001.
  • [20] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • [21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
  • [23] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In CVPR Workshop, 2013.
  • [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In NIPS, 2012.
  • [26] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [28] C. Li, M. Z. Zia, Q.-H. Tran, X. Yu, G. D. Hager, and M. Chandraker. Deep supervision with shape concepts for occlusion-aware 3d object parsing. In CVPR, 2017.
  • [29] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
  • [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal Loss for Dense Object Detection. In ICCV, 2017.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector. In ECCV, 2016.
  • [32] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu.

    Transfer feature learning with joint distribution adaptation.

    In ICCV, 2013.
  • [33] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
  • [34] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efficient learning of transferable representations across domains and tasks. In NIPS, 2017.
  • [35] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-class generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076, 2016.
  • [36] M. Mirza and S. Osindero. Conditional generative adversarial nets. In NIPS Workshop, 2014.
  • [37] B. Moiseev, A. Konev, A. Chigorin, and A. Konushin. Evaluation of traffic sign recognition methods trained on synthetically generated data. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 576–583. Springer, 2013.
  • [38] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.
  • [39] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. Transformation-grounded image generation network for novel 3d view synthesis. In CVPR, 2017.
  • [40] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
  • [43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • [44] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  • [45] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. PAMI, 39(4):640–651, 2017.
  • [46] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
  • [47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [48] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, and M. Chandraker. Unsupervised domain adaptation for face recognition in unlabeled videos. In ICCV, 2017.
  • [49] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In ICLR, 2015.
  • [50] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In IJCNN, 2011.
  • [51] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV Workshop, 2016.
  • [52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [53] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In ICLR, 2017.
  • [54] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In ECCV, 2016.
  • [55] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, July 2017.
  • [56] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [57] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [58] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset, 2011.
  • [59] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
  • [60] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS, 2015.
  • [61] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR, 2015.
  • [62] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In ICCV, 2017.
  • [63] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In ECCV, 2016.
  • [64] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

Appendix A Discussion on Optimal Solutions of Eq (4) and (5)

We start the section with the following proposition:

Proposition A.1.

The following equation has a maximum value of when such that .


Let be an entropy of distribution . Firstly, let’s consider the case where . Then, by definition of entropy, has a maximum value of when for some . Now, let’s consider . Again, by definition, has its minimum value of regardless of and the optimization problem becomes


By L’Hospital’s rule, the expression is bounded by from right as approaches . ∎

Now we show how to prevent being an optimal solution by adding an adversarial term for .

Proposition A.2.

The following equation has a maximum value of when such that .


Let’s assume . By definition of an entropy, . Then, the optimization problem becomes


where . When , it goes to negative infinity and the maximum is obtained when . ∎

In Figure 8, we provide several plots for for different values to get a better sense. As we see in Figure 8(a), Eq (4) has a maximum value of at and . When , the output at becomes negative infinity, and therefore we can prevent being an optimal solution.

Figure 8: Plots for with different values. x-axes are from to . For the ease of comparison, we scale the output values to be in .

Appendix B Implementation Details

In this section, we describe implementation details of individual components. All components are implemented in Torch 


b.1 Appearance Flow Estimation Networks (AFNet)

AFNet has encoder-decoder structure, which is visualized in Figure 9. AFNet takes a source image and target viewpoint as input, where an image of size is fed to a convolutional encoder to produce a -dimensional vector and it is concatenated with -dimensional vector generated from the latent code via viewpoint encoder.

-dimensional concatenated vector is fed to decoder, which is constructed with fractionally-strided convolution layers, to generate flow representation of size

. Finally, a source image is warped via appearance flow based on bilinear sampling [22, 63]666 to predict a target image. All convolution layers use filters, meanwhile filters of fractionally-strided convolution layers have size of . AFNet is trained using Adam optimizer [23] with the learning rate of and batch size of .

KFNet architecture is inherited from AFNet and shares the decoder architecture and viewpoint encoder. To accommodate sparse keypoints as the input, the entire image encoder is replaced by the keypoint encoder, consisting of two fully connected layers with and

output neurons, respectively. KFNet is trained to optimize Eq (

4.2) with

. Other hyperparameters such as the learning rate are the same as those used for AFNet training.

Figure 9: AFNet architecture. AFNet receives source image and the target perspective (e.g., 4-dimensional one hot vector for elevation from ) as input and generates the flow field to synthesize image through bilinear sampling.

b.2 Attribute-conditioned CycleGAN (AC-CycleGAN)

The network architecture for generators and discriminators are illustrated in 11(a). The images of size are used across input or output of generators and discriminators. UNet architecture [21] is used for both generators and while we feed the attribute code in the middle of the generator network. The patchGAN discriminator [21] is used that generates -dimensional output for real/fake discrimination. The discriminator of conditional GAN [36] is used where takes attribute code as an additional input to the real or generated images.777In our experiments, we maintain two sets of parameters for day and night attribute configurations in a similar manner to AC-CycleGAN with unshared generators in Figure 11(b). One can consider a multi-way discriminator [53, 7] that discriminates not only between real or generated but also between different attribute configurations, but we didn’t find it effective in our experiment.

We train using Adam optimizer with learning rate of and the batch size of for all networks. In addition, we adopt two techniques from recent works to stabilize training procedure. For example, we replace the negative log likelihood objective of discriminator by a least square loss [35, 64]. Furthermore, we adopt historical buffer strategy [46] that updates the discriminator not only using generated images with the current generator but also with the generated images from the previous updates. We maintain an image buffer that stores the previously generated images for each generator and randomly select images in the buffer to update discriminator.

b.3 Domain Adversarial Neural Networks (DANN)

In our experiments on CompCars datasets with domain adversarial neural network (DANN) and our proposed DANN-SS and DANN-EM, the ImageNet pretrained ResNet-18888 [18] fine-tuned on the web dataset is used as our baseline network. The dimensions of the last two fully-connected layers are and , respectively. For standard DANN, we append a discriminator with two fully-connected layers that maps -dimensional embedding into and

followed by softmax layer for binary classification.

999For fair comparison to DANN-SS and DANN-EM, shallower network that directly connects -dimensional embedding vector to -dimensional output is also tested, but no significant performance difference is observed. For DANN-SS and DANN-EM, the weight matrix of last fully-connected layer is augmented to produce -dimensional output, where additional slice of weight vector is initialized by averaging the previous weight vectors, i.e., .

All models are trained by updating the classifier/discriminator and CNN parameters in turn without additional techniques such as skipping updates based on training statistics. Adam optimizer is used for training with the learning rate of , which is equivalent to the final learning rate of the fine-tuned model on CompCars web dataset. For DANN and DANN-SS, we tune that balances between classification loss and domain adversarial loss for updating CNN parameters. In addition to , in Eq (5) is considered for model selection of DANN-EM models. Note that, since DANN-SS and DANN-EM share the classifier for both model classification and domain discrimination, we need to give additional care when training the classifier by introducing the hyper parameter :


When , highlights a domain discrimination aspect too much, resulting in significantly amplified target domain classification loss compared to individual model classification loss and it severely damages the model classification ability for both source and the target domain data. On the other hand, when is too small, no domain difference information is learned through classifier. In our experiments on car model recognition, we set , which is small enough not to interfere the source domain model classification but still captures reasonable amount of domain difference. In practice, we find that is a good starting point for hyper parameter search for , where is the number of classes, which in our case . We report other hyper parameters such as and in Table 3.

AC-CycleGAN ,
Both ,
Table 3: Model selection on and . for DANN-SS and DANN-EM is set to .

The model selection, also known as hyper parameter tuning, is done in a supervised way using a validation set of size from the target domain for all cases. The performance is reported after 5 runs with different train/validation splits. As discussed in [4], the supervised model selection can provide a performance upper bound for the proposed domain adaptation technique and can be used to compare different methods when they are evaluated in the same setting. Unsupervised model selection (i.e., model selection without labeled target domain validation set) is an open research problem and we leave it for future research.

ID Perspective Photometric Feature-level DA Top-1 Top-5 Day Night
Top-1 Top-5 Top-1 Top-5
M7 AC-CycleGAN 67.65 82.11 79.17 90.88 44.79 64.69
M12 MKF AC-CycleGAN 77.38 89.66 81.76 91.72 68.69 85.56
M13 AC-CycleGAN DANN 69.06 83.25 78.26 90.21 50.80 69.44
M14 AC-CycleGAN DANN-SS 74.50 88.36 81.77 93.16 60.07 78.84
M15 AC-CycleGAN DANN-EM 81.44 90.37 84.11 91.89 76.14 87.35
M19 MKF AC-CycleGAN DANN 76.10 89.60 80.93 92.17 66.52 84.49
M20 MKF AC-CycleGAN DANN-SS 83.80 93.02 85.49 93.50 80.45 92.06
M21 MKF AC-CycleGAN DANN-EM 84.69 92.51 86.10 92.90 81.89 91.73
M22 AC-CycleGAN 70.91 84.37 79.54 90.32 53.78 72.57
M23 MKF AC-CycleGAN 76.54 89.43 81.84 92.60 66.03 83.14
M24 AC-CycleGAN DANN 73.31 86.35 80.32 91.17 59.40 76.78
M25 AC-CycleGAN DANN-SS 78.72 90.64 82.48 92.55 71.26 86.84
M26 AC-CycleGAN DANN-EM 83.13 91.01 84.67 91.75 80.07 89.54
M27 MKF AC-CycleGAN DANN 76.04 88.98 81.66 92.06 64.88 82.86
M28 MKF AC-CycleGAN DANN-SS 83.94 93.44 85.73 93.98 80.39 92.37
M29 MKF AC-CycleGAN DANN-EM 85.90 92.43 87.25 92.80 83.24 91.69
Table 4: Comparison between AC-CycleGANs with shared and unshared parameters across generators and discriminators with car model recognition accuracy on CompCars Surveillance dataset.

Appendix C AC-CycleGAN with Unshared Parameters

As we have small number of attribute configurations for lighting (e.g., day and night), it is affordable to use generator networks with unshared parameters. This is equivalent to having one generator for each lighting condition while the attribute code acts as a switch in selecting the respective output to attribute condition for inverse generator or discriminator. The network architecture is illustrated in Figure 11(b). Note that this is equivalent to having one generator for each lighting condition and therefore each CycleGAN can be trained independently if we further assume unshared networks for discriminator and inverse generator. We conduct experiments on AC-CycleGAN with unshared parameters for all generators and discriminators and report the car model recognition accuracy in Table 4. We observe some improvement in recognition accuracy with unshared models for most experimental settings; for example, M29 achieves top-1 accuracy, which reduces the error by from M21. We also observe reduced top-5 accuracy with DANN-EM (M29) in comparison to DANN-SS (M28), which is consistent with what we have observed in Section 5.2.1 when comparing M20 and M21.

We visualize in Figure 12 and 13 the photometric transformed images by AC-CycleGAN in both versions of shared and unshared parameters. Besides slight performance improvement for AC-CycleGAN with unshared parameters, we do not observe significant qualitative difference comparing to AC-CycleGAN with shared parameters. Eventually, we believe that the model with shared parameters is more promising for further investigation considering the expansibility of the methods with large number of attribute configurations as well as other interesting properties such as continuous interpolation between attribute configurations.

Appendix D Details of Section 5.3: Evaluation of DANNs on UDA Benchmark

As suggested by [12], we evaluate our proposed feature-level adaptation methods, i.e., DANN-SS and DANN-EM on four UDA benchmark tasks of digits and traffic signs. The summary of four tasks is given as follows:

  1. MNISTMNIST-M: In this task we use MNIST [27] as a source domain and MNIST-M [12] as a target domain. MNIST-M is a variation of MNIST with foreground digits with color transformation on top of background natural images. Following [17], we augment source data by inverting pixel-values from 0 to 255 and vice versa, thus doubling the volume of source data. Overall, labeled source images, unlabeled target images for training, labeled target images for validation, labeled target images for testing are used.

  2. Synth. DigitsSVHN: In this task we use labeled synthesized digits [12] as source domain to recognize digits in street view house number dataset (SVHN) [38]. Unlike other works, we use extra unlabeled images of SVHN dataset to train adaptation model. Overall, labeled source images, unlabeled target images for training, labeled target images for validation, labeled target images for testing are used.

  3. SVHNMNIST: In this task we use SVHN as a source domain and MNIST as a target domain. Overall, labeled source images, unlabeled target images for training, labeled target images for validation, labeled target images for testing are used.

  4. Synth. SignsGTSRB: In this task we recognize traffic signs from german traffic sign recognition benchmark (GTSRB) [50] by adapting from labeled synthesized images [37]. In total, labeled source images, unlabeled target images for training, labeled target images for validation, labeled target images for testing are used. Unlike aforementioned tasks with 10-way classification using images as input, this task is 43-way classification and input images are of size .

For data preprocessing, we apply channel-wise mean and standard deviation normalization per example, i.e.,


where and .

(a) shallow (MMM)
(b) shallow (SS, SM)
(c) shallow (SG)
(d) deep (all)
Figure 10: Shallow and deep network architectures for experiments in Section 5.3

. ReLU activation is applied followed by convolutional and fully-connected layers except for the last fully-connected layer connected to classifier or discriminator.

d.1 Network Architecture

We evaluate the performance with shallow (e.g., 2 or 3 convolution layers) and deep (6 convolution layers) [17] network architectures. We provide details of network architectures in Figure 10. Our shallow networks are inspired by [12] and indeed share the same convolution and pooling architecture, but the classifier and discriminator architectures are slightly different. We empirically found that our proposed classifier/discriminator architectures significantly improves the performance on both baseline DANN as well as our proposed DANN-SS and DANN-EM over the exact architectures of [12] as shown in Table 5.

d.2 Model Selection

We report several important hyperparameters, such as , or , of each model in Table 7. Note that we also use to weigh the contribution of real and fake examples when updating discriminator of DANN, which turns out to be important for some datasets and network architectures:


As mentioned in Section D, the model selection is done in a supervised way using a subset of labeled target images. We randomly sample or (for synthetic sign to GTSRB adaptation task) examples from the labeled target validation set of each task and report the mean test set error and standard error averaged over 10 trials with different random seeds in Table 5. In addition, we also report the best performer among 10 models chosen via cross-validation in the parentheses. The best hyperparameters are reported in Table 7.

Method network MMM SS SM SG
RevGrad [12] shallow 23.33 8.91 26.15 11.35
DSN [4] shallow 16.80 8.80 17.30 6.90
ADA [17] deep 10.47 8.14 2.40 2.34
source only shallow 31.61 (30.55) 12.64 (12.45) 31.58 (28.76) 4.24 (4.23)
DANN [12] 11.83 (11.75) 31.20 (29.06) 4.42 (4.24)
DANN shallow 11.72 (10.32) 6.59 (6.66) 7.60 (4.91) 2.73 (2.19)
DANN-SS shallow 9.47 (8.66) 6.63 (6.78) 5.81 (2.37) 1.55 (1.48)
DANN-EM shallow 9.53 (9.08) 5.67 (5.48) 5.26 (2.36) 1.37 (1.28)
source only deep 32.09 (28.10) 13.28 (13.59) 36.18 (32.77) 5.47 (5.15)
DANN 1.92 (1.71) 7.74 (6.93) 11.29 (9.78) 2.58 (2.27)
DANN-SS 2.02 (1.99) 5.50 (5.11) 3.70 (2.71) 1.25 (1.21)
DANN-EM 1.95 (1.79) 4.82 (4.69) 3.55 (2.54) 1.10 (0.98)
Table 5: Evaluation on UDA benchmark, such as MNIST [27] to MNIST-M [12] (MMM), Synth. Digits [12] to SVHN (SS), SVHN to MNIST (SM), or Synth. Signs [37] to GTSRB [50] (SG) using a single validation set. Experiments are executed for 10 times with different random seeds and mean test set error and standard error are reported. We also report the test set error of the best performer chosen via cross-validation among models from 10 different random seeds in the parentheses. For each network architecture, the best performer and the ones within standard error are bold-faced.
Method # val. set network MMM SS SM SG
source only shallow 31.72 (30.64) 12.78 (12.33) 31.61 (27.23) 4.37 (4.07)
DANN [12] 11.80 (11.57) 31.01 (26.96) 4.64 (4.51)
DANN shallow 11.48 (10.38) 6.67 (6.63) 7.66 (4.82) 2.67 (2.52)
DANN-SS shallow 9.59 (8.94) 6.68 (6.44) 5.93 (2.48) 1.55 (1.39)
DANN-EM shallow 9.42 (8.58) 5.69 (5.46) 5.19 (2.42) 1.46 (1.44)
source only deep 32.10 (27.46) 12.95 (11.94) 36.26 (32.88) 5.47 (5.09)
DANN 2.02 (1.83) 7.76 (7.11) 11.30 (10.02) 2.72 (2.51)
DANN-SS 2.04 (1.96) 5.53 (5.46) 3.77 (3.01) 1.30 (1.24)
DANN-EM 1.94 (1.82) 4.92 (4.82) 3.51 (2.85) 1.05 (0.84)
source only deep 33.68 (32.23) 13.98 (13.91) 40.88 (37.69) 6.27 (6.03)
DANN 5.03 (4.96) 9.86 (9.91) 16.75 (16.35) 4.08 (4.09)
DANN-SS 4.04 (4.04) 6.78 (6.78) 8.43 (8.44) 1.77 (1.77)
DANN-EM 3.58 (3.58) 5.81 (5.76) 4.94 (4.95) 1.55 (1.53)
Table 6: Evaluation on UDA benchmark, such as MNIST [27] to MNIST-M [12] (MMM), Synth. Digits [12] to SVHN (SS), SVHN to MNIST (SM), or Synth. Signs [37] to GTSRB [50] (SG). Experiments are executed for 10 times with different random seeds for each validation set and the average of mean test set error over 10 randomly sampled validation sets and standard error of mean test set error across 10 validation sets are reported. We also report the average of test set error of the best performer over 10 validation sets chosen via cross-validation among models from 10 different random seeds in the parentheses. We also report the performance when cross-validated with even fewer number of labeled target examples, such as for MMM, SS, SM tasks or for SG task. For each network architecture and the size of labeled validation set, the best performer and the ones within standard error are bold-faced.
Method network MMM SS SM SG
DANN shallow 0.003 0.3 0.1 0.3 0.3 0.01 0.03 0.3
DANN-SS 0.003 0.3 1 0.3 3 0.003 0.01 1
DANN-EM 0.001 0.3 0.003 0.03 3 0.3 1 0.03 0.1 0.03 1 0.001
DANN deep 1 0.3 1 0.3 0.1 0.03 0.03 0.3
DANN-SS 0.3 1 0.01 3 0.1 0.3 0.03 1
DANN-EM 1 0.3 0.01 0.01 3 0.3 0.1 0.3 0.03 0.03 1 0.03
Table 7: Optimal hyperparameters of . Besides these hyperparameters, the learning rates of shallow models are set to and those of deep models are set to except for the models in SS task ().

Unlike typical supervised learning protocol of these tasks where large number of labeled target examples, e.g., , are available for validation, we limit the size of validation set to be small (e.g., or ). Although not significant with or validation examples, we still observe some performance variations with the selection of validation examples and it could be misleading to report the performance only with a single validation set. Therefore, we repeat experiments 10 times with different selection of validation sets and report mean and standard deviation over 10 trials with different validation sets of mean test set error over 10 random trials with different seeds in Table 6. Besides MMM task with deep network architecture, the proposed DANN-EM outperforms the baseline DANN model with significant margin. We also conduct the same experiments but with significantly fewer number of labeled target domain examples for validation, such as for SG task and for the rest, and report results in Table 6

. We observe the increased variance in test set error rates across different validation sets and increased error rates,

101010When multiple models achieve the same validation error, we report the mean of the test set error rates of those models. but the proposed DANN-EM still demonstrates significant performance improvement over baseline DANN or DANN-SS models.

(a) Attribute-conditioned CycleGAN
(b) Attribute-conditioned CycleGAN with unshared parameters
Figure 11: Networks architecture comparisons between AC-CycleGANs with shared and unshared parameters across generators and discriminators.
Figure 12: Visualization of synthesized images by photometric transformations using AC-CycleGANs with shared and unshared parameters.
Figure 13: Visualization of synthesized images by photometric transformations using AC-CycleGANs with shared and unshared parameters for web images with different yaw angles from .