In recent years00footnotetext: *The corresponding author is Bingbing Ni.
, with the appearance of Convolutional Neural Networks (CNNs), many classification-based challenges have been tackled with an extremely high accuracy. These powerful CNN architectures, like AlexNet[alexnet] and ResNet [resnet], are capable of efficiently extracting low-level and high-level features with the guidance of labeled data. However, because of the existence of domain shift, models trained on a specific domain suffer from poorer performance when transferred to another domain. This problem is of vital significance in the case where labeled data are unavailable on target domain. Thus how to use these unlabeled data from target domain to fill the domain discrepancy is the main issue in the field of domain adaptation [survey].
Beginning with the work of gradient reverse [gradient_reverse], a group of domain adaptation methods based on adversarial learning were proposed. In the research of this subfield, a domain discriminator is introduced to judge the domain attribute on feature or image level. In order to fool this domain discriminator, the extracted features should be domain-invariant, which is the basic motivation of adversarial domain adaptation. However, just like most variants of Generative Adversarial Networks (GANs) [gan], the domain discriminator is only guided by the hard label information and rarely explores the intrinsic structure of data distribution. Namely, each data point that shifts from both domains should be judged by a latent soft label, i.e.
In this paper, inspired by VAE-GAN [vae-gan], we develop a generative-adversarial-based framework to simultaneously train the classification network and generate auxiliary source-like images from learned embeddings of both domains. On the basis of this framework, domain mixup on pixel and feature level is proposed to alleviate two existing drawbacks. (1) Just as shown in Figure 1, we would like to instruct the domain discriminator to explore the intrinsic structure of source and target distributions through triplet loss with a flexible margin and applying domain classification with mixed images and corresponding soft labels (i.e., mixup ratio), which provides abundant intermediate status between two separate domains. (2) In order to expand the searching range in latent space, linear interpolations of source and target features are exploited. Since the subsequent nonlinear neural network can easily ruin the linear mixed information, the extracted features of mixed images are not used for augmentation directly. This operation leads to a more continuous domain-invariant latent distribution, which benefits the performance on target domain when the oscillation of data distribution occurs in the test phase.
We evaluate the image recognition performance of our approach on three benchmarks with different extent of domain shift. Experiments prove the effectiveness of our approach, and we achieve state-of-the-art in most settings. The contributions of our work are summarized as follows:
We design an adversarial training framework which maps both domains to a common latent distribution, and efficiently transfer our knowledge learned on the supervised domain to its unsupervised counterpart.
Domain mixup on pixel and feature level accompanied with well-designed soft domain labels is proposed to improve the generalization ability of models. This method promotes the generalization ability of feature extractor and obtains a domain discriminator judging samples’ difference relative to two domains with refined scores.
We extensively evaluate our approach under different settings, and our approach achieves superior results even when the domain shift is high and the data distribution is complex.
Domain adaptation is a frequently used technique to promote the generalization ability of models trained on a single domain in many Computer Vision tasks. In this section, we describe existing domain adaptation methods and compare our approach with them.
The transferability of Deep Neural Networks is proved in [deep_transfer_ability]kernel_two_sample, ddc] is a way to measure the similarity of two distributions. Weighted Domain Adaptation Network (WDAN) [wdan] defines the weighted MMD with class conditional distribution on both domains. The multiple kernel version of MMD (MK-MMD) is explored in [Long2015] to define the distance between two distributions. In addition, specific deep neural networks are constructed to restrict the domain-invariance of top layers by aligning the second-order statistics [deep_coral].
Adversarial Training [domain_adversarial, domain_adversarial_journal] is another way to transfer domain information. RevGrad [gradient_reverse] designs a double branched architecture for object classification and domain classification respectively. Adversarial Discriminative Domain Adaptation (ADDA) [adda] trains two feature extractors for source and target domains respectively, and produces embeddings fooling the discriminator. Other works optimize the performance on target domain by capturing complex multimode structures [multi-adversarial, cdan], exploring task-specific decision boundaries [max_discrepancy, adversarial_dropout, joint_pixel_feature], aligning the attention regions [adversarial_attention] and applying structure-aware alignment [gcan]. In addition, the label information of target domain is explored in recent works [collaborative, semantic, gcan].
Another group of methods perform adaptation by applying adversarial loss on pixel level. Source domain images are adapted as if they are drawn from target domain using generative adversarial networks in [pixel_level, co-gan], and generated samples expand the training set. Furthermore, image generation and training the task-specific classifier are accomplished simultaneously in [deep_reconstruction, generate_to_adapt]. Cycle-consistency is also considered in [cycada] to enforce the consistency of relevant semantics.
Comparison with existing GAN-based approaches. Although former works use GAN as a manner of data augmentation [pixel_level, co-gan] or producing domain adaptive gradient information [deep_reconstruction, generate_to_adapt], they may be trapped in the mismatch between generated data and assigned hard labels. We further explore the usage of domain mixup on pixel and feature level to enhance the robustness of adaptation models. On one hand, pixel-level mixup prompts the domain discriminator to excavate the intrinsic structure of source and target distributions. On the other hand, feature-level mixup facilitates a more continuous feature distribution in the latent space with low domain shift.
Adversarial Domain Adaptation
with Domain Mixup
In unsupervised domain adaptation, a source domain dataset with labeled samples and a target domain dataset with unlabeled samples are available. It is assumed that source samples obey the source distribution , and target samples obey the target distribution . In addition, both domains share the same label space , where is the number of classes.
Framework of DM-ADA
In this work, a variant of VAE-GAN [vae-gan] is applied to the domain adaptation task. Figure 2
presents an overview of the whole framework. For the input, there are three kinds: source domain images, target domain images and mixup images obtained by pixel-wise addition of source and target images. Just as conventional variational autoencoder[vae], an encoder
maps inputs from source and target domains to the standard Gaussian distribution. For every sample, a mean vector
and a standard deviation vectorare served as the feature embedding. On feature level, the feature embeddings of two domains are also linearly mixed to produce mixup features . After that, the framework is split into two branches. For one branch, the embedding of source domain is used to do -way object classification by the classifier . For the other branch, source and target domain are aligned on category level through enforcing the decoded images to be source-like and preserve class information of inputs. Details are stated in the following parts.
Domain mixup on two levels. To explore the internal structure of data from two domains, source domain images and target domain images are linearly interpolated [mixup] to produce mixup images and corresponding soft domain labels as follows:
where is the mixup ratio, and , in which is constantly set as 2.0 in all experiments. and represent the domain label of source and target data, which are manually set as 1 and 0.
Inputs of source and target domains are then embedded to and in the latent space by a shared encoder . In order to yield a more continuous domain-invariant latent distribution, two domains’ embeddings are linearly mixed to produce mixup feature embedding :
where equals to the one used in pixel-level mixup.
Restricting encoder with priori. Just as conventional VAE [vae], the encoder
is regularized by a standard gaussian priori over the latent distribution. The objective is to narrow the Kullback-Leibler divergence between posteriori and priori:
where and are the encoded mean and standard deviation of source and target images.
Supervised training for classifier. The classifier is optimized with cross entropy loss defined on source domain, and the objective is as follows:
where denotes concatenation. It is worth noticing that classifier can’t be replaced by the object classification branch of discriminator , since the adapted features are only passed directly to , which enhances ’s performance on target domain.
Decoding latent codes. Before the generation phase, we first define the one-hot object class label and a one-dimensional uncertainty compensation for both domains and mixup features as below:
where and are on the -th position of and to indicate the known class label for both features respectively. For all features derived from target domain or mixup procedure, since the class labels remain uncertain, is set as a compensation to normalize the sum of vector and to 1. After that, decoder predicts the auxiliary generated images as below:
where is the noise vector randomly sampled from standard Gaussian distribution.
Adversarial domain alignment. Compared with previous adversarial-learning-based methods [gradient_reverse, adda, multi-adversarial], we constrain domain-invariance not only on source and target domains, but also on the intermediate representations between two domains. The min-max optimization objective on different domains are defined as follows:
where is the domain classification branch of . During training process, the mixup features can well be mapped to somewhere in-between source and target domain on pixel level, and it is more proper to assign them with scores between 0 and 1. Domain classification loss is utilized to guide domain discriminator output such soft scores:
We further introduce a triplet loss to constrain mixup samples’ distance to source and target domains, which makes domain discriminator easier to converge:
where is the feature extractor of , and
denotes the hinge loss function;, when , and , otherwise. Considering that samples with more source or target domain components should have larger difference with the counterpart domain, a flexible margin is used.
Category-level domain alignment. In order to ensure the identical categories’ features of two domains are mapped nearby in the latent space, classification loss and are introduced to ensure the class-consistency between decoded images and inputs:
where is the object classification branch of , and
is the pseudo label estimated by classifier. So as to eliminate falsely labeled samples which harm domain adaptation, we filter out those samples whose classification confidence below a certain threshold . Considering the fact that domain discrepancy is gradually filled along training, is adaptively adjusted following the strategy in [collaborative].
The proposed iterative training procedure is summarized in Algorithm 1. In each iteration, the input source and target samples are first mixed on pixel level to instruct the domain discriminator to output soft labels. After the samples of two domains are mapped to the latent space, their embeddings are mixed to produce mixup features. The images generated on the basis of these feature embeddings are constrained to be source-like and preserve inputs’ class information, so that the latent distribution is facilitated to be domain-invariant and discriminative. In all experiments, we set as 2.0, since domain mixup can’t effectively explore the linear space between two domains when the value of is small, and more analysis of can be found in supplementary material. and are hyper-parameters that trade off among losses with different orders of magnitude. According to the after sensitivity analysis, the adaptation performance of our approach is not too sensitive to the value of and , and these hyper-parameters share the same value among different tasks.
Pixel-level domain mixup. The work of [mixup] proposes the mixup vicinal distribution as a manner to encourage the model to behave linearly in-between training examples. Another work [autoencoder_interpolation] improves interpolation’s continuity in latent space and benefits downstream tasks. In adversarial domain adaptation, we also would like to lead the domain discriminator to behave linearly between source and target domains. As a result, the domain discriminator is of high capacity to accurately judge the generated images containing oscillations to two domains. In our implementation, such discriminator is trained with pairs of linearly mixed image and corresponding soft label , where simulates an oscillation mode to two domains and provides the guidance. Combined with feature-level mixup, pixel-level mixup can further narrow the domain discrepancy, which is shown in the after ablation study.
Feature-level domain mixup. Existing works attempt to map source and target domains to a common latent distribution, while limited data can not guarantee most parts of the latent space domain-invariant. In order to yield a more continuous domain-invariant latent distribution, the mixup features of two domains are exploited.
We use an intuitive example to illustrate the effectiveness of domain continuity on aligning source and target domains. As shown in Figure 3, the biased test sample may be misclassified without the constraint of domain continuity. However, through adding the mixup feature embedding to the training process, the latent codes between the same class of two domains should also be domain-invariant, which forms the intra-class clusters and . Thus the decision boundary is refined, and the biased samples in these clusters can be classified correctly.
In this section, we first introduce the experimental setup. Then, the classification performance on three domain adaptation benchmarks are presented. Finally, ablation study and sensitivity analysis are conducted for the proposed approach.
|US (p)||US (f)||MN||MN|
|Method||A W||D W||W D||A D||D A||W A||Average|
|AlexNet (source only) alexnet||68.8|
In this part, we describe the network architectures and hyper-parameters of different tasks. Our approach is implemented with PyTorch deep learning framework[pytorch].
Digits experiments. In this part of experiments, we construct four subnetworks with train-from-scratch architectures following [generate_to_adapt]
. Four Adam optimizers with base learning rate 0.0004 are utilized to optimize these submodels for 100 epochs. The hyper-parametersand are set as 0.1 and 0.01 respectively, and their values are constant in all experiments. All of the input images of encoder and discriminator are resized to .
Office experiments. For the encoder, the last layer of AlexNet [alexnet]
is replaced with two parallel fully connected layers producing 256 dimensional vectors respectively, and former layers are initialized with the model pretrained on ImageNet[imagenet]. The encoder is fine-tuned with base learning rate 0.0001 for 100 epochs, and the base learning rate of other three submodels is set as 0.001. The inputs of encoder and discriminator are resized to and respectively.
VisDA experiments. ResNet-101 [resnet] serves as the base architecture, and it is initialized with the model pretrained on ImageNet [imagenet]. The learning rate setting is same as that in the office experiments, and the results are reported after 20 epochs training. The inputs of encoder and discriminator are resized to and respectively.
Classification on Digits Datasets
Dataset. In this set of experiments, three digits datasets are used: MNIST [mnist], USPS [usps] and Street View House Numbers (SVHN) [svhn]. Each dataset contains ten classes corresponding to number 0 to 9. Four settings are used for measurement: MN US (p): sampling 2000 images from MNIST and 1800 images from USPS; MN US (f) and US MN: using the full training set of MNIST and USPS; SV MN: using the full training set of SVHN and MNIST.
Results. Table 1 presents the results of our approach in comparison with other adaptation approaches on the digits datasets. For the source only test, we use the same encoder and classifier architectures as the ones used in our approach. The reported results are averaged over five independent runs with random initialization. Our approach achieves the state-of-the-art performance on all four settings. Especially, it outperforms former GAN-based approaches [pixel_level, deep_reconstruction, generate_to_adapt], which illustrates the effectiveness of the proposed architecture on aligning source and target domains.
|ResNet-101 (source only) resnet||52.4|
Classification on Office-31
Dataset. Office-31 [office] is a standard domain adaptation benchmark commonly used in previous researches. Three distinct domains, Amazon(A), Webcam(W) and DSLR(D), make up of the whole Office-31 dataset. Each domain contains the same 31 classes of office supplies. All transfer tasks of three domains are used for evaluation.
Results. Table 2 reports the performance of our method compared with other works. The results of AlexNet trained with only source domain data serves as the lower bound. Our approach obtains the best performance in three of four hard cases: A W, W A and A D. For two easier cases: W D and D W, our approach achieves accuracy higher than 99.5% and ranks the first two places. Given the fact that the number of samples per class is limited in the Office-31 dataset, our approach manages to improve the performance by providing augmented samples and features.
Classification on VisDA-2017
Dataset. The VisDA-2017 [visda] challenge proposes a large-scale dataset for visual domain adaptation. The training domain is composed of synthetic renderings of 3D models. The validation domain is made up of photo-realistic images drawn from MSCOCO [coco]. Both domains contain the same 12 classes of objects.
Results. Table 3 reports the results on the VisDA-2017 cross-domain classification dataset. The ResNet-101 model pretrained on ImageNet acts as the baseline. Our approach achieves the highest accuracy among all adaptation approaches, and exceeds the baseline with a great margin. Under the condition that large domain shift exists, like transferring from synthetic objects to real images in this task, we think that the triplet loss and soft label play a critical role in excavating intermediate status between two domains.
Metrics. Two metrics are employed. (1) -distance [A_distance, A_distance_2] serves as a measure of cross-domain discrepancy. Inputted with extracted features of two domains, a SVM classifier is used to classify the source and target domain features, and the generalization error is defined as . Then the -distance can be calculated as: . (2) Classification accuracy on target domain serves as a measure of task-specific performance. In this part of experiments, both metrics are evaluated on the task A W.
Effect of pixel-level and feature-level mixup. Table 4 examines the effectiveness of pixel-level mixup (PM) and feature-level mixup (FM). The first row only uses the images and feature embeddings from two domains for training, and it serves as the baseline. In the fourth row, feature-level mixup achieves notable improvement compared with baseline, since the domain-invariant latent space is facilitated to be more continuous in this configuration. In the fifth row, pixel-level mixup further enhance model’s performance through guiding discriminator output soft scores between 0 and 1, which means it is an essential auxiliary scheme for feature-level mixup. In Figure 5, compared with traditional 0/1 discriminator, our discriminator leads to more source-like generated images, which means the domain discrepancy can be further narrowed via pixel-level mixup.
Effect of triplet loss. In Table 4, we evaluate another key component, i.e., triplet loss (Tri). In the third and sixth rows, it can also be observed that model’s performance is improved after adding the triplet loss to discriminator’s training process, since this loss ease the convergence of domain discriminator. We further utilize t-SNE [tsne] to visualize the feature distribution of target domain on the task SV MN. As shown in Figure 4, the features of different classes are separated most clearly in the full model, i.e., with domain mixup on two levels and triplet loss.
Effect of and pseudo target labels. In order to conduct category-aware alignment between source and target domains, the classification branch of discriminator and pseudo target labels are employed, and the effectiveness of them is examined in Table 5. After appending , classification accuracy increases by 1.7%, since this branch facilitates generated images to preserve the class information contained in inputs, which makes domain adaptation perform on the same categories of two domains. On such basis, pseudo target labels introduce the discriminative information of target domain to the adaptation process and make model’s performance state-of-the-art.
In this section, we discuss our approach’s sensitivity to hyper-parameters and which trade off among losses with different orders of magnitude. Four hard-to-transfer tasks of Office-31 dataset are used for evaluation. In Figure 6
, it can be observed that the transfer performance is not sensitive to the variance ofand near 0.1 and 0.01, respectively. In consequence, we can set and as 0.1 and 0.01 for all tasks, and the transfer performance should be satisfactory.
In this paper, we address the problem of unsupervised domain adaptation. A GAN-based architecture is constructed to transfer knowledge from source domain to target domain. In order to facilitate a more continuous domain-invariant latent space and fully utilize the inter-domain information, we propose the domain mixup on pixel and feature level. Extensive experiments on adaptation tasks with different extent of domain shift and data complexity demonstrate the predominant performance of our approach.
This work was supported by National Science Foundation of China (61976137, U1611461). This work was also supported by SJTU-BIGO Joint Research Fund, and CCF-Tencent Open Fund.