Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach

11/19/2018 ∙ by Ruijia Xu, et al. ∙ IEEE SUN YAT-SEN UNIVERSITY 0

Unsupervised domain adaptation aims to mitigate the domain shift when transferring knowledge from a supervised source domain to an unsupervised target domain. Adversarial Feature Alignment has been successfully explored to minimize the domain discrepancy. However, existing methods are usually struggling to optimize mixed learning objectives and vulnerable to negative transfer when two domains do not share the identical label space. In this paper, we empirically reveal that the erratic discrimination of target domain mainly reflects in its much lower feature norm value with respect to that of the source domain. We present a non-parametric Adaptive Feature Norm AFN approach, which is independent of the association between label spaces of the two domains. We demonstrate that adapting feature norms of source and target domains to achieve equilibrium over a large range of values can result in significant domain transfer gains. Without bells and whistles but a few lines of code, our method largely lifts the discrimination of target domain (23.7% from the Source Only in VisDA2017) and achieves the new state of the art under the vanilla setting. Furthermore, as our approach does not require to deliberately align the feature distributions, it is robust to negative transfer and can outperform the existing approaches under the partial setting by an extremely large margin (9.8% on Office-Home and 14.1% on VisDA2017). Code is available at https://github.com/jihanyang/AFN. We are responsible for the reproducibility of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised Domain Adaptation (DA) enables the ability to transfer knowledge from a label-rich source domain to a related but different target domain without any manual annotations. The core challenge of DA is to tackle the domain shift problem caused by different characteristics of the two distributions. Vanilla DA assumes that source and target domains share the identical label space. Emerging literatures divert their attention to relax the constraint, one scenario of which is well-known as partial DA [5, 4]. However, it is not trivial to directly migrate the existing methodology under the vanilla setting since they are vulnerable to the negative transfer effect in the disjoint label space.

Figure 1: Feature visualization of source and target samples from VisDA2017 based on the Source Only

model. This technique is widely used in the field of face recognition 

[34, 17] to characterize the feature embedding under the softmax-related objective. Unlike t-SNE [23]

whose size of empty space does not reflect the similarity between two clusters, the visualization map enables us to interpret the inter-class and intra-class variances. As illustrated, target samples tend to collide in the low-norm (i.e., small-radius) regions, vulnerable to misclassification due to small angular variations. Best viewed in color.

Extensive literatures are based on the theoretical analysis of target-domain generalization upper bound introduced by [1], which states that the expected target risk is bounded by three terms: (i) expected error on source domain samples; (ii) statistical distance between two domains; (iii) a constant relies on the hypothesis complexity, sample sizes and expected error of two domains under the ideal labeling function. Existing works rarely explore the first term since source domain is supervised and the third term which is considered sufficiently low otherwise domain adaptation cannot succeed [2]. The objective of the upper bound optimization is thus focused on minimizing the statistical distance, which takes two sets of observations as input and decides whether they are generated from the same distribution.

There exists a wide range of alternative variants for the instantiation of the statistical distance [1, 19, 28, 7, 8][1] theoretically introduces -divergence and -divergence for non-conservative DA. To the best of our knowledge, [8] is the first work to empirically measure the

-divergence by a parametric domain discriminator and adversarially align the feature via reverse gradient backpropagation

[31] further proposes an asymmetric adversarial learning mechanism with GAN-based loss and [3] presents Maximum Mean Discrepancy (MMD

) for biological data integration, which is used to align kernel embedding of deep feature by 

[19, 21, 32]. Despite these groundbreaking efforts have brought great progresses to the field of DA, they are still suffering from the following dilemmas that should not be overlooked: i) A common strategy for DA is usually to mix and optimize multiple learning objectives, and it is not always easy to get an optimal solution. Besides, the optimization for adversarial learning is not fully developed and is struggling to learn a diverse transformation between different distributions [37], which limits the potential capability on large-scale and hard transfer tasks. ii) Existing literatures present solid model-perspective demonstration or proof for the efficiency of the transfer algorithms, but shed little light on the feature-perspective analysis. Feature maps or embeddings are keep-updated matrices as model weights. Insights from this perspective facilitate to design a data-driven transfer algorithm. iii) As existing methods are in pursuit of transferable and indistinguishable feature representation via aligning two different domains, they are vulnerable to negative transfer when the source and target domains do not share identical label space.
To overcome these issues, we step forward to unveil the mystery of the model degradation on the target domain from the perspective of feature norm. Our insights are motivated by the following intuitions: i) As illustrated in Fig. 1, the target samples tend to collide in low-norm regions in the Source Only model, which results in erratic discrimination on the decision boundary. ii) Smaller-norm-less-informative assumption lay the foundation for some relevant works in the field of model compression, suggesting that parameters or features with smaller-norm play a less informative role during inference [35]. For unsupervised DA, the lack of informative annotations in the target domain may result in lower feature norm value. Like two sides of a coin, in contrast to model compression that prunes unnecessary computational elements and paths, we impose large norms upon the task-specific feature embeddings to facilitate more transferable computational possibility. iii) Another research direction of domain transfer is to explore the advanced network architecture instead of cross-domain adaptation. For example, ResNet-101 [13] generalizes better than ResNet-50. More stacked residual blocks in conv (i.e., layer3) facilitate not only the gradient norm preservation in the backward path [36] but also the direct information flow in the forward path [14], which is partially reflected in the larger-norm feature maps output by conv. Hopefully a top-down large-norm constraint could be exploited to simulate the bottom-up layer-by-layer stacking effect.

Specifically, we present a non-parametric Adaptive Feature Norm (AFN) approach, which is independent of the association between label spaces of the two domains and is thus robust to the negative transfer effect. We first introduce the algorithm called Hard Adaptive Feature Norm (HAFN), which intuitively applies

-norm function to the feature embedding encoded by the deep neural network and aligns the corresponding expectations between source and target domains to a shared large scalar. Despite its simplicity, HAFN can largely promote the domain transfer gain combined with the source domain classification loss. Furthermore, we perform the feature norm adaptation in a self-incremental manner for individual example and introduce an improved version called Instance Adaptive Feature Norm (

IAFN). IAFN achieves more transferable feature embedding, benefiting from the controllable feature norm loss term and a larger range of norm scalars. Finally, we prove that the standard Dropout [30] is -preserved and modify it to preserve feature norm to meet our goal. We summarize our contributions as follows:

i) We present a novel learning paradigm for unsupervised DA from the perspective of adaptive feature norm. Our approach is data-driven, fairly easy to optimize and converted to a few lines of code in current deep learning framework.

ii) Despite its simplicity, our method has appealing universality and achieves the new state of the art under the vanilla setting on a wide range of visual domain adaptation benchmarks (VisDA2017, Office-31 and ImageCLEF-DA).

iii) Our model avoids explicit feature distributions alignment and is thus robust to negative transfer and surpasses the state of the arts under the partial setting by an extremely large margin (9.8% on Office-Home, 14.1% on VisDA2017).

2 Related Work

We highlight the most relevant works to our paper. The theoretical analysis of the upper bound of target-domain generalization in [1] serves as a cornerstone for a majority of domain adaptation algorithms. Existing methods mainly explore transferable structure by minimizing statistic distance of two domains and can be organized into different categories from varied criteria. From the perspective of statistic distance, Maximum Mean Discrepancy (MMD[3] based methods [19, 21, 32]

obtain transferable classifier by minimizing MMD of kernel embeddings from both the source and target domains. 

[1] introduces - and -divergence for two distributions, which are embedded into deep model by [7, 8] and [28] respectively. From the view of methodology, kernel trick is usually employed in MMD based methods [19, 21] to obtain rich and restrictive function class. Recently, inspired by the great progress of GAN [11], adversarial learning is successfully explored to entangle feature distributions from different domains. [7, 8] add a parametric subnetwork as domain discriminator and adversarially align the feature via reverse gradient backpropagation. Tzeng et al. [31] instead facilitates the adversarial alignment with GAN-based loss in an asymmetric manner. [28] performs minimax game between the feature generator and two-branch classifiers to minimize the -divergence. Beyond minimizing statistical distance, other methods are introduced to better fit the target-specific structure. [27] applies tri-training to assign pseudo labels and obtain target-discriminative representation. [9] proposes to add a target data reconstruction objective and [29] refines the decision boundary based on cluster assumption.

Vanilla DA assumes that the source and target domains share the identical label space. [4, 5]

further opened up the partial setting where source label space subsumes the target label space, which is in line with the emerging demand to transfer models from an off-the-shelf large domain (e.g., ImageNet 

[6]) to an unknown small domain. However, it is not trivial to directly migrate the existing methods under the vanilla setting as they may suffer from the notorious negative transfer. Cao et al. [5] attempted to alleviate this issue by down-weighting samples in disjoint label space and promotes positive transfer via adversarial feature alignment in the shared label space. Label Efficient Learning [22] can be customized for partial setting by extending the entropy minimization over the cross-domain pairwise similarity.

Figure 2: The overall framework of the proposed Adaptive Feature Norm (AFN) approach. Backbone network

refers to the general feature extraction module.

is employed as the task-specific classifier with

layers, each of which is organized in the FC-BN-ReLU-Dropout order. In particular, we denote the first

layers of as bottleneck, which is considered as generating the task-specific feature embedding. During each iteration, we apply feature norm adaptation upon the bottleneck embedding along with the initial source classification loss as our supervised target. For the Hard version, the mean feature norm of source and target samples are adapted to a shared large scalar. For the Instance version, we enlarge the feature norm as respect to individual example with an incremental residual scalar . Away from the low-norm regions after the adaptation, target examples can be correctly classified without any supervision. Best viewed in color.

3 Method

Preliminaries Given a source domain of labeled samples attached to categories and a target domain of unlabeled samples attached to categories. Domain adaptation occurs when the inherent distributions and corresponding to source and target domains in the shared label space are different but similar [2] to make sense the transfer. Unsupervised DA considers the scenario that we have no access to any labeled target examples.

Vanilla Setting refers to the standard unsupervised domain adaptation that has been extensively explored. In this setting, the source and target domains share the identical label space, i.e., .

Partial Setting refers to the partial domain adaptation where source label space subsumes the target label space, i.e., . Adversarial learning based methods under the vanilla setting are vulnerable to the negative transfer effect in the disjoint label space .

Our method is independent of the association between label spaces of the two domains. Regarding the definition of MMD, as Norm is a nonnegative-valued scalar function, we instantiate the function class to be the -norm function composited with deep neural network function and define Maximum Mean Feature Norm Discrepancy (MMFND) between source and target domains as follows,

(1)

where and . and correspond to the backbone network and classifier in our framework. is regarded as a general feature extraction module inherits from the prevailing neural network architecture like ResNet [13]. represents a task-specific classifier that has fully-connected layers. denotes the layer operation of the . We call the first layers of the classifier as bottleneck, following [5, 21] and denote it as . We consider

generating the task-specific feature embeddings. Based on the vector of logits computed by

, we further calculate the class probabilities by applying the softmax function and denote it as

.

Inspired by the great success of Generative Adversarial Networks (GANs) [11], we can optimize the upper bound in a two-player adversarial manner. Specifically, we replace sup in Equation (1) with max operator and apply min operator with respect to and respectively and obtain

(2)
0:  Source domain . Target domain . Backbone network . Classifier .
1:  for  to iter do
2:     Randomly sample , from and respectively;
3:     Calculate
4:     Calculate
5:     Calculate
6:     Backward
7:     Update and .
8:  end for
9:  return   and
Algorithm 1 Mini-batch Learning Algorithm for HAFN. m, R, iter and denote the batch size, equilibrium radius, iterations of training and the balance factor respectively.

However, this process lacks explicit interpretability for the adversarial behavior and may lead to random walk in the feature space in our case, which makes it a failure to adapt the source and target samples at the semantic level.
Hard Adaptive Feature Norm (HAFN) Recall that the -norm of a vector can be interpreted as the radius from the hypersphere origin to the vector point, we can construct an equilibrium and a large radius to bridge the gap between source and target domains and obtain the feature norm objective as follows,

(3)

Equation (3) restricts the mean feature norms of source and target domains converging to respectively thus minimizes the MMFND. Rather than adversarial feature alignment that is usually unstable and lacks fully developed knowledge to guide the achievement of a satisfying equilibrium state, our objective is easy to optimize given the existing intermediate variable R. Complementary with the supervised source domain classification loss

(4)

we finally obtain the learning objective as follows,

(5)

is a factor that balances between the two objectives. Although aligning mean norms of the source and target feature embeddings appears a restrictive but not fully sufficient condition for successful domain transfer, HAFN suggests that by satisfying this constraint, we are capable of mitigating the domain shift to a great extent and achieving comparable or even superior performance against the state of the arts. Pipeline of mini-batch learning of HAFN is presented in Algorithm 1.
Instance Adaptive Feature Norm (IAFN) In HAFN, as a large R is preferred with the concern for a more transferable classifier, the gradients of the second term are large to dominate the backpropagation at the early training phase. Empirically, we can incorporate the target samples after semantically classifying the source domain. Beyond that, we further introduce an advanced algorithm called Instance Adaptive Feature Norm (IAFN). Specifically, we modify Equation (3) as follows,

(6)

where . represents the residual feature norm. Observe that in the case of IAFN, we enlarge feature norm with respect to individual examples, taking the variations among different mini-batches into consideration. Finally, we obtain the learning objective for IAFN algorithm as follows:

(7)
Figure 3: Sample images respectively selected from VisDA2017, Office-Home, Office-31 and ImageCLEF-DA datasets.
0:  Source domain . Target domain . Backbone network . Classifier .
1:  for  to iter do
2:     Randomly sample , from and respectively;
3:     Calculate
4:     Calculate
5:     Calculate
6:     Calculate
7:     Calculate
8:     Backward
9:     Update and .
10:  end for
11:  return   and
Algorithm 2 Mini-batch Learning Algorithm for IAFN. m, iter, and denote the batch size, iterations of training, residual feature norm and the balance factor respectively.

It is noteworthy that we can terminate the learning process via early stop, empirically not to dominate the classification error. Although IAFN does not explicitly align the feature norm expectations across two distributions, the self-incremental manner shows appealing superiority in two-objective optimization against HAFN since the magnitude of second loss term is controllable by the residual feature norm

. Besides, IAFN considers the variations at the instance level, avoiding the limited size and varied statistics estimation of mini-batch distribution. Aforementioned properties enable IAFN to obtain more transferable feature embedding by producing larger feature norm. Pipeline of mini-batch learning of IAFN is presented in Algorithm 

2.

Remarks The equilibrium in HAFN cannot be infinitely large since the gradient produced by the feature norm objective in this hard setting may eventually lead to an explosion. Besides, the learning process in IAFN cannot be endless as the expected risk of source domain in the upper bound is not to be ignored.

-preserved Dropout In this part, we first prove that the standard dropout is -preserved and then modify it to preserve feature norm. Dropout is a widely-used regularization technique for deep neural network [30, 15]. Given -dimensional input x, at the training phase we randomly zero the element with probability via samples

that come from the Bernoulli distribution:

(8)

To compute an identity function with evaluation, the outputs are further scaled by a factor of and obtained

(9)

which implicitly preserves the -norm between the outputs at both training and testing phases since and are independent:

(10)

Because we are in pursuit of adaptive feature norm, we instead scale the output by a factor of and obtain

(11)

which satisfies

(12)

4 Experiment

4.1 Setup

We conduct extensive experiments on four widely-used visual DA benchmarks, as shown in Fig. 3. We compare our approach against the state-of-the-art deep learning methods.

Method plane bcycl bus car horse knife mcycl person plant sktbrd train truck Avg Overall
Source Only [13] 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4 -
MMD [19] 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1 -
DANN [8] 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4 -
MCA [28] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9 -
HAFN 92.7 55.4 82.4 70.9 93.2 71.2 90.8 78.2 89.1 50.2 88.9 24.5 73.9 74.2
0.7 4.1 2.6 1.2 0.9 3.9 0.5 1.3 0.7 2.4 0.8 0.5
IAFN 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1 75.6
0.2 4.0 0.5 2.2 0.5 4.1 0.5 1.3 0.7 3.4 0.3 2.9
Table 1: Accuracy (%) on VisDA2017 under vanilla setting (ResNet-101)
Method A W D W W D A D D A W A Avg
ResNet [13] 68.40.2 96.70.1 99.30.1 68.90.2 62.50.3 60.70.3 76.1
TCA [24] 72.70.0 96.70.0 99.60.0 74.10.0 61.70.0 60.90.0 77.6
GFK [10] 72.80.0 95.00.0 98.20.0 74.50.0 63.40.0 61.00.0 77.5
DDC [32] 75.60.2 96.00.2 98.20.1 76.50.3 62.20.4 61.50.5 78.3
DAN [19] 80.50.4 97.10.2 99.60.1 78.60.2 63.60.3 62.80.2 80.4
RTN [20] 84.50.2 96.80.1 99.40.1 77.50.3 66.20.2 64.80.3 81.6
RevGrad [7] 82.00.4 96.90.2 99.10.1 79.70.4 68.20.4 67.40.5 82.2
ADDA [31] 86.20.5 96.20.3 98.40.3 77.80.3 69.50.4 68.90.5 82.9
JAN [21] 85.40.3 97.40.2 99.80.2 84.70.3 68.60.3 70.00.4 84.3
HAFN 83.40.7 98.30.1 99.70.1 84.40.7 69.40.5 68.50.3 83.9
IAFN 88.80.4 98.40.0 99.80.0 87.71.3 69.80.4 69.70.2 85.7
IAFN+ENT 90.10.8 98.60.2 99.80.0 90.70.5 73.00.2 70.20.3 87.1
Table 2: Accuracy (%) on Office-31 under vanilla setting (ResNet-50)

VisDA2017 [25] is an emerging large-scale dataset which aims to fill the synthetic-to-real domain gap. For the object recognition task, the dataset contains over 280K images across 12 object categories. The source domain has 152,397 synthetic images generated by rendering from 3D models. The target domain has 55,388 real object images collected from Microsoft COCO [16]. Under the partial setting, we follow [5] to choose (in alphabetic order) the first 6 categories as target categories and perform the Synthetic-12 Real-6 task. Experiments in VisDA2017 can demonstrate the efficiency of our approach upon large-scale sample size and significant synthetic-to-real domain shift.

Office-Home [33] collects images of everyday objects to form four domains: Artistic images (Ar), Clipart images (Cl), Product images (Pr) and Real-World images (Rw). Each domain contains 65 object categories and they amount to around 15,500 images. Under the partial DA setting, we follow [5] to choose (in alphabetic order) the first 25 categories as target categories. We conduct all 12 transfer tasks (Ar Cl, Ar Pr, , Rw Pr) on this dataset.

Office-31 [26] is a traditional benchmark for visual domain adaptation, comprising 31 office environment categories and a total of 4,652 images originating from three domains: Amazon (A), DSLR (D) and Webcam (W), referring to online website images, digital SLR camera images and web camera images respectively. We conduct all 6 transfer tasks (A D, A W, , W D) on this dataset.

ImageCLEF-DA dates from ImageCLEF 2014 domain adaptation challenge111http://imageclef.org/2014/adaptation and consists of 12 common categories shared by Caltech-256 (C), ImageNet ILSVRC2012 (I) and Pascal VOC 2012 (P). It is considered as a balanced benchmark as each domain contains the same number of images in the same category. We conduct all 6 transfer tasks (C I, C P, , P I) on this dataset.

Method I P P I I C C I C P P C Avg
ResNet [13] 74.80.3 83.90.1 91.50.3 78.00.2 65.50.3 91.20.3 80.7
DAN [19] 74.50.4 82.20.2 92.80.2 86.30.4 69.20.4 89.80.4 82.5
RevGrad [7] 75.00.6 86.00.3 96.20.4 87.00.5 74.30.5 91.50.6 85.0
RTN [20] 74.60.3 85.80.1 94.30.1 85.90.3 71.70.3 91.20.4 83.9
JAN [21] 76.80.4 88.00.2 94.70.2 89.50.3 74.20.3 91.70.3 85.8
HAFN 76.90.4 89.00.4 94.40.1 89.60.6 74.90.2 92.90.1 86.3
IAFN 78.00.4 91.70.5 96.20.1 91.10.3 77.00.5 94.70.3 88.1
IAFN+ENT 79.30.1 93.30.4 96.30.4 91.70.0 77.60.1 95.30.1 88.9
Table 3: Accuracy (%) on ImageCLEF-DA under vanilla setting (ResNet-50)
Method ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
ResNet [13] 38.57 60.78 75.21 39.94 48.12 52.90 49.68 30.91 70.79 65.38 41.79 70.42 53.71
DAN [19] 44.36 61.79 74.49 41.78 45.21 54.11 46.92 38.14 68.42 64.37 45.37 68.85 54.48
DANN [8] 44.89 54.06 68.97 36.27 34.34 45.22 44.08 38.03 68.69 52.98 34.68 46.50 47.39
RTN [20] 49.37 64.33 76.19 47.56 51.74 57.67 50.38 41.45 75.53 70.17 51.82 74.78 59.25
PADA [5] 51.95 67.00 78.74 52.16 53.78 59.03 52.61 43.22 78.79 73.73 56.60 77.09 62.06
HAFN 53.35 72.66 80.84 64.16 65.34 71.07 66.08 51.64 78.26 72.45 55.28 79.02 67.51
0.44 0.53 0.50 0.48 0.30 1.04 0.68 0.42 0.51 0.13 0.37 0.19
IAFN 58.93 76.25 81.42 70.43 72.97 77.78 72.36 55.34 80.40 75.81 60.42 79.92 71.83
0.50 0.33 0.27 0.46 1.39 0.52 0.31 0.46 0.78 0.37 0.83 0.20
Table 4: Accuracy (%) on Office-Home under partial setting (ResNet-50)

Protocol We follow the standard protocol [19, 28, 7, 5, 21] in both vanilla and partial settings and use all labeled source samples and all unlabeled target images belonging to the corresponding target label space. Under the vanilla setting, we compare our approach with the conventional shallow models and state-of-the-art deep learning methods including Transfer Component Analysis (TCA[24], Geodesic Flow Kernel (GFK[10], ResNet [13], Deep Domain Confusion (DDC[32], Deep Adaptation Network (DAN[19], Reverse Gradient (RevGrad[7] or Domain Adversarial Neural Network (DANN[8], Residual Transfer Network (RTN[20], Joint Adaptation Network (JAN[21], Maximum Classifier Discrepancy (MCA[28], Adversarial Discriminative Domain Adaptation (ADDA[31]. Under the partial setting, we compare against ResNet, DAN, RevGrad, RTN and Partial Adversarial Domain Adaptation (PADA[5].

Implementation Details We conducted our experiments on the PyTorch222https://pytorch.org/ platform which is also adopted by most emerging methods. For fair comparison, our backbone network, e.g., ResNet-50 or ResNet-101, is identical to the competitive approaches and is also fine-tuned from the ImageNet [6] pre-trained model. In pursuit of the versatility of our method, we adopted a unified set of hyper-parameters throughout the Office-Home, Office-31 and ImageCLEF-DA under both vanilla and partial settings, where , in HAFN and in IAFN. Since the synthetic domain in VisDA2017 is easy to converge, we applied a slightly smaller and that equal to 0.01 and 0.3 respectively. We did NOT tailor different parameters for each subtask within the benchmark for the sake of versatility, although doing so usually leads to performance gains [5]. We followed [28]

and applied two fully-connected layers, each has 1000 neurons, as the bottleneck in the classifier on

VisDA2017. Considering small sample size of other benchmarks, we reduced bottleneck to one fully-connected layer. We used mini-batch SGD optimizer with learning rate for both backbone network and classifier on ALL benchmarks, mainly following the implementation333https://github.com/mil-tokyo/MCD_DA of [28] . We used center crop (224 224) of the resized image (256

256) for the reported accuracy and set a unified maximum number of epochs throughout

ALL transfer tasks within the benchmark and reported the accuracy of the endpoint as with [28]. We did NOT

search the best model among the whole range of checkpoints. We repeated each experiment three times and reported the average accuracy with the standard deviation.

Method Synthetic-12Real-6
ResNet [13] 45.26
DAN [19] 47.60
DANN [8] 51.01
RTN [20] 50.04
PADA [5] 53.53
HAFN 65.06
IAFN 67.65
Table 5: Accuracy (%) on VisDA2017 under partial setting (ResNet-50)
(a) Source Only on VisDA2017
(b) IAFN on VisDA2017
(c) Sample Size
(d) Embedding Dimension
Figure 4: (a) and (b) correspond to t-SNE embedding visualization on Source Only and IAFN models respectively. The triangle and star markers represent the source and target samples respectively. Different colors indicate different categories. (c) The accuracy with regard to various unlabeled target sample size. (d) Accuracy by varying feature embedding dimension. Best viewed in color.

4.2 Results

Results under vanilla setting for VisDA2017, Office-31 and ImageCLEF-DA are reported in Table 12 and 3 respectively. Accuracies of the compared approaches are directly cited from the corresponding papers. As illustrated, our models significantly outperform the existing methods throughout all benchmarks, where IAFN appears the top-performing variant.

Existing works on VisDA2017 prefer using the mean of per-class accuracy for evaluation. We follow the same criterion and include the overall accuracy for comparison. Results on VisDA2017

reveal some interesting observations: i) Adversarial learning based methods like DANN obtain poor performance on this extreme hard and large-scale benchmark, suffering from the difficulty of optimization. Nevertheless, the encouraging result of our approach proves its efficiency to optimize on the large volume of samples and bridge the significant synthetic-to-real domain gap. ii) It is noteworthy that some existing methods incorporate another class balance loss to align the target samples in a balanced way. Although this trick further improves the accuracy, it cannot generalize to other datasets and falls into a more complex multiple objective learning problem. Our model yields superior accuracy on most categories, which is robust to the unbalanced issue without extra heuristic operation. iii) Our models are lightweight when compared with DANN that has an extra parametric domain discriminator and MCA that needs another branch of classifier.

As indicated in Table 2 and 3, our methods achieve the new state of the arts on these two standard benchmarks. For Office-31, HAFN yields comparable performance while IAFN promotes the accuracy on most transfer tasks, especially for the hard ones like AW, AD and DA. For ImageCLEF-DA, we surpass the state of the arts on all 12 balanced transfer tasks. In Analysis part, we further conduct case study to demonstrate that our approach is complementary with other DA techniques on these two datasets.

Results on Office-Home and VisDA2017 under partial setting are reported in Table 4 and 5 respectively. As illustrated, our methods obtain significant improvement against the comparison approaches, with 9.8% gain on Office-Home and 14.1% gain on VisDA2017. Plain adversarial learning based methods like DANN seriously suffer from the negative transfer effect and perform even worse than Source Only

model. PADA applies a heuristic way to find the outlier source classes thus avoids to align the whole source and target domains. Although PADA outperforms the plain adversarial version as well as MMD based methods, the weighting mechanism is still an approximate strategy and the negative transfer effect cannot be fully erased. Besides, PADA needs to adjust different hyper-parameters for each subtask within the same dataset, which greatly compromises the flexibility and versatility of the method. As our approach avoids to explicitly align the feature distributions, it is robust to negative transfer and can be directly applied to partial setting without any heuristic weighting mechanism.

5 Analysis

Feature Visualization: Although demonstrating the efficiency of DA algorithm via t-SNE [23] embeddings is considered over-interpreted,444https://interpretablevision.github.io/ we still follow the de facto practice to provide the intuitive understanding. We randomly select 2000 samples across 12 categories from source and target domains respectively in VisDA2017 and visualize their bottleneck feature by t-SNE. As shown in Fig. 4(a), the ResNet features of target domain samples collide into a mess because of extremely large synthetic-to-real domain gap. After adaptation, as illustrated in Fig. 4(b), our method succeeded in separating target domain samples and better aligning them to the corresponding source domain clusters.

Sample Size of Target Domain: In this part, we empirically demonstrate that our approach is data-driven, that is, increasing number of unlabeled target domain samples can further boost the recognition performance, which exposes the appealing capacity in practice. It is not necessarily intuitive for adversarial learning based methods to optimize and obtain a diverse transformation upon large volume of unlabeled target samples. Specifically, we shuffle the target domain on VisDA2017 and sequentially access the top 25%, 50%, 75% and 100% of the dataset. We train and evaluate our approach respectively on these four subsets. As illustrated in Fig. 4(c), as the sample size gradually increases, the classification accuracy of the corresponding target domain grows accordingly. It suggests that more unlabeled target domain samples are involved in the feature norm adaptation, a more transferrable classifier with regard to target domain can be obtained.

Complementary with Other Methods: In this part, we demonstrate that our approach is capable of cooperating with other DA techniques. In view of limited space, we particularly exploit ENTropy minimization (ENT[12], a low-density separation technique, for demonstration. ENT is widely employed in domain adaptation [20, 29, 18] to encourage the decision boundary to pass through the target low-density regions by minimizing the conditional entropy of target domain samples. We conduct this case study on Office-31 and ImageCLEF-DA datasets and report the accuracy in Table 2 and Table 3 respectively. As illustrated, with ENT to fit the target-specific structure, we further boost the recognition performance and obtain 1.4% and 0.8% gain for Office-31 and ImageCLEF-DA respectively.

Sensitivity of Embedding Dimension: We investigate the sensitivity of embedding dimension of the bottleneck layer as it plays a significant role in calculating the norm. We conduct this case study on both VisDA2017 and Office-31 datasets. For Office-31, we specifically select AW transfer task for demonstration. We report the mean accuracy with standard deviation for embedding dimensions varied in {500, 1000, 1500, 2000} respectively. As illustrated in Fig. 4(d), the accuracy almost maintains at the same level and achieves slightly higher when the embedding dimension is set to 1000, implying that our approach is not sensitive to the chosen range of feature space dimension.

6 Conclusion

This paper presented a novel learning paradigm for unsupervised DA by feature norm adaptation. We empirically argued that the erratic discrimination of target domain mainly reflects in its low feature norm. To tackle this issue, we designed HAFN to increase and match the mean feature norm of source and target domain. Besides, we further proposed IAFN to enlarge feature norm with respect to individual samples. Both algorithms are independent of the association between label spaces of the two domains. Our method is non-parametric, data-driven and rather easy to optimize. Extensive empirical evaluation on visual DA benchmarks under both vanilla and partial settings demonstrates the superiority of our algorithms against the existing methods.

References

  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [2] S. Ben-David, T. Lu, T. Luu, and D. Pál. Impossibility theorems for domain adaptation. In

    International Conference on Artificial Intelligence and Statistics

    , pages 129–136, 2010.
  • [3] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
  • [4] Z. Cao, M. Long, J. Wang, and M. I. Jordan.

    Partial transfer learning with selective adversarial networks.

    In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [5] Z. Cao, L. Ma, M. Long, and J. Wang. Partial adversarial domain adaptation. In European Conference on Computer Vision, pages 139–155. Springer, 2018.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • [7] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
  • [8] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [9] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
  • [10] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [12] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  • [15] X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134, 2018.
  • [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [17] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 1, 2017.
  • [18] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan. Transferable representation learning with deep adaptation networks. IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [19] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
  • [20] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
  • [21] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217, 2017.
  • [22] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
  • [23] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [24] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
  • [25] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • [26] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • [27] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning, pages 2988–2997, 2017.
  • [28] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [29] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
  • [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [31] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [32] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [33] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5385–5394. IEEE, 2017.
  • [34] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [35] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124, 2018.
  • [36] A. Zaeemzadeh, N. Rahnavard, and M. Shah. Norm-preservation: Why residual networks can become extremely deep? arXiv preprint arXiv:1805.07477, 2018.
  • [37] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In IEEE International Conference on Computer Vision, 2017.