This paper focuses on unsupervised domain adaptation (UDA) for visual classification task. We aim to adapt the knowledge from a source network, trained by the source domain data, to the training of a target network, which will be used for making predications in the target domain. Note that in UDA the target domain is unlabeled. The increasing popularity of UDA arises from the fact that the performance of a model trained on one domain may degenerate heavily on another when their underlying data distributions are different.
In the community of UDA, many deep learning methods attempt to minimize the discrepancy across domains on the top layers, such as the fully connected layers, of the neural network via explicitly imposing penalty terms[26, 16, 17, 23] or in an adversarial way [7, 25, 24]. While the modifications at the fully connected layers can be back-propagated in principle, it may decay after a few layers, especially when gradient explosion or vanishing takes place. Consequently, the convolutional layers may be under-constrained. However, the domain discrepancy may emerge at the start from the convolutional layers, which makes any adjustment purely at the tail of the network less effective.
We investigate the domain discrepancy of the convolutional layers by visualizing their attention mechanisms. In essence, the attention mechanism is emphasized as a key ingredient for CNN, suggested by a number of studies [22, 29, 32, 20, 28, 27, 30]. Zagoruyko et al.  find that the model performance is highly correlated with the attention mechanism: a stronger model always owns better aligned attention than a weaker one. From Fig. 1, suppose we have networks trained on labeled data from source and target domains respectively, we observe distinct attention patterns exhibited by the convolutional layers for the same target domain image. The attention mechanism degenerates when directly applying the source network to the target domain data, which may exert negative influence on the classification performance. Therefore, this paper expects the attention of the convolutional layers to be invariant to the domain shift.
Based on the above discussions, this paper takes the domain discrepancy of the convolutional layers directly into account by aligning the attention of the target network with the source network. Our assumption is that no matter how domain varies, the discriminative parts of an image should be insensitive to the changes of image style. Previous discrepancy measures (e.g., MMD  and JMMD ) which work effectively on high-level semantic representations cannot be trivially transferred to measure the attention discrepancy of the convolutional layers where low-level structure information is critical. In this paper, we propose using CycleGAN  to build the data correspondence across domains, i.e., translating the data from one domain to another without modifying its underlying content. Then, for the paired samples (e.g. real source (or target) image and synthetic target (or source) image), we explicitly penalize the distances between attentions of the source and the target networks.
Additionally, we train our target network with real and synthetic data from both source and target domains. For source domain and its translated data, we impose the cross-entropy loss between the predictions and the ground-truth labels. For target domain and its translated source domain data, due to the lack of ground-truth labels, we make use of their underlying category distributions which provide insight into the target data. In a nutshell, we adopt the modified Expectation Maximization (EM) steps to maximize the likelihood of target domain images and update the model. Training iterations improve both the label posterior distribution estimation and the discriminative ability of the model.
Our contributions are summarized below,
We propose a deep attention alignment method which allows the target network to mimic the attention of the source network. Taking advantage of the pairing nature of CycleGAN, no additional supervision is needed.
We propose using EM algorithm to exploit the unlabeled target data to update the network. Several modifications are made to stabilize training and improve the adaptation performance.
Our method outperforms the state of art in all the six transfer tasks, achieving +2.6% improvement in average on the real-world domain adaptation dataset Office-31.
2 Related Work
Unsupervised domain adaptation. Various methods have been proposed for unsupervised domain adaptation [26, 16, 7, 17]. Many works try to make the representations at the tail of neural networks invariant across domains. Tzeng et al.  propose a kind of domain confusion loss to encourage the network to learn both semantically meaningful and domain invariant representations. Similarly, Long et al.  minimize the MMD distance of the fully-connected activations between source and target domain while sharing the convolutional features. Ganin et al. 
enable the network to learn domain invariant representations in an adversarial way by adding a domain classifier and back-propagating inverse gradients. JAN penalizes the JMMD over multiple fully-connected layers to minimize the domain discrepancy coming from both the data distribution and the label distribution. Further, JAN-A , as a variant of JAN, trains the network in an adversarial way with JMMD as the domain adversary. DSN  explicitly models domain-specific features to help improve networks’ ability to learn domain-invariant features. Associative domain adaptation (ADA)  reinforces associations across domains directly in embedding space to extract statistically domain-invariant and class discriminative features. Few works pay attention to the domain shift coming from the convolutional layers. In this paper, we notice that the attention mechanism cannot be preserved when directly applying the model trained on the source domain to the target domain. To alleviate this problem, we constrain the training of convolutional layers by imposing the attention alignment penalty across domains.
Attention of CNNs. There exist many ways to define and visualize the attention mechanisms learned by CNNs. Zeiler & Fergus  project certain features back onto the image through a network called “deconvnet” which shares the same weights as the original feed-forward network. Simonyan et al.  propose using the gradient of the class score w.r.t the input image to visualize the CNN. Class activation maps (CAMs), proposed by , aim to visualize the class-discriminative image regions used by a CNN. Grad-CAM  combines gradient based attention method and CAM, enabling to obtain class-discriminative attention maps without modifying the original network structure as .
Zagoruyko et al.  define attention as a set of spatial maps indicating which area the network focuses on to perform a certain task. The attention maps can also be defined w.r.t various layers of the network so that they are able to capture both low-, mid-, and high-level representation information. They propose that attention mechanism should be a kind of knowledge transferred across different network architectures. Zaogruyko et al.  align the attention across different architectures for exactly the same image during the training process and aim to transfer the knowledge from a large model to a smaller one. Different to , our method aligns the attention across different data domains where images across domains are unpaired and aims to promote the model adaptation performance.
Unpaired image-to-image translation.
Unpaired image-to-image translation aims to train a model to map image samples across domains, under the absence of pairing information. It can be realized through GAN to pair the real source (or target) and synthetic target (or source) images[15, 21, 33, 12, 14, 2, 11, 19]. Generating synthetic images can be beneficial for various vision tasks [18, 31, 6, 5]. In this paper, we concentrate on maximizing the utility of given paired real and synthetic samples. And we choose CycleGAN  to perform such adversarial data pairing.
3 Deep Adversarial Attention Alignment
Our framework is illustrated in Fig. 2. We train a source CNN which guides the attention alignment of the target CNN whose convolutional layers have the same architecture as the source network. The target CNN is trained with a mixture of real and synthetic images from both source and target domains. For source and synthetic target domain data, we have ground-truth labels and use them to train the target network with cross-entropy loss. On the other hand, for the target and synthetic source domain data, due to the lack of ground-truth labels, we optimize the target network through an EM algorithm.
3.1 Adversarial Data Pairing
We use CycleGAN to translate the samples in the source domain to those in the target domain , and vice versa. The underlying assumption to obtain meaningful translation is that there exist some relationships between two domains. For unsupervised domain adaptation, the objects of interest across domains belong to the same set of category. So it is possible to use CycleGAN to map the sample in the source domain to that in the target domain while maintaining the underlying object-of-interest.
The Generative Adversarial Network (GAN) aims to generate synthetic images which are indistinguishable from real samples through an adversarial loss,
where and are sampled from source domain and target domain , respectively. The generator mapping to strives to make its generated synthetic outputs indistinguishable from real target samples for the domain discriminator .
Because the training data across domains are unpaired, the translation from source domain to target domain is highly under-constrained. CycleGAN couples the adversarial training of this mapping with its inverse one, i.e. the mapping from to and that from to are learned concurrently. Moreover, it introduces a cycle consistency loss to regularize the training,
Formally, the full objective for CycleGAN is,
where the constant controls the strength of the cycle consistency loss. Through CycleGAN, we are able to translate an image in the source domain to that in the target domain in the context of our visual domain adaptation tasks (Fig. 3).
As illustrated in Fig. 1, the target model pays too much attention to the irrelevant background or less discriminative parts of the objects of interest. This attention misalignment will degenerate the model’s performance. In this paper, we propose to use the style-translated images as natural image correspondences to guide the attention mechanism of the target model to mimic that of the source model, to be detailed in Section 3.2.
3.2 Attention Alignment
Based on the paired images, we propose imposing the attention alignment penalty to reduce the discrepancy of attention maps across domains. Specifically, we represent attention as a function of spatial maps w.r.t each convolutional layer . For the input of a CNN, let the corresponding feature maps w.r.t layer be represented by . Then, the attention map w.r.t layer is defined as
where denotes the -th channel of the feature maps. The operations in Eq. (4) are all element-wise. Alternative ways to represent the attention maps include , and , etc. We adopt Eq. (4) to emphasize the salient parts of the feature maps.
We propose using the source network to guide the attention alignment of the target network, as illustrated in Fig. 2
. We penalize the distance between the vectorized attention maps between the source and the target networks to minimize their discrepancy. In order to make the attention mechanism invariant to the domain shift, we train the target network with a mixture of real and synthetic data from both source and target domains.
Formally, the attention alignment penalty can be formulated as,
where the subscript denotes the layer and , denote the samples. The and represent the attention maps w.r.t layer for the source network and the target network, respectively. and are real source and real target domain data, respectively. The synthetic target data and synthetic source data satisfy and , respectively.
Through Eq. (5), the distances of attention maps for the paired images (i.e., (, ) and (, )) are minimized. Moreover, we additionally penalize the attention maps of the same input (i.e., and ) passing through different networks. The attention alignment penalty allows the attention mechanism to be gradually adapted to the target domain, which makes the attention mechanism of the target network invariant to the domain shift.
Discussion. On minimizing the discrepancy across domains, our method shares similar ideas with DAN  and JAN . The difference is that our method works on the convolutional layers where the critical structure information is captured and aligned across domains; in comparison, DAN and JAN focus on the FC layers where high-level semantic information is considered. Another notable difference is that our method deals with the image-level differences through CycleGAN data pairing, whereas DAN and JAN consider the discrepancy of feature distributions.
In DAN and JAN, MMD and JMMD criteria are adopted respectively to measure the discrepancy of feature distributions across domains. Technically, MMD and JMMD can also be used as attention discrepancy measures. However, as to be shown in the experiment part, MMD and JMMD yield inferior performance to the distance enabled by adversarial data pairing in our method. The reason is that MMD and JMMD are distribution distance estimators: they map the attention maps to the Reproducing Kernel Hilbert Space (RKHS) and lose the structure information. So they are not suitable for measuring the attention discrepancy across domains.
3.3 Training with EM
To make full use of the available data (labeled and unlabeled), we train the target-domain model with a mixture of real and synthetic data from both source and target domains, as illustrated in Fig. 2. For the source and its translated synthetic target domain data, we compute the cross-entropy loss between the predictions and ground-truth labels to back-propagate the gradients through the target network. The cross-entropy loss for the source and corresponding synthetic target domain data can be formulated as follows,
where denotes the label for the source sample and the translated synthetic target sample
. The probabilityis represented by the -th output of the target network with parameters given the input image . .
For the unlabeled target data, due to the lack of labels, we employ the EM algorithm to optimize the target network. The EM algorithm can be split into two alternative steps: the (E)xpectation computation step and the expectation (M)aximization step. The objective is to maximize the log-likelihood of target data samples,
In image classification, our prior is that the target data samples belong to different categories. We choose the underlying category of each sample as the hidden variable, and the algorithm is depicted as follows (we omit the sample subscript and the target domain superscript for description simplicity).
(i) The Expectation step. We first estimate through,
where the distribution is modeled by the target network. is the parameters of the target-domain CNN at last training step
. We adopt the uniform distributions to depict(i.e., assuming the occurrence probabilities of all the categories are the same) and (i.e., assuming all possible image instantiations are distributed uniformly in the manifold of image gallery). In this manner, where is a constant.
(ii) The Maximization step. Based on the computed posterior , our objective is to update to improve the lower bound of Eq. (7),
Note that we omit because we assume subjects to the uniform distribution which is irrelevant to . Also, because , Eq. (9) is equivalent to,
Moreover, we propose to improve the effectiveness and stability of the above EM steps through three aspects
A) Asynchronous update of . We adopt an independent network to estimate and update asynchronously, i.e., synchronizes its parameters with the target network every steps: . In this manner, we avoid the frequent update of and make the training process much more stable.
B) Filtering the inaccurate estimates. Because the estimate of is not accurate, we set a threshold and discard the samples whose maximum value of over is lower than .
C) Initializing the learning rate schedule after each update of . To accelerate the target network adapting to the new update of the distribution , we choose to initialize the learning rate schedule after each update of .
Note that for synthetic source data , we can also apply the modified EM steps for training. Because is a definite mapping, we assume .
To summarize, when using the EM algorithm to update the target network with target data and synthetic source data, we first compute the posterior through network which synchronizes with the target network every steps. Then we minimize the loss,
In our experiment, we show that these modifications yield consistent improvement over the basic EM algorithm.
3.4 Deep Adversarial Attention Alignment
Based on the above discussions, our full objective for training the target network can be formulated as,
where determines the strength of the attention alignment penalty term .
Discussion. Our approach mainly consists of two parts: attention alignment and EM training. On the one hand, attention alignment is crucial for the success of EM training. For EM training, there originally exists no constraint that the estimated hidden variable is assigned with the semantic meaning aligned with the ground-truth label, i.e.， there may exist label shift or the data is clustered in an undesirable way. Training with labeled data (e.g. source and synthetic target data) and synchronizing with , the above issue can be alleviated. In addition, attention alignment further regularizes the training process by encouraging the network to focus on the desirable discriminative information.
On the other hand, EM benefits attention alignment by providing label distribution estimations for target data. EM approximately guides the attention of target network to fit the target domain statistics, while attention alignment regularizes the attention of target network to be not far from source network. These two seemingly adversarial counterparts cooperate to make the target network acquire the attention mechanism which is invariant to the domain shift.
Note that both parts are promoted by the use of adversarial data pairing which provides natural image correspondences to perform attention alignment. Thus our method is named “deep adversarial attention alignment”.
Datasets. We use the following two UDA datasets for image classification.
1) Digit datasets from MNIST  (60,000 training + 10,000 test images) to MNIST-M  (59,001 training + 90,001 test images). MNIST and MNIST-M are treated as the source domain and target domain, respectively. The images of MNIST-M are created by combining MNIST digits with the patches randomly extracted from color photos of BSDS500  as their background.
2) Office-31 is a standard benchmark for real-world domain adaptation tasks. It consists of 4,110 images subject to 31 categories. This dataset contains three distinct domains, 1) images which are collected from the Amazon website (Amazon domain), 2) web camera (Webcam domain), and 3) digital SLR camera (DSLR domain) under different settings, respectively. The dataset is also imbalanced across domains, with 2,817 images in A domain, 795 images in W domain, and 498 images in D domain. We evaluate our algorithm for six transfer tasks across these three domains, including A W, D W, W D, A D, D A, and W A.
Competing methods. We compare our method with some representative and state-of-the-art approaches, including RevGrad , JAN , JAN-A , DSN  and ADA  which minimize domain discrepancy on the FC layers of CNN. We compare with the results of these methods reported in their published papers with identical evaluation setting. For the task MNIST MNIST-M, we also compare with PixelDA , a state-of-the-art method on this task. Both CycleGAN and PixelDA transfer the source style to the target domain without modifying its content heavily. Therefore, PixelDA is an alternative way to generate paired images across domains and is compatible to our framework. We emphasize that a model capable of generating more genuine paired images will probably lead to higher accuracy using our method. The investigation in this direction can be parallel and reaches beyond the scope of this paper.
4.2 Implementation details
The source network is trained on the MNIST training set. When the source network is trained, it is fixed to guide the training of the target network. The target and the source network are made up of four convolutional layers, where the first three are for feature extraction and the last one acts as a classifier. We align the attention between the source and target network for the three convolutional layers.
architecture to perform the adaptation tasks on Office-31 and we start from the model pre-trained on ImageNet. We first fine-tune the model on the source domain data and fix it. The source model is then used to guide the attention alignment of the target network. The target network starts from the fine-tuned model and is gradually trained to adapt to the target domain data. We penalize the distances of the attention maps w.r.t all the convolutional layers except for the first convolutional layer.
Detailed settings of training are demonstrated in the supplementary material.
MNIST MNIST-M. The classification results of transferring MNIST to MNIST-M are presented in Table 1. We arrive at four observations. First, our method outperforms a series of representative domain adaptation methods (e.g., RevGrad, DSN, ADA) with a large margin, all of which minimize the domain discrepancy at the FC layers of neural networks. Moreover, we achieve competitive accuracy (95.6%) to the state-of-the-art result (98.2%) reported by PixelDA. Note that technically, PixelDA is compatible to our method, and can be adopted to improve the accuracy of our model. We will investigate this in the future. Second, we observe that the accuracy of the source network drops heavily when transferred to the target domain (from 99.3% on source test set to 45.6% on target test set), which implies the significant domain shift from MNIST to MNIST-M. Third, we can see that the distribution of synthetic target data is much closer to real target data than real source data, by observing that training with synthetic target data improves the performance over the source network by about +30%. Finally, training with a mixture of source and synthetic target data is beneficial for learning domain invariant features, and improves the adaptation performance by +3.5% over the model trained with synthetic target data only.
Table 1 demonstrates that our EM training algorithm is an effective way to exploit unlabeled target domain data. Moreover, imposing the attention alignment penalty always leads to noticeable improvement.
|Method||Train Data||Accuracy (%)|
|Ours (wo )||S+T+T+S||93.5|
|Ours (w )||S+T+T+S||95.6|
|Method||Train Data||Accuracy (%)|
|Ours (wo )||S+T+T+S||93.5|
|Ours (w )||S+T+T+S||95.6|
Office-31. The classification results based on ResNet-50 are shown in Table 2. With identical evaluation setting, we compare our methods with previous transfer methods and variants of our method. We have three major conclusions.
|Method||Train Data||A W||D W||W D||A D||D A||W A||Average|
|ResNet-50||S||68.4 0.2||96.7 0.1||99.3 0.1||68.9 0.2||62.5 0.3||60.7 0.3||76.1|
|RevGrad ||S+T||82.0 0.4||96.9 0.2||99.1 0.1||79.7 0.4||68.2 0.4||67.4 0.5||82.2|
|JAN ||S+T||85.4 0.3||97.4 0.2||99.8 0.2||84.7 0.3||68.6 0.3||70.0 0.4||84.3|
|JAN-A ||S+T||86.0 0.4||96.7 0.3||99.7 0.1||85.1 0.4||69.2 0.4||70.7 0.5||84.6|
|ResNet-50||T||81.1 0.2||98.5 0.2||99.8 0.0||83.3 0.3||61.0 0.2||60.2 0.3||80.6|
|ResNet-50||S+T||81.9 0.2||98.5 0.2||99.8 0.0||83.7 0.3||66.5 0.2||64.8 0.3||82.5|
|Ours (wo )||T+T||86.2 0.2||99.3 0.1||100 0.0||86.5 0.6||69.9 0.6||70.2 0.2||85.4|
|Ours (w )||T+T||86.8 0.2||99.3 0.1||100 0.0||87.2 0.5||71.7 0.5||71.8 0.1||86.1|
|Ours (wo )||S+T+T+S||87.1 0.3||99.3 0.1||100 0.0||87.1 0.2||72.3 0.2||72.2 0.2||86.3|
|Ours (w )||S+T+T+S||86.8 0.2||99.3 0.1||100 0.0||88.8 0.4||74.3 0.2||73.9 0.2||87.2|
|Method||Train Data||A W||A D||D A||W A||Average|
|ResNet-50||S||68.4 0.2||68.9 0.2||62.5 0.3||60.7 0.3||65.1|
|EM-A||S+T+T+S||68.6 0.3||73.5 0.3||62.7 0.3||52.8 0.3||64.4|
|EM-A +||S+T+T+S||80.4 0.2||79.1 0.2||66.4 0.2||58.4 0.2||71.1|
|EM-C||S+T+T+S||86.4 0.3||87.0 0.3||69.5 0.3||71.4 0.3||78.6|
|EM-C +||S+T+T+S||86.2 0.2||86.6 0.3||71.8 0.3||73.7 0.2||79.6|
|EM-B||S+T+T+S||very low||very low||very low||very low||very low|
|EM-B +||S+T+T+S||very low||very low||very low||very low||very low|
|Ours (wo )||S+T+T+S||87.1 0.3||87.1 0.2||72.3 0.2||72.2 0.2||79.7|
|Ours (w )||S+T+T+S||86.8 0.2||88.8 0.4||74.3 0.2||73.9 0.2||80.9|
|Measure||A W||A D||D A||W A||Average|
|-norm||very low||very low||very low||very low||very low|
First, from Table 2, it can be seen that our method outperforms the state of art in all the transfer tasks with a large margin. The improvement is larger on harder transfer tasks, where the source domain is substantially different from and has much less data than the target domain, e.g. D A, and W A. Specifically, we improve over the state of art result by +2.6% on average, and by +5.1 % for the difficult transfer task D A.
Second, we also compare our method with and without the adversarial attention alignment loss . Although for easy transfer tasks, the performance of these two variants are comparable, when moving to much harder tasks, we observe obvious improvement brought by the adversarial attention alignment, e.g., training with adversarial attention alignment outperforms that without attention alignment by for the task D A, and for the task W A. This implies that adversarial attention alignment helps reduce the discrepancy across domains and regularize the training of the target model.
Third, we validate that augmenting with synthetic target data to facilitate the target network training brings significant improvement of accuracy over source network. This indicates that the discrepancy between synthetic and real target data is much smaller. We also notice that in our method, the accuracy of the network trained with real and synthetic data from both domains is much better than the one purely trained with real and synthetic target data. This verifies the knowledge shared by the source domain can be sufficiently uncovered by our framework to improve the target network performance.
Fig. 4 illustrates how the attention alignment penalty changes during the training process with and without this penalty imposed. Without attention alignment, the discrepancy of the attention maps between the source and target network is significantly larger and increases as the training goes on. The improvement of accuracy brought by adding penalty to the objective can be attributed to the much smaller discrepancy of attention maps between the source and the target models, i.e., better aligned attention mechanism. The testing accuracy curves on the target domain for tasks D A and D A are also drawn in Fig. 4. It can be seen that the test accuracy steadily increases and the model with converges much faster than that without any attention alignment.
Visualization of the attention maps of our method is provided in Fig. 1. We observe that through attention alignment, the attention maps of the target network adapt well to the target domain images, and are even better than those of the target model trained on labeled target images.
4.4 Ablation Study
Table 3 compares the accuracy of different EM variants. We conduct ablation studies by removing one component from the system at a time (three components are considered which are defined in Section 3.3). For each variant of EM, we also evaluate the effect of imposing by comparing training with and without . By comparing the performances of EM-A, EM-B, EM-C and full method we adopted, we find that the three modifications all contribute considerably to the system. Among them, filtering the noisy data is the most important factor. We also notice that for EM-A and EM-C, training along with always leads to a significant improvement, implying performing attention alignment is an effective way to improve the adaptation performance.
4.5 Comparing Different Attention Discrepancy Measures
In this section, we provide a method comparison in measuring the attention discrepancy across domains which is discussed in Section 3.2. This paper uses the distance, and the compared methods include the distance, MMD  and JMMD . Results are presented in Table 4.
We find that our method achieves the best results among the four measures. The distance fails in training a workable network because it is misled by the noise in the attention maps. Our method outperforms MMD/JMMD by a large margin, because our method preserves the structure information, as discussed in Section 3.2.
In this paper, we make two contributions to the community of UDA. First, from the convolutional layers, we propose to align the attention maps of the source network and target network to make the knowledge from source network better adapted to the target one. Second, from an EM perspective, we maximize the likelihood of unlabeled target data, which enables target network to leverage more training data for better domain adaptation. Both contributions benefit from the unsupervised image correspondences provided by CycleGAN. Experiment demonstrates that the two contributions both have positive effects on the system performance, and they cooperate together to achieve competitive or even state-of-the-art results on two benchmark datasets.
Acknowledgement. We acknowledge the Data to Decisions CRC (D2D CRC) and Cooperative Research Centres Programme for funding the research.
-  Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33(5), 898–916 (2011)
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in Neural Information Processing Systems. pp. 343–351 (2016)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)
-  Ding, M., Fan, G.: Multilayer joint gait-pose manifolds for human gait motion modeling. IEEE Trans. Cybernetics 45(11), 2413–2424 (2015)
-  Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Style aggregated network for facial landmark detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 379–388 (June 2018)
-  Haeusser, P., Frerix, T., Mordvintsev, A., Cremers, D.: Associative domain adaptation. In: International Conference on Computer Vision (ICCV). vol. 2, p. 6 (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision. pp. 630–645. Springer (2016)
-  Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A.A., Darrell, T.: Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213 (2017)
-  Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: International Conference on Machine Learning (2017)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems. pp. 700–708 (2017)
-  Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: Advances in neural information processing systems. pp. 469–477 (2016)
-  Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning. pp. 97–105 (2015)
Long, M., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2017)
-  Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. In: NIPS Workshop on Adversarial Training (2016)
-  Russo, P., Carlucci, F.M., Tommasi, T., Caputo, B.: From source to target and back: symmetric bi-directional adaptive gan. arXiv preprint arXiv:1705.08824 (2017)
-  Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV. pp. 618–626 (2017)
-  Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)
-  Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
-  Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation. In: Computer Vision–ECCV 2016 Workshops. pp. 443–450. Springer (2016)
-  Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4068–4076 (2015)
-  Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition (CVPR) (2017)
-  Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014)
-  Wei, Y., Feng, J., Liang, X., Cheng, M.M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In: IEEE CVPR (2017)
-  Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision. pp. 818–833. Springer (2014)
-  Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.: Adversarial complementary learning for weakly supervised object localization. In: IEEE CVPR (2018)
-  Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2921–2929 (2016)
-  Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: Computer Vision (ICCV), 2017 IEEE International Conference on (2017)
6 Training Details
MNIST MNIST-M We adopt Adam to update our network and the initial learning rate is set to 0.001. For a mini-batch input data, we fix the proportions of real source data, synthetic target data, real target data and synthetic source data as , , , and , respectively, throughout the experiment. For EM training, we set the threshold so that the network is learned with all the source and synthetic target data before the first update of . We then set the threshold afterwards.
Office-31 We follow the same learning rate schedule adopted in  throughout our experiment except that we initialize the learning rate schedule after each update of posterior estimation network (see Section 3.3 of the text). For a mini-batch input data, we fix the proportions of real source data, synthetic target data, real target data and synthetic source data as , , , and respectively, throughout our experiment. Threshold for EM training is set as . We choose through validation following the same protocol as .
In the experiment of Office-31, we do not penalize the distances between attention maps w.r.t
the first convolutional layer and the max-pooling layers of ResNet-50, because 1) Attention of the first convolutional layer focuses on low-level details and is easily affected by noise. 2) The max-pooling layer does not have parameters (totally determined by the outputs of previous convolutional layer). So it is not necessary to additionally align its attention. 3) We empirically find that ignoring these layers when performing attention alignment brings no loss of accuracy but is more efficient in computation.
7 Impact of Hyper-parameters
We investigate the impact of (i.e. filtering threshold in EM) and (i.e. the strength of attention alignment penalty) on the classification accuracy of target model, respectively. The results are shown in Fig. A5.
Within a range, a larger leads to better accuracy, while with the growth of , the accuracy of the model increases before the slightly decrease. For both and , we observe that within a wide range, the accuracy of our method outperforms the previous state-of-the-art method with a large margin, which implies the superiority of our method.
|Variant||A W||A D||D A||W A||Average|
8 Comparison with different variants of attention
We conduct experiment to verify the effectiveness of attention defined by Eq. (4) of the text. The comparison results are summarized in Table A5. Note that the attention mechanism defined in our method is the aggregation of feature maps along channels using -norm, and - and -norm aggregating methods are compared in Table A5. We also compare our method with directly aligning feature maps without any aggregation (denoted as “FM” in Table A5). We find that aligning the proposed attention performs much better than aligning other variants, which verifies the effectiveness of attention defined by Eq. (4).