Mutual Learning Network for Multi-Source Domain Adaptation

03/29/2020 ∙ by Zhenpeng Li, et al. ∙ 0

Early Unsupervised Domain Adaptation (UDA) methods have mostly assumed the setting of a single source domain, where all the labeled source data come from the same distribution. However, in practice the labeled data can come from multiple source domains with different distributions. In such scenarios, the single source domain adaptation methods can fail due to the existence of domain shifts across different source domains and multi-source domain adaptation methods need to be designed. In this paper, we propose a novel multi-source domain adaptation method, Mutual Learning Network for Multiple Source Domain Adaptation (ML-MSDA). Under the framework of mutual learning, the proposed method pairs the target domain with each single source domain to train a conditional adversarial domain adaptation network as a branch network, while taking the pair of the combined multi-source domain and target domain to train a conditional adversarial adaptive network as the guidance network. The multiple branch networks are aligned with the guidance network to achieve mutual learning by enforcing JS-divergence regularization over their prediction probability distributions on the corresponding target data. We conduct extensive experiments on multiple multi-source domain adaptation benchmark datasets. The results show the proposed ML-MSDA method outperforms the comparison methods and achieves the state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have produced great advances for many computer vision tasks, including classification, detection and segmentation. Such success nevertheless depends on the availability of large amounts of labeled training data under the standard supervised learning setting. However, the labels are typically expensive and time-consuming to produce through manual effort. Domain adaptation aims to reduce the annotation cost by exploiting existing labeled data in auxiliary source domains. As the data in source domains can be collected with different equipments or in different environments, they may exhibit different distributions from the target domain data. Hence the main challenge of domain adaptation is to bridge the distribution divergence across domains and effectively transfer knowledge from the source domains to train prediction models in the target domain. A widely studied domain adaptation setting is unsupervised domain adaptation (UDA), where data in the source domain are labeled and data in the target domain are entirely unlabeled.

Figure 1: (a) Single source unsupervised domain adaptation (UDA) setting, where the source domain data all come from the same distribution. (b) Multi-source domain adaptation (MSDA) setting, where the source data are from different domains and hence have different distributions.

Early UDA methods assume the source domain data all come from the same source and have the same distribution, as shown in Figures 1(a). In practice, it is much easier to collect labeled data from multiple source domains with different distributions, as shown in Figures 1(b). For example, we can collect source domain data from live action movies, cartoons, hand-drawn pictures, etc. Exploiting data from multiple source domains has the potential capacity of transferring more useful information to the target domain, and can be more beneficial in practical applications. Some recent multi-source domain adaptation (MSDA) methods have used shared feature extractors for different source domains [27, 20, 31]. The works in [27, 20] make predictions in the target domain by using weighted combinations of multiple source domain results, while the other work in [31]

trains a classifier for all source and target domains, but back propagates only the minimum cross-domain training error among all source domains. These methods however fail to handle the distribution divergence between different source domains. In addition, it is difficult for these methods to bridge gaps between the target domain and the multiple source domains simultaneously, while negative optimization and transfer may occur 

[27]. Therefore, how to balance the distribution difference between source-source and source-target domains is a key for developing effective MSDA methods.

In this paper, we propose a new approach for multi-source domain adaptation, namely Mutual Learning Network for Multi-Source Domain Adaptation (ML-MSDA

). As the multiple source domains have different distributions, ML-MSDA trains one separate conditional adversarial adaptation network, referred to as branch network, to align each source domain with the target domain. In addition, it also trains a conditional adversarial adaptation network to align the combined source domain with the target domain, which is referred to as guidance network. The guidance and branch networks share weights in the first few feature extraction layers, while the remaining layers are branch specific. We then propose to perform guidance network centered prediction alignment by enforcing JS-divergence regularizations over the prediction probability distributions of target samples between the guidance network and each branch network so that all networks can learn from each other and make similar predictions in the target domain. Such a mutual learning structure is expected to gather domain specific information from each single source domain through branch networks and gather complementary common information through the guidance network, aiming to improve both the information adaptation efficacy across domains and the robustness of network training.

The contribution of this paper is three fold. First, we propose a novel mutual learning network architecture for multi-source domain adaptation, which enables guidance network centered information sharing in the multi-source domain setting. Second, we develop a novel dual alignment mechanism at both the feature and prediction levels: conditional adversarial feature alignment across each pair of source and target domains, and centered prediction alignment between each branch network and the guidance network. Third, we conduct experiments on multiple benchmark datasets and demonstrate the superiority of the proposed method over the-state-of-the-art UDA and MSDA methods.

2 Related Work

Unsupervised Domain Adaptation with Single Source Domain. Unsupervised domain adaptation (UDA) addresses the problem of exploiting labeled data from a source domain to train prediction models for a target domain where all the data instances are unlabeled. UDA has mostly focused on the single source domain setting where the labeled source data are collected from the same source, and hence have the same distribution. The key to solve UDA problems lies in eliminating or mitigating the domain shift between source and target domains. Many works have exploited distribution distance metrics, such as Maximum Mean Discrepancy (MMD) and Kullback-Leibler (KL) divergence, to reduce gaps between the statistical distributions of the source and target domains  [16, 28, 15, 25, 23]. Some recent works have adopted an adversarial learning based DA mechanism [7, 12], which aligns the feature distributions through a minimax adversarial game between the feature extractor and the domain discriminator. Following the adversarial mechanism, the networks can learn domain-invariant features across the source and target domains, and generate source or target data [12, 1, 11]. In [14], the authors further adopted conditional adversarial learning for unsupervised domain adaptation. The teach-student (T-S) learning mechanism has also been used for unsupervised domain adaptation [5, 18, 10]. In [5], the teacher network is updated as an exponential moving average of the student network, while the prediction difference on unlabeled data between the student and teacher networks is penalized. In [18, 10], the teacher network is trained on the source domain and the student network is trained on the target domain, while the teacher network is used to “teach” the student network on unlabeled parallel data the connect the two domains.

Multiple Source Unsupervised Domain Adaptation. In practice we can get labeled training data from multiple source domains with different distributions. Directly applying single domain UDA methods cannot work well in this case as they fail to address the differences between the multiple source domains. To address the domain shift between the multiple source domains, FA [3] concatenates the extra features of each source domain to induce properties shared between each source domain and the target domain. A-SVM [29] ensembles the multiple source specific classifiers. The Domain Adaptation Machine [4] integrates domain-related regularization terms to train a set of source classifiers and make the target classifier share similar decision values with them. CP-MDA [2] computes weight values for the classifier of each source domain and uses conditional distributions to combine them. DCTN [27] deploys a domain discriminator and a category classifier for each source domain and uses the loss of each discriminator to calculate the weight of each classifier. M3SDA [20]

utilizes matching moments to directly match all distributions of source and target domains. MDAN 

[31] uses adversarial adaptation to induce invariant features for all pairs of source-target domains. Different from these related works, our proposed approach ML-MSDA introduces a new mutual learning network architecture that has one guidance network and multiple branch networks. It exploits each source domain for domain adaptation in both domain specific manner (through branch networks) and domain ensemble manner (through the guidance network).

Figure 2: The framework of the proposed Mutual Learning network. For N source domains, it has N branch networks and one guidance network (the bottom one). For each branch network, the corresponding source domain data and the target domain data are used as inputs. The combined multiple source domain data and the target domain data are used as inputs for the guidance network. All these subnetworks have the same structure that has three components: feature extractor, domain discriminator, and category classifier. Classification losses, and , and adversarial alignment loss are considered on each subnetwork. A prediction misalignment loss is considered between each branch network and the guidance network.

3 Mutual Learning Network for MSDA

We consider the following multi-source domain adaptation setting. Assume we have source domains, and one target domain . The multiple source domains and the target domain all have different input distributions. For each source domain, all the input images are labeled, such that , where denotes the input image and

denotes the corresponding label indicator vector. For the target domain, the labels of the images are unavailable, such that

.

In this section, we present a novel mutual learning network model for MSDA. The proposed approach is termed as Mutual Learning network for Multi-Source Domain Adaptation (ML-MSDA). The framework of ML-MSDA is presented in Figures 2. In this learning framework, we aim to exploit both the domain specific adaptation information from each source domain and the combined adaptation information in multiple source domains. We build subnetworks for domain adaptation. The first subnetworks perform domain adaptation from each corresponding single source domain to the target domain, while the -th subnetwork performs domain adaptation from the combined multiple source domains to the target domain. As the combined multi-source domain contains more information than each single domain, it can reinforce the nonspontaneous common information shared across multiple source domains. We hence use the -th subnetwork as a guidance network and use the first subnetworks as branch networks in our proposed mutual learning framework.

For each branch network, the corresponding source domain data and the target domain data are used as inputs. The combined multiple source domain data and the target domain data are used as inputs for the guidance network. All these subnetworks have the same structure that has three components: feature extractor , domain discriminator , and category classifier . The parameters of the first few layers in the feature extractors are shared across all the subnetworks to enable common low-level feature extraction, while the remaining layers are separated to capture source-domain specific information. For each subnetwork, the input data first go through the feature extraction network to produce high level features. Source domain dependent conditional adversarial feature alignment is then conducted to align feature distributions between each specific source domain (or combined source domains) and target domain using a separate domain discriminator as an adversary under an adversarial loss . The classifiers predict the class labels of the input samples based on the aligned features with classification losses and , while mutual learning is conducted by enforcing prediction distribution alignment between each branch network and the guidance network on corresponding samples. A prediction misalignment loss is considered between each branch network and the guidance network. Below we present these loss terms.

3.1 Conditional Adversarial Feature Alignment

. We propose to deploy conditional adversarial domain adaptation to align feature distributions between the source domain and the target domain and induce domain invariant features. As stated above, all the adaptation subnetworks share the same structure. Hence the conditional adversarial feature alignment is conducted in the same manner for different subnetworks. The fundamental difference is that different subnetworks use different source domain data as input and the adversarial alignment results will be source domain dependent. Here we take the -th subnetwork as an example to present the conditional adversarial feature alignment adopted in the proposed model.

The main idea of adversarial domain adaptation is to adopt the adversarial learning principle of generative adversarial networks into the domain adaptation setting by introducing an adversary domain discriminator  [7]. For the -th subnetwork, this implies playing a minimax game between the feature extractor and the domain discriminator , where tries to maximumly distinguish the source domain data from the target domain data and tries to maximumly deceive the discriminator.

Moreover, although we like to drop the domain divergence, it is important to improve the discriminability of the induced features towards the final classification task. We hence take the classifier’s label prediction results into account to perform conditional adversarial domain adaptation with the following adversarial loss:

(1)

where is the prediction probability vector produced by the classifier on image , such that

(2)

For -class classification problem, will be a length vector with each entry indicating the probability of belonging to the corresponding class category. denotes the conditioning strategy function. For simplicity, one can use a simple concatenation . In this work, we used the multilinear conditioning function proposed in [14], as it can capture the cross covariance between feature representations and classifier predictions to help preserve the discriminability of the features.

Finally the overall adversarial loss from all the subnetworks can be computed as:

(3)

3.2 Semi-Supervised Prediction Loss

Following the structure of ML-MSDA in Figure 2, the extracted domain invariant features in each subnetwork will be served as input to the classifier . For the labeled images from the source domain, we can use the supervised cross-entropy loss to perform training:

(4)

For the unlabeled data from the target domain, we use an unsupervised entropy loss to include them into the classifier training:

(5)

The assumption is that if the source and target domains are well aligned, the classifier trained on the labeled source images should be able to make confident predictions on the target domain images and hence have small prediction entropy values. Therefore we expect this entropy loss can help bridge domain divergence and induce discriminative features.

3.3 Guidance Network Centered Mutual Learning

. With the adversarial feature alignment in each branch network, the target domain is aligned with each source domain separately. Due to the existence of domain shift among the multiple source domains, the domain invariant features extracted and the consequent classifier trained in one subnetwork will be different from that in another subnetwork. Nevertheless, under effective domain adaptation, the divergence between each subnetwork’s prediction result on the target domain data and the true labels should be small. By sharing the same target domain, this implies the prediction results of all the subnetworks in the target domain should be consistent. Under this assumption, in order to improve the generalization performance of the model and increase the robustness of network training, we propose to conduct mutual learning over all the subnetworks by minimizing their prediction inconsistency in the shared target domain.

As the guidance network used data from all the source domains as a combined domain, it contains more transferable information than each branch network. Hence we propose to enforce prediction consistency by aligning each branch network with the guidance network in terms of predicted label distribution for each target instance. Specifically, we can use the Kullback Leibler (KL) Divergence to align the predicted label probability vector for each target domain instance from the j-th branch network with the predicted label probability vector for the same instance from the guidance network; that is,

(6)

where and are the predicted label probability vectors for the -th instance in the target domain produced by the -th branch network and the guidance network respectively. Since the KL divergence metric is asymmetric, we use a symmetric Jensen-Shannon Divergence loss [30] instead, which leads to the following overall prediction inconsistency loss:

(7)

This loss enforces regularizations over the prediction inconsistency on the target domain instances across the multiple subnetworks.

3.4 Overall Learning Problem and Prediction

By integrating the prediction loss, adversarial feature alignment loss, and the prediction inconsistency loss together, we have the following overall adversarial learning problem:

(8)

where , and

are trade-off hyperparameters;

and denote the sets of

feature extractors, classifiers and domain discriminators respectively. This training problem can be solved using standard stochastic gradient descent algorithms by performing min-max adversarial updates.

After training, we obtain N+1 classifiers from the model. We use these classifiers to predict the labels of the unlabeled target domain instances in a guidance network centered ensemble manner. For the -th instance in the target domain, the ensemble prediction probability result is:

(9)

where the prediction from the guidance network is qiven equal weight to the average prediction results from the other branch networks.

Standards Models
mt,up,sv,
symm
mm,up,sv,
symt
mm,mt,sv,
syup
mm,mt,up,
sysv
mm,mt,up,
svsy
Avg
Source
Combine
Source Only 63.700.83 92.300.91 90.710.54 71.510.75 83.440.79 80.330.76
DAN [13] 67.870.75 97.500.62 93.490.85 67.800.84 86.930.93 82.720.79
DANN [6] 70.810.94 97.900.83 93.470.79 68.500.85 87.370.68 83.610.82
Multi-
Source
Source Only 63.370.74 90.500.83 88.710.89 63.540.93 82.440.65 77.710.81
DAN [13] 63.780.71 96.310.54 94.240.87 62.450.72 85.430.77 80.440.72
CORAL [22] 62.530.69 97.210.83 93.450.82 64.400.72 82.770.69 80.070.75
DANN [6] 71.300.56 97.600.75 92.330.85 63.480.79 85.340.84 82.010.76
JAN [16] 65.880.68 97.210.73 95.420.77 75.270.71 86.550.64 84.070.71
ADDA [24] 71.570.52 97.890.84 92.830.74 75.480.48 86.450.62 84.840.64
DCTN [27] 70.531.24 96.230.82 92.810.27 77.610.41 86.770.48 84.790.27
MEDA [26] 71.310.75 96.470.78 97.010.82 78.450.77 84.620.79 85.600.78
MCD [21] 72.500.67 96.210.81 95.330.74 78.890.78 87.470.65 86.100.73
 [20] 69.760.86 98.580.47 95.230.79 78.560.95 87.560.53 86.130.64
 [20] 72.821.13 98.430.68 96.140.81 81.320.86 89.580.56 87.650.75
ML-MSDA
(ours)
96.620.15 99.370.06 98.290.13 70.270.64 88.521.29 90.680.46
Table 1: Test results on Digit Recognition. The average classification accuracy of the proposed approach is 90.68%, which is 3.03% higher than the best comparison method.

4 Experiments

To investigate the effectiveness of the proposed approach, we conducted experiments on three well-known benchmark multi-source domain adaptation datasets: Digit-five dataset, OfficeCaltech10 dataset and DomainNet dataset. We compared the proposed ML-MSDA with the state-of-the-art UDA and MSDA methods, and report the comparison results in this section.

Implementation Details.

The experiments are conducted using PyTorch. For the proposed ML-MSDA we set the trade-off hyperparameters (

, ,

) as (5, 5, 0.5) respectively. We define the process of training on all samples of the combined-source domain as an epoch. The learning rate is set as 0.01 for the first 10 epochs and as 0.001 in the following 10 epochs. After the first 20 epochs, the learning rate is set as 0.0001. In the experiments on Digit Recognition, each batch is composed of 256 samples. On Office-Caltech10 and DomainNet, we set the batch-size as 20 due to the large size of images.

4.1 Experiments on Digit Recognition

The Digit Recognition dataset consists of 10 classes of digit images sampled from five different datasets, including mt (MNIST[9], mm (MNIST-M[9], sv (SVHN), up (USPS), and sy (Synthetic Digits[6], which form five domains. Following previous studies,  [20] and DCTN [27], on multi-source domain adaptation, we randomly chose 25,000 images for training and 9000 for testing in MNIST, MNIST-M, and SVHN. For small datasets USPS and Synthetic Digits, we used all their training and testing samples. With these five datasets, five domain adaptation tasks are naturally formed by selecting one dataset as the target domain and using the others as the source domains in turn.

We compared the proposed ML-MSDA method with two state-of-the-art MSDA approaches, Moment Matching for Multi-Source Domain Adaptation ([20] and Deep Cocktail Network (DCTN[27]. In addition, we also compared with a number of UDA methods including Deep Alignment Network (DAN[13], Domain Adversarial Neural Network (DANN[6], Correlation Alignment (CORAL[22], Joint Adaptation Network (JAN[16], Adversarial Discriminative Domain Adaptation (ADDA[24], Manifold Embedded Distribution Alignment (MEDA[26], and Maximum Classifier Discrepancy (MCD) [21]. Following experiments in previous multi-source DA study, for these single-source UDA methods, we recorded the averages of their multiple single-source domain adaptation results under the multi-source setting.

Following the backbone network setting of [20], we used three conv layers and two fc layers as the feature extractor and a single fc

layer as the category classifier. As the model is small, we did not use weight sharing across different branches. The same backbone network was used in all the experiments. We repeat each experiment five times, and report the mean and standard deviation values of the test accuracy results in the target domain.

The comparison results are reported in Table 1. We can see ML-MSDA outperforms all the other methods on three out of the five domain adaptation tasks. The average test accuracy of the proposed ML-MSDA method across the five domain adaptation tasks is 90.68%, which outperforms the best alterative multi-source domain adaptation method, SDA-, and all the other comparison methods with notable performance gains. These results suggest the proposed mutual learning network model is very effective.

4.2 Experiments on Office-Caltech10

Office-Caltech10 dataset [8] is collected from four different domains: A (Amazon), C (Caltech), W (Webcam) and D (DSLR). It consists of 10 object categories, and each domain includes 958, 295, 157, and 1,123 images, respectively. On this dataset, four domain adaptation tasks are constructed by using one domain as the target domain in turn and the others as source domains.

We compared the results produced by the proposed ML-MSDA method with the results of a number of state-of-the-art domain adaptation methods, including DAN [13], DCTN [27], JAN [16], MEDA [26], MCD [21] and [20]

. For fair comparison, we used ResNet101 pre-trained on ImageNet as the backbone network in all the experiments. For ML-MSDA, the weights of

conv1, conv2 and conv3 stages are shared among the guidance network and all branch networks. But each network trains their conv4 and conv5 stages separately.

The comparison results on Office-Caltech10 are reported in Table 2. We can see that on this small dataset, all domain adaptation methods work very well. Nevertheless, our proposed ML-MSDA consistently outperforms all other methods and achieves a 97.6% average accuracy.

Standards Models
A,C,D
W
A,C,W
D
A,D,W
C
C,D,W
A
Avg
Source Combine Source only 99 98.3 87.8 86.1 92.8
DAN [13] 99.3 98.2 89.7 94.8 95.5
Multi- Source Source only 99.1 98.2 85.4 88.7 92.9
DAN [13] 99.5 99.1 89.2 91.6 94.8
DCTN [27] 99.4 99 90.02 92.7 95.3
JAN [16] 99.4 99.4 91.2 91.8 95.5
MEDA [26] 99.3 99.2 91.4 92.9 95.7
MCD [21] 99.5 99.1 91.5 92.1 95.6
 [20] 99.4 99.2 91.5 94.1 96.1
 [20] 99.5 99.2 92.2 94.5 96.4
ML-MSDA (ours) 100 100 94.7 95.7 97.6
Table 2: Results on Office-Caltech10. The average classification accuracy of the proposed approach is 97.6%, which is 1.2% higher than the best comparison result.
clp inf pnt qdr rel skt Total
Train 34,019 37,087 52,867 120,750 122,563 49,115 416,401
Test 14,818 16,114 22,892 51,750 52,764 21,271 179,609
Total 48,837 53,201 75,759 172,500 175,327 70,386 596,010
Per-Class 141 154 219 500 508 204 1,728
Table 3: Details of DomainNet dataset. The ratio of train/test is 70%/30%.
Standards Models
inf,pnt,
qdr,rel,
sktclp
clp,pnt,
qdr,rel,
sktinf
clp,inf,
qdr,rel,
sktpnt
clp,inf,
pnt,rel,
sktqdr
clp,inf,
pnt,qdr,
sktrel
clp,inf,
pnt,qdr,
relskt
Avg
Single
Best
Source Only 39.60.58 8.20.75 33.90.62 11.80.69 41.60.84 23.10.72 26.40.70
DAN [13] 39.10.51 11.40.81 33.30.62 16.20.38 42.10.73 29.70.93 28.60.63
RTN [15] 35.30.73 10.70.61 31.70.82 13.10.68 40.60.55 26.60.78 26.30.70
JAN [16] 35.30.71 9.10.63 32.50.65 14.30.62 43.10.78 25.70.61 26.70.67
DANN [6] 37.90.69 11.40.91 33.90.60 13.70.56 41.50.67 28.60.63 27.80.68
ADDA [24] 39.50.81 14.50.69 29.10.78 14.90.54 41.90.82 30.70.68 28.40.72
SE [5] 31.70.70 12.90.58 19.90.75 7.70.44 33.40.56 26.30.50 22.00.66
MCD [21] 42.60.32 19.60.76 42.60.98 3.80.64 50.50.43 33.80.89 32.20.66
Source
Combine
Source only 47.60.52 13.00.41 38.10.45 13.30.39 51.90.85 33.70.54 32.90.54
DAN [13] 45.40.49 12.80.86 36.20.58 15.30.37 48.60.72 34.00.54 32.10.59
RTN [15] 44.20.57 12.60.73 35.30.59 14.60.76 48.40.67 31.70.73 31.10.68
JAN [16] 40.90.43 11.10.61 35.40.50 12.10.67 45.80.59 32.30.63 29.60.57
DANN [6] 45.50.59 13.10.72 37.00.69 13.20.77 48.90.65 31.80.62 32.60.68
ADDA [24] 47.50.76 11.40.67 36.70.53 14.70.50 49.10.82 33.50.49 32.20.63
SE [5] 24.70.32 3.90.47 12.70.35 7.10.46 22.80.51 9.10.49 16.10.43
MCD [21] 54.30.64 22.10.70 45.70.63 7.60.49 58.40.65 43.50.57 38.50.61
Multi-
Source
DCTN [27] 48.60.73 23.50.59 48.80.63 7.20.46 53.50.56 47.30.47 38.20.57
 [20] 57.20.98 24.21.21 51.60.44 5.20.45 61.60.89 49.60.56 41.50.74
[20] 58.60.53 26.00.89 52.30.55 6.30.58 62.70.51 49.50.76 42.60.64
ML-MSDA
(ours)
61.40.79 26.20.41 51.90.20 19.10.31 57.01.04 50.30.67 44.30.57
Oracle
Results
AlexNet 65.5 0.56 27.70.34 57.60.49 68.00.55 72.80.67 56.30.59 58.00.53
ResNet101 69.30.37 34.50.42 66.30.67 66.80.51 80.00.59 60.70.48 63.00.51
ResNet152 71.00.63 36.10.61 68.10.49 69.10.52 81.30.49 65.20.57 65.10.55
Table 4: Results on DomainNet dataset. The proposed ML-MSDA produced the best average accuracy 44.3% among the domain adaptation methods.
mt, up, sv,
symm
D,W,C
A
clp, inf, qdr,
rel, sktpnt
inf, pnt, qdr,
rel, sktclp
ML-w/o condition-adv
92.1 95.5 50.5 58.3
ML-w/o
94.5 95.4 44.3 51.1
ML-w/o
91.9 94.1 43.3 56.9
ML-guidance-inf
95.7 95.8 51.0 58.7
ML-branch-average-inf 96.0 95.4 41.3 48.4
ML-MSDA (full) 96.6 95.7 51.9 61.0
Table 5: Ablation study. Comparison of the proposed approach with its five variants.
Figure 3: The t-SNE visulization on Digit Recognition. The red, yellow, green, black and blue points are from domains mt, up, sv, sy and mm respectively. We used domains, mt, up, sv, sy, as source domains and mm as the target domain.
Figure 4: The t-SNE visulization on Office-Caltech10. The red, yellow, green and blue points represent data from domain D, W, C, A respectively. We use D, W, C as source domains and A as the target domain.

4.3 Experiments on DomainNet

DomainNet dataset is introduced in [20], which consists of six domains, namely clp (Clipart), inf (Inforgraph), pnt (Painting), qdr (Quickdraw), skt (Sketch) and rel (Real). Each domain has 345 classes of common objects. As shown in Table 3, there are total 596,010 instances in the dataset and 1,728 instances per class. In our experiments, we chose 70% from each domain for training and 30% for testing. Benefiting from its large scale and wide variety, the DomainNet dataset overcome the benchmark saturation issues of the state of the art domain adaptation datasets, which is of great significance to the study of domain adaptation.

We used the same comparison methods as in [20], including DAN [13], RTN [15], JAN [16], DANN [6], ADDA [24], SE [5], DCTN [27] and MCD [21]. Following the same setting as in [20], we used AlexNet as the backbone for DAN, JAN, DANN and RTN. We used ResNet-101 as the backbone for , DCTN, ADDA and MCD, while the backbone of SE is ResNet-152. Same as , our proposed method uses ResNet-101 as the backbone. In our method, the weights of conv1, conv2, conv3 and conv4 stages of all networks are shared.

The comparison results are reported in Table 4. From the table we can see that the average accuracy of our proposed method over the six multi-source domain adaptation tasks is 44.3% , which is 1.7% higher than the best result produced by the comparison methods. Moreover, it is worth noting that on the task of clp,inf,pnt,rel,sktqdr, our proposed method outperforms other MSDA methods and single-source DA methods with notable performance gains. The work of [20] explains that the reason the multi-source methods perform poor on this task is due to negative transfer [19]. This suggests our proposed method can alleviate the problem of negative transfer.

4.4 Further Analysis

Feature Visualization. In our experiments on Digit Recognition and Office-Caltech10, we visualized the feature distributions produced by the proposed ML-MSDA method to validate its efficacy. For comparison, we also visualized the results of the source-only baseline method, and a variant ML-MSDA: ML-MSDA without JS-divergence (via KL-divergence).

For easy observation, we show the distribution of each source domain separately together with the target domain. Fig. 3 and Fig. 4 show the t-SNE [17] visulization of mt, up, sv, symm and D, W, CA respectively. We can see that for the proposed full approach the points of the target domain are closely centered around the clusters of the source domains. This suggests our method can induce more transferable and discriminative features for the target domain.

Ablation Study. To further validate the efficacy of the proposed mutual learning network and investigate the contribution of its different components, we conducted an ablation study to compare the proposed full approach ML-MSDA with five of its variants: (1) ML-w/o condition-adv. This variant replaces the conditional adversarial feature alignment with standard adversarial feature alignment by dropping the prediction probability vector from . (2) ML-w/o . This variant drops the unsupervised entropy loss from ML-MSDA. (3) ML-w/o . This variant drops the prediction inconsisteny loss term , the mutual learning term, from ML-MSDA. (4) ML-guidance-inf. This variant performs training in the same way as ML-MSDA, but uses only the guidance network for inference in the testing phase. (5) ML-branch-average-inf. This variant performs training in the same way as ML-MSDA, but drops the guidance network and uses the average of the branch networks for inference in the testing phase. The comparison is conducted on four of the previously used multi-source domain adaptation tasks and the results are reported in Table 5. We can see all the variants produced inferior results compared with the full ML-MSDA, which suggests the components investigated, such as the entropy loss, conditional adversary, mutual learning regularization, and the ensemble inference, are non-trivial for the proposed approach. In particular, the variant ML-w/o leads to remarkable performance degradation, which suggests the mutual learning regularization term is very important for the proposed ML-MSDA. Moreover, the results also shows that it is beneficial to use both the guidance network and branch networks even in the testing phase.

5 Conclusion

In this paper we proposed a novel mutual learning network, ML-MSDA, for multi-source unsupervised domain adaptation. It builds one adversarial adaptation branch network for each source-target domain pair and a guidance adversarial adaptation network for the combined multi-source–target domain pair. Mutual learning strategy is deployed to train these subnetworks simultaneously by enforcing prediction consistency between the branch networks and the guidance network in the target domain. We conducted experiments on a number of benchmark datasets. The proposed ML-MSDA demonstrated superior performance than the state-of-the-art comparison methods.

References

  • [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [2] R. Chattopadhyay, J. Ye, S. Panchanathan, W. Fan, and I. Davidson (2011) Multi-source domain adaptation and its application to early detection of fatigue. In SIGKDD Conference on Knowledge Discovery and Data mining (SIGKDD), Cited by: §2.
  • [3] H. Daume III (2007) Frustratingly easy domain adaptation. In Annual Meeting of the Association of Computational Linguistics (ACL), Cited by: §2.
  • [4] L. Duan, D. Xu, and I. W. Tsang (2012) Domain adaptation from multiple sources: a domain-dependent regularization approach. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 23 (3), pp. 504–518. Cited by: §2.
  • [5] G. French, M. Mackiewicz, and M. Fisher (2018) Self-ensembling for visual domain adaptation. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.3, Table 4.
  • [6] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    International Conference on Machine Learning (ICML)

    ,
    Cited by: Table 1, §4.1, §4.1, §4.3, Table 4.
  • [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research (JMLR) 17 (Jan), pp. 2096–2030. Cited by: §2, §3.1.
  • [8] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §4.2.
  • [9] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • [10] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong (2017) Large-scale domain adaptation via teacher-student learning.. In Conference of the International Speech Communication Association (Interspeech), Cited by: §2.
  • [11] Y. Li, N. Wang, J. Liu, and X. Hou (2017) Demystifying neural style transfer. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    ,
    Cited by: §2.
  • [12] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §2.
  • [13] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), Cited by: Table 1, §4.1, §4.2, §4.3, Table 2, Table 4.
  • [14] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §2, §3.1.
  • [15] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §2, §4.3, Table 4.
  • [16] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    .
    In International Conference on Machine Learning (ICML), Cited by: §2, Table 1, §4.1, §4.2, §4.3, Table 2, Table 4.
  • [17] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research (JMLR) 9 (Nov), pp. 2579–2605. Cited by: §4.4.
  • [18] V. Manohar, P. Ghahremani, D. Povey, and S. Khudanpur (2018) A teacher-student learning approach for unsupervised domain adaptation of sequence-trained ASR models. In IEEE Spoken Language Technology Workshop (SLT), Cited by: §2.
  • [19] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE) 22 (10), pp. 1345–1359. Cited by: §4.3.
  • [20] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Table 1, §4.1, §4.1, §4.1, §4.2, §4.3, §4.3, §4.3, Table 2, Table 4.
  • [21] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3723–3732. Cited by: Table 1, §4.1, §4.2, §4.3, Table 2, Table 4.
  • [22] B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: Table 1, §4.1.
  • [23] B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [24] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, §4.1, §4.3, Table 4.
  • [25] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. Computer Science, pp. 2672–2680. Cited by: §2.
  • [26] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu (2018) Visual domain adaptation with manifold embedded distribution alignment. In ACM International Conference on Multimedia (ACM MM), Cited by: Table 1, §4.1, §4.2, Table 2.
  • [27] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin (2018) Deep cocktail network: multi-source unsupervised domain adaptation with category shift. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1, §4.1, §4.1, §4.2, §4.3, Table 2, Table 4.
  • [28] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [29] J. Yang, R. Yan, and A. G. Hauptmann (2007) Cross-domain video concept detection using adaptive svms. In ACM International Conference on Multimedia (ACM MM), Cited by: §2.
  • [30] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [31] H. Zhao, S. Zhang, G. Wu, J. P. Costeira, J. M. Moura, and G. J. Gordon (2018) Adversarial multiple source domain adaptation. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §1, §2.