How does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches?

12/30/2020 ∙ by Li Zhong, et al. ∙ University of Technology Sydney Tsinghua University 9

Unsupervised domain adaptation (UDA) aims to train a target classifier with labeled samples from the source domain and unlabeled samples from the target domain. Classical UDA learning bounds show that target risk is upper bounded by three terms: source risk, distribution discrepancy, and combined risk. Based on the assumption that the combined risk is a small fixed value, methods based on this bound train a target classifier by only minimizing estimators of the source risk and the distribution discrepancy. However, the combined risk may increase when minimizing both estimators, which makes the target risk uncontrollable. Hence the target classifier cannot achieve ideal performance if we fail to control the combined risk. To control the combined risk, the key challenge takes root in the unavailability of the labeled samples in the target domain. To address this key challenge, we propose a method named E-MixNet. E-MixNet employs enhanced mixup, a generic vicinal distribution, on the labeled source samples and pseudo-labeled target samples to calculate a proxy of the combined risk. Experiments show that the proxy can effectively curb the increase of the combined risk when minimizing the source risk and distribution discrepancy. Furthermore, we show that if the proxy of the combined risk is added into loss functions of four representative UDA methods, their performance is also improved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Figure 1: The values of combined risk and accuracy on the task C P on Image-CLEF. The left figure shows the value of combined risk. The right figure shows the accuracy of the task. Blue line: ignore the optimization of combined risk. Green line: optimize the combined risk by source samples and target samples with high confidence. Orange line: optimize combined risk by the proxy formulated by mixup. Purple line: optimize the proxy formulated by e-mixup.

Domain Adaptation (DA) aims to train a target-domain classifier with samples from source and target domains Lu et al. (2015). When the labels of samples in the target domain are unavailable, DA is known as unsupervised DA (UDA) Zhong et al. (2020); Fang et al. (2020), which has been applied to address diverse real-world problems, such as computer version Zhang et al. (2020c); Dong et al. (2019, 2020b)

, natural language processing

Lee and Jha (2019); Guo, Pasunuru, and Bansal (2020), and recommender system Zhang et al. (2017); Yu, Wang, and Yuan (2019); Lu et al. (2020)

Significant theoretical advances have been achieved in UDA. Pioneering theoretical work was proposed by Ben-David et al. Ben-David et al. (2007). This work shows that the target risk is upper bounded by three terms: source risk, marginal distribution discrepancy, and combined risk. This earliest learning bound has been extended from many perspectives, such as considering more surrogate loss functions Zhang et al. (2019a) or distributional discrepancies Mohri and Medina (2012); Shen et al. (2018) (see Redko et al. (2020) as a survey). Recently, Zhang et al. Zhang et al. (2019a) proposed a new distributional discrepancy termed Margin Disparity Discrepancy and developed a tighter and more practical UDA learning bound.

The UDA learning bounds proposed by Ben-David et al. (2007, 2010) and the recent UDA learning bounds proposed by Shen et al. (2018); Xu et al. (2020); Zhang et al. (2020b) consist of three terms: source risk, marginal distribution discrepancy, and combined risk. Minimizing the source risk aims to obtain a source-domain classifier, and minimizing the distribution discrepancy aims to learn domain-invariant features so that the source-domain classifier can perform well on the target domain. The combined risk embodies the adaptability between the source and target domains Ben-David et al. (2010). In particularly, when the hypothesis space is fixed, the combined risk is a constant.

Based on the UDA learning bounds where the combined risk is assumed to a small constant, many existing UDA methods focus on learning domain-invariant features Fang et al. (2019); Dong et al. (2020c, a); Liu et al. (2019) by minimizing the estimators of the source risk and the distribution discrepancy. In the learned feature space, the source and target distributions are similar while the source-domain classifier is required to achieve a small error. Furthermore, the generalization error of the source-domain classifier is expected to be small in the target domain.

However, the combined risk may increase when learning the domain-invariant features, and the increase of the combined risk may degrade the performance of the source-domain classifier in the target domain. As shown in Figure 1, we calculate the value of the combined risk and accuracy on a real-world UDA task (see the green line). The performance worsens with the increased combined risk. Zhao et al. Zhao et al. (2019) also pointed out the increase of combined risk causes the failure of source-domain classifier on the target domain.

To investigate how the combined risk affect the performance on the domain-invariant features, we rethink and develop the UDA learning bounds by introducing feature transformations. In the new bound (see Eq. (5)), the combined risk is a function related to feature transformation but not a constant (compared to bounds in Ben-David et al. (2010)). We also reveal that the combined risk is deeply related to the conditional distribution discrepancy (see Theorem 3). Theorem 3 shows that, the conditional distribution discrepancy will increase when the combined risk increases. Hence, it is hard to achieve satisfied target-domain accuracy if we only focus on learning domain-invariant features and omit to control the combined risk.

To estimate the combined risk, the key challenge takes root in the unavailability of the labeled samples in the target domain. A simple solution is to leverage the pseudo labels with high confidence in the target domain to estimate the combined risk. However, since samples with high confidence are insufficient, the value of the combined risk may still increase (see the green line in Figure 1

). Inspired by semi-supervised learning methods, an advanced solution is to directly use the

mixup technique to augment pseudo-labeled target samples, which can slightly help us estimate the combined risk better than the simple solution (see the orange line in Figure 1).

However, the target-domain pseudo labels provided by the source-domain classifier may be inaccurate due to the discrepancy between domains, which causes that mixup may not perform well with inaccurate labels. To mitigate the issue, we propose enhanced mixup (e-mixup) to substitute mixup to compute a proxy of the combined risk. The purple line in Figure 1 shows that the proxy based on e-mixup can significantly boost the performance. Details of the proxy is shown in section Motivation.

To the end, we design a novel UDA method referred to E-MixNet. E-MixNet learns the target-domain classifier by simultaneously minimizing the source risk, the marginal distribution discrepancy, and the proxy of combined risk. Via minimizing the proxy of combined risk, we control the increase of combined risk effectively, thus, control the conditional distribution discrepancy between two domains.

We conduct experiments on three public datasets (Office-31, Office-Home, and Image-CLEF) and compare E-MixNet with a series of existing state-of-the-art methods. Furthermore, we introduce the proxy of the combined risk into four representative UDA methods (i.e., DAN Long et al. (2015), DANN Ganin et al. (2016), CDAN Long et al. (2018), SymNets Zhang et al. (2019b)). Experiments show that E-MixNet can outperform all baselines, and the four representative methods can achieve better performance if the proxy of the combined risk is added into their loss functions.

Problem Setting and Concepts

In this section, we introduce the definition of UDA, then introduce some important concepts used in this paper.

Let be a feature space and be the label space, where the label

is a one-hot vector, whose

-th coordinate is and the other coordinates are .

Definition 1 (Domains in UDA).

Given random variables

,

, the source and target domains are joint distributions

and with .

Then, we propose the UDA problem as follows.

Problem 1 (Uda).

Given independent and identically distributed (i.i.d.) labeled samples drawn from the source domain and i.i.d. unlabeled samples drawn from the target marginal distribution , the aim of UDA is to train a classifier with and such that classifies accurately target data .

Given a loss function and any scoring functions from function space , source risk, target risk and classifier discrepancy are

Lastly, we define the disparity discrepancy based on double losses, which will be used to design our method.

Definition 2 (Double Loss Disparity Discrepancy).

Given distributions over some feature space , two losses , a hypothesis space and any scoring function , then the double loss disparity discrepancy is

(1)

where

When losses are the margin loss Zhang et al. (2019a), the double loss disparity discrepancy is known as the Margin Disparity Discrepancy Zhang et al. (2019a).

Compared with the classical discrepancy distance Mansour, Mohri, and Rostamizadeh (2009):

(2)

double loss disparity discrepancy is tighter and more flexible.

Theorem 1 (DA Learning Bound).

Given a loss satisfying the triangle inequality and a hypothesis space , then for any , we have

where is the discrepancy distance defined in Eq. (2) and known as the combined risk.

In Theorem 1, when the hypothesis space and the loss are fixed, the combined risk is a fixed constant. Note that, under certain assumptions, the target risk can be upper bounded only by the first two terms Gong et al. (2016); Zhang et al. (2020a), which is also a promising research direction.

Theoretical Analysis

Here we introduce our main theoretical results. All proofs can be found at https://github.com/zhonglii/E-MixNet.

Rethinking DA Learning Bound

Many existing UDA methods Wang and Breckon (2020); Zou et al. (2019); Tang and Jia (2020) learn a suitable feature transformation such that the discrepancy between transformed distributions and is reduced. By introducing the transformation in the classical DA learning bound, we discover that the combined error is not a fixed value.

Theorem 2.

Given a loss satisfying the triangle inequality, a transformation space and a hypothesis space , then for any and ,

where is the discrepancy distance defined in Eq. (2) and

(3)

known as the combined risk.

According to Theorem 2, it is not enough to minimize the source risk and distribution discrepancy by seeking the optimal classifier and optimal transformation from spaces and . Because we cannot guarantee the value of combined risk is always small during the training process.

For convenience, we define

(4)

hence, .

Meaning of Combined Risk

To future understand the meaning of the combined risk , we prove the following Theorem.

Theorem 3.

Given a symmetric loss satisfying the triangle inequality, a feature transformation , a hypothesis space , and

then

where is defined in Eq. (3), known as the approximation error and is the discrepancy distance defined in Eq. (2).

Theorem 3 implies that the combined risk is deeply related to the optimal classifier discrepancy

which can be regarded as the conditional distribution discrepancy between and . If increases, the conditional distribution discrepancy is larger.

Double Loss DA Learning Bound

Note that there exist methods, such as MDD Zhang et al. (2019a), whose source and target losses are different. To understand these UDA methods and bridge the gap between theory and algorithms, we develop the classical DA learning bound to a more general scenario.

Theorem 4.

Given losses and satisfying the triangle inequality, a transformation space and a hypothesis space , then for any and , then is bounded by

(5)

where is the double loss disparity discrepancy defined in Eq. (1) and is the combined risk:

(6)
(7)

In Theorem 4, the condition that and satisfy the triangle inequality, can be replaced by a weaker condition:

If we set as the margin loss, do not satisfy the triangle inequality but they satisfy above condition.

Proposed Method: E-MixNet

Here we introduce motivation and details of our method.

Motivation

Theorem 3 has shown that the combined risk is related to the conditional distribution discrepancy. As the increase of the combined risk, the conditional distribution discrepancy is increased. Hence, omitting the importance of the combined risk may make negative impacts on the target-domain accuracy. Figure 1 (blue line) verifies our observation.

To control the combined risk, we consider the following problem.

(8)

where is defined in Eq. (4). Eq. (8) shows we can control the combined risk by minimizing . However, it is prohibitive to directly optimize the combined risk, since the labeled target samples are indispensable to estimate .

To alleviate the above issue, a simple method is to use the target pseudo labels with high confidence to estimate the . Given the source samples and the target samples with high confidence , the empirical form of can be computed by

However, the combined risk may still increase as shown in Fig 1 (green line). The reason may be that the target samples, whose pseudo labels with high confidence, are insufficient.

Inspired by semi-supervised learning, an advanced solution is to use

mixup technique Zhang et al. (2018) to augment pseudo-labeled target samples. Mixup produces new samples by a convex combination: given any two samples , ,

where is a hyper-parameter. Zhang et al. Zhang et al. (2018) has shown that mixup not only reduces the memorization to adversarial samples, but also performs better than Empirical Risk Minimization (ERM) Vapnik and Chervonenkis (2015). By applying mixup on the target samples with high confidence, new samples are produced, then we propose a proxy of as follows:

The aforementioned issue can be mitigated, since mixup can be regarded as a data augmentation Zhang et al. (2018). However, the target-domain pseudo labels provided by the source-domain classifier may be inaccurate due to the discrepancy between domains, which causes that mixup may not perform well with inaccurate labels. We propose enhanced mixup (e-mixup) to substitute the mixup to compute the proxy. E-mixup introduces the pure true-labeled source-samples to mitigate the issue caused by bad pseudo labels.

Furthermore, to increase the diversity of new samples, e-mixup produces each new sample using two distant samples, where the distance of the two distant samples is expected to be large. Compared the ordinary mixup technique (i.e., producing new samples using randomly selected samples), e-mixup can produce new examples more effectively. We also verify that e-mixup can further boost the performance (see Table 5). Details of the e-mixup are shown in Algorithm 1. Corresponding to the double loss situation, denoted by samples produced by e-mixup, the proxy of (defined in Eq. (7)) as

(9)

The purple line in Figure 1 and ablation study show that e-mixup can further boost performance.

Algorithm

The optimization of the combined risk plays a crucial role in UDA. Accordingly, we propose a method based on the aforementioned analyses to solve UDA more deeply.

Figure 2: The network architecture of applying the proxy of the combined risk. The left figure is the general model for adding the proxy into existing UDA models. The right figure is a specific model based on double loss disparity discrepancy.

Objective Function

Input: samples .
Parameter: , the number of class .
Output: new samples .
1:  for  do
2:      % is the c-th
coordinate value of vector .
3:     Select one from the samples whose label is and denoted it by
4:     ,  
5:  end for
Algorithm 1 e-mixup

According to the theoretical bound in Eq. (5), we need to solve the following problem

where is the double losses disparity discrepancy defined in Eq. (1) and is defined in Eq. (9).

Minimizing double loss disparity discrepancy is a minimax game, since the double losses disparity discrepancy is defined as the supremum over hypothesis space . Thus, we revise the above problem as follows:

(10)

where are parameters to make our model more flexible,

(11)

To solve the problem (10), we construct a deep method. The network architecture is shown in Fig. 2(b), which consists of a generator , a discriminator , and two classifiers . Next, we introduce the details about our method.

We use standard cross-entropy as the source loss and use modified cross-entropy Goodfellow et al. (2014); Zhang et al. (2019a) as the target loss .

For any scoring functions ,

(12)

where is softmax function: for any ,

and

(13)

here is the -th coordinate function of function .

Source risk. Given the source samples , then

(14)

where is the label corresponding to one-hot vector .

Double loss disparity discrepancy. Given the source and target samples , then

(15)

where are defined in Eq. (12).

Combined risk. As discussed in Motivation, the combined risk cannot be optimized directly. To mitigate this problem, we use the proxy in Eq. (9) to substitute it.

Further, motivated by Berthelot et al. (2019), we use mean mquare error () to calculate the proxy of the combined risk, because, unlike the cross-entropy loss, is bounded and less sensitive to incorrect predictions. Denoted by the output of the e-mixup. Then the proxy is calculated by

(16)
Input: source, target samples ,.
Parameter: learning rate , batch size , the number of iteration , network parameters .
Output: the predicted target label .
1:  Initialize
2:  for  do
3:     Fetch source minibatch
4:     Fetch target minibatch
5:     Calculate using (Eq. (14))
6:     Calculate using (Eq. (15))
7:     Obtain highly confident target samples predicted by on
8:     
9:      = e-mixup()
10:     Calculate using (Eq. (16))
11:     Update according to Eq. (17)
12:  end for
Algorithm 2 The training procedure of E-MixNet
Method AW DW WD AD DA WA Avg
ResNet-50 He et al. (2016) 68.40.2 96.70.1 99.30.1 68.90.2 62.50.3 60.70.3 76.1
DAN Long et al. (2015) 80.50.4 97.10.2 99.60.1 78.60.2 63.60.3 62.80.2 80.4
RTN Long et al. (2016) 84.50.2 96.80.1 99.40.1 77.50.3 66.20.2 64.80.3 81.6
DANN Ganin et al. (2016) 82.00.4 96.90.2 99.10.1 79.70.4 68.20.4 67.40.5 82.2
ADDA Tzeng et al. (2017) 86.20.5 96.20.3 98.40.3 77.80.3 69.50.4 68.90.5 82.9
JAN Long et al. (2013) 86.00.4 96.70.3 99.70.1 85.10.4 69.20.3 70.70.5 84.6
MADA Pei et al. (2018) 90.00.1 97.40.1 99.60.1 87.80.2 70.30.3 66.40.3 85.2
SimNet Pinheiro (2018) 88.60.5 98.20.2 99.70.2 85.30.3 73.40.8 71.80.6 86.2
MCD Saito et al. (2018) 89.60.2 98.50.1 100.0.0 91.30.2 69.60.1 70.80.3 86.6
CDAN+E Long et al. (2018) 94.10.1 98.60.1 100.0.0 92.90.2 71.00.3 69.30.3 87.7
SymNets Zhang et al. (2019b) 90.80.1 98.80.3 100.0.0 93.90.5 74.60.6 72.50.5 88.4
MDD Zhang et al. (2019a) 94.50.3 98.40.1 100.0.0 93.50.2 74.60.3 72.20.1 88.9
E-MixNet 93.00.3 99.00.1 100.0.0 95.60.2 78.90.5 74.70.7 90.2
Table 1: Results on Office-31 (ResNet-50)
Method IP PI IC CI CP PC Avg
ResNet-50 He et al. (2016) 74.80.3 83.90.1 91.50.3 78.00.2 65.50.3 91.20.3 80.7
DAN Long et al. (2015) 74.50.4 82.20.2 92.80.2 86.30.4 69.20.4 89.80.4 82.5
DANN Ganin et al. (2016) 75.00.6 86.00.3 96.20.4 87.00.5 74.30.5 91.50.6 85.0
JAN Long et al. (2013) 76.80.4 88.00.2 94.70.2 89.50.3 74.20.3 91.70.3 85.8
MADA Pei et al. (2018) 75.00.3 87.90.2 96.00.3 88.80.3 75.20.2 92.20.3 85.8
CDAN+E Long et al. (2018) 77.70.3 90.70.2 97.70.3 91.30.3 74.20.2 94.30.3 87.7
SymNets Zhang et al. (2019b) 80.20.3 93.60.2 97.00.3 93.40.3 78.70.3 96.40.1 89.9
E-MixNet 80.50.4 96.00.1 97.70.3 95.20.4 79.90.2 97.00.3 91.0
Table 2: Results on Image-CLEF (ResNet-50)

Training Procedure

Finally, the UDA problem can be solved by the following minimax game.

(17)

The training procedure is shown in Algorithm 2.

Experiments

Method AC AP AR CA CP CR PA PC PR RA RC RP Avg
ResNet-50 He et al. (2016)  34.9  50.0  58.0  37.4  41.9  46.2  38.5  31.2  60.4  53.9  41.2  59.9 46.1
DAN Long et al. (2015)  54.0  68.6  75.9  56.4  66.0  67.9  57.1  50.3  74.7  68.8  55.8  80.6 64.7
DANN Ganin et al. (2016)  44.1  66.5  74.6  57.9  62.0  67.2  55.7  40.9  73.5  67.5  47.9  77.7 61.3
JAN Long et al. (2013)  45.9  61.2  68.9  50.4  59.7  61.0  45.8  43.4  70.3  63.9  52.4  76.8 58.3
CDAN+E Long et al. (2018)  47.0  69.4  75.8  61.0  68.8  70.8  60.2  47.1  77.9  70.8  51.4  81.7 65.2
SymNets Zhang et al. (2019b)  46.0  73.8  78.2  64.1  69.7  74.2  63.2  48.9  80.0  74.0  51.6  82.9 67.2
MDD Zhang et al. (2019a)  54.9  73.7  77.8  60.0  71.4  71.8  61.2  53.6  78.1  72.5  60.2  82.3 68.1
E-MixNet  57.7  76.6  79.8  63.6  74.1  75.0  63.4  56.4  79.7  72.8  62.4  85.5 70.6
Table 3: Results on Office-Home (ResNet-50)
Method AC AP AR CA CP CR PA PC PR RA RC RP Avg
DAN Long et al. (2015)  54.0  68.6  75.9  56.4  66.0  67.9  57.1  50.3  74.7  68.8  55.8  80.6 64.7
DAN+  57.0  71.0  77.9  59.9  72.6  70.1  58.1  57.1  77.3  72.7  64.7  84.6 68.6
DANN Ganin et al. (2016)  44.1  66.5  74.6  57.9  62.0  67.2  55.7  40.9  73.5  67.5  47.9  77.7 61.3
DANN+  50.9  69.6  77.8  61.9  70.7  71.6  60.0  49.5  78.4  71.8  55.7  83.7 66.8
CDAN+E Long et al. (2018)  47.0  69.4  75.8  61.0  68.8  70.8  60.2  47.1  77.9  70.8  51.4  81.7 65.2
CDAN+E+  49.5  70.1  77.8  64.3  71.3  74.2  61.6  50.6  80.0  73.5  56.6  84.1 67.8
SymNets Zhang et al. (2019b)  46.0  73.8  78.2  64.1  69.7  74.2  63.2  48.9  80.0  74.0  51.6  82.9 67.2
SymNets+  48.8  74.7  79.7  64.9  72.5  75.6  63.9  47.0  80.8  73.9  52.4  83.9 68.2
Table 4: The results of combination experiments on Office-Home (ResNet-50)
s t m e IP PI IC CI CP PC Avg
80.2 94.2 96.7 94.7 79.2 95.5 90.1
79.9 92.2 97.7 93.8 79.4 96.5 89.9
79.7 93.7 97.5 94.5 79.7 96.2 90.2
79.4 95.0 97.8 94.8 81.4 96.5 90.8
80.5 96.0 97.7 95.2 79.9 97.0 91.0
Table 5: Ablation experiments on Image-CLEF

We evaluate E-Mixnet on three public datasets, and compare it with several existing state-of-the-art methods. Codes will be available at https://github.com/zhonglii/E-MixNet.

Datasets

Three common UDA datasets are used to evaluate the efficacy of E-MixNet.

Office-31 Saenko et al. (2010) is an object recognition dataset with images, which consists of three domains with a slight discrepancy: amazon (A), dslr (D) and webcam (W). Each domain contains kinds of objects. So there are domain adaptation tasks on Office-31: A D, A W, D A, D W, W A, W D.

Office-Home Venkateswara et al. (2017) is an object recognition dataset with image, which contains four domains with more obvious domain discrepancy than Office-31. These domains are Artistic (A), Clipart (C), Product (P), Real-World (R). Each domain contains kinds of objects. So there are domain adaptation tasks on Office-Home: A C, A P, A R, …, R P.

ImageCLEF-DA333http://imageclef.org/2014/adaptation/ is a dataset organized by selecting the 12 common classes shared by three public datasets (domains): Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). We permute all three domains and build six transfer tasks: IP, PI, IC, CI, CP, PC.

Experimental Setup

Following the standard protocol for unsupervised domain adaptation in Ganin et al. (2016); Long et al. (2018), all labeled source samples and unlabeled target samples are used in the training process and we report the average classification accuracy based on three random experiments. in Eq. (15) is selected from 2, 4, 8, and it is set to 2 for Office-Home, 4 for Office-31, and 8 for Image-CLEF.

ResNet-50 He et al. (2016)

pretrained on ImageNet is employed as the backbone network (

). , and are all two fully connected layers where the hidden unit is 1024. Gradient reversal layer between G and

is employed for adversarial training. The algorithm is implemented by Pytorch. The mini-batch stochastic gradient descent with momentum 0.9 is employed as the optimizer, and the learning rate is adjected by

, where i linearly increase from 0 to 1 during the training process, , . We follow Zhang et al. (2019a) to employ a progressive strategy for : , is set to 0.1. The in e-mixup is set to 0.6 in all experiments.

Results

The results on Office-31 are reported in Tabel 1. E-MixNet achieves the best results and exceeds the baselines for 4 of 6 tasks. Compared to the competitive baseline MDD, E-MixNet surpasses it by 4.3% for the difficult task D A.

The results on Image-CLEF are reported in Table 2. E-MixNet significantly outperforms the baselines for 5 of 6 tasks. For the hard task C P, E-MixNet surpasses the competitive baseline SymNets by 2.7%.

The results on Office-Home are reported in Table 3. Despite Office-Home is a challenging dataset, E-MixNet still achieves better performance than all the baselines for 9 of 12 tasks. For the difficult tasks A C, P A, and R C, E-MixNet has significant advantages.

In order to further verify the efficacy of the proposed proxy of the combined risk, we add the proxy into the loss functions of four representative UDA methods. As shown in Fig. 2(a), we add a new classifier that is the same as the classifier in the original method to formulate the proxy of the combined risk. The results are shown in Table 4. The four methods can achieve better performance after optimizing the proxy. It is worth noting that DANN obtains a 5.5% percent increase. The experiments adequately demonstrate the combined risk plays a crucial role for methods that aim to learn a domain-invariant representation and the proxy can indeed curb the increase of the combined risk.

Ablation Study and Parameter Analysis

Ablation Study. To further verify the efficacy of the proxy of combined risk calculated by mixup and e-mixup respectively. Ablation experiments are shown in Tabel 5, where s indicates that the source samples are introduced to augment the target samples, t indicates augmenting the target samples, m denotes mixup, and e denotes e-mixup. Table 5 shows that E-MixNet achieves the best performance, which further shows that we can effectively control the combined risk by the proxy .

Parameter analysis. Here we aim to study how the parameter affects the performance and the efficiencies of mean square error (MSE) and cross-entropy for the proxy of combined risk. Firstly, as shown in Fig. 3(a), a relatively larger can obtain better performance and faster convergence. Secondly, when mixup behaves between two samples, the accuracy of the pseudo labels of the target samples are much important. To against the adversarial samples, MSE is employed to substitute cross-entropy. As shown in Fig. 3(b), MSE can obtain more stable and better performance. Furthermore, -distance is also an important indicator showing the distribution discrepancy, which is defined as where is the test error. As shown in Fig. 3 (c). E-MixNet achieves a better performance of adaptation, implying the efficiency of the proposed proxy.

Figure 3: The impact of is shown in (a). The impact of the loss functions for the proxy of the combined risk is shown in (b). Comparison of -distance.

Conclusion

Though numerous UDA methods have been proposed and have achieved significant success, the issue caused by combined risk has not been brought to the forefront and none of the proposed methods solve the problem. Theorem 3 reveals that the combined risk is deeply related to the conditional distribution discrepancy and plays a crucial role for transfer performance. Furthermore, we propose a method termed E-MixNet, which employs enhanced mixup to calculate a proxy of the combined risk. Experiments show that our method achieves a comparable performance compared with existing state-of-the-art methods and the performance of the four representative methods can be boosted by adding the proxy into their loss functions.

Acknowledgments

The work presented in this paper was supported by the Australian Research Council (ARC) under DP170101632 and FL190100149. The first author particularly thanks the support of UTS-AAII during his visit.

References

  • Ben-David et al. (2010) Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine learning 79(1-2): 151–175.
  • Ben-David et al. (2007) Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. In NeurIPS, 137–144.
  • Berthelot et al. (2019) Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. A. 2019. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 5049–5059.
  • Dong et al. (2019) Dong, J.; Cong, Y.; Sun, G.; and Hou, D. 2019. Semantic-Transferable Weakly-Supervised Endoscopic Lesions Segmentation. In ICCV, 10711–10720.
  • Dong et al. (2020a) Dong, J.; Cong, Y.; Sun, G.; Liu, Y.; and Xu, X. 2020a. CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., ECCV, 745–762. Cham: Springer International Publishing. ISBN 978-3-030-58598-3.
  • Dong et al. (2020b) Dong, J.; Cong, Y.; Sun, G.; Yang, Y.; Xu, X.; and Ding, Z. 2020b. Weakly-Supervised Cross-Domain Adaptation for Endoscopic Lesions Segmentation. IEEE Transactions on Circuits and Systems for Video Technology 1–1. doi:10.1109/TCSVT.2020.3016058.
  • Dong et al. (2020c) Dong, J.; Cong, Y.; Sun, G.; Zhong, B.; and Xu, X. 2020c. What Can Be Transferred: Unsupervised Domain Adaptation for Endoscopic Lesions Segmentation. In CVPR, 4022–4031.
  • Fang et al. (2020) Fang, Z.; Lu, J.; Liu, F.; Xuan, J.; and Zhang, G. 2020. Open set domain adaptation: Theoretical bound and algorithm.

    IEEE Transactions on Neural Networks and Learning Systems

    .
  • Fang et al. (2019) Fang, Z.; Lu, J.; Liu, F.; and Zhang, G. 2019. Unsupervised domain adaptation with sphere retracting transformation. In 2019 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
  • Ganin et al. (2016) Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17: 2096–2030.
  • Gong et al. (2016) Gong, M.; Zhang, K.; Liu, T.; Tao, D.; Glymour, C.; and Schölkopf, B. 2016. Domain adaptation with conditional transferable components. In ICML, 2839–2848.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In NeurIPS, 2672–2680. Curran Associates, Inc.
  • Guo, Pasunuru, and Bansal (2020) Guo, H.; Pasunuru, R.; and Bansal, M. 2020. Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits. In AAAI, 7830–7838.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • Lee and Jha (2019) Lee, S.; and Jha, R. 2019. Zero-shot adaptive transfer for conversational language understanding. In AAAI, volume 33, 6642–6649.
  • Liu et al. (2019) Liu, F.; Lu, J.; Han, B.; Niu, G.; Zhang, G.; and Sugiyama, M. 2019. Butterfly: A panacea for all difficulties in wildly unsupervised domain adaptation. arXiv preprint arXiv:1905.07720 .
  • Long et al. (2015) Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with deep adaptation networks. In ICML, 97–105.
  • Long et al. (2018) Long, M.; Cao, Z.; Wang, J.; and Jordan, M. I. 2018. Conditional adversarial domain adaptation. In NeurIPS, 1640–1650.
  • Long et al. (2013) Long, M.; Wang, J.; Ding, G.; Sun, J.; and Yu, P. S. 2013. Transfer feature learning with joint distribution adaptation. In ICCV, 2200–2207.
  • Long et al. (2016) Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2016. Unsupervised domain adaptation with residual transfer networks. In NeurIPS, 136–144.
  • Lu et al. (2015) Lu, J.; Behbood, V.; Hao, P.; Zuo, H.; Xue, S.; and Zhang, G. 2015. Transfer learning using computational intelligence: A survey. Knowledge-Based Systems 80: 14–23.
  • Lu et al. (2020) Lu, W.; Yu, Y.; Chang, Y.; Wang, Z.; Li, C.; and Yuan, B. 2020. A Dual Input-aware Factorization Machine for CTR Prediction. In Proceedings of the 29th International Joint Conference on Artificial Intelligence.
  • Mansour, Mohri, and Rostamizadeh (2009) Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain Adaptation: Learning Bounds and Algorithms. In COLT.
  • Mohri and Medina (2012) Mohri, M.; and Medina, A. M. 2012. New analysis and algorithm for learning with drifting distributions. In ALT, 124–138. Springer.
  • Pei et al. (2018) Pei, Z.; Cao, Z.; Long, M.; and Wang, J. 2018. Multi-adversarial domain adaptation. arXiv preprint arXiv:1809.02176 .
  • Pinheiro (2018) Pinheiro, P. O. 2018. Unsupervised domain adaptation with similarity learning. In CVPR, 8004–8013.
  • Redko et al. (2020) Redko, I.; Morvant, E.; Habrard, A.; Sebban, M.; and Bennani, Y. 2020. A survey on domain adaptation theory. arXiv preprint arXiv:2004.11829 .
  • Saenko et al. (2010) Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. In ECCV, 213–226. Springer.
  • Saito et al. (2018) Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 3723–3732.
  • Shen et al. (2018) Shen, J.; Qu, Y.; Zhang, W.; and Yu, Y. 2018. Wasserstein distance guided representation learning for domain adaptation. In AAAI.
  • Tang and Jia (2020) Tang, H.; and Jia, K. 2020. Discriminative Adversarial Domain Adaptation. In AAAI, 5940–5947.
  • Tzeng et al. (2017) Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR, 7167–7176.
  • Vapnik and Chervonenkis (2015) Vapnik, V. N.; and Chervonenkis, A. Y. 2015.

    On the uniform convergence of relative frequencies of events to their probabilities.

    In Measures of complexity, 11–30. Springer.
  • Venkateswara et al. (2017) Venkateswara, H.; Eusebio, J.; Chakraborty, S.; and Panchanathan, S. 2017. Deep hashing network for unsupervised domain adaptation. In CVPR, 5018–5027.
  • Wang and Breckon (2020) Wang, Q.; and Breckon, T. P. 2020. Unsupervised Domain Adaptation via Structured Prediction Based Selective Pseudo-Labeling. In AAAI, 6243–6250. AAAI Press.
  • Xu et al. (2020) Xu, M.; Zhang, J.; Ni, B.; Li, T.; Wang, C.; Tian, Q.; and Zhang, W. 2020. Adversarial Domain Adaptation with Domain Mixup. In AAAI, 6502–6509. AAAI Press.
  • Yu, Wang, and Yuan (2019) Yu, Y.; Wang, Z.; and Yuan, B. 2019. An Input-aware Factorization Machine for Sparse Prediction. In IJCAI, 1466–1472.
  • Zhang et al. (2018) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. mixup: Beyond Empirical Risk Minimization. In ICLR.
  • Zhang et al. (2020a) Zhang, K.; Gong, M.; Stojanov, P.; Huang, B.; Liu, Q.; and Glymour, C. 2020a. Domain Adaptation As a Problem of Inference on Graphical Models. In NeurIPS.
  • Zhang et al. (2017) Zhang, Q.; Wu, D.; Lu, J.; Liu, F.; and Zhang, G. 2017. A cross-domain recommender system with consistent information transfer. Decision Support Systems 104: 49–63.
  • Zhang et al. (2020b) Zhang, Y.; Deng, B.; Tang, H.; Zhang, L.; and Jia, K. 2020b. Unsupervised multi-class domain adaptation: Theory, algorithms, and practice. arXiv preprint arXiv:2002.08681 .
  • Zhang et al. (2020c) Zhang, Y.; Liu, F.; Fang, Z.; Yuan, B.; Zhang, G.; and Lu, J. 2020c. Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation. arXiv preprint arXiv:2007.14612 .
  • Zhang et al. (2019a) Zhang, Y.; Liu, T.; Long, M.; and Jordan, M. 2019a. Bridging Theory and Algorithm for Domain Adaptation. In Chaudhuri, K.; and Salakhutdinov, R., eds., ICML, volume 97 of PMLR, 7404–7413. PMLR.
  • Zhang et al. (2019b) Zhang, Y.; Tang, H.; Jia, K.; and Tan, M. 2019b. Domain-symmetric networks for adversarial domain adaptation. In CVPR, 5031–5040.
  • Zhao et al. (2019) Zhao, H.; des Combes, R. T.; Zhang, K.; and Gordon, G. 2019. On Learning Invariant Representation for Domain Adaptation. ICML .
  • Zhong et al. (2020) Zhong, L.; Fang, Z.; Liu, F.; Yuan, B.; Zhang, G.; and Lu, J. 2020. Bridging the Theoretical Bound and Deep Algorithms for Open Set Domain Adaptation. arXiv preprint arXiv:2006.13022 .
  • Zou et al. (2019) Zou, H.; Zhou, Y.; Yang, J.; Liu, H.; Das, H. P.; and Spanos, C. J. 2019. Consensus adversarial domain adaptation. In AAAI, volume 33, 5997–6004.