1 Introduction
In the classical transfer learning setting, given data points from the target domain, we aim to learn a function to predict the labels using labeled data from a different but related source domain. Let and
be the variables of features and labels, respectively. In contrast to the standard supervised learning, the joint distributions
and are different across domains. For example, in medical data analysis, health record data collected from patients of different age groups or hospital locations often vary [27]. Inferring invariant knowledge from a domain (e.g., an age group or a location) with a large number of observations to another with scare labeled data is desirable since it is laborious to obtain highquality labels for clinical data [5]. Similarly, in the indoor WiFi localization problem [39], the signal distributions received by different phone models are also different. To avoid labeling data for all phone models, it is essential to transfer knowledge from one phone model with sufficient labeled data to another. In this kind of problems, transfer learning techniques can improve the generalization ability of models learned from source domain by correcting domain mismatches.Due to various assumptions about how the joint distribution shifts across domains, several transfer learning scenarios have been studied. (1) Covariate shift is a traditional scenario where the marginal distribution changes but the conditional distribution stays the same. In this situation, several methods have been proposed to correct the shift in ; for instance, importance reweighting [9] and invariant representation [18]. (2) Model shift [37] assumes that the marginal distribution and the conditional distribution change independently. In this case, successful transfer requires to be continuous, the change in to be smooth, and some labeled data to be available in the target domain. (3) Target shift [40] assumes that the marginal distribution shifts while stays the same. In this scenario, and will change dependently because their changes are caused by the change in . (4) Generalized target shift [40] assumes that and change independently across domains, causing and to change dependently. An interpretation of the difference between these scenarios from a causal standpoint was also provided [31].
The aforementioned transfer learning methods extract invariant knowledge across different domains based on a strong assumption; that is, the source domain labels are “clean”. However, it is often violated in practice. This is because that accurately labeling training set tends to be expensive, timeconsuming, and sometimes impossible. For example, in medical data analysis, due to the subjectivity of domain experts, insufficient discriminative information, and digitalization errors [30]
, noisy labels are often inevitable. In computer vision, to reduce the expensive human supervision, we often prefer directly transferring knowledge from easily obtainable but imperfectly labeled source data such as weblylabeled data or machinelabeled data to target data
[16].Therefore, in this paper, we consider the setting of transfer learning that the observed labels in source domain are noisy. The noise is assumed to be random and the flip rates are classconditional (abbreviated as CCN [22]
), which is a widelyemployed label noise model in the machine learning community. The issue is that since we have no access to the true source distribution when the labels are noisy, it might be problematic if we directly apply existing transfer learning methods to correct the mismatches between the noisy source domain and the target domain.
As expected, except the covariate shift scenario in which correcting the shift in does not require label information, we can show that label noise can adversely affect most existing transfer learning methods in different scenarios. Taking target shift as an example, in order to correct the shift in , the labeled data points in the source domain are required to estimate the class ratio between and . However, in the presence of label noise, it is unclear that the class ratio can be estimated from noisy data. Another example is generalized target shift where and change in an unrelated way. In this scenario, in addition to the possible wrong estimate of , the estimates of invariant representations would be inaccurate because that label noise provides wrong information for matching distributions across domains while learning the representations. Label noise also affects the learning in the model shift scenario, but we will not consider this case because we are concerned with discrete labels and the setting in which there is no label in the target domain.
To address this issue, we propose a labelnoise robust transfer learning method in the generalized target shift scenario which is prevalent in transfer learning. To deal with the noisy labels in source domain, we propose a novel method to denoise conditional invariant components. Our method can provably identify the changes in clean distribution , and simultaneously extracts the conditional invariant representations which have similar across domains. Specifically, we construct a new distribution which is marginalized from the weighted noisy source distribution . Here, we denote as the distributions associated with label noise. By matching the distributions and , the conditional invariant components and are identifiable from the noisy source data and unlabeled target data. Moreover, in our denoising conditional invariant component framework, we can also theoretically ensure the convergence of the estimate of label distribution in target domain.
To verify the effectiveness of the proposed method, we conduct comprehensive experiments on both synthetic and realworld data. The performance are evaluated on classification problems. For fair comparison, after extracting invariant representations using transfer learning methods, we train the robust classifier by employing the forward method in
[24]. Compared with the stateoftheart transfer learning methods, our method achieves superior performance. This also indicates that the proposed method is able to transfer invariant knowledge across different domains when label noise is present.2 Related Work
2.1 Classification with Label Noise
Learning with noisy labels in classification has been widely studied [20, 36]. These methods can be coarsely categorized into four categories, i.e., dealing with unbiased losses [22], labelnoise robust losses [12], label noise cleansing [1], and label noise fitting [34, 24]. Similar to many of these methods, we exploit a transition matrix to statistically model the label noise. However, the problem considered in this paper is more challenging because the clean source domain distribution is not assumed to be identical to the target domain distribution. In contrast to classification with label noise, our method can learn transferable knowledge across different domains, where both and may change and the labels of the source data is corrupted. Reports on the general results obtained in this setting are scarce.
2.2 Traditional Generalized Target Shift Methods
Existing methods to address generalized target shift usually assume that there exists a transformation , e.g., locationscale transformation [40, 6], such that the conditional distribution is invariant across domains. In this paper, we also assume that the conditional invariant components (CICs) exist. We aim to find a transformation such that as in [6] and to estimate . However, we are given only samples drawn from the distribution and the noisy distribution , which makes the problem more challenging.
Note that our work is not a simple combination of traditional generalized target shift methods and robust classifiers. As aforementioned, simple combination of transfer learning and labelnoise robust classifier overlooks that the knowledge transfer process can be affected by label noise, which thus produces unreliable results. In the setting where only noisy source data and unlabeled target data are available, learning becomes pretty challenging. This is because without clean label in both domains, no direct information is available to ensure the matching of conditional densities such that can be learned. Moreover, it is challenging to estimate as briefly discussed in the introduction. If is known, the estimation of is essentially a mixture proportion estimation problem which will be analyzed in the following section. Even if we have the sample from the mixture , the samples of component distributions from source domain are noisy. We cannot obtain correct by using methods [6, 10]. Therefore, we proposed a novel denoising conditional invariant component framework. It is able to identify and conditional invariant components from the noisy source data and unlabeled target data.
In this paper, the simple combinations of transfer learning methods with robust classifiers are included as baselines in our experiments. We show that our method strongly outperforms the baselines, verifying that the superiority of the proposed method to extract invariant knowledge across different domains.
3 The Effects of Label Noise
In this section, we examine the effects of label noise in four different transfer learning scenarios, namely 1) covariate shift, 2) model shift, 3) target shift, and 4) generalized target shift. From a causal perspective, 1) and 2) assume that causes , indicating that and contain no information about each other [32]. In transfer learning, the causal relation implies that changes in are independent of changes in . If the change in is large, then it is difficult to correct the shift in because we often have no or scarce labels in the target domain. On the contrary, 3) and 4) assume that is the cause for , implying that changes in and are independent, while changes in and depend on each other. Figure 1 represents the causal relations between variables in transfer learning using selection diagram defined in [25]. Here, although the noisy label is usually generated after is observed, we exploit the causal model according to the assumption that flip rates are independent of features, which is widely employed in the label noise setting [22, 24, 33]. The effects of label noise in different scenarios are also summarized as follows:
Covariate shift. In covariate shift [9, 39], label noise has no effects on the correction of shift in . However, after correcting the shift in , one needs to take the effects of label noise into account when training a classifier on the source domain [22, 17].
Model shift. In the model shift scenario [37], since and change independently, we can correct them separately. Similar to covariate shift, correcting is not affected by label noise. However, correcting shift in requires matching and , which can be seriously harmed by label noise. In this scenario, since a small number of clean labels are assumed to be available in the target domain, is often assumed to change smoothly across domains to reduce the estimation error. The smoothness constraint can reduce the effects of label noise to some extent if one directly matches and .
Target shift. In target shift scenario [10, 40], it is required that . The changes in are often corrected by matching the marginal distribution of the reweighted source domain and the target domain , where and is the class number.
In the presence of label noise, however, we only have access to and in the source domain. As shown in Figure 2, becomes a mixture of and and is no longer identical to . In this case, directly applying the methods in [40, 10] on the noisy data will lead to wrong estimate of . Specifically, if we directly employ [40, 10] on noisy data, we need to estimate the mixture proportions for the model . But the estimated proportions are very likely to be different from those in the mixture model . Here, .
Suppose
and the label noise is symmetric, i.e., the probability of the labels flipping to each other is the same. Then, it is easy to derive that
is the same with . Therefore, as the sample size , the estimated density ratio also approaches, a vector of ones, which is a trivial solution, resulting in
. However, in most conditions, is often different from and label noise is asymmetric, which often leads to a wrong estimate of . Thus, we can see the adverse effects of label noise on target shift.Generalized target shift. In general target shift [40, 6], also changes across domains, but it changes independently of . A widelyemployed approach is learning conditional invariant components that satisfy . Under the assumption of conditional invariant components, many works jointly learn and by matching and , which naturally requires the information of and .
However, in the setting of label noise, similar to target shift, the estimate of invariant components and the label distribution will be inaccurate if we directly use the noisy source distribution to correct distribution shift. For example, even though we assume that is successfully learned, the estimate of can be incorrect as that in target shift. A wrong estimate of can in turn adversely influence the learning of invariant representations in the joint optimization framework [6].
As a conclusion, we can easily observe that label noise is harmful for transferring invariant knowledge and correcting distribution shift in most transfer learning scenarios. We target to reduce these adverse effects of label noise in the following sections.
4 LabelNoise Robust Transfer Learning
In this paper, we study a new transfer learning setting in which (1) both distributions and change across different domains; (2) and we are given noisily labeled source data and unlabeled target data. Specifically, denoting as a noisy label, we have access to only “noisy” observations in the source domain and unlabeled data in the target domain. Here, we consider the classconditional label noise. The generation of noisy labels is stochastically modeled via a transition probability , i.e., the flip rate from clean label to noisy label . All these transition probabilities are summarized into a transition matrix , where .
In many transfer learning methods, invariant representation learning and label shift correction is critical for transferring knowledge from source domain to target domain. For example, learning domaininvariant representations are widelyused principles for semantic segmentation [26, 8] and classification [9, 7]. Thus, in this new setting, we also aim to learn the invariant representations and the label distribution in target domain such that the changes in and can be corrected and the effects of label noise can be alleviated.
In the following subsections, we first study how to provably identify invariant representations across different domains and correct the distribution shift in the general target shift scenario with label noise. Then, an importance reweighting framework is introduced for classification problem. Both linear and deep models are finally presented for transfer learning with label noise.
4.1 Denoising Conditional Invariant Components
In the label noise setting, learning invariant representations and is very challenging due to the fact that we can only observe the noisy labels but have no information of clean label in the source domain. To address this issue, we first introduce a specific conditional invariant representation to ensure this problem being tractable. That is, we assume that for every dimensional data , there exists a transformation satisfying
(4.1) 
where are known as conditional invariant ç (CICs) [6] across different domains.
Since label noise makes existing transfer learning methods ineffective, we propose a novel method to denoise the conditional invariant components. We find that if the information of label noise model is available, a unique relationship between and can be built, which, in turn, is a clue for us to identify .
We observe that label noise does not affect the distribution of . Then, intuitively, if we marginalize out the variable of the noisy labels, we may achieve Eq. (4.1) by matching the marginal distribution . But we need some nontrivial strategies to make it possible. Specifically, we first construct a new distribution , which is marginalized from the reweighted distribution as follows,
(4.2) 
where are the weights for noisy labels. Note that, in the rest of this paper, when no ambiguity occurs, we use as the variable for both “clean” and “noisy” labels; otherwise, both and are used as variables for “clean” and “noisy” label, respectively.
Then, under mild conditions, by matching the distribution with the new distribution , we can provably identify the invariant components :
Theorem 1.
Suppose the transformation satisfies that are linearly independent, and that the elements in the set are linearly independent. Then, if , we have ; and , where .
Please see the proof of Theorem 1 in Appendix A. Note that the linearly independent property is a weak assumption which has been widely used as the basic condition for class ratio estimation [6] and mixture proportion estimation [38].
Let and . According to Theorem 1, we have . In label noise, we often assume that is usually diagonally dominant and invertible. Then, the relationship between and is uniquely determined, as well as the relationship between and . In this case, if is known and these two marginal distributions are successfully matched, we can (1) identify the conditional invariant components; (2) and learn which indicates that the changes in the distribution is also identifiable. In practice, the transition matrix is not available, but we can usually estimate it by methods in [17, 24].
4.1.1 Denoising MMD Loss
To enforce the matching between and , we employ the kernel mean matching of these two distributions and minimize the squared maximum mean discrepancy (MMD) loss:
(4.3) 
where is a kernel mapping. According to Eq. (4.2), we have
Therefore, minimizing Eq. (4.3) is equivalent to minimizing
In practice, we can only observe the corruptly labeled source data and the unlabeled target data . Therefore, we approximate the expected kernel mean values by the empirical ones:
(4.4) 
where ; denotes the matrix of the invariant representations.
However, Eq. (4.4) is not explicitly formulated w.r.t. . If we directly optimizing Eq. (4.4) w.r.t. , it will result in incorrect that violates the fact that should be the same for the same . It is thus impossible to identify .
Therefore, we need to reparameterize the formulation by applying the relationship between and in Theorem 1, i.e., . It is also easy to derive that . Given estimated and , we can construct the vectors . If , , define the matrix , where the th row of is . Let . Then, is an estimate of .
The denoising MMD loss now can be reparametrized as
(4.5) 
where and are the kernel matrix of and , respectively; is the cross kernel matrix. In this paper, the Gaussian kernel, i.e., is applied, where is the bandwidth.
4.2 Importance Reweighting
Since the denoising MMD loss can provably identify conditional invariant components and correct label shift, we can now learn labelnoise robust classifiers. In this classification problem, we aim to learn a hypothesis function from the noisy source data that can generalize well on the target data. Ideally, minimizes the expected loss , where
is the loss function;
are the conditional invariant components of .In practice, we often assume that can predicts [29, 24] and predicts the label. Here, is the th entry of . To facilitate the learning of , we first imagine that the target domain has the same label noise model as the source domain. Note that, this does not necessarily imply that label noise really exists in target domain because, in our setting, we even have no label information of target data. We can see, the minimizer is also assumed to be able to predict . If the classifier is found and is invertible, we can obtain according to the following relationship:
(4.6) 
Thus, the problem remains to learn , which can be obtained by exploiting the importance reweighting strategy:
Since is constructed from by using the same transition matrix and , we can easily have and thus
where . In practice, only the training sample is observable, we thus minimize the empirical loss,
(4.7) 
to find the approximated classifier .
Instead of separately finding by minimizing Eq. (4.7) and transiting to according to Eq. (4.6), in this paper, we employ the forward strategy proposed in [24]; that is, we directly minimize the following risk,
(4.8) 
As we know, by minimizing the empirical risk in Eq. (4.8), can approximately predict . Then, according to Eq. (4.6), can finally approximately predict .
Note that, in practice, the ratio is also unknown. But can be empirically estimated from the noisy source data, and is estimated by our denoising MMD loss, can also be computed according to the relationship similar to Eq. (4.6). In this way, can be obtained.
4.3 The Overall Models
We are now ready to introduce the proposed models. In order to extract conditional invariant components, the transformation varies from linear ones to nonlinear ones depending on the complexity of input data space. We accordingly propose the following two representative transfer learning models.
4.3.1 Linear Model
Linear model is a twostage model in which we first identify invariant representations and and then train the classifier according to the importance reweighting framework. In linear model, . To avoid the trivial solution, is constrained to be orthogonal. Then, according to Eq. (4.5), we have
Note that even though the objective function has similar form with that in [6], it is essentially different. This is because in this objective function, the source data is noisily labeled and is carefully designed to relate and such that conditional invariant components and can be identified from noisy source data and unlabeled target data.
The alternating optimization method is applied to update and . Specifically, we apply the conjugate gradient algorithm on the Grassmann manifold to optimize , and use the quadratic programming to optimize . After identifying the invariant representations and by solving above problem, we can then use them to train a classifier for the target data by minimizing Eq. (4.8).
4.3.2 Deep Model
Besides the twostage linear model, we also propose an endtoend learning model incorporating deep neural networks, which have been proven to be effective to extract invariant knowledge across different domains
[18, 19]. Here, we modify the conventional deep neural network for classification, e.g., AlexNet [13], in two aspects: (1) Due to the fact that the domain discrepancy becomes larger for the features in higher layers, we impose the denoising MMD loss on a higher layer for extracting the invariant representations; (2) to learn a classifier robust to label noise, we also add the forward procedure [24] before the crossentropy (CE) loss as in Eq. (4.8). The structure is shown in Figure 3.Specifically, let be the responses of the th hidden layer, be the parameters in the th to th layers, and be the total number of layers in the deep neural network. Suppose that we impose the denoising MMD loss on the features in the th layer; that is, . Then, the denoising MMD loss is
(4.9) 
where is the matrix of the responses of the th layer.
Denote as the softmax output w.r.t. the input (see Figure 3). According to Eq. (4.8), the loss for classification is
(4.10) 
where if ; denotes the th column of . Together with the regularization (e.g., norm) of the parameters, our final model becomes
(4.11)  
s.t. 
where and are the tradeoff parameters of denoising MMD loss and regularization, respectively. Again, by minimizing Eq. (4.11), if approximates , then approximates . We can then successfully learn the classifier for the target data.
4.4 Convergence Analysis
In this subsection, we study the convergence rates of the estimators to the true label noise rates and optimal class priors. The convergence rate for estimating the label noise rates has been well studied under the “anchor set” condition that for any there exist in the domain of such that and , which is likely to be held in practice. For example, two estimators with convergence guarantees has been proposed in [17] and [33], respectively. Recently, [28] exploited the “anchor set” condition in Hilbert space and designed estimators that can converge to the true label noise rates with an order of . Some work based on a weaker assumption, i.e, linearly independent assumption, is also proposed to estimate label noise, and a fast convergence is also guaranteed [38]. Therefore, we mainly focus on the convergence analysis of estimating class ratios.
In order to analyze the convergence rate of the estimated class prior to the optimal in the presence of label noise, we first abuse the training samples and as i.i.d. variables, respectively. Abuse as the parameters related to the transformation and
We analyze the convergence rate by deriving an upper bound for with fixed and .
Theorem 2.
Given learned and , let the induced RKHS be universal and upper bounded that for all in the source and target domains, and let the entries of be bounded that for all . , with probability at least , we have
(4.12) 
See the proof of Theorem 2 in Appendix B. Although the bound in Theorem 2 involves two fixed parameters, the result is informative if and are given or and quickly converges to and , respectively. From previous analyses, we know that fast convergence rates for estimating label noise rate are guaranteed. However, the convergence of to is not guaranteed because the objective function is nonconvex w.r.t. . How to identify the transferable components should be further studied.
5 Experiments
To show the robustness of our method to label noise, we conduct comprehensive evaluations on both simulated and real data. We first compare our method, denoising conditional invariant components (abbr. as DCIC hereafter), with CIC [6] on identifying the changes in given noisy observations. The effectiveness of the linear and deep models is then verified on both the synthetic and real data. We compare DCIC with the domain invariant projection (DIP) method [2], transfer component analysis (TCA) [23], Deep Adaptation Networks (DAN) [18] and CIC [6]. In all experiments, the bandwidth of the Gaussian kernel is set to be the median value of the pairwise distances between all raw features (linear model) or between all the extracted invariant features (deep model).
5.1 Synthetic Data
We study the performance of the linear DCIC model in two situations: (a) the estimation of class ratio in the target shift (TarS) scenario given the true flip rates (i.e., transition probabilities); and (b) the evaluation of the extracted invariant components in the generalized target shift (GeTarS) scenario, with various class ratios and different label flip rates. In all experiments, the flip rates are estimated using the method proposed in [17]. We repeat the experiments for 20 times and report the average performances.
We generate the binary classification training and test data from a 2dimensional mixture of Gaussians [6], i.e., where the mean parameters
are sampled from the uniform distribution
and the covariance matrices are sampled from the Wishart distribution . The class labels are the cluster indices. Under TarS, remains the same. We only change the class priors across domains. Under GeTarS, we apply location and scale transformations on the features to generate target domain data. To get the noisy observations, we randomly flip the clean labels in the source domain with the same transition probability .First, we verify that with corrupted labels, the proposed DCIC can almost recover the correct class ratio under TarS. We set the source class prior to 0.5. The target domain class prior varies from 0.1 to 0.9 with step 0.1. The corresponding class ratio varies from 0.2 to 1.8 with step 0.2. Then, we compare the proposed method with CIC [6] on finding the true class ratio with noisy labels in source domain. We evaluate the performance by using the class ratio estimation error , where is the estimated class ratio vector. Figure 4(a) shows that DCIC can find the solutions close to the true for various class ratios. In this experiment, given large label noise (), estimated by CIC is close to the true one only when is close to 0, 1, and 2. The estimation of CIC is accurate at because we set the class prior to 0.5 in the clean source domain, which happens to make . If , then , the estimated will be wrong (see Section 3). CIC gives accurate results when is close to 0, 2 because target domain collapses to a single class, rendering the estimated results trivially right. Figure 4(b) shows the superiority of the proposed method over CIC at different levels of label noise. When , CIC finds the incorrect solutions. However, our method can find a good solution even when is close to 0.5. Figure 4(c) shows that the estimate of improves as the sample size gets larger.
Second, under GeTarS, we evaluate whether our method can discover the invariant representations given the noisy source data and unlabeled target data. In these experiments, we fix the sample size to 500, and the class prior to 0.5. We use classification accuracies to measure the performance. The results in Figure 5 show that our method is more robust to the label noise than DIP, TCA, and CIC.
5.2 Real Data
Softmax  TCA  DIP  CIC  DCIC  

t1 t2  60.73 0.66  70.80 1.66  71.40 0.83  75.50 1.02  79.28 0.56 
t1 t3  55.20 1.22  67.43 0.55  64.65 0.32  69.05 0.28  70.75 0.91 
t2 t3  54.38 2.01  63.58 1.33  66.71 2.63  70.92 3.86  77.28 2.87 
hallway1  40.81 12.05  42.78 7.69  44.31 8.34  51.83 8.73  
hallway2  27.98 10.28  43.68 11.07  44.61 5.94  43.96 6.20  60.50 8.68 
hallway3  24.94 9.89  31.44 5.47  33.50 2.58  32.00 3.88  33.89 5.94 
Classification accuracies and their standard deviations for WiFi localization dataset.
WiFi Localization Dataset. We further compare our linear DCIC model with DIP, TCA, and CIC on the crossdomain indoor WiFi localization dataset [39]. The problem is to learn the function between signals and locations . Here, we view it as a classification problem, where each location space is assigned with a discrete label. In the prediction stage, the label is then converted to the location information. We resample the training set to simulate the changes in . To ensure that the class ratio is not a vector of all ones, we resample the source training examples. We randomly select classes and let their class ratios be 2.5. For the other classes, we set their to be equal. The flip rate from one label to another is set to .
We first learn the linear transformation
() and extract the invariant components. A neural network with one hidden layer is trained by minimizing Eq. (4.7) and then obtain the classifier for the signals in target domain according to Eq. (4.8). The output layer is a softmax with the crossentropy loss. The activation function in the hidden layer is the Rectified Linear Unit (ReLU). The number of neurons in the hidden layer is set to 800. During training, learning rate is fixed to 0.1. After training, as in
[6], we report the percentage of examples on which the difference between the predicted and true locations is within 3 meters. Here, we train a neural network with the raw features as the baseline. All the experiments are repeated 10 times and the average performances are reported. In Table 1, the three upper rows present the transfer across different time periods , and , where . The lower part shows the transfer across different devices, where . We can see that all the results show DCIC can better transfer the invariant knowledge than other methods.See the results in the lower parts, since the input features in two domains are too complex in these cases, the invariant components cannot be well identified by a simple linear transformation, which finally results in the degraded performances. Therefore, for data with complex features, we would like to introduce our deep denoising models to extract invariant components and to correct the shift. The experiments on deep models are shown in the following subsections.






FT+Forward  58.12 0.32  61.02 0.90  59.27 1.51  65.90 0.65  
FT+Forward  54.93 2.23  60.80 0.49  56.97 1.36  65.51 3.07  
DAN+Forward  59.34 5.43  64.68 1.07  62.82 1.15  67.05 0.77  
DAN+Forward  54.76 1.62  63.87 0.84  61.28 1.44  65.70 1.24  
CIC  65.23 2.63  58.09 2.17  66.70 1.31  61.02 3.96  
CIC+Forward  65.37 2.49  63.35 4.43  66.84 3.62  68.45 0.91  
CIC+Forward  64.18 1.49  62.78 2.92  63.42 0.99  67.99 1.30  
DCIC+Forward  69.94 2.25  68.77 2.34  72.33 2.15  70.80 1.59  
DCIC+Forward  68.50 0.37  66.78 1.53  69.29 4.07  70.47 2.29 
MNISTUSPS. USPS dataset is a handwritten digit dataset including ten classes 09 and contains 7,291 training images and 2,007 test images of size , which is rescaled to . MNIST shares the same 10 classes of digits which consist of 60,000 training images and 10,000 test images of size . In our experiments, these two datasets are resampled to construct the transfer learning datasets in which the class priors across different domains vary. For MNIST, we assume that the class priors are unbalanced. For the first 5 classes, the class prior is set to 0.04. For the rest 5 classes, the class prior is equal to 0.16. For USPS, the class priors are balanced; that is, the class prior is set to 0.1 for each class. According to these class priors, we sample 5,000 images from both MNIST and USPS datasets to construct the new dataset mnist2usps. We switch the source/target pair to get another dataset usps2mnist. Same with [24], in the source data, noise flips between the similar digits: , , , with the transition probability or . After the noisy data are obtained, we leave 10 percent of source data as validation set. The LeNet [14]
structure in Caffe’s
[11] MNIST tutorial is employed to train the model from the scratch. Our denoising MMD loss is imposed on the first fully connected layer. In all experiments, regularization is applied and we set and . The batch sizes for both source and target data are set to 100. The initial learning rate and is decayed exponentially according to , where is the index of current iteration. Each experiment is repeated 5 times.Here, DCIC is compared with the baseline that finetunes the source data only (FT), DAN, and CIC. These methods are integrated with the forward procedure in [24] to reduce the effects of label noise. They are denoted as methods with “Forward (resp. )” given the true (resp. estimated) transition matrix. The results are shown in Table 2. When label noise is present, CIC based methods cannot correctly estimate the class ratios, which adversely affects the identification of the invariant components. It thus performs worse than the DAN based methods in some cases. The latter, however, ignores the change of in different domains. In contrast, our method often gives better estimation of the class ratios and can effectively identify the invariant components, which leads to the higher performances.
VLCS. VLCS dataset [35] consists of the images from five common classes: “bird”, “car”, “chair”, “dog”, and “person” in the datasets Pascal VOC 2007 (V), LabelMe (L), Caltech (C), and SUN09 (S), respectively. For these four datasets, we first randomly select at most 300 images for each class to construct the new datasets, respectively. Then, we construct the transfer learning datasets by using the leaveonedomainout evaluation strategy. For example, in “VLS2C”, the source data is the combination of the new Pascal VOC 2007, LabelMe, and SUN09 datasets. The target dataset is the new Caltech. In each source data, the labels flip from “person” to “car”, “chair” to “person”, and “dog” to “person” with the probability . We leave of the source data as the validation set. Each experiment is repeated 5 times.
In this experiments, the source data is finetuned on the pretrained AlexNet [13] model with the parameters in conv1conv3 layers being freezed. We impose our denoising MMD loss on the fc7 layer. The batch sizes for both source and target data are 32. The initial learning rate is 0.001 and decayed exponentially according to . The results are shown in Table 3. Our proposed method also improves the performances of the compared baselines, which indicates the effectiveness of the proposed model to correct the shift in different domains even though the label noise is present.
VLS2C  LCS2V  VLC2S  VCS2L  
FT+Forward  85.88 2.17  62.07 0.86  59.40 1.37  49.34 1.39 
FT+Forward  78.62 4.36  59.49 0.50  57.09 1.81  49.14 1.39 
DAN+Forward  87.66 2.37  64.37 2.07  59.54 0.83  51.07 1.26 
DAN+Forward  84.69 0.24  58.64 1.91  57.51 1.25  50.41 1.20 
CIC  75.15 6.23  54.69 0.96  53.61 2.35  49.30 0.48 
CIC+Forward  86.83 2.53  64.22 0.27  60.36 0.36  51.76 0.82 
CIC+Forward  85.69 1.76  59.80 0.47  57.65 0.60  50.33 0.31 
DCIC+Forward  91.60 0.51  65.67 0.37  61.79 0.77  52.47 0.50 
DCIC+Forward  87.28 1.18  63.35 0.37  58.88 0.74  51.60 1.48 
5.3 Discussions
5.3.1 Convergence analysis
In order to verify the effectiveness of the proposed method to estimate , in Figure 6 (a), we show the convergence of the estimation errors