Label-Propagation-with-Augmented-Anchors
A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems
view repo
Motivated by the problem relatedness between unsupervised domain adaptation (UDA) and semi-supervised learning (SSL), many state-of-the-art UDA methods adopt SSL principles (e.g., the cluster assumption) as their learning ingredients. However, they tend to overlook the very domain-shift nature of UDA. In this work, we take a step further to study the proper extensions of SSL techniques for UDA. Taking the algorithm of label propagation (LP) as an example, we analyze the challenges of adopting LP to UDA and theoretically analyze the conditions of affinity graph/matrix construction in order to achieve better propagation of true labels to unlabeled instances. Our analysis suggests a new algorithm of Label Propagation with Augmented Anchors (A^2LP), which could potentially improve LP via generation of unlabeled virtual instances (i.e., the augmented anchors) with high-confidence label predictions. To make the proposed A^2LP useful for UDA, we propose empirical schemes to generate such virtual instances. The proposed schemes also tackle the domain-shift challenge of UDA by alternating between pseudo labeling via A^2LP and domain-invariant feature learning. Experiments show that such a simple SSL extension improves over representative UDA methods of domain-invariant feature learning, and could empower two state-of-the-art methods on benchmark UDA datasets. Our results show the value of further investigation on SSL techniques for UDA problems.
READ FULL TEXT VIEW PDFA2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems
As a specific setting of transfer learning
[32], unsupervised domain adaptation (UDA) is to predict labels of given instances on a target domain, by learning classification models assisted with labeled data on a source domain that has a different distribution from the target one. Impressive results have been achieved by learning domain-invariant features [43, 27, 45], especially the recent ones based on adversarial training of deep networks [12, 42, 36, 47]. These methods are primarily motivated by the classical UDA theories [4, 3, 30, 46]that specify the success conditions of domain adaptation, where domain divergences induced by hypothesis space of classifiers are typically involved.
While a main focus of these methods is on designing algorithms to learn domain-invariant features, they largely overlook a UDA nature that shares the same property with the related problem of semi-supervised learning (SSL) — both UDA and SSL argue for a principle that the (unlabeled) instances of interest satisfy basic assumptions (e.g., the cluster assumption [6]), although in SSL, the unlabeled instances follow the same distribution as that of the labeled ones. Given the advantages of SSL methods over models trained with labeled data only [5], it is natural to apply the SSL techniques to domain-invariant features learned by seminal UDA methods [27, 12] so as to boost the performance further. We note that ideal domain alignment can hardly be achieved in practice. Although state-of-the-art results have already been achieved by the combination of vanilla SSL techniques and domain-invariant feature learning [17, 9, 25, 29, 47, 18], they typically neglect the issue that SSL methods are designed for data of the same domain, and their direct use for data with shifted distributions (e.g., in UDA tasks) could result in deteriorated performance.
To this end, we investigate how to extend SSL techniques for UDA problems. Take the SSL method of label propagation (LP) [48]
as an example. When there exists such a shift of distributions, edges of an LP graph constructed by affinity relations of data instances could be of low reliability, thus preventing its direct use in UDA problems. To tackle the issue, we analyze in this paper the conditions of the affinity graph (and the corresponding affinity matrix) for better propagation of true labels to unlabeled instances. Our analysis suggests a new algorithm of Label Propagation with Augmented Anchors (A
LP), which could potentially improve LP via generation of unlabeled virtual instances (i.e., the augmented anchors) with high-confidence label predictions. To make the proposed ALP particularly useful for UDA, we generate such virtual instances via a weighted combination of unlabeled target instances, using weights computed by the entropy of their propagated soft cluster assignments, considering that instances of low entropy are more confident in terms of their predicted labels. We iteratively do the steps of (1) using ALP to get pseudo labels of target instances, and (2) learning domain-invariant features with the obtained pseudo-labeled target instances and labeled source ones, where the second step, in turn, improves the quality of pseudo labels of target instances. Experiments on benchmark UDA datasets show that our proposed ALP significantly improves over the LP algorithm, and alternating steps of ALP and domain-invariant feature learning give state-of-the-art results. We finally summarize our main contributions as follows.Motivated by the relatedness between SSL and UDA, we study in this paper the technical challenge that prevents the direct use of graph-based SSL methods in UDA problems. We analyze the conditions of the affinity graph/matrix construction for better propagation of true labels to unlabeled instances, which suggests a new algorithm of ALP. ALP could potentially improve LP via generation of unlabeled virtual instances (i.e., the augmented anchors) with high-confidence label predictions.
To make the proposed ALP useful for UDA, we generate virtual instances as augmented anchors via a weighted combination of unlabeled target instances, where weights are computed based on the entropy of propagated soft cluster assignments of target instances. Our ALP based UDA method alternates in obtaining pseudo labels of target instances via ALP, and using the obtained pseudo-labeled target instances, together with the labeled source ones, to learn domain-invariant features. The second step is expected to enhance the quality of pseudo labels of target instances.
We conduct careful ablation studies to investigate the influence of graph structure on the results of ALP. Empirical evidences on benchmark UDA datasets show that our proposed ALP significantly improves over the original LP, and the alternating steps of ALP and domain-invariant feature learning give state-of-the-art results, confirming the value of further investigating the SSL techniques for UDA problems. The codes are available at https://github.com/YBZh/Label-Propagation-with-Augmented-Anchors .
In this section, we briefly review the UDA methods, especially these [35, 47, 29, 38, 23, 11, 17, 9, 25] involving SSL principles as their learning ingredients, and the recent works [51, 39, 48, 2, 7, 21] on the LP technique.
Motivated by the theoretical bound proposed in [4, 3, 46], the dominant UDA methods target at minimizing the discrepancy between the two domains, which is measured by various statistic distances, such as Maximum Mean Discrepancy (MMD) [27], Jensen-Shannon divergence [12] and Wasserstein distance [37]. They assume that once the domain discrepancy is minimized, the classifier trained on source data only can also perform well on the target ones. Given the advantages of SSL methods over models trained with labeled data only [5], it is natural to apply SSL techniques on domain-invariant features to boost the results further. Recently, state-of-the-art results are achieved by involving the SSL principles in UDA, although they may not have emphasized this point explicitly. Based on the cluster assumption, entropy regularization [13] is adopted in UDA methods [38, 29, 47, 23] to encourage low density separation of category decision boundaries, which is typically used in conjunction with the virtual adversarial training [31] to incorporate the locally-Lipschitz constraint. The vanilla LP method [52] is adopted in [17, 9, 25] together with the learning of domain-invariant features. Based on the mean teacher model of [41], a self-ensembling (SE) algorithm [11] is proposed to penalize the prediction differences between student and teacher networks for the same input target instance. Inspired by the tri-training [50], three task classifiers are asymmetrically used in [35]. However, taking the comparable LP-based methods [17, 9, 25] as an example, they adopt the vanilla LP algorithm directly with no consideration of the UDA nature of domain shift. By contrast, we analyze the challenges of adopting LP in UDA, theoretically characterize the conditions of potential improvement (cf. Proposition 1), and accordingly propose the algorithmic extension of LP for UDA. Such a simple algorithmic extension improves the results dramatically on benchmark UDA datasets.
The LP algorithm is based on a graph whose nodes are data instances (labeled and unlabeled), and edges indicate the similarities between instances. The labels of labeled data can propagate through the edges in order to label all nodes. Following the above principle, a series of LP algorithms [51, 39, 48] and the graph regularization methods [2, 7, 21] have been proposed for the SSL problems. Recently, Iscen et al. [19] revisit the LP algorithm for SSL problems with the iterative strategy of pseudo labeling and network retraining. Liu et al. [26] study the LP algorithm for few-shot learning. Unlike them, we investigate the LP algorithm for UDA problems and alleviate the performance deterioration brought by domain shift via the introduction of virtual instances as well as the domain-invariant feature learning.
Given data sets and with each , the first instances have labels with each and the remaining instances are unlabeled. We also write them collectively as . The goal of both SSL and UDA is to predict the labels of the unlabeled instances in 111We formulate in this paper both the SSL and UDA under the transductive learning setting [5].. In UDA, the labeled data in and unlabeled data in are drawn from two different distributions of the source one and the target one . Differently, in SSL, the source and target distributions are assumed to be the same, i.e., .
We denote as the mapping function parameterized by , where indicates the parameters of a feature extractor and indicates the parameters of a classifier . Let denote the set of probability matrices. A matrix corresponds to a classification on the dataset by labeling each instance as a label . Each indicates classification probabilities of the instance to classes.
The general goal of SSL can be stated as finding by minimizing the following meta objective:
(1) |
where represents the supervised loss term that applies only to the labeled data, and is the regularizer with as a trade-off parameter. The purpose of regularizer is to make the learning decision to satisfy the underlying assumptions of SSL, including the smoothness, cluster, and manifold assumptions [5].
For SSL methods based on cluster assumption (e.g., low density separation [6]), their regularizers are concerned with unlabeled data only. As such, they are more amenable to be used in UDA problems, since the domain shift is not an issue to be taken into account. A prominent example of SSL regularizer is the entropy minimization (EM) [13], whose use in UDA can be instantiated as:
(2) |
where
represents a typical loss function (e.g., cross-entropy loss). Objectives similar to (
2) are widely used in UDA methods, together with other useful ingredients such as adversarial learning of aligned features [29, 47, 15, 38].Different from the above EM like methods, the graph-based SSL methods that are based on local (and global) smoothness rely on the geometry of the data, and thus their regularizers are concerned with both labeled and unlabeled instances. The key of graph-based methods is to build a graph whose nodes are data instances (labeled and unlabeled), and edges represent similarities between instances. Such a graph is represented by the affinity matrix , whose elements are non-negative pairwise similarities between instances and . Here, we choose the LP algorithm [48] as an instance for exploiting the advantages of graph-based SSL methods. Denote as the label matrix with if is labeled as and otherwise. The goal of LP is to find a by minimizing
(3) |
and then the resulting probability matrix is given by , where is a diagonal matrix with its -element equal to the sum of the -th row of . From the above optimization objective, we can easily see that a good affinity matrix is the key success factor of the LP algorithm. So, the straightforward question is that what is the good affinity matrix? As an analysis, we assume the true label of each data instance is , then the regularizer of the objective (3) can be decomposed as:
(4) | |||
Obviously, a good affinity matrix should make its element as large as possible if instances and are in the same class, and at the same time make those as small as possible otherwise. Therefore, it is rather easy to construct such a good affinity matrix in the SSL setting where the all data are drawn from the same underlying distribution. However, in UDA, due to the domain shift between labeled and unlabeled data, those values of elements of the same class pairs between labeled and unlabeled instances would be significantly reduced, which would prevent its use in the UDA problems as illustrated in Figure 1.
In this section, we first analyze conditions of the corresponding affinity matrix for better propagation of true labels to unlabeled instances, which motivate our proposed ALP algorithm. Let be the classification accuracy in by the solution of the LP (Equ. (3)), i.e.,
(5) |
where with the solution of Equ. (3).
Assume the data satisfy the ideal cluster assumption, i.e., for all . Enhancing one zero-valued element between a data instance (labeled or unlabeled) and a labeled instance to a positive number, where , the (5) non-decreases, and increases under the condition when originally .
The proof of Proposition 1 can be found in the appendices.
Based on the above analysis, we propose the algorithm of Label Propagation with Augmented Anchors (ALP), as illustrated in Fig. 2. We detail the ALP method as follows.
We construct the feature set , where . The affinity matrix is constructed with elements:
(6) |
where measures the non-negative similarity between and , and denotes the set of nearest neighbors in . Then, we adopt to make a symmetric non-negative adjacency matrix with zero diagonal.
The closed-form solution of the objective (Equ. (3)) of the LP algorithm is given by [48] as
(7) |
where ,
is an identity matrix and
.Suggested by the Remark 1, we generate virtual instances via a weighted combination of unlabeled target instances, using weights computed by the entropy of their propagated soft cluster assignments, considering that instances of low entropy are more confident in terms of their predicted labels. In particular, we first obtain the pseudo labels of unlabeled target instances by solving Equ. (7), and then we assign the weight to each unlabeled instance by
(8) |
where is the entropy function and . We have since . The virtual instances can then be calculated as:
(9) |
where is the indicator function. The virtual instances generated by Equ. (9) are relatively robust to the label noise and their neighbors are probably the instances of the same label due to the underlying cluster assumption.
Then, we iteratively do the steps of (1) augmenting the feature set and label matrix with the generated virtual instances and (2) generating virtual instances by the LP algorithm based on updated feature set and label matrix . The updating strategies of feature set and label matrix are as follows:
(10) |
The iterative steps empirically converge in less than iterations, as illustrated in Sec. 5.1. The implementation of our ALP is summarized in Algorithm 1 (line 2 10).
Input:
Labeled data:
Unlabeled data:
Model parameters:
Procedure:
Although our proposed ALP can largely alleviate the performance degradation of applying LP to UDA tasks via the introduction of virtual instances, learning domain-invariant features across labeled source data and unlabeled target data is fundamentally important, especially when the domain shift is unexpectedly large. To illustrate the advantage of our proposed ALP on generating high-quality pseudo labels of unlabeled data, and to justify the efficacy of the alternating steps of pseudo labeling via SSL methods and domain-invariant feature learning, we empower state-of-the-art UDA methods [44, 22] by replacing their pseudo label generators with our ALP, and keep other settings unchanged. Empirical results in Sec. 5.2 testify the efficacy of our ALP.
Computation of our proposed algorithm is dominated by constructing the affinity matrix (6) via -nearest neighbor graph and solving the closed-form solution (7). Brute-force implementations of them are computationally expensive for datasets with large numbers of instances. Fortunately, the complexity of full affinity matrix construction of the -nearest neighbor graph can be largely improved via NN-Descent [10], giving rise to an almost linear empirical complexity of . Given that the matrix is positive-definite, the label predictions (7) can be achieved by solving the following linear system with the conjugate gradient (CG) [16, 53]:
(11) |
which is known to be faster than the closed-form solution (7). Empirical results in the appendices show that such accelerating strategies significantly reduce the time consumption and hardly suffer performance penalties.
Office-31 [34] is a standard UDA dataset including three diverse domains: Amazon (A) from Amazon website, Webcam (W) by web camera, and DSLR (D) by digital SLR camera. There are images of categories shared across three domains. ImageCLEF-DA [1] is a balanced dataset containing three domains: Caltech-256 (C
), ImageNet ILSVRC 2012 (
I), and Pascal VOC 2012 (P). There are categories and images in each domain. VisDA-2017 [33] is a dataset with large domain shift from the synthetic data (Syn.) to real images (Real). There are about K images across 12 categories.We implement our A
LP based on PyTorch. We adopt the ResNet
[14] pre-trained on the ImageNet dataset [8] excluding the last fully connected (FC) layer as the feature extractor . In the alternating training step, we fine-tune the feature extractor and train a classifierof one FC layer from scratch. We update all parameters by stochastic gradient descent with momentum of 0.9, and the learning rate of the classifier
is times that of the feature extractor . We employ the annealing strategy of learning rate [12] following , where is the process of training iterations linearly changing from to , and . Following [22], we set for datasets of Office-31 [34] and ImageCLEF-DA [1], while for VisDA-2017 dataset,. We adopt the cosine similarity, i.e.,
, to construct the affinity matrix (6) and compare it with other two alternatives in Sec. 5.1. We empirically set as and for the VisDA-2017 dataset and datasets of Office-31 and ImageCLEF-DA, respectively. We use all labeled source data and all unlabeled target data in the training process, following the standard protocols for UDA [12, 27]. For each adaptation task, we report the average classification accuracy and the standard error on three random experiments.
In this section, we conduct ablative experiments on the C P task of the ImageCLEF-DA dataset to analyze the influences of graph structures to results of ALP. To study the impact of similarity measurements, we construct the affinity matrix with other two alternative similarity measurements, namely the Euclidean distance-based similarity introduced in [49] and the scalar product-based similarity adopted in [20]. We also set to different values to investigate its influence. Results are illustrated in Figure 4. We empirically observe that results with cosine similarity are consistently better than those with the other two alternatives. We attribute the advantages of cosine similarity to the adopted FC layer-based classifier, where the cosine similarities between features and weights of the classifier dominate category predictions. The results of ALP with affinity matrix constructed by the cosine similarity are stable under a wide range of (i.e., ). Results with the full affinity matrix (i.e., ) are generally lower than that with the nearest neighbor graph. We empirically set in the experiments for the Office-31 and ImageCLEF-DA datasets, and set for the VisDA-2017 dataset, where the number of instances is considerably large.
In this section, we observe the behaviors of the ALP on UDA and SSL tasks. The goal of the experiment is to observe the results with augmented virtual instances in LP. For the labeled data, we randomly sample instances per class in the synthetic image set of the VisDA-2017 dataset. For the SSL task, we randomly sample another instances per class in the synthetic image set of the VisDA-2017 dataset as the unlabeled data, whereas instances are sampled randomly in each class of the real image set to construct unlabeled data in the UDA task. We denote the constructed UDA task as VisDA-2017-Small for ease of use. The mean prediction accuracy of all unlabeled instances is reported. To give insights of the different results, we illustrate the percent of connection weight (PoW) of the same class pairs between labeled and unlabeled data in the constructed k-nearest neighbor graph using ground-truth labels of unlabeled data, which is calculated as: , where is the similarities sum of connections of the same class pairs between labeled and unlabeled data in the affinity matrix A and .
The results are illustrated in Fig. 4. In the UDA task, the initial PoW (i.e., N=1) is too low to enable the labels of labeled data propagate to all the unlabeled target data. As the ALP proceeds, the labeled data are augmented with virtual instances with true labels, whose neighbors involve unlabeled instances sharing the same label. Thus the PoW increases, leading to more accurate predictions of unlabeled data. In the SSL task, cluster centers of labeled and unlabeled data are positioned to be close and (statistically) in relatively dense population areas, since all data follow the same distribution. Instances close to cluster centers, including the nearest neighbors of virtual instances, are expected to be classified correctly by the LP algorithm, leading to unchanged results as the ALP proceeds. These observations corroborate the Proposition 1 and verify the efficacy of our proposed virtual instances generation strategy for UDA.
We investigate the influence of the noise level of label predictions on the results of ALP. As illustrated in Table 1, the ALP is robust to the label noise. Specifically, as the noise level increases, results of A2LP degrade and are worse than that of the vanilla LP when the noise level is larger than 60%.
Noise level (%) | 0 | 10 | 30 | 50 | 60 | 70 | 80 | 100 | Vanilla LP |
---|---|---|---|---|---|---|---|---|---|
Acc. (%) of ALP | 92.8 | 92.8 | 92.3 | 91.8 | 91.0 | 90.7 | 90.3 | 90.0 | 91.2 |
We propose a degenerated variant of ALP by representing the entire labeled source data with several representative surrogate instances in the ALP process, which can largely alleviate the computation cost of the LP algorithm. More specifically, we replace the features of source data with source category centers with category labels (only the Line 2 of the Algorithm 1 is updated accordingly). As illustrated in Table 3, the result of ALP variant is slightly lower than that of the ALP on the VisDA-2017-Small task. Note that we only adopt the ALP variant in tasks involving the entire VisDA-2017 dataset unless otherwise specified.
Methods | ALP | ALP variant |
---|---|---|
Acc. (%) | 79.3 | 77.9 |
Methods | AW | WA |
---|---|---|
ALP | 87.7 | 75.9 |
ALP (=1, ) | 87.4 | 75.4 |
We investigate the effects of entropy-based instance weights in reliable virtual instances generation (9) of ALP in this section. As illustrated in Table 3, ALP improves over ALP (=1, ), where all unlabeled instances are weighted equally, supporting that instances of low entropy are more confident in terms of their predicted labels.
Methods | A W | D W | W D | A D | D A | W A | Avg. |
---|---|---|---|---|---|---|---|
Source Only | 68.40.2 | 96.70.1 | 99.30.1 | 68.90.2 | 62.50.3 | 60.70.3 | 76.1 |
DAN [27] | 80.50.4 | 97.10.2 | 99.60.1 | 78.60.2 | 63.60.3 | 62.80.2 | 80.4 |
DANN [12] | 82.00.4 | 96.90.2 | 99.10.1 | 79.70.4 | 68.20.4 | 67.40.5 | 82.2 |
CDAN+E [28] | 94.10.1 | 98.60.1 | 100.0.0 | 92.90.2 | 71.00.3 | 69.30.3 | 87.7 |
SymNets [47] | 90.80.1 | 98.80.3 | 100.0.0 | 93.90.5 | 74.60.6 | 72.50.5 | 88.4 |
DADA [40] | 92.30.1 | 99.20.1 | 100.00.0 | 93.9 0.2 | 74.40.1 | 74.20.1 | 89.0 |
CAN [22] | 94.50.3 | 99.10.2 | 99.80.2 | 95.00.3 | 78.00.3 | 77.00.3 | 90.6 |
LP | 81.1 | 96.8 | 99.0 | 82.3 | 71.6 | 73.1 | 84.0 |
ALP (ours) | 87.7 | 98.1 | 99.0 | 87.8 | 75.8 | 75.9 | 87.4 |
MSTN (reproduced) | 92.70.5 | 98.50.2 | 99.80.2 | 89.90.3 | 74.60.3 | 75.20.5 | 88.5 |
empowered by ALP | 93.10.2 | 98.50.1 | 99.80.2 | 94.00.2 | 76.50.3 | 76.70.3 | 89.8 |
CAN (reproduced) | 94.00.5 | 98.50.1 | 99.70.1 | 94.80.4 | 78.10.2 | 76.70.3 | 90.3 |
empowered by ALP | 93.40.3 | 98.80.1 | 100.0.0 | 96.10.1 | 78.10.1 | 77.60.1 | 90.7 |
Methods | I P | P I | I C | C I | C P | P C | Avg. |
---|---|---|---|---|---|---|---|
Source Only | 74.80.3 | 83.90.1 | 91.50.3 | 78.00.2 | 65.50.3 | 91.20.3 | 80.7 |
DAN [27] | 74.50.4 | 82.20.2 | 92.80.2 | 86.30.4 | 69.20.4 | 89.80.4 | 82.5 |
DANN [12] | 75.00.6 | 86.00.3 | 96.20.4 | 87.00.5 | 74.30.5 | 91.50.6 | 85.0 |
CDAN+E [28] | 77.70.3 | 90.70.2 | 97.70.3 | 91.30.3 | 74.20.2 | 94.30.3 | 87.7 |
SymNets [47] | 80.20.3 | 93.60.2 | 97.00.3 | 93.40.3 | 78.70.3 | 96.40.1 | 89.9 |
LP | 77.1 | 89.2 | 93.0 | 87.5 | 69.8 | 91.2 | 84.6 |
ALP (ours) | 79.3 | 91.8 | 96.3 | 91.7 | 78.1 | 96.0 | 88.9 |
MSTN (reproduced) | 78.30.2 | 92.50.3 | 96.50.2 | 91.10.1 | 76.30.3 | 94.60.4 | 88.2 |
empowered by ALP | 79.60.3 | 92.70.3 | 96.70.1 | 92.50.2 | 78.90.2 | 96.00.1 | 89.4 |
CAN (reproduced) | 78.50.3 | 93.00.3 | 97.30.2 | 91.00.3 | 77.20.2 | 97.00.2 | 89.0 |
empowered by ALP | 79.80.2 | 94.30.3 | 97.70.2 | 93.00.3 | 79.90.1 | 96.90.2 | 90.3 |
We report the classification results on the Office-31 [34], ImageCLEF-DA [1], and VisDA-2017 [33] datasets in Table 4, Table 5, and Table 6, respectively. Results of other methods are either directly reported from their original papers if available or quoted from [28, 24]. Compared to classical methods [12, 27] aiming at domain-invariant feature learning, the vanilla LP generally achieves better results via the graph-based SSL principle, certifying the efficacy of the SSL principles in UDA tasks. Our ALP improves over the LP significantly on all three UDA benchmarks, justifying the efficacy of the introduction of virtual instances for UDA. Additionally, we reproduce the state-of-the-art UDA methods of Moving Semantic Transfer Network (MSTN) [44] and Contrastive Adaptation Network (CAN) [22] with the released codes 222https://github.com/Mid-Push/Moving-Semantic-Transfer-Network https://github.com/kgl-prml/Contrastive-Adaptation-Network-for-Unsupervised-Domain-Adaptation; by replacing the pseudo label generators of MSTN and CAN with our ALP, we improve their results noticeably and achieve the new state of the art, testifying the effectiveness of the combination of ALP and domain-invariant feature learning.
Methods | Acc. Based on a ResNet50 | Acc. Based on a ResNet101 |
---|---|---|
Source Only | 45.6 | 50.8 |
DAN [27] | 53.0 | 61.1 |
DANN [12] | 55.0 | 57.4 |
MCD [36] | – | 71.9 |
CDAN+E [28] | 70.0 | – |
LPJT [25] | – | 74.0 |
DADA [40] | – | 79.8 |
Lee et al. [24] | 76.2 | 81.5 |
CAN [22] | – | 87.2 |
LP | 69.8 | 73.9 |
ALP (ours) | 78.7 | 82.7 |
MSTN (reproduced) | 71.9 | 75.2 |
empowered by ALP | 81.5 | 83.7 |
CAN (reproduced) | 85.6 | 87.2 |
empowered by ALP | 86.5 | 87.6 |
Motivated by the relatedness of problem definitions between UDA and SSL, we study the use of SSL principles in UDA, especially the graph-based LP algorithm. We analyze the conditions of affinity graph/matrix to achieve better propagation of true labels to unlabeled instances, and accordingly propose a new algorithm of ALP, which potentially improves LP via generation of unlabeled virtual instances. An empirical scheme of virtual instance generation is particularly proposed for UDA via a weighted combination of unlabeled target instances. By iteratively using ALP to get high-quality pseudo labels of target instances and learning domain-invariant features involving the obtained pseudo-labeled target instances, new state of the art is achieved on three datasets, confirming the value of further investigating SSL techniques for UDA problems.
Acknowledgment. This work is supported in part by the Guangdong R&D key project of China (Grant No.: 2019B010155001), the National Natural Science Foundation of China (Grant No.: 61771201), and the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No.: 2017ZT07X183). Correspondence to Kui Jia (emai: kuijia@scut.edu.cn)
Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning on large graphs. In: International Conference on Computational Learning Theory. pp. 624–638. Springer (2004)
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Machine learning
79(1-2), 151–175 (2010)Delalleau, O., Bengio, Y., Roux, N.L.: Efficient non-parametric function induction in semi-supervised learning. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (2005),
http://www.gatsby.ucl.ac.uk/aistats/fullpapers/204.pdfDeng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The Journal of Machine Learning Research
17(1), 2096–2030 (2016)He, R., Lee, W.S., Ng, H.T., Dahlmeier, D.: Adaptive semi-supervised learning for cross-domain sentiment classification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 3467–3476 (2018)
Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems. pp. 1195–1204 (2017)
For any , we define the boolean value iff there exists a sequence of instances such that the product of their pair-wise similarities: . Then under the assumption made in Proposition 1, minimizing Equation (7) results in for those with , and for all , therefore we have
(A.1) |
Obviously, enhancing the zero-valued similarity between a data instance (labeled or unlabeled) and a labeled instance , where , to a positive number leads to non-decreasing value of , and therefore non-decreasing value of . In particular, if and originally , the prediction of changes from original to and thus the value of increases. ∎
We investigate different values of (of Equ. (7)) in ALP. As illustrated in Table 7, the results are stable under a wide range of (i.e., 0.10.75).
Values of | 0.1 | 0.25 | 0.4 | 0.5 | 0.6 | 0.75 | 0.9 | 2.0 |
---|---|---|---|---|---|---|---|---|
Acc. (%) of ALP | 95.7 | 96.2 | 96.2 | 96.0 | 96.0 | 95.8 | 94.3 | 16.8 |
To make the proposed methods applicable to datasets with large numbers of instances, we improve the dominating computations of our methods by adopting the NN-Descent [10] to construct the -nearest neighbor graph (6) and the conjugate gradient [16, 53] to acquire the label predictions (11). As illustrated in Table 8, the NN-Descent [10] accelerates the brute-force implementation of affinity matrix at a factor of , and the conjugate gradient-based solution (11) accelerates the closed-form solution (7) at a factor of on the VisDA-2017-Small task of instances, while the classification results drop negligibly (in fact no drop at the precision level of ).
The full classification results on the VisDA-2017 dataset are illustrated in Table 9.
Methods |
aero. |
bike |
bus |
car |
horse |
knife |
moto. |
person |
plant |
sktb. |
train |
truck |
Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Results based on a 50-layer ResNet | |||||||||||||
LP | 91.4 | 81.4 | 73.3 | 71.8 | 94.7 | 60.8 | 87.4 | 62.2 | 87.8 | 19.1 | 86.2 | 20.9 | 69.8 |
ALP | 95.5 | 82.8 | 77.9 | 70.0 | 95.2 | 95.9 | 86.6 | 65.3 | 87.4 | 42.8 | 86.4 | 53.1 | 78.7 |
MSTN (reproduced) | 86.9 | 73.2 | 76.8 | 67.2 | 80.7 | 78.8 | 71.9 | 65.1 | 74.8 | 76.2 | 85.6 | 25.6 | 71.9 |
empowered by ALP | 96.1 | 83.5 | 78.3 | 70.8 | 95.7 | 96.3 | 87.1 | 66.4 | 87.4 | 76.4 | 86.7 | 53.8 | 81.5 |
CAN (reproduced) | 94.5 | 85.4 | 81.9 | 72.3 | 96.7 | 94.9 | 88.3 | 78.4 | 96.3 | 94.7 | 86.2 | 57.3 | 85.6 |
empowered by ALP | 96.3 | 86.2 | 81.4 | 71.7 | 97.1 | 96.8 | 89.7 | 79.1 | 96.1 | 95.4 | 88.6 | 59.1 | 86.5 |
Results based on a 101-layer ResNet | |||||||||||||
LP | 89.6 | 80.6 | 65.4 | 72.9 | 92.7 | 74.0 | 84.2 | 72.8 | 87.9 | 48.4 | 84.6 | 33.0 | 73.9 |
ALP | 96.0 | 82.9 | 82.2 | 68.9 | 95.8 | 96.0 | 87.8 | 66.5 | 89.6 | 85.2 | 88.4 | 53.2 | 82.7 |
MSTN (reproduced) | 90.5 | 73.0 | 70.2 | 58.9 | 84.9 | 77.0 | 84.5 | 79.3 | 89.6 | 69.6 | 89.4 | 36.0 | 75.2 |
empowered by ALP | 96.4 | 84.1 | 82.4 | 70.1 | 96.1 | 96.6 | 88.2 | 67.7 | 91.5 | 87.5 | 89.9 | 54.0 | 83.7 |
CAN (reproduced) | 97.0 | 87.2 | 82.5 | 74.3 | 97.8 | 96.2 | 90.8 | 80.7 | 96.6 | 96.3 | 87.5 | 59.9 | 87.2 |
empowered by ALP | 97.5 | 86.9 | 83.1 | 74.2 | 98.0 | 97.4 | 90.5 | 80.9 | 96.9 | 96.5 | 89.0 | 60.1 | 87.6 |