1 Introduction
The recent success of supervised deep learning is built upon two crucial cornerstones: That the training and test data are drawn from an
identical distribution, and that representative labeled data are available for training. However, in realworld applications, labeled data drawn from the same distribution as test data are usually unavailable. Domain adaptation (QuioneroCandela et al., 2009; Saenko et al., 2010) suggests a way to overcome this challenge by transferring the knowledge of labeled data from a source domain to the target domain.Without further assumptions, the transferability of information is not possible. Existing theoretical works have investigated suitable assumptions that can provide learning guarantees. Many of the works are based on the covariate shift assumption (Heckman, 1979; Shimodaira, 2000), which states that the conditional distribution of the labels (given the input ) is invariant across domains, i.e., . Traditional approaches usually utilize this assumption by further assuming that the source domain covers the support of the target domain. In this setting, importance weighting (Shimodaira, 2000; Cortes et al., 2010, 2015; Zadrozny, 2004) can be used to transfer information from source to target with theoretical guarantees. However, the assumption of covered support rarely holds in practice.
In the seminal works of BenDavid et al. (2010); Ganin et al. (2016), the authors introduced a theory that enables generalization to outofsupport samples via distribution matching. They showed that the risk on the target domain can be bounded by the sum of two terms a) the risk on the source domain plus a discrepancy between source and target domains, and b) the optimal joint risk that a function in the hypothesis class can achieve. Inspired by this bound, numerous domainadversarial algorithms aimed at matching the distribution of source and target domains in the feature space have been proposed (Ajakan et al., 2014; Long et al., 2015; Ganin et al., 2016). These methods show encouraging empirical performance on transferring information from domains with different styles, e.g., from colorized photos to grayscale photos. However, the theory of distribution matching can be violated since only two terms in the bound are optimized in the algorithms while the other term can be arbitrary large (Zhao et al., 2019a; Wu et al., 2019; Li et al., 2020). In practice, forcing the representation distribution of two domains to match may also fail in some settings. As an example, Li et al. (2020) gives empirical evidence of this failure on datasets with subpopulation shift. Li et al. (2020) describes a classification task between vehicle and person; subpopulation shift happens when the source vehicle class contains car and motorcycle, while the target vehicle class contains car and motorcycle.
In realworld applications, subpopulation shift is pervasive, and often in a finegrained manner. The source domain will inevitably fail to capture the diversity of the target domain, and models will encounter unseen subpopulations in the target domain, e.g., unexpected weather conditions for selfdriving or different diagnostic setups in medical applications (Santurkar et al., 2021). The lack of theoretical understanding of subpopulation shift motivates us to study the following question:
How to provably transfer from source to target domain
under subpopulation shift using unlabeled data?
To address this question, we develop a general framework of domain adaptation where we have a supervision signal on the source domain (through a teacher classifier which has nontrivial performance on the source domain but is allowed to be entirely wrong on the target domain (See Assumption 1(a) and Figure 1)) and unlabeled data on both source and target domains. The key of the analysis is to show that the supervision signal can be propagated to the unlabeled data. To do so, we partition data from both domains into some subpopulations and leverage a simple but realistic expansion assumption (Definition 2.1) proposed in Wei et al. (2021) on the subpopulations. We then prove that by minimizing a consistency regularization term (Miyato et al., 2018; Shu et al., 2018; Xie et al., 2020) on unlabeled data from both domains plus a 01 consistency loss with the supervision signal (i.e., the teacher classifier) on the source domain, the supervision signal from the subpopulations of the source domain can not only be propagated to the subpopulations of target domain but also refine the prediction on the source domain. In Theorem 2.1 and 2.2
, we give bounds on the test performance on the target domain. Using offtheshelf generalization bounds, we also obtain endtoend finitesample guarantees for neural networks in Section
2.3.In Section 3, we extend our theoretical framework to a more general setting with sourcetotarget transfer based on an additional unlabeled dataset. As long as the subpopulation components of the unlabeled dataset satisfy the expansion property and cover both the source and target subpopulation components, then one can provably propagate label information from source to target through the unlabeled data distribution (Theorem 3.1 and 3.2
). As corollaries, we immediately obtain learning guarantees for both semisupervised learning and unsupervised domain adaptation. The results can also be applied to various settings like domain generalization etc., see Figure
2.We implement the popular consistencybased semisupervised learning algorithm FixMatch
(Sohn et al., 2020) on the subpopulation shift task from BREEDS (Santurkar et al., 2021), and compare it with popular distributional matching methods (Ganin et al., 2016; Zhang et al., 2019). Results show that the consistencybased method outperforms distributional matching methods by over , partially verifying our theory on the subpopulation shift problem. We also show that combining distributional matching methods and consistencybased algorithm can improve the performance upon distributional matching methods on classic unsupervised domain adaptation datasets such as Office31 (Saenko et al., 2010) and OfficeHome (Venkateswara et al., 2017).In summary, our contributions are: 1) We introduce a theoretical framework of learning under subpopulation shift through label propagation; 2) We provide accuracy guarantees on the target domain for a consistencybased algorithm using a finegrained analysis under the expansion assumption (Wei et al., 2021); 3) We provide a generalized label propagation framework that easily includes several settings, e.g., semisupervised learning, domain generalization, etc.
1.1 Related work
We review some more literature on domain adaptation, its variants, and consistency regularization, followed by discussions on the distinction of our contributions compared to Wei et al. (2021).
For the less challenging setting of covariate shift where the source domain covers the target domain’s support, prior work regarding importance weighting focuses on estimations of the density ratio
(Lin et al., 2002; Zadrozny, 2004) through kernel mean matching (Huang et al., 2006; Gretton et al., 2007; Zhang et al., 2013; Shimodaira, 2000), and some standard divergence minimization paradigms (Sugiyama et al., 2008, 2012; Uehara et al., 2016; Menon and Ong, 2016; Kanamori et al., 2011). For outofsupport domain adaptation, recent work investigate approaches to match the source and target distribution in representation space (Glorot et al., 2011; Ajakan et al., 2014; Long et al., 2015; Ganin et al., 2016). Practical methods involve designing domain adversarial objectives (Tzeng et al., 2017; Long et al., 2017a; Hong et al., 2018; He and Zhang, 2019; Xie et al., 2019; Zhu et al., 2019) or different types of discrepancy minimization (Long et al., 2015; Lee et al., 2019; Roy et al., 2019; Chen et al., 2020a). Another line of work explore selftraining or gradual domain adaptation (Gopalan et al., 2011; Gong et al., 2012; Glorot et al., 2011; Kumar et al., 2020). For instance, Chen et al. (2020c) demonstrates that selftraining tends to learn robust features in some specific probabilistic setting.Variants of domain adaptation have been extensively studied. For instance, weaklysupervised domain adaptation considers the case where the labels in the source domain can be noisy (Shu et al., 2019; Liu et al., 2019); multisource domain adaptations adapts from multiple source domains (Xu et al., 2018; Zhao et al., 2018); domain generalization also allows access to multiple training environments, but seeks outofdistribution generalization without prior knowledge on the target domain (Ghifary et al., 2015; Li et al., 2018; Arjovsky et al., 2019).
The idea of consistency regularization has been used in many settings. Miyato et al. (2018); Qiao et al. (2018); Xie et al. (2020) enforce consistency with respect to adversarial examples or data augmentations for semisupervised learning. Shu et al. (2019) combines domain adversarial training with consistency regularization for unsupervised domain adaptation. Recent work on selfsupervised learning also leverages the consistency between two aggressive data augmentations to learn meaningful features (Chen et al., 2020b; Grill et al., 2020; Caron et al., 2020).
Most closely related to our work is Wei et al. (2021)
, which introduces a simple but realistic “expansion” assumption to analyze label propagation, which states that a lowprobability subset of the data must expand to a neighborhood with larger probability relative to the subset. Under this assumption, the authors show learning guarantees for unsupervised learning and semisupervised learning.
The focus of Wei et al. (2021) is not on domain adaptation, though the theorems directly apply. This leads to several drawbacks that we now discuss. Notably, in the analysis of Wei et al. (2021) for unsupervised domain adaptation, the population test risk is bounded using the population risk of a pseudolabeler on the target domain. The pseudolabeler is obtained via training with labeled data on the source domain. For domain adaptation, we do not expect such a pseudolabeler to be directly informative when applied to the target domain, especially when the distribution shift is severe. In contrast, our theorem does not rely on a good pseudolabeler on the target domain. Instead, we prove that with only supervision on the source domain, the population risk on the target domain can converge to zero as the value of the consistency regularizer of the ground truth classifier decreases (Theorem 2.1, 2.2). In addition, Wei et al. (2021) assumes that the probability mass of each class
together satisfies the expansion assumption. However, each class may consist of several disjoint subpopulations. For instance, the dog class may have different breeds as its subpopulations. This setting differs from the concrete example of the Gaussian mixture model shown in
Wei et al. (2021)where the data of each class concentrate following a Gaussian distribution. In this paper, we instead take a more realistic usage of the expansion assumption by assuming expansion property on the
subpopulations of each class (Assumption 1). Behind this relaxation is a finegrained analysis of the probability mass’s expansion property, which may be of independent interest.2 Label Propogation in Domain Adaptation
In this section, we consider label propagation for unsupervised domain adaptation. We assume the distributions’ structure can be characterized by a specific subpopulation shift with the expansion property. In Section 2.1, we introduce the setting, including the algorithm and assumptions. In Section 2.2, we present the main theorem on bounding the target error. In Section 2.3, we provide an endtoend guarantee on the generalization error of adapting a deep neural network to the target distribution with finite data. In Section 2.4 we provide a proof sketch for the theorems.
2.1 Setting
We consider a multiclass classification problem . Let and be the source and target distribution on respectively, and we wish to find a classifier that performs well on . Suppose we have a teacher classifier on . The teacher can be obtained by training on the labeled data on (standard unsupervised domain adaptation), or by training on a small subset of labeled data on , or by direct transferring from some other trained classifier, etc. In all, the teacher classifier represents all label information we know (and is allowed to have errors). Our goal is to transfer the information in onto using only unlabeled data.
Our setting for subpopulation shift is formulated in the following assumption.
Assumption 1.
Assume the source and target distributions have the following structure: , , where for . We assume the ground truth class for is consistent (constant), which is denoted as . We abuse the notation to let , also denote the conditional distribution (probability measure) of on the set respectively. In addition, we make the following canonical assumptions:

The teacher classifier on is informative of the ground truth class by a margin , that is,

On each component, the ratio of the population under domain shift is upperbounded by a constant , i.e.
Following Wei et al. (2021), we make use of a consistency regularization method, i.e. we expect the predictions to be stable under a suitable set of input transformations . The regularizer of on the mixed probability measure is defined as
and a low regularizer value implies the labels are with high probability constant within . Prior work on using consistency regularization for unlabeled selftraining includes Miyato et al. (2018) where can be understood as a distancebased neighborhood set and Adel et al. (2017); Xie et al. (2020) where can be understood as the set of data augmentations. In general, takes the form ^{1}^{1}1In this paper, consistency regularization, expansion property, and label propagation can also be understood as happening in a representation space, as long as for some feature map . for a small number , some distance function , and a class of data augmentation functions .
The set is used in the following expansion property. First we define the neighborhood function as
and the neighborhood of a set as
The expansion property on the mixed distribution is defined as follows:
Definition 2.1 (Expansion (Wei et al., 2021)).

(Multiplicative Expansion) We say satisfies multiplicative expansion for some constant , , if for any and any subset with , we have .

(Constant Expansion) We say satisfies constant expansion for some constant , if for any set with and , we have .
The expansion property implicitly states that and are close to each other and regularly shaped. Through the regularizer the label can “propagate” from to ^{2}^{2}2Note that our model for subpopulation shift allows any finegrained form (), which makes the expansion property more realistic. In image classification, one can take for example as “Poodles eating dog food” v.s. as “Labradors eating meat” (they’re all under the dog class), which is a rather typical form of shift in a real dataset. The representations of such subpopulations can turn out quite close after certain data augmentation and perturbations as in .. One can keep in mind the specific example of Figure 1 where and forms a single connected component.
Finally, let be a function class of the learning model. We consider the realizable case when the ground truth function . We assume that the consistency error of the ground truth function is small, and use a constant to represent an upper bound: . We find the classifier with the following algorithm:
(1) 
where is the 01 loss on the source domain which encourages to be aligned with on the source domain.
Our main theorem will be formulated using multiplicative expansion or constant expansion^{3}^{3}3Wei et al. (2021) contains several examples and illustrations of the expansion property, e.g., the Gaussian mixture example satisfies multiplicative expansion. The radius in is much smaller than the norm of a typical example, so our model, which requires a separation of between components to make small, is much weaker than a typical notion of “clustering”..
2.2 Main Theorem
With the above preparations, we are ready to establish bounds on the target error .
Theorem 2.1 (Bound on Target Error with Multiplicative Expansion).
Suppose Assumption 1 holds and satisfies multiplicative expansion. Then the classifier obtained by (1) satisfies
Theorem 2.2 (Bound on Target Error with Constant Expansion).
Suppose Assumption 1 holds and satisfies constant expansion. Then the classifier obtained by (1) satisfies
We make the following remarks on the main results, and also highlight the differences from directly applying Wei et al. (2021) to domain adaptation.
Remark 1.
The theorems state that as long as the ground truth consistency error (equivalently, ) is small enough, the classifier can achieve nearzero error. This result does not rely on the teacher being close to zero error, as long as the teacher has a positive margin . As a result, the classifier can improve upon (including on , as the proof of the theorems can show), in a way that the error of converge to zero as , regardless of the error of . This improvement is due the algorithmic change in Equation (1) which strongly enforces label propagation. Under multiplicative expansion, Wei et al. (2021) attain a bound of the form , which explicitly depends on the accuracy of the teacher on the target domain. The improvement is due to that we strongly enforce consistency rather than balancing consistency with teacher classifier fit.
Remark 2.
We do not impose any lower bound on the measure of the components , which is much more general and realistic. From the proofs, one may see that we allow some components to be entirely mislabeled, but in the end, the total measure of such components will be bounded. Directly applying Wei et al. (2021) would require a stringent lower bound on the measure of each .
Remark 3.
We only require expansion with respect to the individual components , instead of the entire class (Wei et al., 2021), which is a weaker requirement.
2.3 Finite Sample Guarantee for Deep Neural Networks
In this section, we leverage existing generalization bounds to prove an endtoend guarantee on training a deep neural network with finite samples on and . The results indicate that if the groundtruth class is realizable by a neural network by a large robust margin, then the total error can be small.
For simplicity let there be i.i.d. data each from and (a total of data), and the empirical distribution is denoted and . In order to upperbound the loss and , we apply a notion of alllayer margin (Wei and Ma, 2019) ^{4}^{4}4Though other notions of margin can also work, this helps us to leverage the results from Wei et al. (2021)., which measures the stability of the neural net to simultaneous perturbations to each hidden layer. We first cite the useful results from Wei et al. (2021). Suppose where is the neural network with weight matrices , ^{5}^{5}5Similarly, and is the ground truth network and its induced classifier. and is the maximum width of any layer. Let denote the alllayer margin at input for label . ^{6}^{6}6For now, we only use if , so that we can upper bound with for any . One can refer the datailed definition to Appendix B or in Wei and Ma (2019). We also define the robust margin . We state the following results.
Proposition 2.1 (Theorem C.3 from Wei et al. (2021)).
For any , with probability ,
where hides polylogarithmic factors in and .
Proposition 2.2 (Theorem 3.7 from Wei et al. (2021)).
For any , With probability ,
To ensure generalization we replace the loss functions with the margin loss in the algorithm and solve
(2) 
where . Based on these preparations, we are ready to state the final bound.
Theorem 2.3.
(a) Under multiplicative expansion on we have
(b) Under constant expansion on we have
where
Remark 4.
Note that the first term in is small if is small, and as , the bounds can be close to and can be close to , which gives us the bounds in Section 2.2.
Similar to the argument in Wei et al. (2021)
, it is worth noting that our required sample complexity does not depend exponentially on the dimension. This is in stark contrast to classic nonparametric methods for unknown “clusters” of samples, where the sample complexity suffers the curse of dimensionality of the input space.
2.4 Proof Sketch for Theorem 2.1 and 2.2
To prove the theorems, we first introduce some concepts and notations.
A point is called robust w.r.t. and if for any in , . Denote
which is called the robust set of . Let
for , , and they form a partition of the set . Denote
which is the majority class label of in the robust set on . We also call
and the minority robust set of . In addition, let
and be the minority set of , which is superset to the minority robust set.
The expansion property can be used to control the total population of the minority set.
Lemma 2.1 (Upper Bound of Minority Set).
For the classifier obtained by (1), can be bounded as follows:
(a) Under multiplicative expansion, we have .
(b) Under constant expansion, we have .
Based on the bound on the minority set, our next lemma says that on most subpopulation components, the inconsistency between and is no greater than the error of plus a margin . Specifically, define
and we have the following result
Lemma 2.2 (Upper Bound on the Inconsistent Components ).
Suppose , then
Based on the above results, we are ready to bound the target error .
Lemma 2.3 (Bounding the Target Error).
Suppose . Let
for in , so that . Then we can separately bound
(a)
(b)
so that the combination gives
Specically, Lemma 2.3(a) is obtained by directly using Lemma 2.2, and Lemma 2.3(b) is proved by a finegrained analysis on the minority set.
Finally, we can plug in from Lemma 2.1 and the desired main results are obtained.
3 Label Propogation in Generalized Subpopulation Shift
In this section, we show that the previous label propagation algorithm can be applied to a much more general setting than standard unsupervised domain adaptation. In a word, as long as we perform consistency regularization on an unlabeled dataset that covers both the teacher classifier’s domain and the target domain, we can perform label propagation through the subpopulation of the unlabeled data.
Specifically, we still let be the source distribution where we have a teacher on, and is the target distribution. The difference is that we have a “covering” distribution (Assumption 2(c)) where we only make use of unlabeled data, and the expansion property is assumed to hold on .
Assumption 2.
Assume the distributions are of the following structure: , , , where for , and . Again, assume the ground truth class for is consistent (constant), denoted . We abuse the notation to let , , also denote the conditional distribution of on the set respectively. We also make the following assumptions, with an additional (c) that says “covers” .
(a)(b): Same as Assumption 1(a)(b).
(c) There exists a constant such that the measure , are bounded by . That is, for any ,
The regularizer now becomes
On can see that the main difference is that we replaced from the previous domain adaptation with a general distribution . Indeed, we assume expansion on and can establish bounds on .
Definition 3.1 (Expansion on ).
(1) We say satisfies multiplicative expansion for some constant , , if for any and any subset with , we have .
(2) We say satisfies constant expansion for some constant , if for any set with and , we have .
Theorem 3.1 (Bound on Target Error with Multiplicative Expansion, Generalized).
Theorem 3.2 (Bound on Target Error with Constant Expansion, Generalized).
Choosing special cases of the structure , we can naturally obtain the following special cases that correspond to the models shown in Figure 2.

Semisupervised learning or selfsupervised denoising (Figure 2(b)). When , the framework becomes the degenerate version of learning a from a in a single domain. can be a pseudolabeler in the semisupervised learning or some other pretrained classifier selfsupervised denoising. Our results improve upon Wei et al. (2021) under this case as discussed in Remark 1, 2.

Domain expansion (Figure 2(c)). When , this becomes a problem between semisupervised learning and domain adaptation, and we call it domain expansion. That is, the source is a subdistribution of where we need to perform well. Frequently, we have a big unlabeled dataset and the labeled data is only a specifc part.

Domain extrapolation (Figure 2(d)). When does not satisfy expansion by itself, e.g. they are not connected by , but they are connected through , we can still obtain small error on . We term this kind of task domain extrapolation, where we have a small source and small target distribution that is not easy to directly correlate, but is possible through a third and bigger unlabeled dataset where label information can propagate.

MultiSource domain adaptation or domain generalization (Figure 2(e)). We have multiple source domains and take as the union (average measure) of all source domains. Learning is guaranteed if in the input space or some representation space, can successfully “cover” , the target distribution in multisource domain adaptation or the test distribution in domain generalization. Also, as the framework suggests, we do not require all the source domains to be labeled, depending on the specific structure.
The general label propogation framework proposed in this section is widely applicable in many practical scenarios, and would also be an interesting future work. The full proof of the theorems in this section is in Appendix A.
4 Experiments
In this section, we first conduct experiments on a dataset that is constructed to simulate natural subpopulation shift. Then we generalize the aspects of subpopulation shift to classic unsupervised domain adaptation datasets by combining distributional matching methods and consistencybased label propagation method.
4.1 Subpopulation Shift Dataset
We empirically verify that label propagation via consistency regularization works well for subpopulation shift tasks. Towards this goal, we constructed an Unsupervised Domain Adaptation (UDA) task using the challenging ENTITY30 task from BREEDS tasks (Santurkar et al., 2021), and directly adapt FixMatch (Sohn et al., 2020), an existing consistency regularization method for semisupervised learning to the subpopulation shift tasks. The main idea of FixMatch is to optimize the supervised loss on weak augmentations of source samples, plus consistency regularization, which encourages the prediction of the classifier on strong augmentations of a sample to be the same to the prediction on weak augmentations of the sample^{7}^{7}7Empirically, FixMatch also combines selftraining techniques that take the hard label of the prediction on weak augmentations. We also use Distribution Alignment (Berthelot et al., 2019) mentioned in Section 2.5 of the FixMatch paper.. In contrast to semisupervised learning where the supports of unlabeled data and labeled data are inherently the same, in subpopulation shift problems, the support sets of different domains are disjoint. To enable label propagation, we need a good feature map to enable label propagation on the feature space. We thus make use of the feature map learned by a selfsupervised learning algorithm SwAV (Caron et al., 2020), which simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations of the same image. This representation has two merits; first, it encourages subpopulations with similar representations to cluster in the feature space. Second, it enforces the augmented samples to be close in the feature space. We expect that subclasses from the same superclass will be assigned to the same cluster and thus enjoy the expansion property to a certain extent in the feature space. We defer the detailed experimental settings to Appendix C and report the results here.
Method  Source Acc  Target Acc 

Train on Source  91.910.23  56.730.32 
DANN (Ganin et al., 2016)  92.810.50  61.034.63 
MDD (Zhang et al., 2019)  92.670.54  63.950.28 
FixMatch (Sohn et al., 2020)  90.870.15  72.600.51 
We compare the performance of the adaptation of FixMatch with popular distributional matching methods, i.e., DANN (Ganin et al., 2016) and MDD (Zhang et al., 2019)^{8}^{8}8We use the implementation from Junguang Jiang (2020), which shows that MDD has the best performance among the evaluated methods.. For a fair comparison, all models are finetuned from SwAV representation. As shown in Table 1, the adaptation with FixMatch obtains significant improvement upon the baseline method that only trains on the source domain by more than points on the target domain. FixMatch also outperforms distributional matching methods by more than . The results suggest that unlike previous distributional matchingbased methods, consistency regularizationbased methods are preferable on domain adaptation tasks when encountering subpopulation shift. This is also aligned with our theoretical findings.
4.2 Classic Unsupervised Domain Adaptation Datasets
In this section we conduct experiments on classic unsupervised domain adaptation datasets, i.e., Office31 (Saenko et al., 2010), OfficeHome (Venkateswara et al., 2017), where source and target domains mainly differ in style, e.g., artistic images to realworld images. Distributional matching methods seek to learn an invariant representation which removes confounding information such as the style. Since the feature distributions of different domains are encouraged to be matched, the supports of different domains in the feature space are overlapped which enables label propagation. In addition, subpopulation shift from source to target domain may remain even if the styles are unified in the feature space. This inspires us to combine distributional matching methods and label propagation.
As a preliminary attempt, we directly combine MDD (Zhang et al., 2019) and FixMatch (Sohn et al., 2020) to see if there is gain upon MDD. Specifically, we first learn models using MDD on two classic unsupervised domain adaptation datasets, Office31 and OfficeHome. Then we finetune the learned model using FixMatch (with Distribution Alignment extension as described in previous subsection). The results in Table 3, 3 confirm that finetuning with FixMatch can improve the performance of MDD models. The detailed experimental settings can be found in Appendix C.
Method 
A W  D W  W D  A D  D A  W A  Average 

MDD  94.970.70  98.780.07  1000  92.770.72  75.641.53  72.820.52  89.16 
MDD+FixMatch  95.470.95  98.320.19  1000  93.710.23  76.641.91  74.931.15  89.84 
Method  Ar Cl  Ar Pr  Ar Rw  Cl Ar  Cl Pr  Cl Rw  Pr Ar  Pr Cl  Pr Rw  Rw Ar  Rw Cl  Rw Pr  Average 

MDD  54.90.7  74.00.3  77.70.3  60.60.4  70.90.7  72.10.6  60.70.8  53.01.0  78.00.2  71.80.4  59.60.4  82.90.3  68.0 
MDD+FixMatch  55.10.9  74.70.8  78.70.5  63.21.3  74.11.8  75.30.1  63.00.6  53.00.6  80.80.4  73.40.1  59.40.7  84.00.5  69.6 
5 Conclusion
In this work, we introduced a new theoretical framework of learning under subpopulation shift through label propagation, providing new insights on solving domain adaptation tasks. We provided accuracy guarantees on the target domain for a consistency regularizationbased algorithm using a finegrained analysis under the expansion assumption. Our generalized label propagation framework in Section 3 subsumes the previous domain adaptation setting and also provides an interesting direction for future work.
References

Adel et al. (2017)
Adel, T., Zhao, H., and Wong, A. (2017).
Unsupervised domain adaptation with a relaxed covariate shift
assumption.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 31.  Ahuja et al. (2020) Ahuja, K., Shanmugam, K., Varshney, K., and Dhurandhar, A. (2020). Invariant risk minimization games. In International Conference on Machine Learning, pages 145–155. PMLR.
 Ajakan et al. (2014) Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., and Marchand, M. (2014). Domainadversarial neural networks. arXiv preprint arXiv:1412.4446.
 Arjovsky et al. (2019) Arjovsky, M., Bottou, L., Gulrajani, I., and LopezPaz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
 Becker et al. (2013) Becker, C. J., Christoudias, C. M., and Fua, P. (2013). Nonlinear domain adaptation with boosting. In Neural Information Processing Systems (NIPS), number CONF.
 BenDavid et al. (2010) BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine learning, 79(12):151–175.
 Berthelot et al. (2019) Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. (2019). Remixmatch: Semisupervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations.
 Caron et al. (2020) Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments.

Chen et al. (2020a)
Chen, C., Fu, Z., Chen, Z., Jin, S., Cheng, Z., Jin, X., and Hua, X.S.
(2020a).
Homm: Higherorder moment matching for unsupervised domain adaptation.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3422–3429.  Chen et al. (2020b) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020b). A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
 Chen et al. (2020c) Chen, Y., Wei, C., Kumar, A., and Ma, T. (2020c). Selftraining avoids using spurious features under domain shift. arXiv preprint arXiv:2006.10032.
 Cortes et al. (2010) Cortes, C., Mansour, Y., and Mohri, M. (2010). Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450.
 Cortes et al. (2015) Cortes, C., Mohri, M., and Muñoz Medina, A. (2015). Adaptation algorithm and theory based on generalized discrepancy. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169–178.
 Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030.

Ghifary et al. (2015)
Ghifary, M., Kleijn, W. B., Zhang, M., and Balduzzi, D. (2015).
Domain generalization for object recognition with multitask autoencoders.
InProceedings of the IEEE international conference on computer vision
, pages 2551–2559.  Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for largescale sentiment classification: A deep learning approach. In ICML.

Gong et al. (2012)
Gong, B., Shi, Y., Sha, F., and Grauman, K. (2012).
Geodesic flow kernel for unsupervised domain adaptation.
In
2012 IEEE Conference on Computer Vision and Pattern Recognition
, pages 2066–2073. IEEE.  Gopalan et al. (2011) Gopalan, R., Li, R., and Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In 2011 international conference on computer vision, pages 999–1006. IEEE.
 Gretton et al. (2007) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. (2007). A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520.
 Grill et al. (2020) Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. (2020). Bootstrap your own latent: A new approach to selfsupervised learning. arXiv preprint arXiv:2006.07733.
 Gulrajani and LopezPaz (2020) Gulrajani, I. and LopezPaz, D. (2020). In search of lost domain generalization. arXiv preprint arXiv:2007.01434.
 He and Zhang (2019) He, Z. and Zhang, L. (2019). Multiadversarial fasterrcnn for unrestricted object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6668–6677.
 Heckman (1979) Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pages 153–161.
 Hoffman et al. (2018) Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. (2018). Cycada: Cycleconsistent adversarial domain adaptation. In International conference on machine learning, pages 1989–1998. PMLR.

Hong et al. (2018)
Hong, W., Wang, Z., Yang, M., and Yuan, J. (2018).
Conditional generative adversarial network for structured domain adaptation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344.  Huang et al. (2006) Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. (2006). Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608.
 Javed et al. (2020) Javed, K., White, M., and Bengio, Y. (2020). Learning causal models online. arXiv preprint arXiv:2006.07461.
 Jhuo et al. (2012) Jhuo, I.H., Liu, D., Lee, D., and Chang, S.F. (2012). Robust visual domain adaptation with lowrank reconstruction. In 2012 IEEE conference on computer vision and pattern recognition, pages 2168–2175. IEEE.
 Junguang Jiang (2020) Junguang Jiang, Bo Fu, M. L. (2020). Transferlearninglibrary. https://github.com/thuml/TransferLearningLibrary.
 Kanamori et al. (2011) Kanamori, T., Suzuki, T., and Sugiyama, M. (2011). divergence estimation and twosample homogeneity test under semiparametric densityratio models. IEEE transactions on information theory, 58(2):708–720.
 Krueger et al. (2020) Krueger, D., Caballero, E., Jacobsen, J.H., Zhang, A., Binas, J., Priol, R. L., and Courville, A. (2020). Outofdistribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688.
 Kumar et al. (2020) Kumar, A., Ma, T., and Liang, P. (2020). Understanding selftraining for gradual domain adaptation. arXiv preprint arXiv:2002.11361.
 Lee et al. (2019) Lee, C.Y., Batra, T., Baig, M. H., and Ulbricht, D. (2019). Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10285–10295.
 Li et al. (2020) Li, B., Wang, Y., Che, T., Zhang, S., Zhao, S., Xu, P., Zhou, W., Bengio, Y., and Keutzer, K. (2020). Rethinking distributional matching based domain adaptation. arXiv preprint arXiv:2006.13352.
 Li et al. (2018) Li, D., Yang, Y., Song, Y.Z., and Hospedales, T. (2018). Learning to generalize: Metalearning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
 Lin et al. (2002) Lin, Y., Lee, Y., and Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine learning, 46(1):191–202.
 Liu et al. (2019) Liu, F., Lu, J., Han, B., Niu, G., Zhang, G., and Sugiyama, M. (2019). Butterfly: A panacea for all difficulties in wildly unsupervised domain adaptation. arXiv preprint arXiv:1905.07720.
 Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR.
 Long et al. (2017a) Long, M., Cao, Z., Wang, J., and Jordan, M. I. (2017a). Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667.
 Long et al. (2017b) Long, M., Zhu, H., Wang, J., and Jordan, M. I. (2017b). Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208–2217. PMLR.
 Menon and Ong (2016) Menon, A. and Ong, C. S. (2016). Linking losses for density ratio and classprobability estimation. In International Conference on Machine Learning, pages 304–313. PMLR.
 Mitrovic et al. (2020) Mitrovic, J., McWilliams, B., Walker, J., Buesing, L., and Blundell, C. (2020). Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922.
 Miyato et al. (2018) Miyato, T., Maeda, S.i., Koyama, M., and Ishii, S. (2018). Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993.
 Parascandolo et al. (2020) Parascandolo, G., Neitz, A., Orvieto, A., Gresele, L., and Schölkopf, B. (2020). Learning explanations that are hard to vary. arXiv preprint arXiv:2009.00329.
 Pei et al. (2018) Pei, Z., Cao, Z., Long, M., and Wang, J. (2018). Multiadversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
 Qiao et al. (2018) Qiao, S., Shen, W., Zhang, Z., Wang, B., and Yuille, A. (2018). Deep cotraining for semisupervised image recognition. In Proceedings of the european conference on computer vision (eccv), pages 135–152.
 QuioneroCandela et al. (2009) QuioneroCandela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift in machine learning.
 Roy et al. (2019) Roy, S., Siarohin, A., Sangineto, E., Bulo, S. R., Sebe, N., and Ricci, E. (2019). Unsupervised domain adaptation using featurewhitening and consensus loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9471–9480.
 Saenko et al. (2010) Saenko, K., Kulis, B., Fritz, M., and Darrell, T. (2010). Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer.
 Sagawa et al. (2019) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. (2019). Distributionally robust neural networks for group shifts: On the importance of regularization for worstcase generalization. arXiv preprint arXiv:1911.08731.
 Santurkar et al. (2021) Santurkar, S., Tsipras, D., and Madry, A. (2021). {BREEDS}: Benchmarks for subpopulation shift. In International Conference on Learning Representations.
 Shimodaira (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244.
 Shu et al. (2018) Shu, R., Bui, H. H., Narui, H., and Ermon, S. (2018). A dirtt approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735.
 Shu et al. (2019) Shu, Y., Cao, Z., Long, M., and Wang, J. (2019). Transferable curriculum for weaklysupervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4951–4958.
 Sohn et al. (2020) Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., and Li, C.L. (2020). Fixmatch: Simplifying semisupervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33.
 Sugiyama et al. (2012) Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Densityratio matching under the bregman divergence: a unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044.
 Sugiyama et al. (2008) Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746.
 Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176.
 Uehara et al. (2016) Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
 Venkateswara et al. (2017) Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027.
 Wei and Ma (2019) Wei, C. and Ma, T. (2019). Improved sample complexities for deep networks and robust classification via an alllayer margin. arXiv preprint arXiv:1910.04284.
 Wei et al. (2021) Wei, C., Shen, K., Chen, Y., and Ma, T. (2021). Theoretical analysis of selftraining with deep networks on unlabeled data. In International Conference on Learning Representations.
 Wu et al. (2019) Wu, Y., Winston, E., Kaushik, D., and Lipton, Z. (2019). Domain adaptation with asymmetricallyrelaxed distribution alignment. In International Conference on Machine Learning, pages 6872–6881. PMLR.
 Xie et al. (2020) Xie, Q., Dai, Z., Hovy, E., Luong, T., and Le, Q. (2020). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33.
 Xie et al. (2019) Xie, R., Yu, F., Wang, J., Wang, Y., and Zhang, L. (2019). Multilevel domain adaptive learning for crossdomain detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0.
 Xu et al. (2021) Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K.I., and Jegelka, S. (2021). How neural networks extrapolate: From feedforward to graph neural networks. In International Conference on Learning Representations.
 Xu et al. (2018) Xu, R., Chen, Z., Zuo, W., Yan, J., and Lin, L. (2018). Deep cocktail network: Multisource unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3964–3973.
 Zadrozny (2004) Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, page 114.
 Zhang et al. (2013) Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827.
 Zhang (2019) Zhang, L. (2019). Transfer adaptation learning: A decade survey. arXiv preprint arXiv:1903.04687.
 Zhang et al. (2019) Zhang, Y., Liu, T., Long, M., and Jordan, M. (2019). Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404–7413. PMLR.
 Zhao et al. (2019a) Zhao, H., Combes, R. T. d., Zhang, K., and Gordon, G. J. (2019a). On learning invariant representation for domain adaptation. arXiv preprint arXiv:1901.09453.
 Zhao et al. (2020a) Zhao, H., Dan, C., Aragam, B., Jaakkola, T. S., Gordon, G. J., and Ravikumar, P. (2020a). Fundamental limits and tradeoffs in invariant representation learning. arXiv preprint arXiv:2012.10713.
 Zhao et al. (2018) Zhao, H., Zhang, S., Wu, G., Moura, J. M., Costeira, J. P., and Gordon, G. J. (2018). Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31:8559–8570.
 Zhao et al. (2019b) Zhao, S., Li, B., Yue, X., Gu, Y., Xu, P., Hu, R., Chai, H., and Keutzer, K. (2019b). Multisource domain adaptation for semantic segmentation. arXiv preprint arXiv:1910.12181.
 Zhao et al. (2020b) Zhao, S., Yue, X., Zhang, S., Li, B., Zhao, H., Wu, B., Krishna, R., Gonzalez, J. E., SangiovanniVincentelli, A. L., Seshia, S. A., et al. (2020b). A review of singlesource deep unsupervised visual domain adaptation. IEEE Transactions on Neural Networks and Learning Systems.
 Zhu et al. (2019) Zhu, X., Pang, J., Yang, C., Shi, J., and Lin, D. (2019). Adapting object detectors via selective crossdomain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 687–696.
 Zhuang et al. (2020) Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.
Appendix A Proof of Theorem 2.1, 2.2, 3.1, and 3.2
Note that in Section 3, by taking , in Assumption 2(c) we have . By plugging in , Theorem 2.1 and 2.2 immediately becomes the corollary of Theorem 3.1 and 3.2. Therefore, we only provide a full proof for Theorem 3.1 and 3.2 here.
First, similar to Section 2.4, we give a proof sketch for Theorem 3.1 and 3.2, which includes the corresponding definitions and lemmas for this generalized setting.
a.1 Proof Sketch for Theorem 3.1 and 3.2
To prove the theorems, we first introduce some concepts and notations.
A point is called robust w.r.t. and if for any in , . Denote
which is called the robust set of . Let
for , , and they form a partition of the set . Denote
which is the majority class label of in the robust set on . We also call
and the minority robust set of . In addition, let
and be the minority set of , which is superset to the minority robust set.
The expansion property can be used to control the total population of the minority set.
Lemma A.1 (Upper Bound of Minority Set).
For the classifier obtained by (1), can be bounded as follows:
(a) Under multiplicative expansion, we have .
(b) Under constant expansion, we have .
Based on the bound on the minority set, our next lemma says that on most subpopulation components, the inconsistency between and is no greater than the error of plus a margin . Specifically, define
and we have the following result
Lemma A.2 (Upper Bound on the Inconsistent Components ).
Suppose , then
Based on the above results, we are ready to bound the target error .
Lemma A.3 (Bounding the Target Error).
Suppose . Let
for in , so that . Then we can separately bound
(a)
(b)
so that the combination gives
a.2 Proof of Lemma a.1.
Proof.
We first prove the expansion case (a). The probability function are all w.r.t. the distribution in this lemma, so we omit this subscript. We also use for in this lemma.
The robust minority set is . In order to do expansion, we partition into two halves:
Lemma A.4 (Partition of ).
There exists a partition of the set into and such that the corresponding partition (, ) satisfies and .
Proof.
Starting from , and each time we add an element into or while keeping the properties and hold. We prove that for any , either or holds, so we can repeat the process until and is an partition of . In fact, since
we know that either or is no more than , and Lemma A.4 is proved.
∎
Let , . Based on Lemma A.4, we know that either , or satisfies the requirement for constant expansion. Hence,
(3) 
On the other hand, by definition of the robust set, we know that for , and the points in and are all nonrobust. Since the total measure of nonrobust points is by definition, we know that
(4) 
Combining (3) and (4), we know that under , it must hold that , or else (3) and (4) would be a contradiction. In all, this means that in any case.
Similarly, we know also hold. Therefore, .
Since only consists of nonrobust points, we know that
which is the desired result (a).
For the multiplicative expansion case (b), it is easy to verify that
Comments
There are no comments yet.