A Theory of Label Propagation for Subpopulation Shift

02/22/2021 ∙ by Tianle Cai, et al. ∙ 3

One of the central problems in machine learning is domain adaptation. Unlike past theoretical work, we consider a new model for subpopulation shift in the input or representation space. In this work, we propose a provably effective framework for domain adaptation based on label propagation. In our analysis, we use a simple but realistic “expansion” assumption, proposed in <cit.>. Using a teacher classifier trained on the source domain, our algorithm not only propagates to the target domain but also improves upon the teacher. By leveraging existing generalization bounds, we also obtain end-to-end finite-sample guarantees on the entire algorithm. In addition, we extend our theoretical framework to a more general setting of source-to-target transfer based on a third unlabeled dataset, which can be easily applied in various learning scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent success of supervised deep learning is built upon two crucial cornerstones: That the training and test data are drawn from an

identical distribution, and that representative labeled data are available for training. However, in real-world applications, labeled data drawn from the same distribution as test data are usually unavailable. Domain adaptation (Quionero-Candela et al., 2009; Saenko et al., 2010) suggests a way to overcome this challenge by transferring the knowledge of labeled data from a source domain to the target domain.

Without further assumptions, the transferability of information is not possible. Existing theoretical works have investigated suitable assumptions that can provide learning guarantees. Many of the works are based on the covariate shift assumption (Heckman, 1979; Shimodaira, 2000), which states that the conditional distribution of the labels (given the input ) is invariant across domains, i.e., . Traditional approaches usually utilize this assumption by further assuming that the source domain covers the support of the target domain. In this setting, importance weighting (Shimodaira, 2000; Cortes et al., 2010, 2015; Zadrozny, 2004) can be used to transfer information from source to target with theoretical guarantees. However, the assumption of covered support rarely holds in practice.

In the seminal works of Ben-David et al. (2010); Ganin et al. (2016), the authors introduced a theory that enables generalization to out-of-support samples via distribution matching. They showed that the risk on the target domain can be bounded by the sum of two terms a) the risk on the source domain plus a discrepancy between source and target domains, and b) the optimal joint risk that a function in the hypothesis class can achieve. Inspired by this bound, numerous domain-adversarial algorithms aimed at matching the distribution of source and target domains in the feature space have been proposed (Ajakan et al., 2014; Long et al., 2015; Ganin et al., 2016). These methods show encouraging empirical performance on transferring information from domains with different styles, e.g., from colorized photos to gray-scale photos. However, the theory of distribution matching can be violated since only two terms in the bound are optimized in the algorithms while the other term can be arbitrary large (Zhao et al., 2019a; Wu et al., 2019; Li et al., 2020). In practice, forcing the representation distribution of two domains to match may also fail in some settings. As an example, Li et al. (2020) gives empirical evidence of this failure on datasets with subpopulation shift. Li et al. (2020) describes a classification task between vehicle and person; subpopulation shift happens when the source vehicle class contains car and motorcycle, while the target vehicle class contains car and motorcycle.

In real-world applications, subpopulation shift is pervasive, and often in a fine-grained manner. The source domain will inevitably fail to capture the diversity of the target domain, and models will encounter unseen subpopulations in the target domain, e.g., unexpected weather conditions for self-driving or different diagnostic setups in medical applications (Santurkar et al., 2021). The lack of theoretical understanding of subpopulation shift motivates us to study the following question:

How to provably transfer from source to target domain

under subpopulation shift using unlabeled data?

Figure 1: A toy illustration of our framework on label propogation on subpopulations, formalized in Section 2. Although the formal definition (Assumption 1) involves a neighborhood function and possibly a representation space, one can understand it by the above toy model: a set of and where each forms a regular connected component. The consistency loss measures the amount of non-robust set of , which contains points whose predictions by is inconsistent in a small neighborhood. Our main theorems (Theorem 2.1 and 2.2) state that, starting from a teacher with information on the source data, consistency regularization (regularizing on unlabeled data) can result in the propogation of label information, thereby obtaining a good classifier on the target domain, which may also improve upon the accuracy of the teacher on the source domain.

To address this question, we develop a general framework of domain adaptation where we have a supervision signal on the source domain (through a teacher classifier which has non-trivial performance on the source domain but is allowed to be entirely wrong on the target domain (See Assumption 1(a) and Figure 1)) and unlabeled data on both source and target domains. The key of the analysis is to show that the supervision signal can be propagated to the unlabeled data. To do so, we partition data from both domains into some subpopulations and leverage a simple but realistic expansion assumption (Definition 2.1) proposed in Wei et al. (2021) on the subpopulations. We then prove that by minimizing a consistency regularization term (Miyato et al., 2018; Shu et al., 2018; Xie et al., 2020) on unlabeled data from both domains plus a 0-1 consistency loss with the supervision signal (i.e., the teacher classifier) on the source domain, the supervision signal from the subpopulations of the source domain can not only be propagated to the subpopulations of target domain but also refine the prediction on the source domain. In Theorem 2.1 and 2.2

, we give bounds on the test performance on the target domain. Using off-the-shelf generalization bounds, we also obtain end-to-end finite-sample guarantees for neural networks in Section 

2.3.

In Section 3, we extend our theoretical framework to a more general setting with source-to-target transfer based on an additional unlabeled dataset. As long as the subpopulation components of the unlabeled dataset satisfy the expansion property and cover both the source and target subpopulation components, then one can provably propagate label information from source to target through the unlabeled data distribution (Theorem 3.1 and 3.2

). As corollaries, we immediately obtain learning guarantees for both semi-supervised learning and unsupervised domain adaptation. The results can also be applied to various settings like domain generalization etc., see Figure 

2.

We implement the popular consistency-based semi-supervised learning algorithm FixMatch 

(Sohn et al., 2020) on the subpopulation shift task from BREEDS (Santurkar et al., 2021), and compare it with popular distributional matching methods  (Ganin et al., 2016; Zhang et al., 2019). Results show that the consistency-based method outperforms distributional matching methods by over , partially verifying our theory on the subpopulation shift problem. We also show that combining distributional matching methods and consistency-based algorithm can improve the performance upon distributional matching methods on classic unsupervised domain adaptation datasets such as Office-31 (Saenko et al., 2010) and Office-Home (Venkateswara et al., 2017).

In summary, our contributions are: 1) We introduce a theoretical framework of learning under subpopulation shift through label propagation; 2) We provide accuracy guarantees on the target domain for a consistency-based algorithm using a fine-grained analysis under the expansion assumption (Wei et al., 2021); 3) We provide a generalized label propagation framework that easily includes several settings, e.g., semi-supervised learning, domain generalization, etc.

1.1 Related work

We review some more literature on domain adaptation, its variants, and consistency regularization, followed by discussions on the distinction of our contributions compared to Wei et al. (2021).

For the less challenging setting of covariate shift where the source domain covers the target domain’s support, prior work regarding importance weighting focuses on estimations of the density ratio 

(Lin et al., 2002; Zadrozny, 2004) through kernel mean matching (Huang et al., 2006; Gretton et al., 2007; Zhang et al., 2013; Shimodaira, 2000), and some standard divergence minimization paradigms (Sugiyama et al., 2008, 2012; Uehara et al., 2016; Menon and Ong, 2016; Kanamori et al., 2011). For out-of-support domain adaptation, recent work investigate approaches to match the source and target distribution in representation space (Glorot et al., 2011; Ajakan et al., 2014; Long et al., 2015; Ganin et al., 2016). Practical methods involve designing domain adversarial objectives (Tzeng et al., 2017; Long et al., 2017a; Hong et al., 2018; He and Zhang, 2019; Xie et al., 2019; Zhu et al., 2019) or different types of discrepancy minimization (Long et al., 2015; Lee et al., 2019; Roy et al., 2019; Chen et al., 2020a). Another line of work explore self-training or gradual domain adaptation (Gopalan et al., 2011; Gong et al., 2012; Glorot et al., 2011; Kumar et al., 2020). For instance, Chen et al. (2020c) demonstrates that self-training tends to learn robust features in some specific probabilistic setting.

Variants of domain adaptation have been extensively studied. For instance, weakly-supervised domain adaptation considers the case where the labels in the source domain can be noisy (Shu et al., 2019; Liu et al., 2019); multi-source domain adaptations adapts from multiple source domains (Xu et al., 2018; Zhao et al., 2018); domain generalization also allows access to multiple training environments, but seeks out-of-distribution generalization without prior knowledge on the target domain (Ghifary et al., 2015; Li et al., 2018; Arjovsky et al., 2019).

The idea of consistency regularization has been used in many settings. Miyato et al. (2018); Qiao et al. (2018); Xie et al. (2020) enforce consistency with respect to adversarial examples or data augmentations for semi-supervised learning. Shu et al. (2019) combines domain adversarial training with consistency regularization for unsupervised domain adaptation. Recent work on self-supervised learning also leverages the consistency between two aggressive data augmentations to learn meaningful features (Chen et al., 2020b; Grill et al., 2020; Caron et al., 2020).

Most closely related to our work is Wei et al. (2021)

, which introduces a simple but realistic “expansion” assumption to analyze label propagation, which states that a low-probability subset of the data must expand to a neighborhood with larger probability relative to the subset. Under this assumption, the authors show learning guarantees for unsupervised learning and semi-supervised learning.

The focus of Wei et al. (2021) is not on domain adaptation, though the theorems directly apply. This leads to several drawbacks that we now discuss. Notably, in the analysis of Wei et al. (2021) for unsupervised domain adaptation, the population test risk is bounded using the population risk of a pseudo-labeler on the target domain. The pseudo-labeler is obtained via training with labeled data on the source domain. For domain adaptation, we do not expect such a pseudo-labeler to be directly informative when applied to the target domain, especially when the distribution shift is severe. In contrast, our theorem does not rely on a good pseudo-labeler on the target domain. Instead, we prove that with only supervision on the source domain, the population risk on the target domain can converge to zero as the value of the consistency regularizer of the ground truth classifier decreases (Theorem 2.12.2). In addition, Wei et al. (2021) assumes that the probability mass of each class

together satisfies the expansion assumption. However, each class may consist of several disjoint subpopulations. For instance, the dog class may have different breeds as its subpopulations. This setting differs from the concrete example of the Gaussian mixture model shown in 

Wei et al. (2021)

where the data of each class concentrate following a Gaussian distribution. In this paper, we instead take a more realistic usage of the expansion assumption by assuming expansion property on the

subpopulations of each class (Assumption 1). Behind this relaxation is a fine-grained analysis of the probability mass’s expansion property, which may be of independent interest.

2 Label Propogation in Domain Adaptation

In this section, we consider label propagation for unsupervised domain adaptation. We assume the distributions’ structure can be characterized by a specific subpopulation shift with the expansion property. In Section 2.1, we introduce the setting, including the algorithm and assumptions. In Section 2.2, we present the main theorem on bounding the target error. In Section 2.3, we provide an end-to-end guarantee on the generalization error of adapting a deep neural network to the target distribution with finite data. In Section 2.4 we provide a proof sketch for the theorems.

2.1 Setting

We consider a multi-class classification problem . Let and be the source and target distribution on respectively, and we wish to find a classifier that performs well on . Suppose we have a teacher classifier on . The teacher can be obtained by training on the labeled data on (standard unsupervised domain adaptation), or by training on a small subset of labeled data on , or by direct transferring from some other trained classifier, etc. In all, the teacher classifier represents all label information we know (and is allowed to have errors). Our goal is to transfer the information in onto using only unlabeled data.

Our setting for subpopulation shift is formulated in the following assumption.

Assumption 1.

Assume the source and target distributions have the following structure: , , where for . We assume the ground truth class for is consistent (constant), which is denoted as . We abuse the notation to let , also denote the conditional distribution (probability measure) of on the set respectively. In addition, we make the following canonical assumptions:

  1. The teacher classifier on is informative of the ground truth class by a margin , that is,

  2. On each component, the ratio of the population under domain shift is upper-bounded by a constant , i.e.

Following Wei et al. (2021), we make use of a consistency regularization method, i.e. we expect the predictions to be stable under a suitable set of input transformations . The regularizer of on the mixed probability measure is defined as

and a low regularizer value implies the labels are with high probability constant within . Prior work on using consistency regularization for unlabeled self-training includes Miyato et al. (2018) where can be understood as a distance-based neighborhood set and Adel et al. (2017); Xie et al. (2020) where can be understood as the set of data augmentations. In general, takes the form 111In this paper, consistency regularization, expansion property, and label propagation can also be understood as happening in a representation space, as long as for some feature map . for a small number , some distance function , and a class of data augmentation functions .

The set is used in the following expansion property. First we define the neighborhood function as

and the neighborhood of a set as

The expansion property on the mixed distribution is defined as follows:

Definition 2.1 (Expansion (Wei et al., 2021)).

  1. (Multiplicative Expansion) We say satisfies -multiplicative expansion for some constant , , if for any and any subset with , we have .

  2. (Constant Expansion) We say satisfies -constant expansion for some constant , if for any set with and , we have .

The expansion property implicitly states that and are close to each other and regularly shaped. Through the regularizer the label can “propagate” from to 222Note that our model for subpopulation shift allows any fine-grained form (), which makes the expansion property more realistic. In image classification, one can take for example as “Poodles eating dog food” v.s. as “Labradors eating meat” (they’re all under the dog class), which is a rather typical form of shift in a real dataset. The representations of such subpopulations can turn out quite close after certain data augmentation and perturbations as in .. One can keep in mind the specific example of Figure 1 where and forms a single connected component.

Finally, let be a function class of the learning model. We consider the realizable case when the ground truth function . We assume that the consistency error of the ground truth function is small, and use a constant to represent an upper bound: . We find the classifier with the following algorithm:

(1)

where is the 0-1 loss on the source domain which encourages to be aligned with on the source domain.

Our main theorem will be formulated using -multiplicative expansion or -constant expansion333Wei et al. (2021) contains several examples and illustrations of the expansion property, e.g., the Gaussian mixture example satisfies multiplicative expansion. The radius in is much smaller than the norm of a typical example, so our model, which requires a separation of between components to make small, is much weaker than a typical notion of “clustering”..

2.2 Main Theorem

With the above preparations, we are ready to establish bounds on the target error .

Theorem 2.1 (Bound on Target Error with Multiplicative Expansion).

Suppose Assumption 1 holds and satisfies -multiplicative expansion. Then the classifier obtained by (1) satisfies

Theorem 2.2 (Bound on Target Error with Constant Expansion).

Suppose Assumption 1 holds and satisfies -constant expansion. Then the classifier obtained by (1) satisfies

We make the following remarks on the main results, and also highlight the differences from directly applying Wei et al. (2021) to domain adaptation.

Remark 1.

The theorems state that as long as the ground truth consistency error (equivalently, ) is small enough, the classifier can achieve near-zero error. This result does not rely on the teacher being close to zero error, as long as the teacher has a positive margin . As a result, the classifier can improve upon (including on , as the proof of the theorems can show), in a way that the error of converge to zero as , regardless of the error of . This improvement is due the algorithmic change in Equation (1) which strongly enforces label propagation. Under multiplicative expansion, Wei et al. (2021) attain a bound of the form , which explicitly depends on the accuracy of the teacher on the target domain. The improvement is due to that we strongly enforce consistency rather than balancing consistency with teacher classifier fit.

Remark 2.

We do not impose any lower bound on the measure of the components , which is much more general and realistic. From the proofs, one may see that we allow some components to be entirely mislabeled, but in the end, the total measure of such components will be bounded. Directly applying Wei et al. (2021) would require a stringent lower bound on the measure of each .

Remark 3.

We only require expansion with respect to the individual components , instead of the entire class (Wei et al., 2021), which is a weaker requirement.

The proofs are essentially because the expansion property turns local consistency into a form of global consistency. The proof sketch is in Section 2.4, and the full proof is in Appendix A.

2.3 Finite Sample Guarantee for Deep Neural Networks

In this section, we leverage existing generalization bounds to prove an end-to-end guarantee on training a deep neural network with finite samples on and . The results indicate that if the ground-truth class is realizable by a neural network by a large robust margin, then the total error can be small.

For simplicity let there be i.i.d. data each from and (a total of data), and the empirical distribution is denoted and . In order to upper-bound the loss and , we apply a notion of all-layer margin (Wei and Ma, 2019) 444Though other notions of margin can also work, this helps us to leverage the results from Wei et al. (2021)., which measures the stability of the neural net to simultaneous perturbations to each hidden layer. We first cite the useful results from Wei et al. (2021). Suppose where is the neural network with weight matrices , 555Similarly, and is the ground truth network and its induced classifier. and is the maximum width of any layer. Let denote the all-layer margin at input for label . 666For now, we only use if , so that we can upper bound with for any . One can refer the datailed definition to Appendix B or in Wei and Ma (2019). We also define the robust margin . We state the following results.

Proposition 2.1 (Theorem C.3 from Wei et al. (2021)).

For any , with probability ,

where hides poly-logarithmic factors in and .

Proposition 2.2 (Theorem 3.7 from Wei et al. (2021)).

For any , With probability ,

To ensure generalization we replace the loss functions with the margin loss in the algorithm and solve

(2)

where . Based on these preparations, we are ready to state the final bound.

Theorem 2.3.

Suppose Assumption 1 holds, and is returned by (2). With probability , we have:

(a) Under -multiplicative expansion on we have

(b) Under -constant expansion on we have

where

Remark 4.

Note that the first term in is small if is small, and as , the bounds can be close to and can be close to , which gives us the bounds in Section 2.2.

Similar to the argument in Wei et al. (2021)

, it is worth noting that our required sample complexity does not depend exponentially on the dimension. This is in stark contrast to classic non-parametric methods for unknown “clusters” of samples, where the sample complexity suffers the curse of dimensionality of the input space.

The proof of Theorem 2.3 is in Appendix B.

2.4 Proof Sketch for Theorem 2.1 and 2.2

To prove the theorems, we first introduce some concepts and notations.

A point is called robust w.r.t. and if for any in , . Denote

which is called the robust set of . Let

for , , and they form a partition of the set . Denote

which is the majority class label of in the robust set on . We also call

and the minority robust set of . In addition, let

and be the minority set of , which is superset to the minority robust set.

The expansion property can be used to control the total population of the minority set.

Lemma 2.1 (Upper Bound of Minority Set).

For the classifier obtained by (1), can be bounded as follows:

(a) Under -multiplicative expansion, we have .

(b) Under -constant expansion, we have .

Based on the bound on the minority set, our next lemma says that on most subpopulation components, the inconsistency between and is no greater than the error of plus a margin . Specifically, define

and we have the following result

Lemma 2.2 (Upper Bound on the Inconsistent Components ).

Suppose , then

Based on the above results, we are ready to bound the target error .

Lemma 2.3 (Bounding the Target Error).

Suppose . Let

for in , so that . Then we can separately bound

(a)

(b)

so that the combination gives

Specically, Lemma 2.3(a) is obtained by directly using Lemma 2.2, and Lemma 2.3(b) is proved by a fine-grained analysis on the minority set.

Finally, we can plug in from Lemma 2.1 and the desired main results are obtained.

3 Label Propogation in Generalized Subpopulation Shift

In this section, we show that the previous label propagation algorithm can be applied to a much more general setting than standard unsupervised domain adaptation. In a word, as long as we perform consistency regularization on an unlabeled dataset that covers both the teacher classifier’s domain and the target domain, we can perform label propagation through the subpopulation of the unlabeled data.

Specifically, we still let be the source distribution where we have a teacher on, and is the target distribution. The difference is that we have a “covering” distribution (Assumption 2(c)) where we only make use of unlabeled data, and the expansion property is assumed to hold on .

Assumption 2.

Assume the distributions are of the following structure: , , , where for , and . Again, assume the ground truth class for is consistent (constant), denoted . We abuse the notation to let , , also denote the conditional distribution of on the set respectively. We also make the following assumptions, with an additional (c) that says “covers” .

(a)(b): Same as Assumption 1(a)(b).

(c) There exists a constant such that the measure , are bounded by . That is, for any ,

The regularizer now becomes

On can see that the main difference is that we replaced from the previous domain adaptation with a general distribution . Indeed, we assume expansion on and can establish bounds on .

Definition 3.1 (Expansion on ).

(1) We say satisfies -multiplicative expansion for some constant , , if for any and any subset with , we have .

(2) We say satisfies -constant expansion for some constant , if for any set with and , we have .

Theorem 3.1 (Bound on Target Error with Multiplicative Expansion, Generalized).

Suppose Assumption 2 holds and satisfies -multiplicative expansion. Then the classifier obtained by (1) satisfies

Theorem 3.2 (Bound on Target Error with Constant Expansion, Generalized).

Suppose Assumption 1 holds and satisfies -constant expansion. Then the classifier obtained by (1) satisfies

Choosing special cases of the structure , we can naturally obtain the following special cases that correspond to the models shown in Figure 2.

Figure 2: Settings of generalized subpopulation shift in Section 3. The figures only draw one subpopulation for each model.
  1. Unsupervised domain adaptation (Figure 2(a)). When , we immediately obtain the results in Section 2.2 by plugging in . Therefore, Theorem 2.1 and  2.2 is just a special case of Theorem 3.1 and 3.2.

  2. Semi-supervised learning or self-supervised denoising (Figure 2(b)). When , the framework becomes the degenerate version of learning a from a in a single domain. can be a pseudo-labeler in the semi-supervised learning or some other pre-trained classifier self-supervised denoising. Our results improve upon Wei et al. (2021) under this case as discussed in Remark 1, 2.

  3. Domain expansion (Figure 2(c)). When , this becomes a problem between semi-supervised learning and domain adaptation, and we call it domain expansion. That is, the source is a sub-distribution of where we need to perform well. Frequently, we have a big unlabeled dataset and the labeled data is only a specifc part.

  4. Domain extrapolation (Figure 2(d)). When does not satisfy expansion by itself, e.g. they are not connected by , but they are connected through , we can still obtain small error on . We term this kind of task domain extrapolation, where we have a small source and small target distribution that is not easy to directly correlate, but is possible through a third and bigger unlabeled dataset where label information can propagate.

  5. Multi-Source domain adaptation or domain generalization (Figure 2(e)). We have multiple source domains and take as the union (average measure) of all source domains. Learning is guaranteed if in the input space or some representation space, can successfully “cover” , the target distribution in multi-source domain adaptation or the test distribution in domain generalization. Also, as the framework suggests, we do not require all the source domains to be labeled, depending on the specific structure.

The general label propogation framework proposed in this section is widely applicable in many practical scenarios, and would also be an interesting future work. The full proof of the theorems in this section is in Appendix A.

4 Experiments

In this section, we first conduct experiments on a dataset that is constructed to simulate natural subpopulation shift. Then we generalize the aspects of subpopulation shift to classic unsupervised domain adaptation datasets by combining distributional matching methods and consistency-based label propagation method.

4.1 Subpopulation Shift Dataset

We empirically verify that label propagation via consistency regularization works well for subpopulation shift tasks. Towards this goal, we constructed an Unsupervised Domain Adaptation (UDA) task using the challenging ENTITY-30 task from BREEDS tasks (Santurkar et al., 2021), and directly adapt FixMatch (Sohn et al., 2020), an existing consistency regularization method for semi-supervised learning to the subpopulation shift tasks. The main idea of FixMatch is to optimize the supervised loss on weak augmentations of source samples, plus consistency regularization, which encourages the prediction of the classifier on strong augmentations of a sample to be the same to the prediction on weak augmentations of the sample777Empirically, FixMatch also combines self-training techniques that take the hard label of the prediction on weak augmentations. We also use Distribution Alignment (Berthelot et al., 2019) mentioned in Section 2.5 of the FixMatch paper.. In contrast to semi-supervised learning where the supports of unlabeled data and labeled data are inherently the same, in subpopulation shift problems, the support sets of different domains are disjoint. To enable label propagation, we need a good feature map to enable label propagation on the feature space. We thus make use of the feature map learned by a self-supervised learning algorithm SwAV (Caron et al., 2020), which simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations of the same image. This representation has two merits; first, it encourages subpopulations with similar representations to cluster in the feature space. Second, it enforces the augmented samples to be close in the feature space. We expect that subclasses from the same superclass will be assigned to the same cluster and thus enjoy the expansion property to a certain extent in the feature space. We defer the detailed experimental settings to Appendix C and report the results here.

Method Source Acc Target Acc
Train on Source 91.910.23 56.730.32
DANN (Ganin et al., 2016) 92.810.50 61.034.63
MDD (Zhang et al., 2019) 92.670.54 63.950.28
FixMatch (Sohn et al., 2020) 90.870.15 72.600.51
Table 1: Comparison of performance on ENTITY-30 (Acc refers to accuracy which is measured by percentage).

We compare the performance of the adaptation of FixMatch with popular distributional matching methods, i.e., DANN (Ganin et al., 2016) and MDD (Zhang et al., 2019)888We use the implementation from Junguang Jiang (2020), which shows that MDD has the best performance among the evaluated methods.. For a fair comparison, all models are finetuned from SwAV representation. As shown in Table 1, the adaptation with FixMatch obtains significant improvement upon the baseline method that only trains on the source domain by more than points on the target domain. FixMatch also outperforms distributional matching methods by more than . The results suggest that unlike previous distributional matching-based methods, consistency regularization-based methods are preferable on domain adaptation tasks when encountering subpopulation shift. This is also aligned with our theoretical findings.

4.2 Classic Unsupervised Domain Adaptation Datasets

In this section we conduct experiments on classic unsupervised domain adaptation datasets, i.e., Office-31 (Saenko et al., 2010), Office-Home (Venkateswara et al., 2017), where source and target domains mainly differ in style, e.g., artistic images to real-world images. Distributional matching methods seek to learn an invariant representation which removes confounding information such as the style. Since the feature distributions of different domains are encouraged to be matched, the supports of different domains in the feature space are overlapped which enables label propagation. In addition, subpopulation shift from source to target domain may remain even if the styles are unified in the feature space. This inspires us to combine distributional matching methods and label propagation.

As a preliminary attempt, we directly combine MDD (Zhang et al., 2019) and FixMatch (Sohn et al., 2020) to see if there is gain upon MDD. Specifically, we first learn models using MDD on two classic unsupervised domain adaptation datasets, Office-31 and Office-Home. Then we finetune the learned model using FixMatch (with Distribution Alignment extension as described in previous subsection). The results in Table 3, 3 confirm that finetuning with FixMatch can improve the performance of MDD models. The detailed experimental settings can be found in Appendix C.


Method
A W D W W D A D D A W A Average
MDD 94.970.70 98.780.07 1000 92.770.72 75.641.53 72.820.52 89.16
MDD+FixMatch 95.470.95 98.320.19 1000 93.710.23 76.641.91 74.931.15 89.84
Table 2: Performance of MDD and MDD+FixMatch on Office-31 dataset.
Method Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr Average
MDD 54.90.7 74.00.3 77.70.3 60.60.4 70.90.7 72.10.6 60.70.8 53.01.0 78.00.2 71.80.4 59.60.4 82.90.3 68.0
MDD+FixMatch 55.10.9 74.70.8 78.70.5 63.21.3 74.11.8 75.30.1 63.00.6 53.00.6 80.80.4 73.40.1 59.40.7 84.00.5 69.6
Table 3: Performance of MDD and MDD+FixMatch on Office-Home dataset.

5 Conclusion

In this work, we introduced a new theoretical framework of learning under subpopulation shift through label propagation, providing new insights on solving domain adaptation tasks. We provided accuracy guarantees on the target domain for a consistency regularization-based algorithm using a fine-grained analysis under the expansion assumption. Our generalized label propagation framework in Section 3 subsumes the previous domain adaptation setting and also provides an interesting direction for future work.

References

Appendix A Proof of Theorem 2.1, 2.2, 3.1, and 3.2

Note that in Section 3, by taking , in Assumption 2(c) we have . By plugging in , Theorem 2.1 and 2.2 immediately becomes the corollary of Theorem 3.1 and 3.2. Therefore, we only provide a full proof for Theorem 3.1 and 3.2 here.

First, similar to Section 2.4, we give a proof sketch for Theorem 3.1 and 3.2, which includes the corresponding definitions and lemmas for this generalized setting.

a.1 Proof Sketch for Theorem 3.1 and 3.2

To prove the theorems, we first introduce some concepts and notations.

A point is called robust w.r.t. and if for any in , . Denote

which is called the robust set of . Let

for , , and they form a partition of the set . Denote

which is the majority class label of in the robust set on . We also call

and the minority robust set of . In addition, let

and be the minority set of , which is superset to the minority robust set.

The expansion property can be used to control the total population of the minority set.

Lemma A.1 (Upper Bound of Minority Set).

For the classifier obtained by (1), can be bounded as follows:

(a) Under -multiplicative expansion, we have .

(b) Under -constant expansion, we have .

Based on the bound on the minority set, our next lemma says that on most subpopulation components, the inconsistency between and is no greater than the error of plus a margin . Specifically, define

and we have the following result

Lemma A.2 (Upper Bound on the Inconsistent Components ).

Suppose , then

Based on the above results, we are ready to bound the target error .

Lemma A.3 (Bounding the Target Error).

Suppose . Let

for in , so that . Then we can separately bound

(a)

(b)

so that the combination gives

Specically, Lemma A.3(a) is obtained by directly using Lemma A.2, and Lemma A.3(b) is proved by a fine-grained analysis on the minority set.

Finally, we can plug in from Lemma A.1 and the desired results in Theorem 3.1 and 3.2 are obtained.

To make the proof complete, we provide a detailed proof of Lemma A.1, A.2, A.3 in the following subsections.

a.2 Proof of Lemma a.1.

Proof.

We first prove the -expansion case (a). The probability function are all w.r.t. the distribution in this lemma, so we omit this subscript. We also use for in this lemma.

The robust minority set is . In order to do expansion, we partition into two halves:

Lemma A.4 (Partition of ).

There exists a partition of the set into and such that the corresponding partition (, ) satisfies and .

Proof.

Starting from , and each time we add an element into or while keeping the properties and hold. We prove that for any , either or holds, so we can repeat the process until and is an partition of . In fact, since

we know that either or is no more than , and Lemma A.4 is proved.

Let , . Based on Lemma A.4, we know that either , or satisfies the requirement for -constant expansion. Hence,

(3)

On the other hand, by definition of the robust set, we know that for , and the points in and are all non-robust. Since the total measure of non-robust points is by definition, we know that

(4)

Combining (3) and (4), we know that under , it must hold that , or else (3) and (4) would be a contradiction. In all, this means that in any case.

Similarly, we know also hold. Therefore, .

Since only consists of non-robust points, we know that

which is the desired result (a).

For the -multiplicative expansion case (b), it is easy to verify that