Understanding Self-Training for Gradual Domain Adaptation

by   Ananya Kumar, et al.
Stanford University

Machine learning systems must adapt to data distributions that evolve over time, in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces. We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. We prove the first non-vacuous upper bound on the error of self-training with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error. The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data, and suggesting that self-training works particularly well for shifts with small Wasserstein-infinity distance. Leveraging the gradual shift structure leads to higher accuracies on a rotating MNIST dataset and a realistic Portraits dataset.



There are no comments yet.


page 2


Regularized Learning for Domain Adaptation under Label Shifts

We propose Regularized Learning under Label shifts (RLLS), a principled ...

Understanding Gradual Domain Adaptation: Improved Analysis, Optimal Path and Beyond

The vast majority of existing algorithms for unsupervised domain adaptat...

Combating Domain Shift with Self-Taught Labeling

We present a novel method to combat domain shift when adapting classific...

A Theory of Label Propagation for Subpopulation Shift

One of the central problems in machine learning is domain adaptation. Un...

Gradual Domain Adaptation in the Wild:When Intermediate Distributions are Absent

We focus on the problem of domain adaptation when the goal is shifting t...

Algorithms and Theory for Supervised Gradual Domain Adaptation

The phenomenon of data distribution evolving over time has been observed...

Adapting ImageNet-scale models to complex distribution shifts with self-learning

While self-learning methods are an important component in many recent do...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models are typically trained and tested on the same data distribution. However, when a model is deployed in the real world, the data distribution typically evolves over time, leading to a drop in performance. This problem is widespread: sensor measurements drift over time due to sensor aging [1], self-driving car vision modules have to deal with evolving road conditions [2], and neural signals received by brain-machine interfaces change within the span of a day [3]. Repeatedly gathering large sets of labeled examples to retrain the model can be impractical, so we would like to leverage unlabeled examples to adapt the model to maintain high accuracy [3, 4].

In these examples the domain shift doesn’t happen at one time, but happens gradually, although this gradual structure is ignored by most domain adaptation methods. Intuitively, it is easier to handle smaller shifts, but for each shift we can incur some error so the more steps, the more degradation—making it unclear whether leveraging the gradual shift structure is better than directly adapting to the target.

In this paper, we provide the first theoretical analysis showing that gradual domain adaptation provides improvements over the traditional approach of direct domain adaptation

. We analyze self-training (also known as pseudolabeling), a method in the semi-supervised learning literature 


that has led to state-of-the-art results on ImageNet 

[6] and adversarial robustness on CIFAR-10 [7, 8, 9].

Figure 1: In gradual domain adaptation we are given labeled data from a source domain, and unlabeled data from intermediate domains that shift gradually in distribution towards a target domain. Here, blue = female, red = male, and gray = unlabeled data.

As a concrete example of our setting, the Portraits dataset [10] contains photos of high school seniors taken across many years, labeled by gender (Figure 1). We use the first 2000 images (1905 - 1935) as the source, next 14000 (1935 - 1969) as intermediate domains, and next 2000 images as the target (1969 - 1973). A model trained on labeled examples from the source gets 98% accuracy on held out examples in the same years, but only 75% accuracy on the target domain. Assuming access to unlabeled images from intermediate domains, our goal is to adapt the model to do well on the target domain. Direct adaptation to the target with self-training only improves the accuracy a little, from 75% to 77%.

The gradual self-training algorithm begins with a classifier trained on labeled examples from the source domain (Figure 1(a)). For each successive domain , the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in Figure 2, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct. We find that gradual self-training on the Portraits dataset improves upon direct target adaptation (77% to 84% accuracy).

(a) t = 0
(b) t = 1
(c) t = 2
(d) t = 3
Figure 2: The source classifier gets 100% accuracy on the source domain (Figure 1(a)), where we have labeled data. But after 3 time steps (Figure 1(d)) the source classifier is stale, classifying most examples incorrectly. Now, we cannot correct the classifier using unlabeled data from the target domain, which corresponds to traditional domain adaptation directly to the target. Given unlabeled data in an intermediate domain (Figure 1(b)) where the shift is gradual, the source classifier pseudolabels most points correctly, and self-training learns an accurate classifier (show in green) that separates the classes. Successively applying self-training learns a good classifier on the target domain (green classifier in Figure 1(d)).

Our results: We analyze gradual domain adaptation in two settings. The key challenge for domain adaptation theory is dealing with source and target domains whose support do not overlap [11, 12], which are typical in the modern high-dimensional regime. The gradual shift structure inherent in many applications provides us with leverage to handle adapting to target distributions with non-overlapping support.

Our first setting, the margin setting, is distribution-free—we only assume that at every point in time there exists some linear classifier that can classify most of the data correctly with a margin, where the linear classifier may be different at each time step (so this is more general than covariate shift), and that the shifts are small in Wasserstein-infinity distance. A simple example (as in Figure 2) shows that a classifier that gets 100% accuracy can get 0% accuracy after a constant number of time steps. Directly adapting to the final target domain also gets 0% accuracy. Gradual self-training does better, letting us bound the error after steps: , where is the error of the classifier on the source domain, and is the number of unlabeled examples in each intermediate domain. While this bound is exponential in , this bound is non-vacuous for small , and we show that this bound is tight for gradual self-training.

In the second setting, stronger distributional assumptions allow us to do better—we assume that is a -dimensional isotropic Gaussian for each . Here, we show that if we begin with a classifier that is nearly Bayes optimal for the initial distribution, we can recover a classifier that is Bayes optimal for the target distribution with infinite unlabeled data. This is an idealized setting to understand what properties of the data might allow self-training to do better than the exponential bound.

Our theory leads to practical insights, showing that regularization—even in the context of infinite data—and label sharpening are essential for gradual self-training. Without regularization, the accuracy of gradual self-training drops from 84% to 77% on Portraits and 88% to 46% on rotating MNIST. Even when we self-train with more examples, the performance gap between regularized and unregularized models stays the same—unlike in supervised learning where the benefit of regularization diminishes as we get more examples.

Finally, our theory suggests that the gradual shift structure helps when the shift is small in Wasserstein-infinity distance as opposed to other distance metrics like the KL-divergence. For example, one way to interpolate between the source and target domains is to gradually introduce more images from the target, but this shift is large in Wasserstein-infinity distance—we see experimentally that gradual self-training does not help in this setting. We hope this gives practitioners some insight into when gradual self-training can work.

2 Setup

Gradually shifting distributions: Consider a binary classification task of predicting labels from input features

. We have joint distributions over the inputs and labels,

: , where is the source domain, is the target domain, and are intermediate domains. We assume the shift is gradual: for some , for all , where is some distance function between distributions and . We have labeled examples sampled independently from the source and unlabeled examples sampled independently from for each .

Models and objectives: We have a model family , where a model outputs a score representing its confidence that the label is 1 for the given example. The model’s prediction for an input is , where if and if . We evaluate models on the fraction of times they make a wrong prediction, also known as the - loss:


The goal is to find a classifier that gets high accuracy on the target domain —that is, low . In an online setting we may care about the accuracy at the current for every time , and our analysis works in this setting as well.

Baseline methods:

We select a loss function

which takes a prediction and label, and outputs a non-negative loss value, and we begin by training a source model that minimizes the loss on labeled data in the source domain:


The non-adaptive baseline is to use on the target domain, which incurs error . Self-training uses unlabeled data to adapt a model. Given a model and unlabeled data , denotes the output of self-training. Self-training pseudolabels each example in using , and then selects a new model that minimizes the loss on this pseudolabeled dataset. Formally,


Here, self-training uses “hard” labels: we pseudolabel examples as either or , based on the output of the classifier, instead of a probabilistic label based on the model’s confidence—we refer to this as label sharpening

. In our theoretical analysis, we sometimes want to describe the behavior of self-training when run on infinite unlabeled data from a probability distribution



The direct adaptation to target baseline takes the source model and self-trains on the target data , and is denoted by . Prior work often chooses to repeat this process of self-training on the target times, which we denote by .

Gradual self-training: In gradual self-training, we self-train on the finite unlabeled examples from each domain successively. That is, for , we set:


is the output of gradual self-training, which we evaluate on the target distribution .

3 Theory for the margin setting

We show that gradual self-training does better than directly adapting to the target, where we assume that at each time step there exists some linear classifier—which can be different at each step—that can classify most of the data correctly with a margin (a standard assumption in learning theory), and that the shifts are small. Our main result (Theorem 3.2) bounds the error of gradual self-training. We show that our analysis is tight for gradual self-training (Example 3.4), and explain why regularization, label sharpening, and the ramp loss, are key to our bounds. Proofs are in Appendix A.

3.1 Assumptions

Models and losses: We consider regularized linear models that have weights with bounded norm: for some fixed . Given , the model’s output is .

We consider margin loss functions such as the hinge and ramp losses. Intuitively, a margin loss encourages a model to classify points correctly and confidently—by keeping correctly classified points far from the decision boundary. We consider the hinge function and ramp function :


The ramp loss is , where is a model’s prediction, and

is the true label. The hinge loss is the standard way to enforce margin, but the ramp loss is more robust towards outliers because it is bounded above—no single point contributes too much to the loss. We will see that the ramp loss is key to the theoretical guarantees for gradual self-training because of its robustness. We denote the population ramp loss as:


Given a finite sample , the empirical loss is:


Distributional distance: Our notion of distance is , the Wasserstein-infinity distance. Intuitively, moves points from distribution to by distance at most to match the distributions. For ease of exposition we consider the Monge form of , although the results can be extended to the Kantarovich formulation as well. Formally, given probability measures on :


As usual, denotes the push-forward of a measure, that is, for every set , .

In our case, we require that the conditional distributions do not shift too much. Given joint probability measures on the inputs and labels , the distance is:


-separation assumption: Assume every domain admits a classifier with low loss , that is there exists and for every domain , there exists some with .

Gradual shift assumption: For some , assume for every consecutive domain, where is the regularization strength of the model class . can be interpreted as the geometric margin (distance from decision boundary to data) the model is trying to enforce.

Bounded data assumption: When dealing with finite samples we need a standard regularity condition: we say that satisfies the bounded data assumption if the data is not too large on average: where .

No label shift assumption: Assume that the fraction of labels does not change: is the same for all .

3.2 Domain shift: baselines fail

While the distribution shift from to is small, the distribution shift from the source to the target can be large, as visualized in Figure 2. A classifier that gets 100% accuracy on , might classify every example wrong on , even if . In this case, directly adaptating to would not help. The following example formalizes this:

Example 3.1.

Even under the -separation, no label shift, gradual shift, and bounded data assumptions, there exists distributions and a source model that gets loss on the source (), but high loss on the target: . Self-training directly on the target does not help: . This holds true even if every domain is separable, so .

Other methods: Our analysis focuses on self-training, but other bounds do not apply in this setting because they either assume that the density ratio between the target and source exists and is not too small [13], or that the source and target are similar enough that we cannot discriminate between them [14].

3.3 Gradual self-training improves error

We show that gradual self-training helps over direct adaptation. For intuition, consider a simple example where and classifies every example in correctly with geometric margin . If each point shifts by distance , gets every example in the new domain correct. If we had infinite unlabeled data from , we can learn a model that classifies every example in the new domain correctly with margin since . Repeating the process for , we get every example in correct.

But what happens when we start with a model that has some error, for example because the data cannot be perfectly separated, and have only finite unlabeled samples? We show that self-training still does better than adapting to the target domain directly, or using the non-adaptive source classifier.

The first main result of the paper says that if we have a model that gets low loss and the distribution shifts slightly, self-training gives us a model that does not do too badly on the new distribution.

Theorem 3.2.

Given with and marginals on are the same so . Suppose satisfy the bounded data assumption, and we have initial model , and unlabeled samples from , and we set . Then with probability at least over the sampling of , letting :


The proof of this result is in Appendix A, but we give a high level sketch here. There exists some classifier that gets accuracy on , so if we had access to labeled examples from then empirical risk minimization gives us a classifier that is accurate on the population—from a Rademacher complexity argument we get a classifier with loss at most , the second and third term in the RHS of the bound.

Since we only have unlabeled examples from , self-training uses to pseudolabel these examples and then trains on this generated dataset. Now, if the distribution shift is small relative to the geometric margin , then we can show that the original model labels most examples in the new distribution correctly—that is, is small if is small. Finally, if most examples are labeled correctly we show that because there exists some classifier with low margin loss, self-training will also learn a classifier with low margin loss , which completes the proof.

We apply this argument inductively to show that after time steps, the error of gradual self-training is for some constant , if the original error is .

Corollary 3.3.

Under the -separation, no label shift, gradual shift, and bounded data assumptions, if the source model has low loss on (i.e. ) and is the result of gradual self-training: , letting :


Corrollary 3.3 says that the gradual structure allows some control of the error unlike direct adaptation where the accuracy on the target domain can be 0% if . Note that if the classes are separable and we have infinite data, then gradual self-training maintains 0 error.

Our next example shows that our analysis for gradual self-training in this setting is tight—if we start with a model with loss , then the error can in fact increase exponentially even with infinite unlabeled examples. Intuitively, at each step of self-training the loss can increase by a constant factor, which leads to an exponential growth in the error.

Example 3.4.

Even under the -separation, no label shift, gradual shift, and bounded data assumptions, given , for every there exists distributions , and with , but if then . Note that is always in .

This suggests that if we want sub-exponential bounds we either need to make additional assumptions on the data distributions, or devise alternative algorithms to achieve better bounds (which we believe is unlikely).

3.4 Essential ingredients for gradual self-training

In this section, we explain why regularization, label sharpening, and the ramp loss are essential to bounding the error of gradual self-training (Theorem 3.2).

Regularization: Without regularization there is no incentive for the model to change when self-training—if we self-train without regularization an optimal thing to do is to output the original model. The intuition is that since the model is used to pseudolabel examples, gets every pseudolabeled example correct. The scaled classifier for large then gets optimal loss, but and make the same predictions for every example. We use to denote the set of possible that minimize the loss on the pseudolabeled distribution (Equation (3)):

Example 3.5.

Given a model and unlabeled examples where for all , , there exists such that for all , .

More specific to our setting, our bounds require regularized models because regularized models classify the data correctly with a margin, so even after a mild distribution shift we get most new examples correct. Note that in traditional supervised learning, regularization is usually required when we have few examples for better generalization to the population, whereas in our setting regularization is important for maintaining a margin even with infinite data.

Label sharpening: When self-training, we pseudolabel examples as or , based on the output of the classifier. Prior work sometimes uses “soft” labels [9], where for each example they assign a probability of the label being or , and train using a logistic loss. The loss on the soft-pseudolabeled distribution is defined as:


, where

is the sigmoid function, and

is the log loss:


Self-training then picks minimizing . A simple example shows that this form of self-training may never update the parameters because minimizes :

Example 3.6.

For all , is a minimizer of , that is, for all , .

This suggests that we “sharpen” the soft labels to encourage the model to update its parameters. Note that this is true even on finite data: set to be the empirical distribution.

Ramp versus hinge loss: We use the ramp loss, but does the more popular hinge loss work? Unfortunately, the next example shows that we cannot control the error of gradual self-training with the hinge loss even if we had infinite examples, so the ramp loss is important for Theorem 3.2.

Example 3.7.

Even under the -separation, no label shift, and gradual shift assumptions, given , there exists distributions and with , but if then ( gets every example in wrong), where we use the hinge loss in self-training.

We only analyzed the statistical effects here—the hinge loss tends to work better in practice because it is much easier to optimize and is convex for linear models.

3.5 Self-training without domain shift

Example 3.4 showed that when the distribution shifts, the loss of gradual self-training can grow exponentially (though the non-adaptive baseline has unbounded error). Here we show that if we have no distribution shift, the error can only grow linearly: if , given a classifier with loss , if we do gradual self-training the loss is at most .

Proposition 3.8.

Given , distributions , and model with , where

In Appendix A, we show that self-training can indeed hurt without domain shift: given a classifier with loss on , self-training on can increase the classifier’s loss on to , but here the non-adaptive baseline has error .

4 Theory for the Gaussian setting

In this section we study an idealized Gaussian setting to understand conditions under which self-training can have better than exponential error bounds: we show that if we begin with a good classifier, the distribution shifts are not too large, and we have infinite unlabeled data, then gradual self-training maintains a good classifier.

4.1 Setting

We assume is an isotropic Gaussian in -dimensions for each . We can shift the data to have mean , so we suppose:


Where and for each . As usual, we assume the shifts are gradual: for some , . We assume that the means of the two classes do not get closer than the shift, or else it would be impossible to distinguish between no shift, and the distributions of the two classes swapping: so for all . We assume infinite unlabeled data (access to ) in our analysis.

Given labeled data in the source, we use the objective:


For unlabeled data, self-training performs descent steps on an underlying objective function [15], which we focus on:


We assume is a continuous, non-increasing function which is strictly decreasing on : these are regularity conditions which the hinge, ramp, and logistic losses satisfy. If then  [15].

The algorithm we analyze begins by choosing from labeled data in , and then updates the parameters with unlabeled data from for :


Note that we do not show that self-training actually converges to the constrained minimum of in Equation (19) and prior work only shows that self-training descends on —we leave this optimization analysis to future work.

4.2 Analysis

Let where . Note that minimizes the 0-1 error on . Our main theorem says that if we start with a regularized classifier that is near , which we can learn from labeled data, and the distribution shifts are not too large, then we recover the optimal . The key challenge is that the unlabeled loss in dimensions is non-convex, with multiple local minima, so directly minimizing does not guarantee a solution that minimizes the labeled loss .

Theorem 4.1.

Assuming the Gaussian setting, if , then we recover .

Proving this reduces to proving the single-step case. At each step , if we have a classifier that was close to , then we will recover . We give intuition here and the formal proof in Appendix B.

We first show that if changes by a small amount, the optimal parameters (for the labeled loss) does not change too much. Then since is close to , is not too far away from . The key step in our argument is showing that the unique minimum of the unlabeled loss in the neighborhood of , is —looking for a minimum nearby is important because if we deviate too far we might select other “bad” minima. We consider arbitrary near and construct a pairing of points in , using a convexity argument to show that contributes more to the loss of than .

5 Experiments

Our theory leads to practical insights—we show that regularization and label sharpening are important for gradual self-training, that leveraging the gradual shift structure improves target accuracy, and give intuition for when the gradual shift assumption may not help. We run experiments on three datasets (see Appendix C for more details):

Gaussian: Synthetic dataset where the distribution for each of two classes is a -dimensional Gaussian, where . The means and covariances of each class vary over time. The model gets labeled samples from the source domain, and unlabeled samples from each of intermediate domains. This dataset resembles our Gaussian setting but the covariance matrices are not isotropic, and the number of labeled and unlabeled samples is finite and on the order of the dimension .

Rotating MNIST: Rotating MNIST is a semi-synthetic dataset where we rotate each MNIST image by an angle between 0 and 60 degrees. We split the 50,000 MNIST training set images into a source domain (images rotated between 0 and 5 degrees), intermediate domain (rotations between 5 and 60 degrees), and a target domain (rotations between 55 degrees and 60 degrees). Note that each image is seen at exactly one angle, so the training procedure cannot track a single image across different angles.

Portraits: A real dataset comprising photos of high school seniors across years [10]. The model’s goal is to classify gender. We split the data into a source domain (first 2000 images), intermediate domain (next 14000 images), and target domain (next 2000 images).

5.1 Does the gradual shift assumption help?

Our goal is to see if adapting to the gradual shift sequentially helps compared to directly adapting to the target. We evaluate four methods: Source: simply train a classifier on the labeled source examples. Target self-train: repeatedly self-train on the unlabeled target examples ignoring the intermediate examples. All self-train: pool all the unlabeled examples from the intermediate and target domains, and repeatedly self-train on this pooled dataset to adapt the initial source classifier. Gradual self-train: sequentially use self-training on unlabeled data in each successive intermediate domain, and finally self-train on unlabeled data on the target domain, to adapt the initial source classifier.

For the Gaussian and MNIST datasets, we ensured that the target self-train method sees as many unlabeled target examples as gradual self-train sees across all the intermediate examples. Since portraits is a real dataset we cannot synthesize more examples from the target, so target self-train uses fewer unlabeled examples here.

For rotating MNIST and Portraits we used a 3-layer convolutional network with dropout and batchnorm on the last layer, that was able to achieve

accuracy on held out examples in the source domain. For the Gaussian dataset we used a logistic regression classifier with

regularization. For each step of self-training, we filter out the 10% of images where the model’s prediction was least confident—Appendix C

shows similar findings without this filtering. To account for variance in initialization and optimization, we ran each method 5 times and give

confidence intervals. More experimental details are in Appendix C.

Gaussian Rot MNIST Portraits
Source 47.70.3 31.91.7 75.31.6
Target ST 49.60.0 33.02.2 76.92.1
All ST 92.50.1 38.01.6 78.93.0
Gradual ST 98.80.0 87.91.2 83.80.8
Table 1: Classification accuracies for gradual self-train (ST) and baselines on 3 datasets, with confidence intervals for the mean over 5 runs. Gradual ST does better than self-training directly on the target or self-training on all the unlabeled data pooled together.

Table 1 shows that leveraging the gradual structure leads to improvements over the baselines on all three datasets.

5.2 Important ingredients for gradual self-training

Our theory suggests that regularization and label sharpening are important for gradual self-training, because without regularization and label sharpening there is no incentive for the model to change (Section 3.4

). However, prior work suggests that overparameterized neural networks trained with stochastic gradient methods have strong implicit regularization 

[16, 17]—in the supervised setting they perform well without explicit regularization even though the number of parameters is much larger than the number of data points—is this implicit regularization enough for gradual self-training?

In our experiments, we see that even without explicit regularization, or with ‘soft’ probabilistic labels, gradual self-training does slightly better than the non-adaptive source classifier, suggesting that this implicit regularization may have some effect. However, explicit regularization and ‘hard’ labeling gives a much larger accuracy boost.

Regularization is important: We repeat the same experiment as Section 5.1, comparing gradual self-training with or without regularization—that is, disabling dropout and batchnorm [18] in the neural network experiments. In both cases, we first train an unregularized model on labeled examples in the source domain. Then, we either turn on regularization during self-training, or keep the model unregularized. We control the original model to be the same in both cases to see if regularization helps in the self-training process, as opposed to in learning a better supervised classifier. Table 2 shows that accuracies are significantly better with regularization, even though unregularized performance is still better than the non-adaptive source classifier.

Soft labeling hurts: We ran the same experiment as Section 5.1, comparing gradual self-training with hard labeling versus using probabilistic labels output by the model. Table 2 shows that accuracies are better with hard labels.

Gaussian Rot MNIST Portraits
Soft Labels 90.51.9 44.12.3 80.11.8
No Reg 84.61.1 45.82.5 76.51.0
Gradual ST 99.30.0 83.82.5 82.60.8
Table 2: Classification accuracies for gradual self-train with explicit regularization and hard labels (Gradual ST), without regularization but with hard labels (No Reg), and with regularization but with soft labels (Soft Labels). Gradual self-train does best with explicit regularization and hard labels, as our theory suggests, even for neural networks with implicit regularization.

Regularization is still important with more data: In supervised learning, the importance of regularization diminishes as we have more training examples—if we had access to infinite data (the population), we don’t need regularization. On the other hand, for gradual domain adaptation, the theory says regularization is needed to adapt to the dataset shift even with infinite data, and predicts that regularization remains important even if we increase the sample size.

To test this hypothesis, we construct a rotating MNIST dataset where we increase the sample sizes. The source domain consists of images on MNIST. then consists of these same images, rotated by angle , for . The goal is to get high accuracy on : these images rotated by 60 degrees—the model doesn’t have to generalize to unseen images, but to seen images at different angles. We compare using regularization versus not using regularization during gradual self-training.

N=2000 N=5000 N=20,000
Source 28.31.4 29.92.5 33.92.6
No Reg 55.73.9 53.64.0 55.13.9
Reg 93.10.8 91.72.4 87.43.1
Table 3: Classification accuracies for gradual self-train on rotating MNIST as we vary the number of samples. Unlike in previous experiments, here the same samples are rotated, so the models do not have to generalize to unseen images, but seen images at different angles. The gap between regularized and unregularized gradual self-training does not shrink much with more data.

Table 3 shows that regularization is still important here, and the gap between regularized and unregularized gradual self-training does not shrink much with more data.

5.3 When does gradual shift help?

Our theory in Section 3 says that gradual self-training works well if the shift between domains is small in Wasserstein-infinity distance, but it may not be enough for the total variation or KL-divergence between and to be small.

To test this, we run an experiment on a modified version of the rotating MNIST dataset. We keep the source and target domains the same as before, but change the intermediate domains. In Table 1 we saw that gradual self-training works well if we have intermediate images rotated by gradually increasing rotation angles. Another type of gradual transformation is to gradually introduce more examples rotated by to degrees. That is, in the -th domain, fraction of the examples are MNIST images rotated by to degrees, and of the examples are MNIST images rotated by to degrees, where . Here the total-variation distance between successive domains is small, but intuitively the Wasserstein distance is large because each image undergoes a large ( degrees) rotation.

As the theory suggests, here gradual self-training does not outperform directly self-training on the target—gradual self-training gets accuracy on the target, while direct adaptation to the target gets over 5 runs. We hope this gives practitioners some insight into not just the strengths of gradual self-training, but also its limitations.

6 Related work

Self-training is a popular method in semi-supervised learning [19, 20] and domain adaptation [21, 22, 23], and is related to entropy minimization [24]. Theory in semi-supervised learning [25, 26, 27] analyzes when unlabeled data can help, but does not show bounds for particular algorithms. Recent work shows that a robust variant of self-training can mitigate the tradeoff between standard and adversarial accuracy [28]. Related to self-training is co-training [29], which assumes that the input features can be split into two or more views that are conditionally independent on the label.

Unsupervised domain adaptation, where the goal is to directly adapt from a labeled source domain to an unlabeled target domain, is widely studied [30]. The key challenge for domain adaptation theory is when the source and target supports do not overlap [11, 12], which are typical in the modern high-dimensional regime. Importance weighting based methods [31, 32, 13] assume the domains overlap, with bounds depending on the expected density ratios between the source and target. Even if the domains overlap, the density ratio often scales exponentially in the dimension. These methods also assume that is the same for the source and target. The theory of -divergence [14, 33] gives conditions for when a model trained on the source does well on the target without any adaptation. Empirical methods aim to learn domain invariant representations [34, 35, 36] but there are no theoretical guarantees for these methods [11]

. These methods require additional heuristics 

[37], and work well on some tasks but not others [2, 38]. Our work suggests that the structure from gradual shifts, which appears often in applications, can be a way to build theory and algorithms for regimes where the source and target are very different.

Hoffman et al. [39], Michael et al. [40], Markus et al. [41], Bobu et al. [2] among others propose approaches for gradual domain adaptation. This setting differs from online learning [42], lifelong learning [43], and concept drift [44, 45, 46], since we only have unlabeled data from shifted distributions. To the best of our knowledge, we are the first to develop a theory for gradual domain adaptation, and investigate when and why the gradual structure helps.


The authors would like to thank the Open Philantropy Project and the Stanford Graduate Fellowship program for funding. This work is also partially supported by the Stanford Data Science Initiative and the Stanford Artificial Intelligence Laboratory.

We are grateful to Stephen Mussman, Robin Jia, Csaba Szepesvari, Shai Ben-David, Lin Yang, Rui Shu, Michael Xie, Aditi Raghunathan, Yining Chen, Colin Wei, Pang Wei Koh, Fereshte Khani, Shengjia Zhao, and Albert Gu for insightful discussions.


Our code is at https://github.com/p-lambda/gradual_domain_adaptation. Code, data, and experiments will be available on CodaLab soon.


  • Vergara et al. [2012] A. Vergara, S. Vembu, T. Ayhan, M. A. Ryan, M. L. Homer, and R. Huerta. Chemical gas sensor drift compensation using classifier ensembles. Journal of the American Statistical Association, -1:320–329, 2012.
  • Bobu et al. [2018] A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell. Adapting to continuously shifting domains. In International Conference on Learning Representations Workshop (ICLR), 2018.
  • Farshchian et al. [2019] A. Farshchian, J. A. Gallego, J. P. Cohen, Y. Bengio, L. E. Miller, and S. A. Solla. Adversarial domain adaptation for stable brain-machine interfaces. In International Conference on Learning Representations (ICLR), 2019.
  • Sethi and Kantardzic [2017] T. S. Sethi and M. Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications, 82:77–99, 2017.
  • Chapelle et al. [2006] O. Chapelle, A. Zien, and B. Scholkopf. Semi-Supervised Learning. MIT Press, 2006.
  • Xie et al. [2020] Q. Xie, M. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. arXiv, 2020.
  • Uesato et al. [2019] J. Uesato, J. Alayrac, P. Huang, R. Stanforth, A. Fawzi, and P. Kohli. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Carmon et al. [2019] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Najafi et al. [2019] A. Najafi, S. Maeda, M. Koyama, and T. Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Ginosar et al. [2017] S. Ginosar, K. Rakelly, S. M. Sachs, B. Yin, C. Lee, P. Krähenbühl, and A. A. Efros. A century of portraits: A visual historical record of american high school yearbooks. IEEE Transactions on Computational Imaging, 3, 2017.
  • Zhao et al. [2019] H. Zhao, R. T. des Combes, K. Zhang, and G. J. Gordon. On learning invariant representation for domain adaptation. In International Conference on Machine Learning (ICML), 2019.
  • Shu et al. [2018] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A DIRT-T approach to unsupervised domain adaptation. In International Conference on Learning Representations (ICLR), 2018.
  • Jiayuan et al. [2006] H. Jiayuan, S. A. J., G. Arthur, B. K. M., and S. Bernhard. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems (NeurIPS), 2006.
  • Ben-David et al. [2010] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151–175, 2010.
  • Amini and Gallinari [2003] M. Amini and P. Gallinari. Semi-supervised learning with explicit misclassification modeling. In International Joint Conference on Artificial Intelligence (IJCAI), 2003.
  • Zhang et al. [2017] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.

    Understanding deep learning requires rethinking generalization.

    In International Conference on Learning Representations (ICLR), 2017.
  • Hardt et al. [2016] M. Hardt, B. Recht, and Y. Singer.

    Train faster, generalize better: Stability of stochastic gradient descent.

    In International Conference on Machine Learning (ICML), pages 1225–1234, 2016.
  • Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
  • Lee [2013] D. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.
  • Sohn et al. [2020] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv, 2020.
  • Long et al. [2013] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation. In

    Proceedings of the IEEE international conference on computer vision

    , pages 2200–2207, 2013.
  • Zou et al. [2019] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang. Confidence regularized self-training. arXiv preprint arXiv:1908.09822, 2019.
  • Inoue et al. [2018] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 5001–5009, 2018.
  • Grandvalet and Bengio [2005] Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, 2005.
  • Rigollet [2007] P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research (JMLR), 8:1369–1392, 2007.
  • Singh et al. [2008] A. Singh, R. Nowak, and J. Zhu. Unlabeled data: Now it helps, now it doesn’t. In Advances in Neural Information Processing Systems (NeurIPS), 2008.
  • Ben-David et al. [2008] S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In Conference on Learning Theory (COLT), 2008.
  • Raghunathan et al. [2020] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Understanding and mitigating the tradeoff between robustness and accuracy. arXiv, 2020.
  • Blum and Mitchell [1998] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learning Theory (COLT), 1998.
  • Quiñonero-Candela et al. [2009] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
  • Shimodaira [2000] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90:227–244, 2000.
  • Sugiyama et al. [2007] M. Sugiyama, M. Krauledat, and K. Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research (JMLR), 8:985–1005, 2007.
  • Mansour et al. [2009] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Conference on Learning Theory (COLT), 2009.
  • Tzeng et al. [2014] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • Ganin and Lempitsky [2015] Y. Ganin and V. Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In International Conference on Machine Learning (ICML), pages 1180–1189, 2015.
  • Tzeng et al. [2017] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • Hoffman et al. [2018] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
  • Peng et al. [2019] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
  • Hoffman et al. [2014] J. Hoffman, T. Darrell, and K. Saenko. Continuous manifold based adaptation for evolving visual domains. In Computer Vision and Pattern Recognition (CVPR), 2014.
  • Michael et al. [2018] G. Michael, E. Dennis, K. B. Mara, B. Peter, and M. Dorit. Gradual domain adaptation for segmenting whole slide images showing pathological variability. In Image and Signal Processing, 2018.
  • Markus et al. [2018] W. Markus, B. Alex, and P. Ingmar. Incremental adversarial domain adaptation for continually changing environments. In International Conference on Robotics and Automation (ICRA), 2018.
  • Shalev-Shwartz [2007] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
  • Silver et al. [2013] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In Association for the Advancement of Artificial Intelligence (AAAI), volume 13, 2013.
  • Kramer [1988] A. H. Kramer. Learning despite distribution shift. In Connectionist Models Summer School, 1988.
  • Bartlett [1992] P. L. Bartlett. Learning with a slowly changing distribution. In Conference on Learning Theory (COLT), 1992.
  • Bartlett et al. [1996] P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41, 1996.
  • Liang [2016] Percy Liang. Statistical learning theory. https://web.stanford.edu/class/cs229t/notes.pdf, 2016.

Appendix A Proofs for Section 3

Restatement of Example 3.1.

Even under the -separation, no label shift, gradual shift, and bounded data assumptions, there exists distributions and a source model that gets loss on the source (), but high loss on the target: . Self-training directly on the target does not help: . This holds true even if every domain is separable, so .


We construct an example in 2-D, where we consider the set of regularized linear models , where . Such a classifier is parametrized by where with , and . The output of the model is , and the predicted label is .

We first define the source distribution :


Consider the source classifier . The classifier classifies all examples correctly, in particular , and . In addition, the ramp loss is , that is:


We now construct distributions and :


Basically, the second-coordinate starts at 1 and decreases over time when the label is , and starts at and increases over time when the label is . We note that .

Now, classifies everything incorrectly in . , and but the corresponding labels in are and respectively. Accordingly, the ramp loss .

Self-traning on cannot fix the problem. gets every example incorrect, so all the pseudolabels are incorrect. In particular, let be the pseudolabels produced using —we have, and . Self-training on this is now a convex optimization problem, which attains 0 loss, for example using the classifier , , but any such classifier also gets all the examples incorrect. Note that the max-margin classifier on the source also exhibits the same issue (that is, it can get all the examples wrong after the dataset shift), from a simple extension of this example.

Finally, the classifier , , gets every label correct in all distributions, .

Restatement of Theorem 3.2.

Given with and marginals on are the same so . Suppose satisfy the bounded data assumption, and we have initial model , and unlabeled samples from , and we set . Then with probability at least over the sampling of , letting :


We begin by stating and proving some lemmas that formalize the proof outline in the main paper. We begin with a standard lemma that says if we learn a regularized linear classifier from labeled examples from a distribution , then the classifier is almost as good as the optimal regularized linear classifier on , and the classifier gets closer to optimal as increases. We bound the error of the classifier using the Rademacher complexity of regularized linear models .

Lemma A.1.

Given samples from a joint distribution over inputs and labels , and suppose . Let and be the empirical and population minimizers of the ramp loss respectively:


Then with probability at least ,


We begin with a standard bound (see e.g. Theorem 9, page 70 in [47]), where the generalization error on the left is bounded by the Rademacher complexity:


Here, is the composition of the loss with the set of regularized linear models, and is the Rademacher complexity. It now suffices to bound .

We first use Talagrand’s lemma, which says that if is an -Lipschitz function (that is, for all ), then:


In our case, we let , in which case where is the ramp loss. The Lipschitz constant of the ramp loss is 1, so .

Finally, we need to bound , the Rademacher complexity of -regularized linear models. This is a standard argument (e.g. see Theorem 11, page 82 in [47]) and we get:


The next lemma shows that the error (0-1 loss) of is low on , even though the margin loss may be high. Intuitively, classifies most points in correctly with geoemtric margin , so after a small distribution shift , these points are still correctly classified since the margin acts as a ‘buffer’ protecting us from misclassification.

Lemma A.2.

If , , and the marginals on are the same so , then


Let be the weights and bias of the regularized linear model, with .

Intuitively, if the ramp loss for a regularized linear model is low, then most points are classified correctly with high geometric margin (distance to decision boundary). Formally, we first show (using basically Markov’s inequality) that , where we recall that is the ramp loss which is bounded between and :

Here, the inequality on the third line follows because if where , then , from the definition of the ramp loss.

This gives us:


The high level intuition of the next step is that since the shift is small, only points with can be misclassified after the distribution shift, and from the previous step since there aren’t too many of these the error of on is small.

Formally, fix with , and let be a mapping such that for all measurable , , with for 111We need the here because a mapping with exactly the distance may not exist, although if they and have densities then such a mapping does exist., then we have:

Where the inequality follows from Cauchy-Schwarz:

Combining this with Equation (34), this gives us:


Since was arbitrary, by taking the infimum over all , we get:


Which was what we wanted to show.

From the previous lemma, has low error on , or in other words only occasionally mislabels examples from . The next lemma says that if we minimize the ramp loss on a distribution where the points are only occasionally mislabeled, then we learn a classifier with low (good) ramp loss as well.

Lemma A.3.

Given random variables

(defined on the same measure space) with joint distribution , where denotes the distribution over inputs, and denote distinct distributions over labels. If then for any , . Here denotes the distribution where the input is sampled from and then the label is sampled from .


Let . The proof is by algebra, where we recall that