1 Introduction
Machine learning models are typically trained and tested on the same data distribution. However, when a model is deployed in the real world, the data distribution typically evolves over time, leading to a drop in performance. This problem is widespread: sensor measurements drift over time due to sensor aging [1], selfdriving car vision modules have to deal with evolving road conditions [2], and neural signals received by brainmachine interfaces change within the span of a day [3]. Repeatedly gathering large sets of labeled examples to retrain the model can be impractical, so we would like to leverage unlabeled examples to adapt the model to maintain high accuracy [3, 4].
In these examples the domain shift doesn’t happen at one time, but happens gradually, although this gradual structure is ignored by most domain adaptation methods. Intuitively, it is easier to handle smaller shifts, but for each shift we can incur some error so the more steps, the more degradation—making it unclear whether leveraging the gradual shift structure is better than directly adapting to the target.
In this paper, we provide the first theoretical analysis showing that gradual domain adaptation provides improvements over the traditional approach of direct domain adaptation
. We analyze selftraining (also known as pseudolabeling), a method in the semisupervised learning literature
[5]that has led to stateoftheart results on ImageNet
[6] and adversarial robustness on CIFAR10 [7, 8, 9].As a concrete example of our setting, the Portraits dataset [10] contains photos of high school seniors taken across many years, labeled by gender (Figure 1). We use the first 2000 images (1905  1935) as the source, next 14000 (1935  1969) as intermediate domains, and next 2000 images as the target (1969  1973). A model trained on labeled examples from the source gets 98% accuracy on held out examples in the same years, but only 75% accuracy on the target domain. Assuming access to unlabeled images from intermediate domains, our goal is to adapt the model to do well on the target domain. Direct adaptation to the target with selftraining only improves the accuracy a little, from 75% to 77%.
The gradual selftraining algorithm begins with a classifier trained on labeled examples from the source domain (Figure 1(a)). For each successive domain , the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in Figure 2, is that after a single gradual shift, most examples are pseudolabeled correctly so selftraining learns a good classifier on the shifted data, but the shift from the source to the target can be too large for selftraining to correct. We find that gradual selftraining on the Portraits dataset improves upon direct target adaptation (77% to 84% accuracy).
Our results: We analyze gradual domain adaptation in two settings. The key challenge for domain adaptation theory is dealing with source and target domains whose support do not overlap [11, 12], which are typical in the modern highdimensional regime. The gradual shift structure inherent in many applications provides us with leverage to handle adapting to target distributions with nonoverlapping support.
Our first setting, the margin setting, is distributionfree—we only assume that at every point in time there exists some linear classifier that can classify most of the data correctly with a margin, where the linear classifier may be different at each time step (so this is more general than covariate shift), and that the shifts are small in Wassersteininfinity distance. A simple example (as in Figure 2) shows that a classifier that gets 100% accuracy can get 0% accuracy after a constant number of time steps. Directly adapting to the final target domain also gets 0% accuracy. Gradual selftraining does better, letting us bound the error after steps: , where is the error of the classifier on the source domain, and is the number of unlabeled examples in each intermediate domain. While this bound is exponential in , this bound is nonvacuous for small , and we show that this bound is tight for gradual selftraining.
In the second setting, stronger distributional assumptions allow us to do better—we assume that is a dimensional isotropic Gaussian for each . Here, we show that if we begin with a classifier that is nearly Bayes optimal for the initial distribution, we can recover a classifier that is Bayes optimal for the target distribution with infinite unlabeled data. This is an idealized setting to understand what properties of the data might allow selftraining to do better than the exponential bound.
Our theory leads to practical insights, showing that regularization—even in the context of infinite data—and label sharpening are essential for gradual selftraining. Without regularization, the accuracy of gradual selftraining drops from 84% to 77% on Portraits and 88% to 46% on rotating MNIST. Even when we selftrain with more examples, the performance gap between regularized and unregularized models stays the same—unlike in supervised learning where the benefit of regularization diminishes as we get more examples.
Finally, our theory suggests that the gradual shift structure helps when the shift is small in Wassersteininfinity distance as opposed to other distance metrics like the KLdivergence. For example, one way to interpolate between the source and target domains is to gradually introduce more images from the target, but this shift is large in Wassersteininfinity distance—we see experimentally that gradual selftraining does not help in this setting. We hope this gives practitioners some insight into when gradual selftraining can work.
2 Setup
Gradually shifting distributions: Consider a binary classification task of predicting labels from input features
. We have joint distributions over the inputs and labels,
: , where is the source domain, is the target domain, and are intermediate domains. We assume the shift is gradual: for some , for all , where is some distance function between distributions and . We have labeled examples sampled independently from the source and unlabeled examples sampled independently from for each .Models and objectives: We have a model family , where a model outputs a score representing its confidence that the label is 1 for the given example. The model’s prediction for an input is , where if and if . We evaluate models on the fraction of times they make a wrong prediction, also known as the  loss:
(1) 
The goal is to find a classifier that gets high accuracy on the target domain —that is, low . In an online setting we may care about the accuracy at the current for every time , and our analysis works in this setting as well.
Baseline methods:
We select a loss function
which takes a prediction and label, and outputs a nonnegative loss value, and we begin by training a source model that minimizes the loss on labeled data in the source domain:(2) 
The nonadaptive baseline is to use on the target domain, which incurs error . Selftraining uses unlabeled data to adapt a model. Given a model and unlabeled data , denotes the output of selftraining. Selftraining pseudolabels each example in using , and then selects a new model that minimizes the loss on this pseudolabeled dataset. Formally,
(3) 
Here, selftraining uses “hard” labels: we pseudolabel examples as either or , based on the output of the classifier, instead of a probabilistic label based on the model’s confidence—we refer to this as label sharpening
. In our theoretical analysis, we sometimes want to describe the behavior of selftraining when run on infinite unlabeled data from a probability distribution
:(4) 
The direct adaptation to target baseline takes the source model and selftrains on the target data , and is denoted by . Prior work often chooses to repeat this process of selftraining on the target times, which we denote by .
Gradual selftraining: In gradual selftraining, we selftrain on the finite unlabeled examples from each domain successively. That is, for , we set:
(5) 
is the output of gradual selftraining, which we evaluate on the target distribution .
3 Theory for the margin setting
We show that gradual selftraining does better than directly adapting to the target, where we assume that at each time step there exists some linear classifier—which can be different at each step—that can classify most of the data correctly with a margin (a standard assumption in learning theory), and that the shifts are small. Our main result (Theorem 3.2) bounds the error of gradual selftraining. We show that our analysis is tight for gradual selftraining (Example 3.4), and explain why regularization, label sharpening, and the ramp loss, are key to our bounds. Proofs are in Appendix A.
3.1 Assumptions
Models and losses: We consider regularized linear models that have weights with bounded norm: for some fixed . Given , the model’s output is .
We consider margin loss functions such as the hinge and ramp losses. Intuitively, a margin loss encourages a model to classify points correctly and confidently—by keeping correctly classified points far from the decision boundary. We consider the hinge function and ramp function :
(6)  
(7) 
The ramp loss is , where is a model’s prediction, and
is the true label. The hinge loss is the standard way to enforce margin, but the ramp loss is more robust towards outliers because it is bounded above—no single point contributes too much to the loss. We will see that the ramp loss is key to the theoretical guarantees for gradual selftraining because of its robustness. We denote the population ramp loss as:
(8) 
Given a finite sample , the empirical loss is:
(9) 
Distributional distance: Our notion of distance is , the Wassersteininfinity distance. Intuitively, moves points from distribution to by distance at most to match the distributions. For ease of exposition we consider the Monge form of , although the results can be extended to the Kantarovich formulation as well. Formally, given probability measures on :
(10) 
As usual, denotes the pushforward of a measure, that is, for every set , .
In our case, we require that the conditional distributions do not shift too much. Given joint probability measures on the inputs and labels , the distance is:
(11) 
separation assumption: Assume every domain admits a classifier with low loss , that is there exists and for every domain , there exists some with .
Gradual shift assumption: For some , assume for every consecutive domain, where is the regularization strength of the model class . can be interpreted as the geometric margin (distance from decision boundary to data) the model is trying to enforce.
Bounded data assumption: When dealing with finite samples we need a standard regularity condition: we say that satisfies the bounded data assumption if the data is not too large on average: where .
No label shift assumption: Assume that the fraction of labels does not change: is the same for all .
3.2 Domain shift: baselines fail
While the distribution shift from to is small, the distribution shift from the source to the target can be large, as visualized in Figure 2. A classifier that gets 100% accuracy on , might classify every example wrong on , even if . In this case, directly adaptating to would not help. The following example formalizes this:
Example 3.1.
Even under the separation, no label shift, gradual shift, and bounded data assumptions, there exists distributions and a source model that gets loss on the source (), but high loss on the target: . Selftraining directly on the target does not help: . This holds true even if every domain is separable, so .
Other methods: Our analysis focuses on selftraining, but other bounds do not apply in this setting because they either assume that the density ratio between the target and source exists and is not too small [13], or that the source and target are similar enough that we cannot discriminate between them [14].
3.3 Gradual selftraining improves error
We show that gradual selftraining helps over direct adaptation. For intuition, consider a simple example where and classifies every example in correctly with geometric margin . If each point shifts by distance , gets every example in the new domain correct. If we had infinite unlabeled data from , we can learn a model that classifies every example in the new domain correctly with margin since . Repeating the process for , we get every example in correct.
But what happens when we start with a model that has some error, for example because the data cannot be perfectly separated, and have only finite unlabeled samples? We show that selftraining still does better than adapting to the target domain directly, or using the nonadaptive source classifier.
The first main result of the paper says that if we have a model that gets low loss and the distribution shifts slightly, selftraining gives us a model that does not do too badly on the new distribution.
Theorem 3.2.
Given with and marginals on are the same so . Suppose satisfy the bounded data assumption, and we have initial model , and unlabeled samples from , and we set . Then with probability at least over the sampling of , letting :
(12) 
The proof of this result is in Appendix A, but we give a high level sketch here. There exists some classifier that gets accuracy on , so if we had access to labeled examples from then empirical risk minimization gives us a classifier that is accurate on the population—from a Rademacher complexity argument we get a classifier with loss at most , the second and third term in the RHS of the bound.
Since we only have unlabeled examples from , selftraining uses to pseudolabel these examples and then trains on this generated dataset. Now, if the distribution shift is small relative to the geometric margin , then we can show that the original model labels most examples in the new distribution correctly—that is, is small if is small. Finally, if most examples are labeled correctly we show that because there exists some classifier with low margin loss, selftraining will also learn a classifier with low margin loss , which completes the proof.
We apply this argument inductively to show that after time steps, the error of gradual selftraining is for some constant , if the original error is .
Corollary 3.3.
Under the separation, no label shift, gradual shift, and bounded data assumptions, if the source model has low loss on (i.e. ) and is the result of gradual selftraining: , letting :
(13) 
Corrollary 3.3 says that the gradual structure allows some control of the error unlike direct adaptation where the accuracy on the target domain can be 0% if . Note that if the classes are separable and we have infinite data, then gradual selftraining maintains 0 error.
Our next example shows that our analysis for gradual selftraining in this setting is tight—if we start with a model with loss , then the error can in fact increase exponentially even with infinite unlabeled examples. Intuitively, at each step of selftraining the loss can increase by a constant factor, which leads to an exponential growth in the error.
Example 3.4.
Even under the separation, no label shift, gradual shift, and bounded data assumptions, given , for every there exists distributions , and with , but if then . Note that is always in .
This suggests that if we want subexponential bounds we either need to make additional assumptions on the data distributions, or devise alternative algorithms to achieve better bounds (which we believe is unlikely).
3.4 Essential ingredients for gradual selftraining
In this section, we explain why regularization, label sharpening, and the ramp loss are essential to bounding the error of gradual selftraining (Theorem 3.2).
Regularization: Without regularization there is no incentive for the model to change when selftraining—if we selftrain without regularization an optimal thing to do is to output the original model. The intuition is that since the model is used to pseudolabel examples, gets every pseudolabeled example correct. The scaled classifier for large then gets optimal loss, but and make the same predictions for every example. We use to denote the set of possible that minimize the loss on the pseudolabeled distribution (Equation (3)):
Example 3.5.
Given a model and unlabeled examples where for all , , there exists such that for all , .
More specific to our setting, our bounds require regularized models because regularized models classify the data correctly with a margin, so even after a mild distribution shift we get most new examples correct. Note that in traditional supervised learning, regularization is usually required when we have few examples for better generalization to the population, whereas in our setting regularization is important for maintaining a margin even with infinite data.
Label sharpening: When selftraining, we pseudolabel examples as or , based on the output of the classifier. Prior work sometimes uses “soft” labels [9], where for each example they assign a probability of the label being or , and train using a logistic loss. The loss on the softpseudolabeled distribution is defined as:
(14) 
, where
is the sigmoid function, and
is the log loss:(15) 
Selftraining then picks minimizing . A simple example shows that this form of selftraining may never update the parameters because minimizes :
Example 3.6.
For all , is a minimizer of , that is, for all , .
This suggests that we “sharpen” the soft labels to encourage the model to update its parameters. Note that this is true even on finite data: set to be the empirical distribution.
Ramp versus hinge loss: We use the ramp loss, but does the more popular hinge loss work? Unfortunately, the next example shows that we cannot control the error of gradual selftraining with the hinge loss even if we had infinite examples, so the ramp loss is important for Theorem 3.2.
Example 3.7.
Even under the separation, no label shift, and gradual shift assumptions, given , there exists distributions and with , but if then ( gets every example in wrong), where we use the hinge loss in selftraining.
We only analyzed the statistical effects here—the hinge loss tends to work better in practice because it is much easier to optimize and is convex for linear models.
3.5 Selftraining without domain shift
Example 3.4 showed that when the distribution shifts, the loss of gradual selftraining can grow exponentially (though the nonadaptive baseline has unbounded error). Here we show that if we have no distribution shift, the error can only grow linearly: if , given a classifier with loss , if we do gradual selftraining the loss is at most .
Proposition 3.8.
Given , distributions , and model with , where
In Appendix A, we show that selftraining can indeed hurt without domain shift: given a classifier with loss on , selftraining on can increase the classifier’s loss on to , but here the nonadaptive baseline has error .
4 Theory for the Gaussian setting
In this section we study an idealized Gaussian setting to understand conditions under which selftraining can have better than exponential error bounds: we show that if we begin with a good classifier, the distribution shifts are not too large, and we have infinite unlabeled data, then gradual selftraining maintains a good classifier.
4.1 Setting
We assume is an isotropic Gaussian in dimensions for each . We can shift the data to have mean , so we suppose:
(16) 
Where and for each . As usual, we assume the shifts are gradual: for some , . We assume that the means of the two classes do not get closer than the shift, or else it would be impossible to distinguish between no shift, and the distributions of the two classes swapping: so for all . We assume infinite unlabeled data (access to ) in our analysis.
Given labeled data in the source, we use the objective:
(17) 
For unlabeled data, selftraining performs descent steps on an underlying objective function [15], which we focus on:
(18) 
We assume is a continuous, nonincreasing function which is strictly decreasing on : these are regularity conditions which the hinge, ramp, and logistic losses satisfy. If then [15].
The algorithm we analyze begins by choosing from labeled data in , and then updates the parameters with unlabeled data from for :
(19) 
Note that we do not show that selftraining actually converges to the constrained minimum of in Equation (19) and prior work only shows that selftraining descends on —we leave this optimization analysis to future work.
4.2 Analysis
Let where . Note that minimizes the 01 error on . Our main theorem says that if we start with a regularized classifier that is near , which we can learn from labeled data, and the distribution shifts are not too large, then we recover the optimal . The key challenge is that the unlabeled loss in dimensions is nonconvex, with multiple local minima, so directly minimizing does not guarantee a solution that minimizes the labeled loss .
Theorem 4.1.
Assuming the Gaussian setting, if , then we recover .
Proving this reduces to proving the singlestep case. At each step , if we have a classifier that was close to , then we will recover . We give intuition here and the formal proof in Appendix B.
We first show that if changes by a small amount, the optimal parameters (for the labeled loss) does not change too much. Then since is close to , is not too far away from . The key step in our argument is showing that the unique minimum of the unlabeled loss in the neighborhood of , is —looking for a minimum nearby is important because if we deviate too far we might select other “bad” minima. We consider arbitrary near and construct a pairing of points in , using a convexity argument to show that contributes more to the loss of than .
5 Experiments
Our theory leads to practical insights—we show that regularization and label sharpening are important for gradual selftraining, that leveraging the gradual shift structure improves target accuracy, and give intuition for when the gradual shift assumption may not help. We run experiments on three datasets (see Appendix C for more details):
Gaussian: Synthetic dataset where the distribution for each of two classes is a dimensional Gaussian, where . The means and covariances of each class vary over time. The model gets labeled samples from the source domain, and unlabeled samples from each of intermediate domains. This dataset resembles our Gaussian setting but the covariance matrices are not isotropic, and the number of labeled and unlabeled samples is finite and on the order of the dimension .
Rotating MNIST: Rotating MNIST is a semisynthetic dataset where we rotate each MNIST image by an angle between 0 and 60 degrees. We split the 50,000 MNIST training set images into a source domain (images rotated between 0 and 5 degrees), intermediate domain (rotations between 5 and 60 degrees), and a target domain (rotations between 55 degrees and 60 degrees). Note that each image is seen at exactly one angle, so the training procedure cannot track a single image across different angles.
Portraits: A real dataset comprising photos of high school seniors across years [10]. The model’s goal is to classify gender. We split the data into a source domain (first 2000 images), intermediate domain (next 14000 images), and target domain (next 2000 images).
5.1 Does the gradual shift assumption help?
Our goal is to see if adapting to the gradual shift sequentially helps compared to directly adapting to the target. We evaluate four methods: Source: simply train a classifier on the labeled source examples. Target selftrain: repeatedly selftrain on the unlabeled target examples ignoring the intermediate examples. All selftrain: pool all the unlabeled examples from the intermediate and target domains, and repeatedly selftrain on this pooled dataset to adapt the initial source classifier. Gradual selftrain: sequentially use selftraining on unlabeled data in each successive intermediate domain, and finally selftrain on unlabeled data on the target domain, to adapt the initial source classifier.
For the Gaussian and MNIST datasets, we ensured that the target selftrain method sees as many unlabeled target examples as gradual selftrain sees across all the intermediate examples. Since portraits is a real dataset we cannot synthesize more examples from the target, so target selftrain uses fewer unlabeled examples here.
For rotating MNIST and Portraits we used a 3layer convolutional network with dropout and batchnorm on the last layer, that was able to achieve
accuracy on held out examples in the source domain. For the Gaussian dataset we used a logistic regression classifier with
regularization. For each step of selftraining, we filter out the 10% of images where the model’s prediction was least confident—Appendix Cshows similar findings without this filtering. To account for variance in initialization and optimization, we ran each method 5 times and give
confidence intervals. More experimental details are in Appendix C.Gaussian  Rot MNIST  Portraits  

Source  47.70.3  31.91.7  75.31.6 
Target ST  49.60.0  33.02.2  76.92.1 
All ST  92.50.1  38.01.6  78.93.0 
Gradual ST  98.80.0  87.91.2  83.80.8 
Table 1 shows that leveraging the gradual structure leads to improvements over the baselines on all three datasets.
5.2 Important ingredients for gradual selftraining
Our theory suggests that regularization and label sharpening are important for gradual selftraining, because without regularization and label sharpening there is no incentive for the model to change (Section 3.4
). However, prior work suggests that overparameterized neural networks trained with stochastic gradient methods have strong implicit regularization
[16, 17]—in the supervised setting they perform well without explicit regularization even though the number of parameters is much larger than the number of data points—is this implicit regularization enough for gradual selftraining?In our experiments, we see that even without explicit regularization, or with ‘soft’ probabilistic labels, gradual selftraining does slightly better than the nonadaptive source classifier, suggesting that this implicit regularization may have some effect. However, explicit regularization and ‘hard’ labeling gives a much larger accuracy boost.
Regularization is important: We repeat the same experiment as Section 5.1, comparing gradual selftraining with or without regularization—that is, disabling dropout and batchnorm [18] in the neural network experiments. In both cases, we first train an unregularized model on labeled examples in the source domain. Then, we either turn on regularization during selftraining, or keep the model unregularized. We control the original model to be the same in both cases to see if regularization helps in the selftraining process, as opposed to in learning a better supervised classifier. Table 2 shows that accuracies are significantly better with regularization, even though unregularized performance is still better than the nonadaptive source classifier.
Soft labeling hurts: We ran the same experiment as Section 5.1, comparing gradual selftraining with hard labeling versus using probabilistic labels output by the model. Table 2 shows that accuracies are better with hard labels.
Gaussian  Rot MNIST  Portraits  

Soft Labels  90.51.9  44.12.3  80.11.8 
No Reg  84.61.1  45.82.5  76.51.0 
Gradual ST  99.30.0  83.82.5  82.60.8 
Regularization is still important with more data: In supervised learning, the importance of regularization diminishes as we have more training examples—if we had access to infinite data (the population), we don’t need regularization. On the other hand, for gradual domain adaptation, the theory says regularization is needed to adapt to the dataset shift even with infinite data, and predicts that regularization remains important even if we increase the sample size.
To test this hypothesis, we construct a rotating MNIST dataset where we increase the sample sizes. The source domain consists of images on MNIST. then consists of these same images, rotated by angle , for . The goal is to get high accuracy on : these images rotated by 60 degrees—the model doesn’t have to generalize to unseen images, but to seen images at different angles. We compare using regularization versus not using regularization during gradual selftraining.
N=2000  N=5000  N=20,000  

Source  28.31.4  29.92.5  33.92.6 
No Reg  55.73.9  53.64.0  55.13.9 
Reg  93.10.8  91.72.4  87.43.1 
Table 3 shows that regularization is still important here, and the gap between regularized and unregularized gradual selftraining does not shrink much with more data.
5.3 When does gradual shift help?
Our theory in Section 3 says that gradual selftraining works well if the shift between domains is small in Wassersteininfinity distance, but it may not be enough for the total variation or KLdivergence between and to be small.
To test this, we run an experiment on a modified version of the rotating MNIST dataset. We keep the source and target domains the same as before, but change the intermediate domains. In Table 1 we saw that gradual selftraining works well if we have intermediate images rotated by gradually increasing rotation angles. Another type of gradual transformation is to gradually introduce more examples rotated by to degrees. That is, in the th domain, fraction of the examples are MNIST images rotated by to degrees, and of the examples are MNIST images rotated by to degrees, where . Here the totalvariation distance between successive domains is small, but intuitively the Wasserstein distance is large because each image undergoes a large ( degrees) rotation.
As the theory suggests, here gradual selftraining does not outperform directly selftraining on the target—gradual selftraining gets accuracy on the target, while direct adaptation to the target gets over 5 runs. We hope this gives practitioners some insight into not just the strengths of gradual selftraining, but also its limitations.
6 Related work
Selftraining is a popular method in semisupervised learning [19, 20] and domain adaptation [21, 22, 23], and is related to entropy minimization [24]. Theory in semisupervised learning [25, 26, 27] analyzes when unlabeled data can help, but does not show bounds for particular algorithms. Recent work shows that a robust variant of selftraining can mitigate the tradeoff between standard and adversarial accuracy [28]. Related to selftraining is cotraining [29], which assumes that the input features can be split into two or more views that are conditionally independent on the label.
Unsupervised domain adaptation, where the goal is to directly adapt from a labeled source domain to an unlabeled target domain, is widely studied [30]. The key challenge for domain adaptation theory is when the source and target supports do not overlap [11, 12], which are typical in the modern highdimensional regime. Importance weighting based methods [31, 32, 13] assume the domains overlap, with bounds depending on the expected density ratios between the source and target. Even if the domains overlap, the density ratio often scales exponentially in the dimension. These methods also assume that is the same for the source and target. The theory of divergence [14, 33] gives conditions for when a model trained on the source does well on the target without any adaptation. Empirical methods aim to learn domain invariant representations [34, 35, 36] but there are no theoretical guarantees for these methods [11]
. These methods require additional heuristics
[37], and work well on some tasks but not others [2, 38]. Our work suggests that the structure from gradual shifts, which appears often in applications, can be a way to build theory and algorithms for regimes where the source and target are very different.Hoffman et al. [39], Michael et al. [40], Markus et al. [41], Bobu et al. [2] among others propose approaches for gradual domain adaptation. This setting differs from online learning [42], lifelong learning [43], and concept drift [44, 45, 46], since we only have unlabeled data from shifted distributions. To the best of our knowledge, we are the first to develop a theory for gradual domain adaptation, and investigate when and why the gradual structure helps.
Acknowledgements.
The authors would like to thank the Open Philantropy Project and the Stanford Graduate Fellowship program for funding. This work is also partially supported by the Stanford Data Science Initiative and the Stanford Artificial Intelligence Laboratory.
We are grateful to Stephen Mussman, Robin Jia, Csaba Szepesvari, Shai BenDavid, Lin Yang, Rui Shu, Michael Xie, Aditi Raghunathan, Yining Chen, Colin Wei, Pang Wei Koh, Fereshte Khani, Shengjia Zhao, and Albert Gu for insightful discussions.
Reproducibility.
Our code is at https://github.com/plambda/gradual_domain_adaptation. Code, data, and experiments will be available on CodaLab soon.
References
 Vergara et al. [2012] A. Vergara, S. Vembu, T. Ayhan, M. A. Ryan, M. L. Homer, and R. Huerta. Chemical gas sensor drift compensation using classifier ensembles. Journal of the American Statistical Association, 1:320–329, 2012.
 Bobu et al. [2018] A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell. Adapting to continuously shifting domains. In International Conference on Learning Representations Workshop (ICLR), 2018.
 Farshchian et al. [2019] A. Farshchian, J. A. Gallego, J. P. Cohen, Y. Bengio, L. E. Miller, and S. A. Solla. Adversarial domain adaptation for stable brainmachine interfaces. In International Conference on Learning Representations (ICLR), 2019.
 Sethi and Kantardzic [2017] T. S. Sethi and M. Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications, 82:77–99, 2017.
 Chapelle et al. [2006] O. Chapelle, A. Zien, and B. Scholkopf. SemiSupervised Learning. MIT Press, 2006.
 Xie et al. [2020] Q. Xie, M. Luong, E. Hovy, and Q. V. Le. Selftraining with noisy student improves imagenet classification. arXiv, 2020.
 Uesato et al. [2019] J. Uesato, J. Alayrac, P. Huang, R. Stanforth, A. Fawzi, and P. Kohli. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Carmon et al. [2019] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Najafi et al. [2019] A. Najafi, S. Maeda, M. Koyama, and T. Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Ginosar et al. [2017] S. Ginosar, K. Rakelly, S. M. Sachs, B. Yin, C. Lee, P. Krähenbühl, and A. A. Efros. A century of portraits: A visual historical record of american high school yearbooks. IEEE Transactions on Computational Imaging, 3, 2017.
 Zhao et al. [2019] H. Zhao, R. T. des Combes, K. Zhang, and G. J. Gordon. On learning invariant representation for domain adaptation. In International Conference on Machine Learning (ICML), 2019.
 Shu et al. [2018] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A DIRTT approach to unsupervised domain adaptation. In International Conference on Learning Representations (ICLR), 2018.
 Jiayuan et al. [2006] H. Jiayuan, S. A. J., G. Arthur, B. K. M., and S. Bernhard. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems (NeurIPS), 2006.
 BenDavid et al. [2010] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151–175, 2010.
 Amini and Gallinari [2003] M. Amini and P. Gallinari. Semisupervised learning with explicit misclassification modeling. In International Joint Conference on Artificial Intelligence (IJCAI), 2003.

Zhang et al. [2017]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
Understanding deep learning requires rethinking generalization.
In International Conference on Learning Representations (ICLR), 2017. 
Hardt et al. [2016]
M. Hardt, B. Recht, and Y. Singer.
Train faster, generalize better: Stability of stochastic gradient descent.
In International Conference on Machine Learning (ICML), pages 1225–1234, 2016.  Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
 Lee [2013] D. Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.
 Sohn et al. [2020] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel. Fixmatch: Simplifying semisupervised learning with consistency and confidence. arXiv, 2020.

Long et al. [2013]
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu.
Transfer feature learning with joint distribution adaptation.
In
Proceedings of the IEEE international conference on computer vision
, pages 2200–2207, 2013.  Zou et al. [2019] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang. Confidence regularized selftraining. arXiv preprint arXiv:1908.09822, 2019.

Inoue et al. [2018]
N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa.
Crossdomain weaklysupervised object detection through progressive
domain adaptation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 5001–5009, 2018.  Grandvalet and Bengio [2005] Y. Grandvalet and Y. Bengio. Entropy regularization. In SemiSupervised Learning, 2005.
 Rigollet [2007] P. Rigollet. Generalization error bounds in semisupervised classification under the cluster assumption. Journal of Machine Learning Research (JMLR), 8:1369–1392, 2007.
 Singh et al. [2008] A. Singh, R. Nowak, and J. Zhu. Unlabeled data: Now it helps, now it doesn’t. In Advances in Neural Information Processing Systems (NeurIPS), 2008.
 BenDavid et al. [2008] S. BenDavid, T. Lu, and D. Pal. Does unlabeled data provably help? worstcase analysis of the sample complexity of semisupervised learning. In Conference on Learning Theory (COLT), 2008.
 Raghunathan et al. [2020] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Understanding and mitigating the tradeoff between robustness and accuracy. arXiv, 2020.
 Blum and Mitchell [1998] A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In Conference on Learning Theory (COLT), 1998.
 QuiñoneroCandela et al. [2009] J. QuiñoneroCandela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
 Shimodaira [2000] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90:227–244, 2000.
 Sugiyama et al. [2007] M. Sugiyama, M. Krauledat, and K. Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research (JMLR), 8:985–1005, 2007.
 Mansour et al. [2009] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Conference on Learning Theory (COLT), 2009.
 Tzeng et al. [2014] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.

Ganin and Lempitsky [2015]
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.
In International Conference on Machine Learning (ICML), pages 1180–1189, 2015.  Tzeng et al. [2017] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
 Hoffman et al. [2018] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
 Peng et al. [2019] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multisource domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
 Hoffman et al. [2014] J. Hoffman, T. Darrell, and K. Saenko. Continuous manifold based adaptation for evolving visual domains. In Computer Vision and Pattern Recognition (CVPR), 2014.
 Michael et al. [2018] G. Michael, E. Dennis, K. B. Mara, B. Peter, and M. Dorit. Gradual domain adaptation for segmenting whole slide images showing pathological variability. In Image and Signal Processing, 2018.
 Markus et al. [2018] W. Markus, B. Alex, and P. Ingmar. Incremental adversarial domain adaptation for continually changing environments. In International Conference on Robotics and Automation (ICRA), 2018.
 ShalevShwartz [2007] S. ShalevShwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
 Silver et al. [2013] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In Association for the Advancement of Artificial Intelligence (AAAI), volume 13, 2013.
 Kramer [1988] A. H. Kramer. Learning despite distribution shift. In Connectionist Models Summer School, 1988.
 Bartlett [1992] P. L. Bartlett. Learning with a slowly changing distribution. In Conference on Learning Theory (COLT), 1992.
 Bartlett et al. [1996] P. L. Bartlett, S. BenDavid, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41, 1996.
 Liang [2016] Percy Liang. Statistical learning theory. https://web.stanford.edu/class/cs229t/notes.pdf, 2016.
Appendix A Proofs for Section 3
Restatement of Example 3.1.
Even under the separation, no label shift, gradual shift, and bounded data assumptions, there exists distributions and a source model that gets loss on the source (), but high loss on the target: . Selftraining directly on the target does not help: . This holds true even if every domain is separable, so .
Proof.
We construct an example in 2D, where we consider the set of regularized linear models , where . Such a classifier is parametrized by where with , and . The output of the model is , and the predicted label is .
We first define the source distribution :
(20) 
(21) 
Consider the source classifier . The classifier classifies all examples correctly, in particular , and . In addition, the ramp loss is , that is:
(22) 
We now construct distributions and :
(23) 
(24) 
(25) 
(26) 
Basically, the secondcoordinate starts at 1 and decreases over time when the label is , and starts at and increases over time when the label is . We note that .
Now, classifies everything incorrectly in . , and but the corresponding labels in are and respectively. Accordingly, the ramp loss .
Selftraning on cannot fix the problem. gets every example incorrect, so all the pseudolabels are incorrect. In particular, let be the pseudolabels produced using —we have, and . Selftraining on this is now a convex optimization problem, which attains 0 loss, for example using the classifier , , but any such classifier also gets all the examples incorrect. Note that the maxmargin classifier on the source also exhibits the same issue (that is, it can get all the examples wrong after the dataset shift), from a simple extension of this example.
Finally, the classifier , , gets every label correct in all distributions, .
∎
Restatement of Theorem 3.2.
Given with and marginals on are the same so . Suppose satisfy the bounded data assumption, and we have initial model , and unlabeled samples from , and we set . Then with probability at least over the sampling of , letting :
(27) 
We begin by stating and proving some lemmas that formalize the proof outline in the main paper. We begin with a standard lemma that says if we learn a regularized linear classifier from labeled examples from a distribution , then the classifier is almost as good as the optimal regularized linear classifier on , and the classifier gets closer to optimal as increases. We bound the error of the classifier using the Rademacher complexity of regularized linear models .
Lemma A.1.
Given samples from a joint distribution over inputs and labels , and suppose . Let and be the empirical and population minimizers of the ramp loss respectively:
(28) 
(29) 
Then with probability at least ,
(30) 
Proof.
We begin with a standard bound (see e.g. Theorem 9, page 70 in [47]), where the generalization error on the left is bounded by the Rademacher complexity:
(31) 
Here, is the composition of the loss with the set of regularized linear models, and is the Rademacher complexity. It now suffices to bound .
We first use Talagrand’s lemma, which says that if is an Lipschitz function (that is, for all ), then:
(32) 
In our case, we let , in which case where is the ramp loss. The Lipschitz constant of the ramp loss is 1, so .
Finally, we need to bound , the Rademacher complexity of regularized linear models. This is a standard argument (e.g. see Theorem 11, page 82 in [47]) and we get:
(33) 
∎
The next lemma shows that the error (01 loss) of is low on , even though the margin loss may be high. Intuitively, classifies most points in correctly with geoemtric margin , so after a small distribution shift , these points are still correctly classified since the margin acts as a ‘buffer’ protecting us from misclassification.
Lemma A.2.
If , , and the marginals on are the same so , then
Proof.
Let be the weights and bias of the regularized linear model, with .
Intuitively, if the ramp loss for a regularized linear model is low, then most points are classified correctly with high geometric margin (distance to decision boundary). Formally, we first show (using basically Markov’s inequality) that , where we recall that is the ramp loss which is bounded between and :
Here, the inequality on the third line follows because if where , then , from the definition of the ramp loss.
This gives us:
(34) 
The high level intuition of the next step is that since the shift is small, only points with can be misclassified after the distribution shift, and from the previous step since there aren’t too many of these the error of on is small.
Formally, fix with , and let be a mapping such that for all measurable , , with for ^{1}^{1}1We need the here because a mapping with exactly the distance may not exist, although if they and have densities then such a mapping does exist., then we have:
Where the inequality follows from CauchySchwarz:
Combining this with Equation (34), this gives us:
(35) 
Since was arbitrary, by taking the infimum over all , we get:
(36) 
Which was what we wanted to show.
∎
From the previous lemma, has low error on , or in other words only occasionally mislabels examples from . The next lemma says that if we minimize the ramp loss on a distribution where the points are only occasionally mislabeled, then we learn a classifier with low (good) ramp loss as well.
Lemma A.3.
Given random variables
(defined on the same measure space) with joint distribution , where denotes the distribution over inputs, and denote distinct distributions over labels. If then for any , . Here denotes the distribution where the input is sampled from and then the label is sampled from .Proof.
Let . The proof is by algebra, where we recall that is the ramp loss which is bounded between and :
Comments
There are no comments yet.