1 Introduction
Deep learning has shown great generalization power on various benchmarks when the training and testing data are drawn from the same distribution. However, the model trained on an existing benchmark cannot generalize effectively to a new scenario due to the wellknown domain shift problem. The existence of domain shift hinders the deployment of deep learning models in the open environment of the realworld. Extensive Domain Adaptation (DA) methods Ganin et al. (2016); Long et al. (2017); Tzeng et al. (2017); Saito et al. (2018); Su et al. (2020) have been proposed to enable models to generalize from a labelrich source domain to an unlabeled target domain. Most of the literature of domain adaptation focus on adapting from source domain to one target domain Ganin et al. (2016); Long et al. (2017); Tzeng et al. (2017); Saito et al. (2018); Su et al. (2020); Peng et al. (2019b) or multiple target domains Gong et al. (2013); Gholami et al. (2020); Liu et al. (2020)
. However, many machine learning models deployed in the realworld are exposed to nonstationary situations where different domains are acquired sequentially and their distribution varies over time, such as deep learning models in autonomous vehicles. When meeting the scenario where target domains are multiple and are coming sequentially, existing DA methods collapse because the models may quickly adapt to new domains and easily forget knowledge on old target domains. The killer defect, which is known as
catastrophic forgetting, prevents most DA algorithms being put into practice.This work studies the problem of continual domain adaptation that the models are required to continually adapt to new target domains without harming performances on the previously observed domains. Previous continual DA methods Bobu et al. (2018); Mancini et al. (2019) require priors about domain labels to conduct domain adversarial training or build a graph of interdomain relationships. However these priors may not be accessible in practice, i.e. the target domain is compounded with multiple subdomains. Besides, the existing method Bobu et al. (2018) applies naive sample replay to avoid catastrophic forgetting, which still suffers from the large negative backward transfer.
In this work, we propose gradient regularized contrastive learning (GRCL) to tackle “domain shifts” and “catastrophic forgetting” in a unified framework. At the core of GRCL, gradient regularization plays two key roles: (1) enforces the gradient of contrastive loss not to increase the supervised training loss on the source domain, which maintains the discriminative power of source features, and in turn improves the target features learned by contrastive loss; (2) regularizes the gradient update on the new domain not to increase the classification loss on the domain memory, which enables the model to adapt to an incoming target domain while preserving the performance of previously observed domains. Hence GRCL can jointly learn semantically discriminative and domaininvariant features, without reliance on priors of domainrelated information.
Specifically, we construct a domain memory to store a subset of image samples from each domain. GRCL leverages the contrastive loss to jointly learn image representation with the domain memory and the incoming target domain. Because the instances belong to the same category exhibit similar appearances, the contrast loss encourages the model to push the target samples towards the samples in the domain memory that belong to the same class, in feature space. Thus different domain inputs belong to the same category will be aligned in the feature space, resulting in domaininvariant image representation. However, simply combining the contrastive learning and supervised learning (with source domain data) as a multitask learning objective could hurt the discriminative power of learned features, due to the conflict between contrastive loss and crossentropy loss. As shown in the purple area of Fig.
1(b), the contrastive loss inevitably pushes some discriminative features towards the instance features of less discriminative power, as the contrastive loss only captures the visual similarity while ignoring the semantics. Such a problem is empirically verified by the experimental results in section
4.2.2. To solve this conflict, we enforce the gradient of contrastive loss not to increase the crossentropy loss on the source domain, which maintains the discriminative power of source features. Since the source instance features act as the anchors for the target instances in contrastive learning, it also improves the quality of learned target features. To overcome “catastrophic forgetting”, we construct an additional set of constraints in which each constraint is imposed to enforce the classification loss of each domainspecific memory never increasing.To summarize, our contributions are as follows: (1) We propose gradient regularized contrastive learning to jointly learn both discriminative and domaininvariant representation without reliance on priors of domain labels. At the core of our method, gradient regularization performs two key functions: maintaining the semantically discriminative power of learned features and overcoming catastrophic forgetting. (2) The experiments on multiple continue domain adaptation benchmarks demonstrate the strong performance of our method when compared to the stateoftheart.
2 Problem Formulation and Evaluation Metric
Let be the labeled dataset of source domain, where each example is composed of an image and its label . Continue domain adaptation defines a sequence of adaptation tasks . Different domains have a common label space but distinct data distributions. On th task , there is an unlabeled target domain dataset . The goal is to learn a label prediction model that can generalize well on multiple target domains . Note that data from different domains in general follow different distributions.
We propose two metrics to evaluate the model adapting over a stream of target domains, namely average accuracy (ACC) and average backward transfer (BWT). After the model finishes the training of adaptation task , we evaluate its individual performance on the testing set of the current and all the previously observed domains . Let denote the test accuracy of the model on the domain after finishing adapting to domain . We use to denote the source domain. ACC and BCT can be calculated as
The ACC represents the average performance over all domains when the model has finished the last adaptation task. And BCT indicates the influence that adapting to domain has on the performance on a previously observed domain . The negative BWT indicates that adapting to a new domain decreases the performance on previous domains. The larger these two metrics, the better the model.
3 Method
3.1 Gradient Regularized Contrastive Learning
Contrastive learning Wu et al. (2018); He et al. (2020); Chen et al. (2020a) has recently shown the great capability of mapping images to an embedding space, where similar images are close together and dissimilar images are far apart. Inspired by this, we utilize the contrastive loss to push the target instance towards the source instances that own similar appearances with target input. Thus the same category instances that are apparently similar but from different domains are aligned in the feature space, resulting in domaininvariant features. Specifically, we first define an episodic memory , which stores a subset of observed images from target domain . When the model is adapting to th domain, we have a domain memory , that is a union of all past episodic memories.
Let be the model finishing the adaptation training on domain . We construct a domainaffiliate feature bank to store the instance features from both source and various target domains (see Fig. 2). Here, is a compact representation of the input , which is initialized by and normalized so that . We choose to be a 2layer MLP
that maps the semantic features to a lowdimensional vector of
. Given an query input , we assume there is a single positive key that matches. is designed as , wheredenotes the data augmentation. With similarity measured by dot product, we consider the contrastive loss function of InfoNCE
Oord et al. (2018) :(1) 
where represents the negative key for , and is the temperature. During the training, we update the sampled keys with the uptodate model by , which indicates the training iteration and is the momentum. Then the adaptation objective becomes a multitask of supervised training loss on and contrastive learning loss on the feature bank :
(2)  
where is the cross entropy loss on source domain . , denote the minibatch of samples from and union of respectively. is the hyperparameter to trade off the two losses.
However, our experiments show that the multitask objective of Eq.2 only brings marginal improvements on the target domain, as shown in Fig.1 (c). We hypothesize the problem raised from the conflict of two objectives, namely crossentropy loss and contrastive loss. As illustrated in the purple area in Fig.1(b), the discriminative instance features could be pushed towards lessdiscriminative instance features, because of the apparent similarity. This conflict is verified by the experiments that contrast loss degrades the model performance on the source domain (see Section 4.2.2), suggesting hurting the discriminative power of image features from the source domain. Since the source instance features act as the anchors for the target instances in contrastive learning, the conflict may impose an undesirable influence on the discriminative power of image features from the target domain.
To solve the conflict between and , we enforce the regularization that the on the source domain should not increase when minimizing the . Then the final domain adaptation objective with constraints can be rephrased as:
(3) 
where is the inner product of gradient of loss and w.r.t. model parameters:
(4) 
And is initialized with the model trained on the source domain with labels. As illustrated in Fig.2 (right), if the constraint of Eq.4 is satisfied, then the parameter update is unlikely to hurt the discriminative power of image features from the source domain, as it does not increase the crossentropy loss on the source domain. If the violations occur, we propose to project the to the closest gradient satisfying the constraint of Eq.4. Thus GRCL enjoys the benefits of semantically discriminative features offered by the gradient regularization and domaininvariant features learned by contrastive loss.
3.2 Overcoming Catastrophic Forgetting
In this section, we extend the GRCL with additional constraints to overcome "catastrophic forgetting", in which each constraint is imposed to enforce the classification loss of each domainspecific memory never increasing. Mathematically, the constraints can be formulated as:
(5) 
where is the model parameters at the end of adapting task on . While Eq.5 effectively permits positive backward transfer of GRCL, it comes at a huge computation burden at training time. At each training step, we need to solve the inequality constraints of all episodic memories. It will become prohibitive when the size of and number of adaptation tasks are large. Alternatively, we propose a much efficient way to approximate the Eq.5 by
(6) 
where is the domain memory. Instead of computing the loss on each individual old domain, Eq.6 only computes the loss with the sampled batch of images from domain memory . For computing , we adopt standard means clustering algorithm Caron et al. (2018) to generate pseudo labels with a pretrained model obtained from previous domain adaptation task. Combining the adaptation objective of Eq.3 and againstforgetting objective of Eq.6, we have the final objective for continue domain adaptation:
(7)  
subject to  
The first constraint is to facilitate the contrastive learning to learn discriminative features. The second constraint ensures the average loss on previously observed domains does not increase, which enforces the model not to forget acquired knowledge on preceding domains. Mathematically, we want to find the gradient update satisfying:
(8)  
subject to  
where is the gradient computed using a batch of random samples from the domain memory . Eq.8 is a quadratic program (QP) on
variables (the number of parameters in the neural network), which could be measured in millions for deep learning models. To solve Eq.
8 efficiently, we work in the dual space of Eq.8 which results in much smaller QP with only variables:(9) 
where and we discard the constant term of . The formal proof of Eq.9 is provided in Appendix B. Once the solution to Eq.9 is found, we can solve the Eq.8 with the gradient update of . Appendix A summarizes the training protocol of GRCL.
4 Experiments
4.1 Dataset and Methods
Digits
includes five digits datasets (MNIST LeCun et al. (1998), MNISTM Ganin and Lempitsky (2015), USPS Hull (1994), SynNum Ganin and Lempitsky (2015) and SVHN Netzer et al. (2011)). Each domain has images for training and images for testing. We consider a continual domain adaptation problem of SynNum MNIST MNISTM USPS SVHN.
DomainNet
Peng et al. (2019a) is one of the largest domain adaptation datasets with approximately million images distributed among categories. Each domain randomly selects images for training and images for testing. Five different domains from DomainNet are used to build a continual domain adaptation task as Clipart Real Infograph Sketch Painting.
OfficeCaltech
Gong et al. (2012) includes 10 common categories shared by Office31 Saenko et al. (2010) and Caltech256 Griffin et al. (2007) datasets. Office31 dataset contains three domains: DSLR, Amazon and WebCam, which represent the images that are collected in different environments respectively. We consider a continual domain adaptation tasks of DSLR Amazon WebCam Caltech.
We compare GRCL with five alternatives, including (1) DANN Ganin et al. (2016), a classic domain adversarial training based method; (2) MCD Saito et al. (2018)
, maximizing the classifier discrepancy to reduce domain gap; (3) DADA
Peng et al. (2019b), disentangling the domainspecific features from category identity; (4) CUA Bobu et al. (2018), adopting an adversarial training based method ADDA Tzeng et al. (2017) to reduce the domain shift and a sample replay loss to avoid forgetting; (5) GRA, replacing the contrastive learning in GRCL with ADDA Tzeng et al. (2017).For fair comparison with previous stateoftheart, we adopt LeNet5 LeCun et al. (1998) on Digits, ResNet50 He et al. (2016) on DomainNet, and ResNet18 He et al. (2016) on OfficeCaltech benchmarks. Each domain has the same number of training and testing images. For contrastive learning, we use a batch size of 256, feature update momentum of
, number of negatives as 1024, training schedule of 240 epochs. The MLP head uses a hidden dimension of
. Following Wu et al. (2018); He et al. (2020), the temperature in Eq.1 is set as . For data augmentation, we use random color jittering, Gaussian blur and random horizontal flip. And the image samples in the episodic memory are selected by the model predictions with top1024 confidence. For methods using memory, CUA, GRA and GRCL use exactly the same size of episodic memory for each domain and same means algorithm to generate pseudo labels.4.2 Experimental Results
Fig.3 shows the results on three benchmarks. The larger the average accuracy (ACC) and backward transfer (BWT) the better the model. When the model has finished the training on the last target domain, we report the ACC over all observed domains. As shown in Fig.3, GRCL consistently achieves better ACC across three benchmarks, suggesting that the model trained with GRCL owns the best generalization capability across domains. Unsurprisingly, most methods exhibit lower negative BWT, as catastrophic forgetting exists. The methods using memory (CUA, GRA, GRCL) performs better than the other methods without memory (DANN, MCD, DADA), especially on the BWT metric. These results highlight the importance of memory in the continual DA problem.
Among the memorybased methods, GRA and GRCL achieve significantly better BWT on three benchmarks, suggesting the effectiveness of gradient constraints for combating catastrophic forgetting. GRCL consistently achieves better ACC than GRA across all benchmarks. It partially because that GRCL utilizes all the samples from domain memory (cached the samples from all previously observed domains) to jointly learn representation, while GRA only uses source domain and current target domain to learn features. Fig.4 depicts the evolution of classification accuracy on the first target domain as more domains are observed. GRCL consistently exhibits minimal forgetting and even positive backward transfer on OfficeCaltech benchmark. Table 1 summarizes the detailed results for all methods on three continual DA benchmarks. Each entry in the Table 1
represents the mean and standard deviation of classification accuracy of five runs in corresponding experiments.
Methods  Digits  DomainNet  OfficeCaltech  

ACC  BWT  ACC  BWT  ACC  BWT  
DANN Ganin et al. (2016)  74.56 0.14  11.37 0.09  30.18 0.13  10.27 0.07  81.78 0.05  8.75 0.07 
MCD Saito et al. (2018)  76.46 0.24  10.90 0.11  31.68 0.20  10.36 0.15  82.63 0.13  8.70 0.12 
DADA Peng et al. (2019b)  77.30 0.19  11.40 0.04  32.14 0.14  8.67 0.09  82.05 0.03  8.30 0.05 
CUA Bobu et al. (2018)  82.12 0.18  6.10 0.12  34.22 0.16  5.53 0.14  84.83 0.10  4.65 0.08 
GRA  84.10 0.15  0.93 0.10  35.84 0.19  1.15 0.16  86.53 0.11  0.03 0.03 
GRCL  85.34 0.10  1.0 0.03  37.74 0.13  0.67 0.12  87.23 0.06  0.05 0.02 
4.2.1 Importance of Memory Size and Training Schedule
memory size  256  512  1024  2048 

Digits  83.00  84.12  85.34  85.41 
DomainNet  33.28  35.75  37.74  37.83 
training epoch  120  180  240  300 

Digits  80.10  83.46  85.34  85.38 
DomainNet  34.80  36.50  37.74  38.16 
OfficeCaltech  80.93  84.70  87.23  87.28 
Table 3 shows the ACC of GRCL under various memory sizes per domain. The ACC benefits from a larger memory size. Because larger domain memory provides more negative samples for contrastive learning, resulting in better selfsupervised representation learning. As the OfficeCaltech dataset only has 100 training samples per domain, it is not applicable to do the ablation of memory sizes.
4.2.2 Importance of Gradient Regularization in GRCL
In this section, we evaluate the effect of gradient regularization of GRCL on one target domain setting. We compare three different methods: (1) source only, which the model is supervised trained on source domain; (2) Multitask, using multitask objective of Eq.2; (3) GRCL. The in Eq.2 uses the best value obtained via grid search on each target domain. The SynNum, Clipart and DSLR are used as the source domain for Digits, DomainNet and OfficeCaltech dataset respectively. We report the averaged classification accuracy on the different target domains. As shown in Fig.5, Multitask improves the performance on the target domain over the baseline of source only method. Because of the conflicts between crossentropy loss and contrast loss, Multitask trained model sacrifices the performance of original source domain, which in turn hurts the discriminative power of image features of the target domain. Benefiting from the gradient regularization, GRCL enjoys the domaininvariant feature learning brought by contrast loss, and semantically discriminative feature provided by crossentropy loss, which together helps it achieve the better classification accuracy on the target domain.
5 Related Works
Continual domain adaptation
Bobu et al. (2018) adopts adversarial training to align different domain distributions and reuse selflabeled samples of previous domains to retain classification performance. Mancini et al. (2019) propose AdaGraph to encode interdomain relationships with domain metadata (i.e., the viewpoint of an image captured), and utilize it produce domainspecific BN parameters. However, existing methods have two limitations: (1)Bobu et al. (2018) requires domain labels to do domain adversarial training and Mancini et al. (2019) requires additional metadata as prior to build the domain relation graph, which may not accessible in practice. In contrast, GRCL leverages selfsupervised learning to learn domaininvariant representation, which does not reacquire priors of domainrelated information. (2)Bobu et al. (2018) uses sample replay to avoid catastrophic forgetting. However, the simple bufferreplay based methods still suffer from the negative transfer, as shown by LopezPaz and Ranzato (2017). In contrast, our method explicitly constraints domain adaptation learning on the new domain not to increase the loss on previous domains. Thus it theoretically permits positive backward transfer that existing method LopezPaz and Ranzato (2017) does not support.
Contrastive learning
has been a powerful selfsupervised learning method to learn semantic image representation that can be transferred to downstream vision recognition tasks. It utilizes a contrastive loss to encourage the model to embed similar images closer to each other while dissimilar images separate in the feature space. Different contrastive learning methods vary in the strategy to generate positive and negative image pairs for each input. For example, Wu et al. (2018) samples the images pairs from a memory bank, He et al. (2020); Chen et al. (2020b) adopt a momentum encoder to generate the image pairs and Tian et al. (2019); Chen et al. (2020a) produce the image pairs using the current batch of images. In this work, we utilize the contrastive learning to jointly learn domaininvariant image representation and the gradient regularization is proposed to maintain the discriminative power of learned representation.
Continual learning
addresses the catastrophic forgetting in a sequence of supervised learning tasks. One representative technique is using episodic memory LopezPaz and Ranzato (2017); Chaudhry et al. (2019) to store some training samples of old tasks, which are used to overcome catastrophic forgetting via constrained optimization. In contrast to continual learning that considers a sequence of supervised learning tasks without domain shift problem, continual domain adaptation aims to solve a sequence unsupervised domain adaptation tasks under domain shifts. Hence continual DA not only requires the learner to overcome catastrophic forgetting like lifelong learning dose but also needs the learner to adapt the novel target domain with varying data distributions.
6 Conclusion
This work studies the problem of continual DA, which is one major obstacle in the deployment of modern AI models. We propose Gradient Regularized Contrastive Learning (GRCL) to joint learn both discriminative and domaininvariant representations. At the core of our method, gradient regularization maintains the discriminative power of feature learned by contrastive loss and overcomes catastrophic forgetting in the continual adaptation process. Our experiments demonstrate the competitive performance of GRCL against the stateoftheart.
7 Broader Impact
Deep neural networks have achieved great success on many supervised learning tasks, where enormous data annotations are available and the data distribution does not change over time. However, it remains challenging for deep neural networks to adapt to novel domains, whose data distributions change over time and the data annotations may not be available. The positive impact of this work is providing a method that enables continually adapt to environmental changes by leveraging past learning experience. At the same time, this work may increase the risk of data privacy because now more unlabeled data can be used to improve the AI models. Recently, many privacypreserving training methods have been explored to improve data privacy, such as federated learning and SecureML. In the future, these privacypreserving methods can be combined with a continual domain adaptation approach to improve AI models while preserving data privacy.
References
 [1] (2018) Adapting to continuously shifting domains. In ICLR Workshop, Cited by: §1, §4.1, Table 1, §5.

[2]
(2018)
Deep clustering for unsupervised learning of visual features
. In ECCV, Cited by: §3.2.  [3] (2019) Efficient lifelong learning with aGEM. In ICLR, External Links: Link Cited by: §5.
 [4] (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §3.1, §5.
 [5] (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §5.

[6]
(2015)
Unsupervised domain adaptation by backpropagation
. In ICML, Cited by: §4.1.  [7] (2016) Domainadversarial training of neural networks. JMLR. Cited by: §1, §4.1, Table 1.
 [8] (2020) Unsupervised multitarget domain adaptation: an information theoretic approach. TIP. Cited by: §1.
 [9] (2013) Reshaping visual datasets for domain adaptation. In NIPS, Cited by: §1.
 [10] (2012) Geodesic flow kernel for unsupervised domain adaptation. In CVPR, Cited by: §4.1.
 [11] (2007) Caltech256 object category dataset. Cited by: §4.1.
 [12] (2020) Momentum contrast for unsupervised visual representation learning. CVPR. Cited by: §3.1, §4.1, §4.2.1, §5.
 [13] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
 [14] (1994) A database for handwritten text recognition research. PAMI. Cited by: §4.1.
 [15] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1, §4.1.
 [16] (2020) Compound domain adaptation in an open world. In CVPR, Cited by: §1.

[17]
(2017)
Deep transfer learning with joint adaptation networks
. In ICML, Cited by: §1.  [18] (2017) Gradient episodic memory for continual learning. In NIPS, Cited by: §5, §5.
 [19] (2019) Adagraph: unifying predictive and continuous domain adaptation through graphs. In CVPR, Cited by: §1, §5.
 [20] (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, Cited by: §4.1.
 [21] (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.1.
 [22] (2019) Moment matching for multisource domain adaptation. In ICCV, Cited by: §4.1.
 [23] (2019) Domain agnostic learning with disentangled representations. In ICML, Cited by: §1, §4.1, Table 1.
 [24] (2010) Adapting visual category models to new domains. In ECCV, Cited by: §4.1.
 [25] (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: §1, §4.1, Table 1.
 [26] (2020) Adapting object detectors with conditional domain normalization. arXiv preprint arXiv:2003.07071. Cited by: §1.
 [27] (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §5.
 [28] (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §4.1.
 [29] (2018) Unsupervised feature learning via nonparametric instance discrimination. In CVPR, Cited by: §3.1, §4.1, §5.
Appendix A Algorithm of Gradient Regularized Contrastive Learning
Appendix B Quadratic Program of Equation (8)
Proof The optimization objective in the Equation (8) of the main paper is:
(10)  
subject to  
Replacing with and discarding the constant term of , Equation (10) can be rephrased as
(11)  
subject to 
where ( is the number of parameters in the model). The Lagrangian of Equation (11) can be written as :
(12) 
where . Defining the dual of Equation(12) as:
(13) 
We can find the value that minimizes the by setting the derivatives of to zero:
(14) 
Equation (13) can be simplified by substituting the value of :
(15) 
So the Lagrangian dual of Equation (10) becomes
(16) 
Comments
There are no comments yet.