The task of supervised learning for classification is based on the assumption that the training data and the testing data are sampled from the same distributions. Thus, supervised learning methods achieve state-of-the-art results when evaluated on popular benchmarks such as ImageNetimagenet. However, when such models are deployed in real-world, they yield sub-optimal results due to the inherent distribution-shift (domain-shift torralba2011unbiased) between the training data and the real-world environment (a.k.a. the target domain). While it is possible to obtain unlabeled samples from the target domain in most cases, the huge costs of data annotation prohibit the creation of a reliable labeled training dataset. To this end, Unsupervised Domain Adaptation (DA) methods have been proposed that aim to transfer knowledge from a labeled "source" dataset to an unlabeled "target" dataset under a domain-shift.
A popular strategy in Unsupervised DA is to learn the task-specific knowledge using supervision from the labeled source dataset, while learning a domain-invariant latent space where the features across the source and the target domains align. Such an alignment is enforced using statistical discrepancy minimization schemes JDA_Dist_Match; gong2012geodesic; pan2010domain_tca; peng2019moment; sun2016deepcoral or via an adversarial objective ganin2016domain; long2018conditional; tzeng2017adversarial; dctn; zhao2018adversarial, or by employing domain-specific transformations dsbn; adabn; featurewhiteningandconsensusloss. This alignment minimizes the domain-shift in the latent space, and improves the target generalization. However, the performance of Single-Source Domain Adaptation (SSDA) methods is usually determined by the choice of the source dataset kundu2020towards.
Recently, Multi-Source Domain Adaptation (MSDA) mansour2012multiple; zhao2020multi has garnered interest wherein multiple labeled source domains are used to transfer the task knowledge to the unlabeled target domain. A common approach mixtureofexperts; peng2019moment; dctn is to learn a shared feature extractor, along with domain-specific classifier modules (Fig. 1a), which yield an ensemble prediction for the target samples. However, an additional challenge in MSDA is to tackle the domain-shift and category-shift dctn between each pair of source-domains (Fig. 1b). To this end, auxiliary losses are enforced encouraging the model to learn domain invariant but class-discriminative representations. Ultimately, an appropriate alignment of all the domains in the latent space peng2019moment improves the generalization on the target domain (Fig. 1b).
In this work, we approach the MSDA problem from a different perspective. Since deep models are known to capture rich transferable representations long2015learning; oquab2014learning; howtransferable, we ask, is an auxiliary feature alignment loss really necessary? The motivation stems from the observation that deep models exhibit a strong inductive bias to implicitly align the latent features under supervision. This is demonstrated in Fig. 2. Following the prior approaches peng2019moment; dctn, we train domain-specific classifiers (Fig. 1b) and observe that the domains do not align in the latent space (Fig. 2a), which calls for an explicit feature alignment loss. However, when we enforce a classifier agreement on the class label for each input instance (Fig. 2b), we find that the domains tend to align, without requiring an explicit alignment loss.
This motivates us to further explore implicit alignment of latent features for MSDA. We aim to leverage the labeled data from multiple source domains, and the multi-classifier setup (Fig. 1a) employed in MSDA to perform alignment, without incorporating auxiliary components such as a domain discriminator dctn; zhao2018adversarial. In contrast to learning domain-specific classifier modules, we enforce an agreement among the classifiers (Fig. 1c) to align the domains in the latent space.
Since the target domain is unlabeled, we resort to the class labels predicted by the model being trained (a.k.a. pseudo-labels lee2013pseudo). The adaptation step encourages the classifiers to agree upon these pseudo-labels which enables alignment of the target features with the source features that have classifier agreement owing to label supervision. Accordingly, we name the approach as Self-supervised Implicit Alignment, abbreviated as SImpAl (pronounced "simple"). We observe that even under category-shift, implicit alignment can be leveraged to align the shared categories, without requiring additional components (e.g. fine-grained alignment cao2018partial1; kundu2020class; mada, adversarial discriminator dctn) or cumbersome training strategies (e.g. to handle arbitrary category-shifts kundu_cvpr_2020; dctn; UDA_2019_CVPR). We also find that classifier agreement can be leveraged as a cue to determine adaptation convergence.
To summarize, we demonstrate successful MSDA by leveraging implicit alignment exhibited by deep classifiers, corroborating the potential for designing simple and effective adaptation algorithms. We conduct extensive evaluation of our approach over five benchmark datasets, with two popular CNN backbone models (ResNet-50, ResNet-101 he2016deep_resnet) and derive insights from the empirical analysis.
2 Related Work
Here, we briefly review the related works and refer the reader to zhao2020multi for an extensive survey.
a) Single-Source Domain Adaptation (SSDA). Motivated by the seminal work by Ben-David et al. ben2010theory; ben2007analysis, a large number of SSDA methods dsbn; NormalAdaptfirstBackprop; ganin2016domain; gong2012geodesic; sta_open_set; long2015learning; long2016unsupervised have been proposed, that aim to learn domain-agnostic but class-discriminative representations. Inspired by the GAN framework goodfellow2014generative, a popular strategy is to employ adversarial learning hoffman2018cycada; adadepth; saito2018open; sankaranarayanan2018generate; tzeng2015simultaneous; tzeng2017adversarial; tzeng2014deep that aims to confuse a domain-discriminator, thereby aligning the latent features of the domains. Saito et al. saito2018maximum formulate an adversarial objective employing classifier discrepancy. In contrast, we aim to study a simpler approach which circumvents the training difficulties encountered in adversarial learning paradigms. Recently, consistency based regularizers chen2019crdoco; kundu2019_um_adapt; murez2018image; adadepth were proposed for domain adaptation. In our work, classifier agreement can be interpreted as a form of consistency at the output space which acts both as an implicit regularizer and as a means to perform latent space alignment for adaptation.
b) Multi-Source Domain Adaptation (MSDA). Several methods mixtureofexperts; peng2019moment; dctn; zhu2019aligning_mfsan learn domain-specific classifier modules and obtain a weighted ensemble prediction for the target samples, motivated by the distribution weighted combining rule hoffman2018algorithms; mansour2009domain; mansour2012multiple. Zhe et al. zhu2019aligning_mfsan employ an alignment loss between each source-target pair in domain-specific feature spaces. In addition, Peng et al. peng2019moment
align each pair of source domains using kernel based moment matching and also propose a variant based on adversarial learningsaito2018maximum. Xu et al. employ multiple domain discriminators to achieve latent space alignment. In this work, we aim to explore a simple adaptation scheme that leverages implicit alignment in deep models. As a result, our approach is applicable even under category-shift among the source domains, while most prior methods mixtureofexperts; peng2019moment; zhu2019aligning_mfsan consider only a shared category set.
c) Self-training methods. Pseudo-labeling lee2013pseudo
is a popular semi-supervised learning approach where "pseudo" class labels are assigned to unlabeled samples, typically using classifier confidencechen2011co; saito_asymmetric_tritraining; dctn; zou2019confidence; zou2018unsupervised or nearest neighbor assignment kundu2020class; pan2019transferrable; saito2020universal; zhang2019category, while the model is retrained using such samples. Confidence thresholding li2019bidirectional; saito_asymmetric_tritraining; dctn
is commonly applied to minimize the noise in pseudo-labels. This introduces a sensitive threshold hyperparameter, requiring labeled target samples or domain expertise for precise tuning. Works such as Zouet al. zou2019confidence; zou2018unsupervised, Li et al. li2019bidirectional and Chen et al. chen2019crdoco propose various regularizers to improve pseudo-label predictions. Xu et al. dctn incorporate an adversarial alignment loss to mitigate the performance degradation arising from noisy pseudo-labels. In contrast, we aim to exploit classifier agreement to perform adaptation and improve the reliability of pseudo-labels without incorporating additional hyperparameters.
3 Self-supervised Implicit Alignment (SImpAl)
Notations. Let and denote the input and the output spaces. We consider labeled source domain datasets , where and a single unlabeled target domain dataset . Each source domain has a label-set , and the target label-set is defined as with
classes. We learn a deep neural network model having a CNN based feature extractor, and classifier modules . For convenience, we denote the output of the network as a matrix , where represents function composition.
is obtained by stacking the logits produced by each classifier (see Fig.3).
Overview. As is conventional in the MSDA methods mixtureofexperts; dctn
, the multi-classifier setup is treated as an ensemble of diverse classifiers, and the class probabilities are obtained through a convex combination of each classifier’s prediction. The model is first trained with the categorical cross-entropy loss imposed on the combined data from all source domains. After a "warm-start", we introduce pseudo-labeled target samples into the training process. The adaptation is performed by enforcing the classifiers to agree on these pseudo-labels. We now describe the approach in detail.
3.1 Warm-start with source domains
To adapt the network to the target domain, we use pseudo-labeled target samples. Thus, we first aim to achieve a reliability in pseudo-labels by training the model on all source domains. We call this as the warm-start process, which is performed as follows.
a) Learning with source domains. For each source-domain instance , we obtain the output matrix (see Fig. 3
) and define the class probability vectoras a convex combination of the probabilities assigned by each classifier,
where, represents the row vector of the matrix (i.e. the logits of the classifier), and is the function. Treating as the class probability vector, we minimize the categorical cross entropy loss () using the labeled source samples,
The last term in Eq. 2 represents an upper bound for the categorical cross-entropy loss of the ensemble, and is obtained by applying the Jensen’s inequality for convex functions. We consider the formulation in Eq. 2 to drive the classifiers to agree upon the label for . Thus, the training objective is,
The objective in Eq. 3 is minimized by mini-batch stochastic optimization. Each mini-batch contains an equal number of samples from each source domain. In practice, each classifier is given a distinct random initialization, and is trained with the same set of training samples at each mini-batch. Intuitively, this process gradually enables a higher degree of similarity among the classifiers (Fig. 1c) through an agreement in the predicted class labels for source samples. Note that, both the feature extractor and the multi-classifier module are shared across all source domains. This step provides a warm-start to introduce pseudo-labeled target samples into the training.
b) Determining the convergence of warm-start. The next question we address is, how to tell if a model is trained sufficiently for the target domain? Intuitively, we would like to train the model until there is a saturation in the target (pseudo-label) accuracy. However, with unlabeled target samples measuring the pseudo-label accuracy is out of bounds. Thus, we propose the classifier agreement as a criterion to determine the convergence. The classifier agreement for an instance is defined as,
where , and is the indicator function that returns when the condition is true, else returns . Intuitively, when each classifier predicts the same class for a given sample , we say that the classifiers "agree". Thus, when classifiers agree, and otherwise.
As we shall show in Sec. 4.2, the target pseudo-label accuracy is higher whenever the classifiers agree nguyen2019self; yu2019does
. Thus, classifier agreement is used to filter out target samples having a higher degree of noise in pseudo-labels. Further, we estimate the fraction of target samples for which there is an agreement in the class predictions among the classifiers. Thus, we define the target agreement rate as,
We hypothesize that the performance on target samples attains a saturation when the agreement rate converges. Thus, we determine the warm-start interval based on the convergence of .
3.2 Introducing target data
After the warm-start, we introduce target samples into the training process. The pseudo-labels are obtained from the classifier predictions as in Eq. 1, i.e. .
We consider the following strategy for pseudo-labeling. To begin with, we select only those target samples for which there is a classifier agreement, since the labels are seen to be more accurate for such samples (verified in Sec. 4.2). Thus, we obtain a subset . Secondly, inspired by curriculum learning bengio2009curriculum; zou2018unsupervised we form an easy-to-hard sampling strategy for . For this purpose, we obtain the average classifier margin as a weight for each target instance,
where an correspond to the indices of the highest and the second highest logit. Intuitively, measures a form of confidence in prediction. Target samples that are farther from the decision boundaries receive a higher (see Fig. 7c for the geometrical interpretation). We show in Sec. 4.2 that, in general, samples with a higher are more likely to possess correct pseudo-labels. Thus, target samples are sorted based on , and are fed to the training pipeline in the decreasing order of . Finally, the pseudo-labels are updated every epochs on . With this strategy, we formalize the training objective for adaptation using the target samples as,
After introducing the target samples from , we train on both source and target samples, in alternate mini-batches, i.e. we minimize the objectives in Eq. 3 and Eq. 7 in alternate mini-batches. Finally, the network is trained until the target agreement rate shows convergence. This enables a simple and effective adaptation pipeline using implicit alignment. The algorithm is given in Algo. 1.
We present the results of our approach on five standard benchmark datasets - Office-Caltech, ImageCLEF, Office-31, Office-Home and the most challenging large-scale benchmark, DomainNet.
a) Prior Arts. We compare against Deep Domain Confusion (DDC) tzeng2014deep, Deep Adaptation Network (DAN) long2015learning, Deep CORAL (D-CORAL) sun2016deepcoral, Reverse Gradient (RevGrad) NormalAdaptfirstBackprop, Residual Transfer Network (RTN) long2016unsupervised, Joint Adaptation Network (JAN) long2017deepJAN, Maximum Classifier Discrepancy (MCD) saito2018maximum, Manifold Embedded Distribution Alignment (MEDA) wang2018visual, Adversarial Discriminative Domain Adaptation (ADDA) tzeng2017adversarial, Deep Cocktail Network (DCTN) dctn, Moment Matching (MSDA) peng2019moment and Multiple Feature Space Adaptation Network (MFSAN) zhu2019aligning_mfsan. Specifically, DDC, RevGrad, ADDA, MCD, DCTN use an adversarial alignment objective to perform adaptation, RTN learns a residual function to bridge the distribution discrepancy, and DAN, MFSAN, D-CORAL, JAN, MEDA and MSDA employ a kernel based moment matching scheme to align the domains.
b) Evaluation. For ImageCLEF and Office-based datasets, we follow the evaluation protocol in MFSAN zhu2019aligning_mfsan, while for DomainNet, we follow the protocol used in MSDA peng2019moment
. Three types of baselines are considered - 1) Single Best (SB) refers to the best single-source transfer results for the target domain, 2) Source Combine (SC) refers to the scenario where all sources are combined into a single source domain to perform SSDA, 3) Multi-Source (MS) refers to the MSDA methods. We report the multi-run statistics (mean and standard deviation) obtained over three different runs.
c) Implementation Details.
We implement our approach in PyTorchNEURIPS2019_9015_pytorch. We use the Adam adam optimizer, with learning rate and weight decay for stochastic optimization. The losses in Eq. 3 and Eq. 7 are alternatively optimized and the target agreement rate (Eq. 5) is periodically monitored for convergence. We set as the update rate for the target pseudo-labels (line 12 in Algo. 1). The total number of training iterations are decided based on the convergence of the target agreement rate for each dataset. Following prior MSDA approaches zhu2019aligning_mfsan; peng2019moment, we use ResNet-50 (SImpAl) and ResNet-101 (SImpAl) he2016deep_resnet as the CNN backbone.
valign=t Method Clp Inf Pnt Qdr Rel Skt Avg MS MSDA 57.2 24.2 51.6 5.2 61.6 49.6 41.5 SImpAl 66.4 26.5 56.6 18.9 68.0 55.5 48.6
We present the results in Table 1. The results for the prior baselines are reported from peng2019moment and zhu2019aligning_mfsan. Due to the limits of space, we present the full comparison table for DomainNet in the Supplementary.
Office-31 office dataset has 4652 images across Amazon (A), DSLR (D) and Webcam (W) domains having 31 object classes found in an office environment. ImageCLEF222http://imageclef.org/2014/adaptation. dataset has been created by selecting 12 shared classes among ImageNet (I) imagenet, Caltech-256 (C) griffin2007caltech, Pascal-VOC 2012 (P) Everingham10Pascal_voc, with 600 images per domain. Office-Caltech gong2012geodesic dataset consists of 2533 images across 10 classes shared between Caltech-256 (C) and the three domains of Office-31 (A, D, W). Office-Home venkateswara2017deep is a more challenging medium-scale dataset containing about 15588 images in 4 domains: Art (Ar), Clipart (Cl), Product (Pr) and Real-World (Rw), sharing 65 categories of objects found in the office and home environments. DomainNet peng2019moment dataset is the largest and the most challenging benchmark, containing 6 diverse domains, with 345 classes, and around 0.6 million images.
a) Implicit alignment of features. In Fig. 4a, we plot the t-SNE tsne embeddings of the features at the pre-classifier space (output of ) for SImpAl. Further, we calculate the Proxy- distance ben2010theory defined as where is the generalization error of a domain discriminator. In Fig. 4b, we report the
value across each source-target pair for 3 different models - 1) warm-start model, trained on the source domains, 2) the model after adaptation using SImpAl, 3) an oracle model employing SImpAl, where the target pseudo-labels are replaced by the ground-truth labels. This shows that adaptation using SImpAl effectively reduces the distribution-shift in the latent space. Further, we also demonstrate implicit alignment under large domain-shifts (such as Quickdraw and Real-world domains on DomainNet), which enables applications such as cross-domain image retrieval on an unlabeled target domain. See Suppl. for further analysis on implicit alignment.
b) Extension to category-shift. To present a more practical scenario for MSDA, dctn introduced two category-shift settings - overlap and disjoint, where the source domains contain overlapping label sets (i.e. , but ) and disjoint label-sets () respectively. In such scenarios, it is vital to prevent mis-alignment of different classes across the source domains to avoid negative transfer mada. Furthermore, since prior MSDA approaches learn domain-specific classifiers, they require separate mechanisms to obtain class probabilities for the domain-specific and the shared classes separately dctn. However, our approach remains unmodified under the presence of category-shift; as such, each classifier learns all the target classes, and the computation of the class probabilities (Eq. 1) remains unchanged. Fig. 4c shows that category-shift is a challenging scenario where all methods show performance degradation, however SImpAl is found to exhibit a relatively lower degradation in the target performance. This is supported by the observation that even under category-shift, only the shared classes align as shown in Fig. 5. See Suppl. for further analysis.
c) Target Agreement Rate. Fig. 6a shows the trend in the target agreement rate () and target performance as training proceeds. We make two observations. Firstly, we find that increases during training, indicating that the target samples migrate into the classifier agreement region in the latent space (). This migration is necessary for a successful adaptation since the source domains inherently fall in the classifier agreement region (due to the nature of the source training for warm-start). Secondly, a correspondence between the convergence of the target agreement rate and the target accuracy is seen, which validates our hypothesis that can be used as a cue to determine the training convergence. This result is of interest in Unsupervised Domain Adaptation methods where the requirement of target labels has been the de-facto for model selection.
d) Do the classifiers agree on correct pseudo-labels? We also calculate the classifier agreement (and disagreement) for target samples that are pseudo-labeled correctly. Notably, Fig. 7a demonstrates that the classifiers tend to agree on an increasing number of target samples with correct pseudo-label predictions. This motivates the periodic update of (Lines 12-13 in Algo. 1), which captures an increasing number of target samples with correct pseudo-labels, as the adaptation proceeds.
e) How accurate are target pseudo-labels? As described in Sec. 3.2, we use classifier agreement to select target samples () with a higher pseudo-label accuracy. In Fig. 6b, we plot the accuracy of pseudo-labels separately for target samples having classifier agreement (i.e. ) and disagreement (i.e. ). Clearly, pseudo-labels are more accurate (more reliable) when the classifiers agree. Further, the accuracy on the target samples with agreement, , is higher than the accuracy on all target samples, (orange curve in Fig. 6b). Thus, the use of with a higher accuracy in pseudo-labels plays a key role in gradually improving the target performance.
f) Using curriculum for target samples. We form a curriculum for the target samples using the average classifier margin as a weight. Fig. 7c shows the geometrical interpretation of , that measures how far into the agreement region a target sample falls. Thus, can be seen as a measure of the confidence in the prediction. As studied by prior methods mixtureofexperts; ruder2017knowledgeadaptation; sohn2020fixmatch, high confidence predictions are often correct. We show this in Fig. 7b where we plot the precision of target pseudo-labels at various confidence percentiles (in descending order of ). The accuracy shows a decreasing trend with , validating our hypothesis that yields an easy-to-hard curriculum. Although our framework supports confidence thresholding to further minimize the pseudo-label noise, we do not employ thresholds for the main results (Table 1) as it introduces sensitive hyperparameters. See Supplementary for an empirical analysis with confidence thresholding.
In this paper, we demonstrated Self-supervised Implicit Alignment (SImpAl), that serves as a simple method to perform Multi-Source Domain Adaptation (MSDA). We observed that deep models exhibit the potential to implicitly align features under label supervision, even in the presence of domain-shift. We demonstrated the use of classifier agreement in SImpAl - to obtain pseudo-labeled target samples, to perform latent space alignment and to determine the training convergence. Extensive empirical analysis demonstrates the efficacy of SImpAl for MSDA.
Our work can facilitate the study of simple and effective algorithms for unsupervised domain adaptation. The insights obtained from our study can be used to explain the efficacy of a number of related self-supervised approaches. A potential direction of research is to develop efficient adaptation algorithms that are devoid of sensitive hyperparameters. Exploring SImpAl for scenarios such as Universal Domain Adaptation UDA_2019_CVPR would also be of future interest.
This work presents a simple and effective solution for Multi-Source Domain Adaptation, that has a two-fold positive impact. First, the method is aimed at improving the performance of prediction models by mitigating the bias caused by domain-shift between the training dataset and the test data encountered when deployed in a real-world environment. This is of growing interest in the machine learning community. Secondly, the insights presented in this work facilitate the study of efficient methods to perform domain adaptation, motivating the innovation of, for instance, energy-efficient methods to generalize deep models. While the method shows promising results under domain-shift, one should be cautious of the use of the pseudo-labeling procedure in the presence of adversarial samples, where the pseudo-labels may be less reliable and may result in performance degradation.
This work was supported by a project grant from MeitY (No.4(16)/2019-ITEA), Govt. of India and a WIRIN project. We would also like to thank the anonymous reviewers for their valuable suggestions.
See pages 1-1 of MSDA_suppl_compressed.pdf See pages 2-2 of MSDA_suppl_compressed.pdf See pages 3-3 of MSDA_suppl_compressed.pdf See pages 4-4 of MSDA_suppl_compressed.pdf See pages 5-5 of MSDA_suppl_compressed.pdf See pages 6-6 of MSDA_suppl_compressed.pdf See pages 7-7 of MSDA_suppl_compressed.pdf See pages 8-8 of MSDA_suppl_compressed.pdf