1 Introduction
The task of supervised learning for classification is based on the assumption that the training data and the testing data are sampled from the same distributions. Thus, supervised learning methods achieve stateoftheart results when evaluated on popular benchmarks such as ImageNet
imagenet. However, when such models are deployed in realworld, they yield suboptimal results due to the inherent distributionshift (domainshift torralba2011unbiased) between the training data and the realworld environment (a.k.a. the target domain). While it is possible to obtain unlabeled samples from the target domain in most cases, the huge costs of data annotation prohibit the creation of a reliable labeled training dataset. To this end, Unsupervised Domain Adaptation (DA) methods have been proposed that aim to transfer knowledge from a labeled "source" dataset to an unlabeled "target" dataset under a domainshift.A popular strategy in Unsupervised DA is to learn the taskspecific knowledge using supervision from the labeled source dataset, while learning a domaininvariant latent space where the features across the source and the target domains align. Such an alignment is enforced using statistical discrepancy minimization schemes JDA_Dist_Match; gong2012geodesic; pan2010domain_tca; peng2019moment; sun2016deepcoral or via an adversarial objective ganin2016domain; long2018conditional; tzeng2017adversarial; dctn; zhao2018adversarial, or by employing domainspecific transformations dsbn; adabn; featurewhiteningandconsensusloss. This alignment minimizes the domainshift in the latent space, and improves the target generalization. However, the performance of SingleSource Domain Adaptation (SSDA) methods is usually determined by the choice of the source dataset kundu2020towards.
Recently, MultiSource Domain Adaptation (MSDA) mansour2012multiple; zhao2020multi has garnered interest wherein multiple labeled source domains are used to transfer the task knowledge to the unlabeled target domain. A common approach mixtureofexperts; peng2019moment; dctn is to learn a shared feature extractor, along with domainspecific classifier modules (Fig. 1a), which yield an ensemble prediction for the target samples. However, an additional challenge in MSDA is to tackle the domainshift and categoryshift dctn between each pair of sourcedomains (Fig. 1b). To this end, auxiliary losses are enforced encouraging the model to learn domain invariant but classdiscriminative representations. Ultimately, an appropriate alignment of all the domains in the latent space peng2019moment improves the generalization on the target domain (Fig. 1b).
In this work, we approach the MSDA problem from a different perspective. Since deep models are known to capture rich transferable representations long2015learning; oquab2014learning; howtransferable, we ask, is an auxiliary feature alignment loss really necessary? The motivation stems from the observation that deep models exhibit a strong inductive bias to implicitly align the latent features under supervision. This is demonstrated in Fig. 2. Following the prior approaches peng2019moment; dctn, we train domainspecific classifiers (Fig. 1b) and observe that the domains do not align in the latent space (Fig. 2a), which calls for an explicit feature alignment loss. However, when we enforce a classifier agreement on the class label for each input instance (Fig. 2b), we find that the domains tend to align, without requiring an explicit alignment loss.
This motivates us to further explore implicit alignment of latent features for MSDA. We aim to leverage the labeled data from multiple source domains, and the multiclassifier setup (Fig. 1a) employed in MSDA to perform alignment, without incorporating auxiliary components such as a domain discriminator dctn; zhao2018adversarial. In contrast to learning domainspecific classifier modules, we enforce an agreement among the classifiers (Fig. 1c) to align the domains in the latent space.
Since the target domain is unlabeled, we resort to the class labels predicted by the model being trained (a.k.a. pseudolabels lee2013pseudo). The adaptation step encourages the classifiers to agree upon these pseudolabels which enables alignment of the target features with the source features that have classifier agreement owing to label supervision. Accordingly, we name the approach as Selfsupervised Implicit Alignment, abbreviated as SImpAl (pronounced "simple"). We observe that even under categoryshift, implicit alignment can be leveraged to align the shared categories, without requiring additional components (e.g. finegrained alignment cao2018partial1; kundu2020class; mada, adversarial discriminator dctn) or cumbersome training strategies (e.g. to handle arbitrary categoryshifts kundu_cvpr_2020; dctn; UDA_2019_CVPR). We also find that classifier agreement can be leveraged as a cue to determine adaptation convergence.
To summarize, we demonstrate successful MSDA by leveraging implicit alignment exhibited by deep classifiers, corroborating the potential for designing simple and effective adaptation algorithms. We conduct extensive evaluation of our approach over five benchmark datasets, with two popular CNN backbone models (ResNet50, ResNet101 he2016deep_resnet) and derive insights from the empirical analysis.
2 Related Work
Here, we briefly review the related works and refer the reader to zhao2020multi for an extensive survey.
a) SingleSource Domain Adaptation (SSDA). Motivated by the seminal work by BenDavid et al. ben2010theory; ben2007analysis, a large number of SSDA methods dsbn; NormalAdaptfirstBackprop; ganin2016domain; gong2012geodesic; sta_open_set; long2015learning; long2016unsupervised have been proposed, that aim to learn domainagnostic but classdiscriminative representations. Inspired by the GAN framework goodfellow2014generative, a popular strategy is to employ adversarial learning hoffman2018cycada; adadepth; saito2018open; sankaranarayanan2018generate; tzeng2015simultaneous; tzeng2017adversarial; tzeng2014deep that aims to confuse a domaindiscriminator, thereby aligning the latent features of the domains. Saito et al. saito2018maximum formulate an adversarial objective employing classifier discrepancy. In contrast, we aim to study a simpler approach which circumvents the training difficulties encountered in adversarial learning paradigms. Recently, consistency based regularizers chen2019crdoco; kundu2019_um_adapt; murez2018image; adadepth were proposed for domain adaptation. In our work, classifier agreement can be interpreted as a form of consistency at the output space which acts both as an implicit regularizer and as a means to perform latent space alignment for adaptation.
b) MultiSource Domain Adaptation (MSDA). Several methods mixtureofexperts; peng2019moment; dctn; zhu2019aligning_mfsan learn domainspecific classifier modules and obtain a weighted ensemble prediction for the target samples, motivated by the distribution weighted combining rule hoffman2018algorithms; mansour2009domain; mansour2012multiple. Zhe et al. zhu2019aligning_mfsan employ an alignment loss between each sourcetarget pair in domainspecific feature spaces. In addition, Peng et al. peng2019moment
align each pair of source domains using kernel based moment matching and also propose a variant based on adversarial learning
saito2018maximum. Xu et al. employ multiple domain discriminators to achieve latent space alignment. In this work, we aim to explore a simple adaptation scheme that leverages implicit alignment in deep models. As a result, our approach is applicable even under categoryshift among the source domains, while most prior methods mixtureofexperts; peng2019moment; zhu2019aligning_mfsan consider only a shared category set.c) Selftraining methods. Pseudolabeling lee2013pseudo
is a popular semisupervised learning approach where "pseudo" class labels are assigned to unlabeled samples, typically using classifier confidence
chen2011co; saito_asymmetric_tritraining; dctn; zou2019confidence; zou2018unsupervised or nearest neighbor assignment kundu2020class; pan2019transferrable; saito2020universal; zhang2019category, while the model is retrained using such samples. Confidence thresholding li2019bidirectional; saito_asymmetric_tritraining; dctnis commonly applied to minimize the noise in pseudolabels. This introduces a sensitive threshold hyperparameter, requiring labeled target samples or domain expertise for precise tuning. Works such as Zou
et al. zou2019confidence; zou2018unsupervised, Li et al. li2019bidirectional and Chen et al. chen2019crdoco propose various regularizers to improve pseudolabel predictions. Xu et al. dctn incorporate an adversarial alignment loss to mitigate the performance degradation arising from noisy pseudolabels. In contrast, we aim to exploit classifier agreement to perform adaptation and improve the reliability of pseudolabels without incorporating additional hyperparameters.3 Selfsupervised Implicit Alignment (SImpAl)
Notations. Let and denote the input and the output spaces. We consider labeled source domain datasets , where and a single unlabeled target domain dataset . Each source domain has a labelset , and the target labelset is defined as with
classes. We learn a deep neural network model having a CNN based feature extractor
, and classifier modules . For convenience, we denote the output of the network as a matrix , where represents function composition.is obtained by stacking the logits produced by each classifier (see Fig.
3).Overview. As is conventional in the MSDA methods mixtureofexperts; dctn
, the multiclassifier setup is treated as an ensemble of diverse classifiers, and the class probabilities are obtained through a convex combination of each classifier’s prediction. The model is first trained with the categorical crossentropy loss imposed on the combined data from all source domains. After a "warmstart", we introduce pseudolabeled target samples into the training process. The adaptation is performed by enforcing the classifiers to agree on these pseudolabels. We now describe the approach in detail.
3.1 Warmstart with source domains
To adapt the network to the target domain, we use pseudolabeled target samples. Thus, we first aim to achieve a reliability in pseudolabels by training the model on all source domains. We call this as the warmstart process, which is performed as follows.
a) Learning with source domains. For each sourcedomain instance , we obtain the output matrix (see Fig. 3
) and define the class probability vector
as a convex combination of the probabilities assigned by each classifier,(1) 
where, represents the row vector of the matrix (i.e. the logits of the classifier), and is the function. Treating as the class probability vector, we minimize the categorical cross entropy loss () using the labeled source samples,
(2) 
The last term in Eq. 2 represents an upper bound for the categorical crossentropy loss of the ensemble, and is obtained by applying the Jensen’s inequality for convex functions. We consider the formulation in Eq. 2 to drive the classifiers to agree upon the label for . Thus, the training objective is,
(3) 
The objective in Eq. 3 is minimized by minibatch stochastic optimization. Each minibatch contains an equal number of samples from each source domain. In practice, each classifier is given a distinct random initialization, and is trained with the same set of training samples at each minibatch. Intuitively, this process gradually enables a higher degree of similarity among the classifiers (Fig. 1c) through an agreement in the predicted class labels for source samples. Note that, both the feature extractor and the multiclassifier module are shared across all source domains. This step provides a warmstart to introduce pseudolabeled target samples into the training.
b) Determining the convergence of warmstart. The next question we address is, how to tell if a model is trained sufficiently for the target domain? Intuitively, we would like to train the model until there is a saturation in the target (pseudolabel) accuracy. However, with unlabeled target samples measuring the pseudolabel accuracy is out of bounds. Thus, we propose the classifier agreement as a criterion to determine the convergence. The classifier agreement for an instance is defined as,
(4) 
where , and is the indicator function that returns when the condition is true, else returns . Intuitively, when each classifier predicts the same class for a given sample , we say that the classifiers "agree". Thus, when classifiers agree, and otherwise.
As we shall show in Sec. 4.2, the target pseudolabel accuracy is higher whenever the classifiers agree nguyen2019self; yu2019does
. Thus, classifier agreement is used to filter out target samples having a higher degree of noise in pseudolabels. Further, we estimate the fraction of target samples for which there is an agreement in the class predictions among the classifiers. Thus, we define the target agreement rate as,
(5) 
We hypothesize that the performance on target samples attains a saturation when the agreement rate converges. Thus, we determine the warmstart interval based on the convergence of .
3.2 Introducing target data
After the warmstart, we introduce target samples into the training process. The pseudolabels are obtained from the classifier predictions as in Eq. 1, i.e. .
We consider the following strategy for pseudolabeling. To begin with, we select only those target samples for which there is a classifier agreement, since the labels are seen to be more accurate for such samples (verified in Sec. 4.2). Thus, we obtain a subset . Secondly, inspired by curriculum learning bengio2009curriculum; zou2018unsupervised we form an easytohard sampling strategy for . For this purpose, we obtain the average classifier margin as a weight for each target instance,
(6) 
where an correspond to the indices of the highest and the second highest logit. Intuitively, measures a form of confidence in prediction. Target samples that are farther from the decision boundaries receive a higher (see Fig. 7c for the geometrical interpretation). We show in Sec. 4.2 that, in general, samples with a higher are more likely to possess correct pseudolabels. Thus, target samples are sorted based on , and are fed to the training pipeline in the decreasing order of . Finally, the pseudolabels are updated every epochs on . With this strategy, we formalize the training objective for adaptation using the target samples as,
(7) 
After introducing the target samples from , we train on both source and target samples, in alternate minibatches, i.e. we minimize the objectives in Eq. 3 and Eq. 7 in alternate minibatches. Finally, the network is trained until the target agreement rate shows convergence. This enables a simple and effective adaptation pipeline using implicit alignment. The algorithm is given in Algo. 1.
4 Experiments
We present the results of our approach on five standard benchmark datasets  OfficeCaltech, ImageCLEF, Office31, OfficeHome and the most challenging largescale benchmark, DomainNet.
a) Prior Arts. We compare against Deep Domain Confusion (DDC) tzeng2014deep, Deep Adaptation Network (DAN) long2015learning, Deep CORAL (DCORAL) sun2016deepcoral, Reverse Gradient (RevGrad) NormalAdaptfirstBackprop, Residual Transfer Network (RTN) long2016unsupervised, Joint Adaptation Network (JAN) long2017deepJAN, Maximum Classifier Discrepancy (MCD) saito2018maximum, Manifold Embedded Distribution Alignment (MEDA) wang2018visual, Adversarial Discriminative Domain Adaptation (ADDA) tzeng2017adversarial, Deep Cocktail Network (DCTN) dctn, Moment Matching (MSDA) peng2019moment and Multiple Feature Space Adaptation Network (MFSAN) zhu2019aligning_mfsan. Specifically, DDC, RevGrad, ADDA, MCD, DCTN use an adversarial alignment objective to perform adaptation, RTN learns a residual function to bridge the distribution discrepancy, and DAN, MFSAN, DCORAL, JAN, MEDA and MSDA employ a kernel based moment matching scheme to align the domains.
b) Evaluation. For ImageCLEF and Officebased datasets, we follow the evaluation protocol in MFSAN zhu2019aligning_mfsan, while for DomainNet, we follow the protocol used in MSDA peng2019moment
. Three types of baselines are considered  1) Single Best (SB) refers to the best singlesource transfer results for the target domain, 2) Source Combine (SC) refers to the scenario where all sources are combined into a single source domain to perform SSDA, 3) MultiSource (MS) refers to the MSDA methods. We report the multirun statistics (mean and standard deviation) obtained over three different runs.
c) Implementation Details.
We implement our approach in PyTorch
NEURIPS2019_9015_pytorch. We use the Adam adam optimizer, with learning rate and weight decay for stochastic optimization. The losses in Eq. 3 and Eq. 7 are alternatively optimized and the target agreement rate (Eq. 5) is periodically monitored for convergence. We set as the update rate for the target pseudolabels (line 12 in Algo. 1). The total number of training iterations are decided based on the convergence of the target agreement rate for each dataset. Following prior MSDA approaches zhu2019aligning_mfsan; peng2019moment, we use ResNet50 (SImpAl) and ResNet101 (SImpAl) he2016deep_resnet as the CNN backbone.


E. DomainNet 
valign=t Method Clp Inf Pnt Qdr Rel Skt Avg MS MSDA 57.2 24.2 51.6 5.2 61.6 49.6 41.5 SImpAl 66.4 26.5 56.6 18.9 68.0 55.5 48.6 
4.1 Results
We present the results in Table 1. The results for the prior baselines are reported from peng2019moment and zhu2019aligning_mfsan. Due to the limits of space, we present the full comparison table for DomainNet in the Supplementary.
Office31 office dataset has 4652 images across Amazon (A), DSLR (D) and Webcam (W) domains having 31 object classes found in an office environment. ImageCLEF^{2}^{2}2http://imageclef.org/2014/adaptation. dataset has been created by selecting 12 shared classes among ImageNet (I) imagenet, Caltech256 (C) griffin2007caltech, PascalVOC 2012 (P) Everingham10Pascal_voc, with 600 images per domain. OfficeCaltech gong2012geodesic dataset consists of 2533 images across 10 classes shared between Caltech256 (C) and the three domains of Office31 (A, D, W). OfficeHome venkateswara2017deep is a more challenging mediumscale dataset containing about 15588 images in 4 domains: Art (Ar), Clipart (Cl), Product (Pr) and RealWorld (Rw), sharing 65 categories of objects found in the office and home environments. DomainNet peng2019moment dataset is the largest and the most challenging benchmark, containing 6 diverse domains, with 345 classes, and around 0.6 million images.
4.2 Analysis
a) Implicit alignment of features. In Fig. 4a, we plot the tSNE tsne embeddings of the features at the preclassifier space (output of ) for SImpAl. Further, we calculate the Proxy distance ben2010theory defined as where is the generalization error of a domain discriminator. In Fig. 4b, we report the
value across each sourcetarget pair for 3 different models  1) warmstart model, trained on the source domains, 2) the model after adaptation using SImpAl, 3) an oracle model employing SImpAl, where the target pseudolabels are replaced by the groundtruth labels. This shows that adaptation using SImpAl effectively reduces the distributionshift in the latent space. Further, we also demonstrate implicit alignment under large domainshifts (such as Quickdraw and Realworld domains on DomainNet), which enables applications such as crossdomain image retrieval on an unlabeled target domain. See Suppl. for further analysis on implicit alignment.
b) Extension to categoryshift. To present a more practical scenario for MSDA, dctn introduced two categoryshift settings  overlap and disjoint, where the source domains contain overlapping label sets (i.e. , but ) and disjoint labelsets () respectively. In such scenarios, it is vital to prevent misalignment of different classes across the source domains to avoid negative transfer mada. Furthermore, since prior MSDA approaches learn domainspecific classifiers, they require separate mechanisms to obtain class probabilities for the domainspecific and the shared classes separately dctn. However, our approach remains unmodified under the presence of categoryshift; as such, each classifier learns all the target classes, and the computation of the class probabilities (Eq. 1) remains unchanged. Fig. 4c shows that categoryshift is a challenging scenario where all methods show performance degradation, however SImpAl is found to exhibit a relatively lower degradation in the target performance. This is supported by the observation that even under categoryshift, only the shared classes align as shown in Fig. 5. See Suppl. for further analysis.
c) Target Agreement Rate. Fig. 6a shows the trend in the target agreement rate () and target performance as training proceeds. We make two observations. Firstly, we find that increases during training, indicating that the target samples migrate into the classifier agreement region in the latent space (). This migration is necessary for a successful adaptation since the source domains inherently fall in the classifier agreement region (due to the nature of the source training for warmstart). Secondly, a correspondence between the convergence of the target agreement rate and the target accuracy is seen, which validates our hypothesis that can be used as a cue to determine the training convergence. This result is of interest in Unsupervised Domain Adaptation methods where the requirement of target labels has been the defacto for model selection.
d) Do the classifiers agree on correct pseudolabels? We also calculate the classifier agreement (and disagreement) for target samples that are pseudolabeled correctly. Notably, Fig. 7a demonstrates that the classifiers tend to agree on an increasing number of target samples with correct pseudolabel predictions. This motivates the periodic update of (Lines 1213 in Algo. 1), which captures an increasing number of target samples with correct pseudolabels, as the adaptation proceeds.
e) How accurate are target pseudolabels? As described in Sec. 3.2, we use classifier agreement to select target samples () with a higher pseudolabel accuracy. In Fig. 6b, we plot the accuracy of pseudolabels separately for target samples having classifier agreement (i.e. ) and disagreement (i.e. ). Clearly, pseudolabels are more accurate (more reliable) when the classifiers agree. Further, the accuracy on the target samples with agreement, , is higher than the accuracy on all target samples, (orange curve in Fig. 6b). Thus, the use of with a higher accuracy in pseudolabels plays a key role in gradually improving the target performance.
f) Using curriculum for target samples. We form a curriculum for the target samples using the average classifier margin as a weight. Fig. 7c shows the geometrical interpretation of , that measures how far into the agreement region a target sample falls. Thus, can be seen as a measure of the confidence in the prediction. As studied by prior methods mixtureofexperts; ruder2017knowledgeadaptation; sohn2020fixmatch, high confidence predictions are often correct. We show this in Fig. 7b where we plot the precision of target pseudolabels at various confidence percentiles (in descending order of ). The accuracy shows a decreasing trend with , validating our hypothesis that yields an easytohard curriculum. Although our framework supports confidence thresholding to further minimize the pseudolabel noise, we do not employ thresholds for the main results (Table 1) as it introduces sensitive hyperparameters. See Supplementary for an empirical analysis with confidence thresholding.
5 Conclusion
In this paper, we demonstrated Selfsupervised Implicit Alignment (SImpAl), that serves as a simple method to perform MultiSource Domain Adaptation (MSDA). We observed that deep models exhibit the potential to implicitly align features under label supervision, even in the presence of domainshift. We demonstrated the use of classifier agreement in SImpAl  to obtain pseudolabeled target samples, to perform latent space alignment and to determine the training convergence. Extensive empirical analysis demonstrates the efficacy of SImpAl for MSDA.
Our work can facilitate the study of simple and effective algorithms for unsupervised domain adaptation. The insights obtained from our study can be used to explain the efficacy of a number of related selfsupervised approaches. A potential direction of research is to develop efficient adaptation algorithms that are devoid of sensitive hyperparameters. Exploring SImpAl for scenarios such as Universal Domain Adaptation UDA_2019_CVPR would also be of future interest.
Broader Impact
This work presents a simple and effective solution for MultiSource Domain Adaptation, that has a twofold positive impact. First, the method is aimed at improving the performance of prediction models by mitigating the bias caused by domainshift between the training dataset and the test data encountered when deployed in a realworld environment. This is of growing interest in the machine learning community. Secondly, the insights presented in this work facilitate the study of efficient methods to perform domain adaptation, motivating the innovation of, for instance, energyefficient methods to generalize deep models. While the method shows promising results under domainshift, one should be cautious of the use of the pseudolabeling procedure in the presence of adversarial samples, where the pseudolabels may be less reliable and may result in performance degradation.
This work was supported by a project grant from MeitY (No.4(16)/2019ITEA), Govt. of India and a WIRIN project. We would also like to thank the anonymous reviewers for their valuable suggestions.
See pages 11 of MSDA_suppl_compressed.pdf See pages 22 of MSDA_suppl_compressed.pdf See pages 33 of MSDA_suppl_compressed.pdf See pages 44 of MSDA_suppl_compressed.pdf See pages 55 of MSDA_suppl_compressed.pdf See pages 66 of MSDA_suppl_compressed.pdf See pages 77 of MSDA_suppl_compressed.pdf See pages 88 of MSDA_suppl_compressed.pdf